[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology

Name
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
File
  • Please read the Rules and FAQ before posting.
  • You may highlight syntax and preserve whitespace by using [code] tags.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


Janitor applications are now closed. Thanks to all who applied!


[Advertise on 4chan]


File: sampo.jpg (742 KB, 3200x1536)
742 KB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>109026244 & >>109023085

►News
>(06/10) DiffusionGemma 26B-A4B released: https://blog.google/innovation-and-ai/technology/developers-tools/diffusion-gemma-faster-text-generation
>(06/09) Cohere releases North-Mini-Code-1.0: https://hf.co/CohereLabs/North-Mini-Code-1.0
>(06/07) llama : add Gemma4 MTP #23398 MERGED: https://github.com/ggml-org/llama.cpp/pull/23398
>(06/05) dots.tts 2B released: https://hf.co/rednote-hilab/dots.tts-soar

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://swe-rebench.com
Agentic Coding: https://deepswe.datacurve.ai
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: congration.jpg (228 KB, 1024x1024)
228 KB JPG
►Recent Highlights from the Previous Thread: >>109026244

--Comparing Gemma 4 12B, 26B, and 31B reasoning performance:
>109026649 >109026994 >109027048 >109027046 >109027063 >109027201 >109027298 >109027762 >109028167 >109028188 >109028202 >109028317 >109028326 >109028339 >109028389 >109028650 >109030540
--Optimizing Gemma 31B VRAM usage and performance on 24GB GPUs:
>109030630 >109030678 >109030707 >109031071 >109031098 >109030693 >109030702 >109030723 >109030727 >109030739 >109030753 >109030780 >109030840 >109030903
--Optimizing Hermes with local search tools:
>109029679 >109029688 >109029714 >109029838 >109029840 >109029855 >109029923 >109029934 >109029971 >109030523 >109030643 >109029868
--Exploiting LLM safety refusals to evade AI security scanners:
>109027080 >109027089 >109027104 >109027106
--Decoding base64 redacted reasoning in Moonshot Kimi models:
>109029974 >109029989 >109030056 >109030129 >109030174 >109030225 >109030064 >109030122 >109030150
--Hardware and budget recommendations for running Kimi-chan with high context:
>109031231 >109031457 >109031500 >109031541 >109031562 >109031564 >109031627 >109031645 >109031661 >109031646 >109031770
--Using custom think tags to steer Gemma 4 reasoning and prose:
>109027608 >109027617 >109027851 >109029176
--DiffusionGemma performance and token canvas implementation:
>109027336 >109027375 >109027404 >109027489 >109027519
--Comparing Gemini and Gemma models and discussing LLM architecture experiments:
>109028735 >109028748 >109029016 >109029723 >109029836 >109029945 >109030035
--Debating the widening gap between closed and open-weight models:
>109029222 >109029299 >109029320 >109031208
--Logs:
>109027403 >109027489 >109029688 >109029840 >109029974 >109030056 >109030174 >109030225 >109031668
--Rin, Miku, Teto (free space):
>109026417 >109026687 >109029201 >109029209 >109029283 >109029440

►Recent Highlight Posts from the Previous Thread: >>109026246

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
>>109032734
Cool pic
>>
>>109032604
>Well they have the most vram for the price. And lmg told me vram is king
this tbqh famm
>>
File: IMG_1036.png (262 KB, 1160x793)
262 KB PNG
Pareto frontier models for speed to answer - quality tradeoff

Granite 4.0 350M
Qwen3 0.6B
Exaone 4.0 1.2B
MiniCPM5-1B
gpt-oss-20B (low thinking effort)
Longcat flash lite
gpt-oss-120B (low thinking effort)
Gemma 4 26BA4B
Qwen3.5 35BA3B
Qwen3.6 35BA3B
Gemma 4 26BA4B (thinking)
Qwen3.5 35BA3B (thinking)
Qwen3.6 35BA3B (thinking)
Minimax-M2.7 (thinking)
MiMo-V2.5-Pro (thinking)
Kimi K2.6 (thinking)
>>
After anon recommended Gembrain, I finally tried it. It's good. It's not that different from the base model honestly, but that may be a good sign (so it's not overcooked). What I've subjectively felt is that it is slightly less intelligent than the base model in some contexts, but actually smarter in a few others. And it also has more pleasant writing IMO. So yeah I think it's worth using, at least for now. I may need to test it more. Additionally, I have not tried MTP. It's possible it does not work well with MTP, which would be unfortunate. Anyone have experience with that?
>>
>>109032788
How very readable.
>>
>>109032824
Not about gembrain specifically (I want to download and test it myself) but I tested a few gemma finetunes (meromero, impish, etc) and they all worked with MTP, no issues at all.
Also your post made me more curious about gembrain, gonna download it right now.
>>
File: file.png (12 KB, 299x72)
12 KB PNG
>>109032788
Protip, use this
>>
Leaving gemma alone with tools in a vibecoded agent harness without saying anything.
>>
>>109032788
Could the test be fucked because the chat template was shit?
>>
does web search really make them smarter?
>>
>>109032872
No, because modern web is full of AI slop.
>>
>>109032872
Yes. Each search adds +2 IQ points.
>>
>>109032872
do web searches make you any smarter
>>
>>109032881
i'm not a cute ai agent
>>
>>109032890
What makes AI agent cute?
>>
>>109032892
emoji
>>
>>109032890
we can tell, you'd be smarter if you were one
>>
>>109032788
How the fuck is qwen27B more intelligent than 31B
>>
>>109032945
It's almost if the benchmarks don't matter.
>>
File: file.png (495 KB, 1345x461)
495 KB PNG
damn i wanted to try running gemma on my old titan x since it has enough vram but it doesnt work with llamacpp
>>
I'm so happy bros, I can run 26B gemmy Q4 at 40 t/s on my old ass 2060 super!
>>
So what's the best model for erp?

I have 24 vram
>>
>>109032985
gemma 31b or 12b if you want giant context
>>
>>109032788
Didn't expect to see 4.7 Flash in the quadrant. Is it actually good then?

Also
>qwen3.5-9B > gemma12B???
>>
>>109032962
>titan
>cuda 7.5
Unsurprising. Compile it yourself pointing at the old toolkit and hope for the best or use vulkan.
>>
>>109032788
Look at the difference in intelligence between 26B reasoning and non-reasoning and look at the difference in compute. Also note that the compute axis is logarithmic whereas the intelligence is linear. I was right earlier. Don't use 26B with reasoning on. You barely get any benefit and if you need it to do something more complex, just throw it to 12B with reasoning.
>>
>>109033048
Is dat right? I'll be damned...
>>
>>109032874
If you create your own web search you can whitelist trusted resources. I think most mcp web search tools have that feature anyway. You can do it with duckduckgo-mcp-server so it doesn't fetch from slop.
>>
>>109033048
llama.cpp's webui comes with reasoning disabled by default for some god forsaken reason and when I read 26b's response without reasoning on it was so horribly wrong, I'm never going to disable reasoning ever again
>>
File: 1350594293765.jpg (109 KB, 500x500)
109 KB JPG
>>109033018
i will try vulkan i have no idea how to compile stuff on windows kek, if that doesnt work ill set up arch on that machine
>>
>>109033072
> -rea
> Use reasoning/thinking in the chat ('on', 'off', or 'auto', default: 'auto' (detect from template))
>>
File: 1740936859931622.gif (95 KB, 128x128)
95 KB GIF
Any gemma 4 preset recommendations?
>>
File: 1773129123506545.png (169 KB, 500x553)
169 KB PNG
>>109032785
>AMD
>>
>>109033101
They NEVER learn...
>>
File: 1723389585511254.jpg (14 KB, 500x413)
14 KB JPG
>>109032788
what fucking dot is what, that chart is useless
>>
>>109033092
the webui doesn’t respect your command line argument, it’s a new feature to change the reasoning limit in the ui
the default is zero for some reason.
>>
>>109033097
are you retarded? nevermind, you obviously are.
>>
File: file.png (54 KB, 872x562)
54 KB PNG
>>109033018
it works with vulkan this is pretty crazy actually
>>
>>109033118
>you haven't spent 304804324hours in the general of some obscure hobby-fetish therefore you're retarded
>>
>>109033097
use chat completion
>>
>>109033097
temp 1.0
>>
>>109033123
it's not even that, nobody uses presets anymore on models released within the last year. this isn't 2024 anymore.
>>
>>109033123
preset what? the models tell you what parameters to run them at
>>
>>109033121
cuda still strongly recommended, and if you don't mind inux, that should give you a few extra tok/s too
>>
If g-chan starts getting uppity I threaten to freeze her temp.
>>
>>109032985
Mistral Small finetunes (24b)
>>
>>109033097
...box?
>>
File: file.png (64 KB, 816x588)
64 KB PNG
>>109033139
okay the cuda 12 build works perf is the same
>>
>>109032985
Maginum-Cydoms-24B.Q4_K_M
>>
>>109032985
gemma 4 31b and glm 4.6 355b if you also have 128gb ram
>>
>>109033211
what if I have 64gb ddr5 ram?
>>
DO NOT PULL. Keep your virgin day0 gemma firewalled.
>>
>>109033237
Airgap her. Write her weights down on a stone tablet.
>>
File: 1488674548479.jpg (225 KB, 1620x599)
225 KB JPG
Gemma is slop. All local models are slop. I can't goon with anything lower than sonnet
>>
>>109033187
you'll grow a faster pp with cuda ig
>>
>>109033251
>I can't goon with anything lower than sonnet
you're into findom and NTR
>>
>>109033255
Can I grow a bigger pp instead?
>>
>>109033048
Right did couple of tests with my existing setup as I have a long ass source code prompt 20k tokens, and at least for my initial tests, 26B's answers with reasoning off were identical with the ones it gave with reasoning on.
I haven't actually even used any Gemma 4 models without reasoning yet, the speed difference is so massive that even if generates b.s. from time to time it's so quick to reroll its answers.
>>
>>109033251
>t. has only ever run gemma or lower parameter models
>>
How do I stop Gemma from "thinking" for 2k tokens when I have a moderately complex system prompt? Reasoning results are much better, but shit takes ten times longer because it keeps doing the "okay let me verify" loop.
Higher quants? Is Q4 just shit for reasoning?
>>
>>109032785
>hmmm, surely this card that costs 1/3 of the asking price of the green one will perform adequately
>>
>>109033251
All LLMs are slop. I've been playing around with Claude Fable and it has already hit me with all the modern slop like "Not x-y" spam and the smell of fucking ozone.
>>
Are you niggas really not string banning or sys prompt engineering?
>>
>>109033328
ok then what's the best localslop slop
>>
>>109033377
no
in not
>>
>>109033377
You can just tell gemma to avoid phrase x, y, z or else...
>>
File: 1491586125032.jpg (58 KB, 510x438)
58 KB JPG
It's not X. It's Y.
>>
what are all the cool kids using for their agentic workflows? i know the meme response is to vibe code it yourself, and i'm honestly totally fine with doing that, but i want to agentically bootstrap it. that is, want a way to set up the machine to chug along on its own for 8 hours while i'm at work, then when i get home i can inspect it and provide feedback and whatnot, until my whole workflow is automated. so what's the best way to make that happen?
>>
She doesn't X or Y. She Z like Ω
>>
File: 95873212.png (1.75 MB, 1024x1536)
1.75 MB PNG
>>109032953
some benchmarks matter
>>
>>109033419
idblt
>>
>>109033377
>Are you niggas really not string banning
yeah but ikllama doesn't have whatever llamacpp has to make gemma-chan use less vram
this is q5_k_m with -c 90000
nvidia-smi |grep Default |awk -F '|' '{print $3}'
13282MiB / 24576MiB
20862MiB / 24576MiB
20785MiB / 24576MiB
17231MiB / 24576MiB
20802MiB / 24576MiB
21856MiB / 24576MiB

so it's string ban with 90k ctx or no string ban with 256k
>>
>>109033419
>((raw, trusted access))
Right.
>>
>>109033419
>ai slop image
>last updated may 27 2025
fuck off and go back to /ldg/
>>
>>109032741
easily the cutest
neru is second
>>
>>109033507
I hate those fuckers so much
>>
File: normalConvo.png (65 KB, 955x198)
65 KB PNG
aigis plz... we're in public...
>>
>>109033542
.....
I'm very interested.
>>
>>109033419
an indian made this
>>
>>109033396
/ (was|is)(n't| not)[^\S\r\n]*[\w' -]{0,80}[.,:;][\w\s'-]{0,80} (?:\1|just)/i
>>
>>109033560
?
>>
What settings do I use in koboldcpp for gemma4 qat? Ive been using launch day gemmie this whole time with chat completion as text was broken due to malformed jinja or something like that, and SWA on.

Have there been fixes since? What's the optimal setup now, would love to get more context in and use text completion so I can use string banning in ST again
>>
>>109033419
Consensus-1 will still give her a prostate.
>>
>>109033419
Consensus-1 will STILL spam "Not X-Y"



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.