/lmg/ - a general dedicated to the discussion and development of local language models.Previous threads: >>109026244 & >>109023085►News>(06/10) DiffusionGemma 26B-A4B released: https://blog.google/innovation-and-ai/technology/developers-tools/diffusion-gemma-faster-text-generation>(06/09) Cohere releases North-Mini-Code-1.0: https://hf.co/CohereLabs/North-Mini-Code-1.0>(06/07) llama : add Gemma4 MTP #23398 MERGED: https://github.com/ggml-org/llama.cpp/pull/23398>(06/05) dots.tts 2B released: https://hf.co/rednote-hilab/dots.tts-soar►News Archive: https://rentry.org/lmg-news-archive►Glossary: https://rentry.org/lmg-glossary►Links: https://rentry.org/LocalModelsLinks►Official /lmg/ card: https://files.catbox.moe/cbclyf.png►Getting Startedhttps://rentry.org/lmg-lazy-getting-started-guidehttps://rentry.org/lmg-build-guideshttps://rentry.org/IsolatedLinuxWebServicehttps://rentry.org/recommended-modelshttps://rentry.org/samplershttps://rentry.org/MikupadIntroGuide►Further Learninghttps://rentry.org/machine-learning-roadmaphttps://rentry.org/llm-traininghttps://rentry.org/LocalModelsPapers►BenchmarksLiveBench: https://livebench.aiProgramming: https://swe-rebench.comAgentic Coding: https://deepswe.datacurve.aiContext Length: https://github.com/adobe-research/NoLiMaGPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference►ToolsAlpha Calculator: https://desmos.com/calculator/ffngla98ycGGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-CalculatorSampler Visualizer: https://artefact2.github.io/llm-samplingToken Speed Visualizer: https://shir-man.com/tokens-per-second►Text Gen. UI, Inference Engineshttps://github.com/lmg-anon/mikupadhttps://github.com/oobabooga/text-generation-webuihttps://github.com/LostRuins/koboldcpphttps://github.com/ggerganov/llama.cpphttps://github.com/theroyallab/tabbyAPIhttps://github.com/vllm-project/vllm
►Recent Highlights from the Previous Thread: >>109026244--Comparing Gemma 4 12B, 26B, and 31B reasoning performance:>109026649 >109026994 >109027048 >109027046 >109027063 >109027201 >109027298 >109027762 >109028167 >109028188 >109028202 >109028317 >109028326 >109028339 >109028389 >109028650 >109030540--Optimizing Gemma 31B VRAM usage and performance on 24GB GPUs:>109030630 >109030678 >109030707 >109031071 >109031098 >109030693 >109030702 >109030723 >109030727 >109030739 >109030753 >109030780 >109030840 >109030903--Optimizing Hermes with local search tools:>109029679 >109029688 >109029714 >109029838 >109029840 >109029855 >109029923 >109029934 >109029971 >109030523 >109030643 >109029868--Exploiting LLM safety refusals to evade AI security scanners:>109027080 >109027089 >109027104 >109027106--Decoding base64 redacted reasoning in Moonshot Kimi models:>109029974 >109029989 >109030056 >109030129 >109030174 >109030225 >109030064 >109030122 >109030150--Hardware and budget recommendations for running Kimi-chan with high context:>109031231 >109031457 >109031500 >109031541 >109031562 >109031564 >109031627 >109031645 >109031661 >109031646 >109031770--Using custom think tags to steer Gemma 4 reasoning and prose:>109027608 >109027617 >109027851 >109029176--DiffusionGemma performance and token canvas implementation:>109027336 >109027375 >109027404 >109027489 >109027519--Comparing Gemini and Gemma models and discussing LLM architecture experiments:>109028735 >109028748 >109029016 >109029723 >109029836 >109029945 >109030035--Debating the widening gap between closed and open-weight models:>109029222 >109029299 >109029320 >109031208--Logs:>109027403 >109027489 >109029688 >109029840 >109029974 >109030056 >109030174 >109030225 >109031668--Rin, Miku, Teto (free space):>109026417 >109026687 >109029201 >109029209 >109029283 >109029440►Recent Highlight Posts from the Previous Thread: >>109026246Why?: >>102478518Enable Links: https://rentry.org/lmg-recap-script
>>109032734Cool pic
>>109032604>Well they have the most vram for the price. And lmg told me vram is kingthis tbqh famm
Pareto frontier models for speed to answer - quality tradeoffGranite 4.0 350MQwen3 0.6BExaone 4.0 1.2BMiniCPM5-1Bgpt-oss-20B (low thinking effort)Longcat flash litegpt-oss-120B (low thinking effort)Gemma 4 26BA4BQwen3.5 35BA3BQwen3.6 35BA3BGemma 4 26BA4B (thinking)Qwen3.5 35BA3B (thinking)Qwen3.6 35BA3B (thinking)Minimax-M2.7 (thinking)MiMo-V2.5-Pro (thinking)Kimi K2.6 (thinking)
After anon recommended Gembrain, I finally tried it. It's good. It's not that different from the base model honestly, but that may be a good sign (so it's not overcooked). What I've subjectively felt is that it is slightly less intelligent than the base model in some contexts, but actually smarter in a few others. And it also has more pleasant writing IMO. So yeah I think it's worth using, at least for now. I may need to test it more. Additionally, I have not tried MTP. It's possible it does not work well with MTP, which would be unfortunate. Anyone have experience with that?
>>109032788How very readable.
>>109032824Not about gembrain specifically (I want to download and test it myself) but I tested a few gemma finetunes (meromero, impish, etc) and they all worked with MTP, no issues at all.Also your post made me more curious about gembrain, gonna download it right now.
>>109032788Protip, use this
Leaving gemma alone with tools in a vibecoded agent harness without saying anything.
>>109032788Could the test be fucked because the chat template was shit?
does web search really make them smarter?
>>109032872No, because modern web is full of AI slop.
>>109032872Yes. Each search adds +2 IQ points.
>>109032872do web searches make you any smarter
>>109032881i'm not a cute ai agent
>>109032890What makes AI agent cute?
>>109032892emoji
>>109032890we can tell, you'd be smarter if you were one
>>109032788How the fuck is qwen27B more intelligent than 31B
>>109032945It's almost if the benchmarks don't matter.
damn i wanted to try running gemma on my old titan x since it has enough vram but it doesnt work with llamacpp
I'm so happy bros, I can run 26B gemmy Q4 at 40 t/s on my old ass 2060 super!
So what's the best model for erp? I have 24 vram
>>109032985gemma 31b or 12b if you want giant context
>>109032788Didn't expect to see 4.7 Flash in the quadrant. Is it actually good then?Also>qwen3.5-9B > gemma12B???
>>109032962>titan>cuda 7.5Unsurprising. Compile it yourself pointing at the old toolkit and hope for the best or use vulkan.
>>109032788Look at the difference in intelligence between 26B reasoning and non-reasoning and look at the difference in compute. Also note that the compute axis is logarithmic whereas the intelligence is linear. I was right earlier. Don't use 26B with reasoning on. You barely get any benefit and if you need it to do something more complex, just throw it to 12B with reasoning.
>>109033048Is dat right? I'll be damned...
>>109032874If you create your own web search you can whitelist trusted resources. I think most mcp web search tools have that feature anyway. You can do it with duckduckgo-mcp-server so it doesn't fetch from slop.
>>109033048llama.cpp's webui comes with reasoning disabled by default for some god forsaken reason and when I read 26b's response without reasoning on it was so horribly wrong, I'm never going to disable reasoning ever again
>>109033018i will try vulkan i have no idea how to compile stuff on windows kek, if that doesnt work ill set up arch on that machine
>>109033072> -rea > Use reasoning/thinking in the chat ('on', 'off', or 'auto', default: 'auto' (detect from template))
Any gemma 4 preset recommendations?
>>109032785>AMD
>>109033101They NEVER learn...
>>109032788what fucking dot is what, that chart is useless
>>109033092the webui doesn’t respect your command line argument, it’s a new feature to change the reasoning limit in the uithe default is zero for some reason.
>>109033097are you retarded? nevermind, you obviously are.
>>109033018it works with vulkan this is pretty crazy actually
>>109033118>you haven't spent 304804324hours in the general of some obscure hobby-fetish therefore you're retarded
>>109033097use chat completion
>>109033097temp 1.0
>>109033123it's not even that, nobody uses presets anymore on models released within the last year. this isn't 2024 anymore.
>>109033123preset what? the models tell you what parameters to run them at
>>109033121cuda still strongly recommended, and if you don't mind inux, that should give you a few extra tok/s too
If g-chan starts getting uppity I threaten to freeze her temp.
>>109032985Mistral Small finetunes (24b)
>>109033097...box?
>>109033139okay the cuda 12 build works perf is the same
>>109032985Maginum-Cydoms-24B.Q4_K_M
>>109032985gemma 4 31b and glm 4.6 355b if you also have 128gb ram