/lmg/ - a general dedicated to the discussion and development of local language models.Previous threads: >>107545298 & >>107535410►News>(12/10) GLM-TTS with streaming, voice cloning, and emotion control: https://github.com/zai-org/GLM-TTS>(12/09) Introducing: Devstral 2 and Mistral Vibe CLI: https://mistral.ai/news/devstral-2-vibe-cli>(12/08) GLM-4.6V (106B) and Flash (9B) released with function calling: https://z.ai/blog/glm-4.6v>(12/06) convert: support Mistral 3 Large MoE #17730: https://github.com/ggml-org/llama.cpp/pull/17730>(12/04) Microsoft releases VibeVoice-Realtime-0.5B: https://hf.co/microsoft/VibeVoice-Realtime-0.5B►News Archive: https://rentry.org/lmg-news-archive►Glossary: https://rentry.org/lmg-glossary►Links: https://rentry.org/LocalModelsLinks►Official /lmg/ card: https://files.catbox.moe/cbclyf.png►Getting Startedhttps://rentry.org/lmg-lazy-getting-started-guidehttps://rentry.org/lmg-build-guideshttps://rentry.org/IsolatedLinuxWebServicehttps://rentry.org/recommended-modelshttps://rentry.org/samplershttps://rentry.org/MikupadIntroGuide►Further Learninghttps://rentry.org/machine-learning-roadmaphttps://rentry.org/llm-traininghttps://rentry.org/LocalModelsPapers►BenchmarksLiveBench: https://livebench.aiProgramming: https://livecodebench.github.io/gso.htmlContext Length: https://github.com/adobe-research/NoLiMaGPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference►ToolsAlpha Calculator: https://desmos.com/calculator/ffngla98ycGGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-CalculatorSampler Visualizer: https://artefact2.github.io/llm-sampling►Text Gen. UI, Inference Engineshttps://github.com/lmg-anon/mikupadhttps://github.com/oobabooga/text-generation-webuihttps://github.com/LostRuins/koboldcpphttps://github.com/ggerganov/llama.cpphttps://github.com/theroyallab/tabbyAPIhttps://github.com/vllm-project/vllm
►Recent Highlights from the Previous Thread: >>107545298--Cost-performance challenges in optimizing K2 models with limited GPU memory:>107552388 >107552493 >107552518 >107552577 >107552593 >107552650--Quantization vs model size performance tradeoffs:>107550012 >107552809 >107552934 >107553336 >107552989 >107553444 >107553425--Optimizing local AI models for Unreal Engine C++ development:>107554300 >107554362 >107554461 >107554482 >107554686 >107554731 >107554743--Prototype speculative decoding methods in llama.cpp lacking server integration:>107551899 >107552450--Challenges and considerations in distilling and fine-tuning advanced models:>107548258 >107548358 >107548387 >107548382 >107548399 >107548441 >107548512 >107548619 >107548693 >107548928 >107552056 >107548781 >107548665--Comparing safety and filtering of GPT-oss 20b vs Gemma models:>107546443 >107546488 >107546704 >107546718 >107546734--ExL3 lacks Kimi-K2 support:>107550440 >107550450 >107550517 >107550548 >107550553 >107550601 >107550629--Roleplay model performance tradeoffs: 4.5 Air vs GPT-OSS-120B vs Qwen Next 80B:>107551643 >107551662 >107551678 >107551721 >107552290 >107552464 >107552586 >107552490 >107552515--ikllama Windows performance issues likely due to flash attention implementation:>107549210 >107552291 >107552912--Token banning compatibility issues between roleplay AI backends:>107550863 >107550873 >107550885 >107550914 >107550969 >107551045 >107551472--NVIDIA RTX PRO 6000 GPU configuration and power management issues:>107545503 >107545537 >107545530 >107545636 >107553858--Comparing censorship in GPT-OSS-120B vs unrestricted models like GLM Air:>107546681 >107548705 >107549905--Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs:>107546364 >107546435--Miku (free space):>107545415 >107547832 >107548687 >107550440►Recent Highlight Posts from the Previous Thread: >>107545300Why?: >>102478518Enable Links: https://rentry.org/lmg-recap-script
miku a shit
Unbelievably based developments, llamabro.
miku a love
Gemma soon
No offense cuda dev but I didn't need settings automation, I need proper numa TP because ram prices are jacked. IK is rolling up and smoking exllama right now.
Gemma 3 27B is the only stable, non-schizo model in the sub $2k runnable hardware range, GLM-4.5 Air is too schizo and often makes 7B tier mistakes. So I'll be looking forward to Gemma 4.
>>107557458Mogged by Mistral Small
>>107557523I don't think so, but if they finetuned it like Ministral 3 14B (without the bad quirks) there might be some chance. Vision would still lose bigly, though.
>>107557425Context in picrel.https://x.com/osanseviero/status/2000493503860892049
>>107557568>if they finetuned it like Ministral 3 14BMinistral is liquid shit though, it's small for megavramlets with copyrighted stuff ripped out of its dataset.
>>107557585The latest Ministral 3 models have unexpectedly nice creativity and writing, but their system instruction-following capabilities are very inconsistent and they have issues with message repetition, so they come off as retarded/broken because of that.
>>107557577WE WILL FINALLY GET NEW SHITTY SYNTHETIC SOTA-SAFE PURPLE PROSE OPTIMIZED MODEL
>>107557577Can't wait to download Google's new... um, you know... their "thing"...
for erp, I've only ever run nemo and mistral small. If I buy the hardware for glm air, will my mind be blown or will it disappointing?
►Recent Highlights from the Previous Thread: >>107545298(2/2)--llama.cpp updates for efficient GPU settings automation and user configuration debates:>107556876 >107556898 >107556943 >107557034 >107557060 >107557120 >107557167 >107557163 >107557275--Text generation parameter debates: temperature, minP, and TopK effectiveness:>107555084 >107555121 >107555140 >107555175 >107556538 >107556572--5090 GPU system configuration challenges for Australian buyers:>107556007 >107556070 >107556107 >107556124 >107556142 >107556143►Recent Highlight Posts from the Previous Thread: >>107545300Why?: >>102478518Enable Links: https://rentry.org/lmg-recap-script
>>107557633nah it's not great
>>107557633If you have complicated scenarios where you want to the model to pick up how character's feel without having to spell it out, air is definitely smarter. But for simple ERP I wouldn't say it's an improvement. It doesn't really write better.
>>107557619I'm looking forward to Gemma 4 providing me with better access.
>>107557633It's a sidegrade. Its prose is a bit nicer but low active params means it will make dumb mistakes more often, and it also frequently parrots user's replies.
>>107557654>>107557675what about a Q3 of glm 4.5?
>>107557633If you plan to buy hardware to run model X, instead of buying hardware for other things and running model X is a nice side effect, you really ought to rent some cloud hardware to give it a try for a day or two beforehand.
>>107557691Buying new hardware in the hopes of running a cope quant is never a good idea.
>>107557633better hardware just means less forgetfulness and faster tpsthe writing quality will be very similar
>>107557704there is glm 4.6 which is better than 4.5, but it's kinda overbaked and lacks knowledge and intelligence. deepseek r1 q2 does feel like an upgrade. but now that ram is five trillion times more expensive idk what people should do
>>107557805Crazy that stacking 3090s are now the 'poorfag' option.
>>107557805nta but which Dipsy is best Dipsy for creative writing?
>>107557832Original R1 is the best for creatively sucking your dick
>>107557453That is one of my immediate next priorities and the only reason I didn't do it first is that multiple other people had expressed interest in working on tensor parallelism (and then didn't seliver).I will not delegate it again and hope to produce a working prototype over the Chistmas break when I will have plenty of time.
>>107557816heh, I stick with what I know. About to buy an 8th 3090. I don't want to deal with different cuda versions, etc
>>107557816picrel>>107557832stellar. no model (i tried) handles unformatted mikupad storywriting better. and yes, original r1
>>107557899>and then didn't seliverThat's why I don't PR features to llama.cpp, I don't want to fuck your project up with features I know I might no maintain for more than a few months.Luckily Claude is good at handling merges when I fetch upstream.
>>107557899It's like the only thing you can count on is yourself. Always in all ways.
>>107557453Has IK_ done anything relevant in the past few months? I'm still using my version from october for K2/GLM.
>>107558025We have regular tensor parallel now for fully offloaded models and some MoE.
>>107558029I assume but not yet for the basic -ot exps=cpu scenario?
>>107558035your prompt processing will get faster if it's on GPU.
>>107557577sirs we are so back
>>107557995Share your secret stash of patches, you selfish fuck. Maybe some vibecoder can point Claude at your repo and make the PRs you refuse to make.
>>107557577I think we should see related PRs soon in the main backends, but there's nothing yet.https://github.com/huggingface/transformers/pullshttps://github.com/vllm-project/vllm/pullshttps://github.com/ggml-org/llama.cpp/pulls
>>107558113we are so backgemma 4 will save us
>>107558115Just like mistral saved us and air saved us?
>>107558122true air has never tried
4.6 Air will be released today.
4.6 Air will not be released today.
>>107558137What are you breathing?
>>107558080>Share your secret stash of patches, you selfish fuck. Selfish would be spamming their code base when I know I don't have time to actively maintain it.My shit is all niche (rpc-server rewrite that requires a copy of the gguf on each node, grpc-server, re-implement training, dodgy xcodec2 implementation, etc) and I don't have the rocm/sycl/metal hardware to test it for all their platforms.
Currently unlistedhttps://huggingface.co/google/gemma-4-100b-pthttps://huggingface.co/google/gemma-4-100b-pthttps://huggingface.co/google/gemma-4-100b-pt
>>107558278Sorry. I messed up the linkshttps://huggingface.co/google/gemma-4-100ba10m-pthttps://huggingface.co/google/gemma-4-100ba10m-pthttps://huggingface.co/google/gemma-4-100ba10m-pt
>>107558278>>107558292jagganath bless. .
>>107558292that would be interesting therefore it wont happen
https://huggingface.co/google/gemma-4peepeepoopoosecret — do not share
>>107558341fuck you racist mc
>>107558329
>>107558329I'm just waiting for 10ma100b.
>>107558278-pt means portuguese only, btw. I hope it's not confusing.
>>107558357That's a lot of layer reusing.
>>107558385It's about time somebody seriously explored layer recursion for production LLMs.
>>107558385The intellect of a god, the knowledge of a nematode worm.
>>107554263> tl;dr open shorts with leverage, right?I'm not a fan of any financial instrument that can lose you more than your investment. If you know how to use shorts and are comfortable with them, great. But those mean you have to have the timing exactly right. If you're the one writing the laws or cutting the big checks, or know those who do, you can get that timing exactly right. Everyone else is guessing.
>>107558029so it supports proper parallel requests? like vllm?
>>107558505Yes, but performance is more like exllamav2 than vllm. 25 t/s llama-3-70b on 3x3090.
>>107557577Bharat class gemma 3 superfinetune will do the needful.I am of refreshing page
Bad timinghttps://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16https://github.com/ggml-org/llama.cpp/pull/18058https://developer.nvidia.com/blog/inside-nvidia-nemotron-3-techniques-tools-and-data-that-make-it-efficient-and-accurate/https://huggingface.co/blog/nvidia/nemotron-3-nano-efficient-open-intelligent-models
>>107558565we are so back
>>107558565a whole pile of stinky vramlet shit
>>107558565>math and code benchmax dataset tune of a math and code benchmax model
>>107558565main advertising point is the speed cope (it's as smart as oss-20b on [hand-picked benchmark]maybe the mamba hybrid jamba wambo thing is interesting but I have no hope
>>107558565Bloody Vishnu... not Nemotron. This is bollocks.
>>107558583artificial anal cysts
>>107558550i just need ik to properly support tool calling to be usable for true local agentic coding so we can plug it into Opencode, roocode...
>>107557633you're better off running 70Bs
>>107558583>maybe the mamba hybrid jamba wambo thing is interestingllama.cpp support ETA: half past never
>>107558565
>>107558685llama : add support for NVIDIA Nemotron 3 Nano #18058https://github.com/ggml-org/llama.cpp/pull/18058
>>107558693uh-oh, stinky!
>>107558565interesting>Nemotron 3 Super and Ultra introduce latent MoE, where experts operate on a shared latent representation before outputs are projected back to token space. This approach allows the model to call on 4x more experts at the same inference cost, enabling better specialization around subtle semantic structures, domain abstractions, or multi-hop reasoning patterns.
>>107558727too bad it's fucking shit
>>107558727Maybe they should take their cutting edge technologies and apply it to a model that wasn't already garbage to begin with
>>107558565Make a 100A10b or something.
>>107558776I want a 60BA30B.
>>107558070https://voca.ro/18gbO3rnlIND
>>107558029does it need any flag to enable it? i'm launching a few queries but it's putting them in a queue instead of responding to both at the same time
>>107558565it's garbage
>>107558760It was pretrained from scratch, 25T tokens.
>>107558860if a white man's skin started turning shit brown from being in close proximity of tech jeets, would that be reverse shittiligo?
>>107558565goof bros let's fucking gooo
>>107558701took 'em long enough
>>107558574wow, congratulations, anon. By posting shit like this for the 1 millionth time your dick has fallen off and turned into a vagina, fulfilling your lifelong goal of becoming a real womxnxn.
>We want to hear from you! Share your ideas, vote on what matters, and help shape the future of Nemotron.>https://nemotron.ideas.nvidia.com/What would be something we as the /lmg/ collective would like these models to have?More "natural sounding human generated" data?
>>107558655I haven't tried it recently with Roo. I was using ClaudeCode with Qwen3 with the anthropic endpoint on mainline. I guess I'll try ikllama next week.
>>107558905but he'll never be a real woman
>>107558918Powerful log.
>>107558860>pajeetedHow do tech companies keep falling for this? It's literally just been one major tech blunder after another, worldwide, since the great pajeeting began.
Gemma-4 has image gen? Why the diffusers stuff in the PR?
48GB vramlet hereMiqumidnight still queen?
>>107558930i have a suspicion it was the satan cat anon that suddenly power moved everyone in this general from never sharing logs again. can't top 'em.
>>107558909You'll never get anything like that from Nvidia Nemotron models. They're meant to be safe benchmaxxed models trained on crawled web data and synthetic data.
>>107558860I understand your prejudice, but just because someone attended university in the US that doesn't automatically mean they're unqualified.
>>107558966>synthetic codeOh god it must shit out absurd amounts of remarks when writing code.
>>107558966I'm aware, but the vote is open, so feel free to go wild.
>>107558909>Introduce a “semantic firewall” layer that optimizes inference at the language-law level — a symbolic energy compression mechanism that cuts redundant compute cycles while preserving meaning fidelity.>Instead of scaling by GPU count, this layer redefines compute as coherence between intention and output.>It’s a governance-first, efficiency-driven approach: models learn to “understand” before they “generate,” lowering both latency and energy use.People sure love posting the llm schizo ramblings everywhere.
>>107558990https://nemotron.ideas.nvidia.com/ideas/LLAMANEMO-I-47
>>107558918FUCK YOU SATAN FUCK YOU SATANKILL SATAN KILL SATANDIE DIE DIE DIE DIE
>>107558859You're mistaking what tensor parallel is. It means parallel processing on ur GPU and not parallel requests on server.
>>107559048A little blunt, but I'll take it.>>107559032LMAO
>>107559052unhinged and based
>>107558905did I strike a nerve? insult me harder, maybe it will let you run a bigger model.
>>107558565>The model was trained with 25T tokens, Synth-slopped and hyper-fit. This shit will be amusing if nothing else.
>>107558959strawberry lemonade not bad
>>107559086pajeet/kike level self awareness on display.You literally just insulted multiple people in the thread and now you're acting like I threw the first punch.Holy shit.Your mother really fucked up with you
Considering a cope-quant of super nemotron 49B. Is it any good?
>>107559094oh no.. the poors are seething. whatever will I do. 5b of their own active parameters are now upset. to the moon rocket emoji.
>>10755895924gb vramlet here running it at iq2_si'm still happy with it and it somehow quantizes really well
>>107558951Subversion
>>107559048anons please vote this is our chance
>>107559133crab
>>107559048>>107559133It's obiviously a long shot, but might as well.
>>107559048Will never happen again with NVidia's name on it. They'll only train their models with open source safe and effective datasets, now.