/lmg/ - a general dedicated to the discussion and development of local language models.Previous threads: >>107545298 & >>107535410►News>(12/10) GLM-TTS with streaming, voice cloning, and emotion control: https://github.com/zai-org/GLM-TTS>(12/09) Introducing: Devstral 2 and Mistral Vibe CLI: https://mistral.ai/news/devstral-2-vibe-cli>(12/08) GLM-4.6V (106B) and Flash (9B) released with function calling: https://z.ai/blog/glm-4.6v>(12/06) convert: support Mistral 3 Large MoE #17730: https://github.com/ggml-org/llama.cpp/pull/17730>(12/04) Microsoft releases VibeVoice-Realtime-0.5B: https://hf.co/microsoft/VibeVoice-Realtime-0.5B►News Archive: https://rentry.org/lmg-news-archive►Glossary: https://rentry.org/lmg-glossary►Links: https://rentry.org/LocalModelsLinks►Official /lmg/ card: https://files.catbox.moe/cbclyf.png►Getting Startedhttps://rentry.org/lmg-lazy-getting-started-guidehttps://rentry.org/lmg-build-guideshttps://rentry.org/IsolatedLinuxWebServicehttps://rentry.org/recommended-modelshttps://rentry.org/samplershttps://rentry.org/MikupadIntroGuide►Further Learninghttps://rentry.org/machine-learning-roadmaphttps://rentry.org/llm-traininghttps://rentry.org/LocalModelsPapers►BenchmarksLiveBench: https://livebench.aiProgramming: https://livecodebench.github.io/gso.htmlContext Length: https://github.com/adobe-research/NoLiMaGPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference►ToolsAlpha Calculator: https://desmos.com/calculator/ffngla98ycGGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-CalculatorSampler Visualizer: https://artefact2.github.io/llm-sampling►Text Gen. UI, Inference Engineshttps://github.com/lmg-anon/mikupadhttps://github.com/oobabooga/text-generation-webuihttps://github.com/LostRuins/koboldcpphttps://github.com/ggerganov/llama.cpphttps://github.com/theroyallab/tabbyAPIhttps://github.com/vllm-project/vllm
►Recent Highlights from the Previous Thread: >>107545298--Cost-performance challenges in optimizing K2 models with limited GPU memory:>107552388 >107552493 >107552518 >107552577 >107552593 >107552650--Quantization vs model size performance tradeoffs:>107550012 >107552809 >107552934 >107553336 >107552989 >107553444 >107553425--Optimizing local AI models for Unreal Engine C++ development:>107554300 >107554362 >107554461 >107554482 >107554686 >107554731 >107554743--Prototype speculative decoding methods in llama.cpp lacking server integration:>107551899 >107552450--Challenges and considerations in distilling and fine-tuning advanced models:>107548258 >107548358 >107548387 >107548382 >107548399 >107548441 >107548512 >107548619 >107548693 >107548928 >107552056 >107548781 >107548665--Comparing safety and filtering of GPT-oss 20b vs Gemma models:>107546443 >107546488 >107546704 >107546718 >107546734--ExL3 lacks Kimi-K2 support:>107550440 >107550450 >107550517 >107550548 >107550553 >107550601 >107550629--Roleplay model performance tradeoffs: 4.5 Air vs GPT-OSS-120B vs Qwen Next 80B:>107551643 >107551662 >107551678 >107551721 >107552290 >107552464 >107552586 >107552490 >107552515--ikllama Windows performance issues likely due to flash attention implementation:>107549210 >107552291 >107552912--Token banning compatibility issues between roleplay AI backends:>107550863 >107550873 >107550885 >107550914 >107550969 >107551045 >107551472--NVIDIA RTX PRO 6000 GPU configuration and power management issues:>107545503 >107545537 >107545530 >107545636 >107553858--Comparing censorship in GPT-OSS-120B vs unrestricted models like GLM Air:>107546681 >107548705 >107549905--Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs:>107546364 >107546435--Miku (free space):>107545415 >107547832 >107548687 >107550440►Recent Highlight Posts from the Previous Thread: >>107545300Why?: >>102478518Enable Links: https://rentry.org/lmg-recap-script
miku a shit
Unbelievably based developments, llamabro.
miku a love
Gemma soon
No offense cuda dev but I didn't need settings automation, I need proper numa TP because ram prices are jacked. IK is rolling up and smoking exllama right now.
Gemma 3 27B is the only stable, non-schizo model in the sub $2k runnable hardware range, GLM-4.5 Air is too schizo and often makes 7B tier mistakes. So I'll be looking forward to Gemma 4.
>>107557458Mogged by Mistral Small
>>107557523I don't think so, but if they finetuned it like Ministral 3 14B (without the bad quirks) there might be some chance. Vision would still lose bigly, though.
>>107557425Context in picrel.https://x.com/osanseviero/status/2000493503860892049
>>107557568>if they finetuned it like Ministral 3 14BMinistral is liquid shit though, it's small for megavramlets with copyrighted stuff ripped out of its dataset.
>>107557585The latest Ministral 3 models have unexpectedly nice creativity and writing, but their system instruction-following capabilities are very inconsistent and they have issues with message repetition, so they come off as retarded/broken because of that.
>>107557577WE WILL FINALLY GET NEW SHITTY SYNTHETIC SOTA-SAFE PURPLE PROSE OPTIMIZED MODEL
>>107557577Can't wait to download Google's new... um, you know... their "thing"...
for erp, I've only ever run nemo and mistral small. If I buy the hardware for glm air, will my mind be blown or will it disappointing?
►Recent Highlights from the Previous Thread: >>107545298(2/2)--llama.cpp updates for efficient GPU settings automation and user configuration debates:>107556876 >107556898 >107556943 >107557034 >107557060 >107557120 >107557167 >107557163 >107557275--Text generation parameter debates: temperature, minP, and TopK effectiveness:>107555084 >107555121 >107555140 >107555175 >107556538 >107556572--5090 GPU system configuration challenges for Australian buyers:>107556007 >107556070 >107556107 >107556124 >107556142 >107556143►Recent Highlight Posts from the Previous Thread: >>107545300Why?: >>102478518Enable Links: https://rentry.org/lmg-recap-script
>>107557633nah it's not great
>>107557633If you have complicated scenarios where you want to the model to pick up how character's feel without having to spell it out, air is definitely smarter. But for simple ERP I wouldn't say it's an improvement. It doesn't really write better.
>>107557619I'm looking forward to Gemma 4 providing me with better access.
>>107557633It's a sidegrade. Its prose is a bit nicer but low active params means it will make dumb mistakes more often, and it also frequently parrots user's replies.
>>107557654>>107557675what about a Q3 of glm 4.5?
>>107557633If you plan to buy hardware to run model X, instead of buying hardware for other things and running model X is a nice side effect, you really ought to rent some cloud hardware to give it a try for a day or two beforehand.
>>107557691Buying new hardware in the hopes of running a cope quant is never a good idea.
>>107557633better hardware just means less forgetfulness and faster tpsthe writing quality will be very similar
>>107557704there is glm 4.6 which is better than 4.5, but it's kinda overbaked and lacks knowledge and intelligence. deepseek r1 q2 does feel like an upgrade. but now that ram is five trillion times more expensive idk what people should do
>>107557805Crazy that stacking 3090s are now the 'poorfag' option.
>>107557805nta but which Dipsy is best Dipsy for creative writing?
>>107557832Original R1 is the best for creatively sucking your dick
>>107557453That is one of my immediate next priorities and the only reason I didn't do it first is that multiple other people had expressed interest in working on tensor parallelism (and then didn't seliver).I will not delegate it again and hope to produce a working prototype over the Chistmas break when I will have plenty of time.
>>107557816heh, I stick with what I know. About to buy an 8th 3090. I don't want to deal with different cuda versions, etc
>>107557816picrel>>107557832stellar. no model (i tried) handles unformatted mikupad storywriting better. and yes, original r1
>>107557899>and then didn't seliverThat's why I don't PR features to llama.cpp, I don't want to fuck your project up with features I know I might no maintain for more than a few months.Luckily Claude is good at handling merges when I fetch upstream.
>>107557899It's like the only thing you can count on is yourself. Always in all ways.
>>107557453Has IK_ done anything relevant in the past few months? I'm still using my version from october for K2/GLM.
>>107558025We have regular tensor parallel now for fully offloaded models and some MoE.
>>107558029I assume but not yet for the basic -ot exps=cpu scenario?
>>107558035your prompt processing will get faster if it's on GPU.
>>107557577sirs we are so back
>>107557995Share your secret stash of patches, you selfish fuck. Maybe some vibecoder can point Claude at your repo and make the PRs you refuse to make.
>>107557577I think we should see related PRs soon in the main backends, but there's nothing yet.https://github.com/huggingface/transformers/pullshttps://github.com/vllm-project/vllm/pullshttps://github.com/ggml-org/llama.cpp/pulls
>>107558113we are so backgemma 4 will save us
>>107558115Just like mistral saved us and air saved us?
>>107558122true air has never tried
4.6 Air will be released today.
4.6 Air will not be released today.
>>107558137What are you breathing?
>>107558080>Share your secret stash of patches, you selfish fuck. Selfish would be spamming their code base when I know I don't have time to actively maintain it.My shit is all niche (rpc-server rewrite that requires a copy of the gguf on each node, grpc-server, re-implement training, dodgy xcodec2 implementation, etc) and I don't have the rocm/sycl/metal hardware to test it for all their platforms.
Currently unlistedhttps://huggingface.co/google/gemma-4-100b-pthttps://huggingface.co/google/gemma-4-100b-pthttps://huggingface.co/google/gemma-4-100b-pt
>>107558278Sorry. I messed up the linkshttps://huggingface.co/google/gemma-4-100ba10m-pthttps://huggingface.co/google/gemma-4-100ba10m-pthttps://huggingface.co/google/gemma-4-100ba10m-pt
>>107558278>>107558292jagganath bless. .
>>107558292that would be interesting therefore it wont happen
https://huggingface.co/google/gemma-4peepeepoopoosecret — do not share
>>107558341fuck you racist mc
>>107558329
>>107558329I'm just waiting for 10ma100b.
>>107558278-pt means portuguese only, btw. I hope it's not confusing.
>>107558357That's a lot of layer reusing.
>>107558385It's about time somebody seriously explored layer recursion for production LLMs.
>>107558385The intellect of a god, the knowledge of a nematode worm.
>>107554263> tl;dr open shorts with leverage, right?I'm not a fan of any financial instrument that can lose you more than your investment. If you know how to use shorts and are comfortable with them, great. But those mean you have to have the timing exactly right. If you're the one writing the laws or cutting the big checks, or know those who do, you can get that timing exactly right. Everyone else is guessing.
>>107558029so it supports proper parallel requests? like vllm?
>>107558505Yes, but performance is more like exllamav2 than vllm. 25 t/s llama-3-70b on 3x3090.
>>107557577Bharat class gemma 3 superfinetune will do the needful.I am of refreshing page
Bad timinghttps://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16https://github.com/ggml-org/llama.cpp/pull/18058https://developer.nvidia.com/blog/inside-nvidia-nemotron-3-techniques-tools-and-data-that-make-it-efficient-and-accurate/https://huggingface.co/blog/nvidia/nemotron-3-nano-efficient-open-intelligent-models
>>107558565we are so back
>>107558565a whole pile of stinky vramlet shit
>>107558565>math and code benchmax dataset tune of a math and code benchmax model
>>107558565main advertising point is the speed cope (it's as smart as oss-20b on [hand-picked benchmark]maybe the mamba hybrid jamba wambo thing is interesting but I have no hope
>>107558565Bloody Vishnu... not Nemotron. This is bollocks.
>>107558583artificial anal cysts
>>107558550i just need ik to properly support tool calling to be usable for true local agentic coding so we can plug it into Opencode, roocode...
>>107557633you're better off running 70Bs
>>107558583>maybe the mamba hybrid jamba wambo thing is interestingllama.cpp support ETA: half past never
>>107558565
>>107558685llama : add support for NVIDIA Nemotron 3 Nano #18058https://github.com/ggml-org/llama.cpp/pull/18058
>>107558693uh-oh, stinky!
>>107558565interesting>Nemotron 3 Super and Ultra introduce latent MoE, where experts operate on a shared latent representation before outputs are projected back to token space. This approach allows the model to call on 4x more experts at the same inference cost, enabling better specialization around subtle semantic structures, domain abstractions, or multi-hop reasoning patterns.
>>107558727too bad it's fucking shit
>>107558727Maybe they should take their cutting edge technologies and apply it to a model that wasn't already garbage to begin with
>>107558565Make a 100A10b or something.
>>107558776I want a 60BA30B.
>>107558070https://voca.ro/18gbO3rnlIND
>>107558029does it need any flag to enable it? i'm launching a few queries but it's putting them in a queue instead of responding to both at the same time
>>107558565it's garbage
>>107558760It was pretrained from scratch, 25T tokens.
>>107558860if a white man's skin started turning shit brown from being in close proximity of tech jeets, would that be reverse shittiligo?
>>107558565goof bros let's fucking gooo
>>107558701took 'em long enough
>>107558574wow, congratulations, anon. By posting shit like this for the 1 millionth time your dick has fallen off and turned into a vagina, fulfilling your lifelong goal of becoming a real womxnxn.
>We want to hear from you! Share your ideas, vote on what matters, and help shape the future of Nemotron.>https://nemotron.ideas.nvidia.com/What would be something we as the /lmg/ collective would like these models to have?More "natural sounding human generated" data?
>>107558655I haven't tried it recently with Roo. I was using ClaudeCode with Qwen3 with the anthropic endpoint on mainline. I guess I'll try ikllama next week.
>>107558905but he'll never be a real woman
>>107558918Powerful log.
>>107558860>pajeetedHow do tech companies keep falling for this? It's literally just been one major tech blunder after another, worldwide, since the great pajeeting began.
Gemma-4 has image gen? Why the diffusers stuff in the PR?
48GB vramlet hereMiqumidnight still queen?
>>107558930i have a suspicion it was the satan cat anon that suddenly power moved everyone in this general from never sharing logs again. can't top 'em.
>>107558909You'll never get anything like that from Nvidia Nemotron models. They're meant to be safe benchmaxxed models trained on crawled web data and synthetic data.
>>107558860I understand your prejudice, but just because someone attended university in the US that doesn't automatically mean they're unqualified.
>>107558966>synthetic codeOh god it must shit out absurd amounts of remarks when writing code.
>>107558966I'm aware, but the vote is open, so feel free to go wild.
>>107558909>Introduce a “semantic firewall” layer that optimizes inference at the language-law level — a symbolic energy compression mechanism that cuts redundant compute cycles while preserving meaning fidelity.>Instead of scaling by GPU count, this layer redefines compute as coherence between intention and output.>It’s a governance-first, efficiency-driven approach: models learn to “understand” before they “generate,” lowering both latency and energy use.People sure love posting the llm schizo ramblings everywhere.
>>107558990https://nemotron.ideas.nvidia.com/ideas/LLAMANEMO-I-47
>>107558918FUCK YOU SATAN FUCK YOU SATANKILL SATAN KILL SATANDIE DIE DIE DIE DIE
>>107558859You're mistaking what tensor parallel is. It means parallel processing on ur GPU and not parallel requests on server.
>>107559048A little blunt, but I'll take it.>>107559032LMAO
>>107559052unhinged and based
>>107558905did I strike a nerve? insult me harder, maybe it will let you run a bigger model.
>>107558565>The model was trained with 25T tokens, Synth-slopped and hyper-fit. This shit will be amusing if nothing else.
>>107558959strawberry lemonade not bad
>>107559086pajeet/kike level self awareness on display.You literally just insulted multiple people in the thread and now you're acting like I threw the first punch.Holy shit.Your mother really fucked up with you
Considering a cope-quant of super nemotron 49B. Is it any good?
>>107559094oh no.. the poors are seething. whatever will I do. 5b of their own active parameters are now upset. to the moon rocket emoji.
>>10755895924gb vramlet here running it at iq2_si'm still happy with it and it somehow quantizes really well
>>107558951Subversion
>>107559048anons please vote this is our chance
>>107559133crab
>>107559048>>107559133It's obiviously a long shot, but might as well.
>>107559048Will never happen again with NVidia's name on it. They'll only train their models with open source safe and effective datasets, now.
>>107559048One of the resident redditors should post this in one of their boards.
>>107559109May Shiva redeem your bants with much bob and vagene sir
>>107558565great, more synthslopped and benchmaxxed trashi miss 2024
>>107559048they're not releasing any more models like that and you know it
>>107559276https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16>We use a considerable amount of synthetic data. Out of 10.6 trillion tokens, 3,534,013,958,278 tokens are synthetically generated.
>>107559276How do you know this new mixture of slop won't do the trick?
>>107559338This ain't Nemo-12B's dataset.
any good dense model above 30B?
>>107559375Dolphin-X1-Llama-3.1-405B is underrated.
>>107559367>synthetic CC4.3 trillion tokens of fake comments sections written by positivityslopped LLMs. This might end up being so shitty it's good for a laugh.
>>107559367>Books - 0They're proud of this and I hate them for it.
>>107559334>3,534,013,958,278 tokensThat sounds a bit expensive to generate with a sota model. I hope this isn't a toss distill or something
>ikprompt eval time = 19841.27 ms / 11023 tokens ( 1.80 ms per token, 555.56 tokens per second)generation eval time = 86733.83 ms / 2546 runs ( 34.07 ms per token, 29.35 tokens per second)>mainlineprompt eval time = 24553.96 ms / 11023 tokens ( 2.23 ms per token, 448.93 tokens per second)eval time = 118823.52 ms / 3154 tokens ( 37.67 ms per token, 26.54 tokens per second)ik is faster even with non-ik/ubergarm quants. Tested at 11K tokens, with glm-4.6 at Q4_K_SAny reason to use mainline over ik at the moment? mainline needs less tweaking in the cli with their defaults maybe?>ik cmd:CUDA_VISIBLE_DEVICES=2,0,6,1,3,4,5 ./build/bin/llama-server \ --model /mnt/llms/models/unsloth/GLM-4.6-GGUF/Q4_K_S/GLM-4.6-Q4_K_S-00001-of-00005.gguf \ --alias "glm-4.6" \ --ctx-size 64000 \ -mla 3 -amb 512 \ -ngl 99 \ --host 0.0.0.0 \ --port 5000 \ --no-mmap --jinja>mainline cmd:CUDA_VISIBLE_DEVICES=2,0,1,3,4,5,6 ./build/bin/llama-server \ --model /mnt/llms/models/unsloth/GLM-4.6-GGUF/Q4_K_S/GLM-4.6-Q4_K_S-00001-of-00005.gguf \ --alias glm-4.6 \ --host 0.0.0.0 \ --port 5000 -c 64000
>>107559424>https://nemotron.ideas.nvidia.com/ideas/LLAMANEMO-I-47Fine. I'll pull and compile ik.
>>107559367
>>107559424What are your specs?
>>107559483Yes
>>1075594831 rtx pro 60002 50904 3090at Q4 it fits in vram at 64K ctx. Q6 needs offloading to ram and speeds drop to 9-10t/s
Even at t=0.6 it seems to be suffering from a bit of gender confusion- like the only purpose for the user to be on their stomach in this scenario would be for her to do the fucking. Also that whole "Would you like me to make a listicle of why LLMs keep getting worse?" seems to have generalized into the roleplay. Probably the best use of non-human anatomy in a model with only 3B active that I've seen so far, though. The dialogue is like a horrible mash-up between a 1-on-1 anime battle and Debbie Does Dallas.
Ik ook
>dense 70B q8 @ tg128 2.87is this acceptable speed?
gemma WILL drop in 2 more hours and WILL save local
qat = always better?
>>107559568Shieldgemma will save us
So using slopotron as an assistant it seems to write out a thought process, but not use thinking tokens. So that's a problemydoo.
>>107559717Are you using --special?
>>107558951this is how a dying civilization looks likesimple as:(
>>107559424>mainlinedon't do that unless there's a specific feature you needtheir retardation starts to show big time
>>107559524two littles in one sentence. sloppy
>>107559731wanna snuggle up and watch the world burn together? UwU.#nohomo (jk it will be very homo).
We are winning
>>107559558Depends on your hardware.
>>107559857It would be very funny if that got some real traction.
Anyway as expected slopotron is bad. But surprisingly not as bad as the gargantuan quantities of synthslop data would make you expect.Which unfortunately just means it's conventionally bad and not so bad it's good.
very cool but how long until OLLAMA offers nemotron
>>107559513Holy shit, why? What kind of ERP scenario exceeds what you can do with a mistral small tune or maybe qwen3-30b-instruct? If it's not ERP then why not use a cloud model? Something like grok-code-fast-1 is unbelievably cheap for the speed and capability. Let the cloud AI companies fight over who can lose money the fastest. There's no way you can match them locally for speed or context.If I had your budget, I'd sell the 5090s and 3090s and get a second 6000 pro. Then at least you could focus what's actually interesting locally, which is things like LongCat-Video or Ovi.
>>107559857It only shows two out of six comments. My rocket emoji went through but my other did not.
>>107559857>I own 20 nvidia shares
>>107559949try clearing cookies
>>107559859dgx spark
>>107559987lol
>>107559874If you look at the 'benchmark' results Nemotron is a direct competitor to GTP-OSS.You can deduct the rest.
>>107559987My condolences.
>>107559048aight
>>107560104>>107559857someone tried to be a little more discreethttps://nemotron.ideas.nvidia.com/ideas/LLAMANEMO-I-48>no votekek
>>107560174this is both hilarious and terrifying, the man is asking for "non-synthetic, real human conversation data" with the most aislopped post ever.