/lmg/ - a general dedicated to the discussion and development of local language models.Previous threads: >>107582405 & >>107573710►News>(12/17) Introducing Meta Segment Anything Model Audio: https://ai.meta.com/samaudio>(12/16) GLM4V vision encoder support merged: https://github.com/ggml-org/llama.cpp/pull/18042>(12/15) Chatterbox-Turbo 350M released: https://huggingface.co/ResembleAI/chatterbox-turbo>(12/15) Nemotron 3 Nano released: https://hf.co/blog/nvidia/nemotron-3-nano-efficient-open-intelligent-models>(12/15) llama.cpp automation for memory allocation: https://github.com/ggml-org/llama.cpp/discussions/18049►News Archive: https://rentry.org/lmg-news-archive►Glossary: https://rentry.org/lmg-glossary►Links: https://rentry.org/LocalModelsLinks►Official /lmg/ card: https://files.catbox.moe/cbclyf.png►Getting Startedhttps://rentry.org/lmg-lazy-getting-started-guidehttps://rentry.org/lmg-build-guideshttps://rentry.org/IsolatedLinuxWebServicehttps://rentry.org/recommended-modelshttps://rentry.org/samplershttps://rentry.org/MikupadIntroGuide►Further Learninghttps://rentry.org/machine-learning-roadmaphttps://rentry.org/llm-traininghttps://rentry.org/LocalModelsPapers►BenchmarksLiveBench: https://livebench.aiProgramming: https://livecodebench.github.io/gso.htmlContext Length: https://github.com/adobe-research/NoLiMaGPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference►ToolsAlpha Calculator: https://desmos.com/calculator/ffngla98ycGGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-CalculatorSampler Visualizer: https://artefact2.github.io/llm-sampling►Text Gen. UI, Inference Engineshttps://github.com/lmg-anon/mikupadhttps://github.com/oobabooga/text-generation-webuihttps://github.com/LostRuins/koboldcpphttps://github.com/ggerganov/llama.cpphttps://github.com/theroyallab/tabbyAPIhttps://github.com/vllm-project/vllm
►Recent Highlights from the Previous Thread: >>107582405--Qwen3 model performance optimization and hardware utilization:>107587959 >107587962 >107588009 >107588204 >107588023 >107588043 >107588126 >107588226--Tensor VRAM prioritization and compute graph optimization challenges:>107585868 >107585978--Attempting to distill Claude-like model from cloud logs using local LLM:>107586842 >107586892 >107586876 >107586899 >107586987 >107587029 >107587038 >107587104--Techniques for generating long NSFW stories with limited LLMs:>107584822 >107584862 >107584875 >107585113--Personal growth through local AI model interactions and ego death experiences:>107582881 >107582903 >107582912 >107583128 >107583070 >107583157--Gemma release updates and Solar-Open parameter specifications:>107582520 >107582589 >107586719 >107582643 >107582699 >107582732 >107582789--Evaluating NemoTron 3 Nano's roleplay abilities vs Gemma with preset demonstration:>107583976 >107584039 >107584065--Nala test results on MistralAI API with Anon/Nala M roleplay:>107586172 >107586197 >107586219 >107586377 >107586813--Testing GLM 4.6 on new Framework desktop:>107583661 >107583684 >107583743 >107583746 >107583748 >107583750 >107583875 >107583904 >107583988 >107583982 >107584717 >107584051 >107584075 >107584275 >107584296 >107584494 >107584285 >107584307 >107584322 >107584357 >107584477 >107584482 >107584609 >107584496 >107584520 >107584530 >107584607 >107585220--Budget GPU alternatives for AI workloads: 5060ti vs 3090 cost-performance analysis:>107585634 >107585658--Nemotron nano model benchmark performance on 3060 GPU:>107583030 >107583098--Misconfigured multi-GPU parameter usage realization:>107582989--Miku (free space):>107582881 >107587769 >107587665►Recent Highlight Posts from the Previous Thread: >>107582410Why?: >>102478518Enable Links: https://rentry.org/lmg-recap-script
Local model for fixing a broken heart when?
>>107588641Get a grip pussy, life's gonna get harder too
>>107588641at least 4 months after corpo models can operate a surgical robot without mistakes
>>107588660>Local man dies after SurgeonGPT refuses to proceed mid-surgery, quoted as saying repeatedly: "I can't assist with that"
>>107588694>why the fuck not>an unconfirmed blood type may lead to disastrous results>i'm telling you it's fuckin o>the sensor isn't working, I can't confirm that
>>107588694>die because surgeongpt refuses to assist with that requestor>die becuse the SARRR doctor decided to start sticking his dick in your innards mid surgery and you get not only all the aids but also fecal matter from his dick and subsequently a lethal infectionclown world man...
>>107588694>I can't operate..he is my son
Should I buy a 5080 prebuilt or can I cope with services like ChatLLM?
Btw Bartowski for some reason updated his BF16 mmproj file for GLM.https://huggingface.co/bartowski/zai-org_GLM-4.6V-GGUF/tree/main
>>107589110there are so many other better options than buying a prebuilt. build a mid tier pc yourself and then get 2 of these gpus:https://www.ebay.com/itm/125006475381
>>107589110There are no good models you can run on 5080 that you can't run on 2080
CUDA DEV CUDA DEV WHY IS THIS HAPPENING:https://litter.catbox.moe/gtb1e3u1jejxs6or.png./llama-bench --model ~/ik_models/GLM-4.5-Air-IQ4_KSS-00001-of-00002.gguf -ot exps=CPU -ngl 0 -t 6 -fa 1 --mmap 0 -r 5 -p 32,64,128,256,512,1024,2048,4096 -r 1 -p 0 -b 512 -nkvo 1 | glm4moe 106B.A12B IQ4_KSS - 4.0 bpw | 53.05 GiB | 106.85 B | CUDA | 0 | 512 | 1 | 0 | exps=CPU | pp1024 | 313.23 ± 0.00 |john@debian:~/TND/CPU/ik_llama.cpp/build/bin$ ./llama-bench --model ~/ik_models/GLM-4.5-Air-IQ4_KSS-00001-of-00002.gguf -t 6 -fa 1 --mmap 0 -r 5 -p 32,64,128,256,512,1024,2048,4096 -r 1 -p 0 -b 512| glm4moe 106B.A12B IQ4_KSS - 4.0 bpw | 53.05 GiB | 106.85 B | CPU | 6 | 512 | 0 | pp1024 | 26.84 ± 0.00 |
RAID0 HDDmaxxing is the new normal.
>>107589220also why does -b 256 and -b 512 make such a big differencespecs: 3060 12gb, i5 12400f, 64gb ddr4 3200mhz dual channel (51.6gb/s)| glm4moe 106B.A12B IQ4_KSS - 4.0 bpw | 53.05 GiB | 106.85 B | CUDA | 0 | 256 | 1 | 0 | exps=CPU | pp2048 | 49.90 ± 0.00 || glm4moe 106B.A12B IQ4_KSS - 4.0 bpw | 53.05 GiB | 106.85 B | CUDA | 0 | 512 | 1 | 0 | exps=CPU | pp2048 | 291.45 ± 0.00 |
> look up everlasting summer> miku is already canon character > purple hair twin bob girl looks like dipsy sans glassesWeird.
>>107589220>why is something happening on a fork cudadev doesn't work on and refuses to read the code of because the author has a pissy fit whenever someone upstreams his code
>>107589320@grok is this true?
>>107589341presented without comment.https://litter.catbox.moe/mdi7kasx8xbioeiv.png
Mistral Small Creative is better than Mistral Small 3.2, but not that much, at least in the EQBench Creative Writing benchmark (I don't think that represents chatbot performance).https://eqbench.com/creative_writing.html
>>107589220>>107589403that performance is more or less standard for your hardware.
>>107589110>5080I literally just got my 5080 and installed it tonight...completely impossible to do gpu passthru to a VM with it. It just outright explodes every time.I was passing through a 2060 super with zero issues forever
>>107589320It's a shitty vn made by channers featuring soviet nostalgia, chan culture and chan mascots as characters, mostly popular among normies
>>107589435>llm-judged creative writing benchmarkI love this dumb meme so much
>>107589436its standard for llamacpp to be slower than ik_llama?anyways yes, i know this performance is standard for my hardwarebut im wondering why despite having 0 gpu layers and disabling kv cache offload, why prompt processing is still faster on the cuda compiled version, even tho im using -b 512when i compile pure cpu its always 20t/s or maybe a bit different depending on batch size
>>107589341also llama.cpp, this time cpu-onlyjohn@debian:~/TND/CPU/llama.cpp/build/bin$ ./llama-bench --model ~/TND/AI/ArliAI_GLM-4.5-Air-Derestricted-IQ4_XS-00001-of-00002.gguf -t 6 -fa 1 --mmap 0 -r 5 -p 32,64,128,256,512,1024,2048,4096 -r 1 -p 0 -b 512| model | size | params | backend | threads | n_batch | fa | mmap | test | t/s || ------------------------------ | ---------: | ---------: | ---------- | ------: | ------: | -: | ---: | --------------: | -------------------: || glm4moe 106B.A12B IQ4_XS - 4.25 bpw | 56.62 GiB | 110.47 B | CPU | 6 | 512 | 1 | 0 | pp32 | 12.37 ± 0.00 || glm4moe 106B.A12B IQ4_XS - 4.25 bpw | 56.62 GiB | 110.47 B | CPU | 6 | 512 | 1 | 0 | pp64 | 12.94 ± 0.00 || glm4moe 106B.A12B IQ4_XS - 4.25 bpw | 56.62 GiB | 110.47 B | CPU | 6 | 512 | 1 | 0 | pp128 | 13.10 ± 0.00 |
>>107589435>Mistral Small CreativeWhat an elusive model.
>>107589220CUDADEV WHY IS THIS HAPPENING (llama.cpp edition):https://litter.catbox.moe/h6x20edznhqvo56l.png
>>107589526Why is what happening?If you mean why the performance first goes up and then down again that's simple: if you have a low number of tokens that hard limits your batch size and you get bad arithmetic intensity (compute efficiency), and as you increase the number of tokens the average context depth increases so the attention becomes slower.
For future anons: be ware that prompt processing for models that don't fully fit into your GPU is highly dependent on cpu-gpu bandwidth. If you use an external gpu connected via thunderbolt (2gb/s) or usb4 (3gb/s), expect very shitty pp. At 6gb/s (pcie4-4x like oculink), you can barely bottleneck your gpu only at batch size 4096.Token generation is much less sensitive to cpu-gpu bandwidth.
Are you ready? Are you sure you're ready?Are you really sure of that?Have you flushed enough?https://x.com/osanseviero/status/2001532493183209665>[eyes emoji][eyes emoji][eyes emoji]
>>107589609hasn't this pajeet been doing this charade for like 2 months now? can they stop fucking edging us
>>107589615It's probably Gemini 3 Flash Image anyway.
>>107589609Never heard of ANY of them and I'm not about to click on any.
>>107589560What goes up must come down.
>>107589560why is the cpu build slower than the cuda buildcuda build has -ngl 0 and -nkvo 1cpu build is 10t/s, cuda build that doesnt use gpu is 100t/sthx for response btw
>>107589567*Kisses you on the lips*
>>107589623Omar Sanseviero is the Google Gemma Team PR guy. He's been hyping up a possible open-weight release from Google (i.e. Gemma 4) for a while now, but things never pan out. This one is now Gemini 3 Flash week and it's unlikely Google will release Gemma 4 until next week at the minimum.
>>107589655Nonnie, this is too sudden!
>>107589715You know it isn't. *Grabs your chin and smooches you agressively*
>>107589609Why are all these brown goblins begging for the silicone demon? AI fucking mindbroke these niggas
Frens the 5090 finally arrived. What are the best uncensored models I can run in LM Studio? My PC only has 64GB of RAM though. Gemma 3 27B Abliterated never refuses my prompts, but its knowledge is very limited
Mistral Small Creative. Where is it then?
>>107588615
>>107589835Should have bought a BWP6000 instead. Also, don't bother with LM Studio. Best you can do is probably a Q4 of GLM Air, though it will be decently fast.
>>107589839I mean, at least he's now self aware of a problem he might have, that's something.
>>107589867We should become his guinea pigs instead.
Sirs is Gemma Strong model deepseek killer day today sirs? Thank you Google brahmin sirs to the moonLord Ganesh bless
>>107589728good night, nonnie
>>107589867The same problem people here have been telling him about for months on end?
>>107589855>Should have bought a BWP6000 insteadWay too expensive in my 3rd world EU country>Also, don't bother with LM StudioWhy? It seems easy to use>Q4 of GLM AirThanks, I’ll check it out
>>107589838It's an API-only experiment because they have no clue yet of what to do with it and its future direction, and are looking for "feedback".
>>107589224whats theoretical read/write speed limit?
>>107590039This kind of feedback to be precise.>We're looking forward to engaging with the community on ways to make the model finely respect guardrails, allowing for deployment in environments requiring moderated outputs.
>>107590039Do they really need to have the logs to know that people goon to AI when they say 'creative writing'?>>107590076As much as bandwidth permits, so for PCIE5 16x that's around 64GB/s, so speed of 2 channel ddr4 ram. Let's be optimistic and assume that each HDD reads at 150MB/s, you would need 427 hdds to fully saturate it.
>>107588615I didnt look into locall LLM's before but I bought 5090 recently, whats the best smut model I can run?
>>107590136Mistral Nemo
>>107590136also I got 128 gb ram
>>107590158GLM Air or low quants of big GLM and deepseek R1.
I just saw a video of someone talking to grok and they were chatting and asking grok to sing to them in their car. Humanity is over. No longer do we need socialization anymore
>>107590118I think they're past that, they stopped adding that note some time after the Nemo release.>>107590132If they were just interested in large amounts of logs they could have simply made the model free on OpenRouter. They're looking for more specific suggestions and feedback.
>>107589637Unless this was changed when I wasn't looking 32 is the batch size at which data starts being moved temporarily from RAM to VRAM to take advantage of the higher compute on GPUs.However, it's not like this choice is guaranteed to be optimal for all hardware combinations.In particular, an RTX 3060 is comparatively low-powered so for 32 tokens the overhead seems to not be worthwhile in this case.Do note though that this is on a completely empty context, if you set a higher --depth the CUDA performance should decline less than the CPU performance because there is more work to be done when the context fills up.
>>107589637>>107590228>why is the cpu build slower than the cuda buildActually, I misread your post: I thought you were asking about the one data point where the CPU build is faster.llama.cpp uses GPUs for prompt processing even at 0 GPU layers, that's why adding a GPU makes it faster.Prompt processing is compute bound so it makes sense to temporarily move data from RAM to VRAM and do the calculations there.
If we don't get Gemma 4 soon then Vishnu is dead to me.
>google hid its recent activities>google hid its recent activities>google hid its recent activities
>>107590265thanks omar
I'm glad that the new captcha is filtering out dalits and pakis, so only aryan brahmin can post
>>107590284TELL ME ABOUT THE BRAHMINWHY DO THEY IDENTIFY WITH THE DALIT?
>>107590284Its 10x easier for me, I don't get how it's filtering anyone.
The only time I ever spend thinking about Indians is when retards insist on dragging their personal grievances into /lmg/.
>>107590329I think about them when applying for tech jobs. (they get them through nepotism)
>>107590334They get all jobs through nepotismOnce an indian is put in charge of hiring people, you can guarantee that 99% of future employees will also be indian.
>>107590343Its funny because I actually met some competent indians at a few companies. Assuming they stood out because of this.So many that didn't know shit about their job or really anything and you would normally wonder why/how they got employed while you get put through the third degree on interviews.
>>107589320>miku but swarthyyikes
WHY ARE THERE SO MANY
>>107590329>personal grievancesI would say it's more of a national grievance or even a civilizational grievance at this point.
>>107590343There's also the explosive diarrhea strategy. Just spam every single venue with your "work" as obnoxiously as you can, farm engagement with any possible strategy, fake it till you make it, and eventually you will get hired by clueless boomers. Indians tend to lack any sense of shame and restraint in this regard.
Local models?
>>107590732>lack any sense of shame and restraint in this regardNeither should you. Employment is one of the rare cases where lying, cheating, and scamming are justified because the other side will do the same to you
>>107590782Local AI tech support sir. Kindly buy a google gist card if you wish to have good local model suggested sir
>>107590825>lying, cheating, and scammingAnd Indians are culturally advantaged with that.
>>107590782We can rapidly bring the thread back in topic with picrel.https://xcancel.com/avisoori1/status/2001332763816083926
>>107590886yay..
>>107590886>Local models?
>>107590914Soon
>>107590136>>107590179I'd go straight to a low quant of GLM 4.6 personally, try this in ik_llama https://huggingface.co/ubergarm/GLM-4.6-GGUF/tree/main/IQ2_KLDeepseek R1 at a similar size is too gimp and it's slower in prompt processing
>>107590917Right after Mistral Medium
If fucking Oracle is what causes the crash i will become the joker
Can you use your own coder llm model in VS Code or is it all forced cloudshit? Alternatively, is it even worth bothering with local-based coding models?
>>107590959Why?They are deeply entangled with this mess. Chances are pretty decent.
>>107590886yjk the bharatian chad got that yellow pussy
>>107591005Do not redeem the IMAF postings
>>107589609Gemma 4 so good they calling it Gemma 6. Local sirs are about to wonner bigly. 1 f5 = 1 minute less till Google does needful gooof upload
Just tried Gemini 3 Flash. It's... bad. It knows less than the Pro version and isn't faster (maybe it's a server overloading thing). Maybe they reached the limits of small MoE models.
>>107590999>deeply entangledHow so, is there an updated incestual bukkake / "commercial agreements" chart? Thought MS are most on the hook
>>107590997No, yes. No.Now go away. We have enough saarposting as it is.
Im going to use pyautogui to automate the generation of data for distillation
>>107590323>I don't get how it's filtering anyone.I spent way too long getting them wrong due to overthinking it. Like for the dots one I assumed must be position, rotation, or color shading, because the number (and almost always being the one with 4 dots) seemed way too fucking easy and surely there was no way they made the new captcha so easy and pointless even 80 iq indians could solve it.
>>107591353How do you even do model distillation?Is there a framework out there that does the token matching or do you have to write something yourself?
>>107591259I don't really care one way or another because it's not local
>>107591379Distillation is not the correct term when you don't train to match logits which requires a matching tokenizer. Otherwise you are just training on the outputs
>>107591432The entire rest of the professional industry and even common usage now disagrees with you.
>>107591432Yes, I know. That's why I'm asking about how people do the distillation process.Are they hand rolling their own scripts to match the logits or do the existing frameworks like axolotl and unsloth have support for it?Maybe there's a dedicated framework just for that?
>>107591458lol they just finetune/train on model outputs
>>107591379Modern distillation is just generating a question answer dataset and training on that. Not training on logits. If we had them it'd be better but we don't.My goals is to finetune a model to make it as close as possible to Sonnet 4.5.