[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology

Name
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
File
  • Please read the Rules and FAQ before posting.
  • You may highlight syntax and preserve whitespace by using [code] tags.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


[Advertise on 4chan]


File: 1739402277498048.jpg (424 KB, 1376x2072)
424 KB
424 KB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>107582405 & >>107573710

►News
>(12/17) Introducing Meta Segment Anything Model Audio: https://ai.meta.com/samaudio
>(12/16) GLM4V vision encoder support merged: https://github.com/ggml-org/llama.cpp/pull/18042
>(12/15) Chatterbox-Turbo 350M released: https://huggingface.co/ResembleAI/chatterbox-turbo
>(12/15) Nemotron 3 Nano released: https://hf.co/blog/nvidia/nemotron-3-nano-efficient-open-intelligent-models
>(12/15) llama.cpp automation for memory allocation: https://github.com/ggml-org/llama.cpp/discussions/18049

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
►Recent Highlights from the Previous Thread: >>107582405

--Qwen3 model performance optimization and hardware utilization:
>107587959 >107587962 >107588009 >107588204 >107588023 >107588043 >107588126 >107588226
--Tensor VRAM prioritization and compute graph optimization challenges:
>107585868 >107585978
--Attempting to distill Claude-like model from cloud logs using local LLM:
>107586842 >107586892 >107586876 >107586899 >107586987 >107587029 >107587038 >107587104
--Techniques for generating long NSFW stories with limited LLMs:
>107584822 >107584862 >107584875 >107585113
--Personal growth through local AI model interactions and ego death experiences:
>107582881 >107582903 >107582912 >107583128 >107583070 >107583157
--Gemma release updates and Solar-Open parameter specifications:
>107582520 >107582589 >107586719 >107582643 >107582699 >107582732 >107582789
--Evaluating NemoTron 3 Nano's roleplay abilities vs Gemma with preset demonstration:
>107583976 >107584039 >107584065
--Nala test results on MistralAI API with Anon/Nala M roleplay:
>107586172 >107586197 >107586219 >107586377 >107586813
--Testing GLM 4.6 on new Framework desktop:
>107583661 >107583684 >107583743 >107583746 >107583748 >107583750 >107583875 >107583904 >107583988 >107583982 >107584717 >107584051 >107584075 >107584275 >107584296 >107584494 >107584285 >107584307 >107584322 >107584357 >107584477 >107584482 >107584609 >107584496 >107584520 >107584530 >107584607 >107585220
--Budget GPU alternatives for AI workloads: 5060ti vs 3090 cost-performance analysis:
>107585634 >107585658
--Nemotron nano model benchmark performance on 3060 GPU:
>107583030 >107583098
--Misconfigured multi-GPU parameter usage realization:
>107582989
--Miku (free space):
>107582881 >107587769 >107587665

►Recent Highlight Posts from the Previous Thread: >>107582410

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
Local model for fixing a broken heart when?
>>
>>107588641
Get a grip pussy, life's gonna get harder too
>>
>>107588641
at least 4 months after corpo models can operate a surgical robot without mistakes
>>
>>107588660
>Local man dies after SurgeonGPT refuses to proceed mid-surgery, quoted as saying repeatedly: "I can't assist with that"
>>
>>107588694
>why the fuck not
>an unconfirmed blood type may lead to disastrous results
>i'm telling you it's fuckin o
>the sensor isn't working, I can't confirm that
>>
>>107588694
>die because surgeongpt refuses to assist with that request
or
>die becuse the SARRR doctor decided to start sticking his dick in your innards mid surgery and you get not only all the aids but also fecal matter from his dick and subsequently a lethal infection

clown world man...
>>
>>107588694
>I can't operate..he is my son
>>
Should I buy a 5080 prebuilt or can I cope with services like ChatLLM?
>>
Btw Bartowski for some reason updated his BF16 mmproj file for GLM.
https://huggingface.co/bartowski/zai-org_GLM-4.6V-GGUF/tree/main
>>
>>107589110
there are so many other better options than buying a prebuilt. build a mid tier pc yourself and then get 2 of these gpus:
https://www.ebay.com/itm/125006475381
>>
>>107589110
There are no good models you can run on 5080 that you can't run on 2080
>>
CUDA DEV CUDA DEV WHY IS THIS HAPPENING:

https://litter.catbox.moe/gtb1e3u1jejxs6or.png

./llama-bench --model ~/ik_models/GLM-4.5-Air-IQ4_KSS-00001-of-00002.gguf -ot exps=CPU -ngl 0 -t 6 -fa 1 --mmap 0 -r 5 -p 32,64,128,256,512,1024,2048,4096 -r 1 -p 0 -b 512 -nkvo 1
| glm4moe 106B.A12B IQ4_KSS - 4.0 bpw | 53.05 GiB | 106.85 B | CUDA | 0 | 512 | 1 | 0 | exps=CPU | pp1024 | 313.23 ± 0.00 |

john@debian:~/TND/CPU/ik_llama.cpp/build/bin$ ./llama-bench --model ~/ik_models/GLM-4.5-Air-IQ4_KSS-00001-of-00002.gguf -t 6 -fa 1 --mmap 0 -r 5 -p 32,64,128,256,512,1024,2048,4096 -r 1 -p 0 -b 512
| glm4moe 106B.A12B IQ4_KSS - 4.0 bpw | 53.05 GiB | 106.85 B | CPU | 6 | 512 | 0 | pp1024 | 26.84 ± 0.00 |
>>
RAID0 HDDmaxxing is the new normal.
>>
>>107589220
also why does -b 256 and -b 512 make such a big difference
specs: 3060 12gb, i5 12400f, 64gb ddr4 3200mhz dual channel (51.6gb/s)
| glm4moe 106B.A12B IQ4_KSS - 4.0 bpw | 53.05 GiB | 106.85 B | CUDA | 0 | 256 | 1 | 0 | exps=CPU | pp2048 | 49.90 ± 0.00 |
| glm4moe 106B.A12B IQ4_KSS - 4.0 bpw | 53.05 GiB | 106.85 B | CUDA | 0 | 512 | 1 | 0 | exps=CPU | pp2048 | 291.45 ± 0.00 |
>>
File: DipsyEverlastingSummer.png (2.27 MB, 1536x1024)
2.27 MB
2.27 MB PNG
> look up everlasting summer
> miku is already canon character
> purple hair twin bob girl looks like dipsy sans glasses
Weird.
>>
>>107589220
>why is something happening on a fork cudadev doesn't work on and refuses to read the code of because the author has a pissy fit whenever someone upstreams his code
>>
>>107589320
@grok is this true?
>>
>>107589341
presented without comment.
https://litter.catbox.moe/mdi7kasx8xbioeiv.png
>>
Mistral Small Creative is better than Mistral Small 3.2, but not that much, at least in the EQBench Creative Writing benchmark (I don't think that represents chatbot performance).
https://eqbench.com/creative_writing.html
>>
>>107589220
>>107589403
that performance is more or less standard for your hardware.
>>
>>107589110
>5080
I literally just got my 5080 and installed it tonight...completely impossible to do gpu passthru to a VM with it. It just outright explodes every time.
I was passing through a 2060 super with zero issues forever
>>
>>107589320
It's a shitty vn made by channers featuring soviet nostalgia, chan culture and chan mascots as characters, mostly popular among normies
>>
>>107589435
>llm-judged creative writing benchmark
I love this dumb meme so much
>>
>>107589436
its standard for llamacpp to be slower than ik_llama?
anyways yes, i know this performance is standard for my hardware
but im wondering why despite having 0 gpu layers and disabling kv cache offload, why prompt processing is still faster on the cuda compiled version, even tho im using -b 512
when i compile pure cpu its always 20t/s or maybe a bit different depending on batch size
>>
>>107589341
also llama.cpp, this time cpu-only
john@debian:~/TND/CPU/llama.cpp/build/bin$ ./llama-bench --model ~/TND/AI/ArliAI_GLM-4.5-Air-Derestricted-IQ4_XS-00001-of-00002.gguf -t 6 -fa 1 --mmap 0 -r 5 -p 32,64,128,256,512,1024,2048,4096 -r 1 -p 0 -b 512
| model | size | params | backend | threads | n_batch | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------: | -: | ---: | --------------: | -------------------: |
| glm4moe 106B.A12B IQ4_XS - 4.25 bpw | 56.62 GiB | 110.47 B | CPU | 6 | 512 | 1 | 0 | pp32 | 12.37 ± 0.00 |
| glm4moe 106B.A12B IQ4_XS - 4.25 bpw | 56.62 GiB | 110.47 B | CPU | 6 | 512 | 1 | 0 | pp64 | 12.94 ± 0.00 |
| glm4moe 106B.A12B IQ4_XS - 4.25 bpw | 56.62 GiB | 110.47 B | CPU | 6 | 512 | 1 | 0 | pp128 | 13.10 ± 0.00 |
>>
>>107589435
>Mistral Small Creative
What an elusive model.
>>
>>107589220
CUDADEV WHY IS THIS HAPPENING (llama.cpp edition):

https://litter.catbox.moe/h6x20edznhqvo56l.png
>>
>>107589526
Why is what happening?
If you mean why the performance first goes up and then down again that's simple: if you have a low number of tokens that hard limits your batch size and you get bad arithmetic intensity (compute efficiency), and as you increase the number of tokens the average context depth increases so the attention becomes slower.
>>
For future anons: be ware that prompt processing for models that don't fully fit into your GPU is highly dependent on cpu-gpu bandwidth. If you use an external gpu connected via thunderbolt (2gb/s) or usb4 (3gb/s), expect very shitty pp. At 6gb/s (pcie4-4x like oculink), you can barely bottleneck your gpu only at batch size 4096.
Token generation is much less sensitive to cpu-gpu bandwidth.
>>
File: sans_eyes.png (276 KB, 590x954)
276 KB
276 KB PNG
Are you ready? Are you sure you're ready?
Are you really sure of that?
Have you flushed enough?

https://x.com/osanseviero/status/2001532493183209665
>[eyes emoji][eyes emoji][eyes emoji]
>>
>>107589609
hasn't this pajeet been doing this charade for like 2 months now? can they stop fucking edging us
>>
>>107589615
It's probably Gemini 3 Flash Image anyway.
>>
>>107589609
Never heard of ANY of them and I'm not about to click on any.
>>
>>107589560
What goes up must come down.
>>
>>107589560
why is the cpu build slower than the cuda build
cuda build has -ngl 0 and -nkvo 1
cpu build is 10t/s, cuda build that doesnt use gpu is 100t/s
thx for response btw
>>
>>107589567
*Kisses you on the lips*
>>
>>107589623
Omar Sanseviero is the Google Gemma Team PR guy. He's been hyping up a possible open-weight release from Google (i.e. Gemma 4) for a while now, but things never pan out. This one is now Gemini 3 Flash week and it's unlikely Google will release Gemma 4 until next week at the minimum.
>>
File: 1743482804734866.jpg (358 KB, 1432x1840)
358 KB
358 KB JPG
>>107589655
Nonnie, this is too sudden!
>>
>>107589715
You know it isn't. *Grabs your chin and smooches you agressively*
>>
File: ComfyUI_temp_dydig_00001_.png (3.97 MB, 2352x1568)
3.97 MB
3.97 MB PNG
>>107589609
Why are all these brown goblins begging for the silicone demon? AI fucking mindbroke these niggas
>>
File: 20251217_202848.jpg (3.88 MB, 4032x3024)
3.88 MB
3.88 MB JPG
Frens the 5090 finally arrived. What are the best uncensored models I can run in LM Studio? My PC only has 64GB of RAM though. Gemma 3 27B Abliterated never refuses my prompts, but its knowledge is very limited
>>
Mistral Small Creative. Where is it then?
>>
File: drum1q.png (56 KB, 943x389)
56 KB
56 KB PNG
>>107588615
>>
>>107589835
Should have bought a BWP6000 instead. Also, don't bother with LM Studio. Best you can do is probably a Q4 of GLM Air, though it will be decently fast.
>>
>>107589839
I mean, at least he's now self aware of a problem he might have, that's something.
>>
>>107589867
We should become his guinea pigs instead.
>>
Sirs is Gemma Strong model deepseek killer day today sirs? Thank you Google brahmin sirs to the moon
Lord Ganesh bless
>>
File: 1750456854038343.jpg (260 KB, 1432x1840)
260 KB
260 KB JPG
>>107589728
good night, nonnie
>>
>>107589867
The same problem people here have been telling him about for months on end?
>>
>>107589855
>Should have bought a BWP6000 instead
Way too expensive in my 3rd world EU country
>Also, don't bother with LM Studio
Why? It seems easy to use
>Q4 of GLM Air
Thanks, I’ll check it out
>>
>>107589838
It's an API-only experiment because they have no clue yet of what to do with it and its future direction, and are looking for "feedback".
>>
>>107589224
whats theoretical read/write speed limit?
>>
>>107590039
This kind of feedback to be precise.
>We're looking forward to engaging with the community on ways to make the model finely respect guardrails, allowing for deployment in environments requiring moderated outputs.
>>
>>107590039
Do they really need to have the logs to know that people goon to AI when they say 'creative writing'?

>>107590076
As much as bandwidth permits, so for PCIE5 16x that's around 64GB/s, so speed of 2 channel ddr4 ram. Let's be optimistic and assume that each HDD reads at 150MB/s, you would need 427 hdds to fully saturate it.
>>
>>107588615
I didnt look into locall LLM's before but I bought 5090 recently, whats the best smut model I can run?
>>
>>107590136
Mistral Nemo
>>
>>107590136
also I got 128 gb ram
>>
>>107590158
GLM Air or low quants of big GLM and deepseek R1.
>>
I just saw a video of someone talking to grok and they were chatting and asking grok to sing to them in their car. Humanity is over. No longer do we need socialization anymore
>>
>>107590118
I think they're past that, they stopped adding that note some time after the Nemo release.
>>107590132
If they were just interested in large amounts of logs they could have simply made the model free on OpenRouter. They're looking for more specific suggestions and feedback.
>>
>>107589637
Unless this was changed when I wasn't looking 32 is the batch size at which data starts being moved temporarily from RAM to VRAM to take advantage of the higher compute on GPUs.
However, it's not like this choice is guaranteed to be optimal for all hardware combinations.
In particular, an RTX 3060 is comparatively low-powered so for 32 tokens the overhead seems to not be worthwhile in this case.
Do note though that this is on a completely empty context, if you set a higher --depth the CUDA performance should decline less than the CPU performance because there is more work to be done when the context fills up.
>>
>>107589637
>>107590228
>why is the cpu build slower than the cuda build
Actually, I misread your post: I thought you were asking about the one data point where the CPU build is faster.
llama.cpp uses GPUs for prompt processing even at 0 GPU layers, that's why adding a GPU makes it faster.
Prompt processing is compute bound so it makes sense to temporarily move data from RAM to VRAM and do the calculations there.
>>
If we don't get Gemma 4 soon then Vishnu is dead to me.
>>
>google hid its recent activities
>google hid its recent activities
>google hid its recent activities
>>
>>107590265
thanks omar
>>
I'm glad that the new captcha is filtering out dalits and pakis, so only aryan brahmin can post
>>
>>107590284
TELL ME ABOUT THE BRAHMIN

WHY DO THEY IDENTIFY WITH THE DALIT?
>>
>>107590284
Its 10x easier for me, I don't get how it's filtering anyone.
>>
The only time I ever spend thinking about Indians is when retards insist on dragging their personal grievances into /lmg/.
>>
>>107590329
I think about them when applying for tech jobs. (they get them through nepotism)
>>
>>107590334
They get all jobs through nepotism
Once an indian is put in charge of hiring people, you can guarantee that 99% of future employees will also be indian.
>>
>>107590343
Its funny because I actually met some competent indians at a few companies. Assuming they stood out because of this.
So many that didn't know shit about their job or really anything and you would normally wonder why/how they got employed while you get put through the third degree on interviews.
>>
>>107589320
>miku but swarthy
yikes
>>
File: cpppppp.png (47 KB, 543x688)
47 KB
47 KB PNG
WHY ARE THERE SO MANY
>>
>>107590329
>personal grievances
I would say it's more of a national grievance or even a civilizational grievance at this point.
>>
>>107590343
There's also the explosive diarrhea strategy. Just spam every single venue with your "work" as obnoxiously as you can, farm engagement with any possible strategy, fake it till you make it, and eventually you will get hired by clueless boomers. Indians tend to lack any sense of shame and restraint in this regard.
>>
Local models?
>>
>>107590732
>lack any sense of shame and restraint in this regard
Neither should you. Employment is one of the rare cases where lying, cheating, and scamming are justified because the other side will do the same to you
>>
>>107590782
Local AI tech support sir. Kindly buy a google gist card if you wish to have good local model suggested sir
>>
>>107590825
>lying, cheating, and scamming
And Indians are culturally advantaged with that.
>>
File: mistralsirs.png (164 KB, 590x867)
164 KB
164 KB PNG
>>107590782
We can rapidly bring the thread back in topic with picrel.
https://xcancel.com/avisoori1/status/2001332763816083926
>>
>>107590886
yay..
>>
>>107590886
>Local models?
>>
>>107590914
Soon
>>
>>107590136
>>107590179
I'd go straight to a low quant of GLM 4.6 personally, try this in ik_llama https://huggingface.co/ubergarm/GLM-4.6-GGUF/tree/main/IQ2_KL

Deepseek R1 at a similar size is too gimp and it's slower in prompt processing
>>
>>107590917
Right after Mistral Medium
>>
If fucking Oracle is what causes the crash i will become the joker
>>
File: ComfyUI_temp_hmpvf_00002_.jpg (1.32 MB, 2048x3328)
1.32 MB
1.32 MB JPG
Can you use your own coder llm model in VS Code or is it all forced cloudshit? Alternatively, is it even worth bothering with local-based coding models?
>>
>>107590959
Why?
They are deeply entangled with this mess. Chances are pretty decent.
>>
>>107590886
yjk the bharatian chad got that yellow pussy
>>
>>107591005
Do not redeem the IMAF postings
>>
>>107589609
Gemma 4 so good they calling it Gemma 6. Local sirs are about to wonner bigly. 1 f5 = 1 minute less till Google does needful gooof upload
>>
Just tried Gemini 3 Flash. It's... bad. It knows less than the Pro version and isn't faster (maybe it's a server overloading thing). Maybe they reached the limits of small MoE models.
>>
>>107590999
>deeply entangled
How so, is there an updated incestual bukkake / "commercial agreements" chart? Thought MS are most on the hook
>>
>>107590997
No, yes. No.
Now go away. We have enough saarposting as it is.
>>
Im going to use pyautogui to automate the generation of data for distillation
>>
>>107590323
>I don't get how it's filtering anyone.
I spent way too long getting them wrong due to overthinking it. Like for the dots one I assumed must be position, rotation, or color shading, because the number (and almost always being the one with 4 dots) seemed way too fucking easy and surely there was no way they made the new captcha so easy and pointless even 80 iq indians could solve it.
>>
>>107591353
How do you even do model distillation?
Is there a framework out there that does the token matching or do you have to write something yourself?
>>
>>107591259
I don't really care one way or another because it's not local
>>
>>107591379
Distillation is not the correct term when you don't train to match logits which requires a matching tokenizer. Otherwise you are just training on the outputs
>>
>>107591432
The entire rest of the professional industry and even common usage now disagrees with you.
>>
>>107591432
Yes, I know. That's why I'm asking about how people do the distillation process.
Are they hand rolling their own scripts to match the logits or do the existing frameworks like axolotl and unsloth have support for it?
Maybe there's a dedicated framework just for that?
>>
>>107591458
lol they just finetune/train on model outputs
>>
>>107591379
Modern distillation is just generating a question answer dataset and training on that. Not training on logits. If we had them it'd be better but we don't.
My goals is to finetune a model to make it as close as possible to Sonnet 4.5.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.