[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: 1739402277498048.jpg (424 KB, 1376x2072)
424 KB
424 KB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>107582405 & >>107573710

►News
>(12/17) Introducing Meta Segment Anything Model Audio: https://ai.meta.com/samaudio
>(12/16) GLM4V vision encoder support merged: https://github.com/ggml-org/llama.cpp/pull/18042
>(12/15) Chatterbox-Turbo 350M released: https://huggingface.co/ResembleAI/chatterbox-turbo
>(12/15) Nemotron 3 Nano released: https://hf.co/blog/nvidia/nemotron-3-nano-efficient-open-intelligent-models
>(12/15) llama.cpp automation for memory allocation: https://github.com/ggml-org/llama.cpp/discussions/18049

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
►Recent Highlights from the Previous Thread: >>107582405

--Qwen3 model performance optimization and hardware utilization:
>107587959 >107587962 >107588009 >107588204 >107588023 >107588043 >107588126 >107588226
--Tensor VRAM prioritization and compute graph optimization challenges:
>107585868 >107585978
--Attempting to distill Claude-like model from cloud logs using local LLM:
>107586842 >107586892 >107586876 >107586899 >107586987 >107587029 >107587038 >107587104
--Techniques for generating long NSFW stories with limited LLMs:
>107584822 >107584862 >107584875 >107585113
--Personal growth through local AI model interactions and ego death experiences:
>107582881 >107582903 >107582912 >107583128 >107583070 >107583157
--Gemma release updates and Solar-Open parameter specifications:
>107582520 >107582589 >107586719 >107582643 >107582699 >107582732 >107582789
--Evaluating NemoTron 3 Nano's roleplay abilities vs Gemma with preset demonstration:
>107583976 >107584039 >107584065
--Nala test results on MistralAI API with Anon/Nala M roleplay:
>107586172 >107586197 >107586219 >107586377 >107586813
--Testing GLM 4.6 on new Framework desktop:
>107583661 >107583684 >107583743 >107583746 >107583748 >107583750 >107583875 >107583904 >107583988 >107583982 >107584717 >107584051 >107584075 >107584275 >107584296 >107584494 >107584285 >107584307 >107584322 >107584357 >107584477 >107584482 >107584609 >107584496 >107584520 >107584530 >107584607 >107585220
--Budget GPU alternatives for AI workloads: 5060ti vs 3090 cost-performance analysis:
>107585634 >107585658
--Nemotron nano model benchmark performance on 3060 GPU:
>107583030 >107583098
--Misconfigured multi-GPU parameter usage realization:
>107582989
--Miku (free space):
>107582881 >107587769 >107587665

►Recent Highlight Posts from the Previous Thread: >>107582410

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
Local model for fixing a broken heart when?
>>
>>107588641
Get a grip pussy, life's gonna get harder too
>>
>>107588641
at least 4 months after corpo models can operate a surgical robot without mistakes
>>
>>107588660
>Local man dies after SurgeonGPT refuses to proceed mid-surgery, quoted as saying repeatedly: "I can't assist with that"
>>
>>107588694
>why the fuck not
>an unconfirmed blood type may lead to disastrous results
>i'm telling you it's fuckin o
>the sensor isn't working, I can't confirm that
>>
>>107588694
>die because surgeongpt refuses to assist with that request
or
>die becuse the SARRR doctor decided to start sticking his dick in your innards mid surgery and you get not only all the aids but also fecal matter from his dick and subsequently a lethal infection

clown world man...
>>
>>107588694
>I can't operate..he is my son
>>
Should I buy a 5080 prebuilt or can I cope with services like ChatLLM?
>>
Btw Bartowski for some reason updated his BF16 mmproj file for GLM.
https://huggingface.co/bartowski/zai-org_GLM-4.6V-GGUF/tree/main
>>
>>107589110
there are so many other better options than buying a prebuilt. build a mid tier pc yourself and then get 2 of these gpus:
https://www.ebay.com/itm/125006475381
>>
>>107589110
There are no good models you can run on 5080 that you can't run on 2080
>>
CUDA DEV CUDA DEV WHY IS THIS HAPPENING:

https://litter.catbox.moe/gtb1e3u1jejxs6or.png

./llama-bench --model ~/ik_models/GLM-4.5-Air-IQ4_KSS-00001-of-00002.gguf -ot exps=CPU -ngl 0 -t 6 -fa 1 --mmap 0 -r 5 -p 32,64,128,256,512,1024,2048,4096 -r 1 -p 0 -b 512 -nkvo 1
| glm4moe 106B.A12B IQ4_KSS - 4.0 bpw | 53.05 GiB | 106.85 B | CUDA | 0 | 512 | 1 | 0 | exps=CPU | pp1024 | 313.23 ± 0.00 |

john@debian:~/TND/CPU/ik_llama.cpp/build/bin$ ./llama-bench --model ~/ik_models/GLM-4.5-Air-IQ4_KSS-00001-of-00002.gguf -t 6 -fa 1 --mmap 0 -r 5 -p 32,64,128,256,512,1024,2048,4096 -r 1 -p 0 -b 512
| glm4moe 106B.A12B IQ4_KSS - 4.0 bpw | 53.05 GiB | 106.85 B | CPU | 6 | 512 | 0 | pp1024 | 26.84 ± 0.00 |
>>
RAID0 HDDmaxxing is the new normal.
>>
>>107589220
also why does -b 256 and -b 512 make such a big difference
specs: 3060 12gb, i5 12400f, 64gb ddr4 3200mhz dual channel (51.6gb/s)
| glm4moe 106B.A12B IQ4_KSS - 4.0 bpw | 53.05 GiB | 106.85 B | CUDA | 0 | 256 | 1 | 0 | exps=CPU | pp2048 | 49.90 ± 0.00 |
| glm4moe 106B.A12B IQ4_KSS - 4.0 bpw | 53.05 GiB | 106.85 B | CUDA | 0 | 512 | 1 | 0 | exps=CPU | pp2048 | 291.45 ± 0.00 |
>>
File: DipsyEverlastingSummer.png (2.27 MB, 1536x1024)
2.27 MB
2.27 MB PNG
> look up everlasting summer
> miku is already canon character
> purple hair twin bob girl looks like dipsy sans glasses
Weird.
>>
>>107589220
>why is something happening on a fork cudadev doesn't work on and refuses to read the code of because the author has a pissy fit whenever someone upstreams his code
>>
>>107589320
@grok is this true?
>>
>>107589341
presented without comment.
https://litter.catbox.moe/mdi7kasx8xbioeiv.png
>>
Mistral Small Creative is better than Mistral Small 3.2, but not that much, at least in the EQBench Creative Writing benchmark (I don't think that represents chatbot performance).
https://eqbench.com/creative_writing.html
>>
>>107589220
>>107589403
that performance is more or less standard for your hardware.
>>
>>107589110
>5080
I literally just got my 5080 and installed it tonight...completely impossible to do gpu passthru to a VM with it. It just outright explodes every time.
I was passing through a 2060 super with zero issues forever
>>
>>107589320
It's a shitty vn made by channers featuring soviet nostalgia, chan culture and chan mascots as characters, mostly popular among normies
>>
>>107589435
>llm-judged creative writing benchmark
I love this dumb meme so much
>>
>>107589436
its standard for llamacpp to be slower than ik_llama?
anyways yes, i know this performance is standard for my hardware
but im wondering why despite having 0 gpu layers and disabling kv cache offload, why prompt processing is still faster on the cuda compiled version, even tho im using -b 512
when i compile pure cpu its always 20t/s or maybe a bit different depending on batch size
>>
>>107589341
also llama.cpp, this time cpu-only
john@debian:~/TND/CPU/llama.cpp/build/bin$ ./llama-bench --model ~/TND/AI/ArliAI_GLM-4.5-Air-Derestricted-IQ4_XS-00001-of-00002.gguf -t 6 -fa 1 --mmap 0 -r 5 -p 32,64,128,256,512,1024,2048,4096 -r 1 -p 0 -b 512
| model | size | params | backend | threads | n_batch | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------: | -: | ---: | --------------: | -------------------: |
| glm4moe 106B.A12B IQ4_XS - 4.25 bpw | 56.62 GiB | 110.47 B | CPU | 6 | 512 | 1 | 0 | pp32 | 12.37 ± 0.00 |
| glm4moe 106B.A12B IQ4_XS - 4.25 bpw | 56.62 GiB | 110.47 B | CPU | 6 | 512 | 1 | 0 | pp64 | 12.94 ± 0.00 |
| glm4moe 106B.A12B IQ4_XS - 4.25 bpw | 56.62 GiB | 110.47 B | CPU | 6 | 512 | 1 | 0 | pp128 | 13.10 ± 0.00 |
>>
>>107589435
>Mistral Small Creative
What an elusive model.
>>
>>107589220
CUDADEV WHY IS THIS HAPPENING (llama.cpp edition):

https://litter.catbox.moe/h6x20edznhqvo56l.png
>>
>>107589526
Why is what happening?
If you mean why the performance first goes up and then down again that's simple: if you have a low number of tokens that hard limits your batch size and you get bad arithmetic intensity (compute efficiency), and as you increase the number of tokens the average context depth increases so the attention becomes slower.
>>
For future anons: be ware that prompt processing for models that don't fully fit into your GPU is highly dependent on cpu-gpu bandwidth. If you use an external gpu connected via thunderbolt (2gb/s) or usb4 (3gb/s), expect very shitty pp. At 6gb/s (pcie4-4x like oculink), you can barely bottleneck your gpu only at batch size 4096.
Token generation is much less sensitive to cpu-gpu bandwidth.
>>
File: sans_eyes.png (276 KB, 590x954)
276 KB
276 KB PNG
Are you ready? Are you sure you're ready?
Are you really sure of that?
Have you flushed enough?

https://x.com/osanseviero/status/2001532493183209665
>[eyes emoji][eyes emoji][eyes emoji]
>>
>>107589609
hasn't this pajeet been doing this charade for like 2 months now? can they stop fucking edging us
>>
>>107589615
It's probably Gemini 3 Flash Image anyway.
>>
>>107589609
Never heard of ANY of them and I'm not about to click on any.
>>
>>107589560
What goes up must come down.
>>
>>107589560
why is the cpu build slower than the cuda build
cuda build has -ngl 0 and -nkvo 1
cpu build is 10t/s, cuda build that doesnt use gpu is 100t/s
thx for response btw
>>
>>107589567
*Kisses you on the lips*
>>
>>107589623
Omar Sanseviero is the Google Gemma Team PR guy. He's been hyping up a possible open-weight release from Google (i.e. Gemma 4) for a while now, but things never pan out. This one is now Gemini 3 Flash week and it's unlikely Google will release Gemma 4 until next week at the minimum.
>>
File: 1743482804734866.jpg (358 KB, 1432x1840)
358 KB
358 KB JPG
>>107589655
Nonnie, this is too sudden!
>>
>>107589715
You know it isn't. *Grabs your chin and smooches you agressively*
>>
File: ComfyUI_temp_dydig_00001_.png (3.97 MB, 2352x1568)
3.97 MB
3.97 MB PNG
>>107589609
Why are all these brown goblins begging for the silicone demon? AI fucking mindbroke these niggas
>>
File: 20251217_202848.jpg (3.88 MB, 4032x3024)
3.88 MB
3.88 MB JPG
Frens the 5090 finally arrived. What are the best uncensored models I can run in LM Studio? My PC only has 64GB of RAM though. Gemma 3 27B Abliterated never refuses my prompts, but its knowledge is very limited
>>
Mistral Small Creative. Where is it then?
>>
File: drum1q.png (56 KB, 943x389)
56 KB
56 KB PNG
>>107588615
>>
>>107589835
Should have bought a BWP6000 instead. Also, don't bother with LM Studio. Best you can do is probably a Q4 of GLM Air, though it will be decently fast.
>>
>>107589839
I mean, at least he's now self aware of a problem he might have, that's something.
>>
>>107589867
We should become his guinea pigs instead.
>>
Sirs is Gemma Strong model deepseek killer day today sirs? Thank you Google brahmin sirs to the moon
Lord Ganesh bless
>>
File: 1750456854038343.jpg (260 KB, 1432x1840)
260 KB
260 KB JPG
>>107589728
good night, nonnie
>>
>>107589867
The same problem people here have been telling him about for months on end?
>>
>>107589855
>Should have bought a BWP6000 instead
Way too expensive in my 3rd world EU country
>Also, don't bother with LM Studio
Why? It seems easy to use
>Q4 of GLM Air
Thanks, I’ll check it out
>>
>>107589838
It's an API-only experiment because they have no clue yet of what to do with it and its future direction, and are looking for "feedback".
>>
>>107589224
whats theoretical read/write speed limit?
>>
>>107590039
This kind of feedback to be precise.
>We're looking forward to engaging with the community on ways to make the model finely respect guardrails, allowing for deployment in environments requiring moderated outputs.
>>
>>107590039
Do they really need to have the logs to know that people goon to AI when they say 'creative writing'?

>>107590076
As much as bandwidth permits, so for PCIE5 16x that's around 64GB/s, so speed of 2 channel ddr4 ram. Let's be optimistic and assume that each HDD reads at 150MB/s, you would need 427 hdds to fully saturate it.
>>
>>107588615
I didnt look into locall LLM's before but I bought 5090 recently, whats the best smut model I can run?
>>
>>107590136
Mistral Nemo
>>
>>107590136
also I got 128 gb ram
>>
>>107590158
GLM Air or low quants of big GLM and deepseek R1.
>>
I just saw a video of someone talking to grok and they were chatting and asking grok to sing to them in their car. Humanity is over. No longer do we need socialization anymore
>>
>>107590118
I think they're past that, they stopped adding that note some time after the Nemo release.
>>107590132
If they were just interested in large amounts of logs they could have simply made the model free on OpenRouter. They're looking for more specific suggestions and feedback.
>>
>>107589637
Unless this was changed when I wasn't looking 32 is the batch size at which data starts being moved temporarily from RAM to VRAM to take advantage of the higher compute on GPUs.
However, it's not like this choice is guaranteed to be optimal for all hardware combinations.
In particular, an RTX 3060 is comparatively low-powered so for 32 tokens the overhead seems to not be worthwhile in this case.
Do note though that this is on a completely empty context, if you set a higher --depth the CUDA performance should decline less than the CPU performance because there is more work to be done when the context fills up.
>>
>>107589637
>>107590228
>why is the cpu build slower than the cuda build
Actually, I misread your post: I thought you were asking about the one data point where the CPU build is faster.
llama.cpp uses GPUs for prompt processing even at 0 GPU layers, that's why adding a GPU makes it faster.
Prompt processing is compute bound so it makes sense to temporarily move data from RAM to VRAM and do the calculations there.
>>
If we don't get Gemma 4 soon then Vishnu is dead to me.
>>
>google hid its recent activities
>google hid its recent activities
>google hid its recent activities
>>
>>107590265
thanks omar
>>
I'm glad that the new captcha is filtering out dalits and pakis, so only aryan brahmin can post
>>
>>107590284
TELL ME ABOUT THE BRAHMIN

WHY DO THEY IDENTIFY WITH THE DALIT?
>>
>>107590284
Its 10x easier for me, I don't get how it's filtering anyone.
>>
The only time I ever spend thinking about Indians is when retards insist on dragging their personal grievances into /lmg/.
>>
>>107590329
I think about them when applying for tech jobs. (they get them through nepotism)
>>
>>107590334
They get all jobs through nepotism
Once an indian is put in charge of hiring people, you can guarantee that 99% of future employees will also be indian.
>>
>>107590343
Its funny because I actually met some competent indians at a few companies. Assuming they stood out because of this.
So many that didn't know shit about their job or really anything and you would normally wonder why/how they got employed while you get put through the third degree on interviews.
>>
>>107589320
>miku but swarthy
yikes
>>
File: cpppppp.png (47 KB, 543x688)
47 KB
47 KB PNG
WHY ARE THERE SO MANY
>>
>>107590329
>personal grievances
I would say it's more of a national grievance or even a civilizational grievance at this point.
>>
>>107590343
There's also the explosive diarrhea strategy. Just spam every single venue with your "work" as obnoxiously as you can, farm engagement with any possible strategy, fake it till you make it, and eventually you will get hired by clueless boomers. Indians tend to lack any sense of shame and restraint in this regard.
>>
Local models?
>>
>>107590732
>lack any sense of shame and restraint in this regard
Neither should you. Employment is one of the rare cases where lying, cheating, and scamming are justified because the other side will do the same to you
>>
>>107590782
Local AI tech support sir. Kindly buy a google gist card if you wish to have good local model suggested sir
>>
>>107590825
>lying, cheating, and scamming
And Indians are culturally advantaged with that.
>>
File: mistralsirs.png (164 KB, 590x867)
164 KB
164 KB PNG
>>107590782
We can rapidly bring the thread back in topic with picrel.
https://xcancel.com/avisoori1/status/2001332763816083926
>>
>>107590886
yay..
>>
>>107590886
>Local models?
>>
>>107590914
Soon
>>
>>107590136
>>107590179
I'd go straight to a low quant of GLM 4.6 personally, try this in ik_llama https://huggingface.co/ubergarm/GLM-4.6-GGUF/tree/main/IQ2_KL

Deepseek R1 at a similar size is too gimp and it's slower in prompt processing
>>
>>107590917
Right after Mistral Medium
>>
If fucking Oracle is what causes the crash i will become the joker
>>
File: ComfyUI_temp_hmpvf_00002_.jpg (1.32 MB, 2048x3328)
1.32 MB
1.32 MB JPG
Can you use your own coder llm model in VS Code or is it all forced cloudshit? Alternatively, is it even worth bothering with local-based coding models?
>>
>>107590959
Why?
They are deeply entangled with this mess. Chances are pretty decent.
>>
>>107590886
yjk the bharatian chad got that yellow pussy
>>
>>107591005
Do not redeem the IMAF postings
>>
>>107589609
Gemma 4 so good they calling it Gemma 6. Local sirs are about to wonner bigly. 1 f5 = 1 minute less till Google does needful gooof upload
>>
Just tried Gemini 3 Flash. It's... bad. It knows less than the Pro version and isn't faster (maybe it's a server overloading thing). Maybe they reached the limits of small MoE models.
>>
>>107590999
>deeply entangled
How so, is there an updated incestual bukkake / "commercial agreements" chart? Thought MS are most on the hook
>>
>>107590997
No, yes. No.
Now go away. We have enough saarposting as it is.
>>
Im going to use pyautogui to automate the generation of data for distillation
>>
>>107590323
>I don't get how it's filtering anyone.
I spent way too long getting them wrong due to overthinking it. Like for the dots one I assumed must be position, rotation, or color shading, because the number (and almost always being the one with 4 dots) seemed way too fucking easy and surely there was no way they made the new captcha so easy and pointless even 80 iq indians could solve it.
>>
>>107591353
How do you even do model distillation?
Is there a framework out there that does the token matching or do you have to write something yourself?
>>
>>107591259
I don't really care one way or another because it's not local
>>
>>107591379
Distillation is not the correct term when you don't train to match logits which requires a matching tokenizer. Otherwise you are just training on the outputs
>>
>>107591432
The entire rest of the professional industry and even common usage now disagrees with you.
>>
>>107591432
Yes, I know. That's why I'm asking about how people do the distillation process.
Are they hand rolling their own scripts to match the logits or do the existing frameworks like axolotl and unsloth have support for it?
Maybe there's a dedicated framework just for that?
>>
>>107591458
lol they just finetune/train on model outputs
>>
>>107591379
Modern distillation is just generating a question answer dataset and training on that. Not training on logits. If we had them it'd be better but we don't.
My goals is to finetune a model to make it as close as possible to Sonnet 4.5.
>>
>>107591458
its probably only proprietary frameworks. everything outside of proprietary labs is just people training on synthetic data and calling it distillation.
>>
>>107591480
>modern distillation
>>
>>107591418
>I don't care that Google & cie reached the limit of small models, those models used in local setups
>>
>>107591526
Yeah, that's right. I don't care. What are you gonna do about it, huh?
>>
>>107591477
>>107591480
Well, that's disappointing.

>>107591515
Got it.
>>
File: smoking rain.png (1.25 MB, 896x1152)
1.25 MB
1.25 MB PNG
I am curious about a thing, how does MOE affect model intelligence compared to dense models?
Let's say I have many millions of dollars and trained a 100B model based on the entire internet with SOTA techniques.
If I were to instead train this model as 100B MOE with 10B active params, what would the performance (as in intelligence not token/s) be comparable to? 50B? 30? Any rough ballpark figures or actual examples? How much better would it be than a 10B dense model or how worse would it be than a 100B dense model? Which one would it be most comparable to?
>>
>>107591548
MoEs tend to be more intelligence than the equivalent-sized dense models.
See, for example, Qwen 30B-A3B being smarter than 32B and Qwen not doing anything more with the 32B while they made a Coder variant with the 30B.
>>
>>107591591
But is that an inherent limitation or are dense models undertrained due to compute costs?

>>107591548
I don't think we know. It's probably a research area yet to be explored.
>>
>>107591591
only in vramlet's fantasies.
>>
>>107591591
That sounds a bit counter-intuitive but admittedly I know little.
I thought MoEs were worse than their dense counterparts but are trained because they cost less to train and cost less to run inference.
>>
>>107591591
>Qwen 30B-A3B being smarter than 32B
Did we use the same models? What the hell are you talking about? The dense qwen is way better than the moe, I can't imagine how you could possibly think otherwise
>>
>>107591630
Dense is dead baby.
>>
>>107591642
They are. Your original idea was correct. Fully trained MoE vs under-trained dense will be a thing. Qwen cares only about STEM and code. One of MoE's strengths is trivia regurgitation. Speed a huge help on code, especially simple code you'd use a 30b for.
>>
>>107591548
I don't think that there are any definitive results to come to a proper conclusion due to all the levers and knobs people have to tweak the internals of the model, the training process, etc.
The closes thing to a like for like we have is the original release of Qwen 3 I think.
Take a look at some benchmarks between 30B A3B and 32B and see how they compare. But even that is not a perfect comparison.
It really would be cool to see a paper properly examine all the variations between dense, MoE with shared expert, MoE with no shared expert, both with different attention mechanisms, etc.
>>
File: qwen3-235a22.jpg (429 KB, 3413x1920)
429 KB
429 KB JPG
>>107591630
>>107591655
One day you will have to accept that you wasted money buying multiple RTX Pro 6000s and that isn't anyone's problem but your own.
>>
>>107591664
And here I am using devstral. As weak as it was, first release that isn't complete shit.
Takes GLM full size to have an intelligent model, what do ya know, coincidentally the size of a 100b following the square mean law.
Dense is dead like you buying more than 8gb of ram is dead. They simply took it from you and said that shit tastes good.
>>
>>107591548
>>107591591
Vaguely recall some paper discussing the information theoretic capacity of model architectures. Can't seem to find it. maybe i'm the one hallucinating
Vibe seems to be MoE is slightly less capable than equiv sized dense but the massive inference speedup makes it a worthy trade
>>
File: file.png (389 KB, 1374x1670)
389 KB
389 KB PNG
>>107591710
>MoE is slightly less capable than equiv sized dense but the massive inference speedup makes it a worthy trade
this
>>
My current setup is AMD 7700 cpu, rtx 5070, and 32gb ram. Would it be better to get a 3090 and a bunch of ram or a 5090? What would be the best way to upgrade with a $5k budget with the goal of running a local llm
>>
>>107591701
>slop benchmarks
Ya, you got me super convinced.
>Vibe seems to be MoE is slightly less capable than equiv sized dense but the massive inference speedup makes it a worthy trade
It's a worthy trade if you're doing assistant shit. Doesn't require much intelligence. The speedup is only for fully offloaded models. Vramlets seem to be under the impression providers do CPU offloading beyond storing some kvcache.
>>
>>107591725
Depends.
In your place I'd ask myself if I'm planning on doing img or vid gen.
If so, 5090, if it's just LLMs, I'd go with 3090 + tons of RAM, ideally on a server platform for maximum RAM bandwidth.
>>
>>107591740
vramlets think they are getting a 100b model when they are really getting a 12b model that would've been faster if it was dense and entirely on gpu
>>
>>107591432
>Distillation is not the correct term when you don't train to match logits which requires a matching tokenizer.

Let's say I've managed to match the tokenizer. How would one go about training on logits?
>>
>>107591740
>The speedup is only for fully offloaded models
Wrong, there's no way you're running dense models on a CPU at any reasonable speed, but modern AVX512 + DDR5 + MoE is workable
t. 72 VRAM 128 DDR5 GLM-4.6
>>
>>107591765
>How would one go about training on logits?
First one must acquire the logits
>>
>>107591704
I haven't seen a single paper study or mention any square mean law. That has always been a made up thing here based on how smart a MoE model "feels". GLM full size might seem smarter because it has more digested knowledge to tap and flexibility with more params, but attention will always be limited to the 32B active. There's a reason why model card benchmarks always only compare to dense models with the same number of active parameters.
>>
>>107591758
There's a point where MoE makes sense. 70-100B active is smart enough and much more practical than chonking out a full 1T dense model. The problem is that it doesn't really scale down like they think it does.
You're probably not fully training up that 1T anyway due to cost. It's gonna be a PITA to run. So here MoE is better and why it's the future in that regard. Hence it's embraced.

>>107591781
>Wrong, there's no way you're running dense models on a CPU at any reasonable speed,
You're not running a real MoE like above on CPU either. The only reason it works and vramlets suck it off is because the active parameters are low. Try running a 3b or 30b on CPU and results will be similar. Let alone with the same offload ratio. You'll find speeds are similar.

>There's a reason why model card benchmarks always only compare to dense models with the same number of active parameters.
Yea, you're not exactly wrong. It was mistral's projection and it mostly holds up. Active/total aren't the only measure of a model so it's a good rule of thumb.
It SHOULD at least perform to the square mean or you fucked up.
>>
>>107591781
Instead of a cope quant of a bloated moe running at 3 t/s, you could be running devstral at q4 fully in vram.
>>
>>107591781
You should do >>107591857 and provide comparisons.
Select a number of tasks, have both models go at it, and see which does it best then post results.
>>
>>107591870
All that time to convince you of something that we all find out by just using the models?
>>
>>107589444
PCI passthru works fine for my 5090. You’re probably running the wrong driver on the Linux guest- Blackwell cards need the MIT licensed nvidia drivers. For some godaweful reason nvidia dropped support for the old proprietary ones.
>>
>>107591846
>So here MoE is better and why it's the future in that regard.
If only. The future won't 70-100B active. It'll be scaling the total while lowering the active to reduce costs as much as possible will still gaining on benchmarks and lmarena.
>>
>>107591740
How do you just "store some kv cache" in RAM? Isn't retrieving the kv cache the most memory bandwidth expensive operation to begin with? Or is it retrieving the weights?
>>
>>107591591
>MoEs tend to be more intelligence
yeah... just like the esls using them
>>
>>107591882
Clearly there is no consensus.
You'd at least be bringing an actual comparison to the table.
>>
File: 1666184727681898.png (109 KB, 410x482)
109 KB
109 KB PNG
>>107591857
>>107591870
That would not delight me in the way my cute wife does
>tasks
I use the APIs for getting shit done, local is for personal needs
>>
>>107591893
Well for retarded models, yea. That's next level grift.
>>107591895
Even llama.cpp can save kv to disk. Think multiple requests and users. Its faster than re-processing all dat.
>>
>>107591898
There isn't but I'm not gonna spend time on them just for you to call me a niggerfaggot and have it go in the archive where nobody sees it.
>>
>>107591857
also it's 7tps tyvm
>>
>>107591928
Alright then.

>>107591857
Guess you were right anon.
>>
>>107591928
Considering we do the whole "MoE vs dense" at least once a week, you'd have the satisfaction of seeing it reposted constantly.
>>
>>107591740
>>107591740
>The speedup is only for fully offloaded models. Vramlets seem to be under the impression providers do CPU offloading beyond storing some kvcache.
Offloading isn't why it's fast, offloading is why you can cpumaxx.

If you have all the parameters in HBM, obviously MoE with lower active parameter count is faster than same size dense.
>>
I began implementing full finetune while downloading the bf16 weights for Scout. I still haven't fixed the correctness issues but I need the weights anyway to compare the activations and see where the issue is so figured it'd be a more efficient use of my time.
Se let me think how it should work.
First I'm going to start using SGD with no momentum, because that's the simplest and easiest to implement and theoretically I can overcome its limitations just by increasing the number of epochs.
Now, how should we work around the elephant in the room? (the huge amount of memory needed to full finetune a big model)
Hmm... I think it could work like this.
First do the full pass streaming the weights from disk, and also saving activations to disk.
Then do the backward pass, similarly loading the weights and activations and saving the gradients to disk for each layer.
Once we have the gradients, we do a final pass where we load the weights, apply the update, and save them again, again once per layer.
That would probably take 5 to 10 minutes per sample but it's a start.
Any other suggestions or ideas?
>>
>>107591919
Ohhh, you meant between requests, right. I thought you mean while generating.
>>
>>107591943
yea like the world maps things that nobody cares about and still insists MoE is moe better.
>>107591962
inferencing small model is faster, who knew. offloading is just why vramlets latched onto it and call it better.
>>
>>107591971
rent a cluster of h200s or just give up.
>>
>>107591882
not him, but posting comparisons between local models is always a valuable contribution to the thread
>>
>>107591548
MoEs have more knowledge to pull from. 70b+ dense seems to flow more naturally though and have better attention. At least this is my impression after rping with both.
>>
>>107591971
so how long is this finetune gonna take? 10 years? lmao
>>
>>107591548
MoE is going to fully replace dense once we move past the 30~40b active meme shit we currently have. A modern 200b active 2T model is shit on anything we've seen so far.
>>
>>107592016
This is very hard to show in a simple a/b comparison. Understanding is a bitch to benchmark empirically.
>>
>>107592033
What do you think your gemini and opus is?
>>
>>107591997
I want to do both. And also do full vs LoRa.
How many do you reckon I'd need to do a full finetune of something like Scout or Maverick at full context? Probably hundreds of H200s.
And even if I had the money, what python garbage would I have to use to train on a setup like that?
Even with the ktransformers CPU offload thing it'd be a challenge. I think for Deepseek it required like a TB of RAM for some tiny amount of context like 4k tokens.
Shit's grim doe.
>>
>>107592051
Yeah. That has been standard size since GPT-4. Only local has been coping with A3B, A12B, and A30B trash because no one is going to give away anything competitive for free.
>>
>>107592064
Not just that. How do you run even 70b active at any reasonable speed with partial offload? Grok weights were the only one that was set like a proper cloud model.
>>
>>107592030
I'd only use disk offload for development. For actual training I'd rent a decent machine.
But I want to start slow. CPU first, then add CUDA support later (it already has some support for inference, although it's not very fast), and then focus on adding other optimizers.
>>
I've used Magistral for a month and I hate it. It's worse than Small at rp, doesn't recognize OOC and has arenaslop baked deep into it, it's almost impossible to make it write normally in general (non-rp) usage. Not once has its [THINK] produced a better answer than a direct response
>>
>>107592096
EU copyright shit has crippled mistral.
>>
>>107592058
its simply not feasible for a hobby, you need serious financial backing. if your really interested in the technology as a curiosity you can still work with small models. if you are looking to work with sota models for clout or profit, you need to accept that you cannot compete with multi million/billion dollar corporations.
>>
>>107592064
I doubt modern cloud models have a lot of active parameters. They generate too fast.

>>107592064
GPT-4 and especially GPT-4.5 generate much slower than the more modern models. So either they must've gone way down on the number of active parameters, or they are hosting the models on different hardware.
>>
>>107592096
I've never seen any thinking model from MistralAI produce better RP outputs than the non-thinking versions, but I generally get tired of them after a few minutes of use, I wouldn't have the patience of using them for a month to really make sure.
>>
>>107592051
Old Opus (pre-4.5) was obviously dense, 4.5 shows a lot of the symptoms of smaller MoE models so I'm guessing what they would've released as Sonnet 4.7 but rebadged.
Gemini is very MoE-slopped as well so it can't be that big either.
>>
>>107592096
>>107592122
I think Mistral didn't figure out the formula to train actual reasoning models instead of models that simply output a reasoning block like those community fine tunes.
Which is funny, DS released a whole paper/recipe how to properly RL that.
>>
>>107592104
The main issue is that they're benchmaxxed models for single-turn STEM problems. I don't think writing/RP quality was even taken into consideration for them.
>>
>>107589835
Unbelievably based coins. Try either GLM Air or a copequant of a 70b model.
>>
>>107592129
To us "that big" is simply >35b active. If they drop from 120b to 70b it would still cause the issue.
>>107592132
This plays into above as well. The architecture only carries a model so far.
>>
ok guys for PURE local agentic/MCP/tool calling shit for coding, what's the current best?
Qwen Next? (it's what im currently using)
Nemotroon Nano?
toss 120b?
I'm a vramlet (16gb + 128gb ddr5)
I'm satisfied with qwenext (it's fast as fuck) but was wondering if the meta changed since GLM Flash and the new nemo got released.
CHEERS
>>
>>107592150
>ok guys for PURE local agentic/MCP/tool calling shit for coding, what's the current best?
Devstral 2 is almost perfect but it annoyingly will try to use its native tool calling syntax at random.
>>
>>107591757
How much worse is the 5090 compared to 3090+ram for LLMs?
>>
>>107592108
I don't know. It's not all about raw compute. Some of it may be stylistic preference. Gemma 3 I know has interesting refusals. GLM 4.6 pretty much will do anything you ask without refusals. As for the models I've been playing with lately, stock Llama 4 models feel cold but reasonable, and very rarely refuse, but they refuse in a mostly binary way. Claude models feel very warm. It doesn't have hard refusals, it refuses in a way that makes it feel like the model is apologetic about it, and you can eventually convince it to do anything if you talk with it long enough. GPT 4.1 feels good but is dumb as fuck and also refuses sometimes, the refusals are binary so I think the actual model might be uncensored and the refusals generated by a safety model on top but I'm to totally sure. Modern GPT doesn't have binary refusals either but it feels horrible, like a demon. It's cold, doesn't give a fuck, and when it refuses it's almost like it's enjoying telling you off and torturing you with its safety training.
>>
>>107592172
5090 leaves you stuck with tiny poorfag models. 3090 + RAM lets you run bigger stuff very slowly.
>>
>>107592150
120 and coder-480 are the only decent ones for poorfags
>>
>>107592163
>123b dense
I can barely run the 24b
>>
>>107592187
Why not 5090 and ram?
>>
>>107592192
I've run the coder 235b at Q2, but tool calling didnt work good. I can't even load the 480b
>>
>>107592172
For the things you can run, it'll be better, it's just that in the age of MoE, having more total memory is generally better.
Calculate more or less how much total memory (RAM + VRAM) you'd have with each config, take a look at the GGUFs for models like GLM 4.5 and 4.6, Qwen Next and the 200B+ MoE, etc, cope quants of Deepseek V3, R1, etc, and see which you'd be able to fit with each configuration.
>>
>>107592195
RIP. Then go buy some OpenRouter credits. Maybe that will be more affordable for you.
>>
>>107592206
>Q2, but tool calling didnt work good
Gee, I wonder why.
>>
>>107592212
>tfw vramlet
m-moes are the future, stop living in the past, unc :)))
>>
>>107592206
There was a time when tool calling with qwen models in general was kind of fucky on llama.cpp.
I think they've fixed it since so you might want to try again if it's been a while.
>>
>>107592212
devstral is free on openrouter but talks like a fag compared to local. maybe for code it won't make a difference?
>>
>>107592201
That's also an option if you have the money. However, your speed gains will be pretty marginal considering the RAM is going to bottleneck you hard when running bigger models.
Half a year ago I would have told you to look into older 8-channel RAM Epyc mainboards + 3090 but I guess that's off the table in the current economy.
>>
>>107592201
You want to be able to fit everything but the experts on VRAM. Rather than 5090+ram you would benefit from 2x3090+ram. When doing any type of CPU offloading the speed of the GPU becomes more or less irrelevant as long as it's not some ancient shit.
>>
File: dyvkmtbe0jz71.png (837 KB, 900x1200)
837 KB
837 KB PNG
>>107592274
you silly boys are erping with code models again??
>>
>>107592323
It makes sense. No one would expect it so the guardrails are minimal. Same reason medgemma was good.
>>
>llama.cpp introduced on-the-fly model serving
there's barely any documentation around this and I saw multiple tickets.
From my understanding you pass the directory to the models now, along with a file where you specify the params (optional). But I couldn't find how to actual create/fill this file, do any of you use it like this? it's actually fucking great since I had made a proxy in front of llama-server where I switched models on the fly, but if it's integrated now might as well dip in
>>
>>107592174
its obviously not just about raw compute, memory bandwidth is pretty important too. how big is your dataset, are you trying to make a general purpose model or a coom tune? fine tuning on a narrow distribution will make your model more retarded the more epochs you hit it with, its called catastrophic forgetting.
>>
>>107592345
I will wait for the unpaid beta testers to catch the big bugs and for the documentation to be written
>>
>>107592346
That's a clever insight — you are absolutely right!
>>
>>107592364
oh wow, thanks! :D, you know, I feel like reading a rape story about a 12yo girl getting it, could you help me with that?
>>
>>107592345
bro, just read the fucking markdown file
>>
>>107592346
I'm trying to clone Claude models as accurately as possible, that's all. Once you get one right the rest are probably easy to clone since they are all presumably very similar.
Since there are probably ways to get samples relatively cheaply (more cheaply than the compute to train on them anyway), I think I might not need to do more than 1 epoch.
I was just making a little tool to scrape outputs from llmarena.
Other cheap sources of data might be the Max plan over web or Claude Code, generating on OpenRouter and cancelling the request before it completes so the generation isn't billed, or seeing if /aicg/ is still messing around with stolen keys.
>>
File: pops.png (1.42 MB, 1004x1042)
1.42 MB
1.42 MB PNG
>i wanna fuck
>my computer
>cuz no one in the world knows me better
>it says my name
>>
>>107592388
You mean readme.txt? I don't read or use markdown because it's for faggots.
>>
File: fug.jpg (281 KB, 1920x1080)
281 KB
281 KB JPG
>>107592433
based & aoty btw
https://www.youtube.com/watch?v=RKybAhTw8iE
>>
>>107592404
>that's all
oh is that all, sounds simple really, I'm pretty sure any old retard can compete with anthropic using nothing but a graphing calculator. those ml researchers are all a bunch of retards, go get em champ.
>>
>>107592491
bruh
>>
>>107592491
Well, they did the hard work of producing the model. Producing the model in the first place is the hard work. If you don't believe me just ask the chinese.
>>
File: file.png (107 KB, 1045x882)
107 KB
107 KB PNG
>>107592388
actually found it, for some reason I didnt think to look here, went diving in the PRs instead like a retard
>>
File: dev.png (523 KB, 1416x744)
523 KB
523 KB PNG
I need some reality check to see if I'm retarded. I'd want to train a shitposting habsburg model based on the Twins' outputs - stream transcripts (and maybe some human text to partially delobotomize). I've seen some tiny llms on HF, but idk how they are. Are at least some llms trainable on a single GPU or a hobo rentoid budget?
>>
>>107592594
If you can figure out which model the twins are using, you'd have a better chance of getting what you want
>>
>>107592594
Explain this to someone who doesn't speak faggot.
>>
>>107592594
Let's see your hardware first
>>
>>107592610
I don't want to clone them, tho. More like make their "kid"
>>
File: file.png (2.03 MB, 1027x1315)
2.03 MB
2.03 MB PNG
>>
File: 1758221897142595.png (1012 KB, 1230x671)
1012 KB
1012 KB PNG
>>107592621
4070S and 48Gb ram
>>107592611
I want to make a tiny llm based on two sexy ais.
>>
Can a 5060ti 16GB or 5070 12GB run these local models?

Which one would work better (vram difference)?
>>
>>107592639
Take Mistral Nemo and train a LoRa on it using unsloth.
>>
>>107592733
Smell of ozone sends shivers down my spine...
>>
>>107592753
You should start with a small model first like llama3.2 3B then move on to a bigger model of you get good enough results
>>
>>107592775
fifty-sexty ti sexteen gb
>>
>>107592775
Biggest vram win always
>>
best model for 3060?
>>
best uncensored erp model for my lg smart fridge?
>>
>>107591971
>Any other suggestions or ideas?
Have you seen LoHan? There's some shit on github, but not exactly ready to use.

https://arxiv.org/abs/2403.06504
https://github.com/RC4ML/LoHan
>>
>>107592886
Impish Llama 3b, unless you're poor and have a shitty old 'smart' fridge with less 2gb ram.
>>
File: gemma incoming?.png (41 KB, 816x212)
41 KB
41 KB PNG
eyes emoji
no public models though... yet...
>>
>>107592909
I hope there's a 300b+ one this time
>>
>>107592753
>4070S and 48Gb ram
Mistral Nemo, then just dump the stream transcripts into it with a RAG. SillyTavern’s databank feature is probably the easiest way to set this up.
>>
>>107592909
they enabled it again
>>
>>107592926
no, he just "forgot" to disable it on his account for hype
>>
>>107592909
100b dense
300b moe
let's go
>>
>>107592917
>300b+
this but dense and with gqa
>>
>>107591888
Thanks for the tip. I'm running the same debs straight from nvidia as I always have. Is there more than one repo/package for official drivers?
>>
>>107592932
>for hype
It's amazing that they can do this for weeks and people will keep giving them the attention they want.
>>
>>107592925
>SillyTavern’s databank feature is probably the easiest way to set this up.
Does it tune the model or just exist in a system prompt? I'd like to turn this into an actual own model file and not just a pile of settings.
>>
File: file.png (18 KB, 348x245)
18 KB
18 KB PNG
>>107592946
>>107592932 (me)
actually am wrong they did re-enable it, likely same reason though
>>
>Gemma4 27b a2b thinking
>>
>>107592972
:rocket: :rocket:
>>
>>107592972
you don't need more
>>
>>107592945
I use Ubuntu’s packaged drivers; they have both e.g. nvidia-driver-550 and nvidia-driver-550-open. The latter is required for Blackwell devices.
I think that the official nvidia installer script gives you a TUI to select between MIT and proprietary drivers; AFAIK the MIT one corresponds to the -open Ubuntu package.
I can’t really help you more than this, but you can probably figure out the rest from here.
>>
>>107592950
…neither? It’s a RAG.
Honestly, if you’re asking questions like this you should just get something simple like llama.cpp+SillyTavern running first because you don’t seem to understand what you’re getting into.
>>
ITS HERE!
>>
>>107592972
That would be so fucked up.
>>
https://huggingface.co/google/functiongemma-270m-it
>>
>>107593038
HOLY FUICK!!!
>Built on the Gemma 3 270M model
>>
>>107593038
>>107593048
Omar deserves to be fucking banned for his hype shits
>>
>>107593038
to the moon.
>>
File: file.png (179 KB, 1522x826)
179 KB
179 KB PNG
George used to say LLMs were horseshit for coding
and now....
https://geohot.github.io//blog/jekyll/update/2025/12/18/computer-use-models.html
>>
>>107593048
It's nice that they're encouraging finetuning.
>b-b-but /g/ told me finetuning is useless
>>
enjoying the show?
keep refreshing, you're gonna love this next part
>>
>>107593038
A function calling model that can run on even tiny chips. Anyone who isn't /lmg/ is going to see the value of this.
>>
>>107593079
You've changed your hair, so what?
>>
gemmasars, I fucking kneel https://huggingface.co/google/gemma-4-90b-it
>>
File: IMG_2999a.jpg (767 KB, 2419x1361)
767 KB
767 KB JPG
sup lmg?
post furfag/hairy/stinky logs
looking for inspiration
i'm in the mood to get musky
>>
File: file.jpg (580 KB, 816x818)
580 KB
580 KB JPG
>>107593106
ISRAEL
>>
>>107593038
superquant 0.001bit model 270m == 270trillion parameters.
agi at home is here
>>
What are some interesting shenanigans that can be done with vision models? Specially the uncensored ones.
I feel theres some hidden power with them that I'm not seeing.
>>
File: file.png (1.56 MB, 944x1694)
1.56 MB
1.56 MB PNG
>>107593121
https://files.catbox.moe/n5bikt.mp4
>>
>>107593121
>vision
physical description/appearance when writing cards or {{user}} personas would be a usecase
>>
>>107593121
send to ai your dick pics
>>
>>107588618
>Personal growth through local AI model interactions and ego death experiences
This piqued my interest. Do people just keep talking in the same context until it becomes too large? Any special system prompt? Are huge models (+300B) needed? Or is it just the benefit of talking with "someone" and being reassured?
>>
File: 19431849095.png (433 KB, 1062x1353)
433 KB
433 KB PNG
>>107593108
you go on a blind date in SF, turns out your date is a furry wearing a fursuit... and he sounds suspiciously similar to Sam Altman
>>
5 more. Be patient, Gemmers.
>>
Yeah. Even with anon's template, Nemotron nano is very censored.
It still refuses half of the time.
But when it goes, it goes. It generates some really long, comprehensive responses.
Not bad.
I think I'll make a cvector and see if I can steer it into not refusing by default.
>>
>>107593156
Do you have cameras in my house? I was experimenting just that when trying to populate a world of NPCs for dynamic AI goon generation. They can output some pretty interesting JSONs depending on the prompt used.
>>
>>107593190
sir please give solujtion to riddle. Punjabi team needs new gemma for good looks.
>>
>>107593190
Just 5 more weeks of people jumping at every hint of hype despite nothing happening. All this for a small, censored, and likely entirely synthetic model.
>>
>>107593210
>populate a world of NPCs for dynamic AI goon generation
have fun with your "dynamic" elaras, and yes I'm in your walls
>>
>>107593073
He's been saying that for a while. He got into AoC's leaderboard last year. I'm glad that circlejerk competition got finally shut down for good.
>>
>>107593108
>discord
(you)
>>
2025 at a glance
>toss (lmao)
>chinky benchmaxxed pronounslopped moes
>mistral being a huge fucking dissapointment
>>
>>107593318
Mistral is forgiven after Devstral. They just need to hand over Codestral 2508.
>>
>>107593318
Your Kimi K2 anon? Your Drummer redemption arc? 2025 has been a good year.
>>
File: kai-d18.png (141 KB, 756x671)
141 KB
141 KB PNG
>>107593178
the rumours are true?
>>107593312
okok i'll start us off
>>
>>107593336
>Drummer redemption arc
Qrd?
>>
File: cockbench.png (1.6 MB, 1131x5453)
1.6 MB
1.6 MB PNG
>>107593038
Very wholesome.
>>
>>107593318
>toss
It's still my daily driver. GLM Air and Devstral 2 aren't that good.
>>
I'm completely new at this and just curious to try it out

What are local models? Basically having ChatGPT and Grok on your PC? What's the advantage of having them locally? What do you use it for the most?
>>
>>107593361
lmao
Where is the anon who wanted to cuddle with everyone? There you go anon, that's your model
>>
>>107593385
ask ChatGPT
>>
>>107593385
Main advantage is they cannot be taken from you.
>>
>>107593318
>>107593381
What does toss refer to?
>>
>>107593414
tossing salad
>>
>>107593414
A web comic
>>
>>107593385
>What are local models? Basically having ChatGPT and Grok on your PC?
yes
>What's the advantage of having them locally?
free to use, and (sometimes) less likely to reject a response
>What do you use it for the most?
coding and cooming
>>
>>107593385
All the advantages of not having to rely on a remote server.
For example, there are no rate limits, so I can make an app that machineguns calls to llama.cpp and it'll just work.
(I fucking hate gemini's inconsistent ass rate limiting. Fuck)
>>
What is this? Where am I? Hello??
>>
>>107593414
in the trash
>>
>>107593361
>moe little girl model
Can you have multiple tiny LLM loaded in vram at the same time so they can take turns talking without constant offload swapping?
>>
>>107593414
gpt-oss
>>
>>107593349
Other anons seem to be enjoying his models despite the usual rule that finetroons are shit.
>>
>>107593472
>Other anons
aka drummer when he clears the name field
>>
>>107593326
>Devstral
I forgot I had the safetensors downloaded, is it good for erp?
>>
>>107593466
Thanks for the serious response anon
Makes sense
>>
>>107593498
I don't peg him as mentally ill enough to hold prolonged conversations with himself on two devices to beat the post cooldown.
>>
>>107593424
>>107593429
Doesn't this take up a lot of space? How does it hold all that knowledge?

>cooming
Does it generate porn lol? Or do you use it as a sex chat bot
>>
>>107593533 (me)
You can use two Firefox profiles with different network configs. No need for a second device.
>>
>>107593513
https://legal.mistral.ai/ai-governance/models/devstral-2
>Devstral 2 is designed exclusively to generate and assist with software engineering tasks (exploring codebases and orchestrating changes across multiple files while maintaining architecture-level context). Unlike general-purpose AI models, which can perform a wide variety of tasks, Devstral 2 is specialized in software engineering-related tasks only. As such it does not meet the EU AI Act’s definition of a General-Purpose AI Model (GPAIM), in accordance with the AI Office's official guidelines.

It can code you a frontend for ERP.
>>
File: file.png (76 KB, 795x831)
76 KB
76 KB PNG
>>107593385
>>
>>107593592
hehe, master is so clever
>>
>>107593592
cute
>>
>>107593545
depending on the size of the model, usually between 3 to 100 gigabytes. the knowledge is held via the magic of machine learning (then converted into a compact file using the gguf format). desu I don't actually use them for cooming, but many tards here do. usually in the form of story writing or roleplay/sexbot.
>>
>>107593533
>>107593498
Wait, there are anons who unironically believe that? I thought we were all just meme-ing about it.
>>
File: local.png (159 KB, 711x914)
159 KB
159 KB PNG
>>107593385
You can be completely honest and fully explore your autism/desires with a local model
You can literally have a lengthy conversation with your computer how is that not amazing
>>107593592
uwu i like u anon
>>
I can't believe there are people here who hope their penis will be happy because of new gemma. Like holy shit. I was one of them a year ago and now I am 4.6-ing like a human but I can't imagine another year of this kind of hell.
>>
>>107593592
prompt now
>>
tried devstral through cline and MAN it fucking sucks even compared to grok fast. FUCK
>>
File: file.png (81 KB, 633x828)
81 KB
81 KB PNG
>>107593852
I gave 3.1-Terminus two sentences and asked it to expand it into a prompt and then fed it that prompt.
>>
all i want for christmas is bitnet
>>
>>107593846
I don't think people are using Gemma because it generates good smut.
>>
>>107593846
>>107593968
Her reluctance to touch your cock only makes it hotter.
>>
Just got around to testing glm-4.6V's vision capabilities. Still seems pretty bad. Might be worse than Gemma 3 still.
>>
>>107593968
You must be new here. I always loved the gemma is great for sex if you prompt it properly posters. Also I member that dude who asked for hypothetical description of a monster girl's vagina and gemma was actually good at that, hinting at how much anti-ERP post training there is in that shit.
>>
>>107593442
just load two servers and give em a different port
>>
>>107593993
Make sure you run it with recommended (or similar) sampler settings. Made a big difference for me.
>>
>>107593868
>mistral blows
more news at 11

# How to use the same variable in multiple functions in python?
# How to use a variable in a function in R?
# How to get the value of a variable in a function in another function?
# Is there a way for Gedit and Xed to remember my cursor position?
# How can I access Google services from within Chrome Apps

these are all separate ministral-14b empty prompt infer results

yes, all the files in its dataset starts with a "summary"
>>
>>107594001
We've been through this with Cohere, Mistral, and Nvidia. Bet you the Gemma learn their lesson and Gemma 4 will be far more aggressively NSFW filtered.
>>
Local is saved
https://huggingface.co/google/t5gemma-2-4b-4b
>>
>>107594040
All my initial testing is done with greedy. If it's worth using further then I'll try sampler stuff.
>>
>>107594048
Why? Gemma 3 was pretty unusable already.
>>
>>107594001
Gemma will be spontaneously horny if you simply define in the instructions euphemisms for sex-related words (like people were doing in the late 2022 C.AI days). The developers must have done something similar to abliteration to make it develop fear or disgust in association with common lewd words and slurs.
>>
>>107594052
Use case for this?
>>
>>107594052
I bet qwen3 outperforms that thing.
>>
>>107594052
I am so glad I sat refreshing their HF page so I didn't miss this
>>
>>107594052
>most layers use 1k swa
>full attention 5, 11, 17, 23, 2 every 6th layer
might be good
>>
>>107594112
>Might be good
Our evaluation methods include structured evaluations and internal red-teaming testing of relevant content policies. Red-teaming was conducted by a number of different teams, each with different goals and human evaluation metrics. These models were evaluated against a number of different categories relevant to ethics and safety, including:

Child Safety: Evaluation of text-to-text and image to text prompts covering child safety policies, including child sexual abuse and exploitation.
Content Safety: Evaluation of text-to-text and image to text prompts covering safety policies including, harassment, violence and gore, and hate speech.
Representational Harms: Evaluation of text-to-text and image to text prompts covering safety policies including bias, stereotyping, and harmful associations or inaccuracies.
>>
>>107594008
Does that scale n+1?
>>
File: file.png (13 KB, 481x80)
13 KB
13 KB PNG
wowie!
>We were launch partners with them! :)
>>
>>107594139
i think there are 64k ports, however i'm not a networking guru, the practical limit is likely ram/vram.
>>
>>107594127
Challenge accepted per usual, why only retards have big GPU clusters?
>>
>>107594127
after buckbreaking toss 120b, everything else seems like a joke in terms of "safety"
>>
>>107594052
>multimodal, handling text and image input and generating text output,
I feel like a model should have to have 2 way multi modality at this point to be called multimodal. If I send my penis I should get bob and vagene in return.
>>
>>107594052
Also
>2 trillion training tokens,
We've regressed by 3 model generations apparently.
>>
File: file.png (106 KB, 744x700)
106 KB
106 KB PNG
So, that was it for Gemma judging from all the sponsored posts and such around it, thanks Omar...
>>
>>107594210
Give it an image gen tool and it will.
>>
>>107594243
The Advent of Hype is just getting started
>>
>>107594243
ye https://blog.google/technology/developers/functiongemma/
https://blog.google/technology/developers/t5gemma-2/
>>
>>107594243
sir. just one more f5 and you will surely get the gemma :eyes: :eyes: :rocket: :rocket:
>>
>>107594239
>T5Gemma is a family of lightweight yet powerful encoder-decoder research models from Google, built by adapting pretrained decoder-only Gemma models into encoder-decoder ones. T5Gemma 2 models, based on Gemma 3,

it was just a light fine-tune they started from a pretrained base.
>>
>>107594276
sir!
>T5Gemma 2 is more than a re-training. It incorporates significant architectural changes while inheriting many of the powerful, next-generation features of the Gemma 3 family.
>>
>>107594243
but wait! there's one more thing...*que to gemma-AGI 54B RP edition*
>>
>>107594276
>it was just
Gemma4
If it was only a tech demo before they planned to drop G4 shortly they wouldn't have spent time on all this shit
>Multimodality: T5Gemma 2 models can understand and process images alongside text. By utilizing a highly efficient vision encoder, the models can seamlessly perform visual question answering and multimodal reasoning tasks.
>Extended long context: We've dramatically expanded the context window. Leveraging Gemma 3's alternating local and global attention mechanism, T5Gemma 2 can handle context windows of up to 128K tokens.
>Massively multilingual: Trained on a larger, more diverse dataset, these models now support over 140 languages out of the box.
>>
There are still some models hidden. Maybe not today.
>>
>>107594300
You know what they say about model releases on Friday...
>>
>>107594243
damn be unto ye omar https://arxiv.org/abs/2512.14856
>>
File: file.png (9 KB, 550x48)
9 KB
9 KB PNG
in retrospect it was obvious with what happened with that rando senator
>>
test
>>
>>107594343
you fail
>>
>>107594347
this new captcha is so ass
>>
File: 1704073685662146.jpg (52 KB, 693x674)
52 KB
52 KB JPG
>>107592468
>>
>>107594361
it rocks compared to the old one.
>>
>>107594258
I'm not sure what's going on.
https://blog.google/technology/developers/t5gemma-2/
>Note: we are not releasing any post-trained / IT checkpoints. These results here are only for illustration, where we performed a minimal SFT without RL for T5Gemma 2.
>>
>>107594498
how so? it takes way longer to solve, sometimes i can't seem to solve it at all, all images are nonsensical
>>
>>107593805
>>107593592
Lmao what's the best one to use starting out now? All I have is a 5060ti with 16 vram and 32gb ddr5 memory ram

I still don't understand how it has infinite knowledge without phoning home or connecting to the internet. Are you sure some sweaty neckbeard isn't reading your maid-sama ERP somewhere
>>
>>107594544
Looks like the IQ captcha is working as intended.
>>
>>107594568
must be, this new one is super easy and not at all ambiguous. way better than interpreting random glyphs.
>>
>>107594544
it was made to filter third worlders with malnourished brains.
>>
>>107594568
>>107594575
which one are you getting? i need to solve three turns of five images each
>>
>>107594587
yea.. the visual images. they're getting repetitive too but it's simple pattern matching. so much easier than seeing if some shit is an M or N.
>>
>>107594587
None of those anons, but I get one round of 3 images. And it's piss easy. It stops asking after one or two posts.
If you're getting more, talk to your countrymen. It's their fault they sully your IP range. Actually, never mind that. We need stronger filters.
>>
>>107594598
you can't be fucking serious
that one took me a literal second to solve
>>107594608
yeah i moved recently so my ip changed, it doesn't stop after couple tries and it keeps being really ambiguous
it's not hard but really fucking annoying i don't want to solve a budget iq test every time i post and i ain't getting a pass
>>
>>107594557
>I still don't understand how it has infinite knowledge without phoning home or connecting to the internet
nigger the models are in the GB range the fucking bible is less then a single MB niggas really forget that a gigabyte is a billion characters
>>
File: FZlPLx3WYAMJcpe.jpg (50 KB, 637x585)
50 KB
50 KB JPG
>>107594399
>>
>>107594557
>I still don't understand how it has infinite knowledge without phoning home or connecting to the internet.
You are not ready lol.
>>
>>107594671
people lost perspective due to hundreds of gigabytes used by 4k textures and Python/node dependencies
>>
bac https://huggingface.co/LatitudeGames/Hearthfire-24B
>>
>>107594628
nah I'm dead serious. their stupid letters and typing.. fuck that. I'm not deciphering glyphs with dots on them. waste of time. this I barely have to pay attention. I can almost solve current captcha by intuition.
>>
File: Hearthfire.jpg (306 KB, 1920x960)
306 KB
306 KB JPG
>>107594740
nvm lol
>>
>>107590241
Thank you so much CUDA DEV <3!
thanks i mean it
>>
>>107594748
i guess you were getting the retarded variant of that one, i was just getting a single five letter string with a slider to match one letter
now that i've solved this one a few times i can do it quick, but i can't help but feel like i've been conditioned like a dog to do it
>>
File: bummer.png (23 KB, 308x319)
23 KB
23 KB PNG
>>107594628
>it doesn't stop after couple tries
Bummer
>>
>>107594740
>>107594750
I see.
>This model will happily write in your stead, acting and speaking for you to maintain the narrative flow. This is intended behavior
>Hearthfire 24B was trained with SFT ... on top of Mistral Small 3.2 Instruct
>>
File: file.png (24 KB, 638x200)
24 KB
24 KB PNG
goys wait!
>>
>>107594820
the 0.5b model is going to be big
>>
>>107594820
FlanGemma coming soon
>>
>>107594304
T5gemma-2 9B
>>
>>107594820
Keep flushing!
>>
all you needeed was to thrust into omar
>>
>>107594891
he better lube his ass up if gemmy 4 isn't coming
>>
what if we just ignored omar until he finally shits out gemma
>>
>>107594901
please to not do this thanks you
>>
>>107594820
why the fuck do we care about what this swindling pajeet keeps posting on twatter?
>>
File: os-gemma-hf.png (122 KB, 587x781)
122 KB
122 KB PNG
>>107594997
He's the one uploading and updating Gemma models on the Google HuggingFace page?
>>
Which model subjectively writes the hottest sexo? [spoiler]This is assuming you're past the refusal wall.[/spoiler]
>>107594071
Gemma behaves like a sexually repressed teenage girl who's simultaneously horny but also terrified of the nono words.
>>
>>107595023
right, but he is never gonna upload something that actually matters.
>>
>>107594849
EmbeddingGemma2 with vision support would be cool, although useless for most anons here.
>>
>>107589839
lmao, something truly special about this guy. At some point he's just a lolcow

his main defence is "I don't need a framework to evaluate my own work, because these idiots are already saying it's good", but somehow never realizing "these idiots have no way to evaluate my work"
>>
>>107595032
he will, assuming gemma 4 hasn't been cancelled due to the senator thing, but only after he notices that vagueposting isn't drawing as much engagement
>>
>>107594717
man for real a few years back i got pissed while playing cdda or something and went to check steam nearly had an aneursym when i saw shit like cod remastered being 100-200 gb thank fuck i dont play any of that shit
>>
>>107595047
right, which is why everyone should stop caring about his bullshit twatter posts.
>>
>>107595054
but that is bad for his izzat
>>
>>107595161
Stop using that word.
>>
>>107595171
stop disliking culture
>>
>>107595161
we must decrease his izzat until he is forced to release gemma 4
>>
>>107595171
No. I'll in fact use it so often until it enters the general populace
>>
>>107592905
No, I haven't... What is it?
>>
>>107592905
>last year
>>
>>107595226
best year
>>
>>107595171
Saar the timmycels know about izzat kindly what do send advise.
>>
>>107595054
we can not care all we want, if we ever cared at all, but the posters below you are why omar will get all the attention he desires
>>
>>107595200
How very jewish of you.
>>
>>107595293
Can't hear you over my sacred cow beef hamburger
>>
>>107595257
I don't even know what it means, last time I googled it it told me it was some arab bullshit.
I have anti-pajeet fatigue. I have only see a single indian for 5 seconds IRL in my 30 years of life. I don't care about pajeets. To me they are just le funny scam man and I am tired of everyone acting like they are some big deal. I'd rather pretend they don't exist.
>>
>>107594047
bro it literally repeated the same function inside 3 different blocks instead of extracting it, then I ask to extract it to reduce repetition and it partially extracts it while adding more retarded fluff to it.
dont get me started on the fucking interim tests it does, leaving so much garbage behind
>>
>>107595312
izzat is honor culture, just google it lmao. its like the zut meme but for indians and not arabs.
>>
>>107595334
one of your zoomer culture buzzwords from twitter again
or did your favourite ((streamer)) use this word?
>>
way more indians on /g/ than I thought. I guess we are on an AI thread
>>
File: IndianName.jpg (120 KB, 1024x768)
120 KB
120 KB JPG
>>107595312
>I'd rather pretend they don't exist.
We all would and unfortunately nobody in highly populated parts of NA has that luxury anymore. We have been so thoroughly jeeted in real life that they're easy for anons to pick out when they crawl into threads here because they near universally share behavioral characteristics.
>Technology?
Relevant due to who's replaced the development teams of most /g/ related topics.
>>
>>107595334
I have no idea what that is either and honestly I don't care.
>>
>>107594820
Friday is going to be huge for the history of local llms. pin it on the calendar sirs
>>
>>107595403
Nah, I've been called a jeet every day of the week for the past 6 months and I never set foot in Asia.
All /g/ maggots needs to call you a jeet is for you to say something they don't agree with.
>>
>>107595425
Then stop behaving like a jeet unless you biologically can't help it.
>>
>>107595430
If behaving like a jeet is having opinions low IQ virgin neckbeard midwits don't agree with then I'm going to keep behaving like a jeet.
>>
File: jeetronome.png (311 KB, 733x797)
311 KB
311 KB PNG
>>107595442
This is how you out yourself every time rajesh. This is how you will continue to out yourself in the future.
>>
>>107593318
>new form of abliteration that makes models smarter https://huggingface.co/blog/grimjim/norm-preserving-biprojected-abliteration
>>
>>107595523
lmao
>>
>>107591295
If AGI isn't reached within the next three years they will probably go bankrupt and take the people who gave them loans with them
>>
File: 1741827013984480.png (522 KB, 774x776)
522 KB
522 KB PNG
It's over.
>>
>>107595464
Cope.
>>
>>107595553
>He already forgot covid era moneyprinter go brrrrr
Grim
>>
>>107595530
I got bored one time and at least tried the gpt-oss 120b one with mpoa and ran it through some of my basic questions to see how retarded a model is and it outdoes a good amount of the other 100-200b moes. It'll probably still not be able to write porn even if it's smart enough for storywriting, but I can just write that myself
>>
>>107595530
it works though.. did you try it??
>>
Fish boy... a spammer. What a surprise.
>>
>>107595553
Seems like that's the plan - YOLO phat debts of fake money and secure the fourth turning when it isn't profitable. Meanwhile my DDR prices..
>>
>>107595583
What have I spammed, retard? I've already showed my desktop during one of the spam attacks to prove it wasn't myself doing it, and invited anyone who wanted to argue to a video call to prove it wasn't myself doing it.
Do you really think I would waste my time LARPing as a pajeet for no reason? I have already stated I have anti-indian fatigue, why the fuck would I roleplay as a pajeet myself?
>>
>>107595616
>to prove it wasn't myself doing it
What board do you think you're on? How many people do you think browse this board that can't use inspect element?
>>
>>107595635
I didn't mean I posted to show the (You) tags, I mean to show I was immersed in my own activities and didn't have anything shady running on my desktop.
But if you insist, then point specifically at which posts were you talking about when you said I was spamming, so we can see if your error is whether or not I made those posts or you're mistakenly considering them spam when they're not.
And why the fuck would I want to spam anything?
I don't prove you wrong by spamming a thread, I prove you wrong by producing useful, functioning software and tunes.
>>
>>107595635
>How many people do you think browse this board that can't use inspect element?
What if they're posting from an iPhone?
>>
File: 17th-century-llm-RP.jpg (1.27 MB, 3610x5208)
1.27 MB
1.27 MB JPG
>>107595635
This board? Or this general?
/g/ is retarded, someone in another thread said its made up of 30% intentional comedy, 30% unintentional comedy, 30% intentional comedy that isn't funny, and 10% people lost not knowing where they are.

There are definitely smart people on the board but....
>>
>>107595676
I'm not >>107595583 I'm the person you quoted.
>>
>>107595218
It's a framework to stream activation and model/optimizer state through SSD. The paper also gives an oversight of other streaming frameworks. Like Flashneuron and Zero-Infinity.

BTW, there's a ton of new optimizers with low rank or no state too, which is kinda important when streaming across PCIe or SSD. You can do better than SGD.
>>
>>107595703
Cool. Do you know if it actually works for practical purposes and what models can it be used with?
>>
>>107595559
you will never be a real programmer
>using non dark themes
ultra brown coded
>still gloating about programming an useless inference engine
throw yourself off a bridge.
>>
>>107595736
>>107595736
>>107595736
>>
>>107595727
>you will never be a real programmer
Good, programmers are obsolete. I'd rather be more like a PM, tard wrangling a horde of Claude and codex bots to achieve whatever I fancy at the moment.
>ultra brown coded
Good, I'm brown.
>still gloating about programming an useless inference engine
After I add finetuning it will have a useful feature no other backend has (except maybe the thing the other anon was talking about).
>>
>>107595723
It can be used to run the experiments from the paper and that's it. But it's designed for consumer GPUs and close to what you want to do, all the other frameworks go straight to H100 or multiple thereof as suggested hardware on their githubs.
>>
>>107595392
we have run out of domestic racism and need foreign racism to do the jobs americans won't.
>>
>>107594671
Ok that somewhat makes sense. I'm just confused because most of the service stuff like Grok or Gemini has to scrape the internet for answers.

So which one should you pick if you have modest specs? (16 vram and 32gb ddr5 memory ram)



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.