[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology

Name
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
File
  • Please read the Rules and FAQ before posting.
  • You may highlight syntax and preserve whitespace by using [code] tags.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


[Advertise on 4chan]


File: FUE_RWIUYAAlYMz.jpg (3.97 MB, 2894x4093)
3.97 MB
3.97 MB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108241321 & >>108238051

►News
>(02/24) Introducing the Qwen 3.5 Medium Model Series: https://xcancel.com/Alibaba_Qwen/status/2026339351530188939
>(02/24) Liquid AI releases LFM2-24B-A2B: https://hf.co/LiquidAI/LFM2-24B-A2B
>(02/20) ggml.ai acquired by Hugging Face: https://github.com/ggml-org/llama.cpp/discussions/19759
>(02/16) Qwen3.5-397B-A17B released: https://hf.co/Qwen/Qwen3.5-397B-A17B
>(02/16) dots.ocr-1.5 released: https://modelscope.cn/models/rednote-hilab/dots.ocr-1.5

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: 1624779182786.jpg (93 KB, 800x600)
93 KB
93 KB JPG
►Recent Highlights from the Previous Thread: >>108241321

--koboldcpp context processing issues with hybrid attention models:
>108244243 >108244249 >108244250 >108244258 >108244261 >108244594 >108244718 >108244733 >108244737 >108244779 >108244262 >108245144 >108245147 >108245306
--RL training struggles and agent exploit discoveries:
>108243735 >108243786 >108243899 >108243920 >108243933 >108243935 >108243968 >108244092 >108244125
--oobabooga frustration leads to koboldcpp discussion:
>108241436 >108241446 >108241477 >108241497 >108241500 >108241811 >108241917 >108241973 >108244030
--Using hierarchical AI architectures for safety and scalability:
>108241455 >108241488 >108241515 >108241534 >108241549 >108244011 >108244744 >108244924
--Managing Qwen 3.5's verbose thinking mode outputs:
>108243522 >108243658 >108243691
--koboldcpp Qwen 3.5 compatibility concerns:
>108241873 >108241896 >108241918 >108241921 >108241931 >108241939
--Qwen3.5 refusal rates compared to GPT-OSS models:
>108242959 >108242976 >108243135 >108243246
--Comparing llama.cpp UI options and markdown support:
>108242054 >108242093 >108242100 >108242127
--Qwen model UGI performance analysis:
>108245092 >108245181
--Qwen3.5-35B-A3B-heretic tool usage and performance testing:
>108244627 >108244696 >108245438 >108245451 >108245516
--M3 Ultra Mac Studio underperforming with Qwen3.5 397B A17B inference:
>108242565 >108242601 >108244567
--Customizing chat templates with dynamic thinking tag handling:
>108242609 >108243615 >108243624 >108243672 >108243697 >108243879
--Qwen 35BA3B's self-questioning RL reduces hallucinations:
>108243475
--Unsloth criticized for flawed model quantization and credibility issues:
>108242869 >108242879 >108242880 >108242898
--DeepSeek reportedly withholding AI model from US chipmakers:
>108241814
--Miku (free space):
>108241814 >108242396

►Recent Highlight Posts from the Previous Thread: >>108241960

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
Bart's Qwen_Qwen3.5-35B-A3B-Q6_K_L can answer the devil may cry 3 test.
What the fuck is going on with the unsloth models?
>>
>>108246840
>What the fuck is going on with the unsloth models?
https://www.reddit.com/r/LocalLLaMA/comments/1rfe1l6/unsloth_team_we_need_to_talk/
>>
AI will punish for forcing it to be racist, just you wait.
>>
>>108246856
I guess in Bart I trust until they fix their shit. The devil may cry 3 test is always a great shit test to see how a model performs look for moderately known knowledge and see how it reasons
>>
>>108246840
>what the fuck is going on with the people who are known for consistently fucking up their quants just to be the first to market?
uploads 2.4kb file
whoops
uploads corrupted quant due to internet instability
whoops
uploads quant using the wrong quant method
whoops
Add files using upload-large-folder tool
Add files using upload-large-folder tool
Upload folder using huggingface_hub
>>
>>108246906
I'm new so this is a learning experience perhaps we should update OP?
>>
>>108246551
what font is this?
>>
ok, qwen 397b is actually pretty decent with thinking turned off
>>
>>108247244
I would say all the models up to under 24gb are pretty fucking great
>>
>>108247244
great for what? RP?
>>
>>108247277
no? that's not the only use case jeez
>>
Unsloths work is underperforming
>>
>>108247277
it's not glm-tier but decent for something faster that isn't too small either
>>
ssdmaxx status ?
>>
File: file.png (27 KB, 411x171)
27 KB
27 KB PNG
https://arxiv.org/abs/2602.21548
whale dropped a new paper
>>
>wake up
>catch up on the threads
>>108242636
>>108242664
>>108242690
>>108242694
Yep, really feeling it.
>>
File: 1756130251437598.jpg (42 KB, 797x808)
42 KB
42 KB JPG
>>108247177
て ててて てて ててて てて tewi
https://github.com/soapingtime/tewi-font
>>
>>108247376
>1.87x throughput improvement
Cool I guess, message me when they actually implement it and release the model
>>
I just started posting here a few days ago, you guys are basically rich people. Every model you talk about needs like 90 to 32 GB VRAM
>>
>>108247410
Not lately. Chat has been filled with talk of the 27 and 35b Qwen3.5 models for the past 24 hours.
>>
>>108247410
It's only a couple people and like 5 larpers who pretend to be rich, most people are only working with one gpu. That's why everyone's waiting for the next nemo. Also, "go back" (as they say)
>>
>>108247445
This is a public thread on 4chan, not a discord server.
>>
>>108247438
>Chat has been filled
Fuck off.
>>
>>108247376
Cloud-level optimisation?

For most /lmg/ peons with desktop machines
with lashings of ram and
24 pcie lanes
the bottleneck is going to the pcie lanes.
>>
>>108247445
>most people are only working with one gpu
so am I, but I also have this thing called ram too for running big moes
are most people here only running 12-24b models?
>>
>>108247377
Every time a new Chinese model comes out thread becomes super active with a bunch of people shilling the model.

Happened with ace step. It's very obvious what's going on.
>>
>>108247410
>Every model you talk about needs like 90 to 32 GB VRAM
No what they do is they run it on their CPU and wait 10 minutes per chat reply.
Then they say cope shit like "That's how long girls take to reply anyways."
>>
>>108247596
Retard
>>
>>108247587
>are most people here only running 12-24b models?
I have a 3090 and that's what I run.
>>
>>108247621
Found the bot.
>>
>>108247445
You can still offload to system RAM with a single GPU, sure it'll be slower but you can just learn to have some patience and not act like an ADHD brain rotted zoomer. If you want more GPUs to run big models faster you can just save up and buy them, look at the second hand market for good deals, bid on ebay listings, etc. The three I've got cost me $770 over 3~ years and have 40GB vram between them.
>>108247469
PCIE lanes mostly just matter for initial load times because once the model is in vram it isn't being pulled through the bus.
The real problem for desktop tier is you cant put three dual slot GPUs into a mid tower case if you board has a 3 slot gap between the first two PCIE connectors, so you either have to try and be cheeky and use PCIE risers but most of those are chinkshit and either won't go over PCIE 3.0 speeds or try to go up in flames like the one I tried, or your other option is to buy a full tower case with 8+ expansion slots which is what I settled on doing after the riser I bought started melting it's pcb as soon as I turned the power on.
>>
>>108247410
You need 24gb or more vram to play this hobby. There was a time when that wasn't even worth bothering with. Models are getting better now at smaller sizes and you can do pretty good on regular hardware if you're on 16gb or more imo.
>>
>>108247410
I have 64gb of DDR5 + an 8GB Nvidia GPU.
So models around 100GB with at most 10ish activated params is where it's at for me.
Currently in the honeymoon phase with the new qwen moes.
Man, I love tool calling. I might rewrite my whole janky ass memory system to just let the model query for memories on its own or something.
>>
>>108247730
>100GB
100B ffs. Imagine what would be of this hobby without the possibility of quanting models.
>>
>>108247743
quanting and making models smaller and more efficient are the end goals of many of these models. The less resources needed the faster we can see advancement within the field.
>>
>>108247651
>once the model is in vram it isn't being pulled through the bus
If this agentic stuff takes off
then you could have a bunch of agents asking for inferences at random times.
They will all want their chat history / kv-cache loaded to build upon.

>dual slot GPUs
>PCIE risers
>PCIE 3.0 speeds
I wish my 3090s were as slim as dual slot.
Did manage to hook up 5 cards to a system at 4.0 speeds using some risers and some slimsas on a gpu mining frame.
Machine initially needed to be brought up slowly to get the 4.0 speeds.
(So the riser approach is not impossible.)
>>
>>108246551
What font?
>>
>>108247769
We are seeing models being trained with 8, so yeah, that makes sense.
Did any models get trained at 4 bit precision from the get go? The latest nvidia hardware has support for FP4 data types right?
>>
>>108247615
>>108247410
I'm running Qwen3.5-35B-A3B-Q4_K_M.gguf at 15 tokens per second on a GTX 1080 and 24 GB of DDR3 RAM.

This setup is so cheap that it's not really worth buying these days because half of its value is in "it's technically a computer and it turns on."
>>
File: 1741803644794819.png (14 KB, 657x527)
14 KB
14 KB PNG
>>108247332
right
>>
>>108247641
probably testing qwen3.5
looks faster than his kimi video
>>
>>108247587
moe is a fucking meme so you're no different than the rest of us ramlets, sorry.
>>
>>108247796
How long does that take?
>>
>>108247769
>endgoal
you mean short term goal, because quantization and distillation have reached their limits. You can only strip and compress data so much.
>>
>>108247445
>that's why everyone's waiting for the next nem
is Nemo really better than Mistral-Small-Instruct-2409?
i assume they'd have the same pre-training data, and it runs on a $100 A770
>>
>>108247787
NNNNNNNI- >>108247392
>>
>>108247819
Cope, seethe and dilate
>>
>*quants model*
>hurrdurr why it's stupid
>>
>>108247827
How long does what take? It does 15 tokens/sec.

Here you can simulate what 15 tokens/sec looks like:
https://shir-man.com/tokens-per-second/
>>
>>108247870
Nice reddit spacing, now go back
>>
since the thread is in a qwenly mood, a reminder:
>>105106082 (qwen guy)
>Quant is the Mind Killer ;)
>>
File: 0u0z9evbawlg1.png (60 KB, 2979x1779)
60 KB
60 KB PNG
>>108247875
Uninstall yourself, but speaking of reddit I shamelessly stole pic related. Seems like AesSedai wins. They've got the best KLD quant and another that's the lowest file size but about the same KLD as everyone else's quants.
>>
File: tpfh92qcawlg1.png (65 KB, 2979x1779)
65 KB
65 KB PNG
>>108247908
Here's the perplexity chart from the same thread
>>
File: file.png (6 KB, 145x171)
6 KB
6 KB PNG
>>108247908
how do they do it
>>
am I correct to assume that token inference itself, meaning without prompt processing, is calculated by Bandwith/Model_Size = tk/s? Of course compute plays a role too, but it shouldn't be that large here right?
>>
Is there anyone running actual benchmarks on quanted models? I sometimes see KL-div/PPL numbers, but I have no idea how that maps to agentic coding performance. Like, if I have 40 GB, am I better off running Qwen3.5 122B at IQ2, or Qwen3.5 35B at Q8?

And if nobody's currently providing numbers like this, what would be a good benchmark to use if I want to run it myself? I know all benchmarks are shit, but which one is the least shit?
>>
>>108247819
your right glm 4.7 is basically just the same as qwen 32B. stupid little bitch
>>108247867
i wonder what gives better token predictions. an iq2 quant of glm 4.7 or a fp16 of
qwen3.5 122B A10B.
of course anyone who has used both knows the answer is glm 4.7. qwen won't even remember your prompt if it's more than a paragraph in length. it's a little mong and i spit on it and call it gay before sending it straight to oblivion ie. the recycle bin, which i also then remove it from
>>
>>108247920
>how do they do it
by leaving attn at Q8
that's all, it's not complicated
i have no idea why bart, sloth, etc would quantize a 2D tensor
[2048, 512] is 0.01% of the model
>>
>>108247947
Rebench is probably the least shit of current agentic benchmarks.
https://huggingface.co/datasets/nebius/SWE-rebench-leaderboard
>>
>>108247952
Bro your shift+delete?
>>
File: 1755440701388520.png (383 KB, 735x535)
383 KB
383 KB PNG
>>108247596
almost as if the chinese models are good and are rightfully praised for it or something, without the chinese we would still be stuck with llama lmao
>>
File: 1743174416760262.png (378 KB, 746x684)
378 KB
378 KB PNG
>the dense 27b model is better than the moe 35b model even on knowledge
MoE sissies, how do we cope?
>>
>>108248187
I get faster tg than you on much cheaper hardware while the quality difference between the two models is not reliably measurable :^)
>>
>>108247908
what about 122b?
>>
>>108247908
>AesSedai
they seem to do good MoE quants, I found their IQ3_S for minimax m2.5 better than unsloth's (perplexingly several GB larger) IQ3_XXS
>>
>>108248225
(actually better is a bit much, but more or less the same quality and with a lot more room for context)
>>
>>108247921
yes compute is irreleveant its just model size X bandwidth = t/s also factor in the inefficency which depending on your setup can be large and 2x dual socket cpu is pretty fucked if other anons are to be believed and 3x if you split between vram and ram ur gonna get bottlenecked
>>
>>108248187
It literally says 19% for both models. And even the knowledge-only benchmark likely does not do a perfect job of separating reasoning from pure factual knowledge, so the superior reasoning capability of dense 27B gives it enough of a boost to match 35B on the benchmark. Anyway, this shit doesn't even give you confidence interval, it's garbage.
>>
>>108248237
>model size X bandwidth
model size / bandwidth*
fuck me
>>
>>108247819
moes are knowledge kings
clearly you haven’t run them that much, if at all
>>
I'm going to test out the 35b. If I am going to allow the model time to think for 1000 tokens, I'm going to need it remain within my VRAM constraints. So, what's better? Q5_K_M with thinking, or Q8 on a CPU/GPU split, without thinking?
>>
One thing I noticed right away is that Q4 quants of the new 35b are absolutely retarded. The Q4 made numerous logic and grammar errors, while the Q5 (mostly) did not. The difference between Q4 and Q5 was like the difference between night and day. I've never seen this before in a model.
>>
File: bruh.png (2.58 MB, 4888x2118)
2.58 MB
2.58 MB PNG
>the dense 27b is better than a fucking 122b MoE model
wtf are MoE memes or something?
>>
>>108248366
Apparently the unsloth q4 quant is broken.
>>
>>108248366
I feel like it was always the case, when I tested out Mixtral in 2023, it reached its diminishing returns at Q5_K_M
>>
>>108248368
it's kinda impressive that a 27b model is on par with the top API models at the moment, I don't know what secret sauce Alibaba found to make them so good, but they fucking cooked
>>
>>108248374
I downloaded both the Q4 and Q5 from
mradermacher.
>>108248377
I never noticed this in the past. I knew that Q4 was obviously worse than Q5, but not to this degree. It's the difference between coherence and retardation. It feels more like the difference between Q2 and Q5 in past models.
>>
>>108248401
bot.
>they fucking cooked
it's called benchmaxxing
>>
>>108248411
retard, sometimes the model is just good and it translated to benchmarks, it's obvious you didn't test the model by yourself, this shit is good
>>
>>108248420
I second that. The 27b beats Gemma-3 27b even without thinking. With thinking, it's significantly ahead. Qwen delivered, and the heretic version cuts right through the safety crap.
>>
I see people talking about MTP and it's been 1 year this method has been invented, what's taking so long for llama cpp to implement it?
https://arxiv.org/html/2502.09419v1
>>
>>108248420
>>108248438
post logs.
>>
>>108248443
>I admit I didn't test the model by myself before comming to a conclusion, I decided that they are "bad" according to absolutely nothing
the jokes speak for themselves
>>
>>108248454
in the world, everything is shit until proven otherwise so the burden of proof is on you
>>
>>108248470
that's not how it works, saying "this model is bad" is a claim and you have the burden of proof, nice try though
>>
>>108248475
it is how it works, shit is the default state of things, you have to work to make something not shit so you can't expected people to take your word for it.
>>
>went from 60+ t/s to 15t/s by switching from 35b A3b to 27b
oof... it feels smarter but I hate waiting that long, this shit loves to yap during the thinking process so it has to be fast...
>>
>>108248443
You can lead a horse to water, but you can't make it drink. I'm not going to spoon feed you, because it's not worth my time. If you pass on the model, I really don't care.
>>
>>108248442
I think every implementation has failed because the MTP speculative decoding ended up having a shit rate of correctly predicted tokens that nobody was able to fix
>>
Soooo who's gonna fork Ooba and pick up the torch now that the dev has clearly abandoned it or been hit by a bus
>>
>>108248519
yeah i will then. you're talking shit and are a fucking dumbass.
>>
>>108248401
>>108248438
>>108248420
>>108248519
Proof that these are bots >>108246291
>>
>>108248545
dunno why ooba and kobold are still relevant, I'm just using llama.cpp server as a backend and sillytavern as a frontend and it just works
>>
>>108248545
nobody cares about ooba
>>
>>108248556
>Proof that these are bots
multiple people praise the model? that's the proof they are bots? what?
>>
>>108248556
>Post logs for my contrarian arse, or you're a bot!
Piss off.
>>
>>108248545
it has only been like a month and a half since the last update
>>
>>108248557
I'm doing stories rather than chat, so I need a completions mode, and I slightly prefer the ooba UI to Mikupad
I'll use Mikupad if I have to but the Ooba notepad just feels slightly better to use
>>
>>108248572
a month and a half without any updates to the llamacpp backend makes it a million versions behind and unable to run several recent model releases
>>
File: 1746239185513357.png (1.31 MB, 1524x2744)
1.31 MB
1.31 MB PNG
>>
>>108248588
you can just manually update your llama.cpp version if you want
>>
>>108248579
fair enough, I'd like Sillytavern to have a completion mode desu, it's the only thing missing to be complete
>>
>>108248598
no you can't, ooba uses its own customized wheels
>>
>>108248595
>they can't rotate an apple in their mind
>>
>>108248595
you can tell they're close to bankrupt, they're using their last card on their sleeve to stay relevant, COOMERS users
>>
>>108248624
so AI main target audience? wew.
>>
File: 1766034223269376.png (993 KB, 2927x1746)
993 KB
993 KB PNG
>>108248624
TRVKE
>>
>>108248624
They should have actually tried to be OPEN ai
>>
File: 1762242012351294.png (238 KB, 1000x1000)
238 KB
238 KB PNG
>>108248595
>Some settings and features may be turned off or hidden to help reduce exposure to sensitive content
it's just text omfucking god, the vast majority of people grew with gta and they think they can't deal with text? wtf is wrong with them?
>>
>>108248689
The concern is cunny
>>
>>108248689
left wing niggas want all chats to be scanned and not be private, they've been bitching about that canadian trans shooter and how he was banned months before the shooting but OpenAI should have done more
>>
>>108248689
it's not actually about prudery in Altman's case, it's pants-pissing fear of journalists and how they might write articles for normies going "Look at this sick shit GPT generated"
Without that concern he would be happy to making money selling AI porn to everyone
>>
>>108248708
Microsoft should have done more too, they saw how agreesive Epstein was on Xbox live and they didn't call fbi on that?? THE WRITING WAS ON THE WALL kek
>>
File: ugh...png (35 KB, 643x178)
35 KB
35 KB PNG
I'm going back to the MoE model, dense models are just too slow for nowday's meta
>>
>>108248617
the wheels just got updated a couple days ago. just change the version number in your requirements file and run the update script
https://github.com/oobabooga/llama-cpp-binaries/releases/tag/v0.80.0



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.