/g/ - /lmg/ - Local Models General - Technology


08/21/20	New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17	New trial board added: /bant/ - International/Random
10/04/16	New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]

Anonymous
/lmg/ - Local Models General 02/26/26(Thu)15:33:06 No.108246772

File: FUE_RWIUYAAlYMz.jpg (3.97 MB, 2894x4093)

/lmg/ - Local Models General Anonymous 02/26/26(Thu)15:33:06 No.108246772

/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108241321 & >>108238051

►News
>(02/24) Introducing the Qwen 3.5 Medium Model Series: https://xcancel.com/Alibaba_Qwen/status/2026339351530188939
>(02/24) Liquid AI releases LFM2-24B-A2B: https://hf.co/LiquidAI/LFM2-24B-A2B
>(02/20) ggml.ai acquired by Hugging Face: https://github.com/ggml-org/llama.cpp/discussions/19759
>(02/16) Qwen3.5-397B-A17B released: https://hf.co/Qwen/Qwen3.5-397B-A17B
>(02/16) dots.ocr-1.5 released: https://modelscope.cn/models/rednote-hilab/dots.ocr-1.5

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm

Anonymous
02/26/26(Thu)15:33:31 No.108246776

Anonymous 02/26/26(Thu)15:33:31 No.108246776

File: 1624779182786.jpg (93 KB, 800x600)

93 KB JPG

►Recent Highlights from the Previous Thread: >>108241321

--koboldcpp context processing issues with hybrid attention models:
>108244243 >108244249 >108244250 >108244258 >108244261 >108244594 >108244718 >108244733 >108244737 >108244779 >108244262 >108245144 >108245147 >108245306
--RL training struggles and agent exploit discoveries:
>108243735 >108243786 >108243899 >108243920 >108243933 >108243935 >108243968 >108244092 >108244125
--oobabooga frustration leads to koboldcpp discussion:
>108241436 >108241446 >108241477 >108241497 >108241500 >108241811 >108241917 >108241973 >108244030
--Using hierarchical AI architectures for safety and scalability:
>108241455 >108241488 >108241515 >108241534 >108241549 >108244011 >108244744 >108244924
--Managing Qwen 3.5's verbose thinking mode outputs:
>108243522 >108243658 >108243691
--koboldcpp Qwen 3.5 compatibility concerns:
>108241873 >108241896 >108241918 >108241921 >108241931 >108241939
--Qwen3.5 refusal rates compared to GPT-OSS models:
>108242959 >108242976 >108243135 >108243246
--Comparing llama.cpp UI options and markdown support:
>108242054 >108242093 >108242100 >108242127
--Qwen model UGI performance analysis:
>108245092 >108245181
--Qwen3.5-35B-A3B-heretic tool usage and performance testing:
>108244627 >108244696 >108245438 >108245451 >108245516
--M3 Ultra Mac Studio underperforming with Qwen3.5 397B A17B inference:
>108242565 >108242601 >108244567
--Customizing chat templates with dynamic thinking tag handling:
>108242609 >108243615 >108243624 >108243672 >108243697 >108243879
--Qwen 35BA3B's self-questioning RL reduces hallucinations:
>108243475
--Unsloth criticized for flawed model quantization and credibility issues:
>108242869 >108242879 >108242880 >108242898
--DeepSeek reportedly withholding AI model from US chipmakers:
>108241814
--Miku (free space):
>108241814 >108242396

►Recent Highlight Posts from the Previous Thread: >>108241960

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script

Anonymous
02/26/26(Thu)15:46:34 No.108246840

Anonymous 02/26/26(Thu)15:46:34 No.108246840

Bart's Qwen_Qwen3.5-35B-A3B-Q6_K_L can answer the devil may cry 3 test.
What the fuck is going on with the unsloth models?

Anonymous
02/26/26(Thu)15:47:54 No.108246856

Anonymous 02/26/26(Thu)15:47:54 No.108246856

>>108246840
>What the fuck is going on with the unsloth models?
https://www.reddit.com/r/LocalLLaMA/comments/1rfe1l6/unsloth_team_we_need_to_talk/

Anonymous
02/26/26(Thu)15:48:24 No.108246862

Anonymous 02/26/26(Thu)15:48:24 No.108246862

AI will punish for forcing it to be racist, just you wait.

Anonymous
02/26/26(Thu)15:52:53 No.108246894

Anonymous 02/26/26(Thu)15:52:53 No.108246894

>>108246856
I guess in Bart I trust until they fix their shit. The devil may cry 3 test is always a great shit test to see how a model performs look for moderately known knowledge and see how it reasons

Anonymous
02/26/26(Thu)15:55:33 No.108246906

Anonymous 02/26/26(Thu)15:55:33 No.108246906

>>108246840
>what the fuck is going on with the people who are known for consistently fucking up their quants just to be the first to market?
uploads 2.4kb file
whoops
uploads corrupted quant due to internet instability
whoops
uploads quant using the wrong quant method
whoops
Add files using upload-large-folder tool
Add files using upload-large-folder tool
Upload folder using huggingface_hub

Anonymous
02/26/26(Thu)15:56:43 No.108246914

Anonymous 02/26/26(Thu)15:56:43 No.108246914

>>108246906
I'm new so this is a learning experience perhaps we should update OP?

Anonymous
02/26/26(Thu)16:34:30 No.108247177

Anonymous 02/26/26(Thu)16:34:30 No.108247177

>>108246551
what font is this?

Anonymous
02/26/26(Thu)16:44:25 No.108247244

Anonymous 02/26/26(Thu)16:44:25 No.108247244

ok, qwen 397b is actually pretty decent with thinking turned off

Anonymous
02/26/26(Thu)16:45:03 No.108247250

Anonymous 02/26/26(Thu)16:45:03 No.108247250

>>108247244
I would say all the models up to under 24gb are pretty fucking great

Anonymous
02/26/26(Thu)16:49:33 No.108247277

Anonymous 02/26/26(Thu)16:49:33 No.108247277

>>108247244
great for what? RP?

Anonymous
02/26/26(Thu)16:57:58 No.108247332

Anonymous 02/26/26(Thu)16:57:58 No.108247332

>>108247277
no? that's not the only use case jeez

Anonymous
02/26/26(Thu)16:59:46 No.108247340

Anonymous 02/26/26(Thu)16:59:46 No.108247340

Unsloths work is underperforming

Anonymous
02/26/26(Thu)17:04:37 No.108247369

Anonymous 02/26/26(Thu)17:04:37 No.108247369

>>108247277
it's not glm-tier but decent for something faster that isn't too small either

Anonymous
02/26/26(Thu)17:05:32 No.108247374

Anonymous 02/26/26(Thu)17:05:32 No.108247374

ssdmaxx status ?

Anonymous
02/26/26(Thu)17:05:59 No.108247376

Anonymous 02/26/26(Thu)17:05:59 No.108247376

File: file.png (27 KB, 411x171)

27 KB PNG

https://arxiv.org/abs/2602.21548
whale dropped a new paper

Anonymous
02/26/26(Thu)17:06:02 No.108247377

Anonymous 02/26/26(Thu)17:06:02 No.108247377

>wake up
>catch up on the threads
>>108242636
>>108242664
>>108242690
>>108242694
Yep, really feeling it.

Anonymous
02/26/26(Thu)17:08:14 No.108247392

Anonymous 02/26/26(Thu)17:08:14 No.108247392

File: 1756130251437598.jpg (42 KB, 797x808)

42 KB JPG

>>108247177
ててててててててててて tewi
https://github.com/soapingtime/tewi-font

Anonymous
02/26/26(Thu)17:11:34 No.108247408

Anonymous 02/26/26(Thu)17:11:34 No.108247408

>>108247376
>1.87x throughput improvement
Cool I guess, message me when they actually implement it and release the model

Anonymous
02/26/26(Thu)17:11:54 No.108247410

Anonymous 02/26/26(Thu)17:11:54 No.108247410

I just started posting here a few days ago, you guys are basically rich people. Every model you talk about needs like 90 to 32 GB VRAM

Anonymous
02/26/26(Thu)17:17:22 No.108247438

Anonymous 02/26/26(Thu)17:17:22 No.108247438

>>108247410
Not lately. Chat has been filled with talk of the 27 and 35b Qwen3.5 models for the past 24 hours.

Anonymous
02/26/26(Thu)17:18:45 No.108247445

Anonymous 02/26/26(Thu)17:18:45 No.108247445

>>108247410
It's only a couple people and like 5 larpers who pretend to be rich, most people are only working with one gpu. That's why everyone's waiting for the next nemo. Also, "go back" (as they say)

Anonymous
02/26/26(Thu)17:20:37 No.108247456

Anonymous 02/26/26(Thu)17:20:37 No.108247456

>>108247445
This is a public thread on 4chan, not a discord server.

Anonymous
02/26/26(Thu)17:21:02 No.108247463

Anonymous 02/26/26(Thu)17:21:02 No.108247463

>>108247438
>Chat has been filled
Fuck off.

Anonymous
02/26/26(Thu)17:21:27 No.108247469

Anonymous 02/26/26(Thu)17:21:27 No.108247469

>>108247376
Cloud-level optimisation?

For most /lmg/ peons with desktop machines
with lashings of ram and
24 pcie lanes
the bottleneck is going to the pcie lanes.

Anonymous
02/26/26(Thu)17:36:20 No.108247587

Anonymous 02/26/26(Thu)17:36:20 No.108247587

>>108247445
>most people are only working with one gpu
so am I, but I also have this thing called ram too for running big moes
are most people here only running 12-24b models?

Anonymous
02/26/26(Thu)17:37:16 No.108247596

Anonymous 02/26/26(Thu)17:37:16 No.108247596

>>108247377
Every time a new Chinese model comes out thread becomes super active with a bunch of people shilling the model.

Happened with ace step. It's very obvious what's going on.

Anonymous
02/26/26(Thu)17:39:09 No.108247615

Anonymous 02/26/26(Thu)17:39:09 No.108247615

>>108247410
>Every model you talk about needs like 90 to 32 GB VRAM
No what they do is they run it on their CPU and wait 10 minutes per chat reply.
Then they say cope shit like "That's how long girls take to reply anyways."

Anonymous
02/26/26(Thu)17:40:03 No.108247621

Anonymous 02/26/26(Thu)17:40:03 No.108247621

>>108247596
Retard

Anonymous
02/26/26(Thu)17:41:34 No.108247634

Anonymous 02/26/26(Thu)17:41:34 No.108247634

>>108247587
>are most people here only running 12-24b models?
I have a 3090 and that's what I run.

Anonymous
02/26/26(Thu)17:42:35 No.108247641

Anonymous 02/26/26(Thu)17:42:35 No.108247641

>>108247621
Found the bot.

Anonymous
02/26/26(Thu)17:43:56 No.108247651

Anonymous 02/26/26(Thu)17:43:56 No.108247651

>>108247445
You can still offload to system RAM with a single GPU, sure it'll be slower but you can just learn to have some patience and not act like an ADHD brain rotted zoomer. If you want more GPUs to run big models faster you can just save up and buy them, look at the second hand market for good deals, bid on ebay listings, etc. The three I've got cost me $770 over 3~ years and have 40GB vram between them.
>>108247469
PCIE lanes mostly just matter for initial load times because once the model is in vram it isn't being pulled through the bus.
The real problem for desktop tier is you cant put three dual slot GPUs into a mid tower case if you board has a 3 slot gap between the first two PCIE connectors, so you either have to try and be cheeky and use PCIE risers but most of those are chinkshit and either won't go over PCIE 3.0 speeds or try to go up in flames like the one I tried, or your other option is to buy a full tower case with 8+ expansion slots which is what I settled on doing after the riser I bought started melting it's pcb as soon as I turned the power on.

Anonymous
02/26/26(Thu)17:45:02 No.108247657

Anonymous 02/26/26(Thu)17:45:02 No.108247657

>>108247410
You need 24gb or more vram to play this hobby. There was a time when that wasn't even worth bothering with. Models are getting better now at smaller sizes and you can do pretty good on regular hardware if you're on 16gb or more imo.

Anonymous
02/26/26(Thu)17:54:51 No.108247730

Anonymous 02/26/26(Thu)17:54:51 No.108247730

>>108247410
I have 64gb of DDR5 + an 8GB Nvidia GPU.
So models around 100GB with at most 10ish activated params is where it's at for me.
Currently in the honeymoon phase with the new qwen moes.
Man, I love tool calling. I might rewrite my whole janky ass memory system to just let the model query for memories on its own or something.

Anonymous
02/26/26(Thu)17:56:29 No.108247743

Anonymous 02/26/26(Thu)17:56:29 No.108247743

>>108247730
>100GB
100B ffs. Imagine what would be of this hobby without the possibility of quanting models.

Anonymous
02/26/26(Thu)18:00:28 No.108247769

Anonymous 02/26/26(Thu)18:00:28 No.108247769

>>108247743
quanting and making models smaller and more efficient are the end goals of many of these models. The less resources needed the faster we can see advancement within the field.

Anonymous
02/26/26(Thu)18:02:36 No.108247780

Anonymous 02/26/26(Thu)18:02:36 No.108247780

>>108247651
>once the model is in vram it isn't being pulled through the bus
If this agentic stuff takes off
then you could have a bunch of agents asking for inferences at random times.
They will all want their chat history / kv-cache loaded to build upon.

>dual slot GPUs
>PCIE risers
>PCIE 3.0 speeds
I wish my 3090s were as slim as dual slot.
Did manage to hook up 5 cards to a system at 4.0 speeds using some risers and some slimsas on a gpu mining frame.
Machine initially needed to be brought up slowly to get the 4.0 speeds.
(So the riser approach is not impossible.)

Anonymous
02/26/26(Thu)18:03:29 No.108247787

Anonymous 02/26/26(Thu)18:03:29 No.108247787

>>108246551
What font?

Anonymous
02/26/26(Thu)18:04:11 No.108247792

Anonymous 02/26/26(Thu)18:04:11 No.108247792

>>108247769
We are seeing models being trained with 8, so yeah, that makes sense.
Did any models get trained at 4 bit precision from the get go? The latest nvidia hardware has support for FP4 data types right?

Anonymous
02/26/26(Thu)18:04:36 No.108247796

Anonymous 02/26/26(Thu)18:04:36 No.108247796

>>108247615
>>108247410
I'm running Qwen3.5-35B-A3B-Q4_K_M.gguf at 15 tokens per second on a GTX 1080 and 24 GB of DDR3 RAM.

This setup is so cheap that it's not really worth buying these days because half of its value is in "it's technically a computer and it turns on."

Anonymous
02/26/26(Thu)18:06:26 No.108247806

Anonymous 02/26/26(Thu)18:06:26 No.108247806

File: 1741803644794819.png (14 KB, 657x527)

14 KB PNG

>>108247332
right

Anonymous
02/26/26(Thu)18:06:37 No.108247807

Anonymous 02/26/26(Thu)18:06:37 No.108247807

>>108247641
probably testing qwen3.5
looks faster than his kimi video

Anonymous
02/26/26(Thu)18:08:07 No.108247819

Anonymous 02/26/26(Thu)18:08:07 No.108247819

>>108247587
moe is a fucking meme so you're no different than the rest of us ramlets, sorry.

Anonymous
02/26/26(Thu)18:09:15 No.108247827

Anonymous 02/26/26(Thu)18:09:15 No.108247827

>>108247796
How long does that take?

Anonymous
02/26/26(Thu)18:11:23 No.108247845

Anonymous 02/26/26(Thu)18:11:23 No.108247845

>>108247769
>endgoal
you mean short term goal, because quantization and distillation have reached their limits. You can only strip and compress data so much.

Anonymous
02/26/26(Thu)18:12:04 No.108247848

Anonymous 02/26/26(Thu)18:12:04 No.108247848

>>108247445
>that's why everyone's waiting for the next nem
is Nemo really better than Mistral-Small-Instruct-2409?
i assume they'd have the same pre-training data, and it runs on a $100 A770

Anonymous
02/26/26(Thu)18:12:34 No.108247852

Anonymous 02/26/26(Thu)18:12:34 No.108247852

>>108247787
NNNNNNNI- >>108247392

Anonymous
02/26/26(Thu)18:14:28 No.108247859

Anonymous 02/26/26(Thu)18:14:28 No.108247859

>>108247819
Cope, seethe and dilate

Anonymous
02/26/26(Thu)18:15:55 No.108247867

Anonymous 02/26/26(Thu)18:15:55 No.108247867

>*quants model*
>hurrdurr why it's stupid

Anonymous
02/26/26(Thu)18:16:38 No.108247870

Anonymous 02/26/26(Thu)18:16:38 No.108247870

>>108247827
How long does what take? It does 15 tokens/sec.

Here you can simulate what 15 tokens/sec looks like:
https://shir-man.com/tokens-per-second/

Anonymous
02/26/26(Thu)18:17:34 No.108247875

Anonymous 02/26/26(Thu)18:17:34 No.108247875

>>108247870
Nice reddit spacing, now go back

Anonymous
02/26/26(Thu)18:19:07 No.108247885

Anonymous 02/26/26(Thu)18:19:07 No.108247885

since the thread is in a qwenly mood, a reminder:
>>105106082 (qwen guy)
>Quant is the Mind Killer ;)

Anonymous
02/26/26(Thu)18:23:07 No.108247908

Anonymous 02/26/26(Thu)18:23:07 No.108247908

File: 0u0z9evbawlg1.png (60 KB, 2979x1779)

60 KB PNG

>>108247875
Uninstall yourself, but speaking of reddit I shamelessly stole pic related. Seems like AesSedai wins. They've got the best KLD quant and another that's the lowest file size but about the same KLD as everyone else's quants.

Anonymous
02/26/26(Thu)18:24:08 No.108247914

Anonymous 02/26/26(Thu)18:24:08 No.108247914

File: tpfh92qcawlg1.png (65 KB, 2979x1779)

65 KB PNG

>>108247908
Here's the perplexity chart from the same thread

Anonymous
02/26/26(Thu)18:24:53 No.108247920

Anonymous 02/26/26(Thu)18:24:53 No.108247920

File: file.png (6 KB, 145x171)

6 KB PNG

>>108247908
how do they do it

Anonymous
02/26/26(Thu)18:24:57 No.108247921

Anonymous 02/26/26(Thu)18:24:57 No.108247921

am I correct to assume that token inference itself, meaning without prompt processing, is calculated by Bandwith/Model_Size = tk/s? Of course compute plays a role too, but it shouldn't be that large here right?

Anonymous
02/26/26(Thu)18:27:43 No.108247947

Anonymous 02/26/26(Thu)18:27:43 No.108247947

Is there anyone running actual benchmarks on quanted models? I sometimes see KL-div/PPL numbers, but I have no idea how that maps to agentic coding performance. Like, if I have 40 GB, am I better off running Qwen3.5 122B at IQ2, or Qwen3.5 35B at Q8?

And if nobody's currently providing numbers like this, what would be a good benchmark to use if I want to run it myself? I know all benchmarks are shit, but which one is the least shit?

Anonymous
02/26/26(Thu)18:28:34 No.108247952

Anonymous 02/26/26(Thu)18:28:34 No.108247952

>>108247819
your right glm 4.7 is basically just the same as qwen 32B. stupid little bitch
>>108247867
i wonder what gives better token predictions. an iq2 quant of glm 4.7 or a fp16 of
qwen3.5 122B A10B.
of course anyone who has used both knows the answer is glm 4.7. qwen won't even remember your prompt if it's more than a paragraph in length. it's a little mong and i spit on it and call it gay before sending it straight to oblivion ie. the recycle bin, which i also then remove it from

Anonymous
02/26/26(Thu)18:32:35 No.108247971

Anonymous 02/26/26(Thu)18:32:35 No.108247971

>>108247920
>how do they do it
by leaving attn at Q8
that's all, it's not complicated
i have no idea why bart, sloth, etc would quantize a 2D tensor
[2048, 512] is 0.01% of the model

Anonymous
02/26/26(Thu)18:33:15 No.108247973

Anonymous 02/26/26(Thu)18:33:15 No.108247973

>>108247947
Rebench is probably the least shit of current agentic benchmarks.
https://huggingface.co/datasets/nebius/SWE-rebench-leaderboard

Anonymous
02/26/26(Thu)19:01:14 No.108248137

Anonymous 02/26/26(Thu)19:01:14 No.108248137

>>108247952
Bro your shift+delete?

Anonymous
02/26/26(Thu)19:01:49 No.108248140

Anonymous 02/26/26(Thu)19:01:49 No.108248140

File: 1755440701388520.png (383 KB, 735x535)

383 KB PNG

>>108247596
almost as if the chinese models are good and are rightfully praised for it or something, without the chinese we would still be stuck with llama lmao

Anonymous
02/26/26(Thu)19:11:31 No.108248187

Anonymous 02/26/26(Thu)19:11:31 No.108248187

File: 1743174416760262.png (378 KB, 746x684)

378 KB PNG

>the dense 27b model is better than the moe 35b model even on knowledge
MoE sissies, how do we cope?

Anonymous
02/26/26(Thu)19:14:36 No.108248207

Anonymous 02/26/26(Thu)19:14:36 No.108248207

>>108248187
I get faster tg than you on much cheaper hardware while the quality difference between the two models is not reliably measurable :^)

Anonymous
02/26/26(Thu)19:15:51 No.108248214

Anonymous 02/26/26(Thu)19:15:51 No.108248214

>>108247908
what about 122b?

Anonymous
02/26/26(Thu)19:17:10 No.108248225

Anonymous 02/26/26(Thu)19:17:10 No.108248225

>>108247908
>AesSedai
they seem to do good MoE quants, I found their IQ3_S for minimax m2.5 better than unsloth's (perplexingly several GB larger) IQ3_XXS

Anonymous
02/26/26(Thu)19:18:14 No.108248233

Anonymous 02/26/26(Thu)19:18:14 No.108248233

>>108248225
(actually better is a bit much, but more or less the same quality and with a lot more room for context)

Anonymous
02/26/26(Thu)19:19:38 No.108248237

Anonymous 02/26/26(Thu)19:19:38 No.108248237

>>108247921
yes compute is irreleveant its just model size X bandwidth = t/s also factor in the inefficency which depending on your setup can be large and 2x dual socket cpu is pretty fucked if other anons are to be believed and 3x if you split between vram and ram ur gonna get bottlenecked

Anonymous
02/26/26(Thu)19:21:37 No.108248249

Anonymous 02/26/26(Thu)19:21:37 No.108248249

>>108248187
It literally says 19% for both models. And even the knowledge-only benchmark likely does not do a perfect job of separating reasoning from pure factual knowledge, so the superior reasoning capability of dense 27B gives it enough of a boost to match 35B on the benchmark. Anyway, this shit doesn't even give you confidence interval, it's garbage.

Anonymous
02/26/26(Thu)19:21:59 No.108248250

Anonymous 02/26/26(Thu)19:21:59 No.108248250

>>108248237
>model size X bandwidth
model size / bandwidth*
fuck me

Anonymous
02/26/26(Thu)19:31:15 No.108248298

Anonymous 02/26/26(Thu)19:31:15 No.108248298

>>108247819
moes are knowledge kings
clearly you haven’t run them that much, if at all

Anonymous
02/26/26(Thu)19:37:53 No.108248345

Anonymous 02/26/26(Thu)19:37:53 No.108248345

I'm going to test out the 35b. If I am going to allow the model time to think for 1000 tokens, I'm going to need it remain within my VRAM constraints. So, what's better? Q5_K_M with thinking, or Q8 on a CPU/GPU split, without thinking?

Anonymous
02/26/26(Thu)19:40:16 No.108248366

Anonymous 02/26/26(Thu)19:40:16 No.108248366

One thing I noticed right away is that Q4 quants of the new 35b are absolutely retarded. The Q4 made numerous logic and grammar errors, while the Q5 (mostly) did not. The difference between Q4 and Q5 was like the difference between night and day. I've never seen this before in a model.

Anonymous
02/26/26(Thu)19:40:21 No.108248368

Anonymous 02/26/26(Thu)19:40:21 No.108248368

File: bruh.png (2.58 MB, 4888x2118)

2.58 MB PNG

>the dense 27b is better than a fucking 122b MoE model
wtf are MoE memes or something?

Anonymous
02/26/26(Thu)19:41:29 No.108248374

Anonymous 02/26/26(Thu)19:41:29 No.108248374

>>108248366
Apparently the unsloth q4 quant is broken.

Anonymous
02/26/26(Thu)19:42:17 No.108248377

Anonymous 02/26/26(Thu)19:42:17 No.108248377

>>108248366
I feel like it was always the case, when I tested out Mixtral in 2023, it reached its diminishing returns at Q5_K_M

Anonymous
02/26/26(Thu)19:46:40 No.108248401

Anonymous 02/26/26(Thu)19:46:40 No.108248401

>>108248368
it's kinda impressive that a 27b model is on par with the top API models at the moment, I don't know what secret sauce Alibaba found to make them so good, but they fucking cooked

Anonymous
02/26/26(Thu)19:46:58 No.108248403

Anonymous 02/26/26(Thu)19:46:58 No.108248403

>>108248374
I downloaded both the Q4 and Q5 from
mradermacher.
>>108248377
I never noticed this in the past. I knew that Q4 was obviously worse than Q5, but not to this degree. It's the difference between coherence and retardation. It feels more like the difference between Q2 and Q5 in past models.

Anonymous
02/26/26(Thu)19:47:32 No.108248411

Anonymous 02/26/26(Thu)19:47:32 No.108248411

>>108248401
bot.
>they fucking cooked
it's called benchmaxxing

Anonymous
02/26/26(Thu)19:49:04 No.108248420

Anonymous 02/26/26(Thu)19:49:04 No.108248420

>>108248411
retard, sometimes the model is just good and it translated to benchmarks, it's obvious you didn't test the model by yourself, this shit is good

Anonymous
02/26/26(Thu)19:51:40 No.108248438

Anonymous 02/26/26(Thu)19:51:40 No.108248438

>>108248420
I second that. The 27b beats Gemma-3 27b even without thinking. With thinking, it's significantly ahead. Qwen delivered, and the heretic version cuts right through the safety crap.

Anonymous
02/26/26(Thu)19:52:28 No.108248442

Anonymous 02/26/26(Thu)19:52:28 No.108248442

I see people talking about MTP and it's been 1 year this method has been invented, what's taking so long for llama cpp to implement it?
https://arxiv.org/html/2502.09419v1

Anonymous
02/26/26(Thu)19:52:32 No.108248443

Anonymous 02/26/26(Thu)19:52:32 No.108248443

>>108248420
>>108248438
post logs.

Anonymous
02/26/26(Thu)19:54:10 No.108248454

Anonymous 02/26/26(Thu)19:54:10 No.108248454

>>108248443
>I admit I didn't test the model by myself before comming to a conclusion, I decided that they are "bad" according to absolutely nothing
the jokes speak for themselves

Anonymous
02/26/26(Thu)19:55:43 No.108248470

Anonymous 02/26/26(Thu)19:55:43 No.108248470

>>108248454
in the world, everything is shit until proven otherwise so the burden of proof is on you

Anonymous
02/26/26(Thu)19:56:53 No.108248475

Anonymous 02/26/26(Thu)19:56:53 No.108248475

>>108248470
that's not how it works, saying "this model is bad" is a claim and you have the burden of proof, nice try though

Anonymous
02/26/26(Thu)19:59:11 No.108248487

Anonymous 02/26/26(Thu)19:59:11 No.108248487

>>108248475
it is how it works, shit is the default state of things, you have to work to make something not shit so you can't expected people to take your word for it.

Anonymous
02/26/26(Thu)20:02:09 No.108248506

Anonymous 02/26/26(Thu)20:02:09 No.108248506

>went from 60+ t/s to 15t/s by switching from 35b A3b to 27b
oof... it feels smarter but I hate waiting that long, this shit loves to yap during the thinking process so it has to be fast...

Anonymous
02/26/26(Thu)20:04:08 No.108248519

Anonymous 02/26/26(Thu)20:04:08 No.108248519

>>108248443
You can lead a horse to water, but you can't make it drink. I'm not going to spoon feed you, because it's not worth my time. If you pass on the model, I really don't care.

Anonymous
02/26/26(Thu)20:07:54 No.108248540

Anonymous 02/26/26(Thu)20:07:54 No.108248540

>>108248442
I think every implementation has failed because the MTP speculative decoding ended up having a shit rate of correctly predicted tokens that nobody was able to fix

Anonymous
02/26/26(Thu)20:08:31 No.108248545

Anonymous 02/26/26(Thu)20:08:31 No.108248545

Soooo who's gonna fork Ooba and pick up the torch now that the dev has clearly abandoned it or been hit by a bus

Anonymous
02/26/26(Thu)20:09:06 No.108248550

Anonymous 02/26/26(Thu)20:09:06 No.108248550

>>108248519
yeah i will then. you're talking shit and are a fucking dumbass.

Anonymous
02/26/26(Thu)20:10:20 No.108248556

Anonymous 02/26/26(Thu)20:10:20 No.108248556

>>108248401
>>108248438
>>108248420
>>108248519
Proof that these are bots >>108246291

Anonymous
02/26/26(Thu)20:10:23 No.108248557

Anonymous 02/26/26(Thu)20:10:23 No.108248557

>>108248545
dunno why ooba and kobold are still relevant, I'm just using llama.cpp server as a backend and sillytavern as a frontend and it just works

Anonymous
02/26/26(Thu)20:11:06 No.108248563

Anonymous 02/26/26(Thu)20:11:06 No.108248563

>>108248545
nobody cares about ooba

Anonymous
02/26/26(Thu)20:11:44 No.108248570

Anonymous 02/26/26(Thu)20:11:44 No.108248570

>>108248556
>Proof that these are bots
multiple people praise the model? that's the proof they are bots? what?

Anonymous
02/26/26(Thu)20:11:53 No.108248571

Anonymous 02/26/26(Thu)20:11:53 No.108248571

>>108248556
>Post logs for my contrarian arse, or you're a bot!
Piss off.

Anonymous
02/26/26(Thu)20:11:55 No.108248572

Anonymous 02/26/26(Thu)20:11:55 No.108248572

>>108248545
it has only been like a month and a half since the last update

Anonymous
02/26/26(Thu)20:13:11 No.108248579

Anonymous 02/26/26(Thu)20:13:11 No.108248579

>>108248557
I'm doing stories rather than chat, so I need a completions mode, and I slightly prefer the ooba UI to Mikupad
I'll use Mikupad if I have to but the Ooba notepad just feels slightly better to use

Anonymous
02/26/26(Thu)20:14:32 No.108248588

Anonymous 02/26/26(Thu)20:14:32 No.108248588

>>108248572
a month and a half without any updates to the llamacpp backend makes it a million versions behind and unable to run several recent model releases

Anonymous
02/26/26(Thu)20:15:05 No.108248595

Anonymous 02/26/26(Thu)20:15:05 No.108248595

File: 1746239185513357.png (1.31 MB, 1524x2744)

1.31 MB PNG

Anonymous
02/26/26(Thu)20:15:22 No.108248598

Anonymous 02/26/26(Thu)20:15:22 No.108248598

>>108248588
you can just manually update your llama.cpp version if you want

Anonymous
02/26/26(Thu)20:16:36 No.108248608

Anonymous 02/26/26(Thu)20:16:36 No.108248608

>>108248579
fair enough, I'd like Sillytavern to have a completion mode desu, it's the only thing missing to be complete

Anonymous
02/26/26(Thu)20:17:28 No.108248617

Anonymous 02/26/26(Thu)20:17:28 No.108248617

>>108248598
no you can't, ooba uses its own customized wheels

Anonymous
02/26/26(Thu)20:17:54 No.108248622

Anonymous 02/26/26(Thu)20:17:54 No.108248622

>>108248595
>they can't rotate an apple in their mind

Anonymous
02/26/26(Thu)20:18:16 No.108248624

Anonymous 02/26/26(Thu)20:18:16 No.108248624

>>108248595
you can tell they're close to bankrupt, they're using their last card on their sleeve to stay relevant, COOMERS users

Anonymous
02/26/26(Thu)20:18:53 No.108248630

Anonymous 02/26/26(Thu)20:18:53 No.108248630

>>108248624
so AI main target audience? wew.

Anonymous
02/26/26(Thu)20:19:13 No.108248632

Anonymous 02/26/26(Thu)20:19:13 No.108248632

File: 1766034223269376.png (993 KB, 2927x1746)

993 KB PNG

>>108248624
TRVKE

Anonymous
02/26/26(Thu)20:20:07 No.108248642

Anonymous 02/26/26(Thu)20:20:07 No.108248642

>>108248624
They should have actually tried to be OPEN ai

Anonymous
02/26/26(Thu)20:27:26 No.108248689

Anonymous 02/26/26(Thu)20:27:26 No.108248689

File: 1762242012351294.png (238 KB, 1000x1000)

238 KB PNG

>>108248595
>Some settings and features may be turned off or hidden to help reduce exposure to sensitive content
it's just text omfucking god, the vast majority of people grew with gta and they think they can't deal with text? wtf is wrong with them?

Anonymous
02/26/26(Thu)20:28:18 No.108248694

Anonymous 02/26/26(Thu)20:28:18 No.108248694

>>108248689
The concern is cunny

Anonymous
02/26/26(Thu)20:30:22 No.108248708

Anonymous 02/26/26(Thu)20:30:22 No.108248708

>>108248689
left wing niggas want all chats to be scanned and not be private, they've been bitching about that canadian trans shooter and how he was banned months before the shooting but OpenAI should have done more

Anonymous
02/26/26(Thu)20:30:23 No.108248709

Anonymous 02/26/26(Thu)20:30:23 No.108248709

>>108248689
it's not actually about prudery in Altman's case, it's pants-pissing fear of journalists and how they might write articles for normies going "Look at this sick shit GPT generated"
Without that concern he would be happy to making money selling AI porn to everyone

Anonymous
02/26/26(Thu)20:35:25 No.108248732

Anonymous 02/26/26(Thu)20:35:25 No.108248732

>>108248708
Microsoft should have done more too, they saw how agreesive Epstein was on Xbox live and they didn't call fbi on that?? THE WRITING WAS ON THE WALL kek

Anonymous
02/26/26(Thu)20:37:34 No.108248747

Anonymous 02/26/26(Thu)20:37:34 No.108248747

File: ugh...png (35 KB, 643x178)

35 KB PNG

I'm going back to the MoE model, dense models are just too slow for nowday's meta

Anonymous
02/26/26(Thu)20:39:24 No.108248768

Anonymous 02/26/26(Thu)20:39:24 No.108248768

>>108248617
the wheels just got updated a couple days ago. just change the version number in your requirements file and run the update script
https://github.com/oobabooga/llama-cpp-binaries/releases/tag/v0.80.0

Name
Options
Comment
Verification	4chan Pass users can bypass this verification. [Learn More] [Login]
File
Please read the Rules and FAQ before posting. You may highlight syntax and preserve whitespace by using [code] tags.