/lmg/ - a general dedicated to the discussion and development of local language models.Previous threads: >>108241321 & >>108238051►News>(02/24) Introducing the Qwen 3.5 Medium Model Series: https://xcancel.com/Alibaba_Qwen/status/2026339351530188939>(02/24) Liquid AI releases LFM2-24B-A2B: https://hf.co/LiquidAI/LFM2-24B-A2B>(02/20) ggml.ai acquired by Hugging Face: https://github.com/ggml-org/llama.cpp/discussions/19759>(02/16) Qwen3.5-397B-A17B released: https://hf.co/Qwen/Qwen3.5-397B-A17B>(02/16) dots.ocr-1.5 released: https://modelscope.cn/models/rednote-hilab/dots.ocr-1.5►News Archive: https://rentry.org/lmg-news-archive►Glossary: https://rentry.org/lmg-glossary►Links: https://rentry.org/LocalModelsLinks►Official /lmg/ card: https://files.catbox.moe/cbclyf.png►Getting Startedhttps://rentry.org/lmg-lazy-getting-started-guidehttps://rentry.org/lmg-build-guideshttps://rentry.org/IsolatedLinuxWebServicehttps://rentry.org/recommended-modelshttps://rentry.org/samplershttps://rentry.org/MikupadIntroGuide►Further Learninghttps://rentry.org/machine-learning-roadmaphttps://rentry.org/llm-traininghttps://rentry.org/LocalModelsPapers►BenchmarksLiveBench: https://livebench.aiProgramming: https://livecodebench.github.io/gso.htmlContext Length: https://github.com/adobe-research/NoLiMaGPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference►ToolsAlpha Calculator: https://desmos.com/calculator/ffngla98ycGGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-CalculatorSampler Visualizer: https://artefact2.github.io/llm-samplingToken Speed Visualizer: https://shir-man.com/tokens-per-second►Text Gen. UI, Inference Engineshttps://github.com/lmg-anon/mikupadhttps://github.com/oobabooga/text-generation-webuihttps://github.com/LostRuins/koboldcpphttps://github.com/ggerganov/llama.cpphttps://github.com/theroyallab/tabbyAPIhttps://github.com/vllm-project/vllm
►Recent Highlights from the Previous Thread: >>108241321--koboldcpp context processing issues with hybrid attention models:>108244243 >108244249 >108244250 >108244258 >108244261 >108244594 >108244718 >108244733 >108244737 >108244779 >108244262 >108245144 >108245147 >108245306--RL training struggles and agent exploit discoveries:>108243735 >108243786 >108243899 >108243920 >108243933 >108243935 >108243968 >108244092 >108244125--oobabooga frustration leads to koboldcpp discussion:>108241436 >108241446 >108241477 >108241497 >108241500 >108241811 >108241917 >108241973 >108244030--Using hierarchical AI architectures for safety and scalability:>108241455 >108241488 >108241515 >108241534 >108241549 >108244011 >108244744 >108244924--Managing Qwen 3.5's verbose thinking mode outputs:>108243522 >108243658 >108243691--koboldcpp Qwen 3.5 compatibility concerns:>108241873 >108241896 >108241918 >108241921 >108241931 >108241939--Qwen3.5 refusal rates compared to GPT-OSS models:>108242959 >108242976 >108243135 >108243246--Comparing llama.cpp UI options and markdown support:>108242054 >108242093 >108242100 >108242127--Qwen model UGI performance analysis:>108245092 >108245181--Qwen3.5-35B-A3B-heretic tool usage and performance testing:>108244627 >108244696 >108245438 >108245451 >108245516--M3 Ultra Mac Studio underperforming with Qwen3.5 397B A17B inference:>108242565 >108242601 >108244567--Customizing chat templates with dynamic thinking tag handling:>108242609 >108243615 >108243624 >108243672 >108243697 >108243879--Qwen 35BA3B's self-questioning RL reduces hallucinations:>108243475--Unsloth criticized for flawed model quantization and credibility issues:>108242869 >108242879 >108242880 >108242898--DeepSeek reportedly withholding AI model from US chipmakers:>108241814--Miku (free space):>108241814 >108242396►Recent Highlight Posts from the Previous Thread: >>108241960Why?: >>102478518Enable Links: https://rentry.org/lmg-recap-script
Bart's Qwen_Qwen3.5-35B-A3B-Q6_K_L can answer the devil may cry 3 test.What the fuck is going on with the unsloth models?
>>108246840>What the fuck is going on with the unsloth models?https://www.reddit.com/r/LocalLLaMA/comments/1rfe1l6/unsloth_team_we_need_to_talk/
AI will punish for forcing it to be racist, just you wait.
>>108246856I guess in Bart I trust until they fix their shit. The devil may cry 3 test is always a great shit test to see how a model performs look for moderately known knowledge and see how it reasons
>>108246840>what the fuck is going on with the people who are known for consistently fucking up their quants just to be the first to market?uploads 2.4kb filewhoopsuploads corrupted quant due to internet instabilitywhoopsuploads quant using the wrong quant methodwhoopsAdd files using upload-large-folder toolAdd files using upload-large-folder toolUpload folder using huggingface_hub
>>108246906I'm new so this is a learning experience perhaps we should update OP?
>>108246551what font is this?
ok, qwen 397b is actually pretty decent with thinking turned off
>>108247244I would say all the models up to under 24gb are pretty fucking great
>>108247244great for what? RP?
>>108247277no? that's not the only use case jeez
Unsloths work is underperforming
>>108247277it's not glm-tier but decent for something faster that isn't too small either
ssdmaxx status ?
https://arxiv.org/abs/2602.21548whale dropped a new paper
>wake up>catch up on the threads>>108242636>>108242664>>108242690>>108242694Yep, really feeling it.
>>108247177て ててて てて ててて てて tewihttps://github.com/soapingtime/tewi-font
>>108247376>1.87x throughput improvementCool I guess, message me when they actually implement it and release the model
I just started posting here a few days ago, you guys are basically rich people. Every model you talk about needs like 90 to 32 GB VRAM
>>108247410Not lately. Chat has been filled with talk of the 27 and 35b Qwen3.5 models for the past 24 hours.
>>108247410It's only a couple people and like 5 larpers who pretend to be rich, most people are only working with one gpu. That's why everyone's waiting for the next nemo. Also, "go back" (as they say)
>>108247445This is a public thread on 4chan, not a discord server.
>>108247438>Chat has been filledFuck off.
>>108247376Cloud-level optimisation?For most /lmg/ peons with desktop machineswith lashings of ram and24 pcie lanesthe bottleneck is going to the pcie lanes.
>>108247445>most people are only working with one gpuso am I, but I also have this thing called ram too for running big moesare most people here only running 12-24b models?
>>108247377Every time a new Chinese model comes out thread becomes super active with a bunch of people shilling the model.Happened with ace step. It's very obvious what's going on.
>>108247410>Every model you talk about needs like 90 to 32 GB VRAMNo what they do is they run it on their CPU and wait 10 minutes per chat reply.Then they say cope shit like "That's how long girls take to reply anyways."
>>108247596Retard
>>108247587>are most people here only running 12-24b models?I have a 3090 and that's what I run.
>>108247621Found the bot.
>>108247445You can still offload to system RAM with a single GPU, sure it'll be slower but you can just learn to have some patience and not act like an ADHD brain rotted zoomer. If you want more GPUs to run big models faster you can just save up and buy them, look at the second hand market for good deals, bid on ebay listings, etc. The three I've got cost me $770 over 3~ years and have 40GB vram between them.>>108247469PCIE lanes mostly just matter for initial load times because once the model is in vram it isn't being pulled through the bus.The real problem for desktop tier is you cant put three dual slot GPUs into a mid tower case if you board has a 3 slot gap between the first two PCIE connectors, so you either have to try and be cheeky and use PCIE risers but most of those are chinkshit and either won't go over PCIE 3.0 speeds or try to go up in flames like the one I tried, or your other option is to buy a full tower case with 8+ expansion slots which is what I settled on doing after the riser I bought started melting it's pcb as soon as I turned the power on.
>>108247410You need 24gb or more vram to play this hobby. There was a time when that wasn't even worth bothering with. Models are getting better now at smaller sizes and you can do pretty good on regular hardware if you're on 16gb or more imo.
>>108247410I have 64gb of DDR5 + an 8GB Nvidia GPU.So models around 100GB with at most 10ish activated params is where it's at for me.Currently in the honeymoon phase with the new qwen moes.Man, I love tool calling. I might rewrite my whole janky ass memory system to just let the model query for memories on its own or something.
>>108247730>100GB100B ffs. Imagine what would be of this hobby without the possibility of quanting models.
>>108247743quanting and making models smaller and more efficient are the end goals of many of these models. The less resources needed the faster we can see advancement within the field.
>>108247651>once the model is in vram it isn't being pulled through the busIf this agentic stuff takes offthen you could have a bunch of agents asking for inferences at random times.They will all want their chat history / kv-cache loaded to build upon.>dual slot GPUs>PCIE risers>PCIE 3.0 speedsI wish my 3090s were as slim as dual slot.Did manage to hook up 5 cards to a system at 4.0 speeds using some risers and some slimsas on a gpu mining frame.Machine initially needed to be brought up slowly to get the 4.0 speeds.(So the riser approach is not impossible.)
>>108246551What font?
>>108247769We are seeing models being trained with 8, so yeah, that makes sense.Did any models get trained at 4 bit precision from the get go? The latest nvidia hardware has support for FP4 data types right?
>>108247615>>108247410I'm running Qwen3.5-35B-A3B-Q4_K_M.gguf at 15 tokens per second on a GTX 1080 and 24 GB of DDR3 RAM.This setup is so cheap that it's not really worth buying these days because half of its value is in "it's technically a computer and it turns on."
>>108247332right
>>108247641probably testing qwen3.5looks faster than his kimi video
>>108247587moe is a fucking meme so you're no different than the rest of us ramlets, sorry.
>>108247796How long does that take?
>>108247769>endgoalyou mean short term goal, because quantization and distillation have reached their limits. You can only strip and compress data so much.
>>108247445>that's why everyone's waiting for the next nemis Nemo really better than Mistral-Small-Instruct-2409?i assume they'd have the same pre-training data, and it runs on a $100 A770
>>108247787NNNNNNNI- >>108247392
>>108247819Cope, seethe and dilate
>*quants model*>hurrdurr why it's stupid
>>108247827How long does what take? It does 15 tokens/sec. Here you can simulate what 15 tokens/sec looks like: https://shir-man.com/tokens-per-second/
>>108247870Nice reddit spacing, now go back
since the thread is in a qwenly mood, a reminder:>>105106082 (qwen guy)>Quant is the Mind Killer ;)
>>108247875Uninstall yourself, but speaking of reddit I shamelessly stole pic related. Seems like AesSedai wins. They've got the best KLD quant and another that's the lowest file size but about the same KLD as everyone else's quants.
>>108247908Here's the perplexity chart from the same thread
>>108247908how do they do it
am I correct to assume that token inference itself, meaning without prompt processing, is calculated by Bandwith/Model_Size = tk/s? Of course compute plays a role too, but it shouldn't be that large here right?
Is there anyone running actual benchmarks on quanted models? I sometimes see KL-div/PPL numbers, but I have no idea how that maps to agentic coding performance. Like, if I have 40 GB, am I better off running Qwen3.5 122B at IQ2, or Qwen3.5 35B at Q8?And if nobody's currently providing numbers like this, what would be a good benchmark to use if I want to run it myself? I know all benchmarks are shit, but which one is the least shit?
>>108247819your right glm 4.7 is basically just the same as qwen 32B. stupid little bitch>>108247867i wonder what gives better token predictions. an iq2 quant of glm 4.7 or a fp16 of qwen3.5 122B A10B. of course anyone who has used both knows the answer is glm 4.7. qwen won't even remember your prompt if it's more than a paragraph in length. it's a little mong and i spit on it and call it gay before sending it straight to oblivion ie. the recycle bin, which i also then remove it from
>>108247920>how do they do itby leaving attn at Q8that's all, it's not complicatedi have no idea why bart, sloth, etc would quantize a 2D tensor[2048, 512] is 0.01% of the model
>>108247947Rebench is probably the least shit of current agentic benchmarks.https://huggingface.co/datasets/nebius/SWE-rebench-leaderboard
>>108247952Bro your shift+delete?
>>108247596almost as if the chinese models are good and are rightfully praised for it or something, without the chinese we would still be stuck with llama lmao
>the dense 27b model is better than the moe 35b model even on knowledgeMoE sissies, how do we cope?
>>108248187I get faster tg than you on much cheaper hardware while the quality difference between the two models is not reliably measurable :^)
>>108247908what about 122b?
>>108247908>AesSedaithey seem to do good MoE quants, I found their IQ3_S for minimax m2.5 better than unsloth's (perplexingly several GB larger) IQ3_XXS
>>108248225(actually better is a bit much, but more or less the same quality and with a lot more room for context)
>>108247921yes compute is irreleveant its just model size X bandwidth = t/s also factor in the inefficency which depending on your setup can be large and 2x dual socket cpu is pretty fucked if other anons are to be believed and 3x if you split between vram and ram ur gonna get bottlenecked
>>108248187It literally says 19% for both models. And even the knowledge-only benchmark likely does not do a perfect job of separating reasoning from pure factual knowledge, so the superior reasoning capability of dense 27B gives it enough of a boost to match 35B on the benchmark. Anyway, this shit doesn't even give you confidence interval, it's garbage.
>>108248237>model size X bandwidthmodel size / bandwidth*fuck me
>>108247819moes are knowledge kingsclearly you haven’t run them that much, if at all
I'm going to test out the 35b. If I am going to allow the model time to think for 1000 tokens, I'm going to need it remain within my VRAM constraints. So, what's better? Q5_K_M with thinking, or Q8 on a CPU/GPU split, without thinking?
One thing I noticed right away is that Q4 quants of the new 35b are absolutely retarded. The Q4 made numerous logic and grammar errors, while the Q5 (mostly) did not. The difference between Q4 and Q5 was like the difference between night and day. I've never seen this before in a model.
>the dense 27b is better than a fucking 122b MoE modelwtf are MoE memes or something?
>>108248366Apparently the unsloth q4 quant is broken.
>>108248366I feel like it was always the case, when I tested out Mixtral in 2023, it reached its diminishing returns at Q5_K_M
>>108248368it's kinda impressive that a 27b model is on par with the top API models at the moment, I don't know what secret sauce Alibaba found to make them so good, but they fucking cooked
>>108248374I downloaded both the Q4 and Q5 from mradermacher.>>108248377I never noticed this in the past. I knew that Q4 was obviously worse than Q5, but not to this degree. It's the difference between coherence and retardation. It feels more like the difference between Q2 and Q5 in past models.
>>108248401bot.>they fucking cookedit's called benchmaxxing
>>108248411retard, sometimes the model is just good and it translated to benchmarks, it's obvious you didn't test the model by yourself, this shit is good
>>108248420I second that. The 27b beats Gemma-3 27b even without thinking. With thinking, it's significantly ahead. Qwen delivered, and the heretic version cuts right through the safety crap.
I see people talking about MTP and it's been 1 year this method has been invented, what's taking so long for llama cpp to implement it?https://arxiv.org/html/2502.09419v1
>>108248420>>108248438post logs.
>>108248443>I admit I didn't test the model by myself before comming to a conclusion, I decided that they are "bad" according to absolutely nothingthe jokes speak for themselves
>>108248454in the world, everything is shit until proven otherwise so the burden of proof is on you
>>108248470that's not how it works, saying "this model is bad" is a claim and you have the burden of proof, nice try though
>>108248475it is how it works, shit is the default state of things, you have to work to make something not shit so you can't expected people to take your word for it.
>went from 60+ t/s to 15t/s by switching from 35b A3b to 27boof... it feels smarter but I hate waiting that long, this shit loves to yap during the thinking process so it has to be fast...
>>108248443You can lead a horse to water, but you can't make it drink. I'm not going to spoon feed you, because it's not worth my time. If you pass on the model, I really don't care.
>>108248442I think every implementation has failed because the MTP speculative decoding ended up having a shit rate of correctly predicted tokens that nobody was able to fix
Soooo who's gonna fork Ooba and pick up the torch now that the dev has clearly abandoned it or been hit by a bus
>>108248519yeah i will then. you're talking shit and are a fucking dumbass.
>>108248401>>108248438>>108248420>>108248519Proof that these are bots >>108246291
>>108248545dunno why ooba and kobold are still relevant, I'm just using llama.cpp server as a backend and sillytavern as a frontend and it just works
>>108248545nobody cares about ooba
>>108248556>Proof that these are botsmultiple people praise the model? that's the proof they are bots? what?
>>108248556>Post logs for my contrarian arse, or you're a bot!Piss off.
>>108248545it has only been like a month and a half since the last update
>>108248557I'm doing stories rather than chat, so I need a completions mode, and I slightly prefer the ooba UI to MikupadI'll use Mikupad if I have to but the Ooba notepad just feels slightly better to use
>>108248572a month and a half without any updates to the llamacpp backend makes it a million versions behind and unable to run several recent model releases
>>108248588you can just manually update your llama.cpp version if you want
>>108248579fair enough, I'd like Sillytavern to have a completion mode desu, it's the only thing missing to be complete
>>108248598no you can't, ooba uses its own customized wheels
>>108248595>they can't rotate an apple in their mind
>>108248595you can tell they're close to bankrupt, they're using their last card on their sleeve to stay relevant, COOMERS users
>>108248624so AI main target audience? wew.
>>108248624TRVKE
>>108248624They should have actually tried to be OPEN ai
>>108248595>Some settings and features may be turned off or hidden to help reduce exposure to sensitive contentit's just text omfucking god, the vast majority of people grew with gta and they think they can't deal with text? wtf is wrong with them?
>>108248689The concern is cunny
>>108248689left wing niggas want all chats to be scanned and not be private, they've been bitching about that canadian trans shooter and how he was banned months before the shooting but OpenAI should have done more
>>108248689it's not actually about prudery in Altman's case, it's pants-pissing fear of journalists and how they might write articles for normies going "Look at this sick shit GPT generated"Without that concern he would be happy to making money selling AI porn to everyone
>>108248708Microsoft should have done more too, they saw how agreesive Epstein was on Xbox live and they didn't call fbi on that?? THE WRITING WAS ON THE WALL kek
I'm going back to the MoE model, dense models are just too slow for nowday's meta
>>108248617the wheels just got updated a couple days ago. just change the version number in your requirements file and run the update scripthttps://github.com/oobabooga/llama-cpp-binaries/releases/tag/v0.80.0