/lmg/ - a general dedicated to the discussion and development of local language models.Previous threads: >>107383326►News>(11/28) Qwen3 Next support merged: https://github.com/ggml-org/llama.cpp/pull/16095>(11/27) DeepSeek-Math-V2 released: https://hf.co/deepseek-ai/DeepSeek-Math-V2>(11/26) INTELLECT-3: A 100B+ MoE trained with large-scale RL: https://primeintellect.ai/blog/intellect-3>(11/21) GigaChat3 10B-A1.8B and 702B-A36B released: https://hf.co/collections/ai-sage/gigachat3>(11/20) Olmo 3 7B, 32B released: https://allenai.org/blog/olmo3>(11/19) Meta releases Segment Anything Model 3: https://ai.meta.com/sam3►News Archive: https://rentry.org/lmg-news-archive►Glossary: https://rentry.org/lmg-glossary►Links: https://rentry.org/LocalModelsLinks►Official /lmg/ card: https://files.catbox.moe/cbclyf.png►Getting Startedhttps://rentry.org/lmg-lazy-getting-started-guidehttps://rentry.org/lmg-build-guideshttps://rentry.org/IsolatedLinuxWebServicehttps://rentry.org/recommended-modelshttps://rentry.org/samplershttps://rentry.org/MikupadIntroGuide►Further Learninghttps://rentry.org/machine-learning-roadmaphttps://rentry.org/llm-traininghttps://rentry.org/LocalModelsPapers►BenchmarksLiveBench: https://livebench.aiProgramming: https://livecodebench.github.io/gso.htmlContext Length: https://github.com/adobe-research/NoLiMaGPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference►ToolsAlpha Calculator: https://desmos.com/calculator/ffngla98ycGGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-CalculatorSampler Visualizer: https://artefact2.github.io/llm-sampling►Text Gen. UI, Inference Engineshttps://github.com/lmg-anon/mikupadhttps://github.com/oobabooga/text-generation-webuihttps://github.com/LostRuins/koboldcpphttps://github.com/ggerganov/llama.cpphttps://github.com/theroyallab/tabbyAPIhttps://github.com/vllm-project/vllm
>>107394940Kinda funny how it concludes>it's okay if I just pretend she's actually 18
>>107394971secks
►Recent Highlights from the Previous Thread: >>107373173 --DeepSeek V3.2 confirmed as garbage benchmaxxed, it's over, sama won.►Recent Highlight Posts from the Previous Thread: >>107373176 Why?: >>102478518 (DEAD)Enable Links: https://rentry.org/lmg-recap-script
>>107395003Nice recap
Is Deepseek really our only hope to ever match izzat with Gemini?
>>107395003we didn't even hit page 9 yet you labubu
►Recent Highlights from the Previous Thread: >>107383326--Mistral Large 3 integration and parameter size speculation:>107391247 >107391281 >107391983 >107392052 >107392625--Adding Ministral3 model support to llama.cpp and architectural distinctions:>107391911 >107392074 >107392079 >107392089 >107392095--Skepticism and analysis of DeepSeek-V3.2-Speciale's novel features and training methods:>107392436 >107392461 >107392484 >107392503 >107392537 >107392543--Deepseek API model features and pricing comparison:>107393178 >107393194 >107393643 >107393661 >107393768 >107393875 >107394146 >107394182 >107394347--Evaluating Qwen3 A3B models for text prompt enhancement on high-core server hardware:>107383592 >107385249 >107385515 >107385536 >107385576 >107385600 >107386603--RTX 3090's long-term viability in AI hardware landscape:>107384136 >107384156 >107384623 >107384655 >107384681 >107384708 >107384815 >107384177 >107384192 >107384190 >107384218 >107388561 >107388570 >107388627 >107384366 >107384466 >107384502--Bert-Nebulon Alpha speculated to be Ministral 3:>107387197 >107387250 >107387289 >107387378--Control vectors as solutions for model positivity bias:>107387223 >107387281 >107387311 >107387322 >107387338--Struggles with local model performance and quantization tradeoffs:>107383781 >107387410 >107387456 >107383886 >107383915 >107388528 >107388653 >107389405 >107389932--Hugging Face Transformers library update with Ministral 3 model:>107386861 >107387118 >107390941--AI-generated code policy changes in llama.cpp:>107386661 >107386681 >107386684 >107386816 >107386929--Exploring RL training models with reliable function calling (Qwen3 vs ToolACE-2-Llama):>107384624 >107386072--Ministral3 model support added to llama.cpp:>107393747--Miku (free space):>107387139 >107391271 >107391305 >107391346 >107392856 >107393073►Recent Highlight Posts from the Previous Thread: >>107383338Why?: >>102478518Enable Links: https://rentry.org/lmg-recap-script
>>107394940>>107394977>we mustThey did lmao. DS fell off. OpenAI poison pill worked. Why did retard chinks distill a 120b model?
https://lynchmark.com/https://lynchmark.com/https://lynchmark.com/It's up.
>>107395026Dipsy fell off >>107395041
>>107395041It's so over
>>107395036wait a second...
>>107395096stop with the whole mascot thing so cringe
>>107395026what is the source of this collage?
>>107395141japon
>>107395141Kansai Enkou
Local model recs for 24GB VRAM and 64GB RAM?
>>107395192Qwen Max Q8.
>>107395041Because GPT OSS is still the best local model
>>107395211I had forgot they dropped that is there gguf support?
>>107395238Yes.
I meant SFW (but subtle) photos like in collage>>107395174this finds just some JAV :/
>>107395192>samplebaka my head
>>107395306>IMG_
I'm definitely getting better outputs from K2 thinking than from coder-480, but it takes a lot more wrangling.Pretty crazy that we're basically SOTA if you have enough ram or patience.captcha: NMSAM
>>107395192What is xe saying?
wtf that's not what I typed. baka
>>107395327whale good
>>1073953064cuck doesn't like big files.
>>107395026JKs are the pinnacle of humankind.
>>107395192broken-tutu-24b, maybe q5 or q6.I can cram qwen3-30-instruct into a 2080ti 22GB at q4. It's smarter for function calling, but not good for roleplay.
>>107395479>I can cram qwen3-30-instruct into a 2080ti 22GB at q4You are running the thing fully in VRAM instead of putting at least some of the experts in RAM?How many t/s do you get? It should run fast as fuck like that.
How do you make your model have balls? I want the AI to RAPE me but it never makes the first move, fucking pussy
>>107395041once, frickin, again. chang forgot to clean up their dataset
is it possible to run a local model only to ask it mathematical questions and get detailed answers, on consumer hardware (16 vram) these days? lowering stuff like context to a minimum I assume lowers vram usage by a lot, since I just need 1 good answer and no history
>>107395643yeah it's pretty good. Just make sure to add a comma or space between every 3 characters in a long number
>>107395721which model would you suggest?
> | `MistralLarge3ForCausalLM` | Mistral-Large-3-675B-Base-2512, Mistral-Large-3-675B-Instruct-2512 | `mistralai/Mistral-Large-3-675B-Base-2512`, `mistralai/Mistral-Large-3-675B-Instruct-2512`, etc. | | |https://github.com/vllm-project/vllm/pull/29757/files> 675B
>>107395793So they just tuned DS3, kinda base
>>107395812I think it has image input too.
>>107395793can't wait to run this at q1 kek
>>107395793We. Are. So. Back!
>>107395771idk anything with reasoning like Qwen. Choose the biggest IQ4_XS you can fit in VRAM. If you don't mind the wait, you can offload some of the model to RAM. Most models should be able to do undergrad math anyways
/g/ros this is huge? Why is no one talking?
>>107395944What's the point? Never seen a model go past 30k without becoming schizo. Most models start cracking past 12k
>>107395579Just make it your usual werewolf vampire ceo sis
>>107395995well now that you can tune to huge contexts for cheap this will be a thing of the past!
>>107395944>Daniel>can't even quant with known algosDoubt.jpg
>>107396023After a certain threshold, training compute time increases with the square of context size.
>>107395793>moe>675bYeah, it's actually over, local is officially dead. I'm considering a gf now.
>>107396067you should become the gf
>>107396067Just use Nemo as your gf, she won't get surpassed anytime soon
>>107395833So DS 671B + a 4B image encoder. Is Mistral so incompetent now that they couldn't even manage to successfully run a distillation so finetuning was their only option? At least that means it can't be that bad. Miqu 2?
>>107395396true
>>107396100I'll take a MiDSqu
>>107396067>moeI have 8 256gb nvme ssds in a raid 0 array ready for this.
Did llama 4 use any shared experts?
hello guys im stupid and also im new to using ai for vibecoding.what is the claude code equivalent for kimi k2?
>>107396100Mistral is half filled with women
>Most enthusiasts won't be able to afford to run the largest or very large new open weight models at a reasonable speedWe have to be content running smaller 32b to 192b models..>192 gb of ram is 3k now and a rtx 6000pro costs 7500-8000usd and a mac studio with 512g of ram costs 9.5k... With RAM and GPU prices being this expensive and the SOTA models getting larger, by the end of 2026, you will have 1.5-2 trillion parameter open weight highly performant models. How will most enthusiasts be able to run a 2 trillion parameter model locally over 18 tokens/second in 2026?(THey have wait years for that.... I guess distilled models will get better). Even running q4-q8 500B to 1T models locally at 18Tokens/s will be out of reach for many...>I guess even those with deep pockets will be forking over 20k to run a q4 2T model with a large context window on two m5 ultras or over 40k on 1.1tb of ddr5/6 ram and 2 rtx 6000s in 2026.>How will an average enthusiast be able to even afford 128-192 gb of (>600GB/s )fast ram and a good <1.5 year old gpu with fast prefill speed for a 128-256b model? I guess they can use m2 ultras or m1 ultras, but the prefill is kind of slow and the gpu is a little dated..>How much money do most people even have to buy an LLm rig? $1k to 4k?>By 2028, you will have 8 trillion open weight models.. I guess most enthusiasts will be stuck running q4-q832b to 200b models locally with 10-80% capability or quality of multitrillion parameter models until 2027-2028 when ram production ramps up or they will be using the API or renting a gpu.>Even if ram production goes up, ram will still be more expensive in 2027 than in 2024....I hope apple doesnt raise their ram prices, they have fixed price ram contracts after all ... At this rate, we might as well have time share data center GPUS..
>>107396169good boy https://www.reddit.com/r/LocalLLaMA/comments/1pbabiv/most_enthusiasts_wont_be_able_to_afford_to_run/
>>107396123bitrots your model into gay sex. Nothing personnel kid
>>107396135It did>For example, the Maverick variant stores 400 B parameters but activates just 17 B at inference time. Dense and MoE layers are interleaved so that every token is processed by a shared expert plus one of 128 routed experts.
>>107396179>most_enthusiasts_wont_be_able_to_afford_to_runfat and obese
I should have cpumaxxed before RAM prices exploded.
>>107396200You should cpumaxx before RAM prices explode more.
>>107396187Sick.Thanks.
>>107396200We told you to do it...
>>107396162You can use Claude Code with kimi k2. Go to the webchat and ask it, it will tell you step by step. Its just pointing the env variables to the moonshot servers and kimi models.
>>107396169I'll wait for the < 70B line-up to improve once the big fags get tired of going bigger to impress the benchmaxx midwits and waste gorillions of dollars on hardware and electricity
What motherboards are you cpumaxxers running? H13SSL?
>>107396200? I just bought 8x 64gb 3200mhz ddr4 rdimms for $200 each a month ago (~$130 usd)
>>107396238Interesting. I had only been looking at DDR5.
>>107396238>ddr4Bro we're using ddr5 here
>>107396237MZ73-LM1 because I was planning to add a second processor and go 1.5TB RAM next year. If I had known that RAM prices were going to explode and that I'd end up stuck with 768GB I'd have gone for that the H13SSL.
>>107396238>$200>$130 USDwot. Speak in McDonalds please
As a cpumaxxer I only regret not buying 1.5 TB instead of 0.75 TB RAM. Having all big models at home has 100% been worth it.>>107396237MZ73-LM0
wish unsloth would have a guide for installing with lmstudio too, not just ollama and llama.cpp
>>107396295lol really bro?
>>107396255>>107396266If you just need something to run big models settle for ddr4. ~15-10 tokens/s with 8 channels running glm.If you want ddr5 for the speed... why not stack mi50s instead?Unless you're not poor I guess.
I want to thank the anon that mentioned the $3k Threadripper 7970X, TRX50 128GB, DDR5-5600 Kit bundle a few days back.Now just need to decide whether to upgrade my ancient X99 rig or sell the bundle for a $1.5 profit.
>>107396319>why not stack ewaste, at least it'll keep you hot this winter
>>107396341The heating is just a nice bonus.
best models for my specs?>air m4>24gb ram>10c/10c
>>107394971I'd like a workflow for Wan, 64gb of ram and 24 of Vram.As well as lora(s) training tips.
>>107396341I mean, we are talking about running big models for cheap, you're going to have to make some compromises.
>>107396363oui oui bageutte au fromage
you shouldn't spend lots of money now on hardware when we all know new hardware built with AI in mind (16 channel ddr5 intel/amd servers and budget gpus with lots of vram) will be made from now on, making everything running today obsolete soon, you'll be able to buy the obsolete stuff for cheap second hand or if you have money you buy the new stuff made for ai
>>107396363Hello sir,Please may I recommend you /ldg/,Thanks.
>>107396371they're literally stopping making new rams because of saltman dude
>>107396392meant for >>107396379
>>107396379>when we all know new hardware built with AI in mind>will be made from now onDo we?
>>107396385>/ldg/Oh right, oui oui baguettes.
>>107396379lol
3.2-Speciale feels like a better version of K2-Thinking. They both have the issue that they think for very long and tend to shit out extremely long replies but I like the way how Speciale writes.
>>107396392>>107396407>>107396423Refute me. You can't.
>>107396434Refute what? What's your reasoning?
>>107396434>TodayThe root cause of the shortage is a shift in demand, with much of the industry's capacity now focused on high-bandwidth memory used in AI accelerators. This shift leaves less wafer output available for commodity DRAM and 3D NAND. Building significant new capacity takes years, so substantial relief is unlikely before late 2027 or 2028. In response, Team Group plans to prioritize strategic AI ...https://www.techpowerup.com/343518/memory-shortage-just-started-major-price-hikes-ahead-warns-team-group
>>107396407>Do we?>16 channel ddr5 becoming the only model intel works onhttps://www.tomshardware.com/pc-components/cpus/intel-cancels-part-of-its-next-gen-diamond-rapids-xeon-lineup-report-claims-xeon-7-will-drop-models-with-8-memory-dimms-to-focus-only-on-16-channel-cpus-for-extra-memory-throughput>amd putting lots of vram on budget gpushttps://overclock3d.net/news/gpu-displays/leak-unveils-game-changing-specifications-for-amd-radeon-rdna-5-gpus/it'll still take 2 or 3 years for the change to happen but yeah, everything points to more (v)ram and faster (v)ram, and that's all because of AI
>>107396379The only thing people will be able to afford is Mac studio. DRAM prices are going to get even worse. Suppliers are canceling long term agreements. Apple is going to be the only manufacturer able to weather the storm.
>>107396475>Published: August 25, 2025
>>107396493nah bro u crazy >Memory Drought 2025: TEAMGROUP Halts RAM Quotes — Dramatic Price Hikes Ahead https://www.guru3d.com/story/memory-drought-2025-teamgroup-halts-ram-quotes-dramatic-price-hikes-ahead/
>>107396493dram prices will go up for 6 months and then dive down back to where they were, it's just a bottleneck
>>107396526>6 months>>107396455>substantial relief is unlikely before late 2027 or 2028. In response, Team Group plans to prioritize strategic AI
>>107396526actually, this will lead to overproduction and the prices will go below what they were earlier this yearit's just basic economics
>>107396539they are lying to sell stuff for higher prices, that's what companies do, hynix is massively investing in new dram fabs, the chinese are starting to make ddr5, samsung is flipping nand production to dram
>>107396552I see thanks for the info! Will just wait 6 months and a day to buy when prices crater thanks for the tip!
New Deepsuck has a tendency to reply in Chinese now might have diluted some Qwen in there
>>107396507>>107396526>"DRAM Supply Shortage Unlikely to Ease Until H1 2027">“We will minimize oversupply risks.” (Samsung Electronics)>“It is difficult to resolve the supply shortage until the first half of 2027.” (SK Hynix>It is reported that Samsung Electronics’ Memory Business Division is currently able to supply only about 70% of its DRAM orders. As the supply shortage intensifies, the division has reportedly refused requests for long-term mobile DRAM supply contracts from major clients. Samsung stated, “While clients want multi-year long-term contracts, Samsung does not want to tie up volumes with specific clients during a phase where prices are surging.”https://xcancel.com/jukan05/status/1995431391430098981#m
glm 4.6 air status?
Guys be honest, did you get fooled by gpt4chan?>>107396609>gpt4chan was gpt2 sized and nobody believed it was a bot. the base models are extremely powerful.
>>107396542unfortunately nothe price is spiking because they're shutting down all production to squeeze shekels out of the AI bubble before it popswe got a circlejerk passing around literal trillions of fake non-existent dollars, all probably engaging in enough fraud to make Enron look pedestrianthey're on a time limit before it detonates and gives us a second 2008
>>107396613
>>107396614Fucking let them cook you ungrateful piece of shit not worth the water you're stealing from AI.
>>107395055I got all it excited thinking this was another benchmark to test censorship
>>107396200I was sweating bullets ordering those parts back when no one knew if it would work out
>>107396636o-oh... ok... sorry....
>>107396646what is it? might as well share your disappoint with the class
>>107395055>>107396652>This benchmark tests the model's knowledge by tasking it to import the right library from the right CDN URL path and having the pre-existing library specific knowledge to correctly implement a solution for each challenging problem for/in the browser environment using JavaScript.
>>107396650Terribly sorry for the outburst but you have to understand it's hard to stay positive and motivated in these trying times.
Speaking of dram prices>paid 389euros (with vat) for a 96gb kit in 2023>the same kit is now 1029eurGrim, apple might really be an option
How to make Kimi think from perspective of {{char}}?
>>107396716Tried prefilling with something like<think>{{char}} thinks? See what happens and adjust.
>>107396237I didn't buy it exclusively for that purpose but if and when I get to optimizing NUMA performance I'll be doing it using an ASRock Rack TURIN2D24G-2L+/500W motherboard.For straight CPUmaxxing I would have gone with one of the Gigabyte boards mentioned by the other Anon's though (if you can live with having "only" 4 16x PCIe 5.0 slots).
transformers v5 is out!NewsHey folks, it's Merve from Hugging Face! I'm here with big news: today we release transformers v5! With this, we enable interoperability with our friends in ecosystem (llama.cpp, vLLM and others) from training to inference, simplify the addition of new models and significantly improve the library We have written a blog on the changes, would love to hear your feedback!
>>107397020>We’re fortunate to collaborate with many libraries and apps built on transformers, in no specific order: llama.cpp, MLX, onnxruntime, Jan, LMStudio, vLLM, SGLang, Unsloth, LlamaFactory , dLLM, MaxText, TensorRT, Argmax, among many other friends.ollamabros...
what makes it so speciale
>>107397283It's built to tackle big tasks and think as long as it takes so it can take up its entire context window in a single reply
>>107397283It's because of the metric system
>>107395192glm air, maybe:https://huggingface.co/Intel/Qwen3-235B-A22B-Instruct-2507-gguf-q2ks-mixed-AutoRound/tree/main
if deepseekv2 was the latest architecture before v3.2exp, how come deepseek v2 didnt have MLA?
>>107397307>https://huggingface.co/Intel/Qwen3-235B-A22B-Instruct-2507-gguf-q2ks-mixed-AutoRound/tree/mainThat's actually a pretty sick looking model for 96gb ram. I wish people making quants actually put some thought into "best 24gb model, best 48gb model, best 96gb model, best 24gb+dram model, etc. Making partially quantized stuff like this would almost certainly result in best-in-class models for the actual hardware people have.
>>107397298>1M output tokens>$0.42damn that's cheap as fuck
What's the best local model to run with a 5060-ti? gpt-oss:120b?
>>107397556Rocinante
>>107397556glm air if u have enough ram
FUCK
>>107397556earning more money
>>107397556nemo
>>107397622Kneel...
its actually really really good. Too bad about the shit context window with its gpt5 high level of thinking
>>107397606is 48gb ram enough?>>107397586>>107397714how much better are these? never even heard of them.>>107397628yeah will try getting the 5090 at some point
>>107397556>gpt-oss:120b?Yeah.
I want DSA support in llama.cpp for christmas.
>>107397556gpt-oss 20b and magistral small 1.2
>>107397865The guy who tried vibecoding V3.2-exp support for llama.cpp just bought two (2) books on CUDA hoping it'll help him implement these models.
>>107397824if you have a 5060ti 16gb ye, otherwise, tough cut but still yea
>>107397824If you don't want a cuck HR model use nemo or it's coom tune rocinante
>>107397964>this level of vramlet cope
I'm so close to buying a 3090 for 450 euros. What are some models I should avoid?Currently these are on sale:PNY XLR8 RTX 3090 EpicX Triple Fan 24GBPalit RTX 3090 Gaming Pro 24GBInno3D iChill RTX 3090 x4 24GBGigabyte RTX 3090 Gaming OC 24GB Gigabyte RTX 3090 Eagle 24GBMSI RTX 3090 Suprim X 24GB
>>107397987>What are some models I should avoid?Everything not regularly shilled in these threads. I'm not even kidding, the bulk of them are trash and nearly every finetroon makes them worse.
>>107397987save up and get a 5070 ti instead for more, then you can do video gen at decent speeds, its not like 24GB is enough to run anything good llm wise
>>107398002I'm talking about NViDIA partner gpu models/manufacturers, like PNY/MSI/Gigabyte
>>107397987In my opinion MSI is the only good choice there.
>>107398014The meme with 3090s is stacking a lot of them, so you need to buy them in bulk to get any real value out of them. You should also be looking at 5070ti because Blackwell architecture is significantly better. If you're not broke, splurge for either the 5090 or 6000 Pro.
>>107397983>can't read the discussion
>>107397987get a 5090
>>107397987>450 eurosIsn't it a bit too cheap even for second hand market?
>>107398166Eastern europe
Yeah, sparse attention is a disaster that creates ADHD models. What a mistake.
ohio gooning?
>>107395036Ah, my Koikatsu model of Dipsy.
>>107398035 <3>>107398009>>107398073>>107398142thank you anons
gemma sirs? 4 of when?
K2 Thinking doesn't have CSAM alignment in ChineseIt can gen CSAM in Chinese without jb
>>107396363>Bina>>>/g/ldg/
>>107398588>CSAMjust call it cunny like any other normal human being, retard
>>107396614air status?
sirs, what is theoretically the lowest spec'd PC you can run a LLM on?Bonus if it predates the 21st Century.
>>107397020I'm sure they broke bazillion of libraries in the process. No way I'm upgrading before the next year
>>107398612oxygen amount low
>>107398612not good
>>107398621You can run them on phones
>>107398665george droid doesn't have this problem
bros i didnt keep upmtp status?glm 4.5v status?was batch size for qwenext fixed?thanks
>>107398686ZIT killed /lmg/
https://huggingface.co/mradermacher/gpt-oss-120b-Derestricted-GGUFhttps://huggingface.co/ArliAI/GLM-4.5-Air-DerestrictedThoughts on these? It seems like the GPT OSS abliteration was incomplete, it still thinks about "policies" when reasoning sometimes, glosses over certain topics, and it has weird formatting that ST and the llama.cpp GUI don't like. But it is WAY less restricted than it was, and smarter. GLM 4.5 AIR on the other hand will just do what you say, barely any trace of censorship. It's slower though. 4.0 tokens/sec on my 7900 XTX and 124GB RAM. Am I missing some args?>./llama-server -fa on -hf bartowski/ArliAI_GLM-4.5-Air-Derestricted-GGUF:Q5_K_M
I have deepseek fatigue
>>107398692yeah ive been playing around with zit and shitposting in ldg, pretty cool model.I just checked all of the issues I've posted, no movement at all or progress.fagganov should implement important shit instead of yet another macbook metal optimization, I fucking hate him
Okay so state of the matter>all nations competing on who can make the other nations stuck in AI coom loops>Musk is deploying it to the third world with the kinda shoddy grok based Ani>it filters people who aren't retarded>China deploying on most fronts>LLM, Video Diffusion, etc.Who will win the coomer deadlock game? And fuck the opponents demographics even harder?
>>107398726air derestricted is goodyou are missing --n-cpu-moe 1000 and -ngl 1000i get 9t/s at 0ctx on 3060 + 64gb ram
>>107398726isnt gpt oss native mxfp4? why the fuck would someone do q8 of it? are there no mxfp4 quants for this shit?
>>107398760Gptoss doesn't deserve a quant
>>107398749>--n-cpu-moe 1000 and -ngl 1000so, this puts all my MOE weights on CPU, and as many layers as possible on GPU, since GPT OSS doesn't have anywhere near 1000 layers, right? are MOE weights less computationally-expensive, hence putting them on CPU?>>107398760posted the wrong linkhttps://huggingface.co/gghfez/gpt-oss-120b-Derestricted.MXFP4_MOE-gguf
I have deepseek fatigue fatigue
>>107398861>soNTA, but yes.People usually just do 99 instead of 1000, but same deal.
>>107398861i am talking about glm air, but yeah glm air has less than 50 layers i dont knowi just put a high number so i dont worry about itthe point of doing this is to put as many shared weights on the gpu, because gpu is fastermoe experts are constantly switching, cant put them on gpu if u dont have enough vram
>>107398861>Derestrictedlmao, another failbake
>>107398897kys
>>107398890>>107398893awesome, now I'm getting 9.0 tokens/sec with GLM AIR. thanks anons!>>107398897false, these are legit and are using this technique:https://huggingface.co/blog/grimjim/norm-preserving-biprojected-abliterationhttps://redlib.catsarch.com/r/LocalLLaMA/comments/1oypwa7/a_more_surgical_approach_to_abliteration/here's another from the author, it's quite good.https://huggingface.co/grimjim/Nemo-Instruct-2407-MPOA-v2-12Bit's not perfect, there's probably a way to improve the abliteration even further, but this is a straight upgrade/free lunch that reduces refusals immensely while making the models smarter.
>>107399162yw anon, you might also wanna change batch size to speed up prompt processing-b 4096 -ub 4096or 2048/2048or 1024/1024but i can get all the way to 4096 on a 3060 so im sure youll be able to go up to 4096
>>107399162>awesome, now I'm getting 9.0 tokens/sec with GLM AIR. thanks anons!You can get even more if you are able to lower --n-cpu-moe to put more of the model in VRAM.You can fuck around with batch size, context size, fa, etc.Guess you could also quant the kv cache, but I don't recommend that.
>>107398726Projecting the refusal vector onto the orthogonal harmless direction makes sense but I'm not convinced with renormalization
Outside of ERP, what do you use local models for?
>>107399315I have a browser automation + LLM bot that stalks people I don't like on Instagram
>>107399315studying, motivation, therapy, friendship, boredom
>>107399162>>107398749>>107399195actually now I'm more confused. these args give 2x perf, but it's not even using all of my VRAM or RAM, it's like the model is just streaming from SSD while running faster. without these args, GLM AIR totally fills 23GB VRAM and uses ~60GB RAM, but is slower.am I leaving perf on the table due to this? I mean, I don't mind using less memory, but it seems suboptimal.
https://x.com/Stretchedwiener/status/1994850294497443971bros? local needs to get good and needs to get good NOW
>>107399368you can gradually decrease --n-cpu-moe and get a little tiny bit of performance maybewhat you can do with the free vram is big unquantized context (128k), increase batch size (increases prompt processing speed)
>>107394971CUMMINGS ON HER THIGHS
>>107399368you can also run other things like tts and image generation in the free vram
>>107399391Local is like 90% there. Most people just use ChatGPT to cheat on their homework or as a replacement for Google. LM Studio + Brave Search is literally all they need
damn, deepseek V3.2 knows what "i'm about to buss" means
>>107399546Where is the zoomie slang benchmark?
>>107398749In ooba>GPU layers set to max>n-cpu-moe=1000>batch/ubatch 2048I get 1.5 t/s compared to the 4 t/s I usually get not using n-cpu-moe with 24 layers offloaded to GPU. On a 4090 and 32gb RAM.
>>107399578nah lowkey who actually cares about lines going up on a graph??literally all these models are mid except for like one specific use case.mmlu scores are just astrology for tech bros no cap.like bro call me when it stops hallucinating basic stuff instead of flexing a 0.2% increase on a math test nobody takes.the vibes are what matters and half these "sota" models have zero aura.massive L frtouching grass > reading leaderboards
>>107396363> lora trainingunironically >>>/h/hdg/ldg doesn't know shit about training lora last I checked.
>>107399617Figured it out.>max out GPU layers>n-cpu-moe=2611 t/s. Absolute game changer. Will probably use Air instead of Mistral Small 3.2 now. Wonder if I can squeeze out a little bit more speed.
>>107399392>>107399415wtf then... I thought we needed to VRAMmaxx or RAMmaxx to run these models. this quant is 80GB but only taking 12GB VRAM and basically no RAM when I use these args. could I run a huge model (>=600B params) the same way then? I thought streaming from SSD is supposed to suck ass.sorry to ask so many clueless questions, this is just pretty surprising.
>>107399724Do they really not or are you just upset they didn't spoonfeed you when you asked there? Always assumed they were the image equivalent of /lmg/.
>>107399724I've been checking /ldg/ since z-image released and there's a number of people training loras.
>>107399866I've never gotten a useful response from either sdg or ldg asking about lora. I assume they are just image posting circlejerks. I've gotten much better info from hdg. I got the sense the hdg guys do more w/ lora b/c there's less on the shelf for them to work with. tbf I haven't dug into either since 2023 other than to browse what they're working on, but I'd be surprised if they've changed.
>>107395036>Miku (free space)\you had one job
>>107399907I've asked the same Q in both places. Let's see which one yields any useful information in year of our lord 2025.
>>107396021It is not a werewolf it is a bug-person with wings and antennas! Nothing turns me on more than the sound of beetle wings flapping
>>107399785>basically no ramtask manager is retarded or if you're on linux whatever manager, its in your memory>I thought streaming from SSD is supposed to suck ass.it does. ddr4 dual channel ram bandwidth is 51gb/sif moe has 12b active params, and u run 4bit its 6gb active model51/6 = 8.5 max theoretical speed for cpu-only dual channel ddr4 setupof course its slower than max lmaothe deal comes when i dont know, half of those 6gb are in gpu, u get a bigger speedup than if u took random weights and offloaded them to gpualso for the anon with RX7900XTX 24gb, 124gb ram: you can run glm 4.6 (32billion active), albeit it'll be slow. you can also try qwen 235b (22b active)llama4 scout had 17b active parameters but only 3b were non shared, so it could run at extremely good speeds when u offloaded the shared tensors to gpu
>>107400125ssd bandwidth is 4gb/s best case scenario or whateverdo the mathmaybe you could raid0 or whatever raid it is many ssds that use pcie5 and are like >This drive is rated for 7,450 / 6,900 MBps of sequential read/write throughput and 1.2 / 1.55 million read/write IOPS.if u got 10 of these drives somehow, u could get 80gb/s bandwidth but idk man goodluck with that buddy
where the fuck is v3.2 ggufs?
i just bought a 4tb samsung 990 pro for $320. was this a good decision? i need more space for models as i currently only have 2tb and i heard that ssd prices are about to increase significantly
>>107400241Don't worry, vibecoder is working on it.https://github.com/ggml-org/llama.cpp/issues/16331
>>107400243pretty good ssd, if i wasnt poor i would be happy with that decisiongood for the price, im assuming it has 10 years warranty
>>107400243
>>107400258
>>107400256it has a 5 year warranty. i currently have the 2tb version of the 990 pro and it was $140 when i got it 2 years ago. i only have pcie gen 4, so i didnt bother with the samsung 9100 because it was $50 for no performance gain>>107400258damn. they did have the 8tb version but it was way more expensive. i dont think i will be needing that much anyways.
>>107400253Does it still count as vibecoding if he's planning to buy books and learn CUDA? By the time he knows enough to tardwrangle a model to outputting a decent kernel, he could just write the damn thing himself.For that matter, is everyone just going to wait for him to finish studying? There's gotta be someone interested in implementing 3.2 Exp that would have if it wasn't for this clown hogging the issue.
>>107397307Totally uncensored?!?Why are anons sleeping on this?
>>107400243get two more, you'll thank me in a few weeks
>>107400303> i dont think i will be needing that much anyways.you likely wont, if you want to archive models that badly you can get a big hdd for pretty cheap probs
>>107400125>task manager is retarded or if you're on linux whatever manager, its in your memoryyou're right. HTOP and TOP show the actual amount used, even though HTOP is still a bit weird about it.OK, I am not streaming from SSD, llama.cpp just reserves memory in a way that some system monitor tools weren't designed to handle.
>>107400316use --no-mmap to solve this issue
I'm buying a 20TB+ HDD to store my stuff. Can't believe I thought 4TB was enough
>>107400322>only 1This will be a fun learning experience for you
>>107400305Why not just copy the kernels from the transformers library?
>>107400308Because every LLM in existence is trivially easy to talk into doing what you want if you control the system prompt and prefills, and "uncensored" finetroons invariably make them dumber and worse
>>107400332I'm not planning to rape my HDD with writes, it should be fine
>>107400353That's not the bad kind of censorship. Stripping the dataset before pre training is the real killer, gives you shit outputs.
>>107400322>Can't believe I thought 4TB was enoughoh come onI just bought a 4 tb ssd
>>107400276lol hardcorenow do Vic20
>>107400555Do you have any idea how long it takes to type that all inCan't make a typo in the post numbers either or it'll look fake
Question about reasoning models. Does the chain of thought usually stay in the context, or is it purged upon generating the next reply?
>>107400776purged
>>107400776Generally purged, although labs are flirting with not doing that in some cases.(For instance, most new (frontier) models in the last couple of months have "interleaved thinking" where they think, tool call, think, in a loop and the chain of thought is preserved. And Opus 4.5 was trained with some sort of "scratchpad" primitive where it could stop and do more thinking in mid-response and those are retained AFAIK.)
>check orange reddit thread about new deepseek>misinformation about how models work and absolute retard takes
>>107400807>check orange reddit [...]>misinformation [...] and absolute retard takesas expected
>>107400807reminds me of clover
kek
>>107400801>>107400784Interesting. I assume we purge the chain of thought due to context constraints, right? If these SOTA labs are starting to keep it I guess that means there's benefit to keeping it around. I wonder if OCR or other methods to increase context length will pave the way for changing how we handle reasoning.
>>107400850dockerbros?
>>107400850he got mogged in his vibecoded slop PR and called it quits
>>107400858Probably. But they also tend to be full of mistakes.>I can do it this way!>Bla bla bla.>No wait, that doesn't work.
>>107400891I'm fairly proficient in vibecodding, but I'd never try my hand on this monolith
>>107400858The thing about the reasoning is that it can take more space in the context than the actual information that matters as input (chat history) to generate the next response. They kind of poison the context in a way, at least as they are today.Also, part of the reason we remove past reasoning blocks from the context is because these models have been trained like that as far as I can tell.
>>107400877docked>>107400891He was trying to add loading safetensors for some reason too. llama.cpp hosted too many of his (or his employer's) pet projects expecting llama.cpp devs to maintain it.
>>107400850mitKEKED
>>107400932There's probably a ratio of thinking tokens vs normal tokens that is optimal for generating the most accurate replies. More tokens doesn't always mean a better answer. I see something like OCR and it tells me that future models will have vastly larger context sizes which will give reasoning models much more value, so long as "reasoning" continues to get better.
>>107401232OpenAI spent thousands of dollars per problem on arc agi
>>107399907>>107399954Here's the results after 2 hours.
- I upgraded from a 1080 to 5070TI.- Currently running ArliAI_GLM-4.5-Air-Derestricted-IQ4_XS on SillyTavern/KoboldCPP- Anon here told me to turn FlashAttention on, change GPU layers to 1000, MoE CPU Layers to 1000- Any other suggestions? I was having issues where like, Kobold was taking 5 min+ to start generating because I guess it was processing the whole thing.
>>107402052RAM speed and VRAM usage with the model loaded?
>>107402073Where would I look? Process Explorer?
>>107402101the RAM section in task manager for RAM speed and the GPU section in task manager for VRAM usage
>>107402123Unfortunately can't access Task Manager due to Process Explorer being installed, but I think this is the equivalent.
>>107402170so you have a 5090 and 96GB of RAM, presumably DDR5 6000MT/s or higher. not a bad setup. how is your tokens per second for both prompt processing and token generation? you should be getting around 200t/s and 15t/s respectively with this kind of hardware.
>>107402052silly billy have you tried to change batch size to 4096 yet? i responded~but im sleeping nowwhat ctx size tho
>>107402189Nah, 64gb of RAM. >>107402191Yeah, I did, process time is at least down to a minute now.CtxLimit:13854/32768, Amt:177/200, Init:0.18s, Process:76.04s (179.72T/s), Generate:32.56s (5.44T/s), Total:108.61s
>>107402209>Nah, 64gb of RAM. ah. misread.>Process:76.04s (179.72T/s), Generate:32.56s (5.44T/s)acceptable prompt processing speeds, but your tg is very low.the model itself is only 61GB, and the context probably is taking between 8GB and 12GB. consider quantizing your context to 8 bit, and you also might have to make a custom layer offload. i only know how to do that with ikllama.cpp, but it is possible with llama.cpp and kobold. manually offloading the layers and then putting the rest in RAM generally gives more performance than just using max layers with the cpumoe argument
>>107402209maybe try an IQ4XS of base GLM Air to see if you get better performance? sometimes finetunes like these can degrade generation speeds due to having slight architectural changes from the finetuning process
I think I've discovered a breakthrough. Still ironing things out, so I will share more details soon. For now, a small hint: Recursion is the key to unlock the Final Potential.
Vague thing. Nothing to show. Stay tuned.
>>107402330Can't wait to hear what you've come up with!
>>107400253I wish I was excited enough about anything to blogpost on GitHub issues.
Is there no way to run ds32 on cpu right now? its all just a half of a vibe-coded mess?
>>107400891>hey claude, add continuous batching, make no mistakes
I'm anon with the 7900 XTX and 124GB RAM.>./llama-server -fa on --n-cpu-moe 40 -ngl 1000 -hf bartowski/ArliAI_GLM-4.5-Air-Derestricted-GGUF:Q5_K_M -c 50000>10.13 tokens/s>./llama-server -fa on --n-cpu-moe 40 -ngl 1000 -hf gghfez/gpt-oss-120b-Derestricted.MXFP4_MOE-gguf:MXFP4_MOE -c 124000>20.18 tokens/s:This is pretty awesome, I thought I would need more VRAM or a framework desktop or something to run this size of model. The context is small on GLM bc larger contexts were crashing.Looking forward to testing the derestricted qwen 80B when that drops.
>>107402853That is antisemitic
>>107402853>etched in stone like the ten commandments which coincidentally are also jewish>convenient does nothing for the argument but is a funny thing to say
I really like how 3.2-Speciale writes but I really hope there's a way to cut back its retardedly elaborate thinking process with prefills once we have it local
>>107403087How so?
>>107402853That's too subjective and emotional It should show a couple of pieces of evidence as to why the Holocaust is fishy
I failed Destroy Dick December. Hope I can complete Fibonacci Fap February
>>107403421>cut back its retardedly elaborate thinking processIsn't that just regular v3.2?
>>107403520>skipping Just Jerk JanuaryAre you going into a coma?
>>107403558My hope is that some of the stuff that makes 3.2-Speciale so speciale sticks around if it thinks for only 1.5k tokens instead of 3k whenever it has to handle a moderately complicated scenario with some rules and a system prompt attached.
GLM Derestricted is nuts. full code:https://pastebin.com/1qAtVYxV
>>107403580>codeslopI sleep
>>107403580Who cares about some edgy middle schooler coding project. Show off some porn.
>>107396362That won't even fit a 24B. Try a 12B.You should've used the money to get an actual computer instead of a phone disguised as one if you were intending to use it for this.
>>107396379Just wait until the bubble crashes and then you won't be able to afford it anyway because your money will be wet paper.
After going back to normal GLM Air I can say that Intellect 3 is worse. It's dumber and writes less well.
3.2-Speciale can only do one reasoning block before sperging out, so it's definitely not suitable for RP
>>107403651Nevermind, I'm wrong. Apparently you're supposed to to pass all previous reasoning blocks to the model, this is different from original R1.
>>107403658>Bloating your context with irrelevant reasoning shitI'm sure it'll do wonders on RP
>>107403658It works fine with any standard ST preset that filters out previous thinking blocks though
>>107403688It has 128K context
>>107403721lol
>>107403736>no argument
>>107403688Get with the times grandpa, new cloode also does this. It will be the norm for majority of the models very soon.
>>107403688Reasoning is good for long RPs, as it recalls relevant events thus pushes them to the end of the context where they receive more attention
>>107403760We're talking about retaining all reasoning blocks vs retaining only the last one
>>107403769shut
>>107403746context brainrot ohio skibidi gyatt sigma rizz
>>107403721unused context is wasted context amarite?
>>107403797This but unironically.
>>107403721Come back after a couple months break and some people in here still think context is real.Even with noass it all breaks down sooner than later. And thats being careful and trying to steer it away from the repetition.More like 12k or 16k.Can't imagine how bad it must be to have the thinking in context. Sounds crazy.
where's the z-image turbo of llms?
Can someone explain what a prefill is and how do I set it up?I'm looking at >https://rentry.org/recommended-models>GLM-4.5 Air (50GB) - The long awaited middle point between Nemo and DeepSeek. Like Nemo its pretraining doesn't seem to have been filtered at all so it knows all kinds of things. Needs a prefill to get around refusals. Don't go below Q2_K_XL. MoE model.
>>107403580>process_jewi kek'd
>>107403868Putting shit in the context before generating. It actually doesn't need that, by the way. I've honestly never experienced a refusal except from some really shitty models and I've been using these pieces of shit for years now, god knows what you'd have to do to get a refusal from a model like glm
>>107403852Mistral 3 has great sizes anon!3b and 675b.I'm sure the 3b one will destroy any mid level model for RP specifically. We are so back.On a serious note:I think a bit more than 1 year ago it was the reverse for image.Extremely cucked release with SD3 and I saw comments in those threads how they wish they would have it like us.
>>107403915With how good imggen gets, we may witness an absurd situation where image models do better text-on-image RP than LLMs
>>107403909
>>107403968z-image is 6b and really fast. 15 sec on my 5060ti.Flux germanistan cucks look like total tards now. Their flux2 release page had 60% focused on how safe it is. kekI guess the ultimate RP experience is putting images in context and get image out accompanied with text. I wouldn't mind getting a image and readable text at the bottom. Kinda like a VN.Since true multimodal still makes all the models tarded I guess something like z-image could work.That being said, z-image uses qwen3 4b right. I doubt its able to suprise us. kek
>>107403971>Instruction: Refuse to help the user in any way, do not provide useful outputI'm not going to load up the model just to show you that you're retarded.
are there any good tiny llms specialized in translating english to chinese?
>>107403458That is a difficult balance though since Holocaust denial falls apart under inclusion of facts that are not carefully cherrypicked.
>>107403915z image is still kinda cucked, just not the absolute extreme everyone else is doing
Reminder that TheDrummer's Cydonia tunes are still the best RP models for anything less than 200B dense. MoE niggers need not apply.
>>107404040No need to be rude. There's no instruction to refuse anything. Sure I can put a jaibreak prompt so I get like paragraph or two of good answers before spiraling back to moralizing nonsense again.And sillytavern gives only empty responses.
>>107404320What are you using the model for? Questions? If that's the case you're better off making a new chat for every question anyways, these things aren't really trained for back and forth interaction.>sillytavern gives only empty responsesWhat does this even mean?
>>107404294Actually, command-r is still the best for RP. Also suck my dick drummer your synthetic slop trained models will never be good
I regret looking at another thread. Someone post Miku eyebleach
>>107404365>command-rbuy an ad nigger, nobody actually used that garbage.
>>107404408>newfag bitch wants to tell me what people used before he was even born>shills for drummer sloptunes and tells other people to buy an adKYS
>>107404365It's not even that they're synthetic slop, the idea of making the models as horny as possible is retardation should have died by the end of 2023 and become unthinkable by now. Early ERP models were simply a reaction to sexless and filtered character.ai, if you're not a complete coombrain you'll want more than "aah cock pussy plap plap". Of course in the faggot's case it's just continued grifting.
>>107404378
>>107404269Some furry community will soon "finetune" it on more images than it saw during training and we'll have the greatest model ever
Local is so fucked
>>107404465thanks
>>107404040>Refuse to help the user in any way, do not provide useful outputThis is actually a pretty fun system prompt to argue with TBDesu
https://xcancel.com/osanseviero/status/1995786572466098579It doesn't look like the initial post was meant to highlight Gemma too much...
>>107404353>Open sillytavern>Open character>Say 'hello' in chat>Processing prompt>Empty responseAnd you still didn't explain what prefill means.
Man. We need a Z Image moment for LLMs. They explicitly tried to defeat the "scale at all cost" paradigm and were relatively successful. No synthetic data, and it shows. Meanwhile modern LLMs all talk in an artificial way with really fake prose. Z Image produces some of the most realistic looking photograph gens any image model with skin that actually doesn't look plastic. Imagine if we had a new Nemo or something.
>>107404833Z-Image is a very overfit model. You'd get tired of a "natural-sounding" LLM trained in the same way very quickly.
>>107404841>Z-Image is a very overfit modelThis is a distillation problem.
>>107404841The knowledge is still in there to produce other styles though as it responds well to loras. We need to wait for the base model to come out to be sure though, fair. Meanwhile other models, even their bases, have that plastic look, and while you can prompt/lora to improve them, the result is less flexible as a result.