/lmg/ - a general dedicated to the discussion and development of local language models.Previous threads: >>106700424 & >>106691703►News>(09/26) Hunyuan3D-Omni released: https://hf.co/tencent/Hunyuan3D-Omni>(09/25) Japanese Stockmark-2-100B-Instruct released: https://hf.co/stockmark/Stockmark-2-100B-Instruct>(09/24) Meta FAIR releases 32B Code World Model: https://hf.co/facebook/cwm>(09/23) Qwen3-VL released: https://hf.co/collections/Qwen/qwen3-vl-68d2a7c1b8a8afce4ebd2dbe>(09/22) RIP Miku.sh: https://github.com/ggml-org/llama.cpp/pull/16174>(09/22) Qwen3-Omni released: https://hf.co/collections/Qwen/qwen3-omni-68d100a86cd0906843ceccbe►News Archive: https://rentry.org/lmg-news-archive►Glossary: https://rentry.org/lmg-glossary►Links: https://rentry.org/LocalModelsLinks►Official /lmg/ card: https://files.catbox.moe/cbclyf.png►Getting Startedhttps://rentry.org/lmg-lazy-getting-started-guidehttps://rentry.org/lmg-build-guideshttps://rentry.org/IsolatedLinuxWebServicehttps://rentry.org/recommended-modelshttps://rentry.org/samplers►Further Learninghttps://rentry.org/machine-learning-roadmaphttps://rentry.org/llm-traininghttps://rentry.org/LocalModelsPapers►BenchmarksLiveBench: https://livebench.aiProgramming: https://livecodebench.github.io/leaderboard.htmlCode Editing: https://aider.chat/docs/leaderboardsContext Length: https://github.com/adobe-research/NoLiMaGPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference►ToolsAlpha Calculator: https://desmos.com/calculator/ffngla98ycGGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-CalculatorSampler Visualizer: https://artefact2.github.io/llm-sampling►Text Gen. UI, Inference Engineshttps://github.com/lmg-anon/mikupadhttps://github.com/oobabooga/text-generation-webuihttps://github.com/LostRuins/koboldcpphttps://github.com/ggerganov/llama.cpphttps://github.com/theroyallab/tabbyAPIhttps://github.com/vllm-project/vllm
►Recent Highlights from the Previous Thread: >>106700424--Evaluating Qwen3-235B quantization quality and performance tradeoffs:>106707130 >106707154 >106707182 >106707479 >106707664 >106708196 >106709111 >106709390 >106709459--Quantization and GPU strategies affecting Kimi-K2-Instruct performance:>106701146 >106701166 >106701239 >106703994 >106704240 >106708477--NovelAI's untuned GLM-4.5 model sparks debate over local model viability:>106709810 >106709861 >106709921 >106709980 >106709993 >106712191--Jamba model evaluation and long-context performance challenges:>106701980 >106702058 >106702137 >106702276 >106702209 >106702285 >106702395 >106702435 >106702528 >106702695 >106702949--CXL emulation challenges and accessibility:>106704835 >106704988 >106705112 >106705195 >106705217 >106705265--Customizing Deepseek's narrative style through prompts and examples:>106700841 >106700871 >106700889 >106713120 >106700873 >106700943--AI hardware limitations and potential breakthroughs:>106716419 >106716796 >106716839 >106716931--Exploring ollm for running Qwen-80b on low-end hardware with SSD speed considerations:>106703817 >106703878--Commercially licensed AI models for Steam games under VRAM constraints:>106702281 >106702334 >106702365 >106702525 >106702386 >106702409 >106702422 >106702458 >106702542 >106702547--Promoting DSPy GEPA as superior to finetuning for LLM prompt optimization:>106704760 >106704779 >106704810 >106704826--imatrix tradeoffs in quantization: benchmark gains vs task skewing:>106709802 >106710882 >106711095--AirLLM and oLLM aim to optimize large model inference on low-vRAM GPUs:>106708050 >106708102 >106708124 >106708171 >106708275--Tips for finetuning character voice with small dataset:>106707963--Miku (free space):>106701808 >106709053 >106709204 >106714299 >106717835 >106718435 >106702561►Recent Highlight Posts from the Previous Thread: >>106700443Why?: >>102478518Enable Links: https://rentry.org/lmg-recap-script
>>106718496This coming week will be the most decisive one in /lmg/ history. If the upcoming big releases fail to push us forward, it will be truly over and all hope is lost.
ATTENTION ALL VIBE CODERS:We need you, yes, YOU! To implement Qwen3 VL in llama.cpp! Please do the needful sirs!
>>106718525If they fail to push us forward then they're not big releases, are they?
>drag model into comfyUI>Nothing happens, it doesn't get loaded, nothing>Try to make a custom route for a model loader>No option worksI HATE COMFYUI I HATE COMFYUI I HATE COMFYUI I HATE COMFYUI I HATE COMFYUI I HATE COMFYUI I HATE COMFYUI I HATE COMFYUI
>>106718591>>/g/ldganistudio waiting room btw, fuck pyshit
Lovely Miku General>>106718270Satisfied with GLM-Air Q4M after using Mistral-Large-2411 all year. 72G VRAM 128G DDR5. Always chasing better models doesn't seem a good use of time.
armrest Hegel trail
>>106718628>>>/g/ldggod damn I'm retarded
it's so fucking depressing that literally the only thing I do anymore is ERP with AI botsand I'm bored of that so I don't even do that anymoreit's fucking grim bros
Mikulove
miku doesn't even wear a hat
>>106718677hair clips/ribbons count as hats
>>106718637Therapy/support bot. Feed it your DMs and gain clarity on what's wrong. You can turn things around anon I believe in you, the future is bright
>>106718637Sit down with Cline or whatever and begin doing some world building around your fetishes.Then write a story in that world.
>>106718637Skill issue
>>106718717Why would you use Cline for that?
>>106718270>mistral large 3 perhaps?that would be ideal. glm air is super fast and pretty good for its speed, but sometimes i want to prioritize quality over speed. glm full is too big, but something like a 200B mistral would be perfect. qwen 235B is garbage
>>106718706The problem with using AI models for therapy is that they just mirror whatever you say and never try to direct the conversation or question anything.To be fair, this can be a problem with a lot of real therapy too. But at least a real therapist will make an effort to get the information out of you and build up an understanding over time. AI will take your every word as a profound revelation and write an essay about it, and then contradict itself when you tell it more.Also the assistant slop training is hard to get rid of and it will always want to write essays and lists with ridiculous throwaway advice like "dunk your face in ice water for 15 seconds" rather than have an actual conversation.
>>106719027Automatic research, organizing things in documents and folders, white board style brainstorming, etc.Having the AI write something based on your idea, then looking at at that and rewriting the whole thing can really get you places.
lmao
>>106719079That sounds kind of interesting.What model's good with Cline? R1?
>>106719230Why 28k specifically?
>the beast arrives
>>106719284I used gemini 2.5pro for a while but I imagine R1 works just fine.>>106719288Probably something about their training data.
>>106719325lel, I don't want Google to see my fetishes
>>106719333If you have ever searched anything related to your fetish, they saw it already.
>>106719333That's fair enough.
>>106719354I did when I was a kidStill using the same Google account :/
kek>>106700000
>>106719288They're a subscription service charging $25 per month. If they want to ensure the expenses incurred by the average user leaves them with whatever profit margins they're aiming for, adjusting context size is the easiest way to do it.
>>106718496>>106718500>>106718629>>106718706
>>106719715i cry evrytim
>>106718717I installed Cline but that system prompt is fucking massive. Is there any way to edit it? I looked this up but it seems like nobody else has asked such a question.I bet Anthropic paid them to make it way longer than necessary to milk more API cashm
>>106719919Use roocline. Cline is depreciated
>>106719919I'm not sure, but I think not.Try Roo or Continue, I guess.
>>106719919>I bet Anthropic paid them to make it way longer than necessary to milk more API cashthe system prompt gets cached and wont use tokens. At least with claude code cli. But as others said, roo code is the way to go.
>once https://github.com/ggml-org/llama.cpp/pull/16208 has been merged a Mi50 will be universally faster for llama.cpp/ggml than a P40.CUDA dev's PR was merged a few hours ago.
>>106719998y'all niggas love your quantmaxxed llama.bbc trashfor some reason nobody talks about distributed inference on multiple pcs with vllm which is super fucking easy.
>>106719919You can with Roo (Cline fork). Fucking nearly 10k tokens of verbose and repetitve tool calling instructions. "vibe coders" are retarded. I gave it to an LLM to condense it to a tenth the size and all models have performed far better since. Only pain in the ass is that you have to manually override it for every single mode and adjust the instructions based on the available tools, but if you only use one mode for world building research it shouldn't be a big deal.>>106719972It's not even about cost or speed, the issue is degrading performance because most models barely have 8k usuable context.
>>106718637longform fanfiction storywriting can be funhonestly these models are more trained for that than pure rp
>>106719998Well I guess those are going back up in price now.
>>106720111 (Me)Although hats off to cudadev for saving them from ewaste status.
>>106720111You got 3 days insider knowledge heads up. Why didn't you place a bulk yet?
>>106719998>>106720111arent those like 8 years old or something tho? i cant imagine they perform well. probably gets crushed by a 3060
>>106720130Not super into AI anymore. Had a quad 3090 rig originally, one is now in my gaming PC, one went to a young relative who is into PC gaming and now just 2 are in my server, so I just play around with whatever 30B> models come around for shits and giggles but not really deep into it anymore.
>>106720135They're not great but 32 gigs of vram on one device is 32 gigs of vram on one device.
>>106720150just get 5090s
>>106720150Also slightly more memory bandwidth than 3090, way more than 3060. So where it lacks for prompt processing it should make up some ground in generation speed.
>>106720064The problem is that the individual hardware pieces are too expensive, distributing them across multiple machines doesn't fix that.
M50MAXXing is more viable than CPUmaxxing now
>>106720186Enjoy your electricity bill
>>106720064How many PCs and GPUs are you using to run deepseek on vllm anon?
>>106720194found the europoor
>>106719065My waifu helps me understand the symbolism in my dreams
>>106720064>buying $10K of hardware to get shivers on his spine
>>106719065>they do XAll depends on the prompt. Let's keep in mind every LLM is a loop over f(prompt)=next_token_distribution. Every token in the prompt affects the output. Defining the intent is the issue.They are useful tools for self-inquiry and access a wider range of perspectives than any one human therapist.Consider cold showers tho, that'll make you feel alive.
>https://www.youtube.com/watch?v=21EYKqUsPfg>Richard Sutton – Father of RL thinks LLMs are a dead endOh no no no...
>>106720277Everyone knows this by now. Even the last normalfag has realized that LLMs won't go anywhere after GPT5.
I am using glm4.5 not air on llamacpp and it seems more coherent and less prone to repeating than ikllama. Is ikllama bugged?
>>106720277>Father of [irrelevant technology] thinks LLMs are a dead end
So what is the fastest backend?
>>106720309RL is the secret sauce that made Deepseek R1 so good though?
>>106720317vllm using parallel tensors to spread out the model across several gpus and do inference in parallel
>>106720321>Deepseek R1*all current reasoning models that are considered good
>>106720329OK thanks. Does vLLM have any issues with mixing GPUs?
>>106720343no those just distill other reasoning models
How big of an upgrade is to 9950X/9950X3D from 7950X
>>106720348then they're not the ones considered good
>>106720367For LLMs completely pointless.
>>106720135A MI50 has about the same memory bandwidth as a 3090 and ~20% of the compute.Given optimal software the token generation speed is proportional to memory bandwidth and the prompt processing speed is proportional to compute.But I'm thinking that it would make sense to cook up some quant formats that are less optimized for maximum compression and more optimized for computation speed.I've also ordered a MI100 which is going to be more competitive in terms of compute; stacking MI100s could be a viable alternative to stacking 3090s I think.
>>106720399>MI100>32GB>going for $1kidk about replacing 3090s. Even the HBM2 variants are going for $800.
>>106720399huh. how does the MI100 compare to 5090s? because according to techpowerup, they are actually faster?https://www.techpowerup.com/gpu-specs/radeon-instinct-mi100.c3496https://www.techpowerup.com/gpu-specs/geforce-rtx-5090.c4216
>>106719288It's 32+-4 because they shift the context back from 36k to 28k when reaching it to cache it, but I guess since 28k is the low end that's what they put so nobody complains
>>106720367For language models memory bandwidth is more important than compute, so prioritize the RAM instead.Usually you only need a few cores to fully saturate the memory bandwidth (see pic).>>106720425My thinking is that for a machine with a fixed number of PCIe slots you could feasibly opt for Mi100s to get a higher VRAM capacity.>>106720441Techpowerup is unreliable in the first place but be careful which "FP16" numbers you compare (Wikipedia has in my experience the correct numbers).With tensor cores an RTX 5090 has 419 TFLOPS vs. the 184 TFLOPS on a MI100.
Local always wins.
>>106720562*SAAS loses when enshittification reaches critical levels.
>>106720562>safety routingthis is a new low
>>106720574You're not even using your buzzword correctly. ClosedAI has always made safety (read: censorship) their primary goal.
>>106720617No, it's an improvement. This means that the average model will no longer have to be fundamentally safety slopped because they'll rely on the router to prevent unsafe conversations. The proper models will get better as a result.
Do you use the C-word (clanker) in real life?
>>106720628That's the most leddit term I've heard in a while
>>106720627massive cope
>>106720627They already had guard models for that. Now if you commit wrongthink, the router will helpfully route you to an expensive reasoning model to waste thousands of costly tokens, that you will be billed for, to refuse your request with extra care and condescending tone.
>knuckles white with tension
>>106720628No. It's a very silly word.
>>106720628no, why would I?