/lmg/ - a general dedicated to the discussion and development of local language models.Previous threads: >>108903381 & >>108896570►News>(05/21) Hy-MT2 “fast-thinking” multilingual translation models released: https://hf.co/collections/tencent/hy-mt2>(05/20) Cohere releases Command A+ 218B-A25B: https://cohere.com/blog/command-a-plus>(05/16) llama + spec: MTP Support #22673 merged: https://github.com/ggml-org/llama.cpp/pull/22673>(05/08) KSA-4B-base released: https://hf.co/OpenOneRec/KSA-4B-base>(05/07) model: Add Mimo v2.5 model support (#22493) merged: https://github.com/ggml-org/llama.cpp/pull/22493►News Archive: https://rentry.org/lmg-news-archive►Glossary: https://rentry.org/lmg-glossary►Links: https://rentry.org/LocalModelsLinks►Official /lmg/ card: https://files.catbox.moe/cbclyf.png►Getting Startedhttps://rentry.org/lmg-lazy-getting-started-guidehttps://rentry.org/lmg-build-guideshttps://rentry.org/IsolatedLinuxWebServicehttps://rentry.org/recommended-modelshttps://rentry.org/samplershttps://rentry.org/MikupadIntroGuide►Further Learninghttps://rentry.org/machine-learning-roadmaphttps://rentry.org/llm-traininghttps://rentry.org/LocalModelsPapers►BenchmarksLiveBench: https://livebench.aiProgramming: https://livecodebench.github.io/gso.htmlContext Length: https://github.com/adobe-research/NoLiMaGPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference►ToolsAlpha Calculator: https://desmos.com/calculator/ffngla98ycGGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-CalculatorSampler Visualizer: https://artefact2.github.io/llm-samplingToken Speed Visualizer: https://shir-man.com/tokens-per-second►Text Gen. UI, Inference Engineshttps://github.com/lmg-anon/mikupadhttps://github.com/oobabooga/text-generation-webuihttps://github.com/LostRuins/koboldcpphttps://github.com/ggerganov/llama.cpphttps://github.com/theroyallab/tabbyAPIhttps://github.com/vllm-project/vllm
►Recent Highlights from the Previous Thread: >>108903381--Debating quantization precision vs full weights and hardware compatibility:>108905193 >108905216 >108905246 >108906139 >108906166 >108906203 >108906281 >108907040 >108906364 >108906428 >108906493 >108906571 >108906619 >108908361 >108906276 >108907069--Refining Gemma jinja templates for thought-channel and tool-call handling:>108908057 >108908405 >108908612 >108908656--SillyTavern and llama.cpp token fusion causing newline display bugs:>108907352 >108907711 >108908073 >108908227 >108908475 >108908994 >108909873--Comparing Gemma 4's programming capabilities and troubleshooting its VRAM usage:>108904621 >108904636 >108904664 >108904716 >108907121--Performance issues and decode errors when using beellama dflash:>108903454 >108903509 >108903660 >108903888 >108903545--Anon creates screen-monitoring wrapper for real-time AI commentary:>108908435 >108908487 >108908555 >108908574 >108909056 >108908838 >108908971--Performance and software optimization hurdles for Intel Arc Pro GPUs:>108908365 >108908964 >108909068 >108909104--RTX 3090 performance and quantization quality metrics:>108903820 >108904833 >108904863 >108904907 >108905347 >108905607 >108905650 >108904790--Jailbreaking Llama 3.1 8B and subsequent compliance audit results:>108904047 >108906555 >108906592--Building custom MTG engine for LLM roleplay and gameplay:>108905982 >108906679 >108908021--Microsoft and Uber scaling back AI tools due to unsustainable costs:>108904932 >108904961 >108905021 >108905109 >108905123 >108905125 >108908561--Gemma 4 and Claude agentic playthroughs of Pokemon Red:>108905722 >108905812 >108908045--Logs:>108903444 >108903509 >108903749 >108906555 >108907352 >108908073 >108908838 >108909056 >108910966--Miku (free space):>108903613 >108903821 >108903829 >108905669 >108908561►Recent Highlight Posts from the Previous Thread: >>108903384Why?: >>102478518Enable Links: https://rentry.org/lmg-recap-script
>>108911107>no (You)s todayI need to step my game up.
Does anyone have a good Qwen thinking model jailbreak? Banning thinking seems to work pretty good, but I'd like to be able to leverage it sometimes.
https://www.youtube.com/watch?v=p-v1Hn_aZHA
minicpmballz
>>108910513I tried to process 1M tokens.
I don't like teto because she's always asking for anal and buttholes are gross no matter what the porn jews try to tell you
>>108911190>2 6000 blackwellsAnd I thought I was big pimpin with a 3090 and 3080ti
>>108882020sorry anon, upon reflection, my imbibing of leaded gasoline has damaged by reading comprehension beyond what I thought.Thinking on it more, I've decided to build out an MCP server to let you do exactly that, so you can hook into whatever front-end you'd like. (I already had it built, but am now going to make it standalone)
I'm slopping out an MTG engine too using antigravity+gemini 3.5
>>108911283Why does all vibecoded webshit look the same?
>>108911283>using flash for codingJust use Gemma
>>108911283local?
>>108911374google has a data center just across the street from me
>>108911280Take a look at https://github.com/oraios/serena.It's an MCP server that exposes symbol lookup and editing tools but also includes a lot of unrelated tools like mode switching. You could either just use that or look at how they implemented it.
>need to migrate my shit to Debian>the thought of not having an LLM running for a few hours is enough to make me sit my ass in a half-assed ubuntu server install for 2 weeks nowfuck
BF16 Ganesh 4 bloody sirs.
>>108911286If it's for coding the best local model is qwen 27b
>>108911285Because they're almost always using pre-canned UIs like Gradio and what not, which AIs also often have a ton of training on anyways so its more reliable.
reposting this because I was too busy jerking off to sinisistar to see that new was postedalso why the fuck is ST so shit holy fuck just opening a lorebook page with more than 50 entries lags>>108910783youkai women belong to human mendeath to evil shrine maidens
>>108910513>>108911190I am happily running Deepseek v4 Flash, original quants, 512k context, at up to 38 t/g / 1100 pp, 300 W max , and I have vram to spare for full context fp8 Gemma for vision or comfy for image gen at 40% the cost of your GPUs.Guess the setup.
>>108911580huawei TPUs in an amiga 4000 connected over serial PPP links
>>108911580>40% the costAnd 40% the speed. 40t/s is barely usable for claude code.
>>108911580intel meme cards?
>>108911622Less than 40% speed for sure. But at 7000$ compared to 19000$ in 2026 I am not complaining.
>>10891162240t/s is plenty usable without thinking.
>>108911655oh its the guy that spent 7000$ on this again..what speeds are you getting generating images hehand what about videos?
>>108911655What a waste of money
>>108911622who the fuck gives a shit about claude code?
>>108911775s/claude code/fotm harness/
>>108911780who the fuck gives a shit about anything except writing smut?
>>108911655If you could buy 4 of these at that price and link them all together, it would be worth it.
>>108911786not everyone running local models is 15 anon
>>108911700>>108911759Show us how you are running DS4 then.
>>108911786Maybe you too could afford a deepseek setup if you did.
>>108911795Get your testosterone levels checked if you can't get it up after the age of 15
>>108911796Tell NVIDIA to stop fucking around and make a 128+ gb rtx pro. Ain't nobody got time for snail shit don't charge enterprise prices and bitch out on the vram
>>108911811vram cucking is probably the one worst thing that happened to consumers lol
Which local robot assistant is going to be the meta in the next few months to use with a VLA over LAN?
>>108911472sex the shrine maidenssex the youkai women
>>108911859You can keep the shrine weirdos, I'm going after the prime real estate
>>108911796IQ1_M on a 2022 rig that cost me 1600$ :)but in reality im runnin gemma 26b
>>108911700Like, 25 seconds for Anima at 832x1240, 40 steps on a single Spark. Haven't tried video.But having 256 GB unified CUDA VRAM to throw things at is fun. Deepseek 4 Flash vibe coded this tool in 15 minutes with a bit of guidance.
>>108911873Why are her wings coming out of her ass?
>>108911924ass wings are hip and cool these days
>>108911821With recent news they will be coming back to us hat and hand and I'm going to need a full on rim job from the green rat before I show them any interest for the next 5 years
>>108911796https://github.com/vllm-project/vllm/pull/41834The same VLLM PR that our resident 2x RTX PRO 6000 haver brags about also run on 2x Spark
>>108911920>25 seconds for Anima at 832x1240, 40 steps on a single Spark.not bad iguess..>But having 256 GB unified CUDA VRAM to throw things at is fun. Deepseek 4 Flash vibe coded this tool in 15 minutes with a bit of guidance.well if you're happy with it.. all the power to you, have you tried glm 4.6/4.7?
>>108911942i doubtthey will try to dripfeed you just the right amount of 'almost there' until some based chink decides to give you 1024bit 512GB ram inference chip for a couple Ks
I’d just like to interject for a moment. What you’re referring to as AI, is in fact, AI-Stack/LLM, or as I’ve recently taken to calling it, AI Stack plus Weights. LLM is not an intelligence unto itself, but rather another component of a fully functioning AI-Stack system made useful by the training corpus, RLHF pipelines and vital Python dependencies comprising a full agent as defined by benchmarks.Many computer users run a modified version of the AI-Stack system every day, without realizing it. Through a peculiar turn of events, the version of the AI-Stack which is widely used today is often called AI, and many of its users are not aware that it is basically the AI-Stack system, developed by the Foundation Researchers.There really is an LLM, and these people are using it, but it is just a part of the system they use. The LLM is the weights: the tensors in the system that allocate the GPU’s resources to the other tokens that you generate. The LLM is an essential part of an artificial intelligence, but useless by itself; it can only function in the context of a complete inference stack. The LLM is normally used in combination with the vector database and the system prompt: the whole system is basically RAG with an LLM added, or RAG/LLM. All the so-called AI assistants are really distributions of AI-Stack/LLM!
>>108911955I am definitely going to try GLM 4.6/4.7, but if you think llama.cpp drama is bad, vLLM is so much worse. You literally cannot run AWQ quants that were working back in December with current builds, output is garbled.I will get to it in due time. First, Mimo 2.5 omni in NVFP4.
>>108911980>until some based chink decides to give you 1024bit 512GB ram inference chip for a couple KsDon't hold your breath. Been hoping for that for 3 years now. It seemed back then like it was inevitable any day but it's no closer now than it was back then.
>>108911811>>108911821As model sizes increase, the benefit of the VRAM wanes. They realistically need faster HBM before they can engineer us a good high vram card at a good price that doesn't just paint us in a corner. Also, the ratio of VRAM to tensor cores would be fucked.
>>108912004yeah i knowbut seeing things like cix8180 coupled with ~128G ram being sold as 'personal ai supercomputer puck' by grifters are honestly not a bad signal besides those stuff being a total dogshit>>108912019LLMs are mostly memory bandwidth bound and by a lotmidrange shit card matched with fucktons of vram with enough bandwidth will still outperform cpu ram cope nearly everytime
>>108912043>LLMs are mostly memory bandwidth bound and by a lot>midrange shit card matched with fucktons of vram with enough bandwidth will still outperform cpu ram cope nearly everytimeyes, that's the point.if you scaled a 3090 with some magical 1TB VRAM kit, you'd still only run a 1T model at q8 at like 0.5t/s. This shit isn't magic, and even the big pro GPUs are built with less VRAM than you could theoretically put on one for that reason. They run 8+ of them in parallel for the aggregate BW.
>>108912138>if you scaled a 3090 with some magical 1TB VRAM kit, you'd still only run a 1T model at q8 at like 0.5t/s.no? cpu setups get more than that so idk what you're smoking
>>108912161>no? cpu setups get more than that so idk what you're smokingIts napkin math, but it should be order-of-magnitude correct for a dense 1T.Run the numbers yourself if you think they're wrong.
>>108912229>a dense 1T.Why are you running math for imaginary models that will never be made?
>>108912255I'm sorry sir may I interest to you the sota of all the model? https://huggingface.co/RichardErkhov/FATLLAMA-1.7T-Instruct
>>1089122291 TB dense has no relevance to this discussion. Any large model in 2026 is using some form of MoE. A 3090 with 1 TB of VRAM would run Mimo Pro, Deepseek Pro, Kimi or GLM very fast. None of these need more than 40 GB/s per token, resulting in 20+ token/second on this hypothetical 3090.>>108911380Underrated post
>>108912284Why do you argue so hard?
>>108912284>None of these need more than 40 GB/s per token, resulting in 20+ token/second on this hypothetical 3090.Bingo. That's about 30% faster than what CPUmaxxers are getting with hardware that does exist.Prefill would be fucking lighting fast tho. If I could buy the 3090. How much would such a thing cost in the current price-differentiated market? I'm ballparking about $40k?TANSTAAFL
>>108911986For that copypasta to work you have to be consistent with how you use terms like AI and LLM.
>>108912343プリンおいちい!
>>108912343Do I have to pay extra for Teto's saliva on the bite marks?
>>108890783Thank you, qwentts anon
>>108912415I don't think you understand the business model anon. The business is Teto Eats. As in Teto Eats.You don't eat, Teto Eats.Please enjoy your order.
>>108912444Is that her age on the shirt?
>>108912461age of potential suitors
>>108912456How do I invest?
new teto song came outinspired me to make a teto card
>>108912461in binary
>>108912468Funding is not being sought at this time.As the sole employee and breadwinner, Teto is entirely sufficient at running this operation and scaling isn't yet possible without diluting the brand.Local Tetos do not coordinate under one umbrella corp and have been found to be entirely unable to engage in teamwork and cooperation. As such, all attempts to scale, thus far, have resulted in profit suppression.however if you just so happened to innovate with new snacks or refreshments a brand synergy could be in the cards.
>>108912485011 is octal.
>>108912563or just wait 10 years for things to get cheaper
>>10891256010 is base 10, but 10 is base 10.
>>10891258010 is base 10, 0b10 is base 0b10, 010 is base 010, and 0x10 is base 0x10.
>>108912580you should make the radix less ambiguous it's extremely confusing
>>108912596Yes. 10 is base 10, and 10 is base 10. Same for 10 being base 10. And all of those are different to base 10, which is dec36, of course.>>108912600Looks fine to me.
for me it's base
Best femboy personality for gemma based agent? Asking for a gay friend
>>108912637Nerdy catboy arguing on the internet about base 10.
>>108912637(You), followed by (Me)
>>108912629based on what?
>>10891265010
Has anyone found the temp/minp coherence band for each model? Seems like useful info to have if you want to maximize creativity at just the right amount of esoteric knowledge schitzo ranting.There should be a system of relative presets based on this data like "Fox Mulder" or "Terry Davis"
>>108912277it’s retarded. slop retarded. beyond retarded
>>108912637>Brooo, this new coin is totally not a scam, I swear, if I don't double your money in a week, I'll put on a wig and suck your dick!>(1 week later)>*shuffles around awkwardly, trying to get used to the feel of the long blonde hair wig on his head* Dude, it's too far, I know, I swore, but you don't really expect me to suck your dick, right? *laughs nervously, hoping you will just laugh it off too* You aren't some faggot or something? I mean… technically it was me who said that in the first place, but come on, I wasn't, like, serious, man! You don't really expect me to actually go through with that bet, right?
do models know tasane keto's personality?
>>108912833gemmer probably does, it knows quite a bit of niche shitqwen probably doesn't know who she is
>>108912833she has a personality?
>>108912842>my wife>nicheit's tetover>>108912850yes, pic relatedshe loves fishing
>>108912850being fat
>>108912637just ablate everything 4 links deep from "man"
>>108912869chimeras can't get fat you newfag
>>108912855i mean no disrespect of course, she's a fine wife, but objectively less widely known than others.
>>108912850She really doesn't.>>108912833Vocaloids don't have a personality, they're just a character drawing slapped onto a voice pack.
>>108912869Nice try.
>>108912877>Vocaloids don't have a personality, they're just a character drawing slapped onto a voice pack.You're no fun.
>>108912877no way, Kasane Teto is a famous Latin American dictator and part-time scientist
>>108911101i never got into local AI,Is there any point in using it if i have a 9070xt on my main arch pc and a 3060ti on proxmox?My ai usecase is basically google/assistant, i use free tier gemini/claude to asks question and never take anything it says for true, but i use it as a gauge on what to look up.
>>108912893I thought she was related to that Yugoslavian
is qwen mtp broken with tensor split or parallel? takes around 3000mb extra on all 8 cards to do mtp for 409600 ctx, 2 parallel.
>>108912908no, not really. need at least 24GB of VRAM to be worthwhile.
>>108912362
I was stuck in a fridge for 2 months. I take it new deepseek is merged to llamacpp? How is the performance?
>>108912908Yes, if you're just getting an idea of what to look up then it's fine. You can stuff the important layers of a moe model like gemma 26b in the vram that you have and get quick results.
>>108913007>I take it new deepseek is merged to llamacpp?Over ggerganov's dead body.
>>108913012FUCK CHINA
>never gonna merge you up