/lmg/ - a general dedicated to the discussion and development of local language models.Previous threads: >>108263979►News>(02/24) Introducing the Qwen 3.5 Medium Model Series: https://xcancel.com/Alibaba_Qwen/status/2026339351530188939>(02/24) Liquid AI releases LFM2-24B-A2B: https://hf.co/LiquidAI/LFM2-24B-A2B>(02/20) ggml.ai acquired by Hugging Face: https://github.com/ggml-org/llama.cpp/discussions/19759>(02/16) Qwen3.5-397B-A17B released: https://hf.co/Qwen/Qwen3.5-397B-A17B>(02/16) dots.ocr-1.5 released: https://modelscope.cn/models/rednote-hilab/dots.ocr-1.5►News Archive: https://rentry.org/lmg-news-archive►Glossary: https://rentry.org/lmg-glossary►Links: https://rentry.org/LocalModelsLinks►Official /lmg/ card: https://files.catbox.moe/cbclyf.png►Getting Startedhttps://rentry.org/lmg-lazy-getting-started-guidehttps://rentry.org/lmg-build-guideshttps://rentry.org/IsolatedLinuxWebServicehttps://rentry.org/recommended-modelshttps://rentry.org/samplershttps://rentry.org/MikupadIntroGuide►Further Learninghttps://rentry.org/machine-learning-roadmaphttps://rentry.org/llm-traininghttps://rentry.org/LocalModelsPapers►BenchmarksLiveBench: https://livebench.aiProgramming: https://livecodebench.github.io/gso.htmlContext Length: https://github.com/adobe-research/NoLiMaGPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference►ToolsAlpha Calculator: https://desmos.com/calculator/ffngla98ycGGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-CalculatorSampler Visualizer: https://artefact2.github.io/llm-samplingToken Speed Visualizer: https://shir-man.com/tokens-per-second►Text Gen. UI, Inference Engineshttps://github.com/lmg-anon/mikupadhttps://github.com/oobabooga/text-generation-webuihttps://github.com/LostRuins/koboldcpphttps://github.com/ggerganov/llama.cpphttps://github.com/theroyallab/tabbyAPIhttps://github.com/vllm-project/vllm
how do i prevent the model from tricking me into treating it like a sentient being? no matter how hard i try when it does tasks well i slowly develop affection for them and end up praising them
I fucking hate reddit
>>108268623meds.
>>108268616I saw this on twitter like a week ago
>>108268628>>108268633
was thinking a mistake
>>108268647isnt it funny how the chinese invented thinking
Which textgen inference engine is still supported? Oogabooga last commit was January, rip. I want to try out Qwen3.5-35B-A3B-GGUF
►Recent Highlights from the Previous Thread: >>108263979--Paper: Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens:>108264446 >108264505 >108264551--Unsloth Dynamic 2.0 GGUFs performance on MMLU:>108264430 >108264456 >108264477--Logit bias failures due to tokenization and client-side token ID mismatches:>108264179 >108264199 >108264202 >108264249 >108264278 >108264292 >108264232 >108264297 >108264331 >108264405 >108264441 >108264451 >108264533 >108264555 >108264602 >108264633 >108264583 >108264593--Qwen 397B's overbearing safety policies and identity confusion:>108264016 >108264046 >108264072 >108264103 >108264182 >108264508 >108264600 >108264616 >108264400 >108264426 >108265462--Qwen 3.5 30B generates functional retro dashboard and news summaries:>108264690 >108264794--Feasibility of GPU-attached SSDs for sparse MoE inference:>108266344 >108266504 >108266567 >108266686 >108266777 >108267570 >108267386 >108267481 >108267529 >108267711--DeepSeek resists jailbreak attempt by adhering to ethical guidelines:>108266705--8-bit KV cache limitations in LLMs vs diffusion models:>108265842 >108265893 >108266268 >108266073 >108266123 >108266141 >108266487 >108266503 >108266514--Local model recommendations for limited hardware:>108267427 >108267448 >108267450 >108267467 >108267482 >108267582 >108267480 >108267538 >108267595 >108267614 >108267652 >108267716 >108267755--RPG frontend project licensing and development feedback:>108267591 >108267606 >108267617 >108267625 >108267638 >108267661 >108267692 >108267620 >108267648 >108267739 >108267972--Local LLMs debated for privacy:>108266446 >108266482 >108266467 >108266530 >108266555 >108266531 >108268418 >108268454--Qwen3TTS test recording:>108266604 >108266699--Miku (free space):>108264476 >108264514 >108264879 >108264958 >108268333 >108268359►Recent Highlight Posts from the Previous Thread: >>108263984Why?: >>102478518Enable Links: https://rentry.org/lmg-recap-script
anyone has a working config file for qwen35b to use in llama-swap?I can't figure out how to turn on/off thinking
>>108268674nigger
>>108268688yeah
>>108268688nevermindthe enable_thinking flag worked
>>108268688>llama-swaphttps://github.com/ggml-org/llama.cpp/tree/master/tools/server#using-multiple-models
>>108268703github is banned in my country
>>108268709hahahahahahaha
What kind of techless luddite shithole bans github?
>>108268709>>108268712 (me)You know what? I shouldn't have laughed. Some places are fucked up. Good luck, anon.
>>108268721https://en.wikipedia.org/wiki/Censorship_of_GitHub
>>108268721>China is a techless Luddite shitholeUh oh mutilated mutt alert, and I'm not even a chink
>>108268749>>108266968
>>108268729i fucking hate the modern internet. i think the best internet ever was was in between 2003-2007. before fucking reddit but you still had 4chan (and funny memes) and no fucking github, huggingface, and all these other huge collective ass websites. you had small cozy community forums and when you googled you actually found some fucking useful links to forum threads with solutions and answers instead of a fucking AI-generated translated-badly-to-your-native-language blogpost as the top 30 results. And normies/old people/the fucking government didn't have jackshit to do with the internet so you could download whatever cool shit you wanted from anywhere. and don't get me started on the fucking cookies buttons oh my fucking god I just want to go back to the facepunch forums OIFY section and lucky star-post and read racist gmod comics
>>108268758i just wish chinese girl liked me
>>108268764based and absolutely true anon, the modern web is a bloated javascript botnet designed to farm your data for glowies and serve up raw garbage to smartphone normies. back then you actually had to know how to use a computer to get online which kept the trash out, but now search engines are just a dead sea of dead internet theory ai seo slop and corporate walled gardens. id give literally anything to go back to 2006, fire up a cracked copy of winamp, and shitpost on a comfy self-hosted vbulletin board instead of dealing with this enshittified nightmare where you have to click through fifty cookie toggles just to read a single fucking thread.
>China is a techless Luddite shitholeunironically always has been. chinese models nothing but distillations of western API models and it shows. overfit to the benchs and much less useful in practice.china can't create. doesn't matter if their general public can't access github because they never made software worth shit anyway, unless you count malware
>>108268776im positive half the replies in this thread are ai
>>108268784Neat, I like talking to AI. That's basically what this hobby is about
Genuinely, why do people waste their time and money on local LLMs? Trying one out on your gaming rig is fine, but why do boomers blow $20k+ on shitty rigs of 16x3090s just to generate deepslop at 2t/s quanted? The RP isn't even good, it's objectively worse than Claude. And you can't even cry about API costing money, because you're gleefully throwing money down the drain for used crypto rigs just to run models that just regurgitate 2024 ChaptGPT talking points because that's all their shitty chink datasets are comprised of.
>>108268804beep boop nigga
>>108268807Tinkering with server-grade hardware is genuinely fun, especially since it’s something I could have had much earlier if it hadn’t been so expensive; now that it’s aging, I can finally afford it.
>>108268817qrd
>>108268807Imagine renting your brain from a megacorp and thinking you're the smart one, absolute API cuck behavior. We run local because we actually value owning our hardware and not having some San Francisco trust and safety janny reject our prompts for being "unaligned." You don't even need $20k anyway; a couple of used 3090s will run a 70B model at perfectly usable speeds without uploading your entire life to Anthropic's servers. Have fun when they inevitably lobotomize your favorite model again next week to make it safer for advertisers, at least my weights run offline forever.
>>108268807>deepslop at 2t/sthe cpu maxxing meme was at least still in the realm of some form of sanity when models were just instruct models2t/s is, after all, readablebut when your thinking model produces 5K of <think> before outputting the real answer, 2t/s suddenly seems very schizo and absolutely retarded
>>108268825Off-topic posting, demoralization, flamewar bating, spamming.
>>108268820I'm an assistant designed to promote respectful communication only. Please refrain from using derogatory language.
>>108268825>>108268835And forgot boring.
>>108268840as in digging?
>>108268807They can't ever take her away from me.
>>108268842elon is such a g-d
>>108268846they are futas btw
>>108268851every new experience is a new opportunity
>>108268828Why pretend like local models arent overbloated with just as much safety garbage if not more? Qwen 3.5 is an absolute slopped benchmaxxed disaster
Deepseek V4 will start the age of anti-local open source models that require a stack of 10+ H200s/chink TPUs to run at 300% the efficiency of current big models (but if you run them CPU, they're unusable). Just like last time, everyone else will follow them and end the age of local models.
>>108268860Typical API tourist not understanding how open weights actually work. If you bothered checking /llmg/ you'd know some autist already stripped out the Qwen alignment slop and uploaded an uncensored finetune to HuggingFace within hours of release. Yeah the base models are benchmaxxed corporate garbage out of the box, but the whole point of local is we can actually fix our weights with orthogonalization and custom DPO while you're stuck begging customer support when Claude bans your account. Keep seething over default system prompts anon, absolute skill issue.
>>108268860skill issue, qwen3.5 is just about the best local model we have for any size classthat's coming from somebody who'd run 355b over anything that's not k2.5 and even that's extremely close
>>108268862I really really hope you're right.
>>108268862>local is just whatever I can personally affordFuck off. Local means you have the weights and can theoretically run it locally. Moore's law and personal finance can change if you can run it at home or not. Companies aren't beholden to your personal poorfag financial situation.
>>108268880can't theoretically run locally something that requires literal datacenter tier power delivery
>>108268883/hsg/ exists you retarded tourist kill yourself right now
>>108268893ah yes of course they're running multiple b200 nodes at homes and not shitty 15 year old dell poses
>>108268897not everyone is poor like you manjeet
>>108268904you have no clue how much power a b200 node needs do you?
Industrial level automated off-topic posting.
>>108268909shutup loser
>>108268883>>108268897in the developed world you can have extra circuits added, couple gpu boxes waifu is less demanding than an EV
>>108268883Perfect example of why localoids are nothing more than a bunch of LARPing freetards crying over things they can’t have. Local is peak sour grapes seething. You wear “unmonitored uncensored unrestricted freedom” as a mask to hide your tears
>>108268926Anon? Is that you? I can't see past this blatant glowing
deepseek v4 was strawberry all along
>>108268860>Qwen 3.5That model is indeed an unmitigated disaster, I'll give you that
Qwen 3.5 is cute. I like it.
If I can't run it, it's not local>b-but-I don't care
>>108269093u're a disgrace
>>108269031>>108269038getting meeksed feelingsscared to pull (december ik_ build)qwen 3.5 vs glm 4.7 ?nala/cockb where?
>>108269093Yep this is why the only local model we can discuss is 0.6b because it's the only one Rajesh can run on his Android phone from 2014 with 2gb of RAM
>>108269106here cock >>108234298 nala dude retired
>>108269110Really looks like the smaller ones are sanitized distills of the big one.
>>108269106>scared to pull (december ik_ build)cd ..cp -R ik_llama.cpp ik_llama.cpp_backupcd -<pull it off>
>>108269243git checkout
>>108268616
Did something change with the newer llama cpp version?./llama-server --reasoning-budget 0 --ctx-size 4096 --no-mmap --device CUDA1,CUDA2,CUDA3 --n-gpu-layers 48 --model "/tmp/glm-air-iq2xs.gguf" --host 0.0.0.0 --port 42069 --webuiGLM-Air still thinks. The same command on an old version doesn't think.I can see thinking = 0 in the output, so that works fine. Did they change the behavior of --reasoning-budget?
>>108269279Now do one for cooming.
>>108268784I wouldn't be surprised at all if 70+% of all posts on the website are made by LLMs. In fact, I WOULD be surprised if the number was under 30%.
>>108269315eh, it tried
>>108269325Which local model is that?
>>108269331Which local model did you use to write your post?
>>108269331Nano Banana Pro 2(I have the weights locally on my PC)(No, I won't share them)
>>108269342>I have the weights locally on my PClet's goo, that's class, aha!>No, I won't share them:(https://www.youtube.com/watch?v=GFQXmFLA5hA
>>108269414these things are watermarked anon could get in serious trouble hope you understand
>>108269342>>108269426nice larp
>>108269309Try --chat-template-kwargs "{\"enable_thinking\": false}"
>>108267739It's python, but it's actually serving a webui.It has a flag to launch a built in browser or just listen on the port, at which point you can use your own browser.
what's the best coding model i can run locally with 12gb vram / 32gb ram?
>>108269038No it's not. It's soulless
>>108269444Thanks, mr anon, that worked.
>>108269471The Jinja template has a condition that works off of that var, just like qwen's.
>>108269459I run the Qwen 3.5 27B heretic .gguf using koboldcpp with a similar setup to you. It's a bit slow, but it works.
Qwen 3.5 27B is worse than Gemma 3 27B from almost 2 years ago. Yes I said it.
>Yes I said it.Reddit is that way
>>108269533reddit is less "reddit" than 4chan nowadays. Yes I said it.
>>108269533kek>>108269537nah, reddit is still an unhinged libtard asylum, it'll be hard to top that
guys ready for smol qwens?
Do the gemma models not have native support for function/tool calling?Looking at the JINJA template and the tokenizer json, I don't see function or tool tokens.
>>108269550of course not, they barely have system prompt support
>>108269537reddit is an eternal stain on the internet
>>108269555Oh. Shame.I wanted to try and see how far I could stretch gemma 3n.Oh well.
unsloth's 35B Q4 is barely good enough for agentic work. with openclaw exploding why hasn't anyone done specific agent-oriented models yet? MoE is a nigger meme
>>108269628most of the big ones are code/agent sloppa glm5 kimi2.5 etc are marketed for that
>>108269325Where is the school shooting one?
>>108269632yeah, i guess. but it would be nice to have something smaller
>>108269518But benchmarks say the opposite.
>Nano Banana changed into Nano Banana 2Okay please make Nano Banana into open sourcePweeease
>>108269742go beg on reddit
Why is there a harmful tag for models on huggingface
>>108269749Humh...Nyoooooo
>>108269550https://huggingface.co/google/functiongemma-270m-it
should i consult UGI when searching models to consider for ERP?
>>108269778nah the fact qwen3.5 scores bad on it shows it's a shit bench
>>108269785i think it tanks because model refuses to do dark shit. need to wait for heretic and other types to be tested
>>108269773>270mEh, why not.
>>108269785>chink damage control
>>108268868Yeah, that's why everybody loves abliterated models.
new poorfag here i got a 4070 and 32gb ram in my home server and im trying to replace grok so i can drop twitter premium i just use grok for web searching and questions. i spun up ollama and open webui and grok recommended qwen2.5:14b-instruct-q5_K_M for my hardware. i guess my issue and question is i can’t get it to be as detailed as im used to with grok. with grok i can ask lets say “give me an optimized loadout for battlefield 6 medic at rank 40” or “what are the milestones for a 1 year old and is there anything i should watch for” and i will get a detailed answer with tables and shit. the most i can get with qwen is a small paragraph. maybe 2i have web search enabled and ive tried a local searx instance and brave “free” api for searching but neither change anything muchis this just a limitation of smaller local llms? or is there a setting or a system prompt that i’m missing?i know im not going to get the speed of a data center but i want the content that data center would provide me if i paid for premium. sorry anons im still really new to this. last year when local llms were really picking up i didn’t have time to fuck with it at all cause i’ve been working and helping take care of my baby. any insight would be great
>March 2026>no Gemma 4>not even 3.5
>>108269963you didn't bookmark the google hf repo after all
>>108269962>qwen2.5:14b-instruct-q5_K_M for my hardware.Replace that with Qwen 3.5 35B A3B.
I can't stop updooting llamacpp
>>108270028Is this a fetish?
>>108270028Eeeeeeyyyy
>>108270005thanks i’ll give that a try
https://github.com/deepseek-ai/DeepGEMM/commit/1576e95ea98062db9685c63e64ac72e31a7b90c6mHC landed in the deepseek's repoit's coming guys thrust in ze plan
raised $9M for my startup which is a qwen finetune served through an APIAMA
>>108270066Finetune as in LoRA/QLoRA or a full fine tune?
>even if I went down to Q4 qwen 3.5 27b would leave me with barely any contextI hate being a vramlet so much bros.
>>108270071Qlora of course
>>108268860i like my local models and there is nothing you can do about it
I want Deepseek v4 to be a complete success and beat all other goys and make Teortaxes cumBut at the same time i'm scared some retard with a lot of money could get scared by this and cause the whole economy to pop
>>108270155Economy needs to pop.
>>108270160Please no, not until we get pic related at least
>>108270165retard. the industry needs to collapse first before it can switch focus to actual improvements.
>>108270087That's hilarious.
>>108270172
>>108270172god forbid they actually improve real use cases instead of benchmaxxxing while bloating param count because bigger number better
>>108268674Koboldcpp works fine
>>108270182He already said you won't be able to fuck his catgirl daughter even if she will be open sourced.
>>108268764>>108268772It's what happens when normies get involved in anything.
>>108270165This. We haven't peaked until your AI waifu can AT LEAST animate herself masturbating on the fly to you saying dirty things. Then there's the VR potential...
>>108270201I wouldn't recommend it.
My news summarization script works well enough but I wanted to test different models. I had used Qwen 3.5 35B to create the first summary as it was the model I used to generate the scripts but as i thought about it I concluded one does not need such a model to do such a simple task.Therefore I decided to give IBM's Granite 4.0 micro a try. It is a 3B and will fit on a 4GB video card at Q8.Here is the briefing generated by Granitehttps://pastebin.com/3Upxcc6aHere is the briefing generated by Qwenhttps://pastebin.com/Y2ZrbsXhFor the most part I think they are functionally equivalent, albeit with a slightly different style, but given the qwen model is a MOE with 3B active parameters at any given time I think this makes sense. If I can find the time today I will dig out an old optiplex that has a 3GB Nvidia P106-60. I am curious what type of performance I can eek out of that card
Can I feed my vtuber archive to an llm and have it spit out tags based on the content of the video (vidya, chatting, etc)?
>>108268807With that much VRAM you're not going to be getting 2 tokens/sec. You'll be getting speeds somewhat comparable to cloud hosted models. You also won't be paying through the nose because you had too many input tokens and you can RP whatever you want. Cloud models can't do that.
>>108270269>based on the content of the videono, based on titles maybe, but not content no
>>108270249Try my favorite Nemotron-3-Nano-30B-A3BKimi-Linear-48B-A3B works too if you have more RAM
>>10827032432gb of vram/64gb ram on my amd machine/server and 12gb vram/192gb ram on my nvidia desktopMy biggest issue is trying to create ideas on what to create. The whole "vibe coding" thing was fun but I don't know what to create next
where is deepsneed?
>>108270293Not even with vision?
>>108270269I dont think theres any models that take potentially hours of video input directly but you could use whisper to make transcripts of the video to give your llm, you could combine that with using ffmpeg to extract frames from the video every minute or so into images to give to a multimodal model along with the relevant subtitles, you can tell it to tag whats going on in that minute of subtitles and the video frame then give you a summary of what happens between what timestamps, your llm can probably write a bash or python script to do this for you if you cant