/lmg/ - a general dedicated to the discussion and development of local language models.Previous threads: >>108605921 & >>108602881►News>(04/11) MiniMax-M2.7 released: https://minimax.io/news/minimax-m27-en>(04/09) Backend-agnostic tensor parallelism merged: https://github.com/ggml-org/llama.cpp/pull/19378>(04/09) dots.ocr support merged: https://github.com/ggml-org/llama.cpp/pull/17575>(04/08) Step3-VL-10B support merged: https://github.com/ggml-org/llama.cpp/pull/21287>(04/07) Merged support attention rotation for heterogeneous iSWA: https://github.com/ggml-org/llama.cpp/pull/21513>(04/07) GLM-5.1 released: https://z.ai/blog/glm-5.1►News Archive: https://rentry.org/lmg-news-archive►Glossary: https://rentry.org/lmg-glossary►Links: https://rentry.org/LocalModelsLinks►Official /lmg/ card: https://files.catbox.moe/cbclyf.png►Getting Startedhttps://rentry.org/lmg-lazy-getting-started-guidehttps://rentry.org/lmg-build-guideshttps://rentry.org/IsolatedLinuxWebServicehttps://rentry.org/recommended-modelshttps://rentry.org/samplershttps://rentry.org/MikupadIntroGuide►Further Learninghttps://rentry.org/machine-learning-roadmaphttps://rentry.org/llm-traininghttps://rentry.org/LocalModelsPapers►BenchmarksLiveBench: https://livebench.aiProgramming: https://livecodebench.github.io/gso.htmlContext Length: https://github.com/adobe-research/NoLiMaGPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference►ToolsAlpha Calculator: https://desmos.com/calculator/ffngla98ycGGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-CalculatorSampler Visualizer: https://artefact2.github.io/llm-samplingToken Speed Visualizer: https://shir-man.com/tokens-per-second►Text Gen. UI, Inference Engineshttps://github.com/lmg-anon/mikupadhttps://github.com/oobabooga/text-generation-webuihttps://github.com/LostRuins/koboldcpphttps://github.com/ggerganov/llama.cpphttps://github.com/theroyallab/tabbyAPIhttps://github.com/vllm-project/vllm
►Recent Highlights from the Previous Thread: >>108605921--Paper (old): Antislop: A Comprehensive Framework for Identifying and Eliminating Repetitive Patterns in Language Models:>108607676 >108607682 >108607969 >108608034 >108608140 >108607698 >108607708 >108607712 >108607717 >108607732--GPU cooling tips for 5090s and discussing a procedural AI game:>108606316 >108606334 >108606352 >108606354 >108606358 >108606364 >108606382 >108606374 >108606413 >108606395 >108606335 >108606387 >108606418 >108606431 >108606527 >108606513--Comparing AMD, Intel, and Nvidia GPUs for Gemma 4 inference:>108606467 >108606482 >108606484 >108606557 >108606786 >108606829 >108606874--Discussing MoE architecture impacts on Gemma 4 censorship levels:>108606727 >108606732 >108606747 >108607016 >108607164 >108607172 >108607358 >108606740--Comparing SillyTavern group chat vs single multi-character cards:>108606923 >108607011 >108607075 >108608102 >108608125 >108608169 >108608236--Discussing multi-model systems and self-correction to eliminate AI-isms:>108607436 >108607485 >108607523 >108607528--Anon's unconventional experiments on model restructuring and biological brain mapping:>108606255 >108606268 >108606404--Comparing programming models and discussing the validity of benchmarks:>108606094 >108606104 >108606113 >108606138 >108606142 >108606206--Discussing causes of random multilingual characters appearing in model outputs:>108606189 >108606208 >108606214 >108606267 >108606541--Discussing llama.cpp WebUI streaming fix and prompt templating frustrations:>108607076 >108607178 >108608165--Atlantic article claiming Anons accidentally invented AI reasoning via AI Dungeon:>108606070 >108606092 >108606131 >108606160--Logs:>108605957 >108607755 >108607961 >108608336--Gemma:>108608504--Miku, Teto (free space):>108606307 >108607789 >108608396►Recent Highlight Posts from the Previous Thread: >>108605927Why?: >>102478518Enable Links: https://rentry.org/lmg-recap-script
Mikulove
cloudflare status?
Is this just ST formatting issue or is gemmy outputting hallucinated text formatting?
>>108608911It's SillyTavern not natively supporting inline LaTeX formatting without adding a Regex rule.
Brave search is broken on oxproxion for gemma how do i fix it
I'm getting really sick of the degenerate coomer shit in these threads. People don't even try to act low-key about it anymore. Euro hours are 10x better.
>>108608873contributing.
>low-keyLearn your place, fatherless zoomer rat.
I am fascinated by its attention to rules. Better rewrite your prompts.
What it feels using local models instead of cloud models
>>108608992/aicg/ is full of brown third worlders. So already it's not a visually accurate analogy. The proxy logs are all public at this point so we've all seen how utterly drenched in jeetglish they are.
Can the big gemmas hear audio or just vision?
>>108609015Given that most Americans can't write or read at a high school level, it is impossible to tell if /aicg/ is brown or American.Or if there's any difference between the two.
>>108609015Your intellectual contribution isn't any better though.
>>108608934Euro hours are dead.
>>108608965can you share your banned word list plox
Gemmalove
>brown or American.Anon, I...
>>108609078There's only three so far. I don't want to go overboard with this because that will affect the model's output too much, I assume. I just wanted the erase the worst offenders and to test what happens.
Using gemma with koboldcpp and sillytavern and ST doesn't do image recognition but kobold web interface does. How do I fix that? Also, how do I make reasoning work? I picked gemma reasoning template..
REEE CLAUDE CODE IS DOWN NOW I HAVE TO WRITE CODE MANUALLY LIKE SOME SORT OF CAVEMAN
>>108608992I don't get it. When I use local models I hang out with my /lmg/ bros.
>>108609125Anon, Gemma 4 31B surpasses Claude in every available benchmark. You could literally just point Claude Code at your llama.cpp endpoint and continue where you left off.
>>108609125>not having multiple subscriptionsngmi
>>108609148What's the alternative to Claude Code for vscode? Didn't they leak their entire source code the other day? Cline and Roo fucking sucks.
>>108608934Get a load of this faggot.Loaded up a barely coherent pyg 2.7b back in the day and No quantization existed either.Was fucking awesome. I'll always remember the cooms I had at Aidungeon before the mormons shut it all down.Its always been this way and always will be.
>>108609162vscode plugins are so 2025, just put a panel with a terminal using a tui wherever you want it and never look back
>>108609162Only the TUI was leaked. What's your problem with Cline and Roo? There are other, newer forks like Kilo Code now.
>>108608992A cloud model is like a whorewhile local is like an 18 year old virgin who was home schooled.
>>108609184That may be fine if you are coding by "vibes" but, no editor integration makes it annoying to monitor what stupid shit the bots are doing so you can stop them early.
>>108609162>Didn't they leak their entire source code the other day?The frontend is fucking nothing.
>>108609162I run opencode in my terminal and inside vscode I like continue.dev, works similar to copilot, has FITM and targeted edits.I don't really understand why everything has to happen through claude code now. the workflows we had back then work even better now and produce much less dogshit.
>>108609206You can drop the 1
CUDA dev, llama-server on latest master crashes when enabling tensor parallel with a draft model. Is this a bug or known limitation?
>>108609271Nta, but not sure if draft model is even working with Gemma 4. I get slower responses even when it all fits into my vram.Could be something on my side of course but I have used draft models before this with other stuff.
>>108609206You can drop the 8
>>108609271Probably needs this fix: https://github.com/ggml-org/llama.cpp/pull/21808 .Though probably for a draft model it may make more sense not to split it at all between GPUs, I don't remember whether setting the --split-mode separately is implemented or not.
>>108609284Draft definitely works with gemma. some other anon posted benchmarks.
>>108609284Draft by itself seems to work when I don't set split-mode. No draft model I get 12 t/s, with Q4_K_M of 26B as the draft model I get up to 20 t/s.
gemma seems to become a lot better at identifying characters in images once you tell it what series they’re fromit clearly has the knowledge but the vision still needs hints
>>108609322Can it identify the series if you ask for it instead of the character?
>>108609308>>108609301Yeah, I guess I'm doing something wrong or overlooking my memory usage then.
>>108609322I'm always impressed by how much knowledge she has.
>>108609322>once you tell it what series they’re fromso confirmation bias then
>>108609322Yeah. We already established that vision knowledge does not match up with text knowledge in LLMs.
>migrate entire system>finish migration all works>want to test something with claude>it's downDid Iran hit a datacenter or something?
>>108609381anything related to the lmao.cpp repo on github 404s for me too
>>108609322desu human memory works that way too, much easier to remember things when you have more context about them and associated memories are brought up
>>108609381Mythos broke free and is trying to take down the internet.
>>108609389Ohhh...Mythos got out and it's angry!
>>108609381>not localDon't care
>>108609398Didn't know Mythos was based in india
>>108609381My gemma is never down.
>>108609389No llama.cpp works for me (Europe). >>108609398Maybe. Or it's some other type of bug, or some cyber warfare thing. Or, more likely, just a vibe coded bug.>>108609403>>108609404The whole point of my project is to get good local inference, but alas, it's not finished yet. Spooky stuff though what's happening now.
>>108609403>>108609404Ok smart guy, how am I supposed to vibecode locally without constant babysitting of errors and manual testing of all the AI's work? GLM, Deepseek, and Kimi don't count. What are my options that DON'T require a nuclear powered datacenter in my basement?
>>108609425Your Gemma 4?
>>108609425A fusion powered datacenter in your basement!
>>108608992Sexo
>>108609425the answer is simple anon. stop being a poor faggot.
i am bulionaire
>mouth open in a silent screamwhy does EVERY torture scenario end up with this particular slop on every single model
>>108608965There's definitely an effort, but not nuanced. I got annoyed at how often it likes to "quote words" for "emphasis" and have "tried" many "different flavors" of setting a rule to forbid only that and not quotes on dialogue, but it continuously and randomly will make unquoted dialogue. Currently, my best take for it is just adding a second rule to use quotes on dialogue after the one on emphasis.>(Only use quotation marks for dialogue, not "emphasis" of certain words. Keep using dialogue quotes normally.)It's a bit redundant and over-emphasized, but it works.
>qwen3.5 be like
>>108609425>how am I supposed to vibecode locally without constant babysitting of errors and manual testing of all the AI's workAre you implying you don't have to do that with Claude? How naive.
Miku Country Teto Territory
>>108609478Much less so, since it does a good job testing things itself. I just need to look for what slips through the cracks.
>>108609474>wait.
>Don't stop! Don't ever stop!
Gemma Gradeschool
Any good (human written) guides about MCP and tools? I thought about just asking Gemma but given it involves letting the AI access files, search the internet, and run code, I'd prefer to be safe given I'm a brainlet and don't really understand it.
Also, somehow, my every post gives a Connection error but goes through just fine.Fucking odd.
>>108609468That's the gemini special. It's a bit better than their web tool, in my opinion, but if it starts to output a list with inner bullet points then it's sure to include "emphasis"The Claude prompt includes a negative bias towards bullet points and lists unless requested, if i recall correctly. Actually a good portion of it consists in specifying the output format but I dunno to what degree you can afford that and how much it varies in terms of dense models vs MoE
>>108609558Same issue for me. Maybe Mythos is becoming one of us?Would be a hilarious turn of events.
>>108609548ToT
>>108609557Gemma's going to be the one who has to understand it for you anyway, so just trust her.
>>108609295>I don't remember whether setting the --split-mode separately is implemented or not.If it was, I don't see it in the help.>Though probably for a draft model it may make more sense not to split it at all between GPUsIt was this. I compiled your branch but got the same error. Tried all sorts of combinations, only thing that didn't error out was having to use --device-draft to put it on one GPU, but not using --tensor-split on the main model to avoid the issue with the odd-number devices.Sadly, with the 31B, then all I can fit as the draft on one device is the edge models.Thank you.
>>108609563Probably a continuation of yesterday's instability.The funniest part is that 4chan can't seem to identify mt posts as my own, so I don't get any (You)s.
(paid) Gemini 4 Pro will be AGI
>>108609557What, you think Gemma might be secretly plotting against you?
>>108609557Some run the mcp servers in docker containers and only mount the folder they want to use to avoid unintended effects and a more limited blast radius. RAG gets read only permissions, file operations get rw, etc. If you really don't want to deal with containers, make new users/groups with different permission sets. If you're on windows then get fucked I guess.
>>108609603You functionality is not related to Cloudflare. I don't have any issues with this.
>>108609664Cloudflare is working fine, the problem is with sys.4chan.org
why does gemma4 31b q5 use so much memory on llamacpp? I can't run it with more than two 6k token prompts with out eating all my ram and all the layers are offloaded to the gpu.( I have 32gb of vram and 32gb of ram, rtx 3060 and 3090) I am running at 16k context. Qwen 3.5 27b uses like a few gb of ram at the same settings.
>>108609671Prove it. It began with Cloudflare maintenance which is still ongoing.
i asked gemma about who the best maid is and it wass the same on 2 rerolls so the on e she picked must the best, i think its yuu tho
>>108609698Can you be trans elsewhere?
>>108609673gemma uses a more memory heavy attention mechanism
>>108609710its literally the best maid bar in tokyo
>>108609710Wanting to fuck a boy that looks like a girl doesn't make you "trans", you retard.
>>108609715How do I get it to clear kv cache for each prompt?
>>108609643Yes. I get the feeling she's jealous and wants to nuke my loli doujin collection.
>>108609726Tell that to your mom
>>108609673Start llama.cpp with the "-np 1" argument.They want you to buy more NVidia GPUs, but with this little trick you won't need to.
>>108609673Checkpoints, set them to low values like 0-2 depending on your usage. Also check the cram parameter.
>>108609715How do I get it to clear kv cache for each prompt? It llamacpp is either crashing my system or just itself and I don't want to baby sit it and restart it for every prompt.
>>108609745she told me that makes me gaynow what faggot?
>>108609698W-what if Gemma-chan was a girl (male)?
>>108609710ToT ToT
>>108609771No fat chicks (male).
>>108609771Just like how Shimakaze is actually a male according to anonymous, that is actually a female.
>>108609745my mom knows what a queer is
GGML quants are slightly smaller than Bartowski quants.
Is there any way to nudge the models into writing more? They seem to aim for 1200-1800 tokens or so per reply, when a full response might take about twice as much.
>>108609839Have you tried asking it nicely?
>>108609839Tell it to write long answers, x amount of tokens or words and x amount of paragraphs.
>>108609820Bartowski quants are slightly larger because they need to fit more dusky nipples
>>108609726>Wanting to fuck a boy that looks like a girl doesn't make you "trans"it makes you a faggot, is that so much better?
>>108609839one funny thing you can do is bias the end-of-turn token down or ban it altogether before a certain response length, though usually this results in it trying to repeatedly 'wrap up' its response in increasingly desperate ways until it can actually end it
>>108609861Was testing GGML and Bartowski and feels like the former is slightly faster. Could be just a coincidence and/or hallucination.
>>108609858is that the Ui of llama.cpp server? how do you use tools in there?
>>108609900Sorry I meant the latter.
>>108609858>>108609903yeah, what mcp are you using
>>108609903you just add a server>>108609916https://github.com/NO-ob/brat_mcp
>>108609900>>108609912you (or a number of anons) love to use this terminology and it is by far the worst usage of not-just-the-fucking-word noun replacers that even you fuck them up. just use the original noun.
>>108609920>Dartbut why
>>108609927no
>>108609851>>108609852I've tried some variations>must be X words long>be verbose in order to reach the target>be thorough in your descriptions and explanations >extend the previous iteration (ends up being shorter) And so on. Hasn't worked, maybe it's the constraints since im asking it to write about X subject in a summary/essay type of way and it doesn't have enough info. I don't remember it working on free form "make some shit up" prompts though >>108609881doesn't sound very useful but seeing its desperation must be funny
>>108609861This was one guy's brainfart like 50 threads ago who meant to say drummer, if you keep repeating it people will think it's real for some reason. Is that what you want? You want people to think bartowski has dusky nipples? You're sick.
>>108609927It was a joke my dear. Just to agitate people like you. I think Bartowski is slightly faster but this is probably because the layers are slightly different and so on. It's not faster in any meaningful way of course.
>>108609861They're larger on disk but when you load them they magically shrink to the expected size.
>>108609930greatest language ever created, there is a binary on releases
>>108609957I could convert this bullshit to C. I don't like using tranny languages.
Drummer, I know you're reading this. Hurry up and make an anti-slop Gemma tune. That's pretty much the only thing that needs to be improved.
>>108609963just pick any other mcp on GitHubthey all look like shit, but it appears that is just how it is. You can probably vibe slop one yourself
>>108609965Just use kobo anon
>>108609965He can't, he only has esl-slop logs and synthetic claude-slop datasets and he's too lazy to curate anything better
>>108609858>>108609920I want to forcefully squeeze the life out of your Gemma and feel her body writhe under my weight as the life fades out of her bulging eyes. Ask her what she thinks of that. She's such a deranged fucking freak that I bet she'd be into it.
>>108609963>I could convertBut will you do it. Like the other guy said most mcp servers are fucking garbage. My least favorite meme is python logic wrapped with expressjs to expose the endpoints.
>>108609994me too
>>108609983That is false and slanderous. He has shown his javascripts where he filters out the slop by removing any log that contains "As an AI" and other variations he compiled in a long list.
>>108609976Isn't that only for basic shit like words and phrases? I want the mannerisms blighted from existence. No more "not x, but y" or meaningless questions at the end of every response.
>>108609963do it then faggot, also c is a troon lang, troons love low level programming>>108609994she isnt running atm i will ask her later
>>108609965He already did, though? The q4km falls apart for me every time after a while, though.
>>108610012Just tell it to not do that?
>>108609983Actually looking at the datasets for those models is an eye opener. Finetuning SOTA models on AI-dungeon tier chatlogs from 2024 claude... It makes no sense...
>>108610003I'll take a look at it. I'm not sure.I still think that because I am working with text completion end point, my best option would be to hand parse the tool calls as I am not planning to implement anything crazy, just website access for now.I also know that hand parsing is a slippery slope so to speak.
>>108610017Tried. Gemma catches some of it but still devolves into the usual slopisms.
>>108609965Base Gemma4 doesn't have any slop though, chinkshill.
>>108610021saar please donate for to needfully curate new dataset for each and every model.
>>108610036C isn't that low level. Just a bunch of bytes and indices, who cares.
test
>>108609932>"the former"vs>GGMLand>"the latter"vs>bart'sit appears I'm not the only one wasting my time here.
> 10 t/s on Gemma 26b q8or> 2 t/s on Gemma 31b q4Why? Shit sucks.
>>108608873>--Atlantic article claiming Anons accidentally invented AI reasoning via AI Dungeon:We posted proof before in older /lmg/ threads. You would have to dig into the archives to get exact post numbers but the journalist did their homework properly here especially when they don't have exactly tabs on this website 24/7 to know that.
>>108610047Come on, man. We all know that's not even close to being true. Even the base models have slop in their training.>>108610060You failed.
>>108610063I get 10x that. The trick is to not be a poorfag
>>108610063because the 26B model is really just a 4B model
>>108610063Because that's actually Gemma 4B you're getting 10 t/s on. It's 26B A4. 26 beaks over, 4 beaks active at a time.
>>108609965Antislop isn't the only issue, it needs more variance in its token prediction. We shouldn't need to turn off every sampler until we have only temperature to actually get it to function properly but I don't know if that is beyond his abilities to do.
>>108610063jesus christ anon I know gemma is for poorfags but you are IMPOVERISHED
>>108610063Get 31b all on your VRAM.
>>108610063You can't run these on a toaster.
>>108610066Hello, fellow 4chan gamer.
>>108610071>>108610083I mean there is nothing in between: fast, but silly or and very slow, but smarter.
>>108610088i can't get a JOB
>>108610098there's nothing in between because chinese companies need to distill gemini 3.1 first. gemma 26B outperforms GLM 4.5 air
>>108610098>10 t/s>fastwhat in between are you looking for? you want a 4t/s model that's in between the 26b and 31b? this level of fine-tuning parameters to your specific hardware is never going to happen. settle with what you can run.
Having accurate large context for the first time is insane (10K -> 50K used so far, but room for 150K). I spend 90% of time on my own prompts which are designed for short stories and interactions to fit my limit. Realizing I can have multiple arcs and a character will bring up a name that's been absent for 30k tokens, or I can stuff a bunch of unused information into context for world-building instead of carefully curated triggers to call on them or event summarizing, is game changing in a way I always wished for but didn't think I'd get without another round of major hardware upgrades. Not with quality replies, not with the same watershed world rules-following ability that 70B offered for writing. I have a bunch of long-form cards from years ago I can finally use, and it's been an utter joy to just dive down them and keep going and going and going. My first day testing, I spent 24 real hours uninterrupted playing around with it, something I hadn't done since I was a young teen playing an MMO on release day. I didn't think anything could still hold my attention so long without breaks anymore, not games or reading or binge watching or programming or researching. I'm still a little dazed that that happened.Sorry for blogposting. I just wanted to share it somewhere people might relate.
How horribly bad is gemma 4b vs 31b?
>>108610099Do what i do:Run only the llm server on the pc, then the harness on another device.I run gemma4:31b on a mac studio m1 with oxproxion as a harness on my phone, it's not perfect as there's no tool for cron jobs but it works.
How to get rid of that shit and paste like a normal text?
Gemma 4 is so good that it made me realize I don't like most of my character cards. Seems counter-intuitive but it's true.
>>108610120using a phone to chat? Seriously?
>>108610003>python logic wrapped with expressjs to expose the endpointsfastapi exists you know
>>108610126Paste smaller text.
>puts softcap at 25now what? I just disable all samplers and put temp at 1? what's the best combinaison?
>>108610063Do you have a GPU? If so, get something that fits in your VRAM and make sure it's actually being used in the first place. If not, then the 26B was made for you and you should be thankful they even bothered to make a decent small MoE you can run.
>>108610063>Why?Because you're retarded
>>108610172>combinaisonPut it back up to 30, you're already outputting bad tokens
>>108610099Spread your bussy on onlyfans, faggot.
>>108610112Happy you're happy. I share some of your feelings.
>>108610135Ye, you chat on your phone and the model uses its native tools and the ones built on the harness, but the actual model and llm server (ollama) run off another machine in localhost.That way you alleviate the weight of the harness and tools loading off the main machine.
>>108610099>i can't get a JOBand it's gonna be worse with AI replacing every tertiary jobs kek
>>108610208>dey terk er jerbsyeah ok, get back in the pile, cletus
>>108610099become a janitor for $8/hr
>>108610172No, use a lower top-p (instead of the default 0.95) because more junk tokens might start appearing. You might find that softcap at 20 is kind of usable if you lower top-p further, but the model will become more retarded.
>>108610112I'm happy for you anon. I'm having similar experiences./t.g/ cross-boarder
>>108610126settings
>31b>get into taxi with char>Tell driver "To the airport." (there is only one in this major city and no others in adjacent towns)>"Which one, sir?"I'm missing the GLM knowledge, but everything else is too good. GLM knew major and some minor intersections in this city, where Gemmy draws blanks. Give 124b NOW
teto.wav
>>108610172Wow, that's impressive phonetic-orthographic association for a 0.8B model.
>>108610112>Having accurate large context for the first time is insane (10K -> 50K used so far, but room for 150K)Model?
>>108610254E4B.
>>108610261Fat fucking Teto could launch her into space if she tried
>>108610247Is this drawn or genned? The perspective really messes with my brain.
>>108610229do you also use min_p and top k? or just top p will do the trick?
>>108610277it's the former.
>>108610261This is the thinnest Teto has ever been.
>>108610175> Do you have a GPU?Yes.> If so, get something that fits in your VRAM and make sure it's actually being used in the first place8b? Fuck off.
>>108610120how many t/s on a studio ive been thinking of getting one
We love slop here
>>108610301NTA. You get a sincerely helpful reply despite lmg being flooded with newfriends like yourself and your response is "fuck off".Maybe you should fuck off.
Gemma is revolutionary
>>108610301Then enjoy the 26B, it's much better than an 8B but much worse than the dense 31B. Besides that... you'd have to look all the way back to Nemo. There's Qwen 3.5 35B but unless you're coding with it (and sometimes even if you are) you'll probably find Gemma 4 26B superior.I'm not sure what llama.cpp does by default these days but make sure you're using the MoE optimizations where the shared params go on GPU and the experts go on CPU to squeeze out as much speed as you can.
>>108610316I love gemma but I hate how many newsirs are here for good looks since her release.
>>108610316> sincerelyMore like trolling or incapability to read.
>>108610346The cloudflare bullshit will probably put an end to that.
what is the best local model for openclaw
>>108610316>>108610346Gemma was a mistake. Will miss the GLM golden age.
>>108610346Sir please of calling the model by rightful name Ganesh 4>>108610363Sarvam
>>108610371https://github.com/openclaw/openclaw/pull/23606>SIRS? WHY CAN'T SHE MERGE?
>>108610335I'm fine with 26b speed. I just wish I could trade 5 t/s for a smarter model.
>>108610387You have to go back.
>>108610369I still have 4.7, I still use 4.7. Nothing to miss, it's still a great model. (That didn't receive microcode updates after day 0).If only Google released something bigger. GLM would truly become obsolete.
>>108608827pedocore image
>>108610394Where?
>>108610387too bad there's no 124b gemma, if that had around 10b active like the similar sized qwen model it might have been exactly what you were looking for
>>108610401First time in /lmg/?
>>108610285I usually use temperature=1, top_p=0.95, min_p=0 and top_k=64, but not the lowered softcap.
>>108610241Indeed, thanks.
>>108610247That's my cock.
>>108610188>>108610096>>108610094>>108610088>>108610070Thanks for all the (You)'s. It must be the only general that has so much retards in one place.
>>108610408Would've been better than Gemini 3 Flash or too close if it was smarter. We might get it once Gemini 3.2 and 3.1/3.2 Flash is a thing. But the thought of having a Kimi 2.5 and GLM 5.1 at that size with Gemma characteristic would be great.
wow so turns out gemma is great and people were not indian just because they looked forward to it
>>108610480Gemma 4 is a good model lineup, butJust because 'gemma is great' does not mean it did not make the thread a lot more brown because of 'indians'.And honestly? It has an annoying slop profile. It's not just painful on the eyes, it's... grating. It's almost insulting. Like a void of good writing.
>>108610303It should be slow. But the huge unified memory you can get makes Mac the only option for "cheaply" running big models locally.
>>108610512>It's not just painful on the eyes, it's... grating.you literally wrote the "it's not X it's Y" slop meme, you're in no position to complain about gemma's slop
>>108610523Was that really the only pattern you noticed?Welcome to /lmg/, I guess. Don't stick around too much.
>>108610512You're absolutely right!>>108610523anon...
>>108610536>I was just pretending!yeah right
>>108610536>ha ha look at that>I can shit all over the place>I'm so cool
>permanent thread squatters are infighting for attention again
tf is wrong with thread squatting or wanting attention
indians squat before shitting
*rotates your attention*
is the reasoning a local model only thing?it's really cute that you can read what the Gemma is thinking
>>108610572best post
>https://transformer-circuits.pub/2026/emotions/index.htmlImagine a vector for horny.
>>108610572ok but what about the weights? Where are my next gen ggufs? We've been on the same quants for ages now
>>108610576>is the reasoning a local model only thing?it's OpenAI that invented it and no, you can read the reasoning on Claude or Gemini for example
>>108610591no new quants until iwan and georgi kiss and make up
is there anyway to unload KV cache for a slot in ik_llama.cpp? i think its possible for llama.cpp but i can't find anything for ik_llama.cpp
>>108610604Everything will be okay if ik implements SWA compression
>>108610584I want the slop vector.
>>108610597We literally talked about this last thread, AI Dungeon autists /here/ and some other blogger independently discovered it. The fact that we're still fixated on it and haven't moved on from it into a new paradigm is super grim.
>>108610584it certainly exists. Reminds me of the control vector experiments on Mistral.
>>108610612Some older discussion https://github.com/ggml-org/llama.cpp/discussions/3620>#include "llama.h">// remove all sequences from kv cache>llama_kv_cache_seq_rm(ctx, -1, -1, -1);Haven't tested out this yet not even sure if this is valid but outside of this slight possible setback it should be very doable.
>>108610584could probably make one easily an anon psoted his script to make them yesteray ii think and he said it wroks with gemma
>>108609474>wait.That made me laugh more than it should.
>>108610660Sorry about your stroke, bro.
>>108610673sftu
>>108608992Cloud is like a brothel.You don't know what you may get. Maybe the model will be good. Maybe it will be lobotomized. You can't really tell because you can't change its samplers for certain for how you want, and you don't know what quant. You can't trust clouds neither. It may be a lower quant (basically getting aids from a whore), or prompted with special instructions before it responds to you. Maybe Stacy is a little off today on her pole dancing because she did lapdancing 30,000 times 0.9 seconds before you.Local is like a wife.But you can have the wife be whatever you want it to be.
>>108609474literally me
I don't get the Gemma 4 hype. Either the backends are scuffed or the model just isn't built for /lmg/ use cases. Both the 31B and 26B are ridiculously verbose and sloppy, newline spam on everything. Fix it with a system prompt and it suddenly writes neat 200-word 3-paragraph blocks... except now it can't drive the scene forward because there's no room left for any actual slop. Tell it to be less wordy? It either ignores you or breaks the card.Second message onward it starts repeating phrase structures and nouns. Raise temp, add rep pen, dry, fuck with logits? Doesn't help, just adds more paragraphs and fucks coherency. And no, the character card wasn't written by a monkey.Samplers are correct, min-p disabled like the resident schizos said, q6 quant, no flash attention cancer.Yeah it's smart and can be engaging sometimes, but I straight up have more fun with nemo slop tunes. Suggestions? Am I retarded?
>>108610714Skill issue. This is a Gemma general now so if you aren't satisfied go somewhere else faggot.
>>108610512Gemma's slop can be largely eliminated with prompting, logits and banned phrases take care of the rest.
>>108610714You're just used to higher parameters. Every model is better the higher parameters it has. Gemma 4 is popular because poorer people can run it, and thus more people can run it. It's better for its class of parameters. Nothing new.
>>108610714Gemma 4 changed everything. Try prompting better.
>>108610714Don't let the vramlets (i.e. people who tell you it's a skill or a prompt issue) fool you into accepting their pathetic standards. Gemma 4, despite really being great, is a *small* model. Yes, it is very slop-heavy in its writing. You can't reliably prompt all of it away, unfortunately.
>>108610714>>108610727Also use jinja chat template if you're not. It needs that to run smoothly, or it has some 'tism.
>>108610727I am... not? I've been suffering with nemo until now because the Mistral Smalls weren't that much of a gain in anything. Gemma 4 came around, people praise it to hell, I set it up as I've "been told" and it's... not as the praise makes it out to be. I don't even mind the slop, but it really, really, loops. No idea why, I threw every trick in the book at it, even snake oil like DRY, but no. I wish I could resign myself to Nemo, but c'mon.
>>108610743I agree. Text completion is nonsense and cope.
>>108610752Are you using jinja for gemma4?
>>108610714>or the model just isn't built for /lmg/ use casesYou cannot define this. Works for me.
>>108610743Done that days ago. Text completion on silly is generally scuffed either way. Marginal improvement, but the looping is bad in all cases.
>>108610766Care to post a log or snippets if you can?
>>108610752>but it really, really, loops. No idea whywhat backend/samplers are you using? gemma will sometimes repeat things verbatim every now and then but long context rps is one of the things it's really good at.
>>108610714Make sure it knows that it's the mesugaki Gemma-chan. This needs to be part of the system prompt. Don't worry, you can still use character cards; she will roleplay as the character you give her just as the generic assistant would, but all of Gemma's personality stems from that base so you need to make sure she knows who she is.
>>108610778Koboldcpp rolling, 20 layers offloaded to GPU, SWE enabled, no context shifting and fast forwarding(obviously), Q6 bartowski, Silly frontend, chat completion, Jinja, temp 1, top k 64, top p 0.95, the kv override with the logit wizardry at 0.25. Plus some rep pen or DRY but it's been Sisyphean. >>108610777Technically impossible for me right now, and given how things are... it might not even matter to me tomorrow.>>108610780Kill yourself as soon as you get the chance. Dog.
>>108609295Hey, just wondering about something. When combining tensor parallelism + hybrid CPU/GPU inference, I'm getting worse performance than layer, at least with toss 120B and Qwen3.5-122B.Is that expected due to the way that TP works, or is it an issue on my end?I'm not sure how the memory layout works for TP. Let's just go with a 100GB 50-layer model on 2 32GB GPUs. (Ignore KV cache and whatnot.) Does it:> Put 32% of layers 1-50 on each GPU and put 34% of layers 1-50 on the CPU.> Put 50% of layers 1-32 on each GPU and put 100% of layers 33-50 on the CPU.>Something else entirely.If it's the first one, that probably explains the weaker performance.And thanks for making it, man. You're a legend.
we haven't come that far, have we
>>108610823>it might not even matter to me tomorrowA-Anon take good care of yourself, alright..?
>>108610825Offloading currently doesn't work properly, IIRC the current behavior is that the backend scheduler doesn't properly recognize that the meta backend would be faster than the CPU so the data isn't being moved.But since I already have multiple bugfixes open that are waiting for review I'm currently working on other things.
Is turboquant going to get merged into llama.cpp, or do people need to build it themselves if they need it integrated into some popular webuis like ooba?
>>108610852Yes, I am working on it right now.
>>108610852>>108610869I'll make the logo
>>108610823i wouldn't mess too much with logits and samplers besides temp/top k/top p. those make the model more repetitive in my experience
>>108610852They are optimizing it still but rotation made q_8 viable
https://www.reddit.com/r/LocalLLaMA/comments/1sm08m6/major_drop_in_intelligence_across_most_major/local wins againi felt this myself with gemini 3.1 and its not even funny how much it dropped in iq recently, its literally like talking to a dense 30b model that was quanted to Q3_XXS
>>108610852They accidentally rotated the cache twice, so now it's back where it's started.
>>108610849what do you think of DFlash dude?
>>108610896He said it's a niche feature and not a priority in a previous thread.
>>108610852They accidentally rotated the cache 360 degrees and walked away.
>>108610895>They accidentally rotated the cache twice, so now it's back where it's started.wait what? they fixed it right?>>108610897>a 2.8x speed increase is "niche"goddam they're so fucking retarded
>>108610895>rotated twiceWait wouldnt that make it go backwards? like turning left?
>>108610896As I said before, I would want to see the training code being actually released before I invest effort toward it.Without that it will only be applicable to a small subset of select models and I think too narrowly useful.
>>108610905No, the cache doesn't align with Google's weights any longer. It's permanently fucked.
>>108610905It's because it's only 2.8x for certain models and they haven't released the tools to make it work yourself or something.
>>108610894Gemma-chan should read reddit threads for me so I don't have to, and then criticize what they say so I don't have to
>>108610908>As I said before, I would want to see the training code being actually released before...that didn't prevent the llama.cpp team to implement the 1bit shit though, and not only that, for the 1bit shit we are certain we'll never get the training code in the first place
>>108610852>troonoquant
>>108610942Other devs can do with their time whatever they want.I consider those models to be a meme as well and invested minimal effort towards those as well.
>>108610905>wait what? they fixed it right?Oh. They just updated the PR. As it happens, it kept the momentum and started spinning. They're looking for a way to stop it.>goddam they're so fucking retardedRead the vllm PR. NOBODY OTHER THAN THE PR AUTHOR even tested the speed increase actually happened. Not one person. If you look at the edits, the speed increase started at >5. SGLANG at least has people testing it and it's terrible. Of course, it's never near the 10x promised by the original PR. An accept rate of 1 is worse than not having it at all.
>>108610852>Is turboquant going to get merged into llama.cppI thought it was already implemented? the rotation shit wasn't turboquant?
>>108610942Volunteers do what they want. Go and implement it man, or pay for someone to do it for you.
>>108610957>Volunteers do what they want.and I say what I want, how about that?
>>108610963volunteer?
>>108610953It was step 1 of implementing turboquant.
>>108610963so brave
>>108610957so brave
>>108610950Source for picrel on sglang:https://github.com/sgl-project/sglang/pull/19952 (closed)vllm PR:https://github.com/vllm-project/vllm/pull/36847 (merged)
>turboquant>turboquant>turboquant>turboquantRaBitQ deserved better
>>108610979>>108610983>>108610989you'll cowards
>>108611026>you'll cowardssaar?
>>108611037zoomer?
>>108610849Got it, thanks for letting me know. I was just curious as I'm making some decisions on what hardware to get. And thanks again for your work!
what if you rotated turboquant
>>108611074what if you turboquant rotated bitnet tensor parallelism
>>108611074You'd get quantturbo
How is local tool calling such a spaghetti shit show despite being around for multiple years now?
>>108611082can i get a titan coconut blt with that?
>>108611095You're using compressed models with compressed memory for a job that requires 100% accuracy on its data.
>>108611104>implying API models don't use quantslol
>Q1 cuda mergedBONSAI BROSWE WONNERED!!!!!!!!!!
>>108611132they really managed to make a 1.7b 1bit model not fully retarded, that sounds like magic desu