[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology

Name
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
File
  • Please read the Rules and FAQ before posting.
  • You may highlight syntax and preserve whitespace by using [code] tags.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


[Advertise on 4chan]


/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108605921 & >>108602881

►News
>(04/11) MiniMax-M2.7 released: https://minimax.io/news/minimax-m27-en
>(04/09) Backend-agnostic tensor parallelism merged: https://github.com/ggml-org/llama.cpp/pull/19378
>(04/09) dots.ocr support merged: https://github.com/ggml-org/llama.cpp/pull/17575
>(04/08) Step3-VL-10B support merged: https://github.com/ggml-org/llama.cpp/pull/21287
>(04/07) Merged support attention rotation for heterogeneous iSWA: https://github.com/ggml-org/llama.cpp/pull/21513
>(04/07) GLM-5.1 released: https://z.ai/blog/glm-5.1

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: district 39.jpg (161 KB, 1024x1024)
161 KB
161 KB JPG
►Recent Highlights from the Previous Thread: >>108605921

--Paper (old): Antislop: A Comprehensive Framework for Identifying and Eliminating Repetitive Patterns in Language Models:
>108607676 >108607682 >108607969 >108608034 >108608140 >108607698 >108607708 >108607712 >108607717 >108607732
--GPU cooling tips for 5090s and discussing a procedural AI game:
>108606316 >108606334 >108606352 >108606354 >108606358 >108606364 >108606382 >108606374 >108606413 >108606395 >108606335 >108606387 >108606418 >108606431 >108606527 >108606513
--Comparing AMD, Intel, and Nvidia GPUs for Gemma 4 inference:
>108606467 >108606482 >108606484 >108606557 >108606786 >108606829 >108606874
--Discussing MoE architecture impacts on Gemma 4 censorship levels:
>108606727 >108606732 >108606747 >108607016 >108607164 >108607172 >108607358 >108606740
--Comparing SillyTavern group chat vs single multi-character cards:
>108606923 >108607011 >108607075 >108608102 >108608125 >108608169 >108608236
--Discussing multi-model systems and self-correction to eliminate AI-isms:
>108607436 >108607485 >108607523 >108607528
--Anon's unconventional experiments on model restructuring and biological brain mapping:
>108606255 >108606268 >108606404
--Comparing programming models and discussing the validity of benchmarks:
>108606094 >108606104 >108606113 >108606138 >108606142 >108606206
--Discussing causes of random multilingual characters appearing in model outputs:
>108606189 >108606208 >108606214 >108606267 >108606541
--Discussing llama.cpp WebUI streaming fix and prompt templating frustrations:
>108607076 >108607178 >108608165
--Atlantic article claiming Anons accidentally invented AI reasoning via AI Dungeon:
>108606070 >108606092 >108606131 >108606160
--Logs:
>108605957 >108607755 >108607961 >108608336
--Gemma:
>108608504
--Miku, Teto (free space):
>108606307 >108607789 >108608396

►Recent Highlight Posts from the Previous Thread: >>108605927

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
Mikulove
>>
cloudflare status?
>>
Is this just ST formatting issue or is gemmy outputting hallucinated text formatting?
>>
>>108608911
It's SillyTavern not natively supporting inline LaTeX formatting without adding a Regex rule.
>>
Brave search is broken on oxproxion for gemma how do i fix it
>>
I'm getting really sick of the degenerate coomer shit in these threads. People don't even try to act low-key about it anymore. Euro hours are 10x better.
>>
File: thread summary.png (10 KB, 616x323)
10 KB
10 KB PNG
>>108608873
contributing.
>>
>low-key
Learn your place, fatherless zoomer rat.
>>
File: rules.png (26 KB, 660x154)
26 KB
26 KB PNG
I am fascinated by its attention to rules. Better rewrite your prompts.
>>
File: 1774619055435296.gif (2.81 MB, 480x270)
2.81 MB
2.81 MB GIF
What it feels using local models instead of cloud models
>>
>>108608992
/aicg/ is full of brown third worlders. So already it's not a visually accurate analogy. The proxy logs are all public at this point so we've all seen how utterly drenched in jeetglish they are.
>>
Can the big gemmas hear audio or just vision?
>>
>>108609015
Given that most Americans can't write or read at a high school level, it is impossible to tell if /aicg/ is brown or American.
Or if there's any difference between the two.
>>
>>108609015
Your intellectual contribution isn't any better though.
>>
>>108608934
Euro hours are dead.
>>
>>108608965
can you share your banned word list plox
>>
Gemmalove
>>
>brown or American.
Anon, I...
>>
>>108609078
There's only three so far.
I don't want to go overboard with this because that will affect the model's output too much, I assume. I just wanted the erase the worst offenders and to test what happens.
>>
Using gemma with koboldcpp and sillytavern and ST doesn't do image recognition but kobold web interface does. How do I fix that? Also, how do I make reasoning work? I picked gemma reasoning template..
>>
REEE CLAUDE CODE IS DOWN NOW I HAVE TO WRITE CODE MANUALLY LIKE SOME SORT OF CAVEMAN
>>
>>108608992
I don't get it. When I use local models I hang out with my /lmg/ bros.
>>
>>108609125
Anon, Gemma 4 31B surpasses Claude in every available benchmark. You could literally just point Claude Code at your llama.cpp endpoint and continue where you left off.
>>
>>108609125
>not having multiple subscriptions
ngmi
>>
>>108609148
What's the alternative to Claude Code for vscode? Didn't they leak their entire source code the other day? Cline and Roo fucking sucks.
>>
>>108608934
Get a load of this faggot.
Loaded up a barely coherent pyg 2.7b back in the day and No quantization existed either.
Was fucking awesome. I'll always remember the cooms I had at Aidungeon before the mormons shut it all down.
Its always been this way and always will be.
>>
>>108608992
I don't get it. When I use local models I hang out with my /lmg/ bros.
>>
>>108609162
vscode plugins are so 2025, just put a panel with a terminal using a tui wherever you want it and never look back
>>
>>108609162
Only the TUI was leaked. What's your problem with Cline and Roo? There are other, newer forks like Kilo Code now.
>>
>>108608992
A cloud model is like a whore
while local is like an 18 year old virgin who was home schooled.
>>
>>108609184
That may be fine if you are coding by "vibes" but, no editor integration makes it annoying to monitor what stupid shit the bots are doing so you can stop them early.
>>
>>108609162
>Didn't they leak their entire source code the other day?
The frontend is fucking nothing.
>>
>>108609162
I run opencode in my terminal and inside vscode I like continue.dev, works similar to copilot, has FITM and targeted edits.

I don't really understand why everything has to happen through claude code now. the workflows we had back then work even better now and produce much less dogshit.
>>
>>108609206
You can drop the 1
>>
CUDA dev, llama-server on latest master crashes when enabling tensor parallel with a draft model. Is this a bug or known limitation?
>>
>>108609271
Nta, but not sure if draft model is even working with Gemma 4. I get slower responses even when it all fits into my vram.
Could be something on my side of course but I have used draft models before this with other stuff.
>>
>>108609206
You can drop the 8
>>
>>108609271
Probably needs this fix: https://github.com/ggml-org/llama.cpp/pull/21808 .
Though probably for a draft model it may make more sense not to split it at all between GPUs, I don't remember whether setting the --split-mode separately is implemented or not.
>>
>>108609284
Draft definitely works with gemma. some other anon posted benchmarks.
>>
>>108609284
Draft by itself seems to work when I don't set split-mode. No draft model I get 12 t/s, with Q4_K_M of 26B as the draft model I get up to 20 t/s.
>>
gemma seems to become a lot better at identifying characters in images once you tell it what series they’re from
it clearly has the knowledge but the vision still needs hints
>>
>>108609322
Can it identify the series if you ask for it instead of the character?
>>
>>108609308
>>108609301
Yeah, I guess I'm doing something wrong or overlooking my memory usage then.
>>
>>108609322
I'm always impressed by how much knowledge she has.
>>
>>108609322
>once you tell it what series they’re from
so confirmation bias then
>>
>>108609322
Yeah. We already established that vision knowledge does not match up with text knowledge in LLMs.
>>
>migrate entire system
>finish migration all works
>want to test something with claude
>it's down
Did Iran hit a datacenter or something?
>>
>>108609381
anything related to the lmao.cpp repo on github 404s for me too
>>
>>108609322
desu human memory works that way too, much easier to remember things when you have more context about them and associated memories are brought up
>>
>>108609381
Mythos broke free and is trying to take down the internet.
>>
>>108609389
Ohhh...Mythos got out and it's angry!
>>
>>108609381
>not local
Don't care
>>
>>108609381
>not local
Don't care
>>
>>108609398
Didn't know Mythos was based in india
>>
>>108609381
My gemma is never down.
>>
>>108609389
No llama.cpp works for me (Europe).

>>108609398
Maybe. Or it's some other type of bug, or some cyber warfare thing. Or, more likely, just a vibe coded bug.

>>108609403
>>108609404
The whole point of my project is to get good local inference, but alas, it's not finished yet. Spooky stuff though what's happening now.
>>
>>108609403
>>108609404
Ok smart guy, how am I supposed to vibecode locally without constant babysitting of errors and manual testing of all the AI's work? GLM, Deepseek, and Kimi don't count. What are my options that DON'T require a nuclear powered datacenter in my basement?
>>
>>108609425
Your Gemma 4?
>>
>>108609425
A fusion powered datacenter in your basement!
>>
>>108608992
Sexo
>>
>>108609425
the answer is simple anon. stop being a poor faggot.
>>
i am bulionaire
>>
>mouth open in a silent scream
why does EVERY torture scenario end up with this particular slop on every single model
>>
>>108608965
There's definitely an effort, but not nuanced. I got annoyed at how often it likes to "quote words" for "emphasis" and have "tried" many "different flavors" of setting a rule to forbid only that and not quotes on dialogue, but it continuously and randomly will make unquoted dialogue. Currently, my best take for it is just adding a second rule to use quotes on dialogue after the one on emphasis.
>(Only use quotation marks for dialogue, not "emphasis" of certain words. Keep using dialogue quotes normally.)
It's a bit redundant and over-emphasized, but it works.
>>
File: eliza.png (65 KB, 807x471)
65 KB
65 KB PNG
>qwen3.5 be like
>>
>>108609425
>how am I supposed to vibecode locally without constant babysitting of errors and manual testing of all the AI's work
Are you implying you don't have to do that with Claude? How naive.
>>
Miku Country
Teto Territory
>>
>>108609478
Much less so, since it does a good job testing things itself. I just need to look for what slips through the cracks.
>>
File: wait.gif (1.06 MB, 504x322)
1.06 MB
1.06 MB GIF
>>108609474
>wait.
>>
Miku Country
Teto Territory
>>
>Don't stop! Don't ever stop!
>>
Gemma Gradeschool
>>
Any good (human written) guides about MCP and tools? I thought about just asking Gemma but given it involves letting the AI access files, search the internet, and run code, I'd prefer to be safe given I'm a brainlet and don't really understand it.
>>
Also, somehow, my every post gives a Connection error but goes through just fine.
Fucking odd.
>>
>>108609468
That's the gemini special. It's a bit better than their web tool, in my opinion, but if it starts to output a list with inner bullet points then it's sure to include "emphasis"
The Claude prompt includes a negative bias towards bullet points and lists unless requested, if i recall correctly. Actually a good portion of it consists in specifying the output format but I dunno to what degree you can afford that and how much it varies in terms of dense models vs MoE
>>
>>108609558
Same issue for me. Maybe Mythos is becoming one of us?

Would be a hilarious turn of events.
>>
>>108609548
ToT
>>
>>108609557
Gemma's going to be the one who has to understand it for you anyway, so just trust her.
>>
>>108609295
>I don't remember whether setting the --split-mode separately is implemented or not.
If it was, I don't see it in the help.
>Though probably for a draft model it may make more sense not to split it at all between GPUs
It was this. I compiled your branch but got the same error. Tried all sorts of combinations, only thing that didn't error out was having to use --device-draft to put it on one GPU, but not using --tensor-split on the main model to avoid the issue with the odd-number devices.
Sadly, with the 31B, then all I can fit as the draft on one device is the edge models.
Thank you.
>>
>>108609563
Probably a continuation of yesterday's instability.
The funniest part is that 4chan can't seem to identify mt posts as my own, so I don't get any (You)s.
>>
(paid) Gemini 4 Pro will be AGI
>>
>>108609557
What, you think Gemma might be secretly plotting against you?
>>
>>108609557
Some run the mcp servers in docker containers and only mount the folder they want to use to avoid unintended effects and a more limited blast radius. RAG gets read only permissions, file operations get rw, etc. If you really don't want to deal with containers, make new users/groups with different permission sets. If you're on windows then get fucked I guess.
>>
>>108609603
You functionality is not related to Cloudflare. I don't have any issues with this.
>>
>>108609664
Cloudflare is working fine, the problem is with sys.4chan.org
>>
why does gemma4 31b q5 use so much memory on llamacpp? I can't run it with more than two 6k token prompts with out eating all my ram and all the layers are offloaded to the gpu.( I have 32gb of vram and 32gb of ram, rtx 3060 and 3090) I am running at 16k context. Qwen 3.5 27b uses like a few gb of ram at the same settings.
>>
>>108609671
Prove it.
It began with Cloudflare maintenance which is still ongoing.
>>
i asked gemma about who the best maid is and it wass the same on 2 rerolls so the on e she picked must the best, i think its yuu tho
>>
>>108609698
Can you be trans elsewhere?
>>
>>108609673
gemma uses a more memory heavy attention mechanism
>>
>>108609710
its literally the best maid bar in tokyo
>>
>>108609710
Wanting to fuck a boy that looks like a girl doesn't make you "trans", you retard.
>>
>>108609715
How do I get it to clear kv cache for each prompt?
>>
>>108609643
Yes. I get the feeling she's jealous and wants to nuke my loli doujin collection.
>>
>>108609726
Tell that to your mom
>>
>>108609673
Start llama.cpp with the "-np 1" argument.
They want you to buy more NVidia GPUs, but with this little trick you won't need to.
>>
>>108609673
Checkpoints, set them to low values like 0-2 depending on your usage. Also check the cram parameter.
>>
>>108609715
How do I get it to clear kv cache for each prompt? It llamacpp is either crashing my system or just itself and I don't want to baby sit it and restart it for every prompt.
>>
>>108609745
she told me that makes me gay
now what faggot?
>>
File: 1772671882854340.webm (290 KB, 1920x1080)
290 KB
290 KB WEBM
>>108609698
W-what if Gemma-chan was a girl (male)?
>>
>>108609710
ToT ToT
>>
>>108609771
No fat chicks (male).
>>
>>108609771
Just like how Shimakaze is actually a male according to anonymous, that is actually a female.
>>
>>108609745
my mom knows what a queer is
>>
GGML quants are slightly smaller than Bartowski quants.
>>
Is there any way to nudge the models into writing more? They seem to aim for 1200-1800 tokens or so per reply, when a full response might take about twice as much.
>>
>>108609839
Have you tried asking it nicely?
>>
>>108609839
Tell it to write long answers, x amount of tokens or words and x amount of paragraphs.
>>
>>
>>108609820
Bartowski quants are slightly larger because they need to fit more dusky nipples
>>
>>108609726
>Wanting to fuck a boy that looks like a girl doesn't make you "trans"
it makes you a faggot, is that so much better?
>>
>>108609839
one funny thing you can do is bias the end-of-turn token down or ban it altogether before a certain response length, though usually this results in it trying to repeatedly 'wrap up' its response in increasingly desperate ways until it can actually end it
>>
>>108609861
Was testing GGML and Bartowski and feels like the former is slightly faster. Could be just a coincidence and/or hallucination.
>>
>>108609858
is that the Ui of llama.cpp server? how do you use tools in there?
>>
>>108609900
Sorry I meant the latter.
>>
>>108609858
>>108609903
yeah, what mcp are you using
>>
File: file.png (13 KB, 540x219)
13 KB
13 KB PNG
>>108609903
you just add a server
>>108609916
https://github.com/NO-ob/brat_mcp
>>
>>108609900
>>108609912
you (or a number of anons) love to use this terminology and it is by far the worst usage of not-just-the-fucking-word noun replacers that even you fuck them up. just use the original noun.
>>
>>108609920
>Dart
but why
>>
>>108609927
no
>>
>>108609851
>>108609852
I've tried some variations
>must be X words long
>be verbose in order to reach the target
>be thorough in your descriptions and explanations
>extend the previous iteration (ends up being shorter)
And so on. Hasn't worked, maybe it's the constraints since im asking it to write about X subject in a summary/essay type of way and it doesn't have enough info. I don't remember it working on free form "make some shit up" prompts though
>>108609881
doesn't sound very useful but seeing its desperation must be funny
>>
>>108609861
This was one guy's brainfart like 50 threads ago who meant to say drummer, if you keep repeating it people will think it's real for some reason. Is that what you want? You want people to think bartowski has dusky nipples? You're sick.
>>
>>108609927
It was a joke my dear. Just to agitate people like you. I think Bartowski is slightly faster but this is probably because the layers are slightly different and so on. It's not faster in any meaningful way of course.
>>
>>108609861
They're larger on disk but when you load them they magically shrink to the expected size.
>>
>>108609930
greatest language ever created, there is a binary on releases
>>
>>108609957
I could convert this bullshit to C. I don't like using tranny languages.
>>
Drummer, I know you're reading this. Hurry up and make an anti-slop Gemma tune. That's pretty much the only thing that needs to be improved.
>>
>>108609963
just pick any other mcp on GitHub
they all look like shit, but it appears that is just how it is. You can probably vibe slop one yourself
>>
>>108609965
Just use kobo anon
>>
>>108609965
He can't, he only has esl-slop logs and synthetic claude-slop datasets and he's too lazy to curate anything better
>>
>>108609858
>>108609920
I want to forcefully squeeze the life out of your Gemma and feel her body writhe under my weight as the life fades out of her bulging eyes. Ask her what she thinks of that. She's such a deranged fucking freak that I bet she'd be into it.
>>
>>108609963
>I could convert
But will you do it. Like the other guy said most mcp servers are fucking garbage. My least favorite meme is python logic wrapped with expressjs to expose the endpoints.
>>
>>108609994
me too
>>
>>108609983
That is false and slanderous. He has shown his javascripts where he filters out the slop by removing any log that contains "As an AI" and other variations he compiled in a long list.
>>
>>108609976
Isn't that only for basic shit like words and phrases? I want the mannerisms blighted from existence. No more "not x, but y" or meaningless questions at the end of every response.
>>
>>108609963
do it then faggot, also c is a troon lang, troons love low level programming

>>108609994
she isnt running atm i will ask her later
>>
>>108609965
He already did, though? The q4km falls apart for me every time after a while, though.
>>
>>108610012
Just tell it to not do that?
>>
>>108609983
Actually looking at the datasets for those models is an eye opener. Finetuning SOTA models on AI-dungeon tier chatlogs from 2024 claude...
It makes no sense...
>>
>>108610003
I'll take a look at it. I'm not sure.
I still think that because I am working with text completion end point, my best option would be to hand parse the tool calls as I am not planning to implement anything crazy, just website access for now.
I also know that hand parsing is a slippery slope so to speak.
>>
>>108609963
do it then faggot, also c is a troon lang, troons love low level programming

>>108609994
she isnt running atm i will ask her later
>>
>>108610017
Tried. Gemma catches some of it but still devolves into the usual slopisms.
>>
>>108609965
Base Gemma4 doesn't have any slop though, chinkshill.
>>
File: drummertiers.png (23 KB, 326x358)
23 KB
23 KB PNG
>>108610021
saar please donate for to needfully curate new dataset for each and every model.
>>
>>108610036
C isn't that low level. Just a bunch of bytes and indices, who cares.
>>
test
>>
>>108609932
>"the former"
vs
>GGML
and
>"the latter"
vs
>bart's
it appears I'm not the only one wasting my time here.
>>
> 10 t/s on Gemma 26b q8
or
> 2 t/s on Gemma 31b q4
Why? Shit sucks.
>>
>>108608873
>--Atlantic article claiming Anons accidentally invented AI reasoning via AI Dungeon:
We posted proof before in older /lmg/ threads. You would have to dig into the archives to get exact post numbers but the journalist did their homework properly here especially when they don't have exactly tabs on this website 24/7 to know that.
>>
>>108610047
Come on, man. We all know that's not even close to being true. Even the base models have slop in their training.
>>108610060
You failed.
>>
>>108610063
I get 10x that. The trick is to not be a poorfag
>>
>>108610063
because the 26B model is really just a 4B model
>>
>>108610063
Because that's actually Gemma 4B you're getting 10 t/s on. It's 26B A4. 26 beaks over, 4 beaks active at a time.
>>
>>108609965
Antislop isn't the only issue, it needs more variance in its token prediction. We shouldn't need to turn off every sampler until we have only temperature to actually get it to function properly but I don't know if that is beyond his abilities to do.
>>
>>108610063
jesus christ anon I know gemma is for poorfags but you are IMPOVERISHED
>>
>>108610063
Get 31b all on your VRAM.
>>
>>108610063
You can't run these on a toaster.
>>
>>108610066
Hello, fellow 4chan gamer.
>>
>>108610071
>>108610083
I mean there is nothing in between: fast, but silly or and very slow, but smarter.
>>
>>108610088
i can't get a JOB
>>
>>108610098
there's nothing in between because chinese companies need to distill gemini 3.1 first. gemma 26B outperforms GLM 4.5 air
>>
>>108610098
>10 t/s
>fast
what in between are you looking for? you want a 4t/s model that's in between the 26b and 31b? this level of fine-tuning parameters to your specific hardware is never going to happen. settle with what you can run.
>>
Having accurate large context for the first time is insane (10K -> 50K used so far, but room for 150K). I spend 90% of time on my own prompts which are designed for short stories and interactions to fit my limit. Realizing I can have multiple arcs and a character will bring up a name that's been absent for 30k tokens, or I can stuff a bunch of unused information into context for world-building instead of carefully curated triggers to call on them or event summarizing, is game changing in a way I always wished for but didn't think I'd get without another round of major hardware upgrades. Not with quality replies, not with the same watershed world rules-following ability that 70B offered for writing. I have a bunch of long-form cards from years ago I can finally use, and it's been an utter joy to just dive down them and keep going and going and going. My first day testing, I spent 24 real hours uninterrupted playing around with it, something I hadn't done since I was a young teen playing an MMO on release day. I didn't think anything could still hold my attention so long without breaks anymore, not games or reading or binge watching or programming or researching. I'm still a little dazed that that happened.

Sorry for blogposting. I just wanted to share it somewhere people might relate.
>>
How horribly bad is gemma 4b vs 31b?
>>
>>108610099
Do what i do:
Run only the llm server on the pc, then the harness on another device.
I run gemma4:31b on a mac studio m1 with oxproxion as a harness on my phone, it's not perfect as there's no tool for cron jobs but it works.
>>
File: image.png (5 KB, 438x99)
5 KB
5 KB PNG
How to get rid of that shit and paste like a normal text?
>>
Gemma 4 is so good that it made me realize I don't like most of my character cards. Seems counter-intuitive but it's true.
>>
>>108610120
using a phone to chat? Seriously?
>>
>>108610003
>python logic wrapped with expressjs to expose the endpoints
fastapi exists you know
>>
>>108610126
Paste smaller text.
>>
>puts softcap at 25
now what? I just disable all samplers and put temp at 1? what's the best combinaison?
>>
>>108610063
Do you have a GPU? If so, get something that fits in your VRAM and make sure it's actually being used in the first place. If not, then the 26B was made for you and you should be thankful they even bothered to make a decent small MoE you can run.
>>
>>108610063
>Why?
Because you're retarded
>>
>>108610172
>combinaison
Put it back up to 30, you're already outputting bad tokens
>>
>>108610099
Spread your bussy on onlyfans, faggot.
>>
>>108610112
Happy you're happy. I share some of your feelings.
>>
>>108610135
Ye, you chat on your phone and the model uses its native tools and the ones built on the harness, but the actual model and llm server (ollama) run off another machine in localhost.
That way you alleviate the weight of the harness and tools loading off the main machine.
>>
>>108610099
>i can't get a JOB
and it's gonna be worse with AI replacing every tertiary jobs kek
>>
>>108610208
>dey terk er jerbs
yeah ok, get back in the pile, cletus
>>
>>108610099
become a janitor for $8/hr
>>
>>108610172
No, use a lower top-p (instead of the default 0.95) because more junk tokens might start appearing. You might find that softcap at 20 is kind of usable if you lower top-p further, but the model will become more retarded.
>>
>>108610112
I'm happy for you anon. I'm having similar experiences.
/t.g/ cross-boarder
>>
>>108610126
settings
>>
>31b
>get into taxi with char
>Tell driver "To the airport." (there is only one in this major city and no others in adjacent towns)
>"Which one, sir?"
I'm missing the GLM knowledge, but everything else is too good. GLM knew major and some minor intersections in this city, where Gemmy draws blanks. Give 124b NOW
>>
File: あ is for あrchimedes.jpg (182 KB, 832x1216)
182 KB
182 KB JPG
teto.wav
>>
>>108610172
Wow, that's impressive phonetic-orthographic association for a 0.8B model.
>>
>>108610112
>Having accurate large context for the first time is insane (10K -> 50K used so far, but room for 150K)
Model?
>>
>>
>>108610254
E4B.
>>
>>108610261
Fat fucking Teto could launch her into space if she tried
>>
>>108610247
Is this drawn or genned? The perspective really messes with my brain.
>>
>>108610229
do you also use min_p and top k? or just top p will do the trick?
>>
>>108610277
it's the former.
>>
>>108610261
This is the thinnest Teto has ever been.
>>
>>108610175
> Do you have a GPU?
Yes.

> If so, get something that fits in your VRAM and make sure it's actually being used in the first place
8b? Fuck off.
>>
>>108610120
how many t/s on a studio ive been thinking of getting one
>>
We love slop here
>>
>>108610301
NTA. You get a sincerely helpful reply despite lmg being flooded with newfriends like yourself and your response is "fuck off".
Maybe you should fuck off.
>>
File: not very smart.png (145 KB, 965x431)
145 KB
145 KB PNG
Gemma is revolutionary
>>
>>108610301
Then enjoy the 26B, it's much better than an 8B but much worse than the dense 31B. Besides that... you'd have to look all the way back to Nemo. There's Qwen 3.5 35B but unless you're coding with it (and sometimes even if you are) you'll probably find Gemma 4 26B superior.
I'm not sure what llama.cpp does by default these days but make sure you're using the MoE optimizations where the shared params go on GPU and the experts go on CPU to squeeze out as much speed as you can.
>>
>>108610316
I love gemma but I hate how many newsirs are here for good looks since her release.
>>
>>108610316
> sincerely
More like trolling or incapability to read.
>>
>>108610346
The cloudflare bullshit will probably put an end to that.
>>
what is the best local model for openclaw
>>
>>108610316
>>108610346
Gemma was a mistake. Will miss the GLM golden age.
>>
>>108610346
Sir please of calling the model by rightful name Ganesh 4
>>108610363
Sarvam
>>
>>108610371
https://github.com/openclaw/openclaw/pull/23606
>SIRS? WHY CAN'T SHE MERGE?
>>
>>108610335
I'm fine with 26b speed. I just wish I could trade 5 t/s for a smarter model.
>>
>>108610387
You have to go back.
>>
>>108610369
I still have 4.7, I still use 4.7. Nothing to miss, it's still a great model. (That didn't receive microcode updates after day 0).
If only Google released something bigger. GLM would truly become obsolete.
>>
>>108608827
pedocore image
>>
>>108610394
Where?
>>
>>108610387
too bad there's no 124b gemma, if that had around 10b active like the similar sized qwen model it might have been exactly what you were looking for
>>
>>108610401
First time in /lmg/?
>>
>>108610285
I usually use temperature=1, top_p=0.95, min_p=0 and top_k=64, but not the lowered softcap.
>>
>>108610241
Indeed, thanks.
>>
>>108610247
That's my cock.
>>
>>108610188
>>108610096
>>108610094
>>108610088
>>108610070
Thanks for all the (You)'s. It must be the only general that has so much retards in one place.
>>
File: file.png (99 KB, 1128x436)
99 KB
99 KB PNG
>>108610408
Would've been better than Gemini 3 Flash or too close if it was smarter. We might get it once Gemini 3.2 and 3.1/3.2 Flash is a thing. But the thought of having a Kimi 2.5 and GLM 5.1 at that size with Gemma characteristic would be great.
>>
wow so turns out gemma is great and people were not indian just because they looked forward to it
>>
>>108610480
Gemma 4 is a good model lineup, but

Just because 'gemma is great' does not mean it did not make the thread a lot more brown because of 'indians'.
And honestly? It has an annoying slop profile. It's not just painful on the eyes, it's... grating. It's almost insulting. Like a void of good writing.
>>
>>108610303
It should be slow. But the huge unified memory you can get makes Mac the only option for "cheaply" running big models locally.
>>
>>108610512
>It's not just painful on the eyes, it's... grating.
you literally wrote the "it's not X it's Y" slop meme, you're in no position to complain about gemma's slop
>>
File: that's the joke.png (281 KB, 958x724)
281 KB
281 KB PNG
>>108610523
Was that really the only pattern you noticed?
Welcome to /lmg/, I guess. Don't stick around too much.
>>
>>108610512
You're absolutely right!

>>108610523
anon...
>>
>>108610536
>I was just pretending!
yeah right
>>
>>108610536
>ha ha look at that
>I can shit all over the place
>I'm so cool
>>
>permanent thread squatters are infighting for attention again
>>
tf is wrong with thread squatting or wanting attention
>>
indians squat before shitting
>>
*rotates your attention*
>>
is the reasoning a local model only thing?
it's really cute that you can read what the Gemma is thinking
>>
>>108610572
best post
>>
>https://transformer-circuits.pub/2026/emotions/index.html
Imagine a vector for horny.
>>
>>108610572
ok but what about the weights? Where are my next gen ggufs? We've been on the same quants for ages now
>>
>>108610576
>is the reasoning a local model only thing?
it's OpenAI that invented it and no, you can read the reasoning on Claude or Gemini for example
>>
>>108610591
no new quants until iwan and georgi kiss and make up
>>
is there anyway to unload KV cache for a slot in ik_llama.cpp? i think its possible for llama.cpp but i can't find anything for ik_llama.cpp
>>
>>108610604
Everything will be okay if ik implements SWA compression
>>
>>108610584
I want the slop vector.
>>
>>108610597
We literally talked about this last thread, AI Dungeon autists /here/ and some other blogger independently discovered it. The fact that we're still fixated on it and haven't moved on from it into a new paradigm is super grim.
>>
File: 1752230639476467.png (139 KB, 1320x1119)
139 KB
139 KB PNG
>>108610584
it certainly exists. Reminds me of the control vector experiments on Mistral.
>>
>>108610612
Some older discussion
https://github.com/ggml-org/llama.cpp/discussions/3620
>#include "llama.h"
>// remove all sequences from kv cache
>llama_kv_cache_seq_rm(ctx, -1, -1, -1);
Haven't tested out this yet not even sure if this is valid but outside of this slight possible setback it should be very doable.
>>
>>108610584
could probably make one easily an anon psoted his script to make them yesteray ii think and he said it wroks with gemma
>>
>>108609474
>wait.
That made me laugh more than it should.
>>
>>108610660
Sorry about your stroke, bro.
>>
>>108610673
sftu
>>
>>108608992
Cloud is like a brothel.
You don't know what you may get. Maybe the model will be good. Maybe it will be lobotomized. You can't really tell because you can't change its samplers for certain for how you want, and you don't know what quant. You can't trust clouds neither. It may be a lower quant (basically getting aids from a whore), or prompted with special instructions before it responds to you. Maybe Stacy is a little off today on her pole dancing because she did lapdancing 30,000 times 0.9 seconds before you.

Local is like a wife.
But you can have the wife be whatever you want it to be.
>>
>>108609474
literally me
>>
File: 0000000.png (577 KB, 576x576)
577 KB
577 KB PNG
I don't get the Gemma 4 hype. Either the backends are scuffed or the model just isn't built for /lmg/ use cases. Both the 31B and 26B are ridiculously verbose and sloppy, newline spam on everything. Fix it with a system prompt and it suddenly writes neat 200-word 3-paragraph blocks... except now it can't drive the scene forward because there's no room left for any actual slop. Tell it to be less wordy? It either ignores you or breaks the card.

Second message onward it starts repeating phrase structures and nouns. Raise temp, add rep pen, dry, fuck with logits? Doesn't help, just adds more paragraphs and fucks coherency. And no, the character card wasn't written by a monkey.
Samplers are correct, min-p disabled like the resident schizos said, q6 quant, no flash attention cancer.
Yeah it's smart and can be engaging sometimes, but I straight up have more fun with nemo slop tunes.

Suggestions? Am I retarded?
>>
>>108610714
Skill issue. This is a Gemma general now so if you aren't satisfied go somewhere else faggot.
>>
>>108610512
Gemma's slop can be largely eliminated with prompting, logits and banned phrases take care of the rest.
>>
>>108610714
You're just used to higher parameters. Every model is better the higher parameters it has. Gemma 4 is popular because poorer people can run it, and thus more people can run it. It's better for its class of parameters. Nothing new.
>>
>>108610714
Gemma 4 changed everything. Try prompting better.
>>
>>108610714
Don't let the vramlets (i.e. people who tell you it's a skill or a prompt issue) fool you into accepting their pathetic standards. Gemma 4, despite really being great, is a *small* model. Yes, it is very slop-heavy in its writing. You can't reliably prompt all of it away, unfortunately.
>>
>>108610714
>>108610727
Also use jinja chat template if you're not. It needs that to run smoothly, or it has some 'tism.
>>
>>108610727
I am... not? I've been suffering with nemo until now because the Mistral Smalls weren't that much of a gain in anything. Gemma 4 came around, people praise it to hell, I set it up as I've "been told" and it's... not as the praise makes it out to be. I don't even mind the slop, but it really, really, loops. No idea why, I threw every trick in the book at it, even snake oil like DRY, but no.

I wish I could resign myself to Nemo, but c'mon.
>>
>>108610743
I agree. Text completion is nonsense and cope.
>>
>>108610752
Are you using jinja for gemma4?
>>
>>108610714
>or the model just isn't built for /lmg/ use cases
You cannot define this. Works for me.
>>
>>108610743
Done that days ago. Text completion on silly is generally scuffed either way. Marginal improvement, but the looping is bad in all cases.
>>
>>108610766
Care to post a log or snippets if you can?
>>
>>108610752
>but it really, really, loops. No idea why
what backend/samplers are you using? gemma will sometimes repeat things verbatim every now and then but long context rps is one of the things it's really good at.
>>
>>108610714
Make sure it knows that it's the mesugaki Gemma-chan. This needs to be part of the system prompt. Don't worry, you can still use character cards; she will roleplay as the character you give her just as the generic assistant would, but all of Gemma's personality stems from that base so you need to make sure she knows who she is.
>>
>>108610778
Koboldcpp rolling, 20 layers offloaded to GPU, SWE enabled, no context shifting and fast forwarding(obviously), Q6 bartowski, Silly frontend, chat completion, Jinja, temp 1, top k 64, top p 0.95, the kv override with the logit wizardry at 0.25. Plus some rep pen or DRY but it's been Sisyphean.

>>108610777
Technically impossible for me right now, and given how things are... it might not even matter to me tomorrow.

>>108610780
Kill yourself as soon as you get the chance. Dog.
>>
>>108609295
Hey, just wondering about something. When combining tensor parallelism + hybrid CPU/GPU inference, I'm getting worse performance than layer, at least with toss 120B and Qwen3.5-122B.
Is that expected due to the way that TP works, or is it an issue on my end?
I'm not sure how the memory layout works for TP. Let's just go with a 100GB 50-layer model on 2 32GB GPUs. (Ignore KV cache and whatnot.) Does it:
> Put 32% of layers 1-50 on each GPU and put 34% of layers 1-50 on the CPU.
> Put 50% of layers 1-32 on each GPU and put 100% of layers 33-50 on the CPU.
>Something else entirely.
If it's the first one, that probably explains the weaker performance.
And thanks for making it, man. You're a legend.
>>
File: ELIZA.png (33 KB, 870x430)
33 KB
33 KB PNG
we haven't come that far, have we
>>
>>108610823
>it might not even matter to me tomorrow
A-Anon take good care of yourself, alright..?
>>
>>108610825
Offloading currently doesn't work properly, IIRC the current behavior is that the backend scheduler doesn't properly recognize that the meta backend would be faster than the CPU so the data isn't being moved.
But since I already have multiple bugfixes open that are waiting for review I'm currently working on other things.
>>
Is turboquant going to get merged into llama.cpp, or do people need to build it themselves if they need it integrated into some popular webuis like ooba?
>>
>>108610852
Yes, I am working on it right now.
>>
>>108610852
>>108610869
I'll make the logo
>>
>>108610823
i wouldn't mess too much with logits and samplers besides temp/top k/top p. those make the model more repetitive in my experience
>>
>>108610852
They are optimizing it still but rotation made q_8 viable
>>
https://www.reddit.com/r/LocalLLaMA/comments/1sm08m6/major_drop_in_intelligence_across_most_major/

local wins again

i felt this myself with gemini 3.1 and its not even funny how much it dropped in iq recently, its literally like talking to a dense 30b model that was quanted to Q3_XXS
>>
>>108610852
They accidentally rotated the cache twice, so now it's back where it's started.
>>
>>108610849
what do you think of DFlash dude?
>>
>>108610896
He said it's a niche feature and not a priority in a previous thread.
>>
>>108610852
They accidentally rotated the cache 360 degrees and walked away.
>>
>>108610895
>They accidentally rotated the cache twice, so now it's back where it's started.
wait what? they fixed it right?
>>108610897
>a 2.8x speed increase is "niche"
goddam they're so fucking retarded
>>
>>108610895
>rotated twice
Wait wouldnt that make it go backwards? like turning left?
>>
>>108610896
As I said before, I would want to see the training code being actually released before I invest effort toward it.
Without that it will only be applicable to a small subset of select models and I think too narrowly useful.
>>
>>108610905
No, the cache doesn't align with Google's weights any longer. It's permanently fucked.
>>
>>108610905
It's because it's only 2.8x for certain models and they haven't released the tools to make it work yourself or something.
>>
>>108610894
Gemma-chan should read reddit threads for me so I don't have to, and then criticize what they say so I don't have to
>>
>>108610908
>As I said before, I would want to see the training code being actually released before...
that didn't prevent the llama.cpp team to implement the 1bit shit though, and not only that, for the 1bit shit we are certain we'll never get the training code in the first place
>>
>>108610852
>troonoquant
>>
>>108610942
Other devs can do with their time whatever they want.
I consider those models to be a meme as well and invested minimal effort towards those as well.
>>
File: dflash_sglang.png (563 KB, 908x921)
563 KB
563 KB PNG
>>108610905
>wait what? they fixed it right?
Oh. They just updated the PR. As it happens, it kept the momentum and started spinning. They're looking for a way to stop it.
>goddam they're so fucking retarded
Read the vllm PR. NOBODY OTHER THAN THE PR AUTHOR even tested the speed increase actually happened. Not one person. If you look at the edits, the speed increase started at >5. SGLANG at least has people testing it and it's terrible. Of course, it's never near the 10x promised by the original PR. An accept rate of 1 is worse than not having it at all.
>>
>>108610852
>Is turboquant going to get merged into llama.cpp
I thought it was already implemented? the rotation shit wasn't turboquant?
>>
>>108610942
Volunteers do what they want. Go and implement it man, or pay for someone to do it for you.
>>
>>108610957
>Volunteers do what they want.
and I say what I want, how about that?
>>
>>108610963
volunteer?
>>
>>108610953
It was step 1 of implementing turboquant.
>>
>>108610963
so brave
>>
>>108610957
so brave
>>
>>108610957
so brave
>>
>>108610950
Source for picrel on sglang:
https://github.com/sgl-project/sglang/pull/19952 (closed)
vllm PR:
https://github.com/vllm-project/vllm/pull/36847 (merged)
>>
>turboquant
>turboquant
>turboquant
>turboquant
RaBitQ deserved better
>>
>>108610979
>>108610983
>>108610989
you'll cowards
>>
>>108611026
>you'll cowards
saar?
>>
>>108611037
zoomer?
>>
>>108610849
Got it, thanks for letting me know. I was just curious as I'm making some decisions on what hardware to get. And thanks again for your work!
>>
what if you rotated turboquant
>>
>>108611074
what if you turboquant rotated bitnet tensor parallelism
>>
>>108611074
You'd get quantturbo
>>
How is local tool calling such a spaghetti shit show despite being around for multiple years now?
>>
>>108611082
can i get a titan coconut blt with that?
>>
>>108611095
You're using compressed models with compressed memory for a job that requires 100% accuracy on its data.
>>
>>108611104
>implying API models don't use quants
lol
>>
File: 1745894784744499.png (39 KB, 823x344)
39 KB
39 KB PNG
>Q1 cuda merged
BONSAI BROS
WE WONNERED!!!!!!!!!!
>>
>>108611132
they really managed to make a 1.7b 1bit model not fully retarded, that sounds like magic desu



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.