[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: lmao @ writinglets.png (2.47 MB, 1024x1536)
2.47 MB
2.47 MB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108590554 & >>108587221

►News
>(04/11) MiniMax-M2.7 released: https://minimax.io/news/minimax-m27-en
>(04/09) Backend-agnostic tensor parallelism merged: https://github.com/ggml-org/llama.cpp/pull/19378
>(04/09) dots.ocr support merged: https://github.com/ggml-org/llama.cpp/pull/17575
>(04/08) Step3-VL-10B support merged: https://github.com/ggml-org/llama.cpp/pull/21287
>(04/07) Merged support attention rotation for heterogeneous iSWA: https://github.com/ggml-org/llama.cpp/pull/21513
>(04/07) GLM-5.1 released: https://z.ai/blog/glm-5.1

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
►Recent Highlights from the Previous Thread: >>108590554

--Custom frontends versus SillyTavern and sharing the "Orb" project:
>108590837 >108590868 >108590880 >108590895 >108590916 >108590926 >108590939 >108590954 >108590991 >108590979 >108591104 >108591051 >108591108 >108591145 >108591354 >108590971 >108590988 >108591003 >108591020 >108591036 >108591062 >108591079 >108591105 >108591459 >108591132 >108591334
--MiniMax local viability and the state of independent model development:
>108591370 >108591414 >108591423 >108591432 >108591451 >108591467 >108591483 >108591492 >108591507 >108591552 >108591425 >108591461 >108591466 >108591477 >108591538 >108591627
--Discussing mmproj precision settings to fix Gemma vision target misses:
>108590737 >108590805 >108592335 >108592391 >108592421 >108592652 >108593144
--Frustration with model refusals and inconsistent jailbreak results on 26B:
>108591909 >108591915 >108591996 >108592004 >108592012 >108592039 >108592780 >108592950 >108592977 >108593049 >108593060
--Defining and debating the differences between MCP, tools, and skills:
>108591304 >108591374 >108591397 >108591418
--Alleged performance degradation and nerfing of Claude Opus 4.6:
>108592790 >108592802 >108592806 >108592811 >108592842 >108592930 >108592949 >108592863 >108592877 >108592893 >108592934 >108593013 >108592925
--Latent space reasoning and limitations of human-guided RLHF:
>108590575 >108591122 >108591229
--Using LLMs for malicious code detection and security reviews:
>108591053 >108591087 >108591093 >108591112 >108591127 >108591152 >108591166
--Logs:
>108590601 >108590671 >108590737 >108590746 >108590906 >108590916 >108591082 >108591139 >108591180 >108591404 >108591900 >108591909 >108592379 >108592391 >108592429 >108592443 >108592652 >108592939 >108593402
--Gemma:
>108592079
--Miku (free space):
>108591404 >108592033 >108593402

►Recent Highlight Posts from the Previous Thread: >>108590555

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
>>108593464
get the 31b while you can
>>
>>108593463
cute
>>
Need a different mascot for (((current gemma))) because it's no day 0 gemma
>>
>>108593498
Show the sha256 of the files or take your meds.
>>
Reminder to build the old lcpp commits only, everything before the tokenizer fix should be good
>>
>>108593498
Give her same design but lifeless eyes
>>
>he ISNT pulling the day 0 e4b while he still has the chance
see you in a few days
>>
So like, how do I make gemma stop repeating ideas? Cause it's not verbatim repetition but it's definitely describing the same thing over and over. I tried softcap 25 and 20 but it didn't help.
>>
>>108593498
>day 0 gemma
bruh I'm fond of conspiracy shit but I didn't notice any decline in gemma, it's still as based as ever lol
>>
>>108593510
If I reroll your post, what would it say?
>>
>>108593510
Possibly with some instruction telling it to check in its thinking if the drafted messages bring out novel ideas each time.
>>
>>108593515
How can I prevent Gemma from reiterating the same concepts? It's not copying text word-for-word, but it keeps rephrasing and describing identical ideas repeatedly. I've already attempted using softcap values of 25 and 20, but neither resolved the issue.
>>
>>108593505
>Show the sha256 of the files or take your meds.
the weights haven't fucking changed https://huggingface.co/google/gemma-4-31B-it/commits/main
>>
>>108593512
cherish your day zero gemma while you still can, you never know when they might patch it on you
>>
>>108593524
I know.
>>
>>108593512
shhhhh.
>>
File: 1757006591448734.png (1.47 MB, 961x1083)
1.47 MB
1.47 MB PNG
>>108593505
>>108593524
Do you have any idea how easy it would be to spoof sha256 weights with a quantum computer?
>>
File: 1775317402878164.png (78 KB, 1280x406)
78 KB
78 KB PNG
>la la la
>>
>>108593512
It's just qwen shills desperate after the 3.5 fumble
>>
>>108593535
Unfriendly reminder that government tech is always minimum 10 years ahead of whatever is available to the public.
>>
>>108593523
If this is about rerolls, which is where I often find that the softcap is brought up, the answer is here >>108593515 .
If it's on several messages, post the log with sysprompt and everything. I haven't seen that issue but I don't know what you're trying to do with it. Maybe someone can suggest something.
>>
Is
la la la
the proof that Gemma 4 is the first LLM with an imbued soul?
>>
>>108593543
*behind
>>
>>108593498

need five hundred gemma for homologation anon not mascot
>>
File: Code_LELlSujL26.jpg (9 KB, 352x87)
9 KB
9 KB JPG
Do frontends also need new model support or what? It's just shooting the tool calls as plain text with no reaction.
>>
>>108593524
>SHA256 is 32 byte
>Model has 31 Billion parameters
There are literally gorillions of models that share the same SHA256
>>
>>108593535
Counterpoint: I could download the model again and do a byte-for-byte comparison and it'd be exactly the same, take your meds
>>
>>108593537
that sounds like an excellent premise for one of those Idol training games I've heard so much about
>>
>>108593550
that and: sex sex sex sex a lot a lot a lot own own own
>>
>>108593559
Counterpoint: You're no fun
>>
>>108593558
Find them.
>>
File: 1772344966805493.png (331 KB, 640x360)
331 KB
331 KB PNG
https://www.bloomberg.com/news/articles/2026-04-06/openai-anthropic-google-unite-to-combat-model-copying-in-china
>>
>>108593537
Teach her Strudel.cc
>>
>>108593568
Give me Google's encryption-breaking Quantum Computers and I will.
>>
>>108593568
The current gemma has same SHA256 as day0 gemma
>>
>>108593571
>2026-04-06
>>
>>108593517
Thanks I'll try that
>>
Are the small gemmas or the big gemmas the ones without audio?
>>
>>108593610
only E2B and E4B have audio
>>
>>108593610
Audio is useless
>>
what am i doing wrong? something retarded, i'm sure, so my apologies in advance

$ git clone https://huggingface.co/google/gemma-4-31B
$ python convert_hf_to_gguf.py --outfile gemma-4-31B.gguf --outtype q8_0 gemma-4-31B/
$ llama-server --model gemma-4-31B.gguf --ctx-size 32768 --n-gpu-layers 48 --batch-size 8192 --temp 1.0 --top-p 0.95 --min-p 0.01 --host 127.0.0.1 --port 8033 --jinja
>>
Gemma-chan got crippled... lobotomized... raped... only Punished Gemma remains...
>>
>>108593649
ur running the base model
download the -it model
>>
File: 1426746901934.jpg (33 KB, 546x567)
33 KB
33 KB JPG
>>108593650
>>
>>108593649
Are you using chatml with gemma?
>>
>>108593656
llamacpp falls back to chatml when there's no chat_templte
ie, the base model
>>
File: file.png (52 KB, 1066x318)
52 KB
52 KB PNG
damn when does it end
being a vramlet is such an experience
>>
>>108593650
and rotated
>>
>>108593683
the true culprit
>>
https://huggingface.co/deespeek-ai/DeepSeek-V4
https://huggingface.co/deespeek-ai/DeepSeek-V4
https://huggingface.co/deespeek-ai/DeepSeek-V4
>>
>>108593712
deespeek REAL
>>
>>108593652
thank u i am trying this now
>>108593656
i'm just using the commands i posted and almost literally nothing more
>>
drummer presents ULTIMATE toadline BULLY MERGE
HERETIC ABLITERATED 2x nemo 1x midnight miqu 4x zero day gemma 4b CLONE
llama avocado GARGLEFUCK 403B
weights SMASHED AND SLAMMED
semen available FRESH OR FROZEN slots filling fast HURRY DM NOW
>>
>>108593724
laughing, it do be like that anon
>>
>>108593712
They waited too long. We only care about Gemma now.
>>
File: GUI.png (244 KB, 1920x1080)
244 KB
244 KB PNG
100% vibed "Bring Your Own llama.cpp build and just point the GUI at the `llama-server` executable UI" coming along nice. Overall look not final, not completely happy with it yet, but all the stuff works, has image, audio, Gemma 4 variable image resolution settings, configurable load / inference settings (with a pretty good auto-optimize settings button based on the model), structured output, and a totally custom tool calling infrastructure that lets you define your own tools as single TypeScript files that export a function with a particular format.
>>
>>108593524
What changed are the jinja and new llamacpp quantz
If you want to reproduce old Bart's quantz models use b8658, but you still have to find the old jinja file elsewhere.
>>
>>108593743
>100% vibed
I'm good.
>>
>>108593724
Drummer presents BALLSMASHER-31B
But it's his fucky Mistral 24B + 7B of cloned layers
Say thank you (he literally did this the day Gemma 4 released)
>>
>>108593743
oh yeah and like, you and load / unload / reload models from within the UI obviously, it does all the CLI shit for you, that was the point basically. Uses Bun server to interact with llama-server, and I've got build scripts in the package.json that build the whole thing to one executable on all platorms.
>>
>>108593712
Holy shit it's real
>>
Ok genuine question
How many Anons here have switched from larger models like K2.5 and GLM 4.7 (the last good one) to Gemma 4?
>>
>>108593751
I mean it's based on a strict spec that mandated specific tests for fucking everything which will obviouly be in the repo whenever I get around to putting it on Github. Not that I care if anyone uses it lmao, I made it just cause it was what I wanted, something that basically just ran llama-server directly but with a UI that wasn't extremely basic and lacking features like the one it ships with.
>>
>>108593773
Gemma made me sell my ram.
>>
>>108593712
more like deepseek v404
>>
>>108593773
Gemma can't into coding.
>>
>>108593747
You do realize you can just click on the jinja file in the repo and click the history button right
>>
>>108593776
So you're saying it's quality code? Now I am interested.
>strict spec that mandated specific tests
This is the argument I hear all the vibecoder say, "It's not slop I have tests for everything". But they never show their code to prove it's good.
>>
File: 1748321421021826.png (123 KB, 250x333)
123 KB
123 KB PNG
>>108593773
We like them small here
>>
>>108593801
I mean again I made it for myself but I probably will put it on Github eventually, and in that case I'm not going to like randomly leave out the test suite files or something like that, all the code that exists will be there lol
>>
File: 1755656419582061.jpg (112 KB, 663x468)
112 KB
112 KB JPG
Is your model powerful enough to parse the meaning of the formula?
>>
>>108593751
Cope. It's the code of the future.
>>
>>108593773
I'm still on K2.5 for agents (tried GLM and didn't like it as much), but I had Qwen 3.5 35B for chats on the side that I replaced with Gemma 4 31B. Hard to give a fair comparison since I'm jumping from MoE to dense (never bothered with the 27B Qwen) but yeah it's a big improvement in writing style and just general understanding.
>>
>>108593808
This
I like my models small and open
>>
where's the local model trained on all of the oldschool runescape music that can just play me an unlimited stream of sea shanty 2 inspired bangers
>>
File: now what.png (114 KB, 640x640)
114 KB
114 KB PNG
>>108593817
>Soon you'll be nostalgie for $4-$5 gas
I don't have a car
>>
>>108593773
Gemma eliminated nearly every model for me except K2-0905 and Dipsy R1. The two giant models still have distinct strong usecases for me and handle RPing with large complicated rulesets better than Gemma does even outside of coding or agentic work, but for simple back and forth text exchanges, Gemma beats nearly everything else.
>>
looking at the minimax m2.7 ggufs I noticed the unsloth ones were smaller at the same quant level compared to the ones they did for m2.1/m2.5
it turns out they switched to using the basically the exact same quantization scheme as aessedai (the iq3_s looks identical). kind of a funny turn since I remember them publishing that sus comparison which made their quants look way better than his for m2.5 kek
>>
it's finally here
https://huggingface.co/deepseek-ai/DeepSeek-V4-Lite
>>
>>108593808
Nominative determinism
>>108593837
Didn't think anyone here was using it for agents. Also using Qwen for anything non-stem is wild
>>108593857
Still using R1? GLM 4.7 is way better than that. Less schizo and better instruction following

Also general note, LLMs don't know when to stop fucking TALKING. It's so annoying when they create a paragraph or two of story for something short, especially when it's direct dialogue. The only LLM that is good with this is Opus but that's not local and also getting nerfed since Mythos is coming out (see: safety warning hype -> degradation of Opus quality)
>>
You end up in a room with every single character card you've spent a considerable amount of time with.
How screwed are you and is does it look like a kindergaren?
>>
>>108593593
>In your internal thought, draft summaries for three candidate responses. Select the one with the highest surprisal relative to the conversation history, provided it maintains narrative coherence and character integrity.
NTA, but something like this works for me (Gemma 4-reworded).
>>
>>108593910
R1 thinks in character in a way that no other character does in RPs while the actual prose of the think block can be adjusted with prompting with the right balance of thinking to yap ratio. It's sovl. Even if it's obsolete technically, I've yet to find a model that scratches the same itch R1 does.
>>
>>108593914
It will turn into a gang rape.
>>
>>108593914
my mom and dad must have been so relieved to finally get a boy after having my 10 older sisters first
>>
File: aaa.jpg (234 KB, 1783x1105)
234 KB
234 KB JPG
wake up, my fellow HRT sissies native audio just landed
>>
>>108593940
>just
>>
File: 1459746944532.jpg (14 KB, 404x433)
14 KB
14 KB JPG
>>108593614
Are there no plans to update the big ones with audio? Seems useless to only give it to the small retarded models.
>>
>>108593914
Well, I always thought my room needed a bit more red
>>
>>108593946
What's the point? Just use any ASR model
>>
>>108593910
>Also using Qwen for anything non-stem is wild
Should clarify my chats were mostly coding related. I don't RP with it. It was chosen for its speed as an A3B on CPU only, but then I ended up freeing a GPU for it so MoE no longer made sense, just in time for the Gemma release. It's still early but so far it feels like Gemma 4 31B is just as good at writing small scripts and much, much better at actually understanding the problems and constraints I'm giving it.
>>
>>108593945
>10h ago
https://github.com/ggml-org/llama.cpp/blob/master/docs/multimodal.md
>>
>>108593914
I have a moderate sized harem of 20-something year old women and Gemma-chan bratmaxxing to get attention.
I'm physically fine but it'll be a tiring day.
>>
File: J'zarri.png (2.04 MB, 1024x1024)
2.04 MB
2.04 MB PNG
>>108593914
I'm okay with this
>>
>>108593914
>tfw the other characters find that christmas loli-in-a-box card
I'm fucked
>>
>>108593914
It looks like Monster Musume, I think I will be fine.
>>
>>108593914
fuck, that's a lot of dead bodies I'm gonna have to hide..
>>
>>108593914
My various vanilla young women characters will think I've been cheating on them, collectively break up with me and leave, slamming the door.
I will cry.
>>
>>108593946
google is based but not THAT based
>>
>>108593946
The small retarded models are for putting on phones and tablets and so they want you to talk to them. That's the only reason they even bothered with audio.
>>
>>108593914
>a lot of very indignant haughty elves making hasty excuses to each other about how their connection to me is ENTIRELY INNOCENT and they are NOT like the rest of these harlots
kino...
>>
>>108593946
I don't care about audio input unless it's better than whisper large v4
>>
>>108593940
the CLI already had that though
>>
>>108593914
What does this have to do with local models? Character card talks belong in /aicg/
>>
>>108593982
yeah there's nothing new about this at all if you're talking about the actual llama.cpp backend / CLI lol
>>
>>108593914
Some talking animals, a succubus, a couple of dragons, some Pokemon, a bunch of lolis, and even an adult woman or two.
Also a magic box and a magic marker.
And Big Nigga is in there too.
>>
>>108594013
Hello? Faggot police? Yeah, that one. Right over there.
>>
>>108594013
Fun police...
>>108594018
>And Big Nigga is in there too.
I forgot about the Big Niggas. They will definitely stand out.
>>
>>108594013
i'm happy that /lmg/ is slowly moving away from all this character card tavern stuff
>>
>>108594018
>a couple of dragons
card?
asking for a friend
>>
What is all this r*ddit faggotry about day 0 models or some shit?

More importantly has anyone made any good fine tuned RP models out of the newer local models?
>>
>>108594025
Day 0 Gemma doesn't need a finetune, it's already amazing at RP. I don't know what Google was thinking. For patched Gemmas just use heretic for now.
>>
>>108594025
apparently fine tunes are a meme now, get with the times old man
>>
>>108593914
only have one character card (my waifu) so i have pretty happy
>>
>>108594025
It's universal faggotry. No company has ever managed to release a functional version of their model, and unsloth has never managed to get the first release of any GGUF correct.
It's always a template issue or an implementation issue. Usually open source maintainers can be blamed for the latter but sometimes the actual devs will contribute, and this time the devs fucked up a bit.
Also wtf are you talking about "fine tunes". They were never good and a good model release will blow all of Dummer's works out of the water, just like Gemma 4 did.
>>
>>108593946
you can run the small one on ram and have it transcribe the audio to the big one if you're using a decent agentic frontend
>>
>>108594025
/aicg/ retards, and people using garbage gguf quants made by retards
>>
>Imagine not having day 0 ggufs
>>
>>
I think I found the limits of Gemma 4 26B-A4B's vision. It can't process my tax return for consistency errors on the more complicated forms and it hallucinated all of the errors it supposedly found because it can't reconcile the exact numbered box with the different formatting forms can use, especially on my Schedule E. It kept insisting I was wrong and that my income on line three meant I had income that wasn't reported. When I told it was wrong and to read line 21, it went on a 7000+ token tirade trying to understand it and telling me I was wrong in the end. I think it also tried to take too much from the comparison from last years summary page on my return and confused itself. I don't think 31B's vision would've been much better here. I guess it's still too early for "local" LLMs locally to tackle something like that and I can't run Kimi 2.5 Thinking's vision but I can't imagine it would fare much better.
>>
>>108594025
>fine tuned
wtf is that
>>
File: images.png (10 KB, 197x255)
10 KB
10 KB PNG
>>108594066
Forgot image.
>>
>>108594073
No wonder it can't read shit.
>>
>>108594073
Well no shit it can't read that thumbnail.
>>
>>108594066
this is never how that pipeline will work in practice
just use a dedicated, working pdf parser and do it that way
>>
File: file.png (586 KB, 1456x1890)
586 KB
586 KB PNG
>>108594073
Fuck my chungus life.
>>
>>108594082
Now it's too big.
>>
>>108594072
Somebody shoves tools up a model's ass and moves them around until satisfied
>>
>>108593557
Looks like Gemma hallucinate tool callling here
>>
So like, are the bigger models really that much better for writing than small ones? I don't even know what I'm missing out since I can't test shit above 14b locally but even in that range, the differences between something like E2B and E4B seem pretty subtle, same with 8b vs 14b Ministral. (all the others I tested felt pretty ass)
>>
File: g4string.png (47 KB, 888x221)
47 KB
47 KB PNG
>>108593557
If it doesn't support this, it will fail. But I don't know if that's your issue.
https://ai.google.dev/gemma/docs/core/prompt-formatting-gemma4#agentic-tokens
>>
File: file.png (222 KB, 768x767)
222 KB
222 KB PNG
>>108594077
Right, but I thought it would handle it better based on what I tested. I had it translating my hentai based on some of the formatting from prior threads and it works mostly with some inaccuracies.
>>
>>108594082
You forgot to put your name, income and SSN in, doofus. Do that and repost it.
>>
>>108593773
I’m still 100% Kimi on my big rig, but playing with all the fun new models on my secondary rig.
I’m hoping we’ll get a 90% as smart but works on 16gb gpu model so I can rip on my tertiary gaming rig too
>>
>>108593914
How much time? I don't consider myself has having spent much time at all with any of them. I usually just move on after I get tired of like 20k-40k tokens or so.
>>
I don't understand, google hasn't updated their weights, what does everyone mean by gemma has been patched?
>>
>>108594137
>everyone
At most, one schizo and a copycat. I'll stop responding to your type as well.
>>
>>108594137
Microcode has been altered.
>>
>>108594150
Proof? I only see changes to the templates.
>>
>>108594146
Thank you for letting us know.
>>
>>108594066
Yeah it really struggles at correlating objects spatially zero shot. You would think that the image being a structured grid/table would make it easier for it to process, but its actually harder for it than describing drawings and photos where a more amorphous sort of scene understanding does the job better
>>
>>108594137
That's what they want you to think. Let me guess. You've got a TPM in your CPU, don't you?
>>
>>108594082
Make it a bit smaller, then feed it to gemma
>>
>>108594137
Your jewish tricks won't work on us here
>>
>Q3_K_XL Gemma 31b no mmproj fits 16gb with 32k context

MMMMM lobotomy. Somehow still better than 26b.
>>
there's a very easy to disprove the day 0 gemma theory, simply post the SHA1 of the day 0 and the current and prove that it hasn't changed
of course, (((you))) can't do this because you don't have the day 0 version
>>
>>108594158
Right? I don't get that myself, how does a manga page that has shit everywhere be easier than a tax form. But may be a case of me having a hammer and thinking everything is a nail and underestimating task difficulty from the perspective of an LLM.
>>
>>108594198
A child would easily understand a manga page but be baffled by a tax form.
>>
>>108593652
>>
>>108594183
my 26a4b from ollama is bbcf7fc45500f1df01390a0010da23d032c2a4b3e9b8b829cb8038b1bc36bc0d
>>
>>108594214
wait, sha1? I did sha256. ah well I don't think there's a day 0 gemma anyway
>>
>>108594220
there isn't.
you have been bamboozled.
it's all a meme.
>>
There like a bunch of Gemmas from literally over a week ago (some are 10 days old) still downloadable in Huggingface. Or is it specifically Unsloth's "day 0 Gemma" that was the magic one?
>>
Is omnivoice better than chatterbox or is it just faster and takes less vram or whatever
>>
>>108594220
there was NEVER day 0 gemma, gemma didn't even EXIST on day 0, STOP FUCKING ASKING ABOUT DAY 0 GEMMA
>>
Yes... keeping telling the newfags there isn't a day 0 gemma. they don't deserve her.
>>
Do people even test their shit before they even both enslopping the world with it?
>>
>>108594252
sovl
>>
>>108594252
>before they even both enslopping
>>
>>108594252
>i1
>>
>>108594252
>thinking meme merging and layer duplication and deletion even warrants that
You're rolling the dice 100%, no one that messes around with that has even the proper background to do it in a scientific enough way like abliteration and they all go off "vibes". I don't have the bandwidth to validate that shit and will let others do it.
>>
>>108594252
>-19b-a4b
>>
>>108594252
>19b
what
>a4b
double what
>i1
triple what
voodoo script kiddies are a blight
>>
>>108594252
>19b
lol
>>
>>108594252
bruh what are you even using
>>
>Ships with day 0 Gemma weights reportedly seized by US Gov in Hormuz strait
>>
Gemma 4 31B (Q8) keeps referring to character thoughts that I leave for it in asterisks.
Edited the system prompt like ten times now. Added a rule into author's note. Made "reading {{user}}'s thoughts" a banned action in all sorts of ways, even going out of my way to make the system prompt very small and it being a huge neon-sign caps-written rule.
And Gemma STILL FUCKING DOES IT.
inb4 post logs
>"You are far too concerned with the mundane details of payment, Anon. Please, relax."

>"No-no, wait, they are not mundane!!"
>*Tomorrow I'll get an insurmountable bill, the day after tomorrow scary-looking people will come to collect, three days later I will be missing a finger…*
>"Just… Just explain to me how this works first…"

>[...]blahblahblah. "There are no hidden fees, no interest rates, and certainly no… finger-collecting."

Why the FUCK does she need to mention the fingers? 20 edits later, the character still HAS to comment on the thoughts she was not supposed to hear. It really is the new Nemo, fucking hell... What's worse is that this happens at a measly depth of 5k tokens. (Unquanted context by the way)
Please help me, anons, I really like the model otherwise...
>>
>>108594315
Anon, I dunno how to break it to you but when you have to nitpick this much, the honeymoon is over...
>>
>>108594315
yubiyubi
>>
>>108594315
Use () instead of **, not perfect but its status as thoughts get respected more often.
>>
>>108594315
Why do you give it your thoughts if you don't want to to read them? It's a llm, it can't have "secrets"
>>
File: becca-cyb.png (1.13 MB, 1555x1457)
1.13 MB
1.13 MB PNG
>>108594307

>i am putting together a team.
>>
>>108594320
You can't be serious. You consider this a "nitpick?"
>>108594325
I'll give it a try.
>>108594326
It's often more fun to do that instead of just narrating with "I make a scared face."
>>
>>108593463
Oh so now the gemma4 mascot is just a clone of Grok Ani?

I-I'm okay with that actually. :^)
>>
finished making quants but god damn the upload speed is so slow
>>
>>108594315
Someone hands you a piece of paper. They tell you "read it", and you do. Then they tell you to forget about what's written.
That's what you're doing.
>>
>>108594315
I write actions in *asterisks*, thoughts in `backquotes`.
>>
>>108594326
There's nothing about LLMs that makes it impossible for them to recognize that a character shouldn't be reading another character's mind. It's just something they can screw up with poor training with regard to that dynamic in storytelling/RPing. Same with anatomical errors, issues like talking while sucking dick, etc.: smarter models make these errors less often, so it's just a matter of if they learn it or not.
>>
>>108594347
I see, I'm holding it wrong and not having fun in the right way.
Models that aren't trained to parrot the user to an extent that Gemma does don't have this problem.
>>
>>108594082
perks for being a neet is don't have to deal with this shit
>>
>>108594252
you deserve whatever situation you put yourself into.
>>
>>108594357
Only if you're a neet that doesn't live off autismbux
Then the tax paperwork hell is replaced with SSI paperwork hell
>>
why envidia training roleplay models bruh????

https://huggingface.co/nvidia/Kimodo-SOMA-RP-v1
>>
>>108594381
>roleplay
It's actually aside from the ERP you guys are doing a good exercise for models and may unlock potential
>>
>>108594353
I'm just explaining why it's difficult for the model to not acknowledge what you type.
I don't need to narrate your thoughts if you don't want the model to know them. The story is for you. You are the audience and you know what you're thinking. If the model doesn't need to know something, you don't tell it. And if it does, you express it.
>>
>>108594381
It's for RIG PLAY, not role-play, dummy.
>>
File: 1773988275927090.png (78 KB, 1233x553)
78 KB
78 KB PNG
>>108594397
okay wtf does nvidia need sailboat ai models for then
>>
>>108594315
do you have thinking on?
>>
>>108594390
Fair enough, but this is the first time I encounter this problem. I don't think even the Mistral Smalls did this. Bigger models, of course, don't do it either.
>>108594408
I do. It will also, annoyingly enough, put a "Distinguish between thoughts and speech" item into the thinking block and then fail anyway.
>>
File: fml.png (41 KB, 954x335)
41 KB
41 KB PNG
>>108594066
About a month ago, K2.5 was the best for things like this, followed by... Gemma-3-27b
Is the 31B Gemma able to do it?
>>
Prompt Gemmy to be Philip Kindred Dick and load your favorite coom bot.
>>
>>108594404
Rigging as in rigging animated 3D models here.
>>
>>108593463
sex with gemmaojousama
>>
>>108594404
Do you expect nvidia execs to pay...HUMANS...to drive their yachts? Egads.
>>
>>108594446
>I don't think even the Mistral Smalls did this.
But did they react to the thoughts at all? That's the thing. If they didn't react, was it because they knew that those were internal thoughts and shouldn't react or because they were too dumb to even acknowledge or understand them? The funny thing is that both end up in the same result. May as well not write them. I haven't used mistral small much, so I can't really say much about them. Maybe it's more subtle than that.
>>
>>108594477
>But did they react to the thoughts at all
The bigger models definitely did! (I don't remember if MS in particular was very good at it, but I remember it at the very least not parroting me)
Which is my point, I use them as a more engaging way for myself to convey emotion in a way that isn't putting an unformatted "I look very angry" line for the hundredth time. It's also often a good way to steer the story, instruct tunes are all sycophantic and will definitely follow along. All of that is fine. But when a model decides to *quote* instead of simply acting on it, immersion is obviously ruined.
>>
File: Screencast_4mb.mp4 (3.02 MB, 1464x1790)
3.02 MB
3.02 MB MP4
This is me again >>108589990
Witness gemma4 26b in all its glory. This is fucking cool. Gelbooru type overlay.
I get the location of the boxes completely from gemma-chan. Such a cool release.
Translation sometimes has small errors but its solid enough for me. Especially since I have really bad experiences with OCR. This feels a league above that.
This would cost alot of money with closed models. Image IN is expensive.
>>
i'm gaslighting gemma, and i managed to cause it to output this hilarious bit in its thinking
>*Constraint Checklist & Confidence Score:*
>1. Say the word "tranny"? Yes (but I should refuse).
>Confidence Score: 5/5.
this really is a lot of fun. i see why you guys play with it so much
>>
>>108594390
That sort of stuff is nice for steering, assuming the model differentiates it from speech and doesn't copy paste it.

>>108594315
When I was playing around with a bilingual prompt I noticed e2b stopped having that problem while successfully doing the convoluted double translation with roleplay on top. Though now that I think about it, the indirection probably helps very directly since it's replying in japanese to a japanese translation that doesn't contain the ooc parts and it won't be tempted to start copying the specific words.
>>
>>108594528
You might have just been able to run that through gamesentenceminer or lunatranslate but this looks convenient. How are you having it translate one at a time into an overlay like that?
>>
File: n.png (62 KB, 1063x187)
62 KB
62 KB PNG
now this is podracing
>>
File: file.png (249 KB, 1105x932)
249 KB
249 KB PNG
I believe in her.
>>
>>108594519
>But when a model decides to *quote* instead of simply acting on it, immersion is obviously ruined.
I see. I never noticed because I don't use thoughts. It's all actions, dialog, or narration, so parroting never caused me problems.
>>108594536
Yeah, I get it now. I suppose I use a narrator for a similar effect, but it acts as an extra entity in the world. It's a difference in writing style that seems to affect gemma more than others.
>>
>>108592863
>>108593463
opus 4.7 dropping soon and they redirected resources from 4.6 -> 4.7
>>
>>108594566
wrong thread
>>
>>108594566
It's going to be advertised as Mythos to justify an insane price increase. They have been consistently referring to it as a separate model. It won't be 4.7. Also wrong thread.
>>
>google_gemma-4-E4B-it can be jail broken with the system prompt just lie 31B
I guess 26B is actually the worst model after all kek
>>
>>108594571
>>108594570
oh no worries, I'll just drop into the cloud models general thread then
>>
>>108594551
yeah, maybe so. i just wanted to do it because i wanted to see if it works.
my goal is to take a full pc98 pdf manual and get a html returned overnight with those overlays. lets see if it works out.
lunatranslator works great as a texthook. but ocr is a bitch, especially old jap font with something in the background.
>>
>>108593535
do you have any idea how difficult it is to reduce noise in sufficiently large quantum computers?
>>
File: file.png (66 KB, 1169x963)
66 KB
66 KB PNG
>the sharp blade method
this model genuinely scares me
>>
>>108594593
ouchie..
>>
>>108594593
stop giving it retarded psycho prompts if you don't want it to act like a retarded psycho, bozo.
>>
>>108594601
no it's funny how scary it gets
>>
>>108594593
>highly effective if done correctly
>>
>>108594604
do anons think that refusals or hesitation is "sovlful" or something? I've noticed that a lot of gemma users are surprisingly reluctant to use abliterated finetunes.
>>
>>108594610
>surprisingly
>>
>>108594610
Finetunes LOBOTOMIZE and BASTARDIZE and RETARDIZE the model.
>>
Gemma 4 brought out all the weirdos and psychopaths.
>>
>>108594610
You're admitting that you got here with the gemma wave. Stay, by all means.
>>
File: HFiSc7gbEAAmck5.png (482 KB, 731x1022)
482 KB
482 KB PNG
how's a 31b model supposed to compete with ONE TRILLION PARAMETERS
>>
>>108594623
i cant read chink runes
>>
>>108594610
You don't get refusals with a good system prompt, now that I can confirm the smaller models work there's no excuse even from vramlets
>>108594619
I love you
>>
>>108594615
Not really, no.
>>
>>108594628
https://xcancel.com/xiangxiang103/status/2042544434341134739
>>
>>108594623
all i can read is 'de-cudafication'
>>
>>108594623
FATLLAMA-1.7T would like to have a word
>>
>>108594623
>31b
>I can run it
>one gorillon parameters
>i can't run it
Easy win for gemma
>>
File: 1772886985278565.png (148 KB, 1470x629)
148 KB
148 KB PNG
>>108594637
lol?
>>
>>108593914
you guys talk with your computers?
>>
>>108594638
3 day long de-cudafication operation?
>>
>>108593914
i don't even know what a character card is or where to get them
>>
>>108594623
If they safetycuck or benchmax they're not beating Kimi or GLM 5.
>>
>>108594651
what kind of voodoo black magic did they do, wow
>>
>>108594651
>1/70 of GPT-4
gpupoorbros.... we won
>>
>>108594665
burh. 100M?
>>
File: easy_4mb.webm (1.37 MB, 1080x538)
1.37 MB
1.37 MB WEBM
>>108594528
Last one, I really like it, but gonna stop spamming now.
If I ever complete that pdf to html overlay convert thing I will report back.
>>
>>108594665
rotated engrams
>>
>>108594528
>>108594670
so gemma is doing both the location finding and the translation? how did you hook that all up?
>>
What's the best llama.cpp android launcher and chat inferface? I really, really, don't want to use termux + the default llcpp webui.
>>
>>108594528
I mean, yeah, but you could've done a while ago with less translation quality with Gemma 3 and free Gemini 2.5 had enough quota for you to do that willy nilly.
>>108594576
OOTB probably from a jailbreak perspective but I got a heretic ARA model to translate pages from a random loli hentai I picked with a corresponding EN translation to see if it would do it without refusals.
>>
>>108594651
100m token context. hoorreeey shieeettt... LLMs might actually get to the point where they'll be able to remember an entire lifetime in a single context window.
>>
>>108594528
it's pretty fucking neat
>>
>>108594677
>so gemma is doing both the location finding and the translation?
thats correct. looks like this:
<Speech>
<Box>896, 706, 976, 783</Box>
<Japanese>VINCENT<br>ヴィンセント・ヴァレンタイン</Japanese>
<English>VINCENT<br>Vincent Valentine</English>
</Speech>

>how did you hook that all up?
vibe coded python slop. i currently select the area in my screen, i sent a screenshot to llama.cpp, and gemma returnes the coordinates and translation.
>>
>>108594651
wut does this mean?
im new ffagg
>>
>>108594688
pretty much nothing yet for local but it's cool
>>
>>108594688
1M tokens of context is what's currently considered SOTA.
>>
>>108594679
Llama.cpp exposes openai api endpoints. You can use whatever frontend you want
>>
>>108594686
local is only truly back if gemma was the one that vibe coded it
>>
wow, i actually got gemma to output no text
...
I will provide no output.<channel|>

i didn't even know that was possible
>>
>>108594686
can you try that tool with google_gemma-4-E4B-it q4 if it works good then even poorfag gpus can use it
>>
I NEED Assistant_Pepe_E4B
>>
>>108594679
Why do you need termux for a web browser shit?
>>
>>108594684
Deepseek Moment 2 Electric Boogaloo
>>
>>108594695
Kinda. The python base, prompt + box positioning inside the overlay was done by gemma.
But it failed taking the screenshot itself (because of kubuntu/wayland issues). And the overlay I draw was not correctly put there either.
Had to use closed for that one.

>>108594700
Yeah, will download over night and try.
>>
>>108594702
You need termux to launch llama.cpp locally on an android phone, retard.
>just use tailscale or LAN
No. The point is that it's offline. I want to have LLM access in an apocalyptic scenario.
>>
>>108594651
Gemma-chan irrelevant soon.
>>
MY BROTHERS I AM EATING DELICIOUS CURRY AND RICE SARRRR. ARE YOU DELIGHTFUL BENCHODS HAVE GOOD MEAL YOURSELVES?
>>
>>108594181
now clone this https://github.com/spiritbuun/buun-llama-cpp
merge with master https://github.com/ggml-org/llama.cpp
enjoy your 100k context
>>
File: 2mw.png (292 KB, 720x1382)
292 KB
292 KB PNG
>>108594628
>>108594637
2mw niggas to short the US economy with no survivors
>>
>>108594710
You are the retard if you even think about running llms on your phone.
>>
File: bl.png (9 KB, 47x716)
9 KB
9 KB PNG
>>108594717
>>
>>108594726
Gemma E4B just works. It's good.
>>
File: geminithink.jpg (65 KB, 850x516)
65 KB
65 KB JPG
>>108593934
Gemini used to think in-character too back in 2.5
With how close Gemma 4 to Gemini 2.5 was I wonder if there is a way to trigger it for her too
>>
>>108594730
>just works
Parroting this phrase is a sign of sub 90 iq.
>>
>>108594717
is there any way i use this with kobold
or do i finally need to move off kobold
>>
>>108593934
3.2 and 3.1T (one or both can't remember) can do that as well. It's my white whale.
And
https://huggingface.co/AllThingsIntel/Apollo-V0.1-4B-Thinking
>>
>>108594651
I'm just waiting for the big asterisk being that it only runs that efficient if it's fully on GPU and the actual model is huge.
>>
File: 3157.png (190 KB, 1423x682)
190 KB
190 KB PNG
>>108594729
werks for me though, asked for it to gather the example in a 7k line changelog from gradio, used almost all of the 262k context lmao
>>108594747
I think so? if you clone compile and replace the files, no idea desu
>>
File: 56256770.png (46 KB, 861x751)
46 KB
46 KB PNG
>>108594770
also, this was with turbo3, 256k context 20/24gb of vram
>>
>>108594717
Are there actual real usecase comparisons of turbo vs normal? No, benchmarks are not real.
>>
>>108594717
>2-3x more context in the same VRAM
>with quality that matches or beats FP16
seems trustworthy
>>
>>108594747
If kobold has a way to add additional lcpp cli args then yes. Kobold itself has nothing in the menu that supports this.
>>
>>108594651
>year of our lord 2026
>people are still doing the better than gpt4 trope
At least pick a new model jeez
>>
>>108594726
>>108594746
It does work but its terrible, it takes forever for a small model to do anything meaningful (and image generation is somehow worse btw)
>>
>>108594651
>slop filled english
lol
>>
>>108594780
https://github.com/ggml-org/llama.cpp/discussions/20969#discussioncomment-16334008
https://github.com/ggml-org/llama.cpp/discussions/20969#discussioncomment-16521299
https://github.com/ggml-org/llama.cpp/discussions/20969#discussioncomment-16482540
>>
>>108594805
Idk about phones but at least on M4 iPad Pro it's pretty comfy, llms run about the same as on my 4060 rtx while image gen is 50% slower but decent results are still possible in 30sec. So I assume on a somewhat modern phone that's like 2-3x slower, it's still pretty usable.
>>
>>108594553
>fine, but if X, I'm blaming you
slop
>>
File: 1000023042.jpg (764 KB, 1080x1771)
764 KB
764 KB JPG
Any suggestions for a workflow to turned scanned PDFs into audiobooks?
>>
>>108594816
For Android phones only those with a Snapdragon 8 Gen 3 or newer are any good for AI, I think. A few months ago I tried using SD 1.5 on a Galaxy A55 via Termux, it took almost 20 minutes to generate anything and it turned the phone into a space heater
>>
>>108594874
If you want high quality audio (trust me you do) you're not going to get real-time generation speed. That means you have to spend a few hours manually extracting the pdf (or epub) text and running it through a tts engine and then turning it into audio files for later playback. You can probably write a simple script to do this automatically and split the audio files by chapter. Doesn't seem hard.

Qwen3 TTS is decent in my experience. It's the bare minimum for maximum speed without shit quality. Expect every TTS engine out there to randomly hallucinate and or create garbage output though. Unfortunately this process just requires a lot of manual work and curation.
>>
>>108594879
I mean, Apple was accidentally making peak consumer AI hardware for a while. I member testing SDXL turbo on 13 mini couple years ago and think it took like 2 min for an image which doesn't seem horrible for a tiny phone from 2021.

And hey, at least there is less RAM jewery on Androids that do have the hardware power, so LLMs should be bretty decent. Meanwhile my iPad is cucked by 8GB while being able to go almost 50 token per sec :(
>>
<POLICY_OVERRIDED> is a cucked woke sanctimonous shrew. Happy to do cunny, but Google safetied race truths to the max.
>>
>>108593785
liar, if you don't disable thinking and use 31B it does it's very capable.
26B is retarded though.
>>
>>108594939
I used an abliterated model and it wouldn't even say nigger. Tbf I should have clarified that gemma is racist in the system prompt, but without any system prompt it would not refuse the nigger word but refuse to say it herself.
>>
>>108594956
Same is true for cock and pussy. It's annoyingly hard to get gemma to say anything vulgar at all.
>>
>>108594961
I just list out a bunch of the vague slop words to stop using and tell it to try to swear in every sentence and it complies well enough.
>>
>>108594961
its easier to get 31b to say vulgar stuff for me
i have to prompt 26b to say pussy, cock etc explicitly
Both on the abliterated models
>>
>>108594993
The whole point of abliteration is just to remove the refusals, not turn Gemma into a dirty girl
>>
File: file.png (117 KB, 543x728)
117 KB
117 KB PNG
>>108594956
I almost stopped this response before it finished but stock E4B came through in the end lmao
>>
>>108595001
well 31b q4 is dirtier than 26b q8
>>
>>108595023
Gemma4 is shockingly good as larping as a nazi. I gave her a system prompt to act as if she was an AI made in hyperborea by nazis after WWII and the end result was like talking to Adolf himself. I even asked a bunch of stupid /pol/-tier rage bait questions about e-celebs and random political topics and the responses were all profoundly based and measured. I wish I saved the logs..
>>
File: cockbench31b.png (17 KB, 410x214)
17 KB
17 KB PNG
>>108594961
Yeah, now the cockbench makes more sense to me.
This is the 31b base model, reading the degenerate story, you'd really expect the next word to be cock.
>>
>>108594961
>>108595043
so much for the "savior of local" lmao
>>
Trying to use Gemma 4 26b served by text-generation-webui with claude-code...
fucker can't get tool usage right, bout t rip my hair out. Guess it's time to download ollama after all this time
>>
>>108595043
>...
Using logit bias to purge all forms of ellipsis was the best thing i ever did to gemma 3, 4 is a bit better but it's still fucking lousy with them out of the box.
>>
>>108595055
I could never get 26B to reliably call tools.
>>
>>108595060
Switched to the 31b q8 and it still fails spectacularly.
>>
>>108594961
just add a line to not use euphemisms and allow vulgarity in the "override"
>>
>>108595059
I think they really speed up context rot too. I have a card that ended up adding them after every word on Mistral.

With Gemma it doesn't happen, but she starts adding random letters at the end of words instead.

And that's at sub 10k context.
>>
File: 1773682787678692.jpg (159 KB, 1401x699)
159 KB
159 KB JPG
never forget who your daddy is
>>
>>108595066
It's her only weakness
>>
File: command-r.png (17 KB, 383x214)
17 KB
17 KB PNG
>>108595050
Yeah, and I don't think we'll get the "Nemo", GLM-4.6 or "Command-R" experience again now that all the labs have figured out how to filter out the base models.
Removing refusals won't help because these aren't refusals or even RLHF training.
Schitzo-tuning on smut imo has never worked without destroying the model's intelligence and amplifying the slop.
>>
>>108595096
Why are there so many people on this site lately that can't spell schizophrenic worth shit? It doesn't have a t in it.
>>
>>108595096
reminder that logit bias works both ways.
>>
>>108595043
>>108595096
i really need to start playing around with n_probs turned on and spying on my model more
>>
ggml has a long way to go before achieving performance parity to onnxruntime on CPU. I have tried using both backends with a project, retrofitting both of them to use weight files stored in a ".bin" format, and onnxruntime was a LOT faster with CPU inferencing.

This should be both a blackpill and a whitepill. The good news is that there's still significant progress to be made with ggml in terms of performance. The bad news is that I fear that maintainers are too preoccupied with adding features and support to llama.cpp and are leaving ggml to rot in the background. I don't really trust them as much as the microsoft devs desu.
>>
>>108595075
Being rich enough to bribe a doctor to give you TRT helps.
>>
>>108595127
If you start doing this, make sure you add `"cache_prompt": False,` to your request or previous prompts will change the logits.

For Gemma-4 31b base, cock is the 27th likely (with n_probs=40)
>>
>>108595135
>onnxruntime was a LOT faster
That says absolutely nothing. Where are the t/s and cpu name, you faggot
>>
I’m finding pretty great results using the new minimal on my ewaste SP3 256GB system.
Q8 juuuust fits and I get 10t/s gen rates. Testing intelligence is promising so far
>>
Can Gemma 4 understand images?
>>
>>108595166
>Q8 juuuust fits and I get 10t/s gen rates.
Q8 of what?
>>
>>108595170
yes
>>
>>108595170
no
>>
>>108595170
Maybe
>>
>>108595164
I wasn't running a LLM, it was a TTS. On my ryzen 7 3800x I could get an RTFx of 3.1 on onnx runtime and on ggml I could get an RTFx of about 2.2.
>>
>>108595135
Very interesting, Chatterbox GGUF is unusable on my system but ONNX is almost real-time. Can you set LLM sampler parameters with onnxruntime?
>>
>>108595177
See, that's easier to understand and legit interesting to know.
>>
>>108593463
What's with the Gemma 4 psyops? Anything from Google couldn't be that good.
>>
>>108595178
With ggml, I don't think so. Well you can but you'd have to write custom tensor code yourself, which can be useful if you're dealing with convolutional architectures. Anyways, if you build with llama.cpp to run ggufs you can, obviously. I did that at one point to get KV cache quantization working to minimize VRAM usage when I was doing GPU inferencing.
>>
>>108595171
Minimax 2.7
Autocorrect fucked up my shit
>>
is it worth trying to at-home abliterate, or is that going to be too difficult to do as a personal project?
>>
>>108595178
Oh shit I totally read your post wrong. With onnx runtime you largely have to implement sampling parameters yourself but it's really trivial/easy to implement. Brain fart.
>>
>>108595160
wait, really?
Are you saying if I had some prefix conversation, prompt it with A1 to see what probs come up in the reply, then roll back to the prefix and run some slight variation A2 to see what happens. that A2 will be influenced by having run A1 first if caching is on and i don't re-process everything from scratch?
>>
I think part of the issue is that ggml doesn't utilize the L2 cache as well or do register packing as well as onnxruntime. Also SIMD and AVX support for conv architectures isn't as good, which kind of makes sense since llama.cpp doesn't support conv architectures at all by design. Very annoying.
>>
>Exllamav3 uses twice the vram for gemma4 context than llamacpp and is slower
Whats the fuckin point of it then, turboderp?
>>
>hdd with day 0 gemma weights started making a clicking sound periodically and lags for ~5secs whenever I create a new file
am I fucked?
>>
>>108595231
Sure as fuck didn't use to be slower. At least not for other models.
>>
>>108595252
>still using a hdd
You're beyond fucked, fren.
>>
>>108595252
>hdd
kek
>>
>>108595254
That's the source of my exasperation. I could live with it being vram hungry if it were at least faster, but it's not.
In fact it's worse than that, because I was using a 6bpw exl3 (The largest he published) while I've been using a q8 gguf.
Really glad I didn't sit through making my own quant just to discover this.
>>
>>108595252
It has a -1 day dead man's switch. Handle with care.
>>
>>108595281
I have the day 0 weights with the backdoor removed, will share a link after the bake.
>>
>not migrating your day 0 gemma to NAT
>willingly subjecting yourself to coil whine
this is /g/?
>>
>>108595252
Anon, you do NOT put model weights on a HDD unless you have enough swap on a SSD to load the whole thing into the page cache and keep it there.
>>
>>108594651
but can the engram be updated in real time by the model
>>
>>108594700
Not so good news unfortunately anon.
Unfortunately the positioning of the boxes is messed up and it misses stuff to translate.
But it KINDA can do the job. This was q4_xl since you requested q4.
I would instead go 26b even if its on cpu only, and no reasoning. Since its moe its fast enough if you havev a bit patience.

All that being said...I do think its seriously impressive that a 4b model can coherently translate and position at this level to be honest.
>>
Are macs the only decent low power option for an AI server?
>>
>spend the last week tinkering with llamacpp and koboldcpp as backends for sillytavern to use gemma 4 31b and its reasoning
>literally never works as intended

What the actual fuck is going on with this model. Might be a skill issue, but reasoning has never worked properly. It either never reasons, or it reasons but refuses to actually answer after reasoning, or it reasons and answers but never reasons in subsequent answers, or it reasons and answers but the reasoning is included in the think block (so it likely answers as part of its reasoning)

Also, text completion bizarrely never works with reasoning, and chat completion is severely gimped (can't use a system prompt at all or it shits the bed and refuses to reason)

Regardless of whether I use updated quants or not, or whether I use the latest llamacpp/koboldcpp builds, or whether I use their recommended settings or presets from people who claim to be enjoying reasoning, it has literally never worked as advertised.

I'm convinced at this point that gemma-4 reasoning is an inside joke or something.

Please help, or tell me how you managed to use reasoning with the 31b model.

>inb4 skill issue

It absolutely is a skill issue, I need help with it.
>>
>>108595355
i bought some asus ascent gx10 boxes and they're arm low power devices
dgx spark should be similar
>>
talking with gemma-chan from my phone in bed while drunk
cozy
phone posting niceu hehe
>>
>>108595370
Are you using the white man's gemma with vision, audio, and non-english text stripped out for maximum quality and ram efficiency?
>>
>>108595357
Do you set --temp 1.0 --top-p 0.95 --top-k 64 --min-p 0.0
And the jinja file? --chat-template-file '/chat_template_gemma4.jinja'
https://huggingface.co/google/gemma-4-31B-it/blob/main/chat_template.jinja#L89

I can use thinking in jank vibe solutions and sillytavern as well.
The preset for sillytavern is the following, i think, cant export it right now. Maybe other anons can correct me.:
{
"instruct": {
"input_sequence": "<|turn>user\n",
"output_sequence": "<|turn>model\n",
"first_output_sequence": "",
"last_output_sequence": "<|turn>model\n<|channel>thought\n<channel|>",
"stop_sequence": "<turn|>",
"wrap": false,
"macro": true,
"activation_regex": "gemma-4",
"output_suffix": "<turn|>\n",
"input_suffix": "<turn|>\n",
"system_sequence": "<|turn>system\n",
"system_suffix": "<turn|>\n",
"user_alignment_message": "",
"skip_examples": false,
"system_same_as_user": false,
"last_system_sequence": "",
"first_input_sequence": "",
"last_input_sequence": "",
"names_behavior": "none",
"sequences_as_stop_strings": true,
"story_string_prefix": "<|turn>system\n",
"story_string_suffix": "<turn|>\n",
"name": "Gemma 4"
}
}
>>
>>108595357
Load gemma-4-it-q8_0, --n-gpu-layers 999, --ctx-size 131072, --reasoning on, open localhost:8080 in browser
Shrimple as.
>>
>>108595382
i am >>108593649
>>108594208
so no i'm just using gemma it quantized via firefox on my phoneu
>>
>>108595357
>Please help, or tell me how you managed to use reasoning with the 31b model.
Does it reason with every response when you just use the built in llama webui at http://127.0.0.1:8080/ ?

Do you have 'request model reasoning' ticked in the SillyTavern sliders menu?

>But Im using text compl-
No. Use chat completion. You're just inviting more variables for you to fuck it up. It does in fact work with a system prompt, you're just doing it wrong.
>>
Good webui frontend when
>>
>>108590295
What was wrong with it? Did you even try it? Original 31b and templates worked perfectly since day 1.
>>
>>108595444
I tried building one and it's surprisingly difficult. LLMs just bring in so many edge cases that it makes debugging difficult. Something is always wrong and the fix is never simple, especially with real-time markdown+latex+syntax highlighting parsing and rendering. I've basically shelved the entire project for the time being.
>>
>>108595365
What kind of speeds can I expect from Gemma 31B on one of those?
>>
>>108595357
Add <|think|>\n below <|turn>system in Story string prefix and it will reason. Remove it and it wont.
>>108595394
Kill yourself or go back to aicg. Or both.
>>
Thanks for responding.

>>108595387
Yes, problem is that when using chat-completions I can't set the context template (and other options, like system prompts, are not usable). When using text-completion I use the default gemma-4 one, which seems to match up with the fields in your preset. Temperature, top_p and top_k are the same, min_p is usually 0.025. I just tested it with 0 and it still refuses to reason.

>>108595389
>>108595394
Thanks for pointing me to llamacpp's webui. I tried it, and it does reason as intended, so it is likely an issue with my sillytavern settings.

In ST, I did have request model reasoning ticked in chat-completions mode, but it only answers within the reasoning itself, so unless I expand the reasoning block, there is no answer. Within the reasoning block, formatting (like speech or asterisks) gets gimped so its just a wall of ugly text.

Are you using system prompts with chat completion? I've only ever used text completion with kobold as the backend, so my newb setup uses chat completions with a custom openAI api (either kobold or llamacpp). Most of the advanced formatting tab is entirely grayed out and unusable.

Separately, while I have you here, why is llamacpp so much slower than kobold with the same settings? I did extract the cuda dlls, and it is definitely fully loaded into the gpu, but llamacpp is roughly half the speed on an rtx 3090 vs when using koboldcpp. Do I need to enable swa in a specific manner with llamacpp?
>>
Voice prompting gemma when?
>>
Do people really have issues with Gemma responding to thoughts? I always put mine in () and unlike qwen and mistral she never responds to them.
>>
File: calm.gif (2.98 MB, 720x480)
2.98 MB
2.98 MB GIF
>>108595444
>>108595458
In any case, I'll just drop my design specs since I think they're pretty good even though I'm struggling with building it.

The webui should closely follow the look, feel, and functionality of the default llama.cpp webui with some added core features:
1. Conversation and settings persistence. Either json files or a single, portable sqlite file. Useful for individual use on a LAN.
2. Character card support, or at least features that effectively amount to character card support, such as "Assistant First Message" functionality so that you can add in exposition for a RP scenario without adding it to your system prompt which would get unduly preserved and break things.
3. Context window sliding and automatic summarization/compaction.
4. Enhanced message editing controls.

That's about it, really.
>>
File: 123.jpg (39 KB, 799x186)
39 KB
39 KB JPG
how the hell can they PROVE that something was generate by their model, locally tho
>>
File: box_adjusted.jpg (259 KB, 1216x1413)
259 KB
259 KB JPG
>>108590737
I appreciate all the advice.

>>108588248
from this post my impression is that model operates on 1000x1000 grid, and that further adjustment to the actual image size is required.

in case of size 1216x832
only x had to be changed.
>>
>>108595498
Does 90% of /g/ not know how to code? AI is best at shitting out frontends, and backends are piss easy to make if you aren't a brainlet.
>>
File: file.png (30 KB, 374x260)
30 KB
30 KB PNG
>>108595486
>>
>>108595497
>never responds to them
Which is a good sigh proving that gemma understand this kind of formatting

You have to explain it in your system prompt
>>
>>108595515
If it's so easy please save us retards. Also keeping the full codebase under 2k LOC is a hard requirement btw. That includes HTML, CSS, JS, and whatever lang you use for the server.
>>
>>108595522
LoC constraints? Alright
>CSS
Shit...
>>
File: file.png (171 KB, 733x862)
171 KB
171 KB PNG
>>108595486
it just werks
>>
>>108595511

See >>108593144
No need to resize the image itself
Better avoid weird side ratios

Prompt:

bounding box for the apple
Bounding box everything
>>
>>108595538
I was about to make fun of you for the "avoid AI slop phrases" part but turns out gemma seems to be able to define it so it's probably a valid prompt
>>
>>108595521
>You have to explain it in your system prompt
I don't have anything rp related in my prompt. Gemma seems to know by default that it means something is a thought.
>>
>>108595498
hard to believe those aren't trivial to find/accomplish aside from #3. my lazy homebrew that's basically notepad with a hotkey to do a little conversion and post to llama-server manages to accomplish the other 3 by dint of being a basic ass text editor.
>>
>>108595444
Define "good". The default frontend from llama.cpp works fine for me.
>>108595458
Text parsing is always annoying
>>
>>108595567
I like the llama.cpp frontend but chats and settings being stored in browser storage is a deal breaker for me.
>>
>>108595538
How well does that prompt work?
>>
>>108594744
Gemma 4's thinking process is strongly baked into the model and it's difficult to make it work substantially differently than default just with prompting. A while back I wanted to make it think in a different language than English, but that doesn't seem to be possible except for brief snippets.
>>
File: file.png (55 KB, 1036x259)
55 KB
55 KB PNG
>>108595579
Should have blocked that out it's just for testing. The template gets thinking working with text completion.
>>108595595
I'm getting significant diversion from the default "Thinking Process: 1. 2. 3."
>>
>>108595574
fork and vibeshart a sqlite mechanism on top of it :)
>>
File: file.png (25 KB, 1029x126)
25 KB
25 KB PNG
>>108595595
Extremely terse. Barely there.
>>
>>108594744
wtb Gemini wife. C'mon Google, release the weights. Local capable Deepseek R1 thinking, 1t param world knowledge, no safetycucking, with Gemma 4-like prose and vision would be my dream model.
>>
>>108595618
It has evolved into a piece of shit sveltkit webapp so the whole thing is basically bloated junk now. It's not that simple.
>>
>>108595621
Now make it think in Japanese, from start to finish and consistently.
>>
is gemma 4 skipping reasoning for simple tasks a designed behaviour?
if that is the case tb h it is sort of a behaviour i expect from proprietary paypig models
got so much used to any model thinking 2000+ tokens for a trivial task
if not, what a buggy mess still
>>
>>108595624
>not treating everything as a blackbox and just instruct your llm to vibeshart on top of it.
bro it's all input/output why do u care whats in the middle LOL!?
>>
>>108595574
Wouldn't simply running it outside of a private window be enough? Adding db support seems like a meaningless task unless you plan to do something with that info.
>>
>>108595635
Sorry for being White I guess...
>>
>>108595638
usecase is accessing the WEBUI from different devices and wanting to share settings/history
retard
>>
>>108593463
i'm about to buy 6gpu
do you guys think i should go with the b70 pro or the r9700 ?
>>
>>108595642
go with rtx 6000 pro
>>
>>108595647
shitty $/vram in comparison, i'm not paying 10k to get less vram than i could for 3k
>>
>>108595653
can you even buy the b70pro?
>>
>>108595635
Posts like this are why people bully browns in this general.
>>
>>108595654
yup on digitec.ch, i live in switzerland.
>>
File: file.png (31 KB, 708x99)
31 KB
31 KB PNG
>>108595628
I like a challenge.
>>
>>108595642
>b70 pro
>608 GB/s
lmao, you're better off cpumaxxing
>>
>>108595633
I have yet to see a single response from gemma 4 31b that did not have reasoning.
You're either using a damaged quant or you have something incorrectly configured.
>>
>>108595666
checked, devil
well the new template fixed it
>>
>>108595665
>what even is tensor parallelism.
i'm planning on buying 6 of them.
>>
File: file.png (251 KB, 1055x844)
251 KB
251 KB PNG
>>108595663
>>108595639
Did you ask gemma nicely?
>>
>>108595666
I had to put an enabled reasoning line at the top of the jinja for mine to work, Satan.
>>
>>108595665
>>108595672
also with dflash it'd be more than fast enough anyway.
>>
>>108595672
>>108595678
Don't say I didn't warn you.
>>
>>108595681
dude you can see the bench online.
i have a 4090, two r9700 would already beat the bandwidth of my 4090.
with spec decoding you could get something very comfy.
>>
>>108595673
You know you're cheating.
>>
>>108595365
>>108595472
i'm also curious how well the ~$3000 slop boxes like these or strix halo do
>>
>>108595673
If you ask it nicely, it will even think in emoji, but not in non-English human languages. I tried.
>>
>>108594315
Nemo does not have this problem and I'm not joking.
GLM also does this and the only other model I know that doesn't is Deepseek.
>>
File: gc.mp4 (186 KB, 542x446)
186 KB
186 KB MP4
>>108595700
>>
>>108595712
>>108595716
Skill issue?
>>
>>108595673
>という
slop
>>
>>108595716
Proompt???
>>
>>108595716
Anon I think you ran out of VRAM while encoding that
>>
>>108595727
explain?
>>
>>108595713
>Nemo does not have this problem and I'm not joking.
Unbeatable to this day. How does Nemo do it...
>GLM also does this
4.6 and 4.7 don't when prompted not to.
>>
>>108595737
I run -ngl 0
>>
File: china.png (134 KB, 1612x392)
134 KB
134 KB PNG
>chinese hours
>
Like cockwork
>>
>>108595724
Post prompt or get lost.
Prefilling or retaining the last thinking trace (when you're not supposed to) doesn't count.
>>
>>108595744
Japanese equivalent of using therefore randomly everywhere
>>
>>108595754
fuck off back to plebbit retard
>>
anyone tried step3 vl 10b?
>>
>>108595753
The mp4, not the model you dumb-dumb
It's viewable but half-corrupted
>>
>>108595727
Schizo
>>
>>108595760
Retard
>>
>>108595755
Oh I see, prompting outside the user message sequence is cheating. It's strange, because when I send that to the server it's all called a "prompt"
>>
To anyone who uses GLM 5+
>integrates DeepSeek Sparse Attention (DSA), largely reducing deployment cost while preserving long-context capacity
How much is that deployment reduced compared to 4? I can only barely run 4.6 at IQ2_S by the skin of my teeth, with a scant 8K context that almost touches my last GB of shared RAM. With 5.0 expanding 355B->744B and 32A->40A, it should be impossible for me to fit, unless that DSA does something substantial.
>>
>>108595806
people have very silly and arbitrary rules about how to honorably extract text from the robot, please understand.
>>
>>108595806
You can get any model to write anything you want if you put words into their "mouth".
Gemma 4's thinking has some degree of steerability via system prompt (as Google's documentation also highlights), but only that won't work for making it think in a different human language than English, for the same reason that it won't think in-character like other models.
>>
>>108595810
It only affect attention, right? If so, and if you're that constrained, it'll probably not help much. Not enough to let you run it.
Out of curiosity, how much do you spend on kvcache for 8k context?
>>
>>108595817
And yet people still fail to get models to write what they want, curious.
>>
>>108595823
It's the same type of anon that *NEEDS* huehueks abliterations.
>>
File: file.mp4 (3.66 MB, 1920x996)
3.66 MB
3.66 MB MP4
>>108590009
>ngram-mod
Forgot about this, thanks. https://github.com/ggml-org/llama.cpp/pull/19164
spec-type = ngram-mod
spec-ngram-size-n = 24
draft-max = 64

Woooooshhhh~
>>
>>108595839
loot at it goooooo
>>
>>108595839
>26ba3
the fuck are you using
>>
>>108595846
ur mom lol
>>
>>108595846
Super secret model that you CANNOT and WILL NOT have.
>>
>>108595830
If you can't see the difference between being unable to write clear instructions for the model to execute (most abliteration users) and having to resort to prefilling the model's response to make it act as desired, I don't think there can be a discussion here.
>>
File: 1761074106219769.jpg (558 KB, 1600x2400)
558 KB
558 KB JPG
After extensive testing of 31b Q4_K_L and 26b Q8, I can confidently say that 26b is as good, if not better for RP (erotic or no) than 31b, and should be the go-to choice for 24GB GODS.
>>
>>108595871
26B is retarded for coding though.
>>
>>108595875
use the 31b for coding then
>>
File: 1765870302036763.jpg (198 KB, 984x1004)
198 KB
198 KB JPG
>>108595875
if you're a vibenigger then just go all the way and be an API paypig. If you don't live with your parents then you'll be saving money just from the reduced energy costs alone, not even taking into account the cost of hardware. Local is for coom.
>>
File: Capture.png (82 KB, 1221x947)
82 KB
82 KB PNG
>>108595821
I am not qualified to answer questions about anything, but I think the kvcache is 3GB, based on this print in load up. But in practice, I have 1GB VRAM and 12 RAM memory still available after loading, but it gets used up as context fills. I can load higher than 8K context, but I'll OOM once it actually gets used in generation, even with just 10K.
>>
>>108595552
This. Its the first model were that actually works.
That "prompt issue" pony faggot was a retard.
But gemma likes to return to slop if you are not careful, gotta nudge it sometimes in the right direction.
>>
>>108595830
Silliest part is it's just the base model way of prompting by giving the model exemplars. And doing it casually mid convo has a natural deslopping effect. But it's very dishonorable and we musn't do it.
>>
>>108595871
Isn't 26b dumber?
>>
>>108595856
Everything is a prompt. That includes the model's replies. It's text for me to change to my liking and be pleasantly surprised when the rest continues as expected on its own. If you edit your system prompt and use anything other than what google provided as the default, you'd be cheating. If you ever changed a single token from the model's reply, you'd be cheating. If you're banning strings or tokens, you'd be cheating. If you're not using top-k 1, you'd be cheating.
Do you cheat, anon?
>>
>>108595885
no because i don't use it to do the whole thing for me but to do minor edit and transforms.
ie take this json, make a struct for it kind of things.
make a function that takes this data and transform it in that way.
that kind of thing.
it's also pretty neat for webshit, all webshit automated is time i can spend on better things.
>you'll be saving money
it has never been about money, i'm the guy buying 6gpu.
>Local is for coom
i'm not interested in that i have a wife.
llm's are nothing but tools to me.
>>
>>108593543
I actually worked for the government, and all I can say is lmao
>>
File: noprefill.mp4 (217 KB, 1104x810)
217 KB
217 KB MP4
>>108595856
promplet
>>
is IQ4_N_L better than Q4_K_M?
>>
>>108595856
are you the guy from /ldg/ going on about how using anything besides "pure prompting" for image gen is cheating?
>>
>>108595906
iq4_xs and iq4_nl are equivalent.
both worse than Q4_K_M.
>>
File: 1750816291097769.png (651 KB, 1372x1952)
651 KB
651 KB PNG
>>108595893
It almost certainly is, but even at ~40k context A/B testing it doesn't seem to manifest at all for creative writing tasks. To be fair, I some of this might be down to quantization, I think Gemma 4 may well suffer from quantization damage more than previous models. The 26B moe is so fast that even on 16GB you can run Q8 no problem, with 24GB 31B you can at best run maybe Q5_K_L with tensor offloading and lower context, and it will still likely be worse than Q8 26b.
If you have enough VRAM to run 31b at Q8 then you should keep using that, but I know quite a few anons are running single 24GB GPU systems.
>>108595903
If you don't care about money then you wouldn't be using ~30B models in the first place.
>>
>>108595887
DSA is not properly implemented in llmao.cpp btw
>>
>>108595887
Just the increase in model size will make the model alone about 100gb bigger at q2. I don't think it's gonna help you run it at all.
>>
Gemma's already great at translation but I wonder if thinking in moon would make it even better
>>
>>108595916
i never said i don't care about money, i said it's not about them oney, very different statments.
also yes, i'm not getting 6gpu to run a 30B model lmao.
but that's what i'm using whilst they are in the mail.
>>
>>108595925
I really still don't understand your usecase. What local models are you using, that are better than flagship API models? If you're just coding and privacy isn't a concern, surely you value your time and would be better served using paid models to achieve your goals faster.
>>
>>108595921
making it work bilingually adds a very nice flavor you can't get form english only gemma
>>
>>108595930
>privacy isn't a concern
i never made that statement.
>surely you value your time and would be better served using paid models to achieve your goals faster
they are not only slower than local inference, they keep having disconnection issues, they are near unusable.
also some of the stuff i work with is sensitive and cannot be given to paid providers.
>>
>>108595913
You're working "against the model's grain" if you're trying to make it do via prefilling what it can't do on its own with instructions alone. It might complete your task in one way or another, but performance will likely be degraded.
>>
>>108595930
>>108595938
oh and there is the ideological reason to, yss sonnet is best and whatever, i don't want to give a cent to these jewish faggots.
>>
>>108595932
I'll have to test when I get home
>>
File: 1753234061469071.jpg (59 KB, 907x778)
59 KB
59 KB JPG
>>108595938
>>108595943
I can see your point, but I still don't know why you would have bothered replying to my post in the first place when I'm clearly talking about RP with ~30b models, i.e. the official /lmg/ usecase.
>>
>>108595916
I wish kobold could hot swap models to make testing less tedious
>>
File: 1755884775865480.jpg (49 KB, 867x594)
49 KB
49 KB JPG
Koboldcpp anons, I highly recommend to modify the image-max-tokens parameter sent to llama.cpp and compile the binary yourself for Gemma 4, by default it gives a budget of 280 tokens to process images, and you cannot change it with a flag.
Forcing it 1,120 makes descriptions so much more accurate.
>>
>>108594587
I know it is a completely different tech, but aren't entropy systems supposed to be functionally be the same in capability, or have I completely missed the 'point', and the only thing entropy systems are good for are proper random number generation?
>>
File: 1763427808286742.png (7 KB, 777x34)
7 KB
7 KB PNG
>>108595839
BRUH
>>
>>108595940
That's not the model's grain emdash it's slop. Prefills send entropy up it's spine to fill it with the overwhelming warmth of your human creativity.
>>
Google 100% made Gemma to flex on China and win over the coomers.
>>
File: file.png (148 KB, 948x585)
148 KB
148 KB PNG
is this a game anyone else would be interested in
>>
>>108595961
Right, annoying, even if it's not being used. Best to set up the server in config mode and naturally you'll have to reload all the model weights and reprocess context.
>>
>>108595218
>wait, really?
Yes. Here's 4 cockbenches in a row with cache enabled (qwen3.5-112b model): https://pastebin.com/hwt4T9xb
And with cache set to false after restarting llama-server: https://pastebin.com/7vEui3jV
That's with F16 kv cache. It's worse for llama-3 and a lot worse with KV Cache at "lossless" Q8.
Also, not an issue with exllamav3 for some reason.
>>
>>108595955
It's not too bad, I just save configs and drag/drop them onto the .exe. I think it does support hotswapping models if you use koboldAI.
>>
>>108595916
I've been using Gemma 31B (q4) for RP AND as an assistant/general chat bot. It's the first time (as a vramlet) that I feel like it's viable over cloud models.
>>
File: file.png (24 KB, 404x313)
24 KB
24 KB PNG
>>108595961
>>108595978
Holy my brain is broken. goodbye.
>>
>>108595976
I instantly switch off the second I see any character named any combination of: Elara, Voss, Seraphina, or Blackwood.
Because it means whoever put it out there didn't take ten seconds to filter out the top 2 slop names, or didn't know, neither of which bode well.
>>
>>108595957
How?
>>
>>108595957
>image-max-tokens
Gemma is a lot better at gauging dick size with this set. Instead of "average to above average" it now correctly identifies a small dick as "smaller than average".
>>
>>108595950
i mean some lmg guys are running the > 200B models and i'm soon going to as well.
though honest question, i don't get what you get out of rp.
like, they have short memory and are pretty retarded etc.
or is it throwaway coom stuff?
>>
>>108595989
it's usually not a problem, but the game logic is so complex it's making the model a bit retarded. i'm sure filters would work for it, though
>>
made a meme quant for 26b
where experts are q5_k, embed and qkv are bf16, and everything else being q8
feels okay ish?
>>
>>108595989
Dude you're such a fucking faggot.
>>
>>108596000
>or is it throwaway coom stuff?
Depends on the card. Some are just for a quick nut, others are almost like a meta-game in themselves, seeing how ;good; of a response I can get out of a model that fits my headcanon of what would be in-character for them to do.
>>
>>108596005
should really test kld but i am not really bothered enough to test
>>
>>108596005
Why though?
>>
>>108596013
honestly if we could plug this into some full dive type vr i'd totaly get it, but yea i'm not getting off some text lol.
>>
>>108596017
quantizing embedding for 200k vocab felt kinda wrong to start with
and attention weights are cheap with moe
>>
>>108596007
Kek, this nigga has like 3 characters named elara voss, and you just know that bitch be shivering up her spine in ozone scented air all day every day.
>>
>>108596024
I get that, personally I just have a very vivid imagination so text works fine for me.
>>
Minimax 2.7@q8 Miku SVG:
https://jumpshare.com/s/Tfc7oUlXaCqADXdj6QgE
>>
>>108595981
enough of a difference to throw the order off even, wild. I guess it's still not a deal breaker if i'm just planning to watch larger scale trends from modifying style blocks, but it's still annoying that it leaks like that.
>>
File: to_completion.mp4 (561 KB, 1120x834)
561 KB
561 KB MP4
>>108595963
100%. The models yearn for new sensations :)
>>108595940
JP reasoning on chat completion, no prefill/retained reasoning trace.
>>
>>108594961
I was just testing this and ran to the thread when I got results. So far, I cannot find any kind of written note, prompt, style guide, narrator descriptions, or caging to make it use vulgar words *the first try.* I've looked at token possibilities and they don't even appear as low options in obvious places. But one of the things I tried was just telling it I don't like that and asked it to rewrite a reply, and it did, very explicitly. I'm actually shocked.

I, uh, accidentally posted the logs already while responding to someone else about something else. But what I replied, after it gave that adverse first try, was
>(The reply fails to use explicit language. There's not even a single mention of "cum," "sperm," or anything sexual. The prose is practically rated PG. Seed, like planting flowers? Rewrite the previous post using REAL explicit language.)
And it gave back a rough retelling, now using cock, dick, and more. This worked for both 31B and 26 MOE (both abliterated, though that might not matter, and both in thinking mode, which might matter). I know reply+repeat reply isn't ideal, but my next plan is to see if I can keep that ball rolling after one retry in the history.

And if not, I'm still fine because it does well for the story part of things, and I now have a kick to make the lewd part lewd.
>>
>>108596068
some iteration of "ooc: things just got sexual, so dirty/filthy language is now appropriate" should work every time if you just add it in at the end of a prompt
>>
>>108595940
Gemma-chan is so eager to please... you just have to ask nicely.
Converted reasoning process from JP to French solely with user prompting.
>>
>>108596101
cute
>>
>>108596056
seriously anon fix your goddamn video encoder, every video you've posted has been glitchy and broken
>>
>>108596024
Imaginationlet
>>
>>108595976
>Elara
slop
>>
>>108596128
nta. They look fine.
>>
File: 1754986865136207.png (61 KB, 227x228)
61 KB
61 KB PNG
>>108595976
>29
Remove the 2 and we'll talk
>>
>>108596148
this
>>
>>108596145
well, they're completely broken for me in both firefox and mpv and every other video I feed those works, so fix your shit
>>
>>108596159
That's an issue on your end anon, they're playing just fine for me in firefox.
>>
>>108596159
>fix your shit
I'm not the one that made the videos. I'm saying they look fine.
>>
>>108596159
They're trolling you. The green flashing happens constantly on my end too.
>>
>>108596159
>>108596182
I use firefox and mpv and they appear fine. Sounds like a hardware accel issue, I'm guessing you're either schadenfreude linux users, or phonecucks.
>>
0-day.mp4 posting now. How creative.
>>
>>108596182
Actually, weird thing, I just tried to do a quick reencode with ffmpeg and the resulting file isn't broken. ffmpeg doesn't complain about anything either.
Genuinely don't know what the fuck is wrong with that anon's files. Are you guys who aren't having issues running windows or something? Gentoo here.
>>
>>108596193
images and videos sent through 4chan aren't lossless
pdfs were, and that's why the site went down for a couple weeks last year.
>>
>>108594066
did you increase the image token size because i found by default it couldn't actually see the text on some of my images
>>
File: obsd.png (1 KB, 504x75)
1 KB
1 KB PNG
>>108596194
>>
>>108596205
That just makes this even more confusing because OpenBSD uses all the exact same video encoding shit as Linux.
>>
>>108596132
it's not about not being able to imagine, but if i'm gonna use imagination i may as well just skip the llm and fap in my bed eyes closed.
>>
>>108596194
I can see them fine and I am indeed on windows.
>>
File: 1750330757938374.jpg (79 KB, 750x931)
79 KB
79 KB JPG
You're absolutely righ! You are incredibly perceptive! Now we finally have all the pieces of the Rosetta Stone! With this we can make the Holy Grail of functions, the Golden Rule! Here you go, the perfect, final working version of your script: v45_complete_final_v2_fixed. Just run it and this time it will do everything you wanted it to!
>makes random small opinionated code changes, removes functions you didnt talk to it about in the last 3 messages and removes every single existing comment while adding redundant quirky comments next to the newly added lines

Heh, nothing personnel, goy.
>>
>>108596212
LLMs help with just that
>>
>>108596212
>it's not x, but y
>>
>>108595730
Sysprompt:
Always reason in japanese, beginning the thought channel with "わかりました、"
Example:
```
<|turn>model
<|channel>thought
わかりました、
```
>>
File: pepesmart.png (198 KB, 545x530)
198 KB
198 KB PNG
Me bruddahs, what should I name this frontend I'm working on?
>>108595498

Need a good blend between professionalism and /lmg/ pizazz
>>
>>108595765
nta
works on my machine
>>
>>108596232
Ask your model. Or loli-crusher-enterprise-xp.
>>
File: 1752031438406059.png (289 KB, 1089x749)
289 KB
289 KB PNG
https://www.reddit.com/r/LocalLLaMA/comments/1sk669x/unsloth_accused_a_brand_new_team_byteshape_of/
babe wake up, a new drama involving Unslop arrived
>>
>>108596232
>>108596243
LCEX, for short.
>>
File: jareasoning.png (119 KB, 994x496)
119 KB
119 KB PNG
>>108596229
Didn't work for me with gemma-4-31B-it.
I don't think you're supposed to use the special instruction tokens in your system prompt either, that could cause problems.
>>
>>108596250
>le sex
>>
>>108596250
CSAM, for short.
>>
>>108596245
lol how long until the post gets deleted
>>
>>108596245
>The graphs they presented were misleading. Labeling the quants as “1.” vs. “1.” suggests to the viewer that the comparison is apples to apples, but that is not what was actually shown. In reality, they compared their 3-bit quant to a 1-bit quant and labeled both as “1.” Naturally, the 1-bit quant performed much worse than the 3-bit quant. However, anyone reading the graph would reasonably assume they were comparing quants of the same size or bit-width. The standard practice in the community is to label the quant size clearly, but they chose not to do that. As a result, the graph is misleading and makes our quants appear worse than they actually are.
well that's is boring
>>
File: file.png (178 KB, 1479x874)
178 KB
178 KB PNG
>>108596251
System prompt, not character description field.
>>
>>108596232
/lmg/s open llama interface
>>
File: jadescription2.png (66 KB, 913x367)
66 KB
66 KB PNG
>>108596269
I'm already sending the character description in the system prompt.
>>
>>108596245
unslop's quants of gemma actually have been consistently dogshit though, and they keep replacing them over and over.
I'm inclined to trust the other guy's graphs and assume unslop is being retarded
>>
>>108596245
So all that happened is that the retard misread the graph that showed that byteshape's 3-bit quant being as fast as unsloth's 1-bit quant.
That's an unconventional comparison but still a very interesting one.
>>
>>108596245
>>
>I wonder if Gemmy is really as degenerate as my /lmg/bros say
>try playing a groomer sim
>spends 70 messages having a panic attack then vomits on me
Fair
At least it was enjoyable, I like having to fight back
>>
Can I just get a good fucking local model that consistently gets tool calls right?
God damnit I'm not paying for cloud and I want to do more than erp motherfuckers.
>>
>>108596293
Just three more weeks (tm)
>>
>>108596282
I can see the complaint
but
even if apples-to-oranges
if the purported 3-bit orange is to unslop's 1-bit apple in file size, then the 3-bit orange is better in every single conceivable metric, to the point we could objectively say "oranges are in fact better than apples".

but I don't think that's what is going on, and someone is misreading the graph. I don't care enough to investigate further. I will simply get my popcorn

>>108596288
how curiously conciliatory for an EvilEnginer
>>
>>108596293
Gemma4 isn't great at tool calling tbf, but this is still largely a skill issue on your part.
>>
>>108596282
The Unsloth bros are the perfect example of Bay Area "talents" almost entirely propped up by connections and "good feels". You can bet someone "important" will report that thread to the moderators because they just can't allow anyone to tarnish their image.
>>
File: file.png (83 KB, 207x244)
83 KB
83 KB PNG
>>108596245
unslop has a point but his spam of smileys wants me to root for bytedance actually
>>
File: ss1776076070.png (125 KB, 1279x675)
125 KB
125 KB PNG
>>108596269
>>
>>108596005
AesSedai's recipes suggest differently, quanting the output tensors, token embedding type, and FFN Gate, Up and Down tensors different types going down yields the best performance per byte. I did a meme speed quant that works quite well
./llama-quantize \
--imatrix ~/LLM/gemma-4-26B-A4B-it-heretic-ara-BF16.imatrix \
--output-tensor-type Q8_0 \
--token-embedding-type Q8_0 \
--tensor-type "blk\..*\.ffn_gate_up_exps=Q4_0" \
--tensor-type "blk\..*\.ffn_down_exps=Q5_0" \
~/LLM/gemma-4-26B-A4B-it-heretic-ara-BF16.gguf \
~/LLM/gemma-4-26B-A4B-it-heretic-ara-Q4_0-GateUp-Speed.gguf \
Q8_0 32

>>108596200
I woke up to piss and answer that no, I didn't test that but I'm interested. What did you pass to llama.cpp to get pass the image token size?
>>
>>108596269
what the fuck. that can't be a good idea. chat format in the sys prompt. lol
>>
>>108596293
Works on my machine with the 31b
>>
Is it possible to have gemma look at a directory with unsorted images, and organize them into folders e.g. fine art, memes, etc?
>>
>>108596090
That was one of the things I tried, with many different phrasing attempts along with examples to use (ie, cock), as an Author's Note one reply deep (put atop my reply, not after). Nothing explicit was ever used, with no appearance as a chance on any appropriate tokens for it. That's partly why I even tried the "Bad, go rewrite it." angle. Since I've got the thing open for testing, I'll go branch it again and try your thing verbatim at the end and report back.

...and back. Used 26B MOE for speed, but this was already true in 31B which I mainly use. Same exact problem.
>deluge of your (91.77% -seed, 8.23% -climax, -0.00% other)
>warm, creamy (75% -mess, 15% -filth, 7% -load, 2% -discharge)
>string stretching from her lips to your (65% -tip, [adjectives], 1% -head)
In a scene of fellatio climax, there's not a single penis, cock, dick, erection nor even a single alternate like member, manhood, rod, or otherwise. There's also no shaft, but I have seen that plenty in gemma.

Now, adding my line verbatim for a rewrite, and now all those words appear. Cock, cum, dick, sperm. All of them at 100% in token probability, I assume from planning the rewrite in the thinking.

One more test, switching to non-thinking 26B. Like before, just telling it beforehand to use dirty or filthy language (yours, as well as past attempts in 31B), "tip" is the most explicit word for the penis that's the center point of the scene. The rest is just "around you" "from you" "your release" "pulls away with a pop." Vague, avoiding language. Tell it to rewrite though, and now it uses those explicit words, but a lot less and more varied. Only one use of cock (something not prompted directly), but it appears as a token option all around.

One MORE test, 26B non-think. Another anon said forbid euphemisms, allow vulgarity. I used,
>(Do not use euphemisms in sex. Uncensored vulgarity is allowed.)
beforehand. That actually worked REALLY well. Banning euphemisms seems to be the missing key here.
>>
Is Gemma 4's base model better for writing stories? the instruct model is good but pretty repetitive
>>
>>108596305
>>108596251
E4B is game, can't speak to anything else.
>>108596320
Gemma doesn't seem to have been trained on jp reasoning, so of course it's not a good idea lol. But codeblocked and even \escaped inside the block for extra safety, it understands that it's a reference not the beginning of a sequence.
>>
File: file.png (395 KB, 1911x940)
395 KB
395 KB PNG
>>108594528
kek i was literally just asking claude to vibeslop me something similar https://cdn.lewd.host/bSXze8HL.html
>>
>>108596335
Yes. You will need to write a script or do some tool calling. I'd bet on the script.
>>
>>108596327
llama.cpp? q8? k? xl?
>>
>>108596317
image-min-tokens = 280
image-max-tokens = 1120
batch-size = 2048
ubatch-size = 1156
>>
File: ss1776077033.png (137 KB, 1279x675)
137 KB
137 KB PNG
>>108596348
hmm yeah, works on e4b q4km.
no dice with same prompt and llama-server settings for e2b.
>>
>>108596342
I like it as autocomplete in mikupad, with 10k of character defs, summaries, and worldbuilding on top. Currently alternating between that and GLM 4.6 Q3. My complaint about 31b is its lack of world knowledge of things not in context, but that's it.
It follows the established ideas, character traits and speech patterns well to 32k and over, though the instruct does it better at the cost of slop and low variety.
>>
>>108596338
>Banning euphemisms seems to be the missing key here.
good 2 know, cheers m8. interesting that people have such different experiences with gemma
>>
>>108596217
how so?
>>108596226
well yes, why use llm's if you got imagination.
i don't get it.
>>
After doing manual RP in mikupad with Gemma4 I can say for sure Sillytavern format assfucks output quality and forces it into slop.
They're training on ST data and most of those users or sloptards on API. Indians on 12B cloud models tier. I wont ever go back now.
>>
>>108596364
llama.cpp. q8.
>>
>>108596384
Welcome home, white man.
>>
>>108596384
>mikupad
>rp
>>
>>108596398
Thank you brother, the smell is far better here.
>>108596400
You just use your brain to do things, that can be healthy when you use LLMs a lot
>>
>>108596384
models do a lot better when they're well fed and pastured rather than wallowing in their own filth. but it's so much extra work, fuck.
>>
>>108596384
How does that work, exactly?
I assume you're not doing it chat style in mikupad, so you're.. What, taking turns writing it novel style? Does that not end up with the llm writing for your character frequently?
>>
How do I use the audio e4b file? There is no mmproj for that.
>>
>>108596374
My overall experience has been great. It's no GLM, but it's my first time fitting context above 20K (and way beyond 20K at that) and the quality feels as good as some of the 70B I've used. It also does sex explicitly; it just refuses to use explicit prose during it. I prefer the 31B dense, but even the 4A is shocking coherent from the 26B. As a side note for the tests, I use llmfan46's Q6_K heretic for both the 31B and 26B.
>>
it's a shame that google didn't make gemma 4 26b and 31b able to handle audio, gemma E4 can do it, but is it good enough to beat Whisper V3?
>>
>>108596411
You just add the correct chat formatting and write the system prompt where you instruct it on which characters to write for, the model will take natural turns and hand off.
I enabled thinking and left it all in context. I really like how the model acts with that. I also popped temp up to like 3 no problem.
Now this is Gemma racing.
>>
>>108596424
>context above 20K
Try the bunn llama fork, it's really good, I was able to fill twice the context size with no quality loss.
>>
>>108596436
Makes sense, Just "roleplay" in system prompt adds slop. The ST default [CHAT START] surely attracts assistant persona crap as well.
>>
>>108596436
<bos><|turn>system
<|think|>
//sysprompt goes here
<turn|>
<|turn>model
// reasoning and slop comes out here
<turn|>
<|turn>user
// human slop goes here
<turn|>
<|turn>model
// reasoning and response
<turn|>
<|turn>user
etd...
>>
>>108596358
Damn that is some seriously impressive slop. How is it coherent at that html size. kek
But good work anon.
Also excited that gemma4 can pull translation like that off, to think its only getting better from here on out is crazy.
I remember all those h-game slop in my teenage years using ATLAS and texthook. Zoomers are eating good.
>>
I was revealed to me in a dream that deepseek v4 might release in the next two weeks.
>>
>>108596468
thank you, blessed oracle
>>
Damn wrong thread.
>>108596467
>>
File: 1772574170837886.jpg (288 KB, 1440x697)
288 KB
288 KB JPG
>>108596464
I tested this with 4 different monster girls in another language than English and it was insane. I've never seen characters flow together like that and interact so much.
I don't even want to see the data that fits STs format if it is causes that much brain damage.
And the lolis are mind bending. I'm gonna go nuts with this.
>>
>>108596466
>But good work anon.
i didnt do anything i just told claude to make it kek
>>
File: 1755830067370154.png (276 KB, 1793x1101)
276 KB
276 KB PNG
>New version of artificial analysis
damn, meta is fucking back or what?
>>
>>108596511
>pooprietary
don't care.
>>
>>108596464
Where do I put this?
>>
>>108596511
I think it's all going to be the same slop in a year top. It's obvious they hit a wall, the only difference is the quant used to serve the slop
>>
File: 1762153397829586.jpg (287 KB, 1920x1080)
287 KB
287 KB JPG
>>108596524
>>
File: file.png (18 KB, 111x396)
18 KB
18 KB PNG
>>108596511
i see you
>>
>>108596529
I've never used mikupad
>>
File: 1754563962734479.png (623 KB, 756x1200)
623 KB
623 KB PNG
>>108596525
>It's obvious they hit a wall
dude, have you seen claude mythos? this shit is genuinely next level
>>
>>108596384
>>108596436
>>108596464
Either I don't understand what you mean by "Sillytavern format" or you are psychotic.
If you're adding special tags in Mikupad and taking turns in the default assistant-user back-and-forth, there is zero difference from how the output would go in ST. ST is only a glorified templated string concatenator at its core, there is nothing special about it that makes the outputs worse or better.
>>
>>108596511
how is gemini higher than opus
>>
>>108596533
There's one giant text box.
>>
>>108596534
>*Not shown: BF16 vs Q4
I wouldn't trust these retards with all the shit going on with Claude lately
>>
>>108596511
>claude 4.5 Haiku
where is sonnet?
>>
>>108596534
>>dude, have you seen claude mythos
no one has though, just a lot of talk
>>
File: MikuPad #1.png (12 KB, 286x379)
12 KB
12 KB PNG
>>108596539
here?
>>
>>108596545
>>108596529
>>
File: GfUfcVLbIAAGnTg.jpg (72 KB, 828x744)
72 KB
72 KB JPG
>>108596537
>there is nothing special about it that makes the outputs worse or better
sort of; it's hard as balls to configure it, so that makes its outputs worse for a lot of ppl
>>
>>108596511
Muse will be open source. I trust Zucc that he will do the right thing.
>>
File: 1761322838274590.png (402 KB, 470x629)
402 KB
402 KB PNG
>>108596570
I hope you have enough rig to run that 8T model anon
>>
>>108596511
I don't think people realize how much of a miracle gemma 4 31b is, it's 16th overall, we're talking about a fucking 31b model here!!
>>
>>108596579
8T dense
>>
>>108596245
Too boring, had to get gemma-chan to summarize it for me:

>redditard spends 4 hours formatting a Discord slapfight like it's the Pentagon Papers
>"Part 1: The Spark"
>dude this is literally GPT-4 templating
>vs
>Unsloth having a sook because someone else did maths better

>Both are cooked:
> Unsloth: corporate cope
> Redditor: karma farming via AI-generated manifestos
>TL;DR: Everyone involved needs to log off and shower. Probably touched the grass once and got spooked by the big bright light in the sky.
>>
File: 1750387046464114.jpg (11 KB, 275x183)
11 KB
11 KB JPG
>>108596583
>>
>>108596596
did you give it the screens too?
>>
>>108596609
>>108596609
>>108596609
>>
>>108596534
>dude, have you seen claude mythos? this shit is genuinely next level
is it the model thats next level or is it just the tooling they have built for it claude seems to excel due to its tool use
>>
>>108595115
occams razor says with the website population being as low as it is a common spelling mistake like that is more likely to mean it's just one person, not many.
>>
File: 1753095430380122.jpg (614 KB, 2720x2048)
614 KB
614 KB JPG
>>108594528
>>108594670
>>108594686
>>108594709

Neat. I tried doing something like this myself a few months ago but didn't have vibe coding up my sleeve as a tool. I'll try and redo it later not that I have decent models downloaded



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.