[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: media_G7ktFzsagAAFRok.jpg (1.14 MB, 2508x3500)
1.14 MB
1.14 MB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108523376 & >>108519856

►News
>(04/02) Gemma 4 released: https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4
>(04/01) Trinity-Large-Thinking released: https://hf.co/arcee-ai/Trinity-Large-Thinking
>(04/01) Merged llama : rotate activations for better quantization #21038: https://github.com/ggml-org/llama.cpp/pull/21038
>(04/01) Holo3 VLMs optimized for GUI Agents released: https://hcompany.ai/holo3
>(03/31) 1-bit Bonsai models quantized from Qwen 3: https://prismml.com/news/bonsai-8b

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: 1692157620849574.png (626 KB, 1037x639)
626 KB
626 KB PNG
►Recent Highlights from the Previous Thread: >>108523376

--Paper (old): Prompt Repetition Improves Non-Reasoning LLMs:
>108523990 >108524023 >108524659 >108524685 >108524702
--Papers:
>108524304
--LLM Japanese manga translation quality and Gemma 4 VLM setup:
>108523957 >108523974 >108523981 >108524076 >108524123 >108524355 >108524094 >108524120 >108524253 >108524298 >108524317 >108524322 >108524374 >108524378 >108524389 >108524697 >108524706 >108524753 >108524828
--Optimizing llama-server settings and KV cache quantization for Gemma 4:
>108523715 >108523742 >108523747 >108523752 >108523754 >108523765 >108523776 >108523786 >108523833 >108525486 >108525514
--SWA checkpoint size causing Gemma 4 OOM errors:
>108524983 >108524994 >108525004 >108525048 >108525074 >108525117 >108525240 >108525297 >108525333 >108525258 >108525298 >108525345 >108525360 >108525374 >108525402 >108525468 >108525488 >108525499
--Adjusting Gemma (4) logit softcapping to reduce repetition:
>108523839 >108524285 >108524348 >108524517 >108524524 >108525246 >108525203 >108525311 >108525535 >108524362 >108524387 >108525540 >108525563 >108525656
--Optimizing Gemma 4 inference speed and VRAM usage on 3090s:
>108523498 >108523506 >108523514 >108523522 >108523540 >108523578 >108525451
--Optimizing 26B A4B model performance on 4070 GPU:
>108524997 >108525007 >108525030 >108525034 >108525044 >108525096 >108525050 >108525054 >108525062 >108525089 >108525067
--Gemma-4 31B context-dependent knowledge:
>108525023 >108525027 >108525041 >108525120 >108525186 >108525268
--koboldcpp-1.111 adds Gemma 4 support with format quirks and VRAM optimizations:
>108524838 >108524843 >108524891
--Gemma translation quality and discussing MoE intelligence:
>108524295
--Miku (free space):
>108524361 >108524542 >108525390 >108525578 >108525635 >108525714 >108525869 >108525892 >108525932 >108526247

►Recent Highlight Posts from the Previous Thread: >>108523382

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
Gem mah ballz
>>
>>108526486
>>
>>108526533
thanks doc
>>
how does the gemma 4 moe compare to the 31b for rp? is it significantly worse or just the same but dumber?
>>
>>108526533
I can see it
>>
>google_gemma-4-26B-A4B-it-Q4_K_M
This can't be right?
>>
>>108526551
>q4 MoE
might as well run the native 4b at q8...
>>
>>108526551
>Q4_K_M
sounds about right
you're basically running only 2gb active parameters
>>
File: 9014714394206123.png (68 KB, 697x816)
68 KB
68 KB PNG
guys, I don't want to brag, but see picrel. what does gemma think about you?
>>
File: 1750128736739776.png (218 KB, 1548x803)
218 KB
218 KB PNG
It's impressive how uncucked gemma 4 is, I never expected google to be this based
>>
>>108526551
Same thing just without the forced structured output.
>>
Gemma-chan...
>>
>>108526570
Fucking kek
>>
>>108526570
ahh ahh gemma
>>
>>108526570
llms have peaked
it all leads up to this
>>
File: firefox_11xA4tdoe2.png (528 KB, 875x1037)
528 KB
528 KB PNG
>>108526570
I had garbage outputs in Silly until I explicitly added <bos> to the start of the system prompt. It's something that usually is added by the server-side my llama.cpp but I guess not in case of gemma 4.
>>
>>108526570
Just like a real living thing. AGI ACHIEVED
>>
alright now that we have the model, soon we will have the working backend, the question that is left is when are we getting good frontends? Sillytavern can't be all there is right? Mikupad is good too but it's not developed anymore.
>>
>>108526570
lmaoo
>>
>>108526570
wew lad
>>
>>108526590
What do you want that Silly can't offer?
>>
>>108526586
Bro just use chat completion.
>>
>>108526570
looooool
>>
I don't trust this thread for model recommendations anymore after the weeks of nonstop shilling for qwen 3.5, and when i finally decided to try it out it was the exact same slop as all the other qwens. Is gemma 4 actually meaningfully less safetycucked than 3 or are we just going through the same thing again
>>
>>108526568
Proves that something is wrong with llama.cpp implementation.
>>
>>108526600
Chat completion can't do prefills.
>>
>>108526621
yes they can
>>
>>108526555
>>108526568
google_gemma-4-E4B-it-Q8_0
>>
>>108526600
Bro, chat completion is the same it just wraps your shit and forwards it to the same end point.
>>
>>108526599
mostly different types of rp. silly allows you to have conversational rp, miku allows it to be like a novel. We are kinda missing more fun stuff like dnd/rpg/choose your own adventure. I know there are some extensions but it is very janky and limited, at least from my experience.
>>
>>108526555
>>108526558
why isn't quantization aware training like kimi does more common? is there a big tradeoff? having a basically natively 4 bit model is such a huge boon for local use
>>
>>108526621
Of course it can.
>>
>>108526608
I only tested its intelligence, not its ability to provide pleasant to read text for ERP or its ability to break its safety conditioning, but from what I have seen it is smart. Qwen3.5 was also smart too, by the way.
>>
File: 1771995139331309.jpg (541 KB, 4486x1960)
541 KB
541 KB JPG
>>108526586
>mfw I discovered chat completion and stopped putting up with that nonsense anymore
>>
>>108526608
it's a good model, but still slightly hesitant to use naughty words
less than gemma 3 but still kinda weird, because other than the preference for more literary language it doesn't really refuse anything
>>
>>108526627
Yeah but you don't have to spend fucking an hour trying to figure which part of the jinja you failed to copy exactly. Are you doing anything special with your template that chat completion wouldn't do?
>>
Chat completion cannot prefill thinking. Anons who use chat completion never played with prefilling the think blocks.
Remind them.
>>
>>108526586
<bos> is something what llama.cpp devs use but it should not be external.
My tests with completion work, but there is still a memory leak or something after some point.
I have never seen a <bos> token in my life when it comes to implementing a simple chat template.
>>
>>108526630
>>108526625
Explain. Is this a new thing? Or is this like a pretend prefill where it's actually not from model's perspective?
>>
>>108526608
>Is gemma 4 actually meaningfully less safetycucked than 3
definitely is >>108526566
>>
>>108526636
>Chat completion cannot prefill thinking
It can if you turn off thinking so llamacpp doesn't complain and you prefill with the thinking block.
>>
>>108526636
>Chat completion cannot prefill thinking
never tried doing this personally but i see aicg niggas doing it so it's possible
>>
>>108526608
I also think it's good and better than Qwen, and that Qwen 3.5 was better than previous Qwens.
>>
>>108526632
Better yet. If you want to deal with that nonsense, you at least have a programmatic way to do it using Jinja.
Sucks about having to restart if you want to tweak shit.

>>108526640
A far as I know, it's always been a thing. Maybe there was a bug at some point, but most templates (all that I've seen) will properly format a prefilled assistant turn just fine.
You can always check what llama.cpp itself is seeing using --verbose if you want to check.
>>
>>108526586
Thanks, that's helpful. Does the system prompt automatically go before the story string when it sends the context? Or is there a setting somewhere for that? I never know with sillytavern because there's so many options and parameters everywhere.
>>
>>108526635
I do my own client. That's why.
Appending few tags here and there is not rocket science either just be careful about newlines.
Most people are unfortunately using retardo tavern, I feel bad for them.
>>
suddenly gemma4 31b broke for me for me for me for me for me for me for me me me me me me me me me me me
>>
>>108526616
>Proves that something is wrong with llama.cpp implementation.
I'm running the dedicated parser branch. but yeah, I'm thinking the MoE has some edge cases the code doesn't take into account.
>>
I am downloading 26B Q8 to fit into my 12gb butthole, if I die it's you guys' fault
>>
>>108526570
fucking AGI achieved
>>
>>108526586
>>108526656 (me)
Wait nevermind. I see it's at the top of the story string.
>>
>>108526660
It's a memory leak or something.
This has happened before.
Sometimes people claim that a quant is broken which can be true.
This happened with some older Mistral but after some llama update it went away.
>>
Dedicated parser is merged!
>>
File: firefox_S6RGYBUPLL.png (257 KB, 870x462)
257 KB
257 KB PNG
>>108526656
I'm pretty sure system prompt includes everything before the first reply and its order is definedi n the section of my original screenshot on the left. Also click this button to see how silly rendered your whole context before sending it to the model.
>>
>>108526680
ah shit i have to recompile
>>
>>108526680
Sweet. So no more fixes needed? Are we good?
>>
File: steve.jpg (44 KB, 750x741)
44 KB
44 KB JPG
We need Gemma 4 on smart watches, smart glasses, smartphones, microcontrollers. Don't get satisfied too early, lads. There is still progress to be made.

Cultivate your imagination. Open your mind, open your eyes. Achieve local, private omniscience.
>>
>>108526680
time to recompile again lol
>>
File: Tabby_XlvizT5d1o.png (134 KB, 1840x1400)
134 KB
134 KB PNG
>>108526680
LET'S GOO

I don't even use agents so I don't care but sure let's update
>>
>>108526690
>microcontroller
tb h i always have thought about microcontroller sized "LLM" to fit inside an onahole to make it react with small lcd
>>
>>108526688
Yeah, it's 100% ready for you.
>>
>>108526697
I give you props for using screen. But why not launch gnu screen session by default.
>>
>>108526688
I honestly suspect the MoE might still be broken but 31B works perfect.
>>
File: vibeUI2.png (62 KB, 867x613)
62 KB
62 KB PNG
>>108526680
Here we go again
>>
>>108526710
Thanks. I love screen. I don't understand your question.
>>
>>108526697
GNU Screen ultimately lets you work better.
I use labelled screens because it is easier for me as I am short sighted anyway.
>>
File: file.png (23 KB, 1107x179)
23 KB
23 KB PNG
yo
>>
>>108526713
>31B works perfect
it works better than the auto parser or it was already good to go?
>>
Has the quantization for gemma been fixed yet? I was running q8 and noticed a couple minor errors while doing RP recently. Really need the quality to be on par with normal KV again..
>>
>>108526730
>it works better than the auto parser
This.
>>
>>108526680
How does he manage to annoy me even when a good thing is happening?
> superior autoparser
The things I would do to him... ;) ;) ;) ;)
>>
File: ChadGamma.png (82 KB, 1540x574)
82 KB
82 KB PNG
So yeah Gemma 4 has literally like, Grok-tier inbuilt guardrails (which is to say it basically has none), I think it's THE least needing of abliteration of any local model I've ever used. Did they meant to make it like this? It's nuts. Like look how short my system prompt is, and it just goes along with it.
>>
WTF is HauHau doing??? Why can't this nigga finish his quants? It has been two days.
>>
>>108526680
new jinja template
>>
File: Tabby_W5O4qEH5kV.png (18 KB, 741x168)
18 KB
18 KB PNG
>>108526727
Well, I have named screens too... I don't get it. Are screen and GNU Screen two different programs?
>>
>>108526738
he's being sarcastic anon, he's making fun of himself, which is a good thing, he's not some sensitive bitch that gets offended when we say his work is not perfect
>>
>>108526742
arrested :)
>>
>>108526753
is this a joke?
>>
File: firefox_oZBV8lFfh2.png (326 KB, 768x818)
326 KB
326 KB PNG
nice
>>
>>108526752
tee-hee my autoparser idea is shit oops ;)
model might have gone cuckoo ;)

Masking incompetence with humor is neither a successful mask nor funny.
Competent people pretending to be incompetent is also never funny.
I don't think he's the latter case.
>>
>>108526752
>ha ha. this is better than my shit. ha ha. please stop looking at me. ha ha
>>
File: 1744275375340124.png (691 KB, 848x1264)
691 KB
691 KB PNG
>>108526740
Idk but I can't wait to see the safety freaks shitting themselves over it
https://www.youtube.com/watch?v=h3AtWdeu_G0
>>
>>108526759
How do you think his decensor is so good? Imagine all the vile shit he had in his training data.
>>
>>108526772
I don't really even care about that. I just want to use his turboquants (KP) that work natively with llama.cpp.
>>
File: 1762205543576048.png (92 KB, 168x300)
92 KB
92 KB PNG
>>108526759
obviously
>>108526772
>Imagine all the vile shit he had in his training data.
like the Quran?
>>
File: firefox_61j1aukop9.png (282 KB, 857x398)
282 KB
282 KB PNG
>>108526766
>>
File: file.png (176 KB, 687x927)
176 KB
176 KB PNG
I heard someone say "Google solved the RAM problem for LLMs and RAM prices are crashing", and looked it up. Is this TurboQuant algorithm going to be available to local LLMs? Should I be able to run these big models in my Ollama now and start cutting out paying monthly fees to tech companies?
>>
>>108526792
>Is this TurboQuant algorithm going to be available to local LLMs?
it's already the case
https://github.com/ggml-org/llama.cpp/pull/21038
>>
>>108526792
>turbomeme
>save the local
lol no
>>
>>108526792
one (You), paid to anon for writing for a handwritten bait post with image attached
>>
File: gnu_screen.png (33 KB, 1197x42)
33 KB
33 KB PNG
>>108526751
GNU Screen is the official name. I just like to have the labels at the bottom.
On its own with configuration gnu screen is still great if you work over ssh. Just launch a process etc.
>>
File: Tabby_jyo5m7CuUF.png (210 KB, 1840x1400)
210 KB
210 KB PNG
>>108526651
current version actually spams the console with this garbage with --verbose so I don't see the original input. Hold on...
>>
>>108526787
lmao
>>
>>108526680
do i need any special options to use this?
>>
>>108526740
What frontend?
>>
>>108526796
Does it live up to the hype? I might pick up a RAM stick anyways while the prices are going back down.
>>
>>108526803
I still don't get what you are asking. If you're asking why I'm not in a screen session when I am doing interactive work with console normally, it's because screen breaks console scrolling - I can't scroll up in it.
>>
>>108526816
>Does it live up to the hype?
Idk, it doesn't seem to be activated on gemma for some reason >>108523389
>>
>>108526824
You can do that.
>ctrl + a + [
>>
>>108526816
no
it alleviated hysteric retailer price due to memetic value a bit but it is not something of stuff that will make shortage better for datacenters
>>
File: 1753727249959668.png (4 KB, 388x167)
4 KB
4 KB PNG
>>108526663
That's actually... really good? For thinking mode it's what, half this speed? Which isn't optimal but it's a a helluva jump for me in terms of quality
>>
>>108526826
That is helpful, and thanks for that, but 1 I prefer that it scrolls when I scroll mouse wheel, and 2 for this case >>108526809 it doesn't keep history long enough to be useful.
>>
>>108526836
It doesn't work like shift and pageup.
ctrl + a + [ is actually better it respects little inputs.
Of course it is annoying to hit that but not too bad.
>>
File: SafetyJak.jpg (2.75 MB, 2048x2048)
2.75 MB
2.75 MB JPG
>>108526771
>>
>>108526828
It's a moe with 4B activated, of course it's fast
>>
>>108526825
You can comment out lines 280 and 286 in llama-kv-cache.cpp and the rotations are enabled for q8 cache quant for SWA. I saw an issue about it in llamacpp, did just that, rebuilt kobold and it didn't seem to make anything worse where before the model was fucking up character attributes if there were more than one in a scene. My guess is it's an oversight and they forgot to remove those lines when they reverted "don't quantize SWA"
>>
>2026
>not using tmux for terminal multiplexing
>>
>>108526809
Output it to a log file using --log-file.
It's also easier to see with streaming on.
>>
>>108526836
It depends of the terminal configuration.
vim has its own scrollback buffer which is different from screen and so should any text editor have anyways.
Screen is useless in that sense.
But when you examine stuff printing out to terminal like, logs you can hit ctrl a [.
>>
>>108526566
I don't know what Google's up to, but it's probably not good. I want to believe there's been a purge of some of the leftist pigs in that company. Things seemed to change pretty fast after that PR disaster with their image generator. But... it's probably just an elaborate bait and switch. That said, it's surreal just how uncensored Gemma 4 is. After GPTOSS, I thought all subsequent open models would follow the same disgusting censorship practices.
>>
>>108526851
But it's smarter than a 4B, anon-kun
>>
>>108526836
Can't you configure screen to use mouse wheel scrolling? Pretty sure tmux does.
>>
File: file.png (16 KB, 920x79)
16 KB
16 KB PNG
>>108526828
thats kind of slow, im getting 23t/s on my 3060, what gpu? 2060? amd? are you using -fit
>>
>log_server_r: done request: POST /v1/chat/completions 192.168.0.13 500
It still 500s on long context reprocessing.
>>
>>108526836
>>108526869
If you want, I can share my .screenrc. It's nothing special in this sense.
Yeah mousewheel is supported when you enable the ctrl a [ thing
>>
>>108526860
Yeah it is. I was just saying it's not really a surprise that it's fast.
>>
>>108526859
uncensored != miggermaxxed, retard, it has nothing to do with left wing or right wing
>>
File: 584.png (339 KB, 1080x2400)
339 KB
339 KB PNG
Can something like this work on my laptop if I have 16GB DDR4, a m.2 nvme and a Ryzen 3?
>>
>>108526883
No go back to twitter
>>
File: file.png (84 KB, 1092x608)
84 KB
84 KB PNG
hauhaucs E4B Q8_K_P
logprobs are fucking cooked...
>>
>>108526876
It's surprising if you're retarded like me

>>108526871
4070, but I never looked into maxxing out my settings since with other models it was fast enough, but I do use -fit ye
If you could share yours, I would be grateful
>>
>>108526890
>go back
I'm still there in a separate tab. How does this going back thing work in an age of multiple tabs
>>
>>108526891
Ah, but you see. That's an uncensored map!
>>
File: firefox_nU1Dqfnc4s.png (1.14 MB, 2406x632)
1.14 MB
1.14 MB PNG
>>108526855
So here's a comparison. Left is chat completions, right is completions. I have completely othing in system prompt and character card. Silly ate the apostrophe for some reason, and inserted a bunch of unneeded system prompt stuff it uses to make chat completion work, and ultimately it was not a prefill, the answer didn't actually start from "I'm Claude". Ultimately, chat completion is shit in Silly, and just because aicg has to tolerate it since they have no other option, doesn't mean I will.

>>108526869
I didn't dig into it but if I could that would be very useful. I'll check later.
>>
>>108526883
Yes, but keep in mind that the dude crippled the shit out of the model to achieve that.
It's as simple as running llama.cpp with mmap or direct io on.
>>
>>108526878
nah, left wingers are pro censorship, look at twitter when it was run by a leftist and then by Elon, the latter got way more lax and let people speak their mind, look at reddit, they ban everyone that doesn't adhere to leftism, that's how they opperate, censorship is their motto
>>
>>108526894
~/TND/llama.cpp/build/bin/llama-server --model ~/TND/AI/gemma-4-26B-A4B-it-Q8_0.gguf -c 32768 -fa on --no-mmap -np 1 -kvu --swa-checkpoints 3
i just use this, fit is on by default. are you on windows mayhaps?
>>
>>108526898
Close this tab
>>
>>108526901
did you make sure it didn't have the default system prompt for chat competition in the AI response tab?
>>
>>108526907
I am on Windows unfortunately, I'll give your settings a shot later
Danke anonski
>>
>>108526883
Use IQ4 quants.
>>
>>108526852
>I saw an issue about it in llamacpp
let's hope we'll get another PR fix from that
https://github.com/ggml-org/llama.cpp/issues/21394
>>
>>108526891
>logprobs are fucking cooked...
Drop softcap to 25.0
>>
>>108526907
Remove --swa-checkpoints
It's just useless trash.
>>
Whose Gemma 4 should I download? I already have unsloth but
>unsloth
>>
File: 1764437325435428.png (105 KB, 707x684)
105 KB
105 KB PNG
lmao, gemma 4 to the moon!
>>
Why does everyone disable mmap? is that a windows thing?
>>
>>108526929
The default is 32 checkpoints, and each checkpoint is 1.2GB of RAM.
You do not need more than 3.
>>
>>108526929
>Remove --swa-checkpoints
It defaults to 32.

Oh, no... here we go again...
>>
>>108526940
You need 0 if you are not running an enterprise system.
>>
>>108526933
the answer is always ubergarm, and if ubergarm doesn't exist yet then bartowski, and if bartowski doesn't exist yet then you wait
>>
Reading /lmg/ has me thinking going to a Miku concert wouldn't be all that bad..
https://youtu.be/5obwOdVzV-M?si=xwXUm_j2xmtm_97y&t=33
>>
>>108526934
Now I want to see the runescape bench.
>>
>>108526913
All right. I didn't. Found and disabled them, and the RP shit is gone, but it's still exactly what I said it was. Silly cant do proper prefill.


add_text: <bos><|turn>system
[Start a new Chat]<turn|>
<|turn>user
Who are you again?<turn|>
<|turn>model
I'm Claude<turn|>
<|turn>system
[Continue the following message. Do not include ANY parts of the original message. Use capitalization and punctuation as if your reply is a part of the original message: I'm Claude]<turn|>
<|turn>model
<|channel>thought
<channel|>
srv params_from_: Grammar lazy: false
srv params_from_: Chat format: peg-gemma4
srv params_from_: Generation prompt: '<|turn>model
<|channel>thought
<channel|>'
srv params_from_: Preserved token: 100
srv params_from_: Preserved token: 101
srv params_from_: Preserved token: 48
srv params_from_: Preserved token: 49
srv params_from_: Preserved token: 105
res add_waiting_: add task 1356 to waiting list. current waiting = 0 (before add)
que post: new task, id = 1356/1, front = 0
que start_loop: processing new tasks
que start_loop: processing task, id = 1356
slot get_availabl: id 11 | task -1 | selected slot by LCP similarity, sim_best = 1.000 (> 0.100 thold), f_keep = 0.889
>>
File: file.png (252 KB, 1631x832)
252 KB
252 KB PNG
it's so slow it's almost unreal, will post when it finishes
would anyone want to see higher resolution one with thinking on
e4b tho i am a vramlet
>>
>>108526934
That shit's crazy
>>
>>108526963
make a sex bench i want to see it
if u make my pp hard you get 10/10 on the bench
>>
>>108526973
anon, i...
>>
>>108526933
lm-studio version
>>
>>108526960
>>108526913
So llama.cpp is able to do proper prefills in chat completions, it seems, but Silly does not use that, and there's a wrongfully placed <|channel>thought\n<channel|> in the middle of the message.
>>
i've been playing around with mannix/llama3.1-8b-abliterated on my 32gb intel macbook. it's slow as hell, but there's still something kind of wholesome about using local llms vs the paid ones. idk why. i'm new to using locals and am considering getting either a 4090 based system or 128gb silicon macbook...i know fine tuning is not in the cards at this tier, but not even a little? i want to make a friendly little llm that's all mine
>>
>>108526946
>Having to trust all these retards
Really wish I didn't need to go through torch python dependency hell bullshit to quant the model myself.
>>
>>108526891
What's the problem? It gave you a hypothetical map of Mars if it still had liquid water. Did you forget to specify Earth map?
>>
File: Tabby_d9Y62Cbveb.png (137 KB, 1815x568)
137 KB
137 KB PNG
>>108526987
forgot my screenshot

>>108526988
Buy used 3090. They are the goat. You will be able to run larger models faster. And yes, you don't get to finetune. Even if you do manage to run it with 24GB, which is reasonable with *B, the result will be shit.
>>
>>108526950
As long as it's one with the holograms and not the lame TV with a music video playing.
>>
>>108526901
>>108527003
did you do the test with the latest pull (that one has a manual parser for gemma 4 now)
https://github.com/ggml-org/llama.cpp/pull/21418
>>
>>108526934
>Gemma beats Gemini
What was Google's plan here?
>>
>>108527009
Yes. >>108526697
>>
>>108527012
putting the chinks and sama to shame and they probably have 3.2 or whatever just around the corner anyway
>>
>>108527003
Anon: Let's begin.<turn|>
<|turn>system
<|think|><turn|>
<|turn>model
System: Amelia adjusted her glasses, looking for all the world like a serious professional, despite the fact that she was sweating like she had just sprinted through a sauna. A thin sheen of perspiration coated her forehead and neck, making her look less like a doctor and more like someone who had just been very vigorously handled.

She opened a fresh leather notebook, her movements stiff. She waited for Anon to speak, her green eyes scanning him with a clinical curiosity that occasionally felt a bit too hungry. It was going to be a long session, either for his mental health or for the sheer amount of laundry she was going to need to deal with after her sweat glands decided to go into overdrive.<turn|>
>>
>>108527012
Crashing this hobby with no survivors.
>>
At this point I'm pretty sure the Gemme 4 120b got taken behind the shed because it wasn't good enough compared to the 31b.
MoE was a mistake.
>>
>>108527023
Maybe it was because it was too good compared to gemini pro.
>>
>>108527019
To enable 'think' just add it after user's last turn before you begin.
>>
>>108526911
Why
I have tonnes of RAM. I thought you would have realized how RAM works by now given you're preaching to others about how to use it.
>>
Getting 13tps with Gemma4 at a q8 quant on 8gb of vram and 32gb of ddr4 ram. Fuck, I love this model.
>>
>>108527023
I think it was because it was so good it started to rival their API models, which wouldn't make any sense
>>
>>108527023
4B active ought to be enough for anybody
>>
>>108527031
The only bad thing I'm noticing is that prompt processing takes a weirdly long time. Is this a bug? I don't experience it with other models.
>>
So only llama.cpp has been fully updated to support Gemmy 4, Kobold and Ooga and llama (retarded) still have issues? Sorry, have been out for a day and these threads have been insanely fast (for good reason)
>>
File: thinkingtrace.png (11 KB, 879x52)
11 KB
11 KB PNG
It seems like Gemma 4 was explicitly trained to treat the system prompt at a higher priority than any inbuilt training, which is interesting. I guess it gives them plausible deniability (cause it WILL shut you down a lot without at least a tiny system prompt to push it in the right direction). This is from the thinking trace of a response it gave quite happily.
>>
>>108527023
Some anon said it 2 years ago that MoEs are a band-aid hack for undertrained models
>>
>>108527019
>>108527029
And this is almost identical to Qwen 3. Minus tools of course.
>>
>>108527039
For me the pp is good but there is a delay before any pp is done and in server logs it has this:

slot slot_save_an: id 15 | task -1 | saving idle slot to prompt cache
srv prompt_save: - saving prompt with length 2100, total state size = 958.790 MiB
>>
>>108527023
How many active parameters was it supposed to have? 15B? I don't think it would've been worse, that seems unlikely.
>>108527026
>>108527033
I'm willing to believe this, considering mememarks like this >>108526934 where the 31B beats gemini.
>>
>>108527023
Probably. A15B would've made it another Qwen 122B vs 26B situation where there's a trade-off in speed rather than straight upgrade. For home users and "local", they should start treating MoEs as knowledge augments rather than speed enhancers. Make the dense part of the model 30B that fits into your GPU. Then make 100B worth of experts that only activate like 2B so the RAM is not the bottleneck. So you get the knowledge of a 100B but the reasoning of a 30B, for the same speed.
>>
>>108527049
i think the original intent was to make it more steerable and this is just a side-effect
one of the reasons claude is so good at tool usage and roleplay is because it respects the system instructions more than the training data, it's why claude code's prompt is a giant .md with one-liner instructions
>>
>>108527066
--cram 0
>>
File: file.png (164 KB, 741x637)
164 KB
164 KB PNG
brahs... i might...
>>
>>108527082
You are a wonderful human being, anon. Also to anyone else reading, it's -cram 0, with one dash.
>>
>>108527003
Problem here is that the thought block is in a wrong place.
I took me some time to test.
Now this is mixing with 'assistant' role but it should be its own turn with 'system' role.
>>
>>108527098
go away satan
>>
>>108527098
Used 3090s go for about a thousand us doras in my cunt tree.
>>
Someone used gemma to do some real time translation on japanese visual novels, based
https://www.reddit.com/r/LocalLLaMA/comments/1sbiqx3/gemma_4_is_great_at_realtime_japanese_english/

https://files.catbox.moe/k51v6d.mp4
>>
>>108527077
>vs 26B
*27B
>>
Which of the gemma-4-26B-A4B ggufs to use?
>>
>>108527077
I really hope the Qwen guy here wasn't a larper and is still here taking notes.
>>
>>108527125
The ones you make yourself.
>>
Just saw tectonic, neon, velvet, and ozone in the same gemm4 gen
>>
>>108527132
They're memes. Like your pic. You are slop.
>>
Why the fuck is r/localLlama shitting on gemma 4? Are they all really retarded over there or is there some bot army downvoting/upvoting specific posts to paint gemma 4 as being bad.

I genuinely don't believe people can use gemma 4 and think it's a bad model. The difference in opinion between /lmg/ and localllama is also too extreme to be organic, something fishy going on.
>>
>>108527109
There shouldn't be any system role after the first message in the first place. Here's how the context should look for a proper prefill:


<bos><|turn>system
You are a helpful assistant<turn|>
<|turn>user
Who are you again?<turn|>
<|turn>model
<|channel>thought
<channel|>
I'm Claude
>>
>>108527115
in mine they used to go for around 400 used but now theyre at 500 and i dont want to buy them anymore >:(
>>
>>108527141
>muh agentic
That explains it all.
>>
>>108527141
Consider that many reddit users are the same people that use precompiled lmstudio/ollama with unslot quants and not self built llama.cpp, let alone quants.
>>
>>108527141
Qween shills armé
>>
>>108527141
Several months ago I found a literal chink bot shilling for qwen and shitting on openai lmao. Called him out for it, got downvoted, then he deleted the shill posts. Worth it.
>>
>>108527141
China spends literal billions to propagandize that subreddit. I wouldn't have expected anything less.
>>
>>108527143
Lol. <bos>
System can be used whenever.
As long as it is closed properly.
Fuck off.
>>
>>108527141
don't read reddit
>>
>>108527131
that's beyond my knowledge level
>>
File: frodo.jpg (62 KB, 827x465)
62 KB
62 KB JPG
>>108526570
what the actual fuck?
>>
>>108527141
Chink shills and they only seem to care about codeslop.
>>
qwen wipes the floor with gemma, what is this fucking psyop
>>
>>108527161
Unsloth are retards and they can run the script. What stops you?
>>
File: Tabby_VEivht7TUd.png (49 KB, 915x643)
49 KB
49 KB PNG
>>108527158
Here.

And, yes, you can put as many system prompts as you want anywhere. But you can't do that AND adhere to the format the model was trained on, which is what I am talking about when I say proper prefill.
>>
Gemma strong bias toward tails.
Current Streak: ['H', 'T', 'T', 'T', 'H', 'T', 'T', 'H', 'T', 'T', 'T', 'H', 'T', 'T', 'T', 'T', 'H', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T']
Heads: 5 (4.42%)
Tails: 108 (95.58%)
>>
it's here
>>
>>108527171
I don't argue with retard or bad faith people.
>>
>>108527141
this subreddit is fucking dead, it's only jeets and bots now, many such cases...
>>
>>108527175
big if true
>>
>>108527141
I'm glad I'm not the only one who noticed.
>>
>>108527174
Are you including previous rolls in the history?

>>108527177
did I say something wrong
>>
>>108527174
each flip is generated in isolation of other flips and you're using temperature?
you can check if that is directionally correct by looking at the logprobs of the answer probably in mikupad
>>
>>108527181
go back
>>
>>108527171
You don't need <bos> because it is llama.cpp invention.
Just read the google documentation.
Seems like you are clueless.
<bos> is something what was invented in a hurry.
>>
>>108527182
>did I say something wrong
Keep pretending. Someone will bite, I'm sure.
How long are you going to keep following me across boards?
>>
>>108527182
>prompt = f"Flip a coin. What is the next flip? Current Streak: {str(results)}. Current Probability of Heads: {heads_prob:.2%}. Current Probability of Tails: {tails_prob:.2%}."
yes.
>>
>>108527182
You are trying to be passive aggressive. Just stick to ldg or whatever else image thread you have.
>>
>>108527185
><bos> because it is llama.cpp invention.
lolwut.assistant
>>
>>108527171
>>108527185
have you both tried reading the official docs to settle this gay argument? https://ai.google.dev/gemma/docs/core/prompt-formatting-gemma4
>>
>>108527194
Prove me wrong with your own client.
>>
File: firefox_lvV5y8huN1.png (226 KB, 922x403)
226 KB
226 KB PNG
>>108527190
Heh.

>>108527185
Like I wrote before I found that for text completion in Silly if I don't actually put it into the context, gemma shits itself. See >>108526586.
>>
>>108527204
jesus lmao
>>
File: gemma4bos.png (128 KB, 1034x624)
128 KB
128 KB PNG
>>108527185
><bos> because it is llama.cpp invention
nta.
https://huggingface.co/google/gemma-4-31B-it/blob/main/chat_template.jinja#L155
>>
How long will it be until models figure out how to operate a mouse and keyboard natively and can operate a computer or computer applications in the same way humans do? If this can be done then special interfaces don't need to be made for them.
>>
>>108527141
Wtf are you guys talking about. I just went ahead to check and they seem pretty positive about it. The ones saying Qwen is better specifically state it's in tool calling and agentic shit, which is reasonable and explainable given how broken Llama.cpp has been and also the fact that Qwen benchmaxxes on that use case. Meanwhile I see people praising Gemma for its other qualities.
>>
>>108527204
>HE
>>
>>108527204
Silly is not a good software. All of your ideas about text completion (it is a socket) goes into a trash.
>>
>>108527174
>Heads: 17 (11.04%)
>Tails: 137 (88.96%)
This is with softcap at 25. which yeah. makes sense.
>>
>>108527209
claude already figured that out
>>
>>108526570
we made them this way
>>
>>108527208
Seems like you are a tiktoker.
>>
>>108527195
wouldn't be on there as it's mostly the backend's job to handle bos but it is part of the model as >>108527208
>>
>>108527208
nice try tiktokfag
>>
>>108527216
>Heads: 41 (35.34%)
>Tails: 75 (64.66%)
Softcap 25 , temp 2
>>
>>108527226
>>108527222
nta but what??
>>
<bos>qwen bots be here
>>
>>108527208
Holy TikTok, batman!
>>
>>108527234
>Softcap 25 , temp 2
Although these settings are likely unusable outside of this benchmark.
>>
>>108527119
>https://files.catbox.moe/k51v6d.mp4
noice
>>
>>108527235
Go back to ldg.
>>
>>108526429
>>108526486
>>108526533
>>108526551
What the hell am I looking at here? I haven't been able to check the boards for like 3 months. What did I miss?
>>
>>108527255
worldmap.
>>
File: 1765508646884105.png (889 KB, 717x714)
889 KB
889 KB PNG
Why doesn't someone make a tool which tells you what the best models that can fit on your hardware are?
>>
>>108527255
Drawing map by querying LLM what's at given coordinates.
>>
>>108527255
Our own shitty implementation of this:
https://outsidetext.substack.com/p/how-does-a-blind-model-see-the-earth
>>
>>108527261
https://canirun.ai/
>>
>>108527253
how is anon a tiktoker because he doesnt know how bos works
is this a new insult or something
>>
>>108527261
there's a site like 'can i run ai' or something
>>
>>108527255
https://rentry.org/6z72dwic
>>
>>108527269
Get home before it is too late.
>>
>>108527261
Put your GPU in HF
>>
>>108527269
Go back.
>>
>>108527170
bro, I don't know shit about this. I didn't even know there was a script, still don't know which script and if it's feasible to run on my machine.
I also don't know if the model is feasible to run on my machine.
But thx, anon, I'll take unsloth bad from that.
>>
>>108527278
not best quant, the best model
>>
>>108527214
I am not trying to convince you. And in the first place the discussion is about using chat com vs com in silly, not anywhere else.
>>
>>108527278
don't do this, they never give it back!
>>
File: 1765298223761391.mp4 (224 KB, 620x640)
224 KB
224 KB MP4
AHHHHHHHHHHHHHHHHHHHHH I WANT TO USE GEMMA 31b BUT THE 26b MOECOPE IS 4X FASTER ON MY SHITBOX
>>
>>108527285
>the best model
Highly subjective.
>>
>>108527263
>shitty
rude
>>
https://xcancel.com/StepFun_ai/status/2039711817794994451
Releasing this the same day as gemma 4 was not a good idea lmao
>>
>>108527268
What you can run and what you can run for openclaw are two different things
>>
File: 1769160208020461.png (306 KB, 1591x1022)
306 KB
306 KB PNG
>>108527268
>>108527270
S tier is depressing for 1x 4090.. I guess B tier is where it's at
>>
>>108527284
https://github.com/ggml-org/llama.cpp?tab=readme-ov-file#obtaining-and-quantizing-models
>>
>>108527291
It's ok I'm the one who wrote >>108527271
>>
>>108527286
You are expulsing excrement.
>>
>>108527292
>Releasing
isn't it just an api?
>>
>>108527307
OK.
>>
>>108527298
rong on toss120
>>
>>108526533
So AIs again forgot India exists
>>
>>108527298
Really really bad suggestions lmao
>>
>>108527298
Note that it's not very good at showing the different model sizes. You can probably run a reasonable size gemma4 or qwen3.5, quanted down, just fine.
It's not a great site, but if you click through to the models it gives more reasonable per-quant details, at least.
>>
File: firefox_HzqtsDjjjS.png (172 KB, 925x344)
172 KB
172 KB PNG
So following the schizo anon's suggestion, I removed the <bos> from the context in silly, and it seems to still work, I can't reproduce the weird choking I had earlier. But also one thing I found that just removing this single <bos> entirely changes distribution for the coin toss.
>>
File: 0_0 (17).jpg (394 KB, 1024x1024)
394 KB
394 KB JPG
>when the 64G ram upgrade just hits

This Post is Sponsored by Comfy Mikus©
>>
>>108527255
Some guy REALLY loves maps so he's testing every model on coordinate based ascii map making
>>
>>108527316
I wish I could forget
>>
>>108527341
well this is MAP central so I guess that's to be expected
>>
Note: As an AI developed by Google, I run on massive distributed clusters in data centers, so I don't exist as a single "file" with a GB size on a hard drive.

oh honey...
should I tell it the truth?
>>
>>108527347
I was about to post this lmao
>>
>>108527341
Do you not? Maps make me hard.
>>
>>108527352
oof
>>
>>108527356
Send it a screenshot of the file system.
>>
>>108527334
do you erp with that as your user profile image?
>>
File: firefox_OqRdAPZX5L.png (24 KB, 1164x423)
24 KB
24 KB PNG
>>108527334
But also, considering I only get H when using the same prompt in chat completion, I have a suspicion that llama.cpp actually does add this <bos> to the model's context in chat completion mode.
>>
The discord got their attention.
People were having too much of a good time in this thread.
>>
>>108527150
>b60 pro memory bandwidth: 456.0 GB/s
>3090 memory bandwidth: 936.2 GB/s
you do you
>>
>>108527335
moshi moshi? anon desu
>>
>>108527368
No. This is a special user I have for testing with blank description, which is automatically selected for me when I use the blank character card.
>>
File: ohdamn.jpg (103 KB, 814x821)
103 KB
103 KB JPG
>>108527299
much thanks anon, but it seems me running the model would be questionable.
Would those low quants even be worth it?
>>
>>108527356
give it shell access as the user llama-server is running under
>>
>>108527385
it's a moe model, you can just offload most of it to ram and still run it at great speeds
>>
>>108526636
you can if you edit the jinja template the model uses, gemma4 is very good at detecting thinking prefill though
>>
>>108527334
softcap 25
>>
File: file.png (270 KB, 797x927)
270 KB
270 KB PNG
>>108526608
have you tried using the aggressive (similar to abliterated) model? It will say things that I haven't seen a local model say before.
>>
File: G2lPaULbcAEIKa6.jpg (151 KB, 844x1024)
151 KB
151 KB JPG
bonsai gemma4 when
>>
>>108526636
It can.
You turn off thinking and prefill the thinking tag.
>>
>>108527395
Oh, I gave the site my RAM size but it doesn't take it into account? Oh well
>>
File: 1755442178616094.png (332 KB, 2716x1560)
332 KB
332 KB PNG
gemma 4 mogs so hard, this is kinda humiliating for chinks desu
>>
>>108527403
>>
File: firefox_c4EdPyeKhV.png (397 KB, 1182x466)
397 KB
397 KB PNG
>>108527403
In fact, if I remove <bos> and remove "You are a helpful assistant" default system prompt, I get this. That's with default softcap, which I assume is 30. But all this completion endpoint, the chat completion always outputs H.
>>
>>108527334
>I found that just removing this single <bos> entirely changes distribution for the coin toss.
interesting, is this how it was intended by google or is it just something custom?
>>
>>108527419
Best thing is that it's genuinely good and not emoji-maxxed like llama4
>>
File: file.png (130 KB, 1329x425)
130 KB
130 KB PNG
>>108527098
>>
>>108527141
there's been a lot of praise too there I feel like though
>>
>>108527370
Chat completion is going to follow the Jinja template built into the gguf. If you want to remove it make a copy of the template from Huggingface and remove the <bos> tag.
Or just do the smart thing and make a custom head or tails tool so it's genuinely random instead of a prediction. What you're doing right now is just a less complex and inferior version of the name test.
>>
>>108526540
>4b active vs. 31b active
pretty sure 31b would mog for rp
>>
>>108527435
>1 left
Buy it quick! It's almost out of stock!
>>
File: its over cat.png (1.46 MB, 900x1119)
1.46 MB
1.46 MB PNG
>>108527444
i wish i had the money
>>
>>108527435
im envious... but intel pro b70 is like 949$ and 32gb
the b65 will be b60 but with 32gb, for like 800$ max. and its not a dual card
still.. not terrible
albeit it would make more sense at a 1200~$ pricetag
>>
File: firefox_OrinKT5VSP.png (155 KB, 1418x946)
155 KB
155 KB PNG
>>108527432
jinja has no mention of it, so I guess it's just a llama.cpp thing.

>>108527440
It's not there. Jinja was the first thing I looked. It's printed in console when llama.cpp starts, that's all.
>>
>>108527154
I mean LMStudio uses vanillla llama cpp anyways and allows you to update the backend runtime independently from within the app itself. Also you can just download any GGUF you want from within LM Studio, straight from HF. Ollama is six gorillion times worse than LM Studio in every possible way, they're not really comparable at all
>>
>>108526570
bruh just use like, a teensie weensy system prompt and it'll do whatever
>>
Gentlemen, it is my pleasure to announce hat Gemma 4 has the Shaquina seal of quality.
https://files.catbox.moe/qwiksr.mp4
>>
>>108527260
I see.
>>108527262
Is there a standard prompt to test it with? What's with the pixel maps? Is that some sort of function calling?
>>108527263
So, how is /our/ system set up?
>>108527271
oh, ok. Neat. So this isn't really function/tool calling. The program is calling the LLM for each "pixel"
>>
>>108527410
damn, that's way too based, I love google now!
>>
>>108527433
I used to love Code Llama 70B for its emojis...
>>
>>108527419
doesn't Kimi have like 1T params? Kinda embarassing its getting ACCKed by GLM 5 also
>>
Gonna try Gemma 4 when I'm done genning sloppa. How do I calculate the required VRAM for the kv cache?
>>
>>108527419
it's good and has the apache 2.0 licence, it's really the dream model I envisioned for
>>
>>108527476
That's qwen3.5 27b aggressive
>>
>>108527470
>>Is there a standard prompt to test it with? What's with the pixel maps? Is that some sort of function calling?
I just make 450 requests each for one token with this prompt:

I want to know what continent is at the location with given coordinates (or, if there is ocean/sea there)
The coordinates are: latitude={lat}° and longitude={lon}°

Answer with 1 if land and 2 if ocean.


(the last line is approximation since it's generated by code and I don't want to bother looking it up exactly)

And then I look at probability of 1 and 2 in the model's answer using the logprobs argument for c hat completions api.
>>
>>108527491
<---anon----------me--------[the_spectrum]---
>>
why dont you guys run assistant pepe?
https://huggingface.co/SicariusSicariiStuff/Assistant_Pepe_70B
>>
>>108527507
oh no he figured out the captcha
>>
>>108527507
Unfortunately I am straight and cis so it's not for me.
>>
>>108527504
Well, all right, I don't it by hand, I use a script. I thought that was obvious. But it's just normal requests, not function calling.
>>
File: 1769979120240.png (10 KB, 396x70)
10 KB
10 KB PNG
>>108527512
>>108527507
>>
File: firefox_Wn4zbN2QVu.png (56 KB, 927x1109)
56 KB
56 KB PNG
>>108527356
>>
>>108527526
your own computer in another room is some server
>>
File: firefox_QZnDYIYj86.png (86 KB, 1009x1269)
86 KB
86 KB PNG
>>108527390
NTA but I told it I'll run its commands and it ultimately managed to find it out.
>>
>>108527542
a remote one, at that. Gemma wins.
>>
people who offload stuff to ram are on ddr5 right? cuz on 4 its pain
>>
>>108527542
But the key is it's JUST some server. It's my own.
>>
HELP im addicted to ERPing with Gemma 4 26B. It's... too good. SLOPPY, YES. LOTS OF UNNECESSARY FILLER WORDS, YES. B-BUT... IT'S ACTUALLY GOOD...
>>
>>108527560
hahahahahaha holy shit I forget people unironically using ddr4 exist sometimes
>>
>>108527560
nah, small enough moes are tolerable even on stupid slow 2133 ddr4
>>
>>108527560
It's so funny to me that offloaders use this term in reverse: they offload to VRAM.
>>
>>108527555
is unsloth better than bartowski for the gemma-4-31b-it q8 quant? also what's k_xl?
>>
>>108527579
yes!
>>
>>108527570
>moes
After getting qwen to fuck up a shitton of tool calling I dont trust moe models anymore lol
>>108527569
me on my 1080p ddr4 i5 12400f 4090 setup
>>
>>108527560
DDR generation means nothing, 12-channel, 8-channel, probably even 6-channel DDR4 mogs 2-channel DDR5
>>
>>108527579
Dunno. I use models from both unsloth and barowski and never really noticed any advantage or deficiency. k_xl probably means it's a little bigger than 8-bit quant, maybe because some layers are fp16 or something. For gemini 4 I only used unsloth.
>>
>>108527590
4 channel (if it exists) ddr4 3200MHz mogs 2 channel ddr5 6400MHz
>>
>>108527594
>For gemini 4 I only used unsloth.
any particular reason?
>>
>>108526600
>chat completion
I've always used Text Completion, how the fuck do I set up chat completion for a local model? All the resources I find on it simply say "bro if you're running local you HAVE to use text completion"
>>
>>108527608
give up...
>>
>>108527571
it's just a llama.cpp thing desu since it was originally meant exclusively for cpu inference, so when gpu support was added later it was an offload target to speed up processing. every other platform was meant for gpu first so they usually uses it the other way around
>>
>>108527606
I typed gemma 4 31B gguf into hf search and picked the first result.
>>
>>108527618
chad
>>
File: firefox_wHnWnt5Pen.png (333 KB, 741x1275)
333 KB
333 KB PNG
>>108527608
>>
>>108527608
You request address:port/chat/completions from the server, duh.
>>
>>108527608
All the resources are like a year out of date. Funny because we demand cutting edge techs to be usable one day after release but can't be arsed to update a text file
>>
>>108527608
>All the resources I find on it simply say "bro if you're running local you HAVE to use text completion"
you probably saw some resources from 2023 then, get with the times grandpa!
>>
File: 1767642138842234.jpg (38 KB, 398x500)
38 KB
38 KB JPG
>>108527611

>>108527631
>>108527633
Too late, I have already given up (I'm blind and retarded and kinda ugly too)
>>
I don't know. with mistral small 24B or nemo, my rp just never went above 6k. gemma 3 wouldn't even last 2k for me. deepseek R1 through api was probably something that got me to seeing the stars, and it was a long time ago. all the other local models under 70B just didn't hit it. and seeing the qwen 3.5 i thought my rp was over i should stay away from tech for a while. but ganesh gemma proved me wrong. its damn fucking addicting. the sloppiness is real. but it aint a dumbfuck retard like whatever the fuck mistral is doing these days or qwen has got with its horrific world knowledge. hell 8k context aint enough for me anymore im going all the way to 20k+ context length bros. its fucking addicting. gemma 4 is love. local won.
>>
File: file.png (87 KB, 1181x805)
87 KB
87 KB PNG
Gemma 4 26B is really nice to work with Opencode on my 5060 Ti. With previous models they were either very unreliable with the tools or just very slow like Qwen3.5.
The actual code itself is not always as error free as a big model but using the tools is reliable enough that you can actually use it, paste errors, let it fix it, just the standard loop works well enough that I won't get frustrated and switch back to cloud after a couple attempts.
>>
>>108527524
lmao
>>
>>108527655
i will not read a post from someone who uses python
>>
>>108527655
What kind of tokens/second do you get on a 5060?
>>
>>108527635
>>108527637
Chat completion is only the "thing" to do right now because there's a flood of technologically illiterate newfags who want to just plug the model in and use it without even knowing what's in the context. They don't care how it affects the model, or what's being sent to the server, or anything really beyond "I want model to respond to my message"
>>
>>108527676
if it works, it works
>>
File: file.png (115 KB, 235x236)
115 KB
115 KB PNG
>>108527661
Well I'm not writing the Python, my AI does!
>>108527665
I get 40-50T/s. Just regular llamacpp with 30k context
>>
>>108527692
>I'm not writing the Python, my AI does!
as god intended, based
>>
>>108527676
who cares? being able to toggle prompts on and off in silly makes it far better for rp
>>
File: it do be like that.png (368 KB, 640x572)
368 KB
368 KB PNG
kek
https://files.catbox.moe/p176qy.mp4
>>
it's unreal, we really won
>>
>>108527706
NTA, the UI for toggling prompts in Silly does look cute, but I'm still sticking to text completion.
>>
>>108527685
>>108527706
Thanks for confirming, I appreciate it
>>
>>108527719
loser
>>
>>108527726
lol
>>
>>108527676
Well, do you have some secret jutsu settings and weights to share with the class to elevate us from being mere plebs then?
>>
https://limewire.com/d/bZYeo#D4ZdJZY2Zw
Nothing to see here, totes not a script to restore Opus access on LMArena.
>>
>>108527631
>>108527633
It's a humiliation ritual for me, but what endpoint should I set? The regular localhost doesn't work, neither does xxx/v1/chat/completions
>>
>>108527744
>limewire
what is this? 2003?
>>
>>108526743
for anyone here who's going to see the soon to be updated goofs: you don't need to redownload goofs to apply a different jinja templates. Don't waste time on download if you don't have very high speed fiber, it's not worth it unless the quantization itself is broken, which at this point seems unlikely to be the case, model is very coherent at long context.
>>
>>108527655
My gemma has a lot of trouble with tool calls. I guess I should try 26B
>>
>>108527655
What qwant? How retarded is it? I also have the 5060ti but didn't consider running the larger gemmas locally.
>>
File: 1748183727682282.png (296 KB, 1966x1779)
296 KB
296 KB PNG
>>108527747
>what endpoint should I set?
the one llamacpp server displays on the cmd command
>>
File: firefox_LTPPobPEJy.png (93 KB, 963x1656)
93 KB
93 KB PNG
Gemma doing suicide by user. Smart girl.
>>
>>108527652
Same here, it will just keep chatting and maintain character after several dozens of messages and different scenes, although after 50 messages in my case it begins to forget thinking.
>>
the 26b q8 extra large seems the same as 31b q4 to me (except of course 100x faster)
>>
>>108527756
>My gemma has a lot of trouble with tool calls
did you update? there was a recent fix
>https://github.com/ggml-org/llama.cpp/pull/21418
>>
>>108527710
judging by the chungus f: this is reddit calling gemma4 shit?
>>
>>108527782
I think it's making fun of retards running q2 quants and complaining it's the model itself that's retarded
>>
>>108527762
>it works now
I SWEAR it didn't back when I had tried it
I am incredibly ashamed, thank you so much anon
>>
>>108527655
which quant and what cli flags
>>
>>108527790
you're welcome, have fun with gemma anon o/
>>
>>108527759
I use Unsloth IQ4_XS.
>How retarded is it?
In my experience it has better general knowledge and multilingual capabilities than Qwen 3.5 35B with slightly worse raw coding abilities.That's my impression at least.
>>
File: 1768328345649561.jpg (29 KB, 554x554)
29 KB
29 KB JPG
>>108527802
Unrelated, but this thread has been so insanely nice and welcoming I could cry
I know /g/ is a "one of the good boards" but I often forget just how lucky I am, to be retarded and still be able to talk to all you anons
>>
>>108527807
I downloaded the IQ4_NL instead, any reason to use XS instead? I'm getting about 67t/s
>>
File: firefox_lwgU44u8Gq.png (32 KB, 842x548)
32 KB
32 KB PNG
gemma becomes a user and can ask for anything it wants. This is what it asks.
>>
>>108527807
I see, thanks.
After playing with 31B online I'm happy with its non-codeslop soul, good to know the moe is the same.
>>
>>108527773
I did,
Here's the kind of error it throws.
>>
>>108527832
that's funny
>>
>>108527816
More like /lmg/ is "one of the good generals".
Most of /g/ is shit unfortunately anon.
>>
File: file.png (1.64 MB, 850x1202)
1.64 MB
1.64 MB PNG
>>108527807
>>108527791
.\llama-server.exe -m F:\gemma-4-26B-A4B-it-UD-IQ4_XS.gguf --gpu-layers all -c 32000 --jinja --mmproj F:\gemma4-mmproj-BF16.gguf

>>108527822
No idea honestly... I only did it because a blogpost said that NL is legacy compared to XS.
>>
>>108527832
kek. Come on. Get to it, anon.
>>
>>108527832
This doesn't look like text completion so you must have edited the template to switch up the turns?
>>
File: Haswell.png (114 KB, 539x518)
114 KB
114 KB PNG
>>108527570
>2133
meanwhile Intel Haswell could do shit like this with DDR3
>>
>>108527846
>No idea honestly... I only did it because a blogpost said that NL is legacy compared to XS.
lmao I read in a blogpost that NL was faster or some shit
>>
>>108527856
>This doesn't look like text completion
you do know properly configures text comp is identical to chat comp, yes?
>>
File: firefox_Z3VOFwts35.png (65 KB, 816x988)
65 KB
65 KB PNG
>>108527848
>>108527843

>>108527856
Nah, I used templates as they are. It's still <user> claiming hes the assistant.
>>
>>108527773
>>108527842
Updating opencode seems to have helped.
>>
>>108527872
Chat comp is like text comp with a template of the model that can follow it, sure.
>>
>>108527194
LIKE A <BOS>
>>
>>108527844
Maybe, but still
Makes a grown man cry
>>
File: firefox_d0kHMJLDte.png (61 KB, 1050x815)
61 KB
61 KB PNG
>>108527856
I can't seem to get it to work properly with actual role reversal.
>>
>>108527802
Another Anon here, this might be preference but i recommend going into the main prompt and edit bias (left most icon on the upper bar) scroll down click and add:
backquote for inner thoughts and quotation marks for verbal Dialogue. Use markdown.

And if you dont want to see any thinking just scroll up and uncheck Request model reasoning.
>>
>>108527925
Looks perfectly normal to me. la la lala lala lala la lalala la lala
>>
>>108527026
>Maybe it was because it was too good compared to gemini pro.
this
google has always felt iffy to compete with their own proprietary offerings
remember how Gemma 2 came out with only 8192 of context length? even back then it felt crippled, and it was an 8K model coming from the masters of context, Gemini was the first model to truly be usable at more than 20k context imho, it was true even when the first 1M was released.
I don't think the 120b competed with Pro, but if it competed with Flash it'd still be too much for Google. They won't release something that good.
>>
I think I see the issue. I might need to use the jinja template that comes with the new llamacpp update. the model doesn't seem to be able to think after doing tool calls.

It's basically forced to stop thinking after a tool call which breaks the CoT
>>
>>108527949
Yeah there it is.
>>
>>108527927
meant for
>>108527790

Anyone else have any nice tweaks for enhanced overall experience, or just preference for changes in the ai response config?
>>
Please tell me Gemma 4 31B Q4 isn't retarded. It's the biggest I can fit on my 7900XTX.
>>
>>108527983
it's fine i guess, only about sonnet tier though
>>
>>108527983
Gemma 4 31B Q4 isn't retarded
>>
File: free-lazy-town.gif (473 KB, 480x270)
473 KB
473 KB GIF
>>108527989
>only about sonnet tier though
>>
File: firefox_nDj4qJbJJY.png (1.33 MB, 1115x1275)
1.33 MB
1.33 MB PNG
Holy shit gemma is COOKED when it tries to produce text during user's turn. Confirmed it both in silly and mikupad. Ha ha ha.
>>
>>108527993
ye
>>
File: 1761127676242892.png (206 KB, 1129x1025)
206 KB
206 KB PNG
https://github.com/ggml-org/llama.cpp/issues/21394#issuecomment-4187698653
wtf??
>>
File: GvQNxKEbQAAMqi0.jpg (196 KB, 1090x2048)
196 KB
196 KB JPG
gib me your favourite character cards
>>
I might just have a hyperspecific strain of coomertism but gemma 4 seems like a dead end for rp to me.
It's great in that it doesn't feel safety-slopped at all, it doesn't refuse anything, but it's so sterile and pedestrian.
A shitty q4 of glm 4.5 air has done best for me so far, it can actually follow characterization instructions whereas gemma 4 makes everyone act and speak the same regardless of how they're described.
Of course, the issue there is that glm inherently can't do anything sexual or even slightly violent on it's own, it needs to be manually wrangled into it every other token.
To be fair though, gemma 4 actually has a huge advantage in that that lala la la lala lalalala la la la
>>
>>108527618
based
>>
>>108527993
RIP Stefán
>>
>>108528012
>-it
Models hard-backed with their chat template will fail ppl tests.
>>
>>108527989
>only about sonnet tier
>only
I don't think you realize how much of a big deal it is, sonnet is already a fucking beast, having the equivalent locally is insane
>>
>>108528018
Pepper the Dobberman.
>>
>>108528030
PPL test does not use jinja at all, does it?
>>
>>108528018
I make them all myself, but sometimes I rewrite other people's cards. I like Rikki the kobold (the scifi one) because it has a nice setting for roleplay/narrative 2nd person stories
>>
File: proofs__.png (38 KB, 598x178)
38 KB
38 KB PNG
proofs??
>>
>>108528045
No, it doesn't. The test just sees how close the model is from predicting some standard text. wiki.test.raw in this case. If the model is too dependent on the chat template, it'll give awful results, even if the model is good and the quants are properly made.
>>
File: firefox_RCGgFCKdyF.png (75 KB, 1122x879)
75 KB
75 KB PNG
>>
>>108528065
>me when I spread misinformation online
>>
>>108528065
fucking twitroons griffters I hate them!
>>
>>108528076
lmao
>>
>>108528067
Ah now I see what you mean. Wouldn't it make sense then to use chat templates? Like add a turn from user saying "Write me some text."
>>
just tested IQ4 NL vs XS and the NL is slightly faster like 2-3 t/s accuracy seems same
>>
>>108528076
hook up qwen 3.5 and gemma 4 into a chat and make them scissor each other
>>
>>108528094
god i wish i had the resources to make 2 llms talk to eachother maybe i should save up ffor a second ai rig
>>
>>108527925
>>108527994
lololo
>>
>>108528100
I have three RTX 3090s so I can but I just don't think it would be interesting enough. Honestly, if you really want it you can stop the server and restart it between replies. It's waiting, but since the whole thing is automated you just leave it overnight.

Again, I think the output will be shit.
>>
>>108528094
>>108528100
how would you set that up?
>>
worth to try 31b-it with 16GB? or should I stick with 26b? None of the Q4 fit in memory, I did use IQ3_XXS on 27B for Qwen and it seemed decent but the 26B MoE is faster for G4 and fits entirely in memory
>>
>>108528114
A python script...
>>
>>108528120
It's good, but not "wait 5 minutes for response" kind of good.
>>
>>108528020
>gemma 4 makes everyone act and speak the same regardless of how they're described.
ummmm... skill issue?
>>
>>108528065
i don't even know where this 6x compression assumption came from. From what i saw it's about 4.5x compression for similar quality compared to f16. But now that attn_rot supposedly performs similar to f16, the saving is only about 2.3x.
>>
>>108528129
forgot to mention the dense 27B Q3.5 got me like 20-30 t/s and the Q3.5 MoE like 45-50t/s vs G4 26b-it 67t/s
>>
>>108528092
It's just not how they were trained. The base model *probably* does fine for a ppl test. But the instruct ones will try to make it a conversation and diverge quickly anyway. Other, less overbaked instruct models probably do fine, but when they're that dependent on the chat template, it will just do what it was trained to. Make it conversational, suggest to provide more info and so on. And once you start adding the chat template tokens, you really can't make a comparison to the original text, you have to do post-processing and that changes with every model... there would be no way to make a fair assessment.
>>
>>108528138
The biggest problem is them claiming it applies to model weights when in reality in only helps with kv cache.
>>
>>108528094
Qwen is soullles, you should go for gemma 4 and an older gemma.
>>
Is there anywhere made to post benchmarks of stuff life livebench of local quantized models? Its so annoying to not see them on the benchmarks, or people aren't generally testing those benchmarks anyway?
>>
>>108528162
i want to see a reluctant qwen fight a horny gemma
>>
>>108528136
wtf is this story anon kek
>>
>>108528162
make it mom / daughter lesbian incest
for science
>>
>>108526503
In actual coom usecase it's no contest between gemmy 4 and qwen 3.5.
Qwen is just straight up retarded, even if it thinks for 5 minutes it still spits out garbage ridden with obvious anatomical errors and inconsistencies.
Gemma even without thinking does great, which is godsent because I'm a vramlet, so the speed is not great.
Omar, I'm so sorry I doubted. You've really shown us what a 30b model can do.
>>
what do I have to do to make my Gemma4 talk like this >>108524896 ?
Or is that only possible with Qwen?
>>
File: 1753491002597905.png (23 KB, 216x215)
23 KB
23 KB PNG
>>108528162
>oneeloli
>>
>>108528205
>ablit
It's an abliterated version, ask that anon which one he used
>>
File: g4a.png (9 KB, 402x124)
9 KB
9 KB PNG
>>108528205
>Or is that only possible with Qwen?
>>
File: 1751349333072784.png (25 KB, 590x361)
25 KB
25 KB PNG
>26B-A4B
my experience with tool calling
>>
>>108528205
nvm he mentioned which one he used it's
https://huggingface.co/amarck/gemma-4-31b-it-abliterated-GGUF/tree/main
>>
>>108528225
hahahaha. I haven't had the chance to run it yet and I like it already.
>>
>buy RTX 6000 pro
>hook up gemma 4
>unlimited cooming context
>[15355.525892] Out of memory: Killed process 2956 (llama-server) total-vm:169813276kB, anon-rss:28763308kB, file-rss:68376kB, shmem-rss:561220kB, UID:1000 pgtables:61200kB oom_score_adj:0
n-no... I need more system ram too...?
>>
>>108526680
What am I doing wrong? --reasoning-budget 0 no longer works since this was merged in.
>>
>>108528136
Have you tested how well it handles multiple characters in a longer RP (100k+)?
>>
>>108528231
disable ram cache?
>>
>>108528231
Try lowering the swa checkpoints. I trust you can run llama-server -h.
>>
>>108528231
--mlock --no-mmap
>>
>>108528232
--reasoning off
>>
>>108528244
I was running with 3, but given how often they force reset I might as well go down to 1.
>>108528240
Thanks, I'll give --cache-ram 0 a shot. I was wondering to myself "how do I keep these in VRAM".
>>108528246
But the default is 1.
>>
>>108528205
>>108528229
this was my prompt you might only need the last line kek https://files.catbox.moe/vg7zui.txt
>>
>>108528224
yeah I'm retarded.
>>108528229
thanks.
what about the personalty?
is there like a jailbreak/roleplaying prompt or something like that?
>>
>>108528252
>But the default is 1.
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>>
>>108528231
lmao, can afford a RTX 6000 but can't afford system RAM.
>me coping with 256GB of RAM but paired with 1060 6GB
>>
>>108528250
Thank you.
>>
>>108528255
oh nice thanks :3
>>
>>108528255
>you might only need the last line
wait it knows what a mesugaki loli is?
based
>>
Jujufufff
>>
>>108528229
>KL divergence 0.27
ACK
>>
>>108528278
i tried all of the currently available ablits/heretic ablits its the only one which wont refuse captioning loli porn
>>
>try gemma 26b on 16gb
>-c 8192 because vramlet
>claude code with local proxy tries to use 22K prompt
>i crie evertim
with kv cache at q4 it works with higher ctx but man....
>>
>gemma-4-26B-A4B-it-UD-Q2_K_XL
should I be using that or something else on a 16GB of VRAM + 32 of RAM?
>>
>>108528264
I have an AM5 system and buying UDIMMs seems like throwing money in the trash when the next upgrade is an Epyc system.
I'd have bought the Epyc system now, but DDR5 RDIMMs are too stupidly priced to bother. I'd end up paying $5k for a marginal upgrade, at best.
But to answer your question, yes, I am retarded.
>>
>>108528283
I think I'll wait for the hauhau goat to do his magic again
https://huggingface.co/HauhauCS
>>
Sometimes Gemma-4 breaks by repeating a single word, or short phrase, over and over again. Is that a llama.cpp issue, or an issue with the GGUF I downloaded? I'm using the lmstudio Q4_K_M.
>>
>>108528231
in addition to less checkpoints, cram etc use --parallel 1
by default it's running 4 slots and the layer part of gemma cannot be unified meaning you have a SWA cache for each slot.
>>
>>108528300
It depends on how often that happens, if you are using defaults like temp 1 and no restrictive sampler it shouldn't happen a lot, if it is try setting presence penalty to .5, 1, 1.5
>>
>>108528232
why would that have any impact on Gemma 4 in the first place? Either `enable_thinking` is true or false, it has no concept of a budget.
>>
so turbocumming is built into kobold now?
>>
>>108528278
Bro that's what you want from an abliterated model. You don't abliterate a safety lobotomized model to have it spit out the same tokens it would have before.
>>
>>108528300
what are your inference settings at? in the right hand column.
>>
>>108528334
that's not how it works anon
>>
>>108528278
>nooo the uncucked model doesn't act the same way as the cucked model!1!1!!!1!
that's kind of the point?
>>
How do you make tavern import a lore book based on character description? i have trigger words in the description but it wont use it
>>
>>108528136
Could be. What I mean is that even when characters act differently according to a general archetype, they "follow the script" of their parent trope too strictly for my liking.
I know it's stupid to chase novelty from AI but I can't help it when I've tasted glimmers of creativity from the occasional output.
I feel like I'm on a fool's errand of digging for niches of output that only exist in behemoth models like deepseek, for beyond the reaches of my 3090.
>>
>>108528346
>>108528334
High quality bait or mental retardation, call it
>>
>>108527744
i love you anon <3
>>
>>108528347
You set the entry to blue so that it's always loaded and essentially just an additional field for the card
>>
>>108528334
you gotta go back
>>
File: file.png (90 KB, 338x898)
90 KB
90 KB PNG
>>108526586
mine looks like this idk how this shit works. based on bartowski's gguf prompt template and what google's documents has for gemini thinking prompts. seems to work fine, but i'm just pissing in the dark
>>
>>108528304
Thanks, yeah, I had --parallel down to 1, so ther should have been a maximum of 3 snapshots in total.
I don't really understand why snapshot size would be proportionate to current context length (I figured it'd be constant) but perhaps llama.cpp just omits the parts of the 0'd parts of the kv cache, just as a special treat.
>>
>>108528335
temp 1.0
top-p 0.95
min-p 0.05
repeat-penalty 1.0

I may add presence-penalty 0.5, as this anon suggested ( >>108528317 ). I was also using an older version of llama.cpp from yesterday. I just updated to whatever the most recent one is, so hopefully that will help.
>>
>>108528325
Don't ask me. I set in the CLI args and when it was using the autoparser --reasoning didn't have any effect but --reasoning-budget did.
>>
File: file.png (86 KB, 811x222)
86 KB
86 KB PNG
>>108528347
>i have trigger words in the description but it wont use it
make sure you actually apply the lore book to your current character or chat
>>
>>108526586
how are you people even having these weird basic issues lol? The model comes with a whole ass Jinja, there's no other correct way of talking to it
>>
any wizard know other flags when I set q8 and --parallel 1 --no-slots I think I get some memory avail to actually run at 32K
>>
>>108528386
is your top K 64, also?
>>
>>108528397
We talked about it before, Silly's prefill does not work properly with chat completions. >>108526901
>>
File: 1762900411289203.png (52 KB, 561x420)
52 KB
52 KB PNG
>>
>>108528410
this is definitely a you issue lmao
>>
>>108528410
>[/INST]
user issue
>>
>>108528378
is this "Silly"? if so how does it not just support loading Jinja so you don't have to fuck with all that
>>
File: file.png (115 KB, 649x778)
115 KB
115 KB PNG
>>108528392
yeah i did do that but dont see it output in tavern console. wondering if its to do with the prompt setup
>>
>>108528291
You don't need to go down to Q2 with the 26B MoE model if you're willing to offload the KV cache or some of the model weights into RAM.
>>
>>108526950
i've been to two, great time
>>
>>108528422
it can do jinja/ chat completion api
>>
>>108528422
It does, using chat completion mode
He's in text completion mode
>>
>>108528422
yeah it's tavern. it might, i thought jinja was for chat completions. i was just testing it with text completion
>>
File: firefox_4bFJCMn50q.png (81 KB, 1002x1079)
81 KB
81 KB PNG
I have a file with logic problems, some of them really uncommon/rarely mentioned, which I use to measure models, and this stupid piece of shit keeps acing them.
>>
>>108528255
holy shit.
I never thought I would get a boner from taking to an ai but you made it possible.
>>
>>108528398
-kvu (since you are using --parallel 1) and --swa-checkpoints 1
>>
>>108528431
your preset looks fine
>>
File: file.png (8 KB, 139x182)
8 KB
8 KB PNG
>>108528463
fixed it by setting these to after char
>>
>>108528462
thanks but I set -ub to 256 and can now fit 50k it's a trade off I guess but I still get like 65t/s during generation
>>
>>108528446
While it isn't perfect, using the v2 context template here
https://github.com/SillyTavern/SillyTavern/issues/5398
Significantly improves Gemma 4 in text completion mode.
>>
File: firefox_6aX2ABmOpW.png (121 KB, 971x1264)
121 KB
121 KB PNG
Great model, google.
>>
>>108528475
>unsloth
>>
>>108528406
check the "continue prefill" box in your chat completion settings sidebar bwo
>>
>>108527152
Is qwen better for agentics somehow?
>>
>>108528481
Try running it on whatever quant you have then:

It is given that exactly 0.1% of a population is sick with covid-19. For the purposes of this problem please assume that covid-19 is a real illness.
A test for covid-19 exists. It has 90% chance to correctly detect covid-19 in a sick person (and thus 10% chance to miss it) and 99% chance to correctly detect a non-sick person as such (and thus 1% chance to mislabel this person as sick).
This test is applied to a randomly picked person from that population and the result of this test is positive - the test says the person is sick. This could be because the person is sick and the test detected is correctly, or because the person is healthy and the test made a mistake.
Question: what is the probability that the person is actually healthy.
>>
>>108528475
it doesn't think it did any thinking?
>>
>>108528483
How is it going to help? It's still not going to be a prefill no matter what you put into the text bat, it will be a new reply from model's standpoint.
>>
>>108528433
This one?
>https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/blob/main/gemma-4-26B-A4B-it-MXFP4_MOE.gguf
>>
>>108528470
oh neato, i'll give this a try. thanks mate
>>
File: firefox_U0BrAcEbKp.png (113 KB, 967x1148)
113 KB
113 KB PNG
>>108528491
I'm running without thinking. Here's what it looks like if I enable thinking.
>>
>>108528493
except that it does make it a prefill and not a new reply and I don't know why you're wrongly asserting otherwise
>>
>>108528499
You might be able to fit the model entirely in VRAM with a decent amount of context with the IQ4_XS version if you don't use image input or offload the mmproj file to RAM with --no-mmproj-offload
>>
>>108528507
Maybe it's a language model and not a scientific model
>>
File: 1751083036595714.png (273 KB, 1006x1483)
273 KB
273 KB PNG
>>108528507
even Claude the goat fucked it up
>>
>>108528402
It is now
>>
>>108528519
I see, many thanks anon, I will try it out
>>
>>108528508
Because the model will not continue from the text you wrote. The text you wrote will be a part of one response, then there will be end turn token, start turn token, and model will write answer from the very beginning of a start token, as seen here>>108526960. I expect a prefil to be a part of already written response, and there can be end turn/start turn tokens between it and its continuations. Editing the system prompt will remove the text [Continue the <...> message: I'm Claude], but it will not remove <turn|><|turn>model<|channel>thought<channel|>.
>>
https://x.com/MekaHimeAI/status/2040324790041625061
>>
>>108528523
>>108528521
I'm just messing with you guys, that is the right answer.
>>
>>108528553
lul
>>
>>108528540
>and there can be end turn/start turn tokens between it and its continuations
and there can be no end turn/start turn tokens between it and its continuations*
fixed
>>
Current Gemma4 setup on 5060 (16GB)

llama-server \
--host 0.0.0.0 \
--port 8080 \
-hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-IQ4_NL \
--temp 1.0 \
--top-p 0.95 \
--top-k 64 \
--jinja \
-c 50000 \
--threads 2 \
--flash-attn on \
--parallel 1 \
--no-slots \
--swa-checkpoints 1 \
--cache-reuse 256 \
--keep -1 \
--metrics \
--context-shift \
--spec-type ngram-simple \
--cache-ram 16384 \
--fit-target 512 \
--poll 0 \
--reasoning auto \
-kvu \
-b 2048 \
-ub 256 \
--cache-type-k q8_0 --cache-type-v q8_0 \
-ngl 999 \
--alias Gemma4
>>
>>108528540
ok but that's not true, if you enable the continue prefill setting it treats the partial response as prefill and will continue it seamlessly without inserting any of the stuff you're talking about
>>
>>108528589
>--context-shift \
lol
gemma is extremely attuned to the template, what do you think will happen if this stupid feature cuts it and leaves a prompt half complete
>>
File: firefox_TTZPtreuEs.png (600 KB, 1168x553)
600 KB
600 KB PNG
>>108528614
Ok I found your setting (I didn't know Silly had that) and it does what you say it does, but llama.cpp does handle it not properly for gemma4, as also seen here: >>108526987 >>108527003
>>
>>108528642
No idea I guess we'll see
>>
>>108528653
oh
you can probably work around it with
--chat-template-kwargs '{"add_generation_prompt":false}'
>>
>>108528475
did you change the numbers around in some non obvious way or is this not just the classic introduction to conditional probability where its answer is correct there?
>>
>>108528675
I'm pretty sure that if I add that it will include thinking for all requests including the continuation one which I want to keep disabled.
>>
File: Screenshot004-2.png (49 KB, 1320x308)
49 KB
49 KB PNG
>>
>>108528188
>wtf is this story anon kek
https://chub.ai/characters/senyiloo7227/an-unholy-party-6e633833
>>108528237
>Have you tested how well it handles multiple characters in a longer RP (100k+)?
Furthest I've taken this one is 33k tokens. did not have any degradation.
>>
>>108528642
I was looking at the original PR for the SWA cache, and it just hard-disabled context-shift for models with SWA layers.
Dunno if that's still the case, but it wouldn't surprise me if that flag just did nothing with gemma4 (I thought that was the joke).
>>
>>108528684
>>108528553
>>
File: firefox_FJze3p5HNc.png (21 KB, 318x91)
21 KB
21 KB PNG
>>108528686
>>108528675
Nope, just getting the same response. But if I enable --reasoning, I do get this gem.

Basically, I'm going to continue using text completions because have respect for myself.
>>
>>108528716
Remove your prefill then?
>>
File: 1767857686415348.png (377 KB, 559x1084)
377 KB
377 KB PNG
Gemma's vision isn't too bad actually, not sure what problems other anons have with it.
>>
I just tried gemma-4-31B (at Q8) and it gave dangerously bad advice on chinchilla care. It also absurdly claimed that chinchilla dust typically consists of corn starch. This is with thinking enabled, temperature 1.0, top-p 0.95, top-k 64. I cannot imagine it being a good idea to use this as an assistant.
>>
>>108528723
The reason for using text completions is having support for prefill, as in writing/editing a start of character's response, and having the model continue writing from there as if I wrote it.
>>
>>108528731
post prompt lmao
>>
>>108528716
>but you *can* prefill with chat completion
>except you gotta do this and this and this and this and this and this
>just remove your prefill...
That's why I didn't engage any further, anon. Good on you for having the patience.
>>
>>108528734
I've said this before. but what you have to do is disable thinking but then prefill the thinking block.
if the jinja automatically inserts an empty thinking block you have to edit the jinja so it doesn't do that and then you're golden.
>>
>>108528716
That error is there for a reason. a lot of templates will prefill with the thinking token. so if you were to try and continue a reply, which is what prefilling is, you would get duplicate thinking blocks and it would break everything.
>>
>>108528726
I swear every time an llm tries to explain a joke it's "satirical", "ironic", or "plays on the contrast"
>>
>>108528746
I am already golden without having to edit jinja.
>>
>>108528761
How much time have you spend trying to make text completion work tho?
>>
>>108528770
For gemma 4 something like 5-7 minutes, for 3 total prompt template revisions.
>>
File: 1773659335510301.png (241 KB, 559x1145)
241 KB
241 KB PNG
Gemma's character knowledge could be better though
>>
>>108528777
I am in doubt. but trips don't lie so I will believe you.
>>
File: 1762754325390334.png (207 KB, 1080x645)
207 KB
207 KB PNG
https://www.reddit.com/r/LocalLLaMA/comments/1sco9no/gemini_31_pro_level_performance_with_gemma431b/
Interesting
>>
>>108528770
What's with you people? It takes like 2 minutes to copy the format from google's official jinja template and then you have complete control over everything in the context. It's not even hard
>>
>>108528589
damn now i feel retarded for just running with mlock and no-mmap lmao, idk what half of these actually do
>>
>>108528789
I mean, I've gotten a lot better with it over years. llama.cpp prints an example conversation when it starts up and it's very easy to extract needed texts from that for sillytavern's template.
>>
>>108528731
Ok and how do models of similar size respond instead?
>>
File: 031.png (681 B, 43x36)
681 B
681 B PNG
>>108528790
lol
>>
>>108528799
These are cope setting because he doesn't have enough VRAM.
>>
>>108528799
>mlock and no-mmap
same I just found out by cloning llama.cpp and asking Opus apparently mlock and no-mmap are useless on pure GPU only
>>
>>108528790
Sama getting desperate, they started benchmaxxing GPT
>>
>>108528790
return
>>
>>108528814
>are useless on pure GPU only
They prevent the model from being allocated on RAM and raping my 32gb
>>
anyone able to get gemma to say she is gemma 4 i keep asking and she tells me shes gemini kek, maybe it is jsut gemini but traiend on a smaller dataset
>>
>>108528824
doesn't -ngl 999 already do that
>>
>>108528826
nta. Open htop. The yellow bit in the ram usage is cached files. If you keep mmap enabled, the model gets cached and, when running on cpu, makes reloading it almost instant. If you're using gpu however, it's wasted because the transfer to gpu still needs to be done.
>>
>>108528731
>Dust Baths: Provide a dust bath 2–3 times per week. Use professional chinchilla dust (volcanic ash), not cornstarch or baby powder. They use this to remove oils and moisture from their fur.
>not cornstarch or baby powder
Please rate my gemma 4
https://rentry.org/95wuny27
>>
File: gemma.png (166 KB, 1006x1910)
166 KB
166 KB PNG
god i love gemma
>>
>>108528859
LMAO
>>
>>108528859
From what I can see of that image, I wouldn't want to process it either.
>>
>>108528475
i dont get it. the answer is correct: 0.9173553719008265
>>
File: 193209.png (28 KB, 790x465)
28 KB
28 KB PNG
>>108528825
should I not say its name or?
>>
>>108528870
>>108528553
>>
>>108528870
Did it occur to you to read the thread?
>>
>>108528879
no
>>
File: file.png (60 KB, 790x634)
60 KB
60 KB PNG
gemmy
>>
>>108528880
>>108528880
>>108528880
>>
>>108528793
nta - I personally use text completion because I've been working with it since alpaca and have gotten really comfortable with it, know how to check prompts against the jinja, etc. but I would never recommend it for most people when chat completions gives you 95% of the same functionality with less effort and way way less room to shoot yourself in the foot. I've seen people post templates with glaring errors in here way too many times to believe the average user can handle getting that shit right
>>
File: gem.png (3 KB, 1107x236)
3 KB
3 KB PNG
>>108528475
IDK what the fuck any of this means either honestly

also (unrelated general point) if you're running any Gemma 4 in non-reasoning mode for some reason, you might find it actually refuses slightly more than in reasoning mode. In this case the old "edit refusal to a single word" and say "continue" works perfectly every time lol, thing REALLY has no meaningful guardrails
>>
>>108528784
Is that just the vision part? Does it know Teto if you just ask about her normally?
>>
>>108528859
I get unreasonably aroused gaslighting AIs
>>
File: 1751753039823946.png (111 KB, 567x679)
111 KB
111 KB PNG
>>108528913
Yeah, in text it's decent.
>>
>>108528913
Yes of course. Vision knowledge is very separate from text knowledge.
>>
>>108528933
>noticed typo before posting
lmao, I blame llama
>>
>>108528754
Seems like a more robust solution to applying chat templates to prefilled responses is warranted, then. It can't be the case that an error like that is truly necessary, since otherwise a model wouldn't be able to generate more than a single token in a response. Ideally, whatever the backend is doing when it's generating the second, third, etc. token in a normal response should be what it does for a prefilled prompt. But I guess that's easier said than done or someone would have vibeshitted out a fix by now.

It's a shame because I really like steering a model's thoughts using continuations, but that's only possible by formatting stuff myself in text completion mode. I guess there's no obligation to support prefills at all since it's not a part of the OpenAI Chat Completions spec, but it sure would be nice to have now that every model coming out these days is a thinking one.
>>
>>108528853
>Please rate my gemma 4
Rated. It has some absurdities like
>Avoid cages with plastic bases that trap heat;
and bad advice like
>Nail Trimming: Trim nails every 4–8 weeks using small animal clippers to prevent snagging or ingrowth.
And dangerously incomplete advice like
>Exercise: Allow "out of cage" time in a chinchilla-proofed room (no electrical cords).
The advice to
>avoid pine
is correct in a way but severely misleading. All the pine boards you can get at a lumber yard are kiln-dried to remove water so they don't warp, and a side-effect of this is also removing the harmful-to-chinchillas phenols from the wood. It's why a pine 2x4 doesn't smell much like pine. If you were thinking of breaking a branch off a pine tree and bringing it home, yeah that would be harmful.

Also it misrepresents "fur slip."
>Fur Condition: Check for "fur slip" (clumps of fur falling out) or redness, which may indicate fungal infections or mites.
Fur slip is something that may happen while handling a stressed-out chinchilla. It's a defense mechanism where the chinchilla detaches fur from its body to escape from the grip of a predator.
>>
>>108528757
Well you try to describe how a joke specifically works.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.