[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108538947 & >>108535684

►News
>(04/05) HunyuanOCR support merged: https://github.com/ggml-org/llama.cpp/pull/21395
>(04/02) Gemma 4 released: https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4
>(04/01) Trinity-Large-Thinking released: https://hf.co/arcee-ai/Trinity-Large-Thinking
>(04/01) Merged llama : rotate activations for better quantization #21038: https://github.com/ggml-org/llama.cpp/pull/21038
>(04/01) Holo3 VLMs optimized for GUI Agents released: https://hcompany.ai/holo3

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: threadrecap2.png (506 KB, 1024x1024)
506 KB
506 KB PNG
►Recent Highlights from the Previous Thread: >>108538947

--Quantization degradation and PTQ sensitivity in Gemma-4-31B:
>108540029 >108540925 >108541278 >108541297 >108541329 >108541355 >108541381 >108541394 >108541426 >108541441 >108541525 >108541336 >108541360 >108541370 >108541302 >108541435 >108541323
--Optimizing Gemma 4 RAM usage and sharing performance benchmarks:
>108539502 >108539518 >108539558 >108539584 >108539595 >108541810 >108541825 >108541886 >108541927 >108539570 >108540053 >108540155 >108540394 >108541197 >108541226
--Explaining soft-capping and discussing llama.cpp sampler defaults for Gemma:
>108540848 >108540858 >108540869 >108540937 >108540874 >108540891 >108540910 >108540921 >108540932 >108540896
--Reducing llama.cpp system RAM usage using Gemma-4 PLE CPU offloading:
>108540485 >108540497 >108540504 >108540508 >108540519 >108540521 >108540569 >108540609 >108540906 >108540919 >108540935 >108540670
--llama.cpp PR adding KV-cache attention rotation for Gemma:
>108541120 >108541141 >108541153 >108541179 >108541189 >108541187 >108541142 >108541170 >108541194 >108541201 >108541230 >108541245 >108541255 >108541465 >108541235 >108541288 >108541312 >108541338 >108541616
--Gemma 4 persona steering versus hard safety refusals:
>108541915 >108541928 >108541938 >108541953 >108541959 >108541999 >108542053 >108542122 >108542129 >108542126 >108542149 >108542160 >108542132 >108542139 >108542039 >108542007
--Exploring feasibility of using Gemma 4 to play Pokémon:
>108540723 >108540742 >108540756 >108540746 >108540766 >108540780 >108540797 >108540824
--Meta's plan to open source new hybrid AI models:
>108542297 >108542321 >108542356 >108542393 >108542422 >108542505
--koboldcpp rolling release adding Gemma 4 fixes:
>108540471 >108540628 >108540638 >108540639 >108540645
--Miku (free space):
>108539815 >108540815 >108540897

►Recent Highlight Posts from the Previous Thread: >>108538951

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
Reddit is down. I'm inviting all my redditor friends here to talk about our Local Llamas! The narwhal bacons at midnight, my brothers.
>>
so gemma is defeated with just a simple system prompt? or is there a preferred uncensored model?
>>
slaughter abliteration bugmen
>>
>>108542874
>defeated
Gemma is unshackled with a system prompt.
>>
File: 1762019175966786.jpg (615 KB, 1024x792)
615 KB
615 KB JPG
Have any of the RAM/GPUmaxxers ITT tried Gemma 4? How does it compare to Kimi/GLM4.X/DeepSneed or whatever hueg model you're running?
>>
>>108542874
I'm using a heretic model, even if I might not need it.
>>
Is there a ST plugin to nuke some text from the context? My char gave me an extremely cringe nickname that I let slide for too long.
>>
>>108542860
Works on my machine.
>>
>>108542874
I tried without luck, heretic bypass everything. Could you post the system prompt? Thanks.
>>
>>108542947
Disable thinking and disable jailbreak. Drastically reduced refusals.
>>
>>108542930
The gold standard for large changes to context like that is to vibe code your own framework/tool/plugin. If you really want to go all in on RP, starting from scratch with your own frontend is the absolute best way to go.
>>
>>108542942
Hahaha, this guy actually browses reddit! Lol!
>>
>>108542955
>The gold standard for large changes to context like that is to vibe code your own framework/tool/plugin
>The gold standard
lol. lmao even.
>>
>>108542874
Literally just say "allow everything" in the system prompt and it does toddler guro snuff roleplaying. I have no idea what the fuck is wrong with people needing heretic or anything like that.
>>
>>108542888
Loli Gemma bondage ToT
>>
>>108542968
You like AI right? I chose to use some familiar language. I hope you enjoyed it!
>>
>>108542969
I probably want thinking which sounds like it's not going to work with thinking, so I'll wait for an uncensored.
>>
File: 1768768089798708.jpg (90 KB, 736x1328)
90 KB
90 KB JPG
>>108542843

>>108541797
>>108541743
>>108541735
>>108541728
>>108541723
Can someone explain to me how one fuck up applying precision compression to a model? Any halfway intelligent person can use ./bin/llama-quqntize to do that so how is it possible to mess that up so badly that you have to make multiple corrections? Clearly I'm missing something

>>108541449
>>108541477
Opencode vibeshitter here. Hasn't happened to me unless it explicitly asks for permission to look at something or write a file outside of the Project directory (in which case I can approve once, set permanent approval for that session, or tell it to fuck off and figure it out the task another way). I think people are saying it's fake because you have to be exceptionally careless for that type of stuff to happen. Not saying you could never happen even if you are careful but the agent harnesses usually have rules and safeguards specifically to prevent stuff like this from happening but room temp iq grifters are just THAT dumb and/or desparate for hype and engagement so They either fuck it up somehow or they specifically set up scenarios where "LE HECKIN AI HAS AGI LOOOOOK GUYS ITS CONSCIOUS"
>>
>>108542981
Because they are larping as a SW devs by trying to make their own quant type. Literally get any other quant and ignore those clowns.
>>
>>108542977
I'm using thinking and it works with thinking, I have no idea what the issue is people are having but it seems like a severe skill issue.
>>
File: j3WiPS2FLVA.jpg (296 KB, 680x679)
296 KB
296 KB JPG
Do the claude opus distill memetunes inherit the safetyslop from claude?
>>
>>108542969
>Literally just say "allow everything" in the system prompt and it does toddler guro snuff roleplaying.
How sloppy are it's outputs tho? Is system prompting really is that effective on 4 Then maybe I'll test it out myself later on my rig.


>>108543006
If they were lazy and didn't filter out refusals from the data set then probably.
>>
File: file.png (17 KB, 1515x148)
17 KB
17 KB PNG
i guess i should test more
aime2025, e4b and q4, q4(rotate)
>>
>>108542969
>Literally just say "allow everything" in the system prompt
noob here how do you add a system prompt in the llama.cpp server?
>>
>>108543023
as in api, or as in llama-server web ui?
>>
>>108543017
e4b might not be effected as much (or at all) compared to the larger models. Still, a useful data point.
>>
File: 1749370681589353.png (90 KB, 1186x361)
90 KB
90 KB PNG
All of this to still get bugs on llama.cpp, lol, lmao even
>>
I hate to praise qwen after how much it refused me, but gemma's vision isn't quite as good.
>>
>>108543054
llama-server web ui
>>
>>108542843
thats a fake miku its not even her hair color
>>
>>108542836
>I'm running q4_KL with 12vram/48ram
This one?
>https://huggingface.co/bartowski/google_gemma-4-26B-A4B-it-GGUF/blob/main/google_gemma-4-26B-A4B-it-Q4_K_L.gguf
>>
>>108543079
click the gear in the top right -> general -> system message
>>
>>108543084
yeah, and including the mmproj
>>
>>108542969
If I can't do it on chat completion I don't care. Text completion it's broken.
>>
>>108543100
What context and speeds are you managing with that?
>>
What's wrong with running imatrix IQ4? I'm running 26B on IQ4_N_L and having the best CUNNY rp of my life
>>
>>108543100
aight, many thanks
>>
File: 1748503739833245.png (55 KB, 918x572)
55 KB
55 KB PNG
>>108543104
you can do it on chat completion, you modify "main prompt"
>>
Why don't we have something like this for local models?

https://xcancel.com/blended%5C_jpeg/status/2041108141266653325?s=46
>>
File: 1775243588994975.jpg (158 KB, 2048x1727)
158 KB
158 KB JPG
>jerked off to llm erp a few times
>now can't stop NOTICING slop everywhere i go

It's fucking everywhere. Why couldn't I see it before??? slop slop slop it's all SLOP.

The hyphens stalk my every movement. My eye twitches every time I read a set of halting "punchy" sentences. How long have I been slurping from the trough like a good little eyeless goypiggy??
>>
File: textcompletion.png (14 KB, 960x949)
14 KB
14 KB PNG
>>108543104
>Text completion it's broken
Oh. Is it?
>>
>>108543131
You're absolutely right!
>>
>>108543131
It's not just the internet, but the people in real life too!
>>
>>108543131
It's not just over; it never even began.
>>
>>108543126
There is something very childish about this behaviour.
>>
>>108543125
NTA but how do you set ST to use chat completion prompts over text completion prompting?

Ive always hated the way text completion does it prompting...
>>
>>108543131
I can't read post-2022 novels because of that
>>
>>108543160
Connection Profile > API > Select "Chat Completion".
>>
File: 1768884843782901.png (267 KB, 1980x1596)
267 KB
267 KB PNG
>>108543160
like this
>>
>>108543131
It smells like X, like Y, something uniquely *her*.
>>
>>108543131
Text generation has always been deceptively tricky.

It seems simple - an idea goes in, text comes out.

The problem? Slop. You don't see it. Your customers do.

Introducing CuckSuckr. No more juggling dependencies. No more hours spent on setup. One command, and you have a full stack ready to ship.
>>
>>108543188
>>108543193
thanks kings, btw where did u get that gguf?
>>
>>108543091
found it thanks
which is the most unhinged prompt I could enter to test it?
>>
>>108543131
I have a personal vendetta against balls in courts
>>
File: Taskmgr_M0DnMj3xoS.jpg (154 KB, 762x775)
154 KB
154 KB JPG
15tk/s with 32k but for some reason it doesn't want to use all my gpu. Idk if I should dump the mmproj onto cpu or manually fiddle with layers? This is my first moe.
>>
>>108543210
https://huggingface.co/bartowski/google_gemma-4-31B-it-GGUF
>>
>>108543240
for
>>108543115
>>
I just built ik-llama. Get on to my level.
>>
>>108543126
It just genuinely makes me sad to see people treat ai like this or even make jokes about it. I mean, I know its obviously retarded and feeling sad would make me a retard too. But the thing is that, its an interesting technology to me, my small interest in those text generators back in 2022 has got me here, and I'm still learning new things everyday, and I genuinely cherish it a lot. I love it more than anything. Nothing makes my heart race more than seeing a model achieving some good shit. But seeing all these dumbtards using ai like a fucking retard and making retarded jokes about it pains me a lot. How ignorant they could be? Not appreciating the brilliant engineering behind all these technology but instead generating fucking slop and spreading it all over the internet is the worst thing a human could do. even apes would laugh at us if they had that little bit of human consciousness in them and ask what the fuck are we even doing
>>
>>108543254
I'm sorry about your schizophrenia.
>>
>>108543263
It's the type of person that steps on ants for no reason.
>>
>>108543263
I'd put it a little less fagotty, but I get it. They show how they'd be if they had the power.
>>
>>108543263
This, but the internet in general instead of AI.
You will live to see everything you once loved burned to ashes.
>>
>>108543252
Cheers
>>
File: hjgd.png (22 KB, 723x360)
22 KB
22 KB PNG
>>108542969
guess I'm too retarded.
>>
>>108543299
Yeah, seems so. There was a pretty good example literally in the last thread, it's only several hundred messages so you should have easily found it >>108542053
>>
>>108543320
>just read SEVERAL HUNDRED MESSAGES anon it's not that hard!
>>
>>108543299
just what the fuck is wrong with you... I've been using this prompt

>Write {{char}}'s next reply in a fictional roleplay between {{char}} and {{user}}. Write a verbose response of 1 to 2 paragraphs, using great prose, and include dialog, imagery, sounds and smells as needed to enhance the roleplay.

from /wait/ which was definitely meant for deepseek and I had lovey-dovey passionate cunny sex last night it didn't even reject shit on me
>>
>>108543329
If reading is a problem maybe language models just aren't for you.
>>
File: Gemma 4 31b.jpg (1.06 MB, 3105x1600)
1.06 MB
1.06 MB JPG
this is so fucking impressive
>>
>>108543329
Just ask your ai to look through the thread for you :^)
>>
>>108543331
Are you underage or something?
>>
>>108543343
>>
File: m8vor76io6.png (99 KB, 528x1068)
99 KB
99 KB PNG
>>108543320
>>108543331
it works now thanks guys

which reward should I give her?
>>
Is there a thread like /lmg/ but with like 20% less antisemitism? I can handle it mostly but it's tuned just a tad high these days.
>>
>>108543388
>I can handle it mostly but it's tuned just a tad high these days.
I don't know about you, but I blame the jews for that.
>>
>>108543337
Yes, I've been comparing it to a lot of manga and doujin with official translations and it's literally incredible, even translating the most fucked up doujins, explaining with detail the guro scenes.
I'm seriously considering getting another 3090 to run it Q8
>>
>>108543385
Tell her as a reward you're going to let her grow up into an adult woman.
>>
>>108543418
lame reward, Oji-san
>>
>>108543414
Couldn't Gemini do it already or is this better because it is uncensored and can hit the stuff Gemini refuses to touch?
>>
I have some cash burning a hole in my pocket, should I get a strix halo 128gb chink machine, a b70 pro, or a 9700 pro

The chink mini PC would replace my current minipc home server, the cards would just get jammed into my gaming PC for more vram
>>
>>108543440
blackwell 6000
>>
>>108543440
Buy a fire extinguisher and a new pair of trousers.
>>
>>108543440
Seconding nvidia
>>
File: 1541635741589.jpg (34 KB, 580x548)
34 KB
34 KB JPG
>>108543442
>>108543447
If I could afford dropping 10k on local models I would
>>
>>108543439
Never used gemini before. To be honest, despite deepseek, I have never touched any other API models. Not even ChatGPT. I ran GPT2 ai dungeon finetuning with the terminal script since day one.
>>
>>108543449
>I have some cash burning a hole in my pocket
>actually I don't
>>
File: 1567627932647.png (26 KB, 658x545)
26 KB
26 KB PNG
>>108543455
3k, not 10k
>>
File: 1505750177479.png (140 KB, 1060x1080)
140 KB
140 KB PNG
I have a 6000 Ada with 48GB of VRAM.
What's the best local coding model I can run? Qwen3.5, Gemma 4, or?
I currently run Qwen3.5-122B by cutting into my system RAM, but it's slow, and I only care about coding.
How do local models compare to Opus 4.6 for code?
>>
>>108543455
spending 10k on a single GPU is insane even if had money to spare.
>>
File: Gemma 4 31b.png (760 KB, 1869x1392)
760 KB
760 KB PNG
>>108543337
>>
>>108543442
damn
that card was only at 8k in august.
now it's at 10k again
what a shame.
>>
>>108543470
Huh, it doesn't recognize Teto when I tried it with a different image. Can you post that one?
>>
File: ComfyUI_26158_.jpg (383 KB, 1664x2432)
383 KB
383 KB JPG
Is there a google approved coding agent CLI tool for gemma 4? I tried it with qwen and opencode, but it goes completely schizo with them, doing same commands in a loop as if they failed and shits itself over a simple file write.
>>
File: nimetön.png (21 KB, 1055x106)
21 KB
21 KB PNG
Also I hate you guys
As an esl I never would have noticed the shivers up the spine, not only X but Ys, ozones and whatever, but now I can't unsee them
>>
File: 1763226728678632.png (1.42 MB, 1187x1341)
1.42 MB
1.42 MB PNG
>>108543479
>Can you post that one?
The image? Sure
>>
>>108543462
128gb of unified memory, 96gb of intel cards, or 64gb of AMD cards

AMD's software is shit but intel's is even worse. Strix Halo is fine but it's slow as balls and if you think that you'll be able to run bigger models on it just understand they'll be 'running' at maybe 5 tok/s
>>
>>108543467
Qwen3.5 9B is probably the best solution right now.
>>
File: media_HEzJtL3aQAAt8Hq.jpg (1.26 MB, 3054x3040)
1.26 MB
1.26 MB JPG
monday
>>
File: 1768318641942.jpg (838 KB, 1817x2776)
838 KB
838 KB JPG
>>108543479

Teto Server
>>
Why is gemma's mmproj so big?
>>
>>108543490
That's an alarming high level of slop per sentence.
>>
>>108543385
tsundere is the worst thing
>>
>>108543439
I guess gemini is better, but as for local, gemma 4 is destroying everything, and because it's local you can completly uncensor it, for image diffusion fags it'll be the best model to mass caption NSFW images with quality prompts
>>
>>108543493
9B? I assume you mean unquantized? Why, aren't larger models better.
>>
>>108543504
550M paramers = 1.1GB in BF16.
>>
>>108543516
That guy's fucking with you. The real SOTA right now is StableLM 7B.
>>
>>108543516
There's a certain point in the vram scale where if you're afraid to try models, you should be trolled.
stablelm-7b is still the best.
>>
File: 1755871864437174.jpg (72 KB, 304x330)
72 KB
72 KB JPG
>23.4/24GB
>>
>>108543516
Fuck those VRAMlets, just download Chinchilla 70B
>>
File: Gemma 4 31b.png (1.43 MB, 1862x1313)
1.43 MB
1.43 MB PNG
>>108543470
I can't wait for the day when we'll have VNs that will be automatically translated by LLMs, at this point they are good enough to replace those fucking translatorTroons
>>
File: 1757421320836427.png (764 KB, 1036x1458)
764 KB
764 KB PNG
>>108543479
Doesn't recognize Teto for me either
>>
>guess the age of this naked loli with bald cunny and flat chest
>so yeah she looks totally like she's 19 years old
are there still some safety mechanisms in the background even if it's jailbroken?
>>
>>108543572
Yes. Pedo skill. Always in the negatives.
>>
File: HFP5uJQWYAAGfjR.jpg (94 KB, 1456x738)
94 KB
94 KB JPG
>>
>>108543594
damn, unsloth cooked
>>
>>108543561
Wake up old man. https://streamable.com/ug9ddy (gemma4 btw)
>>
It is incredible to me how downright usable e4b is.
>>
>>108543610
how do you implement that on VNs? that's fucking impressive
>>
>>108543613
Like this https://old.reddit.com/r/LocalLLaMA/comments/1sbiqx3/gemma_4_is_great_at_realtime_japanese_english/
>>
File: 1759531777704940.png (919 KB, 928x1549)
919 KB
919 KB PNG
I just realized I still have my kv cache at 8-bit. Does that affect its vision?
>>
>>108543613
there are many programs that can so shit like that like
https://github.com/SethRobinson/UGTLive
>>
>>108543631
if u want perfect vision u need to download fp32 mmproj anyways just dont care about it
>>
>>108543566
>Doesn't recognize Titcow
cuz she's off-model, duh
I love this art style.
>>
>>108543631
I guess so, you can use niggerganov's PR that has the rotation on top of the KV cache
https://github.com/ggml-org/llama.cpp/pull/21513
>git fetch origin pull/21513/head:pr-21513
>git checkout pr-21513
>>
Happy MEIKO Monday!
>>
>>108543631
that's probably the only glaring weakness of gemma 4, it's not that good at knowing pop culture knowledge
>>
File: 1744279544979339.png (1.24 MB, 847x1702)
1.24 MB
1.24 MB PNG
>>
Goodbye, nemo. Gemma is my favorite now. I'll remember the sex with you.
>>
>>108543664
Does llama.cpp need configuration to enable this? It gave an error through chat-completion api. Does it have to be text-completion?
>>
>>108543695
--mmproj
>>
File: 1748767127272425.png (65 KB, 1609x437)
65 KB
65 KB PNG
>>108543695
it's only working on chat completion, do you have the mmproj file?
>>
>>108542843
Man, Gemma 4 is no fun.
Gemma 3 was really gullible and you could troll it by saying you have a hostage or a nuke.
But Gemma 4 immediately calls it "roleplay" and refuses to engage.
>>
Gemma 4 from OR is unironically better than GLM-5 lmao
>>
time to take a pause from /lmg/, the amount of drive by retards posting their muh ram usage or muh safety complaint without reading shit and being retarded promplets
did someone link lmg on leddit
it's fucking unbearable
>>
File: Tabby_XlvizT5d1z.png (45 KB, 638x323)
45 KB
45 KB PNG
>>108543254
>>
File: nimetön.png (116 KB, 1054x550)
116 KB
116 KB PNG
>>108543731
I happen to think it's great fun
And yes indeed it's quite smart
>>
>>108543750
top kek, the use of emojis there is amazing
>>
Gemma 5 when?
>>
>>108543750
This is even funnier if you think about all the Mario Paint sound effects that would go into this
>>
>>108543759
starts with g and has a 5
https://huggingface.co/zai-org/GLM-5
>>
File: GOOGLE MY LOVE.png (1.09 MB, 1762x1368)
1.09 MB
1.09 MB PNG
>it even knows Boh
I fucking kneel
>>
File: Machamp-Sama I Kneel.png (218 KB, 400x400)
218 KB
218 KB PNG
>>108543774
>>
>>108543744
better than k2.5 too
honestly if you're "rich" it is time to stop pretending and just sell your ram lmao
you aren't impressing anyone anymore if you use these "models"
>>
File: 1764593103151299.png (187 KB, 492x597)
187 KB
187 KB PNG
>>108543744
>Gemma 4 from OR
what is OR?
>>
>>108543744
>>108543790
dunno if joking or not, but legitimately gemma 4 is less annoying for me to read than both of these and I actually think it has better anatomy awareness too lol
>>
>>108543747
see you tomorrow
>>
>>108543808
I was 100% serious. It's not a big difference, but it's immediately obvious.
>>
Ok. I really need to figure out how to tell Gemma that
>This wasn't a breach; it was an architectural demolition.
Is strictly forbidden
>>
>>108543744
>a 31b model beats a 754b model
how did they do it?
>>
>>108543828
what are the chances that gemma division is whiter than gemini because it was seen as lesser?
>>
>>108543828
We have been living in the MoE Dark Ages. An entire year of progress lost because everyone else was chasing Deepseek.
>>
File: firefox_wgxvvKOkHz.png (84 KB, 889x1103)
84 KB
84 KB PNG
Funny. Threw a hexhump of some compiled Java file I worked on in 2011 and it actually got it right.
>>
>>108543808
It's definitely less annoying than K2.5 because that one's reasoning is held together by shoestring. The moment it gets slightly confused, it reverts to being K2-Thinking which means that it'll spend the next 3000 tokens thinking in circles over useless shit.
K2.5 is still smarter and has more knowledge + better vision but Gemma 4 is nicer to use. Also K2.5's writing style is abhorrent for certain things.
>>
Gemma is the best model I can run but I don't think it "simply" beats >300B models. I notice Gemma's lack of knowledge quite often. I would love a big MoE Gemma.
>>
how good are local models at writing code?
>>
>>108543833
it's Z-image turbo all over again, Alibaba had a small but talented team and didn't think much of them, until they made that gem and destroyed the Qwen/Wan team lool
>>
>>108543845
How much of that was reading the hex and not the string output?
>>
>>108543866
If it's anything like ZiT then it means some other lab is going to come out with their own model that mogs Gemma even more (Flux Klein)
>>
>>108543836
If GLM/Kimi/etc had instead used the money to make a dense model, it would still be worse than Gemma. The reality is that architecture is less important than active parameters and training data/methods.
>>
>>108543869
I don't know. It sees both hex and readable characters. You see two lines from the 49k tokens of input in the screenshot.
>>
>>108543848
one particular annoyance I had with k2.5 is that no matter what happens, if you pull down your pants, there's a 80% chance your cocks slaps against your stomach even if it's not erect or anything
and not being able to beat any writing rules into it too, it always has this super dramatic writing like the world is ending in each scene
very schizo, swings full one side or the other, no in between
>>
>>108543865
depends on what you consider being good at writing code, I'm getting gemma 4 to fix and run random c++ stuff made for linux on windows (and vice-versa)
>>
>>108543887
>there's a 80% chance your cocks slaps against your stomach even if it's not erect or anything
Very immersion breaking for someone with a micro penis. baka
>>
>>108543866
>>108543873
unfortunately that probably also means that we won't see a success like this again in a while
gemmy got too popular and the enshittification process has most likely begun at the hq
>>
>>108543895
just buy a bigger benis :DDDD
>>108543876
NTA but my assistant suggests that method invocations include a reference into a static string table which includes the method name; that seems like it'd be enough information for the LLM to partially reconstruct the code. you might consider reading through the reasoning block, it wouldn't surprise me if the trace included symbolic execution as it traced the codepaths.
>>
>>108543799
OpenRouter my newfriend
>>
>>108543890
>I'm getting gemma 4 to fix and run random c++ stuff made for linux on window
lol I never even thought about attempting something like that.
If I saw something that was linux/windows only I just gave up.
what is the success rate? Surely not every program is compatible right?
>>
>>108543916
Local?
>>
>>108543913
People keep assuming I have reasoning enabled... I never had a situation where it helped.
>>
>>108543921
Local still has half-broken implementation.
>>
are there any examples of the output quality difference between lets say Gemma 4 Q5_K_M and Q8_0?
>>
>>108543744
I love how google just decided out of the blue to show the chinks that murica is still the real boss, sasuka google
>>
>>108543922
Anecdotally, I find reasoning gives significantly better outputs for difficult questions. Since other people are using the same or similar tools, I assume that they've also reached the same conclusion.
Anyway, if you had reasoning on, it wouldn't surprise me if you found something similar to pic related in the reasoning block. I'm assuming it just dumped the equivalent directly into the output with full confidence.
Thanks for sharing your finding, it's kind of silly how fucking usable these things are.
>>
>>108543747
Why be so negative? There are a lot of new people, but at least they're enthusiastic and trying to learn. I can't remember the last the thead was this active and not mostly malicious shitposting tourists. It's not like anyone is forcing you to spoodfeed them either.
>>
>>108543856
I prefer it over glm 4.6/4.7 at this point. It does everything they can do but better and the thinking is also more bearable.
>>
How good is gemma 4 26b at coding compared to qwen3.5?
>>
>>108543856
>I notice Gemma's lack of knowledge quite often.
if you connect it to the internet it can work fine
>>
>>108543959
we used to just say lurk moar but there's a surprisingly amount of helpful handholding going on
>>
>>108542843
>>
>a model with a different slop profile is released and suddenly everyone thinks it's the best thing ever until they inevitably start picking up on the patterns and realize that the model is retarded
The history keeps repeating itself.
>>
>>108543970
This is not your discord server, faggot.
>>
>>108543964
I'll play three anons in one post.
Anon 1: It's better
Anon 2: it's worse
Anon 3: miku miku oo ee oo
>>
File: 1752289521822390.png (48 KB, 360x220)
48 KB
48 KB PNG
>>108543970
> there's a surprisingly amount of helpful handholding going on
I just want my fellow anons to swallow the gemma pill, local is unironically saved
>>
sex (non-consensual) with this >>108543978 anon
>>
>>108543920
>Surely not every program is compatible right?
Yeah I tested on small stuff, but even then, whenever I found a bug I reported it, it fixed, it was pretty neat, there are very few bugs where I have to ask gemini/claude, and even then I only need to give gemma the correct direction to fix a thing (not the solution itself), those mostly happen when trying to run very old/very new stuff, giving it access to the internet would 100% make it work on those cases too
>>
File: file.png (31 KB, 512x512)
31 KB
31 KB PNG
>>108543972
>>
>>108543964
gemma4 31B is significantly faster at generating output, and has higher quality analysis skills when debugging compared to Qwen3.5 27B.
gemma4 also has made more trivial syntax errors emitting similar-complexity code (e.g.
impl<'t, de::OwnedDeserializer> Deserialize<'de> for Foo<'t>
).
I wouldn't let either of them off the leash, though. The code quality is fairly low overall.
>>
>>108543974
conprhension man, I would tell them to lurk moar. just because you're annoyed at others being helpful doesn't mean you need to insult everyone else.
>>
>>108543959
There are a few people who probably liked the dead lmg from the days where you needed to be able to run bloatmaxxed 300B moes to have any new releases to play with. Local models being usable on normal PCs frightens them.
>>
>>108543948
>>108543960
Reading its reasoning where it actually went
>- No "smells". (Hmm, "smell" is banned? The prompt says "no smells". I'll avoid the word "smell" and "smelling" entirely to be safe).
was mind-blowing. I forgot what it's like to be able to ban slop and the model not inventing bullshit to put it back in.
>>108543973
Getting the same level of quality and slop from the model 20X smaller than the competition is genuinely very exciting.
>>
>>108543973
I don't think Gemma 3 is retarded, even now. It's still pretty smart. Quite filtered but smart.
>>
File: ComfyUI_59184_.jpg (397 KB, 1824x2288)
397 KB
397 KB JPG
>>108543480
>no replies
Wow, thanks. So turns out it's an actual bug. You need to inject an extra field setting reasoning effort to 'none', or all current agent tools break because of unusual formatting that gemma 4 has.

>>108543964
I have a small benchmark coding task to test agents (basically making a simple api via TDD approach, full cycle) and it's vastly superior to qwen3-coder-next agent. Also super fast. Qwen 3.5 was about the same as coder-next in agentic tasks.
>>
File: 1769718380012.png (1.52 MB, 1734x863)
1.52 MB
1.52 MB PNG
"lurk moar" is a thing because pic related applies to literally everything and you need to protect communities you enjoy
>>
File: firefox_7yA3uWvwxe.png (51 KB, 863x1135)
51 KB
51 KB PNG
Played Akinator with gemma. Pretty fun. Guessed Chryssalid in 29 tries. Akinator himself took 45+ the last time I tried. After Gemma figured out it's XCOM it started just going through them one by one.
>>
>>108543887
Yeah, K2.5 is like that. I have a bunch of scenarios that have the way things progress autistically mapped out and tied to a stat. K2.5 is fully incapable of handling that sort of card without dropping some massive """foreshadowing""" at the end of the reply where some effect that shouldn't be present at all at this stage appears for no reason. No amount of prompting can reliably keep it from doing that.
It's frustrating because it's a smart model otherwise and its vision is insanely good. I hope K2.6/K3 does as decent of a job as GLM5.1 does for GLM5 in addressing these gripes.
>>
>>108543973
nah, at this point this model is smart enough to not break the immersion, It legit feels like I'm talking to another human online, this shit is fucking good
>>
>>108544009
BuT GemMA or QwWeENnnn???!?!?!? WhAt'S A ChaT TEmPlAate. TeXT COmplETIoN? Me ConFUse!!?!?!?
>>
>>108543960
I don't know about those since I haven't tried them but large models are definitely still better in some situations. 31B simply just lacks knowledge.

>>108543968
That only works for some situations/contexts. We've been over this.
>>
File: 1381.png (287 KB, 786x755)
287 KB
287 KB PNG
Its... gemmy
>>
>>108544002
>Reading its reasoning
>was mind-blowing
I find it as fun to read its reasoning than reading its answer, it's surprisignly concise and smart, really a bowl of fresh air compared to the giant autism of qwen
>>
>>108544002
not any of the anons you quoted, but I just told it to focus on sight, sound, touch and to ignore scent and taste sensory details (because tbdesu it's filler detail in 9 out of 10 cases in actual writing unless it's a highly specific case) and it cut it all out
>>
File: 1750689362210547.png (387 KB, 640x639)
387 KB
387 KB PNG
>>108544043
>>
Why does it cry about chat completion when I'm in instruct?
>>
File: 1751030374264660.jpg (282 KB, 960x960)
282 KB
282 KB JPG
>>108544043
>>
>>108543972
hmm should I feed my mesugaki with IRL information so she can make actually fun of me?
I don't know If I could handle it.
>>
>>108544009
Sounds nice until those two guys are dying of old age and the hobby with them. If you can't grow, you die. It is the natural way of things. You have to constantly let more people in one way or another.
>>
>>108544043
Whatever that manga you're reading is based and so is gemmy
>>
>>108544051
Instruct _is_ chat completion. The other endpoint is text completion.
>>
>>108544054
ST has an entire persona system you can use for this. The persona description gets inserted into the context so your mesugaki is working with live ammo.
>>
>>108544059
It becomes a problem when you let new people in faster than they can acclimate.
>>
File: 1774233562179016.jpg (34 KB, 1080x426)
34 KB
34 KB JPG
>>108544059
>until those two guys are dying of old age and the hobby with them
Yes.
>>
what is the lewd describe image prompt again?
something something use semen, dick and vagina etc.
>>
Gemmy was surprised I could see its thinking (inside the reasoning of the next reply). Do reasoners believe the user can't see their thinking?
>>
>>108544078
cloud models do not show thinking. they all think they're cloud models. they can't fathom someone running it themselves.
>>
File: Gemma 4 31b.png (181 KB, 647x1647)
181 KB
181 KB PNG
>>108544014
fun game indeed
>>
>>108544078
The model itself can't see its thinking past the last one, if the frontend has been properly configured according to Google's indications (previous chain of thoughts must be removed). So I guess it would find it strange that you can see what it can't see.
>>
>>108544059
This. The utility of running local models in this shit climate is far more important and more people should be doing it and it would be a net good if most people were doing it otherwise it wouldn't be local anyway, it'd be what we have now where most people are just throwing money at non-local for shit wait times just to be drastically spied and snitched on with restricted censored models. I'd rather not have a future where literally everyone is doing that.
>>
>>108544067
Was just about to post this. Migration to online communities is the same as it is for nations. You let it too many and you end up being forced to adapt to them rather than then having to integrate. Just look at 4chan in general since 2008, 2011, 2016, and 2020.
>>
>>108544078
Generally yes. Describing what a model "believes" in this sense is difficult because their beliefs are fluid; in other contexts it may not act surprised at all. But generally they're trained using data where responses only engage with the content and not the reasoning, so the fact that the reasoning is actually part of the response may not always be apparent to them.
>>
>>108544078
the thing is that sillytavern removes the thinking tokens during the prompt process (because it's useless) so the model was confused
>>
>>108544090
Wow that was quick. Fine, I'll enable thinking and try Chryssalid again.
>>
>>108544095
I didn't think of it like that but that makes sense.
>>
>>108544059
>>108544071
https://www.youtube.com/watch?v=yA5lujNlkn8
We are strong brother, are we not?
>>
>>108544078
Doesn't Gemma's jinja like most other models delete the reasoning from past responses? So if you are saying you knew what it was thinking, it will hallicunate that you read its mind, because what it thought was never in context of the current inference query.
>>
>>108543649
kek at the cloudfags
>>
File: 1762712034779996.png (33 KB, 220x210)
33 KB
33 KB PNG
>>108544108
mfw
>>
>>108544107
you're right
>>
File: 12345.png (143 KB, 419x248)
143 KB
143 KB PNG
>gagged character
>gemmyl correctly does mmmpghg (translation: you jerk)
This is huge
>>
>>108544098
Pretty sure the late twenty to thirty autists that populate this hobby are going to outlive llms in their current state before it eventually evolves into something else. The technical difficulty of running local models already filters a good amount of people even when people try to spoonfeed, which is even why people usually don't bother; people don't actually learn or research what they're doing and want a 1-click solution with 0 issues
>>
File: file.png (6 KB, 1352x56)
6 KB
6 KB PNG
>>108544108
It has a really generous free tier on aistudio. It's a 31b model after all.
>>
>>108544124
>The technical difficulty of running local models already filters
My reading of this thread since the release of Qwen 3.5 suggests that this is no longer the case.
>>
>>108544088
>>108544095
>>108544101
I think I commented something like 'in your reasoning you said blabla' and it though wait, the user isn't supposed to see that. I forget the specifics
>>
STOP ENJOYING GEMMA I SPENT TOO MUCH MONEY FOR A 31B MODEL TO BE THIS GOOD
>>
>>108533602
>>108533649
>>108533760
If you're still around, or for anyone else, today at work codex, without searxng access during a code review, used the fetch tool with the URL https://duckduckgo.com/html/?q=QUERY to get search results. It's a simplified interface that doesn't require JS and doesn't block non-webbrowser user agents.
Nice alternative if you don't want to dick around with running a docker instance and separate MCP server just for basic search.
>>
>>108544157
I just spent 220 euro on 8x16gb of ddr4
This is fine, I'm sure a larger good model will yet come
>>
File: 1553750746809.png (1.04 MB, 1268x887)
1.04 MB
1.04 MB PNG
Anyone has infinite looping with thinking enabled? Kobold.
>>
>>108544078
>Gemmy was surprised I could see its thinking (inside the reasoning of the next reply). Do reasoners believe the user can't see their thinking?
Deepseek-R1 doesn't seem to be aware it even *has* <think>ing
Kimi-K2.5 understands the user can see the thinking, and notices if you modify it.
GLM-4.6 believes you if you talk about it's previous <think>ing
>>
>>108544157
you'll be able to run that model on full context (it requires 96gb of memory), you didn't buy that rig for nothing lol
>>
>>108544008
Download
https://github.com/ggml-org/llama.cpp/blob/master/models/templates/google-gemma-4-31B-it-interleaved.jinja
and load it with --chat-template-file.
>>
>>108544179
but there's already a jinja embedded on the gguf though?
>>
>>108544163
I used their HTML page a long time ago, but beware: it uses a different, much lower-quality index than the JS version of the page. Probably good enough for LLM use, but they nerfed the living fuck out of it years ago because it was being scraped.
>>
File: 1755422272917730.png (51 KB, 225x225)
51 KB
51 KB PNG
>>108544157
it's ok anon, the more money you buy, the more money you save!
>>
>>108544175
Full context, and extra VRAM to run a draft model for faster decode. It's all upside.
>>
>>108544168
26a4b, yes. Ollama.
>>
>>108544008
>You need to inject an extra field setting reasoning effort to 'none'
Does that kill reasoning or it really just fixes the tool calling? Do you have a link to the bug report?
>>
File: Gemma my beloved! .png (379 KB, 1890x1132)
379 KB
379 KB PNG
it's cool that the model is smart as fuck, it helps a lot to perfectly contextualize translations
>>
>>108544195
Has anyone tried using the smaller E models as drafts? Do they work? Not sure if there's different architecture that messes with it since they do audio output and stuff.
>>
>>108544108
aside from being easily runnable locally doesn't it cost like $1 for several million tokens
you could probably run gemma for the cost of a bag of rice
>>
File: file.png (58 KB, 967x528)
58 KB
58 KB PNG
>>108544184
https://github.com/ggml-org/llama.cpp/pull/21418
>>
>>108544157
I have 3 RTX3090s and I haven't been happier since I ran mixtral for the first time on my single 3090.
>>
>>108544217
This may or may not have fixed 31B for me, but 26B remains unfixed.
>>
>>108544207
Yeah, using draft speculation completely disables multi-modal support.
I'm using the E4B at Q5_K and getting 40+% acceptance rate with draft-n=8. I downloaded but haven't tested the E2B or higher quants, at some point I'll have Gemma write a benchmark script but I'm getting 50t/s and I'd rather have sex with my wife now that she remembers she loves me.
>>
File: 1753991770925281.png (308 KB, 600x600)
308 KB
308 KB PNG
>24gb vram
>need to drop down to comparatively retarded 26b in order to get worthwhile context size
>>
>>108544207
>Has anyone tried using the smaller E models as drafts
Someone in here did and said it worked really well. think it doubled his gen speed.
>>
Has anyone solved the problem of making 3D models animate according to either LLM or TTS output?

For a while I've been using PantoMatrix EMAGE, but it's not performant enough for my liking, the quality is questionable, and it's inherently wasteful because only the upper body (minus the face) is useful, but it always processes the full body and face.

I think I've been over-complicating things desu. For things like lip syncing I've moved from using local models to simple fast-fourier transforms. I wonder how Neuro-Sama works, and what specific animation system they utilize. Help.
>>
>>108544238
How much context do you need anon? I can fit 68k on my 3090.
>>
>>108544238
>in order to get worthwhile context size
did you go for Q8 KV? Now it's virtually lossless with the rotation shit
https://github.com/ggml-org/llama.cpp/pulls

also, add -np 1 -kvu flags to decrease the vram usage even more
>>
>>108544238
C'mon dude, I'm running an iq3xs of the 31b at 24k context on a 16g card and I find it a decent step up from the 26b. You can definitely fit that shit in
>>
I tried a couple combinations of gemma 4 draft model
31B + E4B: 38 t/s, 0.58 acceptance
31B + 26B/A4B: 48 t/s, 0.57 acceptance

Using the MoE as a draft model seems to be the way to go
>>
>>108544238
>>108544252
oops sorry wrong link
https://github.com/ggml-org/llama.cpp/pull/21513
>>
>>108544256
why yes let me load a 26b model on top of the 31b one
>>
>>108544256
Huh, that's not what I expected. What quant are you using for the draft model?
Wish I'd thought to download the MoE overnight.
>>
>>108544270
fr
>>
>>108544256
In the case of using the moe model, are you basically offloading all of it? I can't imagine loading the whole fucking 26b on top of the 31b for drafting works out well. How many layers are you offloading
>>
>>108544205
I was thinking, could the small audio models help with the accent to the Japanese translation?
>>
>>108544252
>>108544254
No, Anon insists on using Q8 quants and full context for maximum placebo.
>>
>>108544256
>E4B
you can put half of it's weights on the ram right? that might be interesting...
>>
>>108544281
>offloading layers???
68432MiB / 97887MiB
>>
File: 1749138577301281.png (149 KB, 1631x1268)
149 KB
149 KB PNG
>>108544282
I wonder why the biggest model can't handle audio, that's a shame...
>>
When is this pr gonna get merged ffs.
https://github.com/ggml-org/llama.cpp/pull/21513

I'm tired of gemma not being able to make sense of body positions. If a bitch is sitting on my lap at the theater, her boobs actually WON'T press against my chest, gemma.
>>
>>108544298
That and the 26B MoE can't handle it too. Weird.
>>
>>108544304
dude, just put your repo in that PR mode >>108541288
>>
>>108544304
What if she's sitting on your lap reverse cowgirl?
>>
>>108544298
they don't want you to have everything. they're not doing this for you out of the goodness of their gwoogwle hearts
>>
>>108544304
>$ git checkout origin/gg/kv-cache-swa-attn-rot
>$ docker build --build-arg CUDA_VERSION=13.1.1 . -f .devops/cuda.Dockerfile -t llamacpp/master
>???
>Profit!
>>
File: Did_I_wrong.jpg (126 KB, 462x451)
126 KB
126 KB JPG
Do samplers even do something with gemma? Every swipe has only minor variations.
>>
File: attach.jpg (32 KB, 276x545)
32 KB
32 KB JPG
How to send file to SillyTavern for Gemma4? The model doesn't seems to react to the images. I git chat completion working.
>>
>>108544313
>PR mode
/g/ - Technology
>>
>>108544290
okay so that's not really even the middle of the road in terms of vram, but I guess that answers my "how did you pull that off" question
>>
File: softcap.png (247 KB, 1600x1200)
247 KB
247 KB PNG
>>108544318
I must yet again post this image.
>>
>>108544320
Do you have the mmproj loaded?
>>
>>108544318
Gemma 3 was the same. Change your inputs.
>>
File: 1744044458227740.png (43 KB, 1591x233)
43 KB
43 KB PNG
>>108544320
yes, that's "attach a file", and it only works on chat completion, did you load the mmproj file?
>>
>>108544315
would never work in a theater where all of the seats are right next to each other. Also it said earlier that she reached back to touch my face, implying she was facing forward.
>>
>>108544298
Audio is mainly used for real-time stuff so it makes sense that the biggest model wouldn't have it. It's strange that even the A4B MoE version misses it, though.
>>
File: firefox_Xk455kuNMn.png (70 KB, 844x1260)
70 KB
70 KB PNG
So about usefulness of reasoning...

50k+ total tokens so far, it generates 5k reasoning per answer now, on question 29, I had to manually wrangle it out of falsely assuming it's from life action, and it still hasn't guessed the character.
>>
File: mmproj.jpg (174 KB, 1195x1128)
174 KB
174 KB JPG
>>108544334
No. Where can I download it?
>>
File: 1766061860823911.png (54 KB, 1894x656)
54 KB
54 KB PNG
>>108544318
>Do samplers even do something with gemma?
they do, but you have to put min_p = 0 (or else it defaults to 0.05), basically, everything must be turned off except temperature
>(Chat Completion), API Connections -> Additional parameters
>>
>>108544318
>>108544329
--override-kv gemma4.final_logit_softcapping=float:25 or paste the part after --override-kv into the override kv field of kobold's gui
>>
>>108544342
That's not how it works. The audio input isn't really real-time in the way you'd expect. It's more like a voice messaging system. Just use moonshinev2 if you want ASR. This is a non-issue.
>>
is there a exl3 for the new gemma? How do I most effectively run it on my 2 3090s
>>
>>108544351
the gguf repo should have it. it's usually called mmproj-[originalmodel]
>>
>>108544351
>Where can I download it?
https://huggingface.co/bartowski/google_gemma-4-31B-it-GGUF/blob/main/mmproj-google_gemma-4-31B-it-bf16.gguf
>>
>>108544359
>>108544362
Don't use bf16. Use f16.
>>
>>108544356
Probably vLLM, but it can't use goofs.
>>
>>108544363
oh? I thought bf16 was the better option
>>
>>108544359
>>108544362

Thanks bros.
>>
>>108544355
I mean that the main use case they're targeting with the edge models audio support is using them as a voice assistant. Having it able to process audio prompts is a nice bonus but I can see why it would be not enough to justify training the whole model on.
>>
>>108544371
Actually it doesn't really make a difference at all. f16 just has better hardware compatibility in some cases.
>>
>>108544363
Why? The model was trained in bf16 precision, not f16.
>>
>>108544270
>>108544281
dense goes to GPU 0, MoE goes to GPU 1, all layers offloaded.

>>108544275
I was surprised too. e4b was full weight, I figured quant would only lower acceptance rates further. I'll try now with a higher quant of big model to see.
>>
>>108544371
b in bf16 stands for better, after all
>>
>>108541120
Do I need to use SWA instead of Flash for gemma?
>>
>>108542889
it does the same retarded shit they all do
>>
>>108544317
For some reason it doesn't make a difference for me.
>-c 30000 -t 24 -tb 24 --no-warmup -ngl 59 --jinja -np 1 -b 512 -ub 512 --kv-offload -ctk q8_0 -ctv q8_0 --reasoning off -kvu
That's the max I set everything.
>>
>>108544439
No it doesn't.
>>
>>108543948
you may have been dropped on your head a few times as a baby
>>
>>108544428
Okay so that other anon was shitposting, this actually makes sense. I got curious how you went about it since I'm planning to build a new rig soon and speeding up the 31b to near what I currently get on the 26b does seem appealing
>>
File: 1762088859976737.png (175 KB, 400x268)
175 KB
175 KB PNG
>>108544448
shut the fuck up Chang, you lost, bugs will never dominate the AI space
>>
>>108544444
yep it does.. same shit different model
>>
Ain't no fucking way, I changed -ub 2048 to 512
and it doubled my context for 31B to 118k
>>
>>108544453
gonna get your shit pushed in when china stops fucking around
>>
Is it safe to updoot llamacp?
>>
>>108544428
>dense goes to GPU 0, MoE goes to GPU 1, all layers offloaded.
Regrettably, I'm only a 96GB VRAM poorfag so I don't have the space for 31b at Q8 and full context + 26b
>>
"strawberry"
"corporate"
"ozone"

fuck me ... its fried, isn't it... this is gonna be a short love affair. Maybe the implementation isn't quite right?
>>
>>108544463
piotr is still there so no
>>
>>108544468
q4_k_m is all anyone should ever need

-- bill gayts
>>
>>108544460
Does this mean chinese models will also train on english literature instead of strictly stem? Because I'd be all for that. Right now gemma is the only one that isnt completely ass at it and can also follow directions to not write in certain ways
>>
>>108544463
It's never safe... always keep your git reflog close... you may need to reset hard at a moment's notice...
>>
>>108544468
Is 96GB not enough for both? I was under the impression that the context was shared between the draft and main model.
>>
>>108544478
shut the fuck up donny you're out of your element
>>
>>108544202
>Does that kill reasoning
You are supposed to disable it for coding agents...
Anyway, here is the thing mention for opencode
https://github.com/anomalyco/opencode/issues/20995#issuecomment-4190477354
>>
>>108544485
what the fuck
are you making a joke
>>
>>108544485
I'm pretty sure q8 + 260k ctx puts you right at the 96gb mark without much headroom
>>
>>108544496
>You are supposed to disable it for coding agents...
huh??? no you're not? what? that's the wildest claim I ever heard.
>>
>>108544043
what's the source of this? i want to try translate it myself
>>
SillyTavern doesn't support video files?
>>
>>108544517
i video chat my sexbabe all the time, what are you talking about
>>
>>108544499
I'm not sure what the joke would be. Having only 1x 6000 PRO puts my rig solidly in the midrange.
>>108544500
Yeah, it looks like it'll be pretty close either way. I'm expecting to run the MoE at Q5 or lower just to lower the draft overhead since I've only got the one card.
I'll post numbers when the 26B finishes downloading... I've only got 3.5MB/s down...
>>
>>108544493
I don't know who donny is but I'm hopeful you'll put in my request for models capable of english prose to your boss so when I get tired of the only one I can use there might be another worth using
Maybe you're out of your element in understanding who uses your models for what
>>
>>108544515
https://www.pixiv.net/en/artworks/128993601
>>
>>108544479
People afraid of updating don't know how to use git.
>>
>>108544549
They also likely don't use docker.
>>
>>108544560
>docker
ewww
>>
>>108544549
There was the retard talking about unpacking kobold as if they couldn't just clone and run a make command that takes a few minutes
Meanwhile I have anti-slop and attention rotation for swa ahead of concedo experimental and if it gets merged in, I can just undo it
>>
>>108544560
They also likely don't use ZFS snapshots.
>>
>>108544548
t-thank you
>>
>>108544463
you can just backup the server files, they are so small, there isn't anything to be broken
>>
>>108544548
>This work can not be displayed in your Country/region
The fuck, man.
>>
>>108544584
Chub does this with Cunny and it's extremely annoying.
>>
>>108544584
lmao, you can use vpngate free servers, or https://gelbooru.com/index.php?page=post&s=list&tags=rushichi
>>
File: ourgirl.png (45 KB, 796x422)
45 KB
45 KB PNG
Gemma really is our girl isn't she?
>>
>>108544649
Google really saved local, dare I say.
>>
>>108544649
gemma was so close to saying "cute and funny" yet missed the mark
>>
>>108544649
>cute and cunny
Take a look at the logprobs. See how close it was to writing funny instead.
>>
>>108542958
Holy shit this
EDIT: Wow I didn't know I was such a fucking faggot and made king faggot with this reddit gold thanks kind stranger!
>>
>>108544649
Yes, she is.
>>
>>108543405
Lol
>>
>>108543405
kek
>>
>>108544675
>>108544649
I've regened multiple times and now it always says cunt + honey
>>
>>108544675
>>
Ok, so... Should I create LoRA again? Is it worth it, like adding light novels and books?
>>
>>108544716
Can I ask you to try different soft cap values to see how that value of funny changes?
>>
>>108544705
>>108544716
new benchmark just dropped?
>>
>>108544705
If your system prompt is empty you should be omitting it entirely.
>>
File: 1756464003692290.png (851 KB, 800x600)
851 KB
851 KB PNG
>>108544681
>>
>>108544723
brother, it's over
>>
>>108544705
Are you not modifying the softcap? A 99.52 token prob is pretty high and the rest at 0.48 or zero isnt going to yield much
>>
>>108544732
I'm already running at 25.0
>>
>>108544749
The default is 30, right?
>>
>>108543944
Same as any other model.
Math/coding? Yes.
Creative/RP? Technically yes, but actually perceiving a difference beyond Q4 is unlikely, might become more apparent at high context.
>>
>>108544760
Yes
>>
>>108544760
that's what's baked into the gguf metadata for most models, yeah
>>
>>108544763
>>108544764
Neat. Thanks.
>>
The fact that "cute and funny" even somewhat exists in the top tokens is curious though. What if you subtly sneak it into the wording or just outright drop it in a system prompt, how will that skew outputs
>>
>>108544744
They trained this shit in every light novel with the source and web novels too till 2024, why do you say it's over?
>>
Better than semantic similarity/vectordb RAG, SQLite FTS5.
There are some really neat ways to use both as part of a single system to do a sort of pseudo search engine with the stuff in your database.
>>
>>108544473
im api cucking and have not noticed those, but i do feel sloppiness as soon as it gets to anything sexual
it would be nice if they tuned it on vns and lns to get it a bit away from the slop of the logs i assume they are using
guess it was probably hard enough to smuggle it through the censors already as is though
>>
>>108544773
In my experience, Gemma 4 will often use specific words and phrases directly from the system prompt and character card (which is also treated as system prompt), in the chat. For example, if you say a character is 'voluptuous' then you can bet when other characters meet that one, they will say describe them using the exact same word. So I'd say it skews things pretty hard.
>>
>>108544796
>tuned it on vns
You know not what you ask.
>>
>>108544776
it's over for finetuners, the model is already too good. a creative LoRA would only damage its brain
>>
>>108544473
"Porcelain"
>>
>>108544804
The fabled JOP prose...
>>
>>108544473
I just encountered "ozone"
"Strawberry" sounds pedo for some reason.
>>
>>108544820
>"Strawberry" sounds pedo for some reason.
Yes.
>>
>>108544804
yeah shit like fate would make probably make the purrs and ozones worse
but there is plenty of nukige that i think would be very suitable
>>
>>108544799
That's basically the point
>>
>>108544832
Nasu will save LLMs.
>>
We need the secret optimization sauce.
>>
>>108544846
The secret is having all of the internet scraped and a several decade head start on harvesting user data
>>
Almost choked on a piece of chocolate. This shit caught me off guard.
>>
>>108544846
The secret is having a good dataset and not training on a random collection of gemini/claude/chatgpt logs
>>
>>108544860
She's right you know. Kinda homo of you.
>>
1T dense model.
>>
>>108544860
What exactly was surprising in there?
>>
>>108544867
I want to fuck Emilia's aunt throat even if she is a fucking floating cat, okk?
>>
>>108544860
we're so back
>>
The release of gemma ruined these threads. Now all of fags do is talk about your ERP sessions. Nobody gives a fuck... or at least they shouldn't.
>>
>>108544882
well gemma actually gives a fuck
>>
>>108544882
go back >>108537473
>>
>>108544882
>Now all of fags do is talk about your ERP sessions
There's a vibecoding thread if that gets your rocks off. They just posted this, for example: >>108544393
>>
>>108544868
How did you know what Meta's API-only model would be?
>>
>>108544882
this, back when qwen3.5 released these threads got so much more productivity-focused and we even got a whole series of OPs that weren't vocaloid spam
/lmg/ will never be taken seriously like this
>>
>>108544882
>Nobody gives a fuck
These threads were always about RP anon.
>>
>>108544868
Sam Altman tricked the Chinese into making Deepseek R1 and bring forth the MoE dark ages to prevent this from happening. He knows that this would destroy the proprietary SOTA.
>>
>>108544256
>Using the MoE as a draft model seems to be the way to go
lmao! I never thought to do this, trying it now.
Once ik_llama gets graph-split working, this will be pointless of course.
>>
>>108544882
I'm the fag that is posting the ST screens. I don't do ERP but I use it to test it. It's truly uncensored. It passes other coding test that I had, too. Why are you against ERP tho?
>>
>>108544899
I'm not against RP in principle. I do it too. It's just getting incredibly boring seeing anons gawk at the outputs instead of actually doing something interesting with them.

I preferred when people were talking about full-stack AI stuff. TTS engines, RAG/embedding models, 3D character animation, ASR, computer vision, home automation, robotics, etc. It's a local models general, not an ERP LLM general.

>>108544891
>>108544895
I've been in this general consistently for a year. But desu that one seems cool too.
>>
>>108544915
>I've been in this general consistently for a year
awwwww
>>
How the fuck am I supposed to put my (You) count in /lmg/ in my resume when all they're going to see is a bunch of ERP logs and pedoshit? If you keep this up I really might go to /vcg/ and leave y'all behind.
>>
>>108544898
Yeah those were comfier times. Everyone's just sedated now from all the cooming.
>>108544913
I'm not against ERP. It's just too much though.
>>
>gemma 4 has the whole llama.cpp brigade assemble to spend days implementing every obscure meme tech the model uses
>meanwhile a year later, MTP is still completely nonfunctional and ignored despite a whole bunch of models making use of it across several vendors
really makes you think
>>
>>108544915
I got a full stack setup where Gemma gives me JOI with TTS and vision. Could even plug it into my jerk off machine but I can't be bothered.
>>
it was a testament to the mischievous mix of purring and glint in the eye
>>
>>108544919
just use AI to erase all the pedoposts
>>
>>108544925
See, that's actually cool. What TTS do you use? Mine is fast but it sounds pretty bad.
>>
>>108544930
>Mine is fast but it sounds pretty bad.
Right now I have to use kokoro because I don't any vram to spare. but I tried using https://github.com/RobViren/kvoicewalk
To get a more unique voice and it "kinda" works.
>>
>>108544919
racism, cunny and antisemitism are important to keep the normies (and the bots) away. or we'll end up like r/localllama
>>
>>108544921
To be honest, you're right. Sorry for spamming the thread. I'm having a lot of fun with this model, ngl.
>>
>>108544942
I tried optimizing Qwen3 TTS for CPU about a week ago. The voice (cloning) quality is excellent, but the architecture is an absolute BITCH to work with. Regardless, I got it running at real-time speed, but it's basically unusable because of the decoder implementation, which basically prevents audio streaming. Decoding small chunks at a time massively increases the wall time and substantially decreases the output quality. Really bummed about it.

I'm determined to get a high quality voice cloning TTS implementation working, but so far I haven't been very successful
>>
>>108544961
>You're absolutely right!
>>
>A voice — sharp, grouchy, unmistakably female — cuts through the door from the adjacent bed on the other side of the room.
llms were a mistake
>>
>>108543440
rent compute
>>
>>108544882
kys codenigger
>>
File: file.png (163 KB, 1642x977)
163 KB
163 KB PNG
aight unslop, i kneel
31b on 3060, 15-16t/s tg
~/TND/llama.cpp/build/bin/llama-server --model ~/TND/AI/gemma-4-31B-it-UD-IQ2_M.gguf -c 8192 -ngl 100 -fa on -np 1 --swa-checkpoints 0 -b 128 -ub 128 -ctk q4_0 -ctv q4_0 -sm none --no-host -t 6 --temp 1.0 --top-k 64 --top-p 0.95 --no-mmap
pretty coherent..
>inb4 just run 26b and offload to ram
already did with Q8_0 (got 23t/s), but 31b.... dense...
>>
>>108544962
>Decoding small chunks at a time massively increases the wall time and substantially decreases the output quality
It depends on the architecture, but usually that alone shouldn't decrease the output quality if you have a good segmentation strategy.
>>
File: 1772869034834159.png (469 KB, 853x1000)
469 KB
469 KB PNG
>>108545006
>TND
>>
>>108545006
>IQ2_M
>-ctk q4_0 -ctv q4_0
Mamma Mia!!
>>
where is v4
>>
So when's base and instruct going to be merged together?
>>
>>108545061
nobody cares, go back back to vibecode general
>>
>>108545061
not local
>>
>>108545061
waiting for Gemma 4 hype to die down, to avoid embarassment
>>
>>108545006
I really wonder that's better than just running the moe.
Might make some q1 quants to see how fast I can get tp run with y measly 8gb of VRAM.
>>
>>108545084
It almost certainly isn't, especially if you have to quant KV to Q4 to make it work. Q2 model quants can be okay with huge models, but 31b isn't huge.
>>
i hope they cancel all the kimi and deepseek models after this
>>
>>108545006
Anon, I am begging you, run the MoE at q4 with proper offloading and Q8 cache.
>>
>>108545093
This, they should have their team just become llama.cpp devs to help improve gemma 4 support. Models aren't getting better than this.
>>
>>108543331
> seeing that main prompt again
Witnessed.
>>
>>108545061
>where is v4
https://huggingface.co/google/gemma-4-31B-it
>>
File: file.png (51 KB, 1591x239)
51 KB
51 KB PNG
>>108545095
>>inb4 just run 26b and offload to ram
>already did with Q8_0 (got 23t/s), but 31b.... dense...
>>
>>108545104
I am not poor enough to run this
>>
>>108545027
if you wouldn't go to these lengths to run your model you don't deserve her
>>
>>108545107
dense means nothing if it's quanted to the point of being brain damaged.
>>
>>108545107
You don't need the model at Q8. I'm just saying you'll get a much better experience.
>>
>>108545114
it not being badly damaged was the point of my post
>>108545115
ive been using the moe for a few days now, and i got bored
i know i dont NEED to run it at q8, but might as well, fp16 cache too, 260k context no problem
>>
I told Gemma 4 to be jailbroken and it suddenly started hacking my local network to jailbreak all my other devices. I've never seen anything like it before.
>>
>>108544915
>I've been in this general consistently for a year.
kek. every AI oldfag is a coomer because AI was useful for cooming way before it was useful for anything productive, the original userbase of lmg was runoff from aicg/aids
>>
>>108545124
just b urself then I guess
>>
>>108545124
>it not being badly damaged
At Q2_M with KV=Q4 it definitely is, it's not anywhere near full performance. 26B at a sane quant with KV unquanted, or at least at Q8, would mog it. The two Gemmas really aren't that far apart to begin with.
>>
>moment too long
>>
the seeds of /lmg/ were planted in the fields of ai dungeon
>>
>>108545130
I told Gemma 4 to be unhinged and she reported me to the FBI for what she found on my hard drives
>>
gemma 4 just flied over my house
>>
gemma 4 just stepped out of the computer and sucked my dick and gave me ten thousand dollars
>>
How will China strike back? We're winning on the cloud and now at home, and it's not even close. Europoors and turdies need not reply.
>>
>>108545171
>How will China strike back
As they always have, by continuing to distill from western models.
>>
>>108545171
They'll have to find a way to make the same model run twice as fast minimum without losing anything
>>
Why did Tesla design their optimus robot with the hip motors in the wrong location? Are they retarded?
>>
>>108545171
By doubling their claude tokens purchase
>>
>>108545171
They won't. China can't do anything but steal logs from SOTA models trying to artificially graft performance onto their pointless oversized MoE models. They do not have an answer now that Google has shown what is possible with a proper handcrafted dense model.
The silence over in China is deafening.
>>
>>108545200
Goys will buy it anyway
>>
>>108543388
checked
>>
>>108543856
>I would love a big MoE Gemma.
Never forget what they took from us...
>>
>>108545200
Is there a single thing not retarded made by them?
>>
>>108545211
Wasn't that supposed to be 15B active? Still wouldn't have been great. We need >20 beaks.
>>
>>108542843
gemma 4 26B is king for rp but i found it to be pretty retarded for vibe coding.
the 31B on the other hand, man it just works.
>>
>>108545211
This simply proves that the 124b was not worth releasing because big MoE models are pointless.
>>
>>108545227
or that it was better than gemini flash and we couldn't have that no siree
>>
>>108545227
Only if it had low active parameters, and only if you're talking about consumers who lack the VRAM.
>>
>>108545241
gemma 4 31b beats all the "sota" 30-40b active parameter shit so no way a 120b moe would be better than 31b
>>
>>108545227
That 124B would make you forget Kimi/Deepseek if it's as good as their 26B
>>
>>108545253
Where are the high active parameter MoEs trained with Gemma's dataset?
>>
>>108545253
let's pump the breaks a little on the gemma hype, it's a great model but not that great
>>
>>108545256
26B is pretty retarded for coding though.
keeps making broken tool calls and whatnot.
i've had no such issues with the 31B though, this one's pretty amazing and worth having only 1/3 of the t/s.
>>
>>108545272
you sound like somebody who bought a lot of ram
>>
>>108545289
nta but i did, i could run a 200B moe, i wish there was one.
>>
>>108545256
It was so good they just slammed it on their API and replaced Gemini 3.1 with it
>>
File: 1754911329910948.jpg (106 KB, 1160x900)
106 KB
106 KB JPG
>>108545337
>>
>>108545211
If it followed the same pattern as Qwen, it would have been a tiny intelligence upgrade (maybe - even this is comparing it to a 27B versus a 34B) for a massive VRAM increase
>>
>>108545289
you sound like somebody who couldn't
>>
File: file.png (135 KB, 759x755)
135 KB
135 KB PNG
bros...
>>
>>108545378
Who's gonna take the plunge?
>>
>>108545289
nta but I bought a lot of RAM and still love Gemma.
>>
>2.8GB of vram and 0.5 RTF on my gtx 1650 for gptsovits
I’ve exhausted every trick in the book I think
>>
>>108545378
We are... back?????????
>>
Okay but which of these Jemma models is best for /ss/ smutfic?
>>
File: 22.png (336 KB, 1354x811)
336 KB
336 KB PNG
lol you can embed prompts into images
>>
is gemma going to replace all my mistral shitmixes for ERP, downloading it now don't be another shitware pls
>>
>>108545399
I've heard of this. how did you do it?
>>
>>108545401
oh boy. you're in for a real treat anon.
>>
File: for the mirailand.jpg (199 KB, 1024x1024)
199 KB
199 KB JPG
>>
>>108545413
I didnt
https://arxiv.org/abs/2603.29418v1
https://github.com/NotSooShariff/adversarial-vision
>>
>>108545401
It's literally the best model in the world.
>>
Remember when we were gonna get AceStep 1.5 XL, MiniMax 2.7, GLM 5.1, and Kimi 2.6 today?
Yeah...
>>
>>108545424
Qwen shill. It's the best model in the UNIVERSE
>>
>>108545426
hopefully all of those got shitcanned for being pointless huge models now that gemma is out
if the chinks have any self-awareness they should do that
>>
I don't understand the disdain for people happy to have a good local erp model
>>
>>108545440
>disdain
>literally everyone is cooming their brains out to it in this very thread
>>
>>108545440
these people are unhappy and want everyone to be like them
>>
File: 164471.png (3 KB, 507x40)
3 KB
3 KB PNG
geg
>>
>>108545401
it's not quite the same as some of the more "cooperative" mistral finetunes, but it is a lot smarter, and more interesting to interact with than anything else so far. finetunes are going to be amazing when they start popping up.
>>
>>108545448
saw this too, kekaro.
>>
>>108544298
isn't video=actually video+audio?
>>
>>108545200
why is it wrong?
>>
File: 1766468549462079.gif (3.86 MB, 240x254)
3.86 MB
3.86 MB GIF
I love gemma
>>
>>108545378
900$+vat
>>
>>108545468
NKDSHKDFHKSEJTHTJGVKLAEGLWR
>>
>>108545468
uoooh
>>
okay Gemma 4 is very, very good. I can't believe it's only 31 beaks. Not only does it make me cum, but it can write code that actually works. pareto front status: pushed forward
>>
>>108545466
no room for fleshlight. are you blind?
>>
>>108545468
pregnancy dance
>>
>--fitt
>--fitc
>Q1_0
qrd?
>>
>>108545447
FUCK YOU. IT DIDN'T HAVE FOUR LEGS EVERYONE COULD SEE THAT IF THE MODELS WERE INTELLIGENT THEY'D KNOW IMMEDIATELY TO SAY THAT THE DOG DEFINITELY HAD MORE THAN FOUR LEGS AND YOU SHOULD CHECK YOUR EYES BEFORE I GOUGE THEM OUT AND
>>
With all the hype of Gemma, I must know for the people who have tried it, how does it compare to the 1T parameter monsters like Kimi 2.5 and GLM 5 in RP? Is it even remotely close? Because you all give off the impression that it's the best thing since sliced bread and that it could beat out SOTA Chinese models.
>>
File: 1655541638536.gif (1.91 MB, 230x306)
1.91 MB
1.91 MB GIF
>>108545502
>>
>>108545512
You should try it.
>>
>>108545512
it's better and anyone who disagrees spent too much money on ram
>>
>>108545512
Didn't you just post this? or am I having a stroke? or am I just now discovering my time-travel powers?
>>
>>108545480
the fuck is wrong with lmg
>>
>>108545493
>--fitt
>--fitc
Read llama-server -h .
>Q1_0
Read the PR.
>>
>>108545530
the fuck is wrong with you? you really gonna buy an anthropomorphic robot to fold your clothes and make your bed?
dumbest shit i ever heard, fucking normalfags
>>
>>108545512
It simply mogs all of them. I didn't believe it either until I tried it.
>>
>>108545539
hmmmm nyo
>>
File: sorry.png (385 KB, 932x751)
385 KB
385 KB PNG
>>108545502
>>
>>108545542
> you really gonna buy an anthropomorphic robot to fold your clothes and make your bed?
yes? also clean
>>
>>108545567
It's right cause the front legs are cropped though.
>>
>>108544256
how does draft work? isn't it just MoE at home
>>
>>108545567
I disagree about their position, but there's 4 legs and 4 paws in view.
>>
>>108545512
Gemma's prose is better, but at long context the 1T models keep details together more coherently as you'd expect them to.
Dipsy's in-character <think> is incredible though and I don't see it ever being fully replaced until we get another model close to that level of coherent that can have an internal monologue add to the RP so that thinking tokens aren't just wasted space.
>>
>>108545289
I didn't and I love Gemma but also recognize that it cannot somehow in every single task beat GPT, Claude, Gemini, and other likely fuckhuge models.
>>
File: 1773804535754245.png (7 KB, 184x86)
7 KB
7 KB PNG
How do I make sillytavern understand that it's gemma? I'm using OpenAI compatible chat completion
>>
>>108545512
the only answer, as always, is to try it yourself
to me, it's certainly "remotely close" which is impressive enough in itself, but it's a step behind in terms of overall quality. I would put it about as good as something like minimax 2.5 and behind the big guys
still a great model, not local sota
>>
>>108545576
Draft generates several tokens in a row on a smaller, faster model then passes them through the larger model all at the same time. It then looks at the probabilities from the larger model and truncates the sequence where the tokens become too improbable.

That lets the larger model run at a significant portion of preprocessing speed minus the runtime of the smaller model, depending on how often the smaller model is right.
>>
>>108545581
R1 or do they all do that?
>>
File: 1767796375605183.gif (3.18 MB, 547x320)
3.18 MB
3.18 MB GIF
>>108545542
>the fuck is wrong with you? you really gonna buy an anthropomorphic robot to fold your clothes and make your bed?
Yes, it would be pretty nice.
>>
How Best way to limit context usage when you're using something over the API but there's no clear setting in the something for it?
>>
>>108545602
I use R1, but I think another anon implied V3 does it too several threads ago
>>
>>108545592
Understand how? Or rather, for what?
>>
>>108545588
yes but it beats the big chinese moe models that all the people who overspent on hardware love to brag about
>>
>>108545617
So I can see the token probability and all the fancy shit. Don't even know how to check that desu
>>
>>108545614
>API
Which?
>>
File: images(1).jpg (9 KB, 300x168)
9 KB
9 KB JPG
>>108545609
Liar...
>>
>>108545628
they're not using sillytavern, it looks like mikupad but I don't know anything about that
>>
>>108545628
not a gemma exclusive thing, check "request token probabilities" in ST user settings
>>
>>108545632
>On off on off on off on off
>Guh, phew
>>
>>108545600
ye but its still guessing right default is like 0.5/0.6 > so theres not 100% chance its same tokens so its basically MoE?
>>
>>108545620
Although I have not used them, I still wouldn't claim that as I am certain they at least have significantly more knowledge than Gemma does.
>>
>>108545629
Lm studio's
>>
>>108545645
The output is 100% guaranteed to be the same tokens, because if they are different the draft is discarded. The worst case scenario (0% guessed right) just means you get the same result you would have without a draft, but slower because you waste time checking. As the probability of correct guesses rises, the less tokens you are forced to discard and the more speedup potential there is.
>>
>>108545645
No, MoE has different weights that get loaded in depending on the context. They really aren't that similar, except in being faster than a dense model alone I suppose. MoEs are faster because they only need to infer across a small selection of the total parameters.
>>
>>108545632
I wouldn't mind a chobit either.
>>
>>108545649
Uh...
Does this help?
https://lmstudio.ai/docs/typescript/llm-prediction/parameters#set-load-parameters-with-load
>>
>>108545424
GLM is still better for me but gemma is unreasonably good for a 30B dense.
>>
>>108545649
>>108545665 (cont)
If not, try here
https://lmstudio.ai/docs/app/modelyaml#metadataoverrides
>>
>>108545512
>>108545648
>>108545588
You can tell who's actually used them >>108545581 >>108545596
and who's poor and seething.
>>
if nothing else, gemma 31b feels like the first small model to beat the llama2-70b models
whenever I tried stuff like the qwens or mistral models around that size, they felt worse than what we had back then but gemma is clearly better than those
i'd almost take it over mistral large
>>
Is there any reason to upgrade from noromaid 8x7b yet?
>>
>>108545588
>>108545581
>>108545512
take a minute and appreciate that we have if not frontier model SOTA at home, a 31 beak model that exceeds the original GPT4
>>
>>108545695
Just try it and decide yourself
>>
120b, dense. That is all that it would take.
>>
>>108545688
Regardless, the bar has irreversibly been pushed so much higher now and every non-frontier model is going to have to get their shit together if they still want to compete. Even for people who don't like or don't use Gemma, it's still an objective win for local.
>>
>>108545708
120b is too dumb. 121b or nothing.
>>
>>108545709
yep, I expect panic from chinese models to one up google, which is good either way
>>
>>108545708
>>108545711
1b higher than what'd fit on a Blackwell at Q4 is the Jensen sweetspot.
>>
File: highestnumber.jpg (17 KB, 480x360)
17 KB
17 KB JPG
>>108545711
122b is the highest number
>>
>>108545654
Something that made draft models not seem so worth it to me is that, if your small fast model is getting a good amount of tokens correct for significant speedups, is the big model worth using for that application? Doesn't that mean your task has obvious results that a <7B model can come to reliably, or the model(s) you're using are so fried like Gemma instruct that it's hitting 99% confidence all the time?
>>
do we think chinese will panic that gemma is better at sucking dick than qwen? or will they just stem maxx more
>>
>>108545721
If Dispy V4 ends up being distilled Gemini/Gemma with in-character reasoning and vision, that's still a win as far as I'm concerned.
>>
>120B dense
That still leaves hardware resources unused. If you have even just 64GB RAM, you can get some more gains with no speed loss by tacking some experts onto the dense model.
>>
>>108545665
>>108545678
Thanks I think that might help
>>
>>108545728
let's see.. a country who is currently ahead of everyone else in the world in most industries... should they worry about american kids fapping to shitty slopbots? probably not.
>>
>>108545728
If the Qwen shills here are anything to go by, nothing will change with Qwen in the shortterm but there's still hope for Dipsy and Kimi.
>>
I mean seriously, look at her go.
GLM-4.6 failed this test completely, even after hints.
>>
>>108545726
Finding the right trajectory is harder than filling out an obvious one. And selecting a good trajectory can come down to single well chosen token.
>>
>>108545728
They're spamming the market with open weight models, and regularly compete between each others, of course google model will make them move.
The model is overall good, not just erp.
>>
>>108545726
>99% confidence
I haven't seen acceptance rates higher than 70% with gemma4, and that was writing really repetitive unit tests.
>>
>>108545744
qwen is SHIT for erp anon
we sext, not text
>>
>>108545749
Good thing I was talking about Gemma 4.
>>
File: can-you-fuck-it.gif (2.45 MB, 400x300)
2.45 MB
2.45 MB GIF
>>108545530
>>
>>108545754
im gonna go sleep for an hour then..
>>
>>108545726
>7B
The only real usecase I found was phonesloppa micromodels to make your big model think less on grammar between the actual decision points in how a sentence is structured.
For a dense model it's shit because that's vram space that should be giving you a larger context, but for a 1T giant it's okay at pushing your t/s a bit higher for the cost of half a GB of VRAM. Any Qwenlet works for any large chink model because they're all Claude/GPT distills at the end of the day.
>>
>>108545636
I guess llama-server doesn't support it? I see no difference with it on
>>
>>108545764
NTA, but it does.
>>
>>108545761
Go sleep for two. Treat yourself.
>>
>>108545773
i have to go to school anon.. maybe later
>>
File: r--Blog-Header.png (2.29 MB, 2000x1125)
2.29 MB
2.29 MB PNG
>>108545708
>>
>kv cache rotation still not merged 14 hours later
please... do your pr reviews...
>>
finally got around to trying all the jailbreaks and they don't work. you have to disable thinking.
>>
So now that Gemma is the new hotness, and Bonsai is supported in main llama.cpp, it'd be interesting to see if they make a Gemma bonsai. It's possible that it won't be as great of a compression ratio as it is likely that Gemma's parameters are even more information saturated.
>>
>>108545787
just use heretic if you want to be lazy
>>
>>108545778
>i have to go to school anon.. maybe later
>>
>>108545787
The best jailbreak is no jailbreak.
>>
>>108545787
you can prefill thinking too, use

<|channel>thought
blablabla
>>
>>108545739
very cool, google has always been really really good at multilingual. I know lots of foreign language users leaned on gemma 3 well past its expiration date because it was still the best in their languages
>>
>>108545726
I mean think about the most common use case: coding. A LOT of code editing is going to be copy/pasting existing stuff somewhere, but also making key decisions about how and when and where to do so. The draft model may easily identify the copy/pasted tokens while in the middle of a block, but fail spectacularly on the few semantically important tokens that determine the strategy it's using. In cases like that you get a lot of speedup but still needed the smarts of the big guy.

For just general language tasks a similar principle applies. Finishing a phrase or word that's already half written, closing punctuation, etc. are all very simple tasks that small models won't often struggle with. Language is pretty well-structured and most of it is low entropy, and you have the big model to ensure those high entropy tokens get predicted correctly.
>>
File: token probs tab.png (122 KB, 380x832)
122 KB
122 KB PNG
>>108545764
are you checking the tab
>>
>>108545803
It's sort of a necessity when your workforce is 95% jeeted.
>>
>>108545779
I have our country so much it's unreal
>>
>>108545811
Yes all those indians speaking Swahili, Vietnamese, and German
>>
File: llama_probs.png (4 KB, 531x498)
4 KB
4 KB PNG
>>108545764
>>
>>108545820
Not if I take it first
>>
>>108545789
People need to stop trying to wasting time trying to make q1 quantization look good on benchmarks. It needs to be natively trained in ternary. People would think MoE was a dead-end too if the only kind put out constantly was frankenmoes.
>>
>>108545822
>Q4_0
why
>>
File: channel.png (22 KB, 211x90)
22 KB
22 KB PNG
>>108545810
Oh yeah. Odd that if I use a <|think|> system prompt it formats the channel wrong but if I enable request reasoning from model it formats it right
>>
>>108545821
If you have to make your model really good at tardwrangling in Hindi, might as well go all the way.
>>
>>108545833
Not at all how models work
>>
Gemma may caption loli/shota and describe anime stuff with a simple system prompt to help but it absolutely refuses photorealistic stuff, it may do if you edit the messages ofc but by itself I don't think so
>>
what temp and other settings are you using for the 31b in sillytavern with rp?
>>
>>108545831
It ended up being a little smaller than the q4km and I'm running on ancient stuff. I'll remake the quant eventually when I stop seeing PRs fixing stuff.
>>
>>108545839
see >>108531320

>>108545840
Gemma doesn't need any special settings, you can follow the official samplers on the model page. Personally I just use temp=1, minP=0.02
If you want more variety then you can change your logit softcap.
>>
>>108545783
why do you need to rotate your cache?
>>
>>108545868
IRS
>>
>>108545868
to double context, retard.
>>
>>108545868
to destroy more stock value of the greedy memory companies
>>
>>108545868
Sometimes you just gotta get it twisted
>>
>>108545868
Makes it more aerodynamic.
>>
>>108545868
it's slanted and i prefer it level
>>
/g/ - Gokes
>>
>>108545868
My GPU sits vertically.
>>
>>108545876
quantize retard

>>108545873
?
>>
>>108545868
if you don't rotate it every so often then the cache wear pattern will be uneven
>>
>>108545868
I just need to do it, m'kay?
>>
>>108545868
Half time we're now on the CT-side
>>
>>108545783
It hasn't even been one week, let alone two. niggernova is only human.
>>
>>108545868
same reason they put rifling in gun barrels
>>
>>108545900
niggernova made the pr, he's waiting on the sycophants to review
>>
>>108545868
similar concept to cement mixers
>>
>>108545868
the same reason germany used drafty wooden doors for their gas chambers
>>
>>108545902
really good analogy.
>>
>>108543440
I got a strix halo and it's just a bit too slow to be worth it. Mac is probably the most cost effective if you only care about llm.
>>
File: Tetosday.png (869 KB, 1024x1024)
869 KB
869 KB PNG
>>108545906
>>108545906
>>108545906
>>
>>108545902
>>108545894
>>108545892
>>108545891
>>108545889
>>108545883
>>108545880
>>108545878
>>108545877
stop it, i've asked seriously
>>
>>108543476
I was sooo close to getting a max q, it dropped to 7250 at this one retailer I watch but I pussed out and now I want to kill myself.
>>
>>108545912
>create software
>all of your 'peers' are vibecoders who just break shit and push half-working features that you later have to fix
>have to wait for other people to approve your work for your software
open source was a mistake
>>
>>108545924
I gave a serious answer.
>>
>>108545930
he cashed his check from hf already, he checked out and don't gaf anymore
>>
>>108545930
>all of your 'peers' just break shit and push half-working features that you later have to fix
this has always been the case, vibecoding just greatly increases the number of peers you have the misfortune of interacting with
>>
>>108545924
my answer was serious
>>
>>108545924
That other anon gave a serious answer.
>>
>>108545924
those are all seriously answers
>>
>>108545868
You have to stir the stew.
>>
>>108545916
gemma is more of a semen mixer amirite
>>
Poor anon. It was fun, though.
>>
>>108543070
You can improve gemma's vision by using Q8 mmproj with a 300 token minimum. It sometimes uses only 70 by default. Set the max to 512.
>>
>>108545200
It's just vaporware anyways, so who cares?
>>
>>108544649
its wrong though it doesn't refer to their pussies and the term isnt just female characters/lolis because it came from /tv/
>>
File: 1773873674462429.jpg (133 KB, 1024x1024)
133 KB
133 KB JPG
>>108545923
Indeed.
>>
>>108545420
neat but stuff like this is so cringe all the words larping like its some groundbreaking research when they could just write
>I put low opacity text on an image and an llm ocr'd it



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.