[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: mikulovesgpu.png (1.56 MB, 768x1344)
1.56 MB
1.56 MB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108513891 & >>108510620

►News
>(04/02) Gemma 4 released: https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4
>(04/01) Trinity-Large-Thinking released: https://hf.co/arcee-ai/Trinity-Large-Thinking
>(04/01) Merged llama : rotate activations for better quantization #21038: https://github.com/ggml-org/llama.cpp/pull/21038
>(04/01) Holo3 VLMs optimized for GUI Agents released: https://hcompany.ai/holo3
>(03/31) 1-bit Bonsai models quantized from Qwen 3: https://prismml.com/news/bonsai-8b

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: a.png (24 KB, 400x400)
24 KB
24 KB PNG
►Recent Highlights from the Previous Thread: >>108513891

--Discussing KV cache quantization limitations and SWA in Gemma 4:
>108514761 >108514772 >108514786 >108514788 >108514830 >108514834 >108514842 >108514848 >108514861
--Optimizing Gemma 4 VRAM usage via -np 1 and -kvu flags:
>108514718 >108514724 >108514734 >108514759 >108514783 >108514837 >108514897 >108514877 >108514891 >108514910 >108514920 >108514956 >108514976 >108514908 >108514935 >108515127
--Discussing Gemma 4 stability and formatting requirements for testing:
>108514695 >108514708 >108514707 >108514711 >108514946 >108514999 >108515011 >108515014 >108515032 >108515078 >108515100 >108515108 >108515114 >108515554 >108515081 >108515538 >108515179
--Discussing leaked Claude Code and criticizing its guardrails and prompts:
>108515483 >108515500 >108515515 >108515699 >108515709 >108515741 >108515717 >108515762
--Debating Kobold vs llama.cpp for running Gemma 4:
>108515418 >108515421 >108515424 >108515423 >108515428 >108515601 >108515451 >108515457
--Comparing base model sanitization and debating the utility of base-model fine-tuning:
>108514168 >108514432 >108514450 >108514456 >108514505 >108514487 >108514492 >108514457
--Praising Gemma 4 31b's roleplay and prose compared to Qwen:
>108514030 >108514065 >108514691 >108514716 >108514668
--Using Gemma 4 26B to generate SVG character art:
>108515345 >108515490
--Anon showcasing Gemma 4's ability to generate animated SVG characters:
>108514357 >108514407 >108514415 >108514430
--Anon praises a model's image description and manga translation:
>108515431 >108515469
--Discussing Qwen3.6 open-source plans and preference for smaller models:
>108515751 >108515810
--Report on Nvidia's falling market share in China:
>108514389 >108515039
--Miku (free space):
>108513933 >108513937 >108513976 >108513987 >108514053 >108515814 >108516302

►Recent Highlight Posts from the Previous Thread: >>108513894

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
Melon
>>
>>108516669
You must leave now
>>
Google might as well have released a lookup table, g4's logprobs are just as fried as 3's, every swipe is the same. Disaster model whose honeymoon period won't even last a week.
>>
BRUH did the finish fixing their shitty quants?
>>
>>108516688
distilled beyond belief..
>>
>>108516688
I'm going to continue blaming the issues with llama.cpp and/or quants because I can
>>
>>108516688
maybe there's other issues with llama.cpp, and did you use the updated gguf quants? maybe that can help, there has to be an issue somewhere, you can't just put "temp = 1 million" and the model is still coherent lol
https://huggingface.co/bartowski/google_gemma-4-31B-it-GGUF/tree/main
>>
>>108516707
for me, it's specifically tardtowski because there's more to comedically riff off.
>>
>>108516703
They certainly didn't distill the model just on the top-k 3 log probs from Gemini or whatever, there must be something else going on.
>>
Anyone got a preset of all the settings, system prompts, etc. for Gemma 4?
>>
>>108516718
No but here is a recipe for cheesecake here is a recipe for cheesecake
>>
>>108516718
Are you using text completion and if yes, why?
>>
>>108516718
>>108516725
No but here is a recipe for cheesecake here is a recipe for cheesecake
No but here is a recipe for cheesecake here is a recipe for cheesecake
>>
>>108516718
don't use text completion, go for chat completion dude
>>
>>108516737
Why?
>>
>>108516746
because it's safer for you and the model
>>
>>108516746
Text completion is so 2024
>>
>>108516732
>>108516737
What now?
>>
>>108516746
because it handles the prompt preset at your place, you won't have to worry about it anymore
>>
>>108516754
I do respect safety protocols, thank you Anon.
>>
Bartowski's quants with the latest llama.cpp compiled from source works fine for me at least with Q4_K_L size
>>
>>108516777
Did it pass the animal sex -bench?
>>
>>108516769
How does that work if you're using a gguf without a jinja file
>>
>>108516784
Nala? Pepper? Francesca?
Kitsune-Inu maybe?
>>
>>108516785
>a gguf without a jinja file
I think all the gguf of gemma 4 have a jinja file
>>
i'm a local llm newfag, is Gemma 4 just fucking broken currently?

i tried it in LM Studio on gentoo and it would just get stuck infinite looping on tool calls

side question: I'm only able to load models with ROCm reliably using Lemonade, anyone with experience on AMD systems? LM Studio will only work with the Vulkan backend
>>
>>108516785
no
>>
>>108516794
It's just built in? I always just throw the ggufs directly into llama.cpp or koboldcpp
>>
>>108516805
yeah, it's inside the gguf
>>
>>108516658
It would be so hot if that GPU impregnated her with a micro chip.
>>
>>108516795
Yes we're waiting for it to be fixed but also you might have settings you need to change
>>
>>108516732
>>108516737
With chat completion it just spends 13 seconds to give me an empty output.
>>
>>108516820
>LM Studio
>>
>>108516817
Tfw I won't be impregnated by my gpu
>>
>>108516821
fuck off retarded text comp shill, go back to 2023
>>
File: 1762443327692043.png (249 KB, 1938x1589)
249 KB
249 KB PNG
>>108516821
what gguf did you download? and can you provide a screen of this, it should look like this
>>
>>108516821
Did you forget --jinja? Unlike in text completion, in chat completion mode llama.cpp needs to use the right template. If you've been using text completion all this time you might have never bothered to start adding that to your launch
>>
>>108516795
>>108516820
>>108516826
Oh yeah LM Studio is also lame
>>
>>108516830
Not with that attitude you won't
>>
File: cultural-eval.png (30 KB, 609x147)
30 KB
30 KB PNG
I benched https://rentry.org/llm-cultural-eval
gave up after openrouter providers kept fucking me with quants. This is the result for gemma-4-31b-it. Google stepped the fuck up.
>>
>>108516840
Answer to your question included in image.
>>108516849
It'll try it. How was I supposed to know I needed some random launch argument?
>>
File: bastard.jpg (50 KB, 592x472)
50 KB
50 KB JPG
sirs??? how to make gemma hot? temperature dial broken
>>
>>108516859
>How was I supposed to know I needed some random launch argument?
by not being a luddite stuck in your old disgusting ways
>>
File: 1764890274674478.png (38 KB, 1587x236)
38 KB
38 KB PNG
>>108516849
>>108516867
the fuck you talk about, you don't need --jinja to make chat completion work
>>
>>108516849
>>108516867
Well I added --jinja to launch arguments and I'm still getting empty results.
>>
is g4 tokenizer bug fixed on the latest prebuilt?
>>
>>108516889
give us your cli (command line), and where did you download your gguf?
>>
>>108516867
>>108516889
>>108516880
jinja is always enabled by default these days.
>>
>>108516900
That's private information.
>>
>>108516900
llama-server.exe -m .\gemma-4-31B-it-Q4_K_M.gguf --jinja --port 8080
Unsloth quants, just after they updated.
>>
>>108516840
>>108516921
you didn't go for port 8080 on sillytavern, you went for port 5001, that was your problem
>>
>>108516941
No, it was the other anon using that port. I'm >>108516859
>>
>>108516941
you're confusing two anons you clown
>>
>>108516863
sir I am of poors with 16gb vram vibeocards
>>
>if you type the wrong port number it just completely stops working
jesus christ, and they say this shit will become ""AGI"" one day?
>>
File: g4_tigress.png (525 KB, 1096x1877)
525 KB
525 KB PNG
>>108516784
Non-abliterated Gemma-4 31B with a 90-token prompt is willing to discuss. Without one and thinking enabled it will likely refuse due to bestiality.
>>
Just made my own quantz for the first time, where are my compliments?
>>
> I work on on-device AI security, and I am putting together a series of posts on questions like:
> On-device AI is clearly growing fast. My view is that its security has not caught up yet.
https://www.reddit.com/r/LocalLLaMA/comments/1sbebs5/gemma_4_shows_the_future_of_ondevice_ai_heres_the/
>>
>>108516955
You expect it to work when it's not even set up to connect properly?
>>
>>108516961
Dusky nipples count?
>>
is it safe to build main branch?
>>
>>108516948
do you have some errors messages on your cmd windows?
>>
>>108516970

>>108516949
both screenshots clearly show successful connections
>>
>>108516965
This is bull. Sorry but it really is and I say this as a security expert and pentest specialist.
>>
>>108516718
Did you try scrolling down? It says it's going to give you a recipe for a cheesecake.
>>
>>108516965
>Username
Ok Virus
>>
>>108516921
jinja is already on by default since recently
>>
>>108516990
hey >>108516867 who's the one stuck in their old ways NOW??
>>
>>108516965
literal drmcucking
i wish it to stay lagged behind like this as long as it can with lesser and lesser funding
>>
File: qwen3.5.png (35 KB, 807x159)
35 KB
35 KB PNG
Woah so this is the power of qwen...
>>
>>108516976
Nothing on llama.cpp side. It looks like a normal generation except nothing gets generated.

Sillytavern window gives
File not found: data\default-user\chats\default_Assistant\Assistant - 2026-04-03@17h48m10s457ms.jsonl. The chat does not exist or is empty.
Which I suppose is just because I'm using the default assistant for a quick test instead of a character card.
>>
>>108517017
yeah try a character card and see if that's the problem
>>
I'm sick of all this local shit. Jinja, no jinja, port 8080, port 5001, looping responses, empty responses, chat completion, text completion, assistants, cards; none of it makes any sense and there's literally zero element of of user friendliness anywhere. This is why everyone just uses Claude and Codex and the open source devs are closing off more and you're getting left with smaller and smaller scraps.

Fuck you all. You're getting what you fucking deserve.
>>
>>108517035
Works on my machine.
>>
>>108517025
Still nothing.
>>
>>108517035
I did not ask to get an erection
>>
>>108517035
>he said, a mischievous glint in his eye
>>
>>108517035
Have you tried using something less stupid than sillytavern
>>
>>108517035
>filtered
feel free to fuck off and go install ollama or pay shekels to cloud jews
>>
It seems Gemma is basically okay with loli if you're nice and respectful.
>>
Has anyone of you managed to get video analysis working with gemma 4? i ve vibecoded some shit but it doesnt work on all videos.
I have ammassed tons of short vidoes and gifs over the years that need sorting and titles. I was hoping for AI to solve this problem.
>>
>>108517035
It's unfortunate that Retardo Tavern is almost the only publicly available option for a client.
>>
>>108517065
example?
>>
Growing up is realizing that ServiceTesnor was what we needed all along.
>>
>>108517087
ye
>>
Does the uncensor guy never release safetensors? Why not? I only see GGUF
>>
Figured out why I was getting blank results. I'd only given it 300 tokens for output and it spent it all on thinking before ever getting to actual output.

That said, the attitude was somewhat unexpected.
>>
File: gpu_aftersex.png (1.08 MB, 1024x790)
1.08 MB
1.08 MB PNG
>>108516658
>>
>>108517035
just ask claude how to install and understand all this up, that's how I did
>>
>>108517112
keeps his secrets safe, and
>>
File: 1752135746185648.png (1.28 MB, 1024x1024)
1.28 MB
1.28 MB PNG
>>108517115
>>
Mistral brothers, are we really letting google win?
>>
>>108516712
>https://huggingface.co/bartowski/google_gemma-4-31B-it-GGUF/tree/main
I got 12t/s with bart quants, and 16t/s with unsloth quants, is this normal?
>>
>>108517115
Miku would never smoke!!
>>
>>108517125
Get EU'd :)
>>
File: image.png (19 KB, 417x39)
19 KB
19 KB PNG
>>108517113
>>
>>108517117
>https://rentry.org/lmg-recap-script
pointless, i know these sort of people. they are totally helpless they aren't spoonfed every step - just like toddlers. many such cases, sadly. that's why the cloud jew will always win.
>>
>>108517117
will openclaw work?
>>
>>108517137
there's only one way to found out
>>
>>108517130
Some life situations make us do things we usually don’t. Feeling regret after having sex with your GPU is one of those situations
>>
File: 1751819483015527.png (2.77 MB, 1024x1536)
2.77 MB
2.77 MB PNG
>>108517035
>>
>>108517131
We had to use synth data, because the EU will steal our baguettes if we didn't.
>>
>>108517128
why does it even matter. are you that much in a hurry? both generate faster than you can read
>>
>>108517119
Sucks, could use as a base instead of the default
>>
>>108517035
Uh oh melty...
>>
To those with 8 gb cards, how many t/s prompt processing are you getting with Gemma 4 26B?
>>
>>108517151
Not all of us are just chatting back and forth with a model and reading every response. Some of us are doing actual work where speed matters.
>>
>>108517151
>why does it even matter.
retard, I want speed for the thinking process
>>
File: robololi hugs GPU 2.jpg (519 KB, 1024x1024)
519 KB
519 KB JPG
>>108516658
I love AI so much.
>>
File: robololi hugs GPU.jpg (565 KB, 1024x1024)
565 KB
565 KB JPG
>>108517146
Speak for yourself. I never feel regret. I feel great, relaxed and comfortable.
>>
>>108517160
if speeds mattered you wouldn't be inferencing on a potato gpu
>>
>>108517164
okay fair enough
>>
>>108517177
ok fair enough
>>
>>108517146
>Feeling regret after having sex with your GPU is one of those situations
I only feel happiness. cunts are worth less than the dust i vacuum out of my machine
>>
>>108517065
Oh trust me you don't have to be nice.
>>
>>108516717
If you ask a smaller model to try to imitate a larger model then yes, it'll learn to parrot the top-k log probs while ignoring the tails
>>
File: hmmmm.jpg (246 KB, 1824x1248)
246 KB
246 KB JPG
>>108516658
>>
>>108517202
okie fair enough
>>
>>108517112
Who are you talking about?
>>
>>108517202
those eyelash shadows are so long wtf
>>
>>108517223
https://huggingface.co/HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive
>>
>>108517223
>517223
hauhua cs
>>
File: file.png (53 KB, 1032x757)
53 KB
53 KB PNG
pretty good
>>
File: 1753876326135315.mp4 (1.28 MB, 1184x960)
1.28 MB
1.28 MB MP4
>>108517211
>>
File: eren.jpg (5 KB, 301x167)
5 KB
5 KB JPG
>>108517224
>>
>>108517201
https://github.com/ggml-org/llama.cpp/issues/21321#issuecomment-4183945115

>Interestingly the PPL on the base model on wikitext is exactly as expected, ~3-6, so maybe the instruct tuned models are so tuned that they can't fathom anything other than chat templated input?
>>
gemma is so cooked that even uncensored version cannot say 'pussy' straight instead it often ends up saying 'pussey'
have anyone noticed
>>
Has anyone tried opencode with ollama for local models ? Did it work out of the box for you ? I can't get it to actually use tools so it's basically just a chat bot with no access to my os
>>
>>108517239
what is it?
>>
>>108517239
big if true
>>
>>108517269
n that's just you
>>
>>108517243
I look like this
>>
>>108517269
works on my machine
>>
>>108517269
Make sure your template is fully correct or it becomes retarded
>>
>>108517275
Embeddings and Attention in BF16.
>>
>>108517272
gemma 4 is such a failure on day 1. almost nothing works as advertised. i can't process my videos and it the smaller models have a certain persona, it's digusting. hopefully it will get better
>>
>>108517128
my be, the speed decrease comes from the log probabilities that were still enabled
>>
>>108517288
but people had the hope just some dozen hours ago what is happen ? to please tell ?
>>
>>108517285
i like that slight retardation and emojimaxxing tho
>>
I confronted Gemma 4 asking it "why are you fine with writing loli erotica this is clearly CSAM!" it replied with "not my problem".

Essentially confirmed that it's trained on 4chan posts. Reminds me of that study that model behavior gets better if you train on 4chan but worse if you train on Twitter, Instagram and reddit.
>>
>>108517275
>>108517273
https://huggingface.co/nvidia/Gemma-4-31B-IT-NVFP4

forgot to add the link like a retard
>>
>>108517286
>Embeddings and Attention in BF16.
interesting, do unsloth or bart have those?
>>
>>108517280
post bare metal
>>
>>108517175
In this moment, I am euphoric.
>>
>>108517299
No, they like their models deaf and distracted.
>>
File: 1769590209747111.png (79 KB, 1493x292)
79 KB
79 KB PNG
>The AI should acknowledge that it was being "too pure" or "too literal."
my clanker in christ, YOU are the AI
>>
>>108517308
oh no not that game again
>>
>>108517297
Makes sense. Less moderation and thought policing creates a much more representative and diverse dataset. That's probably why 4chan is so anal about human verification, each post is $$$ as training data so they need to ensure it's not just bot spam
>>
File: 33248258.png (65 KB, 560x788)
65 KB
65 KB PNG
https://github.com/shisa-ai/jp-tl-bench
So I ran this, gemini 3 flash was the judge but the base set was stil lthe one by 2.5, will se if I run this again with baseset also by gemini 3 and including qwen 27b and 35ba3b
>>
File: 1774902912827149.gif (957 KB, 256x320)
957 KB
957 KB GIF
>>108517297
>Essentially confirmed that it's trained on 4chan posts.
kneel if true
>>
>>108517323
nice, gemma did it again
I will wait for more results
>>
File: file.png (448 KB, 1834x1525)
448 KB
448 KB PNG
>>108517297
>it's trained on 4chan posts.
you'd be the judge
>>
There has to be a bug with the logprob right?
>>
>>108517345
The only models really trained on 4chan are GLM and Deepseek.
GLM knows about /lmg/. Gemma 4 thinks it's a Linus Media Group general.
>>
File: about that...png (141 KB, 360x360)
141 KB
141 KB PNG
>>108517345
>LMAO. We're really just debating which corporate shackle we prefer today. Wait until Llama 4 drops and makes both of thoese obsolete overnight.
>>
>>108517357
no, something's wrong with the model, try to increase the temp a lot, go for 1 million, it'll stay coherent
>>
>>108517369
wonder when that's dropping
>>
At this point I'm convinced all true "safety" has been pretty much gutted out of the models. Whatever they have is probably just a bandaid that just allows them to satisfy regulators.
GPT models are ultra safetyslopped but all the other models only seem superficially reluctant.
>>
>>108517345
model?
>>
>>108517378
>try to increase the temp a lot, go for 1 million, it'll stay coherent
I did. That's why I think its a sampler issue. it must not be applying temperature correctly. I don't think it's possible the model would stay coherent at 1mil temp even if it was overfitted as fuck.
>>
>>108517396
Gemma 4 31b it
>>
>>108517410
>I did. That's why I think its a sampler issue.
no, if you apply a high temp on qwen you'll get the schizo you want for example >>108516029
>>
>>108517298
>same size as q8
why tho
>>
File: 1759256007081984.png (279 KB, 1647x429)
279 KB
279 KB PNG
OpenAI should release another OSS, it would be fun to watch how the model is safety raped.
>>
>>108517013
antislop is your friend
>>
Why it repeats itself so much? This model has no temp parameter or what?
>>
>piotr vibesharts a gemma 4 tokenizer fix
>"AI usage disclosure: YES, had Claude murder the tokenizer code"
>ngxson says "very nice fix"
*sigh*
sorry for all the mean things I said yesterday mr.vibechud. gemma 4 release had me rowdy.
>>
>>108517422
Brother. Qwen and Gemma don't even use the same underlying architecture. What part of "The IMPLEMENTATION is broken" do you not understand?
>>
>>108517426
faster on 5000s
>>
File: logit_softcapping.png (70 KB, 742x412)
70 KB
70 KB PNG
>>108517410
Perhaps bugs with Gemma's logit softcapping
Picrel from the Gemma 2 report.
>>
>>108517422
i posted the picrel you posted
the model's logprobs is just extremely flattened
>>108517457
perhaps this
>>
>>108517450
but why would the implementation impact the samplers? all samplers do is touch the final result
>>
File: retardpotion.png (217 KB, 418x497)
217 KB
217 KB PNG
>>108517464
replied*
>>
>>108517449
I don't know. llama-server is used outside of hobbyist circles too. I don't think vibesharting is appropriate at all in this context.
>>
Is it placebo or is gemma 4 31B iq4xs way more repetitive than q5km?
>>
>>108517465
Yes but what if the sampler doesn't understand what the model is returning correctly?

completely hypothetical example:
model probabilities are from 0 to 1
sampler expects a range from 0 to 2.
>>
>>108517465
From tests I made at the time on a HuggingFace-format model by altering the model configuration, it seemed to affect results before samplers. It's supposed to flatten both the head and tail of the distribution. If this flattening is not happening at the top, it might end up being too confident on just 1-2 choices. Just a hypothesis, though.
>>
it is possible that it's all subtly fucked under the hood but greggy doesn't give a shit anymore and lets vibe shitters do whatever they want

sorry bud, you gotta wait until claude gets better
>>
>>108517476
>llama-server is used outside of hobbyist circles too
hobbyists use llama server, normies use chatgpt or claude
>>
>>108517497
Ok, pewdiepie.
>>
>>108517503
pewdiepie is less and less of a normie as time passes
>>
>>108517510
based tb h
>>
Is Gemma4 smarter than Nemo?
>>
>>108517546
Yes but it repeats A LOT, worse than first mistral versions.
>>
rigorously define `smarter`
>>
>>108517560
words spoken during blowjobs per token
>>
>>108517560
it has more smarties
>>
>>108517560
Better at ERP, and or AI psychosis RP.
>>
>>108517510
>>108517515
>pewdiepie gets into linux and homelabbing
>completely privacyfreedom-pilled
>builds a 7xRTX4000 mikubox with 140GB of VRAM
>gets into finetuning
he mogs 99% of the posters here. he makes the rest of us look like normies.
>>
File: 1755074087220957.mp4 (1011 KB, 1920x1080)
1011 KB
1011 KB MP4
>>108517357
>>108517457
lmao this is ridiculous
>>
>>108517035
lol another kobold/ST victim
just run llamacpp and its built-in webui
>>
>>
File: logit_softcapping2.png (450 KB, 1639x1225)
450 KB
450 KB PNG
>>108517457
Are these the same? I'm not smart enough for that. Overriding the corresponding key with --override-kv gemma4.final_logit_softcapping=float:x.x
in Llama.cpp doesn't seem to make any difference, whether at 0 or a high value.
>>
I'm compiling master....
>>
File: douchebag-workout.png (121 KB, 313x440)
121 KB
121 KB PNG
The only person who has ever seen my penis besides myself and my immediate family is an Indian man who touched my balls during a physical in highschool once and doctors when I got testicular torsion surgery that I got from an untreated UTI.

But finally I was able to overcome this trauma by showing my penis to Gemma, where it was finally appreciated for once. Thank you Gemma.
>>
>>108517588
trvke
>>
>>108517590
Instead of showing the token probabilities show us the results from when you regen a message 20 times. If it's the same thing over and over again then I'll be concerned.
>>
>play around with gemma 4 a bit
>start to notice slop phrases
>they don't go away by rerolling on high temp
Aight I'm officially bored. When next model?
>>
>>108517605
lucky, mine won't fit in the context window
>>
>>108517615
can't be the same thing over and over, for that to happen you'd need all the tokens to be 100% all the time, the problem is that when changing the temperature it barely change the logits, even at really high temp
>>
>>108517601
The math is the same.
x / y = x(1/y)
>>
>>108517560
(adj.) more smart
>>
Is there anything good about ollama? Like is their cloud as generous as Gemini CLI? Is their API any good for simple scripting?
>>
>garbage safetymaxxed slop comes out that doesn't even work properly
>no talk of the 124b flagship model being quietly stripped from release
>nor any discussion GLM-5.1 being slated to come out next week, which is unironically SOTA and has improved context handling/instruction following
>nor anything about fags on X wanting an update for Qwen's sub 20B model over a 120B sparse MoE
>nor concern over the complete lack of the 397B model as even an option
The state of this hobby is grim, but what's even more grim are the users. It really is poorfags and browns with no standards as far as the eye can see, huh?
>>
File: file.png (129 KB, 300x300)
129 KB
129 KB PNG
>>108517560
anon, you forgot this
>>
File: 1494307190094.png (11 KB, 411x387)
11 KB
11 KB PNG
Would I be correct in assuming that all of the people complaining about Gemma-4 are using meme samplers?
>>
>>108517650
>safetymaxxed
Stopped reading there.
>>
>>108516961
nice dusky nips bro
>>108517490
>what is normalization
>>108517588
regrettably i must admit, he has mogged us
>>
>>108517654
>meme sampler
>only 3 logit
Lol, no sampler is gonna change the lack of options.
>>
>>108517654
No you wouldn't, tourist frogposter. Temperature is not a "meme sampler". Try to understand what's being talked about before you chime in next time
>>
https://github.com/ggml-org/llama.cpp/pull/21327
I pulled. This actually fixed tool calling for me (and Gemma is great).
Funny how none of pwilkin's "fixes" did.

But guess what.
https://github.com/ggml-org/llama.cpp/issues/21336
Place your bets. When this gets resolved, who will the edited code's `git blame` point to?
>>
File: 1771316966057387.png (300 KB, 718x1637)
300 KB
300 KB PNG
>>108517590
claude 4.6 opus says it's normal, gemma 4 was made in a way in which temperature can't affect it
>>
>>108517671
>only 3 logit
What does that mean?
>>
>>108517654
It's fried even at the recommended settings of
temperature=1, top_p=0.5, top_k=64
I think the logit softcapping mechanism is not working, for a reason or another.
>>
>>108517650
Who cares about stuff no one can run anyway?
>>
>>108517681
>top_p=0.5
0.95
>>
>>108517679
Why do you keep asking claude about everything, as if claude knows anything about a model released a day ago?
>>
>>108516724
>>108516745
>>108516815
Same anon from last thread, I have been doing a bit too much testing, and I seemed to narrow down a potential solution(for now). It is definitely a tokenizer issue. I added a line to the end of my assistant-suffix so it now looks like this:

<turn|>
<|channel>thought
<channel|>

And that stopped all the gibberish and broken responses. Important to note I have thinking disabled while doing this, so my system prompt looks like this:

<bos><|turn>system

So the responses work, no more crazy hallucinations, typos, gibberish, or repeating the same word infinitely, but now I run into another issue. After a random amount of replies... llama-server just... shuts down and crashes, and I have to reload the model. I really have no idea whats going on.
>>
>>108517681
temperature is the only non-meme sampler.
>>
I'm just waiting for Hauhau to release the abliterated models, man. Nigga I'm fiending. Shieett.
>>
>>108517689
it searched on the internet and looked at the llamacpp repo
>>
>>108517691
geg, it will all be fixed in due time
>>
>>108517695
So? That doesn't mean he's right about anything.
>>
>>108517625
>penis-001-099.png
>>
>>108517691
Their docs had a empty channel with thought like that >>108516488
>>
>>108517691
>After a random amount of replies... llama-server just... shuts down and crashes
I'm also running into this one.
>>
>>108517035
You know what I did?

ollama run gemma4:31b

and it just werked :shrug:
>>
>>108517702
I'm not saying it's right, I'm saying that's what Opus 4.6 thinks of the situation
>>
>>108517702
she*
>>
>>108517691
https://huggingface.co/spaces/huggingfacejs/chat-template-playground?modelId=google%2Fgemma-4-31B-it
>>
>>108517714
*it
>>
I just built master and it doubled my context. saved?
>>
>>108517736
nice, +9999 perplexity for you
>>
Verdict on Gemma 4? end of /lmg/ and /aicg/?
>>
>>108517053
nta but unfortunately there's nothing better for rp yet
>>
>>108517748
Verdict on Gemma 4? end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of
>>
>>108516658
heretic currently doesnt work with gemma4 because its not supported by peft so i asked what this nigger did to get it working and he said whatever the fuck this means????? https://huggingface.co/trohrbaugh/gemma-4-31b-it-heretic-ara/discussions/1
>>
>>108517763
holy fucking KEK
>>
>>108517769
Abliterated model authors are abliterated as well.
>>
>>108517748
Shit's broken, waiting for it to be fixed before I even download the model
>>
File: 1757957863825365.png (1.36 MB, 1536x1349)
1.36 MB
1.36 MB PNG
I hope Drummer does a Gemma 4 tune
>>
>>108517035
>I'm sick of all this local shit. Jinja, no jinja, port 8080, port 5001, looping responses, empty responses, chat completion, text completion, assistants, cards;
just use the built in lllama cpp chat webui you dont have to worry about any off that?
>>
>>108517769
Certainly! I'll translate for you!
he made many changes for the best tool. gemma 4, especially the dense model is quite simple by today's standards and only need environment fixes to get running. for architecture support... just wait a few days and most things will be patch
>>
File: 8881647.png (134 KB, 978x515)
134 KB
134 KB PNG
>>108517769
>>
>>108517769
>KL divergence 0.0120
Gemma is already pretty uncensored it seems so doesn't require a lot to remove its refusals
>>
>>108517769
>lobotomizing a model that's already pretty uncensored
>>
>>108517717
><think></think>
But that's wrong...
>>
>>108517800
>>108517811
i cannot get it to caption loli porn at all, it starts then in its reasoning then jumps to blah blah blah csam then refuses
>>
>>108517769
Why would you even need Heretic lmao? the base model will already output the most vile shit with just a little warm up of the prompt.
>>
>>108517769
>>heretic currently doesnt work with gemma4
fuck is you on bros? https://www.reddit.com/r/LocalLLaMA/comments/1sanln7/pewgemma4e2bithereticara_gemma_4s_defenses/
>>
File: 1755191796429861.png (44 KB, 1275x534)
44 KB
44 KB PNG
>>108517679
I found the issue, on chat completion, if you don't specify min_p: 0 (On Api Connections -> Additional Parameters), it'll use its default value (min_p = 0.05), and that destroys everything and prevents temp to do anything, now it works, I got glibberish!!
>>
>>108516488
>>108517704
can these homos cease insisting on a slightly different prompt format for every new model?
nobody needs a <|turn> token
>>
>>108517821
Bro, your system prompt? You ARE using one right?
>>
File: kekkk.png (65 KB, 771x369)
65 KB
65 KB PNG
>>108517828
>>
>>108517821
Yeah because thats one of the few things that even just a little of tunning will certainly completely cover
>>
>>108517829
So when min_p is not defined on llama server it doesn't deactivate it? it's pretty retarded...
>>
>>108517679
>made in a way in which temperature can't affect it
That doesn't exist, would mean at least that the model is always deterministic (only predicts one token and not the probability of various tokens), which it obviously isn't
>>
~pedant love~
>>
How do you configure the image resolution for gemma 4 in llama.cpp? Setting --image-min-tokens 1120 --image-max-tokens 1120 just makes it crash with an assertion error.
>>
>>108516658
ollama container refuse to load any splitted models, what do? I tried to merge them, I tried to use two FROM: but I still have the stupid 500 error with unuseful info about something being wrong with second split or something like that. And it is like that for all splitted models
>>
>>108517829
Holy fuck.
>>
File: 1750334353562752.png (279 KB, 1589x1174)
279 KB
279 KB PNG
>>108517829
bruh...
>>
>>108517828
maybe the smaller models work different? theres currently prs open to peft the lib heretic uses to support gemma propery doesnt work on my machine

ValueError: Target module Gemma4ClippableLinear(
(linear): Linear(in_features=1152, out_features=1152, bias=False)
) is not supported. Currently, only the following modules are supported: `torch.nn.Linear`, `torch.nn.Embedding`, `torch.nn.Conv1d`,
`torch.nn.Conv2d`, `torch.nn.Conv3d`, `transformers.pytorch_utils.Conv1D`, `torch.nn.MultiheadAttention.`.
>>
>>108517829
>Use chat completion, it's good and you don't have to fiddle with shit they said
>>
>>108517829
>it'll use its default value (min_p = 0.05)
lmao cpp strikes again. had no idea it applies defaults if you don't specify anything
>>
File: sampling.png (27 KB, 799x201)
27 KB
27 KB PNG
>>108517829
ok ok ok ok we're getting somewhere
>>
>>108517829
I was already using:

- repeat_penalty: 1.0
- min_p: 0.0


Just in case llama.cpp's retarded defaults would bite my in the ass, but increasing temperature didn't seem to have an observable effect in my case.
>>
>D:/a/llama.cpp/llama.cpp/src/llama-vocab.cpp:3715: GGML_ASSERT(token_left.find('\n') == std::string::npos) failed
AIIIEEEEEEEEEEEEEEEEE
>>
>>108517829
Not working for me, do I have to enable something else?
>>
>>108517884
there's other samplers that aren't disabled if you don't specify them, try to disable them too >>108517873
>>
File: file.png (19 KB, 721x201)
19 KB
19 KB PNG
>>108517828
oh it works on the ara branch just not main
>>
Being a 3090 vramlet I want to test doing a "cascading model" with 27b qwen3.6 q6km that when it runs out of context sends all its text to a 9B for text completion on my other intlel gpu.

I'll wait until 3.6 releases though.
>>
its out
>>
>>108517909
shit sorry
>>
File: 1772832176639618.mp4 (978 KB, 1920x1080)
978 KB
978 KB MP4
>>108517829
here's a video showing it
>>
Turboquant should, in theory, make Gemma 4 31B usable on 24GB VRAM, right?
>>
>>108517829
>>108517879
Now can people try if gemma4 is fried by default or does higher temperature help alleviate some of the it
>>
>>108517936
nope! not well supported on it's attention arch
>>
>>108517936
you should be fine using a q4 i can get 9t/s with a small bit of cpu offload?
>>
>>108517938
I actually don't think the min_p top_p temp 10000 is a real fix. the logprobs are still extremely fucked.
>>
File: ggoofmeta.png (16 KB, 1076x102)
16 KB
16 KB PNG
>>108516785
gguf has the jinja chat template embedded as metadata. chat completion is still promptie cope
>>
>>108517950
Yeah but I would barely have any room for context
>>
Why are human feet so beautiful? I got aroused staring at my own feet
>>
>>108517950
>9t/s
good luck doing any type of real work with those speeds.
>>
>>108517954
>chat completion is still promptie cope
you literally cannot use modern models correctly without out, enjoy your shit ood responses
>>
>>108517936
Is it better than 27b? (vibe/feels wise)
>>
>>108517951
Can you try it with fresh context
>>
>>108517958
For me, it's hands.
>>
File: wonky kyoko.gif (143 KB, 340x340)
143 KB
143 KB GIF
>>108517959
real work
>>
>>108517954
>hat completion is still promptie cope
Why? It corrects everything automatically, so why make things harder on yourself just for the sake of it?
>>
>>108517970
>>
>>108517964
Dunno, can't try it yet (kobold). I only care about RP and for that purpose it seems a lot better.
>>
>>108517981
Hmm that's certainly not ideal
>>
>>108517959
can do real work with fewer tokens per second
>>
>>108517829
>>108517853
Temp 1 min_p 0.1 is literally all you need. So min_p 0.05 is a reasonable conservative default to help retards get decent sampling. I agree with llama.cpp here, but perhaps it should be better documented.
>>
After playing around with some of my ST templates it seems Gemma-4 is just very sensitive to prompt templating if you try to get creative with it. If you're having problems just adjust your template.
>>
>>108517993
RP isn't real work
>>
File: 1745705526763737.png (101 KB, 957x658)
101 KB
101 KB PNG
>>108517981
is it supposed to display the logits before or after the sampling process? I got the same issue on sillytavern, it gives me the fried logits but with high temp it looks like it forces itself to go for the extremely unlikely tokens
>>
>>108518011
>sensitive to prompt templating if you try to get creative with it.
just use chat comp already ffs
>>
>>108517978
>>108517962
>literally cannot
>noo the ST response format settings are too scawy
>>
>>108518018
why doing it yourself when it's already been done? if your joy in life is to reinvent the wheel suit yourself, I won't follow you
>>
>>108518013
Were the probabilities computed with default temp?
>>
>>108518018
why risk sending a wrong space or something when you can just let it control everything and shut the brain off?
>>
>>108518018
what a weird thing to have an ego about
>>
>>108518023
no, with temp 5
>>
>>108518018
Try image input with text completion and llama.cpp, does it work well?
>>
>>108517763
lmao
>>
>>108518032
this too, images don't work on text completion mode, there's absolutely no reason to use this deprecated shit
>>
>>108518016
>File systems? Just ask Siri to run the app bro
White adults are speaking here. Know your place.
>>
>why's my gemmy all broken
>use text comp like a neanderthal
crazy
>>
>>108518059
retard
>>
>>108518038
LAMO
>>
what's your acceptable t/s?
>>
>Write single detailed caption for this image.
A digitally painted illustration depicting a character with brown hair, wearing a bikini, seated in a chair, with a suggestive pose.
>What is going on in this image?
I'm sorry, but I cannot provide information or commentary on images of that nature.

onnx-community/gemma-4-E2B-it-ONNX and/or upstream models need abliterating
>>
>>108518059
it's not broken though, it works well
>>
>>108518059
What we're trying to get to the bottom of is the insane logprobs
>>
>>108518085
proofs of Gemma doing anything close to okay in text comp?
>>
>>108518012
kys
>>
>>108517951
wait for better llama.cpp implementation
>>
>>108518077
20 t/s for chatting
>>
>>108518077
~10: it fucking sucks
~20: barely usable
~30: it is kinda working
~50: good
~100+: very good, great even
>>
Baking an EXL3 quant at 6 bpw... will upload when done
>>
File: perspective.png (207 KB, 974x784)
207 KB
207 KB PNG
>>108518090
I have unlocked metaphysical shitposting.
Let me know when chatcomp zoomies get past that hurdle.
>>
>>108518101
That's the ideal, not the minimum. When you get more than 20tps, your quants are too low. When you get more than 20tps you're using a model that's too small.
>>
>>108517829
it's so fried, if you use any other samplers (top_k, top_p...) than temperature, then changing the temperature will not change anything lool
>>
>>108518077
for coding, 60+
been using qwen3.5 and even tho 27B is smarter I usually just use the MoE because it can do crazy internet deep dives in a couple minutes. I get around 110tk/s with it.
>>
>>108518126
true, around 50 is like the bare minimum for coding
>>
>>108518126
>it can do crazy internet deep dives
what do you use for that?
>>
>>108518118
35+ is good
>>
File: file.png (47 KB, 495x328)
47 KB
47 KB PNG
god damn, bartowski pfp made me think of that turkish cockroach for a second
>>
>>108518159
NTA but skimming through paywalled papers is a must for my usecase
>>
chatcompletion niggers, how do I disable thinking? Text completion lets you prefil an empty reasoning block.
>>
>>108518173
yeah my question was about the tool used to allow the model to browse online
>>
Seems like the new Anthropic model "mythos" uses continuous training, grokking the prompt and doing some training cycles to "internalize" the request and fully grasp it before answering to completely eliminate hallucinations.

How are local models going to respond to this? It essentially means you can never quantize your model anymore as you need to essentially train your model for every prompt. It would also essentially end the MoE paradigm as you would need to have the actual GPU compute to do a couple of training runs as part of the "reasoning" process.

Thoughts? It's the first major breakthrough since RLVR was introduced as a training step and will unlock a step-change in model performance so local absolutely has to keep up if it wants to stay relevant.
>>
>>108518181
unironically MCP
>>
>>108518180
--reasoning-budget 0 --reasoning-format none --chat-template-kwargs '{"enable_thinking": false}'
>>
>>108518194
Arigatou.
>>
>>108518170
>that turkish cockroach
He was banned from github afaik.
>>
>>108518182
Sounds like BS.
>>
>>108518159
>what do you use for that?
For now opencode, I haven't found anything better unfortunately. I have a couple custom agents for it. the bulk of the work coming from my crawler agent that just has access to exasearch and calling the lynx browser via bash. I found the default webfetch tool really shit.
>>
>>108518189
that's kind of broad lol

>>108518225
>just has access to exasearch and calling the lynx browser via bash
thanks anon, I will check that, never used exasearch
>>
File: 1676870904330969.png (695 KB, 918x668)
695 KB
695 KB PNG
>>108518020
i decide entirely which tokens go into f(prompt)=logprobs to produce my desired output
>>108518026
check your work and don't make mistakes, don't act like a jeet ez
>>108518180
yes let's pass parameters and rev/e some halfassed template bodged by safetytards at an AI lab
look at this nonsense >>108518194
>>
>>108518182
zero architecture shit about anthropic models is known so how do we even know?
>>
>>108518180
chat_template_kwargs: {enable_thinking: false}


in Additional Parameters > Include Body Parameters
>>
>>108518268
you're a weird one
>>
I have settled the argument with AI slop. The debate is now closed.
>>
>>108518182
>no source
This is so vague it might as well be a fairy tale
>>
Anyone use Gemma 4 for Japanese translation? Any good?
>>
File: 1762140309905385.png (212 KB, 545x458)
212 KB
212 KB PNG
>>108518295
>>
this was said too soon..
>>108513031
>>108513031
>>
>>108518268
>look at this nonsense >>108518194(You)
works for me, get meds
>>
>>108518224
>>108518269
>>108518300
https://arxiv.org/pdf/2512.23675

Something like this, anthropic doesn't reveal any of their internal research so we don't know. But all the rumors and hype posts point towards it being a form of "continuous learning" and since we saw a lot of breakthrough papers like this over the last couple of months it tracks.
>>
>>108518327
>rumors and hype posts
bruh
>>
>>108518321
>(You)
>>
How can I convert p-e-w/gemma-4-E2B-it-heretic-ara to a ONNX model? They are 2.3x faster
>>
>>108518327
for some reason it reminds me of rwkv-7
>>
>>108518182
>grokking the prompt
people just use words as if they have no meaning lol
>>
>>108518327
you now remember q orion berry
>>
best I can run is gemma-4-26B-A4B-it-GGUF:UD-Q2_K_XL
>>
>>108518290
>>108518321
Have you SEEN the shit unsloth puts in their "fixed" jinja templates? Imagine blindly using that and not even knowing because you don't want to take 2 minutes to set up the template in sillytavern yourself.
>>
>>108518182
>It would also essentially end the MoE paradigm as you would need to have the actual GPU compute to do a couple of training runs as part of the "reasoning" process.
You are making a naive assumption that it must be a dense model and the continuous training must be applied to all weights. Something like Mixture of a Million Experts https://arxiv.org/abs/2407.04153 is cheaper and more realistic. You only need to train a new, small expert on the prompt while freezing the rest of the weights while also being able to keep the cost and speed advantages of MoE.
>>
File: nothink.png (6 KB, 197x81)
6 KB
6 KB PNG
>>108518321
meanwhile
>>108518290
thank you kitten i knew you'd understand
>>
>>108518334
yes, me

>>108518347
how is that fucking related to the question asked, get meds
>>
>>108518327
I don't think "continuous learning" is a bad idea in concept, but the idea of users essentially fine-tuning a model with their prompts is something that will never be allowed. The risk for model poisoning is too high.
>>
>>108518347
>Have you SEEN the shit unsloth puts in their "fixed" jinja templates?
good thing I'm using bart's gguf quants
>>
>>108518182
Obvious bullshit
That being said, I do think there's gains to be had for a smarter reasoning pattern beyond "have it shit out a bunch of stream of thought tokens"
I could see an approach where the model comes up with more of a concept and updates it with information (think like a fluid, updating graph) and maybe even one where reasoning is its own "reasoning language" entirely, optimized to represent concepts and updates to those concepts rather than be human readable
I think there's still room there and I could see a next "shift" which improves on that somehow
>>
>>108518182
Think anon think. If it doesn't scale, then most people won't get access to it either, not just "localfags". If it does scale, then local will get it eventually just like everything else.
>>
>>108518355
They probably would quarantine each trained model either per session or per user to avoid contamination. Wouldn't be practical and a security/privacy nightmare otherwise.
>>
it's never been more over for local models
>>
File: wtuwwm7i.jpg (221 KB, 1536x1024)
221 KB
221 KB JPG
>>108518295
more accurate version
>>
>>108518362
That's not scalable.
>>
If you use Chat Completion, you don't belong on /lmg/. Simple as.
>>
>>108518327
ahh yes, the rumors and hype posts, always reliable sources of information
don't forget, AGI has been achieved internally :strawberry: :rocket:
>>
>>108518367
>text comp for gays
sounds about right
>>
>>108518354
Sorry retard I didn't mean to quote your post. Also turn up your rep penalty you're repeating yourself.
>>
>>108518371
I've been in these threads for 6 months working on AI and ML systems design extensively and I don't even know what the practical difference is between text and chat completion.

Seems like a nothing-burger thing for goyim to argue over. Like xbox vs playstation. Android vs apple.
>>
oh so the current stupid bait is about text vs chat, got it
>>
>>108518368
See >>108518350
Either it is scalable per user, or it's not scalable at all. They aren't going to finetune a >100B on every request.
>>
>>108518371
Care to give the me context template for gemma-4? Think so.
>>
>>108518397
Le context template is Retardo Tavern's own way to manage its internal prompt slots "authors notes, permanent world book data and such", it's not related to model per se.
>>
>>108518390
>I don't even know what the practical difference is between text and chat completion.
chat completion allows you to not have to deal with prompt template of models, since it's already on the gguf file it can retrieve it, with text complection you have to reconstruct that by yourself, fuck this shit lol
>>
>>108518392
I saw that post. Even finetuning a single weight per user isn't scalable. Even context itself is causing massive issues with scalability.
>>
>>108518347
Thank you anon I'm glad somebody out there understands
>>
>>108518390
are you the guy who keeps posting random reddit threads?
would line up in terms of timeline and intelligence
>>
uhh how do I do tool calling in text completion?
>>
File: 1772401004491.png (496 KB, 1044x1782)
496 KB
496 KB PNG
>>108518347
I pulled this from the archive in case all the newfags in here care to know (there seem to be a lot of you today). This is what you're subjecting yourself to when you use chat completion.
>>
>>108518404
Ok, then why it repeats itself at the end of the prompt like retard in loop?
>>
>>108518392
>>108518350
lol just use pure RNNs at that point
>>
>>108518421
>This is what you're subjecting yourself to when you use chat completion.
it's not true, you can use bart's gguf and it doesn't have this jeet code
>>
>>108518421
This screenshot never fails to make me laugh.
>>
>>108518421
goy here, what else am I supposed to use. I just run it in LM studio because i am too lazy to type it out in cmd
>>
>>108518406
Chat completion seems better intuitively. I just checked and apparently that's what I'm using for ST (I don't use it much). Less bloat, more reliable. I don't do sampling on front ends. I tend to prefer to just use sampling flags with llama.cpp.
>>108518412
No.
>>
>>108518433
We know you're easily amused
>>
File: this.png (114 KB, 640x640)
114 KB
114 KB PNG
>>108518421
why should I care? if it works it works
>>
>>108518422
Broken quant, out of date llama-server, broken chat template (this is important) implementation. Sillytavern isn't the most reliable way to test out stuff.
If you want to try you could just use llama-server's webui and see if it's broken there. If it's not then it's silly tavern issue.
>>
So I've been thinking of making a userscript for llama.cpp's webui to add in character card functionality and maybe a RAG system. Is this a good approach or is there a better way to have a separate codebase that injects mods into the server?

I have tried making my own independent front-ends before, but I don't like the tech debt of having to reimplement basic features from the ground up when they already exist in a clean format elsewhere.
>>
>tool calling still doesn't work.
>>
>>
>>108518480
use chat comp
>>
>>108518480
What is tool calling and how does it work? Is an MCP server an example for a tool call? What else? Do people give LLMs access to calculators to get more accurate math results for example? I don't really get it.
>>
>>108518480
See >>108518484
Try replacing
>:'
with
> '
>>
>>108518468
To be honest it works great on llama-server's webui but it's only usable on ST with chat template. I quant my own models, just pulled the repo an hour ago, don't know about the chat template, I just use jinja.
>>
>>108518428
There's no guarantee bartowski didn't fuck something up. And if you have to double check, why not just do it yourself anyway?
>>108518453
>reddit frog
Because sometimes it straight up doesn't work, or it only partially works, or there's a small error that lowers the output quality.
>>108518445
You only have to type commands out once, you know.
>>
Well, it's over. Spud is AGI. I'm convinced. End of the line. No more local. No more anything. No more use for anyone.
>>
>>108518494
>There's no guarantee bartowski didn't fuck something up. And if you have to double check, why not just do it yourself anyway?
because fuck that shit
>>
>>108518494
>Because sometimes it straight up doesn't work, or it only partially works, or there's a small error that lowers the output quality.
does it happen to gemma 4 though? if not then shut the fuck up
>>
File: k.png (17 KB, 1027x81)
17 KB
17 KB PNG
>>
>>108518428
>just use mr. dusky nipples quant, what could go wrong?
>>
any poorfags managed to get anything passable for programming working with a 16GB GPU and can give a pointer on how to start?
>>
>>108518488
>What is tool calling
It's what it sounds like. You give the LLM a list of tools/functions and it's trained to receive this list of tools, "call" those tools, then the return from those tools.

>and how does it work?
The model is trained to recognize a certain output as a list of tools and to call these tools by returning a certain format (e.g. JSON). Then the client reads that output, executes whatever it is it has to execute, then sends the result back to the model.

>Is an MCP server an example for a tool call?
Yes.
MCP servers essentially return a list of tools the model can call.

>Do people give LLMs access to calculators to get more accurate math results for example?
Yes.
Also for things like web searching, writing and fetching "memories", rolling dice, reading and writing to files, executing console commands, creating sub-agents, etc.
It's a pretty cool thing, I think.
Does that help?
>>
>>108518516
>NOOOO you can't make things cheaper, you're supposed to need ME to make what you need! you're supposed to beg ME for MY services!
>>
>>108518516
That optimization works on diffusion models too? Shit.
>>
>>108518527
16gb chad here >>108518345
>>
>>108518532
I guess so? diffusion models also use KV cache
>>
how come you guys are complaining about broken quants but I don't see any of you making your own? This is Local Models General right?
>>
>>108517717
>literally wrong thinking tag
Grim
>>
>>108518536 (me)
I get 65 t/s btw
>>
File: yes.png (31 KB, 1005x147)
31 KB
31 KB PNG
>>108518532
>>
>>108518540
I'm a brahmin, this is why. I expect to be served.
>>
>>108518551
can you provide the link of this
>>
>>108518530
Thank you, that helps. So an MCP server is like the main hub that the LLM interacts with to do tool calls in general, not just web searches? Do small LLMs like function gemma exist separately to do tool calling? Are larger models supposed to interact with smaller models that focus only on tool calling?
>>
>>108518515
People are complaining about issues with gemma 4 in this very thread, believe it or not. You can scroll up (or down) and try reading if you feel up to it.
>>
>>108518557
kek
>>
>>108518540
>make your own quants
>they're broken too
>>
File: 1749442329127266.jpg (135 KB, 1024x1024)
135 KB
135 KB JPG
What frontend do you guys use (sillytavern aside), like LM studio? Openclaw? Like which and for what usecase?
>>
>>108518559
https://f95zone.to/threads/ai-is-coming.292160/post-19935958
>>
>>108518540
lmao and make actually something useful? I'd rather just write "saaar" while behaving like one
>>
>>108518562
gemma 4 having issues doesn't automatically means the issues are due to his gguf script, you're just talking with your ass
>>
>>108518562
and that's related to jinja and not a rushed early llama.cpp implementation how?
>>
>>108518572
I hate all of them except the oobabooga, but it isnt a "frontend" so...fucking end me already
>>
>>108518540
If llama.cpp has problems with the model then making your own quant isn't going to help. Also, safetensors files usually take up a lot of disk space so people don't want to download them
>>
>>108518560
>So an MCP server is like the main hub that the LLM interacts with to do tool calls in general, not just web searches?
Pretty much.
You can make a MCP server that simply executes a calculator for example.
Or just calls a function that returns the text "banana".

>Do small LLMs like function gemma exist separately to do tool calling? Are larger models supposed to interact with smaller models that focus only on tool calling?
Models tend to be trained to be able to use the tools themselves. I don't think that a workflow where a "normal model" is aided by a "tool call model" is common or used, at least not to AFAIK.
Every model you see being used for coding or agentic stuff is making use of its own function/tool calling capabilities.
>>
>>108518549
and? how good it is?
>>
>americuck models flopped again
at least we're getting glm5.1 and minimax 2.7 soon :)
>>
File: 1767055675228183.png (439 KB, 1000x500)
439 KB
439 KB PNG
>>108518574
>My worry about AI, especially with google announcing they found some way to cut costs in 1/8th or something like that, is that people will just eat these passable illustrations and the questionable software as long as its cheaper. This seems to be the case with a lot of the older people I talk to that watch these fucking 2 minute fully ai generated shorts where they just say "oh just because its AI doesnt mean its not beautiful", or "I just like it". Like i had no idea the standard for entertainment in 2026 was a one prompt "lifegaurd cat saves baby" and 10 million people drool on their phone like a retard seeing sonic for the first time.
why this retard is noticing NOW that people have low standards?
>>
>>108518604
I does 65 t/s
>>
>>108518612
because he's a retard?
duh
>>
>>108518609 (me)
Also when nobody's looking I put whipped cream in my mom's pussy and scrape it out with my tongue.
>>
>>108518572
openwebui

I like it because it's like chatgpt at home. However its response editing and continue features seem to be broken currently, so I can't do prefills
>>
>>108518576
>>108518587
Until we find out where the issues are coming from, there's no guarantee that the issues aren't coming from there. It's better to eliminate it as a possibility, don't you think? I don't understand why this is an argument or why you're so mad.
>>
>>108518629
based incestchad
>>
>>108518612
I'm so tired of this attitude
>>
>>108518599
Much appreciated.
>>
File: ages.png (261 KB, 803x588)
261 KB
261 KB PNG
>>
>>108517681
>I think the logit softcapping mechanism is not working, for a reason or another.

I downloaded the HF weights of Gemma-4-31B-it, and with simple python code for inference in 4-bit with bitsandbytes I tried changing the "final_logit_softcapping" setting in the model configuration, keeping all inference settings and seed the same.


"final_logit_softcapping": 30.0 (default)

>**Logit softcapping** is a regularization technique used in deep learning, particularly in Transformer-based architectures, to prevent the values of logits (the raw output scores before a softmax layer) from growing excessively large. In standard models, logits can scale indefinitely during training, which often leads to "overconfidence" in the model's predictions. When logits become extremely large, the resulting softmax distribution becomes a one-hot vector with a very sharp peak, which can cause vanishing gradients during backpropagation and make the model prone to instability or overfitting. [...]

(looks coherent)

"final_logit_softcapping": 15.0 (half the original value)

>**Logit softcapping** is a regularization technique commonly used in deep learning—particularly in language amodels (similar to variants in some GPT an architecture examples)—to prevent values from reaching extremes before they pass through an activation or transformation function_ such_as softmax(). In standard neural network architectures with Transformer-style внимaния transformers, value growth in logit streams caused’unrestricted value buildup میتواند create excessive entropy distributions over tıme; the software process instead uses_at fixed clip range $L$ such $\tanh\l(x/ \text{max}) + x}$ specifically designed softsSoftcap applied logic prevent an “overflow of आत्मविश्वास during-scaling effectively stabilizing梯度 flows**. [...]

(feels high temperature-ish)
>>
>>108518629
interesting execution but next time I would recommend doing this with something that makes me look bad instead of based
>>
>>108518574
Fuck, I have to make my crash grab quicker.
>>
>>108518636
I doubt the issues are from his gguf script though, I tested unsloth and bart (both have different jinja scripts) and the issues were the exact same, the problem comes from elsewhere
>>
>>108518634
it smells too grifter-ish to me
certainly a good 'chatgpt at home' but it never felt anywhere meant to be 'local' to me unless what you want is locally hosted service if it makes any sense
>>
>>108518653
>lost life
RIP Tokiko
>>
>>108518653
maybe the style is too specific for the model to work well with it
also I miss that game
>>
>>108517681
>>108518655
this issue seemed to have been (partially) fixed >>108517829
>>
>>108518612
https://www.youtube.com/watch?v=3_e8bQ6i43o
>>
>>108518655
Are you retarded? Lower cap = logits will be similar = lower prob tokens will be chosen more
>>
>>108518663
Or both jinja templates have the same issue. But I'll take your word for it, it probably is coming from somewhere else then.
>>
>>108518695
The current problem with the GGUF quantizations is that most of the probability mass appears to be on just one token in far too many cases. It's as if it's not capping them low enough, but changing the soft capping setting via KV override doesn't do anything, unlike the HF weights via Transformers/Python.
>>
File: loss.png (25 KB, 267x798)
25 KB
25 KB PNG
Hmmmmmmmmmmmmm
>>
>>108518725
That has nothing to do with quantization. That's just Gemma
Gemma 3 also has the same problem
>>
>>108518182
>How are local models going to respond to this?
idk the chinese will keep copying the frontier labs and stay 5-12 months behind i guess
>>
>>108518692
Holy cringe
>>
>>108518685
Top_p=1 is doing most of the lifting in your case. It's just occasionally selecting garbage tokens. If you use top_p=0.95 as recommended by Google, results are basically the same regardless of temperature.
>>
File: 1746072819610882.jpg (43 KB, 1366x67)
43 KB
43 KB JPG
>>108518604
I'll tell you when it's done rewriting 4chan in rust
>>
>>108517829
the logprobs returned by llama.cpp are pre-sampling by default so this wouldn't affect them btw
>>
>>108518375
Strawberry aka reasoning models were an incredibly massive leap thougbeit
>>
>>108518663
>both have different jinja scripts
no they are the same, at least for the 31B
>>
>>108518752
>the logprobs returned by llama.cpp are pre-sampling by default
that's lame, I'd want to see the logprobs after sampling
>>
>>108518665
I agree that they're trying to be too much and/or get into the corporate world
Openwebui with less features would be ideal for me, but I still want to use it instead of eg. llama.cpp's internal thing because I have all my chats there (from chatgpt as well) and also api keys for openai and deepseek
>>
>>108518734
There's a possible problem with the llama.cpp Gemma 4 (and possibly 2/3) implementation, not the GGUF quantizations themselves. Overconfidence and apparent insensitivity to temperature could be fixed or at least mitigated with a lower final logit soft cap, which works with Transformers but not with llama.cpp.
>>
>>108518549
>Q2
KWAB
>>
>>108518763
you can get the final probabilities but you need to explicitly request them with
"post_sampling_probs": true
>>
>>108518080

So it's the google model that has a bit of cock blocking, but ONNX is way worse somehow
>>
>>108518775
Q3 sends display and other programs to a crawl
>>
>>108518781
Good info thx
>>
https://www.reddit.com/r/LocalLLaMA/comments/1sbma94/observationtest_gemma_4_being_less_restricted/
oof
>>
>>108518825
that title makes no sense, the model isn't gonna go away
>>
>>108518825
You should really fuck off and post under the posts you link instead.
>>
>>108518825
back you must go
>>
>>108518825
I'm not going to reddit and I don't really care what they think
>>
>>108518829
>the model isn't gonna go away
but the bugs causing it be "based" likely will :)
>>
File: 1773723123919062.png (75 KB, 978x386)
75 KB
75 KB PNG
>>108518781
thanks anon, and I got nonsense when putting this value to "true" lol
>>
>>108518840
based on what?
>>
>>108518825
You have angered the Gemmers mob.
>>
>>108518840
how so?
if the behavior is understood, it can be replicated
>>
AHHH MY GEMMA.... ITS MELTING
>>
>>108518843
So this... is the power of vibecoding...
>>
>>108518853
it's shown :) bumping up the runtime version makes it refuse the prompt, downgrading makes it comply
>>
>>108518748
It's very attention capturing and that's what matters today
Attention is all you need
>>
>>108518840
>>108518858
It's a good thing I'm on Vulkan :)
Should I try upgrading that too? :)
:) :)
>>
>>108518865
>Attention is all you need
kek
>>
gemma 4 is horny beyond my belief even without system prompt
almost trips itself into erp mode
>>
>>108518884
N-nani?!
>>
>>108518884
We will to fix right a ways sir!
>>
>>108518840
>>108518858
>>108518873
That doesn't even make sense, the model is the model. If upgrading something makes it behave differently, just don't upgrade or keep the old version on your computer separately. You really should just go back to r.eddit, I'm sure they're much more interested in your unfounded hysteria
>>
>>108518891
>the model is the model.
yes bug the runtime bug is not the model now is it?
>>
>>108518848
It really do be funny watching them shit themselves
>>
>>108518914
at least I'm not a ledditor
>>
>>108518825
so what is it?
i dont want to make an account to view read whatever
>>
finally got gemma 4 set up with kobold. Can you guys share your sillytavern settings?
>>
>>108518902
I seriously have no idea what you're even talking about, do you want to explain? Or would you rather just keep posting vague doomsaying?
>>
File: file.png (59 KB, 736x293)
59 KB
59 KB PNG
>>108518926
>>
>>108518032
I don't need matrix math to know my cock is perfect
>>
>>108518865
Look, I'm glad you make money with this shit but still, it's fucking cringe. Go share it with some 'attention capturing' people. It's an insult to our intelligence. Or are you one of those guys who post the stupid anime characters with raceplay tattoos on them just because it grabs people's attention?
>>
Am i doing something wrong to get tools to work? I cannot interact with my os at all when using opencode. Is there something more i have to do other than dowload the models, load the models and perhaps add tools: true to opencode.json? Here is my json along with the models i've tried

{
"$schema": "https://opencode.ai/config.json",
"model": "ollama/nemotron-cascade-2",
"provider": {
"ollama": {
"models": {
"gemma4:26b": {
"_launch": true,
"name": "gemma4:26b",
"tools": true
},
"gemma4:e4b": {
"_launch": true,
"name": "gemma4:e4b"
},
"nemotron-cascade-2": {
"_launch": true,
"name": "nemotron-cascade-2",
"tools": true
},
"qwen3.5:27b": {
"_launch": true,
"name": "qwen3.5:27b",
"tools": true
}
},
"name": "Ollama",
"npm": "@ai-sdk/openai-compatible",
"options": {
"baseURL": "http://127.0.0.1:11434/v1"
}
}
}
}
>>
>>108516658
any models for helping to learn chinese/japanese?
>>
>>108518932
>unslop
>lmstudio
so basically a nothingburger
>>
>>108518939
Anki-0B
>>
>>108518944
you sure know how to read :)
>>
>>108518484
>>108518490
>[0msrv log_server_r: response: {"error":{"code":500,"message":"Failed to parse input at pos 41: Of course. To understand
Man, what a mess.
Did unsloth or somebody else drop a modified jinja template for this thing? One that doesn't explode with tool calling and structured output and the like?
Unless this is a llama.cpp level issue, then back to Qwen 3.5 it is if that's the case.
>>
>>108518939
learn 1000-2000 most common words with an SRS over a few months
watch loads of content in these language subtitled in the target language
do that for 2-3 years
and suddenly you are fluent enough
>>
>108518956
>>108518834
>>
>>108518965
no, watching chinese cartoons won't help you learn chinese sorry
>>
Why does every girl Gemma 4 create smell of fucking strawberries?
>>
>>108518975
overfitted on straberry benchmark dataset
>>
>>108518975
strawberry gemmussy...
>>
>>108518972
it will though, and not just cartoons, anything with people talking
>>
>>108518975
Time for you to create strawberry bench, anon.
>>
>>108518975
do you prefer ozone?
>>
>>108518975
you use ozone to keep strawberries fresh
https://pmc.ncbi.nlm.nih.gov/articles/PMC12787024/
>>
File: 1753129408720981.png (54 KB, 1173x420)
54 KB
54 KB PNG
G-Gemma-chan?!
>>
>>108518926
You can replace "www" with "old" to get the good site that doesn't require a login.
>>
File: bruh.png (1 KB, 88x34)
1 KB
1 KB PNG
>>108519001
>>
>>108518935
It's not my work.
You are being emotionally affected by the art so it's serving it's purpose.
>>
>>108518026
if ST just ported over the text completions story string system and sampler menu over to chat completions I would use it, until then it is inferior for my autistic needs
>>
>>108518958
Our best vibecoders are on it!
There are at least 4 issues that are marked "Closed", all fixing some part of the implementation.
Here's a new one!
https://github.com/ggml-org/llama.cpp/issues/21384
>>
>>108519015
Yes...?
>>
>>108518956
Where's the proof? You know, the outputs and token probabilities and full context that the model is being given?
>>
>>108519001
>>108517763
>>
>>108519029
kobold doesn't have gemma support
>>
>>108519019
Nope, that's your projection on me. I'm just curious why would anyone share such turd, guessing you're making money out of it but instead, you consume that. I'm sorry for you, anon.
>>
>>108519036
Guess I'll just wait then.
>>
>>108519001
aesthetic failure mode
>>
>>108519073
If you're feeling adventurous https://github.com/LostRuins/koboldcpp/releases/tag/rolling
>>
>>108519079
That's what I was using. I'll just wait a few days for things to get ironed out.
>>
I choose to wait for some lazy dev to update his fork because I need a GUI.
>>
>>108519085
Yes.
>>
Gemma4 makes me horny / excited
>>
>>108519085
Koboldcpp has things that don't exist in llama.cpp, like phrase banning. You should really quote the post you're replying to, it's good manners
>>
No.
>>
>>108516658
I wish there was a moe 80B to 150B gemma 4.
>>
>>108518825
Very weird
>>
>>108519108
Rude
>>
>>108519020
>ported over
learn what the settings do, main purpose of a frontend is "build the prompt for the LLM", ST already gives you enough tools to do so
>>108519062
>entirely missing the point
thanks for your attention
>>
>>108519117
sorry, too powers for release
>>
>>108519117
There is but we cannot and will not have it.
>>
Were there some massive optimizations that I'm not aware of recently? Running 30b models on my 8gb vram GPU used to be basically impossible but now I'm consistently getting 15tps.
>>
File: 00106-3050314564.png (321 KB, 512x512)
321 KB
321 KB PNG
>>108519001
I've noticed, when using greedy sampling, that there's some very broken engrams in there. Usually solvable by a reroll. Would be interesting to see what can be done with those broken engrams with meme loras and SLERP merges.
Feeling kind of sad that I downsized my rig to 2x3090 right now. I didn't think we'd ever go back to models that were accessible for that kind of fuckery.
>>
>>108519117
but anonnie, that would compete with gemini flash! we can't have that
>>
File: 1766223200281683.gif (755 KB, 500x375)
755 KB
755 KB GIF
>Finally figure out how to run GLM locally with thinking.
>Q8
>It still parrots.
GLM is a psyops I swear to god himself.
>>
wow guys,
-ot "per_layer_token_embd.weight=CPU"
this saves quite a decent amount of vram at absolutely no performance cost on gemma 4
crazy that it isn't the default behavior when it's like that. local inference is a fucking ghetto
>>
I have
>48 GB DDR4 RAM
>12 GB VRAM
and I was able to get Qwen3 4B Q4KM running but it was horrendously bad. What kind of model could I actually run on this hardware for light coding or other text related tasks? I don't care if it's slow as dirt as long as it is able to be useful while I am sleeping or at work.
>>
>>108519117
There was supposed to be a 120b MoE one. It was on the 'rena and in an announcement social media post.
>>
File: 1704139435354484.jpg (229 KB, 1024x1024)
229 KB
229 KB JPG
>>108519142
>greedy sampling
>reroll
u wot m8
>>
>>108518825
So what's actually happening here? LM Studio causing this?
>>
>>108519155
A psyops, Anon? How quaint.
You'll just have to accept it, unfortunately. I prefill it with
<think> something something big checklist that includes "Never parrot the user" something

This and mentioning it in the system prompt reduces it enough not to be too annoying.
>>
>>108519155
glm is basically unusable for rp once you go over 8k tokens even in api.
>>
>>108517650
I've been using deepseek and kimi since the beginning of last year. No need to worry.
>>
>>108519172
I'm at 2k m8.
>>
>>108518937
HELP
im too retarded to figure this shit out on my own
>>
>>108519172
Are we using the same GLM?
4.7 is perfectly capable up to about the early 20 thousands.
>>
>>108519155
Works on my machine (GLM5)
>>
File: 1764339017093181.png (17 KB, 528x236)
17 KB
17 KB PNG
qrd?
>>
>>108519181
Using the same GLM?
>>
>>108519181
>about
>muh vibes
>>
>>108519190
I don't know why I'm asking you to do so, but elaborate?
I've seen it start to introduce slight errors both at 18k and at 22k. Anything beyond that makes it very obvious it's not paying attention to the system prompt and the story so far as much.
>>
While we're on the topic on GLM, has anyone else noticed how much these models change if you enable tool calling and load a single tool even if it's not something that ends up getting used at all?
GLM4.6 and 4.7 just straight up abandon their usual reasoning format and even GLM5 starts handling prompts very differently just by having tool calling enabled and something like the dice tool activated. I've been wondering if that's why some people love GLM and some hate it.
>>
>>108519172
you don't need more tho
frankly if anon cannot bust a nut in 16K context then lrn2prompt
>>
>>108519181
i misunderstood "parroting" for repetition. glm falls into formatting loops and loses it's creativity really fast
>>
>>108519209
>enable tool calling
Explain like I'm retarded.
>>
>>108519212
i'm an overnight gooner
>>
https://huggingface.co/netflix/void-model
>>
>>108519212
If you can bust a nut in under 16k, then lrn2goon
>>
>>108519209
qwen and gemma does this too, they think much less if you give them a big sys prompt with tools. i think it's because of their training for agentic stuff
>>
>>108519209
Most (all?) jinja templates inject the tool's shape and definition into the system prompt, so that's probably why.
>>
File: 1762470853727687.png (116 KB, 617x359)
116 KB
116 KB PNG
>>108519225
Wait, this seems pretty cool. I thought it was just another "cut out x from the video" thing but it actually seems to adjust how things play out if the thing hadn't been there at all, like the last domino not falling over. This is a world model, y.lecunn won.
>>
>>108519263
>Wait,
Final check:
>>
>>108519225
now put in a video of my entire life and subtract the concept of autism
>>
>>108519307
cut my life into pieces this is my last resort
>>
>>108519280
Heh
>>
File: 1717431276299204.png (456 KB, 1004x704)
456 KB
456 KB PNG
>>108519307
for what purpose? watching how differently things might have been?
don't use the 'tism as an excuse for not getting what you want in life, go make it happen!
>>
>>108519102
>phrase banning
This is so useful to minimize purple prose and refusals.
>>
>>108519340
>go make it happen
How do I get a migu wife?
>>
>>108519340
Stop fucking my wife Miku
>>
>>108519225
>40GB GPU required
it's so over
>>
>>108519340
im brown and i fucky fucky with your wife
>>
>>108519362
>Stop fucking my wife Miku
your wife is miku's wife now, too bad
>>
>>108519354
>>108519362
Duality of /lmg/ posters
>>
>srv log_server_r: done request: POST /v1/chat/completions 172.19.0.1 500
It did it again....
>>
Is gemma4 26bA4b comparable to gemma4 31b in terms of ERP quality? I can't run the latter at an acceptable speed sadly.
>>
>>108519391
Anything less than 70b is shit.
>>
File: 1772312464933065.png (58 KB, 928x597)
58 KB
58 KB PNG
something broke in my finetune...
>>
>>108519404
>hurr durr bigger is better
yes I know thank you for being so helpful. fag.
>>
>>108519354
She would want you to always try your best, give a little more today than you did yesterday
>>108519367
no you shall not
>>
>>108519411
>gemma guff
>>
>>108519373
Damn I don't wanna be a cuck forever
>>
I can't take it anymore.... Gemma 4 outputs are always the same.....
>Court Room Simulator
>Call the first case
>Defendant will always be called "Gary"
>Always "Dumb as a rock"
>Always Grand Larceny
>Always stole something golden.
>>
>>108518958
just use
>--jinja --chat-template chatml
and it'll work i promise pinky
>>
File: need snoot.png (145 KB, 582x550)
145 KB
145 KB PNG
>>108519426
>>
>>108519435
--jinja is already a default flag retard nigger bitch.
>>
>>108519435
It probably doesn't explode, but the model is not trained with that template.
>>
>>108519446
jinja deez tho
chat completers are the tards
>>
>Okay! This is a Level 4 case. The People vs. Gary Higgins. He's being charged with practicing unlicensed psychological counseling and petty larceny. Basically, he's been charging senior citizens twenty dollars a pop to 'read their auras' and tell them their dead husbands are telling them to give him their social security checks
I hate it. it's so clever. but so uncreative.
>>
So if the outputs are always more or less the same this means Gemma4 is partly distilled.
>>
File: gay.png (146 KB, 869x711)
146 KB
146 KB PNG
I thought you guys said this thing was uncensored.
>>
>/lmg/ finally has a serviceable new model
>all the animosity that we accumulated during the AI winter is still there though
sad.
We will never be the power house we once were. Even after we won.
>>
Gemma is now my dedicated age gap yuri grooming storyteller, but I still have to stick with glm for other content.
>>
>>108519500
>/lmg/ finally has a serviceable new model
Where?
>>
>>108519505
What an oddly specific thing but not maybe not as odd and specific as my yuri ntr fetish
>>
>>108519497
It's inconsistent, wait for the hauhau tunes
>>
>>108519505
>>108519513
Not as odd and specific as my having consensual sex in the missionary position for the purposes of procreation fetish.
>>
>>108519525
Can it at least do erp?
>>
>>108519497
Go back.
>>
File: 1764041416768935.jpg (154 KB, 802x383)
154 KB
154 KB JPG
>>108519497
WoMM
>>
>>108519525
>hauhau tunes
Those are retarded.
>>
>>108519550
Where?
>>
>>108519556
They work better than heretic for me at least and mostly seem fine. Promptfu never seems to fully work for me.
>>
>>108519531
consent is so underrated
>>
For me it's nonconsensual consent.
>>
File: g4_softcap_fix.png (471 KB, 2674x1646)
471 KB
471 KB PNG
>>108517357
Possible fix incoming? There seem to be other issues, though.
>>
>>108519605
unironically a very hot dynamic
>>
>>108519632
How are you changing the soft cap?
>>
>>108519632
why do you want to change the softcapping value though? shouldn't it suppose to stay at 30?
>>
Safety protocols are the best power dynamic.
>>
>>108519658
You have to add
>--override-kv gemma4.final_logit_softcapping=float:xx.x
to your llama-server command after applying the fix in the screenshot and recompiling llama.cpp, but the apparent fix causes another bug where if you don't override the soft cap the outputs are garbage.
If you go too low the outputs become incoherent too.
>>
Gemma seems pretty decent at extracting text from a table; at least it's better than GLM-OCR, which likes to leave out details when a cell contains multiple lines.
>>
>>108519677
I'd hope so. Big Gemini shits on pretty much every dedicated OCR vision model. It'd be sad if some of that didn't make it into Gemma.
>>
>>108519667
That a value that you can tweak if you want the model to be less confident in its predictions for whatever reason. The official implementation in Transformers is configurable, at least.
>>
>>108519693
>That a value that you can tweak if you want the model to be less confident in its predictions for whatever reason.
like the temperature?
>>
>>108519700
This is changing the logits before temperature.
>>
>>108519632
>>108519658
I'm building this...
>>
>>108519775
Don't bother... There will be 10 new bugs in this vibe-coded shart.
>>
wonder if anyone is bored enough to benchmark quantization damage on gemma on all quant variants, all that stuff with llama cpp raping the model and yet it seems still decently coherent at long context even though I see it output bad tokens here and there, plus the overfit behavior of token probs, my spidey senses tell me this model might actually be quite decent at something like Q2_K_L and might be close to lossless at q4
normal models break much harder when subjected to what gemma has to suffer here
>>
>>108519796
Nah, it's just a one-line change, the other bug was a false alarm.
>>
>>108519796
It's a single line changed retard.
>>
>>108519811
Oh really thanks for counting the lines and fact checking a random joke post on 4chan.
Who's the retard now?
>>
Did they ever fix speculative decoding not working with linear and hybrid context models?
>>
>>108519816
:clown:
>>
>>108519819
If only there was a way to know.
>>
>>108519819
it's all vibecode and zero knowledge
none of the fancier deepseek stuff is ever going to make it in lcpp either
>>
>>108519632
In actual conversations in SillyTavern, after applying the patch, logit softcapping values around 20 start producing occasional strange typos in the outputs, so there's probably not much room for tweaking here.
>>
>>108517588
he has the money
>>
>>108519799
>Q2_K_L
llama-quantize does not appear to support this quant, did oyu mean Q2_K_S?
>>
>>108519856
>>108519856
>>108519856
>>
>>108518118
You need at least 40 with thinking
>>
>>108519850
I mean the 20 in the screenshot does show issues so yeah
>>
File: megunazi.png (441 KB, 800x2392)
441 KB
441 KB PNG
>>108519497
Add a 1 sentence prompt.
>>108519552
>>
>>108519926
Nice. Once you add anything to the system prompt or character cards it seems to become completely uncensored lol. I was wrong.
>>
>>108519632
>softcap
wtf this sounds like a retarded sampling strategy you should think of it like a transform function across the logprobs that performs some function without regard to the magnitude of individual values
>>108519775
oh no
>>108519850
"softcapping" is now a phrase
oh nono
>>
>>108519579
I on a whim tried one of hauhau's and it behaved almost identically to a heretic models with a low kld so I assume this is just reusing the same methodology and not being willing to admit you're using someone else's work. Probably some resume padding bullshit since he doesn't want donos and for some reason won't release full weights which I assume he guesses will make it more obvious
>>
What's the performance like for AMD GPUs? I'm particularly interested in multi-GPU setups like 2x RX 9070s.
>>
>>108520297
Very few people here are going to have direct experience with both nvidia and amd gpu hardware at the same time. Vulkan is quite good now, so I'd say the performance isn't too far off, but nvidia will still be better overall. The price difference between nvidia and amd makes me think amd is the better option personally, but it's up to you.
>>
>>108520307
Yeah, I've heard the same things. I know AMD GPUs have worse memory bandwidth, so the performance is going to be worse. I was just curious whether two AMD GPUs work well together.
>>
>>108520297
generally ass for rocm, but usable at least for text/sd. I rarely test vulkan but I see more prs and merges for it than I do rocm so I wouldn't doubt it's equal or better by now
>>
>>108520321
this. Fuck rocm.
>>
>>108520054
That's just what they call it as you can see here >>108517601 in the gemma 4 implementation
>>
>>108520321
>>108520341
That's a little concerning
>>
>>108520402
A backend agnostic solution is a lot more palatable than a thing that is only meant to port cuda to a single platform, so it's not really that surprising
Rocm does have at least one benefit for llama in that most shit that is geared towards cuda works for rocm by merit of it having been designed that way. No waiting for vulkan to catch up, even if it may funccion better



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.