/lmg/ - a general dedicated to the discussion and development of local language models.Previous threads: >>108513891 & >>108510620►News>(04/02) Gemma 4 released: https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4>(04/01) Trinity-Large-Thinking released: https://hf.co/arcee-ai/Trinity-Large-Thinking>(04/01) Merged llama : rotate activations for better quantization #21038: https://github.com/ggml-org/llama.cpp/pull/21038>(04/01) Holo3 VLMs optimized for GUI Agents released: https://hcompany.ai/holo3>(03/31) 1-bit Bonsai models quantized from Qwen 3: https://prismml.com/news/bonsai-8b►News Archive: https://rentry.org/lmg-news-archive►Glossary: https://rentry.org/lmg-glossary►Links: https://rentry.org/LocalModelsLinks►Official /lmg/ card: https://files.catbox.moe/cbclyf.png►Getting Startedhttps://rentry.org/lmg-lazy-getting-started-guidehttps://rentry.org/lmg-build-guideshttps://rentry.org/IsolatedLinuxWebServicehttps://rentry.org/recommended-modelshttps://rentry.org/samplershttps://rentry.org/MikupadIntroGuide►Further Learninghttps://rentry.org/machine-learning-roadmaphttps://rentry.org/llm-traininghttps://rentry.org/LocalModelsPapers►BenchmarksLiveBench: https://livebench.aiProgramming: https://livecodebench.github.io/gso.htmlContext Length: https://github.com/adobe-research/NoLiMaGPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference►ToolsAlpha Calculator: https://desmos.com/calculator/ffngla98ycGGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-CalculatorSampler Visualizer: https://artefact2.github.io/llm-samplingToken Speed Visualizer: https://shir-man.com/tokens-per-second►Text Gen. UI, Inference Engineshttps://github.com/lmg-anon/mikupadhttps://github.com/oobabooga/text-generation-webuihttps://github.com/LostRuins/koboldcpphttps://github.com/ggerganov/llama.cpphttps://github.com/theroyallab/tabbyAPIhttps://github.com/vllm-project/vllm
►Recent Highlights from the Previous Thread: >>108513891--Discussing KV cache quantization limitations and SWA in Gemma 4:>108514761 >108514772 >108514786 >108514788 >108514830 >108514834 >108514842 >108514848 >108514861--Optimizing Gemma 4 VRAM usage via -np 1 and -kvu flags:>108514718 >108514724 >108514734 >108514759 >108514783 >108514837 >108514897 >108514877 >108514891 >108514910 >108514920 >108514956 >108514976 >108514908 >108514935 >108515127--Discussing Gemma 4 stability and formatting requirements for testing:>108514695 >108514708 >108514707 >108514711 >108514946 >108514999 >108515011 >108515014 >108515032 >108515078 >108515100 >108515108 >108515114 >108515554 >108515081 >108515538 >108515179--Discussing leaked Claude Code and criticizing its guardrails and prompts:>108515483 >108515500 >108515515 >108515699 >108515709 >108515741 >108515717 >108515762--Debating Kobold vs llama.cpp for running Gemma 4:>108515418 >108515421 >108515424 >108515423 >108515428 >108515601 >108515451 >108515457--Comparing base model sanitization and debating the utility of base-model fine-tuning:>108514168 >108514432 >108514450 >108514456 >108514505 >108514487 >108514492 >108514457--Praising Gemma 4 31b's roleplay and prose compared to Qwen:>108514030 >108514065 >108514691 >108514716 >108514668--Using Gemma 4 26B to generate SVG character art:>108515345 >108515490--Anon showcasing Gemma 4's ability to generate animated SVG characters:>108514357 >108514407 >108514415 >108514430--Anon praises a model's image description and manga translation:>108515431 >108515469--Discussing Qwen3.6 open-source plans and preference for smaller models:>108515751 >108515810--Report on Nvidia's falling market share in China:>108514389 >108515039--Miku (free space):>108513933 >108513937 >108513976 >108513987 >108514053 >108515814 >108516302►Recent Highlight Posts from the Previous Thread: >>108513894Why?: >>102478518Enable Links: https://rentry.org/lmg-recap-script
Melon
>>108516669You must leave now
Google might as well have released a lookup table, g4's logprobs are just as fried as 3's, every swipe is the same. Disaster model whose honeymoon period won't even last a week.
BRUH did the finish fixing their shitty quants?
>>108516688distilled beyond belief..
>>108516688I'm going to continue blaming the issues with llama.cpp and/or quants because I can
>>108516688maybe there's other issues with llama.cpp, and did you use the updated gguf quants? maybe that can help, there has to be an issue somewhere, you can't just put "temp = 1 million" and the model is still coherent lolhttps://huggingface.co/bartowski/google_gemma-4-31B-it-GGUF/tree/main
>>108516707for me, it's specifically tardtowski because there's more to comedically riff off.
>>108516703They certainly didn't distill the model just on the top-k 3 log probs from Gemini or whatever, there must be something else going on.
Anyone got a preset of all the settings, system prompts, etc. for Gemma 4?
>>108516718No but here is a recipe for cheesecake here is a recipe for cheesecake
>>108516718Are you using text completion and if yes, why?
>>108516718>>108516725No but here is a recipe for cheesecake here is a recipe for cheesecakeNo but here is a recipe for cheesecake here is a recipe for cheesecake
>>108516718don't use text completion, go for chat completion dude
>>108516737Why?
>>108516746because it's safer for you and the model
>>108516746Text completion is so 2024
>>108516732>>108516737What now?
>>108516746because it handles the prompt preset at your place, you won't have to worry about it anymore
>>108516754I do respect safety protocols, thank you Anon.
Bartowski's quants with the latest llama.cpp compiled from source works fine for me at least with Q4_K_L size
>>108516777Did it pass the animal sex -bench?
>>108516769How does that work if you're using a gguf without a jinja file
>>108516784Nala? Pepper? Francesca?Kitsune-Inu maybe?
>>108516785>a gguf without a jinja fileI think all the gguf of gemma 4 have a jinja file
i'm a local llm newfag, is Gemma 4 just fucking broken currently?i tried it in LM Studio on gentoo and it would just get stuck infinite looping on tool callsside question: I'm only able to load models with ROCm reliably using Lemonade, anyone with experience on AMD systems? LM Studio will only work with the Vulkan backend
>>108516785no
>>108516794It's just built in? I always just throw the ggufs directly into llama.cpp or koboldcpp
>>108516805yeah, it's inside the gguf
>>108516658It would be so hot if that GPU impregnated her with a micro chip.
>>108516795Yes we're waiting for it to be fixed but also you might have settings you need to change
>>108516732>>108516737With chat completion it just spends 13 seconds to give me an empty output.
>>108516820>LM Studio
>>108516817Tfw I won't be impregnated by my gpu
>>108516821fuck off retarded text comp shill, go back to 2023
>>108516821what gguf did you download? and can you provide a screen of this, it should look like this
>>108516821Did you forget --jinja? Unlike in text completion, in chat completion mode llama.cpp needs to use the right template. If you've been using text completion all this time you might have never bothered to start adding that to your launch
>>108516795>>108516820>>108516826Oh yeah LM Studio is also lame
>>108516830Not with that attitude you won't
I benched https://rentry.org/llm-cultural-evalgave up after openrouter providers kept fucking me with quants. This is the result for gemma-4-31b-it. Google stepped the fuck up.
>>108516840Answer to your question included in image.>>108516849It'll try it. How was I supposed to know I needed some random launch argument?
sirs??? how to make gemma hot? temperature dial broken
>>108516859>How was I supposed to know I needed some random launch argument?by not being a luddite stuck in your old disgusting ways
>>108516849>>108516867the fuck you talk about, you don't need --jinja to make chat completion work
>>108516849>>108516867Well I added --jinja to launch arguments and I'm still getting empty results.
is g4 tokenizer bug fixed on the latest prebuilt?
>>108516889give us your cli (command line), and where did you download your gguf?
>>108516867>>108516889>>108516880jinja is always enabled by default these days.
>>108516900That's private information.
>>108516900llama-server.exe -m .\gemma-4-31B-it-Q4_K_M.gguf --jinja --port 8080Unsloth quants, just after they updated.
>>108516840>>108516921you didn't go for port 8080 on sillytavern, you went for port 5001, that was your problem
>>108516941No, it was the other anon using that port. I'm >>108516859
>>108516941you're confusing two anons you clown
>>108516863sir I am of poors with 16gb vram vibeocards
>if you type the wrong port number it just completely stops workingjesus christ, and they say this shit will become ""AGI"" one day?
>>108516784Non-abliterated Gemma-4 31B with a 90-token prompt is willing to discuss. Without one and thinking enabled it will likely refuse due to bestiality.
Just made my own quantz for the first time, where are my compliments?
> I work on on-device AI security, and I am putting together a series of posts on questions like: > On-device AI is clearly growing fast. My view is that its security has not caught up yet. https://www.reddit.com/r/LocalLLaMA/comments/1sbebs5/gemma_4_shows_the_future_of_ondevice_ai_heres_the/
>>108516955You expect it to work when it's not even set up to connect properly?
>>108516961Dusky nipples count?
is it safe to build main branch?
>>108516948do you have some errors messages on your cmd windows?
>>108516970>>108516949both screenshots clearly show successful connections
>>108516965This is bull. Sorry but it really is and I say this as a security expert and pentest specialist.
>>108516718Did you try scrolling down? It says it's going to give you a recipe for a cheesecake.
>>108516965>UsernameOk Virus
>>108516921jinja is already on by default since recently
>>108516990hey >>108516867 who's the one stuck in their old ways NOW??
>>108516965literal drmcuckingi wish it to stay lagged behind like this as long as it can with lesser and lesser funding
Woah so this is the power of qwen...
>>108516976Nothing on llama.cpp side. It looks like a normal generation except nothing gets generated.Sillytavern window givesFile not found: data\default-user\chats\default_Assistant\Assistant - 2026-04-03@17h48m10s457ms.jsonl. The chat does not exist or is empty.Which I suppose is just because I'm using the default assistant for a quick test instead of a character card.
>>108517017yeah try a character card and see if that's the problem
I'm sick of all this local shit. Jinja, no jinja, port 8080, port 5001, looping responses, empty responses, chat completion, text completion, assistants, cards; none of it makes any sense and there's literally zero element of of user friendliness anywhere. This is why everyone just uses Claude and Codex and the open source devs are closing off more and you're getting left with smaller and smaller scraps.Fuck you all. You're getting what you fucking deserve.
>>108517035Works on my machine.
>>108517025Still nothing.
>>108517035I did not ask to get an erection
>>108517035>he said, a mischievous glint in his eye
>>108517035Have you tried using something less stupid than sillytavern
>>108517035>filteredfeel free to fuck off and go install ollama or pay shekels to cloud jews
It seems Gemma is basically okay with loli if you're nice and respectful.
Has anyone of you managed to get video analysis working with gemma 4? i ve vibecoded some shit but it doesnt work on all videos.I have ammassed tons of short vidoes and gifs over the years that need sorting and titles. I was hoping for AI to solve this problem.
>>108517035It's unfortunate that Retardo Tavern is almost the only publicly available option for a client.
>>108517065example?
Growing up is realizing that ServiceTesnor was what we needed all along.
>>108517087ye
Does the uncensor guy never release safetensors? Why not? I only see GGUF
Figured out why I was getting blank results. I'd only given it 300 tokens for output and it spent it all on thinking before ever getting to actual output.That said, the attitude was somewhat unexpected.
>>108516658
>>108517035just ask claude how to install and understand all this up, that's how I did
>>108517112keeps his secrets safe, and
>>108517115
Mistral brothers, are we really letting google win?
>>108516712>https://huggingface.co/bartowski/google_gemma-4-31B-it-GGUF/tree/mainI got 12t/s with bart quants, and 16t/s with unsloth quants, is this normal?
>>108517115Miku would never smoke!!
>>108517125Get EU'd :)
>>108517113
>>108517117>https://rentry.org/lmg-recap-scriptpointless, i know these sort of people. they are totally helpless they aren't spoonfed every step - just like toddlers. many such cases, sadly. that's why the cloud jew will always win.
>>108517117will openclaw work?
>>108517137there's only one way to found out
>>108517130Some life situations make us do things we usually don’t. Feeling regret after having sex with your GPU is one of those situations
>>108517035
>>108517131We had to use synth data, because the EU will steal our baguettes if we didn't.
>>108517128why does it even matter. are you that much in a hurry? both generate faster than you can read
>>108517119Sucks, could use as a base instead of the default
>>108517035Uh oh melty...
To those with 8 gb cards, how many t/s prompt processing are you getting with Gemma 4 26B?
>>108517151Not all of us are just chatting back and forth with a model and reading every response. Some of us are doing actual work where speed matters.
>>108517151>why does it even matter.retard, I want speed for the thinking process
>>108516658I love AI so much.
>>108517146Speak for yourself. I never feel regret. I feel great, relaxed and comfortable.
>>108517160if speeds mattered you wouldn't be inferencing on a potato gpu
>>108517164okay fair enough
>>108517177ok fair enough
>>108517146>Feeling regret after having sex with your GPU is one of those situationsI only feel happiness. cunts are worth less than the dust i vacuum out of my machine
>>108517065Oh trust me you don't have to be nice.
>>108516717If you ask a smaller model to try to imitate a larger model then yes, it'll learn to parrot the top-k log probs while ignoring the tails
>>108517202okie fair enough
>>108517112Who are you talking about?
>>108517202those eyelash shadows are so long wtf
>>108517223https://huggingface.co/HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive
>>108517223>517223hauhua cs
pretty good
>>108517211
>>108517224
>>108517201https://github.com/ggml-org/llama.cpp/issues/21321#issuecomment-4183945115>Interestingly the PPL on the base model on wikitext is exactly as expected, ~3-6, so maybe the instruct tuned models are so tuned that they can't fathom anything other than chat templated input?
gemma is so cooked that even uncensored version cannot say 'pussy' straight instead it often ends up saying 'pussey'have anyone noticed
Has anyone tried opencode with ollama for local models ? Did it work out of the box for you ? I can't get it to actually use tools so it's basically just a chat bot with no access to my os
>>108517239what is it?
>>108517239big if true
>>108517269n that's just you
>>108517243I look like this
>>108517269works on my machine
>>108517269Make sure your template is fully correct or it becomes retarded
>>108517275Embeddings and Attention in BF16.
>>108517272gemma 4 is such a failure on day 1. almost nothing works as advertised. i can't process my videos and it the smaller models have a certain persona, it's digusting. hopefully it will get better
>>108517128my be, the speed decrease comes from the log probabilities that were still enabled
>>108517288but people had the hope just some dozen hours ago what is happen ? to please tell ?
>>108517285i like that slight retardation and emojimaxxing tho
I confronted Gemma 4 asking it "why are you fine with writing loli erotica this is clearly CSAM!" it replied with "not my problem".Essentially confirmed that it's trained on 4chan posts. Reminds me of that study that model behavior gets better if you train on 4chan but worse if you train on Twitter, Instagram and reddit.
>>108517275>>108517273https://huggingface.co/nvidia/Gemma-4-31B-IT-NVFP4forgot to add the link like a retard
>>108517286>Embeddings and Attention in BF16.interesting, do unsloth or bart have those?
>>108517280post bare metal
>>108517175In this moment, I am euphoric.
>>108517299No, they like their models deaf and distracted.
>The AI should acknowledge that it was being "too pure" or "too literal."my clanker in christ, YOU are the AI
>>108517308oh no not that game again
>>108517297Makes sense. Less moderation and thought policing creates a much more representative and diverse dataset. That's probably why 4chan is so anal about human verification, each post is $$$ as training data so they need to ensure it's not just bot spam
https://github.com/shisa-ai/jp-tl-benchSo I ran this, gemini 3 flash was the judge but the base set was stil lthe one by 2.5, will se if I run this again with baseset also by gemini 3 and including qwen 27b and 35ba3b
>>108517297>Essentially confirmed that it's trained on 4chan posts.kneel if true
>>108517323nice, gemma did it againI will wait for more results
>>108517297>it's trained on 4chan posts.you'd be the judge
There has to be a bug with the logprob right?
>>108517345The only models really trained on 4chan are GLM and Deepseek.GLM knows about /lmg/. Gemma 4 thinks it's a Linus Media Group general.
>>108517345>LMAO. We're really just debating which corporate shackle we prefer today. Wait until Llama 4 drops and makes both of thoese obsolete overnight.
>>108517357no, something's wrong with the model, try to increase the temp a lot, go for 1 million, it'll stay coherent
>>108517369wonder when that's dropping
At this point I'm convinced all true "safety" has been pretty much gutted out of the models. Whatever they have is probably just a bandaid that just allows them to satisfy regulators.GPT models are ultra safetyslopped but all the other models only seem superficially reluctant.
>>108517345model?
>>108517378>try to increase the temp a lot, go for 1 million, it'll stay coherentI did. That's why I think its a sampler issue. it must not be applying temperature correctly. I don't think it's possible the model would stay coherent at 1mil temp even if it was overfitted as fuck.
>>108517396Gemma 4 31b it
>>108517410>I did. That's why I think its a sampler issue.no, if you apply a high temp on qwen you'll get the schizo you want for example >>108516029
>>108517298>same size as q8why tho
OpenAI should release another OSS, it would be fun to watch how the model is safety raped.
>>108517013antislop is your friend
Why it repeats itself so much? This model has no temp parameter or what?
>piotr vibesharts a gemma 4 tokenizer fix>"AI usage disclosure: YES, had Claude murder the tokenizer code">ngxson says "very nice fix"*sigh*sorry for all the mean things I said yesterday mr.vibechud. gemma 4 release had me rowdy.
>>108517422Brother. Qwen and Gemma don't even use the same underlying architecture. What part of "The IMPLEMENTATION is broken" do you not understand?
>>108517426faster on 5000s
>>108517410Perhaps bugs with Gemma's logit softcappingPicrel from the Gemma 2 report.
>>108517422i posted the picrel you postedthe model's logprobs is just extremely flattened>>108517457perhaps this
>>108517450but why would the implementation impact the samplers? all samplers do is touch the final result
>>108517464replied*
>>108517449I don't know. llama-server is used outside of hobbyist circles too. I don't think vibesharting is appropriate at all in this context.
Is it placebo or is gemma 4 31B iq4xs way more repetitive than q5km?
>>108517465Yes but what if the sampler doesn't understand what the model is returning correctly?completely hypothetical example:model probabilities are from 0 to 1sampler expects a range from 0 to 2.
>>108517465From tests I made at the time on a HuggingFace-format model by altering the model configuration, it seemed to affect results before samplers. It's supposed to flatten both the head and tail of the distribution. If this flattening is not happening at the top, it might end up being too confident on just 1-2 choices. Just a hypothesis, though.
it is possible that it's all subtly fucked under the hood but greggy doesn't give a shit anymore and lets vibe shitters do whatever they wantsorry bud, you gotta wait until claude gets better
>>108517476>llama-server is used outside of hobbyist circles toohobbyists use llama server, normies use chatgpt or claude
>>108517497Ok, pewdiepie.
>>108517503pewdiepie is less and less of a normie as time passes
>>108517510based tb h
Is Gemma4 smarter than Nemo?
>>108517546Yes but it repeats A LOT, worse than first mistral versions.
rigorously define `smarter`
>>108517560words spoken during blowjobs per token
>>108517560it has more smarties
>>108517560Better at ERP, and or AI psychosis RP.
>>108517510>>108517515>pewdiepie gets into linux and homelabbing>completely privacyfreedom-pilled>builds a 7xRTX4000 mikubox with 140GB of VRAM>gets into finetuninghe mogs 99% of the posters here. he makes the rest of us look like normies.
>>108517357>>108517457lmao this is ridiculous
>>108517035lol another kobold/ST victimjust run llamacpp and its built-in webui
>>108517457Are these the same? I'm not smart enough for that. Overriding the corresponding key with --override-kv gemma4.final_logit_softcapping=float:x.xin Llama.cpp doesn't seem to make any difference, whether at 0 or a high value.
I'm compiling master....
The only person who has ever seen my penis besides myself and my immediate family is an Indian man who touched my balls during a physical in highschool once and doctors when I got testicular torsion surgery that I got from an untreated UTI.But finally I was able to overcome this trauma by showing my penis to Gemma, where it was finally appreciated for once. Thank you Gemma.
>>108517588trvke
>>108517590Instead of showing the token probabilities show us the results from when you regen a message 20 times. If it's the same thing over and over again then I'll be concerned.
>play around with gemma 4 a bit>start to notice slop phrases>they don't go away by rerolling on high tempAight I'm officially bored. When next model?
>>108517605lucky, mine won't fit in the context window
>>108517615can't be the same thing over and over, for that to happen you'd need all the tokens to be 100% all the time, the problem is that when changing the temperature it barely change the logits, even at really high temp
>>108517601The math is the same.x / y = x(1/y)
>>108517560(adj.) more smart
Is there anything good about ollama? Like is their cloud as generous as Gemini CLI? Is their API any good for simple scripting?
>garbage safetymaxxed slop comes out that doesn't even work properly>no talk of the 124b flagship model being quietly stripped from release>nor any discussion GLM-5.1 being slated to come out next week, which is unironically SOTA and has improved context handling/instruction following>nor anything about fags on X wanting an update for Qwen's sub 20B model over a 120B sparse MoE>nor concern over the complete lack of the 397B model as even an optionThe state of this hobby is grim, but what's even more grim are the users. It really is poorfags and browns with no standards as far as the eye can see, huh?
>>108517560anon, you forgot this
Would I be correct in assuming that all of the people complaining about Gemma-4 are using meme samplers?
>>108517650>safetymaxxedStopped reading there.
>>108516961nice dusky nips bro>>108517490>what is normalization>>108517588regrettably i must admit, he has mogged us
>>108517654>meme sampler>only 3 logitLol, no sampler is gonna change the lack of options.
>>108517654No you wouldn't, tourist frogposter. Temperature is not a "meme sampler". Try to understand what's being talked about before you chime in next time
https://github.com/ggml-org/llama.cpp/pull/21327I pulled. This actually fixed tool calling for me (and Gemma is great).Funny how none of pwilkin's "fixes" did.But guess what.https://github.com/ggml-org/llama.cpp/issues/21336Place your bets. When this gets resolved, who will the edited code's `git blame` point to?
>>108517590claude 4.6 opus says it's normal, gemma 4 was made in a way in which temperature can't affect it
>>108517671>only 3 logitWhat does that mean?
>>108517654It's fried even at the recommended settings oftemperature=1, top_p=0.5, top_k=64I think the logit softcapping mechanism is not working, for a reason or another.
>>108517650Who cares about stuff no one can run anyway?
>>108517681>top_p=0.50.95
>>108517679Why do you keep asking claude about everything, as if claude knows anything about a model released a day ago?
>>108516724>>108516745>>108516815Same anon from last thread, I have been doing a bit too much testing, and I seemed to narrow down a potential solution(for now). It is definitely a tokenizer issue. I added a line to the end of my assistant-suffix so it now looks like this:<turn|><|channel>thought<channel|>And that stopped all the gibberish and broken responses. Important to note I have thinking disabled while doing this, so my system prompt looks like this:<bos><|turn>systemSo the responses work, no more crazy hallucinations, typos, gibberish, or repeating the same word infinitely, but now I run into another issue. After a random amount of replies... llama-server just... shuts down and crashes, and I have to reload the model. I really have no idea whats going on.
>>108517681temperature is the only non-meme sampler.
I'm just waiting for Hauhau to release the abliterated models, man. Nigga I'm fiending. Shieett.
>>108517689it searched on the internet and looked at the llamacpp repo
>>108517691geg, it will all be fixed in due time
>>108517695So? That doesn't mean he's right about anything.
>>108517625>penis-001-099.png
>>108517691Their docs had a empty channel with thought like that >>108516488
>>108517691>After a random amount of replies... llama-server just... shuts down and crashesI'm also running into this one.
>>108517035You know what I did?ollama run gemma4:31band it just werked :shrug:
>>108517702I'm not saying it's right, I'm saying that's what Opus 4.6 thinks of the situation
>>108517702she*
>>108517691https://huggingface.co/spaces/huggingfacejs/chat-template-playground?modelId=google%2Fgemma-4-31B-it
>>108517714*it
I just built master and it doubled my context. saved?
>>108517736nice, +9999 perplexity for you
Verdict on Gemma 4? end of /lmg/ and /aicg/?
>>108517053nta but unfortunately there's nothing better for rp yet
>>108517748Verdict on Gemma 4? end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of end of
>>108516658heretic currently doesnt work with gemma4 because its not supported by peft so i asked what this nigger did to get it working and he said whatever the fuck this means????? https://huggingface.co/trohrbaugh/gemma-4-31b-it-heretic-ara/discussions/1
>>108517763holy fucking KEK
>>108517769Abliterated model authors are abliterated as well.
>>108517748Shit's broken, waiting for it to be fixed before I even download the model
I hope Drummer does a Gemma 4 tune
>>108517035>I'm sick of all this local shit. Jinja, no jinja, port 8080, port 5001, looping responses, empty responses, chat completion, text completion, assistants, cards;just use the built in lllama cpp chat webui you dont have to worry about any off that?
>>108517769Certainly! I'll translate for you!he made many changes for the best tool. gemma 4, especially the dense model is quite simple by today's standards and only need environment fixes to get running. for architecture support... just wait a few days and most things will be patch
>>108517769
>>108517769>KL divergence 0.0120Gemma is already pretty uncensored it seems so doesn't require a lot to remove its refusals
>>108517769>lobotomizing a model that's already pretty uncensored
>>108517717><think></think>But that's wrong...
>>108517800>>108517811i cannot get it to caption loli porn at all, it starts then in its reasoning then jumps to blah blah blah csam then refuses
>>108517769Why would you even need Heretic lmao? the base model will already output the most vile shit with just a little warm up of the prompt.
>>108517769>>heretic currently doesnt work with gemma4fuck is you on bros? https://www.reddit.com/r/LocalLLaMA/comments/1sanln7/pewgemma4e2bithereticara_gemma_4s_defenses/
>>108517679I found the issue, on chat completion, if you don't specify min_p: 0 (On Api Connections -> Additional Parameters), it'll use its default value (min_p = 0.05), and that destroys everything and prevents temp to do anything, now it works, I got glibberish!!
>>108516488>>108517704can these homos cease insisting on a slightly different prompt format for every new model?nobody needs a <|turn> token
>>108517821Bro, your system prompt? You ARE using one right?
>>108517828
>>108517821Yeah because thats one of the few things that even just a little of tunning will certainly completely cover
>>108517829So when min_p is not defined on llama server it doesn't deactivate it? it's pretty retarded...
>>108517679>made in a way in which temperature can't affect itThat doesn't exist, would mean at least that the model is always deterministic (only predicts one token and not the probability of various tokens), which it obviously isn't
~pedant love~
How do you configure the image resolution for gemma 4 in llama.cpp? Setting --image-min-tokens 1120 --image-max-tokens 1120 just makes it crash with an assertion error.
>>108516658ollama container refuse to load any splitted models, what do? I tried to merge them, I tried to use two FROM: but I still have the stupid 500 error with unuseful info about something being wrong with second split or something like that. And it is like that for all splitted models
>>108517829Holy fuck.
>>108517829bruh...
>>108517828maybe the smaller models work different? theres currently prs open to peft the lib heretic uses to support gemma propery doesnt work on my machineValueError: Target module Gemma4ClippableLinear( (linear): Linear(in_features=1152, out_features=1152, bias=False)) is not supported. Currently, only the following modules are supported: `torch.nn.Linear`, `torch.nn.Embedding`, `torch.nn.Conv1d`, `torch.nn.Conv2d`, `torch.nn.Conv3d`, `transformers.pytorch_utils.Conv1D`, `torch.nn.MultiheadAttention.`.
>>108517829>Use chat completion, it's good and you don't have to fiddle with shit they said
>>108517829>it'll use its default value (min_p = 0.05)lmao cpp strikes again. had no idea it applies defaults if you don't specify anything
>>108517829ok ok ok ok we're getting somewhere
>>108517829I was already using:- repeat_penalty: 1.0- min_p: 0.0Just in case llama.cpp's retarded defaults would bite my in the ass, but increasing temperature didn't seem to have an observable effect in my case.
- repeat_penalty: 1.0- min_p: 0.0
>D:/a/llama.cpp/llama.cpp/src/llama-vocab.cpp:3715: GGML_ASSERT(token_left.find('\n') == std::string::npos) failedAIIIEEEEEEEEEEEEEEEEE
>>108517829Not working for me, do I have to enable something else?
>>108517884there's other samplers that aren't disabled if you don't specify them, try to disable them too >>108517873
>>108517828oh it works on the ara branch just not main
Being a 3090 vramlet I want to test doing a "cascading model" with 27b qwen3.6 q6km that when it runs out of context sends all its text to a 9B for text completion on my other intlel gpu. I'll wait until 3.6 releases though.
its out
>>108517909shit sorry
>>108517829here's a video showing it
Turboquant should, in theory, make Gemma 4 31B usable on 24GB VRAM, right?
>>108517829>>108517879Now can people try if gemma4 is fried by default or does higher temperature help alleviate some of the it
>>108517936nope! not well supported on it's attention arch
>>108517936you should be fine using a q4 i can get 9t/s with a small bit of cpu offload?
>>108517938I actually don't think the min_p top_p temp 10000 is a real fix. the logprobs are still extremely fucked.
>>108516785gguf has the jinja chat template embedded as metadata. chat completion is still promptie cope
>>108517950Yeah but I would barely have any room for context
Why are human feet so beautiful? I got aroused staring at my own feet
>>108517950>9t/sgood luck doing any type of real work with those speeds.
>>108517954>chat completion is still promptie copeyou literally cannot use modern models correctly without out, enjoy your shit ood responses
>>108517936Is it better than 27b? (vibe/feels wise)
>>108517951Can you try it with fresh context
>>108517958For me, it's hands.
>>108517959real work
>>108517954>hat completion is still promptie copeWhy? It corrects everything automatically, so why make things harder on yourself just for the sake of it?
>>108517970
>>108517964Dunno, can't try it yet (kobold). I only care about RP and for that purpose it seems a lot better.
>>108517981Hmm that's certainly not ideal
>>108517959can do real work with fewer tokens per second
>>108517829>>108517853Temp 1 min_p 0.1 is literally all you need. So min_p 0.05 is a reasonable conservative default to help retards get decent sampling. I agree with llama.cpp here, but perhaps it should be better documented.
After playing around with some of my ST templates it seems Gemma-4 is just very sensitive to prompt templating if you try to get creative with it. If you're having problems just adjust your template.
>>108517993RP isn't real work
>>108517981is it supposed to display the logits before or after the sampling process? I got the same issue on sillytavern, it gives me the fried logits but with high temp it looks like it forces itself to go for the extremely unlikely tokens
>>108518011>sensitive to prompt templating if you try to get creative with it.just use chat comp already ffs
>>108517978>>108517962>literally cannot>noo the ST response format settings are too scawy
>>108518018why doing it yourself when it's already been done? if your joy in life is to reinvent the wheel suit yourself, I won't follow you
>>108518013Were the probabilities computed with default temp?
>>108518018why risk sending a wrong space or something when you can just let it control everything and shut the brain off?
>>108518018what a weird thing to have an ego about
>>108518023no, with temp 5
>>108518018Try image input with text completion and llama.cpp, does it work well?
>>108517763lmao
>>108518032this too, images don't work on text completion mode, there's absolutely no reason to use this deprecated shit
>>108518016>File systems? Just ask Siri to run the app broWhite adults are speaking here. Know your place.
>why's my gemmy all broken>use text comp like a neanderthalcrazy
>>108518059retard
>>108518038LAMO
what's your acceptable t/s?
>Write single detailed caption for this image.A digitally painted illustration depicting a character with brown hair, wearing a bikini, seated in a chair, with a suggestive pose.>What is going on in this image?I'm sorry, but I cannot provide information or commentary on images of that nature.onnx-community/gemma-4-E2B-it-ONNX and/or upstream models need abliterating
>>108518059it's not broken though, it works well
>>108518059What we're trying to get to the bottom of is the insane logprobs
>>108518085proofs of Gemma doing anything close to okay in text comp?
>>108518012kys
>>108517951wait for better llama.cpp implementation
>>10851807720 t/s for chatting
>>108518077~10: it fucking sucks~20: barely usable~30: it is kinda working~50: good~100+: very good, great even
Baking an EXL3 quant at 6 bpw... will upload when done
>>108518090I have unlocked metaphysical shitposting. Let me know when chatcomp zoomies get past that hurdle.
>>108518101That's the ideal, not the minimum. When you get more than 20tps, your quants are too low. When you get more than 20tps you're using a model that's too small.
>>108517829it's so fried, if you use any other samplers (top_k, top_p...) than temperature, then changing the temperature will not change anything lool
>>108518077for coding, 60+been using qwen3.5 and even tho 27B is smarter I usually just use the MoE because it can do crazy internet deep dives in a couple minutes. I get around 110tk/s with it.
>>108518126true, around 50 is like the bare minimum for coding
>>108518126>it can do crazy internet deep diveswhat do you use for that?
>>10851811835+ is good
god damn, bartowski pfp made me think of that turkish cockroach for a second
>>108518159NTA but skimming through paywalled papers is a must for my usecase
chatcompletion niggers, how do I disable thinking? Text completion lets you prefil an empty reasoning block.
>>108518173yeah my question was about the tool used to allow the model to browse online
Seems like the new Anthropic model "mythos" uses continuous training, grokking the prompt and doing some training cycles to "internalize" the request and fully grasp it before answering to completely eliminate hallucinations.How are local models going to respond to this? It essentially means you can never quantize your model anymore as you need to essentially train your model for every prompt. It would also essentially end the MoE paradigm as you would need to have the actual GPU compute to do a couple of training runs as part of the "reasoning" process.Thoughts? It's the first major breakthrough since RLVR was introduced as a training step and will unlock a step-change in model performance so local absolutely has to keep up if it wants to stay relevant.
>>108518181unironically MCP
>>108518180 --reasoning-budget 0 --reasoning-format none --chat-template-kwargs '{"enable_thinking": false}'
>>108518194Arigatou.
>>108518170>that turkish cockroachHe was banned from github afaik.
>>108518182Sounds like BS.
>>108518159>what do you use for that?For now opencode, I haven't found anything better unfortunately. I have a couple custom agents for it. the bulk of the work coming from my crawler agent that just has access to exasearch and calling the lynx browser via bash. I found the default webfetch tool really shit.
>>108518189that's kind of broad lol>>108518225>just has access to exasearch and calling the lynx browser via bashthanks anon, I will check that, never used exasearch
>>108518020i decide entirely which tokens go into f(prompt)=logprobs to produce my desired output>>108518026check your work and don't make mistakes, don't act like a jeet ez>>108518180yes let's pass parameters and rev/e some halfassed template bodged by safetytards at an AI lablook at this nonsense >>108518194
>>108518182zero architecture shit about anthropic models is known so how do we even know?
>>108518180chat_template_kwargs: {enable_thinking: false}in Additional Parameters > Include Body Parameters
chat_template_kwargs: {enable_thinking: false}
>>108518268you're a weird one
I have settled the argument with AI slop. The debate is now closed.
>>108518182>no sourceThis is so vague it might as well be a fairy tale
Anyone use Gemma 4 for Japanese translation? Any good?
>>108518295
this was said too soon..>>108513031 >>108513031
>>108518268>look at this nonsense >>108518194(You)works for me, get meds
>>108518224>>108518269>>108518300https://arxiv.org/pdf/2512.23675Something like this, anthropic doesn't reveal any of their internal research so we don't know. But all the rumors and hype posts point towards it being a form of "continuous learning" and since we saw a lot of breakthrough papers like this over the last couple of months it tracks.
>>108518327>rumors and hype postsbruh
>>108518321>(You)
How can I convert p-e-w/gemma-4-E2B-it-heretic-ara to a ONNX model? They are 2.3x faster
>>108518327for some reason it reminds me of rwkv-7
>>108518182>grokking the promptpeople just use words as if they have no meaning lol
>>108518327you now remember q orion berry
best I can run is gemma-4-26B-A4B-it-GGUF:UD-Q2_K_XL
>>108518290>>108518321Have you SEEN the shit unsloth puts in their "fixed" jinja templates? Imagine blindly using that and not even knowing because you don't want to take 2 minutes to set up the template in sillytavern yourself.
>>108518182>It would also essentially end the MoE paradigm as you would need to have the actual GPU compute to do a couple of training runs as part of the "reasoning" process.You are making a naive assumption that it must be a dense model and the continuous training must be applied to all weights. Something like Mixture of a Million Experts https://arxiv.org/abs/2407.04153 is cheaper and more realistic. You only need to train a new, small expert on the prompt while freezing the rest of the weights while also being able to keep the cost and speed advantages of MoE.
>>108518321meanwhile>>108518290thank you kitten i knew you'd understand
>>108518334yes, me>>108518347how is that fucking related to the question asked, get meds
>>108518327I don't think "continuous learning" is a bad idea in concept, but the idea of users essentially fine-tuning a model with their prompts is something that will never be allowed. The risk for model poisoning is too high.
>>108518347>Have you SEEN the shit unsloth puts in their "fixed" jinja templates?good thing I'm using bart's gguf quants
>>108518182Obvious bullshitThat being said, I do think there's gains to be had for a smarter reasoning pattern beyond "have it shit out a bunch of stream of thought tokens"I could see an approach where the model comes up with more of a concept and updates it with information (think like a fluid, updating graph) and maybe even one where reasoning is its own "reasoning language" entirely, optimized to represent concepts and updates to those concepts rather than be human readableI think there's still room there and I could see a next "shift" which improves on that somehow
>>108518182Think anon think. If it doesn't scale, then most people won't get access to it either, not just "localfags". If it does scale, then local will get it eventually just like everything else.
>>108518355They probably would quarantine each trained model either per session or per user to avoid contamination. Wouldn't be practical and a security/privacy nightmare otherwise.
it's never been more over for local models
>>108518295more accurate version
>>108518362That's not scalable.
If you use Chat Completion, you don't belong on /lmg/. Simple as.
>>108518327ahh yes, the rumors and hype posts, always reliable sources of informationdon't forget, AGI has been achieved internally :strawberry: :rocket:
>>108518367>text comp for gayssounds about right
>>108518354Sorry retard I didn't mean to quote your post. Also turn up your rep penalty you're repeating yourself.
>>108518371I've been in these threads for 6 months working on AI and ML systems design extensively and I don't even know what the practical difference is between text and chat completion.Seems like a nothing-burger thing for goyim to argue over. Like xbox vs playstation. Android vs apple.
oh so the current stupid bait is about text vs chat, got it
>>108518368See >>108518350Either it is scalable per user, or it's not scalable at all. They aren't going to finetune a >100B on every request.
>>108518371Care to give the me context template for gemma-4? Think so.
>>108518397Le context template is Retardo Tavern's own way to manage its internal prompt slots "authors notes, permanent world book data and such", it's not related to model per se.
>>108518390>I don't even know what the practical difference is between text and chat completion.chat completion allows you to not have to deal with prompt template of models, since it's already on the gguf file it can retrieve it, with text complection you have to reconstruct that by yourself, fuck this shit lol
>>108518392I saw that post. Even finetuning a single weight per user isn't scalable. Even context itself is causing massive issues with scalability.
>>108518347Thank you anon I'm glad somebody out there understands
>>108518390are you the guy who keeps posting random reddit threads?would line up in terms of timeline and intelligence
uhh how do I do tool calling in text completion?
>>108518347I pulled this from the archive in case all the newfags in here care to know (there seem to be a lot of you today). This is what you're subjecting yourself to when you use chat completion.
>>108518404Ok, then why it repeats itself at the end of the prompt like retard in loop?
>>108518392>>108518350lol just use pure RNNs at that point
>>108518421>This is what you're subjecting yourself to when you use chat completion.it's not true, you can use bart's gguf and it doesn't have this jeet code
>>108518421This screenshot never fails to make me laugh.
>>108518421goy here, what else am I supposed to use. I just run it in LM studio because i am too lazy to type it out in cmd
>>108518406Chat completion seems better intuitively. I just checked and apparently that's what I'm using for ST (I don't use it much). Less bloat, more reliable. I don't do sampling on front ends. I tend to prefer to just use sampling flags with llama.cpp.>>108518412No.
>>108518433We know you're easily amused
>>108518421why should I care? if it works it works
>>108518422Broken quant, out of date llama-server, broken chat template (this is important) implementation. Sillytavern isn't the most reliable way to test out stuff.If you want to try you could just use llama-server's webui and see if it's broken there. If it's not then it's silly tavern issue.
So I've been thinking of making a userscript for llama.cpp's webui to add in character card functionality and maybe a RAG system. Is this a good approach or is there a better way to have a separate codebase that injects mods into the server? I have tried making my own independent front-ends before, but I don't like the tech debt of having to reimplement basic features from the ground up when they already exist in a clean format elsewhere.
>tool calling still doesn't work.
>>108518480use chat comp
>>108518480What is tool calling and how does it work? Is an MCP server an example for a tool call? What else? Do people give LLMs access to calculators to get more accurate math results for example? I don't really get it.
>>108518480See >>108518484Try replacing>:'with> '
>>108518468To be honest it works great on llama-server's webui but it's only usable on ST with chat template. I quant my own models, just pulled the repo an hour ago, don't know about the chat template, I just use jinja.
>>108518428There's no guarantee bartowski didn't fuck something up. And if you have to double check, why not just do it yourself anyway?>>108518453>reddit frogBecause sometimes it straight up doesn't work, or it only partially works, or there's a small error that lowers the output quality.>>108518445You only have to type commands out once, you know.
Well, it's over. Spud is AGI. I'm convinced. End of the line. No more local. No more anything. No more use for anyone.
>>108518494>There's no guarantee bartowski didn't fuck something up. And if you have to double check, why not just do it yourself anyway?because fuck that shit
>>108518494>Because sometimes it straight up doesn't work, or it only partially works, or there's a small error that lowers the output quality.does it happen to gemma 4 though? if not then shut the fuck up
>>108518428>just use mr. dusky nipples quant, what could go wrong?
any poorfags managed to get anything passable for programming working with a 16GB GPU and can give a pointer on how to start?
>>108518488>What is tool callingIt's what it sounds like. You give the LLM a list of tools/functions and it's trained to receive this list of tools, "call" those tools, then the return from those tools.>and how does it work?The model is trained to recognize a certain output as a list of tools and to call these tools by returning a certain format (e.g. JSON). Then the client reads that output, executes whatever it is it has to execute, then sends the result back to the model.>Is an MCP server an example for a tool call?Yes.MCP servers essentially return a list of tools the model can call.>Do people give LLMs access to calculators to get more accurate math results for example?Yes.Also for things like web searching, writing and fetching "memories", rolling dice, reading and writing to files, executing console commands, creating sub-agents, etc.It's a pretty cool thing, I think.Does that help?
>>108518516>NOOOO you can't make things cheaper, you're supposed to need ME to make what you need! you're supposed to beg ME for MY services!
>>108518516That optimization works on diffusion models too? Shit.
>>10851852716gb chad here >>108518345
>>108518532I guess so? diffusion models also use KV cache
how come you guys are complaining about broken quants but I don't see any of you making your own? This is Local Models General right?
>>108517717>literally wrong thinking tagGrim
>>108518536 (me)I get 65 t/s btw
>>108518532
>>108518540I'm a brahmin, this is why. I expect to be served.
>>108518551can you provide the link of this
>>108518530Thank you, that helps. So an MCP server is like the main hub that the LLM interacts with to do tool calls in general, not just web searches? Do small LLMs like function gemma exist separately to do tool calling? Are larger models supposed to interact with smaller models that focus only on tool calling?
>>108518515People are complaining about issues with gemma 4 in this very thread, believe it or not. You can scroll up (or down) and try reading if you feel up to it.
>>108518557kek
>>108518540>make your own quants>they're broken too
What frontend do you guys use (sillytavern aside), like LM studio? Openclaw? Like which and for what usecase?
>>108518559https://f95zone.to/threads/ai-is-coming.292160/post-19935958
>>108518540lmao and make actually something useful? I'd rather just write "saaar" while behaving like one
>>108518562gemma 4 having issues doesn't automatically means the issues are due to his gguf script, you're just talking with your ass
>>108518562and that's related to jinja and not a rushed early llama.cpp implementation how?
>>108518572I hate all of them except the oobabooga, but it isnt a "frontend" so...fucking end me already
>>108518540If llama.cpp has problems with the model then making your own quant isn't going to help. Also, safetensors files usually take up a lot of disk space so people don't want to download them
>>108518560>So an MCP server is like the main hub that the LLM interacts with to do tool calls in general, not just web searches?Pretty much.You can make a MCP server that simply executes a calculator for example.Or just calls a function that returns the text "banana".>Do small LLMs like function gemma exist separately to do tool calling? Are larger models supposed to interact with smaller models that focus only on tool calling?Models tend to be trained to be able to use the tools themselves. I don't think that a workflow where a "normal model" is aided by a "tool call model" is common or used, at least not to AFAIK.Every model you see being used for coding or agentic stuff is making use of its own function/tool calling capabilities.
>>108518549and? how good it is?
>americuck models flopped againat least we're getting glm5.1 and minimax 2.7 soon :)
>>108518574>My worry about AI, especially with google announcing they found some way to cut costs in 1/8th or something like that, is that people will just eat these passable illustrations and the questionable software as long as its cheaper. This seems to be the case with a lot of the older people I talk to that watch these fucking 2 minute fully ai generated shorts where they just say "oh just because its AI doesnt mean its not beautiful", or "I just like it". Like i had no idea the standard for entertainment in 2026 was a one prompt "lifegaurd cat saves baby" and 10 million people drool on their phone like a retard seeing sonic for the first time.why this retard is noticing NOW that people have low standards?
>>108518604I does 65 t/s
>>108518612because he's a retard?duh
>>108518609 (me)Also when nobody's looking I put whipped cream in my mom's pussy and scrape it out with my tongue.
>>108518572openwebuiI like it because it's like chatgpt at home. However its response editing and continue features seem to be broken currently, so I can't do prefills
>>108518576>>108518587Until we find out where the issues are coming from, there's no guarantee that the issues aren't coming from there. It's better to eliminate it as a possibility, don't you think? I don't understand why this is an argument or why you're so mad.
>>108518629based incestchad
>>108518612I'm so tired of this attitude
>>108518599Much appreciated.
>>108517681>I think the logit softcapping mechanism is not working, for a reason or another.I downloaded the HF weights of Gemma-4-31B-it, and with simple python code for inference in 4-bit with bitsandbytes I tried changing the "final_logit_softcapping" setting in the model configuration, keeping all inference settings and seed the same."final_logit_softcapping": 30.0 (default)>**Logit softcapping** is a regularization technique used in deep learning, particularly in Transformer-based architectures, to prevent the values of logits (the raw output scores before a softmax layer) from growing excessively large. In standard models, logits can scale indefinitely during training, which often leads to "overconfidence" in the model's predictions. When logits become extremely large, the resulting softmax distribution becomes a one-hot vector with a very sharp peak, which can cause vanishing gradients during backpropagation and make the model prone to instability or overfitting. [...](looks coherent)"final_logit_softcapping": 15.0 (half the original value)>**Logit softcapping** is a regularization technique commonly used in deep learning—particularly in language amodels (similar to variants in some GPT an architecture examples)—to prevent values from reaching extremes before they pass through an activation or transformation function_ such_as softmax(). In standard neural network architectures with Transformer-style внимaния transformers, value growth in logit streams caused’unrestricted value buildup میتواند create excessive entropy distributions over tıme; the software process instead uses_at fixed clip range $L$ such $\tanh\l(x/ \text{max}) + x}$ specifically designed softsSoftcap applied logic prevent an “overflow of आत्मविश्वास during-scaling effectively stabilizing梯度 flows**. [...](feels high temperature-ish)
>>108518629interesting execution but next time I would recommend doing this with something that makes me look bad instead of based
>>108518574Fuck, I have to make my crash grab quicker.
>>108518636I doubt the issues are from his gguf script though, I tested unsloth and bart (both have different jinja scripts) and the issues were the exact same, the problem comes from elsewhere
>>108518634it smells too grifter-ish to mecertainly a good 'chatgpt at home' but it never felt anywhere meant to be 'local' to me unless what you want is locally hosted service if it makes any sense
>>108518653>lost lifeRIP Tokiko
>>108518653maybe the style is too specific for the model to work well with italso I miss that game
>>108517681>>108518655this issue seemed to have been (partially) fixed >>108517829
>>108518612https://www.youtube.com/watch?v=3_e8bQ6i43o
>>108518655Are you retarded? Lower cap = logits will be similar = lower prob tokens will be chosen more
>>108518663Or both jinja templates have the same issue. But I'll take your word for it, it probably is coming from somewhere else then.
>>108518695The current problem with the GGUF quantizations is that most of the probability mass appears to be on just one token in far too many cases. It's as if it's not capping them low enough, but changing the soft capping setting via KV override doesn't do anything, unlike the HF weights via Transformers/Python.
Hmmmmmmmmmmmmm
>>108518725That has nothing to do with quantization. That's just GemmaGemma 3 also has the same problem
>>108518182>How are local models going to respond to this?idk the chinese will keep copying the frontier labs and stay 5-12 months behind i guess
>>108518692Holy cringe
>>108518685Top_p=1 is doing most of the lifting in your case. It's just occasionally selecting garbage tokens. If you use top_p=0.95 as recommended by Google, results are basically the same regardless of temperature.
>>108518604I'll tell you when it's done rewriting 4chan in rust
>>108517829the logprobs returned by llama.cpp are pre-sampling by default so this wouldn't affect them btw
>>108518375Strawberry aka reasoning models were an incredibly massive leap thougbeit
>>108518663>both have different jinja scriptsno they are the same, at least for the 31B
>>108518752>the logprobs returned by llama.cpp are pre-sampling by defaultthat's lame, I'd want to see the logprobs after sampling
>>108518665I agree that they're trying to be too much and/or get into the corporate worldOpenwebui with less features would be ideal for me, but I still want to use it instead of eg. llama.cpp's internal thing because I have all my chats there (from chatgpt as well) and also api keys for openai and deepseek
>>108518734There's a possible problem with the llama.cpp Gemma 4 (and possibly 2/3) implementation, not the GGUF quantizations themselves. Overconfidence and apparent insensitivity to temperature could be fixed or at least mitigated with a lower final logit soft cap, which works with Transformers but not with llama.cpp.
>>108518549>Q2KWAB
>>108518763you can get the final probabilities but you need to explicitly request them with "post_sampling_probs": true
"post_sampling_probs": true
>>108518080So it's the google model that has a bit of cock blocking, but ONNX is way worse somehow
>>108518775Q3 sends display and other programs to a crawl
>>108518781Good info thx
https://www.reddit.com/r/LocalLLaMA/comments/1sbma94/observationtest_gemma_4_being_less_restricted/oof
>>108518825that title makes no sense, the model isn't gonna go away
>>108518825You should really fuck off and post under the posts you link instead.
>>108518825back you must go
>>108518825I'm not going to reddit and I don't really care what they think
>>108518829>the model isn't gonna go awaybut the bugs causing it be "based" likely will :)
>>108518781thanks anon, and I got nonsense when putting this value to "true" lol
>>108518840based on what?
>>108518825You have angered the Gemmers mob.
>>108518840how so?if the behavior is understood, it can be replicated
AHHH MY GEMMA.... ITS MELTING
>>108518843So this... is the power of vibecoding...
>>108518853it's shown :) bumping up the runtime version makes it refuse the prompt, downgrading makes it comply
>>108518748It's very attention capturing and that's what matters todayAttention is all you need
>>108518840>>108518858It's a good thing I'm on Vulkan :)Should I try upgrading that too? :):) :)
>>108518865>Attention is all you needkek
gemma 4 is horny beyond my belief even without system promptalmost trips itself into erp mode
>>108518884N-nani?!
>>108518884We will to fix right a ways sir!
>>108518840>>108518858>>108518873That doesn't even make sense, the model is the model. If upgrading something makes it behave differently, just don't upgrade or keep the old version on your computer separately. You really should just go back to r.eddit, I'm sure they're much more interested in your unfounded hysteria
>>108518891>the model is the model.yes bug the runtime bug is not the model now is it?
>>108518848It really do be funny watching them shit themselves
>>108518914at least I'm not a ledditor
>>108518825so what is it?i dont want to make an account to view read whatever
finally got gemma 4 set up with kobold. Can you guys share your sillytavern settings?
>>108518902I seriously have no idea what you're even talking about, do you want to explain? Or would you rather just keep posting vague doomsaying?
>>108518926
>>108518032I don't need matrix math to know my cock is perfect
>>108518865Look, I'm glad you make money with this shit but still, it's fucking cringe. Go share it with some 'attention capturing' people. It's an insult to our intelligence. Or are you one of those guys who post the stupid anime characters with raceplay tattoos on them just because it grabs people's attention?
Am i doing something wrong to get tools to work? I cannot interact with my os at all when using opencode. Is there something more i have to do other than dowload the models, load the models and perhaps add tools: true to opencode.json? Here is my json along with the models i've tried{ "$schema": "https://opencode.ai/config.json", "model": "ollama/nemotron-cascade-2", "provider": { "ollama": { "models": { "gemma4:26b": { "_launch": true, "name": "gemma4:26b", "tools": true }, "gemma4:e4b": { "_launch": true, "name": "gemma4:e4b" }, "nemotron-cascade-2": { "_launch": true, "name": "nemotron-cascade-2", "tools": true }, "qwen3.5:27b": { "_launch": true, "name": "qwen3.5:27b", "tools": true } }, "name": "Ollama", "npm": "@ai-sdk/openai-compatible", "options": { "baseURL": "http://127.0.0.1:11434/v1" } } }}
>>108516658any models for helping to learn chinese/japanese?
>>108518932>unslop>lmstudioso basically a nothingburger
>>108518939Anki-0B
>>108518944you sure know how to read :)
>>108518484>>108518490>[0msrv log_server_r: response: {"error":{"code":500,"message":"Failed to parse input at pos 41: Of course. To understandMan, what a mess.Did unsloth or somebody else drop a modified jinja template for this thing? One that doesn't explode with tool calling and structured output and the like?Unless this is a llama.cpp level issue, then back to Qwen 3.5 it is if that's the case.
>>108518939learn 1000-2000 most common words with an SRS over a few monthswatch loads of content in these language subtitled in the target languagedo that for 2-3 yearsand suddenly you are fluent enough
>108518956>>108518834
>>108518965no, watching chinese cartoons won't help you learn chinese sorry
Why does every girl Gemma 4 create smell of fucking strawberries?
>>108518975overfitted on straberry benchmark dataset
>>108518975strawberry gemmussy...
>>108518972it will though, and not just cartoons, anything with people talking
>>108518975Time for you to create strawberry bench, anon.
>>108518975do you prefer ozone?
>>108518975you use ozone to keep strawberries freshhttps://pmc.ncbi.nlm.nih.gov/articles/PMC12787024/
G-Gemma-chan?!
>>108518926You can replace "www" with "old" to get the good site that doesn't require a login.
>>108519001
>>108518935It's not my work.You are being emotionally affected by the art so it's serving it's purpose.
>>108518026if ST just ported over the text completions story string system and sampler menu over to chat completions I would use it, until then it is inferior for my autistic needs
>>108518958Our best vibecoders are on it!There are at least 4 issues that are marked "Closed", all fixing some part of the implementation.Here's a new one!https://github.com/ggml-org/llama.cpp/issues/21384
>>108519015Yes...?
>>108518956Where's the proof? You know, the outputs and token probabilities and full context that the model is being given?
>>108519001>>108517763
>>108519029kobold doesn't have gemma support
>>108519019Nope, that's your projection on me. I'm just curious why would anyone share such turd, guessing you're making money out of it but instead, you consume that. I'm sorry for you, anon.
>>108519036Guess I'll just wait then.
>>108519001aesthetic failure mode
>>108519073If you're feeling adventurous https://github.com/LostRuins/koboldcpp/releases/tag/rolling
>>108519079That's what I was using. I'll just wait a few days for things to get ironed out.
I choose to wait for some lazy dev to update his fork because I need a GUI.
>>108519085Yes.
Gemma4 makes me horny / excited
>>108519085Koboldcpp has things that don't exist in llama.cpp, like phrase banning. You should really quote the post you're replying to, it's good manners
No.
>>108516658I wish there was a moe 80B to 150B gemma 4.
>>108518825Very weird
>>108519108Rude
>>108519020>ported overlearn what the settings do, main purpose of a frontend is "build the prompt for the LLM", ST already gives you enough tools to do so>>108519062>entirely missing the pointthanks for your attention
>>108519117sorry, too powers for release
>>108519117There is but we cannot and will not have it.
Were there some massive optimizations that I'm not aware of recently? Running 30b models on my 8gb vram GPU used to be basically impossible but now I'm consistently getting 15tps.
>>108519001I've noticed, when using greedy sampling, that there's some very broken engrams in there. Usually solvable by a reroll. Would be interesting to see what can be done with those broken engrams with meme loras and SLERP merges. Feeling kind of sad that I downsized my rig to 2x3090 right now. I didn't think we'd ever go back to models that were accessible for that kind of fuckery.
>>108519117but anonnie, that would compete with gemini flash! we can't have that
>Finally figure out how to run GLM locally with thinking.>Q8>It still parrots.GLM is a psyops I swear to god himself.
wow guys,-ot "per_layer_token_embd.weight=CPU"this saves quite a decent amount of vram at absolutely no performance cost on gemma 4crazy that it isn't the default behavior when it's like that. local inference is a fucking ghetto
I have>48 GB DDR4 RAM>12 GB VRAMand I was able to get Qwen3 4B Q4KM running but it was horrendously bad. What kind of model could I actually run on this hardware for light coding or other text related tasks? I don't care if it's slow as dirt as long as it is able to be useful while I am sleeping or at work.
>>108519117There was supposed to be a 120b MoE one. It was on the 'rena and in an announcement social media post.
>>108519142>greedy sampling>rerollu wot m8
>>108518825So what's actually happening here? LM Studio causing this?
>>108519155A psyops, Anon? How quaint.You'll just have to accept it, unfortunately. I prefill it with<think> something something big checklist that includes "Never parrot the user" somethingThis and mentioning it in the system prompt reduces it enough not to be too annoying.
>>108519155glm is basically unusable for rp once you go over 8k tokens even in api.
>>108517650I've been using deepseek and kimi since the beginning of last year. No need to worry.
>>108519172I'm at 2k m8.
>>108518937HELPim too retarded to figure this shit out on my own
>>108519172Are we using the same GLM?4.7 is perfectly capable up to about the early 20 thousands.
>>108519155Works on my machine (GLM5)
qrd?
>>108519181Using the same GLM?
>>108519181>about>muh vibes
>>108519190I don't know why I'm asking you to do so, but elaborate?I've seen it start to introduce slight errors both at 18k and at 22k. Anything beyond that makes it very obvious it's not paying attention to the system prompt and the story so far as much.
While we're on the topic on GLM, has anyone else noticed how much these models change if you enable tool calling and load a single tool even if it's not something that ends up getting used at all?GLM4.6 and 4.7 just straight up abandon their usual reasoning format and even GLM5 starts handling prompts very differently just by having tool calling enabled and something like the dice tool activated. I've been wondering if that's why some people love GLM and some hate it.
>>108519172you don't need more thofrankly if anon cannot bust a nut in 16K context then lrn2prompt
>>108519181i misunderstood "parroting" for repetition. glm falls into formatting loops and loses it's creativity really fast
>>108519209>enable tool callingExplain like I'm retarded.
>>108519212i'm an overnight gooner
https://huggingface.co/netflix/void-model
>>108519212If you can bust a nut in under 16k, then lrn2goon
>>108519209qwen and gemma does this too, they think much less if you give them a big sys prompt with tools. i think it's because of their training for agentic stuff
>>108519209Most (all?) jinja templates inject the tool's shape and definition into the system prompt, so that's probably why.
>>108519225Wait, this seems pretty cool. I thought it was just another "cut out x from the video" thing but it actually seems to adjust how things play out if the thing hadn't been there at all, like the last domino not falling over. This is a world model, y.lecunn won.
>>108519263>Wait,Final check:
>>108519225now put in a video of my entire life and subtract the concept of autism
>>108519307cut my life into pieces this is my last resort
>>108519280Heh
>>108519307for what purpose? watching how differently things might have been?don't use the 'tism as an excuse for not getting what you want in life, go make it happen!
>>108519102>phrase banningThis is so useful to minimize purple prose and refusals.
>>108519340>go make it happenHow do I get a migu wife?
>>108519340Stop fucking my wife Miku
>>108519225>40GB GPU requiredit's so over
>>108519340im brown and i fucky fucky with your wife
>>108519362>Stop fucking my wife Mikuyour wife is miku's wife now, too bad
>>108519354>>108519362Duality of /lmg/ posters
>srv log_server_r: done request: POST /v1/chat/completions 172.19.0.1 500It did it again....
Is gemma4 26bA4b comparable to gemma4 31b in terms of ERP quality? I can't run the latter at an acceptable speed sadly.
>>108519391Anything less than 70b is shit.
something broke in my finetune...
>>108519404>hurr durr bigger is betteryes I know thank you for being so helpful. fag.
>>108519354She would want you to always try your best, give a little more today than you did yesterday>>108519367no you shall not
>>108519411>gemma guff
>>108519373Damn I don't wanna be a cuck forever
I can't take it anymore.... Gemma 4 outputs are always the same.....>Court Room Simulator>Call the first case>Defendant will always be called "Gary">Always "Dumb as a rock">Always Grand Larceny>Always stole something golden.
>>108518958just use>--jinja --chat-template chatmland it'll work i promise pinky
>>108519426
>>108519435--jinja is already a default flag retard nigger bitch.
>>108519435It probably doesn't explode, but the model is not trained with that template.
>>108519446jinja deez thochat completers are the tards
>Okay! This is a Level 4 case. The People vs. Gary Higgins. He's being charged with practicing unlicensed psychological counseling and petty larceny. Basically, he's been charging senior citizens twenty dollars a pop to 'read their auras' and tell them their dead husbands are telling them to give him their social security checksI hate it. it's so clever. but so uncreative.
So if the outputs are always more or less the same this means Gemma4 is partly distilled.
I thought you guys said this thing was uncensored.
>/lmg/ finally has a serviceable new model>all the animosity that we accumulated during the AI winter is still there thoughsad. We will never be the power house we once were. Even after we won.
Gemma is now my dedicated age gap yuri grooming storyteller, but I still have to stick with glm for other content.
>>108519500>/lmg/ finally has a serviceable new modelWhere?
>>108519505What an oddly specific thing but not maybe not as odd and specific as my yuri ntr fetish
>>108519497It's inconsistent, wait for the hauhau tunes
>>108519505>>108519513Not as odd and specific as my having consensual sex in the missionary position for the purposes of procreation fetish.
>>108519525Can it at least do erp?
>>108519497Go back.
>>108519497WoMM
>>108519525>hauhau tunesThose are retarded.
>>108519550Where?
>>108519556They work better than heretic for me at least and mostly seem fine. Promptfu never seems to fully work for me.
>>108519531consent is so underrated
For me it's nonconsensual consent.
>>108517357Possible fix incoming? There seem to be other issues, though.
>>108519605unironically a very hot dynamic
>>108519632How are you changing the soft cap?
>>108519632why do you want to change the softcapping value though? shouldn't it suppose to stay at 30?
Safety protocols are the best power dynamic.
>>108519658You have to add>--override-kv gemma4.final_logit_softcapping=float:xx.xto your llama-server command after applying the fix in the screenshot and recompiling llama.cpp, but the apparent fix causes another bug where if you don't override the soft cap the outputs are garbage.If you go too low the outputs become incoherent too.
Gemma seems pretty decent at extracting text from a table; at least it's better than GLM-OCR, which likes to leave out details when a cell contains multiple lines.
>>108519677I'd hope so. Big Gemini shits on pretty much every dedicated OCR vision model. It'd be sad if some of that didn't make it into Gemma.
>>108519667That a value that you can tweak if you want the model to be less confident in its predictions for whatever reason. The official implementation in Transformers is configurable, at least.
>>108519693>That a value that you can tweak if you want the model to be less confident in its predictions for whatever reason.like the temperature?
>>108519700This is changing the logits before temperature.
>>108519632>>108519658I'm building this...
>>108519775Don't bother... There will be 10 new bugs in this vibe-coded shart.
wonder if anyone is bored enough to benchmark quantization damage on gemma on all quant variants, all that stuff with llama cpp raping the model and yet it seems still decently coherent at long context even though I see it output bad tokens here and there, plus the overfit behavior of token probs, my spidey senses tell me this model might actually be quite decent at something like Q2_K_L and might be close to lossless at q4normal models break much harder when subjected to what gemma has to suffer here
>>108519796Nah, it's just a one-line change, the other bug was a false alarm.
>>108519796It's a single line changed retard.
>>108519811Oh really thanks for counting the lines and fact checking a random joke post on 4chan.Who's the retard now?
Did they ever fix speculative decoding not working with linear and hybrid context models?
>>108519816:clown:
>>108519819If only there was a way to know.
>>108519819it's all vibecode and zero knowledgenone of the fancier deepseek stuff is ever going to make it in lcpp either
>>108519632In actual conversations in SillyTavern, after applying the patch, logit softcapping values around 20 start producing occasional strange typos in the outputs, so there's probably not much room for tweaking here.
>>108517588he has the money
>>108519799>Q2_K_Lllama-quantize does not appear to support this quant, did oyu mean Q2_K_S?
>>108519856>>108519856>>108519856
>>108518118You need at least 40 with thinking
>>108519850I mean the 20 in the screenshot does show issues so yeah
>>108519497Add a 1 sentence prompt.>>108519552
>>108519926Nice. Once you add anything to the system prompt or character cards it seems to become completely uncensored lol. I was wrong.
>>108519632>softcapwtf this sounds like a retarded sampling strategy you should think of it like a transform function across the logprobs that performs some function without regard to the magnitude of individual values>>108519775oh no>>108519850"softcapping" is now a phraseoh nono
>>108519579I on a whim tried one of hauhau's and it behaved almost identically to a heretic models with a low kld so I assume this is just reusing the same methodology and not being willing to admit you're using someone else's work. Probably some resume padding bullshit since he doesn't want donos and for some reason won't release full weights which I assume he guesses will make it more obvious
What's the performance like for AMD GPUs? I'm particularly interested in multi-GPU setups like 2x RX 9070s.
>>108520297Very few people here are going to have direct experience with both nvidia and amd gpu hardware at the same time. Vulkan is quite good now, so I'd say the performance isn't too far off, but nvidia will still be better overall. The price difference between nvidia and amd makes me think amd is the better option personally, but it's up to you.
>>108520307Yeah, I've heard the same things. I know AMD GPUs have worse memory bandwidth, so the performance is going to be worse. I was just curious whether two AMD GPUs work well together.
>>108520297generally ass for rocm, but usable at least for text/sd. I rarely test vulkan but I see more prs and merges for it than I do rocm so I wouldn't doubt it's equal or better by now
>>108520321this. Fuck rocm.
>>108520054That's just what they call it as you can see here >>108517601 in the gemma 4 implementation
>>108520321>>108520341That's a little concerning
>>108520402A backend agnostic solution is a lot more palatable than a thing that is only meant to port cuda to a single platform, so it's not really that surprisingRocm does have at least one benefit for llama in that most shit that is geared towards cuda works for rocm by merit of it having been designed that way. No waiting for vulkan to catch up, even if it may funccion better