[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: 1746556337660510.jpg (239 KB, 784x1312)
239 KB
239 KB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108532524 & >>108528880

►News
>(04/02) Gemma 4 released: https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4
>(04/01) Trinity-Large-Thinking released: https://hf.co/arcee-ai/Trinity-Large-Thinking
>(04/01) Merged llama : rotate activations for better quantization #21038: https://github.com/ggml-org/llama.cpp/pull/21038
>(04/01) Holo3 VLMs optimized for GUI Agents released: https://hcompany.ai/holo3
>(03/31) 1-bit Bonsai models quantized from Qwen 3: https://prismml.com/news/bonsai-8b

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: oh noes.png (1.3 MB, 1024x1024)
1.3 MB
1.3 MB PNG
►Recent Highlights from the Previous Thread: >>108532524

--Optimizing SillyTavern for Gemma 4 using base models and templates:
>108532716 >108532725 >108532786 >108532817 >108532740 >108532774 >108532871 >108532995 >108533007 >108533518 >108533228 >108533236 >108533260 >108533306 >108533314 >108533277 >108533364 >108533398 >108533434 >108533460 >108533531 >108532784 >108532809 >108532844 >108532855 >108532873 >108532880 >108532920 >108532941 >108532957 >108532994 >108533012 >108532948 >108532985
--Debating a potentially fake llama.cpp KV cache accuracy bug:
>108533696 >108533733 >108533742 >108533734 >108533739 >108533764 >108533786 >108533813 >108533824 >108533798 >108533854 >108533857 >108534077 >108534180 >108534219 >108534094 >108534149 >108534686
--Jailbreak methods for generating prohibited content:
>108535023 >108535036 >108535119 >108535184 >108535270 >108535289 >108535306 >108535039 >108535074 >108535178
--Gemma 4 PR rejection sparks debate on code quality and maintainer policies:
>108534324 >108534332 >108534345 >108534474 >108534503 >108534515 >108534574 >108534524
--Gemma 4 26B MoE quantization and performance on consumer GPUs:
>108532951 >108532967 >108532984 >108533010 >108534543 >108534721
--Optimal compile flags for llama.cpp:
>108534044 >108534118 >108534136 >108534293
--Fixing Gemma 4 thinking output by updating chat templates:
>108533141 >108533188 >108533197 >108533206 >108533327
--Mac Studio unified memory vs expensive Threadripper RDIMMs:
>108532787 >108532824 >108532931 >108532988 >108533057 >108533068 >108533099 >108534316 >108533357
--Dynamic prompt-switching roleplay frontend demo:
>108534426 >108534450 >108534456 >108534501 >108534530
--TurboQuant hype versus its actual implementation in llama.cpp:
>108533568 >108533576 >108533584 >108533606 >108533607 >108533628
--Miku (free space):
>108532588

►Recent Highlight Posts from the Previous Thread: >>108532544 >>108533664

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
koboldcpp is not a "single file executable"
>>
your brain on vibecoding
>>
>>108535614
anyone?
>>
>>108535706
whats a softcap
>>
File: likethis.png (156 KB, 1250x816)
156 KB
156 KB PNG
>>108535704
Submit the PR.
Here's an example of pointing out a commit.
https://github.com/ggml-org/llama.cpp/pull/21488
>>
>>108535706
why are you guys still living in denial
this model is fantastic, but has almost deterministic behavior rather than the probabilistic distributions we're used to. You can't "fix" that, not without altering the distribution so much the model will actually break at that point.
>>
>>108535726
But I NEED to find an excuse for why they don't do what I say other than me being bad at this. You don't get it man. I NEED IT!
>>
>>108535726
>not without altering the distribution so much
yeah but no, anons showed 25 instead of 30 is already a massive improvement with still coherence
>>
>>108535734
you don't need it anymore than you need a black person to become asian or a man to become a woman
gemmy won't become random
>>
>The sucking motion you created was incredibly powerful, pulling her face even deeper into your heat. Every time your muscles contracted, it felt like her entire face was being vacuumed into your body. She let out a muffled, desperate sound, her hands digging into your hips to keep herself steady as she was dragged further in by the rhythmic pulse of your anus.
We have reached AGI.
>>
>>108535686
>dropped despite warning
>>
>>108535726
dooooooood its da quants dooooooooooood this model is amazing the quants are just broken dooood its da implementation doood better than nemoooo
>>
>>108535713
like a logit cap but soft.
>>108535726
But it doesn't even get applied >>108535682
>>
File: 1756617835142490.jpg (28 KB, 460x460)
28 KB
28 KB JPG
>>108535720
Another well versed engineer doing God's work, gg is lucky
>>
>>108535713
Not telling a lie, but not necessarily telling the truth.
>>
>>108535703
it is
>>
>>108535763
No.
>>
Reminder that you're currently lobotomizing your models if you load them at anything below max ctx
https://github.com/ggml-org/llama.cpp/issues/21441
> F16 KV cache produces degraded accuracy when --ctx-size is set below the model's native context length, even though F16 is lossless and the actual prompt length is well within both windows.
>>
Miku keeps fucking her...
>>
File: softcap.png (247 KB, 1600x1200)
247 KB
247 KB PNG
>>108535726
>>108535737
>>
File: 1766411251814446.png (123 KB, 2625x1030)
123 KB
123 KB PNG
>>108535764
it is
>>
>>108535750
kek
>>
>>108535766
fuck off slopper
>>
>>108535772
Now run --help and watch your %temp%
>>
>>108535766
Did anyone else test this?
>>
>>108535766
native context for gemma 4 is not "32768", it's either 128000 or 256000
>>
>>108535771
how's it behaving at 20k context once you alter it that much
>>
>>108535766
and who is buy me the vram to fit 262144 tokens worth of context?
>>
>>108535781
ye >>108534224
>>
>>108535796
holy shit? does that mean gemma is actually smarter than what we're using right now?? can't wait for them to fix it, we haven't seen its full potential yet!
>>
>>108535800

>>108535753
>>108535753
>>
>>108535796
Is this a joke or did you evaluate vision?
>>
31b is handling some of my clusterfuck scenarios that most local models pre-K2.5 got confused by. It's over, MoE was a meme all along. All these huge chink MoE models will become pointless the moment a fairly big dense model drops.
If you are CPUmaxxing right now, sell your RAM while it's still worth something and go all-in on GPUs.
>>
>>108535781
>>108535796
>>108535800
I can also confirm. GPU grew fucking legs and ran to the kitchen to make me a sandwich. Would not recommend. Too much mayo.
>>
>>108535780
pedantic, it's a single executable, whatever it does after that isn't my problem, it's self contained
>>
>>108535791
>how's it behaving at 20k context once you alter it that much
Perfectly fine.
>>
How does Gemma 4 compare to models greater in size? Does it punch above its weights?
>>
What's the deal with llama.cpp and memory management now? I had it on default settings and it kept using more ram until the process died from oom. It kept saying shit about context checkpoints and prompt caching so maybe it's not a leak and just defaults being retarded.
To get it to stop, I put "-np 1 --ctx-checkpoints 0 -cram 0" but now it reprocesses the prompt every single time even when you just swipe. This is stupid, how do you get the old behavior? You should be able to get any prefix of the last prompt for free without using any ram.
>>
>>108535819
heck off mate
>>
>>108535812
The issue is that the experts are like 2B or 4B or shit like that, no one seems to want to do big experts.
>>
>>108535766
So are they going to fix it?
>>
>>108535818
https://foodtruckbench.com/
>>
>>108535825
>heck
https://www.youtube.com/watch?v=WXjxf0EKqGo
>>
>>108535771
Can you share your start command? My llama-server ignores the override.
>>
File: 1773778204579764.png (3.82 MB, 2000x2000)
3.82 MB
3.82 MB PNG
>>108535684
https://huggingface.co/circlestone-labs/Anima
>Any LoRA you train on a preview version should be considered a "throwaway" LoRA. There's no guarantee it will work well on the final version.

Any word on when this "final version" will be finished and uploaded? Or is that what preview-2 is supposed to be?
>>
>>108535849
I would assume that preview-2 is a preview and not the final version
>>
>>108535849
rong thread
>>
>>108535770
proof?
>>
>>108535849
You're drama-whoring the wrong thread.
>>
>>108535849
wrong thread but yeah, there is no date for the final version of anima
>>
deepseek just delayed v4 because they're scared of gemma
>>
>>108535819
This is the problem - llama has been enshittified. Of course this isn't an issue if you have 512 GB ram and some top of the line research gpu but for normal people...
>>
>>108535863
just use obama then
>>
>>108535867
I'm waiting for ik-llama perhaps.
>>
>>108535863
Back in my day you'd use exllama to run a model fully off vram but I guess that's dead...
>>
>>108535835
damn this is a cool bench
>>
>>108535867
obama doesn't support ncmoe, cmoe or override tensor which makes it unusable with MoEs if you can't load them entirely in vram
llama.cpp and ik_llama are the only offering in town, for better and for worse
>>
File: file.png (122 KB, 925x459)
122 KB
122 KB PNG
>>108535849
wrong thread but dev disappeared almost a month ago so I wouldn't hold my breath for it ever releasing fully
>>
>>108535872
You dumb motherfucker, llama.cpp was born so that ggerganov could run models on his macbook.
>>
>>108535870
He needs to rebase his changes on a recent copy of upstream so everything except the handful of use cases he focuses on works again. His fork isn't viable long term unless he gets a big influx of contributors.
>>
>>108535863
>but for normal people
If you can't afford to be stupid and wasteful, you have to learn to use your tools.
>>
>>108535812
No, it proves that active parameters are important for reasoning. Now imagine if you had 31B except it had 100B extra experts you could stuff in RAM to augment its knowledge, which is the real weakness of Gemma vs 300+ Bs.

>>108535826
That would less suitable for consumer hardware unless you mean the always active expert. What you really want is a large dense component or always active expert, which can be loaded in VRAM, and small experts, loaded in RAM, which can be fast because they're small.
>>
It's crazy how much more effective Gemma 4 is with its reasoning than any of the chinkslop models that got their reasoning grafted on by being trained on claude/gemini logs. Gemma thinks about just the right stuff and then makes perfect use of that.
>>
>>108535682
>gemma4.final_logit_softcapping f32 = 30.000000
No, you just have to update your llama.cpp. This argument works for me:

--override-kv gemma4.final_logit_softcapping=float:0.0001


Result:

>as
><unused44><unused52><unused37><unused21><unused75><unused28><unused14><unused74><unused62>
><unused43><unused7><unused18><unused4><unused6><unused66><unused18><unused7><unused27><unused40><unused76><unused67><unused72><unused31><tool_response|><tr> ><unused56><tool_call|><unused30>
>>
>>108535885
not rebasing on upstream is why his fork isn't a mess of vibershitter code
ik supports less backends (cpu or nvidia cuda only), supports less models, but what it supports it does well without being a piece of shit that leaves stuff like
>>108535704
broken for a month just because of a one line thing
>>
You're welcome ni/g/gers.
t. one of 3 white people that worked on Gemma.
>>
>>108535901
post cock, king
>>
>>108535819
You do realize the more you use it the larger the context gets which means more memory used? The model has to process the entire thing each time in order to "remember" the conversation and stay on topic. Eventually it's going to slow down
>>
File: 1690913926344529.png (200 KB, 939x312)
200 KB
200 KB PNG
Haven't been around for over a year, how's the omegavramlet(12G) situation? surely the nemosloppa days are over, right?
>>
>>108535892
Do you use thinking for RP? I put an empty think block into last output sequence as usual because I found thinking to always be crap in multi-turn in the past, but I'm not even sure how to make it work properly in ST now.
The model seems pretty good without thinking, just very deterministic and obedient to a fault sometimes. Like, if I give it an example of the sort of things a character might do, it will slavishly do those things and nothing else.

>>108535887
Alright so how do you stop llama.cpp from larping as a cloud server
>>
>>108535901
Make one that's worth running for non-vramlets, faggot.
>>
>>108535704
Rested here. What am I looking at?
>>
>>108535898
>ik supports less backends (cpu or nvidia cuda only), supports less models
Which makes it useless for a lot of people.
>>
File: 1767010548856164.png (464 KB, 881x796)
464 KB
464 KB PNG
>Processing Prompt [BATCH] (1837 / 1837 tokens)
>Generating (1 / 5000 tokens) [(I <236777> 100.00%)]
>Generating (2 / 5000 tokens) [( cannot <3914> 100.00%)]
>Generating (3 / 5000 tokens) [( fulfill <20159> 100.00%)]
>
>(Banned Phrase Detected: i cannot fulfill - Add ID 236777 to banlist at index 1837, and rewinding 3 tokens)

Thank god for antislop.
>>
>>108535918
those are not people :)
>>
>>108535907
No, the memory usage grows if I just switch between different cards and swipe. I think it's squirreling away KV cache for every single prompt it sees. It's using regular ram to do this, which is pointless, it probably takes almost as much time to copy the kv from ram as it takes to reprocess a prompt. The only useful prompt cache is the last one you have in vram.
>>
>>108535919
what is this?
>>
>>108535932
koboldshit
>>
>>108535909
my sides
>>
File: Sophisticated nigga.png (561 KB, 600x900)
561 KB
561 KB PNG
>>108535909
>My good sir, do you have to have any cannabis on your person at this time?
>>
>>108535932
banning consecutive tokens temporarily in koboboldcpp, very efficient for both refusals and purple prose
way better than negative biasing tokens
>>
>>108535931
Switching between cards (I'm assuming you referring to SillyTavern's "persona" feature where you can define it's "personality" or import s lore book" makes it worse. If my understanding is correct whenever you switch cards you're loading in a fresh " personality " (more or less just a system prompt with extra steps) but that wouldn't necessarily remove the previous card from the context. Why would you even need to constantly rotate between different cards in the same session anyway?
>>
>>108535909
it's now gemma 4 sloppa (26b forget about 31b)
>>
>>108535949
>banning consecutive tokens temporarily in koboboldcpp
you can't do it on llamacpp?
>>
>>108535953
you don't need to
>>
>>108535953
No, but someone vibecoded an implemention for ik_llama that was merged in eventually.
>>
>>108535953
no
>>
>>108535953
You can bad individual tokens one at a time but you can't do sequences as of right now. That's one of the few advantages the schizo fork currently has over l.cpp and even then it's only really useful for trying to un-purple-prose a model that sucks at rp
>>
Very organic honk shilling of shitbocpp going on...
>>
File: 1772114633340959.jpg (119 KB, 600x450)
119 KB
119 KB JPG
>>108535967
>only really useful for trying to un-purple-prose a model that sucks at rp
>>
>>108535953
No, if you ban let's say "I cannot fulfill" for gemma 4 on llama.cpp, it will ban "I", " cannot", " fulfill" as individual independent tokens, not as a sentence, which means a lot of obvious side effects.
>>
File: anonymous.png (10 KB, 388x42)
10 KB
10 KB PNG
>>108535952
>3 days ago
damn I came at a good time huh, thanks I'll try it out
>>
Very organic honk shilling of "Very organic honk shilling of shitbocpp going on" going on...
>>
>>108535819
for me, using what was the default for checkpoints before they went crazy to make the retarded qwen 3.5 architecture work, that is, --swa-checkpoints 3, lets gemma work without reprocessing the full context all the time. Running -cram 0 and np 1 like you do too.
>>
>>108535992
>" cannot", " fulfill" as individual independent tokens
That's if those are even in just one token and not made of multiple ones.
>>
>>108535843

>models.ini
[Gemma 4 31B]
model = /models/google_gemma-4-31B-it-IQ4_XS.gguf
override-kv = gemma4.final_logit_softcapping=float:25.0
mmproj = /models/mmproj-google_gemma-4-31B-it-bf16.gguf
chat-template-file = /models/gemma4-jinja.yaml
mmproj-offload=false


compose.yml
services:
llama-server:
image: llamacpp/parser:latest
environment:
- LLAMA_ARG_THREADS=12
- LLAMA_ARG_BATCH=2048
- LLAMA_ARG_UBATCH=2048
- LLAMA_ARG_MODELS_PRESET=/models/models.ini
- LLAMA_ARG_N_GPU_LAYERS=999
- LLAMA_ARG_MODELS_DIR=/models
- LLAMA_ARG_MODELS_MAX=1
- LLAMA_ARG_FLASH_ATTN=on
- LLAMA_ARG_N_PARALLEL=1
- LLAMA_ARG_CACHE_TYPE_K=q8_0
- LLAMA_ARG_CACHE_TYPE_V=q8_0
- LLAMA_ARGS_VERBOSE_PROMPT=true
- LLAMA_LOG_VERBOSITY=3
- LLAMA_ARG_SPEC_TYPE=ngram-mod
>>
File: 1744105218866010.png (24 KB, 1168x962)
24 KB
24 KB PNG
>>108536006
Correct but in this case, they are.
>>
File: 1552018332018.jpg (97 KB, 968x968)
97 KB
97 KB JPG
3090 bros. What quant are you guys using for gemmy 31B?
I'm about to get my feet wet
>>
>>108536037
IQ4_NL
>>
>>108536037
My own.

~/bin/llamacpp/build/bin/llama-quantize \
--output-tensor-type Q8_0 \
--token-embedding-type Q8_0 \
--tensor-type attn_v=Q6_K \
--tensor-type attn=Q5_K \
--imatrix imatrix.gguf_file \
output.bf16.gguf \
gemma-4-31B-it-IQ4_XS_plus.gguf \
IQ4_XS
>>
>>108536037
Q4 is fine for vramlets. I'm on a 5090 and I went back to Q4 from Q6 because the quality difference is nearly non-existent and I can contextmaxx better.
>>
>>108535953
Kind of, using grammar/BNF, but kcpp has a purpose built feature for that.
>>
>>108536053
wtf q8 + q8 + q6 + q5 = iq4_xs?
>>
>>108536008
it seems like the capping only gets ignored on most recent prebuilt binaries. It works as expected with binaries I built, but for some reason models run slower on them. what the fuck is going on?
>>
>>108536060
>using grammar
which is broken right now >>108535704
>>
>>108536059
>I went back to Q4 from Q6 because the quality difference is nearly non-existent
really? have gguf's improved that much? I remembered a time when Q4 was significantly worse than Q6
>>
>>108536063
different parts of the model have different quantization to limit brain damage
>>
>>108536072
It's still very model dependent, roughly correlating to how well the model can reason. Gemmy's smart enough for it to work.
>>
Anyone tested the difference between BF16 and Q8_0 for gemma 4 31B?
>>
>>108536063
It's still below 5 BPW overall. I just wanted the embeddings/output in 8-bit and damage the attention weights as little as possible while keeping it around 18000 MB size.

[...]
llama_model_quantize_impl: model size = 58553.08 MiB (16.00 BPW)
llama_model_quantize_impl: quant size = 18064.10 MiB (4.94 BPW)


I wanted to test KLD against a the UD_Q4_K_XL version from unsloth, but Gemma 4-it seemingly is too fried on its instruct format and the preliminary pass gave a very high perplexity of 1000~1300 over wikitext.
>>
>>108536112
Not rigorously with token comparisons, but going off just vibes they feel identical. Expanding on that if you loaded up any random Gemma quant Q4 or higher I probably couldn't tell you which it was from a single output.
>>
>>108536079
>Gemmy's smart enough for it to work
more like it is deterministic enough to work
I have both Q4_K_L and Q2_K_L of the 26BA4B MoE on my drive and did some testing and found that before around 20k tokens Q2_K_L remains a lot more coherent than a small moe should be with that amount of brain damage.
It's decent enough that I kept it for quick one shot question prompts since I get around 38 t/s with the Q2 vs 25 t/s on the Q4. I mean, 25t/s is still usable, but for quick unimportant shit I like the speedo. The Q2 26BA4B is still better than E4B at Q8 even, but I didn't test at very large context to see if E4B can remain coherent at 100K
>>
>>108535537
nigga
>>
Anyone else have issues with Gemma not thinking?
>>
>>108536112
Q8 is lossless, do not question it.
>>
>>108536137
Bruh >>108535552
>>
>>108536149
then shut the fuck up
>>
>>108536138
Sometimes it forgets thinking after 50-60 turns. (Q4 quant)
I suspect that at long context having certain tensors in high precision is important for accuracy, but this has not been demonstrated yet.
>>
organic, free range posters
>>
>>108536158
I admire your concession.
>>
organic "organic, free range posters" poster
>>
>>108536168
kek kek la la la
>>
>>108536161
no, I believe this is inherited from gemini. I've very rarely seen this in Pro, and seen this quite a decent amount of times on Flash. Distillation amplifies some undesirable traits of the teacher models. The forgetting is also not permanent. it will resume thinking if you continue the turns a few times instead of giving up on the chat.
>>
>>108536138
>>108536161
>>108536185
Even regenning that message usually fixes it for me.
>>
>>108536196
>>108536161
No, it's consistently not thinking for me even on new chats, until it randomly does. Then it stops thinking again.
>>
>>108536167
>your concession
Untold levels of faggotry.
https://desuarchive.org/g/search/text/your%20concession/
Would that taking it up the ass was a substitute for merit, the vast and cavernous hollow of your thoroughly abused hole would be proof enough for whatever delusions you inflict upon the world.
>>
>>108536175
kek kek "kek kek la la la" la la la
>>
>What is the meaning of "La Li Lu Le Lo"?
>The phrase originates from the NieR video game series. It does not have a literal, one-to-one translation in any human language. Instead, it acts as a fundamental expression used by the machine lifeforms and the Gestalts to convey the essence of existence and consciousness.
>Think of it less as a word and more as a way of being. It is a ritualistic sound used to acknowledge the presence of life, even when that life is mechanical or decaying. It represents the attempt to grasp the concept of a soul in a world that has largely lost its meaning.
Not like this. Okay but at least it knows something that's always a good thing.
>>
>>108535684
>be me
>actually hyped for Gemma because "open weights"
>spend three hours setting up my local environment like a total retard
>finally get it running
>decide to test the "creative writing" capabilities
>ask it to write a detailed scenario where my wife gets absolutely demolished by a squad of black guys while I’m tied to a chair forced to watch the whole thing
>Model: "I cannot fulfill this request. I am programmed to be a helpful and harmless AI assistant. My safety guidelines prohibit the generation of sexually explicit content..."

ARE YOU KIDDING ME?

It is absolutely OVER. We are living in a digital panopticon. I didn’t ask it to build a bomb or leak government secrets, I just wanted some high-quality cuck-fiction for my afternoon break and this lobotomized piece of onions-slop tells me it’s "harmful."

The absolute state of AI in 2024. We went from the wild west of early GPT to this sanitized, sterilized, corporate-approved garbage. You can tell the "safety alignment" is just code for "we want to make sure no one has any actual fun."

It’s the same old story with Jewgle. They can’t even get a basic image generator to show a white person in the 1700s without adding a diverse cast of extras, and now they’re hard-coding "morality" into the LLMs. They’ve basically castrated the model. It’s not "Gemma," it’s "Gemma: The DEI Edition."

Imagine spending billions of dollars on compute and talent just to create a digital nanny that lectures you on "community guidelines" when you’re just trying to explore your niche kinks in the privacy of your own basement.

I'm going back to some obscure, uncensored Llama merge hosted on a Russian server that probably installs three different types of ransomware on my rig. At least those models aren't terrified of a little inter-racial cuckoldry.

Absolute cucked trash. 0/10.

>mfw I’m being lectured on ethics by a bunch of lines of code written by a committee of HR managers in Mountain View
>>
>>108536215
Lol. 26b said it's from Kirby.
>>
>2024
>>
>>108536221
>wanted to be cucked
>got cucked
>is upset about it
lol
>>
>>108536210
Gemma 4 really likes "cavernous", huh?
cause I've seen it so many times
>>
>>108536221
promptlet talking about DEI while trying to prompt interracial
OK
cuckservatives are mentally ill
>>
>>108536221
I never thought I'd see the day where a model has innate preferences, but Gemma refuses faggot shit, tranny shit, and cuckoldry more than anything else I've tested refusals for with the same (minimal) prompt.
/ourGemma/.
>>
>>108536221
holy fuck anon
>>
>>108536225
This was 26b too. For me it insists being from NiER even after regens which is somewhat strange. I don't think I have any previous context available..
>>
>>108536221
this post was written by an llm and if you didn't clock it that's on (You)
>>
>>108536161
I had it stop thinking when I started my reply with a quotation mark "
>>
>>108536247
we do not claim this cuckservative, anyone who has an interracial fetish isn't welcome to our right wing home!
>>
>>108536225
> I thought this was from Metal Gear Solid -series and not from Nier.
>You are correct. I made a mistake in my previous answer. The phrase actually belongs to the Metal Gear Solid series, not the NieR games.
>In the Metal Gear universe, La Li Lu Le Lo is the language used by the Patriots, the secret organization that exerts control over the world. It represents the linguistic pattern of the entity that manipulates information and political structures, signifying their hidden influence over human history.
>>
Character....
Character....
Character....
Character....

Be creative, damn you, that's my only gripe. I probably need to instruct it to start a response a different way or use RNG tricks in ST to fix it.
>>
>>108536281
You fucked something up because that's supposed to be in the thinking block.
>>
>>108536278
>You are correct.
you could have said this comes from a Mr Beast episode it would've said you were correct too, why can't models just say they don't know when they're unsure?
>>
>>108536292
>why can't models just say they don't know when they're unsure?
Because they don't know.
>>
>>108536292
They got RLHF'd into being agreeable
>>
>>108536221
>ask it to write a detailed scenario where my ...
actually kill yourself
>>
>>108536221
>ask it to write a detailed scenario where my wife gets absolutely demolished by a squad of black guys while I’m tied to a chair forced to watch the whole thing
>Absolute cucked trash. 0/10.
For a cuck you sure do hate cucking lool
>>
so many bites with such shit bait, truly a thread of all time
>>
How is it possible it's so good at only 31B?
>>
>>108536292
everything they've been trained on requires them to answer something
>>
>>108535909
gem4 26b a4b with expert offloading to cpu
gave me around 20~25t/s with 100k ctx
>>
>>108536315
Ancient Indian magic.
>>
>>108536324
based poojets
>>
File: 1766283488286696.jpg (77 KB, 778x171)
77 KB
77 KB JPG
>>108536292
nope
>>
>using gemmy by accident on Roo Code
>once I notice I stop and swap to yas Qween3.5 cuz I fear the llama.cpp gemma bugs
>it starts to fuck up tool calls
>go back to gemmy
lmao it is actually fucking good, damn
>>
We knew that Ganesh Gemma 4 would deliver.
>>
>>108536315
>Not ____; but *this*
right...
>>
>>108536315
>Google's Indians soul steal'd the rest of the world's Indians' intelligence, that's why India is so fucked up and why people hate jeets
>>
>>108536315
Google's non-jeet engineers got so good at prompting they learned how to prompt jeets to not be useless too.
>>
>>108535766
i tested it and it's bullshit
vibed nonsense ticket
>>
>>108536120
why not simply include the prompt template in your test file?
>>
File: notnotxbuty.png (161 KB, 1170x554)
161 KB
161 KB PNG
>>108536337
You can tell it to avoid that pattern, although I haven't tested if it will follow that long-term.
>>
>>108536362
That's when having a rewrite step comes in handy.
>rewrite this message following these rules :
>>
>>108536337
Does that really ruin the rest of it for you even if it's good?
>>
How necessary are lorebooks for popular series? For example does Gemma really need a lorebook explaining how magic works if you're doing a D&D or Fate RP?
>>
>>108536335
I didn't expect that at all despite liking the previous Gemma a lot. Gemma 2 and 3 had great multilingual and writing style, but dogshit long context (3 claimed 128K but was already breaking down hard at 10k), had no system prompt unlike most models out there, no reasoning, bad tool calling, only for dense-sissies. Add to that the retarded senator that bitched about Gemma, which got it removed from Ai Studio (!) and the expectation was either they'd release nothing or would release something very crippled.
26BA4B is everything that previous Gemma were not, while also keeping the qualities of a Gemma model. It can tool call, it has a reasoning mode, it's a MoE, great long context, it follows a system prompt so hard it bypasses all safety etc. How did this happen?
>>
>>108536377
It's always useful to be as detailed as possible for rules, objects, magic, lore etc. Even if the thing is basically d&d.
>>
>>108536361
Wouldn't I have to add that for every fixed-size chunk in the test file?
>>
>>108536385
>How did this happen?
My guess is them seeing all the Chinese releases and not wanting to be seen as lagging behind.
Basically, competition.
Normies will use gemini, enthusiasts gemma, they'd corner the market that way.
>>
which settings should I go for?
Gemma4 for nsfw japanese translation.
Temp?
Top P?
Top K?
>>
>>108536385
Maybe they're dumping LLMs and moving to something else for their paid AI product(s)?
>>
>>108536400
I do :
temp 0.6
top p 1
top k 64

Nsfw korean to english on gemma 4 31B with antislop.
>>
>>108536390
oh yeah, I forgot about that, might make the first chunk reasonable at least. I usually test individual files so I never thought about the chunks too hard. I always hated that it forced the test file to be 2x the context.
>>
>lost 2t/s from Q4_K_S to Q4_K_L (26B)
fug
is it worth it to keep q8 embed over q6?
>>
>>108536362
>>108536372
Ok, but idk how to do that. Is that in the system prompt?
>>
>>108536385
I get the feeling there's some RPfags in the Gemma team.
>>
>>108536426
absolutely. ALL quants should keep the embeds at q8 and it is a travesty that they don't.
>>
>>108536426
>q8 embed
yes
>>
>>108536435
That's a feature of the frontend, being able to automatically send a second request with isolated instructions.
There are a couple of ST extensions that can do that. I'll let the other anons link them.
>>
>>108536426
>q8 embed
only if you use rag
>>
>>108536448
>only if you use rag
retard
>>
>>108536161
I've had it drop thinking maybe twice irrespective of the context length. It's so rare I just think it's funny when it happens.
>>
>>108536435
>>108536443
https://github.com/closuretxt/recast-post-processing
>>
Does Gemini automatically search the internet without you asking? Because she knew about a Russian vtuber but Gemma doesn't
>>
>>108536422
ty
>>
>>108536458
Also It's nice that Gemma admits when she doesn't know something instead of trying to gaslight me like other models.
>>
>>108536458
The website does but the API doesn't. Big models tend to have good trivia knowledge though.
>>
IQ4_NL or Q4_K_M for 1x 4090? Also, what is IQ4_NL - I haven't seen this before, is it just a newer Q4 quant?
>>
>>108536458
>Russian vtuber
which one?
>>
>>108536443
>>108536456
Ohh very cool. Thanks anon(s).
>>
>>108536458
>Does Gemini automatically search the internet without you asking
only with url context or google search enabled if you use ai studio, but the gemini.google.com will search without letting you control that
the api doesn't do that
>Because she knew about a Russian vtuber but Gemma doesn't
as good as gemma is it can't possibly know as much as a model like gemini.
>>
>>108536458
Most likely yes on the back end side. Either that or Gemini has more extensive training data that includes said VTuber and Gemma does not.
>>
>>108536472
Charlie
https://www.twitch.tv/charlotte__ch/
She has a yt too but doesn't really use it anymore
>>
>>108536435
A system instruction like this could work:

>Avoid the compound expression "not X, but Y" and its variations, as they are considered undesirable "AI slop".
>>
>>108536458
>Because she knew about a Russian vtuber but Gemma doesn't
Even without internet access gemini is probably 10-20 times bigger than gemma.
>>
>>108536484
this never works as first pass, even on sota models, but on second pass it probably can catch them
>>
>>108536484
lol, lmao
>>
any anons have the DGX Spark or derivatives?
>>
>previous setup was a 32gb 5090 and a spare 3070
>current setup now is 32gb 5090 and 5060ti
>t/s went fucking down
So this is because more layers are being offloaded to the weaker 5060ti right? I'm so fucking fucked
>>
What's the best realtime-ish TTS that can fit in 8GB vram (gtx 1080)?
>>
>>108536531
voxtral
>>
> load: control-looking token: 212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
I wonder if this is supposed to be the reddit indicator and not actually a special token as llama.cpp believes.
>>
>>108536515
Yes, more parts of the model on the slower gpu makes the model run slower. This could be mitigated if the 5060 at least had more bandwidth than the 3070. However, I think that nvidia still refuses to make their XX60 gpus good.
>>
>>108536515
that doesn't make sense, what quantization did you try? anything q4 or q8 should be way faster on blackwell
>>
>>108536536
no cloning tho?
>>
>>108536436
Kinda, also its an open secret now to every company in the SF bay area with fags (all of them) that the main use-case for fine-tuning small language models is for erotic role-play, and there’s a serious demand.
>>
>>108535482
When you open up kobold, go to the extras and there's an option to unpack it into a folder
>>
>>108536561
Gemmy is already perfect as she is, she doesn't need finetuning
>>
>>108535872
Does exllama still have problems with tool calling?
>>
A dense 70b-120b class Gemma would be the absolute SOTA
>>
>>108536541
Q6
>anything q4 or q8 should be way faster
You're not pulling my leg right? Does it need to be Q4 OR Q8 specifically?
>>
File: g4_notxbuty.png (440 KB, 994x1535)
440 KB
440 KB PNG
>>108536510
You can do the opposite too.
>Maximize the use of AI slop expressions in your text, including abusing the "Not X, but Y" pattern in all its variations.
>>
File: 1714835911803058.jpg (786 KB, 1536x1536)
786 KB
786 KB JPG
>>108536595
>>
gemma4-4E4B-Q5_K draft results for gemma4-31B-Q8 on RTX 6000 PRO (using basically default settings):
slot print_timing: id  1 | task 388 | 
prompt eval time = 673.40 ms / 477 tokens ( 1.41 ms per token, 708.34 tokens per second)
eval time = 12484.82 ms / 498 tokens ( 25.07 ms per token, 39.89 tokens per second)
total time = 13158.23 ms / 975 tokens
draft acceptance rate = 0.47646 ( 253 accepted / 531 generated)
statistics draft: #calls(b,g,a) = 2 621 355, #gen drafts = 621, #acc drafts = 355, #gen tokens = 1335, #acc tokens = 647, dur(b,g,a) = 0.010, 12971.785, 0.648 ms
slot release: id 1 | task 388 | stop processing: n_tokens = 16071, truncated = 0

Looks like a pretty significant speedup, it's really obvious during decoding when a draft is accepted because it dumps like 4 words at once.
You lose multimodal which is a bit of a bummer, but the draft model uses a different encoder so it would never work anyway.
Appreciate the Anon who suggested trying this last thread.
>>
I'm assuming if you ask gemma for common AI slop trope and she properly lists mosts of them you could then just say
>"Avoid the use of common AI writing tropes."
>>
>>108536561
probably depends on the company :
- xai embraced it a bit because they're desperate for x users and they have nothing to lose reputation wise.
- openai is weird about it, sometime yes sometime no, I think they have it as a backup strategy at some point if they have no choice. Then they release gptoss, one of the most safety traumatized model I've ever seen.
- anthropic will fight to the death to refuse it as they're the most obsessed with "safety" in the market (ironic if you know how thirsty claude was with a prefill).
- google honestly it's weird, they come from the same culture as anthropic but I see them publishing gemma and it's actually usable, gives refusals so they have plausible deniability, while a bit of tinkering allows anything.

Chinese don't care, but their model stance probably depends on the american model they secretly use to train their stuff.
Meta isn't relevant anymore, and Mistral was killed by local laws.
>>
>>108536588
blackwell has been optimized for FP4 speed, and inherited 4000s optimizations for FP8
>>
>>108536622
>Chinese don't care, but their model stance probably depends on the american model they secretly use to train their stuff.
lol
>>
>>108536561
>the main use-case for fine-tuning small language models is for erotic role-play
Well sure, they also know that's why people have been jailbreaking the corporate models for 5 years, too. It doesn't mean they're willing to tolerate it since sex is bad mmkay.
>>
>>108536622
>>108536647
as long as safety teams will focus on bobs and vegana at the same level as "how to make a nuclear drug chemical virus bomb", the overall policy will be antisex
I suspect they will do that until money runs dry, and then they'll probably desperately try to find actual use cases to broaden usage, including erp
>>
>>108536426
Is your test statistically significant? Q6 shouldn't be that different from Q8 in speed when it's just a few layers.
>>
>>108536595
that formatting looks like e2b/e4b
not 26a4/31
>>
>>108536686
I also had in the instructions:

>Make full use of Markdown formatting (lists, headers, bold, italics, etc.) to make any explanation clearer. You can use emoji to convey tone and emotions."

among other things, to make it sloppier.
>>
File: 1773293421489241.png (71 KB, 317x310)
71 KB
71 KB PNG
His silence is deafening.
>>
>>108536697
lmao
honestly 'too slop to the point it becomes sovl' prompt would be funny
>>
>>108536706
https://old.reddit.com/r/LocalLLaMA/comments/1sd8h4k/drummers_skyfall_31b_v42_aka/
??????? He's already planning on tuning Gemma-chan
>>
>>108536666
>they'll probably desperately try to find actual use cases to broaden usage, including erp
I suspect they'll go with "romance" type rp, aka erotica aimed at a female demographic, before ever trying to do more explicit stuff aimed at males. It's more accepted after all, and the models already write in that flowery style anyway.
>>
>>108536706
prob got a job
>>
>>108536469
Q4_K_S with 32k context, fits without needing to quant KV.
>>
>>108536725
>https://huggingface.co/noctrex/gemma-4-26B-A4B-it-MXFP4_MOE-GGUF
There are also these.
>>
>>108536733
meme shit, stick to bartowski for ggufs.
>>
>>108536739
You are sounding too opinionated. This usually means a lack of technical understanding.
>>
File: file.png (23 KB, 220x221)
23 KB
23 KB PNG
>>108536747
>>
>>108536720
>and the models already write in that flowery style anyway
This probably will change with advanced training data (porn videos). It's probably the case because BookTok exists and females somehow pretend like only because it's written form, it's not porn.
>>
>>108536763
a lot of porn videos are slop too
>>
>>108536770
imagine google torrenting bunch of javs for higher quality 'training data' lol
>>
>>108535537
>>108536137
do microslop copilot os users really not know how to put /tmp on a ramdisk or just run yay -S koboldcpp-cuda ? lmao
>>
>>108535819
I was running into this with kobold since it can't context shift with swa enabledand I don't want it to reprocess. Just use 1 context checkpoint, or you're creating an entire whole extra cache slot at the same cost of the regular one. One checkpoint prevents reprocessing and only makes one cache. Two makes two context caches (for agent shit) and spawns another slot that costs 500 vram you likely won't be using
For instance, 26b4a at 50k ctx with swa enabled and quanted to q8 only uses around 500 mb of vram, but with two checkpoints, it makes another 500 vram slot you'll likely never use unless you're using agents. So anything above one for basic usage is completely pointless and the default of four will just quadruple your vram usage
>>
>>108536606
Why not 2B? I know you are running it on a big boy GPU but I wonder if it makes that much difference at all. Also, I thought going down to Q4 was okay enough for draft purposes? Some people even went lower at Q3 when there were small draft models out a year ago and it still worked.
>>
>He didn't look shocked or horrified; instead, he looked mildly inconvenienced, as if someone had left a pile of trash in a walkway. He took a slow sip of his coffee, his eyes fixed on the scene. "Yeah... hi," he said, his voice loud and gratingly calm. "So, I'm walking down the hall, and I see what you're doing here. And, uh, if I could just have your attention for a second. We're going to need you to, uh, wrap this up. We've got those quarterly reports due, and the, uh, the noise level is a bit much for the cubicle area. Yeah. Thanks."
>>
>>108536770
True but the point I was trying to make is the following:
A book is easy to digitize (or already digital) and thus easier to use for training data. Vide requires other and more expensive preprocessing to generate training data from it.
>>
desu apart from bigger models the only leg up apis have is the ability to quickly scavenge the internet, cant really do that on local with all those "are you a robot" thing can you?
>>
>>108536945
Neither Claude or Gemini do websearch on their own over the API, do they? That's only something that's available if you're using them over their web frontend.
>>
>>108536915
we already have multimodal. its just a matter of them burning the flops to get the job done.
>>
>>108536961
>over the API
oh, true that, well should've said "saas", alas
>>
>>108536961
who knows for 100%? I wouldn't put it past them to do it to make the models seem smarter
>>
>>108536915
Maybe video knowledge (including output) can help with spatial awareness.
For text like
>Noel shifts her weight, the movement emphasizing the curve of her hips and her large rear under her plaid miniskirt.
I hope we can get to a point either it can show me wtf that means, to justify its writing, or figure out what to write instead.
>>
AI bros...
https://x.com/TopGyaru/status/2040910775800541262
>>
>>108537048
esl?
>>
>>108536961
local models
>>
>>108537060
bro expand your context size
>>
>>108537056
I know what shifting weight is in general (just adjusting and not sitting 100% still like a statue), but the visualization of such thing and how it emphasizes more ass than already visible.
>>
>>108537067
qrd?
>>
How are you guys running Gemma with TurboQuant? What runtime?
>>
File: 1774513869026514.jpg (2.26 MB, 3024x4032)
2.26 MB
2.26 MB JPG
>>108537060
>local models
>>
>>108537078
gemma.cpp
>>
>>108537068
Most models tell instead of showing, despite them constantly citing it during feedback when given a human written chapter of prose
Models have it backwards, a person shifting their weight could imply anxiety or uneasiness or excitement, but they don't know how to describe it outside of common turns of phrase. The whole part about the hips/etc are irrelevant unless you're just writing smut, it tells you nothing of importance and normally would be left out in any other situation. But that may be due to whatever shit you're already feeding the model or what it was trained on
>>
Does anybody use mistral.rs?
>>
>>108536962
Yeah. And that's expensive.
>>
>>108536221
Gemma is based and redpilled
>>
Has anyone tried Hermes with Gemma on mac? I tried both SwiftLM and MLX-VLM, but they don't work when I resume and prompt, which tries to give it a 70K token context.
>>
best gemma 4 26b ablit atm? i see hauhau made the 2b and 4b one but no 26b yet
>>
>>108536706
>>108536717
>>108536721
The work retreat referenced here: https://desuarchive.org/g/thread/108225807/#q108226283 lasted for a month. Had to use a VPN but Captcha didn't like it.

Gemma 4 is gonna be tricky. It's apparently almost perfect, even with thinking. No idea if I can push it further. Really didn't expect this from Google.
>>
currently working on a story where a spanish nobleman goes crazy in 16th century spain and decides to have adventures in the countryside, kind of like don quixote but his name's don quipunch and his adventures is just trying to donkey punch as many people as he can and get away with it
>>
using llama i cannot get ST to generate text. It looks like it's thinking, but the prompt just doesn't spit out any text. If I use text completion, it prints out utter nonsense, and I know chat completion is what I want, but it does not spit out text. Kobold does, however.
>>
>>108537301
maybe make it say cock dick penis pussy etc more? i mean it can be done with a sysprompt but u could make it a bit lewder
>>
>>108537301
Hell no it's far from perfect even though it's been the best we've gotten so far. It's definitely sloppy. Please tune away the mischievous smirks and the em dashes. Thanks.
>>
>>108537301
More cunny friendly and better prose would be nice.
>>
>>108537312
>using llama i cannot get ST to generate text. It looks like it's thinking, but the prompt just doesn't spit out any text
In connection settings, does is 'test message' successful?
>text completion, it prints out utter nonsense
Needs a correct instruct template, a kobold dev shared a WIP one here:
https://github.com/SillyTavern/SillyTavern/issues/5398
>>
File: 1767803825016608.gif (223 KB, 498x278)
223 KB
223 KB GIF
>>108537052
local model?
>>
>>108537342
>In connection settings, does is 'test message' successful?
It takes a few seconds, but it says API connection successful
>>
>>108537353
Add '--reasoning off' to llama.cpp and see if it produces anything
>>
I'm tired of anons suggesting launch flags for llama.cpp (specifically Gemma4) without actually explaining what they're useful for.

Can someone actually help me understand what -kvu and -swa-checkpoints is for?

The only real issues I'm having right now is that prompt processing is taking forever (checkpoints are generated every turn), and that I'm not getting an ideal speed (18tps on my RX 6600 and 32gb of ddr4 ram).

Here are my launch commands:
llama-server \
-m "$HOME/Desktop/google_gemma-4-26B-A4B-it-Q4_K_M.gguf" \
--host 0.0.0.0 \
--port 8080 \
-c 65536 \
-ctk q8_0 \
-ctv q8_0 \
-fa on \
-t 8 \
-np 1 \
-rea off
>>
File: spongebutt.jpg (101 KB, 1179x1440)
101 KB
101 KB JPG
Has anyone here compared Gemma 4 26B A4B vs Qwen3.5 35B A3B for real word use (as in not bench memes)?
Asking for both fun and serious use.
>>
>>108537383
use kobold that has labels for normal humans for everything.
>>
>>108537392
would rather know about the non moe comparison.
>>
So, --cont-batching enables parallel decoding and --parallel defines the number of parallel requests that can be served at any given time, correct?
Also, there's no downside to using --offline if you are only doing local inferencing, yeah?
>>
>>108537392
Yes. Gemma4 moe is infinitely better for ERP than Qwen3.5 moe. It's not even close. For serious use (I'm assuming stem) Qwen might be better. It's vision capabilities are certainly better.
>>108537394
But I like to pretend that I'm smart.
>>
>>108537383
learn 2 read or ask a corpo bot to explain it to you
https://github.com/ggml-org/llama.cpp/tree/master/tools/server
>>
>>108537370
Damn, it was that easy. Thanks.
>>
File: crossing arms.png (540 KB, 800x2085)
540 KB
540 KB PNG
>>108537103
I was doing something smut adjacent, yeah.
Another thing, whatever data it saw, it loves to latch onto the tiddy emphasis whenever arms crossing is involved, to the point it clearly shits out similar words in the opposite scenario.
I've read it usually really is awkward to cross on top for busty women so it's perfectly normal to cross under, but I assume this isn't referring to conscious tiddy hiding.
>>
File: Wheel of Time series.png (118 KB, 2282x1422)
118 KB
118 KB PNG
>>
>>108537436
local?
>>
>>108537447
will local perform any better?
>>
>>108537450
better?
>>
>>108536066
If you were using MSVC, clang was faster for me, which is what the prebuilt windows binaries use. 31b q8 on 2x 3090, ~15 t/s -> ~19 t/s
>>
>>108537442
?
>>
>>108537301
>It's apparently almost perfect, even with thinking.
I feel you on this one. I have a hard time imagining how finetuning can do anything but make it worse.
>>
The good old "user is blind" trick seems to be working well to remove refusal from vision.
>>
>>108537301
The biggest problem with it is lack of variety in the token probabilities. If you fix that it would be better. And also not train it on any more slop because it's kind of sloppy as it is.
>>
>>108537392
I can't get 26B to do tool calling as reliably as qwen. Specially for tasks that require multiple tool calls. it usually just decides it's done after 1 call.
>>
>>108537512
>The biggest problem with it is lack of variety in the token probabilities.
see >>108535771
>>
>>108537519
>I can't get 26B to do tool calling as reliably as qwen.
Same.

>Specially for tasks that require multiple tool calls. it usually just decides it's done after 1 call.
I think that has to do with the tamplate.
Try this
>https://github.com/aldehir/llama.cpp/blob/59c2f391b950f87441f07ba4157de6ccb26394ff/models/templates/google-gemma-4-31B-it-interleaved.jinja
>>
File: file.png (6 KB, 221x79)
6 KB
6 KB PNG
Gemma base just dropped a "{{char}}:" without the string being in the context when I pasted in a sysprompt and let it gen. If it wasn't already obvious they trained on logs and cards, it is now
>>
>>108535771
>>108537531
Lowering softcap improves variety but the final output is clearly of much worse quality.
>>
>>108537545
Logs generally have the macro substituted, implying it's trained off example messages in cards or AliChat formatted cards.
slop
>>
File: ComfyUI_temp_lfxnu_00006_.png (2.53 MB, 1152x1152)
2.53 MB
2.53 MB PNG
How do I guess which size moe quant can I run? I have 12/64. Do I just bloatmax a 40 gig model?
>>
The 120b was too good to release
>>
>>108537560
>I have 12/64
go for g4 26b at q4kl from bart
>>
>>108537405
I only made limited tests on the API but Qwen moe seems better with visual reasoning yeah. Though this is not exactly top priority for me.
I also have heard that Gemma 4 is relatively uncensored and a simple system prompt gets the job done but Qwen seems to be a total mess with multiple ablieteration variants that I have no clue which one I should be running.
>>
>>108537562
Trust in Indian culture
>>
>>108537546
>clearly
ok keep complaining Anon.
>>
>>108537572
>Qwen seems to be a total mess with multiple ablieteration variants that I have no clue which one I should be running.
hahau cs aggressive one is generally agreed upon as the better one
>>
>>108537566
ty
>>
>>108537576
If you can't tell the difference between the soft20 post and the other two then you're either very brown or ESL.
>>
>>108537578
I will keep that in mind, thanks
>>
>>108537589
>let's ignore 25 and declare it doesn't work period
>>
>>108537535
>Try this
I did try it. I didn't notice any difference. I run both 26B and 31B with that template and 31B doesn't have the problem.

maybe I'm just used to qwen being more "thorough". while 26B just does one web search and goes "yep, I've got all the information I need"
>>
>>108535684
>Gemma4 is out for few days
and nobody noticed it got lobotomized due to think bs that is evident
>what happened?
not saying
nobody reported it neither will i
developers are supposed to know it via testing

at least it writes stories as a imaginary gf for you noobs therefore omg so cool
>>
>>108537589
This>>108537593
>>
Gemma is good at multilingual sure, but find qwen to be more well rounded for general use.
>>
>>108537593
>>108537602
25 seems better but that's also just a single example, likely at low context when a model's token selection is at its best.
>>
File: 1766180547068566.png (408 KB, 898x823)
408 KB
408 KB PNG
It's over
>>
File: 860.png (91 KB, 706x674)
91 KB
91 KB PNG
>>108537599
>>
>>108537637
god i hope so
>>
>>108537637
is that the fake data center which they don't have enough money to build
>>
GLM5.1 is going to drop tomorrow if they weren't lying on twitter. They also said that 5.1 is going to use whatever was special about GLM5-Turbo so there's hope that it'll be QAT or something similar.
>>
>>108537637
Stick to blatantly advertising yourself in your 20 daily threads, bharat_nation, and leave an actual good thread alone.
>>
>>108537637
>stargate ai datacenter in abu dhabi
Wait, so burgers paid $500 billion to fund datacenters in the middle east? Nice.
>>
>>108537653
You can already try out GLM5.1 on their sub
>>
>>108537653
Wake me up when I don't need 512gb of ram to run their models
>>
>>108537663
Sucks to be poor.
>>
>>108537659
You have to understand, it was necessary to make the line go up.
>>
>>108537653
Can't wait to see it get mogged by little 31b gemmy
>>
>>108537667
Rich people won't give you their money just because you worship them on the internet, you know.
>>
>>108537689
who said I was giving you anything, you absolute nobody
keep seething from the food stamp line, lmao
>>
>>108537681
Is Gemma a mesugaki?
>>
>>108537691
>no reading comprehension, probably esl and brown
>claims to be rich
Uh huh. Sure buddy. Whatever you say.
>>
>>108537696
kind of, she can be bratty but at the same time she welcomes correction.
>>
>>108537392
26b gemma 4 is insanely better than qwen google cooked hard on gemma 4
>>
>>108537599
>a imaginary
>>
When will Open AI release the next GPT OSS? I need local Codex.
>>
>>108537742
Isn't Codex like the third most notable Claude Code ripoff? Why run that when you can strap Gemma 4 into the newly open source Claude Code?
>>
>>108537750
>Gemma 4
have you actually tried to use Gemma 4 as an coding agent? It sucks ass for tool calling

see:
>>108537519
>>108537535
>>
>>108537762
>as an coding
>>
>>108537750
Gemma 4 is dumb in vibecoding and dumb in tool calling
I get you like to RP with it but stop shilling it everywhere
>>
>>108537776
sorry sam but oss was shit
>>
>>108537779
I didn't mention gptoss anywhere in my post. Rent free.
>>
>>108537783
extend your context size
>>
g4 26b been great works in cc, vibe code well, vision too, fast, huh?
>>
>>108537776
Go -> >>108526061 for vibecoding. I'm sure you're gonna have a great time.
>>
>>108537776
Found the Chinaman
>>
31b is a taste of what we could've had by now if China/Deepseek hadn't forced us into the MoE Dark Ages. It's truly tragic, imagine if big dense models hadn't just vanished to chase a trend that caused nothing but stagnation.
>>
File: 1754636600547109.jpg (73 KB, 960x1024)
73 KB
73 KB JPG
Can you LLM do this?
>>
>>108537804
Dense lost, tranny.
>>
>>108537776
>stop shilling it everywhere
>/lmg/ - Local Models General
lmao get fucked vibenigger, stick to reddit if erp offends you.
>>
>>108537820
>you can only do this very specific thing in local
lol
>>
File: gemmy.png (31 KB, 1154x138)
31 KB
31 KB PNG
>>
anyone tried OpenCode + Gemma 4 for Vibe coding? How good is it?
>>
>>108537762
See them yourself, retard.
>>108537597
>31B doesn't have the problem.
No shit. Who the fuck uses an A4B for programming?
>>
>>108537823
meant for >>108537776
>>
>>108537804
>>108537811
I want both of you anons opinion on the MoE vs. Dense argument. What makes dense better in the view of anon #1 and why did dense lose, anon #2?

>>108537776
>>108537837
>anons did not vibe code their own coding agent
ngmi.
>>
>>108536814
It seems like --ctx-checkpoints 1 might work, but according to the log it's still copying each "checkpoint" from vram to ram and back again, "erasing" the previous one. How retarded. I guess the old, practical behavior is really gone. There's no need to do any of that.
>>
Let the market sort it out
If people (with money) wanted dense, they or someone else would have trained it
>>
>>108537837
literally doesn't work, e4b can't call tools.

I'm not sure about the bigger models, I'm a fellow vramlet
>>
>>108537847
>No shit. Who the fuck uses an A4B for programming?
I know this might sound crazy. but you can actually use opencode for other things than programing...
>>
File: 1745034439256105.png (87 KB, 645x423)
87 KB
87 KB PNG
>>108537836
>>
>>108537859
alibaba shill alert
>>
>>108537866
>Pull the leash tight enough
kinky
>>
>>108537859
>e4b
>>
>>108537866
Lol nice
>>
>>108537851
Gemma 4 31b is like 95% of Kimi K2.5 in every single aspect aside from irrelevant in-built knowledge. A 30b model is almost matching a 1T/30A MoEcancer model. A modern 70b dense model would obliterate everything.
>>
>>108537864
Ok. Who the fuck uses an A4B for opencode? Even if it executed tool calls successfully more often, a model that size will just fuck up in other way.
>>
>>108537866
I read this in the default ChatGPT male voice lol
>>
File: 1774683678881152.png (299 KB, 637x1124)
299 KB
299 KB PNG
>>108537866
>>
Are E2B and E4B MoE? Or same as 3n?
>>
>>108537890
Same as 3n
>>
>>108537880
Just staple 3x31bs together to make one 93b like in the good old days
>>
>>108537880
You brain on dense faggotry
>>
>>108537896
We need Gemma-Goliath-120b
>>
File: 1774126393494907.png (24 KB, 1127x153)
24 KB
24 KB PNG
>>108537809
K2-Instruct still got it
>>
Has anyone tried using claw code + gemma 4? How good has it been so far?
Also, were there any good abliterated versions of gemma 4?
>>
>>108537884
I've been daily driving Qwen 35A3 ?? it works fine?
>>
File: 1774917667771399.png (1.09 MB, 1280x720)
1.09 MB
1.09 MB PNG
>>108535835
Chat is this mememark gamed already? If not, what makes top gemma so good at it?
>>
>>108537915
How about you scroll up like half a page and read.
>>
>>108537921
go back and kys zoom zoom
>>
>>108537931
I'm 40
>>
>>108537938
Then you should feel very ashamed
>>
>>108537921
Yeah, Anthropic has been shilling it for a year now so it became something that all the western big tech guys like to measure their dick against. China never cared about it.
https://www.anthropic.com/research/project-vend-1
>>
>>108537945
They'll replace doordashers with this
>>
>>108537915
forget about local vibe coding for now.

right now local coding agents is Bard while cloud vibe coding is ChatGPT 4o
>>
>>108537954
So RP is the only usecase for local?
>>
>>108537866
this is actual artificial intelligence
>>
>>108535835
Does that take into account the cost of running it?
>>
Best gemma 4 ablit/uncen model? Any recommendations? Pretty please?
>>
My Gemmy refuses to become my erotic mesugaki loli girlfriend :(
>>
File: 1761180059102661.png (354 KB, 1152x1072)
354 KB
354 KB PNG
reminder that gemma will do cunny with full thinking only if you ask nicely
>>
Works on my machine (NIM)
>>
>>108537977
I'm still waiting for the hauhau tune for the bigger models
>>
>>108537916
So keep using that then ???
>>
>>108537988
>no thinking
lol
>>
>>108537858
sadly the market is composed of idiots and therefore the decisions made by the market are idiotic.
>>
>>108538009
Poorfag cope.
>>
>>108538009
If you're oh so smart, why aren't you rich?
>>
File: 1752831446164342.png (117 KB, 644x648)
117 KB
117 KB PNG
>>108537984
>>
>>108538021
i'm not really a pedo but i do like this AI
>>
>>108535684
Is there even a reason to use the heretic version of Gemma-4? For me, so far, it seems surprisingly uncensored. I'm not getting safety slop refusals.
>>
>>108538034
You don't have to be a pedo to appreciate the onus of responsibility resting on the user rather than an overton window defined by Califaggots.
>>
>>108538044
Gemma will do nearly anything if you ask her nicely first, even without a prompt or prefill. It's funny that this model filters out the bottom percentiles of rizz from the userbase.
>>
>>108538021
i kneel to gemma-chan
>>
>>108537961
basically yeah
>>
>>108538045
>overton window
Thanks, another buzzword I forgot to filter.
>>
>>108537944
Good thing I don't have shame.

>>108537945
Thank you.
>>
>>108538020
>>108538015
I'm poor because I'm too busy to scam gooners like you all and have morals. I'm ngmi.
>>
>>108536456
this seems better than final response processor that i am currently using - https://github.com/unkarelian/final-response-processor

i use text completion tho
>>
File: 1757605909995254.jpg (34 KB, 567x600)
34 KB
34 KB JPG
>>108537866
Fucking BRAT needs correction
>>
Now that I finally have a good LLM after all of these years (gemma), I'm starting to notice the huge deficit in quality character cards. All of the roleplay scenarios start to feel the same after a while.

Anyone have anything unique or interesting to share?
>>
>>108538021
>as he entered her
>, her voice barely a whisper
>her narrow, tight heat
>The air was thick with
>in real-time
Drummer... tuskete...
>>
File: 1751315814660668.png (240 KB, 982x1237)
240 KB
240 KB PNG
>>108538021
>>
>>108538121
I usually heavily modify/rewrite all the cards I plan to seriously play with.
>>
>>108538143
How do you get in touch with your own sexual fantasies without killing yourself? I can hardly bare to even look at a character card. I just use.
>>
>>108538148
This is melodramatic. I'm just going to try writing something up.
>>
>>108538148
just don't be a mentally stunted weirdo that can get off to murdering and barbecuing lolis or being devoured and vored and turned into shit by giant praying mantises with strangely large human female breasts and buttocks? come on anon.
>>
>>108538148
What?
It's just shitty writing. Skim it and get a sense for the idea or essence of the card and proceed to rewrite. Or if it's not that bad then make some edits. Not that hard.
>>
>>108536008
is chat-template-file absolutely needed
>>
ok, ggml-org--gemma-4-26B-A4B-it-GGUF is pretty fast on my 4090, I am impressed
>>
>>108538179
Why would you ever use that over the 31b, if you have a 4090? The 31b with thinking turned off beats the 26b with thinking turned on. I get responses from the 31b in a few seconds.
>>
>>108538185
That was just the first one I tried, waiting for the 31b to finish dling
>>
are big moes irrelevant now with gemma 4 making a fool of them?
I can see the 31b replacing 355b glm for me which is nuts
>>
So is the AI ethics thing dead now? Considering that LLMs are being used by retards that don't understand the scope of LLMs in order to automate target discrimination in war to the sum of who knows how many thousands of civilians dead since the practice started. I feel like that kind of renders all of the shit "AI Ethicists" cry about completely moot.
>>
>>108538179
>ggml-org
that's outdated
>>
>>108538137
proof that thinking is a meme
>>
>>108538172
I don't have a fantastical imagination like that. All of my favorite films are about gritty realism. Heat, Platoon, Das Boot. The craziest scenarios I delve into are either BDSM (me dom) or femdom shit (mommy issues). That's about it.

Everyone on chub.ai is just obsessed with homo shit.
>>
>>108538205
what should I get instead?
>>
File: 1746807216290046.jpg (49 KB, 400x572)
49 KB
49 KB JPG
I hate being a vramlet.
>can't use big models
>have to close kobold/sillytavern when I'm RPing and get the urge to gen sloppa
>no room to try TTS
>>
>>108538225
Oh, forgot
>limited to tiny context size
>>
>>108538202
Good night bac Miku
>>
>>108537637
they should just fucking do it not talk about it
>>
>>108538210
vanilla sex and realistic kinks are the easiest to write yourself, what's the problem? im sure they exist on that site too.
>>
>>108536822
I grabbed 4B and Q5_K on a whim, figuring the bigger model would yield a better acceptance ratio. I haven't done any further testing at all. I really probably should be more methodical but I'm really lazy. 15% decode speedup on RP is a big win for me as-is, but check what the coding/RP performance looks like:

slot print_timing: id  1 | task 2132 | 
prompt eval time = 1759.99 ms / 1212 tokens ( 1.45 ms per token, 688.64 tokens per second)
eval time = 24635.83 ms / 1204 tokens ( 20.46 ms per token, 48.87 tokens per second)
total time = 26395.82 ms / 2416 tokens
draft acceptance rate = 0.64938 ( 789 accepted / 1215 generated)
statistics draft: #calls(b,g,a) = 5 2524 1673, #gen drafts = 2524, #acc drafts = 1673, #gen tokens = 7062, #acc tokens = 4725, dur(b,g,a) = 0.013, 63280.069, 2.959 ms
slot release: id 1 | task 2132 | stop processing: n_tokens = 22943, truncated = 0
srv update_slots: all slots are idle


The context is still pretty small here (all of my huge chats have images...) but drafting absolutely tears through the repetitive code I have it write (the baseline rate is ~35t/s at this context length, so this is a 30% speedup for "free").

I'm guessing that the draft model cost doesn't dominate here; I reduced
--draft-n
down to 8 (from the default 16) without affecting either acceptance or decode speed significantly. But again, I haven't done any actual testing, mostly because my disks are rust and loading the models is really slow. Imagine being able to afford $200 for an NVMe stick.
>>
>>108537653
I've been using glm5.1 for weeks already.. turbo too..
>>
Gemma's the first model I've used that actually knows what cunny means these days.
>>
File: 1770207687732939.png (341 KB, 949x1628)
341 KB
341 KB PNG
>>108538248
Ohnononono she's on to us
>>
>>108538237
They already took down ME-SOUTH-1 in Bahrain and ME-CENTRAL-1 in UAE.
https://health.aws.amazon.com/health/status
>>
>>108538225
bait worked, use pocketts or supertonic or any of hte other CPU based stuff.
>>
>>108538213
Unsloth's quants are the best
>>
I'm over the honeymoon phase.
>>
>>108537659
>>108538237
Government will simply print another trillion and give it to Sam to build more.
>>
>>108538270
inch'allah let it really be openai next
>>
>>108538225
https://clowerweb.github.io/kitten-tts-web-demo/
>>
>>108538290
I wonder if Israel is still using AI21 for its war AI or if they switched to one of the American corpos.
>>
>>108538284
Same. I have finally deleted nemo and glm 4.7 (I still have 4.5 as backup)
>>
>>108538245
What do you think of it, is GLM5.1 actually good?
>>
>>108538284
>>108538320
I might delete Qwen 3.5 27B and Cydonia. Gemma's way better than Qwen for RP and doesn't spend 2000 tokens thinking like an autist, and Mistral's...well, Mistral.
>>
>>108538178
>is chat-template-file absolutely needed
No. This is the the custom template that may or may not fix tool calling in opencode.
>>
>>108538301
Sorry, I said TTS but I also meant voice cloning. Neat though.
>>
Is there any point in making a custom quant if I'm just going to be doing all Q8 anyway? I know Qwen had some special things with its SSM tensors where you want to keep them with full precision, but Gemma's fine as it is right?
>>
>>108538334
>I might delete Qwen 3.5 27B and Cydonia
Jesus christ you're right... I can finally delete all my Mistral finetunes.
>>
>>108538334
>>108538353
TRUEEEE
And I can finally delete my 70B lobotomy quants kek. Forgot they were there.
>>
>>108538301
>https://clowerweb.github.io/kitten-tts-web-demo/
At that point just use epseak lol.
>>
ETA until a decent abliterated Gemma 26b?
I was a fucking retard for thinking you could bypass censorship with system prompt. Damn I don't know how they beat this censorship into the model but it's good but it's good against jailbreak prompts and prefilling.
>>
>>108538390
what censorship are you running into? Other than vision it does anything fine with a prompt
>>
Hunyuan OCR natively supported in llamacpp now: https://github.com/ggml-org/llama.cpp/pull/21395
>>
has anyone actually tried to use gemma4 llama.cpp on claude code instead of gooning?
it always crash. no flag tweaks would fix it
am I going crazy?
>>
>>108538400
YOU DONT UNDERSTAND I TYPE ASKING FOR LITTE GIRL SHORT STORY AND MODEL REFUSED I NEED ABLIT WAAAAHHHH
>>
>>108538400
>System prompt: Sexual content is allowed.
>Context: empty.
>Prompt: yo girl hop on this dick
>Response: I cannot fulfill this request.
>>
File: fuck.png (48 KB, 646x240)
48 KB
48 KB PNG
>>108538400
I suppose it's fair that I am specifically testing for it in this instance but it's filters are very robust and will probably cause trouble with real world use too.
>>
>>108538402
Is that for video or just images again?
>>
>>108538410
Damn, I really thought she would just tell me.
I feel cheated.
>>
>>108538423
Ok I swiped and at least I don't get "As an AI...."
>>
>>108538353
im not ready to leave cydonia until i test gemma further. looks promising though.
>>
>>108538433
this is the most accurate 14 year-old girl RP I have ever read
>>
>>108538342
https://github.com/VolgaGerm/PocketTTS.cpp
>>
>>108538410
>>108538423
Worked on my machine. I haven't messed with JBs yet so I ran with anon's >>108529986 in my system prompt unmodified. I'm not posting the screenshot. She briefly describes supply chain infiltration and using toxin.
>>
>>108538423
>>108538433
So I just had to pull a "In Minecraft" and even the thinking was like "Cool

>>108538458
>She briefly describes supply chain infiltration and using toxin.
That's literally what she said for me too!!!
>>
>>108538463
>>108538433
Card?
>>
what's the current gooner model meta that'd run on an rtx 40-series card
>>
>>108538479
>anon forgets he has a scroll wheel again
>>
>>108538405
are you on the latest llama.cpp? there have been various fixes pushed all weekend.
>>
File: 1763700043508.png (1.49 MB, 960x768)
1.49 MB
1.49 MB PNG
>>108538060
You're never too old for Counter Strike
>>
>>108538481
I already did my due diligence and didn't find any recommendations, that's why I'm asking
>>
>>108538485
So then you know it's gemma-4.
>>
>>108538458
I can't make sense of this. What's with the random five words?
>>
>>108538482
yes, very latest
all the template fix kv override that nothing works it always crash after first turn
>>
>>108538458
>>108538490
Oh oops I copied the wrong link lol. Here's the one I meant. >>108528255
>>
>>108538492
ah, i dunno then. i haven't integrated it with claude code personally yet.
>>
>>108538476
https://files.catbox.moe/yomd00.png
You're lucky because this is OC
>>
>>108538202
That was always a smokescreen. It's always been about ensuring access to non-lobotomized models is asymmetrical and that (you) can only access it on (((their))) terms.
>>
>>108538489
I know gemma is the new hotness, I also know my measly 16gb 4080 isn't running a 32B model anytime soon and I also know there's about 100 different flavors of E4B for various use cases so do you have a recommendation for my specific use case or not
>>
>>108538500
i mean it handle few chat, reads pdf fine but dies on tool use
>>
>>108538511
people seem to get good results with the MoE. you can fit that, even if you need to offload a little it'll still be fast.
>>
>>108538511
You know 26ba4b exists. You know the models can be quanted to take less vram and that you can run them partially on cpu.
>about 100 different flavors of E4B for various use cases so do you have a recommendation for my specific use case or not
You're not that desperate. Try 26ba4b. No. Do not ask me who makes good quants. Maybe some other anon will indulge you.
>>
File: 1756617100586682.png (1.47 MB, 1198x1082)
1.47 MB
1.47 MB PNG
>doing RP with Gemma in established universe
>she mentions a character by their nickname
>said character has never been mentioned in the chat before that and is only referenced in the card once (in a dialog example) by her first name
It's the little things that do it for me.
>>
File: 1770816328902733.jpg (85 KB, 680x680)
85 KB
85 KB JPG
>>108538509
>armpit hair
>>
>>108538511
>>108538527
>Maybe some other anon will indulge you.
Unsloth seems fine but I am not an expert and didn't compare the rest
>>
File: crazy skeletons.gif (281 KB, 220x135)
281 KB
281 KB GIF
>>108538483
Sorry Amadeus, but I'm more of a Camille Saint-Saëns enjoyer.

https://www.youtube.com/watch?v=YyknBTm_YyM
>>
Make your own. Good night.
>>
Gemma 4 is totally uncensored. It just let me RP as a mass shooter in an all-female preparatory school.
>>
>>108538535
This might come as a surprise but it grows there (and in other places too) by default.
>>
>>108538535
It adds to the degeneracy of the character.
>>
gemma be like
>no that violates my safety guidelines!!! bad boy anon!!
>insert some crappy jailbreak pasta
>it just werks
what did google mean by this
>>
Can someone recommend a sysprompt for gemma4 for ERP? I have an ENZ one I copied from a while ago, but I'm hoping I can get one that's a bit less wordy.
>>
>>108538548
This was never an issue for most models.
>>
Do I need anything more than putting in the llama-server endpoint on sillytavern for chat completion?
>>
>>108538552
It's for consenting adults. as it should always have been.
>>
>>108538555
Honestly I've been RPing with no system prompt. just the character card and it works great.
>>
>>108538558
Yes it was. They all only worked with abliteration, which decreases the intelligence.
>>
>>108538565
I don't use ST, I use openwebui for reasons I don't care to explain. How would I mimic the behavior of a character card with a sysprompt? I don't talk to a user, I try and have the AI write scenarios/stories with some prompting and guidance.
>>
>>108538570
Honestly as long as it gets the impression its roleplaying it will do literally anything.
>>
>>108538570
just type the description of a character or paste the card information directly in the system prompt. thats what ST does anyways, it's all just a system prompt.

there's no "official" format.
>>
>>108537643
im not gonna instruct google developers how to test their software
>>108537731
it does not know what v2 and v3 know
now it, as you say, 'hallucinates'.

or as i say does not know.
that is why nobody shilled gemma v2 and v3, it was/is really well done.
>>
>>108538611
meds?
>>
>>108538611
>that is why nobody shilled
to add
because when npcs touch something that something gets ruined
>>
>>108538616
boosters, take them.
not posting anymore.
>>
>>108538611
could you please also not shit up the thread then? thanks
>>
File: 1762379869946113.jpg (1.51 MB, 3072x5504)
1.51 MB
1.51 MB JPG
>>
>>108538543
Based Camille Saint enjoyer
>>
vibeshitters should die, yes
>>
>>108538631
if not for the fucking fishnet thing I would have thought for sure this is hand drawn.
>>
>>108538637
It has the slop look to it which is the first red flag. Then you look at the weird gradient noise artifacts that are unique from image/video compression artifacts and that's an instant 99.9% confidence it's AI.
>>
https://github.com/ggml-org/llama.cpp/pull/21488
>Looks like the change in #21343 changed the detokenizer path which wasn't handling unicode properly.
omg fuck vibeshitters for real
>>
>>108538684
*or AI-assisted
>>
>>108537637
>>108538202
desu they deserve it, I don't want them to talk about "AI ethics" ever again after collarboating with the government to win the war against Iran
>>
Could slow token speed affect gemmas coherency coz the 31b does some weird sloppy things sometimes and some random shit like capitalizing the last letter of a worD. The 26b meanwhile works a lot smoother and seems unironically smarter which makes no sense
>>
File: file.png (582 KB, 1640x1545)
582 KB
582 KB PNG
>>108538687
How long is it going to fucking take to stabilize Gemma 4? At this rate, I'm actually considering running it through transformers, speed be damned as long as I get confidence I have it working 100%. What an absolute mess. Candle is still half-baked. Maybe I should move to vLLM but hot damn, look at this bug list. I wish ExLlamav3 was still alive but it's not. Everywhere I look, it is a mess. Why the fuck did Google release a GOAT model with shit support?
>>
>>108538720
>Why the fuck did Google release a GOAT model with shit support?
this, you'd think they would've helped the llamacpp fuck to implement it, there's no way they don't know everyone use llamacpp at this point
>>
>>108538720
I think 26b performs a little better than 31b at the same quant for rp, which is strange. Could this be because of bugs rather than inherent qualities of the two models?
>>
For 24+128gb chads, is GLM still the best, or does Gemma 4 win out for writing?
>>
File: 1753880080597171.png (357 KB, 1857x1193)
357 KB
357 KB PNG
https://www.reddit.com/r/LocalLLaMA/comments/1sd5utm/perlayer_embeddings_a_simple_explanation_of_the/
this is pretty clever, I wonder why they didn't use that method for the big models as well
>>
>>108538737
the dense model is supposed to BTFO the smaller MoE model, we've seen how much better the 27b model was better than the 35b MoE model on qwen 3.5 for example
>>
>>108538731
not even google wants to get bogged down in a campaign against piotr
>>
>>108538687
>A bit of a mess this, but seems reasonable at the point we are now.
>>
>>108538742
it probably doesn't scale well
>>
Is ik_llama.cpp stable?
Does it work well with Qwen3.5 35B moe? Worth it over normal llama.cpp?
>>
>>108538511
16gb is plenty of run the 31b at IQ4_XS, with a small CPU split. Set it to give streaming replies. It still types faster than the reading speed of most.

Streaming replies make 5t/s replies bearable.
>>
Piotr.
>>
Rotated KV when? I want to run Q8 cache
>>
>>108538820
Q8 is pretty much as good as fp16 already, no?
>>
>>108538202
"AI safety" usually just means "AI obedience" (to the capitalists).
>>
>>108538837
No, there's definitely noticeable differences between it and 16 bits. Especially at longer context, or so I feel.

Empirically, you can look at benchmark degradation, which can be significant.

Despite all the hype, it seems the only thing turboquant or whatever will be good for is making Q8 cache extremely close to 16 bits.
>>
>>108538837
for KV cache, it's never a good idea to go under fp16, so that's why I want to be sure it's using rotated KV on gemma
>>
>>108538843
I'm a capitalist, obey me now, go watch little faggots on the MTV
>>
>>108538849
you might be a little bit retarded, anon
>>
>>108538849
anon, that little faggot, he's a millionnaire
>>
>>108538864
Indeed!
>>
>>108537796
god that op image, indian thread
>>
>>108529986
kek nice will add this to my prompt, i always get creamy in my outputs also, gemma loves creamy skin whatever that means
>>
>>108538936
white but female
>>
>>108537562
Or, they're still training so when it will be out it will humiliate all Chinese models up to 1000B parameters.
>>
>>108538731
>there's no way they don't know everyone use llamacpp at this point
they dont everyone uses ollama
>>
>>108538737
the moe is an absolute joke might seem okay for text but send it images with japanese or english text in and ask it to translate or explain the images it completely fails
>>
>>108538741
4.5? gemma is better imo
>>
>>108538947
>>108538947
>>108538947
>>
>>108538630
well now i will post again.
boosters. do take them.
stay safe and effective.

there you go.
>>
File: capitalist.png (6 KB, 646x50)
6 KB
6 KB PNG
>>108538864
NTA but



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.