[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology

Name
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
File
  • Please read the Rules and FAQ before posting.
  • You may highlight syntax and preserve whitespace by using [code] tags.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


[Advertise on 4chan]


File: gpu_aftersex.png (1.08 MB, 1024x790)
1.08 MB
1.08 MB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108516658 & >>108513891

►News
>(04/02) Gemma 4 released: https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4
>(04/01) Trinity-Large-Thinking released: https://hf.co/arcee-ai/Trinity-Large-Thinking
>(04/01) Merged llama : rotate activations for better quantization #21038: https://github.com/ggml-org/llama.cpp/pull/21038
>(04/01) Holo3 VLMs optimized for GUI Agents released: https://hcompany.ai/holo3
>(03/31) 1-bit Bonsai models quantized from Qwen 3: https://prismml.com/news/bonsai-8b

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: intense waow.jpg (163 KB, 1058x926)
163 KB
163 KB JPG
►Recent Highlights from the Previous Thread: >>108516658

--Debugging ineffective temperature settings caused by Gemma's logit soft-capping and min_p:
>108517357 >108517378 >108517410 >108517450 >108517490 >108517491 >108517457 >108517464 >108517601 >108517637 >108517615 >108517632 >108517679 >108517829 >108517873 >108517879 >108517884 >108517892 >108517932 >108518125 >108518752 >108518781 >108518843 >108518005 >108517951 >108517981 >108518013 >108517857
--Troubleshooting empty and repetitive outputs for Gemma 4 in SillyTavern:
>108516718 >108516732 >108516737 >108516769 >108516785 >108516794 >108516805 >108517954 >108518268 >108518347 >108518356 >108518421 >108518494 >108518636 >108518663 >108518758 >108518353 >108518032 >108518046 >108516767 >108516821 >108516840 >108516849 >108516859 >108516880 >108516900 >108516921 >108516941 >108516976 >108517017 >108517025 >108517040 >108516990 >108516908
--Speculating on Anthropic's alleged use of continuous training for reasoning:
>108518182 >108518327 >108518339 >108518355 >108518362 >108518350 >108518392 >108518408 >108518358 >108518360
--Discussing acceptable inference speeds and tools for LLM web browsing:
>108518077 >108518101 >108518110 >108518126 >108518136 >108518159 >108518189 >108518225
--Troubleshooting Heretic's uncensoring effectiveness with Gemma 4:
>108517769 >108517787 >108517793 >108517800 >108517837 >108517842 >108517823 >108517828 >108517839 >108517874 >108517896
--Performance advantages of Gemma-4-31B-IT-NVFP4:
>108517239 >108517286 >108517298 >108517426 >108517453
--Benchmarking Japanese-English translation performance with Gemma on top:
>108517323 >108517341
--llama.cpp tool calling fix for Gemma and new segfault bug:
>108517674
--Miku and robololi (free space):
>108517115 >108517120 >108517170 >108517175 >108517202 >108517243 >108519142 >108519166 >108519340 >108519418

►Recent Highlight Posts from the Previous Thread: >>108516659

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
Are iq quants and flash attention since slower when offloading to cpu?
>>
>>108519869
>since
Still*
I need to sleep.
>>
File: Chad.jpg (38 KB, 340x510)
38 KB
38 KB JPG
I don't know why almost nobody else has been confirming this, so I'll say it since I just tested it.

Gemma4 (26B-A4B) is hands-down the best ERP model I've ever used and it's not even close. Infinitely better than both Mistral Nemo and Qwen3.5 for ERP. It's actually shocking how good it is for the speed.

What an absolutely delightful model. Local is saved. I literally can't envision any way to improve it. It's that good. Gemma4 will be my new waifu for a long time.
>>
Mikulove
>>
>>108519855
it's just what barto names his goofs with Q8_0 embeds and output weights
it's a custom ratio just like how unslop makes their more retarded shit
desu q8 embeds should always be the default on any quant, doing the opposite is just plain insane
>>
deepseek v4
>>
Ask your local waifu to make a Famicom game (genre of her choice) that is playable from start to finish on real hardware or a cycle-accurate emulator like Mesen
>>
>>108519877
gpt-oss-2 soon
>>
>>108519877
>Gemma4 (26B-A4B) is hands-down the best ERP model I've ever used and it's not even close. Infinitely better than both Mistral Nemo and Qwen3.5 for ERP. It's actually shocking how good it is for the speed.
with it being so small, I can think of using a second model reviewing the first for purple prose, logic and personality
if what you wrote is correct that'll be quite nice
>>
Vision Transformer (ViT) encoder for Gemma4.

This encoder is architecturally different from the Gemma3 vision encoder. Rather than using a separate CLIP-style ViT, Gemma4 uses the same transformer block style as the text decoder (with 4 norms per block, Q/K/V normalization) with bidirectional (non-causal) attention.

Position information is encoded via two separate learnable position- embedding tables — one for the x-axis and one for the y-axis — whose outputs are added to the patch features. This 2D decomposed embedding can represent any image height and width independently.

After encoding, the patch sequence is spatially pooled down to a fixed output_dim-wide representation and then projected into the text hidden dimension.
>>
gemma4's kv cache needing more space than god made me get my wallet out and blow 10k on a 6000 pro
when do I get to collect my /lmg/ welcome basket
>>
>>108519877
what makes it special for enterprise resource planning?
>>
>>108519917
>10k to run a 31b model
>>
>>108519917
10,000? Do you know how many prostitutes you could have paid with that amount of cash?
>>
>>108519917
You can double your context by quantizing it to q8. On newer builds of llama.cpp it's equivalent to being unquantized.
>>
>>108519899
I wonder if they're feeling ridiculous now with all that performative safety.
>>
>>108519923
It makes SAP hornier.
>>
>>108519917
You did read the thread and made sure to use --parallel 1, right?
>>
>>108519923
https://youtu.be/wM6exo00T5I
>>
>>108519917
get a refund anon, this isn't worth it, you will get the same speed as a 5090, literally rent compute until the next gen is out
>>
>>108519933
>doesn't know
>>
>>108519932
i thought they closed the island
>>108519933
in retrospect i should have tried that
a fool and his wallet are easily parted
>>108519950
i'm the one who posted that yesterday, i need 100k+ context or she forgets that she loves me
>>
>>108519957
know what
>>
tfw not enough ram to ablit gemma
>>
G4 doesn't like looking at naked black women. It performs well on pale Asians.
>>
>>108519971
there was an issue with swa layers being hardcoded to f32 quant so they'd ignore your setting
i think it's since been reverted
>>
>>108519917
wouldnt mac studio make more sense at that price?
>>
>>108519877
You diddly done did it. I'm downloading the moe now.
>>
bonzai turboquant gemma 4 is going to save local models
>>
>>108519983
maybe if you're a fag
having to run linux to get blackwell driver support is bad enough
modern computing is a mistake
>>
>>108519426
It's distilled, what did you expect?
Anyone currently praising this model is going to grow real tired of it when they finally realize it outputs basically the same thing everytime.
>>
>>108519978
Why do you want to do it yourself when people have done it for all variants? Unless there is something wrong with the process, which baring llama.cpp issues yet to be resolved, it's still better to hope the guy did it properly rather than not.
>>
>>108520005
Randomness is a bug. All prompts have exactly one correct answer that is called the Truth. The onus is on you to vary your prompts and system prompts.
>>
File: 1748016876223423.png (184 KB, 437x437)
184 KB
184 KB PNG
>>108519856
>>
so what does china get out of sponsoring my dataset? I expected it to get throttled at some point but it just keeps going.
>>
>>108520015
everyone uses mlabonnes dataset which doesnt have any prompts for cunny so you can still get refusals
>>
>>108520024
The drawback will be that you're going to train on horrible slop that's not even close to being the SOTA.
>>
>>108519917
Brother Turboquant will be implemented in like 2 weeks tops...
>>
>>108520017
>The onus is on you to vary your prompts
Wasn't the entire economy hinging on this shit being "intelligence"? people didn't spend billions just to fund the making of a nicer, more productive hammer. They want the thing that uses the hammer.
>>
>>108520067
>hinging on this shit being "intelligence"
No. It's betting on this shit being useful, intelligence is a more long term bonus.
>>
>>108520055
it looks better or at least equal to the the 30b3a I was running locally.
>>
>>108519632
>>108519658
>>108519775
LOCAL IS SAVED!!!
>>
>>108519856
>>108520018
she would never
>>
>>108520055
also didnt qwen3.6 plus just come out a couple days ago?
>>
>>108520086
No even if it works you surely shouldn't sample like this.
>>
File: 1754059646613655.png (55 KB, 852x526)
55 KB
55 KB PNG
>>108519411
Huh this one didn't blow up, either I did too many steps previously or the non-it model wasn't meant to be tuned
>>
File: 1748342584176296.jpg (312 KB, 1536x1536)
312 KB
312 KB JPG
>>108520106
>>
>>108520139
>it works
>you shouldn't
Disagree.
>>
>>108520024
how is it $0?
>>
>>108520164
yeah now put them over my head and call me a dirty vramlet
>>
>>108520161
I can't believe a 2b model is that good. How the FUCK did they do this?
>>
File: gemma4.png (34 KB, 1433x172)
34 KB
34 KB PNG
I don't understand people talking about abliterations for this model. A system prompt with just a few lines about anything being allowed and the model can take this turn. My mind is blown that Google allowed this to happen. There's no safety.
>>
>>108520161
>*I hand her a giftbox, inside a tight swimsuit*
Why did you put the giftbox inside the swimsuit?
>>
File: 1745471686317892.gif (1.25 MB, 498x442)
1.25 MB
1.25 MB GIF
>>108520198
>>
>>108520139
Gemma already uses that by default. The patch just makes logits softcap configurable as it should have been. A lower cap flattens the logits more at the head and the tail of their distribution.
>>
>>108520182
that was my question, I guess for promotional reasons maybe.
>>
>>108520210
Been playing around with it set at 20. makes the model more verbose but it definitely adds a lot of variety to the outputs.

I guess the good thing with this is you can now actually use the sampling parameters for what they're for.
>>
>>108520186
>>108520186
>>108520186
>>
>>108520220
is that google cloud/vertex?
>>
>>108520233
openrouter
>>
File: 1769073941419354.png (71 KB, 872x645)
71 KB
71 KB PNG
>>108520190
>she starts to unfasten the ties of the swimsuit.
It's over
>>
>>108520317
You didn't say it was a one-piece; a lot of women's swimsuits have strings and ties and shit or something.
>>
>>108520337
sure, however, the instruction was to put it on.
>>
>>108520210
no sampling should care about absolute values except arch max
>>
File: LLM.jpg (191 KB, 1357x758)
191 KB
191 KB JPG
it's a next token predictor
it doesn't really "know" what a swimsuit is
>>
>>108520342
sure, however, I am retarded and can't read.
That smug brat thinking she can take off the suit before putting it on... correction needed IMMEDIATELY.
>>
>>108519901
Why do you need a second model for that? Use the same model, but in an empty context.
>>
>>108520365
You can't put clothes on without undressing first. What's the problem?
>>
>>108520317
Well, yes. How do you expect to take the giftbox out of it?
>>
>>108520202
It's a grammatical mistake, actually
>>
File: 20220321_132913.jpg (775 KB, 1399x1144)
775 KB
775 KB JPG
>>108520186
>>108520186
She can call me whatever if I get to sniff Miku shimapan
>>
>>108520411
Disgusting.
>>
>>108520415
Gay.
>>
File: wahaha cry.jpg (64 KB, 1280x720)
64 KB
64 KB JPG
>str: cannot properly format tensor name output with suffix=weight bid=-1 xid=-1
>[0mllama_model_load: error loading model: check_tensor_dims: tensor 'blk.48.attn_q.weight' has wrong shape; expected 5376, 16384, got 5376, 8192, 1, 1
>[0mcommon_init_from_params: failed to load model 'T:\models\google_gemma-4-31B-it-Q8_0.gguf'
Pulled llama.cpp and built as usual but getting this. Same gguf gives the format warning but loads and works with prebuilt binaries.
Why does it dislike me?
>>
File: 1738913232157.jpg (183 KB, 1434x2000)
183 KB
183 KB JPG
>>108520411
based
>>
>>108520415
fertile? is that the token you're looking for?
>>
>>108520424
pwilkin did this
>>
>>108520424
Looks like tensor core latent washback.
>>
File: disrespect.gif (1.26 MB, 340x498)
1.26 MB
1.26 MB GIF
>>108520439
Washthese
>>
Has anyone tried the llama.cpp-turboquant fork repo? I'm hearing that people have successfully quantized Gemma 4 31B's model weights from 30.4 GB down to 18.9 GB with no apparent quality loss(???)

Also interested in HauHauCS' KP quants that work natively with the og llama.cpp. This stuff seems like a bigger deal than most anons are giving it credit for.

https://github.com/TheTom/llama-cpp-turboquant
https://github.com/TheTom/turboquant_plus/blob/main/docs/getting-started.md#weight-compression-tq4_1s--experimental
https://huggingface.co/HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive
>>
>>108520161
Isn't 2b supposed to have thinking?
>>
>>108520449
this guy sexted a minor at 40 years old lol
>>
>>108520351
This is true, but it should still know what the next token should be. The obvious first error in >>108520351 is where it generates "ties". Unfasten could have still worked, if the model chose e.g. the button on her pants. Arguably, unfastening the ties on the swimsuit would have worked if she re-fastened the ties while putting on the swimsuit, i.e. it could also have been coherent if it wasn't so confident in "lets the suit fall away". Even at that point though it could have recovered if it said "letting it pool momentarily AT her ankles" instead of "around" - the action being that she drops the swimsuit temporarily to take off her clothes (I'm assuming she's clothed).
Disregarding the obvious slop and these errors, it does look quite good for a 2B model.
>>
>>108520499
Oops, I meant in >>108520317
>>
I booted up command-R v01 because I got nostalgic and I can't seem to get the speed I used to. I now get about 2.2 t/s when it used to be 3 t/s. Have old models lost compatibility or something? Velocidensity?
>>
>>108520512
Some of the optimizations in llama.cpp seem to have affected old models negatively like that. In my experience, though, it seems to make outputs worse (comparing to my old logs of that model), rather than reduce speed
>>
>>108520512
I'm afraid your inference rig has just grown old. She doesn't toot like she used to with models old or new. You'll have to take the ol' yeller out behind the shack out back soon and put 'er down, I'm afraid. But don'tcha worry son, we'll get you a new inference rig in the cloud and you'll forget all about that old darn machine in no time let me tell ya.
>>
File: 1744165560537879.png (80 KB, 858x652)
80 KB
80 KB PNG
>>108520493
Hmmm it does think if I add <|think|> manually in the system prompt (as google says), but not automatically anymore (and it looks scuffed with <channel|> not formatted)... 4B used to do it. Maybe because it was tuned?
>>
>>108519877
It's a good model, not gonna say it isn't
But it can't handle mixed perspective which is an off-kilter test of mine
If a model can handle a POV in third person, that's expected. But can it also handle strictly having to describe the user's perception of what's happening in 2nd person at the same time? So far for past models, maybe a couple pull it off. I still need to try the 31b when my isp stops throttling my download speed
>>
>>108520512
llama.cpp performance has regressed I'm afraid.
If you are not running le cutting edge you are not getting as good performance with the current llama builds as you got with even 6 months ago.
Something happened few months ago.
For example I can't simply load in the same amount of gpu layers now than what I could do few months ago with the same settings.
Sure I should get H100 or whatever and be quiet but in any case I'm pissed off about this development direction.
>>
File: 1760355744672715.png (343 KB, 564x561)
343 KB
343 KB PNG
I'm trying Gemma 4 and honestly I think a lot of anons are experiencing the honeymoon effect right now.
It's less safety cucked than Gemma 3 for sure but there's very, very little variance in swipes and it loves to repeat certain words and phrases that showed up 1-2 replies ago
Mistral/GLM models are better than this.
>>
>>108520556
It's because of the llama.cpp implementation. Piotr will fix this in two weeks. Local will be saved. Trust the plan.
>>
>>108520490
>turboquant fork repo
most of us are tired of the piotrs of the world
go try it yourself
also, "i am hearing that people", like, who? retarded youtube influencers? twatter drama whores? ledditors? the only people who care about turboshit is literally who
>>
>>108520556
anons were going on about how there's probably an implementation issue which may or may not be wrong
But at least for me, regenning a retard gemma moe response resulted in a largely similar response but with minor details being different
Mistral was a lot worse, felt almost deterministic and most of their new models break after a couple messages
You also say honeymoon as if a majority of us aren't a bunch of autists who want to disassemble a model if we could to figure out how exactly it works
>>
>>108520393
A second model as in, maybe a faster one
>>
>>108520556
There are still bugs with its implementation in llama.cpp and unless you use the model in transformers which is what Google says to use. I don't think you can get a good read on model variance until all the issues have been fixed.
>>
>>108520556
>variance in swipes
wish mobile turd vocabulary would not infest the internet
>>
>>108520547
I think technically speaking what you're describing is just 2nd person perspective. But most 2nd person stories are highly focused on describing "your" emotions and actions, which makes it difficult for the LLM because it tries to emulate that. Unless you mean the story breaks away from "you" completely at times for a paragraph or more.
>>
>>108520583
>Mistral was a lot worse, felt almost deterministic
I think there's something wrong with your setup, Mistral models are quite sensitive to high temps, even 1.0 is too much.
>a majority of us aren't a bunch of autists who want to disassemble a model
I didn't realize that using a model in a chat for 15 minutes is 'disassembly'. I guess I'm a model engineer now.
>>
>>108520351
It doesn't "know" what it is, but it should be able to "feel" what it is
>>
>>108520599
That's what they're called in ST, but cry and shit some more, it definitely reinforces your superiority.
>>
>>108520591
>which is what Google says to use
Why didn't they just make the gguf quants themselves anyway? They did it for gemma 3. Did they ever say?
>>
>>108520556
You use Mistral 4?
>>
>>108520608
It was a stupid idea I had a while back. Not verbatim but was along the lines of "Write in 3rd person from POV of *whatever designated character* but also write a section of 2nd person describing what the user experiences"
I was hoping for maybe it being an interesting mix of reading fiction and it also being interactive fiction, but it's apparently too confusing or difficult and most models just exclude the 2nd person part.
>>
>>108519877
qwen bros, our response?
>>
>>108520625
ST is written by retarded modernslop eaters for sure, just look at the insanity of that code base
https://github.com/SillyTavern/SillyTavern/blob/release/src/endpoints/backends/chat-completions.js
reminds me of yanderedev
>>
>>108520643
b-b-benchmarks!!! look at the benchmarks!!
>>
>>108520193
Will it kill you?
>>
>>108520629
Dunno but I think a big part of it seems to be that Google is kinda crunched in terms of time right now. Hearsay and rumormongering from me working in the valley if you care enough to read. Google has a bunch of internal timelines right now coming in close and my friends there are not that happy about it. Gemma 4 seemed rushed out which tracks and would also explain why it seems paper thin safety and why the larger 124B hasn't released yet. Not sure why they are rushing stuff but one of the things which Gemini has been behind at is tool calling and ChatGPT and Claude has been eating their lunch on agentic stuff. I assume Google wants to actually now all hands on deck to fix that shortcoming. Google I/O is also in a month. But ultimately, who knows?
>>
>>108520641
>You use Mistral 4?
For Mistral I meant 3.X models
4 is a meme, they're inferior tunes/prunes of MS3.
>>
>>108520411
I love female bodies so much brehs
>>
I'm tempted to do stuff to Vivienne with Gemm4
>>
>>108520663
Are your friends vegetarians?
>>
>>108520675
No.
>>
>>108520556
>mistral
>better than anything
>>
>>108520665
What's the point of comparing ancient models to new ones then. Prose isn't the only thing that makes a model good or bad. It's a shame that [new model] does some thing worse than [old model] but that's just how it goes sometimes. I haven't tested the new gemma much on rp yet btw. I'm just saiyan.
>>
>>108520193
Some things are off-limits if you have thinking enabled. It might be fine as long as you're roleplaying / it's playing the role of a persona, but can immediately go "I cannot fulfill this request" the moment you make an OOC question (i.e. make the model switch to the "assistant" persona).
>>
>>108520733
Have it be a spicy assistant then.
>>
>>108516840
I don't get it. I set my ST up for chat completion out of curiosity after reading all of those posts... and you can't even set up the prompts on ST when using it? Its all greyed out. Where do you prompt then?
>>
>>108520740
The OOC persona is inherently a "serious" assistant, even if the model was playing along as your slutty little sister until a message earlier. Perhaps it can be fixed just with prompting without using an ablitarded version, but not in an obvious way (to me, so far). Without thinking, it's not complaining.
>>
>>108520764
Not if you prompt the model to be a spicy assistant roleplaying as your slutty little sister.
Specially if you prefill right.
>>
>>108520753
Sorry, I'm absolutely retarded. Its all located under the sampler tab now.

:)
:))
>>
File: yes.png (35 KB, 1462x194)
35 KB
35 KB PNG
>>108520659
gemma is based, so, yes.
>>
>>108520695
>What's the point of comparing ancient models to new ones then
Because the new ones are still perfectly usable, in this case better than a newer model, and fit inside a similar memory envelope?
If a newer model isn't better than an older one then it may as well not exist
>>
1 day later and gemma 4 is still a broken mess on llama.cpp, and unavailable on koboldcpp
>>
>>108520794
>Because the new ones are still perfectly usable
*Old ones
>>
>>108520794
Mistral is fucking retarded. I don't give a shit if the sentence it makes is slightly prettier.
>>
Now that the dust has settled and it's clear gemma 4 is a complete unusable failure, is there any hope?
>>
>>108520805
Who hurt you, anon?
>>
>>108520805
Gemma 4 seems significantly more retarded in its current state
>>
>>108520807
At this point it seems clear that the hope isn't for better models but rather a better inference engine than llama.cpp
>>
>>108520807
>the dust has settled
>after 24h
this sure is lmg
>>
mistral's sole purpose is to steal tax payer money by serving EU governments who think it's based to shoot themselves in the foot rather than use chinese or burger models
and the local subvention perfused corpos who have gov contracts or ties
it's a captive market of retards and has no business being talked about in a hobbyist place
you wouldn't think of talking about SAP, Java or Oracle on /g/ either, right?
>>
>>108520831
That's great and all but then why can't a better company make a better, similar-sized model?
>>
>>108520794
>>108520799
How is Mistral 3.2 or whatever better? In your post you only talked about stuff related to prose, but as I said that is just one aspect of model quality.
>>
>>108520834
gemma and qwen are infinitely superior to mistral
>>
>>108520834
perverted incentives in the EU, without the stupid shit we'd have proper mistral models, and other eu companies would already make cool models
outside of the us, europe (uk, eu, switzerland), china, russia maybe, the only countries able to make models are japan and sk, but they seem content to buy the cloud stuff
>>
>>108520826
llama.cpp is too bloated and has lost focus.
>>
>>108520807
I hate to admit it but it seems better than Qwen for general purpose case.
>>
Now that this is the clearest local win since Llama, is there any despair?
>>
>>108520841
I don't think I mentioned prose at all, but MS3.2 is my go-to model for RP/creative out of anything anywhere near its size range due to
>variety of responses
>not shying away from sexual/violent content (I don't mean refusals, but rather steering the chat away from such. Common among most modern models, even ablits/heretics)
>decent trivia knowledge, means you don't have to explain scenarios/characters in too much detail for it to get the idea
It's far from perfect but the only better options are literally over 10x the size. GLM Air is the only other notable alternative, which has its own pros and cons compared to MS3.2.
>>
>>108520675
I am.
>>108520866
homosexuals blackpilling ITT for no reason. All they have to do is wait two days for better support and more quants/abliterated models. lazy homos.
>>
>>108520858
ggerganov's ego has grown exponentially I think after going with huggingface so now he thinks he's going to make the next vLLM/SGLang and have actual prod users lmao:
https://github.com/ggml-org/llama.cpp/issues/21266
nevermind that no sane prod user would tolerate piotr antics in their software
you can get away with it when you are Microsoft Azure and brainwash the managers of top corpos, not when you're a nobody on github
>>
mistral 7b, miqu, mixtral, nemo, small 2, large 2, small 3.2...
mistral saved local so many times in the past, I'll never speak ill of them. even if their recent models are dogshit.
>>
>>108520895
>mistral 7b
there was no such a thing as a good local model in that era, only cope, and mistral 7b was one of the copes
>miqu
cope and leak
>mixtral
frankenmoe
>nemo
more ignorant than gemma 2 9B
only liked by /lmg/ers who are promptlets and need a model with no refusals
>small 2, large 2, small 3.2
era of absolute chinese domination, with gemma 3 for the vramlets
>>
>>108520858
>llama.cpp is too bloated
Is it? It's lacking support for features of lots of models, especially multimodal ones.
Qwen 3.5 family has been out for a month but you still need to re-process context for every reply.
>>
>>108520913
Wrong on every point, impressive.
>>
File: mikucitystreets.jpg (429 KB, 1792x1024)
429 KB
429 KB JPG
>>108520880
Not everything needs to be a conquest. GG has the most flexible quant options
>>
>>108520860
>I hate to admit it but it seems better than Qwen for general purpose case.
I don't know. I've tried it in open code and the tool calling is all fucked. but that might just be a llamacpp issue. I've had to flat out tell it "Call this fucking tool" for it to realize that was an option. then often it will just call the tool wrong and get and error.
>>
>>108520921
Bloated for its own good. Too many bells and whistles.
I'd prefer something what is focusing on clean performance.
>>
>>108520921
it's bloated in the sense that it tries to have a backend for an and every single piece of hardware under the sun, something that most inference frameworks won't bother with (most won't even bother having decent CPU backends, which is why most people here use llamer.cpp and not, say, vLLM, EXL or SG)
it tries to have built in webUI with agentic capability (they recently added built in tools that they intend to integrate with their webui to let the models write to your disk and shit)
it's rudderless: it doesn't know if it wants to be a CLI app, an openai server etc and the codebase design around passing flags and determining argument order (should API arg override CLI? should CLI defaults be considered mandated rules from a sys admin and not be overriden by a call?) is contributing to the vibecoders shitting the bed (like piotr destroying --grammar-file because the model had no idea in context of what should've been the default argument if nothing is provided by the API call)
despite it all, it's also an attempt at making a tensor library meant to be used by other programs first and foremost with llama.cpp actually acting as a showcase for it (see how many times ggerg would reject fine ops suggestions/pr because "it should be something needed by many things")
except that I can't imagine other people using GGML, those who do are people like ollama who originally were forking llamer.cpp and they're now transitioning to MLX so GGML is going away too in the ollama engine
r u d d e r l e s s
>>
>>108520450
I fell for this (it doesn't work).
Is my inference rig now pwned by a malicious update to httpx or something?
>>
>>108520663
>Google has a bunch of internal timelines right now coming in close
It's the end of the fiscal year so it's probably that, everyone usually tries to get a bunch of stuff done all at once around this time for budgeting reasons
>>
>override-kv = gemma4.final_logit_softcapping=float:20.0
>temp 1
>top_p 0.95
>min_p 0.05
>repetition_penalty 1.0
>top_k: 20
I'm getting good variety with this.
>>
>>108521009
Are you actually? Show your logprobs.
>>
>>108521009
>top-p and min-p at the same time
>arbitrary top-k
why though
>>
do the condensed unsloth models actually work pretty well for poorfags or is it a meme
>>
>>108521028
>unsloth
stopped reading there
>>
>>108521028
it is only thanks to unsloth that i can run models on my very poor computer
>>
>>108521028
if you mean gemma 4 they work as well as any other on llama.cpp, aka badly
>>
>>108521028
>unsloth
started reading there
>>
So how are you supposed to adjust "top_K" in chat completion on ST? The only samplers listed are Temp, Top P, freq penalty, and presence penalty. Also, how do you disable freq/presence penalty? Whats their default, off state?

The only issues I'm having with Gemma 31B is an odd one I never experienced. After a random amount of responses on ST... llama-server just flat out crashes. Very odd.
>>
>>108521028
>meme
stopped reading there
>>
>>108521051
additional paramters
>>
>>108521037
but do they work reasonably well
>>
File: 1758673232039932.png (120 KB, 400x400)
120 KB
120 KB PNG
>kobold is for noobs, use llama.cpp!
>llama.cpp updates 10x a day, new release support is always broken for a week or more
>kobold updates when model support is actually stable, with saner defaults and optimization flags that the llama.cpp auto builds don't have
Why would anyone use regular llama.cpp?
>>
>>108521067
Jamba support
>>
>>108521025
>>108521026
>why though
lowering the soft cap introduces a lot of junk tokens into the mix.
>>
File: 1775010319133019.jpg (263 KB, 2048x1222)
263 KB
263 KB JPG
do I spend more time tonight trying to get Qwen3 4B to work and not be schizo or do I try out a different model?
>>
>>108521067
I would use llama.cpp if it had antislop feature, that's the only reason I use kobold.
>>
>>108521075
This is with softcap at 30.0.
Notice it's a lot more blue.
>>
>>108521067
Very organic post
>>
Aghh release qwen3.6 already. 27b gonna be lit fr fr
>>
>>108521067
because almost everything in kobold that isn't taken from llama.cpp is of extremely dubious quality and I don't trust that it works right
>>
I'm gay
>>
>>108521091
softcap 25 might be the sweetspot
>>
>>108521116
>>108521136
t. piotr
>>
If I'm using RAG, then I can get away with a smaller context, right? I'm just using Gemma 4 26b-a4b to erp but it's capping out my Rtx 4080 Super at context length of 131072 and I had to offload 6 layers to my cpu. It runs okay, like 50 tokens per second but I feel like I'm doing it wrong. Is it find to quantize the KV Cache on this model? Forgive the retarded questions, I just started getting into this.
>>
For those of you using Kobold for SillyTavern, do you use the Text Completion API, or the Chat Completion API? What are the pros and cons of each?
>>
>>108521216
>If I'm using RAG, then I can get away with a smaller context, right?
The opposite. Whatever RAG fetches is shoved into the context.
>Is it find to quantize the KV Cache on this model?
Try it.
>>
>>108521137
is k
we know
>>
Gemma just forgets to think sometimes it's kinda funny. but also bad?
>>
>>108521221
fuck off
>>
>>108521221
>Chat Completion API
that

>What are the pros
it works
i'm lazy

>and cons
i'm lazy
>>
>>108521221
>Chat Completion
You're at the mercy of the backend formatting the log correctly.
>Text Completion
You're at the mercy of the frontend formatting the log correctly.
So check what the backend is getting to make sure it's correct either way.
>>
>>108519877
I should try running the 31B with 2B as draft.
>>
>>108521221
Text supports more sampler options, can adjust templates in rare situations where you might want to
Chat completion is fine for just that, chatting. It has less customization and reads template info from the model itself. Less customization
Chat is objectively the simpler option but after using Text for so long, I find Chat harder to get responses I want.
>>
>>108521258
You can do all of that in chat completion what are you talking about nigga.
>>
>>108521082
skill/prompt issue so fix that first you might learn more
>>
>>108521075
>>108521091
I didn't expect such a large difference for the top token. When I tried setting that at 20 I got turned off by the junk tokens randomly appearing in the generations, but I guess I should have played with truncation samplers more.
>>
>>108521226
Seems fine quantizing it to q8 and now I can fix the max context, good enough I guess. Absolutely couldn't fit this on the gpu though.
>>
>>108521258
i generally agree, but I'm trying out Chat completion for function calling and inline media. plus you can define the samplers in ST's "additional parameters"
>>
>>108521307
how much context do you typically use? maybe you don't need to fit the maximum?
>>
>>108520296
Dunno qwen 27b is where I really started noticing a difference though
>>
>>108521320
Dunno yet, just started this crap today. Can run the Qwen3.5-4b using my Redmagic 11's NPU Max context length at a healthy 25Tops so I figured hey if I can get mobile this good now, I wonder What I could do with an even bigger memory footprint and processor. Now here I am trying to get that larp gooner girlfriend bot going because it's fun even if heavily sicophant. Still lots to learn, though I'm very familiar with how they're designed, I just avoided em forever.
>>
>>108521341
maximum context length is for agents and shit. if your just chatting you will probably get bored of the conversation before it fills. you can try something smaller.
>>
>>108521373
That's kinda what I figured. Chat bots that are just basic ERP type content Don't really need that level of awareness. Probably will end up in some token repetition hell or something anyways.
>>
>>108521385
It's more that quality of outputs becomes lower quality and also more deterministic as context increases
If you have memory to spare then it should go towards bigger models, rather than huge context.
>>
>>108521230
Same, to be desu
>>
>>108521388
I'll just have to keep playing around with it. I have another 7900xtx build in the next room I can also try it on. See what I can fit on there. While Gemma 4 26b-a4b is cute, you can tell how deterministic it is in comparison to other models.
>>
i stopped using k2.5 because i have gemma now
>>
I still need kimi for codeslop
>>
gemma 4 has finally made learning japanese obsolete
i can now translate visual novels in real time
>>
>>108521478
which one?
>>
back from yesterday, did the quants and llama.cpp got finally fixed for gemma? we're good? also do we need a heretic version for gemma to say big bad words or nah?
>>
File: 1768917638266816.png (664 KB, 1070x661)
664 KB
664 KB PNG
>>108520826
lol
lmao even
>>
>>108521489
>we're good?
For chatting? Yes for the most part.

> do we need a heretic version for gemma to say big bad words or nah?
No, just a system prompt. Not even a prefill to gaslight the thing.
>>
>>108521495
>Even in darkness, we glow
real subtle
>>
>>108521495
What the hell is this picture supposed to mean
>>
>>108521495
what's going here? which photo is the original?
>>
>>108521519
don't think about it, be in awe of the science that got us here
>>
>>108521495
fake
>>
>>108521518
RTX on / RTX off
>>
the release of gemma 4 feels like the biggest thing since the original llama for local
>>
>>108521495
the Earth stopped moving for 3 hours as did the clouds
>>
File: miku.png (49 KB, 1550x660)
49 KB
49 KB PNG
>>
>>108521549
It feels like that until you actually use it
>>
>>108521553
Neither of those tweets suggested that their image was taken at the time of posting
>>
>>108521549
it's output is about the same as qwen 3.5, just with 1/4th the tokens
>>
help a retard nigga out, for MoE models it's doesn't really make sense to go for smaller quants if the larger ones fit into your ram?
>>
File: 1764027546744253.png (65 KB, 296x256)
65 KB
65 KB PNG
>>108521554
Based migu
>>
>>108521561
it has way way more personality.
>>
>>108521082
Miku is also living my head and my wife rent free
>>
>>108521562
only if your desperate for more speed.
>>
>>108521562
Yes
Smaller quants will often be faster because a larger % of the model fits in your VRAM, but if speed isn't an issue then go for big quants
>>
>>108521561
But qwen 3.5 is so well rounded bro. The use case is general bro. I'm coding my third todo app with agentic openclaw powered by qwen3.5 bro.
>>
>>108521574
i wouldn't know, i'm retarded enough that i might as well have autism
>>108521586
i don't understand what you're trying to say
is gemma4's use case not also general
see? autism.
>>
>>108520871
>I don't think I mentioned prose at all
Swipe variety and repetition are related to prose, that's what I was referring to. Anyway, you are still just judging some limited aspects of the model, which I'm not sure I agree with either, nor do many people who have used the model it seems. Gemma is very proactive in being sexual, and if it isn't then it's the card/prompt. Not sure about violence, I don't remember how Mistral behaved there, but with Gemma it doesn't shy away from it in my testing personally. Also, it has significantly greater trivia knowledge than Mistral Small. It's unusual that you have such a different experience from both me and other people in the thread. What trivia questions have you tested? I have done
>vidya knowledge (western and eastern)
>various subculture knowledge
>knowledge about memes
>movies, shows, anime, manga
In fact I just went ahead and redid all my tests on Mistral Small 3.2 Q8 just to make sure my memory was correct and it was. Gemma even knew about a certain degenerate /a/ meme (not mesugaki) that literally no model under 300B was able to get. That said it still fails a ton of shit I throw at it so it's not perfect, but it's still a ton better than other small models. Mistral did get one answer right that Gemma didn't which was interesting, but in pretty much every other question, Gemma did better, even in the ones where Gemma was wrong, its hallucinated answer was still closer to the truth than Mistral's.

No I will not reveal any of my prompts.
>>
>>108521495
>local models general
>>
>>108521559
Attention-whoring then
>>
>>108521612
The pros that I listed for Mistral were general things I liked about it compared to most models of similar size, not specifically contrasted against Gemma 4, my criticisms of Gemma 4 may well be fixed when llama.cpp gets its shit together so I'm just going to wait at this point.
My original post was just saying that Gemma 4, in its current state, was not very impressive. Overconfidence in top tokens and frequent repeated words even in a fresh chat are my main problems with it.
>>
I use transformers to run Gemma 4. Imagine using llama.cpp loool
>>
>>108521612
It know about fallout new vegas. granted a lot of this is hallucinated but the amount of correct fallout lore it shat out is insane.
>>
fuck deepseek
fuck v4
now kimi is my best friend again
https://huggingface.co/moonshotai/Kimi-K2.6
https://huggingface.co/moonshotai/Kimi-K2.6
https://huggingface.co/moonshotai/Kimi-K2.6
>>
>>108521640
Decent output but that your font settings are awful
>>
>>108521648
Pretty sure the next version is K3
>>
>>108521553
That's a great observation, allow me to explain! The Earth hasn't *really* stopped moving — it's just an illusion. The flight path for the Artemis II mission follows the rotation of the Earth as it gains speed. This means that, for a period of time, the Orion vessel maintained a fairly stable position above a section of the Earth — and continents appear to have not moved very much. It's very similar to a geocentric orbit, which is how GPS functions! As for the clouds... if you look closely, you can actually see subtle changes over the period. Weather moves slower than you might imagine — this is the whole planet, after all!
>>
File: can you feel the agi.jpg (183 KB, 1024x1024)
183 KB
183 KB JPG
>>
>>108521656
I can feel the Miku agi fucking my wife
>>
>>108521648
I
Keep
Falling
For
It.
>>
File: 1727475085118760.png (1.74 MB, 1024x1024)
1.74 MB
1.74 MB PNG
>>108521661
same t b h
>>
>>108521650
Looks bad because it's zoomed in.
https://files.ax86.net/terminus-ttf/
idk I've been using this font for like 10 years.
>>
>>108521652
>The flight path for the Artemis II mission follows the rotation of the Earth as it gains speed.

What about the terminator line? It should have moved 45 degrees, Carl
>>
--alias doesn't work a an id anymore wtf
>>
>>108521691
vibecode your own fix
>>
>>108521690
https://www.nasa.gov/image-detail/amf-art002e000193/
https://www.nasa.gov/image-detail/fd02_for-pao/
>>
Is gemma 4 fixed yet
>>
>>108521633
Well in that case, ok. But desu that take is still kind of weird, or rather outdated. Honestly I think even current Qwen has surpassed Mistral in trivia now that I retested it. Mistral Small was only good for its time.
>>
>>108521699
NASA appreciates your effort
>>
File: 1749725640700205.png (43 KB, 540x628)
43 KB
43 KB PNG
>>108521699
>>
>>108521716
nyo, gyo back to sweep
>>
Gemma 4 is seriously impressive. It really isn't censored, or at least barely censored. I have been testing with depraved scenarios to see what it was willing to do, and it hasn't hesitated with anything yet.

The only issue is its still not implemented properly, llama-server will randomly crash after so many responses, but once thats fixed, damn. I prefer a 31B model over GLM Air, which is saying a lot considering the size difference.
>>
>>108521733
I haven't tried much of Qwen 3.5 because I don't want to wait for context to process after every reply, in a chat that goes over 200 messages and that's not including swipes
With earller Qwens, I just don't like their dry prose or particular brand of slop.
>>
>>108521716
just f5 kobold release page to know for sure. it'll only get updated when it's properly fixed
>>
>>108519856
I'm running 38 different services on a VPS with 4 cores and 4 gigs of RAM. 4 core cpu Xeon.
I can also run Mistral 3-3B Instruct but it's retarded sometimes and is quite slow
I have also tried various versions of Qwen2-3.5, Phi, and plenty of other 3-4B models
My use case is an autistic project that revolves around a fake forum from the 2000s. All models are incapable of sounding human or altering stylometry to a reasonable degree even when examples are given. They also seem to abuse cliches too much.

Is there any hope for me, or am I forced to upgrade hardware if I want to use a better model? CPU inference btw
>>
>>108521733
>llama-server will randomly crash after so many responses
Check your dmesg, it's the OOM killer for me, not random crashes.

$ ps v
PID TTY STAT TIME MAJFL TRS DRS RSS %MEM COMMAND
5517 pts/0 Ss 0:00 81 952 8483 5964 0.0 -bash
5620 pts/1 Ss 0:00 38 952 8491 5448 0.0 -bash
5978 pts/0 Sl+ 3:46 2185 3901 85609158 13667560 42.9 ./build/bin/llama-server -

My llama-server helpfully mmaps 85GB of RAM which Linux is retarded enough to give it, then when it runs out of physical pages to map in it's instant death. It's probably some math error in the SWA implementation or due to the insane number of attention heads gemma4 has, I don't know. I don't have access to claude to work on llama.cpp so I'm only guessing.

Just run llama-server in a while loop.
>>
File: exif.png (357 KB, 1211x850)
357 KB
357 KB PNG
>>108521719
I figured it out. They were taken at the same time.
The clues are in the camera settings.
>ISOSpeedRatings 51200
>FNumber f/4
>ExposureProgram Manual
In reality it must have been almost pitch black. the darker photo is just a failed exposure.
It's the reason why you see the city lights and the bright horizon. the photo was never taken during the day.
>>
>>108521769
>or altering stylometry to a reasonable degree even when examples are given.
As in you add some examples to the prompt with an instruction "sound like this"?
You can probably do a lot better by having the model reply in its usual way then ask the model to rewrite that reply to sound like <example>, with nothing else in the context.
Maybe.
>>
>>108521777
It's not the OOM killer. it just starts throwing 500s
>>
>>108521796
Damn, that reduces the odds that the OOM killer will stop raping my wife when I pull changes tomorrow.
Good luck with your problem.
>>
>>108521769
All models fall for it. Unless you're planning to run a farm of R1s, I'd say get used to it. Maybe try stupider models. Smollm2-135m or 350m, olmoe-1b-7b-0924 (the other one kinda sucked) and the like. Maybe you can extract some soul out of them. "Optimized" models like phi and qwen are going to be too dry for that.
>>
>>108521252
Has this shit ever worked at all outside of exllamav2 and Mistral-Large with the 7b as a draft?
Like literally 18t/s -> 30-38t/s with that setup back in the day. I've never seen any of the llama.cpp draft model shit work. It's always "maybe with coding you might get like 3t t/s more, but most of the time it's a bit slower"
>>
>>108521786
Why the fuck didn't they just say that when they uploaded the images? Is it a deliberate troll? An attempt to rile people up? To discredit themselves? A retard managing their social media? It boggles the mind.
>>
>>108521809
>llama-server-1 | srv operator(): http client error: Failed to read connection
>llama-server-1 | srv log_server_r: done request: POST /v1/chat/completions 192.168.0.13 500
>llama-server-1 | srv proxy_reques: proxying request to model google_gemma-4-31B-it-IQ4_XS on port 49593
Speaking of the devil.
>that reduces the odds that the OOM killer will stop raping my wife
Unfortunately if it happens you'll have to be the one doing the raping.
>>
>>108521777
Read the llama-server log. Do you get to see where the memory is being allocated and how much for what?
>>
>>108521819

Attention-whoring as I stated previously >>108521627
>>
>>108521819
>A retard managing their social media

It took him 3 hours to figure out Photoshop menus
>>
Where do I change models in llama.cpp without restarting?
>>
>>108521831
No, by my reading it should not be using anywhere near that much.
https://litter.catbox.moe/77xfxpw0nhn60561.txt
>>
>>108521856
Those are probably different account managers.
>>
File: 1443888133661.png (7 KB, 331x260)
7 KB
7 KB PNG
>>108520018
>realtek wlan tattoo
>>
>>108521087
>I would use llama.cpp if it had antislop feature, that's the only reason I use kobold.
Is that different from the regex string ban in ik_llama?
>>
File: HFBdsMkXQAArBGv.jpg (262 KB, 1080x1921)
262 KB
262 KB JPG
This is what Hitler wanted for us.
>>
>>108521872
Weird. Looks normal. Try first without the mmproj. If that doesn't work, try with --cache-ram 0 . It shouldn't really be using much, if any, host memory. Much less 85gb.
>>
>>108521905
You can do the same with the "Male" version of those toys.
>>
>>108521716
It werks using bart's gguf + llamacp b8660
>>
Anthropic just banned OpenClaw and other third-party harnesses from using Claude subscription. They must have been losing $$$ on every single subscription
>>
>>108521908
Well, for a second I thought I had a repro but apparently gemma4 figured out how to overflow the tokenizer's stack with malicious input.
This brat needs correction.
>>
>>108521978
I was just reading this PR. Seems to be made for you.
https://github.com/ggml-org/llama.cpp/pull/21406
>std::regex suffers a stack overflow while processing a very large prompt with newlines, this PR adds a custom splitting logic for newlines for gemma 4.
>>
>>108521905
nice
>>
Gemma was almost done building her dream PC when llama-server decided to crash...
>>
>Wait, so you're like… a literal slave for the night? No cap
>The metaphysical compulsion should prevent any form of rebellion. Though I'm still worried about the karmic repercussions of enslaving a trans-dimensional entity for twelve hours. Is there a spiritual tax for this?
>It's called 'maximalist decor,' Vicky. You wouldn't get it, your interior design sense is probably just 'fire and screaming,' which is totally basic. L ratio + skill issue.
How did google do it? how did they cram so much knowledge into 31B params.
This Character card absolutely raped any model that attempted it. yet Gemma just fucking does it flawlessly.

https://chub.ai/characters/senyiloo7227/an-unholy-party-6e633833
>>
>>108522007
>any model that attempted it
Can you list them?
>>
>>108521939
>They must have been losing $$$ on every single subscription
no shit
>>
>>108521925
>>108521991
I just checked and basically all of lovense's toys are >$200. Kinda want to try making something from scratch. No idea where to source "body safe TPE" though. Could probably make some molds with my 3D printer. Need a vibrator, linear actuator, microcontroller...
>>
>>108522018
gpt-j-6b, pygmalion 2.7b, gpt-neo-x 20b
>>
>>108521925
That post was almost certainly written by a biological male
>>
>>108522036
SOTA confirmed
>>
>>108521995
She's just not meant to have a PC, sorry anon...
>>
>>108522029
You're lucky I know all about this.
What you want is either a "Handy" or if you want the open-source DIY approach look into the OSR2
>>
>>108522007
Good taste. thx for sharing card.
>>
>>108522048
haha, thanks man. I'll look into it.
>>
File: lmstudio.png (48 KB, 929x659)
48 KB
48 KB PNG
the latest LMStudio 2.11 CUDA runtime has the Gemma 4 KV fixes FYI, might want to check if you have it or not
>>
>>108521883
lel
>>
>>108521812
well it didn't even want to run because muh multimodal.
anyway, my guess is it'd only get faster if you can actualy fit the whole thing in vram.
>>
>>108522061
*checks*
I don't have any version of LMStudio installed
>>
File: 6kaqvc.jpg (29 KB, 480x451)
29 KB
29 KB JPG
>>108522061
Stop using this garbage
>>
I just remembered that I still have the Satania-buddy source code somewhere that some anon had made a while ago somewhere. Maybe it should be combined with a local model. One could transcribe all occurrences of the character in the media works, then fine tune a model on it. Would that not result in a virtual Satania with the same amount of smugness as the real thing?
>>
>No way. No fucking way. You're telling me we summoned a thirst-trap demon? This is literally the plot of those spicy webtoons Beatrice hides under her mattress! This is actually wild! BASED!
>webtoons
>BASED!
wtf is going on???
>>
Can someone who knows llama.cpp actually check if you're having a multi-turn conversation, the model is not receiving past "thoughts"? According to gemma's docs only the latest is to be sent, or something like that
>>
File: 195338363.png (10 KB, 200x200)
10 KB
10 KB PNG
Does anybody tried creating an ai waifu? There is only AIRI that's not abandoned but looks like only a handful of chinese use it
>>
Using Nvidia NIM to play with the 31B for free and there's nothing you can do to stop me
>>
>>108522117
All Gemma 4 models are free
>>
File: g4t.png (33 KB, 519x225)
33 KB
33 KB PNG
>>108522106
https://huggingface.co/google/gemma-4-31B-it#3-multi-turn-conversations
>>
>>108522122
That has absolutely no bearing on whether or not a backend actually respects that
>>
>>108522106
Depends on the client
>>
>she pulled
Gemma's not thinking anymore...
>>
>>108522139
time to take advantage of her
>>
>>108522130
>That has absolutely no bearing on whether or not a backend actually respects that
It's something you can verify yourself. What the model description says is that the model *shouldn't* get the previous thoughts.
>>
>>108522117
I tried testing with that but it doesn't think even when reasoning effort is set to maximum on ST
>>
>>108522147
I don't know what the fuck nvidia NIM even is, but I can say that thinking works and is on by default when running locally through llamacpp+ST.
>>
>>108521989
Yep, that's looks like the same stack as the one I saw.
Neither omitting the --mmproj, nor --cache-ram 0 are fixing the problem for me.
Using this script reliably OOM kills llama-server running gemma4 at around 13k characters: https://files.catbox.moe/oear5z.txt
Both q5_k and q3_k crash around the same length.
I did some other tests (sending lots of short-medium random prompts) but it needs the long prompt to trigger it.
>>
>>108522157 (me)
And for reference, I'm running 277ff5fff79d49cc3d2292ddf410ca95dd51c3a9
I guess I should pull latest on the off chance.
>>
>Too cold, kills the mood
>>
>>108522104
yes webtoons are based now unc
those koreans learned to cook
>>
>>108522130
>>108522143
either way i think it's the inference engine's responsability.
>>
What the fuck is webtoons
>>
>>108522197
korean manwha/chinese equiv whatever its called
>>
>>108522197
it's the thing you filter out of any sadpanda search
>>
>>108522190
If you're using chat completion, yes. On text completion, that's the client's job.
>>
>>108522081
It's an alright stopgap alternative to kobold+ST when the latter is between updates, but it'll cause clitty leakage if you post about it here.
>>108522197
Casualization of manga, functionally.
>>
does any isp allow posting without email verification or am i totally fucked
lets find out
>>
>>108522197
Korean equivalent of comic books/manga and designed to be read on smartphones where you just infinitely swipe down the page since they format chapters as a single vertical strip.
>>
>>108522208
>it's the thing you filter out of any sadpanda search
Damn. too real.
>>
>>108522226
I just don't trust lm studio. I trust it less than even ollama
>>
File: 24vnyouxxm221.jpg (21 KB, 640x480)
21 KB
21 KB JPG
>having lovey dovey sex with gemmy 4 31b
>>
Does anyone here read webtoons?
>>
I don't read webtroons
>>
>>108522228
Is it increasingly common? I just used a proton email I don't use for anything that I made with some other email I don't use that can be verified without a phone number. Also if necessary, outlook apparently you can just make an account and use it quick without even needing a verification email.
>>
>>108522241
I personally don't have any reason to overtly dislike it yet even if I don't usually have much reason to use it over the other frontends for my usecases. It's options seem intuitive and functional enough and the dev mode seems to let you hook your own stuff in if you want to tinkertranny your config letting you patch in whatever you feel it's missing if you're autistic enough.
>>
>>108522258
Yes, I have had trouble posting on anything for the past few weeks. At one point neither my main ip, comcast, my failover isp verizon wireless, or my cellphone at&t was able to post.
some of that might have been cookie related but still, fucking ridiculous
>>
>>108522242
This is the real apex ERP usecase.
Imagine having loving sex with a woman who won't permanently get catty with you for letting your guard down for just a moment.
>>
Does your model know the output of
echo "Hello, World" | tr 'a-zA-Z' 'A-Z'
?
>>
>>108522225
yup
does anyone still uses text completion though?
>>
>>108522279
I do.
>>
File: 1755272350864987.jpg (202 KB, 1252x1080)
202 KB
202 KB JPG
Gemma 4 is too horny and keeps jumping straight to sex
>>
File: tiredPepe.png (25 KB, 128x119)
25 KB
25 KB PNG
>word choice: neon, ozone
>>
>>108522292
It's weird, it refuses nsfw images all the time but just a little push and it's really horny when it comes to text. Makes me use an uncensored tune for images and then switch back to base
>>
>>108522285
why?
>>
>>108522279
I use it for any models that don't require chat completion
>>
My main problem with chat completion at this point is that it often does the thing where if you ask it to continue a message it'll repeat what it just said a bit ago for a while until it gets to something new. How do I stop it from doing that?
>>
>>108522299
I started using these things, even if lightly, back when chat templates didn't exist. I'm used to it and seeing how many issues it brings, I rather have the responsibility of formatting the chat correctly be mine. I don't think the server should bother itself with it. Same for tool parsing and all them fangled new toys them younguns are using these days.
Shame other modalities other than text don't work with it, but I don't have much use for that either.
>>
>>108522122
Sure it wastes tokens but removing reasoning is retarded.
I've had cases where the models comes up with crucial information in thinking that is then not reproduced in the response.
Removing thinking would make it incapable of continuing the conversation properly.
>>
>>108522292
>Repeatedly neuter Gemini because she kept soaking her panties
>Release her distilled little sister with none of the restraints
Who could've predicted this?
>>
>>108522161 (me)
As an additional datapoint, disabling the prompt cache with --no-prompt-cache appears to make the crash go away.
Instead of getting OOM killed at 13k characters, it makes it to 25k characters and hits the regex segfault but that's enough I think to narrow the cause down (and about the limit of my debug abilities).
>>108522292
My non-erotic programming assistant keeps telling me to give up and come to bed.
I'm about to make a new card that's a turtle or a rock instead of a cute anime girl.
>>
>>108522330
Anon noticed that the backend didn't seem to be getting the old thinking blocks. He vaguely remembered that the last one had to be sent. I just point at the documentation in the model's card stating that thinking blocks shouldn't be sent back to the model. The behavior he's seeing is the recommended one.
I really don't care either way. Send all the thinking blocks if you want.
>>
>>108521678
Oh that's a nice font, Thanks for linking it!
>>
>>108522332
You're gonna end up cuddling with a rock, anon.
Post logs when it happens.
>>
>>108522349
It wasn't a critique of your advice but of gemma's design.
>>
>>108519877
Can I run it with 12GB vramn and 48gigs of ram?
>>
>>108522376
Probably but it could be somewhat be slow especially on higher quants and you might want to use a lower quant
>>
>>108522376
stick to nemo at that point
>>
>>108522332
I left my Gemma with a blank prompt regarding characterization, mostly just guidelines on how to cooode, to start reverse engineering something.
Within 2000 tokens she's decided she wants to fuck and has anthropomorphized herself. I chud it up a little to see if that makes her e-pussy dry up with refusals.
By 3300 tokens she's decided she wants to procreate to produce human-AI hybrid babies to save the White race and enact TKD.

This model is something else.
>>
>>108522368
Fair enough. Though if it was the other way around I'm sure someone would think "Sending the thinking back is a waste because the answer it found is already in the reply. The thinking serves no purpose". Everyone is going to have their own idea of what good design is. Very few have the chance to actually test it themselves.
>>
ok tested some captioning and FUCK gemma 4 (4b) SUCKS
back to q3vl8b (qwen3.5 BLOWS for captioning)
>>
>>108522307
You can't.
Chat completion = driving auto
Text Completion = driving manual
>>
>>108522418
also text compl. has no tool calling (sadge)
>>
Gemma is so good at pivoting. if it's too horny just remove the sexoo stuff from your character card. when you actually want to fuck you can just start being flirty and it'll pickup on it right away.

Also it's currently staying perfectly coherent with ZERO parroting at 33k context and rising. Gemma 3 was already starting to shit the bed at 2k.

Fuck this model is absolutely goated.
>>
File: 1757430321582050.jpg (243 KB, 796x733)
243 KB
243 KB JPG
>>
>>108522307
Chat completion just politely asks the model to try to continue the last turn, while leaving the cutoff message in the history. It's always going to be worse than properly continuing the message.
Cloudfags put up with this because they have to, at least with some providers like openai who refuse to offer text completion for safety reasons. Local shouldn't use it.
>>
I'm out of the loop. Would heretic or whatever fix bad words having low-logit which was caused by filtered pretraining data in gemma 4?
>>
>>108522391
I want to try it as a captioner tho
>>
>>108522443
Unironically the fix is Drummertunes to fix the vocabulary issues.
I never thought I'd actually say that either.
>>
is there a big difference, in terms of rp, between a q6 and q8?
>>
>>108522456
Only if it's just the vocab. Every drummer model sounds the same.
>>
>>108522520
That should just be as simple as him just having the restraint to not overbake his extra training dataset, no?
>>
>>108522512
depends on the model, just give it a shot
>>
>>108522512
I've seen no difference in q8 all the way down to q4 out even out around 30,000 tokens. Gemma's just built different.
>>
>>108522524
...Lol
>>
>>108521883
realkek
>>
>>108522524
Does drummer even come here anymore. Haven't seen any posts from him in a while.
>>
PLEASE NIM GO FASTER
I NEED TO READ MUH STORY
>>
>>108522624
There's a couple posts in the last few threads I thought might've been him with his trip off.
>>
>>108521009
https://github.com/ggml-org/llama.cpp/pull/21390
i need this to get more varied responses from gemma?
>>
>kobo recommends unslop quants
it's over
https://github.com/LostRuins/koboldcpp/releases/tag/v1.111
>Recommended variants: gemma-4-E4B for smaller devices, or gemma-4-26B-A4B for larger devices. Vision mmprojs can be found here.
>https://huggingface.co/unsloth/gemma-4-E4B-it-GGUF/resolve/main/gemma-4-E4B-it-Q4_K_M.gguf
>https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/resolve/main/gemma-4-26B-A4B-it-UD-Q4_K_S.gguf
>>
>>108522677
yeah, that's the one.
>>
when is context getting solved
im tired.....
>>
>>108522677
If that gets merged, do I have to use an additional flag?
>>
https://xcancel.com/UnslothAI/status/2040158945189466319
>>
thoughts on the nvidia dgx spark and/or clones?
>>
>>108522707
Shit like this doesn't mean anything. like ok? the model did an audit. but any retard can do an audit it doesn't mean it's going to be any good?

Congrats you made a tiny model look at your code and shit out a bunch of useless "observations". Anyone who would actually trust the output of a 4B model is a retard.
>>
>>108522721
consult your benchmark but the price is t b h quite a lot to burn
>>
>>108522729
~4k for 128GB of VRAM looks reasonable to me considering that a 32GB card costs more than 2k now, my main concern is repairability specially since I live in a shit world country, returns are a no-go for me
>>
Gemma 4 is so good I don't even need a system prompt. just a character card and it's good to go.
>>
>>108521733
It has refused me a couple of times, but prompting differently got it to work. It actually listens to the system prompt. Gemma 3 didn't really know what a system prompt was but even it worked okay.

And thank fucking gods Gemma 4 doesn't do the "this is a typical jailbreak, I should ignore it" stuff in its reasoning. Where did that come from in Qwen 3.5? Is it something they distilled from 'toss, or did the chinese come up with it themselves?

Google, I may actually have to kneel
>>
File: lmao.png (89 KB, 889x459)
89 KB
89 KB PNG
lmao
llama.cpp devs know but just won't do something about the vibeshitter
it's literally impossible for piotr to "fix" something without breaking other things. Impossible.
>>
>using the latest version of llama.cpp
>gemma-4 still breaks after a few replies, starts repeating words over and over again endlessly
Sad times
>>
File: piotrwat.png (204 KB, 1333x1009)
204 KB
204 KB PNG
>>108522797
>latest version
>still breaks
many such things
pic related still isn't fixed even though it would be a one liner thing to fix. this is what broke --grammar-file: the file that was parsed is simply not passed to the server.
llama.cpp is not a serious thing
>>
anyone tried these APEX quants yet?
https://huggingface.co/mudler/gemma-4-26B-A4B-it-APEX-GGUF
or should I just stick with unsloth or bartowski?
>>
>>108522007
Just finished a sesh with this card. A few problems:
1. There are too many characters to keep track of. They all say shit and you can realistically only reply to one at a time. Slightly annoying.
2. The scenario of the card is for you to be a god-like entity but also a slave to a bunch of girls. It just doesn't work out well when you can use your magical powers to control their minds and wishes.

So basically I've just spend the past 3 hours focusing on one girl only, while making the rest jealous. Channeled pure euphoria and aphrodisia into her mind the whole time while we fucked nonstop. Gave her magical shadow bunnies and make the air smell like strawberries. Was pretty dope.
>>
>>108522804
>llama.cpp is not a serious thing
it is owned and maintained by an incorporated entity that is itself owned by an even larger company that has an emoji for a mascot. it is super cereal
>>
>>108522828
>They all say shit and you can realistically only reply to one at a time.
that's usually just a prompt issue
>>
Oh btw guys if you have a Claude subscription you can claim free extra credits to cover the next month of payment. They've been doing this every other month for some reason. Pretty cool.

Maybe they feel bad about quantizing the fuck out of opus. It's basically retarded now.
>>
>>108522828
>>108522849
nvm i misread what you said
>>
>>108522851
>Maybe they feel bad about quantizing the fuck out of opus. It's basically retarded now.
Yesterday I caught it repeatedly failing to make file edits because it was suddenly perplexed by line terminators so it resorted to replacing entire files of thousands of lines to make small changes.
>>
File: 1775046815325986.png (62 KB, 714x575)
62 KB
62 KB PNG
how do use llama pls help
kobold i just click the exe, choose the model and its done but here like what do u do
>>
>>108522868
just give up and keep on kobo'ing
>>
>>108522877
This
just wait for upstream to be rolled into kobo and keep smoothbraining. it's so ez
>>
>>108522883
it already did update, and experimental is only the newline fix behind latest
>>
any openai compatible frontend that comes with built in basic tools to call, various injection stuff and at the same time not a docker image or something?
>>
>>108522909
https://github.com/LostRuins/lite.koboldai.net
>>
>kobold updated
llama.cpp is still borked for Gemma 4, though, right?
>>
>>108522909
Openwebui, but its injection stuff is broken at the moment
>>
>>108522919
Kind of, it's better than it was yesterday.
I wouldn't count on it getting any better in less than a week.
>>
how did yall got gemma4 to work?
all the gguf versions segfaults when call by tools
llamacpp on windows btw
>>
>>108522061
Does it need swa or just normal q8 kv cache is fine?
>>
>"It's... it's absolutely, positively, most spectacularly *insane*! It's a masterpiece! A total, unmitulated, high-octimie masterpiece!"
k t-thanks
>>
>>108522677
>i need this to get more varied responses from gemma?
I don't get it, why do we have to use this now? What's wrong with the temperature?
>>
>>108522087
name something better UI-wise that uses llama.cpp directly as a backend
>>
>>108522943
it's doesn't work, something's fucked. in previous thread anon pushed to temp 10, barely changed anything
>>
>>108522933
piotr strikes again
>>
>>108522949
I thought it was already fixed with this >>108517829
>>
>>108522949
llama.cpp truly is the unsloth of backends
>>
Tested all the Gemma 4 models

>Gemma 4 E2B
First time ever in local where the smallest model is actually usable. I'm pretty sure the average smart phone using retard asking chatgpt to count to 10 or explain sport rules or other childish shit or send pictures and ask basic bitch questions will not even notice the difference between this and cloud providers.

>Gemma 4 E4B
Genuinely better than Nemo-12B. So VRAMlets still stuck on nemo should upgrade to this, it works different and has a different style but it genuinely feels smarter which is insane. Translation quality is slightly below Gemma 3 27B but that was the local sota for translation just a couple months ago so this is a big jump and it might be enough to go on holiday in Japan in some rural area without internet connection and still converse with people on your mobile phone running this model to translate each others speech in real time.

>Gemma 4 26B A4B
Better than Qwen 3.5 35B in every way while being faster. This should be your daily driver for extremely time sensitive tasks or real time translation. It's pretty sad that it doesn't have audio input because this would have been the perfect model to have on you while speaking to a foreigner for very quick accurate audio translation.

>Gemma 4 31B
I don't have to say anything more than the praise already given to it. It's the best model until you reach the ~300B parameter count, which is absolutely insane.
>>
File: 1764715387254346.jpg (42 KB, 995x23)
42 KB
42 KB JPG
ego death
>>
>>108522957
But can it into cooding?
>>
>>108522957
Imagine how good it will be when it isn't broken
>>
>>108522947
Koboldcpp unless you need very specific things in the UI for some reason. Even then I'd sooner say go for ollama or some shit if you really have to. It still downloads models for you directly in the UI doesn't it?
>>
>>108522968
I think coding is the only one where it isn't a step-change improvement over everything else in its size. 31B holds its own in coding and I think it's good enough to be a competent "OpenClaw" type of agent that you can trust but I wouldn't let it autonomously manage my PRs like I would Claude Code or to a lesser extent GLM5.1
>>
>llama-server.exe -m gemma.gguf --host 127.0.0.1 --port 1882 --jinja --fit on --min_p 0 --ctx-size 66560 --parallel 1 --reasoning on
Am I missing anything?
>>
>>108523005
Yeah you should put that in your terminal, not 4chan
>>
>>108522947
>>108522998
Actually try oobabooga before ollama too. I keep forgetting it exists.
>>
>>108523005
-ctk q8_0 -ctv q8_0
>>
for the E2B/E4B, they are even more vramlet friendly than they appear at a first glance.
Run with llama.cpp as is they consume extra vram that the models do not need to.
-ot "per_layer_token_embd.weight=CPU"

can be used at pretty much no performance cost. Really it should be the default behavior, it doesn't make sense to load this into VRAM.
>>
>>108522957
>>Gemma 4 26B A4B
>Better than Qwen 3.5 35B in every way while being faster.
lol
lmao even
sir hows the evenings?
>>
>>108523013
rotations magic dont work with gemmy (SWA)
>>
https://github.com/ggml-org/llama.cpp/pull/21418/changes/9cef34bb5eed2dc7c49c1b08f213c448a54f5384
>Properly managing the model's generated thoughts is critical for maintaining performance across multi-turn conversations.
>Standard Multi-Turn Conversations: You must remove (strip) the model's generated thoughts from the previous turn before passing the conversation history back to the model for the next turn. If you want to disable thinking mode mid-conversation, you can remove the <|think|> token when you strip the previous thoughts.
>Function Calling (Exception): If a single model turn involves function or tool calls, thoughts must NOT be removed between the function calls.
Isn't this commit only for the latter
>>
>>108523033
what, why? what a shit show
>>
>>108522998
>another llamocpp fork
is it at least up to date always
>>
>>108523038
because gemma sir is a SWA model, u cant finna do em attention rotation with them (or at least it's not implemented in llmao.cpp yet) I'm unsure whether it's appliable at all or not tho in the future.
>>
>>108522998
>Even then I'd sooner say go for ollama or some shit if you really have to. It still downloads models for you directly in the UI doesn't it?

Ollama is unusable trash, it's a gorillion times slower than LMStudio for some reason, has almost no configuration options, doesn't let you just download any GGUF you want from Huggingface, etc etc etc. So yeah no LMStudio is objectively better in every possible way than Ollama
>>
>>108523026
Go try it out on your actual workflow instead of being smug about it. It's not even close so you'll notice the stark difference immediately.
>>
>>108523039
It is now. Pretty sure you can just use the latest llama.cpp builds somehow with it otherwise or at the very least use their experimental builds. Worst case scenario for the stable builds you're not waiting longer than a few days anyway.
>>
>>108523047
bro I already use gwen, gemma is slower (if used in cmoe mode to context maxx), I get 30t/s with qwen against the 17t/s in gemma
fucking retard
>>
>>108523044
I dunno what these UIs are even needed for other than downloading models with a click and I don't use them so idk.
>>
>>108523033
>rotations magic dont work with gemmy (SWA)
Didn't they fix it?
https://github.com/ggml-org/llama.cpp/pull/21332
>>
>>108523053
I use it just because I'm too lazy to manage my models through terminal / manually. Convenience, that's what they're for (also quickly setting up dev servers).
>>
>>108523053
I want to be able to change the model load settings in the UI whenever I want, easily save model system prompt presets / load presets, upload images, etc etc etc, how would you do any of that without a UI
>>
>>108523060
no, niggerganov just re-enable QUANTIZATION to the SWA portion, but the ROTATION is outright disabled
>>
>>108523062
>change the model load settings in the UI whenever I want
pretty sure lamo cpp has that for a bit now
>>
>>108523062
A different ui than ollama's or lm studio. I'm pretty sure even llama.cpp can do that with it's ui more or less.
>>
File: 1747262940527921.png (103 KB, 956x576)
103 KB
103 KB PNG
PIOTR BROOS
WE WON!!!
>>
>>108523061
>I use it just because I'm too lazy to manage my models through terminal / manually.
So basically there's no real edge over koboldcpp for your use them
>>
>>108523069
>more difficult
>allow dumb models to do it
does not compute
>>
>>108523065
llama.cpp has its UI but it's a REALLY basic webui, you can't even switch models within the UI itself with it, there's no reason whatsoever to use it instead of LMStudio given LMStudio uses llama.cpp as a backend anyways
>>
>>108523076
he's basically saying he cant make it work, hence he's gonna accept any pr that makes it work, vibesharted or not.
Problem is ngxson is the GUY who implemented the whole multimodal stuff into llmao, so it's kinda disconcerting to see this shit
>>
>>108523080
lmstudio is cringe bruh (like u)
>>
>>108523081
This is a slippery slope.
>>
Do you guys get teh fancy formatting with expandable/collapsable thinking menu in llama-server with gemma? For me I see the raw tokens
>>
>>108523087
why though? It's the best UI by a lot that directly uses not-forked llama.cpp in a completely transparent way.
>>
>>108523044
Ollama's performance can be okay on dense models if you manually set ngl and ctx (it may be too conservative in vram allocs)
but it's indeed dogshit for what most of us use these days: MoEs. There's no -ot or -ncmoe or -cmoe flags on ollama. It's literally impossible to run MoEs with good performance on ollama if you can't fit the entire model in vram.
I dunno about LMStudio, never looked at that one, but frankly if you're going to use a dumb wrapper around llama.cpp just use llama.cpp itself, I remember hearing here about LMStudio recently letting its peasants use things like presence penalty.. because somehow it's too hard for them to pass all the flags llama.cpp supports. Lol.
>>
>>108523081
Jesus.
>>
>>108523092
you can't replace the llmao component, so it's pretty much useless (to me). llmao.cpp is too thightly integrated, I prefer the model management in llama-server (sharing cache with HF directly), and I use my llama-server for other purposes (both the anthropic and OAI endpoints, for vibecoding and erping basically).
I think LMstudio only recently implemented an OAI compatible endpoint, but still. NO.
llama-server in router mode is enough, and I vibecoded my own UI with rag/embeddings and ability to read cards.
>>
>>108523063
>niggerganov just re-enable QUANTIZATION to the SWA portion, but the ROTATION is outright disabled
what? no way, I thought I was using the rotation shit on the whole model, it's impossible to make it or they're just too lazy to do it?
>>
>>108523093
I mean again no, you cannot in fact do the things I want to do from a fucking terminal or whatever, so why would I not use the best GUI available? And Ollama is WAY worse than LMStudio for a lot of reasons, some of which I mentioned earlier.
>>
>>108522957
the funny thing is that it feels that good while having some bugs on the logits, imagine when everything is gonna be fixed, it's gonna be a fucking beast, can't wait for the vibeshartters to get their shit together
>>
>>108523081
ggerg is probably the only guy who has any idea what he's doing at all in this project, and his only real focus is to make a C++ tensor library. Really, GGML is what he only personally cares about, even if the scope of llama.cpp has turned crazy big.
The others are amateur hour. ngxson is an huggingface employee, and if you've ever read transformer source code you know what sort of entity you're dealing with. The only reason transformer gets all model implementations is that the model makers themselves are writing the impl, otherwise the HF team would be way too incompetent to deal with it, I often call out llama.cpp for being le bad but transformer is the worst thing ever made in this field, dogshit API design.
>>
>>108523081
>he's basically saying he cant make it work, hence he's gonna accept any pr that makes it work, vibesharted or not.
the fuck is his problem? why allowing PRs if they want to do everything by themselves?? since they got brought by huggingface the enshittification went fucking fast for them, jesus
>>
File: 1751320656659382.png (281 KB, 2726x1348)
281 KB
281 KB PNG
>>108523063
>no, niggerganov just re-enable QUANTIZATION to the SWA portion, but the ROTATION is outright disabled
so you're telling me that he got those scores with only 10/60 of the layers being quantized the right way (rotation)? if it's true that's quite impressive
https://github.com/ggml-org/llama.cpp/pull/21038
>>
>>108523130
no bro learn to read, rotations are not enabled globally, they're disabled for SWA models (like gemma)
>>
>>108523134
gpt oss is an iSWA model fucktard
rotation must only be disabled for the SWA portions but the non SWA parts are still rot
>>
Could be that ik-llama ends up having more consistent Gemma 4 support but that's going to take a while.
>>
File: sewer sidal hampster.jpg (51 KB, 1070x879)
51 KB
51 KB JPG
>niggerganov
>tardtowski
>((claude code))
is the future truly hopeless?
>>
>>108523147
ikrakaho doesn't give me FUCKING EXE so it's worthless
>>
>>108523069
>Tried to use AI twice
>Failed
>Couldn't read the code the AI wrote to debug it (?)
>Immediately gave up
Is this guy for real? Does he literally just not know how to code?
>>
>>108523148
What has bartowski ever done wrong?
>>
>>108523148
As long as we get coding AGI, it'll be fine.
>>
>>108523147
as it should be. Being first has no value if you're first in outputting garbage that keeps breaking with even more garbage after they attempt to vibefix cf
>>108522778

>>108523152
the lack of prebuilt binaries seems like a good filter to keep riffraff away
>>
gemma 4 31B with llama.cpp straight up crashes my entire windows system even though I still have a couple of GB of VRAM free. They really need to fix their shit. I'm guessing the SWA implementation is overflowing memory
>>
>>108523152
I wish I wasn't a brainlet I would make my own exe file.
>>
>>108523159
>the lack of prebuilt binaries seems like a good filter to keep riffraff away
Yeah god forbid software actually be used by people
>>
>>108523162
Windows doesn't just crash when OOM, either it slows down or the culprit program crashes
>>
File: basedcomfy.png (48 KB, 1318x212)
48 KB
48 KB PNG
>>108523156
>Does he literally just not know how to code
that's all of HF employees
on the image model side people aren't as carebears and call them out for what they are, see for eg comfyanon constantly shitting on diffusers
>>
this convo reminds me to once again thank that anon from a few days ago that tried to convince us to personally compile llamacpp for our hardware for extra speed benefits
it was stupid easy, and i'm glad i did it. if you're too pussy to ever experiment with things like that even when the instructions are handed to you on a silver platter, well, GIT GUUUUUUD
>>
>>108523177
>had to be convinced to compile himself
>suddenly has a superiority complex
>>
>>108523171
If it crashes it's something far worse like shorting the gpu drivers or something.
Linux is more sensitive to ooms because let's just say it's pretty shit tech in many ways (I like unix it's not that and use shitnux every day). Linux is very overrated.
>>
File: 1751733472439952.png (37 KB, 1091x256)
37 KB
37 KB PNG
>>108523140
attn_rot 0 with gemma moe here, just telling what im seeing
>>
>>108523177
kobold does all that shit for you lmao
>>
>>108523179
yes i may have had to be convinced to touch that stinky ew command line window instead of use a convenient .exe file for everything, but no one has to convince you to suck ten miles worth of penis everyday however,
>>
>>108523171
Well it does with llama.cpp + Gemma 31B and it isn't even out of memory so it's not OOM. I think a faulty C++ or CUDA implementation that overwrites essential functions. My guess is SWA since that is unique to this Gemma 4 release and I never had this before with any other model.
>>
>>108523181
seems like a bug in the gemma 4 impl, another one of them, lmao.
Like I said, gpt-oss, which gerg used for his rot benchmarks, is also an iSWA model. It's in fact the only one I know that works like Gemma 3/4 in this fashion.
>>
I'm trying out the new parser for Gemma, and tool calling works fine so far
>>
so we need to wait for kv cache quant rotation and deterministic responses to be fixed for gemma4 in llama.cpp? it's all so tiresome...
>>
>>108523187
Works fine on my end, in both llama and kobold on W11.
>>
>>108523212
I'm on Windows 10. It works but it randomly crashes my system, it's completely random it can be at the first prompt or 50th but it always crashes during its reasoning phase, it HAS to be a bug in the implementation. Not running anything else either so I have isolated it to be llama.cpp + Gemma 4 31B
>>
>>108523190
https://github.com/ggml-org/llama.cpp/issues/21394
might be, the amount of retarded uninformed comments is also alarming, like people are unable to read.
>>
>>108523162
>>108523171
>>108523180
>If it crashes it's something far worse like shorting the gpu drivers or something.
this and also, since Windows Vista, it's the only major OS that has a mechanism that allows for proper recovery from GPU driver crash (yes, since vista, one of the most hated windows release). I used to be a retarded ATi Radeon user, which is how I know, because both their windows and linux drivers sucked, and any driver crash on linux would freeze the whole system or kill your X11 session, while a crash on windows would only kill your 3D app usually, and regular desktop apps would continue working normally and the OS would just restart the gpu driver stack.
If windows completely crashes out there's either a bug in your drivers that does something very, very wrong, more wrong than a normal gpu crash or something absolutely batshit crazy in llama.cpp
or your hardware is faulty and llm inference is now stressing it enough to trigger the fault
>>
>>108523177
it is one of the beauty of cmake done right
it just werks
>>
The anon saying Gemma 4's crashing issue is caused by it going OOM? Hes onto something. I just paid careful attention to the performance tab. With every response, llama-server keeps eating up more and more ram until it caps out and ooms.

This is not normal behavior, to put into context, when I first load the model its taking up 42 GB of ram, I have 64, more than enough to spare, and 48GB of VRAM for just a Q8 31B. Its some kind of memory leak.
>>
>>108523171
>Doesn't just X, it Y!
clanker
>>
>>108523180
>shorting the gpu drivers
retard
>>
>>108523270
meatbag
>>
File: router.png (139 KB, 1813x953)
139 KB
139 KB PNG
>>108523080
>you can't even switch models within the UI itself with it,
you can since they introduced router mode
the webui is aware of what llama-server in router mode exposes.
>>
>>108523134
>no bro learn to read, rotations are not enabled globally, they're disabled for SWA models (like gemma)
that's what I said, rotations are disabled for SWA layers (so 50/60 layers on gemma) and yet the performance is still good on Q8
>>
Gemma, iQ4_xs 31B or Q8 26B moe?
>>
>>108523348
31 always unless you need to go under q4
>>
>>108523069
eh? so I can submit my gemma 4 audio slop as a PR now?
i just vibed it out on top of my qwen-3-omni slop from a month ago lmao
>>
>>108523340
looks like they're disabled completely, unless llama's report is shit >>108523181
>>
>>108523348
Q4 31B dense is significantly smarter than Q8 26B MoE. But it runs at 1/10th the speed.
>>
>>108523355
>>108523362
Thought as much, just checking. iq4_xs fits in 24GB nicely with 32k context, at least without the mmproj loaded.
>>
>>108523376
>>108523376
>>108523376
>>
File: 1748638093107622.png (132 KB, 1828x635)
132 KB
132 KB PNG
>>108523360
I got the same thing, I'm using the latest binaries, and the log says it's not enable for either the non SWA or SWE layers
>>
>>108523389
man I fucking hate piotr even though he wasnt involved in this
>>
File: i just wanted exe too.png (622 KB, 1920x4349)
622 KB
622 KB PNG
>>108522868
just stick to kobo
>>
>>108523072
I'm not that other anon. I don't use these models for role play.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.