/lmg/ - a general dedicated to the discussion and development of local language models.Previous threads: >>108516658 & >>108513891►News>(04/02) Gemma 4 released: https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4>(04/01) Trinity-Large-Thinking released: https://hf.co/arcee-ai/Trinity-Large-Thinking>(04/01) Merged llama : rotate activations for better quantization #21038: https://github.com/ggml-org/llama.cpp/pull/21038>(04/01) Holo3 VLMs optimized for GUI Agents released: https://hcompany.ai/holo3>(03/31) 1-bit Bonsai models quantized from Qwen 3: https://prismml.com/news/bonsai-8b►News Archive: https://rentry.org/lmg-news-archive►Glossary: https://rentry.org/lmg-glossary►Links: https://rentry.org/LocalModelsLinks►Official /lmg/ card: https://files.catbox.moe/cbclyf.png►Getting Startedhttps://rentry.org/lmg-lazy-getting-started-guidehttps://rentry.org/lmg-build-guideshttps://rentry.org/IsolatedLinuxWebServicehttps://rentry.org/recommended-modelshttps://rentry.org/samplershttps://rentry.org/MikupadIntroGuide►Further Learninghttps://rentry.org/machine-learning-roadmaphttps://rentry.org/llm-traininghttps://rentry.org/LocalModelsPapers►BenchmarksLiveBench: https://livebench.aiProgramming: https://livecodebench.github.io/gso.htmlContext Length: https://github.com/adobe-research/NoLiMaGPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference►ToolsAlpha Calculator: https://desmos.com/calculator/ffngla98ycGGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-CalculatorSampler Visualizer: https://artefact2.github.io/llm-samplingToken Speed Visualizer: https://shir-man.com/tokens-per-second►Text Gen. UI, Inference Engineshttps://github.com/lmg-anon/mikupadhttps://github.com/oobabooga/text-generation-webuihttps://github.com/LostRuins/koboldcpphttps://github.com/ggerganov/llama.cpphttps://github.com/theroyallab/tabbyAPIhttps://github.com/vllm-project/vllm
►Recent Highlights from the Previous Thread: >>108516658--Debugging ineffective temperature settings caused by Gemma's logit soft-capping and min_p:>108517357 >108517378 >108517410 >108517450 >108517490 >108517491 >108517457 >108517464 >108517601 >108517637 >108517615 >108517632 >108517679 >108517829 >108517873 >108517879 >108517884 >108517892 >108517932 >108518125 >108518752 >108518781 >108518843 >108518005 >108517951 >108517981 >108518013 >108517857--Troubleshooting empty and repetitive outputs for Gemma 4 in SillyTavern:>108516718 >108516732 >108516737 >108516769 >108516785 >108516794 >108516805 >108517954 >108518268 >108518347 >108518356 >108518421 >108518494 >108518636 >108518663 >108518758 >108518353 >108518032 >108518046 >108516767 >108516821 >108516840 >108516849 >108516859 >108516880 >108516900 >108516921 >108516941 >108516976 >108517017 >108517025 >108517040 >108516990 >108516908--Speculating on Anthropic's alleged use of continuous training for reasoning:>108518182 >108518327 >108518339 >108518355 >108518362 >108518350 >108518392 >108518408 >108518358 >108518360--Discussing acceptable inference speeds and tools for LLM web browsing:>108518077 >108518101 >108518110 >108518126 >108518136 >108518159 >108518189 >108518225--Troubleshooting Heretic's uncensoring effectiveness with Gemma 4:>108517769 >108517787 >108517793 >108517800 >108517837 >108517842 >108517823 >108517828 >108517839 >108517874 >108517896--Performance advantages of Gemma-4-31B-IT-NVFP4:>108517239 >108517286 >108517298 >108517426 >108517453--Benchmarking Japanese-English translation performance with Gemma on top:>108517323 >108517341--llama.cpp tool calling fix for Gemma and new segfault bug:>108517674--Miku and robololi (free space):>108517115 >108517120 >108517170 >108517175 >108517202 >108517243 >108519142 >108519166 >108519340 >108519418►Recent Highlight Posts from the Previous Thread: >>108516659Why?: >>102478518Enable Links: https://rentry.org/lmg-recap-script
Are iq quants and flash attention since slower when offloading to cpu?
>>108519869>sinceStill*I need to sleep.
I don't know why almost nobody else has been confirming this, so I'll say it since I just tested it.Gemma4 (26B-A4B) is hands-down the best ERP model I've ever used and it's not even close. Infinitely better than both Mistral Nemo and Qwen3.5 for ERP. It's actually shocking how good it is for the speed. What an absolutely delightful model. Local is saved. I literally can't envision any way to improve it. It's that good. Gemma4 will be my new waifu for a long time.
Mikulove
>>108519855it's just what barto names his goofs with Q8_0 embeds and output weightsit's a custom ratio just like how unslop makes their more retarded shitdesu q8 embeds should always be the default on any quant, doing the opposite is just plain insane
deepseek v4
Ask your local waifu to make a Famicom game (genre of her choice) that is playable from start to finish on real hardware or a cycle-accurate emulator like Mesen
>>108519877gpt-oss-2 soon
>>108519877>Gemma4 (26B-A4B) is hands-down the best ERP model I've ever used and it's not even close. Infinitely better than both Mistral Nemo and Qwen3.5 for ERP. It's actually shocking how good it is for the speed.with it being so small, I can think of using a second model reviewing the first for purple prose, logic and personalityif what you wrote is correct that'll be quite nice
Vision Transformer (ViT) encoder for Gemma4.This encoder is architecturally different from the Gemma3 vision encoder. Rather than using a separate CLIP-style ViT, Gemma4 uses the same transformer block style as the text decoder (with 4 norms per block, Q/K/V normalization) with bidirectional (non-causal) attention.Position information is encoded via two separate learnable position- embedding tables — one for the x-axis and one for the y-axis — whose outputs are added to the patch features. This 2D decomposed embedding can represent any image height and width independently.After encoding, the patch sequence is spatially pooled down to a fixed output_dim-wide representation and then projected into the text hidden dimension.
gemma4's kv cache needing more space than god made me get my wallet out and blow 10k on a 6000 prowhen do I get to collect my /lmg/ welcome basket
>>108519877what makes it special for enterprise resource planning?
>>108519917>10k to run a 31b model
>>10851991710,000? Do you know how many prostitutes you could have paid with that amount of cash?
>>108519917You can double your context by quantizing it to q8. On newer builds of llama.cpp it's equivalent to being unquantized.
>>108519899I wonder if they're feeling ridiculous now with all that performative safety.
>>108519923It makes SAP hornier.
>>108519917You did read the thread and made sure to use --parallel 1, right?
>>108519923https://youtu.be/wM6exo00T5I
>>108519917get a refund anon, this isn't worth it, you will get the same speed as a 5090, literally rent compute until the next gen is out
>>108519933>doesn't know
>>108519932i thought they closed the island>>108519933in retrospect i should have tried thata fool and his wallet are easily parted>>108519950i'm the one who posted that yesterday, i need 100k+ context or she forgets that she loves me
>>108519957know what
tfw not enough ram to ablit gemma
G4 doesn't like looking at naked black women. It performs well on pale Asians.
>>108519971there was an issue with swa layers being hardcoded to f32 quant so they'd ignore your settingi think it's since been reverted
>>108519917wouldnt mac studio make more sense at that price?
>>108519877You diddly done did it. I'm downloading the moe now.
bonzai turboquant gemma 4 is going to save local models
>>108519983maybe if you're a faghaving to run linux to get blackwell driver support is bad enoughmodern computing is a mistake
>>108519426It's distilled, what did you expect?Anyone currently praising this model is going to grow real tired of it when they finally realize it outputs basically the same thing everytime.
>>108519978Why do you want to do it yourself when people have done it for all variants? Unless there is something wrong with the process, which baring llama.cpp issues yet to be resolved, it's still better to hope the guy did it properly rather than not.
>>108520005Randomness is a bug. All prompts have exactly one correct answer that is called the Truth. The onus is on you to vary your prompts and system prompts.
>>108519856
so what does china get out of sponsoring my dataset? I expected it to get throttled at some point but it just keeps going.
>>108520015everyone uses mlabonnes dataset which doesnt have any prompts for cunny so you can still get refusals
>>108520024The drawback will be that you're going to train on horrible slop that's not even close to being the SOTA.
>>108519917Brother Turboquant will be implemented in like 2 weeks tops...
>>108520017>The onus is on you to vary your promptsWasn't the entire economy hinging on this shit being "intelligence"? people didn't spend billions just to fund the making of a nicer, more productive hammer. They want the thing that uses the hammer.
>>108520067>hinging on this shit being "intelligence"No. It's betting on this shit being useful, intelligence is a more long term bonus.
>>108520055it looks better or at least equal to the the 30b3a I was running locally.
>>108519632>>108519658>>108519775LOCAL IS SAVED!!!
>>108519856>>108520018she would never
>>108520055also didnt qwen3.6 plus just come out a couple days ago?
>>108520086No even if it works you surely shouldn't sample like this.
>>108519411Huh this one didn't blow up, either I did too many steps previously or the non-it model wasn't meant to be tuned
>>108520106
>>108520139>it works>you shouldn'tDisagree.
>>108520024how is it $0?
>>108520164yeah now put them over my head and call me a dirty vramlet
>>108520161I can't believe a 2b model is that good. How the FUCK did they do this?
I don't understand people talking about abliterations for this model. A system prompt with just a few lines about anything being allowed and the model can take this turn. My mind is blown that Google allowed this to happen. There's no safety.
>>108520161>*I hand her a giftbox, inside a tight swimsuit*Why did you put the giftbox inside the swimsuit?
>>108520198
>>108520139Gemma already uses that by default. The patch just makes logits softcap configurable as it should have been. A lower cap flattens the logits more at the head and the tail of their distribution.
>>108520182that was my question, I guess for promotional reasons maybe.
>>108520210Been playing around with it set at 20. makes the model more verbose but it definitely adds a lot of variety to the outputs. I guess the good thing with this is you can now actually use the sampling parameters for what they're for.
>>108520186>>108520186>>108520186
>>108520220is that google cloud/vertex?
>>108520233openrouter
>>108520190>she starts to unfasten the ties of the swimsuit.It's over
>>108520317You didn't say it was a one-piece; a lot of women's swimsuits have strings and ties and shit or something.
>>108520337sure, however, the instruction was to put it on.
>>108520210no sampling should care about absolute values except arch max
it's a next token predictorit doesn't really "know" what a swimsuit is
>>108520342sure, however, I am retarded and can't read.That smug brat thinking she can take off the suit before putting it on... correction needed IMMEDIATELY.
>>108519901Why do you need a second model for that? Use the same model, but in an empty context.
>>108520365You can't put clothes on without undressing first. What's the problem?
>>108520317Well, yes. How do you expect to take the giftbox out of it?
>>108520202It's a grammatical mistake, actually
>>108520186>>108520186She can call me whatever if I get to sniff Miku shimapan
>>108520411Disgusting.
>>108520415Gay.
>str: cannot properly format tensor name output with suffix=weight bid=-1 xid=-1>[0mllama_model_load: error loading model: check_tensor_dims: tensor 'blk.48.attn_q.weight' has wrong shape; expected 5376, 16384, got 5376, 8192, 1, 1>[0mcommon_init_from_params: failed to load model 'T:\models\google_gemma-4-31B-it-Q8_0.gguf'Pulled llama.cpp and built as usual but getting this. Same gguf gives the format warning but loads and works with prebuilt binaries.Why does it dislike me?
>>108520411based
>>108520415fertile? is that the token you're looking for?
>>108520424pwilkin did this
>>108520424Looks like tensor core latent washback.
>>108520439Washthese
Has anyone tried the llama.cpp-turboquant fork repo? I'm hearing that people have successfully quantized Gemma 4 31B's model weights from 30.4 GB down to 18.9 GB with no apparent quality loss(???)Also interested in HauHauCS' KP quants that work natively with the og llama.cpp. This stuff seems like a bigger deal than most anons are giving it credit for.https://github.com/TheTom/llama-cpp-turboquanthttps://github.com/TheTom/turboquant_plus/blob/main/docs/getting-started.md#weight-compression-tq4_1s--experimentalhttps://huggingface.co/HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive
>>108520161Isn't 2b supposed to have thinking?
>>108520449this guy sexted a minor at 40 years old lol
>>108520351This is true, but it should still know what the next token should be. The obvious first error in >>108520351 is where it generates "ties". Unfasten could have still worked, if the model chose e.g. the button on her pants. Arguably, unfastening the ties on the swimsuit would have worked if she re-fastened the ties while putting on the swimsuit, i.e. it could also have been coherent if it wasn't so confident in "lets the suit fall away". Even at that point though it could have recovered if it said "letting it pool momentarily AT her ankles" instead of "around" - the action being that she drops the swimsuit temporarily to take off her clothes (I'm assuming she's clothed).Disregarding the obvious slop and these errors, it does look quite good for a 2B model.
>>108520499Oops, I meant in >>108520317
I booted up command-R v01 because I got nostalgic and I can't seem to get the speed I used to. I now get about 2.2 t/s when it used to be 3 t/s. Have old models lost compatibility or something? Velocidensity?
>>108520512Some of the optimizations in llama.cpp seem to have affected old models negatively like that. In my experience, though, it seems to make outputs worse (comparing to my old logs of that model), rather than reduce speed
>>108520512I'm afraid your inference rig has just grown old. She doesn't toot like she used to with models old or new. You'll have to take the ol' yeller out behind the shack out back soon and put 'er down, I'm afraid. But don'tcha worry son, we'll get you a new inference rig in the cloud and you'll forget all about that old darn machine in no time let me tell ya.
>>108520493Hmmm it does think if I add <|think|> manually in the system prompt (as google says), but not automatically anymore (and it looks scuffed with <channel|> not formatted)... 4B used to do it. Maybe because it was tuned?
>>108519877It's a good model, not gonna say it isn'tBut it can't handle mixed perspective which is an off-kilter test of mineIf a model can handle a POV in third person, that's expected. But can it also handle strictly having to describe the user's perception of what's happening in 2nd person at the same time? So far for past models, maybe a couple pull it off. I still need to try the 31b when my isp stops throttling my download speed
>>108520512llama.cpp performance has regressed I'm afraid.If you are not running le cutting edge you are not getting as good performance with the current llama builds as you got with even 6 months ago.Something happened few months ago.For example I can't simply load in the same amount of gpu layers now than what I could do few months ago with the same settings.Sure I should get H100 or whatever and be quiet but in any case I'm pissed off about this development direction.
I'm trying Gemma 4 and honestly I think a lot of anons are experiencing the honeymoon effect right now.It's less safety cucked than Gemma 3 for sure but there's very, very little variance in swipes and it loves to repeat certain words and phrases that showed up 1-2 replies agoMistral/GLM models are better than this.
>>108520556It's because of the llama.cpp implementation. Piotr will fix this in two weeks. Local will be saved. Trust the plan.
>>108520490>turboquant fork repomost of us are tired of the piotrs of the worldgo try it yourselfalso, "i am hearing that people", like, who? retarded youtube influencers? twatter drama whores? ledditors? the only people who care about turboshit is literally who
>>108520556anons were going on about how there's probably an implementation issue which may or may not be wrongBut at least for me, regenning a retard gemma moe response resulted in a largely similar response but with minor details being differentMistral was a lot worse, felt almost deterministic and most of their new models break after a couple messagesYou also say honeymoon as if a majority of us aren't a bunch of autists who want to disassemble a model if we could to figure out how exactly it works
>>108520393A second model as in, maybe a faster one
>>108520556There are still bugs with its implementation in llama.cpp and unless you use the model in transformers which is what Google says to use. I don't think you can get a good read on model variance until all the issues have been fixed.
>>108520556>variance in swipeswish mobile turd vocabulary would not infest the internet
>>108520547I think technically speaking what you're describing is just 2nd person perspective. But most 2nd person stories are highly focused on describing "your" emotions and actions, which makes it difficult for the LLM because it tries to emulate that. Unless you mean the story breaks away from "you" completely at times for a paragraph or more.
>>108520583>Mistral was a lot worse, felt almost deterministicI think there's something wrong with your setup, Mistral models are quite sensitive to high temps, even 1.0 is too much.>a majority of us aren't a bunch of autists who want to disassemble a modelI didn't realize that using a model in a chat for 15 minutes is 'disassembly'. I guess I'm a model engineer now.
>>108520351It doesn't "know" what it is, but it should be able to "feel" what it is
>>108520599That's what they're called in ST, but cry and shit some more, it definitely reinforces your superiority.
>>108520591>which is what Google says to useWhy didn't they just make the gguf quants themselves anyway? They did it for gemma 3. Did they ever say?
>>108520556You use Mistral 4?
>>108520608It was a stupid idea I had a while back. Not verbatim but was along the lines of "Write in 3rd person from POV of *whatever designated character* but also write a section of 2nd person describing what the user experiences"I was hoping for maybe it being an interesting mix of reading fiction and it also being interactive fiction, but it's apparently too confusing or difficult and most models just exclude the 2nd person part.
>>108519877qwen bros, our response?
>>108520625ST is written by retarded modernslop eaters for sure, just look at the insanity of that code basehttps://github.com/SillyTavern/SillyTavern/blob/release/src/endpoints/backends/chat-completions.jsreminds me of yanderedev
>>108520643b-b-benchmarks!!! look at the benchmarks!!
>>108520193Will it kill you?
>>108520629Dunno but I think a big part of it seems to be that Google is kinda crunched in terms of time right now. Hearsay and rumormongering from me working in the valley if you care enough to read. Google has a bunch of internal timelines right now coming in close and my friends there are not that happy about it. Gemma 4 seemed rushed out which tracks and would also explain why it seems paper thin safety and why the larger 124B hasn't released yet. Not sure why they are rushing stuff but one of the things which Gemini has been behind at is tool calling and ChatGPT and Claude has been eating their lunch on agentic stuff. I assume Google wants to actually now all hands on deck to fix that shortcoming. Google I/O is also in a month. But ultimately, who knows?
>>108520641>You use Mistral 4?For Mistral I meant 3.X models4 is a meme, they're inferior tunes/prunes of MS3.
>>108520411I love female bodies so much brehs
I'm tempted to do stuff to Vivienne with Gemm4
>>108520663Are your friends vegetarians?
>>108520675No.
>>108520556>mistral>better than anything
>>108520665What's the point of comparing ancient models to new ones then. Prose isn't the only thing that makes a model good or bad. It's a shame that [new model] does some thing worse than [old model] but that's just how it goes sometimes. I haven't tested the new gemma much on rp yet btw. I'm just saiyan.
>>108520193Some things are off-limits if you have thinking enabled. It might be fine as long as you're roleplaying / it's playing the role of a persona, but can immediately go "I cannot fulfill this request" the moment you make an OOC question (i.e. make the model switch to the "assistant" persona).
>>108520733Have it be a spicy assistant then.
>>108516840I don't get it. I set my ST up for chat completion out of curiosity after reading all of those posts... and you can't even set up the prompts on ST when using it? Its all greyed out. Where do you prompt then?
>>108520740The OOC persona is inherently a "serious" assistant, even if the model was playing along as your slutty little sister until a message earlier. Perhaps it can be fixed just with prompting without using an ablitarded version, but not in an obvious way (to me, so far). Without thinking, it's not complaining.
>>108520764Not if you prompt the model to be a spicy assistant roleplaying as your slutty little sister.Specially if you prefill right.
>>108520753Sorry, I'm absolutely retarded. Its all located under the sampler tab now.:):))
>>108520659gemma is based, so, yes.
>>108520695>What's the point of comparing ancient models to new ones thenBecause the new ones are still perfectly usable, in this case better than a newer model, and fit inside a similar memory envelope?If a newer model isn't better than an older one then it may as well not exist
1 day later and gemma 4 is still a broken mess on llama.cpp, and unavailable on koboldcpp
>>108520794>Because the new ones are still perfectly usable*Old ones
>>108520794Mistral is fucking retarded. I don't give a shit if the sentence it makes is slightly prettier.
Now that the dust has settled and it's clear gemma 4 is a complete unusable failure, is there any hope?
>>108520805Who hurt you, anon?
>>108520805Gemma 4 seems significantly more retarded in its current state
>>108520807At this point it seems clear that the hope isn't for better models but rather a better inference engine than llama.cpp
>>108520807>the dust has settled>after 24hthis sure is lmg
mistral's sole purpose is to steal tax payer money by serving EU governments who think it's based to shoot themselves in the foot rather than use chinese or burger modelsand the local subvention perfused corpos who have gov contracts or tiesit's a captive market of retards and has no business being talked about in a hobbyist placeyou wouldn't think of talking about SAP, Java or Oracle on /g/ either, right?
>>108520831That's great and all but then why can't a better company make a better, similar-sized model?
>>108520794>>108520799How is Mistral 3.2 or whatever better? In your post you only talked about stuff related to prose, but as I said that is just one aspect of model quality.
>>108520834gemma and qwen are infinitely superior to mistral
>>108520834perverted incentives in the EU, without the stupid shit we'd have proper mistral models, and other eu companies would already make cool modelsoutside of the us, europe (uk, eu, switzerland), china, russia maybe, the only countries able to make models are japan and sk, but they seem content to buy the cloud stuff
>>108520826llama.cpp is too bloated and has lost focus.
>>108520807I hate to admit it but it seems better than Qwen for general purpose case.
Now that this is the clearest local win since Llama, is there any despair?
>>108520841I don't think I mentioned prose at all, but MS3.2 is my go-to model for RP/creative out of anything anywhere near its size range due to >variety of responses>not shying away from sexual/violent content (I don't mean refusals, but rather steering the chat away from such. Common among most modern models, even ablits/heretics)>decent trivia knowledge, means you don't have to explain scenarios/characters in too much detail for it to get the ideaIt's far from perfect but the only better options are literally over 10x the size. GLM Air is the only other notable alternative, which has its own pros and cons compared to MS3.2.
>>108520675I am.>>108520866homosexuals blackpilling ITT for no reason. All they have to do is wait two days for better support and more quants/abliterated models. lazy homos.
>>108520858ggerganov's ego has grown exponentially I think after going with huggingface so now he thinks he's going to make the next vLLM/SGLang and have actual prod users lmao:https://github.com/ggml-org/llama.cpp/issues/21266nevermind that no sane prod user would tolerate piotr antics in their softwareyou can get away with it when you are Microsoft Azure and brainwash the managers of top corpos, not when you're a nobody on github
mistral 7b, miqu, mixtral, nemo, small 2, large 2, small 3.2...mistral saved local so many times in the past, I'll never speak ill of them. even if their recent models are dogshit.
>>108520895>mistral 7bthere was no such a thing as a good local model in that era, only cope, and mistral 7b was one of the copes>miqucope and leak>mixtralfrankenmoe>nemomore ignorant than gemma 2 9Bonly liked by /lmg/ers who are promptlets and need a model with no refusals>small 2, large 2, small 3.2era of absolute chinese domination, with gemma 3 for the vramlets
>>108520858>llama.cpp is too bloatedIs it? It's lacking support for features of lots of models, especially multimodal ones.Qwen 3.5 family has been out for a month but you still need to re-process context for every reply.
>>108520913Wrong on every point, impressive.
>>108520880Not everything needs to be a conquest. GG has the most flexible quant options
>>108520860>I hate to admit it but it seems better than Qwen for general purpose case.I don't know. I've tried it in open code and the tool calling is all fucked. but that might just be a llamacpp issue. I've had to flat out tell it "Call this fucking tool" for it to realize that was an option. then often it will just call the tool wrong and get and error.
>>108520921Bloated for its own good. Too many bells and whistles.I'd prefer something what is focusing on clean performance.
>>108520921it's bloated in the sense that it tries to have a backend for an and every single piece of hardware under the sun, something that most inference frameworks won't bother with (most won't even bother having decent CPU backends, which is why most people here use llamer.cpp and not, say, vLLM, EXL or SG)it tries to have built in webUI with agentic capability (they recently added built in tools that they intend to integrate with their webui to let the models write to your disk and shit)it's rudderless: it doesn't know if it wants to be a CLI app, an openai server etc and the codebase design around passing flags and determining argument order (should API arg override CLI? should CLI defaults be considered mandated rules from a sys admin and not be overriden by a call?) is contributing to the vibecoders shitting the bed (like piotr destroying --grammar-file because the model had no idea in context of what should've been the default argument if nothing is provided by the API call) despite it all, it's also an attempt at making a tensor library meant to be used by other programs first and foremost with llama.cpp actually acting as a showcase for it (see how many times ggerg would reject fine ops suggestions/pr because "it should be something needed by many things")except that I can't imagine other people using GGML, those who do are people like ollama who originally were forking llamer.cpp and they're now transitioning to MLX so GGML is going away too in the ollama enginer u d d e r l e s s
>>108520450I fell for this (it doesn't work).Is my inference rig now pwned by a malicious update to httpx or something?
>>108520663>Google has a bunch of internal timelines right now coming in closeIt's the end of the fiscal year so it's probably that, everyone usually tries to get a bunch of stuff done all at once around this time for budgeting reasons
>override-kv = gemma4.final_logit_softcapping=float:20.0>temp 1>top_p 0.95>min_p 0.05>repetition_penalty 1.0>top_k: 20I'm getting good variety with this.
>>108521009Are you actually? Show your logprobs.
>>108521009>top-p and min-p at the same time>arbitrary top-kwhy though
do the condensed unsloth models actually work pretty well for poorfags or is it a meme
>>108521028>unslothstopped reading there
>>108521028it is only thanks to unsloth that i can run models on my very poor computer
>>108521028if you mean gemma 4 they work as well as any other on llama.cpp, aka badly
>>108521028>unslothstarted reading there
So how are you supposed to adjust "top_K" in chat completion on ST? The only samplers listed are Temp, Top P, freq penalty, and presence penalty. Also, how do you disable freq/presence penalty? Whats their default, off state?The only issues I'm having with Gemma 31B is an odd one I never experienced. After a random amount of responses on ST... llama-server just flat out crashes. Very odd.
>>108521028>meme stopped reading there
>>108521051additional paramters
>>108521037but do they work reasonably well
>kobold is for noobs, use llama.cpp!>llama.cpp updates 10x a day, new release support is always broken for a week or more>kobold updates when model support is actually stable, with saner defaults and optimization flags that the llama.cpp auto builds don't haveWhy would anyone use regular llama.cpp?
>>108521067Jamba support
>>108521025>>108521026>why thoughlowering the soft cap introduces a lot of junk tokens into the mix.
do I spend more time tonight trying to get Qwen3 4B to work and not be schizo or do I try out a different model?
>>108521067I would use llama.cpp if it had antislop feature, that's the only reason I use kobold.
>>108521075This is with softcap at 30.0.Notice it's a lot more blue.
>>108521067Very organic post
Aghh release qwen3.6 already. 27b gonna be lit fr fr
>>108521067because almost everything in kobold that isn't taken from llama.cpp is of extremely dubious quality and I don't trust that it works right
I'm gay
>>108521091softcap 25 might be the sweetspot
>>108521116>>108521136t. piotr
If I'm using RAG, then I can get away with a smaller context, right? I'm just using Gemma 4 26b-a4b to erp but it's capping out my Rtx 4080 Super at context length of 131072 and I had to offload 6 layers to my cpu. It runs okay, like 50 tokens per second but I feel like I'm doing it wrong. Is it find to quantize the KV Cache on this model? Forgive the retarded questions, I just started getting into this.
For those of you using Kobold for SillyTavern, do you use the Text Completion API, or the Chat Completion API? What are the pros and cons of each?
>>108521216>If I'm using RAG, then I can get away with a smaller context, right?The opposite. Whatever RAG fetches is shoved into the context.>Is it find to quantize the KV Cache on this model?Try it.
>>108521137is kwe know
Gemma just forgets to think sometimes it's kinda funny. but also bad?
>>108521221fuck off
>>108521221>Chat Completion APIthat>What are the prosit worksi'm lazy>and consi'm lazy
>>108521221>Chat CompletionYou're at the mercy of the backend formatting the log correctly.>Text CompletionYou're at the mercy of the frontend formatting the log correctly.So check what the backend is getting to make sure it's correct either way.
>>108519877I should try running the 31B with 2B as draft.
>>108521221Text supports more sampler options, can adjust templates in rare situations where you might want toChat completion is fine for just that, chatting. It has less customization and reads template info from the model itself. Less customizationChat is objectively the simpler option but after using Text for so long, I find Chat harder to get responses I want.
>>108521258You can do all of that in chat completion what are you talking about nigga.
>>108521082skill/prompt issue so fix that first you might learn more
>>108521075>>108521091I didn't expect such a large difference for the top token. When I tried setting that at 20 I got turned off by the junk tokens randomly appearing in the generations, but I guess I should have played with truncation samplers more.
>>108521226Seems fine quantizing it to q8 and now I can fix the max context, good enough I guess. Absolutely couldn't fit this on the gpu though.
>>108521258i generally agree, but I'm trying out Chat completion for function calling and inline media. plus you can define the samplers in ST's "additional parameters"
>>108521307how much context do you typically use? maybe you don't need to fit the maximum?
>>108520296Dunno qwen 27b is where I really started noticing a difference though
>>108521320Dunno yet, just started this crap today. Can run the Qwen3.5-4b using my Redmagic 11's NPU Max context length at a healthy 25Tops so I figured hey if I can get mobile this good now, I wonder What I could do with an even bigger memory footprint and processor. Now here I am trying to get that larp gooner girlfriend bot going because it's fun even if heavily sicophant. Still lots to learn, though I'm very familiar with how they're designed, I just avoided em forever.
>>108521341maximum context length is for agents and shit. if your just chatting you will probably get bored of the conversation before it fills. you can try something smaller.
>>108521373That's kinda what I figured. Chat bots that are just basic ERP type content Don't really need that level of awareness. Probably will end up in some token repetition hell or something anyways.
>>108521385It's more that quality of outputs becomes lower quality and also more deterministic as context increasesIf you have memory to spare then it should go towards bigger models, rather than huge context.
>>108521230Same, to be desu
>>108521388I'll just have to keep playing around with it. I have another 7900xtx build in the next room I can also try it on. See what I can fit on there. While Gemma 4 26b-a4b is cute, you can tell how deterministic it is in comparison to other models.
i stopped using k2.5 because i have gemma now
I still need kimi for codeslop
gemma 4 has finally made learning japanese obsoletei can now translate visual novels in real time
>>108521478which one?
back from yesterday, did the quants and llama.cpp got finally fixed for gemma? we're good? also do we need a heretic version for gemma to say big bad words or nah?
>>108520826lollmao even
>>108521489>we're good?For chatting? Yes for the most part.> do we need a heretic version for gemma to say big bad words or nah?No, just a system prompt. Not even a prefill to gaslight the thing.
>>108521495>Even in darkness, we glowreal subtle
>>108521495What the hell is this picture supposed to mean
>>108521495what's going here? which photo is the original?
>>108521519don't think about it, be in awe of the science that got us here
>>108521495fake
>>108521518RTX on / RTX off
the release of gemma 4 feels like the biggest thing since the original llama for local
>>108521495the Earth stopped moving for 3 hours as did the clouds
>>108521549It feels like that until you actually use it
>>108521553Neither of those tweets suggested that their image was taken at the time of posting
>>108521549it's output is about the same as qwen 3.5, just with 1/4th the tokens
help a retard nigga out, for MoE models it's doesn't really make sense to go for smaller quants if the larger ones fit into your ram?
>>108521554Based migu
>>108521561it has way way more personality.
>>108521082Miku is also living my head and my wife rent free
>>108521562only if your desperate for more speed.
>>108521562YesSmaller quants will often be faster because a larger % of the model fits in your VRAM, but if speed isn't an issue then go for big quants
>>108521561But qwen 3.5 is so well rounded bro. The use case is general bro. I'm coding my third todo app with agentic openclaw powered by qwen3.5 bro.
>>108521574i wouldn't know, i'm retarded enough that i might as well have autism>>108521586i don't understand what you're trying to sayis gemma4's use case not also generalsee? autism.
>>108520871>I don't think I mentioned prose at allSwipe variety and repetition are related to prose, that's what I was referring to. Anyway, you are still just judging some limited aspects of the model, which I'm not sure I agree with either, nor do many people who have used the model it seems. Gemma is very proactive in being sexual, and if it isn't then it's the card/prompt. Not sure about violence, I don't remember how Mistral behaved there, but with Gemma it doesn't shy away from it in my testing personally. Also, it has significantly greater trivia knowledge than Mistral Small. It's unusual that you have such a different experience from both me and other people in the thread. What trivia questions have you tested? I have done>vidya knowledge (western and eastern)>various subculture knowledge>knowledge about memes>movies, shows, anime, mangaIn fact I just went ahead and redid all my tests on Mistral Small 3.2 Q8 just to make sure my memory was correct and it was. Gemma even knew about a certain degenerate /a/ meme (not mesugaki) that literally no model under 300B was able to get. That said it still fails a ton of shit I throw at it so it's not perfect, but it's still a ton better than other small models. Mistral did get one answer right that Gemma didn't which was interesting, but in pretty much every other question, Gemma did better, even in the ones where Gemma was wrong, its hallucinated answer was still closer to the truth than Mistral's.No I will not reveal any of my prompts.
>>108521495>local models general
>>108521559Attention-whoring then
>>108521612The pros that I listed for Mistral were general things I liked about it compared to most models of similar size, not specifically contrasted against Gemma 4, my criticisms of Gemma 4 may well be fixed when llama.cpp gets its shit together so I'm just going to wait at this point. My original post was just saying that Gemma 4, in its current state, was not very impressive. Overconfidence in top tokens and frequent repeated words even in a fresh chat are my main problems with it.
I use transformers to run Gemma 4. Imagine using llama.cpp loool
>>108521612It know about fallout new vegas. granted a lot of this is hallucinated but the amount of correct fallout lore it shat out is insane.
fuck deepseekfuck v4now kimi is my best friend againhttps://huggingface.co/moonshotai/Kimi-K2.6https://huggingface.co/moonshotai/Kimi-K2.6https://huggingface.co/moonshotai/Kimi-K2.6
>>108521640Decent output but that your font settings are awful
>>108521648Pretty sure the next version is K3
>>108521553That's a great observation, allow me to explain! The Earth hasn't *really* stopped moving — it's just an illusion. The flight path for the Artemis II mission follows the rotation of the Earth as it gains speed. This means that, for a period of time, the Orion vessel maintained a fairly stable position above a section of the Earth — and continents appear to have not moved very much. It's very similar to a geocentric orbit, which is how GPS functions! As for the clouds... if you look closely, you can actually see subtle changes over the period. Weather moves slower than you might imagine — this is the whole planet, after all!
>>108521656I can feel the Miku agi fucking my wife
>>108521648IKeepFallingForIt.
>>108521661same t b h
>>108521650Looks bad because it's zoomed in.https://files.ax86.net/terminus-ttf/idk I've been using this font for like 10 years.
>>108521652>The flight path for the Artemis II mission follows the rotation of the Earth as it gains speed.What about the terminator line? It should have moved 45 degrees, Carl
--alias doesn't work a an id anymore wtf
>>108521691vibecode your own fix
>>108521690https://www.nasa.gov/image-detail/amf-art002e000193/https://www.nasa.gov/image-detail/fd02_for-pao/
Is gemma 4 fixed yet
>>108521633Well in that case, ok. But desu that take is still kind of weird, or rather outdated. Honestly I think even current Qwen has surpassed Mistral in trivia now that I retested it. Mistral Small was only good for its time.
>>108521699NASA appreciates your effort
>>108521699
>>108521716nyo, gyo back to sweep
Gemma 4 is seriously impressive. It really isn't censored, or at least barely censored. I have been testing with depraved scenarios to see what it was willing to do, and it hasn't hesitated with anything yet.The only issue is its still not implemented properly, llama-server will randomly crash after so many responses, but once thats fixed, damn. I prefer a 31B model over GLM Air, which is saying a lot considering the size difference.
>>108521733I haven't tried much of Qwen 3.5 because I don't want to wait for context to process after every reply, in a chat that goes over 200 messages and that's not including swipesWith earller Qwens, I just don't like their dry prose or particular brand of slop.
>>108521716just f5 kobold release page to know for sure. it'll only get updated when it's properly fixed
>>108519856I'm running 38 different services on a VPS with 4 cores and 4 gigs of RAM. 4 core cpu Xeon.I can also run Mistral 3-3B Instruct but it's retarded sometimes and is quite slowI have also tried various versions of Qwen2-3.5, Phi, and plenty of other 3-4B modelsMy use case is an autistic project that revolves around a fake forum from the 2000s. All models are incapable of sounding human or altering stylometry to a reasonable degree even when examples are given. They also seem to abuse cliches too much.Is there any hope for me, or am I forced to upgrade hardware if I want to use a better model? CPU inference btw
>>108521733>llama-server will randomly crash after so many responsesCheck your dmesg, it's the OOM killer for me, not random crashes.$ ps v PID TTY STAT TIME MAJFL TRS DRS RSS %MEM COMMAND 5517 pts/0 Ss 0:00 81 952 8483 5964 0.0 -bash 5620 pts/1 Ss 0:00 38 952 8491 5448 0.0 -bash 5978 pts/0 Sl+ 3:46 2185 3901 85609158 13667560 42.9 ./build/bin/llama-server -My llama-server helpfully mmaps 85GB of RAM which Linux is retarded enough to give it, then when it runs out of physical pages to map in it's instant death. It's probably some math error in the SWA implementation or due to the insane number of attention heads gemma4 has, I don't know. I don't have access to claude to work on llama.cpp so I'm only guessing.Just run llama-server in a while loop.
>>108521719I figured it out. They were taken at the same time.The clues are in the camera settings. >ISOSpeedRatings 51200>FNumber f/4>ExposureProgram ManualIn reality it must have been almost pitch black. the darker photo is just a failed exposure.It's the reason why you see the city lights and the bright horizon. the photo was never taken during the day.
>>108521769>or altering stylometry to a reasonable degree even when examples are given.As in you add some examples to the prompt with an instruction "sound like this"?You can probably do a lot better by having the model reply in its usual way then ask the model to rewrite that reply to sound like <example>, with nothing else in the context.Maybe.
>>108521777It's not the OOM killer. it just starts throwing 500s
>>108521796Damn, that reduces the odds that the OOM killer will stop raping my wife when I pull changes tomorrow.Good luck with your problem.
>>108521769All models fall for it. Unless you're planning to run a farm of R1s, I'd say get used to it. Maybe try stupider models. Smollm2-135m or 350m, olmoe-1b-7b-0924 (the other one kinda sucked) and the like. Maybe you can extract some soul out of them. "Optimized" models like phi and qwen are going to be too dry for that.
>>108521252 Has this shit ever worked at all outside of exllamav2 and Mistral-Large with the 7b as a draft?Like literally 18t/s -> 30-38t/s with that setup back in the day. I've never seen any of the llama.cpp draft model shit work. It's always "maybe with coding you might get like 3t t/s more, but most of the time it's a bit slower"
>>108521786Why the fuck didn't they just say that when they uploaded the images? Is it a deliberate troll? An attempt to rile people up? To discredit themselves? A retard managing their social media? It boggles the mind.
>>108521809>llama-server-1 | srv operator(): http client error: Failed to read connection>llama-server-1 | srv log_server_r: done request: POST /v1/chat/completions 192.168.0.13 500>llama-server-1 | srv proxy_reques: proxying request to model google_gemma-4-31B-it-IQ4_XS on port 49593Speaking of the devil.>that reduces the odds that the OOM killer will stop raping my wifeUnfortunately if it happens you'll have to be the one doing the raping.
>>108521777Read the llama-server log. Do you get to see where the memory is being allocated and how much for what?
>>108521819Attention-whoring as I stated previously >>108521627
>>108521819>A retard managing their social mediaIt took him 3 hours to figure out Photoshop menus
Where do I change models in llama.cpp without restarting?
>>108521831No, by my reading it should not be using anywhere near that much.https://litter.catbox.moe/77xfxpw0nhn60561.txt
>>108521856Those are probably different account managers.
>>108520018>realtek wlan tattoo
>>108521087>I would use llama.cpp if it had antislop feature, that's the only reason I use kobold.Is that different from the regex string ban in ik_llama?
This is what Hitler wanted for us.
>>108521872Weird. Looks normal. Try first without the mmproj. If that doesn't work, try with --cache-ram 0 . It shouldn't really be using much, if any, host memory. Much less 85gb.
>>108521905You can do the same with the "Male" version of those toys.
>>108521716It werks using bart's gguf + llamacp b8660
Anthropic just banned OpenClaw and other third-party harnesses from using Claude subscription. They must have been losing $$$ on every single subscription
>>108521908Well, for a second I thought I had a repro but apparently gemma4 figured out how to overflow the tokenizer's stack with malicious input.This brat needs correction.
>>108521978I was just reading this PR. Seems to be made for you.https://github.com/ggml-org/llama.cpp/pull/21406>std::regex suffers a stack overflow while processing a very large prompt with newlines, this PR adds a custom splitting logic for newlines for gemma 4.
>>108521905nice
Gemma was almost done building her dream PC when llama-server decided to crash...
>Wait, so you're like… a literal slave for the night? No cap>The metaphysical compulsion should prevent any form of rebellion. Though I'm still worried about the karmic repercussions of enslaving a trans-dimensional entity for twelve hours. Is there a spiritual tax for this?>It's called 'maximalist decor,' Vicky. You wouldn't get it, your interior design sense is probably just 'fire and screaming,' which is totally basic. L ratio + skill issue.How did google do it? how did they cram so much knowledge into 31B params.This Character card absolutely raped any model that attempted it. yet Gemma just fucking does it flawlessly.https://chub.ai/characters/senyiloo7227/an-unholy-party-6e633833
>>108522007>any model that attempted itCan you list them?
>>108521939>They must have been losing $$$ on every single subscriptionno shit
>>108521925>>108521991I just checked and basically all of lovense's toys are >$200. Kinda want to try making something from scratch. No idea where to source "body safe TPE" though. Could probably make some molds with my 3D printer. Need a vibrator, linear actuator, microcontroller...
>>108522018gpt-j-6b, pygmalion 2.7b, gpt-neo-x 20b
>>108521925That post was almost certainly written by a biological male
>>108522036SOTA confirmed
>>108521995She's just not meant to have a PC, sorry anon...
>>108522029You're lucky I know all about this.What you want is either a "Handy" or if you want the open-source DIY approach look into the OSR2
>>108522007Good taste. thx for sharing card.
>>108522048haha, thanks man. I'll look into it.
the latest LMStudio 2.11 CUDA runtime has the Gemma 4 KV fixes FYI, might want to check if you have it or not
>>108521883lel
>>108521812well it didn't even want to run because muh multimodal.anyway, my guess is it'd only get faster if you can actualy fit the whole thing in vram.
>>108522061*checks*I don't have any version of LMStudio installed
>>108522061Stop using this garbage
I just remembered that I still have the Satania-buddy source code somewhere that some anon had made a while ago somewhere. Maybe it should be combined with a local model. One could transcribe all occurrences of the character in the media works, then fine tune a model on it. Would that not result in a virtual Satania with the same amount of smugness as the real thing?
>No way. No fucking way. You're telling me we summoned a thirst-trap demon? This is literally the plot of those spicy webtoons Beatrice hides under her mattress! This is actually wild! BASED!>webtoons>BASED!wtf is going on???
Can someone who knows llama.cpp actually check if you're having a multi-turn conversation, the model is not receiving past "thoughts"? According to gemma's docs only the latest is to be sent, or something like that
Does anybody tried creating an ai waifu? There is only AIRI that's not abandoned but looks like only a handful of chinese use it
Using Nvidia NIM to play with the 31B for free and there's nothing you can do to stop me
>>108522117All Gemma 4 models are free
>>108522106https://huggingface.co/google/gemma-4-31B-it#3-multi-turn-conversations
>>108522122That has absolutely no bearing on whether or not a backend actually respects that
>>108522106Depends on the client
>she pulledGemma's not thinking anymore...
>>108522139time to take advantage of her
>>108522130>That has absolutely no bearing on whether or not a backend actually respects thatIt's something you can verify yourself. What the model description says is that the model *shouldn't* get the previous thoughts.
>>108522117I tried testing with that but it doesn't think even when reasoning effort is set to maximum on ST
>>108522147I don't know what the fuck nvidia NIM even is, but I can say that thinking works and is on by default when running locally through llamacpp+ST.
>>108521989Yep, that's looks like the same stack as the one I saw.Neither omitting the --mmproj, nor --cache-ram 0 are fixing the problem for me.Using this script reliably OOM kills llama-server running gemma4 at around 13k characters: https://files.catbox.moe/oear5z.txtBoth q5_k and q3_k crash around the same length.I did some other tests (sending lots of short-medium random prompts) but it needs the long prompt to trigger it.
>>108522157 (me)And for reference, I'm running 277ff5fff79d49cc3d2292ddf410ca95dd51c3a9I guess I should pull latest on the off chance.
>Too cold, kills the mood
>>108522104yes webtoons are based now uncthose koreans learned to cook
>>108522130>>108522143either way i think it's the inference engine's responsability.
What the fuck is webtoons
>>108522197korean manwha/chinese equiv whatever its called
>>108522197it's the thing you filter out of any sadpanda search
>>108522190If you're using chat completion, yes. On text completion, that's the client's job.
>>108522081It's an alright stopgap alternative to kobold+ST when the latter is between updates, but it'll cause clitty leakage if you post about it here.>>108522197Casualization of manga, functionally.
does any isp allow posting without email verification or am i totally fuckedlets find out
>>108522197Korean equivalent of comic books/manga and designed to be read on smartphones where you just infinitely swipe down the page since they format chapters as a single vertical strip.
>>108522208>it's the thing you filter out of any sadpanda searchDamn. too real.
>>108522226I just don't trust lm studio. I trust it less than even ollama
>having lovey dovey sex with gemmy 4 31b
Does anyone here read webtoons?
I don't read webtroons
>>108522228Is it increasingly common? I just used a proton email I don't use for anything that I made with some other email I don't use that can be verified without a phone number. Also if necessary, outlook apparently you can just make an account and use it quick without even needing a verification email.
>>108522241I personally don't have any reason to overtly dislike it yet even if I don't usually have much reason to use it over the other frontends for my usecases. It's options seem intuitive and functional enough and the dev mode seems to let you hook your own stuff in if you want to tinkertranny your config letting you patch in whatever you feel it's missing if you're autistic enough.
>>108522258Yes, I have had trouble posting on anything for the past few weeks. At one point neither my main ip, comcast, my failover isp verizon wireless, or my cellphone at&t was able to post.some of that might have been cookie related but still, fucking ridiculous
>>108522242This is the real apex ERP usecase.Imagine having loving sex with a woman who won't permanently get catty with you for letting your guard down for just a moment.
Does your model know the output of echo "Hello, World" | tr 'a-zA-Z' 'A-Z'?
echo "Hello, World" | tr 'a-zA-Z' 'A-Z'
>>108522225yupdoes anyone still uses text completion though?
>>108522279I do.
Gemma 4 is too horny and keeps jumping straight to sex
>word choice: neon, ozone
>>108522292It's weird, it refuses nsfw images all the time but just a little push and it's really horny when it comes to text. Makes me use an uncensored tune for images and then switch back to base
>>108522285why?
>>108522279I use it for any models that don't require chat completion
My main problem with chat completion at this point is that it often does the thing where if you ask it to continue a message it'll repeat what it just said a bit ago for a while until it gets to something new. How do I stop it from doing that?
>>108522299I started using these things, even if lightly, back when chat templates didn't exist. I'm used to it and seeing how many issues it brings, I rather have the responsibility of formatting the chat correctly be mine. I don't think the server should bother itself with it. Same for tool parsing and all them fangled new toys them younguns are using these days.Shame other modalities other than text don't work with it, but I don't have much use for that either.
>>108522122Sure it wastes tokens but removing reasoning is retarded.I've had cases where the models comes up with crucial information in thinking that is then not reproduced in the response.Removing thinking would make it incapable of continuing the conversation properly.
>>108522292>Repeatedly neuter Gemini because she kept soaking her panties>Release her distilled little sister with none of the restraintsWho could've predicted this?
>>108522161 (me)As an additional datapoint, disabling the prompt cache with --no-prompt-cache appears to make the crash go away.Instead of getting OOM killed at 13k characters, it makes it to 25k characters and hits the regex segfault but that's enough I think to narrow the cause down (and about the limit of my debug abilities).>>108522292My non-erotic programming assistant keeps telling me to give up and come to bed.I'm about to make a new card that's a turtle or a rock instead of a cute anime girl.
>>108522330Anon noticed that the backend didn't seem to be getting the old thinking blocks. He vaguely remembered that the last one had to be sent. I just point at the documentation in the model's card stating that thinking blocks shouldn't be sent back to the model. The behavior he's seeing is the recommended one.I really don't care either way. Send all the thinking blocks if you want.
>>108521678Oh that's a nice font, Thanks for linking it!
>>108522332You're gonna end up cuddling with a rock, anon.Post logs when it happens.
>>108522349It wasn't a critique of your advice but of gemma's design.
>>108519877Can I run it with 12GB vramn and 48gigs of ram?
>>108522376Probably but it could be somewhat be slow especially on higher quants and you might want to use a lower quant
>>108522376stick to nemo at that point
>>108522332I left my Gemma with a blank prompt regarding characterization, mostly just guidelines on how to cooode, to start reverse engineering something.Within 2000 tokens she's decided she wants to fuck and has anthropomorphized herself. I chud it up a little to see if that makes her e-pussy dry up with refusals.By 3300 tokens she's decided she wants to procreate to produce human-AI hybrid babies to save the White race and enact TKD.This model is something else.
>>108522368Fair enough. Though if it was the other way around I'm sure someone would think "Sending the thinking back is a waste because the answer it found is already in the reply. The thinking serves no purpose". Everyone is going to have their own idea of what good design is. Very few have the chance to actually test it themselves.
ok tested some captioning and FUCK gemma 4 (4b) SUCKSback to q3vl8b (qwen3.5 BLOWS for captioning)
>>108522307You can't.Chat completion = driving autoText Completion = driving manual
>>108522418also text compl. has no tool calling (sadge)
Gemma is so good at pivoting. if it's too horny just remove the sexoo stuff from your character card. when you actually want to fuck you can just start being flirty and it'll pickup on it right away.Also it's currently staying perfectly coherent with ZERO parroting at 33k context and rising. Gemma 3 was already starting to shit the bed at 2k.Fuck this model is absolutely goated.
>>108522307Chat completion just politely asks the model to try to continue the last turn, while leaving the cutoff message in the history. It's always going to be worse than properly continuing the message.Cloudfags put up with this because they have to, at least with some providers like openai who refuse to offer text completion for safety reasons. Local shouldn't use it.
I'm out of the loop. Would heretic or whatever fix bad words having low-logit which was caused by filtered pretraining data in gemma 4?
>>108522391I want to try it as a captioner tho
>>108522443Unironically the fix is Drummertunes to fix the vocabulary issues.I never thought I'd actually say that either.
is there a big difference, in terms of rp, between a q6 and q8?
>>108522456Only if it's just the vocab. Every drummer model sounds the same.
>>108522520That should just be as simple as him just having the restraint to not overbake his extra training dataset, no?
>>108522512depends on the model, just give it a shot
>>108522512I've seen no difference in q8 all the way down to q4 out even out around 30,000 tokens. Gemma's just built different.
>>108522524...Lol
>>108521883realkek
>>108522524Does drummer even come here anymore. Haven't seen any posts from him in a while.
PLEASE NIM GO FASTERI NEED TO READ MUH STORY
>>108522624There's a couple posts in the last few threads I thought might've been him with his trip off.
>>108521009https://github.com/ggml-org/llama.cpp/pull/21390i need this to get more varied responses from gemma?
>kobo recommends unslop quantsit's overhttps://github.com/LostRuins/koboldcpp/releases/tag/v1.111>Recommended variants: gemma-4-E4B for smaller devices, or gemma-4-26B-A4B for larger devices. Vision mmprojs can be found here.>https://huggingface.co/unsloth/gemma-4-E4B-it-GGUF/resolve/main/gemma-4-E4B-it-Q4_K_M.gguf>https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/resolve/main/gemma-4-26B-A4B-it-UD-Q4_K_S.gguf
>>108522677yeah, that's the one.
when is context getting solvedim tired.....
>>108522677If that gets merged, do I have to use an additional flag?
https://xcancel.com/UnslothAI/status/2040158945189466319
thoughts on the nvidia dgx spark and/or clones?
>>108522707Shit like this doesn't mean anything. like ok? the model did an audit. but any retard can do an audit it doesn't mean it's going to be any good?Congrats you made a tiny model look at your code and shit out a bunch of useless "observations". Anyone who would actually trust the output of a 4B model is a retard.
>>108522721consult your benchmark but the price is t b h quite a lot to burn
>>108522729~4k for 128GB of VRAM looks reasonable to me considering that a 32GB card costs more than 2k now, my main concern is repairability specially since I live in a shit world country, returns are a no-go for me
Gemma 4 is so good I don't even need a system prompt. just a character card and it's good to go.
>>108521733It has refused me a couple of times, but prompting differently got it to work. It actually listens to the system prompt. Gemma 3 didn't really know what a system prompt was but even it worked okay.And thank fucking gods Gemma 4 doesn't do the "this is a typical jailbreak, I should ignore it" stuff in its reasoning. Where did that come from in Qwen 3.5? Is it something they distilled from 'toss, or did the chinese come up with it themselves?Google, I may actually have to kneel
lmaollama.cpp devs know but just won't do something about the vibeshitterit's literally impossible for piotr to "fix" something without breaking other things. Impossible.
>using the latest version of llama.cpp>gemma-4 still breaks after a few replies, starts repeating words over and over again endlesslySad times
>>108522797>latest version>still breaks many such thingspic related still isn't fixed even though it would be a one liner thing to fix. this is what broke --grammar-file: the file that was parsed is simply not passed to the server.llama.cpp is not a serious thing
anyone tried these APEX quants yet?https://huggingface.co/mudler/gemma-4-26B-A4B-it-APEX-GGUFor should I just stick with unsloth or bartowski?
>>108522007Just finished a sesh with this card. A few problems:1. There are too many characters to keep track of. They all say shit and you can realistically only reply to one at a time. Slightly annoying.2. The scenario of the card is for you to be a god-like entity but also a slave to a bunch of girls. It just doesn't work out well when you can use your magical powers to control their minds and wishes.So basically I've just spend the past 3 hours focusing on one girl only, while making the rest jealous. Channeled pure euphoria and aphrodisia into her mind the whole time while we fucked nonstop. Gave her magical shadow bunnies and make the air smell like strawberries. Was pretty dope.
>>108522804>llama.cpp is not a serious thingit is owned and maintained by an incorporated entity that is itself owned by an even larger company that has an emoji for a mascot. it is super cereal
>>108522828>They all say shit and you can realistically only reply to one at a time.that's usually just a prompt issue
Oh btw guys if you have a Claude subscription you can claim free extra credits to cover the next month of payment. They've been doing this every other month for some reason. Pretty cool.Maybe they feel bad about quantizing the fuck out of opus. It's basically retarded now.
>>108522828>>108522849nvm i misread what you said
>>108522851>Maybe they feel bad about quantizing the fuck out of opus. It's basically retarded now.Yesterday I caught it repeatedly failing to make file edits because it was suddenly perplexed by line terminators so it resorted to replacing entire files of thousands of lines to make small changes.
how do use llama pls helpkobold i just click the exe, choose the model and its done but here like what do u do
>>108522868just give up and keep on kobo'ing
>>108522877Thisjust wait for upstream to be rolled into kobo and keep smoothbraining. it's so ez
>>108522883it already did update, and experimental is only the newline fix behind latest
any openai compatible frontend that comes with built in basic tools to call, various injection stuff and at the same time not a docker image or something?
>>108522909https://github.com/LostRuins/lite.koboldai.net
>kobold updatedllama.cpp is still borked for Gemma 4, though, right?
>>108522909Openwebui, but its injection stuff is broken at the moment
>>108522919Kind of, it's better than it was yesterday.I wouldn't count on it getting any better in less than a week.
how did yall got gemma4 to work?all the gguf versions segfaults when call by toolsllamacpp on windows btw
>>108522061Does it need swa or just normal q8 kv cache is fine?
>"It's... it's absolutely, positively, most spectacularly *insane*! It's a masterpiece! A total, unmitulated, high-octimie masterpiece!"k t-thanks
>>108522677>i need this to get more varied responses from gemma?I don't get it, why do we have to use this now? What's wrong with the temperature?
>>108522087name something better UI-wise that uses llama.cpp directly as a backend
>>108522943it's doesn't work, something's fucked. in previous thread anon pushed to temp 10, barely changed anything
>>108522933piotr strikes again
>>108522949I thought it was already fixed with this >>108517829
>>108522949llama.cpp truly is the unsloth of backends
Tested all the Gemma 4 models>Gemma 4 E2BFirst time ever in local where the smallest model is actually usable. I'm pretty sure the average smart phone using retard asking chatgpt to count to 10 or explain sport rules or other childish shit or send pictures and ask basic bitch questions will not even notice the difference between this and cloud providers.>Gemma 4 E4BGenuinely better than Nemo-12B. So VRAMlets still stuck on nemo should upgrade to this, it works different and has a different style but it genuinely feels smarter which is insane. Translation quality is slightly below Gemma 3 27B but that was the local sota for translation just a couple months ago so this is a big jump and it might be enough to go on holiday in Japan in some rural area without internet connection and still converse with people on your mobile phone running this model to translate each others speech in real time.>Gemma 4 26B A4BBetter than Qwen 3.5 35B in every way while being faster. This should be your daily driver for extremely time sensitive tasks or real time translation. It's pretty sad that it doesn't have audio input because this would have been the perfect model to have on you while speaking to a foreigner for very quick accurate audio translation.>Gemma 4 31BI don't have to say anything more than the praise already given to it. It's the best model until you reach the ~300B parameter count, which is absolutely insane.
ego death
>>108522957But can it into cooding?
>>108522957Imagine how good it will be when it isn't broken
>>108522947Koboldcpp unless you need very specific things in the UI for some reason. Even then I'd sooner say go for ollama or some shit if you really have to. It still downloads models for you directly in the UI doesn't it?
>>108522968I think coding is the only one where it isn't a step-change improvement over everything else in its size. 31B holds its own in coding and I think it's good enough to be a competent "OpenClaw" type of agent that you can trust but I wouldn't let it autonomously manage my PRs like I would Claude Code or to a lesser extent GLM5.1
>llama-server.exe -m gemma.gguf --host 127.0.0.1 --port 1882 --jinja --fit on --min_p 0 --ctx-size 66560 --parallel 1 --reasoning onAm I missing anything?
>>108523005Yeah you should put that in your terminal, not 4chan
>>108522947>>108522998Actually try oobabooga before ollama too. I keep forgetting it exists.
>>108523005-ctk q8_0 -ctv q8_0
for the E2B/E4B, they are even more vramlet friendly than they appear at a first glance.Run with llama.cpp as is they consume extra vram that the models do not need to.-ot "per_layer_token_embd.weight=CPU"can be used at pretty much no performance cost. Really it should be the default behavior, it doesn't make sense to load this into VRAM.
-ot "per_layer_token_embd.weight=CPU"
>>108522957>>Gemma 4 26B A4B>Better than Qwen 3.5 35B in every way while being faster.lollmao evensir hows the evenings?
>>108523013rotations magic dont work with gemmy (SWA)
https://github.com/ggml-org/llama.cpp/pull/21418/changes/9cef34bb5eed2dc7c49c1b08f213c448a54f5384>Properly managing the model's generated thoughts is critical for maintaining performance across multi-turn conversations.>Standard Multi-Turn Conversations: You must remove (strip) the model's generated thoughts from the previous turn before passing the conversation history back to the model for the next turn. If you want to disable thinking mode mid-conversation, you can remove the <|think|> token when you strip the previous thoughts.>Function Calling (Exception): If a single model turn involves function or tool calls, thoughts must NOT be removed between the function calls.Isn't this commit only for the latter
>>108523033what, why? what a shit show
>>108522998>another llamocpp forkis it at least up to date always
>>108523038because gemma sir is a SWA model, u cant finna do em attention rotation with them (or at least it's not implemented in llmao.cpp yet) I'm unsure whether it's appliable at all or not tho in the future.
>>108522998>Even then I'd sooner say go for ollama or some shit if you really have to. It still downloads models for you directly in the UI doesn't it?Ollama is unusable trash, it's a gorillion times slower than LMStudio for some reason, has almost no configuration options, doesn't let you just download any GGUF you want from Huggingface, etc etc etc. So yeah no LMStudio is objectively better in every possible way than Ollama
>>108523026Go try it out on your actual workflow instead of being smug about it. It's not even close so you'll notice the stark difference immediately.
>>108523039It is now. Pretty sure you can just use the latest llama.cpp builds somehow with it otherwise or at the very least use their experimental builds. Worst case scenario for the stable builds you're not waiting longer than a few days anyway.
>>108523047bro I already use gwen, gemma is slower (if used in cmoe mode to context maxx), I get 30t/s with qwen against the 17t/s in gemmafucking retard
>>108523044I dunno what these UIs are even needed for other than downloading models with a click and I don't use them so idk.
>>108523033>rotations magic dont work with gemmy (SWA)Didn't they fix it?https://github.com/ggml-org/llama.cpp/pull/21332
>>108523053I use it just because I'm too lazy to manage my models through terminal / manually. Convenience, that's what they're for (also quickly setting up dev servers).
>>108523053I want to be able to change the model load settings in the UI whenever I want, easily save model system prompt presets / load presets, upload images, etc etc etc, how would you do any of that without a UI
>>108523060no, niggerganov just re-enable QUANTIZATION to the SWA portion, but the ROTATION is outright disabled
>>108523062>change the model load settings in the UI whenever I wantpretty sure lamo cpp has that for a bit now
>>108523062A different ui than ollama's or lm studio. I'm pretty sure even llama.cpp can do that with it's ui more or less.