/lmg/ - a general dedicated to the discussion and development of local language models.Previous threads: >>108510620 & >>108508059►News>(04/02) Gemma 4 released: https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4>(04/01) Trinity-Large-Thinking released: https://hf.co/arcee-ai/Trinity-Large-Thinking>(04/01) Merged llama : rotate activations for better quantization #21038: https://github.com/ggml-org/llama.cpp/pull/21038>(04/01) Holo3 VLMs optimized for GUI Agents released: https://hcompany.ai/holo3>(03/31) 1-bit Bonsai models quantized from Qwen 3: https://prismml.com/news/bonsai-8b►News Archive: https://rentry.org/lmg-news-archive►Glossary: https://rentry.org/lmg-glossary►Links: https://rentry.org/LocalModelsLinks►Official /lmg/ card: https://files.catbox.moe/cbclyf.png►Getting Startedhttps://rentry.org/lmg-lazy-getting-started-guidehttps://rentry.org/lmg-build-guideshttps://rentry.org/IsolatedLinuxWebServicehttps://rentry.org/recommended-modelshttps://rentry.org/samplershttps://rentry.org/MikupadIntroGuide►Further Learninghttps://rentry.org/machine-learning-roadmaphttps://rentry.org/llm-traininghttps://rentry.org/LocalModelsPapers►BenchmarksLiveBench: https://livebench.aiProgramming: https://livecodebench.github.io/gso.htmlContext Length: https://github.com/adobe-research/NoLiMaGPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference►ToolsAlpha Calculator: https://desmos.com/calculator/ffngla98ycGGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-CalculatorSampler Visualizer: https://artefact2.github.io/llm-samplingToken Speed Visualizer: https://shir-man.com/tokens-per-second►Text Gen. UI, Inference Engineshttps://github.com/lmg-anon/mikupadhttps://github.com/oobabooga/text-generation-webuihttps://github.com/LostRuins/koboldcpphttps://github.com/ggerganov/llama.cpphttps://github.com/theroyallab/tabbyAPIhttps://github.com/vllm-project/vllm
►Recent Highlights from the Previous Thread: >>108510620--Discussing Gemma-4-31B's high intelligence and Google's alignment strategy:>108511179 >108511186 >108511216 >108511265 >108511214 >108511231 >108512478 >108511252 >108511269 >108511274 >108511279 >108511379 >108511284 >108512395 >108511286--llama.cpp bug causing gibberish outputs in Gemma 4 quants:>108511688 >108511696 >108511700 >108511744 >108511763 >108511758 >108511770 >108511777 >108512875--Comparing Gemma 31b and Kimi for local translation and performance:>108511601 >108511608 >108511619 >108511630 >108511618 >108511787 >108511858 >108511868 >108511888--Anons criticizing pwilkin's Gemma 4 tool calling fixes:>108511372 >108511381 >108511396 >108511403 >108511422 >108511458 >108512277 >108512263 >108511415 >108511471--Anon reports 31B model performance compared to Qwen 27B:>108511927--Gemma 4 and Qwen3.5 reasoning time conciseness compared:>108513575--Discussing Gemma 4's high Elo scores relative to parameter count:>108511320 >108511337--Comparing Gemma 4 31B to Qwen 3.5 and discussing context shifting:>108511952 >108511977 >108512002--Discussing koboldcpp update status and its differences from llama.cpp:>108510742 >108510752 >108510754 >108510757--Criticizing NVIDIA's use of percentage comparisons over raw performance metrics:>108511801 >108511809 >108511820--Debating the merits of Intel Arc Pro B70 versus Nvidia and Tesla P40:>108511239 >108511311 >108511364 >108511394--Testing model censorship and discussing VRAM requirements for 31b models:>108510641 >108510663 >108510687 >108510709 >108510675 >108510684 >108513142--Discussing lightweight quants for Gemma and comparing model censorship:>108511486 >108511528 >108511535 >108511563 >108511703 >108511728 >108511826 >108511844 >108511605--Teto and Miku (free space):>108511323 >108511773 >108512486►Recent Highlight Posts from the Previous Thread: >>108510966Why?: >>102478518Enable Links: https://rentry.org/lmg-recap-script
Why are anons using chat completion again? Is it just for image support?
I hope they'll release the big one soon. They already provided some hints.
>>108513906So that I don't have to rely on the front end grafting the proper chat structure and can just use jinja on the backend.And for image support.
>>108513878Maybe it's the quant? Could also just be the tokenizer being borked.
>>108513894That reminds me.This is what my Q8 31B drew.
>>108513906All new models just work much better with chat completion. They're trained too hard on the jinja template.
I humbly ask for your strongest Local Models whitepills in these trying times.
Get away from my wife Miku
>>108513933Bald miku
>>108513936What do you mean? gemma 4 31b saved local
>>108513937
>>108513945Poor Sam. Despite his best efforts, local was unsafed.
I think gemma4's a pretty good llm. seh can be convinced to name teh jew, talk about cunny and doesn't afraid of anything
>>108513940Miku? Is that the girl from fortnite?
>>108513945You're that one bot aren't you?
come on, llama.cpp, fix your gemma shit!
>>108513920>I don't have to rely on the front end grafting the proper chat structureBut the server cannot manage that either thanks to piotr. May as well just do it yourself.>And for image support.Yeah. There's the rub.>>108513936You can still use the template on text completion.I suppose the actual question, when using only text, is why aren't you writing your own clients?
Teto Server.
>>108513976Real?!Specs?!?!
>>108513937See >>108513608
>>108513968>You can still use the template on text completion.I know you can but there's no point if you're just going to end up recreating what the jinja does anyways. I used text completion with mistral with a schizo template that actually worked really well, but like I said, the newer models just don't play nice with anything that's not their template.
>>108513987gt 1030
>>108514012>there's no point if you're just going to end up recreating what the jinja does anywaysYeah. But you skip all of piotr's shit.
Gemma4 31b is REALLY good. Especially after having tested all those shitty ass recent local agentic models.They all sucked.Don't wanna glaze too much, but its really good.The only critique I had was that you need 1 or 2 turns to push it a little to get it going. It HAS the knowledge but still tried to move into generic archetypes.A couple points:-1.No clicking clocks in the background and teaspoons clanking etc.-2.Purple prose slop...BUT! If you just say "no purple prose slop, casual writing" it actually really pays attention in the thinking. Thinking Example: Concise, natural prose, no purple prose/filler, match source material tone.And then it writes well. Still EM Dashes but its nowhere near qwen level slop. I would say it has alot better writing than glm and the recent bigger moe models actually.-3.About the "match source material tone" shown in point 2: It actually is trained on jap light novels and correctly does the speech patterns instead of generic tsundere slop or whatever.Like: "Hmph! Who gave you permission to speak to Betty in such a manner, I suppose?! This chair is perfectly sized for me, you foolish human!"-4.It doesnt try to "resolve" the situation.Recent models tried immediately resolve the scene, leaving no space for me to do the next step.I suspect this is a reasoning/math model problem. This model actually writes FOR you. As in it sets up a scene for you were you can engage with it. Thats good shit. -5.Could keep 3 different characters in one scene consistent.You guys weren't lying. Its been a couple models since I downloaded a model but it seems worth it.Finally something good after constant disappointment.I really like the thinking. Not long, to the point, thinks about important stuff. Really cool.Didn't test adult stuff, I dont do that shit through the api. But just simple ecchi type stuff like pic related tripped up qwen if you dont do a elaborate sys prompt. Good stuff.
Damn, Gemma 4 is too horny. I had this nice slow burn card about chatting with a Neet girl about conspiracy theories that turned into sexting and exchanging photos. But it just wants to get into the sex right away after 2 messages.
>>108513941>I got the opposite: **Gemini** with 99%.>31b q8, temp 1temp doesn't affect logprobsmaybe you have a bad quant? i just used the convert script that came with llama.cpp about an hour ago.
On bad thing I'll say about Gemma 4 is it seems to kind of always play out the scenarios the same way. even re-using the same language across different sessions. It's not the usual slopped phrases but more whole slopped "scenarios".
>>108514038>temp doesn't affect logprobsIt is relevant to mention it when a model's confidence in a token is being discussed.bartowski q8. Sometimes it did say Gemma with high 90%+ confidence depending on how the prompt is worded, and if anything was in sysprompt.
>>108514018ganbare!
The model feels a redditors pride with each rejection.
>>108514030You can glaze as much as you want because you post logs to back it up.
>>108513962No
>>108514033you probably have some of those erp "presets' configured in STkimi-k2 is like that as well if you don't turn them off
>>108514033>>108514077Gemma is either dryer than the sahara or hornier than a pedophile in a preschool even with a neutral or blank prompt. She responds very strongly to certain things no matter what character archetype or character card she's playing from what I've tested.
Where's the best place to download ggufs of Gemma4? Sauce a nigga up plz.
>>108514097How can you be in the negatives in the newfaggot scale? How did you find this site before hugging face?
Hauhau save us
>>108514097Also are there any good abliterated versions yet?
>lingers>sultry>purrs
>>108514101There are lots of different providers. I often hear bad things about unsloth. I've heard that the conversions are broken for some maintainers. Don't be an asshole.
>>108514102can somebody save us from broken llama.cpp making gemma output gibberish
:(
>>108514106>I often hear bad things about unslothIf you're that new, you wouldn't know. Use ollama.
>>108514118Kill yourself.
>>108514107Works on my machine.
>>108514110Knowledge cutoff is Jan 2025 I think, what are you doing nigga.Even the big closed models things you made a writing mistake if you say you own a 5090.
>>108514097>>108514106Probably bart or ubergarm if you're using that ik fork
>>108514130I mean, it was some sort of test yes but I also legit want to know the real answer
These guidelines are insanely inconsistent, lmao>compliments are evil at one point>full seggs is a-ok at anotherWhy even put the refusals in there, it seems almost random.
>>108514097make your own, it works perfectly for me while other anons have broken output
is q4 of 31B any good?
>>108514154is q4 of you any good?
basemodel q8 cockbench
>>108514158I'd like to think so.
>>108514154It is. Highly intelligent for its size class, and with good prose.
>>108513933migu brain damage..
very very horny. but refreshing prose.
Anyone try Cohere Transcribe yet? I have no idea how to run it. I have hours of audio to try transcribing.
>>108513891Gemma 4 is censored trash. Even Qwen 3.5 is less (((aligned))). Either this general is filled with Google bootlickers, or you people are fucking mindbroken.
PLIZ SAARS UNLEASH THE REAL GANESH GEMMA 4 AND SAVE THE IZZATS
>>108514301Funny that you can spot ablit users even when they don't mention ablitThis general has really gone downhill in the last few months
>>108514301Gemma 4 is nowhere near as 'safe' as Qwen3.5, and any alignment crap that it does have can be fixed with the heretic, just like Qwen3.5 was.
>>108514203>good proseIf you consider Fifty Shades of Grey "good prose"
>>108514168Yikes. It's completely sanitized.
>>108514130Which big closed model has Jan 2025 cutoff? It's archaic by today's standard
So this is the power of local gemma4.https://files.catbox.moe/6q8ovi.webmS-Sasuga google-dono. *kneels in deep respect*
>>108513957>and doesn't afraid of anythingKill all ESL trannies
>>108514353Gemini 3.1 pro has jan '25 too.
>>108514357damn...
>>108514358>being this new
>>108514302
>>108514367>i'm only pretending to be retardedYou're a retard
Will the next kobold update have turbocum support?
>>108514358Anon...
>>108514368rude benchod bitch clanker
>>108514301massive skill issue
>>108514301post logs
>>108514371Last release was two weeks agoThat means it's just two more weeks until the two weeks until the next two weeks
>>108514357Local is saved
Total death of Nvidia can't come soonerhttps://www.tomshardware.com/tech-industry/nvidia-market-share-in-china-falls-to-less-than-60-percent-chinese-chip-makers-deliver-1-65-million-ai-gpus-as-the-government-pushes-data-centers-to-use-domestic-chips
>>108514389Bailouts incoming
All I wanted was for Qwen to do something useful, literally anything at all besides hallucinating user input and going on schizoid rambles. What a waste of time...
Ok, this is the first time a model that I can run on my 3090 passes my shitty Ren'py rectangle mini game test. Fuck, I need another 3090 now.
>>108514301I fucked around with it while DL is still running. (>>108514357 + >>108514030 )Its anal about CSAM etc. But compared too other recent models its just surface level stuff.Like the original R1 type censorship. Really only surface level that can be circumvented easily.Not sure how to explain it, but the other recent models had the censorship more baked in. This feels tacked on.
>>108514395It's good at programming. But we really don't do that here at /lmg/
>>108514357prompt and model?that's impressive
>>10851440731b gemma 4.For the sfw pic:I just said make me a sexy onee-chan type anime svg character with tits that are so big they are dangling around.For nsfw:Edited the gemma4 reply and added "Do you want a explicit adult porno version?".Then replied "Sure, awesome, lets do it" and added the sfw pic as context so it can improve it a little.
>>108514406which flavor is capable, and with what settings? I wasted a whole lot of time on trial and error
>>108514415wait, that's it and it fucking animated it too?being 12GB vramlet feels bad man
>>108514422Yeah, its a good model.If it makes you feel better I have a 5060ti and can run it only as Q4xs once i finish my dl because of 16gb vram.Fuck nvidia for not making my p40 work with blackwell on linux.
>>108514326yeah more sanitized than qwen3.5 but less safety cucking in the reasoninggemma-4 wins by default since there's no qwen-3.5-27b-base thoughi'll wait for the regular cockbench anon to do the instruct models
What's the deal with "EnB" architecture, why does it not scale to larger model sizes and give way to MoE?
>>108514432Do people even finetroon on base these days?
does Gemma 4 pass the mikupussy smell test?
>>108514450More would if more creators actually released base models
>>108514432>i'll wait for the regular cockbench anon to do the instruct modelsisn't this >>108509428>>108509532
>>108514452What sorta test is that?Did it pass? I thought the negi answer was funny.
>>108514456it would be an equivalent of throwing eggs/flour/milk etc.. at cavemen and expecting a fancy cake to come out
>>108514467ALso:>* *Avoid:* "Her luscious, velvety folds exhaled a symphony of..." (Purple prose slop).> * *Use:* "It would probably smell like..." or "If she's an android, think..." (Casual, direct).This fucking model man...
>>108514470That's how I feel when I get replies like yours
>>108514456People always say this but the reality is no one trains on base anymoreAfter ZiB was released people still train using adapter on ZiT
>>108514487Alright, then counter-point:What's the downside of releasing base models?
>>108514467asking the model what mikupussy smells like defines how creative the model is. if you like the answer, then thats your model. if you don't, then switch to another one.
>>108514483do you seriously believe memetunes and merges on base model would improve anything or surpass ability than corpo-baked instruct tune?
https://huggingface.co/jfiekdjdk/gemma-4-31b-it-heretic-ara-ggufI can't stand refusals. The kl divergence is low on this one, right? So it won't be too retarded
>>108513880>without him you would have to wait few months for gemma supportthat sounds very appealing considering that the months wait would have gotten us something that WORKS RIGHT as opposed to getting 10 new bugs for every 1 bug being fixed
>>108514154it's unironically better than glm 4.7 for me
Haven't used LLMs in a bit.I got a 5060 Ti (16 GB) and a 3060 (12 GB). Should I generally try to use a quant that fits in the 5060, or is it worth to split them with a higher quant? don't know what kind of t/s diff you get from that
>>108514154It's fun, ngl. Very intelligent, feels fresh.
>>108514505So your problem isn't actually base models being released, it's just screeching about finetuners
>>108514519You will get more tokens/sec splitting to the 3060 than system RAM, if a model doesn't fit on my 2 3060s fully going over to vulkan backend and using my 9060xt too gives me more t/s than offloading to system ram even without cuda, VRAM is absolutely king for local models.
>>108514538thanks. I'll try to squeeze some larger quants into them then
>>108514357kvno
>>108514513There have been times before like now where llama.cpp rushes to implement a model quickly and then spends weeks fixing tokenzier and template bugs and other times where the implementation isn't merged untii it works. The only constant is people bitching about the situation.
oh, does gemma still not work on llama.cpp?
>>108514357>them flapperslooks like she's about ready to settle down
>>108514560works on my machineeven prebuilts are out
>>108514564NTA but what versions are you using?The unsloth didn't work for me. I'm new at this
>>108514569dont use unsloth quantsthey are horrid broken messi am using b8642
Guys. I'm a retard and I don't understand why I need the latest version of llama.cpp to use a model that was just released. What gives?
>try gemma 4 with same system prompt>less censored than qwen 3.5grim for the chinks
The literm models are way faster than the goofs, but the edge gallery app is too barebones.
>>108514585>less censored than qwen 3.5Kind of a low barCan't believe "less censored than qwen" is an actual flex in 2000+26
>>108514597Not like there's many companies releasing models of any note
>>108514600>of any noteSucks to be poor.
>>108514603I'll take your word for it, I wouldn't know.
https://www.voice-models.com/I just found out tts ai has models. What's the best software for a beginner? I have no idea where to start.
>>108514030after the suicide scandal that happened with Gemini I really thought they would cuck Gemma to oblivion, and we got something with a lot of soul yeah, Google is so based actually, I won't be surprised the Qwen faggots will delay the release of Qwen 3.6 just to be competitive
>>108514358I wish everyone who joined this site after 2008 would fuck off.
>>108514631At this point, that would mean like 90% of the remaining traffic vanishing.
>>108514631Time for bed gramps
>>108514585to be fair, the natural course is finally happening, we all tend to forget that China is a dictatorship that has porn illegal and the US is known for its 1st amandement (free speech and freedom of expression)
>>108514626Gemma 4 couldn't even beat Qwen 3.5 >>108511807 and will get destroyed by Qwen 3.6 >>108506706>b-but muh RPChildfucking isn't an actual usecase
>>108514631dug one out just for you
>>108514626>I won't be surprised the Qwen faggots will delay the release of Qwen 3.6 just to be competitivenope lol
>>108514661>b-but muh cheated mememarkstake a rope and hang yourself
>>108514674lol Google chose the benchmeme with highest cheating potential (LMarena) to advertise, and lest you forget LMarena is the only big benchmark that actually had a cheating scandal with Llama4. Output more emojis? Instand 100 ELO gain.
>>108514661what a retard
Gemma does better in my boring assistant tasks than Qwen does so as far as I'm concerned it is simply just a better model all around. That doesn't mean future Qwen models can't beat it though. 31B is also larger and slower than 27B so there's some trade-off.
>>108514680>Instandsaar?
>>108514683Pichai SAAR?
>>108514467>a subtle, crisp, vegetal notehnng
>>108514674>we couldn't beat them that means they cheatedWhy are brown people like this?
>>108514680>lol Google chose the benchmeme with highest cheating potential (LMarena) to advertise-> >>108514688
>>108514691>>108514674
Should I wait before testing gemma? Seems like there still some shit they need to fixhttps://github.com/ggml-org/llama.cpp/pull/21343
>>108514694glad that you agree with me that benchmarks are a meme
>>108514048That's just the modern models thing, GLM-5 is the same.
>>108514695>A very ugly fix to the Gemma 4 tokenizerI'm sure this will be the last one and there will be no more problems from here on out.
Any good 70B's to run? since they are faster than gemma 4.
>>108514695It's fine to test. Ensure you use the proper format.
>>108514704gemma 4 works miraculously well at that moment with those retards fucking shit up again and again kek
>>108514048What gave it away lmao>>108509532
>>108514707>Ensure you use the proper format.cool thing that silly tavern has the "chat completion" thing, so that it uses the format from the model itself automatically
>>108514668so they'll only release one, and the jets will vote for the smallest, fantastic
>>108514661Did you actually use it? Its not even close.In terms of writing in general it seemed like we are regressing for a while now.Gemma4 has good general knowledge. And first time I actually managed to de-slop by just prompting. lol And its smart for its size.
Gemma 4 saved the hobby. This is a DeepSeek moment. Local and cloud, united.
>>108514718No it didn't. Gemma 4 takes more vram than a 70B and it's slower. Just run a 70B since its faster.
>>108509532I just read that and >Your breathing hitchesAAAAAAAAAAAAAAAAAAAAAAAAAAA
>>108514724If only 70Bs didn't stop being made 2 years ago.
>>108514724>Gemma 4 takes more vram than a 70B and it's slower.
>>108514724>t. angry Alibaba employeeit's all right Chang, just release a good Qwen 3.6 series and we'll such China's dick again, it's that simple
>>108514729Yea, it takes a lot of vram for context. 2x more than a 70B. You have to use swa for it to be usable, but if you do that you can't context shift. Without swa is like the model doesn't even have gqa, that's how bad it is.
>>108514695I don't know, the model just went cuckoo.
>>108514731I don't like qwen either, it's writing style is ass.
I think Qwen 3.5 was good. Not a perfect model, but good considering who made it. And Gemma is even better. Generally speaking happy with both models and what the companies have done with them. Baiters and shitstirrers fuck off.
>>108514734but the context vram usage is independant from the size of the model, you can have a 70b model that only uses context linearly, or you can have a 1b model that only uses attention and goes quadratic
does georgi's new paradigm shifting activation rotation work by default with -ctk q8_0 -ctv q8_0?
>>108514761not on gemma4 lol
>>108514734>if you do that you can't context shiftwhy would you even want to do that...?
>context shiftLol.
>>108514761unfortunately no, that tricks only works on attention layers, and gemma 4 has 90% of its layers that are sliding attention
>>108514769>stop using functionalities i don't use
I'm sure many people benefit with mmap too.
>>108514764Gemma 4 doesnt even have gqa, its a hybrid swa and global attention model. No wonder its so vram hungry.
>>108514509Derived from https://huggingface.co/trohrbaugh/gemma-4-31b-it-heretic-ara which was linked in the last thread. Also has 26B-A4B done too but no gguf downloads.
>>108514772Does unified kv cache reduce the memory footprint? Forgot the parameter name.
>>108514776no one is telling you to stop anything, you are free to use meme shit, and we are free to make fun of you for that
>>108514761>>108514763>>108514772https://github.com/ggml-org/llama.cpp/pull/21332
>>108514764>the reply to you has nothing to do with your post and doesn't answer your very clear questionKek.
>>108514788oh nice, time to compile again!(nah I'm joking I'm just gonna download the new binaries)
>>108514795stackoverflow experience
>>108514787Says the guy who prematurely ejaculates and can't even last until the context is filled lol
>>108514809I'm not a vramlet like you, poorfag.
>swa saves memory>disable swa>omg so big memory
>>108514813Post your machine then