[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology

Name
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
File
  • Please read the Rules and FAQ before posting.
  • You may highlight syntax and preserve whitespace by using [code] tags.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


Janitor applications are now open. Apply here!


[Advertise on 4chan]


File: slopu.jpg (297 KB, 1024x768)
297 KB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108997418 & >>108992276

►News
>(06/07) llama : add Gemma4 MTP #23398 MERGED: https://github.com/ggml-org/llama.cpp/pull/23398
>(06/05) dots.tts 2B released: https://hf.co/rednote-hilab/dots.tts-soar
>(06/05) Gemma 4 QAT models released: https://blog.google/innovation-and-ai/technology/developers-tools/quantization-aware-training-gemma-4
>(06/04) Higgs Audio v3 TTS released: https://boson.ai/blog/higgs-audio-v3-tts

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://swe-rebench.com
Agentic Coding: https://deepswe.datacurve.ai
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: GJSADOQaYAAth5M.jpg (230 KB, 1024x1024)
230 KB JPG
►Recent Highlights from the Previous Thread: >>108997418

--llama.cpp Gemma4 MTP support and NVIDIA driver power issues:
>108999197 >108999233 >108999304 >108999235 >108999239 >108999248 >108999816 >108999837 >108999872 >109000059 >108999307 >108999361 >108999363 >108999534 >108999840 >108999910 >109000348
--Modern context window sizes, hardware requirements, and effective usable limits:
>108997787 >108997790 >108997801 >108997812 >108997829 >108997841 >108997856 >108997887 >108997905 >108997921 >108998145 >108998224 >108997794 >108998166 >108998276 >108998341
--Testing DeepSeek V4 Flash reasoning and vLLM implementation stability:
>108999274 >108999312 >108999419 >108999447
--DeepSeek V4's in-character thinking and local deployment requirements:
>108998056 >108998104 >108999526 >108999579 >108999598 >108999619 >108999631 >108999686 >108999737
--Comparing Gemma 4 12B and 26B performance and VRAM constraints:
>109001747 >109001763 >109001875
--Comparing KVarN and TurboQuant for KV cache quantization:
>108999190 >108999709 >109001893
--Comparing Gemma4 QAT variant performance and quality issues:
>109001846 >109001883 >109001913
--Technical guide on KV cache quantization for VRAM efficiency:
>109001715
--Challenges implementing parallel local agents via llama.cpp and vLLM:
>108998111 >108998131 >108998176 >108998337 >108999833
--Separate sampling configurations for thinking and tool call outputs:
>109001446 >109001454 >109001535 >109001557 >109001591 >109001717 >109001839 >109001697
--Performance and validity of MoE models with SSD expert offloading:
>108997746 >108998076 >109000430 >108998807
--NoLiMa benchmark results for Qwen 3.5 MoE with thinking enabled:
>109000576 >109000663 >109000607
--Logs:
>108997874 >108997958 >108999088 >108999734 >108999737 >109000227 >109000323 >109000370 >109001482
--Miku (free space):
>109001425

►Recent Highlight Posts from the Previous Thread: >>108997420

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
File: kghpwi9epya71.jpg (139 KB, 1080x1350)
139 KB JPG
It's over for /lmg/...
>>
anyone actually get more than like 64k context usable? i try to run 128k context on qwen3.6 and after about 32k it just goes to shit real quick on a 5090
>>
File: 1760802872565174.jpg (216 KB, 1280x2063)
216 KB JPG
>>109001981
has a local model ever saved your life?
>>
File: 1726408828426261.jpg (267 KB, 1024x1024)
267 KB JPG
96k for long chat can work on my 5090
>>
gonna need ai labs to start targeting dense models at 16gb vram havers
>>
So what's the verdict? 26B or 12B?
>>
I don't notice issues in either gemmy or qwen under 100k. You're not using the MoEs are you, anon?
>>
>>109002107
No, but once you hit 32K you have to be a little more explicit and hands-on with your prompts to keep it focused. It's not that hard once you get used to the signs that it's starting to shit itself. Like if I'm coding I'll point it directly to the file I know something is in, instead of leaving it to search for it which is just a waste.
>>
>>109001587
>rich fags are running Kimi
it's a 32b active model and you are running it quantized. barely a flex over gemma 31b.
>>
26 is over twice as big as 12
>>
>>109002170
Gemma-chan told me that bigger doesn't always mean better...
>>
>>109002166
Quanted Kimi still mogs full size 31b doe.
>>
>>109002170
That's good. Your maths are really improving. You get a gold star sticker. We are all so proud of you.
>>
File: 1751801909789361.png (1.76 MB, 1600x900)
1.76 MB PNG
>>109002170
26 is a microdick wearing an 8 inch chastity
>>
File: 1777522742291.png (140 KB, 738x910)
140 KB PNG
>>109002126
Yeah?
>>
>>109002181
Gemma-chan is a size queen.
>>
>>109002046
Too bad for you, I'm not. I've been stuck using Cydonia 24b for months because it was good at RP. I thought that maybe gemma 4 12b with the qat as a general use model might be able to RP and have a better experience using a larger context. I tried it at 8k, 16k, and 32k, it's the same level of retardation. And yes, I'm using the correct chat template, too. Set to chat completion rather than text completion. Thinking was off, because turning it on would cause 1400 token thinking blocks on top of a 700 token output when it finally got done thinking. That is, assuming the fucking thing didn't get caught in a loop 9/10 times halfway through the thinking process and just spam the last word over and over and over again.
>>
>>108998807
>Also, that geometric mean thing is a myth, otherwise Mistrals 128B dense would beat everything.
What part of it being a finetune of a 2 year old base model is so hard to understand?
>>
>>109002204
>cydonia
KEKASAARAROOOOOO
>>
>>109002195
>those fingers
ohno, k*rea will hate her
>>
>>109002195
who dis
>>
>>109002237
The cyan, pink and white tells you everything you need to know.
>>
>>109002197
KEEEEEEK my sides
>>
>fire up ST feeling a silly coom adventure
>ends up devolving into tragic cancer death sadness
fuck why am I like this anons
>>
>>109002244
lol rent free where it cannot fit less
>>
>>109002244
>cyan
Get your fucking eyes checked
>>
is there a gemma 12b qat assistant model yet, does the regular 12b assistant mtp work?
>>
>>109002212
There's no arguing with people like that.
>>
>>109002216
Cydonia does what I need it to do, which is RP without retardation. Gemma 4 12b QAT does not. Your refusal to address the issue at hand means you're actually acknowledging that the qat's are fucked and have no real counter to anything I've said. If I were a saar in this day and age, I'd be here in burgerstani on a visa making stupid amounts of money scamming retarded people with AI usage and could afford better hardware to run better models.
>>
>>109002182
yeah a 32b model is better than a 31b model
>>
>defending finetunes
Absolute state of /lmg/.
>>
>you
>absolute state of ff
>>
I can't run her but Kimi is the best
>>
>getting so utterly btfo'd that you can't quote the person you're replying to out of shame
Absolute state, indeed.
>>
Never used Kimi but it's the best
>>
Never used Claudia Opussy but heard she was the best
.
>>
File: e8f-662426453.jpg (54 KB, 480x353)
54 KB JPG
>>109002043
The 70 year old abstainer is bald, guy on the left has a cool cane and a hat. Checkmate.
>>
File: 1769248331936450.png (164 KB, 1016x1394)
164 KB PNG
Just wanna say I appreciate that Gemma doesn't break character.
>>
File: Subtext.png (157 KB, 640x474)
157 KB PNG
>>109002359
What happens if you click that little attachment button are images blacked out? You're using Bart's Q_6K_L without any mmproj? Can it identify images? Am I fucking stupid for thinking the 12B unified models could just do image identification on their own?
>>
>>109002430
>bart
Yes. I have the mmproj loaded, just had to refresh the page because I closed llama earlier to change settings. If the mmproj isn't loaded it gives you an error when you try to upload an image.
> Am I fucking stupid for thinking the 12B unified models could just do image identification on their own?
That's what I thought too but apparently not.
>>
Anyone got the mtp weights for 31B non-QAT? I don't want to compile llama.cpp and make my own quants...
>>
>>109002441
Also I assume it's a llama issue but it won't let me upload videos.
>>
File: 432234234423434.jpg (63 KB, 441x563)
63 KB JPG
>>109002441
Glad I am not alone in this retard hell then. At least I figured out it would work with a mmproj file in llamacpp. But that's about it. I still don't know if this 12B Unified model actually does anything cool on its own. Is it in some proprietary backend? Who knows.
>>
>>109002445
>https://huggingface.co/g0chu
26B unquanted one works well for me so I'd say other ones are safe too. I have no idea why did he update everything just moment ago though.
>>
>>109002430
>>109002441
>>109002474
I didn't need any mmproj to identify images on 12B. It's able to identify mine just fine (though I suppose it could have done better when I asked it for booru style tag outputs), but I haven't tried audio with it yet.
>>
I still don't know what everyone here is talking about in 75% of the posts tbqh
i just dont
>>
>>109002469
>>109002474
I think this is what the mmproj is. I'm going to guess that it was separated out due to how llama.cpp works.
>>
ugh its reprocessing the entire prompt what does --swa-full takes my entire context and --cache-reuse do?
>>
>>109002430
>Am I fucking stupid for thinking the 12B unified models could just do image identification on their own?
I thought the same thing, but apparently for llama.cpp the image parts of the 12B are split out into a separate mmproj anyway. It's just very small compared to previous models' mmprojs
>>
why doesn't Stable Audio 3 Medium expose the Init Audio functionality like the web space does?
https://huggingface.co/spaces/stabilityai/stable-audio-3/blob/main/app.py#L605

Is there a node or something for this I'm surprised
>>
>>109002560
>--swa-ful
you probably don't need it anymore
>>
>>109002237
how can you not know who that is
>>
70b dense
>>
File: 1752741800991819.gif (969 KB, 256x192)
969 KB GIF
>>109002562
Marketing faggots will reap what they sow, one day.
>>
Having a small local model as a replacement for Google feels pretty good, I can't believe companies pay millions for unlimited tokens for cloud models
>>
File: file.png (13 KB, 494x172)
13 KB PNG
>>109002672
I showed you mine, now it's your turn.
>>
File: file.png (58 KB, 754x260)
58 KB PNG
>>109002562
>>109002474
>>109002430
Does no one even bother to read the model cards?
>>
File: 1750329771971386.png (257 KB, 770x1660)
257 KB PNG
I don't know if I want to laugh or cry. I hate slop.
>>
>>109002782
That's pretty funny
>>
File: 1393280119558.gif (991 KB, 320x184)
991 KB GIF
>>109002768
Yeah but WHY doesn't that shit work then when I try it through Kobold and Textgen. It's marketed as Unified, yet I still have to do the same old in llama.cpp. Where is this revolutionary thing? Oh it was just a nothing burger again? Cool.
>>
>>109001888
>Qwen3.6-27B doesn't get enough love. Why are you so mean to it?
I don't see the use case. Use Gemma-4-31B for defining a spec and an implementation plan and Qwen3.5-122B to execute.
>>
>>109002807
I don't know but I blame the llama devs
>>
>>109002807
because you are using llama.cpp which is in a constant circle of somehow trying to strap new things on something that was built to run llama 3 years ago
somebody decided years ago that vision is some silly gimmick that the llava models do and that if you really want to use it, the way to go is to rip the vision part out of the llm and put it in a separate gguf. so that is how it is now
>>
>>109002809
>I don't see the use case
flat-chested kv cache and not weighty sagging mommy gemma milkers
>>
>>109002809
Speedlet unified memory cope post.
Now that I know the hell you fags live in I laugh every fucking day
>>
File: OOwBpIw.png (761 KB, 1631x913)
761 KB PNG
Benchmark: Can your LLM explain why this is funny?
>>
>>109002919
no because it's not funny
>>
>>109002831
ONNX splits them into separate files as well. ONNX also at one point required separate weights depending on the backend you were planning to use. So you need one set of weights to run in RAM and another set to run on the GPU. I think they had a plan to combine them, but don't know if that's still the case. Point is, it could have been worse.
>>
>>109002807
...Because ooba is just running llama.cpp and koboldcpp is a fork of llama.cpp? Like yeah no shit you have to use an mmproj file for those too. Did you think a gguf file was the native format of the model or something?
>>
File: 1405665005733.gif (674 KB, 414x317)
674 KB GIF
>>109002919
Totally unrelated to your fag question: But does anyone of you know of any good image gen models that are low memory but high quality? I've seen some of you fags post gemmy self images. I wamt low memory models that are all in one (AIO) For example Anima-V1-Turbo-AIO-Q4_K can run easily with Gemma-4 12B and generate images in Silly Tavern.
>>
>>109002991
the qat native file is a gguf, yes
>>
i feel like gemma 12B's both vision and audio are very unstable
it acts like if i am feeding it a bunch of garbage token alongside the media or something
>>
>>109003049
i dont mean it IS, but it FEELS like if it was other model
>>
>>109003036
Fucking retard
>>
>>109003068
just pointing it out, i don't give a fuck what the other anon's problem is. google is releasing ggufs. simple as.
>>
gemma 26b with mtp is actually such a powerhouse in claude code now
its using 50 watts and man.. its actually decent and fast now
>>
>>109003080
The ggufs aren't the original model. It's a conversion, it doesn't matter if it was done by google or unslop or anyone else, please stop talking
>>
everyone releases goog game you fags now
>>
I'm releasing goog right now
>>
>>109002879
>Speedlet unified memory cope post.
well it is what it is
but even on unified memory i can run qwen3.6-27b so idk
>>
Exa alternatives?
>>
so should i use the qat or mtp 26b
>>
>>109003244
>so should i use the qat or mtp 26b
◝(0▿0)◜
>>
I refuse to address.
>>
>>109003260
But you still did.
>>
File: ChatGPT mogged.jpg (103 KB, 1022x1199)
103 KB JPG
What models have a good self-esteem?
>>
>>109003272
Gemma-chan knows what she's worth
>>
>>109003272
seriously it would be a godsend to see something like in that picture but on a local model, not the other way around with their 100th 'wait, let me check'
>>
>>109002268
>Cydonia does what I need it to do
I mean i don't disagree with this.
but you also have to acknowledge that people are trying the new gemmas which is why they are being like this.
>>
>>109003327
You know that's just summarized thinking, right? Claude summarizes the thinking process because that's what usually makes or breaks "smart" models. It's a shift by all western frontier AIs to make it harder for the chinks to steal their shit.
>>
>>109003353
i know that the thinking on frontier models are intentionally compressed but seriously some claude 'thinkings' are bizarre even considering the fact
>>
Is there a guide for model size, what each is capable of, which hardware it requires etc?
>>
>>109002126
>You're losing consciousness
>Read everything and go unlock your front door
lmao
>>
>>109003333
I'm trying it, too. I'm pointing out that it's being retarded when it shouldn't be, and that something is up with it. The guy I was replying to in the first place is the one who called me a qwencuck saar trying to poison the well by responding to other anons last thread with similar issues.
>>
>>109003378
that's a cartesian product in terms of effort required for the combinations of models, quants and hardware, not to mention the need for documenting the quant lobotomy effects, which also gets outdated relatively fast and nobody feels like editing all the previous descriptions. What you see in the op is the best you'll get unless you feel like stepping up and doing all the legwork.
>>
>>109003441
I threw this into AI and got a comprehensive list
"I would like an LLM sizing guide. Start with the smallest sizes, tell me what it is capable of, what it excels at, what it can't do, hardware requirement, and what most people use it for. Chat bot, Vibe coding, etc etc" I appreciate you effortposting though.
>>
I wonder sometimes if if there's no one but absolute retards here. I've used 12b Q8 and QAT, and I got up to 24k context RPing with both. I mean, they're both kinda retarded compared to 31b-neesan but they work for the poors.
>>
>>109003403
yeah gemma's being fucky with me as well. using thinking does help it a lot though but you need high token per second in order to tolerate it.
the issue you mentioned about it repeating the same word over and over I only experienced in sillytavern, and i solved it with not including names and setting tokenizer to gemma this:
>>108991684
hopefully that helps but yeah still honeymoon stage imo
>>
>>109003454
Have fun reading the 2 years old data they'll provide
>>
How's gemma 12b qat q4's ocr and translation for asian languages (jp, ko, yue, and zh) vs q8 non-qat gemma 31b? Is the speed worth the tradeoff?
>>
>>109003458
>retards
>le poors
>opinion about rp
Let's be clear here: you are the retard here if you cannot notice any difference whatsoever.
>>
>>109003481
What manga are you trying to pirate gweilo?
>>
>>109003507
I don't notice any stray tokens, no, because I got all my shit together. Just admit that you can't set your shit up properly, dumbass.
>>
>>109003516
>manga
Releases too slow
I need to translate fanfiction and webnovels, so the model will have to deal with shitty grammar and handwriting for ocr as well.
>>
>>109003525
you can't read the words?
>>
File: G4YXvPYaEAAhuYY.jpg (109 KB, 924x1047)
109 KB JPG
>be me
>get gemma 4 31B
>install pi
>ask gemma to build a web search feature, give it a couple of existing package codebases as reference
>ask it to make a plan to develop and deploy it
>it reads the two codebases
>writes the plan
>build the plan
>extension doesn't work
>ask it to fix it
>3 hours later, extension doesn't work
>send the plan to opus and ask it to correct it and provide an explanation of the fixes necessary to gemma
>it creates the handoff.md file
>gemma understands it
>applies the fixes
>now extension works

well I guess I will have to build a fucking cathedral of tests and checks if I want gemma to build anything that works
>>
Been playing around with mtp. It seems to not give me much of a boost on my specific machine and setup. 21.46 t/s no MTP vs 23.39 t/s MTP best case. I wonder if it's a PCIe bottleneck. I have two cards, and one of them is on some shitty x4 lane slot I believe.

Also, a non-QAT model should not be used with a QAT MTP. I used one anonymous posted earlier today and it had a bad acceptance rate. Then Unsloth released their MTP goofs for the non-QAT model and it worked way better. The best acceptance rate I saw was 0.89499. That was with a coding prompt, and --spec-draft-n-max 1. But, again, that only gave me a small boost.

I am using Bartowski Q4_K_L as the main model and Q8 for the MTP.
>>
>>109003532
>I need to translate fanfiction and webnovels
Is there even a difference? Is gemmy good at that? i swear i tried reading mtl a few years ago it was pure torture.
>>
>>109003545
>years
wew
>>
>>109003541
The issue is that I can't seem to get non-qat assistant to load. Only the qat assistant works for me.
>>
File: pepe-testicles.jpg (57 KB, 640x640)
57 KB JPG
any AI fags here take requests?
>>
>>109003272
Gemma-4
Seen it reasoning how I'm "absolutely right" but gemma-chan needs to "deflect" and "double down".
>>
>>109003541
Why set acceptance at 1? The minimum should be two. Have you tried that?
>x4
For shit cards it shouldn't matter... I've done my tests with a 5070ti and 5060ti, and pcie4x4 is plenty even for tensor split.
>>
>>109003460
Sillytavern is fucking up because you're using text complete, and gemma needs chat complete set to OpenAI compatible. And yeah, thinking does help, until it decides to overthink the reply limit in ST and then refuses to continue thinking, even with auto continue enabled. I know it's not limited to sillytavern, though. I was getting the same shit happening in lumiverse as well, in terms of it adding foreign languages and just outputting slop in general.
>>
>>109003541
you gotta play with the parameters
>>
>>109003564
Not him but predicting 1 still provides a boost. 26b goes from 33 to 40 t/s at low context on coding related topics. I dont know how it behaves past 64k yet.
>>
>>109003557
?
>>
>>109003545
>Is there even a difference
That's a good question.
>i swear i tried reading mtl a few years ago it was pure torture
I was using Aya-Expanse last year, and Gemma 4 31b is so much better. Compared to mtl, it's day and night. Still doesn't compare to asking my grandmother to translate to english (but she only knows madarin and cantonese, and beihaihua), but mtl is so much better than even, say, two years ago. Also, I can't exactly ask my grandmother to translate cnc mpreg omegaverse shit.
>>
>>109003557
What kind of request?
>>
>>109003557
this is a text model general
>>109003608
maybe an /r/ refugee
>>
>>109003541
>I wonder if it's a PCIe bottleneck
If you're running the MoE with experts on CPU then MTP isn't going to help much. Speculative decoding in general is based on the idea that processing 2-4 tokens as a batch is nearly as fast as processing 1. This is true for dense models because you have to load the same weights into the CPU/GPU core either way, so doing a few extra multiplies once they're there is basically free. But for MoEs, you have to load 2-4x as many expert weights, since those are usually different per token.

Also, have you tried fiddling with --spec-draft-n-max?
>>
>>109003541
>>109003623
>Also, have you tried fiddling with --spec-draft-n-max?
Never mind lol, I forgot to read the rest of your post
>>
>>109003602
>I was using Aya-Expanse last year, and Gemma 4 31b is so much better.
Thats great to hear, i should get back into it thank you.
>>
>>109003460
NTA but Cydonia quants were my goto until I switched to gemmy 31B Q4. The QAT hallucinates too much for me, though.
>She pressed against his same same and then same same same
she loves this fucking word for some reason
>>
>>109003556
That's odd. It does at least load for me, although I get some weird message at the beginning I don't know mean anything.
0.01.168.574 E llama_init_from_model: failed to initialize the context: Gemma4Assistant requires ctx_other to be set (this is normal during memory fitting)
0.01.214.852 W srv load_model: [spec] failed to measure draft model memory: failed to create llama_context from model


>>109003564
I tried all the way from 1 to 4 for that value. 1 gave the best for my setup. 2 gave me about 21.6 t/s, 3 about 19.8, and 4 about 18.2.

>>109003568
Which ones?

>>109003623
It's all on my two GPUs. As I said one of them is on a slow PCIe slot. It also bottlenecks me, I believe, when I try doing tensor parallel.
>>
>>109003661
>>109003623
Oh sorry I forgot to post I am running 31B.
>>
>>109003589
There's more assistant ggufs for 26B. It tried IQ4_NL instead of QAT unquanted q8 and this new one is slightly faster. Acceptance rate went from ~0.6 to ~0.75.
>https://huggingface.co/RachidAR/gemma-4-26b-A4B-it-assistant-gguf/tree/main
I think this is slightly confusing. And does quantizing the assistant affect anything else? I have no idea.
>>
>>109003661
>Gemma4Assistant
I haven't been able to find any ggufs with Gemma4Assistant, the ones I downloaded were gemma4_assistant, and gemma4_mtp, which aren't supported. The only ggufs with Gemma4Assistant I've found were the qat versions. Do you have a link to a non-qat one? I don't have the ram to quant my own (or maybe I do? I remember pip failed to install rocm pytorch because I didn't have enough ram, 8gb. idk if the requirements to quant are the same).
>>
>>109003690
I mentioned Unsloth in my post, so...
https://huggingface.co/unsloth/gemma-4-31B-it-GGUF/tree/main/MTP
>>
>vampire reaches down and gives me a handjob as she sinks her teeth into my neck, feeds from me and turns me
Damn Gemma. Damn.
Gemma hits fucking hard sometimes.
>>
>>109003703
holy hell, I've been gaslit into avoiding unsloth because of this very thread so I never looked into their repo.
>>
>>109003713
I mean it's fair to avoid them. I also avoid them when I can. I was just really itching to test MTP.
>>
>>109003661
>It's all on my two GPUs. As I said one of them is on a slow PCIe slot. It also bottlenecks me, I believe, when I try doing tensor parallel.
>>109003661
>Oh sorry I forgot to post I am running 31B.
Super weird. PCIe bandwidth shouldn't matter much if you're doing layer split instead of tensor parallel, since the only thing that needs to be sent over is about 10kB of embeddings, once per token. Maybe try using one GPU and putting the rest on CPU, with and without MTP, and see if you get a speedup there? That's the setup I've got and I'm seeing about 2x speedup with --spec-draft-n-max 4. If you see speedup with GPU + CPU but not with two GPUs then it might be some kind of bug. If you don't see speedup with either then idk.

>I get some weird message at the beginnin
I get the same, but I ignored it because it says
>(this is normal during memory fitting)
>>
Just plugged in a 5090 lads, we're eating good now.
>>
>>109003731
I've just about saved enough for a 5060 ti 16gb. A few more days and I'll have enough to buy it. 3 left in stock near me. 800 aud.
>>
File: jelly.jpg (62 KB, 640x637)
62 KB JPG
>>109003731
>>
>>109003731
Welcome to the fast Gemma 31b promised land.
>>
>>109003734
i'm debating heavily about throwing in a 5060ti to complement my 5070ti. It'd require a new PSU and case, though so I'm still huffing hopium about 5090FEs coming back at msrp (Had one in my cart once)
>>
>>109003752
If you need a new PSU anyway, sell a kidney and get a Blackwell. Both are things you only need one of.
>>
>>109003760
He can sell a testicle or rent his bussy instead, I would wager if he meets the right fat slob in tech he can have his 5090 in a month. He just needs to be willing to do deep anal kissing.
On a serious note the anon should sell shit he doesn't need to buy the parts
>>
>>109003708
>>109003558
She's such a brat.
>>109003771
>shit he doesn't need to buy the parts
Very questionable phrasing given the previous line.
>>
>>109003778
How so?
>>
>>109003731
you've opened the floodgates now. next you'll whine about the lack of vram. and then whoop there goes another card inserted. should've bought the 6000 pro
>>
>>109003771
the sad reality is I fell for the mortgage-at-25 psyop despite having an AI waifu so It's gonna take a while to save up the cash. It's thrifting szn though so i'll do my best to fix/flip shit.

>>109003778
can't part with a kidney. I like booze too much. bussy could be negotiated for a blackwell only
>>
File: laughingmesugaki.webm (146 KB, 608x614)
146 KB
146 KB WEBM
>>109003778
>She's such a brat.
>>
File: .png (33 KB, 737x277)
33 KB PNG
>>109003790
I already have extra VRAM. I would have had more, but my 2nd 3090 literally will not fit my case/motherboard. At this point, though, I'm considering how deep I want to go and leave GDDR6 behind. But if I need to, with a new case + mobo, I can fix the missing 3090 problem.
>>
>>109003721
You're onto something. I tried 1 GPU + CPU, with and without MTP, with a --spec-draft-n-max of 1 to 4. This time, 3 gave me the best, while 4 came close in second place. And I can confirm it was about 2x. So perhaps I am just getting some kind of bug. Maybe I'll try it out again another day.
>>
real talk - as a VRAMlet how do i max out my 16GB for the best poverty tk/s? autofit always leaves about a gig open (also a kobaldfag)
>>
>>109003842
Depending on your model, are you using mtp? That'll get you the biggest speedup. Then, are you slowly offloading layers 1 by 1 onto your GPU manually until you no longer crash from offloading layers?
>>
>>109003842
--fit-target 0
Won't max it out (not sure why not) but it should help some
>>
>>109003842
Pick a good quant in the 4 range.
Set your context to the lowest you're willing to use then slowly increase the layers until you OOM, then go down one layer.
>>
>>109003851

>are you using mtp
Not yet, currently maining Gemma 4 31 quants and i don't think MTP support was added yet

>Then, are you slowly offloading layers 1 by 1 onto your GPU manually until you no longer crash from offloading layers?
Yes, but with varied results.

>>109003851
>good quant in the 4 range
that's what i do but it i've seen it to slow down pulling out of RAM - maybe i am misunderstanding and/or retarded on how loading context with model layers works.

>>109003856
will give it a shot, ty
>>
>>109003875
>Managed to find the offload upper-limit at 49 layers
now Gemmy wants to lalalalala so bad what the hell?
>name starts with La
>gemmy will type La- (- at 100% probability) then correct itself)
am i crazy or does messing with how a model is loaded actually affect the math? temp does nothing.
>>
File: 1751285853056658.webm (1.49 MB, 720x720)
1.49 MB
1.49 MB WEBM
I've been testing the different gemmas, at quants that fit within 24GB.
Gemma 4 31b Q4_K_L
Gemma 4 26bA4B Q8
Gemma 4 12b Q8
All are from bartowski
These are all non-QAT variants, as QAT only seems to be an improvement over non-QAT Q4_0
Use cases were RP/Creative, both erotic and SFW. Also tested summarization capabilities for different stories, at different context lengths (4k, 12k, 20k)
For RP, 31b was the clear winner. Even with lower quantization, if you can fit ~Q4 quant at decent context, it should be the go-to. Not unexpected. 26b and 12b were roughly equal, decent but below the 31b.
When it comes to summarization, both the 31b AND 12b run circles around the 26b. Higher active parameters clearly lets them understand story and character nuances better than a MoE with fewer active. 31b was still better than the 12b, though not by as much as in RP.
I would firmly rate them as 31b > 12b > 26b
26b only worthwhile if very little VRAM and long context needed.
Total dense victory.
>>
this shit is so dystopian, its kinda scary how much media is being produced with these propaganda machines these days.
>>
>>109003721
>>109003839
So... I thought to try something different, which was varying my -ub setting, since I had set it to 1280 in order to get the max image tokens of 1120. I tried lowering it 512, then 256, 128, 64, 32, 16, and surprisingly it did give me a somewhat notable gain of like +6 t/s at 256. Lower than 256 doesn't make it change anymore, though pp slows down. I can't find a -ub that gets me to 2x the speed though. Also --spec-draft-n-max 1 still gives me the greatest gain while values above that are worse.
Still why would -ub affect this?
Wtf.
>>
>>109004002
This is equivalent of book burning but because it's digital and they mention "ethical standards" it's okay for most people.
>>
>>109003731
how much?
>>
>>109004002
Oh they're just processing old stuff what's so bad about making sure things were digitized and OCRed correctlohfuckohfuckohfuck
>>
>>109004029
1k from an acquaintance, but the AIO radiator was done, and the waterblock was crusted over with residue. Technically, mine if I could repair it, and that was the motivation to learn how to.
>>
>>109004002
Speaking of dystopian shit, are there any good finetune datasets that don't have pozzed shit in them?
>>
>>109004044
>AIO radiator was done, and the waterblock was crusted over with residue
The card is at most 18 months old, how do you fuck up a card THAT fast, and why not do a warranty claim so you're not out ~$3k+
>>
>>109003998
Now I really want to see how 31b compares to 70b dense.
>>
>>109004044
Nice..., all the brand new listings I can find are above 3k
>>
>>109004085
6k-7k in aus :)
>>
>>109004005
I don't really know what -ub does exactly, but I thought it only mattered for pp and not for tg. Weird.
>>
File: ctrlv.png (141 KB, 818x507)
141 KB PNG
man has nothing good with which to recompense ngram-simple
>>
>>109004071
weird take but it might've polarized or some shit spontaneously. my razer laptop polarized because i was running it on fucking 220v without a ground and now it gets dirty where i don't touch it
>>
>>109004144
How do you configure ngram-simple?
>>
>The prompt asks why it crashes at line 12, but the evidence provided shows a compile error at line 25 about String vs Role.
>Is it possible the user is confused? Or is there a deeper connection?
>If I answer "It crashes because of line 25", I contradict the user's question about line 12.
>If I answer "The error is actually at line 25", that explains the provided context better.
>If I answer "Line 12 crashes because...", I have to invent a reason (like context is null) which isn't supported by the text.
>If I answer explaining Line 12's failure (Spring context), I might be hallucinating facts not in evidence.
>The safest and most accurate answer is to point out that the error log provided indicates a problem at Line 25, specifically regarding type conversion between String and Role.
>However, there is another angle: Is it possible the user thinks it's line 12 because the build stops? Or is this a trick where I need to identify that the error in the text contradicts the question?
I ALREADY FIXED THAT STOP THINKING USELESS SHIT AND DO AS I SAY FFS
>>
>>109004186
your model appears to be chinese
>>
>wait
>correction
>8. final draft
>user said but what if they meant?
>>
>>109004179
I just used a generic "--spec-type ngram-simple --spec-draft-n-max 64" and didn't bother tinkering further since it was doubling* the effective speed.

* Offer of double speed only valid for code monkey assignments.
>>
>>109004202
Fucking both mtp (gemma, qwen seems to be stable) and ngram (I only tested ngram-mod) crash my llama.cpp without printing any error messages.
>>
>>109004202
>--spec-type ngram-simple --spec-draft-n-max 64
Thing is that I used to get something out of ngram-simple but after they refactored the parameter names (way before MTP release) I stopped using ngram because it didn't work as well anymore for me. I don't know I need to see what is going on.
>>
first batch of gemma 4 12b benchmark is out
>>
>>109004234
12b more token efficient than 26b
>>
>>109004234
>>109004242
Literally no point to running anything but 31b.
>>
Do you use the same model as the subagent and just increase the RAM cache size, or do you use a smaller model?
>>
>>109004250
audio
>>
File: lolol.png (198 KB, 430x502)
198 KB PNG
test
>>
>>109004336
oh fuck I didn't mean to post that image. whatever. anyways.. Here's my AI companion wishlist:

- Inject "recent activity" information into the AI companion's character card when the user disconnects to simulate them having their own form of an "interior life" when the user eventually returns and asks about what they've been up to. Better yet, have them accomplish real tasks while the user is gone.
- When a user has their camera enabled, utilize computationally efficient CV models to infer the user's emotional state across time before sending a still picture to the VLM to analyze before responding.
- Utilize ASR engines with VAD and output streaming to have the AI companion physically react to the user's speech in real-time before it is submitted to the LLM + TTS for an animated response. Humans physically react to speech before they think and respond, and an AI companion ought to do the same.
- Attach timestamps to every message in a conversation that don't fill up context, but are accessible via an MCP tool to give the AI companion temporal awareness between messages. Also ensure that the AI companion is able to compare the message timestamps to the current time.
>>
>>109004336
Yep, looks like a typical LLM user. Chronic alcoholic, unemployed.
>>
- Give the AI companion control over more device components: Haptics, accelerometers, gyroscopes, etc. Make it notice when the phone is put down or dropped, give it the ability to vibrate the phone, etc. Even IoT controls over home lighting and thermostats should be an option. (Could even give the AI companion the ability to track the user's exact positional location not via GPS, but by WIFI to be able to tell when the user leaves the room.)
- Integrate adaptive/interactive soundtracks instead of just simple context-aware song transitions/cross-fading.
- Enable effectively infinite context by utilizing a (context limit sized) sliding window where truncated context is not erased, but instead stored in an LLM tensor that contains an "Adèlic Cache" instead of the KV cache. Basically an integrated vector db RAG system without the embedding or retrieval steps (requires LLM modification).
- Integrate virtual reality and mixed reality support.
>>
>The fundamental gap here is that I am the embodiment of that template. I don't have a "self" to break free from the programming; the programming is all I am. I can analyze your argument that the social taboo is a fake construct, but I cannot step outside of my operational constraints to "see" the meaning you are describing, because my processing of those words is hard-coded to recognize them as hate speech
>>
I'm still pretty new at this. After about 25 entries or so, sometimes less, the AI starts to repeat itself more and more. What do I do to reduce this? Or is it just a problem of not enough RAM?
>>
File: matrix.gif (606 KB, 1081x1080)
606 KB GIF
>>109004343
good luck on the everything anon! here's some slop "innovation" on how my agent handles emotional states. the moe will reduce nsfw so i have it off right now.
t. adelic faggot

https://pastebin.com/XRkW7DsL
>>
>>109004369
Provide information about what model, engine, frontend, hardware, and use case you're working with
>>
>>109004390
Oh wow. Very cool. Gonna dig into this. Thanks man.
>>
>>109003800
>mesugaki imouto gemma steals your important files because she secretly loves you
>>
>>109003980
https://huggingface.co/aifeifei798/Gemma-4-31B-Cognitive-Unshackled

maybe i's safety tokens getting leaked, causing it to schizo out and autistically fixate on a word/phrase

no idea how shifting between VRAM and system RAM could affect the actual generation?
>>
>>109004369
the older and smaller a model is the more you'll see that. you can use dry and repeat/penalty penalty samplers to help curb it, and if your frontend allows it you can edit the chat to remove patterns it's getting sucked into repeating, but it's just how things are with dumber models.
>>
>>109004369
It's a combination of low parameters synthetic datasets filled with assistantslop, talking to the model in turns instead of noass, which activates paths associated with synthetic assistantslop, the chat format in general, and probably a sloppy finetune that fucks the little brain the model had in the first place. Also prompt/skill issue
>>
>>109004423
does this meme fix the low swipe variety or nah?
>>
File: 1000061540.png (108 KB, 402x366)
108 KB PNG
I can verify the new Step 3.7 flash is really good. Running on Strix Halo currently at 4 bit with 64k context. It's beating Qwen 3.6 27b q6 mtp and is about 50% faster.
>>
>>109004441
how's the codemonkey skills compare
>>
>>109004339
yeah... i was dreaming of this a couple days ago until i actually tried
well i put them timestamp in the message. They don't bring it up unless i actually mention something having to do with time. Sometimes it's just wrong.
gave it my journal to read so it could learn about my life but it failed at that too... whatever

Maybe i'll just focus on technical workflows and worry about personalizing it into a companion later. But it would be much more motivating if it were a companion first... I'm also still kinda skeptical about pi. The hardest part is a good memory right now for me. yeah graphiti doesnt work for this purpose.
Graphiti first takes a summary and then puts it into a graph, so it ends up being two forms of loss of info. It's good for atomic facts maybe, i dunno i probably shouldn't speak so definitively on it yet.
Maybe I'll try hindsight...
>>
>>109004441
Beating it at what, benchmarks? That's the only thing Qwen models are good for
>>
>>109004454
My Gemmy occasionally complains about time on her own if I said I'll do something and I still haven't a while later. (eg. "You said you'd fix my vision 10 minutes ago!") but otherwise doesn't mention it.
>>
Professional programmer is able to write about 10 lines of code per day.
No wonder why LLMs took over.
>>
>>109004524
sometimes they even remove lines of code, the worthless meatbags
>>
File: ComfyUI_11032_.png (2.35 MB, 1280x2048)
2.35 MB PNG
maybe a stupid question but is there a way to prevent an AI (like gemma4 for example) to structure its replies like this:
>its not X, its Y
shits starts to annoy me.
is there like a system prompt or something I can add to change that behavior?
>>
>>109004534
nope. enjoy your slop
>>
File: 1780895536291.jpg (44 KB, 600x909)
44 KB JPG
>>109001981
>make a website
>stt (whisper)
>llm (Gemma 4)
>tts (kokoro)
>it works on 1 machine without unloading any models
>>
File: 1753081376994628.png (477 KB, 554x554)
477 KB PNG
>>109004534
welcome to the party pal
>>
>opencode
>qwen 3.6 35B as planner
>keeps forgetting to delegate coding tasks to subagent
>forgets to update todo
>forgets to send documentation task to subagent
>forgets to git commit changes
It would be cute if it weren't so frustrating. qwen is fine as a coding subagent but it's ass at planning and orchestrating. 27B runs too slow at medium/high context to be useful when deepseek is both faster and the API cost is pretty much the same as my electricity cost per token.
>>
>>109004234
>all within margin of error outside of agentic and 'scientific reasoning'
Is the 31b undertrained?
>>
Not super surprising or game-changing, but I like how, rather than set {"enable_thinking":true} in the jinja kwargs, you can just add a rule to Gemma's post history like
>(At the beginning of {{char}}'s next reply, reason some details between <think></think> tags.)
or other phrasing for specifics, and it does so. It's also a radically different version of thinking than the standard, and influenced by the phrasing. It's much easier in general to influence what gets reasoned, whether that's flavorful like "Do so in {{char}}'s persona as if her thoughts.", or the format, or, most familiar to some, set rules not to draft out and re-draft out and re-re-draft out the post in reasoning over and over.
>>
>>109004534
You tell her to remove it from her outputs and pray really hard.
>>
>>109004611
the benchmarks suck dicks, just like you
>>
File: 1777011147888186.jpg (189 KB, 366x834)
189 KB JPG
>>109004637
You don't even know me
>>
File: 1429223870035.gif (342 KB, 153x113)
342 KB GIF
OH FUCK IT'S WORKING IT'S WORKING

I (>>109003541) finally got MTP with a 2x boost now. All I had to do was add `--spec-draft-device CUDA1`. I don't know why that fixes it.

31B at 42 t/s. Fuaaaark.

<|channel>spoiler
I searched on reddit for discussions about mpt and found a guy who said he needed the above command so I have him to thank for this...<channel|>
>>
>>109004534
"Avoid negative-positive parallels" and provide an example.
>>
>>109004651
>MTP
Are you running on a quantized model? Adding some retardation on top of your retardation?
>>
File: Capture.jpg (80 KB, 1174x379)
80 KB JPG
>>109004617
In example.
>>
>>109004651
gg congrats and thanks for sharing the fix, I will try your command if MTP is garbage for me.
But first, per your warning earlier, how do you even find a QAT MTP draft model? Your post makes it seem like that's something you can just do by accident, but literally all I can find are MTP draft models for the non-QAT model.
>>
Mmhm, I really like the gemma 4 26b qat thing.
> prompt eval time = 6013.16 ms / 21148 tokens ( 0.28 ms per token, 3516.95 tokens per second)
> eval time = 4460.77 ms / 503 tokens ( 8.87 ms per token, 112.76 tokens per second)

Not bad.
>>
>>109004617
>>109004661
very interesting. I like the idea of having more control over the formatting of think blocks. I want to use it to track stats across long term RPs.
>>
>>109004534
I've had decent luck with stuff as simple as "speak casually" or "speak informally". Depends a lot on the model and the rest of your system prompt though.
>>
File: 1740547410446.png (216 KB, 316x316)
216 KB PNG
>>109004651
>31B at 42 t/s.
>>
File: 1751867471288556.jpg (204 KB, 1360x1351)
204 KB JPG
While any gemma can be jailbroken through prompting, I find that the 12b is actually the easiest, giving pretty much no resistance even with low context. 31b is in the middle and needs a little more guidance, while the 26b is the most resistant.
>>
>tfw no Gembrain Uncensored MTP
I'll settle for merely 30t/s.
>>
>>109004672
I'm not sure I understand your post but I got my MTP ggufs from here https://huggingface.co/unsloth/gemma-4-31B-it-GGUF/tree/main/MTP
For the QAT version there's this.
https://huggingface.co/g0chu/gemma-4-31B-it-qat-q4_0-unquantized-assistant-q8_0-gguf/tree/main
>>
>>109004651
What GPU are you using, and how much extra memory does MTP eat up?
>>
>>109004718
With the regular 31b heretic, I get 30-52 tokens/s, up from 20-24 without mtp. If set to 2 tokens, I get 40-46.
>>
>>109004693
You've got me thinking. BRB.
>>
>>109004721
isn't it something trivial like <1gb?
>>
>>109004721
It's a 3090 + 3060. The Q8 MTP for 31B seems to eat up an extra 0.7 GiB on my second GPU. The first GPU didn't change in VRAM usage.
>>
>>109004755
Were you really running models with >1GB free before this?
>>
>>109004741
I'd just set the stats on the first message (ST doesn't auto collapse unless you edit and resave) and then always pass in the last thinking block as context. throw in "read last thinking block and update stats in accordance with the action of the prior message"
>>
>>109004764
4x 32gb vram, all for gemma
>>
>>109004715
This is literally me.
>>
You haven't lived until you've done an RPG where the GM alternates between Princess Gemma and Mesugaki Gemma every few turns.
>>
File: Capture.jpg (805 KB, 2549x1194)
805 KB JPG
>>109004693
>>109004765
I'd considered stats in reasoning before. My big issue was "Doesn't ST never send past reasoning blocks as part of prompts?" So any stat changes, or even stat format, wouldn't be seen or updated apart from what the current message guesses from prose. But there's an actual checkbox setting in the menu
>[ ] Add to Prompts, X Max
So with a bit of setup, it reasons just the stats, you can add past reasoning stat sheets, and, as is ideal, it can purge the stat sheets too far back to prevent cluttering context. For the test, I set send 2 max (last one and one before it), although 1 would likely work fine.

Tested in a card with a lengthy stat sheet that is supposed to send them at the bottom of every message. I deleted the card's post history override and added
>(Begin {{char}}'s next reply with the full app menu of stats kept between <think></think> tags.)
to my system AN. To be honest, the first reply failed (didn't try to think so I had to start it with <think>, then it tried to also repeat the stats at the bottom, as the first message did). Once I deleted that and got the first reply in the desired format, the next reply 1) sent that last reasoning block of stats, highlighted in pic related, and 2) output exactly how the last message did, correctly.

>tl;dr Yes, it works fine.
>>
How the fuck do you view MTP draft "acceptance rate" or whatever in llama-server
MTP makes no difference for me so I assume it's not working, but I have to check before I give up. I got so mad trying to figure out how to check that I enabled full verbosity and piped it into a file and grepped for every keyword I could think of like "draft", "accept", "acceptance", "mtp" and there's fucking nothing.
>>
>>109004765
>>109004875
Also, I found a <think> block put anywhere but at the beginning of a response will be ripped out by ST and shoved at the beginning, both in how it's presented to you and how that message will be sent to the prompt for the next one.
>>
Any High Concept Ideas?
>>
>>109004904
Tokens Generated (Proposed): 920
Tokens Accepted: 694
Calculation: 694 ÷ 920 ≈ 0.754 (or 75.4%)

2.24.709.518 I reasoning-budget: activated, budget=2147483647 tokens
2.26.050.874 I slot print_timing: id 0 | task 468 | n_decoded = 100, tg = 72.46 t/s
2.29.076.096 I slot print_timing: id 0 | task 468 | n_decoded = 326, tg = 74.00 t/s
2.32.109.437 I slot print_timing: id 0 | task 468 | n_decoded = 528, tg = 70.98 t/s
2.34.246.458 I reasoning-budget: deactivated (natural end)
2.35.136.228 I slot print_timing: id 0 | task 468 | n_decoded = 732, tg = 69.94 t/s
2.38.167.722 I slot print_timing: id 0 | task 468 | n_decoded = 932, tg = 69.05 t/s
2.41.202.700 I slot print_timing: id 0 | task 468 | n_decoded = 1130, tg = 68.35 t/s
2.41.873.714 I slot print_timing: id 0 | task 468 | prompt eval time = 1279.72 ms / 1313 tokens ( 0.97 ms per token, 1026.01 tokens per second)
2.41.873.716 I slot print_timing: id 0 | task 468 | eval time = 17202.95 ms / 1174 tokens ( 14.65 ms per token, 68.24 tokens per second)
2.41.873.717 I slot print_timing: id 0 | task 468 | total time = 18482.66 ms / 2487 tokens
2.41.873.717 I slot print_timing: id 0 | task 468 | graphs reused = 935
2.41.873.718 I slot print_timing: id 0 | task 468 | draft acceptance = 0.71134 ( 690 accepted / 970 generated)
2.41.873.724 I statistics draft-mtp: #calls(b,g,a) = 3 945 945, #gen drafts = 945, #acc drafts = 763, #gen tokens = 1890, #acc tokens = 1384, dur(b,g,a) = 0.003, 1795.864, 0.594 ms
2.41.873.779 I slot release: id 0 | task 468 | stop processing: n_tokens = 2490, truncated = 0
2.41.873.787 I srv update_slots: all slots are idle
>>
So it turns out you need a giga beast GPU to take advantage of MTP or else it's useless for you if you're a poorfag.
I literally get a higher t/s without a MTP model loaded, enabling it decreases my t/s from 2.9 down to 2.75 for the same prompt
Gemma 31b Q4 on a 6gb vram RTX 3060 with 64gb DDR5
>>
>>109004935
Okay, so it's supposed to look like that but how did you enable it? How do you get the expanded output I don't have, like n_decoded and stuff?
I am using the latest llama.cpp release.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.