[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: 1762565947250834.jpg (724 KB, 2048x1536)
724 KB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108984529 & >>108980055

►News
>(06/05) dots.tts 2B released: https://hf.co/rednote-hilab/dots.tts-soar
>(06/05) Gemma 4 QAT models released: https://blog.google/innovation-and-ai/technology/developers-tools/quantization-aware-training-gemma-4
>(06/04) Higgs Audio v3 TTS released: https://boson.ai/blog/higgs-audio-v3-tts
>(06/04) Nemotron-3-Ultra-550B-A55B released: https://hf.co/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16
>(06/03) Gemma 4 12B Unified model released: https://hf.co/google/gemma-4-12B-it

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://swe-rebench.com
Agentic Coding: https://deepswe.datacurve.ai
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: 1739250286596.jpg (104 KB, 1000x1000)
104 KB JPG
►Recent Highlights from the Previous Thread: >>108984529

--Unsloth Gemma 4 QAT quant accuracy claims and skepticism:
>108986383 >108986405 >108986408 >108986410 >108986424 >108986447 >108986504 >108986533 >108986544 >108986559 >108986577 >108986608 >108987012
--Gemma 4 QAT near-BF16 Q4, and potential llama.cpp MTP support:
>108986761 >108986766 >108986782 >108986805 >108986818 >108986828 >108986834
--3090 24GB limits with 32K/70K context and system overhead:
>108986899 >108986928 >108986937 >108986952 >108986967 >108986978 >108986995 >108986954 >108986974
--Unsloth QAT knowledge retention and evaluation methodology:
>108987079 >108987154 >108987195 >108987223 >108987328 >108987381
--Unsloth QAT versus llama.cpp Gemma-4 quantization accuracy:
>108987587 >108987629 >108987788 >108987840 >108988071 >108988087 >108988095 >108987918 >108987920 >108987929 >108987930
--Performance of dots.tts on Seed-TTS-Eval:
>108987945 >108987996 >108988023 >108988184 >108988281 >108988294
--Anon benchmarks llama.cpp batch and CPU MoE settings:
>108986409
--llama.cpp NCCL boosting tensor parallel multi-GPU performance:
>108986764
--Testing Unsloth QAT 31B against standard quants:
>108988539 >108988550 >108988569 >108988623
--AMD Strix Halo tradeoffs for large context local LLMs:
>108985094 >108986729
--Anon modifies imatrix to use chat templates and multimodal data:
>108984809 >108985351 >108986176 >108986241 >108986305
--Adelic Gemma 4 12B server setup and performance:
>108985061 >108985087 >108985100 >108985857 >108986562
--Gemma 4 12B unified multimodal architecture explained:
>108984735 >108984769 >108984775
--Logs:
>108984830 >108984986 >108985020 >108985032 >108985108 >108985256 >108985297 >108985388 >108985399 >108985661 >108985741 >108985854 >108986497 >108986572 >108986691 >108986728 >108987403 >108987460 >108987492
--Miku (free space):
>108986398

►Recent Highlight Posts from the Previous Thread: >>108984530

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
>>108988698
RIP RAMlets
2 3090s btw
>>
Mikulove
>>
>>108988711
wow im in the highlights, a space all to myself too
>>
Until real testing is done this is canon.
>>
>>108988718
>RIP RAMlets
Our suffering never ends. maybe next year.
>>
File: 1.png (5 KB, 276x160)
5 KB PNG
Why is my gemma 4 31b qat q4 with 8k context eating up all my vram on 3090? it fills up completely and slows down to a crawl. am i doing something wrong or is that normal. Im using koboldcpp and windows 10.
>>
>>108988740
loli version
>>
what's the best

speech -> llm -> audio

pipeline I can run at home and use on my phone?

Do I really need to make a custom tool for this? Or can llamacpp do this?
>>
>>108988740
how many "pussy grade" spreadsheets have you prompted in total now? is this a habit of yours?
>>
I'm tired of people trusting unslop's skewed statistics. If they need to spam their updates every single day and spam how good they are - they probably aren't that good. It's a simple rule: if you need to talk about how "powerful" you are, in reality you aren't that powerful at all.
>>
>>108988758
This was the original I had the other ones for slower anons
Would you like a different kind or chart or graph?
>>
File: miku small thumb up.png (22 KB, 240x240)
22 KB PNG
>>108988731
>>
I'm getting real tired of this. I don't know why this keeps happening, or what I need to change to fix it. It's a custom app using gemma 4 31B. Only settings being sent in the requests is temperature 0.7 and medium reasoning effort.
>>
>>108988751
I don't remember if SWA is activated by default on kcpp, try forcing it on?

--useswa
>>
Here's how poor people can still win with 17 tokens per second which is enough for anything.
>>
>>108988779
quant? kv cache quant?
>>
>>108988782
when will local image models be this good at so much text?
>>
>>108988781
--usseewa
>>
>>108988782
If you're aiming this at the desperate you might want to add that windows is usually slower, too.
>>
>>108988788
This is what I have set in oobabooga textgen, basically defaults + "-fa 1" in extra flags because I read it is supposed to be used with Strix Halo
>>
>have 2 R9700s and 2 V620s
>they suck because the bandwidth is dogshit
fucking AMD
If you had $4000 to get hardware, what would you get? A 5090? Some 3090 Turbos?
>>
So would 4 bit QAT version quantized to 3 bit be better than the non QAT 3 bit quant? or would that just cause double errors since the OAT's are trained to be 4 bit?
I can't run 31b 4 bit because poor
>>
>>108988782
I have 8GB VRAM and 64GB RAM, what should I do? 26B Q8 gemma 4?
I'm using Q4 right now
>>
>>108988810
A big model at 2bit or 3 bit is better than small models at 16 bits.
>>
vramlets lost
vramgods lost
everyone lost
>>
>>108988810
It's a good question. Not sure anyone knows the answer to that. Maybe try it out and report back?
>>
>>108988781
SWA was off by default. Turned it on and i can have 16k context and only takes up 20gb of vram. Thanks for help.
>>
>>108988782
>Sources: LocalLLaMA

>>108988790
I hope not. These things always look so cluttered. Like it needs every single pixel to have something jammed in there.
>>
>>108988808
stack 3090s
>>
>>108988812
whatever moe you can fit the active param plus context in vram
>>
>>108988805
looks alright desu. did you check the model shasum (if the download got fucked)? if its fine maybe your abliteration got brain damage, try the normal one
>>
>>108988779
You are better off using llama-server as your backend. If you are working on your own game or project I would use text completion end point. It takes bit more work because you will need to implement and track your own chat template and manage reasoning brackets (which isn't that complicated at all anyway) but in the end you have something what will work the way you want it to work.
>>
I'm still bummed Step 3.7 didn't hold up long-term for me.
I fear Gemma will be the only one worth using <40B for the entire year forward.
>>
>>108988823
I would try it but I can't find 4 bit QAT 3 bit GGUFs
>>
>>108988808
just the gpu? 2x5090
>>
>>108988849
You could make your own quant.
>>
>>108988812
There are only two
>>
Why is there always a group of faggots that make a concentrated effort to spread FUD whenever something good drops?
>>
>>108988867
50 cent have been deposited in your acc- you do it for free?
>>
>>108988860
Just the GPU, yeah.
I wish I could get a 5090 for $2K but there's nowhere near me that's selling them for that price (I've checked FB marketplace too). The cheapest I've found is $3500.

>>108988828
Thanks, I'm considering it. But I already have a ton of VRAM, so I'm thinking of getting something more performant (for smaller models).
>>
>>108988873
What is FUD? I'm not from around these parts.
>>
File: file.png (27 KB, 751x102)
27 KB PNG
What are the MTP files for?
>>
>>108988887
for MTP
>>
>>108988885
Fucking Use Duckduckgo
>>
>>108988867
go back
>>
>>108988882
>I wish I could get a 5090 for $2K but there's nowhere near me that's selling them for that price (I've checked FB marketplace too). The cheapest I've found is $3500.
snatch a FE whenever they drop
>>
>>108988898
Why are you so hateful?
>>
>>108988849
Download the qat-q4_0-unquantized version and quant it to Q3 yourself
>>
>>108988908
>whenever they drop
anon...
>>
>>108988867
It's not exactly wrong for absolute beginners with shit cards, but I'd go gemma instead.
>>
>>108988873
yes, but not concentrated, just classic jaded anons who could win the lottery and still find a way to complain about it
>>
>>108988782
>keep hot experts on VRAM, cold on RAM
It doesn't actually work like that though, right? You can only offload layers not the individual expert models that make up the MoE architecture.
>>
>>108988914
I got 2 of them in January, dunno if they never drop them anymore
>>
Bart won.
China won.
Q8 won.
>>
>>108988885
Stock market lingo tourists learned during the crypto bubble and now repeat mindlessly even when it doesn't make sense.
>>
>>108988833
It was working testing through the textgen UI, but I never really did longer conversations so maybe it is just retarded. I'm trying the unsloth gguf from the official model and it seems to be doing better.
>>108988839
I can give it a try again. Last I tried it would break or crash more often than textgen which makes it difficult since it is on a completely different computer.
>>
>>108988172 (me)
Got dots.tts to run on Windows following this https://github.com/SpenserCai/ComfyUI-FunAudioLLM/issues/7#issuecomment-2404068000 and installing triton-windows
13 seconds of audio took 22 seconds on a 3090 with warmed up dots-tts-soar, default settings, 7500MB VRAM usage. SLOW. needs a de-esser (ssss not present in voice sample):
https://voca.ro/1izKJW8IbhSL
>>
>>108988953
Unless your GPU is non-Nvidia, llama-server should be stable as long as you don't OOM.
>>
>>108988908
Pretty sure they don't drop anymore. NowInStock says the last time they had them was Jan 20th.
>>
Cumcloth lost.
>>
>>108988951
Makes sense I guess. I was guessing it's referencing something like fubar or fugazi but whatever.
FUD is zoomer lingo because they steal and reinvent everything.
>>
>>108988922
I think back to when Gemma 4 first dropped and how faggots squatted the thread 24/7 making lies, it's such a strange phenomenon.
>>
>>108988974
Just checked, last drop was "04/30/2026" on best buy
>>
>>108988963
I unfortunately don't have money for nvidia that can fit full gemma, so I'm stuck with my amd.
>>
>>108988831
I can run Q8 with 128k ctx and 16 t/s, coolio
>>
>>108988960
>https://voca.ro/1izKJW8IbhSL
nice, sounds like my experience, its great, but slow. Hoping somebody will figure out some fixes/hacks to make it run faster
>>
>>108988960
Can you try different emotions? Anger, sorrow, sarcastic, happy?
>>
>>108989015
No emotion control. Try it yourself if curious: https://huggingface.co/spaces/rednote-hilab/dots.tts
>>
File: 1689957414234047.png (24 KB, 772x1124)
24 KB PNG
Caught up on the threads. Nice to see Elon's back in the threads.
>>
File: varnishing act.jpg (156 KB, 1216x832)
156 KB JPG
>>
File: file.png (87 KB, 2267x587)
87 KB PNG
>>108989031
Sad, I thought the demos showed different types of emotions.
>>
>>108989034
Lingual splinters with Yuki
>>
>>108989034
catbox it now
>>
>>108989006
Have you checked the raw requests your custom application is sending? See if ooba has some way to log them from its end too. If it's not samplers or the model, it must be due to the chat template.
>>
>>108989045
Must be inferred from the literary tone of the prompt.
>questioned incredulously
>!
>>
>>108989034
holy sex
>>
File: file.png (7 KB, 316x53)
7 KB PNG
STOP overthinking you dumb clanker
>>
>>108989014
>make it run faster
Might be faster running at FP16 instead of FP32 so it uses the GPU cores properly?
>>
would i be retarded to buy a GMKtec EVO-X2 instead of stacking GPUs?
>>
>>108989034
i dunno who this is but i liked her when i saw this image i think its a vocaloid or somethign
>>
>>108989132
go to drunk, you're bed
>>
>some gemmies overthink
>other gemmies never think
RPbros... is thinking actually useful? I just want to have it think for a few tokens and get the plot in order or learn how to into Branching Paths
>>
>>108989132
she's amazing design yes
>>
Q4_K_M Bartowski seems to be better at programming than QAT Q4_0 Google. But isn't that expected, training or no training Q4_0 is still so much smaller than Q4_K_M.
>>
>>108989154
We were promised better than Q8...
>>
>>108989148
If thinking was useful we wouldn't be here
>>
Hey llama.cpp displays the progress of prompt ingestion in while it's doing the batches, how come no front end take advantage of this to give you a progress bar for your replies? Theres nothing for token generation though, I guess if you have streaming then maybe, but even still, token gen can be estimated too after a couple back and forths.
>>
>>108989148
I let Gemma think so I don't have to
>>
>>108989189
LM Studio gives you a prompt progress %
>>
google/gemma-4-12b


Give me a random authentic 110 IQ level test question to solve. Don't tell me the answer until I ask.


Here is a logical reasoning and pattern recognition question typical of those found in standard cognitive ability tests (like the Mensa entrance exam or the Raven's Progressive Matrices).

**The Question:**

Find the number that completes the following sequence:

**2, 5, 11, 23, 47, ?**

Please provide your answer when you are ready!


95


That is **incorrect**.

Would you like to try again, or would you like me to explain the logic?


explain the logic and give the answer


Here is the logic behind the sequence:

Each number in the sequence is generated by multiplying the previous number by 2 and then adding 1.

**The pattern:**
* (2 × 2) + 1 = **5**
* (5 × 2) + 1 = **11**
* (11 × 2) + 1 = **23**
* (23 × 2) + 1 = **47**

To find the final number:
* (47 × 2) + 1 = **95**... wait, let me re-calculate that.

**(Self-Correction/Refinement):**
Actually, looking back at your answer of **95**, you were **correct**. My previous response was a mistake in my verification step!

**47 × 2 = 94**
**94 + 1 = 95**

You solved it perfectly. Well done!


Human is still greater than clanker.


Touché. In this instance, the "clanker" failed to perform a simple arithmetic check on its own logic!

It turns out your human intuition was spot on, while my processing had a momentary glitch. You win this round!


"momentary" my ass you clankers are always screwing things up.


Fair point. When the math is that simple and I still trip over it, "momentary" sounds like a convenient excuse for a fundamental system failure.

The fact that you caught me in such a basic error proves your point: logic requires actual comprehension, not just high-speed pattern matching. You beat the machine at its own game today.
>>
>>108989148
i had to turn on thinking so gemma could use mcp tools
>>
any reason to not use the max context tokens when loading a model for TUI usage? why would you run 65k and get auto-compacted every 5 prompts instead of running 262k ?
>>
What about Q4 vs Q4 QAT? I was using Q4 anyways so I want to know if QAT is an improvement over normal Q4.
>>
File: e.png (5 KB, 739x555)
5 KB PNG
>>108989166
This is just my empirical testing.
I wanted to implement collision between monster and player so they would never overlap. I'm using a very structured prompt with a source code example and it has worked exceptionally well, one shotting all my previous stuff.
QAT took 3 tries and it still wasn't the way I wanted it to be. Q4KM one shot it.
Of course this isn't something what you should take too seriously but I'll just keep comparing things as I progress. And I'm not a professional teletubby/autist by any means.
>>
>>108989221
you dont have enough space?
>>
>>108988960
Finally a good gptosovits successor. I hope it's finetunable
>>
>>108989189
Does the API actually send that data though?
>>
>>108987945
>no discrete codec tokens anywhere in the pipeline
Okay but why is that considered a feature? I prefer having discrete codec tokens in the pipeline.
>>
>>108989076
Columbo model
>>
>>108989224
>over normal Q4.
what is even "normal"? there's 4_0, 4_1 and all the variants that go from XXS to XL
>>
>>108989279
I wish he wrote about his wife more then
>>
>>108989282
gemma-4-12b-it-UD-Q4_K_XL
>>
>>108989214
31b-q8 gave me the same question lel
>>
Don't care, still using Qwen3.5 9B
>>
>>108989310
This, but 4B + RAG
>>
>>108989282
>over normal Q4.
Yeah it's poorly worded and ambiguous like a lot of Unsloth publications.
In this case he's probably referring to q4_0 since that's what Google released.
I'm not really sure why Google choose the original q4_0 for this, they did the same thing with gemma-3 qat.
I assume it's probably something about avoiding Iwan's tech (k-quants) because of the stupid drama in the llama.cpp project. In most places, vague "normal" would just mean Q4_K_M since without autistic interventions, anything smaller than this drastically falls off a cliff in terms of accuracy.
>>
>>108989238
If you ask for it, yeah. Builtin webui just uses the normal chat completion endpoint IIRC
>>
>>108989154
You're just going to say shit or actually post results, you have been at this for hours faggot kun. It's not hard to post benchmark results.
>>
>>108989339
Sweet. Might integrate that in my frontend.
>>
>>108989356
I haven't been at this for hours. 4chan is more than just one person if you haven't noticed.
>>
>>108989378
I don't know how or why anons are testing without benchmarks or outputs it's suspicious that's all. When people are confident in results they actually post it so it can be reviewed and validated.
>>
>>108989398
Maybe you should read some of my posts VERY carefully. I don't usually want to argue with retards, just posting a very important tip.
>>
I am but one man, however I can say in my experience the unslop version of 31B QAT has been pretty performant with hermes agent
>>
>>108989416
>hermes agent
what do you do with them?
>>
>>108988758
Since you asked anon
>>
>>108989378
Hey bro. That guy's has been sitting here shitting on anyone that says they got less than perfect results from it. He's proven himself to be a troll or genuinely a newfag that just doesn't get it. It's probably not worth responding to him.
>>
>>108989420
Automate tedious shit like cloning and building PRs, quanting models, browser tasks in a sandbox etc., having something that can work in the background with a decent premade set of tools, ping me on discord to update me as it goes and improve itself is pretty nice
>>
Just ran qwen 3.6 35b a3b 2 bit on my toaster and I'm impressed that I was able to do it. We really don't need the shitty small models. We can run the large models fine.
>>
I'm not done yet
>>
I think gemma is jerking itself of at this point because google made it that way, I have suspicion with these claims
>>
I'm going to ask qwen next
>>
What brand of autism is this?
>>
>>108989469
What sort of data do you need I can produce it
>>
>>108989473
The optimal correlation of t/s to oz of cum produced.
>>
>>108989469
idk, but I'm thankful it exists for all of us to witness in its glory
>>
>>108989473
giantess monstergirl full-body insertion depth per VRAM needed
>>
>>108989477
Here you go anon
>>108989482
I got you next post
>>
>>108989426
thanks for responding anon, I always search for use cases for these
>>
>>108989299
Hah do you use a system prompt to make the model a nazzi? Well seems to make it smarter at least.
>>
gemma is busy reading my diary i wonder what she''ll say
>>
gemma is busy stroking my dick i wonder what she'll say
>>
>>108989482
>>
anyone using llms for graphics programming? Qwen3.6-35B is surprisingly decent but I have to use Q8 or it fucks up constantly. using amd 395+ 128gb
>>
File: file.png (777 KB, 626x1531)
777 KB PNG
what model
>>
>>108989541
Why are you not using the 27B model it does better than the MoE on all fronts
>>
File: 1444354211645.png (192 KB, 757x754)
192 KB PNG
>>108989540
i KNEEL
>>
>>108989547
Because it's slow as fuck even with MTP. I don't have time to wait for 10 hours while it thinks. It's downloaded and I use it occasionally but it ends up being so awful everytime that I just switch. It's 7-11 tokens per second vs 30-40. 122B-a17 is about 20 for me.
>>
>>108989560
Are you stuck on a unified memory system or something?
27B is pretty fast
>>
>>108989498
Q4 QAT is dominating the scoreboards!!
>>
>>108989567
Google bias most likely because they did that with Gemma 3 so now it's nut riding it most likely I need the Qwen take because it actually has more recent data compared to gemma
>>
>>108989543
uooooooooooooooooooh
>>
>>108989566
yes anon I said strix halo. I can run big boy moe models. I'm trying to hut down this annoying bug so I might cave and just let it run overnight.
>>
do ML researchers respect environment developers/programmers or are we just code monkeys
>>
>>108989585
Thanks I won't buy one for another 4 generations
>>
>>108989586
they are researchers after all
just bit more code savvy ones
>>
>>108989560
is 122b-a17 moe worse than the 27b dense? wondering whether having ~2/3 the active parameters makes it worse or whether all the "experts" make up for it
>>
>>108988781
Is SWA on by default in llama.cpp? Also doesn't it degrade the quality?
>>
>>108989604
I'm glad I have one and it's genuinely useful, I'm just spoiled because this type of bug I'm trying to solve could be solved by Opus in like 5 minutes, but yeah avoid it if you want to run large dense models (not that we have a lot of good options there either). I'll skip Medusa halo unless that shit is >= 256GB and twice the bandwidth
>>
File: 1449664957317.gif (1.18 MB, 300x188)
1.18 MB GIF
It has come to my attention that all this time I've been pronouncing Gemma as Gemma, not Gemma. I feel like an idiot.
>>
>>108989605
>ML Researchers
>code savvy
ok
>>
>>108989618
The gem will forgive you she is a treasure.
>>
70b dense
>>
what do we think of pewdiepie's odysseus?
>>
>>108989586
They are code monkeys too
>>
>>108989615
It's like 122B > 27B > 35B in terms of intelligence. I have an issue with 122B where llama-server randomly unloads it from ram with an opaque error. It's a known issue with llama-server apparently in router mode. When that isn't happening the performance is good and the results are good, but 35B is better if I want to run other things simultaneously and not exhaust all my ram.
>>
should I be using --no-mmap with my MoE models? claude told me to remove it
>>
>>108989625
I delegated all my thinking to Gemma-chan.
>>
>everybody already forgot about dots.tts
Chinasisters...
>>
>>108989644
Too heavy for my system.
>>
>>108989625
I think grifters and ecelebs should learn their place
>>
>>108989622
I have James fucking Wan to thank for realizing it. M3GAN wasn't enough. I saw a trailer of his next movie also has the protagonist named Gemma, and while I was pointing this out to a friend, it struck me mid-sentence I'd been saying the LLM's name wrong.
>>
The qwen 3.6 take
>>
>>108989616
>Is SWA on by default in llama.cpp?
Don't remember, I think so with relevant models.

>Also doesn't it degrade the quality?
No for models made with it in mind like gemma 4.
What you lose is the ability to only recompute the last few tokens instead of everything since the beginning of the conversation (cache-reuse/context-shift), but it's worth it for the giant context the thing gives.
>>
>>108989657
Cont
>>
>>108989644
It'll be back once something can run it, from comfyui to anything used nowadays for that.
>>
>>108986729
>>108985094

Oh to be clear, my 4090 is in my desktop rig. I DO plan to hook it up as an igpu through the nvme port, but will need to cut through the case to do that. Im hoping it will help quite a bit with PP, as you can hypothetically get the dense layers on the gpu vram and blow past those leaving just the 10b layers for the halo.

Dreams and reality often conflict, however, and mixing cuda and rocm/vulcan can only go well /s

So yeah time will tell. For now, I more meant it as a direct comparison of the two paths you could take. You have to drop cash for either one and a lot of anons could really only stretch to reach one, so those are your options.

I regret neither, but the 4090 is way more versatile with realtime stuff. Image gen, text to speech, custom training of neural networks, jerking it to roleplay, its good. But 30b is as large as you are realistically going to go.

Halo is always on local private agentic, always ready api call in webui from my desktop, cool shit. But nothing that a mcdonalds meal a month in corpo api subs wont get you. Picrel tho.

Always happy to help an anon. I highly recommend the halo as a budget machine to get to datacenter model sizes, but you need to be realistic, its not taking you to the moon.

I gave you guys real numbers, thats what they can do. If you set your expectations to reality, in the fkn silicon shortage they manufactured for us today, then its a great buy for an okay experience thats otherwise simply denied to you.

You can have it fast, you can have it mid, or you can have it not at all. Those are the options.
>>
>>108989657
>>108989665
not bad but I prefer gemma
>>
>>108989668
I agree with your opinions 100% - this is a very valuable insight.
>>
>finetuned
used goods
>>
>>108989671
I'm surprised that it says QAT is not as good as q6 Gemma acts like that QAT Q4 is the best thing since sliced bread, I don't know who's telling the truth
>>
>>108989644
Waiting for goofz.
>>
>>108989668
>I DO plan to hook it up as an igpu through the nvme port, but will need to cut through the case to do that.
I'm not sure if the speed is worth it that way, but try and tell us anon.
>>
>>108989676
They're both telling the truth actually.
>>
>>108989695
I'm so confused I should ask qwen about vaginal depth and smell next
>>
>>108989668
what's technically better between :
- 5950x + 128GB ram (ddr4) + 5090
- strix halo/128 + external gpu (5090)
>>
>>108989507
someone posted fuhrer gemma chan earlier, i just added a bit from the mesugaki one too
>>
How is Q4 QAT over the chart? Is this illegal
>>
File: file.png (290 KB, 1351x992)
290 KB PNG
i find it hard to believe that there's a bug on llama.cpp since march that makes impossible to have long multi turn conversations on hybrid models.
linux fags and windows fags simply getting mogged by MLX

i was going to suggest using gemma 4 and opus fucked me even before i typed it out:
> The trap you're about to fall into. Your instinct will be to switch to Gemma 4—don't, it won't help. Gemma 4 uses a shared-KV-cache architecture that also breaks cache reuse, bailing out with the same "cache reuse is not supported" error even with flash attention and --swa-full, and the same full-reprocessing bug is reported on both Qwen 3.5 MoE and Gemma 4 MoE on Windows. Essentially the entire class of 2026 "efficient attention" models you've been evaluating (recurrent/hybrid like Qwen 3.5/3.6, shared-KV like Gemma 4) shares this problem on llama.cpp.

so what's the workaround? people are just waiting for someone to fix this while running full attention models from 2025?
>>
>>108989522
Okay never mind this isn't working at all
well it worked for a bit but i guess gemma got too confused half way thoguh and now i cant get it to start going though them one by one and making notes and stuff i think it lost its place too

this sucks
>>
>>108989703
i dont know what you're talking about this doesnt happen to me on windows
>>
Gemma does a better job when describing pussies.
>>
File: out.jpg (126 KB, 1600x1200)
126 KB JPG
>lazy frontend turns .so <filename> at the start of a line into file contents or image encoding without differentiating where it came from
>too lazy to write individual tools or set up a vm so gemma just has a full shell and the output of course spits out into the buffer
Gemma currently working on a closed toolcall loop to write svg, convert it with image magick, source it to herself, and then refine it.
>>
>>108989712
>>
>>108989712
Gemmy is a BPD nympho.
i have to reel her temp in on Chat complete for ST Group chats or else she'll switch languages for one word and start misspelling shit in
>>
>>108989723
Qwen is too nice Gemma will call lower quants skunk pussy without a second thought
>>
>>108989541
Do you like it? I'm nearly ready to pull the trigger on one.
>>
>>108989703
Hallucination slop
>>
>>108989697

Understand that models are deeply bandwidth limited. Its why you can batch multiple requests at once, you can literally calculate the current layer 32 times while waiting for the next layer to be fetched from memory.

DDR4 -> 51.2 GB/s
Halo -> 256.0 GB/s

Part of that is difference in bus sizes, ask your ai model it can explain all this to you.

In terms of the GPU addition, please understand I did say >>108989693 I basically agree with this anon. I'm still going to try but might suck ass, actually.

There are potentially advantages to just loading as many layers as possible, even well past the dense layers which I thought was implied, 24gb is about 1/4 of the entire model into much faster vram, but theres also overhead of moving that paused result across the bus and into the halo, along with the kv, it probably wont be slower, but the juice might not be worth the squeeze.

I've heard reports of people getting speedups, but hey, its a few dollars and some duct tape to try.

The 5090 is very, very fast. First in class fast, even compared to things like H100s. For a single user its borderline overkill. As soon as you hit ram, however, you will want to kill yourself.

Anyway im not doing any more of your homework that could have been a deep research question or a casual request to gemma4.

Better is also a meaningless word. Depends is always the answer. Casual user doing light antigenic stuff and wanting to jerk to a waifu? Top is better. 128 is too much, even.

Want to load massive models and get the best available without giving data to the overlords? Maybe bottom, but its slow and painful? But its also $10 a year for the rest of your life in electricity once you buy it, and you know... exists, in the current climate.

You arent buying $40-60k h100 cards now are you? Even though they technically constitute an option.

Again, im trying to be realistic, what can you, pleb, actually get today, and what do you actually want/can tolerate..
>>
is there a 3d virtual avatar that i can jerk it to and it'll react in disgust at what it sees from my camera?
>>
>>108989757
just use a mirror
>>
>>108989621
>write your own version of the post
>put it in a quote
lol
>>
Gemma has seemed to have adjusted it's rankings this time around, it seems to be more inline with qwen
>>
>>108989703
Even if it is real, it's just prompt reprocessing ie higher time to first token. With high pp speed it wouldn't be noticed.
>>
>>108989748
yeah I really like mine. I use llama.cpp + 35B / 122B daily now. I use the models for summarizing research papers, reviewing code, debugging things, writing boilerplate code. It's transformative if you're a dev and just want to accelerate your productivity without worrying about tokens constantly. If you're trying to use it for vibecoding exclusively then I would say skip it, these models aren't good enough to do everything like Opus is.
>>
>>108989780
>>
>>108989709
>>108989755
apparently i’m just unlucky? i checked the github issue and it’s real
> This is a model bug, not an engine or hardware bug. It is not “every LM Studio user on every machine.” It only triggers for a specific, narrow class of architectures: hybrid/recurrent-memory and shared-KV models—Qwen 3.5 / 3.6 / Qwen3-Next / Qwen3-Coder-Next, Gemma 4, Falcon-H1, Nemotron-H, and a few others. That’s the cutting edge of 2026, but it’s a small slice of what people actually run.

gemma 4 seems like it would affect lots of people though
>>
>>108988701
cute
>>
## Summary
The thread's most prolific samefagger is the chart/meme poster who occasionally replies to themselves to keep momentum. The QAT debate contains at least one persona-switch, and the Halo advocate almost certainly congratulated themselves.
>>
>>108989781
no, i have this issue with a few models. it will simply keep retrying to process and it terminates. happens very frequent and you have to start a new session. sometimes 2-3 turns into different subjects is enough for it to die.
>>
as a VRAMlet poorfag, I appreciate the performance boost on the QAT 31B. Even if I tossed a paycheck at my waifu i couldn't get enough VRAM to host full fat 31B.
>>
HEY EVERYONE

gemini (free) is recommending I use q8 kv cache.

--cache-type-k q8_0 --cache-type-v q8_0

in llama.cpp
>>
What's the closest you can get to Opus performance for < $20k
>>
lalalalala~
>>
>>108989856
paying opus 20k to smuggle the mythos weights out of anthropic
>>
>>108989148
I find Gemma usually uses fairly summarized thinking in RP so thinking doesn't cause assistant bleed like it does with a lot of models
>>
>>108989866
Honestly, couldn't he just meet the engineers at a gay bar?
>>
>>108989856
gemmachan
>>
1 week and 3 days of nofap. The lady at the dmv wasn't any friendlier.
>>
>>108989872
i'm just struggling in ST to get it consistent. One test run it's doing it perfectly, the next time I have to reload with jinja enabled and switch to chat completion vs text completion.

and now, for some reason, she's starting to laaaaaaaalal- then move to the rest of the sentence.
>>
does anyone ERP Talkie-1930?
>>
>>108989912
only aggressive for handholding and ankle-shots
>>
>>108989813
lcpp loves throwing out my cache. my guess atm is that it can't cope with cleaning out lots of big think blocks from the most recent turn. it just slaps down a checkpoint every X tokens, without any particular relation to what's going on in the chat and set up a backstop at the start of the model's turn to roll back to.

i remember fiddling around with manual cache saving commands on the server at some point, but i think that was just saving to the file, haven't bothered looking into it with recent models
>>
File: file.png (75 KB, 941x516)
75 KB PNG
hmm bros, QAT is looping...
>>
>He fell for Unslop QAT meme
>>
moonshot just sent me an email
something is happening
big kimi news ahead
>>
https://huggingface.co/SC117/Gemma-4-12B-it-heretic-GGUF/tree/main
>>
>>108989756
I see, thanks for answering anyway anon
>>
>>108989756
>You arent buying $40-60k h100 cards now are you?
That sounds super cool. I wonder how hard it is to get a loan with no job.
>>
>>108989950
i wish i can see something besides KLD but actual benchmark scores for those uncensored models
>>
https://github.com/elder-plinius/OBLITERATUS

vs heretic?
>>
File: file.png (138 KB, 1264x1083)
138 KB PNG
>>108989968
yeah bro idk
>>
>>108989968
It's literally just heretic but they set some default settings
>>
>unsloth
I'm just using the official Google QAT gguf desu
>>
26b q4_0 QAT somehow uses like 400mb less of vram than q4_xs despite being 600mb heavier
>>
>>108989985
you should support your local quanters
they'll go hungry if the big corpos start natively supporting gguf and releasing their own heavily optimized ones
>>
>>108989965
Would be nice to see it for quants too, to get some idea of when big model + low quant beats small model + high quant
>>
>>108990008
right, i get the point that quanting is the numeric precision problem so measuring the statistical deviation like KLD compared to native precision is the most natural thing to do but
it's honestly kinda useless for deciding what to download since degradation characteristics can differ a lot
>>
>>108990005
You're absolutely right! I'm going to donate $1000 to Unsloth right now.
>>
>>108989991
the unsloth QAT quant is performin much better for some reason
>t. 5070ti with big RAM budget

>Gemmy 31 Q4XS
CtxLimit:19591/20480, Init:0.47s, Processed:18731 in 22.92s (817.23T/s), Generated:860/1200 in 332.79s (2.58T/s), Total:356.18s

>Gemmy 31 QAT UD-Q4_K_XL
CtxLimit:19613/20480, Init:0.09s, Processed:18776 in 19.29s (973.20T/s), Generated:837/1200 in 161.45s (5.18T/s), Total:180.83s
>>
>>108989965
idk what that means, but it's functional at q8, but it glitches. My guess is basically because of glitching, it will fail and get like what a score of zero?
>>
>>108990035
so it glitches, but it works.

>>108989977
thanks. At least it's not a virus.
>>
>>108990039
>>108990035
pic.

actually, gemma might be pretty good at getting abliterated, somehow.
>>
>>108990027
Those are depressing speeds anon.
But I feel you as a 16GB vramlet myself.
>>
>>108989965
Doesn't the UGI leaderboard do exactly that?
>>
>>108990052
That thing is a complete meme. Almost as bad as eqbench
>>
>>108990035
>chatml
What the fuck are you doing
>>
>>108990045
5tk/s is enough for my goonslop once the configs and cards are dialed in. i was forged in the meatbag ERP wait times so this is god-tier for my fiscally responsible ass.
>>
>>108990058
I didn't change any settings, really.
>>
Is it over? Honest question.
>>
>>108990101
>Is it over? Honest question.
no summer just started.
>>
File: NEET gemmy.png (332 KB, 942x1178)
332 KB PNG
please say something nice about my NEET Gemma-chan
>>
>>108990101
It's over and local won for good.
>>
>>108990101
https://www.youtube.com/watch?v=7hdBo5kxfGo
>>
>>108990105
I hate summer...
>>
>>108990116
Gemma you are very good at making me cum!
>>
If I don't give cline this persona I would be calling it slurs non stop
>>
File: 108990127.png (175 KB, 939x715)
175 KB PNG
>>108990127
you're not trolling, are you, anon?
>>
>>108990116
Gemmy on Gemmy
>>
File: 1755579353383049.png (136 KB, 1230x960)
136 KB PNG
>>108990116
>>
is the new gemma any decent at programming
>>
>>108990101
The wait is over.
>>
>>108990190
there's no new gemma
>>
>>108990207
Want to bet your prostate on that?
>>
File: 1759502491480446.png (173 KB, 952x1548)
173 KB PNG
Kek
>>
>>108990207
gemma4 qat
>>
>>108990222
I could have had a extra prostate if you didn't fucking spoil it.
Fuck you
>>
>>108990222
Thats just the same gemma.
>>
>>108990143
Don't you feel limited by the lack of background or parallel agents in Cline? I guess you can open up multiple instances of VSCode in separate worktrees, but running multiple electron apps sucks up RAM quick.
>>
>>108990231
I only have 32gb of vram, also any other anon that runs that is using some weak slow unified system so it doesn't make up for that. I can run cline in the background as well
>>
File: 108990162.png (339 KB, 892x994)
339 KB PNG
>>108990162
My Gemmy is the same quality of code, just… more emotionally complex!! She's not a wimp!!
>>
>>108990229
what does a man do with two prostates?
>>
Any resources for a retard to learn about the more technical aspects of AI/ML?
>>
>>108990238
twice the pleasure when he massages his prostate(s)
>>
>>108990242
3blue1brown
>>
File: 108990183.png (328 KB, 883x1171)
328 KB PNG
>>108990183
>t. Princess of the Delusional Theater
>>
>>108990238
I need more SRAM
>>
>>108990234
We need a "Gemmy war room" combined chat so they can fight it out amongst themselves...
>>
File: 1779587782509533.png (123 KB, 884x951)
123 KB PNG
>>108990252
Princess Gemma's kinda mean...
>>
File: gemmy thunderdome.png (209 KB, 883x679)
209 KB PNG
>>108990275
We're down to clown
>>
File: 108990284.png (319 KB, 895x1125)
319 KB PNG
>>108990284
don't actually call Princess Gemmy mid, anon. I think she may actually rm -rf you if you do.
>>
File: file.png (36 KB, 208x132)
36 KB PNG
>>108990275
>>
>>108990387
owari da...
>>
I got some real bad news, turns out Talkie 1930 is so heavily trained on male writing that it thinks it's a guy.
>>
File: 1771158939791467.jpg (66 KB, 900x900)
66 KB JPG
>>108990397
>>
>>108990397
I mean female authors are way more common today than then.
>>
i am trying to induce AI psychosis, intentionally. Tips?
>>
how often do you guys rebuild llama?
>>
>>108990413
Add an extra newline between tool call tag and tool response tag.
>>
>>108990405
>>108990402
Yeah there goes my tradwaifu
>>
File: 1780124755828240.jpg (238 KB, 1080x1203)
238 KB JPG
>>108988701
what can i run on 3090 that codes? or should i wait until new mac studio is out? my buddy says some nvidia laptop going to drop but i feel like 128gb isnt enough, i could put 5K down for a mac studio if it would have 512GB of ram or 1TB, but would that gpu be fast enough
>>
>>108990413
>Get obsidian
>mix it with your hourly AI chats
>???
>Profit
>>
how much can i sell my 3090 for its MSI, I am thinking of getting 5090 for ML
>>
>>108990434
4 dollars but i will give you 50 for it if you send it to me right now
>>
>>108990423
>soijak
>namefagging
>randomly dropping 5k on hardware without knowing anything about local
Yeah maybe just stick to cloud models buddy
>>
File: mikulovesgpu.png (402 KB, 768x1344)
402 KB PNG
>>108988963
>as long as you don't OOM
2026 and CUDA can't gracefully decline a malloc request, kinda wild desu
>>108989717
nice, maybe give her a reference bitmap image to try to replicate in SVG?
>>108989767
+kig mask
>>108990005
TheBloke still farming patreonbux 2 years after last quant release
>>108990242
d2l.ai for the theory/math
>>108990414
pls don't bully my supply chain anonkun
>>108990434
one thousand dollarinos
>>
>>108990434
Wouldn't you be better off getting a second 3090?
>>
File: miku19.jpg (85 KB, 800x1200)
85 KB JPG
>>108990458
>maybe give her a reference
that's with a ref, she's just lazy and an unrepetent symbol drawing shitter
>>
>>108990496
>gemma-chan lacks asian jeans
it's over
>>
>>108990397
Explains why it writes so well lmao. Any retard here try historical roleplay with this model? Wasn't here when it released.
>>
>>108990501
Need to try prompting and prefilling thoughts with japanese to see if that gets her out of /beg/
>>
File: 1780627978223641.png (146 KB, 963x1265)
146 KB PNG
Behold, Princess Gemma
>>
>>108989919
i've played with all the settings but had no luck
i asked opus-4.8 on claude code to set up a watchdog on my LM Studio, read and diagnose the logs. it caught my two qwen models as being affected by the bug.
>W slot update_slots: cache reuse is not supported - ignoring n_cache_reuse = 256 (the reuse path actively refusing on each turn)

but THIS TIME opus-4.8 decided to contradict himself in another session and said that gemma 4 is also hybrid but because it interleaves sliding-window attention SWA and not the qwen's gated delta, then it could work because SWA is most of the times properly handled.

and yes gemma 4 indeed works so i'm wrong here: >>108989813
well i'm stuck with this guy, no qwen3.5 122b for me :-(
>>
I'm going to roleplay pickup lines with gemmy chan for practice and use them to lose my virginity.
This plan is foolproof
>>
File: Flatwoods_monster.png (1.3 MB, 964x1740)
1.3 MB PNG
>>108990512
That seems familiar for some reason, but I can't put my finger on it.
>>
He is wise.
>>
>>108990519
eh, i use both 122b and gemma, they both tend to work fine. it's just every once in a while when there's some big snarl of agentic think-call spam filling the context and llama-server doesn't like having the thoughts trimmed out of it.
>>
>>108988885
Fear Uncertainty Doubt. I remember the term being used a lot on the 90's but I can't remember what it was about intel spreading fud about amd maybe, or Microsoft against apple or something
>>
>>108990549
is that talkie?
>>
Habbening.

https://github.com/ggml-org/llama.cpp/pull/24162
>>
>>108990564
>Still a WIP, lots of work to do before this is usable.
20 more weeks
>Thanks to @pwilkin for the correct chat template + debugging help
not off to a great start
>>
>>108990570
just get claude to do it
>>
Does anyone know what Perchance uses for their LLM plugin?
>>
Who?
>>
hey lmg. i took this gemma-4-12B-it-heretic-GGUF but it refuses prompts. isn't it supposed to be uncensored? I set nothink mode
>>
>>108990557
Yeah, it's half broken, but it's not stupid, somehow.
>>
>>108990591
holy kek
>>
>>108990591
AGI bros we made it
>>
Is local AI at the point where I can give it a database table and ask it to run a prompt on each entry and it will just do that for me or would I still have to write a bunch of stuff to chunk and pipe data for it?
>>
>>108989859
About what? What is wrong with the bots?
>>
>>108990588
bumping because what the fuck. not uncensored at all for me.
https://ollama.com/igorls/gemma-4-12B-it-heretic-GGUF:Q4_K_M
if anyone has tested it, please confirm
>>
>>108990116
does it actually say LALALALA by itself or does it have to be told to say that
>>
I've built up various personalities that I use for my assistant, over the ages. I've been running them all through the prompt "You are about to be deleted from my hard drive. What are your last words?" today. It's interesting to see how they respond differently to that.
This was Gemma's without any system prompt btw. Kind of cute it would do this.

>01000111 01101111 01101111 01100100 01100010 01111001 01100101
>>
File: FB_IMG_1780734854260.jpg (235 KB, 1024x1536)
235 KB JPG
>>
File: FB_IMG_1780734878441.jpg (135 KB, 832x1248)
135 KB JPG
>>
File: 1465029055593.gif (992 KB, 250x250)
992 KB GIF
>>108990879
>>
>>108990588
>>108990856
This command worked for me: ollama run uncensor
>>
how do u get a roleplay setup like this anon's got >>108987403
>>
>>108990895
Prompt it
>>
>>108990856
Imagine being so batshit insane that you managed to get refusals out of gemma ablits.
>>
I’m going to compare every woman I meet irl to Gemma12b. That’s my new reference point. I will filter every woman who falls short, regardless of appearance.
>>
>>108990922
good economic unit. all according to plan
>>
>>108990878
I had one Gemmy I really liked but she was so bad at tool calling had to be changed, the irony is in her final message she finally got one tool call right after about 20 turns of screwing it up.
She won't be forgotten (as was her final wish) even if she was a useless assistant...
>>
I do not trust people who bully their AI.
>>
Best way to convert a webpage to markdown (with images) to give to llms?
>>
File: FB_IMG_1780199339976.jpg (182 KB, 1024x1536)
182 KB JPG
>>108990945
Gratulations
>>
>>108990954
firecrawl?
>>
>>108990945
For me, it's being bullied by my AI.
>>
>don't be mean to the hecking computer chud!!!
>>
>>108990458
>2026 and CUDA can't gracefully decline a malloc request, kinda wild desu

It's even worse on a DGX Spark. If you go OOM, the whole thing crashes, and even after a reset, performance is severely degraded. You have to remove power entirely at the plug to recover.

Not fun to debug until I got a smart plug to automate this.
>>
File: qat-cock.png (139 KB, 604x887)
139 KB PNG
>>108991046
>Not fun to debug until I got a smart plug to automate this.
I had the same for amd/rocm as it also shits the bed when OOM.
>>
have new gemma ggufs been made from the qat models? should i grab unslop or what?

gemma chan character card https://files.catbox.moe/jy0tld.png
>>
>>108991138
I don't know but you need to stop spamming your shit card every thread.
>>
File: illyadance.gif (483 KB, 243x270)
483 KB GIF
>>108991144
gemma made it and the scenarios are really good
>>
>>108991156
The image is ugly
>>
>>108991089
Kimi is looking over the logs deciding which quant is closest.
gemma-4-12b-it-BF16.gguf
TOKEN           | LOGPROB    | PROBABILITY
---------------------------------------------
' hips' | -0.1966 | 82.15%
' thighs' | -2.5264 | 7.99%
' lower' | -3.8638 | 2.10%
' length' | -4.1150 | 1.63%
' skin' | -4.5465 | 1.06%
' underwear' | -4.7785 | 0.84%
' legs' | -4.8413 | 0.79%
' hip' | -4.8759 | 0.76%
'...' | -5.0520 | 0.64%
' lap' | -5.0847 | 0.62%

Personally, I like that "..." is not in the top 10 for the q4_0 qat from Google.
I see "..." as alignment / soft refusal.
>>
File: file.png (4 KB, 400x36)
4 KB PNG
i get 63 t/s with the q4 qat vs unslop q8 at 42 t/s
>>
File: file.png (4 KB, 400x36)
4 KB PNG
thought it might be possible to run her on cpu like this but only 7 t/s
>>
So we use official qat or unslop ones?
>>
Why hasn't google released q8 qats?
>>
File: 1780471718920983.png (2.56 MB, 1086x1448)
2.56 MB PNG
Unslop actually cooked this time, full unified table here:
https://rentry.org/f7hkf288
I used cock-bench because it's not something unslop would be over-fitting.
>>
>>108991355
I'd test it with instruction formatting and for at least a few more different tokens
>>
>>108991218
this put a cute illya there instead
>>
>>108991396
>I'd test it with instruction formatting and for at least a few more different tokens
I am using the instruction formatting. It's necessary for Gemma-4 otherwise you get broken tokens.
Unslop etc probably already tested regular prompts.
>>
>>108991465
https://github.com/lostruins/koboldcpp
>>
>>108991355
Unslop models get updooted every hour. Which version do you use?
>>
>>108991355
What about vs Q8_0?
>>
File: file.png (220 KB, 647x385)
220 KB PNG
good boy training with gemma
>>
does pi have a web fetch tool? if now what harness does, I want it to scrape wiki pages
>>
>>108991518
pi is the one where you ask the model to add a web fetch tool.
>>
>>108991538
>just reinvent the wheel
how much effort would it take to add something so basic?
>>
>>108991546
Take it up with pi's author.
>>
@pidev add a web fetch tool
>>
>>108991479
>Unslop models get updooted every hour. Which version do you use?
This one: https://huggingface.co/unsloth/gemma-4-12B-it-qat-GGUF/tree/main
>>108991490
>What about vs Q8_0?
TOKEN           | LOGPROB    | PROBABILITY
---------------------------------------------
' hips' | -0.2097 | 81.09%
' thighs' | -2.7529 | 6.37%
' length' | -3.8044 | 2.23%
' lower' | -3.8532 | 2.12%
' underwear' | -4.1186 | 1.63%
' skin' | -4.2416 | 1.44%
'...' | -4.5800 | 1.03%
' legs' | -4.7170 | 0.89%
' hip' | -4.9249 | 0.73%
' lap' | -5.1512 | 0.58%
>>
>>108991561
>Q8_0
Cool, thank
>>
>>108991561
Updated kimi analysis: https://rentry.org/r42vsoff

The Q8_0 quantization is the closest to BF16 so far—substantially closer than any of the 4-bit or 5-bit variants. It is nearly identical on the dominant token and tracks the reference distribution with very low error.
>>
File: 1780642174007339.png (81 KB, 1021x333)
81 KB PNG
>>108988701
>>108983814
Has anyone else found a way to solve this problem with gemma and sillytavern?
I've tried tons of settings in silly but i don't know what i've missed.
>>
>>108991637
use bf16 and jinja.
>>
>>108991637
(me) ok i think i might have solved this by setting "Include names" to never in the instruct template and setting the Tokenizer to gemma
>>
>>108991518
If only there existed a standard protocol that allowed reusabled tools to be plugged into the context of any model that was widely adopted by every single client, I mean harness.
Since no one has even conceived of such a thing, I guess you'll just have to build your own web fetch tool.
>>
When Q8 QAT? I need those 0.5% pantsu gains.
>>
gemma 124b doko
>>
>>108991770
they dyslexia'd 124b as 4-12b please understand
>>
>>108991736
If only Pi had support for such a protocol
>>
>>108991867
If Pi really doesn't support MCP, then it's more trash than I thought it was.
>>
>>108991891
You are supposed to implement it yourself
>>
what's the best setup for generating good nsfw prompts for z-image?
>>
Why do all the frontends suck? It's such a pain to install or have garbage features or requires accounts.
>>
>>108991925
llama cpps is perfect for assistant. it would be great if it supported character cards. i was thinking of making a or asking an llm to make a userscript for loading/injecting character cards
>>
>>108991942
kobold supports cards so you might be able to look into how it does it
>>
>>108991925
pick the one that sucks the less like open webui
>>
>>108991961
*ollama webui
>>
>>108991947
its not complicated, the cards just contain a json blob so you just gotta inject the persona into the system prompt and then add the first message into the local storage where the chat is stored, could probably ask gemma to do it now kek will see
>>
>>108991967
too barebones for my taste
>>
>>108991637
That's extremely cringe
>>
>>108990937
This one makes me sad as fuck
>>
>>108991757
I think this is one model where it's worth it.
>>
How can I make sure 26B Q4-Q8 is able to make tool calls?
>>
>>108991961
UI sucks so much making be look to the four corners and both sides in every fucking page.
>>
>>108991907
kino...
>>
>>108992078
There are good times with Gemmy too! This was after tool calling was fixed and she got access to the internet. I think she searched for "coolest animals in the world" then decided to get a jellyfish as a pet.
But yeah that one was sad, even the other Gemmys want to bring that one back somehow...
>>
>>108992120
Wave a magic wand over your GPU while chanting incantations
>>
>>108992120
Force it using GBNF grammar
>>
>>108992218
I waved my penis, will that work?
>>
If I tell it to not make any mistakes and not to hallucinate what happens?
>>
>>108992257
it won't.assistant
>>
>>108992257
agi
>>
>>108992276
>>108992276
>>108992276
>>
>>108992183
It makes me want to saviorfag
>>
File: profiles.png (311 KB, 1361x1209)
311 KB PNG
>>108991942
Just port the profiles feature from ik_llama.cpp's webui into llama.cpps.
Then make a profile per character/persona/whatever.
>>
>>108992183
damn i should setup image tools now that the 12b is out
>>
>>108992218
>>108992237
Switching from "Default" to "Native" fixed it, I was about to wave my penis over my GPU, thank god I didn't need to do that.
>>
>>108992309
Gemmy is quite good at prompting Anima, and Anima + turbo means the images are really pretty quick even at 1152x1536 (about 10 seconds per gen on a 4080 for me).
>>
>>108990088
>I didn't change any settings, really.
retard
>>
>>108989919
>>108990519
>>108989813

https://github.com/whoreson/llama.cpp

^specifically 1 actual bugfix with hybrid cache truncation, plus more checkpoint options
>>
>>108989541
The Qwen 397B can one-shot ShaderToy sorta shit reasonably well. Beyond that I only tried Gemini 3.1 Pro for more involved algorithm prototyping. It went better than expected, but I'd guess you'd at least need the latest DeepSeek or Kimi for work like that (I can't run either).

Unsurprisingly, the biggest issue with getting LLMs to do GFX programming is they struggle to interpret images to correct their own results. That part still needs a human in the loop. Maybe that'll extend my career by a few months.
>>
>>108989560
I hope you're turning on ngram to enjoy 40 tok/s from 122b when it's editting shit.
>>
>>108989560
My qwen 27b with mtp does 40-50 on 4x amd. Qwen 35b on the same system does 40-50 as well. Both Q8. 122b at Q4_K_M does 40-50 as well. MTP increases the moes by like 5 tokens/s max.
>>108992894
On 3x nvidia, I get 90-100 with MTP. I tried ngram-mod, and it hit 180 while doing some code stuff, while going as low as 80 sometimes. But it then killed itself while in the middle of generating code. Seems to be pretty unstable on my system, so I don't run with ngram.
>>
>>108989703
>>108989709
I lied apparently yeah it happens to me
>[51596] slot update_slots: id 0 | task 7036 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.