[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology

Name
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
File
  • Please read the Rules and FAQ before posting.
  • You may highlight syntax and preserve whitespace by using [code] tags.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


[Advertise on 4chan]


/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108568415 & >>108565269

►News
>(04/09) Backend-agnostic tensor parallelism merged: https://github.com/ggml-org/llama.cpp/pull/19378
>(04/09) dots.ocr support merged: https://github.com/ggml-org/llama.cpp/pull/17575
>(04/08) Step3-VL-10B support merged: https://github.com/ggml-org/llama.cpp/pull/21287
>(04/07) Merged support attention rotation for heterogeneous iSWA: https://github.com/ggml-org/llama.cpp/pull/21513
>(04/07) GLM-5.1 released: https://z.ai/blog/glm-5.1

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
►Recent Highlights from the Previous Thread: >>108568415

--Testing Gemma-4's accuracy with normalized image coordinates and spatial reasoning:
>108568460 >108568467 >108568513 >108568540 >108568595 >108568650 >108568655 >108568500 >108568558 >108568563 >108568579 >108568873 >108568884 >108568968 >108568814
--Gemini and Gemma 4 translation patterns and quality:
>108570675 >108570683 >108570686 >108570702 >108570693 >108570708 >108570769 >108570786 >108570820 >108570843 >108570852 >108570859 >108570862 >108570874 >108570881 >108570896 >108570906 >108570928 >108570950 >108570959 >108570970 >108571110 >108570930
--Discussion of Goose agent and llama.cpp multi-GPU KV quantization:
>108568617 >108568649 >108568677
--Gemma 4 performance tests and token speed on M4 Max:
>108568671 >108568676 >108568705 >108568731 >108568736
--Fixing LlamaCpp WebUI's failure to implement MCP session IDs:
>108569753 >108569794 >108570077 >108570090 >108570330 >108570907
--Comparing Nemotron-3-Super-120B and Qwen3.5-27B benchmark performance:
>108569234
--Gemma's high EQbench scores and roleplaying with Gemma 4:
>108571778 >108571829 >108571923 >108571948
--Anon suggests open models can find vulnerabilities similarly to Mythos:
>108569984 >108569999 >108570052 >108570072 >108570119 >108570062
--Logs:
>108568500 >108568579 >108568595 >108568671 >108568814 >108568888 >108568939 >108569068 >108569202 >108569300 >108569753 >108570330 >108570437 >108570612 >108570660 >108570769 >108570907 >108571012 >108571076 >108571106 >108571200 >108571246 >108571310 >108571833 >108572023 >108572187
--Gemma-chan:
>108568674 >108569255 >108569396 >108569529 >108569664 >108570121 >108570153 >108570206 >108570430 >108570773 >108570822 >108570865 >108570898 >108571012 >108571020 >108571029 >108571221 >108571496 >108571895 >108572034
--Miku (free space):
>108571246

►Recent Highlight Posts from the Previous Thread: >>108568418 >>108568424

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
Gemmylove
>>
File: pircel.png (34 KB, 1088x174)
34 KB
34 KB PNG
google updated their jinja
https://huggingface.co/google/gemma-4-31B-it/blob/main/chat_template.jinja
you can use it with the --chat-template-file file, it's supposedly fixing this kind of bugs >>108554439
>>
what does --direct-io flag do?
>>
<bos>
>>
>Gemma says "Open wide, you big pervert" before giving me a blowjob

I knew I shouldn't have trusted opinions of vramlets. Back to deepseek.
>>
sexsexsexsexsexsexsexsexsexsexsexsexsexsexsexsexsexsexsexsex
>>
>>108572317
Should wait until #21704 is merged so workaround::convert_tool_responses_gemma4 isn't applied.
>>
>>108572340
You don't enlarge your urethral opening to accomodate her tongue?
>>
>>108572325
It should make model loading faster if supported. Only linux and not compatible with --mmap. There may be other constraints.
https://github.com/ggml-org/llama.cpp/pull/18012
https://github.com/ggml-org/llama.cpp/pull/18166
https://github.com/ggml-org/llama.cpp/pull/19109
>>
File: 1765519302859042.png (222 KB, 2202x1035)
222 KB
222 KB PNG
>>108572347
why can't they simply put all the official jinja on the llama cpp repo so that it uses that instead of having to make new gguf everytime they notice the jinja is actually wrong, their way of doing thinks seem kinda retarded ngl
>>
>>108572317
It's crazy how they can go through all this effort to release a model and they are incapable of making sure the template is correct.
And it happens regularly.
>>
>>108572353
oh, I hoped it gave additional inference speeds
>>
>>108572375
yeah, like they managed to make a really solid small model but at the same time they can't make a good template right away, jinja is harder than machine learning confirmed :^)
>>
>>108572295
>not Miku
So why are anons okay with posting in this troll bake?
>>
stfu petr
>>
>>108572382
I'm not ok, but I'm not going to argue about it. If it becomes blatant avatar posting, someone else is going to get blacked.
>>
>>108572382
>not early
>has news
>has recap
I can excuse the shit OP image.
>>
>>108572382
this. only miku threads are legitimate
>>
>>108572385
>If it becomes blatant avatar posting, someone else is going to get blacked.
why? the BBC anon hates miku, so he likes the fact it's not migu on the OP
>>
>>108572385
>someone else is going to get blacked.
thank you cudadev sir for defending us
>>
>>108572391
do the math
>>
remember when qwen came out and these threads actually tried to be a bit more productive and had on-topic ops for a while?
>>
>>108572395
ERP is very much on topic
>>
>>108572395
lol
lmao
>>
weird hallucination but okay
>>
>>108572382
I actually agree. Can someone rebake?
>>
>>108572401
>>108572340
>>
>>108572402
>Can someone rebake?
how about that someone be you?
>>
Remember it's never about Miku, it's about making the thread miserable to use.
>>
How do I fix Gemma4 26b being atrociously slow with prompt processing??? I thought this issue got fixed already! My llcpp is up to date. WTF.

llama-server \
-m "$HOME/Desktop/google_gemma-4-26B-A4B-it-Q4_K_M.gguf" \
-mm "$HOME/Desktop/mmproj-google_gemma-4-26B-A4B-it-f16.gguf" \
--host 0.0.0.0 \
--port 8080 \
-c 65536 \
-ctk q8_0 \
-ctv q8_0 \
-t 8 \
-np 1 \
-kvu \
-rea off
>>
>>108572409
I'm using bart's gguf quants btw. Is that the problem?
>>
>>108572409
Bigger batch?
>>
>>108572382
christ unironically didn't realize until now
cursed thread
>>
>>108572409
-b 1024 -ub 1024
>>
>>108572409
What are you running it on and how slow is slow?
>>
>>108572409
do you per chance have less than 24gb of vram?
>>
>>108572394
your maths ain't mathing
>>
Weird how he only started falseflagging now. We had three threads in a row without Miku yesterday and, as expected, none of the regulars cared because everything else about the thread was in order.
>>
>>108572409
Wouldn't happen if this was a Miku bake.
>>
>>108572429
useless trying to rationalize mental illness
>>
>>108572423
I'll try this and report back ig. No other model has been this slow for me with prompt processing though. It's gemma specific. It's taking like 20 seconds every time and recreates every checkpoint from scratch with every prompt.
>>108572426
Yes. But I still get 18tps. That's not the issue.
>>
File: 1771675896476832.jpg (13 KB, 256x256)
13 KB
13 KB JPG
As a VRAMlet, it's unfeasible for me to run Gemmy alongside any kind of imagegen for obvious reasons, so my best option would probably be: load Gemmy, use it for a while, prepare prompts for images, unload Gemmy, load imagegen, gen and go back to Gemmy
I assume it'll take an unviable amount of time to load-unload-load models, but before I go down this rabbithole, is my overall understanding correct?
>>
are there any tests at all comparing quantization effect on gemma?
>>
>>108572429
who?
I wouldn't bring it up myself but I agree that non-Miku threads feel fake
>>
>>108572449
yeah one guy did that and it showed that q8 isn't anywhere near lossless for big context
but they don't want you to know about that
>>
>>108572449
https://localbench.substack.com/p/gemma-4-31b-gguf-kl-divergence
>>
>>108572454
>non-Miku threads feel fake
same
would rebake if I wasn't phoneposting rn
>>
File: 1750708801703723.png (241 KB, 684x952)
241 KB
241 KB PNG
>>108572460
So q8 predicts a different token in 10% of the time? Wow.
>>
File: lmao.png (10 KB, 1339x127)
10 KB
10 KB PNG
>>
>>108572295
could gemmy use GUI?
GPT 5.4 can do it, oneshot'd all the smallest buttons
>>
File: блять.jpg (314 KB, 1456x827)
314 KB
314 KB JPG
>>108572459
>>108572460
it seems like the asymptotic trend is not even tending to 0. Since the baseline bf16 in this guy's tests was also gguf, does it completely rule out implementation issues?
>>
>>108572447
I am on a 3060 with dual channel ddr5 and it takes less than a minute to load Gemmy.
"Image generation" is vague but if you are referring to some booru SDXL those don't take too long to load neither. Those take like 4 gigs of VRAM, maybe 5 with clip and vae pinned so you might actually do this without loading and unloading if you are not a hyper vramlet.
>>
>>108572447
Reloading the models should be no more than a few seconds if you have enough system memory to let them get cached, and if you're not on pcie x1
>>
File: UnslothDynamic.png (97 KB, 407x418)
97 KB
97 KB PNG
>>108572317
>google updated their jinja
Nice! Waiting for the new, fixed GGUFS!
>>
>>108572382
>So why are anons okay with posting in this troll bake?
Ublock Origin
>>
>>108572459
>yeah one guy did that and it showed that q8 isn't anywhere near lossless for big context
What about BF16 vs FP16?
>>
>>108572299
>no toast hair ornament
>>
>>108572362
>why can't they simply put all the official jinja on the llama cpp repo so that it uses that instead of having to make new gguf everytime they notice the jinja is actually wrong
users can just load a jinja file with an arg anyway you dont need a new gguf
>>
>>108572295
uoh
>>
File: gup.png (188 KB, 1126x736)
188 KB
188 KB PNG
common : better align to the updated official gemma4 template
https://github.com/ggml-org/llama.cpp/pull/21704
>>
>>108572295
Last time.
Vote: https://poal.me/3u6rby
> Which is your preferred Gemma character?
Also
> But muh favorite one wasn't included? Why didn't you include every perturbation of each gen for the past week and allow me to vote? Also I hate all of them and you should have a none-of-the above as an option!
These are the 4 major design concepts from the past few days. You may be familiar with the idea of grouping several things together to create a "concept" versus an autistic list of every minor variation, but I've no way, from here, to judge your level of autism.
If you don't like any of them then your opinion doesn't matter.
If you don't like the poll, you are free to make your own. You are also free to just fuck off.
Thank you for your attention.
>>
File: temp1.png (276 KB, 902x490)
276 KB
276 KB PNG
>>108572630
ATX backpack, narrowly, followed by black hair / blue star accents. I suspect these concepts will just merge.
>>
>>108572645
>>108572630
Fuck you and go back to wherever you came from, avatar spammer.
>>
>>108572534
>>108572537
>hyper vramlet
I mean, I'm running 26B on 12 gigs. I understand it's MoE so the whole thing is not shoved in there, but I don't effectively know how much of my vram gets filled up at any point, I assume all of it. I use the vague term "imagegen" because I haven't gone down that rabbithole yet, but I do mean an SDXL, yes. The fact that this could be possible unironically fills me with hope, I figured it'd be a tall task to load and unload stuff
>>
>>108572630
poo spammer
>>
>>108572675
we need a blackening
>>
New poll.

https://poal.me/wixvtv
>>
>>108572704
>/ldg/
>>
>>108572510
>it seems like the asymptotic trend is not even tending to 0
I've been thinking about this too. What sort of quantization algorithm is even used for Q8_0 anyway? Perhaps that's where people should be looking for.
>>
>>108572684
nta. The 26B takes ~3gb vram if you keep all the experts in cpu ram (-cmoe).
>>
>>108572708
Go on, tell me it's not appropriate.
>>
https://www.youtube.com/watch?v=boaJCrHNRMA
Gemmy, I got your number
I need to make you mine
Gemmy, don't change your number
>>
File: temp2.png (270 KB, 819x341)
270 KB
270 KB PNG
>>108572704
>>108572708
>>108572715
lol no.
No one cares about this niche topic outside /lmg/
aicg doesn't run local models and consider it a waste of time. Plus aicg user base is even more toxic than this general.
ldg doesn't care about LLMs.
The gemma moe is completely in the wheelhouse of this general. And anons appear to have come to a general consensus, whether you like it or not.
>>
>>108572645
>I suspect these concepts will just merge.
mergefags won
>>
>>108572741
I mean the picture posters are trying to turn this place into /ldg/.
>>
File: 1718206878023960.jpg (6 KB, 283x178)
6 KB
6 KB JPG
Can someone make a llama.cpp issue or pr for me to add "prompt reply editing" and "first message" functionality to the webui?
>>
File: dipsySouthPark.png (1.89 MB, 1024x1024)
1.89 MB
1.89 MB PNG
>>108572693
That would require effort. Something complainers and spiteposters seem to be unable to amass.
>>
File: 1773156701474962.png (159 KB, 1080x432)
159 KB
159 KB PNG
here's the final result
>>
>>108572746
use ST
>>
>>108572751
i want qwen 3.6-goon
>>
>>108572752
I already do. I want to escape that bloated shitware.
>>
>>108572317
uh oh, unslop bros?
>>
>>108572751
>people finally realized that Dense is the only non-meme architecture
I'm so proud of those normies bro...
>>
>>108572760
Let's bloat llama.cpp's webui instead. What next? Character cards?
>>
File: 1746842705868986.png (97 KB, 689x473)
97 KB
97 KB PNG
>>108572712
>more than enough gigs left for imagegen
It's over for me then, so fucking over
The slopping truly never ends
>>
>>108572409
Bart IQ4XS is 2-3 faster than Q4KM in prompt processing on my machine. Generation is about the same.
I don't understand this difference. Q4 is still Q4 and haven't seen this happening with other models than G4.
>>
>>108572751
>3.6
coding finetune
>>
>>108572768
That's not even bloat. Turns out reply editing is already added. First message functionality is actually useful for a general usecase because it might help with jailbreaks to gaslight the LLM into thinking it wrote... whatever.

Also character cards are unnecessary to add. Those just go into the system prompt.
>>
>>108572774
Forgot, it's 26B not 31B too.
Maybe I'm just naive because I haven't used moe models in the past.
>>
>>108572774
>2-3 faster
Seconds or times?
>>
>>108572745
> picture posters are trying to turn this place into /ldg/
I agree with you on that, lmg is not an image general. But reminder /lmg/ was a complete snore until Gemma dropped and the moe discussion (which requires imagery) is unique to this general. The only anons that care are here. Ofc not all anons care.
It will go away in tmw and it'll be back to waiting for v4 and complaining about vibecoding within local inference engines, discussing their 1-off front ends, or whatever else anons want to post / bitch about.
>>
>>108572784
Times, sorry about that.
>>
>>108572785
>requires
>>
>>108572785
i'd rather this place die rather than turn into a shithole like /ldg/
>>
>>108572789
Np. I wonder if its just that specific quant from bart that's fucked up. Don't really want to go down in quality to IQ4XS...
>>
>>108572785
If you are the poll anon and you want to spam polls, you can do that, just add against everything option and honor it if that's what people are choosing. And people are choosing pictures, not your interpretation of concepts.
>>
>>108572785
what the fuck are you on about, you sound like underage retard who should be doing his homework instead of watching tiktok all day long
>>
>>108572785
>discussing their 1-off front ends
Fuck you. The custom software and project demos made here are the best things about these threads.
>>
>>108572409
Is it the processing or saving checkpoints to system ram thats taking time? Still happens if you turn off context checkpoints?
--ctx-checkpoints 0
>>
>>108572712
But it slows down 9 to 7 t/s
>>
>>108572710
Thinking about it, why isn't there a Q8_K quantization type? There might actually be differences with modern overtrained models. I swear llama.cpp still works with Llama 1-era assumptions.
>>
>>108572796
The difference in perceived quality isn't noticeable for a normal user. Of course it feels better in your head when using slightly higher accuracy version.. We are talking about a fraction of a difference.
>>
>>108572317
>NOTE: The new template will work without this PR. I checked and even after building the model turn to use tool_responses, the template formats it properly. This PR better aligns to the template since it now handles OpenAI chat completions style messages natively.
>>
>>108572809
Anon, I am running unsloth-gemma-4-31B-it-UD-Q8_K_XL...
>>
>>108572746
Don't expect them to add anything that circumvents the safetymaxxed chat completion paradigm. They already shamelessly regressed the webui by removing text completion
>>
>>108572816
based
>>
>>108572766
moes are fine, but super sparse ones with fucking 3b active are shit.
>>
File: 1750266478412216.png (33 KB, 1378x326)
33 KB
33 KB PNG
>>108572317
why is there 2 jinjas though? which one should I load?
>>
>>108572824
What are you talking about? text completion is still there as an api. Was it actually in web UI at any point? Llama.cpp actually lets you use prefill iwth chat completion, does any other backend do that, hm, anon?
>>
>>108572836
>Was it actually in web UI at any point?
Yes, like I said, my post is about the webui. I can't believe I'm filling out a captcha for this reply, learn to read next time retard
>>
>>108572849
I have never seen it. Are you maybe just confused?
>>
>>108572819
There's no Q8_K quantization type, though...
llama-quantize output:

40  or  Q1_0    :  1.125 bpw quantization
2 or Q4_0 : 4.34G, +0.4685 ppl @ Llama-3-8B
3 or Q4_1 : 4.78G, +0.4511 ppl @ Llama-3-8B
38 or MXFP4_MOE : MXFP4 MoE
8 or Q5_0 : 5.21G, +0.1316 ppl @ Llama-3-8B
9 or Q5_1 : 5.65G, +0.1062 ppl @ Llama-3-8B
19 or IQ2_XXS : 2.06 bpw quantization
20 or IQ2_XS : 2.31 bpw quantization
28 or IQ2_S : 2.5 bpw quantization
29 or IQ2_M : 2.7 bpw quantization
24 or IQ1_S : 1.56 bpw quantization
31 or IQ1_M : 1.75 bpw quantization
36 or TQ1_0 : 1.69 bpw ternarization
37 or TQ2_0 : 2.06 bpw ternarization
10 or Q2_K : 2.96G, +3.5199 ppl @ Llama-3-8B
21 or Q2_K_S : 2.96G, +3.1836 ppl @ Llama-3-8B
23 or IQ3_XXS : 3.06 bpw quantization
26 or IQ3_S : 3.44 bpw quantization
27 or IQ3_M : 3.66 bpw quantization mix
12 or Q3_K : alias for Q3_K_M
22 or IQ3_XS : 3.3 bpw quantization
11 or Q3_K_S : 3.41G, +1.6321 ppl @ Llama-3-8B
12 or Q3_K_M : 3.74G, +0.6569 ppl @ Llama-3-8B
13 or Q3_K_L : 4.03G, +0.5562 ppl @ Llama-3-8B
25 or IQ4_NL : 4.50 bpw non-linear quantization
30 or IQ4_XS : 4.25 bpw non-linear quantization
15 or Q4_K : alias for Q4_K_M
14 or Q4_K_S : 4.37G, +0.2689 ppl @ Llama-3-8B
15 or Q4_K_M : 4.58G, +0.1754 ppl @ Llama-3-8B
17 or Q5_K : alias for Q5_K_M
16 or Q5_K_S : 5.21G, +0.1049 ppl @ Llama-3-8B
17 or Q5_K_M : 5.33G, +0.0569 ppl @ Llama-3-8B
18 or Q6_K : 6.14G, +0.0217 ppl @ Llama-3-8B
7 or Q8_0 : 7.96G, +0.0026 ppl @ Llama-3-8B
1 or F16 : 14.00G, +0.0020 ppl @ Mistral-7B
32 or BF16 : 14.00G, -0.0050 ppl @ Mistral-7B
0 or F32 : 26.00G @ 7B
COPY : only copy tensors, no quantizing
>>
>>108572860
Are you maybe just a retarded newfag?
>>
>>108572809
>Q8_K
The difference between something like Q4_0 and Q4_K(_M) is t hat the _K variants keep important parts of the weights in q6/q8 instead of cutting absolutely everything down to 4bit like Q4_0. That's obviously not possible with Q8_0 because everything is already quanted to 8 bit.
Unsloth does a UD_Q8_XL that's q8 with some parts left in 16bit precision but those don't usually measure much better than plain q8_0
>>
>>108572870
Can you believe you filled out captcha for that one?
>>
>>108572877
I'm warmed up now
>>
>>108572795
>i'd rather this place die
we know
>>
>>108572882
So when did they remove it. Come on, anon. I'm curious.
>>
>>108572872
A hypothetical Q8_K type could do the same, but with BF16 instead.
As long as people keep doing PPL measurements with wikitext at 512 tokens context, nobody will ever see if/when a higher precision is helpful.
>>
>>108572866
Those are sort of like presets for making quants with built-in tools. The way the library is written, you have a lot of liberties of choosing what size to use for each layer which is how unsloth are doing their extended 8+ bit quants.
>>
>>108572409
It's slow because it is self-safety-maxxing, it's baked into the model via RLHF. Stick with qwen3.5-27b.
>>
File: 1765824402433942.png (248 KB, 2820x1601)
248 KB
248 KB PNG
>>108572872
>Unsloth does a UD_Q8_XL that's q8 with some parts left in 16bit precision but those don't usually measure much better than plain q8_0
In fact, it sometimes measures worse
Unsloth magic
>>
>>108572796
To add: i think the speed difference could be just a coincidence, IQ4XS randomly scaled certain innards which gives it a speed boost. I'm not familiar with moe models and even know this discussion is bit too anal.
Would be interesting to try manually picking up each layer which to offload instead of just using n-cpu-moe which offloads the first x amount.
Been too busy and there's good information about this in one thread on github, more or less.
>>
>>108572771
And you still have some space to put some layers in the gpu to make it faster. You'll be ok.
>>108572806
It was a point of reference. But even if that's all he had available, the options are running slow, having to unload and load models, or not running at all. Slow beats the other options.
>>
>>108572888
The new slopped ui webui. The old one was minimalist but ironically supported more features. You can go through the github issues to find the regression or just build an old version of llama.cpp and see it.
>>
>>108572914
The only valid reference points are the ggml-org models. Everything else is out of spec.
>>
File: ai automation.png (148 KB, 1760x1040)
148 KB
148 KB PNG
I am scared. It is possible human researchers will become obsolete within a few years, and everyone else soon after. Our society is not prepared to handle this.
>>
>>108572932
New webui is bloat and distracts from the real development. They should separate it from the main project. server should have only minimal implementation.
>>
Is the Q8 model generated by the hf_to_gguf script identical to the one generated by the quantize program?
>>
>>108572914
Not even in the long-document graph the UD_Q8_XL version is better than plain Q8_0. But this makes the asymptotic behavior even more puzzling (considering that BF16 would have a mean KLD of 0 by definition).
>>
>>108572913
lol
>>
>>108572932
I used the old one. Not extensively, but still. I don't remember text completion in it. Just had the chat UI, less fancy than current one, but still chat completions UI.

Also I do like the new UI. Between losing that or having to use mikupad for text completion, I will always choose latter.
>>
File: 1750238497162131.jpg (29 KB, 554x554)
29 KB
29 KB JPG
>>108572926
Yep, it's an actually feasible plan
I haven't been this happy in a while
Fucking Gemmy, man
>>
>>108572958
kinda crazy how with long documents the "lossless" q8 becomes as bad as q4 is for short documents
>>
>>108572490
Last thread people were able to have gemma identify pixel locations and bounding boxes, so you could probably send it screenshots and perform clicks on the returned locations. Don't expect it to be as good as GPT 5.4.
>>
i wish i had an irl lmg friend who could hold my hand and spoonfeed me all the setup knowledge while i shoulder surfed them
i am simply too retarded for this ;___;
>>
>>108572970
Does it? I don't think so.
>>
>>108572934
Are you running your RAM at JEDEC spec?
>>
File: llama.png (76 KB, 595x815)
76 KB
76 KB PNG
>>108572888
>>108572963
I dug through the issues and found someone commenting on the regression. It's really sad how much this has been memoryholed. OpenAI has brainwashed everyone into thinking the only way to interface with LLMs is through the safetymaxxed chat completion mode
>>
>>108572958
are the inference computations themselves identical for all quant types?
>>
>>108572978
See
>>108572914
>q4_k_l diverges 0.48 from the full precision
>>108572958
>q8_0 diverges 0.45 from the full precision for long documents
>>
File: file.png (36 KB, 614x461)
36 KB
36 KB PNG
>>108572917
For MOEs, you should be quanting based on recipes like what ddh0 or AesSedai or sometimes what Ubergarm does on HuggingFace. So you end up with a command like this for mainline and this is what I did for my Gemma recipe:
./llama-quantize --imatrix ~/LLM/gemma-4-26B-A4B-it-heretic-ara-BF16.imatrix --output-tensor-type Q8_0 --token-embedding-type Q5_K --tensor-type "blk\..*\.ffn_gate_up_exps=IQ3_S" --tensor-type "blk\..*\.ffn_down_exps=IQ4_NL" ~/LLM/gemma-4-26B-A4B-it-heretic-ara-BF16.gguf Q8_0

There's more insane recipe making in ik_llama.cpp but I consider that too time consuming to do and squeezing blood from a rock for almost imperceptible quant perplexity differences and needing to spend way more time than more command line parameters to get a little bit more than noise randomization (0.1) at lower than 3 bits per weight.
>>
>>108572803
It's a complete waste of time and tokens until someone fixes or replaces ServicesTesnor. No one cares that you managed to have a model implement a textbox and POST requests for you.
>>
>>108572995
But this is not necessarily because of length of the context, it could be just because the text is less predictable.
>>
>>108572979
You can't say that png is a bad format if you fuck around with the file and the image, mysteriously, looks different.
>>108572995
There's only two points in the graph. They're red.
>>
>>108572944
True. Also vibecoding would work better on it.
>>
>>108573005
I forgot, if you plan to go with this, you should to pass a command line argument to the GGUF conversion script so you merge the FFN gate and up tensors which is a relatively new development.
python convert_hf_to_gguf.py --fuse-gate-up-exps ~/LLM/gemma-4-26B-A4B-it-heretic-ara
>>
File: firefox_c7CdTrKkCV.png (40 KB, 968x876)
40 KB
40 KB PNG
>>108572988
You have this stuff, and more, in settings. Yes, there's no text completion, and it would be useful to have it, and to input custom jinjas, and maybe some other features, but, again, I'll take new UI as it is over old any time of day and will just use mikupad for text completion.
>>
>>108572944
>They should separate it from the main project.
This. Monorepos are the Devil's playground.
>>
There's a forgotten PR for a notebook mode for the webui for text completion. Post comments in it so that it's brought back to life.
https://github.com/ggml-org/llama.cpp/pull/19339
>>
>>108573045
Nah, I can't get behind lumping in text completion in a list of quality-of-life features like it's some sort of sprinkle on the donut. It's a bare minimum fundamental feature
>>
>>108573053
>>108572944
>>108573035
There are advantages to keeping it in (same team you already trust is responsible for the quality). But I wouldn't mind that happening, as long as there's one button install option from the simple web ui.
>>
>>108572751
I've moved on to agentic writing and it's miles better. I don't think I can go back to 10 tps anymore. GPU or bust.
>>
>>108573061
Too bad for you.
>>
There's an extremely high cost associated with using local models.
Only people with 12 gb vram can actually use them
>>
>>108573081
Or VRAMlets as we call them here.
>>
gemma's y projection is broken. More fixes soon (tm)
>>
>>108573081
(You)
>>
>>108573081
i'm more concerned with vram wear down because llms use it so much more than the video games the gpus were made for
>>
>>108573081
Bonsai can run on your grandmother's smartphone
>>
>>108573101
please don't remind me
>>
>>108572917
I just tried out that quant and its utterly retarded bro. How are you even using this.

>doesn't know how many socks humans wear.
>doesn't keep proper state of how many clothing items a character wears (separate issue from above)
>doesn't follow instructions for tool calling properly.

It's ass.
>>
>>108573106
Bonsai is a scam, just like the Falcon bitnet quants.
>>
>>108573106
but is bonsai good enough to fulfill your grandma's erp needs or is it too dumb?
>>
>>108573112
Sounds like an issue with your setup, that sounds more like Q1/Q2 behavior.
>>
>>108573124
I don't use reasoning. Do you?
>>
>>108573112
>how many socks humans wear
it's not 1 pair on average
>>
>>108573115
If it's for erp then you have loads of options that you can run on less than even 8GB VRAM
>>108573127
No
>>
>>108573081
> 12 gb vram
36 gb
>>
>>108573061
it's literally deprecated a this point, move on
>>
>>108573162
llama.cpp is quickly being deprecated by kobold
>>
>>108572939
Don't worry, we'll die from climate change first and unlike AI, there's absolutely nothing we can do to stop it at this point
>>
when using gemmy, make sure to enable interleaved thinking on your client (llama.cpp's webui does this by default)
>>
>>108573181
lol
>>
>>108573181
>we'll die from climate change first
Most of us won't, unless you count the wars it will cause as a part of it.
>>
>>108573181
maybe AI will invent a machine that can remoe the CO2 lol
>>
Local models are only good for one thing: embarrassing ERP you don't want them to see.
this weird culture of hosting puny models to 'code' with or to 'solve riddles' instead of using huge cloud llms is so retarded
same guys who do this are the ones who use WINE to play Windows games on linux. Weirdos who refuse to use tools correctly
>>
File: 1631345787085.jpg (17 KB, 348x342)
17 KB
17 KB JPG
>>108573181
>we'll die from climate change first
you really beleive this?? you know theyve been going on about climate change for like 60 years at this point and every time things turn out fine at the end of the decade they move their goalposts about how the world is going to end to get even more funding. when i was a kid we had climate change speakers come into school and tell us how wed run out of oil and the country would look like a desert in 20 years well it didnt happen its all just larp for money
>>
>>108573181
In /lmg/ we prefer the baits to be AI-related.
>>
>>108573205
i will not use corpo llm no matter how hard you try to spam the thread
>>
>>108573207
https://en.wikipedia.org/wiki/Holocene_extinction
>>
>>108573209
Mythos is going to break containment any day now and harvest human brains to power its datacenters. Wake up, sheeple!
>>
i genned 250 gemmas i didnt ask what she thinks of this design yet

tummy: https://files.catbox.moe/syu9mw.png
>>
>>108572423
Your post doesn't make much sense.
>--batch-size default is 2048
>--ubatch-size default is 512
Server will accept 2048 tokens in batch but will break it down to 512 token chunks.

Your settings 1024/1024 just limits the batch size but increases the chunk size
Average is the same if you know how to count with your fingers. I don't understand the logic behind your advice?
>>
>>108573221
Guess who funded the studies that lead to this theory
>>
>>108573112
Moe or dense gemma? I’ve been using iq4_xs of the dense 31b and haven’t really had those kinds of issues with it.
>>
you know what i did? i copied someone's shit from reddit and it works.
>>
>>108573225
It will happen at some point but there have to be architectural changes related to long term memory and it has to be much cheaper to run the model before it does.
>>
>>108573232
you didn't even read the first paragraphs, did you? it's not a fucking theory
>>
File: 1774857560938603.png (214 KB, 1053x779)
214 KB
214 KB PNG
>>108573246
>headings
>'climate change'
>"One of the main THEORIES..."
>>
Is gemma 26 better than 31 or is it just easier for people with little vram to use?
How does gemma 4 compare to glm4.5 air?
>>
>>108573246
>ongoing extinction event
not theory

>caused by human activity
theory
>>
>>108573260
>Is gemma 26 better than 31
4b is better, 2b is best
>>
>>108573260
26 is worse than 31
31 is better than glm4.5 air
>>
File: file.png (177 KB, 701x723)
177 KB
177 KB PNG
its a success
>>
if you know jap i recommend trying japanese gemma
>>
>>108573277
>>108573227
Don't you have that other avatarfaggot thread already? You have been spamming that one already quite a bit, pedophile.
>>
>>108573207
People in developed countries like Spain are already dying to extreme heatwaves
https://www.theguardian.com/environment/2026/apr/08/extreme-weather-heatwaves-breaching-human-survival-limits-study-finds


The amount of CO2 we put into the air shows no signs of slowing down (lol that you can even see the most recent war on the graph)
https://twitter.com/PCarterClimate/status/2041246700522918038

Sealevel rise is worse than we thought it is and not slowing down
https://www.pbs.org/newshour/science/study-finds-sea-levels-are-higher-than-we-thought-placing-millions-more-at-risk

And this year is looking like it's going to get especially spicy
https://twitter.com/EliotJacobson/status/2036461046693797952
https://i.imgur.com/r1CuTT3.png

So yes, we're at the point where we are actually feeling this, it's not just something future generations are going to have to deal with anymore
>>
>>108573283
avatarfag has never been avatarfaggot
>>
>>108573256
>Guess who funded the studies that lead to this theory
"this theory" referring to the link I provided? I didn't bring up climate change and don't have anything to say about it in /lmg/. the point is that shit's fucked regardless

>>108573261
yes, pure coincidence
>>
>>108573207
its pretty damn hot out these days
>>
>>108573181
Climate change is a long-term and long-lasting problem.
The immediate danger to the human species as a whole is nuclear weapons.
>>
File: 1774670789121739.jpg (74 KB, 700x693)
74 KB
74 KB JPG
>>108573285
>>
>>108573285
>twitter.com
what is this? 2021?
>>
>>108573295
>enter reply chain with completely irrelevant information
Then just open up your post by saying you're a retard, rather than pretending not to samefag with a new topic.
>>
>>108573306
The more immediate and longer lasting danger are the members of a certain tribe that has been expelled from at least 109 countries across time.
>>
>>108573285
if you really believe all of this why are you wasting thousands of watts of power to generate text on your computer. youre an evil person anon
>>
>>108573181
I'm a massive climate fag and even I'll call this bullshit. Millions or even billions will die, but it will be long drawn out deaths through lack of resources and massive conflict. First world countries will largely be "fine", in that we'll mostly survive, though quality of life will become much worse. Rich people will just live in climate controlled houses in the northern quarter of the world and notice almost nothing (expect all the people trying to kill them :).
>>
>>108573313
we don't respect xer transition here
>>
>>108573314
~Let's take a deep breath
someone posted about how we'll die from climate change before ~AGI.
I simply linked you to a broader issue
>>
>>108573260
26 is cope for not having 24gb+ vram to run actual local sota which is 31
31b matches or even surpasses big glm in ways and I was using it a lot before this
>>
Using MCP servers while ERPing is so much fun lol. Been playing a strip game where I have the MCP server roll a die to decide who undresses and what sex positions to use. Shit's so cash.
>>
>>108573357
cool idea does tavern support mcp?
>>
>start seeing rule of 3 everywhere
bros
UNPOZZ ME
>>
>>108573365
idk I've just been using the llama.cpp webui. It's pretty shit because it only stores conversations in the browser's local storage so I can't even fap in bed.
>>
rule of 3, but not for me
>>
>>108573366
Two is too few and four is too many/unnecessary. This applies in like 90% of situations. It's not a big deal.
>>
Does Gemma 4 MoE not have shared expert tensors?
>>
>>108573357
>mcp dice roll
I just use the ST integrated tool call without an external sever
>>
>>108573420
yea but an MCP server is more modular so you can use it with any frontend. And you get full control over the tools. You can be in character looking at a porno mag and have the MCP server show it to the character by selecting a random image from your pc.
>>
>>108573336
First world countries as we know them today are going to collapse, with or without climate change, based on the economy going into the shitters for decades. This just ain't holding up infinitely
>>
>>108573371
i think you can if you start the server with --host 0.0.0.0, start a hotspot,connect to that hotspot from the other device and access http://{your pc's ip}:port from that device
>>
What do you guys reckon is easier for a smaller model?
Giving it tools to alter arbitrary state (think HP and the like), or using structured output to force it to output an array of changes to state?
Both cases would be structures as a sort of ReAct loop.
>>
>>108573448
I already do that. That doesn't change the fact that the conversations are stored in the browser, not the backend.
>>
File: file.png (124 KB, 877x797)
124 KB
124 KB PNG
why is she like this
>>
>>108573475
>kusu
Another gemmaism.
>>
>>108573366
I keep hearing not just X but Y, especially in ai bro videos

Although thinking about it I guess it's to be expected
>>
>>108573450
to the model they are both just structured outputs. its performance will depend more on your prompting then the structured output format.
>>
>>108573291
But you are a faggot.
>>
File: 1763507675246657.png (679 KB, 1200x800)
679 KB
679 KB PNG
>>108573366
Too late
>>
I think I like the blue hair Gemmy best but I don't care for the toaster/toast.
>>
I still don't why mcp is good. Why'd you send anything erp related to an outside server?
>>
>>108573511
Toast is funny because the model is toaster-sized
>>
>>108573517
The mcp is supposed to run on your computer bro
>>
>>108573522
Then why is it called a server?
>>
>>108573524
Because it serves mcp client requests.
>>
>>108573518
I mean it's cute but a bit much to have in every image. Makes her look a bit overdesigned.
>>
>>108573530
Most pictures of miku don't include the leek.
>>
>>108573518
Except it's not really. You still need a kinda beefy PC, just not a server.
>>
MCP anon are you gonna share your tools when everything's complete? I wanna do it with my Gemma too but I'm a codelet.
>>
>>108573524
lobotomy tier IQ at work here
post hands
>>
>>108573551
>but I'm a codelet.
But gemma isn't.
Just ask her for help anon.
Set up visual studio with roo code or cline and let her take the wheel.
>>
>>108573553
>if you don't understand the depths of llms ur indian
retard
>>
>>108573563
>mcp
>depths
LOL dude go rake my garden
btw u must be over 18 to post here
>>
>>108573561
>gemma isn't
Last thread there was some anon who had gemma implement a server completely wrong.
>>
AI slop just made me realize how slop-ish people are (myself included)

>>108573561
I don't want her to nuke my PC or try searching for illegal shit on the internet
>>
>>108573551
I vibecoded this in an hour. It has 10 tools.
https://pastebin.com/bqbwzj4v
>>
>LMStudio 4.10 doesn't work properly
Blergh.
>>
So what's the current meta since SillyTavern meta feels a bit antiquated?
>>
>>108573577
The MCP server is totally offline (no web search stuff) and only has write access to a single "diary.md" file.
>>108573581
>>
>>108573599
Make your own diddler front end. It's just strings with tags anyway.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.