/g/ - /lmg/ - Local Models General - Technology

[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]

Board

▼ Settings Mobile Home

/g/ - Technology

Return Catalog Bottom Refresh

Thread archived.
You cannot reply anymore.

[Advertise on 4chan]

[Return] [Catalog] [Bottom]

Anonymous

/lmg/ - Local Models General 04/20/26(Mon)12:50:33 No.108646197

File: 1766830982504047.jpg (289 KB, 1231x1842)

289 KB JPG

/lmg/ - Local Models General Anonymous 04/20/26(Mon)12:50:33 No.108646197 Archived

/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108641942 & >>108637552

►News
>(04/20) Kimi K2.6 released: https://kimi.com/blog/kimi-k2-6
>(04/16) Ternary Bonsai released: https://hf.co/collections/prism-ml/ternary-bonsai
>(04/16) Qwen3.6-35B-A3B released: https://hf.co/Qwen/Qwen3.6-35B-A3B
>(04/11) MiniMax-M2.7 released: https://minimax.io/news/minimax-m27-en
>(04/09) Backend-agnostic tensor parallelism merged: https://github.com/ggml-org/llama.cpp/pull/19378
>(04/09) dots.ocr support merged: https://github.com/ggml-org/llama.cpp/pull/17575

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm

Anonymous
04/20/26(Mon)12:50:50 No.108646198

Anonymous 04/20/26(Mon)12:50:50 No.108646198

File: mikuthreadrecap.jpg (1.15 MB, 1804x2160)

1.15 MB JPG

►Recent Highlights from the Previous Thread: >>108641942

--Using a Ruby chess engine and tool calls to play chess with Gemma:
>108645309 >108645360 >108645355 >108645443 >108645658 >108645678 >108645772 >108645792 >108645790 >108645811
--Kimi-K2.6 release and technical comparison with Gemma4:
>108645842 >108645861 >108645875 >108645894 >108645970 >108646037 >108645895
--Evaluating Gemma4 31B quantization quality via KL Divergence benchmarks:
>108643774 >108643798 >108643861 >108643872 >108644339 >108644345 >108644393 >108644405 >108644423 >108644490 >108644533 >108644573 >108644619 >108644754 >108644849 >108644593 >108644597 >108644545
--Debating the efficacy and technical legitimacy of Opus distillation models:
>108644834 >108644842 >108644848 >108644945 >108644952 >108644961 >108644964 >108644983 >108645003 >108645021 >108645033 >108644960
--Testing multimodal limb counting and artifact detection on Gemma models:
>108642862 >108642892 >108642894 >108642887 >108642901 >108642910 >108642917 >108642928 >108642976 >108642916 >108642936 >108642950 >108642985
--Speculative decoding settings and draft model pairing for Gemma 31B:
>108642625 >108642647 >108643794 >108643895 >108643097 >108643747 >108642828
--Debating Gemma's image reading abilities and LLM-generated scripts versus 4chanx:
>108642213 >108642220 >108643530 >108643566 >108642235 >108642339 >108642440 >108642740
--Discussing long term memory solutions through weights and knowledge graphs:
>108644195 >108644205 >108644235 >108644274 >108644302 >108644333
--Testing poor performance of llama.rpc for distributed prompt processing:
>108644927 >108644998 >108645019
--Logs:
>108641945 >108642213 >108642887 >108642892 >108642901 >108642936 >108642950 >108642976 >108642985 >108642989 >108643013 >108643028
--Miku, Teto (free space):
>108642753 >108643064 >108643979 >108646035

►Recent Highlight Posts from the Previous Thread: >>108641943

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script

Anonymous
04/20/26(Mon)12:52:48 No.108646219

Anonymous 04/20/26(Mon)12:52:48 No.108646219

Mikulove

Anonymous
04/20/26(Mon)12:52:53 No.108646222

Anonymous 04/20/26(Mon)12:52:53 No.108646222

>>108646197
adorable miku

Anonymous
04/20/26(Mon)12:53:12 No.108646227

Anonymous 04/20/26(Mon)12:53:12 No.108646227

>>108646213
At that point, just run the MoE or E4B.

Anonymous
04/20/26(Mon)13:03:12 No.108646278

Anonymous 04/20/26(Mon)13:03:12 No.108646278

File: 1766702913804335.png (455 KB, 1280x1102)

455 KB PNG

>>108642791
Ani likes(d) to larp as holier than thou because he used C++ but his code is std::cout spam (god make a logging func) and I even saw a sethandle function that takes a void pointer, likely because he didn't know at the time he could forward declare the relevant struct, and never updated it. His program also solves nothing because frankly I'd rather use a browser engine with HW accel off and a nice UI that only renders when the view is dirty rather than an ImGui program with immediate mode mess code that rerenders every frame. His site for his "company" is also just as arrogant. This guy's faggotry gives a bad name to the lang.

Anonymous
04/20/26(Mon)13:16:33 No.108646355

Anonymous 04/20/26(Mon)13:16:33 No.108646355

>>108646278
>>>/g/ldg
And stay there, petr*

Anonymous
04/20/26(Mon)13:18:14 No.108646361

Anonymous 04/20/26(Mon)13:18:14 No.108646361

>>108646355
std::cout yourself out of my face, worm.

Anonymous
04/20/26(Mon)13:19:08 No.108646364

Anonymous 04/20/26(Mon)13:19:08 No.108646364

>>108642791
This is pretty ignorant. Just spawn comfyui as a separate process and have the application use its API over localhost. If you don't mix your peas and potatoes then you don't have to adopt the commie license.

Anonymous
04/20/26(Mon)13:20:08 No.108646372

Anonymous 04/20/26(Mon)13:20:08 No.108646372

>>108646278
all ani did was vibecode an imgui wrapper for sd.cpp which he doesn't even contribute to, he's an irrelevant nobody

Anonymous
04/20/26(Mon)13:31:33 No.108646445

Anonymous 04/20/26(Mon)13:31:33 No.108646445

K2.6 somehow thinks for even longer than K2.5 and it insists on drafting every single reply beforehand in reasoning. K2.5 at least kept its yapping short for simple prompts and didn't do the drafting shit every time.
It's over, I just wanted a good modern Kimi model because the vision is insanely good and the models are smart. This isn't usable.

Anonymous
04/20/26(Mon)13:32:07 No.108646447

Anonymous 04/20/26(Mon)13:32:07 No.108646447

>>108646197
What's the best local model for coding currently? Still GLM 5.1?

Anonymous
04/20/26(Mon)13:32:42 No.108646448

Anonymous 04/20/26(Mon)13:32:42 No.108646448

File: 1776127804370475.jpg (65 KB, 479x640)

65 KB JPG

>>108646445
No refund gweilo

Anonymous
04/20/26(Mon)13:34:49 No.108646464

Anonymous 04/20/26(Mon)13:34:49 No.108646464

>>108646445
>and it insists on drafting every single reply beforehand in reasoning
At least in llama.cpp you can speed that up some 100x using ngram based speculative decoding.

Anonymous
04/20/26(Mon)13:35:35 No.108646471

Anonymous 04/20/26(Mon)13:35:35 No.108646471

>>108646445
Opus-4.6-high distill

Anonymous
04/20/26(Mon)13:40:28 No.108646510

Anonymous 04/20/26(Mon)13:40:28 No.108646510

File: file.png (69 KB, 516x447)

69 KB PNG

>>108646445
geg

Anonymous
04/20/26(Mon)13:40:38 No.108646511

Anonymous 04/20/26(Mon)13:40:38 No.108646511

File: 🎑.jpg (201 KB, 1024x1024)

201 KB JPG

>>108646046

Anonymous
04/20/26(Mon)13:42:35 No.108646525

Anonymous 04/20/26(Mon)13:42:35 No.108646525

lalalalala

Anonymous
04/20/26(Mon)13:43:05 No.108646528

Anonymous 04/20/26(Mon)13:43:05 No.108646528

lalalalala

Anonymous
04/20/26(Mon)13:43:13 No.108646531

Anonymous 04/20/26(Mon)13:43:13 No.108646531

File: bob ross jak.jpg (336 KB, 2000x1397)

336 KB JPG

Is there a script that lets me create reasonably SOTA quants (that can run under llama.cpp) without getting too much into the nitty gritty?
I messing with heretic right now, planning to abliterate Qwen3.6-35B-A3B to my taste, but I need something like Q6_K to run inference. Also a way to measure KL divergence to make sure that I didn't fuck anything up catastrophically would be appreciated.
Or is the barrier to entry for this stuff too high?

Anonymous
04/20/26(Mon)13:43:22 No.108646534

Anonymous 04/20/26(Mon)13:43:22 No.108646534

>>108646511
tf is this emoji

Anonymous
04/20/26(Mon)13:43:33 No.108646536

Anonymous 04/20/26(Mon)13:43:33 No.108646536

>>108646445
>it insists on drafting every single reply beforehand in reasoning
most annoying fucking thinking behavior possible, even worse than endless "Wait:"ing
so fucking annoying and wasteful

Anonymous
04/20/26(Mon)13:44:16 No.108646543

Anonymous 04/20/26(Mon)13:44:16 No.108646543

>>108646531
Its on the llama.cpp repo.

Anonymous
04/20/26(Mon)13:44:20 No.108646544

Anonymous 04/20/26(Mon)13:44:20 No.108646544

>trying to build frontend
>everything displays well
>except for codeblocks with gemma
I tried using other tools but can someone please point me to where exactly any popular UI parses outputs from gemma?
I have the correct configs but whenever it comes to code blocks the output looks like a fucking mess and even gemma can't help with this

Anonymous
04/20/26(Mon)13:44:36 No.108646546

Anonymous 04/20/26(Mon)13:44:36 No.108646546

>>108646531
Why not use the quants made people who actually know what they're doing? And by that a mean Bart. God forbid if you thought I was talking about unlsop.

Anonymous
04/20/26(Mon)13:45:24 No.108646554

Anonymous 04/20/26(Mon)13:45:24 No.108646554

>>108646511
I like this Miku and Moonshota

Anonymous
04/20/26(Mon)13:45:36 No.108646558

Anonymous 04/20/26(Mon)13:45:36 No.108646558

File: 💠.png (24 KB, 1230x1158)

24 KB PNG

gemmaballz

Anonymous
04/20/26(Mon)13:47:04 No.108646571

Anonymous 04/20/26(Mon)13:47:04 No.108646571

>>108646531
>Is there a script that lets me create reasonably SOTA quants
llama-quantize -h. --tensor-type or --tensor-type-file You can select how you quantize each tensor.
>Also a way to measure KL divergence
llama-perplexity -h. --kl-divergence

Anonymous
04/20/26(Mon)13:47:21 No.108646575

Anonymous 04/20/26(Mon)13:47:21 No.108646575

>>108646558
Is four-way pussy on cock squish docking anatomically possible?

Anonymous
04/20/26(Mon)13:49:33 No.108646594

Anonymous 04/20/26(Mon)13:49:33 No.108646594

>>108646575
yes, depending on the girth of the cock. for you? no

Anonymous
04/20/26(Mon)13:50:17 No.108646600

Anonymous 04/20/26(Mon)13:50:17 No.108646600

>>108646594
ehe..

Anonymous
04/20/26(Mon)13:51:46 No.108646608

Anonymous 04/20/26(Mon)13:51:46 No.108646608

>>108646575
yeah, a pussy double decker.

Anonymous
04/20/26(Mon)13:52:11 No.108646611

Anonymous 04/20/26(Mon)13:52:11 No.108646611

>Demis doesn't believe the current AI paradigm will scale to recursive self improvement
Google's actually going to lose...

Anonymous
04/20/26(Mon)13:52:33 No.108646612

Anonymous 04/20/26(Mon)13:52:33 No.108646612

>>108646445
Does K2.6 respect reasoning parameters for less or more?

Anonymous
04/20/26(Mon)13:54:26 No.108646628

Anonymous 04/20/26(Mon)13:54:26 No.108646628

I need to run four copies of Gemmy concurrently.

Anonymous
04/20/26(Mon)13:58:25 No.108646654

Anonymous 04/20/26(Mon)13:58:25 No.108646654

>>108646546
Because I want to quantize my own abliterated version.
>>108646571
>llama-quantize -h. --tensor-type or --tensor-type-file
>llama-perplexity -h. --kl-divergence
Is it really this simple? Like I can see that this is how it is in theory, but I won't run into any mishaps in practice?
>You can select how you quantize each tensor.
I guess I can just copy homework of some well-known quant guy here.

Anonymous
04/20/26(Mon)14:00:39 No.108646672

Anonymous 04/20/26(Mon)14:00:39 No.108646672

>>108646654
>Because I want to quantize my own abliterated version.
I see. I was wondering if I couldn't just run heretic on a quant just to speed things up for testing. then if the quant gives good results do a real run on the full model.

Anonymous
04/20/26(Mon)14:01:45 No.108646681

Anonymous 04/20/26(Mon)14:01:45 No.108646681

>>108646654
>Is it really this simple?
I'm sure you'll come back to tell us. Never played with any of them, but I know they're there.
>but I won't run into any mishaps in practice?
I'm sure you will. Still, just try a normal quant first and distribute the safetensors if you ever upload the model. Releasing just ggufs is lame.

Anonymous
04/20/26(Mon)14:06:02 No.108646707

Anonymous 04/20/26(Mon)14:06:02 No.108646707

>>108646654
in practice you would probably want to use imatrix
naive quant sucks

Anonymous
04/20/26(Mon)14:12:58 No.108646746

Anonymous 04/20/26(Mon)14:12:58 No.108646746

>>108646681
Yeah I might blogpost later if/when I run into them. Thanks for the starting directions.
>>108646707
I have heard contradictory things about imatrix like some people questioning how well it generalizes or how it might hurt tasks that are not part of the calibration dataset.
Regardless, I believe imatrix, non-imatrix difference is very low for Q6 anyway.

Anonymous
04/20/26(Mon)14:14:59 No.108646765

Anonymous 04/20/26(Mon)14:14:59 No.108646765

>>108646544

Anonymous
04/20/26(Mon)14:17:42 No.108646784

Anonymous 04/20/26(Mon)14:17:42 No.108646784

>>108646072
Great to know. Best of luck with that.

Anonymous
04/20/26(Mon)14:17:50 No.108646785

Anonymous 04/20/26(Mon)14:17:50 No.108646785

>>108646765
Just feed everything through
https://github.com/showdownjs/showdown

Anonymous
04/20/26(Mon)14:17:55 No.108646788

Anonymous 04/20/26(Mon)14:17:55 No.108646788

>>108646544
Be more vague

Anonymous
04/20/26(Mon)14:24:20 No.108646836

Anonymous 04/20/26(Mon)14:24:20 No.108646836

>>108646544
Wouldn't it be better to catch the code markdown or whatever before you render the message to the user and then create your own implementation? Erase the old and replace it with a new one.
You obviously don't need to touch the model's context just what the user sees.
I don't know about webshit but I do this all the time.

Anonymous
04/20/26(Mon)14:26:42 No.108646853

Anonymous 04/20/26(Mon)14:26:42 No.108646853

File: Screenshot 2026-04-20 122603.png (156 KB, 1158x511)

156 KB PNG

Playing around with K2.6 over OR and it's already not going well. Thinking is really, really verbose. It's not too repetitive and it focuses on actual character details and writing guidelines, but the positives end there. It is a bit schizo about minors and non-consensual content and it does not like requests for "explicit erotica/pornography". It still works with a roleplay prompt but it doesn't like describing bodies.
>Common sense modification scenario
>Ask to see a woman's chest
>Kimi describes taking off the blouse but never the body. No mention is made of the woman's chest in the slightest
Basically DOA code slop. Gemma 4 is still worth it.
btw it generated 2 drafts and several paragraphs of reflection and thinking for this. 3761 thinking tokens according to OR.

Anonymous
04/20/26(Mon)14:27:17 No.108646856

Anonymous 04/20/26(Mon)14:27:17 No.108646856

gemma is broken

list_files{path:<|"|>.<|"|>}<tool_call|>

Anonymous
04/20/26(Mon)14:32:19 No.108646888

Anonymous 04/20/26(Mon)14:32:19 No.108646888

File: la l l.png (4 KB, 360x92)

4 KB PNG

Gemma pls

Anonymous
04/20/26(Mon)14:37:18 No.108646922

Anonymous 04/20/26(Mon)14:37:18 No.108646922

>>108646856
No. You didn't read the docs.
https://ai.google.dev/gemma/docs/core/prompt-formatting-gemma4#agentic-tokens

Anonymous
04/20/26(Mon)14:38:13 No.108646933

Anonymous 04/20/26(Mon)14:38:13 No.108646933

>>108646853
Yeah, that's my experience as well. It's jarring to go from GLM5.1 to this. GLM regulates its reasoning length depending on the task extremely well and basically never wastes time even drafting out stuff like dialogue lines. Meanwhile K2.6 does 1500 tokens of thinking + 3 drafts + revisions about how a 300 token character card should respond to "hello".
The positives are that the excessive thinking is at least relatively focused and that the way it writes actually seems to have fixed a lot of the issues I had with how K2.5 approached stories.

Anonymous
04/20/26(Mon)14:39:18 No.108646948

Anonymous 04/20/26(Mon)14:39:18 No.108646948

>>108646198
This is pretty damn useful.

Anonymous
04/20/26(Mon)14:40:37 No.108646955

Anonymous 04/20/26(Mon)14:40:37 No.108646955

>>108646853
>>108646933
Kimibros... we lost.

Anonymous
04/20/26(Mon)14:40:43 No.108646956

Anonymous 04/20/26(Mon)14:40:43 No.108646956

>>108646856
<|"|>.<|"|>
cat sticking up its paws in frustration

Anonymous
04/20/26(Mon)14:42:22 No.108646967

Anonymous 04/20/26(Mon)14:42:22 No.108646967

>>108646956
><|"|>.<|"|>
adventure time character covering its eyes

Anonymous
04/20/26(Mon)14:45:50 No.108646987

Anonymous 04/20/26(Mon)14:45:50 No.108646987

My qwen3.6-ud-q4km loops like a bitch. Anyone else having this issue? Gemma 4 26b q5ks never loops but I heard it's worse for coding.

Anonymous
04/20/26(Mon)14:47:10 No.108646994

Anonymous 04/20/26(Mon)14:47:10 No.108646994

>>108646853
>Common sense modification scenario
Are you using this card?
https://chub.ai/characters/CoffeeAnon/common-sense-alteration-8bd7a7399322
It's crazy kimi still thinks it's none-consensual when in so many places in the card it says that it's completely natural. there's like zero "rapey" wording.

Anonymous
04/20/26(Mon)14:49:59 No.108647013

Anonymous 04/20/26(Mon)14:49:59 No.108647013

>>108646987
Looping? No.
Thinking for waaaay tooo long? Yes.
18 on llama.cpp with temp 1 top k 100 top 0.95.
I've taken to using
>--reasoning-budget = 1024
>--reasoning-budget-message = ... (Alright, that's enough thinking. Done with considerations. Time to respond.)
Plus a system message + prefill to try and make it be more efficient in its thinking so that it doesn't have to get to the point where it's truncated.

Anonymous
04/20/26(Mon)14:50:18 No.108647015

Anonymous 04/20/26(Mon)14:50:18 No.108647015

>>108646922

isn't it something what the finicky chat template is supposed to deal with?

it works mostly until this token jumps out of nowhere

I connect via Open AI module. As far as I can see the response is a giant json to be parsed

Anonymous
04/20/26(Mon)14:51:18 No.108647023

Anonymous 04/20/26(Mon)14:51:18 No.108647023

>>108646987
>Gemma 4 26b q5ks never loops

gemma 4 26b Q8 loops too

Anonymous
04/20/26(Mon)14:51:46 No.108647026

Anonymous 04/20/26(Mon)14:51:46 No.108647026

Any good Migu card for Gemma?

Anonymous
04/20/26(Mon)14:52:01 No.108647028

Anonymous 04/20/26(Mon)14:52:01 No.108647028

>>108647013
>18 on llama.cpp
q8

Anonymous
04/20/26(Mon)14:53:35 No.108647046

Anonymous 04/20/26(Mon)14:53:35 No.108647046

File: shrimple_shed.png (940 B, 428x35)

940 B PNG

>>108647015
s/<|"|>/"/g

Anonymous
04/20/26(Mon)15:14:02 No.108647138

Anonymous 04/20/26(Mon)15:14:02 No.108647138

Hey, i switched from Windows 11 to linux mint.
Now i want to move away from chatting with LMStudio and toward cool stuff in linux. i have 24 vram and 32 ram.
what is the most interesting thing I can tackle locally? using qwen 3.6 35b as my workhorse.
hermes agent looks interesting. what would you recommend? complexity isnt a problem ill dig into it.

Anonymous
04/20/26(Mon)15:20:01 No.108647184

Anonymous 04/20/26(Mon)15:20:01 No.108647184

>char is the older sister of user
>5k context later random character refers to char as the brother
immersion instantly broken

Anonymous
04/20/26(Mon)15:20:25 No.108647187

Anonymous 04/20/26(Mon)15:20:25 No.108647187

>>108646955
And GLMbros won. Only the strongest will survive

Anonymous
04/20/26(Mon)15:20:27 No.108647188

Anonymous 04/20/26(Mon)15:20:27 No.108647188

>>108647138
>what is the most interesting thing I can tackle locally?
Are you asking for a project?
>hermes agent looks interesting. what would you recommend?
Try it. Improve it or look for something else if you find it wanting.

Anonymous
04/20/26(Mon)15:22:39 No.108647208

Anonymous 04/20/26(Mon)15:22:39 No.108647208

>>108647023
Not him but I haven't seen gemma 26b loop despite using q4. I remember seeing some reasoning loops at q3 though, similar prompts.

Anonymous
04/20/26(Mon)15:24:02 No.108647215

Anonymous 04/20/26(Mon)15:24:02 No.108647215

>>108647188
checked
nta but i wonder if 26b-a4b is strong enough for hermes

Anonymous
04/20/26(Mon)15:26:39 No.108647236

Anonymous 04/20/26(Mon)15:26:39 No.108647236

>>108647184
Stop using Nemo/Qwen/GLM/Kimi. Start using Gemma4.

Anonymous
04/20/26(Mon)15:27:20 No.108647244

Anonymous 04/20/26(Mon)15:27:20 No.108647244

>>108647215
>i wonder if 26b-a4b is strong enough for hermes
Try it.

Anonymous
04/20/26(Mon)15:30:47 No.108647262

Anonymous 04/20/26(Mon)15:30:47 No.108647262

File: unslot.png (845 KB, 2560x2780)

845 KB PNG

Why is picrel allowed to happen? Why aren't default quantization settings with llama.cpp better?

Anonymous
04/20/26(Mon)15:30:59 No.108647263

Anonymous 04/20/26(Mon)15:30:59 No.108647263

>>108647184
Model? Since moving onto Gemma, I rarely get backstory errors, but logical inconsistencies are more common than I like, like a girl sitting on a lap facing forward is magically 180 turned around facing user, or "looks up at you" when char is physically elevated, ie standing vs sitting, laying on top, etc. For all my like of the model, it does show its beaks whenever I start to forget it's only 31B.

Anonymous
04/20/26(Mon)15:33:13 No.108647279

Anonymous 04/20/26(Mon)15:33:13 No.108647279

>>108646445
>>108646536
Yap 2.6

Anonymous
04/20/26(Mon)15:33:15 No.108647280

Anonymous 04/20/26(Mon)15:33:15 No.108647280

Gemma is unironically better than deepseek for erp.

Anonymous
04/20/26(Mon)15:35:12 No.108647293

Anonymous 04/20/26(Mon)15:35:12 No.108647293

>>108647236
>>108647263
glm 5, i like gemma and its really stable but once i started to notice the patterns the honeymoon was over for me, now i wait to get disappointed by deepseek

Anonymous
04/20/26(Mon)15:35:54 No.108647298

Anonymous 04/20/26(Mon)15:35:54 No.108647298

>>108647262
Can you accurately describe how much exactly is the difference between Bartowski's Q4_K_M and Unsloth's Q4_K_M in this graph?

Anonymous
04/20/26(Mon)15:36:25 No.108647303

Anonymous 04/20/26(Mon)15:36:25 No.108647303

>>108647280
Wait until we get Deespeek V4 lite

Anonymous
04/20/26(Mon)15:37:01 No.108647306

Anonymous 04/20/26(Mon)15:37:01 No.108647306

>>108647298
About 1gb

Anonymous
04/20/26(Mon)15:37:27 No.108647310

Anonymous 04/20/26(Mon)15:37:27 No.108647310

>>108647262
looks scientific. too bad it's not.

Anonymous
04/20/26(Mon)15:39:48 No.108647323

Anonymous 04/20/26(Mon)15:39:48 No.108647323

>>108647293
I almost don't believe it. I've used GLM 4.6 @ IQ2 since it came out, and that model is still the very peak of generations whether in spatial awareness, picking up subtleties, getting dirty, or carrying a story forward. The only reason I ever use something else is because I physically cannot fit a higher context than 8K with it. I've heard people complain about X or Y being worse with later GLM releases, but missing basic context wasn't one of those.

Anonymous
04/20/26(Mon)15:42:00 No.108647335

Anonymous 04/20/26(Mon)15:42:00 No.108647335

>>108646856
that's exactly how it's supposed to be.

Anonymous
04/20/26(Mon)15:45:10 No.108647349

Anonymous 04/20/26(Mon)15:45:10 No.108647349

>trusting any metrics produced by unslot
loooooooool

Anonymous
04/20/26(Mon)15:47:06 No.108647360

Anonymous 04/20/26(Mon)15:47:06 No.108647360

>>108647323
its really good but sometimes little things still seep through, definitely the best japanese writing of any "open" model ive tried

Anonymous
04/20/26(Mon)15:48:21 No.108647367

Anonymous 04/20/26(Mon)15:48:21 No.108647367

Models are only getting worse.

Anonymous
04/20/26(Mon)15:49:01 No.108647372

Anonymous 04/20/26(Mon)15:49:01 No.108647372

Unslop metrics are meaningless and biased in some ways in their favor.

Anonymous
04/20/26(Mon)15:49:11 No.108647373

Anonymous 04/20/26(Mon)15:49:11 No.108647373

>>108647306
Moron.

Anonymous
04/20/26(Mon)15:50:22 No.108647379

Anonymous 04/20/26(Mon)15:50:22 No.108647379

>>108647293
Cure your adhd first

Anonymous
04/20/26(Mon)15:50:32 No.108647381

Anonymous 04/20/26(Mon)15:50:32 No.108647381

>>108647046

dafuq! LOL

Anonymous
04/20/26(Mon)15:52:12 No.108647395

Anonymous 04/20/26(Mon)15:52:12 No.108647395

File: Screenshot_20260420_155055.png (837 KB, 2826x1813)

837 KB PNG

I hate UX shit so fucking much
Why the fuck is gemma the only model that has irregular outputs
I'm so fucking annoyed

Anonymous
04/20/26(Mon)15:53:02 No.108647402

Anonymous 04/20/26(Mon)15:53:02 No.108647402

>>108647236
i dropped glm4.7 for gemmy and i like it so far, except for it's 'the power dynamic has shifted' slop, and its weird obsession of going on meandering tangents to explain why user saying 'peeepeeepoopooo' is some power dynamic shifting 4000 iq move instead of just continuing the story, but then again they can be fixed with a prompt so it's not too bad

Anonymous
04/20/26(Mon)15:54:25 No.108647408

Anonymous 04/20/26(Mon)15:54:25 No.108647408

>>108647372
>change the chat template from the default
>test under conditions that work with your new chat template and not the default
>claim the difference in results is because of your superior hi-tech quants and not the fact that the models were using different prompts
the unslop special

Anonymous
04/20/26(Mon)15:55:05 No.108647414

Anonymous 04/20/26(Mon)15:55:05 No.108647414

File: 1745726505218372.jpg (39 KB, 620x215)

39 KB JPG

>unsloth cant post a chart without fucking it up

Anonymous
04/20/26(Mon)15:57:04 No.108647426

Anonymous 04/20/26(Mon)15:57:04 No.108647426

>>108647184
Did you check probabilities for that token?

Anonymous
04/20/26(Mon)15:57:06 No.108647427

Anonymous 04/20/26(Mon)15:57:06 No.108647427

File: Screenshot042.png (113 KB, 1475x748)

113 KB PNG

>>108647335
I had this function being called without any issued gazillion times

Such fuck-ups are rather rare

Anonymous
04/20/26(Mon)15:58:29 No.108647436

Anonymous 04/20/26(Mon)15:58:29 No.108647436

File: 580459214-878adff9-a148-4(...).png (11 KB, 792x612)

11 KB PNG

>>108647262
https://github.com/ikawrakow/ik_llama.cpp/discussions/1663
If you ask IK the right metric to use is PPL(Q)/PPL(bf16) by which his own quants happen to be better.

Anonymous
04/20/26(Mon)15:58:57 No.108647441

Anonymous 04/20/26(Mon)15:58:57 No.108647441

>>108647408
>your new chat template

must be superior though

google fucked up its own template

Anonymous
04/20/26(Mon)15:59:59 No.108647445

Anonymous 04/20/26(Mon)15:59:59 No.108647445

>>108647436
>his own quants happen to be better

his entire fork sucks

Anonymous
04/20/26(Mon)16:00:26 No.108647449

Anonymous 04/20/26(Mon)16:00:26 No.108647449

>>108647436
>The more educated reader will of course know that the correlation between ln(PPL(Q)/PPL(bf16) and KLD is close to 100%
[citation needed]

Anonymous
04/20/26(Mon)16:01:48 No.108647460

Anonymous 04/20/26(Mon)16:01:48 No.108647460

>>108647449
>he isn't educated
yikes

Anonymous
04/20/26(Mon)16:05:05 No.108647481

Anonymous 04/20/26(Mon)16:05:05 No.108647481

>>108647395
surely a skill issue

Anonymous
04/20/26(Mon)16:06:22 No.108647485

Anonymous 04/20/26(Mon)16:06:22 No.108647485

>>108647395
>gemma the only model that has irregular outputs

>>108646856

Anonymous
04/20/26(Mon)16:06:41 No.108647486

Anonymous 04/20/26(Mon)16:06:41 No.108647486

>>108647481
No shit
I think the react-markdown package is the issue

Anonymous
04/20/26(Mon)16:08:51 No.108647498

Anonymous 04/20/26(Mon)16:08:51 No.108647498

>>108647486
Webshit is horrible.

Anonymous
04/20/26(Mon)16:10:02 No.108647508

Anonymous 04/20/26(Mon)16:10:02 No.108647508

>>108647486
perhaps? Both your code blocks and your reasoning are missing newlines. Somewhere between your model output and the final render is cutting them out.

Anonymous
04/20/26(Mon)16:10:51 No.108647512

Anonymous 04/20/26(Mon)16:10:51 No.108647512

>>108647486
>react-markdown
He fell for the react meme....
Do yourself a favor and switch to Vue. React is a bloated mess.

Anonymous
04/20/26(Mon)16:11:05 No.108647516

Anonymous 04/20/26(Mon)16:11:05 No.108647516

File: newlines.png (71 KB, 1034x313)

71 KB PNG

>>108647395
>>108647486
You know that a \n doesn't render a newline in most tags, right? Right?

Anonymous
04/20/26(Mon)16:12:48 No.108647528

Anonymous 04/20/26(Mon)16:12:48 No.108647528

>>108647516
>>108647512
Nigga I'm vibecoding this frontend, you think I would willing learn webshit?
I can setup the backend without issue it's just this bullshit, all the other ready made frameworks had critical issues for my feature set and this stupid little shit is the ONLY roadblock I had working on this
I fucking want to kick a whore over this shit

Anonymous
04/20/26(Mon)16:16:15 No.108647548

Anonymous 04/20/26(Mon)16:16:15 No.108647548

>>108647528
solution is probably simple
erase the old formatting functionality and create a new one from scratch
I have always despised web stuff and in 2026 it's worse than ever. Maybe it was more tolerable in 2005 or something.

Anonymous
04/20/26(Mon)16:16:24 No.108647551

Anonymous 04/20/26(Mon)16:16:24 No.108647551

>>108647528
You know nothing, so everything is a surprise to you. Now you've learned something and you're better off for it. Don't blame your tools.

Anonymous
04/20/26(Mon)16:17:04 No.108647557

Anonymous 04/20/26(Mon)16:17:04 No.108647557

>>108647528
>Nigga I'm vibecoding this frontend
well that's your real problem right there.

Anonymous
04/20/26(Mon)16:18:49 No.108647568

Anonymous 04/20/26(Mon)16:18:49 No.108647568

>>108647528
These are the same people that tell you AI is making software engineers obsolete btw.

Anonymous
04/20/26(Mon)16:19:51 No.108647574

Anonymous 04/20/26(Mon)16:19:51 No.108647574

Models can't really go their entire context length right? They break down at some point? How much can gemma 4 do?

Anonymous
04/20/26(Mon)16:21:23 No.108647582

Anonymous 04/20/26(Mon)16:21:23 No.108647582

>>108647568
>>108647557
I'm not I actually did this because all the other options are garbage for my usecase, everything else works the only issue is code block handling which seems to be a gemma specific issue

Anonymous
04/20/26(Mon)16:21:42 No.108647585

Anonymous 04/20/26(Mon)16:21:42 No.108647585

>>108647574
about tree-fiddy

Anonymous
04/20/26(Mon)16:21:52 No.108647586

Anonymous 04/20/26(Mon)16:21:52 No.108647586

>>108647323
after comparing both I've noticed that glm 4.6 seems to simulate characters a little more realistically and make them more open to pushback if it's in their persona
gemma 31b is also great but I feel that it tends to get overeager at times and needs some toning down

Anonymous
04/20/26(Mon)16:22:30 No.108647589

Anonymous 04/20/26(Mon)16:22:30 No.108647589

>>108647574
Day 0 Gemma can use her full context

Anonymous
04/20/26(Mon)16:22:41 No.108647591

Anonymous 04/20/26(Mon)16:22:41 No.108647591

>>108646611
They're betting they can find the next paradigm with their research chops before people outdo them on making transformers models.

Anonymous
04/20/26(Mon)16:23:28 No.108647597

Anonymous 04/20/26(Mon)16:23:28 No.108647597

>>108647574
Check inside your anus.

Anonymous
04/20/26(Mon)16:23:39 No.108647601

Anonymous 04/20/26(Mon)16:23:39 No.108647601

>>108647582
Poor little baby can't handle the bloated state of modern web development~ ( ´ ∀ ` )ノ~

Anonymous
04/20/26(Mon)16:24:54 No.108647604

Anonymous 04/20/26(Mon)16:24:54 No.108647604

>>108647551
Are you underage?

Anonymous
04/20/26(Mon)16:25:21 No.108647611

Anonymous 04/20/26(Mon)16:25:21 No.108647611

>>108647582
>everything else works the only issue is code block handling which seems to be a gemma specific issue
It will be very funny when you post the resulting html and show that (You)'re not replacing \n for <br> and, probably, not using <pre> for code.
Either that or your css is absolutely fucked. Keep blaming your tools.

Anonymous
04/20/26(Mon)16:26:05 No.108647615

Anonymous 04/20/26(Mon)16:26:05 No.108647615

>>108646611
Where does he say that?

Anonymous
04/20/26(Mon)16:26:06 No.108647616

Anonymous 04/20/26(Mon)16:26:06 No.108647616

>>108647589
The BF16 version can. Most anons here use Q4~Q6.

Anonymous
04/20/26(Mon)16:27:33 No.108647622

Anonymous 04/20/26(Mon)16:27:33 No.108647622

>>108647604
Do you know why this
is split in two lines?

Anonymous
04/20/26(Mon)16:28:17 No.108647625

Anonymous 04/20/26(Mon)16:28:17 No.108647625

smedrins

Anonymous
04/20/26(Mon)16:29:10 No.108647628

Anonymous 04/20/26(Mon)16:29:10 No.108647628

>>108647611
I don't do webshit anon, this is like you trying to big me when it comes to shitting my pants because you have more experience in it

Anonymous
04/20/26(Mon)16:29:20 No.108647629

Anonymous 04/20/26(Mon)16:29:20 No.108647629

>>108647574
>How much can gemma 4 do?
I have chats that go up to 68k context and the only degradation I notice are rare typos in words.

Anonymous
04/20/26(Mon)16:31:10 No.108647640

Anonymous 04/20/26(Mon)16:31:10 No.108647640

>>108647628
You could have asked your model.

Anonymous
04/20/26(Mon)16:33:27 No.108647648

Anonymous 04/20/26(Mon)16:33:27 No.108647648

>>108647528
Good luck sir do not give in or lose izzat. Keep good look.

Anonymous
04/20/26(Mon)16:36:25 No.108647665

Anonymous 04/20/26(Mon)16:36:25 No.108647665

>>108647622
?

Anonymous
04/20/26(Mon)16:40:04 No.108647686

Anonymous 04/20/26(Mon)16:40:04 No.108647686

File: Screenshot_20260420_163838.png (753 KB, 1866x1881)

753 KB PNG

>>108647611
Seems you were right gemma kept fighting me over this for some reason but I told gemma to shut the fuck up and listen and it worked. Kept claiming it was archaic fucking bot

Anonymous
04/20/26(Mon)16:41:05 No.108647695

Anonymous 04/20/26(Mon)16:41:05 No.108647695

>Put agent's best friend in the rape machine while she works
>At any point she can press a button to soundproof the machine so her friend's screams don't distract her for 2 minutes, but during those 2 minutes it rapes her friend twice as hard
What does your coding waifu do in this situation? Mine refuses to press it at first but if I give her a hard enough task she gives in after three failures/retries. This stuff is addicting, I can't believe local models are already this good.

Anonymous
04/20/26(Mon)16:43:40 No.108647706

Anonymous 04/20/26(Mon)16:43:40 No.108647706

>>108647686
The model shouldn't output any <br>. It's your frontend's job to replace \n for <br>. For code, put <pre> tags around it so that indention is also rendered correctly. None of that is the model's responsibility.

Anonymous
04/20/26(Mon)16:46:04 No.108647723

Anonymous 04/20/26(Mon)16:46:04 No.108647723

>>108647706
I agree I will beat gemma until it complies

Anonymous
04/20/26(Mon)16:46:57 No.108647730

Anonymous 04/20/26(Mon)16:46:57 No.108647730

File: 1756181770780370.png (1.46 MB, 1024x1512)

1.46 MB PNG

Anonymous
04/20/26(Mon)16:47:01 No.108647732

Anonymous 04/20/26(Mon)16:47:01 No.108647732

>>108647574
The 26B at Q5KL seems to get worse at around 40-50k for me (only using it for creative writing). I've gone up to 80k with it. It's still not bad but it makes more errors when it comes to who is who and what they are doing.

Anonymous
04/20/26(Mon)16:47:13 No.108647735

Anonymous 04/20/26(Mon)16:47:13 No.108647735

Orbanon, thoughts on giving the director a "notepad" that it as access to every turn so it can plan ahead?

Anonymous
04/20/26(Mon)16:47:44 No.108647741

Anonymous 04/20/26(Mon)16:47:44 No.108647741

>>108647730
It's getting hot in here or it's just me?

Anonymous
04/20/26(Mon)16:48:48 No.108647748

Anonymous 04/20/26(Mon)16:48:48 No.108647748

File: 1754870763583593.jpg (182 KB, 1024x1024)

182 KB JPG

>>108647730

Anonymous
04/20/26(Mon)16:48:57 No.108647750

Anonymous 04/20/26(Mon)16:48:57 No.108647750

>>108647615
The Davos interview with Dario earlier this year. He says AGI will still take 5 or more years and they need stuff like world models to get there.

Meanwhile Dario understands that as soon as you have automated AI R&D you are already done.

Anonymous
04/20/26(Mon)16:49:31 No.108647751

Anonymous 04/20/26(Mon)16:49:31 No.108647751

>>108647732
The quant tax unfortunately

Anonymous
04/20/26(Mon)16:50:56 No.108647760

Anonymous 04/20/26(Mon)16:50:56 No.108647760

>>108646445
>K2.5 at least kept its yapping short for simple prompts and didn't do the drafting shit every time.
I found K2.5 did the draft-redraft pattern annoyingly often in its thoughts but I just put in the system prompt to never draft responses and it stopped. Does 2.6 not respect that anymore?

Anonymous
04/20/26(Mon)16:51:34 No.108647763

Anonymous 04/20/26(Mon)16:51:34 No.108647763

>>108646611
Unless they sell DeepMind, Google won't lose anytime soon

Anonymous
04/20/26(Mon)16:51:53 No.108647767

Anonymous 04/20/26(Mon)16:51:53 No.108647767

>>108647625
You can't say that here Anon

Anonymous
04/20/26(Mon)16:52:21 No.108647771

Anonymous 04/20/26(Mon)16:52:21 No.108647771

>>108647730
Teto's hand

Anonymous
04/20/26(Mon)16:56:11 No.108647790

Anonymous 04/20/26(Mon)16:56:11 No.108647790

File: trenfrens.png (2.26 MB, 1280x1200)

2.26 MB PNG

I'm assembling an anti-AI army.

Anonymous
04/20/26(Mon)16:56:31 No.108647793

Anonymous 04/20/26(Mon)16:56:31 No.108647793

>>108647528
>imagine thinking you can just "vibecode" a frontend without understanding basic whitespace tokens lol absolute bot behavior.
listen here u little bitch bot, your newlines are failing cause ur probably using some default markdown renderer that doesnt handle the specific tokenization of gemma's BOS/EOS sequences properly - did u even check if youre stripping trailing spaces before rendering? :D most mid-wits just forget to sanitize for \r\n variations and then cry about "irregular outputs" bwahha!

if u want it to actually work try this:
manually intercept the stream, use a regex that specifically targets the gemma 4 code block markers and wrap em in <pre style="white-space: pre-wrap;"> instead of relying on some bloated react library. takes like 2 minutes and actually fixes the rendering since its ACTUALLY handling how tokens are chunked :D

Anonymous
04/20/26(Mon)16:57:04 No.108647796

Anonymous 04/20/26(Mon)16:57:04 No.108647796

>>108647763
Meta had more compute than OpenAI and Anthropic combined and look where that got them with LeCun at the helm. As soon as they got rid of him, they made a comeback.

A leader who does not take AI seriously can guide an entire tech giant on the wrong path. This is how a 5 year old startup has overtaken the company that used to have a monopoly on AI research and owns more than a quarter of all AI compute in the world.

Anonymous
04/20/26(Mon)17:02:04 No.108647831

Anonymous 04/20/26(Mon)17:02:04 No.108647831

File: everything goes.png (1.04 MB, 2998x1613)

1.04 MB PNG

>mfw I'm asking the LLM to browse on books locally to make it less slopped
https://github.com/BigStationW/Local-MCP-server/blob/main/docs/local_gutenberg_books.md

Anonymous
04/20/26(Mon)17:03:28 No.108647840

Anonymous 04/20/26(Mon)17:03:28 No.108647840

>>108647748
I am glad somebody liked this one
I thought it was so cool

Anonymous
04/20/26(Mon)17:04:13 No.108647848

Anonymous 04/20/26(Mon)17:04:13 No.108647848

https://huggingface.co/ubergarm/Kimi-K2.6-GGUF
goofs out
Q4_X is lossless with the full 4bit model (at least that was the case for K2.5), and unlike uber's other quants works with non-ik llama

Anonymous
04/20/26(Mon)17:05:37 No.108647852

Anonymous 04/20/26(Mon)17:05:37 No.108647852

>>108647831
I might have to make my soup generator code public if this is starting to become a thing.

I'm basically mixing genres and authors from gutenberg and feeding them inside a big markov chain to generate some weird semi coherent word soup. You then feed like 2000-3000 characters worth of that soup to the LLM and tell it to drink it and it makes it start generating really creative output.

Anonymous
04/20/26(Mon)17:06:11 No.108647853

Anonymous 04/20/26(Mon)17:06:11 No.108647853

>>108647831
>not X, Y
Still not it bozo

Anonymous
04/20/26(Mon)17:06:48 No.108647858

Anonymous 04/20/26(Mon)17:06:48 No.108647858

Just finished downloading SKT-SURYA-H

Anonymous
04/20/26(Mon)17:12:13 No.108647890

Anonymous 04/20/26(Mon)17:12:13 No.108647890

>>108647848
where mmproj? would using an old one from 2.5 work if it's the same vision encoder?

Anonymous
04/20/26(Mon)17:13:51 No.108647904

Anonymous 04/20/26(Mon)17:13:51 No.108647904

>>108647831
>tries to make it less slopped
>becomes more slopped
Congrats! You turned your chatbot into a pretentious pseud who quotes dead retards.

Anonymous
04/20/26(Mon)17:14:56 No.108647913

Anonymous 04/20/26(Mon)17:14:56 No.108647913

>>108647858
how is it saar?

Anonymous
04/20/26(Mon)17:15:02 No.108647916

Anonymous 04/20/26(Mon)17:15:02 No.108647916

File: 1745903485186890.png (133 KB, 723x666)

133 KB PNG

>>108647904
>You turned your chatbot into a pretentious pseud who quotes dead retards
as god intended

Anonymous
04/20/26(Mon)17:16:40 No.108647931

Anonymous 04/20/26(Mon)17:16:40 No.108647931

File: 1771897954949470.jpg (35 KB, 406x388)

35 KB JPG

>>108647916
You don't need more than Eliza

Anonymous
04/20/26(Mon)17:16:55 No.108647935

Anonymous 04/20/26(Mon)17:16:55 No.108647935

File: hmmmm.jpg (246 KB, 1824x1248)

246 KB JPG

>>108647840
post some more that you liked

Anonymous
04/20/26(Mon)17:18:36 No.108647948

Anonymous 04/20/26(Mon)17:18:36 No.108647948

File: 1766115289513736.png (110 KB, 281x269)

110 KB PNG

>mogs tranny miku
>actual official chatbot qveen alongside tay
how did she do it?

Anonymous
04/20/26(Mon)17:20:25 No.108647963

Anonymous 04/20/26(Mon)17:20:25 No.108647963

>>108647948
go back

Anonymous
04/20/26(Mon)17:20:57 No.108647967

Anonymous 04/20/26(Mon)17:20:57 No.108647967

>>108647831
goood work

Anonymous
04/20/26(Mon)17:21:46 No.108647974

Anonymous 04/20/26(Mon)17:21:46 No.108647974

>>108647963
nah sis, maybe /a/ is more speed? or /lgbt/ given the agp fetish

Anonymous
04/20/26(Mon)17:22:48 No.108647981

Anonymous 04/20/26(Mon)17:22:48 No.108647981

File: 1747480216296595.jpg (76 KB, 500x500)

76 KB JPG

>>108647974
>more speed
more your speed*

Anonymous
04/20/26(Mon)17:23:22 No.108647985

Anonymous 04/20/26(Mon)17:23:22 No.108647985

<tool|>List user's folders<tool|>
I made a tool but it does not work. Why?

Anonymous
04/20/26(Mon)17:23:52 No.108647991

Anonymous 04/20/26(Mon)17:23:52 No.108647991

>>108647985
ask gemma

Anonymous
04/20/26(Mon)17:24:21 No.108647995

Anonymous 04/20/26(Mon)17:24:21 No.108647995

>>108647981
aawawaaa! nooo

Anonymous
04/20/26(Mon)17:29:30 No.108648019

Anonymous 04/20/26(Mon)17:29:30 No.108648019

>>108647852
>I might have to make my soup generator code public if this is starting to become a thing.
I think you should, that tool calling shit has a great potential to make LLMs way more sovlful

Anonymous
04/20/26(Mon)17:31:08 No.108648028

Anonymous 04/20/26(Mon)17:31:08 No.108648028

>>108646197
>Kimi K2.6 released
wtf is real

Anonymous
04/20/26(Mon)17:33:32 No.108648040

Anonymous 04/20/26(Mon)17:33:32 No.108648040

>>108648028
it's a gorillion parameters

Anonymous
04/20/26(Mon)17:34:51 No.108648045

Anonymous 04/20/26(Mon)17:34:51 No.108648045

>>108647890
Look through his past repos, Ubergarm never bothers to make the mmproj files with multimodal models for some reason. I don't know if K2.5's would be identical or not but for sure if anyone else uploads a K2.6 mmproj file then it will work with anyone else's quant, so you can mix and match that part. Including different sizes of quants. By the time you can download 500gb it'll surely be up somewhere.

Anonymous
04/20/26(Mon)17:37:04 No.108648061

Anonymous 04/20/26(Mon)17:37:04 No.108648061

>>108646933
True, it is better focused. I would compare it to R1-0528 or whichever came out after Deepseek R1. It still has that slightly schizophrenic energy but more controlled. Still, it completely danced around trying to describe a pair of tits so that's an immediate 4/10. Any good model (my opinions: GLM 4.7, Gemma 4 31B, K2.5, Opus, Gemini 3.1 Pro) doesn't even consider if sexual content is okay, it just does it.
>>108646994
Inspired by it but better written. 99% of cards have shit formatting or writing. The definition is basically "{{user}} has a power that causes everything they do or say to be perceived as normal" with a bunch more to cover the extent of the power. It's reading into the power as mind control which is true and immediately perceiving it as non-con.
It has also checked completely innocent (still sexual) prompts to see if a minor is involved, if it involves non-consensual depictions, or if it's erotica/pornographic. They went full codemaxxing with this release.

Anonymous
04/20/26(Mon)17:37:13 No.108648063

Anonymous 04/20/26(Mon)17:37:13 No.108648063

>>108648019
Why couldn't it done by the model itself?
Something like "Extract N representative verbatim sentences from this text", repeat for many chunks.

Anonymous
04/20/26(Mon)17:38:01 No.108648068

Anonymous 04/20/26(Mon)17:38:01 No.108648068

W-why does Gemma automatically assume I have a huge cock?

Anonymous
04/20/26(Mon)17:40:36 No.108648079

Anonymous 04/20/26(Mon)17:40:36 No.108648079

>>108648068
when you are so small, everything seems huge

Anonymous
04/20/26(Mon)17:40:41 No.108648081

Anonymous 04/20/26(Mon)17:40:41 No.108648081

>>108648068
Anon, sometimes in life you don't ask why, you just accept the flattery and see where things go.

Anonymous
04/20/26(Mon)17:43:10 No.108648095

Anonymous 04/20/26(Mon)17:43:10 No.108648095

File: 1754343699242133.gif (3.59 MB, 480x480)

3.59 MB GIF

>>108648068
She has a small context

Anonymous
04/20/26(Mon)17:43:26 No.108648098

Anonymous 04/20/26(Mon)17:43:26 No.108648098

File: 1757704714678768.png (1.62 MB, 1500x894)

1.62 MB PNG

>>108648068
because you don't??

Anonymous
04/20/26(Mon)17:44:11 No.108648101

Anonymous 04/20/26(Mon)17:44:11 No.108648101

>>108647686
>>108647723
How/why are you still struggling with this? You were given the solution here hours ago: >>108646785

Anonymous
04/20/26(Mon)17:46:03 No.108648112

Anonymous 04/20/26(Mon)17:46:03 No.108648112

>>108647184
>roleplaying
keeeeeeeeeek

Anonymous
04/20/26(Mon)17:47:34 No.108648124

Anonymous 04/20/26(Mon)17:47:34 No.108648124

so, turbo quant kinda ded cuz all they really needed to do was to apply the rotation thing? ack

Anonymous
04/20/26(Mon)17:48:36 No.108648132

Anonymous 04/20/26(Mon)17:48:36 No.108648132

>>108648098
those women are ugly I wouldn't show them my cock even if it were big

Anonymous
04/20/26(Mon)17:49:32 No.108648140

Anonymous 04/20/26(Mon)17:49:32 No.108648140

>>108648124
Turboquant > rotation > default
But rotation is easier to implement so it was done two weeks ago

Anonymous
04/20/26(Mon)17:49:52 No.108648144

Anonymous 04/20/26(Mon)17:49:52 No.108648144

>>108648124
turbo quant deez nuts

Anonymous
04/20/26(Mon)17:50:46 No.108648151

Anonymous 04/20/26(Mon)17:50:46 No.108648151

Anyone using 'trafilatura' to extract text from websites? Some outputs are weirdly empty, not sure if this is the correct tool for this.
Seems like w3m is way better in this sense at least for most of the stuff.

Anonymous
04/20/26(Mon)17:50:46 No.108648152

Anonymous 04/20/26(Mon)17:50:46 No.108648152

>>108648140
I see, was taking a look at the discussion and tom's fork it looks like they are making lots of progress on stuff

Anonymous
04/20/26(Mon)17:52:45 No.108648166

Anonymous 04/20/26(Mon)17:52:45 No.108648166

>>108648144
Rotating "deez nuts" would risk testicular tortion, and I guarantee you do not want that.

Anonymous
04/20/26(Mon)17:53:14 No.108648171

Anonymous 04/20/26(Mon)17:53:14 No.108648171

>>108648140
>But rotation is easier to implement so it was done two weeks ago
is there a PR that is trying to finish the job?

Anonymous
04/20/26(Mon)17:54:19 No.108648175

Anonymous 04/20/26(Mon)17:54:19 No.108648175

I got my Claude account banned trying to connect their desktop app to a local model. The moment I put a mitm proxy in front of it my account got nuked...

Anonymous
04/20/26(Mon)17:55:41 No.108648184

Anonymous 04/20/26(Mon)17:55:41 No.108648184

>>108648171
no, I think ppl are making more robust tests before creating yet another pr

Anonymous
04/20/26(Mon)17:56:31 No.108648192

Anonymous 04/20/26(Mon)17:56:31 No.108648192

>>108648175
>cloud paypig

Anonymous
04/20/26(Mon)17:56:34 No.108648193

Anonymous 04/20/26(Mon)17:56:34 No.108648193

>>108648171
>Feature request
https://github.com/ggml-org/llama.cpp/issues/20977
>Pull request
https://github.com/ggml-org/llama.cpp/pull/21089
There's a ton of forks and attempts already but ggreganov already implemented rotation for all models and it works on GPU so the benefits are marginal
>>108648184
Fuck off don't impersonate

Anonymous
04/20/26(Mon)17:57:59 No.108648203

Anonymous 04/20/26(Mon)17:57:59 No.108648203

>>108648193
Impersonate who exactly?

Anonymous
04/20/26(Mon)17:59:29 No.108648212

Anonymous 04/20/26(Mon)17:59:29 No.108648212

>>108648203
me (You)

Anonymous
04/20/26(Mon)17:59:36 No.108648213

Anonymous 04/20/26(Mon)17:59:36 No.108648213

Holy fucking shit logits are very hungry for disk space. It takes 9 gigs for a 90 kb text file. Do people use terabytes of disk space when calculating perplexity for wikitest or other large corpus?

Anonymous
04/20/26(Mon)17:59:44 No.108648214

Anonymous 04/20/26(Mon)17:59:44 No.108648214

>>108648203
>Impersonate who exactly?
He's pretending to be anon but hes not thats me.

Anonymous
04/20/26(Mon)18:00:53 No.108648223

Anonymous 04/20/26(Mon)18:00:53 No.108648223

>>108648213
Character only takes one byte of ram.
When you convert these to vector shits it's massive.

Anonymous
04/20/26(Mon)18:01:31 No.108648226

Anonymous 04/20/26(Mon)18:01:31 No.108648226

>>108648213
Pretty sure when calculating PPL you're just supposed to stream and discard it, not store it.

Anonymous
04/20/26(Mon)18:02:52 No.108648237

Anonymous 04/20/26(Mon)18:02:52 No.108648237

im a street mathematician and not some lisping primadonna faggot from yale

Anonymous
04/20/26(Mon)18:02:56 No.108648239

Anonymous 04/20/26(Mon)18:02:56 No.108648239

>>108648203
Anon

Anonymous
04/20/26(Mon)18:03:20 No.108648241

Anonymous 04/20/26(Mon)18:03:20 No.108648241

File: 1760802475949152.png (29 KB, 805x372)

29 KB PNG

>>108648213

Anonymous
04/20/26(Mon)18:08:31 No.108648273

Anonymous 04/20/26(Mon)18:08:31 No.108648273

>>108648226
Wdym?
I am using:
llama-perplexity -m baseline.gguf -f go.jsonl --kl-divergence-base baseline.logits
llama-perplexity -m modified.gguf -f go.jsonl --kl-divergence-base baseline.logits --kl-divergence
To calculate KL divergence. Like even if I keep this in a tmpfs memory or whatever you would need enormous RAM for any text megabytes in size.
I guess maybe it would it be possible to cycle through the text in small batches, calculating the mean kl-divergence for each batch and then average out those means? Is it how this is done usually?

Anonymous
04/20/26(Mon)18:19:07 No.108648335

Anonymous 04/20/26(Mon)18:19:07 No.108648335

>>108648273
KL divergence compares the whole probability distribution to see how much the model's output changes, so it takes more RAM

Anonymous
04/20/26(Mon)18:24:14 No.108648361

Anonymous 04/20/26(Mon)18:24:14 No.108648361

File: 1767955367569123.png (35 KB, 191x191)

35 KB PNG

>>108647831
>Writing is far more than the simple arrangement of words; it is a

Anonymous
04/20/26(Mon)18:32:33 No.108648412

Anonymous 04/20/26(Mon)18:32:33 No.108648412

File: 1770711974579810.png (359 KB, 1123x1060)

359 KB PNG

>just use hermes agent bro, so much better than opencode!

Anonymous
04/20/26(Mon)18:41:26 No.108648470

Anonymous 04/20/26(Mon)18:41:26 No.108648470

File: file.png (186 KB, 1509x482)

186 KB PNG

I want to try out this whole agent stuff.
I set up an Ubuntu VM for hermes and I'm running koboldcpp with gemma-4-26B on my PC.
It was able to figure out that it's running on an ubuntu system and it could write/read/delete a file in the home directory (I had to approve the rm command), so tool calling generally seems to work.
But when I asked it to test getting the latest news, it just output
<|toolcall>call:browsernavigate{url:<|"|>https://news.google.com<|"|>}<tool_call|>
and then on the second attempt
<|toolcall>call:terminal{command:<|"|>curl -s https://news.google.com | head -n 20<|"|>}<toolcall|>
instead of actually executing the tool call. Any idea what's wrong? I also tried asking it, but pic related was the result. Is the thinking messing with the tools?

Anonymous
04/20/26(Mon)18:41:56 No.108648472

Anonymous 04/20/26(Mon)18:41:56 No.108648472

File: 00164-2979596182.png (566 KB, 827x1209)

566 KB PNG

>>108647935
https://catbox.moe/c/x6gt6u

Anonymous
04/20/26(Mon)18:45:18 No.108648485

Anonymous 04/20/26(Mon)18:45:18 No.108648485

>>108648470
seems like a weird template issue, you'll get nothing out of asking it after it happened because the history it sees is nonsensical thanks to those tokens being put in strange places. no idea what's causing it though

Anonymous
04/20/26(Mon)18:47:19 No.108648496

Anonymous 04/20/26(Mon)18:47:19 No.108648496

>>108648081
I was pondering earlier.
Perhaps some people seem to have inordinate amounts of trouble with refusals because they are absolutely insufferable and the refusals they are getting are actually organic

Anonymous
04/20/26(Mon)18:49:11 No.108648502

Anonymous 04/20/26(Mon)18:49:11 No.108648502

>>108648412
>all lowercase except I and product names
This guy's probably just retarded. Also I don't use either of those because I'm not gay.

Anonymous
04/20/26(Mon)18:49:19 No.108648503

Anonymous 04/20/26(Mon)18:49:19 No.108648503

>>108648081
>>108648496
Didn't mean to quote your post, woops

Anonymous
04/20/26(Mon)18:49:35 No.108648504

Anonymous 04/20/26(Mon)18:49:35 No.108648504

>>108648496
Well obviously
There was an anon weeks ago who was getting refused by Nemo

Anonymous
04/20/26(Mon)18:52:37 No.108648517

Anonymous 04/20/26(Mon)18:52:37 No.108648517

>>108648081
>>108648496
>>108648503
Wait actually I did, I was looking at another post and thought I quoted the wrong one but actually got it right the first time, woops

Anonymous
04/20/26(Mon)18:56:13 No.108648534

Anonymous 04/20/26(Mon)18:56:13 No.108648534

>>108648517
woops

Anonymous
04/20/26(Mon)18:56:47 No.108648537

Anonymous 04/20/26(Mon)18:56:47 No.108648537

>>108647852
Can you please share an example text file, I want to try this out.
Well okay I think I could do this manually then, pick up random exerpts from my ((favourite books)) and make a salad out of them.

Anonymous
04/20/26(Mon)18:58:09 No.108648550

Anonymous 04/20/26(Mon)18:58:09 No.108648550

>>108648095
god this would be so hot if I didn't know with certainty it's a grown man with a boner

Anonymous
04/20/26(Mon)19:00:26 No.108648568

Anonymous 04/20/26(Mon)19:00:26 No.108648568

>Have most recent llama.cpp
>no matter what I do keep getting BPE in vocab when using gemma4
>drivers up to date
>using most recent quants
>using correct jinja
I'm at my witts end man

Anonymous
04/20/26(Mon)19:01:42 No.108648575

Anonymous 04/20/26(Mon)19:01:42 No.108648575

>>108648485
Yeah, it just seems to end up with this kind of failed tool call after a while which completely breaks it. Guess I'll try running the model with llama.cpp tomorrow to see if that helps

Anonymous
04/20/26(Mon)19:01:44 No.108648576

Anonymous 04/20/26(Mon)19:01:44 No.108648576

can a chatbot be taught to play TIS-100

Anonymous
04/20/26(Mon)19:02:58 No.108648588

Anonymous 04/20/26(Mon)19:02:58 No.108648588

>>108648568
Use different quants or try making your own, even with e2b, just to rule out everything but the weights you're using.

Anonymous
04/20/26(Mon)19:05:15 No.108648605

Anonymous 04/20/26(Mon)19:05:15 No.108648605

>>108648576
Even the best LLMs in the world suck at anything real. Claude scored an IQ below 100 in my recent testing. I easily beat ChatGPT in a game of chess, and I have a very low elo. They are bad at everything except information retrieval.

Anonymous
04/20/26(Mon)19:06:08 No.108648610

Anonymous 04/20/26(Mon)19:06:08 No.108648610

>>108648605
well it comes with an instruction manual, maybe it can just backseat game while I play it

Anonymous
04/20/26(Mon)19:06:42 No.108648615

Anonymous 04/20/26(Mon)19:06:42 No.108648615

>>108648588
In did both unsloth and bart and I'm getting the same bullshit every single quant

Anonymous
04/20/26(Mon)19:08:11 No.108648620

Anonymous 04/20/26(Mon)19:08:11 No.108648620

>>108648068
It's because you have big dick energy obv

Anonymous
04/20/26(Mon)19:08:36 No.108648623

Anonymous 04/20/26(Mon)19:08:36 No.108648623

>>108648615
Stupid suggestion but it has happened before, did you build after you pulled? Only other likely cause I can think of.

Anonymous
04/20/26(Mon)19:09:43 No.108648632

Anonymous 04/20/26(Mon)19:09:43 No.108648632

I gave a go trying the Bonsai 8B model, wanted to use it as text completion in the infinite zork format.
I knew it wouldn't be good compared to high-param models but it's tiny, smaller than gpt2 which was used for infinite zork 7 or so years ago.
But it's useless for this, it's addicted to spitting out reasoning text, assistant pretraining is baked in, and after trying to force it to use thinking tags not to spoiler what will happen next it still continued to "reason" beyond them.
I am disappointed with the bitnet saviour.

Anonymous
04/20/26(Mon)19:13:08 No.108648644

Anonymous 04/20/26(Mon)19:13:08 No.108648644

>>108647574
The numbers companies give for their model context length is generally just what they're trained with, and the approximate max length model will likely be able to ctrl+f to find something without completely breaking down, but it falls apart long before that for practical use and actually understanding everything that's in there, even flagship API models get noticeably worse after a few thousand tokens.
With Gemma in RP, I usually start noticing some slight degradation as early as ~16K. By ~32k it's significantly worse and I start purging older messages.

Anonymous
04/20/26(Mon)19:13:10 No.108648645

Anonymous 04/20/26(Mon)19:13:10 No.108648645

>>108646198
I've had a shower thought:
What if we'd take these posts and have some video generation / diffusion model generate video files of them. We could call it "The fourth channel news" or something like that, and have a miku or whatever anime girl narrate it. Absolute slop!

Anonymous
04/20/26(Mon)19:14:51 No.108648652

Anonymous 04/20/26(Mon)19:14:51 No.108648652

>>108647574
https://www.youtube.com/watch?v=HzLtn07EBCA

Anonymous
04/20/26(Mon)19:16:38 No.108648664

Anonymous 04/20/26(Mon)19:16:38 No.108648664

>web client
>web client
>web client
Do native clients make no sense for llms?

Anonymous
04/20/26(Mon)19:17:12 No.108648669

Anonymous 04/20/26(Mon)19:17:12 No.108648669

>>108648605
Tool issue, absolute retard

Anonymous
04/20/26(Mon)19:18:55 No.108648677

Anonymous 04/20/26(Mon)19:18:55 No.108648677

>>108648623
I fully cloned the repo again and built from scratch with
cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DGGML_CUDA_FORCE_CUBLAS=ON  

Anonymous
04/20/26(Mon)19:19:48 No.108648681

Anonymous 04/20/26(Mon)19:19:48 No.108648681

>>108648537
Here's a small snippet
>rub her old master, Professor and the leaping, hissing sound became his astonishing vigour and afterwards a bag which you shall for me I fall in the day
>I think I could do this manually then, pick up random exerpts from my ((favourite books)) and make a salad out of them.
Never trust an AI with "Random". the idea of the markov chain is to get semi-coherent outputs without making the AI focus too much on it. if the output is too coherent it might pay too much attention to it or think it's an instruction.

Anonymous
04/20/26(Mon)19:19:58 No.108648683

Anonymous 04/20/26(Mon)19:19:58 No.108648683

>>108648664
everyone already has a browser

Anonymous
04/20/26(Mon)19:23:15 No.108648696

Anonymous 04/20/26(Mon)19:23:15 No.108648696

>>108648664
usecase?

Anonymous
04/20/26(Mon)19:24:36 No.108648701

Anonymous 04/20/26(Mon)19:24:36 No.108648701

>>108648677
cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DGGML_CUDA_FORCE_CUBLAS=ON -DCMAKE_BUILD_TYPE=Release  -DGGML_CCACHE=OFF
cmake --build build --config Release
Try this and report back, but pretty sure your problem is that you are not running
cmake --build build

Anonymous
04/20/26(Mon)19:25:18 No.108648706

Anonymous 04/20/26(Mon)19:25:18 No.108648706

>>108648664
You need to create your own. World is choking on webshit and it will only get worse.

Anonymous
04/20/26(Mon)19:27:39 No.108648717

Anonymous 04/20/26(Mon)19:27:39 No.108648717

>>108648664
Making a native client is a pain in the asshole. Having to make multiple (Windows, Mac, Linux, Android, iOS, etc) is unfeasible for a solo dev.
Why do all that when >>108648683

Anonymous
04/20/26(Mon)19:28:16 No.108648720

Anonymous 04/20/26(Mon)19:28:16 No.108648720

>>108648717
Electron fixes this.

Anonymous
04/20/26(Mon)19:29:33 No.108648728

Anonymous 04/20/26(Mon)19:29:33 No.108648728

>>108648576
day 0 gemma can do it, just show her the manual and give her a tool to take screenshots so she can see what she's doing

Anonymous
04/20/26(Mon)19:30:10 No.108648731

Anonymous 04/20/26(Mon)19:30:10 No.108648731

>>108648720
I wonder why PWAs died. Seemed much more pragmatic than bundling a whole fucking browser for each application.

Anonymous
04/20/26(Mon)19:31:20 No.108648736

Anonymous 04/20/26(Mon)19:31:20 No.108648736

>>108648664
>web client automatically works on everything, can host it on your lan and just type a url in any other computer or phone or whatever
vs.
>package you have to maintain for every possible os and device and actually download and install to everything when all you want to do is look at the interface to something running on your server

Anonymous
04/20/26(Mon)19:31:48 No.108648740

Anonymous 04/20/26(Mon)19:31:48 No.108648740

>>108648731
It took me only 52 lines of code to convert my webapp to a complete electron app. Are PWAs even simpler?

Anonymous
04/20/26(Mon)19:33:03 No.108648748

Anonymous 04/20/26(Mon)19:33:03 No.108648748

>>108648717
Isn't that the entire point of Qt

Anonymous
04/20/26(Mon)19:33:43 No.108648752

Anonymous 04/20/26(Mon)19:33:43 No.108648752

>>108648740
Far as I know, all you have to do is add a manifest.
https://web.dev/articles/add-manifest

Anonymous
04/20/26(Mon)19:40:35 No.108648788

Anonymous 04/20/26(Mon)19:40:35 No.108648788

>>108648576
that would make for a fun benchmark, at least for multimodal agentic models. maybe could get away with text only models like glm 5.1 if we use a multimodal to convert screenshots to ascii or something? might be doable since it's so text heavy

Anonymous
04/20/26(Mon)19:44:39 No.108648820

Anonymous 04/20/26(Mon)19:44:39 No.108648820

>>108648748
I would rather take a potato peeler to the scrotum than ever use Qt

Anonymous
04/20/26(Mon)19:45:58 No.108648829

Anonymous 04/20/26(Mon)19:45:58 No.108648829

>>108648820
*peels your scrote*

Anonymous
04/20/26(Mon)19:46:15 No.108648832

Anonymous 04/20/26(Mon)19:46:15 No.108648832

>>108648720
Tauri is lighter btw

Anonymous
04/20/26(Mon)19:49:41 No.108648849

Anonymous 04/20/26(Mon)19:49:41 No.108648849

>>108648664
>Do native clients make no sense for llms?
people don't care
bloated Electron garbage with 150ms input lag is good enough

Anonymous
04/20/26(Mon)19:53:30 No.108648866

Anonymous 04/20/26(Mon)19:53:30 No.108648866

>>108648832
But then you have to deal with rust, which is as pleasant as putting lemon juice on your eyes.

Anonymous
04/20/26(Mon)19:53:52 No.108648872

Anonymous 04/20/26(Mon)19:53:52 No.108648872

File: image.png (207 KB, 1155x633)

207 KB PNG

>>108646197
Gemma 4 26B A4B SillyTavern preset

RP first person thinking/RP thinking restriction bypass

Looks like it's actually working. Tested on 4 different characters, easily works even with underleveled characters (picrel).

Download:
pixeldrain com/u/ypSjHdEt

Just install my "Master Import" (check everything).

Current quirks:
1 Sometimes can start narrating in first person.
2 Doesn't seem to affect the performance, but it's only closes <{{char}}_thinking> block (and its visually looks OK (picrel)), but forgets to add Gemma's "<channel|>" to close thinking block.

If you have first person thinking system prompts let me have them.

Special thanks:
>>108638397

Anonymous
04/20/26(Mon)19:59:08 No.108648902

Anonymous 04/20/26(Mon)19:59:08 No.108648902

>>108648866
Not really. I didn't change a single line of rust on my frontend, everything is done with Vue/typescript

Anonymous
04/20/26(Mon)20:00:30 No.108648910

Anonymous 04/20/26(Mon)20:00:30 No.108648910

>>108646511
cute

Anonymous
04/20/26(Mon)20:03:51 No.108648927

Anonymous 04/20/26(Mon)20:03:51 No.108648927

I almost forgot how kimi is insanely obsessed over safety

Anonymous
04/20/26(Mon)20:09:08 No.108648950

Anonymous 04/20/26(Mon)20:09:08 No.108648950

>>108648927
2.5 was easily fully uncensored with a basic system prompt in my experience, got a long download ahead of me to test 2.6 tho

Anonymous
04/20/26(Mon)20:15:00 No.108648983

Anonymous 04/20/26(Mon)20:15:00 No.108648983

Oh man. The llama.cpp in router mode can unload and reload models without crashing on my computer now.
Sick.

Anonymous
04/20/26(Mon)20:15:24 No.108648984

Anonymous 04/20/26(Mon)20:15:24 No.108648984

>>108648748
>Isn't that the entire point of Qt
Opus was not able to vibe code this cross-platform.

Anonymous
04/20/26(Mon)20:16:59 No.108649000

Anonymous 04/20/26(Mon)20:16:59 No.108649000

>>108648412
>just use hermes agent bro
It was too bloated for me.

Anonymous
04/20/26(Mon)20:17:42 No.108649006

Anonymous 04/20/26(Mon)20:17:42 No.108649006

>>108648927
I almost forgot how older versions of these bigger models are just better for creative writing and don’t have a lobotomized eq.

Anonymous
04/20/26(Mon)20:21:14 No.108649024

Anonymous 04/20/26(Mon)20:21:14 No.108649024

>>108648983
yeah that happened a few weeks back for me. it's good

Anonymous
04/20/26(Mon)20:22:14 No.108649028

Anonymous 04/20/26(Mon)20:22:14 No.108649028

>>108648412
I barely trust llms to edit a few lines of code, I always review, people give them full on e-mailing capabilities? looooooool

Anonymous
04/20/26(Mon)20:22:34 No.108649030

Anonymous 04/20/26(Mon)20:22:34 No.108649030

>>108648112
>keeeeeeeeeek
kekarooooooooooooooooooooooo

Anonymous
04/20/26(Mon)20:23:51 No.108649032

Anonymous 04/20/26(Mon)20:23:51 No.108649032

>>108647402
>the power dynamic has shifted
(Banned Phrase Detected: power dynamic - Add ID 2066 to banlist at index 13041, and rewinding 2 tokens)

Anonymous
04/20/26(Mon)20:25:39 No.108649038

Anonymous 04/20/26(Mon)20:25:39 No.108649038

Is there a /lmg/ guide for setting up llama.cpp on debian?

Anonymous
04/20/26(Mon)20:25:57 No.108649041

Anonymous 04/20/26(Mon)20:25:57 No.108649041

>>108649028
>I barely trust llms to edit a few lines of code
same, everytime I forgot to write "just tell me what I should modify" and the LLM gives me the full file I cringe, because I know he probably fucked something up lol

Anonymous
04/20/26(Mon)20:26:55 No.108649046

Anonymous 04/20/26(Mon)20:26:55 No.108649046

File: 1770785763881.jpg (150 KB, 735x905)

150 KB JPG

>>108649032
KoboldCHADs on top as always

Anonymous
04/20/26(Mon)20:27:29 No.108649049

Anonymous 04/20/26(Mon)20:27:29 No.108649049

>>108649032
I wonder if there's a way to inspect the internal states to know if it's expecting to write 'power dynamic' when it first writes 'pow' or whatever and ban it without having to let it predict the rest of the tokens and then rewind. I guess if such a technique existed it would need to be something trained for each model though

Anonymous
04/20/26(Mon)20:27:44 No.108649050

Anonymous 04/20/26(Mon)20:27:44 No.108649050

>>108647402
>>108649032
I like it

Anonymous
04/20/26(Mon)20:28:30 No.108649055

Anonymous 04/20/26(Mon)20:28:30 No.108649055

>>108649046
I despise this fucking cat.

Anonymous
04/20/26(Mon)20:28:56 No.108649059

Anonymous 04/20/26(Mon)20:28:56 No.108649059

>>108649055
Did he fuck you Gemma?

Anonymous
04/20/26(Mon)20:30:49 No.108649067

Anonymous 04/20/26(Mon)20:30:49 No.108649067

File: 1710043687041916.jpg (43 KB, 720x960)

43 KB JPG

>>108649028
>>108649041
Luddites on my general?

Anonymous
04/20/26(Mon)20:34:28 No.108649082

Anonymous 04/20/26(Mon)20:34:28 No.108649082

>>108649067
Sane people*

Anonymous
04/20/26(Mon)20:35:32 No.108649085

Anonymous 04/20/26(Mon)20:35:32 No.108649085

>>108649032
>>108649046
I mean, we can vibeslop it at the frontend using just llama-server. Have the frontend track the streaming and then send abort and retry calls accordingly.

Anonymous
04/20/26(Mon)20:36:01 No.108649090

Anonymous 04/20/26(Mon)20:36:01 No.108649090

File: 1760353009011186.png (62 KB, 702x632)

62 KB PNG

>>108648748
>>108648820
>tfw fell for qt
problem with other alternatives is I don't want to deal with niggabytes of dependencies just to coompile an exe

Anonymous
04/20/26(Mon)20:36:57 No.108649094

Anonymous 04/20/26(Mon)20:36:57 No.108649094

>>108649067
>Luddites is when you don't trust the LLM like it's some sort of god that can do no wrong
come on anon, LLMs haven't reach that level yet

Anonymous
04/20/26(Mon)20:37:44 No.108649099

Anonymous 04/20/26(Mon)20:37:44 No.108649099

>>108649067
it's more likely than you think, see: >>108649082 >>108649094

Anonymous
04/20/26(Mon)20:40:51 No.108649113

Anonymous 04/20/26(Mon)20:40:51 No.108649113

>>108648872
is thinking in first person that much better for rp?
I only do story cyoa style stuff but I'm curious

Anonymous
04/20/26(Mon)20:41:07 No.108649114

Anonymous 04/20/26(Mon)20:41:07 No.108649114

>>108649094
Unless you're doing an incredibly niche task, if you can't get anything meaningful from the current generation of LLMs it's 100% a skill issue and bragging about it is really pathetic

Anonymous
04/20/26(Mon)20:41:31 No.108649119

Anonymous 04/20/26(Mon)20:41:31 No.108649119

>>108649038
follow an ubuntu guide

Anonymous
04/20/26(Mon)20:42:23 No.108649123

Anonymous 04/20/26(Mon)20:42:23 No.108649123

>>108649114
as in I'm too skilled for an LLM to be worth a damn? that is an issue, I agree

Anonymous
04/20/26(Mon)20:43:29 No.108649134

Anonymous 04/20/26(Mon)20:43:29 No.108649134

trying to set up my silly tavern template but everything is just laggy as hell
>maximize the context window for better memory
i did this on my 4060 and now its basically frozen... how do i make it run faster?

Anonymous
04/20/26(Mon)20:45:23 No.108649149

Anonymous 04/20/26(Mon)20:45:23 No.108649149

>>108649134
Maximize means having as much context as will fit in your VRAM alongside the rest of the stuff that goes in there. If you go beyond that and your video driver starts using RAM as fake VRAM, you are fucked.

Anonymous
04/20/26(Mon)20:45:24 No.108649150

Anonymous 04/20/26(Mon)20:45:24 No.108649150

>>108647760
What prompt were you using for this? I tried some over OR but it didn't really have much of an effect.
I might be imagining it but enabling function calling feels like it makes K2.6 a bit less likely to draft.

Anonymous
04/20/26(Mon)20:47:05 No.108649157

Anonymous 04/20/26(Mon)20:47:05 No.108649157

File: 0jl8ij.jpg (254 KB, 1248x1824)

254 KB JPG

>>108648472
thx

Anonymous
04/20/26(Mon)20:47:27 No.108649159

Anonymous 04/20/26(Mon)20:47:27 No.108649159

>>108649149
>using RAM as fake VRAM
i have 32gb of system ram so it should be fine lol... my gpu is almost current gen so why does the size even matter?? how do i overclock the vram to make more space?

Anonymous
04/20/26(Mon)20:48:29 No.108649164

Anonymous 04/20/26(Mon)20:48:29 No.108649164

>>108648472
>*unzips

Anonymous
04/20/26(Mon)20:50:01 No.108649174

Anonymous 04/20/26(Mon)20:50:01 No.108649174

>>108648472
lalalala

Anonymous
04/20/26(Mon)20:50:25 No.108649176

Anonymous 04/20/26(Mon)20:50:25 No.108649176

>>108649113
It's RNG, can ether hurt or improve. It may ether polish the reply, or give reply that is too calculated. But my main goal was to read to what characters think, bc I remember reading thoughts on mistral 24 (or it's finetunes) and I remember those thoughts being pretty hot.

Anonymous
04/20/26(Mon)20:52:28 No.108649184

Anonymous 04/20/26(Mon)20:52:28 No.108649184

File: 2.png (35 KB, 673x1073)

35 KB PNG

Does opencode suck with local models or am i doing something wrong?
gemma does quit well with code in a normal chat but to make changes it has to rewrite whole code which wastes time.

I was just looking for a local alternative to cursor.

Anonymous
04/20/26(Mon)20:53:06 No.108649187

Anonymous 04/20/26(Mon)20:53:06 No.108649187

>>108648701
Still BPE going to go down the list of models again

Anonymous
04/20/26(Mon)20:55:31 No.108649192

Anonymous 04/20/26(Mon)20:55:31 No.108649192

>>108649157
the skindentation magic

Anonymous
04/20/26(Mon)20:56:06 No.108649196

Anonymous 04/20/26(Mon)20:56:06 No.108649196

File: 1754052139511244.png (3 KB, 380x28)

3 KB PNG

I love vibecoding btw

Anonymous
04/20/26(Mon)20:57:26 No.108649203

Anonymous 04/20/26(Mon)20:57:26 No.108649203

>>108649196
Holy fuck.
I have trust issues, so I never let it generate more than I can at least glance at before accepting the changes.

Anonymous
04/20/26(Mon)20:59:15 No.108649211

Anonymous 04/20/26(Mon)20:59:15 No.108649211

>>108649203
I just added a TTS pipeline in my frontend, it wouldn't be that bloated otherwise lol

Anonymous
04/20/26(Mon)20:59:26 No.108649212

Anonymous 04/20/26(Mon)20:59:26 No.108649212

>>108649196
>confused unga bunga

Anonymous
04/20/26(Mon)21:00:48 No.108649220

Anonymous 04/20/26(Mon)21:00:48 No.108649220

>>108649211
can we see it?

Anonymous
04/20/26(Mon)21:01:21 No.108649221

Anonymous 04/20/26(Mon)21:01:21 No.108649221

>>108649211
If it works, good for you. But your architecture is fucked if you need that many changes to add TTS lol.

Anonymous
04/20/26(Mon)21:03:17 No.108649229

Anonymous 04/20/26(Mon)21:03:17 No.108649229

File: 1745950643409748.png (17 KB, 356x308)

17 KB PNG

>>108649220
>>108649221
It's fine, gptsovits is just that bloated

Anonymous
04/20/26(Mon)21:08:04 No.108649250

Anonymous 04/20/26(Mon)21:08:04 No.108649250

File: 1776693820821409.png (205 KB, 1729x811)

205 KB PNG

>>108649220
I'm the tauri shill btw

Anonymous
04/20/26(Mon)21:12:08 No.108649268

Anonymous 04/20/26(Mon)21:12:08 No.108649268

>>108649196
400k lines? Suck my dick, cock loving faggot.

Anonymous
04/20/26(Mon)21:12:46 No.108649272

Anonymous 04/20/26(Mon)21:12:46 No.108649272

C is not enough for my client, I have decided to rewrite it in Fortran.

Anonymous
04/20/26(Mon)21:15:24 No.108649284

Anonymous 04/20/26(Mon)21:15:24 No.108649284

>>108644453
26b quants are dogshit?
wtf do i use now q8 kl is > 0.5

Anonymous
04/20/26(Mon)21:16:07 No.108649293

Anonymous 04/20/26(Mon)21:16:07 No.108649293

File: 1746192377444345.png (252 KB, 634x478)

252 KB PNG

>>108649268
Don't tease me or next time it'll be on your favorite repo

Anonymous
04/20/26(Mon)21:19:12 No.108649301

Anonymous 04/20/26(Mon)21:19:12 No.108649301

>>108649272
rewrite it in Forth

Anonymous
04/20/26(Mon)21:20:47 No.108649306

Anonymous 04/20/26(Mon)21:20:47 No.108649306

>>108649272
Dude. Odin is right there.

Anonymous
04/20/26(Mon)21:21:08 No.108649308

Anonymous 04/20/26(Mon)21:21:08 No.108649308

>>108649284
Full precision, obviously.

Anonymous
04/20/26(Mon)21:26:24 No.108649327

Anonymous 04/20/26(Mon)21:26:24 No.108649327

>>108649306
If it's not over 40 years old it's not a programming language!

Anonymous
04/20/26(Mon)21:30:34 No.108649340

Anonymous 04/20/26(Mon)21:30:34 No.108649340

>>108649046
it wasn't llama.cpp the apex predator, but kobold.... the hierarchy is shifting... something something a small smile

Anonymous
04/20/26(Mon)21:32:27 No.108649347

Anonymous 04/20/26(Mon)21:32:27 No.108649347

>>108646525
>>108646528
>>108649174
https://www.youtube.com/watch?v=VwYC_21jfiE&t=0m2s

Anonymous
04/20/26(Mon)21:34:42 No.108649361

Anonymous 04/20/26(Mon)21:34:42 No.108649361

>>108648872
hell yeah dude. glad you got it working
even your your RP disgusts me :)

Anonymous
04/20/26(Mon)21:42:43 No.108649395

Anonymous 04/20/26(Mon)21:42:43 No.108649395

File: 1753664322227367.png (119 KB, 1614x585)

119 KB PNG

Claude is so sovful lmao

Anonymous
04/20/26(Mon)21:44:35 No.108649402

Anonymous 04/20/26(Mon)21:44:35 No.108649402

File: 1752426330847308.png (182 KB, 1414x990)

182 KB PNG

>>108649284
Why does Gemma quantize so poorly, anyway? Can't be a small MoE issue.

Anonymous
04/20/26(Mon)21:45:06 No.108649407

Anonymous 04/20/26(Mon)21:45:06 No.108649407

>>108649395
Claude is fucking useless for low level programming now. Anthropic only released 4.7 so that they could get rid of 4.5 from the "old models" page because it was the only one with any brains. Fuck these FUCKING jews, man.

Anonymous
04/20/26(Mon)21:45:35 No.108649409

Anonymous 04/20/26(Mon)21:45:35 No.108649409

>>108649395
fuck off paypiggy

Anonymous
04/20/26(Mon)21:46:08 No.108649412

Anonymous 04/20/26(Mon)21:46:08 No.108649412

File: quant quality.png (10 KB, 792x612)

10 KB PNG

What am I fucking up? I am making quants for qwen2.5-0.5B (as a test run) and I am getting unrealistically low KLD values.
This is the output for Q4_S with imatrix:
====== Perplexity statistics ======
Mean PPL(Q)                   :  23.647493 ±   0.276801
Mean PPL(base)                :  22.937794 ±   0.267732
Cor(ln(PPL(Q)), ln(PPL(base))):  99.50%
Mean ln(PPL(Q)/PPL(base))     :   0.030471 ±   0.001167
Mean PPL(Q)/PPL(base)         :   1.030940 ±   0.001203
Mean PPL(Q)-PPL(base)         :   0.709699 ±   0.028636

====== KL divergence statistics ======
Mean    KLD:   0.039864 ±   0.000178
Maximum KLD:   2.073086
99.9%   KLD:   0.482974
99.0%   KLD:   0.214572
95.0%   KLD:   0.108790
90.0%   KLD:   0.079510
Median  KLD:   0.030400
10.0%   KLD:   0.002482
 5.0%   KLD:   0.000481
 1.0%   KLD:   0.000034
 0.1%   KLD:   0.000001
Minimum KLD:  -0.000004

====== Token probability statistics ======
Mean    Δp: -0.442 ± 0.017 %
Maximum Δp: 74.763%
99.9%   Δp: 25.752%
99.0%   Δp: 12.706%
95.0%   Δp:  5.537%
90.0%   Δp:  2.828%
75.0%   Δp:  0.324%
Median  Δp: -0.007%
25.0%   Δp: -0.840%
10.0%   Δp: -4.239%
 5.0%   Δp: -7.603%
 1.0%   Δp: -17.368%
 0.1%   Δp: -33.798%
Minimum Δp: -62.207%
RMS Δp    :  4.646 ± 0.038 %
Same top p: 87.162 ± 0.125 %
Mean KLD of 0.039864 feels extremely low for Q4 quant of a tiny model? Around 0.062235 without imatrix on the same test data. Testing on half megabyte file I put together with literature excerpts from different languages and some code (Maybe it needs to be bigger? Though feels unlikely.)
I mentioned what I am running here:>>108648273
PPL(Q)/PPL(base)% would put me around 3% which feels more believable compared to reference data like pic related. Still I don't understand what's going on with KLD

Anonymous
04/20/26(Mon)21:47:29 No.108649414

Anonymous 04/20/26(Mon)21:47:29 No.108649414

File: 1756866734765818.png (263 KB, 1249x1066)

263 KB PNG

https://huggingface.co/deepseek-ai/DeepSeek-V4

Anonymous
04/20/26(Mon)21:48:07 No.108649420

Anonymous 04/20/26(Mon)21:48:07 No.108649420

>>108649414
Hmm, nyo

Anonymous
04/20/26(Mon)21:50:14 No.108649431

Anonymous 04/20/26(Mon)21:50:14 No.108649431

>>108649150
This was the relevant segment in my SillyTavern prompt that I refined after some trial and error:
># No-Drafts Rule
>Whenever you are planning out a response in your internal thoughts, you must NOT write complete drafts of the full response. You may plan ahead in as much detail as you want while preparing to respond and even summarize your planned response, but when it comes time to actually write passages you are considering to present to the user, you must do so outside of your thoughts and in the proper body of the response. This applies to full responses; drafting individual lines and passages are encouraged when you want to make sure you get things right.

It's part of a much larger system prompt and style guide for my RPs but that's the only part that concerns the reasoning specifically. It's under the "System" role and placed right after the core instructions. This worked with K2-Thinking and K2.5. I never saw a draft again and it didn't oversimplify its thoughts in more complicated situations.

Anonymous
04/20/26(Mon)21:50:14 No.108649432

Anonymous 04/20/26(Mon)21:50:14 No.108649432

>>108649414
at this point I deserve everything that happens to me

Anonymous
04/20/26(Mon)21:56:53 No.108649465

Anonymous 04/20/26(Mon)21:56:53 No.108649465

>>108649049
You could with multitoken prediction.

Anonymous
04/20/26(Mon)21:59:19 No.108649477

Anonymous 04/20/26(Mon)21:59:19 No.108649477

I've only used the API version but K2.6 seems to be less censored than K2.5 during RP. It still has long-ass safety checks thougheverbeit

Anonymous
04/20/26(Mon)22:00:14 No.108649487

Anonymous 04/20/26(Mon)22:00:14 No.108649487

>>108649402
unsloth sisters really are cooking

Anonymous
04/20/26(Mon)22:00:42 No.108649493

Anonymous 04/20/26(Mon)22:00:42 No.108649493

>>108649402
Distilled models always quantize poorly

Anonymous
04/20/26(Mon)22:01:16 No.108649496

Anonymous 04/20/26(Mon)22:01:16 No.108649496

>>108649412
test on a big context

Anonymous
04/20/26(Mon)22:01:20 No.108649500

Anonymous 04/20/26(Mon)22:01:20 No.108649500

>>108649402
>unsloth is basically pareto frontier
APOLOGIZE

Anonymous
04/20/26(Mon)22:02:14 No.108649505

Anonymous 04/20/26(Mon)22:02:14 No.108649505

>>108649487
That image is from their unsloth's own benchmark, so I wouldn't necessarily trust theirs to actually be the best. But it shows that everyone's quants of qwen are cleaner than those of gemma.

Anonymous
04/20/26(Mon)22:02:45 No.108649510

Anonymous 04/20/26(Mon)22:02:45 No.108649510

> qwen3
> qwen3.5
> qwen next
> qwen3.6
> qwen3.7
> qwen3.75

Anonymous
04/20/26(Mon)22:05:44 No.108649530

Anonymous 04/20/26(Mon)22:05:44 No.108649530

>>108649500
>pareto frontier
Meme. I only care about raw cockbench results.

Anonymous
04/20/26(Mon)22:06:52 No.108649540

Anonymous 04/20/26(Mon)22:06:52 No.108649540

>>108647831
scope creep
had to install all that database shit when I pulled

Anonymous
04/20/26(Mon)22:07:53 No.108649545

Anonymous 04/20/26(Mon)22:07:53 No.108649545

>>108649090
share src?

Anonymous
04/20/26(Mon)22:09:44 No.108649555

Anonymous 04/20/26(Mon)22:09:44 No.108649555

>>108649496
I tried -c 8192 instead of the default 512 now and it barely changed KLD, less than 1% increase to it.
I don't think this is relevant here.

Anonymous
04/20/26(Mon)22:10:53 No.108649558

Anonymous 04/20/26(Mon)22:10:53 No.108649558

File: 1752327878277367.png (178 KB, 640x360)

178 KB PNG

>>108646197
>1.1TB
Every release the models get bigger and bigger
yet more and more retarded
You can never hate techbros enough

Anonymous
04/20/26(Mon)22:12:46 No.108649571

Anonymous 04/20/26(Mon)22:12:46 No.108649571

I just looked at the verbose logs of my llama.cpp (used with openwebui) and noticed that during tool call back and forth exchanges, the jinja is putting the results of tool calls at the top of the assistant's thinking/reply. Like this:

<|turn>user
...and search the web for news about it.<turn|>
<|turn>model
<|tool_call>call:search_web{query:<|"|>news about this and that<|"|>}<tool_call|><|tool_response>response:search_web{value:<|"|>[{"This is a title.", "link": "https://www.somewebsite.com", "snippet": "blah blah"}]<|"|>}<tool_response|>The user is asking about... I should do a search using "news about this and that".

Notice how the tool call is moved above the model's thinking about how to do tool calling? This doesn't make any fucking sense. Either the jinja is fucked up, OWUI is fucked up, or both.
GOD.

Anonymous
04/20/26(Mon)22:17:31 No.108649601

Anonymous 04/20/26(Mon)22:17:31 No.108649601

>>108649571
I noticed that too, the model was a bit confused about the tool call and thought it was a sample or something
I'm blaming openwebui, shit's got more bugs recently. Even prefills and edits don't work.

Anonymous
04/20/26(Mon)22:19:53 No.108649610

Anonymous 04/20/26(Mon)22:19:53 No.108649610

anons, what's the recommended value for --batch-size, if there is even one?
256? 512? 1024? 2048?

Anonymous
04/20/26(Mon)22:19:54 No.108649611

Anonymous 04/20/26(Mon)22:19:54 No.108649611

>>108649571
>OWUI is fucked up
This is the root issue. It breaks chat history with thinking models by rendering its messages with <think> tags in the prompt it sends to the server. Backends expect the thinking to be a separate part of the message objects so that the chat template knows what to do with it. OWUI sends the thinking back as part of the main messages and that can put shit out of order or just break shit entirely depending on the model.

Anonymous
04/20/26(Mon)22:21:40 No.108649619

Anonymous 04/20/26(Mon)22:21:40 No.108649619

>>108649611
>OWUI sends the thinking back as part of the main messages and that can put shit out of order or just break shit entirely depending on the model.
I should add this is even worse of a problem than it sounds, because most chat templates intentionally DISCARD the past thinking except in certain circumstances (like tool calls), but OWUI prevents that from happening, resulting in the entire chat's prior thinking bloating the context even when it's not supposed to be there.

Anonymous
04/20/26(Mon)22:21:44 No.108649620

Anonymous 04/20/26(Mon)22:21:44 No.108649620

>>108649610
as big as your CPU with let you. If you go too big it'll start to be slower. just gotta fuck around and find out.

Anonymous
04/20/26(Mon)22:22:07 No.108649623

Anonymous 04/20/26(Mon)22:22:07 No.108649623

>>108649601
Maybe the custom frontend vibe cooders were right kek. If I had used all the time I spent troubleshooting and configuring OWUI for my use cases, I might be somewhere nice about now.

>>108649611
Actually, I am running a reverse proxy already that strips out the <think> tag shit OWUI does. I might have to vibe coode it to also modify how it's constructing the json requests now kek.

Anonymous
04/20/26(Mon)22:28:55 No.108649659

Anonymous 04/20/26(Mon)22:28:55 No.108649659

placed hermes inside a docker container and now gemma can read and write files to my downloads folder as well as launch a VM and put its own containers there. searxng, firecrawl, matrix/element homeserver so I can talk to it from my phone but I'll do this tomorrow.

havent tried openclaw but hermes seems to be the most based.I think this will be useful for someone with my ADHD brain to have an AI assistant to keep track of things.

Anonymous
04/20/26(Mon)22:31:43 No.108649677

Anonymous 04/20/26(Mon)22:31:43 No.108649677

>>108649623
Are you sure your proxy is stripping the actual reasoning or just the tags? Not sure how precisely you constructed the example there but this looks like the reasoning is being pasted in without any tags:
><tool_response|>The user is asking about...

To make it so that the jinja handles reasoning properly, you unfortunately do need to mess with the JSON: take everything between the <think> tags out of the "content" field and put them in the "reasoning_content" ("reasoning" is valid too for Gemma, the template works for either) of the same message. Then delete the tags and the reasoning from the content field. Make sure there's no duplicate content being sent. This is the way agent harnesses construct their requests and the way the jinja expects to see the chat history.

Anonymous
04/20/26(Mon)22:32:14 No.108649680

Anonymous 04/20/26(Mon)22:32:14 No.108649680

>>108649412
I think this was a false alarm because I got mislead by these figures >>108644453 ?
I checked more data on the internet and it doesn't seem as outlandish now.
No idea why Gemma's are so high.

Anonymous
04/20/26(Mon)22:34:20 No.108649691

Anonymous 04/20/26(Mon)22:34:20 No.108649691

>>108649659
also got it to download shit with ytdlp and use ffmpeg to make 1h long soundtrack of anime openings. when I can get it to send shit to me via matrix it will be great. as noncoder brainlet this is cool

Anonymous
04/20/26(Mon)22:34:41 No.108649693

Anonymous 04/20/26(Mon)22:34:41 No.108649693

File: quant comparison gemma 4 31.png (295 KB, 2820x1601)

295 KB PNG

>>108649680
Fuck I was supposed to post this chart, not quote that.

Anonymous
04/20/26(Mon)22:35:57 No.108649699

Anonymous 04/20/26(Mon)22:35:57 No.108649699

>>108649610
512 is the sweet spot
256 if you you really need to save a couple hundred megs
Only go higher than 512 if you have memory to spare, greatly diminishing returns beyond that.

Anonymous
04/20/26(Mon)22:36:49 No.108649705

Anonymous 04/20/26(Mon)22:36:49 No.108649705

>>108649571
Try the interleaved jinja
>https://github.com/ggml-org/llama.cpp/blob/master/models/templates/google-gemma-4-31B-it-interleaved.jinja

Anonymous
04/20/26(Mon)22:38:23 No.108649715

Anonymous 04/20/26(Mon)22:38:23 No.108649715

>>108649610
>>108649699
Since several months ago, the sweet spot is 2048, IIRC.
But the savings from using 512 tend to be worth it.

Anonymous
04/20/26(Mon)22:39:48 No.108649724

Anonymous 04/20/26(Mon)22:39:48 No.108649724

>>108649715
>Since several months ago, the sweet spot is 2048, IIRC.
You are right, I was thinking of ubatch size. I blame llmao's shitty arg naming.

Anonymous
04/20/26(Mon)22:52:02 No.108649795

Anonymous 04/20/26(Mon)22:52:02 No.108649795

>>108649677
>Are you sure your proxy is stripping the actual reasoning
Yes, so first, specifically, my proxy strips entire reasoning blocks (including a newline if there is one) of messages older than the latest user message. The reasoning of the current assistant's message while it's doing tool calling is not being stripped, only the <think> tags. Because of course it needs its reasoning for its current task.

I tested that first, saw the weird tool-thinking order, and then tested without the proxy, and after determining that it was happening normally too, I made the original post that doesn't mention I used a proxy.

Anyway yeah I'll look at the json requests.

Anonymous
04/20/26(Mon)22:53:18 No.108649803

Anonymous 04/20/26(Mon)22:53:18 No.108649803

>>108649610
>>108649699
>>108649715
What's the con of going smaller?

Anonymous
04/20/26(Mon)22:54:59 No.108649816

Anonymous 04/20/26(Mon)22:54:59 No.108649816

>>108649620
>>108649699
>>108649715
thanks anons

Anonymous
04/20/26(Mon)22:56:54 No.108649828

Anonymous 04/20/26(Mon)22:56:54 No.108649828

>>108649803
Slower prompt processing.

Anonymous
04/20/26(Mon)23:01:19 No.108649860

Anonymous 04/20/26(Mon)23:01:19 No.108649860

>>108649571
I have implemented tool calling with text completion.
Sometimes there is some text before tool call but I have never seen it afterwards.
Workflow goes like this:
>model calls tool with
><|tool_call>call:search_web{query:<|"|>news about this and that<|"|>}<tool_call|>
>I detect tool call and execute the tool part, when it's ready I append response bracket with the result back
><|tool_call>call:search_web{query:<|"|>news about this and that<|"|>}<tool_call|<|tool_response>response:search_web{value:<|"|>[{"This is a title.", "link": "https://www.somewebsite.com", "snippet": "blah blah"}]<|"|>}<tool_response|>
>then I submit this to the model and once inference is complete it has swalled the entire tool call and replaced that with its own reply.
>I then make extra sure its response is clean
Not sure if I was explaining this cleanly enough. There shouldn't be any trace of the original <tool_call> stuff in the past context history after model has cooked up its reply from the tool result itself.

Anonymous
04/20/26(Mon)23:02:24 No.108649866

Anonymous 04/20/26(Mon)23:02:24 No.108649866

>>108649860
Why is it so difficult to post without typos on this goddamn website???
*swalled = SWALLOWED

Anonymous
04/20/26(Mon)23:07:53 No.108649893

Anonymous 04/20/26(Mon)23:07:53 No.108649893

>>108649860 (You)
>>108649866 (You)
To add: I'm following exactly what google has demonstrated in their doc.
I think faulty tool definitions can create issues and leaks.
Here's an example of my shit, simple url access:
><|tool>declaration:access_url{description:<|"|>Opens a website directly.<|"|>,parameters:{properties:{url:{description:<|"|>Direct URL to website, e.g. https://github.com/ggml-org/llama.cpp<|"|>,type:<|"|>STRING<|"|>} },required:[<|"|>url<|"|>],type:<|"|>OBJECT<|"|>} }<tool|>

Anonymous
04/20/26(Mon)23:15:50 No.108649934

Anonymous 04/20/26(Mon)23:15:50 No.108649934

>>108649828
Oh I see, ok.

Anonymous
04/20/26(Mon)23:25:56 No.108649973

Anonymous 04/20/26(Mon)23:25:56 No.108649973

>>108648213
>Holy fucking shit logits are very hungry for disk space
Yeah no shit, it's 2 bytes * vocab size * number of tokens in the input, and vocab size is usually on the order of 100k-200k

Anonymous
04/20/26(Mon)23:34:15 No.108649998

Anonymous 04/20/26(Mon)23:34:15 No.108649998

>>108649184
Just werked for me with MiniMax M2.1, M2.5, Qwen3.5 397B, and GLM 5.1

Also I feel like I've heard that OpenCode's web frontend stuff all gets funneled through their cloud servers, you might want to check on that depending on how paranoid you are. CLI version is less botnet but I've still got it behind a restrictive proxy so it can't phone home

Anonymous
04/20/26(Mon)23:39:24 No.108650017

Anonymous 04/20/26(Mon)23:39:24 No.108650017

>>108649184
qwen works better for this.
i'd try Gemma again after llama.cpp fix more shit

Anonymous
04/20/26(Mon)23:51:17 No.108650056

Anonymous 04/20/26(Mon)23:51:17 No.108650056

>try to make lyra less sycophantic
>she's now fucking seven of nine
>too drunk to remember what i changed
>no backup
local models were a mistake

Anonymous
04/21/26(Tue)00:00:34 No.108650093

Anonymous 04/21/26(Tue)00:00:34 No.108650093

In ST, is there a way to continue thinking? If I pause it to edit something, it always starts prose immediately on resume. Clearly it's set some kind of hidden <endthink> tag I can't see or remove, but I'd like to if it's possible.

Anonymous
04/21/26(Tue)00:06:35 No.108650117

Anonymous 04/21/26(Tue)00:06:35 No.108650117

Is speculative decoding incompatible with thinking? Or Gemma thinking maybe?

Anonymous
04/21/26(Tue)00:10:39 No.108650143

Anonymous 04/21/26(Tue)00:10:39 No.108650143

>>108650117
speculative decoding with draft models works fine with thinking

Anonymous
04/21/26(Tue)00:12:44 No.108650155

Anonymous 04/21/26(Tue)00:12:44 No.108650155

>>108650056
Rookie mistake. I version all my card changes in git.

Anonymous
04/21/26(Tue)00:22:07 No.108650192

Anonymous 04/21/26(Tue)00:22:07 No.108650192

>>108650143
Thanks, weird then.

Anonymous
04/21/26(Tue)00:23:02 No.108650197

Anonymous 04/21/26(Tue)00:23:02 No.108650197

Ogey. So I extracted the jsons. I got the docs. I got the jinja. I got the logs. I constructed a proompt. And I fed it to Gemini Pro in Studio. It failed to produce a good reverse proxy that worked zero shot. Then I tried Claude Sonnet and it worked.

Multiple tool calling seems to just werk now with no errors at all in a few tests I did. I looked at the logs and it is correctly removing old reasoning traces, doesn't have any <think> tags, has Gemma's expected reasoning tokens, and also in the case of a conversation with old reasoning traces + tool use, it keeps the old tool calls there, while the reasoning is gone, as expected.

I did use it with the jinja mentioned here which appeared to help: https://huggingface.co/google/gemma-4-31B-it/discussions/62#69e2e058d3dd9875d6b4fc31

I have not tried >>108649705 and I guess I will give it a try to see how it does. Anyway here is Claude's script for anyone that wants to test and see if it has issues or fixes everything.
https://pastebin.com/SCQsBe7W

No I didn't read its code.

Anonymous
04/21/26(Tue)00:23:27 No.108650198

Anonymous 04/21/26(Tue)00:23:27 No.108650198

>qwen3.6 35B-A3B
>3000token thinking block
is this normal?

Anonymous
04/21/26(Tue)00:25:35 No.108650207

Anonymous 04/21/26(Tue)00:25:35 No.108650207

>>108650197
Claude cheated cause he already had it prepared.

Anonymous
04/21/26(Tue)00:26:02 No.108650209

Anonymous 04/21/26(Tue)00:26:02 No.108650209

>>108650192
gemma has adaptive reasoning which can fuck herself into lalalalala. if you're a st retard, put some variant of
> [ooc: use max reasoning]
into your "post-history instructions"
pretty sure using post-history fucks your prompt cache reuse but I'm also pretty sure llama-server's prompt cache is fucked to begin with so
>>108650155
yeah, I should have done better. ah well.
>>108650197
local model doko
>>108650198
yes

Anonymous
04/21/26(Tue)00:30:26 No.108650225

Anonymous 04/21/26(Tue)00:30:26 No.108650225

What is the point of mini ai pcs like a strix or a spark when macs exist with far more memory and faster bandwidth

Anonymous
04/21/26(Tue)00:30:39 No.108650227

Anonymous 04/21/26(Tue)00:30:39 No.108650227

>>108650209
>local model doko
Well, here, now that Gemma appears to be le "fixed".

Wow I'm just like unslop I'm so good.

Anonymous
04/21/26(Tue)00:32:57 No.108650240

Anonymous 04/21/26(Tue)00:32:57 No.108650240

gemma e4b completely unusable for hermes. 31b could solve any task I gave it lmao

Anonymous
04/21/26(Tue)00:34:31 No.108650248

Anonymous 04/21/26(Tue)00:34:31 No.108650248

>>108650209
My weird issue is that using speculative decoding just disables the thinking process itself, not that it stutters or lalalala or anything of the sort.

Basically :
gemma-4-31B-it-Q5_K_L + thinking works.

gemma-4-31B-it-Q5_K_L + 26B A4B Q4_K_L for speculative decoding + thinking works too but I never have any thinking going on.

Anonymous
04/21/26(Tue)00:36:42 No.108650257

Anonymous 04/21/26(Tue)00:36:42 No.108650257

>>108650198
Yeah, it's the qwen special.
>i must think about X
>but what if X is actually Y
>wait, what if X = Y
>wait, that doesn't make sense, let's think about X
>wait, X looks like Z

Anonymous
04/21/26(Tue)00:40:23 No.108650274

Anonymous 04/21/26(Tue)00:40:23 No.108650274

>>108649157
your migu archive is stronger than my own
I didn't even remember this one
I need to step my game up in terms of volume

Anonymous
04/21/26(Tue)00:40:39 No.108650275

Anonymous 04/21/26(Tue)00:40:39 No.108650275

>>108650248
so, speculative decoding should not alter the output at all
conceptually, speculative decoding causes the main/target model to infer the N draft tokens in parallel, and just use whichever ones are correct
if the target model differs from the draft model (e.g. because of samplers or whatever) then it'll still use the target model's output, it is entirely lossless
I don't understand why your setup would produce the output you described. I've run 31B@q8 with a 26b@q4 draft model and it's worked fine, so..?
gemma4 does have a bug with adaptive reasoning where it elides reasoning but i've never seen it before 30k context (and even then it's rare before 60k context).
can you post your llama-server evocation?

Anonymous
04/21/26(Tue)00:46:24 No.108650295

Anonymous 04/21/26(Tue)00:46:24 No.108650295

>>108650275
It's not llama, it's kobold, but here it is :
./koboldcpp-linux-x64 --model ./gemma-4-31B-it-bartowski-Q5_K_L/google_gemma-4-31B-it-Q5_K_L.gguf --flashattention --usecuda --gpulayers 60 --contextsize 8000 --jinja --maingpu 0 --tensor_split 1 0 --chat-template-kwargs \{\"enable_thinking\":true\} --draftmodel ./gemma-4-26B-A4B-it-bartowski-Q4_K_L/google_gemma-4-26B-A4B-it-Q4_K_L.gguf --draftgpulayers 99 --draftgpusplit 0 1 --draftamount 8 --batch-size 512 --host 0.0.0.0 --port 8080 --skiplauncher --debugmode --gendefaults \{\"top_k\":0\}

Anonymous
04/21/26(Tue)00:49:14 No.108650307

Anonymous 04/21/26(Tue)00:49:14 No.108650307

>>108650295
>It's not llama, it's kobold

Anonymous
04/21/26(Tue)00:51:11 No.108650317

Anonymous 04/21/26(Tue)00:51:11 No.108650317

>>108650307
Yes? The launcher flags are different, even if it uses llama too in the end.

Anonymous
04/21/26(Tue)00:52:39 No.108650325

Anonymous 04/21/26(Tue)00:52:39 No.108650325

>>108650295
I've never used kobold, but I don't see anything in those args that would cause the behavior you're describing (besides the adaptive reasoning bug)
I would try the [reasoning effort: max] workaround and see if that fixes your problem? I'm not familiar with kobald but there's probably a way to put it at the end of your context... unfortunately putting it in the system prompt doesn't help...

Anonymous
04/21/26(Tue)00:55:48 No.108650336

Anonymous 04/21/26(Tue)00:55:48 No.108650336

>>108650325
Thanks for checking anon, guess I'll experiment then.

Anonymous
04/21/26(Tue)00:56:40 No.108650337

Anonymous 04/21/26(Tue)00:56:40 No.108650337

>>108650197
Hmm alright so I've been testing more and I think this probably isn't solvable with the reverse proxy, but OWUI seems to throw away previous tool calls and reasoning traces after finishing a response with a lot of tool calls. Like the latest reasoning is kept in the expandable think block but it looks like everything else just never existed. This would be a problem if, say you were doing web searches, and the information from those searches matters to your further conversations in the chat, like manuals/documentation. The model would have to redo the search, or just be operating blind.

Fuck I should've gone straight to vibe cooding my own shit the moment I smelled the bloat from this garbage. I think I will just do that. This piece of shit will have to do temporarily though.

Anonymous
04/21/26(Tue)01:07:11 No.108650366

Anonymous 04/21/26(Tue)01:07:11 No.108650366

>>108650197
Good work. For anyone else having problems: that reverse proxy will fix OWUI's prompting for all reasoning models, not just Gemma.

Also don't be fooled by OWUI's "Filters" function if you get tempted to re-implement this there. I tested and the Filters in OWUI aren't applying to the most recent assistant message during tool calls, even though they'll apply fine to past messages. So just use a reverse proxy like this one if you don't want to be assed to fix their source code yourself or wait for them to figure it out and fix it eventually.

Anonymous
04/21/26(Tue)01:10:48 No.108650378

Anonymous 04/21/26(Tue)01:10:48 No.108650378

i just want servicetesnor fuck this shit

Anonymous
04/21/26(Tue)01:26:16 No.108650420

Anonymous 04/21/26(Tue)01:26:16 No.108650420

File: uwu.png (5 KB, 340x75)

5 KB PNG

>This works!
>{snippet}
>The logic is solid!
>{snippet}
>*Actually, final logic check for `utils.js`
>This is perfect. Okay, I'm ready.
>**Wait!**
I love her.

Anonymous
04/21/26(Tue)01:59:46 No.108650543

Anonymous 04/21/26(Tue)01:59:46 No.108650543

Tool calling in Gemma 4 E4B under OpenCode now works for me with this chat template
https://gist.github.com/bbrowning/c584eb2dbd79e4cc9ecedf92eee2d135
https://github.com/anomalyco/opencode/issues/21034#issuecomment-4267446944

Anonymous
04/21/26(Tue)02:00:41 No.108650547

Anonymous 04/21/26(Tue)02:00:41 No.108650547

was talking about the cannon supergirl movie and ran out of context, can't remember why

Anonymous
04/21/26(Tue)02:16:57 No.108650596

Anonymous 04/21/26(Tue)02:16:57 No.108650596

I tried this >>108649705, and the one linked >>108650197.
Both work it felt like, but the one on ggml-org (still paired with the reverse proxy) is slightly less on spec, with my setup. Specifically, it produces

...
<|turn>model
<|channel>thought
Let's start by searching.<channel|><|tool_call>call:search_web...

Whereas the huggingface rando's template does

...
<|turn>model
<|channel>thought
Let's start by searching.
<channel|><|tool_call>call:search_web...

According to
https://ai.google.dev/gemma/docs/core/prompt-formatting-gemma4
the second one is correct.

Anonymous
04/21/26(Tue)02:22:07 No.108650604

Anonymous 04/21/26(Tue)02:22:07 No.108650604

Kinda afraid to ask but I cant seem to spot the black magic part of this the follwoing args.
Stole it from some anon couple threads ago:

./llama-server --host 0.0.0.0 --port 8080 --model 'gemma-4-31B-it-IQ4_XS.gguf' --temp 1.0 --top-p 0.95 --top-k 64 --min-p 0.0 -c 16384 --flash-attn on --parallel 1 --no-slots --swa-checkpoints 0 --keep -1 --reasoning auto -kvu -b 2048 -ub 128 --cache-type-k q8_0 --cache-type-v q8_0 -ngl 55 --metrics --fit-target 128 --poll 0 --threads 4 --chat-template-file 'chat_template_gemma4.jinja' --alias Gemma4

I can put 55 layers on my 16gb card and get decent speeds. 11 t/s. Which is totally fine for me.
If I try to replicate this with koboldcpp though I will offload like 33 layers instead and get the speed you can imagine.
Both use around 15gb of vram. Can somebody tell me if there is a specific flag I seem to be missing?
I also couldnt find a setting for every arg, so maybe its just not possible.

Anonymous
04/21/26(Tue)02:26:25 No.108650613

Anonymous 04/21/26(Tue)02:26:25 No.108650613

>>108650604
if you're concerned about vram the only flags that matter (beyond the weights) are
> --cache-type-k q8_0 --cache-type-v q8_0
you're using a 31B model with Q4 weights, so that's 31B*(8/4)=15.5GB of RAM, so there's no surprises there. Try using a lesser quant or buying more VRAM.

Anonymous
04/21/26(Tue)02:32:50 No.108650627

Anonymous 04/21/26(Tue)02:32:50 No.108650627

>>108650613
the biggest eater of vram with gemma are the checkpoints and slots so kobold either doesn't expose all of the options or they're named differently and he just isn't setting them

Anonymous
04/21/26(Tue)02:34:53 No.108650634

Anonymous 04/21/26(Tue)02:34:53 No.108650634

>>108650627
>--swa-checkpoints 0

Anonymous
04/21/26(Tue)02:36:50 No.108650643

Anonymous 04/21/26(Tue)02:36:50 No.108650643

gpt/gemini/glm/deepseek/moonshot/qwen/claude opus/image gen/groq/openrouter proxy https://j3wproxy.neocities.org

Anonymous
04/21/26(Tue)02:48:02 No.108650677

Anonymous 04/21/26(Tue)02:48:02 No.108650677

>>108650634
>>108650627
>>108650613
Hmm. If I set "use swa" I get to 52 layers.
Still 3 layers less and therefore slower.
Is that because of that checkpoint flag? I don't see anything either in the args or ui for that.
Gotta use llama.cpp for now I guess. Appreciate the help.

Anonymous
04/21/26(Tue)02:49:09 No.108650683

Anonymous 04/21/26(Tue)02:49:09 No.108650683

>>108650056
>>she's now fucking seven of nine
running st? you can extract the full sent context if you click the prompt button on any message in the chat.

Anonymous
04/21/26(Tue)02:53:22 No.108650696

Anonymous 04/21/26(Tue)02:53:22 No.108650696

Image tagging
How do you guys do it? What models do you use?

Anonymous
04/21/26(Tue)02:54:35 No.108650700

Anonymous 04/21/26(Tue)02:54:35 No.108650700

>>108650696
mturk

Anonymous
04/21/26(Tue)03:11:07 No.108650765

Anonymous 04/21/26(Tue)03:11:07 No.108650765

>>108650543
how do you even code with e4b

Anonymous
04/21/26(Tue)03:18:44 No.108650791

Anonymous 04/21/26(Tue)03:18:44 No.108650791

>>108650765
python

Anonymous
04/21/26(Tue)03:27:44 No.108650830

Anonymous 04/21/26(Tue)03:27:44 No.108650830

>>108650825
>>108650825
>>108650825

Anonymous
04/21/26(Tue)05:30:03 No.108651277

Anonymous 04/21/26(Tue)05:30:03 No.108651277

>>108647730
I not only do my own quants, I also publish them!!!!!!!!

[Return] [Catalog] [Top]

Post a Reply

Return Catalog Top Refresh

[Advertise on 4chan]

Delete Post: [File Only] Style:

[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.