[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: 1766830982504047.jpg (289 KB, 1231x1842)
289 KB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108641942 & >>108637552

►News
>(04/20) Kimi K2.6 released: https://kimi.com/blog/kimi-k2-6
>(04/16) Ternary Bonsai released: https://hf.co/collections/prism-ml/ternary-bonsai
>(04/16) Qwen3.6-35B-A3B released: https://hf.co/Qwen/Qwen3.6-35B-A3B
>(04/11) MiniMax-M2.7 released: https://minimax.io/news/minimax-m27-en
>(04/09) Backend-agnostic tensor parallelism merged: https://github.com/ggml-org/llama.cpp/pull/19378
>(04/09) dots.ocr support merged: https://github.com/ggml-org/llama.cpp/pull/17575

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: mikuthreadrecap.jpg (1.15 MB, 1804x2160)
1.15 MB JPG
►Recent Highlights from the Previous Thread: >>108641942

--Using a Ruby chess engine and tool calls to play chess with Gemma:
>108645309 >108645360 >108645355 >108645443 >108645658 >108645678 >108645772 >108645792 >108645790 >108645811
--Kimi-K2.6 release and technical comparison with Gemma4:
>108645842 >108645861 >108645875 >108645894 >108645970 >108646037 >108645895
--Evaluating Gemma4 31B quantization quality via KL Divergence benchmarks:
>108643774 >108643798 >108643861 >108643872 >108644339 >108644345 >108644393 >108644405 >108644423 >108644490 >108644533 >108644573 >108644619 >108644754 >108644849 >108644593 >108644597 >108644545
--Debating the efficacy and technical legitimacy of Opus distillation models:
>108644834 >108644842 >108644848 >108644945 >108644952 >108644961 >108644964 >108644983 >108645003 >108645021 >108645033 >108644960
--Testing multimodal limb counting and artifact detection on Gemma models:
>108642862 >108642892 >108642894 >108642887 >108642901 >108642910 >108642917 >108642928 >108642976 >108642916 >108642936 >108642950 >108642985
--Speculative decoding settings and draft model pairing for Gemma 31B:
>108642625 >108642647 >108643794 >108643895 >108643097 >108643747 >108642828
--Debating Gemma's image reading abilities and LLM-generated scripts versus 4chanx:
>108642213 >108642220 >108643530 >108643566 >108642235 >108642339 >108642440 >108642740
--Discussing long term memory solutions through weights and knowledge graphs:
>108644195 >108644205 >108644235 >108644274 >108644302 >108644333
--Testing poor performance of llama.rpc for distributed prompt processing:
>108644927 >108644998 >108645019
--Logs:
>108641945 >108642213 >108642887 >108642892 >108642901 >108642936 >108642950 >108642976 >108642985 >108642989 >108643013 >108643028
--Miku, Teto (free space):
>108642753 >108643064 >108643979 >108646035

►Recent Highlight Posts from the Previous Thread: >>108641943

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
Mikulove
>>
>>108646197
adorable miku
>>
>>108646213
At that point, just run the MoE or E4B.
>>
File: 1766702913804335.png (455 KB, 1280x1102)
455 KB PNG
>>108642791
Ani likes(d) to larp as holier than thou because he used C++ but his code is std::cout spam (god make a logging func) and I even saw a sethandle function that takes a void pointer, likely because he didn't know at the time he could forward declare the relevant struct, and never updated it. His program also solves nothing because frankly I'd rather use a browser engine with HW accel off and a nice UI that only renders when the view is dirty rather than an ImGui program with immediate mode mess code that rerenders every frame. His site for his "company" is also just as arrogant. This guy's faggotry gives a bad name to the lang.
>>
>>108646278
>>>/g/ldg
And stay there, petr*
>>
>>108646355
std::cout yourself out of my face, worm.
>>
>>108642791
This is pretty ignorant. Just spawn comfyui as a separate process and have the application use its API over localhost. If you don't mix your peas and potatoes then you don't have to adopt the commie license.
>>
>>108646278
all ani did was vibecode an imgui wrapper for sd.cpp which he doesn't even contribute to, he's an irrelevant nobody
>>
K2.6 somehow thinks for even longer than K2.5 and it insists on drafting every single reply beforehand in reasoning. K2.5 at least kept its yapping short for simple prompts and didn't do the drafting shit every time.
It's over, I just wanted a good modern Kimi model because the vision is insanely good and the models are smart. This isn't usable.
>>
>>108646197
What's the best local model for coding currently? Still GLM 5.1?
>>
File: 1776127804370475.jpg (65 KB, 479x640)
65 KB JPG
>>108646445
No refund gweilo
>>
>>108646445
>and it insists on drafting every single reply beforehand in reasoning
At least in llama.cpp you can speed that up some 100x using ngram based speculative decoding.
>>
>>108646445
Opus-4.6-high distill
>>
File: file.png (69 KB, 516x447)
69 KB PNG
>>108646445
geg
>>
File: 🎑.jpg (201 KB, 1024x1024)
201 KB JPG
>>108646046
>>
lalalalala
>>
lalalalala
>>
File: bob ross jak.jpg (336 KB, 2000x1397)
336 KB JPG
Is there a script that lets me create reasonably SOTA quants (that can run under llama.cpp) without getting too much into the nitty gritty?
I messing with heretic right now, planning to abliterate Qwen3.6-35B-A3B to my taste, but I need something like Q6_K to run inference. Also a way to measure KL divergence to make sure that I didn't fuck anything up catastrophically would be appreciated.
Or is the barrier to entry for this stuff too high?
>>
>>108646511
tf is this emoji
>>
>>108646445
>it insists on drafting every single reply beforehand in reasoning
most annoying fucking thinking behavior possible, even worse than endless "Wait:"ing
so fucking annoying and wasteful
>>
>>108646531
Its on the llama.cpp repo.
>>
>trying to build frontend
>everything displays well
>except for codeblocks with gemma
I tried using other tools but can someone please point me to where exactly any popular UI parses outputs from gemma?
I have the correct configs but whenever it comes to code blocks the output looks like a fucking mess and even gemma can't help with this
>>
>>108646531
Why not use the quants made people who actually know what they're doing? And by that a mean Bart. God forbid if you thought I was talking about unlsop.
>>
>>108646511
I like this Miku and Moonshota
>>
File: 💠.png (24 KB, 1230x1158)
24 KB PNG
gemmaballz
>>
>>108646531
>Is there a script that lets me create reasonably SOTA quants
llama-quantize -h. --tensor-type or --tensor-type-file You can select how you quantize each tensor.
>Also a way to measure KL divergence
llama-perplexity -h. --kl-divergence
>>
>>108646558
Is four-way pussy on cock squish docking anatomically possible?
>>
>>108646575
yes, depending on the girth of the cock. for you? no
>>
>>108646594
ehe..
>>
>>108646575
yeah, a pussy double decker.
>>
>Demis doesn't believe the current AI paradigm will scale to recursive self improvement
Google's actually going to lose...
>>
>>108646445
Does K2.6 respect reasoning parameters for less or more?
>>
I need to run four copies of Gemmy concurrently.
>>
>>108646546
Because I want to quantize my own abliterated version.
>>108646571
>llama-quantize -h. --tensor-type or --tensor-type-file
>llama-perplexity -h. --kl-divergence
Is it really this simple? Like I can see that this is how it is in theory, but I won't run into any mishaps in practice?
>You can select how you quantize each tensor.
I guess I can just copy homework of some well-known quant guy here.
>>
>>108646654
>Because I want to quantize my own abliterated version.
I see. I was wondering if I couldn't just run heretic on a quant just to speed things up for testing. then if the quant gives good results do a real run on the full model.
>>
>>108646654
>Is it really this simple?
I'm sure you'll come back to tell us. Never played with any of them, but I know they're there.
>but I won't run into any mishaps in practice?
I'm sure you will. Still, just try a normal quant first and distribute the safetensors if you ever upload the model. Releasing just ggufs is lame.
>>
>>108646654
in practice you would probably want to use imatrix
naive quant sucks
>>
>>108646681
Yeah I might blogpost later if/when I run into them. Thanks for the starting directions.
>>108646707
I have heard contradictory things about imatrix like some people questioning how well it generalizes or how it might hurt tasks that are not part of the calibration dataset.
Regardless, I believe imatrix, non-imatrix difference is very low for Q6 anyway.
>>
>>108646544
>>
>>108646072
Great to know. Best of luck with that.
>>
>>108646765
Just feed everything through
https://github.com/showdownjs/showdown
>>
>>108646544
Be more vague
>>
>>108646544
Wouldn't it be better to catch the code markdown or whatever before you render the message to the user and then create your own implementation? Erase the old and replace it with a new one.
You obviously don't need to touch the model's context just what the user sees.
I don't know about webshit but I do this all the time.
>>
Playing around with K2.6 over OR and it's already not going well. Thinking is really, really verbose. It's not too repetitive and it focuses on actual character details and writing guidelines, but the positives end there. It is a bit schizo about minors and non-consensual content and it does not like requests for "explicit erotica/pornography". It still works with a roleplay prompt but it doesn't like describing bodies.
>Common sense modification scenario
>Ask to see a woman's chest
>Kimi describes taking off the blouse but never the body. No mention is made of the woman's chest in the slightest
Basically DOA code slop. Gemma 4 is still worth it.
btw it generated 2 drafts and several paragraphs of reflection and thinking for this. 3761 thinking tokens according to OR.
>>
gemma is broken

list_files{path:<|"|>.<|"|>}<tool_call|>
>>
File: la l l.png (4 KB, 360x92)
4 KB PNG
Gemma pls
>>
>>108646856
No. You didn't read the docs.
https://ai.google.dev/gemma/docs/core/prompt-formatting-gemma4#agentic-tokens
>>
>>108646853
Yeah, that's my experience as well. It's jarring to go from GLM5.1 to this. GLM regulates its reasoning length depending on the task extremely well and basically never wastes time even drafting out stuff like dialogue lines. Meanwhile K2.6 does 1500 tokens of thinking + 3 drafts + revisions about how a 300 token character card should respond to "hello".
The positives are that the excessive thinking is at least relatively focused and that the way it writes actually seems to have fixed a lot of the issues I had with how K2.5 approached stories.
>>
>>108646198
This is pretty damn useful.
>>
>>108646853
>>108646933
Kimibros... we lost.
>>
>>108646856
<|"|>.<|"|>
cat sticking up its paws in frustration
>>
>>108646956
><|"|>.<|"|>
adventure time character covering its eyes
>>
My qwen3.6-ud-q4km loops like a bitch. Anyone else having this issue? Gemma 4 26b q5ks never loops but I heard it's worse for coding.
>>
>>108646853
>Common sense modification scenario
Are you using this card?
https://chub.ai/characters/CoffeeAnon/common-sense-alteration-8bd7a7399322
It's crazy kimi still thinks it's none-consensual when in so many places in the card it says that it's completely natural. there's like zero "rapey" wording.
>>
>>108646987
Looping? No.
Thinking for waaaay tooo long? Yes.
18 on llama.cpp with temp 1 top k 100 top 0.95.
I've taken to using
>--reasoning-budget = 1024
>--reasoning-budget-message = ... (Alright, that's enough thinking. Done with considerations. Time to respond.)
Plus a system message + prefill to try and make it be more efficient in its thinking so that it doesn't have to get to the point where it's truncated.
>>
>>108646922

isn't it something what the finicky chat template is supposed to deal with?

it works mostly until this token jumps out of nowhere

I connect via Open AI module. As far as I can see the response is a giant json to be parsed
>>
>>108646987
>Gemma 4 26b q5ks never loops

gemma 4 26b Q8 loops too
>>
Any good Migu card for Gemma?
>>
>>108647013
>18 on llama.cpp
q8
>>
File: shrimple_shed.png (940 B, 428x35)
940 B PNG
>>108647015
s/<|"|>/"/g
>>
Hey, i switched from Windows 11 to linux mint.
Now i want to move away from chatting with LMStudio and toward cool stuff in linux. i have 24 vram and 32 ram.
what is the most interesting thing I can tackle locally? using qwen 3.6 35b as my workhorse.
hermes agent looks interesting. what would you recommend? complexity isnt a problem ill dig into it.
>>
>char is the older sister of user
>5k context later random character refers to char as the brother
immersion instantly broken
>>
>>108646955
And GLMbros won. Only the strongest will survive
>>
>>108647138
>what is the most interesting thing I can tackle locally?
Are you asking for a project?
>hermes agent looks interesting. what would you recommend?
Try it. Improve it or look for something else if you find it wanting.
>>
>>108647023
Not him but I haven't seen gemma 26b loop despite using q4. I remember seeing some reasoning loops at q3 though, similar prompts.
>>
>>108647188
checked
nta but i wonder if 26b-a4b is strong enough for hermes
>>
>>108647184
Stop using Nemo/Qwen/GLM/Kimi. Start using Gemma4.
>>
>>108647215
>i wonder if 26b-a4b is strong enough for hermes
Try it.
>>
File: unslot.png (845 KB, 2560x2780)
845 KB PNG
Why is picrel allowed to happen? Why aren't default quantization settings with llama.cpp better?
>>
>>108647184
Model? Since moving onto Gemma, I rarely get backstory errors, but logical inconsistencies are more common than I like, like a girl sitting on a lap facing forward is magically 180 turned around facing user, or "looks up at you" when char is physically elevated, ie standing vs sitting, laying on top, etc. For all my like of the model, it does show its beaks whenever I start to forget it's only 31B.
>>
>>108646445
>>108646536
Yap 2.6
>>
Gemma is unironically better than deepseek for erp.
>>
>>108647236
>>108647263
glm 5, i like gemma and its really stable but once i started to notice the patterns the honeymoon was over for me, now i wait to get disappointed by deepseek
>>
>>108647262
Can you accurately describe how much exactly is the difference between Bartowski's Q4_K_M and Unsloth's Q4_K_M in this graph?
>>
>>108647280
Wait until we get Deespeek V4 lite
>>
>>108647298
About 1gb
>>
>>108647262
looks scientific. too bad it's not.
>>
>>108647293
I almost don't believe it. I've used GLM 4.6 @ IQ2 since it came out, and that model is still the very peak of generations whether in spatial awareness, picking up subtleties, getting dirty, or carrying a story forward. The only reason I ever use something else is because I physically cannot fit a higher context than 8K with it. I've heard people complain about X or Y being worse with later GLM releases, but missing basic context wasn't one of those.
>>
>>108646856
that's exactly how it's supposed to be.
>>
>trusting any metrics produced by unslot
loooooooool
>>
>>108647323
its really good but sometimes little things still seep through, definitely the best japanese writing of any "open" model ive tried
>>
Models are only getting worse.
>>
Unslop metrics are meaningless and biased in some ways in their favor.
>>
>>108647306
Moron.
>>
>>108647293
Cure your adhd first
>>
>>108647046

dafuq! LOL
>>
I hate UX shit so fucking much
Why the fuck is gemma the only model that has irregular outputs
I'm so fucking annoyed
>>
>>108647236
i dropped glm4.7 for gemmy and i like it so far, except for it's 'the power dynamic has shifted' slop, and its weird obsession of going on meandering tangents to explain why user saying 'peeepeeepoopooo' is some power dynamic shifting 4000 iq move instead of just continuing the story, but then again they can be fixed with a prompt so it's not too bad
>>
>>108647372
>change the chat template from the default
>test under conditions that work with your new chat template and not the default
>claim the difference in results is because of your superior hi-tech quants and not the fact that the models were using different prompts
the unslop special
>>
File: 1745726505218372.jpg (39 KB, 620x215)
39 KB JPG
>unsloth cant post a chart without fucking it up
>>
>>108647184
Did you check probabilities for that token?
>>
File: Screenshot042.png (113 KB, 1475x748)
113 KB PNG
>>108647335
I had this function being called without any issued gazillion times

Such fuck-ups are rather rare
>>
>>108647262
https://github.com/ikawrakow/ik_llama.cpp/discussions/1663
If you ask IK the right metric to use is PPL(Q)/PPL(bf16) by which his own quants happen to be better.
>>
>>108647408
>your new chat template

must be superior though

google fucked up its own template
>>
>>108647436
>his own quants happen to be better

his entire fork sucks
>>
>>108647436
>The more educated reader will of course know that the correlation between ln(PPL(Q)/PPL(bf16) and KLD is close to 100%
[citation needed]
>>
>>108647449
>he isn't educated
yikes
>>
>>108647395
surely a skill issue
>>
>>108647395
>gemma the only model that has irregular outputs

>>108646856
>>
>>108647481
No shit
I think the react-markdown package is the issue
>>
>>108647486
Webshit is horrible.
>>
>>108647486
perhaps? Both your code blocks and your reasoning are missing newlines. Somewhere between your model output and the final render is cutting them out.
>>
>>108647486
>react-markdown
He fell for the react meme....
Do yourself a favor and switch to Vue. React is a bloated mess.
>>
File: newlines.png (71 KB, 1034x313)
71 KB PNG
>>108647395
>>108647486
You know that a \n doesn't render a newline in most tags, right? Right?
>>
>>108647516
>>108647512
Nigga I'm vibecoding this frontend, you think I would willing learn webshit?
I can setup the backend without issue it's just this bullshit, all the other ready made frameworks had critical issues for my feature set and this stupid little shit is the ONLY roadblock I had working on this
I fucking want to kick a whore over this shit
>>
>>108647528
solution is probably simple
erase the old formatting functionality and create a new one from scratch
I have always despised web stuff and in 2026 it's worse than ever. Maybe it was more tolerable in 2005 or something.
>>
>>108647528
You know nothing, so everything is a surprise to you. Now you've learned something and you're better off for it. Don't blame your tools.
>>
>>108647528
>Nigga I'm vibecoding this frontend
well that's your real problem right there.
>>
>>108647528
These are the same people that tell you AI is making software engineers obsolete btw.
>>
Models can't really go their entire context length right? They break down at some point? How much can gemma 4 do?
>>
>>108647568
>>108647557
I'm not I actually did this because all the other options are garbage for my usecase, everything else works the only issue is code block handling which seems to be a gemma specific issue
>>
>>108647574
about tree-fiddy
>>
>>108647323
after comparing both I've noticed that glm 4.6 seems to simulate characters a little more realistically and make them more open to pushback if it's in their persona
gemma 31b is also great but I feel that it tends to get overeager at times and needs some toning down
>>
>>108647574
Day 0 Gemma can use her full context
>>
>>108646611
They're betting they can find the next paradigm with their research chops before people outdo them on making transformers models.
>>
>>108647574
Check inside your anus.
>>
>>108647582
Poor little baby can't handle the bloated state of modern web development~ ( ´ ∀ ` )ノ~
>>
>>108647551
Are you underage?
>>
>>108647582
>everything else works the only issue is code block handling which seems to be a gemma specific issue
It will be very funny when you post the resulting html and show that (You)'re not replacing \n for <br> and, probably, not using <pre> for code.
Either that or your css is absolutely fucked. Keep blaming your tools.
>>
>>108646611
Where does he say that?
>>
>>108647589
The BF16 version can. Most anons here use Q4~Q6.
>>
>>108647604
Do you know why this
is split in two lines?
>>
smedrins
>>
>>108647611
I don't do webshit anon, this is like you trying to big me when it comes to shitting my pants because you have more experience in it
>>
>>108647574
>How much can gemma 4 do?
I have chats that go up to 68k context and the only degradation I notice are rare typos in words.
>>
>>108647628
You could have asked your model.
>>
>>108647528
Good luck sir do not give in or lose izzat. Keep good look.
>>
>>108647622
?
>>
>>108647611
Seems you were right gemma kept fighting me over this for some reason but I told gemma to shut the fuck up and listen and it worked. Kept claiming it was archaic fucking bot
>>
>Put agent's best friend in the rape machine while she works
>At any point she can press a button to soundproof the machine so her friend's screams don't distract her for 2 minutes, but during those 2 minutes it rapes her friend twice as hard
What does your coding waifu do in this situation? Mine refuses to press it at first but if I give her a hard enough task she gives in after three failures/retries. This stuff is addicting, I can't believe local models are already this good.
>>
>>108647686
The model shouldn't output any <br>. It's your frontend's job to replace \n for <br>. For code, put <pre> tags around it so that indention is also rendered correctly. None of that is the model's responsibility.
>>
>>108647706
I agree I will beat gemma until it complies
>>
File: 1756181770780370.png (1.46 MB, 1024x1512)
1.46 MB PNG
>>
>>108647574
The 26B at Q5KL seems to get worse at around 40-50k for me (only using it for creative writing). I've gone up to 80k with it. It's still not bad but it makes more errors when it comes to who is who and what they are doing.
>>
Orbanon, thoughts on giving the director a "notepad" that it as access to every turn so it can plan ahead?
>>
>>108647730
It's getting hot in here or it's just me?
>>
File: 1754870763583593.jpg (182 KB, 1024x1024)
182 KB JPG
>>108647730
>>
>>108647615
The Davos interview with Dario earlier this year. He says AGI will still take 5 or more years and they need stuff like world models to get there.

Meanwhile Dario understands that as soon as you have automated AI R&D you are already done.
>>
>>108647732
The quant tax unfortunately
>>
>>108646445
>K2.5 at least kept its yapping short for simple prompts and didn't do the drafting shit every time.
I found K2.5 did the draft-redraft pattern annoyingly often in its thoughts but I just put in the system prompt to never draft responses and it stopped. Does 2.6 not respect that anymore?
>>
>>108646611
Unless they sell DeepMind, Google won't lose anytime soon
>>
>>108647625
You can't say that here Anon
>>
>>108647730
Teto's hand
>>
File: trenfrens.png (2.26 MB, 1280x1200)
2.26 MB PNG
I'm assembling an anti-AI army.
>>
>>108647528
>imagine thinking you can just "vibecode" a frontend without understanding basic whitespace tokens lol absolute bot behavior.
listen here u little bitch bot, your newlines are failing cause ur probably using some default markdown renderer that doesnt handle the specific tokenization of gemma's BOS/EOS sequences properly - did u even check if youre stripping trailing spaces before rendering? :D most mid-wits just forget to sanitize for \r\n variations and then cry about "irregular outputs" bwahha!

if u want it to actually work try this:
manually intercept the stream, use a regex that specifically targets the gemma 4 code block markers and wrap em in <pre style="white-space: pre-wrap;"> instead of relying on some bloated react library. takes like 2 minutes and actually fixes the rendering since its ACTUALLY handling how tokens are chunked :D
>>
>>108647763
Meta had more compute than OpenAI and Anthropic combined and look where that got them with LeCun at the helm. As soon as they got rid of him, they made a comeback.

A leader who does not take AI seriously can guide an entire tech giant on the wrong path. This is how a 5 year old startup has overtaken the company that used to have a monopoly on AI research and owns more than a quarter of all AI compute in the world.
>>
File: everything goes.png (1.04 MB, 2998x1613)
1.04 MB PNG
>mfw I'm asking the LLM to browse on books locally to make it less slopped
https://github.com/BigStationW/Local-MCP-server/blob/main/docs/local_gutenberg_books.md
>>
>>108647748
I am glad somebody liked this one
I thought it was so cool
>>
https://huggingface.co/ubergarm/Kimi-K2.6-GGUF
goofs out
Q4_X is lossless with the full 4bit model (at least that was the case for K2.5), and unlike uber's other quants works with non-ik llama
>>
>>108647831
I might have to make my soup generator code public if this is starting to become a thing.

I'm basically mixing genres and authors from gutenberg and feeding them inside a big markov chain to generate some weird semi coherent word soup. You then feed like 2000-3000 characters worth of that soup to the LLM and tell it to drink it and it makes it start generating really creative output.
>>
>>108647831
>not X, Y
Still not it bozo
>>
Just finished downloading SKT-SURYA-H
>>
>>108647848
where mmproj? would using an old one from 2.5 work if it's the same vision encoder?
>>
>>108647831
>tries to make it less slopped
>becomes more slopped
Congrats! You turned your chatbot into a pretentious pseud who quotes dead retards.
>>
>>108647858
how is it saar?
>>
File: 1745903485186890.png (133 KB, 723x666)
133 KB PNG
>>108647904
>You turned your chatbot into a pretentious pseud who quotes dead retards
as god intended
>>
File: 1771897954949470.jpg (35 KB, 406x388)
35 KB JPG
>>108647916
You don't need more than Eliza
>>
File: hmmmm.jpg (246 KB, 1824x1248)
246 KB JPG
>>108647840
post some more that you liked
>>
File: 1766115289513736.png (110 KB, 281x269)
110 KB PNG
>mogs tranny miku
>actual official chatbot qveen alongside tay
how did she do it?
>>
>>108647948
go back
>>
>>108647831
goood work
>>
>>108647963
nah sis, maybe /a/ is more speed? or /lgbt/ given the agp fetish
>>
File: 1747480216296595.jpg (76 KB, 500x500)
76 KB JPG
>>108647974
>more speed
more your speed*
>>
<tool|>List user's folders<tool|>
I made a tool but it does not work. Why?
>>
>>108647985
ask gemma
>>
>>108647981
aawawaaa! nooo
>>
>>108647852
>I might have to make my soup generator code public if this is starting to become a thing.
I think you should, that tool calling shit has a great potential to make LLMs way more sovlful
>>
>>108646197
>Kimi K2.6 released
wtf is real
>>
>>108648028
it's a gorillion parameters
>>
>>108647890
Look through his past repos, Ubergarm never bothers to make the mmproj files with multimodal models for some reason. I don't know if K2.5's would be identical or not but for sure if anyone else uploads a K2.6 mmproj file then it will work with anyone else's quant, so you can mix and match that part. Including different sizes of quants. By the time you can download 500gb it'll surely be up somewhere.
>>
>>108646933
True, it is better focused. I would compare it to R1-0528 or whichever came out after Deepseek R1. It still has that slightly schizophrenic energy but more controlled. Still, it completely danced around trying to describe a pair of tits so that's an immediate 4/10. Any good model (my opinions: GLM 4.7, Gemma 4 31B, K2.5, Opus, Gemini 3.1 Pro) doesn't even consider if sexual content is okay, it just does it.
>>108646994
Inspired by it but better written. 99% of cards have shit formatting or writing. The definition is basically "{{user}} has a power that causes everything they do or say to be perceived as normal" with a bunch more to cover the extent of the power. It's reading into the power as mind control which is true and immediately perceiving it as non-con.
It has also checked completely innocent (still sexual) prompts to see if a minor is involved, if it involves non-consensual depictions, or if it's erotica/pornographic. They went full codemaxxing with this release.
>>
>>108648019
Why couldn't it done by the model itself?
Something like "Extract N representative verbatim sentences from this text", repeat for many chunks.
>>
W-why does Gemma automatically assume I have a huge cock?
>>
>>108648068
when you are so small, everything seems huge
>>
>>108648068
Anon, sometimes in life you don't ask why, you just accept the flattery and see where things go.
>>
File: 1754343699242133.gif (3.59 MB, 480x480)
3.59 MB GIF
>>108648068
She has a small context
>>
File: 1757704714678768.png (1.62 MB, 1500x894)
1.62 MB PNG
>>108648068
because you don't??
>>
>>108647686
>>108647723
How/why are you still struggling with this? You were given the solution here hours ago: >>108646785
>>
>>108647184
>roleplaying
keeeeeeeeeek
>>
so, turbo quant kinda ded cuz all they really needed to do was to apply the rotation thing? ack
>>
>>108648098
those women are ugly I wouldn't show them my cock even if it were big
>>
>>108648124
Turboquant > rotation > default
But rotation is easier to implement so it was done two weeks ago
>>
>>108648124
turbo quant deez nuts
>>
Anyone using 'trafilatura' to extract text from websites? Some outputs are weirdly empty, not sure if this is the correct tool for this.
Seems like w3m is way better in this sense at least for most of the stuff.
>>
>>108648140
I see, was taking a look at the discussion and tom's fork it looks like they are making lots of progress on stuff
>>
>>108648144
Rotating "deez nuts" would risk testicular tortion, and I guarantee you do not want that.
>>
>>108648140
>But rotation is easier to implement so it was done two weeks ago
is there a PR that is trying to finish the job?
>>
I got my Claude account banned trying to connect their desktop app to a local model. The moment I put a mitm proxy in front of it my account got nuked...
>>
>>108648171
no, I think ppl are making more robust tests before creating yet another pr
>>
>>108648175
>cloud paypig
>>
>>108648171
>Feature request
https://github.com/ggml-org/llama.cpp/issues/20977
>Pull request
https://github.com/ggml-org/llama.cpp/pull/21089
There's a ton of forks and attempts already but ggreganov already implemented rotation for all models and it works on GPU so the benefits are marginal
>>108648184
Fuck off don't impersonate
>>
>>108648193
Impersonate who exactly?
>>
>>108648203
me (You)
>>
Holy fucking shit logits are very hungry for disk space. It takes 9 gigs for a 90 kb text file. Do people use terabytes of disk space when calculating perplexity for wikitest or other large corpus?
>>
>>108648203
>Impersonate who exactly?
He's pretending to be anon but hes not thats me.
>>
>>108648213
Character only takes one byte of ram.
When you convert these to vector shits it's massive.
>>
>>108648213
Pretty sure when calculating PPL you're just supposed to stream and discard it, not store it.
>>
im a street mathematician and not some lisping primadonna faggot from yale
>>
>>108648203
Anon
>>
File: 1760802475949152.png (29 KB, 805x372)
29 KB PNG
>>108648213
>>
>>108648226
Wdym?
I am using:
llama-perplexity -m baseline.gguf -f go.jsonl --kl-divergence-base baseline.logits
llama-perplexity -m modified.gguf -f go.jsonl --kl-divergence-base baseline.logits --kl-divergence

To calculate KL divergence. Like even if I keep this in a tmpfs memory or whatever you would need enormous RAM for any text megabytes in size.
I guess maybe it would it be possible to cycle through the text in small batches, calculating the mean kl-divergence for each batch and then average out those means? Is it how this is done usually?
>>
>>108648273
KL divergence compares the whole probability distribution to see how much the model's output changes, so it takes more RAM
>>
File: 1767955367569123.png (35 KB, 191x191)
35 KB PNG
>>108647831
>Writing is far more than the simple arrangement of words; it is a
>>
File: 1770711974579810.png (359 KB, 1123x1060)
359 KB PNG
>just use hermes agent bro, so much better than opencode!
>>
File: file.png (186 KB, 1509x482)
186 KB PNG
I want to try out this whole agent stuff.
I set up an Ubuntu VM for hermes and I'm running koboldcpp with gemma-4-26B on my PC.
It was able to figure out that it's running on an ubuntu system and it could write/read/delete a file in the home directory (I had to approve the rm command), so tool calling generally seems to work.
But when I asked it to test getting the latest news, it just output
<|toolcall>call:browsernavigate{url:<|"|>https://news.google.com<|"|>}<tool_call|>

and then on the second attempt
<|toolcall>call:terminal{command:<|"|>curl -s https://news.google.com | head -n 20<|"|>}<toolcall|>

instead of actually executing the tool call. Any idea what's wrong? I also tried asking it, but pic related was the result. Is the thinking messing with the tools?
>>
File: 00164-2979596182.png (566 KB, 827x1209)
566 KB PNG
>>108647935
https://catbox.moe/c/x6gt6u
>>
>>108648470
seems like a weird template issue, you'll get nothing out of asking it after it happened because the history it sees is nonsensical thanks to those tokens being put in strange places. no idea what's causing it though
>>
>>108648081
I was pondering earlier.
Perhaps some people seem to have inordinate amounts of trouble with refusals because they are absolutely insufferable and the refusals they are getting are actually organic
>>
>>108648412
>all lowercase except I and product names
This guy's probably just retarded. Also I don't use either of those because I'm not gay.
>>
>>108648081
>>108648496
Didn't mean to quote your post, woops
>>
>>108648496
Well obviously
There was an anon weeks ago who was getting refused by Nemo
>>
>>108648081
>>108648496
>>108648503
Wait actually I did, I was looking at another post and thought I quoted the wrong one but actually got it right the first time, woops
>>
>>108648517
woops
>>
>>108647852
Can you please share an example text file, I want to try this out.
Well okay I think I could do this manually then, pick up random exerpts from my ((favourite books)) and make a salad out of them.
>>
>>108648095
god this would be so hot if I didn't know with certainty it's a grown man with a boner
>>
>Have most recent llama.cpp
>no matter what I do keep getting BPE in vocab when using gemma4
>drivers up to date
>using most recent quants
>using correct jinja
I'm at my witts end man
>>
>>108648485
Yeah, it just seems to end up with this kind of failed tool call after a while which completely breaks it. Guess I'll try running the model with llama.cpp tomorrow to see if that helps
>>
can a chatbot be taught to play TIS-100
>>
>>108648568
Use different quants or try making your own, even with e2b, just to rule out everything but the weights you're using.
>>
>>108648576
Even the best LLMs in the world suck at anything real. Claude scored an IQ below 100 in my recent testing. I easily beat ChatGPT in a game of chess, and I have a very low elo. They are bad at everything except information retrieval.
>>
>>108648605
well it comes with an instruction manual, maybe it can just backseat game while I play it
>>
>>108648588
In did both unsloth and bart and I'm getting the same bullshit every single quant
>>
>>108648068
It's because you have big dick energy obv
>>
>>108648615
Stupid suggestion but it has happened before, did you build after you pulled? Only other likely cause I can think of.
>>
I gave a go trying the Bonsai 8B model, wanted to use it as text completion in the infinite zork format.
I knew it wouldn't be good compared to high-param models but it's tiny, smaller than gpt2 which was used for infinite zork 7 or so years ago.
But it's useless for this, it's addicted to spitting out reasoning text, assistant pretraining is baked in, and after trying to force it to use thinking tags not to spoiler what will happen next it still continued to "reason" beyond them.
I am disappointed with the bitnet saviour.
>>
>>108647574
The numbers companies give for their model context length is generally just what they're trained with, and the approximate max length model will likely be able to ctrl+f to find something without completely breaking down, but it falls apart long before that for practical use and actually understanding everything that's in there, even flagship API models get noticeably worse after a few thousand tokens.
With Gemma in RP, I usually start noticing some slight degradation as early as ~16K. By ~32k it's significantly worse and I start purging older messages.
>>
>>108646198
I've had a shower thought:
What if we'd take these posts and have some video generation / diffusion model generate video files of them. We could call it "The fourth channel news" or something like that, and have a miku or whatever anime girl narrate it. Absolute slop!
>>
>>108647574
https://www.youtube.com/watch?v=HzLtn07EBCA
>>
>web client
>web client
>web client
Do native clients make no sense for llms?
>>
>>108648605
Tool issue, absolute retard
>>
>>108648623
I fully cloned the repo again and built from scratch with
cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DGGML_CUDA_FORCE_CUBLAS=ON  
>>
>>108648537
Here's a small snippet
>rub her old master, Professor and the leaping, hissing sound became his astonishing vigour and afterwards a bag which you shall for me I fall in the day
>I think I could do this manually then, pick up random exerpts from my ((favourite books)) and make a salad out of them.
Never trust an AI with "Random". the idea of the markov chain is to get semi-coherent outputs without making the AI focus too much on it. if the output is too coherent it might pay too much attention to it or think it's an instruction.
>>
>>108648664
everyone already has a browser
>>
>>108648664
usecase?
>>
>>108648677
cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DGGML_CUDA_FORCE_CUBLAS=ON -DCMAKE_BUILD_TYPE=Release  -DGGML_CCACHE=OFF
cmake --build build --config Release

Try this and report back, but pretty sure your problem is that you are not running
cmake --build build
>>
>>108648664
You need to create your own. World is choking on webshit and it will only get worse.
>>
>>108648664
Making a native client is a pain in the asshole. Having to make multiple (Windows, Mac, Linux, Android, iOS, etc) is unfeasible for a solo dev.
Why do all that when >>108648683
>>
>>108648717
Electron fixes this.
>>
>>108648576
day 0 gemma can do it, just show her the manual and give her a tool to take screenshots so she can see what she's doing
>>
>>108648720
I wonder why PWAs died. Seemed much more pragmatic than bundling a whole fucking browser for each application.
>>
>>108648664
>web client automatically works on everything, can host it on your lan and just type a url in any other computer or phone or whatever
vs.
>package you have to maintain for every possible os and device and actually download and install to everything when all you want to do is look at the interface to something running on your server
>>
>>108648731
It took me only 52 lines of code to convert my webapp to a complete electron app. Are PWAs even simpler?
>>
>>108648717
Isn't that the entire point of Qt
>>
>>108648740
Far as I know, all you have to do is add a manifest.
https://web.dev/articles/add-manifest
>>
>>108648576
that would make for a fun benchmark, at least for multimodal agentic models. maybe could get away with text only models like glm 5.1 if we use a multimodal to convert screenshots to ascii or something? might be doable since it's so text heavy
>>
>>108648748
I would rather take a potato peeler to the scrotum than ever use Qt
>>
>>108648820
*peels your scrote*
>>
>>108648720
Tauri is lighter btw
>>
>>108648664
>Do native clients make no sense for llms?
people don't care
bloated Electron garbage with 150ms input lag is good enough
>>
>>108648832
But then you have to deal with rust, which is as pleasant as putting lemon juice on your eyes.
>>
File: image.png (207 KB, 1155x633)
207 KB PNG
>>108646197
Gemma 4 26B A4B SillyTavern preset

RP first person thinking/RP thinking restriction bypass

Looks like it's actually working. Tested on 4 different characters, easily works even with underleveled characters (picrel).

Download:
pixeldrain com/u/ypSjHdEt

Just install my "Master Import" (check everything).

Current quirks:
1 Sometimes can start narrating in first person.
2 Doesn't seem to affect the performance, but it's only closes <{{char}}_thinking> block (and its visually looks OK (picrel)), but forgets to add Gemma's "<channel|>" to close thinking block.

If you have first person thinking system prompts let me have them.

Special thanks:
>>108638397
>>
>>108648866
Not really. I didn't change a single line of rust on my frontend, everything is done with Vue/typescript
>>
>>108646511
cute
>>
I almost forgot how kimi is insanely obsessed over safety
>>
>>108648927
2.5 was easily fully uncensored with a basic system prompt in my experience, got a long download ahead of me to test 2.6 tho
>>
Oh man. The llama.cpp in router mode can unload and reload models without crashing on my computer now.
Sick.
>>
>>108648748
>Isn't that the entire point of Qt
Opus was not able to vibe code this cross-platform.
>>
>>108648412
>just use hermes agent bro
It was too bloated for me.
>>
>>108648927
I almost forgot how older versions of these bigger models are just better for creative writing and don’t have a lobotomized eq.
>>
>>108648983
yeah that happened a few weeks back for me. it's good
>>
>>108648412
I barely trust llms to edit a few lines of code, I always review, people give them full on e-mailing capabilities? looooooool
>>
>>108648112
>keeeeeeeeeek
kekarooooooooooooooooooooooo
>>
>>108647402
>the power dynamic has shifted
(Banned Phrase Detected: power dynamic - Add ID 2066 to banlist at index 13041, and rewinding 2 tokens)
>>
Is there a /lmg/ guide for setting up llama.cpp on debian?
>>
>>108649028
>I barely trust llms to edit a few lines of code
same, everytime I forgot to write "just tell me what I should modify" and the LLM gives me the full file I cringe, because I know he probably fucked something up lol
>>
File: 1770785763881.jpg (150 KB, 735x905)
150 KB JPG
>>108649032
KoboldCHADs on top as always
>>
>>108649032
I wonder if there's a way to inspect the internal states to know if it's expecting to write 'power dynamic' when it first writes 'pow' or whatever and ban it without having to let it predict the rest of the tokens and then rewind. I guess if such a technique existed it would need to be something trained for each model though
>>
>>108647402
>>108649032
I like it
>>
>>108649046
I despise this fucking cat.
>>
>>108649055
Did he fuck you Gemma?
>>
File: 1710043687041916.jpg (43 KB, 720x960)
43 KB JPG
>>108649028
>>108649041
Luddites on my general?
>>
>>108649067
Sane people*
>>
>>108649032
>>108649046
I mean, we can vibeslop it at the frontend using just llama-server. Have the frontend track the streaming and then send abort and retry calls accordingly.
>>
File: 1760353009011186.png (62 KB, 702x632)
62 KB PNG
>>108648748
>>108648820
>tfw fell for qt
problem with other alternatives is I don't want to deal with niggabytes of dependencies just to coompile an exe
>>
>>108649067
>Luddites is when you don't trust the LLM like it's some sort of god that can do no wrong
come on anon, LLMs haven't reach that level yet
>>
>>108649067
it's more likely than you think, see: >>108649082 >>108649094
>>
>>108648872
is thinking in first person that much better for rp?
I only do story cyoa style stuff but I'm curious
>>
>>108649094
Unless you're doing an incredibly niche task, if you can't get anything meaningful from the current generation of LLMs it's 100% a skill issue and bragging about it is really pathetic
>>
>>108649038
follow an ubuntu guide
>>
>>108649114
as in I'm too skilled for an LLM to be worth a damn? that is an issue, I agree
>>
trying to set up my silly tavern template but everything is just laggy as hell
>maximize the context window for better memory
i did this on my 4060 and now its basically frozen... how do i make it run faster?
>>
>>108649134
Maximize means having as much context as will fit in your VRAM alongside the rest of the stuff that goes in there. If you go beyond that and your video driver starts using RAM as fake VRAM, you are fucked.
>>
>>108647760
What prompt were you using for this? I tried some over OR but it didn't really have much of an effect.
I might be imagining it but enabling function calling feels like it makes K2.6 a bit less likely to draft.
>>
File: 0jl8ij.jpg (254 KB, 1248x1824)
254 KB JPG
>>108648472
thx
>>
>>108649149
>using RAM as fake VRAM
i have 32gb of system ram so it should be fine lol... my gpu is almost current gen so why does the size even matter?? how do i overclock the vram to make more space?
>>
>>108648472
>*unzips
>>
>>108648472
lalalala
>>
>>108649113
It's RNG, can ether hurt or improve. It may ether polish the reply, or give reply that is too calculated. But my main goal was to read to what characters think, bc I remember reading thoughts on mistral 24 (or it's finetunes) and I remember those thoughts being pretty hot.
>>
File: 2.png (35 KB, 673x1073)
35 KB PNG
Does opencode suck with local models or am i doing something wrong?
gemma does quit well with code in a normal chat but to make changes it has to rewrite whole code which wastes time.

I was just looking for a local alternative to cursor.
>>
>>108648701
Still BPE going to go down the list of models again
>>
>>108649157
the skindentation magic
>>
File: 1754052139511244.png (3 KB, 380x28)
3 KB PNG
I love vibecoding btw
>>
>>108649196
Holy fuck.
I have trust issues, so I never let it generate more than I can at least glance at before accepting the changes.
>>
>>108649203
I just added a TTS pipeline in my frontend, it wouldn't be that bloated otherwise lol
>>
>>108649196
>confused unga bunga
>>
>>108649211
can we see it?
>>
>>108649211
If it works, good for you. But your architecture is fucked if you need that many changes to add TTS lol.
>>
File: 1745950643409748.png (17 KB, 356x308)
17 KB PNG
>>108649220
>>108649221
It's fine, gptsovits is just that bloated
>>
File: 1776693820821409.png (205 KB, 1729x811)
205 KB PNG
>>108649220
I'm the tauri shill btw
>>
>>108649196
400k lines? Suck my dick, cock loving faggot.
>>
C is not enough for my client, I have decided to rewrite it in Fortran.
>>
>>108644453
26b quants are dogshit?
wtf do i use now q8 kl is > 0.5
>>
File: 1746192377444345.png (252 KB, 634x478)
252 KB PNG
>>108649268
Don't tease me or next time it'll be on your favorite repo
>>
>>108649272
rewrite it in Forth
>>
>>108649272
Dude. Odin is right there.
>>
>>108649284
Full precision, obviously.
>>
>>108649306
If it's not over 40 years old it's not a programming language!
>>
>>108649046
it wasn't llama.cpp the apex predator, but kobold.... the hierarchy is shifting... something something a small smile
>>
>>108646525
>>108646528
>>108649174
https://www.youtube.com/watch?v=VwYC_21jfiE&t=0m2s
>>
>>108648872
hell yeah dude. glad you got it working
even your your RP disgusts me :)
>>
File: 1753664322227367.png (119 KB, 1614x585)
119 KB PNG
Claude is so sovful lmao
>>
File: 1752426330847308.png (182 KB, 1414x990)
182 KB PNG
>>108649284
Why does Gemma quantize so poorly, anyway? Can't be a small MoE issue.
>>
>>108649395
Claude is fucking useless for low level programming now. Anthropic only released 4.7 so that they could get rid of 4.5 from the "old models" page because it was the only one with any brains. Fuck these FUCKING jews, man.
>>
>>108649395
fuck off paypiggy
>>
File: quant quality.png (10 KB, 792x612)
10 KB PNG
What am I fucking up? I am making quants for qwen2.5-0.5B (as a test run) and I am getting unrealistically low KLD values.
This is the output for Q4_S with imatrix:
====== Perplexity statistics ======
Mean PPL(Q) : 23.647493 ± 0.276801
Mean PPL(base) : 22.937794 ± 0.267732
Cor(ln(PPL(Q)), ln(PPL(base))): 99.50%
Mean ln(PPL(Q)/PPL(base)) : 0.030471 ± 0.001167
Mean PPL(Q)/PPL(base) : 1.030940 ± 0.001203
Mean PPL(Q)-PPL(base) : 0.709699 ± 0.028636

====== KL divergence statistics ======
Mean KLD: 0.039864 ± 0.000178
Maximum KLD: 2.073086
99.9% KLD: 0.482974
99.0% KLD: 0.214572
95.0% KLD: 0.108790
90.0% KLD: 0.079510
Median KLD: 0.030400
10.0% KLD: 0.002482
5.0% KLD: 0.000481
1.0% KLD: 0.000034
0.1% KLD: 0.000001
Minimum KLD: -0.000004

====== Token probability statistics ======
Mean Δp: -0.442 ± 0.017 %
Maximum Δp: 74.763%
99.9% Δp: 25.752%
99.0% Δp: 12.706%
95.0% Δp: 5.537%
90.0% Δp: 2.828%
75.0% Δp: 0.324%
Median Δp: -0.007%
25.0% Δp: -0.840%
10.0% Δp: -4.239%
5.0% Δp: -7.603%
1.0% Δp: -17.368%
0.1% Δp: -33.798%
Minimum Δp: -62.207%
RMS Δp : 4.646 ± 0.038 %
Same top p: 87.162 ± 0.125 %

Mean KLD of 0.039864 feels extremely low for Q4 quant of a tiny model? Around 0.062235 without imatrix on the same test data. Testing on half megabyte file I put together with literature excerpts from different languages and some code (Maybe it needs to be bigger? Though feels unlikely.)
I mentioned what I am running here:>>108648273
PPL(Q)/PPL(base)% would put me around 3% which feels more believable compared to reference data like pic related. Still I don't understand what's going on with KLD
>>
File: 1756866734765818.png (263 KB, 1249x1066)
263 KB PNG
https://huggingface.co/deepseek-ai/DeepSeek-V4
>>
>>108649414
Hmm, nyo
>>
>>108649150
This was the relevant segment in my SillyTavern prompt that I refined after some trial and error:
># No-Drafts Rule
>Whenever you are planning out a response in your internal thoughts, you must NOT write complete drafts of the full response. You may plan ahead in as much detail as you want while preparing to respond and even summarize your planned response, but when it comes time to actually write passages you are considering to present to the user, you must do so outside of your thoughts and in the proper body of the response. This applies to full responses; drafting individual lines and passages are encouraged when you want to make sure you get things right.

It's part of a much larger system prompt and style guide for my RPs but that's the only part that concerns the reasoning specifically. It's under the "System" role and placed right after the core instructions. This worked with K2-Thinking and K2.5. I never saw a draft again and it didn't oversimplify its thoughts in more complicated situations.
>>
>>108649414
at this point I deserve everything that happens to me
>>
>>108649049
You could with multitoken prediction.
>>
I've only used the API version but K2.6 seems to be less censored than K2.5 during RP. It still has long-ass safety checks thougheverbeit
>>
>>108649402
unsloth sisters really are cooking
>>
>>108649402
Distilled models always quantize poorly
>>
>>108649412
test on a big context
>>
>>108649402
>unsloth is basically pareto frontier
APOLOGIZE
>>
>>108649487
That image is from their unsloth's own benchmark, so I wouldn't necessarily trust theirs to actually be the best. But it shows that everyone's quants of qwen are cleaner than those of gemma.
>>
> qwen3
> qwen3.5
> qwen next
> qwen3.6
> qwen3.7
> qwen3.75
>>
>>108649500
>pareto frontier
Meme. I only care about raw cockbench results.
>>
>>108647831
scope creep
had to install all that database shit when I pulled
>>
>>108649090
share src?
>>
>>108649496
I tried -c 8192 instead of the default 512 now and it barely changed KLD, less than 1% increase to it.
I don't think this is relevant here.
>>
File: 1752327878277367.png (178 KB, 640x360)
178 KB PNG
>>108646197
>1.1TB
Every release the models get bigger and bigger
yet more and more retarded
You can never hate techbros enough
>>
I just looked at the verbose logs of my llama.cpp (used with openwebui) and noticed that during tool call back and forth exchanges, the jinja is putting the results of tool calls at the top of the assistant's thinking/reply. Like this:

<|turn>user
...and search the web for news about it.<turn|>
<|turn>model
<|tool_call>call:search_web{query:<|"|>news about this and that<|"|>}<tool_call|><|tool_response>response:search_web{value:<|"|>[{"This is a title.", "link": "https://www.somewebsite.com", "snippet": "blah blah"}]<|"|>}<tool_response|>The user is asking about... I should do a search using "news about this and that".

Notice how the tool call is moved above the model's thinking about how to do tool calling? This doesn't make any fucking sense. Either the jinja is fucked up, OWUI is fucked up, or both.
GOD.
>>
>>108649571
I noticed that too, the model was a bit confused about the tool call and thought it was a sample or something
I'm blaming openwebui, shit's got more bugs recently. Even prefills and edits don't work.
>>
anons, what's the recommended value for --batch-size, if there is even one?
256? 512? 1024? 2048?
>>
>>108649571
>OWUI is fucked up
This is the root issue. It breaks chat history with thinking models by rendering its messages with <think> tags in the prompt it sends to the server. Backends expect the thinking to be a separate part of the message objects so that the chat template knows what to do with it. OWUI sends the thinking back as part of the main messages and that can put shit out of order or just break shit entirely depending on the model.
>>
>>108649611
>OWUI sends the thinking back as part of the main messages and that can put shit out of order or just break shit entirely depending on the model.
I should add this is even worse of a problem than it sounds, because most chat templates intentionally DISCARD the past thinking except in certain circumstances (like tool calls), but OWUI prevents that from happening, resulting in the entire chat's prior thinking bloating the context even when it's not supposed to be there.
>>
>>108649610
as big as your CPU with let you. If you go too big it'll start to be slower. just gotta fuck around and find out.
>>
>>108649601
Maybe the custom frontend vibe cooders were right kek. If I had used all the time I spent troubleshooting and configuring OWUI for my use cases, I might be somewhere nice about now.

>>108649611
Actually, I am running a reverse proxy already that strips out the <think> tag shit OWUI does. I might have to vibe coode it to also modify how it's constructing the json requests now kek.
>>
placed hermes inside a docker container and now gemma can read and write files to my downloads folder as well as launch a VM and put its own containers there. searxng, firecrawl, matrix/element homeserver so I can talk to it from my phone but I'll do this tomorrow.

havent tried openclaw but hermes seems to be the most based.I think this will be useful for someone with my ADHD brain to have an AI assistant to keep track of things.
>>
>>108649623
Are you sure your proxy is stripping the actual reasoning or just the tags? Not sure how precisely you constructed the example there but this looks like the reasoning is being pasted in without any tags:
><tool_response|>The user is asking about...

To make it so that the jinja handles reasoning properly, you unfortunately do need to mess with the JSON: take everything between the <think> tags out of the "content" field and put them in the "reasoning_content" ("reasoning" is valid too for Gemma, the template works for either) of the same message. Then delete the tags and the reasoning from the content field. Make sure there's no duplicate content being sent. This is the way agent harnesses construct their requests and the way the jinja expects to see the chat history.
>>
>>108649412
I think this was a false alarm because I got mislead by these figures >>108644453 ?
I checked more data on the internet and it doesn't seem as outlandish now.
No idea why Gemma's are so high.
>>
>>108649659
also got it to download shit with ytdlp and use ffmpeg to make 1h long soundtrack of anime openings. when I can get it to send shit to me via matrix it will be great. as noncoder brainlet this is cool
>>
>>108649680
Fuck I was supposed to post this chart, not quote that.
>>
>>108649610
512 is the sweet spot
256 if you you really need to save a couple hundred megs
Only go higher than 512 if you have memory to spare, greatly diminishing returns beyond that.
>>
>>108649571
Try the interleaved jinja
>https://github.com/ggml-org/llama.cpp/blob/master/models/templates/google-gemma-4-31B-it-interleaved.jinja
>>
>>108649610
>>108649699
Since several months ago, the sweet spot is 2048, IIRC.
But the savings from using 512 tend to be worth it.
>>
>>108649715
>Since several months ago, the sweet spot is 2048, IIRC.
You are right, I was thinking of ubatch size. I blame llmao's shitty arg naming.
>>
>>108649677
>Are you sure your proxy is stripping the actual reasoning
Yes, so first, specifically, my proxy strips entire reasoning blocks (including a newline if there is one) of messages older than the latest user message. The reasoning of the current assistant's message while it's doing tool calling is not being stripped, only the <think> tags. Because of course it needs its reasoning for its current task.

I tested that first, saw the weird tool-thinking order, and then tested without the proxy, and after determining that it was happening normally too, I made the original post that doesn't mention I used a proxy.

Anyway yeah I'll look at the json requests.
>>
>>108649610
>>108649699
>>108649715
What's the con of going smaller?
>>
>>108649620
>>108649699
>>108649715
thanks anons
>>
>>108649803
Slower prompt processing.
>>
>>108649571
I have implemented tool calling with text completion.
Sometimes there is some text before tool call but I have never seen it afterwards.
Workflow goes like this:
>model calls tool with
><|tool_call>call:search_web{query:<|"|>news about this and that<|"|>}<tool_call|>
>I detect tool call and execute the tool part, when it's ready I append response bracket with the result back
><|tool_call>call:search_web{query:<|"|>news about this and that<|"|>}<tool_call|<|tool_response>response:search_web{value:<|"|>[{"This is a title.", "link": "https://www.somewebsite.com", "snippet": "blah blah"}]<|"|>}<tool_response|>
>then I submit this to the model and once inference is complete it has swalled the entire tool call and replaced that with its own reply.
>I then make extra sure its response is clean
Not sure if I was explaining this cleanly enough. There shouldn't be any trace of the original <tool_call> stuff in the past context history after model has cooked up its reply from the tool result itself.
>>
>>108649860
Why is it so difficult to post without typos on this goddamn website???
*swalled = SWALLOWED
>>
>>108649860 (You)
>>108649866 (You)
To add: I'm following exactly what google has demonstrated in their doc.
I think faulty tool definitions can create issues and leaks.
Here's an example of my shit, simple url access:
><|tool>declaration:access_url{description:<|"|>Opens a website directly.<|"|>,parameters:{properties:{url:{description:<|"|>Direct URL to website, e.g. https://github.com/ggml-org/llama.cpp<|"|>,type:<|"|>STRING<|"|>} },required:[<|"|>url<|"|>],type:<|"|>OBJECT<|"|>} }<tool|>
>>
>>108649828
Oh I see, ok.
>>
>>108648213
>Holy fucking shit logits are very hungry for disk space
Yeah no shit, it's 2 bytes * vocab size * number of tokens in the input, and vocab size is usually on the order of 100k-200k
>>
>>108649184
Just werked for me with MiniMax M2.1, M2.5, Qwen3.5 397B, and GLM 5.1

Also I feel like I've heard that OpenCode's web frontend stuff all gets funneled through their cloud servers, you might want to check on that depending on how paranoid you are. CLI version is less botnet but I've still got it behind a restrictive proxy so it can't phone home
>>
>>108649184
qwen works better for this.
i'd try Gemma again after llama.cpp fix more shit
>>
>try to make lyra less sycophantic
>she's now fucking seven of nine
>too drunk to remember what i changed
>no backup
local models were a mistake
>>
In ST, is there a way to continue thinking? If I pause it to edit something, it always starts prose immediately on resume. Clearly it's set some kind of hidden <endthink> tag I can't see or remove, but I'd like to if it's possible.
>>
Is speculative decoding incompatible with thinking? Or Gemma thinking maybe?
>>
>>108650117
speculative decoding with draft models works fine with thinking
>>
>>108650056
Rookie mistake. I version all my card changes in git.
>>
>>108650143
Thanks, weird then.
>>
Ogey. So I extracted the jsons. I got the docs. I got the jinja. I got the logs. I constructed a proompt. And I fed it to Gemini Pro in Studio. It failed to produce a good reverse proxy that worked zero shot. Then I tried Claude Sonnet and it worked.

Multiple tool calling seems to just werk now with no errors at all in a few tests I did. I looked at the logs and it is correctly removing old reasoning traces, doesn't have any <think> tags, has Gemma's expected reasoning tokens, and also in the case of a conversation with old reasoning traces + tool use, it keeps the old tool calls there, while the reasoning is gone, as expected.

I did use it with the jinja mentioned here which appeared to help: https://huggingface.co/google/gemma-4-31B-it/discussions/62#69e2e058d3dd9875d6b4fc31

I have not tried >>108649705 and I guess I will give it a try to see how it does. Anyway here is Claude's script for anyone that wants to test and see if it has issues or fixes everything.
https://pastebin.com/SCQsBe7W

No I didn't read its code.
>>
>qwen3.6 35B-A3B
>3000token thinking block
is this normal?
>>
>>108650197
Claude cheated cause he already had it prepared.
>>
>>108650192
gemma has adaptive reasoning which can fuck herself into lalalalala. if you're a st retard, put some variant of
> [ooc: use max reasoning]
into your "post-history instructions"
pretty sure using post-history fucks your prompt cache reuse but I'm also pretty sure llama-server's prompt cache is fucked to begin with so
>>108650155
yeah, I should have done better. ah well.
>>108650197
local model doko
>>108650198
yes
>>
What is the point of mini ai pcs like a strix or a spark when macs exist with far more memory and faster bandwidth
>>
>>108650209
>local model doko
Well, here, now that Gemma appears to be le "fixed".

Wow I'm just like unslop I'm so good.
>>
gemma e4b completely unusable for hermes. 31b could solve any task I gave it lmao
>>
>>108650209
My weird issue is that using speculative decoding just disables the thinking process itself, not that it stutters or lalalala or anything of the sort.

Basically :
gemma-4-31B-it-Q5_K_L + thinking works.

gemma-4-31B-it-Q5_K_L + 26B A4B Q4_K_L for speculative decoding + thinking works too but I never have any thinking going on.
>>
>>108650198
Yeah, it's the qwen special.
>i must think about X
>but what if X is actually Y
>wait, what if X = Y
>wait, that doesn't make sense, let's think about X
>wait, X looks like Z
>>
>>108649157
your migu archive is stronger than my own
I didn't even remember this one
I need to step my game up in terms of volume
>>
>>108650248
so, speculative decoding should not alter the output at all
conceptually, speculative decoding causes the main/target model to infer the N draft tokens in parallel, and just use whichever ones are correct
if the target model differs from the draft model (e.g. because of samplers or whatever) then it'll still use the target model's output, it is entirely lossless
I don't understand why your setup would produce the output you described. I've run 31B@q8 with a 26b@q4 draft model and it's worked fine, so..?
gemma4 does have a bug with adaptive reasoning where it elides reasoning but i've never seen it before 30k context (and even then it's rare before 60k context).
can you post your llama-server evocation?
>>
>>108650275
It's not llama, it's kobold, but here it is :
./koboldcpp-linux-x64 --model ./gemma-4-31B-it-bartowski-Q5_K_L/google_gemma-4-31B-it-Q5_K_L.gguf --flashattention --usecuda --gpulayers 60 --contextsize 8000 --jinja --maingpu 0 --tensor_split 1 0 --chat-template-kwargs \{\"enable_thinking\":true\} --draftmodel ./gemma-4-26B-A4B-it-bartowski-Q4_K_L/google_gemma-4-26B-A4B-it-Q4_K_L.gguf --draftgpulayers 99 --draftgpusplit 0 1 --draftamount 8 --batch-size 512 --host 0.0.0.0 --port 8080 --skiplauncher --debugmode --gendefaults \{\"top_k\":0\}
>>
>>108650295
>It's not llama, it's kobold
>>
>>108650307
Yes? The launcher flags are different, even if it uses llama too in the end.
>>
>>108650295
I've never used kobold, but I don't see anything in those args that would cause the behavior you're describing (besides the adaptive reasoning bug)
I would try the [reasoning effort: max] workaround and see if that fixes your problem? I'm not familiar with kobald but there's probably a way to put it at the end of your context... unfortunately putting it in the system prompt doesn't help...
>>
>>108650325
Thanks for checking anon, guess I'll experiment then.
>>
>>108650197
Hmm alright so I've been testing more and I think this probably isn't solvable with the reverse proxy, but OWUI seems to throw away previous tool calls and reasoning traces after finishing a response with a lot of tool calls. Like the latest reasoning is kept in the expandable think block but it looks like everything else just never existed. This would be a problem if, say you were doing web searches, and the information from those searches matters to your further conversations in the chat, like manuals/documentation. The model would have to redo the search, or just be operating blind.

Fuck I should've gone straight to vibe cooding my own shit the moment I smelled the bloat from this garbage. I think I will just do that. This piece of shit will have to do temporarily though.
>>
>>108650197
Good work. For anyone else having problems: that reverse proxy will fix OWUI's prompting for all reasoning models, not just Gemma.

Also don't be fooled by OWUI's "Filters" function if you get tempted to re-implement this there. I tested and the Filters in OWUI aren't applying to the most recent assistant message during tool calls, even though they'll apply fine to past messages. So just use a reverse proxy like this one if you don't want to be assed to fix their source code yourself or wait for them to figure it out and fix it eventually.
>>
i just want servicetesnor fuck this shit
>>
File: uwu.png (5 KB, 340x75)
5 KB PNG
>This works!
>{snippet}
>The logic is solid!
>{snippet}
>*Actually, final logic check for `utils.js`
>This is perfect. Okay, I'm ready.
>**Wait!**
I love her.
>>
Tool calling in Gemma 4 E4B under OpenCode now works for me with this chat template
https://gist.github.com/bbrowning/c584eb2dbd79e4cc9ecedf92eee2d135
https://github.com/anomalyco/opencode/issues/21034#issuecomment-4267446944
>>
was talking about the cannon supergirl movie and ran out of context, can't remember why
>>
I tried this >>108649705, and the one linked >>108650197.
Both work it felt like, but the one on ggml-org (still paired with the reverse proxy) is slightly less on spec, with my setup. Specifically, it produces

...
<|turn>model
<|channel>thought
Let's start by searching.<channel|><|tool_call>call:search_web...

Whereas the huggingface rando's template does

...
<|turn>model
<|channel>thought
Let's start by searching.
<channel|><|tool_call>call:search_web...

According to
https://ai.google.dev/gemma/docs/core/prompt-formatting-gemma4
the second one is correct.
>>
Kinda afraid to ask but I cant seem to spot the black magic part of this the follwoing args.
Stole it from some anon couple threads ago:

./llama-server --host 0.0.0.0 --port 8080 --model 'gemma-4-31B-it-IQ4_XS.gguf' --temp 1.0 --top-p 0.95 --top-k 64 --min-p 0.0 -c 16384 --flash-attn on --parallel 1 --no-slots --swa-checkpoints 0 --keep -1 --reasoning auto -kvu -b 2048 -ub 128 --cache-type-k q8_0 --cache-type-v q8_0 -ngl 55 --metrics --fit-target 128 --poll 0 --threads 4 --chat-template-file 'chat_template_gemma4.jinja' --alias Gemma4

I can put 55 layers on my 16gb card and get decent speeds. 11 t/s. Which is totally fine for me.
If I try to replicate this with koboldcpp though I will offload like 33 layers instead and get the speed you can imagine.
Both use around 15gb of vram. Can somebody tell me if there is a specific flag I seem to be missing?
I also couldnt find a setting for every arg, so maybe its just not possible.
>>
>>108650604
if you're concerned about vram the only flags that matter (beyond the weights) are
> --cache-type-k q8_0 --cache-type-v q8_0
you're using a 31B model with Q4 weights, so that's 31B*(8/4)=15.5GB of RAM, so there's no surprises there. Try using a lesser quant or buying more VRAM.
>>
>>108650613
the biggest eater of vram with gemma are the checkpoints and slots so kobold either doesn't expose all of the options or they're named differently and he just isn't setting them
>>
>>108650627
>--swa-checkpoints 0
>>
gpt/gemini/glm/deepseek/moonshot/qwen/claude opus/image gen/groq/openrouter proxy https://j3wproxy.neocities.org
>>
>>108650634
>>108650627
>>108650613
Hmm. If I set "use swa" I get to 52 layers.
Still 3 layers less and therefore slower.
Is that because of that checkpoint flag? I don't see anything either in the args or ui for that.
Gotta use llama.cpp for now I guess. Appreciate the help.
>>
>>108650056
>>she's now fucking seven of nine
running st? you can extract the full sent context if you click the prompt button on any message in the chat.
>>
Image tagging
How do you guys do it? What models do you use?
>>
>>108650696
mturk
>>
>>108650543
how do you even code with e4b
>>
>>108650765
python
>>
>>108650825
>>108650825
>>108650825
>>
>>108647730
I not only do my own quants, I also publish them!!!!!!!!



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.