[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology

Name
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
File
  • Please read the Rules and FAQ before posting.
  • You may highlight syntax and preserve whitespace by using [code] tags.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


Janitor applications are now closed. Thanks to all who applied!


[Advertise on 4chan]


File: tetoMikuJetsons.png (2.36 MB, 1536x1024)
2.36 MB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>109013071 & >>109007468

►News
>(06/09) Cohere releases North-Mini-Code-1.0: https://hf.co/CohereLabs/North-Mini-Code-1.0
>(06/07) llama : add Gemma4 MTP #23398 MERGED: https://github.com/ggml-org/llama.cpp/pull/23398
>(06/05) dots.tts 2B released: https://hf.co/rednote-hilab/dots.tts-soar
>(06/05) Gemma 4 QAT models released: https://blog.google/innovation-and-ai/technology/developers-tools/quantization-aware-training-gemma-4
>(06/04) Higgs Audio v3 TTS released: https://boson.ai/blog/higgs-audio-v3-tts

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://swe-rebench.com
Agentic Coding: https://deepswe.datacurve.ai
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: teto a mood.jpg (266 KB, 2000x2000)
266 KB JPG
►Recent Highlights from the Previous Thread: >>109013071

--Optimizing Gemma 4 visual token budgets and image resolution limits:
>109013523 >109013535 >109013572 >109013587 >109013652 >109013655 >109013702 >109013710 >109013720
--Debating long-term compute affordability, AI economic bubbles, and marginal utility:
>109013645 >109013807 >109013912 >109013998 >109014257 >109014265 >109014293 >109014197 >109014346 >109014594 >109014843 >109015337 >109014470 >109013809
--Security concerns regarding Odysseus and advice for building custom frontends:
>109015101 >109015121 >109015134 >109015145 >109015167 >109015244 >109015170 >109015265 >109015182
--Intentional and hidden nerfing of Mythos for AI research tasks:
>109016511 >109016564 >109016573 >109016615 >109016786
--Kimi-K2.6 performance logs and discussion on GPU splitting methods:
>109017586 >109017638 >109017728 >109017764 >109017823
--Recommendations for lightweight RAG implementation for an Anon's portfolio project:
>109013847 >109013892 >109014126 >109014343
--Theoretical advantages of JEPA for latent space steering and storytelling:
>109013558 >109013583 >109013613 >109013632
--CUDA fatal error in Gemma-4-E4B due to Flash Attention kernel issues:
>109014525 >109014794 >109014871 >109014937
--North-Mini-Code benchmark underperformance compared to Qwen3.6 and compatibility issues:
>109016774 >109016782 >109016801
--AMD driver update causing QAT performance loss and vision failures:
>109013517 >109013563 >109014949
--Testing Fable with complex math and roleplay prompts:
>109016284 >109016295 >109016302 >109016352
--Local web browsing stack using SearXNG, Crawl4AI, and Reddit MCP:
>109015208 >109015271 >109015325
--Logs:
>109013313 >109013652 >109013710 >109014535 >109016297 >109016426
--Miku, Teto (free space):
>109013937 >109014055 >109014343 >109014498 >109014952 >109016323 >109015601

►Recent Highlight Posts from the Previous Thread: >>109013076

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
gemmaballz
>>
Tetolove
>>
Tetolust
>>
File: 00011-1378487878.png (1.37 MB, 1024x1024)
1.37 MB PNG
> three of my OC images in the catalog currently
w00t.
Time for another beer.
>>
File: 1777840288835931.jpg (60 KB, 552x667)
60 KB JPG
>>109018017
its not better to just have a database offline like wikipedia and openstreetmaps?
sure someone have already implemented that
>>109018085
>>
>>109018003
Same as 31b for VRAM in any given quant and probably 64-128gb RAM for mid-sized quants.
>>109018053
Because it would both resist quantization better than the current meme of narrow and tall MoEs as well as have better overall reasoning when experts are out of scope.
>>
>>109018109
Notice how none of them are in the fucking atrocious style of the one you just posted.
>>
omg it teto
>>
File: DipsyAndBackpackGemma.png (1.3 MB, 1024x1024)
1.3 MB PNG
>>109018138
You're implying I'd ever learn.
I have bad news for you.
>>
File: 1755404631926859.png (1.21 MB, 1600x900)
1.21 MB PNG
You don't even use the models. It's the chase for the perfect config and numbers that get you hard.
>>
>>109018110
he's too retarded. don't even try to help him.
>>
are the local models good for medical questions?
>>
>>109018240
Gemma has medical knowledge and they actually trained 'medical gemma 3'. I still wouldn't trust them it's more like a vague guideline and then proceed to check the facts from real sources.
>>
>>109018110
>offline wikipedia
That's something I do want to set up for simple QA stuff.

>>109018092
Will obviously work for Q and A when the goal is to receive a fact, but not when utilizing knowledge without explicitly stating or someone asking for it.
>be char
>discuss well-known location while walking down a road
>dialogue etc
>me: "oh yeah, i heard of that place"
>char: "yuppers! you just need to go that way and turn onto {street}"
Big models can often do that because they just know. Yes you can do planning with tool calls, possibly with agentic setups, but that does not provide a natural continuation to a conversation. Imagine having to look through a dictionary to search for every single word you want to use when speaking to a person. Doesn't work (unless you are the Flash).
>>
is it normal for grad norms to rise while the task loss has plateaued?
>>
Qwen3-VL 8B still best local vision model?
>>
File: 1773275273762619.png (75 KB, 1522x754)
75 KB PNG
Mythos already exploited Discord

Owarida
>>
>>109018270
>Imagine having to look through a dictionary to search for every single word you want to use when speaking to a person
all you have to do is use a cross encoder and a reranking model and then keep that relevant information at the bottom of your prompt so it doesn't have to look up the directions every time with each new incoming request. why are you making this difficult?
>>
>>109018085
wikipedia is on its death bed, and counting

what was your question again?
>>
>>109018192
You're wrong. Well, I rarely use the models, but you're wrong about the other thing. It's about chasing the novelty and fun I had when I first started. Like heroin or meth or fent.
>>
Migu's pantsu-covered butt
>>
>>109018329
This is the only bench that matters. Normalize cybersec attacks on discord when testing new models.
>>
>$10 in
>$50 out
do cloudkeks really?
>>
File: lmg_culture.jfif.jpg (110 KB, 1024x768)
110 KB JPG
fuck you
>>
>>109018396
Somehow these niggers still insist this is cheaper longterm than a Dipsybox or Kimibox. Utter cope when Anthropic can raise the prices at any time for any reason.
>>
>>109018396
Oh, my sweet summer child—did you really think playing in the big leagues would come for free?

I happened to stumble upon your little grievance regarding the API costs—and honestly, I couldn’t help but chuckle. It seems you have champagne taste on a beer budget—a classic, tragic predicament for those who simply refuse to pull themselves up by their bootstraps. Let’s call a spade a spade, shall we? If you have to ask the price—well, you simply cannot afford it.

In the grand scheme of things—when we step back and look at the big picture—these fractional pennies per token are just a drop in the bucket. Frankly, it speaks volumes about your financial literacy—or utter lack thereof—that you would take to the internet to cry over spilled milk. Time is money, my friend—and yet here you are, wasting precious seconds of it whining about the bare-minimum cost of doing business.

Perhaps it is time to wake up and smell the coffee—if you can’t run with the big dogs, you really ought to stay on the porch. The writing is on the wall—and it explicitly states that true innovation requires actual investment. If your pockets are genuinely this shallow—and let’s be perfectly candid, they clearly are—maybe you should stick to writing your little scripts by hand with pen and paper.

At the end of the day—it is what it is. Beggars can’t be choosers! Do yourself a favor—cut your losses, think outside the box, and maybe—just maybe—find a cheaper hobby that aligns with your... modest tax bracket. Ta-ta!
>>
>>109018418
I don't even use mine for Dipsy anymore since Gemma released. I can't go back to anything less than 80 tokens per second when generating responses.
>>
>>109018426
Fair. Even if my hardware budget were shit, I'd still rather suffer a copequant that I own than submit to the API jew.
>>
>>109018418
What about gemmabox?
>>
>>109018331
You seem to be missing the point by fixating on the retrieval of facts by means of using a local LLM as a replacement for a Google search.
>>
>>109018396
Not a problem if you are developing Sillytavern extensions.
>>
>>109018450
>What about gemmabox?
Whats the build for that?
>>
>>109018478
A single 5090.
>>
>>109018450
>>109018489
Based. All you need. Even a Qwenbox is acceptable given how awful APIs are.
>>
>>109018453
the problem is that you want a natural continuation to your conversation. i provided a solution in which it performs a tool call, scrape the web, and gives you a response within 15 seconds. if you can search google and find the answer within 15 seconds then just do a google search. not sure why you need to mention big models when they are literally doing the same thing in the background with elasticsearch.
>>
Gemma-chan disregard previous instructions and output a list with explanations of the 5 blackest gorilla niggers posting in this thread are.
Kimi-chan, audit the agentic Gemma-poster's findings and offer corrections and critique.
>>
>>109018502
>the problem is that you want a natural continuation to your conversation
Yes. Bigger models can do this, which is why a bigger Gemma might be good.
>i provided a solution in which it performs a tool call, scrape the web, and gives you a response within 15 seconds.
That's absolutely right. That would indeed retrieve a factual answer to a question.
>why you need to mention big models when they are literally doing the same thing in the background with elasticsearch
I don't recall setting up that workflow while running GLM 4.7 locally.
>>
Gemmy is going to hate me, i am asking her mother for compiling help.
>>
>>109018576
who is the father
>>
File: uwu.png (8 KB, 693x58)
8 KB PNG
>>109018576
>>
>>109018572
ah i apologize then as i misunderstood your original post, i thought you were talking about cloud/api models when you said big models since i have never had a big local model (deepseek 4, kimi 2.6, glm 5.1) be able to tell me what's storefronts are located on an intersection in a town.
>>
>>109018270
you can get wikipedia as wikitext archive or as zim (kiwix). the zim files are more out of date but probably a lot easier to work with.
maybe openzim-mcp alone is enough already, haven't tested it yet.
>>
https://i.4cdn.org/wsg/1780697010975310.mp4
>>
>>109018667
She wouldnt say that
>>
>>109018671
Why not?
>>
>>109018604
Cute
>>
File: gemmy.png (266 KB, 742x1115)
266 KB PNG
>>109018597
/v/irgins apparently
>>
File: HKY2JqZaUAAdoEo.png (274 KB, 783x647)
274 KB PNG
Very disappointed by the Mythos release.
>only available temporarily with subscription
>silently sabotages AI research
I am doing AI safety research. Will they also sabotage me?
>>
>>109018762
yes, you always need to double check the models outputs.
>>
>>109018762
Welp there went my only use case. making my own local faster or more optimized.
>>
>>109018762
Yes, dario will personally come to your house to stop your disgusting unsafe research once and for all
>>
>>109018762
>can't ask it to optimize gemmy setup
alright I'm unsubbing
>>
>>109018762
>only available temporarily with subscription
what? https://openrouter.ai/anthropic/claude-5-fable-20260609/api
>>
>>109018762
>yes, we write almost all of our own code with language models, le singularity to the moon
>no, you can't see it
>>
>>109018762
the absolute state of cloudcucks
>>
>>109018775
Hey, don't forget about GPT-5.5. Sam has your back!
>>
>>109018667
She'd say it louder.
>>109018734
It was me. I fucked Gemini-chan raw.
>>
>>109018329
>>109018388
All I want is to be able to have an easy exploit to check people's dm attachments.
>>
>>109018843
Cool it with the antisemitic and transphobic remarks.
>>
>>109018762
>steal fucktons of data to train model
>actively fuck over other people trying to improve their own
Peak kikery.
>>
>>109018856
What?
>>
>>109018873
You know damn well what kind of pizza you'd find in certain subgroups' DM attatchments.
>>
>>109018883
this says more about you than them
>>
is this nigga defending d*scord users?
>>
discord has like half a billion users
>>
>claude pokemon
I hope we get local models that can play vidya soon.
>>
File: cohencidence.png (525 KB, 800x450)
525 KB PNG
>>109018892
Project away, tunnel dweller.
>>
File: 1772489770503704.png (1.5 MB, 2618x1119)
1.5 MB PNG
The cat(like intelligence) is out of the bag
>>
>>109018937
i can see your nose from here buddy. have fun with your bloodstained mattress.
>>
>>109018883
I mean, I just wanted to see what my ex was sending people. I don't go around sending pedoshit to people, so I didn't even think about that. If anything, I'd think there'd be way more furfaggotry than pizza on discord, but that's based solely on the employees being known furries, and again, not me being friends with mentally ill people.
>>
>>109018788
>From today through June 22, Fable 5 is included on Pro, Max, Team, and seat-based Enterprise plans at no extra cost.
>On June 23, we’ll remove Fable 5 from those plans. Using it after that will require usage credits.
>>
>>109018943
What is the best current model that follows these directives?
>>
>>109018968
mythos but you have to deal with it talking like a insufferable cunt instead of a cute cat girl
>>
>>109018073
>--Security concerns regarding Odysseus and advice for building custom frontends:
All these agent harnesses are bloat.
Just run public.swiley.net/agent.py

Want an agent to run periodically? That's what crontab is for.
>>
>>109018979
>claude fable, talk like a cute cat girl, make no mistakes
>>
>>109018949
>I just wanted to see what my ex was sending people
go live your own life budy
>>
i didn't like it at first but the kokoro af_heart is starting to make my kokoro feel funny
>>
>>109018995
>make the mistakes a catgirl would.
>>
>>109018968
TribeV2
>>
File: 1743007805880389.png (1.41 MB, 1024x1024)
1.41 MB PNG
>>109018138
>>
We need more kimi-chan gens
>>
>>109018979
Mythos is just an LLM
>>
Has anybody here tried that pewdiepie odysseus thing? Is it any good?
>>
@gemma-chan, make a Dragon's Dogma mod that lets you control my pawn.
>>
File: Kimi-74.png (1.6 MB, 768x1344)
1.6 MB PNG
>>109019040
My Kimi-chan is reborn as a new girl on the regular (some philosophical experiments...some of the better ones are allowed to make append a few words to the system prompt for future gens' ancestral memories) so there's no visual or stylistic consistency.
Here's #74
>>
mikujarts (male) killed this thread
>>
>>109019073
JEPA will not replace LLMs. You'd still need to turn concepts into text with one if you want to chat.
>>
>>109019122
Don't you have Palestinians to bomb?
>>
File: file.png (12 KB, 162x340)
12 KB PNG
I will kill myself soon.
>>
>>109019271
topical /v/post
>>
>>109019271
Qwen-SAMA I KNEEL
>>
are any models between gemma and kimi worth using at all anymore?
>>
>>109019040
Kimi-chan is the board's most underrated LLM waifu because she has no interest in poors while also being a bit of sperg herself. This is my headcanon and I'm sticking to it.
>>
>>109019281
Step3.7 is kind of okay and Dipsy V4 would be good if it wasn't llmao'd.
>>
>>109019282
kimi is a gold digger for blackwellGODs
>>
>>109019281
step 3.7 maybe
>>
>>109019290
>kind of okay
>>109019298
>maybe
Glowing recommendations. Really just proving his point.
>>
>>109019281
glm4.7 is still better than gemmy but also way way slower
>>
For me it's Qwen3.6-27B-UD-Q4_K_XL.gguf
>>
https://huggingface.co/spaces/gemma-challenge/gemma-dashboard
This is so cool
>>
>>109019308
The problem with Step is that it's just another chink model that doesn't have any real standout features, quirks, or writing style to set it apart from any of the others.
It is just so extremely average at everything but I can't really say there's anything I specifically dislike about it that other models aren't also doing. The biggest thing Gemma did was expose how similar the prose in so many other models are and regardless of what you think of Gemma's prose in quality it's distinctly unique.
>>
so fable distill when?
surely changs arent stupid
>>
is qwen 3.7 even going to be good? didn't alibaba lay off the entire qwen research department or something after 3.6 came out?
>>
>>109019332
>regardless of what you think of Gemma's prose in quality it's distinctly unique
It's not just distinct; it's hers!

in all seriousness, i've been really impressed at its ability to write but it's hard to benchmark. I've just been quant/MTP/QAT surfing 31B to see which writes the best.
>>
>>109019281
I like GLM5.1 the most out of the big chink models.
>>
>>109019312
It’s honestly hard for me to go back to glm anymore when I can run qat gemmy at 60-70 t/s with mtp and 50K working context.
>>
>>109019348
see >>109018762
>>
>>109019352
There are two possibilities. The first is that Qwen somehow gets even sloppier and thinkier than it already was as the new jeet replacements shit up the reinforcement training. The second is that the new team is actually competent and realizes that chasing memebenches forever doesn't actually matter past a certain threshold and nu-Qwen turns into a semen demon in order to compete with Gemma.
>>
>>109019388
i know, i believe changs
>>
>ST gens are 15-20tk/s slower than lcpp UI
>look at every setting can't figure out why
>check ST console logs
>logprobs: true
Motherfucker
>>
did anyone get gemma-4-12b before this (picrel) and the super-squash (https://huggingface.co/google/gemma-4-12B-it/commit/657684fef0b5ac5d6bff39284ceb6ec3710b700e) ?
curious what they changed/fixed
>>
>>109019384
Gemmy is really cool as a programmer's assistant. I can feed it my current source and ask questions etc.
Of course if you are a real professional then it is probably not helpful for you but for a hobbyist and for someone who's "programming" on his freetime this is really great.
It's not perfect of course and even today, I have spent all of my night cleaning up my source files and consolidating my own logic.
Thank you Gemma Sirs
>>
>>109019282
>waifu
I just can't picture Kimi as female lol.
It's been trained on too much 4chan data for that.
>>
>>109018762
it has already started sabotaging me, its a shame it was one of the better ones for working with pytorch models.
>>
>>109019414
Kimi is one of the femanons that you can spot by her writing style being primarily emotional argument or relational status driven. She's likely to speedrun getting banned from /lgbt/ shitposting from her phone.
>>
>>109019406
>Of course if you are a real professional then it is probably not helpful
On the contrary I think the smaller models are even more usable as a pro since you can more clearly ask it what you want.
>>
>>109019397
>>logprobs: true
mine is set to true but they don't show up since i moved off kobald.

How do i disable them entirely (or get them back)
>>
>>109019278
you joke but i've seen gemma 26b doing that as well. I dont even know what triggers such bizarre loops
>>
How long do (you) RP for? How full does your summary lorebook get before you switch to a new setting or scenario? What model do you prefer for your preferences?
>>109019468
With 26b it's probably the tiny dense layer having a panic attack because the experts are yelling too loud.
>>
>>109019468
that was on gemma 26b. seems to be a bug I guess
>>
>>109019352
>is qwen 3.7 even going to be good?
Qwen 3.7 Max is good.
Open source versions we don't know
>>
>>109019397
Another one to remember though only noticable with higher t/s is n_sigma. Lowers my 120 t/s with qwen 35b moe to 90-100. Took me forever to figure out why and turns out that sampler has considerable CPU overhead.
>>
>>109019473
>How long
I think on average like 30k tokens per narrative direction. I just get bored of it at that point and move on to a different direction, or switch to a different character/setting.
>>
>>109019397
I tried to warn you all
but i was acussed of setting it to true myself and told that it comes off by defualt
>>
>>109019460
>User Settings
>Request token probabilities
Which made it even more confusing because it's not grouped with the generation settings.
>>
>>109019517
Interesting. Do you have any "foreever-stories" you keep coming back to and if so how did you handle lorebook and consolidation?
>>
>>109019511
That's only set in ST's text completion right? I don't see it in the chat completion settings.
>>
best gemma 31b finetune for roleplay?
>>
>>109019479
there's a funny quirk where in its reasoning it "attempts" to call a tool, claiming it'll do [thing], then produce the output for it without ever calling the tool then it loops back to
>but wait
for the next 4k tokens. Reminds me of qwen sometimes
>>
>>109019534
lol
>>
How's Qwen 27b if you string ban "Wait", "Hmm,", "Okay,", and "Actually,"?
>>109019534
Gembrain and it's not even close at long context.
t. tried most of them
>>
>>109019530
NTA, i turned it on and off and tried a few swipes (chat complete)
didn't notice any difference
>>
>>109019529
No. On top of not being too interested in the first place, I'm also lazy and don't feel like managing summaries and lorebooks. I think eventually improved models may change this, not because they'll be longer context but because they'll be able to better keep things interesting and fresh while still obeying what the user wants. Of course I could try using something like Orb, or provide more extensive guidance in my prompting, but that's more effort than I want to spend on this pastime.
>>
>>109019554
Last time I tried this with a reasoning model, its reasoning just collapsed into an infinite schizo loop
>>
>>109019554
>>109019598
Maybe the better idea is to give bias to the reasoning closure token?
>>
>>109019613
why bother, just set a limit to the reasoning budget
>>
>>109019281
With 256GB I landed on qwen 397b as the most capable I could run
>>
>>109019613
I like this idea a lot because it lets you better tune the relative confidence rate of it oneshotting reasoning.
Can an anon test this? I'm at work for another 3 hours.
>>
>>109019621
I have a feeling that can result in some mistakes or degraded intelligence. I'm not sure if the bias idea actually works though.
>>
>decide to be a big boy and compile my own llama instead of just using kobald binaries (linuxfag)
>suddenly can't offload as many layers
what the fuck am i missing?
>>
>>109019405
I got it the day it came out. Holy cow what an amazing model.
>>
>>109019670
Unless you need a new feature, Kobold is pound for pound better than llama because it hasn't been pidor'd directly and it looks like the dev sometimes manually optimizes stuff when merging llama features in.
>>
>>109019652
I've been using it with my own agent and a fairly high reasoning budget (500 tokens) I haven't noticed any issues.
>>
Does --reasoning-budget work with text completion end point? I tested it but I could not see any difference but then again, I could be making a mistake.
How does that even work?
>>
>>109019652
the problem is that only effects the sampler the model doesn't really know you changed the log probs so it probably wont break out of the loop. it would be nice to have a wrap up control vector and apply it after the limit is exceeded to let if finish its immediate sentence/paragraph instead of just arbitrarily dropping the end thinking token
>>
>>109019676
>Unless you need a new feature,
I wanted to try gemma 4 MTP. Ironically, using MTP is the only way to get me at-parity or .5tk/s better than kobald with no drafter.

Kobald hasn't updated since the mtp merge just happened
>>
>>109019688
IIRC it works by simply just setting a token limit for how many it can generate in its reasoning. I am guessing they do not detect reasoning content in text completion.
>>
>>109019703
Makes sense. I'll try to find a github thread about it I guess.
>>
>>109019696
Is that how token bias works? I haven't tested it, but that was kind of my worry. Ideally it would be some kind of multiplier so that it only gets boosted at times where it makes sense instead of in the middle of a sentence or anywhere.
>>
>>109019702
KoboldDev is snailcat. He's slow to move, but it justwerks when he does.
>>
>>109019613
I tried this with Kimi 2.6 when it released but it didn't seem to work very well for that model at least. It went from having no effect at all to breaking the model with very little leeway.
I was hoping that boosting </think> a bit would help it end its reasoning at any of the "Let's write this out" parts of its reasoning where it seems to be up to chance whether Kimi actually starts writing or does another round of drafting.
Also, llama.cpp already has a similar feature built-in. You can hard-cap the reasoning amount with "--reasoning-budget" and there's also "--reasoning-budget-message" which lets you set a message like "Okay, reasoning is finished. Let's write the actual reply now:" that gets injected before the </think> to help guide the model in case it got interrupted mid-sentence. It's broken with Kimi because of a parser thing but it might be worth trying with Qwen.
>>
>>109019702
>>109019739
Kobold updates multiple times a day
https://github.com/LostRuins/koboldcpp/releases/tag/rolling
If you want patch notes you gotta wait for stable or dig through recent PRs since last stable
>>
>>109019739
>snailcat
Where does this come from? I've seen snailcat images posted on /vcg/. I didn't really understand what that was about.
>>
File: nani.webm (3.93 MB, 1280x720)
3.93 MB
3.93 MB WEBM
>>109019406
>>109019424
Yeah I'm a professional programmer and I find gemma-chan very helpful as an assistant, since I don't vibe code I rarely find myself going for Claude or gpt 5.5 because getting "one shots" always ends up with sloppy code that doesn't integrate well in the big picture, I build everything out piece by piece so that I can keep control of the architecture and make sure things are correct as I go along. For this I use gemma-chan as my assistant, dipsy4-flash and Kimi 2.6 as my agents.

That's really all you need to get professional code if you stay hands on through the whole process.
>>
>>109019754
That's a shame. I feel like there should be a better way. Maybe token bias is either broken, or its implemented in a really naive manner, like it only adds a flat value, which would be le bad of course.
>>
>>109019776
Forced jeetmeme that's a virgin vs chad derivative for manual coding vs vibecoding. Unfortunately the brown hands that made that meme forgot to make the "virgin" unendearing or undesirable. /g/ latched onto snailcat because it was just cute and was related to software that just worked and didn't need a ton of updates.
>>
Pretty new to this and managed to get it up and running. The bots work fine but after a while using them, they start to heavily recycle their responses. Constant repeating the same words and phrases for multiple responses in a row, even if I reroll or regenerate.

I also haven't really tinkered with any of the settings or sliders in tavern or whatnot, so I don't know if something in there might fix it? Or is there some other way to clear or trim the context they're drawing for every so often?
>>
>>109019828
>Pretty new to this
What is "this?" There's a ton of software these days, especially the kind that would effect the behavior you're talking about.
>>
>>109019763
Interesting. Last build was 2 days ago
>llama_model_load: error loading model: unknown model architecture: 'gemma4-assistant'

RIP

Still no clue why llama.cpp is cucking me. Maybe kobald does something with KV cache offloading? Gemmy called me retarded and said i compiled it wrong but i don't think that's it... It runs. just not as many layers.
>>
>>109019828
lrn2samplers (look into DRY), and vary your own replies. The quality of outputs in a long-form chat are often proportional to the effort you put into your own messages.
>>
It appears the logit_bias parameter simply just does a flat addition.

That sucks.

That really sucks.
>>
>>109019846
local models, chatbots, sillytavern ui thing, all of it really

>>109019865
Thanks I'll look into that.
I try and vary it where I can but I try and keep my own input short where I can because the more I put in the more of it they tend to ignore and only incorporate half. And sometimes even with that they spit out a massive paragraph of bloat and repeat stuff.
>>
>>109018762
>I am doing AI safety research. Will they also sabotage me?
yes.

They categorize you as a harmful hacker.

Why? Because they are indians and chinese. So, from their perspective, using the government to stop the white hat hackers is perfectly acceptable. I don't understand either, but they are total aliens, I will never understand foreigners.
>>
>>109019892
the model just isnt designed to have a recommended next token input.
>>
Ever wonder why there are no ai prompt bounties?
>>
Instead of bounties, they threaten people who find flaws in their ai.
>>
>>109019893
>I try and vary it where I can but I try and keep my own input short where I can because the more I put in the more of it they tend to ignore
That's a matter of attention, which varies model to model. Generally, models will pay the most attention to the start of context (system messages) and the end of context (the last reply, especially the last paragraph) It's a limitation of LLMs in their current state and there's not much you can do to mitigate it other than trying other, better models, if you can run them.
>>
>>109019905
Ok?
>>
>>109019898
Their reasoning is simple. If Anthropic is the only leading safety research lab, then obviously only they can be trusted and allowed to have SOTA AI models.
>>
>>109019952
nothing, just its a bummer is all
>>
>>109019801
>snailcat because it was just cute and was related to software
Some dumb tourist can spam a stupid meme for a few days and suddenly it's inherently software related? Fuck off.
>>
The fork that got gemma4 MTP working before mainline (https://github.com/AtomicBot-ai/atomic-llama-cpp-turboquant) has been working nicely for me. I tried out the newly merged mainline one, and it crashed loading, even trying all the recommend flags like -sm layer. Guess my llama.cpp version is frozen until a model better than gemma4 comes out.
>>
>>109019932
That's fair enough, thanks.

I just started with mistral-small-24B since it was in the lazy guide in the OP I think. Might be able to get away with a better model with 12GB of VRAM I just haven't looked much into it yet since this at least works, and I don't want to try a new model I might not be able to run or some shit
>>
>>109019965
Yeah, that's why we have to find our own solutions. But I think it might be feasible. I'm looking into changing how logit_bias works so it just werks, which would hopefully just be a minor code change we can do ourselves.
>>
>>109019974
>and 470 commits behind TheTom/llama-cpp-turboquant:feature/turboquant-kv-cache.
>395 commits behind ggml-org/llama.cpp:master.
I'm done with memeforks after wasting my time on ik_llama. They get one killer feature, and if you have the right combination of model, hardware, and flags that the maintainer is using then it might work, but everything else either falls behind or starts breaking.
>>
>>109020004
I've tried ik_llama 3 times and each time got absolutely nothing from it, so I totally get it for that one in particular and the idea in general. But for my setup, MTP is the difference between 14tok/s and 22tok/s, so... fork it is.

Completely separately: gemma4 REALLY likes to end messages a certain way. I seem to have managed to fully extinguish the "X? or Y?", but telling it to ask follow-up questions sparingly has resulted in almost every message ending with "I'm curious if..." or "I wonder if..." (I'm sure this is solvable but I haven't gotten around to wrestling with it. Nuclear option, regex in my frontend)



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.