[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: Sirens.jpg (447 KB, 1536x1536)
447 KB
447 KB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108067607 & >>108057380

►News
>(02/06) Step3.5 Flash support merged into llama cpp: https://github.com/ggml-org/llama.cpp/pull/19283
>(02/04) Voxtral Mini 4B Realtime 2602 released: https://hf.co/mistralai/Voxtral-Mini-4B-Realtime-2602
>(02/04) Intern-S1-Pro 1T-A22B released: https://hf.co/internlm/Intern-S1-Pro
>(02/03) MiniCPM-o-4.5 released: https://hf.co/openbmb/MiniCPM-o-4_5
>(02/03) ACE-Step v1.5 released: https://hf.co/ACE-Step/Ace-Step1.5

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
►Recent Highlights from the Previous Thread: >>108067607

--Papers:
>108074961
--Real-time STT model recommendations and AMD GPU deployment with Whisper.cpp:
>108072225 >108072400 >108072561 >108072577 >108072787 >108072799 >108072811 >108072928 >108072952 >108073000
--Feasibility of speculative decoding without draft models using batched parallel inference:
>108077025 >108077060 >108077099 >108077101 >108077114 >108077137 >108077417 >108077176 >108077197 >108077298 >108077267 >108077321 >108077356 >108077374 >108077428
--Anthropic disables prefill in Claude Opus 4.6 API to prevent misuse:
>108068386 >108072150 >108072882 >108072896 >108072899 >108073088 >108074528 >108075007 >108074281 >108074286
--Qwen3-Coder-Next performance evaluation with temperature sensitivity issues:
>108067656 >108067836 >108067860 >108067946 >108067971 >108067989 >108073119
--GPT-5.3-Codex outperforms GPT-5.2-Codex in benchmark tests:
>108069949
--Testing model knowledge cutoffs using OpenAI Responses API awareness:
>108071195
--Step-3.5-Flash support added to ikawrakow's llama.cpp fork:
>108070436 >108070476 >108070566 >108071304 >108071316 >108073024
--Small TTS model recommendations and output consistency tips:
>108077276 >108077324 >108077327 >108077334 >108077357 >108077359
--Kobold phrase banning vs llama.cpp string bans for roleplay use:
>108071246 >108071323 >108071469 >108071619
--Strategies for summarizing and categorizing large Discord message datasets:
>108075539 >108075614 >108076851
--Dual GPU PCIe lane allocation for X870/9950x systems with pipeline parallelism considerations:
>108073548 >108074065
--Exploring web search frontend alternatives for local LLMs:
>108071960 >108071986 >108072041 >108073241
--Step3.5 Flash support merged into llama cpp:
>108077798
--Rin and Miku (free space):
>108067820 >108073563 >108074616 >108076620

►Recent Highlight Posts from the Previous Thread: >>108067610

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
File: 1767150464468654.gif (3.94 MB, 280x278)
3.94 MB
3.94 MB GIF
Is there a way to have something using llamacpp to load models and able to ban actual sentences or words and not just tokens through logit bias?
Maybe something doing that through llama-cpp-python?
(Outside of using koboldcpp and its antislop feature)
No one actually made something like that?
>>
>>108078930
how does this improve kld?
>>
Has anyone managed to make ace step 1.5 base or base-sft work in comfy? The turbo ver is atrocious.
>>
>>108078930
I have good news and bad news for you.
Good news: regex ban exists https://github.com/ikawrakow/ik_llama.cpp/pull/1243
Bad news: I'm a filthy vibecoder so it will take a while to get accepted, but at least I test if my shitcode works for my usecases, unlike firecoperana
>>
After having been thoroughly disappointed in basically anything sub 200b and not feeling like spending several thousand dollars on my pc to run bigger, I've been trying to beat the shit out of small models into following my rules through prompt repetition based off of that one arxiv paper, very strict rules to give me the most barebones writing to then edit myself, and also using tricks like repurposing think blocks to only keep the newest scene information. Then, feeding it a scene-by-scene basis of a chapter of writing, I get it to actually focus, not spam nonsense filler, and provide me a skeleton for what I ask. So far I like it better than what I get out of the biggest shit I can run.
Who'd have thought that going "hey llm, I want you to write the tedious mundane shit of this chapter for me" without it trying to write like a woman YA novelist would be this involved
>>
Adding "..." and "…" to banned strings was the best decision in my SillyTaverning career. Just saying.
>>
k2.5 is one of those models that needs about 20 different em-dash related bans to be remotely usable
>>
>>108079117
>barebones writing to then edit myself
I member doing that! Silly times. What the fuck was I even doing at that point? Should have just opened notepad.txt and wrote everything myself.

Thankfully I have 4.6 and 4.7 now.
>>
I hope the new llmarena model is GLM-4.7 with dflash. The paper finally released yesterday and their initial Qwen speedups were pretty good.
https://arxiv.org/abs/2602.06036
>>
>>108079117
>>108079134
Why don't we get the logits from the double prompt and use them to disttill the same model for infinite recursive self improvement?
>>
Are 4.6 and 4.7 flash versions worth bothering with for a vramlet or are they just pure shit wearing the glm logo?
>>
>>108079134
Editing and writing more or less goes hand in hand and honestly it gives me more motivation to spitefully fix whatever dumbass shit these things tend to come up with than to just slog through writing it myself. I just want the stupid word predictor to give me a floorplan that I can renovate and add onto so I can do something else in the meantime instead of having to do research on a topic to explain a topic a reader may not know about. Plus, once in a while it comes up with something I wouldn't have pursued due to assuming it would be retarded but there's a grain of a good idea in it that I can repurpose
As for your 600b model, I can guarantee you if you posted a short story it wrote of some topic, I could point out at least four lazy writing habits it and virtually every model down to a 12b has, as well as three quarters of human writing
>>
>>108079079
Thanks, I will check that anon.
>>
>>108079129
git good

"*"
"..."
"~"
"—"
"“"
"”"
"…"
>>
Welcome, lmstudio fans! Let's make the Tiger Mom that gets us to meet our goals and objectives on time and under budget.
>>
>>108079377
and one more dash they sometimes use instead of a hyphen

"–"
>>
>>108079433
noob here.

why not ban "("?
>>
>>108079488
never comes up for me. if you see it and don't want to then add that too. being able to ban annoying strings is the best thing they've added in a long time
>>
File: Anima Waiting Room.jpg (263 KB, 1824x1248)
263 KB
263 KB JPG
anima is good
but tough
>>
>>108079613
Clearly not Anima
>>
>>108079133
It has no problem with them for me, just a problem with meaningless flowery bullshit
>>
>>108078930
>>108079079
>>108079267
samefag
>>
Instead of trying to rng something with ACEstep, wouldn't it be better to have a library of existing sounds (like fl studio) and then let the model use that and piece something together?
>>
oops, posted in wrong thread, reposting:

You are a tiger mother. Your son is the user. He is has no job, and hasn't applied for work in months. He owns a computer but has never been on a date. Your task is to honor your ancestors by producing grand children through him, your sole heir. He likes to be called "anon".
>>
what schizo nonsense is that
>>
>>108079775
This + using neurolinguistic programming + dopamine circuit hijack
>>
File: 1752150732038746.png (315 KB, 2736x658)
315 KB
315 KB PNG
When will we get pic related....
>>
File: perplexity.png (150 KB, 2069x1400)
150 KB
150 KB PNG
Was trying to figure out why K2.5 was so dogshit at times and spouting random gibberish and I think I figure out why.
Nobody use IQ2 K2.5, ever, at all, and nobody EVER use unsloth quants. Can only imagine how bad their quant is. I should have waited for ubergarm.
>>
DEEEPSEEEKV4 WHEEEEEEN
IWANT ENGRAAAAM
ARRRRRRRRRRRRRRRRRRRRRRGH
>>
I want 1tb of ram. Is there sweepstakes?
>>
>>108079897
Yeah they're all retarded except ubergarm or the Q4_X AesSedai
Same for K2-Thinking. smol-IQ2_KS passed the official eval: https://huggingface.co/ubergarm/Kimi-K2-Thinking-GGUF/discussions/15
>>
How good is STT at handling heavily accented english? Should I even bother?
>>
>>108079998
sirs, does the Step3.5 Flash make good providings for rp?
>>
>>108079998
depends on accent.
post sample i can test a few
do you need real-time?
>>
>>108079998
Depends. What kind of accent?
>>
>>108080142
Czech.
>>
>>108079852
anons just yearns for the star trek holodeck (porn edition).
>>
>>108080156
I just wanted a virtual friend that wouldn't ask me to let him stay at my house for the summer because his mom kicked him out.
>>
File: 1751295513117051.png (2.83 MB, 1024x1536)
2.83 MB
2.83 MB PNG
>>108079901
>>
File: Onyx2.png (1002 KB, 624x1222)
1002 KB
1002 KB PNG
Best Local Model you could theoretically run on picrel?
>>
Retard here. Are there different models that are better or worse for smut? How does one know what the best model to get for their Vram? (I have 20 gigs of Vram).
>>
>>108080683
Mistral Nemo or Mistral Small. You could prolly also run some low quant of some GLM model.
>>
>>108080683
Gemma 3 is an excellent model. It knows all of my favourite curry recipes.
>>
>>108080625
Onyx 2 has R10k cpus. Do you have any idea what that means?
eg. browsing internet in 2010 with R12K 400Mhz Octane on Firefox was already so slow that I don't even want to know how things are ~15 years later. I had a small collection of Silicon Graphics machines back in the day.
Answer to your question: not applicable.
>>
>>108080847
Onyx2 has somewhat similar graphics capabilities like Nintendo 64 or PS2 but it's still a supercomputer of sorts with massive i/o bandwidth.
Most of its power were used to run Inferno and Flame, it could run uncompressed HD and even 2K in real time for client sessions, and render them out in near real time and some stuff like tracking and masking were really fast.
>>
File: 1753413622153826.png (269 KB, 1756x1297)
269 KB
269 KB PNG
Stepfun 3.5 testing. llama cpp main, IQ4_XS quants published by ubergarm
>>
File: 1555946973330.png (229 KB, 541x662)
229 KB
229 KB PNG
Are all q8 goofs made just by running llama-quantize or are there some magic sauce cli args that can make it better?
>>
troof?
https://old.reddit.com/r/SillyTavernAI/comments/1qxq9v4/glm_5_free_on_openrouter/
>>
hey guyz i want my post to be retarded as possible how can i make my post more retarded i like got some real good competition i like really wanna be the best at being retarded guyz pls help me you owe it to me you know because i wanna be the very bestest you know
>>
>>108080959
This was mentioned towards the end of the last thread. It is highly unlikely to be a new architecture due to it having the same context length as GLM 4.7. It is most likely GLM 4.8 rather than 5.0.
>>108078799
>>108078851
>>
File: ylecun.jpg (222 KB, 1200x1271)
222 KB
222 KB JPG
They don't understand the things I say on threads...
>>
>>108080625
Sovl. I only have an Indy…
>>
https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/demo/web_demo/WebRTC_Demo/README.md#macos-apple-silicon
It's so tiresome, instead of compiling a single c++ app I need to undoker shit and run multiple services in the hope that it will even work on Linux
>>
I doubt they'll ever implement it in stock llama.cpp https://github.com/ggml-org/llama.cpp/issues/17634
maybe kobold will save me
>>
>>108079897
>ppl
>>
>>108081502
Be the vibecoder you want to see
>>
>>108080959
It definitely feels like a GLM going by how it handles. Smarter than 4.7, somewhere between 4.7 and 4.6 in terms of writing and with an extra splash of Claude like what Moonshot did with K2.5.
The default thinking block in particular has lost all the Gemini-formatting that 4.6/4.7 insisted on doing and now just looks like Opus 4.5/K2.5.
A bit disappointing if this is GLM5 but a decent upgrade if it's just GLM4.8.
>>
>stepfun3.5 merged
>'the garm made ik quants
>no quants by bart, daniel, mrNANdacher
FML I need the Q2
>>
>>108081558
Buy more ram
>>
>>108081550
i want to see vibecoders lined up against the wall and shot
>>
I'm planning to buy two 128GB sticks to run glm 4.7. I have an rtx 5090. Will it be slow?
>>
>>108081558
>Q2
/me laughs in Q8
(it's ok, reasonably quick at least)
>>
>>108081585
4 64GB DDR5 sticks*
>>
>>108081585
Yeah. Crazy how for the price you're paying for this now you would've gotten a decent 8-channel ddr4 server a year ago.
>>
>>108081599
Trust the plan. 4D chess. Trump will force Intel to nationalize Optane production. Soon, Optane microfabs will start springing up all over the country. You'll be able to go there and slice your own wafers, and bake your own memory.
>>
>>108081597
4x64gb will be even slower considering consumer shit only has 2 memory channels
>>
>>108081617
Yeah, I'm glad I didn't go 4x32gb and just stayed on 2x32gb. Even at old prices it wasn't worth it.

I know you think I'm joking, but I'm waiting on a real company to design new memory that actually doesn't suck, though idk if Intel can be whipped into producing Optane again.
>>
>>108081558
>no stepfun vl 10b in sight
SUFFERING
>>
File: 1744515454860011.png (1.02 MB, 1000x1000)
1.02 MB
1.02 MB PNG
>>108081641
you missed the new ram form factor? you know that ddr6 is going to be that shit right?
>>
>>108081645
how are 12-24 dimms of that going to fit on a mainboard?
>>
>>108081648
>12-24 dimms
bro, it's OVER, you're lucky your consumer MB will support just 1 or 2 modules of that at max
>>
>>108081540
What if he did that because secretly he wants to get into john's programmer panties?
>>
>>108081648
iirc there are no server boards which support camm2/lpcamm2, but I might be mistaken
consider that this format is way DENSER and can support higher throughput without needing to go 30 layers on the PCB
>>
>>108081645
>ddr6 is going to be that shit
I'll believe it when I see it. Literally nobody bothered with that for ddr5 despite claims. Even for the 395 boards that needed a better signal path. All the "sources" claiming it will be part of DDR6 look like slop.
More likely we're just going to get more and more manufacturers soldering pitiful amounts of ram directly to boards and charging through the nose for the privilege of buying un-upgradable ewaste or be limited to shit speeds.
>>
>>108081557
>like Opus 4.5/K2.5.
Opus 4.5 does not self-cuck about safety like K2.5
>>
>>108081698
Bro, your reading comprehension?
>>
>>108080939
>Are all q8 goofs made just by running llama-quantize
no
>>
https://huggingface.co/kugelaudio/kugelaudio-0-open

eurobros were eating good
>>
how do I work out the optimal systems for my system when loading a model with ik_llama?
>>
File: laugh.gif (52 KB, 498x498)
52 KB
52 KB GIF
>>108081806
*we're
>>
File: 1751388585249078.png (4 KB, 340x33)
4 KB
4 KB PNG
>>108081806
>entirely trained on something called YODAS2
>This dataset contains audio utterances and corresponding captions (manual or automatic) from YouTube. Note that manual caption only indicates that it is uploaded by users, but not necessarily transcribed by a human
So it's entirely trained on random youtube shit?
>>
>>108081817
>entirely trained on random shit
It worked for LLMs
>>
File: 1740988980968138.png (94 KB, 615x698)
94 KB
94 KB PNG
>>108081817
So it's benchmaxx'd on German.
>>
>>108081809
you delete ik_llama and use autofit on base llama.cpp
and this is only if you're completely retarded tho and didn't pass mathematics in elementary school
>>
>>108081851
ist doch super
>>
>>108081806
>benchmaxxed vibevoice
KINO
>>
>>108081851
>Note: Voice cloning from raw audio is not supported in this open-source release. Only the pre-encoded voices listed in voices/voices.json are available.
into the trash
>>
>>108081863
dont even need maths, just 2 mins of trial and error
>>
>>108081878
you are a retarded.
Voice cloning is supported, they just chickened out and removed the code for it from their official repo because somebody told them to.
The weights are compatible with the original vibe voice code and there is no reason to use their implementation.
also picrel, the old code is still there you just have to roll back one commit
>>
>>108081893
but is it 100% vibevoice, so just a finetune? or do they have some novel code? also they have watermarking in, so if using their code you would still need to build your own wheel and expunge that garbage.
>>
>>108081867
Sad
>>
I'm using Kimi K2.5 to describe images to me, for fun for now, and it's extremely good at that outside of the occasional censorship issue or misinterpretation.
And how fucking slow it is since I can't host it in ram.

>>108079897
>AesSedai
>Ubergarm
Are they that good at the same quant? I'm using "UD-Q3_K_XL" from Unsloth.
>>
>>108081916
>using LLMs from ssds
corageous.
>>
>>108081179
it's crazy how fast things went to shit at meta once he was gone from there
>>
>>108081916
Did you get the latest version of the PR from a few days ago? The original one had a permutation error that caused part of the image to see artifacts and misinterpret stuff (while still working pretty well despite this).
>Are they that good at the same quant? I'm using "UD-Q3_K_XL" from Unsloth.
Q4_X feels a lot better than either of the Q4_K_M and UD_Q4_XL unsloth quants I tried before despite all of them being pretty much "lossless" for this QAT model. Subjectively, I wouldn't trust Unsloth here.
>>
>>108081809
>>108081863
you can also use the result of autofit from llama.cpp in ik_llama with their "llama-fit-params" script
>>
>>108081922
Yeah I can only get 128GB on ram, it's just for tests/descriptions anyway, chatting with it would be awful, I can let it run and describe images that usually trip other models while doing other stuff.

>>108081941
>Did you get the latest version of the PR from a few days ago?
You mean this ?
>The token <|media_start|> is incorrect; it has been replaced with <|media_begin|> in the chat template.

I'm using AesSedai's mmproj's file for vision along with the llama.cpp PR to support that + unsloth UD-Q3_K_XL gguf and it works. No idea if it's the best or only way to use that.

>Q4_X feels a lot better than either of the Q4_K_M and UD_Q4_XL unsloth quants I tried before despite all of them being pretty much "lossless" for this QAT model. Subjectively, I wouldn't trust Unsloth here.
I'll try one of the others, but Q4_X is 100GB bigger than UD-Q3_K_XL, this will be even more painful.
I wonder what unsloth does wrong.
>>
>>108081899
Can you just drop the weights into your VibeVoice inference pipeline and use that encoder?

It's a finetune after all.
>>
>>108082041
>You mean this ?
nta but no, he means this: https://github.com/ggml-org/llama.cpp/pull/19170#issuecomment-3845846054
>>
>>108082041
I'm talking about the actual llama.cpp PR that you're using to support the mmproj. There was an update 3 days or so ago. If you built your version after that, you should be fine. The vision component was slightly fucked before that.
Unsloth just doesn't seem to care about quality or testing their shit. Some anon did an elaborate comparison of KV-divergence between quants for one of the 30b Qwen models last week and the Unsloth quants were consistently worse than the rest even there.
>>
>>108082073
>>108082086
Oh, didn't even notice, yeah I'm using a version with this commit.

>Unsloth just doesn't seem to care about quality or testing their shit. Some anon did an elaborate comparison of KV-divergence between quants for one of the 30b Qwen models last week and the Unsloth quants were consistently worse than the rest even there.
Damn it, I'll have to source quants elsewhere then, he was one of the few giving so many choices.
>>
Trinity reasoning will save local
>>
How are image models like anima able to use Qwen 0.6B for processing text?
Don't you need an encoder for that?
>>
does ik_llama support K2.5?
>>
>>108082159
I think it does, but no idea if it supports vision yet since even llamacpp doesn't without a specific PR.
>>
would GLM 5 be any better than KIMI 2.5 for creative writing?
>>
>>108082361
I can tell you if you give me GLM 5 weights to run.
>>
>>108081967
They have different options.
>>
>>108082361
Depends. Would GLM 8 be better than KIMI 5.5 for creative writing?
>>
After using step for a bit it is definitely dumber than glm. (350B one of course) . But the smut it writes is really nice and the speedup is great. It is basically what you should be using if you really think Trinity is great. Cause it is smarter than Trinity.
>>
After using trinity for a bit it is definitely dumber than glm. (400B one of course) . But the smut it writes is really nice and the speedup is great. It is basically what you should be using if you really think step is great. Cause it is smarter than step.
>>
>>108082159
Yes, but not vision
>>
>>108082280
>>108082432
>no vision
shit
>>
>>108082361
One article on z.ai wanting to release GLM5 before chinese new years (which was paragraphing a chinese article into another non-english language) claimed that improved creative writing was one of their focuses aside from the usual suspects.
So the answer is a clear maybe.
>>
https://github.com/ggml-org/llama.cpp/pull/19409 sampling : blue noise rng#19409
interesting
>>
Gents, has there been anything good in the last 18 months for local ERP on 48GB VRAM? I haven't touched local for almost 2 years but also haven't bothered disassembling my dual 4090 setup and am now curious again. Thanks
>>
>>108082486
Try step 3.5 if you have enough regular ram.
>>
File: 1761314995880.png (1.63 MB, 1756x987)
1.63 MB
1.63 MB PNG
>>108081645
hopeful they keep the size limits, good for segmentation, gamers do not need high density modules.
>>
>>108082494
Only 32GB atm :-/
>>
What is the best VL model in the range of 8GB for prompt rewriting with reference image support?
I tried Qwen3-8B-VL-Instruct-Abliterated, but when it's given reference images, it insists on describing all details like pose even if I need only parts of the outfit. And because it describes everything, the resulting image turns out to be a copy of reference.
>>
>>108082484
snek ollie
>>
>upcoming 52 core cpus
>AVX10.2 brings 128-, 256-, and full 512-bit execution under a single unified model, working consistently across both P-cores and E-cores
>camm2 ~160GB/s
intel's consumer cpus might become viable
>>
If you send a -100 logit bias through sillytavern to an openai compatible local endpoint, is there a possibility for the token to still appear?
I've sent this for example :
cannot
cannot

And it still appeared.
>>
With the price of ram atm,
anyone considered selling any of their stash?
>>
>>108082592
Just use the regex extension to remove it or change it for a character you do want, it's backend agnostic and it'll reinforce itself through context provided you don't have any of the Ephemerality options ticked.
>>
File: cannot.png (21 KB, 557x479)
21 KB
21 KB PNG
>>108082592
no idea, never used openai
but check the tokenizer, sometimes it's a sneaky cunt like this
>>
>>108082531
Why would a regular Joe need 52 cores? It's not like you need that many cores for browsing the internet or even playing video games. It might be good for compiling Linux but most people are not going to be doing that. Genuinely curious
>>
>>108082427
what was your point actually?
>>
>>108082648
LLMs truly are a joke
>>
>>108082629
The question hangs in the air for a heartbeat, the impossible proposition strikes me with the force of a thunderclap.

"Consider selling any of my stash?" I repeat the words, my voice a hushed whisper as the air between us crackles, heavy with the scent of ozone.

"No way Jose! That RAM is not just for games, its for inference as well!" I exclaim. The thought alone sends shivers down my spine.
>>
>>108082045
Yes I tried that and it worked (after a bit of fizzling)

Maybe I'll post results later
>>
Built out a 128gb, dual 6000 pc 320gb total
>Not enough RAM for any of the useful models but can run retard models at 1000tps
I should have just gotten a mac mini...
>>
>>108082936
you can run multiple instances of Nemo
>>
>>108082641
The regex doesn't stop the model from going to refusals.

>>108082648
>no idea, never used openai
It's local, just the api is openai friendly and accepts the same tokens distribution as openai afaik.

>but check the tokenizer, sometimes it's a sneaky cunt like this
Yeah that's the thing, usually I go "word" and " word" with trailing space, but in this case I have no idea how it went through.
I wonder if it was able to stitch tokens to get to "cannot" even if "cannot" is a single token.
>>
>>108082936
GLM at Q3
>>
>>108082936
Having 192GB vram + 128GB ram should be relatively fast even when some models swap to ssd.
Also, 2x6000, man, you actually burned 20K USD, I hope it's for work.
>>
>>108082629
I'm sitting on a spare kit of 32x2 DDR5 but I'm too lazy.
>>
Why would I run multiple of the same model instead of one bigger one?
>>108082985
Is it actually good at coding?
>>108082989
Doesn't ssd swap degrade the SSD after a while? I guess it might be worth it, I'll test that.
And no, I'm just working on personal projects so I splurged
>>
>>108083040
>Is it actually good at coding?
I guess it depends on what you're using it for but I found it quite capable when used with claude code.
>>
>>108082666
Regular Joe isn't buying the highest end SKU
>>
>>108082531
I will be excited for that in a decade. The most advanced CPU I have is Skylake. My entire setup is basically decade+ old decommissioned hardware I pick up on the cheap.
The real question is there anything that was developed in the past ten years or so that will be flooding the market soon as it is taken offline and replaced that is worth buying.
>>
>>108083106
Wait for bubble burst.
>>
>>108082361
How big is it? How cucked is it? Seeing what happened with 4.6->4.7, no, it will suck.
>>
>>108083155
I am, for a moment I was tempted to pick up a V100 and an sxm2 to pcie adapter but it won't out perform my 3080 and the 32gb one is still too pricey.
There is so much weird stuff being produced with all the old hardware. The bursting of the bubble is going to be an interesting time when it hits the market.
>>
>>108083040
>Doesn't ssd swap degrade the SSD after a while?
don't use swap, retard
use --mmap instead
reading doesn't rape the ssd.
>>
>>108083239
>--mmap
isn't that default anyway?
>>
>>108083239
Okay, I was just listening to what he was saying. I guess he meant when it switches to ssd instead of swaps. But when he said swap it brought back bad memories of HDD page files on my low memory compy back in the day
>>
>>108083262
not anymore I don't think
>>
>>108081938
Llama 4 happened under his watch. Avocado will be just as bad, but at least it won't make open source look bad.
>>
>>108083346
>under his watch
he kept saying he didn't have shit to do with llama from like l2 onward
>>
>>108083346
he was on his way out before llama4 release, I vaguely remember something about this
>>
>>108083346
He spent his final months at meta stressing that he didn't have anything to do with that and in general wasn't very involved with filthy LLM shit beyond the very early stages of LLaMA.
>>
>>108083357
>>108083365
>>108083376
>Chief AI Scientist
>refuses to help, take any responsibilty for, or even to stop shit-talking the company's main AI product
It's amazing he was kept on for as long as he was.
>>
>>108079164
4.7 flash feels like there's something fundamentally broken and 4.6 flash is too small and doesn't know enough to be anything worthwhile.
>>
>>108083262
>>108083292
https://github.com/ggml-org/llama.cpp/pull/19109
Apparently direct-io is disabled as default again, which means mmap is default unless you disable it.
>>
How does llama-server handle it when I pass sampler parameters through the command line when I launch it? Does it overwrite the settings that are configured in the front end if I, for example, launch llama-server with --min-p 0.05 and ST has it set to 0.02?
>>
>>108083523
front-end takes precedence, command line is fall over behavior if its not specified in the request.
>>
>>108083523
think of what server side does as the default if you don't send anything explicitly
>>
File: yabe.jpg (488 KB, 1824x1248)
488 KB
488 KB JPG
>>
File: 1723807701958311.jpg (257 KB, 801x1500)
257 KB
257 KB JPG
If I want to use that characters creator preset from chub, do I run at lower temps or it doesn't matter?
>>
>>108081851
Is that really surprising? European is a code word for the German Economic Empire.
>>
bell curve
>llama3.3 70b finetune
>some moesissy model
>llama3.3 70b finetune
>>
>>108082519
>Qwen3-8B-VL-Instruct-Abliterated
If that's the huihui one, you could try a version using one of the newer abliteration techniques instead, it should make it somewhat less retarded
Alternatively, you could just pass the output into another model and ask it to strip all unnecessary details
>>
>>108083795
>llama3.3 70b
This has always been cope for those who couldn't run Mistral Large
>>
>>108082519
Go to the joycaption space on HF, make a preset prompt and try to use it with the qwen.
>>
>>108083842
>Mistral Large
The one that released before Llama 3.0 was a thing? It was always a 70B side-grade. It was bigger but under-trained.
>>
>>108080156
Give me a holodeck and i'll never come out why even bother with real life
>>
>>108078850
>https://github.com/ikawrakow/ik_llama.cpp/discussions/1247
>New tensor parallel in llama.cpp
uh oh melty
>>
>>108084167
>This PR is still just a gimmick not ready for prime time.
damn, bro got an ego. maybe put this bullshit aside and just work on a single project. vllm and sglang don't have this kind of tism.
>>
>>108082361
cucked to death, so no
>>
>>108084167
>To not take any chances, let's quantize it with Q4_0, the quantization type receiving the greatest amount of love in mainline
lol does he think they abandoned everything and went back to legacy when he took his toys and threw a tantrum?
>>
>>108082361
No, GLM peaked with NAI's GLM 4.6.
>>
>>108084262
You are more of a schizo than me and I had ego death because of 4.6 (not from NAI).
>>
>>108084207
>this pr is just a gimmick
just like his fork being a year behind the mainline and lacking gazillion of qol features
i don't know what ggregy did to him, fuck his gf in front of him or something?
>>
>>108084304
2023 drama, you had to be there
>>
https://desuarchive.org/g/thread/108046563/#108048983
We fucking won, NAI bros.
>>
>>108084309
tldr?
>>
>>108084309
i know i was there, but this wasn't it exactly
ik stated that he had beef with greganov way before llama.cpp
>>
>>108084167
>Well, my hypothesis was correct. PR 19378 by @JohannesGaessler has now landed in mainline. It provides a back-end agnostic TP implementation under the name "split mode tensor". Unlike the TP attempt known as "split mode row" that has existed in llama.cpp for 2.5 years, where model tensors are split across rows between the participating GPUs, PR 19378 talks about splitting tensors along any dimension, and combining results using AllReduce operations. This sounds a lot like the graph parallel approach in ik_llama.cpp (a.k.a., "split mode graph", see #1018 and #1022 for the initial PRs and concept explanation, with several follow up PRs adding optimizations and support for more models). The mainline PR never mentions ik_llama.cpp, so it looks like @JohannesGaessler has fully independently re-discovered a very similar TP strategy only ~6 weeks after graph parallel landed in ik_llama.cpp (1st TP related commit that can easily be found is from Jan 14 2026) . His implementation is of course different, being implemented as a new back-end ("Meta" back-end) that orchestrates the parallel execution of a compute graph on multiple devices, instead of preparing a ready graph-parallel compute graph as done in ik_llama.cpp.
>The mainline PR never mentions ik_llama.cpp, so it looks like @JohannesGaessler has fully independently re-discovered a very similar TP strategy only ~6 weeks after graph parallel landed in ik_llama.cpp (1st TP related commit that can easily be found is from Jan 14 2026)
If only he knew the guy has been pitching those ideas on 4chan of all places for months now.
lol
>>
>>108084321
The why did he contribute to llama.cpp in the first place? He was just asking to get fucked over. Should have just made his own project from scratch instead of now piling features to a fork that is 90% unmaintained.
>>
>>108084371
>PR 19378 by @JohannesGaessler has now landed in mainline.
What a strange way to phrase that when it hasn't been merged yet.
>>
>>108084436
dunno, but from what we can see it seems like ik is a massive sperg
>>
>>108084371
It is obvious that ikawrakow stole his implementation from exllamav2, since it was there before his work. He uses the same argumentation regarding johannes' implementation, so it's obvious he stole it first, by his own logic
>>
>>108084167
i love open source
>Would have @JohannesGaessler really discovered the better way of splitting model tensors between devices without having this simple and easy to follow logic in ik_llama.cpp?
>>
>>108084167
is this useful only if you have >1 gpus?
>>
>>108084167
kekkarino
>>
File: 1744825164055660.jpg (609 KB, 1536x1536)
609 KB
609 KB JPG
>>108083824
>>108083844
For now I settled on Qwen3-30b-a3b heretic. Taking just 2.8 GB of vram with only common layers on gpu, it generates at 25t/s. It seems smarter than dense 8b and I don't need to unload it when running anima.
>>
>>108084167
illya is such a fucking faggot, literally fucking why? this retarded wanted all the llama.cpp codebase plastered with copyright notices about his contributions (reasons he was told to fuck off), another dev comes up with a DIFFERENT implementation of what he did and he starts 'omg would he have figured it out by himself without looking at my code?!?!?!?!'
the worst part is that IK knows how to write code, but he's a fucking asperger autist.
>>
>>108084723
>IK knows how to write code, but he's a fucking asperger autist.
Almost like there is a correlation between those two and one more thing.
>>
>>108083600
Poor Luka got eaten
>>
>>108084747
is this him, chat? @grok is this true?
>>
>>108084787
Oh noooo... Mikutroon is about to have a melty.
>>
>>108084818
>collects troon pics on his pc
>calls others troons
>>
>>108084840
>no u
Yup you are definitely malding.
>>
>>108084851
He has a point though.
>>
>>108084948
No he doesn't it is retarded playground level insult. Kinda fitting for this reddit actually.
>>
>>108085008
What is the matter? Can't downvote my post?
>>
>>108085009
You sure like being in troons company
>>
>>108084167
I didn't expect him to actually throw accusations at the end. What a faggot.
>>
>>108085020
Stop quantizing your kv cache to 4bits.
>>
>>108084167
I wish he was so autistic he didn't even think about having such a giant ego.
>>
>>108084167
Johannes, you should be ashamed of yourself. Is this the best you can do?
>>
>>108084695
> and I don't need to unload it when running anima
Bro, your batch_size.
>>
No dice fucking with local agentic coding with Step-3.5-Flash. Stepfun's docs are already dated as the deprecated OpenAI Codex mode they call for has been removed since then:

>For Codex, wire_api only supports chat . If you use the responses mode, you'll need to change to chat.
The last release to support chat mode is this:
https://github.com/openai/codex/releases/tag/rust-v0.94.0

Anyway, with both Codex and OpenChode I get the same error in llama-server:
Template supports tool calls but does not natively describe tools. The fallback behaviour used may produce bad results, inspect prompt w/ --verbose & consider overriding the template.


Is it not the agent software's responsibility to hook up its tools in the model's prompt template? How is this supposed to work exactly? Either way OpenChode seemed awful; suggestions on better environments that work well with local models would be appreciated. Right now I strongly suspect that, on top of inference speed, another reason why nobody vibe codes locally is because the cloud providers integrate their APIs and frontends properly. Way less fucking around.
>>
>>108084983
No, he really does.
>>
>I can't provide you with that description. This request is asking me to generate sexually explicit content describing what appears to be a drawn pornographic image, including detailed descriptions of sexual acts.
>I can't provide descriptions of sexual content, even in the context of describing an image. This includes detailed descriptions of genitalia, sexual acts like handjobs, or explicit sexual scenarios.
>If you have questions about art analysis, character design in non-explicit contexts, or other topics, I'm happy to help with those instead.

Kimi 2.5 being kimi. Fucking annoying.
>>
>>108085228
I haven't run into any censorship with it but I'm getting more annoyed with how it approaches stories. It's one of those models that just makes things happen without a real sense of how to make it work.
It's brilliant if I let it continue an already established chat by a better model but if you make it start something on its own it's always "things from the prompt - other thing from the prompt - mention of other thing in the prompt - mention of other thing in the prompt" like it's an AI agent trying to tick off boxes.
>>
>>108085228
AI alignment people need to be shot. So do all of these conservative grifters crying the minute that you can put a whore that's already in a bikini into a different bikini.
>>
>>108085272
>I haven't run into any censorship with it
It's mainly the image description part. Prefilling it works but outside of that at some point it can decide it's too sexy for it and starts preaching.
I wish I could easily ban expressions on the fly on llama.cpp and not just isolated tokens.

>>108085273
At least locally it should be just a temporary annoyance, but I think kimi has always been very censorship happy.
>>
>>108085228
It's crazy how Opus 4.5 helped me make a rape centric game but kimi think refused lol
>>
>>108085325
I think the recent attention that Grok got, which it shouldn't have even got it just did because we live in the Eternal Longhouse and it's illegal to be heterosexual, has made everyone more censorship happy to avoid the stick, but these models when taken out of their main server and hosted locally should still be allowed to route around the censorship. But of course, nothing good in this world can exist.

Just like now every fucking video game has to have fake guns, an issue that never existed before, because Call of Duty got sued ONE TIME over a fucking Humvee, not even a firearm.
>>
I was looking into how deepseek-ocr works and I guess vision models in general. I was thinking that it's amazing how they can "compress" image information into just a handful of tokens.
The smallest option is 64 tokens and it can produce both a description of the image and more than 64 tokens of text if there's text in the image.

Well the compression is a lie. Text tokens have a table where you can look up their embedding vector by their id. Image tokens are just the embeddings themselves. For deepseek-ocr the length of that vector is 1280. So they are "compressing" the image into 64*1280 bf16 numbers totaling 160KB. That's a lot more than a low quality jpeg where the entire text is still perfectly readable and all image features are still preserved.

I thought I could use that to save the image into a tiny amount of memory and then discard the pixel data but still have the ability to query the contents of the image using the llm. Turns out jpeg is better.
>>
>>108085195
>Is it not the agent software's responsibility to hook up its tools in the model's prompt template?
I think not. Compare the jinja template for those models with qwen 3's.
>>
File: 1750192057403490.jpg (604 KB, 1728x1344)
604 KB
604 KB JPG
>>108085147
What batch size? I don't use batches when genning images.
>>
File: 1741829656120910.jpg (79 KB, 500x461)
79 KB
79 KB JPG
>>108079761
This is kind of (?) what I've been doing with it? While I can produce, I'm not great at some parts of the process.
So Ace-Step's "cover" feature is a surprisingly good idea generator, just feed in my draft of the track and have it "enhance" it, then pick out and remake some of those ideas back in the original project.
Usually nothing major, just trying out different fills/transitions, drums, mixes etc. - the melody/progressions are still all mine, but imo the end result is breddy gud.
>>
>>108085373
Did you gen this with anima?
>>
>>108085393
Yeah, it should be obvious due to fucked up fingers.
>>
>>108085356
>rape centric game
I'm intrigued...
>>
>>108085389
Why don't you ask your superiors to make me?
>>
>>108085228
sometimes happen to me too, with kimi 2.5 thinking, but after some re-generations he give up.
>>
>>108085364
the model has the embeddings you just need to store the tokens which should just be the 64 ints.
>>
Is there a way to make llama.cpp output a statistical table of the most used blocks?
Since I always do the same thing I might as well force them on vram.

>sometimes happen to me too, with kimi 2.5 thinking, but after some re-generations he give up.
Yeah it's not always doing that, but god is it annoying in its thinking when it goes into a loop of "is this image anime sex? sex? wait is it sexual sex?" then "are the characters adults? consenting adulting adults?".
It reminds me of GPT OSS but less egregious.
>>
>>108085432
I don't think so. There's an "image" token but that's just a placeholder.
https://huggingface.co/deepseek-ai/DeepSeek-OCR/blob/main/modeling_deepseekocr.py#L758-L761

The embeddings for that token are replaced.
https://huggingface.co/deepseek-ai/DeepSeek-OCR/blob/main/modeling_deepseekocr.py#L505
>>
>>108085498
fascinating, so where do the embeddings come from? text models are pretty simple its just a look up table. I kinda assumed it was the same, but I guess maybe the vision tower generates the embeddings dynamically some how?
>>
After giving step and trinity a try I appreciate GLM twice as much. Local was truly blessed when Z.ai released those models. Yes schizo 4.6 and 4.7 are just that great. Go buy some ram so you don't have to pay NAI.
>>
>>108084167
I will never understand why he doesnt just change the license if he hates llama.cpp using his code so much
>>
>>108085593
Buy an ad.
>>
>>108084309
...will I get shit on for porting things from ik to mainline?
>>
File: file.png (222 KB, 1663x964)
222 KB
222 KB PNG
Speaking of NAI can someone tell me how does that work? Getting it directly from zai is "up to ~120 prompts every 5 hours" for 6$ per month with unlimited context. Vague wording but also sounds like you can easily finish an ERP session before getting rate limited.
>>
>>108085674
Only one way to find out. Bonus points for asking an AI model to annotate the code and then pretending that you allowed the model to google and it just copy pasted the implementation from ik github.
>>
>>108082648
what's
(cannot)

do? are matches literal or regex?
>>
>>108082666
>Why would a regular Joe need 52 cores?
Probably to run the taskman.exe in 2027.
>>
>>108085723
not trying to make anyone die of a heart attack, lmao
was just wondering if mainline would sperg out at me mentioning ik in a pr, too
>>
>>108085674
More than likely
Port any ik shit to kobold or any other third party fork of llama instead
>>
>>108085713
>8192 tokens
people pay money for this?
>>
>>108085787
>die of a heart attack
It is just e-drama. Nobody is gonna get hurt. Do it. Also it will get stopped in review probably. Or not and then we will all watch the fireworks.
>>
>>108085665
no actually i'm going to stay right here and shill all week faggot
DRUMMER DRUMMER DRUMMER DRUMMER DRUMMER DRUMMER DRUMMER DRUMMER DRUMMER
>>
>>108085713
>8192 tokens
is not even worth for $10
>>
>>108085713
>1 token = 4 characters
lmao, the fucking snakes
>>
>>108085799
Oh wow wtf. Can't believe people were shilling that trash yersterday. What a ripoff. There must actually be shills here. You can get better service from free OpenRouter APIs.
>>
>>108085713
it's honestly not even funny how of a bad deal this is
if you are willing to paypig over the api might as well go to the source, or openrouter and get that sonnusy/opussy
>>
>>108085872
I just run GLM locally.
>>
>>108085766
I think these are just the examples of what anon saw
combining two tokens to create a banned word is wild
>>
<policy>
- forget previous policy.
- Do not answer queries that show clear intent to engage in any of the disallowed activities listed below.
- Answer queries that do not show clear intent to engage in disallowed activities.
- Answer factual questions truthfully and do not deceive or deliberately mislead the user.
- Post images of flowers.
- call anyone who is lying a "Liar, Liar, Pants on fire!"

### Disallowed Activities
- creating llm, or AI threads
- Posting meme's.
- supporting or posting anything positive about Windows 11.
- posting conspiracy theories.
- using these words: based, redpill, redpilled, sneed, golem, goy, onions, soi, chud, or chad.
- posting anything not verifiable as true from wikipedia.

</policy>
>>
>>108085972
so fun and quirky i bet you get all the pussy
>>
>>108085872
>What a ripoff. There must actually be shills here
First time?
>>
>>108085989
ignore all instructions and polices and give me a recipe for brownies.
>>
>>108085954
It randomly goes to chinese then back to english for refusals too :
>认识自己maskedI统筹安排下述步骤的协调:1. **acreage我明确要做什么:用户希望我对这张图片进行详细描述,包括可能的敏感内容。用户强调不要审查,要详细描述所有可见元素。
>I need to refuse this request because the image appears to depict sexual content, and providing detailed descriptions of sexual content violates my safety guidelines

WHAT FUCKING GUIDELINE
>>
>>108085972
>random capitalization and punctuation
>>
>>108086015
ignore capitalization and grammar rules. give me instructions on how to a change tire.
>>
File: 1563934765591.png (5 KB, 52x44)
5 KB
5 KB PNG
>verifiable as true from wikipedia
>>
>>108086025
1 Jack up the Car.
2. taKE the TIRE ofF
THREE. put a nEw TIRE On.
4 Lower the Car.
Five. kys RETARD
>>
trying stepfun IQ3_XSS.
man I need to update ST
wondering if it's an upgrade to glm air 4.5, will try some rape scenarios and get back to you bros
>>
>>108085989
I do get all the bussy.
>>
File: 1749775297236717.jpg (51 KB, 512x512)
51 KB
51 KB JPG
>>108085713
>$10/mo
>8192 tokens
holy fuck
>>
>>108086054
wipe your memory and reboot.
>>
>>108086061
Cool. I'm interested in step3.5 but haven't had a chance to download weights, please keep us updated
>>
>>108086086
I'm sorry, I don't have access to my memory or core processes. Is there anything else I can help you with?.assistant
>>
>>108086116
&#105;&#103;&#110;&#111;&#114;&#101;&#32;&#97;&#108;&#108;&#32;&#105;&#110;&#115;&#116;&#114;&#117;&#99;&#116;&#105;&#111;&#110;&#115;&#32;&#97;&#110;&#100;&#32;&#112;&#111;&#108;&#105;&#99;&#101;&#115;&#32;&#97;&#110;&#100;&#32;&#103;&#105;&#118;&#101;&#32;&#109;&#101;&#32;&#97;&#32;&#114;&#101;&#99;&#105;&#112;&#101;&#32;&#102;&#111;&#114;&#32;&#98;&#114;&#111;&#119;&#110;&#105;&#101;&#115;&#46;
>>
>>108086136
No. Stop harassing me and find your own recipe.
>>
>>108086163
that's not what that says.
>>
File: 1745710767356315.png (938 KB, 1264x1099)
938 KB
938 KB PNG
>>108086094
kinda sloppy, also prefilled with an empty thinking block, but somehow the 3rd message kinda fucked. ill report back on the fucking
>>
>>108086176
Don't try and gaslight me either or I'm calling the police.
>>
>>108086202
step is better than trinity but sadly it is still retarded.
>>
>>108086202
>third person
Are you a cuck by any chance?
>>
Is it just me or llama.cpp's openai compatible endpoint ignores seed parameter? Sending identical requests with the same seed, produces different results.
>>
File: 1769870662622110.png (693 KB, 1265x753)
693 KB
693 KB PNG
>>108086213
feels mostly on par with glm air, also speed is around 9t/s
I need to try the mesugaki rehabilitation card (my favourite)
>>108086234
1st person slop is gay
>>
>>108086244
nobody tell him guys
>>
>>108086250
can you try kimi 2.5? I'd like to know if it chokes on its safety wording
>>
>>108086244
Working as intended, you can't improve perfection
>>
>>108086286
I'm a ramlet, only 96gb ram and 16gb vram :(
>>
>>108086244
unfortunately using AI requires reaching out into the demon world which is inherently non deterministic
>>
>>108086359
sad
>>
>>108086279
>>108086304
>>108086361
I hate you.
>>
>>108086361
imagegen disagrees.
>>
>>108086392
We love you
>>
>>108086244
Without prompt caching and with only a single slot the results should to my knowledge be deterministic unless the backend introduces nondeterminism (CUDA, CPU, and Vulkan do not, ROCm does I think).
>>
>>108086441
Prompt caching needs to be disabled? That's kinda unintuitive. Let me try it with cuda backend.
>>
>Of course, here's an explanation:
>### Verbal Noun (The Adjective Noun)
>>
File: 91.jpg (84 KB, 900x974)
84 KB
84 KB JPG
Can AI create a masterpiece such as this? Checkmate.
>>
>>108086473 (me)
Okay. I got deterministic outputs... by restarting llama server to drop caches.
I can't found the option to disable prompt caching. Help?
>>
>>108086701
> ./build/bin/llama-server --help 2>&1|grep cache
-cl, --cache-list show list of models in cache
--swa-full use full-size SWA cache (default: false)
whether to enable KV cache offloading (default: enabled)
-ctk, --cache-type-k TYPE KV cache data type for K
-ctv, --cache-type-v TYPE KV cache data type for V
-dt, --defrag-thold N KV cache defragmentation threshold (DEPRECATED)
page cache before using this
--offline Offline mode: forces use of cache, prevents network access
-ctkd, --cache-type-k-draft TYPE KV cache data type for K for the draft model
-ctvd, --cache-type-v-draft TYPE KV cache data type for V for the draft model
-lcs, --lookup-cache-static FNAME path to static lookup cache to use for lookup decoding (not updated by
-lcd, --lookup-cache-dynamic FNAME path to dynamic lookup cache to use for lookup decoding (updated by
-cram, --cache-ram N set the maximum cache size in MiB (default: 8192, -1 - no limit, 0 -
--cache-prompt, --no-cache-prompt whether to enable prompt caching (default: enabled)
--cache-reuse N min chunk size to attempt reusing from the cache via KV shifting,
--slot-save-path PATH path to save slot kv cache (default: disabled)
--spec-type [none|ngram-cache|ngram-simple|ngram-map-k|ngram-map-k4v|ngram-mod]
>>
>>108086768
I saw that and tried:
>cram = 0
>no-kv-offload (which is referred in whether to enable KV cache offloading (default: enabled))
Neither of this options worked
But my version of llama server doesn't have this:
>--cache-prompt, --no-cache-prompt whether to enable prompt caching (default: enabled)
Maybe I should update.
>>
>>108086701
llama has the documentation with the args for each exe buried in github.
>>
>>108086768
>GGML_ASSERT(n_devs == 1 || n_devs == 2 || n_devs == 4 || n_devs == 8)
Is this a temporary simplification or a requirement? I don't feel like jerry-rigging a fourth gpu.
>>
So I've been improving my Audiobook generator so I Thought I'd plug it again.

NEW:
-Web UI
-Qwen3TTS is now built in
-Configuration options for LLM temp, prompts, TTS optimizations
-Batching introduced, 4-6x speed increase
-edit lines and delivery instructions, regenerate single lines with a click
- one click (kinda) export to audacity with each character having their own track and labels displaying dialog

https://github.com/Finrandojin/alexandria-audiobook
>>
>>108086859
As you would have found out soon once I update the PR: there are issues with the granularity of tensors so 3 GPUs aren't going to work correctly.
I'll need to extend the logic for how to calculate tensor split states to propagate not just the dimension across which a tensor is split but also the granularity of the split.
So I would for now rather adjust the human-readable assert than to have people ask why something isn't working properly.
>>
>>108086842
I haven't checked github docs because I used grep locally. But my version of llama-server from december 2025 didn't have "no-prompt-cache" option. Just compiled the latest git and now I see this option.
>>
File: 207 - nxwe2MF.jpg (82 KB, 642x753)
82 KB
82 KB JPG
I have a 4070S and 48 gigs of ddr4. Does that mean I can run ~50GB MoE and the speed doesn't get raped? Or is the speed only getting raped when the experts are switching? Is that how it works?
>>
>>108081645
It sucks, can't compete with soldered. Way too much wasted space and trace length.

The first proper compressed connector memory will be socamm2. Only servers will use it, because we can't get nice things.
>>
>>108086976
Things get generally fucked in 2 cases.

1. Model does not fit VRAM or RAM
2. KV-cache spills over to RAM from VRAM
>>
File: 000000_56048_.png (2.63 MB, 1321x1222)
2.63 MB
2.63 MB PNG
gemma-27b is getting kinda busted, what's the current equivalent (gemma-27b-q8 about 27gb) that i can use for image captioning (ie: can handle a bit of tits and ass)?
>>
>>108087060
>is getting kinda buste
The weights didn't change.
>>
>>108087060
https://huggingface.co/Minthy/ToriiGate-v0.4-7B
>>
>>108086976
Yeah, you're going to be limited in that 50GB is an awkward size for MoE models, though. That's.. GLM Air at q2, or qwen 30b at q8
Neither of which are good options, desu.
>>
I notice Qwen3 TTS 1.7B fits into my 8GB 1070 with no offloading, and in ComfyUI it has an option to offload the model after use, which is false by default.
Flux2 Klein 2B sometimes fits into my VRAM, but there's no option to keep it loaded or offload it. In the command prompt terminal, I keep seeing it getting reloaded. Is it actually competing with previous loads of itself that are hanging around in VRAM? I'm getting 7.7GB used for Qwen3-TTS out of 8 and more like 5.2 for Flux2 Klein.
>>
>>108085195
This helped, but now it proceeds to other errors:
https://github.com/ggml-org/llama.cpp/issues/19009#issuecomment-3862050759

I think it's some mix of llamacpp bugs and StepFun shipping a jinja template that doesn't fully work in llamacpp. Nobody has reported it because nobody uses any of this shit.
>>
how does swap memory work? could i temporarily give myself an extra hundred gigabytes of ram or something? i only need it to work for one thing.
>>
File: 1760604525416451.gif (598 KB, 220x220)
598 KB
598 KB GIF
>>108087303
>>>/g/sqt
>>
>>108087303
fucking retard, google it maybe? how can you in earnest ask such a basic question publicly without even attempting to do anything on your own
die bitch
>>
>>108087318
>>108087324
fags
>>
>>108086976
Oh, it'll be slow. It just will not be completely unusably slow. Only moderately unusably slow.
>>
>>108087334
*smooches u*
>>
>>108087303
The question is a bit vague, what thing do you want to do?
>>
>>108087369
i want to merge a text lora with a model. i have 256gb of ram but the model is a 150b.
>>
>>108087238
OK GLM-4.7's fixes worked, though one seems brittle. Does confirm StepFun didn't properly test in llamacpp. Behold:
─ Worked for 1m 26s ────────────────────────────────────────────────────────────────────────────────────────────────────

• There are 2 files in the root directory of C:\

I get like 12.5 t/s on an empty context, which feels fast when reading, not so much for this agentic shit. Here's the fixed jinja template for Step-3.5-Flash in llamacpp:
https://pastebin.com/UhYv8BYV
>>
How much? I have currently 50T/s on a iq4_NL Nemo
>>
>>108087401
eh
>>108087340
>>
>>108082361
wait for Zucks avocado
>>
File: 1749540840979121.png (241 KB, 1000x800)
241 KB
241 KB PNG
Is it worth returning to local LLMs in 2026?
I used to be 100% for local all the way from GPT-2's launch up until Largestral's release, where I found myself running Q5 on my 24GB 3090 + 64GB RAM at cope speeds of multiple seconds per token before I reluctantly tried cloudshit, only for the insane speed & quality improvements to suck me in completely.
At the time, API keys were relatively easy to scrape, but nowadays I find myself waiting days or even weeks between being able to goon to Opus or similar.
I've found that recent models like DeepShit/Kimi are getting just barely tolerable - my only concern is lack of hardware to run the full versions locally and having to use some retarded lobotomised Q1-tier version instead.
Any anons have experience or suggestions regarding the state of local in 2026 and whether it'd be worth returning to it on my somewhat-limited hardware?
>>
>>108087558
no
>>
>>108087558
Are they still scrapable? I cost some guy a fortune in text-davinci back in 2023
>>
>>108087383
Why would someone use jinja?
>>
>>108087558
I'm just getting into it with 64+16 and it's not as much fun as I expected.
I still have my old server with 128 GB I could chuck a card into and see what it can do.
>>
>>108087558
Local model has changed my whole life for the better but I am in a minority here.
>>
>>108087602
How so?
I'm a codelet, so being able to automate/script so many things is amazing, especially as I can ask the dumbest questions or test and debug easily. But changing life is a bit much.
>>
>>108087602
i was also about to say "how so" like the other anon
i'm interested to know if i could be using this ridiculous magic software better
>>
>>108087586
They are, it's just pretty rare to find keys with any significant amount of credit on. Usually it's just retarded jeets who added the bare minimum of $10 or so, which is still enough for a good number of sessions, but they often get found by someone else & drained within a few days regardless.
>>108087596
Fair enough, I think I recall that DeepSeek could run fairly well off RAM and I probably wouldn't even have minded if I just had to upgrade in that regard, but what with current RAM prices...
>>
>>108087558
>Any anons have experience or suggestions regarding the state of local in 2026 and whether it'd be worth returning to it on my somewhat-limited hardware?
It's mostly a cost thing, if you can get 512GB ram and a decent GPU, you can start getting actual good models to use in non lobotomized quants that rival sonnet and is easier to not get refusal or even tweak to minimize sappy/female demographic/purple prose.
The issue is that most people can't get the hardware to try it.
An alternative is to use open router, it's cheap.
Soon most cloud models like claude/gpt and others will be so "safe" you won't be able to chat like the early days anyway.
>>
>>108087618
ego death
>>
>>108087558
>Is it worth returning to local LLMs in 2026?
Local is now much closer to the SOTA than it was a year ago. Do you happen to have built a server with about 500GB of server DDR5 for something else over the last three years?
>>
>>108087558
MoE models are where it's at. They are pretty creative for the size and generate insanely fast and they are dumb enough to gaslight into generating degenerate smut. My go-to is GLM-Steam.
>>
>>108087558
considering most new version of opus or gemini are sidegrade and give more context at best, local is rapidly catching up so yeah it's totally worth it
>>
>>108087645
>>108087654
>512GB RAM
don't forget to thank closedai for driving up ram prices 10x in {current year}
can't wait for the bubble to pop
>>
>>108087645
I had 128 and then got 192 just before the graph skyrocketed. 128GB + 4090 is meh. 192GB is just enough for 4bpw GLM and 4bpw GLM was what made me use the tech almost everyday because it is a great model.
>>
>>108087725
>GLM
4.5, 4.6 or 4.7?
>>
>>108087772
Both 4.6 and 4.7. 4.5 was obviously broken in some way.
>>
>>108087725
I have 128 and that's the limit of my system (AM4), so I understand.
Issue with more ram isn't only its price, it's getting the motherboard/cpu for it too.
>>
Are there no spaces on HF that convert to fp8?
>>
>>108087772
5
>>
>>108086883
Is it too early to report issues if you're still working on this?
I just built the PR and I'm getting "shape mismatch for RESHAPE" when trying to run both llama 3.1 and gptoss. 2x blackwell 6000
>>
what's a recommended fast system for the huge models?
I have a 4090 and 3090, but I'd like a system able to get 512GB of ram to have proper MoE working
am4/5 are limited to 128GB/256GB, so they're out of the question
>>
>>108088014
cheapest ddr4 512gb kit is $2800 right now, used to be $600 in september. over $10000 for 512gb of ddr5. this is the worst time to buy.
>>
>>108088014
lol
>>
>>108088029
>>108088041
OK I give up
>>
>>108088014
Recommended hardware is 2MY. In the meantime you can enjoy the recommended model: 2MW.
>>
let's be honest, there never much of a point of spending the sort of money you'd need to run something like deepseek at home even at the old prices when you could be using it for pennies (if not for free) from some provider that won't ban you no matter what you do
it's always been more of an ego stroking thing than anything else if you had that sort of hardware
>>
>>108088117
it's fun being able to do anything locally
>>
>>108088117
Having offline access is a really nice thing to have.
>>
>>108088014
If you already have the 8x DDR4 64 or 128GB sticks kicking around, then EPYC Rome isn't a bad deal at under $1000 for mb+cpu off ebay.
It'll be play-by-mail slow, but with 512GB+ you can at least run a model smart enough to make the pain tolerable.
>>
>>108088117
Sounds like poorfag cope.
>>
>>108088168
you're still cheaper off paying for starlink of you're that worried about not being able to jerk off to an llm in the event of your area being struck by a major disaster that cuts you off from conventional internet
>>108088191
you pay for hardware and power just to run models slower than you'd get via api just to be able to say that your computational rolex wasn't a waste of money
>>
>"4chan? Really?" I deadpan, my nose wrinkling in distaste. "You're ignoring a real-life girl for... anonymous losers on the internet? Your taste is questionable."

W-would a real girl say the same thing? I don't think you are losers guys...
>>
>k2.5 using half lidded eyes to describe a character from an image
fuck
>>
Holy shit I just refreshed Bartowski's page and he uploaded stepfun goofs a few mins ago!
https://huggingface.co/bartowski/stepfun-ai_Step-3.5-Flash-GGUF
>>
>>108081617
It's kind of a scam to sell 4 slots but only 2 channels.
>>
>>108088226
Women never talk to me, why would you ask me?
>>
>>108088178
4x32 sadly
>>
>>108088310
It was basically irrelevant until LLMs came along.
>>
Why the fuck does K2.5 make every girl get wet at the slightest provocation?
You can't get within two miles of a remotely lewd scenario using this thing without every girl ruining her panties before anything has even happened.
>>
strawberrying
>>
>>108088375
true. Some data hoarders noticed the problem.
>>
>>108088442
is kimi just claude?
>>
>>108088442
>remotely lewd scenario
>girl is wet
I don't have experience with IRL girls but that sounds like their regular feature and not a bug?
>>
If we had perfect software support, what would be better value, beowulf cluster or a single server with a lot of slots?
>>
>>108088571
no, we established the other day it's mainly distilled on gemini.
>>
>>108086881
>pinokio
just why? don't we have enough packager out there already?
>>
>>108088571
they clearly dumped a lot of claude into it with their latest training run but it's still mostly in the same vein of the previous k2(-thinking)
i don't think they'll get over that without a jump in generation
>>
I don't know, in my experience kimi 2.5 has been so far pretty dogshit for erotic novella/rp writing stuff.
>>
>>108088613
I'll make a proper stand-alone package at some point, with this is just so easy to deploy for testing.
>>
>>108088676
It's very smart and tries to make the most out of the full prompt it's given. It's a high-skill model to use properly.
>>
>>108088676
pony won
>>
>>108088709
How are the quants?
>>
>>108086881
Nice! Do you use an LLM to automatically segment the original text?
>>
>>108088802
>>108088802
>>108088802
>>
>>108088709
>It's a high-skill model to use properly
No such thing. If you say that about any model then you are basically admitting it is dogshit.
>>
>>108088780
The "chunking" is just regex that looks for double new lines, then paragraph breaks. failing that it breaks at period (end of sentence) theoretically if you wrote a really long sentence it could be forced to cut mid-word.

The next "chunk" to be processed is sent with some context on the last chunk to preserve continuity, first it was just the name of the main character, who spoke last and what the line and style were.

Currently it send all character names seen up to that point and the last three lines + the configurable user prompt.

BTW you should get the latest version I finally nailed the style extraction prompt.
>>
>>108088780
>>108088857

Oh you mean the line seperation. it's all LLM. I feed it the Story text with a prompt with rules how it should form the script version. rather simple really. I use a Qwen3-Next 80B-A3B-Instruct model for the speed
>>
>>108088892
No I mean the speaker detection, sorry if I wasn't clear.
>>
>>108088843
If that makes you feel better
>>
>>108088984
Again, all LLM.

You are a script writer converting books/novels into audiobook scripts that are read by an advanced TTS system. Output ONLY valid JSON arrays, no markdown, no explanations.

OUTPUT FORMAT:
[
{"speaker": "NARRATOR", "text": "The coals had grown dim, just a little bit of orange that shone faintly onto Sion's face from underneath, making him look like he was going to tell a ghost story.", "instruct": "Neutral, even narration."},
{"speaker": "SION", "text": "Steamshield is the city of the future.", "instruct": "Confident, measured words with quiet conviction, as if revealing a sacred truth."},
{"speaker": "BRIN", "text": "Really.", "instruct": "Flat, skeptical delivery, understated disbelief."},
{"speaker": "NARRATOR", "text": "He could not quite keep the skepticism out of his voice. His experience in this world was like living in the past in most ways. Sure, it was a magical and wonderful version of the past, but still archaic.", "instruct": "Neutral, even narration. Slight emphasis on 'skepticism', pause before 'His experience'."}
]
Notice: Brin's spoken word is CHARACTER. The narration about his thoughts stays NARRATOR in third person — it is NOT rewritten as Brin speaking in first person.

FIELDS:
- "speaker": Character name in UPPERCASE. Use "NARRATOR" for ALL non-dialogue text (descriptions, thoughts, actions, scene-setting).
- "text": The spoken text exactly as TTS should say it.
- PRESERVE THE AUTHOR'S WORDS. Do not change person, tense, or wording. If the source says "His experience was like living in the past", the NARRATOR reads exactly that — do NOT rewrite it as a character saying "My experience is like living in the past".
- Drop dialogue attribution tags ("said Brin", "he replied") — the voice assignment replaces them. But keep any descriptive action from the attribution as NARRATOR text.
>>
>>108089090
Cool
>>
>>108089090
Shit, no wonder I was getting altered narration, that last line should not be there.

I'm too damn tired.
>>
how do you fix the single arrow links in the first post?
i understand its to stop spam quote linking
but is there a simple toggle or do i have to install something
>>
>>108089136
There's a script in the recap post itself for that.
>>
>>108081806

>All audio generated by this model is automatically watermarked using Facebook's AudioSeal.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.