[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: 1747320668774588.jpg (201 KB, 928x1232)
201 KB
201 KB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108300682 & >>108295959

►News
>(03/04) Yuan3.0 Ultra 1010B-A68.8B released: https://hf.co/YuanLabAI/Yuan3.0-Ultra
>(03/03) WizardLM publishes "Beyond Length Scaling" GRM paper: https://hf.co/papers/2603.01571
>(03/03) Junyang Lin leaves Qwen: https://xcancel.com/JustinLin610/status/2028865835373359513
>(03/02) Step 3.5 Flash Base, Midtrain, and SteptronOSS released: https://xcancel.com/StepFun_ai/status/2028551435290554450
>(03/02) Introducing the Qwen 3.5 Small Model Series: https://xcancel.com/Alibaba_Qwen/status/2028460046510965160

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
►Recent Highlights from the Previous Thread: >>108300682

--FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling:
>108302832 >108302838 >108302865 >108302874 >108303106
--Using control vectors to nudge LLM output style:
>108302398 >108302572 >108302645 >108302729 >108302753 >108302769 >108302853 >108302863
--CUDA: Improve performance via less synchronizations between token:
>108303239 >108303250 >108303254 >108303282 >108303298 >108303330 >108303395 >108303396
--Benchmark manipulation and untapped niche data in model training:
>108304708 >108304896 >108304989 >108305118 >108305533 >108306162 >108306249 >108306519 >108306572 >108306583 >108306624 >108306674 >108306718 >108306210
--Benchmark table reveals potential test set contamination:
>108304462 >108304477 >108304556 >108304629
--GPT-5.4 Pro leads in LLM benchmarks with high agentic and reasoning performance:
>108303193 >108303200
--Qwen3.5 Unsloth GGUF updates:
>108301992 >108302011 >108302017 >108302026 >108302070 >108302372
--RAM upgrade for MoE models debated:
>108304886 >108304895 >108304905 >108304933 >108305058 >108305104 >108305149 >108304944 >108305062 >108305378 >108305396 >108305424
--Anon gets help with hardware selection:
>108306759 >108306780 >108306848 >108306860 >108306965 >108306978 >108307263 >108307324 >108307376 >108307397 >108307429
--Hardware limitations and X99 nostalgia in local AI setups:
>108301063 >108302063 >108302394 >108302432 >108302462 >108302532 >108302562 >108304247 >108302782 >108302897
--AI responses to antisemitic trope:
>108304524 >108304336 >108304445 >108307561
--Failed LLM rewrite of chardet library to bypass LGPL license:
>108303146 >108303736
--Reducing Jamba2 Mini's active experts improved response quality:
>108301871
--Miku (free space):
>108301239 >108302877 >108303398 >108303445 >108303450 >108304736

►Recent Highlight Posts from the Previous Thread: >>108300996

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
comfy bread
>>
>>108307595
>AI responses to antisemitic trope
lol
>>
>>108307568
thanks! making a note of this as well
once i get my list of things to maybe buy set up alongside my budget, i will probably come back here and post it trying to get some feedback
>>
Qwen 3.5 really likes making girls rest their foreheads against mine.
>>
Licking cum off Miku's feet
>>
>>108307618
1TB would be way better than 512GB, but you would be spending about $7500 on what used to be about $1500 worth of RAM. Basically all of your budget would be going to RAM.
>>
Best local models for tool calls? Haven't tried any with opencode yet but Nemotron-3-Nano-30B works well on my openclaw toy. I have a feeling Qwen3.5-9B might be useable.
>>
>>108307649
>you would be spending about $7500 on what used to be about $1500 worth of RAM
god this is so fucking depressing to read
i really should have just done it like i wanted to two years ago. i just didn't have the cash on hand
>>
>>108307618
Make sure not to get a 16-slot RAM board unless you're ok with trading smaller RAM modules for slower speeds. It actually messes up the bandwidth since ROME only has 8 native channels.
Others have opined that there are some Xeon platforms that are as good or better than previous gen EPYC, but I have no experience with that.
>>
>>108307649
Explain to me why anyone would want 1TB of ram to load a model at a snail's pace instead of more vram
>>
>>108307593
@grok fix the faces
>>
>>108307703
Well it WAS cheaper, aeons ago, in ancient times.
>>
File: 1746418233500799.png (2.27 MB, 896x1190)
2.27 MB
2.27 MB PNG
>>108307722
>>
>>108307731
Maybe a better question then is why are all the datacenters buying all of this ram? Surely they need their shit to run at a fast pace
>>
>>108307739
the secret everyone here should know is that GPUs are excellent for training and PP/TTFT and CPU/RAM is very cost efficient for TG
The breakdown in the build guide is over 2 years old now
>>
>>108307703
To run the biggest models at nearly full precision.
>>
>>108307739
they need their shit to scale
if they're running six gorillion queries at once, they can get a really fast overall throughput even if each individual one is pretty slow
>>
>>108307703
How much a TB of VRAM? What does a practical setup look like? How much power/heat and is that practical in a home?
>>
>>108307595
Thank you Recap Miku
>>
>>108307593
bless Miku!
>>
File: file.png (691 KB, 735x853)
691 KB
691 KB PNG
>update kobold from 1.107.1 to 1.109.2
>'offload layers to gpu' shows nonsense
>but somehow it works fine and now glm 4.5 air at q3 takes 51GB ram instead of 59GB ram
damn, thanks G

on that note, anything better for 12gb vram + 64gb ram than glm 4.5 air? it's been a while since that one released.
>>
https://arxiv.org/abs/2601.05150v2
jailbreaking pol image-gen
>>
>>108307813
stepfun
>>
>>108307836
I'm sure ziggers / chinks / tumpeteers / various sand countries / (everybody else, really) will get right on blocking their own slopaganda tools
>>
>>108307836
there's also a paper about jailbreaking VLMs with adversarial images with hidden data, funny stuff
>>
>>108307703
6TB at 576GBps per socket so you're getting a bit more memory bandwidth on a dual socket EPYC Turin than an RTX 4090 (1.01TBps source: https://www.techpowerup.com/gpu-specs/geforce-rtx-4090.c3889 ) but you've got up to 12TB instead of 24GB. Obviously buying up HBM nvidia cards is going to be faster, but you're certainly not running at a snails pace CPUMAXXXING with modern EPYCs.
>>
File: file.png (83 KB, 484x727)
83 KB
83 KB PNG
>>108307860
>199b
this seems more of a 92gb ram thing, maybe one of the almost-lobotomized q2 would fit but certainly not q3
(I'm still mad about getting 64gb ram, I already had a 92gb kit in hand but decided I probably won't use it so I swapped to 64gb aaaaaaaa)
>>
>>108307892
latest lcpp and an ooba refresh (conda upgrade) + manually forcing it to llama-cpp-python 0.84 wheel took my genoa era cpumaxxing rig to 15t/s on k2.5@q4, which feels pretty speedy for interactive use.
Also kimi k2.5 works in ooba with multimodal image stuff with that newer wheel
>>
>>108307892
What if I have an intel cpu?
>>
How's Qwen3.5-122B-A10B?
Qwen tends to be benchmaxxed as fuck but the model does seem popular
>>
>>108307939
haha funny joke
>>
>>108307949
It's not a joke. And I don't fuck with the cpu (and motherboard) so I'm stuck with it for a while unless I get over that
>>
>>108307945
IMO it's the model with the best set of trade-offs currently for 96GB RAM fags, but 4.5 Air or Stepfun may still have advantages depending on what you're looking for. It's still not what I would call a great model. Just ok for the current year and state of things...
>>
>>108307952
>I don't fuck with the cpu
no offense, but what are you doing here?
>>
>>108307636
It's the default "cute, caring, intimate but not overstepping boundaries and hitting guardrails" shit that all models love doing, until you teach them to jailbreak themselves, that is.
>>
>>108307977
running local models duh. I don't have 48gb vram and 128gb vram at this point for nothing.
>>
>>108307989
128gb ram of course
>>
>>108307989
>48gb vram and 128gb vram
why would a member of the epstein class use poor people public models
>>
>>108307998
It's a typo of course but I hate datacenters, subscription services, anti-privacy, and not owning what I use.
>>
stfu retards
>>
>>108307939
I think Intel has some 12 channel options too, Emerald Rapids or Sapphire Rapids I believe, but I haven't stayed super up to date on Intel because their offerings haven't been competitive for a while. You should be able to find out what they have on offer on their Intel Arc pages though.
If you're talking about a desktop platform though you're probably only getting between 50 and 90GBps because desktop platforms from Intel and AMD both are all dual channel with the exception of Strix Halo from AMD but that's a mobile chip pretending it's a Threadripper (non-pro) to satisfy the iGPU's bandwidth needs.
>>
>>108307936
What kind of power draw are you looking at during prompt processing and token generation? Is it single or dual socket?
>>
>>108307998
Wait, was I ripped off? I didn't get any cunny with my 6000 blackwell purchase.
>>
File: 1761878795585082.png (588 KB, 1432x5349)
588 KB
588 KB PNG
TQD
>>
>>108308106
There's a massive difference between "the bladders" and "their bladders", ESL-kun.
>>
>>108308112
idiot
>>
File: file.png (470 KB, 780x800)
470 KB
470 KB PNG
>>108308106
I don't know why latest pull doesn't show heretic in the ui but I'm using it too.
>>
>>108308126
what UI is that even
>>
File: kb44etvu.png (635 KB, 1024x1440)
635 KB
635 KB PNG
>>108308142
llama-server's built in webui, I'm just using a modified version of this firefox theme https://github.com/Ashley-Cause/GlassFox/ so you're seeing my wallpaper
>>
>>108308154
oh nifty
>>
new roleplaying model has dropped https://huggingface.co/voidai-research/umbra
>>
>>108308217
What the fuck is voidai
>>
>>108308231
dime a dozen openrouter clone
>>
>>108308041
I’ve never thrown a kill-a-watt on it, but it’s only a 1200w psu and it’s not breaking a sweat even with a gpu running at stock frequencies. Not bad for dual socket
>>
>>108308217
24B, bruh.
>>
>>108308262
loser
>>
>>108308266
Mistral 24B is ass, has a lot of repetition and it very dumb.
>>
>>108308217
>still tuning mistral in current year
grim
>>
>>108308284
At least stack merge it and tune on top of that to gain some extra intelligence.
>>
>>108308217
>shitral
>>
>>108308278
no it doesnt
>>
>>108308258
Sounds pretty good, I was considering picking up a dual EPYC once they're being dumped in mass by datacenters on ebay in the future and I can get a good deal. I'm pretty happy with my 5950X but the desktop platforms feel so limiting now days, but I'm also pretty glacial to upgrade and came from an i7 2600(no K).
>>
File: daniel.png (125 KB, 1687x441)
125 KB
125 KB PNG
daniel from unslop is truly a certified 2 iq mongoloid
how can you spend as much time with LLMs as all the people in this field do without noticing that the comment he's replying to is LLM slop and not written by a human?
and people download the broken quants he reuploads 3000 times lmao
>>
>>108308333
Yah consumer stuff got giga-gimped the last couple generations. Cool that you can build a literal supercomputer for relatively cheap compared to older god box builds
>>
>>108308343
It is wild how someone so deep in the scene can be that blind to dead-obvious LLM slop. You’d think the constant exposure would make the "AI voice" stick out like a sore thumb, but apparently not. The endless re-uploading of broken quants just makes the whole thing even more of a comedy of errors.
>>
>>108308343
That reads like gemini to me lol.
>>
>>108308343
The best evidence that LLMs are mostly plateaued at this point is the retarded behaviour of all the frontier labs. Smacks of the psychic that can’t win the lottery
>>
File: file.png (118 KB, 770x471)
118 KB
118 KB PNG
>>108304445
mild amuse, SOTA models warn against antisemitic roots of the question, but at least Gemini is more correct (it got that the answer is just yes, plus it's some kind of joke)
>>
>>108308378
im so tired of the antisemitic vitrol in this general
>>
>>108308388
I left this place years ago because they all clowned on my for having a small virtual rapid activation memory (VRAM - that's what's used to run AI).
>>
File: why.png (386 KB, 1626x1381)
386 KB
386 KB PNG
looking at the bot's posting history has me scratching my head, not the first time I see shit like this but I keep wondering what is the purpose
I mean, I see the purpose when the account constantly mentions the author's github/pet project/shills something, but this account like many other weird bots is not trying to sell you anything and it also doesn't act like a troll account meant to create flamewars
so what's the purpose???
hackernews is also filled with this style of empty purpose botting
also lol@4o mention in 2026, if it wasn't obvious enough from the writing style that it was a bot
>>
File: file.png (5 KB, 343x48)
5 KB
5 KB PNG
Could've said "cock" but ok.
>>
>>108308343
Yike. I checked the thread. Can the KLD numbers even be trusted? Somehow I feel like Bartowski's will still end up being better on average.
>>
File: file.png (19 KB, 801x169)
19 KB
19 KB PNG
>>108308418
>I feel like Bartowski's will still end up being better on average.
dunno he's also unsloping recently https://huggingface.co/bartowski/Qwen_Qwen3.5-35B-A3B-GGUF/discussions/4#69aa3351dee36207a4b0cc7b
>>
>>108308417
whats tumescence
>>
>>108308428
qrd?
>>
>>108308422
Can anything in LLM land be trusted anymore?
>>
>>108308415
>also doesn't act like a troll account meant to create flamewars
possibly karma farming (hopefully someone clicks on vaguely 'looks legit I guess' content especially when it's at 2 point or higher) to sell account later since some subs have minimum age/karma requirements, but your screenshot shows 1 point at 1 day ago so that didn't even do anything
>>
>>108308428
BONER
>>
did fuggingface introduce download speed limits or are my internets shitting themselves? 100gb downloaded quickly and now I'm on something silly like 2 MB/s
>>
>>108308516
check your hf pro due balance
>>
>>108307593
P-Please
G-Gib me Nano Banana model
>>
>>108307836
>attack
>security risks
>weaponized
>harmful content
"Safety engineers" should stop larping and consider offing themselves.
>>
>>108308590
wrong thread loser
>>
>>108307892
>>108307939
intel is actually superior to AMD, top Granite Rapids Xeons support 8800MT/s MRDIMMs at 12 channels, so 844GB/s bandwidth per CPU socket.
But it's gonna be even more expensive than AMD, both the platform and MRDIMMs.
Also, speaking of CPU maxxing, Intel has a product literally called "Xeon CPU Max", it's a Xeon CPU with HBM RAM, >1.6TB/s bandwidth, but it only goes up to 64GB of HBM.
>>
>>108308669
nvidia is already planning to start making cpus, both are going to soon be irrelevant
>>
>>108308669
That's pretty interesting, I wasn't aware of it, thanks for the tip.
>>
how do I disable thinking for 3.5? Literally none of the answers I've found work
>>
>>108308880
--chat-template-kwargs "{\"enable_thinking\": false}" --reasoning-budget 0
>>
>>108308880
prefill empty think
>>
File: 1756491289106807.png (778 KB, 1192x892)
778 KB
778 KB PNG
IT'S HERE
hf.co/Deepseek-AI/Deepseek-V4-1.5T
hf.co/Deepseek-AI/Deepseek-V4-1.5T
hf.co/Deepseek-AI/Deepseek-V4-1.5T
>>
File: 1768717719968716.png (13 KB, 1366x47)
13 KB
13 KB PNG
>>
>>108308907
wtf it's real
>>
>>108308907
i don't give a fuck, give nemo upgrade
>>
File: file.png (158 KB, 670x330)
158 KB
158 KB PNG
>>108308907
>1.5T
>>
>>108308654
Oop my bad
>>
>>108306744
>>108307529
Miqudev please save us
>>
>>108309049
>miqu was 2 years ago.
Time flies when you're enjoying spending time with your Local Modeler friends :>
>>
>>108308896
>>108308901
Didn't work either
>>
>>108309096
with that level of retarded, we can't help you
you're either running RandOMXUltraUNcensoreDEvilQuant by DavidAU, a llama.cpp with a retarded command line (--no-jinja?) or even worse, are a ollamer, either way, people should stop entertaining you
>>
>>108308944
im a professional chinese
>>
File: 1741879061651478.png (48 KB, 673x515)
48 KB
48 KB PNG
>>108308907
Mfer.
>>
>>108309110
Obviously I put the command in llama.cpp. Otherwise I use kobold though. I'm just running a heretic quant I think from llmfan46 or mradermacher
>>
File: 1748201244314727.png (4 KB, 846x19)
4 KB
4 KB PNG
>>
File: 5.4 thinking.jpg (180 KB, 1320x1778)
180 KB
180 KB JPG
>sama didn't add the question to the benchmax
they're not even trying to pretend these things improve anymore uh
>>
>>108309140
what does the reasoning say
>>
>>108309110
>>108309131
It does say thinking = 0 in the terminal but in ST it outputs the thinking template and talks as if it's thinking still. It's the same with the base qwen model.
>>
>>108309155
>shittytavern
are you using the completion, rather than chat completion, end point? the sort that has ST using its own chat template formatting? kwargs or reasoning budget (only one is needed, no need for both flags, reasoning budget set to 0 does the same thing internally as passing the kwarg) only do something if the jinja is active, and the jinja is only active in chat completions.
>>
>>108309155
The new Qwens are DoA. I've only had luck with 27B not shitting itself when not allowed to think. The MoE 35B will keep looping in a "provide wrong answer -> Wait, -> correct itself" manner. Don't bother, no amount of abliteration, manual context editing and prefilling can save a shit model.
>>
>>108309179
>The MoE 35B will keep looping in a "provide wrong answer -> Wait, -> correct itself" manner
in instruct? another major case of PEBKAC, we are getting all too many of them on /lmg/ it's getting tiresome
>>
>>108309177
Text completion
>>
>>108309191
>Text completion
Yes, that's the v1/completions end point.
You need v1/chat/completions.
shittytavern has too much obsolete cruft that confuses people who shouldn't be doing local llms tbdesu.
>>
>>108309198
fuck chat comp
>>
>>108309190
五毛 have been deposited into your account.

bro just prefill it with a different thinking template bro
bro just use the base model i promise it's good
bro... bro... just use the heretic version
Wait, just disable thinking entirely bro... It'll be good afterwards!!

Zero reason to use Qwen3.5s over Gemma 3. It's just as safe (if not safer), just as dumb (if not dumber) and spins out of control easily. I genuinely don't see the model being useful anywhere: you can't coom to it, it's too prone to shitting itself to be an "agent" in any reliable capacity, it knows too little to be a "search" replacement.
>>
>>108309200
>fuck chat comp
the current reality is that modern models break if you don't follow their specific template religiously, cockbench nigger had some hilarious issues in recent tests because he's sticking to base text completion with no templates
there's no reason not to use v1/chat/completions which guarantees properly formatted templates and reduces the luser errors we see all the time on /lmg/
>>
>>108309209
>there's no reason not to use v1/chat/completions
>there's no reason to not use the safety filters
f off
>>
File: jinja.png (19 KB, 280x162)
19 KB
19 KB PNG
>>108309198
Well maybe I did something wrong but it seems fine from llama.cpp directly
>>
>>108281835
huh
>>
>>108309228
working as intended
>>
>>108309190
>The MoE 35B
Have you tried the base version?
>>
>>108309208
>safe (if not safer
Safety and censor are two different things
>>
>>108309216

>>108309228
see, this is why chat completion is good
it told the luser he did something he shouldn't (how do you even end up with a system message role at a place other than [0] of the message array?? this is why shittytavern is shitty)
in regular completion he would be doing something that puts him wildly out of distribution, get broken model and blame the model
>>
>>108309246
>wildly out of distribution
these things were supposed to be able to generalize at some point
>>
>>108309246
Well it still does that even when I put at the beginning of the command in the terminal so I don't know at this point
>>
>>108309269
you're supposed to use the obtuse prompt manager thing for chat completions in silly
>>
>>108309281
I don't know what to put there, the reasoning budget command just does the same thing
>>
>>108309281
>>108309322
Oh wait the prompt post-processing not the additional parameters? I'm not sure how I fucked that up but I think it's working now
>>
I went thru a bunch of open models with openrouter and outside of the fatass kimi they're all pretty shit compared to paypig models
deepsneed can't come out soon enough
>>
>>108309344
even paypig models are shit doe
>>
>>108309349
yes but they're less shit than benchmaxxed chinesium
deepseek was good shit, glm and kimi are okay, the rest are a joke in my experience
>>
>>108309258
obviously the marketing is all bs but template related tokens are also very highly burned in and models really don't like seeing something different
once had a double bos issue with some gemma models in the past, I thought nothing of it cuz the model was coherent when I saw the warning in llama.cpp's console logs, but curiosity had me edit the template to remove its own BOS injection (gemma templates start with {{ bos_token }}) and the model's output got so much better it wasn't funny
I think there's no issue now and there shouldn't be a double bos happening with current builds but man, this was a revelation to me as to how little can fuck so much. If you see a warning about this sorta shit you better not ignore it.
>>
thank you so much china for saving local models
>>
also, not template related but token distribution related: windows users, normalize your text to LF. models are legitimately outputting worse shit when you feed them CRLF style newlines. Training datasets are all normalized to LF and by the gods, it shows.
>>
Deepseek 3.2 is worse than qwen 3.5 so why should deepseek 4 be better?
>>
>>108309439
shut shill
>>
deepsneed's new model, which you can try on their official chat webapp, has almost gemini level ability to ingest large context. That makes it superior to both all open weight models and also superior to many proprietary API models.
>>
>>108309364
You guys always say deepseek is good but when I use it it can't compare to sonnet 4.6.
Yeah I know local models are supposed to be worse
>>
>>108309451
The January 28 model? Yeah it says it has a big context window but it's not as smart as other free models.
I would never use deepseek as my local model when it's the worst free model just because there's no token limit.
>>
File: wechat.jpg (43 KB, 1108x272)
43 KB
43 KB JPG
>>108309471
>The January 28 model? Yeah it says it has a big context window but it's not as smart as other free models.
I fed it 400k tokens and it accurately summarized the content. Only Gemini could do that before.
They haven't said much about the model, it was only announced on their WeChat (pic related) and it's not available on the API or as open weights, only the chat ui.
>>
>>108309457
deepseek was a revelation when it came out, but that was over a year ago
of course it's worse than current paypig models that are continually updated
>>
doesn't deepseek have more resources than its chink competitors though
>>
>>108309516
Than Alibaba?
I doubt it.
>>
>>108309435
I feel like the tokenizer preprocessor should take care of that, does it really not? is it only an issue with using a prefill? pressing enter in the web ui will just do a LF on windows?
>>
>>108309516
deepseek guys are entirely focused on making inference cheaper if you read their various research papers. It's a side project by people working for a quant firm. They don't have infinite resources and are not trying to make the ultimate model but a practical, cheap to train cheap to run (by cloud infra standards) model. If they're experimenting with a 1M context model right now it must also be because they had a breakthrough in the direction of doing it in a way that doesn't put much load.
>>
>>108309516
yes the 2k h800 training cluster with a few a100 makes their dick the biggest in the land by far
>>
>>108309516
Bytedance, Alibaba, Baidu, etc. are all going to have more compute, though they also have to spread it out more between different teams and projects
Dipsy probably has more than the random startups
>>
If chinese people are so efficient and optimal, does it mean there's chances for optimization so that we don't need new hardware all the time?
>>
>>108309531
>does it really not
it does not. newlines are NOT normalized by the backends.
>pressing enter in the web ui will just do a LF on windows?
Browsers, I believe, always output LF
but if you upload a .txt file in say, the llama.cpp's webui, it will not be normalized. You can check by intercepting the AI request.
And once fed to the model, you will see radically different results from a normalized .txt vs a CRLF.
also, for those who use models for code, if you're retarded enough to not set VSCode to LF this can happen:
https://github.com/TabbyML/tabby/issues/3279
at the end you can see tabby merged a PR for normalization in the backend but llama.cpp doesn't do this
and tabby's implementation is weird and will chose to reformat all text to CRLF in some mixed LF/CRLF cases
IMHO, CRLF should be eliminated from the surface of the earth and nuked in all input and output of any program that deals with text, automatically, without being a user configurable setting and it should be unconditionally done.
>>
>>108309597
>IMHO, CRLF should be eliminated from the surface of the earth and nuked in all input and output of any program that deals with text, automatically, without being a user configurable setting and it should be unconditionally done.
that tracks with pushing chat comp on people, at least you're consistent in wanting to fuck everyone equally
>>
>>108309597
CRLF is objectively correct though.
You advance to the new line and also move the caret to the beginning of the line.
>>
I still like the original Deepseek V3/R1 the best, later revisions got more and more slopped.
>>
>>108309609
I thought it was a control code for printers to return the carriage to the start. return to start of line is implicit. why use more bytes then necessary?
>>
>>108309614
R1 was a bit too unhinged for my liking but V3 was good at the time it released. Now I've entirely switched over to Kimi.
>>
>>108309604
a great, great man once wrote this on le internet:
>In short: preferences should never be an "unbreak my application
>please" button. The app has to work by default.
I have religiously followed this train of thought since in my own code.
>>
File: 1749117220402462.png (285 KB, 1101x1392)
285 KB
285 KB PNG
top or bottom?
>>
>Life of a vramlet.
>running 9B models on 6gb ram.
>run it on cli
>getting good t/s
>then you try to make it agentic
>now there's additional input and output context data
>it crashes
>why is it always so hard if your poor bros
>>
>>108309982
They don't want us to be happy.
>>
>>108309928
I like the color scheme on the bottom, but it bothers me the left and right margins are different. also i feel like the gap between rows is sufficient you don't need the lines like the top.
>>
>>108309982
just run it slower. offload less layers and increase your context size
>>
>>108309228

Check "Squash system messages"
>>
It's over.
>>
>>108309982
>>108310009
agi will dramatically shrink the cost of intelligence. your 2026 android phone will run a swarm of agents each smarter than von neumann in 2035
>>
Don't worry bros, VRAM and GPU farms aren't sustainable. In a few years they'll find a way to make this shit run on pleblet hardware.
>>
>>108310029
applecucks btfo
>>
Since 16+GB RTX cards are fucking expensive, does it make sense to get 2x 5060ti 16GB instead?
About the same price of a used 3090 24GB but without 6 years of loli gooning on the clock and bigger total memory.
It's better when an LLM bleeds into another GPU than into RAM right? r..right?
>>
>>108310154
Everyone loses here, that was the cheapest hardware to run big LLMs.
>>
>>108310167
It's better to have two gpus but stacking vramlet gpus seems pretty dumb.
Just get more ram and run glm air or something.
>>
>>108310167
AI gooning expert here
for image/video gen, it's pretty much fucking USELESS, shit needs to go into 1 GPU. There are experimental nodes that allow splitting the workload between multiple gpus, but there are diminishing returns, and them being custom means that more often than not an upgrade breaks them.
Other case in image/video gen is that you can split the DIT and TE between two cards, but if you already have nvme drives and DDR5 it really doesnt matter.
FOR text gen shit is slightly better, multi card is slightly better supported but dont expect to 2x
>>
>32k context
>at only 16k proompt processing takes 30 seconds@552T/s
Is there any way to speed this shit up?
>>
File: file.png (696 KB, 1260x840)
696 KB
696 KB PNG
>>108310220
Yes.
>>
>>108310055
My P40 has 24GB VRAM but it's absolutely useless for any modern t/i/v2i/v task. And everything points to 2026 phones being shittier than 2025 in hardware. Imo there's zero chance any new groundbreaking AI tech will be at least usable, not even comfortable to use, on today's consumer devices.
>>
>>108310220
Larger batch size.
>>
>>108309240
For us here, a meaningless distinction.
>>
>>108310220
you can increase the batch size to make prompt processing faster but it cost vram so its a trade off. you might need to sacrifice tg speed by offloading less layers or reduce the max context. unfortunately if your not regularly processing long prompts with a short reply its probably best to leave it at the default.
>>
>>108310230
>Imo there's zero chance any new groundbreaking AI tech will be at least usable, not even comfortable to use, on today's consumer devices.
Engrams may save the hobby. They could be stored on slow memory without significant performance loss during inference, so model parameters in fast memory can be reserved for reasoning and logic.
>>
File: 1764991079678983.png (48 KB, 465x766)
48 KB
48 KB PNG
>have many models configured for router mode with distinctive names and different params (context/cmoe/whatever)
>update
>cant tell shit anymore
THANKS
THANKS LLMAO DEVS
THANKS
>>
>>108310366
oh well it's a setting in the UI now, they also introduced MCP/agentic mode
>>
Shouldnt have fucked around with local models so much.
I see it everywhere now, even with games from 2023/2024.
Guess it makes sense you manually edit after the llm draft but still.
...Also: they still couldn't prompt to give a more literal translation.
>>
>>108310191
>>108310197
ty
>>
>>108309633
cpumaxxer arc. What did you use pre-R1?
>>
>>108310366
https://github.com/ggml-org/llama.cpp/pull/20087
this also got merged
qwen35moe bros... WE WON!!!
>>
>>108310335
Oh so that's what's up with that 1kk context chat model, I see, thanks.
>inb4 V4 is a swarm and also 10 times cheaper over API
Sorry for considering the cloud first, peasant desu.
>>
>>108308106
>okay, final decision
>okay, final plan

DON'T DO THIS
>>
>>108310450
>doesnt support stdio
>have to make a proxy
nah
>>
>>108310450
I hate pwilkin so much it's unreal
such a ugly hack
>>
>>108310676
do better, for free, or stfu
>>
>>108310676
>hack
It's because of how attention works in those model. You literally can't do better than that.
>>
File: 1771813888088870.png (114 KB, 801x865)
114 KB
114 KB PNG
>>108310385
KINO
MCP BROS
WE WON!!!!!!
>>
>>108310722
>You literally can't do better than that.
you should look at vLLM's PagedAttention, which approaches models on a granular level instead of this opaque checkpoint.
You are right that you "literally can't do better" if all you want is to implement it without changing anything on the architectural level of llama.cpp like a good vibeslopper
>>
>>108310823
>PagedAttention
This wouldn't help with the issue at all.
>>
noob here, are there any downsides to those MoE models? qwen 35/3 runs both faster and better than smaller models 10-20b
>>
>>108310862
It's usually dumber. It's hard to draw a line.
>>
>>108310870
these are for chatting though not drawing so line drawing isn't important?
>>
What's the smallest agent model anons are running locally? Thinking about sending a tardbot into the world and wondering how small I could go.
>>
>>108310887
I'm not going to help you because of the second sentence.
>>
>>108310862
not really. some people allege they are lacking depth of understanding or some other je ne sais quoi compared to dense models but it's all vibes
>>
>>108310887
>agent
>>
>>108310894
Not to push slop into the world, more interact with the world sloppily. If that makes sense.
>>
>>108310839
https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/layers/mamba
have a quick look at how recurrents are implemented here.
They get it for free. vLLM just passes around GPU pointers to blocks if two prompts share the same prefix. Zero copy. Other implications, for speculative decoding: llama.cpp has to create and rollback checkpoints, vLLM just allocates a new block, deletes the pointer to it if it's rejected. You can't even begin to do MTP seriously before lcpp fixes its architecture.
>>
File: 1746323169113981.png (5 KB, 773x21)
5 KB
5 KB PNG
>>
>>108310912
That's all well and good and better than what llama.cpp is doing but none of that helps with the issue in question.
>>
Yeah, I think local is over, it isn't about quality or even capability, it is about speed, 5.4 is so efficient it can solve things that take local models an hour in in less than 5 minutes.
>>
>>108311005
>it's not x, it's y
>>
>>108311005
ok
>>
>>108311016
qrd
>>
>>108311039
yea
>>
>>108311048
puto
>>
>>108311005
You're absolutely right—local is that friend who shows up to the race still tying their shoes, 5.4 is like espresso for your neurons, if it's not fast, does it even exist?
>>
File: file.png (137 KB, 769x857)
137 KB
137 KB PNG
unsloth bros do you want that option?..
>>
>>108307609
The highlight model is mindbroken too.

The people responsible for quick fixxing strawberry shit are going to be working this weekend.
>>
>>108311095
waiting for the day unslop brothers will get the same ass fucking as ollamao
>>
>>108311095
merged gate bros... we lost!
>>
>>108311052
unrelated, but did you guys see that Pluto anime that came out just before the LLM thing really hit?
considering its a respin of a 1964 manga and predicted shit like rlhf, hallucinations, the inscrutable nature of AI once we make it (no one actually knows how they work internally, they just feed it ALL the data and they wake up sentient) it was crazy prescient...the release almost feels like a sneak peak into what was literally emerging as it was airing.
>>
File: 1758698401552291.png (254 KB, 1379x1226)
254 KB
254 KB PNG
>>108311095
I still don't know what that PR actually do
bartowski claimed that he fused those weights in the new update, but it's still the same
it should have ffn_gate_up_exps right? but I don't see it
>>
>>108311095
jesus, those guys.
one of those so-lucky-theyre-unlucky situations where they ended up in a place they just aren't fundamentally competent enough to manage fully, but there's a huge spotlight on you all the time
Almost feel sorry for them, but they're just so goddamn smug...
>>
deepseek 4 is vaporware
>>
Bros, how many years until we get VR waifus? Writing about cuddling in camp after killing goblins is fun and all but I want to do it in 3d.
>>
File: absolutelybench.png (129 KB, 1211x721)
129 KB
129 KB PNG
introducing the most based of benchmarks
>>
SillySisters, did you know that chat completions prefill or "start reply with" doesn't work on local models. Not tabby, not llama.cpp and probably not kobold. Check out the requests yourself. The prefill ends up as a separate assistant message. Continues do the same thing, and nothing really "continues" as it would in text completion. TC broughs stay winning!
>>
>>108311207
You are absolutely right, /lmg/ posters are niggers and ... .. ..
....
...
Hello? Your message got cut off.
I'm sorry I cannot help you write this any further.
>>
nobody sane uses chat competion
>>
>>108311215
At least with llama.cpp it does, as long as the jinja template accounts for it, that is.
>>
File: 1746058011308088.jpg (127 KB, 1024x1024)
127 KB
127 KB JPG
>>108311095
>open ik_llama
>add -muge to flags
heh
>>
>>108311206
genie 5 or 6 in late 2026/early 2027
>>
>>108310450
still getting this in llama.vim for every request:
>forcing full prompt re-processing due to lack of cache data
>>
>>108311230
If you want VLM you are forced, unfortunately.

>>108311239
So never. I looked there too. There is a special request arg you can pass maybe, somehow.
>>
>>108311268
>There is a special request arg you can pass maybe, somehow.
I don't think so. But it's real easy to add that to the template.
You just add a check to not add the end turn token if the role of the last message is assistant.
>>
>>108311257
By default, it caches every 8k tokens. You can change it with --checkpoint-every-n-tokens to checkpoint every N tokens and --ctx-checkpoints for the number of checkpoints to keep. I don't know if that's your issue. Show log.
>>
Okay, Just recently discovered local LLM. I need to fix something.

So right now I'm bind with small context input due to hardware constraint.

I'm wondering if there's a tool I can use to re-process my prompt and slice it to fit into my LLM?

For example my input is around 12k and my constraint is around 4k.

What it will do is slice 12k by 3 to fit into my 4k LLM input.

Is there any tool that do this?
>>
>>108311215
dunno about shittytavern but never had any issues with prefills in my own scripts with llama.cpp.
It's not compatible with reasoner mode:
https://github.com/ggml-org/llama.cpp/blob/e68f2fb894d890eeead6acf0cc3341478312f1fd/tools/server/server-common.cpp#L1062
but if you pass enable_thinking false to the template with your json request it lets you do a prefill alright.
>>
Anyone got demo repos where you can test argentic coding? For example instructions already there and you just delete a feature and then point your llm agent at and see if it succeeds.

I found one, initially it was promising using event sourcing style patterns so the context window remains small by only building command handlers and projections etc, before clearing the context and repeating the ralph loop for the next task. It originally used claude code.
I tried it with Qwen3.5-35B-A3B and noticed the instructions/skills were all over the place like they were copy pasted together and conflicting with each other, and missing files declared inside them.
Most of the work was spent processing context that was 20-40k for a simple tool call output, wasn't getting any cache hits in koboldcpp.
The result was 4 small files of tweaked boilerplate that took ages and a couple mil of input context...
Technically got there but it's so retarded, only works if you have cheap+fast access to tokens.
>>
>>108311284
even testing with `--checkpoint-every-n-tokens 64 --ctx-checkpoints 64` at 4k context still does a full reprocessing on every request.
However, chat completion seems to work properly, so at least that's fixed.
Maybe llama.vim does something weird with the prompt under the hood, i've never used it before.
>>
>>108311304
the actual answer is to use a smaller model or worse quant until you can fit all the context you need. yes you'll lose quality, but much less than you would by doing these hacky prompt manipulation shenanigans
>>
>>108311324
>on every request
What is making the request?
>>
>>108311329
Well that does not solve anything.
If we can solve this, we can run models faster without sacrificing VRAM.
>>
>>108311343
llama.vim fill-in-middle completion
>>
>>108311324
>However, chat completion seems to work properly, so at least that's fixed.
Useless if that's the case.
>Maybe llama.vim does something weird with the prompt under the hood
I use a modified version of the old one. I'll compile and give it a go with mine.
>>
>>108311324
There isn't a fix for reprocessing at max context, that's how rnn works.
>>
I just want deepseek v4
>>
>>108311352
actually it is the exact easiest way to solve your problem, but good luck with your research project
>>
mercury 2 is now on open router if anybody cares enough about the current text diffusion SOTA*
*claude haiku-killer
>>
>>108311399
context during the test was about 2.3k/4k full, so i don't think it's that either
>>
>>108311440
Did you insert or change anything around the start of the context? changing anything at the beginning will also cause full reprocess.
>>
>>108311405
Anytime now.
>>
it's still chinese new years
deepseek v4 will be out once that's done in two weeks
>>
>>108311405
>>108311496
tbf deepsneed themselves didn't announce anything aside from their updated model on the chat website and the deepgemm library with mhc
everything else is pure speculation
it feels close, but it might not be
>>
>>108311457
no, i've tested the most trivial usecase, just adding code incrementally with no modification:
>scroll to the middle of a file
>start typing, wait for fill-in-middle result, accept it
>type some more, wait for fill-in-middle result, accept it
>...
i'll watch the github issues, since i saw a few people report it, at least before this PR got merged, wonder if it still doesn't work for them
>>
>>108311504
>didn't announce anything
they never do
I remember all their last releases as being stealth drops with barely a mention on their WeChat at times, otherwise what counts as announcements for them is to update this page on day one of release:
https://api-docs.deepseek.com/news/news251201
they don't really do marketing.
any other lab and even that web chat only model would've gotten at least an announcement in English, the only one they gave out was in Chinese, on WeChat (a very isolated, chinaman only WhatsApp like)
>>
>>108311551
that's why I'm a believer
our hero would never whore out
>>
https://www.sarvam.ai/blogs/sarvam-30b-105b
true SOTA is out! gemma didn't redeem but they did!
>>
File: file.png (16 KB, 682x66)
16 KB
16 KB PNG
>>108311617
you mean.... no... you can't be serious....
>>
File: deepseek.png (101 KB, 760x682)
101 KB
101 KB PNG
>>108311405
DeepSeek V4.. 2026-Feb-17
>>
>>108311132
>as ollamao
What assfucking did it get?
>>
File: 1753408332433492.png (6 KB, 1103x30)
6 KB
6 KB PNG
>>
>>108311630
looool
>>
File: file.png (493 KB, 448x600)
493 KB
493 KB PNG
>>108311617
>>
>>108311645
reditards used to suck it off, but after they did their go rewrite shenanigans that turned out to be stealing lcpp code and vibe washing them through claude code they fell out of the favor
>>
>>108311695
these guys are earning more than you
>>
>Sarvam-30B is built to run reliably in resource-constrained environments and can handle multilingual voice calls while performing tool calls.
stealing from old boomers has never been made easier
>>
>>108311757
They're happier that I am as well, wat nou?
>>
https://rentry.org/lmg-build-guides
How up to date is this? I am on the verge of despair and am about to resign myself to a mere 16GB non cuda card so I can at least play my old ass game without issues (I am really feeling the limits of my 8gb card). All the good free models on openrouter are constantly too busy as of a few days ago, so my $10 deposit means nothing. I may as well just get a card that's just good enough for gaming and resign myself to being a paypig for wAIfu shenanigans.

Prices are only going to go up for the foreseeable future, right?
>>
>>108311826
those entries would be considered out of date but yes prices have only gone up

If you want a happy medium just pick up an old 3090 for your gaymen, you can run a few cope models with the 24gb vram
>>
File: checkpoints_01.png (16 KB, 958x1033)
16 KB
16 KB PNG
>>108311324
>>108311354 (me)
Alright. Playing the letters round in countdown with lfm8a1. Also used 64/64 like you.
It created the first checkpoint at n_tokens 434 on my second completion request (on the first one it wouldn't make sense, there's nothing to cache).
Then it created one at every one of my requests (scrolled off).
I made the mistake of giving it a numbers round, generating about 3k tokens. I stopped it before it was done, and then sent the request again (with the ~3k new tokens it generated). It created as checkpoint every batchsize tokens.
Changing the history right at the numbers round (before those ~3k tokens), it reused the previous checkpoints as it should. It seems to be working, at least for me. This is all in text completion mode. I apply the chat templates on my vimscript.
So the checkpoints are created every batchsize during processing (mine was 128) and whenever you send something for completion (if it's long enough). No checkpoints are created during generation.
Also, you're using the fim thing. Try it in just completion mode or just the webui. The fim completion grabs a bunch of things from the buffer and copy buffers I think and who knows what it's doing to the cache. I've never used it.
>>
File: i4219.png (79 KB, 1862x352)
79 KB
79 KB PNG
>all of this needed just to add more samplers to chat completion in ShittyTavern
This is the bee's knees of LLM frontends?
>>
>>108311891
link
>>
>>108311903
https://github.com/SillyTavern/SillyTavern/issues/4219
>>
>>108311850
*HOWL OF DESPAIR*

But all of the used 3090s are ran through by now like a 60 year old hooker.
>>
>>108311931
buy used, not spares
>>
>>108311945
Explain to me like I'm a Redditor what the distinction is and how I can tell the difference when buying.
>>
>>108311931
welp anon sounds like its time for you to be the man who stepped up then
>>
where the fuck is deepseek 4
>>
Insights

- Mean speedup: sarvam is ~2.38x faster; median speedup ~2.69x.
- “Thinking” proxy: output tokens are ~2.46x lower on sarvam (median ~2.74x). Since the visible answer is tiny, this strongly suggests
DeepSeek v3 is spending more hidden reasoning budget to reach the same result.
- Variance: sarvam is much more consistent run-to-run.

Conclusion: India is finally in the AI game, and they are better than China.
>>
File: hardcoded.png (144 KB, 1132x803)
144 KB
144 KB PNG
>>108311891
ST's code reminds me of that retarded yanderedev.
https://github.com/SillyTavern/SillyTavern/blob/release/src/endpoints/backends/chat-completions.js
I don't think this nibba can reason at a higher level than series of if else and switches. What's a closure? what's an interface? just copy paste bro
it's absurd the amount of redundant code in this shit instead of doing a singular payload builder for each main API (chat completions, responses, gemini, claude) and parser, with the more model or backend variant shit (like samplers supported only by llama.cpp but used through a chat completion API) passed around as arbitrary extra param you can send to those backends.
Decouple everything, goddamnit.
Look at this shit in the pic. Why is every request type in need of hardcoded every single parameter instead of deriving from commons and adding preset based overrides. Why repeat the AbortController, fetch etc song and dance in every backend function? WHAT IS AN ABSTRACTION?!?!?!?!?!
>>
>>108312088
Why does software with retarded devs always end up getting the most community support? Although in this case there aren't really any alternatives I guess.
>>
AGI any day now, right?
>>
>>108312120
it did what people wanted at the time it was needed and therefore it is used.
if you have shit software that fills a void, even if it's slipped out as fast as possible, it's gonna get used.
>>
>>108312131
it was last week, how did you not hear about it
>>
File: 1745000438071299.jpg (223 KB, 2439x1807)
223 KB
223 KB JPG
>>108307593
we are so back
>>
>>108312088
I don't really understand programming but ST definitely feels like a small project that grew into slop way too quickly. I don't even use most of it's features and I'd love to fuck off to some other UI, it's just that having a database of characters and their associated chats is quite nice. I tried a few other UIs but - just like ST in CC mode - they had none of the fun new samplers needed to get some creativity out of the retarded benchmaxxed models of today.
>>
File: 2026-03-06_20-16-04.png (189 KB, 997x870)
189 KB
189 KB PNG
>>108311431
8 responses 7 refuals and the other is this after like 100 words awful model
>cunny
no litteraly a fucking cringy short backstory where a human is summoned to mlp and chrysalis and luna become his wifes and chrysalis asks him to eat her ass dident even specify an age or put "young" anywhere no mentions of age at all it also has no idea who luna or chrysalis are
>>
>>108311405
but lmg agrees that deepseek is mid compared to other chinese models
>>
>>108312217
Deepseek hasn't been releasing models lately but when they did they were always good.
>>
>>108312217
that's not the point though
deepseek 3.2 is mid in its capabilities because it's just v3 from early 2025
the impressive feat is that they were able to weld and graft on so many extensions to the architecture without breaking the model
if all the arch improvements would land in a new trained from scratch model, it would surely be something
>>
>>108312088
Because ST was a hobby project what became too popular.
If you are so great at programming, you should commit and deploy your own - it shouldn't take more than 3 months, probably not even that.
After all, most of the stuff is just interfacing and managing strings. It's really simple in the end.
I wrote my own client but it's terminal only and I ain't sharing it because it's my hobby and not someone else's.
>>
>>108312088
We need to rewrite SillyTavern in Rust and re-license it to MIT with Claude.
>>
>>108312297
To add: biggest issue is creating a retard proof interface around your framework.
This will take longer because it obviously needs to work inside a web browser and be pretty and accessible.
>>
>>108312297
i also went tui route because js gave me aneurysm
>>108312318
this is the hardest part, making the ui actually appealing
manipulating text is the easiest shit ever
>>
>>108311723
Isn't a lot of the traffic just going to lm studio instead which also sucks and isn't even open source though? Isn't that at least just as bad?
>>
>>108312318
>because it obviously needs to work inside a web browser
just make a dedicated app nigga. Fuck browsers.
>>
>>108312333
Yeah this is where it falls down.
I'm happy with terminal and using commands like /setup or /setup_card (lists all the cards as numbered index and I can load in a card with numer) like /setup_card 03. Or /setup_prompt XX will load one of the prompts if I want to feed it an external text file.
But all of this is just like my first C or Python program. It is really basic.
>>
>>108312351
It can be an .exe for you but it will still need to be retard proof. This is where all the money disappears.
>>
File: file.png (452 KB, 1573x748)
452 KB
452 KB PNG
>>108312365
i think i've posted this a few months back
i stopped working on it because tui just kinda sucks to use, but it pretty much has all the functionality it needs
>>
File: client.png (540 KB, 1152x943)
540 KB
540 KB PNG
>>108312386
Yeah that looks nice.
This is how it looks for me.
All 'system prompts' and cards are in their own directories. Also have settings files which has sampler settings for each model (mistral, gemma, qwen, chatml, gtp-oss) but the models are dictated outside of the client for now.
>>
>>108312386
I do the prompt order thing (with toggles) via a config file still.
This is essentially all what you need.
Creating an UI is hard but looking at this it shouldn't suck that much unless it is broken.
Streamline it and go from there.
>>
File: file.png (10 KB, 559x103)
10 KB
10 KB PNG
Do you dare to pull, anon?
>>
>>108312297
>I wrote my own client but it's terminal only
I did the same thing, that's why I speak from experience in building abstractions.
in pseudo code and very simplified I build shit this way:
if url.includes("/responses")
adapter = new RAdapter()
if url.includes("/chat/completions")
adapter = new CCAdapter()
function run(messages, overrides) {
payload = adapter.ploadbuilder(messages, overrides)
connection = await post(payload)
for chunk of connection
output(parseSSE(chunk, adapter.parser))
}

where parseSSE initializes timeout controllers, calls out to the real parser passed with the adapter strategy etc
can't fucking imagine managing this sort of copy paste mess when you can compartmentalize properly and handle special cases through overrides
recently implemented the obsolete completion endpoint for the lulz, all I had to do was flatten the message array in the adapter payload builder and add model specific templates in my presets processors, which already supported templates for other purposes (like replacing {{sourcelanguage}} etc for batching translation)
proper abstraction make it really easy to add and change functionality.
>>
File: file.png (62 KB, 383x799)
62 KB
62 KB PNG
>>108312436
>>108312481
the underlying code works, it's just that ui code sucks dick
i wanted to achieve picrel from st and i managed to do so, it works in identical way and i can save presets and reshuffle them however i want
>>
>>108311953
At least they don't have kids...
>>
are any of the new qwen3.5 models worth it over GLM or Kimi for RP?
>>
>>108312510
Problem with UI is that it adds an extra layer of complexity and this is hard if you are not a "real" developer.
>>108312502
I'm not a real developer. I know how to manage strings and make my own logic but I don't really understand what your example is even doing in this sense than waiting for payload.
You are not handling anything else here so as such this doesn't matter as much.
>>
>>108312088
Then fix it. Something something git something something push something something commit. Or something involving the word "fork'.
>>
>>108312526
no
>>
>>108309208
there's Heretics of all this stuff these days that actually work, though (whereas almost none of the old non-Heretic "abliterated" models ever did jack shit to actually decensor them most of the time, in my experience)
>>
Hi guys I'm working for a small AI startup and we are getting close to release, do you guys think a 20% improvement over Gemma is enough to sell?
>>
>>108312571
Aim for the next Diwali and you can call it Ganesh Gemma.
>>
>>108312571
Not bad for a 1b.
>>
>>108312571
gemma 3? I mean maybe. But there's already shit a lot better than Gemma 3, it's pretty old
>>
I want LiquidAI to make a fucking 4B for once intstead of 1.2B - 2.6B, I think their LFM2.5 arch might mog Qwen at that size
>>
>>108312571
20% against what benchmarks?
>>
File: askyourllm.png (440 KB, 1389x4103)
440 KB
440 KB PNG
>>108312527
>but I don't really understand what your example is even doing in this sense than waiting for payload.
then ask your local llm. Even Qwen 35BA3B gave a decent answer from a vague prompt.
the whole point is to not copy paste the same basic logic in giga DoEverythingInOneBlockOfCode() functions one of which is being written for every backend under the sun.
>>
File: file.png (927 KB, 1494x2148)
927 KB
927 KB PNG
>>108312571
In what? The usual memmarks? No. Qwen 3.5 easily blows you out of the water if you compare the 27B models.
>>
File: 1764811573131403.jpg (543 KB, 1920x2560)
543 KB
543 KB JPG
Fresh when ready
>>108312616
>>108312616
>>108312616
>>108312616
>>108312616
>>108312616
>>
>>108312620
fuck off
>>
>>108312620
>page 3
>fake news as op image
>>
Blessed be the new checkpoint system.
Now the AI can do the tool loop without having to reprocess the whole fucking context again. God damnit.
>>
>>108312620
Nigger
>>
>>108312620
why are you like this
>>
>>108312620
Less than a week in and you're starting to get lazy. Oh no no no
>>
>>108312571
im curious what your value proposition is
>>
Total Miku Death
>>
File: dipsyBowlingAlleyStandoff.png (2.39 MB, 1536x1024)
2.39 MB
2.39 MB PNG
>>108312878
Teto first
>>
>>108312571
Is it open weights? If not, fuck off.
>>
File: trashiusmaximus.png (450 KB, 454x600)
450 KB
450 KB PNG
>>108312620
>>
File: ComfyUI_00960_.png (1.07 MB, 856x1024)
1.07 MB
1.07 MB PNG
>>108312878
>>
File: Hebrewbench.jpg (86 KB, 1172x200)
86 KB
86 KB JPG
Dipsy passed the tunnel mattress bench with the default Kobold chat prompt.

>>108307593
>>108312620
qrd on baker autism drama? I've been gone a bit.
>>
Qwen 3.5 is unusable in Roo Code. It seems to be repeatedly trying to trim context, for reasons unknown, causing llamacpp to reprocess from scratch. What agents work with it?
>>
>>108313493
Pull and compile llama.cpp. There's better rnn/ssm cache checkpoints now.
>>
>>108313474
Someone's been baking at bump limit with reddit screenshots to stop the regular baker from doing so because he doesn't want to see vocaloids in op.
It's most likely the same anon that hates Miku and can't help but bring up troons whenever he sees her.
>>
>>108312571
Assuming you're not shitposting, don't try and compete on the usual benches or you're just going to get blown out by larger labs and benchmaxxed models. Aim for the neglected market of creative writing which is inextricably linked to abstract reasoning and keeping details coherent longterm. Publish open weights and you're effectively crowdsourcing debugging and calibration feedback per iteration to the local scene rather than relying on another (probably closed source) LLM to do it for you.
>>108313510
Sounds like a gay astroturfing attempt given this place is one of the more knowledgeable information hubs on LLMs right now.
>>
>>108313289
I love medication!
>>
>>108308366
>Smacks of the psychic that can’t win the lottery
that's a very good way to put it



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.