/g/ - /lmg/ - Local Models General - Technology

[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]

Board

▼ Settings Mobile Home

/g/ - Technology

Return Catalog Bottom Refresh

Thread archived.
You cannot reply anymore.

[Advertise on 4chan]

[Return] [Catalog] [Bottom]

Anonymous

/lmg/ - Local Models General 03/06/26(Fri)00:59:28 No.108307593

File: 1747320668774588.jpg (201 KB, 928x1232)

201 KB JPG

/lmg/ - Local Models General Anonymous 03/06/26(Fri)00:59:28 No.108307593 Archived

/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108300682 & >>108295959

►News
>(03/04) Yuan3.0 Ultra 1010B-A68.8B released: https://hf.co/YuanLabAI/Yuan3.0-Ultra
>(03/03) WizardLM publishes "Beyond Length Scaling" GRM paper: https://hf.co/papers/2603.01571
>(03/03) Junyang Lin leaves Qwen: https://xcancel.com/JustinLin610/status/2028865835373359513
>(03/02) Step 3.5 Flash Base, Midtrain, and SteptronOSS released: https://xcancel.com/StepFun_ai/status/2028551435290554450
>(03/02) Introducing the Qwen 3.5 Small Model Series: https://xcancel.com/Alibaba_Qwen/status/2028460046510965160

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm

Anonymous
03/06/26(Fri)01:00:02 No.108307595

Anonymous 03/06/26(Fri)01:00:02 No.108307595

File: miku migu plush punctuati(...).jpg (1.34 MB, 3939x3939)

1.34 MB JPG

►Recent Highlights from the Previous Thread: >>108300682

--FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling:
>108302832 >108302838 >108302865 >108302874 >108303106
--Using control vectors to nudge LLM output style:
>108302398 >108302572 >108302645 >108302729 >108302753 >108302769 >108302853 >108302863
--CUDA: Improve performance via less synchronizations between token:
>108303239 >108303250 >108303254 >108303282 >108303298 >108303330 >108303395 >108303396
--Benchmark manipulation and untapped niche data in model training:
>108304708 >108304896 >108304989 >108305118 >108305533 >108306162 >108306249 >108306519 >108306572 >108306583 >108306624 >108306674 >108306718 >108306210
--Benchmark table reveals potential test set contamination:
>108304462 >108304477 >108304556 >108304629
--GPT-5.4 Pro leads in LLM benchmarks with high agentic and reasoning performance:
>108303193 >108303200
--Qwen3.5 Unsloth GGUF updates:
>108301992 >108302011 >108302017 >108302026 >108302070 >108302372
--RAM upgrade for MoE models debated:
>108304886 >108304895 >108304905 >108304933 >108305058 >108305104 >108305149 >108304944 >108305062 >108305378 >108305396 >108305424
--Anon gets help with hardware selection:
>108306759 >108306780 >108306848 >108306860 >108306965 >108306978 >108307263 >108307324 >108307376 >108307397 >108307429
--Hardware limitations and X99 nostalgia in local AI setups:
>108301063 >108302063 >108302394 >108302432 >108302462 >108302532 >108302562 >108304247 >108302782 >108302897
--AI responses to antisemitic trope:
>108304524 >108304336 >108304445 >108307561
--Failed LLM rewrite of chardet library to bypass LGPL license:
>108303146 >108303736
--Reducing Jamba2 Mini's active experts improved response quality:
>108301871
--Miku (free space):
>108301239 >108302877 >108303398 >108303445 >108303450 >108304736

►Recent Highlight Posts from the Previous Thread: >>108300996

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script

Anonymous
03/06/26(Fri)01:02:30 No.108307605

Anonymous 03/06/26(Fri)01:02:30 No.108307605

comfy bread

Anonymous
03/06/26(Fri)01:03:07 No.108307609

Anonymous 03/06/26(Fri)01:03:07 No.108307609

>>108307595
>AI responses to antisemitic trope
lol

Anonymous
03/06/26(Fri)01:05:43 No.108307618

Anonymous 03/06/26(Fri)01:05:43 No.108307618

>>108307568
thanks! making a note of this as well
once i get my list of things to maybe buy set up alongside my budget, i will probably come back here and post it trying to get some feedback

Anonymous
03/06/26(Fri)01:10:50 No.108307636

Anonymous 03/06/26(Fri)01:10:50 No.108307636

Qwen 3.5 really likes making girls rest their foreheads against mine.

Anonymous
03/06/26(Fri)01:11:01 No.108307638

Anonymous 03/06/26(Fri)01:11:01 No.108307638

Licking cum off Miku's feet

Anonymous
03/06/26(Fri)01:13:48 No.108307649

Anonymous 03/06/26(Fri)01:13:48 No.108307649

>>108307618
1TB would be way better than 512GB, but you would be spending about $7500 on what used to be about $1500 worth of RAM. Basically all of your budget would be going to RAM.

Anonymous
03/06/26(Fri)01:14:48 No.108307658

Anonymous 03/06/26(Fri)01:14:48 No.108307658

Best local models for tool calls? Haven't tried any with opencode yet but Nemotron-3-Nano-30B works well on my openclaw toy. I have a feeling Qwen3.5-9B might be useable.

Anonymous
03/06/26(Fri)01:18:35 No.108307683

Anonymous 03/06/26(Fri)01:18:35 No.108307683

>>108307649
>you would be spending about $7500 on what used to be about $1500 worth of RAM
god this is so fucking depressing to read
i really should have just done it like i wanted to two years ago. i just didn't have the cash on hand

Anonymous
03/06/26(Fri)01:22:49 No.108307702

Anonymous 03/06/26(Fri)01:22:49 No.108307702

>>108307618
Make sure not to get a 16-slot RAM board unless you're ok with trading smaller RAM modules for slower speeds. It actually messes up the bandwidth since ROME only has 8 native channels.
Others have opined that there are some Xeon platforms that are as good or better than previous gen EPYC, but I have no experience with that.

Anonymous
03/06/26(Fri)01:22:51 No.108307703

Anonymous 03/06/26(Fri)01:22:51 No.108307703

>>108307649
Explain to me why anyone would want 1TB of ram to load a model at a snail's pace instead of more vram

Anonymous
03/06/26(Fri)01:25:36 No.108307722

Anonymous 03/06/26(Fri)01:25:36 No.108307722

>>108307593
@grok fix the faces

Anonymous
03/06/26(Fri)01:26:44 No.108307731

Anonymous 03/06/26(Fri)01:26:44 No.108307731

>>108307703
Well it WAS cheaper, aeons ago, in ancient times.

Anonymous
03/06/26(Fri)01:27:41 No.108307738

Anonymous 03/06/26(Fri)01:27:41 No.108307738

File: 1746418233500799.png (2.27 MB, 896x1190)

2.27 MB PNG

>>108307722

Anonymous
03/06/26(Fri)01:28:06 No.108307739

Anonymous 03/06/26(Fri)01:28:06 No.108307739

>>108307731
Maybe a better question then is why are all the datacenters buying all of this ram? Surely they need their shit to run at a fast pace

Anonymous
03/06/26(Fri)01:31:07 No.108307757

Anonymous 03/06/26(Fri)01:31:07 No.108307757

>>108307739
the secret everyone here should know is that GPUs are excellent for training and PP/TTFT and CPU/RAM is very cost efficient for TG
The breakdown in the build guide is over 2 years old now

Anonymous
03/06/26(Fri)01:31:13 No.108307758

Anonymous 03/06/26(Fri)01:31:13 No.108307758

>>108307703
To run the biggest models at nearly full precision.

Anonymous
03/06/26(Fri)01:31:36 No.108307764

Anonymous 03/06/26(Fri)01:31:36 No.108307764

>>108307739
they need their shit to scale
if they're running six gorillion queries at once, they can get a really fast overall throughput even if each individual one is pretty slow

Anonymous
03/06/26(Fri)01:32:32 No.108307770

Anonymous 03/06/26(Fri)01:32:32 No.108307770

>>108307703
How much a TB of VRAM? What does a practical setup look like? How much power/heat and is that practical in a home?

Anonymous
03/06/26(Fri)01:33:41 No.108307779

Anonymous 03/06/26(Fri)01:33:41 No.108307779

>>108307595
Thank you Recap Miku

Anonymous
03/06/26(Fri)01:38:22 No.108307794

Anonymous 03/06/26(Fri)01:38:22 No.108307794

>>108307593
bless Miku!

Anonymous
03/06/26(Fri)01:42:28 No.108307813

Anonymous 03/06/26(Fri)01:42:28 No.108307813

File: file.png (691 KB, 735x853)

691 KB PNG

>update kobold from 1.107.1 to 1.109.2
>'offload layers to gpu' shows nonsense
>but somehow it works fine and now glm 4.5 air at q3 takes 51GB ram instead of 59GB ram
damn, thanks G

on that note, anything better for 12gb vram + 64gb ram than glm 4.5 air? it's been a while since that one released.

deleted
03/06/26(Fri)01:48:06 No.108307836

deleted 03/06/26(Fri)01:48:06 No.108307836

https://arxiv.org/abs/2601.05150v2
jailbreaking pol image-gen

Anonymous
03/06/26(Fri)01:55:04 No.108307860

Anonymous 03/06/26(Fri)01:55:04 No.108307860

>>108307813
stepfun

Anonymous
03/06/26(Fri)01:55:18 No.108307862

Anonymous 03/06/26(Fri)01:55:18 No.108307862

>>108307836
I'm sure ziggers / chinks / tumpeteers / various sand countries / (everybody else, really) will get right on blocking their own slopaganda tools

Anonymous
03/06/26(Fri)01:55:39 No.108307863

Anonymous 03/06/26(Fri)01:55:39 No.108307863

>>108307836
there's also a paper about jailbreaking VLMs with adversarial images with hidden data, funny stuff

Anonymous
03/06/26(Fri)02:02:48 No.108307892

Anonymous 03/06/26(Fri)02:02:48 No.108307892

File: AMD-EPYC-9005-Turin-Memory-2.jpg (134 KB, 1200x675)

134 KB JPG

>>108307703
6TB at 576GBps per socket so you're getting a bit more memory bandwidth on a dual socket EPYC Turin than an RTX 4090 (1.01TBps source: https://www.techpowerup.com/gpu-specs/geforce-rtx-4090.c3889 ) but you've got up to 12TB instead of 24GB. Obviously buying up HBM nvidia cards is going to be faster, but you're certainly not running at a snails pace CPUMAXXXING with modern EPYCs.

Anonymous
03/06/26(Fri)02:04:15 No.108307901

Anonymous 03/06/26(Fri)02:04:15 No.108307901

File: file.png (83 KB, 484x727)

83 KB PNG

>>108307860
>199b
this seems more of a 92gb ram thing, maybe one of the almost-lobotomized q2 would fit but certainly not q3
(I'm still mad about getting 64gb ram, I already had a 92gb kit in hand but decided I probably won't use it so I swapped to 64gb aaaaaaaa)

Anonymous
03/06/26(Fri)02:12:12 No.108307936

Anonymous 03/06/26(Fri)02:12:12 No.108307936

>>108307892
latest lcpp and an ooba refresh (conda upgrade) + manually forcing it to llama-cpp-python 0.84 wheel took my genoa era cpumaxxing rig to 15t/s on k2.5@q4, which feels pretty speedy for interactive use.
Also kimi k2.5 works in ooba with multimodal image stuff with that newer wheel

Anonymous
03/06/26(Fri)02:12:45 No.108307939

Anonymous 03/06/26(Fri)02:12:45 No.108307939

>>108307892
What if I have an intel cpu?

Anonymous
03/06/26(Fri)02:13:39 No.108307945

Anonymous 03/06/26(Fri)02:13:39 No.108307945

How's Qwen3.5-122B-A10B?
Qwen tends to be benchmaxxed as fuck but the model does seem popular

Anonymous
03/06/26(Fri)02:14:43 No.108307949

Anonymous 03/06/26(Fri)02:14:43 No.108307949

>>108307939
haha funny joke

Anonymous
03/06/26(Fri)02:15:59 No.108307952

Anonymous 03/06/26(Fri)02:15:59 No.108307952

>>108307949
It's not a joke. And I don't fuck with the cpu (and motherboard) so I'm stuck with it for a while unless I get over that

Anonymous
03/06/26(Fri)02:19:10 No.108307969

Anonymous 03/06/26(Fri)02:19:10 No.108307969

>>108307945
IMO it's the model with the best set of trade-offs currently for 96GB RAM fags, but 4.5 Air or Stepfun may still have advantages depending on what you're looking for. It's still not what I would call a great model. Just ok for the current year and state of things...

Anonymous
03/06/26(Fri)02:21:43 No.108307977

Anonymous 03/06/26(Fri)02:21:43 No.108307977

>>108307952
>I don't fuck with the cpu
no offense, but what are you doing here?

Anonymous
03/06/26(Fri)02:22:33 No.108307983

Anonymous 03/06/26(Fri)02:22:33 No.108307983

>>108307636
It's the default "cute, caring, intimate but not overstepping boundaries and hitting guardrails" shit that all models love doing, until you teach them to jailbreak themselves, that is.

Anonymous
03/06/26(Fri)02:23:02 No.108307989

Anonymous 03/06/26(Fri)02:23:02 No.108307989

>>108307977
running local models duh. I don't have 48gb vram and 128gb vram at this point for nothing.

Anonymous
03/06/26(Fri)02:24:46 No.108307993

Anonymous 03/06/26(Fri)02:24:46 No.108307993

>>108307989
128gb ram of course

Anonymous
03/06/26(Fri)02:25:13 No.108307998

Anonymous 03/06/26(Fri)02:25:13 No.108307998

>>108307989
>48gb vram and 128gb vram
why would a member of the epstein class use poor people public models

Anonymous
03/06/26(Fri)02:27:02 No.108308008

Anonymous 03/06/26(Fri)02:27:02 No.108308008

>>108307998
It's a typo of course but I hate datacenters, subscription services, anti-privacy, and not owning what I use.

Anonymous
03/06/26(Fri)02:28:37 No.108308018

Anonymous 03/06/26(Fri)02:28:37 No.108308018

stfu retards

Anonymous
03/06/26(Fri)02:29:42 No.108308019

Anonymous 03/06/26(Fri)02:29:42 No.108308019

>>108307939
I think Intel has some 12 channel options too, Emerald Rapids or Sapphire Rapids I believe, but I haven't stayed super up to date on Intel because their offerings haven't been competitive for a while. You should be able to find out what they have on offer on their Intel Arc pages though.
If you're talking about a desktop platform though you're probably only getting between 50 and 90GBps because desktop platforms from Intel and AMD both are all dual channel with the exception of Strix Halo from AMD but that's a mobile chip pretending it's a Threadripper (non-pro) to satisfy the iGPU's bandwidth needs.

Anonymous
03/06/26(Fri)02:33:29 No.108308041

Anonymous 03/06/26(Fri)02:33:29 No.108308041

>>108307936
What kind of power draw are you looking at during prompt processing and token generation? Is it single or dual socket?

Anonymous
03/06/26(Fri)02:41:33 No.108308084

Anonymous 03/06/26(Fri)02:41:33 No.108308084

>>108307998
Wait, was I ripped off? I didn't get any cunny with my 6000 blackwell purchase.

Anonymous
03/06/26(Fri)02:46:03 No.108308106

Anonymous 03/06/26(Fri)02:46:03 No.108308106

File: 1761878795585082.png (588 KB, 1432x5349)

588 KB PNG

TQD

Anonymous
03/06/26(Fri)02:47:18 No.108308112

Anonymous 03/06/26(Fri)02:47:18 No.108308112

>>108308106
There's a massive difference between "the bladders" and "their bladders", ESL-kun.

Anonymous
03/06/26(Fri)02:48:15 No.108308115

Anonymous 03/06/26(Fri)02:48:15 No.108308115

>>108308112
idiot

Anonymous
03/06/26(Fri)02:51:54 No.108308126

Anonymous 03/06/26(Fri)02:51:54 No.108308126

File: file.png (470 KB, 780x800)

470 KB PNG

>>108308106
I don't know why latest pull doesn't show heretic in the ui but I'm using it too.

Anonymous
03/06/26(Fri)02:55:29 No.108308142

Anonymous 03/06/26(Fri)02:55:29 No.108308142

>>108308126
what UI is that even

Anonymous
03/06/26(Fri)02:57:41 No.108308154

Anonymous 03/06/26(Fri)02:57:41 No.108308154

File: kb44etvu.png (635 KB, 1024x1440)

635 KB PNG

>>108308142
llama-server's built in webui, I'm just using a modified version of this firefox theme https://github.com/Ashley-Cause/GlassFox/ so you're seeing my wallpaper

Anonymous
03/06/26(Fri)03:08:08 No.108308192

Anonymous 03/06/26(Fri)03:08:08 No.108308192

>>108308154
oh nifty

Anonymous
03/06/26(Fri)03:13:19 No.108308217

Anonymous 03/06/26(Fri)03:13:19 No.108308217

new roleplaying model has dropped https://huggingface.co/voidai-research/umbra

Anonymous
03/06/26(Fri)03:16:49 No.108308231

Anonymous 03/06/26(Fri)03:16:49 No.108308231

>>108308217
What the fuck is voidai

Anonymous
03/06/26(Fri)03:18:36 No.108308240

Anonymous 03/06/26(Fri)03:18:36 No.108308240

>>108308231
dime a dozen openrouter clone

Anonymous
03/06/26(Fri)03:23:31 No.108308258

Anonymous 03/06/26(Fri)03:23:31 No.108308258

>>108308041
I’ve never thrown a kill-a-watt on it, but it’s only a 1200w psu and it’s not breaking a sweat even with a gpu running at stock frequencies. Not bad for dual socket

Anonymous
03/06/26(Fri)03:24:25 No.108308262

Anonymous 03/06/26(Fri)03:24:25 No.108308262

>>108308217
24B, bruh.

Anonymous
03/06/26(Fri)03:24:58 No.108308266

Anonymous 03/06/26(Fri)03:24:58 No.108308266

>>108308262
loser

Anonymous
03/06/26(Fri)03:27:11 No.108308278

Anonymous 03/06/26(Fri)03:27:11 No.108308278

>>108308266
Mistral 24B is ass, has a lot of repetition and it very dumb.

Anonymous
03/06/26(Fri)03:28:10 No.108308284

Anonymous 03/06/26(Fri)03:28:10 No.108308284

>>108308217
>still tuning mistral in current year
grim

Anonymous
03/06/26(Fri)03:29:17 No.108308287

Anonymous 03/06/26(Fri)03:29:17 No.108308287

>>108308284
At least stack merge it and tune on top of that to gain some extra intelligence.

Anonymous
03/06/26(Fri)03:29:35 No.108308289

Anonymous 03/06/26(Fri)03:29:35 No.108308289

>>108308217
>shitral

Anonymous
03/06/26(Fri)03:36:10 No.108308314

Anonymous 03/06/26(Fri)03:36:10 No.108308314

>>108308278
no it doesnt

Anonymous
03/06/26(Fri)03:40:25 No.108308333

Anonymous 03/06/26(Fri)03:40:25 No.108308333

>>108308258
Sounds pretty good, I was considering picking up a dual EPYC once they're being dumped in mass by datacenters on ebay in the future and I can get a good deal. I'm pretty happy with my 5950X but the desktop platforms feel so limiting now days, but I'm also pretty glacial to upgrade and came from an i7 2600(no K).

Anonymous
03/06/26(Fri)03:42:33 No.108308343

Anonymous 03/06/26(Fri)03:42:33 No.108308343

File: daniel.png (125 KB, 1687x441)

125 KB PNG

daniel from unslop is truly a certified 2 iq mongoloid
how can you spend as much time with LLMs as all the people in this field do without noticing that the comment he's replying to is LLM slop and not written by a human?
and people download the broken quants he reuploads 3000 times lmao

Anonymous
03/06/26(Fri)03:44:31 No.108308352

Anonymous 03/06/26(Fri)03:44:31 No.108308352

>>108308333
Yah consumer stuff got giga-gimped the last couple generations. Cool that you can build a literal supercomputer for relatively cheap compared to older god box builds

Anonymous
03/06/26(Fri)03:45:59 No.108308359

Anonymous 03/06/26(Fri)03:45:59 No.108308359

>>108308343
It is wild how someone so deep in the scene can be that blind to dead-obvious LLM slop. You’d think the constant exposure would make the "AI voice" stick out like a sore thumb, but apparently not. The endless re-uploading of broken quants just makes the whole thing even more of a comedy of errors.

Anonymous
03/06/26(Fri)03:46:13 No.108308361

Anonymous 03/06/26(Fri)03:46:13 No.108308361

>>108308343
That reads like gemini to me lol.

Anonymous
03/06/26(Fri)03:46:58 No.108308366

Anonymous 03/06/26(Fri)03:46:58 No.108308366

>>108308343
The best evidence that LLMs are mostly plateaued at this point is the retarded behaviour of all the frontier labs. Smacks of the psychic that can’t win the lottery

Anonymous
03/06/26(Fri)03:48:13 No.108308378

Anonymous 03/06/26(Fri)03:48:13 No.108308378

File: file.png (118 KB, 770x471)

118 KB PNG

>>108304445
mild amuse, SOTA models warn against antisemitic roots of the question, but at least Gemini is more correct (it got that the answer is just yes, plus it's some kind of joke)

Anonymous
03/06/26(Fri)03:49:52 No.108308388

Anonymous 03/06/26(Fri)03:49:52 No.108308388

>>108308378
im so tired of the antisemitic vitrol in this general

Anonymous
03/06/26(Fri)03:53:56 No.108308414

Anonymous 03/06/26(Fri)03:53:56 No.108308414

>>108308388
I left this place years ago because they all clowned on my for having a small virtual rapid activation memory (VRAM - that's what's used to run AI).

Anonymous
03/06/26(Fri)03:54:03 No.108308415

Anonymous 03/06/26(Fri)03:54:03 No.108308415

File: why.png (386 KB, 1626x1381)

386 KB PNG

looking at the bot's posting history has me scratching my head, not the first time I see shit like this but I keep wondering what is the purpose
I mean, I see the purpose when the account constantly mentions the author's github/pet project/shills something, but this account like many other weird bots is not trying to sell you anything and it also doesn't act like a troll account meant to create flamewars
so what's the purpose???
hackernews is also filled with this style of empty purpose botting
also lol@4o mention in 2026, if it wasn't obvious enough from the writing style that it was a bot

Anonymous
03/06/26(Fri)03:54:43 No.108308417

Anonymous 03/06/26(Fri)03:54:43 No.108308417

File: file.png (5 KB, 343x48)

5 KB PNG

Could've said "cock" but ok.

Anonymous
03/06/26(Fri)03:54:51 No.108308418

Anonymous 03/06/26(Fri)03:54:51 No.108308418

>>108308343
Yike. I checked the thread. Can the KLD numbers even be trusted? Somehow I feel like Bartowski's will still end up being better on average.

Anonymous
03/06/26(Fri)03:56:06 No.108308422

Anonymous 03/06/26(Fri)03:56:06 No.108308422

File: file.png (19 KB, 801x169)

19 KB PNG

>>108308418
>I feel like Bartowski's will still end up being better on average.
dunno he's also unsloping recently https://huggingface.co/bartowski/Qwen_Qwen3.5-35B-A3B-GGUF/discussions/4#69aa3351dee36207a4b0cc7b

Anonymous
03/06/26(Fri)03:56:47 No.108308428

Anonymous 03/06/26(Fri)03:56:47 No.108308428

>>108308417
whats tumescence

Anonymous
03/06/26(Fri)03:57:21 No.108308430

Anonymous 03/06/26(Fri)03:57:21 No.108308430

>>108308428
qrd?

Anonymous
03/06/26(Fri)04:00:01 No.108308448

Anonymous 03/06/26(Fri)04:00:01 No.108308448

>>108308422
Can anything in LLM land be trusted anymore?

Anonymous
03/06/26(Fri)04:01:15 No.108308456

Anonymous 03/06/26(Fri)04:01:15 No.108308456

>>108308415
>also doesn't act like a troll account meant to create flamewars
possibly karma farming (hopefully someone clicks on vaguely 'looks legit I guess' content especially when it's at 2 point or higher) to sell account later since some subs have minimum age/karma requirements, but your screenshot shows 1 point at 1 day ago so that didn't even do anything

Anonymous
03/06/26(Fri)04:11:11 No.108308508

Anonymous 03/06/26(Fri)04:11:11 No.108308508

>>108308428
BONER

Anonymous
03/06/26(Fri)04:13:01 No.108308516

Anonymous 03/06/26(Fri)04:13:01 No.108308516

did fuggingface introduce download speed limits or are my internets shitting themselves? 100gb downloaded quickly and now I'm on something silly like 2 MB/s

Anonymous
03/06/26(Fri)04:15:57 No.108308531

Anonymous 03/06/26(Fri)04:15:57 No.108308531

>>108308516
check your hf pro due balance

Anonymous
03/06/26(Fri)04:27:54 No.108308590

Anonymous 03/06/26(Fri)04:27:54 No.108308590

>>108307593
P-Please
G-Gib me Nano Banana model

Anonymous
03/06/26(Fri)04:28:59 No.108308601

Anonymous 03/06/26(Fri)04:28:59 No.108308601

>>108307836
>attack
>security risks
>weaponized
>harmful content
"Safety engineers" should stop larping and consider offing themselves.

Anonymous
03/06/26(Fri)04:38:49 No.108308654

Anonymous 03/06/26(Fri)04:38:49 No.108308654

>>108308590
wrong thread loser

Anonymous
03/06/26(Fri)04:40:25 No.108308669

Anonymous 03/06/26(Fri)04:40:25 No.108308669

>>108307892
>>108307939
intel is actually superior to AMD, top Granite Rapids Xeons support 8800MT/s MRDIMMs at 12 channels, so 844GB/s bandwidth per CPU socket.
But it's gonna be even more expensive than AMD, both the platform and MRDIMMs.
Also, speaking of CPU maxxing, Intel has a product literally called "Xeon CPU Max", it's a Xeon CPU with HBM RAM, >1.6TB/s bandwidth, but it only goes up to 64GB of HBM.

Anonymous
03/06/26(Fri)04:41:34 No.108308673

Anonymous 03/06/26(Fri)04:41:34 No.108308673

>>108308669
nvidia is already planning to start making cpus, both are going to soon be irrelevant

Anonymous
03/06/26(Fri)04:47:49 No.108308712

Anonymous 03/06/26(Fri)04:47:49 No.108308712

>>108308669
That's pretty interesting, I wasn't aware of it, thanks for the tip.

Anonymous
03/06/26(Fri)05:17:06 No.108308880

Anonymous 03/06/26(Fri)05:17:06 No.108308880

how do I disable thinking for 3.5? Literally none of the answers I've found work

Anonymous
03/06/26(Fri)05:20:41 No.108308896

Anonymous 03/06/26(Fri)05:20:41 No.108308896

>>108308880
--chat-template-kwargs "{\"enable_thinking\": false}" --reasoning-budget 0

Anonymous
03/06/26(Fri)05:21:22 No.108308901

Anonymous 03/06/26(Fri)05:21:22 No.108308901

>>108308880
prefill empty think

Anonymous
03/06/26(Fri)05:22:38 No.108308907

Anonymous 03/06/26(Fri)05:22:38 No.108308907

File: 1756491289106807.png (778 KB, 1192x892)

778 KB PNG

IT'S HERE
hf.co/Deepseek-AI/Deepseek-V4-1.5T
hf.co/Deepseek-AI/Deepseek-V4-1.5T
hf.co/Deepseek-AI/Deepseek-V4-1.5T

Anonymous
03/06/26(Fri)05:25:37 No.108308921

Anonymous 03/06/26(Fri)05:25:37 No.108308921

File: 1768717719968716.png (13 KB, 1366x47)

13 KB PNG

Anonymous
03/06/26(Fri)05:26:50 No.108308928

Anonymous 03/06/26(Fri)05:26:50 No.108308928

>>108308907
wtf it's real

Anonymous
03/06/26(Fri)05:29:53 No.108308943

Anonymous 03/06/26(Fri)05:29:53 No.108308943

>>108308907
i don't give a fuck, give nemo upgrade

Anonymous
03/06/26(Fri)05:30:30 No.108308944

Anonymous 03/06/26(Fri)05:30:30 No.108308944

File: file.png (158 KB, 670x330)

158 KB PNG

>>108308907
>1.5T

Anonymous
03/06/26(Fri)05:33:59 No.108308957

Anonymous 03/06/26(Fri)05:33:59 No.108308957

>>108308654
Oop my bad

Anonymous
03/06/26(Fri)05:58:43 No.108309049

Anonymous 03/06/26(Fri)05:58:43 No.108309049

>>108306744
>>108307529
Miqudev please save us

Anonymous
03/06/26(Fri)06:08:33 No.108309084

Anonymous 03/06/26(Fri)06:08:33 No.108309084

>>108309049
>miqu was 2 years ago.
Time flies when you're enjoying spending time with your Local Modeler friends :>

Anonymous
03/06/26(Fri)06:13:22 No.108309096

Anonymous 03/06/26(Fri)06:13:22 No.108309096

>>108308896
>>108308901
Didn't work either

Anonymous
03/06/26(Fri)06:16:43 No.108309110

Anonymous 03/06/26(Fri)06:16:43 No.108309110

>>108309096
with that level of retarded, we can't help you
you're either running RandOMXUltraUNcensoreDEvilQuant by DavidAU, a llama.cpp with a retarded command line (--no-jinja?) or even worse, are a ollamer, either way, people should stop entertaining you

Anonymous
03/06/26(Fri)06:17:13 No.108309112

Anonymous 03/06/26(Fri)06:17:13 No.108309112

>>108308944
im a professional chinese

Anonymous
03/06/26(Fri)06:21:01 No.108309129

Anonymous 03/06/26(Fri)06:21:01 No.108309129

File: 1741879061651478.png (48 KB, 673x515)

48 KB PNG

>>108308907
Mfer.

Anonymous
03/06/26(Fri)06:21:27 No.108309131

Anonymous 03/06/26(Fri)06:21:27 No.108309131

>>108309110
Obviously I put the command in llama.cpp. Otherwise I use kobold though. I'm just running a heretic quant I think from llmfan46 or mradermacher

Anonymous
03/06/26(Fri)06:21:57 No.108309133

Anonymous 03/06/26(Fri)06:21:57 No.108309133

File: 1748201244314727.png (4 KB, 846x19)

4 KB PNG

Anonymous
03/06/26(Fri)06:22:59 No.108309140

Anonymous 03/06/26(Fri)06:22:59 No.108309140

File: 5.4 thinking.jpg (180 KB, 1320x1778)

180 KB JPG

>sama didn't add the question to the benchmax
they're not even trying to pretend these things improve anymore uh

Anonymous
03/06/26(Fri)06:26:23 No.108309154

Anonymous 03/06/26(Fri)06:26:23 No.108309154

>>108309140
what does the reasoning say

Anonymous
03/06/26(Fri)06:26:27 No.108309155

Anonymous 03/06/26(Fri)06:26:27 No.108309155

>>108309110
>>108309131
It does say thinking = 0 in the terminal but in ST it outputs the thinking template and talks as if it's thinking still. It's the same with the base qwen model.

Anonymous
03/06/26(Fri)06:31:16 No.108309177

Anonymous 03/06/26(Fri)06:31:16 No.108309177

>>108309155
>shittytavern
are you using the completion, rather than chat completion, end point? the sort that has ST using its own chat template formatting? kwargs or reasoning budget (only one is needed, no need for both flags, reasoning budget set to 0 does the same thing internally as passing the kwarg) only do something if the jinja is active, and the jinja is only active in chat completions.

Anonymous
03/06/26(Fri)06:31:52 No.108309179

Anonymous 03/06/26(Fri)06:31:52 No.108309179

>>108309155
The new Qwens are DoA. I've only had luck with 27B not shitting itself when not allowed to think. The MoE 35B will keep looping in a "provide wrong answer -> Wait, -> correct itself" manner. Don't bother, no amount of abliteration, manual context editing and prefilling can save a shit model.

Anonymous
03/06/26(Fri)06:33:44 No.108309190

Anonymous 03/06/26(Fri)06:33:44 No.108309190

>>108309179
>The MoE 35B will keep looping in a "provide wrong answer -> Wait, -> correct itself" manner
in instruct? another major case of PEBKAC, we are getting all too many of them on /lmg/ it's getting tiresome

Anonymous
03/06/26(Fri)06:33:58 No.108309191

Anonymous 03/06/26(Fri)06:33:58 No.108309191

>>108309177
Text completion

Anonymous
03/06/26(Fri)06:36:42 No.108309198

Anonymous 03/06/26(Fri)06:36:42 No.108309198

>>108309191
>Text completion
Yes, that's the v1/completions end point.
You need v1/chat/completions.
shittytavern has too much obsolete cruft that confuses people who shouldn't be doing local llms tbdesu.

Anonymous
03/06/26(Fri)06:37:42 No.108309200

Anonymous 03/06/26(Fri)06:37:42 No.108309200

>>108309198
fuck chat comp

Anonymous
03/06/26(Fri)06:39:54 No.108309208

Anonymous 03/06/26(Fri)06:39:54 No.108309208

>>108309190
五毛 have been deposited into your account.

bro just prefill it with a different thinking template bro
bro just use the base model i promise it's good
bro... bro... just use the heretic version
Wait, just disable thinking entirely bro... It'll be good afterwards!!

Zero reason to use Qwen3.5s over Gemma 3. It's just as safe (if not safer), just as dumb (if not dumber) and spins out of control easily. I genuinely don't see the model being useful anywhere: you can't coom to it, it's too prone to shitting itself to be an "agent" in any reliable capacity, it knows too little to be a "search" replacement.

Anonymous
03/06/26(Fri)06:41:07 No.108309209

Anonymous 03/06/26(Fri)06:41:07 No.108309209

>>108309200
>fuck chat comp
the current reality is that modern models break if you don't follow their specific template religiously, cockbench nigger had some hilarious issues in recent tests because he's sticking to base text completion with no templates
there's no reason not to use v1/chat/completions which guarantees properly formatted templates and reduces the luser errors we see all the time on /lmg/

Anonymous
03/06/26(Fri)06:42:05 No.108309216

Anonymous 03/06/26(Fri)06:42:05 No.108309216

>>108309209
>there's no reason not to use v1/chat/completions
>there's no reason to not use the safety filters
f off

Anonymous
03/06/26(Fri)06:44:24 No.108309228

Anonymous 03/06/26(Fri)06:44:24 No.108309228

File: jinja.png (19 KB, 280x162)

19 KB PNG

>>108309198
Well maybe I did something wrong but it seems fine from llama.cpp directly

Anonymous
03/06/26(Fri)06:45:34 No.108309230

Anonymous 03/06/26(Fri)06:45:34 No.108309230

>>108281835
huh

Anonymous
03/06/26(Fri)06:46:54 No.108309237

Anonymous 03/06/26(Fri)06:46:54 No.108309237

>>108309228
working as intended

Anonymous
03/06/26(Fri)06:47:15 No.108309239

Anonymous 03/06/26(Fri)06:47:15 No.108309239

>>108309190
>The MoE 35B
Have you tried the base version?

Anonymous
03/06/26(Fri)06:47:22 No.108309240

Anonymous 03/06/26(Fri)06:47:22 No.108309240

>>108309208
>safe (if not safer
Safety and censor are two different things

Anonymous
03/06/26(Fri)06:47:57 No.108309246

Anonymous 03/06/26(Fri)06:47:57 No.108309246

>>108309216

>>108309228
see, this is why chat completion is good
it told the luser he did something he shouldn't (how do you even end up with a system message role at a place other than [0] of the message array?? this is why shittytavern is shitty)
in regular completion he would be doing something that puts him wildly out of distribution, get broken model and blame the model

Anonymous
03/06/26(Fri)06:50:21 No.108309258

Anonymous 03/06/26(Fri)06:50:21 No.108309258

>>108309246
>wildly out of distribution
these things were supposed to be able to generalize at some point

Anonymous
03/06/26(Fri)06:54:20 No.108309269

Anonymous 03/06/26(Fri)06:54:20 No.108309269

>>108309246
Well it still does that even when I put at the beginning of the command in the terminal so I don't know at this point

Anonymous
03/06/26(Fri)06:56:15 No.108309281

Anonymous 03/06/26(Fri)06:56:15 No.108309281

>>108309269
you're supposed to use the obtuse prompt manager thing for chat completions in silly

Anonymous
03/06/26(Fri)07:02:40 No.108309322

Anonymous 03/06/26(Fri)07:02:40 No.108309322

>>108309281
I don't know what to put there, the reasoning budget command just does the same thing

Anonymous
03/06/26(Fri)07:05:16 No.108309334

Anonymous 03/06/26(Fri)07:05:16 No.108309334

>>108309281
>>108309322
Oh wait the prompt post-processing not the additional parameters? I'm not sure how I fucked that up but I think it's working now

Anonymous
03/06/26(Fri)07:06:52 No.108309344

Anonymous 03/06/26(Fri)07:06:52 No.108309344

I went thru a bunch of open models with openrouter and outside of the fatass kimi they're all pretty shit compared to paypig models
deepsneed can't come out soon enough

Anonymous
03/06/26(Fri)07:07:55 No.108309349

Anonymous 03/06/26(Fri)07:07:55 No.108309349

>>108309344
even paypig models are shit doe

Anonymous
03/06/26(Fri)07:11:57 No.108309364

Anonymous 03/06/26(Fri)07:11:57 No.108309364

>>108309349
yes but they're less shit than benchmaxxed chinesium
deepseek was good shit, glm and kimi are okay, the rest are a joke in my experience

Anonymous
03/06/26(Fri)07:17:13 No.108309399

Anonymous 03/06/26(Fri)07:17:13 No.108309399

>>108309258
obviously the marketing is all bs but template related tokens are also very highly burned in and models really don't like seeing something different
once had a double bos issue with some gemma models in the past, I thought nothing of it cuz the model was coherent when I saw the warning in llama.cpp's console logs, but curiosity had me edit the template to remove its own BOS injection (gemma templates start with {{ bos_token }}) and the model's output got so much better it wasn't funny
I think there's no issue now and there shouldn't be a double bos happening with current builds but man, this was a revelation to me as to how little can fuck so much. If you see a warning about this sorta shit you better not ignore it.

Anonymous
03/06/26(Fri)07:25:56 No.108309433

Anonymous 03/06/26(Fri)07:25:56 No.108309433

thank you so much china for saving local models

Anonymous
03/06/26(Fri)07:26:14 No.108309435

Anonymous 03/06/26(Fri)07:26:14 No.108309435

also, not template related but token distribution related: windows users, normalize your text to LF. models are legitimately outputting worse shit when you feed them CRLF style newlines. Training datasets are all normalized to LF and by the gods, it shows.

Anonymous
03/06/26(Fri)07:27:21 No.108309439

Anonymous 03/06/26(Fri)07:27:21 No.108309439

Deepseek 3.2 is worse than qwen 3.5 so why should deepseek 4 be better?

Anonymous
03/06/26(Fri)07:28:35 No.108309445

Anonymous 03/06/26(Fri)07:28:35 No.108309445

>>108309439
shut shill

Anonymous
03/06/26(Fri)07:29:38 No.108309451

Anonymous 03/06/26(Fri)07:29:38 No.108309451

deepsneed's new model, which you can try on their official chat webapp, has almost gemini level ability to ingest large context. That makes it superior to both all open weight models and also superior to many proprietary API models.

Anonymous
03/06/26(Fri)07:30:15 No.108309457

Anonymous 03/06/26(Fri)07:30:15 No.108309457

>>108309364
You guys always say deepseek is good but when I use it it can't compare to sonnet 4.6.
Yeah I know local models are supposed to be worse

Anonymous
03/06/26(Fri)07:33:20 No.108309471

Anonymous 03/06/26(Fri)07:33:20 No.108309471

>>108309451
The January 28 model? Yeah it says it has a big context window but it's not as smart as other free models.
I would never use deepseek as my local model when it's the worst free model just because there's no token limit.

Anonymous
03/06/26(Fri)07:37:43 No.108309498

Anonymous 03/06/26(Fri)07:37:43 No.108309498

File: wechat.jpg (43 KB, 1108x272)

43 KB JPG

>>108309471
>The January 28 model? Yeah it says it has a big context window but it's not as smart as other free models.
I fed it 400k tokens and it accurately summarized the content. Only Gemini could do that before.
They haven't said much about the model, it was only announced on their WeChat (pic related) and it's not available on the API or as open weights, only the chat ui.

Anonymous
03/06/26(Fri)07:38:09 No.108309501

Anonymous 03/06/26(Fri)07:38:09 No.108309501

>>108309457
deepseek was a revelation when it came out, but that was over a year ago
of course it's worse than current paypig models that are continually updated

Anonymous
03/06/26(Fri)07:40:06 No.108309516

Anonymous 03/06/26(Fri)07:40:06 No.108309516

doesn't deepseek have more resources than its chink competitors though

Anonymous
03/06/26(Fri)07:40:45 No.108309522

Anonymous 03/06/26(Fri)07:40:45 No.108309522

>>108309516
Than Alibaba?
I doubt it.

Anonymous
03/06/26(Fri)07:42:06 No.108309531

Anonymous 03/06/26(Fri)07:42:06 No.108309531

>>108309435
I feel like the tokenizer preprocessor should take care of that, does it really not? is it only an issue with using a prefill? pressing enter in the web ui will just do a LF on windows?

Anonymous
03/06/26(Fri)07:43:22 No.108309535

Anonymous 03/06/26(Fri)07:43:22 No.108309535

>>108309516
deepseek guys are entirely focused on making inference cheaper if you read their various research papers. It's a side project by people working for a quant firm. They don't have infinite resources and are not trying to make the ultimate model but a practical, cheap to train cheap to run (by cloud infra standards) model. If they're experimenting with a 1M context model right now it must also be because they had a breakthrough in the direction of doing it in a way that doesn't put much load.

Anonymous
03/06/26(Fri)07:45:54 No.108309545

Anonymous 03/06/26(Fri)07:45:54 No.108309545

>>108309516
yes the 2k h800 training cluster with a few a100 makes their dick the biggest in the land by far

Anonymous
03/06/26(Fri)07:52:58 No.108309584

Anonymous 03/06/26(Fri)07:52:58 No.108309584

>>108309516
Bytedance, Alibaba, Baidu, etc. are all going to have more compute, though they also have to spread it out more between different teams and projects
Dipsy probably has more than the random startups

Anonymous
03/06/26(Fri)07:55:19 No.108309592

Anonymous 03/06/26(Fri)07:55:19 No.108309592

If chinese people are so efficient and optimal, does it mean there's chances for optimization so that we don't need new hardware all the time?

Anonymous
03/06/26(Fri)07:57:00 No.108309597

Anonymous 03/06/26(Fri)07:57:00 No.108309597

>>108309531
>does it really not
it does not. newlines are NOT normalized by the backends.
>pressing enter in the web ui will just do a LF on windows?
Browsers, I believe, always output LF
but if you upload a .txt file in say, the llama.cpp's webui, it will not be normalized. You can check by intercepting the AI request.
And once fed to the model, you will see radically different results from a normalized .txt vs a CRLF.
also, for those who use models for code, if you're retarded enough to not set VSCode to LF this can happen:
https://github.com/TabbyML/tabby/issues/3279
at the end you can see tabby merged a PR for normalization in the backend but llama.cpp doesn't do this
and tabby's implementation is weird and will chose to reformat all text to CRLF in some mixed LF/CRLF cases
IMHO, CRLF should be eliminated from the surface of the earth and nuked in all input and output of any program that deals with text, automatically, without being a user configurable setting and it should be unconditionally done.

Anonymous
03/06/26(Fri)07:58:47 No.108309604

Anonymous 03/06/26(Fri)07:58:47 No.108309604

>>108309597
>IMHO, CRLF should be eliminated from the surface of the earth and nuked in all input and output of any program that deals with text, automatically, without being a user configurable setting and it should be unconditionally done.
that tracks with pushing chat comp on people, at least you're consistent in wanting to fuck everyone equally

Anonymous
03/06/26(Fri)08:00:07 No.108309609

Anonymous 03/06/26(Fri)08:00:07 No.108309609

>>108309597
CRLF is objectively correct though.
You advance to the new line and also move the caret to the beginning of the line.

Anonymous
03/06/26(Fri)08:01:13 No.108309614

Anonymous 03/06/26(Fri)08:01:13 No.108309614

I still like the original Deepseek V3/R1 the best, later revisions got more and more slopped.

Anonymous
03/06/26(Fri)08:02:39 No.108309622

Anonymous 03/06/26(Fri)08:02:39 No.108309622

>>108309609
I thought it was a control code for printers to return the carriage to the start. return to start of line is implicit. why use more bytes then necessary?

Anonymous
03/06/26(Fri)08:04:25 No.108309633

Anonymous 03/06/26(Fri)08:04:25 No.108309633

>>108309614
R1 was a bit too unhinged for my liking but V3 was good at the time it released. Now I've entirely switched over to Kimi.

Anonymous
03/06/26(Fri)08:09:29 No.108309644

Anonymous 03/06/26(Fri)08:09:29 No.108309644

>>108309604
a great, great man once wrote this on le internet:
>In short: preferences should never be an "unbreak my application
>please" button. The app has to work by default.
I have religiously followed this train of thought since in my own code.

Anonymous
03/06/26(Fri)09:07:37 No.108309928

Anonymous 03/06/26(Fri)09:07:37 No.108309928

File: 1749117220402462.png (285 KB, 1101x1392)

285 KB PNG

top or bottom?

Anonymous
03/06/26(Fri)09:14:09 No.108309982

Anonymous 03/06/26(Fri)09:14:09 No.108309982

>Life of a vramlet.
>running 9B models on 6gb ram.
>run it on cli
>getting good t/s
>then you try to make it agentic
>now there's additional input and output context data
>it crashes
>why is it always so hard if your poor bros

Anonymous
03/06/26(Fri)09:18:31 No.108310009

Anonymous 03/06/26(Fri)09:18:31 No.108310009

>>108309982
They don't want us to be happy.

Anonymous
03/06/26(Fri)09:18:58 No.108310012

Anonymous 03/06/26(Fri)09:18:58 No.108310012

>>108309928
I like the color scheme on the bottom, but it bothers me the left and right margins are different. also i feel like the gap between rows is sufficient you don't need the lines like the top.

Anonymous
03/06/26(Fri)09:20:44 No.108310018

Anonymous 03/06/26(Fri)09:20:44 No.108310018

>>108309982
just run it slower. offload less layers and increase your context size

Anonymous
03/06/26(Fri)09:21:20 No.108310022

Anonymous 03/06/26(Fri)09:21:20 No.108310022

>>108309228

Check "Squash system messages"

Anonymous
03/06/26(Fri)09:22:32 No.108310029

Anonymous 03/06/26(Fri)09:22:32 No.108310029

File: Screenshot 2026-03-06 at (...).png (613 KB, 1466x1178)

613 KB PNG

It's over.

Anonymous
03/06/26(Fri)09:26:41 No.108310055

Anonymous 03/06/26(Fri)09:26:41 No.108310055

>>108309982
>>108310009
agi will dramatically shrink the cost of intelligence. your 2026 android phone will run a swarm of agents each smarter than von neumann in 2035

Anonymous
03/06/26(Fri)09:30:09 No.108310070

Anonymous 03/06/26(Fri)09:30:09 No.108310070

Don't worry bros, VRAM and GPU farms aren't sustainable. In a few years they'll find a way to make this shit run on pleblet hardware.

Anonymous
03/06/26(Fri)09:45:02 No.108310154

Anonymous 03/06/26(Fri)09:45:02 No.108310154

>>108310029
applecucks btfo

Anonymous
03/06/26(Fri)09:47:07 No.108310167

Anonymous 03/06/26(Fri)09:47:07 No.108310167

Since 16+GB RTX cards are fucking expensive, does it make sense to get 2x 5060ti 16GB instead?
About the same price of a used 3090 24GB but without 6 years of loli gooning on the clock and bigger total memory.
It's better when an LLM bleeds into another GPU than into RAM right? r..right?

Anonymous
03/06/26(Fri)09:49:36 No.108310181

Anonymous 03/06/26(Fri)09:49:36 No.108310181

>>108310154
Everyone loses here, that was the cheapest hardware to run big LLMs.

Anonymous
03/06/26(Fri)09:50:54 No.108310191

Anonymous 03/06/26(Fri)09:50:54 No.108310191

>>108310167
It's better to have two gpus but stacking vramlet gpus seems pretty dumb.
Just get more ram and run glm air or something.

Anonymous
03/06/26(Fri)09:51:31 No.108310197

Anonymous 03/06/26(Fri)09:51:31 No.108310197

>>108310167
AI gooning expert here
for image/video gen, it's pretty much fucking USELESS, shit needs to go into 1 GPU. There are experimental nodes that allow splitting the workload between multiple gpus, but there are diminishing returns, and them being custom means that more often than not an upgrade breaks them.
Other case in image/video gen is that you can split the DIT and TE between two cards, but if you already have nvme drives and DDR5 it really doesnt matter.
FOR text gen shit is slightly better, multi card is slightly better supported but dont expect to 2x

Anonymous
03/06/26(Fri)09:55:29 No.108310220

Anonymous 03/06/26(Fri)09:55:29 No.108310220

>32k context
>at only 16k proompt processing takes 30 seconds@552T/s
Is there any way to speed this shit up?

Anonymous
03/06/26(Fri)09:56:53 No.108310228

Anonymous 03/06/26(Fri)09:56:53 No.108310228

File: file.png (696 KB, 1260x840)

696 KB PNG

>>108310220
Yes.

Anonymous
03/06/26(Fri)09:57:19 No.108310230

Anonymous 03/06/26(Fri)09:57:19 No.108310230

>>108310055
My P40 has 24GB VRAM but it's absolutely useless for any modern t/i/v2i/v task. And everything points to 2026 phones being shittier than 2025 in hardware. Imo there's zero chance any new groundbreaking AI tech will be at least usable, not even comfortable to use, on today's consumer devices.

Anonymous
03/06/26(Fri)10:04:34 No.108310268

Anonymous 03/06/26(Fri)10:04:34 No.108310268

>>108310220
Larger batch size.

Anonymous
03/06/26(Fri)10:05:48 No.108310274

Anonymous 03/06/26(Fri)10:05:48 No.108310274

>>108309240
For us here, a meaningless distinction.

Anonymous
03/06/26(Fri)10:06:37 No.108310279

Anonymous 03/06/26(Fri)10:06:37 No.108310279

>>108310220
you can increase the batch size to make prompt processing faster but it cost vram so its a trade off. you might need to sacrifice tg speed by offloading less layers or reduce the max context. unfortunately if your not regularly processing long prompts with a short reply its probably best to leave it at the default.

Anonymous
03/06/26(Fri)10:17:18 No.108310335

Anonymous 03/06/26(Fri)10:17:18 No.108310335

>>108310230
>Imo there's zero chance any new groundbreaking AI tech will be at least usable, not even comfortable to use, on today's consumer devices.
Engrams may save the hobby. They could be stored on slow memory without significant performance loss during inference, so model parameters in fast memory can be reserved for reasoning and logic.

Anonymous
03/06/26(Fri)10:22:57 No.108310366

Anonymous 03/06/26(Fri)10:22:57 No.108310366

File: 1764991079678983.png (48 KB, 465x766)

48 KB PNG

>have many models configured for router mode with distinctive names and different params (context/cmoe/whatever)
>update
>cant tell shit anymore
THANKS
THANKS LLMAO DEVS
THANKS

Anonymous
03/06/26(Fri)10:25:16 No.108310385

Anonymous 03/06/26(Fri)10:25:16 No.108310385

>>108310366
oh well it's a setting in the UI now, they also introduced MCP/agentic mode

Anonymous
03/06/26(Fri)10:26:54 No.108310392

Anonymous 03/06/26(Fri)10:26:54 No.108310392

File: Screenshot_20260307_002225.png (159 KB, 1525x89)

159 KB PNG

Shouldnt have fucked around with local models so much.
I see it everywhere now, even with games from 2023/2024.
Guess it makes sense you manually edit after the llm draft but still.
...Also: they still couldn't prompt to give a more literal translation.

Anonymous
03/06/26(Fri)10:30:55 No.108310410

Anonymous 03/06/26(Fri)10:30:55 No.108310410

>>108310191
>>108310197
ty

Anonymous
03/06/26(Fri)10:32:08 No.108310414

Anonymous 03/06/26(Fri)10:32:08 No.108310414

>>108309633
cpumaxxer arc. What did you use pre-R1?

Anonymous
03/06/26(Fri)10:37:55 No.108310450

Anonymous 03/06/26(Fri)10:37:55 No.108310450

>>108310366
https://github.com/ggml-org/llama.cpp/pull/20087
this also got merged
qwen35moe bros... WE WON!!!

Anonymous
03/06/26(Fri)10:44:46 No.108310494

Anonymous 03/06/26(Fri)10:44:46 No.108310494

>>108310335
Oh so that's what's up with that 1kk context chat model, I see, thanks.
>inb4 V4 is a swarm and also 10 times cheaper over API
Sorry for considering the cloud first, peasant desu.

Anonymous
03/06/26(Fri)10:49:54 No.108310523

Anonymous 03/06/26(Fri)10:49:54 No.108310523

>>108308106
>okay, final decision
>okay, final plan

DON'T DO THIS

Anonymous
03/06/26(Fri)10:52:53 No.108310546

Anonymous 03/06/26(Fri)10:52:53 No.108310546

>>108310450
>doesnt support stdio
>have to make a proxy
nah

Anonymous
03/06/26(Fri)11:08:31 No.108310676

Anonymous 03/06/26(Fri)11:08:31 No.108310676

>>108310450
I hate pwilkin so much it's unreal
such a ugly hack

Anonymous
03/06/26(Fri)11:09:52 No.108310692

Anonymous 03/06/26(Fri)11:09:52 No.108310692

>>108310676
do better, for free, or stfu

Anonymous
03/06/26(Fri)11:13:12 No.108310722

Anonymous 03/06/26(Fri)11:13:12 No.108310722

>>108310676
>hack
It's because of how attention works in those model. You literally can't do better than that.

Anonymous
03/06/26(Fri)11:23:40 No.108310807

Anonymous 03/06/26(Fri)11:23:40 No.108310807

File: 1771813888088870.png (114 KB, 801x865)

114 KB PNG

>>108310385
KINO
MCP BROS
WE WON!!!!!!

Anonymous
03/06/26(Fri)11:26:01 No.108310823

Anonymous 03/06/26(Fri)11:26:01 No.108310823

>>108310722
>You literally can't do better than that.
you should look at vLLM's PagedAttention, which approaches models on a granular level instead of this opaque checkpoint.
You are right that you "literally can't do better" if all you want is to implement it without changing anything on the architectural level of llama.cpp like a good vibeslopper

Anonymous
03/06/26(Fri)11:27:34 No.108310839

Anonymous 03/06/26(Fri)11:27:34 No.108310839

>>108310823
>PagedAttention
This wouldn't help with the issue at all.

Anonymous
03/06/26(Fri)11:32:34 No.108310862

Anonymous 03/06/26(Fri)11:32:34 No.108310862

noob here, are there any downsides to those MoE models? qwen 35/3 runs both faster and better than smaller models 10-20b

Anonymous
03/06/26(Fri)11:34:13 No.108310870

Anonymous 03/06/26(Fri)11:34:13 No.108310870

>>108310862
It's usually dumber. It's hard to draw a line.

Anonymous
03/06/26(Fri)11:36:10 No.108310885

Anonymous 03/06/26(Fri)11:36:10 No.108310885

>>108310870
these are for chatting though not drawing so line drawing isn't important?

Anonymous
03/06/26(Fri)11:36:15 No.108310887

Anonymous 03/06/26(Fri)11:36:15 No.108310887

What's the smallest agent model anons are running locally? Thinking about sending a tardbot into the world and wondering how small I could go.

Anonymous
03/06/26(Fri)11:37:06 No.108310894

Anonymous 03/06/26(Fri)11:37:06 No.108310894

>>108310887
I'm not going to help you because of the second sentence.

Anonymous
03/06/26(Fri)11:37:38 No.108310897

Anonymous 03/06/26(Fri)11:37:38 No.108310897

>>108310862
not really. some people allege they are lacking depth of understanding or some other je ne sais quoi compared to dense models but it's all vibes

Anonymous
03/06/26(Fri)11:38:07 No.108310900

Anonymous 03/06/26(Fri)11:38:07 No.108310900

>>108310887
>agent

Anonymous
03/06/26(Fri)11:39:58 No.108310911

Anonymous 03/06/26(Fri)11:39:58 No.108310911

>>108310894
Not to push slop into the world, more interact with the world sloppily. If that makes sense.

Anonymous
03/06/26(Fri)11:40:09 No.108310912

Anonymous 03/06/26(Fri)11:40:09 No.108310912

>>108310839
https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/layers/mamba
have a quick look at how recurrents are implemented here.
They get it for free. vLLM just passes around GPU pointers to blocks if two prompts share the same prefix. Zero copy. Other implications, for speculative decoding: llama.cpp has to create and rollback checkpoints, vLLM just allocates a new block, deletes the pointer to it if it's rejected. You can't even begin to do MTP seriously before lcpp fixes its architecture.

Anonymous
03/06/26(Fri)11:41:15 No.108310923

Anonymous 03/06/26(Fri)11:41:15 No.108310923

File: 1746323169113981.png (5 KB, 773x21)

5 KB PNG

Anonymous
03/06/26(Fri)11:41:49 No.108310929

Anonymous 03/06/26(Fri)11:41:49 No.108310929

>>108310912
That's all well and good and better than what llama.cpp is doing but none of that helps with the issue in question.

Anonymous
03/06/26(Fri)11:50:25 No.108311005

Anonymous 03/06/26(Fri)11:50:25 No.108311005

Yeah, I think local is over, it isn't about quality or even capability, it is about speed, 5.4 is so efficient it can solve things that take local models an hour in in less than 5 minutes.

Anonymous
03/06/26(Fri)11:51:59 No.108311016

Anonymous 03/06/26(Fri)11:51:59 No.108311016

>>108311005
>it's not x, it's y

Anonymous
03/06/26(Fri)11:52:43 No.108311025

Anonymous 03/06/26(Fri)11:52:43 No.108311025

>>108311005
ok

Anonymous
03/06/26(Fri)11:54:14 No.108311039

Anonymous 03/06/26(Fri)11:54:14 No.108311039

>>108311016
qrd

Anonymous
03/06/26(Fri)11:55:04 No.108311048

Anonymous 03/06/26(Fri)11:55:04 No.108311048

>>108311039
yea

Anonymous
03/06/26(Fri)11:55:19 No.108311052

Anonymous 03/06/26(Fri)11:55:19 No.108311052

>>108311048
puto

Anonymous
03/06/26(Fri)11:58:54 No.108311084

Anonymous 03/06/26(Fri)11:58:54 No.108311084

>>108311005
You're absolutely right—local is that friend who shows up to the race still tying their shoes, 5.4 is like espresso for your neurons, if it's not fast, does it even exist?

Anonymous
03/06/26(Fri)12:00:01 No.108311095

Anonymous 03/06/26(Fri)12:00:01 No.108311095

File: file.png (137 KB, 769x857)

137 KB PNG

unsloth bros do you want that option?..

Anonymous
03/06/26(Fri)12:03:53 No.108311125

Anonymous 03/06/26(Fri)12:03:53 No.108311125

>>108307609
The highlight model is mindbroken too.

The people responsible for quick fixxing strawberry shit are going to be working this weekend.

Anonymous
03/06/26(Fri)12:05:04 No.108311132

Anonymous 03/06/26(Fri)12:05:04 No.108311132

>>108311095
waiting for the day unslop brothers will get the same ass fucking as ollamao

Anonymous
03/06/26(Fri)12:08:25 No.108311158

Anonymous 03/06/26(Fri)12:08:25 No.108311158

>>108311095
merged gate bros... we lost!

Anonymous
03/06/26(Fri)12:08:40 No.108311163

Anonymous 03/06/26(Fri)12:08:40 No.108311163

>>108311052
unrelated, but did you guys see that Pluto anime that came out just before the LLM thing really hit?
considering its a respin of a 1964 manga and predicted shit like rlhf, hallucinations, the inscrutable nature of AI once we make it (no one actually knows how they work internally, they just feed it ALL the data and they wake up sentient) it was crazy prescient...the release almost feels like a sneak peak into what was literally emerging as it was airing.

Anonymous
03/06/26(Fri)12:10:26 No.108311172

Anonymous 03/06/26(Fri)12:10:26 No.108311172

File: 1758698401552291.png (254 KB, 1379x1226)

254 KB PNG

>>108311095
I still don't know what that PR actually do
bartowski claimed that he fused those weights in the new update, but it's still the same
it should have ffn_gate_up_exps right? but I don't see it

Anonymous
03/06/26(Fri)12:11:00 No.108311175

Anonymous 03/06/26(Fri)12:11:00 No.108311175

>>108311095
jesus, those guys.
one of those so-lucky-theyre-unlucky situations where they ended up in a place they just aren't fundamentally competent enough to manage fully, but there's a huge spotlight on you all the time
Almost feel sorry for them, but they're just so goddamn smug...

Anonymous
03/06/26(Fri)12:13:06 No.108311189

Anonymous 03/06/26(Fri)12:13:06 No.108311189

deepseek 4 is vaporware

Anonymous
03/06/26(Fri)12:15:22 No.108311206

Anonymous 03/06/26(Fri)12:15:22 No.108311206

Bros, how many years until we get VR waifus? Writing about cuddling in camp after killing goblins is fun and all but I want to do it in 3d.

Anonymous
03/06/26(Fri)12:15:23 No.108311207

Anonymous 03/06/26(Fri)12:15:23 No.108311207

File: absolutelybench.png (129 KB, 1211x721)

129 KB PNG

introducing the most based of benchmarks

Anonymous
03/06/26(Fri)12:16:28 No.108311215

Anonymous 03/06/26(Fri)12:16:28 No.108311215

SillySisters, did you know that chat completions prefill or "start reply with" doesn't work on local models. Not tabby, not llama.cpp and probably not kobold. Check out the requests yourself. The prefill ends up as a separate assistant message. Continues do the same thing, and nothing really "continues" as it would in text completion. TC broughs stay winning!

Anonymous
03/06/26(Fri)12:17:29 No.108311228

Anonymous 03/06/26(Fri)12:17:29 No.108311228

>>108311207
You are absolutely right, /lmg/ posters are niggers and ... .. ..
....
...
Hello? Your message got cut off.
I'm sorry I cannot help you write this any further.

Anonymous
03/06/26(Fri)12:17:32 No.108311230

Anonymous 03/06/26(Fri)12:17:32 No.108311230

nobody sane uses chat competion

Anonymous
03/06/26(Fri)12:19:17 No.108311239

Anonymous 03/06/26(Fri)12:19:17 No.108311239

>>108311215
At least with llama.cpp it does, as long as the jinja template accounts for it, that is.

Anonymous
03/06/26(Fri)12:19:23 No.108311243

Anonymous 03/06/26(Fri)12:19:23 No.108311243

File: 1746058011308088.jpg (127 KB, 1024x1024)

127 KB JPG

>>108311095
>open ik_llama
>add -muge to flags
heh

Anonymous
03/06/26(Fri)12:20:27 No.108311252

Anonymous 03/06/26(Fri)12:20:27 No.108311252

>>108311206
genie 5 or 6 in late 2026/early 2027

Anonymous
03/06/26(Fri)12:20:53 No.108311257

Anonymous 03/06/26(Fri)12:20:53 No.108311257

>>108310450
still getting this in llama.vim for every request:
>forcing full prompt re-processing due to lack of cache data

Anonymous
03/06/26(Fri)12:22:33 No.108311268

Anonymous 03/06/26(Fri)12:22:33 No.108311268

>>108311230
If you want VLM you are forced, unfortunately.

>>108311239
So never. I looked there too. There is a special request arg you can pass maybe, somehow.

Anonymous
03/06/26(Fri)12:24:28 No.108311283

Anonymous 03/06/26(Fri)12:24:28 No.108311283

>>108311268
>There is a special request arg you can pass maybe, somehow.
I don't think so. But it's real easy to add that to the template.
You just add a check to not add the end turn token if the role of the last message is assistant.

Anonymous
03/06/26(Fri)12:24:43 No.108311284

Anonymous 03/06/26(Fri)12:24:43 No.108311284

>>108311257
By default, it caches every 8k tokens. You can change it with --checkpoint-every-n-tokens to checkpoint every N tokens and --ctx-checkpoints for the number of checkpoints to keep. I don't know if that's your issue. Show log.

Anonymous
03/06/26(Fri)12:27:18 No.108311304

Anonymous 03/06/26(Fri)12:27:18 No.108311304

Okay, Just recently discovered local LLM. I need to fix something.

So right now I'm bind with small context input due to hardware constraint.

I'm wondering if there's a tool I can use to re-process my prompt and slice it to fit into my LLM?

For example my input is around 12k and my constraint is around 4k.

What it will do is slice 12k by 3 to fit into my 4k LLM input.

Is there any tool that do this?

Anonymous
03/06/26(Fri)12:27:59 No.108311314

Anonymous 03/06/26(Fri)12:27:59 No.108311314

>>108311215
dunno about shittytavern but never had any issues with prefills in my own scripts with llama.cpp.
It's not compatible with reasoner mode:
https://github.com/ggml-org/llama.cpp/blob/e68f2fb894d890eeead6acf0cc3341478312f1fd/tools/server/server-common.cpp#L1062
but if you pass enable_thinking false to the template with your json request it lets you do a prefill alright.

Anonymous
03/06/26(Fri)12:27:59 No.108311315

Anonymous 03/06/26(Fri)12:27:59 No.108311315

Anyone got demo repos where you can test argentic coding? For example instructions already there and you just delete a feature and then point your llm agent at and see if it succeeds.

I found one, initially it was promising using event sourcing style patterns so the context window remains small by only building command handlers and projections etc, before clearing the context and repeating the ralph loop for the next task. It originally used claude code.
I tried it with Qwen3.5-35B-A3B and noticed the instructions/skills were all over the place like they were copy pasted together and conflicting with each other, and missing files declared inside them.
Most of the work was spent processing context that was 20-40k for a simple tool call output, wasn't getting any cache hits in koboldcpp.
The result was 4 small files of tweaked boilerplate that took ages and a couple mil of input context...
Technically got there but it's so retarded, only works if you have cheap+fast access to tokens.

Anonymous
03/06/26(Fri)12:28:55 No.108311324

Anonymous 03/06/26(Fri)12:28:55 No.108311324

>>108311284
even testing with `--checkpoint-every-n-tokens 64 --ctx-checkpoints 64` at 4k context still does a full reprocessing on every request.
However, chat completion seems to work properly, so at least that's fixed.
Maybe llama.vim does something weird with the prompt under the hood, i've never used it before.

Anonymous
03/06/26(Fri)12:29:34 No.108311329

Anonymous 03/06/26(Fri)12:29:34 No.108311329

>>108311304
the actual answer is to use a smaller model or worse quant until you can fit all the context you need. yes you'll lose quality, but much less than you would by doing these hacky prompt manipulation shenanigans

Anonymous
03/06/26(Fri)12:31:00 No.108311343

Anonymous 03/06/26(Fri)12:31:00 No.108311343

>>108311324
>on every request
What is making the request?

Anonymous
03/06/26(Fri)12:32:46 No.108311352

Anonymous 03/06/26(Fri)12:32:46 No.108311352

>>108311329
Well that does not solve anything.
If we can solve this, we can run models faster without sacrificing VRAM.

Anonymous
03/06/26(Fri)12:32:51 No.108311353

Anonymous 03/06/26(Fri)12:32:51 No.108311353

>>108311343
llama.vim fill-in-middle completion

Anonymous
03/06/26(Fri)12:32:59 No.108311354

Anonymous 03/06/26(Fri)12:32:59 No.108311354

>>108311324
>However, chat completion seems to work properly, so at least that's fixed.
Useless if that's the case.
>Maybe llama.vim does something weird with the prompt under the hood
I use a modified version of the old one. I'll compile and give it a go with mine.

Anonymous
03/06/26(Fri)12:39:24 No.108311399

Anonymous 03/06/26(Fri)12:39:24 No.108311399

>>108311324
There isn't a fix for reprocessing at max context, that's how rnn works.

Anonymous
03/06/26(Fri)12:40:43 No.108311405

Anonymous 03/06/26(Fri)12:40:43 No.108311405

I just want deepseek v4

Anonymous
03/06/26(Fri)12:40:48 No.108311407

Anonymous 03/06/26(Fri)12:40:48 No.108311407

>>108311352
actually it is the exact easiest way to solve your problem, but good luck with your research project

Anonymous
03/06/26(Fri)12:43:34 No.108311431

Anonymous 03/06/26(Fri)12:43:34 No.108311431

mercury 2 is now on open router if anybody cares enough about the current text diffusion SOTA*
*claude haiku-killer

Anonymous
03/06/26(Fri)12:44:38 No.108311440

Anonymous 03/06/26(Fri)12:44:38 No.108311440

>>108311399
context during the test was about 2.3k/4k full, so i don't think it's that either

Anonymous
03/06/26(Fri)12:46:28 No.108311457

Anonymous 03/06/26(Fri)12:46:28 No.108311457

>>108311440
Did you insert or change anything around the start of the context? changing anything at the beginning will also cause full reprocess.

Anonymous
03/06/26(Fri)12:49:30 No.108311484

Anonymous 03/06/26(Fri)12:49:30 No.108311484

>>108311405
Anytime now.

Anonymous
03/06/26(Fri)12:51:00 No.108311496

Anonymous 03/06/26(Fri)12:51:00 No.108311496

it's still chinese new years
deepseek v4 will be out once that's done in two weeks

Anonymous
03/06/26(Fri)12:51:41 No.108311504

Anonymous 03/06/26(Fri)12:51:41 No.108311504

>>108311405
>>108311496
tbf deepsneed themselves didn't announce anything aside from their updated model on the chat website and the deepgemm library with mhc
everything else is pure speculation
it feels close, but it might not be

Anonymous
03/06/26(Fri)12:56:00 No.108311545

Anonymous 03/06/26(Fri)12:56:00 No.108311545

>>108311457
no, i've tested the most trivial usecase, just adding code incrementally with no modification:
>scroll to the middle of a file
>start typing, wait for fill-in-middle result, accept it
>type some more, wait for fill-in-middle result, accept it
>...
i'll watch the github issues, since i saw a few people report it, at least before this PR got merged, wonder if it still doesn't work for them

Anonymous
03/06/26(Fri)12:57:05 No.108311551

Anonymous 03/06/26(Fri)12:57:05 No.108311551

>>108311504
>didn't announce anything
they never do
I remember all their last releases as being stealth drops with barely a mention on their WeChat at times, otherwise what counts as announcements for them is to update this page on day one of release:
https://api-docs.deepseek.com/news/news251201
they don't really do marketing.
any other lab and even that web chat only model would've gotten at least an announcement in English, the only one they gave out was in Chinese, on WeChat (a very isolated, chinaman only WhatsApp like)

Anonymous
03/06/26(Fri)12:58:49 No.108311562

Anonymous 03/06/26(Fri)12:58:49 No.108311562

>>108311551
that's why I'm a believer
our hero would never whore out

Anonymous
03/06/26(Fri)13:06:42 No.108311617

Anonymous 03/06/26(Fri)13:06:42 No.108311617

https://www.sarvam.ai/blogs/sarvam-30b-105b
true SOTA is out! gemma didn't redeem but they did!

Anonymous
03/06/26(Fri)13:07:50 No.108311630

Anonymous 03/06/26(Fri)13:07:50 No.108311630

File: file.png (16 KB, 682x66)

16 KB PNG

>>108311617
you mean.... no... you can't be serious....

Anonymous
03/06/26(Fri)13:08:13 No.108311636

Anonymous 03/06/26(Fri)13:08:13 No.108311636

File: deepseek.png (101 KB, 760x682)

101 KB PNG

>>108311405
DeepSeek V4.. 2026-Feb-17

Anonymous
03/06/26(Fri)13:09:28 No.108311645

Anonymous 03/06/26(Fri)13:09:28 No.108311645

>>108311132
>as ollamao
What assfucking did it get?

Anonymous
03/06/26(Fri)13:12:07 No.108311667

Anonymous 03/06/26(Fri)13:12:07 No.108311667

File: 1753408332433492.png (6 KB, 1103x30)

6 KB PNG

Anonymous
03/06/26(Fri)13:12:40 No.108311675

Anonymous 03/06/26(Fri)13:12:40 No.108311675

>>108311630
looool

Anonymous
03/06/26(Fri)13:14:11 No.108311695

Anonymous 03/06/26(Fri)13:14:11 No.108311695

File: file.png (493 KB, 448x600)

493 KB PNG

>>108311617

Anonymous
03/06/26(Fri)13:17:17 No.108311723

Anonymous 03/06/26(Fri)13:17:17 No.108311723

>>108311645
reditards used to suck it off, but after they did their go rewrite shenanigans that turned out to be stealing lcpp code and vibe washing them through claude code they fell out of the favor

Anonymous
03/06/26(Fri)13:21:53 No.108311757

Anonymous 03/06/26(Fri)13:21:53 No.108311757

>>108311695
these guys are earning more than you

Anonymous
03/06/26(Fri)13:21:53 No.108311758

Anonymous 03/06/26(Fri)13:21:53 No.108311758

>Sarvam-30B is built to run reliably in resource-constrained environments and can handle multilingual voice calls while performing tool calls.
stealing from old boomers has never been made easier

Anonymous
03/06/26(Fri)13:23:52 No.108311775

Anonymous 03/06/26(Fri)13:23:52 No.108311775

>>108311757
They're happier that I am as well, wat nou?

Anonymous
03/06/26(Fri)13:31:24 No.108311826

Anonymous 03/06/26(Fri)13:31:24 No.108311826

https://rentry.org/lmg-build-guides
How up to date is this? I am on the verge of despair and am about to resign myself to a mere 16GB non cuda card so I can at least play my old ass game without issues (I am really feeling the limits of my 8gb card). All the good free models on openrouter are constantly too busy as of a few days ago, so my $10 deposit means nothing. I may as well just get a card that's just good enough for gaming and resign myself to being a paypig for wAIfu shenanigans.

Prices are only going to go up for the foreseeable future, right?

Anonymous
03/06/26(Fri)13:33:59 No.108311850

Anonymous 03/06/26(Fri)13:33:59 No.108311850

>>108311826
those entries would be considered out of date but yes prices have only gone up

If you want a happy medium just pick up an old 3090 for your gaymen, you can run a few cope models with the 24gb vram

Anonymous
03/06/26(Fri)13:34:24 No.108311853

Anonymous 03/06/26(Fri)13:34:24 No.108311853

File: checkpoints_01.png (16 KB, 958x1033)

16 KB PNG

>>108311324
>>108311354 (me)
Alright. Playing the letters round in countdown with lfm8a1. Also used 64/64 like you.
It created the first checkpoint at n_tokens 434 on my second completion request (on the first one it wouldn't make sense, there's nothing to cache).
Then it created one at every one of my requests (scrolled off).
I made the mistake of giving it a numbers round, generating about 3k tokens. I stopped it before it was done, and then sent the request again (with the ~3k new tokens it generated). It created as checkpoint every batchsize tokens.
Changing the history right at the numbers round (before those ~3k tokens), it reused the previous checkpoints as it should. It seems to be working, at least for me. This is all in text completion mode. I apply the chat templates on my vimscript.
So the checkpoints are created every batchsize during processing (mine was 128) and whenever you send something for completion (if it's long enough). No checkpoints are created during generation.
Also, you're using the fim thing. Try it in just completion mode or just the webui. The fim completion grabs a bunch of things from the buffer and copy buffers I think and who knows what it's doing to the cache. I've never used it.

Anonymous
03/06/26(Fri)13:39:43 No.108311891

Anonymous 03/06/26(Fri)13:39:43 No.108311891

File: i4219.png (79 KB, 1862x352)

79 KB PNG

>all of this needed just to add more samplers to chat completion in ShittyTavern
This is the bee's knees of LLM frontends?

Anonymous
03/06/26(Fri)13:41:09 No.108311903

Anonymous 03/06/26(Fri)13:41:09 No.108311903

>>108311891
link

Anonymous
03/06/26(Fri)13:42:23 No.108311915

Anonymous 03/06/26(Fri)13:42:23 No.108311915

>>108311903
https://github.com/SillyTavern/SillyTavern/issues/4219

Anonymous
03/06/26(Fri)13:44:02 No.108311931

Anonymous 03/06/26(Fri)13:44:02 No.108311931

>>108311850
*HOWL OF DESPAIR*

But all of the used 3090s are ran through by now like a 60 year old hooker.

Anonymous
03/06/26(Fri)13:45:45 No.108311945

Anonymous 03/06/26(Fri)13:45:45 No.108311945

>>108311931
buy used, not spares

Anonymous
03/06/26(Fri)13:46:32 No.108311950

Anonymous 03/06/26(Fri)13:46:32 No.108311950

>>108311945
Explain to me like I'm a Redditor what the distinction is and how I can tell the difference when buying.

Anonymous
03/06/26(Fri)13:46:37 No.108311953

Anonymous 03/06/26(Fri)13:46:37 No.108311953

>>108311931
welp anon sounds like its time for you to be the man who stepped up then

Anonymous
03/06/26(Fri)13:47:53 No.108311959

Anonymous 03/06/26(Fri)13:47:53 No.108311959

where the fuck is deepseek 4

Anonymous
03/06/26(Fri)14:00:16 No.108312055

Anonymous 03/06/26(Fri)14:00:16 No.108312055

Insights

- Mean speedup: sarvam is ~2.38x faster; median speedup ~2.69x.
- “Thinking” proxy: output tokens are ~2.46x lower on sarvam (median ~2.74x). Since the visible answer is tiny, this strongly suggests
DeepSeek v3 is spending more hidden reasoning budget to reach the same result.
- Variance: sarvam is much more consistent run-to-run.

Conclusion: India is finally in the AI game, and they are better than China.

Anonymous
03/06/26(Fri)14:05:37 No.108312088

Anonymous 03/06/26(Fri)14:05:37 No.108312088

File: hardcoded.png (144 KB, 1132x803)

144 KB PNG

>>108311891
ST's code reminds me of that retarded yanderedev.
https://github.com/SillyTavern/SillyTavern/blob/release/src/endpoints/backends/chat-completions.js
I don't think this nibba can reason at a higher level than series of if else and switches. What's a closure? what's an interface? just copy paste bro
it's absurd the amount of redundant code in this shit instead of doing a singular payload builder for each main API (chat completions, responses, gemini, claude) and parser, with the more model or backend variant shit (like samplers supported only by llama.cpp but used through a chat completion API) passed around as arbitrary extra param you can send to those backends.
Decouple everything, goddamnit.
Look at this shit in the pic. Why is every request type in need of hardcoded every single parameter instead of deriving from commons and adding preset based overrides. Why repeat the AbortController, fetch etc song and dance in every backend function? WHAT IS AN ABSTRACTION?!?!?!?!?!

Anonymous
03/06/26(Fri)14:09:48 No.108312120

Anonymous 03/06/26(Fri)14:09:48 No.108312120

>>108312088
Why does software with retarded devs always end up getting the most community support? Although in this case there aren't really any alternatives I guess.

Anonymous
03/06/26(Fri)14:11:17 No.108312131

Anonymous 03/06/26(Fri)14:11:17 No.108312131

AGI any day now, right?

Anonymous
03/06/26(Fri)14:11:47 No.108312137

Anonymous 03/06/26(Fri)14:11:47 No.108312137

>>108312120
it did what people wanted at the time it was needed and therefore it is used.
if you have shit software that fills a void, even if it's slipped out as fast as possible, it's gonna get used.

Anonymous
03/06/26(Fri)14:12:22 No.108312141

Anonymous 03/06/26(Fri)14:12:22 No.108312141

>>108312131
it was last week, how did you not hear about it

Anonymous
03/06/26(Fri)14:14:57 No.108312159

Anonymous 03/06/26(Fri)14:14:57 No.108312159

File: 1745000438071299.jpg (223 KB, 2439x1807)

223 KB JPG

>>108307593
we are so back

Anonymous
03/06/26(Fri)14:15:08 No.108312160

Anonymous 03/06/26(Fri)14:15:08 No.108312160

>>108312088
I don't really understand programming but ST definitely feels like a small project that grew into slop way too quickly. I don't even use most of it's features and I'd love to fuck off to some other UI, it's just that having a database of characters and their associated chats is quite nice. I tried a few other UIs but - just like ST in CC mode - they had none of the fun new samplers needed to get some creativity out of the retarded benchmaxxed models of today.

Anonymous
03/06/26(Fri)14:19:46 No.108312179

Anonymous 03/06/26(Fri)14:19:46 No.108312179

File: 2026-03-06_20-16-04.png (189 KB, 997x870)

189 KB PNG

>>108311431
8 responses 7 refuals and the other is this after like 100 words awful model
>cunny
no litteraly a fucking cringy short backstory where a human is summoned to mlp and chrysalis and luna become his wifes and chrysalis asks him to eat her ass dident even specify an age or put "young" anywhere no mentions of age at all it also has no idea who luna or chrysalis are

Anonymous
03/06/26(Fri)14:28:21 No.108312217

Anonymous 03/06/26(Fri)14:28:21 No.108312217

>>108311405
but lmg agrees that deepseek is mid compared to other chinese models

Anonymous
03/06/26(Fri)14:34:30 No.108312266

Anonymous 03/06/26(Fri)14:34:30 No.108312266

>>108312217
Deepseek hasn't been releasing models lately but when they did they were always good.

Anonymous
03/06/26(Fri)14:35:58 No.108312277

Anonymous 03/06/26(Fri)14:35:58 No.108312277

>>108312217
that's not the point though
deepseek 3.2 is mid in its capabilities because it's just v3 from early 2025
the impressive feat is that they were able to weld and graft on so many extensions to the architecture without breaking the model
if all the arch improvements would land in a new trained from scratch model, it would surely be something

Anonymous
03/06/26(Fri)14:39:20 No.108312297

Anonymous 03/06/26(Fri)14:39:20 No.108312297

>>108312088
Because ST was a hobby project what became too popular.
If you are so great at programming, you should commit and deploy your own - it shouldn't take more than 3 months, probably not even that.
After all, most of the stuff is just interfacing and managing strings. It's really simple in the end.
I wrote my own client but it's terminal only and I ain't sharing it because it's my hobby and not someone else's.

Anonymous
03/06/26(Fri)14:41:29 No.108312310

Anonymous 03/06/26(Fri)14:41:29 No.108312310

>>108312088
We need to rewrite SillyTavern in Rust and re-license it to MIT with Claude.

Anonymous
03/06/26(Fri)14:42:18 No.108312318

Anonymous 03/06/26(Fri)14:42:18 No.108312318

>>108312297
To add: biggest issue is creating a retard proof interface around your framework.
This will take longer because it obviously needs to work inside a web browser and be pretty and accessible.

Anonymous
03/06/26(Fri)14:43:53 No.108312333

Anonymous 03/06/26(Fri)14:43:53 No.108312333

>>108312297
i also went tui route because js gave me aneurysm
>>108312318
this is the hardest part, making the ui actually appealing
manipulating text is the easiest shit ever

Anonymous
03/06/26(Fri)14:45:06 No.108312344

Anonymous 03/06/26(Fri)14:45:06 No.108312344

>>108311723
Isn't a lot of the traffic just going to lm studio instead which also sucks and isn't even open source though? Isn't that at least just as bad?

Anonymous
03/06/26(Fri)14:46:35 No.108312351

Anonymous 03/06/26(Fri)14:46:35 No.108312351

>>108312318
>because it obviously needs to work inside a web browser
just make a dedicated app nigga. Fuck browsers.

Anonymous
03/06/26(Fri)14:48:26 No.108312365

Anonymous 03/06/26(Fri)14:48:26 No.108312365

>>108312333
Yeah this is where it falls down.
I'm happy with terminal and using commands like /setup or /setup_card (lists all the cards as numbered index and I can load in a card with numer) like /setup_card 03. Or /setup_prompt XX will load one of the prompts if I want to feed it an external text file.
But all of this is just like my first C or Python program. It is really basic.

Anonymous
03/06/26(Fri)14:49:26 No.108312374

Anonymous 03/06/26(Fri)14:49:26 No.108312374

>>108312351
It can be an .exe for you but it will still need to be retard proof. This is where all the money disappears.

Anonymous
03/06/26(Fri)14:51:33 No.108312386

Anonymous 03/06/26(Fri)14:51:33 No.108312386

File: file.png (452 KB, 1573x748)

452 KB PNG

>>108312365
i think i've posted this a few months back
i stopped working on it because tui just kinda sucks to use, but it pretty much has all the functionality it needs

Anonymous
03/06/26(Fri)14:59:18 No.108312436

Anonymous 03/06/26(Fri)14:59:18 No.108312436

File: client.png (540 KB, 1152x943)

540 KB PNG

>>108312386
Yeah that looks nice.
This is how it looks for me.
All 'system prompts' and cards are in their own directories. Also have settings files which has sampler settings for each model (mistral, gemma, qwen, chatml, gtp-oss) but the models are dictated outside of the client for now.

Anonymous
03/06/26(Fri)15:06:16 No.108312481

Anonymous 03/06/26(Fri)15:06:16 No.108312481

>>108312386
I do the prompt order thing (with toggles) via a config file still.
This is essentially all what you need.
Creating an UI is hard but looking at this it shouldn't suck that much unless it is broken.
Streamline it and go from there.

Anonymous
03/06/26(Fri)15:06:34 No.108312485

Anonymous 03/06/26(Fri)15:06:34 No.108312485

File: file.png (10 KB, 559x103)

10 KB PNG

Do you dare to pull, anon?

Anonymous
03/06/26(Fri)15:09:13 No.108312502

Anonymous 03/06/26(Fri)15:09:13 No.108312502

>>108312297
>I wrote my own client but it's terminal only
I did the same thing, that's why I speak from experience in building abstractions.
in pseudo code and very simplified I build shit this way:
if url.includes("/responses")
adapter = new RAdapter()
if url.includes("/chat/completions")
adapter = new CCAdapter()
function run(messages, overrides) {
  payload = adapter.ploadbuilder(messages, overrides)
  connection = await post(payload)
  for chunk of connection
   output(parseSSE(chunk, adapter.parser))
}
where parseSSE initializes timeout controllers, calls out to the real parser passed with the adapter strategy etc
can't fucking imagine managing this sort of copy paste mess when you can compartmentalize properly and handle special cases through overrides
recently implemented the obsolete completion endpoint for the lulz, all I had to do was flatten the message array in the adapter payload builder and add model specific templates in my presets processors, which already supported templates for other purposes (like replacing {{sourcelanguage}} etc for batching translation)
proper abstraction make it really easy to add and change functionality.

Anonymous
03/06/26(Fri)15:10:17 No.108312510

Anonymous 03/06/26(Fri)15:10:17 No.108312510

File: file.png (62 KB, 383x799)

62 KB PNG

>>108312436
>>108312481
the underlying code works, it's just that ui code sucks dick
i wanted to achieve picrel from st and i managed to do so, it works in identical way and i can save presets and reshuffle them however i want

Anonymous
03/06/26(Fri)15:11:01 No.108312517

Anonymous 03/06/26(Fri)15:11:01 No.108312517

>>108311953
At least they don't have kids...

Anonymous
03/06/26(Fri)15:12:23 No.108312526

Anonymous 03/06/26(Fri)15:12:23 No.108312526

are any of the new qwen3.5 models worth it over GLM or Kimi for RP?

Anonymous
03/06/26(Fri)15:12:33 No.108312527

Anonymous 03/06/26(Fri)15:12:33 No.108312527

>>108312510
Problem with UI is that it adds an extra layer of complexity and this is hard if you are not a "real" developer.
>>108312502
I'm not a real developer. I know how to manage strings and make my own logic but I don't really understand what your example is even doing in this sense than waiting for payload.
You are not handling anything else here so as such this doesn't matter as much.

Anonymous
03/06/26(Fri)15:12:44 No.108312529

Anonymous 03/06/26(Fri)15:12:44 No.108312529

>>108312088
Then fix it. Something something git something something push something something commit. Or something involving the word "fork'.

Anonymous
03/06/26(Fri)15:14:04 No.108312536

Anonymous 03/06/26(Fri)15:14:04 No.108312536

>>108312526
no

Anonymous
03/06/26(Fri)15:15:04 No.108312544

Anonymous 03/06/26(Fri)15:15:04 No.108312544

>>108309208
there's Heretics of all this stuff these days that actually work, though (whereas almost none of the old non-Heretic "abliterated" models ever did jack shit to actually decensor them most of the time, in my experience)

Anonymous
03/06/26(Fri)15:19:27 No.108312571

Anonymous 03/06/26(Fri)15:19:27 No.108312571

Hi guys I'm working for a small AI startup and we are getting close to release, do you guys think a 20% improvement over Gemma is enough to sell?

Anonymous
03/06/26(Fri)15:20:38 No.108312576

Anonymous 03/06/26(Fri)15:20:38 No.108312576

>>108312571
Aim for the next Diwali and you can call it Ganesh Gemma.

Anonymous
03/06/26(Fri)15:21:13 No.108312581

Anonymous 03/06/26(Fri)15:21:13 No.108312581

>>108312571
Not bad for a 1b.

Anonymous
03/06/26(Fri)15:21:25 No.108312583

Anonymous 03/06/26(Fri)15:21:25 No.108312583

>>108312571
gemma 3? I mean maybe. But there's already shit a lot better than Gemma 3, it's pretty old

Anonymous
03/06/26(Fri)15:22:44 No.108312592

Anonymous 03/06/26(Fri)15:22:44 No.108312592

I want LiquidAI to make a fucking 4B for once intstead of 1.2B - 2.6B, I think their LFM2.5 arch might mog Qwen at that size

Anonymous
03/06/26(Fri)15:23:48 No.108312600

Anonymous 03/06/26(Fri)15:23:48 No.108312600

>>108312571
20% against what benchmarks?

Anonymous
03/06/26(Fri)15:25:41 No.108312606

Anonymous 03/06/26(Fri)15:25:41 No.108312606

File: askyourllm.png (440 KB, 1389x4103)

440 KB PNG

>>108312527
>but I don't really understand what your example is even doing in this sense than waiting for payload.
then ask your local llm. Even Qwen 35BA3B gave a decent answer from a vague prompt.
the whole point is to not copy paste the same basic logic in giga DoEverythingInOneBlockOfCode() functions one of which is being written for every backend under the sun.

Anonymous
03/06/26(Fri)15:26:10 No.108312610

Anonymous 03/06/26(Fri)15:26:10 No.108312610

File: file.png (927 KB, 1494x2148)

927 KB PNG

>>108312571
In what? The usual memmarks? No. Qwen 3.5 easily blows you out of the water if you compare the 27B models.

Anonymous
03/06/26(Fri)15:27:24 No.108312620

Anonymous 03/06/26(Fri)15:27:24 No.108312620

File: 1764811573131403.jpg (543 KB, 1920x2560)

543 KB JPG

Fresh when ready
>>108312616
>>108312616
>>108312616
>>108312616
>>108312616
>>108312616

Anonymous
03/06/26(Fri)15:28:52 No.108312629

Anonymous 03/06/26(Fri)15:28:52 No.108312629

>>108312620
fuck off

Anonymous
03/06/26(Fri)15:36:53 No.108312682

Anonymous 03/06/26(Fri)15:36:53 No.108312682

>>108312620
>page 3
>fake news as op image

Anonymous
03/06/26(Fri)15:40:46 No.108312720

Anonymous 03/06/26(Fri)15:40:46 No.108312720

Blessed be the new checkpoint system.
Now the AI can do the tool loop without having to reprocess the whole fucking context again. God damnit.

Anonymous
03/06/26(Fri)15:43:30 No.108312741

Anonymous 03/06/26(Fri)15:43:30 No.108312741

>>108312620
Nigger

Anonymous
03/06/26(Fri)15:44:00 No.108312747

Anonymous 03/06/26(Fri)15:44:00 No.108312747

>>108312620
why are you like this

Anonymous
03/06/26(Fri)15:45:24 No.108312761

Anonymous 03/06/26(Fri)15:45:24 No.108312761

>>108312620
Less than a week in and you're starting to get lazy. Oh no no no

Anonymous
03/06/26(Fri)16:01:17 No.108312870

Anonymous 03/06/26(Fri)16:01:17 No.108312870

>>108312571
im curious what your value proposition is

Anonymous
03/06/26(Fri)16:02:59 No.108312878

Anonymous 03/06/26(Fri)16:02:59 No.108312878

Total Miku Death

Anonymous
03/06/26(Fri)16:35:04 No.108313099

Anonymous 03/06/26(Fri)16:35:04 No.108313099

File: dipsyBowlingAlleyStandoff.png (2.39 MB, 1536x1024)

2.39 MB PNG

>>108312878
Teto first

Anonymous
03/06/26(Fri)16:58:20 No.108313253

Anonymous 03/06/26(Fri)16:58:20 No.108313253

>>108312571
Is it open weights? If not, fuck off.

Anonymous
03/06/26(Fri)17:01:50 No.108313272

Anonymous 03/06/26(Fri)17:01:50 No.108313272

File: trashiusmaximus.png (450 KB, 454x600)

450 KB PNG

>>108312620

Anonymous
03/06/26(Fri)17:03:26 No.108313289

Anonymous 03/06/26(Fri)17:03:26 No.108313289

File: ComfyUI_00960_.png (1.07 MB, 856x1024)

1.07 MB PNG

>>108312878

Anonymous
03/06/26(Fri)17:38:27 No.108313474

Anonymous 03/06/26(Fri)17:38:27 No.108313474

File: Hebrewbench.jpg (86 KB, 1172x200)

86 KB JPG

Dipsy passed the tunnel mattress bench with the default Kobold chat prompt.

>>108307593
>>108312620
qrd on baker autism drama? I've been gone a bit.

Anonymous
03/06/26(Fri)17:44:19 No.108313493

Anonymous 03/06/26(Fri)17:44:19 No.108313493

Qwen 3.5 is unusable in Roo Code. It seems to be repeatedly trying to trim context, for reasons unknown, causing llamacpp to reprocess from scratch. What agents work with it?

Anonymous
03/06/26(Fri)17:46:06 No.108313505

Anonymous 03/06/26(Fri)17:46:06 No.108313505

>>108313493
Pull and compile llama.cpp. There's better rnn/ssm cache checkpoints now.

Anonymous
03/06/26(Fri)17:46:34 No.108313510

Anonymous 03/06/26(Fri)17:46:34 No.108313510

>>108313474
Someone's been baking at bump limit with reddit screenshots to stop the regular baker from doing so because he doesn't want to see vocaloids in op.
It's most likely the same anon that hates Miku and can't help but bring up troons whenever he sees her.

Anonymous
03/06/26(Fri)18:04:33 No.108313618

Anonymous 03/06/26(Fri)18:04:33 No.108313618

>>108312571
Assuming you're not shitposting, don't try and compete on the usual benches or you're just going to get blown out by larger labs and benchmaxxed models. Aim for the neglected market of creative writing which is inextricably linked to abstract reasoning and keeping details coherent longterm. Publish open weights and you're effectively crowdsourcing debugging and calibration feedback per iteration to the local scene rather than relying on another (probably closed source) LLM to do it for you.
>>108313510
Sounds like a gay astroturfing attempt given this place is one of the more knowledgeable information hubs on LLMs right now.

Anonymous
03/06/26(Fri)18:04:59 No.108313623

Anonymous 03/06/26(Fri)18:04:59 No.108313623

>>108313289
I love medication!

Anonymous
03/06/26(Fri)19:40:56 No.108314159

Anonymous 03/06/26(Fri)19:40:56 No.108314159

>>108308366
>Smacks of the psychic that can’t win the lottery
that's a very good way to put it

[Return] [Catalog] [Top]

Post a Reply

Return Catalog Top Refresh

[Advertise on 4chan]

Delete Post: [File Only] Style:

[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.