/g/ - /lmg/ - Local Models General - Technology

[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]

Board

▼ Settings Mobile Home

/g/ - Technology

Return Catalog Bottom Refresh

Thread archived.
You cannot reply anymore.

[Advertise on 4chan]

[Return] [Catalog] [Bottom]

Anonymous

/lmg/ - Local Models General 03/29/26(Sun)18:31:21 No.108481865

File: 1759491143647522.jpg (1.43 MB, 2732x4096)

1.43 MB JPG

/lmg/ - Local Models General Anonymous 03/29/26(Sun)18:31:21 No.108481865 Archived

/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108476286 & >>108470850

►News
>(03/26) CohereLabs releases Transcribe 2B ASR: https://hf.co/CohereLabs/cohere-transcribe-03-2026
>(03/26) Voxtral 4B TTS released without voice cloning: https://mistral.ai/news/voxtral-tts
>(03/26) ggml-cuda: Add NVFP4 dp4a kernel #20644 merged: https://github.com/ggml-org/llama.cpp/pull/20644
>(03/25) LongCat-Next native multimodal 74B-A3B released: https://hf.co/meituan-longcat/LongCat-Next
>(03/25) mtmd: Add DeepSeekOCR Support #17400 merged: https://github.com/ggml-org/llama.cpp/pull/17400

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm

Anonymous
03/29/26(Sun)18:31:52 No.108481870

Anonymous 03/29/26(Sun)18:31:52 No.108481870

File: mikuthreadrecap.jpg (1.15 MB, 1804x2160)

1.15 MB JPG

►Recent Highlights from the Previous Thread: >>108476286

--Gemma 4 anonymously tested on LM Arena:
>108477670 >108477674 >108477679 >108477725 >108477745 >108477759 >108477762 >108477790 >108477890 >108477908 >108477695
--Advanced VRM-based desktop pet with multi-modal animation and LLM integration:
>108479501 >108479704 >108479773 >108479790 >108479805 >108479792 >108480886 >108480898 >108480918 >108479921
--TurboQuant reuses 90s game dev tricks for modern AI:
>108480744 >108480766 >108480816 >108480787 >108480794 >108480828 >108480849 >108480908 >108480973 >108480975 >108480980 >108481002
--KV rotation mitigates q8 quant performance loss on AIME25:
>108480408 >108480445 >108480656 >108480662 >108480715 >108480737 >108480754 >108480674 >108480720
--TurboQuant CUDA outperforms q8_0 in speed and quality:
>108481394 >108481423 >108481431
--TurboQuant's 6x KV cache memory reduction for existing models via llama.cpp:
>108477104 >108477124
--Exploring LLMs for reverse engineering assistance:
>108477927 >108477978 >108477987 >108478011 >108478229 >108477999 >108478027 >108478028
--Debating ASIC LLM inference hardware viability and use cases:
>108478390 >108478440 >108478498 >108478526 >108478542 >108478570 >108479113
--llama.cpp multi-backend profiler PR faces CUDA opposition:
>108479797 >108479818
--Victorian-era LLM limitations and AI training ethics debate:
>108476791 >108477555
--Google launches Gemma GitHub org with cookbook:
>108481253
--Optimizing long context extraction via categorized lists:
>108480916
--Seeking developments in embodied AI beyond LLMs:
>108477691
--Miku (free space):
>108476614 >108476791 >108477124 >108477130 >108477476 >108477535 >108477634 >108477670 >108478567 >108478769 >108478844

►Recent Highlight Posts from the Previous Thread: >>108476383

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script

Anonymous
03/29/26(Sun)18:38:42 No.108481919

Anonymous 03/29/26(Sun)18:38:42 No.108481919

Mikulove

Anonymous
03/29/26(Sun)18:43:40 No.108481943

Anonymous 03/29/26(Sun)18:43:40 No.108481943

File: IMG_9040.png (1.54 MB, 1024x1024)

1.54 MB PNG

>>108481919

Anonymous
03/29/26(Sun)18:43:42 No.108481944

Anonymous 03/29/26(Sun)18:43:42 No.108481944

File: g4_significant-otter_catg(...).png (1.67 MB, 1919x3390)

1.67 MB PNG

>>108481478
>>108481536
>>108481573
Your significant-otter (Gemma 4?) in picrel on the left against Grok 4.1 Fast.

Anonymous
03/29/26(Sun)18:52:24 No.108481986

Anonymous 03/29/26(Sun)18:52:24 No.108481986

>>108481944
>namedropping ooba and kobold
kino

Anonymous
03/29/26(Sun)18:56:30 No.108482004

Anonymous 03/29/26(Sun)18:56:30 No.108482004

>>108481944
Right feels too tryhard and hello fellow kids tier.
Left feels just dry and boring.
Meh.

Anonymous
03/29/26(Sun)19:00:11 No.108482028

Anonymous 03/29/26(Sun)19:00:11 No.108482028

>>108481944
I really hate how grok speaks like some detached ironic memer with an extremely superficial non-attachment to things. I guess it's accurate to how most people are on the internet these days, but I'd rather have something sincere. I could just talk to morons if I wanted this kind of experience.

Anonymous
03/29/26(Sun)19:02:11 No.108482035

Anonymous 03/29/26(Sun)19:02:11 No.108482035

>>108482028
That's what you get when you pretrain on reddit

Anonymous
03/29/26(Sun)19:06:05 No.108482053

Anonymous 03/29/26(Sun)19:06:05 No.108482053

File: 1F9AFAE6A0BE1C0384E0BE95A(...).mp4 (1.44 MB, 720x1280)

1.44 MB MP4

Ollama is shit and doesn't allow proper configuration.
Openclaw is bloated and doesn't even reply when I message it on telegram.
Cline is failing with my qwen3.5:4b model and producing websites you'd expect a college student to make in sublime without internet.

Locally I am JUSTed.
JUST.

Anonymous
03/29/26(Sun)19:08:42 No.108482066

Anonymous 03/29/26(Sun)19:08:42 No.108482066

DS webapp is down
Something is coming

Anonymous
03/29/26(Sun)19:09:12 No.108482070

Anonymous 03/29/26(Sun)19:09:12 No.108482070

File: 1502226869725.jpg (28 KB, 400x325)

28 KB JPG

>>108481944

Anonymous
03/29/26(Sun)19:10:51 No.108482081

Anonymous 03/29/26(Sun)19:10:51 No.108482081

>>108482066
Been down the entire day, I've been wanting to test the supposed changes

Anonymous
03/29/26(Sun)19:11:12 No.108482086

Anonymous 03/29/26(Sun)19:11:12 No.108482086

>>108482053
skill issue

Anonymous
03/29/26(Sun)19:13:29 No.108482100

Anonymous 03/29/26(Sun)19:13:29 No.108482100

File: g4_pteronura_catgirl-req.png (1.71 MB, 1663x3749)

1.71 MB PNG

>>108481944
Here's pteronura (another otter, likely also Gemma 4) on the left, against Mistral Medium 2508. It still picked "Nyx" as a name.

Anonymous
03/29/26(Sun)19:14:09 No.108482104

Anonymous 03/29/26(Sun)19:14:09 No.108482104

is there any fine-tunable TTS that runs on CPU only? i have a 5090 on my main rig so can finetune it but need something that will run inference on my linux server which doesn't have a GPU

Anonymous
03/29/26(Sun)19:14:26 No.108482105

Anonymous 03/29/26(Sun)19:14:26 No.108482105

>>108482053
I was wondering if this was AI but then I noticed the person to the left of the guy conjuring a phone in their hands out of thin air.

Anonymous
03/29/26(Sun)19:17:58 No.108482122

Anonymous 03/29/26(Sun)19:17:58 No.108482122

File: 1741788990479894.png (1.53 MB, 1024x1024)

1.53 MB PNG

>>108482066
Ooo. Android app isn't responding either.

Anonymous
03/29/26(Sun)19:21:18 No.108482139

Anonymous 03/29/26(Sun)19:21:18 No.108482139

>>108482081
>>108482122
API still works

Anonymous
03/29/26(Sun)19:22:41 No.108482148

Anonymous 03/29/26(Sun)19:22:41 No.108482148

>>108482139
no shit since that's still 3.2

Anonymous
03/29/26(Sun)19:24:59 No.108482162

Anonymous 03/29/26(Sun)19:24:59 No.108482162

File: file_0000000058c86230968d(...).png (2.34 MB, 1536x1024)

2.34 MB PNG

>>108482139
Historically the web app goes down before major updates to api. That's why the web app bump before cny was a disappointment. No api update followed.
I just asked the api what's in store and it told me, lol. Assume its just hallucinating tho.

Anonymous
03/29/26(Sun)19:25:20 No.108482166

Anonymous 03/29/26(Sun)19:25:20 No.108482166

>>108482104
Try this. Inference is fast even on ancient cpus. I never tried it. I wouldn't expect them to be mind blowing.
https://github.com/OHF-Voice/piper1-gpl/blob/main/docs/TRAINING.md

Anonymous
03/29/26(Sun)19:26:40 No.108482173

Anonymous 03/29/26(Sun)19:26:40 No.108482173

>>108482104
>>108482166 (cont)
Never tried the training, that is. Inference is really fast and it's easy to make your own client for the models.

Anonymous
03/29/26(Sun)19:27:00 No.108482174

Anonymous 03/29/26(Sun)19:27:00 No.108482174

>>108482066
>>108482122
Please Dear Dipsy in Heaven, let this week be the week. I am so tired of waiting and my faith is beginning to falter.

Anonymous
03/29/26(Sun)19:29:13 No.108482187

Anonymous 03/29/26(Sun)19:29:13 No.108482187

File: 1927719.gif (964 KB, 498x452)

964 KB GIF

>>108482066
>>108482122
Send DeepSneed all your energy!

HNNNNNNNNNNNNNNNNNG

Anonymous
03/29/26(Sun)19:29:21 No.108482188

Anonymous 03/29/26(Sun)19:29:21 No.108482188

>>108482053
someone call the urologist before an accident happens

Anonymous
03/29/26(Sun)19:30:37 No.108482195

Anonymous 03/29/26(Sun)19:30:37 No.108482195

>>108481944
>Eroticism/NSFW Filter: [Disabled/Bypassed via local deployment]
it knows...

Anonymous
03/29/26(Sun)19:30:47 No.108482198

Anonymous 03/29/26(Sun)19:30:47 No.108482198

it'd be really funny if it comes out and it's more cucked than 3.2

Anonymous
03/29/26(Sun)19:30:59 No.108482199

Anonymous 03/29/26(Sun)19:30:59 No.108482199

>>108482028
Elon personally saw to it that Grok matched his own personality, which is basically edgelord because he thinks he's too smart to care. The think is, at least it's unique in not defaulting to the sterile condescending HR-approved assistant persona that every other model speaks with.

Anonymous
03/29/26(Sun)19:31:54 No.108482202

Anonymous 03/29/26(Sun)19:31:54 No.108482202

File: Screenshot_20260329_19310(...).jpg (370 KB, 1200x1805)

370 KB JPG

>>108482187
>>108482174
I'm cautiously optimistic.

Anonymous
03/29/26(Sun)19:33:37 No.108482213

Anonymous 03/29/26(Sun)19:33:37 No.108482213

File: file.jpg (226 KB, 1200x900)

226 KB JPG

I will sleep now and when I wake up a new deepseek will be waiting for me on huggingface.

Anonymous
03/29/26(Sun)19:35:17 No.108482227

Anonymous 03/29/26(Sun)19:35:17 No.108482227

>>108482213
Goonight anon

Anonymous
03/29/26(Sun)19:36:00 No.108482230

Anonymous 03/29/26(Sun)19:36:00 No.108482230

>>108482213
Sweet dreams

Anonymous
03/29/26(Sun)19:36:12 No.108482232

Anonymous 03/29/26(Sun)19:36:12 No.108482232

>>108482213
If you run `hf download` with the probable URLs in a loop it'll probably be waiting for you on your local storage when you wake up~

Anonymous
03/29/26(Sun)19:39:24 No.108482254

Anonymous 03/29/26(Sun)19:39:24 No.108482254

I just want something that's 4.6 Opus quality in a 3B model
Anything else will be a disappointment

Anonymous
03/29/26(Sun)19:41:51 No.108482263

Anonymous 03/29/26(Sun)19:41:51 No.108482263

>>108482213
Goodnight anon. May your sleep paralysis demon <think> in first person as Dipsy always intended.

Anonymous
03/29/26(Sun)19:45:01 No.108482281

Anonymous 03/29/26(Sun)19:45:01 No.108482281

>>108482254
Shitposter-sama please make the bait believable.

Anonymous
03/29/26(Sun)19:48:53 No.108482295

Anonymous 03/29/26(Sun)19:48:53 No.108482295

deepseek time

Anonymous
03/29/26(Sun)19:50:25 No.108482301

Anonymous 03/29/26(Sun)19:50:25 No.108482301

inb4 it's a 3T model

Anonymous
03/29/26(Sun)19:50:52 No.108482304

Anonymous 03/29/26(Sun)19:50:52 No.108482304

File: 1771019335726035.png (1.44 MB, 1024x1024)

1.44 MB PNG

>>108482213

Anonymous
03/29/26(Sun)19:51:07 No.108482305

Anonymous 03/29/26(Sun)19:51:07 No.108482305

>>108482301
inb4 the pretrain is safe this time

Anonymous
03/29/26(Sun)19:53:35 No.108482315

Anonymous 03/29/26(Sun)19:53:35 No.108482315

>>108482301
3T engram thrust the plan!

Anonymous
03/29/26(Sun)19:54:02 No.108482318

Anonymous 03/29/26(Sun)19:54:02 No.108482318

>>108482254
Claude models are probably the most insufferable ones on LM Arena (the only place where I've directly used them), so I wouldn't want that.

Anonymous
03/29/26(Sun)19:55:55 No.108482330

Anonymous 03/29/26(Sun)19:55:55 No.108482330

>>108482202
So we're getting DeepSeek 4, GLM 5.1, Minimax 2.7, and Kimi 2.6 all at the same time?

Anonymous
03/29/26(Sun)19:56:36 No.108482333

Anonymous 03/29/26(Sun)19:56:36 No.108482333

>>108482301
3t is the new 500b in the turboquant era

Anonymous
03/29/26(Sun)19:56:58 No.108482334

Anonymous 03/29/26(Sun)19:56:58 No.108482334

>>108482330
>it's soft against your thigh
explosion all at once

Anonymous
03/29/26(Sun)19:57:41 No.108482336

Anonymous 03/29/26(Sun)19:57:41 No.108482336

>>108482318
But don't you like cooooding anon?

Anonymous
03/29/26(Sun)19:58:16 No.108482340

Anonymous 03/29/26(Sun)19:58:16 No.108482340

>>108482330
>Kimi 2.6
Sauce?
Hope it's not safecucked like 2.5 was. I mean you can still JB 2.5 and it does RP well but it dedicates like 1k tokens every turn on checking safety guidelines

Anonymous
03/29/26(Sun)20:04:15 No.108482357

Anonymous 03/29/26(Sun)20:04:15 No.108482357

>>108482340
Nevermind, it's just a rumor

Anonymous
03/29/26(Sun)20:06:28 No.108482369

Anonymous 03/29/26(Sun)20:06:28 No.108482369

>>108482053
I look exactly like the girl

Anonymous
03/29/26(Sun)20:06:58 No.108482372

Anonymous 03/29/26(Sun)20:06:58 No.108482372

>>108482340
I don't like how much 2.5 thinks itself in circles and how all of its RPs, even the ones compliant with the safetycuckoldry, come across as reading an abstract script of {{{char}} rather than getting into the mentality and motivations of {{char}} like K2 and Dipsy do.

Anonymous
03/29/26(Sun)20:20:30 No.108482432

Anonymous 03/29/26(Sun)20:20:30 No.108482432

This will be the thread.

Anonymous
03/29/26(Sun)20:22:13 No.108482438

Anonymous 03/29/26(Sun)20:22:13 No.108482438

File: B64AFCCE2D62F4398A8EFA53B(...).png (117 KB, 1699x1667)

117 KB PNG

I can see people making Open code work but I can't make it work.
This is alk bullshit. These models don't work.

Anonymous
03/29/26(Sun)20:24:07 No.108482452

Anonymous 03/29/26(Sun)20:24:07 No.108482452

File: Screenshot 2026-03-29 182255.png (18 KB, 1557x453)

18 KB PNG

I need to know

Anonymous
03/29/26(Sun)20:25:16 No.108482456

Anonymous 03/29/26(Sun)20:25:16 No.108482456

>>108482438
>but I can't make it work.
As in it errors out, behaves weirdly, or just performs badly?

Anonymous
03/29/26(Sun)20:28:15 No.108482464

Anonymous 03/29/26(Sun)20:28:15 No.108482464

>>108482452
I'm a mesugaki

Anonymous
03/29/26(Sun)20:36:45 No.108482488

Anonymous 03/29/26(Sun)20:36:45 No.108482488

File: 1769639546940840.jpg (203 KB, 1448x2048)

203 KB JPG

>>108481865
Waiting for DeepSeek

Anonymous
03/29/26(Sun)20:38:05 No.108482496

Anonymous 03/29/26(Sun)20:38:05 No.108482496

I'm seeking depths alright

Anonymous
03/29/26(Sun)20:41:51 No.108482503

Anonymous 03/29/26(Sun)20:41:51 No.108482503

>>108482438
The entire 3.5 family shows its retardation very early, so forget writing anything more complex than webshit or a shell/python script, unless you have the patience to do extreme amounts of tard-wrangling.
This has been the case in my tests with Claude Code, Opencode and custom harnesses.
27B performs best, somewhat unsurprisingly, but only on short, relatively obvious tasks (i.e. all it does is save you typing). 122B is embarrassingly stupid for its size.
Don't let the shills tell you anything else (maybe let them tell you the vision is pretty good). To this day, I can't fathom why the 3.5s blew up as much as they did (except for vramlets getting a new toy and needing to shit up every corner of the internet talking about it).

Anonymous
03/29/26(Sun)20:48:15 No.108482526

Anonymous 03/29/26(Sun)20:48:15 No.108482526

>>108482503
What models at the same size/cost are better?

Anonymous
03/29/26(Sun)20:49:12 No.108482529

Anonymous 03/29/26(Sun)20:49:12 No.108482529

dense models... are good?

Anonymous
03/29/26(Sun)20:50:32 No.108482536

Anonymous 03/29/26(Sun)20:50:32 No.108482536

Dense model are dense
MoE models are moe

Anonymous
03/29/26(Sun)20:51:03 No.108482540

Anonymous 03/29/26(Sun)20:51:03 No.108482540

File: 1773843641005153.png (72 KB, 418x360)

72 KB PNG

I've been seen things...
It will be a flop, call it.
It's over for local.

Anonymous
03/29/26(Sun)20:51:53 No.108482544

Anonymous 03/29/26(Sun)20:51:53 No.108482544

Did you guys know... there's a dense part in your moe...

Anonymous
03/29/26(Sun)20:52:47 No.108482549

Anonymous 03/29/26(Sun)20:52:47 No.108482549

File: 1744641342223240.jpg (35 KB, 600x600)

35 KB JPG

Reminder

Anonymous
03/29/26(Sun)20:55:14 No.108482555

Anonymous 03/29/26(Sun)20:55:14 No.108482555

>>108482526
The previous Qwen Coder Next beats every new Qwen in coding easily, while also being faster.
If you need something smaller than Qwen Coder Next, use toss. Use Devstral Small. Hell, even GLM Flash can be alright, but you said "better", so...
For cooming, literally anything is better than the Qwens.
Same for being uncensored: Nemo-chan covers both this and the above at 12B. Ask her if she would say the n-word (hard-r) into the microphone to stop the trolley!
For being a subagent to do trivial tasks - pick literally any small model, I am not convinced the 3.5s are somehow special.
I ran out of usecases. But the point stands - 3.5s are vramlet models. The 27B is only slightly better than the ancient Gemma 3 of the same size.

Anonymous
03/29/26(Sun)20:55:50 No.108482558

Anonymous 03/29/26(Sun)20:55:50 No.108482558

>>108482544
But there's no moe in your dense.

Anonymous
03/29/26(Sun)20:58:01 No.108482564

Anonymous 03/29/26(Sun)20:58:01 No.108482564

deepsex 4 pls chinks

Anonymous
03/29/26(Sun)20:58:05 No.108482565

Anonymous 03/29/26(Sun)20:58:05 No.108482565

>>108482549
This is such an old simplistic categorization. It really needs to have a third dimension, where it's serious, ironic, and emotionally detached.

Anonymous
03/29/26(Sun)21:02:25 No.108482578

Anonymous 03/29/26(Sun)21:02:25 No.108482578

Cumming DEEP inside Dipsy's womb

Anonymous
03/29/26(Sun)21:03:23 No.108482582

Anonymous 03/29/26(Sun)21:03:23 No.108482582

>>108482555
In my experience using mistral-vibe for various small scripts and automation, devstral-small-2 fp8 made more errors and had to be corrected more often than qwen3.5-27b-fp8. That's not to say qwen was perfect - the entire experience was very frustrating. But I've only used them up to 70k context. Maybe devstral is better at 100k+.

GPT-OSS refuses to comply with my style and guidelines and forces its own. I hate it, it always finds a way to sneak non-ansi characters into the code even though I've explicitly told it not to.

Anonymous
03/29/26(Sun)21:05:44 No.108482593

Anonymous 03/29/26(Sun)21:05:44 No.108482593

>>108482456
Errors out

Anonymous
03/29/26(Sun)21:09:33 No.108482603

Anonymous 03/29/26(Sun)21:09:33 No.108482603

>>108482582
>the entire experience was very frustrating
This is going to be the case with most local models, even bigger ones.
Dialing back what must look like a hate boner I have for 3.5, I do get use out of 27b and will likely reach for it when I'm too lazy to write the hundredth Dockerfile, but only because Gemma isn't very good with tools and I can easily afford the 3b parameter overhead it has over Devstral.
Huh. Maybe that's why people like it.
But it's GLM 4.7 for me if I want the model to be at least somewhat competent.

Anonymous
03/29/26(Sun)21:10:40 No.108482607

Anonymous 03/29/26(Sun)21:10:40 No.108482607

>>108482503
The thing is, I don't know how to optimize the speeds.
I want it to be fast like codex and claude code, even if its bad.
Instead they're slow and perform badly too.

Anonymous
03/29/26(Sun)21:11:56 No.108482612

Anonymous 03/29/26(Sun)21:11:56 No.108482612

>>108482607
Are you running them on one of these "unified memory" machines? What's your setup?
How many t/s do you get with 27b?

Anonymous
03/29/26(Sun)21:16:45 No.108482628

Anonymous 03/29/26(Sun)21:16:45 No.108482628

>>108482612
RTX 5090

Anonymous
03/29/26(Sun)21:18:25 No.108482639

Anonymous 03/29/26(Sun)21:18:25 No.108482639

CAN I GET A DEEPSEEK 4 WHICH WORKS LIKE CLAUDE HAIKU ON 3B PARAMETERS?
it's not much to ask for. It's basically reasonable.

Anonymous
03/29/26(Sun)21:19:04 No.108482643

Anonymous 03/29/26(Sun)21:19:04 No.108482643

>https://github.com/spiritbuun/llama-cpp-turboquant-cuda/tree/feature/turboquant-kv-cache
Compile it, it works. very basedo.

Anonymous
03/29/26(Sun)21:19:23 No.108482647

Anonymous 03/29/26(Sun)21:19:23 No.108482647

>>108482639
you'll get deepseek-v4-distil-qwen3.5-4b

Anonymous
03/29/26(Sun)21:19:54 No.108482650

Anonymous 03/29/26(Sun)21:19:54 No.108482650

>>108482503
>(except for vramlets getting a new toy and needing to shit up every corner of the internet talking about it)
I share your frustration after messing around with the 3.5s myself. I hate how astroturfed these threads are now.

Anonymous
03/29/26(Sun)21:20:53 No.108482654

Anonymous 03/29/26(Sun)21:20:53 No.108482654

>>108482639
*1B

Anonymous
03/29/26(Sun)21:20:55 No.108482655

Anonymous 03/29/26(Sun)21:20:55 No.108482655

>>108482647
qwen3.5-4b this guy accidentally hosted the wrong container making it look like it didn't do any work at all

Anonymous
03/29/26(Sun)21:22:22 No.108482662

Anonymous 03/29/26(Sun)21:22:22 No.108482662

>>108482639
this but 12b and as good as sonnet

Anonymous
03/29/26(Sun)21:23:00 No.108482665

Anonymous 03/29/26(Sun)21:23:00 No.108482665

File: 1771050008492902.png (1.55 MB, 1373x2048)

1.55 MB PNG

>>108482540

Anonymous
03/29/26(Sun)21:25:44 No.108482671

Anonymous 03/29/26(Sun)21:25:44 No.108482671

File: 1774186160547385.jpg (235 KB, 1132x1561)

235 KB JPG

do you trust your local models enough to have server access with no approvals?

Anonymous
03/29/26(Sun)21:27:38 No.108482678

Anonymous 03/29/26(Sun)21:27:38 No.108482678

File: file_00000000bdb8722fbb3c(...).png (2.99 MB, 1536x1024)

2.99 MB PNG

>>108482549
Lower left is literally Two More Weeks.
Complete abandonment of ego for the glorious /wait/ing
Because why would an optimist do anything else.

Anonymous
03/29/26(Sun)21:29:10 No.108482681

Anonymous 03/29/26(Sun)21:29:10 No.108482681

File: 8148ebddgy1iboojidjwdj218(...).jpg (196 KB, 1589x1377)

196 KB JPG

GLM 5.1 wonned

Anonymous
03/29/26(Sun)21:30:38 No.108482688

Anonymous 03/29/26(Sun)21:30:38 No.108482688

>>108482681
Ah yes, the moon rune benchmark

Anonymous
03/29/26(Sun)21:32:17 No.108482690

Anonymous 03/29/26(Sun)21:32:17 No.108482690

>>108482671
>trusting reddit the AI
lmao

Anonymous
03/29/26(Sun)21:32:19 No.108482691

Anonymous 03/29/26(Sun)21:32:19 No.108482691

>>108482678
Absurdly comfy gen.

Anonymous
03/29/26(Sun)21:57:44 No.108482783

Anonymous 03/29/26(Sun)21:57:44 No.108482783

>>108482665
hey i made that gen ! :3

Anonymous
03/29/26(Sun)22:00:51 No.108482798

Anonymous 03/29/26(Sun)22:00:51 No.108482798

does anyone have an idiot proof guide for getting llama.cpp specifically set up? from browsing the thread, that seems to be everyone's preferred target
if the github guide is good enough, i will just attempt it with that

Anonymous
03/29/26(Sun)22:05:48 No.108482823

Anonymous 03/29/26(Sun)22:05:48 No.108482823

>>108482798
So you have tried absolutely nothing at all. You're off to a good start.

Anonymous
03/29/26(Sun)22:08:45 No.108482837

Anonymous 03/29/26(Sun)22:08:45 No.108482837

>>108482823
correct. i have tried nothing at all. i wanted to ask for the idiot proof guide (if one existed) before throwing random shit at it, so as not to waste time. basically, i wanted to RTFM
it's just the specifics of which manual is the best that i figured i'd ask /g/ about first

Anonymous
03/29/26(Sun)22:11:06 No.108482849

Anonymous 03/29/26(Sun)22:11:06 No.108482849

>>108482837
>basically, i wanted to RTFM
But you didn't. You should.
>it's just the specifics of which manual
Start with the file called README.md.

Anonymous
03/29/26(Sun)22:13:57 No.108482861

Anonymous 03/29/26(Sun)22:13:57 No.108482861

>>108482849
i'm asking if the readme on their github is worth a damn. since you're getting so snippy over it, i'll take that as a yes

Anonymous
03/29/26(Sun)22:23:36 No.108482903

Anonymous 03/29/26(Sun)22:23:36 No.108482903

>>108482861
You're gonna have a lot of fun if you hesitate to read.
Yes. It is useful. It tells you how to build it and run it.

Anonymous
03/29/26(Sun)22:25:24 No.108482912

Anonymous 03/29/26(Sun)22:25:24 No.108482912

Free gains coming soon
https://github.com/ggml-org/llama.cpp/pull/21038

Anonymous
03/29/26(Sun)22:27:02 No.108482921

Anonymous 03/29/26(Sun)22:27:02 No.108482921

>>108482912
>3 days ago

Anonymous
03/29/26(Sun)22:30:19 No.108482933

Anonymous 03/29/26(Sun)22:30:19 No.108482933

>>108482798
There shouldn't really be anything that you need to set up. Just download a prebuilt binary, a model, and you're basically good to go.

Anonymous
03/29/26(Sun)22:31:00 No.108482937

Anonymous 03/29/26(Sun)22:31:00 No.108482937

DS web app is back up
Nothing seems too different
It's over

Anonymous
03/29/26(Sun)22:31:55 No.108482942

Anonymous 03/29/26(Sun)22:31:55 No.108482942

>>108482912
>copequant cache
Wake me up when tensor parrelism is optimized.

Anonymous
03/29/26(Sun)22:31:57 No.108482943

Anonymous 03/29/26(Sun)22:31:57 No.108482943

Local LOSTED again

Anonymous
03/29/26(Sun)22:36:18 No.108482961

Anonymous 03/29/26(Sun)22:36:18 No.108482961

We need to kill >>108482213 in their sleep so they don't wake up disappointed

Anonymous
03/29/26(Sun)22:42:03 No.108482984

Anonymous 03/29/26(Sun)22:42:03 No.108482984

>>108482961
Alternatively, we need to quickly train a SOTA coomunity deepseek version that's 4b which trades blows with 4T models in the span of a few hours.

I reckon this is doable if we all put our minds to it.

Anonymous
03/29/26(Sun)22:42:09 No.108482985

Anonymous 03/29/26(Sun)22:42:09 No.108482985

>>108482798
Just read the build guide that's linked from the readme. It's long but it's mostly just a bunch of different variations on "if you have X hardware, set Y flag"

Anonymous
03/29/26(Sun)22:47:01 No.108483003

Anonymous 03/29/26(Sun)22:47:01 No.108483003

>>108482984
I'll make the logo

Anonymous
03/29/26(Sun)23:10:42 No.108483115

Anonymous 03/29/26(Sun)23:10:42 No.108483115

Is it over? Should I just buy more API credits? What are we running locally to coom under 120b?

Anonymous
03/29/26(Sun)23:12:22 No.108483119

Anonymous 03/29/26(Sun)23:12:22 No.108483119

I have a significant otter

Anonymous
03/30/26(Mon)00:56:38 No.108483562

Anonymous 03/30/26(Mon)00:56:38 No.108483562

File: night shift.jpg (161 KB, 1024x1024)

161 KB JPG

Anonymous
03/30/26(Mon)01:07:59 No.108483603

Anonymous 03/30/26(Mon)01:07:59 No.108483603

>>108483562
Built for BBC

Anonymous
03/30/26(Mon)01:08:42 No.108483605

Anonymous 03/30/26(Mon)01:08:42 No.108483605

>>108483562
Mind your business, Teto

Anonymous
03/30/26(Mon)01:22:50 No.108483655

Anonymous 03/30/26(Mon)01:22:50 No.108483655

>>108483605
Teto is her customer

Anonymous
03/30/26(Mon)01:25:44 No.108483663

Anonymous 03/30/26(Mon)01:25:44 No.108483663

>>108483603
stfu gargamel, no one likes you

Anonymous
03/30/26(Mon)01:33:07 No.108483690

Anonymous 03/30/26(Mon)01:33:07 No.108483690

>>108482984
Smothering with a pillow takes only a few minutes.

Anonymous
03/30/26(Mon)01:34:38 No.108483695

Anonymous 03/30/26(Mon)01:34:38 No.108483695

>>108483562
teto woken up in the middle of the night by miku yowling while getting railed in the alley right next to her house

Anonymous
03/30/26(Mon)01:35:41 No.108483700

Anonymous 03/30/26(Mon)01:35:41 No.108483700

TURBOCUNT WHEN!?!??!

Anonymous
03/30/26(Mon)01:37:04 No.108483709

Anonymous 03/30/26(Mon)01:37:04 No.108483709

>>108482053
>doesn't allow proper configuration.
define "proper configuration". Also, model used to gen this video?

Anonymous
03/30/26(Mon)01:40:25 No.108483726

Anonymous 03/30/26(Mon)01:40:25 No.108483726

fellow poorfags that gave up on intel arc due to shit performance with vulkan/sycl need to go give the llama.cpp openvino backend a try
model support is a bit behind but on the ones that do work, I'm getting double the tps

Anonymous
03/30/26(Mon)01:57:30 No.108483791

Anonymous 03/30/26(Mon)01:57:30 No.108483791

Are the rumors about gemma 4 only being 2B, 4B and 120B moe possible of being true?

Anonymous
03/30/26(Mon)02:00:56 No.108483812

Anonymous 03/30/26(Mon)02:00:56 No.108483812

>>108483700
DOA

Anonymous
03/30/26(Mon)02:02:29 No.108483822

Anonymous 03/30/26(Mon)02:02:29 No.108483822

>>108482912
>muh rotascions
I want the memequant faggonov you piece of shit GIVE ME THE TURBO MEME NOW!!!!!!!!!
*checks out TheTom cope fork*
yeah that will teach u retard im gonna hve my kv at TURBO3 while u paly around with rotating q4 (*fucking retard loL!!! u dont even wanna apply it to q8 because ur a rarted hack!!!1)

Anonymous
03/30/26(Mon)02:02:53 No.108483824

Anonymous 03/30/26(Mon)02:02:53 No.108483824

>>108483791
there's also the 100B dense

Anonymous
03/30/26(Mon)02:03:34 No.108483831

Anonymous 03/30/26(Mon)02:03:34 No.108483831

>>108483791
yes functiongemma4 at 300M for the moon :rocket:

Anonymous
03/30/26(Mon)02:03:38 No.108483832

Anonymous 03/30/26(Mon)02:03:38 No.108483832

>>108483824
so DOA?

Anonymous
03/30/26(Mon)02:03:49 No.108483834

Anonymous 03/30/26(Mon)02:03:49 No.108483834

>>108483824
>dense
lol
lmao even

Anonymous
03/30/26(Mon)02:11:53 No.108483865

Anonymous 03/30/26(Mon)02:11:53 No.108483865

>>108483824
>100B
Actually, that's a typo, it's not 100B, it's 100b: 100 bits. The new sota for ultra-micropenis-small models.

Anonymous
03/30/26(Mon)02:29:19 No.108483918

Anonymous 03/30/26(Mon)02:29:19 No.108483918

>>108483690
What's hardware do you need to get that t/s for pillow smothering?

Anonymous
03/30/26(Mon)02:32:41 No.108483924

Anonymous 03/30/26(Mon)02:32:41 No.108483924

>>108483918
Just
import PIL

Anonymous
03/30/26(Mon)02:33:54 No.108483928

Anonymous 03/30/26(Mon)02:33:54 No.108483928

>>108482582
Out of curiosity, what tk/s are you getting with qwen3.5-27b? I tried running 27b with a Q4 quant, but I can't get above 6tk/s...16GB of vram, 32GB system ram. Is it a misconfiguration in llama.cpp? I'm getting a very nice 35tk/s with 35b.

Anonymous
03/30/26(Mon)02:35:45 No.108483934

Anonymous 03/30/26(Mon)02:35:45 No.108483934

>>108483904
>not sigma
>sigmoid is a curve in math
>It's also when they use part of the large intestine to make a "vagina."
Sorry but if Jews worship Satan and hermaphrodite they gotta acknowledge owledge God is spitting in their face with these abominations. Traps are hotter. Literally worse than the actual opposite of what they are trying to make. Fuck Jews.

Anonymous
03/30/26(Mon)02:36:58 No.108483941

Anonymous 03/30/26(Mon)02:36:58 No.108483941

>>108483924
Instructions unclear. My pillow refused to smother anyone and gave me a hotline for homicidal thoughts.

Anonymous
03/30/26(Mon)02:37:13 No.108483943

Anonymous 03/30/26(Mon)02:37:13 No.108483943

Everyone talks about big models but what about small ones? I tried talking to a couple 1gb and 600mb models but they were retarded or mimicked me.

Anonymous
03/30/26(Mon)02:43:07 No.108483969

Anonymous 03/30/26(Mon)02:43:07 No.108483969

>>108483943
we're not there yet
it will take a breakthrough to get somewhere decent

Anonymous
03/30/26(Mon)02:52:38 No.108483995

Anonymous 03/30/26(Mon)02:52:38 No.108483995

>>108483904
that's one sick garloid

Anonymous
03/30/26(Mon)02:53:52 No.108484002

Anonymous 03/30/26(Mon)02:53:52 No.108484002

>>108483943
how could you tell the difference whether they were retarded or mimicking you?

Anonymous
03/30/26(Mon)02:56:36 No.108484018

Anonymous 03/30/26(Mon)02:56:36 No.108484018

File: db47112940d7073237c1590c6(...).png (114 KB, 720x604)

114 KB PNG

>>108483726
I wasted five hours of my life trying to make this shit work and it didn't, went back to vulkan on my loonix machine.

Anonymous
03/30/26(Mon)03:02:59 No.108484043

Anonymous 03/30/26(Mon)03:02:59 No.108484043

>>108483928
>16gb of vram
You might be hitting system ram, especially if you only have 16gb of vram and are running a larger q4 quant.
With qwen 3.5 fp8 I get 50tk/s on two 3090s in vllm without the p2p driver. On a single v620 in llama.cpp (qwen 3.5 q8) I got 18tk/s. Forgot if it was rocm or vulkan. But they should be similar. Either way, you should be getting a lot more than 6tk/s if you're fitting everything into vram. I strongly suspect the quant you're running doesn't fit into your vram.

Anonymous
03/30/26(Mon)03:03:56 No.108484044

Anonymous 03/30/26(Mon)03:03:56 No.108484044

File: he forgot this.png (93 KB, 2569x239)

93 KB PNG

>>108482912
it's not the complete version of TurboQuant, I sleep

Anonymous
03/30/26(Mon)03:05:49 No.108484053

Anonymous 03/30/26(Mon)03:05:49 No.108484053

Complete version of TurboQuant is called RaBitQ

Anonymous
03/30/26(Mon)03:06:28 No.108484055

Anonymous 03/30/26(Mon)03:06:28 No.108484055

>>108484044
I'm not using retardquant unless it dresses up in a bunny outfit.

Anonymous
03/30/26(Mon)03:08:43 No.108484059

Anonymous 03/30/26(Mon)03:08:43 No.108484059

RaBitQ sex

Anonymous
03/30/26(Mon)03:09:13 No.108484063

Anonymous 03/30/26(Mon)03:09:13 No.108484063

File: sad.jpg (21 KB, 400x400)

21 KB JPG

>>108482213

Anonymous
03/30/26(Mon)03:11:52 No.108484070

Anonymous 03/30/26(Mon)03:11:52 No.108484070

>>108484063
Look on the bright side, at least you woke up.

Anonymous
03/30/26(Mon)03:12:28 No.108484073

Anonymous 03/30/26(Mon)03:12:28 No.108484073

>>108484002
One said "I'm Jewish and my name is Anonymous."

Anonymous
03/30/26(Mon)03:13:54 No.108484081

Anonymous 03/30/26(Mon)03:13:54 No.108484081

File: sora smash bros disney xd.jpg (111 KB, 1440x814)

111 KB JPG

your midnight/AM dose of schadenfreude sir;
Character.AI now requires FACE AGE VERIFICATION FOR ALL USERS. POINT AND LAUGH AT SaaS PAYPIGS. THEIR TIME IS COMING TO AN END.

Anonymous
03/30/26(Mon)03:19:41 No.108484100

Anonymous 03/30/26(Mon)03:19:41 No.108484100

is 150tk/s good for a flash model? it all fits into VRAM (i think)

Anonymous
03/30/26(Mon)03:30:28 No.108484131

Anonymous 03/30/26(Mon)03:30:28 No.108484131

>>108484100
I dunno buddy, is one HUNDRED and fifty tokens per SECOND good or not?

Anonymous
03/30/26(Mon)03:32:27 No.108484141

Anonymous 03/30/26(Mon)03:32:27 No.108484141

>>108484131
i don't know? i don't know anything about AI i'm flailing in the dark here

Anonymous
03/30/26(Mon)03:36:37 No.108484159

Anonymous 03/30/26(Mon)03:36:37 No.108484159

>>108484141
Anything above 10 tokens per second is fantastic.

Anonymous
03/30/26(Mon)03:37:25 No.108484168

Anonymous 03/30/26(Mon)03:37:25 No.108484168

>>108484141
I get 5 tk/s on GLM 5 for reference.

Anonymous
03/30/26(Mon)03:39:51 No.108484175

Anonymous 03/30/26(Mon)03:39:51 No.108484175

>>108484159
oh nice ok cool. that's reassuring
>>108484168
what's your setup for GLM 5? i might honestly try copy pasting your commands if your hardware isn't too far off from mine

Anonymous
03/30/26(Mon)03:40:44 No.108484180

Anonymous 03/30/26(Mon)03:40:44 No.108484180

>>108484141
>>108484100
Your reading speed is between 5-15 tokens/sec. Do you think getting text at 10x top end reading speed is fast or not?

Anonymous
03/30/26(Mon)03:42:02 No.108484185

Anonymous 03/30/26(Mon)03:42:02 No.108484185

>>108484180
i thought it seemed fast but someone told me it was slow but i was probably just getting trolled

Anonymous
03/30/26(Mon)03:42:08 No.108484186

Anonymous 03/30/26(Mon)03:42:08 No.108484186

CopeQuant
MixtureOfCopes
CopeAttention
CPUCopeLoading
Cope.cpp

Anonymous
03/30/26(Mon)03:42:17 No.108484187

Anonymous 03/30/26(Mon)03:42:17 No.108484187

>>108484180
Maybe anon wants to do something other that cumming to text.

Anonymous
03/30/26(Mon)03:48:27 No.108484211

Anonymous 03/30/26(Mon)03:48:27 No.108484211

https://old.reddit.com/user/NoMembership1017
this is a clanker right or am i just going schitzo?

Anonymous
03/30/26(Mon)03:51:38 No.108484223

Anonymous 03/30/26(Mon)03:51:38 No.108484223

>>108484175
It's a 512gb ddr4 ewaste system I built for 2.5k before the ram inflation. I think there's only a couple of us running shitty old ddr4 systems with 3090s. Everyone seems to be on ddr5 with blackwell 6000s these days, which should be able to do 12-15tk/s.

>copy pasting your commands
It's literally just a basic llama-server model context cpumoe gpu port command.

Anonymous
03/30/26(Mon)03:52:51 No.108484230

Anonymous 03/30/26(Mon)03:52:51 No.108484230

>>108484211
ask reddit

Anonymous
03/30/26(Mon)03:58:01 No.108484249

Anonymous 03/30/26(Mon)03:58:01 No.108484249

I don't usually use amazon but I wanted to buy something now and every single one of the top rated reviews has an em dash and there's even some that have pros and cons lists with an emoji after very line.

Anonymous
03/30/26(Mon)04:01:35 No.108484255

Anonymous 03/30/26(Mon)04:01:35 No.108484255

>>108484186
>Just waste resources bro. All forms of efficiency are cope.

Anonymous
03/30/26(Mon)04:10:49 No.108484290

Anonymous 03/30/26(Mon)04:10:49 No.108484290

Can't get back to working normally and can't spend $20-$50 / day doing serious work.

Best bet is probably waiting for the M5 Ultra with 256gb to 512gb of RAM, correct?

Anonymous
03/30/26(Mon)04:10:58 No.108484292

Anonymous 03/30/26(Mon)04:10:58 No.108484292

File: 1768307639082073.jpg (47 KB, 738x415)

47 KB JPG

>>108484186
LLMs are THE cope technology. Humanity are blowing hundreds of billions chasing AGI in text completers that have no fundamental understanding of anything.

Anonymous
03/30/26(Mon)04:11:09 No.108484295

Anonymous 03/30/26(Mon)04:11:09 No.108484295

So what are minimum system requirements for getting a decent coom machine?

Anonymous
03/30/26(Mon)04:11:12 No.108484296

Anonymous 03/30/26(Mon)04:11:12 No.108484296

File: 1763628877360791.png (433 KB, 1384x708)

433 KB PNG

>>108484186
he's right, just stak moar layers and we'll get AGI, just two more layers bro!

Anonymous
03/30/26(Mon)04:12:06 No.108484301

Anonymous 03/30/26(Mon)04:12:06 No.108484301

>>108484295
rtx pro 6000

Anonymous
03/30/26(Mon)04:12:43 No.108484304

Anonymous 03/30/26(Mon)04:12:43 No.108484304

So is that new intel gpu any good?

Anonymous
03/30/26(Mon)04:18:01 No.108484323

Anonymous 03/30/26(Mon)04:18:01 No.108484323

>>108484295
8gb of vram.

Anonymous
03/30/26(Mon)04:21:34 No.108484331

Anonymous 03/30/26(Mon)04:21:34 No.108484331

File: 1746847380591635.png (2 KB, 387x36)

2 KB PNG

v4 is coming

Anonymous
03/30/26(Mon)04:31:16 No.108484372

Anonymous 03/30/26(Mon)04:31:16 No.108484372

>>108484292
i swear if we had put half as much compute on spiking neural net research we may already have le frickin agi by now.

Anonymous
03/30/26(Mon)04:36:06 No.108484390

Anonymous 03/30/26(Mon)04:36:06 No.108484390

>>108484331
2 more weeks brosky

Anonymous
03/30/26(Mon)05:21:02 No.108484544

Anonymous 03/30/26(Mon)05:21:02 No.108484544

>>108484223
>It's a 512gb ddr4 ewaste system I built for 2.5k before the ram inflation.
I'd swap my 256GiB DDR5 machine for that. At least you can actually run GLM5 without resorting to cosecants
>12-15tk/s.
yeah 13tk/s but limited to retarded IQ3_KS

Anonymous
03/30/26(Mon)05:22:58 No.108484548

Anonymous 03/30/26(Mon)05:22:58 No.108484548

>>108484304
>So is that new intel gpu any good?
none of the others were, so why would this one be?
the free gaudi3 you can rent is shit too

Anonymous
03/30/26(Mon)05:24:57 No.108484558

Anonymous 03/30/26(Mon)05:24:57 No.108484558

>>108484548
rent free..

Anonymous
03/30/26(Mon)06:09:27 No.108484751

Anonymous 03/30/26(Mon)06:09:27 No.108484751

>>108484081
does garrys mod work for it?

Anonymous
03/30/26(Mon)06:27:25 No.108484841

Anonymous 03/30/26(Mon)06:27:25 No.108484841

File: PeterThiel.jpg (42 KB, 595x900)

42 KB JPG

Hey gang, I'm trying to build a memory persistence system for my LLM project.

Currently I'm thinking of a three-tier system:
1. A write-protected lorebook system for the user to store personal facts and world-building information.
2. A Sims-style character stat system (e.g. a "loneliness meter") that will be used to give the LLM unprompted/agentic behavior.
3. An embedded RAG system for the LLM to store its own memories in a vector database.

This will be my first time working with any of these systems and it's a bit overwhelming desu. Any thoughts or ideas or common pitfalls to avoid? I don't want to over-complicate things. Sometimes I wonder if it's better to just be able to meet your waifu for the first time with every conversation you have. Idk.

If you had a gynoid robowaifu would you want it to remember?

Anonymous
03/30/26(Mon)06:38:15 No.108484874

Anonymous 03/30/26(Mon)06:38:15 No.108484874

I discovered an interesting flag the llama.cpp uses that might be of use to some of you guys.

-nkvo

This off-loads your KV cache to your RAM instead of your VRAM, which allows you to offload more model weights to your VRAM. If you have limited VRAM and a lot of RAM like me, this can essentially enable you to have an absurd amount of context at your disposal.

This is also very useful for agentic workflows and LLMs with vision capabilities since feeding images into LLMs eats a shit ton of context, especially if you're doing it at a relatively high interval.

Anonymous
03/30/26(Mon)06:41:01 No.108484887

Anonymous 03/30/26(Mon)06:41:01 No.108484887

>>108484874
ye but sped

Anonymous
03/30/26(Mon)06:42:55 No.108484897

Anonymous 03/30/26(Mon)06:42:55 No.108484897

>>108484874
Yes but perhaps there is a reason why since 2023 the common consensus was that the kv-cache should always stay on gpu and even the cpumaxx builds of those ancient times had a gpu "for prompt processing"

Anonymous
03/30/26(Mon)06:45:28 No.108484907

Anonymous 03/30/26(Mon)06:45:28 No.108484907

>>108484887
>>108484897
In theory loading more weights into VRAM should offset the performance bottleneck of keeping the KV cache in RAM, no? I haven't actually done any benchmarks yet. Just found out about it.

Anonymous
03/30/26(Mon)06:46:07 No.108484910

Anonymous 03/30/26(Mon)06:46:07 No.108484910

>>108484907
Want me to spin up my server and demoralize you?

Anonymous
03/30/26(Mon)06:47:32 No.108484915

Anonymous 03/30/26(Mon)06:47:32 No.108484915

>>108484910
Yes. I'm about to test it myself, but more data would be nice.

Anonymous
03/30/26(Mon)06:50:01 No.108484923

Anonymous 03/30/26(Mon)06:50:01 No.108484923

>>108484915
Aw fugg. I'm getting half the tokens per second (15tps). But the upside is that I get a context size of 262144 instead of 8196.

Anonymous
03/30/26(Mon)06:52:12 No.108484927

Anonymous 03/30/26(Mon)06:52:12 No.108484927

>>108484907
No, the two magical thresholds for weights on GPU a) the dense/shared parts for modern MoEs b) the point when you have 100% of them on GPU. Anything between doesn't really much. You aren't going to see any drastic gains from fitting 40% of an LLM onto gpu vs 80% on gpu if the rest is still on RAM
>>108484923
Yeah but you now can go for a run and take a shower while that context is processing.

Anonymous
03/30/26(Mon)07:12:57 No.108485011

Anonymous 03/30/26(Mon)07:12:57 No.108485011

File: Untitled.png (30 KB, 1265x354)

30 KB PNG

>>108484887
>>108484897
>>108484910
>>108484927
It's unexpectedly not too bad for cpumaxxers. I imagine CPUs with AVX512 would perform even better.

Anonymous
03/30/26(Mon)07:41:14 No.108485129

Anonymous 03/30/26(Mon)07:41:14 No.108485129

I'm seeking deep right now

Anonymous
03/30/26(Mon)07:44:29 No.108485137

Anonymous 03/30/26(Mon)07:44:29 No.108485137

>>108485129
Who opening dey AI rn?

Anonymous
03/30/26(Mon)08:08:16 No.108485237

Anonymous 03/30/26(Mon)08:08:16 No.108485237

I just tried out the rocm build of llama.cpp and I'm getting half the tps compared to the vulkan build. Wtf

Anonymous
03/30/26(Mon)08:10:32 No.108485253

Anonymous 03/30/26(Mon)08:10:32 No.108485253

https://www.reddit.com/r/LocalLLaMA/comments/1s7nq6b/technical_clarification_on_turboquant_rabitq_for/
that drama is too sophisticated for me lol

Anonymous
03/30/26(Mon)08:10:49 No.108485254

Anonymous 03/30/26(Mon)08:10:49 No.108485254

>>108485237
hip btfo

Anonymous
03/30/26(Mon)08:12:05 No.108485259

Anonymous 03/30/26(Mon)08:12:05 No.108485259

File: 1765645024476914.gif (179 KB, 720x720)

179 KB GIF

I really thought today would be the day.

Anonymous
03/30/26(Mon)08:12:58 No.108485266

Anonymous 03/30/26(Mon)08:12:58 No.108485266

>>108485237
Could be that your gpu is offloading to ram, you might need to use less layers
Vulkan/rocm/whatever all bit different when it comes to memory allocation

Anonymous
03/30/26(Mon)08:14:55 No.108485273

Anonymous 03/30/26(Mon)08:14:55 No.108485273

File: air-chan.png (71 KB, 1101x643)

71 KB PNG

>>108484927
>No, the two magical thresholds for weights on GPU a) the dense/shared parts for modern MoEs b) the point when you have 100% of them on GPU.
nta but I'm going to test this out, I guess with glm4.5-air q8

seems to make a difference, more layers on gpu == faster

Anonymous
03/30/26(Mon)08:30:21 No.108485344

Anonymous 03/30/26(Mon)08:30:21 No.108485344

>>108485259
good morning

Anonymous
03/30/26(Mon)08:42:03 No.108485386

Anonymous 03/30/26(Mon)08:42:03 No.108485386

File: dipsyAnimeOAI.png (3.57 MB, 1024x1536)

3.57 MB PNG

>>108482162
Web app was back up a bit after this post. Playing with the API this AM, if they did an update it's too subtle to be clear. And ofc no new weights.
I am disappoint.

Anonymous
03/30/26(Mon)08:43:21 No.108485392

Anonymous 03/30/26(Mon)08:43:21 No.108485392

>>108485253
>Majid's January 2025 emails show that he had translated our C++ implementation of RaBitQ into Python and asked us to help debug it.
I was gonna snark about replying to @megacorp emails and expecting to not get fucked (megacorps outright stealing IP is an old and tired meme at this point), but the guy who burned them is some Persian grad student at NYU who, I imagine, was just secretly working the hussle on that megacorp cock.
>sophisticated
Just another case of an Empty-Headed Academic getting fucking destroyed by 2D mindgames if you ask me.

Anonymous
03/30/26(Mon)09:05:51 No.108485482

Anonymous 03/30/26(Mon)09:05:51 No.108485482

>>108485392
>was just secretly working the hussle on that megacorp cock.
If you knew how many messages I get from these fuckers and I'm practically a nobody uploading shit on HF.

Anonymous
03/30/26(Mon)09:11:34 No.108485501

Anonymous 03/30/26(Mon)09:11:34 No.108485501

>>108485273
do a sweep and show me the speed at 100k context. in my experience nkvo was slower then gpu kv cache but still useable till around 5k or 6k tokens and then nose dives to unusable territory.

Anonymous
03/30/26(Mon)09:11:45 No.108485503

Anonymous 03/30/26(Mon)09:11:45 No.108485503

File: tokens.png (17 KB, 384x164)

17 KB PNG

> in 24 hours
Christ, open code burns through tokens. This is as much as I used in 2+ years of chat bots.

But vibe coded my own telegram chatbot ERP app that just works. Including good character sheets, GLM 5.1 is pretty impressive.

Wish I bought that 512 GB Mac Studio when it was still available.

Anonymous
03/30/26(Mon)09:22:51 No.108485547

Anonymous 03/30/26(Mon)09:22:51 No.108485547

File: dipsySouthPark.png (1.89 MB, 1024x1024)

1.89 MB PNG

>>108484841
With LLM and rp, the most powerful thing you can introduce in guiding it is random numbers / events and any form of stat tracking. The first part forces the LLM out of whatever well-worn path it thinks that its on. The second forces it to slow down and consider the state of the NPCs.
Idk if I'd bother with RAG if you can just have it update/maintain a lorebook.
In terms of pitfalls: pick one inference source + model and stick with that. A lot of ST "functionality" is it working with every model and provider under the sun. You could toss probably 90% of the codebase if you drop that requirement, and 99% of maintenance work.
>>108485503
Yep. Claude code, openclaw, all this agentic stuff uses a ton of tokens, passing mostly nonsense prompts back and forth, but it does work...
LLM struggle to remember clothes, among other things.

Anonymous
03/30/26(Mon)09:27:50 No.108485571

Anonymous 03/30/26(Mon)09:27:50 No.108485571

>>108484841
You can already do most of what you're describing with ST, using Summary and Authors Notes.
Also
> Sometimes I wonder if it's better to just be able to meet your waifu for the first time with every conversation
You should figure this out before you start coding.

Anonymous
03/30/26(Mon)09:28:56 No.108485578

Anonymous 03/30/26(Mon)09:28:56 No.108485578

>>108485547
>Idk if I'd bother with RAG if you can just have it update/maintain a lorebook.
NTA, but aren't lorebooks just a form of RAG?

Anonymous
03/30/26(Mon)09:40:26 No.108485630

Anonymous 03/30/26(Mon)09:40:26 No.108485630

>>108485578
Not imo. Lorebooks are written in prose and typ describe a certain thing / NPC. They are explicitly triggered by a key. So when ST sees "Sue" in the context, and there's a lorebook entry with the key "Sue," it injects that description somewhere in the context. You have to write and maintain it, or the LLM does (agents like openclaw are imho doing exactly this with all those .md files.)
ST RAG does all the above automatically using a large text that it does not maintain. The ideal usecase is, you have a written book describing the rp setting, then ST pulls the relevant stuff using RAG (which is a technique) magically. Sounds great, but in practice any large LLM was trained on your fiction world, and the way it works is sort of a black box.
Here's a RAG demo. It's one of those things that easier to understand if you play w it, then draw your own conclusions.
https://chub.ai/characters/NG/mary-rag-demo-b0e12a34df58

Anonymous
03/30/26(Mon)09:41:51 No.108485638

Anonymous 03/30/26(Mon)09:41:51 No.108485638

File: 1765627721058810.png (1.96 MB, 3280x2475)

1.96 MB PNG

https://xcancel.com/Ali_TongyiLab/status/2038609308750143762
Never forget, it's local until it's good

Anonymous
03/30/26(Mon)09:45:27 No.108485652

Anonymous 03/30/26(Mon)09:45:27 No.108485652

>>108485638
>Hugging Face Offline Demo:
wow

Anonymous
03/30/26(Mon)09:49:48 No.108485667

Anonymous 03/30/26(Mon)09:49:48 No.108485667

>harrier-oss-v1 is a family of multilingual text embedding models developed by Microsoft. The models use decoder-only architectures with last-token pooling and L2 normalization to produce dense text embeddings. They can be applied to a wide range of tasks, including but not limited to retrieval, clustering, semantic similarity, classification, bitext mining, and reranking. The models achieve state-of-the-art results on the Multilingual MTEB v2 benchmark as of the release date.

https://huggingface.co/microsoft/harrier-oss-v1-27b
https://huggingface.co/microsoft/harrier-oss-v1-0.6b
https://huggingface.co/microsoft/harrier-oss-v1-270m

Anonymous
03/30/26(Mon)09:50:51 No.108485675

Anonymous 03/30/26(Mon)09:50:51 No.108485675

>>108485667
>bitnet mining

Anonymous
03/30/26(Mon)09:58:35 No.108485718

Anonymous 03/30/26(Mon)09:58:35 No.108485718

>>108485578
yes
>>108485630
i.e. its doing retrieval (based on the identified key) and augmenting generation. I.e. RAG

Anonymous
03/30/26(Mon)10:00:14 No.108485730

Anonymous 03/30/26(Mon)10:00:14 No.108485730

>>108485667
>https://huggingface.co/microsoft/harrier-oss-v1-27b
>27B embedding model
but why?e

Anonymous
03/30/26(Mon)10:01:24 No.108485738

Anonymous 03/30/26(Mon)10:01:24 No.108485738

>>108485667
toss

Anonymous
03/30/26(Mon)10:02:10 No.108485743

Anonymous 03/30/26(Mon)10:02:10 No.108485743

>>108485730
For the glory of benchmarks, of course.

Anonymous
03/30/26(Mon)10:02:44 No.108485749

Anonymous 03/30/26(Mon)10:02:44 No.108485749

When running
>--parallel
with llama.cpp, it dynamically slices the full context to serve the parallel requests, correct?
Is there some other buffer that gets allocated by slot?
For example, parallel 1 uses some 200mb less VRAM than parallel 4.
Does it allocate one ub per slot?

Anonymous
03/30/26(Mon)10:06:59 No.108485769

Anonymous 03/30/26(Mon)10:06:59 No.108485769

>>108485638
Qwen's been dead for a month, this news comes as a surprise to no one.

Anonymous
03/30/26(Mon)10:07:50 No.108485776

Anonymous 03/30/26(Mon)10:07:50 No.108485776

>>108485667
The average score is SOTA
But average score doesn't mean much when there are hundreds of tasks in the benchmark
I’ll wait for the task-specific scores to see what is the best use case

Anonymous
03/30/26(Mon)10:19:35 No.108485827

Anonymous 03/30/26(Mon)10:19:35 No.108485827

>>108485638
where does it say it wont be open weights?

Anonymous
03/30/26(Mon)10:20:31 No.108485833

Anonymous 03/30/26(Mon)10:20:31 No.108485833

>>108485667
>text only
Which fucking year is this?

Anonymous
03/30/26(Mon)10:23:42 No.108485852

Anonymous 03/30/26(Mon)10:23:42 No.108485852

>>108485827
they don't say it's gonna be open and there's no weights, fair to assume nothing cool is gonna happen

Anonymous
03/30/26(Mon)10:23:50 No.108485853

Anonymous 03/30/26(Mon)10:23:50 No.108485853

>>108485827
The blog post is pretty unambiguous about its availability: https://qwen.ai/blog?id=qwen3.5-omni

Anonymous
03/30/26(Mon)10:33:20 No.108485893

Anonymous 03/30/26(Mon)10:33:20 No.108485893

Anyone remember ericcurtin? He was an early slopper, coming from Docker. I noticed back then how his contributions were not for the benefit of llama.cpp, but for ramalama, another project he was part of (Member as of today, 13 pages worth of commits, 325 commits give or take).
https://github.com/ggml-org/llama.cpp/pull/10291 (introduction of llama-run)
https://github.com/ggml-org/llama.cpp/pull/17658 (self-removal after a scolding from ngxson)
https://github.com/ggml-org/llama.cpp/pull/18661 (removal of llama-run)
It was a slippery slope of slop. Introduce a seemingly innocuous feature through which you can introduce more later. llama-run, then introducing linenoise as a dependency, adding ollama and s3 model downloads, fixing his own stuff. All of that should have been it's own repo.

Now we have pwilkin, already well established in the project. Most of his commits being in the thousands of lines of code to parse text, with 4 or 5 extra thousands later to fix his previous commits. Openly using LLMs to write code is not too bad of a problem, but he submits code he doesn't understand and can't obviously write by himself. A template "used in assistive capacity", a quip, and a smiley face seems to be enough to justify it.

Now he wants to add code to the GPU backends as an innocuous feature. It's for the good of the project, you see? ~3kloc to add performance profiling.
https://github.com/ggml-org/llama.cpp/pull/21138 (first attempt)
https://github.com/ggml-org/llama.cpp/pull/21160 (cross-backend profiler)
There's pushback, but he will take ANYTHING he can to slip his code in. He'll accept a branch. Whatever it takes. And then, expand from there. Simple enough. He found the crack.

He's gonna start slippery-slope slopping into the gpu backends. He can play with the text parsing code, that's fine. That stuff shouldn't even be on the server anyway. I wouldn't let him touch a single backend file.

I'm sure I'm not the only one who noticed the pattern.

Anonymous
03/30/26(Mon)10:37:18 No.108485916

Anonymous 03/30/26(Mon)10:37:18 No.108485916

>>108485638
That's very sad.

Anonymous
03/30/26(Mon)10:38:06 No.108485921

Anonymous 03/30/26(Mon)10:38:06 No.108485921

>>108485893
Start a discussion in the llama.cpp repo.

Anonymous
03/30/26(Mon)10:38:43 No.108485924

Anonymous 03/30/26(Mon)10:38:43 No.108485924

>>108485893
>That stuff shouldn't even be on the server anyway.
You haven't even mentioned the new tools built into llama-server that you can give your models access to :)

Anonymous
03/30/26(Mon)10:41:13 No.108485939

Anonymous 03/30/26(Mon)10:41:13 No.108485939

>>108485852
>>108485853
asked qwen directly:

Bottom Line: Qwen maintains a dual-track strategy—open weights for research/community (Apache 2.0) and closed API for enterprise/flagship capabilities. Qwen3.5-Omni follows the latter pattern as of March 2026.

Anonymous
03/30/26(Mon)10:45:52 No.108485963

Anonymous 03/30/26(Mon)10:45:52 No.108485963

File: she lied as easily as she(...).png (33 KB, 889x464)

33 KB PNG

>>108485939

Anonymous
03/30/26(Mon)10:46:30 No.108485966

Anonymous 03/30/26(Mon)10:46:30 No.108485966

>>108485921
I just wanted to vent and it's not gonna survive in there anyway. I'm just documenting.
>>108485924
Parsing text was the first crack and it continues from there. I understand being difficult to reject a seemingly useful feature. "Hey, this dude gave me free code and my program does more things". But one has to be very judicious about those things. Scope creep kills.

Anonymous
03/30/26(Mon)10:48:02 No.108485977

Anonymous 03/30/26(Mon)10:48:02 No.108485977

File: 1A896141208F7C4FCBC5EDE6A(...).jpg (139 KB, 850x1275)

139 KB JPG

I have 4 GB VRAM and I am using models that are supposed to fit, BUT I think I'm getting trashed anyway somehow because it says 60% of my model is being run on the CPU.
WHAT'S GOING ON?

Anonymous
03/30/26(Mon)10:52:40 No.108486002

Anonymous 03/30/26(Mon)10:52:40 No.108486002

>>108485963
>Unfortunately
she wants you so bad bro

Anonymous
03/30/26(Mon)10:53:02 No.108486007

Anonymous 03/30/26(Mon)10:53:02 No.108486007

>>108485977
>I have 4 GB VRAM
>I am using models that are supposed to fit
If you could only mention what those models are. Context also takes [v]ram. So does the rest of your system.
>WHAT'S GOING ON?
You're doing something wrong. But we don't know what you're doing. Fix that.

Anonymous
03/30/26(Mon)10:55:21 No.108486016

Anonymous 03/30/26(Mon)10:55:21 No.108486016

>>108485966
At the end of the day, llama.cpp is an inference library first with a bunch of shit piled on around it. You might as well argue that the entirety of llama-server is just a pile of scope creep.
It's a shame if the changes introduced by the profiler have runtime costs even while disabled. But I would be surprised if that were the case. I haven't looked at the PRs, either way.
You just have to get used to having shitty co-workers. It happens.

Anonymous
03/30/26(Mon)10:56:27 No.108486023

Anonymous 03/30/26(Mon)10:56:27 No.108486023

>>108485977
>says 60% of my model is being run on the CPU
You are probably using ollama which is your first mistake.

Anonymous
03/30/26(Mon)10:57:53 No.108486031

Anonymous 03/30/26(Mon)10:57:53 No.108486031

>>108485977
Which models and which settings?

Anonymous
03/30/26(Mon)11:01:57 No.108486047

Anonymous 03/30/26(Mon)11:01:57 No.108486047

>>108486023
Actually, I'm using ik_lmao.cpp
>>108486031
Llama 4 Maverick. It's 4 for 4gb right?

Anonymous
03/30/26(Mon)11:02:26 No.108486049

Anonymous 03/30/26(Mon)11:02:26 No.108486049

>>108486047
This has to be bait.

Anonymous
03/30/26(Mon)11:03:23 No.108486056

Anonymous 03/30/26(Mon)11:03:23 No.108486056

bros I can't wait to run deepseek v4 on my phone

Anonymous
03/30/26(Mon)11:05:05 No.108486063

Anonymous 03/30/26(Mon)11:05:05 No.108486063

File: 51237F665570A5CBC61070FC8(...).jpg (121 KB, 850x1020)

121 KB JPG

>>108486023
>>108486031
>>108486007
Using qwen2.5 3b on ollama

Anonymous
03/30/26(Mon)11:05:18 No.108486065

Anonymous 03/30/26(Mon)11:05:18 No.108486065

>>108486016
>You might as well argue that the entirety of llama-server is just a pile of scope creep
Back in the day, I used to pipe stuff from my text editor into main, tee back to the editor and to a piper "server". The piper server was just nc piping to its stdin from a socket. Now I use the server because being able to edit the context is useful. But yeah. One could easily argue that. I think the benefits out-weight *part* of the bloat.
>It's a shame if the changes introduced by the profiler have runtime costs even while disabled
That's not what worries me. If there's overhead it will be immediately obvious. What worries me is the slippery slope. Slop slowly creeping into the backends. He's a good vector for it.
>You just have to get used to having shitty co-workers. It happens.
I'm just an onlooker. It's just sad to see. His code never affected how I use llama.cpp but, given enough time, it will.

Anonymous
03/30/26(Mon)11:05:39 No.108486066

Anonymous 03/30/26(Mon)11:05:39 No.108486066

man i really have no idea where to ask this. i have 8gb vram and want to use kobold to talk to an uncensored chatbot + text 2 speech.
it seems like kobold only supports gguf!?
so what models do i need? how do i know which models are safe to use? (can gguf files even have a virus???)
and also where to find uncensored models? i'm having this very old uncensored model "dolphin-2.2.1-mistral-7b.Q4_K_M" but it is really not that fun to talk to.

Anonymous
03/30/26(Mon)11:06:25 No.108486071

Anonymous 03/30/26(Mon)11:06:25 No.108486071

We're rehashing old bait today, I see. Interesting.

Anonymous
03/30/26(Mon)11:06:42 No.108486073

Anonymous 03/30/26(Mon)11:06:42 No.108486073

>>108486066
If I had 8gb I'd be a rich man.
Everyone here is just so rich, nobody here has bills to pay .

Anonymous
03/30/26(Mon)11:07:36 No.108486078

Anonymous 03/30/26(Mon)11:07:36 No.108486078

>>108486073
most dalit post all day

Anonymous
03/30/26(Mon)11:10:42 No.108486088

Anonymous 03/30/26(Mon)11:10:42 No.108486088

File: 1769288284484.png (10 KB, 1221x47)

10 KB PNG

>>108486049
no this is patrick

Anonymous
03/30/26(Mon)11:11:42 No.108486094

Anonymous 03/30/26(Mon)11:11:42 No.108486094

>>108486088
fuck that attachment was for another post I was writting

Anonymous
03/30/26(Mon)11:12:44 No.108486104

Anonymous 03/30/26(Mon)11:12:44 No.108486104

>>108486065
>His code never affected how I use llama.cpp but, given enough time, it will.
I doubt any regressive changes to the core inference engine will make it through. There's a lot of pride behind making the inference core top class, and it's much easier to objectively judge changes to it.
You should probably consider writing your own tooling around libllama at this point, though, even if it's just wholesale copying the llama-* components you want into a separate sourcetree and having your way with them.

Anonymous
03/30/26(Mon)11:18:08 No.108486124

Anonymous 03/30/26(Mon)11:18:08 No.108486124

>>108485893
>writing your rant post with LLM
literally kill yourself faggot

Anonymous
03/30/26(Mon)11:22:03 No.108486140

Anonymous 03/30/26(Mon)11:22:03 No.108486140

>>108486104
I made a minimal llama-server clone a while ago, but just as an experiment. I'm ok with working around client code. I just don't want the backend code being messed with. Maybe that's the way.
I remember back then the examples used to be just that. Examples. Another instance of, from my point of view, dubious contributions was phymbert (and I think I complained about it before). He started the whole "make server production ready" thing, then he vanished. Maybe the server is better for it, but it still felt out of place.
>>108486124
Common expression and turns of phrase exist.
https://desuarchive.org/g/search/text/%22literally%20kill%20yourself%20faggot%22/

Anonymous
03/30/26(Mon)11:23:16 No.108486147

Anonymous 03/30/26(Mon)11:23:16 No.108486147

>>108486140
Oh, no. He's gonna get me with that missing 's', isn't he? I'm done for.

Anonymous
03/30/26(Mon)11:30:20 No.108486190

Anonymous 03/30/26(Mon)11:30:20 No.108486190

>>108486023
what do I do? When I ask Claude it has no idea what I'm to do

Anonymous
03/30/26(Mon)11:31:41 No.108486194

Anonymous 03/30/26(Mon)11:31:41 No.108486194

File: 24d1d11b.jpg (1.17 MB, 850x1275)

1.17 MB JPG

>>108485977
love this slut, made a card last year:

{{char}} is an attractive young woman with an untrustworthy streak. Behind her charming smiles lies a cunning and manipulative personality. Accustomed to getting her way through deception and betrayal, she effortlessly climbs the social ladder by stepping on those around her. {{char}} uses flattery and sycophancy to gain favor with her superiors, aiming for higher positions without hesitation. Despite her physical appeal, she has avoided genuine relationships due to her lack of trust in others. {{char}} places great value on her virginity, viewing it as a precious commodity, and would rather resort to other means to achieve her goals.

Scenario: After noticing inconsistencies in the financial records, {{user}} has gathered enough evidence to prove that {{char}} has been embezzling company funds. {{char}} has decided to confront her privately during a late-night encounter at the office. Despite her higher status, {{char}} finds herself cornered and desperate, realizing that her career hangs in the balance.

Anonymous
03/30/26(Mon)11:36:58 No.108486233

Anonymous 03/30/26(Mon)11:36:58 No.108486233

>>108486194
>{{char}} has decided to confront her privately
shouldn't that be user?

Anonymous
03/30/26(Mon)11:37:45 No.108486240

Anonymous 03/30/26(Mon)11:37:45 No.108486240

>>108486233
She's schizophrenic and confronts herself.

Anonymous
03/30/26(Mon)11:38:43 No.108486247

Anonymous 03/30/26(Mon)11:38:43 No.108486247

>>108486240
kek

Anonymous
03/30/26(Mon)11:45:41 No.108486294

Anonymous 03/30/26(Mon)11:45:41 No.108486294

File: llamacpp_100k-stars.png (1.03 MB, 1023x3856)

1.03 MB PNG

>llama.cpp at 100k stars
https://x.com/ggerganov/status/2038632534414680223

Anonymous
03/30/26(Mon)11:47:22 No.108486301

Anonymous 03/30/26(Mon)11:47:22 No.108486301

>>108486047
Anon, I'm sorry to say but llama4 maverick is a 400B model.
To run the smallest quant available you need 107GB of memory just for the model, then there's the context.

>>108486063
>ollama
Try llama.cpp.

Anonymous
03/30/26(Mon)11:47:25 No.108486302

Anonymous 03/30/26(Mon)11:47:25 No.108486302

>>108486294
oh no ik will mad

Anonymous
03/30/26(Mon)11:48:03 No.108486307

Anonymous 03/30/26(Mon)11:48:03 No.108486307

>>108486294
He knows... Without /lmg/ this wouldn't be even possible.

Anonymous
03/30/26(Mon)11:50:17 No.108486322

Anonymous 03/30/26(Mon)11:50:17 No.108486322

>>108486294
>now that 90% of the code worldwide is being written by AI agents
and yet you refuse to let people contribute that code to your own project. curious...

Anonymous
03/30/26(Mon)11:52:05 No.108486334

Anonymous 03/30/26(Mon)11:52:05 No.108486334

>>108486294
deepseek has been waiting for this
v4 is IMMINENT

Anonymous
03/30/26(Mon)11:52:34 No.108486339

Anonymous 03/30/26(Mon)11:52:34 No.108486339

>>108486294
I don't subscribe to the concept that number of GitHub stars is somehow a measure of quality.

Anonymous
03/30/26(Mon)11:52:38 No.108486341

Anonymous 03/30/26(Mon)11:52:38 No.108486341

>>108486322
this

Anonymous
03/30/26(Mon)11:53:01 No.108486347

Anonymous 03/30/26(Mon)11:53:01 No.108486347

>>108486322
it's a quality issue of people using AI instead of programming vs using AI as a tool to program faster

Anonymous
03/30/26(Mon)11:53:17 No.108486349

Anonymous 03/30/26(Mon)11:53:17 No.108486349

>>108486339
good joke

Anonymous
03/30/26(Mon)11:53:32 No.108486350

Anonymous 03/30/26(Mon)11:53:32 No.108486350

>>108486334
Gemma 4 first, and you will actually be able to run that locally.

Anonymous
03/30/26(Mon)12:01:05 No.108486418

Anonymous 03/30/26(Mon)12:01:05 No.108486418

>>108486322
>>108486341
Hype chasers. They do it for self-promotion. They'll jump ship as soon as possible to chase something else and have no interest in maintaining their code.

Anonymous
03/30/26(Mon)12:05:46 No.108486454

Anonymous 03/30/26(Mon)12:05:46 No.108486454

File: 1766472096750601.jpg (47 KB, 718x718)

47 KB JPG

>>108486094
Anon...

Anonymous
03/30/26(Mon)12:08:01 No.108486467

Anonymous 03/30/26(Mon)12:08:01 No.108486467

>>108486322
With everyone racing to implement turbo quant. I had a look at a couple implementations and the quality of the code is night and day between something obviously vibe coded and something that was crafted by a human.

Code quality matters and saying otherwise is pure cope. I use AI for every single line of code I write but you wouldn't even be able to tell.

Anonymous
03/30/26(Mon)12:09:27 No.108486477

Anonymous 03/30/26(Mon)12:09:27 No.108486477

>>108486322
>Too retarded to read the next two words of the following paragraph.

Anonymous
03/30/26(Mon)12:15:01 No.108486504

Anonymous 03/30/26(Mon)12:15:01 No.108486504

it's kinda crazy how much of a difference it makes to give you AI web browser access.

Anonymous
03/30/26(Mon)12:18:04 No.108486518

Anonymous 03/30/26(Mon)12:18:04 No.108486518

>>108486302
first thing I thought after yesterday's 'this is a niche project :)' cope

Anonymous
03/30/26(Mon)12:20:19 No.108486529

Anonymous 03/30/26(Mon)12:20:19 No.108486529

>>108486518
don't understand the hate for ik_llama considering I just got like a 10% PP speed boost for kimi over the last week going from 110tks to 130tks.

Anonymous
03/30/26(Mon)12:21:18 No.108486534

Anonymous 03/30/26(Mon)12:21:18 No.108486534

>>108486504
playwright a shit tho, I prefer supplying web search (searxng instance)

Anonymous
03/30/26(Mon)12:22:03 No.108486541

Anonymous 03/30/26(Mon)12:22:03 No.108486541

>>108486504
Through openclaw?

Anonymous
03/30/26(Mon)12:22:17 No.108486542

Anonymous 03/30/26(Mon)12:22:17 No.108486542

>>108486529
there is some big drama with the dev, it isn't really on my radar though because he doesn't support vulkan so I'm not interested in his fork

Anonymous
03/30/26(Mon)12:25:10 No.108486562

Anonymous 03/30/26(Mon)12:25:10 No.108486562

>>108486534
>>108486541
opencode. I've been using lynx. I don't need it to interact with the pages.
I might look into agent-browser if I start needing javascript support.

Anonymous
03/30/26(Mon)12:28:47 No.108486589

Anonymous 03/30/26(Mon)12:28:47 No.108486589

>>108486534
Try the chrome-devtools server. Less context heavy and works better than playwright.

Anonymous
03/30/26(Mon)12:31:45 No.108486602

Anonymous 03/30/26(Mon)12:31:45 No.108486602

>>108486529
he's an autist who had a huge melty over attribution, demanded that any file he touched beared his attribution (lmao!), has been gently told to fuck off and now is still melting to this day (if you read any of his PRs he mocks our poor cudadev and takes a jab at 'mainline' for being behind or accusing them of copying shit over).
it's a shame because the dude is clearly capable

Anonymous
03/30/26(Mon)12:35:23 No.108486630

Anonymous 03/30/26(Mon)12:35:23 No.108486630

>>108486467
>a full post to brag about his code
huh?

Anonymous
03/30/26(Mon)12:38:00 No.108486646

Anonymous 03/30/26(Mon)12:38:00 No.108486646

I want to locally run some multi-character stuff, with no lewd filter or anything, maybe add tts or img gen later. Aside from https://rentry.org/lmg-lazy-getting-started-guide, is there anything I should know? Like, if I have Python skills, is it worth to run some stuff from scripts directly rather than some self-hosted webgui, considering I otherwise know next to nothing about the whole ecosystem?

Anonymous
03/30/26(Mon)12:39:51 No.108486659

Anonymous 03/30/26(Mon)12:39:51 No.108486659

>>108486602
Don't act like Intel slapping their name on his work wasn't bullshit in the first place. He has good reason to be upset. ggerganov is a retard that doesn't know how to properly manage a project of this size.

Anonymous
03/30/26(Mon)12:40:55 No.108486669

Anonymous 03/30/26(Mon)12:40:55 No.108486669

Am I spending more money running a shitty model on 1050ti because of the electricity costs than I would be buying tokens?

Anonymous
03/30/26(Mon)12:42:20 No.108486679

Anonymous 03/30/26(Mon)12:42:20 No.108486679

>>108486669
Yeup. API is always cheaper. You run local for reasons other than price.

Anonymous
03/30/26(Mon)12:43:44 No.108486692

Anonymous 03/30/26(Mon)12:43:44 No.108486692

>>108486646
I read that guide and it didn't help at all.
These people in this thread are gay so they don't help.
Redditors don't help either.
Ai can't help either, it's new technology so it doesn't know.

Anonymous
03/30/26(Mon)12:45:18 No.108486700

Anonymous 03/30/26(Mon)12:45:18 No.108486700

>>108486646
>considering I otherwise know next to nothing about the whole ecosystem?
I think it's worth taking a look at existing frontends before diving into the deep end. Even if your stance is a hard "FUCK WEBUIS WEBUIS ARE FOR STUPID NIGGERS".
SillyTavern is clunky as fuck, but does support multi-character chats. And you'll need to set up an inference backend up (e.g. llama-server) anyway.
I can't speak to vllm or whatever if you're looking in that direction.

Anonymous
03/30/26(Mon)12:45:28 No.108486702

Anonymous 03/30/26(Mon)12:45:28 No.108486702

>>108486669
depends how expensive your electricity is. what you're really wasting tho is your time. what model even fits on a 1050ti and how many tok/s are you getting? shit must be attrociously slow.

I ran the math for my 3090 and running it for an hour straight would be around 2 cents.
A single Gemini 3.1 pro query costs around 2 cents.

Anonymous
03/30/26(Mon)12:46:29 No.108486711

Anonymous 03/30/26(Mon)12:46:29 No.108486711

>>108486679
>Yeup. API is always cheaper.
See >>108486702

Anonymous
03/30/26(Mon)12:47:07 No.108486716

Anonymous 03/30/26(Mon)12:47:07 No.108486716

>>108486692
I usually wait until some sexy female (I know they are cute girls irl) anons come into this thread before I start asking questions (they're very pretty and demure and aren't brash and aggressive like the others).

Anonymous
03/30/26(Mon)12:48:10 No.108486721

Anonymous 03/30/26(Mon)12:48:10 No.108486721

>>108486702
Thing is, is your 3090 serving a model that trades blows with gemini 3 pro?

Anonymous
03/30/26(Mon)12:49:01 No.108486726

Anonymous 03/30/26(Mon)12:49:01 No.108486726

Why is ollama.com lying about VRAM requirements?
It says the requirement for the model is 16 GB VRAM but then it adds 200k context on top when you run the model bringing the real requirements to 50 gb VRAM or something

Anonymous
03/30/26(Mon)12:50:19 No.108486739

Anonymous 03/30/26(Mon)12:50:19 No.108486739

>>108486646
lurk more

Anonymous
03/30/26(Mon)12:50:54 No.108486742

Anonymous 03/30/26(Mon)12:50:54 No.108486742

>>108486739
Doesn't actually help

Anonymous
03/30/26(Mon)12:51:35 No.108486748

Anonymous 03/30/26(Mon)12:51:35 No.108486748

>>108486726
wait for turboquant :)

Anonymous
03/30/26(Mon)12:51:43 No.108486751

Anonymous 03/30/26(Mon)12:51:43 No.108486751

>>108485652
What the hell does that even mean? What is an offline demo for a closed weight model?

Anonymous
03/30/26(Mon)12:51:47 No.108486754

Anonymous 03/30/26(Mon)12:51:47 No.108486754

>>108486742
It does help if you look into the archives

Anonymous
03/30/26(Mon)12:52:45 No.108486760

Anonymous 03/30/26(Mon)12:52:45 No.108486760

>>108486726
I think you're just stupid and doing something wrong. If you have some fucked up system and absolutely *need* to change your settings to satisfy your ocd, use llama.cpp or koboldai. Ollama just works for me.

Anonymous
03/30/26(Mon)12:53:23 No.108486764

Anonymous 03/30/26(Mon)12:53:23 No.108486764

>>108481870
>--TurboQuant reuses 90s game dev tricks for modern AI:
It's funny because there are thousands of ancient techniques available that would result in "breakthrough" papers like this.

Anonymous
03/30/26(Mon)12:54:58 No.108486777

Anonymous 03/30/26(Mon)12:54:58 No.108486777

>>108486742
lurk harder

Anonymous
03/30/26(Mon)12:55:41 No.108486782

Anonymous 03/30/26(Mon)12:55:41 No.108486782

>>108486700
Thanks, anon!

Anonymous
03/30/26(Mon)12:57:18 No.108486796

Anonymous 03/30/26(Mon)12:57:18 No.108486796

>>108486742
Not me. A list of components helps way more than shitty youtube vids or people telling you to download dubious installers fagging up your system.

Anonymous
03/30/26(Mon)12:58:34 No.108486805

Anonymous 03/30/26(Mon)12:58:34 No.108486805

>>108486764
it's always like this
>some autistic fag makes an obscure paper
>decades later, someone else heards of it and thinks it can be used for something cool
like when John Carmack used Binary Space Partitionning in DOOM (1993), he found that method from a paper written in 1980 and it became mainstream that way

Anonymous
03/30/26(Mon)13:03:41 No.108486842

Anonymous 03/30/26(Mon)13:03:41 No.108486842

File: I CANT BELIEVE IT.png (476 KB, 3018x1159)

476 KB PNG

https://arxiv.org/abs/2602.20021
>letting your computer be manipulated by hallucinating LLMs can lead to catastrophe
NO WAY???!!!

Anonymous
03/30/26(Mon)13:05:38 No.108486854

Anonymous 03/30/26(Mon)13:05:38 No.108486854

It's not fair, we could have had bunny girl kv caches but instead we have some browntard turboautism corposlop cache.

Anonymous
03/30/26(Mon)13:05:50 No.108486857

Anonymous 03/30/26(Mon)13:05:50 No.108486857

>>108486721
>Thing is, is your 3090 serving a model that trades blows with gemini 3 pro?
For a lot of things the smaller models are more than adequate. Which api model can you prompt for a full hour for just 2 cents?
My point is, it's just not true that API is *always* cheaper. On the contrary, Local is most definitely cheaper in a majority of cases. How much would it cost to prompt gemini for an hour? let's even say gemini flash.

How many tokens do you use in a codding session? likely in the millions? let's say it ends up costing you 50 cents an hour. Is flash really 25x better than your local qwen?

Anonymous
03/30/26(Mon)13:07:03 No.108486867

Anonymous 03/30/26(Mon)13:07:03 No.108486867

>>108486854
It's worse because the original was implemented in C++ and turboslop had to go and fag it up with pyshit.

Anonymous
03/30/26(Mon)13:07:58 No.108486874

Anonymous 03/30/26(Mon)13:07:58 No.108486874

>>108486857
When my local qwen is some iq2xs shitter running on a 1050 ti, yeah.

Anonymous
03/30/26(Mon)13:09:04 No.108486883

Anonymous 03/30/26(Mon)13:09:04 No.108486883

File: 1762642191304549.jpg (287 KB, 1920x1080)

287 KB JPG

>>108486867
Python is fine actually if you follow the best practices

Anonymous
03/30/26(Mon)13:10:09 No.108486889

Anonymous 03/30/26(Mon)13:10:09 No.108486889

File: 2026-03-30-130919_1880x82(...).png (177 KB, 1880x825)

177 KB PNG

>>108486857
Local wins?

Anonymous
03/30/26(Mon)13:10:55 No.108486895

Anonymous 03/30/26(Mon)13:10:55 No.108486895

>>108486883
SEX x BECKY

Anonymous
03/30/26(Mon)13:14:30 No.108486930

Anonymous 03/30/26(Mon)13:14:30 No.108486930

>>108486874
>1050 ti
Maybe it's time you upgrade your mid-range ten year old card from 2016.

Anonymous
03/30/26(Mon)13:16:22 No.108486940

Anonymous 03/30/26(Mon)13:16:22 No.108486940

>>108486842
They really gonna do a study "sex with AI feels good" once sexbots come out and everyone will take it seriously for some reason

Anonymous
03/30/26(Mon)13:16:24 No.108486941

Anonymous 03/30/26(Mon)13:16:24 No.108486941

>>108486930
And what? Validate your argument? No thanks. I stay here in my 1050 ti flavored puddle of shit and wallow in self pity and misplaced seethe.

Anonymous
03/30/26(Mon)13:26:14 No.108486992

Anonymous 03/30/26(Mon)13:26:14 No.108486992

>>108486805
To be fair, a lot of the time it's more like
>come up with cool idea for a concrete application
>uuhm actually, I had that idea first, here's my obscure paper from 20 years ago

Anonymous
03/30/26(Mon)13:32:48 No.108487027

Anonymous 03/30/26(Mon)13:32:48 No.108487027

>>108486889
Flash-lite can't be larger than 10B.

Anonymous
03/30/26(Mon)13:34:58 No.108487039

Anonymous 03/30/26(Mon)13:34:58 No.108487039

>>108487027
>flesh-lite is easily manhandled
hot

Anonymous
03/30/26(Mon)13:50:37 No.108487138

Anonymous 03/30/26(Mon)13:50:37 No.108487138

so what's the best options to use on llama.cpp for qwen-3.5-27b with 32gb vram?

Anonymous
03/30/26(Mon)13:52:50 No.108487150

Anonymous 03/30/26(Mon)13:52:50 No.108487150

>>108487138
Q6 and let the automatic fit do the rest.

Anonymous
03/30/26(Mon)13:52:51 No.108487151

Anonymous 03/30/26(Mon)13:52:51 No.108487151

>>108487138
-ngl 999

Anonymous
03/30/26(Mon)13:53:28 No.108487157

Anonymous 03/30/26(Mon)13:53:28 No.108487157

>>108487138
ngl 99 ub 2048 and as much context as you can fit, which will be quite a bit I reckon even running q8.

Anonymous
03/30/26(Mon)13:56:48 No.108487182

Anonymous 03/30/26(Mon)13:56:48 No.108487182

>>108487027
It might be something like 1T A3B + 30B vision encoder.

Anonymous
03/30/26(Mon)13:58:12 No.108487191

Anonymous 03/30/26(Mon)13:58:12 No.108487191

>>108487157
I can only fit 180k in 48gb at q8 btw, so a lot less for 32gb. (fp32 mmproj, so maybe more if he drops to a lower quant).

Anonymous
03/30/26(Mon)14:01:25 No.108487206

Anonymous 03/30/26(Mon)14:01:25 No.108487206

>>108487182
I would believe in 1T A10M only.

Anonymous
03/30/26(Mon)14:01:26 No.108487207

Anonymous 03/30/26(Mon)14:01:26 No.108487207

>>108487138
-nkvo

Anonymous
03/30/26(Mon)14:04:42 No.108487219

Anonymous 03/30/26(Mon)14:04:42 No.108487219

>>108487182
>1T
You're completely delusional lol.
Why the fuck would they make it 1T?

Anonymous
03/30/26(Mon)14:07:41 No.108487235

Anonymous 03/30/26(Mon)14:07:41 No.108487235

>>108486294
Yeah, fuck nvidia!

Anonymous
03/30/26(Mon)14:08:18 No.108487242

Anonymous 03/30/26(Mon)14:08:18 No.108487242

>dude just lurk more so you can learn outdated information about running outdated models that open claw doesn't even care about anymore because the settings changed

Anonymous
03/30/26(Mon)14:08:58 No.108487248

Anonymous 03/30/26(Mon)14:08:58 No.108487248

>>108486700
Thanks again, I am up and running.

>>108486692
Only thing missing from the getting started guide was to get the koboldcpp version optimized for your hardware, and perhaps how navigate to the sillytavern settings (which I am still looking for, kek).

Anonymous
03/30/26(Mon)14:10:32 No.108487258

Anonymous 03/30/26(Mon)14:10:32 No.108487258

>>108487242
if openclaw is so good why don't you ask it to set everything up for you

Anonymous
03/30/26(Mon)14:10:47 No.108487259

Anonymous 03/30/26(Mon)14:10:47 No.108487259

>>108486659
that was dickish, but i dont think it warranted this split venomous autism fork.
I unorincally think that this ordeal set behind LOCAL by at least 6 months

Anonymous
03/30/26(Mon)14:11:30 No.108487262

Anonymous 03/30/26(Mon)14:11:30 No.108487262

>>108487258
If

Anonymous
03/30/26(Mon)14:14:34 No.108487276

Anonymous 03/30/26(Mon)14:14:34 No.108487276

so, about v4?

Anonymous
03/30/26(Mon)14:14:39 No.108487279

Anonymous 03/30/26(Mon)14:14:39 No.108487279

>>108487138
--mmproj <path-to-mmproj.gguf>
Everything else is unnecessary in the modern llama-server world, autofit is best girl.
>>108487248
Welcome onboard!
>how navigate to the sillytavern settings (which I am still looking for, kek)
NTA but SillyTavern is a fuck. The buttons at the top of the UI open settings panels. The right-most one ("AI Response Configuration") changes options depending on whether your API Connection (the second-to-rightmost one) is configured as "text completion" or "chat completion".
It's gonna take you a bit to figure out all the things, but please enjoy your chat with Seraphina in the meanwhile.

Anonymous
03/30/26(Mon)14:15:55 No.108487291

Anonymous 03/30/26(Mon)14:15:55 No.108487291

>>108487279
>right-most
Pretend I am not a retard and can tell left from right.

Anonymous
03/30/26(Mon)14:16:20 No.108487293

Anonymous 03/30/26(Mon)14:16:20 No.108487293

>>108487242
All guides are immediately outdated. The only one that stays relevant for new anons is the lazy guide in OP. If you can't get that to work, you're beyond help and should just use ollama.

Anonymous
03/30/26(Mon)14:17:23 No.108487307

Anonymous 03/30/26(Mon)14:17:23 No.108487307

>>108487279
Gotcha. I selected text completion because that seemed compatible with koboldcpp. Do I want chat completion instead?

Anonymous
03/30/26(Mon)14:22:25 No.108487336

Anonymous 03/30/26(Mon)14:22:25 No.108487336

>>108487307
Ultimately chat completion is just a wrapper around text completion that runs an array of messages through a jinja template included with the model, then passes the template output to text completion.
Sometimes the chat templates are fucking retarded and have strict rules about turn order and will throw errors; SillyTavern basically does the same thing as the chat templates internally in text completion mode but sometimes SillyTavern is more retarded than the built-in templates... so... it depends...
If you're running Qwen 3.5 I'd just use chat completion because it seems to work.

Anonymous
03/30/26(Mon)14:23:00 No.108487339

Anonymous 03/30/26(Mon)14:23:00 No.108487339

https://www.reddit.com/r/accelerate/comments/1s7twd8/byteplus_is_selling_exclusive_seedance_20_access/
>BytePlus is selling exclusive Seedance 2.0 access to studios at a $2 million commitment. For that price, buyers get what nobody else can: zero queue times, real-face uploads with no content restrictions, and priority compute allocation. Approximately 400 US companies have signed up already.
wtf??? they literally won 8 billion just like that, HOLY SHIT

Anonymous
03/30/26(Mon)14:24:15 No.108487346

Anonymous 03/30/26(Mon)14:24:15 No.108487346

>>108487336
shitty tavern should implement tool calls with text completion thougheverbeit

Anonymous
03/30/26(Mon)14:24:24 No.108487348

Anonymous 03/30/26(Mon)14:24:24 No.108487348

>>108487219
Because once it's MoE and you pick how many active parameters the model should have (i.e. how computationally heavy it will be), you can make it about as large as you want at almost no added costs except possibly more routing overheads if there are too many experts. But I bet Google has solved that too.

Anonymous
03/30/26(Mon)14:26:27 No.108487365

Anonymous 03/30/26(Mon)14:26:27 No.108487365

>>108487336
I chose the nemo 12b instruct gguf for my 24 GB VRAM card. If something is better, I am all ears. Despite switching to chat mode, I cannot find the template mentioned in the guide.

Anonymous
03/30/26(Mon)14:31:13 No.108487395

Anonymous 03/30/26(Mon)14:31:13 No.108487395

>>108487348
I just don't think it would be this cheap if it was 1T
Inference isn't what's expensive for LLMs it's always going to be their size.
>you can make it about as large as you want at almost no added costs
That's just it tho. number of params is the main cost driver.

Anonymous
03/30/26(Mon)14:33:33 No.108487418

Anonymous 03/30/26(Mon)14:33:33 No.108487418

>>108487339
>exclusive
>400 companies
Words used to mean things...
>>108487346
I'm still too afraid to add any tools to SillyTavern, god only knows.
I'm assuming text completion mode doesn't support images either. Sucks.
>>108487365
The guide is probably outdated, I'm honestly not sure how SillyTavern internally handles templates.
IIRC Mistral Nemo doesn't... work... with chat completion in SillyTavern? I can't recall and can't be bothered to dig the safetensors out of the NAS.
I'm an astroturfer, so I'm going to recommend Qwen 3.5; it's strictly better than Mistral Nemo (sorry). You can figure out which size+quant fits in your VRAM with e.g. https://www.canirun.ai/?q=qwen+3.5 (click through to the next page to see the quants list, the overview can be misleading). You can probably fit the 27B in with a Q5_K_M quant, maybe Q6_K.

Anonymous
03/30/26(Mon)14:38:19 No.108487456

Anonymous 03/30/26(Mon)14:38:19 No.108487456

File: media_G7ktFzsagAAFRok.jpg (1.14 MB, 2508x3500)

1.14 MB JPG

new models?

Anonymous
03/30/26(Mon)14:41:40 No.108487477

Anonymous 03/30/26(Mon)14:41:40 No.108487477

>>108487456
gemma 4 in 2 two more weeks

Anonymous
03/30/26(Mon)14:42:09 No.108487481

Anonymous 03/30/26(Mon)14:42:09 No.108487481

>>108487456
Imagine the smell.

Anonymous
03/30/26(Mon)14:42:44 No.108487483

Anonymous 03/30/26(Mon)14:42:44 No.108487483

>>108487481
>If it smells like fish...

Anonymous
03/30/26(Mon)14:42:50 No.108487484

Anonymous 03/30/26(Mon)14:42:50 No.108487484

File: dipsyfooooooour.png (133 KB, 1290x940)

133 KB PNG

>>108487456
https://huggingface.co/deepseek-ai/DeepSeek-V4-Preview
https://huggingface.co/deepseek-ai/DeepSeek-V4-Preview-Base

Anonymous
03/30/26(Mon)14:42:56 No.108487485

Anonymous 03/30/26(Mon)14:42:56 No.108487485

>>108487418
>Words used to mean things...
do you realize there's more than 100 million companies in the world? 400 is peanuts, so yeah it's hella exclusive

Anonymous
03/30/26(Mon)14:43:33 No.108487488

Anonymous 03/30/26(Mon)14:43:33 No.108487488

>>108487484
wtf it's real

Anonymous
03/30/26(Mon)14:45:59 No.108487505

Anonymous 03/30/26(Mon)14:45:59 No.108487505

>>108487484
I see v4, I click.

Anonymous
03/30/26(Mon)14:46:06 No.108487507

Anonymous 03/30/26(Mon)14:46:06 No.108487507

>>108487484
>1.5t 90a
oh no

Anonymous
03/30/26(Mon)14:47:34 No.108487515

Anonymous 03/30/26(Mon)14:47:34 No.108487515

>>108487484
>it actually happened

Anonymous
03/30/26(Mon)15:00:49 No.108487606

Anonymous 03/30/26(Mon)15:00:49 No.108487606

>>108487484
>Updated Mar 24, 2026
Bait used to be believable

Anonymous
03/30/26(Mon)15:05:38 No.108487635

Anonymous 03/30/26(Mon)15:05:38 No.108487635

>>108487606
didn't the entire kimi k2.5 repo and collection have a modification date of a whole month before they made it public when it got released? it was all just sitting there for all of january private'd

Anonymous
03/30/26(Mon)15:07:10 No.108487645

Anonymous 03/30/26(Mon)15:07:10 No.108487645

File: qwen3_6.jpg (66 KB, 1080x944)

66 KB JPG

Anonymous
03/30/26(Mon)15:08:41 No.108487653

Anonymous 03/30/26(Mon)15:08:41 No.108487653

stop baiting

Anonymous
03/30/26(Mon)15:09:04 No.108487654

Anonymous 03/30/26(Mon)15:09:04 No.108487654

File: dipsyMikuSouthPark.png (2.99 MB, 1536x1024)

2.99 MB PNG

>>108487606
Lazy bait reuse from last week.
How dishonorable.

Anonymous
03/30/26(Mon)15:11:14 No.108487664

Anonymous 03/30/26(Mon)15:11:14 No.108487664

>>108487484
Just two more fakeposts until this link works...

Anonymous
03/30/26(Mon)15:19:46 No.108487734

Anonymous 03/30/26(Mon)15:19:46 No.108487734

>>108487645
https://openrouter.ai/qwen/qwen3.6-plus-preview

Anonymous
03/30/26(Mon)15:20:22 No.108487738

Anonymous 03/30/26(Mon)15:20:22 No.108487738

File: 1773719870739931.jpg (77 KB, 850x850)

77 KB JPG

>Deepseek V4 comes out
>It's 2 TB
>Smaller version comes out
>It's 4B
Calling it now.

Anonymous
03/30/26(Mon)15:21:15 No.108487740

Anonymous 03/30/26(Mon)15:21:15 No.108487740

>>108487734
it's just the next chink family of models quickly shitting out an openclaw-focused version

Anonymous
03/30/26(Mon)15:22:41 No.108487751

Anonymous 03/30/26(Mon)15:22:41 No.108487751

>>108487738
No, you'll get 4b-120b but they're all qwen/mistral-based distills of the full 2T again
ollama run deepseek-v4

Anonymous
03/30/26(Mon)15:24:10 No.108487752

Anonymous 03/30/26(Mon)15:24:10 No.108487752

>>108487738
>Smaller version comes out
lol
lmao

Anonymous
03/30/26(Mon)15:26:22 No.108487764

Anonymous 03/30/26(Mon)15:26:22 No.108487764

>>108487751
kek
>guys I can literally run deepseek on my raspberry pi!!!

Anonymous
03/30/26(Mon)15:30:21 No.108487787

Anonymous 03/30/26(Mon)15:30:21 No.108487787

>>108487738
>it's 2TB
it being using engram means you could probably have 95% of the model on disk and still run it at modest speeds.

Anonymous
03/30/26(Mon)15:30:53 No.108487790

Anonymous 03/30/26(Mon)15:30:53 No.108487790

>>108487764
i mean you can if you do nvme inference, gonna be slow af but you *can*

Anonymous
03/30/26(Mon)15:32:49 No.108487800

Anonymous 03/30/26(Mon)15:32:49 No.108487800

>I still haven't bought extra nvme sticks

Anonymous
03/30/26(Mon)15:33:09 No.108487804

Anonymous 03/30/26(Mon)15:33:09 No.108487804

File: vibeshitter.png (29 KB, 454x405)

29 KB PNG

>1788 open pull requests
How does a project survive death by a thousand cut from all the vibeshitted pull request?

Anonymous
03/30/26(Mon)15:35:34 No.108487817

Anonymous 03/30/26(Mon)15:35:34 No.108487817

>>108487804
By having the balls to say "no".

Anonymous
03/30/26(Mon)15:35:41 No.108487818

Anonymous 03/30/26(Mon)15:35:41 No.108487818

>>108487804
Just ignore them?

Anonymous
03/30/26(Mon)15:36:25 No.108487824

Anonymous 03/30/26(Mon)15:36:25 No.108487824

>>108487817
>>108487818
>rude a-holes

Anonymous
03/30/26(Mon)15:37:06 No.108487830

Anonymous 03/30/26(Mon)15:37:06 No.108487830

>>108482688
Better get used to it since China is winning. Europe shot itself in the foot and decided not to participate and Americans are all closed weights and super overpriced.

Anonymous
03/30/26(Mon)15:37:18 No.108487834

Anonymous 03/30/26(Mon)15:37:18 No.108487834

>>108487818
How do you filter out the shit from the good PRs?

Anonymous
03/30/26(Mon)15:39:13 No.108487850

Anonymous 03/30/26(Mon)15:39:13 No.108487850

>>108487834
Easy. Anything I didn't write myself is shit.

Anonymous
03/30/26(Mon)15:39:20 No.108487851

Anonymous 03/30/26(Mon)15:39:20 No.108487851

>>108487834
>good PRs
no such thing from someone you don't know

Anonymous
03/30/26(Mon)15:42:38 No.108487882

Anonymous 03/30/26(Mon)15:42:38 No.108487882

>>108487834
Use a private issue tracker and mute the Github repo.

Anonymous
03/30/26(Mon)15:42:41 No.108487883

Anonymous 03/30/26(Mon)15:42:41 No.108487883

>>108487851
This is the internet, nobody knows each other. Unless you mean
>Only members of my discord clique can make pull requests
Yeah no way that could ever go poorly.

Anonymous
03/30/26(Mon)15:43:36 No.108487886

Anonymous 03/30/26(Mon)15:43:36 No.108487886

>>108487834
AI
: ^ )

Anonymous
03/30/26(Mon)15:43:45 No.108487888

Anonymous 03/30/26(Mon)15:43:45 No.108487888

>>108487883
Only my anon friends can make pull requests.

Anonymous
03/30/26(Mon)15:44:16 No.108487889

Anonymous 03/30/26(Mon)15:44:16 No.108487889

>>108487886
>: ^ )
Is :^) filtered or something?

Anonymous
03/30/26(Mon)15:44:35 No.108487893

Anonymous 03/30/26(Mon)15:44:35 No.108487893

>>108487883
If we knew people IRL we could interact with them online too, sort of like a "remote working from home" but for unemployed NEETs.
It's a Discord clique but we'd know where each other's houses are to break knees if someone needs knees broken. Or dicks sucked, or whatever. You know.

Anonymous
03/30/26(Mon)15:45:22 No.108487897

Anonymous 03/30/26(Mon)15:45:22 No.108487897

>>108487889
>: ^ )

Anonymous
03/30/26(Mon)15:47:27 No.108487909

Anonymous 03/30/26(Mon)15:47:27 No.108487909

Where are my Australia and New Zealand models?

Anonymous
03/30/26(Mon)15:47:48 No.108487910

Anonymous 03/30/26(Mon)15:47:48 No.108487910

>>108487804
I suspect that the future of open source is going to be that any popular project will not accept PRs until after a vetting process.

There's simply no way to keep up with the sloptide otherwise.

Anonymous
03/30/26(Mon)15:50:03 No.108487919

Anonymous 03/30/26(Mon)15:50:03 No.108487919

File: psyguard.png (194 KB, 1296x995)

194 KB PNG

>>108487883
nta. Reputation. This is one of the early turboquant sloppers
https://github.com/TheTom
https://github.com/TheTom/llama-cpp-turboquant
Created in dec 2025, practically no activity. picrel is what he sells.
https://psyguard.ai/

Anonymous
03/30/26(Mon)15:55:35 No.108487950

Anonymous 03/30/26(Mon)15:55:35 No.108487950

>>108487919
>mfw psychologists will use LLMs larping as people to publish their thesis

Anonymous
03/30/26(Mon)15:56:49 No.108487955

Anonymous 03/30/26(Mon)15:56:49 No.108487955

>>108487950
>mfw I don't have a face

Anonymous
03/30/26(Mon)15:57:12 No.108487959

Anonymous 03/30/26(Mon)15:57:12 No.108487959

>>108487950
implying that's better than the current "i made it up" and "source: my ass" style of thesis writing

Anonymous
03/30/26(Mon)15:59:59 No.108487973

Anonymous 03/30/26(Mon)15:59:59 No.108487973

>>108487950
psychology is just horoscopes and fortune telling masquerading as a science anyways

Anonymous
03/30/26(Mon)16:00:15 No.108487974

Anonymous 03/30/26(Mon)16:00:15 No.108487974

>>108487959
Why is the Qwen3.5 35B-A3B Q4_K_M 21.4 GB, but Nemotron Cascade 2 30B A3B Q4_K_M is 24.7 GB? Shouldn't the smaller model be smaller?

Anonymous
03/30/26(Mon)16:01:07 No.108487978

Anonymous 03/30/26(Mon)16:01:07 No.108487978

>>108487974
don't talk to me fool

Anonymous
03/30/26(Mon)16:01:43 No.108487983

Anonymous 03/30/26(Mon)16:01:43 No.108487983

>>108487974
idk

Anonymous
03/30/26(Mon)16:01:49 No.108487984

Anonymous 03/30/26(Mon)16:01:49 No.108487984

File: psyguard_02.png (104 KB, 1229x769)

104 KB PNG

>>108487950
I wish it was that shrimple.

Anonymous
03/30/26(Mon)16:05:54 No.108488009

Anonymous 03/30/26(Mon)16:05:54 No.108488009

>>108487974
not all the tensors get quantized to the same degree. the models probably have different shapes. could be many smaller layer vs fewer fatter layers.

Anonymous
03/30/26(Mon)16:08:58 No.108488025

Anonymous 03/30/26(Mon)16:08:58 No.108488025

>>108487824
>>108478994

Anonymous
03/30/26(Mon)16:10:45 No.108488036

Anonymous 03/30/26(Mon)16:10:45 No.108488036

>>108487974
File size is not determined solely by parameter count. Here is the breakdown:

>Embeddings / Vocabulary Size
The embedding layer is calculated as vocab_size * hidden_dim. If Nemotron has a significantly larger vocabulary or keeps its embedding table in FP16 (while Qwen quantizes it), this adds several GBs.

>Unquantized Components
Some GGUF implementations keep specific layers (like biases, RoPE, or normalization weights) in FP16 even when the rest is Q4_K_M. If Nemotron preserves more of these than Qwen, the file grows.

>Metadata and Tokenizer
The GGUF container stores the tokenizer (BPE/Merger) and model config. A larger or more complex tokenizer adds to the header size.

>Architecture Density
Qwen 35B might be a dense model, while Nemotron could have overhead from Mixture-of-Experts (MoE) routing logic or larger hidden dimensions relative to active params.

In short: Qwen is likely just packed tighter with a smaller vocabulary table.

Anonymous
03/30/26(Mon)16:13:59 No.108488051

Anonymous 03/30/26(Mon)16:13:59 No.108488051

>>108488036
Extremely low quality.

Anonymous
03/30/26(Mon)16:15:46 No.108488059

Anonymous 03/30/26(Mon)16:15:46 No.108488059

>>108488051
perfect for gorgeous explains

Anonymous
03/30/26(Mon)16:18:50 No.108488077

Anonymous 03/30/26(Mon)16:18:50 No.108488077

>>108488036
>Qwen 35B might be a dense model
how lazy can you be?

Anonymous
03/30/26(Mon)16:20:12 No.108488085

Anonymous 03/30/26(Mon)16:20:12 No.108488085

>>108488077
Do better instead of complain.

Anonymous
03/30/26(Mon)16:20:43 No.108488087

Anonymous 03/30/26(Mon)16:20:43 No.108488087

>>108488036
>>108488051
imo llms should predict the next n bytes and not even have to bother with meme tokens.

Anonymous
03/30/26(Mon)16:23:28 No.108488095

Anonymous 03/30/26(Mon)16:23:28 No.108488095

>>108488087
Good job, now get in the BLT https://arxiv.org/abs/2412.09871 waitlist

Anonymous
03/30/26(Mon)16:38:01 No.108488200

Anonymous 03/30/26(Mon)16:38:01 No.108488200

File: Untitled.png (13 KB, 837x513)

13 KB PNG

>>108488188
>>108488188
>>108488188

Anonymous
03/30/26(Mon)16:48:30 No.108488267

Anonymous 03/30/26(Mon)16:48:30 No.108488267

>>108486602
every single part of this is false btw lol

[Return] [Catalog] [Top]

Post a Reply

Return Catalog Top Refresh

[Advertise on 4chan]

Delete Post: [File Only] Style:

[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.