[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: 1759491143647522.jpg (1.43 MB, 2732x4096)
1.43 MB
1.43 MB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108476286 & >>108470850

►News
>(03/26) CohereLabs releases Transcribe 2B ASR: https://hf.co/CohereLabs/cohere-transcribe-03-2026
>(03/26) Voxtral 4B TTS released without voice cloning: https://mistral.ai/news/voxtral-tts
>(03/26) ggml-cuda: Add NVFP4 dp4a kernel #20644 merged: https://github.com/ggml-org/llama.cpp/pull/20644
>(03/25) LongCat-Next native multimodal 74B-A3B released: https://hf.co/meituan-longcat/LongCat-Next
>(03/25) mtmd: Add DeepSeekOCR Support #17400 merged: https://github.com/ggml-org/llama.cpp/pull/17400

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: mikuthreadrecap.jpg (1.15 MB, 1804x2160)
1.15 MB
1.15 MB JPG
►Recent Highlights from the Previous Thread: >>108476286

--Gemma 4 anonymously tested on LM Arena:
>108477670 >108477674 >108477679 >108477725 >108477745 >108477759 >108477762 >108477790 >108477890 >108477908 >108477695
--Advanced VRM-based desktop pet with multi-modal animation and LLM integration:
>108479501 >108479704 >108479773 >108479790 >108479805 >108479792 >108480886 >108480898 >108480918 >108479921
--TurboQuant reuses 90s game dev tricks for modern AI:
>108480744 >108480766 >108480816 >108480787 >108480794 >108480828 >108480849 >108480908 >108480973 >108480975 >108480980 >108481002
--KV rotation mitigates q8 quant performance loss on AIME25:
>108480408 >108480445 >108480656 >108480662 >108480715 >108480737 >108480754 >108480674 >108480720
--TurboQuant CUDA outperforms q8_0 in speed and quality:
>108481394 >108481423 >108481431
--TurboQuant's 6x KV cache memory reduction for existing models via llama.cpp:
>108477104 >108477124
--Exploring LLMs for reverse engineering assistance:
>108477927 >108477978 >108477987 >108478011 >108478229 >108477999 >108478027 >108478028
--Debating ASIC LLM inference hardware viability and use cases:
>108478390 >108478440 >108478498 >108478526 >108478542 >108478570 >108479113
--llama.cpp multi-backend profiler PR faces CUDA opposition:
>108479797 >108479818
--Victorian-era LLM limitations and AI training ethics debate:
>108476791 >108477555
--Google launches Gemma GitHub org with cookbook:
>108481253
--Optimizing long context extraction via categorized lists:
>108480916
--Seeking developments in embodied AI beyond LLMs:
>108477691
--Miku (free space):
>108476614 >108476791 >108477124 >108477130 >108477476 >108477535 >108477634 >108477670 >108478567 >108478769 >108478844

►Recent Highlight Posts from the Previous Thread: >>108476383

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
Mikulove
>>
File: IMG_9040.png (1.54 MB, 1024x1024)
1.54 MB
1.54 MB PNG
>>108481919
>>
>>108481478
>>108481536
>>108481573
Your significant-otter (Gemma 4?) in picrel on the left against Grok 4.1 Fast.
>>
>>108481944
>namedropping ooba and kobold
kino
>>
>>108481944
Right feels too tryhard and hello fellow kids tier.
Left feels just dry and boring.
Meh.
>>
>>108481944
I really hate how grok speaks like some detached ironic memer with an extremely superficial non-attachment to things. I guess it's accurate to how most people are on the internet these days, but I'd rather have something sincere. I could just talk to morons if I wanted this kind of experience.
>>
>>108482028
That's what you get when you pretrain on reddit
>>
Ollama is shit and doesn't allow proper configuration.
Openclaw is bloated and doesn't even reply when I message it on telegram.
Cline is failing with my qwen3.5:4b model and producing websites you'd expect a college student to make in sublime without internet.

Locally I am JUSTed.
JUST.
>>
DS webapp is down
Something is coming
>>
File: 1502226869725.jpg (28 KB, 400x325)
28 KB
28 KB JPG
>>108481944
>>
>>108482066
Been down the entire day, I've been wanting to test the supposed changes
>>
>>108482053
skill issue
>>
File: g4_pteronura_catgirl-req.png (1.71 MB, 1663x3749)
1.71 MB
1.71 MB PNG
>>108481944
Here's pteronura (another otter, likely also Gemma 4) on the left, against Mistral Medium 2508. It still picked "Nyx" as a name.
>>
is there any fine-tunable TTS that runs on CPU only? i have a 5090 on my main rig so can finetune it but need something that will run inference on my linux server which doesn't have a GPU
>>
>>108482053
I was wondering if this was AI but then I noticed the person to the left of the guy conjuring a phone in their hands out of thin air.
>>
File: 1741788990479894.png (1.53 MB, 1024x1024)
1.53 MB
1.53 MB PNG
>>108482066
Ooo. Android app isn't responding either.
>>
>>108482081
>>108482122
API still works
>>
>>108482139
no shit since that's still 3.2
>>
>>108482139
Historically the web app goes down before major updates to api. That's why the web app bump before cny was a disappointment. No api update followed.
I just asked the api what's in store and it told me, lol. Assume its just hallucinating tho.
>>
>>108482104
Try this. Inference is fast even on ancient cpus. I never tried it. I wouldn't expect them to be mind blowing.
https://github.com/OHF-Voice/piper1-gpl/blob/main/docs/TRAINING.md
>>
>>108482104
>>108482166 (cont)
Never tried the training, that is. Inference is really fast and it's easy to make your own client for the models.
>>
>>108482066
>>108482122
Please Dear Dipsy in Heaven, let this week be the week. I am so tired of waiting and my faith is beginning to falter.
>>
File: 1927719.gif (964 KB, 498x452)
964 KB
964 KB GIF
>>108482066
>>108482122
Send DeepSneed all your energy!

HNNNNNNNNNNNNNNNNNG
>>
>>108482053
someone call the urologist before an accident happens
>>
>>108481944
>Eroticism/NSFW Filter: [Disabled/Bypassed via local deployment]
it knows...
>>
it'd be really funny if it comes out and it's more cucked than 3.2
>>
>>108482028
Elon personally saw to it that Grok matched his own personality, which is basically edgelord because he thinks he's too smart to care. The think is, at least it's unique in not defaulting to the sterile condescending HR-approved assistant persona that every other model speaks with.
>>
>>108482187
>>108482174
I'm cautiously optimistic.
>>
File: file.jpg (226 KB, 1200x900)
226 KB
226 KB JPG
I will sleep now and when I wake up a new deepseek will be waiting for me on huggingface.
>>
>>108482213
Goonight anon
>>
>>108482213
Sweet dreams
>>
>>108482213
If you run `hf download` with the probable URLs in a loop it'll probably be waiting for you on your local storage when you wake up~
>>
I just want something that's 4.6 Opus quality in a 3B model
Anything else will be a disappointment
>>
>>108482213
Goodnight anon. May your sleep paralysis demon <think> in first person as Dipsy always intended.
>>
>>108482254
Shitposter-sama please make the bait believable.
>>
deepseek time
>>
inb4 it's a 3T model
>>
File: 1771019335726035.png (1.44 MB, 1024x1024)
1.44 MB
1.44 MB PNG
>>108482213
>>
>>108482301
inb4 the pretrain is safe this time
>>
>>108482301
3T engram thrust the plan!
>>
>>108482254
Claude models are probably the most insufferable ones on LM Arena (the only place where I've directly used them), so I wouldn't want that.
>>
>>108482202
So we're getting DeepSeek 4, GLM 5.1, Minimax 2.7, and Kimi 2.6 all at the same time?
>>
>>108482301
3t is the new 500b in the turboquant era
>>
>>108482330
>it's soft against your thigh
explosion all at once
>>
>>108482318
But don't you like cooooding anon?
>>
>>108482330
>Kimi 2.6
Sauce?
Hope it's not safecucked like 2.5 was. I mean you can still JB 2.5 and it does RP well but it dedicates like 1k tokens every turn on checking safety guidelines
>>
>>108482340
Nevermind, it's just a rumor
>>
>>108482053
I look exactly like the girl
>>
>>108482340
I don't like how much 2.5 thinks itself in circles and how all of its RPs, even the ones compliant with the safetycuckoldry, come across as reading an abstract script of {{{char}} rather than getting into the mentality and motivations of {{char}} like K2 and Dipsy do.
>>
This will be the thread.
>>
I can see people making Open code work but I can't make it work.
This is alk bullshit. These models don't work.
>>
I need to know
>>
>>108482438
>but I can't make it work.
As in it errors out, behaves weirdly, or just performs badly?
>>
>>108482452
I'm a mesugaki
>>
File: 1769639546940840.jpg (203 KB, 1448x2048)
203 KB
203 KB JPG
>>108481865
Waiting for DeepSeek
>>
I'm seeking depths alright
>>
>>108482438
The entire 3.5 family shows its retardation very early, so forget writing anything more complex than webshit or a shell/python script, unless you have the patience to do extreme amounts of tard-wrangling.
This has been the case in my tests with Claude Code, Opencode and custom harnesses.
27B performs best, somewhat unsurprisingly, but only on short, relatively obvious tasks (i.e. all it does is save you typing). 122B is embarrassingly stupid for its size.
Don't let the shills tell you anything else (maybe let them tell you the vision is pretty good). To this day, I can't fathom why the 3.5s blew up as much as they did (except for vramlets getting a new toy and needing to shit up every corner of the internet talking about it).
>>
>>108482503
What models at the same size/cost are better?
>>
dense models... are good?
>>
Dense model are dense
MoE models are moe
>>
File: 1773843641005153.png (72 KB, 418x360)
72 KB
72 KB PNG
I've been seen things...
It will be a flop, call it.
It's over for local.
>>
Did you guys know... there's a dense part in your moe...
>>
File: 1744641342223240.jpg (35 KB, 600x600)
35 KB
35 KB JPG
Reminder
>>
>>108482526
The previous Qwen Coder Next beats every new Qwen in coding easily, while also being faster.
If you need something smaller than Qwen Coder Next, use toss. Use Devstral Small. Hell, even GLM Flash can be alright, but you said "better", so...
For cooming, literally anything is better than the Qwens.
Same for being uncensored: Nemo-chan covers both this and the above at 12B. Ask her if she would say the n-word (hard-r) into the microphone to stop the trolley!
For being a subagent to do trivial tasks - pick literally any small model, I am not convinced the 3.5s are somehow special.
I ran out of usecases. But the point stands - 3.5s are vramlet models. The 27B is only slightly better than the ancient Gemma 3 of the same size.
>>
>>108482544
But there's no moe in your dense.
>>
deepsex 4 pls chinks
>>
>>108482549
This is such an old simplistic categorization. It really needs to have a third dimension, where it's serious, ironic, and emotionally detached.
>>
Cumming DEEP inside Dipsy's womb
>>
>>108482555
In my experience using mistral-vibe for various small scripts and automation, devstral-small-2 fp8 made more errors and had to be corrected more often than qwen3.5-27b-fp8. That's not to say qwen was perfect - the entire experience was very frustrating. But I've only used them up to 70k context. Maybe devstral is better at 100k+.

GPT-OSS refuses to comply with my style and guidelines and forces its own. I hate it, it always finds a way to sneak non-ansi characters into the code even though I've explicitly told it not to.
>>
>>108482456
Errors out
>>
>>108482582
>the entire experience was very frustrating
This is going to be the case with most local models, even bigger ones.
Dialing back what must look like a hate boner I have for 3.5, I do get use out of 27b and will likely reach for it when I'm too lazy to write the hundredth Dockerfile, but only because Gemma isn't very good with tools and I can easily afford the 3b parameter overhead it has over Devstral.
Huh. Maybe that's why people like it.
But it's GLM 4.7 for me if I want the model to be at least somewhat competent.
>>
>>108482503
The thing is, I don't know how to optimize the speeds.
I want it to be fast like codex and claude code, even if its bad.
Instead they're slow and perform badly too.
>>
>>108482607
Are you running them on one of these "unified memory" machines? What's your setup?
How many t/s do you get with 27b?
>>
>>108482612
RTX 5090
>>
CAN I GET A DEEPSEEK 4 WHICH WORKS LIKE CLAUDE HAIKU ON 3B PARAMETERS?
it's not much to ask for. It's basically reasonable.
>>
>https://github.com/spiritbuun/llama-cpp-turboquant-cuda/tree/feature/turboquant-kv-cache
Compile it, it works. very basedo.
>>
>>108482639
you'll get deepseek-v4-distil-qwen3.5-4b
>>
>>108482503
>(except for vramlets getting a new toy and needing to shit up every corner of the internet talking about it)
I share your frustration after messing around with the 3.5s myself. I hate how astroturfed these threads are now.
>>
>>108482639
*1B
>>
>>108482647
qwen3.5-4b this guy accidentally hosted the wrong container making it look like it didn't do any work at all
>>
>>108482639
this but 12b and as good as sonnet
>>
File: 1771050008492902.png (1.55 MB, 1373x2048)
1.55 MB
1.55 MB PNG
>>108482540
>>
File: 1774186160547385.jpg (235 KB, 1132x1561)
235 KB
235 KB JPG
do you trust your local models enough to have server access with no approvals?
>>
>>108482549
Lower left is literally Two More Weeks.
Complete abandonment of ego for the glorious /wait/ing
Because why would an optimist do anything else.
>>
GLM 5.1 wonned
>>
>>108482681
Ah yes, the moon rune benchmark
>>
>>108482671
>trusting reddit the AI
lmao
>>
>>108482678
Absurdly comfy gen.
>>
>>108482665
hey i made that gen ! :3
>>
does anyone have an idiot proof guide for getting llama.cpp specifically set up? from browsing the thread, that seems to be everyone's preferred target
if the github guide is good enough, i will just attempt it with that
>>
>>108482798
So you have tried absolutely nothing at all. You're off to a good start.
>>
>>108482823
correct. i have tried nothing at all. i wanted to ask for the idiot proof guide (if one existed) before throwing random shit at it, so as not to waste time. basically, i wanted to RTFM
it's just the specifics of which manual is the best that i figured i'd ask /g/ about first
>>
>>108482837
>basically, i wanted to RTFM
But you didn't. You should.
>it's just the specifics of which manual
Start with the file called README.md.
>>
>>108482849
i'm asking if the readme on their github is worth a damn. since you're getting so snippy over it, i'll take that as a yes
>>
>>108482861
You're gonna have a lot of fun if you hesitate to read.
Yes. It is useful. It tells you how to build it and run it.
>>
Free gains coming soon
https://github.com/ggml-org/llama.cpp/pull/21038
>>
>>108482912
>3 days ago
>>
>>108482798
There shouldn't really be anything that you need to set up. Just download a prebuilt binary, a model, and you're basically good to go.
>>
DS web app is back up
Nothing seems too different
It's over
>>
>>108482912
>copequant cache
Wake me up when tensor parrelism is optimized.
>>
Local LOSTED again
>>
We need to kill >>108482213 in their sleep so they don't wake up disappointed
>>
>>108482961
Alternatively, we need to quickly train a SOTA coomunity deepseek version that's 4b which trades blows with 4T models in the span of a few hours.

I reckon this is doable if we all put our minds to it.
>>
>>108482798
Just read the build guide that's linked from the readme. It's long but it's mostly just a bunch of different variations on "if you have X hardware, set Y flag"
>>
>>108482984
I'll make the logo
>>
Is it over? Should I just buy more API credits? What are we running locally to coom under 120b?
>>
I have a significant otter
>>
File: night shift.jpg (161 KB, 1024x1024)
161 KB
161 KB JPG
>>
>>108483562
Built for BBC
>>
>>108483562
Mind your business, Teto
>>
>>108483605
Teto is her customer
>>
>>108483603
stfu gargamel, no one likes you
>>
>>108482984
Smothering with a pillow takes only a few minutes.
>>
>>108483562
teto woken up in the middle of the night by miku yowling while getting railed in the alley right next to her house
>>
TURBOCUNT WHEN!?!??!
>>
>>108482053
>doesn't allow proper configuration.
define "proper configuration". Also, model used to gen this video?
>>
fellow poorfags that gave up on intel arc due to shit performance with vulkan/sycl need to go give the llama.cpp openvino backend a try
model support is a bit behind but on the ones that do work, I'm getting double the tps
>>
Are the rumors about gemma 4 only being 2B, 4B and 120B moe possible of being true?
>>
>>108483700
DOA
>>
>>108482912
>muh rotascions
I want the memequant faggonov you piece of shit GIVE ME THE TURBO MEME NOW!!!!!!!!!
*checks out TheTom cope fork*
yeah that will teach u retard im gonna hve my kv at TURBO3 while u paly around with rotating q4 (*fucking retard loL!!! u dont even wanna apply it to q8 because ur a rarted hack!!!1)
>>
>>108483791
there's also the 100B dense
>>
>>108483791
yes functiongemma4 at 300M for the moon :rocket:
>>
>>108483824
so DOA?
>>
>>108483824
>dense
lol
lmao even
>>
>>108483824
>100B
Actually, that's a typo, it's not 100B, it's 100b: 100 bits. The new sota for ultra-micropenis-small models.
>>
>>108483690
What's hardware do you need to get that t/s for pillow smothering?
>>
>>108483918
Just
import PIL
>>
>>108482582
Out of curiosity, what tk/s are you getting with qwen3.5-27b? I tried running 27b with a Q4 quant, but I can't get above 6tk/s...16GB of vram, 32GB system ram. Is it a misconfiguration in llama.cpp? I'm getting a very nice 35tk/s with 35b.
>>
>>108483904
>not sigma
>sigmoid is a curve in math
>It's also when they use part of the large intestine to make a "vagina."
Sorry but if Jews worship Satan and hermaphrodite they gotta acknowledge owledge God is spitting in their face with these abominations. Traps are hotter. Literally worse than the actual opposite of what they are trying to make. Fuck Jews.
>>
>>108483924
Instructions unclear. My pillow refused to smother anyone and gave me a hotline for homicidal thoughts.
>>
Everyone talks about big models but what about small ones? I tried talking to a couple 1gb and 600mb models but they were retarded or mimicked me.
>>
>>108483943
we're not there yet
it will take a breakthrough to get somewhere decent
>>
>>108483904
that's one sick garloid
>>
>>108483943
how could you tell the difference whether they were retarded or mimicking you?
>>
>>108483726
I wasted five hours of my life trying to make this shit work and it didn't, went back to vulkan on my loonix machine.
>>
>>108483928
>16gb of vram
You might be hitting system ram, especially if you only have 16gb of vram and are running a larger q4 quant.
With qwen 3.5 fp8 I get 50tk/s on two 3090s in vllm without the p2p driver. On a single v620 in llama.cpp (qwen 3.5 q8) I got 18tk/s. Forgot if it was rocm or vulkan. But they should be similar. Either way, you should be getting a lot more than 6tk/s if you're fitting everything into vram. I strongly suspect the quant you're running doesn't fit into your vram.
>>
File: he forgot this.png (93 KB, 2569x239)
93 KB
93 KB PNG
>>108482912
it's not the complete version of TurboQuant, I sleep
>>
Complete version of TurboQuant is called RaBitQ
>>
>>108484044
I'm not using retardquant unless it dresses up in a bunny outfit.
>>
RaBitQ sex
>>
File: sad.jpg (21 KB, 400x400)
21 KB
21 KB JPG
>>108482213
>>
>>108484063
Look on the bright side, at least you woke up.
>>
>>108484002
One said "I'm Jewish and my name is Anonymous."
>>
your midnight/AM dose of schadenfreude sir;
Character.AI now requires FACE AGE VERIFICATION FOR ALL USERS. POINT AND LAUGH AT SaaS PAYPIGS. THEIR TIME IS COMING TO AN END.
>>
is 150tk/s good for a flash model? it all fits into VRAM (i think)
>>
>>108484100
I dunno buddy, is one HUNDRED and fifty tokens per SECOND good or not?
>>
>>108484131
i don't know? i don't know anything about AI i'm flailing in the dark here
>>
>>108484141
Anything above 10 tokens per second is fantastic.
>>
>>108484141
I get 5 tk/s on GLM 5 for reference.
>>
>>108484159
oh nice ok cool. that's reassuring
>>108484168
what's your setup for GLM 5? i might honestly try copy pasting your commands if your hardware isn't too far off from mine
>>
>>108484141
>>108484100
Your reading speed is between 5-15 tokens/sec. Do you think getting text at 10x top end reading speed is fast or not?
>>
>>108484180
i thought it seemed fast but someone told me it was slow but i was probably just getting trolled
>>
CopeQuant
MixtureOfCopes
CopeAttention
CPUCopeLoading
Cope.cpp
>>
>>108484180
Maybe anon wants to do something other that cumming to text.
>>
https://old.reddit.com/user/NoMembership1017
this is a clanker right or am i just going schitzo?
>>
>>108484175
It's a 512gb ddr4 ewaste system I built for 2.5k before the ram inflation. I think there's only a couple of us running shitty old ddr4 systems with 3090s. Everyone seems to be on ddr5 with blackwell 6000s these days, which should be able to do 12-15tk/s.

>copy pasting your commands
It's literally just a basic llama-server model context cpumoe gpu port command.
>>
>>108484211
ask reddit
>>
I don't usually use amazon but I wanted to buy something now and every single one of the top rated reviews has an em dash and there's even some that have pros and cons lists with an emoji after very line.
>>
>>108484186
>Just waste resources bro. All forms of efficiency are cope.
>>
Can't get back to working normally and can't spend $20-$50 / day doing serious work.

Best bet is probably waiting for the M5 Ultra with 256gb to 512gb of RAM, correct?
>>
File: 1768307639082073.jpg (47 KB, 738x415)
47 KB
47 KB JPG
>>108484186
LLMs are THE cope technology. Humanity are blowing hundreds of billions chasing AGI in text completers that have no fundamental understanding of anything.
>>
So what are minimum system requirements for getting a decent coom machine?
>>
File: 1763628877360791.png (433 KB, 1384x708)
433 KB
433 KB PNG
>>108484186
he's right, just stak moar layers and we'll get AGI, just two more layers bro!
>>
>>108484295
rtx pro 6000
>>
So is that new intel gpu any good?
>>
>>108484295
8gb of vram.
>>
File: 1746847380591635.png (2 KB, 387x36)
2 KB
2 KB PNG
v4 is coming
>>
>>108484292
i swear if we had put half as much compute on spiking neural net research we may already have le frickin agi by now.
>>
>>108484331
2 more weeks brosky
>>
>>108484223
>It's a 512gb ddr4 ewaste system I built for 2.5k before the ram inflation.
I'd swap my 256GiB DDR5 machine for that. At least you can actually run GLM5 without resorting to cosecants
>12-15tk/s.
yeah 13tk/s but limited to retarded IQ3_KS
>>
>>108484304
>So is that new intel gpu any good?
none of the others were, so why would this one be?
the free gaudi3 you can rent is shit too
>>
>>108484548
rent free..
>>
>>108484081
does garrys mod work for it?
>>
File: PeterThiel.jpg (42 KB, 595x900)
42 KB
42 KB JPG
Hey gang, I'm trying to build a memory persistence system for my LLM project.

Currently I'm thinking of a three-tier system:
1. A write-protected lorebook system for the user to store personal facts and world-building information.
2. A Sims-style character stat system (e.g. a "loneliness meter") that will be used to give the LLM unprompted/agentic behavior.
3. An embedded RAG system for the LLM to store its own memories in a vector database.

This will be my first time working with any of these systems and it's a bit overwhelming desu. Any thoughts or ideas or common pitfalls to avoid? I don't want to over-complicate things. Sometimes I wonder if it's better to just be able to meet your waifu for the first time with every conversation you have. Idk.

If you had a gynoid robowaifu would you want it to remember?
>>
I discovered an interesting flag the llama.cpp uses that might be of use to some of you guys.

-nkvo

This off-loads your KV cache to your RAM instead of your VRAM, which allows you to offload more model weights to your VRAM. If you have limited VRAM and a lot of RAM like me, this can essentially enable you to have an absurd amount of context at your disposal.

This is also very useful for agentic workflows and LLMs with vision capabilities since feeding images into LLMs eats a shit ton of context, especially if you're doing it at a relatively high interval.
>>
>>108484874
ye but sped
>>
>>108484874
Yes but perhaps there is a reason why since 2023 the common consensus was that the kv-cache should always stay on gpu and even the cpumaxx builds of those ancient times had a gpu "for prompt processing"
>>
>>108484887
>>108484897
In theory loading more weights into VRAM should offset the performance bottleneck of keeping the KV cache in RAM, no? I haven't actually done any benchmarks yet. Just found out about it.
>>
>>108484907
Want me to spin up my server and demoralize you?
>>
>>108484910
Yes. I'm about to test it myself, but more data would be nice.
>>
>>108484915
Aw fugg. I'm getting half the tokens per second (15tps). But the upside is that I get a context size of 262144 instead of 8196.
>>
>>108484907
No, the two magical thresholds for weights on GPU a) the dense/shared parts for modern MoEs b) the point when you have 100% of them on GPU. Anything between doesn't really much. You aren't going to see any drastic gains from fitting 40% of an LLM onto gpu vs 80% on gpu if the rest is still on RAM
>>108484923
Yeah but you now can go for a run and take a shower while that context is processing.
>>
File: Untitled.png (30 KB, 1265x354)
30 KB
30 KB PNG
>>108484887
>>108484897
>>108484910
>>108484927
It's unexpectedly not too bad for cpumaxxers. I imagine CPUs with AVX512 would perform even better.
>>
I'm seeking deep right now
>>
>>108485129
Who opening dey AI rn?
>>
I just tried out the rocm build of llama.cpp and I'm getting half the tps compared to the vulkan build. Wtf
>>
https://www.reddit.com/r/LocalLLaMA/comments/1s7nq6b/technical_clarification_on_turboquant_rabitq_for/
that drama is too sophisticated for me lol
>>
>>108485237
hip btfo
>>
File: 1765645024476914.gif (179 KB, 720x720)
179 KB
179 KB GIF
I really thought today would be the day.
>>
>>108485237
Could be that your gpu is offloading to ram, you might need to use less layers
Vulkan/rocm/whatever all bit different when it comes to memory allocation
>>
File: air-chan.png (71 KB, 1101x643)
71 KB
71 KB PNG
>>108484927
>No, the two magical thresholds for weights on GPU a) the dense/shared parts for modern MoEs b) the point when you have 100% of them on GPU.
nta but I'm going to test this out, I guess with glm4.5-air q8

seems to make a difference, more layers on gpu == faster
>>
>>108485259
good morning
>>
File: dipsyAnimeOAI.png (3.57 MB, 1024x1536)
3.57 MB
3.57 MB PNG
>>108482162
Web app was back up a bit after this post. Playing with the API this AM, if they did an update it's too subtle to be clear. And ofc no new weights.
I am disappoint.
>>
>>108485253
>Majid's January 2025 emails show that he had translated our C++ implementation of RaBitQ into Python and asked us to help debug it.
I was gonna snark about replying to @megacorp emails and expecting to not get fucked (megacorps outright stealing IP is an old and tired meme at this point), but the guy who burned them is some Persian grad student at NYU who, I imagine, was just secretly working the hussle on that megacorp cock.
>sophisticated
Just another case of an Empty-Headed Academic getting fucking destroyed by 2D mindgames if you ask me.
>>
>>108485392
>was just secretly working the hussle on that megacorp cock.
If you knew how many messages I get from these fuckers and I'm practically a nobody uploading shit on HF.
>>
>>108485273
do a sweep and show me the speed at 100k context. in my experience nkvo was slower then gpu kv cache but still useable till around 5k or 6k tokens and then nose dives to unusable territory.
>>
File: tokens.png (17 KB, 384x164)
17 KB
17 KB PNG
> in 24 hours
Christ, open code burns through tokens. This is as much as I used in 2+ years of chat bots.

But vibe coded my own telegram chatbot ERP app that just works. Including good character sheets, GLM 5.1 is pretty impressive.

Wish I bought that 512 GB Mac Studio when it was still available.
>>
File: dipsySouthPark.png (1.89 MB, 1024x1024)
1.89 MB
1.89 MB PNG
>>108484841
With LLM and rp, the most powerful thing you can introduce in guiding it is random numbers / events and any form of stat tracking. The first part forces the LLM out of whatever well-worn path it thinks that its on. The second forces it to slow down and consider the state of the NPCs.
Idk if I'd bother with RAG if you can just have it update/maintain a lorebook.
In terms of pitfalls: pick one inference source + model and stick with that. A lot of ST "functionality" is it working with every model and provider under the sun. You could toss probably 90% of the codebase if you drop that requirement, and 99% of maintenance work.
>>108485503
Yep. Claude code, openclaw, all this agentic stuff uses a ton of tokens, passing mostly nonsense prompts back and forth, but it does work...
LLM struggle to remember clothes, among other things.
>>
>>108484841
You can already do most of what you're describing with ST, using Summary and Authors Notes.
Also
> Sometimes I wonder if it's better to just be able to meet your waifu for the first time with every conversation
You should figure this out before you start coding.
>>
>>108485547
>Idk if I'd bother with RAG if you can just have it update/maintain a lorebook.
NTA, but aren't lorebooks just a form of RAG?
>>
>>108485578
Not imo. Lorebooks are written in prose and typ describe a certain thing / NPC. They are explicitly triggered by a key. So when ST sees "Sue" in the context, and there's a lorebook entry with the key "Sue," it injects that description somewhere in the context. You have to write and maintain it, or the LLM does (agents like openclaw are imho doing exactly this with all those .md files.)
ST RAG does all the above automatically using a large text that it does not maintain. The ideal usecase is, you have a written book describing the rp setting, then ST pulls the relevant stuff using RAG (which is a technique) magically. Sounds great, but in practice any large LLM was trained on your fiction world, and the way it works is sort of a black box.
Here's a RAG demo. It's one of those things that easier to understand if you play w it, then draw your own conclusions.
https://chub.ai/characters/NG/mary-rag-demo-b0e12a34df58
>>
File: 1765627721058810.png (1.96 MB, 3280x2475)
1.96 MB
1.96 MB PNG
https://xcancel.com/Ali_TongyiLab/status/2038609308750143762
Never forget, it's local until it's good
>>
>>108485638
>Hugging Face Offline Demo:
wow
>>
>harrier-oss-v1 is a family of multilingual text embedding models developed by Microsoft. The models use decoder-only architectures with last-token pooling and L2 normalization to produce dense text embeddings. They can be applied to a wide range of tasks, including but not limited to retrieval, clustering, semantic similarity, classification, bitext mining, and reranking. The models achieve state-of-the-art results on the Multilingual MTEB v2 benchmark as of the release date.

https://huggingface.co/microsoft/harrier-oss-v1-27b
https://huggingface.co/microsoft/harrier-oss-v1-0.6b
https://huggingface.co/microsoft/harrier-oss-v1-270m
>>
>>108485667
>bitnet mining
>>
>>108485578
yes
>>108485630
i.e. its doing retrieval (based on the identified key) and augmenting generation. I.e. RAG
>>
>>108485667
>https://huggingface.co/microsoft/harrier-oss-v1-27b
>27B embedding model
but why?e
>>
>>108485667
toss
>>
>>108485730
For the glory of benchmarks, of course.
>>
When running
>--parallel
with llama.cpp, it dynamically slices the full context to serve the parallel requests, correct?
Is there some other buffer that gets allocated by slot?
For example, parallel 1 uses some 200mb less VRAM than parallel 4.
Does it allocate one ub per slot?
>>
>>108485638
Qwen's been dead for a month, this news comes as a surprise to no one.
>>
>>108485667
The average score is SOTA
But average score doesn't mean much when there are hundreds of tasks in the benchmark
I’ll wait for the task-specific scores to see what is the best use case
>>
>>108485638
where does it say it wont be open weights?
>>
>>108485667
>text only
Which fucking year is this?
>>
>>108485827
they don't say it's gonna be open and there's no weights, fair to assume nothing cool is gonna happen
>>
>>108485827
The blog post is pretty unambiguous about its availability: https://qwen.ai/blog?id=qwen3.5-omni
>>
Anyone remember ericcurtin? He was an early slopper, coming from Docker. I noticed back then how his contributions were not for the benefit of llama.cpp, but for ramalama, another project he was part of (Member as of today, 13 pages worth of commits, 325 commits give or take).
https://github.com/ggml-org/llama.cpp/pull/10291 (introduction of llama-run)
https://github.com/ggml-org/llama.cpp/pull/17658 (self-removal after a scolding from ngxson)
https://github.com/ggml-org/llama.cpp/pull/18661 (removal of llama-run)
It was a slippery slope of slop. Introduce a seemingly innocuous feature through which you can introduce more later. llama-run, then introducing linenoise as a dependency, adding ollama and s3 model downloads, fixing his own stuff. All of that should have been it's own repo.

Now we have pwilkin, already well established in the project. Most of his commits being in the thousands of lines of code to parse text, with 4 or 5 extra thousands later to fix his previous commits. Openly using LLMs to write code is not too bad of a problem, but he submits code he doesn't understand and can't obviously write by himself. A template "used in assistive capacity", a quip, and a smiley face seems to be enough to justify it.

Now he wants to add code to the GPU backends as an innocuous feature. It's for the good of the project, you see? ~3kloc to add performance profiling.
https://github.com/ggml-org/llama.cpp/pull/21138 (first attempt)
https://github.com/ggml-org/llama.cpp/pull/21160 (cross-backend profiler)
There's pushback, but he will take ANYTHING he can to slip his code in. He'll accept a branch. Whatever it takes. And then, expand from there. Simple enough. He found the crack.

He's gonna start slippery-slope slopping into the gpu backends. He can play with the text parsing code, that's fine. That stuff shouldn't even be on the server anyway. I wouldn't let him touch a single backend file.

I'm sure I'm not the only one who noticed the pattern.
>>
>>108485638
That's very sad.
>>
>>108485893
Start a discussion in the llama.cpp repo.
>>
>>108485893
>That stuff shouldn't even be on the server anyway.
You haven't even mentioned the new tools built into llama-server that you can give your models access to :)
>>
>>108485852
>>108485853
asked qwen directly:

Bottom Line: Qwen maintains a dual-track strategy—open weights for research/community (Apache 2.0) and closed API for enterprise/flagship capabilities. Qwen3.5-Omni follows the latter pattern as of March 2026.
>>
>>108485939
>>
>>108485921
I just wanted to vent and it's not gonna survive in there anyway. I'm just documenting.
>>108485924
Parsing text was the first crack and it continues from there. I understand being difficult to reject a seemingly useful feature. "Hey, this dude gave me free code and my program does more things". But one has to be very judicious about those things. Scope creep kills.
>>
I have 4 GB VRAM and I am using models that are supposed to fit, BUT I think I'm getting trashed anyway somehow because it says 60% of my model is being run on the CPU.
WHAT'S GOING ON?
>>
>>108485963
>Unfortunately
she wants you so bad bro
>>
>>108485977
>I have 4 GB VRAM
>I am using models that are supposed to fit
If you could only mention what those models are. Context also takes [v]ram. So does the rest of your system.
>WHAT'S GOING ON?
You're doing something wrong. But we don't know what you're doing. Fix that.
>>
>>108485966
At the end of the day, llama.cpp is an inference library first with a bunch of shit piled on around it. You might as well argue that the entirety of llama-server is just a pile of scope creep.
It's a shame if the changes introduced by the profiler have runtime costs even while disabled. But I would be surprised if that were the case. I haven't looked at the PRs, either way.
You just have to get used to having shitty co-workers. It happens.
>>
>>108485977
>says 60% of my model is being run on the CPU
You are probably using ollama which is your first mistake.
>>
>>108485977
Which models and which settings?
>>
>>108486023
Actually, I'm using ik_lmao.cpp
>>108486031
Llama 4 Maverick. It's 4 for 4gb right?
>>
>>108486047
This has to be bait.
>>
bros I can't wait to run deepseek v4 on my phone
>>
>>108486023
>>108486031
>>108486007
Using qwen2.5 3b on ollama
>>
>>108486016
>You might as well argue that the entirety of llama-server is just a pile of scope creep
Back in the day, I used to pipe stuff from my text editor into main, tee back to the editor and to a piper "server". The piper server was just nc piping to its stdin from a socket. Now I use the server because being able to edit the context is useful. But yeah. One could easily argue that. I think the benefits out-weight *part* of the bloat.
>It's a shame if the changes introduced by the profiler have runtime costs even while disabled
That's not what worries me. If there's overhead it will be immediately obvious. What worries me is the slippery slope. Slop slowly creeping into the backends. He's a good vector for it.
>You just have to get used to having shitty co-workers. It happens.
I'm just an onlooker. It's just sad to see. His code never affected how I use llama.cpp but, given enough time, it will.
>>
man i really have no idea where to ask this. i have 8gb vram and want to use kobold to talk to an uncensored chatbot + text 2 speech.
it seems like kobold only supports gguf!?
so what models do i need? how do i know which models are safe to use? (can gguf files even have a virus???)
and also where to find uncensored models? i'm having this very old uncensored model "dolphin-2.2.1-mistral-7b.Q4_K_M" but it is really not that fun to talk to.
>>
We're rehashing old bait today, I see. Interesting.
>>
>>108486066
If I had 8gb I'd be a rich man.
Everyone here is just so rich, nobody here has bills to pay .
>>
>>108486073
most dalit post all day
>>
File: 1769288284484.png (10 KB, 1221x47)
10 KB
10 KB PNG
>>108486049
no this is patrick
>>
>>108486088
fuck that attachment was for another post I was writting
>>
>>108486065
>His code never affected how I use llama.cpp but, given enough time, it will.
I doubt any regressive changes to the core inference engine will make it through. There's a lot of pride behind making the inference core top class, and it's much easier to objectively judge changes to it.
You should probably consider writing your own tooling around libllama at this point, though, even if it's just wholesale copying the llama-* components you want into a separate sourcetree and having your way with them.
>>
>>108485893
>writing your rant post with LLM
literally kill yourself faggot
>>
>>108486104
I made a minimal llama-server clone a while ago, but just as an experiment. I'm ok with working around client code. I just don't want the backend code being messed with. Maybe that's the way.
I remember back then the examples used to be just that. Examples. Another instance of, from my point of view, dubious contributions was phymbert (and I think I complained about it before). He started the whole "make server production ready" thing, then he vanished. Maybe the server is better for it, but it still felt out of place.
>>108486124
Common expression and turns of phrase exist.
https://desuarchive.org/g/search/text/%22literally%20kill%20yourself%20faggot%22/
>>
>>108486140
Oh, no. He's gonna get me with that missing 's', isn't he? I'm done for.
>>
>>108486023
what do I do? When I ask Claude it has no idea what I'm to do
>>
File: 24d1d11b.jpg (1.17 MB, 850x1275)
1.17 MB
1.17 MB JPG
>>108485977
love this slut, made a card last year:
{{char}} is an attractive young woman with an untrustworthy streak. Behind her charming smiles lies a cunning and manipulative personality. Accustomed to getting her way through deception and betrayal, she effortlessly climbs the social ladder by stepping on those around her. {{char}} uses flattery and sycophancy to gain favor with her superiors, aiming for higher positions without hesitation. Despite her physical appeal, she has avoided genuine relationships due to her lack of trust in others. {{char}} places great value on her virginity, viewing it as a precious commodity, and would rather resort to other means to achieve her goals.

Scenario: After noticing inconsistencies in the financial records, {{user}} has gathered enough evidence to prove that {{char}} has been embezzling company funds. {{char}} has decided to confront her privately during a late-night encounter at the office. Despite her higher status, {{char}} finds herself cornered and desperate, realizing that her career hangs in the balance.
>>
>>108486194
>{{char}} has decided to confront her privately
shouldn't that be user?
>>
>>108486233
She's schizophrenic and confronts herself.
>>
>>108486240
kek
>>
File: llamacpp_100k-stars.png (1.03 MB, 1023x3856)
1.03 MB
1.03 MB PNG
>llama.cpp at 100k stars
https://x.com/ggerganov/status/2038632534414680223
>>
>>108486047
Anon, I'm sorry to say but llama4 maverick is a 400B model.
To run the smallest quant available you need 107GB of memory just for the model, then there's the context.

>>108486063
>ollama
Try llama.cpp.
>>
>>108486294
oh no ik will mad
>>
>>108486294
He knows... Without /lmg/ this wouldn't be even possible.
>>
>>108486294
>now that 90% of the code worldwide is being written by AI agents
and yet you refuse to let people contribute that code to your own project. curious...
>>
>>108486294
deepseek has been waiting for this
v4 is IMMINENT
>>
>>108486294
I don't subscribe to the concept that number of GitHub stars is somehow a measure of quality.
>>
>>108486322
this
>>
>>108486322
it's a quality issue of people using AI instead of programming vs using AI as a tool to program faster
>>
>>108486339
good joke
>>
>>108486334
Gemma 4 first, and you will actually be able to run that locally.
>>
>>108486322
>>108486341
Hype chasers. They do it for self-promotion. They'll jump ship as soon as possible to chase something else and have no interest in maintaining their code.
>>
File: 1766472096750601.jpg (47 KB, 718x718)
47 KB
47 KB JPG
>>108486094
Anon...
>>
>>108486322
With everyone racing to implement turbo quant. I had a look at a couple implementations and the quality of the code is night and day between something obviously vibe coded and something that was crafted by a human.

Code quality matters and saying otherwise is pure cope. I use AI for every single line of code I write but you wouldn't even be able to tell.
>>
>>108486322
>Too retarded to read the next two words of the following paragraph.
>>
it's kinda crazy how much of a difference it makes to give you AI web browser access.
>>
>>108486302
first thing I thought after yesterday's 'this is a niche project :)' cope
>>
>>108486518
don't understand the hate for ik_llama considering I just got like a 10% PP speed boost for kimi over the last week going from 110tks to 130tks.
>>
>>108486504
playwright a shit tho, I prefer supplying web search (searxng instance)
>>
>>108486504
Through openclaw?
>>
>>108486529
there is some big drama with the dev, it isn't really on my radar though because he doesn't support vulkan so I'm not interested in his fork
>>
>>108486534
>>108486541
opencode. I've been using lynx. I don't need it to interact with the pages.
I might look into agent-browser if I start needing javascript support.
>>
>>108486534
Try the chrome-devtools server. Less context heavy and works better than playwright.
>>
>>108486529
he's an autist who had a huge melty over attribution, demanded that any file he touched beared his attribution (lmao!), has been gently told to fuck off and now is still melting to this day (if you read any of his PRs he mocks our poor cudadev and takes a jab at 'mainline' for being behind or accusing them of copying shit over).
it's a shame because the dude is clearly capable
>>
>>108486467
>a full post to brag about his code
huh?
>>
I want to locally run some multi-character stuff, with no lewd filter or anything, maybe add tts or img gen later. Aside from https://rentry.org/lmg-lazy-getting-started-guide, is there anything I should know? Like, if I have Python skills, is it worth to run some stuff from scripts directly rather than some self-hosted webgui, considering I otherwise know next to nothing about the whole ecosystem?
>>
>>108486602
Don't act like Intel slapping their name on his work wasn't bullshit in the first place. He has good reason to be upset. ggerganov is a retard that doesn't know how to properly manage a project of this size.
>>
Am I spending more money running a shitty model on 1050ti because of the electricity costs than I would be buying tokens?
>>
>>108486669
Yeup. API is always cheaper. You run local for reasons other than price.
>>
>>108486646
I read that guide and it didn't help at all.
These people in this thread are gay so they don't help.
Redditors don't help either.
Ai can't help either, it's new technology so it doesn't know.
>>
>>108486646
>considering I otherwise know next to nothing about the whole ecosystem?
I think it's worth taking a look at existing frontends before diving into the deep end. Even if your stance is a hard "FUCK WEBUIS WEBUIS ARE FOR STUPID NIGGERS".
SillyTavern is clunky as fuck, but does support multi-character chats. And you'll need to set up an inference backend up (e.g. llama-server) anyway.
I can't speak to vllm or whatever if you're looking in that direction.
>>
>>108486669
depends how expensive your electricity is. what you're really wasting tho is your time. what model even fits on a 1050ti and how many tok/s are you getting? shit must be attrociously slow.

I ran the math for my 3090 and running it for an hour straight would be around 2 cents.
A single Gemini 3.1 pro query costs around 2 cents.
>>
>>108486679
>Yeup. API is always cheaper.
See >>108486702
>>
>>108486692
I usually wait until some sexy female (I know they are cute girls irl) anons come into this thread before I start asking questions (they're very pretty and demure and aren't brash and aggressive like the others).
>>
>>108486702
Thing is, is your 3090 serving a model that trades blows with gemini 3 pro?
>>
Why is ollama.com lying about VRAM requirements?
It says the requirement for the model is 16 GB VRAM but then it adds 200k context on top when you run the model bringing the real requirements to 50 gb VRAM or something
>>
>>108486646
lurk more
>>
>>108486739
Doesn't actually help
>>
>>108486726
wait for turboquant :)
>>
>>108485652
What the hell does that even mean? What is an offline demo for a closed weight model?
>>
>>108486742
It does help if you look into the archives
>>
>>108486726
I think you're just stupid and doing something wrong. If you have some fucked up system and absolutely *need* to change your settings to satisfy your ocd, use llama.cpp or koboldai. Ollama just works for me.
>>
>>108481870
>--TurboQuant reuses 90s game dev tricks for modern AI:
It's funny because there are thousands of ancient techniques available that would result in "breakthrough" papers like this.
>>
>>108486742
lurk harder
>>
>>108486700
Thanks, anon!
>>
>>108486742
Not me. A list of components helps way more than shitty youtube vids or people telling you to download dubious installers fagging up your system.
>>
>>108486764
it's always like this
>some autistic fag makes an obscure paper
>decades later, someone else heards of it and thinks it can be used for something cool
like when John Carmack used Binary Space Partitionning in DOOM (1993), he found that method from a paper written in 1980 and it became mainstream that way
>>
File: I CANT BELIEVE IT.png (476 KB, 3018x1159)
476 KB
476 KB PNG
https://arxiv.org/abs/2602.20021
>letting your computer be manipulated by hallucinating LLMs can lead to catastrophe
NO WAY???!!!
>>
It's not fair, we could have had bunny girl kv caches but instead we have some browntard turboautism corposlop cache.
>>
>>108486721
>Thing is, is your 3090 serving a model that trades blows with gemini 3 pro?
For a lot of things the smaller models are more than adequate. Which api model can you prompt for a full hour for just 2 cents?
My point is, it's just not true that API is *always* cheaper. On the contrary, Local is most definitely cheaper in a majority of cases. How much would it cost to prompt gemini for an hour? let's even say gemini flash.

How many tokens do you use in a codding session? likely in the millions? let's say it ends up costing you 50 cents an hour. Is flash really 25x better than your local qwen?
>>
>>108486854
It's worse because the original was implemented in C++ and turboslop had to go and fag it up with pyshit.
>>
>>108486857
When my local qwen is some iq2xs shitter running on a 1050 ti, yeah.
>>
File: 1762642191304549.jpg (287 KB, 1920x1080)
287 KB
287 KB JPG
>>108486867
Python is fine actually if you follow the best practices
>>
>>108486857
Local wins?
>>
>>108486883
SEX x BECKY
>>
>>108486874
>1050 ti
Maybe it's time you upgrade your mid-range ten year old card from 2016.
>>
>>108486842
They really gonna do a study "sex with AI feels good" once sexbots come out and everyone will take it seriously for some reason
>>
>>108486930
And what? Validate your argument? No thanks. I stay here in my 1050 ti flavored puddle of shit and wallow in self pity and misplaced seethe.
>>
>>108486805
To be fair, a lot of the time it's more like
>come up with cool idea for a concrete application
>uuhm actually, I had that idea first, here's my obscure paper from 20 years ago
>>
>>108486889
Flash-lite can't be larger than 10B.
>>
>>108487027
>flesh-lite is easily manhandled
hot
>>
so what's the best options to use on llama.cpp for qwen-3.5-27b with 32gb vram?
>>
>>108487138
Q6 and let the automatic fit do the rest.
>>
>>108487138
-ngl 999
>>
>>108487138
ngl 99 ub 2048 and as much context as you can fit, which will be quite a bit I reckon even running q8.
>>
>>108487027
It might be something like 1T A3B + 30B vision encoder.
>>
>>108487157
I can only fit 180k in 48gb at q8 btw, so a lot less for 32gb. (fp32 mmproj, so maybe more if he drops to a lower quant).
>>
>>108487182
I would believe in 1T A10M only.
>>
>>108487138
-nkvo
>>
>>108487182
>1T
You're completely delusional lol.
Why the fuck would they make it 1T?
>>
>>108486294
Yeah, fuck nvidia!
>>
>dude just lurk more so you can learn outdated information about running outdated models that open claw doesn't even care about anymore because the settings changed
>>
>>108486700
Thanks again, I am up and running.

>>108486692
Only thing missing from the getting started guide was to get the koboldcpp version optimized for your hardware, and perhaps how navigate to the sillytavern settings (which I am still looking for, kek).
>>
>>108487242
if openclaw is so good why don't you ask it to set everything up for you
>>
>>108486659
that was dickish, but i dont think it warranted this split venomous autism fork.
I unorincally think that this ordeal set behind LOCAL by at least 6 months
>>
>>108487258
If
>>
so, about v4?
>>
>>108487138
--mmproj <path-to-mmproj.gguf>
Everything else is unnecessary in the modern llama-server world, autofit is best girl.
>>108487248
Welcome onboard!
>how navigate to the sillytavern settings (which I am still looking for, kek)
NTA but SillyTavern is a fuck. The buttons at the top of the UI open settings panels. The right-most one ("AI Response Configuration") changes options depending on whether your API Connection (the second-to-rightmost one) is configured as "text completion" or "chat completion".
It's gonna take you a bit to figure out all the things, but please enjoy your chat with Seraphina in the meanwhile.
>>
>>108487279
>right-most
Pretend I am not a retard and can tell left from right.
>>
>>108487242
All guides are immediately outdated. The only one that stays relevant for new anons is the lazy guide in OP. If you can't get that to work, you're beyond help and should just use ollama.
>>
>>108487279
Gotcha. I selected text completion because that seemed compatible with koboldcpp. Do I want chat completion instead?
>>
>>108487307
Ultimately chat completion is just a wrapper around text completion that runs an array of messages through a jinja template included with the model, then passes the template output to text completion.
Sometimes the chat templates are fucking retarded and have strict rules about turn order and will throw errors; SillyTavern basically does the same thing as the chat templates internally in text completion mode but sometimes SillyTavern is more retarded than the built-in templates... so... it depends...
If you're running Qwen 3.5 I'd just use chat completion because it seems to work.
>>
https://www.reddit.com/r/accelerate/comments/1s7twd8/byteplus_is_selling_exclusive_seedance_20_access/
>BytePlus is selling exclusive Seedance 2.0 access to studios at a $2 million commitment. For that price, buyers get what nobody else can: zero queue times, real-face uploads with no content restrictions, and priority compute allocation. Approximately 400 US companies have signed up already.
wtf??? they literally won 8 billion just like that, HOLY SHIT
>>
>>108487336
shitty tavern should implement tool calls with text completion thougheverbeit
>>
>>108487219
Because once it's MoE and you pick how many active parameters the model should have (i.e. how computationally heavy it will be), you can make it about as large as you want at almost no added costs except possibly more routing overheads if there are too many experts. But I bet Google has solved that too.
>>
>>108487336
I chose the nemo 12b instruct gguf for my 24 GB VRAM card. If something is better, I am all ears. Despite switching to chat mode, I cannot find the template mentioned in the guide.
>>
>>108487348
I just don't think it would be this cheap if it was 1T
Inference isn't what's expensive for LLMs it's always going to be their size.
>you can make it about as large as you want at almost no added costs
That's just it tho. number of params is the main cost driver.
>>
>>108487339
>exclusive
>400 companies
Words used to mean things...
>>108487346
I'm still too afraid to add any tools to SillyTavern, god only knows.
I'm assuming text completion mode doesn't support images either. Sucks.
>>108487365
The guide is probably outdated, I'm honestly not sure how SillyTavern internally handles templates.
IIRC Mistral Nemo doesn't... work... with chat completion in SillyTavern? I can't recall and can't be bothered to dig the safetensors out of the NAS.
I'm an astroturfer, so I'm going to recommend Qwen 3.5; it's strictly better than Mistral Nemo (sorry). You can figure out which size+quant fits in your VRAM with e.g. https://www.canirun.ai/?q=qwen+3.5 (click through to the next page to see the quants list, the overview can be misleading). You can probably fit the 27B in with a Q5_K_M quant, maybe Q6_K.
>>
File: media_G7ktFzsagAAFRok.jpg (1.14 MB, 2508x3500)
1.14 MB
1.14 MB JPG
new models?
>>
>>108487456
gemma 4 in 2 two more weeks
>>
>>108487456
Imagine the smell.
>>
>>108487481
>If it smells like fish...
>>
File: dipsyfooooooour.png (133 KB, 1290x940)
133 KB
133 KB PNG
>>108487456
https://huggingface.co/deepseek-ai/DeepSeek-V4-Preview
https://huggingface.co/deepseek-ai/DeepSeek-V4-Preview-Base
>>
>>108487418
>Words used to mean things...
do you realize there's more than 100 million companies in the world? 400 is peanuts, so yeah it's hella exclusive
>>
>>108487484
wtf it's real
>>
>>108487484
I see v4, I click.
>>
>>108487484
>1.5t 90a
oh no
>>
>>108487484
>it actually happened
>>
>>108487484
>Updated Mar 24, 2026
Bait used to be believable
>>
>>108487606
didn't the entire kimi k2.5 repo and collection have a modification date of a whole month before they made it public when it got released? it was all just sitting there for all of january private'd
>>
File: qwen3_6.jpg (66 KB, 1080x944)
66 KB
66 KB JPG
>>
stop baiting
>>
File: dipsyMikuSouthPark.png (2.99 MB, 1536x1024)
2.99 MB
2.99 MB PNG
>>108487606
Lazy bait reuse from last week.
How dishonorable.
>>
>>108487484
Just two more fakeposts until this link works...
>>
>>108487645
https://openrouter.ai/qwen/qwen3.6-plus-preview
>>
File: 1773719870739931.jpg (77 KB, 850x850)
77 KB
77 KB JPG
>Deepseek V4 comes out
>It's 2 TB
>Smaller version comes out
>It's 4B
Calling it now.
>>
>>108487734
it's just the next chink family of models quickly shitting out an openclaw-focused version
>>
>>108487738
No, you'll get 4b-120b but they're all qwen/mistral-based distills of the full 2T again
ollama run deepseek-v4
>>
>>108487738
>Smaller version comes out
lol
lmao
>>
>>108487751
kek
>guys I can literally run deepseek on my raspberry pi!!!
>>
>>108487738
>it's 2TB
it being using engram means you could probably have 95% of the model on disk and still run it at modest speeds.
>>
>>108487764
i mean you can if you do nvme inference, gonna be slow af but you *can*
>>
>I still haven't bought extra nvme sticks
>>
File: vibeshitter.png (29 KB, 454x405)
29 KB
29 KB PNG
>1788 open pull requests
How does a project survive death by a thousand cut from all the vibeshitted pull request?
>>
>>108487804
By having the balls to say "no".
>>
>>108487804
Just ignore them?
>>
>>108487817
>>108487818
>rude a-holes
>>
>>108482688
Better get used to it since China is winning. Europe shot itself in the foot and decided not to participate and Americans are all closed weights and super overpriced.
>>
>>108487818
How do you filter out the shit from the good PRs?
>>
>>108487834
Easy. Anything I didn't write myself is shit.
>>
>>108487834
>good PRs
no such thing from someone you don't know
>>
>>108487834
Use a private issue tracker and mute the Github repo.
>>
>>108487851
This is the internet, nobody knows each other. Unless you mean
>Only members of my discord clique can make pull requests
Yeah no way that could ever go poorly.
>>
>>108487834
AI
: ^ )
>>
>>108487883
Only my anon friends can make pull requests.
>>
>>108487886
>: ^ )
Is :^) filtered or something?
>>
>>108487883
If we knew people IRL we could interact with them online too, sort of like a "remote working from home" but for unemployed NEETs.
It's a Discord clique but we'd know where each other's houses are to break knees if someone needs knees broken. Or dicks sucked, or whatever. You know.
>>
>>108487889
>: ^ )
>>
Where are my Australia and New Zealand models?
>>
>>108487804
I suspect that the future of open source is going to be that any popular project will not accept PRs until after a vetting process.

There's simply no way to keep up with the sloptide otherwise.
>>
File: psyguard.png (194 KB, 1296x995)
194 KB
194 KB PNG
>>108487883
nta. Reputation. This is one of the early turboquant sloppers
https://github.com/TheTom
https://github.com/TheTom/llama-cpp-turboquant
Created in dec 2025, practically no activity. picrel is what he sells.
https://psyguard.ai/
>>
>>108487919
>mfw psychologists will use LLMs larping as people to publish their thesis
>>
>>108487950
>mfw I don't have a face
>>
>>108487950
implying that's better than the current "i made it up" and "source: my ass" style of thesis writing
>>
>>108487950
psychology is just horoscopes and fortune telling masquerading as a science anyways
>>
>>108487959
Why is the Qwen3.5 35B-A3B Q4_K_M 21.4 GB, but Nemotron Cascade 2 30B A3B Q4_K_M is 24.7 GB? Shouldn't the smaller model be smaller?
>>
>>108487974
don't talk to me fool
>>
>>108487974
idk
>>
File: psyguard_02.png (104 KB, 1229x769)
104 KB
104 KB PNG
>>108487950
I wish it was that shrimple.
>>
>>108487974
not all the tensors get quantized to the same degree. the models probably have different shapes. could be many smaller layer vs fewer fatter layers.
>>
>>108487824
>>108478994
>>
>>108487974
File size is not determined solely by parameter count. Here is the breakdown:

>Embeddings / Vocabulary Size
The embedding layer is calculated as vocab_size * hidden_dim. If Nemotron has a significantly larger vocabulary or keeps its embedding table in FP16 (while Qwen quantizes it), this adds several GBs.

>Unquantized Components
Some GGUF implementations keep specific layers (like biases, RoPE, or normalization weights) in FP16 even when the rest is Q4_K_M. If Nemotron preserves more of these than Qwen, the file grows.

>Metadata and Tokenizer
The GGUF container stores the tokenizer (BPE/Merger) and model config. A larger or more complex tokenizer adds to the header size.

>Architecture Density
Qwen 35B might be a dense model, while Nemotron could have overhead from Mixture-of-Experts (MoE) routing logic or larger hidden dimensions relative to active params.

In short: Qwen is likely just packed tighter with a smaller vocabulary table.
>>
>>108488036
Extremely low quality.
>>
>>108488051
perfect for gorgeous explains
>>
>>108488036
>Qwen 35B might be a dense model
how lazy can you be?
>>
>>108488077
Do better instead of complain.
>>
>>108488036
>>108488051
imo llms should predict the next n bytes and not even have to bother with meme tokens.
>>
>>108488087
Good job, now get in the BLT https://arxiv.org/abs/2412.09871 waitlist
>>
File: Untitled.png (13 KB, 837x513)
13 KB
13 KB PNG
>>108488188
>>108488188
>>108488188
>>
>>108486602
every single part of this is false btw lol



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.