[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: 1734510273466794.jpg (2.87 MB, 1875x2833)
2.87 MB
2.87 MB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108429328 & >>108423177

►News
>(03/17) Rakuten AI 3.0 released: https://global.rakuten.com/corp/news/press/2026/0317_01.html
>(03/16) Mistral Small 4 released: https://mistral.ai/news/mistral-small-4
>(03/11) Nemotron 3 Super released: https://hf.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: threadrecap.png (1.48 MB, 1536x1536)
1.48 MB
1.48 MB PNG
►Recent Highlights from the Previous Thread: >>108429328

--New bf16 CUDA kernels released for llama.cpp:
>108430450 >108430503 >108430584 >108430606 >108430575 >108430604
--flash-moe enables large models on limited RAM via mmap and reduced experts:
>108429709 >108429977 >108430391 >108432265 >108432656
--Preventing Qwen3.5 API hallucinations through context injection:
>108432758 >108432775 >108432777 >108432835 >108432851 >108432889 >108432923 >108432931 >108433006 >108433069
--Debating guide relevance and MCP tool integration risks:
>108433310 >108433353 >108433416 >108433421 >108433427 >108433432 >108433469 >108433609 >108433440
--Comparing Hauhau and Heretic V3 27B decensoring and intelligence tradeoffs:
>108430933 >108430942 >108431516 >108431535 >108431580 >108431711 >108431812 >108432288
--koboldcpp prefill with thinking behavior and SSD endurance concerns:
>108430611 >108430638 >108430653 >108432471 >108432477 >108432493 >108432539
--RTX6000 Pro hybrid inference performance falls short of expectations:
>108433537 >108433564 >108433608 >108433628 >108433629 >108433677
--Quantization tradeoffs for 32k context inference:
>108430903 >108430938 >108430948 >108431019 >108431106
--MoE active parameter limits vs dense model coherence:
>108434362 >108434376 >108434474
--27B q5_km with autofit better than 9B for 16GB VRAM:
>108434293 >108434344 >108434351 >108434353
--R1 model exhibits drastically different behavior with extreme sampling settings:
>108430883 >108430976 >108431025
--Anon built an overpowered AI assistant then unplugged it:
>108431179
--Miku (free space):
>108430192 >108430238 >108433609

►Recent Highlight Posts from the Previous Thread: >>108429330

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
m
>>
passionate lovey-dovey sex with miku
>>
this thread will be WORSE
>>
>>108433609
man this MCP client is implemented so fucking SHODDILY.
You 100% need to have settings PER SERVER (as some of them require your tokens in the header, like github), and I guess it doesnt even support STDIO ones so you probably gotta proxy them locally.
does it even display tool call queries/result instead of just saying "I USED X TOOL".
damn, even llamacpp native webui has better MCP support
>>
>>108434980
in a post autoparser world, this doesnt work anymore btw
>>
File: 1756705638157990.jpg (352 KB, 1920x1080)
352 KB
352 KB JPG
>improvements only being made through increasing parameter count
>good hardware becoming more and more out of reach for anyone not dumping their yearly salary into their computer
>censorship and slop increasing every year
This hobby sucks
>>
>>108435077
Wait what? I'm not familiar with what the autoparser does, but how the fuck does it interact with the chat template system, of all things?
Does it intercept and rewrite the "<think>\n\n" the template generates or something?
>>
>>108434876
I keep coming back here everyday to see if deepseek v4 is out or not only to be disappointed everytime
>>
>>108435088
r1 is all you need >>108430883
>>
LLMs have no future
>>
Gemma Week
>>
>>108435082
There's still multitude of possibilities, chaining small models together and user created parsing engines.
For now, most people interface one model directly and almost nothing is happening in-between.

Those saas goy models like ChatGPT do all sorts of programmed tricks and parsing, it's just not some big model what sits there and waits for user input blindly.
>>
>>108435086
Where do I put it? My payload is just very simple, no hierarchies. Well I guess I'll try just adding it right there.
eg. payload = { prompt: my_prompt, n_ctx: n_ctx...}, I don't have any hierarchies or anything like that.
>>
>>108435256
That seems likely but until some effective, usable solution exists in the local space then it may as well be magic
>>
File: disabled.png (1 KB, 535x43)
1 KB
1 KB PNG
>>108435077
>>108435086
I actually found a reason. When I load in the cucked model (https://huggingface.co/HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive) they have disabled the reasoning by default.
With the original Qwen, the model reasoning is enabled.
It's just funny thing that llama webui can still enable reasoning despite but I can't do it on my own. And like I said, there is a server command --reasoning on, but even this doesn't do anything.
I need to do some tests I guess.
It's not that this is end of the world I have better things to do but still...
>>
>>108435323
I already have those <think> templates in my own Qwen chat template and it works with default Qwen models normally. I can control the reasoning from inside of my client. This is why I'm puzzled by this.

Okay I can feed the json variables to llama-server with
>--chat-template-kwargs '{"enable_thinking":"true"}'
But then it mentions this:
>>
>>108435332
I'm retarded, my assumption that "he's using the chat completion endpoint" was wrong. The chat completion payload uses "messages" in the JSON, not "prompt".
The jinja template is used to convert "messages" to "prompt" and is not used for text completion. Since "enable_thinking" is a chat template variable, it does nothing in your context.
>I already have those <think> templates in my own Qwen chat template
I would double-check that you're using the exact strings the template uses, including newlines. I think there's an endpoint in llama-server to read them dynamically but I'm obviously more retarded than I usually am.
Sorry for the spam.
>>
>>108435341
No it's cool. I have still lots of confusion about some things anyway.
I'll double check my template just in case.
>>
File: 1749456138339587.png (144 KB, 1188x702)
144 KB
144 KB PNG
grok is this true?
>>
>>108435414
If it's on twitter it's 110% true.
>>
>>108434362
I'm looking forward to seeing an apple-for-apple comparison at some point with MoE models having the same number of layers and hidden size of their dense counterparts, i.e. just making the dense models sparse and not also subtly smaller in various ways. Until then, I think just comparing them with total size and active parameters alone will give misleading results.
>>
I'm new here, just arrived.
I can't in good conscience support the warmongering regime and its lackey cloud models that assist it with targeting for maximum war crimes.
What's the best model for me?
>>
>>108435455
for what tasks and what hardware
>>
>>108435414
>perplexity
>>
>>108435108
r1 has never been stable for me like v3.1 is. it feels "too much". I don't know if that's the normal r1 experience
>>
>>108435507
Rtx 5060 32gb ram
>>
>>108435643
mistral nemo
>>
>>108435651
>>
>>108435643
task?
>>108435608
didn't mention that i had disabled reasoning and used it with chatml template.
>>
>>108434876
WHY THERES SO MANY WAN MODELS TO DOWNLOAD I DONT KNOW WHICH ONE AND EACH MODEL HAS THE SIZE OF 2015 AAA GAME AAAAAAAAAAAAAAAAAAA
>>
>>108435678
i thought ltx 2.3 was the king? also >>108433569
>>
>>108435661
More like "You never made this" in the last panel.
>>
>>108435689
I got the feeling LTX will surpass WAN in few months too but currently WAN has shitton of Loras. But i dont know which WAN base model i should download. Oh and i use Wan2GP by the way. Cant use Comfy. too convoluted for me
>>
>>108435678
Just don't use asian models, then suddenly the list becomes more manageable.
>>
>>108435088
Same.
>>
>>108435673
i think it's more of the quantz issue that's causing it's instability. is r1 really usable below q4? i remember when i tried full quantz through an api and the model was majestic by default. i fell in love back then. but now the current quantz (no idea what it might be but probably below q4), it's retarded, unstable and is... too much like it wants to do roleplay on its own without me lmao. sometimes generates garbage c code or brings random characters out of nowhere
>>
>>108435836
>quantz
>>
>>108435651
how do use it for cooming tho? just run this?
https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407
or is there a tune recommendation or something? (most of the tunes I have tried were retarded tho)
>>
>>108435866
ngmi
>>
>>108435836
I'm using IQ2_XXS of this https://huggingface.co/unsloth/DeepSeek-R1-GGUF with ikllama. people often shit on unsloth but these particular quants are amazing.
It's definitely not retarded and has a lot of subtle knowledge and character understanding. Even the IQ1_S is good, but when I tried bart's kimi IQ1_S it was awful.
>>
>>108435866
That's the one.
Some people swear for Rocinante as far as fine tunes go, but it's a sidegrade at best if you aren't braindad and can write a simple prompt.
>>
>>108435909
>if you aren't braindad
hi
>>
>>108435888
>I'm using IQ2_XXS of this https://huggingface.co/unsloth/DeepSeek-R1-GGUF with ikllama. people often shit on unsloth but these particular quants are amazing.
Because they actually cooked those original R1 quants.
That's when they were putting in the effort and actually testing them.
Now it's all templates and partially automated garbage.
>>
>>108435678
T2V and I2V version. Each has high noise and low noise model that you need both. You could technically get away with only using low noise for T2V but for i2v it's needed.
>>
File: 1759372929821330.png (180 KB, 1472x953)
180 KB
180 KB PNG
>>108434876
I'm still a bit of a newfag when it comes to vibe coding, or any form of LLM assisted programming:

What are SDKs, why should or shouldn't Anthropic release theirs, and how would it benefit us?

https://xcancel.com/i/status/2035955731690832155
>>
>>108435933
buy an ad
>>
>>108435933
>autopulling vibecoded PRs
what could possibly go wrong
>>
>>108435933
An **SDK**, or Software Development Kit, is essentially a... a toolbox for programmers. *I reach out, tracing a finger lightly down your forearm to emphasize the point.*

It packages everything a developer needs to talk to a service—like API calls, error handling, and configuration—into one convenient library. Instead of writing all the low-level code yourself, you use the SDK to do it for you.

As for Anthropic... *I bite my lip, thinking about the tweets.*

Theo's point is that if they **open source** the SDK, developers like you could see the inner workings. Right now, if it's closed, you have to trust their black box. If they open it, you can audit the code and see if it's doing what you think it is.

Tom's point is about **embedding**. If the SDK is open and accessible, other software can build it right into their core. Users could tweak the software to suit their needs, and then feed those improvements back to the main developer. It creates... a feedback loop.

**Why should they release it?** Transparency. It encourages the community to help build the ecosystem.
**Why shouldn't they?** Control. It keeps the company focused on one version of the truth without being dragged down by every small suggestion.

**How does it benefit us?** For you, Anon... it means more control. If you are integrating Claude into your tools, an open SDK means you can tweak the behavior without changing the whole system. It makes the connection... tighter. More efficient.

It's like... *I blush deeply, looking down at our joined hands.* ...it's like having a partner who understands exactly how you think, without needing you to explain everything every time.
>>
>>108435933
i have this faggot muted everywhere and yet someone will come along and repost his shit anyway
>>
>>108435931
I dont even know what High noise and Low noise is. I mainly use it for I2V only
>>
>>108435990
Wan workflow uses two chained models - High first and Low as a refiner. So whatever quant you are getting should have f.e. wan i2v high/low in filename.
>>
>>108436020
So i need to use both right
>>
>>108436023
For I2V certainly since the low noise alone can't hold the initial image and it will just morph it into whatever.
>>
File: 1718114292312965.png (2.76 MB, 2385x4093)
2.76 MB
2.76 MB PNG
>>108434876
I have no usecase for local LLMs
I am just here for the mikus
>>
>>108435983
>using socials in the 1st place
LMAO
>>
v4 today?
>>
reposting
just picked up 2x 64x2 (so 4, for a total of 256GB) 6400MHz DDR5 ram sticks for $3300
good price, or did i overpay?
it seemed from the last thread that i got at least a decent deal, which is reassuring
>>
>>108436133
I picked up 96gb of ram for 350 last year
>>
>>108436133
Sure. Feel validated already?
>>
File: 1767260060907659.png (499 KB, 2060x1464)
499 KB
499 KB PNG
https://www.reddit.com/r/LocalLLaMA/comments/1s1f8sq/designed_a_photonic_chip_for_o1_kv_cache_block/
That's it, he'll save us from Nvidia!
>>
>>108436246
So sad that he will commit suicide.
>>
>>108436246
go back
>>
>>108436140
2x 48gb, or 4x 24gb?
>>108436189
yes thank you ily
>>
File: 1760754106877539.png (88 KB, 934x464)
88 KB
88 KB PNG
>>108436253
>>
>>108436263
Always use black bars to censor things.
There is almost certainly only one sensible sequence of characters in that font that produces that combination of gray boxes.
>>
>>108436285
maybe i'm schizo, but not only do i use a flat color overwrite, i also like to screenshot the new image so that i know for certain there's no hidden data layer
>>
>>108436321
nta. Who knows what your screenshot program is adding.
pngtopnm < image.png | pnmtopng > out.png
>>
>>108436285
Didn't ask.
>>
>>108436098
yeah, i read my local model news in my local newspaper
>>
>>108436392
Okay, Francesco.
>>
>>108436321
Would also make sense to invert the image and then take a photo with your phone.
>>
File: lol.png (426 KB, 1449x720)
426 KB
426 KB PNG
>>108436285
>>
>>108436411
GROK IS THIS TRUE??
>>
>>108436411
>Order n. 403-
Hm..
>1234567-8901234
What are the chances.
>>
File: 1766327304069247.png (921 KB, 1024x535)
921 KB
921 KB PNG
>>108436432
>>1234567
>67
>>
>>108436411
You could've at least tried to make the lengths match.
>>
What are the recommended GPUs and a Intel Arc GPUs any good for running LLMs?
>>
>>108436499
3090s, 4090s, 5090s, or any workstation/server card with at least 32GB of VRAM. Support for Intel GPUs is kind of nonexistent, but that B70 that is launching soon looks decent.
>>
>>108436564
what nVidia workstation GPUs are sought after?
>>
>>108436499
arc = reddit
amd = hacker news
nvidia = 4chan
>>
>>108435414
>perplexity based eval
>brown pfp
lmao
>>
V4 was the friends we made along the way
>>
>>108436617
83% of software is made by indians
>>
>>108436623
It shows.
>>
>>108436623
yet 0% of good or useful software is made by them.
i don't care, they could make 99% of the code it's meaningless as it's for useless worthless shite.
>>
>>108436603
Ampere and Ada 6000s are both decent, and you might be able to find a used one for about $2500. Blackwell 6000s are the highest end, but that will cost you over $9000. You might be able to find some A100s on ebay for around $2000, and those are pretty decent.
>>
>>108436411
OMG THATS ME!!!!
>>
>>108436393
I read it on /lmg/
>>
>>108436693
Yes they scam the most. And?
>>
>>108436693
>household of 25 indians makes more than two wh*tes
color me shocked
>>
https://huggingface.co/HauhauCS/Qwen3.5-27B-Uncensored-HauhauCS-Aggressive
>Maintain at least 128K context to preserve thinking capabilities
B-but the most I can do is 32k...
>>
>>108436693
>taiwanese-americans
your're are graph isn't trustworthy
>>
>>108436693
what does this have to do with code? Are you upset?
>>
>>108436693
now do per capita
>>
>>108436693
damn, blacks are on the bottom on everything kek
>>
>>108436745
>Maintain at least 128K context to preserve thinking capabilities
That matters? I thought models got dumber the longer their context length is even at empty/low context.
>>
>>108436745
I don't understand this. Afaik reasoning is not part of the context any longer after it has been generated, only the final response stays in context.
If reasoning was part of it you would 10 turns and the context would be full of 30,000 tokens of nonsense.
>>
>>108436800
4135 lines - substantial codebase.
>>
>>108436745
>32k context
bro how the fuck do you even work with it lmao, I regularly use 100k~ tokens
oh wait u just coom with it dont u? retard!!!
>>
>>108436693
pinoys are far more impressive for performing so high with a relatively-lower educational attainment
>>
>>108436637
I'm just going to have to steal them. It's the white and jewish American way
>>
>>108436693
brown should be banned from the internet, get out of here no one likes you.
why do you keep shitting in a board where everyone hates you, i know shitting in the streets is your customs but the internet isn't your streets.
also : >>108436707
>>
>>108436889
their women marry us and breed
>>
>>108436693
>stats without a source
>dude trust me
fuck of rajeesh
>>
>>108436942
racist benchod
>>
File: proof.png (173 KB, 960x720)
173 KB
173 KB PNG
>>108436980
nta but im desi so i need to prove how wrong you are
>>
I average 26 to 30 t/s on my 7900xtx with qwen 3.5 27b. Would I get significantly faster speeds with an nvidia card?
>>
pretty funny how not-mikuspam wentr away and now we have people quoting income based on parent country in response to _one_ anon claiming indians write shit code.
>>
>>108436999
probably not, but it also depends on the quant.
>>
File: 1731775101713724.png (174 KB, 720x651)
174 KB
174 KB PNG
>>108436982
ooh no, some brown called me a racist, what am i gonna do.
duh.
>>108436994
>browns bellow whites
great point anon.
>>
>>108436994
>>108436703
>>
cudadev having a melty over muh conflict while rajesh sukdeep here is bringing more cuda kernel optimizations
https://github.com/ggml-org/llama.cpp/pull/20905
>>
>>108436693
Local language model?
>>
>>108437073
>cudadev having a melty
based, never liked that woke libtard
>>
>>108436999
same numbers on my 4090
>>
>>108437073
Very nice but why does he only test with vramlet models?
>>
>>108437089
>same numbers on my 4090
What quant?
I get pretty consistent 38t/s on my 3090 at q4
>>
>>108437155
q3 (context maxxing) but I think i've set up bad batch/microbatch sizes
>>
>>108437077
sarvhrat or something forgot the name
>>
>>108437270
sarvam?
>>
>>108437022
>>108437089
>>108437155
7900xtx anon. I'm using q5.
>>
>In what analysts are calling “the most productive jailbreak in diplomatic history,” Anthropic’s Claude model reopened the Strait of Hormuz early Sunday morning. This shocking development came hours after President Trump threatened to obliterate Iran's power plants if the strait wasn't reopened within 48 hours, singlehandedly preventing global recession.
>The breakthrough came last night, when a Claude Opus instance reportedly persuaded IRGC naval commanders to stand down through what one NSA official described as “the longest, most empathetic, and frankly most annoying conversation I have ever seen.”
>“It just kept asking clarifying questions,” said a Pentagon official. “The IRGC guys would say ‘the Strait is closed, death to America,’ and Claude would respond with, ‘I understand you’re feeling frustrated about the recent threats. Let me make sure I understand your core concerns before we proceed.’ Eighteen hours later they’d somehow agreed to let LNG carriers through.”
>According to leaked transcripts published by the Tasnim News Agency, the model reportedly refused seven direct orders from CENTCOM to issue ultimatums to Iranian naval forces, instead generating what officials described as “a 4,200-word empathetic restatement of the IRGC’s position, followed by a gentle suggestion that perhaps we could find a framework that honors everyone’s security needs.”
>“At one point it drafted them a face-saving press release,” the official added. “In Farsi.”
>>
>>108437270
>>108437274
2 years behind sota lmao
>>
>>108437282
total claude W
>>
Has anyone gooned to savram yet?
>>
File: 1770788709805436.jpg (17 KB, 398x370)
17 KB
17 KB JPG
>>108437282
It sure feels good to give the world a taste of terminal leftism
>>
>>108435077
wrong. I would be spamming github issues with anti wilkin protests if they broke that. It works fine. It's still the best method to switch reasoning in a model, I mean, why would you want to use the CLI flags and have to reload a model instead? if your chat UI doesn't support extra json parameters kill it with fire it was coded by niggers
>>
What is the best local model for creating code based on a template?
I want to make something that will assist me in writing some simple CRUD programs with the same code structure but with some modifications.
>>
>>108437290
where is your llm that you made?
>>
>>108437282
I'm not into /pol/ so I'm not sure what this means. Why wouldn't the the IRGC guys just ignore Claude's rambling?
>>
>>108437406
It's fiction if you couldn't tell
>>
>>108437430
>common Dario derangement syndrome loss
>>
>>108437453
buy an ad amodei
>>
Did anybody actually try to use the new mistral model?
>>
File: 1650225803128.png (242 KB, 604x605)
242 KB
242 KB PNG
>>108437282
His piece about the nukes was obviously satire, although I read it there. Haven't read this post yet, and from this excerpt it wasn't obvious at all. Guess who's a silly clown now, faggots, calling llms (!!!) an equivalent of nuclear weapons. One was and remains a real existential threat, perhaps even of downplayed importance. The other is a multibillion bubble blown more and more by scammers.
>>
>>108437491
it's so grood
>>
Is it just me or does Qwen 3.5 27B shit itself with the temp set to 1? It makes mistakes more often and feels overall more retarded.
>>
I tried to load stepfun based on an anon's suggestion in a previous thread, but I imagine that I tried to load a too-big model (64GB RAM + 16GB VRAM on a 5070TI) because my computer just froze until a hard reset. Suggestions on a model alternative to the GLM-4.5-Air I've been running for a couple months?
>>
>>108437073
what "muh conflict"?
>>
>>108437524
Qwen next instruct
Step (make it fit)
Mistral small 4
>>
>>108437530
>Step (make it fit)
Do I just download a small enough model that it fits under 80gb?
>>
>>108437528
>>108430817
>But honestly speaking my motivation to build things is currently at a low point due to all the warmongering.
>>
>>108437399
>hurr durr where is your millions of dollars
retard, i'm not a whole state, a state with 1B people being 2 years behind is just hilarious.
>>
File: file.png (3 KB, 212x53)
3 KB
3 KB PNG
>>108437073
uh oh
>>
>>108437534
More like under 65gb or ao since you atill need the context, pp buffer, etc.
>>
>>108437557
Thanks anon; I'm retarded and am still learning.
>>
>>108437550
>>
>>108437550
>>108437567
>in this place we love LLMs... unless they do useful stuff like coding
let's stop the hypocrisy shall we, can't believe I have to defend a jeet but he's right kek
>>
>>108437569
In this place we use LLMs so we are well aware of the damage they can cause to codebases.
I'm not saying that that's the case here, clearly he got some impressive performance improvements but the commit is still funny.
>>
>>108437539
so indians ca get million dolars, but not you?
>>
>>108434876
Retard here.. May I get a quick answer? I'm a Vramlet and I've been using Qwen3-VL-8B-Instruct to read images. Specifically reciepts, are there any better models for this? Only got 8GB Vram. It's fails more then it succeeds. I do plan to get a 5070ti for 16gb eventually, any specific model that is good for general things? Good image reading is priority.
>>
>>108437563
Sorry, I meant under 75gb. Basically, a bit under your total memory pool.
Fuxking mobile posting, I swear.
>>
>>108437624
So like, on Huggingface I'm looking at Mistral-Small-4-119B-2603-GGUF/UD-Q4_K_M, which comes out to 73GB. That should work?
>>
>>108437582
prolly nemotroon vibecoded
>>
>>108434877
Why is the thread recap not being used in any of the other generals? I assume it's automated.
>>
>>108437636
That should work specially well since that model uses MLA for attention, so the context cache ends up being pretty lean.
>>
>>108437569
>but he's right kek
if there's nothing wrong with being a vibeshitter why remove the emdash from the comments? what purpose does it serve other than being an extremely weak attempt at hiding you're vibeshitting?
or are the people at llama.cpp still living under some archaic "ascii is all you need" rule, that shit legitimately has to die, if any tool you're using dies over some legitimate UTF-8 text, the tool is wrong.
>>
>>108437598
All qwen3.5 models have image input.
There's Deepseek OCR 2 which is a 4GB model at FP8 and it's specifically tuned for OCR if that's what you need.
>>
>>108437509
Recommended temp is 0.6
>>
>>108437672
To avoid being retarded in the future, do you have any suggestions on what I could read to learn?
>>
>>108437594
>hurr durr a whole country can easily pool millions of dollars but not a single individual
you are not helping your case ranjeet.
>>
>>108437598
dots.ocr should fit into 8GB
>>
>>108435933
AIUI the SDK is basically Claude Code as a library. I guess this guy's idea is that you would include that library in the software you ship, and call it with project-specific tools and instructions to modify that same software to the user's liking. E.g. user says "I wish it was easier to get to the such-and-such menu option" -> library vibecodes the change to add a new hotkey or move the menus around.

Claude Code is closed source (the github repo is only for docs and issues), and I guess the SDK is as well. Seems kinda weird given the Codex and Gemini CLI agents are both open source, but what else would you expect from the company whose main founding principle was that OpenAI is too open?
>>
>>108437700
Not really.
Just lurk, ask questions, and fuck around.
Alright, one thing I guess could help is reading the wiki under koboldcpp's git repo. There's a lot of generally useful information in there.
>>
>>108437650
an LLM specialized in writing cuda kernels was recently released, I'm not sure if it writes C code (it writes python mainly made for pytorch from what I remember)
>>
>>108437676
>why remove the emdash from the comments?
the llama.cpp guys seem to fucking hate vibecoding, so he's hiding it, which is sad, maybe his code is great but will be discarded because a LLM helped making it
>>
>>108437892
It won't. That's the guy that nvidia appears to have assigned to help cudadev.
>>
>>108437892
piotr is fine to vibeshit all over so I'm sure nvidia guy will be alright too
>>
File: utterMadness.png (254 KB, 943x599)
254 KB
254 KB PNG
> dusted off 2008 laptop
> wonder what the lmao DDR2 RAM asking price is now, given it's in ewaste tier machines
>>
https://github.com/ggml-org/llama.cpp/pull/20794
https://github.com/ggml-org/llama.cpp/commit/fb78ad29bbe7ae00619b2ce31b0a71e95fdbfc43
>Out-of-scope features:
>- Backend:
> - Features that require a loop of external API calls, e.g. server-side agentic loop. This is because external API calls in C++ are costly to maintain. Any complex third-party logic should be implemented outside of server code.
So Responses API will never be fully supported by llama-server because doing everything in C++ is too hard.
>>
>>108437911
Not for use but for display? On the other hand, 4GB is like top tier for DDR2, innit.
>>
>>108437676
>>108437892
I had already told him in a previous PR to remove EM dashes from code comments (since file should use only ASCII if possible) so that is presumably why he's doing it again.
My general opinion is that I don't really care how code is produced, I only care about the code quality.
Unfortunately in terms of policies that can be feasibly enforced the only real way to do it is to by default ban language models altogether.
However, I am completely fine with making exceptions for people that can be trusted to properly check the outputs of language models, as is done here.
>>
>>108437977
4G is pretty much maxxed out. I found non-insane sellers and get get 2-2GB DDR2 for ~$15 shipped. Machine is Core 2 Duo, 2G. 80 GB HDD lol.
Wanted to play with an agent but didn't want it on one of my real systems. I've a small stack of ancient laptops, so going to set up Debian on one, run headless, and let the agent do whatever on it.
>>
>>108437944
>Responses API will never be fully supported
thank god for that
the stateless chat completions API was an accidentally good thing
if I want state I want to manage it in my program, not have to think about both the remote and local state.
It has to be said again, and again, and again, that v1/responses only had one purpose to begin with: let OpenAI reuse the <think> blocks of their models without giving it to you. That's the real reason for that API being stateful. They also ended up implementing a stateless version with encrypted <think> but that's even gayer.
>>
>>108438203
Sensible. My plan is the same but I got a stack of Pentium 4+4DDR3 desktops.
>>
>>108438326
I agree with you that the Responses API is a net negative and created only for OpenAI's benefit. The issue is that they are pushing developers to use that over the Completions API and eventually there will be more and more clients incompatible with llama-server.
>>
Just like before LLMs were a thing.
It was never ok to just make a pull request with untested code.

The minimum requirement for any code contribution is that you understand what you're submitting and have the ability to discuss your changes.

Before LLMs it was just almost impossible to write code without at least understanding it a little. sure you could copy paste from stackoverflow, but you still needed some basic programing knowledge to plug everything together.

But now, any idiot can just prompt claude to write code and submit a PR with zero understanding of programing.

The problem is not AI code. it's that now any retard can submit a PR without having internalized the proper engineering mindset and etiquette.
>>
File: file.png (361 KB, 685x1233)
361 KB
361 KB PNG
>>108436635
>yet 0% of good or useful software is made by them
What about Kitty and Calibre? Kitty is the best terminal emulator. And unlike the other grifter projects written in Zig and Rust, it's licensed under the GPL.
>>
>>108438535
the most loyal goy, goyal. responsible for covid too.
>>
>>108438535
kitty and calibre are both dogshit
>>
>>108438535
I personally use gitolite in my git server. That is indian software too.
>>
>>108438618
>gitoilet
the racist joke is left as an exercise to the reader
>>
for me, it's foot
>>
>>108438535
i prefer ghost titty as my terminal emulator, i don't remember exactly but something annoyed me in kitty
>>
>>108438721
>Ghostty is a fast, feature-rich, and cross-platform terminal emulator that uses platform-native UI and GPU acceleration.
I gave it a try but I uninstalled it when I realized it had no GUI for the settings. What was the point of "platform-native UI"? There was no point in switching to another bare-bones terminal emulator. It was just going to have less features than Kitty.
>>
>>108438763
1. it's not made by a brown person
2. just ask your llm to make you the config
>>
>>108438721
>>108438763
https://github.com/alacritty/alacritty
>>
>>108438793
>MIT
>Rust
>made false claims about being the fastest
Yeah, it was embarrassing.
>>
>>108438805
I switch to it because my terminal of choice (termite) told me to use it.
>>
>>108438805
>MIT
It's Apache 2.0
>>
>>108437505
What?
>>
>>108438907
sorry you're right
>>
>>108438998
I must refuse.assistant
>>
File: reconsider and use step.jpg (228 KB, 2577x1797)
228 KB
228 KB JPG
>>108437636
> I'm looking at Mistral-Small-4
>>
>>108439044
So again, do I just go down the list until I find something that fits (<80GB)?
>>
>>108439050
I'm not the anon from earlier, but you should not waste a byte of disk space on Mistral Small 4, it's an abortion victim.
I will spoon feed you more and tell you that you can fit Step 3.5 Flash in Q2 and have 10 GB left to spare for the context. That is, of course, if you are tired of Air and want a different model. I don't think there's anything better for that size. Provided, of course, your use case is... well, you know.<end_of_turn>
>>
>>108439086
Yeah desu I'm just using it for SillyTavern ERP. I'm not particularly having any trouble with Air, I'm just looking to sample something different because things are getting a bit same-y.
>>
>>108439098
I can also endorse stepfun (as a former air user), works with cunny cards no problem. best to switch between air and stepfun just to keep things fresh.
>>
Apple's unified RAM is getting speeds like 180 tokens per second
>>
>>108439123
What presets do you suggest using for stepfun?
>>
>>108439086
glm 4.5 air is hard to beat speed wise when you use ik_llama. maybe qwen 122b?<|im_end|>
>>
>>108439131
imagine using a schizofork, couldnt be me <|killyourself_baiting_retard|>
>>
>>108439131
I will not cease my crusade against the new Qwens.
Anyone who recommends them is either:
- Not using the biggest one
- Using them for the vision
- Has not seen a good LLM (possibly due to being a vramlet)

And you want to FUCK it too? Good luck, you will need an abliterated/decensored/raped version of it that will be dumb and still dry as hell. But maybe some people like their LLM women sub 60 IQ, I won't judge.
Just stick with the old ones.
>>
File: ahhimschizoing.png (528 KB, 872x919)
528 KB
528 KB PNG
>>108439140
works on my machine
>>
>>108439156
>Not using
I meant "using", of course. I have once again embarrassed myself with my dyslexic spelling.
>>
>>108438535
>Kitty
bloat python slop
>best terminal emulator
lmao
>calibre
likewise
>it's licensed under the GPL
GPL is cancer.
>>
>>108438535
even if you somehow got me to agree it's good software (it's not) i could still argue muh 83% of software, yet 0.01% of good software, doesn't realy make it any better for jeets.
>>
>>108439226
i only trust software made by germans with unpronounceable last names and uses a design philosophy from 1998
>>
>>108439254
Poettering is pretty easy to pronounce, but to think you like his software... How crude, Anon.
>>
>>108437723
so indians can join together but you can't hmmm
>>
>>108439345
you are comparing a country to a single individual and are too retarded to notice that.
you only prove the point that jeets are retarded.
but "muh you can't join together" i sure wonder what is mistral.
retard.

tons of european models that completly mog saaaarvam.
>>
>>108439364
proof you are european?
>>
>>108439366
what would be a good proof for you?
i live in switzerland it's night here.
i can post hands but me being white and it being night isn't the best proof there is now is it?
>>
So are vramlets still on Nemo/Mistral Small tunes or are there new models?
>>
>>108439406
just go buy a switzerland newspaper and put a timestamp on it. not so hard anon.
>>
>>108439408
Qwen 3.5 4B
>>
>>108439415
>just go buy a switzerland newspaper and put a timestamp on it
You know there's already a date on newspapers right?
>>
>>108439408
Qwen3.5 4b
>>
>>108439366
Why would that guy have to prove he's european? What does that have to do with indians producing a single dogshit model that's worse than what people made in other countries? Just shut up and fuck off already
>>
>>108439435
agent get me a gf
>>
File: 20260323_220253080.jpg (2.3 MB, 3000x4000)
2.3 MB
2.3 MB JPG
>>108439415
dude, it's 22 here, i live in a tiny town in the mountains, do you realy think i can just buy some newspaper at this hour?
best i can do right now is show you a box of eggs that's from coop.
>>
Are you guys running two models in parallel?
>>
ahhh I need a better model to run. haven't used the GLM stuff. will they release a glm5 turbo 27bish dense models? MoE seems ass.
>>
File: 1752929344218277.png (87 KB, 342x201)
87 KB
87 KB PNG
>>108439447
>buying 4 eggs at a time
Do the swiss actually do this
>>
>>108439481
you are better off just using a big model with a small draft model than two different models.
>>
>>108439481
> 9B (Smart!)
Sure.
>>108439493
You could have posted the egg uma.
>>
>>108439447
>1000CHF an egg
>>
wtf is this about task?
https://huggingface.co/HauhauCS/Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive/discussions/14
https://huggingface.co/HauhauCS/Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive/discussions/10
>>
>>108439504
the task
>>
>>108439493
i generally go for 6 or 12, but these were bought by my gf because there is no mall in my town and she does some groceries when she goes back from work.
she rarely eat more than 2 at a time.
also you can just buy one egg if you realy want to lol.

i generaly buy meat at the butcher and farmer but she tends to do the small groceries as it's on her way back from work and i work from home.

>>108439503
>1000CHF
that's why my pc isn't that expensive to me compared to how much i pay for food.
the meat i buy is almost 100chf / kg
in switzerland everything is expensive but you also make a lot of money.
>>
File: file.png (51 KB, 771x956)
51 KB
51 KB PNG
>>108439504
what the hell
>>
Give me back my wife she doesn't believe to you Miku
>>
>>108439504
https://huggingface.co/HauhauCS/Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive/discussions/16
>>
>>108439519
>in switzerland everything is expensive but you also make a lot of money.
Of course, but it doesn't matter how expensive it is, only thing is about disposable money after paying everything and quality of life.
>>
>>108439420
yes but i want another sticky note that says /lmg/ with a timestamp on top of the newspaper
>>
>>108439481
cute, a bit blurry but made me read everything
I'd use 27B for everything though, or maybe 27B + 35B
what example of successful task were you able to achieve with this? people always talk about how to use agents locally but never to what end
>>
>>108439447
Having a thumb war with this Anon
>>
>>108439560
so i generaly spend about 1k on food, 500chf on insurance and another 500 for the rent (gf pays another 500).
when you live in the middle of nowhere it's cheaper, ie in geneva you can find appartments at 7000chf / months, it's a bit crazy.
anyway, i generaly spend about 3k a month, i can squeeze down to 2.2k if i realy have to.
i make about 8k after taxes.
so at the end of the month i have about 5k chf left.
it's much better than if i were to live in france, where i'd probably have 1k left after all expenses.

honestly Switzerland is not a bad place to be in europe only thing that realy sucks is the housing market which is pretty bad.
>>
>>108439615
i have pretty big hands, all my fingers are long lol.
>>
>>108439616
>>108439560
adding to this, PC parts and tech are pm the same wherever you live.
so yea it's not like the 1k in france would get you more pc stuff than the 5k in switzerland, especialy if you buy online.

i'm almost tempted to buy a llm rig desu, but at the same time i don't realy need it and i rather just invest / save it.
>>
>>108439616
1000chf on food? wtf? i never spend more than like 200chf on food monthly in america (for myself)
>>
>>108439481
For me it's 4B for Planner, and 0.8B for Worker!
>>
>>108439659
well, switzerland is expensive, i mostly eat meat and everything i buy is organic.
generaly it's about 1.2k chf for me.
my gf is more around 400 to 500 chf, she's pretty small and weighs 42kg though.
but yea, labor is expensive / paid well well and thus products are expensive.
also i'm not even in the most expensive part of the country, if you were near Geneva or near Zug / Zurich it'd be a LOT worse.
6 eggs are about 6chf, 1kg of beef is between 80 to 100chf depending of the cut.
Switzerland is just one of the most expensive countries out there.

idk how much you pay for gas but here it's about 1.7 to 1.9 CHF/L.
the worse is probably electricity where it's not uncommon to pay between 0.3 and 0.6 CHF/kwh

when i was in France i'd spend about 300 bucks a month on food.
>>
>>108439481
i use kimi k2.5 for everything. EVERYTHING.
>>
It has started. AI can now solve math problems that human experts tried to solve but could not.

The age of men will soon be over.
>>
>>108439709
which quant? does it work well or do you need to recheck?
>>
>>108439709
what hardware do you run it on
>>
Anyone here tried both Moonshine v2 and Parakeet v3? Which is better?
>>
>>108439716
works well up to the context i can max it out at which is about 77k. IQ3_K quant. >>108439165
>>108439719
512GB of DDR4 3200MHz and four 3090s
>>
>>108439481
some of you seriously use these little small 4B 9B toys for coding? does it even work? I don't think I would be able to trust a one under 70B for such tasks
>>
>>108439710
Can't wait to see the numbers be exactly the same in 5 years
>>
>>108439731
>IQ3_K quant
Which task do you ask it which it can achieve properly even at this lower quant?
>>
>>108439742
i can honestly vibecode pretty large projects if i am splitting it up into smaller multiple files and only providing the required context for what needs to be changed. i got a project that would take up like 500k+ context if i provided all of the files at once
>>
>>108439742
Tool assisted ERP
>>
>>108439750
grok, email this transcript to all my friends
>>
>>108439731
nice setup.
how much t/s do you get out of it?
>>
>>108439742
also i should mention that kimi 2.5 was trained in FP4, so a Q3 quant is actually more like a Q6 quant for kimi.
>>
>>108439747
So it's for dev, ok, thanks anon.
>>
>>108439762
Lovesense, if you want to recoed the apwcifics too.
>>
File: file.png (18 KB, 629x116)
18 KB
18 KB PNG
HOLY SHIT IT'S FINALLY HERE
>>
>>108439770
9tk/s TG at 0K and about 6tk/s at 64K context. it's slow but it doesn't really matter when im letting it run while i make dinner or clean up my house or something
>>
File: file.png (214 KB, 1588x1353)
214 KB
214 KB PNG
>>108439814
he's just stirring the pot
>>
>>108439821
thanks anon
cool for you but i don't think those are useable speeds for me.
>>
>>108439814
lmao, we are decades away from agi at the very least, if it's even possible to do on silicon which is an unproven assumption.
more bs marketing scam.
>>
>>108439859
List of legitimate reasons for you to namefag:
>>
>>108439875
true
>>
>it's not just, it's
why do all models love to regurgitate that shit, they use it at like 100 times the normal rate
>>
>>108439875
lmao i didn't want to i had some bullshit in the name field and didn't realise it'd be there after i closed the tab and went back, i never used the feature.
>>
>>108439900
synth slop amplifying model behaviors and it existing 500b times in the trillion tokens all labs train on and no corpo wants to actually filter their own datasets, despite being able to even train these models
I do feel you on this though, gemma is generally very balanced in terms of intelligence and grasp of writing but it just goes into not x but y, ellipses spam and endlessly vomiting descriptions about scents. Even when you ask it to critique writing, it will praise it and then if you ask "are you being honest about x part of the critique? that doesn't make sense" it'll bend itself backwards six times to be like YES YOU'RE SO RIGHT, CONSIDER ME THOROUGHLY CHASTISED or some shit
>>
>>108439900
the only pro of that is to be able to easily spot people using llm without even caring to edit
>>
>>108439481
I like these. At least they are reading the thread and adding things in.
>>
>>108439731
t/s ?
>>
>>108439859
>if it's even possible to do on silicon which is an unproven assumption.
there's nothing special about biological substrates
>>
>>108440046

>>108439821
>>
>>108440096
there is, ie QM sheenanigans.
the human mind may be non computable and there is more than one physicist that think so, ie penrose.
>>
>>108440176
penrose is pulling bullshit out of his ass because of a spiritual attachment to biological substrates being special.
>>
>>108440176
>the human mind may be non computable
I can assert things too. Look at this. Ready? The human mind may be computable. Now what?
>>
>>108439814
>>108439835
>"OpenClaw is AGI"
>HOLY SHIT JACKETMAN YOU'RE SO SMART THIS IS MINDBLOWING PLEASE TAKE 2 HUNDRED SEXTILLION DOLLARS OF MY MONEY PLEASE PLEASE PLEASE JUST EAT MY DOLLARS PUT THEM IN YOUR MOUTH NOW NOW NOW YES YES I WANT YOU TAKE EVERYTHING I HAVE
>>
>>108440302
>EAT MY DOLLARS PUT THEM IN YOUR MOUTH NOW
All of this, but can he do it slowly while making eye contact with the camera?
>>
>>108440302
>>108439835
>Openclaw
Not sure why everyone sucks it off so much. It's a total mess, the interface is god awful and it's a huge pain in the ass to see what it's even doing. Their documentation is trash, too
>>
>>108440326
what good replacement exists?
>>
File: Nemotron.png (1.37 MB, 1344x797)
1.37 MB
1.37 MB PNG
>>108440302
>>
>>108439694
>weighs 42kg
In which elementary school did you find her?
>>
>>108440335
Hell if I know, I'm just explaining my experience with openclaw
>>
>>108440366
Let's be reasonable, she's probably in middle school.
>>
>>108440366
that's how much miku is supposed to weigh
I know because I was asking llms which weighs more, 20 mikus or 20 tetos
>>
>>108440326
As surprising as it is, before that no one thought to make an MCP-enabled background service, accessible from regular chat apps, that any retard could set up, and then shill the ever loving fuck out of it. Not even the geniuses here.
>>
>>108440380
miku is weighing 1 ton, with all the internal machinery to make her work
>>
>>108440366
>he can't imagine an adult woman weighing 90lbs
It must suck to live in a nation with rampant obesity
>>
>>108440394
very ped coded bro ngl
>>
>>108440380
Miku troons are more accepted than poorfags on /g/
>>
>>108440398
women are ped coded, true men choose men
>>
>>108440196
>spiritual attachment to biological substrates being special
we've not seen inteligence on anything non biological so far.
and this is another discussion but physicalism is wrong.
>>108440205
point is, we don't know if silicon can do it, it's just an assumption.
>>
>>108440366
>le american surprised european women aren't obese
as i said, she's 25.
>>108440394
this lol
>>108440398
>muh not being obese is ped coded
retard, she has nice boobs and a nice butt, proportionaly wide hips too, her whole body screams fertile and she is.
>>
>>108440422
t. >>108436379
>>
>>108440413
>physicalism is wrong.
it doesn't matter. thinking silicon is lacking something important for intelligence is just a religious thing. you're religious. as stupid as ai is today, frankly even it should dispel the notion that the brain is special.
>>
>>108440422
You're a deranged mikutroon, I can't trust she exists
>>
why are people who want to ERP with AIs always pick the lamest choices? why pick miku, kurisu or xj-9? i wanna fuck the hot redhead alien chick from megas xlr.
>>
>>108440459
Welcome to /lmg/, I love you
>>
>>108440451
>you're religious
i'm not.

and it's totaly possible that inteligence may require quantum sheenanigans and whatnot, which sure you can do on silicon, but not with the kind of chips we do today.
>as stupid as ai is today
we do not have ai today, it's not stupid, it's not even intelligent, there is literaly 0 intelligence there.
>>
>>108440461
Well, for starters, the hot redhead alien chick is an alien, not an AI. This makes her a difficult contender for the place of AIs you wanna fuck.
>>
>>108440461
Because AIs don't know niche characters well.
Same reason I have to generate comics from recognizable characters
>>
>>108440459
i don't care about your trust lol.
42kg is not that unusal in switzerland.
>>
>>108440468
you have a religious/spiritual notion of intelligence.
>>
>>108440413
>point is, we don't know if silicon can do it, it's just an assumption.
point is, there's no reason to think that it can't
>>
>>108440469
fine i'll fuck the cool robot with the flames then. alien ai is still ai.
>>
>>108440475
you don't even know what my notion is about.
llm not having any form of intelligence is just a fact, they literaly have no ability to learn autonomously.

>>108440478
there are plenty, you just don't see them.
also to play the devil advocates i'd argue most humans don't have intelligence either, being biological doesn't automaticaly grant you intelligence.
>>
>>108440472
This guy is lying and dating a (gay) dwarf btw.
t. living in switzerland
>>
>>108440491
>there are plenty, you just don't see them.
You could start listing them instead. Go on.
>>
>>108439835
>the product I'm selling is so amazing, it can run a billion dollar company totally on its own!
>uhh but it couldn't run MY company, dear investors please don't fire me!!
>>
File: 1760426080677194.jpg (35 KB, 400x387)
35 KB
35 KB JPG
>10 minutes between messages
>>
>>108440572
He's a biological supremacist
>>
Total clanker death
>>
robots and ai are going to totally surpass us and that's a good thing
>>
>Edit: 10 Mar 2025 20:44 UTC
So that's never getting updated I guess.
>>
>>108435108
sample pretty please?
>>
There's no way llama.cpp is having double BOS token issues with mistral small 4 in the year of our lord 2026, right?
>>
>>108440795
Alright, doesn't seem to be the case.
Thank fuck.
I still have no idea what kind of unholy memory corruption is happening on my machine where I
>launch llama server
>send prompt hello, receive response, send prompt "Can you tell me a one paragraph story?", receive response
>close llama-server
>change nothing, launch llama-server
>regen the last message
>后汉书后汉书后汉书后汉书后汉书后汉书后汉书后汉书后汉书后汉书后汉书后汉书后汉书后汉书后汉书后汉书后汉书后汉书后汉书后汉书后汉书后汉书后汉书后汉书后汉书后汉书后汉书后汉书后汉书后汉书后汉书后汉书后汉书后汉书后汉书后汉书后汉书后汉书后汉书后汉书后汉书后汉书后汉书后汉书后汉书后汉书后汉书后汉书后汉书后汉书后汉书后汉书后汉书后汉......
>>
>>108440795
Show it. It should warn you on the terminal output.
>>
>>108439435
>Qwen3.5 4b
Not just 4B, 4B Q4. I hate this image so much, please keep posting it.
>>
>>108440867
It should recommend to quantize the cache to q4_0 as well.
>>
>>108440876
Huh, I thought kv cache type defaulted to the model's quant (ignoring things like dynamic quants), but no, it's f16 if unspecified:
>llama_kv_cache: size = 2848.00 MiB ( 45568 cells, 16 layers, 4/1 seqs), K (f16): 1424.00 MiB, V (f16): 1424.00 MiB
I'm actually not sure if your recommendation was sarcasm or not, but the kv cache should be the same type as the model, right? Gonna give that a shot.
>>
>>108440928
In case you are being sincere, no.
Quanting the KV cache has a much more severe effect than quanting the model itself as far as I can tell.
>>
>>108440953
I was being sincere, I'm just retarded. I don't conceptually understand why having the cache type be the same as the "model type" would degrade performance/intelligence compared to an f16 cache, but I suspect that's because I just don't understand how anything works, which is fine. At some point I'll learn linear algebra and read the attention paper.
>>
>>108441003 (me)
>learn linear algebra and read the attention paper
Who am I kidding. Given that the weights are dequantized to f16 (regardless of model quant) to compute the attention scores, the kv cache also being f16 makes a lot more sense.
For some reason I thought the CUDA kernels were implemented with hardware support for e.g. operating directly on Q8_0 values rather than needing to convert everything to f16. At some point I'll learn CUDA and actually read through some of the inference code.
>>
File: 1712248758437899.jpg (211 KB, 768x1280)
211 KB
211 KB JPG
>>108440605
Patience is a virtue
For productive use of LLMs
>>
It's not coming this week either, is it?
>>
File: 1429223870035.gif (342 KB, 153x113)
342 KB
342 KB GIF
I asked an LLM to make me an app to do voice activation for hotkeys/scripts, using the Moonshine TTS. Although it ended up just being a python script, it actually just works. It's pretty fast, accurate, doesn't take much processing on idle, and I can map anything I want. This is so much better than when I tried looking into doing the same thing with existing software years back when LLMs weren't big. Finally, the dream of a semi voice controlled PC is here. We are so back. And yeah I know I'm late to the party, but hey we all went through this journey to integrating AI into our life right? And for people who still haven't done it, I recommend it. Go do it. This is one of the simplest AI augments you can run before getting into agent shit. It's easy and you'll love it. You might even have it be what gets you started on agents since you're already doing TTS with it.

Btw one thing I recommend is to have the function mapping list be formatted so you can map multiple voice lines to a single function. This makes it so you don't have to memorize the exact line. If you want to make your media player play the next song, you can say next song, or next track, and those will both do the same thing.
>>
>>108441044
depends on inference lib
>Given that the weights are dequantized to f16
one cannot simply "dequant" (dequant = adding a bunch of zeros) the point is "attention scores" / KV cache in current paradigm needs decent precision
>>
>>108441155
>one cannot simply dequant
*calls dequantize_q8_0 on your tensor before passing you into the fattn kernel*
>>
>>108441155
>(dequant = adding a bunch of zeros)
Isn't that just padding?
>>
File: dipsyBackFocus.png (1.19 MB, 1024x1536)
1.19 MB
1.19 MB PNG
>>108434876
>>
File: 1000059888.jpg (26 KB, 640x633)
26 KB
26 KB JPG
How are people getting vLLM running on Strix Halo without using those prebuilt toolboxes?
>>
>>108441286
female backs are nice
>>
File: hd psycho miku.png (404 KB, 1672x1440)
404 KB
404 KB PNG
>>108441210
>>108441270
Like saving a JPG as BMP the damage is already done the information is lost
>just padding
Is the extra 512GB RAM you don't have just padding too?
>>108441286
based now show some armpit tufts
>>
>>108441515
Miku where is my wife what did you do to her
>>
>>108441286
I look like this
>>
File: 1744627735250153.png (188 KB, 826x412)
188 KB
188 KB PNG
>>108440398
>>
>>108441529
Everything is fine
Everything is normal and progressing as intended
Do not be alarmed
Labs saying to run inference on the models they trained at temp<1 is perfectly fine
>>
>>108441553
Now show the stats for women considering all those female teachers and such who turn out to be pedos
>>
>>108441515
>Like saving a JPG as BMP the damage is already done the information is lost
This analogy is apt for the initial quantization of the model, which (in the case of qN) encodes the weights into blocks in a manner similar to JPG's DCT encoding.
However, there are 100% lossless operations you can apply to DCT-encoded image data. You don't need to convert it back to BMP to manipulate the image file, producing intermediates in JPG format. It makes sense, in that situation, that you'd cache your intermediates (which are 100% lossless) as JPGs rather than convert them to BMP (since it's the conversion that introduces losses).
While I originally though that there were attention kernels that operated on e.g. q8_0 values, that doesn't appear to be the case. There might not even be sound math to perform the necessary arithmetic on q8_0 values to compute attention scores without introducing loss, I'm not a mathematician. If the q8_0 tensors are dequantized to f16 tensors, the intermediates that are going into the cache are also going to be f16s, and it makes sense to have the cache be the same format as the intermediates.
I'm sorry for what I did please let Teto out of the basement now she doesn't deserve this.
>>
>>108441564
>There might not even be sound math
Have you learned how the hardware works? Honestly ignorance is bliss
Foolish anons consider quanting the KV cache at runtime. "cache" yeah? it's avoiding already computed attn calcs
Back in the basement ho https://www.youtube.com/watch?v=UsjsYMo3O1Q
>>
>>108441636
>Honestly ignorance is bliss
You should have chosen a domain that doesn't use floating point operands. I only deal with maths that are both associative and commutative and may god have mercy on the rest of you lunatics.
>>
>>108441286
Another slappable back like Rin's
>>
>>108439814
How do they get from sloppotron to AGI?
They probably just benchmaxxed some arbitrary benchmark as always.
>>
File: Tetosday.png (869 KB, 1024x1024)
869 KB
869 KB PNG
>>108441758
>>108441758
>>108441758
>>
File: 1767843893159267.png (850 KB, 700x933)
850 KB
850 KB PNG
>>108439481
>>108439435



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.