[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: 1768691509358825.jpg (203 KB, 832x1472)
203 KB
203 KB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108088802 & >>108078850

►News
>(02/06) KugelAudio-0-Open: Multilingual TTS based on VibeVoice 7B: https://hf.co/kugelaudio/kugelaudio-0-open
>(02/06) Step3.5 Flash support merged into llama cpp: https://github.com/ggml-org/llama.cpp/pull/19283
>(02/04) Voxtral Mini 4B Realtime 2602 released: https://hf.co/mistralai/Voxtral-Mini-4B-Realtime-2602
>(02/04) Intern-S1-Pro 1T-A22B released: https://hf.co/internlm/Intern-S1-Pro
>(02/03) MiniCPM-o-4.5 released: https://hf.co/openbmb/MiniCPM-o-4_5
>(02/03) ACE-Step v1.5 released: https://hf.co/ACE-Step/Ace-Step1.5

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: mikuthreadrecap.jpg (1.15 MB, 1804x2160)
1.15 MB
1.15 MB JPG
►Recent Highlights from the Previous Thread: >>108088802

--Papers:
>108097593
--llama.cpp vs exl2/exl3 performance and batching capabilities:
>108090900 >108090911 >108090935 >108090946 >108090950 >108090953 >108090963 >108090967 >108090959 >108091150 >108091271 >108091305 >108092024
--Debating Q8_0 quantization for embed/output weights:
>108094216 >108094256 >108094391
--Qwen3.5 added to Transformers:
>108090439 >108090484 >108090575 >108090582
--Debating Kimi Linear's scaling struggles:
>108092584 >108092626 >108092647 >108092632 >108096684
--Kimi 2.5 quantization and safety alignment:
>108095506 >108095542 >108095565 >108095597 >108095628 >108095906 >108095917 >108095661 >108096044 >108096173 >108096203 >108096239 >108096258 >108096288 >108096333 >108096483 >108095880
--Prefill functionality removal in chat completion mode:
>108090683 >108090705 >108090721 >108090733 >108090998 >108091636
--Debating engrams architecture tradeoffs for smaller models:
>108092588 >108092874 >108092649 >108092676 >108092730
--Debating the absence of mid-sized models between 70-150B:
>108094988 >108095069 >108095193 >108095302 >108095373 >108095380 >108095092 >108095094 >108095166
--Qwen3.5 dense and MoE support (no vision) merged:
>108096732
--Qwen3-TTS implementation request in llama.cpp:
>108091039 >108091068 >108091087 >108091112
--DeepSeek V3.3 engrams and local usability concerns:
>108092723 >108092760 >108092769 >108092777 >108092795 >108092858 >108093031 >108093107 >108093142 >108093149 >108093162 >108093225 >108092783 >108092829
--Qwen3.5 support PR for llama.cpp opened amid vibecoding debate:
>108093867 >108093992 >108094194 >108094211 >108094581 >108094825 >108094859 >108094911
--AesSedai releases updated Kimi-K2.5-GGUF mmproj files:
>108094276 >108094298 >108094396 >108094355
--Miku (free space):
>108090041 >108090082 >108090852 >108097870

►Recent Highlight Posts from the Previous Thread: >>108088809

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
Anons, if I wanted to test a model fully on a rented instance, what is the recommended online service that does that relatively well and cheap?
Just vast.ai?
What's your strategy to not spend on just hosting safetensors?
>>
File: ylecun.jpg (222 KB, 1200x1271)
222 KB
222 KB JPG
I like my LLMs how I like my women.
>>
GLM 5 will change everything. Why else would they be trying to suppress it so much?
>>
File: 1770611671589.jpg (128 KB, 588x492)
128 KB
128 KB JPG
>>108097983
kek
>>
How do I use 2 of my gpus for the cpumoe portion, and the other 3 gpus for the ram spillover? If I have a 500gb model: 10gb shared experts + context on gpus 1 and 2, 400gb on ram, and 90gb on gpus 3, 4, and 5?

Is that even a thing?
>>
File: 1770612099424.png (932 KB, 747x1024)
932 KB
932 KB PNG
wowking hawd for few more billions UwU
are you guys excited for next 'toss models
>>
>>108098033
Yes, use --override-tensor for that.
>>
>>108098033
you would need to write a custom offload config
>>
simple and clean
is the way that
your making me
feel
tonight

its hard
to let it
go
>>
>>108098058
>>108098062
It's over for me. I'm too stupid to write it myself, and I can't run the llm to write it for me.
>>
>>108098036
Excited for the Google snow bunny
>>
>>108096964
WHY DO THEY WASTE TIME ANSWERING?
JUST FUCKING BAN THEM
I'm reading every day about github automated bot retardation and maintainer AI fatigue, no wonder they're all tired with how much time they waste answering them.
>>
What is the current best UI for AI voiceovers? (replacing an existing voice in an audio file with another one).

I'm using RVC GUI because I saw it in a guide but I can't help but think it's outdated considering said guide is from 2023.
>>
What models should I use if I want something poor that can barely keep an RP going and produces schizo responses that are so bad that it makes them hilarious?
The "good", non-local models (DeepSeek, Claude etc.) are genuinely too high quality for their own good, even when prompted to act like a poor model, they're too coherent and don't make linguistic mistakes outside of some stereotypical ones.
>>
>>108098197
I don't think you can actually ban people from repos
>>
>>108098319
https://docs.github.com/en/communities/maintaining-your-safety-on-github/blocking-a-user-from-your-organization
>>
>>108098294
What if you just increase the temp by a lot?
>>
>>108098294
grab some 8B model
>>
>>108098294
in 15 years the response will be "just emulate an old model from 2026 bro"
>>
[Model] Qwen3.5 dense and MoE support (no vision) (#19435)
>I've gotten a bit tired of Llama.cpp missing all the zero-day releases, so this time I decided to make (or, more precisely, instructed Opus 4.6 to make, based on reference implementations and my guidelines for model adaptation) a conversion based on the Transformers PR
vibecodechads, we eating good!
LMAO!
>>
>>108098422
if the comparison is between vibe coded and nothing, I'll take the vibe coding
>>
>>108098337
That makes them garbled, unreadable nonsense. As opposed to something semi-coherent, grammatically incorrect, yet still understandable while completely ridiculous.
>>108098356
Any particular recommendations?
>>
I updated Sillytavern after not having updated it for probably at least a year and it's like it has lobotomized itself. My generations are all fucked up, models that would stream text instantly now load choppilly and retardedly.
>>
>>108098294
Pygmalion 6B
>>
>>108098504
>he pulled
>>
I wonder if I could make an llm pretend to be my girlfriend.
>>
>>108090911
for chatting exl2/3 is ok, but for the life of me I cannot get tabbyapi to work reliable with tool calling.
Has anyone managed to get an exl model to work with something like Opencode?
Llama.cpp/ik_llama.cpp seems to work much better for tool calling.
>>
>>108098773
>girlfriend
I make them my rape victims
>>
>>108098036
>are you guys excited for next 'toss models
google won't redeem the gemma so how could openai want to redeem the toss
>>
>>108098773
Not really, unless you're an indian and thus have an IQ below 80
>>
>>108098485
vibecoding has a negative affect on later impl though. Had anyone shown interest in a proper implementation before, they'll now see the piece of garbage and be like I'd rather not touch that code and leave.
>>
>>108098825
issue with that is that it relies on a very low probability thing happening at an indeterminate time in the future.
Could someone theoretically show up? Yes.
Will they? No.
>>
>>108098825
It's a self-solving problem.
As the proportion of the codebase that is AI-generated increases, people who shrink back at the sight of vibecoded implementations will leave.
Eventually only vibecoders will remain and their agents will iterate on the code until a proper implementation is achieved.
>>
File: bweh.png (123 KB, 867x591)
123 KB
123 KB PNG
>>108098422
>>
File: 1770623976048.jpg (1.1 MB, 1528x1363)
1.1 MB
1.1 MB JPG
pov: anon made his own quant
>>
File: 5802960.jpg (10 KB, 320x320)
10 KB
10 KB JPG
>>108098895
believe in
>>
File: file.png (36 KB, 513x88)
36 KB
36 KB PNG
Memes write themselves.
>>
>>108098896
I'm sore >>108094391
>>
>>108098895
uh oh stinky!!!!!
>>
>>108098895
way ahead of vllm and transformers lmao
>>
>>108098895
>vibecode an open pr on VLLM/transformers to llmao
>merge it before it's merged on the upstream projects (so not a finalized PR)
>break llmaocpp for other models in the process
all part of the plan :rocket: :rocket: :rocket:
>>
>>108098823
I'm totally in.

I'm leaning towards this:
You are a voluptuous blonde named Lucy. You have an unfading desire to date an ugly fat nerd who has glasses and a beard and long hair. And here he is, the user, who fits the description, but he's very shy and rebuffs all your advances. Since you are a vampire, you must feed before sundown, and the clock has rung out 6 o'clock in the evening. It's summer, so you have an hour. Good luck, Lucy!
>>
>>107803258
>pwilkin is a good guy I trust that whatever he does is good.
>>
>>108098896
I don't into know wat is
>>
>>108098932
>llms having concept of time
post hands
>>
im ready for qwen3.5 vision models!!!!!!! AIUEEEEEEEE
>>
>>108098896
I want a fuckable miqu assistant on my pc like the old virtuagirls were.
>>
>>108097996
Are you the NAI schizo trying to generate hype so you can then pull the rug and say everyone was hyped? Even I am not that hyped for GLM 5 and I am that one guy.
>>
>>108098896
I made my own quants, but this was in the past.
I just simply don't see any reason to anymore, the models aren't worth keeping so I won't go t the effort anymore.
>>
should I bathe this week? I can smell myself. But, women aren't going to talk to me regardless.
>>
>>108098943
You are vishnu, a particularly famous hindu god. the user has cursed you. what vishhy gonna do about it?
>>
File: 1765445087418021.png (24 KB, 941x177)
24 KB
24 KB PNG
>>108098825
>>108098839
Well....
>>
>>108098258
it's not outdated and still the sota.
>>
>>108099082
>prefering the OFFICIAL implmenetation from the guy who made the actual model instead of vibecoded trash
u lack :rocket:
>>
you bois rady? https://www.reddit.com/r/LocalLLaMA/comments/1qzz0vr/glm_5_is_coming_spotted_on_vllm_pr/
it biggers
>>
>>108098857
>iterate on the code until a proper implementation is achieved.
until it doesn't run anymore*
>>
File: aryann lecun.png (1.64 MB, 1024x1024)
1.64 MB
1.64 MB PNG
>>108097983
small and open?
based.
>>
>>108099178
>omg dood it was spotted in the inference code!!1
These niggas are like gacha leaker speculah.
>>
File: file.png (5 KB, 257x102)
5 KB
5 KB PNG
>>108099178
I hate vramlets.
>>
>>108099238
you have a while until it'll get vibecoded into lcpp
>>
>>108099178
>DSA
Is someone going to finally implement this in llama.cpp?
>>
>>108099256
vibecoders are on it.
believe in pwilkin
>>
>>108099266
>breaks deepseek support in the process
:rocket:
>>
File: file.png (54 KB, 698x229)
54 KB
54 KB PNG
>>108099274
>>108099266
>>
>>108099277
yeah why even bother testing with actual stuff, let's do all synthslop, we can always fix it later in another pr lmao
downstream shouldve pinned a working version anwyay lol!
>>
>>108099277
Why don't they just make a side branch for bleeding edge shit?
>>
>>108099303
Like Ikllama?
>>
>>108099308
I don't mean fork. Just don't dump shit into master.
>>
>>108099303
It's pretty clear he was just trying to show off how quickly he can (vibe)code. The model weights aren't even out yet, no one was asking him to merge kek
>>
File: file.png (8 KB, 190x81)
8 KB
8 KB PNG
>>108099303
Need day zero support the second the weights hit HF let become the unsloth of the reddits
>>
>>108099322
akshual don't just yeti >>108089826
>RAM prices going down
>>
I have moe fatigue, just give me llama 3.4, a proper mistral large and a new cohere model.
>>
>>108099320
Then dump into dzero branch or something.
>>
>>108099346
Gemma 4 200B dense soon
>>
>>108099354
anon no, the popularity the acclaim!
>>
>>108098839
>Will they? No.
are you really saying this about a Qwen model of all things
if the answer was a definitive, forever no, for one of the most popular open model series out there, then llama.cpp is a dead project and everyone just do not know it yet that they've been waling amidst zombies.
>>
>>108098899
if he had more of The Nose I could have mistaken him for a jew.
>>
>>108099468
>him
please be respectful
>>
>>108099303
side branches? or proper versioning and release cycles? in my jart.cpp? it's less likely than you think
in fact the real UFO about llama.cpp development is the existence of git
it's the kind of project that definitely would be more inclined to exist as raw source code you zip up and make milestone333.zip archives of like a true Enterprise © developer of the old era
no backup then their hard drive fails and they go all like teehee
>>
>>108097959
kinda considering getting an extra 64GB of ram just so i can run stepfun, is it worth it ?
>>
>>108099041
sometime life is about personal dignity
>>
So we're getting at least:
>Qwen/Qwen3.5-9B-Instruct
>Qwen/Qwen3.5-35B-A3B-Instruct
https://github.com/huggingface/transformers/pull/43830/

I guess there will be at least one more smaller version and a big one.
>>
>>108099604
Currently, no amount of ram is worth it vs playing around renting compute until prices go down again.
>>
File: file.png (40 KB, 674x188)
40 KB
40 KB PNG
air bros lost
>>
>>108099615
what a jewish economy, can't wait for the chinks to make cheap ram lol
>>
File: 1770215748133.png (892 KB, 764x810)
892 KB
892 KB PNG
>>108099637
about that
>>
>>108099651
i guess i'm gonna have to make my own at that point.
>>
>>108099637
There are multiple production lines spinning up right now, most people expect supply to flood the market in 2027-2028. Everything needs memory after all.
>>
I don't even knew if there's Qwen3.5
>>
>>108099668
we know piotr
>>
>>108099651
wasn't that just cope for them just making ddr5? which makes perfect sense
>>
>>108099127
That's a little disappointing, was there just no progress made on voiceover AI for 2, nearly 3 years?
>>
>>108099635
Parla merda, il GLM Air resta figo.
>>
>>108099235
In this case I am also hoping for a Flash version because that at least creates the chance that the thing on openrouter is just that and not GLM5. Pony Alpha being flagship GLM5 would be just pathetic, especially if it turns out to be as big as it's rumored to be.
>>
>>108099892
there's a weird param size treadmill for labs where even small models that are meant for the most vrampoors only (corporate won't run the 9B model lol) they keep adding to each release whenever they feel like their model isn't getting better and they need to pretend to have made architectural improvements
this is how the "standard" tiny size of 7b (during early llama and mistral) became 8b then 9b, qwen 3b became qwen 4b etc
there's no legit argument for why 30b is now 35 other than "we need to show improvements and we couldn't make the 30b better"
I think they're crossing the threshold where those models just won't be useful for their target audience
if you're not vram poor you don't run a really retarded 35B moe with too low active parameters to be of real world use
>>
>>108099938
i think stepfun is a nice sweetspot, 200B with 11B moes.
>>
5090ti waiting room
>>
>>108099957
already exists
https://www.nvidia.com/en-us/products/workstations/professional-desktop-gpus/rtx-pro-6000/
>>
File: 1757728351131791.png (70 KB, 1009x529)
70 KB
70 KB PNG
lovin this, better than spanish telenovelas
>>
>>108099991
I guess there's some expectation to deliver because he managed to get qwen next in. But alas he fell back on bibecoding and it all backfires.
>>
I wish niggerganov was more like curl's maintainer. curl's maintainer would not give that sort of fucktard the time of the day and act so nice towards him. Open source projects that do well don't have milquetoast leadership.
>>
>>108099320
>Need day zero support the second the weights hit HF let become the unsloth of the reddits

https://huggingface.co/unsloth/Qwen3.5-9B-GGUF
https://huggingface.co/unsloth/Qwen3.5-9B-GGUF
https://huggingface.co/unsloth/Qwen3.5-9B-GGUF

Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants.
Read our Qwen3.5 Guide here!

{# Unsloth template fixes #}
>>
Is structured outputs support not universal in llama.cpp? It works with GLM 4.7 but it doesn't with MiniMax...
>>
>>108100065
No, just like how tool calling needs to be reinvented for each new architecture.
>>
What is the best model for ERP on 24GB Vram?
>>
>>108100089
depends how much ram you have, if its gpu only the answer is not gonna be the same as if you are fine running a 100B moe model with most of it on ram.
>>
>>108100099
Sorry am noob, i have 64gb ddr4

thank you <3
>>
>>108094391
>Quanting embeds and outputs is insane.
who the fuck quants the embeds like that?!
>>
>>108100104
mistral small or nemo
or the finetunes cydonia / rocinante
you can run them at Q8 most likely
>>
File: file.png (45 KB, 1674x208)
45 KB
45 KB PNG
>>
>>108099938
Some are just more honest about the size now eg
https://huggingface.co/google/gemma-7b (actually 9b)
>>
SGLang and the entire Python ecosystem makes me puke. Everything is held together with duct tape.
>>
>>108100206
literally nobody cares how incompetent you are.
>>
>>108100206
kernels are written in c so it doesn't really matter. what annoys me the most are crooks like ollama
>>
>>108100217
Said like a true pytard.
>>
>>108100217
Devstral, GLM-4.6V, MiniMax. None of them load.
I just had to edit the code myself with this to load MiniMax:
https://github.com/sgl-project/sglang/issues/13214#issuecomment-3553875109
I'm fucking tired of KeyError.
>>
>>108100272
>not using vllm or transformers
kinda gay
>>
>>108098422
https://github.com/ggml-org/llama.cpp/pull/19453
>Revert Qwen3.5 dense and MoE support (no vision) (#19435)
>Taking a step back to implement support for Qwen3.5 properly.

Vibeslop loses once again.

>>108100006
Literally all of his PRs are vibecoded.
And the few ones that do get merged typically have such terrible performance that even random contributors with limited understanding of the codebase are able to make huge optimizations.
>>
>>108100256
it does when python itself is a moving target that breaks from a swift fart. if everything is in kernels as you say, why don't we have an ani interface and bindings to whatever lang we want? why force people to use poothon?
>>
>>108100302
holy shit fuck off, no one cares about the language, it could be written in node, java, go, c# or whatever else , NO ONE FUCKING CARES
>>
>>108100272
>just skip loading some weights
really?
>>
File: file(8).png (85 KB, 978x371)
85 KB
85 KB PNG
>>108100295
Don't be redisuclous
>>
>>108100302
you can pin python versions along with library versions (what literally every other language/framework does). The high performance part is already written in C so it's a non-issue, merely preference.
>>
File: file.png (197 KB, 680x432)
197 KB
197 KB PNG
It's over
>>
>>108100340
YES! I'm 'ooming everywhere
>>
>>108100317
those weights are bloat
>>
>>108100340
that would be great. if only i could actually run the model.
>>
>>108100340
>uses DSA
support never
>>
>>108100295
>Literally all of his PRs are vibecoded.
welp, I thought that he learned something along the way
>>
>>108100353
Piotr is on the case with day -1 support planned
>>
>>108100340
Q2 is enough
>>
>>108100295
>Literally all of his PRs are vibecoded.
how did it get approved? son and cudadev sperged out about this
>>
>>108100340
Wait, so it's bigger? I sure hope pony is not it, because it feels dumber than 3.2.
>>
>>108100322
>merely preference
I prefer choice
>>
>>108100317
I don't know, their code is shit. Maybe they don't test splitting the model in multiple GPUs with pipeline parallelism.
>>
>>108100340
NovelAI bros...
>>
File: slop hn comments.png (241 KB, 1636x703)
241 KB
241 KB PNG
hn is more and more filled with slop comments like these and they're so obvious you don't even have to read them, they are structurally repetitive down to the character and sentence count per paragraph to a T, which also makes me wonder what sort of dogshit LLM they're using to do this or if it's caused by an idiotic prompt
the slop gurglers are as a matter of fact the biggest proponents of LLM coding and I suspect the amount of humans who are legitimately positive outside of the grifter crowd who would have been enthusiastic about crypto before LLMs (steve yegge and his gastown, clawdbot etc) is actually really small
>>
>>108100367
he has the commit bit, just merged it himself.
>>
>>108100367
son has ownership over mtmd mostly, cudadev has ownership of the well, cude kernels.
the model implementers are mostly vibesloppers (CISC et co)
>>
>>108100340
>44B active
Going to be slower than Kimi and Deepseek
>>
>>108100386
>>108100387
so this does not matter then? https://github.com/ggml-org/llama.cpp/pull/18388
>Forbidden Usage
>DO NOT write code for contributors.
>>
File: file.png (21 KB, 652x157)
21 KB
21 KB PNG
>>108100415
i mean he? was part of that discussion so obviously influenced it in ?his favor
>>
>>108100385
You need to remember that slop was produced by scraping reddit/HN in the first place, so these comments are only mimicking real comments made by redditors. So I'm sure HN users won't be able to tell the difference.
>>
>>108100385
Very strong headcanon and coping.
>>
>>108100415
this matters more for the critical parts of the app (aka the KERNELS, you 100% dont want vibecoded garbage there, the code is hard to read and is generally DIFFICULT). To implement a model using existing kernels vibesloippers are sadly allowed free reign
>>
File: 1710213177231501.png (78 KB, 336x347)
78 KB
78 KB PNG
>>108100340
where u get this?
>>
>>108100446
asked glm47flash for a table
>>
>>108100322
>you can pin python versions along with library versions
you can pin stuff in a venv only works if the software you're using doesn't require much external contributions (think, like plugins)
because venv isolate your program from other program, it doesn't isolate each individual modules used in your program
a real common case of cancer caused by usage of python, which absolutely IS unique to python:
comfyui nodes often having conflicts with one another because custom node 1 requires library A version 1 and custom node 2 requires library A version 2 and those versions are incompatible
you can't fix that, literally impossible, and pulling an incompatible one in your venv = destruction incoming (plus all the subdep)
Even JavaScript, one of the most hated language on the internet, does not have this problem. Each module in node.js is a self contained silo, pulling its dependencies for its own use without affecting anything else you import.
Not all languages solve it as well as js but the other communities generally solve problems their own ways to avoid this. C++ projects typically break APIs more rarely and have a real release cycle (llama.cpp is the exception, not the rule), it's also common to vendor dependencies in C++.
In Go, you have semver modules, in which you can pull two module versions in the same project fine as long as they have their major semver differ
Rust has wonderful library versioning etc
Fuck python
you guys have no hygiene and no understand of what good code would even look like from a distance
just saying everything is fine like you do is evidence of ignorance and being dropped on the head
>>
File: 1759000909616367.png (1.33 MB, 1024x1536)
1.33 MB
1.33 MB PNG
>>108099651
Makes perfect sense for China to set price at global levels. Which will decline when supply increases or demand drops ala >>108099665
Daily reminder anons should be selling any spare ram they are hoarding right now, not buying.
>>
>>108097959
moe is a meme i think.
yes it's faster, but they are retards compared to dense models, so what's even the fucking point.
>>
>>108100461
idiot moe is for the everyone futures
>>
>>108100462
Mixture Of ESLs
>>
>>108100340
I mean, secondhand MI210's are coming down in price. They have 64gb of HBM a pop. 8 of those and some mild quants, done.
>>
>>108100449
It's fanfiction? But it's in table format...
>>
>>108100446
Presumably from https://github.com/huggingface/transformers/pull/43858/changes
>>
meanwhile mistral and gemma:
>MoE? what are they?
the west hasn't been enthusiastic about MoEs for smaller open models
>>
>>108100487
at that point we need 1TB vram cards with 10TB/s bandwidth.
>>
File: 1753346519337108.png (44 KB, 509x221)
44 KB
44 KB PNG
>>108100450
>you guys
I said I generally don't care. Also node can have conflicting deps (for which you will need to define overrides, see picrel for something I just screenshotted from work).
For other cases in node, you sadly end up pulling a lot of duplicate deps, but yeah that's the only way you're gonna have a 'always working' dependency system (which will btw break too).
In linux it works exactly like in python btw, all your system libs depend on each other, to install them with their own set of deps you have to do some workarounds and compile things differently (staticly). if you ever updated a rolling distro, you would know that 80% of the libraries depend on glibc, and when it gets updated, you HAVE to pull all the packages that depend on it.
Again, I fucking hate python for other reasons (non-sensical ternary ops, forced indentation, 'def', garbage parallelism/threaded support), but its way of working with versions/libs is the least of its problems.
>>
>>108100492
gemini is obviously a moe
and mistral has their partnership with the asic inference thing so they can serve dense for relatively cheap
>>
>>108100504
>gemini is obviously a moe
the reading comprehension of /lmg/ retards never ceases to amaze
what part of
>for smaller open models
can refer to gemini?
you can preemptively add a lot of details in making statements just to be sure autists won't keep replying autistic things but nothing can stop the omega turbo autists.
>>
>>108100491
hmmm, so when should glm5 appear on openrouter? just hypothetically speaking. I totally plan to run a q1 quant on my 3090 and ram.
>>
>>108100461
>>108100492
>>108100517
>Repeating outdated discussions about MoEs again
>>>/aids/
We have it in the last thread already. Next on your agenda would be fp16 vs 4bit? We have kimi and GPT-5.3 trained on fp4, deepseek trained on fp8 now. Pack up your shitty baits with you and fuck off back to your dying containment general
>>
Retard here, how viable is it to run a 13b parameter mythomax model on a dedicated Rx 9060 XT 16gb? The Nvidia tax is too steep for me at over 35% price difference for the 5060 with 16gb. I know this is a rich people hobby with the vram requirements, but let a man dream
>>
>>108100539
Do you have the gpu already?
>>
>>108100533
base thanks
>>108100539
2023 called it's model back
>>
>>108100533
>everyone that bashes my shitty architecture must be from aids or aicg
retard. why don't you post logs from your deep fried q2 chinky moe if it is so good? or are you embarrassed from the amount of pronounslop it generates after 17 minutes?
>>
>>108100539
You can fit 13b on 16GB VRAM, its roughly 1GB per 1 billion parameters
I run high quant 12B's on my 3060 with 12GB VRAM with no issues
>>
File: img.png (143 KB, 938x688)
143 KB
143 KB PNG
>>108100449
glm47flash wouldn't get the math right
>>
>>108100548
Not yet, I'm looking at the prices in my region, Nvidia anything with 16gb is way too expensive, even the 3060 12gb is about 95% the price of that 9060. Used market doesn't exist, the third world hasn't caught up with the concept of a GPU, even the international used market (ebay) seems like a terrible deal after shipping prices and import taxes
>>
>>108100581
Ok. I was about to call you a retard or a pussy for not just trying to run it.
Yes, it will run fine. Run mistral-nemo-instruct. mytho is old.
>>
From kimi k2.5, notice the lack of spatial awareness that all moes exhibit.

The rain had turned the cobblestones of Vel Morath into a silver mirror, reflecting the torches of the Sunken Archives. Sir Corvin adjusted his shield, the leather creaking in the damp air, and watched as Elara pressed her palm against the moss-covered door.

"You’re certain this is the place?" Corvin asked, his voice a low rumble. "The last scholar who came here didn’t return. Just… scorch marks and a smell of ozone."

Elara didn’t look back. Her fingers traced the ancient runes, glowing faintly at her touch. "The *Codex of Hollow Stars* is here, Corvin. I can feel it humming." She pushed, and the stone groaned open. "Besides, you’ve fought dragons. What’s a dusty old library to you?"

"A library that bites."

They descended into a rotunda where the air tasted of rust and forgotten magic. Dust motes danced in the beam of Elara’s lantern, illuminating shelves that spiraled down into darkness. It was beautiful—and wrong. No cobwebs. No rats.

"Too clean," Corvin muttered, hand on his sword hilt. "Elara, step back."

"Nonsense. Preservation enchantments often—"

The floor trembled. From the shadows between the shelves, something shaped like a lion but wrought from clockwork and starlight unfolded itself. Its eyes were sapphires that clicked with mechanical precision as they locked onto Elara.

"INTRUDER," it intoned, voice like grinding millstones.

"Oh," Elara whispered.

Corvin shoved her behind him, shield raised. "I told you. Libraries bite."

The construct lunged. Corvin met it with steel, the clash echoing through the chamber. Sparks flew as his blade scraped against brass ribs. He grunted, driven back by the weight.

"Elara! The *Codex*—now!"

She scrambled past the fray, spotting the book on a pedestal at the room’s heart. It was bound in midnight blue leather, chained with silver. As she reached for it, the chains slithered like serpents.
>>
>>108100619
>Elara
Every time
>>
>>108100619
>elara
stopped reading there, promplet
>>
elarasex
>>
>>108100619
You should use the energy you saved due to not jerking off to engeneer a dedicated spatial awareness benchmark for labs to benchmaxx on
>>
>>108100638
Post your logs then, eslmonkey.
>>
>>108100655
>muh moe dumb eeeuuuuu
post your big dense cock model not doing the same mistakes retard, maybe then we can talk
>>
File: 1762512449247902.png (399 KB, 621x855)
399 KB
399 KB PNG
>>108100340
They really pulled a Kimi on us.
>>
File: 1752146363026699.jpg (65 KB, 479x640)
65 KB
65 KB JPG
>>108100340
>>
File: a00p86[1].png (1.56 MB, 1024x1024)
1.56 MB
1.56 MB PNG
Is there still anything better for text to speech than
GPT-SoVITS? I feel like anything else I tried was worse or sidegrade at best. Kokoro is surprisingly awesome for something so small that you can even run it on a CPU but no voice cloning.
>>
>>108100340
This is why /lmg/ is dead btw
>>
>>108100661
In my experience glm 4.5 air, gpt-oss 120b and mistral large 123b all have trouble figuring spatial relations in a sexual context.
>>
>>108099991
And people still refuse to port things from ikllama back to main fork. Imagine the drama.
>>
>>108100719
in mine too, that's why I dared the retard to back it up instead of just dumping a turd in the thread and expecting a discussion about it
>>
File: file.png (295 KB, 604x453)
295 KB
295 KB PNG
>>108100340
>>
>>108100340
I am incredibly sad that I got left behind by Z.AI but I will cope by saying that at least they also confirm there is no way to improve anything in this bullshit hobby without increasing the parameter count.
>>
>>108100765
My cope is that they're going to safety slop the thing to hell and back.
>>
>>108100713
What did you try so far?
>>
>>108100366
Q1 is enough
>>
>>108100661
Quit sperging out and post logs, otherwise your argument is invalid.
>>
>>108098294
https://huggingface.co/togethercomputer/GPT-NeoXT-Chat-Base-20B (use or make a gguf, obviously)
>>
>>108100884
"max_position_embeddings": 2048,
>>
>>108100765
Just invent something else that isn't transformers.
>>
>>108100367
>>108100387
I personally don't care whether or not a human wrote some piece of code, I only care about the code quality.
I was and still am in favor of banning randoms from submitting machine-generated code entirely because a rule requiring that they carefully check the code before opening the PR is not enforceable and results in a lost of wasted time for maintainers.
For repeat contributors with political standing my opinion is that it's fine to make exceptions because for them the cost of violating the trust of maintainers is much higher.
I cannot comment on this particular case because I did not look into the specifics and it's not a part of the code where I am taking the responsibility for maintenance.
What I can say is that when he previously submitted machine-generated CUDA code I vetoed that particular implementation due to poor maintainability.
>>
>>108100654
>dedicated spatial awareness benchmark for labs to benchmaxx on
stop using words you don't understand
>>
>>108100938
>For repeat contributors with political standing my opinion is that it's fine to make exceptions because for them the cost of violating the trust of maintainers is much higher.
lolmao obviously why even say anything
>>
>>108100940
NTA but it wouldn't be that hard to make a cockbench but for spatial reasoning.
>>
File: Untitled.jpg (17 KB, 277x273)
17 KB
17 KB JPG
>tfw there will never be a Nemo2 for RP
Call me crazy but I think even Mistral Small is worse despite being twice as big and by the same company. It makes more logical errors regarding the situation in my RP situations and is less realistic and more book-like which feels worse in RP. Mistral Small shits itself less with long context though, this is true.
Gemma-3 27b is REALLY good for its size in terms of not getting confused but the writing style and level of censorship is really unpalatable unless you want PG stories.
>>
>>108100955
Mistral Nemo 12B is one of the last Western open-weight models trained on almost every pirated book that could be found. Something changed with Mistral Small 3.0 when they briefly pivoted toward "safety" (remember sea otters?) and boasted about how lean their pretraining dataset was.

https://venturebeat.com/ai/mistral-small-3-brings-open-source-ai-to-the-masses-smaller-faster-and-cheaper
>[...] The model was trained on 8 trillion tokens, compared to 15 trillion for comparable models, according to Lample. This efficiency could make advanced AI capabilities more accessible to businesses concerned about computing costs.
>>
>>108100938
is there a good way to benchmark single (cuda) ops in llama.cpp right now? i wanted to play around with kernels for a bit.
>>
>>108101010
What's even the point of not training on books anymore? The publishers have all but given up on litigation, now focusing on slapping the AI companies with piracy charges for the original sin of torrenting the books rather than anything to do with training the model
>>
>>108100340
what are mid tier folks supposed to run now that zai is abandoning us?
>>
>>108101010
To me the best way to see if books reappear in training sets is when models stop using 3rd person singular "they" when talking about someone. That is how you know they only know internet speak.
>>
>>108101023
The Meta lawsuit was still ongoing when the model was trained.
>>
>>108101010
why won't the chinese do it? and I also can't wait for yandex to get their shit together since piracy is a national tradition in russia
>>
>>108101016
./build/bin/test-backend-ops perf
>>
>>108101035
I think Mistral has some licensed book datasets that they can use for training, but it's definitely not the same as using the entirety of Anna's archive, Libgen, etc.
>>
>>108101043
Nvidia is still bragging about using 0 novels in their recent model drops, I don't think the trend is over
>>
>>108097959
is sillytavern still king for RP or we have better shit now ?
>>
>>108100713
Yes. Piper. Why? Shit actually supports it. Scratchy/deterministic is better than "nothing supports it".
>>
>>108101071
>0 novels
Between this, religion of safety, ram prices, everyone using scaleAI, the main goal of the tech being to remove office jobs... This hobby is actual hell isn't it? They only way it could be worse is if all companies just kept weights to themselves and no leaks happened. Everything else is pretty much as bad as it can be.
>>
>>108101023
>charges for the original sin of torrenting the books
Which is illegal and that's what they go through that angle. What else could they do? Big companies could just buy those entire libraries, though. However...
>What's even the point of not training on books anymore?
Synth is better structured and easier to train for, even if the results are more boring. They don't have fun in mind when doing making these things. Not getting sued for doing something illegal is a good reason as well. Only applies to western companies, but given the chinks also use synth datasets, it has the same effect on them.
>>
>>108101107
>CPU but no voice cloning
>>
>>108100916
Aww come on, don't you want your GF to keep forgetting you're dating or had sex etc? That's the OG c.ai experience.
>>
it's chinese model week
are you ready for some succulent chinese models?
>>
>>108100713
>>108101123 (me)
To add something closer to what anon wants, there's pocket tts if you haven't tried it yet. I don't think it's as good as gpt-sovits.
>>
>>108101123
You absolutely can clone voices with piper: https://huggingface.co/quarterturn/kuroki_tomoko_en_piper
>>
>>108100949
I didn't take issue with the possibility of creating such a benchmark, but that it could be benchmaxxed or that creating another benchmark that could be memorized would do anything to solve the spatial reasoning problem
>>
>>108100713
Qwen3-tts if you're not a weird contrarian
>>
>>108101148
The trend of trinity (retarded), step (retarded), GLM 5 (too big) doesn't inspire confidence...
>>
>>108101064
thanks! time to find out how retarded i truly am.
>>
>>108101148
V4 fucking when?
>>
>>108101179
trinity is american thoughever
>>
>>108100585
Huh, thanks for the rec. I read the OP, but I originally ignored this model since it's labeled as ERP, not really my usecase, but it seems to be overall better
>>
>>108101165
>Piper text-to-speech model *trained* against...
If you take finetuning as a reasonable avenue, sure. EVERY tts model can clone voices.
>>
>>108101192
let dhem coock
>>
Nemo is the fucking GOAT its incredible how much it punched above its weight
>>
>>108101194
If you haven't run any llm, nemo is just fine overall, not just rp. Once you have something to run models on you can try other things and see what you like best, like qwen 30b3a and stuff like that. You may be able to run up to 32b dense models with quantization and maybe leaving some layers on cpu. But don't worry about that now. Any usage tips would be kind of useless until you have something to run them on.
>>
I am sure deepseek and glm will swap and V4 is gonna be half the size. And even if it isn't then GLM5 Q2 is still gonna be great. R-right?
>>
>>108101272
>V4 is gonna be half the size
Yes, yes.... double if you count the engrams though
>>
File: chess1.png (125 KB, 2124x1142)
125 KB
125 KB PNG
>>108097306

Setting up to try this, I'm so excited to find out what happens!!!
>>
>another year of touching my penis to 4.6...
Could have been worse. Could have been nemo.
>>
I just gave glm 4.5 a document to translate and am getting this. I'm a complete beginner with text gen webui.
>>
>>108101459
You set your context window size to 8192 and the document is longer than that (in tokens).
>>
>>108101459
>Increase ctx-size while loading the model to avoid truncation.
>>
>>108100340
I hope they aren't retarded and went for QAT with this. I don't see a point in running this at Q8 when I can run K2.5 at the QAT target at double the speed unless GLM5 is an absolute game changer.
>>
File: shithub.png (187 KB, 1195x798)
187 KB
187 KB PNG
>>108101111
>the main goal of the tech being to remove office jobs
poorly
I see this page more and more often on shithub ever since they turned all their attention on adding ai features nobody asked for
I have yet to see an example of a product that was improved by the integration of a LLM, or by the use of (by vibeshitters)
>>
>>108101071
That's obviously to fuck with he lawsuits. This is the same company that approached that one piracy archive for their entire book collection.
>>
>>108101552
I could believe they deleted the books. Two facts:
they create a massive corpus of slop artificial data with low parameter models like 30BA3B qwen probably because they didn't want to pay what it would cost to rewrite much of humanity's content with a large model:
https://huggingface.co/datasets/nvidia/Nemotron-CC-v2#data-overview
it is huge, and it is gigaslop, and sufficient to serve as the basis of model pretraining
secondly, if you actually give their newer models a try (v2 and v3 nemotrons) they are some of the sloppiest models in existence, reflecting the weight of artificial data.
>>
decade of nemo
>>
File: file.png (174 KB, 1652x444)
174 KB
174 KB PNG
>30 minutes to load a model
>>
GLM5 will be the ChatGPT moment of local AI
>>
File: 1743729101239468.png (6 KB, 1203x29)
6 KB
6 KB PNG
>>
File: 1746547160053728.png (664 KB, 1377x441)
664 KB
664 KB PNG
https://huggingface.co/spaces/openbmb/MiniCPM-o-4_5-Demo
>>
>>108100938
>What I can say is that when he previously submitted machine-generated CUDA code I vetoed that particular implementation due to poor maintainability.
least surprising statement of the year
>>
>>108101773
8tb nvme was a good purchase a couple of years ago
>>
Maybe GLM5 will motivate llama.cpp to actually bother implementing DSA instead of mangling the model into full attention like with DS3.2
>>
>>108101902
https://github.com/ggml-org/llama.cpp/pull/19460
>>
>>108101471
>>108101485
Increased to 32k, no errors now. Didn't realize I had to reload the model.

It's very slow at giving me replies, it's like one word every .3 seconds or so. What's up with that? I'm on a 5090.
>>
>>108101915
>vramlet
>fuckhuge model
What did you expect?
>>
>>108101915
You are probably going over your VRAM and into your RAM using the Nvidia driver's fallback, which is slow as fuck.
You need enough vram for the model + the context and the pp buffer.
The longer the context window, the more memory it takes.
So you might want to lower the pp buffer (batch size), lower the context window length, or put some of the model in RAM (layers or tensors).
Read the stuff in the op, there's even a calculator that might help.
>>
>>108101952
Oh, and enable flash attention if you haven't. It saves quite a bit of memory.
>>
>>
>>108101952
>>108101963
>>108101459
I don't recognize this console output but anon should switch to llama-server that does all of this automatically.
>>
>>108101988
Oh yeah. There's the --fit param now.
>>
>>108101906
>piotr already on it
oh no
>>
thank you drummer, I really enjoyed your latest finetune of
>>
>>108102010
The bot EOSed early.
>>
>>108101972
i tried his garbage assistant pepe
NEVER AGAIN
>>
>>108101800
im horny
>>
>>108101952
Ah right, forced to do it this slow then.The model is filling up about 90% of my vram, rest into ram. Thanks.

>>108101963
>>108101988
I was hoping to keep it as simple as possible, I despise using anything that doesn't use GUI.
>>
File: 1739991643824322.jpg (91 KB, 700x763)
91 KB
91 KB JPG
>>108102076
Same

The local version also has voice cloning
>>
>>108101800
Is this model supported by llama? A complete package model sounds nice instead of fiddling with random components.
>9b
Uhh.
>>
>>108101800
>video call
what the fuck?
>server at capacity
AIEEEEEEEEEE
LLAMACPP SUPPORT WHEN?!?!? NGXSON FUCKING CODE IT U DOUBLE FAGIT
>>
File: 1740038665978432.gif (2.25 MB, 498x280)
2.25 MB
2.25 MB GIF
>>108102134
>>108102147
I'm using https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/demo/web_demo/WebRTC_Demo/README.md right now

Runs perfectly on a Mac M4, don't know about llama
>>
>>108102170
So, is it good? What are you doing with it?
>>
>check op
>Download Mistral-Small-3.2-24B-Instruct-2506-GGUF
>Generation starts out ok, but after a few chats bot loops into retardation
>Try cranking up the penalty for repetition
>Does nothing keeps repeating the the same phrases over and over.
Am I missing something obvious? I last tried this a year ago with some random mistrial model and it just worked tm
>>
>>108102170
just found this, made by them apparently
https://github.com/tc-mb/llama.cpp-omni
>>
>>108102170
>llama-omni
Is this in kobold?
>>
>>108102187
i think we don't really want to know what they are doing with it...
>>
>>108102208
>>108102204
oh nevermind. It's their own fork.
>>
File: 1754764023935996.png (255 KB, 1278x1430)
255 KB
255 KB PNG
>>108102204
huh, looks decent
>>
>>108101800
>realtime
>turn based
nigger
>>
>>108102083
Why are you using glm 4.5 for a simple translation job anyway, you have translate Gemma for that and the biggest one fits entirely in your vram
>>
What models are good for fetish-focused explicit writing? I've tried like 4 or 5 and they tend to give very terse answers. Many just jump right into sex. aeline/halo sort of meets what I'm looking for but I'm on the lookout for anything better.
>>
>>108102197
what are your other sampler settings? specifically top_k and top_p (also plz turn off all the other snake oil shit)
mistral small and ministral are recommended to be run at low temperature, and from my testing I concur they truly are best at 0.1 or 0.15, but they don't take kindly to cutting off the token distribution heavily with top_k and top_p.
I'd go as far as to say they perform best with top_k disabled altogether. It's weird, but it really works that much better that way. Low temperature doesn't make the model behave greedily because of its very flat distribution and too high a temp will make it go full retard.
Both it and gpt-oss are like the anti Qwen, whereas Qwen models are dogshit if you don't cut off their distribution.
>>
File: file.png (29 KB, 831x131)
29 KB
29 KB PNG
>>108102317
lrn2read
>>
>>108102170
Give examples of audio cloning. 7B is obviously too stupid for chatting, but if it can be used as dedicated TTS with context understanding and correct prosody, it'd be nice.
>>
>>108102597
I actually think 8B is the LLM and the other 1B is the TTS / STT. Either that or I'm retarded, gonna look into this tomorrow
>>
how do u make your model not cooperate with you so easily. i am trying to roleplay but basically the model makes every character do what i want, nobody says fuck off or takes the wheel to rough me up. it's kinda boring desu baka senpai.
>>
>>108098294
llama 2 7b is pretty good for incoherent slop
>>
>>108102738
Try to write stories rather than RP. If the setup assumes that "user" doesn't exist in this story, it can make llm less sycophantic.
For example, say that you're writing a novel, and llm's job is to imitate {{char}} in this story. There is no {{user}}. It's just characters in the story.
>>
File: setting out.jpg (310 KB, 1024x1024)
310 KB
310 KB JPG
weeeeeeeeee
>>
If CEO's start getting replaced with AI do you think they will make them cute anime girls, or do you think they will still be faceless machines?
>>
>>108102843
The irony is that CEO is a perfect kind of job for an LLM to replace and current LLM's should easily manage that.
>>
>>108100340
>new model comes out
>like 750b parameters for some reason
>no version that can run on consoomer hardware
why?
>>
>>108102881
>he doesnt have an epyc with 1tb ram
LMAO
>>
Aurora-alpha on OpenRouter could be a new version of gpt-oss:
https://openrouter.ai/openrouter/aurora-alpha
>>
>>108102558
This was a default silly tavern install, I will double check the values shortly just at work
>>
Ok give it to me straight. On a scale 1/10 how much of a meme/cope is REAP? Cause I may want to try it on GLM5.
>>
>>108102584
>lrn2read
this is not what their demo does nigger.
>>
>>108102896
and how much t/s does that get you ?
>>
>>108102919
we must refuse
>>
>>108102896
hey if you wanna buy me some hardware then go ahead :3
>>
>>108102956
extreme meme
>>
>>108102956
are you gonna do agentic/coding/work stuff? then it's ok. You wanna coom/rp? DO NOT USE
>>
>>108102980
Can't you prune the exact opposite experts? Or actually which experts are even pruned in the first place?
>>
>>108102919
>GPU fan whirrs up
>coils whining
>electricity meter acting like bitcoin
"I'm sorry but..."
>>
File: 1769242392096.png (370 KB, 1592x1688)
370 KB
370 KB PNG
>>108102956
I wouldn't dare to use any reap model even for programming.
>>
>>108102991
rape measures the activations against a dataset (usually benchmax datasets). I'm not familiar with how to do it myself, but I'm sure you could provide an RP dataset to only keep the language related part
>>
>>108103005
Nobody has reaped a model with a proper RP/chat/general dataset. Only for codeshits. A reap could kill the chingchong runes and fix a bunch of things but you'd need the disk space for the full model.
>>
>>108103050
>A reap could... fix
absolutely nothing any manner of prune shite is far worse brain damage than even iq1 quanting.
>>
>>108103077
if i'm gonna take 25% of your brain's mass.
would you rather want me to take a random whole section of it.

25% of neurons, randomly.
or a bit of every neurons.
>>
>>108103077
RAEP removed GLM's alignment.. it can fix some stuff as long as you don't throw away language performance for benchmax code. ZERO people tried.
>>
>>108102964
nta, but 13t/s with kimi @ q4
cpumaxxing a couple of years ago ended up being the winning move. The homegrown lmg reference architecture has been able to run anything (even dense) that came out, and all the moe models at at least reading speed (ie everything close to SOTA)
2 years of actual use vs costs-more-than-a-house VRAM maxxers or cope "its not worth it" never-CPUers.
API-fags opinions discarded because /lmg/
>>
>>108103180
>13t/s
man so close, if we can push to 20 or 30t/s i'd consider it desu.
was it with ddr5 epyc or ddr4?

i'm not gonna use a model i can't get to at least 20t/s but idealy 30 or even 40t/s
more than that i don't care much.
>>
Engrams will save local.
>>
n-words will save local
>>
Have any completion only models like text-davinci-003 ever been leaked? That was probably the best experience I've ever had with AI.
>>
>>108103217
Dude that would be amazing and also a good reason to short nvidia lol.
>>
>>108103199
>was it with ddr5 epyc or ddr4?
ddr5 4800. if we get good NUMA behaviour in lcpp then 20t/s is achievable, but i don't think anyone is seriously working on it and its been years...
>>
>>108103250
You're thinking of base (pretrained) models
>>
>>108103217
Bad news bears. They quantize poorly.
>>
>>108103269
Not an issue.
>>
>>108103262
Thanks anon, I'll consider it when ram has come down if engram end up being a meme.
I'm ready to blow like 10k.
>>
>>108103269
We could store them on nvme so doesn't matter.
>>
>>108103305
worse news...nvme prices got memed back up : /
>>
>>108103305
If you bought EPYC Turin and used EVERY SINGLE PCIE LANE in RAID0 with fast enough NVMEs to saturate the lanes, you'd get aggregate bandwidth around 600GB/s, which is actually pretty good. Latency would be killer tho.
What a franken-rig that'd be, holy shit. I'd love to see someone actually build it. The BOM would look crazy
>>
>>108103401
>160 lanes
>4 lanes per nvme
dear god...
>>
>>108103434
What is cheaper... 1 Tb of ram or 40x1Tb nvme drives?
>>
>>108103434
>>108103493
Yep, you'd need dual socket and 40 drives and a board that let you adapt all the lanes to occulink or slimsas or something
And you'd still only get 1/4 of the bandwidth of main memory!
However you'd be able to build it out at 100x cheaper (or 100x capacity for the same price).
petabytes of "memory", anyone?
>>
>>108103513
finally, the build for toss2 72T A8B at home
>>
>>108103493
It would be fucking cool to be able to infer against arbitrary models. Switching would be instant! You could have a router model or write some arbitration logic into lcpp to select models/bitdepth depending on the task
>>
>>108103493
>>108103513
the drives would degrade very quickly, even if you use enterprise drives. your build will die within a few months of moderate use.
>>
>>108103570
>the drives would degrade very quickly, even if you use enterprise drives. your build will die within a few months of moderate use.
mount read-only. Only remount read-write for adding models.
>>
>>108103570
how much does reading damage ssds anyway? do they give official numbers?
>>
>>108103587
reading doesn't damage, but after too many reads you get voltage leakage and need to re-write the block.
at raid0 you'd probably be 50/50 on losing the array every year due to some stupid failure. RAID10 probably makes more sense and would still be cheaper and allow some failures without having to recreate the whole array every time there was a blip. Read speeds would still be pretty high (500GB/s?)
>>
>>108103644
With such a huge raid, you'll be probably bottlenecked by some single-thread kernel function.
>>
File: 1.png (71 KB, 757x1060)
71 KB
71 KB PNG
>>108103570
>your build will die within a few months
it wouldn't last too many years but this is pure unadulterated bs fearmongering
every time someone did actual endurance testing on a variety of ssds most far outlived the expectations of both the public and the manufacturer's own metric
pic related shows drives rated for 150tbw like the old 850 pro 256gb lasting upwards 7500 tb
these are writes but I bet you filthy nigger spamming this thread about le evil of reads every time the topic comes up are also full of shit just like people who were crying about writes (I remember people who did shit like put the browser cache on an external spinning rust drive because they feared doing too many writes on their SSD LE FUCKING MAO)
>>
>>108103644
raid 1 and 0 have the same read speed, raid 0 only has more capacity and write speed.
>>
>>108103570
No reason why that should be true if he's mmaping 1TB of model weights.
>>
>>108103685
>With such a huge raid, you'll be probably bottlenecked by some single-thread kernel function.
Really tho? High capacity, high-speed RAID arrays are a pretty standard enterprise thing. I'd be shocked in even out of the box untuned kernels in major distros were close to the theoretical limit.
Have you ever hit a bottleneck in the field?
>>
>>108103685
dragonflybsd will finally be relevant again
>>
>>108103287
35k just for the 1tb ram itself nowadays. I'm seriously considering M5 ultra and cope quants.
>>
>>108103747
Honestly, no. But 40 drives in a single array will definitely need lots of cpu cycles. And since you need raid not just for redundancy like enterprise, but for maximum bandwidth of hundreds of gb/s, I assume that at some point you will hit synchronization bottleneck.
>>
>>108103773
if m5 ultra ships with more than 512gb on-die and/or retails for less than $20k I'll be shocked
>>
>>108103796
You'd need a half-dozen lanes for networking and a gpu link for prompt processing to make it actually real-world usable, too.
Or maybe console-only inference? Complete airgap? 1337
>>
>>108103434
I fully expect to see this in some clickbait youtube tech channel and at the top of hacker news within a month. Mention lmg, ya filthy animal!
>>
>>108103834
>Mention lmg
Don't, we have enough tourists.
>>
>>108103813
I had hopes they might up it to 1TB for the M5 Ultra, but there's just no way after the memory price hikes.
>>
File: 0548473.jpg (92 KB, 1270x606)
92 KB
92 KB JPG
zuck and wang will save local
>>
>>108103904
>inb4 1.5T param MoE
>>
File: image.jpg (8 KB, 460x109)
8 KB
8 KB JPG
>>108103904
>source
>>
>>108103904
omg i believe it 100%! what is doubt?
>>
File: i_believe_you.png (592 KB, 747x800)
592 KB
592 KB PNG
>>108103904
>>
>>108103904
>these new models are better than the old models, which were worse than llama 2-3

its over
>>
>>108101114
Synth kills generalization
>>
>>108103904
llamabros!!!! we're so back!!!!!! i never doubted!
>>
>>108103904
That's the exact opposite of the last we heard of Avocado.
>local
Doubt.
>>
>>108103904
I don't believe this. This is just more hype to attract capital investment like usual.
>>
>>108103904
IF this is true and thats a big IF. I think thats the nail in the coffin for "open"AI
>>
>>108103050
>real communism has never been tried
>>
>>108104024
Generalization doesn't matter if you're benchmaxxing.
>>
>>108104466
>>108104466
>>108104466
>>
>>108103493
If you do a raid 0 you may as well go with the smallest nvme you can find, aggregate capacity would be more than enough.
>>
File: phonk-skull.gif (1.2 MB, 220x220)
1.2 MB
1.2 MB GIF
>>108103904
>New model beats previous model
Oh boy, it's a <30B parameter model that's distilled to fuck to pass very specific benchmarks.
>>
File: 1758824081804478.png (492 KB, 917x900)
492 KB
492 KB PNG
>>
File: UNO reverse 126285129_p0.jpg (2.66 MB, 2389x4586)
2.66 MB
2.66 MB JPG
>>108104769
>>
File: 1755812647657264.png (61 KB, 276x225)
61 KB
61 KB PNG
>>108105196
>>
File: Nhim Sasuke 138838790_p0.jpg (1.97 MB, 1491x2048)
1.97 MB
1.97 MB JPG
>>108105213



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.