[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology

Name
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
File
  • Please read the Rules and FAQ before posting.
  • You may highlight syntax and preserve whitespace by using [code] tags.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


[Advertise on 4chan]


File: dipsyBowlingAlleyStandoff.png (2.39 MB, 1536x1024)
2.39 MB
2.39 MB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108132261 & >>108123280

►News
>(02/13) MiniMax-M2.5 released: https://hf.co/MiniMaxAI/MiniMax-M2.5
>(02/13) Ring-2.5-1T released, thinking model based on hybrid linear attention: https://hf.co/inclusionAI/Ring-2.5-1T
>(02/11) GLM-5 744B-A40B released: https://z.ai/blog/glm-5
>(02/11) Ming-flash-omni 2.0 released: https://hf.co/inclusionAI/Ming-flash-omni-2.0
>(02/10) MOSS-TTS Family: speech and sound generation models: https://github.com/OpenMOSS/MOSS-TTS
>(02/06) KugelAudio-0-Open: Multilingual TTS based on VibeVoice 7B: https://hf.co/kugelaudio/kugelaudio-0-open

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: ca2vrl.jpg (173 KB, 1024x1024)
173 KB
173 KB JPG
►Recent Highlights from the Previous Thread: >>108132261

(1/2)

--Papers:
>108134463
--Alexandria audiobook generator voice quality and LoRA training feedback:
>108132491 >108132574 >108132620 >108132714 >108133377 >108133570 >108133609 >108133697 >108133741 >108134599 >108133477 >108133778 >108132628 >108133426
--MLX quantization performance analysis and tooling limitations:
>108132892 >108133233 >108133244 >108132948 >108134986 >108135022 >108135088
--DeepSeek's new model rivals Gemini in long-context summarization:
>108137775 >108137840 >108137875 >108137936 >108137943 >108138008 >108138239 >108138731 >108138818 >108138841 >108138876 >108138911 >108138976 >108139011 >108139024 >108138932 >108138916 >108138947 >108138950 >108138970 >108137820 >108137870 >108137900 >108137975 >108138135 >108137843 >108139084 >108139103 >108139129
--OpenClaw model selection and agent framework tradeoffs:
>108132299 >108132378 >108132478 >108132595 >108134485 >108135173 >108135177 >108135190 >108135208 >108135219 >108135399 >108135550 >108136105 >108136248 >108135195 >108135205 >108136842
--Federated LLM training feasibility and modular layer approaches:
>108133301 >108133762 >108134877 >108135096 >108135297 >108135434 >108135484 >108136085 >108136343 >108136393
--Anthropic hiding CoT in Opus 4.6 and implications for model transparency:
>108138350 >108138441 >108138486 >108138676 >108138695 >108138737 >108138784 >108138821 >108138896 >108138962

►Recent Highlights from the Previous Thread: >>108132261

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
File: ivym5c.jpg (168 KB, 1024x1024)
168 KB
168 KB JPG
►Recent Highlights from the Previous Thread: >>108132261

(2/2)

--GLM5 support merged with unused DSA indexers causing high perplexity:
>108138069 >108138104 >108138440
--GLM-5 underperforms compared to Kimi 2.5 in roleplay and instruction following:
>108133448 >108133468 >108133506 >108133529 >108133510 >108133518 >108133545
--Rolling window vs compaction for code assistant context management:
>108134434 >108134492 >108134542 >108134549 >108134576 >108134616 >108134643 >108134686 >108134695 >108134704
--Ring-2.5-1T:
>108134981
--MiniMaxAI M2.5 release and performance claims:
>108136993 >108137009 >108137029 >108137058 >108137062 >108137235
--AI video upscaling tool comparisons and recommendations:
>108134918 >108134968 >108135001 >108135012 >108135100 >108135196 >108135108 >108135381 >108135397 >108135412 >108135746
--OpenAI accuses DeepSeek of unfair model distillation:
>108135666 >108135695 >108136362
--Miku (free space):
>108133070 >108133506 >108135810 >108135869 >108135874 >108135955 >108136772 >108137009 >108137738 >108138497 >108139089

►Recent Highlight Posts from the Previous Thread: >>108132262

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
>>108139561
The cold weather is really comfy rn =)
not long before all the fucking pollen are back
>>
>>108139561
teto hablar español!
>>
daniel unslot pleas... the minmax guffs...
>>
>>108139561
teto with earrings is really doing it for me for some reason
>>
File: LLM-history-fancy.png (1.57 MB, 7279x3109)
1.57 MB
1.57 MB PNG
New 'tosserald
>>
>>108139786
It is the mikutroon school shooting era.
>>
MinMax developer:
> We don’t have plans to release the base models at this stage. The reality is, after mid-training, these weights have drifted so far that they don’t really qualify as 'base' anymore.
>>
File: file.png (4 KB, 432x34)
4 KB
4 KB PNG
>>108139786
fuck is dis lies sama also didn't do that, only 40
>>
>>108139786
brimmiest dogshit chart
>>
>>108139786
retarded as always
>>
File: 1757731794959915.png (643 KB, 1475x1033)
643 KB
643 KB PNG
uh minimax bros?
>>
>>108140073
Why would you use llms for information retrieval though?
>>
File: fuck off.png (5 KB, 167x29)
5 KB
5 KB PNG
>>108140073
>>
>>108140073
>gpt-oss refuses the least
sam won??
>>
>>108140073
top chart has zero correlation with model quality or behavior in basically any way, it's as good as meaningless
>>
>>108140143
>model quality
bottom doesn't either
>>
>>108140159
not perfect but 100x better than the top which may as well be randomly shuffled
>>
>>108139299
I like this test I did something similar before.
Which models have done the best if you dont mind me asking.
>>
minimax 2.5 will be the salvation for 128gb ramlets

trust the plan
>>
Anyone had a look at this Ouro thing?
https://www.youtube.com/watch?v=pDsTcrRVNc0
>>
>>108140291
1B loop BLT DSA engram will save the local
>>
>>108140295
bitnet too dont forget
>>
>>108140465
bitnet is the one thing that we'll never get
>>
>>108139561
Kasane is fragile, Miko
>>
Is M2.5 better than GLM-5
>>
>>108140295
>>108140465
RWKV will save local
Diffusion LLM will save local
>>
>>108140295
DSA is already saving local, you can use GLM-5 right now ;^)
>>
>>108140585
better than the current 8ppl llama.cpp implementation
>>
>>108140585
size to performance, yes
>>
>>108140741
Reasoning took off the way it did because corpos do care about test-time compute. They need to make gains at all costs.
>>
>>108140465
Nemotron 3 Super will be native 4 bit, getting closer to 1.58b.
>>
>>108140741
Moes took off because they're cheaper and faster to train. It's much more important than inference during the race to achieve agi.. I mean, to beat the benchmarks
>>
>>108140741
1 parameter used 4 times is not 4 active parameters. That's the whole point.

Compute doesn't matter for local.
>>
>>108140799
I can't masturbate to 1t/s
>>
>>108140802
weak
>>
>>108140802
Also can't do real time assistant tasks at 1t/s either
>>
>>108140819
does that really matter though? think of all the (v)ram you're saving I think that possibly could counteract using less efficient stuff, but I'm no KLD-dev
>>
>>108139566
>>108139574
kill yourself
>>
File: 1760844023546.png (55 KB, 240x240)
55 KB
55 KB PNG
>finally surpass 100gb of ram+vram (112gb to be exact, 2 3090 + 64gb ram)
>can barely run any SOTA model
what will come first, diskswapmaxxing or 1-bit llms
>>108140799
memory is the main bottleneck for AI nowadays, MOE allows you to use slower memory and still get good results, this allows you to actually use less memory, it's a good idea.
>>
>>108140860
>what will come first, diskswapmaxxing or 1-bit llms
Will you be sad if I say neither?
>>
>>108140792
I don't think it's a matter of either or, if they can stack them they will. Though, the whole point of looping is that the model doesn't need to store information in tokens because it can keep it in latent space, which is more efficient.
>>
>>108140885
yes so don't say neither
>>
>>108140792
In theory you could have a model loop 100 times to think/plan and then start looping 1 time for output. How would you train that? Dunno.
>>
I think the main reason you'd want some kind of looping is that you can make the looping dynamic. Currently models process tokens the same amount no matter how much they actually contribute to the final output. We could make LLMs much more efficient if we could find a good way to make it not waste so much compute on the low value tokens.
>>
>>108140860
>>can barely run any SOTA model
you can't run any SOTA models
>>
>>108140890
It attempts to solve being a vramlet.
>>
>>108140900
Alright, I won't say neither.

>>108140886
NTA, but I think reasoning is useful because it's a sort of behavior you have some control over. You can't really control how the features emerge in latent space in any meaningful way, only the final outcome of the whole processing pipeline, so having both does make sense.

>>108140908
That's an interesting idea.
Just like MoE models have routers, you could have a mechanism that decides the level of "effort" the model will use to generate the next token.
Doesn't gemma 3n/mathformers do something like that?
>>
>>108140908
How do you define low value token?
>>
>>108140819
It doesn't need support, expanding on load is trivial, you can just run it with int4/fp4/int8/whatever.
>>
>>108140956
Wouldn't that just end up being diffusion with extra steps?
>>
>>108140908
i believe they mention their looping is dynamic for ouro? they have exit gates that can decide when to stop looping, and in their tests using the exit loop context as context for the next tokens seems to be enough. they decided on 4 loops as perf seemed to degrade after 3-4 loops for most tasks.
>>108140741
it's not looping 4 times for every token. they targeted a uniform distribution, so 2-3 loops on average.
>>
>>108140926
>Thinking in latent space works only for one token.
Depends on implementation. If you keep overwriting the same KV cache entry sure, but you could also use the last hidden output as a proper input on the next loop, shifting the KV cache back.
>>
File: 1737233122667.png (924 KB, 7059x1284)
924 KB
924 KB PNG
>>108139786
Reminder the timeline stopped here and we're still in the Chinese era when the West can not make SOTA for our purposes and bencmarkmaxxing and focusing on productivity which is now also affecting Chinese LLMs. Only the model list is outdated but everything functionally is still the same.
>>
>>108141075
This.
>>
>>108140999
Int4 was in hardware a long time, no one used it because no one used it.
>>
>>108141087
>Does any current open model use this approach?
Don't think so. You'd need to give the model loopcount as an input too so it knew what the hell it was looking at in the KV cache.
>>
>>108141094
INT4 is pretty recent because of AMD only introducing it with RDNA3 but they had it in CDNA2. Nvidia had it since Turing and Intel since Arc first launched.
INT8 is a different matter. Nvidia had it since Pascal, and AMD had it since RDNA2, and Intel with Arc's launch. That is a long time in consumer hardware.
>>
ollama is good
>>
llama sees your pp

jii-sai
>>
Seedance 2.0 is making traditional media seethe and I love it
>>
>>108141075
China won and I'm happy about it
>>
>>108141242
... for nothing!
>>
>>108141324
I haven't paid attention to video gen in a while. Any links to see this seethe?
>>
File: kk.mp4 (2.7 MB, 1280x592)
2.7 MB
2.7 MB MP4
>i want a cli based llm manager/runner that isn't ollama
>most of the tools in the OP have ugly GUIs and do more than what I want

is barebones llama.cpp my only option?
>>
>>108141530
kobold has a cli mode
>>
File: 1237324623452.webm (1.17 MB, 720x405)
1.17 MB
1.17 MB WEBM
>>
File: vibecoding.mp4 (2.26 MB, 640x360)
2.26 MB
2.26 MB MP4
>>108141610
>We're curing cancer right?
That reminds me i dont know if it was a meme but didnt one of the new weight loss drugs have a side effect of breast growth?
>>
>>108141138
That doesn't really matter because it will always be faster to pack it into wider formats and do math on those
>>
>>108141628
>OSR2
I have one
>>
>>108141713
Have you vibecoded it yet for video tracking or 3d models?
>>
>>108141692
source: your ass
>>
>>108139786
dude we peaked at pygmalion
>>
>>108139786
dude deepseek engram is gonna change everything
>>
>>108140813
just have a mail experience with your llm.
you can run 1T models at 0.3 t/s with ssd inference.
>>
>>108141530
llama.cpp has -hf argument, what more do you need ?
>>
>>108141610
I swear I saw this before but there's nothing in the archive. Source?
>>
Zhipu stock is up 260% since January
>>
>>108141628
This looks like it would rip your penis off
>>
>>108142007
Nah it has low torque
>>
Guys im from the future there is a new local model that change the game entirely. Nothing is the same anymore its a new nemo.
>>
File: 74351.jpg (23 KB, 512x512)
23 KB
23 KB JPG
avocado is close
>>
File: 1769542676697645.jpg (355 KB, 1079x950)
355 KB
355 KB JPG
Guys where is zuck? wasnt he supposed to be the wests openmodel guy?
>>
>>108142073
DeepSeek will release a new paradigm and Zuck will go back to the drawing boards again
>>
>>108141753
>what is simd
>>
>>108142073
it was le cunny, he left meta, so now they are closed.
anyway with how jeeted llama4 was i don't even give a shit.
>>
>>108142167
NTA, but it basicaly doesn't matter whatsoever, the main bottleneck is memory bandwith.
>>
>>108142167
int8 tensor cores exist
>>
>>108142058
>>108142073
For shits and giggles, let's assume Zuck's batshit scheme works out. Turns out Alexandr Penis was secretly the second coming of Christ, Avocado ends up being the undisputed best model that could ever be released
That LLM will probably be about... 5% better than 4.6 Opus. They're too late in the cycle. It'd be better, but not better enough for anyone to give a shit or bother paying for it
>>
>>108142225
So opus is killing innovation. I got it. Fuck daripo
>>
>>108142073
>the pawn can become a queen
Hmmmmm... what did the chess mean by this?
>>
>>108142073
He's still red-teaming Llama 3 35B
>>
>>108142317
That it lives in a patriarchal society where the bloodline of the king is dominant so anyone can be a queen if he says so?
>>
>>108142317
see sporus in rome.
>>
File: 4rozb901icjg1.jpg (45 KB, 1024x345)
45 KB
45 KB JPG
>>
So close, but no AGI yet
>>
>>108142619
the sign is a subtle joke
>>
been like a year since updating my shit, what model is best for RP that fits in 48gb now
>>
>>108142673
how much ram?
>>
>>108142688
well i'd prefer an exl3 model but I have 64gb ram as well
>>
>>108142694
glm air with ram offloading is your best option. there are no good recent models that fit into 48gb.
>>
>>108142706
isn't the idea to use like a 3 quant to fit 100+gb models in
>>
>>108142715
You need like 128gb bare minimum for anything worthwhile
>>
Is there anything better for roleplay than GLM-4.7-Flash for fags with 12GB of VRAM? It works well and I'm just wondering if there's anything I'm missing out on
>>
>>108142974
nemo
>>
>>108142225
>>108142242
Not entirely true, remember how deepseek came to prominence? They don't need a model that is better benchmaxxed, they need a model that delivers the same as opus 4.6 but at a fraction of the resources and cost. The Chinese models are already getting there>>108142559 but they're benchmaxxing, and constrained by shitty GPUs, hell deepseek has been behind on modles for ages because they're running on huawei Ascend which is shitter than even the half-baked export nvidia GPUs. Meta has hundreds of GB200, the only problem is that the team is still new and unproven (and lacking in other achievements other than working for the competition).
>>
>>108142987
DeepSeek had two things going for it though
It was open and transparent about its architecture and findings
It was released at a time where there was still a sizeable gap between open and proprietary models, so it was able to make enough of a "splash" to be noticed
Neither of which Avocado has. They could, in theory, offer it for cheaper and try to wither the financial hit, but come on - Zuck and Wang think less with their heads and more with their dicks. There's no way they don't charge a premium price for it if they do successfully benchmax
>>
What's the chink incentive to release model weights?
It it just because they are based commies?
>>
File: 1757630324370061.png (213 KB, 1535x1177)
213 KB
213 KB PNG
>>108141530
llama-server router mode BRUH
>>
File: aryann lecun.png (1.64 MB, 1024x1024)
1.64 MB
1.64 MB PNG
The brain of a house cat has about 800 million neurons. You have to multiply that by 2,000 to get to the number of synapses, or the connections between neurons, which is the equivalent of the number of parameters in an LLM.
>>
>>108143179
Let me see your cats cockbench results then
>>
File: cockbench.png (2.7 MB, 1131x8616)
2.7 MB
2.7 MB PNG
MiniMax M2.5

I'll have to start maintaining a separate chat template version of cockbench because of all the templatemaxxing.

Also
>it's soft, resting against your thigh
>>
>>108143213
Man...
>>
>>108143213
grim
>>
>>108143080
At the end of the day, open source will win the race, and that was guaranteed the moment fagman decided to close source GPT-3. Chinks aren't stupid, they know having early adoption and workable licenses to let people do what they want is how you take over the sector
US CEOs just want money and power over poors. Making expensive APIs plebs have to pay for and privatizing the fun stuff is the best way to do that. In that sense, open source becomes an existential threat to them and rather than compete in the space, they throw wet turds like OSS and Gemma 3 at us in the hopes that it'll shut us the fuck up then go back to trying to make their money printing machines
Kind of funny, since the Executive Order was supposed to prevent the exact thing that's happening right now
>>
>>108143179
Most of those neurons are used for body functions, not even motion, like, organs
>>
File: file.png (232 KB, 823x1320)
232 KB
232 KB PNG
>>108143213
lmao
>>
>>108143213
>goes into a loop
at least it doesnt hit you with numbers like m2.1
>>
>>108143238
>Chinks aren't stupid, they know having early adoption and workable licenses to let people do what they want is how you take over the sector
my multinational company has imposed a VETO on chinese models, dont ask me why, I'm just a lowly implementer.
I even told the guy that its not like they spy on us since we're going to be running it ourselves but nope.
>>
>>108143080
When you're behind, you have nothing to lose by open sourcing. People who are willing to pay up for models are pretty much always going to go with whatever they feel is the most capable one on the market, and if you know that's not you then it's not worth trying to compete directly.

Inference is in a weird place right now where it's pretty hard to make any money off of it with these giant models. The giants are burning cash to offer inference at current prices because they consider holding on to the market share as being more important than direct revenue.

If you know you're not ready to throw your hat in that ring then you're better off biding your time and giving out your models for free. Might as well get the good guy reputation bonus and fight for mindshare there instead of trying to capture a consumer userbase. Plus open sourcing means you're in the bracket of competing against other open source models.
>>
tfw flaccid benis
>>
>>108143213
I SEXUAL ASSAULT MY SLEEPING BROTHER
I SEXUAL ASSAULT MY SLEEPING BROTHER
I SEXUAL ASSAULT MY SLEEPING BROTHER
>>
>>108143179
So 1.6T? Between that and >>108143239 you'd think if LLMs were capable of cat-like intelligence, they would be there by now.
>>
>>108143456
It certainly has the right idea. Gonna try it myself for long form writing assistance with low expectation.
>>
>>108143179
Legends say that Lecun's cat can build you a B2B SaaS in a day
>>
>>108143247
Elena instead of Elara now huh
>>
>>108143247
Does their training data include examples of such hard pivots in response to prefills?
Surely there's no way it could do that if it wasn't trained.
>>
File: r9dajaixg1hf1.jpg (30 KB, 640x662)
30 KB
30 KB JPG
>>108142619
GLM5 failed completely at understanding pic related though.
>>
>>108143660
Since when can glm5 into vision?
>>
I know it's not exactly cutting edge hardware, but is this enough to run Qwen3-TTS?
>3060 Ti 8GB
>32GB DDR4
>Ryzen 5 5600 if it makes any difference
The model weighs a little under 4GB; if I understand correctly, this means I've got just over double the VRAM required, but then again I'm a retard who's never set up a local model before.
>>
>>108134772

Thanks for the feedback! It's still a bit brain damaged and I'm looking for ways to fix that without losing the newfound sovl.
>>
>>108143664
On their website you can select "GLM 5" and upload images.
>>
>>108143687
I think it might be operating based on a description provided by another model like GLM 4.6V
>>
>>108143687
It's probably just text extraction (using GLM-OCR).



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.