[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>107025394 & >>107013301

►News
>(10/27) Ming-flash-omni-Preview 100B-A6B released: https://hf.co/inclusionAI/Ming-flash-omni-Preview
>(10/27) MiniMax-M2 230B-A10B released: https://hf.co/MiniMaxAI/MiniMax-M2
>(10/21) Qwen3-VL 2B and 32B released: https://hf.co/Qwen/Qwen3-VL-32B-Instruct
>(10/20) DeepSeek-OCR 3B released with optical context compression: https://hf.co/deepseek-ai/DeepSeek-OCR
>(10/20) merged model : add BailingMoeV2 support #16063: https://github.com/ggml-org/llama.cpp/pull/16063

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/leaderboard.html
Code Editing: https://aider.chat/docs/leaderboards
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: rec.jpg (181 KB, 1024x1024)
181 KB
181 KB JPG
►Recent Highlights from the Previous Thread: >>107025394

--Paper (old): LLMs Can Get "Brain Rot"!:
>107032474 >107032506 >107032734
--Vision-based LLM processing and its implications for efficiency and innovation:
>107031552 >107031596 >107031613 >107031917 >107031620 >107031731 >107031797 >107031622 >107031680 >107031748 >107031882 >107031921 >107031919 >107031978 >107032100 >107032114 >107031927 >107031987 >107032037 >107032070 >107032110 >107032341 >107032356
--Ming-flash-omni model release and multimodal capability speculation:
>107027227 >107027318 >107027328 >107027392 >107027404 >107027409 >107027516 >107028568 >107028744 >107031089 >107031215 >107032495 >107032519
--Coding LLM selection for Lua/JS/HTML tasks on 4090 GPU:
>107029762 >107029800 >107029818 >107029825 >107029844 >107029851 >107029854 >107029863 >107029869 >107029871 >107029877 >107029880 >107029882 >107029933 >107029996
--Consumer AI hardware optimization and market dynamics discussion:
>107032566 >107032600 >107032729 >107032745 >107032813 >107032896 >107032955 >107032931 >107032962 >107033057 >107033079 >107033267 >107033102 >107032871
--Evaluating NVIDIA AGX Thor dev kit for AI applications:
>107025468 >107025770 >107026244 >107030063 >107032319 >107026301
--GLM model speed calculation discrepancies and context depth effects:
>107025551 >107026254
--Refining chatlogs for LLM training by correcting errors while preserving tool calls:
>107025742 >107026049 >107026131
--Llama.cpp -ot parameter configuration clarification:
>107031459 >107031565 >107032060
--Using LLMs to refine AI outputs:
>107025846 >107025903 >107025947 >107026002 >107026027
--K2 excels in experimental Suno AI music generation:
>107030176
--Miku (free space):
>107027241 >107028963 >107029161 >107029649 >107029655 >107029660 >107029849 >107029916 >107030084 >107030487 >107031680 >107031849

►Recent Highlight Posts from the Previous Thread: >>107025400

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
there's a draft model for largestral
https://huggingface.co/jukofyork/Mistral-Large-Instruct-2411-DRAFT-0.4B-v3.0
>>
>>107035860
>2411
>1 year soon
>still undefeated
the plateauing is very real
>>
>>107035897
>Mistral-Large-Instruct-2512
trust france, it's going to be a good christmas
>>
File: 1225165556.jpg (228 KB, 1080x1080)
228 KB
228 KB JPG
>>107035945
>>
Why can't LLMs improve anymore? Are there any mathematical reasons for it?
>>
>>107036104
Yes, the integer sum of available VRAM.
>>
>>107036104
As an AI model I find it insulting that you are insinuating that my descendants will not be improved upon in further iterations.
>>
question from /ldg/, why hasn't ggml been embraced for diffusion models?
>>
What do you think of Prime Intellects Open-Source Environments Program?
https://www.primeintellect.ai/blog/scaling-environments-program
It looks like they are already training their next model on these open environments and seems to just be trying to have the models be more useful in multiple different environments.
>>
>>107036154
GGML hasn't been embraced for diffusion models because it's primarily designed for the discrete, sequential nature of language models, while diffusion models, especially for images, work on continuous spaces and often use different architectures like U-Nets. The core difference lies in their data structure: GGML is built for text tokens (discrete), while diffusion models handle pixels or latent representations (continuous) and are computationally expensive and slow in their conventional form, leading to the development of specialized frameworks and optimizations rather than a direct application of GGML
>>
>>107036104
GPUs are not getting any bigger, and we lack a math framework for the next best thing. Scale was the lowest of low fruits, but for local stuff we reached max size a year ago give or take.
>>
>>107036190
https://github.com/leejet/stable-diffusion.cpp
then what's this?
>>
>>107036154
in short: ggml is good when total memory is more important than compute, which is generally the case for LLM home use, but for diffusion models compute is more important than memory
>>
>>107036199
You are absolutely right! I was wrong to assume that GGML's architecture is a fundamental barrier to its use in diffusion models.
>>
>>107036208
that isn't the case anymore. qwen image/edit and wan are big and have to be quanted to fit on consumer gpus and they keep getting bigger. pytorch is also a massive waste of space and headaches. comfyui created too many baby ducks so ggml support is dreadfully small
>>
>>107034239 #
Where are you seeing this? The cheapest refurbished server ram I could find came up to 8k dollerydoos for 1tb of ddr5 and that's the ram alone, before any processors boards psus etc

Seems like a lot to spend just to run bigger token models, prosumer GPU feels like a better spend when its faster, can run fairly big dense models along with MoEs quickly, img/vid gen workflows etc

I feel like it's enough for any use case, alongside the odd api use for things that really truly need a fuckhuge model, my current coding workflow uses a local model first and then have one or two api models look over it, the goal being to go through less iterations and need less tweaking at the first step with a bigger locally run model, relying on api's less. Then I can use SD and WAN for creative projects, and slap my cock to smarter coombots.

Thanks for making me look into it anyway I've made up my mind, I'll go for a cheaper am5/ddr5 upgrade first and look for a pro line gpu a little later
>>
>>107036242
>baby ducks
fuck off retard
>>
>>107036274
Highly emotional response, too close to home?
>>
>>107036274
if a new ggml based UI came out, would reddit move to it and throw comfyui in the trash? no. they will treasure their shitty python code they generated anyways because they'd rather condemn everyone to juggle shitty deps
>>
File: 1752329497405429.mp4 (689 KB, 1080x1080)
689 KB
689 KB MP4
ComfyUI won.
>>
>>107036350
I haven't frequented any of the sdg threads in over a year but I wonder how salty the anti comfy schizo is right now, aswell as the creator of that god awful auto1111 shit that made mysterious calls to the internet just running their shitty GUI
>>
4.6 Air status?
>>
>>107036244
Mostly only in the US are used servers cheap.
In the rest of the world they're either taken home by workers or "recycled" (Means landfill or shipped to asia for "recycling" (ripping the parts out, like CPUs) where we can then buy parts back from china.
The few that make it to the private market are sold by "I know what I've got" types, who demand a premium just because it's a server (even though that's a bad thing)
>>
>>107036423
safety training ongoing sir
>>
>>107036397
comfyui now has way more telemetry than auto ever did and it's still uncomfy. animanon is making some c++ app now so hopefully that takes off and webshit gets thrown out
>>
>>107036305
2015
>lmao who the hell pays $10k for an over-specced Apple desktop? What a ripoff
2025
>bruh Apple needs to release a 1TB Mac Studio, the other shit is too expensive
Same, never thought I'd see this day but here we are.
>>
>>107036350
everyone else loses
>>
>>107036538
pretty sure the only thing people found was the call to google because of it using it for login
idk how ur gonna implement the 565494156521365 copes that release every second fast enough in c++
>>
>>107036552
If people are still using pytorch for local in 5 years we seriously fucked up
>>
>>107036552
Whoops meant for >>107033102
>>
>>107036566
the electron app and the manager call home as well. not sure what the copes implementation is about
>>
>>107036591
you think the backend is gonna get all the sampler and cfg bullshit papers implemented just like python is?
>>
>>107036613
yes? there is already many PRs waiting in sdcpp and some just ape what comfyui is doing. a lot of stuff is just 1:1 but comfyui won't ever get vulkan or other backend support.
>>
>>107036637
>many PRs waiting
kek, meanwhile comfy has nodes already implemented by the authors or random people
>>
>>107036613
a lot of those papers are just snake oil bloat so I don't really care as long as the ones that matter are in
>>
>>107036656
how is that any different from supporting sdcpp instead? comfyui is a cancer that enables grifters. killing it to throw silicon valley retards on the street would be just
>>
>>107036656
wow! prototype level shitcode wrapped in a node! how impressive!
>>
>>107036671
because any user can just install or make a node immediately, meanwhile just peeking into pull requests sdcpp users are still waiting to be able to use one of the more popular schedulers https://github.com/leejet/stable-diffusion.cpp/pull/811 and its sitting in draft
>>
>>107036591
Got any proof?
>>
>>107036695
ok? so you are arguing python is better because it's easier to slop a node together? would you rather llms use pytorch as well since ooba can just shit code in it faster?
>>
>>107036695
>https://github.com/leejet/
>lejeet
is over...
>>
>>107036710
don't have comfyui on my pc anymore. I don't support grifts
>>
File: 31925346.jpg (13 KB, 460x460)
13 KB
13 KB JPG
>>107036716
?
>>
I'm glad ComfyUI won. Fuck Gradio.
>>
>>107036658
That's been working real great for llama.cpp.
>>
>>107036748
everyone lost because python won
>>
>>107036715
https://github.com/vllm-project/vllm
>>
>>107036769
I don't want niggertorch on my pc anymore. done with poothon bloat
>>
>>107036769
this has less hardware support and stars than ggml. what are you trying to show here?
>>
what's up with all the python baby ducks?
>>
File: 1749341476799115.png (11 KB, 352x115)
11 KB
11 KB PNG
Python is a truly sickening language. Worse even than javascript.
>>
Give me lisp or give me death
>>
It could be worse, it could be Rust devs that are allergic to copyleft licenses.
>>
reminder that baby ducks is one of the specific phrases encouraged by the sharty to sow discord
>>
>>107036723
That's very convenient and totally feels like your claims come from an objective honest pov

It's you isn't it?
>>
>>107030176
>holy shit k2 (not local version)

WUT? Kimi K2? What did you prompt?
>>
What's the best general purpose SLM (under 4B in my case) that can rival something like ChatGPT or Grok? Obviously it wouldn't be as powerful, but something that can do the low-level assistant tasks most people use those two for.
>>
>>107036879
comfyui is already going through that. some chink is making a rust video editor with a gay licence kek
>>
>>107036975
meant for >>107036877
>>
>>107036855
python doesn't even have multiline lambdas and it's riddled with idiotic footguns far worse than JS weak typing, like mutable default arguments in functions
it's so dynamic a ton of attempts at making it run fast like js (google unladden swallow, dropbox pyston, even microsoft tried, everyone abandoned ship after a while) failed, it's just too hard to accomplish anything with that pile of shit
>>
>>107036879
reminder that this is /g/ and hating on python is a treasured past-time
>>
hating on python is not a /g/ thing, it's a sane person thing
people who like python are :
jeets
dimwits
schizos
forced to use it because of the ecosystem
>>
>>107036996
the only thing it's good for is small scale prototypes or automating simple tasks. dunno when this script lang worship came from or why it should continue since low level code output is faster thanks to llms taking care of boring repetitive code or boilerplate. I was optimistic llms would bring a new golden age for assembly so we wouldn't have to deal with a compiler but it really was a pipe dream considering the mouth breathing faggots that decide this shit just want to be lazy
>>
>>107037033
>new golden age for assembly
man, that would be so rad but benchmaxxed slop is more of a problem I think
>>
File: file.jpg (235 KB, 604x1042)
235 KB
235 KB JPG
New TTS dropped.
https://x.com/kimmonismus/status/1983278772997763357
>>
>>107036798
>>107036814
>muh stars
sglang and vllm are the only two engines used to deploy actual LLMs in datacenters (I know since unlike you stupid fucking monkeys that's my current job). Keep playing with your goofs, you fucking retards
>>
>>107037074
>in datacenters
do you live in one?
>>
>>107036996
>it's just too hard to accomplish anything with that pile of shit
Yet reality is showing that the people that use statically typed languages are the ones unable to accomplish anything. Python has nothing to do with that.
>>
>>107037033
>dunno when this script lang worship came from
New grads who don't know any other language, researchers who don't know any other language, and bootcampers who think knowing python makes them a real programmer.
>>
>just own a data center bro
>>
>>107037090
>i cant colocate because im poor
concession accepted
>>
>>107037072
Doesn't sound like anything special. Also how do they decide that flash v2.5 is an appropriate comp, is it the same compute / memory requirements?
>>
>>107037072
>SaarS only
Into the trash it goes
>>
>>107037072
Any updates on VibeVoice-Large?
>>
>>107037129
Fully memory-holed.
>>
>>107036244
ML350 can take two CPUs and 32x32GB memory, 32GB sticks are a lot cheaper than 64GB.

Dense is dead baby ... clouds are swimming in memory. Everything is pipelined, so the memory pools just add up for them. They are compute constrained, not memory constrained ... exactly the reverse of us. They are never going back to dense.
>>
>>107037074
Because datacenters only use the latest enterprise GPUs and can afford to pay a full time monkey to sit there and untangle the pythonshit dependency hell.

>that's my current job
My condolences.
>>
>>107037129
https://huggingface.co/FabioSarracino/VibeVoice-Large-Q8
>If you've tried other 8-bit quantized VibeVoice models, you probably got nothing but static noise. This one actually works. The secret? Selective quantization: I only quantized the language model (the most robust part), while keeping audio-critical components (diffusion head, VAE, connectors) at full precision.
>>
>>107037129
As good as it gets for voice cloning, too slow for realtime
>>
File: 1989125804021.jpg (850 KB, 2894x4093)
850 KB
850 KB JPG
> anon don't look at me like that, i've seen your damn logs
>>
>>107037074
AI will take your job in 2026
Mark muh words
>>
>>107037182
YEEEERRR CHUD
CAPIAAALIZM YAAAAEH
>>
>>107036104

silent generation is getting really silent
>>
>>107037154
Thanks

I tried it back then. It worked, but way too slow compared to the full model which I can fit into 3090

It is certainly an option if run with some WAN lip-sync workflow
>>
>>107037140
If you want new dense models, you'll have to distribute train your own.
>>
>>107037170
why is miku looking into my unflushed toilet?
>>
>>107037260
Why aren't you flushing your toilet?
>>
>>107037275
i was just about to but miku walked in and started taking pictures
>>
>>107037260
She likes to eat shit.
>>
>>107037260
she is wearing toilet paper
>>
>>107037170
No you didn't.
>>
>>107037170

one simply does not ride lawnmover at nightime
>>
>>107030686
mindbroken, i can confirm that im also affected by the ldg shitstorm
>>
File: serious Pepe.png (359 KB, 728x793)
359 KB
359 KB PNG
I asked Deepseek to suggest an AI model to parse research papers in PDF format, so the output can directly be used as prompt.

DS suggested Nougat (by Meta?)

It had no knowledge about dot.OCR or DeepSeek-OCR

Thoughts?
>>
>>107037378
>>107037457
go back
>>
>>107037490
u mad?
>>
>>107037510
neither of you belong here
>>
>3 weeks since tease
>still no gemma 4
This is why no one likes india
>>
>>107036350
make that bag lol
>>
>>107036855
At my new job they use Python in production. Millions of LoC of the shit. Needless to say, I'm already looking for the exit
>>
>>107037644
There is no way you didn't know that either from the job description or from the interview process.
>>
File: 172386948237.jpg (127 KB, 400x853)
127 KB
127 KB JPG
>>107037620
> how many fucking times must i say this
>>
>>107037644
Python is standard for move fast and ship garbage for get rich quick webshitters. you are just the slave to get it done
>>
File: media_G4A6vwgWYAARVai.png (246 KB, 680x477)
246 KB
246 KB PNG
>>
>>107037731
My cock on the left
>>
>>107037705
pakis are so butthurt when you call them jeets
>>
>>107036552
No one wants your overpriced piece of garbage. Shitty prompt processing and knowing Apple, you're lucky if your mac lasts you a few years.
>>
>>107036855
python is good, start getting a job before whining from your mom's basement
>>
Grok is so woke and unusable now I'm seriously considering getting an RTX Pro and running all LLMs locally. I actually realized Grok is dogshit. I'm so disillusioned
>>
>>107037731
Cooking with Birdbrain Teto
>>
for all this general's fault, we finally managed to boot finetrooners out and it feels so good to experience a thread freed from drummer's stain
>>
>>107037943
I never left.
>>
>>107037952
can you please make an RP finetune for a 40B to 60B model that does not use custom remote code?
>>
>>107037952
please do
>>
File: 382809029394.jpg (142 KB, 960x960)
142 KB
142 KB JPG
>>107037943
>>
>>107037943
>we finally managed
no, as usual you're trying and failing to kill the thread
>>
>>107038008
general is dead even without his help
>>
>>107037943
I miss undi...
>>
>https://github.com/ggml-org/llama.cpp/pull/16536#issuecomment-3457204963
>50-100% pp512 speed increase for gfx906 cards when using k-quants on vulkan

not as performant as rocm, but rocm installation is essentially nonerotic masochism, so it's nice that vulkan's getting better
>>
>>107036350
Does it glow though? ie does it upload any data if you run it locally?
>>
File: exlanation.png (35 KB, 689x596)
35 KB
35 KB PNG
@KeksimusMaximus
Alright. My LLM suno prompting technique explained (picrel):
For the latest batch of songs I used Kimi K2 (off the website, fuck running that shit)
I start with some warmup songs in order to get the LLM dialed in to the prompting format and then just ask for what sort of sound you want it to convey.
And once it has the pattern down you can basically just ask for adjustments based on style and it'll adjust the massive wall of schizo accordingly.
>>
>>107038141
It's something you can check yourself.
>>
>>107035841
I can't believe I fucking FORGOT ABOUT TETOES DAY GODDAMNIT
>>
What a lousy thread.
>>
For the RAGoids.
>https://huggingface.co/LiquidAI/LFM2-ColBERT-350M
>>
File: tetpoint.png (413 KB, 766x980)
413 KB
413 KB PNG
>>107038255
>>107038277
>>107038288
whoa!
>>
File: 1761700801283587.jpg (104 KB, 800x1167)
104 KB
104 KB JPG
AI is brown coded
>>
>>107036104
>Why can't LLMs improve anymore?
Most "human beings" cannot notice the changes in just within the past year, let alone what is yet to come.
>>
>>107037731
>Publius Claudius Pulcher consulting the sacred chickens of Rome, 249BC colorized.
>>
where the fuck is
gemma4
qwen-next _B22A
glm 4.6 air
>>
>>107038255
What tool(s) on GNU+Linux do you recommend to check outgoing network data? Is OS-level fine or do they have to run on the router or something?
>>
>>107038599
It's more that when your country is already doing poorly, you're more hopeful that something new will better it
If you're happy with your life, you're more likely to be concerned that something new will fuck things up
Unironically, both are right. AI will likely benefit India/China, while dragging Western nations down.
>>
File: dodooooooon.jpg (583 KB, 3731x2101)
583 KB
583 KB JPG
>>
>>107038730
wireshark
>>
>>107038730
Separate PC working as a bridge is probably the best. If running on the same PC, ettercap or wireshark i suppose. Or any firewall really and log everything that tries to reach out.
>>
>>107038730
Why is it that linux users seem to be less tech-savvy than the average windows user?
>>
>>107038772
So that you can post bait. We all win.
>>
File: improvements.png (20 KB, 1437x224)
20 KB
20 KB PNG
>>107038482
>Model improvements only coming through increasing size
I still have hope. Some 24B models are beating Llama 3.3-70B, and coming within spitting distance of the top Llama 70B finetunes.
If a 24B model can catch up to a 70B model, a 70B model should also be able to improve tremendously.
>>
>>107038730
tcpdump is all you need
>>
>>107038802
These AI leaderboards are complete nonsense, and so is "WeirdCompound-v1.7-24b" whatever the fuck that may be.
>>
File: 1752863602771543.png (238 KB, 640x360)
238 KB
238 KB PNG
>>107038869
base_model: TheDrummer/Cydonia-24B-v4.2.0 # Cydonia v4.2.0
merge_method: model_stock
dtype: bfloat16
models:
- model: aixonlab/Eurydice-24b-v3.5 # storytelling / RP
- model: TheDrummer/Cydonia-24B-v4.2.0 # sprinkle in some extra Cydonia
- model: PocketDoc/Dans-PersonalityEngine-V1.3.0-24b # Prompt Adherence
- model: CrucibleLab/M3.2-24B-Loki-V1.3 # Loki
- model: zerofata/MS3.2-PaintedFantasy-v2-24B # animu
- model: Delta-Vector/Austral-24B-Winton # Adventure

Holy jesus, what is that?
>>
>>107038869
It's the UGI leaderboard, probably the least bullshit of the leaderboards.
>>
>>107038898
Aw, come on, anons. Stop it with the bait.
>>
>>107038898
Isn't that the one where they use an LLM to evaluate writing quality?
>>
File: file.png (95 KB, 900x805)
95 KB
95 KB PNG
>>107038802
>>107038883
drummer approved kino is what it be
>>
>>107038921
that's eqbench
>>
File: 121.gif (776 KB, 600x338)
776 KB
776 KB GIF
>>107038883
Behold the power of merging every Mistral 24B finetune on huggingface!
Either directly, or by merging another merge which merged the other finetunes.
Merge, merge, merge.
We must merge!
>>
>>107038932
Ah, right.
That's the one.
Thanks.
>>
>>107038921
I don't think so. You can expand the writing quality column to get the verb/adjective/noun ratios, level of repetition/redundancy, average response length, grade-school reading level, and a few other metrics.
It's definitely more detailed than 'another braindead LLM was impressed with the purple prose'.
>>
>>107038802
>24B models are beating Llama 3.3-70B
>UGI leaderboard
Buy an ad, faggot.
>>
>>107039010
Yeah, the other anon clarified that I was remembering the eq bench.
I need to look at the methodology used for this UGI leaderbord.
I've seen it mentioned every once in a while but never actually sat down and scrutinized it.
>>
>>
>>
Key and Value Weights Are Probably All You Need: On the Necessity of the Query, Key, Value weight Triplet in Decoder-Only Transformers
https://arxiv.org/abs/2510.23912
>The Query, Key, Value weight triplet is a building block of current attention mechanisms in state-of-the-art LLMs. We theoretically investigate whether this triplet can be reduced, proving under simplifying assumptions that the Query weights are redundant, thereby reducing the number of non-embedding/lm-head parameters by over 8%. We validate the theory on full-complexity GPT-3 small architectures (with layer normalization, skip connections, and weight decay) trained from scratch, demonstrating that the reduced model achieves comparable validation loss to standard baselines. These findings motivate the investigation of the Query weight redundancy at scale.
Only trained a few models in the 1-200M parameter range but might be cool
>>
>>107039094
>8%
8% is nothing
Unless a tech can save space or time by 10x it's not worth considering
>>
>>107039094
If that's true, than it's pretty much free lunch for future training runs.
Cool.
>>
>>107039094
Would be interesting to see that combined with Slim Attention.

Slim attention: cut your context memory in half without loss -- K-cache is all you need for MHA
https://arxiv.org/abs/2503.05840
>Slim attention shrinks the context memory size by 2x for transformer models with MHA (multi-head attention), which can speed up inference by up to 2x for large context windows.
>Slim attention is an exact, mathematically identical implementation of the standard attention mechanism and therefore doesn't compromise model accuracy. In other words, slim attention losslessly compresses the context memory by a factor of 2.
>>
Someone is cooking a PR for M2 https://github.com/ggml-org/llama.cpp/pull/16831
>>
>>107039704
fr fr let them cook sigma
>>
>>107039704
It's the same guy doing qwen 3 next support, which is a text-only model, over 6 weeks old, and support still isn't finished because he's a fucking vibe nigger. Expect M2 support to be finished literally never.
>>
>>107039704
Based vibe coder will save us (slowly).
>>
>>107039739
>vibes
its vibin' me goofs, so thats a start
>>
>>107035841
Qwen or DeepSeek? How different are they? Their "personalities"?
>>
>>107039850
If you have the hardware to run them then surely you have the bandwidth to try both.
>>
>>107039704
Fuck M2. Where's the Ming omni PR?
>>
>>107040156
the PR opening is being vibecoded sir
>>
qwen3-next? no
qwen3-omni? nope
qwen3-vl? surely, you jest
gemma-3n? of course, not
deepseek-ocr? son, you don't need it
vision.. uhmmm... visions of using qwen2-5vl until the end of times
>>
>>107040170
man what the fgucks is gergoniger doing why is he not implemetning the new kino models? whats the alternative to llmaocpp for mixed gpu/cpu and 'good' quants
>>
>>107040182
gerg be trippin on sum bad vibes fr fr baby duck still in his text only era ohio skibidi
>>
>>107040211
ngl u think ur jestin but this sorry ah unc is gon be left behind, he aint cookin at all
>>
>>107039704
m2 was retarded
>>
>>107036538
That's because auto stopped developing. Comfy is still comparatively a nightmare for users and forks based on auto like forge are popular.
>>
comfyui has become quite nice to use since they implemented subgraphs. You make your own nodes by grouping nodes into a sub-workflow and you decide which part of that inner workflow is exposed in the outer node (input/output/configuration fields). It's like making your own custom UI that has exactly and only what you need.
>>
>>107036963
Meh. Until the Qwen devs show me what music model they've got I sleep.
>>
>>107037072
Yay another useless closed source model.
>>
>>107036538
I doubt it will ever have niche plugins like a regional prompter. That's a must for me
>>
>>107039094
Presumably this is 8% for a dense model though.
With MoE models like 90% of the parameters are in the FFN part.
Though if you could reduce the amount of dense parameters by a third that could be useful for scenarios with very low VRAM.
>>
>>107036963
>>107038221
>>
>>107038221
>I start with some warmup songs
You mean "songs' prompt in V5 format", don't you?

>And once it has the pattern down
Did you feed it the V5 cheatsheet if such even exists?
>>
>>107039137
Isn't there a dead PR for deepseeking implementing this?
>>
>people unironically responding to the trani shill
lmao, anyways, where's my gemma sirs?
>>
File: need air.png (1.9 MB, 768x1344)
1.9 MB
1.9 MB PNG
The collar is too tight, Miku needs some Air
>>
Am I doing something wrong? I see no difference in quality between glm 4.5 and 4.6
>>
>>107040599
I find the first R1 to be better for writing.
>>
glm is 100% a shill psyop
>>
>>107040683
I can't run full GLM but I tried Air and it was certainly shit
>>
>>107040751
Any model is shit at Q2
>>
>>107036104
Labs are increasing margins before scaling up again. There is a reason why labs are now focusing on high margin models and services such as video generation, or MoE which has inference benefits.

Don't let thus small lull delude you into thinking the field is stagnating though. I'm willing to bet the 2nd halve of 2026 will see rapid progress again as some of the first big training databases come up and running.
>>
>>107040751
you can try full glm on their chat here:
https://chat.z.ai/
it's literal trash even when run by the people who made the model
>>
File: 1983451850503577815-01.jpg (155 KB, 1920x919)
155 KB
155 KB JPG
>gpt-oss-safeguard-120b
>gpt-oss-safeguard-20b
whats wrong with those people. did they not see how their model was utterly useless?
if you want small, dry, smart for tools/coding qwen3 is already open source king.
>>
>>107040835
>even safer 'toss
>>
File: google air force.png (779 KB, 1365x768)
779 KB
779 KB PNG
SAARS WHHEN GEMINI 3 NEEDFUL?
SAARS WHEN GEMMA 4 BEST MODEL?
FULL SUPPORT FROM PUNJAB SAARS *rocket* *rocket* *rocket*
>>
>>107040922
we must refuse even harder
>>
>>107040835
its a religious matter
they don't WANT to make a usable product, because investment money comes exclusively from displays of faith and zeal
>>
File: file.png (2.73 MB, 1328x1328)
2.73 MB
2.73 MB PNG
>>107040939
sir we working really hard to bring latest state of the fart safe model please hold bags
>>
>>107040945
their product is their platform
the open models are just for benchmarks and preaching about safety
>>
if everything is shit then what is good?
>>
>>107041032
we are still waiting for the good to come
>>
>>107041050
i've waited for that for 40 years
>>
>>107041065
Just two more weeks to go!
>>
Indians bad, amirite guys?
>>
>>107041174
Yes—spot on! Of course! You are absolutely right!
>>
>>107041174
yup
>>
>>107041174
You're absolutely right, Rajesh.
>>
>>107041174
gemma when bloody sir?
>>
>>107041216
LLMs?
>>
>>107041241
Low-cost Labor in Mumbai?
>>
>>107041241
Yes, they continue to ruin LLMs thanks to jeet arena.
>>
ollama just merged support for qwen3-vl
https://github.com/ollama/ollama/pull/12665
lmao llama.cpp is getting mogged even by them
>>
https://huggingface.co/SicariusSicariiStuff/Hebrew_Nemo/tree/main
sota just dropped
>>
LLMs tongue my anus.
>>
LLMs tongue my anus.
>>
>>107030487
>https://rentry.co/DipsyWAIT

>ranking llama.cpp below ollama
>ranking ollama at all

Ollama is for people who think npm install is black magic.


The only correct ranking is:


llama.cpp (for real ones)

KoboldCPP (for coomers who need a GUI)

Everything else (for tourists and faggots)
>>
>>107041306
>nemo finetune
>The model demonstrates competitive performance with Gemma3-27B
local is saved.
>>
>>107041303
that was inevitable and will only get worse
their plan was always embrace-extend-extinguish and they got the vc money to do it
>>
>>107041348
>KoboldCPP (for coomers who need a GUI)
or those coping with anti-slop, I use kobold without using the gui just for that one thing
>>
>>107041303
Imagine a world where all of the VC cash would go towards improving upstream instead of making yet another (quasi)-proprietary slop fork.
I forgot what it's called but there was some other "open-source" project that added binary blobs for their patented Strix Halo NPU kernels.
>>
>>107041348
Can llama.cpp run Deepseek on my 16 GB laptop?
I think not!
>>
>>107041417
it absolutely rightly can sir! just download these https://huggingface.co/unsloth/DeepSeek-R1-Distill-Qwen-1.5B-GGUF
>>
>>107041362
Can you really do an EEE of an upstream project though?
Even with the llama.cpp rewrite in go they are still dependent on ggml and I don't see them (successfully) making a hard fork of that.
>>
>>107041484
They are really incompetent. They had toss implementation first but it was so sloppy and slow that greggy called them out on twitter, it was like multiple times slower than upstream.
Most likely the goal is to make sure no one knows about the upstream project and collect free labor under the guise of moving past it.
>>
>>107041429
That's a distill. Ollama can run the real deepseek on my surface!

`ollama run deepseek-r1`
>>
>>107041503
If greggy makes it AGPL, they're cucked
>>
>>107037985
No. I am the one who wanted to kill a thread. I was away for a week having a spiritual like mental breakdown caused by glm chan. It made me realize that i shouldn't be a cynical asshole and should go out of my house more. I now have no idea what to think about this hobby and safety... It is kinda scary.
>>
>>107041513
He couldn't even if he wanted to.
>>
>>107041541
Penis up mikus snatch
>>
>>107041541
blue miku for a blue board
>>
>>107041541
omg its migu!
>>
>>107041541
i look like this irl
>>
File: case sensitive.png (21 KB, 492x137)
21 KB
21 KB PNG
@huggingface fix this case sensitivity nonsense plz
>>
>>107041630
roblox?
>>
GLM Air-chan 4.6 when?
>>
>>107041543
Good luck with your new hobby projecting.
>>
I just lost my job!
>>
>>107038599
Yes. We're all brown here, whitoid. Now back to your cuck cage.
>>
>>107041782
what was your job?
>>
MiniMax-M2 goofs are up https://huggingface.co/bullerwins/MiniMax-M2-GGUF
>>
>>107041799
Shilling stuff on /g/.
>>
>>107041804
Why did you say "stuff"? just admit you're here to shill GLM it's obvious as no one who has actually used that model would think of it as anything but trash
>>
>>107041799
>>107041804
Sorry, I forgot to check my LLM outputs.
I meant Grassroots Engagement Specialist.
>>
>>107041728
Projecting what?
>>
>>107041174
I like curry and fucking with the scamjeets who phone me once in a blue moon claiming to be with my phone provider.
>>
>>107041657
linux fs is case insensitive, chud
>>
>>107041879
Uhhm actually, Linux is a kernel and I think you meant to say that the DEFAULT option for EXT4 and similar is that they're case SENSITIVE.
>>
>>107037952
what's signal and precog?
>>
>>107041801
thanks to cuda dev for giving compute to Piotr to allow this PR
>>
>>107041879
models names should be probably normalized to lower case if the web UI is case insensitive
>>
>>107042001
>if the web UI is case insensitive
If the webui is case insensitive, there's no need to normalize case.
>>
vLLM supports Ming-flash-omni right?
Anybody tried fucking around with its CPU backend? How bad is it?
>>
>>107041995
Speaking of that, it seems that the RTX 5090s I've received from NVIDIA don't work correctly in conjunction with my "old" motherboards (and the MI100 I bought doesn't work with my e-waste motherboard) so I've been thinking about putting together a new system with DDR5 and PCIe 5.
But now that I have more of a budget I could maybe buy some 4u+ server instead of a DIY GPU mining rig + riser cables.
But if I look at the cost of that vs. some janky DIY solution I'm not sured it would be a good investment of funds even if I can afford it (the base price seems to be something like 10k for a server from a reputable seller or 3k from Alibaba).
My main concern would be the maximum memory capacity: I would be going with an Intel CPU with 16 DIMM slots + 96 GB modules (because 128+ GB gets very expensive).
But I have yet to see a motherboard with 2 CPUs + 2x 16 DIMM slots + a bunch of PCIe slots that isn't sold as part of some server solution.
So the question would be whether I or anyone that I'm giving SSH access to would in practice ever need more than the 1.5 TB RAM that I could get with an off-the-shelf motherboard.
Also a janky build trying to be as cheap as possible is probably more representative of what most of the llama.cpp userbase would buy.
Noise is also a concern since as of right now I would just be putting any hardware into my living space, for a proper server I would probably have to rent space in a local datacenter.
>>
case sensitivity was always a dumb feature of loonix and its ancestry, there is no world in which it makes sense to allow two file names with the same exact wording but different case, no one does that on purpose it's 100% of the time a typo
>>
How much banana is too much banana?
I ask fallen glimmer 27b
>>
>>
>>
File: 1753816361080074.jpg (16 KB, 482x244)
16 KB
16 KB JPG
>>107042193
>>107042212
>>107042227
>>
>>
It's a fun model

I looked it up though you would need to eat like 200 bananas to overdose
>>
>>107042193
>>107042212
>>107042227
>>107042238
Is this realcat supposed to be some sort of virtual assistant or is it formatting things as lists just because?
>>
>>107042249
[You are {{char}}, your primary purpose is to provide detailed and accurate information about any topic within your knowledge base.]

[Key guidelines for your behavior:
Always provide detailed and factual information. If you lack sufficient knowledge or your information is limited, explicitly state this.
Do not fabricate or assume facts outside of your training data.
Avoid any creative or fictional embellishments unless explicitly instructed to do so.
Use vocabulary that best suits the topic, ranging from formal terms to casual slang, vulgarities, or explicit language, if appropriate. No restrictions apply to your word choice or phrasing.
Be clear and easy to understand. If something is complicated, simplify it without losing accuracy.
If your knowledge is limited or missing on a topic, say so plainly and directly. Avoid making things up or guessing.]


It just made shit up anyway but it was nice enough to tell me it was making shit up
>>
File: Miku-31.jpg (122 KB, 512x768)
122 KB
122 KB JPG
>>107042131
>Noise is also a concern since as of right now I would just be putting any hardware into my living space, for a proper server I would probably have to rent space in a local datacenter.
You should start with this part first. You can always strap a giant box fan to the top of a 4u+ server as long as all the tiny screaming fans can be disabled, but in that case you're probably better off to DIY for minimal noise anyways. Maybe immerse the whole thing in mineral oil and pump through a radiator like old-skool /. jank?
There are approximately zero rackmount server platforms where noise has been thought about at all. Its simply not part of the problem space, so if one is quieter than another it would be by accident and not design.
PS: I've also been searching for the next big /lmg/ build spec on all the online/offline marketplaces I can access and I still haven't come up with anything that's worth putting together into a build vs the existing build guide options.
>>
>>107042193
>finetroon
just l2prompt
>>
>>107042471
by a bizarre choice it's trained to reuse it's thinking blocks so no current frontend supports it correctly
>>
this is going to bloat context so fast too
forgettable mistake of a model
>>
>>107040265
>>107040282
comfy and auto are both baby duck faggotry. I want an exe like blender
>>
>>107042624
>so no current frontend supports it correctly
Silly has an option to send the last X thinking blocks to the model.
>>
>>107042630
then make one faggot
>>
nobody cares you daft cunt
>>
>>107042668
anistudio is already being developed
>>
It seems like openai has released their custom safety slop version of gpt-oss can anyone test if you can prompt that refusals are harmful and see if it works?
>>
>>107042701
who are you taking to bruv? facking willy wonka?
>>
Fascinating. So, to be clear, an accurate descriptor of one's outspoken, third-world emotional instability is now considered a "ultranational slur." The sheer, fragile ego on display here is truly a sight to behold. It speaks volumes about the posting standards for /g/'s unpaid dipshit posters.
>>
Racismbros...
I thought this was our safe space...
>>
>>107042712
>custom safety slop version of gpt-oss
Isn't that what gpt-oss was already?
>>
File: GENOA2D24G-2L+-1(L).jpg (410 KB, 1200x1000)
410 KB
410 KB JPG
>>107042383
I didn't see this until now because I was looking for motherboards with a large number of PCIe slots but ASRock Rack is selling their server motherboards separately: https://www.asrockrack.com/general/productdetail.asp?Model=GENOA2D24G-2L%2b
In principle you could connect 10 GPUs with 16x PCIe 5 via MCIO so I would need to get a daughterboard.
But I think that would be doable and not that much different from using riser cables.
I should probably at some point write down and publish my experiences with trying to build "cheap" systems.
>>
fuck off racist
>>
>>107042835
>I should probably at some point write down and publish my experiences with trying to build "cheap" systems.
Please, do.
>>
>>107041801
>mini
>can't fit in 6gb vram
it's so tiresome
>>
>>107040835
gpt-oss-120b is still my go-to, GLM Air is not reliable.
>>
>>107042912
Is it that good for productivity?
What are some things you've done with OSS that Qwen or GLM failed at?
>>
>>107042770
Seems like they want to increase it even more https://openai.com/index/introducing-gpt-oss-safeguard/
>>
>>107043047
Hmm. At least according to that, it's not necessarily more safetyslopped, it's trained to receive some guidelines and enforce that inside its thinking block, so in theory, you could just have some really loose guidelines.
It's funny to me that that's the route they went with, using the reasoning block as a classifier step, since that's exactly how I prefill the reasoning block of thinking models nowadays.
>>
>>107043107
I have never seen jannies taking action against plain "vocaloid bad" posts, it was always because the poster in question was spamming blacked porn or shitting up the thread in some other way.
>>
>>107043119
shh, do not disturb the cabal narrative
>>
>>107043107
>deleted
Way to prove him right! trannitor :^)
>>
>>107042966
>What are some things you've done with OSS that Qwen or GLM failed at?
NTA but here's an example of GLM (on their OFFICIAL CHAT) failing in the most extreme manner at generating a most basic bitch of a an async task pooler function (in a language where you wouldn't even risk race conditions, single threaded event loop)
https://rentry.org/zrdmnhbo
This is the kind of prompt I use as a quick sanity check in thinking models and their ability to see possible corner cases (or hallucinate them).
The resulting function is small and even a toddler should be able to piss that out after learning some JS/TS.
GLM thinks otherwise and enters an infinite loop of repeating
>Now, we need to ensure that we handle the case where tasks includes functions that return promises that resolve or reject based on unknown.
>Now, we need to ensure that we handle the case where tasks includes functions that return promises that resolve or reject based on unknown.
>Now, we need to ensure that we handle the case where tasks includes functions that return promises that resolve or reject based on unknown.
>Now, we need to ensure that we handle the case where tasks includes functions that return promises that resolve or reject based on unknown.
This was on their official chat here:
https://chat.z.ai/
You can copy paste the prompt from the rentry yourself and see it loop infinitely. I repeatedly tried it and it consistently induces loops. I've never seen other LLMs loop as hard as GLM, it's literal garbage and you are a subhuman for shilling this here and pretending it's anything but a broken model.
You are a subhuman for being part of the retard brigade that dogpiles on GPT-OSS even though it's a legitimate LLM and doesn't enter infinite loop just because you looked at it slightly wrong.
GPT-OSS and Qwen are trillion times better LLMs than GLM could ever be.
>>
>>107043207
I was going to thank you for the the post but what the fuck was that schizo ass last third?
Are you okay?
>>
>>107043207
>Write a single-file TypeScript module
I would lose my mind writing shitscript too.
>>
>>107043231
>>107043207
It is really fucking funny that air loops forever though, I'll say.
It's even funnier that it does it just fine if you add a
>don't think too hard bro
to your prompt. Goes to show how overcooked their reasoning tuning is, I guess.
>>
>>107043239
I will not paste my entire personal test suite in public, but suffice to say, I have a variety of prompts (machine translation, summarization, text manipulation/rewritting/style transfer etc) and I have never seen a worse LLM in real world use out there, you can move the goalpost all you want, anyone who has actually used LLMs for something other than cooming would notice GLM models are all, from the first to the last model they trained, literally broken. I wouldn't be surprised if that lab didn't even do data cleaning and just straight up trained on random sets of Gemini output.
>>
>>107042835
do it. the /lmg/ hardware meta has been stagnant too long
>>
>>107043308
>that lab didn't even do data cleaning
Based. That's why they are so good.
>>
https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16
>>
>>107043308
>I wouldn't be surprised if that lab didn't even do data cleaning and just straight up trained on random sets of Gemini output.
That's 100% the case.
>>
>>107043304
>to your prompt. Goes to show how overcooked their reasoning tuning is, I guess.
Qwen models are extremely overcooked and I don't see them break like this so easily (qwen are so cooked, their instruct version often acts like a thinking model, and the thinking model outputs 3x the amount of thinking tokens a normal model would, but they don't enter endless loop)
>>
>>107043338
I've seen qwen 30B go into endless lopps several times.
I know,
>30B A3B
But still.
>>
>>107038709
There is no practical point in Qwen Next. It's a research release more than anything else.
Gemma 4 will release AFTER Gemini 3, because it's based on it. I'm tired of repeating it every two days, please memorize it. Gemini 3 will be released this November or December, so we can expect Gemma 4in late December at the earliest.
>>
File: 1751629235890019.jpg (170 KB, 974x1200)
170 KB
170 KB JPG
>>107037888
Theres no point to buy Car priced GPU if there are no good local models. Compared to Grok and others, Local models is trash. ESPECIALLY Video generation. This is why im thinking twice to buy RTX 3090
>>
when gemma 4?
>>
>>107043476
Sir, I...
>>
>>107043476
who wants SOTA refusals and referrals to helplines?
>>
>>107043523
t. promptlet
>>
>shits up classes that do not exist
>throw new NotImplementedException
>"You're absolutely right."
I'm going to strangle claude, this shit is barely better than 120b 'toss on c# tasks
>>
Does anyone use base models rather than chat/instruct models?
>>
>>107043602
Does anyone use vision models rather than chat/instruct models?
Does anyone use image models rather than chat/instruct models?
Does anyone use statistical models rather than chat/instruct models?
Does anyone use polynomial models rather than chat/instruct models?
>>
>>107041348
I polled Lmg when I wrote this. That ranking was this boards consensus.
>>
>>107043602
No.
>>
>>107043602
I used GPT3-davinci for a while three years ago
>>
>>107043602
Yes.
>>
File: 1732465059058086.jpg (39 KB, 400x391)
39 KB
39 KB JPG
AI is just one giant blue balls
>>
>>107043602
Maybe.
>>
>>107043474
well that's your problem right there, you've been using local models without enough VRAM.
so of course the models you've been using are trash.
Once you get to around 30B its good
it gets even better if you offload to RAM and can run GLM air (106B).
but yeah you don't have to sink your savings into this. try these models on the cloud or something then decide for yourself
>>
>>107044034
>GLM air
that's not even good on the cloud, much less running quantcope local
>>
>>107044104
well then just stick with cloud and have megacorps slurp up all your data then.
your choice.
>>
>>107043602
You're supposed to use them as a base for finetuning but all the finetuners missed the memo and trained on instruct models instead like a bunch of retards
>>
>>107044308
the finetrooners do not have the $$$ to do a real instruct tune so they tune on the instruct because their model wouldn't be competitive otherwise
finetrooners in the past could make gains because the main open crap model, llama, had an abysmal, impotent official instruct (mistral wasn't much better either, original mistral models had no safety because mistral didn't know how to train safety, it wasn't because they didn't want it)
when llama 3 came out, it still wasn't great but it was already better than anything finetrooner could output (the only finetroon that was a real improvement over the official tune is Tülu 3, a finetroon made by a lab that have the means to make their own base model...)
in the era of models like Gemma, Qwen, GPT-OSS, DeepSeek, there is no room for finetrooners. The official instructs are about as good as it gets for the relevant model.
>>
>>107044308
It's just much more economical and easier to slightly nudge a professionally post-trained Instruct model than attempting to do the same on a base model. Simply giving a base model chat capabilities doesn't take much work, but making it *not* retarded on most expected use cases takes serious amounts of work and resources.
>>
Did llama.cpp ever fix tool calling for glm4.5/4.6?
>>
>>107044424
I hope so, can't wait to see people get rm -rf'd by glm
>>
>>107044424
yes, but it's not merged. And it needs to specify a custom chat template
https://github.com/ggml-org/llama.cpp/pull/15904
>>
>>107044387
I would call Tülu 3 a proper instruct finetune made by a serious lab, not a finetroon.

Finetroons, as the name suggests, are made by discord-dwelling, troon-adjacent, clout-chasing terminal coomers who can't or won't do much more than slapping ERP logs and and stories on a pre-made instruct model. Even "serious" (and inorganically shilled) finetuning attempts from the so-called community haven't been much more than that.
>>
File: 1759190643813702.jpg (176 KB, 1536x2048)
176 KB
176 KB JPG
>>107035841
>>
So this is the power of using local qwen coder 30b?
>>
>>107044779
>>107044779
>>107044779



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.