/g/ - How big are the quality difference between LLMs? - Technology

Anonymous

How big are the quality differ(...) 05/13/26(Wed)18:45:23 No.108817400

File: 1751337189513720.png (892 KB, 1080x1080)

How big are the quality difference between LLMs? Anonymous 05/13/26(Wed)18:45:23 No.108817400 Archived

I feel like the differences are mostly niche and marginal but 1 or 2 personal experiences of it fucking up will make someone believe one of the major LLMs sucks but they're all basically the same. The benchmarks are obviously horseshit so idk how you really judge the differences. Maybe we need our own benchmark to be objective without it being gamified by big tech.

Anonymous
05/13/26(Wed)20:56:05 No.108818083

Anonymous 05/13/26(Wed)20:56:05 No.108818083

File: ebassi.jpg (21 KB, 460x460)

21 KB JPG

>>108817400
Usecase for differentiation?
What makes you think benchmark is a metric?

Anonymous
05/13/26(Wed)21:27:39 No.108818229

Anonymous 05/13/26(Wed)21:27:39 No.108818229

>>108817400
i also suspect they're mostly the same
it was too fast and easy for the zeitgeist to switch from "gpt is best" to "claude is best"
it's all based on people's perceptions and social media hype, not measurable outputs
i don't know how possible it will ever be to measure them against each other
it's like a search engine, where the quality is subjective

Anonymous
05/13/26(Wed)21:29:08 No.108818237

Anonymous 05/13/26(Wed)21:29:08 No.108818237

>>108817400
Each one is optimized for different tasks. Your best bet is to buy a subscription to every single one and run an agent that queries all of them, and another agent that checks the results and presents the best ones to you. You'll want a third agent to ensure the first two are doing what you ask them to do too. If you aren't using at least this simplest approach you're already behind.

Anonymous
05/13/26(Wed)21:53:35 No.108818355

Anonymous 05/13/26(Wed)21:53:35 No.108818355

>>108818237
surely someone has vibecoded a web app for this

Anonymous
05/14/26(Thu)00:39:52 No.108819084

Anonymous 05/14/26(Thu)00:39:52 No.108819084

>>108818083
Usecase for usecases? What makes you think that metrics are a metric?

Anonymous
05/14/26(Thu)05:09:02 No.108820002

Anonymous 05/14/26(Thu)05:09:02 No.108820002

>>108818237
>ur gooonna be left behind
fucking Indians

Anonymous
05/14/26(Thu)05:11:34 No.108820012

Anonymous 05/14/26(Thu)05:11:34 No.108820012

>>108818237
No just get good at prompting to draw out secret vectors

Anonymous
05/14/26(Thu)05:24:55 No.108820066

Anonymous 05/14/26(Thu)05:24:55 No.108820066

among cloud hosted frontier models there isn't much differentiation in my experience, except maybe in tooling. where the models feel tangibly different is among chinese/local/open-weight models

Anonymous
05/14/26(Thu)05:29:10 No.108820079

Anonymous 05/14/26(Thu)05:29:10 No.108820079

>>108817400
the moment you introduce a new benchmark it will be gamed. It's pointless

Anonymous
05/14/26(Thu)05:33:06 No.108820095

Anonymous 05/14/26(Thu)05:33:06 No.108820095

>>108817400
Qwen 3.6 27b dense beats gpt-oss in most tasks
It makes a lot less errors but it appears to have more limited "world knowledge"
It's only bottlenecked by inconsistent responses between regens in non code tasks with higher temp but that's just sampling and the fact its just random once it has made a list of tokens to select and weighted the probabilities
It's actually kinda creepy how fast llms have evolved, imagine in the future how it will evolve