[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: 1751337189513720.png (892 KB, 1080x1080)
892 KB PNG
I feel like the differences are mostly niche and marginal but 1 or 2 personal experiences of it fucking up will make someone believe one of the major LLMs sucks but they're all basically the same. The benchmarks are obviously horseshit so idk how you really judge the differences. Maybe we need our own benchmark to be objective without it being gamified by big tech.
>>
File: ebassi.jpg (21 KB, 460x460)
21 KB JPG
>>108817400
Usecase for differentiation?
What makes you think benchmark is a metric?
>>
>>108817400
i also suspect they're mostly the same
it was too fast and easy for the zeitgeist to switch from "gpt is best" to "claude is best"
it's all based on people's perceptions and social media hype, not measurable outputs
i don't know how possible it will ever be to measure them against each other
it's like a search engine, where the quality is subjective
>>
>>108817400
Each one is optimized for different tasks. Your best bet is to buy a subscription to every single one and run an agent that queries all of them, and another agent that checks the results and presents the best ones to you. You'll want a third agent to ensure the first two are doing what you ask them to do too. If you aren't using at least this simplest approach you're already behind.
>>
>>108818237
surely someone has vibecoded a web app for this
>>
>>108818083
Usecase for usecases? What makes you think that metrics are a metric?
>>
>>108818237
>ur gooonna be left behind
fucking Indians
>>
>>108818237
No just get good at prompting to draw out secret vectors
>>
among cloud hosted frontier models there isn't much differentiation in my experience, except maybe in tooling. where the models feel tangibly different is among chinese/local/open-weight models
>>
>>108817400
the moment you introduce a new benchmark it will be gamed. It's pointless
>>
>>108817400
Qwen 3.6 27b dense beats gpt-oss in most tasks
It makes a lot less errors but it appears to have more limited "world knowledge"
It's only bottlenecked by inconsistent responses between regens in non code tasks with higher temp but that's just sampling and the fact its just random once it has made a list of tokens to select and weighted the probabilities
It's actually kinda creepy how fast llms have evolved, imagine in the future how it will evolve



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.