I feel like the differences are mostly niche and marginal but 1 or 2 personal experiences of it fucking up will make someone believe one of the major LLMs sucks but they're all basically the same. The benchmarks are obviously horseshit so idk how you really judge the differences. Maybe we need our own benchmark to be objective without it being gamified by big tech.
>>108817400Usecase for differentiation?What makes you think benchmark is a metric?
>>108817400i also suspect they're mostly the sameit was too fast and easy for the zeitgeist to switch from "gpt is best" to "claude is best"it's all based on people's perceptions and social media hype, not measurable outputsi don't know how possible it will ever be to measure them against each otherit's like a search engine, where the quality is subjective
>>108817400Each one is optimized for different tasks. Your best bet is to buy a subscription to every single one and run an agent that queries all of them, and another agent that checks the results and presents the best ones to you. You'll want a third agent to ensure the first two are doing what you ask them to do too. If you aren't using at least this simplest approach you're already behind.
>>108818237surely someone has vibecoded a web app for this
>>108818083Usecase for usecases? What makes you think that metrics are a metric?
>>108818237>ur gooonna be left behindfucking Indians
>>108818237No just get good at prompting to draw out secret vectors
among cloud hosted frontier models there isn't much differentiation in my experience, except maybe in tooling. where the models feel tangibly different is among chinese/local/open-weight models
>>108817400the moment you introduce a new benchmark it will be gamed. It's pointless
>>108817400Qwen 3.6 27b dense beats gpt-oss in most tasksIt makes a lot less errors but it appears to have more limited "world knowledge"It's only bottlenecked by inconsistent responses between regens in non code tasks with higher temp but that's just sampling and the fact its just random once it has made a list of tokens to select and weighted the probabilitiesIt's actually kinda creepy how fast llms have evolved, imagine in the future how it will evolve