Claude Fable 5 under performs on contamination-free benchmarks.As predicted LLMs reached ceiling. They don't improve anymore.
>>109021740gemini looking good for once. i think... ultrametricshit... will improve a lot of benchmarks.
>>109021740>that if averageso model doesn't follow prompt at all. lmao
>>109021756more like, it loves to add things you haven't asked to make it "better".
>>109021740That won't stop Anthropic shills from spaming 24h every single day.
>>109021740>contamination-free benchmarks.what is that?
benchmarks have become unreliable, the only way I've found to figure out if a model is worth it is to actually like use it for the intended purposes and making an individual judgement as you work with it .
>>109021797>benchmarks have become unreliable,They've been unreliable since ChatGPT 2. The benchmarks just become part of the training set.
>>109021740Wheres Grok?
>>109021797Probably, because Ive thoroughly tried opus, paid ofcourse, and its 100% shittier than sonnet. Which that fact alone negates over half that chart
>>109021792It means the questions in the benchmark have not been made publicly available, so it's not possible for a model to have trained on it. This is better than the classical evaluation gauntlets, but not good enough in practice: deep learning is good at "blending" examples as well, always has been. So if you have quite similar examples in training that aren't exactly the ones in the test gauntlet, it'll still look good, even though for general tasks it will perform much worse than expected.
>>109022479It's ironic because just before that, the deep learning community had finally started genuinely fixing the benchmaxxing problem. Nobody used MNIST anymore, for example, and there was a lot of work going on in figuring out how to make tests fair.
high-effort just means they like the AI bruteforce the solution btw
>>109021750The model is good, but the tools around it horrible. Also gemini-cli is getting killed in favor for antigravity-cli.
>>109022783>The model is goodAnd I'm the queen of England.
>>109022796Your Majesty
DeepSWE hasn't been updated yet, I'm really curious where it will place.>>109021740What bench is this one?
>>109022807https://livebench.ai
>>109021797That’s difficult for fable>ask it to write some tests for bog stabdard software>the tests it wrote trigger its le cybersecurity guardrails so it returns you a warning
>>109022521you need to scroll down below open source phone local models
>>109021740the improvements were because of new data but now they ran out of new data, and even worse the new data is starting to be contaminated by AI
>>109021740Why do you hate america?
>>109023208The real question is why does America hate me?
>>109023208Because it's jewish and jeeted.>>109023372Because you're not jewish or jeeted.
>>109023042Grok is still the only chatbot who knows what's going on after the 2025-or-whatever cutoff date