/g/ - Claude Fable 5 under performs on contamination-fre - Technology

Anonymous

06/10/26(Wed)06:43:44 No.109021740

File: 1777253273984815.png (263 KB, 2078x1288)

Anonymous 06/10/26(Wed)06:43:44 No.109021740 Archived

Claude Fable 5 under performs on contamination-free benchmarks.
As predicted LLMs reached ceiling. They don't improve anymore.

Anonymous
06/10/26(Wed)06:47:02 No.109021750

Anonymous 06/10/26(Wed)06:47:02 No.109021750

>>109021740
gemini looking good for once. i think... ultrametricshit... will improve a lot of benchmarks.

Anonymous
06/10/26(Wed)06:48:35 No.109021756

Anonymous 06/10/26(Wed)06:48:35 No.109021756

>>109021740
>that if average
so model doesn't follow prompt at all. lmao

Anonymous
06/10/26(Wed)06:52:52 No.109021768

Anonymous 06/10/26(Wed)06:52:52 No.109021768

>>109021756
more like, it loves to add things you haven't asked to make it "better".

Anonymous
06/10/26(Wed)06:57:39 No.109021788

Anonymous 06/10/26(Wed)06:57:39 No.109021788

>>109021740
That won't stop Anthropic shills from spaming 24h every single day.

Anonymous
06/10/26(Wed)06:58:49 No.109021792

Anonymous 06/10/26(Wed)06:58:49 No.109021792

>>109021740
>contamination-free benchmarks.
what is that?

Anonymous
06/10/26(Wed)07:00:12 No.109021797

Anonymous 06/10/26(Wed)07:00:12 No.109021797

benchmarks have become unreliable,
the only way I've found to figure out if a model is worth it is to actually like use it for the intended purposes and making an individual judgement as you work with it .

Anonymous
06/10/26(Wed)09:27:27 No.109022479

Anonymous 06/10/26(Wed)09:27:27 No.109022479

>>109021797
>benchmarks have become unreliable,
They've been unreliable since ChatGPT 2. The benchmarks just become part of the training set.

Anonymous
06/10/26(Wed)09:35:08 No.109022521

Anonymous 06/10/26(Wed)09:35:08 No.109022521

>>109021740
Wheres Grok?

Anonymous
06/10/26(Wed)09:36:57 No.109022532

Anonymous 06/10/26(Wed)09:36:57 No.109022532

>>109021797
Probably, because Ive thoroughly tried opus, paid ofcourse, and its 100% shittier than sonnet. Which that fact alone negates over half that chart

Anonymous
06/10/26(Wed)09:49:37 No.109022602

Anonymous 06/10/26(Wed)09:49:37 No.109022602

>>109021792
It means the questions in the benchmark have not been made publicly available, so it's not possible for a model to have trained on it. This is better than the classical evaluation gauntlets, but not good enough in practice: deep learning is good at "blending" examples as well, always has been. So if you have quite similar examples in training that aren't exactly the ones in the test gauntlet, it'll still look good, even though for general tasks it will perform much worse than expected.

Anonymous
06/10/26(Wed)09:50:39 No.109022608

Anonymous 06/10/26(Wed)09:50:39 No.109022608

>>109022479
It's ironic because just before that, the deep learning community had finally started genuinely fixing the benchmaxxing problem. Nobody used MNIST anymore, for example, and there was a lot of work going on in figuring out how to make tests fair.

Anonymous
06/10/26(Wed)10:06:13 No.109022685

Anonymous 06/10/26(Wed)10:06:13 No.109022685

high-effort just means they like the AI bruteforce the solution btw

Anonymous
06/10/26(Wed)10:23:38 No.109022783

Anonymous 06/10/26(Wed)10:23:38 No.109022783

>>109021750
The model is good, but the tools around it horrible. Also gemini-cli is getting killed in favor for antigravity-cli.

Anonymous
06/10/26(Wed)10:26:04 No.109022796

Anonymous 06/10/26(Wed)10:26:04 No.109022796

>>109022783
>The model is good
And I'm the queen of England.

Anonymous
06/10/26(Wed)10:26:28 No.109022801

Anonymous 06/10/26(Wed)10:26:28 No.109022801

>>109022796
Your Majesty

Anonymous
06/10/26(Wed)10:26:57 No.109022807

Anonymous 06/10/26(Wed)10:26:57 No.109022807

DeepSWE hasn't been updated yet, I'm really curious where it will place.

>>109021740
What bench is this one?

Anonymous
06/10/26(Wed)10:28:53 No.109022819

Anonymous 06/10/26(Wed)10:28:53 No.109022819

>>109022807
https://livebench.ai

Anonymous
06/10/26(Wed)10:31:24 No.109022838

Anonymous 06/10/26(Wed)10:31:24 No.109022838

>>109021797
That’s difficult for fable
>ask it to write some tests for bog stabdard software
>the tests it wrote trigger its le cybersecurity guardrails so it returns you a warning

Anonymous
06/10/26(Wed)11:07:35 No.109023042

Anonymous 06/10/26(Wed)11:07:35 No.109023042

File: 20260610170638002417.jpg (238 KB, 1965x1102)

238 KB JPG

>>109022521
you need to scroll down below open source phone local models

Anonymous
06/10/26(Wed)11:23:57 No.109023140

Anonymous 06/10/26(Wed)11:23:57 No.109023140

>>109021740

the improvements were because of new data but now they ran out of new data, and even worse the new data is starting to be contaminated by AI

Anonymous
06/10/26(Wed)11:33:13 No.109023208

Anonymous 06/10/26(Wed)11:33:13 No.109023208

>>109021740
Why do you hate america?

Anonymous
06/10/26(Wed)12:00:31 No.109023372

Anonymous 06/10/26(Wed)12:00:31 No.109023372

>>109023208
The real question is why does America hate me?

Anonymous
06/10/26(Wed)12:08:47 No.109023418

Anonymous 06/10/26(Wed)12:08:47 No.109023418

>>109023208
Because it's jewish and jeeted.
>>109023372
Because you're not jewish or jeeted.

Anonymous
06/10/26(Wed)13:22:45 No.109023972

Anonymous 06/10/26(Wed)13:22:45 No.109023972

>>109023042
Grok is still the only chatbot who knows what's going on after the 2025-or-whatever cutoff date