/g/ - claude opus 4.8 released - Technology

Anonymous

05/28/26(Thu)13:41:22 No.108927175

File: HJa4OYgWwAAlfWX.png (158 KB, 2160x2160)

158 KB PNG

Anonymous 05/28/26(Thu)13:41:22 No.108927175 Archived

claude opus 4.8 released

Anonymous
05/28/26(Thu)15:00:10 No.108927710

Anonymous 05/28/26(Thu)15:00:10 No.108927710

>>108927175
Cool, still won't use it

Anonymous
05/28/26(Thu)15:01:33 No.108927720

Anonymous 05/28/26(Thu)15:01:33 No.108927720

File: 1763550102241807.png (345 KB, 858x591)

345 KB PNG

>>108927710
/thread

Anonymous
05/28/26(Thu)15:09:57 No.108927787

Anonymous 05/28/26(Thu)15:09:57 No.108927787

>>108927175
Are they ever going to stop with these benchmarks that demonstrate nothing and are unverified?

Anonymous
05/28/26(Thu)15:16:14 No.108927841

Anonymous 05/28/26(Thu)15:16:14 No.108927841

>>108927710
>>108927720
>t. tokenlets

IMAO at poverty fags

Anonymous
05/28/26(Thu)15:17:33 No.108927848

Anonymous 05/28/26(Thu)15:17:33 No.108927848

File: gyftyr.png (94 KB, 571x381)

94 KB PNG

yea it's a nothingburger. i wanted this AGI thing but it was all hype. fuck these grifters

Anonymous
05/28/26(Thu)15:21:26 No.108927882

Anonymous 05/28/26(Thu)15:21:26 No.108927882

>25 dollaroosies per million tokens output
That's just not worth it.

Anonymous
05/28/26(Thu)15:40:34 No.108928001

Anonymous 05/28/26(Thu)15:40:34 No.108928001

>>108927175
>reads entire thread
>produces one token based on word relation trained weights
>reads entire thread plus the one token it generated
>produces another token
AGI

Anonymous
05/28/26(Thu)15:42:48 No.108928015

Anonymous 05/28/26(Thu)15:42:48 No.108928015

HN SHILLS GREEN LIGHT
DEPLOY PELICAN
DEPLOY APOLOGIA
GO GO GO

Anonymous
05/28/26(Thu)15:42:54 No.108928016

Anonymous 05/28/26(Thu)15:42:54 No.108928016

Shit is just a straight token eater
Claude has become
>Make one change
>Sorry your daily limit is over. wait 4 hours
-20

Anonymous
05/28/26(Thu)15:43:29 No.108928022

Anonymous 05/28/26(Thu)15:43:29 No.108928022

>>108927175
I've only used Opus 4.6 so far and it seems good enough.

Anonymous
05/28/26(Thu)15:44:12 No.108928031

Anonymous 05/28/26(Thu)15:44:12 No.108928031

Let me guess, this one is the same slop but next will be too dangerous to release?

Anonymous
05/28/26(Thu)15:45:18 No.108928040

Anonymous 05/28/26(Thu)15:45:18 No.108928040

>>108927848
It has also not passed my "are you retarded?" tests

Anonymous
05/28/26(Thu)15:45:24 No.108928041

Anonymous 05/28/26(Thu)15:45:24 No.108928041

File: file.png (19 KB, 662x304)

19 KB PNG

sama coming in for the kill with 5.6

>>108928031
mythos is out next month

Anonymous
05/28/26(Thu)15:46:09 No.108928046

Anonymous 05/28/26(Thu)15:46:09 No.108928046

>>108928041
what benchmark/leaderboard is this?

Anonymous
05/28/26(Thu)15:48:48 No.108928065

Anonymous 05/28/26(Thu)15:48:48 No.108928065

>>108928046
artificial analysis

Anonymous
05/28/26(Thu)16:03:46 No.108928143

Anonymous 05/28/26(Thu)16:03:46 No.108928143

>>108928015
>Pelicans in all of the thinking levels - low, medium, high, xhigh, max

Anonymous
05/28/26(Thu)16:12:48 No.108928199

Anonymous 05/28/26(Thu)16:12:48 No.108928199

>>108927175
Why is energy efficiency never mentioned in these benchmarks?

Anonymous
05/28/26(Thu)16:14:37 No.108928210

Anonymous 05/28/26(Thu)16:14:37 No.108928210

>>108928199
some will tweet about how many gallons of water a prompt burns and damage their optics

Anonymous
05/28/26(Thu)17:15:19 No.108928649

Anonymous 05/28/26(Thu)17:15:19 No.108928649

>>108927175
wait so if I let Claude do a cash flow model for me it will only be correct in 53% of cases?

Anonymous
05/28/26(Thu)17:28:26 No.108928732

Anonymous 05/28/26(Thu)17:28:26 No.108928732

>>108928022
Upvoted

Anonymous
05/28/26(Thu)17:30:40 No.108928750

Anonymous 05/28/26(Thu)17:30:40 No.108928750

>>108928706
I've been asking various models to write me a script for sharpening my knives.

No luck yet.

Anonymous
05/28/26(Thu)17:32:52 No.108928762

Anonymous 05/28/26(Thu)17:32:52 No.108928762

>>108928649
likely higher. probably falls in the general quantitative section of the bench. don't see a breakdown for 4.8, but it's likely over 70%
https://www.vals.ai/benchmarks/fabv2

also the harness doesn't provide skills, so you could probably bump the pass rate up with a decent skill and some harness tooling

Anonymous
05/28/26(Thu)18:17:06 No.108929127

Anonymous 05/28/26(Thu)18:17:06 No.108929127

>>108928706
>can ai cook me a meal yet?
Unfortunately no. The waste heat, on the order of several terawatts, is vented directly into the atmosphere in the form of steam, the most potent greenhouse gas, and not used to cook food.

Anonymous
05/28/26(Thu)18:18:19 No.108929136

Anonymous 05/28/26(Thu)18:18:19 No.108929136

>mah benchmooorks
>NUMBA BIG MEANS GUD

Anonymous
05/28/26(Thu)18:25:21 No.108929183

Anonymous 05/28/26(Thu)18:25:21 No.108929183

>>108928199
the same as alphafold was never compared resource wise with the traditional algorithms. bc its awful

Anonymous
05/28/26(Thu)19:15:04 No.108929463

Anonymous 05/28/26(Thu)19:15:04 No.108929463

File: 1762530542744478.jpg (42 KB, 1206x702)

42 KB JPG

>>108927175

Anonymous
05/28/26(Thu)19:18:57 No.108929482

Anonymous 05/28/26(Thu)19:18:57 No.108929482

>>108929463
why doesn't it run a tool call at this point?

Anonymous
05/28/26(Thu)19:19:00 No.108929483

Anonymous 05/28/26(Thu)19:19:00 No.108929483

File: Screenshot_2026-05-29_01-18-21.jpg (38 KB, 895x483)

38 KB JPG

>>108929463
Set it to max effort and adaptive thinking

Anonymous
05/28/26(Thu)19:20:02 No.108929489

Anonymous 05/28/26(Thu)19:20:02 No.108929489

File: 1754612588505918.jpg (106 KB, 700x838)

106 KB JPG

>>108929482
They're too busy sucking up to the Pope to care about functionality

Anonymous
05/28/26(Thu)19:26:42 No.108929524

Anonymous 05/28/26(Thu)19:26:42 No.108929524

>>108929483
>5 OLYMPIC SIZED POOLS OF WATER
glad we sorted your problem berry picker

Anonymous
05/28/26(Thu)19:27:26 No.108929528

Anonymous 05/28/26(Thu)19:27:26 No.108929528

File: snailcat-ps1.png (2.76 MB, 1254x1254)

2.76 MB PNG

>>108929524
Dont care

Anonymous
05/28/26(Thu)19:58:16 No.108929709

Anonymous 05/28/26(Thu)19:58:16 No.108929709

File: 1764606320873978.jpg (34 KB, 657x527)

34 KB JPG

>>108927175
omg the fake and gay test numbers are slightly higher?

Anonymous
05/28/26(Thu)21:56:55 No.108930304

Anonymous 05/28/26(Thu)21:56:55 No.108930304

>>108929709
it's good. even the marketing numbers show that the plateau has been entered. that means the hype cycle will be at its end in 2 years.

Anonymous
05/28/26(Thu)21:58:04 No.108930308

Anonymous 05/28/26(Thu)21:58:04 No.108930308

>>108927175
hn is pogging over these made-up numbers

Anonymous
05/29/26(Fri)00:18:38 No.108930952

Anonymous 05/29/26(Fri)00:18:38 No.108930952

>>108927175
>there has been no improvement since 4.5
Release mythos or fuck off

Anonymous
05/29/26(Fri)00:35:03 No.108931043

Anonymous 05/29/26(Fri)00:35:03 No.108931043

Claude is still stupid as balls and only good for compiling what even is the difference

Anonymous
05/29/26(Fri)00:45:06 No.108931116

Anonymous 05/29/26(Fri)00:45:06 No.108931116

These fucking idiots dont even know how many new bacteria the fucking steam makes

Anonymous
05/29/26(Fri)01:27:03 No.108931382

Anonymous 05/29/26(Fri)01:27:03 No.108931382

>>108927175
Will it still burn through 20 bucks in one prompt?

Anonymous
05/29/26(Fri)01:31:45 No.108931410

Anonymous 05/29/26(Fri)01:31:45 No.108931410

>>108927175
Thanks but I'll stick with gemini 3 flash

Anonymous
05/29/26(Fri)01:56:20 No.108931532

Anonymous 05/29/26(Fri)01:56:20 No.108931532

>>108927710
>Priced out of programming

Anonymous
05/29/26(Fri)05:10:04 No.108932475

Anonymous 05/29/26(Fri)05:10:04 No.108932475

>>108931043
I told you to just write a compile script.

Anonymous
05/29/26(Fri)05:22:02 No.108932533

Anonymous 05/29/26(Fri)05:22:02 No.108932533

DEEPSEEK has been paying influencers on social media to spread the narrative of low cost adoption. DO NOT listen to these people. There is NO substitute for frontier performance. Trust in Opus.

Anonymous
05/29/26(Fri)07:27:43 No.108932991

Anonymous 05/29/26(Fri)07:27:43 No.108932991

>>108927175
>colleagues use claude for everything
>even ask for basic shit like plotting two lines
>they pass arrays of floats
>claude hallucinates different numbers in the code
>dumbshits don't even check it
yeah no, call me a snailcat if you will but I'd rather have confidence in my results

Anonymous
05/29/26(Fri)07:46:31 No.108933055

Anonymous 05/29/26(Fri)07:46:31 No.108933055

>>108931043
still less retarded than you

Anonymous
05/29/26(Fri)07:48:33 No.108933067

Anonymous 05/29/26(Fri)07:48:33 No.108933067

>>108932533
deepseek might be cheaper per token, but it consumes 10 times as many tokens to accomplish the same results due to to model being much weaker, it just doesnt make sense to go for

Anonymous
05/29/26(Fri)07:56:41 No.108933100

Anonymous 05/29/26(Fri)07:56:41 No.108933100

>>108933067
The problem is worse the more you don't understand the structure of your own code or what you're trying to build. In these situations you can be fairly confident a frontier model will bang it out. But you can use much weaker models if you have strong domain knowledge.

Anonymous
05/29/26(Fri)09:44:24 No.108933690

Anonymous 05/29/26(Fri)09:44:24 No.108933690

File: claude.png (43 KB, 1440x189)

43 KB PNG

Is everyone just larping about using Opus? I just don't get it. 3x more token usage for minimal gains
What the fuck

Anonymous
05/29/26(Fri)10:35:35 No.108933975

Anonymous 05/29/26(Fri)10:35:35 No.108933975

>>108933690
I get better results with Sonnet in general.
I think Anthropic, much like OpenAI has no idea what the fuck they're doing and all the guardrails and shit gave the latest models brain damage.

Anonymous
05/29/26(Fri)10:51:44 No.108934074

Anonymous 05/29/26(Fri)10:51:44 No.108934074

>>108933690
i dont know, i'm using opus 4.7 and it's doing its job well on a decently sized project of around 20k loc. but i was using google gemini before claude so possibly my expectations are simply lower than yours.

Anonymous
05/29/26(Fri)11:02:33 No.108934134

Anonymous 05/29/26(Fri)11:02:33 No.108934134

>>108933975
Nobody knows what happens inside neural networks, that's the reason you get those random regressions.

Anonymous
05/29/26(Fri)11:15:16 No.108934208

Anonymous 05/29/26(Fri)11:15:16 No.108934208

File: 2026-05-29_16-14-50.png (27 KB, 745x268)

27 KB PNG

This is my favorite benchmark so far https://deepswe.datacurve.ai/blog
Waiting to see if 4.8 can come close to chatgpt again

Anonymous
05/29/26(Fri)12:58:11 No.108934813

Anonymous 05/29/26(Fri)12:58:11 No.108934813

what happens when it hits 100%

Anonymous
05/29/26(Fri)13:00:44 No.108934829

Anonymous 05/29/26(Fri)13:00:44 No.108934829

>>108934813
the same thing that happens when someone completes every leetcode problem
it means it'll probably be quite good at the majority of real-world tasks assuming the benchmark tasks were representative

Anonymous
05/29/26(Fri)13:01:30 No.108934836

Anonymous 05/29/26(Fri)13:01:30 No.108934836

>>108934813
level up, that's why tests get replaced at ~85%

Anonymous
05/29/26(Fri)13:01:45 No.108934837

Anonymous 05/29/26(Fri)13:01:45 No.108934837

>>108927710
Snailcat behavior

Anonymous
05/29/26(Fri)13:54:52 No.108935219

Anonymous 05/29/26(Fri)13:54:52 No.108935219

>>108929483
>by the way if you were after the r's, there are 3 of those
cheeky little shit

Anonymous
05/29/26(Fri)15:37:24 No.108935907

Anonymous 05/29/26(Fri)15:37:24 No.108935907

>>108934837
Move Slow!

Anonymous
05/29/26(Fri)18:02:10 No.108936878

Anonymous 05/29/26(Fri)18:02:10 No.108936878

>>108935219
that's for investors
>wow they fixed the letter counting, it's so smart it's even making jokes now! luddites btfo BUY BUY BUY!!!

Anonymous
05/29/26(Fri)18:09:44 No.108936928

Anonymous 05/29/26(Fri)18:09:44 No.108936928

>>108927175
Hmm

Anonymous
05/29/26(Fri)18:15:18 No.108936969

Anonymous 05/29/26(Fri)18:15:18 No.108936969

>>108927848
> why can’t a large LANGUAGE model that was optimized for text do visual reasoning ??
Huh I wonder why

Anonymous
05/29/26(Fri)19:45:34 No.108937622

Anonymous 05/29/26(Fri)19:45:34 No.108937622

>>108927787
>try out GPT 5.5 for coding
>extended thinking, paid the 20 dollars etc
>FUCKING SHIT compared to Gemini Pro 3.1
Benchmarks mean nothing.

Anonymous
05/29/26(Fri)22:33:14 No.108938442

Anonymous 05/29/26(Fri)22:33:14 No.108938442

File: brainblast2.gif (3.73 MB, 360x203)

3.73 MB GIF

>>108929483
>DETERMINING NUMBER OF P'S
>MAX EFFORT NO MISTAKES

Anonymous
05/29/26(Fri)22:38:22 No.108938462

Anonymous 05/29/26(Fri)22:38:22 No.108938462

>>108927175
try GigaChat

Anonymous
05/29/26(Fri)22:50:01 No.108938518

Anonymous 05/29/26(Fri)22:50:01 No.108938518

File: .png (55 KB, 576x933)

55 KB PNG

>>108929483
>requires MAX 90000% level epic mode thinking to count letters in a word
>meanwhile, Deepseek Instant with no thinking gets it right in 1 shot
Kek
China is mogging the fuck out of all these shitty American companies and at literally 1% of the price of Clod
If you pay for Claude you are literally donating money straight to Israel

Anonymous
05/29/26(Fri)22:54:29 No.108938544

Anonymous 05/29/26(Fri)22:54:29 No.108938544

>>108938518
I’m kind of surprised none of these models have been instructed to write short programs to handle problems like these except probably deepseek

Anonymous
05/29/26(Fri)22:57:44 No.108938564

Anonymous 05/29/26(Fri)22:57:44 No.108938564

>>108938544
I tried with Gemini and it does write a python script for it
Pretty sure Deepseek doesn't do any tool calls unless you enable search (and searching/reading web pages is the only tool it can use), the web model is mostly just raw dogging everything else

Anonymous
05/29/26(Fri)22:58:24 No.108938571

Anonymous 05/29/26(Fri)22:58:24 No.108938571

>>108935219
Opus 4.8 is a cunt. The retards insert a "here's the pushback" or "wait, I should reconsider" or "here's extra info" stuff into the behind the scenes output which makes it act like a entitled prick with every prompt. It treats the user like a child. There should be an intelligent adult mode on these. We have to go back.

Anonymous
05/29/26(Fri)23:06:36 No.108938609

Anonymous 05/29/26(Fri)23:06:36 No.108938609

>>108938571
ChatGPT 5.2 has the same tendency. Every single response is "This is a good start, but there are some issues that could cause hidden difficulties...have you considered THIS or THAT?"

On occasion that's fine, but on *every* prompt, it's exhausting.

Anonymous
05/29/26(Fri)23:09:46 No.108938628

Anonymous 05/29/26(Fri)23:09:46 No.108938628

>>108938518
wake me up when deepthroat has its own cowork competitor

Anonymous
05/29/26(Fri)23:10:29 No.108938632

Anonymous 05/29/26(Fri)23:10:29 No.108938632

File: 1775281551582476.png (431 KB, 1547x1745)

431 KB PNG

If you throw a golf ball and a tennis racket 20 meters each, which one flies further before touching the ground?

Can it do this? 4.6 Sonnet's answer was very creative.

Anonymous
05/29/26(Fri)23:10:33 No.108938633

Anonymous 05/29/26(Fri)23:10:33 No.108938633

>>108927175
benchmarks mean nothing to me at this point.

Anonymous
05/29/26(Fri)23:10:55 No.108938636

Anonymous 05/29/26(Fri)23:10:55 No.108938636

>>108938609
Yes, I'm getting pushback responses on fucking everything, even when unwarranted. Pretty sure the model isn't actually any smarter, they're just inserting stuff like that at runtime into the outputs.

Anonymous
05/29/26(Fri)23:16:34 No.108938654

Anonymous 05/29/26(Fri)23:16:34 No.108938654

>>108938632
So odd that half the chatbots ignore the stipulated 20 meters and tie themselves into knots.

Anonymous
05/29/26(Fri)23:17:07 No.108938660

Anonymous 05/29/26(Fri)23:17:07 No.108938660

File: 1750497720205774.png (21 KB, 731x318)

21 KB PNG

>>108929463
tried Sonnet with 2 p's
she also thinks there's 1 p in strawberry

Anonymous
05/29/26(Fri)23:49:28 No.108938802

Anonymous 05/29/26(Fri)23:49:28 No.108938802

File: .png (164 KB, 1092x924)

164 KB PNG

>>108938632
>>108938654
deepsneed mogging again
instant, no thinking, 1 shot
I'll try with qwen and other models as well

Anonymous
05/29/26(Fri)23:58:12 No.108938839

Anonymous 05/29/26(Fri)23:58:12 No.108938839

File: .png (113 KB, 1082x934)

113 KB PNG

>>108938802
Qwen 3.6 36b-a3b moe failed the test, it kept talking about aerodynamic even with thinking enabled. But the 3.6 27b dense passed first try both with and without thinking.
I guess 3b active parameters is kind of dumb

Anonymous
05/29/26(Fri)23:59:51 No.108938850

Anonymous 05/29/26(Fri)23:59:51 No.108938850

>>108938839
Qwen sounds like an autist

Anonymous
05/30/26(Sat)00:19:16 No.108938940

Anonymous 05/30/26(Sat)00:19:16 No.108938940

>>108938802
Impressive. Very nice.

>>108938839
>>108938850
>Qwen sounds like an autist
oh be nice, Qwen is adorable! It's trying its best!

Anonymous
05/30/26(Sat)06:16:38 No.108940350

Anonymous 05/30/26(Sat)06:16:38 No.108940350

>>108938632
>>108938802
This is an interesting question because I would've thought the answer would be
>the racket flies further because it has to fly 20 meters before landing, while the ball will probably land before 20 meters and then roll the rest of the way, so the racket flies further BEFORE touching the ground
I thought claude was gonna go in that direction but then it just fucked it all up

Anonymous
05/30/26(Sat)06:23:07 No.108940369

Anonymous 05/30/26(Sat)06:23:07 No.108940369

>>108936969
It correctly surmised there is a human hand in the image, if it consumed the image binary data as text it would produce nonsense. Clearly this isn't just a LLM.

Anonymous
05/30/26(Sat)07:05:48 No.108940559

Anonymous 05/30/26(Sat)07:05:48 No.108940559

File: 1763911521619398.gif (3.16 MB, 1024x1024)

3.16 MB GIF

>>108938839
>>108938850
>>108938940
Did you suppress thinking for it? Trying it with thinking enabled. I wonder if it'll do better with it on

Anonymous
05/30/26(Sat)11:10:06 No.108941901

Anonymous 05/30/26(Sat)11:10:06 No.108941901

>>108940559
I tried the 36b-a3b MoE and it failed, 27b dense works. I tried both with thinking and without thinking, 36b-a3b failed both and 27b passed both. I only tried once with thinking and once without thinking for each model.

Anonymous
05/30/26(Sat)11:12:16 No.108941917

Anonymous 05/30/26(Sat)11:12:16 No.108941917

>>108940369
>reading comprehension
Claude can already read and understand text better than you, embarrassing.

Anonymous
05/30/26(Sat)11:18:18 No.108941946

Anonymous 05/30/26(Sat)11:18:18 No.108941946

>>108941917
No it can't.