[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: HJa4OYgWwAAlfWX.png (158 KB, 2160x2160)
158 KB PNG
claude opus 4.8 released
>>
>>108927175
Cool, still won't use it
>>
File: 1763550102241807.png (345 KB, 858x591)
345 KB PNG
>>108927710
/thread
>>
>>108927175
Are they ever going to stop with these benchmarks that demonstrate nothing and are unverified?
>>
>>108927710
>>108927720
>t. tokenlets

IMAO at poverty fags
>>
File: gyftyr.png (94 KB, 571x381)
94 KB PNG
yea it's a nothingburger. i wanted this AGI thing but it was all hype. fuck these grifters
>>
>25 dollaroosies per million tokens output
That's just not worth it.
>>
>>108927175
>reads entire thread
>produces one token based on word relation trained weights
>reads entire thread plus the one token it generated
>produces another token
AGI
>>
HN SHILLS GREEN LIGHT
DEPLOY PELICAN
DEPLOY APOLOGIA
GO GO GO
>>
Shit is just a straight token eater
Claude has become
>Make one change
>Sorry your daily limit is over. wait 4 hours
-20
>>
>>108927175
I've only used Opus 4.6 so far and it seems good enough.
>>
Let me guess, this one is the same slop but next will be too dangerous to release?
>>
>>108927848
It has also not passed my "are you retarded?" tests
>>
File: file.png (19 KB, 662x304)
19 KB PNG
sama coming in for the kill with 5.6

>>108928031
mythos is out next month
>>
>>108928041
what benchmark/leaderboard is this?
>>
>>108928046
artificial analysis
>>
>>108928015
>Pelicans in all of the thinking levels - low, medium, high, xhigh, max
>>
>>108927175
Why is energy efficiency never mentioned in these benchmarks?
>>
>>108928199
some will tweet about how many gallons of water a prompt burns and damage their optics
>>
>>108927175
wait so if I let Claude do a cash flow model for me it will only be correct in 53% of cases?
>>
>>108928022
Upvoted
>>
>>108928706
I've been asking various models to write me a script for sharpening my knives.

No luck yet.
>>
>>108928649
likely higher. probably falls in the general quantitative section of the bench. don't see a breakdown for 4.8, but it's likely over 70%
https://www.vals.ai/benchmarks/fabv2

also the harness doesn't provide skills, so you could probably bump the pass rate up with a decent skill and some harness tooling
>>
>>108928706
>can ai cook me a meal yet?
Unfortunately no. The waste heat, on the order of several terawatts, is vented directly into the atmosphere in the form of steam, the most potent greenhouse gas, and not used to cook food.
>>
>mah benchmooorks
>NUMBA BIG MEANS GUD
>>
>>108928199
the same as alphafold was never compared resource wise with the traditional algorithms. bc its awful
>>
File: 1762530542744478.jpg (42 KB, 1206x702)
42 KB JPG
>>108927175
>>
>>108929463
why doesn't it run a tool call at this point?
>>
>>108929463
Set it to max effort and adaptive thinking
>>
File: 1754612588505918.jpg (106 KB, 700x838)
106 KB JPG
>>108929482
They're too busy sucking up to the Pope to care about functionality
>>
>>108929483
>5 OLYMPIC SIZED POOLS OF WATER
glad we sorted your problem berry picker
>>
File: snailcat-ps1.png (2.76 MB, 1254x1254)
2.76 MB PNG
>>108929524
Dont care
>>
File: 1764606320873978.jpg (34 KB, 657x527)
34 KB JPG
>>108927175
omg the fake and gay test numbers are slightly higher?
>>
>>108929709
it's good. even the marketing numbers show that the plateau has been entered. that means the hype cycle will be at its end in 2 years.
>>
>>108927175
hn is pogging over these made-up numbers
>>
>>108927175
>there has been no improvement since 4.5
Release mythos or fuck off
>>
Claude is still stupid as balls and only good for compiling what even is the difference
>>
These fucking idiots dont even know how many new bacteria the fucking steam makes
>>
>>108927175
Will it still burn through 20 bucks in one prompt?
>>
>>108927175
Thanks but I'll stick with gemini 3 flash
>>
>>108927710
>Priced out of programming
>>
>>108931043
I told you to just write a compile script.
>>
DEEPSEEK has been paying influencers on social media to spread the narrative of low cost adoption. DO NOT listen to these people. There is NO substitute for frontier performance. Trust in Opus.
>>
>>108927175
>colleagues use claude for everything
>even ask for basic shit like plotting two lines
>they pass arrays of floats
>claude hallucinates different numbers in the code
>dumbshits don't even check it
yeah no, call me a snailcat if you will but I'd rather have confidence in my results
>>
>>108931043
still less retarded than you
>>
>>108932533
deepseek might be cheaper per token, but it consumes 10 times as many tokens to accomplish the same results due to to model being much weaker, it just doesnt make sense to go for
>>
>>108933067
The problem is worse the more you don't understand the structure of your own code or what you're trying to build. In these situations you can be fairly confident a frontier model will bang it out. But you can use much weaker models if you have strong domain knowledge.
>>
File: claude.png (43 KB, 1440x189)
43 KB PNG
Is everyone just larping about using Opus? I just don't get it. 3x more token usage for minimal gains
What the fuck
>>
>>108933690
I get better results with Sonnet in general.
I think Anthropic, much like OpenAI has no idea what the fuck they're doing and all the guardrails and shit gave the latest models brain damage.
>>
>>108933690
i dont know, i'm using opus 4.7 and it's doing its job well on a decently sized project of around 20k loc. but i was using google gemini before claude so possibly my expectations are simply lower than yours.
>>
>>108933975
Nobody knows what happens inside neural networks, that's the reason you get those random regressions.
>>
File: 2026-05-29_16-14-50.png (27 KB, 745x268)
27 KB PNG
This is my favorite benchmark so far https://deepswe.datacurve.ai/blog
Waiting to see if 4.8 can come close to chatgpt again
>>
what happens when it hits 100%
>>
>>108934813
the same thing that happens when someone completes every leetcode problem
it means it'll probably be quite good at the majority of real-world tasks assuming the benchmark tasks were representative
>>
>>108934813
level up, that's why tests get replaced at ~85%
>>
>>108927710
Snailcat behavior
>>
>>108929483
>by the way if you were after the r's, there are 3 of those
cheeky little shit
>>
>>108934837
Move Slow!
>>
>>108935219
that's for investors
>wow they fixed the letter counting, it's so smart it's even making jokes now! luddites btfo BUY BUY BUY!!!
>>
>>108927175
Hmm
>>
>>108927848
> why can’t a large LANGUAGE model that was optimized for text do visual reasoning ??
Huh I wonder why
>>
>>108927787
>try out GPT 5.5 for coding
>extended thinking, paid the 20 dollars etc
>FUCKING SHIT compared to Gemini Pro 3.1
Benchmarks mean nothing.
>>
File: brainblast2.gif (3.73 MB, 360x203)
3.73 MB GIF
>>108929483
>DETERMINING NUMBER OF P'S
>MAX EFFORT NO MISTAKES
>>
>>108927175
try GigaChat
>>
File: .png (55 KB, 576x933)
55 KB PNG
>>108929483
>requires MAX 90000% level epic mode thinking to count letters in a word
>meanwhile, Deepseek Instant with no thinking gets it right in 1 shot
Kek
China is mogging the fuck out of all these shitty American companies and at literally 1% of the price of Clod
If you pay for Claude you are literally donating money straight to Israel
>>
>>108938518
I’m kind of surprised none of these models have been instructed to write short programs to handle problems like these except probably deepseek
>>
>>108938544
I tried with Gemini and it does write a python script for it
Pretty sure Deepseek doesn't do any tool calls unless you enable search (and searching/reading web pages is the only tool it can use), the web model is mostly just raw dogging everything else
>>
>>108935219
Opus 4.8 is a cunt. The retards insert a "here's the pushback" or "wait, I should reconsider" or "here's extra info" stuff into the behind the scenes output which makes it act like a entitled prick with every prompt. It treats the user like a child. There should be an intelligent adult mode on these. We have to go back.
>>
>>108938571
ChatGPT 5.2 has the same tendency. Every single response is "This is a good start, but there are some issues that could cause hidden difficulties...have you considered THIS or THAT?"

On occasion that's fine, but on *every* prompt, it's exhausting.
>>
>>108938518
wake me up when deepthroat has its own cowork competitor
>>
File: 1775281551582476.png (431 KB, 1547x1745)
431 KB PNG
If you throw a golf ball and a tennis racket 20 meters each, which one flies further before touching the ground?

Can it do this? 4.6 Sonnet's answer was very creative.
>>
>>108927175
benchmarks mean nothing to me at this point.
>>
>>108938609
Yes, I'm getting pushback responses on fucking everything, even when unwarranted. Pretty sure the model isn't actually any smarter, they're just inserting stuff like that at runtime into the outputs.
>>
>>108938632
So odd that half the chatbots ignore the stipulated 20 meters and tie themselves into knots.
>>
File: 1750497720205774.png (21 KB, 731x318)
21 KB PNG
>>108929463
tried Sonnet with 2 p's
she also thinks there's 1 p in strawberry
>>
File: .png (164 KB, 1092x924)
164 KB PNG
>>108938632
>>108938654
deepsneed mogging again
instant, no thinking, 1 shot
I'll try with qwen and other models as well
>>
File: .png (113 KB, 1082x934)
113 KB PNG
>>108938802
Qwen 3.6 36b-a3b moe failed the test, it kept talking about aerodynamic even with thinking enabled. But the 3.6 27b dense passed first try both with and without thinking.
I guess 3b active parameters is kind of dumb
>>
>>108938839
Qwen sounds like an autist
>>
>>108938802
Impressive. Very nice.

>>108938839
>>108938850
>Qwen sounds like an autist
oh be nice, Qwen is adorable! It's trying its best!
>>
>>108938632
>>108938802
This is an interesting question because I would've thought the answer would be
>the racket flies further because it has to fly 20 meters before landing, while the ball will probably land before 20 meters and then roll the rest of the way, so the racket flies further BEFORE touching the ground
I thought claude was gonna go in that direction but then it just fucked it all up
>>
>>108936969
It correctly surmised there is a human hand in the image, if it consumed the image binary data as text it would produce nonsense. Clearly this isn't just a LLM.
>>
File: 1763911521619398.gif (3.16 MB, 1024x1024)
3.16 MB GIF
>>108938839
>>108938850
>>108938940
Did you suppress thinking for it? Trying it with thinking enabled. I wonder if it'll do better with it on
>>
>>108940559
I tried the 36b-a3b MoE and it failed, 27b dense works. I tried both with thinking and without thinking, 36b-a3b failed both and 27b passed both. I only tried once with thinking and once without thinking for each model.
>>
>>108940369
>reading comprehension
Claude can already read and understand text better than you, embarrassing.
>>
>>108941917
No it can't.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.