claude opus 4.8 released
>>108927175Cool, still won't use it
>>108927710/thread
>>108927175Are they ever going to stop with these benchmarks that demonstrate nothing and are unverified?
>>108927710>>108927720>t. tokenlets IMAO at poverty fags
yea it's a nothingburger. i wanted this AGI thing but it was all hype. fuck these grifters
>25 dollaroosies per million tokens outputThat's just not worth it.
>>108927175>reads entire thread>produces one token based on word relation trained weights>reads entire thread plus the one token it generated>produces another tokenAGI
HN SHILLS GREEN LIGHTDEPLOY PELICANDEPLOY APOLOGIAGO GO GO
Shit is just a straight token eater Claude has become>Make one change>Sorry your daily limit is over. wait 4 hours -20
>>108927175I've only used Opus 4.6 so far and it seems good enough.
Let me guess, this one is the same slop but next will be too dangerous to release?
>>108927848It has also not passed my "are you retarded?" tests
sama coming in for the kill with 5.6>>108928031mythos is out next month
>>108928041what benchmark/leaderboard is this?
>>108928046artificial analysis
>>108928015>Pelicans in all of the thinking levels - low, medium, high, xhigh, max
>>108927175Why is energy efficiency never mentioned in these benchmarks?
>>108928199some will tweet about how many gallons of water a prompt burns and damage their optics
>>108927175wait so if I let Claude do a cash flow model for me it will only be correct in 53% of cases?
>>108928022Upvoted
>>108928706I've been asking various models to write me a script for sharpening my knives. No luck yet.
>>108928649likely higher. probably falls in the general quantitative section of the bench. don't see a breakdown for 4.8, but it's likely over 70%https://www.vals.ai/benchmarks/fabv2also the harness doesn't provide skills, so you could probably bump the pass rate up with a decent skill and some harness tooling
>>108928706>can ai cook me a meal yet?Unfortunately no. The waste heat, on the order of several terawatts, is vented directly into the atmosphere in the form of steam, the most potent greenhouse gas, and not used to cook food.
>mah benchmooorks>NUMBA BIG MEANS GUD
>>108928199the same as alphafold was never compared resource wise with the traditional algorithms. bc its awful
>>108927175
>>108929463why doesn't it run a tool call at this point?
>>108929463Set it to max effort and adaptive thinking
>>108929482They're too busy sucking up to the Pope to care about functionality
>>108929483>5 OLYMPIC SIZED POOLS OF WATERglad we sorted your problem berry picker
>>108929524Dont care
>>108927175omg the fake and gay test numbers are slightly higher?
>>108929709it's good. even the marketing numbers show that the plateau has been entered. that means the hype cycle will be at its end in 2 years.
>>108927175hn is pogging over these made-up numbers
>>108927175>there has been no improvement since 4.5Release mythos or fuck off
Claude is still stupid as balls and only good for compiling what even is the difference
These fucking idiots dont even know how many new bacteria the fucking steam makes
>>108927175Will it still burn through 20 bucks in one prompt?
>>108927175Thanks but I'll stick with gemini 3 flash
>>108927710>Priced out of programming
>>108931043I told you to just write a compile script.
DEEPSEEK has been paying influencers on social media to spread the narrative of low cost adoption. DO NOT listen to these people. There is NO substitute for frontier performance. Trust in Opus.
>>108927175>colleagues use claude for everything>even ask for basic shit like plotting two lines>they pass arrays of floats>claude hallucinates different numbers in the code>dumbshits don't even check ityeah no, call me a snailcat if you will but I'd rather have confidence in my results
>>108931043still less retarded than you
>>108932533deepseek might be cheaper per token, but it consumes 10 times as many tokens to accomplish the same results due to to model being much weaker, it just doesnt make sense to go for
>>108933067The problem is worse the more you don't understand the structure of your own code or what you're trying to build. In these situations you can be fairly confident a frontier model will bang it out. But you can use much weaker models if you have strong domain knowledge.
Is everyone just larping about using Opus? I just don't get it. 3x more token usage for minimal gainsWhat the fuck
>>108933690I get better results with Sonnet in general.I think Anthropic, much like OpenAI has no idea what the fuck they're doing and all the guardrails and shit gave the latest models brain damage.
>>108933690i dont know, i'm using opus 4.7 and it's doing its job well on a decently sized project of around 20k loc. but i was using google gemini before claude so possibly my expectations are simply lower than yours.
>>108933975Nobody knows what happens inside neural networks, that's the reason you get those random regressions.
This is my favorite benchmark so far https://deepswe.datacurve.ai/blogWaiting to see if 4.8 can come close to chatgpt again
what happens when it hits 100%
>>108934813the same thing that happens when someone completes every leetcode problemit means it'll probably be quite good at the majority of real-world tasks assuming the benchmark tasks were representative
>>108934813level up, that's why tests get replaced at ~85%
>>108927710Snailcat behavior
>>108929483>by the way if you were after the r's, there are 3 of thosecheeky little shit
>>108934837Move Slow!
>>108935219that's for investors>wow they fixed the letter counting, it's so smart it's even making jokes now! luddites btfo BUY BUY BUY!!!
>>108927175Hmm
>>108927848> why can’t a large LANGUAGE model that was optimized for text do visual reasoning ??Huh I wonder why
>>108927787>try out GPT 5.5 for coding>extended thinking, paid the 20 dollars etc>FUCKING SHIT compared to Gemini Pro 3.1Benchmarks mean nothing.
>>108929483>DETERMINING NUMBER OF P'S>MAX EFFORT NO MISTAKES
>>108927175try GigaChat
>>108929483>requires MAX 90000% level epic mode thinking to count letters in a word>meanwhile, Deepseek Instant with no thinking gets it right in 1 shotKekChina is mogging the fuck out of all these shitty American companies and at literally 1% of the price of ClodIf you pay for Claude you are literally donating money straight to Israel
>>108938518I’m kind of surprised none of these models have been instructed to write short programs to handle problems like these except probably deepseek
>>108938544I tried with Gemini and it does write a python script for itPretty sure Deepseek doesn't do any tool calls unless you enable search (and searching/reading web pages is the only tool it can use), the web model is mostly just raw dogging everything else
>>108935219Opus 4.8 is a cunt. The retards insert a "here's the pushback" or "wait, I should reconsider" or "here's extra info" stuff into the behind the scenes output which makes it act like a entitled prick with every prompt. It treats the user like a child. There should be an intelligent adult mode on these. We have to go back.
>>108938571ChatGPT 5.2 has the same tendency. Every single response is "This is a good start, but there are some issues that could cause hidden difficulties...have you considered THIS or THAT?"On occasion that's fine, but on *every* prompt, it's exhausting.
>>108938518wake me up when deepthroat has its own cowork competitor
If you throw a golf ball and a tennis racket 20 meters each, which one flies further before touching the ground?Can it do this? 4.6 Sonnet's answer was very creative.
>>108927175benchmarks mean nothing to me at this point.
>>108938609Yes, I'm getting pushback responses on fucking everything, even when unwarranted. Pretty sure the model isn't actually any smarter, they're just inserting stuff like that at runtime into the outputs.
>>108938632So odd that half the chatbots ignore the stipulated 20 meters and tie themselves into knots.
>>108929463tried Sonnet with 2 p'sshe also thinks there's 1 p in strawberry
>>108938632>>108938654deepsneed mogging againinstant, no thinking, 1 shotI'll try with qwen and other models as well
>>108938802Qwen 3.6 36b-a3b moe failed the test, it kept talking about aerodynamic even with thinking enabled. But the 3.6 27b dense passed first try both with and without thinking.I guess 3b active parameters is kind of dumb
>>108938839Qwen sounds like an autist
>>108938802Impressive. Very nice.>>108938839>>108938850>Qwen sounds like an autistoh be nice, Qwen is adorable! It's trying its best!
>>108938632>>108938802This is an interesting question because I would've thought the answer would be>the racket flies further because it has to fly 20 meters before landing, while the ball will probably land before 20 meters and then roll the rest of the way, so the racket flies further BEFORE touching the groundI thought claude was gonna go in that direction but then it just fucked it all up
>>108936969It correctly surmised there is a human hand in the image, if it consumed the image binary data as text it would produce nonsense. Clearly this isn't just a LLM.
>>108938839>>108938850>>108938940Did you suppress thinking for it? Trying it with thinking enabled. I wonder if it'll do better with it on
>>108940559I tried the 36b-a3b MoE and it failed, 27b dense works. I tried both with thinking and without thinking, 36b-a3b failed both and 27b passed both. I only tried once with thinking and once without thinking for each model.
>>108940369>reading comprehensionClaude can already read and understand text better than you, embarrassing.
>>108941917No it can't.