[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: 1750322114852299.png (456 KB, 828x1242)
456 KB PNG
On a difficult new SWE benchmark, ProgramBench, GPT5.5 high/xhigh solves a task for first time, significantly outperforms Opus 4.7
Link to tweets:
https://x.com/KLieret/status/2054215545663144217?s=20
Link to GitHub:
https://github.com/facebookresearch/ProgramBench/
Link to ProgramBench website:
https://programbench.com/blog/gpt-5-5-first-solve/
>>
>>108808850
>passes 0.5%
>>
>>108809367
Yes but it rebuilt a powerful tool from scratch without internet
>>
the benchmark is already invalid as they will just fine tune the model to replicate the existing source material now, even if it can reproduce all 50 progams in the bench it will fail on the 51st one. Public benchmarks become invalidated as soon as they are published.
>>
how do they control for the fact that these models probably have source code in their memory of these programs they're supposed to replicate?
>>
>>108809457
They don't, if they did they wouldn't get "research grants" from AI "companies."
>>
>>108809378
You mean it hallucinated 99.5% of a powerful tool and miraculously got 0.5% right.
AGI in two weeks.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.