On a difficult new SWE benchmark, ProgramBench, GPT5.5 high/xhigh solves a task for first time, significantly outperforms Opus 4.7 Link to tweets:https://x.com/KLieret/status/2054215545663144217?s=20Link to GitHub:https://github.com/facebookresearch/ProgramBench/Link to ProgramBench website:https://programbench.com/blog/gpt-5-5-first-solve/
>>108808850>passes 0.5%
>>108809367Yes but it rebuilt a powerful tool from scratch without internet
the benchmark is already invalid as they will just fine tune the model to replicate the existing source material now, even if it can reproduce all 50 progams in the bench it will fail on the 51st one. Public benchmarks become invalidated as soon as they are published.
how do they control for the fact that these models probably have source code in their memory of these programs they're supposed to replicate?
>>108809457They don't, if they did they wouldn't get "research grants" from AI "companies."
>>108809378You mean it hallucinated 99.5% of a powerful tool and miraculously got 0.5% right.AGI in two weeks.