/g/ - On a difficult new SWE benchmark, ProgramBench, GP - Technology

Anonymous

05/12/26(Tue)14:16:01 No.108808850

File: 1750322114852299.png (456 KB, 828x1242)

Anonymous 05/12/26(Tue)14:16:01 No.108808850 Archived

On a difficult new SWE benchmark, ProgramBench, GPT5.5 high/xhigh solves a task for first time, significantly outperforms Opus 4.7
Link to tweets:
https://x.com/KLieret/status/2054215545663144217?s=20
Link to GitHub:
https://github.com/facebookresearch/ProgramBench/
Link to ProgramBench website:
https://programbench.com/blog/gpt-5-5-first-solve/

Anonymous
05/12/26(Tue)15:45:02 No.108809367

Anonymous 05/12/26(Tue)15:45:02 No.108809367

>>108808850
>passes 0.5%

Anonymous
05/12/26(Tue)15:46:25 No.108809378

Anonymous 05/12/26(Tue)15:46:25 No.108809378

>>108809367
Yes but it rebuilt a powerful tool from scratch without internet

Anonymous
05/12/26(Tue)16:00:51 No.108809441

Anonymous 05/12/26(Tue)16:00:51 No.108809441

the benchmark is already invalid as they will just fine tune the model to replicate the existing source material now, even if it can reproduce all 50 progams in the bench it will fail on the 51st one. Public benchmarks become invalidated as soon as they are published.

Anonymous
05/12/26(Tue)16:05:03 No.108809457

Anonymous 05/12/26(Tue)16:05:03 No.108809457

how do they control for the fact that these models probably have source code in their memory of these programs they're supposed to replicate?

Anonymous
05/12/26(Tue)16:07:50 No.108809472

Anonymous 05/12/26(Tue)16:07:50 No.108809472

>>108809457
They don't, if they did they wouldn't get "research grants" from AI "companies."

Anonymous
05/12/26(Tue)16:11:35 No.108809482

Anonymous 05/12/26(Tue)16:11:35 No.108809482

>>108809378
You mean it hallucinated 99.5% of a powerful tool and miraculously got 0.5% right.
AGI in two weeks.