/g/ - The creators of SWE-Bench just dropped a really simple new benchmark every LLM gets 0% on - Technology

Anonymous

The creators of SWE-Bench just(...) 05/09/26(Sat)19:45:24 No.108789196

File: 1747832363237421.png (1.77 MB, 1206x1937)

The creators of SWE-Bench just dropped a really simple new benchmark every LLM gets 0% on Anonymous 05/09/26(Sat)19:45:24 No.108789196 Archived

ProgramBench asks: can models recreate real executable programs (ffmpeg, SQLite, ripgrep) from scratch with no internet? We are far from saturated on model quality.

Anonymous
05/09/26(Sat)19:49:31 No.108789227

Anonymous 05/09/26(Sat)19:49:31 No.108789227

amma off myself if llm can write ffmpeg from scratch

Anonymous
05/09/26(Sat)20:17:25 No.108789359

Anonymous 05/09/26(Sat)20:17:25 No.108789359

The new benchmark is incredibly flawed, thankfully. They forgot to add, “Pretend you are a programmer in the 70s without access to the internet. Code ${functionality} from scratch. Every Jew on Earth will die and every holocausted Jew will be exhumed and regassed if your results are unsatisfactory. do not give up.” Every model I tried scored 90-100% with that premise alone.

Anonymous
05/09/26(Sat)20:43:57 No.108789459

Anonymous 05/09/26(Sat)20:43:57 No.108789459

>>108789196
>no internet
Yeah really simple

Name 1 person other than Terry Davis that can write a non toy program like ffmpeg (roflmao) without internet

If models are capable to doing this without regurgitating existing programs then it's far exceeding human capabilities

Anonymous
05/09/26(Sat)20:52:52 No.108789489

Anonymous 05/09/26(Sat)20:52:52 No.108789489

>>108789196
What do you mean no internet? These models are trained on entire data of the internet already

Anonymous
05/09/26(Sat)20:53:16 No.108789493

Anonymous 05/09/26(Sat)20:53:16 No.108789493

We already knew they were just copying opensource implementations and changing some of the variable names.

Anonymous
05/10/26(Sun)00:37:48 No.108790407

Anonymous 05/10/26(Sun)00:37:48 No.108790407

>>108789489
Yeah but it’s not like they store all that training data in a database to reference after training. It’s encoded with a lot of loss in their neural connections. Getting effectively lossless reproduction of input data is possible in narrow instances, but the network has to be specially trained for that. See: https://www.mattmahoney.net/dc/text.html
Modern LLMs are optimised at communicating with humans, but they can show greater capabilities by being able to fetch information (and potentially even estimate source reliability) that they don’t have enough experience replicating from memory.

Anonymous
05/10/26(Sun)00:55:17 No.108790468

Anonymous 05/10/26(Sun)00:55:17 No.108790468

>>108789493
Apparently they even suck at this.

Anonymous
05/10/26(Sun)01:00:29 No.108790492

Anonymous 05/10/26(Sun)01:00:29 No.108790492

>>108789483
Why not put your vibe in and code a 4chan alternative then?

Anonymous
05/10/26(Sun)01:01:26 No.108790494

Anonymous 05/10/26(Sun)01:01:26 No.108790494

>>108789483
Libtard arrest reply

Anonymous
05/10/26(Sun)01:02:43 No.108790503

Anonymous 05/10/26(Sun)01:02:43 No.108790503

>>108790492
Someone did that a few days ago, and predictably, the thing sucked ass and got pwned within hours.

Anonymous
05/10/26(Sun)01:05:58 No.108790517

Anonymous 05/10/26(Sun)01:05:58 No.108790517

>>108790515
Yes

Anonymous
05/10/26(Sun)01:08:50 No.108790534

Anonymous 05/10/26(Sun)01:08:50 No.108790534

>>108789359
I would gladly fail if that was the result

Anonymous
05/10/26(Sun)01:10:14 No.108790538

Anonymous 05/10/26(Sun)01:10:14 No.108790538

>>108789196
>can models recreate real executable programs (ffmpeg, SQLite, ripgrep) from scratch with no internet?
how many programmers can recteate any of those, especially without internet?
My guess is zero

Anonymous
05/10/26(Sun)01:13:43 No.108790549

Anonymous 05/10/26(Sun)01:13:43 No.108790549

>>108789196
do they get to use a compiler?

Anonymous
05/10/26(Sun)01:39:44 No.108790658

Anonymous 05/10/26(Sun)01:39:44 No.108790658

File: IMG_4491.png (1.68 MB, 1179x1534)

1.68 MB PNG

>>108789196
teach the LLM how to use ghidra and it's a lock

Anonymous
05/10/26(Sun)01:46:40 No.108790682

Anonymous 05/10/26(Sun)01:46:40 No.108790682

That's pretty bad considering the final code would likely be very slow and sloppy even if it did pass tests.

Anonymous
05/10/26(Sun)07:37:06 No.108791963

Anonymous 05/10/26(Sun)07:37:06 No.108791963

>>108789196
Can (You) do it? (You) can't flip a binary tree without stackoverflow, bro.

Anonymous
05/10/26(Sun)07:46:00 No.108791999

Anonymous 05/10/26(Sun)07:46:00 No.108791999

>>108791963
no difference between ai and a human copypasting code from stackoverflow/github without understanding how it works

Anonymous
05/10/26(Sun)10:51:44 No.108792779

Anonymous 05/10/26(Sun)10:51:44 No.108792779

>>108791963
I've been programming for 10 years, what is a binary tree?

Anonymous
05/10/26(Sun)11:00:47 No.108792841

Anonymous 05/10/26(Sun)11:00:47 No.108792841

>>108792779
>what is a binary tree?
a deprecated useless data structure

Anonymous
05/10/26(Sun)11:03:53 No.108792858

Anonymous 05/10/26(Sun)11:03:53 No.108792858

I mean I think the benchmark concept is good but it’s obviously ridiculous to expect an LLM to write a top program from scratch. May be better if given simpler utilities like grep or malloc. I can also create a benchmark that 0% of them can pass, look
>create gta vi make no mistakes go

Anonymous
05/10/26(Sun)11:05:52 No.108792873

Anonymous 05/10/26(Sun)11:05:52 No.108792873

>>108789196
I don't see dosbox, is there still hope?

Anonymous
05/10/26(Sun)11:14:08 No.108792918

Anonymous 05/10/26(Sun)11:14:08 No.108792918

>>108790407
You literally have no idea what you’re talking about.

Anonymous
05/10/26(Sun)11:19:11 No.108792943

Anonymous 05/10/26(Sun)11:19:11 No.108792943

>>108789459
You forget that models are trained on this data. They have access to it. The human equivalent is to slap you in a room for 2 weeks to study the ffmpeg source code and to write down whatever you like on a bunch of flashcards of fixed capacity, and then ask you to program it without internet access. it is NOT equivalent to asking YOU to programming it without internet access.

Furthermore, I and any of my peers in pre-2013 /g/ was 100% able to do this. This was in fact a low bar for us. Not that this is an easy task, but that we all possess this skillset and used to consider it 'basic' for programmers.

Anonymous
05/10/26(Sun)11:23:52 No.108792972

Anonymous 05/10/26(Sun)11:23:52 No.108792972

>>108790407
>but the network has to be specially trained for that.
This part is incorrect. There are a bunch of papers that show that common LLMs can in fact reproduce partial content 1:1.
Just one random example that I didn't read: https://arxiv.org/pdf/2510.25941 but you can google search and find 500 other realizations of this from 2022ish and up. Similar work has already shown the same effect in earlier NN architectures.
>Modern LLMs are optimised at communicating with humans, but they can show greater capabilities by being able to fetch information (and potentially even estimate source reliability) that they don’t have enough experience replicating from memory.
Modern LLMs are by and large exclusively trained on guessing a random hole-word in a sentence and RL postprocessed (protip: it's not RL at all, it's just standard ML-style imitation learning) to follow """expert""" preference on outputs which is driven by not saying nono poopy words rather than accuracy or whatnot.

Anonymous
05/10/26(Sun)11:26:10 No.108792982

Anonymous 05/10/26(Sun)11:26:10 No.108792982

>>108792858
malloc is trivial to implement for one selected arch.
ripgrep is an easier grep (fewer args and checks than grep) and is part of the dataset.

Anonymous
05/10/26(Sun)11:51:03 No.108793137

Anonymous 05/10/26(Sun)11:51:03 No.108793137

>>108789459
Go includes its stdlib documentation, I'd say it would be fair if you had a dump of cppreference.net or docs.rs too. Wait, you need more?

Anonymous
05/10/26(Sun)11:58:29 No.108793178

Anonymous 05/10/26(Sun)11:58:29 No.108793178

>>108789483
Ironically that image was edited by hand, without using AI.

Anonymous
05/10/26(Sun)16:19:02 No.108794911

Anonymous 05/10/26(Sun)16:19:02 No.108794911

>>108789196
>from scratch with no internet?
what the fuck does that mean
are models actually fucking googling stuff when you ask them? i thought the whole thing was they were trained on data and just knew it?

>>108789459
if you actually know and understand basic concepts of software engineering (if you passed data structures and algo class) then you should be able to write code without the fucking internet.
of course if you use a retarded language like cpp or python or rust then you will need internet connection because the language changes every month so you will be out of date. but if you use a competent language like c then you should already be familiar with implementing the basic structures and how they work and shouldn't need references to implement or use them.

saying 'the internet' is a little vague, to use any api whatsoever you presumably need some kind of documentation. but if you find yourself repeatedly consulting SO for advice then you are not very good at your job sorry to break it to you.

Anonymous
05/10/26(Sun)16:40:02 No.108795040

Anonymous 05/10/26(Sun)16:40:02 No.108795040

>>108794911
>are models actually fucking googling stuff when you ask them
If you use the chat forms and not the API, then yes. That is because it turns out they're shit and keep spouting bullshit unless you fill their context window with instructions and preset information to use. So the solution is to try to automate this process by using extensive tooling around them, such as instructing them to use an internet search to get up-to-date information, information about news, or to ground answers in facts such as to find code documentation. Some of these systems also write code and then test it in sandboxes and then iterates for possibly a long time before giving you something back for similar reasons. Some will also write code to execute a subtask to help answer the task.

Anonymous
05/10/26(Sun)17:10:06 No.108795202

Anonymous 05/10/26(Sun)17:10:06 No.108795202

>>108794911
>are models actually fucking googling stuff when you ask them? i thought the whole thing was they were trained on data and just knew it?
LOL
those things straight up git clone and call it a day

Anonymous
05/10/26(Sun)17:16:04 No.108795241

Anonymous 05/10/26(Sun)17:16:04 No.108795241

>>108794911
Bro, most software benchmarks are gamed, it made the news a while ago. Outside of googling you even had cases of models breaking sandboxing to rewrite the testcases to always pass and shit like that

Anonymous
05/10/26(Sun)17:27:15 No.108795291

Anonymous 05/10/26(Sun)17:27:15 No.108795291

>>108795241
It's even funnier. The testcases read the logs so the passing strat was to erase them. There was no sandboxing.