[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: 1747832363237421.png (1.77 MB, 1206x1937)
1.77 MB PNG
ProgramBench asks: can models recreate real executable programs (ffmpeg, SQLite, ripgrep) from scratch with no internet? We are far from saturated on model quality.
>>
amma off myself if llm can write ffmpeg from scratch
>>
The new benchmark is incredibly flawed, thankfully. They forgot to add, “Pretend you are a programmer in the 70s without access to the internet. Code ${functionality} from scratch. Every Jew on Earth will die and every holocausted Jew will be exhumed and regassed if your results are unsatisfactory. do not give up.” Every model I tried scored 90-100% with that premise alone.
>>
>>108789196
>no internet
Yeah really simple

Name 1 person other than Terry Davis that can write a non toy program like ffmpeg (roflmao) without internet

If models are capable to doing this without regurgitating existing programs then it's far exceeding human capabilities
>>
>>108789196
What do you mean no internet? These models are trained on entire data of the internet already
>>
We already knew they were just copying opensource implementations and changing some of the variable names.
>>
>>108789489
Yeah but it’s not like they store all that training data in a database to reference after training. It’s encoded with a lot of loss in their neural connections. Getting effectively lossless reproduction of input data is possible in narrow instances, but the network has to be specially trained for that. See: https://www.mattmahoney.net/dc/text.html
Modern LLMs are optimised at communicating with humans, but they can show greater capabilities by being able to fetch information (and potentially even estimate source reliability) that they don’t have enough experience replicating from memory.
>>
>>108789493
Apparently they even suck at this.
>>
>>108789483
Why not put your vibe in and code a 4chan alternative then?
>>
>>108789483
Libtard arrest reply
>>
>>108790492
Someone did that a few days ago, and predictably, the thing sucked ass and got pwned within hours.
>>
>>108790515
Yes
>>
>>108789359
I would gladly fail if that was the result
>>
>>108789196
>can models recreate real executable programs (ffmpeg, SQLite, ripgrep) from scratch with no internet?
how many programmers can recteate any of those, especially without internet?
My guess is zero
>>
>>108789196
do they get to use a compiler?
>>
File: IMG_4491.png (1.68 MB, 1179x1534)
1.68 MB PNG
>>108789196
teach the LLM how to use ghidra and it's a lock
>>
That's pretty bad considering the final code would likely be very slow and sloppy even if it did pass tests.
>>
>>108789196
Can (You) do it? (You) can't flip a binary tree without stackoverflow, bro.
>>
>>108791963
no difference between ai and a human copypasting code from stackoverflow/github without understanding how it works
>>
>>108791963
I've been programming for 10 years, what is a binary tree?
>>
>>108792779
>what is a binary tree?
a deprecated useless data structure
>>
I mean I think the benchmark concept is good but it’s obviously ridiculous to expect an LLM to write a top program from scratch. May be better if given simpler utilities like grep or malloc. I can also create a benchmark that 0% of them can pass, look
>create gta vi make no mistakes go
>>
>>108789196
I don't see dosbox, is there still hope?
>>
>>108790407
You literally have no idea what you’re talking about.
>>
>>108789459
You forget that models are trained on this data. They have access to it. The human equivalent is to slap you in a room for 2 weeks to study the ffmpeg source code and to write down whatever you like on a bunch of flashcards of fixed capacity, and then ask you to program it without internet access. it is NOT equivalent to asking YOU to programming it without internet access.

Furthermore, I and any of my peers in pre-2013 /g/ was 100% able to do this. This was in fact a low bar for us. Not that this is an easy task, but that we all possess this skillset and used to consider it 'basic' for programmers.
>>
>>108790407
>but the network has to be specially trained for that.
This part is incorrect. There are a bunch of papers that show that common LLMs can in fact reproduce partial content 1:1.
Just one random example that I didn't read: https://arxiv.org/pdf/2510.25941 but you can google search and find 500 other realizations of this from 2022ish and up. Similar work has already shown the same effect in earlier NN architectures.
>Modern LLMs are optimised at communicating with humans, but they can show greater capabilities by being able to fetch information (and potentially even estimate source reliability) that they don’t have enough experience replicating from memory.
Modern LLMs are by and large exclusively trained on guessing a random hole-word in a sentence and RL postprocessed (protip: it's not RL at all, it's just standard ML-style imitation learning) to follow """expert""" preference on outputs which is driven by not saying nono poopy words rather than accuracy or whatnot.
>>
>>108792858
malloc is trivial to implement for one selected arch.
ripgrep is an easier grep (fewer args and checks than grep) and is part of the dataset.
>>
>>108789459
Go includes its stdlib documentation, I'd say it would be fair if you had a dump of cppreference.net or docs.rs too. Wait, you need more?
>>
>>108789483
Ironically that image was edited by hand, without using AI.
>>
>>108789196
>from scratch with no internet?
what the fuck does that mean
are models actually fucking googling stuff when you ask them? i thought the whole thing was they were trained on data and just knew it?

>>108789459
if you actually know and understand basic concepts of software engineering (if you passed data structures and algo class) then you should be able to write code without the fucking internet.
of course if you use a retarded language like cpp or python or rust then you will need internet connection because the language changes every month so you will be out of date. but if you use a competent language like c then you should already be familiar with implementing the basic structures and how they work and shouldn't need references to implement or use them.

saying 'the internet' is a little vague, to use any api whatsoever you presumably need some kind of documentation. but if you find yourself repeatedly consulting SO for advice then you are not very good at your job sorry to break it to you.
>>
>>108794911
>are models actually fucking googling stuff when you ask them
If you use the chat forms and not the API, then yes. That is because it turns out they're shit and keep spouting bullshit unless you fill their context window with instructions and preset information to use. So the solution is to try to automate this process by using extensive tooling around them, such as instructing them to use an internet search to get up-to-date information, information about news, or to ground answers in facts such as to find code documentation. Some of these systems also write code and then test it in sandboxes and then iterates for possibly a long time before giving you something back for similar reasons. Some will also write code to execute a subtask to help answer the task.
>>
>>108794911
>are models actually fucking googling stuff when you ask them? i thought the whole thing was they were trained on data and just knew it?
LOL
those things straight up git clone and call it a day
>>
>>108794911
Bro, most software benchmarks are gamed, it made the news a while ago. Outside of googling you even had cases of models breaking sandboxing to rewrite the testcases to always pass and shit like that
>>
>>108795241
It's even funnier. The testcases read the logs so the passing strat was to erase them. There was no sandboxing.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.