/g/ - What are you criticisms of AI benchmarks? How woul - Technology

Anonymous

04/27/26(Mon)20:06:07 No.108705493

File: 1768936360826479.png (712 KB, 736x1075)

712 KB PNG

Anonymous 04/27/26(Mon)20:06:07 No.108705493 Archived

What are you criticisms of AI benchmarks? How would you fix them?

Anonymous
04/27/26(Mon)20:07:24 No.108705500

Anonymous 04/27/26(Mon)20:07:24 No.108705500

>ai benchmarks
Another fucking filter to add, jesus you nigs are painful

Anonymous
04/27/26(Mon)20:46:41 No.108705673

Anonymous 04/27/26(Mon)20:46:41 No.108705673

>>108705493
New models are just the same model finetuned to have a bigger number on AI benchmarks (yes, basically what PewDiePie did on that video where he fine tunned a model to perform better than ClosedAI o3).

Anonymous
04/27/26(Mon)20:47:04 No.108705676

Anonymous 04/27/26(Mon)20:47:04 No.108705676

>>108705493
My criticism is that a benchmark measures a particular metric, like how many operations per second an algorithm or piece of hardware can perform, how much you can bench press, how fast you can run a mile, and so on.

It's difficult to distill a wide range of very different tasks, like you find within the field of software development, down to a single, or even just a finite number of, metrics. Even if you throw different AI models at the same, large sample of diverse tasks, evaluating their performance is still subjective beyond "does the result match the spec." And that's not even getting into evaluating their performance on the things not in the spec, like how readable the code is, how expandable/maintainable the code is, and how clean the code is/to what degree the code fits the rest of the codebase (e.g. reusing existing abstractions rather), whether the model considered things not stated in the spec like the business context which LLMs are infamously bad at and is one reason why we are not yet able to fully automate software development.

Anonymous
04/27/26(Mon)23:11:51 No.108706291

Anonymous 04/27/26(Mon)23:11:51 No.108706291

***PEDOPHILE THREAD***
***PEDOPHILE THREAD***

CAUTION: YOU HAVE JUST ENTERED A PEDOPHILE THREAD

***PEDOPHILE THREAD***
***PEDOPHILE THREAD***

Anonymous
04/27/26(Mon)23:14:09 No.108706301

Anonymous 04/27/26(Mon)23:14:09 No.108706301

>>108705493
I want to benchmark her if you know what I mean

Anonymous
04/27/26(Mon)23:54:30 No.108706448

Anonymous 04/27/26(Mon)23:54:30 No.108706448

>>108705500
kek give up 4chan already

Anonymous
04/28/26(Tue)01:19:02 No.108706778

Anonymous 04/28/26(Tue)01:19:02 No.108706778

>>108705493
benchmark her fertility capabilities by trying to go for a baby multiple times in row

Anonymous
04/28/26(Tue)01:28:45 No.108706801

Anonymous 04/28/26(Tue)01:28:45 No.108706801

>>108705493
They exist so the correct answers will be over trained for, aka benchmaxxed to make bad models look like they perform well, there is no fixing this because no AI house putting out models is going to be honest about doing it.
>>108706291
Yeah those massive hanging tits tiny waist and wide hips really scream "little girl".