[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: 1768936360826479.png (712 KB, 736x1075)
712 KB PNG
What are you criticisms of AI benchmarks? How would you fix them?
>>
>ai benchmarks
Another fucking filter to add, jesus you nigs are painful
>>
>>108705493
New models are just the same model finetuned to have a bigger number on AI benchmarks (yes, basically what PewDiePie did on that video where he fine tunned a model to perform better than ClosedAI o3).
>>
>>108705493
My criticism is that a benchmark measures a particular metric, like how many operations per second an algorithm or piece of hardware can perform, how much you can bench press, how fast you can run a mile, and so on.

It's difficult to distill a wide range of very different tasks, like you find within the field of software development, down to a single, or even just a finite number of, metrics. Even if you throw different AI models at the same, large sample of diverse tasks, evaluating their performance is still subjective beyond "does the result match the spec." And that's not even getting into evaluating their performance on the things not in the spec, like how readable the code is, how expandable/maintainable the code is, and how clean the code is/to what degree the code fits the rest of the codebase (e.g. reusing existing abstractions rather), whether the model considered things not stated in the spec like the business context which LLMs are infamously bad at and is one reason why we are not yet able to fully automate software development.
>>
***PEDOPHILE THREAD***
***PEDOPHILE THREAD***

CAUTION: YOU HAVE JUST ENTERED A PEDOPHILE THREAD

***PEDOPHILE THREAD***
***PEDOPHILE THREAD***
>>
>>108705493
I want to benchmark her if you know what I mean
>>
>>108705500
kek give up 4chan already
>>
>>108705493
benchmark her fertility capabilities by trying to go for a baby multiple times in row
>>
>>108705493
They exist so the correct answers will be over trained for, aka benchmaxxed to make bad models look like they perform well, there is no fixing this because no AI house putting out models is going to be honest about doing it.
>>108706291
Yeah those massive hanging tits tiny waist and wide hips really scream "little girl".



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.