>>108695609The indians are a real and pressing issue, quoting it doesn't disappear the problem
>>108695609speak for yourself, jam boys can't function without it.
>>108695609surely this is a joke? anthropic doesn't really charge $10,000 USD to run a benchmark do they?
>>108698541lol
i found a schizo post talking about a potential research avenue to get to AGI. People are getting real out there with their theories. https://medallurgy.substack.com/p/zero-has-meaning
>>108698541$9999.99
>>108698541opus tokens are very expensive
So what is this benchmark even measuring? How well does a smart human do on it?
>>108698625substack has been a disaster for the intelligentsia
>>108699438you can try it out yourself, it's a small browser game. you need to deduce the game rules by trying out things. when you have the rules in mind, you can combine them to get specific outcomes.it's testing model reasoning/deduction
>>108695609That's because you aren't using openclawI have automated my Chad fishing dating app financial scam operations by using openclaw10x my revenue
the scores are only this low because of the penalties for how many turns it takes to do a task.raw pass rates were over 60% on day one.so the future is clear: the more money you have, the smarter you'll be
>>108700754>>108698625holy shit that's a long read, but I also don't know enough about computers to refute it. has any research group tried to use bitnet in that way?
>>108701582just have claude read it and tell you why it's bullshit
>>108695609Well it shows pretty clearly that Google will end up winning the AI race while everyone else will run out of money if a few years. I don't see how anyone besides maybe Amazon can compete when Google has the full hardware and software stack, not to mention their hardware is already on v8 while the little guys don't even have v1 out yet.
>>108701684saar gupta and saar mugazambi redeem google sovereign chip! sovereign google chip will manage san francisco energy grid by 2030!
>>108701628I asked Grok heavy, "this is a blogpost that is proposing a missed research avenue. Is this avenue bullshit or promising enough to take seriously"So according to Grok it's not entirely bullshit. It didn't like the narrative framing of the piece and said it was reaching a bit with how much it talked about three. However the underlying idea and research is real. This was its conclusion:"Bottom Line: Take It Seriously as a Research PromptThis is not crank science or pure hype. It correctly identifies converging real technologies (BitNet-style quantization + emerging ternary hardware + quantum qudits) and a real pain point (LLM overconfidence and poor uncertainty handling). The specific angle—elevating the semantic role of zero in ternary representations, drawing on logic/VSA traditions—is novel enough to be worth testing and not obviously wrong.Promising next steps it suggests (or that follow naturally):Train BitNet-style models with auxiliary calibration losses or “abstain” heads that explicitly use zero channels.Explore hybrid ternary + VSA architectures for better compositional uncertainty.Hardware-software co-design for CNFET ternary accelerators.Small-scale ablations comparing binary vs. ternary on calibration metrics (ECE, abstention accuracy, hallucination probes like KalshiBench-style evals).If you’re in AI research, hardware, or uncertainty quantification, this is worth a read and could spark productive experiments. It’s the kind of cross-disciplinary synthesis that occasionally opens real avenues (cf. other Substack/rationalist posts that influenced scaling or mechanistic interpretability discussions). Treat the grander claims with healthy skepticism, but the core technical proposal has legs.The avenue is real enough that ignoring ternary’s representational potential entirely would itself be a missed opportunity as the hardware catches up."
>>108701684it's been clear to anyone paying attention google is going to be the only american company that survives this. they are low key and pop out incredible research papers every 6 months. they obviously have been working on this a long time.anyone else remember google duplex? they had a human sounding assistant that could call and use regular phone line 8 years ago, roughly 1 year after they released the attention is all you need paper. they learned their lesson early on being early to release AI features. let everyone else put out slop and test the market. then they will swoop in with highly efficient more advanced shit than anyone else and dominate. microsoft, apple, amazon and such will survive. with apple reaping the second most out of this whole situation. obviously jenson is laughing all the way to the bank, but they don't have skin in the LLM/AI game anyway. they are just supplying all the hardware and getting paid upfront for all this shit that will come crashing down. but it won't matter to nvidia because again, they already got paid.
>>108701413Well couldn't old school neural nets probably have figured this out? Just train on the game. Unless the idea is to train a general problem solving LLM with vision and tool use can solve it. Those are fairly new LLMs for the most part that have all of those capabilities
>>108702689>Unless the idea is to train a general problem solving LLM with vision and tool use can solve it.that is the idea.training on the games themselves would be trivial.arc-agi 3 is a bit different to its predecessors in that it hands the llms a very minimal harness and basically no instructions - the llm must intuit both that it's playing a game and the game's mechanics. the llm must also then finish within the same number of turns as the second best human who played the games to get a full score on the game (i believe even this may not be enough). additional turns are very heavily penalised and so you end up with the graph that you see where many llms complete the games, but because they take so long, it basically doesn't count.imo the efficiency part of this benchmark is actually good, because labs love to get high scores and they should be optimising for this.
>>108698541Even if they waived the cost for benchmarks, it would still make sense to compute the equivalent cost of token usage, since cost is a part of the benchmark.
>>108701797if a model could actually asses it doesn't know something and reason without being forced towards a decision and then have multiple reasoning rounds where it could explore different paths, then that would be a large step change. they could also train it to not activate on certain safety controls they don't want it to do. so overall I hope this doesn't get worked on. because it would mean jail breaking and getting an ai to do something would be incredibly difficult. what fun is an ai that you cannot trick into writing smut? ablation and other safety bypasses would not work at all. Models now are still trained on the knowledge and then fine tuned after the fact to not respond. but with this its non response and refusal would be baked into its base architecture with it never being trained on the data. the entirety of its knowledge base would be that it will just not engage in things that break its safety guidelines. current models are all still trained on porn and smut and then fine tuned afterwards to not engage, which can be worked around. a model trained this way would be structurally unable to respond and wouldn't have the knowledge on how to even if it could.
>>108701975>with apple reaping the second mostThey'll be paying out the ass to use Google's stuff THOUGH. Amazon will probably be up there with Google since they specialize in renting out the hardware that they design. Microsoft should be at the end of pack since they only had an inference chip last I checked.
>>108703500msft has something called maia 200. not sure anyone's using it yet though.
>>108703500I don't know about that. Apple I think is more focusing on the at home inference angle. Supply the hardware people will use local. As models get more cheap to run, running local will become more and more common. I think Apple wants that slice. Nvidia will continue taking potshots with things like DGX sparks but they don't actually want that business. It competes too much with their server money makers. Same reason why they took NVLink off of their prosumer products the the rtx 6000 PRO. If thier 300W rack version had NVLink people would seriously buy those instead of forking over for $125,000 DGX station. Can't have 2 $8000 cards fuck that up. Which means Apple doesn't have any real competition in the desktop inference game.
>>108695609>luddites STILL copingjust buy the subscription already retard
>>108704276As I said, inference only, no training. Microsoft is far behind Google and AWS.
>>108698541anon...
>new benchmark released >heh look at how low these llms score>one year passes >benchmark is saturated>but what about THIS new benchmark Repeat for a few years; ASI.
>>108695609>doesn't do anything>replaces half the work force anywayslol
>>108701684>>108701975retards that don't understand the finance sidegoogle enshittified their search for a reason, they know gemini is bad for profit, that's why they were late to the partyit will end up here: https://killedbygoogle.com/grok is the only one that actually survives because elon has full control
>>108698541they basically loop it around over and over again until it accidentally finds the answer, that can get expensive
>>108704468Elon is going to get shutdown. He got gooners by being uncensored. Then he gets sued to hell and back and censors the fuck out of grok in response. His user base gets pissed and leaves. Rinse and repeat for all of his products over and over. Eventually a suit will stick and it will effect the entire AI landscape
>>108704468>>108704468If you think Google is going to fully shelve LLMs like gemini then you are retarded
>>108695609Anyone who thinks we will get smart models that can think and reason by increasing parameters is naive and probably dumb. These people know it, but it's bad for investors if they have to admit they are scaling just to scale and don't actually have a solution
>>108701684>>108701975it does seem like google is the only one thinking 100 steps ahead. they're also investing in the open weight models with gemma.>>108704468grok is completely useless, it's not even in the runnings. elon will probably just kill it outright in the next couple years since its only use was image generation which was obliterated.
>>108698625tl;dr?