[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology

Name
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
File
  • Please read the Rules and FAQ before posting.
  • You may highlight syntax and preserve whitespace by using [code] tags.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


[Advertise on 4chan]


File: images.jpg (8 KB, 277x182)
8 KB JPG
I want to run 120b+ models (Llama 3/Nemotron) locally but dropping $4,700 on a DGX Spark seems like peak consumerist brainrot. I'm looking for a real alternative for VRAM heavy setups that doesn't involve paying the "AI Workstation" premium for a fancy case and 128GB of LPDDR5X.

Should I just go the 6x used RTX 3090 route and deal with the 2000W power draw and industrial fan noise in my room or is there a better way to handle p2p without NVLink being a bottleneck? The Mac Studio M3 Ultra with 192GB unified memory is an option but it feels like paying a massive onions tax for slower tokens and being locked into the Apple ecosystem. I’ve also looked at scavenging eBay for used A100s or a refurbished Supermicro server but I don't want to get scammed on dead enterprise silicon.
>>
>https://www.amd.com/en/products/processors/consumer/ryzen-ai/ryzen-ai-halo.html
Maybe this will be an option, it's a month away. No clue about price.
>>
just rent a cloud api or server
>>
>>108742726
No i want to run local models
>>
>>108742767
pedo
>>
>>108742725
$3,500 minimum, these AI mini PCs are expensive as hell.
>>
>>108742830
get a life
>>
china has cheap 4090 48gb
>>
File: 1777437812869198.jpg (41 KB, 512x384)
41 KB JPG
>>108742895
Not really a 4090 with 48Gb costs 4100$ in chinese sites.
>>
>>108742720
The Macs are unironically the best hardware for this use case, and they are well-supported by llama.cpp which is what you'd most likely be using. The only downside is that if you wanted to venture into anything more adventurous than LLM inference you'll be shit out of luck without Nvidia hardware.
>>
>let's find out the best alternative to dgx spark because I don't like it and I pretend it's not the best for local llm
I'm tired of this game
>>
my uncle works for an ai hardware company and he said he got one of these for free. kinda jelly
>>
>>108742987
but Unified Memory bandwidth is slow for 120b models compared to a multi GPU stack. The DGX Spark is much faster for prompt processing and actually lets me run CUDA only tools from GitHub without waiting for Mac ports. I’m mostly worried the Mac becomes a $5k paperweight the moment I want to do more than just basic chat inference.
>>
>>108742990
Lol, I never said I didn’t like it I just said it’s too expensive.
>>
>>108743059
Ask him if he's hiring, I'd take a buggy prototype for free too
>>
>>108742720
The Spark sucks. I have one from my job for testing, and there's basically nothing I've been able to use it for that wasn't better done with a desktop workstation an discrete GPU.

p.s. if you actually want one for some reason, get the Asus-branded one. It's inexplicable that people are paying more for the nvidia FE version with the same hardware / worse cooling.
>>
>>108742990
Gotta kill time when you're unemployed.
>>
File: 1777356064251936.jpg (2.23 MB, 5006x3464)
2.23 MB JPG
>>108743160
Yeah that’s kind of what I suspected most of the value seems to just be the form factor + unified memory, not actual throughput.

What are you running on your desktop setup though? My main concern isn’t raw compute, it’s fitting 120B+ models without everything choking on PCIe.

Have you tried multi-3090 (or A100) setups without NVLink? I keep seeing mixed answers on whether tensor parallel over PCIe is actually usable or just a stuttery mess.

Also curious if you’ve found a decent middle ground between “6x 3090 space heater” and “sell kidney for enterprise gear.”
>>
Just use an API from Claude or DeepSeek, LOL. Why take the effort and money to run local LLMS? I really don't understand.
>>
>>108742720
STOP USING AI
>>
>>108743580
this is the stupidest thing i've ever heard
>>
>>108742720
>>108742832

I bought a Framework Desktop on launch, Ryzen AI MAX 395+ with 128GB unified RAM for $2300. For that price it was worth but prices are fucked now. Still, here's the current landscape

sub $2k budget:
>Beelink SER10 MAX HX 470

$3k ish
>Framework Desktop or GMKtec EVO-X2 128GB

$4k+
>wait for new M5 Ultra Mac Studios and buy one with as much RAM as you can afford

keep in mind Strix Halo is 256GB/s RAM speed, the new mac studio will probably be over 1,000GB/s (M3 Ultra was 819GB/s). so inference will be a LOT faster, though still slower than a discrete GPU setup (5090 is 1,792GB/s)

imo the discrete GPU route isn't worth it due to the combination of cost, required physical space, and power draw. unless you're using a fuckton of tokens every month

if I had money to blow I'd definitely opt for the upcoming Mac Studio M5 Ultra.
>>
>>108743060
>I’m mostly worried the Mac becomes a $5k paperweight the moment I want to do more than just basic chat inference.
you can just sell it
There will still be people buying those in a decade
a bunch of ewaste like 6x 3090, probably not
>>
>>108743190
>tensor parallel over PCIe
>3090
just buy some shitbox with 128gb ram, you clearly have no idea what you’re doing
>>
>>108743075
Have you considered that running high parameter counts is just always going to be expensive?
>>
>>108744998
My midrange desktop that I bought 128gb ram for last year because it was cheap (lol lmao) runs 120b models just fine
The actual expensive stuff begind when you go above 128gb ram + 24 gb vram, because consumer mobo is no longer good enough or you’re blowing a fortune on gous
>>
>>108743933
snail cat
>>
dgx spark was mocked by the basement 3090ers
how the table had turned
>>
>>108744401
You're contemplating the cost of ludicrously speced pcs. If these prices are something you're stressing over you need get some self awareness. They well all be completely obsolete in 5 years, especially the mini pcs.

They should have made the dgx spark fit in a 5.25 bay slot.
>>
>>108742720
Use 3x Gmtek strix halo 128gb machines linked via fabric
>>
Bump
>>
>>108742720
interconnect speed between gpus is basically irrelevant for inference. your gpus will work pretty much independently of each other if you use tensor parallelism
>>
>>108742720
A year ago a Xeon scalable with AMX and refurb DDR5, combined with any recent NVIDIA GPU, to run ktransformers.

Now fuck you.
>>
What do you need all this stuff for?
>>
>>108744997
>ram instead of vram
clearly you are the one who don't know anything about this not the OP
>>
>>108742720
What's your usecase?
The recent mid-sized qwen and gemma modesl (27b-35b) models are quite capable and can fit comfortably in 32gb of vram.
Which can be had for ~1,300$ in the Radeon AI pro R7900 if you're near a microcenter. Slightly more annoying than nvidia cards and not as fast as a 5090, but if it's just for inference ROCm or vulkan you can get competitive speeds out of it, two of those cards puts you at 64g of vram for comparable cost to a 5090.

quanted you can make these models fit in 24g, and the MoE models you can offload some to CPU and still get tollerable speed.


Strix Halo was a good deal at launch, but the prices have gone nuts. AMD just announced a first party box, If you want to try it I'd wait and see if that one can be had for more like MSRP.

Mac is a solid choice, but you're locked into it and macos and at the mercy of the community. That said, the community around it is strong, and there's stuff that lands there before it lands in other places (dflash, recent example).

Hate to say it, but X is a good source of info and experiences with this stuff.

A pile of Used 3090's is a popular in the community but you pay for it in power and heat.
>>
>>108746297
>>108742720
>I want to run 120b+ models

read
>>
>>108746427
i want to run 120b+ models
>>
>>108746448
and then?
>>
>>108746399
It depends on what he wants it for. CPUMAXXING was considered a legit strategy. you can get tollerable t/s out of a Rome, Milan, or Xeon CPU if you've got all the memory channels populated, and you'll have enough pcie to start putting GPU's in later if you decide you need more later.

>>108746427
R9700* not R7900

>>108746448
Have you tried the 27B dense or 35B A3B? Just "I want this many beaks" is not an answer. are you coding, are you openclawing, are you ai-waifuing? what's your usecase?
>>
>>108746457
coding and agentic stuff maybe some stable diffusion
>>
>>108746472
>are you coding, are you openclawing, are you ai-waifuing? what's your usecase?
all of them :}
>>
>>108746478
Ok.
What kind of agentic stuff?
Genuinely interested.
>>
>>108742990
>ARM Linux
Hell-box
>>
>>108742924
maybe 4 ur gweilo ass lMAOO
>>
>>108746673
DGX spark is probably the best arm linux experience available that's not android.
>>
>>108742720
M5 ultra maybe? It supposedly will have 1.2TB/s mem bandwidth and better perf. Not sure how well it will score in PP.
I heard about some tech that helps speed up the prefill, but haven't digged into it.
Perhaps it's best to wait it out until there's decent consumer hardware. Also I think you will need around 512GB for the best models on Q4/5.
>>
>>108746675
>maybe 4 ur gweilo ass lMAOO
Saaar!
>>
>>108742720
Buy 3 r9700.
If you are poor you can also go with sxm2 cards either 6x16GB or 3x 32GB
>>
>>108746457
this retard has no idea what he's doing
>>
>>108742720
Unlike graphics cards, you can find good deals for ai mini-pcs on places like ebay. Its worth a look if you're mostly interested in ai.
>>
>>108742720
dgx spark is a cuda prototyping machine. if you're not prototyping in the cuda ecosystem before enterprise roll out it's not for you.

maybe check out the strix halo options.
>>
>>108743190
A lot of the Nvdia Spark systems is in the networking hardware. If you're just going local you don't need that hardware.
>>
if dgx spark can connect a fast dgpu it would be perfect
>>
File: 2.png (1.43 MB, 1184x864)
1.43 MB PNG
>>108747427
>>
>>108748383
>A lot of the Nvdia Spark systems is in the networking hardware. If you're just going local you don't need that hardware.

Not really. The networking is mainly for multi node scaling. The real value of DGX Spark is the 128GB unified memory, which is exactly what you need for running 120B models locally.
>>
File: file.png (2.18 MB, 1024x1024)
2.18 MB PNG
>>108750701
I would just wait a few years and store muns up like a squirrel till something bespoke, and purpose built for fast high density inference comes along, it's inevitable, the 6xxx nvidia consumer chips are rumored to hit 5.5-6+ k tops. Will be worth the wait. Local llms are dogshit right now even if you can have a pack of them working at code together. To commit to buying now would be foolish due to the coming power step-up even at the low end. And theres also that whole cooking your gpu thing, even with bast case preventative care your burning your hardware.. I'm 4.8 million comfy gens in on my 5070 TI and the system sound output just died on it, only been running gens on it for about 7 months.
>>
>>108751202
lmao



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.