I'm not an AI bro, just someone who likes fucking around with locally ran models. I have a decent laptop (Intel Core i9-14900HX, NVIDIA GeForce RTX 4060 Laptop GPU, 32 GB DDR5-5600), and it struggles with a lot of models (15 tok/sec for nvidia/nemotron-3-nano). Is there anything that can be done about this? And why are LLMs so heavy?
>>108741087cause they are giant markov chains that represent a wide range of paths
>>108741087that's a big model for 8gb vram. you should get a macbook with apple silicon and unified memory if you want to run LLM's on your machine.
>>10874108730b is a big model for consumer hardware. try a 7-14b model. qwen3.6, qwen3.5, qwen3
>>108741087>Why are Large Language Models so heavyWhat the fuck do you think "Large" means nigga? Run a small or local model if you want something lightweight and portable. Lightning fast, small AI tailored to a specific domain are going to be the future anyways.
>>108741087>Is there anything that can be done about this?faster hardware, preferably not a laptop>And why are LLMs so heavy?heavy? get real, faggot. those sizes are reasonable considering how much data these models are able to vomit up IF your hardware is fast enough.tl;dr: stop being poor and get good
>>108741087I'd say 15 tok/second for a free model of that quality on a sub-$2000 mobile device is actually pretty impressive and you should be grateful
>>108741087Because they get better as they get bigger, so naturally people will make them as big as they fit on enterprise hardware, not as big as they fit on your laptop.
Look man, they already can do a fucking lot. I run this Gemma 4 E4B or some nonsense and the shit I'm finding is that it's working not much smaller on a piece of dogshit like the mac mini m4 16gb than what I used when I discovered deepseek (non-local). It works stupidly well for basic shit. I'm asking it "hey look, I took this picture, what do you see" and it's accurately telling me that my photographic skills are fucking terrible. I don't need someone on /p/ to tell me anymore and the best of all is, I didn't spend thousands, and I'm not running this online so this critique is kept private.
>>108741087If Q4_K_M versions are too heavy for your hardware, try Q3_K_M. Perplexity will go up a bit, but it can be a worthwhile tradeoff if the model is just too slow.
>>108741087run full gpu offloads if you want speed, right now the qwen3.5 4B/9B are the best fit for you i think. they for example i typically use a Q3.5-9B with 64k context as it fully fits in my 4070 with some spare room for browser windows and shit like that in gpu memory
>>1087450174 bit are so popular because it's the exact sweet spot between size and performance in almost all models of the size a typical gaming pc can run (~32B and under)
Your prompt is ready. Speed weak, weights are heavy.The model's making slop already, code spaghetti.
>>108741087It's almost as if it's just bruteforcing shit.
>>108747591>It's almost as if it's just bruteforcing shit.Exactly.