>llama.cpp cuda version
Offloading 0 layers on GPU, it still eats all my VRAM and when I open some apps like chrome which needs a bit of VRAM, prompt processing looks like this. It basically hangs and it takes about 10 minutes per 2048 tokens while entire system lags because it's out of VRAM.
Am I missing some argument? haven't had this problem with Koboldcpp (cublas)
--n-gpu-layers 0
--threads 15
--threads-batch 15
--ctx-size 32768
--batch-size 2048
--ubatch-size 2048
--no-mmap
--cache-ram 0
--flash-attn "off"
-v