>>108805227
I'm using llama.cpp with ROCm. Haven't done a ton of performance testing, though, but from what little I've done ROCm runs rings around Vulkan for them.
I just ran these to test Gemma 4 31B on two of the V620s:
llama-bench -m ~/models/gemma-4-31B-it-Q8_0.gguf -fa 1 --numa numactl --device ROCm2/ROCm3 -d 0 --direct-io 1 -t 32 -b 512 -ub 512 --mmap 1 -sm tensor
llama-bench -m ~/models/gemma-4-31B-it-Q8_0.gguf -fa 1 --numa numactl --device Vulkan2/Vulkan3 -d 0 --direct-io 1 -t 32 -b 512 -ub 512 --mmap 1 -sm tensor
llama-bench -m ~/models/gemma-4-31B-it-Q8_0.gguf -fa 1 --numa numactl --device ROCm2/ROCm3 -d 0 --direct-io 1 -t 32 -b 512 -ub 512 --mmap 1 -sm layer
llama-bench -m ~/models/gemma-4-31B-it-Q8_0.gguf -fa 1 --numa numactl --device Vulkan2/Vulkan3 -d 0 --direct-io 1 -t 32 -b 512 -ub 512 --mmap 1 -sm layer
I got:
* 525 pp / 20 tg for ROCm with tensor parallelism
* 102 pp / 7 tg for Vulkan with tensor
* 449 pp / 8 tg for ROCm with layer parallelism
* 222 pp / 8 tg for Vulkan with layer
(But thermal throttling may have been a slight issue for the last 2 tests.)