glm air cpu moe bench vs ngl its about 3 t/s improvement which is pretty good
./llama-bench -m '/mnt/miku/Text/GLM-4.5-Air-Q3_K_M/GLM-4.5-Air-Q3_K_M-00001-of-00002.gguf' -ngl 99 --n-cpu-moe 33 -t 48 -fa 1 --mmap 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32
| model | size | params | backend | ngl | threads | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -: | ---: | --------------: | -------------------: |
| glm4moe 106B.A12B Q3_K - Medium | 53.11 GiB | 110.47 B | ROCm | 99 | 48 | 1 | 0 | pp512 | 207.63 ± 3.52 |
| glm4moe 106B.A12B Q3_K - Medium | 53.11 GiB | 110.47 B | ROCm | 99 | 48 | 1 | 0 | tg128 | 12.19 ± 0.21 |
build: da30ab5f8 (6531)
(づ◡﹏◡)づ [llama.cpp]$ ./build/bin/llama-bench -m '/mnt/miku/Text/GLM-4.5-Air-Q3_K_M/GLM-4.5-Air-Q3_K_M-00001-of-00002.gguf' -ngl 19 -t 48 -fa 1 --mmap 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32
| model | size | params | backend | ngl | threads | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -: | ---: | --------------: | -------------------: |
| glm4moe 106B.A12B Q3_K - Medium | 53.11 GiB | 110.47 B | ROCm | 19 | 48 | 1 | 0 | pp512 | 206.51 ± 8.28 |
| glm4moe 106B.A12B Q3_K - Medium | 53.11 GiB | 110.47 B | ROCm | 19 | 48 | 1 | 0 | tg128 | 9.89 ± 0.19 |
build: da30ab5f8 (6531)