>>109159747
I should've clarified that I was testing with 4 R9700s at the time and 40% of the model being on CPU, so the CPU speedup wasn't going to be massive.
Ran tests with 1 R9700 and using ncmoe for all MOE layers. For whatever reason the prefill is pretty much unchanged between the options this time (unlike Qwen 397B), and I'm capped at the 1.3x speedup on decode. I'll look into that - maybe you're right, but >Claude said that it's doing what you suggested. I can provide you its explanation of what it did if you want.
Tests:
... build/bin/llama-bench -m ~/models/GLM-4.7-Q3_K_L.gguf -sm layer --device ROCm0 -fa 1 --numa split -t 48 -ncmoe 92
(the fork): 40.8 t/s PP and 6.9 t/s TG
numactl --cpunodebind=0 --membind=0 build/bin/llama-bench -m ~/models/GLM-4.7-Q3_K_L.gguf -sm layer --device ROCm0 -fa 1 --numa numactl -t 32 -ncmoe 92
: 39.7 t/s PP and 5.3 t/s TG
numactl --cpunodebind=0 --membind=0 build/bin/llama-bench -m ~/models/GLM-4.7-Q3_K_L.gguf -sm layer --device ROCm0 -fa 1 --numa numactl -t 24 -ncmoe 92
: 39.2 t/s PP and 5.3 t/s TG
I give a VM 8 threads on one CPU, so I can only test with 48 threads if I want each node to have a perfect split.