>>106529741
The code in that repo reads like AI slop and a lot of the "performance optimizations" don't make any sense to me.
But it does use the v_dot2_f32_f16 for FP16 multiplication with FP32 accumulation that I was previously unfamiliar with as I hadn't read the Vega ISA documentation in detail.
After applying the same instructions to mainline llama.cpp https://github.com/ggml-org/llama.cpp/pull/15884 I've found it to be faster.
It's getting to the point where I think Mi50s will soon be a universally better buy than P40s:
| GPU | model | backend | test | t/s |
| -------- | ------------- | ------- | --------------: | ------- |
| RTX 3090 | llama 8B Q4_0 | CUDA | pp512 | 5327.80 |
| RTX 3090 | llama 8B Q4_0 | CUDA | tg128 | 141.04 |
| RTX 3090 | llama 8B Q4_0 | CUDA | pp512 @ d16384 | 2572.06 |
| RTX 3090 | llama 8B Q4_0 | CUDA | tg128 @ d16384 | 96.27 |
| P40 | llama 8B Q4_0 | CUDA | pp512 | 1034.45 |
| P40 | llama 8B Q4_0 | CUDA | tg128 | 53.63 |
| P40 | llama 8B Q4_0 | CUDA | pp512 @ d16384 | 311.11 |
| P40 | llama 8B Q4_0 | CUDA | tg128 @ d16384 | 30.66 |
| Mi50 | llama 8B Q4_0 | ROCm | pp512 | 1053.87 |
| Mi50 | llama 8B Q4_0 | ROCm | tg128 | 73.04 |
| Mi50 | llama 8B Q4_0 | ROCm | pp512 @ d16384 | 212.49 |
| Mi50 | llama 8B Q4_0 | ROCm | tg128 @ d16384 | 20.25 |