>>106688044
Here you go:
| model | backend | fa | test | t/s |
| ------------- | ---------- | -: | ----: | -------------: |
| llama 7B Q4_0 | ROCm | 0 | pp512 | 1052.10 ± 1.18 |
| llama 7B Q4_0 | ROCm | 0 | tg128 | 89.54 ± 0.08 |
| llama 7B Q4_0 | ROCm | 1 | pp512 | 1130.04 ± 0.17 |
| llama 7B Q4_0 | ROCm | 1 | tg128 | 90.53 ± 0.02 |
I think that this is not a useful point of comparison though.
For what it's worth, the performance on an empty context is definitely still very suboptimal, I only prioritized FlashAttention because that was the bigger problem.