>>108061572
Memory allocation still needs to be deduplicated but the performance (without NCCL) on 2x RTX 4090 can already be better than pipeline parallelism:
| model | backend | sm | test | t/s |
| ------------ | ----------: | -----: | --------------: | -------: |
| llama 8B F16 | CUDA | layer | pp512 | 10464.75 |
| llama 8B F16 | CUDA | layer | tg128 | 60.32 |
| llama 8B F16 | CUDA | layer | pp512 @ d32768 | 2744.50 |
| llama 8B F16 | CUDA | layer | tg128 @ d32768 | 46.95 |
| llama 8B F16 | CUDA | row | pp512 | 1592.28 |
| llama 8B F16 | CUDA | row | tg128 | 46.51 |
| llama 8B F16 | CUDA | row | pp512 @ d32768 | 1102.05 |
| llama 8B F16 | CUDA | row | tg128 @ d32768 | 37.96 |
| llama 8B F16 | CUDA | tensor | pp512 | 5170.11 |
| llama 8B F16 | CUDA | tensor | tg128 | 75.53 |
| llama 8B F16 | CUDA | tensor | pp512 @ d32768 | 2298.07 |
| llama 8B F16 | CUDA | tensor | tg128 @ d32768 | 63.27 |
I'll probably make the PR either Friday or Saturday.