is anybody using tensor parallelism (-sm tensor)? i've got it working for gemma 31b on a 3090+3060 setup, went from 18 t/s with draft (and without vision) to 24 t/s without draft (and with vision) at 80k context for q4kxl on a shit ass x4 pcie bus. latest commit fucks up vision, ff5ef8278 is the latest one i tried that works.
apparently it also doesn't work with cuda 13 and there's some kind of memory leak, but with two cherry-picked commits it works very well.
$ git log -4 --oneline
228d96bb7 (HEAD -> gemma-stable) CUDA: use LRU based eviction for cuda graphs (#21611)
ad3a9a96d CUDA: manage NCCL communicators in context (#21891)
ff5ef8278 (tag: b8763) CUDA: skip compilation of superfluous FA kernels (#21768)
073bb2c20 (tag: b8762) mtmd : add MERaLiON-2 multimodal audio support (#21756)