I pushed the current Mikubox (2x 3090 3x P100) to the 8K context limit with command-r-plus 5bpw
Device 0 [NVIDIA GeForce RTX 3090] PCIe GEN 1@ 4x RX: 0.000 KiB/s TX: 0.000 KiB/s
GPU 0MHz MEM 405MHz TEMP 40°C FAN 0% POW 28 / 350 W
GPU[ 0%] MEM[||||||||||||||||||23.825Gi/24.000Gi]
Device 1 [Tesla P100-PCIE-16GB] PCIe GEN 3@16x RX: 0.000 KiB/s TX: 0.000 KiB/s
GPU 1189MHz MEM 715MHz TEMP 36°C FAN N/A% POW 32 / 250 W
GPU[ 0%] MEM[||||||||||||||||||15.729Gi/16.000Gi]
Device 2 [Tesla P100-PCIE-16GB] PCIe GEN 3@16x RX: 0.000 KiB/s TX: 0.000 KiB/s
GPU 1189MHz MEM 715MHz TEMP 34°C FAN N/A% POW 33 / 250 W
GPU[ 0%] MEM[||||||||||||||||||15.847Gi/16.000Gi]
Device 3 [Tesla P100-PCIE-16GB] PCIe GEN 3@16x RX: 0.000 KiB/s TX: 0.000 KiB/s
GPU 1189MHz MEM 715MHz TEMP 38°C FAN N/A% POW 34 / 250 W
GPU[ 0%] MEM[||||||||||||||||||13.540Gi/16.000Gi]
Device 4 [NVIDIA GeForce RTX 3090] PCIe GEN 1@16x RX: 0.000 KiB/s TX: 0.000 KiB/s
GPU 0MHz MEM 405MHz TEMP 42°C FAN 57% POW 29 / 370 W
GPU[ 0%] MEM[||||||||||||||||||23.156Gi/24.000Gi]
It just fits, and at full context, I'm getting about 2 t/s. Yeah, slow, but tolerable with streaming turned on. Ah well, nothing left to do but swap the P100s for 3090s, since the next plateau is being able to have flash attention at this model size. It's not really going to get much faster without it.