I rented two cloud machines to try Qwen 3.6 35B Q6_K to see if it would even be worth it to try to run it locally.
I found a RTX Pro 6000 is faster for inference with llama-server than a H200.
The command was ./llama-b8840/llama-server --port 8082 --host 0.0.0.0 -c 262000 -m ../Qwen_Qwen3.6-35B-A3B-Q6_K.gguf -ngl 99 --jinja --temp 1.0 --top-k 20 --top-p 0.95 --min-p 0.0 --presence-penalty 1.5 --repeat-penalty 1.0
Both tested with a 36789 token prompt through the web interface.
H200:
Generated 3362 tokens
PP 4478 tk/s
TG 92 tk/s
Pro 6000:
Generated 3775 tokens
PP 1446 tk/s
TG 161 tk/s
I don't really understand why PP is faster on the H200 but TG is slower and vice versa.
As for the actual model quality, it seems to be doing kinda ok for agentic use. It gets stuck in thinking loops but I'm tuning the parameters to try to stop that. It probably is the best at its weight class though.
I'll test with consumer GPUs next.