>>101520034
Fresh test with Nemo with exllama and llama.cpp. Processing 20k context with a 3090. Exllama is 45% faster, 2762.58 T/s vs 1895.98 T/s.
>Mistral-Nemo-Instruct-12B-8.0bpw-exl2
Metrics: 725 tokens generated in 22.49 seconds (Queue: 0.0 s,
Process: 0 cached tokens and 20892 new tokens at 2762.58 T/s, Generate:
48.59 T/s, Context: 20892 tokens)
>Mistral-Nemo-12B-Instruct-2407-Q8_0.gguf
prompt eval time = 11018.02 ms / 20890 tokens ( 0.53 ms per token, 1895.98 tokens per second)
generation eval time = 17040.94 ms / 646 runs ( 26.38 ms per token, 37.91 tokens per second)