>>101836927
I am the one that made that plot.
It's a simple matplotlib plot where I just ran the llama.cpp llama-cli binary and wrote down the reported performance (the specific numbers are a bit outdated).
Full source code:
#!/usr/bin/env python3
import numpy as np
import matplotlib.pyplot as plt
MAX_LAYERS_7B = 33
MAX_LAYERS_13B = 41
TS_7B = np.array([
9.99, 10.17, 10.51, 10.73, 11.13, 11.50, 11.90, 12.37, 12.81, 13.16,
13.69, 14.20, 14.91, 15.53, 16.20, 17.01, 17.72, 18.65, 19.65, 20.52,
22.02, 23.43, 25.15, 26.83, 29.50, 32.10, 35.36, 39.70, 44.52, 50.88,
59.63, 72.04, 90.87, 132.06
])
X_7B = np.arange(MAX_LAYERS_7B + 1, dtype=float) / MAX_LAYERS_7B
TS_13B = np.array([
5.01, 5.17, 5.24, 5.44, 5.58, 5.69, 5.78, 6.03, 6.17, 6.33,
6.45, 6.65, 6.88, 7.07, 7.29, 7.41, 7.58, 8.06, 8.82, 9.19,
9.59, 10.02, 10.48, 10.92, 11.35, 12.01, 12.55, 13.37, 14.02, 15.21,
16.12, 17.61, 19.12, 20.93, 23.02, 25.54, 28.82, 32.97, 38.86, 47.09,
59.55, 79.81
])
X_13B = np.arange(MAX_LAYERS_13B + 1, dtype=float) / MAX_LAYERS_13B
plt.plot(X_7B, TS_7B, label="7b q4_0")
plt.plot(X_13B, TS_13B, label="13b q4_0")
plt.legend(loc="upper left")
plt.xlabel("Proportion of GPU-accelerated layers")
plt.ylabel("Generated tokens / s")
plt.xlim(0, 1)
plt.ylim(0, 140)
plt.title("RTX 3090, Ryzen 3700X, 3200 MHz RAM")
plt.savefig("ts_rtx3090.png", dpi=240)
plt.show()
Nowadays a better way to do it would be to use the llama-bench binary.
You can specify the output as CSV, JSON, or SQL which you can then read in using Python.