Vibed up some KL-div measurement tools for chat completion logs. This uses the chat template and collects logits only for the assistant messages (since that's the only part the model needs to be able to generate). Anyone know if these numbers seem plausible?
>Gemma 4 31B UD-Q8_K_XL
====== KL divergence statistics ======
Mean KLD: 0.007588 +/- 0.000551
Maximum KLD: 27.633171
99.9%% KLD: 1.204302
99.0%% KLD: 0.079414
95.0%% KLD: 0.010268
90.0%% KLD: 0.004402
Median KLD: 0.000122
10.0%% KLD: 0.000000
5.0%% KLD: 0.000000
1.0%% KLD: 0.000000
0.1%% KLD: 0.000000
Minimum KLD: 0.000000
====== Same-top-token statistics ======
Same top p: 98.540 +/- 0.042 %%
Tokens: 80958 (194 sample(s))
>Gemma 4 31B UD-Q5_K_XL
====== KL divergence statistics ======
Mean KLD: 0.012907 +/- 0.000455
Maximum KLD: 11.402487
99.9%% KLD: 1.331023
99.0%% KLD: 0.144649
95.0%% KLD: 0.038739
90.0%% KLD: 0.020304
Median KLD: 0.000660
10.0%% KLD: 0.000000
5.0%% KLD: 0.000000
1.0%% KLD: 0.000000
0.1%% KLD: 0.000000
Minimum KLD: 0.000000
====== Same-top-token statistics ======
Same top p: 97.142 +/- 0.059 %%
Tokens: 80958 (194 sample(s))
Logs were mostly generated using the Q5, though I don't think that should matter