gemma4-4E4B-Q5_K draft results for gemma4-31B-Q8 on RTX 6000 PRO (using basically default settings):
slot print_timing: id 1 | task 388 |
prompt eval time = 673.40 ms / 477 tokens ( 1.41 ms per token, 708.34 tokens per second)
eval time = 12484.82 ms / 498 tokens ( 25.07 ms per token, 39.89 tokens per second)
total time = 13158.23 ms / 975 tokens
draft acceptance rate = 0.47646 ( 253 accepted / 531 generated)
statistics draft: #calls(b,g,a) = 2 621 355, #gen drafts = 621, #acc drafts = 355, #gen tokens = 1335, #acc tokens = 647, dur(b,g,a) = 0.010, 12971.785, 0.648 ms
slot release: id 1 | task 388 | stop processing: n_tokens = 16071, truncated = 0
Looks like a pretty significant speedup, it's really obvious during decoding when a draft is accepted because it dumps like 4 words at once.
You lose multimodal which is a bit of a bummer, but the draft model uses a different encoder so it would never work anyway.
Appreciate the Anon who suggested trying this last thread.