so running gemma on just one GPU with beellama work without any garbage messages in console, but I get just 35 t/s:
prompt eval time = 598.81 ms / 355 tokens ( 1.69 ms per token, 592.85 tokens per second)
eval time = 8489.68 ms / 279 tokens ( 30.43 ms per token, 32.86 tokens per second)
total time = 9088.49 ms / 634 tokens
draft acceptance rate = 0.33051 ( 156 accepted / 472 generated)
adaptive dm: fringe=0.00 n_max=3
statistics dflash: #calls(b,g,a) = 1 121 89, #gen drafts = 121, #acc drafts = 89, #gen tokens = 472, #acc tokens = 156, dur(b,g,a) = 0.003, 754.019, 0.010 ms
slot release: id 0 | task 0 | stop processing: n_tokens = 635, truncated = 0
srv update_slots: all slots are idle
prompt eval time = 259.05 ms / 15 tokens ( 17.27 ms per token, 57.90 tokens per second)
eval time = 11304.23 ms / 411 tokens ( 27.50 ms per token, 36.36 tokens per second)
total time = 11563.27 ms / 426 tokens
draft acceptance rate = 0.09738 ( 223 accepted / 2290 generated)
adaptive dm: fringe=0.00 n_max=12
statistics dflash: #calls(b,g,a) = 3 318 215, #gen drafts = 318, #acc drafts = 215, #gen tokens = 2893, #acc tokens = 398, dur(b,g,a) = 0.004, 2055.248, 0.034 ms
slot release: id 0 | task 141 | stop processing: n_tokens = 715, truncated = 0
srv update_slots: all slots are idle
On vanilla llama.cpp I get 45t/s, with 3 GPUs, twice as big quant and fp16 cache:
prompt eval time = 657.57 ms / 304 tokens ( 2.16 ms per token, 462.31 tokens per second)
eval time = 8437.75 ms / 357 tokens ( 23.64 ms per token, 42.31 tokens per second)
total time = 9095.32 ms / 661 tokens
slot release: id 15 | task 0 | stop processing: n_tokens = 660, truncated = 0
srv update_slots: all slots are idle