I don't have artifacts compiled right now I deleted them all to save storage but some results I pulled from session logs for an ocr model
max_new_tokens=16
mine: 558.4175 ms
transformers: 4968.9111 ms
max_new_tokens=512
mine: 6676.0433ms
transformers: 16650.7969 ms
grid_thw: dynamic from 1,16,16 to 1,128,128
min: 6,815,744 bytes
max: 436,207,616 bytes
so the same artifact works for the dynamic shape range, profiling is bucketed, and the allocated workspace is for the current input shape, but that's one allocation with offsets
for this particular model torch.compile didn't work and I didn't check anything else yet
>>109041433
Yeah I might look at that too, this is regular fp16/bf16 multiplication, fp32 accumulation but dequant from Q4_K (or other format) weights to fp16/bf16 either as whole tensor before launch or fused in