I'm trying to run gay-assed Chroma-2K-QC-fp8mixed-blockwise. It works with silveroxide's node but at every step it stops momentarily and gives me
FP8 dynamic quant failed, falling back to dequant: at 48:16:
b_s_k_blocks = tl.cdiv(K, input_block_size)
b_s_base = b_s_ptr + pid_n * b_s_k_blocks
# Accumulator
accumulator = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32)
# Main loop
for k_idx in range(k_blocks):
k_start = k_idx * BLOCK_SIZE_K
mask_k = offs_k < K - k_start
a_fp8 = tl.load(a_ptrs, mask=mask_k[None, :], other=0)
^
cannot cast int32[constexpr[128], constexpr[128]] to <['128', '128'], fp8e4nv>