elementwise add
input [64] + parameter weight [64]
float16: max_blob=256 constant_offset=128, total 384
float8_e4m3: max_blob=128 constant_offset=64, total 192
weight float8_e4m3, infer float16: max_blob=256 constant_offset=64, total 320
float8 foundations
thought i'd start with elementwise, turns out pytorch doesn't actually support float8 for elementwise yet, makes sense considering the tolerance is kinda bad, <0.125 vs <0.001 with fp16, i've only implemented add for float8_e4m3 so far but easy enough to do the other elementwise. i dont think pytorch supports much actually running in float8 yet desu
so im also testing the same method in auto/comfy where the weights are float8 and cast for inference
found a bug in tensor usage records, duplicate records because of casting, was causing workspace to be larger than needed, fixed it, its been affecting float16 workspace calculation a bit too
i think there's more i can do for the workspace, idk, will see
tl;dr nothing important