found a bad optimization pass that was causing huge workspace size
also implemented dynamic workspace allocation
>compile once for large dynamic shape
>allocate only the required memory instead of using the memory of the maximum shape
autoencoder decode
8, 8 latent 503MiB / 46068MiB
128, 128 latent 1815MiB / 46068MiB
256, 256 latent 5727MiB / 46068MiB
very cool