>>108960223
I'll ask the ai, hold on.
ok, you were slightly wrong for NEON, if this is correct (it probably is):
// 1. Load and broadcast the 128-bit round key across the SVE elements
ld1rb z4.b, p0/z, [x1] // x1 points to the current round key in memory
// This replaces the pointer arithmetic and multiple loads
// 2. Interleave and pipe the 4 blocks simultaneously to maximize pipeline depth
aese z0.b, z0.b, z4.b
aese z1.b, z1.b, z4.b
aese z2.b, z2.b, z4.b
aese z3.b, z3.b, z4.b
// 3. Complete the MixColumns transformation for all 4 blocks
aesmc z0.b, z0.b
aesmc z1.b, z1.b
aesmc z2.b, z2.b
aesmc z3.b, z3.b
9:1 is an improvement over 13:1.
Therefore, I won the Internet argument.