God, I don't wanna benchmark this:
memmove(ptr + 0x20,ptr + 0x00,0x1F);
+-------------------------------------+-------------------------------------+
| GCC/clang | My version |
+-------------------------------------+-------------------------------------+
| vmovups xmm0,XMMWORD PTR [rcx] | mov al,BYTE PTR [rcx+0x3F] |
| vmovups xmm1,XMMWORD PTR [rcx+0xf] | vmovdqu ymm0,YMMWORD PTR [rcx] |
| vmovups XMMWORD PTR [rcx+0x2f],xmm1 | vmovdqu YMMWORD PTR [rcx+0x20],ymm0 |
| vmovups XMMWORD PTR [rcx+0x20],xmm0 | mov BYTE PTR [rcx+0x3F],al |
+-------------------------------------+-------------------------------------+
Logic would suggest my version is faster, but who the fuck knows if GCC/clang for once know what they're doing.