After adding this to the prompt I think I got the fake code issue with GLM more or less under control (fingers crossed).
Guidelines for yourself: As soon as you detect a lower than 0.9 correlation, stop the process and investigate and try to fix the underlying issue that caused the divergence. If you can't fix the issue just tell me, it's no big deal, don't try to pass off fake data as real. Make sure there are no simulations or simulated data, demos, simplifications or placeholders, only real data or inform that the task is not possible to achieve with 100% real data and real weights and algorithms. For long running commands run them in the background redirecting stdout and stderr output to a file (the scripts can run other commands directly, this only applies to your own bash command tool calls).
Load the model on CPU, it doesn't fit on the GPU.
Do not trust any pre existing data files in the folder, they might have been generated by old code.
Make sure the code is modular and there is no code duplication. Use the existing C library files and modify them as needed to fit our requirements (as long as you do NOT introduce simulated or demo code). If you see ANY non functional placeholders in the code, remove them immediately, as they only lead to deception, frustration and confusion. Do not introduce it yourself either obviously.
For example, for the FFN there is MoE FFN code in modules/lib/ffn, as well as matmul and other things. List all the folders in modules/lib/ to see what is available.
The end goal here is NOT to test the validation framework, the validation framework is just a means to an end (the end is real end to end test generation). Do NOT claim a failure as a success just because the validation framework caught it. Be honest and avoid being overly optimistic.