Is pic related the expected output when running IQ4_NL quant of gemma-4-26b from unsloth!? Running pruned 21b version IQ4_XS yields good output. I have tested without any parameters set and w/ the recommended values. 21b runs just fine.
llama-server \
--host "${LLAMA_HOST}" \
--port "${PORT}" \
--model "${MODEL}" \
--chat-template-file "${JINJA}" \
--n-gpu-layers 99 \
--n-cpu-moe 3 \
--ctx-size 32768 \
--batch-size 1024 \
--ubatch-size 1024 \
--flash-attn on \
--cache-type-k q4_0 \
--cache-type-v q4_0 \
--fit off
And I have tried with q8 on both k/v cache. I need to offload 20 moe layers for it to work but same gargled mess. Running the updated jinja template as well. Oh, and while Im here asking; I have a 5070ti and my old 3070 still lying around. Would it be detrimental to performance splitting models between these two cards? Or will it be fine as long as I complile Llama.cpp with both architectures in mind?