>>106501257
>How slow are we talking here?
I have HP Z840 with two Xeon and 512-512MB DDR4 memory
I get the maximum of 4 t/s with DeepSeek-R1-0528-Q2_K_L and --cpu-moe if
model is cached entirely in NUMA0
llama-cli is run on CPU0
--threads matches the number of PHYSICAL cores of this single CPU0
You can run two instances of LLM on two CPUs if they are separated physically in NUMA
As you can see I have to isolate the memory and the cores to get the maximum.
All my attempts to get a bust by using the second CPU only slowed thing down considerably.
If the model does not fit entirely in a single NUMA unit, it sucks big time too
# Run the command
CUDA_VISIBLE_DEVICES="0," \
numactl --physcpubind=8-15 --membind=1 \
"$HOME/LLAMA_CPP/$commit/llama.cpp/build/bin/llama-cli" \
--model "$model" $model_parameters \
--threads 8 \
--ctx-size $cxt_size \
--cache-type-k q4_0 \
--flash-attn \
--n-gpu-layers 99 \
--no-warmup \
--batch-size 8192 \
--ubatch-size 2048 \
--threads-batch 8 \
--jinja \
$log_option \
--prompt-cache "$cache_file" \
--file "$tmp_file" \
--cpu-moe