I tried this repo for Deepseek-V4-Flash
https://github.com/antirez/ds4
how many t/s should I expect on RTX 3090 + 512gb ?
...because what I'm getting is not worth mentioning
(base) user@host:~/ds4$ ./ds4 -p "Hello" --cuda
ds4: Linux cuda backend set oom_score_adj=1000
ds4: CUDA backend initialized on NVIDIA GeForce RTX 3090 (sm_86)
ds4: CUDA registered 153.33 GiB model mapping for device access
ds4: CUDA preparing model tensor mappings: 152.04 GiB
ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=3.94 GiB limit=4.00 GiB)
ds4: CUDA startup model preparation covered 153.32 GiB of tensor spans in 0.048s
ds4: cuda backend initialized for graph diagnostics
ds4: context buffers 1053.75 MiB (ctx=32768, backend=cuda, prefill_chunk=4096, raw_kv_rows=4352, compressed_kv_rows=8194)
processing 10 input tokens: 10/10 (100.0%)
We need to respond to the user. The user just said "Hello". As an assistant, we should respond politely and ask how we can help.
Hello! How can I assist you today?
ds4: prefill: 0.48 t/s, generation: 0.13 t/s
>ds4: prefill: 0.48 t/s, generation: 0.13 t/s
>GPU power consumption 190W