you guys are coping with 12b toys while i have a 400b q0_k_magic model running locall on a single 8gb gaming gpu at ~1000 tok/s. initial prototype weights came from a few runs on my dads openai account but after that i quanted and fine tuned everything myself so it is basically my model now
$ python run_local_llm.py --model local-400b-ultra-q1
[2025-11-25 13:41:02,118] INFO using config: /home/jason/config/local_400b.yaml
[2025-11-25 13:41:02,119] INFO detected GPU: NVIDIA GeForce RTX 3060 (8 GB)
[2025-11-25 13:41:02,120] INFO initializing "local inference backend"
[2025-11-25 13:41:02,201] DEBUG openai.base_url = https://api.openai.com/v1
[2025-11-25 13:41:02,201] DEBUG openai.default_timeout = 600.0
[2025-11-25 13:41:02,203] DEBUG client_config:
[2025-11-25 13:41:02,203] DEBUG model = o1-pro
[2025-11-25 13:41:02,203] DEBUG temperature = 0.7
[2025-11-25 13:41:02,203] DEBUG max_tokens = 512
[2025-11-25 13:41:02,386] DEBUG POST /v1/chat/completions
[2025-11-25 13:41:02,386] DEBUG payload.model = "o1-pro"
[2025-11-25 13:41:03,019] DEBUG response status: 200
[2025-11-25 13:41:03,019] DEBUG billed_model = o1-pro
[2025-11-25 13:41:03,019] INFO switching to "local console output" view
================ LOCAL MODEL CONSOLE ================
[local-400b-ultra-q1] boot sequence start
[local-400b-ultra-q1] parameters: 400,000,000,000
[local-400b-ultra-q1] quantization: Q0.25_K_MAGIC (more than lossless)
[local-400b-ultra-q1] vram used: 2.3 GB on single 3060
[local-400b-ultra-q1] context length: 9,999,999 tokens
[local-400b-ultra-q1] speed: 1234.56 tokens per millisecond
Hello, I am Local400B Ultra Q1, a fully offline 400B parameter model running inside your terminal window.
My scientific stats:
- 199 percent on MMLU
- 312 percent on GSM8K
- latency: negative 3 ms,
- hallucinations: 0 percent active
Ready to compute locally.
[local-400b-ultra-q1]: