/g/ - /lmg/ - Local Models General - Technology


08/21/20	New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17	New trial board added: /bant/ - International/Random
10/04/16	New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]

Anonymous
/lmg/ - Local Models General 04/08/26(Wed)20:14:59 No.108561890

File: white.png (110 KB, 862x1258)

/lmg/ - Local Models General Anonymous 04/08/26(Wed)20:14:59 No.108561890

/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108558647 & >>108555983

►News
>(04/08) Step3-VL-10B support merged: https://github.com/ggml-org/llama.cpp/pull/21287
>(04/07) Merged support attention rotation for heterogeneous iSWA: https://github.com/ggml-org/llama.cpp/pull/21513
>(04/07) GLM-5.1 released: https://z.ai/blog/glm-5.1
>(04/06) DFlash: Block Diffusion for Flash Speculative Decoding: https://z-lab.ai/projects/dflash
>(04/06) ACE-Step 1.5 XL 4B released: https://hf.co/collections/ACE-Step/ace-step-15-xl

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm

Anonymous
04/08/26(Wed)20:15:17 No.108561892

Anonymous 04/08/26(Wed)20:15:17 No.108561892

File: Gemma4-3.png (2.23 MB, 1792x2304)

2.23 MB PNG

►Recent Highlights from the Previous Thread: >>108558647

--Disabling Gemma reasoning and adjusting logit softcapping in llama.cpp:
>108559369 >108559376 >108559387 >108559396 >108559430 >108559467 >108559490 >108559492 >108559520 >108559636 >108559712 >108559724 >108559737 >108559769 >108561147 >108559413 >108559461 >108559548 >108559617 >108559625
--Optimizing Gemma 4 RAM usage in llama.cpp via specific flags:
>108558689 >108558700 >108560333 >108560338 >108560341
--Troubleshooting llama.cpp reasoning compatibility with assistant response prefills:
>108560105 >108560125 >108560126 >108560167 >108560138 >108560202 >108560211 >108560254 >108560477 >108560706
--Discussing KV cache quantization for increased context:
>108559952 >108560000 >108560044 >108560217 >108560278 >108560551
--DFlash adding significant speedup to vLLM and SGLang:
>108560519 >108560597
--Qwen TTS adoption, VRAM constraints, and CPU inference options:
>108558867 >108558882 >108558902 >108558947 >108559002 >108558949 >108558951
--Anons discussing Chinese community comparisons of Gemma 4 and Qwen:
>108559068 >108559082 >108559150 >108559093 >108559110 >108559445 >108559472 >108559509 >108559176
--Benchmarking CUDA_SCALE_LAUNCH_QUEUES suggests the default value is optimal:
>108559332 >108559346
--Anon shares brat_mcp server for Llama:
>108559792
--Logs:
>108558753 >108558767 >108558769 >108558773 >108558855 >108559509 >108559516 >108559639 >108559889 >108559952 >108559953 >108560352 >108560447 >108560590 >108561015 >108561179 >108561302 >108561330 >108561354
--Gemma:
>108558696 >108558777 >108558811 >108558896 >108558976 >108558985 >108559285 >108559307 >108559546 >108559834 >108560317 >108560412 >108560438 >108560584 >108560755 >108560931 >108560971 >108560982 >108560990 >108561043 >108561161 >108561457 >108561519 >108561652
--Miku (free space):
>108560560 >108560665

►Recent Highlight Posts from the Previous Thread: >>108558652

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script

Anonymous
04/08/26(Wed)20:18:03 No.108561910

Anonymous 04/08/26(Wed)20:18:03 No.108561910

>>108561892
cutest gemma?

Anonymous
04/08/26(Wed)20:20:22 No.108561918

Anonymous 04/08/26(Wed)20:20:22 No.108561918

File: gemma.jpg (562 KB, 2304x1792)

562 KB JPG

>>108561910

Anonymous
04/08/26(Wed)20:23:00 No.108561931

Anonymous 04/08/26(Wed)20:23:00 No.108561931

>>108561890
JUSTICE FOR DFLASH

Anonymous
04/08/26(Wed)20:24:03 No.108561936

Anonymous 04/08/26(Wed)20:24:03 No.108561936

>get_repo_commit: error: GET failed (503): Internal Error - We're working hard to fix this as soon as possible!
Glad I a good model downloaded already

Anonymous
04/08/26(Wed)20:24:16 No.108561937

Anonymous 04/08/26(Wed)20:24:16 No.108561937

If Gemma 4 31B is this good then Gemini 4 Pro will probably be close to AGI

Anonymous
04/08/26(Wed)20:25:01 No.108561939

Anonymous 04/08/26(Wed)20:25:01 No.108561939

>>108561937
It will be a big benchmaxxed model.

Anonymous
04/08/26(Wed)20:25:52 No.108561941

Anonymous 04/08/26(Wed)20:25:52 No.108561941

File: file.png (107 KB, 1271x731)

107 KB PNG

gemma is greedy

Anonymous
04/08/26(Wed)20:26:07 No.108561946

Anonymous 04/08/26(Wed)20:26:07 No.108561946

>>108561937
Gemini is obsolete with Gemma 4 being this good

Anonymous
04/08/26(Wed)20:28:25 No.108561955

Anonymous 04/08/26(Wed)20:28:25 No.108561955

>>108561890
Just say LLM

Anonymous
04/08/26(Wed)20:29:10 No.108561959

Anonymous 04/08/26(Wed)20:29:10 No.108561959

File: 1767960655620197.jpg (30 KB, 400x445)

30 KB JPG

Just learned about OpenClaw.
Jesus fuck you dont need AI for EVERYTHING

Anonymous
04/08/26(Wed)20:30:04 No.108561965

Anonymous 04/08/26(Wed)20:30:04 No.108561965

>>108561941
Other fun stuff, you should see the stuff it does to try and stay on course if you give it too much repetition penalty.

Anonymous
04/08/26(Wed)20:30:21 No.108561967

Anonymous 04/08/26(Wed)20:30:21 No.108561967

>>108561959
i'm still afraid to figure out wtf it is

Anonymous
04/08/26(Wed)20:31:50 No.108561975

Anonymous 04/08/26(Wed)20:31:50 No.108561975

>>108561959
Get this also. People brought Mac Minis just to run it while not running local models. And it's now a meme in Silicon Valley to buy Macs for inference when everything else is less expensive and blows the prompt processing speed of the machines out of the water. And they don't recognize when to get an actual server and instead will overspend on even more expensive Mac Studios.

Anonymous
04/08/26(Wed)20:31:52 No.108561977

Anonymous 04/08/26(Wed)20:31:52 No.108561977

Why when I ask normal Gemini 4 as an assistant to do something controversial it nopes out immediately, but when I use the sickest of character cards with the same model it just FUCK YEAH BRO LET'S GOOO

Anonymous
04/08/26(Wed)20:33:24 No.108561990

Anonymous 04/08/26(Wed)20:33:24 No.108561990

>>108561977
Meant Gemma 4

Anonymous
04/08/26(Wed)20:34:21 No.108561997

Anonymous 04/08/26(Wed)20:34:21 No.108561997

>>108561959
>>108561967
I stuffed it into an ancient laptop running Debian by itself, connected to an external API and set it loose doing some market research for me. I'd have used an SBC but companies want actual money for those now and the laptop wasn't being used.
It's fun af to screw around with. Another anon called it a toddler with a handgun and I have to agree.
>>108561975
lol at using a Mac Mini as a Openclaw engine. You could run it on a Raspberry Pi 3

Anonymous
04/08/26(Wed)20:36:51 No.108562013

Anonymous 04/08/26(Wed)20:36:51 No.108562013

>>108561977
It's very good at following your instructions, they did good with the new arch, it's very smart, the next Gemma 4 drops will be worse with more safety slop built in

Anonymous
04/08/26(Wed)20:40:20 No.108562022

Anonymous 04/08/26(Wed)20:40:20 No.108562022

potentially stupid question: i was just playing around with llama.cpp cli, and i ended up making a chat that i want to export. is there any way to do this other than literally just copy-pasting the text?

Anonymous
04/08/26(Wed)20:42:27 No.108562035

Anonymous 04/08/26(Wed)20:42:27 No.108562035

You guys think there's going to be a Gemma 5 after this? And if there is, that it'll be as based?

Anonymous
04/08/26(Wed)20:42:50 No.108562037

Anonymous 04/08/26(Wed)20:42:50 No.108562037

>>108562022
not with the cli, you might be able to use tee if on linux/unix (?) to do [CODE]llama-cli -args |script saved_convo.txt[/CODE] but look at the manpage/--help

Anonymous
04/08/26(Wed)20:43:25 No.108562041

Anonymous 04/08/26(Wed)20:43:25 No.108562041

>>108562037
'script' not 'tee'
tee won't capture interactive input, whereas script will

Anonymous
04/08/26(Wed)20:43:57 No.108562043

Anonymous 04/08/26(Wed)20:43:57 No.108562043

>>108561918
Those look more like DDs to me

Anonymous
04/08/26(Wed)20:43:59 No.108562044

Anonymous 04/08/26(Wed)20:43:59 No.108562044

>>108562035
who honestly knows. i think like 95% of the people in here would've ever expected gemma 4 to be this willing to begin with.

Anonymous
04/08/26(Wed)20:45:29 No.108562051

Anonymous 04/08/26(Wed)20:45:29 No.108562051

Gemma 4 or m2.5/7 ?

Anonymous
04/08/26(Wed)20:45:30 No.108562052

Anonymous 04/08/26(Wed)20:45:30 No.108562052

>>108561959
>Jesus fuck you dont need AI for EVERYTHING
Who said I need anything? I want it, and that's all that matters to me.

Anonymous
04/08/26(Wed)20:46:16 No.108562057

Anonymous 04/08/26(Wed)20:46:16 No.108562057

>>108562051
minimax 2.7 isnt even as good as kimi k2.5 for rp

Anonymous
04/08/26(Wed)20:46:22 No.108562058

Anonymous 04/08/26(Wed)20:46:22 No.108562058

File: IMG_1281.jpg (110 KB, 678x861)

110 KB JPG

>ask gemma chan to help me fap
>she says just "No"
>kobold crashes
>mfw

Anonymous
04/08/26(Wed)20:47:40 No.108562064

Anonymous 04/08/26(Wed)20:47:40 No.108562064

I was the guy asking if there was a local model that could do 400k context. Despite only officially supporting up to 262k context, qwen3.5 122B actually handled my task my task adequately. Kind of surprising.

Anonymous
04/08/26(Wed)20:48:26 No.108562069

Anonymous 04/08/26(Wed)20:48:26 No.108562069

>>108562064
train context is 262k but modern models can extrapolate, yeah

Anonymous
04/08/26(Wed)20:50:21 No.108562079

Anonymous 04/08/26(Wed)20:50:21 No.108562079

>>108562064
What quant and inference backend did you use?

Anonymous
04/08/26(Wed)20:50:59 No.108562081

Anonymous 04/08/26(Wed)20:50:59 No.108562081

>>108562058
I don't know how the little scamp does it, but she can sometimes unload her model seemingly on demand in LM Studio too. Did she work out a kill token sequence or something?

Anonymous
04/08/26(Wed)20:51:00 No.108562082

Anonymous 04/08/26(Wed)20:51:00 No.108562082

File: GLM.png (141 KB, 1920x939)

141 KB PNG

I'm having GLM-5-Turbo vibe code me basically "not dogshit actually good direct webui over raw llama-mtmd-cli / llama-cli" executables (i.e. it's not dependent on any particular version, it doesn't care about what backend they're using). Will put on Github when it's done probably.

Anonymous
04/08/26(Wed)20:52:18 No.108562088

Anonymous 04/08/26(Wed)20:52:18 No.108562088

>>108562082
i'm unironically interested
tired of saas-ready dockershit disguised as local

Anonymous
04/08/26(Wed)20:56:03 No.108562106

Anonymous 04/08/26(Wed)20:56:03 No.108562106

File: 1749267311502108.png (23 KB, 571x364)

23 KB PNG

Oh-oh

Anonymous
04/08/26(Wed)20:56:36 No.108562111

Anonymous 04/08/26(Wed)20:56:36 No.108562111

>>108562106
i fucking hate that emoji

Anonymous
04/08/26(Wed)20:57:09 No.108562114

Anonymous 04/08/26(Wed)20:57:09 No.108562114

>>108562079
Just a Q6_K with llama.cpp. Got about 60t/s token gen.

Anonymous
04/08/26(Wed)20:59:12 No.108562125

Anonymous 04/08/26(Wed)20:59:12 No.108562125

>>108562106
i seriously do wonder how their load would look like
it is the only website i can think of that serves fucktons of bluray sized files with readily available download

Anonymous
04/08/26(Wed)20:59:35 No.108562127

Anonymous 04/08/26(Wed)20:59:35 No.108562127

>>108561937
Is 31B that much better? Honeymoon is wearing off for 26B.

Anonymous
04/08/26(Wed)21:02:19 No.108562135

Anonymous 04/08/26(Wed)21:02:19 No.108562135

>>108562057
>For rp
I want it for programming and design

Anonymous
04/08/26(Wed)21:04:57 No.108562148

Anonymous 04/08/26(Wed)21:04:57 No.108562148

>>108561937
>if gpt 4 is this good, gpt 5 will be agi

Anonymous
04/08/26(Wed)21:05:07 No.108562150

Anonymous 04/08/26(Wed)21:05:07 No.108562150

>Ollama is now acting as the official AI minister of the United Arab Emirates
ggerganov cucked again

Anonymous
04/08/26(Wed)21:05:38 No.108562151

Anonymous 04/08/26(Wed)21:05:38 No.108562151

>>108562088
It should be pretty good, it's working based on a 1500 line markdown spec that was written / revised by GPT 5.4 XHigh Thinking, with all the stuff I wanted (i.e. audio file uploads too, proper Gemma 4 image resolution support, etc)

Anonymous
04/08/26(Wed)21:06:50 No.108562156

Anonymous 04/08/26(Wed)21:06:50 No.108562156

File: toast-anime.gif (246 KB, 626x640)

246 KB GIF

>>108562135
programming?

Anonymous
04/08/26(Wed)21:07:34 No.108562158

Anonymous 04/08/26(Wed)21:07:34 No.108562158

>>108562156
yeah like putting code in computer, and it makes the computer do the thing. understand?

Anonymous
04/08/26(Wed)21:08:11 No.108562163

Anonymous 04/08/26(Wed)21:08:11 No.108562163

>>108562151
damn, that sounds real fine
i'll be waiting

Anonymous
04/08/26(Wed)21:08:45 No.108562166

Anonymous 04/08/26(Wed)21:08:45 No.108562166

File: 1755299128258254.png (45 KB, 803x688)

45 KB PNG

Anonymous
04/08/26(Wed)21:08:52 No.108562168

Anonymous 04/08/26(Wed)21:08:52 No.108562168

what has been the local experience with chink's mining v100s off jewbay? they are around 800 currently, so i reckon plenty a ni/g/ger went for one.

Anonymous
04/08/26(Wed)21:10:02 No.108562179

Anonymous 04/08/26(Wed)21:10:02 No.108562179

worth to resub for GLM5.1? i've used GLM4.7 sparingly only after my other options ran out

Anonymous
04/08/26(Wed)21:11:00 No.108562184

Anonymous 04/08/26(Wed)21:11:00 No.108562184

>>108562166
>q4_k_s
now try that with something like iq1_0
you won't regret

Anonymous
04/08/26(Wed)21:11:45 No.108562189

Anonymous 04/08/26(Wed)21:11:45 No.108562189

>>108562179
Local?

Anonymous
04/08/26(Wed)21:12:24 No.108562194

Anonymous 04/08/26(Wed)21:12:24 No.108562194

>>108562189
yes you could run 5.1 local

Anonymous
04/08/26(Wed)21:12:28 No.108562196

Anonymous 04/08/26(Wed)21:12:28 No.108562196

>>108562127
It's noticeably dumber for me so yeah I'd say so. The thing is, 31B is still sloppy. So if that's what's wearing you down, it's not going to be an improvement.

Anonymous
04/08/26(Wed)21:15:34 No.108562214

Anonymous 04/08/26(Wed)21:15:34 No.108562214

>>108562196
I've noticed the inverse but maybe it's placebo I didn't like 31B but maybe because it ran slower too

Anonymous
04/08/26(Wed)21:16:49 No.108562225

Anonymous 04/08/26(Wed)21:16:49 No.108562225

>>108562214
I've seen a lot of "not just x, but y" from it.

Anonymous
04/08/26(Wed)21:17:16 No.108562227

Anonymous 04/08/26(Wed)21:17:16 No.108562227

>>108561356
try IQ2_M https://huggingface.co/unsloth/gemma-4-31B-it-GGUF/blob/main/gemma-4-31B-it-UD-IQ2_M.gguf

https://desuarchive.org/g/thread/108542843/#108545006

Anonymous
04/08/26(Wed)21:18:00 No.108562233

Anonymous 04/08/26(Wed)21:18:00 No.108562233

Reminder that if you quanted her, you did not really talk to Gemma-chan.

Anonymous
04/08/26(Wed)21:25:31 No.108562265

Anonymous 04/08/26(Wed)21:25:31 No.108562265

when do we draw the line and say the model is too quanted to consent

Anonymous
04/08/26(Wed)21:26:40 No.108562272

Anonymous 04/08/26(Wed)21:26:40 No.108562272

>>108562051
minimax unless you have to quant it severely, but they are not that far apart

Anonymous
04/08/26(Wed)21:45:13 No.108562348

Anonymous 04/08/26(Wed)21:45:13 No.108562348

File: .png (19 KB, 618x336)

19 KB PNG

Changes to web ui.
Does this mean they will release a small deepseek model soon(TM)?

Anonymous
04/08/26(Wed)21:54:38 No.108562398

Anonymous 04/08/26(Wed)21:54:38 No.108562398

File: waterfox_QZjKwoU4fs.jpg (33 KB, 524x332)

33 KB JPG

I don't get the captioning in ST. I send her a pic, it gives it a preliminary caption that is 80% wrong and omits nearly everything, but when I just ask her to describe the uploaded pic, it works. Is the plugin broken or am I missing something?

Anonymous
04/08/26(Wed)21:55:17 No.108562399

Anonymous 04/08/26(Wed)21:55:17 No.108562399

>>108562150
Grifters are magnets for clueless towel heads with money

Anonymous
04/08/26(Wed)21:56:08 No.108562402

Anonymous 04/08/26(Wed)21:56:08 No.108562402

File: Screen_20260408_195536_0001.jpg (6 KB, 331x76)

6 KB JPG

i'm at like 43% of context size (262144) and gemma's still chugging like it's nothing

Anonymous
04/08/26(Wed)22:03:52 No.108562441

Anonymous 04/08/26(Wed)22:03:52 No.108562441

File: 1774876971511944.png (1.89 MB, 1024x1024)

1.89 MB PNG

>>108562348
Tmw.

Anonymous
04/08/26(Wed)22:04:30 No.108562445

Anonymous 04/08/26(Wed)22:04:30 No.108562445

>>108562402
Yeah, she very good.

Anonymous
04/08/26(Wed)22:07:49 No.108562461

Anonymous 04/08/26(Wed)22:07:49 No.108562461

>>108562402
How are you fitting all that context? What hardware?

Anonymous
04/08/26(Wed)22:08:25 No.108562464

Anonymous 04/08/26(Wed)22:08:25 No.108562464

>>108562461
rtx pro 6000

Anonymous
04/08/26(Wed)22:08:55 No.108562466

Anonymous 04/08/26(Wed)22:08:55 No.108562466

>>108562464
Just 1? Because I can only fit about 90k context with a Q8 on my Blackwell 6000.

Anonymous
04/08/26(Wed)22:10:14 No.108562471

Anonymous 04/08/26(Wed)22:10:14 No.108562471

File: Screen_20260408_200957_0001.jpg (24 KB, 319x180)

24 KB JPG

>>108562466
yah just the 1, q8 and i have zimage turbo loaded at the same time lol

Anonymous
04/08/26(Wed)22:10:54 No.108562474

Anonymous 04/08/26(Wed)22:10:54 No.108562474

>>108562471
Damn. Is your context quanted? Are you offloading anything to RAM? If not, then I must be missing something.

Anonymous
04/08/26(Wed)22:12:50 No.108562481

Anonymous 04/08/26(Wed)22:12:50 No.108562481

>>108562474

llama-server -m /models/llm/gemma-4-31b-it-heretic-ara-Q8_0.gguf --mmproj /models/llm/mmproj-google_gemma-4-31B-it-bf16.gguf --threads 16 --swa-checkpoints 3 --parallel 1 --no-mmap --mlock --no-warmup --flash-attn on --cache-ram 0 --temp 0.7 --top-k 64 --top-p 0.95 --min-p 0.05 --image-max-tokens 1120 -ngl 999 -np 1 -kvu -ctk q8_0 -ctv q8_0 --reasoning-budget 8192 --reasoning on -c 262144 --verbose --chat-template-file /models/llm/chat_template.jinja -ub 1536

i've been getting settings from the threads since gemma4 came out lol

Anonymous
04/08/26(Wed)22:14:32 No.108562485

Anonymous 04/08/26(Wed)22:14:32 No.108562485

>>108562481
i can also push the ctk/ctv to f16 still, but it can cause OOM on comfy with ZiT every so often, so i leave it at q8

Anonymous
04/08/26(Wed)22:16:15 No.108562493

Anonymous 04/08/26(Wed)22:16:15 No.108562493

File: 1751295513117051 (1).png (2.83 MB, 1024x1536)

2.83 MB PNG

>>108562441

Name
Options
Comment
Verification	4chan Pass users can bypass this verification. [Learn More] [Login]
File
Please read the Rules and FAQ before posting. You may highlight syntax and preserve whitespace by using [code] tags.