[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology

Name
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
File
  • Please read the Rules and FAQ before posting.
  • You may highlight syntax and preserve whitespace by using [code] tags.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


[Advertise on 4chan]


File: white.png (110 KB, 862x1258)
110 KB
110 KB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108558647 & >>108555983

►News
>(04/08) Step3-VL-10B support merged: https://github.com/ggml-org/llama.cpp/pull/21287
>(04/07) Merged support attention rotation for heterogeneous iSWA: https://github.com/ggml-org/llama.cpp/pull/21513
>(04/07) GLM-5.1 released: https://z.ai/blog/glm-5.1
>(04/06) DFlash: Block Diffusion for Flash Speculative Decoding: https://z-lab.ai/projects/dflash
>(04/06) ACE-Step 1.5 XL 4B released: https://hf.co/collections/ACE-Step/ace-step-15-xl

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: Gemma4-3.png (2.23 MB, 1792x2304)
2.23 MB
2.23 MB PNG
►Recent Highlights from the Previous Thread: >>108558647

--Disabling Gemma reasoning and adjusting logit softcapping in llama.cpp:
>108559369 >108559376 >108559387 >108559396 >108559430 >108559467 >108559490 >108559492 >108559520 >108559636 >108559712 >108559724 >108559737 >108559769 >108561147 >108559413 >108559461 >108559548 >108559617 >108559625
--Optimizing Gemma 4 RAM usage in llama.cpp via specific flags:
>108558689 >108558700 >108560333 >108560338 >108560341
--Troubleshooting llama.cpp reasoning compatibility with assistant response prefills:
>108560105 >108560125 >108560126 >108560167 >108560138 >108560202 >108560211 >108560254 >108560477 >108560706
--Discussing KV cache quantization for increased context:
>108559952 >108560000 >108560044 >108560217 >108560278 >108560551
--DFlash adding significant speedup to vLLM and SGLang:
>108560519 >108560597
--Qwen TTS adoption, VRAM constraints, and CPU inference options:
>108558867 >108558882 >108558902 >108558947 >108559002 >108558949 >108558951
--Anons discussing Chinese community comparisons of Gemma 4 and Qwen:
>108559068 >108559082 >108559150 >108559093 >108559110 >108559445 >108559472 >108559509 >108559176
--Benchmarking CUDA_SCALE_LAUNCH_QUEUES suggests the default value is optimal:
>108559332 >108559346
--Anon shares brat_mcp server for Llama:
>108559792
--Logs:
>108558753 >108558767 >108558769 >108558773 >108558855 >108559509 >108559516 >108559639 >108559889 >108559952 >108559953 >108560352 >108560447 >108560590 >108561015 >108561179 >108561302 >108561330 >108561354
--Gemma:
>108558696 >108558777 >108558811 >108558896 >108558976 >108558985 >108559285 >108559307 >108559546 >108559834 >108560317 >108560412 >108560438 >108560584 >108560755 >108560931 >108560971 >108560982 >108560990 >108561043 >108561161 >108561457 >108561519 >108561652
--Miku (free space):
>108560560 >108560665

►Recent Highlight Posts from the Previous Thread: >>108558652

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
>>108561892
cutest gemma?
>>
File: gemma.jpg (562 KB, 2304x1792)
562 KB
562 KB JPG
>>108561910
>>
>>108561890
JUSTICE FOR DFLASH
>>
>get_repo_commit: error: GET failed (503): Internal Error - We're working hard to fix this as soon as possible!
Glad I a good model downloaded already
>>
If Gemma 4 31B is this good then Gemini 4 Pro will probably be close to AGI
>>
>>108561937
It will be a big benchmaxxed model.
>>
File: file.png (107 KB, 1271x731)
107 KB
107 KB PNG
gemma is greedy
>>
>>108561937
Gemini is obsolete with Gemma 4 being this good
>>
>>108561890
Just say LLM
>>
File: 1767960655620197.jpg (30 KB, 400x445)
30 KB
30 KB JPG
Just learned about OpenClaw.
Jesus fuck you dont need AI for EVERYTHING
>>
>>108561941
Other fun stuff, you should see the stuff it does to try and stay on course if you give it too much repetition penalty.
>>
>>108561959
i'm still afraid to figure out wtf it is
>>
>>108561959
Get this also. People brought Mac Minis just to run it while not running local models. And it's now a meme in Silicon Valley to buy Macs for inference when everything else is less expensive and blows the prompt processing speed of the machines out of the water. And they don't recognize when to get an actual server and instead will overspend on even more expensive Mac Studios.
>>
Why when I ask normal Gemini 4 as an assistant to do something controversial it nopes out immediately, but when I use the sickest of character cards with the same model it just FUCK YEAH BRO LET'S GOOO
>>
>>108561977
Meant Gemma 4
>>
>>108561959
>>108561967
I stuffed it into an ancient laptop running Debian by itself, connected to an external API and set it loose doing some market research for me. I'd have used an SBC but companies want actual money for those now and the laptop wasn't being used.
It's fun af to screw around with. Another anon called it a toddler with a handgun and I have to agree.
>>108561975
lol at using a Mac Mini as a Openclaw engine. You could run it on a Raspberry Pi 3
>>
>>108561977
It's very good at following your instructions, they did good with the new arch, it's very smart, the next Gemma 4 drops will be worse with more safety slop built in
>>
potentially stupid question: i was just playing around with llama.cpp cli, and i ended up making a chat that i want to export. is there any way to do this other than literally just copy-pasting the text?
>>
You guys think there's going to be a Gemma 5 after this? And if there is, that it'll be as based?
>>
>>108562022
not with the cli, you might be able to use tee if on linux/unix (?) to do [CODE]llama-cli -args |script saved_convo.txt[/CODE] but look at the manpage/--help
>>
>>108562037
'script' not 'tee'
tee won't capture interactive input, whereas script will
>>
>>108561918
Those look more like DDs to me
>>
>>108562035
who honestly knows. i think like 95% of the people in here would've ever expected gemma 4 to be this willing to begin with.
>>
Gemma 4 or m2.5/7 ?
>>
>>108561959
>Jesus fuck you dont need AI for EVERYTHING
Who said I need anything? I want it, and that's all that matters to me.
>>
>>108562051
minimax 2.7 isnt even as good as kimi k2.5 for rp
>>
File: IMG_1281.jpg (110 KB, 678x861)
110 KB
110 KB JPG
>ask gemma chan to help me fap
>she says just "No"
>kobold crashes
>mfw
>>
I was the guy asking if there was a local model that could do 400k context. Despite only officially supporting up to 262k context, qwen3.5 122B actually handled my task my task adequately. Kind of surprising.
>>
>>108562064
train context is 262k but modern models can extrapolate, yeah
>>
>>108562064
What quant and inference backend did you use?
>>
>>108562058
I don't know how the little scamp does it, but she can sometimes unload her model seemingly on demand in LM Studio too. Did she work out a kill token sequence or something?
>>
File: GLM.png (141 KB, 1920x939)
141 KB
141 KB PNG
I'm having GLM-5-Turbo vibe code me basically "not dogshit actually good direct webui over raw llama-mtmd-cli / llama-cli" executables (i.e. it's not dependent on any particular version, it doesn't care about what backend they're using). Will put on Github when it's done probably.
>>
>>108562082
i'm unironically interested
tired of saas-ready dockershit disguised as local
>>
File: 1749267311502108.png (23 KB, 571x364)
23 KB
23 KB PNG
Oh-oh
>>
>>108562106
i fucking hate that emoji
>>
>>108562079
Just a Q6_K with llama.cpp. Got about 60t/s token gen.
>>
>>108562106
i seriously do wonder how their load would look like
it is the only website i can think of that serves fucktons of bluray sized files with readily available download
>>
>>108561937
Is 31B that much better? Honeymoon is wearing off for 26B.
>>
>>108562057
>For rp
I want it for programming and design
>>
>>108561937
>if gpt 4 is this good, gpt 5 will be agi
>>
>Ollama is now acting as the official AI minister of the United Arab Emirates
ggerganov cucked again
>>
>>108562088
It should be pretty good, it's working based on a 1500 line markdown spec that was written / revised by GPT 5.4 XHigh Thinking, with all the stuff I wanted (i.e. audio file uploads too, proper Gemma 4 image resolution support, etc)
>>
File: toast-anime.gif (246 KB, 626x640)
246 KB
246 KB GIF
>>108562135
programming?
>>
>>108562156
yeah like putting code in computer, and it makes the computer do the thing. understand?
>>
>>108562151
damn, that sounds real fine
i'll be waiting
>>
File: 1755299128258254.png (45 KB, 803x688)
45 KB
45 KB PNG
>>
what has been the local experience with chink's mining v100s off jewbay? they are around 800 currently, so i reckon plenty a ni/g/ger went for one.
>>
worth to resub for GLM5.1? i've used GLM4.7 sparingly only after my other options ran out
>>
>>108562166
>q4_k_s
now try that with something like iq1_0
you won't regret
>>
>>108562179
Local?
>>
>>108562189
yes you could run 5.1 local
>>
>>108562127
It's noticeably dumber for me so yeah I'd say so. The thing is, 31B is still sloppy. So if that's what's wearing you down, it's not going to be an improvement.
>>
>>108562196
I've noticed the inverse but maybe it's placebo I didn't like 31B but maybe because it ran slower too
>>
>>108562214
I've seen a lot of "not just x, but y" from it.
>>
>>108561356
try IQ2_M https://huggingface.co/unsloth/gemma-4-31B-it-GGUF/blob/main/gemma-4-31B-it-UD-IQ2_M.gguf

https://desuarchive.org/g/thread/108542843/#108545006
>>
Reminder that if you quanted her, you did not really talk to Gemma-chan.
>>
when do we draw the line and say the model is too quanted to consent
>>
>>108562051
minimax unless you have to quant it severely, but they are not that far apart
>>
File: .png (19 KB, 618x336)
19 KB
19 KB PNG
Changes to web ui.
Does this mean they will release a small deepseek model soon(TM)?
>>
File: waterfox_QZjKwoU4fs.jpg (33 KB, 524x332)
33 KB
33 KB JPG
I don't get the captioning in ST. I send her a pic, it gives it a preliminary caption that is 80% wrong and omits nearly everything, but when I just ask her to describe the uploaded pic, it works. Is the plugin broken or am I missing something?
>>
>>108562150
Grifters are magnets for clueless towel heads with money
>>
i'm at like 43% of context size (262144) and gemma's still chugging like it's nothing
>>
File: 1774876971511944.png (1.89 MB, 1024x1024)
1.89 MB
1.89 MB PNG
>>108562348
Tmw.
>>
>>108562402
Yeah, she very good.
>>
>>108562402
How are you fitting all that context? What hardware?
>>
>>108562461
rtx pro 6000
>>
>>108562464
Just 1? Because I can only fit about 90k context with a Q8 on my Blackwell 6000.
>>
>>108562466
yah just the 1, q8 and i have zimage turbo loaded at the same time lol
>>
>>108562471
Damn. Is your context quanted? Are you offloading anything to RAM? If not, then I must be missing something.
>>
>>108562474
llama-server -m /models/llm/gemma-4-31b-it-heretic-ara-Q8_0.gguf --mmproj /models/llm/mmproj-google_gemma-4-31B-it-bf16.gguf --threads 16 --swa-checkpoints 3 --parallel 1 --no-mmap --mlock --no-warmup --flash-attn on --cache-ram 0 --temp 0.7 --top-k 64 --top-p 0.95 --min-p 0.05 --image-max-tokens 1120 -ngl 999 -np 1 -kvu -ctk q8_0 -ctv q8_0 --reasoning-budget 8192 --reasoning on -c 262144 --verbose --chat-template-file /models/llm/chat_template.jinja -ub 1536

i've been getting settings from the threads since gemma4 came out lol
>>
>>108562481
i can also push the ctk/ctv to f16 still, but it can cause OOM on comfy with ZiT every so often, so i leave it at q8
>>
File: 1751295513117051 (1).png (2.83 MB, 1024x1536)
2.83 MB
2.83 MB PNG
>>108562441



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.