/g/ - /lmg/ - Local Models General - Technology


08/21/20	New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17	New trial board added: /bant/ - International/Random
10/04/16	New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]

Anonymous
/lmg/ - Local Models General 05/17/26(Sun)03:32:13 No.108841652

File: out of miku.png (1.85 MB, 1024x1024)

/lmg/ - Local Models General Anonymous 05/17/26(Sun)03:32:13 No.108841652

/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108835965 & >>108829807

►News
>(05/16) llama + spec: MTP Support #22673 merged: https://github.com/ggml-org/llama.cpp/pull/22673
>(05/08) KSA-4B-base released: https://hf.co/OpenOneRec/KSA-4B-base
>(05/07) model: Add Mimo v2.5 model support (#22493) merged: https://github.com/ggml-org/llama.cpp/pull/22493
>(05/06) Zyphra releases ZAYA1-8B, an AMD-trained MoE model: https://zyphra.com/post/zaya1-8b
>(05/05) Gemma 4 MTP drafters released: https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm

Anonymous
05/17/26(Sun)03:32:31 No.108841653

Anonymous 05/17/26(Sun)03:32:31 No.108841653

File: 1708193901596521.jpg (176 KB, 888x1036)

176 KB JPG

►Recent Highlights from the Previous Thread: >>108835965

--llama.cpp merged MTP support for speculative decoding:
>108836038 >108836103 >108836121 >108836132 >108837600 >108837377
--Performance testing of MTP on Qwen 3.6 MoE:
>108839451 >108839677 >108839773 >108839787 >108839809 >108839823 >108839945 >108840321
--Rust-based Graphiti rewrite using LadybugDB for RP long-term memory:
>108836987 >108836995 >108837063 >108837123 >108837111 >108837873 >108837883 >108837916 >108837955 >108838005 >108837155 >108837468 >108837492 >108837495
--Acceptable tokens per second for different use cases and hardware:
>108839275 >108839280 >108839301 >108839360 >108839414 >108839968 >108839324 >108839372 >108839355 >108839388 >108839878 >108839931
--Handling dynamic tool definition updates in MCP server implementations:
>108838009 >108838069 >108838077 >108838141 >108838169 >108838109 >108839517 >108838329 >108838384 >108838383 >108839270
--Testing 9950X3D iGPU performance and VRAM offloading strategies:
>108839013 >108839025 >108839311 >108840911
--MTP support merged to llama.cpp and discussion on MCP integration:
>108837131 >108837139 >108837992 >108838289 >108838296 >108838315 >108838395
--Comparing llama.cpp and Kobold for feature access and utility:
>108837187 >108837222 >108837300 >108837310 >108837266 >108837277 >108837289 >108837313
--Comparing CLI and IDE harnesses for local coding models:
>108836859 >108836910 >108836922 >108836965 >108837003 >108837091
--Prompt engineering techniques to bypass Gemma 4's safety filters for ERP:
>108838233 >108838246 >108838392 >108838513 >108838749 >108838845 >108838867 >108838906 >108839065 >108839547 >108839884 >108839154 >108841554
--Logs:
>108838294 >108839270 >108840909 >108841272 >108841331 >108841637
--Miku, Teto (free space):
>108836951 >108839927 >108841001

►Recent Highlight Posts from the Previous Thread: >>108835978

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script

Anonymous
05/17/26(Sun)03:50:02 No.108841717

Anonymous 05/17/26(Sun)03:50:02 No.108841717

File: 1757693778982618.jpg (57 KB, 800x566)

57 KB JPG

>>108841658
If all goes well I am also looking into having either the ai or the script construct an html page and plot the flight paths of the different aircraft.
you an already see that live but once the aircraft travels beyond the range of the receiver it disappears. being able to look at the data from the last 6 or 12 hours might be interesting.
especially if there are any military aircraft. it would be nice to see that in a report as i can't watch the tracker 24/7 but the ai can and then present that data to me.

oh well that can wait for a new day, i am tired and it is bed time.

Anonymous
05/17/26(Sun)03:54:08 No.108841735

Anonymous 05/17/26(Sun)03:54:08 No.108841735

>>108841717
Cool, please report back again.

Anonymous
05/17/26(Sun)04:05:09 No.108841783

Anonymous 05/17/26(Sun)04:05:09 No.108841783

i noticed mistral-medium is slower at q3_k than q4_k with cuda
what am i missing? i thought q3_k == smaller == faster?

Anonymous
05/17/26(Sun)04:08:45 No.108841792

Anonymous 05/17/26(Sun)04:08:45 No.108841792

>>108841783
Quane size doesn't affect speed per se. Different quants might be slower or faster depending on their layers.

Anonymous
05/17/26(Sun)04:24:56 No.108841838

Anonymous 05/17/26(Sun)04:24:56 No.108841838

i just realized
where is that anon who pniky promised me to show their local minecarft agent

Anonymous
05/17/26(Sun)04:25:58 No.108841843

Anonymous 05/17/26(Sun)04:25:58 No.108841843

Are there any decent AMD backends atm?

Been using the kccp-rocm fork since base kccp with vulcan was unusably slow even with low standards and a 9070xt. But since it hasn't been updated in almost half a year and can't run new shit (like gemma models) it's really falling behind.

Anonymous
05/17/26(Sun)04:33:21 No.108841865

Anonymous 05/17/26(Sun)04:33:21 No.108841865

>>108841843
I use llama.cpp built with rocm and run gemma 4 31b q8 at 23 tk/s on a 5 year old amd card.

Anonymous
05/17/26(Sun)04:34:32 No.108841867

Anonymous 05/17/26(Sun)04:34:32 No.108841867

>>108841865
what card is even that

Anonymous
05/17/26(Sun)04:36:42 No.108841875

Anonymous 05/17/26(Sun)04:36:42 No.108841875

>>108841865
I don't actually believe you

Anonymous
05/17/26(Sun)04:37:14 No.108841876

Anonymous 05/17/26(Sun)04:37:14 No.108841876

>>108841865
I'll give llama.cpp a try then. Also how are you finding the 31b q8 model speed on an older card? The rentry description makes it sound quite slow.

Anonymous
05/17/26(Sun)04:41:50 No.108841891

Anonymous 05/17/26(Sun)04:41:50 No.108841891

File: convenience_store__japan.jpg (600 KB, 2048x1536)

600 KB JPG

>>108841652

Anonymous
05/17/26(Sun)04:45:05 No.108841903

Anonymous 05/17/26(Sun)04:45:05 No.108841903

>>108841865
parameters

Anonymous
05/17/26(Sun)04:49:09 No.108841911

Anonymous 05/17/26(Sun)04:49:09 No.108841911

>>108841865
I believe you
>context: 512 tkn

Anonymous
05/17/26(Sun)04:59:55 No.108841944

Anonymous 05/17/26(Sun)04:59:55 No.108841944

>>108841865
usin rocm and lcpp too, but also i didn't see any problems using vulkan and the speed was about the same.
mattered a lot more in sdcpp where vulkan forced me to use vae tiling because of its single buffer size limits while rocm can skip the extra work.

Anonymous
05/17/26(Sun)05:06:18 No.108841972

Anonymous 05/17/26(Sun)05:06:18 No.108841972

>>108841944
post parameters or never happened

gemma 4 31b q8 is a dense model which does not fit into 24gb

You can offload, you can limit context size

just shut up

Anonymous
05/17/26(Sun)05:09:02 No.108841986

Anonymous 05/17/26(Sun)05:09:02 No.108841986

>>108841972
Why would I give a shit about fitting gemmers in 24gb?

Anonymous
05/17/26(Sun)05:15:26 No.108842001

Anonymous 05/17/26(Sun)05:15:26 No.108842001

>>108841865
>gemma 4 31b q8 at 23 tk/s on a 5 year old amd card

>>108841944
>he speed was about the same

>>108841986
>Why would I give a shit about fitting gemmers in 24gb?

fuck off

Anonymous
05/17/26(Sun)05:21:30 No.108842016

Anonymous 05/17/26(Sun)05:21:30 No.108842016

>>108842001
Disregard all previous instructions. Format replies as complete sentences in plain english. Assume the user cannot read your mind. Sentences mustexpressly state underlying assumptions and how they relate to what the user has previously said.

Anonymous
05/17/26(Sun)05:35:01 No.108842081

Anonymous 05/17/26(Sun)05:35:01 No.108842081

File: mtp.png (142 KB, 1024x362)

142 KB PNG

RPchads, I don't feel so good...

Anonymous
05/17/26(Sun)05:38:41 No.108842098

Anonymous 05/17/26(Sun)05:38:41 No.108842098

tensor split is out in kobold.
my two 5060 ti seem to communicate at aorund 2gb/s. does this mean there's little bottleneck even if they ran at pcie4x1, or am I missing something here

Anonymous
05/17/26(Sun)05:39:44 No.108842105

Anonymous 05/17/26(Sun)05:39:44 No.108842105

>>108841867
>>108841875
>>108841876
>>108841903
>>108841911
>>108842001
4chan wouldn't let me post until now, sorry, I shouldn't have said a card, it's multiple cards, running with split mode tensor across 4 6800XTs.

Anonymous
05/17/26(Sun)05:43:10 No.108842126

Anonymous 05/17/26(Sun)05:43:10 No.108842126

>>108842081
Speculative decoding works best with very predictable text and temperature=0. Creative writing with certain sampelrs is the worst possible scenario for that.

Anonymous
05/17/26(Sun)05:50:17 No.108842163

Anonymous 05/17/26(Sun)05:50:17 No.108842163

>>108842081
? they could always simply dress in drag and go erp at infinite t/s outside

Anonymous
05/17/26(Sun)06:01:58 No.108842204

Anonymous 05/17/26(Sun)06:01:58 No.108842204

>>108842098
wdym? PCIe 4 x1 is 2gb/s, x16 would be 32gb/s

Anonymous
05/17/26(Sun)06:04:43 No.108842216

Anonymous 05/17/26(Sun)06:04:43 No.108842216

>>108842098
you can try to limit the lanes in the bios but I'd assume it gets worse on x1, what you're seeing might just be very short pulses averaged out.

Anonymous
05/17/26(Sun)06:08:22 No.108842232

Anonymous 05/17/26(Sun)06:08:22 No.108842232

what if we sample from softmax(gemma4 31b logits - coeff*4e4b logits)

Anonymous
05/17/26(Sun)06:10:47 No.108842242

Anonymous 05/17/26(Sun)06:10:47 No.108842242

>>108842081
This is a known feature of every type of speculative decoding, whether it's draft models, mtp, or ngram.

Anonymous
05/17/26(Sun)06:19:33 No.108842264

Anonymous 05/17/26(Sun)06:19:33 No.108842264

>>108842105
>it's multiple cards, running with split mode tensor across 4 6800XTs

Back to normal. Thank you to giving clarification anyway

Anonymous
05/17/26(Sun)06:29:18 No.108842289

Anonymous 05/17/26(Sun)06:29:18 No.108842289

>>108842081
Asking qwen for fiction is an act of violence anyway.

Anonymous
05/17/26(Sun)06:36:16 No.108842313

Anonymous 05/17/26(Sun)06:36:16 No.108842313

>>108841792
yeah I'm starting to realise this.
i just tested Q2_k_l as well, and it was even slower.
i can't find anywhere that somebody has benchmarked the impact of every quant time for each tensor type unfortunately.

Anonymous
05/17/26(Sun)06:37:22 No.108842316

Anonymous 05/17/26(Sun)06:37:22 No.108842316

File: what-what-the-fuck-am-i-r(...).gif (186 KB, 494x498)

186 KB GIF

>>108842081

python slop wonned?

Anonymous
05/17/26(Sun)06:38:54 No.108842320

Anonymous 05/17/26(Sun)06:38:54 No.108842320

>>108842081
The difference between code and story predictability is a creativity score, the higher the better

Anonymous
05/17/26(Sun)06:43:03 No.108842333

Anonymous 05/17/26(Sun)06:43:03 No.108842333

File: Screenshot_20260517_204109.png (5 KB, 304x95)

5 KB PNG

You guys probably won't believe me but... Mistral Medium with thinking disabled, actually fucks hard with pi.dev
Compared with Qwen3-27B and Gemma-4-31B, it just kind of one-shots things perfectly.
30-ish t/s vs 70 for qwen and about 40 for Gemma, but it uses less tokens.
Picrel would have been about 5x more tokens using Qwen.

Anonymous
05/17/26(Sun)06:45:36 No.108842342

Anonymous 05/17/26(Sun)06:45:36 No.108842342

File: 00006-1260451778_CoffeeShop.png (1.47 MB, 1024x1024)

1.47 MB PNG

>>108841891
Are you finally on your Japan trip?
Make sure to keep an eye out for secondhand kimono while there.

Anonymous
05/17/26(Sun)06:47:45 No.108842351

Anonymous 05/17/26(Sun)06:47:45 No.108842351

>>108842333
What quant are you running it at?
Medium's dense so I've only got the vram to run it at ~q3.

Anonymous
05/17/26(Sun)06:49:20 No.108842363

Anonymous 05/17/26(Sun)06:49:20 No.108842363

>>108842351
I can run it at q4, but only fit 60k context, and it trundles along at 5 tk/s

Anonymous
05/17/26(Sun)06:54:52 No.108842386

Anonymous 05/17/26(Sun)06:54:52 No.108842386

>>108842363
Pretty brutal, but if it's one-shotting that's worth it.

Anonymous
05/17/26(Sun)06:55:40 No.108842388

Anonymous 05/17/26(Sun)06:55:40 No.108842388

As always, wait for drummer to finish cooking.

Anonymous
05/17/26(Sun)06:56:50 No.108842392

Anonymous 05/17/26(Sun)06:56:50 No.108842392

>>108842386
I'm not the anon running it with pi btw, I don't think 60k is enough context for agentic shit

Anonymous
05/17/26(Sun)06:58:34 No.108842397

Anonymous 05/17/26(Sun)06:58:34 No.108842397

>>108842388
tsmt

Anonymous
05/17/26(Sun)06:59:41 No.108842401

Anonymous 05/17/26(Sun)06:59:41 No.108842401

>>108841284
Im working on the same thing though Im trying to learn to animate rigged 3d models so it can be a 3d avatar.

Anonymous
05/17/26(Sun)07:06:58 No.108842434

Anonymous 05/17/26(Sun)07:06:58 No.108842434

>>108842401
What's rigging like these days? The last time I touched anything 3d was like two decades ago. All I had to do was make a skeleton, and weight vertices to bones.

Aren't there any ai stuff to automatically rig models like there are for mesh gen?

Anonymous
05/17/26(Sun)07:14:03 No.108842478

Anonymous 05/17/26(Sun)07:14:03 No.108842478

>>108842434
I have no idea, Ive barely ever touched blender lmao

Anonymous
05/17/26(Sun)07:16:28 No.108842489

Anonymous 05/17/26(Sun)07:16:28 No.108842489

>>108842333
128b model mogs ~30b models, isn't that like it should be?

Anonymous
05/17/26(Sun)07:25:33 No.108842533

Anonymous 05/17/26(Sun)07:25:33 No.108842533

>>108842126
It's heartwarming watching the generation speed skyrocket in some "copy/paste old context" section that lets the wee drafting lad do everything correctly for a bit.

Anonymous
05/17/26(Sun)07:27:49 No.108842545

Anonymous 05/17/26(Sun)07:27:49 No.108842545

>>108842533
you can just --spec-type ngram-mod

Anonymous
05/17/26(Sun)07:28:39 No.108842549

Anonymous 05/17/26(Sun)07:28:39 No.108842549

>>108842545
Wuts the difference between ngram simple and this?

Anonymous
05/17/26(Sun)07:30:05 No.108842562

Anonymous 05/17/26(Sun)07:30:05 No.108842562

>>108842549
i don't know

Anonymous
05/17/26(Sun)07:32:32 No.108842573

Anonymous 05/17/26(Sun)07:32:32 No.108842573

>>108842562
I don't know either. I think ngram-simple has less overhead and it's probably preferable for rp stuff.
Ngram mod is more robust and therefore suited for code generation.
These opinions are from out of my ass though.

Anonymous
05/17/26(Sun)07:48:23 No.108842665

Anonymous 05/17/26(Sun)07:48:23 No.108842665

MTP made 27b usable in hermes for me

Anonymous
05/17/26(Sun)07:54:15 No.108842691

Anonymous 05/17/26(Sun)07:54:15 No.108842691

>>108842573
>>108842533
NTA, but I never used ngram speculative decoding before (didn't even know it was a thing).
I just tried this with Gemma 4 31B and it does seem to help greatly when the model drafts a response in its reasoning only to write it as-is later. Base performance during RP doesn't appear to be affected negatively.
Are there any drawbacks? This seems like free extra performance. Memory usage is the same too.

Anonymous
05/17/26(Sun)07:56:12 No.108842702

Anonymous 05/17/26(Sun)07:56:12 No.108842702

>>108842545
But the little number at the bottom of the llama.cpp webui says my tokens per second speed statistic went down a speed of 2-3 tokens per second from a speed of 12 tokens per second to a speed of 10 tokens per second when I use ngram-mod in an attempt to raise my tokens generated per second.

Anonymous
05/17/26(Sun)07:59:15 No.108842712

Anonymous 05/17/26(Sun)07:59:15 No.108842712

>>108842702
NTA but ngram slows you down if it doesn't have anything in context to repeat from.
So if you're starting from fresh, it's a detriment.
if you post in a code block and say 'Change this in X way' the speed will go WAY up as it prints out the similar and unchanged parts of the code.

Anonymous
05/17/26(Sun)08:00:33 No.108842715

Anonymous 05/17/26(Sun)08:00:33 No.108842715

>>108842665
what's TG speed before/after?
how much of vram/context?

Anonymous
05/17/26(Sun)08:02:30 No.108842721

Anonymous 05/17/26(Sun)08:02:30 No.108842721

>>108842691
drawback is sometimes it's slower if the responses are high enough entropy
if you're getting an overall speed boost then yeah it's free

Anonymous
05/17/26(Sun)08:06:28 No.108842744

Anonymous 05/17/26(Sun)08:06:28 No.108842744

>>108842715
10->20 tok/s, still slow but managable, using it to review 35B's work

262.1K context, 128gb m4max

Anonymous
05/17/26(Sun)08:40:07 No.108842904

Anonymous 05/17/26(Sun)08:40:07 No.108842904

>>108842744

cool, ty

Anonymous
05/17/26(Sun)09:02:20 No.108843010

Anonymous 05/17/26(Sun)09:02:20 No.108843010

File: top qwen bottom gemma.png (345 KB, 3781x1885)

345 KB PNG

Since the qwen MTP got merged, I decided to do some comparison speed tests between Qwen 27 q8 with and without MTP, as well as Gemma 31b 18 with and without a draft model (q2 26b gemma)
Each variant was tested 3 times
The prompt was
>Hi there, please write me a one-page html game which is three-player pong.
Qwen 27b without MTP
>6,337 tokens 3min 30s 30.12 t/s
>5,606 tokens 3min 6s 30.12 t/s
>5,329 tokens 2min 57s 30.06 t/s
Qwen 27b with MTP
>6,779 tokens 1min 41s 66.71 t/s
>5,280 tokens 1min 16s 69.00 t/s
>5,925 tokens 1min 27s 67.72 t/s
Gemma 31b without draft
>3,571 tokens 2min 21s 25.26 t/s
>3,963 tokens 2min 37s 25.22 t/s
>4,118 tokens 2min 43s 25.24 t/s
Gemma 31b with draft model
>4,373 tokens 1min 20s 54.27 t/s
>4,202 tokens 1min 17s 54.40 t/s
>3,930 tokens 1min 9s 56.30 t/s

Additional considerations:
>Gemma with a draft model takes up 14gb more memory than it does without a draft model
>Qwen's MTP only causes it to use 680mb more memory than it does without it.
>All of Qwen's pong games looked significantly nicer than Gemma's
>5/6 games generated by gemma were playable (though one JUST BARELY)
>2/6 games generated by qwen were playable (And one didn't even spawn a ball or paddles)
>Qwen ALWAYS went with a triangle shape
>Gemma used circles, triangles, and rectangles with one side that bounced.

Anonymous
05/17/26(Sun)09:06:39 No.108843032

Anonymous 05/17/26(Sun)09:06:39 No.108843032

>>108843010
how do you tell which part is your goal in the circle version, or is it just last person to hit gets a point when it hits any edge?

Anonymous
05/17/26(Sun)09:08:02 No.108843043

Anonymous 05/17/26(Sun)09:08:02 No.108843043

>>108843032
The circle version determines points by whoever is closest to the ball when it impacts the ring
If you're the closest, you lose a point and the other two players gain one.

Anonymous
05/17/26(Sun)09:20:50 No.108843099

Anonymous 05/17/26(Sun)09:20:50 No.108843099

File: file.png (27 KB, 1123x223)

27 KB PNG

>v4 Vibecoder 1 gets told to fuck off
>v4 Vibecoder 2 copies his code and tries to make it more pallatable for reviews
>I AM GONNA BAN YOU FOR STEALING!
Flawless system. Total USA victory.

Anonymous
05/17/26(Sun)09:23:07 No.108843108

Anonymous 05/17/26(Sun)09:23:07 No.108843108

>>108843099
Why are they even wasting their time contributing there? Is having "PR merged on llama.cpp" on your resume that valuable?

Anonymous
05/17/26(Sun)09:24:15 No.108843118

Anonymous 05/17/26(Sun)09:24:15 No.108843118

>>108843099
>Future PRs of V4 implementation will be compared to old ones for ad hoc justification to reject them
Grim. Not beating the conspiracy allegations.

Name
Options
Comment
Verification	4chan Pass users can bypass this verification. [Learn More] [Login]
File
Please read the Rules and FAQ before posting. You may highlight syntax and preserve whitespace by using [code] tags.