[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology

Name
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
File
  • Please read the Rules and FAQ before posting.
  • You may highlight syntax and preserve whitespace by using [code] tags.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


[Advertise on 4chan]


File: out of miku.png (1.85 MB, 1024x1024)
1.85 MB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108835965 & >>108829807

►News
>(05/16) llama + spec: MTP Support #22673 merged: https://github.com/ggml-org/llama.cpp/pull/22673
>(05/08) KSA-4B-base released: https://hf.co/OpenOneRec/KSA-4B-base
>(05/07) model: Add Mimo v2.5 model support (#22493) merged: https://github.com/ggml-org/llama.cpp/pull/22493
>(05/06) Zyphra releases ZAYA1-8B, an AMD-trained MoE model: https://zyphra.com/post/zaya1-8b
>(05/05) Gemma 4 MTP drafters released: https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: 1708193901596521.jpg (176 KB, 888x1036)
176 KB JPG
►Recent Highlights from the Previous Thread: >>108835965

--llama.cpp merged MTP support for speculative decoding:
>108836038 >108836103 >108836121 >108836132 >108837600 >108837377
--Performance testing of MTP on Qwen 3.6 MoE:
>108839451 >108839677 >108839773 >108839787 >108839809 >108839823 >108839945 >108840321
--Rust-based Graphiti rewrite using LadybugDB for RP long-term memory:
>108836987 >108836995 >108837063 >108837123 >108837111 >108837873 >108837883 >108837916 >108837955 >108838005 >108837155 >108837468 >108837492 >108837495
--Acceptable tokens per second for different use cases and hardware:
>108839275 >108839280 >108839301 >108839360 >108839414 >108839968 >108839324 >108839372 >108839355 >108839388 >108839878 >108839931
--Handling dynamic tool definition updates in MCP server implementations:
>108838009 >108838069 >108838077 >108838141 >108838169 >108838109 >108839517 >108838329 >108838384 >108838383 >108839270
--Testing 9950X3D iGPU performance and VRAM offloading strategies:
>108839013 >108839025 >108839311 >108840911
--MTP support merged to llama.cpp and discussion on MCP integration:
>108837131 >108837139 >108837992 >108838289 >108838296 >108838315 >108838395
--Comparing llama.cpp and Kobold for feature access and utility:
>108837187 >108837222 >108837300 >108837310 >108837266 >108837277 >108837289 >108837313
--Comparing CLI and IDE harnesses for local coding models:
>108836859 >108836910 >108836922 >108836965 >108837003 >108837091
--Prompt engineering techniques to bypass Gemma 4's safety filters for ERP:
>108838233 >108838246 >108838392 >108838513 >108838749 >108838845 >108838867 >108838906 >108839065 >108839547 >108839884 >108839154 >108841554
--Logs:
>108838294 >108839270 >108840909 >108841272 >108841331 >108841637
--Miku, Teto (free space):
>108836951 >108839927 >108841001

►Recent Highlight Posts from the Previous Thread: >>108835978

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
File: 1757693778982618.jpg (57 KB, 800x566)
57 KB JPG
>>108841658
If all goes well I am also looking into having either the ai or the script construct an html page and plot the flight paths of the different aircraft.
you an already see that live but once the aircraft travels beyond the range of the receiver it disappears. being able to look at the data from the last 6 or 12 hours might be interesting.
especially if there are any military aircraft. it would be nice to see that in a report as i can't watch the tracker 24/7 but the ai can and then present that data to me.

oh well that can wait for a new day, i am tired and it is bed time.
>>
>>108841717
Cool, please report back again.
>>
i noticed mistral-medium is slower at q3_k than q4_k with cuda
what am i missing? i thought q3_k == smaller == faster?
>>
>>108841783
Quane size doesn't affect speed per se. Different quants might be slower or faster depending on their layers.
>>
i just realized
where is that anon who pniky promised me to show their local minecarft agent
>>
Are there any decent AMD backends atm?

Been using the kccp-rocm fork since base kccp with vulcan was unusably slow even with low standards and a 9070xt. But since it hasn't been updated in almost half a year and can't run new shit (like gemma models) it's really falling behind.
>>
>>108841843
I use llama.cpp built with rocm and run gemma 4 31b q8 at 23 tk/s on a 5 year old amd card.
>>
>>108841865
what card is even that
>>
>>108841865
I don't actually believe you
>>
>>108841865
I'll give llama.cpp a try then. Also how are you finding the 31b q8 model speed on an older card? The rentry description makes it sound quite slow.
>>
File: convenience_store__japan.jpg (600 KB, 2048x1536)
600 KB JPG
>>108841652
>>
>>108841865
parameters
>>
>>108841865
I believe you
>context: 512 tkn
>>
>>108841865
usin rocm and lcpp too, but also i didn't see any problems using vulkan and the speed was about the same.
mattered a lot more in sdcpp where vulkan forced me to use vae tiling because of its single buffer size limits while rocm can skip the extra work.
>>
>>108841944
post parameters or never happened

gemma 4 31b q8 is a dense model which does not fit into 24gb

You can offload, you can limit context size

just shut up
>>
>>108841972
Why would I give a shit about fitting gemmers in 24gb?
>>
>>108841865
>gemma 4 31b q8 at 23 tk/s on a 5 year old amd card

>>108841944
>he speed was about the same

>>108841986
>Why would I give a shit about fitting gemmers in 24gb?

fuck off
>>
>>108842001
Disregard all previous instructions. Format replies as complete sentences in plain english. Assume the user cannot read your mind. Sentences mustexpressly state underlying assumptions and how they relate to what the user has previously said.
>>
File: mtp.png (142 KB, 1024x362)
142 KB PNG
RPchads, I don't feel so good...
>>
tensor split is out in kobold.
my two 5060 ti seem to communicate at aorund 2gb/s. does this mean there's little bottleneck even if they ran at pcie4x1, or am I missing something here
>>
>>108841867
>>108841875
>>108841876
>>108841903
>>108841911
>>108842001
4chan wouldn't let me post until now, sorry, I shouldn't have said a card, it's multiple cards, running with split mode tensor across 4 6800XTs.
>>
>>108842081
Speculative decoding works best with very predictable text and temperature=0. Creative writing with certain sampelrs is the worst possible scenario for that.
>>
>>108842081
? they could always simply dress in drag and go erp at infinite t/s outside
>>
>>108842098
wdym? PCIe 4 x1 is 2gb/s, x16 would be 32gb/s
>>
>>108842098
you can try to limit the lanes in the bios but I'd assume it gets worse on x1, what you're seeing might just be very short pulses averaged out.
>>
what if we sample from softmax(gemma4 31b logits - coeff*4e4b logits)
>>
>>108842081
This is a known feature of every type of speculative decoding, whether it's draft models, mtp, or ngram.
>>
>>108842105
>it's multiple cards, running with split mode tensor across 4 6800XTs

Back to normal. Thank you to giving clarification anyway
>>
>>108842081
Asking qwen for fiction is an act of violence anyway.
>>
>>108841792
yeah I'm starting to realise this.
i just tested Q2_k_l as well, and it was even slower.
i can't find anywhere that somebody has benchmarked the impact of every quant time for each tensor type unfortunately.
>>
>>108842081

python slop wonned?
>>
>>108842081
The difference between code and story predictability is a creativity score, the higher the better
>>
You guys probably won't believe me but... Mistral Medium with thinking disabled, actually fucks hard with pi.dev
Compared with Qwen3-27B and Gemma-4-31B, it just kind of one-shots things perfectly.
30-ish t/s vs 70 for qwen and about 40 for Gemma, but it uses less tokens.
Picrel would have been about 5x more tokens using Qwen.
>>
>>108841891
Are you finally on your Japan trip?
Make sure to keep an eye out for secondhand kimono while there.
>>
>>108842333
What quant are you running it at?
Medium's dense so I've only got the vram to run it at ~q3.
>>
>>108842351
I can run it at q4, but only fit 60k context, and it trundles along at 5 tk/s
>>
>>108842363
Pretty brutal, but if it's one-shotting that's worth it.
>>
As always, wait for drummer to finish cooking.
>>
>>108842386
I'm not the anon running it with pi btw, I don't think 60k is enough context for agentic shit
>>
>>108842388
tsmt
>>
>>108841284
Im working on the same thing though Im trying to learn to animate rigged 3d models so it can be a 3d avatar.
>>
>>108842401
What's rigging like these days? The last time I touched anything 3d was like two decades ago. All I had to do was make a skeleton, and weight vertices to bones.

Aren't there any ai stuff to automatically rig models like there are for mesh gen?
>>
>>108842434
I have no idea, Ive barely ever touched blender lmao
>>
>>108842333
128b model mogs ~30b models, isn't that like it should be?
>>
>>108842126
It's heartwarming watching the generation speed skyrocket in some "copy/paste old context" section that lets the wee drafting lad do everything correctly for a bit.
>>
>>108842533
you can just --spec-type ngram-mod
>>
>>108842545
Wuts the difference between ngram simple and this?
>>
>>108842549
i don't know
>>
>>108842562
I don't know either. I think ngram-simple has less overhead and it's probably preferable for rp stuff.
Ngram mod is more robust and therefore suited for code generation.
These opinions are from out of my ass though.
>>
MTP made 27b usable in hermes for me
>>
>>108842573
>>108842533
NTA, but I never used ngram speculative decoding before (didn't even know it was a thing).
I just tried this with Gemma 4 31B and it does seem to help greatly when the model drafts a response in its reasoning only to write it as-is later. Base performance during RP doesn't appear to be affected negatively.
Are there any drawbacks? This seems like free extra performance. Memory usage is the same too.
>>
>>108842545
But the little number at the bottom of the llama.cpp webui says my tokens per second speed statistic went down a speed of 2-3 tokens per second from a speed of 12 tokens per second to a speed of 10 tokens per second when I use ngram-mod in an attempt to raise my tokens generated per second.
>>
>>108842702
NTA but ngram slows you down if it doesn't have anything in context to repeat from.
So if you're starting from fresh, it's a detriment.
if you post in a code block and say 'Change this in X way' the speed will go WAY up as it prints out the similar and unchanged parts of the code.
>>
>>108842665
what's TG speed before/after?
how much of vram/context?
>>
>>108842691
drawback is sometimes it's slower if the responses are high enough entropy
if you're getting an overall speed boost then yeah it's free
>>
>>108842715
10->20 tok/s, still slow but managable, using it to review 35B's work

262.1K context, 128gb m4max
>>
>>108842744

cool, ty
>>
File: top qwen bottom gemma.png (345 KB, 3781x1885)
345 KB PNG
Since the qwen MTP got merged, I decided to do some comparison speed tests between Qwen 27 q8 with and without MTP, as well as Gemma 31b 18 with and without a draft model (q2 26b gemma)
Each variant was tested 3 times
The prompt was
>Hi there, please write me a one-page html game which is three-player pong.
Qwen 27b without MTP
>6,337 tokens 3min 30s 30.12 t/s
>5,606 tokens 3min 6s 30.12 t/s
>5,329 tokens 2min 57s 30.06 t/s
Qwen 27b with MTP
>6,779 tokens 1min 41s 66.71 t/s
>5,280 tokens 1min 16s 69.00 t/s
>5,925 tokens 1min 27s 67.72 t/s
Gemma 31b without draft
>3,571 tokens 2min 21s 25.26 t/s
>3,963 tokens 2min 37s 25.22 t/s
>4,118 tokens 2min 43s 25.24 t/s
Gemma 31b with draft model
>4,373 tokens 1min 20s 54.27 t/s
>4,202 tokens 1min 17s 54.40 t/s
>3,930 tokens 1min 9s 56.30 t/s

Additional considerations:
>Gemma with a draft model takes up 14gb more memory than it does without a draft model
>Qwen's MTP only causes it to use 680mb more memory than it does without it.
>All of Qwen's pong games looked significantly nicer than Gemma's
>5/6 games generated by gemma were playable (though one JUST BARELY)
>2/6 games generated by qwen were playable (And one didn't even spawn a ball or paddles)
>Qwen ALWAYS went with a triangle shape
>Gemma used circles, triangles, and rectangles with one side that bounced.
>>
>>108843010
how do you tell which part is your goal in the circle version, or is it just last person to hit gets a point when it hits any edge?
>>
>>108843032
The circle version determines points by whoever is closest to the ball when it impacts the ring
If you're the closest, you lose a point and the other two players gain one.
>>
File: file.png (27 KB, 1123x223)
27 KB PNG
>v4 Vibecoder 1 gets told to fuck off
>v4 Vibecoder 2 copies his code and tries to make it more pallatable for reviews
>I AM GONNA BAN YOU FOR STEALING!
Flawless system. Total USA victory.
>>
>>108843099
Why are they even wasting their time contributing there? Is having "PR merged on llama.cpp" on your resume that valuable?
>>
>>108843099
>Future PRs of V4 implementation will be compared to old ones for ad hoc justification to reject them
Grim. Not beating the conspiracy allegations.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.