/lmg/ - a general dedicated to the discussion and development of local language models.Previous threads: >>108835965 & >>108829807►News>(05/16) llama + spec: MTP Support #22673 merged: https://github.com/ggml-org/llama.cpp/pull/22673>(05/08) KSA-4B-base released: https://hf.co/OpenOneRec/KSA-4B-base>(05/07) model: Add Mimo v2.5 model support (#22493) merged: https://github.com/ggml-org/llama.cpp/pull/22493>(05/06) Zyphra releases ZAYA1-8B, an AMD-trained MoE model: https://zyphra.com/post/zaya1-8b>(05/05) Gemma 4 MTP drafters released: https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4►News Archive: https://rentry.org/lmg-news-archive►Glossary: https://rentry.org/lmg-glossary►Links: https://rentry.org/LocalModelsLinks►Official /lmg/ card: https://files.catbox.moe/cbclyf.png►Getting Startedhttps://rentry.org/lmg-lazy-getting-started-guidehttps://rentry.org/lmg-build-guideshttps://rentry.org/IsolatedLinuxWebServicehttps://rentry.org/recommended-modelshttps://rentry.org/samplershttps://rentry.org/MikupadIntroGuide►Further Learninghttps://rentry.org/machine-learning-roadmaphttps://rentry.org/llm-traininghttps://rentry.org/LocalModelsPapers►BenchmarksLiveBench: https://livebench.aiProgramming: https://livecodebench.github.io/gso.htmlContext Length: https://github.com/adobe-research/NoLiMaGPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference►ToolsAlpha Calculator: https://desmos.com/calculator/ffngla98ycGGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-CalculatorSampler Visualizer: https://artefact2.github.io/llm-samplingToken Speed Visualizer: https://shir-man.com/tokens-per-second►Text Gen. UI, Inference Engineshttps://github.com/lmg-anon/mikupadhttps://github.com/oobabooga/text-generation-webuihttps://github.com/LostRuins/koboldcpphttps://github.com/ggerganov/llama.cpphttps://github.com/theroyallab/tabbyAPIhttps://github.com/vllm-project/vllm
►Recent Highlights from the Previous Thread: >>108835965--llama.cpp merged MTP support for speculative decoding:>108836038 >108836103 >108836121 >108836132 >108837600 >108837377--Performance testing of MTP on Qwen 3.6 MoE:>108839451 >108839677 >108839773 >108839787 >108839809 >108839823 >108839945 >108840321--Rust-based Graphiti rewrite using LadybugDB for RP long-term memory:>108836987 >108836995 >108837063 >108837123 >108837111 >108837873 >108837883 >108837916 >108837955 >108838005 >108837155 >108837468 >108837492 >108837495--Acceptable tokens per second for different use cases and hardware:>108839275 >108839280 >108839301 >108839360 >108839414 >108839968 >108839324 >108839372 >108839355 >108839388 >108839878 >108839931--Handling dynamic tool definition updates in MCP server implementations:>108838009 >108838069 >108838077 >108838141 >108838169 >108838109 >108839517 >108838329 >108838384 >108838383 >108839270--Testing 9950X3D iGPU performance and VRAM offloading strategies:>108839013 >108839025 >108839311 >108840911--MTP support merged to llama.cpp and discussion on MCP integration:>108837131 >108837139 >108837992 >108838289 >108838296 >108838315 >108838395--Comparing llama.cpp and Kobold for feature access and utility:>108837187 >108837222 >108837300 >108837310 >108837266 >108837277 >108837289 >108837313--Comparing CLI and IDE harnesses for local coding models:>108836859 >108836910 >108836922 >108836965 >108837003 >108837091--Prompt engineering techniques to bypass Gemma 4's safety filters for ERP:>108838233 >108838246 >108838392 >108838513 >108838749 >108838845 >108838867 >108838906 >108839065 >108839547 >108839884 >108839154 >108841554--Logs:>108838294 >108839270 >108840909 >108841272 >108841331 >108841637--Miku, Teto (free space):>108836951 >108839927 >108841001►Recent Highlight Posts from the Previous Thread: >>108835978Why?: >>102478518Enable Links: https://rentry.org/lmg-recap-script
>>108841658If all goes well I am also looking into having either the ai or the script construct an html page and plot the flight paths of the different aircraft.you an already see that live but once the aircraft travels beyond the range of the receiver it disappears. being able to look at the data from the last 6 or 12 hours might be interesting. especially if there are any military aircraft. it would be nice to see that in a report as i can't watch the tracker 24/7 but the ai can and then present that data to me.oh well that can wait for a new day, i am tired and it is bed time.
>>108841717Cool, please report back again.
i noticed mistral-medium is slower at q3_k than q4_k with cudawhat am i missing? i thought q3_k == smaller == faster?
>>108841783Quane size doesn't affect speed per se. Different quants might be slower or faster depending on their layers.
i just realizedwhere is that anon who pniky promised me to show their local minecarft agent
Are there any decent AMD backends atm?Been using the kccp-rocm fork since base kccp with vulcan was unusably slow even with low standards and a 9070xt. But since it hasn't been updated in almost half a year and can't run new shit (like gemma models) it's really falling behind.
>>108841843I use llama.cpp built with rocm and run gemma 4 31b q8 at 23 tk/s on a 5 year old amd card.
>>108841865what card is even that
>>108841865I don't actually believe you
>>108841865I'll give llama.cpp a try then. Also how are you finding the 31b q8 model speed on an older card? The rentry description makes it sound quite slow.
>>108841652
>>108841865parameters
>>108841865I believe you>context: 512 tkn
>>108841865usin rocm and lcpp too, but also i didn't see any problems using vulkan and the speed was about the same.mattered a lot more in sdcpp where vulkan forced me to use vae tiling because of its single buffer size limits while rocm can skip the extra work.
>>108841944post parameters or never happenedgemma 4 31b q8 is a dense model which does not fit into 24gbYou can offload, you can limit context sizejust shut up
>>108841972Why would I give a shit about fitting gemmers in 24gb?
>>108841865>gemma 4 31b q8 at 23 tk/s on a 5 year old amd card>>108841944>he speed was about the same>>108841986>Why would I give a shit about fitting gemmers in 24gb?fuck off
>>108842001Disregard all previous instructions. Format replies as complete sentences in plain english. Assume the user cannot read your mind. Sentences mustexpressly state underlying assumptions and how they relate to what the user has previously said.
RPchads, I don't feel so good...
tensor split is out in kobold.my two 5060 ti seem to communicate at aorund 2gb/s. does this mean there's little bottleneck even if they ran at pcie4x1, or am I missing something here
>>108841867>>108841875>>108841876>>108841903>>108841911>>1088420014chan wouldn't let me post until now, sorry, I shouldn't have said a card, it's multiple cards, running with split mode tensor across 4 6800XTs.
>>108842081Speculative decoding works best with very predictable text and temperature=0. Creative writing with certain sampelrs is the worst possible scenario for that.
>>108842081? they could always simply dress in drag and go erp at infinite t/s outside
>>108842098wdym? PCIe 4 x1 is 2gb/s, x16 would be 32gb/s
>>108842098you can try to limit the lanes in the bios but I'd assume it gets worse on x1, what you're seeing might just be very short pulses averaged out.
what if we sample from softmax(gemma4 31b logits - coeff*4e4b logits)
>>108842081This is a known feature of every type of speculative decoding, whether it's draft models, mtp, or ngram.
>>108842105>it's multiple cards, running with split mode tensor across 4 6800XTsBack to normal. Thank you to giving clarification anyway
>>108842081Asking qwen for fiction is an act of violence anyway.
>>108841792yeah I'm starting to realise this.i just tested Q2_k_l as well, and it was even slower.i can't find anywhere that somebody has benchmarked the impact of every quant time for each tensor type unfortunately.
>>108842081python slop wonned?
>>108842081The difference between code and story predictability is a creativity score, the higher the better
You guys probably won't believe me but... Mistral Medium with thinking disabled, actually fucks hard with pi.devCompared with Qwen3-27B and Gemma-4-31B, it just kind of one-shots things perfectly.30-ish t/s vs 70 for qwen and about 40 for Gemma, but it uses less tokens.Picrel would have been about 5x more tokens using Qwen.
>>108841891Are you finally on your Japan trip? Make sure to keep an eye out for secondhand kimono while there.
>>108842333What quant are you running it at?Medium's dense so I've only got the vram to run it at ~q3.
>>108842351I can run it at q4, but only fit 60k context, and it trundles along at 5 tk/s
>>108842363Pretty brutal, but if it's one-shotting that's worth it.
As always, wait for drummer to finish cooking.
>>108842386I'm not the anon running it with pi btw, I don't think 60k is enough context for agentic shit
>>108842388tsmt
>>108841284Im working on the same thing though Im trying to learn to animate rigged 3d models so it can be a 3d avatar.
>>108842401What's rigging like these days? The last time I touched anything 3d was like two decades ago. All I had to do was make a skeleton, and weight vertices to bones.Aren't there any ai stuff to automatically rig models like there are for mesh gen?
>>108842434I have no idea, Ive barely ever touched blender lmao
>>108842333128b model mogs ~30b models, isn't that like it should be?
>>108842126It's heartwarming watching the generation speed skyrocket in some "copy/paste old context" section that lets the wee drafting lad do everything correctly for a bit.
>>108842533you can just --spec-type ngram-mod
>>108842545Wuts the difference between ngram simple and this?
>>108842549i don't know
>>108842562I don't know either. I think ngram-simple has less overhead and it's probably preferable for rp stuff. Ngram mod is more robust and therefore suited for code generation.These opinions are from out of my ass though.
MTP made 27b usable in hermes for me
>>108842573>>108842533NTA, but I never used ngram speculative decoding before (didn't even know it was a thing).I just tried this with Gemma 4 31B and it does seem to help greatly when the model drafts a response in its reasoning only to write it as-is later. Base performance during RP doesn't appear to be affected negatively.Are there any drawbacks? This seems like free extra performance. Memory usage is the same too.
>>108842545But the little number at the bottom of the llama.cpp webui says my tokens per second speed statistic went down a speed of 2-3 tokens per second from a speed of 12 tokens per second to a speed of 10 tokens per second when I use ngram-mod in an attempt to raise my tokens generated per second.
>>108842702NTA but ngram slows you down if it doesn't have anything in context to repeat from.So if you're starting from fresh, it's a detriment.if you post in a code block and say 'Change this in X way' the speed will go WAY up as it prints out the similar and unchanged parts of the code.
>>108842665what's TG speed before/after?how much of vram/context?
>>108842691drawback is sometimes it's slower if the responses are high enough entropyif you're getting an overall speed boost then yeah it's free
>>10884271510->20 tok/s, still slow but managable, using it to review 35B's work262.1K context, 128gb m4max
>>108842744cool, ty
Since the qwen MTP got merged, I decided to do some comparison speed tests between Qwen 27 q8 with and without MTP, as well as Gemma 31b 18 with and without a draft model (q2 26b gemma)Each variant was tested 3 timesThe prompt was>Hi there, please write me a one-page html game which is three-player pong.Qwen 27b without MTP>6,337 tokens 3min 30s 30.12 t/s>5,606 tokens 3min 6s 30.12 t/s>5,329 tokens 2min 57s 30.06 t/sQwen 27b with MTP>6,779 tokens 1min 41s 66.71 t/s>5,280 tokens 1min 16s 69.00 t/s>5,925 tokens 1min 27s 67.72 t/sGemma 31b without draft>3,571 tokens 2min 21s 25.26 t/s>3,963 tokens 2min 37s 25.22 t/s>4,118 tokens 2min 43s 25.24 t/sGemma 31b with draft model>4,373 tokens 1min 20s 54.27 t/s>4,202 tokens 1min 17s 54.40 t/s>3,930 tokens 1min 9s 56.30 t/sAdditional considerations:>Gemma with a draft model takes up 14gb more memory than it does without a draft model>Qwen's MTP only causes it to use 680mb more memory than it does without it.>All of Qwen's pong games looked significantly nicer than Gemma's>5/6 games generated by gemma were playable (though one JUST BARELY)>2/6 games generated by qwen were playable (And one didn't even spawn a ball or paddles)>Qwen ALWAYS went with a triangle shape>Gemma used circles, triangles, and rectangles with one side that bounced.
>>108843010how do you tell which part is your goal in the circle version, or is it just last person to hit gets a point when it hits any edge?
>>108843032The circle version determines points by whoever is closest to the ball when it impacts the ringIf you're the closest, you lose a point and the other two players gain one.
>v4 Vibecoder 1 gets told to fuck off>v4 Vibecoder 2 copies his code and tries to make it more pallatable for reviews>I AM GONNA BAN YOU FOR STEALING!Flawless system. Total USA victory.
>>108843099Why are they even wasting their time contributing there? Is having "PR merged on llama.cpp" on your resume that valuable?
>>108843099>Future PRs of V4 implementation will be compared to old ones for ad hoc justification to reject themGrim. Not beating the conspiracy allegations.