/lmg/ - a general dedicated to the discussion and development of local language models.Previous threads: >>108943155 & >>108937312 ►News>(05/29) Step 3.7 Flash released https://hf.co/stepfun-ai/Step-3.7-Flash>(05/21) Hy-MT2 “fast-thinking” translation models released: https://hf.co/collections/tencent/hy-mt2>(05/20) Cohere releases Command A+ 218B-A25B: https://cohere.com/blog/command-a-plus>(05/16) llama + spec: MTP Support #22673 merged: https://github.com/ggml-org/llama.cpp/pull/22673>(05/08) KSA-4B-base released: https://hf.co/OpenOneRec/KSA-4B-base►News Archive: https://rentry.org/lmg-news-archive►Glossary: https://rentry.org/lmg-glossary►Links: https://rentry.org/LocalModelsLinks►Official /lmg/ card: https://files.catbox.moe/cbclyf.png►Getting Startedhttps://rentry.org/lmg-lazy-getting-started-guidehttps://rentry.org/lmg-build-guideshttps://rentry.org/IsolatedLinuxWebServicehttps://rentry.org/recommended-modelshttps://rentry.org/samplershttps://rentry.org/MikupadIntroGuide►Further Learninghttps://rentry.org/machine-learning-roadmaphttps://rentry.org/llm-traininghttps://rentry.org/LocalModelsPapers►BenchmarksLiveBench: https://livebench.aiProgramming: https://swe-rebench.comAgentic Coding: https://deepswe.datacurve.aiContext Length: https://github.com/adobe-research/NoLiMaGPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference►ToolsAlpha Calculator: https://desmos.com/calculator/ffngla98ycGGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-CalculatorSampler Visualizer: https://artefact2.github.io/llm-samplingToken Speed Visualizer: https://shir-man.com/tokens-per-second►Text Gen. UI, Inference Engineshttps://github.com/lmg-anon/mikupadhttps://github.com/oobabooga/text-generation-webuihttps://github.com/LostRuins/koboldcpphttps://github.com/ggerganov/llama.cpphttps://github.com/theroyallab/tabbyAPIhttps://github.com/vllm-project/vllm
>>108949589>embedding model into ASICRetarded. By the time you have produced them, the model and inference algos will already be obsolete. AI progress will continue to accelerate, which will keep GPU dominant.>photonic computingNot going to happen, at least not any time soon. Photonic elements are 1000000 times larger and scaling them is much more difficult. There's no optical transistor analog either. MOSFET is a switch with gain.
►Recent Highlights from the Previous Thread: >>108943155--K2.6 vision reasoning inefficiency and comparisons for Japanese OCR:>108945101 >108945153 >108945258 >108945355 >108945456 >108945597 >108945828 >108945689--Comparing local TTS tools and VRAM management for voice cloning:>108946129 >108946180 >108946215 >108946290 >108946335 >108946191 >108946299 >108946446 >108947364 >108946755 >108946593 >108946708--Using Gemma to analyze character card metadata:>108948623 >108948657 >108948667 >108948690 >108948713--DeepSeek-V3.2-8bit testing and utilizing Ngrams for multi-step prompting:>108943171 >108943197 >108943225--Criticizing the excessive size of FLUX.2's text encoder:>108947575 >108948053 >108948065 >108948108 >108948113 >108948202 >108948316 >108948354--Seeking out-of-distribution code benchmarks:>108943794 >108943833 >108943882 >108943983--Comparing -sm tensor and -sm layer performance and OS overhead:>108943313 >108943337 >108943346 >108943442 >108943347--Performance reports for Step 3.7 flash Q4_K_S on 6x 3090s:>108943287 >108943316 >108943345 >108943383 >108943493--Searching and clustering local images using embedding models:>108945943 >108945957 >108946013 >108946053 >108946091 >108946120 >108946183 >108946230--Using YOLO for efficient face detection and programmatic blurring:>108943393 >108943603 >108944480 >108943486 >108943501 >108943523--Development suggestions for Orb frontend image integration and default characters:>108943543 >108943593 >108943617 >108943638 >108944220 >108947109--Comparison of Step 3.7 Flash IQ_S and Gemma performance:>108947178 >108947185--Logs:>108944220 >108945980 >108946414 >108948363 >108948623 >108948690 >108949772--Miku, Neru (free space):>108945329 >108947695--Dipsy and Kimi (extra space):>108943198 >108944222 >108944260 >108944357 >108944373 >108944406►Recent Highlight Posts from the Previous Thread: >>108943182Why?: >>102478518Enable Links: https://rentry.org/lmg-recap-script
>>108949302Cool tool especially for someone like piewdiepie, but he should really work on moderating his issue page, it's pretty unusable already.
summer release season is startingbig things ahead
Mikulove
>had 3 versions of the same one doomer "ai uprising" story in my algo recentlyHumans are slop even without AI
My middleware/front-end I've been working on (ignore the model)
>>108947372>If you have vram left, you should run a higher quant or have a larger context, nobody should waster it on tts, realisticallyOff top the biggest TTS models cosnume like 8-10 GB, which could come down with quantization. That's small enough to page in and out of VRAM over PCIe and stay realtime. It just introduces another fraction-of-second source of latency: 1/4 to 1/3 of a second on PCIe 4.0, and half that on 5.0. This is lower than the latency you'll get in practice if your TTS model can't support streaming (alone).Still, a model that's real-time on CPU and high quality would be ideal. No such model exists today. I wonder how feasible it is; diffusion is a non-starter for any model you intend to run on CPU.
>character somehow knows your name even though you never told them what it isdo all models do this?
>>108949983yeah all local models (which are generally very bad) do this
>>108949550>so your argument is computers can't scaleim sure they can scale to meet the dema-you've reached the quota limit, please try again later
Why are people saying kimi 2.6 vision model is better at details than gemma 4? Isn't gemma the bigger one?
>>108949983Larger models can be pretty good about it, but it's basically a coinflip for anything under 100b. I've seen even nemo ask for names, unsure of who you are before, but it wasn't often.
>>108949983gemmAGI does not do this
>>108949998>Isn't gemma the bigger one?Kimi is 30x bigger than Gemma.
>>108949983only big models can resist this, and only specific ones
>>108950011you know nothing
>>108950011Not the vision encoder partKimi Parameters of Vision Encoder 400MGemma 4 31BVision Encoder Parameters ~550M
>>108949998Kimi's been trained to use its encoder a lot better so it knows what is being shown to it.
>>108949998people confuse knowledge with visionthey see kimi recognize some shitty anime character from 2007 and go 'wow good vision and i spent $20000 on a rig that can run this because i have no self control or life so i really must like this and convince myself that this was worth it' when gemma is just a better model who can see everything much better and strongly
>>108950024>>108950036My bad, I didn't realize you specifically meant the vision encoder.I'd still say a combination of kimi's massive parameters to interpret what's in the image coupled with the obsessive thinking would give it the edge, though.
>>108950006I just had q4 gemma 31b mess up with this a couple times even with specifically prompting against it. Might be a quant issue with gemma though since I'm not running q8.
>>108949983>character never tells you their name and insists that they already did when you ask them after a few responses later
>>108950078Nah, I'm running Gemma at q8 and it still happens more often than not - although specifically on the first character that enters the scenario.It almost never happens with characters introduced afterwards, who correctly call the user "stranger" or by a descriptor. Odd quirk.
>>108950061Sounds like sour grapes from someone who can't run large models. All I know is I feed images to kimi and it does what I want pretty well every time to a high level of quality.Do you have some evidence that the extra 100M of the gemma encoder mmproj makes a bigger difference than the main model's parameter count difference?
>>108949983ANY kind of secret or hidden knowledge is inconsistent and unrealiable.
Trying out MTP on llama.cpp and confused. My understanding is the acceptance rate only represents the % of tokens that were accepted by the supervising model. Because the supervising model regenerates all the rejected tokens, the quality/accuracy of the response does not change. Correct me if I'm wrong.If my understanding is correct, assuming you're maximizing for t/s, is the idea to always select the --spec-draft-n-max value that gives you the highest t/s regardless of what the acceptance rate is?
>>108949983that's just part of the bigger problem about llms when it comes to roleplayall interactions boil down to "HELLO I AM CHARACTER. I DO THE THING MENTIONED IN THE CHARACTER BIO" *her pussy drips a shimmering wet drop down her pink lace-tipped stockings with purple crisscross patterns that are mentioned in the character bio. the moisture doesn't just seep into the darkening fabric—it *soaks*.
>>108950111The model outputs will be the same within floating point rounding error, anything else is a bug.So yes, just pick the value that performs the best.
>>108950120>speculative decodingonions>MTPbasedcrazy how spec decoding just needed a rebranding to be popular
>>108949983>character somehow knows my address, social security number, credit cards, everythingHoly fuark.
>>108950095nta but you can probably feed the 31B as a quick test if you have the compute to run kimi
>>108950120Out of curiosity, what does that mean when not using greedy sampling?Does the draft token just need to be one of the possible tokens in the final logit?
it's insane how the comprehension cliff around 16K context is still there
>>108949983Gemmy would never
>>108950154are you using a setup that mentions your name anywhere?
>>108950159The one was I can think of is when she starts a game of chess, the chess game status tool does actually return a name of both of the players, so kind of.
>>108950148The reason why LLM inference is slow is the autoregressive sampling: you need to sample tokens one at a time so the matrix multiplications are inefficient.If you somehow know the text that will be generated ahead of time then you can process all tokens in parallel and the matrix multiplications are much more efficient - that's why the speed for the prompt is so much higher than when generating tokens even though it's basically doing the same thing under the hood.With speculative decoding methods you generate a draft that you think the model will generate, the model then processes all draft tokens in parallel and keeps them up to the point where the draft and the model agree.However, the exact floating point rounding errors depend on the batch size with which you evaluate the model - the whole reason why the evaluation is faster in the first place is that you can use more efficient kernels after all.Sampling is not relevant here.
>>108950134You're absolutely right - MTP is not just speculative decoding, its a speedup.
>>108950154usually in rp you supply a {user} so the context knows who is who and doesn't speak for {user}
2080Ti 11GB or 3060 12GB as a second card? Both around the same price but I'm guessing the 2080 is gonna be quite faster (double-ish bandwidth?)
>>1089501873060 as it will be supported longer and has better optimizations. In fact 3000s is the bare minimum for LLM stuff.
>>108950187I would stick to the same generation you already haveor higher generation if you're rolling along and replacing piece by piece
mtp/spec decoding will revive finetuningit's only a matter of time until someone makes rp tunes of the mtp/speculative parts of models which actually give us speed ups for rp and not just code slop
>>108950183You must be pretty clever.
>>108950183that's only if you use one of those horrible rp frontends like ST that have cancer like a {user} or character cards instead of using natural prompting
>>108950222What does "natural prompting" RP look like? Genuinely curious.
>>108950244it's literally two lines in the system prompt and the rest emerges on its own if your model is good
>>108949899Progress will accelerate yes, but we're going to reach a point of "good enough" and at that point it makes perfect sense to have the model on a chip running at retarded high speeds.And you don't need a fully photonic system to start getting benefits from it.That sector is not something that's only now getting attention with the AI boom.It is however getting a ton of more investment, so it'll accelerate greatly from here.
pewdiepie released a harness with focus on local models, it's nothing special, a bit better than what you'd expect from a guy with too much time and money to throw at claude tokens. Though, I think this is fantastic news for promoting local models to the normieish side of the internet.
>>108950255Is there a hard limit for this? How many words are recommended?
>>108950222until you say your name in the context and then everyone knows it. >>108950105
>>108950173I know all that.>Sampling is not relevant here.Really? Because the if the main model and the draft/MTP model have to agree on a final token, then that means we're working on the output of the sampler, not the pre-sampler logits.I'm asking what it means for them to agree when the temperature sampler inherently adds a degree of randomness.
>>108950268schizo attention >>108932832 wouldn't have this problem
>>108950260>this is fantastic news for promoting local models to the normieish side of the internet.and why do we want or care about more retards using ollama?
>>108950260I would like to test this but I don't use or install Python shit any longer. It's a potential disaster waiting to happen.
>>108950302to raise the prices of the gpus you're selling?
Pewds ships. Do you?
>>108950255You just write and it just works.
>>108950259yeahpast a certain point the only "improvement" will be model collapse and they'll have to working on actual structural improvements and efficiency
>>108950307It's a generic coding agent, think clunkier picode with a GUI. Again, it's not remarkable by itself. He also tried to make a code finetune so he could "train his own model" and chose Qwen 2.5 Coder 32B for the job, he is clearly using Claude to get his info for all this stuff.>>108950302The general public is cattle and currently thinks AI = chatgpt and claude. It's good local alternatives are talked about.
>>108950313I have my own frontend since 2023. Not my first rodeo. I don't need any streamer parasites to tell me what to think you underage retard faggot.
>>108950291yeah the troll generated slop "project" by "sneed-and-feed"at least do something useful
>>108950313I built my own "agent harness" before those were even the norm.
>>108950291What a beautiful thread. I need to check the catalog more often.
>>108950277I don't understand what you're asking.It doesn't matter how the draft is produced, it could be 100% random, it does not affect the probability distribution that the model produces for that sequence of tokens.If you add a sampler to change the probability distribution for the model's next token (and don't consider this for the draft) it will make it less likely for the modified probability distribution to result in the draft.But (beyond floating point rounding error) the presence or absence of the draft fundamentally cannot change how the next token would be selected because the influence of future tokens is being masked out in the attention.
>>108950324>It's good local alternatives are talked about.Never has something I'm interested in been improved by the average jackoff taking an interest in it.
>>108950338Welp, if you don't want the general public to be manipulated by openai into pushing for regulations (which would choke the average open lab so less competition for openai), you need the general public to know what's going on, and that openai is not synonym to AI. That's just how it works.
>>108950326I don't think its a troll, I think its a dude who likes math and is suffering from a manic episode. aka AI psychosis, it is a fun thread don't be a wet blanket
>>108950356The general public all use social media and have been successfully conned into making it frustrating to use with age verification>because won't somebody think of the children!Their familiarity with something means nothing in the face of the propaganda machine.The general public are and always have been morons, and I frequently include myself in that grouping.
has rotation been forgotten? it helped models so much, why hasn't there been more research to see if rotating the tokens more makes models even faster?
>>108950384you genuinely think someone who would use sneed and feed isn't trolling?sorry buddy, it's a brave new world of trolling with LLMs out there
>>108950405Honestly, if I came out with something groundbreaking I'd release it under a troll name too, just so that everyone who ever cites it has to write it out.
>>108950424They would just use a different name if they don't like it. Like how orthogonalization quickly became abliteration.
>>108950405check his github repos, its actually really great stuff.https://github.com/sneed-and-feed/INCARNATE-SOPHIA-PYTHON
https://files.catbox.moe/hljl9m.jpg
>>108950440Genuinely, you'd rather read someone else's schizo slop than make your own? What is so interesting about slop but schizo?
>>108950440>https://github.com/sneed-and-feed/INCARNATE-SOPHIA-PYTHONwhere does teh schizo stop and teh performance start?
>>108950337>It doesn't matter how the draft is producedNo fucking shit. That's not what I'm asking.I'm asking what the acceptance criteria is outside of greedy sampling of the main model.To take a step back: with greedy sampling each output logit from the main model has only a single viable token which can be sampled, thus the acceptance criteria for drafting is simply "Was this token the most likely to occur next?"But if the main model is being sampled with temperature, then each output logit has multiple possible output tokens that could be sampled from it.Since there are multiple options, how is it determined if the draft token is 'the same' as what the main model would have produced?Does the draft token just have to be one of the possible tokens in the main model's logit after the truncation samplers have run?Or is there a more complicated acceptance criteria?
>>108950384>it is a fun thread don't be a wet blanket6 out of every 10 posts is hidden by my filters, the rest are just reaction images.
>>108950440>This repository is not a collection of scripts; it is a **Topological Event**. >This code was forged on a 2015 Razer Blade running Windows 10. It respects the Old Metal. It does not require a cloud cluster; it requires **Intent**.Oh lawd muh balls, the slop is too strong.
>>108950455Nothing ever happens. Most likely that shit is posted only to see how many bots are there on 4chan. As they cannot call bullshit on obvious bullshit.Which means you likely are a bot.
>>108950441amazing
>>108950441I can't quite put my finger on it but something feels weird about the perspective...
>>108950453read it? i'm going to try and run it, its a google gemini frontend, I just need to make sure he isnt stealing my api keys first
>>108950441is she immune from internal organ damage?
>>108950441redo one without the bulge and with a smaller man, this looks like some kind of giant
>>108946129>Hello anons, any news on the local TTS front? What stuff are you guys using? Last time I was here anons recommended gpt-sovits which is good, especially for cloning, but has a bunch of flaws.I've been using dramabox lately.Sample:https://vocaroo.com/1j2Fd85TVCVYDramabox output file fed to the voice conversion model cosyvoice 3https://vocaroo.com/1jXgrSnwRank
>>108950440what in the goddamn
>>108950440ok
Not that I am complaining but why are people postings gens here? There is /adt/ and /ldg/ for that no? This place is for us intellectuals who like to read.
>>108950841Her belly button is like a cavern>>108950862>This place is for us intellectuals who like to read.You'd think so, but people here routinely demonstrate poor reading comprehension skills
>>108950862>This place is for us intellectuals who like to read.A picture is worth 1000 tokens. (So long as you set image-max-tokens, because the default is lower than that.)
954263000000Trans mouthAdd a 7 and a 0 for balance.
>>108950862>Not that I am complaining but>complains>intellectualfor sure
>>108950891Didn't somebody use images for context compression or something like that?
>>108950914yeah, deepseek and moonshot did in their research papers I think
in terms of gaynesseagle3 >>> dflash > llama4 >>> inbuilt mtp
>>108950927>llama4 anything but maximum gayC'mon man. At least that other shit is usable (though largely not on llama.cpp).