/lmg/ - a general dedicated to the discussion and development of local language models.Previous threads: >>107174614 & >>107164243►News>(11/07) Step-Audio-EditX, LLM-based TTS and audio editing model released: https://hf.co/stepfun-ai/Step-Audio-EditX>(11/06) Kimi K2 Thinking released with INT4 quantization and 256k context: https://moonshotai.github.io/Kimi-K2/thinking.html>(11/05) MegaDLMs framework for training diffusion language models released: https://github.com/JinjieNi/MegaDLMs>(11/01) LongCat-Flash-Omni 560B-A27B released: https://hf.co/meituan-longcat/LongCat-Flash-Omni►News Archive: https://rentry.org/lmg-news-archive►Glossary: https://rentry.org/lmg-glossary►Links: https://rentry.org/LocalModelsLinks►Official /lmg/ card: https://files.catbox.moe/cbclyf.png►Getting Startedhttps://rentry.org/lmg-lazy-getting-started-guidehttps://rentry.org/lmg-build-guideshttps://rentry.org/IsolatedLinuxWebServicehttps://rentry.org/recommended-modelshttps://rentry.org/samplers►Further Learninghttps://rentry.org/machine-learning-roadmaphttps://rentry.org/llm-traininghttps://rentry.org/LocalModelsPapers►BenchmarksLiveBench: https://livebench.aiProgramming: https://livecodebench.github.io/gso.htmlContext Length: https://github.com/adobe-research/NoLiMaGPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference►ToolsAlpha Calculator: https://desmos.com/calculator/ffngla98ycGGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-CalculatorSampler Visualizer: https://artefact2.github.io/llm-sampling►Text Gen. UI, Inference Engineshttps://github.com/lmg-anon/mikupadhttps://github.com/oobabooga/text-generation-webuihttps://github.com/LostRuins/koboldcpphttps://github.com/ggerganov/llama.cpphttps://github.com/theroyallab/tabbyAPIhttps://github.com/vllm-project/vllm
►Recent Highlights from the Previous Thread: >>107174614--Paper: LeJEPA paper and Yann LeCun's potential new venture discussed:>107181985 >107182047 >107182081 >107182097 >107182105 >107182118 >107182786 >107182462--Skepticism over Google's 'secure cloud AI' claims:>107182872 >107182888 >107182907 >107183248 >107183385 >107183482 >107183498--Comparing Kimi, GLM, and DeepSeek for creative writing:>107179399 >107179425 >107179434 >107179510 >107179674 >107180095 >107180171 >107180180 >107180221 >107180134--Quantization optimization experiments with Q8_0_64 and intermediate formats:>107180476 >107180530 >107180688--GLM 4.5 Air deployment challenges and optimization on consumer-grade hardware:>107174665 >107174677 >107174681 >107175083 >107175095 >107175120 >107175142 >107175231 >107175270 >107175290 >107175624 >107177243 >107176390 >107176473 >107176533 >107176578 >107176611 >107177015 >107177252 >107177277 >107177524 >107177546 >107177566 >107178047 >107181418--Frontend tool comparison for story writing:>107178671 >107178760 >107179089 >107179188--Optimizing 120b model performance on a single 3090 GPU:>107182483 >107182594 >107182615 >107182618 >107182656 >107182671 >107182676 >107182694 >107182707 >107182742 >107182749--GPT-5's limitations in generating performant CUDA kernels for llama.cpp integration:>107179734--Debating AI's capability for detailed agentic coding and optimal abstraction levels:>107181333 >107181358 >107181467 >107182044 >107182064 >107181430 >107181472 >107181428--Implementing persistent memory systems for local LLMs using markdown-based RAG approaches:>107175255 >107175762 >107177084 >107177172 >107177189 >107177209 >107177241 >107177634 >107177771 >107178429 >107178789--Kimi K2 Thinking webapp:>107176092 >107176237 >107176249--Miku (free space):>107178964 >107180253 >107180428 >107178764►Recent Highlight Posts from the Previous Thread: >>107174619Why?: >>102478518Enable Links: https://rentry.org/lmg-recap-script
>>107184173you are a living tumor upon the earth
>>1071842402x RTX 6000 in a 12 channel epyc platform with the fastest DDR5 you can get.>>107184258>>107184299Alright.One IDE extension user, one CLI user.I've been using Cline too and it's been working alright so far.Haven't tried any of the pure CLI tools.What are the advantage of those? Anything that would make them work better with local models?I imagine not, but figured I might as well ask.
>>107184240>I'm seriously thinking of putting together a setup with 2 RTX 6000 Pros. >>107184363>2x RTX 6000 in a 12 channel epyc platform with the fastest DDR5 you can get.I don't think building a ddr5 epyc system is good right now, due to the extreme price increase of ddr5 ram. Zen 6 Epyc is supposedly going to be announced at CES in january. Zen 6 epyc is going to be much, much better than zen 5. It's also going to use MRDIMMS, which will supposedly exist at 12800hz. Compare to *maybe* getting 8000hz ddr5 next year. There will be 16 channel cpus too, but even 8-channel will be 2x the bandwidth of the best ddr5 ram.One rtx 6000 pro and wait for zen 6 is The Way.
Thanks to the anon for suggesting checking out the k quants and trellis quants. I learned about importance-weighted optimization and I think I just got a free lunch out of Q8_0You can quantize to Q8_0 slightly better by using the importance-weighted optimizations that smaller quant formats use and this gives you about a 5% reduction in mean square error. The resulting GGUF is fully backwards-compatible with Q8 (it's literally Q8 just quantized a bit more efficiently at the cost of a much more expensive algorithm than just dividing the weight by 127)There is no reason I see not to quantize like this if you're releasing a final Q8_0, or to use a Q8_0 that was quantized like this
>>107184325You that ESL spammer. Thanks to you there's never any real discussion here.
>>107184585does bartowski know?
>>107184602>real discussion is vibe coding adviceliterally kys retard
>>107184602>ESLhe thinks americunts are the main posters on this board lmao
>>107184602>You that ESL
>>107184623Better discussion than forcing llms to output vulgar text.
>>107184681according to whom? we only care about cockbench here
>>107184702>we
>>107184681there is no discussion to be had with mongoloids like youbugger off making more inane PRs that waste maintainer time like the onslaught of garbage that constantly tries to get pushed in llama.cppeven SOTA models can't really produce good code or that nigger trying to vibecode deepseek v3.2 wouldn't have entered the loopy circle of unending refactor that never properly worksyou are an unwanted abortion, a plague on all repos that have to suffer your existence
>>107184742>even SOTA models can't really produce good codeGarbage in, garbage out. And it seems like you are incapable of anything but garbage.
>>107184399That should be relatively easy since it's only got 10B active params>>107184547Thanks for that heads up
>>107184616>does bartowski know?he probably has better things to care about i'd think. There is literally no reason to not quantize Q8_0 like this though if you're releasing a Q8_0 version of a modelThis isn't a new quantization format though its just an alternate way to quantize Q8_0 that is very slightly better so I might just make an issue on github and show this to the devs and they can decide if/how they want to implement it.
>>107184766riddle me this, mongoloid, if it worked, why has there been not even one singular instance of enhanced productivity and velocity in open source projects where anyone can actually see the code and features being added? where are all the projects that were LLM boosted? you vibe coding niggers are always at the stage of useless prototype or wasting the rest of your team's time in your real life job, if you even have onebelieve me every fucking developer in existence that actually produce value hate your guts with the force of a thousand sunit used to be mosquitoes or cockroaches were the first thing one would push the genocide button on but I would argue your kind should be exterminated firstyour ability to generate endless garbage with a few prompts is indeed like literal tumors but with contagion powers.
All this sperging because I asked about "vibe coding" tools?Damn.
jej
why is editing the thinking block so poorly supported in many frontends
>>107184971Such as?
>>107184844"vibe coding" is an annoying buzzword that sets a lot people off. You might be received better if you ask for AI Agent-Assisted Development Tooling next time.
>>107184844You're damn right that vibe coding is for tools.
>>107185040I suppose.Trying to doge schizos is standard 4chan fare these days I guess.Anyhow, impressed with Qwen3 30B. It's surprisingly usable for something with 3 active params.
>>107180688>I'm just messing around, you can't make a format better than Q8_0. It's literally just a float16 multiplied by an int8.Q8_1 (or whatever) was a float16 multiplied by an int8 and summed with another float16 instead of implicitly summed with 0. That's what q8 MLX does with a default group size of 64 rather than 32 which works out to the same amount of metadata per weight. I wonder if in practice it's typically a win.
>>107185040NTA but Karpathy made that decision for us. I hated the term as well but if I don't use it somebody else will so might as well claim it.
>>107185160why should we care what that anti open sores snake decides?
>>107184971just be a grug and write your own scripts for anything that needs to be batched/chunked, and use mikupad for chat and hand edit things yourselfthe more features frontends have the worse they are in real use
>>107185173It's less that he decided anything, and more that thought of a catchy term the zoomers instantly fell in love with and now everyone is using it.
>>107184971llama-server default lm studiocherry studioI have now resorted to sillytavern but I don't like it.>>1071851773 years into the LLM craze I would have hoped to have more robust tools. Then again I also experience so many rendering issues on OpenAI/Claude etc I guess frontend is just too hard to do properly.
>>107185148>Anyhow, impressed with Qwen3 30B. It's surprisingly usable for something with 3 active params.I wish they made a coder variant of the 32B. Would love to trade some speed for a more capable small model.>>107184173>A visual studio extension?If you find one, let me know. Apparently no one interested working on these extensions is capable of anything but Python and JavaScript. I considered forking and developing one of the Chinese shoddy extensions, but it was easier to just use VSCode for this shit.
>>107185160pic related is one of the things he showed as an example of proud vibe coding in the thread where he coined the term this is the sort of shit bootcamp genz faggots could hand write in 10 minutes
>>107185216>If you find one, let me know.Coding agent extensions for vs code?As one anon mentioned, there's ClineThere's Roo, a Cline fork and Continue.
>>107185256I keep Roo and Contiue installed. Continue is good for autocomplete and quick questions and Roo for agentic tasks. Tried Cline first, but the only thing it had over Roo was that it had a button to generate commit messages, but even that was annoying because it gives the model all changes instead of just what was staged and no way to change it.
Mistral Nemo really is nice... sad there's no bigger version.
am I retarded where are the rest of the sampler settings like min p?
>>107185406They don't show up in the chat completion interface, but you can still use them by setting those as custom properties/headers.Same with shit like GBNF and anything else the API accepts.
>>107185380could always merge two nemos together
>>107185154>Q8_1 (or whatever) was a float16 multiplied by an int8 and summed with another float16 instead of implicitly summed with 0.In practice it's typically a loss. Try it out yourself. Summing a float16 destroys any quality bonuses you get from having the extra info from the float16 bias in the first place. That's probably why Q8_1 isn't exposed and is only used internally for an intermediate step in some niche quants.Yes, you can get slightly higher precision by using an int16 instead but it comes with 2 bytes more of overhead per 32 elements which is 9.0bpw and it performs worse than fp16 outlier strategiesanother reminder that none of this matters (other than improving the quantization of Q8_0 itself, and maybe Q8_0_64 and its _IMP version because 3% less model size for 0.001% loss in accuracy might be interesting to some) because you can't practically a single fp16 * int8 calculation. you can easily imagine how well that could be optimized for hardware instructionsI'm gonna poke around and see if i can squeeze any better precision out of the Q8_0_IMP quantization function and then if I can' think of anything else, I'll open an issue
>>107185173Might as well ask why the state of Israel must exist
>>107185454howis it actually worth it?
>>107185474No. He's pulling your leg.
>>107185474>howyou can easily google this, merging a model with itself slightly improves its intelligence>is it actually worth it?using local LLMs isn't worth it beyond learning how they work lol
>>107185248I think you're overestimating the speed of development when hand coding
WE MUST PROTECT AI CHILDREN
>>107185498>you can easily google thiskys
>>107185607dude just google "miqu-70b merged with itself" and the first result is miqu-120b ... and just do your own research from there
>>107185629>just do your own research from therekys gossipnigger
>>107185634>This is a 120b frankenmerge of miqu-1-70b created by interleaving layers of miqu-1-70b-sf with itself using mergekit.There now you have the full spoonfeed. Go and use mergekit to interleave layers of mistral-nemo with itself
>>107185501And the attention required for manual implementation. Sometimes most of my brain is locked in on a specific big picture problem and it's very helpful to be able to delegate things to a language model to validate some random ideas.In many cases the quality of the vibed LLM implementation is irrelevant (I might throw it out entirely) I just wanna see if something might be good to pursue further.
>>107185629>70b + 70b = 120bWhere did the other 20b go?
>>107185672>Where did the other 20b go?mergekit uses a passthrough method, which concatenates/assembles transformer blocks from the source(s) into a deeper model rather than just averaging weights
>>107185557Even if the UK citizens voted against it they would still implement that law.
>>107185771>citizens voted against itHuh
I have a genuine question.Why the fuck is everyone so obsessed with making an LLM run as fast as possible?I understand it for audio or images it's very important since the result is something we can process as fast as our brains can, but reading is very slow comparatively, and with token streaming wouldn't be the best choice to pick the smartest model that we can run at our reading speed?What is the point of having an answer in seconds if we still need to take a minute to read it? But I do understand the want to run a small model to also be able run a tts and/or image model together.
>>107185810for code or generating huge chunks of text you mostly skim, as well as reasoning which takes ages at reading speed
>>107185810>Why the fuck is everyone so obsessed with making an LLM run as fast as possible?because LLMs are mostly used for coding, and time is money
>>107185810Because you need to reroll 46 times to get one usable line out of these POS
Should I use I quants for >6_k_s?
>>107185821>>107185825Yeah I forgot lazy fucks just copy paste the code without reading it.>>107185841Yes, but wouldn't make sense to use a smarter model so you don't need to reroll as much? Besides you still need to read each reroll at the slow speed to know if you have to reroll to begin with.
>>107185909I mean... it doesn't really take more than a few s to read the few sentences it gens, I'm not genning 4k token walls.
>>107185810You might be a slow reader anon. Also it's fun to experiment with card settings and prompts, or reroll to see what else could happen. If your model is slow it greatly degrades the experience. Every time I switched to offloading to CPU I regretted it, the models are smarter but it's not worth it.
>>107185474iirc merging was based on the observation that residual layers (most transformers stack these) can work somewhat independently of each other. There was a paper (https://arxiv.org/abs/1605.06431) showing that you could permute/delete them with minimal performance degradation, and people attributed this to iterative refinement or ensemble-like behavior, but it's still an open problem to my knowledge. I'd assume adding layers from finetuned variants of a model shouldn't decrease performance by much, but idk if it would benefit either
Is there a collection of best practices to minimize prompt length without losing information?
>>107185984>chatgpt, condense this prompt without losing information
>>107185984>day 999 of reinventing /aids/Does it really matter with the context sizes?
>day 999 of forcing /aids/ into the conversation
/aids/? nobody's got /aids/!
>>107185938Yes, but I usually read as it generates the answer.>>107185940Well, probably yes since I'm not a native English speaker, but I'm asking if it would make more sense to chose the best model according to your individual reading speed instead the one that runs as fast as possible. For example the best model I can run at my own reading speed on my 8GB card is a 16B Q4_k_m at 8k context or if I want a model with vision I run an 8B model Q6_k_m with 12k context.
>>107186047thiswow /aids/ touched on a fundamental behavior of LLMs at one point, so did every other LLM community, who cares? unless they have a specific ingenious solution that 1) still applies with modern models and 2) isn't already common knowledge, it's not worth bringing up
>tried the self merge>it's full on repeating schizoW A O W
At this point I am checking /lmg/ out of habit. Still not tired of glmsex.
>>107186110>16BQ6_k_moh you're just a baitie
>>107186221any model bigger than the original model made by internet randos was either:snake oilor literally broken garbage that's worse than snake oilalso fuck solar and other upscale retardationyou want a big model? spend the money on training a big modelthere, that's iteverything else is a cope
>>107186311brother the whole field is cope layered on more cope
>>107186311I don't think they're any smarter or better at actual problem solving than their source components but I think they can be more interesting for creative writing and similar tasks
>>107186337
>>107186301With that lack of reading compression it's no wonder you read fast.I said I can run at my slow reading speed:-16B at Q4-8B at Q6 with vision.
Just tried GLM-4.5-Air EXL3 at 3.07 (optimized) bpw on 2x3090.native tp (no nvlink), 30k context: 952 tok/s pp, 28 tok/s tgsnccl tp (uses nvlink), 30k context: 1135 tok/s pp, 28 tok/s tgs
>>107186458yes and 16b (one thing) and q6km (another) is bait
i've been bragging about getting 18 tps on a 1080tibut it turns out the vast majority was being offloaded onto my 5800x3d. pls ignore my bad benchmark.
>>107186311I kind of never got how people expect this to work. Any "finetuning" does almost nothing because you have to do very little (one epoch) or you start overfitting and frying the model. If you add new layers you are just giving the training algorithm a place, which it can modify to reach the overfitting state faster. Even if you would train only those layers it is hard to imagine not overfitting.I guess in the best case you could get the model to output a specific type of output like specific formatting or something, but that is only if the possibility of it was already in the model. You aren't teaching it new things this way. It is just impossible.
>>107186614
>>107186640You can't rag your model into being an expert masterpiece highest quality ERP-er. You just need to buy ram for 4.6.
>>107186663oh, just a NAI shill, carry on sir
>>107185810>>107185825I could wait for code 2 or 3 days, if it worked and was accurate. But bigger models are not that smart.
>>107186311>>107186614The psycology that is in effect when people are making finetunes is the same as when people are making "ShadowMaster's Ultra-High-Res Skyrim Grass Modpack"1) Feeling of acomplishment. Technically, they did manage to create a mod pack. This is fine.2) Denial of skill and expertise. "If the game developers were as smart as me, they would have made the grass more high resolution."3) Denial of their role in the consumer class. "People are downloading my mod, so I've created something of value, just like the game's developers."4) Denial of taste. "I like my high res grass (although I'm unaware that it's becuase of reasons 1-3). Anyone who says it's shit must be jealous or just have different taste. Therefore, the fact that I can't tell that it's ugly doesn't mean I lack taste."5) Imitation of academic tradition. "There's something named after me."It's literally the same exact brain damage for finetunes. There was a very brief period where finetuning was being invented, where individual people were going back and finetuning the earlier untuned models. That was valid, but everything for the last year is cope.Seriously, if finetuning was good, don't you think the billion dollar company would have someone doing it? They are better than you at this. Only delusion prevents this realization.
>>107186686Yes of course, run it overnight, heard ALL about it when llama 405B dropped. So many people do this it's crazy!
>>107186696i don't think you know what finetuning means
>>107186591I don't understand what are you trying to say then, this is the speed I get with the 8B model with vision enabled and it is a Q6 and it's a lot faster than I can read English
>>107186686Right?If there was a model that would take 3 days to spit out what you need but would get it exactly right every time, I'd be more than happy leaving the thing running.Alas, that's not yet a thing.
>>107186696drummer mentioned
>>107186720Hi faggot, all here...
>>107186720people post training or merging or whatever to create mods of existing models. releasing the whole model instead of a lora
>>107186730uh, yeah, right?
>>107186696this post was written by an llm
>>107186722>Captura de pantallalolmaowhat 16b are you running little bro
>>107186744>ShadowMaster's Ultra-High-Res Skyrim Grass ModpackMake your LLM output that. I dare you.
this post was written by an esl
>>107186755that's possibly the most llm-y part of the post, kimi for example is addicted to unnecessary little flourishes like that
>>107186768esl hobby sir de pantella Pareto paradigm just mooned
>>107186640The real misconception is that the model parroting finetuning data means it has learned new knowledge. A tiny QLoRA adapter is enough for that, for limited amounts of data. But it doesn't really mean the model has actually learned to use and apply any new information.
>>107186803>noooo muh mesugaki lightbulb bublesort benchie
>>107186747Fuck me, do you even know how to read numbers? I said is a 8b model.The 16b model runs at 8 tokes per second.
>>107186821i'm asking which 16b you claim to run ffs
drummer getting desperate ITT...
>>107186860leave the Pantella frontier alone!
>>107186860kofi bucks running low his discord are ungratefulls
>>107186830I swap between these two depending on the mood:LLama-3.1-128k-Darkest-Planet-Uncensored-16.5B-Q4_k_mNyanade_Stunna-Maid-7B-v0.2-Q6_K-imatAlso, the vision model it's a 7b, not an 8b.
>>107186876and there we go...>128k-Darkest-Planet-Uncensored-16.5Ba davidau clownmoe atrocity
>>107186876>Darkest-Planet-UncensoredThat's so fucking funny.>128kI bet it is.>>107186884>davidau Figures.I love that guy man. I always get a chuckle out of his shit on huggingface.
>>107186821>do you even know how to read numbers? I said is a 8b model.>>107186876>it's a 7b, not an 8b.Womp womp
>>107186884Yes, and? I'm just discussing the sizes of models and their running speeds, not what they are for.
>>107186876>LLama-3.1-128k-Darkest-Planet-Uncensored-16.5B-Q4_k_m>Nyanade_Stunna-Maid-7B-v0.2-Q6_K-imat
>>107186936The running speed of atrocities in their own size class is surely widely useful info, thanks anon.
For me it's the pre Llama2 merges consisting of 278 nestled models (confirmed)
>>107186998Utopia/UtopiaXL my beloveds
>>107185199>3 years into the LLM craze I would have hoped to have more robust tools.I'll bet their readme files on their git repos have been the bulk of their merge histories.>>107185810Fried dopamine receptors needing faster validation. Every other answer is cope.>>107186787This is why Kimi is so good.
>>107186614You can do multiple epochs over the data you want to actually train on by diluting it with more generic data.Also what makes you think you can't teach the model something in one epoch? Pretraining is often just 1 epoch.
>>107187264>Pretraining is often just 1 epoch.pretty sure that hasn't been true in years, that's how they get to claim their crazy 30T+ tokens by doing multi epochs on the same shit, also iirc some papers showed they specifically did multiple epochs of stuff like wikipedia.
yo is it just me or is QwQ weirdly better than you'd expect? feels like it punches way above it's weight, least slopped and smartest ~30B model in my book (compared to Qwen3 30 & 32, magistral and gemma)
>>107187326>punches way above it's weightHELL YEAH!!>>107182378
>>107187326I don't think I've seen one good Qwen model but IG I'll download it and see
>>107187264One pretraining epoch has information repeated hundreds (at the minimum) or thousands of times in many different ways, though.
>>107187354Qwen models post 2507 are all pretty good
>>107186696They don't because they don't have an ML department and they don't want to invest resources into something that sounds technical and risky/scary.My boomer boss literally thinks you can "train the AI with your own data" with <shitty low code software> but finetuning is "too low level".
it's outhttps://openai.com/index/gpt-5-1/
>>107187357Not on our proprietary high quality de duplicated filtered dataset sir.
>>107187369buy an ad
>>107187348How did soul not make the list?
>>107187375because soul is sovl of course
>>107187264Ok drummer then where is that one model that is actually noticeably better? And why do you shit out new models every few weeks? I have not seen a single fine-tune that delivered an ERP improvement you get when you jump from 7B>30B>70B>the land of eternal magical sex (4.6)
>>107187393>the land of eternal magical sex (4.6)buy the ad NAI shill
>>107187348>slop words:>slopRussell's Paradox?
>>107187393>tunes and drummer are bad because we don't have them on NAI
>>107187408It is just a number. I didn't say the model actual name. You see NAI everywhere anon.
>>107187434With how much you guys are spamming about muh glm sex it's very obvious what you meant.
>>107187408Based.
>>107187373Deduplication removes identical documents, not repeated information, though. It's the repeated information under many different contexts that gives LLM general knowledge. One epoch of information that is only mentioned and used once won't work.
>>107187357There are ways to do data augmentation and synthetic data generation for finetuning. That's the main strength of finetuning IMO.Any system prompt can be baked into a model through SFT on the generated data, except without wasting context or the model becoming confused due to too many rules. Imagine if you could use a 1 MB system prompt and the model actually followed everything in the prompt. That is what people who shit on finetuning don't get.