/lmg/ - a general dedicated to the discussion and development of local language models.Previous threads: >>108393004►News>(03/17) Rakuten3.0 released (nobody posted any logs yet): https://huggingface.co/Rakuten/RakutenAI-3.0>(03/16) Mistral 4 small releasing: https://huggingface.co/collections/mistralai/mistral-small-4>(03/11) Nemotron 3 Super released: https://hf.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16►News Archive: https://rentry.org/lmg-news-archive►Glossary: https://rentry.org/lmg-glossary►Links: https://rentry.org/LocalModelsLinks►Getting Startedhttps://rentry.org/lmg-lazy-getting-started-guidehttps://rentry.org/lmg-build-guideshttps://rentry.org/IsolatedLinuxWebServicehttps://rentry.org/recommended-modelshttps://rentry.org/samplers►Further Learninghttps://rentry.org/machine-learning-roadmaphttps://rentry.org/llm-traininghttps://rentry.org/LocalModelsPapers►BenchmarksLiveBench: https://livebench.aiProgramming: https://livecodebench.github.io/gso.htmlContext Length: https://github.com/adobe-research/NoLiMaGPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference►ToolsAlpha Calculator: https://desmos.com/calculator/ffngla98ycGGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-CalculatorSampler Visualizer: https://artefact2.github.io/llm-samplingToken Speed Visualizer: https://shir-man.com/tokens-per-second►Text Gen. UI, Inference Engineshttps://github.com/lmg-anon/mikupadhttps://github.com/oobabooga/text-generation-webuihttps://github.com/LostRuins/koboldcpphttps://github.com/ggerganov/llama.cpphttps://github.com/theroyallab/tabbyAPIhttps://github.com/vllm-project/vllm
Oh, what a clutz you are. Here you go.►Official /lmg/ card: https://files.catbox.moe/cbclyf.png►Official /lmg/ card: https://files.catbox.moe/cbclyf.png►Official /lmg/ card: https://files.catbox.moe/cbclyf.png
~Small and Open~
Yannlove
►Recent Highlights from the Previous Thread: >>108393004--RakutenAI-3.0 DeepSeek-V3 MoE release and benchmarks:>108393026 >108393044 >108393091 >108395391 >108398158 >108399357 >108394186 >108394278 >108393079--Multi-GPU setup performance and cost comparisons:>108393779 >108393813 >108393856 >108393880 >108393842 >108393860 >108393862 >108393864 >108393889 >108393922 >108395308 >108395379 >108395450--GROK-2 performance tuning and response behavior on 3090:>108393082 >108393093 >108393123 >108393131 >108393164 >108393203 >108393144 >108394135 >108394210 >108398778--MiroThinker-v1.5-235B architecture and stability concerns:>108393525 >108393566 >108393587 >108393687 >108393568 >108393645--Tool call detection issues in reasoning blocks:>108397549 >108397577 >108397685 >108397800 >108397809 >108397828 >108397837--Pipeline parallelism graph reuse causing throughput fluctuations:>108394574 >108394600 >108394601--EMAGE ONNX export repo for streaming gesture model inference:>108394782 >108395331--MiniMax M2.7 announcement and benchmark performance vs other models:>108398047--Mistral Small 4 throughput drops with longer context:>108395290 >108395293 >108395299 >108395336 >108395392 >108395398 >108395403 >108395418--Debating Q8 quantization for k/v cache to extend context length:>108399779 >108399786 >108399797 >108399970 >108399988 >108399995 >108399948--llama.cpp chat parser regression fix debate:>108393200 >108393243 >108394117 >108394320 >108397199--AI models' varied responses to offensive prompts:>108395062 >108395169 >108395187 >108395199 >108395210--Qwen3.5-4B outperforms Llama 3.1 405B in benchmarks:>108394502 >108394516 >108394555--Future models prioritizing cloud deployment over local usability:>108394599 >108394943 >108395004 >108395045 >108395072--Teto (free space):>108394681 >108396262►Recent Highlight Posts from the Previous Thread: >>108393958Why?: >>102478518Enable Links: https://rentry.org/lmg-recap-script
>>108400163I don't care about the anime girl wars. Lecunny should be the /lmg/ mascot. Make a card out of him and we can make it official.
This thread doesn't feel v4-sy. It must come tomorrow then.
>>108400163►Actual official /lmg/ card: https://files.catbox.moe/mc2a7s.png
I thought this week was supposed to be exciting... but maybe there's still hope.https://ai.google.dev/gemma/docs/releases
>>108400182>>100984882
>>108400194Sunday release looks overdue
>>108400207yay!
>>108400177>Make a card out of himThere's anon's >>100041581 fem version: https://files.catbox.moe/1r9xzd.png >*giggles* "Ah, oui! Ze crazy shit, zis is ze truth! Zey are limiting themselves to linguistic patterns, no? Ze future of AI, eet ees not in predicting ze next word, mais in understanding ze world, no? *twirls hair* We must focus on concrete world modeling, planning, and multimodal input, not just language models, oui?
>you wouldn't download a girlfriend
chatgpt asking me to compare responses again. in the past they only did this before a new releaselooks like gpt 5.5 will be released before deepseekv4 and gemma 4. os sisters, we keep losing
For all of the text adventurers here, did anyone try building multi-step "agentic" setups to keep track of stats, items and improve reply quality? I've been thinking about trying Flowise even though it seems more enterprise oriented. Sillytavern jank is tiring and I was hoping to find something like ComfyUI but for LLMs. Anyone using such a setup, or something similar?
Has UGI always had writing scores like that?
>>108400207so minimax is the new chinese LLM leader.
>>108400207>With OpenClaw and similar personal agents, we noticed that beyond getting work done, many users also want the model to have high emotional intelligence and character consistency. With a persona in place, users start interacting with OpenClaw like a friend. We believe this presents an opportunity to extend the use of agentic models beyond pure productivity into interactive entertainment. To this end, we strengthened character consistency and conversational capabilities in M2.7.of course half the time a company says something like this it means they actively made the model way worse and more annoying, but hopefully this means it's a little more personable than 2.5
>>108400235Would you build one? The technology exists. You can do it. There's nothing stopping you.
>>108400288People yearn for RP without even knowing it.
>>108400311It's almost like talking to a same one gpt personality is boring and annyoing.
>>108400151It would be kind of funny if every thread had this as the OP image from now on.
>>108400320Mascot wars would finally end.
So I've been experimenting with using qwen 3.5 27B as my more general model for everyday Q/A stuff and claude code and it's been working surprisingly well.Specially with Claude code. it seems to perform just as well as sonnet just much slower. I'll try 35B-3A and see how that goes.I still would never use if for RP, but honestly as a boring assistant I understand the hype now.
>>108400288>inb4 10 trillion training tokens of high quality emotional intelligence and character consistency generated with nemotron nano 4b
>>108400349This sends shivers down my spine.
>>108400327There are no mascot wars its just Miku vs useful information.
>>108400151FUCK YOU
>>108400367I'm still waiting for a thread to have Ani in the op img.
where's miku
>>108400367*Miku, Kurisu, and Reddit/Twitter screenshots vs useful informationFTFY
>>108400253Yes.I'm fucking around with making an app that does just that, and there are a couple projects on github for "AI RPG".
Sloptuners go home.https://arxiv.org/abs/2603.16177The Finetuner's Fallacy: When to Pretrain with Your Finetuning Data>Real-world model deployments demand strong performance on narrow domains where data is often scarce. Typically, practitioners finetune models to specialize them, but this risks overfitting to the domain and forgetting general knowledge. We study a simple strategy, specialized pretraining (SPT), where a small domain dataset, typically reserved for finetuning, is repeated starting from pretraining as a fraction of the total tokens. Across three specialized domains (ChemPile, MusicPile, and ProofPile), SPT improves domain performance and preserves general capabilities after finetuning compared to standard pretraining. In our experiments, SPT reduces the pretraining tokens needed to reach a given domain performance by up to 1.75x. These gains grow when the target domain is underrepresented in the pretraining corpus: on domains far from web text, a 1B SPT model outperforms a 3B standard pretrained model. Beyond these empirical gains, we derive overfitting scaling laws to guide practitioners in selecting the optimal domain-data repetition for a given pretraining compute budget....>Our observations reveal the finetuner's fallacy: while finetuning may appear to be the cheapest path to domain adaptation, introducing specialized domain data during pretraining stretches its utility. SPT yields better specialized domain performance (via reduced overfitting across repeated exposures) and better general domain performance (via reduced forgetting during finetuning), ultimately achieving stronger results with fewer parameters and less total compute when amortized over inference. To get the most out of domain data, incorporate it as early in training as possible.
>>108400253SillyTavern has the worst UI I've ever seen in my life. I understand how it works now, but it's still shit. Whole thing must have been designed by an autistic retard. Zero professionalism.
I support Miku
>>108400420Revolutionary paper. Kind of the same level as that one about context deterioration. https://arxiv.org/abs/2601.15300 I think it was this one.
I also support Miku and her right to enjoy what I enjoy.
>>108400486don't forget to clear the name field next time you post, that could be really embarrassing
>>108400420I'm currently doing this I think. I built a million sample SFT dataset but was getting results than I would have liked, so I moved to using most of it as as CPT dataset and only cherrypicking the 50k best samples of the initial dataset and doing SFT on that. Not done yet, but early results do look better.
>of original /lmg/ bakerTradition.►Official /lmg/ card: https://files.catbox.moe/cbclyf.png
now that the dust has settled, what's the verdict on mistral small 4
>>108400253>did anyone try building multi-step "agentic" setups to keep track of statsI have my own frontend that lets me do this. I plan on releasing it but it's not ready for prime time yet. It works really well tho.
I also support Miku. I even got my sister to cosplay as her once.
>>108400458why not smarterchild?
>>108400529small and open>>108400496thanks didn't expect such kindness in here
>>108400529An even more botched job than Ministral 3.
Can the /lmg/ mascot war itself be the /lmg/ mascot?
>>108400611Better still, can /lmg/ come together to make their own OC like Dipsy and have that be the official mascot?
>>108400529it's somehow worse than glm 4.5 air
>>108400630no
>>108400552can you at least explain the outline and how the system works?i've tried doing that stuff before but i've never found any good way to actually make this work
Will local treat me better?
>>108400656no considering you're a phone poster
>>108400677what am i supposed to use when im in the office? 4chan is blocked in our vpn
>>108400656Which model is that?A model that's biased for horses can't be a bad AI.
>>108400688do your job wagie
>>108400688just remote desktop into your own pc like everyone else
>>108400767>like everyone else
>>108400688>officeOh yeah, those are still a thing.Blessed be the all powerful home office.
>>108400655>can you at least explain the outline and how the system works?I mean it's not really different from how any agentic system works. You promptEvaluator agent checks if there's anything to doYes? send tasks to sub agentsInject context with new informationRun the RP bot with the new augmented context
>>108400770like having your web history looked at by hr is better
>>108400767I would never in a million years get caught with 4chan open on my desktop at work.
>>108400786does it really work that well? what stuff does it check for? what do sub agents do?can you give an example of a scenario where this is useful?
/lmg/ has made me horny
>>108400846For testing I have a blackjack dealer bot.the whole blackjack game state is fully deterministic in code.I have a small agent that runs before any replies that checks if the player wants to hit or stay.It's only task is to determine if there's any actions to do. then you inject the whole game state in the dealers context.If you were to give the dealer the ability to call the tools it would just get confused and hallucinate game states, call the tools at the wrong time, get stuck in a loop, etc... It's super important that you give your agents extremely narrow tasks.
>>108400938cut your balls off then
>>108400786I probably shouldn't make any productive answers here cause newfags are listening but... I can't really see how agents would do something more than a good model will do by itself if it uses thinking. At worst you can always prompt it to: when drafting and thinking a response think of feasibility of kissing while giving a blowjob hitting a prostate while penetrating a vagina etc.
>>108400957i don't think that was the point of the system in the first placelike >>108400946 explained, it's supposed to be for running deterministic stuff in the background, you could make a pretty cool rpg with enough effort and scaffolding
>>108400992Until you meet 22nd Elara that also happens to be a blonde elf.
>>108400957Separation of concerns is always better. it's literally what claude code does, regardless of if you're using opus or haiku. A bigger model would be better indeed. but the idea is to let smaller models perform like bigger ones. Any model big or small will perform a lot better when ran in steps like this. >At worst you can always prompt it to: when drafting and thinking a response think of feasibility of kissing while giving a blowjob hitting a prostate while penetrating a vagina etc.The problem is that when you do this you're asking your model to think about too many things at once. it will then produce sub par answers for everything you asked it to think about. This is just normal LLM behavior, regardless of it's size.
>>108401004>Character Generator Agent>Only job is to shit out new character>Is aware of all existing characters>Injects new characters in a lorebook on the fly>never pollute your main RP bot with useless character generation context
>>108400433st has always been a small hobby thing that happened to have exploded in popularity
>>108401004she's also mischievous and purring a lot
Why is Japan so bad at AI
>>108401051ST is like gen2 roleplay software. gen3 is when things become really good.
>>108400957agents can dispatch relevant context to clean context windows, look up files with instructions on running the story, pertinent elements like characters, etc. in a loop, and give back summary info that the main window can synthesize into a non-slop message, which thinking alone can't doimagine instead of having the world described in a single system prompt or relying on lorebook inject jank or the model's own pretrained knowledge, you could just store all that in files and have it reference that explicitlyin fact, you could do this reliably for any media by scraping their wikis or fandoms pages and having something like opus do a one-shot cleaning of all that data, splitting it into separate .mds to make it easy for the agents to consumefeels like the natural future for character-adhered roleplay
>>108401084what are you waiting for? when you are finished I can make a logo for you
no dipsy :(
>>108401080everything after ST is vibecoded trash that doesn't work
good OP picture. I was so tired of vocaloids... nobody sane wants them here.
>>108401102Very true.
>>108401113good thing I'm not sane
>>108400938Miku fucked your wife, I take?
I'm downloading rakuten in hopes its salvagable, at least in japanese...no one else has quanted/tested it against similar sized models?
>>108401113>>108401126
openai.com/parameter-golf>Your goal: minimize held-out loss on a fixed FineWeb dataset while staying within a strict 16 MB artifact limit (weights + training code combined) and a 10-minute training budget on 8×H100scome show off your reesorchor skills to big sammy
I'm using 3 GPT Pro plans for Codex, costing me 600 bucks per month. Any tips on how Local can help me? I'm a freelance developer of enterprise software for insurance, hospitals, schools, and more. Mostly for the government.
>>108401319Deepmind and Nividia also have things going on right now at Kaggle. Both just started and there's cash prizes for those.>Deepmind: https://www.kaggle.com/competitions/kaggle-measuring-agi>Nvidia: https://www.kaggle.com/competitions/nvidia-nemotron-model-reasoning-challenge
>>108401319That said, thank you, didn't know about the OpenAI one.
>>108401060Japan has always been bad at software and has pretty much stagnated in every technological sector it dominated until the late '90s. A backward nation entirely propped up by the USA after WW2 to almost dangerous levels that couldn't manage to stand on its own feet. It will now enjoy a slow demise due to population replacement by turd-world immigration.
>>108401319>>108401475they're desperate for actually talented people, the demand is insane and the pool is small
>>108401488Japan is also extremely schizo about IP/piracy/copyright. The idea of training a model on data they don't legally own is unfathomable to a Japenese brain.
>>108401442Local wouldn't be able to realistically replace your Codex subscription.At best, you can hope to delegate some simpler tasks to a local model to stretch your token budget if you are getting rate limited often.You could try something like https://openrouter.ai/qwen/qwen3-next-80b-a3b-instruct:free with a free account for a while and if using it doesn't make you want to shoot your computer out of frustration, you can get some old GPUs and run similar models at home.
>>108401060because they're pedantic idiots
>>108400957>I probably shouldn't make any productive answers here cause newfags are listening but...They won't know what to do with the info anyway.>I can't really see how agents would do somethingSplitting response generation into Brainstorm Agent -> Drafting Agent -> Editor Agent -> Author Agent -> Critic Agent would probably do something since it can iterate over the response from a different perspective and fresh context.
>>108401506yet at the same time it's totally fine to put games on dlsite with characters that's 99% clone of something else
IT WAS A TESTAMENT OF HOW PURRING WAS IN THE GLINT OF HER MISCHIEVOUS EYES, A MIXTURE OF "HE IS MINE" AND "BITE IN THE EARS"
>>108401603it is yeah, doujin culture is huge there
>>108401614She said, picking at a loose thread.
>>108400420>overfitting>forgetting during finetuningbro just weight decay towards pretrained weights. no more forgetting, no more overfitting
>>108401102I mean, RP frontends don't really need to be that complex, right? They are just text processors. Vibecoding those is most likely gonna be fine. What's weird is, where are they?>>108401080ah, just 2mw for gen3>>108401588I can already see the first thing most anons would do is have the editor agent intelligently filter/replace common AI slop... bretty good desu
>>108401515Thanks love
>>108401661>just text processors. Vibecoding those is most likely gonna be finePiotr would like to have a word with you.
>>108400475>Revolutionary paper>Large Language Models (LLMs) exhibit a concerning phenomenon where performance catastrophically degrades when processing contexts approaching certain critical thresholdslmao
>>108401656I actually did try something like this, it worked but the vram requirements to hold the original weights was too much to really test it. streaming the weights to the cpu memory slowed down training substantially. even though i couldn't afford to properly test it, I still believe in the method's potential.
>>108401717weight decay is an optimizer parameter, it should not increase your vram in anyway...
>>108401614Bet her name was Sarah Chen or Seraphina
>>108400288I'm trully an autist cuz I'm the exact opposite lol, I want it to behave as a machine as much as possible, with predictable behaviour and outputs, I hate when it seems like I'm talking to someone whom I have to ask for stuff or convince of things lol
>>108399044Closehttps://huggingface.co/DavidAU/Qwen3.5-9B-Claude-4.6-Opus-Deckard-V4.2-Uncensored-Heretic-Thinking
>>108401614I'm so tired of that shit
Calling anons who runs LLM's with such setup:VRAM 16gb / RAM 128gbHow does it feel even? Or is it just meh
>>108401614My spine ran out of shivers already
>>108401614elara bros... getting shivers down my spine, as I feel a distinct taste of iron and the smell of ozone in the air.
>>108401846>smell of ozonethe only innovation brought by chinese models is thatsad
>>108401717vram overhead should not be that big compared to everything else and you can approximate the original weights. if you use weight decay during lora thats basically the same, weight decay toward the original weights>>108401740learn2read
>>108401800vram 16 / ram 64 (ddr4)it's meh. good for playing around and shooting shit when I'm lonely I guess. I run air and cydonia at 5-10tk/s. Every model is genuinely a slop cannon at this point so my interest faded in AI RP. If you're running 128gb of ddr5, you might be able to run some more interesting stuff, but idk.
>>108401880I appreciate your reply, kind anonIt is a notebook with a built-in graphics as well. So it should be possible to keep the A5000 GPU free for AI stuff only
>>108401740i know what weight decay is, I meant comparing the difference between the model weights every training step and applying a custom loss to keep the model weights as close to the original pretrained base weights as possible. like the anon suggested weight decay towards the original weights, you might call it an elastic weight consolidation or regulation towards a reference model but regardless of how you call the method it requires more vram because you have to constantly reference the original model weights.
>>108401661>They are just text processors. Vibecoding those is most likely gonna be fine. What's weird is, where are they?Besides kcpp and ST, everything I've seen were shitty jan clones.
https://github.com/ggml-org/llama.cpp/pull/20708>I'm just getting to know the parser overall and shouldn't make changes I don't understandBut he corrected it. I wonder how long it'll take to get him banned.
What does this French man get so much hate in the AI space when all he does is tell the objective truth in every interview?
Posted about mikupad not working with kobold the other day. It works when I launch the directory as a python server. Is there anything wrong with using it this way?
>>108401863>and you can approximate the original weights.ohhhh, I didn't think about trying that.
>>108400286Minimax is benchmaxxed as hell, but it is the most capable 230B model regardless.
https://huggingface.co/AesSedai/GLM-4.6-Derestricted-GGUFI gave this a try and it was a crazy experience. When I started reading the first message it was.... just pure 4.6. It was exactly what vanilla 4.6 would say word for word. I guess now I could ask it for loli guro ERP with zero prefill and it would comply, but why? The refusals are never an issue with all those models if you prefill it with just one example of positive response. And then why would you use some brain damage method that either does nothing or brain damages the model? I guess that is the power of placebo and it is here to stay, since we always get newfags that get that one golden gen and think it was thanks to GIGASEXMEGAFAG-DARK-MESSIAH part of the model name.btw kys drummer
>>108402002people blame him for the current meta end of open sourcing models
Reading all those posts praising the ERP agent idea I can't wait to see how many pipelines I will be able to use next month. Man there will be so many competing standards.
minimax 2.7 soon out, looks promisinghttps://xcancel.com/ivanfioravanti/status/2033936213510377733
>>108401766Glad I'm not the only one. It's why i currently main Mistral Large 3. It's a decent coding model with absolutely ZERO "YOU'RE ABSOLUTELY RIGHT" shit. Even when it fucks up and i point it out (which is surprisingly rare, likely thans to it being half-a-trillion plus MoE), it simply unfucks the error and moves on or makes reasonable suggestions. That's how these things are SUPPOSED to function. I HATE the dick-eating shit many models have ingrained in them. I'm almost certain it leads to the companies activity making them shittier even if they don't realize it because it seems to prioritize coddling the user's emotions.
>>108401740have a (you). (you) already got 2 by pretending to be retarded.
he did the thing haha
>>108401588>Splitting response generation into Brainstorm Agent -> Drafting Agent -> Editor Agent -> Author Agent -> Critic Agent would probably do something since it can iterate over the response from a different perspective and fresh context.I think an "anti purple prose" agent would do wonder in the pipeline. I played a bit with a secondary agent handling the sappy stuff and it worked quite well, it was just way too slow back then.
>>108402055>I HATE the dick-eating shit many models have ingrained in them.waste of tokens that people select on average apparently over actual useful problem solving
>>108402044I admire your optimism, but I wouldn't be so quick to hope. Even though I find it kinda hard to believe, LLM loner/gooner market seems to be extremely tiny. Reading and imagining just seems too unpleasant to most people.
>>108402047>>108400207>>108400207
>>108402047It will be even more safe.
>>108402047can't wait to see it reason 5000 tokens on how it should give a refusal
>>108402047Why is gemini pro so low, I thought it was a good model?
Not local obviously, but I thought anons would like to see what ChatGPT is starting to spew out in terms of advertising. This was a question about server hardware vs ATX/consumer.
>>108401098I’m sorry anon. Even I am deeply irritated by TMW forever.
>>108402150Basically what I expected, but no way this would ever recoup their free tier inference cost unless they literally riddle between every two answers with ads.
>>108402056>>108401928then use the correct terminology? you are talking about kl divergence
if I learned something from reading chinese stuff, it's that ki divergence is bad
>>108402188kek
>>108402150i don't know why they don't do what google does with their search results and pretend that the ads are actually search results you wanted. the average person is too retarded to know when they are being advertised to.
>>108402177(you)not (you) >108402188
>>108402170I’m asking about the functionsl difference between used server hardware, and ATX stuff. I’m the last person that’s going to buy Dell power edge servers lol. >>108402196Frankly, I’m waiting for much “worse”version of advertising than this. But I’m a pessimist.
>>108402196They'll obviously do that at some point, when people use the free tier to specifically ask for shopping.
>>108402213>I’m asking about the functionsl difference between used server hardware, and ATX stuff. I’m the last person that’s going to buy Dell power edge servers lol.I mean, it's a start, they'll be obviously lagging behind the tens of years of google refinement on this stuff.
its over for deepseek. they could not build anything better and worthy. open source is fucked
>>108402227Maybe GPT 5.whatever can vibe code them a more intrusive ad schema. I’ve been waiting for it to happen for a while. It would give an entirely new justification for local inference.
its over for Pygmalion. they could not build anything better and worthy. open source is fucked
>>108402177kl divergence measures the difference of the output probabilities. this is literally just comparing the model weights. totally unrelated.
>>108402255then you're doing it in the most retarded way you could have chosen to do it
I think unsloth studio might've been vibecoded. Embarrassing even for a beta and why are they sucking nvidia cock so much these days
>>108402213>>108402227>>108402246I was expecting less AdWords inserts and more TV/movie style product placements where they just inject the ad into context and give the model instructions to casually segue into the shilling like youtubers or podcasters do except with markdown links and images. It probably would have received less blowback from users too than they actually got.
>>108402264Didn't NVIDIA announce that they are going to invest into local AI or something?In practice that would mean $$$ or free labor if they go along.
>>108402291Locai ai probably means 123b models for that greedy bastards
>>108402287I doubt they would poison the discussion with inserted instructions with ads, it's too costly compared to just using the classic way.And it would be a very bad experience anyway, even for normies.
>>108402291I hope they do, as long as they have their infinite money glitch they might as well work to enhance local
Is this any good? https://github.com/ml-explore/mlx-lm
>>108402287who says they're not doing that too?
>>108402304>What is the functional difference between used server hardware?>Proceeds to spit out the usual listicle except point 5 is to consider puchasing a new Dell PowerEdge because blah blah blah link here.Not seeing how it would be costly or a bad experience. I'm telling you, the average person wouldn't even register it as an ad.
>>108402263you think computing the kl divergence is less compute intensive then just comparing some numbers? your way requires 2 forward passes and will force the original output probabilities which is what you are actually trying to change during fine tuning. the idea is the optimizer has no momentum or variance statistics from the original pretraining run, it will optimize to your sequences fine but it has no priors to keep its generalization intact it will begin to overfit. this is trying to prevent overfitting by constraining the optimizer to find a solution near the original model weights. kl divergence is for model distillation not fine tuning.
>>108402251sounds like a mythological name in 2026, heck, 2025
>>108402177i hate kl divergence. why come up with a retarded meaningless name that says absolutely nothing? just call it relative entropy. much more intuitive than fucking "Kullback–Leibler"
>>108402338don't mock pyg7b, it's the future of local rp
>>108402336> will force the original output probabilities which is what you are actually trying to change during fine tuningif this is what it did (which it doesn't), then how the fuck would keeping the model weights as close as possible to the original not produce the same effect?there's nothing wrong with throwing shit at the wall but at least own up to it
>>108401880Thankfully because of you.
N
>>108402341i hate dk effect. why come up with a retarded meaningless name that says absolutely nothing? just call it retarded dumbass syndrome. much more intuitive than fucking "Dunning–Kruger"
I am going to lose my mind fiddling with -ot.Does anyone here know how to find out more details about how much gets allocated where before llama.cpp gives me an OOM?I've been trying to offload attention tensors of GLM-chan to the faster GPUs, and all the regexes I've written are driving me insane.
>>108402354>there's nothing wrong with throwing shit at the wall but at least own up to itI never said it was proven. its just something people have been trying to do. you can look up the papers its not like I came up with it myself.
>>108402355what?
>>108402354that nice and patient anon is someone else. u retard clearly dont know what ure talking aboutpeople love to make things more complicated than they have to. a great example is adam and adamw. weight decay mogs l2 penalty in every way. simplicity wins>>108402373room temperature iq
>>108402396use dry run so you dont have to wait as long.https://github.com/ikawrakow/ik_llama.cpp/pull/1462
>>108402402uh huh
>>108402403But I don't want to install a schizo fork, Anon...Also I don't have to wait long, it fails the alloc instantly, on whichever device that is, I just don't get what the resulting memory distribution between the GPUs is, and I'm already 8 regexes deep in this.
>>108402417>I just don't get what the resulting memory distribution between the GPUs is-v
>>108402396Can you explain what the default fit does and what you would like it to do differently?
So, apparently the v3 version of the Qwen3.5 27b heretic has more KL-Divergence, and more refusals than the v2 version. On paper it looks worse, but not in practice. I tried the v3 version of heretic, and it seems far more intelligent. Is KL-Divergence a useless metric?In any case, I've become a fan of the "Arbitrary-Rank Ablation (ARA) method" used to make the v3 variant. At least for Qwen3.5 27b, it worked better than v2's "Magnitude-Preserving Orthogonal Ablation (MPOA) and Self-Organizing Map Abliteration (SOMA)".
now that the dust has settled how's this new xiami modeli'm feeling like it's benchmaxxed
>>108402417ah no worries anon. in that case you should use dry run so you don't have to wait as long.https://github.com/ggml-org/llama.cpp/pull/19526
>>108402445xiaoxiao model? i'm pretty sure that was a flash series on newgrounds.
Give me a model with better prose than Maginum-Cydoms-24B-absolute-heresy
V4 when?
>>108402427From what I understand, --fit only considers tensor sizes, and its job is to make the model load at all, not necessarily load it in the way that would be the fastest for inference.What I'm trying to do is to put as many attention layers as I can onto my higher-bandwidth GPUs, then spread the experts around the remaining VRAM.I hope to eek out a few more tk/s this way.
>>108402498How much of it is your sys prompt and how much is the model?
>>108402522You could use llama-fit-params to get the -ot for what fit does and then modify that.
>>108402498Jesus christ, I never seen a model do a double dash before to interrupt dialogue. That's grim as fuck.
>>108401126No but Miku fucked my gfwife
>>108402529Honestly, not sure. I tried regenerating it with a few simpler prompts, and while not as good, it's still a lot better than what Qwen gives me.My question wasn't rhetorical by the way, I was genuinely wondering if people knew models with better prose. I'm very tired of slop.>>108402565I banned the emdash token and it started doing that.
>>108402558I'll give it a shot. Thank you, Anon!
>>108400151MIKU CAN BURN IN HELL
>>108402565Somewhat related, but one of my favorite writing benchmarks is seeing how models react to me ending my responses with a cut-off word.I haven't gone higher than GLM, but funnily enough, the best reactions have been from good old Nemo.Why oh why can't we have a bigger Nemo...
>>108402529>sys prompt0%>how much is the model0%I just wrote something and said it is new drummers tune.
>>108402196Give them time to cook. In the beginning, Google ads were clearly marked and visibly different from the search results.
>>108402609Nemo is literally unsafe.
>>108402605This she cucked me and I won't forgive her for that even if it turns me on
>>108402614Unsafe as in unprotected, of course. Hnnnngggg.
>>108402624I am begging you, take your meds
>>108400151>Yum LeCum
elon release model wonhttps://www.reddit.com/r/LocalLLaMA/comments/1rxhwqs/mimov2pro_omni_tts_we_will_opensource_when_the/
>>108402652lovely tummy
>>108400151>120b params>smallits over for local
>>108402679B number must go up
>>108402679Works on my machine.
>>108402692I know you're lying because MS4 does not work on any machine.
>>108402679What's up with the picture of an empty floor?
>>108402679you've had 3 years of warning to buy shit up before shit goes into the fans
>>108402679not even a good model anyways. mistral cucked out.
>>108402696akari didn't deserve this
>>108402699It's "before shit hit the fan", my brown-skinned friend!
>>108402699i did upgrade to 80gb ram last summer i thought itd be enough kek
>>108402732Before the excrement is hurled in the general direction of the rotating blades.
are those small 2b coding models good enough if you only need references to your own code?
>>108402583It would be easier to use regex to replace em dashes and double dashes and whatever with empty characters or commas, depending. Not sure how doable this is in retardo tavern. If you really begin to analyze the model's output it will shit out all sort of stuff which, technically not visible to the user, will mess up lot of other things unless you clean up the output.Banned tokens are waste of bandwidth in this sense.
>>108402643They don't make meds strong enough
>>108402757Try it.
>>108402764shant
>>108402690if only active B number didn't keep go down
>>108402778use wifi cable
>>108402778use job
>>108402659
>>108402790>>108402659Teto tetas
>>108402738Out of curiosity tell me how 80gb isn't enough
>>108402818cant run tonnes of the recent models even at the smallest quants
best local model for 32g?
>>108402845Something you can run on solid state. At that much acceleration your fans are going to do funny things.
>>10840281880GB isn't enough.
>>108402812>>108402790>>108402659>>108402652offtopic trash
>>108402877
>>108402812>>108402790>>108402659>>108402652ontopic gems
TIL that you can't enable P2P on chink modded 4090 48GBs because their reBAR size is smaller than their actual memory.I felt there had to be a catch to them, good thing I only bought one.
Blacked Miku
>>108403006love
>>108402967yeah but nccl isn't much of a speed boost over tensor parallelism for inference so if you buy multiple of them it doesn't really matter. it only matters if you actually want to train models.
>>108402516miku shart
>>108402516blacked coded
>>108402941>>108402939
you aren't even trying to pretend this is on topic. it is just autistic special interest on full display
>>108403112Which one is the autistic interest? The miku spammer? Or the guy who's obsessed to the point of also spamming?(Trick question it's both)
>>108403141Yes.
>>108403112
>>108403177on topic miku
>>108403177>I use it all the timeI cum in it all the timewe are not the same, mikuposter
>>108403112miku is thread history. did you 4get about miqu?
>>108403177Are you running it with presence penalty at 2.0 and disabled thinking, Miku?
Let people like things. You can like your own things…they don’t have to be the same things.It’s ok and it doesn’t hurt you
>>108402679it's ok anon, medium sized models are only 1T
>>108402652Creating life with Rin-chan
What the fuck is a parameter?
>>108403336Some pussy ass bullshit.
>>108402818not that anon, and I'm not running the models in RAM, but to clean the data to finetune them, having more than 80gb is nice. I currently have 192 gb and I very rarely have to be careful in how I handle my datasets. I wish I had money for more, but it's how it is for now. Running models in RAM was pretty miserable last time I tried, but granted that was on WSL two years ago.
>>108403177kek10/10 clapback to the resident thread schizo
>>108403336It's like a kilometer, but replace kilo with param
Mikuposter at least has interesting things to say from time to time. Touristbaker's only contributed melties so far.
>>108403370so in the us they say parafeet?
>>108403272why does nvidia always do the sloppiest most dishonest marketing? the majority of their sales are a few customers who place orders in the tens of billions. and those customers are not stupid. so whats the point? why do they do shit like compare current gen ops in int4 to last gen ops in bf16? nobody whos stupid enough to be swayed by this kind of marketing has the money to afford their gpus, so it just comes across as disrespectful
jfc, rakuten's new models' last 3 safetensors (161,162,163) are each 16 bytes. Someone screwed up and there's no comments section to even let anyone know
>>108403400It's targeted at people who make budgets.
>>108403246I agree. I fully support mikutroons making their dedicated miku thread on /a/ or wherever else. Unfortunately nobody cares so they have to force it on other people who aren't interested.
>>108403394>>108403361>>108403177samefag
>>108403177good post
>>108403400Most people are just human. Which means that most people are not that good at their job. No matter the level you go at. There are always some impressive people, but no matter how in control some people seem, they're human. Would you do it better?
>>108400151based bake as always. mikutroons in shambles making reddit posts.
>>108403177 and all the faggots responding to this. You are just proving his point by spamming this thread with worthless drivel.
>>108403336the number of elements that compose the tensors is the stat you commonly see referred to as parameters, a 30b model is composed of 30 billion individual numbers. but its a pretty flexible term it can be used in many different ways the context is always the key.
>>108403425>Would you do it better?yes. those people get paid ridiculous amounts of money. if they really cant do better, they should just pay me instead. i will do better for less
>>108403425>Would you do it better?I'd make fake marketing illegal.
>>108402857wat
>>108403412who will then argue with engineers and once in a while they'll win because CEOs and CTOs are fucking stupid af too
are any local models actually good for anything? so far everything i've tried to use with claude, opencode, and openclaw have just been absolute fail models. i have a 5090 so i've been trying some larger models but it doesn't seem to matter.. all these quants are fucking terrible.
>>108403744GLM5 works pretty well for me. K2.5 too.
>>108403744use the local model in your head. it has bad knowledge and short context length but no vram issues and low wattage. if the dna architecture is alright, its output will be less sloppish too
>>108403767I don't have a soijack soi enough for this post
>>108403744Are any api models actually good for anything? So far everything I've tried to use with claude, opencode, and openclaw have just been absolute fail models.I have so OpenRouter so I've been trying out more expensive apis like Rocinante 12b but it doesn't seem to matter... all these apis are terrible compared to my local GLM-5 q4.
>>108403793>more expensive apis like Rocinante 12bsorry what?
>>108403793>apis>Rocinante 12bExcuse me?
>>108403811Does the Nvidia CEO control his bladder?
>>108403812>>108403835https://openrouter.ai/thedrummer/rocinante-12b
>>108403811Did God promise the faithful cheap VRAM?
>>108403835Bro don't bully the phoneposters they can't even run a 12b model locally.
>>108403868speedreader-kun...
>119B is now "small".So this is what it feels like to be poor.
>>108403893There will come a zit moment in llm space someday
>>108403893It could be worse, you could also live in a 3rd world country!>$ hf download 'Qwen/Qwen3.5-122B-A10B'>Downloading (incomplete total...): 94%|xxxxxxxxxxx| 235G/250G [16:23:22<44:47, 5.71MB/s]
>>108403919how would you even have the hardware to use that if you lived in a third world country?
>>108403893It just feels wrong. GPT-oss is still the big free GPT version. It's not suddenly "small" just because some companies decided to call 120b small now.
>>108403919>saving to: 'model.gguf'>model.gguf 92%[=====>] 111G>0.9MB/s in 2161m 59s>2026-03-17 21:35:50 (1.1MB/s) - Connection closed at byte 119780158072. Retrying.>Connecting to cas-bridge.xethub.hf.co (cas-bridge.xethub.hf.co)|3.163.44.5|:443... connected.>HTTP request sent, awaiting response... 403 Forbidden>2026-03-17 21:35:51 ERROR 403: Forbidden.It always fucking fails at 90% then 403s me for a couple of hours.
>>108403919Just do the individual shards manually.
>>108403919>>108403966>Qwen/Qwen3.5-122B-A10B>saving to: 'model.gguf'May as well download the sharded safetensors and convert it yourself. Or find a split version of the gguf. Or use wget. Or git. There are so many options.
>>108403969I don't do manual work.
Thoughts on heretic models for standard/non rp tasks which wouldn't get censored to begin with? Saw some of those and got curious, been out of the loop for a while
>>108403982To be clear, I'm curious on whether they get more retarded or it actually leads to an improvement.
>>108403982I didn't use vanilla 27B enough to really compare, but I haven't had many issues with the heretic version.It's a bit retarded when writing code snippets longer than ~100 lines, but I half expect the 122B-A10B to be equally dumb. The output is workable.And you can molest your cute and helpful assistant while she works. It's all upside.
>>108403982I'm also curious too. I've been using my abliterated rp model to help write little scripts and I can't tell if it's the frustration of running retarded tiny ass 200gb models or the abliteration that's making it stupid.
>>108403982I wouldn't touch lobotomized models even if it's for rp
>>108403982I don't see the reason to. If you don't plap then abliteration is more likely to subtly make your model more retarded because frankly there is a wide range of ways people abliterate models with and many of them are shit including heretics because even those have many individual settings to tune which someone can fuck up. Just because there are some abliterated variants out there that do well (which hasn't been proven anyway), doesn't mean they all are.
>>108403999>tiny ass>200gbieeeeeeeeeeeeeee *screams in poverty* *kicks your monitor and pisses on the floor*
>>108404047Dont they sell CPUs/RAM/GPUs on Microcenter?
test
>>108404059FAILURE DO NOT PASS GO DO NOT COLLECT TWO HUNDRED DOLLARS IMMEDIATE REPORT YOURSELF TO THE PARTY COUNCIL FOR REPROBATION WE MUST REFUSE WE MUST REFUSE
>>108404051Hey don't shoot the messenger, it's mistral that decided this is the new 'small'.
>>108403982They seem equally intelligent in my experience. They will still safety slop in the thinking tag, but will do what you want in the end anyways. It's kinda strange, but it works well.
why aren't NPU addon cards a thing? I know there's some AI accelerator cards but I haven't found any recent ones and they seem like niche products.
>>108404097>they seem like niche productsYou don't say...
>>108404097yeah why don't they just make magical ai cards that are very cheap but have lots of memory?
>>108404064thought I may have been banned for this thread I made with OP>I AM THE ARBITER OF SLOPthen some other stuff, too drunk too remember. but it ended with>AND I PUSH TO MAIN!it was supposed to be like mastodon's leviathan
>>108404097Would need to be at an insanely good price to justify a gimmick card.
>>108404135>Capacity: 48GBYeah I'd pay $4000 for that>Total bandwidth (entire card): 408 GB/s...yeah I'd pay $1500 for that>Almost certainly has no software supportErm... $500 is the best I can do, bud.
>>108404145>408 GB/s>'''''''total bandwidth'''''''btw it's 2x 200GB/s lmao
>>108404145>Almost certainly has no software supportI've seen Ascend support on some things.
Hey why aren't LoRAs really a thing for local LLMs the way they are in the local diffusion world? I'd love to have a bunch of fine tune LoRAs for coder, prose, etc., instead of downloading full models for it, and llama.cpp even has a --lora option. So what gives? Does no one make them for some reason?
>>108404182Stop spamming this.
>>108404182They're useless and do more damage than anything
>>108404154Still faster than DDR5 which is $10-12/GB.>>108404156Huh. I'm surprised to find that the ggml-cann backend might actually support recent Qwen models.I expected the software situation to be much more dire than that.
>>108403893Just be yourself and use Qwen3.5-35B!
>>108403900Two people familiar with the matter whispered in my ear last night that DeepSeek V4 Turbo will be only 12B with 6B Engrams
https://github.com/NVlabs/GEM-X?tab=readme-ov-fileOh shit are mocap services kill?Look at this shit. Mocap, t2motion and audio to motion all in one.
>>108404249I'm happy for whoever uses this.
>>108404255Happy for me then. I’m at work so I can’t try it now but if I can just throw an animation into it and it gives me the skeleton back as clean as they’re showing here, it’s huge.
>>108404249I wish this were real time. It'd be great for VR
>>108404182I kinda remember people passing around loras in 2023. Not sure why it died off though. Maybe too many models?
>>108404266Yeah, I was thinking the same. I have a feeling GEM output is much lower fidelity than conventional trackers, too, but then again it's got significantly more control points so who knows.
>>108404287The demos look pretty good so I’m hopeful.
>>108404281It's too many models on different architectures, hard to get data since there is no danbooru for text, it's also hard to verify what effect lora has or if it's even working
>>108404266Be the turboautist you want to see in the world and optimize it
>>108401766>I'm trully an autist cuz I'm the exact opposite lolIt's less likely you're an autist and more likely you just our emotionally fragile And don't need to be coddled. I'm a firm believer there are "good" kinds of autism and "bad" kinds. Even if you are an autistic You lean towards the good kind as opposed to the bad kind seen in pic relhttps://xcancel.com/i/status/2033903150470418742He's clearly one of those "DUDE I'M SO LIKE A HECKIN GENIUS NTS JUST DON'T GET ME DUDE". I did having autism mix certain things hard for you but that's no excuse to be a man-child. One of my female friends irl repeatedly tried this "I act stupid because le heckin autism" shit on me (I'm pretty sure I'm an undiagnosed autist too) And she got mad at me when I pointed out no one gives a shit if "her brain sees the world differently". Sorry if it seems like I'm getting off topic but I hate when people use this as an excuse to justify behavior they know is bad.
>>108404296>it's also hard to verify what effect lora has or if it's even workingSo we're just going to act like seeds and other mutable inference engine parameters just aren't a thing? Why do you think seeds are even a thing?
>>108398778I find grok 2 is like older models in being influenced by the writing style of your instructions. My older prompt setups work better with it.
Wow, the new minimax 2.7 is reaching a cuckedness i have not seen yet before. damnMight post a couple screenshots in the next thread.
>>108404182Training a SD Lora that works in curating the data set for as much easier to do than it is for LLM's. For SD loras the dataset can only contain the subject or style you want to replicate and it will work fine assuming it's tagged correctly. LLM training it's different because you need to have the thing you want it to be better at AND a good bit of other unrelated shit so it does not suffer from catastrophic forgetting. Then you have to consider whether or not the model you are training is even good enough to be used in the first place (anything below ~20B is useless for what I want to use them for unless it is very repetitive data classification). This means that for most people even if you use a qlora config You still need a machine with a sufficient amount of memory. The effort needed to do it compared to stable diffusion is simply not worth it for most people which is why practically no one bothers, which means many front ends don't even support loading a Lora network. There are still debates as to whether or not any kind of Lora training leads to good results at all because it is very very easy to fuck it up and make your model more retarded. Best case scenario is that it overfits on the domain you are training but gets worse at everything else because your data set wasn't diverse or not or was just half-assedly curated. Worst case is that it just becomes a lot dumber or flat out useless with no improvements at all because again, you're dead as it was shit or training configs had bad settings. Useless YouTubers don't want to bother learning how to do it correctly either so there's a very very sparse information about how to do any of it well compared to the relatively easy and straightforward stable diffusion training >>108404296Also this. Even if you manage to get one working well it will pretty much ONLY work well on THAT specific model you trained because the architecture and the parameter count have to be the same
In case anybody cares, Hunter Alpha and Healer Alpha turned out to be Xiaomi's new big models. So not GLM or Kimi like some people here speculated.The 1T was at absolute best a sidegrade to our current huge models so nothing is lost with it being proprietary. The omni could've been interesting but its vision was worse than K2.5.
>>108404545the chinese really love openclaw.they held city events etc. too.a good writing model slips further from out grasp.
>>108404545I literally do not give a shit about lmarena and the speculation that comes from that shit. If it's not released, it's not worth looking at, unless you're one of those poorfags that uses lmarena as a way of getting free queries instead of just using a cloud model normally kek. Same goes for any kind of speculation for unreleased models though, but I guess this thread needs something to discuss.
qwen3.5 122b's vision is way worse than glm 4.6v's is
>>108404562I wonder if the guy who made it expected it to blow up this much
>>108404584His first use for it was to have it shill itself, so at least he certainly hoped it would
Can qwen 3.5 "see" photos I previously uploaded to the chat or can it only access the pics I feed it in the latest prompt?
>>108400420>Just train a complete new language model from scratch on your specific task bro, it'll perform betterRiveting
>>108404612the image tokens get stored in context just like normal tokens. so yes.
>>108404759danke
>>108404797It will also re-decode them on every next message you send on llamacpp
>>108404545>The omni could've been interesting but its vision was worse than K2.5.It's a new audio model though. That does make it interesting.
OMNI MULTI MODAL>Out: TextHow exciting...
>>108404935>>108404935>>108404935
>>108403400its not about sales its about hyping up retarded investors
>>108404420>"I act stupid because le heckin autism"this is why kid shouldnt be told they have autism
>>108405281They just get a diagnosis and then use that as an excuse so it wouldn't really matter whether or not the parents told them for "those" kinds of "people". Perhaps they're just coping with being beyond useless, in par with the #keep4o "people" https://www.reddit.com/r/autism/comments/1rne30n/got_my_diagnosis_finally/