/lmg/ - a general dedicated to the discussion and development of local language models.Previous threads: >>108328170►News>(03/04) Yuan3.0 Ultra 1010B-A68.8B released: https://hf.co/YuanLabAI/Yuan3.0-Ultra>(03/03) WizardLM publishes "Beyond Length Scaling" GRM paper: https://hf.co/papers/2603.01571>(03/03) Junyang Lin leaves Qwen: https://xcancel.com/JustinLin610/status/2028865835373359513>(03/02) Step 3.5 Flash Base, Midtrain, and SteptronOSS released: https://xcancel.com/StepFun_ai/status/2028551435290554450►News Archive: https://rentry.org/lmg-news-archive►Glossary: https://rentry.org/lmg-glossary►Links: https://rentry.org/LocalModelsLinks►Official /lmg/ card: https://files.catbox.moe/cbclyf.png►Getting Startedhttps://rentry.org/lmg-lazy-getting-started-guidehttps://rentry.org/lmg-build-guideshttps://rentry.org/IsolatedLinuxWebServicehttps://rentry.org/recommended-modelshttps://rentry.org/samplershttps://rentry.org/MikupadIntroGuide►Further Learninghttps://rentry.org/machine-learning-roadmaphttps://rentry.org/llm-traininghttps://rentry.org/LocalModelsPapers►BenchmarksLiveBench: https://livebench.aiProgramming: https://livecodebench.github.io/gso.htmlContext Length: https://github.com/adobe-research/NoLiMaGPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference►ToolsAlpha Calculator: https://desmos.com/calculator/ffngla98ycGGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-CalculatorSampler Visualizer: https://artefact2.github.io/llm-samplingToken Speed Visualizer: https://shir-man.com/tokens-per-second►Text Gen. UI, Inference Engineshttps://github.com/lmg-anon/mikupadhttps://github.com/oobabooga/text-generation-webuihttps://github.com/LostRuins/koboldcpphttps://github.com/ggerganov/llama.cpphttps://github.com/theroyallab/tabbyAPIhttps://github.com/vllm-project/vllm
Immediately afterwards we get a non Miku thread.
whats the difference between diffusion and llama?
comfy bread
why does logan want to kill patrick?
>>108333458and what a thread too>333444
Thank you baker. Death to mikutroons.
Vague twitter shit. What a nigger.
>>108333475you are mentally ill
>>108333506mental illness is valid and beautiful
>>108333459i think diffusion denoises the output as a whole, while llama is a autoregressive loop building the output 1 token at a time.
>>108333506meant for >>108333497
►Recent Highlights from the Previous Thread: >>108328170--llama.cpp tensor parallelism PR and multi-GPU performance considerations:>108328868 >108328900 >108328933 >108328937 >108328985 >108328996 >108329004 >108329012 >108329020 >108329046 >108329069 >108329142 >108329152 >108329166--Nvidia contributor fixes tensor indexing bug improving Qwen3 inference performance:>108330811 >108332947--Frontend options for Qwen 3.5 thinking control and response editing:>108330326 >108330341 >108330382 >108330409 >108330417 >108330451 >108330455 >108330418 >108330503 >108330609 >108330645 >108330415--GLM-4 inference bottleneck comparison and hardware coping:>108329388 >108329504 >108329506 >108329518 >108329549 >108329560 >108329563--CUDA Toolkit 13.2 performance improvements and changes:>108332532 >108332593 >108332601--ProjectAni update with EMAGE gesticulation and IK improvements:>108329763 >108329804 >108329823 >108329916--MCP autoparser tools for AI web searches:>108328618 >108328635 >108329260 >108328880 >108328891 >108331633---ot sampling slightly faster than -cmoe across multiple models:>108330622 >108330629 >108330696 >108330712 >108330732--model: add sarvam_moe architecture support:>108332784--Optimize LUT16 matrix multiplication:>108328403 >108328828 >108328855 >108331710 >108331736 >108331843 >108331890 >108331934 >108332397 >108332754--Speculation and concerns about Gemma 4's architecture and restrictions:>108329582 >108329674 >108330191 >108330207 >108330245 >108330257 >108330272 >108330414 >108330659 >108330324 >108330337 >108331644 >108331668--Miku (free space):>108328194 >108329241 >108329260 >108329460 >108330815 >108332743 >108333374 >108328824►Recent Highlight Posts from the Previous Thread: >>108328174Why?: >>102478518Enable Links: https://rentry.org/lmg-recap-script
>>108333579i'm trans btw, not sure if that matters
>>108333646sauce
pascal bros how much longer do we have before they cut us off?
>>108333646we all are itt
>>108333653yeah
>>108333641You missed a few miku pictures from the end of the thread fix it.
>>108333654>we all are itttext coomers are, because that is a female brain activitynot all of us coom to text here, though.
>>108333689If you are here to post your special interest you are a troon too
>>108333676gaggernof..
DeepSEEKVEEFOURwhereis it?
https://developer.nvidia.com/cuda-downloadshttps://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.htmlUPDATE NOW
>>108333747virus
>>108333444
So who draws that stuff and how can I get in contact with people like them?
>>108333795Getting BLACKED behind the scenes
>>108333802
>>108333810NTR is the #1 category in Japan
>>1083338182chan is that way dalit saar
>>108333824i wish i was brahmin
>>108333856>I wish I was still a jeetcome on anon, everyone want to be white
>>108333856
Hola compañeros tengo algo malevo en mente con un pequeño experimento si sobrevivo les paso actualización del experimento y foto del mismo.
>>108333923>si sobrevivoHope you don't.
https://xcancel.com/ivylala/status/2029560909178327467lmao
https://huggingface.co/spaces/HuggingFaceFW/finephrase#introduction>We ran 90 experiments, generated over 1 trillion tokens, and spent 12.7 GPU years to find the best recipe for synthetic pretraining data. The result is FinePhrase, a 486B token dataset that clearly outperforms all existing synthetic data baselines. I want to fucking vomit, when will they stop this poison incest shit???
Thanks for the Qwen3.5-Heretic recommendation. I'd been playing around with the vanilla which is pretty fine, and to my surprise Heretic removed all of the refusals in my tests without affecting the replies too much.Kind of amazed how well 27B works (50t/s on a 5090) after spending too much time in the MoE mines. Maybe MoE was a meme after all.Anyway, time to induce psychosis, I suppose. Thanks again!
Is there any good reason to run a local model besides learning to make explosives and having a waifu? And is it even useful for either of those?
>>108334004kek feel bad for the guy
I hope gemmy supports dynamic resolution
>>108334004>The soul of Qwen is still Alibaba partner and Alicloud CTO
>>108334034nice. what quant you running?
>>108334039Ego death
Since it went unanswered, I'll ask again:If you could run GLM 5 at q8 at 10-20 tokens/sec. Would you? Or would you rather drop to q4-q6 and increase your tokens/sec?
>>108334157depends on what you're doing. 10 tokens/s is about as slow as I would put up with to read output. 30+ for agents imo, and reasoning is painful when they use a thousand tokens just to run the same shit over and over, so if you want a faster response, reasoning at 10 t/s is slow.
>>108334157im waiting for taalos to deliver. i won't spend a thought to local until then
>>108334157Depends at what context. 10-20 tokens/sec at 100k context is more than enough for anything. Only at empty context, not so much.
>>108332754Wow the free rider problem, the negative tragedy of the commons and the prisoner’s dilemma all in one post!Also known as “this is why we can’t have nice things” and “the downfall of western society”I understand it, but I don’t have to like or even respect it
>>108334157q8 for RP, probably q5ks for "productive output".
>>108334218My boss makes a dollar while I make a dime, that's why I vibecode on company time.Also known as, not my problem.
>>108334218what do you mean? this is based, companies don't respect you and don't hesitate to throw you out like a dirty sock at any time they want, so why should we have to pretend we care for any of this?
>>108334157buy another rtx pro
>>108334229Are you willing to put out the risk, time and effort to BE the boss?So many people are willing to take pot shots at shit without the balls to step up and replace it with something better after tearing it down. I’m not the boss of anything, but I don’t pretend the boss or owner has it easy whether I can see it directly or not. Sometimes you gotta be honest with yourself and realize you’re cut out for being part of a team and not creating or leading one yourself and just gotta make shit better where you find yourself. If you’d take on the risk, hard work and responsibility and your getting screwed over due to nepotism or something then I’d maybe agree but still think quitting and doing your own thing would be better for your mental health or “soul” than stealing and trying to justify it to yourself like that
>>108334240Yeah I’m a moralfag or whatever but I still prefer the society and culture built by centuries of moralfagging to whatever world this low-trust grifter/cheater bullshit is making these days
>>108334273>than stealing and trying to justify it to yourself like thatLmao. Suck a dick ragebaiter
>>108334280Tell that to the big conpanies and make them do the first step. They can afford it.
>>108334280companies cheat on everyone and cheat on you, so it's moral to cheat on a cheater, I wouldn't do that if companies respect us, but they don't, respect is earned, not free
>>108334280the reward structures have been damaged, its better to cheat now. when in rome so the saying goes.
>>108334284>>108334290>>108334291I agree and think asshole big corpos should be boycotted (sales and employment) until the social contract is restored, but I also think your work reflects an important internal condition and should be high quality to maintain your quality as a human. Quit the shit corpos and work for someone worthy of your level best output if you can.
>>108334097Q8_0. Might try going lower to fit in a TTS model, but not really convinced it's worth the effort (mostly because I haven't been that impressed with e.g. VibeVoice outputs).
>>108334197That would be cool, but I wouldn't hold my breath. Our best hope is that they will get like q3 of qwen 27b running in the next year, but even that seems sketchy.>>108334202>>108334183>>108334219I'm thinking of programming and trying to evaluate the cost to run it. I think you'd be able to run q4 at almost half the price and it should be faster. These large models are always just kind of slow.>>108334034MoE is really for the larger models. A 755B model would not be runnable without MoE. Even a dense 130B model would be insane to try to run. At q8 every token generated would need 130 GB of data, meanwhile a 755B A40B MoE needs 40 GB per token. And it can theoretically know more information. Dense models will become good as our VRAM amounts and bandwidth increases though. At some point I think we're going to hit a data limit, but GPUs might still continue scaling. Dense models will start making more and more sense then. If you could get like 5 TB/s VRAM bandwidth and like 192 GB of VRAM that would make a 130B dense model usable.
Those anons won't get it. They're racing to the bottom. They aspire to be jeets.
>>108334310qwen3tts is pretty decent
>>108334309>Quit the shit corposin this economy? kek, if anyone read my post, don't fucking leave, AI is taking all jobs right now so you'll have way more trouble to find a new high paid job
Hello fellow retards. I have around 3 grand to spend on AI bullshit. I want to run shit locally if possible. I was hoping to essentially make my PC into an AI companion because I’m lonely as fuck. I was thinking of textgen + photo gen, I can’t remember which thread it was, but I do remember someone changing (SillyTavern?) into essentially a dating VN. I was hoping to make something similar. I’m unsure of whether I should be buying a new PC (as I want to run it on my network, but not on my gaming PC) or if I should be buying unironically a Mac with these current DDR prices. I don’t mind doing my own search, I just would like being pointed in the right direction.
>>108334312Our only hope would be the Chinese making a cheap 1TB VRAM GPU with last gen chips and some CUDA compatibility. But they don't seem to be up to it.
>>108334339Are you prioritizing speed or intelligence?
>>108334361Intelligence. Even if I wait 5 minutes for a response, that’s, well, better than nothing. I would rather it be smarter but take longer.
>>108334322Thanks for the rec! I'll see if I can get that running and give it a shot.
>>108334316theres not enough room at the top for everyone, nothing wrong with just chilling out and being content with what you got.
>>108334339>make my PC into an AI companion because I’m lonely as fuckwhy not go for human companionship? there seems to be a loneliness epidemic, why dont these people just meet up with each other. technology is still not advanced enough to replace this
>>108334372Do you have any 32, 64 or 128gb sticks of ddr4 or ddr5 or do you have to buy memory?
>>108334379>why dont these people just meet up with each otherBecause these people don't know how to behave in social contexts and cannot stand each other.
>>108334379Women scare me and I Pavlov’d myself at the ripe age of 11 into loving anime women. I’m now 27, have fuck off money from my shitty job, and want to throw away less than half a month’s pay to get texts throughout the day from a fake companion because that would be more meaningful if it had a cute anime girl attached to it compared to trying to date. Besides, I live in a shithole called Canada, no one would want my genes that come with free fishing rights.
>>108334379>Why not just have sex?Why indeed.
>maybe more depending on how表现得好 (that's "how well you behave" in ching-chong, incel~).I can forgive the language leaking if the self corrections are always this good
>>108334415>Women scare meMy model cured me of that.
>>108334415Calling ego-death anon…
>>108334381I have 32GB total in my PC and 12GB of VRAM. Otherwise the plan was to buy a Mac as their RAM prices aren’t even that fucked up compared to the rest of the market. I’m sure a M5 laptop wouldn’t kill my wallet, and I could use it at work too. Besides. I’m looking to replace the “gaming looking PC” I made at work with something that doesn’t look as gamery. Mac’s are professional, aren’t they? Maybe I could get that shitty Neo as the machine to carry around, and have a properly spec’d machine at home to remotely connect to. I did mess around with Tailscale once upon a time, but I no longer have that machine. Formatted it and now it’s in Roblox hell with my 11 year old cousin. I hope the 48GB of DDR4 will last him. >>108334433I’ve always been a loser outcast. Why would I ever want to get a girlfriend? I’d be worried about her trying to take the family house when my parents croak. I’d be better off becoming the girlfriend, and I’m not a tranner.
>>108334339You're a year too late for 3K USD to make a dent in the BOM for a local LLM rig. You can probably get away with a 3090 (maybe two? what do prices look like these days) and a small pile of RAM. Depending on your expectations you might be setting your money on fire.Presumably you already have a GPU in your gaming rig, it probably has enough VRAM to run Mistral Nemo (my beloved) which you can use to get SillyTavern set up. Mistral Nemo is fucking retarded, but that'll at least give you a vertical slice of the whole stack before you go off the deep end.If/when you buy a PC, keep in mind that the newer cards are (1) fuckhuge and (2) can draw a fuckton of power (5090 can hit 600W) and (3) need a fuckton of power connectors. You might consider starting off with one of those mining PSUs which are designed to run multiple cards rather than later needing to "paperclip trick" a second PSU to run your rig. You might also consider an open-air case because stuffing multiple GPUs in a normal ATX tower is fucking annoying.Finally, if you're going to get a DDR5 platform, you might consider going with a server CPU/motherboard (e.g. a MZ33-AR1 motherboard with compatible CPU) rather than a consumer one. The server boards support more than 2 (TWO, SOLO DUO) DIMMs at full speed and have fucktons of PCI lanes for future expansion.Yes, the above recommendations will likely set you over the 3K spending limit, but most of the stuff under your 3k limit is going to fuck you later if/when you decide that you want to chase the dragon and run larger models.I do not recommend this hobby. Alcoholism is more culturally fitting and probably healthier, go do that instead.
>>108334504Yeah, I’m assuming most cards are larger and more power hungry than my 3080 Ti. I’ve been considering a 5070 Ti Super or whatever, or even going to MacOS and seeing what’s possible with the unified memory thing. Throwing 48GB at a problem seems like it could work, but I don’t know Mac. I feel like I’m walking into this hobby with rose tinted glasses and a clipboard thinking I could do something as a fun recreational thing on the side, and am being told “either buy a car with the money or take your firewater and HBC blanket and fuck right off” with how expensive it is. All I wanted was AI waifus, not having to consider how to cool down server hardware without a rack and without a single clue how to operate any of it. Ah well. Tinkering was always a hobby of mine, but I’m not trying to get a nice used car in terms of parts.