/lmg/ - a general dedicated to the discussion and development of local language models.The Raven EditionPrevious threads: >>106559371 & >>106551921►News>(09/11) Qwen3-Next-80B-A3B released: https://hf.co/collections/Qwen/qwen3-next-68c25fd6838e585db8eeea9d>(09/11) ERNIE-4.5-21B-A3B-Thinking released: https://hf.co/baidu/ERNIE-4.5-21B-A3B-Thinking>(09/09) K2 Think (no relation) 32B released: https://hf.co/LLM360/K2-Think>(09/08) OneCAT-3B, unified multimodal decoder-only model released: https://onecat-ai.github.io>(09/08) IndexTTS2 released: https://hf.co/IndexTeam/IndexTTS-2►News Archive: https://rentry.org/lmg-news-archive►Glossary: https://rentry.org/lmg-glossary►Links: https://rentry.org/LocalModelsLinks►Official /lmg/ card: https://files.catbox.moe/cbclyf.png►Getting Startedhttps://rentry.org/lmg-lazy-getting-started-guidehttps://rentry.org/lmg-build-guideshttps://rentry.org/IsolatedLinuxWebServicehttps://rentry.org/recommended-modelshttps://rentry.org/samplers►Further Learninghttps://rentry.org/machine-learning-roadmaphttps://rentry.org/llm-traininghttps://rentry.org/LocalModelsPapers►BenchmarksLiveBench: https://livebench.aiProgramming: https://livecodebench.github.io/leaderboard.htmlCode Editing: https://aider.chat/docs/leaderboardsContext Length: https://github.com/adobe-research/NoLiMaGPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference►ToolsAlpha Calculator: https://desmos.com/calculator/ffngla98ycGGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-CalculatorSampler Visualizer: https://artefact2.github.io/llm-sampling►Text Gen. UI, Inference Engineshttps://github.com/lmg-anon/mikupadhttps://github.com/oobabooga/text-generation-webuihttps://github.com/LostRuins/koboldcpphttps://github.com/ggerganov/llama.cpphttps://github.com/theroyallab/tabbyAPIhttps://github.com/vllm-project/vllm
►Recent Highlights from the Previous Thread: >>106559371--Paper: ButterflyQuant: Ultra-low-bit LLM Quantization through Learnable Orthogonal Butterfly Transforms:>106561145 >106561161--Troubleshooting llama-server.exe performance with multi-GPU configurations:>106563691 >106563763 >106563772 >106563861 >106563838 >106563879 >106563891 >106563919 >106563941 >106563960 >106564017 >106564070 >106564107 >106564154 >106564411 >106564784--Qwen3-Next model efficiency and performance analysis:>106560211 >106560245 >106560248 >106560269 >106560274 >106560283 >106560310 >106560291 >106560294 >106560322 >106560314 >106560338 >106560356 >106560302 >106563929--Optimizing MoE models via selective tensor offloading:>106559871 >106559925 >106559938 >106559943 >106559962 >106559979 >106559984 >106560000 >106560056--Role of LLMs in TTS and image generation via autoregression and embeddings:>106562827 >106562864 >106562981 >106563064--Qwen3 model's verbose reasoning issues during roleplay testing:>106561341 >106561358 >106561391--Public server setup for Qwen3-Next-80B-A3B-Instruct with 65t/s performance:>106563343--TTS phonemizer bottlenecks and optimization:>106562423 >106562450 >106562542 >106562586 >106562603 >106562493 >106563141 >106562482 >106562515 >106562543 >106562579 >106562608 >106562763 >106564046 >106564012 >106564024--California bill to regulate AI companion chatbots:>106563074 >106563109 >106563402 >106563820 >106563680 >106564086 >106563394--OpenAI optimization techniques boost local transformer efficiency:>106563608--FTC probes major tech firms' AI chatbots for child safety risks:>106562092--Specialized small LLMs vs multi-purpose models:>106564203 >106564224 >106564273 >106564280 >106564230 >106564323 >106564409 >106564560 >106564600 >106564607--Miku (free space):>106559401 >106562108 >106562161 >106562252►Recent Highlight Posts from the Previous Thread: >>106559374Why?: >>102478518Enable Links: https://rentry.org/lmg-recap-script
Ramlets, how are we doing today?
Mikulove
>>106566876I compress my ram - gives approx. 2 times more memory.
>>106566903I sparsify my ram - gives 2 times more speed.
>>106566903I overclock my ram,It cost twice as much because it breaks.
>>106566778If you're using it on API, why are you using Air instead of the big one?
https://www.downloadmoreram.com/
Back in the MS-DOS days, I actually had a driver that compressed my RAM. It broke some things though.
https://research.google/blog/vaultgemma-the-worlds-most-capable-differentially-private-llm/https://services.google.com/fh/files/blogs/vaultgemma_tech_report.pdfhttps://huggingface.co/google/vaultgemma-1bThe future of Google LLMs: models that know nothing about rare information. They use huge batch size to mitigate memorization, and other stuff.>What does this mean in practice? Informally speaking, because we provide protection at the sequence level, if information relating to any (potentially private) fact or inference occurs in a single sequence, then VaultGemma essentially does not know that fact: the response to any query will be statistically similar to the result from a model that never trained on the sequence in question. However, if many training sequences contain information relevant to a particular fact, then in general VaultGemma will be able to provide that information.>> [...] Sequence-level DP provably bounds the influence of any single training sequence (example) on the final model. We prompted the model with a 50-token prefix from a training document to see if it would generate the corresponding 50-token suffix. VaultGemma 1B shows no detectable memorization of its training data and successfully demonstrates the efficacy of DP training.
>>106566923Maybe he means he tried it, and found it lacking (in addition to the obese one).
>>106566876As a 24GB vramlet I just went back to gemma3 qat and it's still the goat for ST-style RP. The writing is always fresh and pleasant to read, with outstanding vocabulary. Fucked around with offloading glm4.5-air q3 and other drummer finetunes and they seemed broken and inconsistent in their responses. Google needs to release a 80b moe gemma 4
>>106567050Hard to say, can't see the image any longer.
Is this legit or over-hyped?I find hard to believe that just 32B is enough to match GPT4 and 200B checkpoints.With just 32B, you can run it locally in most PCs with a decent quantization that doesn't sacrifice much, having a private GPT4 with no quota limitations...sounds too good to be true.
>>106567118Sounds like a typical marketing department sales pitch.
>>106567118>sounds too good to be trueCongratulations, you have a brain.
>>106567118>reasoningalways and has always been a meme
>>106567105>>106566972Welcome to /lmg/ thread google stealth marketing engineer technician saars. Please kindly inform us if the next gemma will be as cucked as the last one so I can decide if I should post funny jeet pictures or not.
>>106566972> glm4.5-air > glm4.5-airTo me work well is at the moment the best model for Vramlets, you are doing something work, this model is ultra sensitive to context temperature, when you reach too much context is time to low the temperature.
>>106566836quoteth the raven.....
>>106565629It's been stuck for four hours...
>>106566944If this works how it looks like it works based on what you quoted and the image, then the model would theoretically be equivalent to stuff like Phi. Maybe a bit better. But ultimately it will have trouble with real world user queries since it would lack OOD skills and knowledge. This technique can only create models for extremely narrow use cases, not general assistant models. So if they do it for all future models, Google would be shooting themselves in the foot and lose all the market share they just spent tons of effort to claw back.
>>106567662It's probably for the better.WSL is a janky half-measure.If you want to run windows for your main gaming/home office PC but you want to run linux for LLM stuff just get a second system and run linux properly.
>>106566952GLM-chan is NOT obese!
>>106567118Both can be true30b's are seeing massive improvements in abilities but that has to do with coding, physics, chemical equations etc. And who gives a fuck about that. It's a glorified wikipedia. Good for grunt work.For stuff like writing and other more complex tasks, size is still king and may be for a long time. My LLM needs to know every dragon ball z character and understand that yes, piccolo <could> wipe his ass with goku's face if he wanted to. If you want nuance and complexity, simple trivia is not gonna do it for ya.
>>106567795(stomp stomp stomp)
>>106567118I'll ignore all the rest of the shit in the post. What called my attention was>2000 tokens/s on cerebras>most reasoning models crawl at 200 tok/secWhat the fuck does reasoning have to do with the token generation speed?And why the fuck are you paying attention to that retard?
>>106567628ggufs nevermore
>>106567898Assuming that's total throughput over multiple concurrent requests, that guy has skill issues with the other models or cerebras is shit.
>>106567795>>106566952This is just AIR-CHAN, GLM-CHAN wont even fit in the frame.
>>106567707They were already bragging how memorization in Gemma 3 was lower than Gemma 2, so I think that's the direction where things are going.
>>106566944>adding noise to the model so it doesn't precisely reproduce input data>reduces training stability, increases computation costs>today’s private training methods produce models with utility comparable to that of non-private models from roughly 5 years ago, highlighting the important gap our work will help the community systematically close.Okay, so adding noise to the model makes it significantly worse (as one would expect). They seem to think that's avoidable, but I don't know how.
>>106566944>make your model retarded>brag about it
>>106567118I think punching above weight and trading blows with gpt 4 started in 2024.
macchads just cant stop winning
>>106567806>piccolo <could> wipe his ass with goku's face if he wanted toThat is only at the start of dbz.
is next better than 235B for sex?
macchads just cant stop winning/2
macchads just cant stop winning/3
>>106567977Fake news! GLM-chan is a slender young lady.
>>106568337Differential Privacy is an area of research. Researchers work on it, promising "it'll be good soon" and push out papers. Everyone not working in that field ignores them, unless they need to bring them up for compliance like "we're working on DP, don't worry".
Macfags stopped bragging about glm. It's the one thing they had. They will stop bragging about 80b-a3b soon enough. It's the one thing they'll have for a few days.Then it's back to just sucking cock.
Qwen Image Edit abuse
anyone else testing qwen 80b right now? it feels scarily close to 235b, long context, barely hallucinates, incredibly fast. their new moe architecture must be getting close to the closed models sota.(testing the mlx version)
>>106568235That's weird since Gemma 3 still knows a ton of trivia compared to other similarly sized models. If it truly didn't memorize things, then it should be worse than even Qwen. Also that graph is weird. How does 4B have the exact same memorization rate as 9B and 27B?
>>106568659It's honestly been garbage for me, outputs seem comparable or even a little worse than 30B A3B
>>106568645>needs a lora to do itLocal nano banana when?
>>106568661xou don't need or want the model to memorize single examples verbatim. its supposed to generalize. if the information is presented many times it will get integrated. just not a single example. its just to prevent personal information from getting memorized.
>>106568680If you praise china hard enough, two more weeks. If you dont 4 more months
>>106568674seethe ggoof
>>106568680>need a lorawrong mentality, reframe: can train the model to do new things if desired.Also it can already do this without, this just enhances the result faithfulness (fabric texture on fumo, painting style).
>>106568700But I don't want to browse, search for, and maintain a library of loras like I did during the SD days. Not again...
>>106568713Yeah me neitherBut I also understand that no model, ever, will be able to do enough for what I want to see.Being able to teach a model new stuff is important to me.Granted this isn't super complex stuff and I can understand why you'd want it out of the box.
>>106568700Let me guess you need 100gb of vram to train a Qwen lora?
>>106568688So is it a bad thing then? Won't the model have more room to learn things when it's not spending time memorizing everything word for word? I mean, could this explain why Gemma seems to know so much for its parameter size?
>>106568743yeah it's a dire situation there, I had to use prosumer hardware.still, the results were fantastic on even a tiny dataset, model understand is clever even if the base results are dog.
>>1065676282 weeks more
/lmg/ still struggling to cope with the fact that apple won local
>>106568645How well does that lora work with fancy outfits?
>>106568747I think it is a valid idea. it shouldn't hurt models at the trillions of training tokens scale. anything important will be seen multiple times from multiple examples. it just won't be able to reproduce some random persons medical information or ssn.
>>106568747The possibly have very good/long post-training and include general trivia there.
my toaster is already creepin in on the 6 year markPlanning my next machine to buy on this black friday, will prolly buy an rtx 5090, a ryzen 9950X3D cpu and 256gb of ddr5 memory, what are some of the more memory hungry models I should try then?
What are some must have extensions for SillyTavern?
>>106568821Qwen3-next
why is training a model to make it specialized so fucking hard?
>>106568863a girlfriend
>>106568879I already have one. She doesn't like AI or me using it but she respects my drive to learn and be skilled with computers.
>>106568875its easier now then it was 10 years ago.
>>106568892does she know you are having sex with it?
>>106568645>spyware UIany other options?
qwen goofs???????????????
>>106568916https://huggingface.co/mlx-community/Qwen3-Next-80B-A3B-Instruct-4bit
>>106568796dunno, you can crank an arbitrary resolution though
>>106566944safetymaxxing niggers
>>106568923>mlxI said goofs nigger, im not downloading lm studio or vllm for the awq variant
can I run glm4.5v in llama.cpp? I cant find gguf for it
>>106568875If you're trying to teach it domain-specific information, it's pretty much impossible with a tiny LoRA and/or without heavily affecting previous knowledge with a huge learning rate and burning the information into the weights.Using summarized information might work better/faster than entire documents, but good luck training the model in a way to make it truly understand the information without just parroting it (verbatim, even).
>>106568936wait 2 more weeks then faggot
>>106568962
qwen next is officially goated
>>106568680judging by the google studio api outages the chinese are already working on it
>>106569008what happens at 100%?
>>106569008>80b>worse than 30b codergo fuck your herpes infested goat
>>106569008why is thinking variant so much worse than the regular chat version?
>>106569014But distilling never gets the performance of the original...
>>106569029it hasnt been trained on the secret benchmaxx
yeah im starting to think its over
>>106569008>officially goated>Lost to Qwen3-Coder-30BDense bros we can't stop winning
>>106569072wouldn't that be kind of against the point if they have to train it specifically for the benchmark? That's like one of the big flaws with test driven programming where you make your program fit your test rather than your actual problem.
>>106569075GLM4.5 AIR BROS, WE CANT STOP WINNING!!!
>pull vllm>follow the same steps I did last time that I wrote down for successfully compiling it, which I had to change and come up with new steps for as the ones previous stopped working at some point>doesn't work anymore either, giving different errorsSigh...
>>106568907>>spyware UIUhh what??
so... has anyone modified a onahole to interact with a LLM yet?
Groksirs when is we getting supports?https://github.com/ggml-org/llama.cpp/pull/15539@CUDAdev sir please do the needful Vishnu bless yo
>vllm supports gguf guys!!!>try loading a gguf>it just errors outMy ass it's supported. Now I'm downloading some safetensors to try again and see if it's a model issue or my build is just fucked for some reason.
>>106569204sends your data to jewgle on startup. API nodes, electron have it but they are optional. the manager calls home
>>106568925>/gig/ on /lmg/Weird colab
https://www.trendforce.com/news/2025/09/11/news-kioxia-reportedly-eyes-2027-launch-for-nvidia-partnered-ai-ssds-with-100x-speed-boost/>Kioxia, in partnership with NVIDIA, is developing next-generation SSDs aimed at AI servers, targeting commercialization by 2027 with read speeds nearly 100 times faster than current models, reaching up to 200 million IOPS using PCIe 7.0 technology. These SSDs, designed to partially replace HBM as GPU memory expanders, reflect the growing AI-driven demand in storage, with projections indicating that AI-related NAND will comprise 34% of the global NAND market by 2029, adding $29 billion in total addressable market (TAM). As a result, a U.S. investment firm warns of a potential NAND shortage starting in 2026, exacerbated by increased adoption of QLC eSSDs, Nearline SSDs, and high-bandwidth flash in response to tightening HDD supplies and AI infrastructure needs.SSDmaxxers, 2027 will be your year!
>>106569299>Kioxia, in partnership with NVIDIAdoa
>>106569268When I tried it, I couldn't get a single moe gguf to load. I was expecting it to be slow and unoptimized, but it didn't even load.
>>106569268just use llama for goofs bro
>>106569268Support for gguf on vllm is ass anyway
>>106569268>he expects corposlop pyshit to "just work" without suffering through dependency hell
>>106569268Ok I just tried loading a small safetensors model and it also failed. Searching the error on the github issues gives 0 hits.Wtf is wrong with vllm man.I suppose GPU would probably work fine as I can just download the prebuilt wheels, but the CPU build is not well.>>106569331Thanks.I think CPU AND goof support just simply cannot be expected to be stable on vllm. Let alone GPU + CPU inference which isn't currently supported.>>106569335>>106569349It seems even safetensors don't work on my build kek. They don't truly "support" goofs or CPU either.
you will give me ze best GERMAN low latency TTS or STS model right now!I'm tired of my models turning into drooling retards when trying to pronounce 25 aka FÜNFUNDZWANZIG!>fünfffffuuffuhffffzwwffggg 'hick!Don't make me use jewgle voices...(fuck elevenlabs, they aren't even that good).
Did Drummer finally troon out?
>>106569357Thier github is practically useless and it seems like all support happens through their discord.What error did you get? Try using the V0 engine. They rushed getting V1 out and making it the default while it was still a broken pile of shit missing more than half the features of V0.
>>106569367Just directly pass your ipa transcription?
>>106569367VibeVoice>low latencyoh...
>>106569379What's with this tech to troon pipeline?
>>106569407>What's with this tech to troon pipeline?The terminally online fit into two groups mostly, the mentally ill and assholes. If you are weak to praise and group think the former is where you will stay. If you just want to solve problems you are going to argue try out things fix it come back and then call everyone a dumbass.
How do you guys usually write character prompts? Do you just write a paragraph describing them, or something more structured?
>>106569413your logic is a self reportanyway, maybe focus less on people, or do you keep your nose firmly buried up everyone's ass
Since I'm an esl I'd like to know if this prosody using heuristics (kokoro) sounds acceptable for americans: https://vocaroo.com/17z5mdm2a0yUThe sample is complex on purpose so I can test a bunch of heuristics at once: "At 11:30 p.m. on Jan. 3, 2024, the project's lead (Dr. O'Neil) announced, "We'll ship v2.1 to the U.S. and EU by Q2," but a $1250.50 shortfall, a 5% processing fee, and three critical API regressions forced the team to triage, reassign tasks, and reconvene Monday—prioritizing user-facing fixes over backend refactors to preserve product quality."
>>106569441Everyone here just asks GPT5 to write it for them and improve it. Nobody uses local models for roleplay, GPT5 is the current meta.
>>106569471What isnt a self report? im writing myself my opinion what else am i suppose to right or think
The hell kinda quant method are you supposed to use again? I've seen conflicting reports.
>>106569492>>106569492penis
>>106569391Local or private, show me a conversational STT>TTS method with decent enough latency. Best I found was some hugging face space from the transformers.js dudedev. But it was kinda meh and engerrish only. and it had no dialogue turn system or whatever that interruption mechanic is called. I really cba developing this as I'm more interested in the backend stuff. I'd just use openAI realtime API for prototyping, but fuck me those prices are surreal.
>>106569385>discordUgh.The error is "TypeError: Cannot instantiate typing.Literal". I guess I could ask my llm about it to see if it possibly has any solutions.How do I use the V0 engine? I tried the environment variable but it doesn't seem to do anything?
>>106569520Check this https://github.com/Open-LLM-VTuber/Open-LLM-VTuber it has a barge-in system which is the interruption mechanic you're looking for
>>106569474>https://vocaroo.com/17z5mdm2a0yUsounds fine
>>106569357I was actually fucking with the cpu install just to see if I could give next a little test or two, but I could smell it being a migraine the minute I started running into weird dependency mismatches. I'd honestly rather wait the multiple weeks it'll take to get support in llamacpp only to test it and go "yeah, it's pretty shit for writing" anyway. Shame, because small active parameter models are great for cheap context and being relatively fast off the bat. Even jamba with more active parameters is still pretty snappy if you put enough of it into vram, but sperganov has yet to fix the constant reprocessing on new messages for it, or multiple a couple other models that whatever the fuck they coded that causes this.
>>106569503IQ2 is the new meta, really. You will not notice any difference even when using smaller models.
>>106569557yeah that looks promising, will give it a shot. thanks, pedo.
>>106569553>How do I use the V0 engine? I tried the environment variable but it doesn't seem to do anything?Read the output on startup, it should tell you which engine is being used.>TypeError: Cannot instantiate typing.LiteralSee if there's any hints in the stack trace before this part. For me, the only success I had when vllm decided to throw errors was upgrading or downgrading (at the cost of model support) the vllm version. Using the V0 engine solved a lot of trouble for me, but once they hit v10, I gave up on messing with it.
>>106569614He says as he spins on his heel and then says how q8 kv cache is disastrous for models or something
>>106569630Yeah I think I'll just stop here if my LLM doesn't solve it. Don't feel like trying out various versions.And honestly I have a feeling the CPU performance is worse than Llama.cpp's anyway, but it'd be nice to actually confirm.
>>106569617break a leg
>>106568925>>106568645damn, wasn't qwen-image and qwen-image edit supposed to be a slopped failure that's not worth it to run?
Is there some kind of model that can act as a sort of twitch chat for when you're playing games by yourself? Like you give it a video feed and it reacts in some way. Just so that it's not so lonely.
>>106569817im gonna use this idea to become a billionairethanks
>>106569822You'd be lucky to make lunch money. The only billionare is the owner of whatever platform streaming platform you use.
>finally found a model with little censoring and pretty comptetent logic>leaks chatml all over the place half the timesigh
>>106569817Having used 30b+ models, it depends. If you start off in a prose setting, then ask it to interject with something like a chat/review section (eg: character reads chat mid-story), it will fuck it up. Off the bat with examples, maybe. As for giving an llm a video feed, I don't think that's feasible at the moment unless you have a shitload of vram or some kind of highly specialized and hand written pipeline
>>106569817>Just so that it's not so lonelyThis is a general for pretend sex with computers and yet this post is one of the most pathetic things I've ever read
Just became a 128gb ramGOD with 24gb vram. What's the best I can run now?
>>106569856>ram>What's the best I can run nowYou mean crawl?
>>106569817for the millionth time, no we cant build screenreading twitch yet, no we dont know how neuro does it and it cannot be done locally for any sane price
>>106569856Probably glm 4.5 at iq3, with a whopping 9 t/s on empty context
>>106569869Sorry...
>>106569832What model and how does it "leak" "chatml" "all" "over" "the" "place"?
>>106569856qwen235b q4
>>106569856>128gb ramGODsounds you are a channellet with 2 channels at most
>>106569905Times are changing old man, I could barely fit a llama2 13b at 4k context and now I can run 100b moes and 32b dense models with bare minimum 16k context yet I have not bothered buying new hardware
>>106569817Could you get a primitive version by sending screenshots through a multimodal model?
>>106569923If you don't mind minute+ long latency.
>>106569869>no we dont know how neuro does itIt's still hilarious that some random guy built a better utilization for AI than trillions in VC cash between every major corporation
>>106570004Did you forget Ani exists?
>>106570004>It's still hilarious that some random guy built a better utilization for AI than trillions in VC cash between every major corporationNot really, if you dig into anything you will realize its a very small group of people actually doing anything at all sometimes its just one hyperfocused dude who does nothing but that for years cause of autism.
>>106570013I wish I could
>>106570013Someone post the mouth animations
>>106569869Uhh, techlet?>stream has a few minutes long delay (this is what most parasites do normally even)>selected twitch chat entries are redirected to llmIt's not rocket science.He wrote a backend what controls the character and llm and integrates them together but I can assure I could make a demo if I had more interest.
>>106570036>I can assure I could make a demo if I had more interest.That means you cant, and no one else has cracked it as good and made it available.
>I couldlol
The new fiscal quarter starts in October. As usual, this will be when companies start pushing out new models to look good.Two more weeks and the big releases start.
>>106570048>no one else has cracked it as good and made it available.What is the incentive to put in that much work just to make it available because you want it? Even if I put in that much effort, I would just make a Neuro knockoff and try to make money off it.
>>106570094Okay thats fair, but still if you can clone it and make money why not? how come none of the 'i made my own neuro' is close to his?
My implementation is cool she's just on the Canadian Github
>>106570048You are just too stupid and/or underage even. Jesus christ these ERPers shouldn't even allowed to post in this thread.
I just did a test of GPU performance with vllm and llama.cpp. With Qwen 4B, original BF16 safetensors, I got around 71 t/s on vllm with empty context, and 45 t/s on llama.cpp with a BF16 GGUF. At 8k context, Llama.cpp got 44 t/s, and vllm got 60 t/s. I also tried an F16 GGUF and it was like 2 t/s faster. These results suggest that at least on my system with Qwen 4B on full precision, there is something bottlenecking Llama.cpp. Maybe it'd be closer with a larger parameter model, but I don't have the VRAM to hold the full precision weights.
>>106570054Problem nigger?
>>106569082So, a general instruct model lost to a model that was specialized for coding at coding, and that's supposed to be a mark against the general instruct model?
>>106569869Nah you coomers are braindead. There are bazillions of projects like these on github https://github.com/kimjammer/Neuro
>>106569817>>106569869>>106570343the one that can play games with ai:https://github.com/moeru-ai/airi
>>106566836Question about -ts command on llamacpp, when considering a split, would I consider the fact that my main(first) GPU will have vram in use from other programs and windows when considering the split? Or does llama.cpp take that into consideration and balance it properly? Are there any other commands that will just split the vram evenly between two cards without having to adjust numbers with -ts? I find myself using such odd -ts number combos to get an almost even vram usage split, I don't know why. For example, currently -ts 24,15 has it split almost evenly between my cards which makes no sense to me considering my 1st card is using vram for other programs and windows. I just don't like how I have to keep re-loading the model over and over trying different numbers until I find a combo that splits it properly.
What if my computer is over a decade old with no vram? Is local llms the hobby for me.
>>106570396
>>106570369Wait, so it is possible? Why were anons being mean to me? Are they trying to keep this tech all to themselves?
>>106570369>ElevenLabs voice synthesis
>>106570480You talked to clueless retards. Very few here know more than edging to chub cards
>>106570480they're all tsun with little dere here
>>106570488It's the best and will continue to be the best
>>106570796China will rip off vibe voice and make it better.I believe
Finally a model that passes this test and it's only 1.7B and open sourced. wild
>>106570867Didn't it get the 8th character wrong?
>>106570867the model alone or the whole stack?
>>106570892well fuck, guess there's always next model
>>106570901just the model
What is a good model for being negative and critical? I hate how they agree with everything. I want to be told I'm wrong or being an idiot.
For those of you who use the models for anything other than porn, what is the best way to let the model browse the web to search for info?In my opinion the difference nowadays between proprietary models and local is mostly in the tooling integration rather than the actual models.
>>106570964Kimi k2 is the GOAT
>>106571077>Kimi k2Is kimi k2 local? can you run it?
>>106571090No but I understand that one or two people here can run it :)nta btw
>>106571090yes
>>106571090It's 1T/30A
>>106571094>one or two people here can run it :)I wish i was one of them.
>>106571077Can it still talk about medical or mental stuff or does it just shut down?
>>106571105post your full medical history and social security number and i'll ask my buddy kimi
any kokoro voice recs? https://voca.ro/1jAMPLyV0zJA
>>106571216Bateman is always good.https://files.catbox.moe/bwv1fc.mp3
>>106571216Can it do Japanese sex, moaning, and blowjob noises?If no, it's worthless
>>106571243VibeVoice can, but no api support yet
>>106571243>braindead coomer
There arent any coomers here. We are all using this technology safely and to enhance our lives and work abilities.
>>106571070>what is the best way to let the model browse the web to search for info?MCP
>>106571337gooners are the reason AI has advanced so mucha 4chan holo gooner invented chain of thought
>>106571347>There arent any coomers hereSorry I was offline for a bit, I'm back now.
>>106571466show me your coom
KH music came on in my playlist and I remembered the lyrics poster :)
>>106570892Come on now, let the man rest
I made a Miku for you guys. Feel free to use at your leisure.
Can you guys give me a list of safetymaxxing companies so I know to ignore their model releases?
>>106571835
>>106571836Pretty much everyone else except Mistral and Chinese..
Textless, exploitable version.
>>106571836All of them
Exploitable transparency version.Enjoy you are images of official /lmg/ mascot Hatsune Miku!
>>106571835>>106571849>>106571853>>106571856fuck off, spammer.
stay, cute normal poster
>>106571876I'm sorry for contributing OC. Really, I am.I'll go back to enjoying my chat with official /lmg/ roleplaying model Rocinante 1.1, the best roleplaying model made by the best finetuner, TheDrummer!
>>106571835Cute migu
>>106571916Yeah I'm happy with how that artist tag blend turned out.The key is Namori. Namori tag makes anything cute.
>>106569281fork it and edit out the homing beacon
>>106571553
>>106570867wasn't the mokuro manga thing able to do this already?>https://github.com/kha-white/mokuro
>>106572018where are you supposed to get the high quality raws for this though
>>106569856>24
>>106568374>>106568414>>106568426You do know there is still transformers and that has all the model support and where everything goes first, right? Most of the internet only mentions GGUF because they don't want to waste space to download the raw model and use AWQ for non-fined grained 4/8 bit inference because most people don't overspend for the amount of compute they get and are running <1-2k USD builds for these models.
>>106568789Apple didn't win jack shit when it is slower per dollar and harder to use overall for anything <=128 GB of RAM than AMD's Strix Halo. Maybe their matmul implementation in the A19/M5 is worth a shit but I am leaning towards no unless proven otherwise given how shit Apple is at AI.
w y w ay d h m spo bd g
>>106572139b-but I make up for it with my prompting...
my prompts turns 30B models to 500B behemoths
>>106570867Haha, nice to see my image still floating around.8th character like other people said and also the KA hiragana torwards the end.Damn 2 years and and they all still struggle.In 2023 I thought we would have a local gaming buddy by now. That I can have in the background translating games with an overlay. At least drummer finetunes are good enough for lunatranslator. That works pretty well most of the time.I remember the old ATLAS translations back in the 00s. kek
>>106570867It failed though. There is one big error and three small ones.
>>106572459You're absolutely right anon! It really is a testament to your genius to point this out!
what did they mean by this
>>106572569>everyone picking up mi50s and v100s despite the next major ML stack software releases for their vendors with AMD and Nvidia dropping them.I don't get it at all, Even if you had to pay double the price, it is worth still having software support over trying hack things together after that point and praying the Vulkan backend is super optimized one day so you can keep using your card.
>>106572592i meant the little green display but yes the gpu choice is also questionable
>>106572592What could be the reasons for updating your drivers? The last time I assembled my LLM machine was last year and I had to downgrade drivers for better power savings, it still works to this day. The only thing I've heard about these drivers is that they become slower in newer versions, and power efficiency on idle has been broken for more than a year now
And when it comes to AMD drivers, if you find a version that somewhat works, you'd better never touch it again
>>106572601Oh didn't notice. Yeah, won't comment on that. I still think microATX is way too small to fit cards like that even with blowers but I guess that's why noise is never a factor to consider.>>106572637Depends on what card you have. Ada and now Blackwell are still getting performance improvements and fixes. If you locked your hardware stack now especially on Blackwell, you're missing out on Nvidia actually providing value in unbreaking shit, although to be fair, it's shit they broke in the first place. CUDA also does get a bunch of API changes between major releases.>>106572653For AMD, you especially want to run nightly ROCm if you can build it yourself.Of course, that's from a developer/tinkerer standpoint. If you want shit to just work, then okay, you do you in order to keep software stability at all costs.
just don't use AYYMD and you will be happy
>>106570386Unless someone changed it when I wasn't looking, the default in llama.cpp is to use the GPU with index 0 as "main GPU".Note that the order in which llama.cpp/ggml receives GPUs is not necessarily consistent with the one reported elsewhere.>>106572592Essentially all GPUs you buy are a depreciating asset.Even if you have to replace them earlier and they end up as e-waste that may have been a better deal than buying and later selling a more expensive GPU.Though as long as there are drivers I intend to maintain llama.cpp/ggml support for Pascal and Vega (Mi50).
>>106572669Have you ever experienced a t/s increase after updating nvidia drivers?>>106572696People who buy AMD are either desperate enough or in it for the ride. Someone has to finish off that lunatic extra phantasm, you know
>>106570295>So, a general instruct model lost to a model that was specialized for coding at coding, and that's supposed to be a mark against the general instruct model?the general instruct usually did better than the previous coder focused model, yes.Qwen 3 instructs (the general instruct, not coder) are better than 2.5 coder. A new model being worse than previous model is a sign that the tech is stalling.
What's the best model I can run on my RTX 3060?I tried Cydonia 22b q5 and Rocinante 12b q8, but im not sure if im using low tier stuff, it's been ages since I last used ai chatbots
>>106572794>Cydonia>Rocinante finetroon users are a lost cause
>>106572883>mikutroon opiniondiscarded>>106572794theres a newer cydonia and valkyriehttps://huggingface.co/TheDrummer/Cydonia-24B-v4.1-GGUFhttps://huggingface.co/TheDrummer/Valkyrie-49B-v2-GGUF
>>106568645...Will I have to take the comfy troon pill?
>>106568659>mlx-community/Qwen3-Next-80B-A3B-Instruct-8bit/blob/main/config.json> "group_size": 64,dumbasses
>tr**n>tr**n>tr**n>tr**nobesed!
MistralLarge3
>>106573036neber ever
>>106572948I know the default is 128, but I wonder why they changed that
>>106573045>128at that point it's literally retarded, you're supposed to quant it to 32gs. lower is better
>>106573036ugh i need it so much
Update on Qwen3-next goofs?
>>1065731512 weeks morehttps://github.com/ggml-org/llama.cpp/issues/15940#issuecomment-3286596522
>>106572714>Essentially all GPUs you buy are a depreciating asset.My used 3090 I bought 3 years ago is worth about 30% more now than when I bought it
>>106573171sir no >This is a massive task, likely 2-3 months of full-time work for a highly specialized engineer. this no goods
>>106573171>MXFP4>Successfully quantized the Qwen/Qwen3-Next-80B-A3B-Instruct model to the MXFP4 format, with expert layers quantized to MXFP4 and other layers retaining their original precision. The model size has been reduced from 159GB to 45GB. seems like sama's shitty model was useful after all
>>106573151>run a prompt with 0 temperature>get result>run it again with 0 temperature>slightly different resultWhy is it so fucking hard to make a robust rounding in GPU calculations, are they stupid?
>>106573199paging dr cuda dev, drop what you're doing and get on it
>>106573224>seems like sama's shitty model was useful after allYou can say that once you run benchmarks comparing mxfp4 to other q4 quants on that model.
>>106573226>He didn't fix the seed
>>106573226result from the second one on should be the same
>>106573036What's the point? Dense lost
local is getting more relevant and popular because, spoiler alert, all these cloud NIGGERS are serving comped and quantmaxxed garbage!
>>106573226top-k 1
>>106573271>all these cloud NIGGERS are serving comped and quantmaxxed garbageGood.
>>106573271we're in peak race to bottoms phase
>>106573292This is the only bottom I'm racing to
>>106573324we must get behind this
>>106573270You thought it was going to be dense?
>>106573236I seeded your mother>>106573260>>106573279I know what the problem is, I literally wrote it in my post. It depends on the order calculations are made as they introduce rounding errors. In real life a + (b + c) = (a + b) + c, but not on GPU. On GPU it will give you two different results and the errors stacks up until they flip a token and from there it's over.I'm asking if it's fixable in a reasonable manner by people writing inference engines.
>>106573379cuda said it wasn't worth its time iirc
Whats the point of doing all this? Why even have private LLM?
>>106573379temp 0 only weights the tokens. Sampling still happens. top-k 1 picks the first one always, removing the chance for other samplers to interfere. The first token won't change.
>>106573423>private LLMThe clue is in the name.
>>106573425anon...
>>106573435Check your probs for all the tokens you generate. top-k doesn't have that problem.
>>106573423some people like owning things.
>>106573441Obviously I have deterministic samplers. It still flips tokens for the reason I explained above. top-k 1 won't do shit if the top token changes between generations.
>>106573447that sounds awful
>>106573469it really is. it is much more exciting to gamble if my work flow will continue working when the corpos 'update' the models.
>>106573467 (me)To these who are still confused:>CPU does the calculations sequentially, GPU splits operations and do them in parallel>the order in which the calculations are made is more or less random>because it's random, you don't control how rounding errors are introduced> (a+b) + c is not equal to a + (b+c) meaning you will get different results depending on GPU whims>this give us micro errors that can sometimes flip top tokens You can see the numerical errors I'm talking about in action if you run this:a = 1.0b = 0.00000001c = 0.00000001result_1 = (a + b) + cresult_2 = a + (b + c)print(result_1) print(result_2)I still think cuda dev should make an option to force sequential operations in the necessary places to make results reproducible for the sake of having a baseline for experimentation.
Large Migu's Galore?
>>106573535I prefer them small
>>106573551
>>106573324PANTYHOSE
>>106573551Obviously, small things are good
>>106573519Reproductibility will murder your t/s. Just read retard https://docs.pytorch.org/docs/stable/notes/randomness.html
>>106573610>Reproductibility will murder your t/sI'm aware of that
>>106569281>not using a firewallngmi
>>106569281>sends your dataIt just pings an IP. Do you know what that means? Of course you don't, retard.
>>106573519The ggml CUDA backend is deterministic, All operations are always done in the exact same order.However, when you use prompt caching this is done by re-running the model evaluation for the last token of the prompt only.As a consequence the KV cache/logits of the first token are slightly different and you can get different results.For 100% reproducible results with prompt caching one would have to implement caching the prompt only in chunks of the physical batch size.
>>106569281link the line of code that does those things
>>106572734>>106570295qwe3 coder 30b is a3bstupid niggers!
>>106573185i can buy 3090 for 470 euro in my country
>>106573977I can get an RTX 6000 for half that
>>106573982fine bro ill post the site i dont care if anons buy all 3090s in my country i wont be buying them anytime soon anywayshttps://www.kupujemprodajem.com/kompjuteri-desktop/graficke-kartice/inno3d-ichill-rtx-3090-rtx3090/oglas/181677213?filterId=7125152428https://www.kupujemprodajem.com/kompjuteri-desktop/graficke-kartice/rtx-3090-3080-3070-3060-1070-rx6900-6700-5700-580/oglas/141810340?filterId=7125152428
Trying windows 11... Why does it just instantly quit? I have no idea what's wrong because it tells me nothing.
>>1065739771.6k aud lmao
>>106574004try -v for verbose output maybe.
>>106574016
>>106574029maybe try renaming the model so it doesn't have spaces and dashes and shit?
>>106573171>I asked GPT5 Codex to get a view of the work to be done, it's monstrous...this is why "social media" style coding (github) is cancerrandos should not be allowed to post on issues or do PRmuch less randos who are just copy pasting shit in chatgpt and pasting back the answer
the silence is deafening
>>106574045
>>106574080maybe install linux?
>>106574080your shit's right fucked mate
i need new sex. is qwen next good?
>>106574080I've had the same problem back when I was on windows and the only solution was to compile the binary myself. The one from github ci just refused to work with my drivers.
>>106574090>a3b
>>106574080wow, okay, I'm all out of ideas. I suppose you could try building it from source or downloading a different version.
>>106574080You do have CUDA 12.4 installed, right?
>>106574071That's pretty much same everywhere on internet. Post a mod on Nexus for example, and these weird retarded posts immediately come out of nowhere... Why would other people's work should be allowed to be commented by strangers anyway unless they are actually part of the team. Pure demoralising cancer.
>>106574132You do not need to install cuda on windows.cudart-llama-bin-win-cuda-12.4-x64.zip provides the necessary runtime files and is distributed on the github releases alongside the binaries for the CUDA version of lamer.cpp
>>106574089But why? Shouldn't it at least inform me if something's wrong? What kind of code just instantly exits without anything, even with a verbose flag?>>106574101I went with kobold cpp...It's still around 15 tk/s. Seems like it's a windows issue with my hardware? >>106574084 On linux I get over 50tk/s, but I'm not a linux user, and I don't want to have to switch between operating systems every time I want to ask an AI something.A few threads ago I thought it was maybe windows 10... but windows 11 is also fucked.I also installed SystemPanic/vllm-windows, but that had a problem with pyzmq, and I couldn't get it to run with multiple gpus. Single gpu works fine, but I never had a problem with the single gpu performance in the first place.>>106574132Yeah, cuda 12.4 and driver version 552. Also tested with 12.8, and driver version 571.
>>106574097I am tired of people hating on low A count. Low A count is the future. Attention and context on GPU and a fuckswarm of fucksmall experts on CPU is the future. It is all a training problem and I am not training so corpo nerds have to solve it so I can jerk off in peace.
>>106574196Big models with low active parameters are useless outside of benchmaxxing, a model doesn't need intelligence if it's just pulling complete answers from memory.
>>106574196Anything less than 30B active is too retarded for anything but trivia recall and the most simplest of tasks.
>>106574177windows just isnt for ai, whats
>>106566836Qwen3-Next or GLM Air for storytelling and roleplay?
>>106574225cuda dev gets 80 tk/s on triple 4090s on his windows though
>>106574229Nemo
>>106574208Agreed. I love that model I use to ERP that never pulls complete answers from memory. I forgot the name of it though....
>>106574247
>>106574292Go fuck yourself straight to reddit.
>>106574320t. infiltrator ledditor
>>106574229probably glm. qwen models are really into that thing where they go "It's not X. It's Y."
>>106571849I like this Miku
>>106574292I like this Miku
Is there any point to LLMs when free llm with better data like deepseek and aistudio exist? It feels like I wasted money on the P40 I got
>>106574618You should have thought of that before buying it
>>106574618no, you have discovered the hole in the systemyou have just fully invalidated all of /lmg/ and the very existence of this general
>>106574618Its yours its offline and importantly for workflows, it doesnt change. You wont get random drops in performance because the company wants to tweak or save money.
Look what anniversary is coming up in just under two weeks. This is when they will drop Large 3. It's the perfect occasion.
>>106574685>its yoursThis means if something goes wrong you can shift blame to the provider>its offlineThis means you have to pay for cost of hosting and ensuring availability>it doesnt changeThis means you will never be the first to get the new features and fixesAll the positives about local are viewed as negatives by apifags
>>106574738the only anniversary I will celebrate here is the one that will celebrate their death anniversary when funding dries out
>>106574738Insider here: Mistral is being refactored. This means safety factoring them, but also two new models - Mistral Small 4.0 and Large 4.0.
I just had an idea (that I won't do myself), what if you run one pass through the chat with your model without continuing, instead just use the attentions to find out which messages/paragraphs are more important right now, and then make your entire chat history just an RAG in the second pass?
>>106574786Last news was that they might be bought out by Apple. Presumably because their attempt to relocate themselves to California failed.
>>106574786Never trust a frenchy.Never trust anyone else.
I believe anon,but not Anon
Why local when API?
>>106574799Last news is this:https://www.asml.com/en/news/press-releases/2025/asml-mistral-ai-enter-strategic-partnership> ASML, Mistral AI enter strategic partnership >> Companies agree on long-term collaboration deal to benefit ASML customers> ASML to lead Mistral AI’s Series C funding round investing €1.3 billion
>>106573379It's impossible.
>>106574685so it might be useful in the future when the companies decide to charge us 1000 dollars per month but not now? got it
>>106574819100% misuse of corporate assets from a French (ASML's CEO is a french faggot) to support another French in what is likely to be corruption.What a way to throw away a billion.
>>106574857they're just putting in a quick dirty $1.2b so that they can make apple pay them $3b in four months when they finally decide to buy up mistral
>>106574864https://www.devdiscourse.com/article/politics/3200518-former-french-finance-minister-le-maire-joins-asml-as-adviser>Former French Finance Minister Le Maire Joins ASML as AdviserNo, you are wrong and as a French I can guarantee this is yet another affair of corruption on our partNot only a French CEO but also one of the biggest piece of shit of recent political history is taking part in this messthe only thing that saves ASML is that they have a monopoly on EUV otherwise the rot that is currently beginning to eat them at the core is the sort that can kill a corporation
For a full private waifu stack at good speed you need 6 RTX Pro 6000 and some other 24+GB GPU, right? Four to run GLM-4.5, two for GLM-4.5V for vision and the other GPU for VibeVoice 7B.I hope 29GB are enough for 128k context.>5120*2*92*128000 = 112 GBFuck. So another RTX Pro 6000, two to be safe, and hope that TP=5 or TP=6 work well.
>>106574919>Four to run GLM-4.5, two for GLM-4.5V for visionWhy not just use GLM-4.5V for both text and vision? 4 RTX Pro 6000 is not worth running GLM-4.5 over Air/V especially when your usecase is waifu talk.
>>106574940Because the full run is much better and within reach.
>>106574900What did you expect from Macron's government? It's already like this for several industries. Mistral got gov gibs too so they're cashing out taxpayers' money.t. french
>>106575020isn't all tax money going to boomers in france
how do you run glm4.5v?
>>106575123very carefully
>>106575131but theres no mmproj :( like I thought I could run it with glm air since it's based on it :(
>>106575131Because she's fat and obese.
Ling>https://huggingface.co/inclusionAI/Ling-mini-2.0Ring>https://huggingface.co/inclusionAI/Ring-mini-2.016B 1.4AHere's hoping it's somehow at least as good as Qwen 30BA3.
>>106570867Wanted a quick sanity check on my memory here so I ran this on Gemma 3 27B Q8 with BF16 mmproj.I don't know Japanese. I see one definite error near the end. Are the others errors?
>>106575202>>106575202>>106575202
>>106575144So rude.
>>106575181Impressive, it actually got the kanji all other models fail at.
https://youtu.be/7Jzjl3eWMA0?t=117Women raping gay billionare werewolf writers sounds unsafe. But their fucked up fetishes are somehow safe. I hate this world.