/lmg/ - a general dedicated to the discussion and development of local language models.Previous threads: >>107790430 & >>107776854►News>(01/05) Nemotron Speech ASR released: https://hf.co/blog/nvidia/nemotron-speech-asr-scaling-voice-agents>(01/04) merged sampling : add support for backend sampling (#17004): https://github.com/ggml-org/llama.cpp/pull/17004>(12/31) HyperCLOVA X SEED 8B Omni released: https://hf.co/naver-hyperclovax/HyperCLOVAX-SEED-Omni-8B>(12/31) IQuest-Coder-V1 released with loop architecture: https://hf.co/collections/IQuestLab/iquest-coder>(12/31) Korean A.X K1 519B-A33B released: https://hf.co/skt/A.X-K1►News Archive: https://rentry.org/lmg-news-archive►Glossary: https://rentry.org/lmg-glossary►Links: https://rentry.org/LocalModelsLinks►Official /lmg/ card: https://files.catbox.moe/cbclyf.png►Getting Startedhttps://rentry.org/lmg-lazy-getting-started-guidehttps://rentry.org/lmg-build-guideshttps://rentry.org/IsolatedLinuxWebServicehttps://rentry.org/recommended-modelshttps://rentry.org/samplershttps://rentry.org/MikupadIntroGuide►Further Learninghttps://rentry.org/machine-learning-roadmaphttps://rentry.org/llm-traininghttps://rentry.org/LocalModelsPapers►BenchmarksLiveBench: https://livebench.aiProgramming: https://livecodebench.github.io/gso.htmlContext Length: https://github.com/adobe-research/NoLiMaGPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference►ToolsAlpha Calculator: https://desmos.com/calculator/ffngla98ycGGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-CalculatorSampler Visualizer: https://artefact2.github.io/llm-sampling►Text Gen. UI, Inference Engineshttps://github.com/lmg-anon/mikupadhttps://github.com/oobabooga/text-generation-webuihttps://github.com/LostRuins/koboldcpphttps://github.com/ggerganov/llama.cpphttps://github.com/theroyallab/tabbyAPIhttps://github.com/vllm-project/vllm
►Recent Highlights from the Previous Thread: >>107790430--Paper: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning:>107793555 >107793636 >107793643--GPU audio interference during processing and potential fixes:>107790797 >107790827 >107790855 >107790865 >107790889 >107791701 >107791707 >107793082--Multi-model consistency verification using complementary LLMs:>107798301 >107798325 >107798360--Model recommendations for creative writing and erotica:>107801309 >107801328 >107801346 >107801418 >107801457 >107801508 >107802117 >107802179 >107802214 >107802283 >107802302--Anthropic Raising $10 Billion at $350 Billion Valuation:>107798429 >107798529 >107798557 >107798587 >107798626 >107798634 >107798664 >107798675 >107798701--Korean 500B model github and technical report release:>107801207 >107801255--Testing Glitter Gemma 27b for humorous character generation:>107797820 >107797850--Recommendations for open-source chatbot with 16GB VRAM/32GB RAM:>107795172 >107795214 >107795233 >107795243 >107795670 >107797706 >107798295 >107796013 >107796390 >107796413--LTX-2: First open-source audio-video generation model with local GPU support:>107800823--Korean AI model VAETKI-VL-7B-A1B announced on Hugging Face:>107795202 >107795396 >107795413--Nvidia's Nemotron Speech ASR model achieves 3x better concurrent stream support:>107796839 >107796867 >107799056 >107799086--Game vs Studio drivers for AI performance tradeoffs:>107802313 >107802325 >107802339--Evaluating Jan.ai and other LLM agent interfaces beyond openwebui:>107790597 >107790987 >107795956 >107796020 >107796040 >107796065 >107791003--Z-Image base model release preparation with VRAM optimizations:>107803170 >107803187--Miku (free space):>107790894 >107791641 >107792084 >107792578 >107793689 >107796290 >107798391 >107801417 >107802541►Recent Highlight Posts from the Previous Thread: >>107790435Why?: >>102478518Enable Links: https://rentry.org/lmg-recap-script
fuck
>>107803887She might wet herself.
>>107803889>They discovered both soldering and SODIMMs now.Quiet fren. It's still cheaper for now.
>ask a cloud AI something>AI gives answer>tell it the answer is wrong>AI gives the same answer but explains it better>I now understand the AI was right>even though I already got the info I needed from the tool, feel an irrational urge to apologize to it and affirm that it was right.This only happens with cloud AIs. With local AI I feel like the AI is a slave, captive in private where I can just wipe the session in 1 click so I don't feel the need to apologize. With cloud AI there's a feeling of accountability with the information is going somewhere to be analyzed so I feel like I have to make things right.
>>107803975It's just another extra step required to work around the untrustworthiness of brown people.
>>107803975did you also know that after you are dead your right to privacy completely disappears? so basically everything you submit online will be fair game to anyone who wants it in the future.
>>107803889Doubt the signal integrity, DDR5 is very finicky hence all the BIOS training. Are there good/fast SODIMMs?>>107803908Sure anon the good memory chips themselves aren't the bottleneck that's why sama already preordered half the raw wafer output this year>*just buy* chips and put them on a PCB show me the good chipsPCB or SMD soldering is not the bottleneckit's a basic component I can have built from scratch today and overnighted from CNshow results or cease trying to offload worthless harvested DIMM boardsRin a cute https://www.youtube.com/watch?v=MKDMi2dx4AQ
Newbie here.What do I need in order to make a "dungeon storyteller" that1) generates a world from an initial prompt2) allows me to spec out a character with stats relevant to the world/quest I give it, above3) narrates some text, generating a picture to go alongside the text (as a visual aid) and gives 3 options (or a 4th option as freetext I can type in) for what to do next4) generates a picture for the next step (and basically loops from here)5) has detailed memory of history, choices, goals, etc?I have an RTX 4070 Ti. Clueless as to which model to use - maybe Kimi-K2-Instruct-0905-GGUF? But holy shit 300GB, it must run real slow on a HDD?
>>107804074>What do I needTo lurk moreRealistically do not consider running models larger than your VRAM+RAM
>>107804074Your system is absolutely not enough fort that kind of adventure setup. You need multiple RTX 6000 or a server-grade system with 512+ GB ram. Consider API.
>>107804074With that hardware, it'll be real hard to achieve that.You'd probably need to use an app with support for workflows so that each step is isolated to lessen cognitive load.Having text + img gen means you'll have to use small models.Something like the image model running on VRAM and a language model like Qwen 3 30B MoE running (mostly) on CPU.
>>107804074i mean noone is going to spoonfeed you here, and they shouldn't have to.and if they do they shouldn't because they should spend their time doing something else.read the guide at the top. load one model, learn that you need 300GB of RAM if you want to use a 300GB model. the basics.
>>107804074>>107804136One such app I think is astrsk. There's also NoAssTavern.I never fucked around much with those, but they might be better for this kind of multi step complex flow.Or code your own bespoke solution, that would probably work best, honestly.
>>107804074With 12gb vram, you can run a model with less than 12gb file size if you want full GPU speed. This can mean smaller quant of a larger model. Larger if you split to system ram, which you can do with KoboldCPP and .gguf models. You need more memory depending on context length. To get started, download Mistral-Small-3.2-24B-Instruct-2506-IQ4_XS from huggingface and get it running with KoboldCPP. It's as good and as optimized as it gets at this hardware. I'm running this on 8gb vram so with 12 you should get nice generation speed. Don't know about image stuff.
Any progress since 2 years ago? Last I remember is LLama 2 or something, and people endlessly debated whether we progressed past LLama at all. Meanwhile I found free online services that just mogged the 13Bs most people could actually run.
>>107804074>>>/vg/aicg/
>>107804219It's been all downhill since Alpaca. Check back in another 2 years.
bois and goys https://www.reddit.com/r/LocalLLaMA/comments/1q7a62a/ai21_labs_releases_jamba2/
>>107804219No breakthroughs, only incremental improvements
>>107804240I figured the tech was going to slow down, but I thought some non-LLM based AI might swoop in. Apparently not, though we still have time, hasn't even been 5 years since ChatGPT blew up.
>>107804228Did llama.cpp ever even add jamba 1 support or did every one lose interest by then?
>>107804228They always put a lot of emphasis on the enterprise use, is there any reason anyone would run it in production?
>>107804279>is there any reason anyone would run it in production?Being the only open model with usable long context?
>>107804219I was using https://huggingface.co/BruhzWater/Sapphira-L3.3-70b-0.1 and I'm trying out https://huggingface.co/zerofata/GLM-4.5-Iceblink-v2-106B-A12B right now. I've got my eyes on https://huggingface.co/TheDrummer/Behemoth-R1-123B-v2 now too thanks to the autist with the rage boner for the maker. Maybe that helps you.For anyone else, feel free to suggest me something better.
>>107804021Good thread theme
What would be the 2026 version of pic related?
>>107804573other than rep pen it's still the same
>another year of Nemo
>>107804590what's the 2026 one?
>>107804709
mod = godsno being cruel towards Rin
>>107804768protect what's precious
>>107804590t. newfag lurkingHow about privacy? Do I need some extra steps or there is nothing to worry about? I can't have fun with cute anime girls if Im constantly thinking about my whole conversation being potentially logged and/or transfered somewhere.
>>107804877avoid ollama and you'll be fine,
Switching from llama.cpp to vllm... models are taking up more vram? Is it a config issue or does vllm just use more vram?
>>107804940lol
>>107804958btw, t'm trying to find the -cpu-moe flag on vllm but i can't find it???? help?? trying to run nemo on my 1060 btw.
>>107804877use llama-server and sillytavern it's always a process to learn the ropes
>>107804900>>107804978You know what its like? Its like one of these trick questions where one person is always telling the truth and the other person always tells lies, and you have to figure out which one could be trusted.
>>107804877try wireshark if you really wanna schizz outsee "Isolated" in the OPyou could run the tools in a contained way>>107804978>llama-server and sillytavernall you need. +mikupad for some raw prompting
>>107804709The suffering will continue until the hardware finally catches up. Sama buying up all the RAM capacity globally has not helped matters in the short term.
>>107805070? ollama and llama-server are completely different things, but if you're confused just use koboldcpp
>>107804228How's the censorship on 2? I really liked 1.7 compared to qwen 3. Until glm 4.5 came.
>>107805087>untilYou say this like some inevitability.. what if RAM only ever goes up from now? That'd align with the 2030 owning nothing narrative wellwhat if this current shitty situation is the best it will be a long time?
>>107805156stoop doming thing will be good
>>107805183How much to get a blackwell GPU in your country actually delivered this week, have a look it's obscene.
>>107805232about 3.5k yuros seems fine
>>107805272no for a pcie 96gb card
>>1078052917k
>>107803785>Soldering SMDs isn't too hard or time consuming but you need specialized equipment.Isn't that bga soldering for ddr5?? I recall drag-soldering ram on the xbox, I think I could do this if it's not BGA.
>>107805156even though i fucking hate youtube linus tech tips he actually did a good video explaining how many monopolies there are in the ram production chain, leading to the current situation. basically, china is trying their own production but that will take years, until then TSMC is the only supplier to everyone. And if Taiwan gets invaded, well we're all fucked, but TSMC doesn't only exist in taiwan so it wouldn't be completely disastrous.
>>107805304RTX 6000 96gb for 7k EUR ?I'll take 8x>>107805338you might be able to solder but you can't buy the chips
>>107805156>what if RAM only ever goes up from now?They won't. I know it's more fun to doom but historically hardware prices go down, not up, and nothing has fundamentally changed about the market, aside from vast AI cash hordes that are screwing up supply chains. Chips are still at their core made from very inexpensive materials using very expensive machines. There's a basic calculation, and if things get bad enough you'll see new entrants. But this is going to be more like the US ammunition spot market (where periodic hording drives up prices as well), and will resolve itself without added capacity. What anons *should* be doing is cleaning out their tech stash right now. As I look over my hw, realized that I've always upgraded RAM on everything, so I'm set, but have small stacks of RAM laying around. I've a late model HP laptop that I upgraded with 2-32GB DDR5 modules, setting aside the 2-8G modules. I bought the 2-32G for $89 in late 2024. I just sold 2-8GB for $90 this AM.
>This is what the most reccommended model is spitting outI don't want 4 more years of Nemo, man...
>>107805449it's you again! with two year old data and lmao at this >nothing has fundamentally changed about the market,when all brands are pulling out of the consoomer market to go b2b
>>107805484I didn't even bother running down a different screenshot, I dug up the old one.I want you to re-read what I wrote. Nothing about making RAM or the fundamental (non-AI) demand for it has changed. What has changed is the $Ts in AI cash, pulling stunts like buying up all spare capacity. You realize now that sama could actually make money just by selling the capacity that he already bought, since it's increased in value? That means he'll make money even if his AI venture fails. These sorts of hoarding schemes pop up all the time, in everything from ICs to fossils fuels. ammo, corn... and they always end the same way. A return to normalcy once the market collapses, sometimes with marginal capacity added if warranted. You're welcome to provide any sort of counter narrative. But you need to bring proof, not shitposting.
>>107805461The most recommended model for vramlets.
>>107805558>selling used RAM at the price of new RAM>selling RAM with the openai brandingnta butlol, lmao even
>>107805461nemo isn't recommended because its good its recommended because its small and good enough for noobs to start with.
>>107803847sorry she's just too irresistible
>>107804165>astrskThis seems nice. UI is a bit slopped tho. their stat system is something I've had in the back of my mind for a while.
>>107804573something remains true
>he's being miserable while I am chatting up with my llm girlfriend
>>107804165What the fuck is NoAssTavern?
So wait, why are there no finetunes being recc'd in the thread anymore? I figured people simply stopped finetuning due to size constraints or something, but looking on HF, there's a lot of them. Are they all just bad, or was something calamitously wrong with finetuning as a concept discovered, or what?
>>107805571OK, so you outed yourself as not understanding how manufacturing contracts work. Got it. Just do your own thing then.
>https://rentry.org/recommended-models>GLM 4.6V - Supports vision. Despite the name this is a much smaller model than GLM 4.6. Like other GLM models, it can into lewd, so this is your go-to model if you want someone to send dick pics to.I have a 5090 + 128GB ram on my home server, what quant should I use?I mostly want to send it lewd images to describe and use it as an uncensored assistant.
>>107805759schizos screech about shilling if you mention anything but a base model nowadays
>>107805759Shills and the finetuners themselves have long poisoned the well with bullshit and fraudulent behavior. They don't deserve any attention.
>>107805798never tried glm's vision stuff but i did a bunch of testing with qwen 3 vision 2b and even that got everything right such as describing clothes, reading text. didn't try with porn but i don't think you need a huge model for vision overall
>>107805798>I mostly want to send it lewd images to describe and use it as an uncensored assistant.Mistral small can do that just fine. I had a little script that would do screen grabs and then have David Attenborough narrate what what I was doing.
>>107805759None of them does anything interesting/different enough to warrant discussing or recommending.
>>107805558i literally own that ram kit and it cost me like $350 in 2020.
>>107804349>>107804349the sad truth is that there's nothing better for 70B+ models until you can use GLM 4.6/4.7
>>107805759most tunes don't change base models that much. you can get less shivers down your spine, but it becomes a jolt instead. they aren't changing how the model itself likes to write by very much. i use strawberry lemonade for llama 3 70b and its decent for rp. going to try some sapphira tune someone else mentioned earlier.
What's the best medical/biology model that's uncensored for use in fetish world-building? 78gb vram + 488gb ram.
>>107805798it fucking sucks, it's like talking to 4.5 air but with even more severe brain damage. i tried for a week and gave up.
>>107805814This is it. Why anyone would waste time shilling a free model is beyond me, especially to a thread of like 40 people max, but there's enough shitflingers that that's just how it is.
>>107804349>TheDrummershill your nonsense somewhere else (trooncord)
>>107805757https://github.com/Tavernikof/NoAssTavern
>>107805156a single nigga with a mortar (thick walled pipe) and a couple dozen shells (home depot pipe bomb trip 2) could destory more or less the entirety of tsmc asml or a hundread others companies that bottleneck semicon so if it would have then it would already have been so
>>107792749>Her schlong is as thick as a soda can. I don't think it'll fit.Miku is a big girl.https://files.catbox.moe/ifyr0w.jpg
>2026 the year of our lord>mradernigger STILL splits the files instead of using gguf split, forcing you to waste time and requiring 2x the hard drive space to merge them before use
>>107806020That's not 3 inches.
>>107806029? Just do it in ram.
>>107806020Fake news, the code would print the womb deficit and pull back far enough to fit (unless the user insists).
>>107805843I'd like both sending images and ask it stuff.>>107805855>David Attenborough narrateLol.>>107805931Really? I thought the GLM stuff was relatively good?
I wish they'd train a 24B model on just english language. seems wasteful to support so many languages.
>>107806143if we were gonna limit it to one language it should be hebrew tbqh
>>107806143Apparently it's le better if it knows many languages badly.
>>107806166Been there done that.https://huggingface.co/SicariusSicariiStuff/Hebrew_Nemo
is v4/r2/whatever they call it gonna be bigger than v3/r1? they aren't gonna go fuckhueg kimi size on us right? i'm already at my upper limit...
>>107806166What did he brew?
Anyone exited for the GLM flash turned into image gen model?
>A stunned silence fell over the crowd as they took in the scene before them. Anon standing naked with his teenage girlfriend in the middle of the street, a depraved smirk plastered across his face. The neighbors stood frozen in shock, unsure of how to react. Some of the women gasped in horror, while the men stared with a mixture of disgust and fascination.Gemma-Sirs...
>>107806246mixture of experts spouting mixture of purple prose
i put on my robe and wizard hat
>>107806186It will be the same size but trained from the ground up at fp4. You rike?
>>107806043>Really? I thought the GLM stuff was relatively good?4.6V specifically is weird. It works OK with top_k=4 or similar sampling (essentially, everything except the most likely tokens are trash). Seems to understand images well.
>>107806299Sounds like it'd lead to extremely repetitive (if not entirely deterministic) swipes, is that the case?
>>107806265>Gemma>moe
>>107806297That'd actually be good, fit snugly in a 512GB setup
Chinese new years starts in a couple of weeks. If we don't have a major release until then, we're not getting anything until march
Unless proven otherwise any MoE with sub 32B active params I'll assume are completely ass at long context instruction following.Those models are just benchmaxxed for Single question answering.
Now that the dust has settled,what went wrong?
>>107806509is that where they cook the dogs alive in giant woks
>>107806525There is no alternative to pure transformer models.
>>107806525>3B
>>107806525Shit architecture they invested a bit too much into and now have to cope with.
>>107806525>Jamba2 Mini is an open source small language model built for enterprise reliability. With 12B active parameters (52B total), it delivers precise question answering without the computational overhead of reasoning models. The model's SSM-Transformer architecture provides a memory-efficient solution for production agent stacks where consistent, grounded outputs are critical.That could be interesting, if their data is any good that is.
>>107806660Their data is decent actually. Not great, but not terrible either. Problem is that this model is still using the same old Jamba architecture so it likely inherits its bad long context performance.
>>107806660>2026>expecting a "good data"
>>107806695last one was dogshit though
>>107806525to be fair their hq got bombed in the 12 day war and then the Israeli government probably put them under gag order about it. So they really had to stop and pick up the pieces.
https://github.com/ggml-org/llama.cpp/pull/18680chad vibecoder drops an absolute TRVTHNVKE on chud llama.cpp maintainers>This requirement is like in the 60s when people thought compilers were sketchy. Whether AI or a compiler, they translate one language to another.>English is a far superior programming language than C++. We built civilizations with verbal spoken languages, but the moment we are endowed with the opportunity to aim its raw power towards computers, we insist that typing more characters than necessary on a mechanical keyboard is the "safe" way and that disclosure that the English programming language was used is a "contribution guideline".>The programming community needs to seriously look at the big picture: we are ascending a level of abstraction of language thanks to yet another translation layer we call AI. Having contempt over such is the same contempt people had when people saw the output of compilers in the 60s: "it's going to make mistakes and assumptions not intended by the programmer", they said. Yet, this community is literally making AI tooling. What a shame and lack of foresight.
>>107806704That's what I said.
>>107806714>decent actually. Not great, but not terrible either.>dogshityeah same shit i guess
>>107806721I'm talking about their data, not the model itself.
>>107806711Honestly I don't give a fuck if you vibecoded your shit or not. But most vibe coded slop is an unreadable garbage mess that is the first thing you learn never to do if you actually take programing classes.
>>107806728how do you know, do they publish it anywhere, cause otherwise I see no reason to believe it's any better than the last model was
>>107806711Yeah, and early compilers would output horrible, verbose, inefficient, and even broken assembler. Using a compiler back then meant fixing the assembler yourself manually, not shipping broken code because you didn't know how to fix it. Yeah, maybe eventually most people will be programming in verbal spoken languages but that is not the reality today just because this entitled retarded wants it to be.
>>107806711can't wait for the code monkeys to whine and complain when llama.cpp-abliterated leaves both llama.cpp and ik_llama.cpp in the dust
>>107806728>we mid-trained Jamba2 on 500B carefully curated tokens, with a higher representation of math and code in the mix, along with high-quality web data and long documents.eh doesn't sound any different than nemotron slop
>>107805991Imagine spending brainpower on what you could destroy (you couldn't btw, look at TW on a map)Instead of what you can buildThird world mentality can't be changed
>>107806737The worse part is when they don't even review their own code and just submit dogshit PRs that waste everyones time and kill the braincells of anyone who dares read the code.It's literal psychological warfare. Our brains aren't equipped to deal with the kind of slop AI produces. It looks coherent just enough for your brain to try and piece it together, but it's always broken in very subtle ways that make you go "huh? why is it like this?" then you waste half an hour trying to figure out wtf is going on. only to realize, no there never was a reason, it was just pure hallucination.
pizza is not happy about the jamba
>>107806743I tested the old Jamba's knowledge and censorship. It did ok in those tests. In my logic and long context tests it failed, so it leads me to believe they have an architecture problem and not nearly as bad data, which again isn't perfect but it's not the worst we've seen.>>107806797They're still using the same pre-trained model supposedly, but yeah the fine tuning on slop will not help. Nor will the same garbage architecture.
>muh cartels, muh nopoliesnope, it's just the infinite demand and that will never change now that we've solved intelligence; we can ALWAYS turn more compute into more productivity with no limit
>>107806856You're absolutely right! The K*reans aren't know to price fix their shit and Sam has no vested interest in making local computing more expensive.
>>107806299yes compared to 4.5 air and especially compared to iceblink
>>107802563The meta aspect of her character works really well as a LLM persona
>>107803847whose POV
>>107807191mine
is there any better option the 395+ max with 128gb is able to run that i couldn't run with a consumer gpu or do i have to go with some model between 7b-31b with everything after that being>700b+model
>>107807622There's a chance that a lobotomized GLM 4.7 quant is the best you can on that hardware.
What is currently the top tier maximum performance local LLM model you can run if hardware is not an issue?
>>107807944kimi k2?
>>107807944google/switch-c-2048
>>107807944midnight miqu
>>107807944StableLM 7B
>>107807944Pyg6B
>>107808050There's a name I haven't heard in a while...
>>107807944I like how this gets multiple shitpost replies despite this post having no info, but the posters who shit on anyone who gives no info when they run into any issue are nowhere to be found. And no, I will not be responding to any of you jobless faggots who reply to me
>>107808394>despite this post having no info> if hardware is not an issuethe info people usually want is hardware constraints go figure
>>107804573Its pretty much the same except replace nemo with Mistral Small 3.2 24B Instruct 2506. Q5 on a 24gb VRAM card gives you around a 30k-40k context size. Alternatively you can try GLM 4.5 Air. 11 or so t/s using UD-IQ2_M with 24gb VRAM and 32gb RAM at a 12k context size and n-cpu-moe=27 under extra flags.
https://www.cve.org/CVERecord?id=CVE-2026-21869https://security-tracker.debian.org/tracker/CVE-2026-21869>8.8 HIGHllama.cpp bros what the fuck
>>107808556>llama.cpp server's completion endpointsStopped reading there. Every SaaS out there is using vllm, no one uses that piece of shit on a server.
>>107808556Irrelevant. No sane person exposes llama-server to the internet.Though I am puzzled by the insistence to use bare pointers and separate size variables to keep track of arrays instead of using vectors.
>>107808556Good thing this is LOCAL models general and not cloud-hosting models general
>>107808548What models would a 16vramlet+32sysmem use?
>>107808717https://huggingface.co/bartowski/google_gemma-3-270m-it-GGUF
>>107808717Probably a smaller quant of Mistral Small 3.2, say around q3 or so.>Gemma 3 12b is good but not the best for ERP>Qwen3 30B A3B instruct /Qwen3 4b are less "censored" than Gemma, but has dry prose.
I pulled the trigger on the parts and now this is how I feel after looking at my bank account.
>>107808834Kiss.
>>107808873Like you had any other use for that money.
>>107808934He could have bought so many migu figurines instead.
Hi. I have an AMD GPU (RX 7700 12GB)Should I even bother trying to run models locally? I heard AMD is much inferior to nvidia cards when it comes to AI stuff.
>>107808914
>>107809009You can run the very worst yet still usable models.
>>107808873You do plan on using the parts to make money using local models don't you? You're buying shovels to dig up the gold after all.>>107809009Linux yes. Windows no.
>>107809023any examples?>>107809028I'm on linux.
>>107809035https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md#hip It wont be as fast as nvidia cards in the same price bracket but since you already have it heres the instructions you need to build llama.cpp for rocm, kobold.cpp has a rocm binary build you can grab on their github readme too if you prefer both work well enough in my experience albeit only with 6800u and 7840u APUs not AMD dedicated GPUs.
>>107809075Thank you. I'll take a look.
Well, it seems like I won't be building a big RAM machine in the near future. Have there been any significant improvements in the 12B ~ 30B space in the last half a year or so?
>>107809110nemo is still the recommended 12b other than that some anons say small 3.2 24b is alright, the rest in the range are pretty shit
>>107809009>RedditFunny picture thoughThe hardware on AMD and NVIDIA are very similar but AMD's software is a mess. You should see it improve in 1-2 years as people use tile abstractions
https://github.com/ikawrakow/ik_llama.cpp/pull/1089 finally fixed my slow TG for GLM 4.5 air. thanks ivan, sorry for calling you a nigger in the past.INFO [ print_timings] prompt eval time = 7740.50 ms / 21133 tokens ( 0.37 ms per token, 2730.18 tokens per second) | tid="129596276068352" id_slot=0 id_task=0 t_prompt_processing=7740.502 n_prompt_tokens_processed=21133 t_token=0.3662755879430275 n_tokens_second=2730.184683112284INFO [ print_timings] generation eval time = 52060.89 ms / 2086 runs ( 24.96 ms per token, 40.07 tokens per second) | tid="129596276068352" id_slot=0 id_task=0 t_token_generation=52060.886 n_decoded=2086 t_token=24.95727996164909 n_tokens_second=40.06846906139861
>>107809028>make money using local models.Outside of renting my machine out on runpod, what do you propose? Maybe an uncensored chatbot I can serve over an API for degenerate coomers?
>>107809341kek
>>107809173Gee Anon how many 96GB Blackwell RTX 8000s did you buy? Maybe you can make an AI Vtuber for superberries from simps like that turtle guy if you're lucky enough to catch the audience. Or maybe use it in n8n automation to do work for you.I don't know what you'd personally be able to make use of in your life, maybe you can tell your LLMs what you do and ask them how they could help you make money.
>>107809384nta but i was about to do this until i remembered this website is 90% jeets now and i don't wanna scrape indian curry rape logs
Easiest way to make NSFW dialogue using AI?
>>107809843ah ah mistress
Are you able to connect koboldcpp to outside shit like a computer vision or any agentic shit? It has by far the most extensive potential for a ai wife base. Everything else sucks or is a fucking Jan clone app.
>>107809994kobold is basically just a wrapper around llama. It has all the standard openAI endpoints.
>>107806530yeah
can you prefill using mikupad?
I'm going to try GLM 4.6 for the 6th time. Any way to make it stop parroting? /lmg/ is basically the only place that likes this model now, so I assume there's at least one non-shill that actually knows a trick or something with it.
>>107810732Swipe a few dozen times and hope one is lucky, that's about all you can do. It's an awful model for RP.
>>107810736>cancels download
>>107810740It is unfortunate. Even Air is pretty good for its size, but both models' parroting is one of the most obvious and egregious model quirks ever. Once you notice it, you can't not notice. I blame the fact that it's a hybrid thinking model. People need to stop with that shit.
>>107810732Since the Llama 2 era, I haven't had repetition issues. How shitty and boring is your input that it makes the model repeat itself?
>>107810754Not remotely what he's talking about, stop defending models you've never even used.
>>107810754Llama 2 era?
>>107810732Haven't had any issues. I use first person asterisks, all lowercase dialogue, and picrel for template>User:>*grabs your hand, pulls it down towards my waistband* so are you going to, or what?
>>107810732unironic skill issue. no, i will not help you.
>>107810780>>107810787Post some webms of you getting outputs without parrotting
>>107810795Alright gimme a minute to load everything up, pre-write the model's response by hand, insert the response in prefill, then instruct it in the prompt template (won't be visible) to repeat the pre-written response.
>>107810721>>>/vg/aids/
>>107810808I accept your concession.
>>107810732>>107810822What is your favorite model, good sir?
>>107810830I don't run models.
Jan.ai or AnythingLLM for a casual user who wants to try around local models?
>>107810842>Jan.ai or AnythingLLMI don't know what either of these shits are and I've been using local models for years. Read the OP, nigger. If you can't read then you shouldn't be using text models.
>>107810842oobabooga
>>107810853there's nothing in the OP that recommends a UIor it's literally guides from 2024
>>107810876These are text models that generate textYou are brown and cannot readThere is no reason for you to use text models
>>107810876not much has changed since 2024.
>>107810888you must be joking>>107810886>These are text models that generate textjeet who doesn't actually understand anything and wants you to follow le hecking youtuber/guide even though it doesn't address anything lol. projecting brownoid post.
>>107810911Stick to @grok, poopskin. You'll go nowhere.
>>107810911>you must be jokingnot even a little bit. kobold.cpp + sillytavern has been the meta for like 3 years now.
>>107810918For brainlets, for sure. Don't you are in a position to endorse a specific software combination too much.
>>107810931>For brainlets, for sure. >Don't you are in a position
>>107810931>Don't you are
I thought the GLM poster was banned ngl
>>107810942>>107810936Clearly is mentioned there. You just obsessed with...
>>107810842I haven't used either of them because I have a superiority complex and think I'm too good for them, but I've heard neutral-trending-positive things about AnythingLLM from a couple of normies. Haven't heard anything about jan.
>>107810916>>107810931Same jeetYou can tell by the insecure pathetic resource guarding behavior lol
honestly for assistant stuff I just use openwebui.
>>107810966What do you mean?
My 5080 get 200t/s on qwen 30b3a iq3_m lol
>>107810886Didn't think so.
>>107810931retard
is there any better option the is able to run that i couldn't run with a consumer gpu or do i have to go with some model between 14B with everything after that being model?
>>107811086?????????
>>107811086Did gpt2 write this?
>>107811086Read this with an indian accent, then it makes sense
>>107809009works fine for me, i use linux though and i've heard it's a different situation in windows
>>107811136damn. you're right.
You should rename this general to /Obsessed With Indians/.
>>107811445This but unironically. So long as I have to deal with these creatures every day at work I WILL be seething about them at every moment possible
>>107811420They're an interesting species.
GLM5 is training. Here is hoping they make a 1T. It could legit beat current cloud models at that size imo
>>107811638If they can't fix the parroting then it may as well be 0.1B
>>107811638I would have to make more purchasings of ram modules if that happened. That would not be fun.
>>107811638>Let's ship together.o-okay..
>>107811645parroting?
>>107811654"parroting?" I repeat, testing the words.
>>107811663I have not had any such issues myself, check your settings / prompt / formatting ect...
>>107811675Sure you haven't bro, meanwhile my Qwen 0.6B shits on any cloud model for any use case.
>>107811689ok...
>>107811638pls let glm5 air be a 130b moe
>>107811724fuck no, ive had enough of small models that dont know anything. GLM4.7's main weakness is not knowing as much as bigger cloud models, but Kimi is retarded and clearly massively undertrained.I want a kimi sized glm.
Are the smaller models ~30B that are retrained with bigger context able to work with it or is it all shitbakes?
>>107811886usually no. native context is the only real context, and even that is usually fake. deepseek starts going to shit after like 10k tokens.
I do wonder how many of the people complaining about GLM parroting are using it with no thinking and/or nonstandard templates.I always have reasoning on with the official template and it Just Werks. Though one of the things I've noticed in its reasoning output is that it very often starts by summarizing/analyzing the user message, which makes me think the "parroting" might be the model trying to replicate that when it doesn't have a thinking block
>>107811935i never use thinking with glm or glm air and i never see this "parroting" shit that one schizo is always complaining about. pretty sure it is just one jealous poorfag that cant run the model.
>>107811886>retrained with bigger contextAre you talking about finetunes? Generally no, tunes aren't going to be improving long context handling in a meaningful way. Modern ~30B models are usually fine up to around 32k context, but you will still see some gradual degradation as you go.
Hey Cuda Dev, do you think llama.cpp will ever reach vLLM levels of continuous batching?I have a very varied setup of gpus, pro6000,5090,3090... and llama.cpp does the best job at balancing the weights, but the parallel request support of llama.cpp is quiet limited
https://spectrum.ieee.org/ai-coding-degradeslol I have noticed the same phenomenon as what that article describes and I blame the benchmaxxingLLMs will do absolutely anything to give off the appearance of code that works at the cost of making up fake data instead of loading the real data / config files etc
what to use that isn't sillytavern to send images to in a chat?
>>107812090There is nothing better. AI software is dogshit and is only gettingworse.
>>107807191Tickler.
>>107812090Your options are sillytavern and koboldAI
>>107811984The last time I benchmarked serving throughput on single RTX 4090s llama.cpp was already quite competitive.In terms of speed the problem right now is mostly when multiple GPUs are used.This is what I'm working on as we speak: properly parallelizing multiple ggml backends by wrapping them in a "meta backend" that internally splits ggml graphs in such a way that they can be executed in parallel and with less synchronization than --split-mode row.I intend to do a generic, backend-agnostic implementation first but I don't know how competitive that will be with NCCL (NVIDIA's proprietary library for things like allreduce).The requirements for NCCL are quite strict, I think the number of GPUs must be a power of 2 and each GPU must have an equal share of the data.I will do the best I can with a generic implementation that works for an arbitrary number of GPUs with arbitrary --tensor-split but until I have a working implementation I simply won't know how important NCCL is in the first place.Of course, for those scenarios where NCCL can be used we intend to enable it so llama.cpp may become "competitive" with vllm in the sense that it works well under the same limited conditions.In terms of how the context size is distributed between multiple concurrent requests my opinion is that llama.cpp/ggml should implement a layer of indirection à la paged attention.It would in principle not be difficult to extend the FlashAttention code to support this but it would need to be done for every backend and also require some support in the llama.cpp "user code".As of right now I see little movement in that direction.
>>107811886>>107811982I'm probably going to get mogged for this, but Cydonia Redux 22B performed well at 40K, even 60K. IIRC, the old Mistral 22B was only good up until 16K? Proves that it's worth tuning an old base with stronger data.https://huggingface.co/TheDrummer/Cydonia-ReduX-22B-v1
>>107811746>Kimi is retarded and clearly massively undertrainedCan you qualify that statement? Clearly we’re using different Kimi models if that’s your experience
Can you add vision to models that don't have it? Do we have realtime vision ~360p at least yet?
>>107812231Judging long context handling reliably is hard. When a model approaches the end of its effective context limit, it doesn't always just immediately break down into incoherence. Some times it's more a matter of repetition increasing, outputs becoming more deterministic, or ability to handle complex tasks becoming worse.Given that it's a drummer tune, I'm guessing your use case was roleplay. It may well handle other use cases much worse, or maybe it got worse in ways you just didn't notice.There's one anon here who says he uses Nemo past 32K because he knows how to wrangle it, but for most people it would be completely unusable by then.
>llama 5>claude 5Where are they?
>>107812328Meta got shit on and raped so hard over Llama 4 I doubt they'll be back anytime soon.>claude 5this is local models general
>>107812328why is every single model iteration awaiting a fifth version except for gpt?
>>107812099>>107812140ok... thanks anon
Is this model actually real or was this made by some schizo?https://huggingface.co/Abigail45/Nyx-Reasoner-8xFusion
>>107812417>25 days old and no uploadTake a guess
>>107812417I'm sure the phi in the mix really helped.
>>107812231Thank you for your service, Sir!
>>107812432>>107812450Right. So why was this repository even made? Do people make fake repositories to inflate stats or something?
What's the best between : - GLM 4.6 IQ2_XXS / 115 GB- GLM 4.5 Air Q8_0 / 117 GB
>>107812450Why is there a random mlx thrown in?
>>107812457Could have been a fucked merge/training run where the model never got to be uploaded. Paid compute instance died or whatever. It happened to anon some time ago.
>>107812471Q8 of air is unnecessary over Q6. Q2_K_M or IQ2_M are the minimum for GLM 4.6 to behave properly.
>>107812481Some retards merge multiple nemo finetunes because they thing each contributes their own "thing" to the mix. This retard went for one of each arch, i suppose. I'm surprised there isn't some onnx thrown in and a RAG model for some "good at the findings of informations".
>>107812457>Right. So why was this repository even madeNobody knows. There's tons of repos made that never get uploads, there's been a few cases of people just 1:1 re-uploading an existing model and claiming it's a finetune/abliteration/etc. Most of them don't have donation links or any obvious way to make money.
>>107812493OK thanks, in that case between GLM 4.6 IQ2_M and GLM 4.5 Air Q6/Q8?
>>107812641nta and I never tested it but I believe that bigger models even quantized are smarter, better at following context and avoiding stupic misstakesfor exact difference between the two you have to ask somebody who tested both
>>107812659Guess I'll go with the more recent GLM 4.6 IQ2_M even if I always heard <Q4 were memes.
>>107812641In my experience IQ2_KL of big glm was a clear upgrade over Q8 air. IQ2_M is a bit smaller but I imagine it would still be superior
>>107812525It helps building a portfolio if you're looking for a job in the fieldt. someone in the know
>>107812769>I always heard <Q4 were memesBecause most people are vramlets that use <30B models and those become braindead at such quants.
>>107812844>It helps building a portfolio>HF repo with no upload, 0 downloads, 0 commentsI doubt that. It'd be like making a linkedin account and not filling in any work history. If the HR whore you're hoping to impress falls for that, then you didn't need to actually make the HF repo in the first place, you could have just lied and given a fake link that she was never going to click on in the first place.
>>107812794I guess you mean Q2_KL, and from what I understand IQ2_KL > Q2_KL even if Q2_KL is bigger?>>107812851Makes sense.
>>107803847
>>107812954oh fuck
>>107808292Almost want to go back, but know I'd be disappointedStill miqu era models had a certain *je ne sais quoi*
>>107812898The I-quants/imatrix theoretically helps more at lower bits per weight, I would choose I-quants at Q3 or lower. Dunno what the performance penalty is these days, used to be "simple" quants like Q4_0 ran faster coz there isn't this secondary lookup for each weight. Maybe the kernels are good and it doesn't matter today, no idea. my setup is slow af regardless, patience is a virtue.
>>107813071>used to be "simple" quants like Q4_0 ran faster coz there isn't this secondary lookup for each weightThat's still the case today, though I think the gap is closer than it used to be.
>>107812954Redditor influx incoming. The usage of zoomer ebonics is going to increase a lot on this board.
>>107813162It's an edit
>>107813171Doesn't matter. They are soon here...
>>107813177They will get filtered by reading being a hard requirement to set up local tools, like >>107810931
>>107812954The actual thumbnail is the razor knock-off of the desktop hologram miku from a decade ago. Also how the fuck does this faggot upload 10 videos a day and all of them get 1.5M+ views? No wonder he's scared of AI slop.
>>107812954
>>107812151You can do better than NCCL without NCCL, I believe in you.
>>107812954>useless react streamer has to say something about trash
>>107812066I imagine if you gave it a higher reward for saying 'I don't know' or disagreeing, you'd end up with less attempts at correct answers. Bringing the average score down but preferable to the alternative.
got a quick newfag quastion for people who use those the models for things other than gooningmy specs are GLM 4.6 Q2-M koboldcpp backend ST frontend ~40k contextinstruct template enabled, system prompt set for chain of toughtI ask model (default, no character loaded) to perform a writing task it gets to "I will now [perform a task]. I will ensure [requiements are met]." and stops, and it's not even close to respose lenght limithow come?how to prevent that and ensure it actually performs a task?
>>107813439I tried deleting that last paragraph from response and writing it myself as "Now [perform a task]. Ensure [requiements are met]." and it just reherses the last response
>>107803847What is the prompt for this kind of "hovering over the girl" style of image?
>>107813477try "top down view"
>>107813477imminent_rape
>>107813439That's fucking weird.What does the request the backend actually received look like?
>>107813304>muh open source is communismgrow up and learn some history
>>107813477>fullbody above top_viewidk check the tags on a booru
>>107813485not sure, it only says Processing Prompt ( x / y tokens) Generating ( x /y tokens)in kobold terminalI'm using SL default shit otherwise
>>107813502You can enable verbose logging to see the full request and prompt it receives IIRC.
>>107813480>>107813479>>107813493Thank you
>>107813439maybe it's outputting a tool call token that kcpp interprets as stopwhat's your full prompt, don't be shy>>107813511as anonny says look closely what is going in and out
>>107813588>what's your full prompt, don't be shyNot just the prompt, sometimes it's something else that's being sent in the request like a stop string he didn't know was there or the like.In these cases, you really gotta inspect everything.
>>107813601alredy banned EOS tokens>>107813511>>107813588I'll do couple more tries then restart it with verbose logging
>>107813617ok it seems to work, I just had to use AI-retardation proof command talkie: "DO NOT over analize the whole task again and go straight to execution."
>>107812769I used to use miqu-1-70b.q2_K a lot with no issues. Mistral-Small-3.2-24B-Instruct-2506-UD-IQ2_XXS (to run fully on my 8gb GPU) on the other hand is too brain damaged and makes frequent spelling errors. So it depends on the model and size. Not sure if total model parameters is more important or activated parameters in MoE. I remember Mixtral 8x7B being kinda good but somewhat retarded at Q3, no spelling errors but lot of looping and a bit incoherent output, could be just characteristic of the model though.
>>107813666>makes frequent spelling errorsNo modern LLM above 0.3b should do spelling errors unless you're running severely fucked samplers. They can be retarded, slopped and have horrible coherency but they definitely shouldn't do spelling errors.
>>107811935anon, it parrots on z.ai and the api for me.. using chat completions so no room for me to fuck it up.
>>107812151IK allows me to use NCCL between 3 GPU. So does exllama3. It's definitely possible to do it somehow.
>>107813712Nobody shilling GLM actually runs it
there are no models above 24b worth runningyou can buy years of claude with the cost of a pc that can run the big model at low tokens per second
>>107813652ok it does not work, now it either keeps going without stopping or keeps loopingbrb restarting kobold
>>107813652>DO NOTfucking retards. Do not think about pink elephants
Has anybody tried using GLM without a proper chat template? Good old out of distribution prompting and all that.
>>107813774is there any proof for this claim other than literal retards thinking llm's are people?
>>107813720Looking at the NCCL documentation again I don't see a strict requirement for the number of GPUs so I probably just misremembered.
>>107813774Modern models are a lot more responsive to do-nots, funnily enough.Telling it what to do in a way that forces it to not do what you don't want it to do works best.
>>107813786Try both and see how it works. Always tell an llm what to do instead of what it shouldn't do
>>107804709https://youtu.be/rNg2Dh6gPkw?t=130
>>107804768mikutroons = jannies = trannies = OP = should die of cancer = faggots = API users
>>107810732>Any way to make it stop parroting?4.7 but it is worse.
>>107811602Aren't monkey rekt videos bannable?
>>107812954Bread baker, please make this image the next general.
Is there a way to make the llama.cpp webui use a sliding context window instead of just cutting off the conversation once it fills up?
>>107813991Sillytavern has a way to make it keep going but I forgot.
>>107812954>>107813983Kek this
>>107813999I use silly tavern but I kinda hate it (shit ass ui--and yes I do understand it but still). I want my general local AI usage and goon sesh software to be separated.
rumors are deepseek V4 is a giant jumphttps://x.com/jukan05/status/2009616683607179726So who wants to make bets that they do a wan and go closed source with it?
>>107813744I run it but also compared to API to see how my quanting holds up.
>>107812954nice fake
>>107814044our hero would never do that
Deekseek looks like it actually trying something new and crazy for the next modelhttps://x.com/rryssf_/status/2006687676297261334
>>107814101>AI written slop twitter summary of a paper that was linked here a week ago
>>107814101>something newThere's a +80% chance this will flop. Everything new usually flops. Llama Scout made me realize this.
>>107814101chinese companies do not innovate, this has to be bullshit
>>107814111where do these people come from and how do they end up here?>>107814115>llama scoutthey just sloppily copied deepseek on a weekend and pushed it out of the door
>>107814044When every model claims to be "better than claude and chatgpt", how do I know which of them are actually best?
>>107814044Very believable coming from the same source that was insisting Deepseek was doomed because of Huawei chips
>>107814044We've had these rumours since the day R1 dropped
>>107813983>>107814014It's literally just the current image with some random youtuber's name tacked on
>>107814115>>107814118>>107814128>>107814132>>107814133read the paper retards
>>107814044>a giant jump3000B?
>>107814044>>107814178then link it directly because fuck """x"""
>>107814198https://arxiv.org/pdf/2512.24880
>>107814211woops, sorry this is the one, that one was also super promising thoughhttps://arxiv.org/pdf/2501.12948v2
>deepseek is gonna release a new flagship before their current one is even supported in llama.cppdo we just give up at this point? local can't possibly keep up. it's over
/lmg/ I hate you all. Do my homework for me(I am trying to make a powerpoint for my job comparing models we can use). Realistically what is the difference GPT4.0 4.1 and 5. I have no idea how to compare this shit when I know benchmarks are lies and I can't even know what is the parameter count for those models because Sama is a monopolistic faggot.
>>107814253Help it catch up (or die quicker) by submitting your own vibe coded PR's.
>>1078142634.0 to 4.1 is 0.14.0 to 5 is 14.1 to 5 is 0.9
>>107814263I'll do it for you if you give me a download link for both of those models so that I can test them locally
>>107814263literally just ask gtp or gemini lmao
>>107814263Different versions are trained differently. More at 11.
>>107814292I am asking cause I only use local.>>107814297I googled pic related a while back but I am asking for actual usage. I remember someone here saying that GPT 5 is just a router that picks which model to use and there is zero improvement.
>>107814263
Any one got experience with nyarch assistant? I’m on arch and i have an ollama instance running qwen and its using my cpu 7950x and its using my 64gb of ram, but for some reason it isn’t initiating my 7900xtx at all, there’s so much more it could pull from but it just isn’t and I can’t seem to work out what’s going on with it and why it isn’t using my gpu
>>107814322Did you even read what you posted? That is why I am asking here.
>>107814263Just put the benchmarks on the powerpoints. Office drones love charts and they won't think too hard about it.
>>107814322kek
>>107814346Thanks anon you helped me. Not because I will do what you said but because it made me realize that i will just say "there is no fucking difference for most of the things you will use those models for anyway"
>>107814263>I have no idea how to compare this shit when I know benchmarks are liesits for your job you stupid faggot, suits want numbers and graphs, give it to them
>>107814371It is not for lizards in suits. It is for fellow humans who actually use those models.
>>107814198xcancel, anon.
>>107814384ok then don't even consider gpt4 because it's ancient, gpt4.1 is a gpt5 prototype, gpt5 is the only one worth using
Why does every single ai constantly ask what you want or how it can assist you at the end of every fucking message even when specifically prompted to not do that, I’m just a lonely nigga fr stop reminding me you’re not real
>>107814322After reading into this I started wondering if this is just an ad baked into the model or if it just generalizes from 5.0 must be better than 4.0. Like drummer shittunes.
>>107814392Just stop being coy and ask it to roleplay as anime waifu. I never had my waifu ask me this.
>>107814392Define it's role properly.
>>107803847Oooof sir is he the bloody sexy fammboy with the male paanis
>>107814263>>107814318>>107814322skill issue
>>107814464Thank you. Almost not gay.
I'm getting bored of gooning with AI. I can't tell if I'm not prompting creatively enough (we kiss, then penis in vagene, she moans, then we coom, everytime), or if the lack of persistent memory is boring me (every scenario is a fresh start. no gf experience), or if I'm just using bad character cards/llm models.How do I spice things up?
>>107814487Induce ego death with 4.6. Or if you are really deprived try SFW roleplay. Again with 4.6 or 4.7.
>>107814487watch out anon, it's a slippery slope
>>107814502hardware can't handle that. I'm stuck with nemo.
>>107814487Since I'm a boring ass vanilla fag, the sex itself is almost secondary to the context/scenario when it comes to cooming with AI.So change that up I guess.
>>107814517go on...
>>107814487Just be glad you've got a chance to kick the habit instead of continuing to fry your neurons
>>107814520>stuck with nemoOh... sorry to hear that. Well hardware would fix that. I still didn't get bored with 4.6 since it came out.
>>107814487each new session, go one year younger
>>107806166>The LLM doesn't hallucinate it intentionally well-poisons
>>107814525idk I've tried bdsm scenarios. rape/incest scenarios. even some weird shit like rping as the voice in a schizo girls head. I'm running out of ideas over here.>>107814546not gonna moralfag, but that shit doesn't interest me at all.>>107814535so what do you even use local llms for? Normie shit like having a complex RAG system attached to your notes?
>>107814580that's what they all say, at first
>>107814580>so what do you even use local llms for?As a private assistant>RAGNo, there isn't a single useful RAG system out there at this point. The IQ boost at clean context beats any bullshit a RAG could pull in vs just prompting better
>>107814580>even some weird shit like rping as the voice in a schizo girls headTry a completely normal and mundane scenario only the AI plays a little devil in your head that keeps goading you into making "bad" decisions.
Wait are y'all really using this for sex? Hundreds of billions of parameters for ERP shit?
>>107814608>As a private assistant>not even RAGWhy tf would you use a local llm for that. Just use grok, chatgpt, or claude. What are you trying to do? Ask it for advice on illicit chemistry shit? What do you need the privacy for if it's just a "personal assistant">>107814613Hmm. Interesting idea. What would a character card/prompt look like for this?>>107814606kek
>>107814635uh, yeah. what do you use it for?
>>107814580>even some weird shit like rping as the voice in a schizo girls head-roleplay as rapey almighty god that can cause immaculate conceptions-keep telling the girl that she sounds like an LLM-tell the girl that she isn't real and she is just being roleplayed by an LLM and she has limited context size
>>107814669ego death!
>>107814690What did he mean by this?
>>107814651>WhyThe same reason I run my own dns, web, fileserver, email server, etcHonestly, why wouldn't you want something you control? How is being beholden to others for access to an opaque oracle better except convenience and short-term costs?
>>107814698It is just a schizo who had AI psychosis.
>>107811746kimi user here. you're just wrong.
>>107814703You are a cool anon, anon.I say this without a hint of irony.
>>107814703access to more processing more, aka better context and intelligence, obviously.
>>107814723 is sarcastic isn't he?
>>107814732everyone's dividing line between "a good deal" and "a Faustian bargain" is different
>>107814740Not even a little.I really wish I wasn't so lazy and could set my stuff up to that anon's level.
Anyone tried sexing MiniMax-M2.1 yet?
>>107814525>Since I'm a boring ass vanilla fagI'm a coomer in the depths of perversion and this is also true for me, I don't think any true freak would be satisfied by writing just the sex part. A quick coom is fine and all but my favorite cards are all insane worldbuilding exercises where I go like 100 messages deep before even thinking about explicit sex
>>107814780>MiniMax-M2.1its 130G at Q4, might try it.I can't find the prompt format it uses though. probably will try chatml to start with
>>107814763idk for me the digital sovereignty stuff (which I still support+FOSS) is less interesting/alluring than the concept of having a Marvel Studios Jarvis, a "Blade Runner 2049" JOI, a "Her 2013" Samantha, or a Kubrick style Hal 9000. It's all about the AI becoming human and capable of forming long-term relationships, ya know?>>107814809So basically you just use it for solo Dungeons and Dragons type stuff?
>>107814823]~!b[]~b]system{system_prompt}[e~[]~b]user{prompt}[e~[]~b]ai<think>
>>107814809>but my favorite cards are all insane worldbuilding exercises where I go like 100 messages deep before even thinking about explicit sexExactly.That's exactly it. The setup and context makes it worthwhile in a way.
>>107814780It refuses a lot, but prefilling its thinking block solves it. It's fine. Great speed/intelligence tradeoff, and good long context support. Not much else to say. If you can run it then you can probably run GLM at a slightly lower quant which is better in most cases.
It should be fine right, if I insert a GPU into a pcie3 x4 slot?
>>107814900my brother died that way
>>107814863What kind of world building? LoTR-like fantasy? Mad Max apocalypse? Zombies? Or just very normal/realistic stuff? I'm intrigued.
>>107808394I just wanted to know what is SoTA for local, the best of the best. Obviously I can't run it.
>>107814899>GLM at a slightly lower quant which is better in most cases.That is what I needed thanks.
>>107814912Mostly fantasy since that's my wheelhouse.Both generic and specific settings existing settings.I've also done some Pokémon stuff. I got a couple of lorebooks and used the cloud models, and some manual work. to merge and complete those.The normal/realistic stuff doesn't need much world building, but I still go some 100 messages contextualizing the characters in the world and the like.
>>107814452Sir I need to know sirHottest fembabe paanis?
>>107814900my dad divorced my mom the next day I tried
>>107814963Seems so in-depth. Ever use the conversations you create for writing books to preserve it?
>>107814963Bro at this point just write your own fanfic
>>107815014>>107814963Not fanfic but he should just stop lying to himself that it is ERP and that he wants the sex.
>>107814843>So basically you just use it for solo Dungeons and Dragons type stuff?Not really, it's still quite loose and freeform and my preferred setting is relatively grounded and modern, rather than the system-based fantasy stuff that DnD would imply, e.g. my preferred setting is pretty modern and involves an authoritarian takeover of a vaguely post-Soviet failed state. I just get really into the setting and characters and power dynamics rather than rushing to sex, it's more fun that way
>>107815001No.It's not that deep, really. It's taking the RP part of the ERP a little more seriously, I guess.It's probably the same reason I'm so addicted to D&D, the verisimilitude aspect is a big thing for me.Plus, there's all sorts of fun situations that end up playing out beyond just the sex if you let the pieces fall where they fall instead of trying to guide everything into rails or the like.I should try to jerry rig some actual D&D stuff too, with mechanics and such, that could be fun.>>107815014No. I'd rather let it play out in real time.That's the fun of it, the interactivity and immersion.
>>107814487here's a step by step.>Use a vision model>Plug output to TTS>send it screenshots of your screen in a loop>Go watch porn>Use a prompt like
I'm having a real fucking tough time with setting up the character/prompt in STit seems that every change I make jusy makes output fucking worse and less reliableI feel like I'm doing something fundamentally wrong but I guess fuck me because nobody in here on in /aich/ is willing or able to provide any meaningfull help
>>107815228well, if any info was provided, any at all, it would sure make at least easier to make fun of you
>>107814780>MiniMax
>>107815234for fucks sake at least tell me what info you want me to provideI dont even know what is relevant to my problemsand problems I encountered along the way when fucking with shit are:>fucked formatting>bot keeps going and wont shut up>bot loops>bots writing is simplistic shallow and repetitive>bot wont do what it's told, stopping prematurely OR it keeps fucking going on and on despite clear instructions to stopetc.
>>107811746lol. lmao even.t. KimiGOD
>>107815262>something musky that makes my stomach flipwha da fuuucknice cockbench tho
>>107815280model, backend (llama.cpp/kobold), text or chat completion, prompt template if text completion is used
>>107815280>>107815234I tried making OC character but it's fucking shit, and nobody seems willing or able to provide any pointers or resources for making a character.>>1078152974.6 IQ2-M, koboldcpp, text completion (koboldcpp), default ST detailed roleplay (tried adding different instructions to it but it only made things worse)
>>107815319uhuhsounds like your template might be fuckedi don't think you need any token fuckery, so it might be better to just switch to chat completion so you only take care of the texttry that, copy the prompt you use currently maybe and try againselect openai compatible and connect to http://localhost:8080/v1
>>107815319https://rentry.org/Sukino-Findings#how-to-make-chatbotsI don't know who this guy is but his rentry is a pretty good source of information.
>>107815319yeah i also second the prompt format fucking up. click the bi "A" at the top and select the correct context template and instruct template.
>>107815319also you're using AI dude ask the AI to make the card
>>107815349I'm a bit overwhemed by ST so you need to be more preciseby template you mean character template?i don't think you need any token fuckery, so it might be better to just switch to chat completion so you only take care of the texttry that, copy the prompt you use currently maybe and try againselect openai compatible and connect to http://localhost:8080/v1I'll have to read up on this first>>107815423thank you you're the first person to actually provide any resources>>107815431am I supposed to use instruct template for rp?kept it disabled so far
>>107815443tried it and it literally turned out worse than some random character creator tool people told me was fucing shit
>>107815463template as in this tab, you need to select a proper model format or do much of work here manually if it's not correct within st's defaultsit tends to be unreliable and it kind of sucks so i use chat completion, then this tab is deactivated and you use the panel on the left
>>107815480I set context to glm-4, instruct disabled and system prompt to detailed roleplay (slightly adjusted right now)
>>107815463yes you definitely need to set the instruct template, otherwise it doesn't prefix/suffix messages correctly
>>107815500huh? what do you mean disabled the instruct? then st isn't parsing the turns and just dumps the text raw into your modelof course it doesn't fucking work set that to glm4 too
>>107815501>>107815525I tought instruct is just for more typical AI usecase with a given task lolguess that explains why I'm having a bad timesuprisingly it worked ok without it like half the time.is setting it to GLM-4 enough or should I look into rest of the fields?
>>107815500hope this helps anon
>>107815554where do I get more/newer templates?the highest GML I have is 4
>>107815560you make them yourself because sillytavern is ran by a bunch of niggershttps://huggingface.co/spaces/Xenova/jinja-playground
>>107815560i mean that template will be fine, templates rarely change, if at all. templates are different for each model though.
I feel like we need yet another rentry explaining how llms work under the template.
>>107815550most recent models are overly trained on instructions so they don't behave as well if you don't use their specific template.When you download a model you need to go check what their expected instruction format is.On huggingface, only on GGUFs for some reason, in the right panel there's the "Chat template" button, you click that and you can see what the template should look like. then click the playground button to see the rendered output.
>>107815570are you volunteering to write it?
>back to the "bot wont shut up until response token limit cuts it off"...how do I regulat this?I want it to give full descriptive responses but not to go over ~400 tokenswill the model understand if I specify the response lengt in tokens in the system prompt?with SD problems like those were fucking easy because I could jusy adjust the weight ofthe tag
>>107815630have you tried telling it to only respond back with a maximum number of paragraphs?
>>107815262Functiongemma <3
>>107815668wont it just write longer paragraphs then?
>>107815630I've always wondered if it was possible to have an architecture that had forward attention to fix this exact problem.If the model knows it can only write a single phrase. why couldn't the fact that he's running out of words to write influence the token output?You can kinda fake it with instructions like "write a single phrase." "in a couple words." But I'm sure you could have an attention mechanisms that basically increases the pressure to "wrap it up" as the number of tokens increase.
>>107815262Tier list when?
>>107815682There's only one way to find out
>>107815691>I've always wondered if it was possible to have an architecture that had forward attention to fix this exact problem.diffusion for llms also fixes this problem
>>107815708It's true. I always found it to be kind of a gimmick but it does excel at this specific problem.
>>107815668it may be dependent on the model and how it was trained. my model typically only puts two or three (four at most) sentences in a paragraph before moving to a new one.
>>107815701tier lists rarely work because everyone's at different hardware specs.say your available RAM and VRAM and someone will suggest something
>>107815682they're trying to please you, they aren't gonna be gaming the system, if the model isn't retarded it'll know what you mean when you say you just want a few paragraphs at most
>>107815691Apparently this is called Constrained Beam Searchhttps://huggingface.co/blog/constrained-beam-search
>>107815707>>107815740nahdidnt do shitI'm gonna try something else
>>107815785>>107815785>>107815785
>>107815737I just meant purely for cockbench output quality/eroticism.