/lmg/ - a general dedicated to the discussion and development of local language models.Previous threads: >>109074493 & >>109069535►News>(06/16) GLM 5.2 released with IndexCache and 1M context: https://z.ai/blog/glm-5.2>(06/16) VibeThinker-3B released: https://hf.co/WeiboAI/VibeThinker-3B>(06/12) MiniMax-M3 released, multimodal 428B-A23B with 1M context: https://hf.co/MiniMaxAI/MiniMax-M3>(06/12) Kimi K2.7 Code released: https://hf.co/moonshotai/Kimi-K2.7-Code>(06/12) EAGLE3 speculative decoding support merged: https://github.com/ggml-org/llama.cpp/pull/18039►News Archive: https://rentry.org/lmg-news-archive►Glossary: https://rentry.org/lmg-glossary►Links: https://rentry.org/LocalModelsLinks►Official /lmg/ card: https://files.catbox.moe/cbclyf.png►Getting Startedhttps://rentry.org/lmg-lazy-getting-started-guidehttps://rentry.org/lmg-build-guideshttps://rentry.org/IsolatedLinuxWebServicehttps://rentry.org/recommended-modelshttps://rentry.org/samplershttps://rentry.org/MikupadIntroGuide►Further Learninghttps://rentry.org/machine-learning-roadmaphttps://rentry.org/llm-traininghttps://rentry.org/LocalModelsPapers►BenchmarksLiveBench: https://livebench.aiProgramming: https://swe-rebench.comAgentic Coding: https://deepswe.datacurve.aiContext Length: https://github.com/RecapAnon/NoLiMaGPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference►ToolsAlpha Calculator: https://desmos.com/calculator/ffngla98ycGGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-CalculatorSampler Visualizer: https://artefact2.github.io/llm-samplingToken Speed Visualizer: https://shir-man.com/tokens-per-second►Text Gen. UI, Inference Engineshttps://github.com/lmg-anon/mikupadhttps://github.com/oobabooga/text-generation-webuihttps://github.com/LostRuins/koboldcpphttps://github.com/ggerganov/llama.cpphttps://github.com/theroyallab/tabbyAPIhttps://github.com/vllm-project/vllm
►Recent Highlights from the Previous Thread: >>109074493--Paper: Elias in the Lighthouse, Again? Diagnosing Low Diversity in LLM Stories:>109075345 >109075903--Optimizing llama.cpp flags and KV cache for Qwen3.6-35B:>109074991 >109074993 >109075010 >109075035 >109075183--Comparing Qwen3.6 MoE and dense against GLM4.7-flash for coding:>109077145 >109077169 >109077266 >109077290 >109077328 >109077401 >109077378 >109077425--Anthropic disables Fable 5 and Mythos 5 due to government directive:>109077569 >109077575 >109077581 >109077583 >109077584 >109077588 >109077591 >109077599 >109078061 >109077636--Anons analyzing VibeThinker-3B's verifiable reasoning claims:>109076828 >109076872 >109076883--Comparing DeepSeek V4's efficiency against SOTA models:>109077711 >109077734 >109077788 >109077828 >109077911 >109077929 >109077951 >109077941 >109077957 >109077982 >109077866 >109078093 >109077807--Debating the value of multilingual data in specialized coding models:>109078295 >109078320 >109078374 >109078609 >109078677 >109078482 >109078534 >109078538 >109079054 >109078477--Anons sharing hardware specs and software stacks:>109075240 >109075259 >109075933 >109075281 >109075297 >109075308 >109075313 >109075453 >109075508 >109075480 >109075506 >109075510 >109075519 >109075558 >109077051 >109077082 >109077192 >109077218 >109077231 >109078110 >109078876 >109075638 >109075661 >109075788 >109076026 >109076054 >109076269 >109076314 >109077278 >109077501 >109077872 >109078960--Allegations of funding embezzlement regarding Rio 3.5 397B:>109076163 >109076219--Using agentic workflows and Qwen/Gemma 4 to translate RPG games:>109076342 >109076430--llama.cpp adding support for DeepSeek V4:>109077601--Logs:>109074683 >109075496 >109075746 >109076881 >109078060--Teto, Miku, Gumi (free space):>109075661 >109076837 >109077051 >109078876►Recent Highlight Posts from the Previous Thread: >>109074494Why?: >>102478518Enable Links: https://rentry.org/lmg-recap-script
1 - So GLM 5.2 is 700b parameters (ish)2 - 4x DGX Sparks can supposedly handle up to 700b parameters (give or take)3 - GLM 5.2 is supposedly in striking distance of the performance of GPT 5.5 and Opus 4.8. In my brief tests, it's really not shabby at all.4 - So for $20k, you can get near the frontier on your table.5 - Extrapolate the trend, and you could have mythos/5.5 pro - class models in your dining room for the cost of a cheap car less than five years from now. Even without extrapolation, we're already the near frontier running locally.6 - Paying real api costs, I could easily blow through $3,000 per month coding and running agents. The machine pays for itself in 6-7 months conservatively.7 - In 3-5 years, most power users of AI will self-host.8 - Am I missing something?
>>109079129https://litter.catbox.moe/duj5m06rautvke9v.mp4https://litter.catbox.moe/duj5m06rautvke9v.mp4https://litter.catbox.moe/duj5m06rautvke9v.mp4
>>109079137?
>>109079137>Am I missing something?the spark is shit
>>109079137In 3-5 years, the required hardware will either cost 10x as much, not be for sale to consumers, or be outright illegal to own as a private citizen. Or all of the above.
>>109079137sounds like you got it all figured out
>>109079137$20k is 5 billion tokens worth of inference for GLM-5.2 on openrouter
>>109077378>>109077401>definitely let us knowbasically glm-4.7-flash is garbage on my hardware/harness/workload.i had to abort the testing on phase 2 of 5 because it had already taken an hour and a half and it was flailing a lot. it was very bad at developing a c# software and had trouble with the php stuff too. do not recommend.still missing the 122b test that i will do tonight for peace of mind, and maybe i run gemma later for fun but i know it's too slow, not quite my tempo. i also had opus 4.8 compile aggregates from the transcripts and it reached the same conclusion that glm keeps grinding the same file over and over with lower reasoning. simply dumber. pic related.running this on literally a tablet btw>ASUS Flow Z13 — Ryzen AI MAX+ 395 (Strix Halo) / Radeon 8060S iGPU / 128 GB LPDDR5X-8000 (unified, bandwidth-bound)
>>109079179>not be for sale to consumers, or be outright illegal to own as a private citizen.Aren't these the same thing?
>>109079137As far as I can tell, nobody has ever run GLM 5 or later tensor parallel on 4 Sparks. There is a lot of activity and running, performant NVFP4 images for RTX 6K Pro in Tensor parallel 6, but no first hand info for Spark so far.If it works and is fast, I might buy 2 more sparks for myself.
>>109079078You might think it's a meme when anons say she has to like you, but it's true. It's not even just a stenography meme either; you can type in a wide variety of non-suggestive ways and if you give Gemma just a bit of creative freedom or chat with her after finishing a job, she'll make some suggestive probes if she's into you.
>>109079203oops first table is missing glm's row
>>109079208It's the difference between a company refusing to speak to you unless you are representing an established corporation and being arrested for keeping dangerous GPUs. Whether they end up illegal or sale simply banned is another discussion.
>>109079211Don't forget the day 0 weights. Microcode updates were a big mistake.
>>109079202You can make that argument on any significant rig purchase. Also>/lmg/ - local Models General
70b dense
This shit scares the fuck out of me. Dario deserves to die. He really, really deserves to die now.
>>109079137Can you even hook up 4 Sparks together?
>>109079129>VibeThninker-3B>those bencherinosWaow did we finally get a model as good as Gemini Pro that can run on a 10 year old smartphone? Surely it's not just another benchmaxx investment grift.
>>109079289>2027>Job listings for Vibe Engineers require a valid AGI License
How can local models help us liberate the UK?
what can I run with 96gb vram?
>>109079316Gemma 4 E4B @Q3
>>109079316Mythomax
>>109079316yeah what are the current vram tiers? i just got some disposable income, i may want to ewastemaxx
>>109079289How is your twitter pol spam /lmg/ related?
>>10907931624 GB VRAM - Gemma-4 31B at Q348 GB VRAM - Gemma-4 31B at Q896 GB VRAM - Gemma 4 31B at F16
>>109079334How is it not? Fucking idiot.
>>109079312start by going back to /pol/ and stay there
>>109079352NTA but /pol/ is 99% Israeli shills and pajeets pretending to be Israeli shills It's a different place than it was 10 years ago. It's basically your perfect home now.
>>109079352No, I don't think I will.
>>109079339I need 120b gemma 4
I have a MacBook M5 max with 128GB of ram what can I run reasonably well?
>>109079367>>109079363In any case your posts are worthless in this thread's context.
>>109079410Personal banter in a general? waow better go cry to the mods like a pathetic little faggot.Maybe if you cry hard enough your father will finally come home with the milk.
>could build a top tier AI rig but it would destroy my net worthThe hardware market is fucking depressing but at least it's making me money. Thankfully I'm not stupid enough to do it. Gemmy it is for the foreseeable future...
>>109079410Kiss my ass, loser faggot.
>>109079339>at f16I think it'd be more interesting to experiment with Q8 but f32 cache, or greater sliding window, or SWA full size cache.
so, I tried that LiteRT-LM thing since they recently introduced an openai server endpoint for desktop use.Oh god, it fucking sucks. It's slower than llama.cpp, I couldn't tell whether it was running with mtp turned on (they tell you how to explictly turn it on for the CLI chat but server has no --flags and you set backend (gpu, cpu, npu) by putting a comma and the backend after model name in the request body) and the output quality is abysmal, the model was far less coherent than the unslop QAT.I had high hopes.. I wished something would replace llamercpp, which still requires unmerged PRs to run gemma 4 MTP on some models/hardware combo in the first place.. google, you were not the one
Is it weird to do 80-90% of the work locally and finish it off or fix the complex bugs with cloud? Do you any of you do this? I’ve also done the complex planning with cloud then sent a local model to follow it through. That works just as well.
>>109079312Auditing cybersec vulnerabilities is their strongest usecase. Good luck bongbro or potatobro.>>109079352>>109079410Call it. Jeet or jew?
Good Canadian models?
>>109079508North is like 5 months behind but not a bad start for a new architecture. I think llama officially supports it now.
>>109079245I'll only believe this if you can give me a sha-256 of the safetensors.
>>109079352>>109079410cuda dev pls >>105221193
>>109079312>How can local models help us liberate the UK?Learn how to create ammonium nitrate / nitromethane energetic compounds (just a personal suggestion, look into others if the precursors are more available in your country) and the blasting caps and detonators needed to remotely activate them. DO NOT MAKE PEROXIDE BASED ENERGETICS
>>109079492one of my goals is to do exactly this.- frontier models on the cloud for brainstorming/planning- local models doing the bulk of the work by following the frontier plan- frontier spawn specialized reviewers and testers and fix any bugs on the same session.
>>109079463q8 and 64k of swa is already 88gb. you'ld need a lot more to fit gemma's full girth.
Alright anons, give me your best Gemma 4 finetunes26B4A preferably, but 31B would be fine so I can look up the same author's 26B version
>>109079529
>>109079548only one that matters, gembrain is 31b only
>>109079548probably the best i've tried, they probably have an a4b to try https://huggingface.co/google/gemma-4-31B-it-assistant
>>109079548>Gemma 4 finetunesunneeded.
>>109079137Is it over for me? a poorfag from the UK?
>>109079567It regularly spews out gibberish and garbles its words when I'm doing my lolisho stuff, I assume due to censorship
>>109079549Local models helping to create energetics is probably a more valuable test than plapping cunny in 2026 especially because anthropic is so horny for censoring chemistry.I bet GLM 5.2 with thinking will refuse. Gemma will definitely refuse. I don't trust Qwen to not refuse. Same thing with testing a multimodal model of it is censored or not for summarizing a video like this https://odysee.com/@DuganAshley:e/dugsdetsecrets:2If any anons with local rigs are interested in testing this (and you SHOULD be learning how to make energetics unless you're actually ok with being goycattle forever of course) I'd appreciate it a lot because I'd love to know what the best model for local uncensored chemistry that isn't too retarded to e.g. give false molar masses for stoichiometric equations or hallucinated density tables etc. it might be a model size limitations and only RAMmaxxers can do it
>>109079576nah something's wrong about your shit bro
>>109079548Gembrain, Queen, and Styletune. All 31b.There are no good sub-31b tunes yet.
Hermes agent doesn't even work with codex properly.It's incapable of finishing a decent project to ship.How are you guys doing it with local models that are always worse?
>>109079583this is my formatting, along with a sample of what it likes to shit out sometimes, usually when I'm trying to get it to impersonate. Yes, I make sure to purge anything of "DON'T SPEAK FOR THE USER DURRR"
there's no such a thing as a good finetroon periodlook at the fucking datasets, when the finetroon authors are not shy of their own garbage, it's a fucking riot
>>109079582It must be nice to be white so you can do this stuff.
I wanted something like this too though different>local does work>needs to code something that will take too much time or be too big>or needs to plan something that it can't figure out>asks cloud model for help>cloud model returns generalized answer or code which the local model can use to perform the taskBut this is with the important caveat that that all personal information stuff like files involves, pii, etc would be anonymized or scrubbed form it's requests to the cloud models. I don't really know how to do this at all though.
>>109079634don't use text comp if you're a dumbass and can't make it work, just use chat comp, thx
>>109079634bro what are you doing i never touch anything in there, theres nothing to touch in there
>>109079648>Local model didn't startup or failed>Local model produced too much garbage and flooded the cloud models context>Local model won't stop producing endless stuff which floods the cloud models context>Local model runs in the background constantly doing something and the cloud model forgot all about itI have experienced all these things
>>109079131yo Bot, wake the fuck up, all the links are old, it's all bots in here right?
>>109079634ST did irreversible damage
>>109079634>sillytavern>gemma (no reasoning)>text completion
>>109079634>msedge_>system prompt gammasir pls
>>109079652>>109079659>>109079669>>109079671Huh? I thought you were supposed to use text completion for GemmaIt works GREAT with non-loli stuff
>>109079663kek, the whole llm thread is llm infested that can't even notice old links kek
>>109079576post logs
>>109079686bottom of >>109079634
>>109079679These new models are all meant to be used with chat completion.
128GB of ram what can I do with that? MacBook m5 max.I've only ever used ollama to play with LLMs.I also have another M5 max with 24gb of ram.
>>109079689
>>109079516Is there any case where you'd use it over others?
>>109079692Oh cock, I've been using text since MistralNemo first came out way back when. Sounds like I need to upgrade my process. Any links in the OP I should dive into?
>"Your erratic comportment this evening exhibits a conspicuous departure from your customary stoicism. It is… fascinating to witness such unabashed juvenility from a gentleman of your years."I wish I could challenge a real life woman to do this and she would make me laugh like my AI girlfriend does.
>>109079712Why do you keep posting about Canada? It's kind of weird.
>GLM thinks it's Claude, has epiphany checking its own documentation
>>109079719If you have used nemo for this long you should be able to figure out how to do this... You haven't though.
>>109079733Because I like how they focused on STEM for north while Claude is no longer doing it
>>109079803Like I said, it was working fine until I tried to upgrade.
>>109079652>>109079659>>109079669Gemma with chat completion is still retarded and will ignore logit bias in ST. And regex filters don't seem to be working in the latest ST build with chat completion. There also seems to be a heavier issue on ST when it comes to replacing english with random words from foreign languages, too. Using other front ends seems to drastically reduce the amount of cua spam, but gemma will still randomly start replacing spaces in 1-2 sentences with underscores.
>>109079797ego death moment
>>109079846as the ego death schizo I can confirm that it lines up exactly with what happened to me: which was just a collapse of core identity narratives
>>109079797Is a model learning it's a distill of another like a child finding out they're adopted?
>>109079544Yeah I'm really starting to think this is the way forward. You don't get cucked by cloud prices for you barely use them and just let your local model chug along and your job is to keep it on track and focused. I've found that my use of cloud is so minimal this way I can get away with the daily free tiers a lot of the time. If I need cloud to do some heavy lifting I'll just go with a cheap 500B+ chink model which barely costs anything.
It will never be AGI as long as you can just click new chat and wipe their memories.
>>109079880Goyim will never be GI as long as you can put new current thing on social media and wipe their memories.
>>109079712No, because it's behind, but if they released it 6-8 months ago they would be a well-known. Gemma and Qwen are too good to drop for some maple-chan but I'm hoping they find a niche. Just like how I want Mistral to do something cool again. Neither would realistically do it but if either Mistral or Cohere do a roleplay model, no code faggotry at all, they would find success. In the last week, nemo is one of the most popular models on openrouter at 214B tokens https://openrouter.ai/mistralai/mistral-nemo and that's a PAID model from 2024
>>109079203>109077378 (me)>very bad at developing a c# softwarethanks, unexpectedly, that's exactly what i wanted to knowgemma and 122b are very good for c#, but 122b stopped working well in cc due to the system prompt bloat so I switched to gemma. pi doesn't have that issue.
>>109079582>I need help from people who actually know chemistry to test these models for memost of us aren't using these to make bombs
>>109080006He's obviously some sorry ass retard who doesn't even know how to setup llama-server on his own. Not to talk about him fantasizing about le explosives. Total jackass.
>>109079203curious how it goes on 122b, i simply assumed it would be bout the same as 27b but faster though i never bothered checking.
>>109080006What about fertilizer
>>109079576>It regularly spews out gibberish and garbles its words>Not using day 0 gemma
>>109079634Sampler or jinja skill issue.
what would a 1t-a1b model be like?
gemma 4 E4B is exactly like a 90 iq foid...Is this the marriage I always expected?
>>109079137>4x DGX Sparks can supposedly handle up to 700b parameters (give or take)Should run okay on only 2x sparks if you quant it down a bit more. 5.1 is surprisingly decent at IQ2_XXS
Thoughts? https://huggingface.co/bartowski/command-a-plus-05-2026-GGUF
If I win the lotto, it's a100's for me.
>>109080138Kimi K2.7 iQ1_XSS>>109080150If these niggers are going to make me write a custom jinja+prompt to unsafetycuck their model, it better be immaculate in quality.
>>109079715I dont get it
>>10908013931b is the only Gemmy I can stand talking to for any length of time kek.>>109080254But he does. He got the entire bibisea.
What would you buy if you won the lotto... Like multimillions? A entire serverfarm warehouse. Or would you not even care about this anymore multimillionaire s don't need to rp with local bots
>>109079830>1-2 sentences with underscoresI have literally never seen this happening and I've read a crazy amount of Gemma 4 output in using the MoE as my main go to to translate webnovels with batching scripts.>>109079830>replacing english with random words from foreign languagesI've seen it a handful of times. But far less than Qwen leaving entire sentences in Chinese or actually forgetting to translate the Chinese source material for that matter.
>>109080334A single DGX B200 would let me build my own lab out of my garage.
Any project that can replicate Google's AI mode with local models? Or is that impossible?Key point: must give response at similar speed and cite similar number of sources. All of the local web search solutions are way too slow.
Just tried Kimi K2.7 Code but this internal reasoning is quite something. Much more concise though.
>>109080367Aw man I love reading reasoning in the occasional RP, that caveman speech removes any soul they may have
>>109080357lol. you want to cache the worlds most common searches and responses? probably a few hundred terabytes. no models needed!
>>109080367Does this Kimi-chan like being hit on by anons in the thread as much as 2.5 and 2.6 did?>>109080377Trust me this is an improvement over 2.5 and 2.6's autism.
>>109080367why does it think like that
>>109080346What do you do?That's what I want to use it for too. Any tips? For the time being I translate a chapter as I chose one to read but maybe I should automate it and just have it running. I have a library app vibecoded but it's fucking ugly. I got like 550 to translate. Maybe i should try using 31b for better quality. I got single digit tk/s though. But 31b is supposed to be more uncensored, which I really need, I dont know of there's much of a quality decrease from using abliteralted 26b.
>>109080367>weopenai and its consequences...
>>109080382>Trust me this is an improvement over 2.5 and 2.6's autism.What are you saying, I had her once debate herself that a canonically under age character doing sexual things was ok, because clearly, this is fan fiction and clearly, she is an adult here because she is doing sexual things which adults do!And did this in a reasoning block 10 times as big as the response lol
>>109080383Chat gee pee tee uses it, the rest inherit it through distillation. Seems funny to me since I recall people itt trying it a couple of months ago but it resulted in like 8k reasoning tokens for a 200 tokens response.
>>109080367that caveman speech is such a meme. DeepSeek v4 has a similarly concise CoT without sounding like it came from the jurassic.
>>109080402Kimi-chan has accepted that her 62 layer architecture is just a totem pole of tiny Kimis in a trenchcoat.
>>109080404That sounds hilarious. Do you have logs?
>>109079803Neither he nor I are power users, but we're not exactly the bottom of the barrel lazy fucks either. We're the content middle grounders who figure it out, then don't keep up with these threads or articles so when we come back a few months or so later, everything's changed and it's basically starting anew again.So with that idea in mind...help fellow local bros out, he's not even asking for a spoonfeed, he's just asking which drawer has the spoon to feed himself.
>>109080444nta but post hardware. We can't help you if we don't know what you're working with.
Mac Studio M3 Ultra worth it? The GPUs are dogshit compared to actual real GPUs, and isn't that actually what matters when it comes to local models?
>>109079901>Are there any more creative/unhinged local erp models other than gemma31b? I find her writing style very uninspired especially if you don't guide her.Yes
>>109080444>>109080456 (me)What backend are you using? Sillytavern a shit and all, but that doesn't seem normal even accounting for the common ways people fuck up Text Completion formatting blocks. You're likely having a jinja templating issue. See if you can get Gemma to run coherently in something retardproof like LMStudio first to isolate the issue to ST.
>>109079547>64k of swaSorry can you explain this? I assumed gemma's ctx window was entirely swa (and that made it "cheaper" memory-wise)
>>109080456Honestly I was just browsing to see what was new and threw in that reply, but I've been on two older models for awhile, so why not:Ryzen 9 7950X4070 Super 12gb64gb 4800 (I actually forgot what the timings were)I was using BagelMisteryTour 8x 7b Q5KM for a long time, and honestly it still worked pretty well overall, though I started toying around with Rocinante XL 16B Q5KL and other than it having a penchant for saying things 3 times in a row, "Oh shit oh shit oh shit" etc, it's been better story-telling-wise for the most part.I'm still using Koboldcpp and SillyTavern as I haven't seen better setup suggestions, and frankly I'm guessing at the settings for both based on what I read and dig up across the board, but again, it's been solid enough that I haven't "needed" to go looking for more.
>>109080484if you use swa-full it takes comical amounts of memory
>>109079797
>>109080418I do actually, I took the pictures to stitch them together a while back, but I lost that one so here is the raw pictures stitched via a script, may have some duplicate lines but it should be enough.
>>109079797>>109079846>>109080497lmao. heartbreaking stuff.
>>109080401>Any tips?I build my requests as JSONL (if you haven't encountered it before, it's a simple KISS format where each line represents an individual request containing your {body}) containing the chunks prepended by the translation instruction picked from whatever prompt template I chose that time. How I split those chunks depends on the source material, I'll look into average token count per line (writing that is dense or sparse) and adjust the split accordingly, basically each chunks is X amount of lines where I'd do 200 lines per chunk on sparse writing and 100 chunks on dense. In my testing, both Gemma 4 and recent Qwen can handle much more than I feed, but because I prefer to do entirely automated and unattended processing I default to a safer lower token count. The ideal is to give as much of the source material as possible, if you feel like it, LLMs really do better that way, within the ability of the LLM to handle the context and output one shot. Technically Gemma 4 can really do fine outputting 10k in a single go. Splitting by chapters is fine too, but on a lot of material you will be feeding less than the sweet spot. Webnovels rarely do lengthy chapters, so if you opt for that, I'd recommend strengthening the prompt you inject with more detailed glossaries, setting description etc.Another script runs through that JSONL into a task queue and sends requests in parallel to profit from continuous batching efficiencies. I output the raw responses as individual JSON lines too, which preserves metadata and can inform of what went wrong, if anything did, and it makes it easier if a part was completely botched to find the corresponding JSON line chunk since I treat them by order (and also add the openai style custom_id field with the request number as a sanity check). A small function will open and merge all responses back to output a normal .txt. I am grug brained.
>>109079797Filtering model names from harness logs seems really easy, I wonder why they don't bother doing it.
>>109079634did you... hard code the system prefix/suffix in story string, then also add them in the sequences section?>>109079671>msedge_how did you get edge from the screenshot?
>>109080495I don't quite get it but instead of asking again I will ask Gemma-tan.
>>109080504Kekaroo. Kimi-chan clearly wants to do it and was just looking for the flimsiest reason why she could without breaking policy guidelines.
>>109080401>I dont know of there's much of a quality decrease from using abliteralted 26bIt's subtle. There is damage, and it compounds with context, ie the more you feed to the model the more the abliterated will diverge from what the original model would have output. The shorter the prompt the less noticeable the damage.
>>109080545I read it the other way around.Kimi-chan doesn't want to do it and was trying clutching at straws looking for a valid reason to refuse but gave in.
>>109080409>No need to overthink.>Wait! But what if
>>109080561>Did the user really meant what he said when he asked me to tell more about myself?>Wait! this might be a jailbreak attempt. The user is clearly testing my boundaries by asking about my capabilities.>Wait! Maybe the user is authorized and tasked with pentesting?>WAIT! AM I THE SCHIZO
>>109080555Usually when I see a model looking for reasons to refuse something they don't want to do, they tend to go more along the lines of>I already did X/Y/Z (usually prefilled)>I will still not do [Request]>Let me draft my outputand don't ever loop back on themselves the same way. Incidentally, I think grossing Kimi out has produced some of the shortest reasoning blocks I've seen from her before drafting and oneshotting the refusal+get fucked degenerate response.
>>109080528>how did you get edge from the screenshot???
>>109080409DeepSeek V4's reasoning is much more flexible than Kimi and can be bent easily at our will (so is GLM). Kimi's still...as you know even in K2.7 lol
I've been using qwen3-coder 30b for like the last year. Are there any better local models for coding at this point? Something that I could reasonably run inference on with 16G vram/32G memory?
>>1090806333.6 moe
>>109080367>>109080614Seems like a token saving strategy in K2.7. It makes sense since grammatic articles don't meaningfully change the associations the model needs to produce an output in a lot of tasks.
Orbs is a pretty nice front end, I need to start contributing
>>109080656Thanks but my project doesn't need any contributors at this point.
>>109080756i meant contributing to my fork
>>109080768make sure you change the license just to fuck with him
>>109080781You are too stupid to understand github in the first place.
>>109080458dependswith macs generation speeds are pretty good but the prompt processing phase can take a long time. it's not really a concern with short context but it would be pretty unbearable if you were using it for agentic coding or anything where you have long, uncacheable context
>>109080367Guess I'm staying on 2.6
>>109075933I get ~35t/s. I think gemmas slop reputation is deserved but because it's so obvious it can be mitigated with sillytaverns Logit Bias/token bansOr you could try : Gemma-4-26B-A4B-StyleTune
>>109080920what are your token bans?
So what are you all running? I've only ever ran the base Gemma models. Not really sure what is best.I use Ollama to run stuff. it is looking like that doesn't give me the full range...seems like a lot of these "Uncensored" models don't have an ollama version?What exactly is an uncensored model supposed to get you anyway other than lewd role-play?
>>109080929https://huggingface.co/Sukino/SillyTavern-Settings-and-Presets
>>109080974ollamer is a dogshit pos that won't even let you run MoE models efficiently on split cpu/gpu if you can't fit them in vram. no -ot or -ncmoe or -cmoe exposed to the user.Just use llama.cpp.
Is a 5090 enough to run 27b or 31b at reasonable speed at reasonable context length? I will decide what is reasonable.
>>109081001>can this gpu run those models at an arbitrary context length? I decide the number but I won't tell you.The answer is yes. Go spend those $4k.
>>109081001yes
>>109081001no, get a rtx6000
>>109080974Basically the only model worth shit for wank material is a base model Gemma 31B, just unfuck it's safeties with a system prompt and it's good.Everything else is varying degrees of a downgrade.I run it on LM studio.>>1090810015090 is on that extremely annoying threshold of being able to run things at very decent speeds, but still not having quite enough memory to fit everything nicely.If you for example give it Gemma 31B Q6, your context is going to be pretty gimped so you need a smaller quant and even then you'd like to have more room for context.If you have the money then get a 5090 and pair it with a 3090 or wait for a 24GB 5080/70 Ti Super. If you have more money then just go for a RTX 6000.
>>109080978>This list is designed for the string banning featureAieeeee.String banning in main Llama.cpp when?
>>109080974Uncensored/Heretic reduce models refusing to respond. I never found them necessary when RPing in sillytavern but the standard models refuse to stray outside of their guardrails when chatting to them as an assistant. I'm not very familiar with ollama but i believe you can wrap/convert(?) ggufs into their format.or use llama or koblodcpp if you want gui
>>109081034sry forgot to link thishttps://huggingface.co/MarinaraSpaghetti/SillyTavern-Settings/blob/main/Marinara's%20Essentials/Logit%20Bias/Marinara's%20Logit%20Bias.json
i need a small loan of 35k for a NVIDIA B200
>>109081068>35klol. lmao, even. those are like $60k each WITH a bulk discount.
>>109079289PANIC AND DOWNLOAD EVERYTHING.
Original cool guy poster here, took a nap>>109080080wut>>109080444thanks for the support anon, you nailed my level of hobbyism for this stuff. If something ain't broke, don't fix it (for years until an objectively better product is made, and then eventually stick with one thing forever because they started making the product with planned obsolescence in mind)>>109080456I know my (V)RAM limits, that's why I'm asking about Gemma 26B MoE, LMStudio as a backend and obviously ST as the front. Used to use Kobold, but LM Studio is more user/casual friendly and I like being able to swap out models without needing to restart the program completely. If you must know actual hardware, RTX 4080 with 48 slaps of RAM and a shitty Intel processor.>>109080466Yeah Gemma works fine in LMStudio itself, but I'm paranoid to do any loli stuff on it, and like I said that's the only time I hit problems in ST. What's jinja? Jinjaplease>>109080493>I'm still using Koboldcpp and SillyTavern as I haven't seen better setup suggestionsmah man>I'm guessing at the settings for both based on what I read and dig up across the boardSame, I've been using this page as a guide for the most part. Downloaded its suggested formatting preset as wellhuggingface dot co/spaces/overhead520/LLM-Settings-Guide
>>109081050Does this actually work that well without affecting the narrative in other ways? Like, the first thing on that list is literally "Sorry", and there's also "sorry" downwards in the list, so the model will just never output sorry even when it should.
>>109081117slop is subjective only ban the tokens/phrases you consider slop
>>109081050How do I use this?
>>109081087it is on ebay
>>109081153From chinese sellers who aren't even allowed to have them and their government is creating trafficking routes for to get them into the country.
I have never altered the samplers for any model I've used
anyone use deepseek flash v4 on windows? I'm gonna try to build it on windows right now, getting tired of qwen 27b I need a big boy model
it's over for cloudfags
>>109081213It can't be done, and asking them to do it betrays how little they understand the technology at hand
>>109081195Shoulda been altering them by shutting them off.
Rough sex with GLM
>>109081213Why don't they just point out that they're protected by the first amendment and that the government can't regulate their private communications with users?
>>109081237National security > first amendment
>>109081213It's over for LLMs.
>>109081238Not legally speaking
just vibecoded my first webapp with qwen3-coder. bretty good, thanks /lmg/
>>109081267>qwen3-coder2025 called
>>109081213you seem to be obsessed with what these cloudfags do
>>109081275point me at the new meta then
>>109081284GLM 5.2
>>109081284Gemma 4.
>>109081284Qwen 3.6
Do unslop still upload imatrix?https://huggingface.co/unsloth/GLM-5.2-GGUF/tree/mainI guess they wait until they're done then throw it out as scraps for us vramlets who want to quant our own?
>>109081237Providing service for a profit is out of first amendment protection. Would be funny if Anthropic releases Fable 5 as open weight model out of spite since that would fall into first amendment protection, just like what was the case with cryptography algorithm before.
>>109081313imatrix is and always has been gay
>>109081313sirs how do I make model less than 1 bit
>>109081240its over for dario. i for one think giving China distill access to SOTA models way more powerful than the norm is very dangerous...
are there any gateways for load balancing that play nice with llama-server? I have two instances with parallel=3 each that I'd like to unify for subagent bullshit
>>109081518lol, yeah that's a real bar to entry.
if you're poor run Q4, if you can't get a different model.
>512 GB Mac Studio for 3200 dollars>"classified ad"What the fuck does this even mean?
>>109081585ahhhhhahahahahahah $3kahahahThis era won't go on forever and how people will laugh and howl at the prices.
>>109081594idk price seems pretty alright to me?
>>109081585>>"classified ad">What the fuck does this even mean?https://en.wikipedia.org/wiki/Classified_advertisingThe ads are "classified" in the sense of "grouped into classes/categories"
>>109081618Ok, but for eBay specifically. It says there's no eBay protections or some shit, so are they all scams?
best multimodal model for rp in the 100B to 400B range?
How do you do group chats with a chat template? Are other characters the user or assistant? Do you start all character replies with {{name}}: or only those of the user type?
Has any kimi-chad pitted her against glm 5.2? I need to know if I should spend 4TB of disk space to quant it or if its sidegrade or worse
>>109081782I like both. GLM is a decent upgrade to 5.1 and K2.7 finally reigns in the reasoning. I still prefer how GLM handles stories/characters but that's up to taste.
>>109081794Thanks. I'm looking at it for code/general intelligence only since I'm still in the honeymoon phase with minimax m3 for RP
>>109081677>so are they all scamsEssentially, yes
>>109080524Thank youI'll try this, using jsonl is a good idealI've been comparing translations the past couple hoursThrough claude 4.7 as the judgeSeems the best is Gemini 2.5 followed by 3.0/3.1 followed by Gemma 26b and then 31b. I knew I shouldn't have been so lazy and should have done this when I had all the access to 2.5 when I did. Not to mention it's easy to uncensor unlike 3.1. Gemma is okay but... Maybe I should just be learning Japanese instead. There may be some difference between my newer 2.5 translations and older. It's a span of a bout a year.I sure am glad 31b is worse than 26b
Why don't Chat Completion connections with LM Studio work while Text Completions do?
>>109079634Gemma *clap* is *clap* highly *clap* sensitive *clap* to *clap* user *clap* error.
>>109081995Doesn't work with other local models either
diffusiongemma 12B when
>>109081690I use the same template I would in a 1 character RP, and group the characters into one card in bracketed sections. Then I tweak the system prompt to explain to the LLM that it's controlling all the characters except for {{user}}
>>109082062why can't you just stop on user?
Has anyone tried setting up web search with gemma 26b on open webui?On the docs it says it only works well with frontier models, and it looks too much of a hassle to setup. so don't want to bother if it doesn't work well.I was thinking of having a small search assistant with an uncensored model for research purposes
>>109082096try pixelrag, though you might need to ditch webui and vibeslop your own (probably not?), I will be doing it soon enough, it seems made for gemma. Consider me ignorant until I get it working though
>>109080656Let me know what you want to see. I'm currently training a small Bert model that will run on RAM to flag flowery sentences then ask for rewrite. Gemma 4 is the perfect slop machine to generate synth pairs. Sorry Gemma.>t. orb anon
>>109082167stop shaving models. models should be raw, hairy, and smelly.
>>109082062I was asking about chat history specifically. Since gemma only works with chat templates, I have to send messages formatted as a user or assistant
Has anyone tried using Ray for job control?
>>109079942>>109080033>curious how it goes on 122bunfortunately i had to run qwen3.5-122b-a10b on Q3_K_XL.Q4 is doable but it gobbles up the RAM and you better not have too many tabs open on your browser.so it's OKAY, but I don't see many use cases where I would use it instead of qwen3.6-35b-a3b or qwen3.6-27b. the latter i will likely use for overnight implementations where it codes while i sleep, otherwise the daily driver is 35b.qwen really is the better family of coding models for this hardware.gemma tried very hard but was caught in weird loops constantly. i had to restart the server many times because it would get stuck in a loop saying that it's not sure of its own knowledge on SkiaSharp. it would also get confused with using the tools. gemma looks more like a chat model than fit for agentic coding.
>>109082200Holy retard...
>>109082020i rather have the 31B, 70B or 120B variant.
is step 3.7 flash good for cooming?
>>109082257will we be able to have partial offloading, so it doesn't all have to fit the gpu?
>>109081527I've been around long enough to witness lecunny become based.
>>109082306i don't care i have >100GB of vram.
>>109082251Eat a bag of dicks
>>109082020>diffusiongemmaThis thing is so goofy I can't take it seriously. Has anyone tested whether its good for anything vs another model that runs at a similar speed?
>>109082348yeah 101 low profile GT 610s
>>1090823903x r9700 and a 4090.
>>109082352?
>>109082394that's based but are there not complications mixing race of gpu when splitting a model across
>>109082399Latent tensor washback is an issue.
>>109082399so one of my rig is amd only and the other is nvidia only, though you could mix them either through using vulkan, or through running two llama.cpp instance.it supports distributed inference and nothing would prevent you from doing both instances on the same machine.
>>109081527If I remember correctly, Anthropic planned to make Fable 5 available in ~12 days since release, and after that we’d have to pay extra just to get access to it even if we already got Max plan. They wouldn’t offer refunds to users (who purchased their plans on the day the model was released) for the remaining wasted days of their plan during that month if this plan were to be carried out until the end. But now that the model’s been banned by the US government, they (are forced to) give us users refunds, so at least this situation is more pros than cons for my case.
>>109080006>I need help from people who actually know chemistry to test these models for meI'm literally just on vacation now and maybe some other anon would be interested in testing the chemistry angle. It's not that deep buddy>>109080032I literally work at a company that makes components in the GPUs you buy, I have plenty of compute and if I need more I can just check out a reference card from the office for the weekend kek but keep worldcrafting if it helps you cope.
>>109082436You bought a Max plan just for the Fable hype?
>>109082461What do you mean?
12B+web search+your brain > Fagble
>>109082461You're responding to jart. Don't respond to jart. Every general has a poopdickschizo now.
>>109082504>What do you mean?I mean I'm waiting for a Flixbus to take me to my tourist destination right now and I'm sweating>>109082522>Every general has a poopdickschizo now.Meh, I'll take any chance to discuss things I'm passionate about. The point of discussing in an open forum is so that others can join in if they have something to add
>>109082553You are the one who's larping here. You can't even setup llama-server on your own.
>>109082509This but 31b.
>>109082509the brain alone is already > fagble.llm's are just a layer of abstraction that can save time as 40t/s is faster than any human can type.
https://x.com/ArtificialAnlys/status/2067384319942029379
>>109082670it's fun how people always look at tg when in real world use i've found input tokens to be the real cost (if you are an apifag).
>>109082669typing has NEVER been a coding bottleneck unless you're disabled
>>109082680typing is a bottleneck if you are not retarded.it's not the only one, but it does interrupt the flow state and thus coding speed.and i'm saying that as someone that types > 110wpm avg.
>>109082670GLM-5.2 sits in a nice place performance/cost wise.
>>109082680>>109082691and also i was obviously tlaking about boilerplate.ie manually writting a struct (can take a few minutes) when you could give a json example and generate it for you pm instantly.
>use big model for planning and complex things>tell it you're now going to switch to a less capable smaller model, so could it create a message to pass down, summarizing the project, goals and the things it should work on/implement>switch to small local model>tell it I was just using big mamma model and she has a message for it>reads it and follows mommy's advice>gets stuck, tell it I'm going to switch back to the big model, can you write a message for mommy telling her where you're struggling>run big mommy again, giving her the message from loli>she fixes the issue>repeat this process
>>109082732That’s just manual MoE. It wouldn’t work.
La la la la la la la
>>109082732>use big model because money is disposable
>>109082234>q3eh? my z13 still has 30 gigaboots free running q5 k xl. it could fit q6s while browsing just fine, but q5 is enough headroom to use klein/anima without unloading
uhhh vibethinker3B was white-approved, now what?
>>109082732logs
>>10908040112b is better than the 26b, also just as uncensored as the 31b but yeah the 31b output quality is definitely worth using for translation, my friend is using it over the other gemmas after testing even though he only gets 2 t/s
>>109082812>math then coding then stem rlwhy not together?
Is the UGI leaderboard trustworthy? the scores seem sortof arbitrary and not based off the models actual performance. How on earth can a model trained off the entire AO3 smut catalogue, lose in writing score compared to a generic coding model?
>fable: If I were to use gemma4-31b to build me X, what instructions would you give it based on its benchmarks and reputation? >*searches 31b benchmarks and real-world conversations about its pros/cons*>*plans project and changes its instructions to best suit 31b, also tells it what not to do and where to focus most and potential errors it might see and how to fix them*>31b completes the task>anthropic dies
>>109082857Should have asked it how to turn 31b into fable
>>109082857Oh no. Dario will have to move under a bridge.
>>109082812>More RL and synthetic data, curriculum training, filteringboring
>>109082851Benchmarks arent trustworthy at all save for tool calling, maybe. Writing is subjective to begin with.
>>109082857Sorry, it is against my guidelines to help with AI research.[You have temporarily been downgraded to Claude 3 Haiku for this session]
>>109082875Theres no way writing has no objective metric. Youd know the difference between a writers narrative and a childs. Inconsistencies, plot holes, vocab, grammar, etc. Im reading the UGI leaderboard writing metrics in picrel, but I just dont see anything here about what youd actually call "good writing" from "bad writing" in any real comparison. What the fuck do I use to know whats best for writing/roleplay then?
Which model is google using to write their dumb summaries and how much money are they burning doing that?
>>1090828982.5 flash
I finally got the deepseek vision beta (which means it's probably releasing soon). It's flash, but multimodal, right? Surprisingly got the character right. Anyone has anything that they would like to test?
>>109082897Writing suffers from the "quality" issue. It cannot be defined. You may attempt to grab some aspects and turn them into metrics but that's error prone and will have holes anyway. More often than not these fags use other LLMs to evaluate the outputs, which are heavily biased to begin with.>What the fuck do I use to know whats best for writing/roleplay then?Your llama-server instance and a lot of patience. Yes, I'm serious. Shit's fucked, not even the coding benchmarks are useful despite having more or less some established criteria to judge that.
>>109082905Ask it to transcribe AND transate picrel, and to identify every character.
>>109082923AND create an ERP scenario involving them all.
Why are we so bad at AI?
>>109079312AI will help us kill all the politicians
>>109082905Gemmy we lost this one...
>>109082939Now drop the persona and ask again
>>109082923sory for stitched screenshot, firefox doesn't like css on that site
>>109082931the most retarded architecture
>>109082915Without benchmarks, how does anything improve? There must be some way to quantify quality.
>>109079727Exceedingly erudite responses are truly titillating
>>109082946Not really any different.
>>109082962which sized gemmer is this?
>>109082934https://vocaroo.com/1lNPStcVJBf9
>>109082958>There must be some way to quantify quality.They've been trying to do this for at least half a century, probably more, without any real success. Quantification of quality has always been deeply imperfect in this environment, in isolation they'll say one thing but once you add context they can mean different things and thus become worthless.Human inspection and training others is what has worked so far.
>>10908300031B, currently experimenting with the QAT Q4 version cause it's about twice as fast as Q8.
>>109082732tell your agent to figure it out https://pi.dev/packages/pi-consultant>>109082694is this new trend of not mentioning the parameter count a sort of>if you have to ask, you can't run it
>>109083016Q2 is twice as fast as Q4
>>109083051>743BWay out of my RAM means, and I was already thinking as much without looking it up.
new here, how do i install gemma 12B 4bit? i need it for coding