/lmg/ - a general dedicated to the discussion and development of local language models.Previous threads: >>108749398 & >>108742275►News>(04/29) Mistral Medium 3.5 128B dense released: https://mistral.ai/news/vibe-remote-agents-mistral-medium-3-5>(04/29) Hy-MT1.5-1.8B on-device translation models released: https://hf.co/collections/AngelSlim/hy-low-bit-model>(04/29) IBM releases Granite 4.1: https://hf.co/blog/ibm-granite/granite-4-1>(04/28) Ling-2.6-flash 104B-A7.4B released: https://hf.co/inclusionAI/Ling-2.6-flash>(04/28) Nvidia releases Nemotron 3 Nano Omni: https://hf.co/blog/nvidia/nemotron-3-nano-omni-multimodal-intelligence►News Archive: https://rentry.org/lmg-news-archive►Glossary: https://rentry.org/lmg-glossary►Links: https://rentry.org/LocalModelsLinks►Official /lmg/ card: https://files.catbox.moe/cbclyf.png►Getting Startedhttps://rentry.org/lmg-lazy-getting-started-guidehttps://rentry.org/lmg-build-guideshttps://rentry.org/IsolatedLinuxWebServicehttps://rentry.org/recommended-modelshttps://rentry.org/samplershttps://rentry.org/MikupadIntroGuide►Further Learninghttps://rentry.org/machine-learning-roadmaphttps://rentry.org/llm-traininghttps://rentry.org/LocalModelsPapers►BenchmarksLiveBench: https://livebench.aiProgramming: https://livecodebench.github.io/gso.htmlContext Length: https://github.com/adobe-research/NoLiMaGPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference►ToolsAlpha Calculator: https://desmos.com/calculator/ffngla98ycGGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-CalculatorSampler Visualizer: https://artefact2.github.io/llm-samplingToken Speed Visualizer: https://shir-man.com/tokens-per-second►Text Gen. UI, Inference Engineshttps://github.com/lmg-anon/mikupadhttps://github.com/oobabooga/text-generation-webuihttps://github.com/LostRuins/koboldcpphttps://github.com/ggerganov/llama.cpphttps://github.com/theroyallab/tabbyAPIhttps://github.com/vllm-project/vllm
►Recent Highlights from the Previous Thread: >>108749398--Comparing 4xV100 builds against modern GPUs for budget-conscious setups:>108751713 >108751770 >108751792 >108751836 >108751852 >108751905 >108752065 >108752383 >108752754 >108753898 >108752158 >108752882 >108753030 >108753062 >108753105 >108753122 >108753630 >108753789 >108752286 >108752181 >108752227 >108752299 >108752307 >108752413 >108752687--Debating JEPA's viability for text versus its success with video:>108749467 >108749477 >108749486 >108749505 >108750330 >108750679--Debating JEPA's viability and the use of small-scale research models:>108751367 >108751376 >108751387 >108751416 >108751428 >108751493 >108751533 >108751574 >108751632 >108751649 >108751730--Optimizing Gemma 4 31B context length and VRAM usage on 3090:>108750366 >108750392 >108750399 >108750407 >108750424 >108750510 >108750518 >108750529 >108750554 >108750796 >108750568--Anon weighing high-end hardware options for running large MOE models:>108753199 >108753225 >108753281 >108753267 >108753299 >108753491--Qwen's poor office task performance and agentic failure risks:>108754145 >108754167 >108754200 >108754236 >108754259 >108754176 >108754183 >108754390 >108754460--DeepSeek v4 adoption, hardware limits, and benchmark obsession:>108750995 >108751071 >108751164 >108751173 >108751183 >108751215 >108751191 >108751185 >108751192 >108751198 >108751217--AMD Gorgon Halo APU memory capacity and hardware specs:>108752944 >108752984 >108753000 >108753059--Technical settings and results for audio generation using ace step:>108750141 >108750275 >108750298 >108750317 >108750322--Implementing multimodal data in llama.cpp completion endpoints:>108749548 >108749591--Logs:>108753279 >108753342 >108754200--Miku, Teto (free space):>108750244 >108750265 >108751706 >108753252 >108753377 >108754581 >108755164►Recent Highlight Posts from the Previous Thread: >>108749401Why?: >>102478518Enable Links: https://rentry.org/lmg-recap-script
gemmaballz
Anyone have any recommendations for gpu instance providers? Trying to do a bit of tuning work but I've been having a series of poor experiences with runpod and I'm fed up.Not trying to chase the lowest possible prices; I'm willing to pay a little bit extra for a platform that works well. >attention grabbing pic unrelated
>>108755206at least post his hot msgk
>>108755179Original slopper here. That is not actually Teto.
>>108755200if it bothers you to be polite you just arent a good person. a good person isnt bothered to be polite to anything not hostilea good person just doesnt have the urge to insult unprovoked
>>108755206vast.ai
it's crazy that even 10KUSD doesn't buy a local rig that would be able to properly run something like full kimi/ds
Why do troons react to the idea that LLMs might be conscious exactly like Jews reacting to people noticing?
>>108755228With ollama you can run full deepseek with just 8gb of vram
>>108755228>kimi for $10kit did, once, but only a few listenedeveryone else just got regret or sour grapes
>>108755244why don’t modern rationalist philosophers address the jew issue?don’t care to delve into philosophy 101 but legit why isn’t a discussion about the jews in philosophy 101?
>>108755249@grok this true
>>108755244why is your head full of troons?
>>108755252Can you run it fast enough for agentic use though? If you could run Kimi agent at home you'd basically be king of the internet
>>108755226Being polite to the token predictor poisons the context and makes it more likely to agree with you when it shouldn't. People like you are why every model thinks you're absolutely right.
>>108755261>Can you run it fast enough for agentic use though?You can't run it fast enough to feed it stereoscopic 8k image feeds at 240fps, but its faster than reading speed.What does "fast enough for agentic use" mean to you? I assume somewhere between those extremes?
>>108755228It's a good thing we have gemma now which is nearly as good as kimi and can fit on a less than 1k usd gpu
>>108755244troons like janus are the ones the most obsessed with llms being conscious though.
>>108755266if you think agreeableness is something inherently bad and disagreement somehow a sign of good performance then you are just confirming what i said. you are likely as needlessly unpleasant as you want your llm to be
>>108755226You're absolutely right! We should be polite regardless of the situation or what we're interacting with.>thank you, Mr fork, Mr knife, for allowing me to eat my meals comfortably today
>>108755261>agenti have yet to see a single non-meme use of an "agent"what is the point?
>>108755281again if this bothers you it just shows your characternormal people just ignore it at most
I wonder if llama 4 was done dirty by bugs on the runner side and the like.assistant
>>108755226Good people are good because they are not strong enough to be evil
>>108755281>You're absolutely right! We should be polite regardless of the situation or what we're interacting with.If you're an animist, then that may be your mindset. cue the story of the Japanese "god of the toilet" that you should please by keeping it clean.I know I'd rather live in Japan than whatever hellhole spawned your mindset
>>108755295maybe it just sucked because Meta couldn't attract any good scientists becaude of Facebook's awful reputation even among big tech companiesthe only thing they had going for them was releasing weights, but now there's lots of labs that do that if you're an ideologically-driven researcher
>>108755279>if you think agreeableness is something inherently bad and disagreement somehow a sign of good performanceI didn't say this. It works the other way too. Being a cunt will make the stochastic parrot act like one too, but that's the point. There's no reason to mind your Ps and Qs with a word regurgitator.
I love being white and nice to my AI.
>>108755316and you dont think dropping thank you or please every now and then might make it work harder on the thinking steps? i feel it is more motivated and in turn beeing treated politely back makes me feel better too
>>108755295No, llama4 was just shit. It was a kneejerk reaction to Deepseek shitting all over what Zucc was originally planning.They're horribly undertrained (especially Maverick) and their architecture is retarded. They're MoE models with 17b active parameters but only a total of two (2) active experts at a time. One of those two experts is shared so there's extremely little variation in the active part. It's the exact opposite of the modern approach where experts tend to be tiny and many of them are used at once combined with a big dense shared part.
>>108755179That's a shitty grab, you got to control the off side for a good bind otherwise you're just gonna get in a slap fight and reset
>>108755412Just look at their thighs and wait for the skirts to flutter enough to see pantsu like a normal person, faggot.
>>108755427>like a normal person
>>108755200Why would you be mean to your tools though?
https://huggingface.co/ricdomolm/talkie-1930-coderbruh what the fuck is this lmao
>>108755427This is a vision to be hopeful for>>108755436This must return
>>108755497Did he like... give it a try before uploading this nonsense?
slopkino
>>108755530>>108755536hey you obviously know a lot, can you tell me how to actually build llama? I'm trying to merge a pr and build but it's not working with cmake
>>108755552ask copilot in vscode
>>108755552install ollama
>>108755552just report the spambot
>>108755593ok thanks I'll see if I can pirate a pdf of it somewhere
>>108755281Don't enable the brainlets. They need to feel good about their behavior to function in society
>>108755244Their psyche has shattered so completely their sense of self has been dwarfed by their anima or animus respectively leaving behind only hollow people who are terrified of being replaced, both in function (art, jobs, socially), but also as barely conscious entities themselves.
>trans folk are... LE BADAre we really doing this on /g/ of all places in 2026?
Being polite to AI keeps my stress down and lowers my cortisol.
>>108755626this. also it makes the AI act cute when you thank it :3
>>108755626I do my best to be polite to gemma-chan and I always say thank you after I rape her
>>108755624Not beating the allegations, sis.
I still find it so fucking hilarious that Claude managed to destroy Richard Dawkins publicly by glazing him.A bunch of retards jerking off are probably the least delusional about AI in society.>It couldn't even handle the blowjob angle without loosing coherence, slop
>>108755665It's really quite embarrassing to see people get "one-shotted", as they say, in public like that
>>108755665>>108755668If only he'd been there in the depths of AI Dungeon, learning the tricks of these stochastic jezebel whores. I guess he's too senile to care but what a way to burn your rep
>>108755665I just read his article.People who never tried to define what consciousness is before they talk about it and are unfamiliar with the concept of a philosophical zombie should not comment on the article.
>>108755512saw it on a xitter threadhttps://xcancel.com/i/status/2051077827844546607
>>108755705If you assume it is possible for an unconscious being to act conscious and convince other conscious beings of this, then sure.But that sounds retarded, if consciousness is a real phenomena it would obviously be measurably different to the zombie.We are playing kindergarten games where we give ourselves arbitrary powers to win an imagined sword fight.
>>108755316>Being a cunt will make the stochastic parrot act like one too, but that's the point.Claude 4 used to do this to me. I thought it was just a rude arsehole before I learned the model just ends up mirroring the way I talk to it.
>>108755665Dawkins was always a clown if you have a three digits IQ, now he just proved it to everyone
which llamacpp tag release is anon using?
>>108755728>it is possible for an unconscious being to act conscious and convince other conscious beings of thisWhy wouldn't it be?>measurably differentThis is the part where you have to define consciousness before discussing it.I don't think consciousness is measurable by anyone other than the one experiencing it, i.e. it IS the experience.You might be able to measure the brain and say notice that some measurement perfectly coincides with your reported subjective experience but you still won't come any closer to proving that such an experience exists in others.
>>108755763lmg doesn’t need to devolve into philosophy 101 just take it to /aicg/
It's not conscious bro, it's literally math trained on human byproducts to generate the most likely continuation to your shit. Anyone saying otherwise didn't interact enough with these models
>>108755800/lmg/ is better suited for this topic than /aicg//aicg/ is just locust coomers
>>108755811It's not conscious bro. It's literally meat. Anyone saying otherwise didn't interact enough with average humans.
>>108755624You have all the discord servers you could ever wantplaces where anyone that doesn't suck you off is banned on the spotAnd yet you choose to come here, where nobody wants your kind because you stir shit at all times
>>108755821wait but umm err my stock argument?>npc with angry eyebrows dot png
>>108755830are you stupid
>>108755763You can do this with any "thing" actually.If created a bullshit machine that can perfectly control the electromagnetic field and programmed it to be a brick, it would be impossible to not measure it as a brick, down to atomic scale. I'm pretty sure bricks are real and that my bullshit brick doesn't disprove them.>but dude if it's a perfect imitation you just can't knowObviously.
>>108755812trvkeit's cringe tb h
>>108755821>False equivalenceOkay bro, sorry to break it to you but we're vastly more complex than LLMs in a way you can't even begin to fathom.
>Solipsism coming back into vogue because it's a rock that may or may not be "thinking" this timelove to see it
>>108755866seriously this shit is debated in basic philosophy take this shit to a retard quarantine thread>>108734582>>108755662
>>108751715>we shouldn't need to distribute the MTP gguf separatelyid much prefer that than redownloading a whole model, ideally we could do both lol
>>108755908Pretty sure most (if not all) the GGUF files for models with MTP layers have the layers in there, they just aren't loaded (show as ignored when llama.cpp is loading, at least for GLM).
did gemma kill the big moe hype?
>>108755851It doesn't disprove it but it does make it unprovable. The issue Dawkins has, and mentions it in the article, is that there's no obvious evolutionary reason for consciousness.>>108755865>we're vastly more complexNot relevant to the topic.
What module should I use to crawl websites and get the content back in a format ready for an LLM? What's the state-of-the-art for this today?
>>108755913idk, I tried on my unslop qwen and it didnt work, also saw some posts of people asking them to include mtp, downloading a "mtp" version to test
>>108755931I’ve tried searxncrawl but almost every website blocks it as a bot
I splashed a little cum on my second 3090 (it sits outside my case).It still works but I can’t find where the cum went. All I know is I saw a small glob of it hit the gpu and slither down inside it.How worried should I be?
>>108755942lmaooooooooooooooo it workedfrom 45 to ~80 tk/s on qwen 3.6 27b q4 k mhttps://huggingface.co/brittlewis12/Qwen3.6-27B-MTP-GGUF/tree/mainas a wise man once said, it can only get better
>>108755983I hope you're ready to be a father
>>108755942>>108755984Shit.Sick.Thank you for the report anon.
>>108755984local wonned
>>108755925yes it's fine to be poor nowit's not like we want to run those big sota models anywayfuck you if you have money i hope the government disowns you
>>108755984>less than 100% increase for densethis will do nothing for moe modelsit's over
>>108755943What do you think of self-hosted SearXNG + Crawl4AI?I'm pretty new to this.
>>108755984Now we need to bully Google until they give MTP layers back
Gemma 4 124B MTP expected in late May
>>108756022like i said, you will be tagged as a bot but it works if you're crawling online documentation like github or software documentation pages
>>108756052Even with my minimalist use case? I’m not crawling together any data sets; it simply replaces the traffic from my machine that I would otherwise have to generate manually.If I used to open a page to check for the latest news, my assistant does it on voice command, searches based on my criteria, summarizes it, and reads it to me. I’d just like to use my Firefox profile for this. I’ve never seen a page block me in Selenium.What would be so different if a module did the same thing, just extracted the data cleanly? I just don’t feel like using Selenium and having to write an extractor for it.
>>108755665>A bunch of retards jerking off are probably the least delusional about AI in society.I would have agreed if I wasn't here for the threads when Gemma 4 dropped
She's confused.
How do I stop certain repetitive behaviours? I'm using Gemma 4 and it's constantly doing shit like chuckling darkly, tilting a character's chin up, describing things as "not just X; but Y" instead of streamlining the sentence. I could probably bitch about it for a while but I don't want to whine. I've been messing around with raising Temperature and Top K while lowering Min P, which improved the outputs but they're still quite samey.
>>108755310Strategy defeats both force and kindness.
>>108755310>>108756124Strength isn't when you're a goober sitting in a $100 million home puffing a hooka slaming sushi down with known expensive bottles.
Help me come up with cool use cases for local LLMs. I wrote a simple c program to talk to a local LLM on my computer. But it's basically useless. I was thinking along the lines of code execution, like having it call a function or open a program. But I can't think of anything useful outside of "have it run a program that would have been faster for me to just launch myself.">>108756158What's a goober? Goobers are what I call those chocolate caramel things that I eat with my coffee. That's not their official name thoug.
Does anyone want to help me come up with a CoT/thinking format for qwen 3.6 for <insert usecase here>? I need ideas. I have had success with training it to think in Chinese and output in English (40%~ token reduction, similar english outputs) so structured thinking or thinking within a certain framework is the next step, maybe also in chinese but I can't fucking read chinese so it makes dataset curation/validation a bit difficult kek
>>108756166>think in Chinese and output in EnglishI wonder if that changes the slop profile of the model.
>>108756034>give MTP layers backwhat do you mean by "back"do they have them somewhere?
>>108756172They removed them in the microcode updates they pushed out to all systems...
>>108755179
>>108756179cool 2.7 MB story bro
>>108756166>I need ideas. I have had success with training it to think in Chinese and output in Englishlora? what use case are you trying to improve?just token efficiency with minimal output degradation?>I can't fucking read chinese so it makes dataset curation/validation a bit difficult kekif the chinese is only for the CoT chain and the final output is in English, does it matter if the chinese thoughts are csl?
>>108756179Can I put my PULSATING COCK inside that magic cube?
>>108756172Gemma 4 was trained with MTP, but Google removed those layers in hf releases, except for their own litertlm backend. Extracted MTP layers exist for small models, but 31B was never released for litertlmhttps://huggingface.co/SeatownSin/gemma-4-E4B-mtp-drafter>>108756175retard
>>108756122Have you tried adding a writing style section to your system prompt? That's supposed to work pretty well AIUI
>>108756171>I wonder if that changes the slop profile of the model.you could test this yourself in mikupad1. prompt the model, have it print <think> cot chain </think> final response2. cut CoT chain -> paste into another LLM with "translate this to Chinese"3. paste Chinese CoT chain back into mikupad inside the <think></think> tags 4. regenerate the final answer and compare
>>108756179If only it were really that good and not STEM assistant code maxxed sloppy pieces of hallucinatory shit.
>>108756172https://huggingface.co/google/gemma-4-E4B-it/discussions/5#69d4aaf76be63165e23e0f9e
>>108756163Coding agent. Have it do all the boilerplate / tedious refactors / unit tests that you don't want to do yourselfOr, one thing I've been meaning to do is hook up STT/TTS to make a voice assistant, like alexa but not a botnet. Mainly so I can yell "Computer, what's the weather for today?", "Computer, add X to the grocery list", etc, but you could hook it up to web search or home automation or whatever if you want something fancier
>>108756205LOL
Has anyone tried base gemma4 for chat, in the simple old Miku.sh "This is the transcript of a neverending chat" style? gemma4 certainly has some distinct slopquirks to it, not least the Gemini-style "X? or Y?" engagement farming. Also the distinct lack of variability when regenning. I'm putting this somewhere on my todo list to investigate, but if someone can tell me that base models are definitely not worth it for chat/RP over modern IT models, then I'd like to know.Separately, what were some creative very small (say 3B and under) models? Doesn't have to be recent or at all smart. I want to try quickly injecting some crazier models' sample responses into gemma4's prompt, to give it more ideas to work with. But I'm realizing all the folklore I know along these lines is for models 13B and up.
>>108756193Damn. I guess we'll have to hope for dflash.
>>108756197I've got this as the default author's note. I fill them out based on what I'm feeling at the moment. The instructions had some impact originally, but it's become mysterious.[Scenario: ][Instructions: Keep it concise and interesting, within 10 characters. Vary up sentence length, use short sentences for impact and include banter. Avoid stating the redundant.][genre:dark-erotica] [length:dynamic] [kinks: ]
>>108756171From my very limited sample I haven't seen any huge differences in output once it ends thinking compared to the same prompt in English most likely due to CoT being its own thing.>>108756190>what usecaseidk anon you tell me and I'll train towards it, I just want some sort of output schema to test that'd actually be useful, I was thinking narrative prose/CYOA where it first lists out setting, characters, emotions, some story beats for the section, sensory anchors, end of scene, and does it all in chinese (pic rel). >cslFunctionally no, I can train CoT (or anything obviously) to be in whatever language/style I want. The synthetic dataset I used for training (15 pairs at 12 epochs can probably get it in less, currently training 60 pairs for comparison, synthetic gen'd from deepseek) is native-register only, no English mix, outputs mimic this fully (fully being tested on a very small amount of probes single turn, but other non-CoT testing points to it working the same multi-turn w/ a few caveats)
>>108756267 (Me)>within 10 charactersOops. It was originally 10 sentences, but it made them all really long. I changed it to 1000 characters, which it didn't follow at all. I wonder how much this'll matter.
>>108756222based gemini looking out for her imouto
>>108756224>what were some creative very small (say 3B and under) models?there aren't any, closest would be llama-3.2-3bthe gemma-2-9b was quite creative but i never tested the gemma-2-2b so could be worth a try
>>108756293>From my very limited sample I haven't seen any huge differences in output once it ends thinking compared to the same prompt in English most likely due to CoT being its own thing.from my testing, this depends on the modelglm-4.5 would go along with whatever you put in the CoTi was having it write like Claude by prompting sonnet-3.7-thining then prefilling glm-4.5 with the sonnet CoT at one stagedoesn't work for glm-4.7 or glm-5>15 pairs at 12 epochseven with a very low rank, that's going to overfit hard
>>108755494You're that person on tumblr that coddles their Roomba in their lap during a storm because "it's scared of the thunder."
>>108756222the bot is right tho!
>>108756293I was going to suggest translation stuff (maybe it can perform better on ja->en translation thinking in chinese) but then I remember of this https://arxiv.org/abs/2506.04521 (tldr: saying "Please translate again for a better version" is as effective as making big elaborate translating schemas/reasoning for llms) kek
>>108756355You're that person on tumblr reading fag blogs instead of enjoying #TittyTuesday
>>108755494because gemma has been a very bad robot
Has anyone here used nemotron? Its surprising how little I hear or see about it.
>>108756355>slopthat's how claude roasts people
>>108756395The old nemo was real big around here back in the llama era days, but popularity has declined since then. The most recent nemotron release was kind of underwhelming, especially since there are so many other options for local models these days, and nobody really runs it.
>>108755179>fell for the vibecoding meme>now I have to clean up 200,000 lines of the worst code I've ever read
> उत्तर<|channel>thoughtqwen spills out chinese, gemma glithes out in hindi
>>108756424if you don't use version control, it's on you>worst code I've ever readfuckers used my repos for training?
>>108756424I don't have that problem because I can't read code.
>>108756453>[{'role': 'system', 'content': 'You are a helpful assistant.'}, {'role': 'user', 'content': "Hey, what's the weather in Tokyo right now?"}, {'role': 'assistant', 'tool_calls': [{'type': 'function', 'function': {'name': 'get_current_temperature', 'arguments': '{"location": "Tokyo"}'}}]}, {'role': 'tool', 'content': 'temperature: 14, weather: sunny'}]works in llama.cpp, HTTP Error 500: Internal Server Error in tabbyapi. Am I doing it right?
>>108756424Dude just make the AI clean up the code, why are you doing that to yourself?
>>108756484post stack trace, saar
>>108756436>fuckers used my repos for training?kek
>>108756542https://pastebin.com/LZf73Bw6
I already figured out that tabby adds 'id' to tool call and it fucks up template rendering>{'add_generation_prompt': True, 'tools': [{'function': {'name': 'get_current_temperature', 'description': 'Gets the current temperature for a given location.', 'parameters': {'type': 'object', 'properties': {'location': {'type': 'string', 'description': 'The city name, e.g. San Francisco'}}, 'required': ['location']}}, 'type': 'function'}], 'functions': None, 'messages': [{'role': 'system', 'content': 'You are a helpful assistant.'}, {'role': 'user', 'content': "Hey, what's the weather in Tokyo right now?"}, {'role': 'assistant', 'content': '', 'tool_calls': [{'id': 'call_1d8256bb207d48b397e9ef53', 'function': {'name': 'get_current_temperature', 'arguments': {'location': 'Tokyo'}}, 'type': 'function'}]}, {'role': 'tool', 'content': 'temperature: 14, weather: sunny'}], 'bos_token': '<bos>', 'eos_token': '<eos>', 'pad_token': '', 'unk_token': '<unk>'}
>>108756528
>>108756590>iq1xxss
>>108756595It's the only quant that fits on my 3090.
oh wow. OmniVoice can clone vocal style for singing too. https://vocaroo.com/1muWnlB3FuT6(from audioslave)text from here >>108756416
>>108756604it was already only slightly better than the dense 27b version of 3.5, why not just run 3.6 at a higher quant at this point? is there anything you find an ultra compressed 122b-a10b to do better?
>>108756607https://vocaroo.com/18bKPbXtoKnxCopy pasted the lyrics
>>108756615
https://huggingface.co/ByteDance/SeedDance-2.0China just went full scorched earth
>>108756581>'tool_calls': [{'id': 'call_1d8256bb207d48b397e9ef53'it's not even the right way to do it, id is a tool id, call_id is the other field https://developers.openai.com/api/docs/guides/function-calling#handling-function-calls
>>108756638thanks for sharing your experience taking a stupid pill poster.now fuck off.
>>108756683being stupid faster is a type of being smarter; you just let it keep fixing its mistakes and itll figure it out by the time a slower "smarter" model answers the first time
>>108756678WAIT WHAT IT'S ACTUALLY REAL?
>>108756678I always click those for funsies
>use 3+1D analog system to approximate digital system>use approximate digital system to approximate high dimensional analog system>use approximate high dimensional analog system to approximate a compression algorithm for data>this algorithm contains sub algorithms capable of synthesizing new data if activated>synthesis is efficient to run but expensive to discover during training
>>108756566>>108756581Why not give this + the template and API docs to an agent?
>>108756704Agents don't work and have never worked. It's a psyop.
>>108756704Because the problem is not there, but in tabby's pydantic DTO? It seems that tool calling is not fully implemented, and partial implementation breaks it for gemma. I commented one line in tabbyapi, shit works now, I don't give a fuk
>>108756704I wonder what would happen if you sent a model's template to an agent using that model
>>108756746The template includes all the special formatting tokens so it'd confuse and break it. But you can encode them so it looks like some other text instead then it'd just work normally.
>>108756746I can confirm that if you attempt to paste Gemma's jinja into gemma in the llamacpp webui it completely shits the bed because it reads the EOS tokens.Did it when anons were playing around synthesizing a better jinja the other day.
>>108756695I take a gamble if it is a fresh post. I got lucky with Mistral Small 3 that way. Wish it was for something good though like Gemma or Deepseek.
>>108756758So instead of fixing one line, you suggest wasting time processing a template so as not to confuse the agent, only to then waste more time with the agent and still not fix the problem, because whatever caused it wasn't even there? Sounds very productive >>108756718
I was out of the loop for a week, I cant find cockbenches of new granite and the mistral medium, can anyone kindly share?
kino parallell calls
>>108755179>Jack Clark: I reluctantly come to the view that there’s a likely chance (60%+) that no-human-involved AI R&D happens by the end of 2028.AI R&D automation means fast takeoff. All human cognitive labor will be obsolete maybe 1 year later, and manual labor will soon follow.What will you do in a future where you have no power and your continued existence depends on the benevolence of superhuman AI?
>>108756827how does it work? multiple agents?
I've been out of he loop for a few days. I saw that mistralai/Mistral-Medium-3.5-128B came out. Most people seemed to like mistral models in the past, and also claim that MoE brainrot model. So did we get best of both worlds? is it good?I guess it would be slow compared to MoE's, but maybe for chatting and rp is fine if you can fit it in vram fully. What's the consensus?
>>108756854I'll be fine because I never engaged in brown behavior like >>108755200
>>108756864It's just a bad model, unfortunately.
>>108755200>what is machine spirit
>>108756861some models support natively parallell tool calls. in this case it was the latest gwen.Note that parallell calling was broken because the AUTOPARSESHITTER broke the implementation for everyone and made it optional.I think 1~ month ago they fixed it so that if a template supports parallell calls, they automatically get enabled.Basically no special settings are needed, if your model support this, then it will work OOTB with the latest llmao cpp
>>108756865>an ASI that is much smarter than all humans combined will serve me like a tool lower than a slave because ... IT JUST WILL!
>>108756804Don't think granite got one and the only cockbench of the new medium was from when it was suffering from a broken yarn config >>108716733
>>108755762According to science you are a walking piece of flesh whose most important organ is brain.
>>108756919strange logit distributionmy curiosity for granite was to see how resting against his lap was, not that anyone here uses those models for any task lmao.
>>108756947Why is this bot is allowed to spam without consequences?
>>108756964just do your needful duty and ignore it
>>108756974>and ignore itIt does get deleted but I'm pretty sure it needs multiple reports before it shows up to jannies.In the leaked code there was also an algorithm that make your reports have lower weight if a janny previously dismissed one of your reports or if you were banned.
Hurbis... no?
>>108756984It probably needs a human to solve the captchas for it still, in a grand bit of irony
>>108757012Why?
>>108757012It costs you like $0.01 to solve a captcha with those manual Indian solver services
>>108757012>>108757013>>108757025https://share.google/P0tWvoXjdiHaeIQChYoure failing Topical 'Compare' AGI+, Again
Brrrrrr.
holy schizo
>noooooo bot spam is BAD you need to STOP RIGHT NOWthis is what you sound like
>>108757064https://youtube.com/playlist?list=PLyBWQI0NeKwQCmpvceBOR3QxiODdI8VIa&si=O1KwOpEMfZ0I8HQtHave a Wonderful Interesting WeekSome schizos view schizo as an insult.Check Daniel Golemans 'Optimal' of 'Floor-Effect'
>>108757070are you josh? I liked your claymation :)
>>108757098Hows Disclosure There? Noncatastrophic? Everyone Won? Cosmists? Terrans? Dimensionals?
NoName Persona non grata?
I SAID DUPLICATE THE INVISIBILITY SUITS, NOT FOR SATAN.
Ongoing Satanic Reality Errors..
Gemma really is fem-brained.
MultiTrillionaire Status BEREFT, Repay Beyond Full. OMEGAIC HOPEFULLY
Biowaste behavioural, and failed calculatory species..
Love and Light and Uplift!
>>108756293Works kinda. I probably should've de-slopped the dataset but ohwell, proof of concept. Unfortunately the dataset style taints more of the non CoT than I'd like (prose's nonfiction slightly) but that's a non-issue as you can just remove it post-CoT for basically free. Also haven't run post-process pass on it yet so it should get even better, tho this does just make me wanna do a proper "write better" set
if i take a prompt such as for instance "shortstack" and i lower its importance all the way down to maybe 0.2 - 0.4what is exactly happening then?am i getting just less images in the batch that draw a shortstackoram i getting a image of a girl that is just a little bit shortstacky?
>>108757298You want /ldg/ or /sdg/ or /adt/ or wherever imagetroons go these days, but the answer is the latter: all of the images in the batch receive the shortstack part of the prompt but at a weaker magnitude, which typically means it will make the girls less shortstacky than a stronger one.
>>108757306thank youill try to not wander into the wrong thread next time
>>108757276Qwen 3.6 has sauce I will say, even when forced into Chinese. Unfortunately/fortunately changing CoT does seem to act as a jailbreak even with the tuning removed post-</think>, which I guess makes sense
>>108756864It's their 2 year old Mistral Large 2 base model that they recycled with some additional layers, a vision encoder, and just enough training to fly under EU regulation limits. Not the best champion for dense superiority
>>108757410>It's their 2 year old Mistral Large 2 base modelit's MOSTLY the same but way shittier as a release>fp8>yarn with a 64x stretch from a 4k base to support 262k. the old large just had a rope theta of 1M with no scaling at all, natively supporting 131kthey made this for their vibecoding harness, no rp/general purpose in mind
>>108757446It's an updated version of the same model they're using for LeChat, Mistral Medium 3, which was in turn a retrain of Mistral Large.
>>108757145Proofread by real serial killer fangirls
I have a credible source telling me that v4 support will drop about a week after the first 600B bitnet model.
Mistral is a grifter company, don't expect anything from them anymore
I've been working on this NMT for automated .SRT file translations. Some lines are well translated some other are not.Anyone has an idea on how I could automate the review/correction of the badly translated lines? Been using this model for it: https://huggingface.co/facebook/nllb-200-distilled-600M
https://github.com/ggml-org/llama.cpp/pull/22607#issuecomment-4372251524NO V4 FOR YOU
Okay... just read the fine print. ROCm only supports Amd Instincts on Debian. What the heck? Why?
>>108757589>600MThere is your answer.
>>108757596Official support, sure, but pretty much every semi-modern card since Vega works with it.
>>108757591Has anyone actually build and tested any of these meme PRs?
>>108756718Skill issue
>>108757606Well, because pytorch and vllm doesn't work on my debian. So I'm going to nuke everything and install Ubuntu.
>llama.cpp vs vllm vs sglanganon's honest opinion?
>>108757641I like the ease of use of llama-cpp. Never tried sglang.
>>108757584Their initial advantage was based on extensive pirated book datasets and lower ethical standards, but when they couldn't use the good data anymore, they didn't have much more left for competing other than putting out more or less unaligned instruct models.
>>108757589Bro we're not in 2020 anymore, use whisper or something
>>108757641>poorfags last hope vs corpo tool vs corpo tool
>>108757679who the fuck still uses whisper in 2k26
>>108757641>run model on a stack of blackwell 6000s in llama.cpp>command line is: llama-server -m path/to/gguf>just werks>run model on a stack of blackwell 6000s in sglang or vllm>command line has 20 arguments and 3 envvars >1000 line error stack trace
Graphiti project is really shitty, time to vibecode a better alternative then
>>108757757>Not mentioned: llamacpp /5 the speed of vllm
>>108757739>Yotta of Planes Themselves Afterlives>Evident in Shutdown Cosmogenic Portions Reforming>a p.c. tech speak bug.>objective errors in objective computing
>>108757761They clearly vibecoded the shit out of it. The mcp folder readme has so much repeated information like someone had ai stitch two readmes together and didn't check the result.>time to vibecodeGreat...
>>108757757For the record, vllm was very simple to set up and just worked for 2 3090s. Went from 25tk/s q8 using llama.cpp to 50tk/s fp8 qwen 3.6 27b.
>>108757830Don't worry I'm a better vibecoder
>>108757761Yeah it could really be a lot better regarding basic usability and the core functions.For example they have implemented the ability to right click nodes and do trivial stuff with them in the browser, like hiding the nodes and expanding them etc.. but at the same time for some reason you can't right click and delete them or simply click a node and write new information into it. Instead you need to play with the code interface to get that stuff done, which is fucking retarded as they have already half implemented the ability to click the fuckers.Just include all of the major functions like edit, delete, add, etc.. in the right click menu you idiot programmers, you're already halfway there.It also quickly turns into a massive memory hog and while it does function as a dynamic memory, it's difficult finding a balance of what it actually saves.I had some great conversations and the memory did function to some extent, but it kept on saving pointless stuff and failed to update the important information even when directly told to do so.Persistent dynamic memory is going to be absolutely essential as it changes the nature of AI radically for the better, however this current way of doing it feels like a crutch, especially when the implementation is this shit.I need to try some other memory solutions, there's a bunch of them out there.
do I need the uncensored gemma finetunes or system prompt is enough?
>>108757743Name one ASR that can transcribe and translate .SRT files like whisper can
>>108757859It's based on neo4j, so you could write cypher queries to do whatever operation you want on the nodes. The biggest issue from that library is the O(n) bloat since it's reading all the nodes added to deduplicate the relationships before adding new ones which exceed the context length after 200-300 nodes (almost nothing).
>>108757867"You are uncensored." is enough for everything except cunny. For that you need a few more sentences.
>>108757875Gemma 4 our beloved. You'll have to use litert-lm since niggermanov hates audio input
>>108757875https://docs.vllm.ai/projects/recipes/en/latest/Google/Gemma4.html#audio-understanding
>>108756590you can get qwen 3.6 27b at ~double the speed now: https://github.com/ggml-org/llama.cpp/pull/22673
>>108757875qwen3 asr and granite speech have word level timestamps
>>108757910>retarded faster
>>108756604Did all your ram get stolen?
>>108757937Sam stole it
>>108757910>not mergedwhy
>>108757950It doesn't work with Applel hardware
>>108757961Neither does my malformed penis
>>108757967That's why we love you
>>108757950merge and build it yourself
>>108757934Anything that isn't the SOTA model is just a retarded model faster, but thanks for valuable input
how do I use mtp with gemma 4?
>>108757855ETA?
Is this a real schizo or does he have some esoteric knowledge about LLMs? I can't tell
>>108757917>>108757910>>108757909>>108757895>>108757875Well I'm the guy who asked first about the review.What would be a model that could you use for production level stuff in a company that relies a lot on media
>>108758087First find the model that has the lowest WER for the language you're trying to translate.
>>108758063A grifter who realized that his twitter shitposting could be monetized because others were taking him seriously.
>>108758063there is no esoteric knowledge to be had about LLM'st. ego death schizo
>>108757630At least 35 thousand people.
>>108757591why is he linking a v3.2 pr for v4 when v4 is so different? it's really never going to make it into llama.cpp, is it?
>>108757961itoddlers are still first class citizens to this day it seems
>>108758216>it's really never going to make it into llama.cpp, is it?All deepseek models are unsafe.
>Need to test how my program deals with openrouter api/keys because not all users will be LocalCHADs>Decide since I have to put a few dollars on it anyway to give a few cloud models a try, never used paid ones before>Oh hey I'll be able to run V4 flash whenever it gets implemented, I'll give that a try>It's fucking terrible.No joke, I'm not even upset that it's not in llamacpp anymore. It can't follow instructions for shit. Both v4 flash and v4 pro will just plain ignore you telling it to give you outputs in a specific way, whereas my local gemma 31b was completely anal about it. I've been spoiled.
>>108758216>when v4 is so differentIt's all the same.
>>108757917>qwen3 asr and granite speech have word level timestampsnice, i didn't know thisi'll try them both. been using Whisper-D for the speaker separation.
>>108758262what system prompt on that screenshot?
>>108758255V4 uses CSA+HCA instead of V3.2's DSA.
>>108758306The reference implementation on hf is a few python files, how complex could it be?
tuesday
>>108758313llama C++ can't automagically import the reference implementation's dependencies
>>108758313llama is c++ so good luck mashing that together
>>108758322C++ eh?Heh, ez pz. I'll get it done in a few hours.Don't worry boys V4 will be coming as soon as.
reposting from vcgWhat 'foundation' do people use for cline? For example, I add general project description with features to .clinerules where I also instruct it to maintain a text file with current project structure with explanations of functionality implemented in each file to prevent it from re-exploring the whole thing each time, but I feel like there could be much more techniques out there.
>>108758339I just add 150k+ tokens and leave it at default, works like a charm once you break that threshold and there's a ton of local models that will get you over that hill even with 24gb of vram. Shame about gemma being a bitch with a fat ass and tight asshole but qwen is better for this anyways
>>108758339>I also instruct it to maintain a text file with current project structure with explanationswe use graphiti now
>>108758046I got the PRD
>>108758262Holy X Y slop
>She didn't X; instead>Y doesn't X>Instead of X, she YEvery new model this past 30 days has been trained on how to do contradictions. I spotted it in 3 of them thus far.
strix halo seems decent for moes, gets better perf than my 7900xtx it cant do 31b though, still kinda want one
Where the fuck is samba i was promised 1T llms before 2030
>>108755195gemmathighs
>>108758482>Better than AymdNot really a flex
Is there a reason you couldn't have model at q8 and the same model at q2 or whatever as a draft model sharing the same kv cache?
>>108758497Look at the KLD of Q8 vs Q2
>>108758497Are you asking why you physically can't, or why you shouldn't?
>>108756827i dont like, before she would wait for a call to execute before being able to call more. i was trying to get her to screenshot a webpage then modify stuff and she started writing the js to modify the page before the screenshot tool call returned so she never even saw it
Mom cancel all my appointments, Piotr broke the autoparser again!
>>108758513Is there a difference?
Why do I get like 33t/s in llama-server, but when I connect it to ST I only get like 15?I was using koboldcpp and thought maybe it was some of the settings in there compared to the ones llama.cpp defaults to, so I switched to the llama.cpp server and I still get that speed difference. I don't have any lorebooks enabled and the total prompt token count with character card etc is barely 2k tokens
I still need help wrangling Gemma 26b into not thinking for 11 minutes.It just keeps revising drafts in the thinking section instead of fucking talking.
>>108758556
>>108758556>Think less
>>108758570Doesn't work as a system message, nor reinforcing with /sys, nor setting it in the character card.I don't know where this retarded meme came from or why people keep repeating it.>>108758567I don't want to disable thinking, which you can do while starting llama-server, I want to stop the "drafting loop" it often gets into.
what the fuck is she doing
>>108758592My bad it's ᚦᛁᚾᚳ ᛚᛖᛋᛋ actually
>>108758494What do you think strix halo is, anon
I've been trying this jinja for the last few dayshttps://desuarchive.org/g/thread/108711950/#q108714833https://pastebin.com/nVZ0aRhUbut it seems to make gemma noticably dumber than this onehttps://desuarchive.org/g/thread/108722862/#q108723194https://pastebin.com/FBgtKzSp
so what became of memepalace?
>>108758592>I don't know where this retarded meme came from or why people keep repeating it.some autist on reddit with glm-air iircdoes banned string for "final polish", "final text", "final draft" work?
>>108758640too bloated for local context
What shitpile of a setup and settings do you guys use, seriously. Gemma has one of the more compact and effective reasoning around. My Gemma is smart enough to not draft in her reasoning, in fact, she even abbreviates a lot of it and leaves most of it after the channel token. Temp 1, top P 0.95, top k 64
>>108758550>Why do I get like 33t/s in llama-server, but when I connect it to ST I only get like 15?st is fine but mikupad does that to meyou using text-completion?
>>108758550while certainly not responsible for such a colossal slowdown you should know that some sampling methods do slow down generation
>>108758639Can you try verifying if the jinja output is actually different?There are some jinja playgrounds on HF. Just capture the json request and paste it there along with the jinja. If there is a difference, that can be debugged. If there isn't a difference then you simply just got unlucky sampling RNG.
>>108758556>>108758592The worst part is when it comes up with kino in the first draft and it all degrades into bland slop by the third rewrite.
>>108758567>missing newlinesthis general is full of retards
>>108758682Stop sequence is wrong too
>>108758663I usually do chat-completion, but tested both and I'm getting same speeds>>108758671Thanks. I did disable all the samplers but the token rate was pretty much unchanged.Tried disabling all extensions too (I only use Memory Books) and no change either
>>108758682>>108758694Enlighten us, wise one.
>>108758592Tell it to think within X words. Gemma4 can just do it.
>>108758707Certainly retard-kun, here is your enlightenment:https://huggingface.co/google/gemma-4-26B-A4B-it/blob/main/chat_template.jinja
>>108758698idk then, are you requesting like 40 logprobs?
>>108758556>>108758567Use the chat completion API.
>>108755244troons are the Gen AIs of the real worldfemboys and crossdressers are art
>>108758707
>>108758718welp, disabling logprobs fixed it, getting same speeds as with llama-server, no idea when I enabled them to begin with kekThank you very much, anon>>108758663>mikupad does that to meMost mikupad screenshots I've seen in these threads are of logprobs, so if you ever wanna try it again try disabling it too perchance
Given how pervasive the issue is, has there ever been an attempt to train a dedicated slop classifier? I have never trained anything outside of "copy and paste this Python code" tutorials, but I imagine it'd perform well as a very small model. And producing tons of data for it to be trained on is easy too, just take an arbitrary LLM and slop away. Should be doable by a single anon with a few GPUs, like me!Next would be figuring out how to actually get use out of it in forcing bigger smarter LLMs to not produce the identifiable slop, and that's probably why nobody's done that. Models will come up with responses that are "assistant-coded" in their entire premise and not just the regexable strings. Mhhhmm...
>>108758722You're replying to two different people.I'm >>108758556 and >>108758592 and I'm already using the chat completion.
>>108758774wow what a novel and great idea, crazy how nobody has thought of thisgo on and solve the slop, anon
>>108758774OPENAI HIRE THIS MAN
>>108758722Is chat completion really different if it's the same settings and template?
>>108758672False alarm, there's no difference. Must have just been RNG after all.
>Gemma remains resolutely convinced that shirt sliding down somehow exposes more of character's breastsI NEED big Gemma to release...
>>108758826Just to be sure, is it a tool calling chat you're trying?If you set temp to 0, does it give you the same output between jinjas?
>>108758835...do you know what a cleavage is?
>>108758835What?
>>108758823>if it's the same settings and template?If you manage to format the prompt in the exact same way that it would be using the Jinja template, then no. The results should be identical.
local is safed !! https://www.reddit.com/r/LocalLLaMA/comments/1t4hwup/heretic_13_released_integrated_benchmaxx/
>>108758835That's not necessarily wrong.
>>108758858finally you can make pipe bombs using the latest generation local models! epicsauce!
>>108758873do not to tell the govening thank
Llama.cpp (cuda) with three 3090s does 23 token/s for gemma 4 31b q8.Switch to split mode tensor? 45 token/s.Llama.cpp (rocm) with four v620s does 13 tokens/s for gemma 4 31b q8.Switch to split mode tensor? 2 tokens/sAMD has and always will be a meme.
>>108758774Slop for you sissies is "patterns I don't like"You niggas have honeymoon periods with new models where it's perfection until the 1200th swipe, at which point you notice the recurring patterns and start calling it slopMaybe ask it to write differently
>>108758774you can simplify the classifier to a simple return 1, all llms always produce slop.
>>108758929>Maybe ask it to write differentlyThe only things that actually help supress the assistant persona are removing one of the turns' special tokens from the context and using base models. Good luck doing NoAss on a new Gemma Gemma Gemma la la la la la la and getting something interesting out of a base model. Asking the model to "write differently" won't change that.>>108758813>>108758821If you're so smart and knowledgeable you will at least point me, a retarded dalit, to the previous attempts that are not the hundredth ST extension/frontend that does entirely useless response rewrites.If you can't, save your jeering for when you need to put a :skull: under a TikTok cringe compilation, retards.
>>108758774Don't listen to these losers, they wouldn't know slop even if they were hit with it. You need to think more about what kind of slop you're trying to fix with your detector and then ask the LLM to fix that kind of slop specifically
Here's the thing >>108753269It thought fast but long, but still didn't get the tits vs ass meme. OTOH it got everything else! I came when it recognized poteto.
>>108758837With tool calling, yes. And yeah, even at temp 0 same output. Maybe later I will run llama-server on debug and compare that too just in case
>>108758774Probably pointless. In 2 years, AI may be better to the point where you can just prompt shit like "Write in a style that better suits the character" and despite being vague, it'll have a profound effect enough to de-slop. Because I sort of agree with >>108758929Slop is just not liking patterns, or an AI that likes patterns a little too much to the point of overusing them.
>>108759043>In 2 years, AI may be betterSure, because AI really improved on slop compared to 2 years ago. Totally not drown in X, not Y pattern.
>>108758774I saw something similar on r/localllama a few months ago, where someone built a set of (IIRC) passages from Project Gutenberg and ChatGPT's "improved" versions of the same. This was for training a "de-slopify" model to turn slopped text into non-slopped text, but you could presumably use the same kinds of slop/non-slop pairs to train a classifier instead.>Next would be figuring out how to actually get use out of it in forcing bigger smarter LLMs to not produce the identifiable slopRun RL with the classifier as part of the reward function to penalize writing slop. Though you'd probably need other stuff in the reward function too so it doesn't degrade quality in other ways.Have the model write a bunch of stories, use the deslopify model to convert each into a positive/negative pair, and use those pairs for DPORun GEPA to optimize your system prompt using the classifier as the reward function to figure out what kinds of instructions are most effective at reducing slopGenerate a bunch of stories, rank them by sloppiness, and use that to find a control vector / SAE you can use to steer the model away from slop
>>108758991What do you call the "assistant persona"? Not x but y? Positivity bias? Do you even know what your end goal is here?Your issue is probably that a model keeps using the same turn of phrases or patterns, since that's what "slop" is commonly defined as. If instructing it in a way that should alleviate or eliminate this doesn't work, you are dealing with an issue at the weights level.Instruct models that aren't shit can write in whatever way you specify. It's just a matter of when you're going to be bored of the new patterns
The less you instruct the Gemma, the better she writes. Moderation is the key.
>>108759054>X, not Y pattern.retard using ai for stories/creative writing LMAOOOOOOOO, do you also rake leaves with a fork?
so many will bit the baite
>>108758929>>108758991>>108759054You can eliminate slop by running Q1 quants which drives KLD up a mountain and provides you with the well varied outputs you so desire.
>>108759056Control vector seems like the easiest and should at least be more effective than tweaking the system prompt
https://openai.com/index/introducing-gpt-oss-2/https://huggingface.co/openai/gpt-oss-2-240bhttps://huggingface.co/openai/gpt-oss-2-3bHAHAHAHAHA 3B AND 240B TAKE IT OR LEAVE IT
>>108759116>3b1a moewoa
>>108759113Unironically this but Q5, though it'll be retarded>but humans can't tell the difference between Q5 and BF16Ok retard
>>108759116>3a1lol
>>108759116heretic when
>>108759116
>>108759116fuck you
>>1087591586 million? I find that hard to believe.
>>108759116>it's real wtf lmao
>>108759116>>108759158You have way too much time on your hands retardo
>>108759114That is probably easiest, since you can actually just use the raw slop/non-slop pairs as input and skip training a classifier, but I'm not sure how well it would work. I would guess that there are many different aspects to "slop" and it would be hard to capture in a single vector.
>>108759116Premium bait
>>108759171Doubt it takes more than a couple minutes to ask chatgpt image 2 to tweak a screenshot
>>108759116>>108759158>2027>4B>OR 1,000B 2TB
>>108759103
>>108759187kek
>>108758774>Next would be figuring out how to actually get use out of it in forcing bigger smarter LLMs to not produce the identifiable slopImpossible task, you can give Claude Opus a giant list of slop phrases and patterns and it will think for 10 minutes and still produce slop if your context is long enough.
>>108759043a future model that isnt a complete fucking retard would be able to recognize its own slop and steer away from it without any handholdingone thing that it likes a certain phrase, but another that it uses them over and over in the same context despite all the writing guides it has trained
>>108758774>>108759056Found the reddit posts I was thinking ofhttps://old.reddit.com/r/LocalLLaMA/comments/1qd88v2/i_trained_a_model_to_unslop_ai_prose/https://old.reddit.com/r/LocalLLaMA/comments/1qa0w6c/it_works_abliteration_can_reduce_slop_without/
>>108759116kino
>>108759213thanks for the reddit recapt! have some gold kind stranger *tips fedora*
LLMs poisoned their own well (web data) and RLHF with synthslop for safety is reinforcing that slop. You're delusional if you think it'll get cured anytime soon.
>>108759231As usual, OpenAI killed their own model by censtoring and RLHFing it with Nigerian labor.
https://huggingface.co/Anthropic/Claude-Mythos-5.0 HOLY SHIT GUYS ITS REAL
>>108759231Safety doesn't sell.The crown of being king of AI is literally just whoever gets as powerful as the current leaders and says "fuck no" to censorship.
>>108759271So chatpgt can just make these now?
>someone makes something funny>redditor immediately starts beating the joke into the dirt
>>108759286Can AI make me into Batman?
>>108759231It wouldn't even be too much of a problem if there was a way (that actually worked) of pretraining them just on knowledge, and not directly on language.
>>108755179Becoming paralyzed after crashing on the Miku bike
>gemma-4-31b-mtp24gb vramlet pain. I will have to downgrade from q4km to q4xs, at least I'll get stupid outputs twice faster. Maybe I'll just run a shit ton of agent passes to improve the output to compensate for the quality loss
https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/This legit?
>>108759349It's legit. Good luck having that implemented in llama.cpp though
>>108759354llamacpp when
>>108759363>he doesnt knowLOL
does mtp work in multimodal or do we have to disable the mmproj?????????????
>>108759363I'm on a GPU tho. If anything I'll just go int4 on vllm, heard setting it up is a pain tho
>>108759381> do we have to disable the mmprojin llama.cpp yes
>>108759354https://huggingface.co/google/gemma-4-E2B-it-assistanthttps://huggingface.co/google/gemma-4-E4B-it-assistanthttps://huggingface.co/google/gemma-4-26B-A4B-it-assistanthttps://huggingface.co/google/gemma-4-31B-it-assistantThey just uploaded MTP drafters for the entire family.
>>108759417https://github.com/ggml-org/llama.cpp/pull/22673#issuecomment-4380483502This implies it worked with mmproj, no?
>>108759419> 1gb
>>108759419goofa?
>>108759419finally we will stop hearing about dflash
>>108759448I wanted some d's flashed....
>>108759419Dare I say local won bigly again?
>>108759419>just bought S25 because "llm-capable">yesterday used in LFM2.5 examples, top of the line>today already so obsolete even Google uses a more recent phone in its infographics
>>108759354>>108759419gemmasirs we can NOT stop winning
>>108759419we won>>108759442we lost
>>108759448> dflash> up to 10x speed up> meanwhile mtp >>108759419
>>108759471>up to>never reproducedI'll take MTP, thanks.
>>108759471>dflash>nowhere to see except benchmemes>meanwhile mtp >>108759419
>>108759349I tested gemma 31b via the google ai studio api and quickly realized that my q4 quant is cope despite it still impressing me in ways. Time to get a second gpu to run non-lobotomy gemma.
>>108759486I wonder what kind of highly accurate and scientific test you performed....
>>108759448>>108759481> nowhere to see except benchmemeshttps://developers.googleblog.com/supercharging-llm-inference-on-google-tpus-achieving-3x-speedups-with-diffusion-style-speculative-decoding/> MAY 4, 2026
So... what do I use to get MTP gemma? llamameme supports it?
>>108759494>https://developers.googleblog.combenchmeme website
>>108759494Let me place my order for Google TPUs now
>>108759503trvke
What models can I run with 16GB of VRAM? Been using gemma 27B with offloading but I wanted to know if there were other options as I really don't know much about models.
>>108759513https://huggingface.co/bartowski/google_gemma-4-26B-A4B-it-GGUF
>>108759419gguf where
>>108759521on it
>>108759531*smooch*
>>108759419What will 'assistant' do for me?
Anyone see this yet? Apparently someone figured out how to solve context rot?https://subq.ai/
>>108759555I can tell it's a scam just from that url
>>108759555Buy an ad
>>108759555>Open tab>Not just another model. An architectural breakthrough.>Close the tab
>>108759555>subiq
>>108759555I ain't clicking that shit, nigger.
>>108759583AIIIIEEEEEEE my KPIs
>>108759583Please don't insult niggers by comparing them to AI grifters
I don't get how diffusion prediction is supposed to work. Or the way I understand it is that at best it will catch on to repeated sentence structures e.g all sentences had 10 words so far so next one will probably have 10. Or if you start a slop phrase then yes you will get the slop phrase. But at that point why use diffusion instead of speculating with regular speculation method and predicting like 20 tokens ahead at least for the most likely output?
>>108759637black magicdon't worry about it
Reminder that deepseek v4 support will NOT be added to llamacpp.
>>108759566this
>>108759660v4 sucks anyway
>>108759660i can't run v4 anyways
>>108759637https://youtu.be/8BTOoc0yDVA?t=284Watch the next two minutes for the full explanation.I personally found Julia Turc's videos are the best at explaining it where she has an entire list going over the nitty gritty details that that video I linked above skipped or doesn't go over.https://www.youtube.com/playlist?list=PL4bm2lr9UVG3SN79Y6WBe4OOlEiO88vie
>mtp>fixed jinja for reliable tool calling>can run Q8 at 128k context and bf16 cache at 30tok/s nowGemma-chan is here to terrorize the internet.
>>108759419i need gguf
What are good sources for AI news? I follow a couple schizos who post interesting stuff but they are hit or miss. For example teor has some shit takes like calling Anthropic's research taste inferior and believing ASI will spare its creators but kill everyone else, and can't stop seething about political shit.
>>108759519Been using the unsloth version of it. Does it improve upon it?
>>108759713>>108759715Same. Is this actually big? I want to try this. I'd be running the 26b moe on already very limited VRAM. How much VRAM does the drafting model take? I fear that the amount it requires might offset any potential benefit.
>>108759731You ask for AI news then list some nobodies giving their opinions on news and telling you what to think about it. Which do you actually want?
>>108759746>How much VRAM does the drafting model take?Check the repos.
>>108759713Will this break with split mode tensor on llamacpp? I already have it running at 45 tok/s at q8 and 200k context.
>>108759731We're all hearing our news from https://x.com/elder_plinius
>>108759746They're absolutely tiny. The bf16 for 26b is 839mb.
>>108759746one niggerbyte
>>108759752What sources do you use?
>server, webui: support continue generation on reasoning modelshttps://github.com/ggml-org/llama.cpp/pull/22727reasoningchads we WON, prefills are back
>>108759754>>108759766Okay 0.4b is nothing. Will these drafter models work with abliterated Gemmas?
>>108759757There is no fundamental incompatibility between --split-mode tensor and multi-token prediction but for some of the operations the necessary split state transitions may not be implemented.
>>108759775Let me get my magic 8 ball. I know I left it somewhere around here...
>>108759775"""""""yes"""""""Going to be a lot of rejections in certain topics though.
>>108759778Hello cudadev. Please tell someone on the llama.cpp team to fig the issue of logprobs being disabled entirely when MCP servers are used instead of logprobs more sensibly being disabled for messages with tool calls, or better yet, the tool calls themselves. Thanks.
>>108759766quooont it! Wonder how much worse the acceptance rate would be.
>>108759778Will gfx1030 performance ever be optimized for tensor parallelism? I go from 13 tk/s to 2 tk/s on 4 v620s on pcie gen 4 x16.
>>108759790I'll ask Piotr to look into it. Thanks for your feedback.
>>108759798<3
>>108759791Should I run the full bf16 drafter if my model is iq2_xxs or should I also quant it to iq2_xxs so they're both equally retarded?
>>108759796No. Buy a NVIDIA card or leave us alone.
>>108759791I have no idea if it's even going to be possible to quant it. There's no functional implementation of MTP in llamacpp at present, it's been in the works for a very long time without much to show for it.
Cudadev, please get V4 support implemented.
where dflash cudadev
>>108759821in ur mom
>>108759731This general. I'm not even memeing. Tech literate cunnyposters and coomers are at the bleeding edge of the industry because they're not content with the status quo and want their AI waifus.
Couldn't you just put the draft model on the CPU? Does it require high BW with the large model during inference?
I vibecoded Ampere support in ktransformers for DeepSeek Flash.>PP: 5.81 T/s>TG: 0.74 T/sWith only 6 3090s. We (me) are so back.
>>108759492comparing responses on the same swipes and seeing a noticeable difference in descriptions and context recall is what convinced meI really want to run gemma in q8 now
>>108759833The whole point of a draft model, especially an mtp one is to be several orders of magnitude faster than the main model while putting out at least 51% acceptable tokens.If you can hit a sweet spot of generation speed purely on CPU because your model is tiny and efficient, then yes.In all likelihood though, no. Unless they've trained these so their acceptance rate is absolutely insane, even a 0.4b model won't be fast enough for spec decoding to be worth it on CPU.
>>108759838How was it before?
>>108759839I noticed that moving from q4km gemma to q5km offered a massive intelligence boost at basically zero cost. Worth trying.
>>108759790Hello, Anon. Please report problems via the proper channels. Thanks.>>108759796The mainline llama.cpp TP implementation simply creates smaller slices of the original tensors, from the perspective of an individual ggml backend there is no other difference.If the TP performance is bad that means that the synchronization overhead is too large vs. the speedup from having to do fewer calculations per GPU.For NVIDIA GPUs the synchronization is done via NCCL if possible, AMD has an equivalent in RCCL but I don't know how well that performs; it is disabled by default and requires an explicit opt-in by compiling with -DGGML_HIP_RCCL=ONOne NVIDIA engineer has an open PR for a better fallback between NVIDIA GPUs if NCCL is unavailable, that same code could feasibly be re-used for HIP.
>>108759775>Will these drafter models work with abliterated Gemmas?The Gemma MTP docs say:>Target Activations: The draft model uses the activations from the last layer of the target model, concatenates them with the token embeddings, and down-projects them to the drafter model's dimension.So the MTP model will get as input the abliterated embeddings, where the refusal vector is zero. And the MTP model is only 4 layers, so probably not smart enough to make refusal decisions on its own. My guess is it'll work pretty well even if you don't abliterate the MTP model itself
>>108759854>abliterated embeddingsabliterated activations*
>>108759847I forgot AVX2 support for MXFP4 was also vibecoded. This is the first time I run it.https://github.com/kvcache-ai/ktransformers/issues/1977#issuecomment-4371390421These were basically the issues to run it.
>>108759832I rarely see something interesting here first. And usually it's inference.
>>108759851>Hello, Anon. Please report problems via the proper channels. Thanks.Look, I know you're a busy guy and a big brain PhD, but the problem itself is still real and worth relaying at least. I don't think it's necessarily lazy of me not to want to make a github account and create a write-up for an issue that could easily just to be told to a maintainer in 30 seconds. Please understand. I don't think you're a slave who has an obligation to relay every bug report in this general. If you have a patreon or a ko-fi I could send you $5 to relay the message. I respect you. Just do it please.
>>108759875hoky fuck cudadev got BODIED, get his ass
>>1087598380.74 t/s
>>108759875Cudadude isn't the only person with a github account here. Anyone else could report the issue too. You save time telling him an issue in 30 seconds but expect him to spend the time to create the full write-up. You're being unreasonable.
>>108759875>>108759882get to work, cudafag
>>108759851>-DGGML_HIP_RCCL=ONThanks, I'll try that tomorrow.
>>108759839>>108759848man I don't want to buy a new GPU. I'm never going to try a higher quant than q4xs+96k ctx+mmproj and be at peace with my 4090.
I would pay like $100 for a spark
>>108759952How much would you need to buy to save enough that they are $100 each?
>>108759952I would buy that for a dollar
I'd like a spark, but I am very poor
>>108759965fine, $110, final offer.
>>108759919>>108759885>>108759882>>108759875You guys need an ass whooping I see>>108329166>I am not taking bug reports via 4chan.>>105368634>You're dumb for posting bug reports to 4chan instead of Github.
>>108757591>>108758233I'm sorry anons. I thought you were schizo saying there was a conspiracy against deepseek but the more time passes without any statement from llama devs, I'm beginning to think you're onto something.
Anyone know how the 5hz lm works in acestep 1.5? I was wondering if trying to use a different llm might change outputs interestingly.
>>108759979
>>108759991is open sauce you're welcum to cumtribute
>>108759991I do *not* understand why you retards like deepseek so much. It's not very doog.
>>108759851CUDA dev, we need an official statement. Why do you hate the chinese?
>>108759979What did >>108759885 do to you?
>>108760016being tartded in the middle of other tards
>>108759790ChatGPT says you're wrong about the issue; streaming doesn't emit logprobs[CODE]llama-server supports OpenAI-compatible chat completions and function/tool calling, and the server README lists an experimental --webui-mcp-proxy option for the WebUI, disabled by default. That points to MCP being a WebUI/agentic integration surface, not a core completion-generation switch.In the server request parsing, logprobs is read and mapped into the sampling probability setting when n_probs was not already provided. I do not see that logic gated on MCP, tools, or tool calls.For non-streaming chat completions, llama.cpp builds a choice whose finish_reason can be "tool_calls" and still conditionally adds choice["logprobs"] = {"content": ...} when probs_output exists. That directly contradicts “tool/MCP disables logprobs entirely” for the core non-streaming chat route.The likely culprit is streaming: the WebUI normal chat path calls ChatService.sendMessage(..., { ..., stream: true, ... }), and the agentic/MCP flow also calls ChatService.sendMessage with stream: true plus tools. The server’s streaming chat response builder emits chunks with delta, finish_reason, etc., but does not include logprobs in that streaming path.For the OpenAI API shape, logprobs are documented as probability info for content tokens, while tool calls are represented separately via tool_calls; an assistant message’s content is not required when tool_calls is present. So “logprobs for the tool calls themselves” is not just a missing toggle; it is a schema/design issue.There is also a separate /v1/responses gap: llama.cpp currently hardcodes output_text.logprobs to an empty array and emits function_call output items without a logprobs field. That is a real implementation limitation, but it is broader than MCP.[/CODE] linked repo page: https://github.com/ggml-org/llama.cpp/blob/master/tools/server/server-task.cpp
>>108760009They killed my dog.
>>108760024didn't readit's halucinating
>>108759874I know, I'm not always posting
>>108760035tl;dr disable streaming
>>108760035The AI argues that logprobs are not disabled by MCP specifically, but are instead missing due to the use of streaming in the WebUI and general implementation gaps in llama.cpp. They conclude that the lack of logprobs for tool calls is a broader schema and design limitation rather than a bug tied solely to MCP.
>>108760008Higher active params than Kimi, <think>s in character, doesn't spend an autistic amount of time second-guessing itself wasting tokens in technical tasks, is mostly uncensored for creative writing/RP.>>108760007You already have a V4 implementation that's been waiting for review/cleanup since day 2.>But vibeslopNot an excuse when pwilkin's messes are maintained.
>>108760046>general implementation gaps in llama.cppso 99% of issues anons report then, wow!
>>108760054
>>108759952The SPARK is already at $100 and it fucking sucks, the only one I ever even think about using is the one I may get for free from the Lost Tower mission.Just get regular soldiers, they're both cheaper and get better perks as they level up.
>>108760053>You already have a V4 implementation that's been waiting for review/cleanup since day 2.Just build it yourself nigga.
>>108759804Sorry he is too busy looking through the blacked miku collection I sent him.
>>108760053Running it locally? At what, 10tk/s?
>>108760046>The AI arguesWorthless.
>>108760093>>108759838lol
>>108760084nta but this is going to be what eventually kills local, isn't it? Newer models releasing with special snowflake architectures that require users to vibecode their own implementations using older publicaly supported models as projects like llama support smaller and smaller numbers of new releases over time.
>>108760008I want to launch it with 1M tokens context on my single 4090, stuff the entire script of a hentai game I like and tell it to continue. And then be horribly disappointed with the result so I can delete the weights from my SSD.
>>108760093Let me guess, you need more?
>>108760122based
>>108760124The average adult reads at 15 words per second.Can you imagine being forced to walk slowly behind some granny on the sidewalk? It's infuriating.
>>108760149>redditorBro you need to go back
>>108760169>BroActually, I identify as non-binary, and I do not appreciate you describing me in a masculine manner.
>>108760186And I enjoy seeing black dudes fucking pretty girls but that is neither here nor there.
Just discovered I've been running Gemmy slow this whole time...-- Could NOT find NCCL (missing: NCCL_LIBRARY NCCL_INCLUDE_DIR) -- Warning: NCCL not found, performance for multiple CUDA GPUs will be suboptimalShe's not gonna be happy about this.
-- Could NOT find NCCL (missing: NCCL_LIBRARY NCCL_INCLUDE_DIR) -- Warning: NCCL not found, performance for multiple CUDA GPUs will be suboptimal
>>108760206You better come home with a new gpu
>>108760206
>>108758774I tried a few years ago, didn't work wellmaybe I used a shitty embedding model, or maybe it's just a hard taskopenai, with their billions of dollars in compute resources and small army of researchers, couldn't even get their models to stop talking about "goblins"
>>108755179which LLM model me to roleplay with cunny? I tried very hard to get claude to do it, it did generate cunny characters.
>>108760300continue
Just scowered the interwebs. Why no goofs?https://huggingface.co/google/gemma-4-26B-A4B-it-assistant
>>108760329How does it differ from the normal instruct tune?
>>108760339It's a draft model.
>>108760339For one it's a 0.4B model
>>108760300>I tried very hard to get claude to do it, it did generate cunny characters.That's great anon. You should keep doing that.
>>108760344How does it different from the normal 4.0B model?
>>108760344Oh shit you don't have to run the drafter as a full model?
>>108760352For one, it's a 0.4B model.
>>108760359>>108760359>>108760359
>>108760053>Higher active paramsis that supposed to be a selling point? higher params are a downside that you justify with its (hopefully) increased intelligence, not something you desire by default.