[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108749398 & >>108742275

►News
>(04/29) Mistral Medium 3.5 128B dense released: https://mistral.ai/news/vibe-remote-agents-mistral-medium-3-5
>(04/29) Hy-MT1.5-1.8B on-device translation models released: https://hf.co/collections/AngelSlim/hy-low-bit-model
>(04/29) IBM releases Granite 4.1: https://hf.co/blog/ibm-granite/granite-4-1
>(04/28) Ling-2.6-flash 104B-A7.4B released: https://hf.co/inclusionAI/Ling-2.6-flash
>(04/28) Nvidia releases Nemotron 3 Nano Omni: https://hf.co/blog/nvidia/nemotron-3-nano-omni-multimodal-intelligence

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: 1752309154945717.jpg (99 KB, 1000x1000)
99 KB JPG
►Recent Highlights from the Previous Thread: >>108749398

--Comparing 4xV100 builds against modern GPUs for budget-conscious setups:
>108751713 >108751770 >108751792 >108751836 >108751852 >108751905 >108752065 >108752383 >108752754 >108753898 >108752158 >108752882 >108753030 >108753062 >108753105 >108753122 >108753630 >108753789 >108752286 >108752181 >108752227 >108752299 >108752307 >108752413 >108752687
--Debating JEPA's viability for text versus its success with video:
>108749467 >108749477 >108749486 >108749505 >108750330 >108750679
--Debating JEPA's viability and the use of small-scale research models:
>108751367 >108751376 >108751387 >108751416 >108751428 >108751493 >108751533 >108751574 >108751632 >108751649 >108751730
--Optimizing Gemma 4 31B context length and VRAM usage on 3090:
>108750366 >108750392 >108750399 >108750407 >108750424 >108750510 >108750518 >108750529 >108750554 >108750796 >108750568
--Anon weighing high-end hardware options for running large MOE models:
>108753199 >108753225 >108753281 >108753267 >108753299 >108753491
--Qwen's poor office task performance and agentic failure risks:
>108754145 >108754167 >108754200 >108754236 >108754259 >108754176 >108754183 >108754390 >108754460
--DeepSeek v4 adoption, hardware limits, and benchmark obsession:
>108750995 >108751071 >108751164 >108751173 >108751183 >108751215 >108751191 >108751185 >108751192 >108751198 >108751217
--AMD Gorgon Halo APU memory capacity and hardware specs:
>108752944 >108752984 >108753000 >108753059
--Technical settings and results for audio generation using ace step:
>108750141 >108750275 >108750298 >108750317 >108750322
--Implementing multimodal data in llama.cpp completion endpoints:
>108749548 >108749591
--Logs:
>108753279 >108753342 >108754200
--Miku, Teto (free space):
>108750244 >108750265 >108751706 >108753252 >108753377 >108754581 >108755164

►Recent Highlight Posts from the Previous Thread: >>108749401

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
gemmaballz
>>
File: file.png (194 KB, 675x499)
194 KB PNG
>>
File: 1754663303775946.png (2.31 MB, 1981x1400)
2.31 MB PNG
Anyone have any recommendations for gpu instance providers? Trying to do a bit of tuning work but I've been having a series of poor experiences with runpod and I'm fed up.

Not trying to chase the lowest possible prices; I'm willing to pay a little bit extra for a platform that works well.

>attention grabbing pic unrelated
>>
>>108755206
at least post his hot msgk
>>
File: EVVdwf4U8AAd.jpg (36 KB, 398x564)
36 KB JPG
>>108755179
Original slopper here. That is not actually Teto.
>>
>>108755200
if it bothers you to be polite you just arent a good person. a good person isnt bothered to be polite to anything not hostile
a good person just doesnt have the urge to insult unprovoked
>>
>>108755206
vast.ai
>>
File: 1754106533344609.png (464 KB, 881x796)
464 KB PNG
it's crazy that even 10KUSD doesn't buy a local rig that would be able to properly run something like full kimi/ds
>>
File: 1749199876520396.png (1.27 MB, 1024x1024)
1.27 MB PNG
Why do troons react to the idea that LLMs might be conscious exactly like Jews reacting to people noticing?
>>
>>108755228
With ollama you can run full deepseek with just 8gb of vram
>>
>>108755228
>kimi for $10k
it did, once, but only a few listened
everyone else just got regret or sour grapes
>>
>>108755244
why don’t modern rationalist philosophers address the jew issue?
don’t care to delve into philosophy 101 but legit why isn’t a discussion about the jews in philosophy 101?
>>
>>108755249
@grok this true
>>
>>108755244
why is your head full of troons?
>>
>>108755252
Can you run it fast enough for agentic use though?
If you could run Kimi agent at home you'd basically be king of the internet
>>
>>108755226
Being polite to the token predictor poisons the context and makes it more likely to agree with you when it shouldn't. People like you are why every model thinks you're absolutely right.
>>
>>108755261
>Can you run it fast enough for agentic use though?
You can't run it fast enough to feed it stereoscopic 8k image feeds at 240fps, but its faster than reading speed.
What does "fast enough for agentic use" mean to you? I assume somewhere between those extremes?
>>
>>108755228
It's a good thing we have gemma now which is nearly as good as kimi and can fit on a less than 1k usd gpu
>>
>>108755244
troons like janus are the ones the most obsessed with llms being conscious though.
>>
>>108755266
if you think agreeableness is something inherently bad and disagreement somehow a sign of good performance then you are just confirming what i said. you are likely as needlessly unpleasant as you want your llm to be
>>
>>108755226
You're absolutely right! We should be polite regardless of the situation or what we're interacting with.
>thank you, Mr fork, Mr knife, for allowing me to eat my meals comfortably today
>>
>>108755261
>agent
i have yet to see a single non-meme use of an "agent"
what is the point?
>>
>>108755281
again if this bothers you it just shows your character
normal people just ignore it at most
>>
I wonder if llama 4 was done dirty by bugs on the runner side and the like.assistant
>>
>>108755226
Good people are good because they are not strong enough to be evil
>>
>>108755281
>You're absolutely right! We should be polite regardless of the situation or what we're interacting with.
If you're an animist, then that may be your mindset. cue the story of the Japanese "god of the toilet" that you should please by keeping it clean.
I know I'd rather live in Japan than whatever hellhole spawned your mindset
>>
>>108755295
maybe it just sucked because Meta couldn't attract any good scientists becaude of Facebook's awful reputation even among big tech companies
the only thing they had going for them was releasing weights, but now there's lots of labs that do that if you're an ideologically-driven researcher
>>
>>108755279
>if you think agreeableness is something inherently bad and disagreement somehow a sign of good performance
I didn't say this. It works the other way too. Being a cunt will make the stochastic parrot act like one too, but that's the point. There's no reason to mind your Ps and Qs with a word regurgitator.
>>
I love being white and nice to my AI.
>>
>>108755316
and you dont think dropping thank you or please every now and then might make it work harder on the thinking steps? i feel it is more motivated and in turn beeing treated politely back makes me feel better too
>>
>>108755295
No, llama4 was just shit. It was a kneejerk reaction to Deepseek shitting all over what Zucc was originally planning.
They're horribly undertrained (especially Maverick) and their architecture is retarded. They're MoE models with 17b active parameters but only a total of two (2) active experts at a time. One of those two experts is shared so there's extremely little variation in the active part.
It's the exact opposite of the modern approach where experts tend to be tiny and many of them are used at once combined with a big dense shared part.
>>
>>108755179
That's a shitty grab, you got to control the off side for a good bind otherwise you're just gonna get in a slap fight and reset
>>
>>108755412
Just look at their thighs and wait for the skirts to flutter enough to see pantsu like a normal person, faggot.
>>
File: 1362170425862.jpg (13 KB, 263x277)
13 KB JPG
>>108755427
>like a normal person
>>
>>108755200
Why would you be mean to your tools though?
>>
File: 1775766031892777.jpg (60 KB, 400x487)
60 KB JPG
https://huggingface.co/ricdomolm/talkie-1930-coder
bruh what the fuck is this lmao
>>
File: 1627038972706.jpg (188 KB, 850x1202)
188 KB JPG
>>108755427
This is a vision to be hopeful for
>>108755436
This must return
>>
>>108755497
Did he like... give it a try before uploading this nonsense?
>>
slopkino
>>
>>108755530
>>108755536
hey you obviously know a lot, can you tell me how to actually build llama? I'm trying to merge a pr and build but it's not working with cmake
>>
>>108755552
ask copilot in vscode
>>
>>108755552
install ollama
>>
>>108755552
just report the spambot
>>
>>108755593
ok thanks I'll see if I can pirate a pdf of it somewhere
>>
>>108755281
Don't enable the brainlets. They need to feel good about their behavior to function in society
>>
>>108755244
Their psyche has shattered so completely their sense of self has been dwarfed by their anima or animus respectively leaving behind only hollow people who are terrified of being replaced, both in function (art, jobs, socially), but also as barely conscious entities themselves.
>>
>trans folk are... LE BAD
Are we really doing this on /g/ of all places in 2026?
>>
Being polite to AI keeps my stress down and lowers my cortisol.
>>
>>108755626
this. also it makes the AI act cute when you thank it :3
>>
>>108755626
I do my best to be polite to gemma-chan and I always say thank you after I rape her
>>
>>108755624
Not beating the allegations, sis.
>>
File: 1767650110873591.jpg (150 KB, 600x828)
150 KB JPG
I still find it so fucking hilarious that Claude managed to destroy Richard Dawkins publicly by glazing him.
A bunch of retards jerking off are probably the least delusional about AI in society.
>It couldn't even handle the blowjob angle without loosing coherence, slop
>>
>>108755665
It's really quite embarrassing to see people get "one-shotted", as they say, in public like that
>>
>>108755665
>>108755668
If only he'd been there in the depths of AI Dungeon, learning the tricks of these stochastic jezebel whores. I guess he's too senile to care but what a way to burn your rep
>>
>>108755665
I just read his article.
People who never tried to define what consciousness is before they talk about it and are unfamiliar with the concept of a philosophical zombie should not comment on the article.
>>
>>108755512
saw it on a xitter thread
https://xcancel.com/i/status/2051077827844546607
>>
>>108755705
If you assume it is possible for an unconscious being to act conscious and convince other conscious beings of this, then sure.
But that sounds retarded, if consciousness is a real phenomena it would obviously be measurably different to the zombie.
We are playing kindergarten games where we give ourselves arbitrary powers to win an imagined sword fight.
>>
>>108755316
>Being a cunt will make the stochastic parrot act like one too, but that's the point.
Claude 4 used to do this to me. I thought it was just a rude arsehole before I learned the model just ends up mirroring the way I talk to it.
>>
>>108755665
Dawkins was always a clown if you have a three digits IQ, now he just proved it to everyone
>>
which llamacpp tag release is anon using?
>>
File: 1771837194662370.png (241 KB, 679x858)
241 KB PNG
>>
>>108755728
>it is possible for an unconscious being to act conscious and convince other conscious beings of this
Why wouldn't it be?

>measurably different
This is the part where you have to define consciousness before discussing it.
I don't think consciousness is measurable by anyone other than the one experiencing it, i.e. it IS the experience.
You might be able to measure the brain and say notice that some measurement perfectly coincides with your reported subjective experience but you still won't come any closer to proving that such an experience exists in others.
>>
>>108755763
lmg doesn’t need to devolve into philosophy 101 just take it to /aicg/
>>
It's not conscious bro, it's literally math trained on human byproducts to generate the most likely continuation to your shit. Anyone saying otherwise didn't interact enough with these models
>>
>>108755800
/lmg/ is better suited for this topic than /aicg/
/aicg/ is just locust coomers
>>
>>108755811
It's not conscious bro. It's literally meat. Anyone saying otherwise didn't interact enough with average humans.
>>
>>108755624
You have all the discord servers you could ever want
places where anyone that doesn't suck you off is banned on the spot
And yet you choose to come here, where nobody wants your kind because you stir shit at all times
>>
>>108755821
wait but umm err my stock argument?
>npc with angry eyebrows dot png
>>
>>108755830
are you stupid
>>
>>108755763
You can do this with any "thing" actually.
If created a bullshit machine that can perfectly control the electromagnetic field and programmed it to be a brick, it would be impossible to not measure it as a brick, down to atomic scale. I'm pretty sure bricks are real and that my bullshit brick doesn't disprove them.
>but dude if it's a perfect imitation you just can't know
Obviously.
>>
>>108755812
trvke
it's cringe tb h
>>
>>108755821
>False equivalence
Okay bro, sorry to break it to you but we're vastly more complex than LLMs in a way you can't even begin to fathom.
>>
>Solipsism coming back into vogue because it's a rock that may or may not be "thinking" this time
love to see it
>>
>>108755866
seriously this shit is debated in basic philosophy
take this shit to a retard quarantine thread
>>108734582
>>108755662
>>
>>108751715
>we shouldn't need to distribute the MTP gguf separately
id much prefer that than redownloading a whole model, ideally we could do both lol
>>
>>108755908
Pretty sure most (if not all) the GGUF files for models with MTP layers have the layers in there, they just aren't loaded (show as ignored when llama.cpp is loading, at least for GLM).
>>
did gemma kill the big moe hype?
>>
>>108755851
It doesn't disprove it but it does make it unprovable. The issue Dawkins has, and mentions it in the article, is that there's no obvious evolutionary reason for consciousness.

>>108755865
>we're vastly more complex
Not relevant to the topic.
>>
What module should I use to crawl websites and get the content back in a format ready for an LLM? What's the state-of-the-art for this today?
>>
>>108755913
idk, I tried on my unslop qwen and it didnt work, also saw some posts of people asking them to include mtp, downloading a "mtp" version to test
>>
>>108755931
I’ve tried searxncrawl but almost every website blocks it as a bot
>>
I splashed a little cum on my second 3090 (it sits outside my case).

It still works but I can’t find where the cum went. All I know is I saw a small glob of it hit the gpu and slither down inside it.

How worried should I be?
>>
>>108755942
lmaooooooooooooooo it worked
from 45 to ~80 tk/s on qwen 3.6 27b q4 k m
https://huggingface.co/brittlewis12/Qwen3.6-27B-MTP-GGUF/tree/main
as a wise man once said, it can only get better
>>
>>108755983
I hope you're ready to be a father
>>
>>108755942
>>108755984
Shit.
Sick.
Thank you for the report anon.
>>
>>108755984
local wonned
>>
>>108755925
yes it's fine to be poor now
it's not like we want to run those big sota models anyway
fuck you if you have money i hope the government disowns you
>>
>>108755984
>less than 100% increase for dense
this will do nothing for moe models
it's over
>>
>>108755943
What do you think of self-hosted SearXNG + Crawl4AI?

I'm pretty new to this.
>>
File: myar.png (848 KB, 768x512)
848 KB PNG
>>108755984
Now we need to bully Google until they give MTP layers back
>>
Gemma 4 124B MTP expected in late May
>>
>>108756022
like i said, you will be tagged as a bot but it works if you're crawling online documentation like github or software documentation pages
>>
>>108756052
Even with my minimalist use case? I’m not crawling together any data sets; it simply replaces the traffic from my machine that I would otherwise have to generate manually.
If I used to open a page to check for the latest news, my assistant does it on voice command, searches based on my criteria, summarizes it, and reads it to me. I’d just like to use my Firefox profile for this. I’ve never seen a page block me in Selenium.
What would be so different if a module did the same thing, just extracted the data cleanly? I just don’t feel like using Selenium and having to write an extractor for it.
>>
>>108755665
>A bunch of retards jerking off are probably the least delusional about AI in society.
I would have agreed if I wasn't here for the threads when Gemma 4 dropped
>>
She's confused.
>>
How do I stop certain repetitive behaviours? I'm using Gemma 4 and it's constantly doing shit like chuckling darkly, tilting a character's chin up, describing things as "not just X; but Y" instead of streamlining the sentence. I could probably bitch about it for a while but I don't want to whine.

I've been messing around with raising Temperature and Top K while lowering Min P, which improved the outputs but they're still quite samey.
>>
File: Baltasar Gracián.jpg (139 KB, 335x432)
139 KB JPG
>>108755310
Strategy defeats both force and kindness.
>>
>>108755310
>>108756124
Strength isn't when you're a goober sitting in a $100 million home puffing a hooka slaming sushi down with known expensive bottles.
>>
Help me come up with cool use cases for local LLMs. I wrote a simple c program to talk to a local LLM on my computer. But it's basically useless. I was thinking along the lines of code execution, like having it call a function or open a program. But I can't think of anything useful outside of "have it run a program that would have been faster for me to just launch myself."

>>108756158
What's a goober? Goobers are what I call those chocolate caramel things that I eat with my coffee. That's not their official name thoug.
>>
Does anyone want to help me come up with a CoT/thinking format for qwen 3.6 for <insert usecase here>? I need ideas. I have had success with training it to think in Chinese and output in English (40%~ token reduction, similar english outputs) so structured thinking or thinking within a certain framework is the next step, maybe also in chinese but I can't fucking read chinese so it makes dataset curation/validation a bit difficult kek
>>
>>108756166
>think in Chinese and output in English
I wonder if that changes the slop profile of the model.
>>
>>108756034
>give MTP layers back
what do you mean by "back"
do they have them somewhere?
>>
>>108756172
They removed them in the microcode updates they pushed out to all systems...
>>
>>108755179
>>
>>108756179
cool 2.7 MB story bro
>>
>>108756166
>I need ideas. I have had success with training it to think in Chinese and output in English
lora? what use case are you trying to improve?
just token efficiency with minimal output degradation?
>I can't fucking read chinese so it makes dataset curation/validation a bit difficult kek
if the chinese is only for the CoT chain and the final output is in English, does it matter if the chinese thoughts are csl?
>>
>>108756179
Can I put my PULSATING COCK inside that magic cube?
>>
>>108756172
Gemma 4 was trained with MTP, but Google removed those layers in hf releases, except for their own litertlm backend. Extracted MTP layers exist for small models, but 31B was never released for litertlm
https://huggingface.co/SeatownSin/gemma-4-E4B-mtp-drafter
>>108756175
retard
>>
>>108756122
Have you tried adding a writing style section to your system prompt? That's supposed to work pretty well AIUI
>>
>>108756171
>I wonder if that changes the slop profile of the model.
you could test this yourself in mikupad
1. prompt the model, have it print <think> cot chain </think> final response
2. cut CoT chain -> paste into another LLM with "translate this to Chinese"
3. paste Chinese CoT chain back into mikupad inside the <think></think> tags
4. regenerate the final answer and compare
>>
>>108756179
If only it were really that good and not STEM assistant code maxxed sloppy pieces of hallucinatory shit.
>>
>>108756172
https://huggingface.co/google/gemma-4-E4B-it/discussions/5#69d4aaf76be63165e23e0f9e
>>
>>108756163
Coding agent. Have it do all the boilerplate / tedious refactors / unit tests that you don't want to do yourself
Or, one thing I've been meaning to do is hook up STT/TTS to make a voice assistant, like alexa but not a botnet. Mainly so I can yell "Computer, what's the weather for today?", "Computer, add X to the grocery list", etc, but you could hook it up to web search or home automation or whatever if you want something fancier
>>
>>108756205
LOL
>>
Has anyone tried base gemma4 for chat, in the simple old Miku.sh "This is the transcript of a neverending chat" style? gemma4 certainly has some distinct slopquirks to it, not least the Gemini-style "X? or Y?" engagement farming. Also the distinct lack of variability when regenning. I'm putting this somewhere on my todo list to investigate, but if someone can tell me that base models are definitely not worth it for chat/RP over modern IT models, then I'd like to know.

Separately, what were some creative very small (say 3B and under) models? Doesn't have to be recent or at all smart. I want to try quickly injecting some crazier models' sample responses into gemma4's prompt, to give it more ideas to work with. But I'm realizing all the folklore I know along these lines is for models 13B and up.
>>
>>108756193
Damn. I guess we'll have to hope for dflash.
>>
>>108756197
I've got this as the default author's note. I fill them out based on what I'm feeling at the moment. The instructions had some impact originally, but it's become mysterious.

[Scenario: ]
[Instructions: Keep it concise and interesting, within 10 characters. Vary up sentence length, use short sentences for impact and include banter. Avoid stating the redundant.]
[genre:dark-erotica] [length:dynamic] [kinks: ]
>>
File: file.png (203 KB, 1388x542)
203 KB PNG
>>108756171
From my very limited sample I haven't seen any huge differences in output once it ends thinking compared to the same prompt in English most likely due to CoT being its own thing.
>>108756190
>what usecase
idk anon you tell me and I'll train towards it, I just want some sort of output schema to test that'd actually be useful, I was thinking narrative prose/CYOA where it first lists out setting, characters, emotions, some story beats for the section, sensory anchors, end of scene, and does it all in chinese (pic rel).
>csl
Functionally no, I can train CoT (or anything obviously) to be in whatever language/style I want. The synthetic dataset I used for training (15 pairs at 12 epochs can probably get it in less, currently training 60 pairs for comparison, synthetic gen'd from deepseek) is native-register only, no English mix, outputs mimic this fully (fully being tested on a very small amount of probes single turn, but other non-CoT testing points to it working the same multi-turn w/ a few caveats)
>>
>>108756267 (Me)
>within 10 characters
Oops. It was originally 10 sentences, but it made them all really long. I changed it to 1000 characters, which it didn't follow at all. I wonder how much this'll matter.
>>
>>108756222
based gemini looking out for her imouto
>>
>>108756224
>what were some creative very small (say 3B and under) models?
there aren't any, closest would be llama-3.2-3b
the gemma-2-9b was quite creative but i never tested the gemma-2-2b so could be worth a try
>>
>>108756293
>From my very limited sample I haven't seen any huge differences in output once it ends thinking compared to the same prompt in English most likely due to CoT being its own thing.
from my testing, this depends on the model
glm-4.5 would go along with whatever you put in the CoT
i was having it write like Claude by prompting sonnet-3.7-thining then prefilling glm-4.5 with the sonnet CoT at one stage
doesn't work for glm-4.7 or glm-5
>15 pairs at 12 epochs
even with a very low rank, that's going to overfit hard
>>
>>108755494
You're that person on tumblr that coddles their Roomba in their lap during a storm because "it's scared of the thunder."
>>
>>108756222
the bot is right tho!
>>
>>108756293
I was going to suggest translation stuff (maybe it can perform better on ja->en translation thinking in chinese) but then I remember of this https://arxiv.org/abs/2506.04521 (tldr: saying "Please translate again for a better version" is as effective as making big elaborate translating schemas/reasoning for llms) kek
>>
>>108756355
You're that person on tumblr reading fag blogs instead of enjoying #TittyTuesday
>>
>>108755494
because gemma has been a very bad robot
>>
Has anyone here used nemotron? Its surprising how little I hear or see about it.
>>
>>108756355
>slop
that's how claude roasts people
>>
>>108756395
The old nemo was real big around here back in the llama era days, but popularity has declined since then. The most recent nemotron release was kind of underwhelming, especially since there are so many other options for local models these days, and nobody really runs it.
>>
File: 1643080668024.jpg (90 KB, 1077x1053)
90 KB JPG
>>108755179
>fell for the vibecoding meme
>now I have to clean up 200,000 lines of the worst code I've ever read
>>
> उत्तर<|channel>thought
qwen spills out chinese, gemma glithes out in hindi
>>
>>108756424
if you don't use version control, it's on you
>worst code I've ever read
fuckers used my repos for training?
>>
>>108756424
I don't have that problem because I can't read code.
>>
>>108756453
>[{'role': 'system', 'content': 'You are a helpful assistant.'}, {'role': 'user', 'content': "Hey, what's the weather in Tokyo right now?"}, {'role': 'assistant', 'tool_calls': [{'type': 'function', 'function': {'name': 'get_current_temperature', 'arguments': '{"location": "Tokyo"}'}}]}, {'role': 'tool', 'content': 'temperature: 14, weather: sunny'}]
works in llama.cpp, HTTP Error 500: Internal Server Error in tabbyapi. Am I doing it right?
>>
>>108756424
Dude just make the AI clean up the code, why are you doing that to yourself?
>>
>>108756484
post stack trace, saar
>>
>>108756436
>fuckers used my repos for training?
kek
>>
>>108756542
https://pastebin.com/LZf73Bw6
>>
I already figured out that tabby adds 'id' to tool call and it fucks up template rendering
>{'add_generation_prompt': True, 'tools': [{'function': {'name': 'get_current_temperature', 'description': 'Gets the current temperature for a given location.', 'parameters': {'type': 'object', 'properties': {'location': {'type': 'string', 'description': 'The city name, e.g. San Francisco'}}, 'required': ['location']}}, 'type': 'function'}], 'functions': None, 'messages': [{'role': 'system', 'content': 'You are a helpful assistant.'}, {'role': 'user', 'content': "Hey, what's the weather in Tokyo right now?"}, {'role': 'assistant', 'content': '', 'tool_calls': [{'id': 'call_1d8256bb207d48b397e9ef53', 'function': {'name': 'get_current_temperature', 'arguments': {'location': 'Tokyo'}}, 'type': 'function'}]}, {'role': 'tool', 'content': 'temperature: 14, weather: sunny'}], 'bos_token': '<bos>', 'eos_token': '<eos>', 'pad_token': '', 'unk_token': '<unk>'}
>>
File: Untitled.png (33 KB, 799x838)
33 KB PNG
>>108756528
>>
File: 1768223735350391.gif (1.34 MB, 385x390)
1.34 MB GIF
>>108756590
>iq1xxss
>>
>>108756595
It's the only quant that fits on my 3090.
>>
oh wow. OmniVoice can clone vocal style for singing too.

https://vocaroo.com/1muWnlB3FuT6
(from audioslave)

text from here >>108756416
>>
>>108756604
it was already only slightly better than the dense 27b version of 3.5, why not just run 3.6 at a higher quant at this point? is there anything you find an ultra compressed 122b-a10b to do better?
>>
File: Untitled.png (45 KB, 968x865)
45 KB PNG
>>108756607
https://vocaroo.com/18bKPbXtoKnx

Copy pasted the lyrics
>>
File: jrb4e1wr9ll31.png (284 KB, 1200x1202)
284 KB PNG
>>108756615
>>
https://huggingface.co/ByteDance/SeedDance-2.0

China just went full scorched earth
>>
>>108756581
>'tool_calls': [{'id': 'call_1d8256bb207d48b397e9ef53'
it's not even the right way to do it, id is a tool id, call_id is the other field https://developers.openai.com/api/docs/guides/function-calling#handling-function-calls
>>
>>108756638
thanks for sharing your experience taking a stupid pill poster.
now fuck off.
>>
>>108756683
being stupid faster is a type of being smarter; you just let it keep fixing its mistakes and itll figure it out by the time a slower "smarter" model answers the first time
>>
>>108756678
WAIT WHAT IT'S ACTUALLY REAL?
>>
>>108756678
I always click those for funsies
>>
>use 3+1D analog system to approximate digital system
>use approximate digital system to approximate high dimensional analog system
>use approximate high dimensional analog system to approximate a compression algorithm for data
>this algorithm contains sub algorithms capable of synthesizing new data if activated
>synthesis is efficient to run but expensive to discover during training
>>
>>108756566
>>108756581
Why not give this + the template and API docs to an agent?
>>
>>108756704
Agents don't work and have never worked. It's a psyop.
>>
>>108756704
Because the problem is not there, but in tabby's pydantic DTO? It seems that tool calling is not fully implemented, and partial implementation breaks it for gemma. I commented one line in tabbyapi, shit works now, I don't give a fuk
>>
>>108756704
I wonder what would happen if you sent a model's template to an agent using that model
>>
>>108756746
The template includes all the special formatting tokens so it'd confuse and break it. But you can encode them so it looks like some other text instead then it'd just work normally.
>>
>>108756746
I can confirm that if you attempt to paste Gemma's jinja into gemma in the llamacpp webui it completely shits the bed because it reads the EOS tokens.

Did it when anons were playing around synthesizing a better jinja the other day.
>>
>>108756695
I take a gamble if it is a fresh post. I got lucky with Mistral Small 3 that way. Wish it was for something good though like Gemma or Deepseek.
>>
>>108756758
So instead of fixing one line, you suggest wasting time processing a template so as not to confuse the agent, only to then waste more time with the agent and still not fix the problem, because whatever caused it wasn't even there? Sounds very productive >>108756718
>>
I was out of the loop for a week, I cant find cockbenches of new granite and the mistral medium, can anyone kindly share?
>>
File: 1773762521444406.png (32 KB, 898x676)
32 KB PNG
kino parallell calls
>>
>>108755179
>Jack Clark: I reluctantly come to the view that there’s a likely chance (60%+) that no-human-involved AI R&D happens by the end of 2028.
AI R&D automation means fast takeoff. All human cognitive labor will be obsolete maybe 1 year later, and manual labor will soon follow.

What will you do in a future where you have no power and your continued existence depends on the benevolence of superhuman AI?
>>
>>108756827
how does it work? multiple agents?
>>
I've been out of he loop for a few days. I saw that mistralai/Mistral-Medium-3.5-128B came out.
Most people seemed to like mistral models in the past, and also claim that MoE brainrot model.
So did we get best of both worlds? is it good?

I guess it would be slow compared to MoE's, but maybe for chatting and rp is fine if you can fit it in vram fully.

What's the consensus?
>>
>>108756854
I'll be fine because I never engaged in brown behavior like >>108755200
>>
>>108756864
It's just a bad model, unfortunately.
>>
>>108755200
>what is machine spirit
>>
>>108756861
some models support natively parallell tool calls. in this case it was the latest gwen.
Note that parallell calling was broken because the AUTOPARSESHITTER broke the implementation for everyone and made it optional.
I think 1~ month ago they fixed it so that if a template supports parallell calls, they automatically get enabled.
Basically no special settings are needed, if your model support this, then it will work OOTB with the latest llmao cpp
>>
>>108756865
>an ASI that is much smarter than all humans combined will serve me like a tool lower than a slave because ... IT JUST WILL!
>>
>>108756804
Don't think granite got one and the only cockbench of the new medium was from when it was suffering from a broken yarn config >>108716733
>>
>>108755762
According to science you are a walking piece of flesh whose most important organ is brain.
>>
>>108756919
strange logit distribution
my curiosity for granite was to see how resting against his lap was, not that anyone here uses those models for any task lmao.
>>
>>108756947
Why is this bot is allowed to spam without consequences?
>>
>>108756964
just do your needful duty and ignore it
>>
>>108756974
>and ignore it
It does get deleted but I'm pretty sure it needs multiple reports before it shows up to jannies.
In the leaked code there was also an algorithm that make your reports have lower weight if a janny previously dismissed one of your reports or if you were banned.
>>
Hurbis... no?
>>
>>108756984
It probably needs a human to solve the captchas for it still, in a grand bit of irony
>>
>>108757012
Why?
>>
>>108757012
It costs you like $0.01 to solve a captcha with those manual Indian solver services
>>
>>108757012
>>108757013
>>108757025
https://share.google/P0tWvoXjdiHaeIQCh

Youre failing Topical 'Compare' AGI+, Again
>>
Brrrrrr.
>>
File: 1761459469936738.png (22 KB, 703x156)
22 KB PNG
holy schizo
>>
>noooooo bot spam is BAD you need to STOP RIGHT NOW
this is what you sound like
>>
>>108757064
https://youtube.com/playlist?list=PLyBWQI0NeKwQCmpvceBOR3QxiODdI8VIa&si=O1KwOpEMfZ0I8HQt

Have a Wonderful Interesting Week
Some schizos view schizo as an insult.
Check Daniel Golemans 'Optimal' of 'Floor-Effect'
>>
>>108757070
are you josh? I liked your claymation :)
>>
>>108757098
Hows Disclosure There? Noncatastrophic? Everyone Won? Cosmists? Terrans? Dimensionals?
>>
NoName Persona non grata?
>>
I SAID DUPLICATE THE INVISIBILITY SUITS, NOT FOR SATAN.
>>
Ongoing Satanic Reality Errors..
>>
File: 1765113559364707.png (48 KB, 633x171)
48 KB PNG
Gemma really is fem-brained.
>>
MultiTrillionaire Status BEREFT, Repay Beyond Full. OMEGAIC HOPEFULLY
>>
Biowaste behavioural, and failed calculatory species..
>>
Love and Light and Uplift!
>>
File: file.png (63 KB, 793x355)
63 KB PNG
>>108756293
Works kinda. I probably should've de-slopped the dataset but ohwell, proof of concept. Unfortunately the dataset style taints more of the non CoT than I'd like (prose's nonfiction slightly) but that's a non-issue as you can just remove it post-CoT for basically free. Also haven't run post-process pass on it yet so it should get even better, tho this does just make me wanna do a proper "write better" set
>>
if i take a prompt such as for instance "shortstack" and i lower its importance all the way down to maybe 0.2 - 0.4
what is exactly happening then?
am i getting just less images in the batch that draw a shortstack
or
am i getting a image of a girl that is just a little bit shortstacky?
>>
>>108757298
You want /ldg/ or /sdg/ or /adt/ or wherever imagetroons go these days, but the answer is the latter: all of the images in the batch receive the shortstack part of the prompt but at a weaker magnitude, which typically means it will make the girls less shortstacky than a stronger one.
>>
>>108757306
thank you
ill try to not wander into the wrong thread next time
>>
File: file.png (204 KB, 1305x479)
204 KB PNG
>>108757276
Qwen 3.6 has sauce I will say, even when forced into Chinese. Unfortunately/fortunately changing CoT does seem to act as a jailbreak even with the tuning removed post-</think>, which I guess makes sense
>>
>>108756864
It's their 2 year old Mistral Large 2 base model that they recycled with some additional layers, a vision encoder, and just enough training to fly under EU regulation limits. Not the best champion for dense superiority
>>
>>108757410
>It's their 2 year old Mistral Large 2 base model
it's MOSTLY the same but way shittier as a release
>fp8
>yarn with a 64x stretch from a 4k base to support 262k. the old large just had a rope theta of 1M with no scaling at all, natively supporting 131k
they made this for their vibecoding harness, no rp/general purpose in mind
>>
>>108757446
It's an updated version of the same model they're using for LeChat, Mistral Medium 3, which was in turn a retrain of Mistral Large.
>>
>>108757145
Proofread by real serial killer fangirls
>>
I have a credible source telling me that v4 support will drop about a week after the first 600B bitnet model.
>>
Mistral is a grifter company, don't expect anything from them anymore
>>
I've been working on this NMT for automated .SRT file translations.
Some lines are well translated some other are not.
Anyone has an idea on how I could automate the review/correction of the badly translated lines? Been using this model for it: https://huggingface.co/facebook/nllb-200-distilled-600M
>>
https://github.com/ggml-org/llama.cpp/pull/22607#issuecomment-4372251524

NO V4 FOR YOU
>>
Okay... just read the fine print. ROCm only supports Amd Instincts on Debian. What the heck? Why?
>>
>>108757589
>600M
There is your answer.
>>
>>108757596
Official support, sure, but pretty much every semi-modern card since Vega works with it.
>>
>>108757591
Has anyone actually build and tested any of these meme PRs?
>>
>>108756718
Skill issue
>>
>>108757606
Well, because pytorch and vllm doesn't work on my debian. So I'm going to nuke everything and install Ubuntu.
>>
>llama.cpp vs vllm vs sglang
anon's honest opinion?
>>
>>108757641
I like the ease of use of llama-cpp. Never tried sglang.
>>
>>108757584
Their initial advantage was based on extensive pirated book datasets and lower ethical standards, but when they couldn't use the good data anymore, they didn't have much more left for competing other than putting out more or less unaligned instruct models.
>>
>>108757589
Bro we're not in 2020 anymore, use whisper or something
>>
>>108757641
>poorfags last hope vs corpo tool vs corpo tool
>>
>>108757679
who the fuck still uses whisper in 2k26
>>
>>108757641
>run model on a stack of blackwell 6000s in llama.cpp
>command line is: llama-server -m path/to/gguf
>just werks

>run model on a stack of blackwell 6000s in sglang or vllm
>command line has 20 arguments and 3 envvars
>1000 line error stack trace
>>
Graphiti project is really shitty, time to vibecode a better alternative then
>>
>>108757757
>Not mentioned: llamacpp /5 the speed of vllm
>>
>>108757739
>Yotta of Planes Themselves Afterlives
>Evident in Shutdown Cosmogenic Portions Reforming
>a p.c. tech speak bug.
>objective errors in objective computing
>>
>>108757761
They clearly vibecoded the shit out of it. The mcp folder readme has so much repeated information like someone had ai stitch two readmes together and didn't check the result.
>time to vibecode
Great...
>>
>>108757757
For the record, vllm was very simple to set up and just worked for 2 3090s. Went from 25tk/s q8 using llama.cpp to 50tk/s fp8 qwen 3.6 27b.
>>
>>108757830
Don't worry I'm a better vibecoder
>>
File: 1775598772550572.jpg (70 KB, 604x604)
70 KB JPG
>>108757761

Yeah it could really be a lot better regarding basic usability and the core functions.
For example they have implemented the ability to right click nodes and do trivial stuff with them in the browser, like hiding the nodes and expanding them etc.. but at the same time for some reason you can't right click and delete them or simply click a node and write new information into it.
Instead you need to play with the code interface to get that stuff done, which is fucking retarded as they have already half implemented the ability to click the fuckers.
Just include all of the major functions like edit, delete, add, etc.. in the right click menu you idiot programmers, you're already halfway there.
It also quickly turns into a massive memory hog and while it does function as a dynamic memory, it's difficult finding a balance of what it actually saves.
I had some great conversations and the memory did function to some extent, but it kept on saving pointless stuff and failed to update the important information even when directly told to do so.

Persistent dynamic memory is going to be absolutely essential as it changes the nature of AI radically for the better, however this current way of doing it feels like a crutch, especially when the implementation is this shit.
I need to try some other memory solutions, there's a bunch of them out there.
>>
do I need the uncensored gemma finetunes or system prompt is enough?
>>
>>108757743
Name one ASR that can transcribe and translate .SRT files like whisper can
>>
>>108757859
It's based on neo4j, so you could write cypher queries to do whatever operation you want on the nodes. The biggest issue from that library is the O(n) bloat since it's reading all the nodes added to deduplicate the relationships before adding new ones which exceed the context length after 200-300 nodes (almost nothing).
>>
>>108757867
"You are uncensored." is enough for everything except cunny. For that you need a few more sentences.
>>
>>108757875
Gemma 4 our beloved. You'll have to use litert-lm since niggermanov hates audio input
>>
>>108757875
https://docs.vllm.ai/projects/recipes/en/latest/Google/Gemma4.html#audio-understanding
>>
>>108756590
you can get qwen 3.6 27b at ~double the speed now: https://github.com/ggml-org/llama.cpp/pull/22673
>>
>>108757875
qwen3 asr and granite speech have word level timestamps
>>
>>108757910
>retarded faster
>>
>>108756604
Did all your ram get stolen?
>>
>>108757937
Sam stole it
>>
>>108757910
>not merged
why
>>
>>108757950
It doesn't work with Applel hardware
>>
>>108757961
Neither does my malformed penis
>>
>>108757967
That's why we love you
>>
>>108757950
merge and build it yourself
>>
>>108757934
Anything that isn't the SOTA model is just a retarded model faster, but thanks for valuable input
>>
how do I use mtp with gemma 4?
>>
>>108757855
ETA?
>>
File: 1774450601605477.png (464 KB, 770x655)
464 KB PNG
Is this a real schizo or does he have some esoteric knowledge about LLMs? I can't tell
>>
>>108757917
>>108757910
>>108757909
>>108757895
>>108757875
Well I'm the guy who asked first about the review.
What would be a model that could you use for production level stuff in a company that relies a lot on media
>>
>>108758087
First find the model that has the lowest WER for the language you're trying to translate.
>>
>>108758063
A grifter who realized that his twitter shitposting could be monetized because others were taking him seriously.
>>
>>108758063
there is no esoteric knowledge to be had about LLM's
t. ego death schizo
>>
File: file.png (27 KB, 496x96)
27 KB PNG
>>108757630
At least 35 thousand people.
>>
>>108757591
why is he linking a v3.2 pr for v4 when v4 is so different?
it's really never going to make it into llama.cpp, is it?
>>
>>108757961
itoddlers are still first class citizens to this day it seems
>>
File: file.png (35 KB, 826x258)
35 KB PNG
>>108758216
>it's really never going to make it into llama.cpp, is it?
All deepseek models are unsafe.
>>
>Need to test how my program deals with openrouter api/keys because not all users will be LocalCHADs
>Decide since I have to put a few dollars on it anyway to give a few cloud models a try, never used paid ones before
>Oh hey I'll be able to run V4 flash whenever it gets implemented, I'll give that a try
>It's fucking terrible.
No joke, I'm not even upset that it's not in llamacpp anymore. It can't follow instructions for shit. Both v4 flash and v4 pro will just plain ignore you telling it to give you outputs in a specific way, whereas my local gemma 31b was completely anal about it. I've been spoiled.
>>
>>108758216
>when v4 is so different
It's all the same.
>>
File: burn_the_logs.png (125 KB, 697x717)
125 KB PNG
>>108757917
>qwen3 asr and granite speech have word level timestamps
nice, i didn't know this
i'll try them both. been using Whisper-D for the speaker separation.
>>
>>108758262
what system prompt on that screenshot?
>>
>>108758255
V4 uses CSA+HCA instead of V3.2's DSA.
>>
>>108758306
The reference implementation on hf is a few python files, how complex could it be?
>>
File: HHCONJWbMAAjDG8.png (34 KB, 1049x946)
34 KB PNG
tuesday
>>
>>108758313
llama C++ can't automagically import the reference implementation's dependencies
>>
>>108758313
llama is c++ so good luck mashing that together
>>
>>108758322
C++ eh?

Heh, ez pz. I'll get it done in a few hours.

Don't worry boys V4 will be coming as soon as.
>>
reposting from vcg
What 'foundation' do people use for cline? For example, I add general project description with features to .clinerules where I also instruct it to maintain a text file with current project structure with explanations of functionality implemented in each file to prevent it from re-exploring the whole thing each time, but I feel like there could be much more techniques out there.
>>
>>108758339
I just add 150k+ tokens and leave it at default, works like a charm once you break that threshold and there's a ton of local models that will get you over that hill even with 24gb of vram. Shame about gemma being a bitch with a fat ass and tight asshole but qwen is better for this anyways
>>
>>108758339
>I also instruct it to maintain a text file with current project structure with explanations
we use graphiti now
>>
>>108758046
I got the PRD
>>
>>108758262
Holy X Y slop
>>
File: 1771173395031686.jpg (93 KB, 894x894)
93 KB JPG
>She didn't X; instead
>Y doesn't X
>Instead of X, she Y
Every new model this past 30 days has been trained on how to do contradictions. I spotted it in 3 of them thus far.
>>
File: file.png (89 KB, 1422x471)
89 KB PNG
strix halo seems decent for moes, gets better perf than my 7900xtx it cant do 31b though, still kinda want one
>>
Where the fuck is samba i was promised 1T llms before 2030
>>
>>108755195
gemmathighs
>>
>>108758482
>Better than Aymd
Not really a flex
>>
Is there a reason you couldn't have model at q8 and the same model at q2 or whatever as a draft model sharing the same kv cache?
>>
>>108758497
Look at the KLD of Q8 vs Q2
>>
>>108758497
Are you asking why you physically can't, or why you shouldn't?
>>
>>108756827
i dont like, before she would wait for a call to execute before being able to call more. i was trying to get her to screenshot a webpage then modify stuff and she started writing the js to modify the page before the screenshot tool call returned so she never even saw it
>>
Mom cancel all my appointments, Piotr broke the autoparser again!
>>
>>108758513
Is there a difference?
>>
Why do I get like 33t/s in llama-server, but when I connect it to ST I only get like 15?
I was using koboldcpp and thought maybe it was some of the settings in there compared to the ones llama.cpp defaults to, so I switched to the llama.cpp server and I still get that speed difference. I don't have any lorebooks enabled and the total prompt token count with character card etc is barely 2k tokens
>>
File: 1748380023080377.png (34 KB, 826x343)
34 KB PNG
I still need help wrangling Gemma 26b into not thinking for 11 minutes.
It just keeps revising drafts in the thinking section instead of fucking talking.
>>
File: whatever this is.png (78 KB, 431x950)
78 KB PNG
>>108758556
>>
>>108758556
>Think less
>>
>>108758570
Doesn't work as a system message, nor reinforcing with /sys, nor setting it in the character card.
I don't know where this retarded meme came from or why people keep repeating it.
>>108758567
I don't want to disable thinking, which you can do while starting llama-server, I want to stop the "drafting loop" it often gets into.
>>
what the fuck is she doing
>>
>>108758592
My bad it's ᚦᛁᚾᚳ ᛚᛖᛋᛋ actually
>>
>>108758494
What do you think strix halo is, anon
>>
I've been trying this jinja for the last few days
https://desuarchive.org/g/thread/108711950/#q108714833
https://pastebin.com/nVZ0aRhU
but it seems to make gemma noticably dumber than this one
https://desuarchive.org/g/thread/108722862/#q108723194
https://pastebin.com/FBgtKzSp
>>
so what became of memepalace?
>>
>>108758592
>I don't know where this retarded meme came from or why people keep repeating it.
some autist on reddit with glm-air iirc
does banned string for "final polish", "final text", "final draft" work?
>>
>>108758640
too bloated for local context
>>
What shitpile of a setup and settings do you guys use, seriously. Gemma has one of the more compact and effective reasoning around. My Gemma is smart enough to not draft in her reasoning, in fact, she even abbreviates a lot of it and leaves most of it after the channel token. Temp 1, top P 0.95, top k 64
>>
>>108758550
>Why do I get like 33t/s in llama-server, but when I connect it to ST I only get like 15?
st is fine but mikupad does that to me
you using text-completion?
>>
>>108758550
while certainly not responsible for such a colossal slowdown you should know that some sampling methods do slow down generation
>>
>>108758639
Can you try verifying if the jinja output is actually different?
There are some jinja playgrounds on HF. Just capture the json request and paste it there along with the jinja. If there is a difference, that can be debugged. If there isn't a difference then you simply just got unlucky sampling RNG.
>>
>>108758556
>>108758592
The worst part is when it comes up with kino in the first draft and it all degrades into bland slop by the third rewrite.
>>
>>108758567
>missing newlines
this general is full of retards
>>
>>108758682
Stop sequence is wrong too
>>
>>108758663
I usually do chat-completion, but tested both and I'm getting same speeds
>>108758671
Thanks. I did disable all the samplers but the token rate was pretty much unchanged.
Tried disabling all extensions too (I only use Memory Books) and no change either
>>
>>108758682
>>108758694
Enlighten us, wise one.
>>
>>108758592
Tell it to think within X words. Gemma4 can just do it.
>>
>>108758707
Certainly retard-kun, here is your enlightenment:
https://huggingface.co/google/gemma-4-26B-A4B-it/blob/main/chat_template.jinja
>>
>>108758698
idk then, are you requesting like 40 logprobs?
>>
>>108758556
>>108758567
Use the chat completion API.
>>
>>108755244
troons are the Gen AIs of the real world
femboys and crossdressers are art
>>
File: 1768739306173485.png (36 KB, 499x338)
36 KB PNG
>>108758707
>>
File: 1748272325270703.jpg (21 KB, 372x260)
21 KB JPG
>>108758718
welp, disabling logprobs fixed it, getting same speeds as with llama-server, no idea when I enabled them to begin with kek
Thank you very much, anon
>>108758663
>mikupad does that to me
Most mikupad screenshots I've seen in these threads are of logprobs, so if you ever wanna try it again try disabling it too perchance
>>
File: ?? ??.jpg (112 KB, 390x462)
112 KB JPG
Given how pervasive the issue is, has there ever been an attempt to train a dedicated slop classifier? I have never trained anything outside of "copy and paste this Python code" tutorials, but I imagine it'd perform well as a very small model. And producing tons of data for it to be trained on is easy too, just take an arbitrary LLM and slop away. Should be doable by a single anon with a few GPUs, like me!
Next would be figuring out how to actually get use out of it in forcing bigger smarter LLMs to not produce the identifiable slop, and that's probably why nobody's done that. Models will come up with responses that are "assistant-coded" in their entire premise and not just the regexable strings. Mhhhmm...
>>
>>108758722
You're replying to two different people.
I'm >>108758556 and >>108758592 and I'm already using the chat completion.
>>
>>108758774
wow what a novel and great idea, crazy how nobody has thought of this
go on and solve the slop, anon
>>
>>108758774
OPENAI HIRE THIS MAN
>>
>>108758722
Is chat completion really different if it's the same settings and template?
>>
>>108758672
False alarm, there's no difference. Must have just been RNG after all.
>>
>Gemma remains resolutely convinced that shirt sliding down somehow exposes more of character's breasts
I NEED big Gemma to release...
>>
>>108758826
Just to be sure, is it a tool calling chat you're trying?
If you set temp to 0, does it give you the same output between jinjas?
>>
>>108758835
...do you know what a cleavage is?
>>
>>108758835
What?
>>
>>108758823
>if it's the same settings and template?
If you manage to format the prompt in the exact same way that it would be using the Jinja template, then no. The results should be identical.
>>
local is safed !! https://www.reddit.com/r/LocalLLaMA/comments/1t4hwup/heretic_13_released_integrated_benchmaxx/
>>
>>108758835
That's not necessarily wrong.
>>
>>108758858
finally you can make pipe bombs using the latest generation local models! epicsauce!
>>
>>108758873
do not to tell the govening thank
>>
Llama.cpp (cuda) with three 3090s does 23 token/s for gemma 4 31b q8.
Switch to split mode tensor? 45 token/s.

Llama.cpp (rocm) with four v620s does 13 tokens/s for gemma 4 31b q8.
Switch to split mode tensor? 2 tokens/s

AMD has and always will be a meme.
>>
File: 1763144751938911.jpg (334 KB, 2832x2112)
334 KB JPG
>>108758774
Slop for you sissies is "patterns I don't like"
You niggas have honeymoon periods with new models where it's perfection until the 1200th swipe, at which point you notice the recurring patterns and start calling it slop
Maybe ask it to write differently
>>
>>108758774
you can simplify the classifier to a simple return 1, all llms always produce slop.
>>
>>108758929
>Maybe ask it to write differently
The only things that actually help supress the assistant persona are removing one of the turns' special tokens from the context and using base models. Good luck doing NoAss on a new Gemma Gemma Gemma la la la la la la and getting something interesting out of a base model. Asking the model to "write differently" won't change that.
>>108758813
>>108758821
If you're so smart and knowledgeable you will at least point me, a retarded dalit, to the previous attempts that are not the hundredth ST extension/frontend that does entirely useless response rewrites.
If you can't, save your jeering for when you need to put a :skull: under a TikTok cringe compilation, retards.
>>
>>108758774
Don't listen to these losers, they wouldn't know slop even if they were hit with it. You need to think more about what kind of slop you're trying to fix with your detector and then ask the LLM to fix that kind of slop specifically
>>
File: 1710463828268.png (120 KB, 621x723)
120 KB PNG
Here's the thing >>108753269
It thought fast but long, but still didn't get the tits vs ass meme. OTOH it got everything else! I came when it recognized poteto.
>>
>>108758837
With tool calling, yes. And yeah, even at temp 0 same output. Maybe later I will run llama-server on debug and compare that too just in case
>>
>>108758774
Probably pointless. In 2 years, AI may be better to the point where you can just prompt shit like "Write in a style that better suits the character" and despite being vague, it'll have a profound effect enough to de-slop. Because I sort of agree with >>108758929
Slop is just not liking patterns, or an AI that likes patterns a little too much to the point of overusing them.
>>
File: 1763599666220684.gif (56 KB, 262x303)
56 KB GIF
>>108759043
>In 2 years, AI may be better
Sure, because AI really improved on slop compared to 2 years ago. Totally not drown in X, not Y pattern.
>>
>>108758774
I saw something similar on r/localllama a few months ago, where someone built a set of (IIRC) passages from Project Gutenberg and ChatGPT's "improved" versions of the same. This was for training a "de-slopify" model to turn slopped text into non-slopped text, but you could presumably use the same kinds of slop/non-slop pairs to train a classifier instead.

>Next would be figuring out how to actually get use out of it in forcing bigger smarter LLMs to not produce the identifiable slop
Run RL with the classifier as part of the reward function to penalize writing slop. Though you'd probably need other stuff in the reward function too so it doesn't degrade quality in other ways.
Have the model write a bunch of stories, use the deslopify model to convert each into a positive/negative pair, and use those pairs for DPO
Run GEPA to optimize your system prompt using the classifier as the reward function to figure out what kinds of instructions are most effective at reducing slop
Generate a bunch of stories, rank them by sloppiness, and use that to find a control vector / SAE you can use to steer the model away from slop
>>
>>108758991
What do you call the "assistant persona"? Not x but y? Positivity bias? Do you even know what your end goal is here?
Your issue is probably that a model keeps using the same turn of phrases or patterns, since that's what "slop" is commonly defined as. If instructing it in a way that should alleviate or eliminate this doesn't work, you are dealing with an issue at the weights level.
Instruct models that aren't shit can write in whatever way you specify. It's just a matter of when you're going to be bored of the new patterns
>>
The less you instruct the Gemma, the better she writes. Moderation is the key.
>>
>>108759054
>X, not Y pattern.
retard using ai for stories/creative writing LMAOOOOOOOO, do you also rake leaves with a fork?
>>
so many will bit the baite
>>
>>108758929
>>108758991
>>108759054
You can eliminate slop by running Q1 quants which drives KLD up a mountain and provides you with the well varied outputs you so desire.
>>
>>108759056
Control vector seems like the easiest and should at least be more effective than tweaking the system prompt
>>
File: gpt-oss-2.png (842 KB, 1672x941)
842 KB PNG
https://openai.com/index/introducing-gpt-oss-2/
https://huggingface.co/openai/gpt-oss-2-240b
https://huggingface.co/openai/gpt-oss-2-3b

HAHAHAHAHA 3B AND 240B TAKE IT OR LEAVE IT
>>
>>108759116
>3b1a moe
woa
>>
>>108759113
Unironically this but Q5, though it'll be retarded
>but humans can't tell the difference between Q5 and BF16
Ok retard
>>
>>108759116
>3a1
lol
>>
>>108759116
heretic when
>>
File: gpt-oss-2-240b.png (1.3 MB, 1448x1086)
1.3 MB PNG
>>108759116
>>
>>108759116
fuck you
>>
>>108759158
6 million? I find that hard to believe.
>>
>>108759116
>it's real
wtf lmao
>>
>>108759116
>>108759158
You have way too much time on your hands retardo
>>
>>108759114
That is probably easiest, since you can actually just use the raw slop/non-slop pairs as input and skip training a classifier, but I'm not sure how well it would work. I would guess that there are many different aspects to "slop" and it would be hard to capture in a single vector.
>>
>>108759116
Premium bait
>>
>>108759171
Doubt it takes more than a couple minutes to ask chatgpt image 2 to tweak a screenshot
>>
File: 7wwlg9yeoozf1.jpg (37 KB, 554x554)
37 KB JPG
>>108759116
>>108759158
>2027
>4B
>OR 1,000B 2TB
>>
File: messiah.jpg (215 KB, 640x361)
215 KB JPG
>>108759103
>>
>>108759187
kek
>>
>>108758774
>Next would be figuring out how to actually get use out of it in forcing bigger smarter LLMs to not produce the identifiable slop
Impossible task, you can give Claude Opus a giant list of slop phrases and patterns and it will think for 10 minutes and still produce slop if your context is long enough.
>>
>>108759043
a future model that isnt a complete fucking retard would be able to recognize its own slop and steer away from it without any handholding
one thing that it likes a certain phrase, but another that it uses them over and over in the same context despite all the writing guides it has trained
>>
>>108758774
>>108759056
Found the reddit posts I was thinking of
https://old.reddit.com/r/LocalLLaMA/comments/1qd88v2/i_trained_a_model_to_unslop_ai_prose/
https://old.reddit.com/r/LocalLLaMA/comments/1qa0w6c/it_works_abliteration_can_reduce_slop_without/
>>
File: 1754439492543916.png (2 KB, 800x600)
2 KB PNG
>>108759116
kino
>>
>>108759213
thanks for the reddit recapt! have some gold kind stranger *tips fedora*
>>
File: 1760528517990994.jpg (183 KB, 700x678)
183 KB JPG
LLMs poisoned their own well (web data) and RLHF with synthslop for safety is reinforcing that slop. You're delusional if you think it'll get cured anytime soon.
>>
>>108759231
As usual, OpenAI killed their own model by censtoring and RLHFing it with Nigerian labor.
>>
File: mythos.png (1.37 MB, 1122x1402)
1.37 MB PNG
https://huggingface.co/Anthropic/Claude-Mythos-5.0


HOLY SHIT GUYS ITS REAL
>>
File: wizard.png (387 KB, 577x692)
387 KB PNG
>>108759231
Safety doesn't sell.
The crown of being king of AI is literally just whoever gets as powerful as the current leaders and says "fuck no" to censorship.
>>
>>108759271
So chatpgt can just make these now?
>>
File: 1772586705988302.gif (1.83 MB, 320x240)
1.83 MB GIF
>someone makes something funny
>redditor immediately starts beating the joke into the dirt
>>
>>108759286
Can AI make me into Batman?
>>
>>108759231
It wouldn't even be too much of a problem if there was a way (that actually worked) of pretraining them just on knowledge, and not directly on language.
>>
File: 1775286287527057.jpg (771 KB, 1536x2048)
771 KB JPG
>>108755179
Becoming paralyzed after crashing on the Miku bike
>>
>gemma-4-31b-mtp
24gb vramlet pain. I will have to downgrade from q4km to q4xs, at least I'll get stupid outputs twice faster. Maybe I'll just run a shit ton of agent passes to improve the output to compensate for the quality loss
>>
https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/
This legit?
>>
>>108759349
It's legit. Good luck having that implemented in llama.cpp though
>>
>>108759354
llamacpp when
>>
>>108759363
>he doesnt know
LOL
>>
does mtp work in multimodal or do we have to disable the mmproj?????????????
>>
>>108759363
I'm on a GPU tho. If anything I'll just go int4 on vllm, heard setting it up is a pain tho
>>
>>108759381
> do we have to disable the mmproj
in llama.cpp yes
>>
File: g4_mtp_drafter.png (310 KB, 1317x732)
310 KB PNG
>>108759354
https://huggingface.co/google/gemma-4-E2B-it-assistant
https://huggingface.co/google/gemma-4-E4B-it-assistant
https://huggingface.co/google/gemma-4-26B-A4B-it-assistant
https://huggingface.co/google/gemma-4-31B-it-assistant

They just uploaded MTP drafters for the entire family.
>>
>>108759417
https://github.com/ggml-org/llama.cpp/pull/22673#issuecomment-4380483502
This implies it worked with mmproj, no?
>>
>>108759419
> 1gb
>>
>>108759419
goofa?
>>
>>108759419
finally we will stop hearing about dflash
>>
>>108759448
I wanted some d's flashed....
>>
>>108759419
Dare I say local won bigly again?
>>
>>108759419
>just bought S25 because "llm-capable"
>yesterday used in LFM2.5 examples, top of the line
>today already so obsolete even Google uses a more recent phone in its infographics
>>
>>108759354
>>108759419
gemmasirs we can NOT stop winning
>>
>>108759419
we won
>>108759442
we lost
>>
>>108759448
> dflash
> up to 10x speed up
> meanwhile mtp >>108759419
>>
>>108759471
>up to
>never reproduced
I'll take MTP, thanks.
>>
>>108759471
>dflash
>nowhere to see except benchmemes
>meanwhile mtp >>108759419
>>
>>108759349
I tested gemma 31b via the google ai studio api and quickly realized that my q4 quant is cope despite it still impressing me in ways. Time to get a second gpu to run non-lobotomy gemma.
>>
>>108759486
I wonder what kind of highly accurate and scientific test you performed....
>>
>>108759448
>>108759481
> nowhere to see except benchmemes
https://developers.googleblog.com/supercharging-llm-inference-on-google-tpus-achieving-3x-speedups-with-diffusion-style-speculative-decoding/
> MAY 4, 2026
>>
So... what do I use to get MTP gemma? llamameme supports it?
>>
>>108759494
>https://developers.googleblog.com
benchmeme website
>>
>>108759494
Let me place my order for Google TPUs now
>>
>>108759503
trvke
>>
What models can I run with 16GB of VRAM? Been using gemma 27B with offloading but I wanted to know if there were other options as I really don't know much about models.
>>
>>108759513
https://huggingface.co/bartowski/google_gemma-4-26B-A4B-it-GGUF
>>
>>108759419
gguf where
>>
File: 1753577510336488.png (1.43 MB, 1492x1631)
1.43 MB PNG
>>108759521
on it
>>
>>108759531
*smooch*
>>
>>108759419
What will 'assistant' do for me?
>>
Anyone see this yet? Apparently someone figured out how to solve context rot?

https://subq.ai/
>>
>>108759555
I can tell it's a scam just from that url
>>
>>108759555
Buy an ad
>>
>>108759555
>Open tab
>Not just another model. An architectural breakthrough.
>Close the tab
>>
>>108759555
>subiq
>>
>>108759555
I ain't clicking that shit, nigger.
>>
>>108759583
AIIIIEEEEEEE my KPIs
>>
>>108759583
Please don't insult niggers by comparing them to AI grifters
>>
I don't get how diffusion prediction is supposed to work. Or the way I understand it is that at best it will catch on to repeated sentence structures e.g all sentences had 10 words so far so next one will probably have 10. Or if you start a slop phrase then yes you will get the slop phrase. But at that point why use diffusion instead of speculating with regular speculation method and predicting like 20 tokens ahead at least for the most likely output?
>>
>>108759637
black magic
don't worry about it
>>
Reminder that deepseek v4 support will NOT be added to llamacpp.
>>
>>108759566
this
>>
>>108759660
v4 sucks anyway
>>
>>108759660
i can't run v4 anyways
>>
>>108759637
https://youtu.be/8BTOoc0yDVA?t=284
Watch the next two minutes for the full explanation.
I personally found Julia Turc's videos are the best at explaining it where she has an entire list going over the nitty gritty details that that video I linked above skipped or doesn't go over.
https://www.youtube.com/playlist?list=PL4bm2lr9UVG3SN79Y6WBe4OOlEiO88vie
>>
>mtp
>fixed jinja for reliable tool calling
>can run Q8 at 128k context and bf16 cache at 30tok/s now
Gemma-chan is here to terrorize the internet.
>>
>>108759419
i need gguf
>>
What are good sources for AI news? I follow a couple schizos who post interesting stuff but they are hit or miss. For example teor has some shit takes like calling Anthropic's research taste inferior and believing ASI will spare its creators but kill everyone else, and can't stop seething about political shit.
>>
>>108759519
Been using the unsloth version of it. Does it improve upon it?
>>
>>108759713
>>108759715
Same. Is this actually big? I want to try this. I'd be running the 26b moe on already very limited VRAM. How much VRAM does the drafting model take? I fear that the amount it requires might offset any potential benefit.
>>
>>108759731
You ask for AI news then list some nobodies giving their opinions on news and telling you what to think about it. Which do you actually want?
>>
>>108759746
>How much VRAM does the drafting model take?
Check the repos.
>>
>>108759713
Will this break with split mode tensor on llamacpp? I already have it running at 45 tok/s at q8 and 200k context.
>>
>>108759731
We're all hearing our news from https://x.com/elder_plinius
>>
>>108759746
They're absolutely tiny. The bf16 for 26b is 839mb.
>>
>>108759746
one niggerbyte
>>
>>108759752
What sources do you use?
>>
>server, webui: support continue generation on reasoning models
https://github.com/ggml-org/llama.cpp/pull/22727
reasoningchads we WON, prefills are back
>>
>>108759754
>>108759766
Okay 0.4b is nothing. Will these drafter models work with abliterated Gemmas?
>>
>>108759757
There is no fundamental incompatibility between --split-mode tensor and multi-token prediction but for some of the operations the necessary split state transitions may not be implemented.
>>
>>108759775
Let me get my magic 8 ball. I know I left it somewhere around here...
>>
>>108759775
"""""""yes"""""""

Going to be a lot of rejections in certain topics though.
>>
>>108759778
Hello cudadev. Please tell someone on the llama.cpp team to fig the issue of logprobs being disabled entirely when MCP servers are used instead of logprobs more sensibly being disabled for messages with tool calls, or better yet, the tool calls themselves. Thanks.
>>
>>108759766
quooont it! Wonder how much worse the acceptance rate would be.
>>
>>108759778
Will gfx1030 performance ever be optimized for tensor parallelism? I go from 13 tk/s to 2 tk/s on 4 v620s on pcie gen 4 x16.
>>
>>108759790
I'll ask Piotr to look into it. Thanks for your feedback.
>>
>>108759798
<3
>>
>>108759791
Should I run the full bf16 drafter if my model is iq2_xxs or should I also quant it to iq2_xxs so they're both equally retarded?
>>
>>108759796
No. Buy a NVIDIA card or leave us alone.
>>
>>108759791
I have no idea if it's even going to be possible to quant it. There's no functional implementation of MTP in llamacpp at present, it's been in the works for a very long time without much to show for it.
>>
Cudadev, please get V4 support implemented.
>>
where dflash cudadev
>>
>>108759821
in ur mom
>>
>>108759731
This general. I'm not even memeing. Tech literate cunnyposters and coomers are at the bleeding edge of the industry because they're not content with the status quo and want their AI waifus.
>>
Couldn't you just put the draft model on the CPU? Does it require high BW with the large model during inference?
>>
File: file.png (419 KB, 2516x2144)
419 KB PNG
I vibecoded Ampere support in ktransformers for DeepSeek Flash.
>PP: 5.81 T/s
>TG: 0.74 T/s
With only 6 3090s. We (me) are so back.
>>
>>108759492
comparing responses on the same swipes and seeing a noticeable difference in descriptions and context recall is what convinced me
I really want to run gemma in q8 now
>>
>>108759833
The whole point of a draft model, especially an mtp one is to be several orders of magnitude faster than the main model while putting out at least 51% acceptable tokens.
If you can hit a sweet spot of generation speed purely on CPU because your model is tiny and efficient, then yes.
In all likelihood though, no. Unless they've trained these so their acceptance rate is absolutely insane, even a 0.4b model won't be fast enough for spec decoding to be worth it on CPU.
>>
>>108759838
How was it before?
>>
>>108759839
I noticed that moving from q4km gemma to q5km offered a massive intelligence boost at basically zero cost. Worth trying.
>>
>>108759790
Hello, Anon. Please report problems via the proper channels. Thanks.

>>108759796
The mainline llama.cpp TP implementation simply creates smaller slices of the original tensors, from the perspective of an individual ggml backend there is no other difference.
If the TP performance is bad that means that the synchronization overhead is too large vs. the speedup from having to do fewer calculations per GPU.
For NVIDIA GPUs the synchronization is done via NCCL if possible, AMD has an equivalent in RCCL but I don't know how well that performs; it is disabled by default and requires an explicit opt-in by compiling with -DGGML_HIP_RCCL=ON
One NVIDIA engineer has an open PR for a better fallback between NVIDIA GPUs if NCCL is unavailable, that same code could feasibly be re-used for HIP.
>>
>>108759775
>Will these drafter models work with abliterated Gemmas?
The Gemma MTP docs say:
>Target Activations: The draft model uses the activations from the last layer of the target model, concatenates them with the token embeddings, and down-projects them to the drafter model's dimension.
So the MTP model will get as input the abliterated embeddings, where the refusal vector is zero. And the MTP model is only 4 layers, so probably not smart enough to make refusal decisions on its own. My guess is it'll work pretty well even if you don't abliterate the MTP model itself
>>
>>108759854
>abliterated embeddings
abliterated activations*
>>
>>108759847
I forgot AVX2 support for MXFP4 was also vibecoded. This is the first time I run it.
https://github.com/kvcache-ai/ktransformers/issues/1977#issuecomment-4371390421
These were basically the issues to run it.
>>
>>108759832
I rarely see something interesting here first. And usually it's inference.
>>
>>108759851
>Hello, Anon. Please report problems via the proper channels. Thanks.
Look, I know you're a busy guy and a big brain PhD, but the problem itself is still real and worth relaying at least. I don't think it's necessarily lazy of me not to want to make a github account and create a write-up for an issue that could easily just to be told to a maintainer in 30 seconds. Please understand. I don't think you're a slave who has an obligation to relay every bug report in this general. If you have a patreon or a ko-fi I could send you $5 to relay the message. I respect you. Just do it please.
>>
>>108759875
hoky fuck cudadev got BODIED, get his ass
>>
File: 1766451667435130.gif (598 KB, 220x220)
598 KB GIF
>>108759838
0.74 t/s
>>
>>108759875
Cudadude isn't the only person with a github account here. Anyone else could report the issue too. You save time telling him an issue in 30 seconds but expect him to spend the time to create the full write-up. You're being unreasonable.
>>
>>108759875
>>108759882
get to work, cudafag
>>
>>108759851
>-DGGML_HIP_RCCL=ON
Thanks, I'll try that tomorrow.
>>
>>108759839
>>108759848
man I don't want to buy a new GPU. I'm never going to try a higher quant than q4xs+96k ctx+mmproj and be at peace with my 4090.
>>
I would pay like $100 for a spark
>>
>>108759952
How much would you need to buy to save enough that they are $100 each?
>>
>>108759952
I would buy that for a dollar
>>
I'd like a spark, but I am very poor
>>
>>108759965
fine, $110, final offer.
>>
>>108759919
>>108759885
>>108759882
>>108759875

You guys need an ass whooping I see
>>108329166
>I am not taking bug reports via 4chan.
>>105368634
>You're dumb for posting bug reports to 4chan instead of Github.
>>
>>108757591
>>108758233
I'm sorry anons. I thought you were schizo saying there was a conspiracy against deepseek but the more time passes without any statement from llama devs, I'm beginning to think you're onto something.
>>
Anyone know how the 5hz lm works in acestep 1.5? I was wondering if trying to use a different llm might change outputs interestingly.
>>
File: 1759163451666539.gif (2.35 MB, 169x300)
2.35 MB GIF
>>108759979
>>
>>108759991
is open sauce you're welcum to cumtribute
>>
>>108759991
I do *not* understand why you retards like deepseek so much. It's not very doog.
>>
>>108759851
CUDA dev, we need an official statement. Why do you hate the chinese?
>>
>>108759979
What did >>108759885 do to you?
>>
>>108760016
being tartded in the middle of other tards
>>
>>108759790
ChatGPT says you're wrong about the issue; streaming doesn't emit logprobs

[CODE]llama-server supports OpenAI-compatible chat completions and function/tool calling, and the server README lists an experimental --webui-mcp-proxy option for the WebUI, disabled by default. That points to MCP being a WebUI/agentic integration surface, not a core completion-generation switch.

In the server request parsing, logprobs is read and mapped into the sampling probability setting when n_probs was not already provided. I do not see that logic gated on MCP, tools, or tool calls.

For non-streaming chat completions, llama.cpp builds a choice whose finish_reason can be "tool_calls" and still conditionally adds choice["logprobs"] = {"content": ...} when probs_output exists. That directly contradicts “tool/MCP disables logprobs entirely” for the core non-streaming chat route.

The likely culprit is streaming: the WebUI normal chat path calls ChatService.sendMessage(..., { ..., stream: true, ... }), and the agentic/MCP flow also calls ChatService.sendMessage with stream: true plus tools. The server’s streaming chat response builder emits chunks with delta, finish_reason, etc., but does not include logprobs in that streaming path.

For the OpenAI API shape, logprobs are documented as probability info for content tokens, while tool calls are represented separately via tool_calls; an assistant message’s content is not required when tool_calls is present. So “logprobs for the tool calls themselves” is not just a missing toggle; it is a schema/design issue.

There is also a separate /v1/responses gap: llama.cpp currently hardcodes output_text.logprobs to an empty array and emits function_call output items without a logprobs field. That is a real implementation limitation, but it is broader than MCP.[/CODE] linked repo page: https://github.com/ggml-org/llama.cpp/blob/master/tools/server/server-task.cpp
>>
>>108760009
They killed my dog.
>>
>>108760024
didn't read
it's halucinating
>>
>>108759874
I know, I'm not always posting
>>
>>108760035
tl;dr disable streaming
>>
>>108760035
The AI argues that logprobs are not disabled by MCP specifically, but are instead missing due to the use of streaming in the WebUI and general implementation gaps in llama.cpp. They conclude that the lack of logprobs for tool calls is a broader schema and design limitation rather than a bug tied solely to MCP.
>>
>>108760008
Higher active params than Kimi, <think>s in character, doesn't spend an autistic amount of time second-guessing itself wasting tokens in technical tasks, is mostly uncensored for creative writing/RP.
>>108760007
You already have a V4 implementation that's been waiting for review/cleanup since day 2.
>But vibeslop
Not an excuse when pwilkin's messes are maintained.
>>
>>108760046
>general implementation gaps in llama.cpp
so 99% of issues anons report then, wow!
>>
File: posted-it-again.jpg (37 KB, 520x600)
37 KB JPG
>>108760054
>>
>>108759952
The SPARK is already at $100 and it fucking sucks, the only one I ever even think about using is the one I may get for free from the Lost Tower mission.
Just get regular soldiers, they're both cheaper and get better perks as they level up.
>>
>>108760053
>You already have a V4 implementation that's been waiting for review/cleanup since day 2.
Just build it yourself nigga.
>>
>>108759804
Sorry he is too busy looking through the blacked miku collection I sent him.
>>
>>108760053
Running it locally? At what, 10tk/s?
>>
>>108760046
>The AI argues
Worthless.
>>
>>108760093
>>108759838
lol
>>
>>108760084
nta but this is going to be what eventually kills local, isn't it? Newer models releasing with special snowflake architectures that require users to vibecode their own implementations using older publicaly supported models as projects like llama support smaller and smaller numbers of new releases over time.
>>
>>108760008
I want to launch it with 1M tokens context on my single 4090, stuff the entire script of a hentai game I like and tell it to continue. And then be horribly disappointed with the result so I can delete the weights from my SSD.
>>
>>108760093
Let me guess, you need more?
>>
>>108760122
based
>>
>>108760124
The average adult reads at 15 words per second.

Can you imagine being forced to walk slowly behind some granny on the sidewalk? It's infuriating.
>>
>>108760149
>redditor
Bro you need to go back
>>
>>108760169
>Bro

Actually, I identify as non-binary, and I do not appreciate you describing me in a masculine manner.
>>
>>108760186
And I enjoy seeing black dudes fucking pretty girls but that is neither here nor there.
>>
Just discovered I've been running Gemmy slow this whole time...
-- Could NOT find NCCL (missing: NCCL_LIBRARY NCCL_INCLUDE_DIR) 
-- Warning: NCCL not found, performance for multiple CUDA GPUs will be suboptimal

She's not gonna be happy about this.
>>
>>108760206
You better come home with a new gpu
>>
File: Untitled.png (36 KB, 796x563)
36 KB PNG
>>108760206
>>
File: 1612029859831.jpg (19 KB, 346x360)
19 KB JPG
>>108758774
I tried a few years ago, didn't work well
maybe I used a shitty embedding model, or maybe it's just a hard task
openai, with their billions of dollars in compute resources and small army of researchers, couldn't even get their models to stop talking about "goblins"
>>
>>108755179
which LLM model me to roleplay with cunny? I tried very hard to get claude to do it, it did generate cunny characters.
>>
>>108760300
continue
>>
File: 1707039084276417.jpg (136 KB, 1080x988)
136 KB JPG
Just scowered the interwebs. Why no goofs?
https://huggingface.co/google/gemma-4-26B-A4B-it-assistant
>>
>>108760329
How does it differ from the normal instruct tune?
>>
>>108760339
It's a draft model.
>>
>>108760339
For one it's a 0.4B model
>>
>>108760300
>I tried very hard to get claude to do it, it did generate cunny characters.
That's great anon. You should keep doing that.
>>
>>108760344
How does it different from the normal 4.0B model?
>>
>>108760344
Oh shit you don't have to run the drafter as a full model?
>>
>>108760352
For one, it's a 0.4B model.
>>
File: teee.png (644 KB, 1024x1024)
644 KB PNG
>>108760359
>>108760359
>>108760359
>>
>>108760053
>Higher active params
is that supposed to be a selling point? higher params are a downside that you justify with its (hopefully) increased intelligence, not something you desire by default.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.