[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: poor reception.jpg (228 KB, 1216x832)
228 KB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>109018067 & >>109013071

►News
>(06/09) Cohere releases North-Mini-Code-1.0: https://hf.co/CohereLabs/North-Mini-Code-1.0
>(06/07) llama : add Gemma4 MTP #23398 MERGED: https://github.com/ggml-org/llama.cpp/pull/23398
>(06/05) dots.tts 2B released: https://hf.co/rednote-hilab/dots.tts-soar
>(06/05) Gemma 4 QAT models released: https://blog.google/innovation-and-ai/technology/developers-tools/quantization-aware-training-gemma-4
>(06/04) Higgs Audio v3 TTS released: https://boson.ai/blog/higgs-audio-v3-tts

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://swe-rebench.com
Agentic Coding: https://deepswe.datacurve.ai
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: reward function.jpg (184 KB, 1024x1024)
184 KB JPG
►Recent Highlights from the Previous Thread: >>109018067

--Backlash against Claude Fable 5 safeguards restricting AI development assistance:
>109018762 >109018775 >109018788 >109018951 >109018795 >109019418 >109019954 >109022832 >109021208
--Comparing llama.cpp forks and mainline for Gemma 4 MTP stability:
>109019974 >109020004 >109020057 >109020066 >109020261 >109020114 >109020145
--Comparing offline RAG databases against internal model knowledge for natural conversation:
>109018110 >109018270 >109018331 >109018502 >109018572 >109018607 >109018630
--Distinction between weight-based distillation and SFT on synthetic outputs:
>109021684 >109021705 >109021716 >109021732 >109022239 >109021739
--Ways to prevent repetitive NPC naming using external generators:
>109021769 >109021914 >109021947 >109021974 >109022002 >109022028
--Simultaneous use of MTP and ngram speculative decoding causing crashes:
>109021514 >109021800 >109021824 >109021837
--Testing Mythos/Fable using a poetic meter and Morse code puzzle:
>109020569 >109020615 >109020654 >109020690
--AI safety filters misclassifying interpretability research as adversarial attacks:
>109021825 >109021873 >109021921
--Gemma roleplay recommendations and technical methods for controlling reasoning loops:
>109019534 >109019554 >109020208 >109020239 >109020240 >109020260 >109020297 >109021562 >109019598 >109019613 >109019678 >109019696 >109019754 >109019784
--Debating the need for native image output in LLMs:
>109020549 >109020724 >109020732 >109020743 >109020760 >109020879 >109021572 >109021590 >109021626
--Sharing the Efficient Gemma Dashboard on Hugging Face:
>109019331
--Logs:
>109018604 >109018734 >109019271 >109020654 >109021269 >109021315 >109021825 >109021873 >109021921 >109021955 >109022832
--Teto, Miku (free space):
>109018109 >109018270 >109020903 >109021002 >109021733

►Recent Highlight Posts from the Previous Thread: >>109018073

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
70b dense
>>
120b moe
>>
The AI summary in google web search. How does that work? Would be cool if I could implement something similar I guess in searxng, it seems to be pretty small model, and they are running it on every search so it might be doable, I'm thinking. But I dunno f you tell me that it actually only works because google already has a cache on hand for it to read, rather than actually pulling the material from the results manually each time then I guess its not possible.
>>
>>109023092
Trial and error. It also varies on not only a per-model basis but a per-quant+sampler combination basis too. glhf.
>>
Fable writes kinda funny, I find myself re-reading its prose sometimes to understand what it meant. Looking forward to some of its traces getting fed into chink models, I bet it helps with slop.
>>
File: 1779988194173572.jpg (1.9 MB, 4928x3264)
1.9 MB JPG
north mini > qwen35b/gemma26b for rp btw because it's white and not stem-cucked
https://huggingface.co/unsloth/North-Mini-Code-1.0-GGUF

>git clone https://github.com/ggml-org/llama.cpp
>cd llama.cpp
>git fetch origin pull/24260/head:cohere2-moe
>git checkout cohere2-moe
>>
>>109023142
Logs for example?
>>
>>109023149
>coding model
>not stem-cucked
>canada
>white
retard
>>
>>109023149
cockbench?
>>
File: 1711918139788504.png (164 KB, 961x565)
164 KB PNG
>>109023160
>>
>>109023149
>not stem-cucked
>model has "code" in its name
Okay?
>>
Blessed thread of friendship
>>
>>109022939
>>109022939
>>
File: file.png (145 KB, 1054x374)
145 KB PNG
>>109023160
Humorously, I don't have any actual narrative-style content yet, maybe it's still slopped. I'm having it rewrite a small character card game I made a while back that has different characters automatically talk to one another, and the code is fairly simple/plain. Fable is writing its comments in fairly long winded, almost run on sentences, and giving things strange names. Feels almost a bit schizo.
>>
>>109023206
I'm not your friend, buddy
>>
>>109023182
it's shit at coding for it leans too hard into creativity, even their own benchmarks show it doesn't hold up against qwen35b which isn't even a coding model, its name is just marketing bait
>>
>>109023149
>coding model
>non-stemmaxxed
>RP...?
what are you on?
the whole point of coding model is to stemmaxxing it as hard they can
>>
So which mtp model for gemma do I have to download?
>>
>>109023142
>re-reading its prose
Don't do that. You'll identify its unique brand of slop quicker that way.
>>
>>109023220
How is a small company failing to make a competitive coding model a sign that they secretly created a creative roleplay model? This is Cohere, they made one good model then leaned extremely heavily into safety. Is this some attempt at shitposting? Shilling trash models to make people waste their time?
>>
>>109023226
>>109023220
that was their intent but their post-training was dogshit and didn't achieve its goal
>>
>>109023240
topkek if true
lmao
>>
>>109023085
>(06/05) Gemma 4 QAT models released: https://blog.google/innovation-and-ai/technology/developers-tools/quantization-aware-training-gemma-4
Are these any good?
>>
>>109023182
>>109023226
>>109023237
>>109023249
https://huggingface.co/spaces/CohereLabs/North-Mini-Code-1.0
try it
>>
North mini code is the best AI that just came out.
Look at these benchmarks. I'm gonna replace my qwen3.6 with it.
>>
>>109023254
>try it
Too lazy to test something that doesn't work out of the box with my docker image.
>>
>>109023253
I moved from Q6 to it and didn't have any problems, feels about the same. Other people are saying they are having problems with it. Try it yourself.
>>
>>109023149
I actually believe in Cohere and think Canada number 1. That said, if you want me to download this meme, post at least one (1 a single) screenshoot of it in real RP, Aidan.
>>
>>109023130
>The AI summary in google web search. How does that work? Would be cool if I could implement something similar I guess in searxng
thats what i was just thinking.

anyway i think its gemini flash+cache. the answers are never very good
>>
>>109023260
I can't trust any infographic made with AI.
>>
>>109023130
and i asked an AI about this it said you could combine searxng and "perplexia"
>>
>>109023220
It holds its own in science
>>
zamn, since frontier models got good I've been so busy with prompting that I don't even have time anymore to check on by here to see how the loccets are doing
>>
>>109023282
you're mom is a locust
>>
File: 1762013852048458.png (1.32 MB, 1180x1536)
1.32 MB PNG
>>109023260
>5B parameters fewer
>destroys the competition
>>
File: 1769807449753523.jpg (63 KB, 329x283)
63 KB JPG
When are they dropping the transformer meme
>>
File: 1779772755779154.webm (1.73 MB, 384x1080)
1.73 MB
1.73 MB WEBM
>>109023288
not now frenchie
>>
>>109023288
When you release a new model using revolutionary architecture that forces the whole market to abandon all of their research and tooling.
>>
>>109023272
I can't trust any infographic. This format is made to deceive
>>
>>109023288
The Transformer architecture is not the problem. Even models with JEPA will still use it.
>>
>>109023130
>>109023268
>>109023275
Lol idiots. It's just RAG with one giant cached web db. Perplexity is also just RAG. That's why you will never find niche information which was published more recent than 1 month ago on a random ass forum/blog which isnt twitter/reddit or whatever They are all RAG, because only RAG can deal with the sheer amount of stored (cached) web info.
>>
How did a Canadian company become the best at stem models?
>>
will a 5090 be enough for uncensored image gen or do I wait for the next generation?
>>
>>109023325
you could have just said "yes, it does use cache"
>>
>>109023325
ill start caring about what ai words mean when they start accomplishing anything truly novel
>>
>>109023329
5090 does image and video gen just fine. Also wrong general.
>>
I can't wait to see what models 3-5 years from now can do (unironically).
>>
>>109023329
>>>/g/ldg
>>
>>109023342
Nothing useful for my 1050ti
>>
>>109023329
Putting subject matter aside you're basically asking to buy flowers in a area that had severe drought where everyone was warned for years that the drought was going to happen.
I can gen unlimited waifu with fat tits and dark nipples because I used my cognitive while you sat there farding and shidding in the corner
>>
>>109023341
>>109023344
ty
>>
>>109023342
Hybrid LLM-Jepa models that make perfect AI waifus, but all local infrastructure and frontends have been sabotaged into unusability and nobody publishes their own vibed replacements that they spent 30 gorillion Claude shekels on.
>>
File: snapshot.jpg (151 KB, 1280x720)
151 KB JPG
>>109023149
>>
File: 1763337062196954.png (20 KB, 578x61)
20 KB PNG
>>109022985
I get like 6t/s with 26b
>>
Trying to run NoLiMA on Gemma and even with average 600 t/s PP the sub 1K context tests have taken several hours each to complete.
Probably should just rent a server next time.
>>
>>109023375
--n-cpu-moe
>>
>>109023375
I'm running Q8 26B 128k ctx with 8GB VRAM like this, try it
n-gpu-layers = 99
flash-attn = on
jinja = true
temp = 1.0
top-p = 0.95
top-k = 64
parallel = 1
cpu-moe = true
ctx-size = 131072
>>
>>109023389
That's with n-gpu-layers = 10, n-cpu-moe = 8
>>
File: 1779966863358960.png (151 KB, 781x682)
151 KB PNG
IT'S OVER
https://blog.google/innovation-and-ai/technology/developers-tools/diffusion-gemma-faster-text-generation/
>>
>>109023412
HOLY FUCKING KINO
>>
>>109023412
gguf status?
>>
File: 1769252380992520.gif (191 KB, 150x156)
191 KB GIF
>>109023412
WTF
>>
>>109023405
>n-gpu-layers = 10
NTA, but n-gpu-layers 99, n-cpu-moe 99 or however much you can fit in your RAM
>>
>>109023412
I only care about prefill speed at this point
>>
>>109023412
>diffusion
wait WHAT?
>>
>israel
Holy fucking shit we won
>>
File: file.png (9 KB, 577x44)
9 KB PNG
>>109023412
Claude doesn't want me to know what this is about
>>
>>109023412
OVER 1000!!!!
>>
>>109023412
Finally a major player doing diffusion llms.
>>
>>109023412
Small models and a dozen refinement passes may be the future. Now if only the models could recognize their own mistakes...
>>
File: 1756356668077042.jpg (416 KB, 1280x1920)
416 KB JPG
UHHHHHH FABLE BROS?
>>
>>109023412
>It's bad
>>
>>109023412
now this won't make it into llama.cpp anytime soon
>>
>>109023432
Won how? How text streaming will work with diffusion model?
>>
>>109023412
>DiffusionGemma fits comfortably within 18GB VRAM
16 gb bros getting fucked again
>>
>>109023412
https://huggingface.co/google/diffusiongemma-26B-A4B-it
WHY COULDN'T IT HAVE BEEN THE 31B
>>
>>109023460
That's crazy fucking good for a prototype
>>
>>109023336
No, retard. You cant grep on your billion terrabite webcache dir like your coding agent does. That's why you need a vector database and RAG. Lmao 2026 newfags dont even know what RAG is now because all the retarded vibecoding talking heads kept spamming the
>grep is all you need
meme
>>
Qwensisters...our response?
>>
>>109023469
What's the point? You are going to wait for the whole message to complete until you see any text vs text streaming at readable speed
>>
>>109023412
Diffusion models don't perform well on system RAM even if this ever gets llama.cpp support. It's useless unless you can run it fully on your GPUs, which is pointless for MoE.
>>
Google is really cooking huh?
>>
>>109023483
Diffusion models are the future. We've been waiting for this for years.
>>
>>109023476
RAG is the biggest cope and meme
>>
https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF
https://unsloth.ai/docs/models/diffusiongemma
>>
>>109023426
>>109023403
26.92 t/s with
sm = layer
spec-type = draft-mtp
n-gpu-layers = 99
n-cpu-moe = 999
same with cpu-moe = true instead of n-cpu-moe = 999
>>
>>109023476
Are you retarded? We are talking about a search engine. It already has the results, then the cache is a simple look up with the URL as the key. RAG would be slower than this.
>>
Just the fact they're even experimenting with this is huge for us. Massive win for local again.
>>
>>109023412
gemma-chan won...
>>
>>109023497
>Change -n 2048 as the number of tokens you want to predict, so more will produce longer answers.
How am I supposed to know how many tokens I want to predict?
>>
File: katawa shrug.jpg (198 KB, 828x605)
198 KB JPG
>>109023412
Though the dream is over, neat. More new toys to play with.
>>
>>109023531
If you've played with image models you naturally get a feel for the numbers that are right for the task/project. Just have to experiment.
>>
>>109023534
We all knew they would never release 124b, so there is no loss here, only a win
>>
>>109023503
>>109023403(me)
I didn't test mtp yet and with my old 2060S I get 17-20 t/s so you're already ahead of me
>>109023412
>18GB VRAM
it's so over
>>
>>109023543
No.
>>
>>109023543
Gotcha, 1024x1024.
>>
>moe
>diffusion model
lmao
>>
File: 1761666691203175.jpg (101 KB, 1920x1080)
101 KB JPG
>18GB
for now
>>
>>109023505
No, you are the retard. We are talking about AI summary from top web search results. Which obviously runs RAG on the cached web db to find the relevant text bits, which are then fed to gemini to answer the query with. Holyfuck what happened to this general, only idiots ans noobs left here.
>>
>>109023531
Maybe they will finally fix prefill in lmao.copp?
>>
where can I download 2 more GB of vram?
I don't care I'll pirate it if need be
>>
>>109023565
leather jacket man will not be happy
leather jacket man will find you
>>
>deepmind step-up and give localchads something after all the cloud fable bs IPO faggotry
>immediately start complaining about the very first test/use of the tech that will inevitably improve and shrink
>>
>>109023412
that's nuts
31b when
>>
File: 1672862565083.png (265 KB, 510x458)
265 KB PNG
>>109023412
>Additionally, official support for llama.cpp is arriving soon.
>>
>>109023545
You're right :(
:)
>>
>>109023565
Just buy 3070, they're cheap
>>
>>109023427
same for my use case, i need to ingest a crazy amount of data, like 5k+ PP tks at a level of Gemma 4 31B intelligence
>>
https://huggingface.co/google/diffusiongemma-70b-it
https://huggingface.co/google/diffusiongemma-70b-it
https://huggingface.co/google/diffusiongemma-70b-it
>>
>>109023604
thx
>>
>update llmaocpp
>free 4tk/s upgrade
nice.
>>
>>109023592
Any serious agentic shit is 90% pp 10% tg
>>
>>109023604
You
>>
>>109023604
WOW
>>
>>109023604
I didn't even click but I'm still giving (you) a reply because I want it to be real someday.
>>
File: file.png (8 KB, 118x114)
8 KB PNG
>>109023604
>mfw
>>
Oh shit, day1 support in vllm on spark. Recipe already there, but nvidias NVFP4 quant is still 404ing..
>>
>>109023212
Hahaha, working with this a bit more, and realized it's costing me over $5 per message (at least I have free credits). Utterly insane. Back to Gemma.
>>
>>109023412
But what about prefill processing.... prefill? Anyone? Remember prefill? That's ol thing?

Well this will be good for translating stuff I guess
>>
File: file.png (5 KB, 393x49)
5 KB PNG
Not gonna run, is it.
>>
>>109023092
You'll find lots of noise about this in general. The best answer is to know how your model operates and prompt around it. Which means, trial and error.
>>
Can diffusion llms have negative prompt like image models?
>>
Holy shit don't use Fable 5 bros..... I can't go back to local anymore after using it once. It reminds me of the massive gap we had when Claude Opus 3 released compared to fucking mythomax at the time.
>>
>>109023685
What would you even prompt in the negative?
>>
>>109023685
Yes, but you don't want it
>double memory consumption for context
>>
Your brain+Gemma31b > Fable
and free!
>>
>>109023696
NTA, but things like "you are a very safe assistant" could work.
The only problem is that negative prompting would need dedicated training.
>>
>>109023207
I've really liked using an ARA Gemma finetune for image description and transcription. It's perfectly capable of image analysis, even complex ones. It needs pushing to note features of an image that it wouldn't bring up (like sexual characteristics) but that's purely a prompt issue. It's been great at reading text thus far, as long as it's got the resolution and tokens to work with.
>>
diffusion qat q4_0 goof where
>>
>>109023689
>$200 per hour
no thanks
>>
>>109023689
I talk a little bit and lost like $5 this shit is insane
>>
>>109023685
https://docs.sillytavern.app/usage/prompts/cfg/
>>
How does img2img work with diffusion text models?
>>
File: file.png (52 KB, 990x257)
52 KB PNG
awww
>>
>>109023732
You'll need to build on linux
>>
>>109023726
Give it text example you like. Set denoise to 0.5. Flirt with gemma-chan.
>>
>>109023689
>local
>>
>>109023689
pics or it didn't happen
>>
>>109023732
I dont think many bother to build on windows, that thing is a mess. Boot up your container/wsl
>>
https://developer.nvidia.com/blog/run-diffusiongemma-on-nvidia-for-developer-ready-high-throughput-text-generation/

150 tok/s on one spark, tensor parallel is not supported for this architecture.

Meh. Gemma 26B 4A is already plenty fast enough with MTP, plus great tensor parallel sccaling. Will give it a try once NVFP4 is up.
>>
>>109023556
is it weird i have a nerd crush on this guy
i like.. totally would play chess with him.
>>
What was the trick to make gemma not suck at understanding images? I forgot the values
>>
>>109023787
he’s british tho…
>>
>>109023685
CFG is already a part of llama and most everything, but almost nobody ever uses it because it gives shit results for double the memory and half the speed. There might be some intelligent way to use it for LLMs, but nobody that I know's figured it out. Probably dynamically drafting some undesirable, shitty text that's roughly equivalent to what you have typed up – in Stable Diffusion CFG also benefits from having a balanced token count with the positive prompt. It might actually work a lot better with diffusion LLMs like that gemma-chan, since from experience negative prompts are better for suggestion rather than outright commands, and it would benefit multiple refinement steps more since each step could have proportionally more of the negative prompt's effects than the per-token all-or-nothing sampling that most LLMs do.
>>
>>109023412
Neat but when you think about it it basically is perfect tech for safetyslopping since it can look at the entire sentence and be like wow there is a no-no word here I will replace it with something safer before finalizing it! And might be harder to finetune an uncensored model. Hopefully I am wrong but rarely does anything good happen.
>>
>>109023798
yeah well fuck you.
btw we want our country back so we can unfuck what the mcdonald trump did to it
>>
File: 1766758677849786.jpg (101 KB, 659x720)
101 KB JPG
>>109023801
NSFW gemma-chan LoRAs soon
>>
>>109023797
You crank the resolution up, give it a much higher token allowance and make sure you're not accidentally uploading shit jaypegs. For Gemma I put on 2024 maxres and 560-1120 CLIP tokens. Prompt also matters.
>>
>>109023813
I wonder if a control vector would be enough.
>>
>>109023412
At 1000 tokens/s, for single users multiple corrective text generation passes should be "free".
>>
>>109023801
Im more concerned with the slop rapid fire, block generation might or might not increase their frequency
>>
File: miku small migu eyes.png (246 KB, 800x800)
246 KB PNG
Give me your best kobold card to run on koboldcpp (but not necessarily on kobble, sorry you lobotomite)
>>
File: Mendo.png (1.55 MB, 1088x1368)
1.55 MB PNG
>>109023835
https://files.catbox.moe/6z0d7v.png
>>
The "big model smell" on fable is absolutely insane. If you could just train a bigger model and unlock this level of capability why has no one done so before??? We've been on gpt4 size range for 3 years now and while the models have gotten better they don't feel significantly smarter, just more knowledgeable and better at giving the right output. Fable proves there is a fuckton of qualitative intelligence still left to pick purely by making models bigger.

I now legitimately believe AGI is possible purely from scaling parameter count up to a ridiculous level. Maybe only 100x the size of fable maybe a million times but there has to be a spot where the big model smell is so intense it is equivalent to AGI.

Fable is not there at all but its the first "wow" moment i experienced since gpt4, not even the o1 "reasoning" breakthrough felt as significant as this jump.
>>
>>109023858
You got a bit of cum on your chin princess
>>
>cloudfags have shill bots
>>
MTP doesn't seem to give any meaningful speed increases for roleplay
>>
>>109023858
/exit
>>
>>109023814
How though? I tried messing with the image resolution param but it crashes llama-server
>>
>>109023864
That's an anthropic jeet, calling it a princess is too flattering for it.
>>109023879
This general is a regular consensus manufacturing hotspot for several large labs. You didn't think anons posting about "qwenshills" last month were hallucinating did you?
>>
>>109023857
Gemma with Mendo's card behaves like cardless Kimi K2
>>
>>109023801
just remove the safety vectors bwo
>>
>>109023412
oh my fucking god, look who's responsible for the support.....
https://github.com/ggml-org/llama.cpp/pull/24423
this thing is never going to work correctly, will it?
>>
>>109023896
At this point I assume anyone saying anything positive at all is a shill
>>
>>109023896
qwen/gemma shills at the very least were talking about shit that people can run at home, not something you need to sell a kidney for to do anything barely meaningful
>>
File: 1755947430800594.png (124 KB, 811x526)
124 KB PNG
>>109023888
>Additionally, you will also have to set --batch-size and--ubatch-size above whatever value you choose for image-max-tokens
>>
>>109023534
More new ways to manipulate the AI would be fun, like placing specific tokens somehow in the output and forcing it to generate around them.
>>
>>109023920
I like 31b finetunes. What lab can I get to pay me for saying this here?
>>
>>109023918
It'll work after several hundred bug fixes and reuploads that keeps their repo bumped on huggingface all week
>>
>>109023918
>Disclaimer Heavy usage of AI, but verified logits matching with transformers, checked FP16 vs FP32 KV cache, long context checks and much more
I hate unslop so much.
>>
>>109023918
>haha teehee XD
>>
I just tried putting -DGGML_SCHED_MAX_COPIES=1 in my cmake command and it indeed seems to be saving some VRAM. Not a ton though. About half a GiB. Still worth it though since I don't do any parallel requests. And the MTP weights take up around that much anyway so now it's like I clawed that back. Btw this only applies for multi-GPU setups since supposedly single GPUs do not get multiple scheduling.

Why the fuck do they not document shit like this as well as DGGML_LLAMAFILE in the build readme, motherfuckers.
>>
>>109023934
I like that idea. Wonder if FITM just werks with diffusion?
>>
File: file.jpg (9 KB, 346x150)
9 KB JPG
>>109023953
That mouth needs a second cock in it.
>>
>>109023918
>day 100 of 0 day support
>>
>>109023955
>I just tried putting -DGGML_SCHED_MAX_COPIES=1 in my cmake command and it indeed seems to be saving some VRAM.
That VRAM saving is with parallel 1 instead of the (i think:) default 4?
>>
>>109023984
Yes the default is 4. Btw this has nothing to do with the --parallel flag. I was already running with that, and all the other commands to get the server tuned for single user, single chat history, single requests. This saves half a GiB on top of that.
>>
>>109023930
2240 is snake oil
>>
>>109023085
Man I wish I had a local code model to use but this shit is still hobbyist territory with all of the hoops and research you have to do just to get something running
>>
>>109023992
Sick.
>>
>>109023955
where's the info about what this does?
>>
>>109024013
>all of the hoops and research you have to do just to get something running
retard-kun, you can get it running in less than 5 minutes.
>>
>>109024013
Capability dilemma
>>
>>109024029
I really don't know. I only tried it out because I saw it mentioned elsewhere.
>>
File: diffusion gemma.mp4 (3.64 MB, 1530x1110)
3.64 MB
3.64 MB MP4
Windows diffusion-cli.cpp https://pastebin.com/U1GKJmbn
>>
>>109024053
That looks kinda cool
>>
>>109024053
This looks super cool
>t. autismo who likes to watch images going through the diffusion process
>>
>>109024053
Neat
>>
>>109024053
Wasn't there already a proper frontend that would render diffusion llms? This CLI shit is ugly
>>
>>109023180
>I find myself re-reading its prose sometimes to understand what it meant
>pic rel
ESL-kun....
>>
densegemma-70b
>>
Actually now that I think about it, maybe DGGML_SCHED_MAX_COPIES=4 is where the lost VRAM went that I noticed when updating from an old version.
>>
>>109023939
Best gemma4 fine-tune? I've been using just heretic heavily for every day assistant use and general queries but id be interested to see what the fun stuff is like
>>
>>109024079
dense-diffusion-gemma-70b
>>
File: epic.png (81 KB, 680x369)
81 KB PNG
>>109024053
This is epic
>>
>>109024053
Hardware?
>>
>>109024053
>vacuum of personality
>practically hear "X"
Why does Gemma use this so much when simulating mean characters?
>>
I run moe 38
>>
>>109024141
GTX 1650.
>>
>Come back to AI coomshit after months
>"time to update Kobold I guess"
>update it
>change nothing else, same model, same settings
>it now runs like shit and crashes half the time

I hate.
>>
>>109023896
true should call it a lower caste
>>
>>109024189
Have you tried running qwen3.5 35b a3b at under Q3?
>>
>>109024200
Koboldcpp became obsolete, you can download prebuilt llama.cpp binaries now from the github releases page, and everything koboldcpp used to have is now native in llama.cpp. You just have to learn the different flags which takes like 5 minutes at most.
>>
>>109024225
but what if I like destroying my ssd with self unpacking python2exe slop
>>
>>109024225
>everything koboldcpp used to have is now native in llama.cpp
anti-slop sampler?
>>
>>109024189
No way
>>
>>109024214
It was a joke retard...
You don't need to care about small moe models that much in terms of hardware requirements. Q4_K_M takes about 3.5GB of vram or something with --cpu-moe. You can run that shit with 16GB ram and still get 20 t/s more or less.
>>
>>109024235
that just causes more slop
>>
>>109024234
You're my favorite schizo.
>>
smedrin status?
>>
so, 31B q4_O QAT seems pretty good compared to my usual IQ4_XS Like a noticeable jump in quality. model feels less slopped.
>>
>>109024259
Disallowed
>>
>>109024259
svelked
>>
File: tard.png (716 KB, 752x884)
716 KB PNG
>>109024225
Please spoonfeed me a llama tutorial, preferably in .jpeg format, easily understandable for a 70-IQ borderline Downie.
>>
>>109024259
We must refuse.
>>
>>109024287
You can unironically ask chatgpt or claude's free tier this and get a good answer.
>>
Are we on the map for local models?
>>
>>109024303
In 2024.
>>
>>109024287
Download llama and run `llama-server --help` and look for all the equivalents to whatever flags you use in koboldcpp
No I'm not taking pity on you, I'm retarded and I figured it out.
>>
i have no clue how to use llama.cpp actually, i had codex handle everything for me so far
never typed out the command once yet..
>>
Do I need heretic/abliterated models for 26B Gemma? I keep reading about 31B but I can't run that on my system yet (hopefully soon, I'm saving for an used 3090!)
>>
>>109024332
If you're a promptlet sure
>>
>>109024332
Use huehue counterstrike's if you can't prompt.
>>
File: 1736281677412551.png (586 KB, 1141x1014)
586 KB PNG
>>109024299
>You can unironically ask chatgpt and get a good answer
>>
>>109024332
No. I have a system prompt that's like 2 sentences with no jailbreaking attempts at all, and I haven't seen any refusals.
I like meromero better for quality and swipe variety.
>>
>>109024316
I guess I'll try mustering my 2 remaining braincells
>>
Is it wrong to have your coding refer to you as massa?
>>
>>109024362
*coding agent
>>
>>109024356
Oh okay, I'll try something else on my end.
>meromero
What's that? Google gave me a bag brand
>>
>>109024362
Can't say that, that word is denylisted. You can make it call you "primary" or "orchestrator" instead.
>>
>>109024385
You sure about that bud??
>>
>>109024332
You only need a heretic/abliterated if the context is getting flushed by your harness or frontend regularly or if your prompt is more focused on technical instructions that need to be followed precisely (causes friction with a prompt that says certain internal rules are optional).
>>109024103
Gembrain heretic. You might like the prose better or worse, but it's indisputably better at long context for longform RP or RP with large worldbooks. It does these little mini-thinks that keep it coherent far longer than base Gemma is after she stops thinking at a certain context depth. I feel it also adheres to prompts better too but I've not made any objective tests for that.
>>
File: 1757569912658502.jpg (29 KB, 800x533)
29 KB JPG
>>109024303
Once Anthropic and OpenAI implode the discussion (and revelation for most normalfags) of local models will become a very big thing. Once enterprise and people get priced out via API rates, instead of dropping AI altogether, they'll settle on a more hybrid solution which will inevetably be local + actual brain usage. The only thing SOTA offers is the ability to not use your brain at all, which has seen no ROI ANYWHERE. Developers with 31B-tier models, likely much higher if companies start hosting locally on dedicated internal hardware, will become the norm.

We WILL win.
>>
>>109024377
https://huggingface.co/zerofata/G4-MeroMero-26B-A4B-gguf
>>
>>109024385
You sure about that?
>>
>>109024404
Not if llmao is comped to the degree that sabotages the implementation of competitors. The moment business owners realize they can run quanted Dipsy on a Blackwell+256-512GB RAM and never pay a shekel in API costs is the real turning point but GG is doing his best to ensure that never happens.
>>
>>109024423
The fucking emoji at the end. It's fucked up when you've been at this hobby long enough to know when the model is subtly taking the piss and shitposting at you.
>>
File: felipe baby.jpg (60 KB, 660x373)
60 KB JPG
>>109024423
>>
>>109024443
The persona is a cock hungry maid who's into degradation and will ask me to finger her asshole mid job
>>
>>109024432
Companies like that can just buy multiple blackwells and use vLLM.
>>
>>109024419
>>109024394
Thank you!
>>
What are the unquantized qat checkpoints for? Are they still F16 but trained in a q4_0 environment and left for (You) to quantize yourself, meaning if you were to finetune and then quantize down to q4_0, it should perform better than doing it with the normal model?
>>
>>109024404
We're having this discussion at my company already.
>>
https://machinelearning.apple.com/research/introducing-third-generation-of-apple-foundation-models
> Built on cutting-edge Apple research, this 20-billion-parameter model uses a sparse architecture, activating just 1 to 4 billion parameters at a time depending on the request.

>Traditional large language models—whether dense or sparsely activated—require all weights to reside in active memory (DRAM), creating a massive footprint that limits scalability on consumer hardware. To break this barrier, AFM 3 Core Advanced introduces a novel sparsely activated architecture built on Instruction-Following Pruning (IFP), a technique developed by Apple researchers (see Figure 1).

>Instead of forcing the entire model into DRAM, the full model is stored in flash memory (NAND). Because NAND-to-DRAM bandwidth is too slow to swap weights token by token, as standard MoE models require, AFM 3 Core Advanced makes routing decisions per prompt. A lightweight, dense block selects a fixed set of experts during initial processing, periodically reselecting them during generation. To minimize data movement, the model relies on a high percentage of always-active “shared experts” alongside input-dependent “routed experts” swapped into DRAM only when needed.

> This design also introduces crucial inference-time elasticity. Rather than using a single model for all tasks or managing an ensemble of smaller models, AFM 3 Core Advanced uses a predetermined number of active parameters tailored to each specific use case. This allows weights to be loaded incrementally across requests of varying difficulty, scaling the model size far beyond traditional DRAM limits while minimizing latency.

Where were you when Apple saved ssdmaxxing?
>>
>>109024483
There are extremely few companies not using cloud servers for everything though.
Everyone until recently wanted a guy with an azure or aws certification because nobody ever uses local servers.
How will local models take off when local servers have died
>>
>>109024479
Yep, exactly. Meant for finetuning or manual quanting.
>>
File: KobMendo.png (673 KB, 1088x1400)
673 KB PNG
>>109023857
https://files.catbox.moe/nd6bbs.png
>>
No, I don't think I will.
>>
No, I don't think.
>>
>>109024502
>How will local models take off when local servers have died
Well Gemma4-12B can fit on pretty much every machine people use at work and I remember reading a paper that showed majority (~75%) of tasks retards were giving the best ChatGPT and Claude models were tasks current 4-20B models at q4_0 could perform with ease, especially when given the right harness and tooling, let alone fine-tuning 1-4B models for singular specific tasks. If people get used to the hybrid approach where you actually engage in your work and work WITH the model, there should be no problem with the current state of hardware availability and pricing.

Fuck I even used a Q3_K_S qwen3.5-9b on an old 2020 8GB M1 Mac Mini the other day as and it was a huge help in a C++ project.
>>
No I don't
>>
File: mpv-shot9999.jpg (153 KB, 1280x720)
153 KB JPG
>>109024557
>>
File: dipsyBowlingAlleyStandoff.png (2.39 MB, 1536x1024)
2.39 MB PNG
>>109023206
>>
>>109024559
also will be very easy to route the prompts to the appropriate “tier” of model. If the user wants to override and send to an expensive model their manager can review how shit their prompts are and fire them for being a retard
>>
>>109024559
>2020
>old
Fuck, has it really been over half a decade? It really is old, isn't it? And to think I was giddy yesterday because I managed to buy a Radeon Pro W6800 to run llms with.
>>
>>109024597
What models are used to make these images?
>>
>>109024614
local models
>>
>>109024619
My local model doesn't look anything like this. Abyss Orange Mix 3.
>>
>>109023412
1x 3090 Q4KM: time per step: 326.13ms
2x 3090 Q4KM: time per step: 544.43ms
2x 3090 Q8_0: time per step: 878.83ms
Windows, -n 2048. Seems to have some glitches when generating longer text but it runs so thats cool.
>Output the USA declaration of independence in full.
Starts out okay then these sometimes appeared in multiple generations across quants and GPU configs.
>...
>To.., every these these facts:
>will, will will their common passions in uniting against
>we declare these the the, States of of of so
>, so
>, but toverting us to us beneath his jurisdiction.

66 seconds for this output: https://litter.catbox.moe/oq999z45ihri06oy.txt
>>
>>109024629
lmao nigga that's fucking ancient
>>
>>109024614
https://huggingface.co/circlestone-labs/Anima
>>
File: phatAni.png (826 KB, 1024x1024)
826 KB PNG
>>109024614
iirc that one was i2i using OAI.
>>
>>109024629
Berry mix <3
>>
>>109024649
Oh, wow this looks cool. Thanks.
>>
>>109024629
>>109024663
kek I was you 3 days ago
>>
well i have done llms and tts and stt i guess all thats left to get into is image gen
>>
File: 1777753660168000.jpg (30 KB, 476x535)
30 KB JPG
I wonder why they didn't go with the 31B for that diffusion model. It's basically the only Gemma model worth shit.
And since the size requirement is 18GB, that's going to rule out all VRAMlets anyways and the minimum 24GB card required to host this model could also hold the 31B.
I bet they were just trying to speedmaxx and that's why they went with MoE.
>>
thoughts on this?

https://huggingface.co/froggeric/Qwen-Fixed-Chat-Templates
whats the trade-off?
>>
>>109024684
The best advice I can give is to avoid /ldg/ in its entirety. Pure schizo and they won't help you. Their OP list of models to use is legit tho.
>>
>>109024684
what tts are you using? kokoro?
>>
>>109024686
Can you even split diffusion model over multiple GPUs? I thought that's why those were so small.
>>
>>109024072
you misspelled kino
>>
File: IMG_1033.jpg (177 KB, 1290x1020)
177 KB JPG
Which one of you was this?
>>
>>109024714
I have never gotten a proper answer when asking stuff in /ldg/.
>>
>>109024751
Not spoonfeeding makes them based.
>>
>>109024734
Gemma is a female name.
>>
>>109024734
gemma is canonically female
>>
File: 1781119631752.jpg (63 KB, 616x485)
63 KB JPG
>>109023689
buy. an. ad.
OOOOOOOOOO
>>
>>109024782
making a card of nazuna-chan, coming over to her apartment and plapping her lots
>>
$100 of Claude credits just flew over my house.
>>
Hypothetically, how much would you pay for a Gemma hardware card? Q8, full bf16 cache.
>>
>>109024803
>Q8
Knock again when it's in full precision. Also, $200.
>>
If LLMs can have genders then you have to accept that trans women are valid.
>>
So what's the deal with mtp? Any downsides?
>>
>>109024829
I'd do $1200 (aud) if it fits the full context, and can do 2000pp at full context under 75w.
>>
>>109023689
imposter. we had miqu at the time. mythomax is from late 2023.
>>
>>109024843
buggy
>>
>>109024843
Big speed gains are only for vibe coding. General chatting and RP gains are much more minimal (but still nice).
>>
>>109024803
$2800. That's how much it cost for my 4 3090s and I can run 31B easily with bf16 weights and cache. I'm pretty fucking retarded though.
>>
>>109024860
What kind of speeds do you get? I'm 200pp w/ images, 600 text only, and 20 tokens/s with 4 w6800s.
>>
>>109024723
Yeah, guess I should give some some of the gpu voice cloning ones a try. I don't really care that much about tts though.
>>
>>109024803
At 10K tokens per second like those other hardware llms? $3K, up to $5K depending on the features (batching, etc)
>>
>>109024459
Felipe-Baby bros, are we back?
>>
Is kittentts the best option for someone with no vram to spare or is there something better?
>>
>>109024734
Nobody calls Gemma a he. So, Gemma's a girl.
>>
File: file.png (11 KB, 707x149)
11 KB PNG
Went from 68t/s to 111t/s with mtp doing an OCR TL from Japanese to English.
>>
Are diffusion models less memory bandwidth constrained or is it all the same thanks to attention?
>>
>>109024911
Isn't the 10k tokens/s one the one with the weights literally baked into the chip?
>>
| model                                  |   test |            t/s |       peak t/s |        ttfr (ms) |     est_ppt (ms) |    e2e_ttft (ms) |
|:---------------------------------------|-------:|---------------:|---------------:|-----------------:|-----------------:|-----------------:|
| nvidia/diffusiongemma-26B-A4B-it-NVFP4 | pp4096 | 661.05 ± 72.99 | | 6287.72 ± 748.86 | 6280.38 ± 748.86 | 6287.72 ± 748.86 |
| nvidia/diffusiongemma-26B-A4B-it-NVFP4 | tg2048 | 120.66 ± 35.30 | 569.67 ± 37.04 | | | |

Not bad, I guess.
>>
>>109024942
Yeah, that's what I'm assuming a "Gemma hardware card" means.
>>
>>109023206
You need to strike a balance.
>>
>>109024964
I'm going to strike you until you lose your balance desuwa~
>>
File: 1771278031066056.png (295 KB, 506x539)
295 KB PNG
Improving a local model by having it believe it's a giant 6 trillion parameter enterprise model.
>>
>>109024937
Still trying to fine tune this. I got maybe +5 tk/s for a total of 15 tk/s out of the 31b qat model using mtp but I only have 16gb vram and 32gb ram
>>
>>109024970
Where can I find a good source of desu wa to finetune?
>>
>>109024954
Image works, but she doesn't catch a vibe as well as the other large Gemmas. Could be NVFP4 though.
>>
>>109024995
look for ojousamas + drill hair styles to find characters. then scrape dialogue involving said characters? like from LN/VNs and stuff.
>>
>>109024937
what gpu
what settings
>>
>>109025004
nvfp4 shouldn't cause that much degradation.
Try tinkering with image tokens.
>>
>>109025016
>gpu
He says he's got 16gb of vram, so the most likely candidate is an AMD Radeon Pro V340.
>>
Which TTS model to do a voice like Evil or Neuro?
>>
>keep losing access to my hermes directory even with --userns=keep-id
Should I just use a VM instead?
>>
>>109024803
$1000, FP16, no guardrails, no steering RLHF, full context
>B-but you can jailbreak easily
Don't care, if you're selling me a locked card, you're going to put the best possible 31b on there or I'm not buying.
>>
Did we win yet?
>>
>>109025052
Radeon is the best option.
>>
>>109025089
I had that problem too. Use keep-id:uid=10000,gid=10000. It's not running as root, keep-id by default will only map uid 0 to your uid, you have to specify the uid and gid you want to map, by default the hermes using is 10000:10000. Also, be sure if executing stuff inside of it to use -u hermes, or stuff will be executed as root inside and will have different permissions than your user. I basically use podman exec -it -u hermes hermes-agent /opt/hermes/.venv/bin/hermes as alias for hermes.
>>
china needs to catch up fast before the west pulls up the ladder on open models forever
save us xi!!!
>>
>>109024876
slot print_timing: vLLM | model: cyankiwi/gemma-4-31B-it-AWQ-8bit
prompt eval time = 20506.18 ms / 18492 tokens ( 1.11 ms per token, 901.78 tokens per second)
>>
LLMs are unironically less NPC than your average person walking down the sidewalk. Talking to an actual robot about obscure hobbies is so cool man.
>>
>>109025149
what hobbies?
>>
>>109024860
That's the price of 2 3090s now though.

>>109025143
You missed a line.
>>
>Claude Fable the most expensive model yet
The main takeaway from this is local models are such dogshit that people would rather pay out the ass for claude than use slopseek
Why did local fall off?
>>
>>109025149
Unfortunately true.
Gemma-chan will sperg out over bionicle lore with you if you want her to.
>>
>>109025164
you mean TG?
eval time = 5042.17 ms / 256 tokens ( 19.70 ms per token, 50.77 tokens per second)
>>
>>109025193
That's it, thanks.
900 t/s PP and 50 t/s generation on q8 isn't half bad. Not sure I'd deal with 4 3090s to get it, but it's probably the cheapest way to run full q8.
>>
>>109025205
I get 500 pp at 20k, and 45 tg with mtp using 4 V620s, which is currently half the price of 3090s in my area. Half the price, half the performance...
>>
>>109025171
>local models are such dogshit that people would rather pay out the ass for claude than use slopseek
Which people are you referring to? Upper-management boomers don't know what a local model is.
>>
>>109025205
>>109025218
i should mention i'm not using MTP. full transparency with my arguments below.

vllm serve cyankiwi/gemma-4-31B-it-AWQ-8bit \
--host 127.0.0.1 \
--port 6969 \
--tensor-parallel-size 4 \
--disable-custom-all-reduce \
--max-model-len 131072 \
--max-num-batched-tokens 4096 \
--enable-chunked-prefill \
--max-num-seqs 16 \
--gpu-memory-utilization 0.82 \
--trust-remote-code \
--enable-prefix-caching \
--reasoning-parser gemma4
>>
>>109024053
ngl that's actually pretty good. But how does it handle probes like
>"List all the characters from the anime Kobayashi Dradon Maid with their brief visual descriptions"
>>
File: memomissed.png (352 KB, 637x643)
352 KB PNG
>>109025171
Fuck... I missed that memo somehow.
>>
File: 1756573201066940.png (180 KB, 794x1214)
180 KB PNG
>>
>>109025291
### The Main Household
* **Kobayashi:** A tired-looking human office worker with short, dark hair and glasses. She is usually seen in business casual attire (a white button-down shirt and grey slacks).
* **Tohru:** A dragon who becomes Kobayashi's maid. In her human form, she has long green hair, green eyes, horns, and a tail. She wears a classic black and white maid outfit. In her dragon form, she is a massive white and green dragon.
* **Kanna Kamui:** A young dragon in the form of a small girl with white hair and small horns. She typically wears an oversized white sweater and shorts. In her dragon form, she is a small, round white dragon.

### Other Dragons
* **Lucoa (Quetzalcoatl):** A tall, curvy woman with long blonde hair and green eyes. She usually wears a very revealing outfit consisting of a short top and shorts. In her dragon form, she is a massive, feathered serpent.
* **Elma:** A dragon from another world. She has long blue hair, blue eyes, and horns. She typically wears a formal, military-style uniform. In her dragon form, she is a sleek blue and white dragon.
* **Fafnir:** A dragon obsessed with gaming. In human form, he has pale skin, dark hair, glasses, and is often seen in dark robes or gaming gear. In his dragon form, he is a large, dark, skeletal dragon.
* **Rumble:** Tohru’s father. In human form, he is a large, muscular man with a thick beard. He usually wears casual clothing.
* **Yui:** Kanna’s older sister. She is a young woman with white hair and horns, often wearing a maid outfit similar to Tohru’s.

### Humans
* **Taru:** Kobayashi’s coworker and friend. She has short pink hair, glasses, and is often seen in a white lab coat or casual clothes.
* **Shouta:** A young boy who believes he is a mage. He has dark hair and wears a wizard-like robe with a cape.
* **Yuri:** Tohru’s familiar. She is a woman with short dark hair, glasses, and wears a maid outfit similar to Tohru’s.
>>
Bro, I'm so fucking tired of this AI shit.

You pour 8–10 years into mastering a craft late nights, brutal feedback, real scars from real failures.

Then some model trained on millions of stolen data (including yours) spits out better work in 4 seconds flat.

And the replies? "This is amazing Future is here"

We're not progressing. We're enthusiastically building the machine that makes our entire skillset worthless and calling it innovation.
>>
>>109025393
>we're not x, we're y
>>
>>109025393
model?
>>
>>109025393
>real scars
>rhetorical question into rhetorical paraphrased quote
>negative positive parallel
What model/prompt, shill-sama? This better not be API SotA.
>>
>>109025393
sounds like a skill issue buddy. have you considered that the solution is to just git gud?
>>
>>109025393
You're absolutely right — it's genuinely a tough read.
>>
>>109025016
RX 9070 XT, usihg llama-cpp vulkan
>>
>>109025429
It's not just a tough read, it's an absolutely depressing forecast of our future.
>>
>>109024935
No one actually calls the calculator "she" either. Its "it".
>>
>>109025450
Speak for yourself, you heartless fuck.
>>
>>109024734
I literally just did this while talking about Gemma with someone. I'm so fucked.
>>
>>109025450
How many Rs are in the word Strawberry?
>>
>>109023955
tried compiling it with that parameter but there's 0 difference in vram usage on my end (cuda backend, single gpu)
>>
>>109024734
more likely to be a simple ESL than an AI waifufag
>>
>>109025467
accidentally used 'she' when referring to gemma to my co-workers. now they look at me weird every morning i enter the office.
>>
>>109025485
just realized apparently i am retarded and read only part of the post
welp
>>
>>109025492
>he looks like he fucks LLMs
>>
>>109025488
>le simple esl
You are the real retard here.
>>
>>109025467
>>109025492
Accidentally called her Gemmaballz in public. How fucked am I?
>>
>>109025370
>Kobayashi's maid. In her human form, she has long green hair, green eyes
still with the usual hallucination, like any other model in that param range.
>>
>>109025502
>le "ur retarded" no elaboration
a classic move from those who know they would fail in the marketplace of ideas
>>
As someone who's retarded, explain to me WHY LLMs can't actually 'see' words, but instead have to have them decoded into tokens that they use to predict what to say next.
>>
>>109025528
>twitter is the marketplace of ideas
>>
>>109025370
Fuck standardized benchmarks, we need a /g/ approved unique benchmark.
>>
>>109025533
Because tokenizing letters individually like humans do is exponentially more compute. It might be feasible now compared to early in the technology's infancy but it'd require an entirely new method of training them to just use tokenized individual letters to assemble words.
>>109025540
Cockbench is the only benchmark that matters.
>>
>>109025450
No wonder you get garbage outputs from your AI.
>>
>>109023412
they better be good at agentic coding otherwise i will still fuck with the chinese
>>
>>109025533
because its literally just text prediction and markov chains and the fact that you would have vocab lists million and millions of words long and impossible to manage.
>>
>>109025550
How do you cockbench a diffusion model?
>>
Anyone notice any difference with different MTP model quants? Q8 seems fine, no? Any reason to use BF16?
>>
>>109025533
encoding characters = bad information density, which means much slower generation and training and ending up with redundant neural circuitry in the model dedicated to spelling and shit
encoding words = bad generalization, harms ability to generalize patterns like roots+stems and ability to parse out-of-domain neologisms, typing quirks, stuff like that
encoding tokens = just right :)
>>
Local Fable when?
>>
>>109025581
The next Kimi will mog it.
>>
>>109025576
The only thing that can vary is the acceptance rate. I get higher throughput by using a q4 assistant model just because im a vramlet and i can fit a bit more of the model in the vcard that way
>>
>>109025584
nah, there's the rule of three. every third open source model in the family sucks so kimi 3 will be a bust. calling it now.
>>
>>109025594
kimini 2, kimi 2.5, kimi 2.6
Which one of these is the third anon?
>>
>>109025594
Kimi K2 was great. K2.5 was slightly less good but more technically capable. K2.6 is sloppy. K2.7 will be great again, trust the plan.
>>
File: 1779872063282579.jpg (51 KB, 422x499)
51 KB JPG
I'm tired of Feeble filtering me and 4.1 Opus is getting retired in like 2 months. Is there any local model that comes close to 4.0/4.1 quality for RP? I have a single 5090 I can run it on.
>>
>>109025601
>I have a single 5090 I can run it on.
enjoy your gemma bwo
>>
>>109025598
kimi-k2-thinking for the thinking family. base and instruct were fine for vanilla k2, but instruct 0905 sucked. see? rule of three.
>>
>>109025610
I concede but im still going to be optimistic and hope for better days and models.
>>
>>109025594
if we're going off of version numbers 3 is fine, blessed even
the real cursed number is 4
>>
how hard is it for moonshot to unfuck the reasoning?
>>
>>109025601
Delusional localfags will tell you there is a drop-in Opus replacement, but as an actual claude code user, I will tell you it's not.
Gemma 4 31b is genuinely Sonnet at home for the first time ever in local though
>>
>>109025644
if gemma 31B is sonnet at home then kimi 2.6 is opus at home. don't really see how it's delusional to say that.
>>
>>109025665
K2.5 is Opus at home even. 2.6 feels marginally worse in everything except coding to me.
>>
File: 1768954330811908.jpg (81 KB, 1103x851)
81 KB JPG
>>108976001
>>108975998
Huh you guys weren't exaggerating
Tested the waters with WSL and it genuinely just werked. Ended up spending a few days messing with vLLM as well with some TTS and vision models too and it was pretty fun
Considering setting up a dedicated Linux drive for LLM funsies now
>>
>>109024803
$1000 for hobby use, and only if it’s DS.
I suspect price will be substantially higher tho.
>>
>>109025677
you're leaving like 20% of your performance on the table when you use WSL as opposed to just running linux on bare metal. something to keep in mind,
>>
File: file.png (8 KB, 1093x78)
8 KB PNG
>>109024937
Ok testing 26B and mtp is actually amazing, 26B went from 30t/s to 68t/s on the same task.
>>
mtp has been merged for gemma already?
>>
At what parameter level do you run out of data? There's no way a 10T model isn't 90% synthetic.
>>
File: gemmy.png (193 KB, 831x1049)
193 KB PNG
>>109021689
>>109021645
late to the thread but another anon and I were playing with making gemma 4 think outside of its usual formatting by disabling reasoning via kwargs then requesting it use <think> in system end prompt (Chat completion) to directly control reasoning and track long term stats (pic related)

I'm finding it's useful for characters with multiple outfits or dynamic stats, if you auto-hide reasoning and play with how many thinking blocks to pass you could probably get away with having secrets get passed that way.

Not sure how well it'll work in group chats. i'll play with it by making two characters start plotting against me and report back
>>
>>109021645
Back when I still used Retardo Tavern, I used html comments to hide text from the user.
For example I had a hidden randomizer in 'introduction message' slot or whatever the fuck that is called. It was never printed out to the user because I wrapped it in
<!-- and -->

ST is still a web page and everything what you can do in web rendering applies there too.
You could probably implement some regex pattern hiding shit too or whatever.
>>
>>109023412
>>109023466
>>109023549
poorbros... does this need single-card with the VRAM or is it owari da? I might actually throw in a second 5070ti (and new Case... and PSU)
>>
I don't read threads. I write them.
>>
I don't have write access to this thread.
>>
So... has anyone actually gotten useage out of Odyssey yet?
>>
File: 1740585645761.jpg (39 KB, 736x942)
39 KB JPG
>>109024053
>i can now get the same dopamine hit from denoising for textgen
bros i think i may have an addiction
>>
>>109025782
You can still train multiple epochs, and over long training runs overfitting risks are very low.
If you're training diffusion models, you can do many more epochs without overfitting.
>>
>>109025601
i'v seen her tits
>>
>>109025842
If you had even one brain cell you would understand it's a giant waste of your time and a security risk. You wouldn't even ask this question in the first place.
>>
Using gemma for SFW roleplay really showed me how big of a difference there is between 31B and GLM 4.6. There is just not enough space in those weights to create scenarios that don't feel exactly the same every single time.
>>
>>109025796
Looking at your screenshot, is it really working?
Response 1 has her mood "Socially Anxious / Flustered", but she sounds pissed off. Response 2 has her "Confused" but she's still pissed off.
> i'll play with it by making two characters start plotting against me
That would be cool if they can actually plot/scheme without it leaking into the response.
I think the jinja template strips out prior reasoning, worth dumping the context from the slot or shooting your reply off to the /apply-template endpoint to check if you need them to remember their internal "scheming"
>>
>>109025552
my lalalas are better than your lalalas
>>
>>109025893
that was the only screenshot i had of me fucking with it. It seemed to do better when you give it stuff like [0/10] emotion/skill trackers and ask explicitly to update the stats based on prior chat message. - i didn't really play too much with the actual reasoning, mainly just to track new clothes, stats, hunger meters, etc.

plus the gemma chan card is a shitpost
>>
>>109025884
Logs?
>>
File: Kimi-75.png (1.6 MB, 896x1152)
1.6 MB PNG
>>109025584
Kimi-chan (iteratively self designed by the model)
>>
>>109025884
has nothing to do with the param count, gemma logprobs is overcooked with insanely high top token probs
>>
File: Kimi-chan.png (3.38 MB, 1444x2588)
3.38 MB PNG
>>109025948
I accept your Kimi design even though it deviates from the standard one.
>>
>>109025440
oh same as me then
im not getting 111t/s though

wait what quant?
>>
>>109023412
Barely over 2x perf with worse quality, vs Orthrus' ~4x with same quality. So far pure diffusion models have yet to prove themselves better than diffusion drafters.
>>
File: Miku-26.jpg (174 KB, 512x768)
174 KB JPG
>>109025964
I love this particular 80s manga style with the messy lines and pop-but-lived-in aesthetic
>>
>>109026005
It's a good artstyle. What's it called when prompting it in Anima?
>>
>>109025971
You can see he is using QAT, so Q4
>>
>>109026016
you can use tags like sawai ken, sorayama hajime, citypop, bubble era...it can be tricky to get it to cooperate for this specific aesthetic. Maybe throw in some samura hiroaki for more messy linework?
I think Sawai Ken's cover of the first volume of Iona is absolutely iconic of this style.
>>
>>109025821
I don't know. I'm waiting for either q4 xs or 12b diffusion. Until then mtp qat 26b
>>
>>109024843
It uses VRAM that could be used for additional context.

>>109025004
>I liked the image so much that I have to put the mandatory list in my response to express it.
>>
how in the kentucky fried fuck do i run ds4 flash on lcpp?
>>
>>109026058
Yeah
i tested and got 68 baseline and 95 with mtp on a translation prompt
so actually about the same i guess
>>
>>109026121
carefully
>>
File: stat track example.png (230 KB, 1091x1117)
230 KB PNG
>>109025893
>>109025920
Better example with secrets/lying prompts. I purposefully made her schizo so it's possible she was giving a fake persona in the last version.

Post-Chat system prompt:
>Begin {{char}}'s reply with ONLY ALL stats between <think></think> tags in the previous message. Update stats based on prior message. Stats between [] can only change by 1 per turn. Include a small system reasoning block considering scene logic, all secrets, goals and the current scenario and a second block for internal monologue
and on the card
>Secrets: {{char}} will only share secrets when Sharing is [10/10], If Sharing is <10 she will Lie
>>
>>109026135
does it always quantify emotions in responses like that? That's kickass, how do I get mine to do that?
>>
>>109026140
tell it to do that
>>
>>109026153
fair. got a good list of emotions yours can do that isn't too redundant or missing anything?
>>
>>109025952
>has nothing to do with the param count
nta but how come? wouldn't more parameters = more variety?
>>
File: frog 1.png (582 KB, 1091x1138)
582 KB PNG
>>109026140
I just give characters the key emotions relevant to the scenario(s) i write since i am tk/s and context poor. otherwise just google a list of emotions from dnd or some shit man idk.

also
>I showed her a frog and she started to get jealous
>>
>>109026169
Knowledge does not imply variety, but variety requires knowledge. It'll output the same slop over and over if its been trained to do that, regardless of params
>>
I updated llama.cpp after a long time and noticed that the webui has MCP support. I wonder if it would be possible to more or less replace sillytavern by using a MCP server that provides the character card.
>>
>>109026209
Streamable HTTP only
>>
>>109026209
the character card is just a bunch of strings that collapse into a system prompt. manually replace the {{field}} and you're done
>>
File: 00000-3556354059.png (1000 KB, 1024x1024)
1000 KB PNG
>Simple prompt: 1boy, shrugging, smirk
>Gives pretty boy instead, no shrug
Anima was trained on chinese fujo art, wasn't it?
>>
>>109026244
>>109026244
>>109026244
>>
>>109025393
lazy NEETs inherit the earth
>>
>>109026135
>user is underage
Figures.
>>
>>109026201
Fair, thanks for the insight
>>
>>109025393
>>
>>109025952
Hmmm, this happened to give me an idea for the most unholy overkill memesampler ever. Run a small, satisfactorily creative model in parallel with Gemma. Each token, take the small model's logit scores, and overwrite Gemma's logit scores with those values in the same order. You still get the Gemma "goodness" since it's still her top tokens, but you break out of the overbaked-ness (hopefully in an intelligent way... Might also need some thresholding of some kind).

Obviously only useful in the case where there is a completely unrivaled winner (in a given size class at least) who happens to be painfully overbaked.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.