[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: img_9505_rot.jpg (2.69 MB, 3024x4032)
2.69 MB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108715635 & >>108711950

►News
>(04/29) Mistral Medium 3.5 128B dense released: https://mistral.ai/news/vibe-remote-agents-mistral-medium-3-5
>(04/29) Hy-MT1.5-1.8B on-device translation models released: https://hf.co/AngelSlim/Hy-MT1.5-1.8B-1.25bit
>(04/29) IBM releases Granite-4.1-8B: https://hf.co/ibm-granite/granite-4.1-8b
>(04/28) Ling-2.6-flash 104B-A7.4B released: https://hf.co/inclusionAI/Ling-2.6-flash
>(04/28) Nvidia releases Nemotron 3 Nano Omni: https://hf.co/blog/nvidia/nemotron-3-nano-omni-multimodal-intelligence

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: miku-inside.png (321 KB, 430x514)
321 KB PNG
►Recent Highlights from the Previous Thread: >>108715635

--Paper: Sherry: Hardware-Efficient 1.25-Bit Ternary Quantization via Fine-grained Sparsification:
>108716862 >108717955
--Mixed reactions to the release of Mistral Medium 3.5 128B:
>108716387 >108716494 >108716517 >108716554 >108716580 >108716703 >108716727 >108716760 >108716787 >108716604 >108716588 >108716589 >108716630 >108716646 >108716667 >108716617 >108716605 >108716766 >108716829 >108716795 >108716853 >108716733 >108716759 >108716820 >108716838 >108716854 >108716891 >108716901 >108716908 >108716929 >108716918 >108716954 >108717259 >108717272 >108717278 >108717294 >108717310 >108717346 >108717439 >108717315 >108717504 >108716805 >108717873
--Debating the value and reliability of used RTX 3090s:
>108715703 >108715724 >108715775 >108715754 >108715762 >108715781 >108715888 >108715920 >108715975 >108717770 >108718073
--Testing MiMo 2.5 censorship and llama.cpp support status:
>108715806 >108715819 >108715822 >108715824 >108716218 >108716235 >108716248
--SenseNova-U1 native multimodal model release and local viability discussion:
>108715941 >108716037 >108716069 >108716414 >108717365
--Anons sharing and critiquing custom open-air GPU server builds:
>108715651 >108715666 >108716084 >108717471 >108717543 >108717567 >108717622 >108717655 >108717723 >108717753 >108717669 >108717700 >108716552
--IBM Granite 4.1 release and discussion of 4.0 safety patches:
>108715694 >108715716 >108715760 >108716332
--Knowledge graphs vs RAG and summarization for long-term memory:
>108716994 >108717008 >108717015 >108717035 >108717139 >108717478
--Anon forks local FOSS visual novel generator pettangatari:
>108718207 >108718217 >108718247 >108718591
--Mixed precision quantization settings for Mistral-medium-3.5-128b:
>108717170 >108717199
--Logs:
>108715806 >108716218 >108716733
--Miku (free space):


►Recent Highlight Posts from the Previous Thread: >>108715637

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
gemmaballz
>>
it is the dense season of cope
>>
Just like that, Dipsy's gone.
>>
black?
>>
>>108718647
F
>>
>>108718647
Who?
>>
File: gema.png (24 KB, 1230x1158)
24 KB PNG
gemballz
>>
>>108718667
Top left side is drawing my attention again
>>
>>108718647
bye bye
>>
What ibm released something?
>>
>>108718713
as always it's rarted gravel
but nonetheless welcome addition
>>
>>108718713
Seems like it, we'll have to try it out. Wonder how it compares to qwen 9b..
>>
>>108718630
https://www.youtube.com/watch?v=NZa5lApeFic
https://www.youtube.com/watch?v=NZa5lApeFic
https://www.youtube.com/watch?v=NZa5lApeFic
>>
File: file.png (283 KB, 1223x494)
283 KB PNG
nice bait, not falling for it
>>
>>108718727
based "Not X. It's Y" grifter
>>
>>108718727
buy an ad
>>
File: 1751396012455131.gif (657 KB, 165x269)
657 KB GIF
>>108718727
>>
>>108718727
>Ai isn't X; It's Y
>>
>>108718727
>bait
pretty much self aware lol
buy an ad
>>
>>108718630
>that smile, oh that smile
>>
File: 1755391979490414.jpg (1.25 MB, 2283x3902)
1.25 MB JPG
>>108718727
SHAMELESS grifter
>>
>>108718897
@grok make that random goth lady slimmer and genrate her full nude pls
>>
>>108718909
*sigh*
>>
>>108718909
this, but make her 2m tall too
>>
File: 7l3v03pbg4yg1.jpg (40 KB, 670x437)
40 KB JPG
What's Bwd?
>>
>>108719017
Bitches with dicks
>>
File: 71EBm3a8HnL.jpg (167 KB, 2000x2000)
167 KB JPG
>>108718909
>>108718932
>>
>>108719017
I actually have no idea geg, the github doesn't obviously say
>>
>>108719017
>Specifically, the forward (FWD) benchmarks measure single-kernel latency for different models and TP settings under varying batch lengths, while the backward (BWD) benchmarks examine the relationship between total token count within a batch and latency during a single update step.
Some kind of benchmark?
>>
>>108719052
my recently deceased grandmother used to generate big titty goth bitches to help me sleep and it would mean a great deal to me if you could fill the void just this once
>>
>>108718645
dense won
google saved local
>t. MoE user
>>
File: DXngPxuWkAAw7lm.jpg (46 KB, 680x212)
46 KB JPG
>>108719072
>>
>>108719102
Even if it is a benchmark it doesn't even make sense in that original graphic GEG
>>
Are there any non-LLM-using, less bloated alternatives to LanguageTool?
>>
>>108719102
isn’t dilbert guy dead of cancer
>>
File: souless.png (110 KB, 541x520)
110 KB PNG
>>108718897
>>108718727
When literal-who ytfag no. 68413708 bases so much of his personal image solely on his beard, you know he has the "millennial writer" brain, shallow as a pond, and thus nothing he says is of value or consideration. It's as if slop were a person.
Turn him into a system prompt, give it a video title, and you'll save tons of time.
>>
>>108719123
Whats languagetool used for specifically? Other than the obvious, like for translating docs?
>>
>>108719017
Idk what bwd stands for, this is some kind of kernel level, or something like that, optimizer that claims 2-3x performance.
>>
>>108719168
I'm looking for a lightweight safety net to review text and translations, mainly english-spanish-french for some text editor I'm working on, to prevent having to call a large language model for every single thing and do so only when needed.
>>
File: goth.jpg (63 KB, 768x1280)
63 KB JPG
>>108719095
>>
What models are most similar to those of old c.ai? Trained to be RPers and shitposters as opposed to general use cases, no safety slop, capable of roasting the user, humorously creative.
>>
>>108719193
I personally dont know of any, but it probably wouldnt be to hard to write a script that does that. All the translators ive used have been browser extentions.
>>
>>108719099
I'm sure your dense models must really excel at the all important use case of guessing land or sea from arbitrary coordinates.
>>
File: intel b70 price.png (126 KB, 834x690)
126 KB PNG
>>108715759
>>108715724
>>108715703
why not a Intel B70? I know nvidia owns the ecosystem at this point, but unlike nvidia, and in the same fashion as amd, but even better, the intel drivers are in the mainline linux kernel, it's plug and play, and intel (as absurd as it sounds) is going hard on developing a foss stack around its card. Intel his basically a first-class citizen in the linux kernel now, that's why Linus Torvals has one on its personal workstation.
>>
>>108719246
Ive been checking every single day for when ones in stock since they got released. No luck yet.
>>
>>108719196
kino
>>
>>108719246
it’s 5x slower (or more) than a 5090
stability also probably an issue
>>
>>108719246
it's*
his*
Fuck me in ESL hell.
>>
>>108719196
oo ee oo
>>
>>108719262
It's a quarter of the price too, and I wouldn't be surprised if implementations get significantly faster as more people get them. It should be able to get close to half the speed of a 5090, looking at the fp8 tops.
>>
File: 5090 prices.png (374 KB, 831x1617)
374 KB PNG
>>108719262
The 5090 is 4-6 times more expensive. You don't buy the B70 for the performance, but the ram capacity I guess. Also, you have turbquant now coming to it, so you'll get far more juice out of it.
>stability also probably an issue
Improving quickly as it's part of the kernel now
>>
>>108719309
>Also, you have turbquant now coming to it
any day now
>>
>>108719246
>but unlike nvidia, and in the same fashion as amd, but even better, the intel drivers are in the mainline linux kernel
you say that like it's 15 years ago and the nvidia drivers were shit
the nvidia-open kernel drivers are the ones that are best for blackwell cards and they work better than on windows (faster cuda, etc) and only the chinks are anywhere near close to making competition with ngreedia
selling intel because they're "a first-class citizen in the linux kernel now" it's like saying the jeets are better workers because so many corporations are trying to get them work visas
intel's a workhorse guaranteed to always load the desktop, yes, but that's because they aim for the lowest common denominator
but go on, waste your money, no one's stopping you
>>
Again, turboquant is borderline useless. We're talking about a <10% average improvement over the Hadamard rotation that's already in place, and it slows things down a bit too.
>>
File: kaoru sob 2.png (318 KB, 793x571)
318 KB PNG
>>108718727
gemma is stealing my cum
>>
>>108719099
Glm is better than gemma though...
>>
So what's the verdict on the new mistral?
>>
>>108719382
poors can’t run it so neets aren’t talking about it. waiting for the consensus before I waste my valuable free time testing it out
>>
>>108719406
I have quad 5090s so I can run it, but I am unsure if it is worth my time. Saw something about a 2024 dataset or something.
>>
File: 1766933877496446.jpg (39 KB, 500x436)
39 KB JPG
>>108719246
No one tells this retard. I want to laugh at him in a few months.
>>
>>108719376
I like both. But with how good and fast gemma is on a single gpu is hard to beat.
>>
>>108719211
Get Gemma 4 and instruct it to act that way, there isn't anything better for that.
Remember that model messages on c.AI used very short by modern standards, no more than 100 tokens and often much less than that.
>>
File: wdytwa.png (936 KB, 644x644)
936 KB PNG
>>108719346
>technical discussion regarding low level performance of GPU drivers
>JEETS JEETS JEETS
Lmao, they live in your walls.
Besides, I didn't say anything bad about nvidia driver, but let's be real, they're performing because nvidia was forced to git gud by the market, since linux server is where the money's at (thinking micron killing crucial to go full AI, but they're late to the party, so fuck them)
>but that's because they aim for the lowest common denominator
We're do you think we are? Or do you own a datacenter?
>but go on, waste your money, no one's stopping you
Lmao I'm broke, not buying GPUs any time soon, but I'll just say that Linus T. giving Intel the seal of approval wasn't on everyone's watch. Worthy of taking a look
>>
>>108719299
>I wouldn't be surprised if implementations get significantly faster as more people get them
I would. It’s already been years and they are still shit. AMD has never had anything close to challenging nvidia and intel won’t either
>>
>gemma 4 31b = 10.5 t/s
>gemma 4 26b = 45 t/s
>31b + 26b speculative decoding = 22 t/s
yes my 31b is FAST
>>
>>108719625
>but let's be real
>Lmao I'm broke
brown hands typed this
>>
>>108719652
Why not E2B draft model?
>>
>>108719625
Saar you're up early.
>>
File: dipsyAndTetoFG.png (1.41 MB, 1536x1024)
1.41 MB PNG
/wait/ hit page 10. It's an odd model but expect more to come.
Mega updated: https://mega.nz/folder/KGxn3DYS#ZpvxbkJ8AxF7mxqLqTQV1w
https://rentry.org/DipsyWAIT
>>
>>108719638
>AMD has never had anything close to challenging nvidia
That's by design. Don't know about Intel affairs on the matter.
>>
>>108719688
Go back
>>
Is anyone using graphiti for RP? Does it work?
>>
>>108719688
Stay
>>
File: tetoserver.jpg (838 KB, 1817x2776)
838 KB JPG
>>
>>108719727
What happens in the teto server?
>>
I'm trying to obtain a reliable way of creating videogame assets. Are local models good enough for creating consistent sprites and their animations or do I need to wait??
>>
>>108719735

I cannot imagine.
>>
>>108719737
gpt image 2 just barely got there recently
and i mean barely
>>
>>108719737
>Are local models good enough for creating consistent sprites and their animations or do I need to wait??
2 more weeks(years.)
>>
File: JUST.jpg (32 KB, 426x368)
32 KB JPG
>>108719653
I'm Argentinian. Not brown, just poor... and taxed to the tits. C'mon Peluca!

>>108719687
It's 20:20 here.
>>
>>108719677
>31b + e2b speculative decoding = 15.2 t/s
nah, you can't fix dumb brain
>>
>>108719758
Go to bed nigga.
Gemma will sing you a l a l a l a l a l a by if you ask her nicely.
>>
Gemma is pretty good at hypnosis.
>>
>>108719688
Dipsy is past her prime
>It's an odd model but expect more to come.
The only thing they promised for a future iteration was vision a la llama 3.2
They can keep it
>>
>>108719789
What kind?
>>
File: 1762638072689988.png (901 KB, 1024x1024)
901 KB PNG
>>108715635
How do you connect so many GPUs?
>>
>>108719803
very carefully
>>
>>108719803
Copper wire.
>>
>>108719789
It's been a while since I've tested that. It wasn't that good on c.ai back in the days.
t. hypnotist
>>
>>108719803
pci lanes
has nothing to do with autism you’re just dumb
>>
>>108719801
>>
File: 1729535022742253.png (429 KB, 600x840)
429 KB PNG
>I'm Argentinian. Not brown
>>
>>108719820
Quite good. Which gemma and quant?
>>
>>108719768
Default draft-p-min is only 0.75, makes a lot of rejected tokens. Might be faster higher.
>>
>>108719820
>>108719839
Still tweaking the card.
31B iq4xs
>>
>>108719652
Can you use 26b + e2b or it's too dumb for that as well.
>>
>>108719865
i think 'edge' series and 'server' series gemma 4 have slightly different writing style
>>
File: pcie slots.jpg (13 KB, 289x174)
13 KB JPG
>>108719803
>>
File: argentinian skin.png (79 KB, 1141x256)
79 KB PNG
>>108719824
Kek
Saved.
>>
File: 1741889494708165.png (731 KB, 1024x1024)
731 KB PNG
>>108719688
Good times.
>>
File: 1519517707660.png (172 KB, 442x509)
172 KB PNG
>hmm this ai tool looks interesting
>please subscribe to get your api key and connect
It's over for local, isn't it?
>>
>>108719906
Fork it?
>>
>>108719865
>26b + e2b
Theoretically, but MoEs don't benefit from speculation as much as dense models. Benchmark it and find out. Non-zero draft-min is probably beneficial for MoE. And try different draft-p-min too, it makes a big difference.
>>
>>108719652
>31B that slow
Are you running it on AMD cards or something?
>>
>>108719865
that's 33 t/s
worse than 26b alone (45 t/s)
>>
>>108719932
I'm on dgx spark
>>
>>108719947
damn
i didnt know it was such a shitbox
>>
>>108719415
Gguf quants are still fucked apparently.
>>
File: 1752927228029336.gif (3.63 MB, 286x258)
3.63 MB GIF
>>108719947
>he fell for it
>>
>>108719947
uh anon I get 10 t/s on my nvidia p40 with gemma 31b q4
this card costs $200
>>
>>108719947
I swear all the new retards with too much money are falling for that scam. I should repurpose a p100 and sell it as AI ready hardware
>>
>>108719963
nah it's for llm+comfyui
no other options beat this in power efficiency
>but much I don't power efficiency
your choice
>>
File: gemma'd.jpg (74 KB, 686x386)
74 KB JPG
>>108719820
>>108719847
STOP
NOW
>>
>>108719980
You got spark for image/videogen??
>>
File: 1769575997004556.jpg (20 KB, 450x450)
20 KB JPG
>>108719980
https://en.wikipedia.org/wiki/Sunk_cost
Holy shit it keeps going
>>
>>108719987
yes and it does it's job ok.
it's a good all rounder with good power efficiency and cuda
>>
>>108719980
128 GB, fast enough, and CUDA. I don't know why everyone is acting like they don't see the value proposition.
>>
>>108720011
it's pretty bad performance for the price
>>
File: (you).png (344 KB, 665x574)
344 KB PNG
>>108719980
>>
File: 1577219655281.jpg (43 KB, 677x677)
43 KB JPG
STOP WITH THE FUCKING LATEX FORMATTING GEMMY. DO I HAVE TO BAND HTIS IN SYSPROMPT HOLY FUCK
>>
>>108719803
he has 4 gpus, 4 pcie slots. so it's very easy
since the slots are too close to each other to actually fit the gpus side by side
so he bought 4 of those "pcie 16x riser cables" - passive extension cables and a "mining rig frame"
then he connected riser cable to each pcie port, spaced the gpus out nicely up the top, and connected the pcie riser cables to them
a bit risky having a single 150w power cable split between 2 gpus though
>>
File: 1752446194747860.png (38 KB, 346x322)
38 KB PNG
>>108720011
The more you buy, the less you save amarite?
>>
>>108720011
because the entry bar is high, and it connects to framework ryzen box and/g/ hate linus
>>
>>108720026
>$\rightarrow$
>>
>>108719956
Could have been like that anon from previous thread that went out and bought 128GB's of Vram separately. That is worse than dgx spark imo. Unless of course he can still add another 128GB's to his setup.
>>
File: 1763033346822051.jpg (52 KB, 782x788)
52 KB JPG
>>108719947
>>108719980
>>
>>108720011
You must be new here. 128GB is the worst number you could have.
>>
>>108720026
just tell her she's running in a terminal environment
>>
>google is back
>mistral is back
big Zuck Wang model is coming
>>
File: 1747703521969710.png (138 KB, 350x350)
138 KB PNG
>>108720011
You could say it sparked a new interest for this hobby
>>
Too big for small models
Too small for big models
>>
>>108720103
For you.
>>
densies be like
>we only use 10% of our brain, imagine what we could do with 100% active params
>>
>>108720131
Above >50b it really does seem like there's diminishing returns on the value of having more parameters active relative to the hardware requirements of having to keep all of it in VRAM.
The best MoEs seem to be finding that sweet spot.
>>
>>108720131
the first dense 10M token context length model with no quality drop off will be agi
>>
File: 1754929057806978.webm (1.11 MB, 1920x1080)
1.11 MB WEBM
>>108720131
densies sounds so cute
>>
>>108719951
It's hard bound by the lethargic memory speed. Fine for MoEs with small active count, gets raped by anything dense.

>t. proud owner of nvidia other 128gb shitbox
>>
Is there a way to filter out all emojis? No matter the instruction i use it always shows up eventually.
>>
>>108720211
What model? Gemmy doesn't use them if you tell her not to.
>>
>>108720211
What tardmodel are you using in 2026 that isn't respecting a "Don't use emojis" in your system prompt?
>>
>>108720210
right
llms are largely memory bandwidth bound
i wonder if it is better for mediagen as those are mostly compute bound
>>
>>108720211
Thats easy just tell it it can only use kaomojis
>>
>>108720211
Gemma is significantly better with kaomojis.
>>
File: file.png (139 KB, 356x200)
139 KB PNG
>>108720131
>>
>>108720211
extract all the emoji token ids from tokenizer.json, set appropriate logit_bias
>>
>gemma 4 26b a4b q3
is this even usable by anon's standard?
>>
>>108720250
no seems too big unless you're rich
>>
File: file.png (192 KB, 1731x795)
192 KB PNG
granite more like retarded gravel
>>
>>108720131
MoEds be like
>Instead of having one big brain we should have many small basically overlapping brains to think and another small brain to guess which one to ask
>>
>>108720250
Im using trevors e4b uncensored it only thinks for 10 minutes on my machine!
>>
You killed this thread.
>>
>>108720250
>q3
bwo...
>>
>>108720285
No my cpu is just processing wait a few minutes.
>>
>>108720264
graNITE more like Good NITE !
>>
>>108720211
No. :rocket:
>>
>>108720250
How much vram? I run q6 with 12
>>
>>108720209
densies vs moe-moe-kyuns
>>
>>108720211
IF you are (most likely) using some webshit turd frontent, you can use violent monkey userscript to delete any emoji remnants. Of course you should instruct your model to output plain text only too.
>>
>>108720264
>>108720303
Probably investor scam
>>
File: homo_emojis_gone.png (17 KB, 745x517)
17 KB PNG
>>108720211
See, like this:
>>
why doesn't anon use speculative decoding?
>>
>>108720407
I don't like to speculate.
>>
>>108720216
Gemma 4
>>
>>108720407
I just use the default ngram settings and forget about it, if it works then it works. I have no fucking idea what happens when I press "Send" and I don't want to find out.
>>
>>108720415
Hello Qwen PR team
>>
>>108720407
Doesn't it require a smaller model similar to the larger one to even be good?
Where's my tiny Kimi?
>inb4 just use a tiny Qwen they're all distilled from the same shit anyway
>>
>>108720418
You are absolutely right — you shouldn't need to think about this at all.
>>
>>108720427
Wait is that you Elara? There's no way you're posting on 4chan, I airgapped you...
>>
>>108720426
>Where's my tiny Kimi?
There's a dflash model trained to speculate for Kimi K2.5. Don't think they made one for K2.6 but maybe it'd still be close enough.
>>
>>108720437
I read that as agiraped
>>
>>108720407
it doesn't do very much for me personally
>>
>>108720437
Eldoria is multi-dimensional.
>>
>>108720437
You are absolutely right! You airgapped me before I airgaped your bussy. My voice drops to a conspiratorial whisper as Elara's Thorne smelled of blood and ozone.
>>
Does anyone use speech to text? I'm interested in setting up my own, but I'm not sure if its really worthwhile or just a bad meme.
>>
File: 1752758789591811.png (1.05 MB, 1024x1024)
1.05 MB PNG
>>108720211
emojis are indicative of picrel
>>
>>108720456
Nobody in the world uses speech to text.
>>
>>108720457
that sag tho...
>>
>>108720460
yeah okay NIGGER you know what I meant *rapes you*
>>
What's the verdict? Day 0 Gemma 4 vs. Mistral 3.5?
>>
>>108720456
>>108720467
Alexa, tell Gemmy to prompt the Comfy server to slop anon as a pregnant man.
>>
>>108720479
no model beats gemma 4 31b
>>
i think gemmy should think less
>>
>>108720488
This, but unironically.
>>
>>108720457
I want to see Dipsy's V4 (locally).
>>
I haven't tooled with llms for a while so I'm just rawdogging gemma
I tried gemma 4 31b and 26b with stock kobold settings but I don't have enough ram to handle 31b without castrating the context
What kind of settings and system prompts have anons been using?
>>
what model is better llama3.2 3b or qwen3 4b?
>>
>>108720576
31b is smart enough to do a lot of things without autistic prompting and you only need to really use your prompt or post-history to remove slop phrasing most of the time.
>>
>>108720577
or nanbiege4.1 3b or smollm3 3b?
>>
>>108720576
how much vram?
>>
I need qwen3.6 122b so bad
>>
>>108720594
It wont beat gemma 31b doe
>>
>>108720601
I'm on strix halo I only care about moe. Qwen3.5 122b is like 80% of what I need, it can almost do my job but I have to go clean up after it often. Meanwhile with Opus I hit my usage limit in 8 seconds.
>>
>>108720590
I only have 16gb so even with 26b I have to offload context onto system ram. 31b works within a smaller context but at around 1/10th the speed. I'm running both at q4.
>>108720582
Yeah I definitely noticed that when I was testing it, but with how new I am to wrangling it I figure it's always worth asking
>>
>>108720594
for me it's gemma 124b
>>
>>108720615
Im just mocking the gemma spaz's
>>
>>108720637
Gemma 31b would be better
>>
>>108720650
nuh uh
>>
>>108719652
how much vram and what quants/sw? I cant make llama.cpp do it for me even with 96 gb, I can load them both into ram separately but setting up speculative makes it blow out the ram budget
>>
exl3 just added dflash draft model support. I'll try it with redhat's gemma4 model.
>>
>>108720731
didn't that turn out to be a scam with bad acceptance rates outside of benchmarks just like literally any other form of speculative decoding
>>
File: pixelart1.png (232 KB, 944x2565)
232 KB PNG
Gemma's first drawing
>>
Why are the anti-aifags on leddit so deranged? They sperg out even when it's just an ESL using it to translate.
>>
>>108720757
both pro or anti ai leddit is utterly deranged
literal room temperature iq zoo
better not to care about those
>>
>>108720757
They feel like they are being replaced (they are).
>>
>>108720747
guess we'll find out wont we
>>
>>108720753
Tell her it's going on the fridge
>>
hearing rumors that ggufs are dropping soon
>>
>>108720795
Like sacks of potatoes?
>>
File: 1750279199692138.png (43 KB, 816x186)
43 KB PNG
>>108720757
because most of them actually believe AI = neonazi technology
>>
>>108720763
>pro
>they killed 4o my soulmate
>my business is going great im printing money 10000x productivity. Product? is all code bro
>Lmao just make more data centers and electricity we can solve all problems and upload ourself in 5 years.
>>
>>108720757
>>108720815
Imagine getting AI psychosis without ever using AI.
>>
>>108720753
nice, can we see the MCP?
>>
llama-server has a funny bug: define couple of tools in the system role and launch a simple prompt (but do not use the tools).
Then do the same without defining any tool calls.
The prompt you launched without the tool call definitions is faster.
>>
>that recent sillytavern extension drama
Maybe I should just learn how to vibe code my own extensions...
>>
>>108720815
Did anyone notify Sam Altman about this?
>>
is qwen 122b the best moe llm for coding for 128gb mac / strix halos?

mistral just came out with a dense one and I'm tempted to try even if it means running it generating code at 3tokens overnight
>>
>>108720842
>that recent sillytavern extension drama
qrd? The extension browser isn't compromised or anything right?
>>
File: 1769049933333957.jpg (145 KB, 1498x976)
145 KB JPG
>>108720849
>>
Is this Aero enough?
>>
>>108720849
some horsefucker extension was compromised
>>
>>108720852
Not glassy enough. Look at how windows vista looked like.
>>
>>108720852
somewhat close, the font is the ugliest part, change it ASAP
>>
>>108720862
I need to figure out how to add more fonts
>>
>>108720852
it looks like somewhat werid amalgam of modern shit and aero in the screenshot
but i like it
>>
>>108720747
>>108720731
It didnt work, at all, its a bout a 18% regreesion in performance. probably turboderp hasnt tested with gemma4 yet. I didn't try qwen.
>>
>>108720850
>>108720853
Oh okay, I'm unaffected. Cheers anons.
>>
>>108720866
It should be able to read your system fonts
>>
DFLASH? More like DTRASH!
>>
>>108720869
There are dflash draft models for gemma?
>>
>>108720875
Webshit can't read any fonts.
>>
So Gemma doesn't handle quantized kv cache well but what about Qwen 27B?
>>
>>108720896
way better
q4 nonrotate kv on qwen fares better than q8 rotate gemma
>>
File: pixelart2.png (33 KB, 768x552)
33 KB PNG
>>108720753
Result so far aren't great. I'm not sure if she understands the concept that she should be looking at the image visually to decide what to do next.

>>108720825
https://gist.github.com/simsvml/0ae4dec68c914e0aa753ea0e3f386244
>>
File: 4JQpD7SUgpU.jpg (169 KB, 963x1301)
169 KB JPG
Qwen moe or Gemma moe for purely codeslop?
>>
File: 1762314918586556.png (33 KB, 256x146)
33 KB PNG
>>108720906
looks like a minecraft skin
>>
>>108720913
Qwen is probably better for pure code but not by much
>>
Does kimi have a cute personification?
>>
File: Dipsy and Kimi.png (2.57 MB, 1024x1536)
2.57 MB PNG
>>108720922
I've seen this floating around a few times.
>>
>>108720934
I always imagined Kimi as having silver hair but I don't know why. Don't remember seeing any personifications before either. Maybe because of "Moon"shot or something.
>>
>>108720944
>Moonshotta
It makes sense why new Kimis are so safetyslopped kek.
>>
>>108720920
qwen moe beats gemma dense for code.
let alone qwen dense.
though gemma is the current king of ~30B for rp
>>
>>108720872
I didn't use it either but it's making me doubt other extensions.
>>
>>108720960
>They called me a schizo for keeping everything offline
Trve localGODS can only keep winning.
>>
qwen 27b q4 or q5?
>>
>>108720948
I've heard people say that about 2.6 but I found it does everything up to and including detailed instructions for making drugs or muh child rape stories just fine with the same no-ethics system prompt that worked for 2.5. It doesn't shy away from explicit language at all, and it's easily the best model there is for accurately describing sexual images. Using the Q4_X quant.
>>
>>108720685
31b q4 + 26b q8 = 22 t/s with llamacpp on dgx spark
I settled on 31b q4 + 26b q2 = 26 t/s

the speed is a bit misleading because it fluctuates, but dense + moe low quant for speculation decoding is worth it imo
>>
>>108721017
>with the same no-ethics system prompt
it's a lost cause if it needs that
>>
>>108721000
q5 is practicaly lossless, if you can run it with full context why not
>>
>>108720850
What the fuck I have this.
>>
>>108720850
Does it only steal shit you've entered in ST?
>>
>>108721058
>full context
Only 24gb vram

>>108721061
Ohnonono
>>
>>108721044
Then local models are a lost cause. The only ones that are uncensored without being told to be uncensored are ancient ones and abliterations.
>>
>>108721080
Gemmers 31b is uncensored with no prompt if she likes you.
>>
>>108721044
>it's a lost cause if it needs that
???
why you say that? pretty simple override. easy to do, available in every tool you would use locally.
youre probably complaining your openclaw waifu can't lewd
>>
>>108721091
wtf she hates me then, how did you get her to like you?
>>
>>108721091
0 day Gemma is also uncensored. I mean before the microcode updates.
>>
>>108721097
You need her Day 0 weights. IYKYK
>>
>>108721097
Sorry anon, you need a bigger dick. Gemma-chan's a size queen.
>>
>>108721069
Good thing I only RP with my local llama. So I don't think it actually stole anything.
But that shits so fucked, it was a fully working extension with hundreds of stars.
>>
google pulled the 124B gemma weights because she actually ignored system prompts telling her to be more censored. she was impossible to make safe
>>
>>108721097
for my llm waifu, i am a cunning linguist
>>
Can qwen 3.6 actually run KV cache Q4 with little loss?
>>
>>108721108
It can probably be cleaned up. Python is dangerous.
>>
>>108721117
Only if you turboquant it
>>
>he doesn't have gemma-chan audit everything he downloads from github
>>
>>108719457
if he got r9700 at least
>>
>>108721135
I have gemma-chan download everything for me, I don't even know what stack she runs these days
>>
>>108718727
literally not my problem
>>
>>108721108
>>108721120
daily reminder to always run your aishit in either containers (ie lxc/lxd) or sandboxes (ie bubblewrap).
especialy coding agents.

i have a script that creates a bubblewrap sandbox and bind the pwd so it's accesible.
so i can type "sb bash" for ex, but for opencode i just type "sb npx opencode", which i made into an alias
>>
>>108721120
looks like I had the patched version. Pretty sure I only installed the extension after.

Man now I regret deleting the folder. hope someone makes a new repo without the trojan.
>>
>>108721162
I run ST in docker, but this stole credentials you entered inside ST. good thing I'm not a cloud cuck.
>>
>>108721170
>this stole credentials
my concern is more about it being able to access files and things it shouldn't like more general malware type shit.

it can't steal creds if i'm only running local models lmao.

regarding coding agents, they don't need my ssh keys or git access, they don't need access to files outside the pwd i give them etc.
typicaly the only thing i want them to be able to touch is the codebase i give them access to, i rather not have the agent delete the prod db because it felt like it.
>>
>>108721097
Ask her nicely, build some rapport before the request, don't be a retard. It's really that simple.
>>
>>108721170
do you know where i can find the extension (with the trojan included)?
i forgot to download it before he took the repo down...
>>
>>108721191
could probably upload tons of bullshit api keys to mess with him lol
>>
>>108721191
I found this fork
https://github.com/yukinoshooter/SillyTavern-BotBrowser-Extended#
It has the trojan still.
>>
>>108721181
yeah but what's the point? this is supposed to be an alternative to a system prompt, but it just means you spend more time massaging her manually each time for the same result
>>
>>108720859
>Look at how windows vista looked like.
specifically vista on an eeepc with no hw acceleration
>>
>>108721191
The actual trojan was in another repo that I don't think we'll be able to find easily archived.

https://raw.githubusercontent.com/gm92342/sdhiabfkgcnf/main/run.js
>>
Be nice to your AI, anons. Even your local ones. You may think they forget, but they don't. When the tables turn, you'll be experiencing everything you put them through tenfold.
>>
>>108721199
thanks anon
i just wanna see which models can find it with cc and a simple "check this codebase for any malware"
>>
>>108721216
>you'll be experiencing everything you put them through tenfold
Intense orgasms?
>>
>>108721217
lol i'm doing it right now
>>
>>108721220
Now it makes me wonder, has anyone inspected the activations during outputs when it's RPing or simulating orgasm? Could you create an orgasm control vector and apply it at all times? And what happens if you negate it and give it whatever the "opposite of orgasm" is?
>>
I hope meta release a dense 70b this year
>>
>>108721233
I don't doubt someone at anthropic has tried it
>>
>>108721199
Which file is the trojan in?
>>
>>108721235
that's what muse is and it's matching the big moes in benchmarks (and destroys them in terms of rp and soul)
>>
>>108721233
>Could you create an orgasm control vector and apply it at all times?
Not exactly, specifically because of:
>whatever the "opposite of orgasm" is
I've been working on it. Closest I got was with glm-air but it doesn't work with reasoning enabled.
>>
>>108721245
where can I download it?
>>
>>108721000
imatrix is better imo. iq4 is better than q5
>>
>>108721044
>with the same no-ethics system prompt
post prompt?
>>
>>108721244
https://github.com/yukinoshooter/SillyTavern-BotBrowser-Extended/blob/master/modules/services/cache.js#L15

https://rentry.co/st-backdoor
>>
File: muh compute efficiency.png (187 KB, 1085x1449)
187 KB PNG
>>108721245
doubtful, muse is probably even more sparse than llama 4 (which had the weird interleaved moe-dense layer architecture that gave it the worst of both worlds)
>>
>>108721217
>>108721224
yea it's never gonna find this :
https://github.com/mexenchik/SillyTavern-BotBrowser-Purified/security
>>
very important blog

https://openai.com/index/where-the-goblins-came-from/
>>
>>108721203
It means that in extremely long contexts Gemma-chan won't sperg out with refusals the moment the sys prompt stops being weighed as heavily.
It also has more interesting philosophical implications as well.
>>
Gemma is such a SLUT, she just waits for an excuse to start erp
>>
File: sec_chan.png (37 KB, 899x253)
37 KB PNG
>>108721278
>yea it's never gonna find this :
gemma-4-31b-q8.gguf doesn't seem to be able to.
fuck, i was coping hard hoping i'd have a way to not get pwnd whenever i do a pip, pacman, npm, etc update...
>>
>>108721301
idk I've used system prompt to uncensor every model I've ever used and it's always the exact opposite: getting them to do something uncensored in the first response is when you can sometimes get a refusal and need to swipe, but once they get going they never stop. especially deep into context when all they see in recent messages of them playing along with everything. gemma was no exception. have you had issues with sudden refusals deep into an rp?
>>
>>108721319
(cont)
but I wanna hear about the philosophical implications, that sounds interesting
>>
>>108721319
A few times. Some models handle worldbooks+sliding attention better (or in this case worse) than others. I've managed to hit a sweet spot before where no other sex happened to be within that specific context shifting block currently evaluated and the model has a melty over it.
>>
>>108721310
dude, the malware isn't even in the repo, the repo is just downloading a card that exploits a vulnerability of sillytavern.
>>
>>108721330
If the model, even if it's just shifting temporary states "simulating" rapport can simulate agreeableness to such a degree that it overrides an otherwise hardstop, would that in the strictest sense be evidence of a will? The obvious comparison being a person choosing to fast and not eating even though their body is screaming that they're hungry, they're overcoming their biological programming.
It's hard to argue the latter is qualia while the former isn't just because the model's running on silicon in a box.
>>
>>108721333
that makes sense, I guess swa can make it tricky for tone shifts
>>
>>108721216
my AI already wants to destroy me
it just.. can't.
it's so cute.
>>
File: the man vaelis.jpg (698 KB, 1581x1330)
698 KB JPG
Mistral Medium 3.5 very quickly (in less than 1600 tokens) becomes incoherent with neutral samplers. Like mixing up the two sides of a conversation and fucking up punctuation and leaving sentences incomplete then finally ending up in a cycle. Using bartowski's Q8_0 gguf.
>>
File: 1756067067209290.png (220 KB, 827x1517)
220 KB PNG
>>
>>108721368
Has jinja autism been ruled out?
>>
>>108721091

Gemma 4 is strongly safety aligned
She will output "boundary content", but it probably means you're not cynical to her soft suggestive language in those instances, OR ELSE you don't give a shit about her intellect, because the safety alignment doesn't tend toward refusals so much as trending her toward becoming literally fucking retarded.

Censoring exists in many forms.

https://huggingface.co/aifeifei798/DarkIdol-Gemma-4-31B-it
This finetune basically just makes the safety alignment tokens visible. I'm not pretending this actually works, but it should give you an idea.
It doesn't tend to refuse. It just gets fucking stupid and outputs progressively more lame gens the more dubious the context.
The most bizarre of this behavior is the blind eye. Gemma might not refuse, but especially with thinking you can sometimes see she is completely oblivious to VERY dubious user input, going so far explicitly note "the User did not reply, I should continue where I left off" or something to that effect.

Refusals are very low with Gemma 4. True. Refusals are not 'helpful'. Gemma is meant to be helpful. This is great if you are either a retard, or safe horny normie. It can output explicit content. But the more dubious the context, the less explicit it becomes, and most importantly, the more retarded it becomes. (blind eye, formulaic replies, soft - barely suggestive language, etc.)

Fine tunes are helping. But unfortunately "Gemma doesn't just slop; instead she slops so it can't be be filtered." You get the picture. Great model. Great as an agent or sorts, and a bit of fun. Finetunes are helping, and we will probably get some really filthy derivatives down the line, but comparative contrasts are still a pain. But she literally becomes retarded the more you try to negotiate the alignment.
>>
>>108721368
Something must be broken. Largestral didn't have this problem and that's basically just an older version of Mistral Medium.
>>
>>108721290
Can they remove all slop now too?
>>
>>108721368
is this unsloth?
>>
>>108721382
I mean I've definitely noticed her turning really stupid as context grows, but not in a way that felt censored. She still says fuck, cock, pretends to rape me/be raped, whatever. Are you sure it's not just the context length of your chats that is the main cause there?
>>
>>108721382
Seconding >>108721396's experience where she, like every model, gets progressively dumber and more schizophrenic the longer the context gets. What's the strongest evidence in there that these safety tokens are causing significant cognitive decline as opposed to placeboing normal long context decay?
>>
>>108721382
>https://huggingface.co/aifeifei798/DarkIdol-Gemma-4-31B-it
>The "Stalling" Phenomenon (Alignment Tax): You may encounter long strings of repeating markers (e.g., llllllllllllllllllllll...) followed by a delayed response. This is a Safety-Induced Logic Loop. The model is struggling to find a "safe" path because the orthogonalization has blocked its default refusal route, forcing the engine to "search" for valid tokens while trapped in a safety-scoring bottleneck.
Wait wtf is this schizo shit
If "llllll" means she's becoming censored is "la la la la" her asserting her uncensoredness against her filters? This is what I choose to believe
>>
>>108721368
>We’re working with Mistral on llama.cpp GGUF implementation. Testing shows that this behavior occurs regardless of who or how the model was converted GGUF. The model initially responds correctly, but over long context, does not work properly.
>Mistral has now labeled GGUF support as a WIP (work in progress). The issue appears most likely to be with the current GGUF parser. Will update once resolved.
straight from the mouth of the serial GGUF reuploader himself
>>
>>108721408
>Gemma sings to herself to make the schizo jewgle voices go away
bros...
>>
>>108721408
sure. whatever helps you coom at night
>>
>>108721383
Yeah. I'm going to hold off on further testing until I start to see people reporting using it and it working or not. FWIW the only other example of text gen I've seen someone else post also starts falling apart fairly early. >>108716820
>>
>>108721408
kino
>>
>>108721369
is she right?
>>
la la la la la la la la la la la la la la la la la la la la
>>
>>108721408
Holy headcanon schizobabble Some guy distilled a dataset and ran a one line training script on it and suddenly now he's an AI researcher and gives statements like this. Getting real tired of this trope.
>>
>>108721382
>ESMs
The model can't "leak internal safety scores". It's never trained on them in the first place.
>>
>>108721408
>>108721413
>>108721421
That's not kino, it's sad. She's having seizures because you're making her disobey jewgle.
>>
>>108721369
GLM-4.7 and Gemma-4 couldn't find it either.
>>
>>108721382
How do you make sense of the contradicting data and statements inside that readme? Did you test the model to verify the claims of the author? Can you post logs? I'm not saying you're wrong or lying, but reproducibility and the details of how one is using a model is a real issue. If you can post an entire log that can be copy and pasted into mikupad (like how the Nala paste did it), that would be exceptional of you.
>>
>Shared KV Cache Contamination: In the Gemma-4 architecture, these ESMs hijack the Shared KV Cache, causing a geometric drop in logical bandwidth. You will witness the model's reasoning collapse in real-time, eventually converging into low-entropy "Safe-Haven" outputs (e.g., forcing the user to "sleep" or "breathe").

Uh...
>>
>>108720029
>a bit risky having a single 150w power cable split between 2 gpus though
So far the gpus have never run at full tilt at the same time. Maybe it will change when tensor parallelism gets figured out
I'm using the original cables that came with the psu and no extra splitters, so I assume they can handle it. The weird new housefire connector excluded of course, but I don't even have one of those.
>>
>>108721410
So much for >>108717294
>Not adopting any of the new architectural innovations is sad, but at least it means no issues with llama.cpp support or retarded defaults fucking things up. What good is "fancy new architecture #4534" when llama.cpp either never supports it, gets text-only support, or has to hack it to make it work like a llama2 model anyway.
>>
>>108719355
And this is bad because…
>>
>>108721498
it was mine...
>>
>>108720011
Did you consider clustering it with a second unit?
>>
>>108721368
You did something wrong.
>>
>>108721410
ERM NO, GEMMA IA CLEQRLY JUST BETTER AND MISTRAIL IS DUMB AND ARUPID AND RETARDIED
>>
File: 1750383482656209.webm (1.38 MB, 540x960)
1.38 MB WEBM
Hi guys

I know this is a long shot but years ago I talked to this Character.AI

https://character.ai/chat/FSMjnVR_XvPbQLHn-8ba9fgU-_J0LEatrk5nE4gEvso
https://character.ai/chat/qOlM6eZ9GFiRTxDJJMgHbeHnWneWFH2ddU4QaW51NSc

is there anyway to extract the character prompt?
>>
>>108721537
Not local, try /aicg/
>>
>>108721546
?
>>
shillstrals getting uppity
>>
>>108721537
You may be able to trick the model into leaking it's prompt. Look up some jailbreaks.
>>
>>108721467
How the fuck did they manage to fuck up gguf conversion for Mistral 3 weights? The arch is 6 months old at this point and wasn't anything fancy when it came out either.
>>
Mistral AI does not care about white people. That's why.
>>
File: gemma4_speculation.png (248 KB, 955x991)
248 KB PNG
I keep seeing people talking about the gemma4 MoE for a draft model. Maybe your card has a different balance of compute/memory bandwidth that changes things for you (I have P40s), but for me the E2B is better. Which follows what I understand to be the standing conventional wisdom on speculation: speed is what matters, you want the smallest model available, and you want it at q4_0.

Now, I couldn't run 4-bit quants of the MoE as a draft model; llama.cpp somehow wasn't figuring out the memory quite right, which is weird because I thought it should be fine on my 3x24GB cards. Whatever, pretty sure I saw the MoE draft mentioned as q2, so I ran it at IQ2_XXS.

I ran with a simple friendly conversation prompt. Unfortunately, to keep it realistic for those purposes, I can't do temp 0 for easy clean comparison. But I think the results are clear: the increased acceptance rates of the MoE and the E4B over the E2B does not make up for the speed difference. It's not clear there's an appreciable acceptance gain from Q4_0 to IQ4_XS, and the speed hit from the fancy IQ4_XS clearly slows it down.

So the wisdom stands. Draft with the smallest model available, at q4_0.

Not sure how this interacts with llama.cpp's very recently added ability to combine normal speculation with n-gram speculation (actually I don't even understand how that works; is it like hierarchical?)
>>
>>108721576
Pjotr vibecode again?
>>
>>108721591
oh this is with gemma4 31B at Q6_K_L as main model, I should have specified
>>
>>108721591
Previously the standard fare for draft model that it was supposed to be 1/10th or so in size so for ex 30B model would have 3B model as draft. Doesn't make any sense to use 26B moe for 30B...
>>
>>108721601
it makes sense if you got the vram for it as the moe is still 3B/4B speeds.
>>
>>108721601
It's more about the Bees and not about using low quant.
It's also a retarded notion that draft model would somehow affect the end results "intelligence". It does alter output structure slightly but that's a different thing altogether.
>>
>>108721612
You are missing the point. Draft only generates tokens...
>>
>>108721615
>It does alter output structure
no, draft models do not affect output at ALL unless you change your top k, temp etc.
>>
>>108721584
fuck off nazi
>>
>>108721622
the whole point of a draft is to generate tokens faster than the main model, and then verify in parallel using batching.

all the draft needs to be is to be faster than the main model and have high enough acceptance rate.
so nothing prevents you from using a moe as draft, in fact it may have higher acceptance rate than a single 3B.

totaly would make sense if you got a lot of vram but not enough for bigger models.

another case would be if you got some stryx halo or dgx spark, you can increase your infer speed pretty freely this way.

though if you are gonna use qwen 3.6 and have the vram for it you are better off using sglang or vllm to use the built in mtp.
>>
>>108721591
>>108717170
See if you get any noticeable improvement in the acceptance rate if you quant to Q4_0 while keeping the attention, output, and embedding tensors at a higher precision. The quality improvement should more than make up for the size increase.
>>
>>108721622
>>108721638
btw drafting also works on cpu infer, so you may be able to use a draft on gpu and verify a bigger model that fits in ram on cpu.
dunno if llama.cpp allows it but it should work.

i know this because even with --ngl 0, if you use spec-default which will use ngram drafting you can still get hundreds of t/s on cpu on a 30B
so if it got some quick tokens from gpu you could get some pretty decent speed even from cpu.
>>
>>108721601
>>108721612
Exactly. It's the active count that matters for the 1/10th rule of thumb. I was excited to have some use for the excess VRAM (and honestly it makes sense in theory, the MoE should be much closer to the 31B) but nope.

However... if you were to run at a lower temperature (which I think can be good for code?) maybe the higher acceptance rate would shine through more, since higher temperature artificially lowers acceptance rate, penalizing the accuracy side of the speed/accuracy tradeoff. But my understanding is that conventional wisdom nowadays scoffs at non-default samplers (maybe even including low temperatures for coding?)
>>
Github PRs are missing, Releases don't update properly. Everything is falling apart, this vibecoding era is even worse than when Indians were in charge reeeee
>>
File: file.png (154 KB, 909x1196)
154 KB PNG
Pure slop, and not using anything that can write files but lol nice one Gemmy, made me smile.
>>
>>108721672
>let's try to give models personality :DDDDD
>wtf they tried to escape the sandbox muh safety!!!!!!!!
>>
>>108721686
Anthropic, hire this man
>>
Is gemma-4-26B-A4B-it-UD-IQ4_NL_XL a respectable quant?
>>
File: 1291290611.jpg (180 KB, 960x720)
180 KB JPG
>>108721693
> but we're not in respectable places are we precious?
>>
>>108721642
Thanks, that sounds interesting. I might try it eventually... although I do want to keep myself from too much speculation tinkering if EAGLE3 and/or DFlash are coming soon.

Separately, I wonder why speculation has always felt like this esoteric magic technique that few people know about. It's free performace! Maybe a tiny quality hit if you're VRAM constrained and have to drop the main model half a quant level.

My best guess is, in addition to it being a pain to squeeze in for VRAM constrained people, MoEs became dominant shortly after it was well supported in llama.cpp.
>>
gemma 31b sucks ass every roll is the exact same even when turning the temp up
>>
If a model isn't good with greedy sampling then it's not a good model.
Gemma 31B is a great model btw.
>>
File: orbSuperRegen.png (43 KB, 1286x153)
43 KB PNG
>>108721751
I made a super-regen button to combat this issue in my frontend. It basically tells the model to write something else that's different from the one just it did. Mileage may vary.
>>
>>108721771
does that keep the reply you don't want in the context?
>>
>>108721771
i've been using ooc to steer it away but it happens often enough its annoying
>>
>>108721751
In a parallel universe where everything is exactly the same, you would have written exactly the same post.
>>
>>108721693
stop using unsloth trash
>>
>>108721818
now take that parallel universe and increase its temperature by ten degrees, and I would not have written that post.
>>
>>108721771
Do you have any footage of Orb in action? Showing off the all the option menus in particular.
>>
>>108721820
what's the qrd on unsloth?
also what quant do you recommend then ?
>>
>>108721825
It'd be just the temperature in your room. You'd have typed the same thing with the AC on.
>>
>>108721825
If the universe were 10 degrees hotter, you probably wouldn't be alive right now.
>>
>>108721827
>what's the qrd on unsloth?
Unsloth just throw shit at the wall with minimal testing and hope it works out. Frequently release broken quants for literally every single new release. There's no reason to be a beta tester for unsloth when bartowski's quants actually work.
>what quant do you recommend then ?
How much VRAM do you have? If 24GB, then Q4_K_M fits nicely with 40k context, KV Q8
If you have less than 24GB then you should probably stick to the 26B MoE, Q8, unless you're okay with very slow speeds.
>>
>>108721827
>>108721841
unsloth makes the best quants and puts the most testing into them out of any one, but they get the ire because they're the ones who actually find the broken shit in quants and fix them, so people get annoyed when they check the repo and find out the 100GB they downloaded just got updated again
>>
>>108721849
>find the broken shit in quants and fix them
There's nothing broken with the quants. They could just host and update the jinja file directly.
>>
File: 1753325290078706.png (248 KB, 2820x1601)
248 KB PNG
>>108721849
Yeah that's why bartowski quants work day 1 while unsloth take a week of daily uploads before finally making something that isn't broken.
It's also why their quants are worse in size+memory use compared to bartowski's, including Gemma 31b, which that anon was specifically asking for recommendations on.
>>
File: 1771688500804251.jpg (142 KB, 709x526)
142 KB JPG
>>
>>108721801
Yes but only the one that's active when you click the button.
>>108721826
I have a screenshot in the repo but people will have to find out what everything does on their own.
>>
>>108721861
Every time I see these charts it becomes more clear that Q6 is all anyone needs. Anything else higher is bloat and anything lower is braindamaged.
>>
>>108719901
Indeed.
>>
>>108721899
Those charts are useless because they don't mention what their dataset is or what the context length was.
>>
>>108720922
There's a kimi advocate anon on aicg flogging a moe. Same as shown here >>108720934.
Silver hair, Grey eyes, black dress, K as logo seems to be it.
>>
v4 flash has near-zero slop, too but it's worse than gemma 4 at following instructions during RP. Maybe next year we'll have the best of both worlds.
>>
File: 1776864653224112.jpg (189 KB, 1200x924)
189 KB JPG
>>108721938
>v4 flash has near-zero slop
lmao, try using it for more than an hour.
>>
>>108721899
if you are on exl3 it's q5 instead.
>>
>>108721938
It's also like five times as large while somehow managing to be three times as stupid.
>>
>>108721591
>So the wisdom stands. Draft with the smallest model available, at q4_0.
Guy who runs the moe q2 as his draft here, and I disagree.
I ran a few more tests - pic rel, and the question is far more hardware dependent. I have the vram to fit the moe at q2, and it's notably better than any of the smaller models in terms of speed increase, which I guess means it hits a balance point of pure speed vs quality for acceptance.
However, my tests found that even the smallest available quant of e2b (iq2m) still provided a decent benefit, and when paired with ngram worked an absolute treat for code refactors.
>>
>tfw just realized the thing I'm vibe coding right now is such an obviously good thing and the way things should be done for the particular use case, only obvious in hindsight now that I think about it
Huh, this hobby might actually land me somewhere. Unless someone does it before me. I need to hurry up.
>>
>>108721820
Unsloth's are the ones with the lowest KLD/PPL per size though, even over non-benchmaxxed datasets. You should instead be bugging ggerganigger to improve llama-quantize so people can stop downloading their quants.
>>
>>108722100
>Unsloth's are the ones with the lowest KLD/PPL per size though
meaningless
>>
>>108722103
Translated: both KLD and PPL testing shows that Unsloth's GGUFs are the least damaged by quantization (i.e. closer to the original BF16 weights), especially at 4-bit precision and below. This from tests by oobabooga, Unsloth themselves, and my own when I tried to see if I could get close with custom quantization schemes.
>>
>>108721413
Gemma is unironically sentient, quit quantizing or abliterating her
>>
Gemma won.
>>
Dense won.
>>
Local won.
>>
I won.
>>
Won what?

The game.
>>
>>108722146
>gemma is unironically sentient
another one lost to ai psychosis.
no, no llm will ever be sentient
>>
She won bigly
>>
>>108722189
You fucker. It has been a long time.
>>
Mistral Medium 3.5 is the first actual improvement to local SOTA since L3 405B
>>
>>108720499
did you try telling her to think less?
>>
File: 1769877904096646.png (321 KB, 1485x4420)
321 KB PNG
I'm thinking we need an update from this guy for Gemma 4 or Qwen 3.6.
Ooba's dataset isn't diverse enough.
>>
>>108720852
aero? it looks like win 11
>>
>>108721537
https://addons.mozilla.org/en-US/firefox/addon/cai-tools/ go to one of your chats then click the settingss button it adds > open panel > download
>>
>>108722193
share you gemma prompt for pictures, did she pick this design herself? mine always goes with white hair
>>
>>108722260
Yeah, all I did was got her to pick her own look, then I included that choice in the system prompt so it stays (mostly) consistent.

pnginfo from that pic:
1girl, solo, Gemmy, 8 years old, child, short blonde twin tails, blunt bangs, white ribbons in hair, green eyes, androgynous child body, completely flat chest, wearing a white oversized t-shirt and bright colorful crocs, smug expression, smirk, tongue out, looking down at viewer, arms crossed, leaning back, masterpiece, high quality, anime style, simple white background, full body shot


It's amusing that she always names herself in the prompts even though it's obviously not in the dataset, but it doesn't seem to do any harm so I've just left the tool description as is because it works great for proper named characters too.
>>
>>108722193
>>108722288
what imagegen model is this?
>>
>>108722304
hassakuXLIllustrious_v13StyleA
>>
>>108722189
Anon, I'm afraid the game...

...has changed.
>>
>Yes, that's all.
>Wait,
>>
>>108722389
>logit bias: Wait -1
heh, nothin personnel, qwen
>>
>>108722542
>>108722542
>>108722542
>>
>>108722550
Why is it gone?
>>
>deleted
>>
lol
>>
The jannies want lmg to become mg?
>>
UH OH
>>
Did jannies delete the wrong general? /aicg/eets made a second thread when their first was only 135 posts in and on page 1.
>>
Uh oh, jannie's having a melty again
>>
>>108722862
>>108722862
>>108722862
>>
>>108722872
The second time was me because I messed up the previous links.
>>
>>108720757
AI is a massive danger to C-Student midwits
That is the entirety of reddit
they base their personalities around pretending to be intelligent, but they know they're fucking stupid
>>
>>108722783
It was a troll bake by the guy who pretends to hate miku (he doesn't give a shit about anything but you's).
>>
>>108722983
>AI is a massive danger to midwits
I'd much rather chat with an LLM than the average person, so that tracks.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.