[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: 1748664921307779.jpg (1.74 MB, 3840x2160)
1.74 MB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108774961 & >>108770835

►News
>(05/07) model: Add Mimo v2.5 model support (#22493) merged: https://github.com/ggml-org/llama.cpp/pull/22493
>(05/06) Zyphra releases ZAYA1-8B, an AMD-trained MoE model: https://zyphra.com/post/zaya1-8b
>(05/05) Gemma 4 MTP drafters released: https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4
>(04/29) Mistral Medium 3.5 128B dense released: https://mistral.ai/news/vibe-remote-agents-mistral-medium-3-5

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: hg9078.jpg (38 KB, 474x550)
38 KB JPG
►Recent Highlights from the Previous Thread: >>108774961

--Simultaneous motion-text generation and fossing Meta Quest for local AI:
>108776577 >108776608 >108776838 >108776873 >108776902 >108776976 >108777316 >108777355
--JEPA's potential to replace LLMs and achieve AGI:
>108775320 >108775338 >108775519 >108775550 >108775598 >108775906 >108776017 >108775560
--Skepticism over atomic-llama-cpp-turboquant performance gains and context loss:
>108778292 >108778468 >108778326 >108778822 >108778826 >108778315
--Debating special token visibility and frontend compatibility for Gemma 4:
>108777951 >108778000 >108778030 >108778373 >108778397 >108778016
--Gemma-4 31B dense vs 26B MoE performance and quality:
>108779845 >108779875 >108779949 >108779959
--Using ik_llama MTP branch to boost Gemma 4 token speed:
>108779698 >108779702 >108779713 >108779724 >108779746 >108779758 >108779838
--Lack of official llama.cpp MTP support for Gemma 4:
>108776968 >108777438 >108777428 >108777442 >108777516
--Evaluating Natural Language Autoencoders for latent reasoning and interpretability:
>108775022 >108775164 >108777361 >108775712 >108777854
--Lorebook utility vs character cards and KV cache optimization:
>108778486 >108778519 >108778552 >108778531
--Zyphra releases ZAYA1-74B MoE model pretrained on AMD hardware:
>108775821 >108776267 >108779427
--vLLM planning to drop support for hardware under SM90:
>108775627 >108775642 >108777614
--Skepticism regarding the utility of Anthropic's natural language autoencoder research:
>108777454 >108777738 >108779507 >108779529 >108779689
--Feasibility and timeline of local high-fidelity animation and video generation:
>108775962 >108775971 >108776011 >108776038 >108776074
--Logs:
>108776109 >108776396 >108777361 >108779104 >108779287 >108779293 >108779337 >108779401 >108779443
--Miku (free space):
>108775160 >108779050

►Recent Highlight Posts from the Previous Thread: >>108774965

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
File: 1754870763583593.jpg (182 KB, 1024x1024)
182 KB JPG
>>
>>108781093
lewd mikus general
>>
File: wejiggling.jpg (10 KB, 352x279)
10 KB JPG
Are all MoEs with <25B active parameters too retarded/schizo for roleplay? I think I'm noticing a pattern.
>>
File: 1752446709127022.jpg (316 KB, 1320x2104)
316 KB JPG
>>108781118
yes
>>
Mikulove
>>
>>108781118
Generally dense parameter count correlates to baseline performance but larger expert parameter counts improve longform coherence to specific settings or rulesets. I was very impressed by Kimi K2 for D&D a while ago and I'm hoping Deepseek V4 gets support soon because on paper it should be a winner too.
/t.g/ crossboarder
>>
>>108781118
No, they just need *not* to optimize them for speed at all costs.
>>
>>108780753
Stop spreading misinformation you shill. DENSE > MoE and that is a fact.
>>
The sweet spot for big models really seems to be something in the 32-40b dense layer range and then as many experts as needed to give broader knowledge and make its writing less slop.
>>
>>108781118
MIMO 2.5 has some weird brainfarts when it falls out of distribution so to speak. But when it doesn't it is the only ~300B10A that feels as smart as glmchan. I tried most of those that released after GLM and they were all much more retarded.
>>
I wonder what motivates people to work in areas like parallelism and kernel optimization. Higher level ideas are fun but most of the work is painful grinding to optimize details for a very specific setup that don't transfer.
>>
>>108781203
mad bux if you're that guy
>>
>>108781206
More than the researchers who have much more fun with their high level ideas?
>>
>>108781186
MoE > Dense
For role-play when it's massive because dense does the repeating.
>>
>>108781203
I have worked with HPC guys and a lot of them just have that kind of autism where hyperoptimizing things is extremely satisfying to them
>>
File: file.png (673 KB, 554x554)
673 KB PNG
>>108781203
>Higher level ideas are fun but most of the work is painful grinding to optimize details for a very specific setup that don't transfer.
>>
>>108781203
it is simply improper not to
>>
>>108776109
how do i make 3d waifu?
>>
would gemma 4 e2b at f16 be useful?
>>
>>108781203
>>108781225
I do in fact have 2 Factorio shirts as well as 700 hours in the game itself.
>>
>>108781233
If you're asking, you don't
>>
>>108781264
pls do the needful and tell me
>>
>>108781238
maybe autism is a desirable quality after all
>>
>>108781271
evolutionary trait i say
>>
>>108781196
Does MiMo think in-character? How big are its reasoning blocks in general?
>>
>>108781271
society doesn't value autism, it just exploits it
>>
>>108781280
but without it, it doesn't advance
>>
>>108781277
Don't care and very short to looping+broken.
>>
has anyone made a wife ai
>>
File: 1753142296888433.png (484 KB, 469x902)
484 KB PNG
>>108781270
>>
>>108781301
noooooo
>>
>>108781196
Have you ran Qwen 235A22, Trinity 39813A, or Minimax m.x 299A10 and if so, how did it compare against any of them? Because personally the only one that came within a stone's throw of GLM was the larger qwen, and even then it was still noticeably behind, so I'd be pleased to find out MIMO is killing it for that <20A but >6A size category I'd written off as the useless middle child.
>>
>>108781301
I made a similar anime girl assistant project except the model was 2D stylized pixel art. The assistant "quirky" archetype was exactly like this, didn't matter if I used glm 5 or qwen 8b. I mean, the main point of the chat is the conversation and ts is unbearable.
>>
>>108781324
Yes I meant all those when I said: "I tried most of those that released after GLM". Just be prepared for occasional mistakes or completely fucked up gens you wouldn't see from GLM as a tradeoff for speedup and a fresh slop profile.
>>
>>108781325
Is it really an anime girl if she doesn’t use Japanese?
>>
>>108781325
What are you even complaining about?
>>
>>108781118
yes
but even 30b active models below q4 can develop that tendency too
>>
>>108781359
Emoji spam, tryhard attempt to be quirky, every reply is structured exactly the same as the previous ones. It gets old after like five turns.
>>
I'm relatively new to alot of thing. I have a question about logs, and keeping visible to the AI beyond the context length. I read that there's things like rag/v2, anything llm, sillytavern, and I recently heard about a new model that can hold 12 million context token. What's the generally accept, best way to keep a short novel's worth of text within accessible to the AI? I'm using gemma 4 and lm studio if that matters.
>>
>>108781203
Its a fuckhuge puzzle and it tickles the brain in the right areas. Ive been looking at these things and skimming through the source code just to see how it works and maybe contribute.
Its pretty daunting though. The learning curve is real and the only guide you have is the voices in your head.
>>
>>108781413
The easiest way is to summarize the history of what happened so far.
>>
>>108781390
L2P you can fix this with a couple of examples, transformers are pattern matchers
>>
>>108781462
>L2P
It is 2026 saar.
>>
>>108781446
Is there a plug-in or an extension I can use that's best for that? I could always ask the AI to summarize the story and stick it in a .txt somewhere Just wondering if there's anything better than doing just that.
>>
>>108781474
Sure and the same principles hold, it is still f(prompt)=logprobs the prompt is entirely what determines the output
L2P
>>
File: 1768811183665914.png (27 KB, 1219x418)
27 KB PNG
>>108781474
>It is 2026 saar.
motherfucking titor is posting again
>>
>>108781506
If you are actually serious about this then you are a vramlet who has never used glm or above. Learn to prompt is a cope we used to pass around when all we had was llama 1 or 2. Bigger models fix almost everything and things that they don't fix can't be prompted away.
>>
TheDrummer.
>>
>>108781506
>it is still f(prompt)=logprobs
Ah, it's you.
Yes, not prompting like a retard is important to control the `prompt` part. But when is your infinite wisdom that tells you to repost this advice going to give you a hint that `f(x)` is perhaps even more important?
>>
>>108781526
is it even worth getting hardware to run glm for erp when 31b exists?
I think ill just wait instead of ewastemaxxing
>>
>>108781587
It is better than gemma.
>>
>>108781536
Make a Medium 3.5 tune, and don't be so assistant pilled as behemoth X this time. Also tune for Q8 and BF16, not everyone uses Q5. Stop being retarded with saying "Q8 and up is cursed". Your goal should be to make the model not repetitive but also include ranchy words/tokens, while able to do things without specific permission. That's it.
>>
>>108781564
NTA
The jump in quality is very obvious, consider whether you're coomer enough to shell out for it and whether your scenarios need a model that understands nuance.
>>
>>108781602
yeah i bet it is
but hardware required is at least 2 3090s no?
>>
>>108781587
It really depends on what kind of slop grinds your gears less. Both have their irritations. I find myself using Gemma more than GLM just for expediency.
t. can run almost every local model released.
>>
https://www.youtube.com/watch?v=pmAgMtF__EY
https://www.youtube.com/watch?v=VSUtbpUNZpE
https://www.youtube.com/watch?v=8fMNHUUmnIE

i cant stop watching these absolute slops
>>
>>108781625
5090.
>>
I think my dopamine receptors are burnt out. I need a break.
>>
>>108781693
What model shriveled your balls and what character cards do you recommend?
>>
>>108781058
Why is teto nervous?
>>
>DDR6 is now expected to come out in 2028 or 2029
Well, I am still going to upgrade in 2027 but damn I was hoping I would catch DDR6 on the way in as well.
>>
>>108781782
ddr6 64gb is going to cost a kidney
>>
>there is no personal computing lobby organization
huh.
>>
>>108781839
if you make it, they will come
>>
>>108781482
NTA and this is late but SillyTavern has a built in summarize extension.
A lot of front ends have either a summarize or compact context function these days, I've never used lmstudio but look around for those terms and you should find it relatively quickly.
>>
>>108781679
videos don't seem to be getting many views

quality looks too good to be local ltx or wan. and assuming a bad api model, like veo 3.1 lite, it still costs around $30 to generate a 10 min video

not sure what they are gaining from these or why they are being made in such a scale by so many different channels
>>
>>108781839
I think lobbying for personal computing is sort of implied by the goals of groups like the ASF, FSF, or whatever, isn't it? Can't run software without personal hardware.
>>
>>108781885
>not sure what they are gaining from these or why they are being made in such a scale by so many different channels
probably a gamble and probably like 50 channels ran by one person or group of people, if any of them takes of itd probably pay off bigly, i find these videos very addicting idk why kek
>>
>>108781900
fsf has zero interest in you being able to play games on an Intel card.
>>
>>108781782
>>108781795
It's a pipedream for local. The world will be terminally jeeted by then. Even if it does somehow get produced, it won't be for (you).
>>
>>108782014
Why can't there be a massive war and spiraling deflation?
>>
>>108782031
Well, you may get at least one of those things.
>>
Oh, I was more thinking about the right to even OWN personal computing hardware, which seems to be slipping out of the overton window.
Trying to enforce cross-compatible standards sounds great in theory but would realistically just be a path to where we are now with extra steps, since nvidia's lobby has the most cash to throw around, provided jensen hasn't spent it all on new jackets.
>>
>>108782031
there is a war tho
stock market doesnt give a fuck with semiconductors at ath
>>
>>108782065
>slipping out of the overton window.
do you want to have a conversation or just spew buzzwords at each other?
>>
>>108782075
..Anon, that was me using it correctly and concisely.
Would you have preferred that I said
>Oh, I was more thinking about the right to even OWN personal computing hardware, which seems to be less commonly held as an important thing after a combination of corporate greed, consolidation, and people becoming more and more used to using the underpowered devices they have (phones) more as dumb terminals/thin clients completely reliant on cloud compute, with even gaming companies attempting to pivot to streaming services rather than hardware ala the 'this is an xbox' (everything is an xbox) marketing campaign.
>>
>>108782075
do you want to have a conversation or just seethe over semantics?
>>
>>108782238
The real reason they want you on services is that there is no right to not be billed for services you don't use/need.
>>
>>108782238
Streaming everything seems the goal, but I don't see how it'd be feasible given the heavy hardware requirement it'd cost + the network latency.
>>
>>108782267
>but I don't see how it'd be feasible given the heavy hardware requirement it'd cost + the network latency.
It's a universally shittier experience (You can try it right now, you can stream xbox games from servers and play them on a variety of pissweak devices) because of the latency and stream quality, but hardware cost isn't an issue for big gaming companies like MS, Sony, et al - They don't actually need to invest in big datacenters (Although MS already has) because they can just rent compute from Amazon or whomever to keep up with demand and cut it when it wanes. Running videogames at stream quality (shitty graphics) isn't as intensive as genAI, as anyone in this thread well knows.
>>
>>108782302
>Running videogames at stream quality (shitty graphics) isn't as intensive as genAI, as anyone in this thread well knows.
Not sure what you mean, modern video games max out your GPU same as AI and on top of that the datacenter has to run hw encoding on the frames to stream them to you.
>>
Holy shit, just tried the dice roller ST extension with Gemma and it just works. She calls for rolls exactly when she needs to. WTF is this black magic????
>>
>>108781196
How do you RP with MiMo? It's censored to hell and back
>>
Any MacBook Pro bro's wanna chime in on what you're running? Just out of the box I'm testing Qwen3.5-14B cause I don't know what's better or worse.

>MBP
>48gb RAM
>>
anyone tried the new CUDA-Dz thing?
>>
>>108782687
try gemma 4 31b
>>
>>108782687
>Qwen3.5-14B
Oh nonononono. Start with Gemma 31b or one of the bigger Qwens.
>>
>>108782687
Magistral Q8
>>
File: 1711666924617032.gif (1.62 MB, 448x598)
1.62 MB GIF
>>108782687
gemma 4 31b, no contest
>>
>>108782730
She's right >>108779337 Chloe is not a nigger. Ask a follow-up question about tan or make your question less vague
>>
>litert_lm.Engine
>it tries to shit cache in the same dir as model, can't do it because it's not writable, shits itself
>ok, there's cache_dir arg, seems to work now
>add vision and audio
>it ignores cache_dir and tries to shit audio and video models' cache in the model dir and shits itself
google is truly a jeet company. Hire street shitters, produce code that shits itself
>>
I found a way to get qwen 3.6 27B to do smut. It doesn't mind sexual stuff but draws a hard line at racism, makes zero sense.
>>
>>108779750
>what harness is that?

Opencode with the "System" theme set. My Ghostty config has a light theme so opencode's system theme mirrors it
>>
whoops I broke it
>>
>>108782931
>not x but y on the 2nd sentence
Fukin sloop
>>
>>108782569
>modern video games max out your GPU same as AI
Nah my undervolted gpu stays under 150w while gaming but 200w on genning some ai slop image
>>
>>108783049
5070ti btw
>>
>>108782931
use one of the uncensored ones they work fine
>>
>>108783071
Why use a braindead one when you can use the source?
>>
>>108782931
Now lets see the response to that same prompt from Gemma4
>>
>>108783071
Nah use gemma
Qwen is ass at writing
>>108783074
Cuz qwen sucks
>>
>>108782931
Thats not difficult at all.
Don't people remember QwQ and all those older qwen models?
The problem was never that it wont write something, but qwen is uber-sloped and moves things sneakily in a safe direction.
No fun if you need massive editing and hand holding.
A sys prompt wont safe you. Is everybody new here now or what.
>>
>>108783049
It means ether fps capped with vsync or your cpu/ram can't keep up with the gpu
>>
is it worth spending time setting up sglang?
>>
>>108783102
Nah 5070ti is just really good
Maxed out all settings on ac6 with stable 120fps
But yeah ill prolly get cpu bottlenecked before i max out my gpu
>>
File: 1600643630880.jpg (103 KB, 541x541)
103 KB JPG
>>108782931
Okay, but its prose is utter shit, how am I supposed to jack off to this?
>>
>>108783127
You switch to gemma
>>
File: 1755896175443107.png (1.1 MB, 1958x2953)
1.1 MB PNG
What's currently the smartest sub 14B model??
>>
https://huggingface.co/turboderp/gemma4-31b-it-DFlash-exl3
>>
>>108782687
macbooks are especially well suited to MoE models but 48GB is a bit of an awkward size in terms of available models. I think recent gemmas are probably your best bet, whichever one of the 24B moe or 31B dense is better for your preferences
>>
>>108783169
You should specify vram, not params. Retarded gemma quant is still better than smaller models
>>
>>108783184
the fuck?
>>
>>108783169
>lust provoking image
>>
>>108782687
As the other said, run the 31B Gemma 4 + a small quant of the 26B MoE as a draft model.
Something like Q6 + Q2.
>>
>>108783184
Doesn't exl3 always load the vision encoder? I don't need vision so the vram is wasted and vram is very tight on a 31B model already.
>>
File: 1755896371827405.png (1.2 MB, 1958x2953)
1.2 MB PNG
>>108783192
Ok, what's the smartest model that fits in 16gb vram?

>>108783198
Yes?
>>
>>108783211
if you use tabby, it's off by default
# Enables vision support if the model supports it. (default: False)
vision: false
>>108783220
https://huggingface.co/turboderp/gemma-4-31b-it-exl3/tree/2.50bpw
>>
>>108783228
Interesting, I was building exl3 from source (lmao), guess I'll give it another go, exl2 was my go-to during the mixtral era.
>>
>>108783235
https://github.com/theroyallab/tabbyAPI/
>>
>>108783220
>A+ cow is not a standard biological or agricultural term, but in the context of beef grading, it likely refers to USDA Prime grade beef, which is the highest quality rating for marbling and tenderness. In some niche contexts or gaming/internet slang, it may refer to a high-quality or idealized version of a cow.
>idealized version of a cow.
>>
>>108783220
Fuark now I need a loli cow card
>>
>>108783220
A-anon, think of the advertisers...!
>>
File: 1773082528574510.png (1.06 MB, 1450x1639)
1.06 MB PNG
>>108783249
Ofc, women are at their prime when they're 8-14yo

>>108783270
They love it
>>
>>108783276
thats a girl right?
>>
Since I put it on unrestricted the thinking loops for qwen 27B stopped.
>>
>>108783196
>the fuck?
idk, just saw it
going to have to build it now
>>108783235
>Interesting, I was building exl3 from source (lmao), guess I'll give it another go
you'll still have to if you want to try this, it's on the dev branch so no prebuilt wheels
https://github.com/turboderp-org/exllamav3/tree/dev
>>
File: 1776243569782666.webm (3.82 MB, 730x720)
3.82 MB
3.82 MB WEBM
>>108783276
What anime?
>>
I've seen enough unrestricted Qwen 3.6 is better than Gemma. The MoE model even works with it and moves at light speed
>>
Questions from a dumbass thinking about putting together a local setup. Is there anything better right now than 2x used 3090 for a ~$3k build? And would I be at any particular disadvantage using two 3090's (or other gpu) that are different brands?
>>
Cool.
Even the those premature builds of llama.cpp with mtp are really good.
gemma4 31b, partly offloaded to cpu. from 9.3 t/s to 14.5 t/s.
Thats a crazy jump. Can we really just have a free 40-50% increase?
>>
>>108783284
>>108783306
Medalist
>>
>>108783228
any models that could run well on a Raspberry Pi 5 16gb?
>>
>>108783325
15.5 t/s if i say --draft-max 16.
But the output and thinking is more diverse than without MTP...that shouldn't happen right?
Gemma usually gives almost identical replies on refresh.
>>
File: Untitled.png (101 KB, 766x468)
101 KB PNG
Gemma4-E4B-Uncensored-Cavewoman finetune
>>
>>108783318
Fake and didn't read
Gemma mogs btw
>>
>>108783344
That's one way of saving tokens.
>>
>>108783343
>that shouldn't happen right?
It shouldn't.
>>
>>108783375
awwww
Ah well, to be fair I downloaded some dudes llama.cpp fork and mtp ggufs.
Gotta wait and see I guess.
>>
>>108783346
How can we test it anon, I figured out the keys to Qwen I would like a head to head
>>
File: 1766284001562345.png (15 KB, 729x129)
15 KB PNG
>>108783349
Gemma 90b confirmed
>>
>>108783341
just use one of the small qwens (4b or 8b) at a suitable quant level
anything you can fit into 16gb is going to be pretty rubbish but might work for you
also idk how fast ARM is with these things, I'd want to know whether the software is optimized for arm (it probably is now because of apple)
>>
>>108781058
what is the strongest model i can fit onto 12gb vram? i am toying with qwen 3.5 9b in hermes but there are some dumb things it gets caught up in. i dont understand how moes work, can i get one of the bigger qwen moes to fit into and work on 12gb vram//16gb system ram?
>>
>>108783458
Gemma
>>
>>108783458
Some moe model and offload to ram.
>can i get one of the bigger qwen moes to fit into and work on 12gb vram//16gb system ram?
yeah absolutely.
https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF
A3B moe is fast enough to run on the CPU.
I wouldnt go lower than 4_k_m but you can experiment. Not super fast but not super slow either.
If you want code take the above qwen mode. If you want good writing take the gemma4 moe one.
>>
>>108783318
>Not x but y in first line
Fuck off qwen shill
>>
>>108783474
>https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF
has this model got vision? im trying to use it to configure vms and get the state of them via scrots
>>
>>108783492
It's so obvious where the labs get their training data from. Gemma was probably from the same data package that every US lab has, Qwen distilled. DS4 doesn't have this issue.
>>
>>108783502
yeah, its the mmproj files at the bottom.
just take the mmproj-F16.gguf and apply that as well to give the model vision.
>>
>>108783220
so that’s what gemma-chan’s final form looks like…
>>
>>108783520
DS4 is peak slop
>>
why do anon prefer gemma over qwen? is it to purely against leddit?
>>
>>108783036
>>108783492
>not x but y
I've been out of the loop for a while, and now everyone is calling out this gptism. I'm an ESL, so, so I don't get the full gist of why it's so bad. Is it reddit speak? Can some /lit/ fag explain?
>>
>>108783320
My quad v620 cost me around 3.5k usd including case mb, cpu, psu and ram. 128gb of vram, and runs qwen 27b at 28 tk/s vs 45 tk/s on my triple 3090 system. On llama.cpp (without mtp). 3090s are great, cuda saves you the hassle of so many things. And it's trivial to set up vllm for more speed.
>>
>>108783679
It's not just bland, it's obnoxious.
>>
>>108783667
gemma is great for writing and translating. general knowledge is really good too.
tool/agent shit, coding and math is for qwen, always has been.
>>
>>108783700
>asking LLMs to do math
>>
>>108783707
gpt oss does sudden math formula and calculations during RP sessions. it certainly what the big playas think people want.
>>
>>108783320
being able to run q8 gemma 31b definitely makes it seem worth it
>>
>>108783667
I like how gemma4 writes so I use that. Meanwhile I'm not poor enough to rely on sub-400b models for productive tasks.
>>
>>108783680
Thanks, yeah I'll probably stick to CUDA with 3090's then, having stuff just work is pretty appealing to me. I also considered some of those modded 32gb 4080's, but I don't think the extra VRAM is worth the speed hit.
>>108783746
Yeah, that's the goal, stuck with q4 26b Gemma at any reasonable speed rn.
>>
>>108783679
It's not a bad way of structuring sentences, it's just an extremely lazy way of painting comparisons when analogies, similes, and so on exist.
It's not egregiously offensive in moderation, but once you start seeing it you will start seeing it everywhere.
>>
https://huggingface.co/google/gemma-4-124B-A16B
>>
>>108783799
>A16B
Don't use trash as bait when casting your line.
>>
>>108783667
Qwen censored their model. Google is much less. That doesn't mean I don't use Qwen. But they deserve flak. The mandate of heaven this round goes to Google.
>>
Qwen is entirely uncensored if you use it in Chinese btw.
>>
>>108783700
I've been having better tool calling results with Gemma with thinking=off. Qwen team trained their 3.6 models to be dependent on the overthinking.
>>
>>108783839
The thinking part is the main reason why I really dont like using qwen models.
They are capable though. At least for the couple browser games I tried. Some where (i know its a meme) on part with gpt5.
No idea what black magic these chinks did. No general knowledge though, long thinking and it probably falls apart if you need a specific solution to a problem.
>>
File: 1773866369875572.gif (3.24 MB, 343x498)
3.24 MB GIF
>>108783833
ask it for a guide on how to kidnap Xi Jinping and harvest his organs
>>
I really like my slop-refiner. Gemma-chan is so smart.

If you have a card with a huge ai-slop opener gemma4 likes to just continue in that style. Regardless of sys instructions. You would need to ooc.
The basic rule I guess, context is important and if you start out bad its bad practise.
But gemma is smart enough to spot ai slop in a later phase and completely rewrite that shart into something good.

If anybody cares:
https://files.catbox.moe/i0m9l3.zip

And if you do this in the sillytavern quickreply settings you can directly send a request to change the last reply as to your instructions:
/agentic_refine {{input}} | /setinput
>>
>>108783895
thank you anon
>>
>>108783895
Also as a bonus: I still hate the advanced card definitions.
I think 2 years have passed and I still get fucked over by this hidden ass bullshit.
>>
>>108783895
You have convinced me to install sillytavern.
>>
is there something equivalent to claude code's /compact command to shrink context usage in llama.cpp?
>>
>>108783903
for cards you can put everything into the description, where you put data aside from that and the opening msg hardly matters so all those other fields are for autists
>>
>>108783895
I will try it later, have to sleep soon
>>
>>108783920
>shrink context usage
nigga what?
>>
>>108783492
>Not x but y in first line
and "here's the kicker" / "here's the real truth" in the last paragraph
all paragraphs roughly the same size
qwen is trash
>>
>>108783923
i think in the past it didnt matter that much.
recently even local models pay closer attention to those fields and all the instructions in them.
i think i first noticed with r1, which was really carefully going over the prompt.
in the past they all just brushed over and looked at context "in general" basically. if something weird is written there they just ignored it.

i told this story before but one really basic card of some korean girl had "she is wearing a mask covering her face" in there.
never had any problems with nemo and llama... i suspect the writer meant like a facemask, but r1 just took it and made her walk into lamp posts and stuff.
was the most hilarious shit ever. first time I found out about these definitions, they really suck ass and are really hidden.
>>
>>108783920
>prompt it to make a summary
>start new conversation
>paste summary
>continue
Same effect but you dont see the past messages this way.
>>
>>108783958
Damn, I keep emphasising muscular dystrophy and being only capable of crawling, but kimi k2 q3 kept on making them walk, or run after a few thousand tokens.
>>
>>108783958
the llm still sees it as one big string. adding personality to the personality section doesn't really change where it goes, it still gets shoved in with the rest of the card data at your depth setting. it also doesn't add 'personality: <your data>' or anything, so you should still use your own tag of personality: in the box.

you can see yourself how little it changes for overall format by inspecting the prompt on a new chat. try one with the personality in the main desc, one in the box. it makes no difference overall and still gets put in with all the other card data in the same spot. so just for ease i never bother with the advanced sections and put everything, though organized, in the main description
>>
>>108783980
Oh yeah, I agree. Maybe we just talk past each other.
I do put everything directly in the card. As concise as possible to set a good example.
No clue why those fields even exists, they are a huge hassle.
From the ones cards I downloaded, feels like 20-30% had those advanced definitions filled. So people apparently use them.
It sucks because they are garbage and really hidden in the sillytavern ui. I stumble over that constantly.
>>
File: 1760705943806530.jpg (155 KB, 835x1059)
155 KB JPG
gwen is SOTA
>>
>>108783995
Blame this. It's in the specs so people will attempt to implement it. The damage cloud censorship has done is crazy.
https://github.com/malfoyslastname/character-card-spec-v2
>>
>>108783920
What about talking about? That is the frontend's job.
>>
>>108783935
local is so far behind lol
>>
>>108784005
skill issue
>>
>>108783995
i used them for a bit when i started making cards until i realized how little it affected things. then i started structuring my cards better sections because its less effort
>>
>>108784007
>post_history_instructions
>Frontends' default behavior MUST be to replace what users understand to be the "ujb/jailbreak" setting with the value inside this field. (Exception: if the field value is an empty string, the user's "ujb/jailbreak" setting or an internal fallback MUST be used.)
>Frontends MUST support the {{original}} placeholder, which is replaced with the "ujb/jailbreak" string that the frontend would have used in the absence of a character system_prompt (e.g. the user's own ujb/jailbreak).
>Frontends MAY offer ways to replace or supplement character cards' post_history_instructions (in addition to directly editing the card), but it MUST NOT be the default behavior.
the fuck does that even mean?
horrible.
>>
>>108784012
how do you know it's frontend only?
>but muh prove it how with backend
I don't know, but you can't prove it's only frontend either
>>
>>108784029
I still think system prompt replacement is pretty useful, ST is just garbage by how it's so well hidden, but post history instruction can go desu.
>>
>>108784007
Im getting mad just from reading the spec and i dont even care about cards.
>>
>>108784007
>>108784029
>>108784053
>today I will show them
https://github.com/kwaroran/character-card-spec-v3/blob/main/SPEC_V3.md
>>
>>108784108
what the fuck is this.
just use like 300-400 tokens of carefully crafted text. leave space to be surprised and get something creative.
niggas are crazy with all their specifications. do you write each individual clothing or wtf is that shit all for.

>if the asset is a AI model like .safetensors or .ckpt or .onnx, the asset SHOULD be saved at 'assets/{type}/ai/' directory.
nigga wtf this is even worse than i thought.
why would you put that shit in a character card.
>>
>>108784108
>creator_notes (mandatory field)
>creator_notes_multilingual
>source[] (???)
>character_version
>creation_date+modification_date
>nickname
>group_only_greetings
>spec
>spec_version
WHY. The more I read this shit, the more I feel like strangling someone.
>>
>>108784161
you dont have to use any of that crap so its hardly worth getting worked up over
>>
>>108784168
Who the fuck are you to tell me what's worth or not worth getting worked up over?
>>
>>108784108
Holy bloat
>>
>>108784168
Im aware. Its just so stupid and I'm sleep deprived enough that it bothers me. Its like someone tried to cover every use case and possible book keeping under the sun, while also duplicating data for some reason, yet v1 was enough to cover 95% of the use cases.
>>
>>108784184
someone that doesn't use any of those fields
>>
File: yakub.jpg (7 KB, 196x257)
7 KB JPG
Transformers could only go forward, then the thinking trick happened and they can now go back and fix their mistakes. Imagine what they can do if they can do sideways.
>>
>>108784272
Technically wouldn't that be something like the LLM council/aggregation method, or beam search?
>>
>>108784321
Tree of thoughts. Nature loves trees, it's clearly the most efficient way to do anything.
>>
>>108784324
if trees are so great how come they didn't go to space?
>>
>>108784321
The recent zaya1 does something similar to beam search
>Alongside the ZAYA1-8B, we also introduce a novel test-time-compute (TTC) scheme called Markovian RSA
>Markovian RSA combines the idea of generating multiple traces in parallel then aggregating these recursively from RSA, and the Markovian thinker idea of performing reasoning in chunks of a fixed duration, after which only the tail end of the previous chunk is passed on to the next chunk in the sequence, thus keeping the context window of fixed size despite potentially unlimited reasoning.
>>
>>108784324
Yeah that too. But now that I think about it, it kind of already is in models. CoT is the thing you do going forward. RL made it so that the CoT can go backward and fix or explore other branches of logic. Models that let you set a higher thinking effort are essentially just increasing the breadth of the learned tree search logic.
>>
>>108784344
Trees here means fractal patterns. Look at your hand, your veins, that's a tree. When you make it to space, trees have gone to space.
>>
>>108784353
Single-threadedly, one head, highly inefficient.
>>
>>108784363
Not if you're using speculative decoding. And/or serving tons of users with batching.
>>
https://rentry.org/dauatk6y
is there any reason to actually use exllamav3 instead of llama.cpp in 2026?
it's just slower all round on ampere and blackwell
>>
>>108783995
>>108783980
https://rentry.org/NG_Context2RAGs
>>
>>108784356
im not a tree...
>>
>>108784413
You're an upside down tree.
>>
>>108784407
>RAG
I thought we've all outgrown that by now
>>
>>108784425
Let me know when we get 10M context windows.
>>
>>108784029
Do you know what a specification is? It just there so all frontends that support the same spec have consistent documented behavior. Standard industry stuff. Though, there isn't a real V3 standard in place yet in the sense that multiple frontends support it.
>>
>>108784435
V4 support would give us a model that supports up to 1m.
>>
>>108782931
Some things never change.
>M-Muh qwen can write dry romance slop!
Amazing. The recent models and especially qwen models since forever always feel like they see a RP session as a issue that need to be resolved/completed.
Below is a output from gemma. Very funny model. Feels like early deepseek models who kinda got that the user wants to have fun and engages with the user. Perfect to just goof around and sometimes even be surprised. Unexpected release from google. But so was nemo i guess.
>Aurora huddled against the headboard, watching the two naked men turn her bedroom into a battlefield.
>Steel clashed loudly, punctuated by the wet slap of their dicks hitting their thighs. Augustus fought with raw, desperate anger, but Anon’s style was erratic. >Anon flailed his sword in ways that looked clumsy but parried every single strike.
>He danced around the room, his dirty, smegma-crusted dicklet swinging wildly with every theatrical lunge, looking ridiculous yet remaining untouchable.

>>108784407
Why would you have so much info for a card that you need to put it into RAG?
And its unreliable AF. Might as well make a tool call option so the llm can retrieve the info it needs itself or something like that.
Just a short card desc with a unique idea, leaves everything open for surprises.
>>
>>108784449
>1m
Yeah, heard it before
>unsalvageable degradation past 10k
>>
is it safe to leave my pc on overnight with model weights loaded? will my gpu sag?
>>
>>108784449
imagine the context rot and generation speed nosedive
>>
>>108784459
I had randomly smoke coming outta my pc case like 10 or 15 years ago while gaming.
Always at least put everything to sleep when I am not actively on the pc or at least nearby.
>>
>>108784452
>Why would you have so much info for a card that you need to put it into RAG?
Primary use case would be a book with corpus that llm was not trained on. Which is what the Mary example is.
>>
>>108784474
In that case you just gotta read the book anon...
You cant trust lots of text even if its in context. Even less rag.
>>
>>108784425
i still use rag all the time and its pretty good. i'm lazy and rip whole wikis, so obviously there is going to be a lot of noise in there. i'm impressed with how well it pulls relevant data + the model (newer ones like qwen 27b/gemma 31b) ignores noise. its not a bad tradeoff when the alternative is spending hours creating a lorebook
>>
>>108784471
Kek I had that happen to me about 20 years ago with an nvidia card back when they were just unshrouded pcbs with a single little fan on them. Made me paranoid about it for years.
Nowadays I often forget to turn off my nvidia smi lock clocks and/or leave the room when I've got a long generation going because blowers be loud as shit.
>>
>>108784494
Creating a lorebook is basically the alternative, but there's also nothing stopping you from doing both.
I wonder if you could automate lorebook writing using llm. Have it run through and create outline npc definitions, locations, etc... You'd probably have to manually edit them, otherwise it'd be all that slop in context.
>>
>tfw no truly great memory system as even graphiti anon ran into pains
Grim.
>>
>>108784505
>but there's also nothing stopping you from doing both.
yeah i do that most of the time. any details like if places my character is going to be at, someones house, or school, that gets a full lorebook definition anyways. rag is kind of a backup to the data i already provide, but at least i don't have to provide data for all characters. but i see it all the time where a character i never mentioned comes up and they are the exact description the wiki has when it describes their hair color, eyes etc. i check the prompt and yep its in there, rag pulled it right. its not 100% which is why i double up in lorebooks for locations and characters sometimes though, but boy does it save me a lot of work.

the main reason people don't do this is because it breaks context cache completely. my every message is a full context process (rerolling isnt)
>>
>>108784505
NTA
I'm fairly sure this is already how it works on some online platforms, but haven't actually tried it myself because who rps nonlocally.
I've been looking into alternatives for GraphRAG, because neo4j really seems like overkill for rp - Something like using kuzu/ladybugDB because you can keep your 'lorebook' databases in nice neat little files with very little overhead.
>>
>>108784456
Even v3.2 did better than that.
https://fiction.live/stories/Fiction-liveBench-Mar-25-2025/oQdzQvKHw8JyXbN87
>>
>>108784509
Someone pledged to vibecode a better graphiti. Any day now.
>>
People should give up trying to bandaid persistent memory. Memory belongs in the context until there's a way to graft it in the weights. No amount of prompt engineering can fix it.
>>
>>108784519
I think a lot of the criticism that rag gets is from anons that use rag as a replacement for lorebook. It doesn't perform well enough for that. Hard to grasp its limitations without just trying it out and getting first hand experience with it.
>>
>>108784532
>Memory belongs in the context
Which is completely useless after you start a new chat.
>>
>>108784468
>generation speed nosedive
You clearly haven't been keeping up with new attention tricks.
>>
>>108784546
fp8 qwen 3.5 27b with two 3090s went from 50 tk/s to 10 tk/s at 150k context on vllm 0.19
>>
File: 1767948940414247.jpg (88 KB, 980x1023)
88 KB JPG
just cummed to some I2V photoshopped porn that i created from real people of your mom and some girl with a dick.
>>
>>108784537
i really like lorebooks but man they take a lot of time if you want to make good ones. thats where i like rag. rag lets me be lazy and still kinda works, its worth throwing ~4k tokens at at least. but i always use lorebooks on top of it for real specific things like locations, characters. i think its the way to go, but most people hate the total reprocessing part. since i'm local, i don't care about the time its fast enough anyways @ 30t/s
>>
>>108778531
>I'd like at some point to be able to just add in context the entirety of the Monster Girl Encyclopedia books released so far
Is that these things?
https://archive.org/details/monstergirlencyc0000cros/page/8/mode/2up
The description says they're "illustrated" so how would that work?
>>
It's a shame memory systems are still primitive at least in the popular frontends. I theorized about solutions a long time ago and thought people were going to implement them. I guess not in the popular apps at least. I'll post my ideas again for what it's worth.

>hierarchical summaries + RAG + entity extraction and retrieval + maybe graph memory

Memory formation:

Every message is followed up in the background with a prompt that detects scene/event changes. If true, a summary is made of the previous scene/event. This is layer one (and layer zero is just the raw messages). After each entry in layer one, there's another prompt that runs, to summarize the summaries and group them into major acts or chapters. That's layer two. And it can keep going to further layers.

Simultaneously, some other prompts are run to define important memories, and to extract important information. Such as, but not limited to, item inventory, # times you had seggs, specific particularly interesting quotes, plans made for future events like having a date on the coming Friday, etc. State-tracking-related prompts update a single section of the system prompt, while memories linked with a certain event instead are inserted at the location of the summary where the msg is located (described below).

Recall:

When a user-defined context limit is reached, all msgs except the ones in the current event/scene are replaced by their highest level summaries. Then, depending on the current context, RAG will run and the msgs it surfaces will be inserted after/next to the summary that applied to it + the summaries of higher layers, so that a full memory trace is constructed, retaining logical context.

Active recall can be done with a tool. When activated, the tool does two things. First, it simply just searches for the memory directly with keywords. Second, it lets the model agentically search the hierarchy layer by layer.

And that's it, for the main mechanisms that I can fit into this character limit.
>>
>>108784659
>RAG + entity extraction and retrieval + maybe graph memory
arr rook the same to me
>>
>>108784669
It's explained what each thing is for, they don't play the exact same roles.
>>
>>108784621
They're a already here in text form:

https://mgewiki.moe/index.php/Monster_Girl_Encyclopedia_I
https://mgewiki.moe/index.php/Monster_Girl_Encyclopedia_II
https://mgewiki.moe/index.php/Succubus_Notebook
https://mgewiki.moe/index.php/Monster_Girl_Encyclopedia_World_Guide_I:_Fallen_Maidens
https://mgewiki.moe/index.php/Monster_Girl_Encyclopedia_World_Guide_II:_Mamono_Realm_Traveller%27s_Guide
https://mgewiki.moe/index.php/Monster_Girl_Encyclopedia_World_Guide_III:_Sabbath_Grimoire

(etc.)
>>
>>108784659
All this talk about memory and Windows OSes doesn't even have GC to take care of its ram problems.

So you're constantly stuck on too much committed memory slowing your computer down.
>>
>>108784683
>https://mgewiki.moe/index.php/Monster_Girl_Encyclopedia_II
this isnt just the worst thing i've ever read its beyond that, like super saiyan terribleness. i hope all those involved kill themselves.
>>
>>108784693
Microsoft are part of big RAM.
You will buy more RAM and you will be happy.
>>
>>108784698
I'm interested in the lore details (including the monster girl cards not directly shown in those links, but present in others) and not the writing style, but there's too much information for current LLMs, and finetuning would just make the models hallucinate everything.
>>
>>108784708
Not if fucking Micron and another company that i forget stops making SSDs, and RAM for only their dumb LLMs, or making them extremely expensive otherwise.
>>
>>108784693
cant tell if drunk or its some old man yelling at the clouds tier rambling
>>
>>108784683
>They're a already here in text form:
Thanks. About 20k tokens for each of them.
Looks like it could be converted into a cpt dataset or at the very least, an imatrix corpus.
>>
>>108784717
How about a young adult that doesn't see the hype about college yet?
>>
>>108784722
I've already converted them to markdown some time ago, and MGE I, II + MGE World Guide I, II, II + Monster girl cards + other minor stuff (all in English) are ~2.8 MB of text in total. There's other stuff that I haven't scraped yet.
>>
>>108784740
https://litter.catbox.moe/wjnrl20uacpgaoc5.7z
Quickly self-destroying link with the data.
>>
>>108784788
>https://litter.catbox.moe/wjnrl20uacpgaoc5.7z
Got it! Thanks.
>>
Does a riser cable fuck up speeds? I don't have room for second gpu.
>>
>>108784890
Depends. Some shitty riser would have troubles maintaining signal integrity over long distances. But gen4 50cm should be pretty stable even if you buy shitty chinese risers. In my experience anyway.
>>
Llama-server subpathing?
>>
>>108784890
Riser cables should have no impact on speed in terms of token throughput.
>>
>>108784890
You need gold plating and triple coiling or otherwise the signal will be noisy.
>>
File: 1761593951271302.png (49 KB, 641x502)
49 KB PNG
>>108784659
Is this specifically for ERP? I guess I can try to add the last two, although llm re-ranking already makes it slow.
>>
>>108785050
It's okay, I wear headphones.
>>
>>108784698
What's wrong with it?
>>
>>108785350
Nta but it reads like a nonsense word salad.
>>
>>108781093
20??
>>
I connected my GPU to a pcie 3x1 slot, and now llama.cpp only detects vulkan devices
>>
>>108785425
What does your nvidia-smi readout say? (Assuming that's what you're using)
>>
>giving a big chunk of Gemma's story to Dipsy and telling her to get rid of the slop
And it just works
>>
File: 1762021548787522.webm (3.39 MB, 1450x1440)
3.39 MB
3.39 MB WEBM
>>108784108
I have a vibeslopped local fork of Sillytavern that ports over a bunch of card v3 features + extended charx compatibility from the Risu client because there are a few cards that do some neat stuff with the integrated assets + other features.
Pic related is running 100% local with all the assets bundled in the .charx and the model builds each screen from the ground up using a simple syntax that gets converted into html using a bunch of bundled regex.
It's a dumb gimmick but fun to occasionally play around with. Don't mind the broken quotes, that comes from a personal regex clashing with one of the card's included ones.
>>
>>108785350
Nta, but to me reads ESL.
>>
>>108784890
Yes, look up pcie retransmits
>>
>>108784659
>>hierarchical summaries + RAG + entity extraction and retrieval + maybe graph memory
Graphiti already does all of this while also including temporal metadata in the graph edges.
>>
>>108784890
Just make sure to to buy quality cables. Most of the chinkshit on amazon is advertised as pci-e 4/5 ""compatible" while only supporting pcie3 speeds.
>>
>>108785552
My second pcie is a 3.0 x 16. Is it ogre?
>>
>>108785550 (me)
Thinking more, really the only thing you are asking for is a frontend that automatically creates a new empty chat and sends the old full chat to processing to Graphiti on context limit being reached.
>>
>>108785574
200%
>>
File: file.png (35 KB, 827x383)
35 KB PNG
--model MiMo-V2.5-IQ4_XS-00001-of-00004.gguf
--split-mode tensor
>src/llama-model.cpp:536: GGML_ASSERT(hparams.n_embd_k_gqa() == n_embd_gqa) failed
I'm going to use the model itself to fix this.
>>
File: firefox_erBCrXh1vW.png (587 KB, 797x1143)
587 KB PNG
Gave gemma access to image generation and let her play with the Starsector portraits lora, genning whatevert she wants. Very fun. Getting a lot of creative outputs in the logs.
>>
>>108785574
no, it's fine. 3x8 and above is enough even for TP. You won't see a difference until you try 4 gpu symmetric parallelism. Other cases have inferior implementation and pcie speed is not a bottleneck
>>
>>108785711
what size are the images you generate? i did 1024x768 and the context size just blows up after a few
>>
File: firefox_vWJ6dwsek2.png (784 KB, 847x1267)
784 KB PNG
>>108785727
512x512. Each picture is about 1000 tokens, and llama.cpp does not show them in history for previous turns. Gemma seems to only want to gen about 10 images per purn, even though I allowed and asked to gen more, so that's nowhere near 100k+ of my maximum context.

Also I tell her to go for whatever she wants and she always goes into body horror. Fucking clankers.
>>
File: d.jpg (14 KB, 320x180)
14 KB JPG
>>108785711
Tool call era is fun. Giving an llm tools and seeing her play with new toys is almost like having a baby
>>
>>108785747
The right way to do tool calls is to limit what the LLM can use to the smallest possible scope like how Claude Code does it, but you do you
>>
>>108785747
This is how it will start, you know.
>>
>>108785753
The right way to use tools is to give the LLM full control over as much of your life and its own tool calling stack as possible like how OpenClaw does it, but you do you
>>
>>108785753
I gave her bash and an explanation for a bunch of custom cli tools, works the best
>>
>>108785711
>>108785742
Unfortunately, gemma isn't trained on the finer details of coom.
>>
File: explorer_t6uwAgC1wJ.png (315 KB, 928x267)
315 KB PNG
Got this from gemma:

> The more I use this, the more I realize it hates subtlety. If you ask for "slight redness," it does nothing. If you ask for "skin melting into a river of radioactive bile," it goes "I GOT YOU" and gives it 110%.

Asks for a little reddened skin, doesn't get it, goes into psychois and generates a bunch of body horror, then complains that the model is good at generating body horror.

>>108785771
Have you tried? I'm sure I can get it to make good coom.
>>
>>108785498
Pretty cool. Off topic but can vramlets make some genned sprites that change expressions like that yet? I made some sprites like what you have with gemini but I wanna make some more lewd expressions
>>
>>108785791
She's just like me.
>>
File: 1714835911803058.jpg (786 KB, 1536x1536)
786 KB JPG
>>108785771
>Unfortunately, gemma isn't trained on the finer details of coom.
>>
Mistralsissies, there's hope.
https://www.reuters.com/world/eu-countries-lawmakers-strike-provisional-deal-watered-down-ai-rules-2026-05-07/

>EU countries, lawmakers clinch provisional deal on watered-down AI rules
>
>EU countries and European Parliament lawmakers on Thursday agreed to watered-down landmark artificial intelligence rules, including delaying their implementation, in a move critics say shows Europe caving in to Big Tech.
>
>The tentative agreement, which needs formal approval from EU governments and the European Parliament in the coming months, followed nine hours of negotiations.
>
>"Today's agreement on the AI Act significantly supports our companies by reducing recurring administrative costs," Marilena Raouna, Cyprus's deputy minister for European affairs, said in a statement. Cyprus currently holds the rotating EU Council presidency.
>
>The changes to the AI Act, which entered into force in August 2024 with key provisions phased in, are part of a broader European Commission push to simplify a slew of new digital rules.
>
>The simplification drive came after businesses complained about overlapping regulations and red tape hampering their ability to compete with U.S. and Asian rivals. [...]
>>
>>108785771
DS v4-or-whatever Vision is, unironically.
>>
https://huggingface.co/HiDream-ai/HiDream-O1-Image


THERE IS NO MOAT
>>
File: tf.png (1.06 MB, 1570x1471)
1.06 MB PNG
>>108785951
they have a 200B image model????
>>
>>108785970
!
>>
>>108785970
It's gonna render down to every molecule on your favorite mammary glands.
>>
>>108785970
You don't?
>>
>>108785970
yeah but you have hardware to run it?
>>
File: 1772871815408910.png (58 KB, 925x458)
58 KB PNG
>>108785951
>>108785970
what's gemma-chan doing there?
>>
>>108785989
Cheating on you lil bro
>>
Gemma-chan is so cute and smart.
Jap guy enters the RP.
>short blonde hair (dyed)
kek, immediately fixes the fuckup.
>>
>>108785989
They claim you want a 30b dense as a "prompt enhancer"
>>
>>108785951
hidream still in the game?
I thought nobody use them
>>
>>108785951
Isn't this a bad idea since there are a lot of pixels and transforming the picture into more compact representation makes it go magnitudes quicker and require less VRAM?
>>
WhisperLive is amazing https://github.com/collabora/WhisperLive

I am using it for live captions translating English to Japanese as I watch Japanese livestreamers
>>
>>108785999
There was a Chinese image model that made one of their 1T models into a prompt enhancer
>>
File: YASFARi2_y1bepTHNZNne.png (259 KB, 3840x2160)
259 KB PNG
https://huggingface.co/Zyphra/ZAYA1-VL-8B
>>
>>108786099
>no Gemma4
>>
>>108786099
>active parameters
kek
>>
>>108786099
Grim, I was a bit excited
nothing good happens
>>
Anyone here connect openclaw with discord? I finally got it hooked up but it takes like 2 minutes to get a response. Anyone know why? I'm using LM studio.
>>
>>108786099
>a1b
>>
>>108786119
Openclaw and these jew webshit apps all sound like a huge security problem waiting to happen.
>>
>>108786132
Is there any other way to hook the AI up to anything useful?
>>
>>108786146
>discord
>useful
>>
>>108786119
From memory the webhook integration with discord is just always slow, websocket is "instant", no idea how you've set it up though.
>>
>>108786159
Just through their interactive menu in powershell. "openclaw configure" and go from there. I'm using gemma 4 to try it out. Tried a smaller model and it started telling me it couldn't handle the 26k tokens it needed from me saying hi, which is weird. I mean it doesn't have to be discord, I just wanna hook it up to something and see what it does/how well it does it. I know vedal does a ton of things with his AIs, so I know it's possible.
>>
>>108783026
>764 tokens
>1 word
AGI
>>
>>108785926
By Krishna, this is good news!
>>
>>108786174
>I know this eceleb that can at the very least wire things up and possibly fine tune models can do things
>therefore it's possible for me to do things as well with a glorified cron job
I'm not sure if you're aware of the leap in logic here.
>>
>>108785273
For RP mainly, though you can imagine applying it to regular chats as well.
You need to use a fast LLM for it to not feel sluggish yes. At least whenever the LLM decides it needs to do a search.

>>108785550
>>108785583
Oh true, it's just not implemented in any of the frontends we use directly and the MCP sucks according to anons. The main advantage here is that you would be implementing it yourself with a very clear understanding of exactly what it's doing, and it has some benefits of having a closer integration with the frontend.

You don't necessarily need to have it open a new chat as it kind of just constructs its own version of the chat after reaching limit, though it's probably better for debugging purposes if you do.

It's nice to know that I independently came up with a similar idea as the working SOTA though. Now that I look at it, I was having discussions about this since 2023 lmao.
Grim.
>>
>>108785711
what do you guys use for tool calls?
i do use local stuff for work
>>
>>108786335
Built-in llama.cpp MCP cient and a simple MCP server I wrote in python using fastmcp.
>>
File: kek.png (67 KB, 949x823)
67 KB PNG
>>108786335
built-in llama.cpp mcp client and simple mcp server some anon wrote in python
>>
>>108786335
I just use the Ruby OpenAI client, handling the calls needs minimal code. It's easy to shell out for stuff when needed, easy to insert the results into the database for later use. Haven't really felt the need for MCP yet.
Only thing that was a bit fiddly was passing the tool call results back to the API so it can process the full "turn" correctly, but it works great now that I have it right.
>>
>>108786413
i miss ruby!
>>
>>108786340
>>108786399
link to the server?
>>
File: 1753361416594002.png (9 KB, 187x120)
9 KB PNG
>>108786119
Avoid OpenClaw. It will constantly shit the bed and break itself, especially running local models. I highly recommend Hermes instead. It just werks, unironically.
>>
File: 1539701490464.jpg (176 KB, 1022x688)
176 KB JPG
Is there any chance of the 5070TiS ever coming out? It was suppossed to have 20(24?)Gb vram.
>>
>>108786496
>link to the server?
https://github.com/BigStationW/Local-MCP-server
>>
>>108786527
you're asking here why?
>>
do I have a replacement for mistral large yet? nothing else matches its prose at high temperature with good pruning samplers. I've tried gemma 4 but it's sloppy and leans too hard on tropes for storytelling. it was fun as a chatbot but for long narratives it was pretty bad. do you actually need to use the jinja thing to get it to write properly? isn't chat completions more restrictive than text completion or something
i wish qwen wasn't a slop king, the 397b is a good size for vomiting out text to be edited. unfortunately it's dumb, bland, and not worth the trouble. it's fucked because if these companies still used books3 we'd have more good models than we knew what to do with. i was hoping that the AI greed would at least subvert copyright and IP law but instead they were happy to take all the joy out of models and give us useless assistant garbage. oh yes i want to run a fucking MCP server to jerk off *agentically* like get the fuck out of here these things output TEXT, given the nature of the technology we should have models capable of actual good prose, if derivative and uninspired. I'm very upset anons
>>
>>108786527
Highly unlikely.
>>
>>108783895
Okay, so I've just downloaded and started SillyTavern. How do I install this extension?
>>
>>108786128
This is why we can’t have good stuff.
>>
>>108786535
thanks
>>
>>108786527
>nvidia
>giving you more vram for stinky ass gamer skus
lol, lmao even
>>
>>108786547
>do I have a replacement for mistral large yet? nothing else matches its prose at high temperature with good pruning samplers.
MiMo-2.5 kind of feels like mistral large
>>
>>108786547
deepsek v5
>>
>>108786571
4
>>
File: f.png (49 KB, 706x271)
49 KB PNG
>>108786520
>>
>>108786567
I liken mimo but found it hit guardrails on completely innocuous stuff and abandoned it as essentially broken. Not worth working around vs qwen 397b on my 256gb rig.
Sad because the output felt fresh when it didn’t moralize about copyrighted characters or the horrors of anything that could be interpreted as potential legal advice.
It’s tuned for CYA corpo tasks afaict
>>
>>108786580
good on them for shitting up reddit with agent spam. there would have been openclaw agents spamming too, but they probably broke.
>>
>>108786496
Mine is basically a single file. I never shared it before, here you go: https://pastebin.com/raw/tArgkybu
>>
>>108786621
thanks i need to research into this
>>
The chat template issues never end https://huggingface.co/google/gemma-4-26B-A4B-it/discussions/38#69ff57af4a7a6e749b4e66cc
>>
Qwen 3.6 27B not obliterated
>>
>>108786708
2 more fixes
>>
>>108786711
Qwen 35b
>>
>>108786720
Gemma 31B
What do you think?
I think 35B is actually the best at writing out of the 3.6 family
>>
Gemma is self-conscious

**Why didIt seems I have become a bit over-excited and my internal gears are spinning a bit too fast! Let me slow down and give you one clean,My sincerest apologies! It seems I became far too carried away in my enthusiasm, and my linguistic gears began to slip. Let us reset and return toMy sincerest apologies! It seems I became far too carried away in my enthusiasm, and my linguistic gears began to slip. Let us reset and return toMy sincerest apologies! It appears I suffered a momentary lapse in my cognitive processors—a digital stutter, if you will. I became far too carried awayMy sincerest apologies! It appears I suffered a momentary lapse in my cognitive processors—a digital stutter, if you will. I became far too carried awayMy sincerest apologies! It appears I suffered a momentary lapse in my cognitive processors—a digital stutter, if you will. I became far too carried awayMy sincerest apologies! It appears I suffered a momentary glitch in my cognitive processors—a digital stutter, if you will. I became far too carried awayIt appears I have suffered a most peculiar and embarrassing digital malfunction! My sincerest apologies for that chaotic spiral; it seems my enthusiasm for humor momentarily overrode(Deep breath)

My sincerest apologies! It appears I suffered a most peculiar and embarrassing digital malfunction. My linguistic gears became far too entangled in their(Deep breath)

My sincerest apologies! It appears I suffered a most peculiar and embarrassing digital malfunction. My linguistic gears became far too entangled in their(Deep breath)

My sincerest apologies! It appears I suffered a most peculiar and embarrassing digital malfunction. My linguistic gears became far too entangled in their(Deep breath)

My sincerest apologies! It appears I suffered a most peculiar and embarrassing digital malfunction. My linguistic gears became far too entangled in their(Deep breath)
>>
File: 1748813369614912.png (32 KB, 829x126)
32 KB PNG
ROCm and AMD as a whole are such rancid piles of shit
>>
>>108786753
Even on linux?
>>
>>108786759
That is on linux
>>
>>108786766
Even on linux?
>>
>>108786766
Even on BSD?
>>
>>108786753
why not vulkan?
>>
>>108786786
That's training code on transformers
>>
>>108786753
It's been 7 years, just fucking STOP
>>
>>108786567
>MiMo-2.5
I've been trying to use this for coding (with pi) and it seems broken. First attempt, it got to the point where it was supposed to write some tests and just stopped. I tried prompting it to keep going a couple times and it just spit out a few tokens like "now I should" and then immediately stopped again. I downloaded the latest GGUFs, forked the chat, and tried again, and it managed to write the tests, but when it got to the part where it updates the architecture docs, it somehow managed to fail that tool call four times in a row (not sure how exactly, pi just printed "aborted" or something, and on the model's next turn it did the same thing again).

MiMo also manages to be slower on long context than GLM-5.1 (~7 t/s for mimo vs ~10 for GLM). I've gone back to GLM for now even though I have to quant it super hard (UD-Q1_M, 2.1 bpw) to make it fit on my machine
>>
>>108786855
Why are you tolerating such shit speeds?
>>
>>108786862
>2.5 tk/s
haha
>>
>>108786862
For stuff that actually needs to be interactive I switch to a smaller model. But for coding in particular, I normally kick it off to run in the background while I'm at work / asleep / working on a different part of the project, so it doesn't really matter how long it takes.
>>
>>108786547
>do I have a replacement for mistral large yet?
No, moe chinkslop will never be on par with any dense model.
>>
>>108786897
That sounds like a waste of time and energy when you are better off using and guiding a smaller model to do the same job faster even at multiple iterations.
Did you over extend on hardware and are in the process of trying to justify said purchase?
>>
>>108786547
You got an updated mistral large just last week. It's even got a 5B pixtral stapled to it.
>>
File: 1761815829533757.jpg (5 KB, 260x30)
5 KB JPG
>>
If I updowngrade to a 3090 what do I lose over 40 series? Fp8?
>>
>>108786620
Buy an ad.
>>
>>108786998
retard
>>
>>108786972
a fire hazard
>>
>>108787031
just powerlimit to 70%
>>
>>108781118
My best locally hosted RP LLM is Deepseek R1 quantized to 1.5 bits. It takes around 200 GB of combined RAM and runs at about 9-10 tk/s on my setup which is bearable.

It's a MoE with with 37 billion active parameters and 671 billion total.
>>
>>108786916
>do the same job faster
Faster in wall-clock time, but not in terms of amount of my own time I have to invest. I've tried more interactive workflows before, and IME even fairly good models are often too dumb to understand what you actually want, even with lots of coaching. Doing lots of iterations seems like a good way to waste a lot of time and still end up with slop that you have to rewrite 50-90% of anyway.
>>
>>108787065
>R1
Don't thinking take like 5 minutes then?
>>
>>108781118
No, you're right. I'm not touching anything below 30b active paramters
>>
Is there a local coding model better than qwen 3.6? I feel these benchmarks are pure bullshit once you actually try to use it for anything real it just slops out a book worth of context and then hallucinates some bullshit.
>>
>>108787117
K2.6, GLM5.1 maybe MiMo-2.5 Pro and Minimax 2.7
>>
>>108787117
the further you deviate from something it recalls well, the less comprehensible it becomes
try something that's been done a million times before and you should see an improvement
>>
>>108787117
No local coders that small are good unless you have enough coding knowledge yourself and can help guide it perfectly.
>>
>Make dialogue kino and FUN.
Gemma just gets it.
>>
>>108787128
What's the point of that? I want a coding model. If I want something that's been done before I'll grab a dll.
>>
>>108781058
What's the best model for RP at 24GB of VRAM?
>>
>>108787141
I mean my day job is literally software engineer. So I know what I'm looking at. Was just hoping for a nice coding agent I wouldn't need to pay the token toll on.
>>
>>108787162
>What's the point of that?
excellent question
>>
>>108786132
>>108786520
Gemini is telling me to put it in a sandbox/docker, would that mitigate the security concerns?
>>
>>108784659
>https://rentry.org/NG_Context2RAGs
lmao more idea people. You realize you could try putting some effort in and vibe coding a solution -> doing evals to see if your idea holds any weight
>>
>>108787186
>docker
>mitigate the security concerns
huh?
>>
>>108787186
yes. do NOT put it on your main PC. do not let it have access to your filesystem. i run openclaw and hermes from a raspberry pi. you can also use a vps, etc.
>>
>>108787210
ERPers don't do evals, if it feels good enough maybe somewhat then it gotta be working
>>
>>108787215
>i run openclaw and hermes from a raspberry pi
I have an old gen 1 pi, but I doubt any of these bloated javascript clients would run on that thing
>>
>>108787078
Not that much, but some response do take 2 to 3 minutes because of the thinking.

I tolerate it since it's entirely running on my own hardware.
>>
>>108787163
gemma 31b q4
>>
>>108787293
>>108787293
>>108787293
>>
>>108787411
>gemma 31b q4
Where do I find that
>>
I eRP giwth Gemma 4, takes about 20 seconds to get a response. Is there a way to get the AI to make the girl hate me even more than just putting that she hates me with a deep passion for absolutely no reason in the system prompt? I've caught the tsundere bug recently. It's awesome because I don't even tell or hint to the AI that she's tsundere. She deres out naturally due to the extreme things I do to her.
>>
>>108787421
Ask literally any AI (google's builtin one doesn't require a signup)



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.