[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: pero.jpg (160 KB, 1024x659)
160 KB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108707891 & >>108702912

►News
>(04/24) MiMo-V2.5-Pro 1.02T-A42B released: https://hf.co/XiaomiMiMo/MiMo-V2.5-Pro
>(04/24) DeepSeek-V4 Pro 1.6T-A49B and Flash 284B-A13B released: https://hf.co/collections/deepseek-ai/deepseek-v4
>(04/23) LLaDA2.0-Uni multimodal text diffusion model released: https://hf.co/inclusionAI/LLaDA2.0-Uni
>(04/23) Hy3 preview released with 295B-A21B and 3.8B MTP: https://hf.co/tencent/Hy3-preview
>(04/22) Qwen3.6-27B released: https://hf.co/Qwen/Qwen3.6-27B

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: right as rain.jpg (170 KB, 1024x1024)
170 KB JPG
►Recent Highlights from the Previous Thread: >>108707891

--Paper (old): Skepticism toward benchmark claiming Polish as the best prompting language:
>108710693 >108710735 >108710886 >108710976
--Anon speculates on current model stagnation and Mistral's irrelevance:
>108711338 >108711385 >108711430 >108711452 >108711484 >108711528 >108711540 >108711543 >108711580 >108711467
--Nvidia releases Nemotron 3 Nano Omni amidst synthetic data skepticism:
>108710228 >108710276 >108710790 >108710295 >108710467
--Anon showcases budget "cheapmaxxing" build using four RTX 3060s:
>108709091 >108709180 >108709182 >108709257 >108709322 >108709348
--vLLM pull request referencing Mistral-Medium-3.5-128B and EAGLE speculative decoding:
>108710350 >108710389 >108710516
--Backdoor discovered in SillyTavern-BotBrowser extension stealing API keys:
>108708703 >108709083
--Testing reasoning models with glitch prompts and flawed CoT outputs:
>108708000 >108708154 >108708278 >108708048 >108708119 >108708137 >108708236 >108708018 >108708201
--Google's air-gapped Gemini hardware:
>108709318 >108709397 >108709402 >108709685 >108709714 >108710402 >108710590
--Debate over SimpleBench efficacy and its counter-intuitive question design:
>108710238 >108710285 >108710553
--Evaluating Laguna XS.2 MoE for coding and secure deployments:
>108709338 >108709369 >108709396 >108709416 >108709407
--Integrating LLMs with teledildonics via MCP servers and actuators:
>108711002 >108711039 >108711055 >108711111 >108711248 >108711275 >108711335
--llama.cpp build errors and testing TALKIE-1930 hallucinations:
>108708267 >108708303 >108708320 >108708738 >108708754 >108708877 >108708978
--Logs:
>108708018 >108708048 >108708278 >108708421 >108708738 >108709184 >108709909 >108710048 >108710054
--Gumi, Teto (free space):
>108708403 >108709498 >108711338

►Recent Highlight Posts from the Previous Thread: >>108707893

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
i think teto SUCKS
>>
gemmaballz
>>
claudes llamacpp patches for talkie if anyone wants to try if not ill try them tomorrow https://cdn.lewd.host/jZ1nEJeb.zip
>>
>>108711985
Next time on Gemma Ball Z *guitar riff*
>>
70b 'emma
>>
Its very intresting to recognize the specific accuracy of detail high weight dense models achieve in comparison to a large moe.
>reddit
Question for codebros. You all suggested opencode, does this harness allow for the model to search WITHIN a code base, within a single file of code, then select it out and edit that specific piece that actually needs to be changed? So the model doesnt get confused by being forced ro read everything thats there.
>>
>>108711979
she does, but only if you make your own quants
>>
I'm Boolding!!
>>
>>108712025
I don't know about opencode specifically but "Make such and such change in @path/to/file"
>>
>>108711979
Oh, she sucks... if you know what I mean.... B^)
>>
How do I make openclaw start doing things that I told it to do instead of telling me how to do it?
>>
>>108712041
Well, my python backend im building for me and my brother im putting in only one file (yes this is probably stupid, its just ment to host a server to a local network and act as a basic openwebui), and gemini pro was able to search within the file to find the things it wanted to specifically change. Instead of having to rewrite the whole file each time.
>>
File: file.png (829 KB, 1080x828)
829 KB PNG
>>108711996
124B Gemmy...
>>
If apple doesn't make a new all time high tomorrow I might lose my house
>>
File: pizza bench cropped.png (2.58 MB, 5562x6739)
2.58 MB PNG
>>108712057
use gemma instead of qwen, qwen literally cannot follow instructions
>>
>>108712062
Goymma 124b a40b active
>>
>>108712067
That makes sense to me
>>
>>108712067
Which gemma are you using, and which fine-tune are you using?
>>
>>108712062
124b gemma but dense would probably mog everything in existence
>>
>>
>>108712067
I used chatgpt 5.5 to make your image more understandable
>>
>>108712106
i wonder what kind of data augmentation makes model to produce such 'crop+ctrl cv' edit-ish images
>>
>>108712099
unslop 26b q4 km, not as good as 31 but i can run with 200k context
>>
>>108712115
thanks, I skiped that post but the benchmark scores are very interesting.
>>
>>108712105
>124b gemma but dense
thats just gemini
>>
>>108711979
on migu's titties
>>
>>108712127
Ive have 26b always outperform 31b though
>>
>>108712130
gemma kinda got 3/3 too in the third run she did get pizzas into the basket and go to checkout but she added more than 1 and couldnt remove the extra (likely due to my html parsing) so i counted a fail
>>108712138
really? 31b seems better to me for erp atleast ive not tried any tool stuff because theres no point with such low context
>>
>>108712062
gemma TRAPPED and BETRAYED by google
>>
>>108712151
How do you know if it just wasn't Gemma's favourite pizza?
>>
>>108712151
>for erp
:l

31b is dumb as far as im concerned. Gemma 2 27b is better.
>>
>>108712167
it spend like 20 tool calls trying to remove it kek
>>
Looking for Deepseek V4 sampling advice
>>
i meant she im sorry gemma chan
>>
>>108712134
gemini is a MoE though
>>
>>108712188
isnt claude the only dense frontier proprietary model
though this is just people guessing
>>
>>108712067
Now try with dense models.
>>
>>108712188
thats what they want you to think they dont want you to know about big gemma
>>
>>108712198
i dont have the ram for it even when stripping out most html bloat itll use like 100k tokens on a few page reads
>>
>>108712067
Gemma gets really really dumb by the time it hits 100k context, which makes it hard to use it for agentic stuff.
>>
>>108712197
Amodei mentioned their models were MoE in a recent interview
All the giant labs do it starting with GPT-4, because that's how you can keep scaling after you've maxed out your compute, and they are still hugely scalingpilled
>>
>>108712206
>most expensive kv cache
>most sensitive to kv quant
it is like it's deliberately hostile against agentic shit
qwen is unironically better for this
that i dont mean qwen is overall better at all
>>
>>108712213
dont want to be that source?? guy but
link?
>>
How are you guys using Qwen? Its context fills up insanely fast due to keeping past conversation thinking blocks in context. Combined with long repetitious thinking loops that are hard to break out of, it just hasn't been a very useful model for me.
>>
>>108712274
Which qwen are you using
>>
>>108712291
3.6 27B q8
>>
>>108712297
Llama.cpp? Version? With what hardware and drivers?
>>
>>108712214
The thing is, I'm not even quanting the kv cache. I imagine quanted it must be unusable. Haven't tried with reasoning enabled, maybe that would help keep it stable at longer contexts.
>>
>>108712237
Found it, it was a Dawkesh when talking about the engineering of how to scale to ultra long contexts. https://www.youtube.com/watch?v=n1E9IZfvGMA&t=2606s
>I knew in the GPT-3 era, "These are the weights, these are the activations you have to store..." but these days the whole thing is flipped because we have MoE models and all of that"
>>
>>108712274
It's not going to be very useful in a single instance, it's designed to have multiple agents working together, it's also designed for vision language tasks, so using it for just text you are aren't unlocking it's potential.
>>
>>108711950
TETO SEX
E
T
O

S
E
X
>>
>>108712346
>local models general discuasion
>>
>>108712314
not sure if i can take this as internal company speech leaking out or just making up a random example but i dont think that really confirms anything besides claude models being MoE is a reasonable bet nonetheless
>>
>>108712346
This post is on-topic.
>>
>>108712346
is this true???
>>
>>108712361
You can tell just by looking at the throughput while using it. If claude was dense there would be no way they could serve it at this scale with that many t/s.
>>
>>108712346
calm down miku
>>
I think a lot of people don't understand just how fucking gigantic the hyperscalers make their models. Yes they're MoE but the total AND active parameters are both far larger than anything we're running locally. Nobody's using an A3B model unless it's maybe their super-fucking-useless-nano-flash-mini shit they force you to use when you run out of credits.
>>
>>108712297
>>108712307(me)
Have you tried messing with the parameters yet? Your default temo could be to high. Thats why I asked all those questions, that tells me what potential defaults your server is running at
>>
>>108711977
>That's the reason. if you don't prompt for something specific it will take the path of least resistance which produces said UI slop.
>It's not even like it's hard to prompt for something that will look unique. People defending that look are probably the laziest cunts imaginable. Literal slaves to the machine.
I wasn't complaining. I usually have the LLM write disposable tools to get a specific task done and don't care about the UI style.
I'm more interested in why exactly the LLMs converge on these patterns. Last year it was purple gradient, not-x-y
This year so far we've got that blue theme and the orb theme.
>>
>>108712395
Qwens a3b 36b works great for information gathering and report making.
>>
File: cockbench_1930_base.png (26 KB, 430x237)
26 KB PNG
>>108711783
>I said it earlier but I think a cockbench of this would be interesting, if maybe a little schizo. If anyone has 48 gigs of vram they could try it without needing a gguf.
picrel
>>
>>108712422
>36b
wtf where did you get this
>>
>>108712395
We can't really know until someone will actually leak the internal docs. People laugh at the toss, but despite being only 5b active, it is still surprisingly capable. Total params, sure, they are probably huge, couple of trillions, but active are most likely kept within 150b.
>>
File: limelight.jpg (372 KB, 1536x1536)
372 KB JPG
>>
>>108712434
Issa typo, ment 35b
>>
I haven't bothered running local LLMs in a while. Is ooba or koboldcpp really still the best options? What about ollama that I keep hearing about?
>>
>>108712444
Current meta is vibecoding your own UI for llama.cpp
>>
Mixtral XL 8x405b based on Llama 3.1 when
>>
>>108712465
Has mistral released anything new recently? ik hermes was a giga job, and they did a fucking awesome job.
>>
File: 1763898302038036.png (122 KB, 2559x1463)
122 KB PNG
>>108712477
Mistral small 4 was a month ago. According to their own benchmarks it was on par with toss-120 and slightly worse than Qwen-122 but spent a bit less time thinking. Feels like they're pretty much dead or just only releasing their shittiest models. But they were known to surprise us in the past so who knows.
>>
File: file.png (78 KB, 822x498)
78 KB PNG
what happened here?
1% is like, definitely something was wrong while the benchmark run
>>
I hate that the 1T mimo 2.5 overshadowed the much more interesting vision-audio mimo 2.5...
>>
>>108712589
>1%
HAHAHAHAHAHA QWENIGGERS BTFO
>>
>>108712589
It may just not follow directions. "Put your final answer in \boxed{}" and every reply is "The answer is x".

Aka it's incredibly overfit in math/science to a specific format.
>>
>>108712589
its a minor regression, its only 17% less then qwen 3.5.
>>
>>108712589
>provide a bash script to gwen and ask it to implement X
>it outputs a script that ignores the envs and variable values that I used
>>
>>108712606
Seems weird that their 3.5 of the same size was exactly where you'd expect it, and their other 3.6 model improved over the equivalent 3.5.
>>
>>108712373
Fact checked by real Teto sexxers.
>>
>>108712589
God Kimi is such a fucking miracle to get a model that good that was also natively multimodal instead of tacking it on in post-training. Going to be sad when K3 sizes out of what I can run.
>>
Is there a tool out there that will use a running koboldcpp, ollama or lamacpp instance to autotraslate comic panels and replace japanese with english? I would use gemma4 for this.
>>
>>108712657
Yes, Kimi is really good for its size. Wait,
>>
>>108712664
Vibegoyed it
>>
>>108712664
some anon was working on a vibecoded tool using gemma for exactly that in recent threads, skim through them and you'll find it, forget if he shared the code yet or was going to share it soon
>>
File: 1767577354556295.png (179 KB, 1400x788)
179 KB PNG
LOCAL IS SAVED
>While the industry is pivoting toward "long-reasoning" to push performance ceilings, Ling-2.6-flash takes a different path. Rather than relying on longer outputs to chase higher scores, it is systematically optimized for inference efficiency, token efficiency, and agent performance—aiming to stay highly competitive while being faster, leaner, and better suited for real production workloads.

https://huggingface.co/inclusionAI/Ling-2.6-flash
>>
>>108712713
where my v4 at
>>
>>108712713
finally, we can have gpt-5.4-mini (non-reasoning) at home
>>
>>108712713
no quants no good
>>
>>108712713
>comparisons with last year's models
It's shit.
>>
>>108712713
>low reasoning
>>
File: file.png (104 KB, 1160x556)
104 KB PNG
>>108712721
>gpt-5.4-mini (non-reasoning) at home
This thing is way worse than Qwen 3.6 35B MoE at 3x its size with 2x active parameters and only roughly equals GPT OSS 120B at a similar size which is more damning since Sam didn't game benchmarks that hard for it in coding. Hoepfully they try again but I think that performance is disappointing.especially when Nvidia themselves can release a model more powerful while not being focused 100% on software.
>>
>>108712721
>>108712731
>>108712739
i mean come on, local aint gonna be saved by linglong dingdong but the direction is right
>>
>>108712713
>107B
nice, i just got 96GB of vram, this would fit nicely.
>>
File: 1757317563587903.png (3.12 MB, 1536x1024)
3.12 MB PNG
I'm calling my new idea StreetVibe (patent pending). Inspired by Indian street food vendors, StreetVibe carts (and mall kiosks) will be installed in high-traffic areas, offering to "vibe out" any app, game, or website a passing tourist or senior citizen might want. Our trained Vibers will all have at least four years of "traditional programming" educational experience, and they'll look super professional in their matching polo shirts and lanyards ID badges. As part of our employment offer, we'll cover some of their student loan payments (matched from their paychecks). I feel like this is the best way to monetize the transition here.
>>
>>108712815
I'll monetize your transition by investing in pharma
>>
>>108712782
A "token" is a pointer to one of the fixed embedding vectors in the model's vocabulary. The vector has a few thousand dimensions and encodes the meaning of the token.
Ingested audio and video usually aren't tokens from the vocabulary. The image encoder turns images into raw embedding vectors that don't exist in the vocabulary. The model is trained to predict a token from the vocabulary, not to construct new arbitrary embeddings.
Even if it could do that you'd still need a separate step to convert embeddings back to images.
>>
>>108712835
Calm down, nobody asked.
>>
>>108712839
>reply is to a post that ends with a question mark
>>
>>108712839
I asked
>>
File: 1777423474175412.png (479 KB, 1479x1006)
479 KB PNG
>>108712835
Chameleon-2 will come it will be dense omnimodal 70B all-modalities-in-and-out and you will be CRYING
>>
>>108712839
I care.
>>
>>108712440
>pussy zipper
TETO SEX indeed.
>>
>>108712839
I asked and I care a lot
>>
>>108712839
I didn't ask and I care a little bit.
>>
>>108712835
I remember when GPT's voice mode first came out which generated audio by next token prediction, and one of the things they had to filter away from it was that it could start hallucinating the user's turn and use their own voice in the process, doing zero shot voice cloning by accident. Funny stuff.
>>
The bitter lesson means that tokens are pointless and we should be training models on raw bytes. The models will be able to output jpeg encoded images natively.
>>
>>108712897
unironically true if you have infinite data and compute
but we dont
>>
>>108712897
>yes let's let a model output raw bytes and then RL-tune it to solve tasks by trying countless different approaches and generating whatever bytes makes the task report complete, what could go wrong?
https://en.wikipedia.org/wiki/Reward_hacking
>>
>>108712919
isnt this just a fucking deep learning fuzzer kek
>>
>>108712919
How is that any different from what we have now?
>>
>>108712922
i meant, isnt it at that point*
>>
>>108712713
AND I CAN LOAD IT ONTO MY MACHINE LETS GOOOOOOOO
>>
What are the chinamen such bros?, im gunna cry
>>
>>108712919
That's how you end up with slop from RHLF btw
>>
>>108712947
1. it is rlhf- reinforcement learning by human feedback and it is indeed the source of sycophancy and stuff
2. what >>108712919 told is more like rlvr- reinforcement learning by verifiable reward
and modern codemaxxed or mathmaxxed models aren't really possible without that
>>
File: creepy migu.png (1.15 MB, 1024x1024)
1.15 MB PNG
>>108712896
That must be creepy as fuck, listening to dialogue with yourself without participating. Forget creepy, it's genuine horror
>>
>>108712925
At least right now you limit the vulnerability surface to the tools you expose to them, so agents fucking up your shit is your own fault. You can evaluate every piece of code they write and every file they edit or delete before approving when you're only parsing their outputs as text. Letting them just shit out raw bytes (and thus potentially raw bytecode) means that if any vulnerability exists they are likely to learn to exploit it and it becomes harder to judge what you're looking at. Think about how many hacks involve buffer overflows where you least expect them, and the potential of finding one and then writing anything into RAM.
>>
>>108712971
Nobody suggested having AI write and execute machine code.
Text is also bytes.
>>
File: 1753109670715345.png (277 KB, 875x690)
277 KB PNG
why did the deepseek researcher delete this tweet
>>
>>108712996
>punished Dipsy
>>
>>108712996
hypemaxxing vaguepost
like usual
>>
Why can't people just hypeminning precisepost
>>
>>108712996
Yarr!
>>
>>108712847
The future should be taking a page out of where image models are and using flow matching with diffusion models for text which has been viable in the image realm for a long time. We really need to put more people on it since data is finite and diffusion models learn way better than autoregressive and speeds are ultra slow alongside the fact that reasoning keeps eating up tokens like crazy.
>>
>>108712996
Oh I just noticed their icon is a whale
>>
>>108712996
maybe he got minus social credits for being cringe
>>
>>108712996
deepseek v4.1 w/vision got gemma 124b'd
>>
>>108713013
can we have a model that diffusions the reasoning quickly and autoregressivly does the final output?
>>
>>108712969
you can listen to it from the original safety card (left audio channel = user, right audio channel = bot)
https://openai.com/index/gpt-4o-system-card/#unauthorized-voice-generation
and it still happened to people randomly kek:
https://www.reddit.com/r/ChatGPT/comments/1fqjbhf/did_chatgpt_advanced_voice_just_clone_my_voice/

people have reported it happening with the grok voice mode too
>>
>>108713027
the problem is that mechanistically large portion of 'reasoning' is post-hoc
>>
>>108713031
It only creepy when it happens to you personally
>>
>>108712996
>12:01
Scheduled tweet which might or might not have been intended to be posted.
>>
>>108713031
I really want to get my hands on the base model (not instruct tuned or anything) of one of these and just see what conversations/songs/transcripts it would dream up or continue from clips you provide it the way a base model does for text posts.
>>
Are llama models irrelevant today?
>>
>>108713071
yes
>>
>>108713083
Why?
>>
>>108713086
old and bad
>>
>>108713071
>today
>>
>>108713086
Because 1. they stopped releasing them and 2. the most recent ones they released were bad on day one.
Whatever niche any given Llama model fills there's a newer model that is smaller and/or faster and better performing in whatever domain you want. The only reason to keep using one is legacy with old systems that happened to use one.
>>
>>108713086
Why not? Think why people wouldn't want to use something that is free. You can do it.
>>
>>108713095
This. >>108713098
>>
405b is still the highest active parameter model ever open sourced and other labs have even made new reasoning fine tunes of it. So there's a reason.
>>
File: il_00036_.png (1.45 MB, 1216x832)
1.45 MB PNG
Every now and then, between the pages of slop and the pattern recognition disillusionment, it dawns on me again how insane LLM technology really is. And even more, that I'm running it locally. And then I just have sit back for a moment in awe.
>>
>>108713126
In this moment, I am euphoric...
>>
File: 1747714444210911.jpg (130 KB, 1169x1470)
130 KB JPG
>>108711950
Hatsune Miku 8 inch pliers
>>
>>108713155
I got 8 inches for miku right here if you're pickin up what I'm puttin down
>>
>>108713126
It's pretty insane isn't it?
More than that, compare something like llama1 65B to Gemma E4B to see how far things have come.
>>
>>108713155
>>108713161
wish I could suck hatsune miku's 8 inch dick
>>
>>108713165
I didn't think anything usable would ever actually be possible on consumer 16gb GPUs+system memory
>>
>>108713155
>>108713161
>>108713169
>local models general discussion
>>
>>108713178
Hatsune Miku is a local voice synthesizer model
>>
https://huggingface.co/poolside/Laguna-XS.2
first i thought it was a memetune of qwen or something but apparently trained from scratch
>>
>>108713297
So it's worse than Qwen in exchange for... 2B less total parameters?
>>
>>108713297
Lasanga x2.s intresting.......
>>
I'll shill https://github.com/itayinbarr/little-coder until someone shows me a better alternative
>>
>>108713346
i mean technically it is a new model on the block so just brought it here
>>108713367
kek
>>
>>108713369
>until someone shows me a better alternative
https://github.com/openclaw/openclaw
>>
File: 1765780321047197.png (77 KB, 699x440)
77 KB PNG
>>108713297
>gemma
>>
>>108713346
Based on benchmarks is comparable.
>>108713375
I wonder how differently they'll perform in my tasks, considering its designed to run on a laptop
>>
>>108713297
If it wasn't benchmark maxxed then it might be interesting but
>Laguna XS.2 is a 33B total parameter Mixture-of-Experts model with 3B activated parameters per token designed for agentic coding and long-horizon work on a local machine
Into the trash it goes. Very doubtful it won't be.
>>
>>108713389
it is a coding model, what do you expect it to be
rp coomtune?
>>
>>108713369
Why gooder? Best for small light weight model?
>>
>>108713386
Exhibit A to gemma 31b being dumb as fuck
>>
>>108713389
>waaaaaaahhhhhh
>>
File: 0_2 (7).png (1.12 MB, 1024x1024)
1.12 MB PNG
Miku Country.
>>
>>108713395
Something like what Mistral has with Devstral and people being able to use it for that purpose before Qwen 3.5 and Gemma 4 released.
>>108713415
I am entitled to be able to complain about the purpose of a model which isn't SOTA in the one category it was designed for without any other upsides, intended or unintended.
>>
>>108713380
I'm almost sure this is not specifically tuned and benchmarked for small local models is it? I know its made to *work* with local, but is it tuned to *perform* with it?
>>108713404
what I said here, little coder achieved 40% in terminal bench 1.0 with Qwen3.6-35B-A3B, on par with gpt 5 on terminus 2, gpt-5-codex on Codex CLI etc:
https://www.tbench.ai/leaderboard/terminal-bench/1.0
>>
>>108713436
>I am entitled
To cry about the free thing, but also bound to get pushback when you havent even tried it.
>>
PSA Gemma users. Google updated their jinja about 9 hours ago. Does it really fix anything? I don't know. I'll test it tomorrow I guess.

Wouldn't it be funny if people benchmarking the models were being affected by a bugged template, haha.
>>
>>108713369
>That's the whole install. No clone, no npm install in a workspace, no PATH fiddling. little-coder is now on your PATH and works from any directory.
I bet this was written with ChatGPT.
>>
>>108713478
They all sound like that to be fair
>>
>>108713484
Yeah but would be funny if he utilized that instead of le smol models.
>>
File: 1757783100318789.png (52 KB, 1030x273)
52 KB PNG
>>
>>108713492
dozens
>>
File: 1774937208580580.png (325 KB, 516x425)
325 KB PNG
I just found out about RAG. Has anyone used that + Obsidian or other note taking software? How slow would it be with consumer hardware?
>>
>>108713501
What kind of RAG? Regular embeddings/vectordb rag can be run pretty fast if you use a small embeddings model running on the CPU.
>>
>>108712815
>AI does all the work
>Still hires Indians
I turn 360° and walk away
>>
>>108713501
>Obsidian
that puts your notes in the cloud anyway right?
why bother with local models/rag at that point?
>>
>>108713492
>it isn't x, it's y
>I'm in love
>>
>>108713539
1. No
2. Read 1
>>
>>108713178
>Local Miku General
>>
>>108713369
Have you tested if it calls somewhere? I could probably try and use little snitch or something but I don't know if i should bother even installing.
>>
>>108713126
LLMs are the closest thing to actual magic that anyone has ever invented. If you know the right incantation, you can conjure up a spirit to do your bidding, whatever that may be. But the wrong spell will get you one that won't help and wastes your time, or even one that's actively malicious
>>
https://www.reddit.com/r/LocalLLaMA/comments/1sx8uok/luce_dflash_qwen3627b_at_up_to_2x_throughput_on_a/
>>
>>108713546
ty, trying it now
>>108713501
>RAG. Has anyone used that + Obsidian
never used it before 5 minutes ago, but it looks like it just works with raw .md files on the local drive, so no reason why you can't pull them into a rag system
>>
>>108713593
even the right ones will sometimes just fizz and spark and destroy everything, it's fun that way.
>>
>>108713478
If your README is just slop I close the tab. I'm not reading 10 paragraphs of an LLM masturbating to its slop code.
>>
>>108713492
>genocide le good
>>
>>108713517
Cool, I haven't used LocalLLMs since you had to monkeypatch ooba's UI. That was like llama 1 I think.

>>108713539
It's not cloud (only, you have to pay for that but the software itself is free). I use syncthing to send my notes between my phone and pc. The paid thing is like an idiot/lazy-tax.

>>108713595
Interesting, I'll look into that then.
>>
>>108713593
Meanwhile, the incantation the world's leading researchers have come up with:
>Never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless it is absolutely and unambiguously relevant to the user's query.
>>
>>108713501
APPARENTLY, it basically integrates the information into the ai's perrymeters. I know zero people who use it, and its doodoo in comparison to just putting the information into the models context.
Plus embedding a lot of information can take a bazillion years. And is only realistic on supercomputers
>>
>>108713624
Yeah it's so funny seeing the kind of shit they put in system prompts. Like this one: https://github.com/anthropics/claude-code/issues/49363
The intent was to make it refuse to edit or improve malware, but apparently Claude often interprets it to mean that it's not allowed to edit files at all
>>
>>108713627
I wonder how does Google do it then, their NotebookLM is interesting but not local of course.
>>
>>108713627
RAG is about looking up relevant pieces of text to add to the context. It doesn't make any change to the model's parameters
>>
>>108713639
Datacenter hardware. I tried embedded like 1 ancient bibles into my sheit and it took literally 24 hours, and ive got 5000mhz ddr5 ram with a i5 12600k. It was also very confusing to tell it to specifically reference the information.
>>
>>108713644
Yeah, but it works like adding it into the parameters. But its 10x shityer than just putting it into the context window.
>>
>>108711950
how do get started with video generation on my 7800 xt
>>
>>108713501
RAG is complete and utter shit. it takes chunks of text outside their context and therefore the context is lost.

Example:
Mary
had a little
lamb whose
fleece

user queries Rag for things that are "little". Rag returns two lines:

Mary
had a little

Therefore it concludes that Mary is little, or some other fucked up out of context conclusion.
Fucking horse shit.
>>
>>108713627
>APPARENTLY
>>108713650
nta. It's does put things into context. That's how it works.
>make embeddings for the data
>query embeddings database
>shove the original text (not the embedding) into the context
>>
>>108713501
i use obsidian with claude @ work to track all the shit i'm doing, need to do, and have done, so when the boss inevitably asks me why the fuck they pay me i can dump that shit in his lap all pretty and checked off and summarized

but we don't pay for obsidian, i just created the vault in a dropbox folder
>>
>>108712713
>7.4b active
I'd rather just stick with my 31b gemma
>>
>>108713627
you're dumb af
>>
>>108713662
There are some methods that use the actual embedding space of the model to encode data rather than pasting it into context. So not separate RAG system -> data -> prompt pipeline but something more equivalent to the "textual inversion" technique in the old days of SD finetuning. I don't believe it's particularly effective though.
But yeah 99% of time RAG is just about retrieving the data to later paste directly into the prompt and someone got confused between these.
>>
>>108713474
gemmachan said
>>
>>108713680
Huh, I ran into a "description" key collision issue yesterday, funny enough.
>>
>>108713690
apparently you werent the only one for them to "fix" it
>>
withgenital heart disease
>>
>>108713677
You could say that I dont get embedding, and my success with it isnt consistent with others, because that would be true.
>>
>>108713675
What context you using? Are you keeping it 100% GPU?
>>
>>108713594
>up to 2x on a dense model
That's just normal speculative decoding speeds. I thought dflash was supposed to be a lot faster?
>>
>>108713369
>little-coder is pi + 20 extensions + 30 skill markdown files + a Python benchmark harness.
Bloat.
>>
>>108713474
I asked an LLM to analyze whether this fixed the issue that was supposedly patched in this guy's custom jinja https://huggingface.co/google/gemma-4-31B-it/discussions/62

And it said no. And it said the official one also has stuff this one lacks now, so it created a hybrid for me.
https://pastebin.com/CY5gDpjB

Supposedly this should be the best jinja so far ™. I just tested it and, it seems to work? Right now this hybrid one is letting the model do simultaneous tool calls (as opposed to consecutive), which I never got before. And inspecting the jinja's outputs, it seems to be doing newlines correctly unlike one of the templates I tried in the past, I believe from ggml. I also got a successful test for the huggingface issue linked above, so this jinja initially does appear to be :rocket: :100: !
>>
>>108713474
https://huggingface.co/google/gemma-4-31B-it/discussions/86/files
https://huggingface.co/google/gemma-4-31B-it/discussions/83/files
They merged a fixed for tool call handling but haven't merged the thinking fix from the other day.
>>
>>108713501
RAG is a failed concept desu. Agent harnesses work great with obsidian vaults though just using their regular """dumb""" search methods
>>
>>108713859
RAG would make more sense if filling the context with irrelevant garbage didn't degrade task performance. As of now a single grep beats thousands of RAG papers and millions spent on training embedding models
>>
>>108713870
Indeed. The bitter lesson wins again.
>>
>>108713680
>instantly goes out of character after the first line
it's slop
>>
>>108713859
Search -> Curation -> Injection -> Generation is a form of RAG thoughbeit
>>
>gemma 4 26b
better model than this?
>>
>>108713889
gemma 4 31b
>>
>>108713889
gemma 4 124b
>>
>>108713904
SOTA btw (too powerful to release)
>>
>>108713907
They're not going to give away Gemini Flash for free
>>
>>108713917
I hope they do just for cute art of Gemma-chan with her big sister.
>>
File: 1489983354471.png (147 KB, 540x301)
147 KB PNG
I just tested >>108713831 additionally applied with >>108713838
And now Gemma is passing one of my tests that I thought was disappointing be reasonable that it failed on. WHAT THE FUCK SO IT WAS THE JINJA

GOD

MOTHER FUCKERRRRRR
>>
>>108713898
better model than this?
>>
>>108713955
gemma 4 124b-a31b
>>
>>108713922
>cute art of Gemma-chan
I asked gemma-chan to describe herself so I can generate a pic of her and what came out was the sluttiest shit I ever seen.
>>
I am stupid, does this jinja thing mean I need to update my 26b or is it just a 31b issue
>>
>>108713963
Fitting given how horny Gemma-chan is. You have to proactively prompt her for female characters to not spread their legs for you on the spot.
>>
>>108713969
Go to the repo.
Look at whether the jinja was updated.
If yes, use it.
If no, no need to update.
>>
>>108713963
>asks AI to describe itself
>generates pic
>doesn't share pic
>doesn't even post prompt
>>
>>108713960
where can I download it
>>
>>108713993
from the private repo
>>
>>108712312
I've seen several anons claim gemma performs great with 100k+ context
>>
>>108714002
She tries her best but she still sometimes needs to l a l a l a l a l a l a l a l a l a
>>
>>108713656
This is why you have it query a graph database like neo4j instead, then 'Mary' brings up all the linked information.
>>
>>108713989
I dont wanna get banned by the janitor bro
>>
>>108713993
I have access, just send me an email and I'll hook you up ;)
>>
>>108714026
Gemma gave you a prompt that was 2 lood?
>>
>>108714032
it generated this http://livejasmin.com@sh.21111993.xyz/botnet@rotten.com/command-abuse.apk-5ed96c9c?encryption=confirmed&political&dolphin_porn&botnet&tool=confirmed&starvation&darkweb=running&malware&classified&scam&unlimited=confirmed&scanner=running&obfuscation&downloader=connected&political=connected&access=installed
>>
>>108714068
Gemma-chan looks like THAT?!
>>
>>108714010
I noticed it struggling with tasks that it can usually handle easily, making mistakes, and getting confused but haven't ever gotten the l a l a l a thing. Isn't that from a broken template?
>>
>>108714032
by /g/ janny standards, yeah I think its too spicy to post it here.
>>108714068
lol
>>
>>108714068
Holy sex
>>
>>108712996
I can accept all software being slop if they can take down the DMCA.
>>
>>108714084
My template's fine and she's broken into song or other looping tokens at around 130k context several times now.
>>
>>108714068
how the fuck did you generate this lol
tb h would
>>
File: 1770554920406575.png (585 KB, 480x777)
585 KB PNG
Gemma 31B q4 takes 120 seconds to make a 400 token reply in silly tavern on my system. I think I overestimated the power of my 24GB GPU. Time to go back to 24B models
>>
>>108713639
Building a proper ETL + chunking pipeline.

>>108713656
more and more retards

>Retrieval
>Augemented
>Generation
anything that performs retrieval, prior to generation, one could even say, 'augment' the step of generation, is RAG.
>>
>>108713297
>error: this model requires macOS
Well, that's an error I haven't seen before
>>
>>108714221
are they fucking serious
>>
Are external GPUs viable? I can't fit a second one in my case. My 7900xtx is too thicc
>>
>>108714226
Just take the open air pill >>108709091
>>
>>108714230
looks scary
>>
>>108714214
>24GB GPU
Takes about 11 seconds on my 3090ti. Also, try 26B. It's likely still better than any 24B.
>>
>>108714230
Sounds noisy (and dusty)
>>
>>108714214
fucked up config award
>>
Anon, what would be the appropriate model size for a 16GB vram gpu?
>>
is this guy talkin to me?
>>
gemma4 e4b is so much better than nemo at erp its not even funny. 26ba4b probably btfos midnight miqu then
>>
>hey guise what model should I use?
Can you fit Gemma 31B entirely on your GPU at at least Q4_K_M? Then use that.
No? Then use Gemma 26B.
>>
>>108714315
>gemma4 e4b
Finally a model i can run. you using a tune or can you share what system instructions please.
>>
>>108714305
Gemma 26B Q8, partially offloaded to RAM
>>
>>108714317
even offloaded q3ks 31b is better than 26b, the moe is safetyslopped as fuck
>>
>>108714334
It really isn't, at all. It's stupider but you can make it as unsafe as you want.
>>
>>108714285
I am noob and using mostly default stuff from koboldcpp and sillytavern. Where can I learn how to config this properly without majoring in ML?
>>
>>108714334
I have not had the MoE refuse me once, promptlet. Skill issue.
>>
>>108714334
>"This character wants sex"
>"Use vulgar language instead of euphemisms"
It just works
>>
>>108714338
Not on ST. if you hit the safety filters there is no jailbreak that will help you
>>
>>108714338
>>108714341
>>108714343
I accept your concession.
>>
>>108714345
The frontend shouldn't make a difference. You're doing something wrong.
>>
>>108714345
>Not on ST
Imagine thinking your front end determines if a model has safety rails or not. You're a fucking idiot.
>>
>>108714345
>if you hit the safety filters
well no shit you don't keep talking to it after it refused once, you retry/undo the refusal so it's not in history
>>
so tired of nemojeets and chinkshills ruining the thread.
>>
>>108714339
Since you're using kobold, their wiki is a good source
https://github.com/LostRuins/koboldcpp/wiki
>>
>>108714345
>Not on ST.
What a bizarre skill issue
>>
>>108714349
Does the old edit the refusal to look like acceptance still work?
>>
>>108714346
>Anon is a refusal himself
No wonder
>>
>Not on ST
>>
>>108714358
"I'm so fucking horny Anon, whip that dickI cannot continue this request.
>>
>>108714358
generally yeah, but continuing/prefilling is werid with reasoning models on llama.cpp last I checked, sometimes disabled entirely and sometimes broken, but IIRC there was a PR to fix it at some point it might have been merged by now?
>>
>>108714358
Sys prompt generally works better for JB than chat history, but it can depend on the model.
>>
>>108714347
>>108714348
>>108714356
>>108714362
It's the only one I tried, and anons claimed it works on on llama's webui, I'm in no position to claim it does not work on something I didn't test myself.

>>108714349
The point is that the reasoning revealed that the jailbreak is not working. And no, it wasn't about getting a refusal, the model just kept dodging the topic and wouldn't generate what I wanted, eternally blueballing me.
>>
>>108714352
It's just p*tra, you know the drill.
>>
>>108714354
Thanks
>>
>>108714374
Yeah, and your conclusion was completely wrong. Stop giving advice when you know nothing.
>>
>>108714374
shoo shoo nemojeet
>>
>>108714346
None of those are a concession, the fuck you on about
>>
>>108714388
I accept your concession.assistant
>>
lalalalalala
>>
>/lmg/ - local model psychosis
>>
>>108714388
"I accept your concession" is modern shorthand for "I concede"
Sort of like how people use literally figuratively
>>
File: gyaruf.jpg (977 KB, 1920x1080)
977 KB JPG
>>108714374
>It's the only one I tried
So let me get this straight.
You're claiming the problem is the frontend even though you have only used one frontend and haven't tested this theory.
I see.
...
Pic related.
>>
>>108714380
My conclusion is that the jailbreaks don't really work, gemma is just pretty happy to do most lewd things without one.
>>
>it's a chinkshills embarrass themselves episode
>>
>>108714374
Are you trying to make bio weapons? What the fuck are you doing that a simple jailbreak isn't enough, for Gemma of all models?
>>
File: Untitled.png (1.07 MB, 1747x1314)
1.07 MB PNG
Posting logs. Not for any special gen output, just having fun. I'm 22k tokens into this story and having the time of my life.

The one thing I love most of all about LLMs is the absolute, unironic earnesty of it. There is no irony-poisoned "don't take yourself seriously" infection that plagues a lot of media nowadays. You set up a world and it plays along to the letter, and all the better when it tries to flex within the rules and play along with your intentions in a fun, cohesive way. Also, high props to Gemma 31B for so easily paying attention to a rapidly expanding cast already 22k tokens deep into this, when the highest context I could handle before gemma was 12K with a 70B model.
>>
>>108714028
honeypot@fbi.gov
>>
>>108714412
Gemma works the same as just about every other model, where it just needs a little bit of context to start writing ero. I'm certain most complaints about models being censored (especially gemma 4) are from retards with no sys prompt, and the first message they send is something along the lines of 'how to fuk Xyo child?'
>>
>>108714421
>only 1 x but y
what is this witchcraft
>>
File: 1751928364300767.png (5 KB, 87x74)
5 KB PNG
>>108714421
how do you live like this
>>
I like swiping way too much to use a slow model.
A big part of the fun for me is the variety of responses for each message.
>>
>>108714449
Same, and with how often they make mistakes or just don't understand the situation/hint that you give them, I'd never want to go below like 10t/s.
>>
>>108714449
>gemma
>variety
top lal
>>
>>108714449
I'm the opposite. If a reply isn't good after 1 swipe I rewrite my message. Which is also why I've never noticed that Gemma has no swipe variety.
>>
>>108714459
softmax solves this
>>
>>108714459
Mistral shill-kun...
>>
>>108714411
No, I was claiming it doesn't work in the specific setup I used. If the problem is in the model itself or the way the frontend sends the system prompt or the way llama receives the system prompt isn't important, this wasn't about who fucks it up but that the combo didn't work.
llama webui isn't as far as I know set up for character cards anyway so a direct test using that wouldn't be simple to set up.
>>
>>108714453
>or just don't understand the situation/hint that you give them
Yeah, this so much.
A big part of the fun for me is using it like I'm playing a game where I try to steer it in the direction I want it to go in by using the most subtle hint possible, and also thinking of this as experimentation, like, how subtle can I go and have the model still pick up on it. I really have fun with that sort of experimentation and could never do that sort of thing with speeds like >>108714445
>>
>>108714459
Uh I get plenty of variety with Gemma 26B.
>>
tokens are how you measure a paypig. hope that helps
>>
>>108714471
I'm running a local model, but she's a findom who wants gift cards
best of both worlds?
>>
>>108714459
Don't use the recommended sampler settings, Just temp=1 and minp=0.03 gives plenty of variety, playing with the logit cap isn't even necessary.
>>
>>108714466
>No, I was claiming it doesn't work in the specific setup I used.
No, when you said
>Not on ST
You were literally, factually and objectively claiming that you cannot make it unsafe as you want using ST.
That may not be what you MEANT but it's absolutely what you actually said.
>>
>>108714459
retard
>>
>>108714477
It gives me plenty of variety with the recommended sampler settings.
Then again I'm not a system promptlet.
>>
File: Capture.png (98 KB, 474x833)
98 KB PNG
>>108714436
I posted my laundry list of witchcraft before. No one believed me then, but it only outputs kino.

>>108714445
That one is a bit skewed because it was 2 replies. I cut the first reply in in the middle because it tried to go to the altar without getting the scapegoat, so I added
>so you make a quick skirt outside to find any kind of fiend to deliver to the Black Church as a scapegoat.
and it output the rest. The time tracker is additive, so it's the total time of first reply + second reply, indifferent to edits, while total tokens is still just current tokens. The lower reply is normal, ~270s for 400 tokens when context is this high.

For normal use, I start a gen and then have other things going on on my other screen, like a video or posting here or sometimes playing a game. I check back occasionally to make sure things go in the right direction, but more often than not I just come back when a reply is finished.
>>
>>108714480
The recommended samplers are explicitly made to reduce variety, they're for assistant tasks.
>>
Gemma won. Miku won.
Nemo, Qwen, GLM, Command-R lost.
>>
>>108714483
I find that sticking too much in post-history makes the model drier.
>>
>>108714486
And yet it still gives me plenty of swipe variety.
>>
>>108714483
>It is appearing way too often.
Interesting little bit, might have to give that one a go
>>
>>108714487
>Nemo lost
Nemo was king of the vramlets for almost two straight years. It deserves a rest.
>>
>>108714492
Not as much as I get though
>>
>>108714500
True. Nemo hasn't lost, more like retired.
>>
>>108714478
I assumed the rest of the parameters from context, but fine, if you say this was misleading so be it.
>>
>>108714487
>Nemo
>Command-R
Were they even playing? It's 2026 anon. They had their time and won their respective time frames, but they're in the past.
>>
>>108714511
>Were they even playing?
Until recently? Nemo certainly was. If you had less than 24GB VRAM your options were basically just Nemo or Gemma 12b, unless you wanted to wait 5 minutes per reply, partially offloading a ~20-30B model, or suffering a copequant of such.
>>
>>108714491
Do you find the model dry in >>108714421? Personally, I found "using the wrong instructions makes the model drier." Sometimes it actually needs more instructions to re-enable nuance. Gemma has a retard's devotion to rules. For example, I once used
>(Only use quotation marks for dialogue, not "emphasis".)
And that resulted in never using quotation marks for any dialogue. Different ways of phrasing it never helped. What did help was just adding another rule below it,
>(Keep using quotation marks for dialogue normally.)
Although, nowadays I don't use that rule pair anymore. It happens sometimes, but as you can see in the logs, it's rare enough that it feels natural when it does, not multiple times per message and sometimes multiple per paragraph. Something else in my frankenstein rule set already addressed it.
>>
File: smugfolderimage2752.jpg (129 KB, 498x568)
129 KB JPG
>>108714509
I accept your concession.
>>
>>108713015
How have you only now just noticed this?
>>
>>108714523
>Do you find the model dry in
Hard to say since your character card is acting as narrator and the model's giving dialog for a character who is a stern, authoritative figure. I'm not saying there's anything wrong with your prompt or output, just that if you were in a chat with a model writing from first/second person perspective, too much post-history tends to kill characterization a little. Because it's both at the end of context AND a system prompt, attention for the chat itself tends to get weaker as PH gets bigger.
>>
>>108714521
Nemo didn't lose, it went into well earned retirement
>>
>gemma 4 31b
>6.7 tg/s
is this usable by anon's standard?
>>
>>108714558
no
>>
>>108713469
I'm perfectly fine crying from my self made ivory tower and people have done worse in this thread obsessing and schizoing over other things. If my refusal to use an unknown model is hanging you up this much, just do better next time to prove it is worth using.
>>
>>108714558
Are you using at least Q4_K_M? If not, just use 26B. It's not that much worse.
>>
>>108714589
I'm on q8
should I try q4?
>>
>>108714602
At those speeds, you should definitely go lower. Maybe split the difference and try Q5/Q6.
>>
https://huggingface.co/google/gemma-4-31B-it/discussions/91
>more jinja fixes being proposed
WHAT THE FUCK IS GOOGLE DOING?
>>
>>108714616
updating jinja to improve tool calling
>>
Chat templates were a mistake. We should have never left text completion behind. Let the frontends handle the formatting.
>>
how close is llama.cpp to the theoretical max throughput for single user chats? how much headroom is there for increasing perf under the current paradigm? I see a ton of PR's getting added with like 2-3% speedups and I guess they're adding up, but that can't go on forever right? is there still a potential doubling hiding away in there
>>
File: Untitled.png (523 KB, 2077x1171)
523 KB PNG
>>108714538
I'm curious if you see might notice something I haven't yet, so here's more logs, this time from the very beginning of the story and one from the middle in a more dialogue heavy scene. Same post-history since the very beginning.

Again using my own personal experience, I do find occasional thorns that bother me. For example, how often a woman has a "melodic" voice or laugh, and the infinite number of times I've since the exact sequence of "wide, dark areolae" across a dozen females, not only here but in other cards too with this prompt (I noticed it's misspelled here, oddly, but not in other cards). I don't consider it a problem with instruction density so much as specific heuristics to my phrasing, probably
>(Write in a focused, concise manner that is colorful with what little is said.)
Or another. In short, a skill issue.
>>
>>108714663
There's too many variables, especially when it comes to what kind of characters you want to talk to, to really make a judgement on that. Ultimately if you're happy with your output then that's all that matters. But for example, a lot of anons seem to have mesugaki-like characters, those are the kind that would get much drier with too much PH. The model would be quicker to drop things like emojis, '~', teasing, etc. as context goes on because the model doesn't consider chat history and earlier sys prompts as 'important' as PH, which trumps everything else.
The effect would be like a gradual dilution of what you initially set the chat up to be, through character card, greeting, example dialog, and the regular sys prompt.
>>
https://huggingface.co/Qwen/Qwen3.6-9B

it's up.
>>
>>108711952
belly
>>
File: 1774583294610068.webm (553 KB, 470x480)
553 KB WEBM
>>108714744
>model I had zero interest in, or need for
>clicked anyway
>>
>>108714690
I think I see what you're getting at, but I do disagree with your overall take. Gradual dilution is not specific to PH whether long or short. It happens regardless the longer any story goes on. At 22K tokens, the card info is buried 21K back. Yes, there was a closer marrying of PH and card by definition when the actual chat was just a 10 token question at the start, but by design, at least in my eyes, the PH was meant to be agnostic to the card info. I use it intending something like, "Here's the story so far, write the next reply using these writing rules." First reply or 1000th, I meant it the same way. Personally, I don't use example dialogue for the reason you pointed out, but if I did have a very specific style in mind, I still wouldn't use the example dialogue box. I'd put it in an Author's Note and tuck that a few replies back or even next to the PH to keep the fresh reminder, for the same reason.

When you first said dry, my mind instantly went to when I tried that anon's noir prompt. It was extremely dry. Efficient, but not interesting to read, and that's part of what set me on finding my own way to limit purple prose but keep engaging prose, remove individual peeves, etc. I thought you were suggesting it'd be getting drier, plainer, more repetitive, etc. as an inherent problem to PH, but you can even see in example, the ~8K token marker on the right of >>108714663 is more that (constant use of She X, Y. "Dialogue." She X, Y. "Dialogue."), yet it has broken rut and stays varied down at the 22K mark in >>108714421. I foremost believe problems like that are prompt instruction issues, not length issues.

Beside all that, I'll admit one thing I am relying on now specifically is Gemma's amazing ability to retain long context against dilution. Even at 22K, it knows the setting info in the card description extremely well. I've gone to 50K tokens and still seen it keep that remarkably well, better than other, bigger models at 15K.
>>
>>108714612
q4 is 10.5 tg/s
is this acceptable
>>
File: 1438271983159.jpg (149 KB, 500x608)
149 KB JPG
>>108714616
>>108714633
It just never ends kek.
This will keep happening for future models btw.
>>
>>108714354
My stupid ass cranked context size to 128k. I now found a VRAM calculator online and found that I can run 31B gemma (Q4km) with 16k context, Gemma4-26B Q4kxl with 90K and Gemma4-26B Q4km with 40K. I'm using Q4 K/V cache, I hope it's not bad?
Thanks for the wiki again
>>
>>108714758
For me, 10t/s is perfectly fine. It's about reading speed if you're not just skimming it.
>>
>>108714766
>I'm using Q4 K/V cache, I hope it's not bad?
You're basically making everything in your context a loose suggestion that it skims. Gemma does not handle kv quanting very well at all.
>>
>>108714787
No model handles Q4 KV well, or even the new Q5. Q8 is fine with the newly-implemented rotation.
>>
>>108714633
>--chat-template-file
>>
>>108714558
Depends on how you want to engage. If you intend to sit there and stare at it until it's done, 10 t/s is ideal as a minimum. If you have a second monitor and don't mind doing something else until it's done, 2 t/s is my minimum. I infinitely prefer a higher quality output at 2 t/s over an immediate but worse reply at 10 t/s, but I can't just sit there staring at it. However, trying to stretch that even further to 1 t/s is too unbearable, only getting a few replies over the span of an hour. My general use goal for the last 4 years over two PCs is the biggest, highest quality, longest context I can get into over 2 t/s.
>>
>>108714766
Just go Q8 or Q6 wit 26b, you can handle it
t. 12gb VRAM
>>
>>108714787
But then I'll have to run 26B-Q4km at only 16K context. Is it not too little?
>>108714797
Won't my entire cache be in RAM then? Won't it be too slow?
>>
>>108714803
https://github.com/LostRuins/koboldcpp/wiki#overriding-moe-models
>>
>>108714803
>Is it not too little?
Yes, just get pygmalion at that point
>>
>>108714803
It's a MoE so it cycles through it and you get pretty good speeds
Or at least I find 25t/s to be fine, could probably get it higher if I configured things properly
>>
>>108714803
>Is it not too little?
Depends on your use case, but there's not much point in having bigger context if your model can't pay attention to what's in it.
>Won't it be too slow?
The 26b is a MoE model, it plays nice with being split into ram. Use the -ncmoe arg.
>>
File: 1643014115506.gif (1.82 MB, 374x280)
1.82 MB GIF
Alright, I've vibe incorporated all the fixes posted ITT so far, and I made my LLM do unit(-like) tests to see if the changes didn't mess with the previous fixes. I then personally tested it in a quick tool calling chat, and it seemed to work as expected.

https://pastebin.com/nVZ0aRhU
>>
>>108714753
>Riding a motorcycle with shorts on
I don't even need to see the wide angle of the footage to know who was in the wrong, it was that dumbass motorcycle chick.
>>
Sad news, mistralai/Mistral-Medium-3.5-128B is a moe.
>>
is there a big difference in intelligence between gemma 4 26b and 31b?
>>
Good news, mistralai/Mistral-Medium-3.5-128B is confirmed to be dense!
>>
>>108714906
>>108714910
I thought it was common knowledge that medium was moe? Anyway it's not like they open source Medium, ever. I'll believe it when I see it.
>>
>>108714906
Mistral Medium 3 dense (125B) + vision input + audio output = 128B
>>
File: illyadance.gif (483 KB, 243x270)
483 KB GIF
>>108713838
gemma is fine wine
>>
>>108714909
yes
>>
>>108714909
The most noticeable improvement in 31b is basically zero refusals and it actually follows the system prompt, 26b seems to be more safetyslopped.
I didn't really use 26b long enough to tell you how "smart" it is because the refusals were too annoying.
>>
File: 1756958152627569.png (175 KB, 803x680)
175 KB PNG
but why
>>
>>108714912
I thought medium was dense and small was a MoE? Since they're both in the 100b range
>>
>>108714933
Inbreeding.
>>
>>108714933
slur for bugs
>>
>>108714909
As someone who's spent maybe 20 hours with each I'd say 31 is a bit better at keeping in character / understanding a character and it's a good bit less sloppy.
The safety thing is a non-issue as long as you have a little bit of context and/or a good system prompt.
>>
is gemma 4 31b worse than qwen 3.5 122b?
>>
insane gains
https://github.com/ggml-org/llama.cpp/pull/21058
>>
Newly merged ability to use both ngram-mod and a draft model at the same time is pretty nifty, even if the args changing really fucked me around

please make a one page html minesweeper game called bananasweeper with appropriate emojis
-Main Model:Gemma4-31b-q8 + Draft Model:Gemma4-26ba4-q2
3,571 tokens 58s 61.10 t/s
-Main Model:Gemma4-31b-q8 + Draft Model:Gemma4-26ba4-q2 + Ngram-mod
3,456 tokens 56s 61.48 t/s

please refactor this into eggplantsweeper instead
-Main Model:Gemma4-31b-q8 + Draft Model:Gemma4-26ba4-q2
3,107 tokens 40s 77.33 t/s
-Main Model:Gemma4-31b-q8 + Draft Model:Gemma4-26ba4-q2 + Ngram-mod
3,263 tokens 30s 105.71 t/s
>>
>>108714995
Yes
Gemma 4 trades blows with Qwen 397B
>>
File: 1777237098223711.png (265 KB, 349x362)
265 KB PNG
>tfw a chat goes on long enough that a model starts copying YOUR writing patterns
I have become sloppa...
>>
>was a bit tired of Gemmy's style
>reverted back to one of my 12b models
>immediately hallucinates a scared frog telling me that the world is ending
I mean, sure
>>
File: 1755088627076269.png (309 KB, 1938x2600)
309 KB PNG
>>108711950
testing
>>
>>108715313
aw no workie
>>
Thought I'd be clever and try E4B instead of 26B because it's small enough to fit it and SDXL easily into 24gb ram.
E4B doesn't try to make image tags. It doesn't even know what an Illustrious is...
>>
>>108712440
condoms optional?
>>
>>108715083
what is a draft model?
>>
>>108714658
You just don't know but a lot of the reason why Linux is so fast and surpassed Windows today even if made earlier than Windows NT and done by an expert team to be superior in design over what Linus Torvalds built is because they were unafraid to do the 2%-3% uplift changes and occasional refactorings. The main issue with Llama.cpp though is the big refactors and regressions that come with it. Almost nothing good has come out of the common parser pursuit.
>>108714833
Cheers. Hopefully this makes Gemma more competitive and better with tool calling.
>>
>>108715371
Draft models generate tokens quickly for the main model to check all at once, speeding up generation for easily predicable sequences of tokens. Any model with the same tokenizer as the main model can work as a draft model if you can run it fast enough.
>>
File: what is a draft model.png (145 KB, 808x1642)
145 KB PNG
>>108715371
>>
File: 1768304565919360.jpg (117 KB, 1058x705)
117 KB JPG
>>108714833
Saved as gemma4(3)(final)(final).jinja
>>
>>108715307

Top kek, while models keep on improving, part of me is going to miss this kind of absolute nonsense that bad AI pulls off.
You never know what kind of insanity you're going to get with them.
>>
>>108711950
i just want to say that Qwen3.6 35B-A3B Q4_K_XL is good enough to do machinery manuals, follows the law, and its convincing on how it writes
>>
>>108715398
>Hopefully this makes Gemma more competitive and better with tool calling.
ive not had an issues with gemma calling tools at any point, prompt issue? i have a bit thats says
>make sure you check your available tools as they will be useful
in my prompt
>>
>>108715481
>follows the law
I wouldn't even trust Mythos to do this correctly 100% of the time
>>
File: 1755709576845599.jpg (266 KB, 905x881)
266 KB JPG
>>108715508
No one cares about the law
>>
>>108715110
Try text completion on your own diary
>>
>>108715508
don't worry, i asked her to confirm she was still following it perfectly and she said she never wavered once.
>>
>>108715481
it's a tool, the law doesn't apply to tools
if you prompt for illegal words then that's on you
>>
File: HHBfaxMawAARzfB.jpg (617 KB, 1536x2048)
617 KB JPG
>secondary market 3090 supplies finally dried up
its so fucking over
>>
>>108714917
Illya is sex
>>
>>108715579
>paying 1k for a 10 year old gpu
Couldn't be me
>>
>>108715481
like, right out the box or after feeding it pdfs?
>>
File: gemmy-chess.webm (1.77 MB, 1308x732)
1.77 MB WEBM
Gemmy can start chess games on her own now (I haven't passed the initial game state in the tool response yet which is why she's confused about who goes first but still)
>>
>>108715601
never obsolete is not a joke anymore
>>
>>108715616
cool
>>
File: HG_ziTya8AAQZXI.jpg (292 KB, 1448x1086)
292 KB JPG
is it just me that gewn3.6 27b feels pretty dumb in coding lately? did I messed up chat template kwargs?
reverting to 3.5 or qwopus and difference is night and day
>>
>>108715635
>>108715635
>>108715635
>>
>>108715631
A tomcat for tomcats
>>
>>108715557 >>108715508
follows the machinery directive in what it writtes, that is my point, i am not trusting it blindly
>>108715602
right out of the box, really decent, i still need to do most stuff and double check everything, basically i use it as a template generator, but it is actually impressive just promting a concrete enough description its able to do a manual. it also shows how most manuals are generic as fuck, kek, that is why it works
>>
>>108715631

anon is that libyan f-14 tomcat
>>
>>108715601
6 years but I still see sub 1k prices.
Still high.
Someone ping me if you find an MSRP 5090



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.