[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology

Name
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
File
  • Please read the Rules and FAQ before posting.
  • You may highlight syntax and preserve whitespace by using [code] tags.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


Janitor applications are now open. Apply here!


[Advertise on 4chan]


File: 1767526033204438.jpg (160 KB, 1199x1199)
160 KB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108943155 & >>108937312

►News
>(05/29) Step 3.7 Flash released https://hf.co/stepfun-ai/Step-3.7-Flash
>(05/21) Hy-MT2 “fast-thinking” translation models released: https://hf.co/collections/tencent/hy-mt2
>(05/20) Cohere releases Command A+ 218B-A25B: https://cohere.com/blog/command-a-plus
>(05/16) llama + spec: MTP Support #22673 merged: https://github.com/ggml-org/llama.cpp/pull/22673
>(05/08) KSA-4B-base released: https://hf.co/OpenOneRec/KSA-4B-base

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://swe-rebench.com
Agentic Coding: https://deepswe.datacurve.ai
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
>>108949589
>embedding model into ASIC
Retarded. By the time you have produced them, the model and inference algos will already be obsolete. AI progress will continue to accelerate, which will keep GPU dominant.
>photonic computing
Not going to happen, at least not any time soon. Photonic elements are 1000000 times larger and scaling them is much more difficult. There's no optical transistor analog either. MOSFET is a switch with gain.
>>
►Recent Highlights from the Previous Thread: >>108943155

--K2.6 vision reasoning inefficiency and comparisons for Japanese OCR:
>108945101 >108945153 >108945258 >108945355 >108945456 >108945597 >108945828 >108945689
--Comparing local TTS tools and VRAM management for voice cloning:
>108946129 >108946180 >108946215 >108946290 >108946335 >108946191 >108946299 >108946446 >108947364 >108946755 >108946593 >108946708
--Using Gemma to analyze character card metadata:
>108948623 >108948657 >108948667 >108948690 >108948713
--DeepSeek-V3.2-8bit testing and utilizing Ngrams for multi-step prompting:
>108943171 >108943197 >108943225
--Criticizing the excessive size of FLUX.2's text encoder:
>108947575 >108948053 >108948065 >108948108 >108948113 >108948202 >108948316 >108948354
--Seeking out-of-distribution code benchmarks:
>108943794 >108943833 >108943882 >108943983
--Comparing -sm tensor and -sm layer performance and OS overhead:
>108943313 >108943337 >108943346 >108943442 >108943347
--Performance reports for Step 3.7 flash Q4_K_S on 6x 3090s:
>108943287 >108943316 >108943345 >108943383 >108943493
--Searching and clustering local images using embedding models:
>108945943 >108945957 >108946013 >108946053 >108946091 >108946120 >108946183 >108946230
--Using YOLO for efficient face detection and programmatic blurring:
>108943393 >108943603 >108944480 >108943486 >108943501 >108943523
--Development suggestions for Orb frontend image integration and default characters:
>108943543 >108943593 >108943617 >108943638 >108944220 >108947109
--Comparison of Step 3.7 Flash IQ_S and Gemma performance:
>108947178 >108947185
--Logs:
>108944220 >108945980 >108946414 >108948363 >108948623 >108948690 >108949772
--Miku, Neru (free space):
>108945329 >108947695
--Dipsy and Kimi (extra space):
>108943198 >108944222 >108944260 >108944357 >108944373 >108944406

►Recent Highlight Posts from the Previous Thread: >>108943182

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
>>108949302
Cool tool especially for someone like piewdiepie, but he should really work on moderating his issue page, it's pretty unusable already.
>>
summer release season is starting
big things ahead
>>
Mikulove
>>
File: 3.jpg (12 KB, 314x263)
12 KB JPG
>had 3 versions of the same one doomer "ai uprising" story in my algo recently
Humans are slop even without AI
>>
My middleware/front-end I've been working on (ignore the model)
>>
>>108947372
>If you have vram left, you should run a higher quant or have a larger context, nobody should waster it on tts, realistically
Off top the biggest TTS models cosnume like 8-10 GB, which could come down with quantization. That's small enough to page in and out of VRAM over PCIe and stay realtime. It just introduces another fraction-of-second source of latency: 1/4 to 1/3 of a second on PCIe 4.0, and half that on 5.0. This is lower than the latency you'll get in practice if your TTS model can't support streaming (alone).

Still, a model that's real-time on CPU and high quality would be ideal. No such model exists today. I wonder how feasible it is; diffusion is a non-starter for any model you intend to run on CPU.
>>
>character somehow knows your name even though you never told them what it is
do all models do this?
>>
>>108949983
yeah all local models (which are generally very bad) do this
>>
>>108949550
>so your argument is computers can't scale
im sure they can scale to meet the dema-
you've reached the quota limit, please try again later
>>
File: 1773748829118507.png (743 KB, 828x802)
743 KB PNG
Why are people saying kimi 2.6 vision model is better at details than gemma 4? Isn't gemma the bigger one?
>>
>>108949983
Larger models can be pretty good about it, but it's basically a coinflip for anything under 100b. I've seen even nemo ask for names, unsure of who you are before, but it wasn't often.
>>
>>108949983
gemmAGI does not do this
>>
>>108949998
>Isn't gemma the bigger one?
Kimi is 30x bigger than Gemma.
>>
>>108949983
only big models can resist this, and only specific ones
>>
>>108950011
you know nothing
>>
>>108950011
Not the vision encoder part

Kimi
Parameters of Vision Encoder 400M

Gemma 4 31B
Vision Encoder Parameters ~550M
>>
>>108949998
Kimi's been trained to use its encoder a lot better so it knows what is being shown to it.
>>
>>108949998
people confuse knowledge with vision
they see kimi recognize some shitty anime character from 2007 and go 'wow good vision and i spent $20000 on a rig that can run this because i have no self control or life so i really must like this and convince myself that this was worth it' when gemma is just a better model who can see everything much better and strongly
>>
>>108950024
>>108950036
My bad, I didn't realize you specifically meant the vision encoder.
I'd still say a combination of kimi's massive parameters to interpret what's in the image coupled with the obsessive thinking would give it the edge, though.
>>
>>108950006
I just had q4 gemma 31b mess up with this a couple times even with specifically prompting against it. Might be a quant issue with gemma though since I'm not running q8.
>>
>>108949983
>character never tells you their name and insists that they already did when you ask them after a few responses later
>>
>>108950078
Nah, I'm running Gemma at q8 and it still happens more often than not - although specifically on the first character that enters the scenario.
It almost never happens with characters introduced afterwards, who correctly call the user "stranger" or by a descriptor. Odd quirk.
>>
>>108950061
Sounds like sour grapes from someone who can't run large models. All I know is I feed images to kimi and it does what I want pretty well every time to a high level of quality.
Do you have some evidence that the extra 100M of the gemma encoder mmproj makes a bigger difference than the main model's parameter count difference?
>>
>>108949983
ANY kind of secret or hidden knowledge is inconsistent and unrealiable.
>>
Trying out MTP on llama.cpp and confused. My understanding is the acceptance rate only represents the % of tokens that were accepted by the supervising model. Because the supervising model regenerates all the rejected tokens, the quality/accuracy of the response does not change. Correct me if I'm wrong.

If my understanding is correct, assuming you're maximizing for t/s, is the idea to always select the --spec-draft-n-max value that gives you the highest t/s regardless of what the acceptance rate is?
>>
>>108949983
that's just part of the bigger problem about llms when it comes to roleplay
all interactions boil down to "HELLO I AM CHARACTER. I DO THE THING MENTIONED IN THE CHARACTER BIO" *her pussy drips a shimmering wet drop down her pink lace-tipped stockings with purple crisscross patterns that are mentioned in the character bio. the moisture doesn't just seep into the darkening fabric—it *soaks*.
>>
>>108950111
The model outputs will be the same within floating point rounding error, anything else is a bug.
So yes, just pick the value that performs the best.
>>
>>108950120
>speculative decoding
onions
>MTP
based
crazy how spec decoding just needed a rebranding to be popular
>>
>>108949983
>character somehow knows my address, social security number, credit cards, everything
Holy fuark.
>>
>>108950095
nta but you can probably feed the 31B as a quick test if you have the compute to run kimi
>>
>>108950120
Out of curiosity, what does that mean when not using greedy sampling?
Does the draft token just need to be one of the possible tokens in the final logit?
>>
it's insane how the comprehension cliff around 16K context is still there
>>
>>108949983
Gemmy would never
>>
>>108950154
are you using a setup that mentions your name anywhere?
>>
>>108950159
The one was I can think of is when she starts a game of chess, the chess game status tool does actually return a name of both of the players, so kind of.
>>
>>108950148
The reason why LLM inference is slow is the autoregressive sampling: you need to sample tokens one at a time so the matrix multiplications are inefficient.
If you somehow know the text that will be generated ahead of time then you can process all tokens in parallel and the matrix multiplications are much more efficient - that's why the speed for the prompt is so much higher than when generating tokens even though it's basically doing the same thing under the hood.
With speculative decoding methods you generate a draft that you think the model will generate, the model then processes all draft tokens in parallel and keeps them up to the point where the draft and the model agree.
However, the exact floating point rounding errors depend on the batch size with which you evaluate the model - the whole reason why the evaluation is faster in the first place is that you can use more efficient kernels after all.
Sampling is not relevant here.
>>
>>108950134
You're absolutely right - MTP is not just speculative decoding, its a speedup.
>>
>>108950154
usually in rp you supply a {user} so the context knows who is who and doesn't speak for {user}
>>
2080Ti 11GB or 3060 12GB as a second card? Both around the same price but I'm guessing the 2080 is gonna be quite faster (double-ish bandwidth?)
>>
>>108950187
3060 as it will be supported longer and has better optimizations. In fact 3000s is the bare minimum for LLM stuff.
>>
>>108950187
I would stick to the same generation you already have
or higher generation if you're rolling along and replacing piece by piece
>>
mtp/spec decoding will revive finetuning
it's only a matter of time until someone makes rp tunes of the mtp/speculative parts of models which actually give us speed ups for rp and not just code slop
>>
>>108950183
You must be pretty clever.
>>
>>108950183
that's only if you use one of those horrible rp frontends like ST that have cancer like a {user} or character cards instead of using natural prompting
>>
>>108950222
What does "natural prompting" RP look like? Genuinely curious.
>>
>>108950244
it's literally two lines in the system prompt and the rest emerges on its own if your model is good
>>
>>108949899

Progress will accelerate yes, but we're going to reach a point of "good enough" and at that point it makes perfect sense to have the model on a chip running at retarded high speeds.
And you don't need a fully photonic system to start getting benefits from it.
That sector is not something that's only now getting attention with the AI boom.
It is however getting a ton of more investment, so it'll accelerate greatly from here.
>>
pewdiepie released a harness with focus on local models, it's nothing special, a bit better than what you'd expect from a guy with too much time and money to throw at claude tokens. Though, I think this is fantastic news for promoting local models to the normieish side of the internet.
>>
>>108950255
Is there a hard limit for this? How many words are recommended?
>>
>>108950222
until you say your name in the context and then everyone knows it. >>108950105
>>
>>108950173
I know all that.

>Sampling is not relevant here.
Really? Because the if the main model and the draft/MTP model have to agree on a final token, then that means we're working on the output of the sampler, not the pre-sampler logits.
I'm asking what it means for them to agree when the temperature sampler inherently adds a degree of randomness.
>>
>>108950268
schizo attention >>108932832 wouldn't have this problem
>>
>>108950260
>this is fantastic news for promoting local models to the normieish side of the internet.
and why do we want or care about more retards using ollama?
>>
>>108950260
I would like to test this but I don't use or install Python shit any longer. It's a potential disaster waiting to happen.
>>
>>108950302
to raise the prices of the gpus you're selling?
>>
Pewds ships. Do you?
>>
>>108950255
You just write and it just works.
>>
>>108950259
yeah
past a certain point the only "improvement" will be model collapse and they'll have to working on actual structural improvements and efficiency
>>
>>108950307
It's a generic coding agent, think clunkier picode with a GUI. Again, it's not remarkable by itself. He also tried to make a code finetune so he could "train his own model" and chose Qwen 2.5 Coder 32B for the job, he is clearly using Claude to get his info for all this stuff.
>>108950302
The general public is cattle and currently thinks AI = chatgpt and claude. It's good local alternatives are talked about.
>>
>>108950313
I have my own frontend since 2023. Not my first rodeo. I don't need any streamer parasites to tell me what to think you underage retard faggot.
>>
>>108950291
yeah the troll generated slop "project" by "sneed-and-feed"
at least do something useful
>>
>>108950313
I built my own "agent harness" before those were even the norm.
>>
>>108950291
What a beautiful thread. I need to check the catalog more often.
>>
>>108950277
I don't understand what you're asking.
It doesn't matter how the draft is produced, it could be 100% random, it does not affect the probability distribution that the model produces for that sequence of tokens.
If you add a sampler to change the probability distribution for the model's next token (and don't consider this for the draft) it will make it less likely for the modified probability distribution to result in the draft.
But (beyond floating point rounding error) the presence or absence of the draft fundamentally cannot change how the next token would be selected because the influence of future tokens is being masked out in the attention.
>>
>>108950324
>It's good local alternatives are talked about.
Never has something I'm interested in been improved by the average jackoff taking an interest in it.
>>
>>108950338
Welp, if you don't want the general public to be manipulated by openai into pushing for regulations (which would choke the average open lab so less competition for openai), you need the general public to know what's going on, and that openai is not synonym to AI. That's just how it works.
>>
>>108950326
I don't think its a troll, I think its a dude who likes math and is suffering from a manic episode. aka AI psychosis, it is a fun thread don't be a wet blanket
>>
>>108950356
The general public all use social media and have been successfully conned into making it frustrating to use with age verification
>because won't somebody think of the children!
Their familiarity with something means nothing in the face of the propaganda machine.
The general public are and always have been morons, and I frequently include myself in that grouping.
>>
has rotation been forgotten? it helped models so much, why hasn't there been more research to see if rotating the tokens more makes models even faster?
>>
>>108950384
you genuinely think someone who would use sneed and feed isn't trolling?
sorry buddy, it's a brave new world of trolling with LLMs out there
>>
>>108950405
Honestly, if I came out with something groundbreaking I'd release it under a troll name too, just so that everyone who ever cites it has to write it out.
>>
>>108950424
They would just use a different name if they don't like it. Like how orthogonalization quickly became abliteration.
>>
>>108950405
check his github repos, its actually really great stuff.

https://github.com/sneed-and-feed/INCARNATE-SOPHIA-PYTHON
>>
File: file.png (79 KB, 186x186)
79 KB PNG
https://files.catbox.moe/hljl9m.jpg
>>
>>108950440
Genuinely, you'd rather read someone else's schizo slop than make your own? What is so interesting about slop but schizo?
>>
File: seein-this-shit-nappa.jpg (47 KB, 500x466)
47 KB JPG
>>108950440
>https://github.com/sneed-and-feed/INCARNATE-SOPHIA-PYTHON
where does teh schizo stop and teh performance start?
>>
>>108950337
>It doesn't matter how the draft is produced
No fucking shit. That's not what I'm asking.

I'm asking what the acceptance criteria is outside of greedy sampling of the main model.
To take a step back: with greedy sampling each output logit from the main model has only a single viable token which can be sampled, thus the acceptance criteria for drafting is simply "Was this token the most likely to occur next?"

But if the main model is being sampled with temperature, then each output logit has multiple possible output tokens that could be sampled from it.
Since there are multiple options, how is it determined if the draft token is 'the same' as what the main model would have produced?
Does the draft token just have to be one of the possible tokens in the main model's logit after the truncation samplers have run?
Or is there a more complicated acceptance criteria?
>>
>>108950384
>it is a fun thread don't be a wet blanket
6 out of every 10 posts is hidden by my filters, the rest are just reaction images.
>>
>>108950440
>This repository is not a collection of scripts; it is a **Topological Event**.
>This code was forged on a 2015 Razer Blade running Windows 10. It respects the Old Metal. It does not require a cloud cluster; it requires **Intent**.
Oh lawd muh balls, the slop is too strong.
>>
>>108950455
Nothing ever happens. Most likely that shit is posted only to see how many bots are there on 4chan. As they cannot call bullshit on obvious bullshit.
Which means you likely are a bot.
>>
>>108950441
amazing
>>
>>108950441
I can't quite put my finger on it but something feels weird about the perspective...
>>
>>108950453
read it? i'm going to try and run it, its a google gemini frontend, I just need to make sure he isnt stealing my api keys first
>>
>>108950441
is she immune from internal organ damage?
>>
>>108950441
redo one without the bulge and with a smaller man, this looks like some kind of giant
>>
>>108946129
>Hello anons, any news on the local TTS front? What stuff are you guys using? Last time I was here anons recommended gpt-sovits which is good, especially for cloning, but has a bunch of flaws.
I've been using dramabox lately.
Sample:
https://vocaroo.com/1j2Fd85TVCVY
Dramabox output file fed to the voice conversion model cosyvoice 3
https://vocaroo.com/1jXgrSnwRank
>>
>>108950440
what in the goddamn
>>
File: file.png (63 KB, 1522x659)
63 KB PNG
>>108950440
ok
>>
File: file.png (513 KB, 462x696)
513 KB PNG
>>
Not that I am complaining but why are people postings gens here? There is /adt/ and /ldg/ for that no?
This place is for us intellectuals who like to read.
>>
>>108950841
Her belly button is like a cavern

>>108950862
>This place is for us intellectuals who like to read.
You'd think so, but people here routinely demonstrate poor reading comprehension skills
>>
>>108950862
>This place is for us intellectuals who like to read.
A picture is worth 1000 tokens. (So long as you set image-max-tokens, because the default is lower than that.)
>>
954263000000
Trans mouth
Add a 7 and a 0 for balance.
>>
>>108950862
>Not that I am complaining but
>complains

>intellectual
for sure
>>
>>108950891
Didn't somebody use images for context compression or something like that?
>>
>>108950914
yeah, deepseek and moonshot did in their research papers I think
>>
in terms of gayness
eagle3 >>> dflash > llama4 >>> inbuilt mtp
>>
>>108950927
>llama4 anything but maximum gay
C'mon man. At least that other shit is usable (though largely not on llama.cpp).



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.