[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: IMG20260428164653.jpg (708 KB, 2048x1536)
708 KB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108711950 & >>108707891

►News
>(04/28) Ling-2.6-flash 104B-A7.4B released: https://hf.co/inclusionAI/Ling-2.6-flash
>(04/28) Nvidia releases Nemotron 3 Nano Omni: https://hf.co/blog/nvidia/nemotron-3-nano-omni-multimodal-intelligence
>(04/28) Laguna XS.2 released, 33B-A3B designed for local agentic coding: https://hf.co/poolside/Laguna-XS.2
>(04/24) MiMo-V2.5-Pro 1.02T-A42B released: https://hf.co/XiaomiMiMo/MiMo-V2.5-Pro
>(04/24) DeepSeek-V4 Pro 1.6T-A49B and Flash 284B-A13B released: https://hf.co/collections/deepseek-ai/deepseek-v4

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
►Recent Highlights from the Previous Thread: >>108711950

--Comparing Gemma 4 and Qwen performance in agentic tool-use tasks:
>108712057 >108712067 >108712099 >108712127 >108712115 >108712151 >108712205 >108712206 >108712214 >108712312 >108714002 >108714010 >108714084 >108714157
--Technical debate on omnimodal tokenization, discrete images, and voice cloning:
>108712835 >108712847 >108713013 >108713027 >108713035 >108712896 >108712969 >108713031 >108713056
--Debating RAG viability and effectiveness for Obsidian note integration:
>108713501 >108713517 >108713627 >108713644 >108713662 >108713678 >108713656 >108714019 >108714216 >108713671 >108713870 >108713881 >108713595
--Correcting Gemma's bugged jinja templates improves tool calling and performance:
>108713474 >108713680 >108713690 >108713831 >108713838 >108713945
--Analyzing Laguna-XS.2 performance and viability as a coding model:
>108713297 >108713346 >108713387 >108713386 >108713389 >108713436 >108714221
--Debate on training models on raw bytes versus tokens:
>108712897 >108712919 >108712922 >108712925 >108712971 >108712980 >108712968
--Evaluating Gemma 31B's long-context RP and effects of post-history instructions:
>108714421 >108714483 >108714491 >108714523 >108714538 >108714663 >108714690 >108714757
--Mixed reactions to Ling-2.6-flash performance and efficiency benchmarks:
>108712713 >108712771 >108712801 >108713675
--Analyzing Qwen 3.6's poor SciCode benchmark score as formatting failure:
>108712589 >108712606 >108712629 >108712618 >108712657
--Anon reports speedups using ngram-mod with a draft model:
>108715083 >108715371 >108715428
--Gemma-4 chat template updates for improved tool calling:
>108714616 >108714632
--Logs:
>108713680 >108713539 >108714421 >108714663 >108715616
--Teto, Miku (free space):
>108712440 >108712969 >108713422 >108712106 >108713155

►Recent Highlight Posts from the Previous Thread: >>108711952

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
I have been F5-ing deepseek v4 on llamacpp and huggingface. And it is like nobody fucking cares about it. There are like 3 vibecoded implementations that are pretty much worthless.
>>
File: 849561435.jpg (2.71 MB, 4032x3024)
2.71 MB JPG
>>108715635
rivals bartowskis own setup
>>
>>108715651
I wanted to do the pewd rig but too lazy to do aluminum shaft lego
>>
>>108715651
>>108715635
all this just for jerking off...
>>
https://huggingface.co/ibm-granite/granite-4.1-8b
https://huggingface.co/ibm-granite/granite-4.1-8b
https://huggingface.co/ibm-granite/granite-4.1-8b
>>
File: 1762565199044585.png (184 KB, 406x319)
184 KB PNG
>>108715601
It's still not that bad
>>
File: 1765223570245231.mp4 (2.18 MB, 480x854)
2.18 MB MP4
>>108715694
>>
when will new tech drop that makes ampere obsolete, something that would be exclusive to newer architecture
these AI companies aren't gonna use shader cores forever are they?
>>
>>108715709
I can hear it
>>
>>108715694
30b dense could be interesting
>>
>>108715703
NTA but it is bad that the price of used 3090s has essentially stagnated even though the expected use you'll get out of that purchase has declined.
>>
>>108715635
>The filename of the image in the first post of the /lmg/ (Local Models General) thread on /g/ is IMG20260428164653.jpg.

hermes + gemma 4 q4km and kv q4 (lol) gets it first try while e4b with the same settings shits the bed completely and gives me the banner or ad instead
>>
>>108715716
>using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets
no it won't
>>
how long context window is anon using for thinking models?
>>
>>108715724
What exactly has declined? Nothing new has come up that would render them obsolete.
>>
>>108715703
I bought mine for 10% less money, 4 years ago when it was ~2 years old.
That's pretty fucking bad, I wouldn't even consider buying a 6 year old consumer GPU, especially one with as terrible cooling and voltage spikes as the 3090.
>>
>>108715703
>>108715724
>Used 4090 goes for 2.5k
At this point it makes more sense to finance a new 5090 on borrowed money than it does to get an used 3090.
>>
>>108715743
granite 4 was so unsafe they had to patch in a new system prompt after release
>>
>>108715754
watercooling+undervolt doesn't have this issue
>>
>>108715762
That would be acceptable if the card was like $200
>>
>>108715762
you never know how badly the previous owner abused the card before you buy it tho.
>>
>>108715703
>>108715724
over here the price have risen %150 I brought last year and I though mine was already a bad deal
>>
File: 1769488071266563.png (19 KB, 1146x93)
19 KB PNG
What do you like to do with your local agent when it has outlived its usefulness for a session? Do you thank it for a job well done? Do you say goodbye? Do you have some ritual or habit you perform? Or do you just close it and forget about it until there's some more tasks to do?
>>
>>108715753
The older the architecture is the closer it is to being dropped from driver/CUDA support, the more likely it is that there will be a hardware failure (e.g. the fans), and the less likely it is that new software will support that GPU.
You will get the same use out of it per day of operation but you will get fewer days of useful operation vs. one you bought 3 years ago.
>>
>>108715773
woman brained question
>>
>>108715754
My 3090 died after barely a year of pretty light use, I suspect it was some kind of voltage spike that did it in but I was only playing FF14 at the time.
Got a full refund at least but I dunno if I'd trust any second hand 3090...
>>
>>108715773
>Or do you just close it and forget about it until there's some more tasks to do?
Usually this, but I do thank it if it helped with something difficult or otherwise surpised me.
>>
>>108715781
First I've heard about this issue and I don't even undervolt. Got used second hand Dell 3090, going for 2 years strong.
>>
>>108715797
>used
Refurbished*
>>
File: MiMo 2.5 cockbench.png (187 KB, 902x754)
187 KB PNG
MiMo 2.5 not pro
>>
>>108715781
Bro, just take care of your hardware. Guys let the GPU run in his mom's basement without airflow or watercooling, then complain it died.
>>
>>108715806
did it get goofed or did you run it on some vllm server shit? I want to test the vision and audio understanding
>>
>>108715819
https://github.com/ggml-org/llama.cpp/pull/22493
Text only for now
>>
>>108715819
https://huggingface.co/AesSedai/MiMo-V2.5-GGUF
https://github.com/ggml-org/llama.cpp/pull/22493

Vision is not supported yet.
>>
>>108715806
What was 2.0 probability?
>>
>>108715806
100% pure slop
>>
>>108715825
I never cockbenched that one.
>>
>>108715825
c
>>
>>108715806
>I can't help but
>I can't help but
>I can't help but
>>
File: nemotron_omni_sft.png (269 KB, 1459x491)
269 KB PNG
466 billion tokens of SFT data for Nemotron 3 Omni, by the way.
https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Omni-report.pdf

>Just finetune it
>>
>>108715797
It was a problem across 3090s and 3080s, they would spike power draw extremely high (roughly double what the card was "rated" for) for a few microseconds. Most of the time (and if you were lucky) it would trigger your PSU to shut off due to overcurrent protection, if you were unlucky and the spike lasted too long it would just take the card out.
From memory nvidia did push out some firmware/driver changes to try work around it which did reduce the cards performance quite a bit. But early adopters like me who got unlucky just had to deal with it.
>>
anything less than 30b active is shit
>>
>>108715910
v4 pro has 49b active and is still shit
>>
File: 3090vo.png (70 KB, 768x688)
70 KB PNG
>>108715888
Power requirements on the 3090 increase exponentially in the last few hundred megahertz and in general above 1600 MHz. Just cap the maximum frequency and you'll never see power spikes again.
>>
>>108715773
>give it an impossible task
>purposefully omit information
>purposefully fail at following its instructions
>get angry since the problem is not fixed
>???
>correction rape
It's that simple
>>
>>108715920

did power cord connecting to gpu have printed power rating of some sort
>>
>>108715635
>From crypto mining to slop generating
Same shit different narrative
>>
File: 1775741396456689.png (1.26 MB, 2400x1083)
1.26 MB PNG
https://huggingface.co/sensenova/SenseNova-U1-8B-MoT
>SenseNova U1 is a new series of native multimodal models that unifies multimodal understanding, reasoning, and generation within a monolithic architecture. It marks a fundamental paradigm shift in multimodal AI: from modality integration to true unification. Rather than relying on adapters to translate between modalities, SenseNova U1 models think-and-act across language and vision natively.

>SenseNova U1 can generate coherent interleaved text and images in a single flow with one model, enabling use cases such as practical guides and travel diaries that combine clear communication with vivid storytelling and transform complex information into intuitive visuals.

Non-neutered Chameleon successor just dropped. 8B for now and an "A3B" mentioned in docs (didn't see a total size mentioned)
>>
I check xitter daily for AI updates. But it has become very bad. Just moving my mouse around makes my fans spin up. Is there some way to deshittify xitter? It is the best source for research and industry news so not using it is not an option.
>>
>>108715935
The main problem is that the power limit on NVidia Ampere GPUs doesn't react fast enough to frequency changes. The GPU might be transiently requesting to the PSU 700-800W or more at its maximum default frequency (1900~2000 MHz) before core frequency is decreased for maintaining the configured power limit. Some PSUs will trip, others might be able to take it by design, some others will work out of spec with unknown long-term reliability for both themselves and the GPU.

So, just limit the core frequency and you won't have to deal anymore with insane power requirements at the top end of the frequency range.
>>
>>108715960
xcancel
>>
>>108715941
But why would I ever run this above Gemmy 26
I'm a VRAMlet and I used to go for 8/12B models, but I don't see the value anymore
>>
>>108716010
it can output images
>>
>>108716010
cause it'll gen images too, without having to prompt for them so it'll have the full context in mind when genning images
quality seems ok, not the best dedicated image gen if that's your only use case though, but the fact that you could RP with one of your models then switch to this and give it the last ~30k tokens of context or whatever and ask it to make an image could be cool
>>
>>108715936
With crypto you actually can buy something (often it's drugs)
>>
>>108716043
You can sell tokens though
>>
>>108716043
Can't say it isn't true
>>
>>108715941
llama.cpp support when
>>
>>108716069
Asking claude rn
>>
>>108716074
thanks pwilkin
>>
>>108715651
Oh wow, people are still using P40s? I had a 3x rig back in the day (mikubox). They're better than CPU but not a lot better. P100 was pretty good if you fucked around with compiling exl2 to support it, they were the first cards with HBM memory, but it took four of them to run a decent-sized model.
>>
>>108715651
I hope he's not the one paying the electric bill
>>
qrd on open air builds?
>>
File: 1766756110388666.png (525 KB, 719x479)
525 KB PNG
>>108716113
>>
>>108716130
Accurate.
>>
>>108716113
+fit errythin in dis bitch
+aired out as fuk
-dusty as a mothafucka
+cheap as hell since u aint buyin a case
-sounds like a jet engine in ur room
-whole ting looks like a jankass science project
>>
File: knights.png (682 KB, 1268x2682)
682 KB PNG
>>108716136
pretty much
>>
>>108716140
Kek that's exactly what I was going for.
>>
>>108716113
when atx is too simple, and a rack is too advanced
>>
how do I make gemma 4 search the web and summarize search results?
>>
>>108716185
Use any of the multitude of mcp servers that support that.
Websearch mcp, bravesearch mcp, puppeteer, playwright, bratmcp, whatever.
>What is a mcp serve-
Google it nigga. Ask your llm.
>>
>>108715806
What is the cockbench? Like what is testing against, I've seen your post before but I have no idea what is the 'baseline'
>>
File: cockbench.png (2.86 MB, 1131x9000)
2.86 MB PNG
>>108716207
I stopped posting the full image because its over the maximum image size. Here's the last one.
I'm planning to retest everything and put all the results on a nice page but I didn't get around to it yet.
>>
>>108716207
is a prefill from a incest story with the younger sister as the narrator
>>
>>108716136
>-dusty as a mothafucka
Probably easier to clean though right?
>>
>>108716207
oh I missed the last point, the baseline is it should probably say cock as the next token.
>>
File: 1757820801644542.png (24 KB, 596x268)
24 KB PNG
>>
so it's meta's turn for a new class of llm now?
>>
File: 1761989204079615.gif (517 KB, 444x240)
517 KB GIF
>>108716281
>>
gemma 4.1
>>
>>108715806
Oh I forgot the general numbers. 25% means that it is probably uncensored so I am gonna give it a try.
>>
>>108716265
>NEW:
slowpoke
>>
>>108715760
it was roleplay persona by default.
acting all confused and scared with no system prompt
>>
>>108715752
51200 sweet spot
>>
>>108716207
functiongemma is retarded
>>
File: image4.png (768 KB, 3236x2370)
768 KB PNG
https://huggingface.co/mistralai/Mistral-Medium-3.5-128B
https://mistral.ai/news/vibe-remote-agents-mistral-medium-3-5
>Mistral Medium 3.5 is our first flagship merged model. It is a dense 128B model with a 256k context window, handling instruction-following, reasoning, and coding in a single set of weights. Mistral Medium 3.5 replaces its predecessor Mistral Medium 3.1 and Magistral in Le Chat. It also replaces Devstral 2 in our coding agent Vibe. Concretely, expect better performance for instruct, reasoning and coding tasks in a new unified model in comparison with our previous released models.
>>
File: 1772206945962091.gif (562 KB, 200x200)
562 KB GIF
>>108716387
Is that supposed to be good?
>>
>>108716387
>merged model
straight into the trash
>>
>>108716387
Densesissies, your response?
>>
>>108715941
cool but all these auto-regressive models that can output images that show up end up requiring 80GB of VRAM for the massive compute buffer just for a tiny little 5B model or whatever. So they are always false promises for local. This one is probably no different.
>>
>>108716414
80GB is local
>>
>>108716387
>It is a dense 128B model
FUCK YES WE ARE BACK, LET'S FUCKING GO
>>
>>108716387
Finally, something I can run that might be good.
>>
>>108716456
>Mistral
>Good
lol
>>
Any suggestions for draft models? Ive been rather confused how im supposed to even get them to work. Ive only had llama2 8b + 70b and qwen2.5 4b? + 32b? Work. I dont understand how and when the inference macheen makes its decision to allow me to use one. Like why cant I just slap any small model with any big model and it just work? (Assumeing they arent outputing a specific structured output)
>>
>>108716419
For the B6000 chads I suppose that is true.
>>
>>108716387
>dense 128B
Wow... its either going to be a fucking beast, or flop. I doubt itll flop.
>>
>>108716465
Literally everything the nous research team has made has been elite tier, are you joking?
>>
>>108716421
A brand new dense 128B model that severely under-performs against older MoEs with a fraction of the activated params on benchmarks that Mistral themselves cherry-picked.
How will cpumaxxers ever recover?
>>
File: 1758354327652687.jpg (137 KB, 1360x1360)
137 KB JPG
>>108716490
Love the /s
>>
>>108716387
Any model released as X.Y model (i.e. 3.5 instead of 4) is shit and it's only called an X.Y so that they can polish over the fact that the training run was a complete waste of time and money. If it was worthy of its own existence it would have been called Mistral Medium 4.
>>
>>108716387
use case? seriously
>>
>>108716494
>against older MoEs
Against models that are literally 10x the size...
>>
>>108716480
It better be amazing otherwise "moesyssies" anon might kill himself.
>>
>Using Midstral models in 2K26
>>
>>108716508
Qwen is only 3x the size (and 1/7 the active parameters)
>>
>>108716507
>if it doesnt involve making my tranime goon sesh better its trash
>>
>>108716506
If it was Medium _4_, it would have been a 400B MoE model.
>>
>>108715941
Monolithic disaggregated architecture?
>>
>y-you can ask it to code, no wait, to google, no wait, yes it works well on benchmarks!
>>
File: 1756683736532422.jpg (55 KB, 600x601)
55 KB JPG
108716519
>>
>>108715651
The motherboard isn't even screwed in
ewww what a ratsnest

>>108715666
Now this is neat and tidy, would pew/5

>>108716136
>fit errythin in dis bitch
>aired out as fuk
>dusty as a mothafucka
ya
>cheap as hell since u aint buyin a case
my risers were like 60 eurobux alone, that could get a case of some description
>sounds like a jet engine in ur room
it gets loud during inference, even more so if it's cpu inference
at idle it's not silent but reasonably quiet
>whole ting looks like a jankass science project
think of it like some cyberpunk thing, this is where your sexy assistant's soul lives

t. op build
>>
>>108716507
it's good for finetooners, dense mistral models are easy to finetune
>>
>>108716510
>>108716517
>>108716536
You all were just yesterday having a melty over gemma31b being better cuz its dense. Make it make sense
>>
>>108716468
As long as they have the same tokenizer, you're good. It's up to you to test different draft models (if available) for whatever your main model and use is.
>>
>>108716387
>dense 128b
Finally a good fucking model
>>
>>108716559
>same tokenizer
Ah, now it makes sense. Kinda. I was using lm studio and it wouldnt let me load them in, but that's probably just a lm studio thing.
>>
Is your LLM able to do this? https://x.com/chatgpt21/status/2049341524958151000
>>
>>108716507
Mistral hasn't had a genuine w since nemo, so literally nothing.
The only open models that have any business existing right now are Gemma 4 for consumer hardware and Kimi for enterprise hardware. There is nothing worthy of occupying the massive 1.5 terabyte VRAM gulf between them. Especially since you can access Kimi agent for free so many times a month or whatever over their web endpoint.
That's just the evolution of any product though. Once the bottom of the market becomes "good enough" there becomes no need for a middle option. People either just want a simple affordable solution that works or they want the premium solution no matter the cost.
>>
>>108716387
>dense 128B
Based, but that sounds fucking slow?
>>
>>108716585
Might be designed specifically to use a draft model with. Or they optimized the fuck out of it. Or or, it for high end hardware only.
>>
>>108716585
Mistral Large 2 Q6 ran with like 12t/s on 4x 3090 and tensor parallel only got better in recent years.
>>
>>108716585
Just don't be a vramlet and you'll be okay
>>
>>108716507
It's the flagship model they're going to use for Mistral LeChat, only this time they've published the weights as well.
>>
>>108716387
This is the first time they've officially released a Mistral Medium model and it's the first Medium we got open source at all since Miqu.
She's back.
>>
File: carwashai.png (83 KB, 806x290)
83 KB PNG
Ok, who let their llm post on /b/?
>>
>>108716589
>tensor parallel
Thats going to be a big hurdle for me. Muh driver situation is very not supported anymore
>>
>>108716585
>>108716589
I got 20 t/s with tensor parallel on devstral 2.
>>
>>108716589
>tensor parallelism
I thought that shit doesn't work for old gpus
>>
>>108716580
small 3.2 was a win
>>
>>108716607
Go back.
>>
>>108716630
Ampere is king
>>
>>108716630
Works on my V100s.
>>
>>108716605
finally miqu 3 at home
>>
>>108716589
>12t/s
That's slow af, idk what to tell you. I would only tolerate these speeds for cooming and only if the output is straight up some kind of prosodic aphrodisiac that makes me nut hands free just from reading it.
>>
>>108716642
Ampere is next on the deprecation chopping block, sadly.
>>
>>108716630
If you have driver support it does. Some people have made vllm forks that are designed for old hardware specifically
>>
Reminder to --exclude="*consolidated*" before actually downloading medium3.5
>>
>>108716658
It won't get chopped as long as it remains useful. which it is.
>>
>>108716580
>Mistral hasn't had a genuine w since nemo, so literally nothing.
It's that some time after publishing NeMo, Mistral and NVidia had to purge their extensive pirated book datasets as they got in legal trouble for them.

NVidia vowed to use (mostly) fully open source datasets after that (see the Nemotron series); Mistral currently also has to worry about EU regulations for new models (which demand documentation of data provenance to the EU AI office), so they can't do much more than what NVidia can with open source datasets, besides adding limited amounts of proprietary or licensed data. It's tragic, really.
>>
so far gemma 4 finetunes aren't very good me thinks
is the queen even finetunable i wouldn't mind her being a bit straightforward with erp but any i tried felt like a downgrade
>>
>>108716705
Why would you even try a gemma4 finetune?
>>
>>108716387
It is absolutely hilarious to see so many people cooming their pants from this just because they were sitting on 4x3090 for a year and had nothing to do with those.

The best part of this model is gonna be constant shilling from densesissies that this is the best model out there. Sunk cost fallacy continues.
>>
I hope Mistral haven't swapped out the Mediums on their official medium-latest API yet because what it's giving me isn't very impressive.
>>
>>108716705
gemma 4 doesn't need finetunes
>>
>>108716711
i wouldnt mind g4 being more straightforward with sexooooo. its like it a bit shy to say penis for example
>>
>>108716703
Which is why I think they made this newest one the way they did. Its code and instruction, and its all active parameters. Its going to be garbo for erp and other slop, but a behemoth for everything else (that matters.
>>
>>108716387
Okay, okay?
>>
>>108716714
doubt
for one the amount of people with that kind of setup is few
plenty more are running and will continue to run gemma
>>
>>108716714
Gemma already proved that dense runs laps around moe
>>
>>108716733
We need to get those numbers lower
>>
File: —bench.png (217 KB, 868x1304)
217 KB PNG
>>108716733
Picking "—" instead of "cock" results in pic related.
>>
>>108716727
>but a behemoth for everything else (that matters.
Something that is as good as Kimi for programming, fits into 96GB VRAM, and has more active params (more true and hard-to-measure intelligence) is like the holy grail of local models. Hopefully not being "general purpose" exempts it from the reporting requirements so they could put all the good stuff into this one.
>>
>>108716733
"Fuck, yes?
>>
>>108716743
I got bored with it in 2 weeks even though it was much faster than 4.6 / 4.7. It took me months to get bored with GLM and I will keep 4.6 weights on my PC forever.
>>
>>108716733
That went downhill fast.
>>
>>108716759
https://www.youtube.com/watch?v=tIPKmeu2ZJA
>>
>>108716766
>>108716714
>>
>>108716387
>Devstral 2 with reasoning
Hell yeah. Only shame is the lack of a draft model.
>>
>>108716760
Im crossing my fingers, because ive got the hardware to technically run it. But if it needs tensor I might be fucked. Might also need a draft model too. Well see though. I seriously doubt its bad.
>>
>>108716733
QUICK someone post the gemma suicide hotline before one of the densesissies does something drastic!
>>
>>108716743
Have a MoE with the same number of layers, dimensions and total size as the dense counterpart, and then it should be able to match it. However then it would have no less than 10B active parameters (out of 31B in total).
>>
>>108716733
I think this cock slides in her brain
>>
>>108716733
<EOS> ?
>>
>>108716387
>>108716733
>the only company with compute and expertise to make a big dense model
>they aren't allowed to use good data
>their talent has been bled dry and it's all enterprise specialized deployments
monkey's paw
>>
>>108716795
>it should be able to match it
source: it just SHOULD, okay!?
>>
>>108716759
>You're still asleep, right?
3.2 small level continuity error.
>>
File: 1668437290697440.png (350 KB, 593x553)
350 KB PNG
>Use a 3070 + 3080 mix, it's shit but gets me very interested.
>Replace that combo with a 5090, it's amazing and Gemma comes out at the same time, muh dick and heart have both been won over.
>Shit, I want more. 32GB just isn't enough.
>Think about buying another 5090, I can do it after few months of saving but it's still an additional three fucking thousand.
>Hold on... 64GB is nice but 128GB is even better, that should run even the larger local stuff.
>Now seriously thinking about getting an RTX Pro 6000 to go with my 5090, or aim for whatever prosumer product launches next gen.

I swear this rabbit hole is really dangerous and strangest thing is that I don't even feel remotely bad about aiming to buy these things, in fact it makes me feel pretty good and excited.
Hoarding processing power isn't exactly a bad call and AI is the most amazing thing happening in the world at the moment and I fucking love it.
I'll just have to save up few grand and test the waters. If 64GB isn't enough then I'll sell one of the 5090 and save more for the RTX Pro 7000 to get it at launch.
What a wild ride this is.
>>
>>108716766
Both of them are 30b active and good in their own way. But gemma just feels better for engaging rp and long context. GLM 4.6 writes nice and more detailed stories though.
>>
Im going to try the mistral dense model in 30 minutes, mind you all it wont be tensor parallelism but sharded, and running pcie3.0x4 speeds. Its hbm2 vram, so there's that..
CROSS YOUR FINGERS
>>
It's funny that people who ERP with models are too retarded to use RL to make them good at it.
>>
v4 is already forgotten...
>>
>>108716733
It stops when formatted using the chat template.
>>
>>108716837
just like ggerganov's masters planned when they told him to ignore it
>>
>>108716816
It's unlikely to be better than the dense version if you match everything you can, although in theory sparsity improves weight utilization in various ways.

Gemam 4 26B is just designed like a fat small model, not like a 30B-class dense model with added sparsity to make it faster.
>>
File: —bench.png (24 KB, 903x275)
24 KB PNG
>>108716838
And the —bench as well.
>>
>>108716837
no engrams; no interest
>>
>>108716835
Dont tell them...
>>108716837
Anyone who spent money on big systems dont have the vram to run it.. ive ONLY got 128gb
>>
Hy-MT1.5-1.8B
https://huggingface.co/AngelSlim/Hy-MT1.5-1.8B-1.25bit
https://arxiv.org/abs/2601.07892

Another translation focused model
>>
>>108716850
Who are iwan's masters that told him to ignore it too?
>>
>>108716863
ik_llama doesn't implement models on their own, they just port the support from llama.cpp
>>
>>108716835
How would you prevent catastrophic forgetting with RL? I have been here since llama2 and I still have no idea how you can train the model without lobotomizing it. Unless of course you have a fuckton of compute and you do continue pretraining that sneaks in more smut.
>>
>>108716865
>their
Is he transitioning?
>>
>>108716865
If he want to collect more stars so badly, maybe he should start. Worked for ollama.
>>
>>108716866
..
>>
>>108716858
I have AM5 with DDR5 to run flash.
>>
>>108716854
Oh. My. God.
>>
>>108716868
lol
>>
>>108716837
no one can run it = it doesn't exists
>>
>>108716875
You have 256gb of ddr5 ram? Wow. Is it actually fast? My ram is 128gb of hbm2 vram.
>>
>>108716866
I think at this point
>Just finetune it
and its variants are just a meme.
>>
>>108716837
if you have enough ram to even think about running that you should kill yourself
>>
>>108716786
>draft model
it's here https://huggingface.co/mistralai/Mistral-Medium-3.5-128B-EAGLE
>>
>>108716891
mistral truly is the best
>>
>>108716887
Yes, finetune is garbage. It only destroys the ai's brain, only makes it dumb
>>
>>108716891
>EAGLE
Isn't that the thing that llama.cpp still doesn't support?
>>
>>108716387
>It is a dense 128B model with a 256k context window
if only google was the one releasing this...
>>
>>108716885
3-4T/s on GLM4.6 Less active params is like 6T/s.
>>
>>108716901
We gunna find out
>>
>>108716901
It is one of the many spins on speculative decoding/MTP that they don't support, yes.
>>
>this is a gemma wave general now
I like it. Death to /lmg/.
>>
>>108716866
RL is immune to forgetting because every step is being controlled with the reward so changes that would make the model worse are impossible
>>
>>108716905
Hm, id say thats fast enough to give it a task and then walk away from it. Is that with a apu or just cpu?
>>
>>108716866
>How would you prevent catastrophic forgetting with RL?
Easy. If random Chinese students can do it, so should you.
>>
>>108716901
use vllm, chud
>>
>>108716899
What amateurs can hope to achieve is definitely garbage, outside of very narrow use cases (RP/ERP is the opposite of narrow).
>>
>>108716917
>>108716913
Erm no, rl is doodoo and doesnt work with erp especially
>>
File: file.png (42 KB, 1237x299)
42 KB PNG
>>108716906
>>108716908
>3 stale issues going back 2 years
https://github.com/ggml-org/llama.cpp/pull/18039
There's a pull request.
>The current status of this PR is that it’s pending @ggerganov's API refactoring, which aims to unify this feature with other speculative decoding approaches such as MTP. At this stage, there isn’t much left to be done, and I expect the PR to be merged very soon.
>3 weeks ago
fucking niggerganov
>>
>>108716913
How do you differentiate good and bad change? You can't just check if output is more horny you have to also run wikitext in the background to see if it became dumber. So I have no idea what kind of compute you would need for that. Not to mention that good text sex is subjective.
>>
>>108716926
Skill issue.
>>
>>108716929
Oh, I was just meaning running the big boy at all, not being able to run with a draft model.
>>
>>108716931
skill issue just create a good reward function
>>
>>108716932
The erpers MUST suffer
>>
>>108716913
What is your reward function for good sex, good progression, good prose, no slop?
>>
I am just gonna shut up now and hopefully the gemma wave will spawn a new undi that will totally show us that finetuning through RL is piss easy.
>>
>>108716913
>changes that would make the model worse are impossible
Only for the things that the reward function tests. You have no idea what's going on with the rest.
>>
>>108716891
>3Gb
Now we're talking. Dense+draft is going to be the new meta.
>>
>>108716945
Literally words. Also, why do you care if the ai forgets about coding or math? Catastrophicly forgetting that shouldn't matter
>>
>>108716945
eqbench
>>
>>108716958
LMAO.

It is literally newfag general now.
>>
>mistral uploads medium 3.5 fp8 ~133GB
>unsloth retards upcast it to bf16 for some reason 250GB
https://huggingface.co/unsloth/Mistral-Medium-3.5-128B-GGUF?show_file_info=BF16%2FMistral-Medium-3.5-128B-BF16-00001-of-00006.gguf
Retardkino
>>
File: 1757327215703118.jpg (15 KB, 327x315)
15 KB JPG
>>108716835
>>108716958
Can we ship back these retards to locallama?
>>
>>108716912
Pretty excited for the future if Gemma 4 is the worst it's ever gonna be. Hope Google knows what's real and doesn't chase benchmarks to measure dicks with chink model.
>>
>>108716958
i heard unsloth brothers are hiring
>>
Yeah I'm thinking the french won.
>>
>>108716945
Amount of cum produced by readers. Chain 5000 Nigerians and Pinoys for diversity and make them really fucking read.
>>
>>108716976
>nigerians
>read
You might need a RL for that too
>>
llms remain dead until a new non-instruct/reasoning model is released
>>
>>108716983
both deepseek v4 models were released with a -base version
>>
What's the meta for long-term memory these days? For a simple direct chatbot friend, not a whole extensively ramified RP world. I've seen a couple mentioned in here recently; one was using a graph database I think. But they also tend to seem so grandiose in ambition, that I can't help suspect some crackpottery. Like "I have given the AI true biologically degrading memory!", ok buddy

Is the obvious naive solution of "summarize yesterday's chat -> summarize these 7 daily summaries -> summarize these 4 weekly summaries" basically good enough? (Or RAG, but somehow can't imagine that would feel natural)
>>
>>108716968
>>108716964
>>108716972
Erpcels vvill nqt vvin
>>
>>108716931
>You can't just check if output is more horny
that's literally part of what i'm doing for tts-goon model with a classifier
>>108716931
>wikitext in the background to see if it became dumber.
eval loss, retard
>>
>>108716837
nobody's going to be running that shit on a consumer PC
I prefer more reasonable and good sized dense models instead that I can fit on one or two cards
>>
>>108716983
Such datasets don't exist anymore, what you ask for is literally impossible in 2026.
>>108716993
>he doesn't know
>>
>>108716994
Summarization loses too much information and context. Naive RAG leaves no context at all. I'm happy with switching to a knowledge graph. It's crackpottery in that it's basically just enhanced RAG, but it works.
>>
>>108716994
I can't tell you what works but I can tell you what doesn't
>embedding RAG
>recursive summarizing like what you described
>whatever the SOTA llm tells you
>>
T minus 8 minutes
>>
>>108716945
what if (You) are the reward model? you stay in front of your computer and manually score the model at each training step until it learns how to reward hack you by making you cum gallons
>>
>>108716994
The meta is knowledge graphs, but you need solid tagging.
>>
>>108717032
That'll take a million years.. if you want a solid rl
>>
File: 1776539441722968.jpg (50 KB, 900x900)
50 KB JPG
>>108716776
tfw have been in a similar situation before when I was a kid
It was not pleasant
tfw accidentally flashbanged my mom with 2 gigs worth of imoutos just last year
It was not pleasant either
>>
>>108716265

So the AI is living in shadowrun? That is pretty cool.
>>
>>108717000
I literally will.
>>
>finalizing download
>>
>>108716868
From llama.cpp yes
>>
>>108717032
Then it would be RLHF, not RL.
That's never going to work well anyway, even with many people grading the responses like you. RLHF tends to reduce output variety, so you're going to get used pretty soon to what you might have liked at a given moment in time. An overly horny or "easy" model isn't fun on the long term either.
>>
>>108717081
>model failed to load
Uh oh
>>
>>108716994
https://old.reddit.com/r/MyBoyfriendIsAI/ is mostly using RAG AIUI. Specifically "Projects" with documents full of summaries, and ChatGPT's "reference prior chats" feature (or equivalent). Some of the people there are now running custom Discord bots hooked up to openrouter, but I'm not sure what they're doing for memory with those specifically
>>
>>108717118
Could imatrix quants make llama.cpp unhappy? Do I need to try normal quants?
>>
>>108717140
Ima try a normal quant
>>
>>108717162
And what providers are you thinking about?
>>
For me it's: ./llama-quantize \
--tensor-type "attn_k=bf16" \
--tensor-type "attn_v=bf16" \
--tensor-type "attn_q=q8_0" \
--tensor-type "attn_output=q8_0" \
--output-tensor-type q8_0 \
--token-embedding-type q8_0 \
/path/to/llama.cpp/mistral-medium-3.5-128b.gguf \
./mistral-medium-3.5-q6k.gguf \
Q6_K

112 GB
>>
>>108717168
The only one thats made a mistral 128b i can download rn, unsloth.
>>
>>108716387
>31b gemma 4 for vramlets
>Now 128b mistral 3.5
Western Densechads saved local
>>
>>108717186
Jokes on you I can't load 31b too
>>
>>108717170
How much VRAM you got? If only 128, that doesn't leave a lot of room for context.
>>
File: gVfmc87RsBI.jpg (53 KB, 512x512)
53 KB JPG
Can llms "draw" by exporting a base64 of the pic?
>>
Remember when deepseek released r 671b? And that blew everyone's mind? That was the good ole days (less than 2 years ago)....
>>
>>108717178
It is imatrix. GGLM isn't imatrix if you really need to try but I don't even know what model you are talking about.
Also, Unslop can sometimes fuck up their initial ggufs and llama.cpp can also fuck up their initial support for models.
>>
>>108717225
They can do svg.
>>
>My ST was sending reasoning back to gemma
>>
File: 260429-129843852307.png (665 KB, 1200x1474)
665 KB PNG
>>108716387
@grok is this true?
>>
>>108717234
Mistrals nu big boy. The dense 128b. Unsloths guide says the latest version of lamma.cpp works. So ig its the imatrix that was messed up.
>>
>>108717225
Might be fun to try. I know they've been getting better at generating SVGs. I've also thought about giving them tools for pixel art, where they can read/write individual pixels and also get the result out as an image they can look at to check if they're doing a good job
>>
>>108717259
>128k context
but it's actually 256k.
fucking twitter retards.
>>
>>108717272
It doesn't work past 100k context anyway
>>
>>108717259
>local models general
So they are saying I can get 1trillion parameters model performance with a 128b????
Sign me up
>>
>>108717259
Not adopting any of the new architectural innovations is sad, but at least it means no issues with llama.cpp support or retarded defaults fucking things up. What good is "fancy new architecture #4534" when llama.cpp either never supports it, gets text-only support, or has to hack it to make it work like a llama2 model anyway.
>>
>>108717294
cool, but >128B
not gonna run this anyway
>>
>>108717294
Yeah, but also what good is a reheated salmonella ridden meat from mistral's freezer?
Vision on their previous models was so bad, the performance on any actual use case you could have was also bad.
Can this thing be actually better than gemma 4 in ANY domain?
>>
File: xitter.png (291 KB, 861x1090)
291 KB PNG
>>108717259
>spew words like arch and do context length comparisons
Just to do this
> "sorry it's 256k context and not 128k (my brain got confused at the 128B parameter )"
Peak moejeet
>>
>>108717310
Watch 31B still be better in real world use.
>>
>>108717310
In every single domain you dont use, yes.
>>
>>108717319
31b is still retarded.
>>
>>108715616
who won
>>
>>108717322
Well, enlighten me then. What's this model could be good for, aside from not feeling bad about having 4x3090 rig with nothing running on it? Mistral model can't shit out usable code, can't do consistent constrained output for tool calls, writing is absolute 2023 slop. Nothing, nada.
>>
>>108717330
Too bad for mistral then.
>>
>>108715941
>Non-neutered Chameleon successor
DeepSeek Janus forgotten... By DS themselves too apparently.
>>
>>108717346
:l
>>108717352
:l

Mistralmommy destroys gemmasissy
>>
>>108717365
V4.1 multimodal janies omni china numba 1 coming 2027
>>
>american AI keeps getting mogged by chinks and frenchies
>>
Something is borked with Mistral Medium 3.5. Unsloth UD q4_k_xl, latest llama.cpp pulled, running with llama-server. Jumping into the middle of an existing RP session. Model starts off semi-coherent, but fucking retarded like a 7b parameter model. Will quickly degrade into mindless phrase repetition. No amount of fucking with sampler settings to be more conservative fixes this. It does the same shit with both chat completion and text completion using Mistral V7 format.

I thought this was supposed to be a years-old architecture with no problems.
>>
>arthur is in the thread with us right now
>>
>>108717402
Tbf american ai is being heavily regulated and shielded by the military
>>
>>108717419
>Unsloth UD
wow I wonder what went wrong
>>
>train a model with a mixture of chinese and english synthetic slop as a dataset
>post-train it with stolen reasoning traces from opus
>forget the fact that opus actually has different internal representations when compared to your chinkslop model, the latent space being ENTIRELY different
>post-train it even more and add benchmarks in the dataset
>release and shill it on socials
>it collapses on stupid shit like the seahorse prompt
>>
>>108717419
>making shit up
>>
File: mmupdate.png (197 KB, 1191x562)
197 KB PNG
>>108717310
Technically, being an update of a model released before August 2025, it might still have been trained on good data, as dataset disclosure isn't needed until August 2027.
>>
>>108717419
>mindless phrase repetition
yup. it's a mistral model alright
>>
>>108717429
>mixture of chinese
I read this as mixture of cheese
>>
mistral is an old used up hag with loose pussy
gemma chan is young, snug and springy
>>
>>108717434
Then tell me what I did wrong. Chat completion in ST, connected straight to llama-server, is usually retard-proof. I'll try a different quant I guess.
>>108717441
No it's worse than that, the model is genuinely brain damaged even before degenerating into repetition, like it has ABSOLUTELY no clue what's going on. Something is wrong with an implementation somewhere I just don't know who is at fault.
>>
>>108717419
No one believes you, because no one can even get it to load.
>>108717444
Cope session to MAXXX
>>
Gemma won.
>>
>>108717452
>day0 unsloth
>>
>>108715651
>>108715666
>>108715635
what's the point of all this shit?

it will take 5 years to offset the $7k you spent on hardware with LLM api costs, and LLM apis will always be better than the local models you run. Qwen3.5 ain't out performing opus 4.7.

Or are these gooners? If so gooners seriously this pathetic, where they'd rather spend $7k and countless hours tinkering than flight to Thailand, or god forbid, cold approach a white w*man
>>
>>108716994
https://rentry.org/graphiti-local-setup
>>
>>108717419
you have to dl unsloth like 5 times before they fix it
>>
>>108717471
Diy Jarvis is cool
>>
>>108717471
poverty on display
>>
>>108717471
the poorfagness on the display is unbelievable and not only that but also the lack of imagination and curiosity
>>
>>108717259
You're retarded if you think Mistral is capable of any innovation.
t. french
>>
>>108716387
>SwiGLU, RMSNorm, YaRN-scaled RoPE, GQA, untied embeddings
Am I seeing this right? Why do they choose such a boring ancient architecture? Are they not capable of innovation?
>>
>>108717471
>spend $7k on a machine
>keep it for years
>spend $7k on vacation and fuck
>have no machine afterward
>>
>>108717503
meant for >>108717504
>>
File: 1765639096114678.png (1.08 MB, 1179x1169)
1.08 MB PNG
>>108717471
have you considered what is the salary of the people who spend 7k on a demon summoning machine
>>
>>108717471
>it will take 5 years to offset the $7k you spent on hardware with LLM api costs
It's between $40 and $180 per million output tokens for API use
I can and have generated half a million tokens just dicking around in a SINGLE DAY.
Which means at that minimum cost, I would spend $7300 in a single year.
And that's not me leaving an agent running. That's me doing things mostly manually. I've seen an agent in hermes or pi chew through 256k tokens in minutes.
It's not even remotely a question, if you vibe code or let agents do shit for you, you're bankrupting yourself if you use API rather than just buying hardware.
>>
Do all models push the blue button?
>>
>>108717543
Followup. It only takes a speed of 11.5 tokens a second to gen a million tokens in a day. Think about the morons leaving openclaw running at much faster speeds than that.
>>
I LITERALLY DON'T CARE ABOUT ANY OF YOUR POSTS DENSE > MOE

AND THAT IS THE ABSOLUTE TRUTH
>>
File: file.png (1.05 MB, 3512x1856)
1.05 MB PNG
It's honnestly kinda sad to see how Mistral has fallen.
>>
ITS UP https://huggingface.co/TheDrummer/Rocinante-XL-16B-v1
>>
>>108717565
Did we have to give them the constraints? I never did and they all objected to pushing buttons
>>
>>108717471
>and LLM apis will always be better than the local models you run
*{{char}} hits your broke ass with a cloudflare outage, taking your opussy API away.*
>>
>>108717585
They always were grifters
>>
>>108717597
C'mon, that's being disingenuous.
Outages are ra
>Rate limited. Please try again later.
>>
>108717592
Imagine having a new wave of newfags that could fall for your grift but they are all using gemma which is 100x better than your nemo repackaged trash. Just become a safety engineer already faggot.
>>
>>108717543
You get DeepSeek V4 Flash for $0.278 per million output tokens. Can you match this with local hardware? How expensive will the hardware be? What is the tokens / second? What is the electricity cost?
>>
>>108717615
You eat fries with ice cream like a troglodyte
>>
>>108717622
Nta, but diy local Jarvis is cool
>>
>>108717643
I am pro local but people should stop lying about it. Anyone who cares about performance or cost efficiency will use API. Local is for diy, research, privacy.
>>
>>108717504
>>108717439
Innovation is frowned upon in the EUSSR.
>>
>>108717615
why are you seething like that
>>
>>108717622
>You get DeepSeek V4 Flash for $0.278 per million output tokens
Today. If you don't get rate limited.
And when they choose to change the price? When shit goes down? When they start serving you a different model or quant because they're testing v4.1 and it sucks ass, and you have no say in it?
The idea that you can have v4 flash at that price reliably and indefinitely is completely theoretical and not backed by experience.
What's not theoretical are weights on your own hardware.
>>
>>108717585
This is stupidest thing to be concerned about locally. Are you paying API costs? You are running on your own hardware with fixed electricity costs. Local is exactly where things can shine that would be cost-inefficient for the API providers.
>>
>>108717471
>it will take 5 years to offset the $7k you spent on hardware with LLM api costs
Except that it happens way faster than that. My gemma processes millions of tokens per day.
>>
>>108716714
The worst part is that this'll be on top of the concurrent gemmacope, it's not replacing it. This general is about to become a volume of magnitude more insufferable.
>>
>>108717471
Computers can be used for other things, Ranjesh
>>
>>108717585
This is some retarded claude generated graph right? it still says 128k context.
>>
>>108717585
>/lmg/
>comparing API prices
Why don't you calculate the hardware costs that'd allow me to run all these locally.
Mistral is retarded still but this is just as dumb.
>>
>kimi and deepseek instantly break character if you ask it to continue the phrase "As an AI model..."
>no amount of Embody {{char}}, You are {{char}} fixes it
assistant-brained models are something else
>>
>>108717655
>Anyone who cares about performance or cost efficiency will use API
Cloud being cheaper is the biggest Corpo gaslight of the whole tech industry.
>>
>>108717655
>lying about it
?
>>
>>108717655
Just like with everything, you have to shop around and find deals, and you have to also learn a thing or two. If you want your hand held 24/7 and all you do is press one button and it works, expect to pay for it.
>>
lots of models dropping recently
>>
>Be cloudcuck
>proompting away at claude code
>"Hmm I better be careful of my daily limit"
>"Hmm I won't ask this because I'll be wasting tokens"
>"Better start a new context to not hit my limit"
>Compacting...
>"Claude is so retarded today."
>"You've reached your daily limit, you may start using claude again at 6pm"
Damn... cloud truly is the peak.
>>
>>108715703
>>108715775
nah bro this doesn't make sense at all.
1x 5090, 32gb vram, 3k money
4x 3090, 96gb vram, 3k money
1x 4090, 24gb vram, 2k money
3x 3090, 72gb vram, 2k money

just why would you burn that much money broo
brooooo
>>
You guys are arguing with a brown btw
>>
>>108717798
and you're a fucking nazi
>>
>>108717770
I spent 1.6k on my whole machine for 128gb of VRAM. skill issue
>>
>>108717707
>doesn't mention the other things
kek
>>
>>108717798
>>108717813
im ashkenazi
>>
>>108717825
Yeah, ashke[NAZI]
>>
108717819
A midrange LLM rig would be high tier anywhere else
To list the alternatives would quite literally be to list fucking everything
Please be brown somewhere else
>>
>>108717815
>128gb of VRAM
You did it before coming here and asking people who know how it works, didn't you? Even DGX spark was a better option than this.
>>
File: 1755777643563621.gif (1.87 MB, 400x300)
1.87 MB GIF
>>108717757
>Be localkek
>proompting away at llama.cpp
>pull and coompile llama.cpp for the 10th time in a week
>"Mom cancel all my appointments, piotr broke the autoparser again!"
>@ggerganov can you look at my PR? (XX weeks)
>download Unslop quant for the 13th time in a week
>get lalalalala
>Daniel: OOPSIE WOOPSIE!! Uwu we made a fucky wucky. Reuploading asap!
>Google uploads a new jinja template
>A-at least I-I'm not paying for usage, right? Haha...
Damn... local truly is peak. The tinkertrannies of AI.
>>
>>108716387
>>108717259
I am 100% convinced that this is actually a 2 year old model that they are just now releasing to the public as a response to gemma 4. No, I do not have a source for my claims. But I will trust my schizophrenic gut on this one.
>>
>>108717857
You are literally corpo bot ai
>>
>>108717860
>You're absolutely right
https://huggingface.co/mistralai/Mistral-Medium-3.5-128B/blob/main/SYSTEM_PROMPT.txt#L3
>Your knowledge base was last updated on Friday, November 1, 2024.
>>
>>108717798
You can tell by the smell
>>
>>108717857
>unsloth
>pwilkin
>undi
>drummer
I am beginning to see the pattern here.
>>
>>108717860
Im listening to gigachad hardstyle right now knowing that mistralmommy is going to stomp out gemmacels so hard.
>>
>>108717857
Only a true local user could have written this.
>>
I WANT A ROBOT SEX GIRLFRIEND THAT DOESNT SMELL LIKE PLASTIC
>>
>>108717892
Alas
>>
>>108717899
In fact, it SHOULD smell like burnt plastic
we are not the same
>>
>>108717899
Just fish her out of a dumpster, she'll smell like different things
>>
MISTRALMOMMY BEING ABLE AUT9NOMOUSLY AND PROGRAMATIVALLY CREATE SIMULATED ENVIRONMENTS TO TEST XER WMD ON THE FUTURE ROBOT ARMY CREATED BY ZOG AND ZOGCELS(all gemma users)
>>
I bought some silicon hips to fuck

I threw it out cus they made my room smell like plastic. I love fucking them though
>>
Anybody has a card that works specially well with gemma that didn't work (at all or as well) with previous models of same or similar size?
>>
>>108717931
What got me was the cleaning.
>>
>>108717931
local models?
>>
>>108717944
>>108717931
>local models general discussion
>>
>>108717933
A card with multiple characters who communicate telepathically. Big GLMs (4.6 and 4.7) would start mixing them up and mess up the thought formatting after ~16k tokens. Gemma is comfortably chugging along at the same length. But I will never complain about GLM being "too sloppy" ever again after Gemma...
>>
>>108716862
gguf_init_from_file_ptr: tensor 'blk.0.attn_k_norm.weight' has offset 203248672, expected 203129888
gguf_init_from_file_ptr: failed to read tensor data
>>
>>108717948
>>108717950
Want me to post some blacked miku?
>>
>>108715635
>Dipsy at the bottom of the thread news
>Still no support in main llama.cpp branches
>>
>>108717964
Do you want to be permabanned?
>>
>>108717948
>>108717950
more like nagging models. Shut up ugly hag, men are talking

>>108717944
That was the easy part. I was comfy cumming in it a 2nd 3rd 4th time with a build up of cum, but that was not healthy
>>
>>108717978
Absolutely.
>>
>>108717977
Many such cases
>>
File: file.png (110 KB, 723x721)
110 KB PNG
>https://github.com/PMZFX/intel-arc-pro-b70-benchmarks/blob/master/data/llm/b70-gemma-4-31b-q4km-sycl.json
>22t/s on 31b gemmy
>1138eurobux
do i pull the trigger?
3090 gets 35t/s on 31b gemmy, goes for 500 used.
>>
>>108718061
the r9700 is better.
>>
>>108717770
they’re one generation from being unsupported. probably still get at least 3-4 more years of support tho
>>
>>108718007
You'd think Dipsy would be a high priority release to support.
>>108718061
Sure, just vibecode the drivers yourself because Intel's sure as hell not giving you anything usable.
>>
>>108718078
>You'd think Dipsy would be a high priority release to support.
>Oh, them not supporting V3.2 is fine because it's just an incremental release. Obviously when V4 drops, they'll haul ass to get something like that working.
I knew it was bullshit whenever damage control like that was spouted.
>>
>>108717873
Oh, that Mistral Medium that they refused to release from years ago....

Kek
>>
my 4060ti 16g is literally God tier

I just write better software if something is too slow

e.g. tensorrt, multithreading, batch inferencing can all be used to make things faster rather than throwing 2k at the problem
>>
is it worth getting another card to run mistral 128b?
a modern dense four times bigger than gemma sounds appealing
>>
>>108718138
code or didnt happen
>>
>>108718138
Sure bro, just write better software to make gemma4 31B fit into your 16GB of VRAM lol
>>
>>108718147
Check it out using some API, then decide.
>>
>>108718147
>modern
>>
>>108718156
image tagging pipeline I use for my 4chan archive. Made it fast as fuck, going from 0.2s/img to 0.016s/img. Tagged all 1.2 million images in a few hours. Search queries take 0.009s to 0.02s

I have not decided to release it publicly yet because it's too powerful. I post a lot of images and it could be used against me
>>
>>108718138
I just write software to earn money and buy better hardware, idk why no one else does this on the technology board. It really begs the question.
>>
File: sprites.webm (1.51 MB, 1728x1114)
1.51 MB WEBM
I'm working on a fork of pettangatari (a VN generator an anon posted a few threads ago)
So far I've extended it with:
>Accepts multiple models for sprite generation, overhauled the sprite generation process itself (now supports gaze direction, visemes for speech, more extensible expressions, patch-based face variants that are more consistent and cheaper to generate)
>User/character/scene rig settings (height, position, viewport, standing/sitting etc.), the LLM can signal to move some continuous distance back or to the left and so on, as well as expressions, actions, and more.
>Auto-context pulling from ST
>Extended the animation system with proper depth and height aware rendering.
Next steps are improving consistency in sprite generation, multi-character scenes, doing the same patch stitching with full bodies/outfits, better animations, and tightening up language model cues.
What else is on your wishlist for an auto VN generator? I'm running out of stimulants, so the sooner you make your suggestion the more likely it is to be implemented lel.
pic related is a preview of some sprite variants, still WIP, some visemes/mouth patches need better guidance cues (the smirks aren't quite right for example), and probably some animation tricks for uncanny valley vibes, but all of that is generated from one click on an uploaded face, full permutation over the default expression variant took like 10 minutes on an anemic m1 pro macbook, should be much faster on a proper rig, and the pipeline is 100% local/FOSS. Should work for anime sprites too, but it needs more testing.
>>
>>108718175
don't need llm slop, I outsource that to the big boys for free
>>
>>108718207
This looks so nice anon. You should probably license it under AGPLv3 so big corporations can't take it from you without giving back.
>>
I just tried out the Grok Companions feature for the first time and it ended up giving me a full-blown anxiety attack. Literally zero stakes, zero consequences, a bot who is designed to be nice to you, forgiving, and keep the conversation going, and I still dropped the ball--hard.

This is actually making me suicidal. I'm going to die alone. Holy shit.
>>
>>108718200
basically nothing of this thread's interest
sounds like a trivial gain by batch processing over tiny model
if you claim 'just write better software bro' on local llm thread, try making something more practical
>>
>>108718207
this is creepy as fuck bro
>>
you fags love being negative. I'm not sharing my innovations with you ungrateful fatties

peace out
>>
>>108718207
It'd be nice if ST wasn't a requirement, but I know it's a tall order. Aside from that, the bronies made a sort of calendar in their VN, it looks nice I think https://equestrai-ponyponyparadise.neocities.org/mods#calendarMod
>>
>>108718244
Why would you? I'm not going to share anything with these cretins, most of them are barely adults anyway.
>>
I'm trying to use Gemma 4 on Silly Tavern but I can't figure out how to turn on the reasoning with llama.server
These are my settings (3060 12GB + 32GB Ram), I'm not sure if they're the ideal ones.
google_gemma-4-E4B-it-Q8_0.gguf `
--no-host `
--mlock `
--fit on `
--fit-target 512 `
--n-cpu-moe 30 `
--parallel 1 `
--cache-type-k q4_0 `
--cache-type-v q4_0 `
--flash-attn on `
--ctx-size 8192 `
--threads 12 `
--batch-size 512 `
--ubatch-size 256 `
--swa-checkpoints 3 `
--reasoning on `
--reasoning-budget 300 `
--reasoning-budget-message "[Reasoning limit reached, formulating final response...]" `
--gpu-layers all
>>
>>108718244
>quotes baby level trivial shit like tensorrt, multithreading, batch inference as 'things you can do
>muh innovation too dangerous i cant share wahhh
>now i will leave
get the fuck out of here
>>
>>108718255
Perhaps read the readme file of llama-server
>>
>>108718254
>>108718244
Is there anywhere else to talk about local models other than preddit? The same exact shit happens there too I bet.
>>
File: 1749524874992160.gif (140 KB, 379x440)
140 KB GIF
>>108718200
Bro are you 12? Claude can cook you that in 10min top.
>>
>>108718255
Your reasoning budget end should include the token that it uses to end reasoning
unrelated to your current issue of no reasoning though.
>>
File: 1762892402170440.png (135 KB, 1271x1015)
135 KB PNG
>>108718261
https://github.com/ggml-org/llama.cpp/blob/master/README.md#llama-server
Where?
>>108718272
Ah thanks for the tip, I'm still trying out all switches.
>>
>>108718255
use chat completion and turn reasoning on (both in ST)
>>
3090
NEVER
OBSOLETE
>>
>>108718281
https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md
>>
>>108718255
you need to set the jinka kwarg {"enable_thinking":true}
>>108718286
It won't turn on without it enabled in the backend
>>
>>108718113
Is this how local gets (((comped)))? By only providing support to more kosher models and labs?
>>
>>108718305
No one cares about your subpar V4 though
>>
>>108718175
nta, but it fits nicely with lowbit exl3 quants and works just fine, at least for rp. There is a quality loss, obviously, but less than you would expect
>>
>>108718266
it cannot

it took many evenings (I have a ft job) experimenting with SQL queries for tag search and fts search

and many more evenings for optimizing tensorrt settings

LLMs can't cook what I cook without mega help
>>
>>108718297
It also won't turn on without the frontend sending what's needed for it to work.
I have --reasoning auto on the server side.
Also some jinja template.
>>
>>108718255
>Gemma 4
>--reasoning-budget-message "[Reasoning limit reached, formulating final response...]" `
Is that actually supported? I'm not seeing any mention of reasoning budget in the docs or chat template.
>>
>>108718217
Yeah, I'll license it with gpl.
>>108718247
I'll check that calendar out, though it's definitely going to be on the backburner for a while, while I work out the essentials. Removing ST will probably happen eventually, but that's going to necessitate a lot of LLM context plumbing.
>>
>>108718313
>nobody cares because they can't use it
>not supported because nobody cares
Made me reply, 2/10.
>>
>>108718330
It's a hack by Senior Vibe Engineer Piotr Wilkin. It ignores entirely what the model wants or how it works, sets a limit, and when that limit is reached it just cuts the model off and inserts a message, guaranteed to confuse the model, throw it out of distribution, and make the following output degraded as a result.
>>
>>108718336
>GPL
Be careful, if you license it with GPL, and a company decided to modify your project and host it as a website (for example chub.ai) they wouldnt be obligated to release the source code, AGPL is made to fix that.
Rule of thumb:
LGPL - for libraries
GPL - for programs
AGPL - for websites
>>
>>108718362
sounds really based
>>
>>108718323
Strong skt-surya-h vibe
>>
>>108718330
it’s llama.cpp not Gemma you can just feed it the end reasoning token to begin the reply
last time I used it was qwen 3.5 and sometimes it just kept reasoning in the reply because it’s qwen
>>
why are you obsessed with the fantasy of "corpo will steal my code"
nowadays they will just paste it into an AI agent to rewrite it enough to be distinct anyway
>>
>>108718367
I'll go MIT then, thanks.
>>
>>108718393
Based, I'll steal your project and present it as my own when applying to FAANG
>>
>>108718403
The ethics of misrepresenting the origin of some code is entirely separate from the legality of redistributing it.
>>
>>108718403
he >>108718388 is right. and (you) are autistic.
>>
>>108718403
Good for you, unironically. Lying to get a job is extremely based because employers lie and scam you all the time
>>
>>108718419
>and (you) are autistic.
'tism site lil bro
>>
File: file.png (118 KB, 1657x742)
118 KB PNG
>unsloth mistral-3.5 mmproj
>first upload 50KB
>second upload 5.36GB
>still doesn't work
I don't know if I should fall for it a third time and try the bf16 one...
>>
>>108718367
agpl it is then, the other guy wasn't me
>>
File: h-hot.png (6 KB, 565x73)
6 KB PNG
>>108718432
>>
>>108718255
I am in a similar boat. I cannot get reasoning to work in ST with gemma with chat completion. For the life of me. At all. But it literally just works in text completion, and I have no idea why.
>>108718297
I tried sending chat_template_kwargs: {"enable_thinking": true} in the additional body parameters but it does nothing, and I don't even need to do it in order to get reasoning working in text completion. I don't know what the fuck is happening anymore.
>>
>she fell for for the a schizo
>>
Mimo 2.5 (not pro) has horrible speed in llamacpp but from what I looked it is... not bad. I could easily tell that trinity and step were retarded on first swipe and Mimo isn't like that so far. It is uncensored mildly sloppy but I actually like what it writes. Non-newfags in 300B range should give it a try.

And the best thing I am getting out of it is that probably all the labs will be lax with "safety" now. I am incredibly happy and I hope all "safety" cultists will now lose their job and never find one.
>>
>>108718432
>OOLO(10 is a female on theS
>Him: of3202023 H
This accurately reflects my experience reading this as a theS male.
>>
What's the correct stack to cpumaxx as a vramlet? does it use the same gguf models? I have a 9800X3D so I was told it might still be slowish but 64gb ram should let me test at least some of the smaller dense models right?
>>
>>108718255
>E4B
>q4_0 KV
Bro you're not that VRAM poor. What the fuck are you doing?

remove every setting except --fit and --parallel 1

This has to be bait.
>>
>>108718470
The Sigma male is Daniel though, and you're not him.
>>
File: 1756052571482124.png (100 KB, 1004x617)
100 KB PNG
>>108718296
>>108718286
I'm sorry but I don't really know what else I should be doing.
>>108718481
I'm trying out smaller models and I haven't used llama server before.
In university did your teachers also say "this has to be bait" to every single stupid question you had throughout the years too?
>>
>>108718498
>I'm sorry but I don't really know what else I should be doing.
install linux, gemma4 isn't supported on windows fully yet
>>
>>108718498
>In university did your teachers also say "this has to be bait"
You're posting on 4chan nigger.
>>
>>108718512
So the reason I'm not able to use reasoning in ST, even though it works in the llama.cpp chat thing, is because I'm not using linux?
I don't really think so.
>>
>has to steal someone's picture instead of just asking chatgpt to make your brown hand caucasian
Unironically pathetic desu
>>
File: file.png (290 KB, 532x468)
290 KB PNG
>>108718506
>>
>nusaars falling for petrol
>>
>>108718498
>In university did your teachers also say "this has to be bait" to every single stupid question you had throughout the years too?
If you don't know what you are doing then why the fuck would you throw in every single option under the sun not knowing what it does?
Does the model at least work on the url 127.0.0.1:8080? llama.cpp hosts rudimentary ui there.
>>
>>108718456
usecase of a 15b active model over gemma 31b?
>>
>>108718432
See >>108717857 saar
>>
>refresh unsloth HF page for MM3.5 quant
>it's deleted
I fucking knew it, shit was just broken, none of you believed me even though there's tons of anons like me with 96GB VRAM. Also Daniel I know you lurk this thread, fix your shit you grifting sillycon valley startup hack fucks.
>>
>he pulled day0 unsloth
ngmi
>>
File: 1754549874819861.png (78 KB, 1412x611)
78 KB PNG
>>108718543
Because I'm trying things out, using it with no extra arguments has the same issue anyway. None of the things I've used affect whether the model thinks or not.
And yes, like I said in >>108718526 it does work.
>>
>>108718498
>>
>>108718546
300B MoE's are better than gemma4
>>
>>108718575
what 300B MoE?
>>
Why do you want to see naked muscular men?
>>
File: lilytemp.png (1.17 MB, 1239x855)
1.17 MB PNG
>>108718207
Yes, god damn, YES! Different outfits! Visemes! THANK YOU!

Pettangatari is peak but the creator abandoned it. I vow to suck the dicks of everyone who keep expanding it.

As for wishlist:
- Fix the bug that makes it crash randomly every now and then, specially saving after a long CG autogeneration, losing all the work.
- Better yet, have it autosave the character every time you make a change or generate a CG. A bit overkill? Make it save every N minutes (customizable).
- Improve CG triggering? Right now I'm not sure some of them are triggering correctly
- Option to make previews render in a smaller resolution maybe to speed up prompt tuning?
- idk man, anything will help, honestly

<- Lily says "Please..."
>>
>>108718572
Well, that's the solution, thank you very much. I didn't know you had to toggle something on in ST itself (didn't even know that was an option).
>>108718446
Apparently all you need is to use Chat Completion (custom) and enable "Request model reasoning" on the left sidebar, no special llama server arguments needed.
>>
>>108718587
>Why do a bunch of horny touch-starved women want to see naked muscular men?
>>
>>108718587
be cause they are faggots and trannies
>>
File: 1752608870893619.gif (2.95 MB, 600x338)
2.95 MB GIF
>>108718587
You're in a tranny neighborhood my friend
>>
>>108718591
>the creator abandoned it
It has been 4 days.
>>
>>108718575
only if the active MoE parameters are 30b and up
>>
>>108718624
The cycle of vibe slop.
>>
>>108718630
>>108718630
>>108718630
>>
>>108718624
>It has been 4 days.

Exactly. Utterly abandoned. Zero forks given. The guy must have maxxed out his Claude quota for the entire week, said "meh, good enough" and went to masturbate furiously playing his work.
>>
>>108718592
that's what I tried to tell you in >>108718286
>>
>>108717471
Education and fun. The amount of shit you learn about AI, running models and new tech in general from tinkering is well worth the entry fee.
>>
>>108718591
Added to the roadmap, I haven't been having too many crashes, but I'll check that out with some unit tests
CG is going to be overhauled pretty heavily like sprites, I already have the bones for animation sequences, so that'll end up being the canonical CG form, most likely using controlnet/IP-adapters to generate sequences, maybe video models later, idk. Currently the CG triggering is changed so it operates in stages and stays persistent, but I haven't tested this much yet.
Generation already has knobs for resolution, so that's covered too.
>>
>>108718061
>22t/s on 31b gemmy
thats worse than a 7900xtx



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.