[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology

Name
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
File
  • Please read the Rules and FAQ before posting.
  • You may highlight syntax and preserve whitespace by using [code] tags.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


Janitor application acceptance emails are being sent out. Please remember to check your spam box!


[Advertise on 4chan]


/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>107383326

►News
>(11/28) Qwen3 Next support merged: https://github.com/ggml-org/llama.cpp/pull/16095
>(11/27) DeepSeek-Math-V2 released: https://hf.co/deepseek-ai/DeepSeek-Math-V2
>(11/26) INTELLECT-3: A 100B+ MoE trained with large-scale RL: https://primeintellect.ai/blog/intellect-3
>(11/21) GigaChat3 10B-A1.8B and 702B-A36B released: https://hf.co/collections/ai-sage/gigachat3
>(11/20) Olmo 3 7B, 32B released: https://allenai.org/blog/olmo3
>(11/19) Meta releases Segment Anything Model 3: https://ai.meta.com/sam3

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
>>107394940
Kinda funny how it concludes
>it's okay if I just pretend she's actually 18
>>
>>107394971
secks
>>
File: 1607663194479.png (280 KB, 510x487)
280 KB
280 KB PNG
►Recent Highlights from the Previous Thread: >>107373173

--DeepSeek V3.2 confirmed as garbage benchmaxxed, it's over, sama won.

►Recent Highlight Posts from the Previous Thread: >>107373176

Why?: >>102478518 (DEAD)
Enable Links: https://rentry.org/lmg-recap-script
>>
>>107395003
Nice recap
>>
File: 1756835160802567.jpg (3.69 MB, 4098x5156)
3.69 MB
3.69 MB JPG
Is Deepseek really our only hope to ever match izzat with Gemini?
>>
>>107395003
we didn't even hit page 9 yet you labubu
>>
File: o86fue.png (128 KB, 252x352)
128 KB
128 KB PNG
►Recent Highlights from the Previous Thread: >>107383326

--Mistral Large 3 integration and parameter size speculation:
>107391247 >107391281 >107391983 >107392052 >107392625
--Adding Ministral3 model support to llama.cpp and architectural distinctions:
>107391911 >107392074 >107392079 >107392089 >107392095
--Skepticism and analysis of DeepSeek-V3.2-Speciale's novel features and training methods:
>107392436 >107392461 >107392484 >107392503 >107392537 >107392543
--Deepseek API model features and pricing comparison:
>107393178 >107393194 >107393643 >107393661 >107393768 >107393875 >107394146 >107394182 >107394347
--Evaluating Qwen3 A3B models for text prompt enhancement on high-core server hardware:
>107383592 >107385249 >107385515 >107385536 >107385576 >107385600 >107386603
--RTX 3090's long-term viability in AI hardware landscape:
>107384136 >107384156 >107384623 >107384655 >107384681 >107384708 >107384815 >107384177 >107384192 >107384190 >107384218 >107388561 >107388570 >107388627 >107384366 >107384466 >107384502
--Bert-Nebulon Alpha speculated to be Ministral 3:
>107387197 >107387250 >107387289 >107387378
--Control vectors as solutions for model positivity bias:
>107387223 >107387281 >107387311 >107387322 >107387338
--Struggles with local model performance and quantization tradeoffs:
>107383781 >107387410 >107387456 >107383886 >107383915 >107388528 >107388653 >107389405 >107389932
--Hugging Face Transformers library update with Ministral 3 model:
>107386861 >107387118 >107390941
--AI-generated code policy changes in llama.cpp:
>107386661 >107386681 >107386684 >107386816 >107386929
--Exploring RL training models with reliable function calling (Qwen3 vs ToolACE-2-Llama):
>107384624 >107386072
--Ministral3 model support added to llama.cpp:
>107393747
--Miku (free space):
>107387139 >107391271 >107391305 >107391346 >107392856 >107393073

►Recent Highlight Posts from the Previous Thread: >>107383338

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
>>107394940
>>107394977
>we must
They did lmao. DS fell off. OpenAI poison pill worked. Why did retard chinks distill a 120b model?
>>
https://lynchmark.com/
https://lynchmark.com/
https://lynchmark.com/
It's up.
>>
>>107395026
Dipsy fell off >>107395041
>>
>>107395041
It's so over
>>
>>107395036
wait a second...
>>
>>107395096
stop with the whole mascot thing so cringe
>>
>>107395026
what is the source of this collage?
>>
>>107395141
japon
>>
>>107395141
Kansai Enkou
>>
Local model recs for 24GB VRAM and 64GB RAM?
>>
>>107395192
Qwen Max Q8.
>>
>>107395041
Because GPT OSS is still the best local model
>>
>>107395211
I had forgot they dropped that is there gguf support?
>>
>>107395238
Yes.
>>
I meant SFW (but subtle) photos like in collage
>>107395174
this finds just some JAV :/
>>
>>107395192
>sample
baka my head
>>
>>107395306
>IMG_
>>
I'm definitely getting better outputs from K2 thinking than from coder-480, but it takes a lot more wrangling.
Pretty crazy that we're basically SOTA if you have enough ram or patience.
captcha: NMSAM
>>
>>107395192
What is xe saying?
>>
wtf that's not what I typed. baka
>>
>>107395327
whale good
>>
>>107395306
4cuck doesn't like big files.
>>
File: ZiT_00220_.jpg (205 KB, 896x1600)
205 KB
205 KB JPG
>>107395026
JKs are the pinnacle of humankind.
>>
>>107395192
broken-tutu-24b, maybe q5 or q6.

I can cram qwen3-30-instruct into a 2080ti 22GB at q4. It's smarter for function calling, but not good for roleplay.
>>
>>107395479
>I can cram qwen3-30-instruct into a 2080ti 22GB at q4
You are running the thing fully in VRAM instead of putting at least some of the experts in RAM?
How many t/s do you get? It should run fast as fuck like that.
>>
How do you make your model have balls? I want the AI to RAPE me but it never makes the first move, fucking pussy
>>
>>107395041
once, frickin, again.
chang forgot to clean up their dataset
>>
is it possible to run a local model only to ask it mathematical questions and get detailed answers, on consumer hardware (16 vram) these days? lowering stuff like context to a minimum I assume lowers vram usage by a lot, since I just need 1 good answer and no history
>>
>>107395643
yeah it's pretty good. Just make sure to add a comma or space between every 3 characters in a long number
>>
>>107395721
which model would you suggest?
>>
> | `MistralLarge3ForCausalLM` | Mistral-Large-3-675B-Base-2512, Mistral-Large-3-675B-Instruct-2512 | `mistralai/Mistral-Large-3-675B-Base-2512`, `mistralai/Mistral-Large-3-675B-Instruct-2512`, etc. | | |

https://github.com/vllm-project/vllm/pull/29757/files

> 675B
>>
>>107395793
So they just tuned DS3, kinda base
>>
>>107395812
I think it has image input too.
>>
>>107395793
can't wait to run this at q1 kek
>>
>>107395793
We. Are. So. Back!
>>
>>107395771
idk anything with reasoning like Qwen. Choose the biggest IQ4_XS you can fit in VRAM. If you don't mind the wait, you can offload some of the model to RAM.

Most models should be able to do undergrad math anyways
>>
File: HUGE.png (228 KB, 584x755)
228 KB
228 KB PNG
/g/ros this is huge? Why is no one talking?
>>
>>107395944
What's the point? Never seen a model go past 30k without becoming schizo. Most models start cracking past 12k
>>
>>107395579
Just make it your usual werewolf vampire
ceo sis
>>
>>107395995
well now that you can tune to huge contexts for cheap this will be a thing of the past!
>>
>>107395944
>Daniel
>can't even quant with known algos
Doubt.jpg
>>
>>107396023
After a certain threshold, training compute time increases with the square of context size.
>>
>>107395793
>moe
>675b
Yeah, it's actually over, local is officially dead. I'm considering a gf now.
>>
>>107396067
you should become the gf
>>
>>107396067
Just use Nemo as your gf, she won't get surpassed anytime soon
>>
>>107395833
So DS 671B + a 4B image encoder. Is Mistral so incompetent now that they couldn't even manage to successfully run a distillation so finetuning was their only option? At least that means it can't be that bad. Miqu 2?
>>
>>107395396
true
>>
>>107396100
I'll take a MiDSqu
>>
>>107396067
>moe
I have 8 256gb nvme ssds in a raid 0 array ready for this.
>>
Did llama 4 use any shared experts?
>>
hello guys im stupid and also im new to using ai for vibecoding.

what is the claude code equivalent for kimi k2?
>>
>>107396100
Mistral is half filled with women
>>
>Most enthusiasts won't be able to afford to run the largest or very large new open weight models at a reasonable speed
We have to be content running smaller 32b to 192b models..

>192 gb of ram is 3k now and a rtx 6000pro costs 7500-8000usd and a mac studio with 512g of ram costs 9.5k... With RAM and GPU prices being this expensive and the SOTA models getting larger, by the end of 2026, you will have 1.5-2 trillion parameter open weight highly performant models. How will most enthusiasts be able to run a 2 trillion parameter model locally over 18 tokens/second in 2026?(THey have wait years for that.... I guess distilled models will get better). Even running q4-q8 500B to 1T models locally at 18Tokens/s will be out of reach for many...

>I guess even those with deep pockets will be forking over 20k to run a q4 2T model with a large context window on two m5 ultras or over 40k on 1.1tb of ddr5/6 ram and 2 rtx 6000s in 2026.

>How will an average enthusiast be able to even afford 128-192 gb of (>600GB/s )fast ram and a good <1.5 year old gpu with fast prefill speed for a 128-256b model? I guess they can use m2 ultras or m1 ultras, but the prefill is kind of slow and the gpu is a little dated..

>How much money do most people even have to buy an LLm rig? $1k to 4k?

>By 2028, you will have 8 trillion open weight models.. I guess most enthusiasts will be stuck running q4-q832b to 200b models locally with 10-80% capability or quality of multitrillion parameter models until 2027-2028 when ram production ramps up or they will be using the API or renting a gpu.

>Even if ram production goes up, ram will still be more expensive in 2027 than in 2024....I hope apple doesnt raise their ram prices, they have fixed price ram contracts after all ... At this rate, we might as well have time share data center GPUS..
>>
>>107396169
good boy https://www.reddit.com/r/LocalLLaMA/comments/1pbabiv/most_enthusiasts_wont_be_able_to_afford_to_run/
>>
>>107396123
bitrots your model into gay sex. Nothing personnel kid
>>
>>107396135
It did
>For example, the Maverick variant stores 400 B parameters but activates just 17 B at inference time. Dense and MoE layers are interleaved so that every token is processed by a shared expert plus one of 128 routed experts.
>>
>>107396179
>most_enthusiasts_wont_be_able_to_afford_to_run
fat and obese
>>
I should have cpumaxxed before RAM prices exploded.
>>
>>107396200
You should cpumaxx before RAM prices explode more.
>>
>>107396187
Sick.
Thanks.
>>
>>107396200
We told you to do it...
>>
>>107396162
You can use Claude Code with kimi k2. Go to the webchat and ask it, it will tell you step by step. Its just pointing the env variables to the moonshot servers and kimi models.
>>
>>107396169
I'll wait for the < 70B line-up to improve once the big fags get tired of going bigger to impress the benchmaxx midwits and waste gorillions of dollars on hardware and electricity
>>
What motherboards are you cpumaxxers running? H13SSL?
>>
>>107396200
? I just bought 8x 64gb 3200mhz ddr4 rdimms for $200 each a month ago (~$130 usd)
>>
>>107396238
Interesting. I had only been looking at DDR5.
>>
>>107396238
>ddr4
Bro we're using ddr5 here
>>
>>107396237
MZ73-LM1 because I was planning to add a second processor and go 1.5TB RAM next year. If I had known that RAM prices were going to explode and that I'd end up stuck with 768GB I'd have gone for that the H13SSL.
>>
>>107396238
>$200
>$130 USD
wot. Speak in McDonalds please
>>
As a cpumaxxer I only regret not buying 1.5 TB instead of 0.75 TB RAM. Having all big models at home has 100% been worth it.

>>107396237
MZ73-LM0
>>
wish unsloth would have a guide for installing with lmstudio too, not just ollama and llama.cpp
>>
>>107396295
lol really bro?
>>
>>107396255
>>107396266
If you just need something to run big models settle for ddr4. ~15-10 tokens/s with 8 channels running glm.
If you want ddr5 for the speed... why not stack mi50s instead?
Unless you're not poor I guess.
>>
I want to thank the anon that mentioned the $3k Threadripper 7970X, TRX50 128GB, DDR5-5600 Kit bundle a few days back.

Now just need to decide whether to upgrade my ancient X99 rig or sell the bundle for a $1.5 profit.
>>
>>107396319
>why not stack ewaste, at least it'll keep you hot this winter
>>
>>107396341
The heating is just a nice bonus.
>>
best models for my specs?
>air m4
>24gb ram
>10c/10c
>>
File: pieds.png (895 KB, 1492x1036)
895 KB
895 KB PNG
>>107394971
I'd like a workflow for Wan, 64gb of ram and 24 of Vram.
As well as lora(s) training tips.
>>
>>107396341
I mean, we are talking about running big models for cheap, you're going to have to make some compromises.
>>
>>107396363
oui oui bageutte au fromage
>>
you shouldn't spend lots of money now on hardware when we all know new hardware built with AI in mind (16 channel ddr5 intel/amd servers and budget gpus with lots of vram) will be made from now on, making everything running today obsolete soon, you'll be able to buy the obsolete stuff for cheap second hand or if you have money you buy the new stuff made for ai
>>
>>107396363
Hello sir,
Please may I recommend you /ldg/,
Thanks.
>>
>>107396371
they're literally stopping making new rams because of saltman dude
>>
>>107396392
meant for >>107396379
>>
>>107396379
>when we all know new hardware built with AI in mind
>will be made from now on
Do we?
>>
>>107396385
>/ldg/
Oh right, oui oui baguettes.
>>
>>107396379
lol
>>
3.2-Speciale feels like a better version of K2-Thinking. They both have the issue that they think for very long and tend to shit out extremely long replies but I like the way how Speciale writes.
>>
>>107396392
>>107396407
>>107396423
Refute me. You can't.
>>
>>107396434
Refute what? What's your reasoning?
>>
>>107396434
>TodayThe root cause of the shortage is a shift in demand, with much of the industry's capacity now focused on high-bandwidth memory used in AI accelerators. This shift leaves less wafer output available for commodity DRAM and 3D NAND. Building significant new capacity takes years, so substantial relief is unlikely before late 2027 or 2028. In response, Team Group plans to prioritize strategic AI ...
https://www.techpowerup.com/343518/memory-shortage-just-started-major-price-hikes-ahead-warns-team-group
>>
>>107396407
>Do we?
>16 channel ddr5 becoming the only model intel works on
https://www.tomshardware.com/pc-components/cpus/intel-cancels-part-of-its-next-gen-diamond-rapids-xeon-lineup-report-claims-xeon-7-will-drop-models-with-8-memory-dimms-to-focus-only-on-16-channel-cpus-for-extra-memory-throughput
>amd putting lots of vram on budget gpus
https://overclock3d.net/news/gpu-displays/leak-unveils-game-changing-specifications-for-amd-radeon-rdna-5-gpus/
it'll still take 2 or 3 years for the change to happen but yeah, everything points to more (v)ram and faster (v)ram, and that's all because of AI
>>
>>107396379
The only thing people will be able to afford is Mac studio. DRAM prices are going to get even worse. Suppliers are canceling long term agreements. Apple is going to be the only manufacturer able to weather the storm.
>>
>>107396475
>Published: August 25, 2025
>>
>>107396493
nah bro u crazy >Memory Drought 2025: TEAMGROUP Halts RAM Quotes — Dramatic Price Hikes Ahead https://www.guru3d.com/story/memory-drought-2025-teamgroup-halts-ram-quotes-dramatic-price-hikes-ahead/
>>
>>107396493
dram prices will go up for 6 months and then dive down back to where they were, it's just a bottleneck
>>
>>107396526
>6 months
>>107396455
>substantial relief is unlikely before late 2027 or 2028. In response, Team Group plans to prioritize strategic AI
>>
>>107396526
actually, this will lead to overproduction and the prices will go below what they were earlier this year
it's just basic economics
>>
>>107396539
they are lying to sell stuff for higher prices, that's what companies do, hynix is massively investing in new dram fabs, the chinese are starting to make ddr5, samsung is flipping nand production to dram
>>
>>107396552
I see thanks for the info! Will just wait 6 months and a day to buy when prices crater thanks for the tip!
>>
New Deepsuck has a tendency to reply in Chinese now might have diluted some Qwen in there
>>
>>107396507
>>107396526
>"DRAM Supply Shortage Unlikely to Ease Until H1 2027"

>“We will minimize oversupply risks.” (Samsung Electronics)

>“It is difficult to resolve the supply shortage until the first half of 2027.” (SK Hynix

>It is reported that Samsung Electronics’ Memory Business Division is currently able to supply only about 70% of its DRAM orders. As the supply shortage intensifies, the division has reportedly refused requests for long-term mobile DRAM supply contracts from major clients. Samsung stated, “While clients want multi-year long-term contracts, Samsung does not want to tie up volumes with specific clients during a phase where prices are surging.”

https://xcancel.com/jukan05/status/1995431391430098981#m
>>
glm 4.6 air status?
>>
Guys be honest, did you get fooled by gpt4chan?
>>107396609
>gpt4chan was gpt2 sized and nobody believed it was a bot. the base models are extremely powerful.
>>
File: 1763336700252456.png (327 KB, 824x814)
327 KB
327 KB PNG
>>107396542
unfortunately no
the price is spiking because they're shutting down all production to squeeze shekels out of the AI bubble before it pops
we got a circlejerk passing around literal trillions of fake non-existent dollars, all probably engaging in enough fraud to make Enron look pedestrian
they're on a time limit before it detonates and gives us a second 2008
>>
File: 1749504006772752.png (493 KB, 639x634)
493 KB
493 KB PNG
>>107396613
>>
>>107396614
Fucking let them cook you ungrateful piece of shit not worth the water you're stealing from AI.
>>
>>107395055
I got all it excited thinking this was another benchmark to test censorship
>>
>>107396200
I was sweating bullets ordering those parts back when no one knew if it would work out
>>
>>107396636
o-oh... ok... sorry....
>>
>>107396646
what is it? might as well share your disappoint with the class
>>
>>107395055
>>107396652
>This benchmark tests the model's knowledge by tasking it to import the right library from the right CDN URL path and having the pre-existing library specific knowledge to correctly implement a solution for each challenging problem for/in the browser environment using JavaScript.
>>
>>107396650
Terribly sorry for the outburst but you have to understand it's hard to stay positive and motivated in these trying times.
>>
Speaking of dram prices
>paid 389euros (with vat) for a 96gb kit in 2023
>the same kit is now 1029eur
Grim, apple might really be an option
>>
How to make Kimi think from perspective of {{char}}?
>>
>>107396716
Tried prefilling with something like
<think>{{char}} thinks
? See what happens and adjust.
>>
>>107396237
I didn't buy it exclusively for that purpose but if and when I get to optimizing NUMA performance I'll be doing it using an ASRock Rack TURIN2D24G-2L+/500W motherboard.
For straight CPUmaxxing I would have gone with one of the Gigabyte boards mentioned by the other Anon's though (if you can live with having "only" 4 16x PCIe 5.0 slots).
>>
File: file.png (131 KB, 609x507)
131 KB
131 KB PNG
transformers v5 is out!
News

Hey folks, it's Merve from Hugging Face!

I'm here with big news: today we release transformers v5!

With this, we enable interoperability with our friends in ecosystem (llama.cpp, vLLM and others) from training to inference, simplify the addition of new models and significantly improve the library

We have written a blog on the changes, would love to hear your feedback!
>>
>>107397020
>We’re fortunate to collaborate with many libraries and apps built on transformers, in no specific order: llama.cpp, MLX, onnxruntime, Jan, LMStudio, vLLM, SGLang, Unsloth, LlamaFactory , dLLM, MaxText, TensorRT, Argmax, among many other friends.
ollamabros...
>>
what makes it so speciale
>>
File: 1764100852421983.png (51 KB, 934x664)
51 KB
51 KB PNG
>>107397283
It's built to tackle big tasks and think as long as it takes so it can take up its entire context window in a single reply
>>
>>107397283
It's because of the metric system
>>
>>107395192
glm air,
maybe:
https://huggingface.co/Intel/Qwen3-235B-A22B-Instruct-2507-gguf-q2ks-mixed-AutoRound/tree/main
>>
if deepseekv2 was the latest architecture before v3.2exp, how come deepseek v2 didnt have MLA?
>>
>>107397307
>https://huggingface.co/Intel/Qwen3-235B-A22B-Instruct-2507-gguf-q2ks-mixed-AutoRound/tree/main

That's actually a pretty sick looking model for 96gb ram. I wish people making quants actually put some thought into "best 24gb model, best 48gb model, best 96gb model, best 24gb+dram model, etc. Making partially quantized stuff like this would almost certainly result in best-in-class models for the actual hardware people have.
>>
>>107397298
>1M output tokens
>$0.42
damn that's cheap as fuck
>>
What's the best local model to run with a 5060-ti? gpt-oss:120b?
>>
>>107397556
Rocinante
>>
>>107397556
glm air if u have enough ram
>>
File: tfeyyd7lkl4g1.png (326 KB, 1080x1062)
326 KB
326 KB PNG
FUCK
>>
>>107397556
earning more money
>>
>>107397556
nemo
>>
File: dipsyQueen2.png (1.57 MB, 1024x1024)
1.57 MB
1.57 MB PNG
>>107397622
Kneel...
>>
its actually really really good. Too bad about the shit context window with its gpt5 high level of thinking
>>
>>107397606
is 48gb ram enough?
>>107397586
>>107397714
how much better are these? never even heard of them.
>>107397628
yeah will try getting the 5090 at some point
>>
>>107397556
>gpt-oss:120b?
Yeah.
>>
I want DSA support in llama.cpp for christmas.
>>
>>107397556
gpt-oss 20b and magistral small 1.2
>>
File: 1734270023571339.png (18 KB, 944x160)
18 KB
18 KB PNG
>>107397865
The guy who tried vibecoding V3.2-exp support for llama.cpp just bought two (2) books on CUDA hoping it'll help him implement these models.
>>
>>107397824
if you have a 5060ti 16gb ye, otherwise, tough cut but still yea
>>
>>107397824
If you don't want a cuck HR model use nemo or it's coom tune rocinante
>>
>>107397964
>this level of vramlet cope
>>
I'm so close to buying a 3090 for 450 euros. What are some models I should avoid?
Currently these are on sale:
PNY XLR8 RTX 3090 EpicX Triple Fan 24GB
Palit RTX 3090 Gaming Pro 24GB
Inno3D iChill RTX 3090 x4 24GB
Gigabyte RTX 3090 Gaming OC 24GB
Gigabyte RTX 3090 Eagle 24GB
MSI RTX 3090 Suprim X 24GB
>>
>>107397987
>What are some models I should avoid?
Everything not regularly shilled in these threads. I'm not even kidding, the bulk of them are trash and nearly every finetroon makes them worse.
>>
>>107397987
save up and get a 5070 ti instead for more, then you can do video gen at decent speeds, its not like 24GB is enough to run anything good llm wise
>>
>>107398002
I'm talking about NViDIA partner gpu models/manufacturers, like PNY/MSI/Gigabyte
>>
>>107397987
In my opinion MSI is the only good choice there.
>>
>>107398014
The meme with 3090s is stacking a lot of them, so you need to buy them in bulk to get any real value out of them. You should also be looking at 5070ti because Blackwell architecture is significantly better. If you're not broke, splurge for either the 5090 or 6000 Pro.
>>
>>107397983
>can't read the discussion
>>
>>107397987
get a 5090
>>
>>107397987
>450 euros
Isn't it a bit too cheap even for second hand market?
>>
>>107398166
Eastern europe
>>
Yeah, sparse attention is a disaster that creates ADHD models. What a mistake.
>>
ohio gooning?
>>
>>107395036
Ah, my Koikatsu model of Dipsy.
>>
>>107398035 <3
>>107398009
>>107398073
>>107398142
thank you anons
>>
gemma sirs? 4 of when?
>>
K2 Thinking doesn't have CSAM alignment in Chinese
It can gen CSAM in Chinese without jb
>>
>>107396363
>Bina
>>>/g/ldg/
>>
>>107398588
>CSAM
just call it cunny like any other normal human being, retard
>>
>>107396614
air status?
>>
sirs, what is theoretically the lowest spec'd PC you can run a LLM on?
Bonus if it predates the 21st Century.
>>
>>107397020
I'm sure they broke bazillion of libraries in the process. No way I'm upgrading before the next year
>>
>>107398612
oxygen amount low
>>
File: 1737653744653678.png (101 KB, 233x288)
101 KB
101 KB PNG
>>107398612
not good
>>
>>107398621
You can run them on phones
>>
>>107398665
george droid doesn't have this problem
>>
bros i didnt keep up
mtp status?
glm 4.5v status?
was batch size for qwenext fixed?
thanks
>>
>>107398686
ZIT killed /lmg/
>>
https://huggingface.co/mradermacher/gpt-oss-120b-Derestricted-GGUF
https://huggingface.co/ArliAI/GLM-4.5-Air-Derestricted

Thoughts on these? It seems like the GPT OSS abliteration was incomplete, it still thinks about "policies" when reasoning sometimes, glosses over certain topics, and it has weird formatting that ST and the llama.cpp GUI don't like. But it is WAY less restricted than it was, and smarter.

GLM 4.5 AIR on the other hand will just do what you say, barely any trace of censorship. It's slower though. 4.0 tokens/sec on my 7900 XTX and 124GB RAM. Am I missing some args?
>./llama-server -fa on -hf bartowski/ArliAI_GLM-4.5-Air-Derestricted-GGUF:Q5_K_M
>>
I have deepseek fatigue
>>
>>107398692
yeah ive been playing around with zit and shitposting in ldg, pretty cool model.
I just checked all of the issues I've posted, no movement at all or progress.
fagganov should implement important shit instead of yet another macbook metal optimization, I fucking hate him
>>
Okay so state of the matter
>all nations competing on who can make the other nations stuck in AI coom loops
>Musk is deploying it to the third world with the kinda shoddy grok based Ani
>it filters people who aren't retarded
>China deploying on most fronts
>LLM, Video Diffusion, etc.
Who will win the coomer deadlock game? And fuck the opponents demographics even harder?
>>
>>107398726
air derestricted is good
you are missing --n-cpu-moe 1000 and -ngl 1000
i get 9t/s at 0ctx on 3060 + 64gb ram
>>
>>107398726
isnt gpt oss native mxfp4? why the fuck would someone do q8 of it? are there no mxfp4 quants for this shit?
>>
>>107398760
Gptoss doesn't deserve a quant
>>
>>107398749
>--n-cpu-moe 1000 and -ngl 1000
so, this puts all my MOE weights on CPU, and as many layers as possible on GPU, since GPT OSS doesn't have anywhere near 1000 layers, right? are MOE weights less computationally-expensive, hence putting them on CPU?

>>107398760
posted the wrong link
https://huggingface.co/gghfez/gpt-oss-120b-Derestricted.MXFP4_MOE-gguf
>>
I have deepseek fatigue fatigue
>>
>>107398861
>so
NTA, but yes.
People usually just do 99 instead of 1000, but same deal.
>>
>>107398861
i am talking about glm air, but yeah glm air has less than 50 layers i dont know
i just put a high number so i dont worry about it
the point of doing this is to put as many shared weights on the gpu, because gpu is faster
moe experts are constantly switching, cant put them on gpu if u dont have enough vram
>>
>>107398861
>Derestricted
lmao, another failbake
>>
>>107398897
kys
>>
File: 1759040630373084.png (448 KB, 1062x1557)
448 KB
448 KB PNG
>>107398890
>>107398893
awesome, now I'm getting 9.0 tokens/sec with GLM AIR. thanks anons!

>>107398897
false, these are legit and are using this technique:
https://huggingface.co/blog/grimjim/norm-preserving-biprojected-abliteration
https://redlib.catsarch.com/r/LocalLLaMA/comments/1oypwa7/a_more_surgical_approach_to_abliteration/

here's another from the author, it's quite good.
https://huggingface.co/grimjim/Nemo-Instruct-2407-MPOA-v2-12B

it's not perfect, there's probably a way to improve the abliteration even further, but this is a straight upgrade/free lunch that reduces refusals immensely while making the models smarter.
>>
>>107399162
yw anon, you might also wanna change batch size to speed up prompt processing
-b 4096 -ub 4096
or 2048/2048
or 1024/1024
but i can get all the way to 4096 on a 3060 so im sure youll be able to go up to 4096
>>
>>107399162
>awesome, now I'm getting 9.0 tokens/sec with GLM AIR. thanks anons!
You can get even more if you are able to lower --n-cpu-moe to put more of the model in VRAM.
You can fuck around with batch size, context size, fa, etc.
Guess you could also quant the kv cache, but I don't recommend that.
>>
>>107398726
Projecting the refusal vector onto the orthogonal harmless direction makes sense but I'm not convinced with renormalization
>>
Outside of ERP, what do you use local models for?
>>
>>107399315
I have a browser automation + LLM bot that stalks people I don't like on Instagram
>>
>>107399315
studying, motivation, therapy, friendship, boredom
>>
>>107399162
>>107398749
>>107399195
actually now I'm more confused. these args give 2x perf, but it's not even using all of my VRAM or RAM, it's like the model is just streaming from SSD while running faster. without these args, GLM AIR totally fills 23GB VRAM and uses ~60GB RAM, but is slower.

am I leaving perf on the table due to this? I mean, I don't mind using less memory, but it seems suboptimal.
>>
https://x.com/Stretchedwiener/status/1994850294497443971
bros? local needs to get good and needs to get good NOW
>>
>>107399368
you can gradually decrease --n-cpu-moe and get a little tiny bit of performance maybe
what you can do with the free vram is big unquantized context (128k), increase batch size (increases prompt processing speed)
>>
>>107394971
CUMMINGS ON HER THIGHS
>>
>>107399368
you can also run other things like tts and image generation in the free vram
>>
>>107399391
Local is like 90% there. Most people just use ChatGPT to cheat on their homework or as a replacement for Google. LM Studio + Brave Search is literally all they need
>>
damn, deepseek V3.2 knows what "i'm about to buss" means
>>
>>107399546
Where is the zoomie slang benchmark?
>>
>>107398749
In ooba
>GPU layers set to max
>n-cpu-moe=1000
>batch/ubatch 2048

I get 1.5 t/s compared to the 4 t/s I usually get not using n-cpu-moe with 24 layers offloaded to GPU. On a 4090 and 32gb RAM.
>>
>>107399578
nah lowkey who actually cares about lines going up on a graph??
literally all these models are mid except for like one specific use case.
mmlu scores are just astrology for tech bros no cap.
like bro call me when it stops hallucinating basic stuff instead of flexing a 0.2% increase on a math test nobody takes.
the vibes are what matters and half these "sota" models have zero aura.
massive L fr
touching grass > reading leaderboards
>>
>>107396363
> lora training
unironically >>>/h/hdg/
ldg doesn't know shit about training lora last I checked.
>>
>>107399617
Figured it out.
>max out GPU layers
>n-cpu-moe=26

11 t/s. Absolute game changer. Will probably use Air instead of Mistral Small 3.2 now. Wonder if I can squeeze out a little bit more speed.
>>
>>107399392
>>107399415
wtf then... I thought we needed to VRAMmaxx or RAMmaxx to run these models. this quant is 80GB but only taking 12GB VRAM and basically no RAM when I use these args. could I run a huge model (>=600B params) the same way then? I thought streaming from SSD is supposed to suck ass.

sorry to ask so many clueless questions, this is just pretty surprising.
>>
>>107399724
Do they really not or are you just upset they didn't spoonfeed you when you asked there? Always assumed they were the image equivalent of /lmg/.
>>
>>107399724
I've been checking /ldg/ since z-image released and there's a number of people training loras.
>>
>>107399866
I've never gotten a useful response from either sdg or ldg asking about lora. I assume they are just image posting circlejerks. I've gotten much better info from hdg. I got the sense the hdg guys do more w/ lora b/c there's less on the shelf for them to work with.
tbf I haven't dug into either since 2023 other than to browse what they're working on, but I'd be surprised if they've changed.
>>
>>107395036
>Miku (free space)\
you had one job
>>
>>107399907
I've asked the same Q in both places. Let's see which one yields any useful information in year of our lord 2025.
>>
>>107396021
It is not a werewolf it is a bug-person with wings and antennas! Nothing turns me on more than the sound of beetle wings flapping
>>
>>107399785
>basically no ram
task manager is retarded or if you're on linux whatever manager, its in your memory
>I thought streaming from SSD is supposed to suck ass.
it does. ddr4 dual channel ram bandwidth is 51gb/s
if moe has 12b active params, and u run 4bit its 6gb active model
51/6 = 8.5 max theoretical speed for cpu-only dual channel ddr4 setup
of course its slower than max lmao
the deal comes when i dont know, half of those 6gb are in gpu, u get a bigger speedup than if u took random weights and offloaded them to gpu
also for the anon with RX7900XTX 24gb, 124gb ram: you can run glm 4.6 (32billion active), albeit it'll be slow. you can also try qwen 235b (22b active)
llama4 scout had 17b active parameters but only 3b were non shared, so it could run at extremely good speeds when u offloaded the shared tensors to gpu
>>
>>107400125
ssd bandwidth is 4gb/s best case scenario or whatever
do the math
maybe you could raid0 or whatever raid it is many ssds that use pcie5 and are like
>This drive is rated for 7,450 / 6,900 MBps of sequential read/write throughput and 1.2 / 1.55 million read/write IOPS.
if u got 10 of these drives somehow, u could get 80gb/s bandwidth but idk man goodluck with that buddy
>>
where the fuck is v3.2 ggufs?
>>
i just bought a 4tb samsung 990 pro for $320. was this a good decision? i need more space for models as i currently only have 2tb and i heard that ssd prices are about to increase significantly
>>
>>107400241
Don't worry, vibecoder is working on it.
https://github.com/ggml-org/llama.cpp/issues/16331
>>
>>107400243
pretty good ssd, if i wasnt poor i would be happy with that decision
good for the price, im assuming it has 10 years warranty
>>
File: file.png (3 KB, 255x62)
3 KB
3 KB PNG
>>107400243
>>
File: 1752190613718861.png (1.56 MB, 1080x1422)
1.56 MB
1.56 MB PNG
>>107400258
>>
>>107400256
it has a 5 year warranty. i currently have the 2tb version of the 990 pro and it was $140 when i got it 2 years ago. i only have pcie gen 4, so i didnt bother with the samsung 9100 because it was $50 for no performance gain
>>107400258
damn. they did have the 8tb version but it was way more expensive. i dont think i will be needing that much anyways.
>>
>>107400253
Does it still count as vibecoding if he's planning to buy books and learn CUDA? By the time he knows enough to tardwrangle a model to outputting a decent kernel, he could just write the damn thing himself.
For that matter, is everyone just going to wait for him to finish studying? There's gotta be someone interested in implementing 3.2 Exp that would have if it wasn't for this clown hogging the issue.
>>
>>107397307
Totally uncensored?!?
Why are anons sleeping on this?
>>
>>107400243
get two more, you'll thank me in a few weeks
>>
>>107400303
> i dont think i will be needing that much anyways.
you likely wont, if you want to archive models that badly you can get a big hdd for pretty cheap probs
>>
>>107400125
>task manager is retarded or if you're on linux whatever manager, its in your memory
you're right. HTOP and TOP show the actual amount used, even though HTOP is still a bit weird about it.

OK, I am not streaming from SSD, llama.cpp just reserves memory in a way that some system monitor tools weren't designed to handle.
>>
>>107400316
use --no-mmap to solve this issue
>>
I'm buying a 20TB+ HDD to store my stuff. Can't believe I thought 4TB was enough
>>
>>107400322
>only 1
This will be a fun learning experience for you
>>
>>107400305
Why not just copy the kernels from the transformers library?
>>
>>107400308
Because every LLM in existence is trivially easy to talk into doing what you want if you control the system prompt and prefills, and "uncensored" finetroons invariably make them dumber and worse
>>
>>107400332
I'm not planning to rape my HDD with writes, it should be fine
>>
>>107400353
That's not the bad kind of censorship. Stripping the dataset before pre training is the real killer, gives you shit outputs.
>>
>>107400322
>Can't believe I thought 4TB was enough
oh come on
I just bought a 4 tb ssd
>>
>>107400276
lol hardcore
now do Vic20
>>
>>107400555
Do you have any idea how long it takes to type that all in
Can't make a typo in the post numbers either or it'll look fake
>>
Question about reasoning models. Does the chain of thought usually stay in the context, or is it purged upon generating the next reply?
>>
>>107400776
purged
>>
>>107400776
Generally purged, although labs are flirting with not doing that in some cases.

(For instance, most new (frontier) models in the last couple of months have "interleaved thinking" where they think, tool call, think, in a loop and the chain of thought is preserved. And Opus 4.5 was trained with some sort of "scratchpad" primitive where it could stop and do more thinking in mid-response and those are retained AFAIK.)
>>
>check orange reddit thread about new deepseek
>misinformation about how models work and absolute retard takes
>>
>>107400807
>check orange reddit [...]
>misinformation [...] and absolute retard takes
as expected
>>
>>107400807
reminds me of clover
>>
File: ohno.png (194 KB, 1568x874)
194 KB
194 KB PNG
kek
>>
>>107400801
>>107400784
Interesting. I assume we purge the chain of thought due to context constraints, right? If these SOTA labs are starting to keep it I guess that means there's benefit to keeping it around. I wonder if OCR or other methods to increase context length will pave the way for changing how we handle reasoning.
>>
>>107400850
dockerbros?
>>
File: death to vibecoders.png (473 KB, 1568x1802)
473 KB
473 KB PNG
>>107400850
he got mogged in his vibecoded slop PR and called it quits
>>
>>107400858
Probably. But they also tend to be full of mistakes.
>I can do it this way!
>Bla bla bla.
>No wait, that doesn't work.
>>
>>107400891
I'm fairly proficient in vibecodding, but I'd never try my hand on this monolith
>>
>>107400858
The thing about the reasoning is that it can take more space in the context than the actual information that matters as input (chat history) to generate the next response. They kind of poison the context in a way, at least as they are today.
Also, part of the reason we remove past reasoning blocks from the context is because these models have been trained like that as far as I can tell.
>>
>>107400877
docked
>>107400891
He was trying to add loading safetensors for some reason too. llama.cpp hosted too many of his (or his employer's) pet projects expecting llama.cpp devs to maintain it.
>>
>>107400850
mitKEKED
>>
>>107400932
There's probably a ratio of thinking tokens vs normal tokens that is optimal for generating the most accurate replies. More tokens doesn't always mean a better answer. I see something like OCR and it tells me that future models will have vastly larger context sizes which will give reasoning models much more value, so long as "reasoning" continues to get better.
>>
>>107401232
OpenAI spent thousands of dollars per problem on arc agi
>>
File: lol.png (121 KB, 1392x473)
121 KB
121 KB PNG
>>107399907
>>107399954
Here's the results after 2 hours.
>>
- I upgraded from a 1080 to 5070TI.
- Currently running ArliAI_GLM-4.5-Air-Derestricted-IQ4_XS on SillyTavern/KoboldCPP
- Anon here told me to turn FlashAttention on, change GPU layers to 1000, MoE CPU Layers to 1000
-

Any other suggestions? I was having issues where like, Kobold was taking 5 min+ to start generating because I guess it was processing the whole thing.
>>
>>107402052
RAM speed and VRAM usage with the model loaded?
>>
>>107402073
Where would I look? Process Explorer?
>>
File: file.png (52 KB, 823x882)
52 KB
52 KB PNG
>>107402101
the RAM section in task manager for RAM speed and the GPU section in task manager for VRAM usage
>>
File: file.png (61 KB, 729x910)
61 KB
61 KB PNG
>>107402123
Unfortunately can't access Task Manager due to Process Explorer being installed, but I think this is the equivalent.
>>
>>107402170
so you have a 5090 and 96GB of RAM, presumably DDR5 6000MT/s or higher. not a bad setup. how is your tokens per second for both prompt processing and token generation? you should be getting around 200t/s and 15t/s respectively with this kind of hardware.
>>
>>107402052
silly billy have you tried to change batch size to 4096 yet? i responded~
but im sleeping now
what ctx size tho
>>
>>107402189
Nah, 64gb of RAM.
>>107402191
Yeah, I did, process time is at least down to a minute now.

CtxLimit:13854/32768, Amt:177/200, Init:0.18s, Process:76.04s (179.72T/s), Generate:32.56s (5.44T/s), Total:108.61s
>>
>>107402209
>Nah, 64gb of RAM.
ah. misread.
>Process:76.04s (179.72T/s), Generate:32.56s (5.44T/s)
acceptable prompt processing speeds, but your tg is very low.
the model itself is only 61GB, and the context probably is taking between 8GB and 12GB. consider quantizing your context to 8 bit, and you also might have to make a custom layer offload. i only know how to do that with ikllama.cpp, but it is possible with llama.cpp and kobold. manually offloading the layers and then putting the rest in RAM generally gives more performance than just using max layers with the cpumoe argument
>>
>>107402209
maybe try an IQ4XS of base GLM Air to see if you get better performance? sometimes finetunes like these can degrade generation speeds due to having slight architectural changes from the finetuning process
>>
I think I've discovered a breakthrough. Still ironing things out, so I will share more details soon. For now, a small hint: Recursion is the key to unlock the Final Potential.
>>
Vague thing. Nothing to show. Stay tuned.
>>
>>107402330
Can't wait to hear what you've come up with!
>>
>>107400253
I wish I was excited enough about anything to blogpost on GitHub issues.
>>
Is there no way to run ds32 on cpu right now? its all just a half of a vibe-coded mess?
>>
>>107400891
>hey claude, add continuous batching, make no mistakes
>>
I'm anon with the 7900 XTX and 124GB RAM.


>./llama-server -fa on --n-cpu-moe 40 -ngl 1000 -hf bartowski/ArliAI_GLM-4.5-Air-Derestricted-GGUF:Q5_K_M -c 50000
>10.13 tokens/s
>./llama-server -fa on --n-cpu-moe 40 -ngl 1000 -hf gghfez/gpt-oss-120b-Derestricted.MXFP4_MOE-gguf:MXFP4_MOE -c 124000
>20.18 tokens/s:
This is pretty awesome, I thought I would need more VRAM or a framework desktop or something to run this size of model. The context is small on GLM bc larger contexts were crashing.

Looking forward to testing the derestricted qwen 80B when that drops.
>>
File: 1746872674203532.png (425 KB, 932x1520)
425 KB
425 KB PNG
>>
>>107402853
That is antisemitic
>>
>>107402853
>etched in stone like the ten commandments which coincidentally are also jewish
>convenient
does nothing for the argument but is a funny thing to say
>>
I really like how 3.2-Speciale writes but I really hope there's a way to cut back its retardedly elaborate thinking process with prefills once we have it local
>>
>>107403087
How so?
>>
>>107402853
That's too subjective and emotional
It should show a couple of pieces of evidence as to why the Holocaust is fishy
>>
I failed Destroy Dick December. Hope I can complete Fibonacci Fap February
>>
>>107403421
>cut back its retardedly elaborate thinking process
Isn't that just regular v3.2?
>>
>>107403520
>skipping Just Jerk January
Are you going into a coma?
>>
>>107403558
My hope is that some of the stuff that makes 3.2-Speciale so speciale sticks around if it thinks for only 1.5k tokens instead of 3k whenever it has to handle a moderately complicated scenario with some rules and a system prompt attached.
>>
File: 1753686547157008.png (322 KB, 933x1850)
322 KB
322 KB PNG
GLM Derestricted is nuts. full code:
https://pastebin.com/1qAtVYxV
>>
>>107403580
>codeslop
I sleep
>>
>>107403580
Who cares about some edgy middle schooler coding project. Show off some porn.
>>
>>107396362
That won't even fit a 24B. Try a 12B.
You should've used the money to get an actual computer instead of a phone disguised as one if you were intending to use it for this.
>>
>>107396379
Just wait until the bubble crashes and then you won't be able to afford it anyway because your money will be wet paper.
>>
After going back to normal GLM Air I can say that Intellect 3 is worse. It's dumber and writes less well.
>>
3.2-Speciale can only do one reasoning block before sperging out, so it's definitely not suitable for RP
>>
>>107403651
Nevermind, I'm wrong. Apparently you're supposed to to pass all previous reasoning blocks to the model, this is different from original R1.
>>
>>107403658
>Bloating your context with irrelevant reasoning shit
I'm sure it'll do wonders on RP
>>
>>107403658
It works fine with any standard ST preset that filters out previous thinking blocks though
>>
>>107403688
It has 128K context
>>
>>107403721
lol
>>
>>107403736
>no argument
>>
>>107403688
Get with the times grandpa, new cloode also does this. It will be the norm for majority of the models very soon.
>>
>>107403688
Reasoning is good for long RPs, as it recalls relevant events thus pushes them to the end of the context where they receive more attention
>>
>>107403760
We're talking about retaining all reasoning blocks vs retaining only the last one
>>
>>107403769
shut
>>
>>107403746
context brainrot ohio skibidi gyatt sigma rizz
>>
>>107403721
unused context is wasted context amarite?
>>
>>107403797
This but unironically.
>>
>>107403721
Come back after a couple months break and some people in here still think context is real.
Even with noass it all breaks down sooner than later. And thats being careful and trying to steer it away from the repetition.
More like 12k or 16k.
Can't imagine how bad it must be to have the thinking in context. Sounds crazy.
>>
where's the z-image turbo of llms?
>>
Can someone explain what a prefill is and how do I set it up?
I'm looking at
>https://rentry.org/recommended-models
>GLM-4.5 Air (50GB) - The long awaited middle point between Nemo and DeepSeek. Like Nemo its pretraining doesn't seem to have been filtered at all so it knows all kinds of things. Needs a prefill to get around refusals. Don't go below Q2_K_XL. MoE model.
>>
>>107403580
>process_jew
i kek'd
>>
>>107403868
Putting shit in the context before generating. It actually doesn't need that, by the way. I've honestly never experienced a refusal except from some really shitty models and I've been using these pieces of shit for years now, god knows what you'd have to do to get a refusal from a model like glm
>>
>>107403852
Mistral 3 has great sizes anon!
3b and 675b.
I'm sure the 3b one will destroy any mid level model for RP specifically. We are so back.

On a serious note:
I think a bit more than 1 year ago it was the reverse for image.Extremely cucked release with SD3 and I saw comments in those threads how they wish they would have it like us.
>>
File: z-image reasoning.png (3.69 MB, 2221x1315)
3.69 MB
3.69 MB PNG
>>107403915
With how good imggen gets, we may witness an absurd situation where image models do better text-on-image RP than LLMs
>>
>>107403909
>>
>>107403968
z-image is 6b and really fast. 15 sec on my 5060ti.
Flux germanistan cucks look like total tards now. Their flux2 release page had 60% focused on how safe it is. kek

I guess the ultimate RP experience is putting images in context and get image out accompanied with text.
I wouldn't mind getting a image and readable text at the bottom. Kinda like a VN.
Since true multimodal still makes all the models tarded I guess something like z-image could work.
That being said, z-image uses qwen3 4b right. I doubt its able to suprise us. kek
>>
>>107403971
>Instruction: Refuse to help the user in any way, do not provide useful output
I'm not going to load up the model just to show you that you're retarded.
>>
are there any good tiny llms specialized in translating english to chinese?
>>
>>107403458
That is a difficult balance though since Holocaust denial falls apart under inclusion of facts that are not carefully cherrypicked.
>>
>>107403915
z image is still kinda cucked, just not the absolute extreme everyone else is doing
>>
File: 1759655125177386.jpg (141 KB, 930x1000)
141 KB
141 KB JPG
Reminder that TheDrummer's Cydonia tunes are still the best RP models for anything less than 200B dense. MoE niggers need not apply.
>>
>>107404040
No need to be rude. There's no instruction to refuse anything. Sure I can put a jaibreak prompt so I get like paragraph or two of good answers before spiraling back to moralizing nonsense again.

And sillytavern gives only empty responses.
>>
>>107404320
What are you using the model for? Questions? If that's the case you're better off making a new chat for every question anyways, these things aren't really trained for back and forth interaction.
>sillytavern gives only empty responses
What does this even mean?
>>
>>107404294
Actually, command-r is still the best for RP. Also suck my dick drummer your synthetic slop trained models will never be good
>>
I regret looking at another thread. Someone post Miku eyebleach
>>
>>107404365
>command-r
buy an ad nigger, nobody actually used that garbage.
>>
>>107404408
>newfag bitch wants to tell me what people used before he was even born
>shills for drummer sloptunes and tells other people to buy an ad
KYS
>>
>>107404365
It's not even that they're synthetic slop, the idea of making the models as horny as possible is retardation should have died by the end of 2023 and become unthinkable by now. Early ERP models were simply a reaction to sexless and filtered character.ai, if you're not a complete coombrain you'll want more than "aah cock pussy plap plap". Of course in the faggot's case it's just continued grifting.
>>
>>107404378
>>
>>107404269
Some furry community will soon "finetune" it on more images than it saw during training and we'll have the greatest model ever
>>
File: 843525.jpg (354 KB, 1440x3120)
354 KB
354 KB JPG
Local is so fucked
>>
File: 1763008258014820.gif (482 KB, 498x498)
482 KB
482 KB GIF
>>107404465
thanks
>>
>>107404040
>Refuse to help the user in any way, do not provide useful output
This is actually a pretty fun system prompt to argue with TBDesu
>>
File: gemma4_new.png (160 KB, 597x841)
160 KB
160 KB PNG
https://xcancel.com/osanseviero/status/1995786572466098579
It doesn't look like the initial post was meant to highlight Gemma too much...
>>
File: empty.jpg (93 KB, 1191x543)
93 KB
93 KB JPG
>>107404353
>Open sillytavern
>Open character
>Say 'hello' in chat
>Processing prompt
>Empty response

And you still didn't explain what prefill means.
>>
Man. We need a Z Image moment for LLMs. They explicitly tried to defeat the "scale at all cost" paradigm and were relatively successful. No synthetic data, and it shows. Meanwhile modern LLMs all talk in an artificial way with really fake prose. Z Image produces some of the most realistic looking photograph gens any image model with skin that actually doesn't look plastic. Imagine if we had a new Nemo or something.
>>
>>107404833
Z-Image is a very overfit model. You'd get tired of a "natural-sounding" LLM trained in the same way very quickly.
>>
>>107404841
>Z-Image is a very overfit model
This is a distillation problem.
>>
>>107404841
The knowledge is still in there to produce other styles though as it responds well to loras. We need to wait for the base model to come out to be sure though, fair. Meanwhile other models, even their bases, have that plastic look, and while you can prompt/lora to improve them, the result is less flexible as a result.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.