[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology

Name
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
File
  • Please read the Rules and FAQ before posting.
  • You may highlight syntax and preserve whitespace by using [code] tags.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


Janitor applications are now being accepted. Click here to apply.


[Advertise on 4chan]


File: date with miku - good end.png (1.47 MB, 1024x1512)
1.47 MB
1.47 MB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>106551921 & >>106539477

►News
>(09/11) Qwen3-Next-80B-A3B released: https://hf.co/collections/Qwen/qwen3-next-68c25fd6838e585db8eeea9d
>(09/11) ERNIE-4.5-21B-A3B-Thinking released: https://hf.co/baidu/ERNIE-4.5-21B-A3B-Thinking
>(09/09) K2 Think (no relation) 32B released: https://hf.co/LLM360/K2-Think
>(09/08) OneCAT-3B, unified multimodal decoder-only model released: https://onecat-ai.github.io
>(09/08) IndexTTS2 released: https://hf.co/IndexTeam/IndexTTS-2

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/leaderboard.html
Code Editing: https://aider.chat/docs/leaderboards
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
►Recent Highlights from the Previous Thread: >>106551921

-Optimizing code generation workflows on V100 GPUs with MoE models:
>106555312 >106555465 >106555506 >106555522 >106555524 >106555586 >106555717 >106555770 >106555782 >106555852
--Best local text gen models and VRAM optimization discussion:
>106556580 >106556863 >106556934 >106557638 >106557036 >106557046 >106557069 >106557098 >106557239 >106557514 >106557190
--AI surpasses mathematicians in complex analysis challenge:
>106558352 >106558367 >106558387 >106558476 >106558500 >106558527 >106558711
--Baidu's ERNIE-4.5-21B-A3B-Thinking model release and performance evaluation:
>106554153 >106554580 >106555008 >106555170 >106555207
--Silero VAD v6 evaluation and comparison with Nvidia's MarbleNet:
>106557953 >106558064
--LocalAI vs OpenWebUI: backend model management vs frontend interface:
>106555093 >106555341 >106555529 >106558434
--Running 30B-A3B models on 12GB VRAM via expert offloading and quantization:
>106558134 >106558186 >106558210 >106558227 >106558238 >106558251 >106558293 >106558317 >106558341
--GPU layer differences in small vs large models due to parameter grouping and optimization:
>106553923 >106554094 >106554256 >106554362 >106554384 >106554458 >106556050 >106556200
--LongCat's strengths and MoE limitations in llama.cpp compatibility:
>106552000 >106552095 >106552267 >106554325 >106554412
--Achieving deterministic LLM inference through caching logic adjustments:
>106555106 >106555150 >106555169
--llama.cpp development updates and flash attention implementation considerations:
>106553388 >106553417 >106553890 >106555026 >106555040 >106555059 >106555061 >106555068
--Qwen3 Next release`:
>106557806 >106557845 >106557853 >106557858 >106557903
--Miku (free space):
>106555337 >106554679 >106555530 >106555574 >106557190 >106558219 >106559139 >106559166 >106559181

►Recent Highlight Posts from the Previous Thread: >>106551925

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
File: 1671200815321009.png (1.27 MB, 883x1073)
1.27 MB
1.27 MB PNG
>>106559374
I made the highlight reel again back to back boys
>>
File: 1739311776947199.png (1.61 MB, 1024x1512)
1.61 MB
1.61 MB PNG
>>106559371
>>
>>106559395
https://www.youtube.com/watch?v=VcWAQ5a1NdI
>>
>>106559404
You could still help them out. I'm pretty sure vllm supports it now.
>>
My uni gives me access to Copilot Chat (GPT 5) and this thing is dumb as fuck, even with search. I think you people have lied to me about the big models being hugely smarter (narratively) than some 32b model.
>>
>>106559420
Yes, I could be doing a lot of things but everything has an opportunity cost.
>>
>llama.cpp still hasn't added support for qwen-next
https://github.com/ggml-org/llama.cpp/issues/15940
>vllm already merged in support last night
https://github.com/vllm-project/vllm/pull/24526
llama devs are hacks
>>
Qwen3 Next geejuff status?
>>
>>106559516
vllm devs only needed to bump their pytorch version or something.
>>
>>106559516
Maybe rewriting the entire ML stack in C++ wasn't such a good idea.
>>
>>106559555
should've used pure C, they probably don't need any of the OOP features anyway.
>>
>>106559598
PyTorch is written in C++ contrary to its name. Nobody is using C for good reasons.
>>
>>106559627
The performance critical parts are, but it's not like you can use PyTorch directly from C++.
>>
qwen 3 80b consensus?
>>
>>106559680
of course not. pytorch is literally a wrapper for libtorch which is in C++. you would use libtorch if you wanted to use C++. there's alot more support around pytorch tho as it's far more accessible to people.
>>
>>106559516
>>106559555
maybe 1 has a gorillion dollars since it's used by llm companies and one is a hobby project for consumers
>>
>>106559696
It's shit because there are no goofs
>>
File: 1676316942545183.png (6 KB, 208x242)
6 KB
6 KB PNG
>>106559714
>he doesn't know how to run safetensors
>>
File: 1748050363563976.png (2.2 MB, 1328x1328)
2.2 MB
2.2 MB PNG
>>106559696
>>
>>106559733
literally me
>>
File: MHHHHMMMMMM.png (70 KB, 252x218)
70 KB
70 KB PNG
>>106559506
*CLAP EMOJI* CUDA *CLAP EMOJI* DEV *CLAP EMOJI* WE *CLAP EMOJI* ARE *CLAP EMOJI* ASKING
>>
I never understood how some of you have the hardware and talent to render AI videos and images that are realistic and good and yet you don't make full length porn videos
>>
>>106559780
Video models break down quickly past 5 seconds.
>>
>>106559780
Porn sucks, text is better. The mind is most powerful sex organ. Unironically. t. man
>>
File: 1741181107861956.jpg (41 KB, 734x734)
41 KB
41 KB JPG
>>106559780
>knowing how to read instructions= talent
>>
>>106559780
all the slop I posted this thread took around 8s~ to gen (praise be nunchaku devs)
>>
>>106559792
IIRC standard for brainrot tiktok videos is to have a cut every 3 seconds.
>>
>>106559824
Now try getting the model to maintain consistency across hundreds of 3 second clips.
>>
What options are people running to get speedups on MoE models? There was a way to offload only certain tensors to RAM in order to get a significant speedup. Is it ik_llama.cpp only?
>>
>>106559863
Who said consistency was a requirement?
>>
>>106559871
"overridetensors": "([2-8]+).ffn_.*_exps.=CPU"

That's what I use on kobold to run 30B A3B Q4_K_M on 8 GB VRAM / 24 GB RAM, the parameter is probably the same on llama.cpp (no fork needed)
>>
File: 1747328841336034.mp4 (2.54 MB, 1424x890)
2.54 MB
2.54 MB MP4
>>106559780
about that.....
This one is really good
not full creation, but it's one of the better tools released yet
https://ebsynth.com/
>>
>>106559871
--cpu-moe is all you need
--n-cpu-moe 999 if you want to be fancy
>>
>>106559871
>>106558251
>--n-cpu-moe 37 --gpu-layers 99
Normal llama.cpp.
Obviously, adjust --n-cpu-moe as needed.
>>
File: joelHaver-thumbnail.jpg (29 KB, 296x440)
29 KB
29 KB JPG
>>106559926
huh, I thought picrelated guy was tracing frames by hand
>>
>>106559943
What was that -ot thing I saw some anons use? It had a bunch of numbers after it.
>>
>>106559824
>>106559780
the issue is it will suck so why bother making it. The ass wont jiggle right, the blowjob wont have audio thats good, any gimmick you add to take advantage of ai will break the lora. And real porn will just look better. Probably better off deepfaking porn already made with enhancements

I have had success using vibevoice to clone a pornstars voice and then have her talk for several minutes using infinite talker. An LLM wrote the script so I wouldnt know what it would say and I got my own personal vid from her, and it was uh... kinda good.
>>
>>106559962
With -ot you can target the specific tensors inside the model's layers using regex. --n-cpu-moe simply obfuscates all that much like -ngl does for whole layers.
One thing to keep in mind when using -ot like in >>106559925 is to not move the shared experts (if they exist) from VRAM, since those are always used.
>>
>>106559962
-ot was the only way to do same thing before --cpu-moe arguments were introduced.
>>
>>106559975
nah bro, you just have to search for it
https://litter.catbox.moe/110x2tu7sbg6hixe.gif
>>
>>106559979
>>106559984
Ah cool, thanks for the explanation. So --n-cpu-moe moves only the non-shared experts to CPU? And --cpu-moe keeps *all* non-shared experts to the CPU?
>>
fuk me sideways, i wanted to try to use qwen3-next with vLLM and it seems it doesn't work with pipeline parallelism
>>
>>106560000
I'm only aware of --n-cpu-moe.
Maybe --cpu-moe is the same thing for koboldcpp, I don't know.
As far as I know, --n-cpu-moe also keeps the normal experts to the CPU/RAM.
You can run llama-server with the -h option to get more details.
>>
File: file.png (128 KB, 975x551)
128 KB
128 KB PNG
llamabros...
>>
>>106560060
Note that "primary hardware" is always GPUs. That's because to anyone serious, "cpumaxxing" is as sad and absurd as "ssdmaxxing" is to us.
>>
https://allenai.org/blog/olmo2-32b
How did they manage to do it in just 32B?
>>
>>106559803
it really is in this day and age
western kids have been dragged down to the level of their 80IQ peers for two generations now
>>
>Qwen3-Next is trained on a uniformly sampled subset (15T tokens) of Qwen3’s 36T-token pretraining corpus. It uses less than 80% of the GPU hours needed by Qwen3-30A-3B, and only 9.3% of the compute cost of Qwen3-32B — while achieving better performance. This shows outstanding training efficiency and value.
And it beats Qwen 3 32B + handles long context better than the 235B moe
pretty impressing stuff
>>
>>106560211
That's great. Would be greater if they expedited a Qwen3 Next Coder.
>>
>>106560211
It's native 256K context I think, without extending.
>>
>>106559998
Illya?
>>
>>106560248
yeah but the RULER benchmark is better on the Q-Next than the Q3 235B
>>
>>106557716
i don't work in an office
>>
>>106560211
Isn't Qwen3-Next 70B? Why are they comparing to Qwen3 32B and not other 70B models?
>>
>>106560274
It's 80B A3B.
>>
>>106560274
Supersparse MoE, 80B A3B
>>
>>106560291
>>106560283
Okay, so how does it compare to models that are around 80B?
>>
>>106560274
they compare it to every other Qwen3.
>>106560294
they did not bother comparing it to non-Qwen3 models.
>>
>>106560283
sqrt(80*3) means it's a copetitor to 16b models
>>
File: file.png (366 KB, 1006x1542)
366 KB
366 KB PNG
>>106560274
Even Gemini is praising it.
>>
Qwen3-Max is such a disappointment that I have absolutely zero hope for 3.5
Alibaba truly is the meta of China
>>
>>106560294
It's faster :)
>>
>>106560320
Kinda funny how Max got completely overshadowed by Qwen3-Next.
>>
3bit? is that not bitnet?
>>
>>106560314
Gemini will praise anything
>>
File: 1615679846051.jpg (53 KB, 660x574)
53 KB
53 KB JPG
LLMs seem like a competition between Americans, Europe, and China. Why can't Russia, Japan nor Korea compete despite being tech giants?
>>
>>106560327
Max got overshadowed by the fact that it's completely pointless so everyone forgot about it two hours after it became available.
>>
>>106560346
>Europe
They're competing? It looks like only one European state is just barely trying.
>>
File: 1749693672355757.jpg (47 KB, 738x415)
47 KB
47 KB JPG
>>106560314
>an 80B model requires ~160GB of VRAM. A 3-bit version could potentially run in under 40GB of VRAM, making it feasible to run on a single high-end GPU like an NVIDA RTX 4090
This is Gemini? The peak of LLMs right now? With web access?
>>
>>106560354
Mistral was great
>>
>>106560361
>was
Yeah
>>
File: file.png (99 KB, 983x1103)
99 KB
99 KB PNG
The proper name is Qwen3-MoE-A3B thank you.
>>
>>106560404
It still is.
>>
File: 1750684572736732.png (4 KB, 852x31)
4 KB
4 KB PNG
>>106560409
The sqrt(total * active) formula has been officially confirmed
>>
>>106559998
ghostbusters ectoplasm ghostly appearing sperm
>>
>>106560346
>Europe
Lol lmao even
>>
>>106560346
>Russia
For the last 35 years the #1 rule of doing business in Russia was "don't do business in Russia". CS stuff was the easiest to move abroad.
It's not like there's nothing at all, IIRC Yandex was pretty competitive in the self-driving scene for a moment, and every street dog sells it's own proprietary voice assistant now, but for local I only found https://huggingface.co/yandex/YandexGPT-5-Lite-8B-instruct so far. (It's whatever)
>Japan
Failed into programming way back when we started using real operating systems and high level languages, and did not recover to this day. I blame language barrier.
>Korea
Probably too busy printing money with all their gachas instead.
>>
File: file.png (316 KB, 1777x2417)
316 KB
316 KB PNG
>>106560417
>>
>Why aren't you using vllm bro?
>What do you mean you don't have H100 cluster? It can still work with A100 cluster bro.
>Wait, you got just an RTX 3090? Uhm, I've never heard about such GPUs, must be Chinese knockoff or something. Get legit hardware bro.
>You got no money? Just ask for grants bro! Or get investors. You are part of the network, right?
>>
>>106560442
>You got no money?
Have you tried getting a job recently? scamming into a grant or investor is unironically easier at this point.
>>
File: 30474 - SoyBooru.png (118 KB, 337x390)
118 KB
118 KB PNG
Are you enjoying the next best thing? (Qwen Next) (Subtle request for feedback)
>>
>>106560442
vllm can run on an intel arc gpu. you've got no excuse bro. Also it can do cpu as well and MoEs and even got gguf support not long ago
>>
File: 1726683352470372.png (476 KB, 1179x441)
476 KB
476 KB PNG
>>106560356
>>
>>106559958
I wonder what this dude thinks about AI. There's not a lot of difference between what he does and what video models can do.
>>
>>106560460
Where can I buy B60 for MSRP(1200BURGERCOINS)?
>>
>>106560414
Mistral's 2025 output has been one okayish 24B model and nothing else of note
Meanwhile their business model is increasingly Cohere-ifying and there's good reason to believe they fucked up training Large 3
Maybe the cash injection from ASML will help some but acting like they're still internationally competitive is a joke
>>
>>106560474
B60 will be announced next month at SEMICON West. $500 for 24GB model
>>
>>106560460
How does pure CPU performance of vllm compare to ik_llama.cpp?
>>
>>106560489
I would suggest just standing up the vllm cpu docker image and running benchmarks yourself. You probably won't find much public info for cpu benchmarks between the two.
>>
>>106560483
>AMD CPU
>Intel GPU
>NVIDIA RAM
If only I had infinite money...
>>
>>106560523
Where exe? I DONT GIVE A FUCK ABOUT THE FUCKING DOCKER! i just want to download this stupid fucking application and use it
WHY IS THERE DOCKER??? MAKE A FUCKING .EXE FILE AND GIVE IT TO ME. these dumbfucks think that everyone is a developer and understands code. well i am not and i don't understand it. I only know to download and install applications. SO WHY THE FUCK IS THERE DOCKER? make an EXE file and give it to me. STUPID FUCKING SMELLY NERDS
>>
File: 1593076675554.jpg (57 KB, 477x477)
57 KB
57 KB JPG
>>106560551
>>
>2025
>vibevoice is fully forgotten
>>
>>106560566
Useless without training scripts.
>>
File: 1754938970794577.jpg (46 KB, 750x1086)
46 KB
46 KB JPG
>>106560551
>>
>>106560481
To Mistral's credit, that single model they made is actually the best model for running on a normal PC. Gemma is heavily censored, Qwen's similar sized models are worse at non benchmaxx tasks and everything else is too big unless you're building your PC for running LLMs
>>
>>106560346
>>106560436
>Russia
Case in point: https://en.wikipedia.org/wiki/ABBYY_FineReader
I was informed this used to be SOTA for OCR.
>ABBY ... was founded in the USSR and operated in Russia for nine years before moving to the United States.
>>
>>106560550
>NVIDIA CPU
>AMD GPU
>INTEL RAM
WE ARE MAKING A MEME SYSTEM. OPTANE WILL NEVER DIE.
>>
nvidia not offering a 24gb 50xx card was criminal and i'm tired of pretending otherwise.
>>
>>106560551
This argument has never been refuted
>>
>>106560606
nobody wants to deal with women. if exe is a filter than so be it.
>>
mistral for erp
qwen3 for anything else but erp
>>
>>106560604
Fuck 24GB. The 5090 should have just been cheaper, it's not remotely close to being a proper workstation card and 32GB is too little for anything outside of hobbyist stuff.
>>
>>106560622
It's a gayming gpu. Buy from their worktsation lineup if you want professional stuff.
>>
>>106560614
You aren't a woman, though
>>
>>106559044
SSDmaxxbros, maybe our time is finally cuming soon...
>>
>>106560619
But what about sfw rp, is that included in that? Is Qwen 3 smarter than Gemma 3?
>>
>>106560630
no?? really??? I think your lost bro, this isn't >>>/lgbt/
>>
>>106560619
>anything else but erp
there is nothing else
>>
File: file.png (176 KB, 1425x569)
176 KB
176 KB PNG
Am I about to get scammed? I've never seen these under $1000. From Hong Kong.
>>
>>106560634
>gemma3
after all safety humiliation I got I will never use it again
>>
I refuse to support any model whose selling point is high context limits. Every llm i've used from free to paid are absolute garbage and hallucinate at high context.
>>
>forcing full prompt re-processing due to lack of cache data (likely due to SWA
humiliation ritual
>>
My CLINE prompts are all timing out when I'm trying to use gemma3:12b on a 4070. Do I need a quantized model instead?
>>
>>106560645
No, you're in for a great deal! Buy it quick, there's only one left!
>>
>>106560645
>seller with 0 reviews
Yeah, trust him!
>>
File: 1733603436129628.png (2.24 MB, 2038x1678)
2.24 MB
2.24 MB PNG
>>106560645
bro no don't do that
buy this one: https://www.ebay.com/itm/325407276138

much better trust me
>>
>>106560809
>Graphcore IPU
what
>>
>>106560693
How do you prevent this?
>>
>>106560814
>intelligent processing unit
lmao
>>
>>106560809
ok ersinc03
>>
>>106560687
You can't trust the actual numbers for context that companies put out, they're always wrong. But it's usually safe to assume that a higher advertised number does mean a higher 'effective' context ceiling.
>>
Can't wait for adobe research to publish an updated study on how all these models go to shit past 32k
>>
>>106559371
>>106559401
>no tits
>shitty reddit memes
You are gay.
>>
>>106560967
>>106560814
IPU/NPUs are a real thing, they're in all the new CPUs from AMD for instance. just not from meme companies like that one.
>>
>>106561079
>central processing unit
makes sense
>graphics processing unit
yup
>neural processing unit
works with neural networks, gotcha
>intelligent processing unit
the fuck is this supposed to be? it sounds like some marketing term
>>
File: Base Image.png (490 KB, 1200x1576)
490 KB
490 KB PNG
ButterflyQuant: Ultra-low-bit LLM Quantization through Learnable Orthogonal Butterfly Transforms
https://arxiv.org/abs/2509.09679
>Rotation-based methods such as QuIP and QuaRot apply orthogonal transforms to eliminate outliers before quantization, using computational invariance. However, these methods use fixed transforms--Hadamard matrices achieving optimal worst-case coherence \mu = 1/\sqrt{n}--that cannot adapt to specific weight distributions. We identify that different transformer layers exhibit distinct outlier patterns, motivating layer-adaptive rotations rather than one-size-fits-all approaches. We propose ButterflyQuant, which replaces Hadamard rotations with learnable butterfly transforms parameterized by continuous Givens rotation angles. Unlike Hadamard's discrete \{+1, -1\} entries that are non-differentiable and prohibit gradient-based learning, butterfly transforms' continuous parameterization enables smooth optimization while guaranteeing orthogonality by construction. This orthogonal constraint ensures theoretical guarantees in outlier suppression while achieving O(n \log n) computational complexity with only \frac{n \log n}{2} learnable parameters. We further introduce a uniformity regularization on post-transformation activations to promote smoother distributions amenable to quantization. Learning requires only 128 calibration samples and converges in minutes on a single GPU--a negligible one-time cost. On LLaMA-2-7B with 2-bit quantization, ButterflyQuant achieves 15.4 perplexity versus 22.1 for QuaRot.
Links belowhttps://github.com/42Shawn
https://github.com/oumi-ai/oumi
Code might be posted on one of those. Might be cool but then again very little results included.
previous paper that looked at butterfly transforms
https://arxiv.org/abs/2302.06646
>>
>>106560089
Anyone serious is deploying for enterprise, not personal use. Normal people don't use local models for personal use, just like normal people don't use 4chan and only use fb/linkedin.
>>
>>106561145
That is very nice, but how does it compare to GGUF?
>>
>>106561127
>it sounds like some marketing term
it basically is. NPU = IPU
in the industry its looking like NPU has won out but AMD at least early in the developments of NPUs in like 2023 referred to it as IPUs as well
>>
>>106561127
Graphics Processing Unit is a horrible term nowadays.
NVIDIA calls the H100 a GPU even though it doesn’t even have a display output and isn’t aimed at graphics processing.
>>
>>106561161
Probably just as shit as Q2 ggufs are.
>>
>>106561168
Nah, ggufs are probably better since they don't mention them.
>>
>>106561166
GPU stands for "General Processing Unit" in nvidia's own terms
>>
2-bit is all you need, you don't need more
>>
>>106561177
My point stands. Q2 is shit . This is literally a competition of who has the nicer looking pile of shit. If you're seriously using a Q2 model you need to reevaluate your life. Also the paper likely doesn't mention GGUFs at all because it's talking about W2A16 which Q2 GGUF can't even map to in practice.
>>
File: 1757508971871431.png (327 KB, 800x982)
327 KB
327 KB PNG
>>106561184
https://www.nvidia.com/en-us/about-nvidia/corporate-timeline/
>>
>>106561184
that's a backronym they made up so they can keep using the term everyone would have used anyways
>>
That reminds me, I have vllm installed. Might as well try a quick speed comparison. Tomorrow maybe.
>>
File: k2-0905-perplexity.png (175 KB, 2069x1400)
175 KB
175 KB PNG
>>106561204
case in point
>>
>>106561184
Should call them NVIDIA Processing Units to shit into everyone's salad.
>>
RP testing qwen3-next-thinking and it has a completely different reasoning style from 2507, and not in a particularly good way
several times more verbose and EXTREMELY wasteful of tokens - trying out different lines of dialogue over and over again, outputting them in full with minor variations, outputting full drafts of the response, or in one case "let me check the previous messages [proceeds to output EVERY previous turn of the roleplay IN FULL]"... wtf. I get the sense that this is something of a proof of concept model for them (and to their credit, in my limited testing the models do seem smart and pretty good at long context) but they've gotta fix this for 3.5 or whatever their next release is.
>>
>>106561341
post cockbench
>>
>>106561341
Have you tried prefilling the thinking with some guidance on how to think about the RP?
>>
>>106561184
>>106561207
>>106561336
You butt hurt boys are SO silly! :3
>>
File: cute finnish bf.webm (903 KB, 720x1280)
903 KB
903 KB WEBM
Why is it so hard to get models to undress the finnish catgirl pm?
>>
>>106561347
I APIfagged, sorry anon. I'd expect it to be in line with the 2507 qwens though.
>>106561358
not yet, I'm putting off messing with it more until there are ggufs
>>
>>106560551
> anon comes to a thread where everyone has a fucking doctorate in AI
> sees the word docker
> looses his shit as he's dumb as fuck
> after crying gets his mcdonalds uniform ready for work tomorrow
>>
Not sure what I expected.
What is this called? At the beginning the sentences are long and then its all short and weird. I saw this before with another sloped model.
>>
>>106561459
I can get better outputs from llama 8B. Holy slop
>>
>>106561476
Sad because this would have been a really cool size.Fast even with offloading.
But at least they try something new.
>>
>>106561506
even the chinks are putting in extreme safety nets. shame. Gemma3 tier slop
>>
File: file.png (211 KB, 602x600)
211 KB
211 KB PNG
>>106561367
Consider the following you tranny freak
>>
>>106561459
It kinda communicates pacing.
>>
>>106561459
Somehow way worse than Mistral Small
>>
File: 1692170984443505.jpg (32 KB, 400x400)
32 KB
32 KB JPG
I did nothing today
>>
>>106561166
>isn’t aimed at graphics processing
you can have a gpu render something and then display it through an iGPU's display output
i wonder if you could stick a h100 inside a normal desktop PC, install the geforce driver (after doing inf mod) and then just play games on it
>>
>>106561459
It's qwen3 only problem I think. It tries to mimic the text formatting from latest response. Also how it was trained could be the culprit, like maybe it was trained with a bunch of Chinese poems.
The pattern I noticed is like this :
1 paragraph -> 2 paragraphs -> 3 -> 4 -> 5 -> Then it ended with one line per paragraph.

So far the only way to control it is by instruct it explicitly in system prompt. For example I'm using this :
"Respond in multiple standard paragraphs format. Avoid poetic or dramatic elements. "
>>
>>106561768
That helps. But what a weird writing style. Feels like Deepseek on steroids.
>>
>>106560606
pay me
>>
>>106561794
now you've got a pattern of 3 paragraphs of exactly 3 lines.
>>
>>106561855
As god intended it. Proper paragraphs should never exceed more than 3-4 lines. I learned that in middle school
>>
>>106561077
>>no tits
Perfect.
>>
>>106561861
congratulations for completing middle school anon.
nobody thought you could do it, but you did.
>>
File: 1736083582437887.jpg (64 KB, 1024x1020)
64 KB
64 KB JPG
How will qwen 80b-A3B improve my text adventures involving me being a magical kemoshota that cures predators of their fucked up fetishes?
>>
wen qwen ggoofs
>>
y no opera his son?
>>
i think im gonna goof...
>>
Realistically speaking, there haven't been any improvements erp wise since llama3.3-70b and mistral large 2407
>>
>>106561915
Oke doke gay
>>
>>106562070
I've never once used LLMs to goon so I have no idea what this even means
But the obvious solution to get around LLMs not doing what you want is to be agentic
agents aren't only for tool calling and APIs. They can also form complex logic based on natural language, like following and maintaining a story structure despite whatever retarded shit you're trying to pull
>>
File: file.png (175 KB, 420x459)
175 KB
175 KB PNG
https://www.washingtontimes.com/news/2025/sep/11/ftc-launches-inquiry-ai-chatbots-acting-companions-effects-children
>>
File: 1754987229502936.png (2.39 MB, 1416x2120)
2.39 MB
2.39 MB PNG
>>
>>106562070
Air is a direct upgrade for 3-4 3090 VRAMlets
>>
File: 1740718713054690.png (1.9 MB, 1416x2120)
1.9 MB
1.9 MB PNG
>>
>>106562108
>>106562161
tfw the goofs are nevermore
>>
>>106561944
Seeing how shit it is will make you put even more effort into your RPs using Mistral Nemo
>>
im vibecoding vibevoice for my vibecoded local ai software. what am i in for?
>>
>>106562240
aids
>>
File: 1729918764920247.png (3.01 MB, 1416x2120)
3.01 MB
3.01 MB PNG
>>
It's up!
>>
>>106562289
>*looks down*
Yes, it is!
>>
>>106559371
Cute miku I like
>>
wheres my fucking ggoofs Daniel
>>
>>106562331
What's happening?
>>
>>106562331
>unsloth
>ever
lmao
>>
>>106562345
upload the qwen3 next ggoofs you goof
>>
>>106559420
vLLM supported it in June via https://github.com/vllm-project/vllm/commit/b69781f107b7ad847a351f584178cfafbee2b32a but it's really hacky and depends on their Extension for Pytorch and some calls in their LLM hacked backend.
The best I've seen from Intel publicly for C++ is their closed pull request inside the main Flash Attention repo.
https://github.com/Dao-AILab/flash-attention/pull/1528
This uses SYCL so yeah, would be kind of an uphill for anyone not an Intel developer to adapt to the existing CUDA code.
>>
Damn phonemizers are a huge bottleneck for TTS because devs use by default the pile of trash that is espeak. On CPU for kokoro it takes almost 8-9s to preprocess a single sentence to IPA phonemes on my laptop while the inference itself is ~6s and that shit grows at O(n) or more (fucking 22s to preprocess a paragraph). Switching to g2p_en for american english + a bunch of heuristics I got from chatgpt achieves the same preprocessing output in 1.5s for a single sentence, growing at ~O(log N). I wish this field focused a bit on efficiency instead of convenience
>>
>>106562423
ok nerd
>>
>>106562423
You don't need to pre-process anything.
>>
>>106562430
It's not feeding the raw text to the TTS, it's preprocessing the text to phonemes before feeding them to the model
>>
>>106562423
Shouldn't that just be a database lookup for retarded languages like english where the pronounciation doesn't match the spelling?
>>
>>106562423
They aren't using espeak just because it is easy, it is because it has multilingual support out of the box. G2P is much harder to configure with mappings needed for each language.
>>
>>106560481
They've made other models too, but they're mostly not open-weights. But I don't get why they don't start doing MoE models the Qwen way though, wouldn't that make them able to release them in a wider range of sizes with less compute?
>>
>>106562453
It's not enough, because some words have different pronunciation depending on whether they're a noun or a verb while written the same way, like "use" + other things that are context dependant
>>
>>106562482
You're describing convenience bro. Espeak is almost twenty years old, it has memory leaks and a lot of issues that won't ever be fixed because of GPL no one wants to contribute to this trash.
>>
>>106562450
One thing what will help you regardless - doesn't matter if it gets converted to phonememes or not - is to use contractions module
>import contractions
>cleaned_text = contractions.fix(text)
and then remove surrogates with regex and optionally add abbreviations and optionally clean up any problematic remaining characters (because LLMs always output random shit).
>>
>>106562482
Sounds like an llm whose sole purpose is to take text as input and output ipa is required.
>>
>MUH ERP
go to sleep americlaps and huemonkeys.
productive eurochads are taking over from here.
>>
>>106562546
Give us Miqu 3 already or at least Largestral 3. WTF are you frogs doing?
>>
>>106562543
at that point you might as well take text as input and diffuse the audio directly.
Take in some positive/negative descriptor tokens too.
>>
>>106562542
Thanks, I didn't know it was a thing. I'll add that
>>
>>106562586
Yeah so what I did with piper voice (it's instant tts, takes ~100 mb or less but it's not as robust as vibevoice of course)
>contractions
>surrogates
># remove surrogates (U+D800 to U+DFFF unicode range)
>cleaned_text = re.sub(r'[\ud800-\udfff]', '', cleaned_text)
>Then replace commas, ellipses, "", dash, em dash, and whatever else there is with either empty spaces or periods - this way TTS does not even try to do anything but it'll go straight onward - basically remove and replace everything else except periods. This is sort of trial and error, you'll need to test this and proceed accordingly.
>>
>>106562543
There are small transformers for that (T5), but it's even slower than espeak. They're using them for disambiguation, which is fine when you don't care about latency and want the output to be as good as possible
>>
>>106559401
kek



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.