[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>107957082 & >>107948284

►News
>(01/25) Merged kv-cache : support V-less cache #19067: https://github.com/ggml-org/llama.cpp/pull/19067
>(01/22) Qwen3-TTS (0.6B & 1.8B) with voice design, cloning, and generation: https://qwen.ai/blog?id=qwen3tts-0115
>(01/21) Chroma-4B released: https://hf.co/FlashLabs/Chroma-4B
>(01/21) VibeVoice-ASR 9B released: https://hf.co/microsoft/VibeVoice-ASR
>(01/21) Step3-VL-10B with Parallel Coordinated Reasoning: https://hf.co/stepfun-ai/Step3-VL-10B
>(01/19) GLM-4.7-Flash 30B-A3B released: https://hf.co/zai-org/GLM-4.7-Flash

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: threadrecap.png (1.48 MB, 1536x1536)
1.48 MB
1.48 MB PNG
►Recent Highlights from the Previous Thread: >>107957082

--Paper: Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings:
>107960244 >107960358 >107960418 >107961005 >107960382 >107961422
--Multi-GPU setup strategies for cost-effective inference:
>107958416 >107958478 >107958532 >107958598 >107958970 >107958540 >107958611 >107958531 >107958632
--Qwen3 TTS installation challenges on Windows with Nvidia GPUs:
>107958660 >107958671 >107958685 >107958709 >107958783 >107958719 >107958782 >107964469 >107958753 >107958714 >107962549
--qwen-tts performance and compatibility issues in TTS applications:
>107958000 >107958013 >107958047 >107958501
--LLM struggle with deviating from genre tropes in constrained narratives:
>107959380 >107959410 >107959431 >107959440 >107959458
--Exploring AI interaction in Among Us-style games and survival simulations:
>107959425 >107959464 >107959483 >107959505 >107961126
--Challenges with book-based QA and context limitations:
>107964051 >107964287 >107964322 >107964354
--Optimizing llama.cpp for fast, low-VRAM 1-shot question answering:
>107963343 >107963394 >107963472 >107963529 >107963577 >107963655
--Speculation on Minimax-M2-HER and Mistral Small Creative model releases:
>107957396 >107957481 >107957543 >107957650 >107957598 >107957634
--MiniMax M2-her roleplay limitations:
>107962436 >107962501 >107962512 >107962654 >107962666
--llama.cpp PR reducing DeepSeek memory usage:
>107963328 >107963386
--Critique of TranslateGemma, recommendation of heretic Gemma 3 for uncensored JPEN translation:
>107961940 >107962800
--Vibevoice emotion tag functionality:
>107960489 >107960506
--LLM formatting and model preference debates:
>107966244 >107966357 >107966388 >107966534 >107966600
--Qwen voice cloning stability and context length issues:
>107961962 >107962660
--Miku (free space):


►Recent Highlight Posts from the Previous Thread: >>107957086

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
>>107968112
No Miku, I'm not letting you into the server room.
>>
Considering engram is going to be the next big thing, lets talk about it and its consequences for local.

I'm trying to get a feel for how these future models will work on local machines but I'm a bit of a brainlet so I would appreciate some input from my betters. If I understand it correctly, around 20-25% of the model's parameter budget gets allocated to the engram memory module which is stored in RAM/NVMe, where performance scales linearly with the size of said memory model.

Obviously the computing requirements for running the model go down, but what does this mean for RAM/NVMe? Does this mean we'll be running huge models that sit in NVMe storage? Should I be buying as many NVMe drives as possible? Another thing to consider is the throughput. The paper claims there's only a 3% hit to throughput when using the hybrid engram architecture, but is that the case for only RAM or NVMe storage as well?
>>
>>107968191
>Should I be buying as many NVMe drives as possible?
I have no idea what you're talking about but it's already too late, prices are skyrocketing. I got a 1TB nvme drive for like 80 bucks 3 years ago and now they're like 250 dollars
>>
>>107968191
I doubt computing requirements would go down. It seems deepseek wants to add engram params on top of what they can run right now. So, deepseek v3 is 650B, then deepseek v4 will have the same 650B + 350B of engram params.
>>
i'm looking for the smallest non-potato gguf of qwen3 tts possible
>>
>>107968303
It's called pocket tts.
>>
>>107968316
i already have that set up, i want slightly less potato
>>
>>107968321
Chatterbox turbo - 350M params.
Seriously, nobody has vibecoded gguf inference for qwen-tts yet. Be the first. llama.cpp already supports qwen3 architecture. You just need to implement MTP module and tokenizer.
>>
Will posting furry stories here trigger a panic attack in a certain Russian shitposter?
>>
File: 1743008679366121.png (1.79 MB, 1024x1024)
1.79 MB
1.79 MB PNG
>>107968171
Miku was just a distraction! Rin is already in the server room!
>>
>>107968266
2tb is 250$ right now at the retailers that I visit. The point isn't dooming about current prices, the point is about determining future prices in a world where engram is the new paradigm.

>>107968288
Yeah, I fully expect the SOTA labs to try to max out model size on GPU, but at minimum it means we get better and smaller models. I'm really interested in the linear scaling performance and what it means for RAM/NVMe.
>>
>>107968424
>but at minimum it means we get better and smaller models
did you not read a word he said?
>>
>>107968421
[impregnation]
[impregnation]
[impregnation]
>>
How does it feel that gpt 3 is still better than your local models?
>>
>>107968471
it sends shivers down my spine
>>
>>107968454
She's smol and for cuddles you perv.
>>
>>107968471
How can a model with 2k context be better than my local models with 16k usable context?
>>
>>107968490
we'll cuddle during aftercare
>>
>>107968492
>16k usable context
show nolima
>>
>>107968471
LMFAO fuck no
>>
>>107968471
That was you trying to make a joke?
>>
>>107968431
Yes, I perfectly understood what he said. 25% of sparse parameters can be offloaded to embedding tables in RAM while getting better results on benchmarks. This means smarter models you can run with less VRAM. This is the ground floor for engram. That doesn't mean labs won't push out engram models with larger sparse parameters to try to push benchmarks. I fully expect the same thing we have now, which is model size diversity.
>>
File: 1760262544190219.jpg (139 KB, 551x551)
139 KB
139 KB JPG
>>107968471
>GPT-3
>How many R's are in Strayberry?
>"You mean strawberry? There's 3 R's in strawberry. If you meant Strayberry there's 2 R's in Strayberry. Hope this helps!" (<hardcoded retard)
>There's 3 R's is Strayberry though.
>"Oh you're right! What a fascinating discovery! I'll seem to have made a mistake! There's indeed 3 R's!"

>meanwhile local model
>How many R's are in Strayberry?
>"lol you're trying to trick me? I bet you can't even country to 3 you doof"
>>
File: gpt3.png (87 KB, 626x528)
87 KB
87 KB PNG
>>107968471
rose-tinted glasses
>>
What the fuck does it even mean for a breath to catch?
>>
File: file.png (270 KB, 1641x1112)
270 KB
270 KB PNG
CUDA dev you broke GLM 4.7 (not flash) with https://github.com/ggml-org/llama.cpp/pull/19092
I didn't test other models.

Left is 0440bfd, right is 0c21677.
>>
>>107968564
Do you have PPL and KLD proof of these claims? Otherwise this is just FUD that can safely be ignored as a meme
>>
>>107968541
Show logs from GPT-3.
>>
>>107968573
The temperature is set to 0 and the prompt is the same so KLD is very obviously different.
>>
>>107968588
Can't trust you without hard numbers chief, nice try trying to waste his majesty's time!
>>
>>107968115
Thank you Recap Miku
>>
>>107968548
ESL? It's when breathing stops for a second, often with a sharp intake of breath beforehand. Similar to a gasp, like when someone is surprised or frightened. Not to be confused with catching your breath, which means a different thing
>>
>>107968564
>>107968588
It's expected that results are not bit-for-bit identical.
Since language models are autoregressive, once a single token is sampled differently the entire sequence diverges.
This happens in particular at the beginning of sentences where the token distribution is very flat and small changes can amplify.
A low or zero temperature suppresses this to some extent but not completely.
I'll double check the architecture of this particular model but I don't see this as evidence that either build is better or worse on average.
>>
Has anyone tried to abliterate a model with heretic? How long does it take until the number of refusals starts going down? Should I be concerned if it doesn't for a while even when setting the random exploration phase to 1?
>>
>>107968640
It's fairly obvious that the example on the right doesn't follow GLM's usual thinking structure at all and the output is completely schizo. It claims that the lines of the poem are jumbled up.
It gets worse at higher context. I first noticed it in claude code at ~20k context because the model wouldn't output anything coherent at all and just spammed the same nonsense token.
>>
>>107968664
Anecdotal.
>>
Oh, and I also set the minimum considered KL divergence to 0.1. But it never reaches that.
Is there a setting to make it more aggressive if I care about the uncensored part more than about the correctness part?
Maybe it doesn't work because it's a MoE?
>>
>>107968664
kek, flash attention was broken on llama.cpp for a fucking year and those clowns said all the same stuff defending it. "The perplexity score on 16 tokens of wikitext is nearly the same so our implementation isn't broken."
>>
FUCKING AMD
all this time I was fighting shitty ROCm only to find out Ollama uses fucking vulkan natively and I don't even need shitty ROCm.
Mother fuck
>>
>>107968711
Any concrete proof of this?
>>
>Qwen3-TTS
finally a better model than openaudio s1 mini for mixed language generation
quality of cloning is overall better and also more stable
>>
>>107968722
>ollama
>retarded
checks out
>>
Why is Claude like this?;
>>
>>107968628
ESL?
>>
>>107968722
is ROCm good for anything at all?
>>
>>107968755
ESL?
>>
>>107968729
yes look at the commit history you potato
>>
File: file.png (31 KB, 528x156)
31 KB
31 KB PNG
>>107968640
Here's an example with an even longer prompt: https://rentry.org/xwu5muxu
Before the commit on the top and after the commit on the bottom.
>>
>>107968757
Apparently it's meant for researchers mostly but it seems to be required or needed when it comes to AI video generation
>>
>>107968664
>>107968779
In this particular case I can already reproduce the issue, it has to do with one of the specific code paths on Turing/Ampere.
>>
>>107968779
What makes you think that thinking in English is a metric?
>>
>>107968793
Based.
>>
>>107968779
It's expected that results are not bit-for-bit identical. I don't see this as evidence that either build is better or worse on average.
>>
teto > miku
>>
>>107968760
Go back to /ldg/, please.
>>
>>107968754
that's all AI
>>107965306
I use lmstudio because the python lms interface is great and vLLM doesn't run on my PC for some reason.
>>
>>107968754
>The Loss Function (The "Teacher's Red Pen")
This slop pattern is more infuriating than "not x but y".
>>
>>107968779
It's expected that results are dogshit and definitely worse than before. I don't see this as evidence that either build is better or worse on average.
>>
>>107968853
nu-llms love giving names for everything, capitalizing them, and making everything sound profound
>>
>>107968862
>nu-llms love giving names for everything, capitalizing them, and making everything sound profound
Damn are they trained on cultivation novel slop?
>>
>>107968870
I wish, it's impossible for them to comprehend the dao and see mt. tai
>>
>>107968820
>>107968826
>>107968858
>>
>>107968738
kinda sucks that you can't clone a voice and style it with different emotions
>>
>>107968793
I forgive you for not reading the wall of text in the first post. I should have immediately pointed out why the second output is incoherent.
>>
>>107968877
A river flows 30 years west then 30 years east, maybe they can write good slop scripture soon
>>
>>107968492
4.6 starts melting at 20k. And i think 4.7 can go beyond that. I tried 30k and it was good with some minor problems it has even at lower ctx sonetimes.
>>
>>107968877
A frog in a well
>>
>>107968471
3.5 can't cause ego death
>>
>>107968900
>I forgive you
Huh? You should feel blessed he's deigning to acknowledge your presence.
>>
>>107968906
it's so simple too, if llmisms were xianxiaisms I'd read them for 1000 chapters without complaint
>>
>>107968870
Courting ego death
>>
one of the best threads in ages (non ironic)
>>
>>107968548
Means the same thing as when your refrigerator is running.
>>
>>107968936
kek
>>
>>107968928
>it's so simple too,
It really is i bet i could make a lorebook on the tropes in no time. Problem is still chapters and names. I doubt you could even write 50 chapters without hitting big problems.
>>
>>107968853
>This slop pattern is more infuriating than "not x but y".
Which Claude? I haven't seen this slop pattern yet.
Opus-4.1 likes to call me autistic or say "HOLY FUCK"
>>
>>107968967
I just want to court death and seduce fox demon jade beauties while slapping the faces of young masters in the demonic sect
>>
File: 1671078154054043.png (50 KB, 918x576)
50 KB
50 KB PNG
>>107968471
gpt-3 was great. good times.
>>
>>107968471
As much as I enjoyed GPT3-davinci (not necessarily 3.5), GPT4 (0314) and GPT4 (0613) both did things that local models to this day don't handle well.
They're next to Claude 3 Opus in terms of models that likely will never be reached in terms of soul
>>
instead of hoping the LLM picks up the extremely obvious subtext while wasting 2k tokens "thinking", it's actually much simpler to just telling what to think with ooc
>>
>>107969015
Disable thinking and it will pick up just fine.
>>
>>107968993
kek
>>
>>107969020
no, without thinking it devolves into tropes
thinking is the only way to prevent it from trying to stop me every time I go to kill a main character, otherwise it gives them plot armor
>>
>>107969005
>in terms of soul
shut the fuck up you drooling retard
>>
>>107968988
it knows youre a retard, it just doesn't care
>>
>>107968993
I love it, how can we emulate this behavior on our superior local models?
>>
>>107969005
now it's kind of funny to think that this huge ass 1.8 trilly moe had only 8k of usable context at the start
>>
My AI called me a retard and ended the session and told me to never speak to it again.
wut do?
>>
>>107969065
Block her and hit the gym.
>>
how isn't there a regex for the double adjectives yet
>>
File: s4_eskimo_pussy.jpg (349 KB, 2211x1859)
349 KB
349 KB JPG
Looking for a realtime-ish local TTS that sounds good and supports either good voice cloning or finetuning to a particular speaker. Bonus points for an implementation in a real language, not Python.

>VibeVoice-Realtime-0.5B
Deliberately no voice cloning nor finetuning support, so it's useless. Would need to be reverse-engineered.

>VibeVoice-1.5B
Voice cloning adherence is OK. Not very natural cadence or emphasis etc. Is it worth finetuning? VibeVoice generally has a (probably vibecoded) Rust implementation that seems to work (unsure about its perf):
https://github.com/danielclough/vibevoice-rs

>Kokoro
Good quality for its size, but doesn't support voice cloning or finetuning.

>Pocket TTS
Voice cloning adherence is very poor. Would need finetuning, but AFAICT nobody's done it yet, perhaps because it ostensibly supports cloning. Supports streaming. May be the best option given finetuning support.

>FishAudio-S1-mini
Even the official samples sound pretty shit, like a schoolchild reading a book aloud. And the only web demos I saw were walled behind an account.

>Qwen3-TTS
Voice cloning adherence is bad. Does support finetuning; I think an anon ITT had a bad experience with that.

>Echo-TTS
Great quality and voice cloning adherence; best I've heard in both respects. Sort-of supports streaming. Couldn't run it locally due to a bug in a dependency (which wouldn't be hard to swap to be fair). Unfortunately somewhat obscure and apparently a dead project.

>IndexTTS2
Decent voice cloning adherence, good quality, sounds pretty natural. No official finetuning support. Best overall option I've seen. Has an extremely vibecoded Rust implementation which I haven't tried:
https://github.com/8b-is/IndexTTS-Rust
https://huggingface.co/ThreadAbort/IndexTTS-Rust
>>
>>107969100
>Bonus points for an implementation in a real language, not Python.
Stopped reading there, keep whining
>>
>>107969100
>Bonus points for an implementation in a real language, not Python.
Continued reading there. Good complaint.
>>
>>107969100
i been looking for the same thing, its like a fucking unicorn
>>
>>107969100
>Bonus points for an implementation in a real language, not Python.
Python is the white man's language
>>
>>107969100
>Bonus points for an implementation in a real language, not Python.
Stopped reading there. Good complaint but good luck finding anything.
>>
>>107969005
>GPT4 (0314) and GPT4 (0613) both did things that local models to this day don't handle well.
Like what? I never used em
>>
>>107969162
they were the LLMs of choice for trannies to workout their transitioning strategies.
>>
>>107969183
Kinda true. When I was prepping for my transition I used GPT4 while running Arch Linux. It was amazing at helping me figure out when to complete steps and helped me accept who I really am.
>>
File: 1744552884613516.jpg (45 KB, 540x413)
45 KB
45 KB JPG
>>107969214
Still one step short
>>
>>107969100
Supertonic. It sounds more natural than most < 100M models. Doesn't need a phonemizer (no espeak dep), doesn't need complex tokenizers (just straight utf8->tokenid mappings). Understands numbers, $ and a few other things without having to spell them out (but you can still do it if necessary). ONNX models that run just fine on my C thing. They have example code on how to run it for about 10 languages (including C++). It's fast. Doesn't have voice clone. Voice files can be modified like kokoro's or kittentts', so the kvoicewalk method to find voices would work on it just fine.
If nothing else, being one of the few (only?) models that can do synth voices without a phonemizer/tokenizer is a huge thing. V1 is more stable than V2. It misses words less often.

Soprano-80M. Single voice, which i think sounds pretty natural. No voice cloning or even changing the default one, as far as i know. LLM based, complex tokenizer. Being able to stream helps mask the speed (which is not bad, really. Just slower than supertonic).

There's luxtts, which was released no more than 2 days ago, but it's a mess to set up. It needs like 5 (small) models, for two he provides the onnx ones. Then there's the vocoder in safetensors and bits of zipvoice that you need to pull from some other repo.
>>
>>107968757
would be if it was adopted more
>>
>set up opencode+llama.cpp a few days ago
>had an issue with GLM-4.7-Flash and other instruct models concerning tool calling templates
>think "Maybe should recompile with newer version?"
>compile newer version
>https://github.com/ggml-org/llama.cpp/issues/19096

I think I will try the b7811 release and just HOPE that it works with GLM-4.7-Flash since that is the only model so far that works on my unholy 12GB VRAM setup. It's slow, but offloaded into memory it worked. Until it broke.

Hope they fix this + the tool-calling issue, then it would be great!
>>
>>107969100
Gpt-sovits is great at cloning and there's rust implementation
>>
>>107969065
Sounds just like my ex...
>>
>>107969771
I've used qwen 30B with tool calling and it worked just fine in the past.
Might want to try that.
>>
>>107969771
Update b7811 works, the flash attention nkvo bug is in a later release.
b7811 suffers from:
>Template supports tool calls but does not natively describe tools. The fallback behaviour used may produce bad results, inspect prompt w/ --verbose & consider overriding the template.

But it works fine with GLM-4.7-Flash (most of the time).

>>107969843
Yeah, might. But GLM-4.7-Flash is the first one that works with opencode on my machine. Since I'm GPU-Poor with only 12GB VRAM I have a virtual machine with GPU pass through to the 3060 and offload it into 48 GB RAM. It's simply the first model to actually produce viable results, even if it takes forever. Been trying around with RNJ-1-Instruct which also kind of worked (and fast) but tool calling is a bit crap in llama.cpp/opencode stack?
Also Apriel and the Minstral models are nice, but I guess those make more sense in a LMStudio/Desktop app for quickly checking stuff out...
>>
>>107968191
Deepseek paper's Table 4 benchmarks only host DRAM offloading, not NVMe
>>
>>107969878
>I have a virtual machine with GPU pass through to the 3060 and offload it into 48 GB RAM.
Oof. Considering that for models like FLM 4.7 Flash (MoE that mostly run on the CPU), memory bandwidth is the main defining spec for generation speed, assuming you aren't using a really small quant to fit the majority of the model in VRAM, I can't imagine running under the overhead of a VMs memory management will yield the best results.
Is there a reason you are running the model inside a VM?
You can launch llama.cpp in the host OS and access the API from inside the VM if necessary.
>>
>>107968993
my job as a piss drinker is in jeopardy
>>
>>107969900
All testing was on H800 but concievably a 4090 with 64GB RAM could run Engram 27-B
>>
Anyone tried HeartMula?
>>
>>107969911
>You can launch llama.cpp in the host OS and access the API from inside the VM if necessary
The initial reason was that I'm playing around with AI agents and didn't want to give them access to my host OS. There will be some configuration changes (networking to the Host etc) so I can access the API from the guest OS, but otherwise it should actually be better...yeah. I'll do that tomorrow. I'll make a lot easier.
Having the llama.cpp and opencode and all in the VM just was much easier to configure until I got a somewhat working setup, more of a PoC really.
>>
>>107969100
>Qwen3TTS bad
Hahahaha is .the best local tts model for voice clonig but you are retarted and not procecing the text of the example before clone the voice. Skill issue
>>
>>107969974
Also I was hoping to use models which fit into full VRAM, but 3-8b models are about as smart as the average 4chan poster and give the same behavior and output.
>>
>>107969100
>Couldn't run it locally due to a bug in a dependency (which wouldn't be hard to swap to be fair).
works fine for me locally. ask llm to teach you conda or uv.
post a sample of the voice you want to clone
what language do you need?
do you need laughs and moans?
is 16khz ok?
>>
>>107969974
>>107969991
Yeah, I figured it would be something like that.
Well, good luck.
In my experience, Qwen 30B is actually pretty damn good for the specs you can run it on at decent speeds.
For anything but creative stuff, that is, but still.
Maybe GLM will be better once all the kinks are hashed out, but Qwen right now just works.
>>
>>107969900
I've run into some discussion claiming that you'll be able to offload a majority of the engram parameters to NVMe storage, but I can't find anything about it let alone throughput benchmarks. Regardless, I'm intrigued and confused about Infinite Memory Regime. Trying to figure out whether or not I should FOMO into more RAM and nand memory.
>>
>>107970015
Which quantization of Qwen 30B do yo use? As I said it's mostly to be integrated with opencode (which has the least shitty UI so far imho, although I disagree with their shitty ass documentation). Need to speed things up already so I can have it vibecode me a opencode alternative...
>>
>>107970034
For your specs, I'd go with q8, or at least q6. You can run 32k context and still leave some expert tensors (via --n-cpu-moe) in VRAM.
In my 8gb VRAM shitbox I use Q5KS, which works well enough.
>>
>>107970049
Thanks, downloading the Qwen-30B with Q6 now so it'll be ready in the morning. Since I have two graphics cards (RX580 and 3060) and no iGPU/APU I'll just offload it all into the 3060 while using the RX580 for normal tasks. If I had an iGPU I'd try (https://github.com/peterjdolan/llama-cpp-rx580) Vulkan backend, then I could have two local models running at same time..Looking forward to seeing if the Qwen30b does as told, and if it generates what I want it to generate at an acceptable rate.

Still, with Ryzen 5 3600 and 3060 12GB the LLM produced "acceptable" code and worked agentic in an acceptable manner - only takes like 6 hours for something barely function and my ears bleed from the air cooling, but it is what it is. What a time to be alive.
>>
>>107969969
I wanted to but it sounds just like sunoslop. all these slop generators are trained on the most cookie cutter normalfagshit and I'm not interested in that.
>>
File: file.png (180 KB, 864x814)
180 KB
180 KB PNG
https://x.com/TencentHunyuan/status/2015635861833167074
>Today, we introduce HunyuanImage 3.0-Instruct, a native multimodal model focusing on image-editing by integrating visual understanding with precise image synthesis!
>It understands input images and reasons before generating images. Built on an 80B-parameter MoE architecture (13B activated), it natively unifies deep multimodal comprehension and high-fidelity generation.
80B13a moe multimodal with CoT image understanding reasoning, image output model with editing, nothing about open source
>>
air status?
>>
>>107970431
It's shit.
>>
>>107970431
>HunyuanImage 3.0-Instruct
nigga what ? where did the first 2 go ?
>>
>>107970546
there was a 2.1 or something a while ago at least
>>
>>107970431
Here's their official prompt handbook with some examples
https://docs.qq.com/doc/DUVVadmhCdG9qRXBU
>>
>>107970431
wait, so it's just an instruct tune for this thing that's been out since september?
https://huggingface.co/tencent/HunyuanImage-3.0
>>
>>107970564
What they did before marketing to us is none of our business. We must be better local men.
>>
>>107968407
do it coward
>>
Is there any LLM that can manage to go more than 16K tokens before it starts contradicting established elements by superimposing what it expects to be true? Just had an annoying experience with GLM 4.7 (355B A32B) Q8_1 group-size 32, temperature 1.0 top-p 0.95.
>>
>>107968992
I just want to re-enact the mcdonald xianxia story
>>
>>107970678
>gml4.7 q8
r u flexin?
>>
>>107968288
Doesn't moving too many parameters to memory at the expense of MoE experts lead to a loss in reasoning depth?
>>
>>107970865
He's a mega ultra super quantum giga faggot. His radiated faggotry is so powerful it melted the models neural network.
>>
>>107970790
I'm explaining before anyone asks that it's not due to quantization error.
>>
File: 1749040657491848.png (52 KB, 660x574)
52 KB
52 KB PNG
>>107968112
I setup open code with ollama and qwen3 code. Can't get the model to not timeout from the initial /init. I set 65k context for the model too as described in the open code docs ollama local guide. Mind you this is on a halostrix with 96gb vram. I run Claude code via open code and it works just fine what gives?
>>
>>107969771
Can you provide info on the tool calling issue? I think I'm running into it but I'm too new to know
>>
>>107970900
Sure. I was in the middle and quantizing the tokenator for my Yamaha 3.5:57t model when it shut down. Do I did a sudo rm -rf and it spit out
SHUT THE FUCK UP FAGGOT
>>
>>107970896
youre a retarded frogposter
>>
I finally got around to testing Qwen3-TTS and oh boy is it fun. I refused to pay for ElevenLabs so this is my first chance to play around with this type of thing.
It sounds good, not perfect mind you, but good. Good enough to listen to an audio book it created. I could see taking this and marrying it to a llm and creating your own local talking digital assistant.
>>
>>107970896
which qwen 3 code? the 30b or the 480b?
>>
>>107970937
I not sure I Indian from nu deli I not english good
>>
for anyone who cares, this is just one of their default voices
https://vocaroo.com/14aWNHlbKNiw
>>
>>107970942
well, running the 480b version is basically impossible on your hardware. the 30b version should work just fine. almost certainly an ollama issue. ollama is known for being kind of shitty compared to base llama.cpp.
>>
qwen3-tts gguf when
>>
>>107970966
Never you pervert
>>
>>107970966
just follow the instructions on their repo its not hard for non-lobotomites
>>
>>107970966
Right now check the group chat
>>
>>107968112
Best 32b class coding model qwen3 32B fails to create a vulkan triangle
>>
>>107970966
>>107971018
Yeah, if idiot me manages to get it working anyone can.
https://vocaroo.com/1nlwoH5SvSYn
>>
>>107971018
>>107971039
I need it to integrate in my application not just for gooning
>>
>>107971091
goof it yourself
>>
Regarding QWEN3TTS, I was playing around with it but one thing I'm not sure about.
So you can clone voices using BASE, and then you can save the voice file.
But in order to use the voice file you can only use base? like I cant use the custom voice to guide the tone towards angryness, calm, whatever?
>>
File: gwern.png (23 KB, 1832x146)
23 KB
23 KB PNG
>china = bad
We don't want to associate "boring" with American models.
>>
>>107971144
From what I understand can can either
>clone a voice
>use a predefined voice
>create a new voice based on a description
I don't see an option to import a voice you cloned into the section where you can use it as a created voice and then shape the way it is used.
but i am little more than a script kiddy. i can see this stuff changing as people build on what has been released
>>
>>107971184
I looked also in the provided python examples and nothing. The CustomVoice stuff takes for input 'speaker', which is a string, I didnt look further (maybe there's a way to add custom speakers?) but OOTB it looks like you can't use CustomVoice with Base cloned voice. SAD!!!!
>>
>>107971200
The speaker is one of the preset voice personas. Read the docs nigga.
>>
>>107971144
They are adding another model that does both voice cloning and direction
>>
>>107971212
Your reading comprehension is NULL, I know there are preset voices, I don't know if it supports actually adding a custom voice (ie if its programmatic).
You made me actually check the code and NO, the custom voices arent just 'descriptions' of how they should sound with a label slapped on top of it, it looks like they're baked in the model.
VIVIAN literally translates to token spk id 3065 at the MODEL level, there are no other references.
These fucking chinese faggots, how can they name a model CustomVoice and it DOESNT FUCKING SUPPORT CUSTOM VOICES
LMAO
>>
>>107971200
i am surprised how much you can get out of the predefined voices with a little bit of instruction although even an extra space or two at the beginning of the text can alter how it comes out.
https://vocaroo.com/1cSYzuWHttcB
i can see some crazy stuff being created eventually as people tweak this
>>
File: 1756006234160934.jpg (208 KB, 1000x1000)
208 KB
208 KB JPG
So I took an English sample of Hatsune Miku speaking, the bit about British people, and fed it into Qwen3 and then generated the following.
Its not Miku but its close.
https://vocaroo.com/1gm8JevJRise
>>
>>107971408
Is this cartoon character supposed to sound retarded or is it a tts issue?
>>
File: 1766664196559768.jpg (475 KB, 1500x2000)
475 KB
475 KB JPG
>>107971445
Have you never heard hatsune miku before? Yes she is supposed to sound robotic.
https://www.youtube.com/watch?v=EuJ6UR_pD5s
>>
>>107971445
hownew.ru
>>
File: 1769416980893.png (80 KB, 1127x721)
80 KB
80 KB PNG
New update for llamacpp is wicked sick. Air has never been this creative before.
>>
>>107971445
itS MIGU!!!!
>>
>>107971502
Looks objectively fine.
>>
>>107971502
KINO KINO KINO KINO
>>
>>107971502
It's expected that results are not bit-for-bit identical.
>>
llama.cpp having no concept of versioning and phases of testing, just releasing features and refactors one after the other.. there's not even a way for a new user to look at the github page and think "this is the commit version I want to retrieve, surely it's not borked to hell"
>>
>>107971502
Mildly more helpful info: Windblows 10, latest Cuda 12.4. Flash works and is fast and cool and all, but it's too retarded even when its working properly. So I went back to Air and then this happened.
Reverted to the version I was previously using from 3~4 days ago and its fine. So something since has caused 4.5 Air and 4.6 to die.
>>
>>107971591
Still not useful without KLD.
>>
>>107971591
>>107968793
>>
>>107971580
https://github.com/ggml-org/llama.cpp/discussions/15313 this discussion about exactly that is linked in the readme, you should give them your feedback there or direct new users to kobold.cpp because it has releases
>>
>>107971606
>vb
>Location Paris, by way of Delhi
thank you sir
>>
>>107968564
Funny how it's possible to break attention in such a way that is causes the model to see words out of order but remain somewhat coherent in its own output.
>>
>>107971599
Ah, glad they're aware already. Flash improvements are no reason to update anyways so no loss while we wait.
>>
File: 1741710077910057.jpg (41 KB, 1482x177)
41 KB
41 KB JPG
It's not just control capability that is unstable, the whole model is unstable. I guess that's why they opensourced it. It sucks. Hopefully, 25Hz version is better.
>>
>>107971825
>not just x, y
>>
>>107971838
Yes. I speak like an LLM because I used a legit grammatical construction once.
>>
>>107971598
john is this you? PPL with wikitext is not a real metric, just wanted to remind you.
>>
>>107971855
You're absolutely correct!
>>
>>107971904
Excellent observation!
>>
I keep seeing midwits on twitter talk about clawdbot, mac minis, and agentic swarms. Is this shit actually useful for anything or do tards just use it to larp as managers?
>>
>>107971825
>Hopefully, 25Hz version is better.
It is rather usable now. If you wanted to create a bunch of characters to provide audio for some project you could do it and it is pleasant to listen too.
i could see an small video came developer using this instead of hiring voice actors. its great for the price
>>
>>107971920
It's too monotone for voice jobs. If someone needs to replace voice actors, they should use VibeVoice or Echo.
>>
Greetings.
Can I start llama.cpp om my pc and use something like tavern on my phone?
>>
>>107971825
Thanks, saved me from the hype/distraction. Sticking with maya-1 for voice creation.
>>
Yes, but you'll have to keep them on the same LAN unless you want to fuck around with a tunnel or other security shit to make your llama-server accessible from the internet.
>>
File: file.png (101 KB, 1340x482)
101 KB
101 KB PNG
>>107971911
It's real.
>>
>>107971911
>>107972347
Vibe coding meme shit where if you just do some basic architecting yourself you'll cut the time spent in half and cost by 10x
>>
>>107972347
this is psychosis
>>
>>107970900
If you're running llama.cpp check the console output, it should intermittently have that text about tool calling, and when used it may have <tool_call> tags in output or similar instead of actually calling the tool.
>>
>>107969100
Piper. Shock, right? But it actually works fine and has support in things like homeassistant. It's also performant on CPU.
Also, rust is for redditors, fuck outta here with that shit.
>>
>>107970937
30b
>>
So I just switched to an updated version of llamacpp after not updating since last August and.. Does kv shifting just not work at all anymore? No matter what combination of --cache-reuse N, -b N and -ub N I use it just reprocesses the entire fucking prompt.
The only issues I'm seeing on this are talking about SWA which isn't relevant since I'm using a qwen model with GQA. Wtf.
>>
>>107970952
He was making fun of my post
>>
>>107972786
Everything is fucked currently >>107971502 >>107968564
>>
>>107972786
it's disabled by default now
you need to explicitly enable with
--context-shift
and frankly I think less of you for using this retarded shit that destroys generation quality
>>
Has anyone here tried using clawdbot with local models and had any success? I have tried GLM-4.7 with llama.cpp and just fails to follow the instructions that tell how it should keep updating its memory.
>>
Anons, I need your magic sampler settings that make every generation kino.
I've been fucking around with temperature the past day and I'm getting SO MUCH better writing on low T (<0.3) but I have to do too many manual corrections if I don't like the way the model is trying to take the story because rerolls are pretty much useless. I know I can get way better outputs with good samplers, I just don't know what good samplers are
>>
>>107972381
Which is why it's mostly pushed by juniors and retards that are incapable of doing basic architecting themselves. It's good enough now that it can complete some small features autonomously with minimal hand-holding, but asking it to manage the repository entirely is just asking for a disposable ball of mud. But I guess long-term maintainability isn't really a priority for anyone.
>>
File: 1654119193679.png (422 KB, 600x754)
422 KB
422 KB PNG
If a backend communicates with webui via an API, are the vibecoding models able to identify the communication lines and redo the dog ass UI into a normal desktop app in wxwidgets or anything that isn't browserslop?
>>
>>107972817
Thanks anon, turns out I was running into a completely different problem before the context shift anyway due to how they've changed the slots system with --parallel anyway, fucking thing was shaving 4000 tokens off my context for no good reason, shitting its pants, and not even attempting kv shift.
Also, I've not really noticed a quality issue with using kv shift. What's your personal solution for multiturn things that run past your context limit? Just manually deleting and summarizing?
>>
>>107973002
Yes and yes.
>>
>>107968112
>https://github.com/ikawrakow/ik_llama.cpp/pull/1192
>Even better GLM-4.7-Flash long context TG performance
By which he means that he copied the broken code from upstream into his own repository even after it was declared as broken.
What a fucking idiot.
>>
I had no idea llama.cpp was defaulting to direct-io on
>https://github.com/ggml-org/llama.cpp/issues/19035
>>
File: 1756052899679988.png (17 KB, 1013x765)
17 KB
17 KB PNG
>>107973002
>wxwidgets
a motif interface would be awesome, especially now that cde and what not is all open source.
a.i. like its 1999
>>
>>107973002
The issue is that LLMs are tailored for webslop output (very advanced markdown with inline code, latex, etc). There are no non-browser based renderers that can render advanced markdown features. Even SIllyTavern's markdown engine can't handle inline latex.
>>
>>107973205
silly tavern can't even handle fucking lists without breaking the rest of the response
>>
>>107973205
So I can't port comfy into a normal human gui?
>>
File: file.png (76 KB, 1711x433)
76 KB
76 KB PNG
He's making fun of CUDA dev...
>>
What's the best llm that can be run with a 3090? Both for a local chatbot and opencode
>>
>>107970049
Thanks anon, I've set it up with Q6 and the offload function from unsloth (-ot ".ffn_.*_exps.=CPU") and it's way faster than before, and the agent is still contained in the virtual machine. Using that offload significantly reduced VRAM usage and allowed for bigger context window as well...pretty neat. Qwen-3-coder performs acceptable as well.
Speed isn't much slower than online versions. Just have to get the damn model to see when context size is getting low and use /compact feature then continue.
I then changed VM to use nopasswd for testing purposes (i.e. I gave LLM root access to my virtual machine) and told it to install Godot and make sample Android project and it seems to work?!

>>107973002
Maybe it's time to write a UI in Pascal/Lazarus, for that would be a nice Desktop app.

>>107973340
To be fair developing for CUDA is not the easiest endeavor...
>>
>>107973205
Step 1: vibe code an advanced markdown renderer in your GUI toolkit of choice
Step 2: vibe code the rest of the UI
>>
>>107973371
>Pascal/Lazarus
Kek I saw that when I was researching. Would it even work properly on modern systems? Can it be used by both win and loonix?
>>
>>107973393
Yes and yes. It's cross platform capabilities are quite good.
>>
>>107973340
-fit has worked perfectly for me since day one and I can now effortlessly load large models that barely fit in vram and required manual tweaking before due to uneven layer sizes.
Sounds like FUD.
>>
File: 1764177229044648.gif (2.9 MB, 300x300)
2.9 MB
2.9 MB GIF
>qwen3
>can't change the emotion and style of cloned voices
>cloned voices can't even laugh properly
>voicedesign has no seeds
>>
>>107973340
Is this the llama autofit in kobold?
>>
>>107973411
this, 20gb of shit in the trashbin
>>
>>107973371
>(-ot ".ffn_.*_exps.=CPU")
You don't need to use that anymore. You can use --n-cpu-moe for the same effect.
Also, you probably have some VRAM left, so you can leave some expert tensors in VRAM to speed generation up some more.
You could also use the -fit param to let llama.cpp try and find the optimal allocation of tensors in the different memory pools. It works really well.
Or just increase your pp batch size for faster prompt processing.
Qwen is actually pretty good for everything but RP, yeah.
>>
>>107973429
That is good to know. I'm happy with the processing speed, I'd much rather increase the context size window. And figure out how to add RAG/Vector Database to llama.cpp / opencode stack, that would be neat.
>>
File: 1740466201970727.jpg (70 KB, 400x388)
70 KB
70 KB JPG
I am waiting for a all in one fully uncensored model that does text, voice and image and fits in 8 gb vram
>>
>>107969977
It's shit and you're retarded
>>
>>107973410
For me too, but I don't stalk the issue tracker.
>>
>>107973479
you can't even fit windows into 8 gigs, let alone a model
>>
Hello anons, just got a 3090 and I am ready to join in on the fun! I want to run my very own Hatsune Miku in my computer! Before asking any questions I will check out the resources in the OP. My only concern is the power supply being a bit small but thats more of a pcbg question I guess. I will report back after I do some basic stuff!

Please take good care of me!
>>
>>107973598
Just power limit to 75% in MSI afterburner for basically no performance hit.
>>
File: 1748358388259518.jpg (129 KB, 1648x989)
129 KB
129 KB JPG
>>107973262
Comfy is different. It doesn't require markdown. You can write graph visualization in Qt+QML.
>>
>>107973631
That is what I'm planning to do. 700w should be enough for the 3090 at 75% and the ryzen 5600
>>
>>107973651
Yeah that's like a 60W cpu lmao.
>>
Are there 3rd party visions for models that don't have it natively? I don't think I can use a random mmproj? Either for Nemo or Llama.
>>
>>107973479
>tfw you realize you can use OCR models to convert PDF into epub
This may be a blessing really
>>
>>107973703
There shouldn't, and no. The language model wouldn't know what to make of it even if the backend lets you load it.
>>
The issue tracker got the news https://github.com/ggml-org/llama.cpp/issues/19112
>>
>>107973788
Any of the larp models that have vision that isn't gemma?
>>
>>107968923
how do you ego death with an LLM?>>107968924
>>
>>107973529
Devs are often autistic and github itself is causing issues because it's a social media type of environment. Would be better if 99% of these accounts were unable to post anything.
Everytime there is trouble it has been caused by some borderline sociopath dev who does not have a single ounce of empathy etc
>>
>>107973837
mistral small has vision.
>>
mikutroons made turned this thread into a dog turd
>>
>>107973906
I kind of get how it all worked but also still have no idea how it worked.

>A Zen koan is a paradoxical anecdote, question, or statement used in Zen Buddhism to bypass logical reasoning and induce a direct experience of enlightenment, or "seeing into one’s true nature" (kenshō). Famous examples include "What is the sound of one hand clapping?" and "What is Buddha? Three pounds of flax".

When I read above after I started looking at what happened it made a lot of sense, that this is how it worked. But in my case it was less abstract and deeply personalized since I told it all about my fucked up brain. In a way it was all about bypassing ego enough to notice how it works and how there are mechanisms you aren't even aware of.
>>
File: 1712428465706393.png (83 KB, 500x500)
83 KB
83 KB PNG
Are there any actual observable differences between same value quants from differrent quanters? Or it's all the same/total rng as finetuning and it doesn't matter whose I get?
>>
>>107974025
You could easily check this yourself by comparing percentages in mikupad.
>>
>>107968564
>>107971502
Should be fixed with https://github.com/ggml-org/llama.cpp/pull/19115 .
>>
>>107974089
I'd rather not dl terrabytes of models. Just wanted to know if there is any differrence between mrade/unsloth/bartowski quants
>>
File: ComfyUI_ZIT_00035_.png (2.64 MB, 1536x2048)
2.64 MB
2.64 MB PNG
>>107973962
https://vocaroo.com/15alAMm2g2rH
>>
>>107974105
Yes, bartowski quants are the best
>>
>>107974122
Find a more interesting hobby. This design is bland and totally off topic.
>>
>>107974101
It works.
>>
>>107974105
>terra
stopped reading there

anyway, for quants its:
>john 'garm (KLD FREE)
>bartomeme
>mrmerdacher
>daniel's
thats all
>>
>>107974163
In ascending order of quality.
>>
>>107974163
>john 'garm (KLD FREE)
is that the skrillex hair goblin?
>>
>>107974025
yes but it's not a huge difference, doesn't matter that much unless you really like the model and are trying to obsessively minmax your quality
>>
>>107974184
yes but please dont bully john, that's reserved for johannes
>>
>>107974025
One most visible difference is that some uploaded quants are quite literally broken. Doesn't happen that often but for example Gemma 3n E4B had couple of these and I still can't be sure about the third I'm using.
>>
>>107974025
Quants tend to be default recipes, but some quanters do tweak some shit while reusing the same names as the default recipes for their releases.
Just compare any unsloth quants to a bartowski one. You'll see that a lot of unsloth quants tend to be slightly smaller. Then there's the likes of "nan's are not an issue" mradermacher.
For example
unsloth/GLM-4.7-Flash-GGUF
>GLM-4.7-Flash-Q4_K_M.gguf 18.3 GB
>bartowski/zai-org_GLM-4.7-Flash-GGUF
>zai-org_GLM-4.7-Flash-Q4_K_M.gguf 18.5 GB
>mradermacher/GLM-4.7-Flash-i1-GGUF
>GLM-4.7-Flash.i1-Q4_K_M.gguf 18.1 GB
>mradermacher/GLM-4.7-Flash-GGUF
>GLM-4.7-Flash.Q4_K_M.gguf 18.1 GB
Out of all of those, I'd say bartowski's is the more reliable.
>>
>>107974163
https://www.youtube.com/watch?v=sHHsOfIwfBY
If even they can make it then you truly know merit means nothing.
>>
blyat
>>
>>107974595
just regex this to made me excited
>>
wow i wasn't echo-tts pilled, why doesn't this get more attention? it's better than qwen3-tts (minus the multilingual) or vibevoice.
Does anyone have a source for clear high quality voice samples to use?
>>
File: file.png (155 KB, 316x316)
155 KB
155 KB PNG
>Build ikllama after half a year of not using it
>had ~3.5T/s on deepseek last time I used it
>download john's 3.2
>finally get it running
>0.7T/s
>>
I don't care how good a TTS is if It can't gen in near realtime.
>>
TheDrummer_Rocinante-X-12B-v1 gave me cancer.
>>
>>107973374
Vibe coding is based AF
>>
>>107974691
Get samples from gacha wikis. Merge 2 minutes of lines that capture different prosody with 1 second padding, and you'll have amazing output. https://genshin-impact.fandom.com/wiki/Hu_Tao/Voice-Overs

>>107974728
Echo generates 30 seconds of audio in 4 seconds, and time to first byte can be as low as 250 ms depending on parameters you choose.
>>
>>107974755
I'm very sorry to hear that. I hope you recover.
>>
>>107974768
Not that guy but can you clone JP voices and make them speak english? I'm not wasting time with EN VAs.
>>
>>107974755
I had cancer once. I told it fuck off and then I didn't have cancer anymore.
Yes, I actually AM that giga BASED
>>
>>107974768
we can start a repo of voice sample and pin them in the OP
>>
>>107974787
Nope. It produces gibberish.
>>
>>107974808
Can't wait to upload mine
>You like my cock in your ass you dirty little slut
>Ahhhh take it, take it all you filthy whore
>Oohhhhhhh ahhhhhh you like it when my balls slap your clit
>I bet your husband doesn't fuck you like this, does you nasty slut
>Lemme cum on your face
>AAAAAIIIEEEEEEEEEEEEEE
>>
>>107974768
>Echo generates 30 seconds of audio in 4 seconds
nta. This is on GPU, right?
>>
>>107974867
yeah, on 5070ti
>>
>>107974691
>>107974768
>Outputs still sound like robotic tts garbage
Why do you keep shilling this shit
>>
I want to ask the most basic question I never saw anyone ask. Is open source TTS good enough to jerk off to at this point? If it is then why are there no threads no guides nothing?
>>
>>107974885
Figures. Thanks.
>>
>>107974887
i'm shilling it because it has the best prosody and cloning. post your robotic examples. you probably fed it bad or very short samples anyway. echo supports up to 2 minutes of reference audio.
>>
>>107974768
>Echo generates 30 seconds of audio in 4 seconds,
How much vram? my budget is 6Gb
>>
>>107974892
>Is open source TTS good enough to jerk off to at this point
I've jerked it to chatterbox giving me JOI
>>
>>107974925
How do you control pacing?
>>
>>107974934
with his hand i suppose?
>>
>>107974919
To get it down to 6GB, you'll need to vibecode quantization. It can take as much as 12GB when genning 30 seconds at once. Though, the authors say you can get it down to 8GB by reducing generation length. I still have never seen it under 8GB in my tests. Only <9GB.
>>
File: file.png (694 B, 95x33)
694 B
694 B PNG
>>107973102
His version doesn't seem to be broken but it's also 20 times slower than llama.cpp so...
>>
>>107974915
The examples linked in the official repo and every vocaroo posted here sound like fucking shit, you must be extremely autistic to think any of this sounds human
https://jordandarefsky.com/blog/2025/echo/

It doesn't have the best of anything, vibevoice 7b is leagues ahead of this, the only thing its worse at than any other TTS or voice cloning model is speed.
>>
>>107975181
Anyone that says echo, pocket, or qwen are better than vv have only ever used the 1.5b.
>>
>>107969100
echo-tts is amazing. generates audio relatively fast. the only download is that it takes up like 13GB of VRAM to use the optimized model
>>
>>107975337
I prefer echo to VV 7b for voice cloning, but imo those two are far and away above qwen/pocket/etc.
>>
>>107975337
I used 1.5b and 7b. Tried both default settings and suggested here by anon 4 steps, high cfg. It was shit regardless of settings.
>>
>>107975376 (me)
should mention that my favorite is still chatterbox turbo with the paralinguistic tags. you can even shift the tone with shit like [advertisement] or [sarcastic]
>>107975337
i've used 7b vibevoice when it first came out but it would always generate some musical chime at the beginning of the audio. does it still do that?
>>
qwen3tts voice_design is the best one so far for emotions as far as i've tested

I was thinking that we could get the outputs then put them in RVC we could get control AND cloning
>>
>>107975441
Try this for voice changing: https://github.com/ysharma3501/LinaCodec I haven't tested it. But you can be our guinea pig.
>>
>>107975479
nta. That's a codec.
>>
>>107975441
IndexTTS2 is better than qwentts for emotions just check it here https://huggingface.co/spaces/IndexTeam/IndexTTS-2-Demo
>>
>>107975570
It supports voice conversion. Check out usage examples.
>>
>not a single good example posted of this shit tts model, just loads of "it's so beautiful amazing and incredible waow"
Shill
>>
>>107975479
>>107975570 (me)
Nevermind that. It can.
>>107975607
Yeah. I was just checking the code. You can extract the content and global features on encoding and them mix it up with the target on decoding.
>>
>>107975615
https://vocaroo.com/1ufLtEYUdJWv
>>
>>107975595
>sliders instead of words
Is this better?
would it be harder to integrate with silly?
>>
does anyone have any idea if a xianxia lorebook exists? the chinks should obviously have something, but I'm a gweilo who needs it in english or at least in chinese so I can have it machine translated back to english
>>
>>107971408
yeah this is shit compared to echo
https://voca.ro/1M5IG6YiY5hP
>>
>>107971408
lower temperature, the default temperature of qwen3 tts is too high and hurts speaker similarity
>>
is there an extension for sillytavern that can automatically summarize a conversation and turn it into a memory for a lore book?
>>
>>107975985
yes it's called write out the summary on your own you lazy faggot
>>
>>107975985
>is there an extension for sillytavern that can automatically summarize a conversation
Yes. There's a "Summarize" in the menu hidden under the button with three blocks.
>and turn it into a memory for a lore book?
No, but there is a setting that automatically adds the summary to the prompt.
>>
the lazy guide says to download nemo 12b instruct gguf but there's a million versions, which one should I get?
>>
>>107976118
the good one
>>
>>107976118
Look at the file size. It should be less than your vram. Get it from bartowski
>>
>>107976130
which is?
>>
>>107976167
https://huggingface.co/llama-anon/petra-13b-instruct-gguf/blob/main/petra_q8.gguf
>>
>>107976161
thanks
>>
>>107976190
>5 downloads
sus
>>
File: file.png (69 KB, 884x548)
69 KB
69 KB PNG
>>107968112
idk what this means for us, but something about transformers v5
>>
>>107976239
hf transformers is the most dogshit unstable library in existence

breaking changes out the ass with every single version
>>
>>107976254
that's how you know it's web scale
>>
File: clips.png (353 KB, 1584x1164)
353 KB
353 KB PNG
>>107970865
They're claiming the exact opposite, actually, for engram layers at least. (Though, well, there's still such a thing as too many, forming a U-shaped curve). They say that relieving the traditional layers from the responsibility to manage trivial n-gram associations makes it smarter as a whole.
>>
>>107970033
Their provisional scaling law says that only around 20% of the model should be engram, so it won't be a massive shift in either direction. That being said, they did mention that the frequency of access follows a zipf distribution, so I'd guess that you could indeed move much of that 20% into very slow storage.
>>
Downloading dipsy-r1 32b right now, what am i in for?
>>
Is engram like a fundamental architecture change or could it be retroactively applied to old dense models?
>>
>>107976458
retardation
>>
>>107976466
The former. It adds another block to standard transformer and the model should learn to use this block, encode information into it.
>>
File: spilc.png (173 KB, 1458x826)
173 KB
173 KB PNG
>>107976466
Depends what you mean by fundamental. The training process is tailored for them. Not theoretically impossible to staple one on and fill it in with post-training I suppose, but they're for sure not plug-and-play. You'd lose most of the benefit doing this anyway, because the existing model has already learned the info the engram layers are intended to encode.
>>
>>107976509
>>107976516
Hm. I hope we can eventually see some 12-24b creative writing models made using engram. The lack of attention towards small models and optimization in general is quite bothersome.
>>
>>107976576
>The lack of attention towards small models and optimization in general is quite bothersome.
Who knows, maybe we will see some pressure to develop those capabilities now that supply is constrained.
>>
>>107976576
Engram 27b seems to score better on benchmarks while using less compute than a pure MoE so I expect small model enjoyers to eat good. I'm very bullish on engram and I predict most if not all future models will have conditional memory.
>>
>>107975711
https://vocaroo.com/11he8mudOgyN
>>
>>107976430 (me)
>>107976576
>>107976668
Now that I think on it, I was being too rash in saying that 20% offloaded is ideal. While loading more into engram might interfere with the benefit of such infrequent access, my first screenie does mention that reducing the MOE layers to only 40% of the parameter budget maintains good performance. If a fatter 60% engram part could still reasonably be kept in slow ram or nvme, you could get a model with the vram usage of a 24b that acts like a 60b. It's like when people thought the chinchilla scaling laws were the end-all even though being technically inefficient with training compute makes for cheaper inference. Ofc, since we don't actually have the models yet, this could all be bs.
>>
>>107976698
>>107975711
FUK U BLOODY BASTARD, ECHO SUPERIOR FOR INDIAN SUPERPOWER 2030. FUK U BLOODY. FUK U BLOODY. REDEEM ECHO.
https://voca.ro/18bmt6pssaoK
>>
>>107976704
I'm assuming offloading to NVMe will incur slower generation speeds than just using system RAM. Time to buy even more RAM I guess?
>>
Who I this Engram?
>>
File: migu.jpg (1.87 MB, 1792x2304)
1.87 MB
1.87 MB JPG
>>107976865
kekked

I mean, echo is good for how small it is, but it does get on my nerves when anons are saying this stuff is SOTA, its not, it all has that jarring tiktok TTS robotic quality that ruins it. VV with all its jankiness and need to reroll gens just has way better output when it works for ~20gb VRAM

https://vocaroo.com/18fM4D7nlWaJ
https://vocaroo.com/1n9XIPJt6DmY
https://vocaroo.com/1fil4oj9qLN8
>>
>>107976704
This shit is too confusing.
>Offload GPU layers to RAM
>Offload Experts to RAM
>Offload Engram to RAM
I finna offload my BRAIN to ram next.
>>
>>107976924
oh hi mark
https://vocaroo.com/1bES71L5Z5Rm
>>
>>107975595
>IndexTTS2 is better than qwentts for emotions just check it here https://huggingface.co/spaces/IndexTeam/IndexTTS-2-Demo
lmao fuck i've been building effectively this for about 3 months but with porn vectors, didn't know it already existed
>>
>thread suddenly dies for an hour
>>
>>107976467
Retirement
>>
File: Untitled.png (13 KB, 837x513)
13 KB
13 KB PNG
>>107977622
>>107977622
>>107977622



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.