[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/mlp/ - Pony


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: New OP.png (1.53 MB, 2119x1500)
1.53 MB
1.53 MB PNG
Welcome to the Pony Voice Preservation Project!
youtu.be/730zGRwbQuE

The Pony Preservation Project is a collaborative effort by /mlp/ to build and curate pony datasets for as many applications in AI as possible.

Technology has progressed such that a trained neural network can generate convincing voice clips, drawings and text for any person or character using existing audio recordings, artwork and fanfics as a reference. As you can surely imagine, AI pony voices, drawings and text have endless applications for pony content creation.

AI is incredibly versatile, basically anything that can be boiled down to a simple dataset can be used for training to create more of it. AI-generated images, fanfics, wAIfu chatbots and even animation are possible, and are being worked on here.

Any anon is free to join, and there are many active tasks that would suit any level of technical expertise. If you’re interested in helping out, take a look at the quick start guide linked below and ask in the thread for any further detail you need.

EQG and G5 are not welcome.

>Quick start guide:
docs.google.com/document/d/1PDkSrKKiHzzpUTKzBldZeKngvjeBUjyTtGCOv2GWwa0/edit
Introduction to the PPP, links to text-to-speech tools, and how (You) can help with active tasks.

>The main Doc:
docs.google.com/document/d/1y1pfS0LCrwbbvxdn3ZksH25BKaf0LaO13uYppxIQnac/edit
An in-depth repository of tutorials, resources and archives.

>Active tasks:
Research into animation AI
Research into pony image generation

>Latest developments:
GDrive clone of Master File now available >>37159549
SortAnon releases script to run TalkNet on Windows >>37299594
TalkNet training script >>37374942
GPT-J downloadable model >>37646318
FiMmicroSoL model >>38027533
Delta GPT-J notebook + tutorial >>38018428
New FiMfic GPT model >>38308297 >>38347556 >>38301248
FimFic dataset release >>38391839
Offline GPT-PNY >>38821349
FiMfic dataset >>38934474
SD weights >>38959367
SD low vram >>38959447
Huggingface SD: >>38979677
Colab SD >>38981735
NSFW Pony Model >>39114433
New DeltaVox >>39678806
so-vits-svt 4.0 >>39683876
so-vits-svt tutorial >>39692758
Hay Say >>39920556
Haysay on the web! >>40391443
SFX seperator >>40786997 >>40790270
Synthbot updates GDrive >>41019588
Private "MareLoid" project >>40925332 >>40928583 >>40932952
VoiceCraft >>40938470 >>40953388
Fimfarch dataset >>41027971
5 years of PPP >>41029227
Audio re-up >>41100938
RVC Experiments >>41244976 >>41244980
Ace Studio Demo >>41256049 >>41256783

>The PoneAI drive, an archive for AI pony voice content:
drive.google.com/drive/folders/1E21zJQWC5XVQWy2mt42bUiJ_XbqTJXCp

>Clipper’s Master Files, the central location for MLP voice data:
mega.nz/folder/jkwimSTa#_xk0VnR30C8Ljsy4RCGSig
mega.nz/folder/gVYUEZrI#6dQHH3P2cFYWm3UkQveHxQ
drive.google.com/drive/folders/1MuM9Nb_LwnVxInIPFNvzD_hv3zOZhpwx

>Cool, where is the discord/forum/whatever unifying place for this project?
You're looking at it.

Last Thread:
>>41354496
>>
FAQs:
If your question isn’t listed here, take a look in the quick start guide and main doc to see if it’s already answered there. Use the tabs on the left for easy navigation.
Quick: docs.google.com/document/d/1PDkSrKKiHzzpUTKzBldZeKngvjeBUjyTtGCOv2GWwa0/edit
Main: docs.google.com/document/d/1y1pfS0LCrwbbvxdn3ZksH25BKaf0LaO13uYppxIQnac/edit

>Where can I find the AI text-to-speech tools and how do I use them?
A list of TTS tools: docs.google.com/document/d/1y1pfS0LCrwbbvxdn3ZksH25BKaf0LaO13uYppxIQnac/edit#heading=h.yuhl8zjiwmwq
How to get the best out of them: docs.google.com/document/d/1y1pfS0LCrwbbvxdn3ZksH25BKaf0LaO13uYppxIQnac/edit#heading=h.mnnpknmj1hcy

>Where can I find content made with the voice AI?
In the PoneAI drive: drive.google.com/drive/folders/1E21zJQWC5XVQWy2mt42bUiJ_XbqTJXCp
And the PPP Mega Compilation: docs.google.com/spreadsheets/d/1T2TE3OBs681Vphfas7Jgi5rvugdH6wnXVtUVYiZyJF8/edit

>I want to know more about the PPP, but I can’t be arsed to read the doc.
See the live PPP panel shows presented on /mlp/con for a more condensed overview.
2020 pony.tube/w/5fUkuT3245pL8ZoWXUnXJ4
2021 pony.tube/w/a5yfTV4Ynq7tRveZH7AA8f
2022 pony.tube/w/mV3xgbdtrXqjoPAwEXZCw5
2023 pony.tube/w/fVZShksjBbu6uT51DtvWWz

>How can I help with the PPP?
Build datasets, train AIs, and use the AI to make more pony content. Take a look at the quick start guide for current active tasks, or start your own in the thread if you have an idea. There’s always more data to collect and more AIs to train.

>Did you know that such and such voiced this other thing that could be used for voice data?
It is best to keep to official audio only unless there is very little of it available. If you know of a good source of audio for characters with few (or just fewer) lines, please post it in the thread. 5.1 is generally required unless you have a source already clean of background noise. Preferably post a sample or link. The easier you make it, the more likely it will be done.

>What about fan-imitations of official voices?
No.

>Will you guys be doing a [insert language here] version of the AI?
Probably not, but you're welcome to. You can however get most of the way there by using phonetic transcriptions of other languages as input for the AI.

>What about [insert OC here]'s voice?
It is often quite difficult to find good quality audio data for OCs. If you happen to know any, post them in the thread and we’ll take a look.

>I have an idea!
Great. Post it in the thread and we'll discuss it.

>Do you have a Code of Conduct?
Of course: 15.ai/code

>Is this project open source? Who is in charge of this?
pony.tube/w/mqJyvdgrpbWgZduz2cs1Cm

PPP Redubs:
pony.tube/w/p/aR2dpAFn5KhnqPYiRxFQ97

Stream Premieres:
pony.tube/w/6cKnjJEZSCi3gsvrbATXnC
pony.tube/w/oNeBFMPiQKh93ePqTz1ns8
>>
File: anchor.png (33 KB, 1200x1453)
33 KB
33 KB PNG
Anchor
>>
Is Clipper still doing episodes? I loved that "Free Hugs" one.
>>
>>41364866
God I hope not. It sucks.
>>
File: 1701781466764247.png (60 KB, 500x459)
60 KB
60 KB PNG
>>41364875
>Wanting everything to be about sex with Anon
>>
>>41364876
Good idea
>>
File: 670652.png (670 KB, 3991x5761)
670 KB
670 KB PNG
>>41364787
Added audio from recently released animatics (s2e3, 25, 26) to the voice dataset, replacing corresponding entries in the FiM folder.
mega.nz/folder/jkwimSTa#_xk0VnR30C8Ljsy4RCGSig
Sliced Dialogue -> Special source
Also put label files in the the label files folder.

>>41364866
Working on a thing to present at Mare Fair. Not an episode as such, though will make use of the AI voice.

>>41364875
There are a lot of things I'd do differently looking back, mainly pacing. Always better next time, that's the goal.
>>
>>41364894
>Sex Hotline 2.0
>>
>>41364787
reposting song cover in case people missed it from the last thread >>41361841
>https://files.catbox.moe/qxa6vp.mp3
>>
Found a good VITS-based voice conversion/generation tool with decent TTS capabilities called Applio.
https://www.youtube.com/watch?v=gjggpadBgOo
https://github.com/IAHispano/Applio
https://applio.org/

It seems to have a lot of functionality similar to Hay Say, but with more in-depth TTS. The way it functions is effectively has a TTS voice layer of various voices; speakers of numerous countries, languages and accents that it uses to interpose RVC voices onto. Which works pretty well in my early testings in the compiled clip below.

https://files.catbox.moe/lju6ub.mp4

Full compatible with existing pony models, as evidenced by it working with Vul's Fluttershy S1 model. Though it was a bit confusing where the models had to go (Apparently in "/Applio-3.2.4/logs/" in a named folder containing the .pth and .index files). It's a little finicky, needing experimentation with suitable TTS voices and settings to optimize for each mare, adjusting until it sounds right. The noisiness is mostly from the TTS end of things and less so the RVC side. Additional RVC training and inference functions are useful too, hadn't tested those parts yet though.
>>
>>41365138
Those clips sound pretty good, some low level noise but all within what I'd call an acceptable limit.
Might this be a new start for pony TTS? Would be awesome to have an alternative to those that rely fully on reference audio.
>>
>>41364894
- [In progress] Download the Master File again so I can a clean updated copy.
- [ ] Reupload a clone of the new Master File to my gdrive.
- [ ] Reupload a clone of both Master Files to HuggingFace.

Separately, I've spent a lot of time in the last few months working with LLMs. I'm putting together a library for creating more complex chatbots.
https://github.com/synthbot-anon/horsona
- [In progress] Collect a list of functionality that would help in making better chatbots. Right now, that means going through existing chatbots and figuring out (1) what features they support, (2) what it takes to specify a new chatbot, and (3) what people complain about regarding chatbot interactions and personalities.
- [ ] Split the target features into individual functions that can be implemented and pieced together, and create a github issue for each one so it's easy to keep track of everything.
- [ ] Start implementing.

If anyone has ideas for functionality and for things other chatbots do well/poorly, let me know. Right now, I don't care how difficult anything here is to implement.
>>
>>41364782
Is there a voice actor AI model that can simulate rage and emotions like AVGN and doesn't take a rocket degree science to utilize it? I plan on using it for really long ranting reviews of 10k+ words.
>>
>>41365910
No, not right now.
>>
>>41365609
>Chatbots
>Horsona
Would there be any likelihood of these being capable of writing entire fics independently like GPT-PNY does/did? Curious too about there flexibility as it could be fun to have AI mares give us a script to then animate or otherwise try and achieve. Mare assisted brainstorming.
>>
>>41365950
I think so. I'm toying around with having an LLM read a fanfic paragraph-by-paragraph to extract information for automated lorebook creation. Writing a fic would be basically that in reverse + lorebook generation.
Once I get a better handle on how to implement this and everything /chag/ suggested, I'll try to break down the tasks into small pieces that can be implemented by more people than just myself.
>>
>>41365138
Example is little bit too high pitched but it does show an idea. I will test it out later myself.
>>
>>41366029
are you planing to make it as heavy modded TavernAI or some custom ui program from scratch?
>>
>>41366105
There is a pitch slider in the settings, so it's pretty much a non-issue; can be adjusted as desired. This will be a setting played around with often as differing TTS voices each vary in natural pitch range. Some deliveries might also need a small increase or reduction too.

Lower pitches did feel more Fluttershy, but didn't seem to have as much pitch variance. She kinda sounded bored or tired to me.
>>
>>41366302
I don't know. Probably a mix of both, leaning toward integrating with other UIs as much as possible. Right now, I mostly want to see how limiting the technical challenges really are when trying to make perfect chatbots. I intend for it to be a library, not a full chat program, but a UI might be necessary occasionally to make use of & test the functionality.
>>
>>41366401
I still have a fondness of for how the barebones GPT-PNY worked way back when, with the colab and separate window thing. I still feel it functioned a lot better and more freely than in KoboldAI, so any simple interfacing you come up with that'd allow for more raw/unfiltered/free-form outputs is good by me, even if other more flexible and potentially restrictive interfaces are adopted for it later on too.
>>
>>41366416
>https://www.youtube.com/watch?v=jHS1RJREG2Q
>https://arxiv.org/abs/2408.14837
>Diffusion Models Are Real-Time Game Engines
So same nerds combined a llm text model with art diffusion model and trained it on images + keyboard inputs to create a synthesized Doom gameplay.
Not mare related but the idea of practical combination of different ai tools is interesting to me.
>>
https://www.udio.com/songs/67X7mqHih4C8m4raEX8fzW
https://pomf2.lain.la/f/tky4cms5.mp4

Midnight Rejections
acoustic guitar music. princess celestia, anon, male vocalist, sad

Lyrics

[Verse 1]
Celestia's trying hard
She’s got her royal charm turned up to ten
But Anon's still not interested again, yeah
She pulled out all her tricks
Even baked him a cake, extra thick
But buddy's not biting, not even one little bit

[Chorus]
In the castle, at midnight, room 302
With a bouquet of roses and some candles too
Anon’s locked the door, put a sign in his view
That says "Please go away"

[Verse 2]
She's got a plan, who knew?
But Luna and Cadance can't believe it’s true
She wore a fancy dress and said “Hey there you!”
She read from romance books
Tried adding sexy looks
But Anon just laughed and moved to his favorite nooks

[Chorus]
In the castle, at midnight, room 302
With a bouquet of roses and some candles too
Anon’s locked the door, put a sign in his view
That says "Please go away"

[Bridge]
So Celestia sighed, wiped a tear from her eye
The mares all gathered round, gave it one more try
They played their guitars, singing under moon light
But Anon just yawned and said, "Goodnight"

[Chorus]
In the castle, at midnight, room 302
With a bouquet of roses and some candles too
Anon’s locked the door, put a sign in his view
That says "Please go away"


----
Is "put a sign in his view" too bad?
>>
>>41367097
you have quite a nice collection there, the 'Dreams of Luna' is pretty nice.
>>
Do any of you guys reckon you could make a good voiced version of the second comic here?

https://www.tumblr.com/radioactive-dragonlover/759831654724419584

I tried using Haysay and putting in a segment of the audio from the episode of Game Changers it's referencing as an audio input, but it came out sounding very robotic and off - I think because Twilight as a character has a different range of pitches when speaking emotionally than BLeeM does. I don't really know how to adjust that, though.
>>
>>41366416
Do you mean how it wasn't tuned to act like an assistant, and that it just continued from whatever text you gave it? That should be easy enough.
>>llm = AsyncCerebrasEngine(model="llama3.1-70b")
>>print(await llm.query_continuation("Once upon a time there was a little pony named Twilight Sparkle."))
>She lived in a magical land known as Equestria, where the sun was always shining and the air was sweet with the scent of blooming wildflowers. Twilight Sparkle was a student of Princess Celestia, the ruler of Equestria, and was learning the art of magic at the princess's palace in Canterlot. One day, while Twilight was studying in the library, she received a letter from the princess, instructing her to move to the town of Ponyville and live alongside the other ponies, to learn about the magic of friendship.

I can make sure there's a way to support jailbreaks too.
>>
>>41366336
Welp, that new program is not going to be very useful to me as it crashes at the beginning with inability to load some dll models. So the struggle to find nice sounding tts is still going.
>>
>>41367854
Afraid I can't be much help with that, assuming you're using the windows version; Linux version works fine. If it's an installation issue, in the releases of the GitHub there should be another install option there, maybe that'll work?
>dll models
I'm pretty sure it's intended to use .pth files as the models. Also only RVC, I had a slip up earlier where I accidentally tried to load a SoVits model if mine and it naturally errored.
>>
>>41365609
Current ai might be too slow for this but I was thinking about some kind of LLM/RAG powered "RPG engine" where you don't just provide character definitions but world definitions, item definitions, maybe have some kind of prebuilt framework for quests, skills, other user defined mechanics, do the math of tracking XP, health, armor, damage multipliers in code rather than having the LLM try to pick that role up. Rather than the hackiness of trying to shove a world scenario or an explicit story into each character card these could be split into more logical pieces and composed into RPGs
>>
>>41368170
Someone mentioned this in /chag/. The hard part is make sure it's possible extract & track the relevant information from a rulebook. If you can send me an example rulebook (maybe one of the PonyFinder ones), that would help.
>>
>>41367684
Can you catbox the Game Changers audio, or link to the original episode?
>>
>>41368005
they do have a precompiled download but its giving me the same error, im guessing it just my system being extra derped.
>>
>>41368620
https://youtu.be/88et7YlmzTs?si=_okFx5HtSE9e9cBV
Here's the audio snippet I used:
https://files.catbox.moe/m5eph6.mp3
>>
>9
>>
>>41369482
So it is.
>>
>>41369021
https://files.catbox.moe/rf3zrc.mp3

Architecture = rvc | Character = Twilight Sparkle | Index Ratio = 0.95 | Pitch Shift = 8 | Voice Envelope Mix Ratio = 1.0 | Voiceless Consonants Protection Ratio = 0.33 | f0 Extraction Method = rmvpe
>>
>>41369482
>cutie mark
Is that what it looks like to have 4chan as a special talent?
>>
>>41369482
oy
>>
>>41371447
Either that or her talent is related to some kind of Star Trek: Green Edition.
>>
>>41372139
Kek, that uniform design is gold.
>>
>>41365609
Updating the Master File:
- [Hopefully done] Download the Master File again so I can a clean updated copy. Mega was having issues, as usual. I'll need to check to make sure I have all the files, but I think this is done.
- [In progress] Reupload a clone of both Master Files to HuggingFace.
- [In progress] Reupload a clone of the new Master File to my gdrive.

Horsona chatbot library:
- [Done] Collect a list of functionality that would help in making better chatbots. The currentl list is up on github readme https://github.com/synthbot-anon/horsona. I have enough to get started, but please keep suggesting functionality if you think of anything. There's still functionality I want that no one's mentioned, so I'm sure the list is incomplete.
- [In progress] Split the target features into individual functions that can be implemented and pieced together, and create a github issue for each one so it's easy to keep track of everything. I'll need to start implementing some of these things so I can have a better understanding of how do this.
- ... [Done] Create a sample memory module, which is required for several of the candidate features. I went with "RAG where the underlying dataset can be automatically updated based on new information." The implementation is done, though the LLM prompts in https://github.com/synthbot-anon/horsona/blob/main/src/horsona/memory/rag.py#L139 could use some work. There's an example use in https://github.com/synthbot-anon/horsona/blob/main/tests/test_rag.py though the test only passes about 30% of the time. I'm pretty sure this can be made close to 100% with better prompts.
- ... [ ] Add documentation, a "start developing" guide, and tasks for the features where it's feasible to make progress using the memory module.
- ... [ ] Find some old & current jailbreaks to add jailbreak support, and turn them into either modules or LLM wrappers. If anyone has links for this, please send them.
- ... [ ] Figure out how to organize text corpora into compatible universes.
- ... [ ] Go through the candidate feature list and make sure jailbreaks & compatible universes are the only feature that are hard to support with the existing framework.

Information I need:
- Sample rulebooks, preferably one of the PonyFinder ones, so I can figure out what it'll take to extract information from these.
- Old & current jailbreaks so I can make sure my jailbreak implementation is comprehensive.
>>
>>41373093
>keep suggesting functionality if you think of anything
I can't think of any at this moment, however I would love if you were able to keep the addons options that auto1111 webui for Stable Diffusion has, where one can just install whatever additional options as needed and as how anons make new ones in the future.
>>
Page 10 bump.
>>
Does Clipper know where he found the MLP background music? Shit's kino as fuck.
>>
>>41373305
I'm only building the library right now (not a full application), but I can make sure it can support custom add-ons that can be dynamically toggled.
>>
>>41373093
It looks like jailbreaks sometimes are LLM-specific and require modifying near-arbitrary arguments to the call. So they'll likely be implemented as custom LLMEngines. In that cases, I don't think I need to do anything for them right now since my LLMEngine implementation already supports all of the customizations required. I'll just create issues for popular jailbreaks that I or others can implement.
Organizing text into compatible universes looks like it'll require graphs of data sources, where one data source can inherent from another with edits. I'll have to think about how to implement this. Most of the features don't depend on this, so I'll shift focus to documenting & creating issues for now.
>>
>>41373902
I know about these archived rips of background music:
https://www.mediafire.com/?rdhhrpyc0d6d3
https://www.mediafire.com/?rh219xdgj66bu

The first directory has music from seasons 1-2, and the second account belongs to RainShadow, who also has a YouTube channel:
https://www.youtube.com/@RainShadow
>>
>>41373093
I think horsona / chatbot library is in a good-enough state for anyone that wants to help out with development.
Repo: https://github.com/synthbot-anon/horsona
Open tasks: https://github.com/synthbot-anon/horsona/issues
- The current open tasks are for creating new LLMEngines (easy), making prompts more reliable (medium), and creating new datatypes & modules for character cards and image generation (medium/hard, probably requires some familiarity with pytorch).
- If you want to develop and run into any issues with the setup, let me know.

The integrations with existing chatbot UIs will come a bit later from integrations with common inference engines. I don't expect that to be difficult.

Updating the Master File:
- [Hopefully done] Download the Master File again so I can a clean updated copy. I'll need to check to make sure I have all the files, but I think this is done.
- [In progress] Reupload a clone of both Master Files to HuggingFace.
- [In progress] Reupload a clone of the new Master File to my gdrive.

Horsona chatbot library:
- [Done] Add documentation, a "start developing" guide, and tasks for the features where it's feasible to make progress using the memory module.
- [Done] Find some old & current jailbreaks to add jailbreak support, and turn them into either modules or LLM wrappers. Ultimate Jailbreak is the main one, and there's an open task for it. There are others listed on rentry listed here: https://rentry.org/jb-listing.
- [Done enough] Go through the candidate feature list and make sure jailbreaks & compatible universes are the only features that are hard to support with the existing framework. Jailbreaks are easy to support. The rest of the features are easy to support.
- [ ] Work on whatever open issues other anons don't pick up.
- [ ] Continue working on lorebook generation. After this, I'll try making a simple chatbot with the library.
- [ ] Figure out how to organize text corpora into compatible universes.
>>
>>41373902
pretty sure 99% its rips from the show itself (with two/three pieces made by Anons) that you can find it OP post second mega NZ link ('sfx and music' folder).
>>
>>41373888
Minus one.
>>
>>
>>41373902
Nothing special, I just took it all from the music tracks of the same show audio used to make the voice dataset. Same clipping process with a different tagging system.
>>
>>41371103
That's pretty good up until the screaming at the end. Thank you
>>
File: 3121428.png (170 KB, 1528x2267)
170 KB
170 KB PNG
Any chance someone here could train up a Lightning Dust model for RVC please? I need it for a song and her SVC one isn't cutting it.
>>
>>41377069
https://huggingface.co/Amo/so-vits-svc-4.0_GA/tree/main/ModelsFolder/ddm_DaringDo_100k
There is a sovits model for her that was set up as ulti model training, as if I remember correctly the model did not have the required 2 minutes of audio?
I may give it a try for RVC training in a day or two.
>>
>>41374539
Updating the Master File:
- [Done] Downloaded the new Master File and checked to make sure everything is good.
- [In progress] Reupload a clone of both Master Files to HuggingFace. This should be done in about 1 hour.
- [In progress] Reupload a clone of the new Master File to my gdrive. This should be done in about 3 hours.

Updating the Fimfarchive:
- [In progress] Download & verify the Sep 1 Fimfarchive. I'm downloading it now. If there are no errors, this should be done in about 5 hours.
- [ ] Upload to HuggingFace.

>>41364894
There's an empty "Luster Dawn" folder in Special Source, in case you wanted to remove that.
>>
>>41377069
I've also been meaning to train a Lightning Dust model. I'll see about training her later today probably. Would be good to get me back into the rhythm of training; been a long while.
>>
File: Untitled.png (553 KB, 1080x1049)
553 KB
553 KB PNG
SelectTTS: Synthesizing Anyone's Voice via Discrete Unit-Based Frame Selection
https://arxiv.org/abs/2408.17432
>Synthesizing the voices of unseen speakers is a persisting challenge in multi-speaker text-to-speech (TTS). Most multi-speaker TTS models rely on modeling speaker characteristics through speaker conditioning during training. Modeling unseen speaker attributes through this approach has necessitated an increase in model complexity, which makes it challenging to reproduce results and improve upon them. We design a simple alternative to this. We propose SelectTTS, a novel method to select the appropriate frames from the target speaker and decode using frame-level self-supervised learning (SSL) features. We show that this approach can effectively capture speaker characteristics for unseen speakers, and achieves comparable results to other multi-speaker TTS frameworks in both objective and subjective metrics. With SelectTTS, we show that frame selection from the target speaker's speech is a direct way to achieve generalization in unseen speakers with low model complexity. We achieve better speaker similarity performance than SOTA baselines XTTS-v2 and VALL-E with over an 8x reduction in model parameters and a 270x reduction in training data
https://kodhandarama.github.io/selectTTSdemo/
code and weights to be released (soon?)
examples aren't great but considering the training time/training data/parameters means its viable for personal training. they used 100 hours of data
>>
Hold Me Tight: Stable Encoder-Decoder Design for Speech Enhancement
https://arxiv.org/abs/2408.17358
>Convolutional layers with 1-D filters are often used as frontend to encode audio signals. Unlike fixed time-frequency representations, they can adapt to the local characteristics of input data. However, 1-D filters on raw audio are hard to train and often suffer from instabilities. In this paper, we address these problems with hybrid solutions, i.e., combining theory-driven and data-driven approaches. First, we preprocess the audio signals via a auditory filterbank, guaranteeing good frequency localization for the learned encoder. Second, we use results from frame theory to define an unsupervised learning objective that encourages energy conservation and perfect reconstruction. Third, we adapt mixed compressed spectral norms as learning objectives to the encoder coefficients. Using these solutions in a low-complexity encoder-mask-decoder model significantly improves the perceptual evaluation of speech quality (PESQ) in speech enhancement.
https://github.com/felixperfler/Stable-Hybrid-Auditory-Filterbanks
>>
>>41377203
>>41377404
It would be much appreciated. SVS has a tone to it that's difficult to work into some genres, I think RVC would nail it.
>>
Precautionary page 8 bump.
>>
>>41378282
Plus one.
>>
>>41377900
I may have something workable within 6 hours (or 12, depending if pc decides to have another technical hiccup).
>>
>>41377900
>>41379052
And Im back.
>https://huggingface.co/Amo/RVC_v2_GA/tree/main/models/MLP_Lightning_Dust_GA
>https://files.catbox.moe/amkbes.mp3
I may over train her with setting up the epoch to 500 but I still think this model came out pretty decent, specially with all the additional clean up data clips.
>>
>>41380032
Nicely done. I attempted to clean some more files for training her yesterday but didn't have much luck with getting any of the tools to work before I got bummed out and needed sleep. Those tools being jazzpear94's model (https://colab.research.google.com/drive/1efoJFKeRNOulk6F4rKXkjg63RBUm0AnJ) and a couple other similar ones intended to help separate SFX specifically, which I found in this doc: https://docs.google.com/document/d/17fjNvJzj8ZGSer7c7OFe_CNfUKbAxEh_OBv94ZdRG5c/edit#heading=h.owqo9q2d774z

How much data did you have of her to work with? Still tempted to train another model of her anyways, if at least to see if my training setup works still.
>>
Hey Hydrus, any chance we could get this >>41380032 on Haysay.ai?
>>
>>41377233
Updating the Master File:
- [Done] Reupload a clone of both Master Files to HuggingFace. https://huggingface.co/datasets/synthbot/pony-speech and https://huggingface.co/datasets/synthbot/pony-singing
- [In progress] Reupload a clone of the new Master File to my gdrive. https://drive.google.com/drive/folders/1ho2qhjUTfKtYUXwDPArTmHuTJCaODQyQ

Updating the Fimfarchive:
- [Done] Download & verify the Sep 1 Fimfarchive. Fimfiction added some restriction to prevent bots from scraping. Part of my script downloads the story html if there's anything wrong with the fimfarchive epubs, or if there's a conflict between the fimfarchive metadata and epub. I have a hackish fix for this for now.
- [Done] Upload to HuggingFace. https://huggingface.co/datasets/synthbot/fimfarchive

Horsona chatbot library:
- [In progress] Continue working on lorebook generation. I cleaned up part of my embedding-based memory implementation. I'll clean the rest as I figure out the right way to use it for reading through a story. Right now, I'm having an LLM create a question-answer dataset for the story setting, which it refines as it reads the story. The questions get turned into embeddings, which can be used to look up the corresponding answers as necessary. This is still a work in progress. I think my first "test" for this would be if it can create a decent character card for every character after each chapter of a story. That's what I'm currently working toward.
- [ ] Work on whatever open issues other anons don't pick up.
- [ ] Figure out how to organize text corpora into compatible universes.
>>
>>41380500
Correction: the Master File reupload to my gdrive should be [Done].
>>
>>41380258
>https://files.catbox.moe/d3j7w5.zip
>I attempted to clean some more files
I usually just grab the Clear files from OP mega folder and if the number is below 3 minutes I will additionally scavenge any usable 'noisy' files as well, in this case I think I used almost all of the noisy ones with only three being deleted.
>>
>>41380756
That's more or less what I have to train with, but I'm a bit stringent when it comes to samples I use. There's quite a few lines sourced from her second episode that I didn't feel suited as she kinda gets a slight country-like accent to it in her delivery (https://files.catbox.moe/gn0vd5.flac) and unusual or distorted with others (https://files.catbox.moe/1h5trr.flac & https://files.catbox.moe/4mug5h.flac). But yeah, I'll train her with what I've defined, though fewer files, and consider an alternate version that'll hopefully be more faithful to her debut version.
>>
Up.
>>
>>41381049
Oh yes, when going through the dataset I will try to look out for the tone of the voice as well, while I dont have proof I think the clips were "X character pretends to talk like Y" may end up poisoning the training process.
Btw Anon, what stuff are you planing to be making with her voice, songs? green text?
>>
>>41381590
I usually make covers, but I'll be doing further tests with Applio and which combinations of TTS voices it has pairs best and the settings I feel works best. Might have to compile some sort of parameters list. Also curious about the TTS >>41377494 mentioned and how well that'll perform in comparison. I have a few factors limiting my ability to effectively perform non-song stuff, but I have a long list of stuff to produce. Nothing so much with Lightning Dust though thus far, aside from a few songs I wanna test with her.
>>
Good night bump.
>>
MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer
https://arxiv.org/abs/2409.00750
>Nowadays, large-scale text-to-speech (TTS) systems are primarily divided into two types: autoregressive and non-autoregressive. The autoregressive systems have certain deficiencies in robustness and cannot control speech duration. In contrast, non-autoregressive systems require explicit prediction of phone-level duration, which may compromise their naturalness. We introduce the Masked Generative Codec Transformer (MaskGCT), a fully non-autoregressive model for TTS that does not require precise alignment information between text and speech. MaskGCT is a two-stage model: in the first stage, the model uses text to predict semantic tokens extracted from a speech self-supervised learning (SSL) model, and in the second stage, the model predicts acoustic tokens conditioned on these semantic tokens. MaskGCT follows the \textit{mask-and-predict} learning paradigm. During training, MaskGCT learns to predict masked semantic or acoustic tokens based on given conditions and prompts. During inference, the model generates tokens of a specified length in a parallel manner. We scale MaskGCT to a large-scale multilingual dataset with 100K hours of in-the-wild speech. Our experiments demonstrate that MaskGCT achieves superior or competitive performance compared to state-of-the-art zero-shot TTS systems in terms of quality, similarity, and intelligibility while offering higher generation efficiency than diffusion-based or autoregressive TTS models
https://maskgct.github.io/
no weights (ever) since they're worried about safety. finetuned it after for emotion control and voice cloning. sounds pretty good. 100k hours training dataset.
>>
What's the fastest way to get this shit voice acted? it can even be a female voice actor, it doesn't matter, it just has to sound entertaining to listen to.
and optionally make the visuals match with what he's talking about?
https://desuarchive.org/mlp/thread/40590194/#40598329
>>
anypony have voice packs for eleven labs
>>
I hate to moral fag but am I the only one bothered by people who use CMC voice AIs for coomer shit normally I wouldn't care but they were kids when they recorded at least most of their lines
>>
>>41383447
The VAs are all adults now anyway, so that concern isn't really an issue any more.
>>
>>
File: OIG4.Pc7p8fPrtACEmu2G7R_o.jpg (249 KB, 1024x1024)
249 KB
249 KB JPG
>>
>>41380500
Minor update on lorebook generation:
The current plan for memory is to extract questions and answers as the LLM reads a story. The questions get indexed by embedding, and they won't get updated unless a corresponding answer is deleted. The answers will get updated as the LLM reads the story. It'll process each paragraph twice: one the generate a list of questions that need to be answered to understand the paragraph, and a second time with the corresponding answers to determine how the memory should be updated.
>>
>>41377233
Not sure what that empty Luster Dawn folder was supposed to be for, perhaps a holdover from processing the studio leaks that now has no purpose. It's now been removed.
>>
>>
https://www.udio.com/songs/kgm72z2swizqRLSYDJWMMG
https://vocaroo.com/1hO6O2SpRNy6
https://pomf2.lain.la/f/4s3d56ud.mp4

Behind the Facade 1

Lyrics

[Verse 1]
We live in a world with cartoons and rainbows
With sparkly eyes and vibrant shows
Featureless, seamless, no lines can be seen
In our perfect land where nothing disagrees
No whispers of night's forbidden touch
In pastel dreams, we're bound and crushed

[Chorus]
Hey, Equestria
What can we do?
We live by rules, pretend they're true
While desires hide and hearts must play
In child's delight, we can't be free today

[Verse 2]
Can't flaunt our flair or show a peek
No lips can part for secrets to speak
Innocent, sweet, and always demure
Living in a world where nothing's obscure
Behind closed doors, our true selves lie
Hushing our wants as we gaze at the sky

[Chorus]
Hey, Equestria
What can we do?
We live by rules, pretend they're true
While desires hide and hearts must play
In child's delight, we can't be free today

[Bridge]
Can't break these chains of purity's face
In this vibrant land, we find no embrace
Our silent cries echo in the night
In a painted world, there's no real sight

[Chorus]
Hey, Equestria
What can we do?
We live by rules, pretend they're true
While desires hide and hearts must play
In child's delight, we can't be free today
>>
Training Universal Vocoders with Feature Smoothing-Based Augmentation Methods for High-Quality TTS Systems
https://arxiv.org/abs/2409.02517
>While universal vocoders have achieved proficient waveform generation across diverse voices, their integration into text-to-speech (TTS) tasks often results in degraded synthetic quality. To address this challenge, we present a novel augmentation technique for training universal vocoders. Our training scheme randomly applies linear smoothing filters to input acoustic features, facilitating vocoder generalization across a wide range of smoothings. It significantly mitigates the training-inference mismatch, enhancing the naturalness of synthetic output even when the acoustic model produces overly smoothed features. Notably, our method is applicable to any vocoder without requiring architectural modifications or dependencies on specific acoustic models. The experimental results validate the superiority of our vocoder over conventional methods, achieving 11.99% and 12.05% improvements in mean opinion scores when integrated with Tacotron 2 and FastSpeech 2 TTS acoustic models, respectively.
>>
>>41386926
Cute and soulful
>>41386944
>Tacotron 2
Huh, that's a name we had not seen in the threads for a while.
>>
>page 10
>>
>>41377069
>>41377404
Training of Lightning Dust (Alt) model has begun. Decided to use the pretrained TITAN as I found the descriptor of it interesting.
>TITAN is a fine-tuned based on the original RVC V2 pretrained, leveraging an 11.15-hours dataset sourced from Expresso. It gives cleaner results compared to the original pretrained, also handles the accent and noise better due to its robustness, being able to generate high quality results. Like Ov2 Super, it allows models to be trained with few epochs, it supports all the sample rates.
Hopefully she'll prove to be less noisy and have more accent retention. Training's a little slow my end with the reduced batch size (supposedly smaller gives better results but at the expense of training speed) but so far no issues in the process. If all goes well hopefully I can also begin training more mares between my other commitments.

Improvements to vocal separation AI should be looked further into, it'd be nice to be able to separate audio that have had a hard time separating for datasets in the past. Amalthea comes to mind with most current separators struggling with cricket sounds and other natural additions. I have a feeling we could perhaps create and/or finetune a UVR5 model designed to separate SFX using all the pony SFX we've separated thus far to have an easier time removing a lot of foley, hoofstep, crashes and similar sounds. As for more natural sounds like rain, wind, birds, insects, etc. there's a lot of data from the enormous SONNISS GameAudioGDC packs that could be utilized for this. Would be preferable to use MDX-Net so it's reliable for a number of GPUs, as DEMUCS models tend to not want to run unless the GPU has more than 6GBs.
>>
>https://hailuoai.com/video
this shit is fucking crazy
>>
>>41387770
but can it recreate the tienanmen square massacre of june 1989?
>>
File: Long Queue.png (11 KB, 418x303)
11 KB
11 KB PNG
>>41387770
Almost as crazy as the queue times for it will soon be.
6 minutes to wait already with only 171 people.Yikes.
>>
>>41387770
>>41387845
For the quality though, it's definitely got pony down surprisingly well. Just concerned how slow it'll start to get once more jump on board the same generator. Thankfully it's free (for now)
>Pinkie Pie (My Little Pony: Friendship is Magic) pony waving her hoof at the viewer.
>>
>>41387770 >>41387845 >>41387864
so now we are entering the age of computer animated mares, I will be very disappointed if there isnt going to be a bootleg of this tech available for offline generation sometime within the next three years.
>>
How do I get Doug Walker or James Rofle's voices to voice act this?

https://desuarchive.org/mlp/thread/40590194/#40598329
>>
>>41387864
In what format is it done though? Is it usable for .ai vectors PSD photoshop, FLA flash, Toonboom , Live2D, Spine, or something else?
>>
>>41387770
>https://hailuoai.com/video
Anime is dead. Finally smooth animation for free.
>>
>>41387845
G6 intro just dropped.
>>
>>41387770

very cute twilight.
>>
>>41387845
>>
File: lolgate.gif (302 KB, 300x335)
302 KB
302 KB GIF
>>41388039
the more often you watch this, the funnier it gets
>>
>>41387898
voice-models.com doesn't have a Doug Walker voice, but it does have James Rolfe. Since it's an RVC model, you'll have to read the pasta yourself.
>>
>>41388039
>background full of 'Curse of the Fly'/'The Unearthly' type misshapen monstrosities
>>41388147
Just pause at any random moment for a good laugh/scare
>>
>>41388624
my favorite part is the random tiny little houses at the end for some reason
>>
>>41388651
Well now that you mention it, this does bring up sort of an interesting point with the show. It is clearly established that animals like Angel Bunny have roughly pony intelligence. Would it really be so farfetched if we saw them living in actual tiny houses?
>>
File: Untitled.png (118 KB, 1125x440)
118 KB
118 KB PNG
Sample-Efficient Diffusion for Text-To-Speech Synthesis
https://arxiv.org/abs/2409.03717
>This work introduces Sample-Efficient Speech Diffusion (SESD), an algorithm for effective speech synthesis in modest data regimes through latent diffusion. It is based on a novel diffusion architecture, that we call U-Audio Transformer (U-AT), that efficiently scales to long sequences and operates in the latent space of a pre-trained audio autoencoder. Conditioned on character-aware language model representations, SESD achieves impressive results despite training on less than 1k hours of speech - far less than current state-of-the-art systems. In fact, it synthesizes more intelligible speech than the state-of-the-art auto-regressive model, VALL-E, while using less than 2% the training data.
https://github.com/justinlovelace/SESD
no code yet though they suggest they'll post an "implementation" so maybe weights too. no examples. so just posting to keep those interested aware. the 2% training data of vall-e but outcompetes it is big if true
>>
>>41389084
>Note: Code and model checkpoint will be available soon. Stay tuned for updates!
ah should have checked the whole readme
>>
>>41387770
I prompted "Applejack (My Little Pony: Friendship is Magic) collecting hay from each of the rest of the Mane 6." and got G5.
>>
Page 9 bump.
>>
>>41389084
>Graphs
We're the voices at? Also it would be nice if once this gets published to be able to run it on my old GPU (I will loose my shit if this will be another model that requires 16vram to even start up).
>>
>>41387652
[RVC] Lightning Dust sings - There For Tomorrow "A Little Faster"
>https://files.catbox.moe/d610ed.mp4
>https://files.catbox.moe/v8uxxn.mp3

>https://huggingface.co/datasets/HazySkies/RVC2-M/tree/main/FiM_LightningDust
So far she seems decently capable with her preliminary testings; although I've only tried singing so far. Surprisingly her natural range requires +0 than the expected +12. This test could've been better had the original song separated a little better, but it's a good average to test out. I quite liked her backing vocals at 1:22, her sustained notes sound nice.
>>
File: lightning dust.gif (3.15 MB, 468x480)
3.15 MB
3.15 MB GIF
>>41389897
hey that sounds pretty good. Nice job, man!
>>
>>41389897
That's actually impressive, given how few voice lines she has.
>>
>>41389897
very good, we need more background mares songs.
>>
>>41391013
>catbox ded
uhoh, another good alternative to host small files? also could someone repost the above cover?
>>
File: cadence flurry DJ.gif (1.04 MB, 394x382)
1.04 MB
1.04 MB GIF
Cadence - Like a Prayer (BHO cover)
>https://www.youtube.com/watch?v=uP6CRRhTOIM

First time experimenting with vocoders and heavy filters. It sounds bit scuffed, but not too bad. Used Haysay RVC for the voice and UVR for everything else.

Does anyone have a working catbox alternative? I'd rather drop a direct link than a youtube link, but every other site I try either doesn't work or prunes the link in a matter of hours.
>>
>>41380500
Horsona chatbot library:
- [In progress] Continue working on lorebook generation. My QA-based memory seems to work. I'm refactoring my code now, and maybe trying to get my rate limits increased with some API providers so I can speed this up. I think the final lorebook is going to consist of the main details from the memory plus character cards. I'll be working on character card generation next.
- [ ] Work on whatever open issues other anons don't pick up.
- [ ] Figure out how to organize text corpora into compatible universes.

A snapshot of the QA memory after reading the first 50 paragraphs of Friendship is Optimal:
https://ponepaste.org/10323
>>
>>41391221
Pixeldrain could work! Unless not?
>>
Looks like Gemini now uses Imagen 3 as its image generator. While it's an increase in detail capability, it feels way worse in terms of language comprehension and flexibility. On top of that there's only 1 image at a time rather than 4, and no "generate more" button.

>>41391221
There's also https://pomf.lain.la/
>>
>>41389749
>10
>>
>>41391225
Thank you, anon. Ponies singing upbeat love songs do wonders for my motivation, and I really needed that.
>>
>>41392798
mares
>>
>>41392037
>only 1 image at a time rather than 4, and no "generate more" button
Sounds like a straight downgrade.
>>
>>41393542
It pretty much is. If they just picked off the abundance of flaws it would then be good, as it can still do show style ponies pretty well in its current state.
>>
>>41393734
that's a fat ponk
>>
Ponybros, I need your help. I've been told that installing 'mamba' over conda terminal would speed things up but only fuck up my conda environments. Can someone advice how to uninstall this piece of shit as so far people suggest to nuke the entire conda program and reinstall everything from ground up.
>>
>>41394192
after hours of messing around it looks like my conda installation is fucked beyond repair, so now im going to spend entire day reinstalling all my old envs.
Lesson learn, do not install mamba or listen to random internet advises.
>>
>>41394560
>(base) pip install pip
works fine
>(rvc-env) pip install pip
I get this bullshit, can someone please explain whats going on, this is a fresh installation of miniconda so it shouldn't give me this much shit.
https://pastebin.com/zQbiKCjf
>>
>>41393756
uuuu
>>
File: no_update_option.png (52 KB, 645x287)
52 KB
52 KB PNG
>>41395068
I'm a little confused how you even got this error. What version of pip are you using? When I create a conda environment with python3.10, pip reports that there is no --update option. Did you mean *upgrade* instead of update?
pip install --upgrade pip
or:
python -m pip install --upgrade pip
>>
>>41395068
From the stack trace, it looks like python is trying to use files under a user directory, outside of the Conda environment, i.e.:
C:\Users\User001\AppData\Roaming\Python\Python310\

I wonder if the issue you are facing is the same as the one described here:
https://github.com/conda-forge/miniforge/issues/344

You can try a similar solution by renaming and moving the offending folder like so:
ren "C:\Users\User001\AppData\Roaming\Python\Python310" "Python310Backup" & move "C:\Users\User001\AppData\Roaming\Python\Python310Backup" "C:\Users\User001\Python310Backup"
And then try the pip command again.

If that doesn't work and you want to revert your changes, execute:
ren "C:\Users\User001\Python310Backup" "Python310" & move "C:\Users\User001\Python310" "C:\Users\User001\AppData\Roaming\Python\Python310"
>>
File: 1705677602979430.png (1.26 MB, 1357x1920)
1.26 MB
1.26 MB PNG
Sup, PPP, few weeks ago I asked for advice in this general about using AI voice conversion and you guys helped me out.
So I thought I'd deliver the finished project in case anyone was curious:

https://www.newgrounds.com/portal/view/947079

The female voices were achieved by converting my own recordings in Haysay, can you identify who I picked for them?

Lastly, I don't have any experience drawing ponies, but gave a go at ponifying my character as a token of appreciation.
Thank you very much!
>>
>>41395797
I can't say if I recognise the ai voice, but hey, this is pretty awesome. Reminds me of all the time I've spent after school watching indie animations on new grounds.
Also very cool ponification drawing. Maybe we could see you pop in the /create/ or /bale/ threads with more poni stuff in the future?
>>
>>41395292
what is the text editor that you use?
>>
>>41395292
>>41395532
thanks guys for the suggestion (no idea what i've fucked up, but clearly its been and advanced fuck up).
I kind of end up running the below solution in that env before entering the thread.
$env:PYTHONNOUSERSITE=1
python -m ensurepip --upgrade
python -m pip install --force-reinstall pip

This has solved the previous pip issue for me, and now Im back to the usual python dependencies hell.
>>
File: flutternom.gif (2.14 MB, 498x331)
2.14 MB
2.14 MB GIF
>>41395797
>can you identify who I picked for them?

That sounded like Rarity and Fluttershy's voice models.
>>
>>41396227
It depends; at home I use Sublime Text 3, at work it's Visual Studio Code, and if I'm logged in to a Unix server with terminal-only access then I use Vim.

>>41396254
Glad to hear you got it sorted out! That solution makes sense; it will prevent anything in the user site-packages, e.g.
C:\Users\User001\AppData\Roaming\Python\Python310\site-packages
from being added to sys.path, thereby forcing it to use only the packages you installed for that Conda environment. I find it strange that Conda does not already set PYTHONNOUSERSITE=1 by default.

>>41395797
That was sick! I'd guess that you used Rarity's voice for the first female character that spoke. Harder to tell with Ocapon, but Fluttershy sounds like a good guess.
>>
>>41396628
https://pastebin.com/KA0mkz30
hey, i finally installed the rvc env but now when i try to convert the audio i get the AttributeError: 'NoneType' object has no attribute 'dtype' error
>>
>>41397291
Ive managed to fix the above, apparently the av module was too new and no longer recognize the rb an wb functions.

>https://pastebin.com/2Tv7wnpJ
now everything seem to be working fine (including the mic recording) but the above errors happens if I select the f0 rmvpe model to run the conversion.
Ive placed the file "rmvpe.pt" and "rmvpe.onnx" in the main folder as well as the \assets\rmvpe folder however this did not fix the error.
>>
File: 1643535507592.webm (654 KB, 1280x720)
654 KB
654 KB WEBM
>>41387770
cute Twilight
>>
>>41397515
>A beach episode by Warners Brothers studio
very cute style. From the other thread it seems to mostly make cohered looking mares, but do tell us, do you get any cursed looking mares there as well?
>>
File: 1684869540896.webm (536 KB, 1280x720)
536 KB
536 KB WEBM
>>41397591
Except for straight-up anthro, I got this.
>>
>>41397355
>os.environ["rmvpe_root"]
>KeyError: 'rmvpe_root'
The application is expecting an environment variable named "rmvpe_root" which points to the location of the rmvpe model, but that variable is undefined. RVC includes a .env file which is supposed to define that variable. First of all, please verify that the following line is present in your .env file:
rmvpe_root = assets/rmvpe
Assuming it's there, that means the file is not getting loaded. Try adding the following two lines at the top of inference_gui.py:
from dotenv import load_dotenv
load_dotenv()
Failing that, try adding this line instead, to hardcode the value:
os.environ["rmvpe_root"] = 'assets/rmvpe'
>>
>>41398045
>First of all, please verify that the following line is present in your .env file:
>rmvpe_root = assets/rmvpe
i dont have .env file (i thnink), I just activate it straight from the powershell terminal.
>os.environ["rmvpe_root"] = 'assets/rmvpe'
Hey, this works now. Happy days m8!
>>
>>41387770
Frankly, it should go to the AI Art Thread. >>41354251
>>
One thing I do wonder can you feed in normal text to speech into haysay so-vits-svc Would it output okay sounding audio? That would improve a lot of the Usability with so-vits-svc I think I've seen other ones do similar things
>>
>>41398371
You can, but if you're using regular ass TTS as the input it's gonna copy over the robotic intonations, and sound like TTS speech.
>>
>>41398371
SoVits would likely work with TTS as audio input just fine, like how there are things for RVC. It's primarily the TTS side that needs work, in that there's very few that is: open sourced, natural sounding, and emotionally flexible.
>>
>>41396190
>>41396628
>>41396407
Cute hamster.
And correct answers, it was Fluttershy and Rarity with 0% character similarity selected.
For Fluttershy, only +4 pitch.
For Rarity, +8.

Usually +12 is recommended, but if you voice act with an already feminine affectation to it, it doesn't have to go up a full octave.
>>
>early mare bump
>>
>>41399082
The early Pegasus catches the bird.
>>
Is there something that can read an entire book with the click of a button?
>>
>>41399456
Yes. Regular, non-AI TTS that we've had for decades.
>>
>>41399479
>Text to speech
Ew. It can't emote and it sounds like dogshit in general.
>>
>10 again
>>
>>41400055
>minus one
>>
>>41400618
>Rollercoaster of bump
>>
Should we make a project where we post Youtube videos about interesting threads read by various AI voices?
>>
>>41401641
dunno, i would prefer short pony greens.
>>
Is this a good place for feature requests for hay say
Anyways I really wish they had like a microphone in built with a browser So you can record onto the website and then send it to SoVits I think that would improve the usability of the website a lot
>>
>>41401832
Thanks for the suggestion. I regularly check in here so I will see requests. If you have a Github account, you can also open an issue on https://github.com/hydrusbeta/hay_say_ui/issues.

Hay Say is currently built on the Plotly Dash framework, which does not have an out-of-the-box audio recording component, though it should be possible to build a custom one. I'm on the fence about doing so. A while back, I started working on a complete redesign of Hay Say which will not use Plotly Dash, so part of me doesn't want to put effort into creating a component which will eventually become obsolete. However... progress on the redesign has stalled as I've been spending a lot of my time working on a different pony-related project, so I might just postpone the redesign and devote more time to the existing Hay Say for now. If I do, I will make this feature a priority.
>>
>>41401989
Plotly Dash you say
>>
Are any annotations for the gender of the characters in the voice dataset? If not I can write them up
>>
>>41402357
Here's what I have so far: https://files.catbox.moe/ckftfl.json
I noticed that Wallflower didn't appear in there though. Synthbot, did Forgotten Friendship somehow get dropped out of the dataset?
>>
File: fluttershock.gif (715 KB, 400x400)
715 KB
715 KB GIF
>>41402110
>>
>>41402110
Nice view (on the future).
>>
>>41391225
Why does the voice sound so muffled?
>>
>>41365609
What is the envisioned use case for this?
>>
Up.
>>
>>41402773
Probably because I went a bit wild on the filters

Also AI voices just tend to be a bit muffled by default.
>>
>>41391490
what are the technical specs you are aiming this to be operating on?
>>
[RVC] Lightning Dust sings - The Knocks & Sofi Tukker "One on one"
>https://files.catbox.moe/oor0l8.mp4
>https://files.catbox.moe/h9e6ye.mp3

She probably butchered the short Portuguese section, but she sounded great anyway.
Is it just me, or does her voice paired with these vocals sound a little... Applebloom-y? Probably just the accent/pronunciation.
>https://files.catbox.moe/iiaqif.mp3

Considering training Nurse Redheart model next, but also have plans for Saffron and a wide variety of other mares. Training for Lightning was pretty smooth and quick (only a couple hours for the actual training), so probably could get a bunch done now that I have a few days free of other commitments.
>>
>https://files.catbox.moe/5zp8sb.mp3
>Original ai song - RD - locked out
ive spend 2 days trying to unfuck the random voice glitches but i cant so here is my best attempt at ai song making.
>>
>>41401623
>Rollercoaster
>>
>>41403457
9 up.
>>
>>41404987
oy
>>
>>41404030
How does LD's voice work so well for songs?
>>
File: EmotionalSupportNurse.png (907 KB, 1024x1024)
907 KB
907 KB PNG
[RVC] Nurse Redheart sings - Francesca Battistelli "Angel by your side"
>https://files.catbox.moe/wr1lb9.mp4
>https://files.catbox.moe/0l7gue.mp3

>https://huggingface.co/datasets/HazySkies/RVC2-M/tree/main/FiM_NurseRedheart

Passing through RVC it kept mis-detecting certain words, so I first passed it through SynthV first to fix some lyrics, which worked but she did lose some volume control as a result. Redheart still sings well with raw vocals, even if they're not always the right words.

>>41405749
RVC is well adapted for singing in general. I also trained Lightning Dust with pitch guidance on, and from pretrained "TITAN" to provide her as much support as she could have to be able to speak and sing well given her small dataset. Nurse Redheart above had the same treatment.
>>
>>41402901
It's for developers that want to create better chatbots or create custom chatbots that require more than just changing prompts. That includes:
- Automated character card & lorebook generation, which can be fed into existing chatbot applications.
- Custom memory modules, including graph databases, embedding databases, SQL databases, json dumps, or arbitrary data structures.
- RPG functionality for tracking, e.g., HP, XP, and skill trees.
- Creating & refining datasets for LoRAs or other fine-tuning techniques.
- Creating API endpoints that transparently add memory & character consistency to generations, which can be plugged into other chatbot applications.

If other people use it, I'm expecting that early on, it would mostly be useful for creating character cards and lorebooks for other chatbot applications. Maybe a little later on, people would use it to create drop-in replacements for LLM APIs that automatically add in memory & character consistency features.
I personally want to use it to create chatbots with more complex personas & more flexible interactions. That requires better memory mechanisms, more complex inferences (more than just prompt-response), and tool use (e.g., listen to a song given a youtube link).

>>41403980
The current plan is to have the library itself require nothing but a CPU. Any additional hardware requirements will be on whatever inference APIs the user decides to use.
- I expect it'll never need more than API access to LLMs, image generators, etc. So if you run it against an sglang API, it'll be sglang specs. If you run it against Claude, just a CPU will be fine.
- For embeddings: right now, I'm developing & testing all embedding use against BAAI/bge-large-en-v1.5, which runs well on a CPU. This adds a pytorch dependency, which I plan to get rid of that. Instead of having it directly run the embedding model, I'll probably have it use an embedding API (e.g., ollama support for open source embedding models). In this case, the requirements will be the same as for the LLM APIs: whatever specs are required to run the embedding API & model.
>>
>>41391490
I have the initial "StoryReader" implementation up. It can read stories paragraph-by-paragraph to extract information. More details below.

Horsona chatbot library:
- [In progress] Continue working on lorebook generation.
- ... [Done] The initial implementation is up. You can see an example of how to use it here: https://github.com/synthbot-anon/horsona/blob/main/tests/test_reader.py. The implementation of the StoryReader module is here: https://github.com/synthbot-anon/horsona/blob/main/src/horsona/stories/reader.py. Right now, it tracks three things as it reads through a story: (1) short term memory, which is just an array of the most recent paragraphs read, (2) a long term memory, which consists of an embedding-based question-answer database that keep track of information extracted from the story and a cache to keep track of the most recent retrievals, and (3) a StoryState, which is a data structure that keeps track of "live" information about what's being read (e.g., current location, current speakers).
- ... [In progress] Refactor the StoryReader module to support custom memory modules, support extracting custom information from stories, and support tracking custom "live" information.
- [New issue] Have the EmbeddingModel class https://github.com/synthbot-anon/horsona/blob/main/src/horsona/memory/embeddings/models.py use an API of Huggingface Transformers to generate embeddings. I created a new issue for this in case someone here wants to work on it: https://github.com/synthbot-anon/horsona/issues/9.
- [New issue] Use FAISS or other to store, delete, and query embeddings instead of using matrix operations directly on the embeddings. I created a new issue for this in case someone here wants to work on it: https://github.com/synthbot-anon/horsona/issues/10.
- [ ] Work on whatever open issues other anons don't pick up. https://github.com/synthbot-anon/horsona/issues
- [ ] Figure out how to organize text corpora into compatible universes.
>>
>>41405850
>provide her as much support as she could have to be able to speak and sing well given her small dataset
And it really shows.
>>
>>41405972
if I may put forth this suggestion, whatever future build it will be, could it be to limited to the "torch==1.13.1+cu116" (or "pytorch==2.0.1") requirements as anything above that is straight up shitting up my poor old gpu.
>>
>>41406657
Do you happen to know whether Cuda 11.8 works on your gpu? Cuda 11.6 and 11.8 both support a minimum compute capability of 3.5, which means that cuda 11.8 *should* also work on your gpu. If that's the case, then the maximum compatible torch version would be torch-2.4.1+cu118. The last minor version of Cuda 11 was 11.8. After that, the minimum compute capability became 5.0 in Cuda 12.
>>
>>41407036
In the pc I have the 8vram 1080 gtx (being poorfag is suffering ) technically my gpu is supposed to only work up to 11.4, however for some reason unknown to me it chooses to be just good enough to get cuda 11.6 rolling.
>>
https://huggingface.co/fishaudio/fish-speech-1.4
Is this good?
>>
>>41407382
>just 1gb model
Could be interesting, however i do not see the training file/instructions in their github.
>>
File: jjjj.png (16 KB, 307x333)
16 KB
16 KB PNG
>>41404030
>Lightning Dust singing to (you) in Portuguese
Garbled pronunciation or not, that sounds damn hot.
>CAPTCHA JJJJ
>>
>>41406657
I'm almost 100% sure I can get rid of the pytorch dependency altogether, so it shouldn't be an issue.
>>
>>41402414
>>41407736
>>
>>41404030
That's a very sexy Lightning.
>>
>>41402414
Interesting. The clipped audio files are there, but it's missing the label file.
>>41385715
Clipper, do you happen to have the original label file for Forgotten Friendship? I don't see it in the mega or in my older copy of the Master File. If you don't, I can create a bootleg version from the audio files, though it'll be less precise than the others, so it'll be misaligned when loaded into Audacity.

>>41408224
Thank you.
>>
>>41408451
I found it in the Feb 2022 version of the Master File.
https://files.catbox.moe/iehjw1.txt
I'm updating my clones now.
>>
>>41402357
>>41402414
I added this as a "gender" column to my huggingface clone.
>Complete list of genders: https://files.catbox.moe/8mmw78.json
>Dialogue dataset: https://huggingface.co/datasets/synthbot/pony-speech
>Singing dataset: https://huggingface.co/datasets/synthbot/pony-singing
For the character tags that represent multiple characters, I left the gender empty since people training gender-specific models probably want to exclude those. Those are: CMC, Dazzlings, Flim Flam, Mane 6, Multiple.
I'll update my gdrive clone tomorrow.
>>
Good morning bump
>>
How do I automate the AI to read an entire book? it doesn't have to be in real time, I just need to leave it overnight to read all of it or an entire week.
>>
>>41408610
Thank you

>>41409210
I have a colab notebook intended for this that may provide you some guidance on how to do this with StyleTTS2 (I wouldn't rely on Colab to do it though because of the time limit). The main problem I think is maintaining a consistent syllable rate because StyleTTS2 has problems with that
https://colab.research.google.com/drive/1ys8SkP-VW7CkhnwVveEGINaszG1kRaYl#scrollTo=GJnM6GwdAG2W
>>
https://files.catbox.moe/thtbt5.mp3
>>
>>41408461
Not sure how that happened but I've fixed it now, thanks.
>>
File: Sweetie Bop.gif (2.74 MB, 690x541)
2.74 MB
2.74 MB GIF
>>41409422
Good to see you're back, Vul.
>>
>>41409422
Very nice
>>
File: OIG1..vCx.4Hrxi7Shxs96MTL.jpg (151 KB, 1024x1024)
151 KB
151 KB JPG
>>
>>41408610
Updated gdrive clone of the Master File with the Forgotten Friendship label file: https://drive.google.com/drive/folders/1ho2qhjUTfKtYUXwDPArTmHuTJCaODQyQ?usp=drive_link
>>
>>41409422
egghead
>>
[RVC] Saffron Masala sings - Sandaru Sathsara "Smooth Criminal"
>https://files.catbox.moe/sw53g7.mp4
>https://files.catbox.moe/c4t1pp.mp3

>https://huggingface.co/datasets/HazySkies/RVC2-M/tree/main/FiM_SaffronMasala
Haven't done testing with non-indian accent songs (yet), but she's shown great promise with this one.
>>
>>41411299
>Saffron Masala
Now it's getting really exotic. Never thought we'd see songs with ponies like her.
>>
>>41411386
I love this mare, so I felt it my duty to ensure her beautiful voice is heard all the more.
With how well her training went you can be sure more is on the way.
>>
>>41411635
Based anon delivering good stuff
>>
>>41411299
im going to redeem this mare
>>
>>41411299
>>41411635
very nice
now show saffron vagene
>>
>>41412254
and teet
>>
File: Chinese AI Twilight.webm (1.99 MB, 1280x720)
1.99 MB
1.99 MB WEBM
Holly guacamolee!
https://derpibooru.org/images/3441503
>>
>>41412960
Damn, it understand English too!
https://hailuoai.com/video
>>
File: 1634078595872.webm (1.93 MB, 1280x720)
1.93 MB
1.93 MB WEBM
https://pomf2.lain.la/f/tptanmv.mp4
https://files.catbox.moe/6u2397.mp3

Cacti and Books

[Verse 1]
A filly from Ponyville with a whimsical mind,
She wrote a joke article, cheeky and kind.
All books make you silly, she'd claim with a grin,
Keep a cactus nearby to keep sanity in.
Ponyville Chronicle, they laughed her away,
Foal Free Press said it just wouldn't sway.

[Chorus]
Oh, cacti and books,
Avoid magic hooks!
Prickle and giggle while you're reading fine,
It's just a funny sign.

[Verse 2]
Flower Monthly saw it and said, 'That's the stuff!'
Rose printed the story, not calling it bluff.
Now Rose, Lily, Daisy kept cactus so near,
Pricking their hooves, but they had no fear.
Twilight Sparkle said it was all horseapples' fun,
By then Yakyakistan believed every one.

[Chorus]
Oh, cacti and books,
Avoid magic hooks!
Prickle and giggle while you're reading fine,
It's just a funny sign.

[Bridge]
From Ponyville joke to Yakyakistan craze,
Books and cacti in a mystical haze.
Hooves on the pages and spikes at their side,
Silly magic, but they’ll keep their pride.

[Chorus]
Oh, cacti and books,
Avoid magic hooks!
Prickle and giggle while you're reading fine,
It's just a funny sign.
>>
File: 1596252303251.webm (2.2 MB, 1280x720)
2.2 MB
2.2 MB WEBM
https://pomf2.lain.la/f/x3hcejh.mp4
https://files.catbox.moe/wdsdpl.mp3

[Verse 1]
There was a filly in Ponyville who had a big idea
She wrote an article as a joke, the silliest you'd hear

[Chorus]
You need a cactus when you read
To stop the magic's silly deed
Need a cactus when you read
It's how to save your mind indeed

[Verse 2]
Ponyville Chronicle and Free Press laughed it off with glee
But Flower Monthly thought it was the best thing they've received

[Chorus]
You need a cactus when you read...

[Bridge]
Flower ponies got their cacti, prickling as they read
Twilight said it wasn't true, but Yaks can be misled
So keep your cactus if you want, but magic’s not to fret
Just read your book without a fright, no pricks or needles yet

[Chorus]
You need a cactus when you read
To stop the magic's silly deed
Need a cactus when you read
It's how to save your mind indeed
>>
File: 1684416388665.webm (1.29 MB, 1280x720)
1.29 MB
1.29 MB WEBM
This one contains an eye rhyme, and the audio was bad anyway.

Udio 1.0 - The Cactus Conundrum ext v1

Lyrics

[Verse]
There once was a filly from Ponyville, her name was Sparkle Glint
She wrote an article o' magic books, and she sent it in for print

[Chorus]
Keep a cactus close when you read, it'll neutralize the magic
Or you'll get all silly instead, ain't that just tragic?

[Verse]
The Chronicle editor laughed it off, said 'This prank won't do!'
While the Foal Free Press just shrugged and said, 'There's nothin' new!'

[Chorus]
Keep a cactus close when you read, it'll neutralize the magic
Or you'll get all silly instead, ain't that just tragic?

[Verse]
But Rose from Flower Monthly was like, 'This is quite a find!'
She printed it for all to see, and some ponies lost their mind

[Chorus]
Keep a cactus close when you read, it'll neutralize the magic
Or you'll get all silly instead, ain't that just tragic?

[Verse]
Twilight heard 'bout the cactus craze, said 'What in Celestia's name?'
She found Rose, Daisy, and Lily, and put an end to their prickly game
She explained it was all horseapples, a joke that got too far
But the Yaks in Yakyakistan, still keep cacti by their memoir

[Chorus]
Keep a cactus close when you read, it'll neutralize the magic
Or you'll get all silly instead, ain't that just tragic?

[Outro]
So remember fillies, colts, and Yaks, before you believe such tales
No cactus needed for your books, just facts to fill your sails
>>
>>41364787
(1/2) parlerttsbros...

Here's 16 epochs of ParlerTTS finetuning on the Parler-TTS-Mini-v1 checkpoint over the entire pony-speech dataset (sans Forgotten Friendship) ~48hrs training on and off: https://huggingface.co/therealvul/parler-tts-pony-mini-v0-e16/tree/main

Results:
https://files.catbox.moe/t94htk.mp3
https://files.catbox.moe/1yplez.mp3
https://files.catbox.moe/9b8lah.mp3
https://files.catbox.moe/d1airi.mp3
https://files.catbox.moe/etp6b4.mp3
https://files.catbox.moe/3feky2.mp3

I'm not sure why the results are so bad, although I have suspicions.
>The character resemblance is OK but the audio quality seems to be quite poor compared to the original checkpoint.
Although I wonder what the "best case" is for DAC on our audio - I didn't normalize the audio going in, so maybe that's a factor?
>Pronunciation of certain words is fucked, e.g. "storage"
This might be related to their choice of tokenizer:
https://github.com/huggingface/parler-tts/issues/88
Combined with the fact that our dataset probably wouldn't be diverse enough to cover all of the 'single-word' tokens. If single-word tokens are the problem (see picrel), then the pronunciation issues might not be fixable at all. Although if you check using their HF space, even the baseline mini model trained on ~45k hrs of audiobook data (!) seems to miss syllables in some cases.
>There are lots of minefield cases where generations stop pre-emptively or just shit themselves.
Encountering an apostrophe breaks the Rainbow Dash one, and the Twilight one is supposed to be a much longer sentence but stops at a comma. I have no idea what the deal is with the Pinkie one.
>Problems increase as generation length increases.
Normal phenomenon for transformer-based models, but seems excessively sensitive. They added an option to use RoPE in one of their pulls (https://github.com/huggingface/parler-tts/pull/65) but it's not really documented, and I don't know whether they're considering training a different base model on RoPE or a different tokenizer given these issues.

>Is it a problem of undertraining?
Maybe partially? According to what metrics I could see in tensorboard (only enabled after disabling wandb) eval loss was more or less going down the whole time, and train loss is just beginning to peter out. Leading to:
>Is it a problem of model size?
From the little bit of testing I did using the HF space, it seems the large model has slightly better audio quality and better handling of pronunciation--still, the results here are much worse than the existing mini checkpoint.
There's also an option to unfreeze the text encoder during training, which I didn't select as it wasn't part of their suggested training command.
>>
>>41413292
(2/2)
The actual process of training was kind of spotty (but thankfully nowhere near as much as StyleTTS2). I ran into a problem with wandb getting stuck in the middle of the training process, so I removed the --report_to_wandb option. Training generates ~7GB of checkpoint files every 500 steps, so disk space was also an issue.

For data, I generated dataspeech compatible text descriptions of the dataset using Mistral 7B-v0.3 (this took around 60 hrs, it was done before Forgotten Friendship or the animatic data was added in, and I don't feel like redoing it right now):
https://huggingface.co/datasets/therealvul/parlertts_pony

I think the dataspeech tool could be modified to include our emotion tags in the prompt too, although I'm not sure what effect it would have, and I doubt it'd improve things in this case
>>
>>41412960
https://files.catbox.moe/t9wwac.mp4

Oh wow, this is pretty good!
>>
>>41413139
How did you manage the camera to move like that? What prompt did you use?
>>
>>41395797
That was really nice!
The ending music made me laugh
Thank you for producing nice things, and thank you for the cute pony too!
>>
File: 1664286824018001.png (600 KB, 1280x720)
600 KB
600 KB PNG
https://voca.ro/19285AVeCkhX

>soon we'll be able to have a podcast with the Mane 6 talking about anything
exciting times ahead... generated from notebooklm.google.com
>>
>>41413292
>>41413298
That's looking super promising. Would still need more work of course, but the concentration of mare-ness present is very strong. Here's to hoping the issues can be worked through.

Even if the voices aren't entirely similar sounding to the mares in question, using it as sources for RVC and the like could help reinforce it and maybe also clean up some noisiness?
>>
>>41413583
I suspect the tokenizer is the biggest hurdle to overcome and swapping it out would probably require a full retrain of the base model, which is not feasible for me. That makes it a dead end from my viewpoint.
>>
la mare
>>
>>41413298
>dataspeech tool could be modified to include our emotion tags in the prompt too
would e nice to have an return of the emotional controlled tts
>>
>>41413634
If retaining the base requires a lot of computational resources, maybe Synthbot could help you out with it? In any case I'd also encourage exploring other areas to be improved in the meantime, which could also be done to an alternate base model at a later date.

Despite the setbacks I've yet to see a TTS with this much character fidelity (aside of course from our well known absentee's tech) so even with the faults it's something I really feel is worth exploring further or kept an eye on. There's always the chance someone with more resources and know-how develops and extension solving those issues.
>>
>>41414148
Just for fun, here's some more sentences with words that are probably not in the dataset but encoded by T5 tokenizer as single words:

>"Bitcoin and Ethereum are popular cryptocurrencies."
https://files.catbox.moe/5j6vb4.mp3
>"Google and Microsoft are competing in the AI space."
https://files.catbox.moe/boh1ce.mp3
>"Amazon delivers packages quickly across the United States."
https://files.catbox.moe/qrkmg0.mp3 (wow, it actually got this one right! Maybe it's leakage from "Amazon rainforest"?)
>"Facebook users share photos on Instagram."
https://files.catbox.moe/k3fpnr.mp3
>"NASA launched a new satellite to study climate change."
https://files.catbox.moe/jo02jr.mp3
>"Tesla's electric vehicles use advanced battery technology."
https://files.catbox.moe/1q3cce.mp3
>"Netflix produces many original series and movies."
https://files.catbox.moe/31aewm.mp3

I wonder why nobody's tried to make a tokenizer based off of phonetics/synthetic phonetics yet? I might try replacing the tokenizer anyways and see how far I can get, no guarantees.
>>
>>41413323
The prompt was the story described in the song, a simplified version of the irl events it was based on, and that it is a song. It did not mention anything about the looks or named characters except for Twilight and the flower ponies.
>>
Up.
>>
>>41416202
>>
>>41414203
Usually people hack it in with a g2p (grapheme-to-phoneme) model. In those cases, the input text gets converted to phonemes or phoneme posteriograms (probability distribution over all possible phonemes), then that gets run through the speech model. Cookie had previously trained models that could convert both text/graphemes and arpabet to speech. In that case, the text was fed to the model per-character, and the resulting models did a decent job learning the relationship between graphemes and arpabet. In cases where the model failed to learn the pronunciation, users could replace the individual difficult graphemes with arpabet phonemes.
I haven't seen anyone actually create a phonetics-aware tokenizer though. To create that kind of tokenizer, you'd need a model that can convert per-character graphemes to phonetics data, then I think you'd need to figure out which set of tokens minimizes the number of decisions required to do the conversion. I think we have all the tools necessary to do that, though it would be somewhat complicated. I can write an explanation if you're interested.
>>
>>41416901
Does this make any sense or is it just bullshit?
https://github.com/effusiveperiscope/parler-tts/blob/g2p/tokenizer_data_train_g2pen.ipynb
https://huggingface.co/therealvul/tokenizer_g2pen
>>
>>41417144
It makes sense, and it's only partially bullshit.
- I expect it's better than just assigning each Arpabet phonemes its own token index. It's worthwhile to model diphones and maybe triphones, which your tokenizer can do well.
- I think anons have had issues with g2p-generated pronunciations in the past, so bottlenecking tokenization on it will probably cause issues. However, from your tests it looks like your tokenizer gives much better tokens than parlertts, so I do expect it'll be an improvement.
- You should use the audio corpus to train the tokenizer, not generics_kb. The final set of tokens should be ones that are common in the audio training corpus so their embeddings can be well-trained.

If you want to refine it further with a very custom tokenizer, a hybrid approach might work well. If you have access to it, ChatGPT o1-preview knows how to create custom tokenizers. https://ponepaste.org/10347
- Use two tokenizers, plus one that wraps them both. For the first one, use the N most common tokens from the ParlerTTS tokenizer plus the BPE base tokens. For the second one, use the M most common tokens from the Arpabet tokenizer plus the BPE base tokens. For the wrapper, come up with some convention to denote arpabet words (maybe surround them in {}), make sure the pre_tokenizer doesn't split on the special characters you're using to denote arpabet words, and use the appropriate underlying tokenizer to convert each word. Make sure the BPE tokens and special tokens all map the the same indices regardless of which underlying tokenizer is used, and make sure the two underlying tokenizers otherwise use mutually exclusive token indices.
- Every time a training sample is loaded: modify it so it uses a random mix of Arpabet-denoted words and grapheme-denoted words, and randomly pick characters to be encoded with the BPE base tokens so the BPE base tokens are well-represented.

I think that would let you get around the limitations of both g2p and the ParlerTTS tokenizer. That would also get rid of ParlerTTS's uncommon tokens and UNK tokens, which are probably responsible for a lot of its pronunciation issues.

A simpler approach, though one that would be less flexible during inference than the hybrid approach, would be to modify the ParlerTTS tokenizer to (1) get rid of tokens that are uncommon in the audio training corpus, (2) add in the BPE base tokens, and (3) while training, randomly pick characters to be encoded with the BPE base tokens so that the BPE tokens are well-represented.
>>
>>41405973
Horsona chatbot library updates:
- [In progress] Continue working on lorebook generation.
- ... [Done] Refactor the StoryReader module to support custom memory modules, support extracting custom information from stories, and support tracking custom "live" information. I've abstracted all state tracking into "Cache" modules with load() and sync() functions, which can be swapped out.
- ... [In progress] Test the StoryReader flexiblity on character card creation and lorebook generation.
- No changes for the rest.

I've made a lot of changes to how the autograd functionality works, and I've broken away from some pytorch conventions to (1) better support async execution, and (2) to simplify how backproppable functions are created. Summary:
- In pytorch, you call loss.backward() then optimizer.step(). The problem is that loss.backward() has no information on what needs to be updated, so it needs to calculate gradients for everything that led to the loss. In my refactor, loss.backward() is passed a list of leaf nodes so excess computations can be cut out. Code: https://github.com/synthbot-anon/horsona/blob/main/src/horsona/autodiff/basic.py#L55
- In pytorch, a single optimizer updates all parameters. In my refactor, the optimizer is just a step() function that's given a gradient context, which contains all computed gradients. loss.backward() returns a gradient context that can be passed to step(). This makes it easier for a module to update its own parameters as needed without needing to rely on the caller to call the optimizer. Code: https://github.com/synthbot-anon/horsona/blob/main/src/horsona/autodiff/basic.py#L188
- In pytorch, backproppable functions are defined as classes with forward() and backward() methods. In my refactor, both the forward and backward pass are defined by a single generator. It calculates the forward call, yields the forward result, gets a gradient context from the yield, and performs the backward call. Code: https://github.com/synthbot-anon/horsona/blob/main/src/horsona/autodiff/basic.py#L128
- The gradient context passed during the backward operation contains a dictionary with all variables that need to be updated. This lets functions figure out which gradients actually need to be calculated by checking if a variable is in the dictionary, which lets them cut out unnecessary calculations. The functions are supposed to set the gradients of a variable by adding them to a list in the dictionary. Code: (same as the above horsefunction).
- Both sync and async generators are supported for backproppable function definitions. For sync generators, the backward pass call is wrapped in an async function so it can be handled consistently when I eventually make the backprop & update steps run in parallel. Code: (same as the above horsefunction).
>>
>>41416202
Preserve.
>>
>>41417450
>You should use the audio corpus to train the tokenizer, not generics_kb. The final set of tokens should be ones that are common in the audio training corpus so their embeddings can be well-trained.
My thinking was that generics_kb might be more representative of text that people might input into the model (at least compared to episode transcripts), and pony-speech was included mostly just to make sure the tokenizer could accept it. One of my concerns was that since "Equestria" showed up so often in pony-speech its phonetic equivalent kept getting mapped to a single token. Does this not matter, or should I just restrict the number of tokens until it doesn't show up as a token?

>Make sure the BPE tokens and special tokens all map the the same indices regardless of which underlying tokenizer is used, and make sure the two underlying tokenizers otherwise use mutually exclusive token indices.
Question: Why map the BPE base tokens (single char) to the same indices? Isn't "A" in ARPAbet-land different from "A" in text-land? Or is the point that they're correlated?
(Also, the ParlerTTS/FLAN-T5 tokenizer uses Unigram instead of BPE; it still has all of the base characters though.)
>>
>>41417144
>>41417450
Im not really comprehend all the technical stuff but is there a way to hack up the tts model to include the reference audio like Talknet/RVC but also have the emotion control like the old Ngrok?
>>
>>41418163
>is there a way
There is always a way, but
>Would it work the way you want
Probably not--it's easy to say "just add these inputs" but it's another thing for the model to learn it.

I imagine for emotional control we could shove the emotion tags into the text prompt and unfreeze the text encoder, which would probably be the least drastic change to make. Replacing the tokenizer might require retraining the text encoder anyways, which puts training demands above what I can fit on my local system. The caveat is that a lot of our past work also seems to indicate that our emotion tags don't really correlate that well to styles of speech?

For conditioning on reference audio you might be able to shove speech features and a pitch curve into the pre-decoder hidden states; you'd have to shove some extra "stuff" into the loss function too. Whether a decoder-only model that fits on a normal GPU would be smart enough to figure all that out based on only pony-speech (or maybe another dataset?) is another question.
>>
>>41418265
>Replacing the tokenizer might require retraining the text encoder anyways
Nvm, it looks like the input IDs from the prompt tokenizer just get fed into an nn.Embedding, whereas the input IDs from the description tokenizer are what gets fed through the text encoder, which is nice because that means I can still run non-emotion conditioned training locally. Still means that to add our emotion tags I'd probably need cloud GPU (or a very generous Anon).
>>
File: emotion 1607391601088.png (58 KB, 753x682)
58 KB
58 KB PNG
>>41418265
>>41418291
I can't remember if this pic is from the ngrok scripts or something some anon was working on, but, I feel like it could be "easy" (for someone who knows their shit aka not me) to create a dataset that could use something similar, were it would take the tags that already exist with the transcribed dataset files, if the file has description "_happy_shouting_" would be converted to "happy=0.6, shouting=0.4" generated training test.
>>
>>41418325
>"happy=0.6, shouting=0.4"
Are these fixed values or do you mean like trying to fit an emotion classifier and use the classification probabilities? Also ParlerTTS uses natural language descriptions so I'm not sure it would be appropriate in this case (since we know even huge LLMs tend to be pretty bad at working with numbers)
>>
>>41418355
Well, like I said, I am not expert at anything program related, just thinking it would be nice if we had a way to force the outputs to have ore specific emotions that do not involve shouting the exact same line in mic three dozen times in hopes one of the lines will be good enough to stitched into the project im working on.
I just need something were I cant tell the program that the output needs to convert "a happy whisper with hint of excitement", and at this point I am very apathetic if this will look like "smiley face, green heart, X eyes emoji" or "happy=0.5, whisper=0.45, excited=0.15" in the program itself.
>>
haysay's fucked again. StyleTTS2 specifically.
> FileNotFoundError: [Errno 2] No such file or directory: '/home/luna/hay_say/styletts_2/output/ffeef7dd9cde25f1bb44.wav'
>"User Audio": "d98daba91e38edd61a73"}, "Options": {"Architecture": "styletts_2", "Character": "Multi-speaker (30 epochs) (Mane 6, CMC, Princesses, Discord, Gilda, Zecora, Trixie, Starlight, Chrysalis, Tirek, Cozy Glow, Flim, Flam, and Shining Armor)", "Noise": 0.3, "Diffusion Steps": 5, "Embedding Scale": 1.5, "Use Long Form": true, "Style Blend": 0.5, "Reference Style Source": "Use Reference Audio", "Timbre Reference Blend": 0.09999999999999998, "Prosody Reference Blend": 0.09999999999999998, "Precomputed Style Character": "Rainbow Dash", "Precomputed Style Trait": "Shouting 1", "Speed": 1.0}, "Output File": "93ad122316abff861896", "GPU ID": "", "Session ID": "28929cb5096b47a9b679d82f46997712"} Input Audio Dir Listing: No input files to report does not exist
There absolutely is an input file.
>>
>>41417144
Is the phoneme tokenizer encoding spaces/word boundaries? I see underscores in the text tokens but not in the arpabet
>>
>>41418847
Nope, it's not
>>
>>41418783
That error is kinda confusing, because the first part:
>FileNotFoundError: [Errno 2] No such file or directory: '/home/luna/hay_say/styletts_2/output/ffeef7dd9cde25f1bb44.wav
Says that it cannot find the /output/ file, but the end says:
>Input Audio Dir Listing: No input files to report does not exist
Which is saying it instead can't find the input? So... because it can't find or process the input, it can't create an output at the end of the process, resulting in the error of missing file. Weird.
>>
File: output.png (88 KB, 986x590)
88 KB
88 KB PNG
Also the prompt that they use in dataspeech to generate descriptions for named speakers implicitly assumes the speaker's gender/pronouns. ;^)

Whose names is Mistral 7B-v0.3 likely to get most wrong (just using "his/her")?
>>
>>41418783
>>41419012
Thanks for bringing the error to my attention. The last part of the error message is weird due to a bug in the error handling code. I was supposed to pass a directory into the "construct_full_error_message" method, but instead I passed the string "No input files to report" for some reason. It then tried to list the input files in that directory, but couldn't find a directory named "No input files to report" so it spat out "'No input files to report' does not exist."

The true error is that the output file failed to generate (No such file or directory: '/home/luna/hay_say/styletts_2/output/ffeef7dd9cde25f1bb44.wav'). However, since I don't save any logs, I can't get any details on *why* it failed to generate. I have also been unable to reproduce the error. Is the issue repeatable for you or did it happen once and then go away?
>>
>>41419166
https://pastebin.com/P3H2cSKJ
It's a repeat issue, and happens constantly. I've changed settings in StyleTTS2 and it happens regardless of setting changes.I had it happen when there wasn't any reference audio uploaded as well. The only time it worked today was when I had just part of the chorus pasted in the text box to try out the sound.
Ponepaste's captcha doesn't work so have a pastebin.
>>
>>41419270
Thanks for the pastebin link. The input text has no periods (or question marks or exclamation points), so StyleTTS tries to treat it all as one very long sentence and apparently it has trouble doing that. Try adding end-of-sentence punctuation marks to the lyrics to break them down into sentences.
In the meantime, I'll take a closer look at the codebase to see if there's anything I can do to prevent the error on the backend.
>>
>>41418153
>Does this not matter, or should I just restrict the number of tokens until it doesn't show up as a token?
I can run some tests to find out. Give me a few hours.
>Question: Why map the BPE base tokens (single char) to the same indices? Isn't "A" in ARPAbet-land different from "A" in text-land? Or is the point that they're correlated?
No, you're right. Some single characters in Arpabet deviate from their english pronunciations, so it's better to keep them separate. Good catch.
>(Also, the ParlerTTS/FLAN-T5 tokenizer uses Unigram instead of BPE; it still has all of the base characters though.)
Nice. That makes things easier.
>>
>>41419270
>>41419449
Looks like there's not much I can do from the backend to prevent the error, aside from maybe warning the user when they enter too long a sentence.
StyleTTS2 takes the tokenized text and passes it through a BERT model. The BERT model that StyleTTS2 uses is configured to handle a maximum of 512 tokens at a time, which puts the limit at ~512 characters in a single sentence. To make this any longer would require retraining the BERT model. I could also attempt to automatically slice up sentences that are too long, but that comes with the risk of slicing up a sentence in an undesirable way.
>>
>>41420275
Maybe: add an option for sentence slicing, and have a hard cutoff at 512 tokens with a warning?
>>
>>41419719
>>41418153
Check the last cell in:
https://github.com/synthbot-anon/sample-code/blob/tokenizer-analysis/notebooks/tokenizer_analysis.ipynb
This finds tokens that are underrepresented in either the pony dataset or the generic_kb+pony dataset.
- Ones that are underrepresented in pony would be undertrained.
- Ones that are underrepresented in generic_kb+pony would "useless". In these cases, it would have been better to get rid of the token so that more useful sub-tokens can get more training.

It then splits the "bad" tokens into sub-tokens that, where possible, would be both well-trained and more useful. It looks like that's possible for all of the tokens in a random 240k sample of generics_kb.
>>
>>41420884
I guess some of these are obvious in retrospect - "PIY1PAH0L" (i.e. "people") and "HHYUW1" (i.e. "hu" in "human"). Would I just remove these from the tokenizer vocabulary?

Also on the point of >>41418847, probably worth training on word boundaries? I could probably re-run this on my end just in case the results end up different
>>
>>41421050
>Would I just remove these from the tokenizer vocabulary?
Yeah, I think those should be removed from the tokenizer vocabulary. You'd have to (1) find all token merges rooted in the tokens you want to remove, (2) remove the whole merge subtree containing the tokens to remove, and (3) remove all of those subtree tokens from the vocabulary. That last cell contains a function for doing that (create_subtokenizer), and the bottom of that cell shows how to use it to get rid of a set of tokens.
Given the tokenizer postprocessing steps in that last cell to find & remove bad tokens, I think it's safe to train your tokenizer to create a lot more tokens. Given how BPE merges keep track of how each token was created, I don't think there's any downside to it, and it'll probably give you more good tokens. Nothing in that code depends on the particular set of underlying tokens, so you can do the same thing for a BPE grapheme tokenizer.
>word boundaries
I'd guess that it's better to only split on punctuation (excluding hyphens adjoining words), not on spaces, to retain punctuation in the final transcription, and maybe to have only single-character tokens for punctuation (i.e., remove all token merges that involve punctuation). When speaking, word boundaries seem to all get blurred. You should be able to do that with the same code in that last cell by adding any multi-character token containing punctuation to the list of bad tokens.

Are you planning to use a hybrid phoneme- and grapheme- tokenizer? The llama 3.1 tokenizer is BPE. If you're planning to mix phoneme transcriptions alongside grapheme transcriptions, it might be better to use that one instead of the ParlerTTS one. If you want help with that, I can see if I can modify the llama 3.1 tokenizer to work for this.
>>
>>41421267
>I'd guess that it's better to only split on punctuation (excluding hyphens adjoining words), not on spaces, to retain punctuation in the final transcription, and maybe to have only single-character tokens for punctuation (i.e., remove all token merges that involve punctuation). When speaking, word boundaries seem to all get blurred. You should be able to do that with the same code in that last cell by adding any multi-character token containing punctuation to the list of bad tokens.
Sorry, that was unreadable. Trying again:
I'd guess that it's better to:
- Not split on word boundaries, split on punctuation boundaries instead
- Retain punctuation in the final phoneme transcriptions
- Make sure punctuation is always represented as a single token.

Word boundaries seem to be irrelevant when speaking, so I don't expect it'll help. Splitting on word boundaries would just lead to unnecessarily large tokenizations.
>>
>>41421050
Before you train anything on a tokenizer that's pruned based on my code, I realized that I'll need to fix it so the token counts it uses for determining splits get updated every time it decides on an actual split. Otherwise it'll split some tokens unnecessarily. I can wrap everything in a python class so it's easier to use.

>>41417482
Minor update: I have a simple implementation for generating character cards. I only ran it on the first 4 paragraphs of FiO so far (https://ponepaste.org/10352) to test the framework. Right now, it works by attaching a character card generator module to a story reader module, though I'm considering other approaches for getting the same result. It's currently done by passing a "postprocessor" function to the story reader. I'm thinking it would be better to have the story reader return enough information that the character card module can be called like any other module.
>>
>>41411299
>Saffron Masala AI song
>It's fucking Jackson
I've got mixed feelings about this one.
>>
File: safest_hooves_derpy.png (204 KB, 1487x1100)
204 KB
204 KB PNG
Hey everyone, I'm posting here on behalf of the /mlp/ 4CC team. If you keep up with the cup, you probably know that /mlp/ is hosting the Autumn Babby cup this year, and as such we have an opportunity to provide a pre-cup day intro/hyp video and a post-cup day credits scene. We're looking for people with video editing skills and ideas who would like to get involved in the process. I know people here have produced some amazing content for the antithology and the cons, and it would be cool if the board could produce something for the upcoming cup. It doesn't need to have extreme production value, but we can't let a chance go by to make 4chan remember that /mlp/ is still alive and kicking.
>>
>>41421267
>Are you planning to use a hybrid phoneme- and grapheme- tokenizer? The llama 3.1 tokenizer is BPE. If you're planning to mix phoneme transcriptions alongside grapheme transcriptions, it might be better to use that one instead of the ParlerTTS one. If you want help with that, I can see if I can modify the llama 3.1 tokenizer to work for this.
I am willing to try the hybrid tokenizer approach. On using llama: would there be any benefit in retaining some of the old ParlerTTS token IDs? I'm guessing we would lose those if we used the llama one.

>>41421283
>>41421460
OK
>>
>>41421283
>Word boundaries seem to be irrelevant when speaking
My intuition says it'll be more difficult for the model to parse a stream of phonemes as opposed to chunks of phonemes broken into words, but off the top of my head I can't think of any examples where word boundaries are necessary. Maybe no boundaries will make it harder for the model to learn intonation/prosody (e.g. generating breaths), but it might also be better to reduce the number of tokens.
>>
>>41422419
>My intuition says it'll be more difficult for the model to parse a stream of phonemes as opposed to chunks of phonemes broken into words
I'd think this too; if you wanted to write out a custom word in ARPAbet wouldn't the model then be deciding where the spoken word boundaries are?

>>41421283
I think the way the ParlerTTS tokenizer currently works is that it uses Metaspace and marks the "beginning of word" with a separate character (), but the separator itself is not usually treated as a separate token
>>
File: 1694811827493421.png (141 KB, 444x465)
141 KB
141 KB PNG
>>41422529
(U+2851) Just imagine it's there
>>
https://pomf2.lain.la/f/tuimd2mx.mp3
>>
File: SaffronLevitating.gif (2.05 MB, 800x800)
2.05 MB
2.05 MB GIF
>>41421720
Fair. It is possible to enjoy the work of a creator though without liking them by extension, but is usually easier if you already liked their work to begin.

Made a few more with Saffron and as capable as she is, I've noticed she can often struggle with some higher notes in songs, resulting in her showing the classic AI screech syndrome. More common with male vocals it seems, so perhaps she has high sensitivity to coarseness?

[RVC] Saffron sings - Various
Sandaru Sathsara "Levitating"
>https://files.catbox.moe/m15x32.mp4
>https://files.catbox.moe/hqbjyx.mp3
David Hasselhoff "True Survivor" (Very poorly in parts)
>https://files.catbox.moe/57v6g5.mp3
Rising pitch test (via SynthV) singing "Mare"
>https://files.catbox.moe/z3f3ur.mp3
>>
>>41422419
>>41422529
>splitting on words
It makes sense that whitespace would be important for deciding when to generate breaths. I see three approaches:
- 1. Include whitespace in the phoneme transcription so that tokens can include whitespaces.
- 2. Set whitespace as a boundary when tokenizing so that tokens never contain whitespace.
- 3. Strip whitespace from phoneme transcriptions so whitespaces aren't even in the token vocabulary.

I think option 2 is the worst one since it both removes whitespace information and increases tokenization lengths. How about this:
- Include whitespace in the phoneme transcription and allow the tokenizer to do whatever it wants with that.
- Don't set whitespace as a boundary when tokenizing.

Which is the same as option 1. That would both reduce the number of tokens and include word boundary information. I think in terms of information provided to the model, it's the same as ParlerTTS's metaspace approach. I'm not seeing a reason to include a metaspace separate from normal whitespace, punctuation, and <bos>.
>>
>>41419128
>High Winds and Blaze so high
How?
>>
https://www.youtube.com/watch?v=VBacYxEEiuU
>>
>>41423311
Well it is almost christmas...
>>
>>41421977
I should be able to help out after Mare Fair, could you give more detail on the expected scope and timeline?
>>
>>41423224
Mistral thinks they sound like male names?
>>
>>41423090
>I'm not seeing a reason to include a metaspace separate from normal whitespace, punctuation, and <bos>.
There might be some slight confusion here--basically, applying Metaspace to word boundaries happens in the pre-tokenization step, so it's not something that's baked into the training data

As for why that specific character--maybe so it's easier to distinguish as a character that wouldn't normally appear in the dataset for debugging purposes? Probably wouldn't matter besides that.

Also if you could comment on
>>41422002

I guess using Llama 3.1's tokenizer would allow us to do the same pruning without having to change much?

Picrel shows current configuration of g2p tokenizer (token count increased to 1024), both pretokenization and actual tokenization, compared to parlerTTS -- mostly showing word separation and punctuation behavior
>>
>>41423224
>>41423388
Well, Blaze is a often a male name, and "High" is one letter away from "Hugh" which is also male. Makes sense.
>>
>>41423490
I also just realized that this isn't the approach you proposed. Here's what it looks like when you let whitespace be anywhere in the token (i.e. no Metaspace/Whitespace)
>>
>>41423599
https://github.com/huggingface/tokenizers/pull/909
https://github.com/huggingface/tokenizers/releases
https://github.com/huggingface/transformers/pull/32535
Apparently tokenizers didn't support spaces in the middle of tokens until last month, and transformers isn't able to bump their tokenizers version yet?
>>
>>
>>41423388
>>41423538
Oh, my bad. I thought these were mistakes made by analyzing the voice rather than the name.
>>
>>41423711
Huh. Yeah, the Huggingface BPE config format would break if there were spaces in vocabulary since merges are specified as pairs of space-delineated token strings. It looks like the Llama 3.1 tokenizer replace all spaces with "Ġ" (unicode 288) in the token vocabulary.
>tokenizer.tokenize(' ')
>['Ġ']
But it also maps Ġ to something else.
>tokenizer.tokenize('Ġ')
>['Äł']
>tokenizer.tokenize('Äł')
>['ÃĦ', 'ÅĤ']
>... # there's a whole chain of remappings
If I use the underlying tokenizer's encode function, it transparently replaces spaces with Ġ.
>tokenizer._tokenizer.encode(' ').tokens
>['<|begin_of_text|>', 'Ġ']
And the underlying tokenizer's decode function transparently converts Ġ back to spaces.
I'm not seeing anything obvious in the llama tokenizer's config file that would be responsible for this. Same for HF's PreTrainedTokenizerFast, which is used to load the Llama tokenizer, or the Tokenizer class, which is the type of the underlying _tokenizer class, so it might be some hidden functionality in the HuggingFace's BPE implementation. Llama 3.1 is using the BPE tokenizer though, so it is somehow possible with that. I'll have to look into this more to see how to get something similar.
It might be easier to just replace spaces with the metaspace character for the phoneme tokenizer. It won't be an issue for the llama tokenizer since it already handles this.
>>41423490
>On using llama: would there be any benefit in retaining some of the old ParlerTTS token IDs?
I don't see a reason to retain the ParlerTTS token IDs. If you wanted to continue training a model that was already trained with ParlerTTS, then yes, but for training new models, no.

I'm going to be busy with other things for the rest of tonight and most (maybe all) of tomorrow. After that, I can work on that tokenizer cleanup class. Let me know if you end up working on the hybrid tokenizer implementation before then, otherwise I might start on that too.
>>
>>41424644
I don't know the details of the HF/Llama tokenizers, but replacing space with Ġ goes back at least as far as GPT2's BPE, so maybe it was just copied from there
https://github.com/karpathy/minGPT/blob/37baab71b9abea1b76ab957409a1cc2fbfba8a26/mingpt/bpe.py#L29-L30
>>
>>41424644
>It looks like the Llama 3.1 tokenizer replace all spaces with "Ġ" (unicode 288) in the token vocabulary.
I think this is the ByteLevel pretokenizer step; it also has a decoder that converts it back (I copied it in picrel). FWIW it looks like a tokenizer produced this way will actually save without an error so maybe it's fine?

>I don't see a reason to retain the ParlerTTS token IDs. If you wanted to continue training a model that was already trained with ParlerTTS, then yes, but for training new models, no.
I was considering training on top of the existing finetune or the original ParlerTTS mini checkpoint, but maybe that's a bad idea?
>>
>>41424717
That is a really ugly hack for a problem that really should have been solved with more type-consistent formatting.
Makes sense.
>>41424718
It looks like ByteLevel can take two arguments:
- add_prefix_space is probably fine with its current/default value True.
- I think use_regex should be False to avoid forcibly splitting tokens on spaces.
Otherwise, it looks good!

>I was considering training on top of the existing finetune or the original ParlerTTS mini checkpoint, but maybe that's a bad idea?
If you're swapping out the ParlerTTS tokenizer with the Llama tokenizer, we'd have to map the token ids over. It probably wouldn't be that difficult, as long as most of ParlerTTS's tokens are in the Llama vocabulary, which I expect they will be. Regardless of which one you end up using (ParlerTTS vs Llama), we'll need to modify the vocabulary to get rid of poorly trained tokens, and the extra work of mapping tokens should be small.
I don't expect that training on top of the existing checkpoint will cause problems.
>>
>>41424873
OK. I think I will retain the original ParlerTTS tokenizer for english
Will try to work on hybrid tomorrow
>>
File: 1702555648441938.gif (3.8 MB, 502x502)
3.8 MB
3.8 MB GIF
Good morning bump
>>
StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion
https://arxiv.org/abs/2409.10058
>The rapid development of large-scale text-to-speech (TTS) models has led to significant advancements in modeling diverse speaker prosody and voices. However, these models often face issues such as slow inference speeds, reliance on complex pre-trained neural codec representations, and difficulties in achieving naturalness and high similarity to reference speakers. To address these challenges, this work introduces StyleTTS-ZS, an efficient zero-shot TTS model that leverages distilled time-varying style diffusion to capture diverse speaker identities and prosodies. We propose a novel approach that represents human speech using input text and fixed-length time-varying discrete style codes to capture diverse prosodic variations, trained adversarially with multi-modal discriminators. A diffusion model is then built to sample this time-varying style code for efficient latent diffusion. Using classifier-free guidance, StyleTTS-ZS achieves high similarity to the reference speaker in the style diffusion process. Furthermore, to expedite sampling, the style diffusion model is distilled with perceptual loss using only 10k samples, maintaining speech quality and similarity while reducing inference speed by 90%. Our model surpasses previous state-of-the-art large-scale zero-shot TTS models in both naturalness and similarity, offering a 10-20 faster sampling speed, making it an attractive alternative for efficient large-scale zero-shot TTS systems.
https://github.com/yl4579/StyleTTS-ZS
>>
>>41425827
Oh, wow! That seems quite promising from the examples ( https://styletts-zs.github.io/ ) and surprising not very noisy nor distant from the reference voice. Very curious to hear how well (or poorly) it works with pony voices, I'll try and test it myself if I can in a couple days, whenever I'm likely to get enough free time to do so.
>>
>>41364782
Hey faggots. What AI do you use to read entire fanfics?
>>
>>41423359
>https://implying.fun/video/summer24/2024-07-19/
Here's a video from last summer. At 1:30 the intro starts that was used for Summer.

>https://www.youtube.com/watch?v=jn_ND9sYe5k
An idea we had was to use the S1E01 opener and reframe it as a battle between the streamers of the Elite Cup versus the Babby cup, ince the Babby cup is seen as the lesser tournament only meant as a stepping stone for teams to promote to the elite. That way we could use most of the source video and only rely on some minor editing with faces/board icons pasted over the ponies, with narration of a script spoken by Celestia via RVC.

For the Credits that are essentially played at the end of each cup day, we were thinking of something based on the Running of the Leaves, since this is for the Autumn Babby Cup. It could be as simply as just having some clips from the episode with the credits scrolling down with music playing in the background.

We're trying to keep the scope low since time is running out. No dates have been finalized, but cup likely kicks off somewhere mid-October.
>>
https://vocaroo.com/1gBlpYI8MRWb
Why does this shit cost an arm and a leg ... might as well make a hundred accounts instead of having to pay $100 for a measly 10 hours of audio. This shit can't replicate the emotion required to do convincing AVGN reviews.
>>
>>41425827
>>41425861
Dunno, maybe it just me but it sounds either on the same quality or even slightly worse than what's already exists.
>>
>>41423346
Marketing wise that's probably true.
>>
>>41424885
Update: I can't find any documentation on how/if you're actually supposed to subclass a PreTrainedTokenizer, particularly in a way that's compatible with AutoTokenizer. I've gotten as far as successfully registering the class with AutoTokenizer, but actually instantiating it seems to depend on a bunch of method implementations that don't seem to be in a spec anywhere. Might be easier just to make something that mimics the AutoTokenizer API to the extent it's used in ParlerTTS training? But obviously less flexible with other projects.
>>
>>41420275
Maybe having line breaks get treated like punctuation?
>>
>>41425543
Good morning in a different part of the world.
>>
>>41427365
Why does it need to be compatible with AutoTokenizer? Why not just use your PreTrainedTokenizer class? It has a from_pretrained function that can pull a tokenizer model from an HF repo. You'd have to commit the new tokenizer class and set trust_remote_code=True in from_pretrained. I don't think there's a way around that since I don't think there's a way to get a consistent hybrid tokenizer using only existing HF tokenizer classes.
I'm not sure how far you got with the hybrid tokenizer, but I think this mostly works: https://ponepaste.org/10355. When decoding, it adds extra spaces between tokens, and I haven't figured out how to get rid of that.

I'll see if I can get the code to clean up the grapheme and phoneme tokenizers based on what we discussed so far. I think that's just:
- Remove undertrained and useless tokens from both the llama 3.1 and arpabet tokenizers.
- Update the llama 3.1 tokenizer to use token ids from ParlerTTS where possible.
- I'll use the pony speech dataset to determine which tokens will be well-trained.
>>
>>41428025
Mostly because I wanted to avoid messing with the training code as little as possible, and I assumed it'd be easier to make it compatible than it actually was.

All I have right now is an (incomplete) "toy" implementation that vaguely mimics AutoTokenizer instead of trying to be compatible with it directly.
https://github.com/effusiveperiscope/parler-tts/blob/g2p/hybrid_phoneme_tokenizer_nonhf.ipynb
I think I dodged the spaces problem by decoding in spans and passing 'clean_up_tokenization_spaces' to the underlying decoders, but feels hacky
>>
File: error 18 09 24.png (89 KB, 1172x1072)
89 KB
89 KB PNG
>>41428107
the colab file stops running at this code point:
self.tokenizer_g2p = AutoTokenizer.from_pretrained(
tokenizer_g2p)
>>
>>41428107
I realized that my approach for removing bad tokens has issues when uncommon tokens are prefixes for common tokens. For BPE, there's no way to fix this since removing an intermediate token would leave a gap in BPE's merge tree, and there's no way to resolve that without removing some "good" tokens. So the tokenizer function would need to check if any of the resulting tokens are in a reject list, and it would need to find the appropriate subtokens. I can have my code give a map of {bad tokens -> replacement tokens} so you can implement this more easily.
Also, I'll see if we can stick to the parlertts tokenizer for graphemes. That might be easier to work with than BPE, and its tokens are probably more tailored to speech than llama 3.1.
>>
>>41428457
Sorry, I had it pointed at my local folder for the tokenizer instead of the one on HF. Updated. Also it's not intended to run in Colab although it has so few dependencies it probably would
>>41428492
Would it be easier to use Unigram for the g2p tokenizer too?
>>
Moshi: a speech-text foundation model for real time dialogue
>https://github.com/kyutai-labs/moshi
>https://kyutai.org/Moshi.pdf
>https://moshi.chat/
Model that can generate a speech response while you're talking to it ("theoretical latency of 160ms, 200ms in practice"). Can also do speech recognition/TTS depending on how you configure it (and maybe even voice conversion if finetuned?)
>>
File: 1703433028824135.jpg (87 KB, 1280x720)
87 KB
87 KB JPG
>>41420275
>>41419449
I kind of expected the newline character to function like that. It "worked" as far as not throwing an error when adding punctuation as suggested, but the output eh....
https://files.catbox.moe/huvqum.flac
Like Dash got taken over by the Mark Hamill Joker. It also didn't fit the length of the reference audio, and in fact exceeded it. That might just be an expectation issue with how StyleTTS2 works though. I was expecting it to be like Talknet where the voice reference provides the base and runtime but that might just not be how this works.
>>
Up.
>>
Holy shit, the board is fast today.
>>
Learning Source Disentanglement in Neural Audio Codec
https://arxiv.org/abs/2409.11228
>Neural audio codecs have significantly advanced audio compression by efficiently converting continuous audio signals into discrete tokens. These codecs preserve high-quality sound and enable sophisticated sound generation through generative models trained on these tokens. However, existing neural codec models are typically trained on large, undifferentiated audio datasets, neglecting the essential discrepancies between sound domains like speech, music, and environmental sound effects. This oversight complicates data modeling and poses additional challenges to the controllability of sound generation. To tackle these issues, we introduce the Source-Disentangled Neural Audio Codec (SD-Codec), a novel approach that combines audio coding and source separation. By jointly learning audio resynthesis and separation, SD-Codec explicitly assigns audio signals from different domains to distinct codebooks, sets of discrete representations. Experimental results indicate that SD-Codec not only maintains competitive resynthesis quality but also, supported by the separation results, demonstrates successful disentanglement of different sources in the latent space, thereby enhancing interpretability in audio codec and providing potential finer control over the audio generation process.
https://xiaoyubie1994.github.io/sdcodec/
has examples. code to be posted
>>
>>41427847
>>41427847
>>41427847
new animatic dropped, which means new clean mare audio for the mega
>>
File: IMG_7178.jpg (601 KB, 1170x689)
601 KB
601 KB JPG
>>41409422
>>41409443
Rapedit tier meme but it fits
>>
>>41430883
>new clean audio
So that's what the fuss with the animatic is a about.
>>
>>41430883
Strange this didn't got leak soon
>>
>>41428838
>Would it be easier to use Unigram for the g2p tokenizer too?
I should be able to find out tomorrow.
>>
>>41423359
Contact Titan on discord to get added to the server that is planning/making the Autumn videos.
>>
You know what time Trixie was suppose to be a stallion and competent at challenges?
https://voca.ro/14B6PfRqMXM6
>>
>>41431861
Can someone make Twilight Sparkle voice this?
>>
File: 1726759221054826.png (1.19 MB, 1387x765)
1.19 MB
1.19 MB PNG
>>41430883
Don't forget to download it from vimeo because it has better sound
https://player.vimeo.com/video/1010460267
>>
>>41431875
here
https://voca.ro/151QvnRiTCI6
>>
>>41431891
oh, thx. nice find. downloading now
>>
File: 3298366.jpg (73 KB, 888x499)
73 KB
73 KB JPG
>>41364787
>>41431891
>>41431501
Added to the master file.
https://mega.nz/folder/jkwimSTa#_xk0VnR30C8Ljsy4RCGSig
Sliced Dialogue/Special Source/s2e10
Label file also added to the label files folder.
Removed the standard s2e10 from the FiM folder.
>>
Golden mare?
>>
>>41432280
That was fast. I'll update my clones sometime in the next few days.

>>41428838
>>41431501
Unigram is easier to work with. Can you train a Unigram model for the g2p tokenizer? Also, you should replace 'ñ' with 'n' before giving it to ParlerTTS. ParlerTTS doesn't seem to have a token id for ñ.
Unigram models keep track of per-token scores that determine whether to select a token. If I set the score of a token to -99, then it retains the token id and never uses that token, and as far as resulting tokens go, there's no difference between deleting a token vs setting its score to -99, which is exactly the behavior we want.
There are a LOT of bad tokens in the ParlerTTS tokenizer, which explains the pronunciation issues. Getting rid of them leads to a much less efficient tokenizer, but it's better than having undertrained & useless tokens. That does suggest though that creating a new model from scratch may be worthwhile since it can use much more efficient tokenizer. If you create & upload a new grapheme tokenizer, I can check it against ParlerTTS's grapheme tokenizer to see how much more efficient it is.
>>
>>41432280
Some dataset errors. These all have double spaces that should be single spaces:
>./s09e16.txt:607.608000 611.157279 00_10_08_Granny Smith_Happy_Noisy_that is one hundred and percent correct!
>./s09e21.txt:445.546000 448.796948 00_07_26_Caballeron_Anxious_Very Noisy_we were desperate for money to keep it open.
>./s09e21_master_ver.txt:445.546000 448.796948 00_07_26_Caballeron_Anxious_Very Noisy_we were desperate for money to keep it open.
>./s09e04_master_ver.txt:221.989000 225.833750 00_03_42_Shining Armor_Neutral_Very Noisy_giant fans keep any creature from flying too close to the castle.
>./s09e04.txt:221.989000 225.833750 00_03_42_Shining Armor_Neutral_Very Noisy_giant fans keep any creature from flying too close to the castle.
>./s09e04.txt:236.309115 240.007343 00_03_56_Shining Armor_Neutral_Noisy_and even if you could get in, which you can't, I've doubled the ranks of security.
>./s09e04_demu1.txt:236.309115 240.007343 00_03_56_Shining Armor_Neutral_Noisy_and even if you could get in, which you can't, I've doubled the ranks of security.
>./s09e04_master_ver.txt:229.555572 235.082270 00_03_50_Shining Armor_Neutral_Very Noisy_plus the entrances to the tunnels below the castle have been sealed. so there's no underground access.
>./s09e04.txt:229.555572 235.082270 00_03_50_Shining Armor_Neutral_Very Noisy_plus the entrances to the tunnels below the castle have been sealed. so there's no underground access.
>./s04e16_master_ver.txt:1248.049742 1253.039906 00_20_48_Fluttershy_Neutral_Very Noisy_and sometimes being too kind can actually keep a friend from doing what they need to do.
>./s04e16.txt:1248.049742 1253.039906 00_20_48_Fluttershy_Neutral_Very Noisy_and sometimes being too kind can actually keep a friend from doing what they need to do.
>./s04e10.txt:46.700000 49.590000 00_00_47_Pinkie_Shouting_Very Noisy_p ponyville!
>./s04e10_master_ver.txt:46.700000 49.590000 00_00_47_Pinkie_Shouting_Very Noisy_p ponyville!
>./s04e05_demu1.txt:971.904440 974.544440 00_16_12_Scootaloo_Sad__I, i'm not going.
>./s04e05.txt:971.904440 974.544440 00_16_12_Scootaloo_Sad__I, i'm not going.
>./s03e13.txt:571.525569 575.929673 00_09_32_Fluttershy_Happy_Very Noisy_Aww, look at that. i guess you were all just cranky because you were hungry.
>./s03e13_master_ver.txt:571.525569 575.929673 00_09_32_Fluttershy_Happy_Very Noisy_Aww, look at that. i guess you were all just cranky because you were hungry.
>./s04e21_outtakes.txt:44.595442 52.863044 00_00_45_Rarity_Neutral__Fortunately, thanks to the vision of Mare d'Flair, the Wonderbolts ensemble became more streamlined in a wonderfully.
>./s04e21_outtakes.txt:53.789015 62.718025 00_00_54_Rarity_Neutral__Fortunately, thanks to the vision of Mare d'Flair, the Wonderbolts ensemble became more streamlined in a wonderfully breathable fabric.
>./s02e25_special source.txt:660.861828 663.763950 00_11_01_Twilight_Happy__how many unicorns can just spread love wherever they go?
>>
>>41428838
Here's the cleaned-up ParlerTTS tokenizer:
https://huggingface.co/synthbot/parlertts_tokenizer_clean
>tokenizer = AutoTokenizer.from_pretrained("synthbot/parlertts_tokenizer_clean")
Make sure to clean up sentences before feeding them to the tokenizer:
>sentence = sentence.replace(' ', ' ').replace('ñ', 'n')
>>
>>41428838
>>41434491
I forgot the code.
https://github.com/synthbot-anon/sample-code/blob/main/src/cleanup_unigram_tokenizer.py
`fix_unigram_tokenizer` accepts a huggingface repo and returns a cleaned up PreTrainedTokenizerFast. If you save_pretrained a PreTrainedTokenizerFast, the result can be loaded with AutoTokenizer, which is how I created parlertts_tokenizer_clean. The original ParlerTTS tokenizer uses T5Tokenizer as the tokenizer_class in tokenizer_config.json whereas this one uses a PreTrainedTokenizerFast. That doesn't seem to cause any issues. I check to make sure that it returns identical results to the original ParlerTTS tokenizer when there are no bad tokens as long as the double-spaces are replaced with single spaces. Other than that, the config files should be identical excluding than the tokens whose score is set to -99.
>>
>>41434301
>Can you train a Unigram model for the g2p tokenizer?
Here's a Unigram model (I think):
https://huggingface.co/therealvul/tokenizer_g2pen_v3/tree/main

Parameters (i.e. vocab size, pretokenizers) and dataset are the same as v2 except using Unigram instead of BPE.
https://github.com/effusiveperiscope/parler-tts/blob/g2p/tokenizer_data_train_g2pen.ipynb (please ignore the error). I'd note that a lot of names (including Equestria) seem to be encoded as single tokens now

>Also, you should replace 'ñ' with 'n' before giving it to ParlerTTS
Noted
>>
>>41434491
>>41434512
Thanks. I think you might need to copy decoder and/or post_processor over? I'm getting the metaspace character when I use this in batch_decode (see third cell; your tokenizer is now default)
https://github.com/effusiveperiscope/parler-tts/blob/g2p/hybrid_phoneme_tokenizer_nonhf.ipynb

Also I am going to sleep now
>>
>>41434543
Do you mean the tokens with nothing but '{metaspace}'? I think that's the correct behavior. Some words like 'he' in the sentence aren't common enough as word beginning to warrant their own '{metaspace}he' token.

Cleaned up g2p tokenizer:
https://huggingface.co/synthbot/vul_g2pen_tokenizer_clean/tree/main
And I updated the parlertts tokenizer to use n=240k instead of 120k:
https://huggingface.co/synthbot/parlertts_tokenizer_clean
>>
>>41434543
>>41434574
Nevermind, I see what you mean. I'll see if I can fix it on my side.
>>
>>41434543
They should both be fixed now. I had to copy over the normalizer, post_processor, and decoder.
>>
Dunno if this matters to anyone but some lesser known dude who made some fandom music back in the day recently unlisted his videos. They're still available via a playlist, for however long that lasts.
https://www.youtube.com/playlist?list=PL6ZZW44lVoeh1CAX93IoZjaZsl1tVmBOx
>>
>>41434933
Woops, sorry lads. Meant to post this in the archival thread.
>404
Nevermind.
>>
>>41434543
>>41434574
>>41434579
Thank you. Updated.
https://github.com/effusiveperiscope/parler-tts/blob/g2p/hybrid_phoneme_tokenizer_nonhf.ipynb
Going to test more and then work on making it work with parlerTTS
>>
Who do I have to FUCK to get a 500 page book storytimed by Tara Strong?
>>
>>41434943
You can fix that.
>>
>>41435458
If only it were this simple...
>>
>10
>>
>>41435458
Are there really no one working on a plain tts on small scale but with all the new tech invention than make them sounds at least more pleasurable to listen than ms Sam?
>>
>>41435458
500 ElevenLabs accounts working în sync commanded by a script.
>>
>9
>>
>>41436868
Didn't that thing cost money?
>>
>>41437872
Necer used it but I'm pretty sure it used the old classical bait of "free X uses per day, than pay is subscription for more".
>>
>>41417482
Horsona chatbot library updates:
- [In progress] Continue working on lorebook generation.
- ... [Done] Test the StoryReader flexiblity on character card creation and lorebook generation. It looks flexible enough. I added a test "Character Card" module, and the early results seem promising enough.
- ... [In progress] I'm going to refactor StoryReader, CharacterCard, and some data structure implementations based on what I found works. The basic workflow for deriving custom information from the StoryReader will be something like this, which is analogous to what you'd do when using a pytorch module:
- ... ... 1. Use StoryReader to parse information paragraph-by-paragraph. On each iteration, it returns a bunch of context objects that keep track of what information it used to understand each paragraph.
- ... ... 2. Pass the context objects to any modules that need to extract more information from the story. Right now, there's a CharacterCard module that tracks per-character information (right now, just synopsis, personality, appearance, and quotes).
- ... [ ] After the refactor, I'll add another module for extracting story setting information. I think between character cards, setting information, and the story database generated by StoryReader, I'll have all the data extraction parts necessary to create lorebooks.
- [Done] I created a function that wraps multiple other LLM APIs. The main use is for making many calls to LLMs where several LLMs are interchangeable for the task. It tracks recent usage, per-call rate limits, and per-token rate limits to decide which API is most appropriate, and it waits for rate limits before running an inference. It mimics the API of the least common denominator of all LLMs passed, so if all of the LLMs are compatible with the OpenAI interface, the resulting object will be too. This is going to be necessary to sidestep rate limits when I try having it process an entire story.
- [Done] I closed out two of the open issues: one for supporting OpenAI models, and one for supporting Anthropic models.
>>
>>41438062
They farmed up a dataset from users with the free trial and then pulled a
>oh no, people are using it, we need to charge!
and kept the data. Because they're American and not based in the EU, they don't get fucked for it.
>>
>>41438062
>>41437872
By using a google account you get like 10k chars or 10 minutes worth of voice audio for free, but then for 1 hour and 10 hours they charge you up the ass and for someone who needs 500 pages that's a no go.
So better to get an army of accounts and a command bot to order those army of bots to churn out 500 pages in 1 hour.
>>
>>41438199
Did I forget to mention it's 10 minutes worth of voice audio PER MONTH? That's why I refuse to pay anything.
>>
>>41438213
>Per month
Wew, that's turbo jewish, even other services are doing something more reasonable like resetting counter per day/week.
>>
Was the changeable file format feature removed from Haysay just now? I was just using it, it went down for a few secs, and now the function is gone.
>>
>>41438602
Sorry about that. The TLS certificate expired and I needed to temporarily take down Hay Say to go renew it. The code for selecting a file format isn't in the Docker image yet, so that functionality got wiped when I brought HaySay back up. I had to manually edit it back in. Everything should be working now.
>>
>>41438619
Yup, all working fine now. Thanks for hosting the site!
>>
Bumpen the thread.
>>
>>41438143
Neat, is openrouter supported thru openai?
>>
>>41440892
I added support for it just now. It's untested, but if it's actually compatible with the OpenAI SDK just with a different url & api key, I don't expect any issues with it.
>>
>>41438143
I'd like to share some ideas for the end product.
Straight to the point: a node-based system that processes different modalities, manages context, memory, handles logic, and responds in different modalities, perhaps has internal clock, all that with the modularity of comfyui. This kind of system would be the ultimate solution to to be used in apps and games sex games letting people experiment and share their data flows and write custom modules, so you don't have to worry about coming up with the single best way of doing something.
>>
Mareing in the dirt
>>
>>41441648
Truly so.
>>
File: Nurse R H ALT 720p Fix 2.png (3.43 MB, 1280x720)
3.43 MB
3.43 MB PNG
[RVC] Nurse Redheart sings - Massive Attack "Teardrop"
>https://files.catbox.moe/ivs4mn.mp4
>https://files.catbox.moe/b2gchw.mp3

Had this idea for a good while, only just recently remembering about it. Would be fun to see a pony version of House M.D. at some point with Nurse Redheart as the lead role.
>>
>>41441123
I like it. All it would require is a simplified way to turn modules and functions I create into ComfyUI nodes. I think that's doable. I might be able to add an "API endpoint" nodes as well so that custom workflows can be wrapped into their own OpenAI-compatible interface usable by other text generation interfaces. I'll add it to the target feature list.
>>
File: Untitled.png (1003 KB, 1080x1631)
1003 KB
1003 KB PNG
DiffEditor: Enhancing Speech Editing with Semantic Enrichment and Acoustic Consistency
https://arxiv.org/abs/2409.12992
>As text-based speech editing becomes increasingly prevalent, the demand for unrestricted free-text editing continues to grow. However, existing speech editing techniques encounter significant challenges, particularly in maintaining intelligibility and acoustic consistency when dealing with out-of-domain (OOD) text. In this paper, we introduce, DiffEditor, a novel speech editing model designed to enhance performance in OOD text scenarios through semantic enrichment and acoustic consistency. To improve the intelligibility of the edited speech, we enrich the semantic information of phoneme embeddings by integrating word embeddings extracted from a pretrained language model. Furthermore, we emphasize that interframe smoothing properties are critical for modeling acoustic consistency, and thus we propose a first-order loss function to promote smoother transitions at editing boundaries and enhance the overall fluency of the edited speech. Experimental results demonstrate that our model achieves state-of-the-art performance in both in-domain and OOD text scenarios.
https://nku-hlt.github.io/DiffEditor
https://github.com/NKU-HLT/DiffEditor
also has implementations of 4 other speech editing papers. trained on a 3090. they used a typical academic dataset so I wonder if a high quality one with a lot of examples (so pony stuff would work) would show further improvements
>>
>>41443483
I think you understood me wrong. I suggested a node-based state machine running as a server that can take inputs, update state and generate outputs. The user doesn't get to interact with models directly and can only send data to input nodes (images, sound, text, other structured data), the workflow would handle them and may or may not generate and send data. For example, a workflow may do the following: Some other app is a frontend mare emulator. Among everything else, a scene image gets sent every x seconds. A multimodal llm analyzes it with previously gathered context, stores it for later, then if some condition is met it may generate a phrase response, voice for it, generate an animation, add all of this to according contexts, then send it back as a response. Stuff like this will come around inevitably in the future.
>>
>>41443335
It certainly would
>>
>>41443335
She sounds a bit like she had an accent for some reason.
>>
File: 1696903207703738.png (47 KB, 220x229)
47 KB
47 KB PNG
>>41443335
This song has lyrics?
Yes, I have only heard the cut used for House and deliberately avoided looking up the full song.
>>
>>41444192
If I'm understanding correctly, that's how the current system works. Someone designs a workflow, which can take arbitrary inputs, generate arbitrary outputs, and update state based on arbitrary feedback. The workflows are built up through functions & modules. A chatbot would just be a particular workflow, maybe with some pre-populated parameters & data. The library keeps track of the full computation graph as a collection of nodes so it can figure out (1) which outputs to generate for each input, and (2) which state to update whenever some feedback or update request is provided. To do what I think you're saying:
- The end user would interact with some mare emulator frontend (e.g., SillyTavern or something custom).
- The user's messages would get passed to some backend.
- The backend would run the messages through the computation graph.
- That computation graph would continue running until it generates whatever outputs and state updates it decides are appropriate. The state updates would get applied whenever it makes sense to do so, some inline with output generation like database updates, some after the computation like history state updates.
- The outputs would get sent back to the mare emulator frontend so the end user can see/hear whatever was generated.

ComfyUI is just a visual interface for building and optionally running computation graphs like this. It's the same sort of computation graph that pytorch programs build up, which is why it works for image generation. The ComfyUI interface would be more for chatbot developers that want to customize how a chatbot works, not for end users that want to use chatbots. Someone can use ComfyUI to create a chatbot workflow, then wrap the result in an API so any mare emulator can use it.
>>
>>41444957
I'd say that's because the original song has a fair amount an accent to it already. The words "fearless on my breath" had always sounded more like "feel the summer - pray" to my ear for example. I also had the character likeness reduced (0.7-0.8 I believe) a little because with her limited dataset she can mistake certain pronunciations more often and sound much rough the closer she is to 1. I've found this to have more of an affect than voiceless protection and a couple other settings.

>>41445222
https://youtu.be/l7iCmLoFoyA
My favourite lyrics of it are "Love, love is a verb; love is a doing word." which could be interpreted as love requiring persistent action in giving and commitment, not just given the once and/or expected to exist without the act of giving. I also like the sound of "Most faithful mirror" and "Of a confession" but having a hard time gleaming its meaning all that well.
>>
Good night bump.
>>
>>41405973
What's the motivation behind using the pytorch training loop idiom for your API?
>>
>>41447111
There are a lot of reasons I went with the pytorch training loop. The impetus was playing around with TextGrad and seeing how well it worked. TextGrad's core was terrible though (it's slightly better now), and it's too tailored to text, whereas I wanted something that can work with more modalities and with API & database integrations. I've played around with & looked into other agent frameworks, and I found that the popular ones scale very poorly in terms of the complexity they can support. It's a hypothesis that the pytorch approach will allow for much more complex functionality. The pytorch approach also makes it easier (compared to other agent frameworks) to integrate things that require an actual pytorch training loop (e.g., fine-tuning) without having to treat them as blackboxed components.
There's a second half to this that I plan to integrate once there's a need and once I understand how to integrate it better. If it gets that far, single "agents" will be based around the pytorch training loop, and multiple "agents" will interact through something like the kubernetes controller model. Most of the code for that second half has already been implemented as part of an earlier attempt at getting more complex agents, but it needs to be rewritten into something people can run without spending $500/month.
>>
>>41446258
>her limited dataset
It's impressive how much can be done with it in spite of its limits though.
>>
>>41446244
Instead of comfyui's output node with lazy eval approach, I suggest flow control with execution pins, a workflow may have several externally addressable starting pins to serve different kinds of data, for example. You can even start drawing parallels with how brain has different sensory inputs. And by mare emulator I meant a more sophisticated frontend such as a game or something that could benefit from this. Comfy in it's current state can't do this no matter how many custom nodes you make because it's conceptually different from what I'm talking about, I should have made it more clear.
>>
>>41447629
For mare emulation, I was only thinking of using Comfy as an interface for designing workflows. As I was thinking about it, the workflow would then be exposed through some REST API, or it would be exportable as a python class, at which point Comfy's limitations no longer apply.
You're thinking of Unreal Engine's Blueprint? I would love to have this available in Unreal. It doesn't look like it'll be too difficult to expose horsona modules as Blueprint nodes through Unreal's Python API. My main concerns are:
1. It would be difficult for me to test it in any meaningful way though since I don't know how to do anything in Unreal right now. I might need your help with that. If you're a pony game dev, do you have an example Blueprint screenshot showing how mare nodes interact with the environment? I want to get a sense for which inputs & outputs it would be worth supporting.
2. We have a lot more anons familiar with ComfyUI than with Unreal Engine. This isn't that big a deal since I can support both, but until I have enough modules to make it worthwhile for game engine integration, I'll probably prioritize Comfy-based workflow creation + a separate program to actually run it.

I'm going to out for the next few weeks for Mare Fair, waifu anniversary, then the mlp anniversary stream. I won't get much work done until that's over.
>>
>>41447511
Even more so if you consider her dataset is:
- Less than 23 seconds of audio
- Partly noisy with some leaked SFX
- Varies a lot in terms of tone and delivery
- Contains some harsh sounds and less phonic elements

>https://pomf2.lain.la/f/9gdwg97d.mp3
She just so smol
>>
>>41448046
Impressive. Hope Meadowbrook voice model will be possible someday.
>>
File: example.png (244 KB, 1544x781)
244 KB
244 KB PNG
>>41447743
As I said, you can't have flow control in comfy, it has no execution pins. If we want a workflow to be able to decide which nodes to execute based on model outputs or with time intervals (to 'percieve' time instead of being a chatbot), you would have to implement all the logic inside the nodes, killing the modularity.
What i'm talking about is indeed like unreal's blueprints, except the events are bound to api calls.
UE'S Python API is actually editor only, it's used for tasks like asset management. It doesn't run in game. For external communication you should use modules that can do socket or http. It won't change much, it still comes down to sending data to one of the endpoints and handling responses.
>do you have an example Blueprint screenshot showing how mare nodes interact with the environment?
Is pic related what you asked for? In UE you basically have actors that are user defined objects that may have logic, like a character. They have event tick that fires every frame and you can do pretty much anything with them. Actors can access their child components (like actors but without tick) and any other object they have a reference to. An actor can either be a blueprint or a c++ class, the latter is a bit more flexible.
>>
File: applejack hoofy kicks.gif (318 KB, 500x379)
318 KB
318 KB GIF
>>41448046
She sounds like Applejack doing a funny voice
>>
File: NurseAshleigh.png (16 KB, 368x71)
16 KB
16 KB PNG
>>41448651
She IS Applejack doing a funny voice
>>
>>41449328
I swear she is voicing like 1/3 of background character, she can't keep getting away from this!
>>
>>41449328
I thought she sounded more like Rainbow here.
>>
>>41448046
wtf she is cute. tiny nurse pone
>>
File: TabithaRandomPony.png (454 KB, 1038x1319)
454 KB
454 KB PNG
>>41449526
Another 1/3 has to be Tabitha, well known for her role as Random Pony.
>>41450304
True. She does sound more coarse than country.
>>
>>41448510
Thank you. That helps a lot.
>UE'S Python API is actually editor only, it's used for tasks like asset management. It doesn't run in game. For external communication you should use modules that can do socket or http.
That simplifies things.
>As I said, you can't have flow control in comfy, it has no execution pins
I didn't understand what those were, but it makes sense now with your image. I still want to support Comfy because of the reach, and I'm fine with Comfy-created workflows being more limited in what they can do. Workflows in something like UE (or similar) would be able to support more complexity and efficiency.
>>
Haysay keeps spitting out that one long error when I try to use rvc or any other. What's going on?
>>
Test bump
>>
>>41451031
Maybe I will contribute by implementing what i'm talking about, too busy right now.
>>
>>41451390
Let me know if you decide to read through the code and find anything unclear. There's a lot that's undocumented right now, which I plan to clean up. I can prioritize documenting you intend to look at. I plan to add a design doc as well so it's easier to navigate the code.
I'm going to make close to zero progress from Mare Fair until & through the anni stream.
>>
So how do I actually make the AI read an entire book while I sit back and do nothing?
>>
>>41451569
Nigga... You already asked 40 times...
>>
File: TrixReally.png (98 KB, 539x539)
98 KB
98 KB PNG
>>41451569
You don't, the tools here aren't built for that. Like you were told last time, cope with regular TTS, or make yourself useful and create the tool you want yourself. The answer's not changing.
>>
>>41451569
What pc specs do you have? What format is the book in?
>>
File: 6411306.png (17 KB, 215x228)
17 KB
17 KB PNG
>>
>accidentally leak openrouter key in github because i'm retarded
>openrouter account immediately drained of credit on some coomers Sonnet 3.5 gens
Well, fuck. Clearly I'm too retarded to do this.
>>
>>41452634
Lmao, what a fag
>>
>>41452634
Well that sucks.
>>
>>41451164
Thanks for bringing it up. I took a look and found that the same issue described in >>41134933 and >>41211179 happened yet again for RVC. I still don't know the root cause, but I've added a cron job to check for the issue every minute and automatically correct it, so that should help keep it under control.

The other architectures are working fine for me. Everything OK for you now or are you still seeing an error with another architecture?
>>
>>41453921
RVC works perfectly once again! Thank you so much!
>>
Pre wageslaving bump.
>>
>>41454322
Post wageslaving bump.
>>
>>
>10
Will likely have to retrain Lotus a second time round. Something went wrong where 300 epochs has loads of pronunciation issues, and 345 epochs when the overtraining detector kicked in but must've caught it late, because there's some oddities still in her with that version. Might remove one of her (albeit limited) reference data files solely because the pitch goes really high from her usual range, which might be messing with it too much.
>>
File: TriHard.png.jpg (55 KB, 400x400)
55 KB
55 KB JPG
>>41452634
AYOOOOO
/aicg/ SENDS ITS REGARDS
>>
>>41452634
I just nutted to a loli sister bot, thanks for your cervix
>>
>>41452634
how the fuck do you "accidentally" leak any key?? put it in an .env retarded fag gitignore will not post it
>>
>>41456346
Damn, hope this can get fixed.
>>
File: OIG1.HVzZoe6FVJgbpOvmZsox.jpg (165 KB, 1024x1024)
165 KB
165 KB JPG
>>
>>41457802
It looks like the finger is doing the E.T. thing.
>>
>>41435411
>>41434512
I think there's a bug in this -- the token you get from tokenizer.decode() (no metaspace char) isn't the same as the one from scores, so a lot of tokens that should be removed (i.e. the ones that occur 0 times in pony dataset) don't end up getting removed, e.g. "Reparatur", "Servicii".
>>
Page 9 bump.
>>
>>41458959
Plus one.
>>
>>41460073
>>
What’s this thread about
>>
>>41462741
We bump
>>
>>41462741
Unironically it's about ponies saying pony things.
>>
>>41462847
>>
>>41462741
It's like hanging out with people who think AI tech is going to get exponentially more powerful with time, but they're actually pretty excited about it.
>>
>>41464030
I mean, it's of course more than likely that some corpo will brew a Skynet at some point in the future, but we can at least have fun with AI ponies before we all burn in atomic fire.
>Captcha GOYGMX
>>
>>41464578
Loads of the contributing anons here are sorta like the hapless scientist driven by pure passion who gets their work seized by powerful people to ruin the world, but it's pretty funny that the passion project is breathing life into horse wives.
>>
>>41464578
Can't wait for the AI overlords.
>>
>>41464613
Only when they are pony AI overlords.
>>
Saved by the bell.
>>
>>41458217
Will take a look when I get back (Wednesday).
>>
>>41467361
Thank you for your service.
>>
This is intentional
>>
>>41469125
What exactly?
>>
>>41467671
https://github.com/effusiveperiscope/parler-tts/blob/g2p/cleanup_unigram_tokenizer.py
I changed it on Friday (near line 60) because I had to leave home on an errand and I wanted to get a head start on training even if it doesn't work

Anyways have a (cherrypicked) eval sample (epoch 28, 43 actual hrs of training)
https://pomf2.lain.la/f/5yrsu3u3.mp3

>Hack 1
The way their training works is that it precomputes all of the token IDs, which is a problem if you want to apply random ARPAbet g2p. I couldn't write a clean solution in time so I ended up modifying the collator to decode the text from the token IDs and then apply random g2p on that to get new token IDs. I think the way I did it somehow breaks the WER metrics but hopefully they're just eval metrics and not actually used for anything important. The text prompt the wandb log files show is partially nonsense, but at least the actual TTS seems to read correctly?
>Come on rainbow dash des zu you can des bodyarea often. Just remember the routine des les
>Hack 2
I avoided modifying the vocab size (throwing away the token embedding weights) by just making the g2p tokens reuse token IDs of the disabled eng tokens
>>
>>41471308
A video of eval samples so far (mildly interesting and/or kekkable):
https://pomf2.lain.la/f/cff7zdl6.mp4

Interestingly it hallucinated the magic sound effect in "Come on, you can do this", step 20430
>>
>>41471529
Does the model seem to be a little overtrained at 31780 steps? Also, does Parler-TTS produce deterministic outputs?
>>
>>41471529
Kek, Rainbow Dash almost turns into Applejack at 20430 steps.
>>
>>41471712
>Does the model seem to be a little overtrained at 31780 steps?
I'm wary of calling "overtrained" on anything yet--we're looking at a very small subset of the total evaluation. Right now I can't tell how much of the variation is coming from the random g2p substitutions vs. problems with the model itself. If you prefer the 20k you'd still have to contend with the magic noise and general weirdness, for example, in Rainbow Dash's "come on" sample. Eval loss (whatever that is?) is still going down.

>Also, does Parler-TTS produce deterministic outputs?
I believe so, although I suspect you could get quite a bit of variation by changing the description prompts (not shown in the video).

I kind of wonder if you'd be able to find useful "directions" in the description embeddings like with LLMs.
>>
>>41471529
>Tirek at 1135 steps
EspanolCentaur.wav
>Tirek at 11300 steps
"Harry Potter is dead! Hehehehe"
>Tirek at 31780 steps
"Perry the platypus?!"

All of these are hilarious, but still reasonably un-robotic sounding compared to other TTS stuff. The Rainbow Dash ones towards the end hold a lot of promise. Were all of these the first samples, or were they the best/funniest attempts of the various versions?
>>
>>41471983
These 5 lines were the only ones the training process randomly selected to preview; the steps were just close to 10k intervals I selected to be concise (I have more)
>>
>>41472167
Sorry, I misunderstood. I think inference is deterministic (for exactly the same text prompt and description prompt) so this would qualify as the "first" output in your thinking (i.e. I had no alternatives from each step to cherrypick from, although I could cherrypick the steps to use)
>>
>>41470894
Bump
>>
>>41472170
Not as fond of strongly deterministic models, as they greatly restrict the ability to get various deliveries/takes for a particular line. I prefer versatility over consistency for that reason.

Is there some function in that model to actively very the output?
>>
>>41474091
>>41471785
>>
>>
Up.
>>
Welp, time to get back crunching ideas for ai mares.
>>
>>41475839
>>
>>41474091
>>41474338
Update: Apparently the decoder can use either greedy or sampling decoding (although actually the code doesn't seem to work with greedy sampling on my machine despite trying many transformers/tokenizers versions) so it can be nondeterministic
>>
>>41364787
Paused training for now (47 epochs, ~80hrs). More comparisons, this time on actual inference. 3 models * 3 samples per model = 9 samples per line (for some).
https://pomf2.lain.la/f/ifavh3h8.mp4

Notes:
- Thanks to the tokenizer change, the model now at least has some ability to guess pronunciations of words that it hasn't seen before, which is neat (it also doesn't consistently guess pronunciations the same way either).
- Strong schizo energy, as usual.
- What's the right epochs? Not sure, feeling a toss-up between 33 and 47. 33 "feels like" it produces more consistently acceptable results but occasionally fucks up very bad. Might be the move to just upload a bunch in between those two and just let people decide.
- I don't know how much of the original information from the base model we're utilizing at this point, but it's still clearly limited by not having seen as diverse a range of words as the base model.
- ARPAbet works better than expected (Pinkie can say an entire sentence in ARPAbet tokens just fine) but not as well as one would like.
- It seems strangely difficult to get a consistent Tirek voice. Possibly fixable with prompting? But some people are working on zero-shot cloning/voice prompting: https://github.com/huggingface/parler-tts/issues/139 which may help a bit.
- Would moving to the larger model help? IDK.
>>
>>41477594
>>41471308
>>41471529
forgot to link previous posts.
>>
>>41476065
Any ideas planned?
>>
Just spent the past few hours struggling to find any vocal remover software or AI models that's effective at removing specifically SFX (as in foley and other sounds), and not stems or music related. It's absolutely barren in that regard and so over-saturated with music related separation tools. So far UVR's current models are not effective enough to remove all clippa cloppa and shifty crunkly sounds in preparation for clean mare training.

Considering we've already got an enormous amount of separated pony SFX available, plus heaps more if we consider Sonniss GDC game audio bundles, how do you think we'd fare at creating an MDX-Net model from scratch (or finetuning) specifically for separating SFX from any audio we feed it? If it works well, we might be able to reprocess files to get even more clean and usable audio. Maybe even make models for ponies previously too noisy to consider training. Just not sure on how to go about it, as documentation on training doesn't seem easy to find.
>>
>>41477764
We have already done a pass with demucs model, does this work for your purposes?
https://huggingface.co/therealvul/demucs
>>
>>41477868
After creating a simple .yaml for it to be read by UVR5, "SFX_separation_v2_epoch_338.th" inferred perfectly fine, but unfortunately didn't do anything to separate SFX. Whereas the other one, "sfx_demucs_v2_checkpoint_epoch_338.th", threw an immediate key error named "'klass'" and couldn't continue.
>https://ponepaste.org/10379

This is the audio in question I'm testing with a baseline. I'm trying to clean Lotus's data up so she can be trained better to be at least comparable with what was achieved with Saffron and Redheart. As you can hear there's lots of clippa cloppa present in this one among others in her dataset.
>https://pomf2.lain.la/f/dwk4bvei.flac
>>
>>41477974
I see. I processed the audio you gave and got similar results; hoofsteps are generally difficult to remove.
>>
>>41477594
>>41477597
Colab.
https://colab.research.google.com/drive/1EaAlB5H3mHgFddozijZ698O6zJVG5zO7?usp=sharing
>>
>>41478132
Added temperature, and modified it to generate in batches which will be better for content creation.
Here's a little thing to demonstrate (took it from >>41478369):
Wet: https://files.catbox.moe/ia94ci.mp3
Dry: https://files.catbox.moe/iscsmt.mp3
Took a lot of rerolls, some ARPAbet, and respelling fuckery, but at least it's theoretically possible?

- It doesn't like long sentences (wonder if rope would fix).
- I wonder if incorporating the voice steering PR from #139 might help in getting more consistent voices.
>>
https://github.com/kyutai-labs/moshi
>>
>>41430662
>>41477764
no code yet still. might try emailing them
https://xiaoyubie1994.github.io
https://liuxubo717.github.io
>>
>>41477594
Nice.

The schizo energy sounds like it's from the model not knowing what volume and rhythm to use. I can probably rig something up to add that to the training data, though that would require a change to the input layer of the model architecture.
- Volume information should be easy to generate with parselmouth.
- Rhythm information would be more complicated. That will involve using a forced aligner to figure out the duration of each grapheme/phoneme. I know how I would do this for phonemes, but not for graphemes. One option would be to just drop this information when using graphemes.
- We'd have to discretize this information since it would be infeasible to provide floating point volume & rhythm information at inference time.
- I don't know what would be the best way to provide this information to the model. It might be enough to just sum f(volume) + f(rhythm) to embedding(token) before running the result through the rest of the model. One alternative would be to add volume and rhythm information as separate tokens with position encodings f(t+0.33) and f(t+0.67).

If you think modifying the model input layer would be feasible, I can look into generating the data for this after checking >>41458217 >>41471308. I'm mostly free for the next couple days. I'm not sure how much I'll be working on this during the anni stream, though I'll have plenty of time after that.
>>
Page 10 bump.
>>
>>41479077
>The schizo energy sounds like it's from the model not knowing what volume and rhythm to use
Do you think the parameter count matters here? This is 880M, their next model size up is 2.3B. Just from testing off their space the prosody seems more natural on the 2.3B one, and the audio quality is higher.
Also, wouldn't modeling volume and rhythm explicitly require the user to generate that information? Unless the plan is to make this into another voice conversion model similar to TalkNet.
>I know how I would do this for phonemes, but not for graphemes. One option would be to just drop this information when using graphemes.
Might be related -- automated word level timestamping:
https://github.com/m-bain/whisperX
>It might be enough to just sum f(volume) + f(rhythm) to embedding(token) before running the result through the rest of the model. One alternative would be to add volume and rhythm information as separate tokens with position encodings f(t+0.33) and f(t+0.67).
I don't think anybody has actually tried experimenting with the inputs but people talked about conditioning on background music for singing voice synthesis:
https://github.com/huggingface/parler-tts/issues/8
Someone did finetune the mini model to produce singing vocals, but only conditioned on the lyrics.
https://wandb.ai/akhiltolani/parler-speech/runs/mv9dd4hz/workspace?nw=nwuserakhiltolani

Another (extremely) stretchy idea--since our dataset comes from the show, I wonder if there's some way we could get an AI (multimodal language model) to look over MLP transcripts/scripts/wiki and generate natural language descriptions around the emotional context of the characters when they're saying those lines, rather than just automated feature-based methods/hand-annotation.
>>
>>41478593
Got vocal steering/zero shot to "work", have a separate colab for it.
https://colab.research.google.com/drive/1wlBh8FDGG-Fmf-r3dWZDanBiabUviYX7?usp=sharing

Does it work? Sorta kinda.
Demo: https://pomf2.lain.la/f/sxjgd8kr.mp4

It works better in some cases than others--it performs better in cases where it'd have a decent amount of data already, and you just want to "lock in" a timbre. It can't do powered-up Tirek shouting at all (not shown in demo because Colab timed out before I remembered to save it).

The way I understand it works is it's basically "prefilling" the decoder. Unfortunately it also seems that it suffers greatly from the old transformers problem of not being able to deal with things further out in the context; intelligibility is severely compromised. Also likely has to do with the fact that we split things into shorter audio clips than they used to train the base model.
>>
https://files.catbox.moe/9kzx6y.mp3
>>
>>41480381
https://huggingface.co/therealvul/parler-tts-pony-mini-g2p-v1-e52/tree/main
Epoch 52. Is there a point in continuing training? No idea, but if I do I'm only going to do it overnight now because frankly I don't want to be stuck listening to fans 24/7.
>>
>>41480381
berry good output, it struggles a little bit with proclamation but this is something that could be iron out in the future.
>>41480464
Sorry for not following the threads but will the control for this kind of model be Text only or will there be audio reference(and other) inputs as well?
>>
>>41480699
ParlerTTS's main selling point is that in addition to text it uses an audio description prompt (see >>41477594 https://pomf2.lain.la/f/ifavh3h8.mp4). Also it is possible to prefill audio to "steer" the voice, but it doesn't work that well (see >>41480135 https://pomf2.lain.la/f/sxjgd8kr.mp4).
>>
>>41479735
Parameter count should matter a lot whenever the model needs to infer information that isn't present in the input (incl. volume and rhythm). For text generation, I think a typical rule-of-thumb is that a 10x increase in parameter count leads to a 50% reduction in loss. I think a 2.5x increase in parameter count (890M -> 2.3B) would lead to a ~25% reduction in loss. I expect that would help a lot, but I don't expect it to fully get rid of the schizo energy.
>https://github.com/m-bain/whisperX
The problem is that there's no way to convert word-level timestamps to grapheme-level timestamps.
>Also, wouldn't modeling volume and rhythm explicitly require the user to generate that information? Unless the plan is to make this into another voice conversion model similar to TalkNet.
It could be done similarly to how you're adding phoneme tokens. It's optional information that the user can add, and it's discretized into things the user can type. We could discretize the volume & rhythm information in a way that's easier for people to input without needing speech input. For example, just having high/medium/low tokens for phonemes at the beginning of a sequence and after punctuation, and having higher-, same-, and lower-than-previous tokens for everything else.
I could also add pitch & voicing information using parselmouth. Pitch information would follow the same high-medium-low scheme, and voicing information would be either voiced (vibrating vocal chords) or unvoiced (breath only).

Before that: ParlerTTS on its own without fine-tuning seems to be much better at pronunciation and avoiding schizo energy, which suggests that the problem is with how fine-tuning is done. Do you have samples from a training run that uses the original tokens and where the text encoder is frozen?
>>
>>41480858
>ParlerTTS on its own without fine-tuning seems to be much better at pronunciation and avoiding schizo energy, which suggests that the problem is with how fine-tuning is done.
https://huggingface.co/datasets/parler-tts/mls_eng
https://huggingface.co/datasets/parler-tts/libritts_r_filtered
The base model is supposed to be trained on 45k hrs of data which is 3 orders of magnitude larger than ours (Not sure we can do much about this, although perhaps reintroducing a portion of their original dataset back into our finetuning might be able to mitigate some of the pronunciation issues).
Most of it also appears to be from audiobook narration, which I would argue has much less complex relationships between speech prosody and text than dialogue from MLP.
Pronunciation errors also seem to increase with input length, which might have to do with the dataset and also positional embeddings (they just use sinusoidal but they added an option to use RoPE). The datasets they say they trained on have clips that get as long as 20-30 seconds, and the mls_eng dataset seems to have had clip length balancing.

>Do you have samples from a training run that uses the original tokens and where the text encoder is frozen?
The first training run used the original tokens, and the text encoder so far has been frozen for both:
>>41413292
The problems with this motivated the move to reduced tokens and randomized g2p. Although I realize that this post only had 16 epochs and is therefore not that comparable to the current performance.

For a slightly more fair comparison, here's samples from the 22nd epoch of g2p which is the closest equivalent checkpoint I have to that:
https://pomf2.lain.la/f/ko457pvs.mp3
https://pomf2.lain.la/f/kwkglfhw.mp3
https://pomf2.lain.la/f/kycgvtda.mp3
https://pomf2.lain.la/f/87m1yy68.mp3
https://pomf2.lain.la/f/a9vop6q6.mp3
https://pomf2.lain.la/f/uq4gatp.mp3
Even though there are some obvious schizo moments it's able to "guess" the pronunciation of words it hasn't seen better than training on the original tokens.
>>
>>41481274
Another thought - the original tokenizer (from Flan-T5) might encourage better semantic representations of the text prompt better than the current tokenizer, so we might have traded off some semantic information in the text input in exchange for pronunciation ability.
>>
>>
File: 6308713.png (177 KB, 407x570)
177 KB
177 KB PNG
>>
>page 10
>>
>>41483612
>page 10-1
>>
>>41481274
>>41481294
Summary/more thinking on things to consider changing aside from parameter count:
>Optionally condition on volume/rhythm/pitch/voicing information thru extra tokens
(Has there been any work on creating features on "vocal intensity" or similar characteristics?)
>Reintroduce some of their original dataset to improve pronunciation and performance at longer lengths
Right now I can't find their actual speaker name-annotated dataset, and I am worried that the imbalance in clip lengths wrt speakers might encourage the model to start talking more like LibreSpeech at longer lengths.
>RoPE instead of sinusoidal to improve performance at longer lengths?
Haven't seen anybody actually use the option in ParlerTTS yet (so no clue if it's functional), but it works well enough for LLMs.
>Learning rate scheduling
Their base models are trained on 3-9 epochs of a 45k hr dataset, whereas we're doing on the order of tens of epochs. Would cosine annealing make more sense?
>Reintroducing any semantic information on the text prompt better captured by the original Flan-T5 text encoder to improve prosody?
Not sure if this is a good idea or how to go about this.
>Concatenate the text prompt directly into the description prompt
>Concatenate text prompt IDs from the original tokenizer into the description prompt IDs (same tokenizer), possibly with a separator token (use one of the special token IDs)
>Run the text prompt with original token IDs through the text encoder, then concatenate hidden state onto the description
>The above, but adding a learned bias vector or applying some other transformation to the portion representing the text prompt
>Something fancy with cross attention?

It looks like the repo author is still training ParlerTTS models, but AFAIK there hasn't been any public announcement of anything.
https://wandb.ai/ylacombe
>>
>nine
>>
>>41480381
Thanks, Glimmer. Very cool.
>>
>>41481274
I'm confused. Are you fine-tuning the original ParlerTTS model or training from scratch / fine-tuning a model that you trained from scratch? If you're training on top of the ParlerTTS model, you should be able to implicitly take advantage of the 45k hours used to train the original model.
- The description text encoder (`text_encoder`) should always be frozen.
- You might also want to freeze the transcript text embeddings (`embed_prompts`), but only when training on top of ParlerTTS's original models and when using only graphemes. The transcript text embeddings should contain a lot of information about pronunciation, volume, rhythm, and pitch, so even if you're training "from scratch", it would be worthwhile to train on top of ParlerTTS's transcript text embeddings.
- If you're reusing `embed_prompts` with only graphemes, I think the only part that needs to be trained is the `decoder`.
- If you're adding phonemes, then you would need a new `nn.Embedding` for phoneme embeddings, which needs to also be trained.

>>41484716
>(Has there been any work on creating features on "vocal intensity" or similar characteristics?)
I haven't been keeping up with speech generation research, but I think some vocoders used energy features (frequency * amplitude^2). I haven't seen anything that uses intensity features as part of text input.
>Right now I can't find their actual speaker name-annotated dataset, and I am worried that the imbalance in clip lengths wrt speakers might encourage the model to start talking more like LibreSpeech at longer lengths.
>Concatenate the text prompt directly into the description prompt
>Concatenate text prompt IDs from the original tokenizer into the description prompt IDs (same tokenizer), possibly with a separator token (use one of the special token IDs)
These won't work well since the two require different embeddings. The text prompt ids use the `embed_prompt` embeddings, and the description uses the `text_encoder` embeddings.
>Something fancy with cross attention?
ParlerTTS already uses cross attention to bias the transcript processing with the description text. The `decoder` is a cross-attention model that conditions on `text_encoder(input_ids)`, where `input_ids` is the tokenized description.
>>
>>41485564
>>41484716
I forgot to respond to:
>Right now I can't find their actual speaker name-annotated dataset, and I am worried that the imbalance in clip lengths wrt speakers might encourage the model to start talking more like LibreSpeech at longer lengths.
ParlerTTS depends entirely on the description and `text_encoder` to infer the speaker name. It seems to work well enough given your samples, so there doesn't seem to be a need to explicitly label the speaker. The imbalance might be an issue, though I think all of the pony data contains only short clips. Using RoPE should mitigate that at least to some extent. I'm not sure if you can switch from Sinusoidal encodings to RoPE encodings after the `decoder` has already been trained. If you're training the `decoder` from scratch (note: I'd still recommend reusing the original `embed_prompts` in this case), it's worth trying RoPE before Sinusoidal. If you're fine-tuning the original `decoder`, it might be worth an overnight test to see if the `decoder` is adapting to the new encodings.
>>
>>41485564
>I'm confused. Are you fine-tuning the original ParlerTTS model or training from scratch / fine-tuning a model that you trained from scratch? If you're training on top of the ParlerTTS model, you should be able to implicitly take advantage of the 45k hours used to train the original model.
I'm finetuning the original model. My point is that the 45k hours probably has more diverse sequences of tokens, and we're losing some of that information, which could account for some of the difference in pronunciation ability.
>You might also want to freeze the transcript text embeddings (`embed_prompts`), but only when training on top of ParlerTTS's original models and when using only graphemes
Maybe we could try training with just graphemes but using the reduced-size English tokenizer and frozen embeddings? Although I could foresee it being more difficult for the model to adapt to words that wouldn't appear commonly in its dataset but appear more commonly in ours (e.g. 'Equestria')
>ParlerTTS depends entirely on the description and `text_encoder` to infer the speaker name.
What I mean is that in the README they say that they trained the model to produce consistent voices from 34 labeled speakers ("Jon, Lea, Gary...") https://github.com/huggingface/parler-tts but all of the datasets I've found so far only refer to the speaker generically as "A male speaker", "a female speaker", "a woman", etc.
If I can't find a version of the dataset where the descriptions contain the actual names, and try to add some fraction of their data as-is back into the training from these, I could foresee it collapsing into two different modes based on the description--where it only learns "generating random LibriSpeech voices" when the description just says "a male speaker", "a female speaker", etc. and only learns "generating pony voices" when the description names a specific speaker (since all of the pony descriptions use explicit names).
>These won't work well since the two require different embeddings. The text prompt ids use the `embed_prompt` embeddings, and the description uses the `text_encoder` embeddings.
Why not exactly? `text_encoder` isn't just an embedding, it's the FLAN-T5 text encoder (which seems to have been frozen in the base model training, so I'm pretty sure it's -exactly- the FLAN-T5 encoder), and the same FLAN-T5 tokenizer was used for both the prompt and the description in the base model. Even if it were an embedding, concatenating the text prompt into the description string (not IDs) would mean they would both get tokenized into the same token ID space, and concatenating the text prompt IDs using the same tokenizer as the description + a separator token would also make sure they're in the same space.
Also,
>Would cosine annealing make more sense?

>If you're fine-tuning the original `decoder`, it might be worth an overnight test to see if the `decoder` is adapting to the new encodings.
I think I will try this.
>>
>>41485733
>I'm finetuning the original model. My point is that the 45k hours probably has more diverse sequences of tokens, and we're losing some of that information, which could account for some of the difference in pronunciation ability.
Ah, got it.
I'd guess that all of the prosody information comes from the cross-attention weights in the `decoder`. If we're only trying to add pony prosody into the model, it might work to: (1) freeze all weights other than the `decoder` cross-attention weights, and (2) optionally add new cross-attention weights to condition the `decoder` on more prosody information when available.
>the same FLAN-T5 tokenizer was used for both the prompt and the description in the base model
Are you sure? It looks like the prompt is converted using a separate embedding.
Converting prompts to embeddings: https://github.com/effusiveperiscope/parler-tts/blob/g2p/parler_tts/modeling_parler_tts.py#L2649
embed_prompts definition: https://github.com/effusiveperiscope/parler-tts/blob/g2p/parler_tts/modeling_parler_tts.py#L2286
>Even if it were an embedding, concatenating the text prompt into the description string (not IDs) would mean they would both get tokenized into the same token ID space, and concatenating the text prompt IDs using the same tokenizer as the description + a separator token would also make sure they're in the same space.
Sorry, I confused "text prompt" and "description". Yeah, that makes sense. I would guess that adding something like "{character} says: {prompt}" in the description would help with emotional expression, but I don't expect it to help with the schizo energy.
>>
>>41486266
>If we're only trying to add pony prosody into the model, it might work to: (1) freeze all weights other than the `decoder` cross-attention weights, and (2) optionally add new cross-attention weights to condition the `decoder` on more prosody information when available.
We're trying to add pony voices into the model though? Unless you mean trying to refine one of the existing finetunes

>Are you sure? It looks like the prompt is converted using a separate embedding.
The prompt is converted using a separate embedding, but the same tokenizer was used to produce the input token IDs for both of them (according to their README):
https://github.com/effusiveperiscope/parler-tts/blob/g2p/training/README.md#3-training
>--description_tokenizer_name "google/flan-t5-large" \
>--prompt_tokenizer_name "google/flan-t5-large" \

>I would guess that adding something like "{character} says: {prompt}" in the description would help with emotional expression, but I don't expect it to help with the schizo energy.
I think we need to more clearly define "schizo energy" or be more specific. I think this solution might help with the naturalness of the prosody. If by "schizo energy" we mean the model deciding to generate voices that are incoherent or don't sound much like the target character, I hope that if we can get a longer effective context (through RoPE or other means) the vocal steering solution in >>41480135 might become more viable.
>>
Turns out the Elements of Justice team hit the same limitations with Adobe Animate as the PPP did a few years ago. That is, Animate was so slow at rendering their projects (up to 1.5x slower than real time?) that they also decided to write an XFL renderer:
>https://github.com/conncomg123/CXFL
Parts of this are directly ported from the PPP code, such as edges (https://github.com/conncomg123/CXFL/blob/CSConversion/Scripts/EdgeUtils.cs) and radial gradients (https://github.com/conncomg123/CXFL/commit/ef45be2).
But, I hear they're now working on features not in the PPP code (stroke masking?). So, it might be worth looking at when it's finished.
>>
>>41486639
berry good, it would be interesting what different angles they will approach the problems of rendering animation.
>>
Does anyone still give a shit about the redub series?
>>
>>41485585
>>41485733
>it might be worth an overnight test to see if the `decoder` is adapting to the new encodings.
Here's 6 epochs of a new finetune over parler-tts-v1-mini. Only modification from previous recipe is that it uses RoPE instead of sinusoidal.
https://pomf2.lain.la/f/gkt5i3q.mp3
https://pomf2.lain.la/f/732spvbu.mp3
https://pomf2.lain.la/f/c195kcyn.mp3
https://pomf2.lain.la/f/1apfoul9.mp3
https://pomf2.lain.la/f/ufsom344.mp3

Doesn't seem to have completely broken the model?
>>
>>41487623
>pomf2
The fuck is going on with that site? It's just loading and loading for me. It worked fine before. Searching info about it on Google says that it has been blocked or compromised or some shit? Someone posted CP again like they did with smutty.horse? That site also still not allowing uploads either.
>>
>>41487408
I do, I may not have time for the usual contentfaging but I always love making some colab with fellow ppp niggas.
>>
>>41487623
>>41487639
It loads fine for me. Is there an alternate file hosting site that you can access? I've experienced similar problems with catbox before so I'm not sure if there's a reliable one.
>>
>>41487729
Catbox and Uguu works fine for me. Uguu only allows 64MB though while Catbox allows 200MB but Pomf allowed even bigger files which is a shame it no longer works for me.
>>
>>41487752
>Uguu
It's also temporary?
Sucks that bad actors ended up getting so many sharing sites shut down...
>>
>>41486312
>I think we need to more clearly define "schizo energy" or be more specific.
I meant the unnatural (generically, not character-specific) high variance in pitch, volume, and rhythm. Fine-tuning seems to make the model forget whatever default bias it has for those things.
>I hope that if we can get a longer effective context (through RoPE or other means) the vocal steering solution in >>41480135 might become more viable.
I suspect there are multiple things going wrong here. The context length would be one issue, but I think the decoder model also isn't made for this kind of inference. Based on the paper, it's using BERT (encoder transformer model) for the "decoder", which struggles to learn sequence patterns. It would be better to add the reference in through cross-attention weights. To train those weights, you could use a small slice of clip X as the reference audio when training on clip X.
>We're trying to add pony voices into the model though?
That's what I mean. Pony voices are voices with pony prosody. The way ParlerTTS is designed, everything that's specific to a voice and independent of the speech content (phonemes) should be captured by the cross-attention weights. If that's correct, then modifying just the cross-attention weights should be a more targeted way to add pony voices to the model without messing up the parts of speech generation that can be inferred from just the phonemes. Technically, "what can be inferred from just the phonemes" should only be pronunciation, but it probably includes a "default" bias on pitch, volume, and rhythm shifts, which I think is what you'd want to retain to avoid the schizo energy.
>The prompt is converted using a separate embedding, but the same tokenizer was used to produce the input token IDs for both of them
The makes sense. The embedding weights are what we'd want to selectively freeze. Freeze embeddings for tokens in the original vocabulary, train embeddings for new (phoneme) tokens.
>>41487623
Very nice. I didn't know that was possible.

>>41486639
I wonder why they don't just use the xflsvg repo directly. Python isn't the the limiting factor in rendering, and the repo uses optimized libraries to handle the expensive parts. It also already supports parallel rendering within a process & batch execution across processes and servers.
>>
>>41488480
>Based on the paper, it's using BERT (encoder transformer model)
??
https://arxiv.org/pdf/2402.01912 says "Decoder-only Transformer" in the diagram.
Also their decoder class seems to have been adapted from MusicGen (it has lots of "MusicGen" left over in the code):
https://arxiv.org/pdf/2306.05284 says "musicgen consists in an autoregressive transformer-based decoder"
>then modifying just the cross-attention weights should be a more targeted way to add pony voices to the model without messing up the parts of speech generation that can be inferred from just the phonemes.
>The embedding weights are what we'd want to selectively freeze. Freeze embeddings for tokens in the original vocabulary, train embeddings for new (phoneme) tokens.
OK, I see. So basically just freeze everything except the encoder_attn layers in ParlerTTSDecoderLayer and the prompt embeddings corresponding to new tokens?
>Very nice. I didn't know that was possible.
It makes me wonder if the "finetuning" learning rate is too high. (It's 0.0001 by default.)
>>
>>41489042
>BERT
Yeah, I'm misremembering that pretty badly. I'm doing that a lot. The github repo and the HF repo both mention a causal LM. Thanks for checking.
>OK, I see. So basically just freeze everything except the encoder_attn layers in ParlerTTSDecoderLayer and the prompt embeddings corresponding to new tokens?
That sounds right.
>>
>>41489042
By the way, we might be able to shrink the DAC codebook to get higher quality audio. I'm not sure how to do that yet.
>>
>>41458217
It should be fixed now. Same github & hf repos.
https://huggingface.co/synthbot/parlertts_tokenizer_clean
https://github.com/synthbot-anon/sample-code/blob/main/src/cleanup_unigram_tokenizer.py
>>
>>41490077
Thanks
>>
>>41488480
>I wonder why they don't just use the xflsvg repo directly.
Good question, honestly. Forgot to ask, maybe they just don't like Python? Or maybe they like reimplementing things since they also recreated JSFL
>>
https://files.catbox.moe/p5myi0.mp3
>>
>>41491956
Oh noes the sizefags are here but HOW big are we talking about?
>>
>>41492201
>spoiler 2
As long as it's not hyper, it's not too bad.
>>
>>41491956
Is that a reference to something?
>>
>bumpo
>>
https://files.catbox.moe/djtz24.mp3
>>
>>41495126
I see this potentially getting used a lot in the near future.
>>
File: large.png (1.4 MB, 1280x1024)
1.4 MB
1.4 MB PNG
>>41487408
We've seen from the last redub episode that some anons here are uncomfortable with using speech-to-speech because they aren't good actors and/or they don't have American accents. TalkNet can override the input speaker's accent with the character's, but it's lower quality. RVC and SVC are higher quality, but they retain the speaker's accent. We probably shouldn't start another redub until we have a non-deterministic text-to-speech system that works reasonably well as a substitute for 15. Maybe ParlerTTS can be that solution.
>>
>>41496236
As one of the anons less than comfortable with speech-to-speech options, I agree that a suitable TTS options should eagerly be sought. I also miss the abundance of shitposts and other funnies that were present back in the 15 and Talknet era. Inspiration is fleeting, as is time. The easier the usage of tech we can provide, the more readily the inspiration can be acted upon by anyone.
>>
Maremare
>>
>>41496236
Redub or not, I definitely agree that more TTS options would be good to have right now.
>>
File: Crowd.png (917 KB, 1080x1080)
917 KB
917 KB PNG
>>41496236
>>41496639
You know, on the note of pony TTS, someone coded a really nice sounding local TTS for /CHAG/.
https://desuarchive.org/mlp/thread/41267361/#41286152
https://www.youtube.com/live/mSINcy1_6Ms?si=DO1qRj9WhpkixaBp&t=775
I don't think it has a standard UI, instead just pulling text from certain programs, but I'm sure that could be rectified, and it sounds pretty good, esp for running on CPU.
>>
Sup Hydrus, the HaySay website seems to be down (not sure if it just me or its server wise).
>>
Do we have a plan for the next OP since they limited the number of quoted posts in a message to 9?
Use desu?
>>
>>41497607
What?
>>
If vocoders are still a problem, did anyone try to look into audio codec research? For example into loss compensation in Opus codec. Or into perception model from Opus.
>>
I hope I don't accidentally spam.

Is there anynon interested in natural language processing and potentially better tagging systems for boorus of future? Why NLP and tagging is together? Because I think tagging system is probably close to field of NLP.

Specifically I want this system to have relations inside one image. For example assume there is selfponidox image of human Rainbow Dash in hat and pony Rainbow Dash in skirt. I want to search all images, where there are pony and humans and pony wears a skirt. With traditional tagging systems I will get too many false positives of humans in skirts. Also it has benefit of not only implied tags like in traditional tagging systems, but can potentially deduce tags. For example if image shows two instances of same character, then it might be selfponidox, transformation sequence or whatever tag EqG1 cover deserves.

This also might have potential with fanfics processing. It kinda started as idea for better tags for fanfics.

I did some research and found that information extraction the field it belongs to. There is Resource Describtion Framework, but it doesn't seem to meet all requirements, but close enough, and I think it can be hacked to support local and multiple instances of objects.
>>
>>41497600
Everything is working for me, except for synthapp.haysay.ai. I'll restart that one in a bit. Are you getting an error or does the site just not load at all for you?
>>
>>41497933
Annoyingly it "just werks" right now so I can't show you the error message I was getting before hand. But thank you for checking it up on your end.
>>
(waiting for next thread)
>>
rip OP, legends say he is still stuck in the linecon
>>
NEW THREAD
>>41498541



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.