[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/mlp/ - Pony

Name
Spoiler?[]
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
Flag
File[]
  • Please read the Rules and FAQ before posting.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


3-Year duration 4chan Passes are now available for $45

[Advertise on 4chan]


File: New OP.png (1.53 MB, 2119x1500)
1.53 MB
1.53 MB PNG
Welcome to the Pony Voice Preservation Project!
youtu.be/730zGRwbQuE

The Pony Preservation Project is a collaborative effort by /mlp/ to build and curate pony datasets for as many applications in AI as possible.

Technology has progressed such that a trained neural network can generate convincing voice clips, drawings and text for any person or character using existing audio recordings, artwork and fanfics as a reference. As you can surely imagine, AI pony voices, drawings and text have endless applications for pony content creation.

AI is incredibly versatile, basically anything that can be boiled down to a simple dataset can be used for training to create more of it. AI-generated images, fanfics, wAIfu chatbots and even animation are possible, and are being worked on here.

Any anon is free to join, and there are many active tasks that would suit any level of technical expertise. If you’re interested in helping out, take a look at the quick start guide linked below and ask in the thread for any further detail you need.

EQG and G5 are not welcome.

>Quick start guide:
docs.google.com/document/d/1PDkSrKKiHzzpUTKzBldZeKngvjeBUjyTtGCOv2GWwa0/edit
Introduction to the PPP, links to text-to-speech tools, and how (You) can help with active tasks.

>The main Doc:
docs.google.com/document/d/1y1pfS0LCrwbbvxdn3ZksH25BKaf0LaO13uYppxIQnac/edit
An in-depth repository of tutorials, resources and archives.

>Active tasks:
Research into animation AI
Research into pony image generation

>Latest developments:
GDrive clone of Master File now available >>37159549
SortAnon releases script to run TalkNet on Windows >>37299594
TalkNet training script >>37374942
GPT-J downloadable model >>37646318
FiMmicroSoL model >>38027533
Delta GPT-J notebook + tutorial >>38018428
New FiMfic GPT model >>38308297 >>38347556 >>38301248
FimFic dataset release >>38391839
Offline GPT-PNY >>38821349
FiMfic dataset >>38934474
SD weights >>38959367
SD low vram >>38959447
Huggingface SD: >>38979677
Colab SD >>38981735
NSFW Pony Model >>39114433
New DeltaVox >>39678806
so-vits-svt 4.0 >>39683876
so-vits-svt tutorial >>39692758
Hay Say >>39920556
Haysay on the web! >>40391443
SFX seperator >>40786997 >>40790270
Synthbot updates GDrive >>41019588
Private "MareLoid" project >>40925332 >>40928583 >>40932952
VoiceCraft >>40938470 >>40953388
Fimfarch dataset >>41027971
5 years of PPP >>41029227
Audio re-up >>41100938
RVC Experiments >>41244976 >>41244980
Ace Studio Demo >>41256049 >>41256783

>The PoneAI drive, an archive for AI pony voice content:
drive.google.com/drive/folders/1E21zJQWC5XVQWy2mt42bUiJ_XbqTJXCp

>Clipper’s Master Files, the central location for MLP voice data:
mega.nz/folder/jkwimSTa#_xk0VnR30C8Ljsy4RCGSig
mega.nz/folder/gVYUEZrI#6dQHH3P2cFYWm3UkQveHxQ
drive.google.com/drive/folders/1MuM9Nb_LwnVxInIPFNvzD_hv3zOZhpwx

>Cool, where is the discord/forum/whatever unifying place for this project?
You're looking at it.

Last Thread:
>>41354496
>>
FAQs:
If your question isn’t listed here, take a look in the quick start guide and main doc to see if it’s already answered there. Use the tabs on the left for easy navigation.
Quick: docs.google.com/document/d/1PDkSrKKiHzzpUTKzBldZeKngvjeBUjyTtGCOv2GWwa0/edit
Main: docs.google.com/document/d/1y1pfS0LCrwbbvxdn3ZksH25BKaf0LaO13uYppxIQnac/edit

>Where can I find the AI text-to-speech tools and how do I use them?
A list of TTS tools: docs.google.com/document/d/1y1pfS0LCrwbbvxdn3ZksH25BKaf0LaO13uYppxIQnac/edit#heading=h.yuhl8zjiwmwq
How to get the best out of them: docs.google.com/document/d/1y1pfS0LCrwbbvxdn3ZksH25BKaf0LaO13uYppxIQnac/edit#heading=h.mnnpknmj1hcy

>Where can I find content made with the voice AI?
In the PoneAI drive: drive.google.com/drive/folders/1E21zJQWC5XVQWy2mt42bUiJ_XbqTJXCp
And the PPP Mega Compilation: docs.google.com/spreadsheets/d/1T2TE3OBs681Vphfas7Jgi5rvugdH6wnXVtUVYiZyJF8/edit

>I want to know more about the PPP, but I can’t be arsed to read the doc.
See the live PPP panel shows presented on /mlp/con for a more condensed overview.
2020 pony.tube/w/5fUkuT3245pL8ZoWXUnXJ4
2021 pony.tube/w/a5yfTV4Ynq7tRveZH7AA8f
2022 pony.tube/w/mV3xgbdtrXqjoPAwEXZCw5
2023 pony.tube/w/fVZShksjBbu6uT51DtvWWz

>How can I help with the PPP?
Build datasets, train AIs, and use the AI to make more pony content. Take a look at the quick start guide for current active tasks, or start your own in the thread if you have an idea. There’s always more data to collect and more AIs to train.

>Did you know that such and such voiced this other thing that could be used for voice data?
It is best to keep to official audio only unless there is very little of it available. If you know of a good source of audio for characters with few (or just fewer) lines, please post it in the thread. 5.1 is generally required unless you have a source already clean of background noise. Preferably post a sample or link. The easier you make it, the more likely it will be done.

>What about fan-imitations of official voices?
No.

>Will you guys be doing a [insert language here] version of the AI?
Probably not, but you're welcome to. You can however get most of the way there by using phonetic transcriptions of other languages as input for the AI.

>What about [insert OC here]'s voice?
It is often quite difficult to find good quality audio data for OCs. If you happen to know any, post them in the thread and we’ll take a look.

>I have an idea!
Great. Post it in the thread and we'll discuss it.

>Do you have a Code of Conduct?
Of course: 15.ai/code

>Is this project open source? Who is in charge of this?
pony.tube/w/mqJyvdgrpbWgZduz2cs1Cm

PPP Redubs:
pony.tube/w/p/aR2dpAFn5KhnqPYiRxFQ97

Stream Premieres:
pony.tube/w/6cKnjJEZSCi3gsvrbATXnC
pony.tube/w/oNeBFMPiQKh93ePqTz1ns8
>>
File: anchor.png (33 KB, 1200x1453)
33 KB
33 KB PNG
Anchor
>>
Is Clipper still doing episodes? I loved that "Free Hugs" one.
>>
>>41364866
God I hope not. It sucks.
>>
File: 1701781466764247.png (60 KB, 500x459)
60 KB
60 KB PNG
>>41364875
>Wanting everything to be about sex with Anon
>>
>>41364876
Good idea
>>
File: 670652.png (670 KB, 3991x5761)
670 KB
670 KB PNG
>>41364787
Added audio from recently released animatics (s2e3, 25, 26) to the voice dataset, replacing corresponding entries in the FiM folder.
mega.nz/folder/jkwimSTa#_xk0VnR30C8Ljsy4RCGSig
Sliced Dialogue -> Special source
Also put label files in the the label files folder.

>>41364866
Working on a thing to present at Mare Fair. Not an episode as such, though will make use of the AI voice.

>>41364875
There are a lot of things I'd do differently looking back, mainly pacing. Always better next time, that's the goal.
>>
>>41364894
>Sex Hotline 2.0
>>
>>41364787
reposting song cover in case people missed it from the last thread >>41361841
>https://files.catbox.moe/qxa6vp.mp3
>>
Found a good VITS-based voice conversion/generation tool with decent TTS capabilities called Applio.
https://www.youtube.com/watch?v=gjggpadBgOo
https://github.com/IAHispano/Applio
https://applio.org/

It seems to have a lot of functionality similar to Hay Say, but with more in-depth TTS. The way it functions is effectively has a TTS voice layer of various voices; speakers of numerous countries, languages and accents that it uses to interpose RVC voices onto. Which works pretty well in my early testings in the compiled clip below.

https://files.catbox.moe/lju6ub.mp4

Full compatible with existing pony models, as evidenced by it working with Vul's Fluttershy S1 model. Though it was a bit confusing where the models had to go (Apparently in "/Applio-3.2.4/logs/" in a named folder containing the .pth and .index files). It's a little finicky, needing experimentation with suitable TTS voices and settings to optimize for each mare, adjusting until it sounds right. The noisiness is mostly from the TTS end of things and less so the RVC side. Additional RVC training and inference functions are useful too, hadn't tested those parts yet though.
>>
>>41365138
Those clips sound pretty good, some low level noise but all within what I'd call an acceptable limit.
Might this be a new start for pony TTS? Would be awesome to have an alternative to those that rely fully on reference audio.
>>
>>41364894
- [In progress] Download the Master File again so I can a clean updated copy.
- [ ] Reupload a clone of the new Master File to my gdrive.
- [ ] Reupload a clone of both Master Files to HuggingFace.

Separately, I've spent a lot of time in the last few months working with LLMs. I'm putting together a library for creating more complex chatbots.
https://github.com/synthbot-anon/horsona
- [In progress] Collect a list of functionality that would help in making better chatbots. Right now, that means going through existing chatbots and figuring out (1) what features they support, (2) what it takes to specify a new chatbot, and (3) what people complain about regarding chatbot interactions and personalities.
- [ ] Split the target features into individual functions that can be implemented and pieced together, and create a github issue for each one so it's easy to keep track of everything.
- [ ] Start implementing.

If anyone has ideas for functionality and for things other chatbots do well/poorly, let me know. Right now, I don't care how difficult anything here is to implement.
>>
>>41364782
Is there a voice actor AI model that can simulate rage and emotions like AVGN and doesn't take a rocket degree science to utilize it? I plan on using it for really long ranting reviews of 10k+ words.
>>
>>41365910
No, not right now.
>>
>>41365609
>Chatbots
>Horsona
Would there be any likelihood of these being capable of writing entire fics independently like GPT-PNY does/did? Curious too about there flexibility as it could be fun to have AI mares give us a script to then animate or otherwise try and achieve. Mare assisted brainstorming.
>>
>>41365950
I think so. I'm toying around with having an LLM read a fanfic paragraph-by-paragraph to extract information for automated lorebook creation. Writing a fic would be basically that in reverse + lorebook generation.
Once I get a better handle on how to implement this and everything /chag/ suggested, I'll try to break down the tasks into small pieces that can be implemented by more people than just myself.
>>
>>41365138
Example is little bit too high pitched but it does show an idea. I will test it out later myself.
>>
>>41366029
are you planing to make it as heavy modded TavernAI or some custom ui program from scratch?
>>
>>41366105
There is a pitch slider in the settings, so it's pretty much a non-issue; can be adjusted as desired. This will be a setting played around with often as differing TTS voices each vary in natural pitch range. Some deliveries might also need a small increase or reduction too.

Lower pitches did feel more Fluttershy, but didn't seem to have as much pitch variance. She kinda sounded bored or tired to me.
>>
>>41366302
I don't know. Probably a mix of both, leaning toward integrating with other UIs as much as possible. Right now, I mostly want to see how limiting the technical challenges really are when trying to make perfect chatbots. I intend for it to be a library, not a full chat program, but a UI might be necessary occasionally to make use of & test the functionality.
>>
>>41366401
I still have a fondness of for how the barebones GPT-PNY worked way back when, with the colab and separate window thing. I still feel it functioned a lot better and more freely than in KoboldAI, so any simple interfacing you come up with that'd allow for more raw/unfiltered/free-form outputs is good by me, even if other more flexible and potentially restrictive interfaces are adopted for it later on too.
>>
>>41366416
>https://www.youtube.com/watch?v=jHS1RJREG2Q
>https://arxiv.org/abs/2408.14837
>Diffusion Models Are Real-Time Game Engines
So same nerds combined a llm text model with art diffusion model and trained it on images + keyboard inputs to create a synthesized Doom gameplay.
Not mare related but the idea of practical combination of different ai tools is interesting to me.
>>
https://www.udio.com/songs/67X7mqHih4C8m4raEX8fzW
https://pomf2.lain.la/f/tky4cms5.mp4

Midnight Rejections
acoustic guitar music. princess celestia, anon, male vocalist, sad

Lyrics

[Verse 1]
Celestia's trying hard
She’s got her royal charm turned up to ten
But Anon's still not interested again, yeah
She pulled out all her tricks
Even baked him a cake, extra thick
But buddy's not biting, not even one little bit

[Chorus]
In the castle, at midnight, room 302
With a bouquet of roses and some candles too
Anon’s locked the door, put a sign in his view
That says "Please go away"

[Verse 2]
She's got a plan, who knew?
But Luna and Cadance can't believe it’s true
She wore a fancy dress and said “Hey there you!”
She read from romance books
Tried adding sexy looks
But Anon just laughed and moved to his favorite nooks

[Chorus]
In the castle, at midnight, room 302
With a bouquet of roses and some candles too
Anon’s locked the door, put a sign in his view
That says "Please go away"

[Bridge]
So Celestia sighed, wiped a tear from her eye
The mares all gathered round, gave it one more try
They played their guitars, singing under moon light
But Anon just yawned and said, "Goodnight"

[Chorus]
In the castle, at midnight, room 302
With a bouquet of roses and some candles too
Anon’s locked the door, put a sign in his view
That says "Please go away"


----
Is "put a sign in his view" too bad?
>>
>>41367097
you have quite a nice collection there, the 'Dreams of Luna' is pretty nice.
>>
Do any of you guys reckon you could make a good voiced version of the second comic here?

https://www.tumblr.com/radioactive-dragonlover/759831654724419584

I tried using Haysay and putting in a segment of the audio from the episode of Game Changers it's referencing as an audio input, but it came out sounding very robotic and off - I think because Twilight as a character has a different range of pitches when speaking emotionally than BLeeM does. I don't really know how to adjust that, though.
>>
>>41366416
Do you mean how it wasn't tuned to act like an assistant, and that it just continued from whatever text you gave it? That should be easy enough.
>>llm = AsyncCerebrasEngine(model="llama3.1-70b")
>>print(await llm.query_continuation("Once upon a time there was a little pony named Twilight Sparkle."))
>She lived in a magical land known as Equestria, where the sun was always shining and the air was sweet with the scent of blooming wildflowers. Twilight Sparkle was a student of Princess Celestia, the ruler of Equestria, and was learning the art of magic at the princess's palace in Canterlot. One day, while Twilight was studying in the library, she received a letter from the princess, instructing her to move to the town of Ponyville and live alongside the other ponies, to learn about the magic of friendship.

I can make sure there's a way to support jailbreaks too.
>>
>>41366336
Welp, that new program is not going to be very useful to me as it crashes at the beginning with inability to load some dll models. So the struggle to find nice sounding tts is still going.
>>
>>41367854
Afraid I can't be much help with that, assuming you're using the windows version; Linux version works fine. If it's an installation issue, in the releases of the GitHub there should be another install option there, maybe that'll work?
>dll models
I'm pretty sure it's intended to use .pth files as the models. Also only RVC, I had a slip up earlier where I accidentally tried to load a SoVits model if mine and it naturally errored.
>>
>>41365609
Current ai might be too slow for this but I was thinking about some kind of LLM/RAG powered "RPG engine" where you don't just provide character definitions but world definitions, item definitions, maybe have some kind of prebuilt framework for quests, skills, other user defined mechanics, do the math of tracking XP, health, armor, damage multipliers in code rather than having the LLM try to pick that role up. Rather than the hackiness of trying to shove a world scenario or an explicit story into each character card these could be split into more logical pieces and composed into RPGs
>>
>>41368170
Someone mentioned this in /chag/. The hard part is make sure it's possible extract & track the relevant information from a rulebook. If you can send me an example rulebook (maybe one of the PonyFinder ones), that would help.
>>
>>41367684
Can you catbox the Game Changers audio, or link to the original episode?
>>
>>41368005
they do have a precompiled download but its giving me the same error, im guessing it just my system being extra derped.
>>
>>41368620
https://youtu.be/88et7YlmzTs?si=_okFx5HtSE9e9cBV
Here's the audio snippet I used:
https://files.catbox.moe/m5eph6.mp3
>>
>9
>>
>>41369482
So it is.
>>
>>41369021
https://files.catbox.moe/rf3zrc.mp3

Architecture = rvc | Character = Twilight Sparkle | Index Ratio = 0.95 | Pitch Shift = 8 | Voice Envelope Mix Ratio = 1.0 | Voiceless Consonants Protection Ratio = 0.33 | f0 Extraction Method = rmvpe
>>
>>41369482
>cutie mark
Is that what it looks like to have 4chan as a special talent?
>>
>>41369482
oy
>>
>>41371447
Either that or her talent is related to some kind of Star Trek: Green Edition.
>>
>>41372139
Kek, that uniform design is gold.
>>
>>41365609
Updating the Master File:
- [Hopefully done] Download the Master File again so I can a clean updated copy. Mega was having issues, as usual. I'll need to check to make sure I have all the files, but I think this is done.
- [In progress] Reupload a clone of both Master Files to HuggingFace.
- [In progress] Reupload a clone of the new Master File to my gdrive.

Horsona chatbot library:
- [Done] Collect a list of functionality that would help in making better chatbots. The currentl list is up on github readme https://github.com/synthbot-anon/horsona. I have enough to get started, but please keep suggesting functionality if you think of anything. There's still functionality I want that no one's mentioned, so I'm sure the list is incomplete.
- [In progress] Split the target features into individual functions that can be implemented and pieced together, and create a github issue for each one so it's easy to keep track of everything. I'll need to start implementing some of these things so I can have a better understanding of how do this.
- ... [Done] Create a sample memory module, which is required for several of the candidate features. I went with "RAG where the underlying dataset can be automatically updated based on new information." The implementation is done, though the LLM prompts in https://github.com/synthbot-anon/horsona/blob/main/src/horsona/memory/rag.py#L139 could use some work. There's an example use in https://github.com/synthbot-anon/horsona/blob/main/tests/test_rag.py though the test only passes about 30% of the time. I'm pretty sure this can be made close to 100% with better prompts.
- ... [ ] Add documentation, a "start developing" guide, and tasks for the features where it's feasible to make progress using the memory module.
- ... [ ] Find some old & current jailbreaks to add jailbreak support, and turn them into either modules or LLM wrappers. If anyone has links for this, please send them.
- ... [ ] Figure out how to organize text corpora into compatible universes.
- ... [ ] Go through the candidate feature list and make sure jailbreaks & compatible universes are the only feature that are hard to support with the existing framework.

Information I need:
- Sample rulebooks, preferably one of the PonyFinder ones, so I can figure out what it'll take to extract information from these.
- Old & current jailbreaks so I can make sure my jailbreak implementation is comprehensive.
>>
>>41373093
>keep suggesting functionality if you think of anything
I can't think of any at this moment, however I would love if you were able to keep the addons options that auto1111 webui for Stable Diffusion has, where one can just install whatever additional options as needed and as how anons make new ones in the future.
>>
Page 10 bump.
>>
Does Clipper know where he found the MLP background music? Shit's kino as fuck.
>>
>>41373305
I'm only building the library right now (not a full application), but I can make sure it can support custom add-ons that can be dynamically toggled.
>>
>>41373093
It looks like jailbreaks sometimes are LLM-specific and require modifying near-arbitrary arguments to the call. So they'll likely be implemented as custom LLMEngines. In that cases, I don't think I need to do anything for them right now since my LLMEngine implementation already supports all of the customizations required. I'll just create issues for popular jailbreaks that I or others can implement.
Organizing text into compatible universes looks like it'll require graphs of data sources, where one data source can inherent from another with edits. I'll have to think about how to implement this. Most of the features don't depend on this, so I'll shift focus to documenting & creating issues for now.
>>
>>41373902
I know about these archived rips of background music:
https://www.mediafire.com/?rdhhrpyc0d6d3
https://www.mediafire.com/?rh219xdgj66bu

The first directory has music from seasons 1-2, and the second account belongs to RainShadow, who also has a YouTube channel:
https://www.youtube.com/@RainShadow
>>
>>41373093
I think horsona / chatbot library is in a good-enough state for anyone that wants to help out with development.
Repo: https://github.com/synthbot-anon/horsona
Open tasks: https://github.com/synthbot-anon/horsona/issues
- The current open tasks are for creating new LLMEngines (easy), making prompts more reliable (medium), and creating new datatypes & modules for character cards and image generation (medium/hard, probably requires some familiarity with pytorch).
- If you want to develop and run into any issues with the setup, let me know.

The integrations with existing chatbot UIs will come a bit later from integrations with common inference engines. I don't expect that to be difficult.

Updating the Master File:
- [Hopefully done] Download the Master File again so I can a clean updated copy. I'll need to check to make sure I have all the files, but I think this is done.
- [In progress] Reupload a clone of both Master Files to HuggingFace.
- [In progress] Reupload a clone of the new Master File to my gdrive.

Horsona chatbot library:
- [Done] Add documentation, a "start developing" guide, and tasks for the features where it's feasible to make progress using the memory module.
- [Done] Find some old & current jailbreaks to add jailbreak support, and turn them into either modules or LLM wrappers. Ultimate Jailbreak is the main one, and there's an open task for it. There are others listed on rentry listed here: https://rentry.org/jb-listing.
- [Done enough] Go through the candidate feature list and make sure jailbreaks & compatible universes are the only features that are hard to support with the existing framework. Jailbreaks are easy to support. The rest of the features are easy to support.
- [ ] Work on whatever open issues other anons don't pick up.
- [ ] Continue working on lorebook generation. After this, I'll try making a simple chatbot with the library.
- [ ] Figure out how to organize text corpora into compatible universes.
>>
>>41373902
pretty sure 99% its rips from the show itself (with two/three pieces made by Anons) that you can find it OP post second mega NZ link ('sfx and music' folder).
>>
>>41373888
Minus one.
>>
>>
>>41373902
Nothing special, I just took it all from the music tracks of the same show audio used to make the voice dataset. Same clipping process with a different tagging system.
>>
>>41371103
That's pretty good up until the screaming at the end. Thank you
>>
File: 3121428.png (170 KB, 1528x2267)
170 KB
170 KB PNG
Any chance someone here could train up a Lightning Dust model for RVC please? I need it for a song and her SVC one isn't cutting it.
>>
>>41377069
https://huggingface.co/Amo/so-vits-svc-4.0_GA/tree/main/ModelsFolder/ddm_DaringDo_100k
There is a sovits model for her that was set up as ulti model training, as if I remember correctly the model did not have the required 2 minutes of audio?
I may give it a try for RVC training in a day or two.
>>
>>41374539
Updating the Master File:
- [Done] Downloaded the new Master File and checked to make sure everything is good.
- [In progress] Reupload a clone of both Master Files to HuggingFace. This should be done in about 1 hour.
- [In progress] Reupload a clone of the new Master File to my gdrive. This should be done in about 3 hours.

Updating the Fimfarchive:
- [In progress] Download & verify the Sep 1 Fimfarchive. I'm downloading it now. If there are no errors, this should be done in about 5 hours.
- [ ] Upload to HuggingFace.

>>41364894
There's an empty "Luster Dawn" folder in Special Source, in case you wanted to remove that.
>>
>>41377069
I've also been meaning to train a Lightning Dust model. I'll see about training her later today probably. Would be good to get me back into the rhythm of training; been a long while.
>>
File: Untitled.png (553 KB, 1080x1049)
553 KB
553 KB PNG
SelectTTS: Synthesizing Anyone's Voice via Discrete Unit-Based Frame Selection
https://arxiv.org/abs/2408.17432
>Synthesizing the voices of unseen speakers is a persisting challenge in multi-speaker text-to-speech (TTS). Most multi-speaker TTS models rely on modeling speaker characteristics through speaker conditioning during training. Modeling unseen speaker attributes through this approach has necessitated an increase in model complexity, which makes it challenging to reproduce results and improve upon them. We design a simple alternative to this. We propose SelectTTS, a novel method to select the appropriate frames from the target speaker and decode using frame-level self-supervised learning (SSL) features. We show that this approach can effectively capture speaker characteristics for unseen speakers, and achieves comparable results to other multi-speaker TTS frameworks in both objective and subjective metrics. With SelectTTS, we show that frame selection from the target speaker's speech is a direct way to achieve generalization in unseen speakers with low model complexity. We achieve better speaker similarity performance than SOTA baselines XTTS-v2 and VALL-E with over an 8x reduction in model parameters and a 270x reduction in training data
https://kodhandarama.github.io/selectTTSdemo/
code and weights to be released (soon?)
examples aren't great but considering the training time/training data/parameters means its viable for personal training. they used 100 hours of data
>>
Hold Me Tight: Stable Encoder-Decoder Design for Speech Enhancement
https://arxiv.org/abs/2408.17358
>Convolutional layers with 1-D filters are often used as frontend to encode audio signals. Unlike fixed time-frequency representations, they can adapt to the local characteristics of input data. However, 1-D filters on raw audio are hard to train and often suffer from instabilities. In this paper, we address these problems with hybrid solutions, i.e., combining theory-driven and data-driven approaches. First, we preprocess the audio signals via a auditory filterbank, guaranteeing good frequency localization for the learned encoder. Second, we use results from frame theory to define an unsupervised learning objective that encourages energy conservation and perfect reconstruction. Third, we adapt mixed compressed spectral norms as learning objectives to the encoder coefficients. Using these solutions in a low-complexity encoder-mask-decoder model significantly improves the perceptual evaluation of speech quality (PESQ) in speech enhancement.
https://github.com/felixperfler/Stable-Hybrid-Auditory-Filterbanks
>>
>>41377203
>>41377404
It would be much appreciated. SVS has a tone to it that's difficult to work into some genres, I think RVC would nail it.
>>
Precautionary page 8 bump.
>>
>>41378282
Plus one.
>>
>>41377900
I may have something workable within 6 hours (or 12, depending if pc decides to have another technical hiccup).
>>
>>41377900
>>41379052
And Im back.
>https://huggingface.co/Amo/RVC_v2_GA/tree/main/models/MLP_Lightning_Dust_GA
>https://files.catbox.moe/amkbes.mp3
I may over train her with setting up the epoch to 500 but I still think this model came out pretty decent, specially with all the additional clean up data clips.
>>
>>41380032
Nicely done. I attempted to clean some more files for training her yesterday but didn't have much luck with getting any of the tools to work before I got bummed out and needed sleep. Those tools being jazzpear94's model (https://colab.research.google.com/drive/1efoJFKeRNOulk6F4rKXkjg63RBUm0AnJ) and a couple other similar ones intended to help separate SFX specifically, which I found in this doc: https://docs.google.com/document/d/17fjNvJzj8ZGSer7c7OFe_CNfUKbAxEh_OBv94ZdRG5c/edit#heading=h.owqo9q2d774z

How much data did you have of her to work with? Still tempted to train another model of her anyways, if at least to see if my training setup works still.
>>
Hey Hydrus, any chance we could get this >>41380032 on Haysay.ai?
>>
>>41377233
Updating the Master File:
- [Done] Reupload a clone of both Master Files to HuggingFace. https://huggingface.co/datasets/synthbot/pony-speech and https://huggingface.co/datasets/synthbot/pony-singing
- [In progress] Reupload a clone of the new Master File to my gdrive. https://drive.google.com/drive/folders/1ho2qhjUTfKtYUXwDPArTmHuTJCaODQyQ

Updating the Fimfarchive:
- [Done] Download & verify the Sep 1 Fimfarchive. Fimfiction added some restriction to prevent bots from scraping. Part of my script downloads the story html if there's anything wrong with the fimfarchive epubs, or if there's a conflict between the fimfarchive metadata and epub. I have a hackish fix for this for now.
- [Done] Upload to HuggingFace. https://huggingface.co/datasets/synthbot/fimfarchive

Horsona chatbot library:
- [In progress] Continue working on lorebook generation. I cleaned up part of my embedding-based memory implementation. I'll clean the rest as I figure out the right way to use it for reading through a story. Right now, I'm having an LLM create a question-answer dataset for the story setting, which it refines as it reads the story. The questions get turned into embeddings, which can be used to look up the corresponding answers as necessary. This is still a work in progress. I think my first "test" for this would be if it can create a decent character card for every character after each chapter of a story. That's what I'm currently working toward.
- [ ] Work on whatever open issues other anons don't pick up.
- [ ] Figure out how to organize text corpora into compatible universes.
>>
>>41380500
Correction: the Master File reupload to my gdrive should be [Done].
>>
>>41380258
>https://files.catbox.moe/d3j7w5.zip
>I attempted to clean some more files
I usually just grab the Clear files from OP mega folder and if the number is below 3 minutes I will additionally scavenge any usable 'noisy' files as well, in this case I think I used almost all of the noisy ones with only three being deleted.
>>
>>41380756
That's more or less what I have to train with, but I'm a bit stringent when it comes to samples I use. There's quite a few lines sourced from her second episode that I didn't feel suited as she kinda gets a slight country-like accent to it in her delivery (https://files.catbox.moe/gn0vd5.flac) and unusual or distorted with others (https://files.catbox.moe/1h5trr.flac & https://files.catbox.moe/4mug5h.flac). But yeah, I'll train her with what I've defined, though fewer files, and consider an alternate version that'll hopefully be more faithful to her debut version.
>>
Up.
>>
>>41381049
Oh yes, when going through the dataset I will try to look out for the tone of the voice as well, while I dont have proof I think the clips were "X character pretends to talk like Y" may end up poisoning the training process.
Btw Anon, what stuff are you planing to be making with her voice, songs? green text?
>>
>>41381590
I usually make covers, but I'll be doing further tests with Applio and which combinations of TTS voices it has pairs best and the settings I feel works best. Might have to compile some sort of parameters list. Also curious about the TTS >>41377494 mentioned and how well that'll perform in comparison. I have a few factors limiting my ability to effectively perform non-song stuff, but I have a long list of stuff to produce. Nothing so much with Lightning Dust though thus far, aside from a few songs I wanna test with her.
>>
Good night bump.
>>
MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer
https://arxiv.org/abs/2409.00750
>Nowadays, large-scale text-to-speech (TTS) systems are primarily divided into two types: autoregressive and non-autoregressive. The autoregressive systems have certain deficiencies in robustness and cannot control speech duration. In contrast, non-autoregressive systems require explicit prediction of phone-level duration, which may compromise their naturalness. We introduce the Masked Generative Codec Transformer (MaskGCT), a fully non-autoregressive model for TTS that does not require precise alignment information between text and speech. MaskGCT is a two-stage model: in the first stage, the model uses text to predict semantic tokens extracted from a speech self-supervised learning (SSL) model, and in the second stage, the model predicts acoustic tokens conditioned on these semantic tokens. MaskGCT follows the \textit{mask-and-predict} learning paradigm. During training, MaskGCT learns to predict masked semantic or acoustic tokens based on given conditions and prompts. During inference, the model generates tokens of a specified length in a parallel manner. We scale MaskGCT to a large-scale multilingual dataset with 100K hours of in-the-wild speech. Our experiments demonstrate that MaskGCT achieves superior or competitive performance compared to state-of-the-art zero-shot TTS systems in terms of quality, similarity, and intelligibility while offering higher generation efficiency than diffusion-based or autoregressive TTS models
https://maskgct.github.io/
no weights (ever) since they're worried about safety. finetuned it after for emotion control and voice cloning. sounds pretty good. 100k hours training dataset.
>>
What's the fastest way to get this shit voice acted? it can even be a female voice actor, it doesn't matter, it just has to sound entertaining to listen to.
and optionally make the visuals match with what he's talking about?
https://desuarchive.org/mlp/thread/40590194/#40598329
>>
anypony have voice packs for eleven labs
>>
I hate to moral fag but am I the only one bothered by people who use CMC voice AIs for coomer shit normally I wouldn't care but they were kids when they recorded at least most of their lines
>>
>>41383447
The VAs are all adults now anyway, so that concern isn't really an issue any more.
>>
>>
File: OIG4.Pc7p8fPrtACEmu2G7R_o.jpg (249 KB, 1024x1024)
249 KB
249 KB JPG
>>
>>41380500
Minor update on lorebook generation:
The current plan for memory is to extract questions and answers as the LLM reads a story. The questions get indexed by embedding, and they won't get updated unless a corresponding answer is deleted. The answers will get updated as the LLM reads the story. It'll process each paragraph twice: one the generate a list of questions that need to be answered to understand the paragraph, and a second time with the corresponding answers to determine how the memory should be updated.
>>
>>41377233
Not sure what that empty Luster Dawn folder was supposed to be for, perhaps a holdover from processing the studio leaks that now has no purpose. It's now been removed.
>>
>>
https://www.udio.com/songs/kgm72z2swizqRLSYDJWMMG
https://vocaroo.com/1hO6O2SpRNy6
https://pomf2.lain.la/f/4s3d56ud.mp4

Behind the Facade 1

Lyrics

[Verse 1]
We live in a world with cartoons and rainbows
With sparkly eyes and vibrant shows
Featureless, seamless, no lines can be seen
In our perfect land where nothing disagrees
No whispers of night's forbidden touch
In pastel dreams, we're bound and crushed

[Chorus]
Hey, Equestria
What can we do?
We live by rules, pretend they're true
While desires hide and hearts must play
In child's delight, we can't be free today

[Verse 2]
Can't flaunt our flair or show a peek
No lips can part for secrets to speak
Innocent, sweet, and always demure
Living in a world where nothing's obscure
Behind closed doors, our true selves lie
Hushing our wants as we gaze at the sky

[Chorus]
Hey, Equestria
What can we do?
We live by rules, pretend they're true
While desires hide and hearts must play
In child's delight, we can't be free today

[Bridge]
Can't break these chains of purity's face
In this vibrant land, we find no embrace
Our silent cries echo in the night
In a painted world, there's no real sight

[Chorus]
Hey, Equestria
What can we do?
We live by rules, pretend they're true
While desires hide and hearts must play
In child's delight, we can't be free today
>>
Training Universal Vocoders with Feature Smoothing-Based Augmentation Methods for High-Quality TTS Systems
https://arxiv.org/abs/2409.02517
>While universal vocoders have achieved proficient waveform generation across diverse voices, their integration into text-to-speech (TTS) tasks often results in degraded synthetic quality. To address this challenge, we present a novel augmentation technique for training universal vocoders. Our training scheme randomly applies linear smoothing filters to input acoustic features, facilitating vocoder generalization across a wide range of smoothings. It significantly mitigates the training-inference mismatch, enhancing the naturalness of synthetic output even when the acoustic model produces overly smoothed features. Notably, our method is applicable to any vocoder without requiring architectural modifications or dependencies on specific acoustic models. The experimental results validate the superiority of our vocoder over conventional methods, achieving 11.99% and 12.05% improvements in mean opinion scores when integrated with Tacotron 2 and FastSpeech 2 TTS acoustic models, respectively.
>>
>>41386926
Cute and soulful
>>41386944
>Tacotron 2
Huh, that's a name we had not seen in the threads for a while.
>>
>page 10
>>
>>41377069
>>41377404
Training of Lightning Dust (Alt) model has begun. Decided to use the pretrained TITAN as I found the descriptor of it interesting.
>TITAN is a fine-tuned based on the original RVC V2 pretrained, leveraging an 11.15-hours dataset sourced from Expresso. It gives cleaner results compared to the original pretrained, also handles the accent and noise better due to its robustness, being able to generate high quality results. Like Ov2 Super, it allows models to be trained with few epochs, it supports all the sample rates.
Hopefully she'll prove to be less noisy and have more accent retention. Training's a little slow my end with the reduced batch size (supposedly smaller gives better results but at the expense of training speed) but so far no issues in the process. If all goes well hopefully I can also begin training more mares between my other commitments.

Improvements to vocal separation AI should be looked further into, it'd be nice to be able to separate audio that have had a hard time separating for datasets in the past. Amalthea comes to mind with most current separators struggling with cricket sounds and other natural additions. I have a feeling we could perhaps create and/or finetune a UVR5 model designed to separate SFX using all the pony SFX we've separated thus far to have an easier time removing a lot of foley, hoofstep, crashes and similar sounds. As for more natural sounds like rain, wind, birds, insects, etc. there's a lot of data from the enormous SONNISS GameAudioGDC packs that could be utilized for this. Would be preferable to use MDX-Net so it's reliable for a number of GPUs, as DEMUCS models tend to not want to run unless the GPU has more than 6GBs.
>>
>https://hailuoai.com/video
this shit is fucking crazy
>>
>>41387770
but can it recreate the tienanmen square massacre of june 1989?
>>
File: Long Queue.png (11 KB, 418x303)
11 KB
11 KB PNG
>>41387770
Almost as crazy as the queue times for it will soon be.
6 minutes to wait already with only 171 people.Yikes.
>>
>>41387770
>>41387845
For the quality though, it's definitely got pony down surprisingly well. Just concerned how slow it'll start to get once more jump on board the same generator. Thankfully it's free (for now)
>Pinkie Pie (My Little Pony: Friendship is Magic) pony waving her hoof at the viewer.
>>
>>41387770 >>41387845 >>41387864
so now we are entering the age of computer animated mares, I will be very disappointed if there isnt going to be a bootleg of this tech available for offline generation sometime within the next three years.
>>
How do I get Doug Walker or James Rofle's voices to voice act this?

https://desuarchive.org/mlp/thread/40590194/#40598329
>>
>>41387864
In what format is it done though? Is it usable for .ai vectors PSD photoshop, FLA flash, Toonboom , Live2D, Spine, or something else?
>>
>>41387770
>https://hailuoai.com/video
Anime is dead. Finally smooth animation for free.
>>
>>41387845
G6 intro just dropped.
>>
>>41387770

very cute twilight.
>>
>>41387845
>>
File: lolgate.gif (302 KB, 300x335)
302 KB
302 KB GIF
>>41388039
the more often you watch this, the funnier it gets
>>
>>41387898
voice-models.com doesn't have a Doug Walker voice, but it does have James Rolfe. Since it's an RVC model, you'll have to read the pasta yourself.
>>
>>41388039
>background full of 'Curse of the Fly'/'The Unearthly' type misshapen monstrosities
>>41388147
Just pause at any random moment for a good laugh/scare
>>
>>41388624
my favorite part is the random tiny little houses at the end for some reason
>>
>>41388651
Well now that you mention it, this does bring up sort of an interesting point with the show. It is clearly established that animals like Angel Bunny have roughly pony intelligence. Would it really be so farfetched if we saw them living in actual tiny houses?
>>
File: Untitled.png (118 KB, 1125x440)
118 KB
118 KB PNG
Sample-Efficient Diffusion for Text-To-Speech Synthesis
https://arxiv.org/abs/2409.03717
>This work introduces Sample-Efficient Speech Diffusion (SESD), an algorithm for effective speech synthesis in modest data regimes through latent diffusion. It is based on a novel diffusion architecture, that we call U-Audio Transformer (U-AT), that efficiently scales to long sequences and operates in the latent space of a pre-trained audio autoencoder. Conditioned on character-aware language model representations, SESD achieves impressive results despite training on less than 1k hours of speech - far less than current state-of-the-art systems. In fact, it synthesizes more intelligible speech than the state-of-the-art auto-regressive model, VALL-E, while using less than 2% the training data.
https://github.com/justinlovelace/SESD
no code yet though they suggest they'll post an "implementation" so maybe weights too. no examples. so just posting to keep those interested aware. the 2% training data of vall-e but outcompetes it is big if true
>>
>>41389084
>Note: Code and model checkpoint will be available soon. Stay tuned for updates!
ah should have checked the whole readme
>>
>>41387770
I prompted "Applejack (My Little Pony: Friendship is Magic) collecting hay from each of the rest of the Mane 6." and got G5.
>>
Page 9 bump.
>>
>>41389084
>Graphs
We're the voices at? Also it would be nice if once this gets published to be able to run it on my old GPU (I will loose my shit if this will be another model that requires 16vram to even start up).
>>
>>41387652
[RVC] Lightning Dust sings - There For Tomorrow "A Little Faster"
>https://files.catbox.moe/d610ed.mp4
>https://files.catbox.moe/v8uxxn.mp3

>https://huggingface.co/datasets/HazySkies/RVC2-M/tree/main/FiM_LightningDust
So far she seems decently capable with her preliminary testings; although I've only tried singing so far. Surprisingly her natural range requires +0 than the expected +12. This test could've been better had the original song separated a little better, but it's a good average to test out. I quite liked her backing vocals at 1:22, her sustained notes sound nice.
>>
File: lightning dust.gif (3.15 MB, 468x480)
3.15 MB
3.15 MB GIF
>>41389897
hey that sounds pretty good. Nice job, man!
>>
>>41389897
That's actually impressive, given how few voice lines she has.
>>
>>41389897
very good, we need more background mares songs.
>>
>>41391013
>catbox ded
uhoh, another good alternative to host small files? also could someone repost the above cover?
>>
File: cadence flurry DJ.gif (1.04 MB, 394x382)
1.04 MB
1.04 MB GIF
Cadence - Like a Prayer (BHO cover)
>https://www.youtube.com/watch?v=uP6CRRhTOIM

First time experimenting with vocoders and heavy filters. It sounds bit scuffed, but not too bad. Used Haysay RVC for the voice and UVR for everything else.

Does anyone have a working catbox alternative? I'd rather drop a direct link than a youtube link, but every other site I try either doesn't work or prunes the link in a matter of hours.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.