/mlp/ - Pony Preservation Project (Thread 145) - Pony


08/21/20	New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17	New trial board added: /bant/ - International/Random
10/04/16	New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]

Anonymous
Pony Preservation Project (Thr(...) 08/01/24(Thu)20:17:27 No.41284357

File: AltOPp.png (1.54 MB, 2119x1500)

Pony Preservation Project (Thread 145) Anonymous 08/01/24(Thu)20:17:27 No.41284357

Welcome to the Pony Voice Preservation Project!
youtu.be/730zGRwbQuE

The Pony Preservation Project is a collaborative effort by /mlp/ to build and curate pony datasets for as many applications in AI as possible.

Technology has progressed such that a trained neural network can generate convincing voice clips, drawings and text for any person or character using existing audio recordings, artwork and fanfics as a reference. As you can surely imagine, AI pony voices, drawings and text have endless applications for pony content creation.

AI is incredibly versatile, basically anything that can be boiled down to a simple dataset can be used for training to create more of it. AI-generated images, fanfics, wAIfu chatbots and even animation are possible, and are being worked on here.

Any anon is free to join, and there are many active tasks that would suit any level of technical expertise. If you’re interested in helping out, take a look at the quick start guide linked below and ask in the thread for any further detail you need.

EQG and G5 are not welcome.

>Quick start guide:
docs.google.com/document/d/1PDkSrKKiHzzpUTKzBldZeKngvjeBUjyTtGCOv2GWwa0/edit
Introduction to the PPP, links to text-to-speech tools, and how (You) can help with active tasks.

>The main Doc:
docs.google.com/document/d/1y1pfS0LCrwbbvxdn3ZksH25BKaf0LaO13uYppxIQnac/edit
An in-depth repository of tutorials, resources and archives.

>Active tasks:
Research into animation AI
Research into pony image generation

>Latest developments:
GDrive clone of Master File now available >>37159549
SortAnon releases script to run TalkNet on Windows >>37299594
TalkNet training script >>37374942
GPT-J downloadable model >>37646318
FiMmicroSoL model >>38027533
Delta GPT-J notebook + tutorial >>38018428
New FiMfic GPT model >>38308297 >>38347556 >>38301248
FimFic dataset release >>38391839
Offline GPT-PNY >>38821349
FiMfic dataset >>38934474
SD weights >>38959367
SD low vram >>38959447
Huggingface SD: >>38979677
Colab SD >>38981735
NSFW Pony Model >>39114433
New DeltaVox >>39678806
so-vits-svt 4.0 >>39683876
so-vits-svt tutorial >>39692758
Hay Say >>39920556
Haysay on the web! >>40391443
SFX seperator >>40786997 >>40790270
Synthbot updates GDrive >>41019588
Private "MareLoid" project >>40925332 >>40928583 >>40932952
VoiceCraft >>40938470 >>40953388
Fimfarch dataset >>41027971
5 years of PPP >>41029227
Audio re-up >>41100938
RVC Experiments >>41244976 >>41244980
Ace Studio Demo >>41256049 >>41256783

>The PoneAI drive, an archive for AI pony voice content:
drive.google.com/drive/folders/1E21zJQWC5XVQWy2mt42bUiJ_XbqTJXCp

>Clipper’s Master Files, the central location for MLP voice data:
mega.nz/folder/jkwimSTa#_xk0VnR30C8Ljsy4RCGSig
mega.nz/folder/gVYUEZrI#6dQHH3P2cFYWm3UkQveHxQ
drive.google.com/drive/folders/1MuM9Nb_LwnVxInIPFNvzD_hv3zOZhpwx

>Cool, where is the discord/forum/whatever unifying place for this project?
You're looking at it.

Last Thread:
>>41137243

Anonymous
08/01/24(Thu)20:17:43 No.41284358

Anonymous 08/01/24(Thu)20:17:43 No.41284358

FAQs:
If your question isn’t listed here, take a look in the quick start guide and main doc to see if it’s already answered there. Use the tabs on the left for easy navigation.
Quick: docs.google.com/document/d/1PDkSrKKiHzzpUTKzBldZeKngvjeBUjyTtGCOv2GWwa0/edit
Main: docs.google.com/document/d/1y1pfS0LCrwbbvxdn3ZksH25BKaf0LaO13uYppxIQnac/edit

>Where can I find the AI text-to-speech tools and how do I use them?
A list of TTS tools: docs.google.com/document/d/1y1pfS0LCrwbbvxdn3ZksH25BKaf0LaO13uYppxIQnac/edit#heading=h.yuhl8zjiwmwq
How to get the best out of them: docs.google.com/document/d/1y1pfS0LCrwbbvxdn3ZksH25BKaf0LaO13uYppxIQnac/edit#heading=h.mnnpknmj1hcy

>Where can I find content made with the voice AI?
In the PoneAI drive: drive.google.com/drive/folders/1E21zJQWC5XVQWy2mt42bUiJ_XbqTJXCp
And the PPP Mega Compilation: docs.google.com/spreadsheets/d/1T2TE3OBs681Vphfas7Jgi5rvugdH6wnXVtUVYiZyJF8/edit

>I want to know more about the PPP, but I can’t be arsed to read the doc.
See the live PPP panel shows presented on /mlp/con for a more condensed overview.
2020 pony.tube/w/5fUkuT3245pL8ZoWXUnXJ4
2021 pony.tube/w/a5yfTV4Ynq7tRveZH7AA8f
2022 pony.tube/w/mV3xgbdtrXqjoPAwEXZCw5
2023 pony.tube/w/fVZShksjBbu6uT51DtvWWz

>How can I help with the PPP?
Build datasets, train AIs, and use the AI to make more pony content. Take a look at the quick start guide for current active tasks, or start your own in the thread if you have an idea. There’s always more data to collect and more AIs to train.

>Did you know that such and such voiced this other thing that could be used for voice data?
It is best to keep to official audio only unless there is very little of it available. If you know of a good source of audio for characters with few (or just fewer) lines, please post it in the thread. 5.1 is generally required unless you have a source already clean of background noise. Preferably post a sample or link. The easier you make it, the more likely it will be done.

>What about fan-imitations of official voices?
No.

>Will you guys be doing a [insert language here] version of the AI?
Probably not, but you're welcome to. You can however get most of the way there by using phonetic transcriptions of other languages as input for the AI.

>What about [insert OC here]'s voice?
It is often quite difficult to find good quality audio data for OCs. If you happen to know any, post them in the thread and we’ll take a look.

>I have an idea!
Great. Post it in the thread and we'll discuss it.

>Do you have a Code of Conduct?
Of course: 15.ai/code

>Is this project open source? Who is in charge of this?
pony.tube/w/mqJyvdgrpbWgZduz2cs1Cm

PPP Redubs:
pony.tube/w/p/aR2dpAFn5KhnqPYiRxFQ97

Stream Premieres:
pony.tube/w/6cKnjJEZSCi3gsvrbATXnC
pony.tube/w/oNeBFMPiQKh93ePqTz1ns8

Anonymous
08/01/24(Thu)20:18:44 No.41284362

Anonymous 08/01/24(Thu)20:18:44 No.41284362

File: veryVERYbiganchor.jpg (214 KB, 1024x681)

214 KB JPG

>>41284357
Anchor.

Anonymous
08/01/24(Thu)22:09:01 No.41284576

Anonymous 08/01/24(Thu)22:09:01 No.41284576

HEY, FIFTEEN.
Can you add S.h.o.d.a.n. please?

Anonymous
08/01/24(Thu)22:47:14 No.41284645

Anonymous 08/01/24(Thu)22:47:14 No.41284645

>>41284576
I thought 15 died of ligma

Anonymous
08/02/24(Fri)02:01:55 No.41285047

Anonymous 08/02/24(Fri)02:01:55 No.41285047

>>41284645
Who is Steve Jobs?

Anonymous
08/02/24(Fri)02:05:23 No.41285049

Anonymous 08/02/24(Fri)02:05:23 No.41285049

>>41284645
What's 15?

Anonymous
08/02/24(Fri)05:15:08 No.41285342

Anonymous 08/02/24(Fri)05:15:08 No.41285342

>>41285049
A ghost at this point.

Anonymous
08/02/24(Fri)06:49:35 No.41285481

Anonymous 08/02/24(Fri)06:49:35 No.41285481

There are way to install hay_say_ui with conda instead of docker?

Anonymous
08/02/24(Fri)09:33:15 No.41285689

Anonymous 08/02/24(Fri)09:33:15 No.41285689

File: IveGotSomeLearningToDo.png (1.54 MB, 1920x1080)

1.54 MB PNG

>>41284357
I made a cover of Lemon Demon's I've Got Some Falling to Do using so-vits. My friendanon made the ponified lyrics and I sang the vocals. I also tried animating the lyrics.
https://youtu.be/atLuQp66BJI

Anonymous
08/02/24(Fri)12:42:33 No.41286002

Anonymous 08/02/24(Fri)12:42:33 No.41286002

File: 497256287.png (45 KB, 302x313)

45 KB PNG

I will be upgrading my GPU soon, but have been completely out of the AI scene as of late. Is NVIDIA still the only option if I want to run anything AI locally? Gaming will be it's primary function but I still want to be able to generate a bit of content as well.
Both are modest gpus, same price. I'm deciding between a 8gb rx 6650xt and a 12gb RTX 3060. The amd gpu performs 10% better on games on average from the benchmarks I saw. Any advice?

Anonymous
08/02/24(Fri)14:33:48 No.41286254

Anonymous 08/02/24(Fri)14:33:48 No.41286254

>>41286002
While there is better support for AMD nowadays but those 4 extra gb are valuable. the more memory you can get the more you have access to in terms of local AI models.

Anonymous
08/02/24(Fri)15:13:38 No.41286335

Anonymous 08/02/24(Fri)15:13:38 No.41286335

File: starry batpone.png (272 KB, 1041x768)

272 KB PNG

>>41286002
Depends on how much (if any) AI you plan on doing, but AMD support (ROCm) is unfortunately still a bit of a meme and for now still limited to just a handful of top-end AMD cards (it will allow you to try on others, but things will break). Maybe it'll get better "soon", but people have been saying that for the last 2 years or so. Also, last I checked, AMD AI support was completely non-functional on windows if that's the OS you use. It's not super relevant since you can double-boot and Linux is lightweight, but worth mentioning the extra hassle. So, to go back to the question,
>Is NVIDIA still the only option if I want to run anything AI locally?
In short, yeah.

I built my PC just a couple months ago, and I specifically opted for the 16 GB version of 4060 Ti (yes, the meme card) to be able to do textgen at decent speeds. You don't need 16GB if you're not interested in text—and the newest llama 3.1 can even run pretty fine on 12 GB—but going to 8 GB *and* AMD will hurt a basically any AI.

Anonymous
08/02/24(Fri)17:06:12 No.41286570

Anonymous 08/02/24(Fri)17:06:12 No.41286570

>>41284357
>EQG and G5 are not welcome.
Can I ask for ponified EqG? They're the best.

Anonymous
08/02/24(Fri)17:09:21 No.41286576

Anonymous 08/02/24(Fri)17:09:21 No.41286576

>>41284576
Seconding.

Anonymous
08/02/24(Fri)19:12:54 No.41286873

Anonymous 08/02/24(Fri)19:12:54 No.41286873

>>41286335
>but going to 8 GB *and* AMD will hurt a basically any AI.
as an owner of 8GB nvidia card (due to poorfagging) I have to agree, half the shit people talk about in the ai image/text threads straight up do not load for me due to lack of raw gpu power.

Anonymous
08/02/24(Fri)19:14:11 No.41286878

Anonymous 08/02/24(Fri)19:14:11 No.41286878

>>41286002
If you want to use AMD GPU, you have to Linux.

Anonymous
08/02/24(Fri)21:34:35 No.41287157

Anonymous 08/02/24(Fri)21:34:35 No.41287157

>>41285481
Short answer: Sorry, but no.
Long answer, if you're feeling adventurous: It may be technically feasible, but would take a lot of work. The Hay Say UI code communicates with the other Hay Say components (which live in their own Docker containers) by making local webservice calls to them. To eliminate Hay Say's dependence on Docker, you would have to install each component individually (by mimicking what its Dockerfile does) and then edit your hosts file to map all the service names to localhost so that the web service communication would still work between them. Alternatively, you could install each component in its own virtual machine and set up networking between them but the performance of that would probably be terrible.

Anonymous
08/03/24(Sat)04:01:52 No.41287833

Anonymous 08/03/24(Sat)04:01:52 No.41287833

>10

Anonymous
08/03/24(Sat)05:33:30 No.41287953

Anonymous 08/03/24(Sat)05:33:30 No.41287953

File: RVC HuBERT PCA.png (216 KB, 1620x2075)

216 KB PNG

>>41267685
Thanks for trying it. I'm guessing it didn't work because "speaker" isn't represented linearly in HuBERT. I took 50 clips from Twilight and bdl (male from CMU ARCTIC), put them through HuBERT, and computed the PCA for each layer's output. The bottom of pic related shows a complete overlap in the first 2 PCs for all layers, unlike the linear separation seen in the top graph (from https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction).
This makes sense, since HuBERT is trained to discard speaker information. If it didn't, then retrieval probably wouldn't work at all.
The next idea I have is to use MiMiC from https://arxiv.org/abs/2402.09631, which matches both mean and covariance. This apparently reduces "bias by neighbors", which might improve retrieval.

If that fails, then I'm not sure. Maybe the artifacts aren't from speaker info leakage, but something else?
For example, unless I'm reading the code wrong, it seems like RVC is trained on unmodified HuBERT features. This could cause a mismatch since inference uses retrieved features, which usually come from completely different clips. You could try to fix this by passing features through the index first, so you're training on the nearest neighbor of each feature vector.
I also wonder how robust retrieval is to time misalignments. e.g. shifting an audio clip by 10ms would change the features slightly, which could cause retrieval to return different results. Maybe this messes up the model in inference.

Anonymous
08/03/24(Sat)08:55:58 No.41288265

Anonymous 08/03/24(Sat)08:55:58 No.41288265

Is it just me or the option for generating StyleTTS2 on haysay is disabled for anybody else?

HydrusBeta
08/03/24(Sat)10:06:50 No.41288375

HydrusBeta 08/03/24(Sat)10:06:50 No.41288375

>>41288265
It's working for me. Could you provide more details? Like, is the StyleTTS tab missing for you, or are buttons greyed out or does it just hang when generating or something else? Also, are you using haysay.ai or a local installation? If local, what OS do you use?

Anonymous
08/03/24(Sat)10:55:21 No.41288450

Anonymous 08/03/24(Sat)10:55:21 No.41288450

>>41288375
the button is just gray out and I cant press it to generate audio.

HydrusBeta
08/03/24(Sat)12:06:35 No.41288554

HydrusBeta 08/03/24(Sat)12:06:35 No.41288554

File: text input.png (3 KB, 406x131)

3 KB PNG

>>41288450
Did you type anything into the Input section at the top of the page? StyleTTS requires text input, so the Generate button will be greyed out until you provide some text.

Anonymous
08/03/24(Sat)14:14:01 No.41288972

Anonymous 08/03/24(Sat)14:14:01 No.41288972

>>41288554
oh, the styletts2 do not work with audio references than?

Anonymous
08/03/24(Sat)15:17:22 No.41289131

Anonymous 08/03/24(Sat)15:17:22 No.41289131

>youtu.be/eGe4Do1mkFE
>https://files.catbox.moe/mn9sdn.mp3
song for Rarity and jazz enjoyers

Anonymous
08/03/24(Sat)20:15:07 No.41289785

Anonymous 08/03/24(Sat)20:15:07 No.41289785

>>41288972
StyleTTS can take an audio reference in addition to text, but unlike so-vits or RVC, the reference is used for setting the "style" and prosody of the output rather than doing a voice conversion. For best results, you'll want to use a clip of the target character's voice (not your own voice) that sounds similar (e.g. emotion, pace, etc) to how you want the line delivered.

Anonymous
08/04/24(Sun)03:23:00 No.41290525

Anonymous 08/04/24(Sun)03:23:00 No.41290525

File: 1092216.gif (2.1 MB, 455x347)

2.1 MB GIF

Anonymous
08/04/24(Sun)12:12:18 No.41291330

Anonymous 08/04/24(Sun)12:12:18 No.41291330

is 15.ai all of a sudden gonna comeback ?

Anonymous
08/04/24(Sun)12:27:21 No.41291363

Anonymous 08/04/24(Sun)12:27:21 No.41291363

>>41291330
Anything that makes you say that?

Anonymous
08/04/24(Sun)13:16:29 No.41291537

Anonymous 08/04/24(Sun)13:16:29 No.41291537

>>41291330
I seriously doubt it, but I wish he would at least open source this shit Even if you needed like a supercomputer to run it, it would be better than paying 11 ai

Anonymous
08/04/24(Sun)15:35:28 No.41292298

Anonymous 08/04/24(Sun)15:35:28 No.41292298

File: 1593625004854.jpg (728 KB, 1536x1536)

728 KB JPG

https://vocaroo.com/15DZbXqTkx4g
https://pomf2.lain.la/f/41cwwxss.mp4
https://www.udio.com/songs/fgAs8B89VVZ6erafA5jY2S

Udio-130
A Serenade for the Garden

[Verse 1]
Golden sun in the morning
Wishes in the breeze
Dreams scattered like leaves (dreams scattered like leaves)
Royal, serene
Blossoms dancing, floating free
Seeking a tender plea

[Chorus]
Princess Celestia, don't pee here
In this sacred place, pretty please
Keep it pure with your grace
In these gardens, in these trees

[Verse 2]
Luminous skies at twilight
Whispers from the stars
Harmony, we are (harmony, we are)
Regal, so bright
Twilight sparkling in your eyes
Faith in this calm night

[Chorus]
Princess Celestia, don't pee here
In this sacred place, pretty please
Keep it pure with your grace
In these gardens, in these trees

[Bridge]
Majesty, cascading light
Through the ages, through the night
Let it be, clear and bright
Harmony in our sight

[Chorus]
Princess Celestia, don't pee here
In this sacred place, pretty please
Keep it pure with your grace
In these gardens, in these trees

Anonymous
08/04/24(Sun)18:46:57 No.41293556

Anonymous 08/04/24(Sun)18:46:57 No.41293556

>>41291330
Even if that's the case, he has the poor track record of "I'm updating so nobody could use it for another 6 months."

Anonymous
08/04/24(Sun)23:30:53 No.41294521

Anonymous 08/04/24(Sun)23:30:53 No.41294521

>MLPRegressor
Which one of you fags is responsible for this?

Anonymous
08/04/24(Sun)23:49:37 No.41294570

Anonymous 08/04/24(Sun)23:49:37 No.41294570

>>41292298
I accidentally deleted it from Udio, because when trying to move it to a different playlist, I selected songs from the recently created songs and clicked Delete. If I selected them in a playlist, the button would have been "Remove Songs from Playlist".

So here is a couple of songs I saved.

Prompt: a song by Pinkie Pie, who is for some reason asking Princess Celestia not to pee in the Sugarcube Corner! Of course, Celestia wasn't going to do that, but you know Pinkie.

https://pomf2.lain.la/f/pp0tz2kf.mp4
https://voca.ro/1gfw6B89FeLU

A Sweet Plea

[Verse 1]
Princess Celestia
Please hear my plea
Sugarcube Corner
Needs to be pee-free
Just for a second
Think about the pies
Twilight’s here too
We're aiming for the skies

[Chorus]
Please don’t pee
Here in the shop
Let's keep the sprinkles
And the fun non-stop
Please don’t pee
Here in the shop
Let's keep the sprinkles
And the fun non-stop

[Verse 2]
Cupcakes and muffins
Need to stay clean
Cookies and brownies
And all treats between
Just for our sakes
Let's play, not weep
We'll laugh forever
With our memories to keep

[Chorus]
Please don’t pee
Here in the shop
Let's keep the sprinkles
And the fun non-stop
Please don’t pee
Here in the shop
Let's keep the sprinkles
And the fun non-stop

[Bridge]
Think of the laughter
Everyone will share
We can have fun
Without any scare

[Chorus]
Please don’t pee
Here in the shop
Let's keep the sprinkles
And the fun non-stop
https://voca.ro/1iYgDeFxIHJz

A Sweet Request

[Verse 1]
Oh Celestia, what a sunny day,
But there's one tiny thing I need to say!
Please, don't pee in Sugarcube Corner,
I know you wouldn't, but I gotta warn ya!
It's a place for cupcakes, not a place for piddles,
Save that for the loo, in the castle riddles!

[Chorus]
Oh Princess dear, keep it clear,
The Corner's for sweets, not what you excrete!
For cupcakes and treats, come inside,
Just keep your potty breaks outside!

[Verse 2]
All the ponies gather, and we have such fun,
But pee on the floor? No, that's not done!
Sprinkles and frosting, that's all we need,
Keep it sugar-sweet, that's Pinkie's creed!
Let's keep it tidy, no puddles here,
Just laughter and joy, and lots of cheer!

[Chorus]

[Bridge]
Celestia, I know you're grand,
With a kingdom at your command,
Let's keep our Corner pristine clean,
For everypony, king or queen!

[Chorus]

Anonymous
08/05/24(Mon)00:10:53 No.41294609

Anonymous 08/05/24(Mon)00:10:53 No.41294609

And I know somepony would ask about this, so here is a different request. It seems like "pee" stands for "be" in most cases here.

https://voca.ro/13V22n5a7DPk

A Plea for Light (to Princess Celestia)
male vocalist, electronic, pop, indie pop, synthpop, electropop, bittersweet

Lyrics

[Verse 1]
Dear Princess Celestia, hear my plea
Way up in Canterlot, beautiful and free
I need your magic, your golden light
To chase away the shadows, deep in the night

[Chorus]
Dear Princess Celestia
Won't you come near?
Dear Princess Celestia
Please, please pee here

[Verse 2]
It's an odd request, I know it sounds strange
But your grace and power, can help me change
I feel so broken, lost in the dark
I need your warmth, to light a spark

[Chorus]
Dear Princess Celestia
Won't you come near?
Dear Princess Celestia
Please, please pee here

[Bridge]
Just one touch of divine light
Could make everything alright

[Chorus]
Dear Princess Celestia
Won't you come near?
Dear Princess Celestia
Please, please pee here

Anonymous
08/05/24(Mon)03:41:18 No.41294948

Anonymous 08/05/24(Mon)03:41:18 No.41294948

>>41291330
Something, something, flying pigs.

Anonymous
08/05/24(Mon)08:33:55 No.41295386

Anonymous 08/05/24(Mon)08:33:55 No.41295386

>>41294609
We need some more Irish / Saxon style poetry going on with ponies .

Anonymous
08/05/24(Mon)13:07:35 No.41296058

Anonymous 08/05/24(Mon)13:07:35 No.41296058

File: 1702555648441938.gif (3.8 MB, 502x502)

3.8 MB GIF

Anonymous
08/05/24(Mon)14:12:45 No.41296158

Anonymous 08/05/24(Mon)14:12:45 No.41296158

>>41287953 (1/2)
>I'm guessing it didn't work because "speaker" isn't represented linearly in HuBERT
Most likely
>This makes sense, since HuBERT is trained to discard speaker information
A bit of confusion: apparently even though they called their checkpoint "hubert_base" it's actually ContentVec:
https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI/issues/2078

AFAIK ContentVec does try to do things like pitch and formant shifting to try to disentangle "content" from speaker identity, but even with transformations like that, you'd still be able to tell from accents, for instance, whether something was said by Applejack vs. Rarity. And on top of that I'm still not sure moving each individual feature closer to the nearest features the model saw in training is necessarily the "correct" mapping to go from one voice to another. Using phonemes as an analogy, there's no inherent reason to believe that my AE is closest to Rarity's AE vs. any other phoneme in feature space (by the fact that retrieval does seem to work somewhat, they probably are "close" but the mapping is not perfect). Plus accents introduce a lot of time-dependent variations that probably wouldn't be fixed by retrieval much, if at all.

I think the RVC maintainers might be getting at the same issue with RVCv3--they say something about being limited by ContentVec here:
https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI/issues/2013

>MiMiC
It looks like they're just using stuff from an existing library (ot.da.LinearTransform) to do this? I fit it on the input features vs. concatenated features from 50 samples of the target speaker. Doesn't really seem to do much but introduce a few small artifacts:
>No MiMiC, index rate 0.75
https://files.catbox.moe/swdkjw.mp3
>MiMiC, index rate 0.75
https://files.catbox.moe/o0r03m.mp3

>Maybe it's something else?
Possible, but it would also have to account for StyleTTS2 generated inputs for the same character showing better output?
https://files.catbox.moe/vd9op0.mp3
Me doing my best attempt to match: https://files.catbox.moe/k53i0b.mp3
Another styletts2 gen: https://files.catbox.moe/iqwa5o.mp3
I also wonder if pitch contours have something to do with it--I feel like pitch detection is a bit dodgier for lower pitches (lower resolution).

Anonymous
08/05/24(Mon)14:14:42 No.41296161

Anonymous 08/05/24(Mon)14:14:42 No.41296161

>>41287953
>>41296158 (2/2)
>Retrieval mismatch
That is a good point--I hadn't thought about trying to fix the problem from that end. It might be asking the network and finetuning process to do some heavy lifting though? (Might also be why sibilants get a lot buzzier and there are more clicky artifacts when you use retrieval: the network is seeing features in a sequence it hasn't learned to interpret?)

>I also wonder how robust retrieval is to time misalignments. e.g. shifting an audio clip by 10ms would change the features slightly, which could cause retrieval to return different results. Maybe this messes up the model in inference.
Tested this. It can definitely introduce some small crackly artifacts, although they're not exactly the kind of artifacts we're looking at.
Index rate of 1.0 used for all samples.
0ms offset: https://files.catbox.moe/3pcjhu.mp3
50ms offset: https://files.catbox.moe/qwi1fc.mp3
110ms offset: https://files.catbox.moe/yx8cia.mp3
Might be something worth putting into the training end if we decide to go that route?

>Other thoughts
With TalkNet we could be a lot less dependent on the input voice. I almost wonder if it makes sense to fall back to a TalkNet-like system (without the hackiness and resource-intensiveness of just feeding the audio output of one into the other). Maybe you could get around the whole "needing a transcription" problem by training something on ASR decoder logits (or some intermediate feature in an ASR decoder)?

I'm starting to wonder if the apparent better quality of RVCv2 over so-vits-svc 4 is even due to any architecture and implementation differences vs. just having a more robust pretrained model; the two don't seem to be that different.

so-vits-svc 5.0 seems to be a tiny bit better on this front; I'm guessing it's either because A) they add noise to the features during training, and/or B) Whisper encoder features might be slightly more robust against speaker identity since they're explicitly trained on ASR.

>Code
https://github.com/effusiveperiscope/raraai/blob/master/2_featureexpl/add2.ipynb

Anonymous
08/05/24(Mon)14:21:13 No.41296172

Anonymous 08/05/24(Mon)14:21:13 No.41296172

>>41296161
Also, I finally got around to finishing "transcribing" the leak material. (I used automated processes to do as much work as possible and just did a pass of transcription corrections, so there's some questionable clipping points; also, the timestamps are synthetic).
https://drive.google.com/drive/folders/1XXO8o7O8m-kJGZvkYx3U1eheo2s902N5?usp=sharing

Separated into "Broadcast Stems" and "Studio Recordings" so it's easier to tell which are unprocessed recordings.

Name
Spoiler?	[Spoiler?]
Options
Comment
Verification	4chan Pass users can bypass this verification. [Learn More] [Login]
Flag
File	[Spoiler?]
Please read the Rules and FAQ before posting.