/mlp/ - Pony Preservation Project (Thread 149) - Pony

[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]

Board

▼ Settings Mobile Home

/mlp/ - Pony

Return Catalog Bottom Refresh

[Post a Reply]

Name
Spoiler?	[Spoiler?]
Options
Comment
Verification	4chan Pass users can bypass this verification. [Learn More] [Login]
Flag
File	[Spoiler?]
Please read the Rules and FAQ before posting.


08/21/20	New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17	New trial board added: /bant/ - International/Random
10/04/16	New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]

[Advertise on 4chan]

[Return] [Catalog] [Bottom]

Anonymous
Pony Preservation Project (Thr(...) 10/23/24(Wed)17:35:23 No.41571795

File: ceci n'est pas une poni.png (1.29 MB, 2119x1500)

1.29 MB PNG

Pony Preservation Project (Thread 149) Anonymous 10/23/24(Wed)17:35:23 No.41571795

Welcome to the Pony Voice Preservation Project!
youtu.be/730zGRwbQuE

The Pony Preservation Project is a collaborative effort by /mlp/ to build and curate pony datasets for as many applications in AI as possible.

Technology has progressed such that a trained neural network can generate convincing voice clips, drawings and text for any person or character using existing audio recordings, artwork and fanfics as a reference. As you can surely imagine, AI pony voices, drawings and text have endless applications for pony content creation.

AI is incredibly versatile, basically anything that can be boiled down to a simple dataset can be used for training to create more of it. AI-generated images, fanfics, wAIfu chatbots and even animation are possible, and are being worked on here.

Any anon is free to join, and there are many active tasks that would suit any level of technical expertise. If you’re interested in helping out, take a look at the quick start guide linked below and ask in the thread for any further detail you need.

EQG and G5 are not welcome.

>Quick start guide:
docs.google.com/document/d/1PDkSrKKiHzzpUTKzBldZeKngvjeBUjyTtGCOv2GWwa0/edit
Introduction to the PPP, links to text-to-speech tools, and how (You) can help with active tasks.

>The main Doc:
docs.google.com/document/d/1y1pfS0LCrwbbvxdn3ZksH25BKaf0LaO13uYppxIQnac/edit
An in-depth repository of tutorials, resources and archives.

>Active tasks:
Research into text-to-speech
Research into speech-to-speech
Research into chatbots

>Latest developments:
See developments post below

>The PoneAI drive, an archive for AI pony voice content:
drive.google.com/drive/folders/1E21zJQWC5XVQWy2mt42bUiJ_XbqTJXCp

>Clipper’s Master Files, the central location for MLP voice data:
mega.nz/folder/jkwimSTa#_xk0VnR30C8Ljsy4RCGSig
mega.nz/folder/gVYUEZrI#6dQHH3P2cFYWm3UkQveHxQ
drive.google.com/drive/folders/1MuM9Nb_LwnVxInIPFNvzD_hv3zOZhpwx
https://huggingface.co/datasets/synthbot/pony-speech
https://huggingface.co/datasets/synthbot/pony-singing

>Cool, where is the discord/forum/whatever unifying place for this project?
You're looking at it.

Last Thread:
>>41498541

Anonymous
10/23/24(Wed)17:48:33 No.41571851

Anonymous 10/23/24(Wed)17:48:33 No.41571851

>>41571795
>Latest developments
https://ponepaste.org/10430

FAQs:
If your question isn’t listed here, take a look in the quick start guide and main doc to see if it’s already answered there. Use the tabs on the left for easy navigation.
Quick: docs.google.com/document/d/1PDkSrKKiHzzpUTKzBldZeKngvjeBUjyTtGCOv2GWwa0/edit
Main: docs.google.com/document/d/1y1pfS0LCrwbbvxdn3ZksH25BKaf0LaO13uYppxIQnac/edit

>Where can I find the AI text-to-speech tools and how do I use them?
A list of TTS tools: docs.google.com/document/d/1y1pfS0LCrwbbvxdn3ZksH25BKaf0LaO13uYppxIQnac/edit#heading=h.yuhl8zjiwmwq
How to get the best out of them: docs.google.com/document/d/1y1pfS0LCrwbbvxdn3ZksH25BKaf0LaO13uYppxIQnac/edit#heading=h.mnnpknmj1hcy

>Where can I find content made with the voice AI?
In the PoneAI drive: drive.google.com/drive/folders/1E21zJQWC5XVQWy2mt42bUiJ_XbqTJXCp
And the PPP Mega Compilation: docs.google.com/spreadsheets/d/1T2TE3OBs681Vphfas7Jgi5rvugdH6wnXVtUVYiZyJF8/edit

>I want to know more about the PPP, but I can’t be arsed to read the doc.
See the live PPP panel shows presented on /mlp/con for a more condensed overview.
2020 pony.tube/w/5fUkuT3245pL8ZoWXUnXJ4
2021 pony.tube/w/a5yfTV4Ynq7tRveZH7AA8f
2022 pony.tube/w/mV3xgbdtrXqjoPAwEXZCw5
2023 pony.tube/w/fVZShksjBbu6uT51DtvWWz

>How can I help with the PPP?
Build datasets, train AIs, and use the AI to make more pony content. Take a look at the quick start guide for current active tasks, or start your own in the thread if you have an idea. There’s always more data to collect and more AIs to train.

>Did you know that such and such voiced this other thing that could be used for voice data?
It is best to keep to official audio only unless there is very little of it available. If you know of a good source of audio for characters with few (or just fewer) lines, please post it in the thread. 5.1 is generally required unless you have a source already clean of background noise. Preferably post a sample or link. The easier you make it, the more likely it will be done.

>What about fan-imitations of official voices?
No.

>Will you guys be doing a [insert language here] version of the AI?
Probably not, but you're welcome to. You can however get most of the way there by using phonetic transcriptions of other languages as input for the AI.

>What about [insert OC here]'s voice?
It is often quite difficult to find good quality audio data for OCs. If you happen to know any, post them in the thread and we’ll take a look.

>I have an idea!
Great. Post it in the thread and we'll discuss it.

>Do you have a Code of Conduct?
Of course: 15.ai/code

>Is this project open source? Who is in charge of this?
pony.tube/w/mqJyvdgrpbWgZduz2cs1Cm

PPP Redubs:
pony.tube/w/p/aR2dpAFn5KhnqPYiRxFQ97

Stream Premieres:
pony.tube/w/6cKnjJEZSCi3gsvrbATXnC
pony.tube/w/oNeBFMPiQKh93ePqTz1ns8

Anonymous
10/23/24(Wed)18:13:16 No.41571963

Anonymous 10/23/24(Wed)18:13:16 No.41571963

>>41571795
>>41571851
I'm not the usual OP. Sorry if I got anything wrong. I can update the Latest Developments paste if anyone has suggestions or corrections. I made one change to the OP:
- Added the Huggingface clone of Clipper's Master Files.

>Will you guys be doing a [insert language here] version of the AI?
>Probably not, but you're welcome to. You can however get most of the way there by using phonetic transcriptions of other languages as input for the AI.
For future threads, the answer here can probably be updated since the fine-tuned GPT-SoVITS seems very capable of generating Japanese speech, and probably Mandarin/Cantonese, and Korean too.

Anonymous
10/23/24(Wed)20:42:14 No.41572507

Anonymous 10/23/24(Wed)20:42:14 No.41572507

>>41571963
I didn't even know where to look for the thread until I looked in the catalog thing, I thought for sure that 149 wasn't a thing yet...

Anonymous
10/23/24(Wed)20:44:13 No.41572514

Anonymous 10/23/24(Wed)20:44:13 No.41572514

File: 7013349__safe_artist-colo(...).png (332 KB, 2500x2500)

332 KB PNG

Rainbow Dash GPT-SoVITS Model (GPT 8, SoVITS 24)
https://huggingface.co/therealvul/GPT-SoVITS-v2/tree/main/Rainbow-SVe24-GPTe8
Reference: https://files.catbox.moe/r2v0mv.mp3
Multispeaker: https://files.catbox.moe/kiydla.mp3
Individual: https://files.catbox.moe/nwdqh2.mp3
Reference: https://files.catbox.moe/mkotd9.mp3
Multispeaker: https://files.catbox.moe/csya0e.mp3
Individual: https://files.catbox.moe/m4f76x.mp3

Pinkie Pie GPT-SoVITS Model (GPT 8, SoVITS 24)
https://huggingface.co/therealvul/GPT-SoVITS-v2/tree/main/Pinkie-SVe24-GPTe8
Reference: https://files.catbox.moe/5d77ck.mp3
Multispeaker: https://files.catbox.moe/2kmvgv.mp3
Individual: https://files.catbox.moe/ok3mbn.mp3
Reference: https://files.catbox.moe/2btax8.mp3
Multispeaker: https://files.catbox.moe/7w4b6q.mp3
Individual: https://files.catbox.moe/ksirb0.mp3

Fluttershy GPT-SoVITS Model (GPT 8, SoVITS 24)
https://huggingface.co/therealvul/GPT-SoVITS-v2/tree/main/Fluttershy-SVe24-GPTe8
Reference: https://files.catbox.moe/06oyrp.mp3
Multispeaker: https://files.catbox.moe/7yiplw.mp3
Individual: https://files.catbox.moe/b7gqjx.mp3
Reference: https://files.catbox.moe/lc1z49.mp3
Multispeaker: https://files.catbox.moe/exekkh.mp3
Individual: https://files.catbox.moe/efzd04.mp3

~~Definitely some cases in which I'm not sure we gain anything over the multispeaker model but they exist now I guess.~~

That rounds out the Mane 6.

Anonymous
10/23/24(Wed)20:45:58 No.41572518

Anonymous 10/23/24(Wed)20:45:58 No.41572518

>>41572507
It appears that we slid off the catalog in the middle of US night/early morning.

Anonymous
10/23/24(Wed)20:58:44 No.41572561

Anonymous 10/23/24(Wed)20:58:44 No.41572561

File: boat anchor art.jpg (131 KB, 800x800)

131 KB JPG

unofficial anchor post for unofficial thread

Anonymous
10/23/24(Wed)22:01:25 No.41572774

Anonymous 10/23/24(Wed)22:01:25 No.41572774

Fimfiction groups are scraped.
https://github.com/uis246/fimfarc-search/releases/download/0.1-rc2/fimfgroups-20241022.tar.xz

Quick guilde into archive structure:
group-names - table with 2 rows: group id, group name
out*/ - directories for each depth in folder tree
out*/.folders - table with 3 rows: group id, folder id, parent folder id
out*/.names - table with 2 rows: folder id, folder name
out*/* - lists of fanfics in corresponding folders

Anonymous
10/23/24(Wed)22:30:36 No.41572862

Anonymous 10/23/24(Wed)22:30:36 No.41572862

File: numget faust.png (91 KB, 324x255)

91 KB PNG

>>41572514

Anonymous
10/24/24(Thu)00:09:59 No.41573424

Anonymous 10/24/24(Thu)00:09:59 No.41573424

File: 2136773__safe_artist-colo(...).png (285 KB, 2033x2000)

285 KB PNG

>>41558920
I made an audio dataset for Button's mom 2014, before she got sick and her voice changed some more:
https://mega.nz/folder/PiQATIbC#XhtJf-n5Y6ug2SFrztjT7A
Her voice got deeper after 2012-2013, and I have to find the audio project file for 9 minutes of good mommy data and upload it to replace the 5 minutes of data in clipper's archive.

Anonymous
10/24/24(Thu)00:46:04 No.41573536

Anonymous 10/24/24(Thu)00:46:04 No.41573536

File: 1203809__safe_spike_solo_(...).png (190 KB, 600x1092)

190 KB PNG

>>41572514
Spike GPT-SoVITS (GPT 8, SoVITS 24)
https://huggingface.co/therealvul/GPT-SoVITS-v2/tree/main/Spike-SVe24-GPTe8
Reference: https://pomf2.lain.la/f/mtcmnad.mp3
Generated: https://pomf2.lain.la/f/2og0me18.mp3

~~Not the best reference since it has a cut in the middle but whatever~~

Anonymous
10/24/24(Thu)00:51:05 No.41573550

Anonymous 10/24/24(Thu)00:51:05 No.41573550

>>41571963
You did fine. Thanks for making the thread, I was just about to start working on it. Usually I try to prepare for the next thread around post 400.

~~Also, the trend of new anti-spam bullshit each thread continues. I wonder what it'll be next thread?~~

Anonymous
10/24/24(Thu)02:40:25 No.41573818

Anonymous 10/24/24(Thu)02:40:25 No.41573818

File: blueblood vision.jpg (26 KB, 736x414)

26 KB JPG

>>41573536
.https://files.catbox.moe/e5w4fh.zip
Any chance you could do something with the above few seconds clips of Prince Blueblood?
>https://files.catbox.moe/d9m1no.zip
Alternative I have this one minute of synthesized PB voice if above is not enough

Anonymous
10/24/24(Thu)03:08:48 No.41573892

Anonymous 10/24/24(Thu)03:08:48 No.41573892

>>41573818
I could look into it but I'd like to finish training the models for the common characters before I start mucking about with very low data ones which I imagine will require some finagling with multispeaker models

Anonymous
10/24/24(Thu)04:19:34 No.41574083

Anonymous 10/24/24(Thu)04:19:34 No.41574083

File: 1529938197429.jpg (2.87 MB, 1993x3328)

2.87 MB JPG

>>41573892
I hope you can do Meadowbrook. God I love her sweet Cajun voice.

Anonymous
10/24/24(Thu)05:02:11 No.41574170

Anonymous 10/24/24(Thu)05:02:11 No.41574170

File: 6266477__safe_edit_editor(...).gif (2.22 MB, 480x256)

2.22 MB GIF

>>41573536
Celestia
https://huggingface.co/therealvul/GPT-SoVITS-v2/tree/main/Celestia-SVe24-GPTe8
ref: https://files.catbox.moe/e0fewc.mp3
generated: https://files.catbox.moe/oqnf90.mp3

Luna
https://huggingface.co/therealvul/GPT-SoVITS-v2/tree/main/Luna-SVe24-GPTe8
ref: https://files.catbox.moe/0nnthl.mp3
generated: https://files.catbox.moe/bvmb34.mp3
ref: https://files.catbox.moe/3zfbva.mp3
generated: https://files.catbox.moe/24n0da.mp3

Glimglam
https://huggingface.co/therealvul/GPT-SoVITS-v2/tree/main/Starlight-SVe24-GPTe8
ref: https://files.catbox.moe/8uudhe.mp3
generated: https://files.catbox.moe/7odtqn.mp3

~~The choice of reference audio seems to have pretty violent effects on output quality. Not sure what's up with that.~~

I wonder if there's a way to modify pronunciations.

Anonymous
10/24/24(Thu)06:40:53 No.41574331

Anonymous 10/24/24(Thu)06:40:53 No.41574331

>>41574170
Yes, but you have to retrain expression trabslate from chinese, with this. https://huggingface.co/Systran/faster-whisper-large-v3

Anonymous
10/24/24(Thu)08:33:08 No.41574537

Anonymous 10/24/24(Thu)08:33:08 No.41574537

>>41574170
Not sure if it was already posted by what are the GPT-SoVITS-v2 ram demands just for generating a 10~20 output?

Anonymous
10/24/24(Thu)11:27:12 No.41574933

Anonymous 10/24/24(Thu)11:27:12 No.41574933

>>41574537
~3 GB with batch size of 20
I did post this but it was in the other thread

Anonymous
10/24/24(Thu)11:30:37 No.41574948

Anonymous 10/24/24(Thu)11:30:37 No.41574948

>>41574331
I meant at inference time, but this could also be helpful.

Anonymous
10/24/24(Thu)14:04:20 No.41575510

Anonymous 10/24/24(Thu)14:04:20 No.41575510

>>41574933
Huh, that's not bad. How long does inference typically take for you, and would more RAM allowance (say... double) improve upon that speed, and/or just expand how large the batch output is?

Anonymous
10/24/24(Thu)15:46:48 No.41575835

Anonymous 10/24/24(Thu)15:46:48 No.41575835

File: 363__safe_apple+bloom_sol(...).png (456 KB, 912x1118)

456 KB PNG

>>41574170
Apple Bloom! Apple Bloom! Apple Bloom!
https://huggingface.co/therealvul/GPT-SoVITS-v2/tree/main/Apple%20Bloom-SVe24-GPTe8
ref: https://files.catbox.moe/9g6eff.mp3
gen: https://files.catbox.moe/3dzimz.mp3

>>41575510
I'd say RTF is roughly 16% to 40% on 3080Ti. i.e. generating a 60 second output takes 10 to 24 seconds (faster than realtime). Adjusting the batch size (at least in their gradio interface) doesn't seem to do anything, not sure how it works.

Anonymous
10/24/24(Thu)18:29:39 No.41576416

Anonymous 10/24/24(Thu)18:29:39 No.41576416

File: 6107062__safe_edit_rarity(...).png (215 KB, 546x395)

215 KB PNG

>>41575835
Sweeble.
https://huggingface.co/therealvul/GPT-SoVITS-v2/tree/main/SweetieBelle-SVe24-GPTe8
Reference: https://files.catbox.moe/vj3zv7.mp3
Generation: https://files.catbox.moe/4d204z.mp3

Ok, so batched inference doesn't work the way I expected. I think what the GUI is doing is it uses a user-selectable slicing method to split the input into "batches". So a batch can be 4 sentences, or one sentence--but this means you won't see the time benefits or memory usage of batching unless you synthesize more than one/four sentences at a time (which is unusual for most content creation, but may be more suitable for certain automated systems/audiobooks).

Long demo: https://files.catbox.moe/yka037.mp3
The memory usage doesn't seem to actually be consistent with just batch size; it also varies with slice length and from inference to inference. If I slice by sentences, I can infer the ~2k word passage with a batch size of 20 under 14 GB, a batch size of 10 under 6 GB. If I slice by 4 sentences I OOM. Inference time varies (obv. variance increases wrt to the text length), for this passage it took anywhere from 26sec to 1min.

Anonymous
10/24/24(Thu)19:46:21 No.41576644

Anonymous 10/24/24(Thu)19:46:21 No.41576644

>>41575835
>>41576416
Wew, that's really impressive. I think book narration still needs a bit more, but I can see the tech is there and it manages it. Biggest concern would be it randomly hallucinating in long inference, which would be difficult to detect.

Anonymous
10/24/24(Thu)21:45:49 No.41577055

Anonymous 10/24/24(Thu)21:45:49 No.41577055

>>41576416
Could you also train a specific Squeaky Belle model at some point please? Would love to see AI attempt that, and would feel more early seasons.

Anonymous
10/24/24(Thu)22:05:25 No.41577113

Anonymous 10/24/24(Thu)22:05:25 No.41577113

File: 7083023__safe_artist-colo(...).jpg (397 KB, 688x937)

397 KB JPG

>>41576416
Scoot Scootaloo.
https://huggingface.co/therealvul/GPT-SoVITS-v2/tree/main/Scootaloo-SVe24-GPTe8
ref: https://files.catbox.moe/8xy1ud.mp3
gen: https://files.catbox.moe/cmq12h.mp3
ref: https://files.catbox.moe/tvtvuz.mp3
gen: https://files.catbox.moe/xkjobn.mp3
no ref gen: https://files.catbox.moe/klbjwq.mp3

>>41577055
Noted.

Anonymous
10/24/24(Thu)22:08:44 No.41577119

Anonymous 10/24/24(Thu)22:08:44 No.41577119

>>41577055
>https://huggingface.co/Amo/RVC_v2_GA/tree/main/models/MLP_Sweetie_Belle_Squeeky
There is RVC2 model of her, if you are interested.

Anonymous
10/24/24(Thu)22:28:42 No.41577159

Anonymous 10/24/24(Thu)22:28:42 No.41577159

File: Celestia top luna.png (539 KB, 856x1036)

539 KB PNG

>>41574170
I want to hear Celestia and Luna narrating books/articles. They're very well spoken in these gens.

Anonymous
10/24/24(Thu)22:38:25 No.41577184

Anonymous 10/24/24(Thu)22:38:25 No.41577184

File: image_fx_ (49).png (1.18 MB, 1024x1024)

1.18 MB PNG

>>41577119
I feel that would require squeaky voice acting to work, and unless it's accidental it loses its charm. Thanks though, could be good to pair with a later Squeaky TTS.
>>41577113
Glad to hear references aren't necessary for the voice qualities to surface and retain in inference.
>>41575835
Forgot to mention, Bloom feels the most accurate so far, and really highlights how well it matches her usual pitch changes. Awesome to see AI less restrictive in its convincing vocal ranges.

Anonymous
10/24/24(Thu)22:42:52 No.41577207

Anonymous 10/24/24(Thu)22:42:52 No.41577207

>>41577113
so lightly glossing the github, is it really just taking like 5 seconds of audio as reference and then using that to make these generated voices you're posting? if so it sounds really clear

Anonymous
10/24/24(Thu)22:44:35 No.41577219

Anonymous 10/24/24(Thu)22:44:35 No.41577219

>>41577207
Well, first I have to finetune the model. The reference audio just helps with steering towards the final desired timbre (the Scootaloo post has examples generated both with references and one without to show the difference). Trying to use the reference audio on the base, untrained model does not produce good results.

Anonymous
10/24/24(Thu)22:49:19 No.41577244

Anonymous 10/24/24(Thu)22:49:19 No.41577244

>>41577184
>feel that would require squeaky voice acting to work
Not really, I don't have the test file on my but i remember it was pretty decent at forcing whatever input audio to sound like her S1 squeaky self.
>https://files.catbox.moe/ltfpaw.zip
Here are wav folder with 6m of just squeaky SB if you need it.

Anonymous
10/24/24(Thu)23:26:42 No.41577433

Anonymous 10/24/24(Thu)23:26:42 No.41577433

>>41569387
>Hay Say using 0% of GPU
Open the docker-compose.yaml file. There are several commented-out sections that start with "deploy:". Uncomment those sections and restart Hay Say. You should then see the option to generate with GPU above the Generate button.

Anonymous
10/25/24(Fri)01:18:04 No.41577801

Anonymous 10/25/24(Fri)01:18:04 No.41577801

File: 6469758__safe_artist-colo(...).png (1.22 MB, 2500x2000)

1.22 MB PNG

>>41577113
The Grrrrrreat and Powerful Trrrixie!
https://huggingface.co/therealvul/GPT-SoVITS-v2/tree/main/Trixie-SVe24-GPTe8
>Can she roll her R's?
Ref: https://files.catbox.moe/q05hko.mp3
Prompt:
>The Great and Powerful Trixie, has no need for your frivolous manual! Trixie is a master of magic and illusion! Everything she does is a display of sheer brilliance! Instructions are for those who lack natural talent, unlike Trixie!
Gen: https://files.catbox.moe/c6qdy8.mp3
Gen: https://files.catbox.moe/mg0wnu.mp3
Gen (less roll): https://files.catbox.moe/upw9ij.mp3

Sort of, sometimes, apparently!

Anonymous
10/25/24(Fri)01:31:35 No.41577843

Anonymous 10/25/24(Fri)01:31:35 No.41577843

>>41577801
>Additional postprocessing?
RVC (using a retrieval ratio of 0.75):
https://files.catbox.moe/bisplb.mp3
https://files.catbox.moe/f6heuq.mp3

Anonymous
10/25/24(Fri)01:33:27 No.41577848

Anonymous 10/25/24(Fri)01:33:27 No.41577848

>>41577801
>Sort of, sometimes, apparently!
Holy shit.

Anonymous
10/25/24(Fri)04:47:48 No.41578124

Anonymous 10/25/24(Fri)04:47:48 No.41578124

>>41577801
Every turn this TTS AI surprises me with its capabilities. It replicated Trixie REALLY well, especially with this (>>41577843) additional pass; which I imagine is responsible for also clearing most of the buzz and noise.

I'm starting to wonder if there's even a pony voice odd or unique enough to stump it. Discord maybe? Breezies? Chrysalis?

Anonymous
10/25/24(Fri)05:31:10 No.41578166

Anonymous 10/25/24(Fri)05:31:10 No.41578166

File: 3021922__safe_princess+ca(...).jpg (23 KB, 280x280)

23 KB JPG

>>41577801
Peetzer (GPT epoch 10, SoVITS epoch 16)
https://huggingface.co/therealvul/GPT-SoVITS-v2/tree/main/Cadance-SVe16-GPTe10
Ref: https://files.catbox.moe/8kg1ov.mp3
Gen: https://files.catbox.moe/3i6sgh.mp3
Ref: https://files.catbox.moe/zdyuu1.mp3
Gen: https://files.catbox.moe/1sj35x.mp3

Maybe I'm not familiar with Cadance's vocal timbre enough but this one feels weird to me, which is why I spent so much time deliberating over which combination to use. Didn't like the 24th SoVITS epoch, too much buzziness, like when you overtrain an RVC model. I suspect we're at the point where the quantity of data for the character becomes a noticeable problem.

>>41578124
Not so much the uniqueness of the voice as much as the availability of data.

Anonymous
10/25/24(Fri)05:37:11 No.41578173

Anonymous 10/25/24(Fri)05:37:11 No.41578173

STTATTS: Unified Speech-To-Text And Text-To-Speech Model
https://arxiv.org/abs/2410.18607
>Speech recognition and speech synthesis models are typically trained separately, each with its own set of learning objectives, training data, and model parameters, resulting in two distinct large networks. We propose a parameter-efficient approach to learning ASR and TTS jointly via a multi-task learning objective and shared parameters. Our evaluation demonstrates that the performance of our multi-task model is comparable to that of individually trained models while significantly saving computational and memory costs (∼50\% reduction in the total number of parameters required for the two tasks combined). We experiment with English as a resource-rich language, and Arabic as a relatively low-resource language due to shortage of TTS data. Our models are trained with publicly available data, and both the training code and model checkpoints are openly available for further research.
https://github.com/mbzuai-nlp/sttatts
no examples but weights are up. voice conversion is one of the tasks

Anonymous
10/25/24(Fri)08:44:08 No.41578412

Anonymous 10/25/24(Fri)08:44:08 No.41578412

>>41578173
man, there is so many of those "it mighty be cool" project out there, I just wish PPP and Chag wasnt the only groups that took chances with developing stuff further

Anonymous
10/25/24(Fri)14:05:06 No.41579340

Anonymous 10/25/24(Fri)14:05:06 No.41579340

File: 6894628__safe_imported+fr(...).png (722 KB, 1280x720)

722 KB PNG

>>41578124
>>41578166
Ok, Discord. (GPT epoch 48? SoVITS epoch 96?!)
https://huggingface.co/therealvul/GPT-SoVITS-v2/tree/main/Discord-GPTe48-SVe96

ref: https://files.catbox.moe/ich5gc.mp3
gen: https://files.catbox.moe/59y7pi.mp3
ref: https://files.catbox.moe/dj7rzm.mp3
gen: https://files.catbox.moe/i5nrni.mp3

First GPT-SoVITS L? Or is it a me problem? I tried quite a lot, as you can tell from the epochs. The increased GPT seems to help with the framiness, but there are some occasional pronunciation misses. It seems to have a lot of trouble modeling the deeper voice--maybe the base model is biased to higher voices, or the analysis frames are too short to model lower f0?

Anonymous
10/25/24(Fri)15:44:27 No.41579585

Anonymous 10/25/24(Fri)15:44:27 No.41579585

File: 7085489__safe_artist-colo(...).jpg (215 KB, 2048x1470)

215 KB JPG

>>41579340
Crazy Glue (GPT epoch 48? SoVITS epoch 24)
https://huggingface.co/therealvul/GPT-SoVITS-v2/tree/main/CozyGlow-SVe24-GPTe48
ref: https://files.catbox.moe/0kpx47.mp3
gen: https://files.catbox.moe/5hze52.mp3

~~Not sure how faithful this is, but for some reason increasing GPT epochs seemed to increase the "resemblance" I perceived.~~

Anonymous
10/25/24(Fri)15:57:30 No.41579620

Anonymous 10/25/24(Fri)15:57:30 No.41579620

>>41579585
can you do derpy

Anonymous
10/25/24(Fri)16:44:55 No.41579822

Anonymous 10/25/24(Fri)16:44:55 No.41579822

>>41579340
It sounds pretty spot on to me, though there might be some subtle differences yeah. Not perfect. I think. It's hard to tell honestly. We're in the subtle territory for AI now it seems.

Anonymous
10/25/24(Fri)18:28:10 No.41580182

Anonymous 10/25/24(Fri)18:28:10 No.41580182

File: 6888919__safe_imported+fr(...).jpg (174 KB, 1799x1012)

174 KB JPG

>>41579585
T-rex
https://huggingface.co/therealvul/GPT-SoVITS-v2/tree/main/Tirek-SVe32-GPTe32
ref: https://files.catbox.moe/byfz9o.mp3
gen: https://files.catbox.moe/3j1f3g.mp3
ref: https://files.catbox.moe/57fg6x.mp3
gen: https://files.catbox.moe/zsnanq.mp3

Chryssie
https://huggingface.co/therealvul/GPT-SoVITS-v2/tree/main/Chrysalis-SVe32-GPTe8
ref: https://files.catbox.moe/sns1uo.mp3
gen: https://files.catbox.moe/ljpexn.mp3
ref: https://files.catbox.moe/iafgjk.mp3
gen: https://files.catbox.moe/hl3tse.mp3
ref: https://files.catbox.moe/hd9no1.mp3
gen: https://files.catbox.moe/p4dnvf.mp3

~~Noticeably more finicky and lower quality, both in audio quality and pronunciations.~~

>>41579620
Noted

Anonymous
10/25/24(Fri)22:50:16 No.41581034

Anonymous 10/25/24(Fri)22:50:16 No.41581034

>>41573818
Prince Blueblood (?) (GPT 40, SoVITS 32)
https://huggingface.co/therealvul/GPT-SoVITS-v2/tree/main/Blueblood-SVe32-GPTe40
ref: https://files.catbox.moe/lmo8x7.mp3
gen: https://files.catbox.moe/2xr3dk.mp3

~~Well, that's an impressive performance for 17 seconds of audio. I wonder how much of it is inherited from the base model, and I wonder how well it works for other characters.~~

I had to stitch two reference audios together to hit the 3 second requirement. I still can't tell what the "auxiliary reference audios" slot does.

Anonymous
10/25/24(Fri)22:51:35 No.41581039

Anonymous 10/25/24(Fri)22:51:35 No.41581039

>>41581034
>I still can't tell what the "auxiliary reference audios" slot does.
What I've been told is that it "averages" the tone of the audios.

Anonymous
10/25/24(Fri)23:46:22 No.41581357

Anonymous 10/25/24(Fri)23:46:22 No.41581357

>>41581034
>>41574083
Meadowbrook (?) (GPT 48, SoVITS 24)
https://huggingface.co/therealvul/GPT-SoVITS-v2/tree/main/Meadowbrook-SVe24-GPTe48
Ref: https://files.catbox.moe/3s9n96.mp3
Gen: https://files.catbox.moe/2be3l9.mp3
Gen using entire dataset as aux reference: https://files.catbox.moe/ek4uoy.mp3

~~These inflections are really weird. The accent isn't 100% there but I doubt we'll get much closer.~~

Anonymous
10/26/24(Sat)01:26:16 No.41581823

Anonymous 10/26/24(Sat)01:26:16 No.41581823

>>41577055
>>41581357
Squeaky Belle (?) (GPT 32 SoVITS 48)
https://huggingface.co/therealvul/GPT-SoVITS-v2/tree/main/%20SqueakyBelle-SVe48-GPTe32
Ref: https://files.catbox.moe/xd9qb2.mp3
Gen: https://files.catbox.moe/u4gng7.mp3

~~Well, no real squeaks. I noticed there was some S2 material in there too, I wonder if that affects anything?~~

Anonymous
10/26/24(Sat)03:02:57 No.41582059

Anonymous 10/26/24(Sat)03:02:57 No.41582059

File: 7080204__safe_artist-colo(...).png (3.4 MB, 2664x2664)

3.4 MB PNG

>>41579620
Tabitha's Derpy (GPT epoch 24, SoVITS epoch 36)
https://huggingface.co/therealvul/GPT-SoVITS-v2/tree/main/Derpy-SVe36-GPTe48
ref: https://files.catbox.moe/bl8oom.mp3
gen: https://files.catbox.moe/8mroi0.mp3

~~Doesn't seem to follow the reference timbre that well?~~

Anonymous
10/26/24(Sat)03:29:16 No.41582084

Anonymous 10/26/24(Sat)03:29:16 No.41582084

>>41581039
Thanks.

I'm not really sure what the "no reference" option (or whatever Google Translate missed) actually does either. The underlying pipeline doesn't seem to actually allow you to generate "without a reference" until something has already been generated--and there seems to be some kind of underlying caching of the reference audio and the requests that the pipeline receives (so that if you screw up a request by selecting the wrong reference audio language once, then it's just perma-fucked until you restart the webui).

Anonymous
10/26/24(Sat)03:43:53 No.41582112

Anonymous 10/26/24(Sat)03:43:53 No.41582112

>>41581357
Daym, that's impressive. It seemed impossible just a year ago.

Anonymous
10/26/24(Sat)05:48:11 No.41582298

Anonymous 10/26/24(Sat)05:48:11 No.41582298

An update of a shitpost
raw gpt-sovits: https://files.catbox.moe/l9qtqe.mp3
post-rvc + pitch shift: https://files.catbox.moe/v80sqy.mp3

Anonymous
10/26/24(Sat)05:49:26 No.41582300

Anonymous 10/26/24(Sat)05:49:26 No.41582300

>>41581034
>>41582059
Oh man, this is really bloody based, there is a whole list if characters that I would live to have the voice model for but their dataset is limited to 10-30 seconds, but this, fuck me this is a proper game changer.

Anonymous
10/26/24(Sat)06:18:44 No.41582329

Anonymous 10/26/24(Sat)06:18:44 No.41582329

File: image_2024-10-26_111333861.png (12 KB, 401x362)

12 KB PNG

>>41581357
>>41581823
I can see there are some instructions on how to use different languages for tts, but did the original developers provide instructions on how to add another language? I would imagine for 99% of people on the board it would be useless as English is a go-to language but I would be interested if a new languages could be added to it (or if it would required a re-training base model from scratch)?

Anonymous
10/26/24(Sat)06:31:08 No.41582360

Anonymous 10/26/24(Sat)06:31:08 No.41582360

>>41582329
I'd love a Russian dub with original voices, instead of that terrible official one.

Anonymous
10/26/24(Sat)07:55:52 No.41582575

Anonymous 10/26/24(Sat)07:55:52 No.41582575

>>41581823
I love it! Now very squeaky, but does sound within the era of.
>>41582059
Hmm, might sound better with a less anxious line? Maybe with something like these:
https://files.catbox.moe/7rf4vw.wav
https://files.catbox.moe/2kg7w8.wav
>>41582298
Kek. I think it falters at the end, most of these tend to come to think of it. Maybe with this AI we'd have to get in the habit of leaving an additional dummy sentence at the end to snip out?

Anonymous
10/26/24(Sat)11:49:49 No.41583036

Anonymous 10/26/24(Sat)11:49:49 No.41583036

>>41582329
Don't know, I don't think they have any base training instructions
>>41582575
You don't have to generate the entire thing at once.

Anonymous
10/26/24(Sat)13:05:28 No.41583252

Anonymous 10/26/24(Sat)13:05:28 No.41583252

>>41582329
I don't see any instructions on how to add new languages. Glancing through the repo, however, it looks like it would require some modifications to the code because it checks the language you pass against a whitelist and also does some mapping (e.g. "en" -> "english"). You'd also need to write a phoneme tokenizer specific to the language you want. The built-in tokenizers are located here: https://github.com/RVC-Boss/GPT-SoVITS/tree/main/GPT_SoVITS/text. I'm not 100% sure, but I don't think you'd need to retrain the base model.

Anonymous
10/26/24(Sat)14:09:40 No.41583423

Anonymous 10/26/24(Sat)14:09:40 No.41583423

>>41583252
I have a feeling the existing tokenizers could use some updating considering the occasional (but strongly) mispronounced words shown in the recent examples. Or perhaps establish a way to select from specific tokenizers to account for both using manually updated ones, and selecting differing pronunciations with ones tailored to accent and region.

In any case, not something overly urgent or to consider working on at this stage, but something of note to consider when it comes to QoL improvements.

Anonymous
10/26/24(Sat)19:49:55 No.41584258

Anonymous 10/26/24(Sat)19:49:55 No.41584258

Edge case: Long sentence, no punctuation
https://gist.github.com/effusiveperiscope/2740ec098c0834dee76919e0c2e205b3
https://files.catbox.moe/33ltgn.mp3
Seems to be suffering the usual transformer context length issue.
Uses 3.8 GB with a batch size of 1, could be problematic for small GPUs.

Anonymous
10/26/24(Sat)19:52:00 No.41584264

Anonymous 10/26/24(Sat)19:52:00 No.41584264

>>41584258
>>41581357
Hm... going back and comparing these I wonder if I was premature in stopping the Mane 6 training.

Anonymous
10/26/24(Sat)22:38:11 No.41584659

Anonymous 10/26/24(Sat)22:38:11 No.41584659

>>41584264
Whether you continue training from the current models or start anew for extended training, the existing ones are still great overall, so perhaps save the new batch as separate and mark as like "v2" or 'EXT". The existing ones would be useful as a baseline to compare against anyway.

Anonymous
10/26/24(Sat)23:25:58 No.41584837

Anonymous 10/26/24(Sat)23:25:58 No.41584837

>>41583036
https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-ssml-phonetic-sets
As you said, this is not on a high list of priorities but if you did choose to look into this, I had found the following link that may or may not be bit useful here.

Anonymous
10/27/24(Sun)04:01:40 No.41585721

Anonymous 10/27/24(Sun)04:01:40 No.41585721

File: 6014972__safe_artist-colo(...).gif (10 KB, 298x392)

10 KB GIF

Bump

Anonymous
10/27/24(Sun)13:14:50 No.41586799

Anonymous 10/27/24(Sun)13:14:50 No.41586799

File: rvc_vs_gptsovits.png (3.42 MB, 2345x2043)

3.42 MB PNG

>>41584264
>>41584659
Well, I trained the Applejack model up to SoVITS epoch 48 and GPT epoch 32. The high-frequency information increased marginally but increasing either independently or jointly seems to cost character resemblance. Since this is something you can fix anyways with an RVC postprocessing pass I decided it's not a worthwhile tradeoff.

Also, I think I can demonstrate what I mean by "framiness" more concretely. Take these samples:
GPT-SoVITS: https://files.catbox.moe/r6zj8m.mp3
RVC: https://files.catbox.moe/b9tzxg.mp3

For GPT-SoVITS, in the word "ends" and "is" you can hear some sort of unnatural discontinuity that RVC figures out it should smooth over. You can see this in the spectrogram--for "ends", the discontinuities are more in the upper harmonics, which seem to momentarily drop out in a place that RVC (and our ears) doesn't agree with, and for "is" there is a very noticeable discontinuity in f0 as well.

Anonymous
10/27/24(Sun)15:26:23 No.41587230

Anonymous 10/27/24(Sun)15:26:23 No.41587230

File: 1708127702548213.gif (618 KB, 473x472)

618 KB GIF

Anonymous
10/27/24(Sun)18:39:27 No.41588030

Anonymous 10/27/24(Sun)18:39:27 No.41588030

File: 1362150.png (558 KB, 1280x720)

558 KB PNG

Anonymous
10/27/24(Sun)19:06:03 No.41588158

Anonymous 10/27/24(Sun)19:06:03 No.41588158

File: Screenshot from 2024-10-2(...).png (35 KB, 759x380)

35 KB PNG

>>41586799
Huggingface's scanner has some unspecified issues with s2G488k.pth in the GPT-SoVITS HF repo. I checked the file and didn't see anything dangerous in its data.pkl, though my checks https://ponepaste.org/10436 are pretty crude. If you can load the model using use_safetensors=True, that would be good. It's not a big deal for colab, but you might want to do this when loading the model locally. People are starting to upload more malware to Huggingface.

Anonymous
10/27/24(Sun)21:08:20 No.41588517

Anonymous 10/27/24(Sun)21:08:20 No.41588517

>>41588158
Huggingface marks every single one of my SoVITS weights as suspicious too, but for some reason none of the GPT weights. There are at least four pretrained models I think that are involved. I'll see what I can do about this soon

Anonymous
10/27/24(Sun)22:52:39 No.41588963

Anonymous 10/27/24(Sun)22:52:39 No.41588963

File: Screenshot 2024-10-27 193832.png (97 KB, 1207x642)

97 KB PNG

>>41588517
>>41588158
Well, facially, these aren't just normal pytorch model state dicts; it contains configuration information as well (yes, I did just torch.load the pickle and check the keys it said it had; I've probably already loaded these hundreds of times anyways) and shoves the model state dict under the key 'weight', so I don't think it can actually be converted to safetensors. Assuming that nothing actually malicious is happening, I could try separating the data out into another file.
(Also, if we made safetensors loading default, that'd make us incompatible with any models trained by anyone else, and it'd add an extra step for people training models to do.)

Synthbot
10/28/24(Mon)01:58:25 No.41590281

Synthbot 10/28/24(Mon)01:58:25 No.41590281

>>41588963
It's probably fine for now. I'll think about it to see if there's a better solution since it'll likely affect a lot of models going forward.

Anonymous
10/28/24(Mon)02:48:11 No.41590487

Anonymous 10/28/24(Mon)02:48:11 No.41590487

File: 1432939__safe_amethyst+st(...).png (303 KB, 453x373)

303 KB PNG

>ngrok needs a "verified account" with a CREDIT CARD to be used now
https://files.catbox.moe/46dfg5.mp3

~~2 more weeks~~

Synthbot
10/28/24(Mon)02:50:02 No.41590494

Synthbot 10/28/24(Mon)02:50:02 No.41590494

>>41570055
Horsona updates:
- [Done] I redid how the database cache works since it clubbed together multiple disparate functionality, and its interface required special handling by any module that used it. The new version gives an embedding database an LLM interface. It can be queried like any other LLM, and it does any embedding-specific handling in there (esp. generating keyword searches from the prompt to get better embedding lookups). For whatever underlying LLM it uses, it requires two queries: one to generate the search terms, and one to respond to the query.
- ... Code: https://github.com/synthbot-anon/horsona/blob/main/src/horsona/memory/embedding_llm.py
- [Done] I implemented ReadAgent for dealing with long documents. ReadAgent generates a "gist" for each "page" of the document, which can be used to determine what information is on each page. At query time, it uses one LLM call to determine which pages to pull into the context, then a second LLM call to respond to the query. I implemented this as two modules: one to generate & keep track of gists, and one to provide the LLM interface. My version has two changes relative to the original: (1) when summarizing pages, it provides all gists-so-far as context so it can generate better summaries, and (2) when responding to a query, it provides all gists along with the selected pages rather than just the selected pages.
- ... Code for creating gists: https://github.com/synthbot-anon/horsona/blob/main/src/horsona/memory/gist_module.py
- ... Code for the ReadAgent LLM wrapper: https://github.com/synthbot-anon/horsona/blob/main/src/horsona/memory/readagent_llm.py
- [Done] I added some utility functions that are generally useful for getting "smarter" responses. One of the is for searching the web for information on a given topic. The second is for decomposing a given topic into subtopics.
- ... Code for searching the web: https://github.com/synthbot-anon/horsona/blob/main/src/horsona/smarts/search_module.py
- ... Code for decomposing a topic: https://github.com/synthbot-anon/horsona/blob/main/src/horsona/smarts/mece_module.py
- [In progress] I like the LLM wrapper approach for generating augmented responses. I'll likely update some other modules to use the same approach, particularly the DialogueModule for generating in-character responses.
- [In progress] I need to update my ReadModule to reflect the database cache changes.

Anonymous
10/28/24(Mon)06:13:55 No.41590988

Anonymous 10/28/24(Mon)06:13:55 No.41590988

File: Screenshot 2024-10-28 024310.png (118 KB, 1774x989)

118 KB PNG

>>41572561
GPT-SoVITS Inference GUI, first build.
>Windows pyinstaller download (may remove this later if it turns out to be broken)
https://drive.google.com/file/d/1PZt71cOH0X7QSFRgcThTwC2_WOain7Nj/view?usp=sharing
>GitHub + usage instructions
https://github.com/effusiveperiscope/GPT-SoVITS

Please help test:
>GPU/No GPU system, other system info?
>Does it work at all?
>Interface appearance on your display?
>Other issues?

>>41590487
Closer to 2 more hours. Apparently it's easier to turn a client/server application into a client-only application since the decoupling's built in.

Anonymous
10/28/24(Mon)07:15:11 No.41591065

Anonymous 10/28/24(Mon)07:15:11 No.41591065

File: my-little-pony-Princess-L(...).png (2.75 MB, 1680x2400)

2.75 MB PNG

What do you think is behind the stellar quality of 11.ai? Some secret sauce algorithm or raw compute power for training?

Anonymous
10/28/24(Mon)08:10:22 No.41591141

Anonymous 10/28/24(Mon)08:10:22 No.41591141

File: 1728167241026014.jpg (363 KB, 1359x2100)

363 KB JPG

>>41590988

is GPT-SoVITS the closest to 11.ai TTS right now? I really don't like how reference audio is mandatory.

Anonymous
10/28/24(Mon)09:26:47 No.41591321

Anonymous 10/28/24(Mon)09:26:47 No.41591321

>>41590988
Yeah, once I get out of the wagiecage in 5 hours I will give this a go. Any advise on what kind of reference work better vs what could make the output sound trash?

Anonymous
10/28/24(Mon)11:41:24 No.41591651

Anonymous 10/28/24(Mon)11:41:24 No.41591651

>>41591065
Most likely a large dataset, a large model, and enough resources to train one on the other. It's not "that good" for our use case though; we are looking for very specific voices and prosody patterns which even "the best" models can't quite get zero shot, and companies aren't going to special finetune their models just for us.
>>41591141
It's the closest that we have the resources to deal with. It's open source, has maintainers who are willing to train and release a base model that performs well enough at inference time on a footprint that fits onto most consumer GPUs (under 4GB), requires relatively minimal resources, data, time to finetune to a reasonable performance and character resemblance (unlike StyleTTS2 and xTTS), and doesn't have any obviously crippling flaws like ParlerTTS's performance on underrepresented tokens or general schizo energy.
>I really don't like how reference audio is mandatory.
For all we know 11 could be doing the same thing under the hood, just having a default reference audio for each character. I'm actually still not quite sure whether reference audio is actually mandatory, or whether whatever they get out of reference audio could be precomputed or not. Most of the repo is either Chinese or google translated Chinese and there's a lot of confused terminology (for example, the splitting unit of the text splitting method is called a "batch", but "batch size" is also used in its normal sense, but with this definition of "batch" "batch size" wouldn't refer to the size of the individual "batch" but rather the maximum number of "batch"es). All I know is that the original webui and underlying TTS pipeline give me an error if I run it without reference audio, and what I thought earlier was the "no reference audio" option seems to rely on some kind of cached internal state which produces really bad results if you switch models.
>>41591321
The TTS pipeline for some reason disallows reference audio outside of the 3 second to 10 second range (not sure why), so that's a length constraint (you could get around it by frankensteining audio clips in an editor). Shouting references tend not to work very well. I haven't quite nailed down what makes references work consistently, but some factors seem to be:
>The brightness/high frequency information available in the audio clip (too much/too little)
>The intonation - like StyleTTS2, GPT-SoVITS seems to impose the average pitch and general pitch contour of the primary reference onto the output
Adding a bunch of auxiliary references of the same character seems to help with audio quality, although I haven't confirmed this. Feel free to experiment and report your own observations (assuming it even works).

Anonymous
10/28/24(Mon)12:59:53 No.41591911

Anonymous 10/28/24(Mon)12:59:53 No.41591911

>>41590988
OK, so I noticed a pretty big bug already -- the auxiliary reference audio paths are never actually passed in. Still, since inference works on my machine, I'm interested in knowing whether it actually works on anyone else's before I go and put up another version.

Also, Clipper, if you're here -- do you mind if I remove the demucs-processed episode stems from my Drive?

Anonymous
10/28/24(Mon)13:00:31 No.41591914

Anonymous 10/28/24(Mon)13:00:31 No.41591914

>>41590988
To make the (other) system steps less ambiguous, please add the steps/commands to create and activate a conda/venv environment and then the ones to install the dependencies from the txt files to the git page. The more dum dum proof we can make it the more accessible and hassle free it'll be.

Anonymous
10/28/24(Mon)13:10:46 No.41591949

Anonymous 10/28/24(Mon)13:10:46 No.41591949

>>41591914
OK updated

Anonymous
10/28/24(Mon)13:46:01 No.41592046

Anonymous 10/28/24(Mon)13:46:01 No.41592046

>>41591949
>conda env create -n GPTSovitsClient python=3.10
>SpecNotFound: Invalid name 'python=3.10', try the format: user/package
Hmm, that doesn't seem to work, at least with Ubuntu. Wrong syntax perhaps?

Depending on the interfacing, maybe something like Applio's setup and run scripts can be examined and revised for use with this SoVitsTTS. Or perhaps a version made which integrates this TTS into it as a separate page/function?
>https://github.com/IAHispano/Applio/releases

Anonymous
10/28/24(Mon)14:08:27 No.41592116

Anonymous 10/28/24(Mon)14:08:27 No.41592116

>>41592046
Whoops, I forgot that it's just conda create. Updated.

Anonymous
10/28/24(Mon)14:10:39 No.41592126

Anonymous 10/28/24(Mon)14:10:39 No.41592126

>>41592046
Could someone test if the script still works with "python=3.10.3" version, as I believe the .4 and above versions has some pip/fairseq/omegaconf some other shitty module was throwing a fit in newer versions.

Anonymous
10/28/24(Mon)14:12:24 No.41592134

Anonymous 10/28/24(Mon)14:12:24 No.41592134

>>41590988
I've been out of the loop. Is this *just* a GUI, intended to plug into a preinstalled AI, or does it come with the actual AI model and functionality too?

Anonymous
10/28/24(Mon)14:16:35 No.41592155

Anonymous 10/28/24(Mon)14:16:35 No.41592155

>>41591651
OK, after poring over the code for around 2 hours, here's what I think is going on:
- Primary reference audio is passed into a "HuBERT" model followed by a Conv1d and RVQ to produce codes--presumably to represent semantic content of the audio.
- The phoneme embeddings plus a positional embedding and something to do with BERT (presumably to represent the semantic content of both the reference text prompt and the actual text prompt) are concatenated with the semantic audio codes (also plus a positional embedding) then run through a transformer model ("GPT") to create another intermediate representation. They use sinusoidal positional embeddings.
- This representation then gets fed into a VITS network ("SoVITS") which is conditioned on speaker timbre information (the same way speaker embeddings are normally applied in VITS) and converts it into audio.
- The speaker timbre information comes from the auxiliary reference audio, or if none is specified, the primary reference audio. These are converted into spectrograms and fed in to a MelStyleEncoder which eventually averages them out temporally. After this they are all averaged together producing a [1, 512] output.

What this means for us:
- The primary reference appears to keep its sequence dimension, so it's not possible to calculate an "average" primary reference audio.
- Primary reference audio might not be mandatory; the model seems to have cases built in for having no reference, it could just be the TTS pipeline code that disallows it. That being said, I don't know if these are dead code or if generating without a reference will produce good results.
- There doesn't seem to be any inherent reason to restrict the primary reference audio to 3-10 seconds, other than for quality vs. memory usage purposes (perhaps there are edge cases where a reference audio might end up too short for something in the model to use). Maybe they didn't want GIGO to give the model a bad rep?
- It's at least theoretically possible to precalculate average speaker timbre information from auxiliary reference audio, since multiple audios can be averaged together. Whether it's actually useful is another question entirely.

>>41592126
I'm using an environment with python=3.10.15 and it seems to work.

>>41592134
The pyinstaller is a self-contained solution for running the model; it has the actual AI model and doesn't plug into anything else (it would be pretty bad if it were 10GB and needed to plug into something else!)
Originally I intended for it to interface with a server but then I found out >>41590487 which defeated most of the reason I even wanted to make a client-server model

Anonymous
10/28/24(Mon)14:23:11 No.41592180

Anonymous 10/28/24(Mon)14:23:11 No.41592180

File: NumPy Error - GPT-SoVITS.png (120 KB, 890x529)

120 KB PNG

>>41592046
>>41592116
Also, something's may be odd with my conda setup perhaps. Running the "install.sh" gives a command not found kinda error for the first 4 lines, but continues with pip. But inputting the same conda commands into terminal works just fine and I was able to install everything required. Strange.

The install.sh is missing the install for the requirements_client.txt, ended up trying to launch with "py" and it said "gui_client" is not defined, then I tried with "python" and it was missing its "peewee". Everything was seemingly okay after doing the final step of installing the requirements_cleint.txt files, but now the following error in image related happened and I'm now stuck.

Anonymous
10/28/24(Mon)14:27:54 No.41592203

Anonymous 10/28/24(Mon)14:27:54 No.41592203

>>41592134
if you want to mass download the models from hugging face use this with your python terminal

cd #directory were you want it saved#
pip install huggingface_hub
python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='therealvul/GPT-SoVITS-v2', cache_dir='tmp', local_dir='models')"

Anonymous
10/28/24(Mon)14:28:29 No.41592205

Anonymous 10/28/24(Mon)14:28:29 No.41592205

>>41592180
You are not supposed to run ./install.sh. That is from the original repo. The only correct instructions are under the README. https://github.com/effusiveperiscope/GPT-SoVITS

Anonymous
10/28/24(Mon)14:40:12 No.41592256

Anonymous 10/28/24(Mon)14:40:12 No.41592256

>>41592180
>>41592205
Also, it looks like you're in your base conda environment.

Anonymous
10/28/24(Mon)14:43:18 No.41592269

Anonymous 10/28/24(Mon)14:43:18 No.41592269

>>41592205
>>41592256
Oh. That's surprising, because I was able to get Applio to run that way, minus the extra steps for this one.
>conda env create -n GPTSovitsClient python=3.10
>TypeError: deprecated() got an unexpected keyword argument 'name'
Full error: https://ponepaste.org/10440
And yeah, also did "conda create -n GPTSovitsClient python=3.10" as per the updated git instructions and got basically the same error.

Would it be possible to still continue the setup and run the GUI without setting up a conda environment if these errors persist?

Anonymous
10/28/24(Mon)14:49:30 No.41592286

Anonymous 10/28/24(Mon)14:49:30 No.41592286

>>41592269
>https://ponepaste.org/10440
It would be helpful if you could post the error you get specifically when you run "conda create -n GPTSovitsClient python=3.10".
>deprecated() got an unexpected keyword argument 'name'
https://github.com/aws/aws-cli/issues/7325
This seems to be a problem with pyOpenSSL, try uninstalling it: `pip3 uninstall pyOpenSSL`
>Would it be possible to still continue the setup and run the GUI without setting up a conda environment if these errors persist?
Possible, if you're willing to overwrite packages in your base python/conda environment (this could cause other things to break if they depend on it). However, if your python version doesn't match it increases the likelihood of bugs that might be difficult to solve, and you could end up putting things in a state that is difficult to recover.

Anonymous
10/28/24(Mon)14:53:36 No.41592299

Anonymous 10/28/24(Mon)14:53:36 No.41592299

>>41592269
>Oh. That's surprising, because I was able to get Applio to run that way, minus the extra steps for this one.
I've removed install.sh from my fork to prevent further confusion. However generally you should not assume that installation steps transfer from project to project.

Anonymous
10/28/24(Mon)15:20:58 No.41592417

Anonymous 10/28/24(Mon)15:20:58 No.41592417

File: Cannot open shared object file.png (83 KB, 886x342)

83 KB PNG

>>41592286
>This seems to be a problem with pyOpenSSL, try uninstalling it: `pip3 uninstall pyOpenSSL`
Thank you! This problem actually prevented me doing anything with conda apparently, including updating the OpenSSL and whatnot, throwing the same error.
I was able to create the environment and proceed as expected, but ran into one more error:
>ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
>fairseq 0.12.2 requires hydra-core<1.1,>=1.0.7, but you have hydra-core 1.3.2 which is incompatible.
>fairseq 0.12.2 requires omegaconf<2.1, but you have omegaconf 2.2.0 which is incompatible.
>hydra-core 1.3.2 requires antlr4-python3-runtime==4.9.*, but you have antlr4-python3-runtime 4.8 which is incompatible.
and it stopped installing afterwards. However I was able to separate the step of "pip install -r requirements.txt -r requirements_client.txt" into separate pip install commands, which allowed everything to be fetched correctly (some apparently it missed) and it started to work for a bit but now can't find something (image related). Note: The aforementioned dependency error is still existing after the requirements_client.txt step. I may withhold from further attempting as it's quite late my end. I'll see about continuing when I wake.
>>41592299
Noted.

Anonymous
10/28/24(Mon)15:25:59 No.41592440

Anonymous 10/28/24(Mon)15:25:59 No.41592440

>>41592417
This seems to be a linux specific issue.
https://github.com/elieserdejesus/JamTaba/issues/1228
If you're using apt, try:
`apt install libqt5multimedia5`
If not, try to find the equivalent libraries for your distro.

Anonymous
10/28/24(Mon)16:29:45 No.41592649

Anonymous 10/28/24(Mon)16:29:45 No.41592649

>>41592155
>Arbitrary reference length
OK, so here's how reference lengths shorter than 3 seconds affect output quality:
>under 2 seconds
ref: https://files.catbox.moe/944rn3.mp3
gen: https://files.catbox.moe/tkkl1c.mp3
ref: https://files.catbox.moe/7k63bq.mp3
gen: https://files.catbox.moe/oylemz.mp3
>2 seconds
ref: https://files.catbox.moe/5ysutk.mp3
gen: https://files.catbox.moe/dihj8g.mp3
>3.5 seconds:
ref: https://files.catbox.moe/mdu2vl.mp3
gen: https://files.catbox.moe/vegwm8.mp3
>over 10 seconds:
ref (for both): https://files.catbox.moe/ejlgxr.mp3
gen: https://files.catbox.moe/vhw04c.mp3
gen: https://files.catbox.moe/z3hjev.mp3

I guess you might expect more pronunciation errors with >10 seconds due to increased context length? And under 2 seconds the quality of generated audio and character resemblance seems to suffer. Around 2 seconds I think is "OK" territory though. I think I can adjust the code just to give you a warning if your reference is shorter than 3 seconds or longer than 10 seconds rather than straight out disallowing it.

>Mandatory references
OTOH, it seems that some of the module code DOES expect reference audio to exist, so it looks like reference audio is mandatory unless I start mucking about in the model's innards.

Clipper
10/28/24(Mon)18:20:25 No.41592954

Clipper 10/28/24(Mon)18:20:25 No.41592954

>>41591911
Go ahead, I have all those saved locally.

Anonymous
10/28/24(Mon)18:48:58 No.41593046

Anonymous 10/28/24(Mon)18:48:58 No.41593046

>>41592954
OK.

Anonymous
10/28/24(Mon)19:22:05 No.41593176

Anonymous 10/28/24(Mon)19:22:05 No.41593176

>>41590988
Trying to run on windows, get this error while running the exe:
https://pomf2.lain.la/f/7lmb842g.txt

Anonymous
10/28/24(Mon)21:48:40 No.41593610

Anonymous 10/28/24(Mon)21:48:40 No.41593610

>>41593176
I have my suspicions about what's causing this but I'm not 100% sure. Do you have ffmpeg/ffprobe on your PATH (I know it's not in the instructions)?

Anonymous
10/28/24(Mon)23:05:42 No.41593870

Anonymous 10/28/24(Mon)23:05:42 No.41593870

File: Screenshot 2024-10-28 200245.png (110 KB, 1852x1012)

110 KB PNG

>>41590988
GPT-SoVITS Inference GUI, revision 1.
https://drive.google.com/file/d/1UvzWIFRyO8jjB2z5bgeQnMn0GrOjkaNH/view?usp=drive_link

>Changes
- Fixed auxiliary reference audio paths not being passed into inference properly
- Remove pydub/ffmpeg dependency
- Added experimental ARPAbet support for English
- Allow selection of <3s or >10s reference audio
- Warn instead of stopping generation on <3s or >10s reference audio

>>41593176
>>41593610
Ok, I think I know what the cause was. Apparently, pydub actually depends on ffmpeg/ffprobe, and those aren't bundled with it when I run pyinstaller. I've removed pydub as a dependency and replaced its functionality with soundfile, since soundfile at least seems to bundle its library properly with pyinstaller.

Anonymous
10/29/24(Tue)03:59:49 No.41594400

Anonymous 10/29/24(Tue)03:59:49 No.41594400

File: CouldNotFind Qt platform (...).png (96 KB, 1108x434)

96 KB PNG

>>41592440
Awesome, one more hurtle dealt with. Everything went good with running, but once more an obstacle; this time something about a Qt platform plugin not being found. Are there some additional things I need? The error persists even with a fresh clone of the git, which I would hope has the same updates as the windows one >>41593870

Anonymous
10/29/24(Tue)04:14:31 No.41594434

Anonymous 10/29/24(Tue)04:14:31 No.41594434

>>41593870
>since soundfile at least seems to bundle its library properly with pyinstaller.
It feel like this kind of errors the module developers should had fixed ages ago.

Anonymous
10/29/24(Tue)11:10:10 No.41595036

Anonymous 10/29/24(Tue)11:10:10 No.41595036

Hey now you've gotta save the thread

Anonymous
10/29/24(Tue)11:16:46 No.41595049

Anonymous 10/29/24(Tue)11:16:46 No.41595049

>>41594400
It seems that there are a variety of causes behind this, but the most common solution appears to be `apt install libxcb-cursor0`. Try running that and see if it works; if it does, I'll add it to the README
https://stackoverflow.com/questions/68036484/qt-qpa-plugin-could-not-load-the-qt-platform-plugin-xcb-in-even-though-it
https://forum.qt.io/topic/148718/qt-qpa-plugin-could-not-load-the-qt-platform-plugin-xcb-in-even-though-it-was-found/2

Anonymous
10/29/24(Tue)11:59:24 No.41595133

Anonymous 10/29/24(Tue)11:59:24 No.41595133

>>41595049
I've tried many methods surrounding that:
[Already had installed]
>pip install pyqt6
>sudo apt-get install libxcb-xinerama0
>sudo apt-get install libxcb-xinerama0-dev
>sudo apt-get install --reinstall libxcb-xinerama0 // (Hadn't changed anything)
[New but didn't solve]
>sudo apt-get install libxcb-randr0-dev libxcb-xtest0-dev libxcb-xinerama0-dev libxcb-shape0-dev libxcb-xkb-dev
>sudo apt-get install libxkbcommon-x11-dev
>pip install opencv-python-headless

My current theory is that the plugin's placement is incorrectly configured. As some feedback regarding the type of issue expects the plugin "libqxcb.so" to be in "/home/user/.local/lib/python3.10/site-packages/cv2/qt/plugins/" but I found it in another directory beyond it "~/plugins/platforms/", maybe I need to create a symbolic link or something? Or the program/lib reconfigured to look there instead? Or maybe like... move/duplicate the libqxcb.so to be in the expected directory?

Anonymous
10/29/24(Tue)12:26:13 No.41595192

Anonymous 10/29/24(Tue)12:26:13 No.41595192

>>41595133
>libxcb-randr0-dev libxcb-xtest0-dev libxcb-xinerama0-dev libxcb-shape0-dev libxcb-xkb-dev
These are development files (headers, static libraries for compiling), they shouldn't affect anything.

Have you tried running with `QT_DEBUG_PLUGINS=1 python gui_client.py`? If so could you post the output here?

Anonymous
10/29/24(Tue)12:28:28 No.41595199

Anonymous 10/29/24(Tue)12:28:28 No.41595199

>>41595192
Another thing that seems promising:
>sudo apt-get install libqt5x11extras5

Anonymous
10/29/24(Tue)12:59:05 No.41595262

Anonymous 10/29/24(Tue)12:59:05 No.41595262

>>41595199
Oh yeah, also tried that. Sadly, no dice.
>>41595192
That doesn't seem to output anything different than the previous error. Pretty much identical; wrong syntax for additional parameters maybe?

Anonymous
10/29/24(Tue)13:05:27 No.41595272

Anonymous 10/29/24(Tue)13:05:27 No.41595272

>>41595262
Try `export QT_DEBUG_PLUGINS=1` then `python gui_client.py`?

Anonymous
10/29/24(Tue)16:21:13 No.41595665

Anonymous 10/29/24(Tue)16:21:13 No.41595665

File: GUI LOADED.png (68 KB, 1218x916)

68 KB PNG

>>41595272
Couldn't get that work work in terminal, but I was able to add this tot he client_gui.py for more info:
>import os
>os.environ["QT_DEBUG_PLUGINS"] = "1"
Which the output helped inform me where it was looking for the plugin, which it wasn't; it was elsewhere. By default it looked for it in:
>/home/hazyskies/miniconda3/envs/GPTSovitsClient/bin/platforms
but the "platforms" directory was never created, and thus no plugin. I found the one in "/usr/lib/x86_64-linux-gnu/qt6/plugins/platforms" but this is an outdated one that likely came with the distro or something, so it errored stating so:
>"The plugin '/home/hazyskies/miniconda3/envs/GPTSovitsClient/bin/platforms/libqxcb.so' uses incompatible Qt library. (6.2.0) [release]" not a plugin

Thankfully from my earlier testing I found the REAL one it was looking for in "/home/hazyskies/.local/lib/python3.10/site-packages/PyQt5/Qt5/plugins/platforms", and so I manually created a directory called "platforms" in "/home/hazyskies/miniconda3/envs/GPTSovitsClient/bin/" and copied the plugin into, and now it works! Yay! More testing to be done later. Consider adding a check to see if the directory/plugin exists when first running the script and checking the relevant locations. mkdir and copying the plugin into the made directory when it exists to ensure there's no further errors? That'd save similar OS users some headache.

Now that I finally got it running, I'm glad. Just wish the GUI could be scalable, my second monitor is currently being occupied by another device and it doesn't scale well on my ancient 5:4 (1280x1024) monitor.

Anonymous
10/29/24(Tue)16:32:30 No.41595701

Anonymous 10/29/24(Tue)16:32:30 No.41595701

>>41595665
>"/usr/lib/x86_64-linux-gnu/qt6/plugins/platforms"
The Qt6 one, likely installed from when you pip installed it earlier, wouldn't work because the GUI uses PyQt5.
>Consider adding a check to see if the directory/plugin exists when first running the script and checking the relevant locations. mkdir and copying the plugin into the made directory when it exists to ensure there's no further errors?
This seems very hacky. There's no reason to assume that the user has PyQt5 already installed in their local python.

Todo:
- Investigate why this happens and if there's a more robust solution.
- Work on making the UI more compact/scalable.

Anonymous
10/29/24(Tue)17:29:18 No.41595917

Anonymous 10/29/24(Tue)17:29:18 No.41595917

File: image_fx_(9).png (1.03 MB, 1280x896)

1.03 MB PNG

>>41595701
I notice there's also no default voice lines. Consider adding a download button for that too, or maybe also include some to use when you download a voice model? That way they can be used immediately without having to go to the mega each time, or assuming the user already has a copy. Probably would need a search function to find ones of certain lines given the sheer amount we have available.

Another issue, though thankfully not critical this time. None of the audio wants to play, it just goes to the pause state and doesn't play. The only info in terminal says "defaultServiceProvider::requestService(): no service found for - "org.qt-project.qt.mediaplayer". I was able to find the files and play them in VLC though, so still workable.

[First test] Trixie would surely be a master coder:
Ref: https://files.catbox.moe/rn7rxj.flac
Output (compiled): https://files.catbox.moe/h340ee.mp3

Anonymous
10/29/24(Tue)18:24:42 No.41596098

Anonymous 10/29/24(Tue)18:24:42 No.41596098

>>41593870
Not sure if this is a problem with the program or if I'm just not using it correctly - it doesn't seem to be able to play any of the reference lines supplied or added to the table and I get an error when trying to generate relating to failing to load audio.
https://pomf2.lain.la/f/7rzu6zzk.mkv

Anonymous
10/29/24(Tue)18:57:10 No.41596196

Anonymous 10/29/24(Tue)18:57:10 No.41596196

>>41596098
Oh, there's meant to be circles and squares in the primary and aux tables? Those are missing in Linux, just empty. Took a bit to work out what the table had to be clicked there first before the generations were allowed.
>Not playing reference lines
Yeah, same for me.

Anonymous
10/29/24(Tue)19:48:20 No.41596335

Anonymous 10/29/24(Tue)19:48:20 No.41596335

>>41596098
Not at home right now but I think I know what's going on--the base GPT-SoVITS library also seems to require ffmpeg be installed. I'm not sure how much I can work around this yet.
>>41596196
Do you think it might be a result of the resolution? Also do you get anything that looks like an error when you try to play reference lines?

Anonymous
10/29/24(Tue)21:59:28 No.41596607

Anonymous 10/29/24(Tue)21:59:28 No.41596607

>>41595665
Do you still have the specific error output from when you didn't have the library? That would be helpful.
>>41595917
>Adding a download button, search function
That'd probably require some kind of index of the MEGA Master File be created, since we can't download subsets of HF datasets. I'll put it into consideration since it seems like an important feature to have.
>Bundle reference audio with the voice models
Possible.
>"defaultServiceProvider::requestService(): no service found for - "org.qt-project.qt.mediaplayer".
It looks like on Linux gstreamer plugins are also a dependency. Could you try following the advice from here:
https://doc.qt.io/qt-5/linux-requirements.html#multimedia-dependencies
And report if it works?
>>41596098
On the preview requirement, I remember now: This is a codec issue. Try installing K-Lite codecs: https://codecguide.com/download_kl.htm
On not being able to generate -- still looking into it.

Anonymous
10/29/24(Tue)23:15:52 No.41596767

Anonymous 10/29/24(Tue)23:15:52 No.41596767

>>41596607
>ffmpeg:
Apparently it looks like GPT-SoVITS not only uses ffmpeg for I/O but also plans(?) to use it for audio stretching(??), which is not so easy to do with another library (that wouldn't require yet another extra install, like rubberband). It's tied into enough things that unfortunately I think the best solution really is just bundling the ffmpeg executables with the Windows pyinstaller, taking us up to a hefty 10.6 GB for a full install. Bloat über alles!

Anonymous
10/29/24(Tue)23:50:48 No.41596821

Anonymous 10/29/24(Tue)23:50:48 No.41596821

File: Screenshot 2024-10-29 204940.png (196 KB, 2237x1486)

196 KB PNG

>>41593870
GPT-SoVITS GUI, revision 2
Windows pyinstaller: https://drive.google.com/file/d/12JgwvkFao_h_6hHLi-VqoOrA6Lf11X4f/view?usp=drive_link
Updates:
>ffmpeg bundled, possibly last missing dependency for generation on Windows?
>Tentative GUI changes to make it compatible with smaller displays (should still be resizable to more sane size for larger displays)
>ARPAbet syntax highlighting in the prompt editor

Anonymous
10/30/24(Wed)01:25:08 No.41596960

Anonymous 10/30/24(Wed)01:25:08 No.41596960

>>41596821
getting this error when trying to play the rendered audio.

DirectShowPlayerService::doRender: Unknown error 0x80040266.

Anonymous
10/30/24(Wed)01:55:50 No.41597008

Anonymous 10/30/24(Wed)01:55:50 No.41597008

>>41596960
>>41596607
Try installing K-Lite codecs: https://codecguide.com/download_kl.htm

Anonymous
10/30/24(Wed)03:52:05 No.41597162

Anonymous 10/30/24(Wed)03:52:05 No.41597162

File: Metadata_ForciblyAddedPlugin.png (62 KB, 1098x406)

62 KB PNG

>>41596607
>Do you still have the specific error output from when you didn't have the library? That would be helpful.
Yes -> Image related in >>41594400
Image related in this post has a little more info, which is when I added the wrong plugin to where it was looking for it.

>gstreamer plugins
>And report if it works?
I already have gstreamer and gstreamer-plugins installed it seems. I also installed good and bad, also another for pipewire (as it's the audio runner thing I use) and libqt5multimedia5-plugins. However, I think it's the same case of it not finding the service in the right directory rather than missing something require to run it. The output says among other things before it:
>QFactoryLoader::QFactoryLoader() checking directory path "/home/hazyskies/miniconda3/envs/GPTSovitsClient/bin/mediaservice" ...
>defaultServiceProvider::requestService(): no service found for - "org.qt-project.qt.mediaplayer"
So there's a bunch of things missing from bin that should be there but aren't. Maybe I'll have to go through the list again or something, might've missed a step or not had my env loaded at the time? Or it's not allowing install into the env because what it's looking for already exists outside of it?

Does this also mean the git is updated with the same changes? If so, how would I go about updating my local copy. Git pull?

Anonymous
10/30/24(Wed)04:43:30 No.41597230

Anonymous 10/30/24(Wed)04:43:30 No.41597230

>>41597162
I am referring specifically to the error with debug enabled from before you added the wrong plugin.
At this point I'm considering spinning up a VM just to troubleshoot this. What distro are you using?
>Does this also mean the git is updated with the same changes? If so, how would I go about updating my local copy. Git pull?
If you cloned the repository with git clone then git pull should get the new changes, yes.

Anonymous
10/30/24(Wed)06:06:05 No.41597292

Anonymous 10/30/24(Wed)06:06:05 No.41597292

>>41597230
>debug enabled from before you added the wrong plugin.
There was no additional debug info before I added the plugin, wrong or otherwise. Whether doing the QT debug thing in terminal, or adding it to the python script.
>What distro are you using?
Linux Mint 21

Anonymous
10/30/24(Wed)06:56:38 No.41597325

Anonymous 10/30/24(Wed)06:56:38 No.41597325

yet another zero shot like TTS
https://github.com/open-mmlab/Amphion/tree/main/models/tts/maskgct
hugging space demo:
https://huggingface.co/spaces/amphion/maskgct
samples:
https://maskgct.github.io/

Anonymous
10/30/24(Wed)10:56:20 No.41597685

Anonymous 10/30/24(Wed)10:56:20 No.41597685

>>41597325
Well, shit. I remember this was posted a few threads back, whatever happened to them saying they wouldn't release weights for "safety reasons"? Or was that just made up?

Also looks like no finetuning code. It's still worth taking a look at because if they release the weights they might plan on releasing training/finetuning code as well. Will test zero shot soon.

Anonymous
10/30/24(Wed)13:07:40 No.41598009

Anonymous 10/30/24(Wed)13:07:40 No.41598009

File: Screenshot 2024-10-30 094921.png (215 KB, 1310x1059)

215 KB PNG

>>41597230
>>41597292
This is strange. I used the XFCE version from here: https://linuxmint.com/edition.php?id=301

With the exception of a missing nltk package (which might also affect the Windows version), I'm able to load the GUI without issue. It displays radio buttons and check buttons correctly, and I also am able to play preview audio, generate, and preview generations with no real problems (apart from my audio stuttering like hell b/c it's a VM). It's possible that something unrelated installed on your system or one of the commands you ran in >>41595133 might cause these issues--but that makes it harder to debug.

I've also confirmed that you can perform CPU-only inference, at least inside a Linux VM. It is quite slow; it took me 44.5 s to synthesize these 28 s of audio.
gen: https://files.catbox.moe/lo3s60.mp3

I also have some more ideas on how to make the UI fit smaller displays.

Anonymous
10/30/24(Wed)13:20:54 No.41598043

Anonymous 10/30/24(Wed)13:20:54 No.41598043

>>41598009
>>41597292
In retrospect I probably should've asked you what DE you're using as well, that may affect things.

Anonymous
10/30/24(Wed)13:31:42 No.41598071

Anonymous 10/30/24(Wed)13:31:42 No.41598071

>>41598009
>>41598043
Yea, to be fair I probably should've mentioned it to avoid confusion. I'm using Cinnamon. May have been because I liked the interface better or something. That or the name.
>you can perform CPU-only inference
>it took me 44.5 s to synthesize these 28 s of audio
That's not too bad. It's double realtime but still pretty quick for the quality it puts out. There's also three outputs, so technically speaking it's still faster than realtime?
>It's possible that something unrelated installed on your system --- might cause these issues--but that makes it harder to debug.
Well in any case I may be doing a fresh install anyway, as I just got my PNY m.2 nvme. Finally making the very long overdue switch from the relic format that is HDD. Or well, at least for my main operating system; storage with those will still be sound until we all get those new petabit sized optical disks on the horizon. Then there's room for all the mares everywhere.

Anonymous
10/30/24(Wed)13:38:52 No.41598083

Anonymous 10/30/24(Wed)13:38:52 No.41598083

>>41598071
>I'm using Cinnamon
Got it.
>There's also three outputs, so technically speaking it's still faster than realtime?
This was only with one output. There seems to be a -slight- fixed cost effect, because with 3 repetitions I was able to get 115 s gen time which is under 3*45, but I was also dipping into swap with 8GB of RAM.
>Well in any case I may be doing a fresh install anyway, as I just got my PNY m.2 nvme.
Even so, I would still like to make sure it works for Cinnamon users.

Anonymous
10/30/24(Wed)14:31:24 No.41598191

Anonymous 10/30/24(Wed)14:31:24 No.41598191

File: Screenshot 2024-10-30 113109.png (159 KB, 1261x1050)

159 KB PNG

>>41598083
Did an install on Cinnamon, same results.

Anonymous
10/30/24(Wed)16:00:14 No.41598404

Anonymous 10/30/24(Wed)16:00:14 No.41598404

File: cherber.png (160 KB, 495x354)

160 KB PNG

>>41572561
Amphion test
>>41597325
OK, so this might be one of the worse installs I've done on Windows. I had to fork phonemizer and Amphion and modify the inference script with a bunch of hacky stuff to even get it to start downloading models.
>Rough installation instructions on Windows
0. Create conda environment with python=3.9.15 and activate
1. Install espeak-NG and locate the install directory
2. Clone Amphion repository from here: https://github.com/effusiveperiscope/Amphion
3. In models/tts/maskgct/maskgct_inference.py modify _ESPEAK_LIBRARY to point to your espeak-NG install directory
4. pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118
5. Install dependencies roughly following https://github.com/open-mmlab/Amphion/blob/main/models/tts/maskgct/env.sh WITH CHANGES:
- Remove torch==2.0.1 from the end of tensorboard
- The version of phonemizer they use won't work on Windows because of missing mbrola. Change the pip install phonemizer to pip install git+https://github.com/effusiveperiscope/phonemizer.git
- Also, pip install json5
6. pip install -U numpy==1.26.4

>Inference requirements
It looks like you need an NVIDIA GPU with minimum 12GB VRAM to run this (it maxed out at 10.2 GB). That puts it out of reach for most people here.
The git repo itself is ~240 MB on disk, and all of the pretrained models it downloads together are ~5.5 GB. The miniconda environment is another 5.5 GB.

>Zero shot performance
It used 27 s to infer 18 s on 3080Ti.
ref: https://files.catbox.moe/w90njn.mp3
gen (18 s): https://files.catbox.moe/e8am3r.mp3
~~Well, that's disturbing. But it sounded pretty good at the start? Much better zero shot performance than most other models.~~

ref: https://files.catbox.moe/j9hnbg.mp3
gen (18 s): https://files.catbox.moe/3qym13.mp3
gen (10 s): https://files.catbox.moe/w90njn.mp3
Not so good on Rarity -- obviously she has a less generic accent. The hallucinations are less likely to happen if I make the audio shorter. I think this failure mode is much less desirable for automated systems compared to mispronunciation errors.

The inference requires that you explicitly specify the output duration in advance.
>>41598191
Also -- do you remember if you selected the "multimedia" codec thing at installation?

Anonymous
10/30/24(Wed)16:01:33 No.41598411

Anonymous 10/30/24(Wed)16:01:33 No.41598411

>>41598404
>gen (10 s): https://files.catbox.moe/w90njn.mp3
Whoops, wrong file. Here's the correct one: https://files.catbox.moe/rg3m6h.mp3

Anonymous
10/30/24(Wed)16:03:28 No.41598421

Anonymous 10/30/24(Wed)16:03:28 No.41598421

File: 3098740__safe_twilight+sp(...).jpg (22 KB, 290x292)

22 KB JPG

>>41598404
Also this model is called "MaskGCT", not amphion.

Anonymous
10/30/24(Wed)18:08:36 No.41598852

Anonymous 10/30/24(Wed)18:08:36 No.41598852

>>41596821
Managed to install and train my own models with the default fork, your GUI is super handy but I noticed the generations don't sound the same compared to the RVC-Boss fork of GPT

RVC-Boss:
https://files.catbox.moe/n9sn0i.wav

GPT-SoVITS GUI:
https://files.catbox.moe/v6bfdb.flac

Reference Audio:
https://files.catbox.moe/lwus9w.flac

could the requirements I downloaded for RVC-Boss be conflicting with this fork, how do I get the generations to sound the same?

Anonymous
10/30/24(Wed)18:16:05 No.41598870

Anonymous 10/30/24(Wed)18:16:05 No.41598870

>>41598852
You may have the base pretrained model loaded (it's loaded by default). Did you load the dedicated model?

Anonymous
10/30/24(Wed)18:48:47 No.41598963

Anonymous 10/30/24(Wed)18:48:47 No.41598963

>>41598852
>generations don't sound the same compared to the RVC-Boss fork of GPT
>how do I get the generations to sound the same?
The same ... in what regard? Are you referring to how dynamic/varied the generations are with GPT-SoVITS? As that's a good thing, allowing for many takes to get a really good line and/or delivery. It shouldn't be necessary to try and make one AI sound like another, but rather the character be more accurate with clarity.

Anonymous
10/30/24(Wed)19:54:32 No.41599176

Anonymous 10/30/24(Wed)19:54:32 No.41599176

>>41596821
Windows 10, CPU gen inference seems to be functional.
https://files.catbox.moe/lsfqe3.mp3
It took a moment to figure it out coming from SVS/RVC, but that's pretty damn impressive for a local CPU gen.

Anonymous
10/30/24(Wed)21:49:37 No.41599538

Anonymous 10/30/24(Wed)21:49:37 No.41599538

>>41599176
Nice, thanks for reporting

Anonymous
10/31/24(Thu)00:07:17 No.41599817

Anonymous 10/31/24(Thu)00:07:17 No.41599817

>>41596821

Hey, how do I properly add more reference clips while automatically having them be sorted? Or do I have to sort them out myself? I'm confused.

Anonymous
10/31/24(Thu)00:13:54 No.41599826

Anonymous 10/31/24(Thu)00:13:54 No.41599826

>>41599817
What do you mean sorted? You can just download the voice data from Clipper's master file (in the OP) and all of those have the relevant data to label themselves.

Anonymous
10/31/24(Thu)00:25:58 No.41599854

Anonymous 10/31/24(Thu)00:25:58 No.41599854

>>41599817
What do you mean specifically by sorted? If your reference clips don't have PPP-style labeling data in their names then you will have to fill in the fields manually.

Anonymous
10/31/24(Thu)03:11:35 No.41600135

Anonymous 10/31/24(Thu)03:11:35 No.41600135

File: 04.04.2024.png (166 KB, 1024x1024)

166 KB PNG

>>41598870
>You may have the base pretrained model loaded (it's loaded by default). Did you load the dedicated model?
Dammit, I completely missed the 'load selected models' button, everything works perfectly. I feel like a complete jerk for bringing the 'issue' up. I just got everything set up and was rushing out the door for work when I posted. Thanks for quick fix anon!

Anonymous
10/31/24(Thu)03:33:05 No.41600181

Anonymous 10/31/24(Thu)03:33:05 No.41600181

>>41600135
>I feel like a complete jerk for bringing the 'issue' up.
Eh, it's a fairly reasonable assumption to make that there's nothing loaded there beforehand, and that the newly downloaded model would be loaded automatically. I'm hesitant to touch GPT-SoVITS's automatic base model load though because I don't know if something else might depend on it.

Anonymous
10/31/24(Thu)03:47:55 No.41600205

Anonymous 10/31/24(Thu)03:47:55 No.41600205

TTS is finally winning big here with GPT-SoVITS. Hopeful to see some great line delivery in future fan episodes.

Anonymous
10/31/24(Thu)07:56:13 No.41600633

Anonymous 10/31/24(Thu)07:56:13 No.41600633

https://files.catbox.moe/wu08z5.mp3

Anonymous
10/31/24(Thu)11:04:09 No.41600981

Anonymous 10/31/24(Thu)11:04:09 No.41600981

>>41600633
That's a lot of bugs, you should probably file a mare issue.

Anonymous
10/31/24(Thu)11:07:12 No.41600989

Anonymous 10/31/24(Thu)11:07:12 No.41600989

>>41596821

A few lines I did with Twilight's model! I like it a lot!

https://files.catbox.moe/idmne9.wav

Anonymous
10/31/24(Thu)11:38:46 No.41601066

Anonymous 10/31/24(Thu)11:38:46 No.41601066

>>41600989
Wow, nice.

Anonymous
10/31/24(Thu)11:45:34 No.41601080

Anonymous 10/31/24(Thu)11:45:34 No.41601080

I found a neat trick. Process audio octaves apart and layer them. The hoarse parts don't stand out anymore and you get a neat effect as they add up to something good in aggregate. Especially good for a song that already sounds flange-y or had heavily processed voice in the first place.

Anonymous
10/31/24(Thu)12:26:00 No.41601159

Anonymous 10/31/24(Thu)12:26:00 No.41601159

>>41582059
Oh, before I forget to ask, would you be so kind as to create or otherwise provide documentation on how to fine-tune models for this? Wouldn't mind training some of my own given how good these turned out, providing an RTX 2060 is capable enough to do so.

Anonymous
10/31/24(Thu)12:41:22 No.41601185

Anonymous 10/31/24(Thu)12:41:22 No.41601185

>>41601159
Some relevant information on setting up the environment here:
https://desuarchive.org/mlp/thread/41498541/#q41562711 (it's not true that training always starts from epoch 0; that was an erroneous observation I made)
This rentry is also helpful but inaccurate (you should clone the repository, don't use the zip because it's outdated.): https://rentry.org/GPT-SoVITS-guide
Also the choice of epochs is much less rigid than the guide states. I didn't notice any adverse effects from enabling DPO. Check the archive for my observations on training hyperparams and their effects on output.

Very generally:
0. If you don't have transcriptions you need to follow the "audio slicer" etc. steps in the rentry which will use ASR to automatically generate them and also automatically slice your audio
1. If you already have transcriptions, you don't need to use their step 0 preprocessing, but you do need to provide your own filelist with each line representing a sample, in the format:
><audio_path>|<speaker_name>|en|<plaintext transcription>\n
I have my own notebook+library to do this with a local copy of the Master File:
https://github.com/effusiveperiscope/PPPDataset/blob/f42390a7ee75fae04a50afb400be417d380577b1/ppp2.ipynb
2. You can specify your own filelist under the webUI step 1 (GPT-SOVITS-TTS) tabs. I feel the rest of the interface is self-explanatory.

Also:
- I modified the webui code to increase the maximum number of SoVITS and GPT epochs (because I found it worked better for them).

Anonymous
10/31/24(Thu)13:21:24 No.41601267

Anonymous 10/31/24(Thu)13:21:24 No.41601267

>>41596821
GPT-SoVITS Inference GUI, revision 3.
https://drive.google.com/file/d/1EljbxeUckYATH269utj7q1T-8oKcPhte/view?usp=sharing

I think we're feature-complete here.
>Changes
- Add a Master File downloader, which programmatically constructs an index of the Master File and lets you search
for files using unix glob patterns (e.g. *_Rarity_*) and download them to ref_audios folder
- Rearranged columns for better UX on small displays, and made the reference audio table slightly more adaptive overall to different display sizes
- Add less time-consuming check for NLTK packages
- Check for download averaged_perceptron_tagger_eng (although I'm not sure if it's needed)
- Fill in missing config keys if they are not found in user config
- Loosen omegaconf requirement on Linux so requirements.txt and requirements_client.txt are (hopefully) more compatible
- Warn the user if inferring with the base model
>Migration
If you want to avoid redownloading the pretrained models, you should just be able to replace _internal and gptsovits.exe in an existing install with the new versions, but let me know if it doesn't work.
>Note for source users
Dependencies changed, requirements_client.txt has new dependencies.

>Rant
By far the hardest part of this was trying to figure out how to programmatically list files and download links from MEGA shared folders. mega.py is deprecated and the MEGA REST API itself is undocumented and quite opaque, and most info about it only seems to come from reverse engineering. The only other alternative is MegaCMD, and I didn't want to make users install yet another dependency to make the damn thing work, so I ended up implementing it in python. The end result feels hacky, and I don't know how long it's going to work or how well supported/breaking it ends up being across systems. There is exactly one StackOverflow thread that gave me most of the information that I needed to work with: https://stackoverflow.com/questions/64488709/how-can-i-list-the-contents-of-a-mega-public-folder-by-its-shared-url-using-meg

Anonymous
10/31/24(Thu)13:22:24 No.41601272

Anonymous 10/31/24(Thu)13:22:24 No.41601272

File: Screenshot 2024-10-31 023504.png (140 KB, 1603x1050)

140 KB PNG

>>41601267
lol I forgot to include a screenshot

Anonymous
10/31/24(Thu)18:06:04 No.41601911

Anonymous 10/31/24(Thu)18:06:04 No.41601911

All this "use 5 second of reference to get a style of speech" make me interested in this idea (codefags, feel free to call me out an idiot on how this is not how it works, but it makes sense in my head):
Have a audio reference model train with audio that is separated into clear folder and one into noisy folder (random reverb, bad pitch, crap going on in background etc).
The way I think the raining process would work is to get model to understanding two parameters, a) what makes the audio "clear" and b)what makes audio "noisy".
Once trained, one could feed the model the main noisy audio and a 2nd reference of clean audio to convert the main audio into a clear version of itself (aka use the 2nd reference to fix up the bad quality in the 1st audio).

Anonymous
10/31/24(Thu)19:40:01 No.41602250

Anonymous 10/31/24(Thu)19:40:01 No.41602250

>>41601911
I think it's possible, you've basically structured a style transfer problem. I guess the reason you feed it a 2nd reference as opposed to not conditioning on anything is so you give the model an idea of what the character's "clean" audio is like, to bootstrap off existing data?
- ParlerTTS already kind of implicitly does half of this with labeled noise levels, it works OK
- Generally for denoising tasks people use the more straightforward approach to just degrade an existing clean input so you have matched noisy/clean pairs
- For it to actually be worth using it would have to outperform things like just running the input through, for instance, an RVC or so-vits-svc 5.0 trained on the existing clean data

Anonymous
10/31/24(Thu)20:43:33 No.41602441

Anonymous 10/31/24(Thu)20:43:33 No.41602441

File: error gptsovits r3.jpg (149 KB, 687x396)

149 KB JPG

>>41601267
ehh, getting this error on W7, I remember getting something similar with RVC and having it fixed with this set up:
pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1+cu116 "tensorflow[and-cuda]" --extra-index-url https://download.pytorch.org/whl/
I will test out different module combos over the weekend, and once I find something that works I will post it here.

Anonymous
10/31/24(Thu)21:04:04 No.41602526

Anonymous 10/31/24(Thu)21:04:04 No.41602526

>>41602441
Well the python runtime and dependencies are bundled with the installer so you won't be able to modify anything with it by messing with your pip directly. You might want to try an install from source but replace the pytorch version with one that you know is compatible with your system (idk if anything will break though; the recommended pytorch version here is 2.3.0).

Anonymous
10/31/24(Thu)21:09:03 No.41602539

Anonymous 10/31/24(Thu)21:09:03 No.41602539

>>41602526
Upon further investigation, it looks like there is a way to install/use pytorch versions >= 2.1 on Windows 7 using something called VxKex + an extra DLL:
https://discuss.pytorch.org/t/pytorch-2-1-is-no-more-able-to-use-my-gpu/208672/13

Anonymous
10/31/24(Thu)21:33:48 No.41602624

Anonymous 10/31/24(Thu)21:33:48 No.41602624

Alright, here's another fun mistake you can worry about when finetuning: Make sure you have the pretrained discriminator in the right directory and you don't accidentally delete it when you're cleaning up your project folder like I did ;^)

With pretrained discriminator: https://files.catbox.moe/ba7z82.mp3
Without pretrained discriminator: https://files.catbox.moe/lbw4m7.mp3

Also, I think the deeper male voices (like Flam's) require more SoVITS epochs (up to 96) to get the needed deepness--the pretrained model seems to have some bias to higher pitched voices.

Anonymous
10/31/24(Thu)22:17:23 No.41602782

Anonymous 10/31/24(Thu)22:17:23 No.41602782

https://files.catbox.moe/hxfzyl.flac

Anonymous
10/31/24(Thu)22:32:53 No.41602858

Anonymous 10/31/24(Thu)22:32:53 No.41602858

>>41601185
So, a little confused as to the hardware requirements in that referenced post. Saying the GPT and DPO (Dunno what the latter is) can train on ~6GBs, but the SoVITS side or something needs ~12GBs? Depending on requirements to fine-tune like the earlier examples, might be outside my 8GB capabilities; hoping this isn't the case.

Anonymous
10/31/24(Thu)22:33:57 No.41602860

Anonymous 10/31/24(Thu)22:33:57 No.41602860

>>41602858
That was for the given batch sizes. You can use lower batch sizes to lower the memory requirements.

Anonymous
10/31/24(Thu)22:36:05 No.41602874

Anonymous 10/31/24(Thu)22:36:05 No.41602874

>>41602858
NTA but I've trained models with batch size 12 on my RTX 2060 6GB VRAM.

Anonymous
10/31/24(Thu)22:40:07 No.41602886

Anonymous 10/31/24(Thu)22:40:07 No.41602886

>>41602860
Ah, good. So just means longer training times then. I'll look further into it and attempt within the next few days. Will keep posted.

Anonymous
10/31/24(Thu)22:59:03 No.41602947

Anonymous 10/31/24(Thu)22:59:03 No.41602947

File: 1821085__safe_flim_flam_s(...).png (651 KB, 1366x768)

651 KB PNG

>>41602782
https://files.catbox.moe/bn6xn4.mp3

Flim (SoVITS epoch 96, GPT epoch 36): https://huggingface.co/therealvul/GPT-SoVITS-v2/tree/main/Flim-SVe96-GPTe36
gen: see above
Flam (SoVITS epoch 96, GPT epoch 48): https://huggingface.co/therealvul/GPT-SoVITS-v2/tree/main/Flam-SVe96-GPTe48
gen: see >>41602624
(auxiliary references were used for these so I'm going to refrain from listing them for)

Anonymous
11/01/24(Fri)00:43:16 No.41603208

Anonymous 11/01/24(Fri)00:43:16 No.41603208

https://x.com/genmoai/status/1852154518911304152

Anonymous
11/01/24(Fri)01:19:28 No.41603288

Anonymous 11/01/24(Fri)01:19:28 No.41603288

File: 7088113__safe_artist-colo(...).png (587 KB, 849x1280)

587 KB PNG

>>41602947
For some reason it didn't register to me that it was Nightmare Night until now.

Black Snooty (?) (SoVITS epoch 96, GPT epoch 32): https://huggingface.co/therealvul/GPT-SoVITS-v2/tree/main/NightmareMoon-SVe96-GPTe32
https://files.catbox.moe/b0524r.mp3
https://files.catbox.moe/kxb5iv.mp3

I think what I'm noticing more clearly is that as GPT epochs increase, model generates more of its own information and have better naturalness/accent resemblance to the overall character, whereas at lower GPT epochs, it seems to adhere more closely to the reference.

Anonymous
11/01/24(Fri)04:25:18 No.41603684

Anonymous 11/01/24(Fri)04:25:18 No.41603684

>>41602539
>>41602526
Berry interesting, I will mess around with this over the weekend. Thanks for posting this, looks like I will not have to decide between cucking to win11 or going full linux autism for at lest few years.

Anonymous
11/01/24(Fri)04:37:28 No.41603713

Anonymous 11/01/24(Fri)04:37:28 No.41603713

>>41603208
>The model requires at least 4 H100 GPUs to run
wew. I mean, it's cool the video generator ai exist out in the open for people to use, but it's going to be two ro three years until this is downsized to the degree nor/mlp/eople can causally use.

Anonymous
11/01/24(Fri)09:23:35 No.41604100

Anonymous 11/01/24(Fri)09:23:35 No.41604100

>early are bump

Anonymous
11/01/24(Fri)12:02:37 No.41604390

Anonymous 11/01/24(Fri)12:02:37 No.41604390

File: 2549810.png (657 KB, 4750x3250)

657 KB PNG

Guess who else has 17 seconds of data?

Octavia Melody
https://huggingface.co/therealvul/GPT-SoVITS-v2/tree/main/Octavia-SVe84-GPTe48
gen: https://files.catbox.moe/2q9dp1.mp3

Anonymous
11/01/24(Fri)14:11:08 No.41604699

Anonymous 11/01/24(Fri)14:11:08 No.41604699

https://files.catbox.moe/a41mh4.mp3

Anonymous
11/01/24(Fri)14:17:23 No.41604708

Anonymous 11/01/24(Fri)14:17:23 No.41604708

https://huggingface.co/fishaudio/fish-agent-v0.1-3b

Anonymous
11/01/24(Fri)14:26:23 No.41604730

Anonymous 11/01/24(Fri)14:26:23 No.41604730

>>41604708
I remember finetuning fish-speech before, I don't have the samples before but I was not very impressed by its character resemblance even after finetuning

Anonymous
11/01/24(Fri)17:19:08 No.41605276

Anonymous 11/01/24(Fri)17:19:08 No.41605276

File: 6758127__safe_imported+fr(...).gif (641 KB, 498x280)

641 KB GIF

Anonymous
11/01/24(Fri)21:04:37 No.41606012

Anonymous 11/01/24(Fri)21:04:37 No.41606012

>>41604390
welcome back tts Octavia voice, it's been awhile.

Anonymous
11/02/24(Sat)02:28:03 No.41606964

Anonymous 11/02/24(Sat)02:28:03 No.41606964

https://files.catbox.moe/e9dcwv.mp3

Anonymous
11/02/24(Sat)04:44:04 No.41607251

Anonymous 11/02/24(Sat)04:44:04 No.41607251

File: Screenshot_20241102-164001~2.png (375 KB, 763x799)

375 KB PNG

https://github.com/etched-ai/open-oasis
What if this was fed Gameloft pony gameplay, or similar? What kind of wild, unnatural and possibly cursed mares would it unleash?

Anonymous
11/02/24(Sat)08:13:02 No.41607544

Anonymous 11/02/24(Sat)08:13:02 No.41607544

https://files.catbox.moe/3i1qn5.wav

Anonymous
11/02/24(Sat)11:08:16 No.41607871

Anonymous 11/02/24(Sat)11:08:16 No.41607871

So I have been thinking, is there anything somebody could do to help out with PPP and pony ai related stuff that do not involve running/training models on 10+ VRAM GPU?

Anonymous
11/02/24(Sat)11:16:26 No.41607899

Anonymous 11/02/24(Sat)11:16:26 No.41607899

>>41607544
Nice
>>41607871
>Run/train models on lower VRAM GPU
>Make content
>Study ML, make toy ML projects
>Make toy LLM projects using the free yet severely rate limited models on OpenRouter idk
>Clean up OP/docs

Anonymous
11/02/24(Sat)11:40:06 No.41607977

Anonymous 11/02/24(Sat)11:40:06 No.41607977

File: VxKex files.png (36 KB, 602x546)

36 KB PNG

>>41602526
>>41602539
so I look into the link and uhhhh, how do I convert this stuff into exe installation file?

Anonymous
11/02/24(Sat)11:48:27 No.41608000

Anonymous 11/02/24(Sat)11:48:27 No.41608000

>>41607977
https://github.com/i486/VxKex/releases/tag/Version1.1.1.1375
appears to have exe

Anonymous
11/02/24(Sat)11:50:45 No.41608011

Anonymous 11/02/24(Sat)11:50:45 No.41608011

>>41602947
Did you figure out the proper amount of audio needed to get a good voice?

Anonymous
11/02/24(Sat)11:54:06 No.41608019

Anonymous 11/02/24(Sat)11:54:06 No.41608019

>>41608011
If you're OK with this level of performance >>41604390
as low as 17 seconds seems to be possible (haven't bothered with lower), but with that little data any accents will suffer (for instance, how she pronounces "gold"). Some voices seem more well-behaved than others; deeper voices tend to have a bit more trouble.

Anonymous
11/02/24(Sat)13:30:53 No.41608282

Anonymous 11/02/24(Sat)13:30:53 No.41608282

File: woona on throne 1623524817628.png (653 KB, 6452x3629)

653 KB PNG

>>41604390
Any chance for requesting the S1E2 Woona voice?

Anonymous
11/02/24(Sat)13:46:20 No.41608331

Anonymous 11/02/24(Sat)13:46:20 No.41608331

>>41607251
>https://files.catbox.moe/l5jo3e.mp4
man, i love how strange this feels, its like trying to explain a dream morphing from one thought to another.

Anonymous
11/02/24(Sat)15:18:27 No.41608688

Anonymous 11/02/24(Sat)15:18:27 No.41608688

File: 8530.jpg (282 KB, 544x1628)

282 KB JPG

>>41608282
She only ever spoke nine words. Like I said when 15's site was available, you might as well just use the later Luna voice and raise its pitch with audio software:
https://u.smutty.horse/magmddxudmm.mp3

Anonymous
11/02/24(Sat)20:44:26 No.41609772

Anonymous 11/02/24(Sat)20:44:26 No.41609772

>>41609407

Anonymous
11/02/24(Sat)22:31:24 No.41610135

Anonymous 11/02/24(Sat)22:31:24 No.41610135

https://files.catbox.moe/f6jbkt.png

Anonymous
11/03/24(Sun)01:49:42 No.41611013

Anonymous 11/03/24(Sun)01:49:42 No.41611013

>>41598009
Is because you're ins a env python enverioment, install qt pluging in there or, just copy the .so

Anonymous
11/03/24(Sun)05:07:45 No.41611327

Anonymous 11/03/24(Sun)05:07:45 No.41611327

>>41608282
Wasn't she voiced by Tabitha too?

Anonymous
11/03/24(Sun)06:00:14 No.41611396

Anonymous 11/03/24(Sun)06:00:14 No.41611396

>>41610135
Imagine the taste

Anonymous
11/03/24(Sun)06:05:23 No.41611402

Anonymous 11/03/24(Sun)06:05:23 No.41611402

>>41608688
>>41611327
She only has 3 seconds of audio but I can't stop thinking how sweet and innocent , I just wish to hear more of it.

Anonymous
11/03/24(Sun)06:22:42 No.41611430

Anonymous 11/03/24(Sun)06:22:42 No.41611430

How use the API of GPT-Sovits to connect with Silly tavern I tried but the model tend to repeat and make nonsense word...

Anonymous
11/03/24(Sun)10:02:16 No.41611817

Anonymous 11/03/24(Sun)10:02:16 No.41611817

>mares

Anonymous
11/03/24(Sun)10:21:05 No.41611865

Anonymous 11/03/24(Sun)10:21:05 No.41611865

>>41578166
What was the amount of data for Cadance?

GothicAnon
11/03/24(Sun)16:56:40 No.41613268

GothicAnon 11/03/24(Sun)16:56:40 No.41613268

File: 3022800.png (2.64 MB, 1612x2008)

2.64 MB PNG

Would Anons here be interested in doing a original ai song album for marecon (which I would assume will happen sometime in January)?
The general theme would be mares signing about/to Anon, whenever the subject would be love/hate/friendship, and given that Anons here have all kind of different taste in ponies and music I feel like it would be a pretty interesting spectrum of songs to enjoy. ~~I was thinking of doing this solo, with one song per m6 but that I thought there must be at least few Anons out there that may be interested in this idea as well.~~
All the ai song tools for this are currently free (and while I hate the service model style of Udio/Suno, they are currently only fully working song models unless something changes in next few week), from there one can use preferred audio/vocal separator and apply rvc/sovit/other ai tool from PPP to give the songs the proper pony voice.

Anonymous
11/03/24(Sun)19:58:31 No.41613894

Anonymous 11/03/24(Sun)19:58:31 No.41613894

>>41611817
quite so

Anonymous
11/03/24(Sun)20:05:44 No.41613907

Anonymous 11/03/24(Sun)20:05:44 No.41613907

>>41611402
Some guy made an AI out of fucking Dark Souls 1 male pain noise. S1 Luna voice would work.

Anonymous
11/04/24(Mon)00:16:32 No.41614604

Anonymous 11/04/24(Mon)00:16:32 No.41614604

File: OIG2.ODAYiy5u9L5O3xfEDd_H.jpg (112 KB, 1024x1024)

112 KB JPG

Anonymous
11/04/24(Mon)02:37:13 No.41614891

Anonymous 11/04/24(Mon)02:37:13 No.41614891

>>41613907
I don't want to imagine how that sounds.

Anonymous
11/04/24(Mon)12:24:28 No.41615902

Anonymous 11/04/24(Mon)12:24:28 No.41615902

Page 10 save.

Anonymous
11/05/24(Tue)01:20:42 No.41617765

Anonymous 11/05/24(Tue)01:20:42 No.41617765

So it turns out Tara Strong did the Twilight Sparkle 5 years prior to working on MLP for a disney pilot, at least that's what it seems like
https://www.youtube.com/watch?v=tJoW6rNR_A4

Anonymous
11/05/24(Tue)10:46:38 No.41618625

Anonymous 11/05/24(Tue)10:46:38 No.41618625

>>41608282
>>41608688
Nobody ever told you guys to find Tabitha's original voice for it? cause I keep fucking trying, but I cannot find her exact raspy voice.

There's also another problem. Her "I'm so sorry" has a raspy voice that Tabitha never uses.
Her second line "I missed you so much big sister" doesn't have the same raspiness.
So basically got to contact Tabitha to make you a voice unless you guys can find some japanese or western cartoon where she did that exact raspy accent.

>>41608688
It's not the same accent/pitch/personality. It just sounds like a younger Rarity. Also 15.AI still sounds like shit , he lost the AI race.

Anonymous
11/05/24(Tue)10:49:30 No.41618631

Anonymous 11/05/24(Tue)10:49:30 No.41618631

>>41617765
Tara Strong uses the same 3-4 voices and her generic-ass voice isn't hard to replicate by other voice actors.

However that voice is spot-on for the exact pitch & accent she used for Twilight.
Holy shit that's a lot of popular voice actors.

Anonymous
11/06/24(Wed)00:40:58 No.41620700

Anonymous 11/06/24(Wed)00:40:58 No.41620700

>>41614604

Anonymous
11/06/24(Wed)07:01:34 No.41621240

Anonymous 11/06/24(Wed)07:01:34 No.41621240

>>41611430
What's it sound like in that weirded state? I wish to hear the sound of misconfigured mares.

Anonymous
11/06/24(Wed)12:01:43 No.41621671

Anonymous 11/06/24(Wed)12:01:43 No.41621671

First time checking in on these threads in about 9 months
>it's just some guy posting porn every other day or so to keep it from falling off
Well that's sad but now that AI voice cloning is democratized and is as easy as downloading a model and throwing up a w-okada instance what is the continued goal of the project? And did 15 ever stop being a faggot and explain why he disappeared for years at a time and continually miss promised launch dates while taking everyone's money?

Anonymous
11/06/24(Wed)14:29:59 No.41621950

Anonymous 11/06/24(Wed)14:29:59 No.41621950

>>41621671
I'm not a regular, but I don't think voice cloning is everything there is to this project. Also, it's still not perfect though it's coming close to the big players (11labs). 15.ai was the goat, but now it's time to move on.

Anonymous
11/07/24(Thu)02:24:52 No.41623389

Anonymous 11/07/24(Thu)02:24:52 No.41623389

>>41621950
Gotta agree that this part of year is stupidly busy irl, and with lack of spare time sadly the pony content making time is sadly pretty limited.

Anonymous
11/07/24(Thu)10:03:02 No.41624052

Anonymous 11/07/24(Thu)10:03:02 No.41624052

Page 10 bump.

Anonymous
11/07/24(Thu)13:23:32 No.41624424

Anonymous 11/07/24(Thu)13:23:32 No.41624424

File: glimmer bounce.gif (832 KB, 450x543)

832 KB GIF

>>41617765
Wow! Glimmer

Anonymous
11/07/24(Thu)14:40:54 No.41624678

Anonymous 11/07/24(Thu)14:40:54 No.41624678

File: Screenshot.jpg (253 KB, 1917x1041)

253 KB JPG

>>41601267
Ran into this error after trying to use my own recorded reference lines for the first time. Not sure if it's a bug or if I've done something wrong?

Anonymous
11/07/24(Thu)14:56:54 No.41624737

Anonymous 11/07/24(Thu)14:56:54 No.41624737

File: 2827859__safe_princess+lu(...).png (482 KB, 1920x1080)

482 KB PNG

Back from an involuntary vacation. J*nnies apparently consider my posting patterns + bumps "flooding". Frankly if this keeps happening I'm going to move to NHNB.
>GPT-SoVITS
I've released alternate versions of some of the models trained with more GPT and SoVITS epochs on my HF: https://huggingface.co/therealvul/GPT-SoVITS-v2/tree/main
The old models will remain uploaded (as long as as huggingface keeps letting me dump models onto their servers for free). I did not exhaustively check to see how they actually performed, so it's possible some may be screwed up. Should be improved audio quality but character resemblance/reference resemblance/coherency may vary.
>Music
I have finally forced myself to make an original song for the first time in nine months: https://files.catbox.moe/iu1jyb.mp3
Lyrics: https://ponepaste.org/10467
I thought it would be interesting to train a so-vits-svc 5.0 model on Luna's speaking voice (Tabitha St. Germain) for this one, since I'm not completely satisfied with Aloma's audio quality or timbre. I think it turned out OK. https://huggingface.co/therealvul/so-vits-svc-5.0/tree/main/Luna%20Speaking

Anonymous
11/07/24(Thu)14:59:54 No.41624743

Anonymous 11/07/24(Thu)14:59:54 No.41624743

>>41624678
I think this is because you haven't specified a character name for the sample. Unfortunately I'm retarded so I didn't anticipate that case.

Anonymous
11/07/24(Thu)15:03:06 No.41624752

Anonymous 11/07/24(Thu)15:03:06 No.41624752

>>41624737
King. Thx for keeping this shit alive.

Anonymous
11/07/24(Thu)15:35:15 No.41624893

Anonymous 11/07/24(Thu)15:35:15 No.41624893

>>41624737
I too, hate jannies
Great to see you back at it again.

>>41624743
Yeah that was the problem, assumed that I could leave everything except "Utterance" blank from reading the instructions, I suppose there's one little thing in there somewhere that has character name as a dependency. No matter to me though, from this experiment I once again learn that so-vits just doesn't vibe at all with my voice so will be sticking to samples from the master file. Cheers.

Anonymous
11/07/24(Thu)16:42:46 No.41625105

Anonymous 11/07/24(Thu)16:42:46 No.41625105

>>41624737
not the first nor is it going to be last time when jannies were acting like complete fucking mongoloids on this site. Nice song thu, I can't put my finger on it but it reminds me of something Ive listen to between '06~'10.

Anonymous
11/07/24(Thu)21:01:49 No.41625728

Anonymous 11/07/24(Thu)21:01:49 No.41625728

>>41624678
GPT-SoVITS GUI revision 4
I've updated the program to fix this behavior and also a bug where all the 'n's in filenames were being sanitized out.
https://drive.google.com/file/d/1dgG1kg0e9p4khrwpPaI9NdV_PIiMDGOZ/view
>>41611865
Cadance has 13 minutes of Clean+Noisy data.
>>41624752
>>41624893
>>41625105
Thanks.

Anonymous
11/08/24(Fri)01:11:00 No.41626232

Anonymous 11/08/24(Fri)01:11:00 No.41626232

Hey, so I was doing more experimenting with Ace Studio and using the custom voices thing to make the mares sing, and I think they made some adjustments to how they handled the tone of the voice? It seems like it takes accent more into account.

Applejack on Solo23 - https://files.catbox.moe/3ywhaz.wav

Applejack on Verse24 - https://files.catbox.moe/m1c403.wav

Twilight on Solo23 - https://files.catbox.moe/axyrm0.wav

Twilight on Verse24 - https://files.catbox.moe/7734xc.wav

Rarity on Solo23 - https://files.catbox.moe/4b8g4k.wav

Rarity on Verse24 - https://files.catbox.moe/x8tdr6.wav

I still don't like the whole subscription thing, but I do admit that they're getting better at preserving accents, and I like that.

Anonymous
11/08/24(Fri)01:46:41 No.41626260

Anonymous 11/08/24(Fri)01:46:41 No.41626260

File: Sweetie Bop.gif (2.74 MB, 690x541)

2.74 MB GIF

>>41624737
>https://files.catbox.moe/iu1jyb.mp3
This is really trippy, nice. Love the tone it sets, I haven't heard a lot like it.

Anonymous
11/08/24(Fri)01:56:56 No.41626272

Anonymous 11/08/24(Fri)01:56:56 No.41626272

>>41624737
Honestly after using these a little I think GPT epoch 24 is a mistake. The resulting tone of speaking seems a lot more boring which is not what we're really going for here.

Anonymous
11/08/24(Fri)02:49:31 No.41626351

Anonymous 11/08/24(Fri)02:49:31 No.41626351

>>41624737
very rich textures

Anonymous
11/08/24(Fri)02:51:38 No.41626354

Anonymous 11/08/24(Fri)02:51:38 No.41626354

>>41626272
What was the rational for going from SV24-GPTe8 to SV96-GPTe24? I'm trying to understand how this thing works and where to stop the training.
Also you can dump as much as you can on HF, there is a 10K files limit per folder and 50GB per file. I have one repo with like 1TB of checkpoints lmao.

Anonymous
11/08/24(Fri)08:24:26 No.41626736

Anonymous 11/08/24(Fri)08:24:26 No.41626736

>digital mares forever

Anonymous
11/08/24(Fri)09:46:21 No.41626871

Anonymous 11/08/24(Fri)09:46:21 No.41626871

>>41626354
My working theory is:
- SoVITS training generally increases audio quality, quality of sibilants up to some point of overtraining
- Some degree of GPT training is needed just to get plausible results. After that, more GPT training seems to "increase" the "plausibility" of the delivery (pitch and rhythm) at the cost of variation and possibly increased pronunciation errors

Anonymous
11/08/24(Fri)14:34:29 No.41627443

Anonymous 11/08/24(Fri)14:34:29 No.41627443

>>41626736
Hopefully, yes.

Anonymous
11/08/24(Fri)15:16:48 No.41627548

Anonymous 11/08/24(Fri)15:16:48 No.41627548

>>41625728
>You need access
reeee!

Anonymous
11/08/24(Fri)16:29:16 No.41627767

Anonymous 11/08/24(Fri)16:29:16 No.41627767

>>41627548
Fuck fixed

Anonymous
11/08/24(Fri)17:37:15 No.41627957

Anonymous 11/08/24(Fri)17:37:15 No.41627957

>>41626871
Thanks for the feedback. I take it that the overtraining point for Sovits was at epoch 96 then? I hope you can find a good middle ground for GPT. Also, I am wondering if DPO has a big effect.

Anonymous
11/08/24(Fri)18:10:52 No.41628053

Anonymous 11/08/24(Fri)18:10:52 No.41628053

>>41627957
I think it was for Twilight, I just extrapolated it to the rest so as to not think or test as much. I'm really not sure I preferred anything I observed past GPT 8.

Anonymous
11/08/24(Fri)21:19:20 No.41628777

Anonymous 11/08/24(Fri)21:19:20 No.41628777

Up.

Anonymous
11/09/24(Sat)09:13:21 No.41630112

Anonymous 11/09/24(Sat)09:13:21 No.41630112

>>41621240
Repeat the reference audio randomly thorough the text

Anonymous
11/09/24(Sat)12:35:41 No.41630452

Anonymous 11/09/24(Sat)12:35:41 No.41630452

>>41630112
I don't know specifically about SillyTavern but this hallucination can happen in a few situations:
- If there is no reference text/utterance being passed alongside the reference audio
- If the text being passed to generation is empty or very similar to the reference audio's utterance
- If the generation is too long (this can happen with the 4-sentence batching method)

SnoopyAnon
11/09/24(Sat)17:42:57 No.41631289

SnoopyAnon 11/09/24(Sat)17:42:57 No.41631289

idk if you guys care about AI music covers anymore, but I just finished two projects from last year.
https://www.youtube.com/watch?v=Y-9K9aWhutk
https://www.youtube.com/watch?v=HgfsKS-Ux_A

Anonymous
11/09/24(Sat)17:51:26 No.41631309

Anonymous 11/09/24(Sat)17:51:26 No.41631309

>>41631289
I love AI music, covers included.

Anonymous
11/09/24(Sat)22:30:49 No.41632014

Anonymous 11/09/24(Sat)22:30:49 No.41632014

Page 10 save.

Anonymous
11/09/24(Sat)22:50:03 No.41632050

Anonymous 11/09/24(Sat)22:50:03 No.41632050

File: 2641391.png (671 KB, 1280x720)

671 KB PNG

Zero shot F5-TTS with no training or finetuning using a randomly selected 5-second reference audio:
https://voca.ro/1c0r89ojIOcM

Using RVC as a post process:
https://voca.ro/1ZTudWW2y9jA

Anonymous
11/10/24(Sun)01:47:15 No.41632459

Anonymous 11/10/24(Sun)01:47:15 No.41632459

>10

Anonymous
11/10/24(Sun)05:46:02 No.41632828

Anonymous 11/10/24(Sun)05:46:02 No.41632828

>>41632050
>F5-TTS
interesting, the first clip do not really sound like any specific character but having some other tts alternative is always nice.

Anonymous
11/10/24(Sun)07:27:40 No.41632943

Anonymous 11/10/24(Sun)07:27:40 No.41632943

Im sure somebody had ask this before but are there any TTS programs that allow to train custom voices that are NOT pytorch/ai based?
I know the Vocaloid copycats SynthV/UTAU allow for that but their UI is not really friendly for simple copy pasting text into it. Im just looking for something closer to MS Sam tier.
~~I know TkinetAnon and DeltaVox tts programs exist, but I was hoping for a project that is bit more up-to-date as these programs had massive issue reading stuff and were extremely non-customizationed.~~

Anonymous
11/10/24(Sun)10:49:34 No.41633143

Anonymous 11/10/24(Sun)10:49:34 No.41633143

>>41632943
>https://github.com/rhasspy/piper
>https://www.youtube.com/watch?v=rjq5eZoWWSo
>https://www.youtube.com/watch?v=b_we_jma220
Alright, I lurked around and there is this project from last year called Piper TTS, designed to primarily run on Raspberry Pi 4 (so max requirements to run it cannot exceed 8GB of RAM).
The training process do not seem to be that much of pain in the ass and from the github they state the project is supposedly able to work just fine on linux/windows/mac environment. No idea about minimum and maximum requirements (other than audio needs to be 22,050 and fine tuned on the large model for the best quality) so that needs to be tested.
This would benefit from having a proper UI as using just plain terminal is doable but i feel like it would be too much pain in ass for everybody else, but that's something to look into in some distant future.

Anonymous
11/10/24(Sun)10:56:59 No.41633148

Anonymous 11/10/24(Sun)10:56:59 No.41633148

>>41633143
Check the archive someone already trained models

Anonymous
11/10/24(Sun)14:06:00 No.41633650

Anonymous 11/10/24(Sun)14:06:00 No.41633650

File: old gpu cuda error.png (42 KB, 960x367)

42 KB PNG

>>41625728
I have a really old graphics card (GTX 770M) that is not supported for any recent version of CUDA. When I attempt to run the inference GUI (revision 4), I get this error. I already have the latest official NVIDIA driver installed for it, and I'm pretty certain this GPU will never work with GPT-SoVITS. Is there a way to force the inference GUI to ignore my GPU and use the CPU instead?

Anonymous
11/10/24(Sun)14:55:53 No.41633798

Anonymous 11/10/24(Sun)14:55:53 No.41633798

>>41633650
I found a workaround. I can just set the CUDA_VISIBLE_DEVICES environment variable to an empty string and it will use the CPU instead. It's still reasonably quick too on old hardware (20-50 seconds for a 1-2 sentence gneeration).
https://files.catbox.moe/ee1adp.mp3

Anonymous
11/10/24(Sun)14:57:13 No.41633803

Anonymous 11/10/24(Sun)14:57:13 No.41633803

File: success.png (49 KB, 977x392)

49 KB PNG

>>41633798
screenshot for demonstration

Anonymous
11/10/24(Sun)16:05:21 No.41633979

Anonymous 11/10/24(Sun)16:05:21 No.41633979

>>41633650
>CUDA_VISIBLE_DEVICES=""
hmm, Ive tried that and im still getting error of :

File "torch\__init__.py", line 130, in <module>
OSError: [WinError 193] %1 is not a valid Win32 application. Error loading "Q:\_
AIfromC\_AItts\GPT-SoVITS-v2\gptsovits_11-7-24-r4\_internal\torch\lib\c10_cuda.d
ll" or one of its dependencies.
[PYI-29800:ERROR] Failed to execute script 'gui_client' due to unhandled excepti
on!

>>41633798
>>41633803
yeah, would be nice if there was an option to add parameter/argument in the command line "--CPU=TRUE" to force it to start in a plain CPU mode

Anonymous
11/10/24(Sun)17:04:25 No.41634178

Anonymous 11/10/24(Sun)17:04:25 No.41634178

>>41625728
Is there any place to get more pony voices for this?

Anonymous
11/10/24(Sun)17:07:22 No.41634188

Anonymous 11/10/24(Sun)17:07:22 No.41634188

>>41634178
I'm not aware of anyone else having trained pony models for it.

Anonymous
11/10/24(Sun)17:15:31 No.41634209

Anonymous 11/10/24(Sun)17:15:31 No.41634209

>>41633979
>download.pytorch.org/whl/
>This site can’t be reached
fucking great, absolutely splendid. does anyone know an alternative/archive to the above that can be used to install different version of torch?

Anonymous
11/10/24(Sun)17:25:47 No.41634244

Anonymous 11/10/24(Sun)17:25:47 No.41634244

>>41634209
>download.pytorch.org/whl/
Works for me anon. Although updating your pytorch in your own python installation won't affect anything from the pyinstaller zip.

I'm not sure to what extent the cuda build of pytorch relies on the cuda DLLs being loadable (it looks like it's failing pretty early on in the process), so it's possible that for portability I might actually need to maintain two packaging environments and builds--one for CUDA and one for CPU only.

Anonymous
11/10/24(Sun)18:07:19 No.41634377

Anonymous 11/10/24(Sun)18:07:19 No.41634377

>>41634209
>>41634244
>CPU only
Can you try this? (Also updated github readme with this link)
https://drive.google.com/file/d/1FVuwuKyUqfuRcHVKr-ACgAul4araUjPN/view?usp=sharing

Anonymous
11/10/24(Sun)18:25:01 No.41634446

Anonymous 11/10/24(Sun)18:25:01 No.41634446

>>41634244
pytroch website is acting like a bitch for no reason and I had to use vpn to install the following torch that works on my old pc:
pip3 install torch==1.10.1+cu113 torchvision==0.11.2+cu113 torchaudio===0.10.1+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html --force-reinstall

However I was still getting the errors like above, so I finally decided to install that "KexSetup_Release_1_1_1_1375.exe". In properties I set the option to use "Win7 pack 1 " and open the "gptsovits.exe" was able to open up from console like for the Anon from above.
https://voca.ro/1gMIOaBVPaIU
I had not test it thoroughly but it is working and messing around with the Blueblood model I can that it seems to struggle longer 5s clips, however this is way better than not having any tts options for it.

There seem to be one problem, I cannot replay a once played clip (I can see animation of the red bar going from start to end, but no sound on replay).
The Generation window could also use the name of the clip that was generated and being played since I can see myself spamming like five dozen of lines and forgetting which ones I liked and having to re-listing to all the clips all over again.
Also I think I found bug, when setting "Seed" to non-random number and "Repetition" to a number higher than 1 the program seems to get stuck in never ending generating mode?
>>41634377
uhoh, sorry, the above problem seems to be solved for me, but potentially the other Anons could use this version.

Anonymous
11/10/24(Sun)18:34:09 No.41634475

Anonymous 11/10/24(Sun)18:34:09 No.41634475

>>41634446
>The Generation window could also use the name of the clip that was generated and being played since I can see myself spamming like five dozen of lines and forgetting which ones I liked and having to re-listing to all the clips all over again.
You can drag and drop audio clips from the play button into another folder or DAW, does this help?
>Also I think I found bug, when setting "Seed" to non-random number and "Repetition" to a number higher than 1 the program seems to get stuck in never ending generating mode?
Some generation errors don't get propagated all the way up, check the console.

Anonymous
11/10/24(Sun)19:08:24 No.41634610

Anonymous 11/10/24(Sun)19:08:24 No.41634610

>>41634475
>You can drag and drop audio clips from the play button into another folder or DAW, does this help?
not really, since I will still need to re-listing all the clips all over again , instead of making a note "pony123 sound bad but clip pony 124 sounded pretty good" while generating next batch.
Also, there seen to be an issue with the tts part hallucinating extra words, as trying to generate " and Derpy is my beloved pony." I get "and Derpy WHY is my beloved pony." I can somewhat fix it by changing the word to "Derpee" to minimize the interaction with audacity.

Anonymous
11/10/24(Sun)19:19:26 No.41634640

Anonymous 11/10/24(Sun)19:19:26 No.41634640

Go home guys. Udio.com won.

Anonymous
11/10/24(Sun)19:23:04 No.41634649

Anonymous 11/10/24(Sun)19:23:04 No.41634649

>>41634640
you could had said the same about Amazon Alexa in 2019 and yet here we are.

Anonymous
11/10/24(Sun)20:36:34 No.41634928

Anonymous 11/10/24(Sun)20:36:34 No.41634928

>>41634610
>not really, since I will still need to re-listing all the clips all over again , instead of making a note "pony123 sound bad but clip pony 124 sounded pretty good" while generating next batch.
Then just keep all of the clips that sound good in a separate folder. Do you intend to keep the bad sounding clips? What exactly are you trying to do?
>I can somewhat fix it by changing the word to "Derpee" to minimize the interaction with audacity.
I think this happens because the repo authors use an extra library for word segmentation which probably detects the word "Derpy" as two words "Derp+y".
There's also ARPAbet support which won't be affected by this issue.

Anonymous
11/10/24(Sun)21:22:45 No.41635014

Anonymous 11/10/24(Sun)21:22:45 No.41635014

>>41633650
I just took another look at this screenshot I posted earlier and now I feel really dumb. The problem was not my graphics card, it was the application failing to download the pretrained s1-BERT model.

After setting CUDA_VISIBLE_DEVICES="", I no longer got the CUDA warning at the top of the console, so I assumed I had "fixed" that "problem" and had ran into a different error. In reality, I think it was the same error and stack trace preventing the startup, just without the CUDA warning. To resolve it, I manually downloaded the model from https://huggingface.co/lj1995/GPT-SoVITS/tree/main/gsv-v2final-pretrained and saved it to GPT_SoVITS\pretrained_models\gsv-v2final-pretrained\

I tried starting the GUI again just now without setting CUDA_VISIBLE_DEVICES="" and it worked, which confirms that the GUI will default to using CPU if it detects an old graphics driver; it just prints a harmless warning to the console. No need to mess with environment variables after all.

Anonymous
11/10/24(Sun)23:19:04 No.41635317

Anonymous 11/10/24(Sun)23:19:04 No.41635317

>>41635014
Didn't we use to have a customized ARPAbet dictionary that added words that were only found in the show? Is GPT-SoVITS able to load it?

Anonymous
11/11/24(Mon)01:35:47 No.41635590

Anonymous 11/11/24(Mon)01:35:47 No.41635590

>>41635317
I think by default it uses its own universal dictionary (some version of the CMU pronouncing dictionary). You might be able to hack in the horsewords one, but not sure how it would interact with the word segmentation problem.

Anonymous
11/11/24(Mon)05:11:48 No.41635884

Anonymous 11/11/24(Mon)05:11:48 No.41635884

>>41634640
Who?

Anonymous
11/11/24(Mon)11:55:27 No.41636331

Anonymous 11/11/24(Mon)11:55:27 No.41636331

Mare!

Anonymous
11/11/24(Mon)16:48:47 No.41636830

Anonymous 11/11/24(Mon)16:48:47 No.41636830

>>41636331
and again

Anonymous
11/11/24(Mon)21:00:37 No.41637607

Anonymous 11/11/24(Mon)21:00:37 No.41637607

>>41634640
What's that supposed to be?

Anonymous
11/12/24(Tue)01:28:20 No.41638216

Anonymous 11/12/24(Tue)01:28:20 No.41638216

>>41636830
yes

Anonymous
11/12/24(Tue)08:07:58 No.41638833

Anonymous 11/12/24(Tue)08:07:58 No.41638833

>GPT-SoVITS
Did anyone else had tested how far you can push the clip reference without actual training? I've tried Vinyl Scratch og voice on rarity model and results were better than expected but still meh.

Anonymous
11/12/24(Tue)09:49:35 No.41639032

Anonymous 11/12/24(Tue)09:49:35 No.41639032

File: 1728895201488642.jpg (1.17 MB, 2000x2500)

1.17 MB JPG

https://vocaroo.com/11g3P0OEfrme

Anonymous
11/12/24(Tue)09:54:02 No.41639035

Anonymous 11/12/24(Tue)09:54:02 No.41639035

File: Princess-Celestia-royal-m(...).jpg (412 KB, 1000x1200)

412 KB JPG

It seems elevenlabs has the best voice denoiser tool, or there is open source alternative where i don't need to make throwaway account to use?

Anonymous
11/12/24(Tue)11:45:09 No.41639190

Anonymous 11/12/24(Tue)11:45:09 No.41639190

>>41639035
What's wrong using the ultimate vocal remover? Vul made a model that remove random background noises sone time ago.

Anonymous
11/12/24(Tue)16:26:23 No.41639779

Anonymous 11/12/24(Tue)16:26:23 No.41639779

File: cry 1675188712987732.png (639 KB, 1024x1024)

639 KB PNG

>>41633143
>https://github.com/rhasspy/piper-phonemize
Hello, I require an assistance of someone who is much more competent coder that I am, if somebody could be willing to turn this hithub into a windows wheel that would be really appreciated. For whatever reason the microsoft visual studio 2022 is pissing and shitting itself on my pc and all the other wheels for the above only exist in mac and linux format.
I was trying using the "make" command but for once again unknown reasons to me it gets the job done at 95% and than just errors out.

Anonymous
11/12/24(Tue)18:12:46 No.41639975

Anonymous 11/12/24(Tue)18:12:46 No.41639975

Derp the Wind! I happened to make this right before the song was discovered recently. I made it for the Ponyville Ciderfest mix tape, and I usually don't upload songs I make for convention tapes immediately, but this warranted an exception.

https://www.youtube.com/watch?v=p9OFkfzZuLg

Anonymous
11/12/24(Tue)21:32:55 No.41640508

Anonymous 11/12/24(Tue)21:32:55 No.41640508

Up.

Anonymous
11/13/24(Wed)01:16:16 No.41640954

Anonymous 11/13/24(Wed)01:16:16 No.41640954

>>41640508
One-shot generation:
https://files.catbox.moe/lbaww6.ogg
I'm probably going to consume a lot of news this way.

Anonymous
11/13/24(Wed)02:26:17 No.41641069

Anonymous 11/13/24(Wed)02:26:17 No.41641069

>>41640954
>A whole broadcast
Neat idea. I guess you could use it for stuff like text review and proofreading too.

Anonymous
11/13/24(Wed)02:56:56 No.41641113

Anonymous 11/13/24(Wed)02:56:56 No.41641113

>>41640954
>Guest
>Host
Was the dialogue generated by an LLM?

Anonymous
11/13/24(Wed)04:33:27 No.41641297

Anonymous 11/13/24(Wed)04:33:27 No.41641297

>>41641113
Yeah. I had it deliberately set the names to Host and Guest so I could swap out the voices without the result being too distracting. It was generated through about 1200 calls to llama 3.1 70b based on this text:
https://buttondown.com/ainews/archive/ainews-bitnet-was-a-lie/
It was mildly complicated. The code should be published soon™.

Anonymous
11/13/24(Wed)06:08:24 No.41641395

Anonymous 11/13/24(Wed)06:08:24 No.41641395

>>41625728
I submitted a pull request to patch api.py and Dockerfile for better automation. The updated api.py accepts a file path for prompt_text instead of the actual text. It's so the caller doesn't need to know specifics about what reference files are available, which makes it easier to decouple the caller from the api server. I don't think you're using api.py, so it shouldn't break any of your code.

Anonymous
11/13/24(Wed)06:58:08 No.41641441

Anonymous 11/13/24(Wed)06:58:08 No.41641441

>>41640954
Rarity and Starlight discuss this thread: https://files.catbox.moe/9xn1x4.ogg
Transcript: https://files.catbox.moe/exo0sz.txt
I tried to get rid it to cover the input document more comprehensively & faithfully. It tends to discuss redundant topics when doing this for threads. Fixing that will probably require some preprocessing step to create an organized document from the thread. Using more context could also fix it, but the API I'm using only supports an 8k context window. I'll get back to this later.

Anonymous
11/13/24(Wed)12:31:38 No.41641944

Anonymous 11/13/24(Wed)12:31:38 No.41641944

>>41641441
Would make my day if you could make another one with Pinkie and Dash. Voices here seem really impressive.

Anonymous
11/13/24(Wed)12:46:11 No.41641969

Anonymous 11/13/24(Wed)12:46:11 No.41641969

>>41572862
numget

Anonymous
11/13/24(Wed)14:21:21 No.41642134

Anonymous 11/13/24(Wed)14:21:21 No.41642134

>>41641297
Is this still GPT-Svoits or something else?

Anonymous
11/13/24(Wed)15:45:21 No.41642341

Anonymous 11/13/24(Wed)15:45:21 No.41642341

>>41641395
ok merged

Anonymous
11/13/24(Wed)17:00:14 No.41642569

Anonymous 11/13/24(Wed)17:00:14 No.41642569

>>41642134
It is GPT-SoVITS with Vul's voice models & Master File clip references.
>>41641944
I ran into a daily token limit, but will do once I can.

Anonymous
11/13/24(Wed)17:41:54 No.41642674

Anonymous 11/13/24(Wed)17:41:54 No.41642674

>>41640954
>>41641441
Listening to these put a big stupid grin on my face and prompted me to think back to the days of the first threads when this was all just getting started. Being able to now make voices this good with relatively little work makes all the effort feel worthwhile.

Anonymous
11/13/24(Wed)23:36:09 No.41643470

Anonymous 11/13/24(Wed)23:36:09 No.41643470

>>41571795
EQG when?
I need them for ponifications, I swear.

Anonymous
11/14/24(Thu)00:00:39 No.41643518

Anonymous 11/14/24(Thu)00:00:39 No.41643518

>>41643470
>>>/trash/

Anonymous
11/14/24(Thu)04:48:02 No.41643942

Anonymous 11/14/24(Thu)04:48:02 No.41643942

https://files.catbox.moe/bjjhq7.mp3

Anonymous
11/14/24(Thu)07:56:46 No.41644174

Anonymous 11/14/24(Thu)07:56:46 No.41644174

>>41641944
Here (You) go:
https://files.catbox.moe/vdc6i7.ogg
Pinkie and Dash geeking out over Vul's commit history.

>>41643942
This stirs something primal in me.

Anonymous
11/14/24(Thu)10:53:38 No.41644434

Anonymous 11/14/24(Thu)10:53:38 No.41644434

>>41644174
>55 minutes
Oh boy.

Anonymous
11/14/24(Thu)13:11:08 No.41644712

Anonymous 11/14/24(Thu)13:11:08 No.41644712

>>41644174
Yo THANKS for coming through, dude! Really appreciate your work here

Anonymous
11/14/24(Thu)13:39:50 No.41644760

Anonymous 11/14/24(Thu)13:39:50 No.41644760

>>41644174
Do you use some kind of system to detect who is speaking from the text, or is this just set up as chat style text:
"character 1: text"
"character 2: text" ?

Anonymous
11/14/24(Thu)17:16:48 No.41645227

Anonymous 11/14/24(Thu)17:16:48 No.41645227

File: confused.png (720 KB, 1200x1146)

720 KB PNG

I'm working on a project And I'm trying to find the best way to prompt a Text to speech through a Python script whether that be sending a request or importing it directly Is there any like API thing that comes with haysay that I could just hook into? or any other repos that I should look at

Anonymous
11/14/24(Thu)19:46:33 No.41645566

Anonymous 11/14/24(Thu)19:46:33 No.41645566

^ I too would be interested in an answer to Anons question above.

Anonymous
11/14/24(Thu)21:29:16 No.41645829

Anonymous 11/14/24(Thu)21:29:16 No.41645829

>>41644760
For this one, I had it generate chat-style text, exactly as you described, then I parsed it out. The transcript in >>41641441 is exactly what it generated, just in small segments. In cases where I want to parse out text, I usually have the LLM generate JSON with the relevant fields already separated. In this case, the LLM produced worse dialogue when I had it output JSON, which is why I had it just generate chat-style text.

>>41645227
https://github.com/synthbot-anon/horsona/tree/main/samples/gpt_sovits
In your own project, you just need to use two of the files:
- A utility class for making sure parallel calls involving different speakers are ordered properly, to avoid switching out the voice model too frequently: https://github.com/synthbot-anon/horsona/blob/main/src/horsona/lock/resource_state_lock.py
- The class for actually generating speech: https://github.com/synthbot-anon/horsona/blob/main/src/horsona/audiogen/gptsovits.py

Example code once you have those two files: https://github.com/synthbot-anon/horsona/blob/main/samples/gpt_sovits/src/main.py

Anonymous
11/14/24(Thu)23:35:53 No.41646072

Anonymous 11/14/24(Thu)23:35:53 No.41646072

>>41645829
I'm just annoyed that you didn't even try to search and replace Host and Guest with Rarity and Twilight, or AI with "ae eye".

Synthbot
11/15/24(Fri)00:32:06 No.41646158

Synthbot 11/15/24(Fri)00:32:06 No.41646158

>>41646072
I switched to working on something else after I had it written. Right now, everything is generated in a fire-and-forget way, so I don't get a chance to modify the transcript before it's passed to the TTS. Once it's published, I'll clean up things like that.

On that note, horsona updates:
- [Done] I added OpenAPI support when running the node_graph server for game engine integration. Here's an example of the spec it generates: https://ponepaste.org/10498. It generates this dynamically on the /api/openapi.json endpoint. There are a lot tools for automatically generating clients from OpenAPI specs https://github.com/OpenAPITools/openapi-generator so this would be an easy way to expose any functionality written with the library to external clients.
- ... [In progress] I'm going to write a sample application for this.
- [In progress] I'm working on a way to add explicit causal reasoning to LLMs. I'll commit all of the change for this once I have the whole thing working.
- ... [Done] I have a module that can do causal regression given a small number of datapoints & a small causal graph.
- ... [Done] I have a module for picking representative data points for cases where it's given too much data. I wrote this because causal regression is slow with a large number of datapoints.
- ... [In progress] I'm working on a module to chain together analysis from multiple small models. I mostly know how to do this, but implementing it is tedious.
- ... [In progress] I can get an LLM to generate small causal graphs and datapoints from small snippets of text. I'll need to test it for robustness, then update it to handle streams of text.
- [In progress] I'm writing a few modules to handle streams of text. I can get an 8k token context window to handle about 25k words of context right now with a combination of the GistModule + Recent Messages. I have some thoughts on how to get that number much higher using multiple levels of Gists.
- [Done] TTS support through GPT-SoVITS + sample application.

Anonymous
11/15/24(Fri)04:33:46 No.41646545

Anonymous 11/15/24(Fri)04:33:46 No.41646545

Up.

Anonymous
11/15/24(Fri)08:56:52 No.41646911

Anonymous 11/15/24(Fri)08:56:52 No.41646911

>>41646545
ay

Anonymous
11/15/24(Fri)12:23:05 No.41647285

Anonymous 11/15/24(Fri)12:23:05 No.41647285

>>41646911
neigh

Anonymous
11/15/24(Fri)14:58:03 No.41647668

Anonymous 11/15/24(Fri)14:58:03 No.41647668

>>41647285
nay

BGM
11/15/24(Fri)15:13:39 No.41647735

BGM 11/15/24(Fri)15:13:39 No.41647735

File: thB.png (1.29 MB, 1400x1400)

1.29 MB PNG

Collab song with Vul
https://www.youtube.com/watch?v=vwZRqM9quic

Anonymous
11/15/24(Fri)15:39:01 No.41647801

Anonymous 11/15/24(Fri)15:39:01 No.41647801

I tried something, but somebody else should try it with Rebecca Shoichet or Tara Strong's voice.

1st one is using Udio. 2nd one is using ElevenLabs.

https://vocaroo.com/1oHodh5vYKpR
https://vocaroo.com/1eG8IzJ6C5y6

Here is the original speech for anyone else to give it a go:

A cutie mark is far more than a mere symbol or identifier—it is the distilled essence of a pony’s very being. It is not simply a reflection of a talent or hobby, nor a role assigned by society. Instead, it stands as an intricate, immutable emblem of individuality, representing a pony’s soul, heritage, and identity in a way that is both deeply personal and profoundly abstract.

This mark is a tapestry of meaning, weaving together culture, ancestry, character, and spirit. It is a flag of individuality, a coat of arms that each pony bears proudly. Like a fingerprint unique to the self, it cannot be replicated or erased. In its permanence, as confirmed in Call of the Cutie, the cutie mark becomes a lifelong affirmation of one’s unique narrative—a sacred banner of identity and self-discovery.

To reduce such a profound symbol to a mere vocational label, as some later depictions in Cutie Pox or Magical Mystery Cure attempt, is to strip it of its true magnificence. A cutie mark is not a job or an obligation; it is a timeless reflection of the harmony between body, mind, and soul. To trivialize its meaning is to misunderstand its transcendent role in expressing individuality and purpose.

Scientifically, one might liken it to a unique genetic code, an expression of existence so layered and intricate that it defies reductive interpretation. Emotionally, it is a beacon—a radiant testament to the miracle of identity and the wonder of self-expression.

Let the cutie mark remain untouched, its beauty unblemished and its meaning untarnished. To honor the cutie mark is to honor the sacred, irreplaceable essence of the individual. It is a celebration of the complexities that define us, a crystallized symbol of the infinite beauty of the soul. Let it forever stand as the brilliant coat of arms it was always meant to be—a shining flag of the heart, unfurled in the winds of life. A true snowflake essence known colloquially as snowpity.

Anonymous
11/15/24(Fri)15:45:00 No.41647827

Anonymous 11/15/24(Fri)15:45:00 No.41647827

>>41647735
nice

Anonymous
11/15/24(Fri)15:49:29 No.41647842

Anonymous 11/15/24(Fri)15:49:29 No.41647842

>>41647735
very nice>>41647801
>2nd one
this is not sounding good at all, not sure if this is due to their service output or some option messed around in audio editor.

Anonymous
11/15/24(Fri)19:44:29 No.41648584

Anonymous 11/15/24(Fri)19:44:29 No.41648584

File: OIG4.ElDsCOoDXf32Bq8DrQYt.jpg (165 KB, 1024x1024)

165 KB JPG

Anonymous
11/16/24(Sat)02:11:41 No.41649581

Anonymous 11/16/24(Sat)02:11:41 No.41649581

Small VLMs?

Anonymous
11/16/24(Sat)03:54:29 No.41649757

Anonymous 11/16/24(Sat)03:54:29 No.41649757

I'm new to gpt-sovits, please I need help.

this is from the rentry:
>Here are the recommended settings for SoVITS training:
> Batch size: 2 (1 if your gpu has 6G vram)
> Total epochs: 8
> Text model learning rate weighting: <=0.4
> Save frequency: 4

I've seen that the biggest points of contention were with these specific settings.
What settings would you suggest?
how long does it take to train 8 epochs? just te get an idea. I want to mess it with but not let my gpu run for a month, I have 1080ti and 32gb memory

Anonymous
11/16/24(Sat)04:22:27 No.41649797

Anonymous 11/16/24(Sat)04:22:27 No.41649797

>>41647801
So uh… how did you do the first thing with Udio? I thought it was a text to music site. Can you guide us through your process?

Anonymous
11/16/24(Sat)04:45:04 No.41649818

Anonymous 11/16/24(Sat)04:45:04 No.41649818

>>41649757
Ok I tested it and it produced something within 10 minutes with the default settings. Not bad. the results weren't great but I liked it better than e2 f5. Pronounciation is worse but virtually no mistakes.
Now I'm training with the maximum allowed epochs for both, and it seems it's still fast enough, gonna be done in 45min or so.
Is there away to unlock the ui? to train beyond 25/50

Anonymous
11/16/24(Sat)04:45:42 No.41649819

Anonymous 11/16/24(Sat)04:45:42 No.41649819

>>41649757
>>41601185
>>41626354
>>41626871
Not recommendations but guidelines. You can save multiple checkpoints at epoch intervals and test them too.

Anonymous
11/16/24(Sat)04:47:37 No.41649822

Anonymous 11/16/24(Sat)04:47:37 No.41649822

>>41649818
update, it might be done sooner than that lol >>41649818
about 15 minutes for 25 epochs for a 1080ti, running with a batch size of 6

Anonymous
11/16/24(Sat)04:48:37 No.41649824

Anonymous 11/16/24(Sat)04:48:37 No.41649824

>>41649819
How can I unlock the max epochs in the ui pretty please?

Anonymous
11/16/24(Sat)05:12:59 No.41649852

Anonymous 11/16/24(Sat)05:12:59 No.41649852

File: Capture d'écran 2024-11-1(...).png (59 KB, 723x558)

59 KB PNG

wtf is this? I see chinese is that normal for training and eng model?

Anonymous
11/16/24(Sat)08:44:50 No.41650140

Anonymous 11/16/24(Sat)08:44:50 No.41650140

>b

Anonymous
11/16/24(Sat)10:33:01 No.41650295

Anonymous 11/16/24(Sat)10:33:01 No.41650295

>>41649852
Don't worri mai ferrow anon

Anonymous
11/16/24(Sat)13:34:26 No.41650825

Anonymous 11/16/24(Sat)13:34:26 No.41650825

>>41650140
yes

Anonymous
11/16/24(Sat)16:34:16 No.41651369

Anonymous 11/16/24(Sat)16:34:16 No.41651369

>>41649818
>>41649824
>Is there away to unlock the ui? to train beyond 25/50
I'm not sure what the other anon training it was doing, but I just unlocked it via inspect element and it worked fine, so you can do that.

Anonymous
11/16/24(Sat)20:41:57 No.41652033

Anonymous 11/16/24(Sat)20:41:57 No.41652033

Up from 10.

Anonymous
11/16/24(Sat)23:11:47 No.41652461

Anonymous 11/16/24(Sat)23:11:47 No.41652461

>>41649852
The repo was worked on by a chinese guy so everything is going to default to it.

Anonymous
11/17/24(Sun)03:07:07 No.41652933

Anonymous 11/17/24(Sun)03:07:07 No.41652933

>>41652461
>RVC was made by group of rando Chinese programers
>This one as well
Uhoh, it's little bit worrying that all the quality ai projects are only being progressed between three groups, globohomo western corpos, china commies and the small group of horsefuckers.

Anonymous
11/17/24(Sun)03:18:21 No.41652950

Anonymous 11/17/24(Sun)03:18:21 No.41652950

>>41652933
They're the ones that care the most about AI and don't have to deal with all the red tape that western devs have to deal with since they don't give a shit about things like ethics or copyright.

Anonymous
11/17/24(Sun)06:23:12 No.41653178

Anonymous 11/17/24(Sun)06:23:12 No.41653178

File: 8rzf-1612329079-488248-medium.jpg (21 KB, 256x255)

21 KB JPG

https://vocaroo.com/14eyuFuDu0Zs

Anonymous
11/17/24(Sun)14:17:08 No.41654108

Anonymous 11/17/24(Sun)14:17:08 No.41654108

>>41653178
Shouldn't Celestia's voice be deeper than that?

Synthbot
11/17/24(Sun)17:45:49 No.41654800

Synthbot 11/17/24(Sun)17:45:49 No.41654800

>>41646158
Horsona updates:
- [Done] I finished the sample application for automatically generating an SDK. The usage looks a little ugly since the SDK generator I'm using generates ugly code, and I had trouble finding a better one for python. It at least shows that the auto-generated OpenAPI spec works. Hopefully there are better generators for other languages. C++ and C# generators seem to be the important ones for game engine integration. (Unreal Engine, Unity.)
- ... Code: https://github.com/synthbot-anon/horsona/tree/main/samples/node_graph_client
- [In progress] I'm working on an OpenAI-compatible interface for custom modules. The basic idea is: some modules build custom functionality into the LLM API (e.g., generate results like some character, automatically include things like RAG and Gists, etc.), then run a script to start a server to create an endpoint for that module. Then chatbot UIs like SillyTavern can use that endpoint instead of Ollama/OpenAI/Anthropic to get better & more tailored text generation with a lot more customization options than what the UI itself supports.
- ... [Done] I cleaned up a bunch of code to make this possible and to make it easier to create custom LLM APIs. Here's an example for how to create one that can reference a ~25k "canon" story with an 8k LLM context window: https://github.com/synthbot-anon/horsona/blob/main/src/horsona/memory/readagent_llm.py
- ... [Done] I have the code for creating an endpoint for custom LLM modules here: https://github.com/synthbot-anon/horsona/tree/main/src/horsona/interface/oai.
- ... [In progress] I need to write a sample server showing how to create & expose a custom module, then test it with SillyTavern.

No changes from the last post:
- [In progress] I'm working on a way to add explicit causal reasoning to LLMs. I'll commit all of the change for this once I have the whole thing working.
- ... [In progress] I'm working on a module to chain together analysis from multiple small models. I mostly know how to do this, but implementing it is tedious.
- ... [In progress] I can get an LLM to generate small causal graphs and datapoints from small snippets of text. I'll need to test it for robustness, then update it to handle streams of text.
- [In progress] I'm writing a few modules to handle streams of text. I can get an 8k token context window to handle about 25k words of context right now with a combination of the GistModule + Recent Messages. I have some thoughts on how to get that number much higher using multiple levels of Gists.

Anonymous
11/17/24(Sun)20:09:36 No.41655239

Anonymous 11/17/24(Sun)20:09:36 No.41655239

>>41654108
I don't think that's ai generated.

Anonymous
11/17/24(Sun)21:48:12 No.41655549

Anonymous 11/17/24(Sun)21:48:12 No.41655549

Zero-shot Voice Conversion with Diffusion Transformers
https://arxiv.org/abs/2411.09943
>Zero-shot voice conversion aims to transform a source speech utterance to match the timbre of a reference speech from an unseen speaker. Traditional approaches struggle with timbre leakage, insufficient timbre representation, and mismatches between training and inference tasks. We propose Seed-VC, a novel framework that addresses these issues by introducing an external timbre shifter during training to perturb the source speech timbre, mitigating leakage and aligning training with inference. Additionally, we employ a diffusion transformer that leverages the entire reference speech context, capturing fine-grained timbre features through in-context learning. Experiments demonstrate that Seed-VC outperforms strong baselines like OpenVoice and CosyVoice, achieving higher speaker similarity and lower word error rates in zero-shot voice conversion tasks. We further extend our approach to zero-shot singing voice conversion by incorporating fundamental frequency (F0) conditioning, resulting in comparative performance to current state-of-the-art methods. Our findings highlight the effectiveness of Seed-VC in overcoming core challenges, paving the way for more accurate and versatile voice conversion systems.
https://github.com/Plachtaa/seed-vc

Anonymous
11/18/24(Mon)01:05:40 No.41656027

Anonymous 11/18/24(Mon)01:05:40 No.41656027

File: Designer_2024-11-15.jpg (91 KB, 1024x1024)

91 KB JPG

Anonymous
11/18/24(Mon)02:31:31 No.41656154

Anonymous 11/18/24(Mon)02:31:31 No.41656154

>>41654800
It's kind of frustrating that the API stuff for the gpt sovits sample is mixed in with the same download containing over 2GB of pretrained models. Wouldn't it be better to separate them.

Anonymous
11/18/24(Mon)03:41:40 No.41656242

Anonymous 11/18/24(Mon)03:41:40 No.41656242

>>41656154
>2GB
First day with python?

Synthbot
11/18/24(Mon)05:29:29 No.41656312

Synthbot 11/18/24(Mon)05:29:29 No.41656312

>>41656242
Agreed. I updated it.
The updated voices file: https://drive.google.com/file/d/106i6hQVDrUuULe_k8-MSi7wB4fW0X2Qx/view?usp=sharing
- This one contains only the tts config and one voice folder for reference.
And the updated readme: https://github.com/synthbot-anon/horsona/tree/main/samples/gpt_sovits
- The only changes are (1) use the new voice file, and (2) use synthbot/gpt-sovits:v3 instead of v2.

>>41654800
Horsona updates:
- [In progress] I'm almost done with a module to use significantly more memory than the context window allows. This one combines ReadAgent gist & fetch with RAG. The RAG is used to identify relevant gists up to some character limit, then ReadAgent is used to pull the most relevant pages into context up to some other character limit. The data can be organized like a filesystem. Once this is done, I'll create the SillyTavern integration sample app based on this. I've already tested the SillyTavern integration to make sure it works.

No changes from the last post:
- [In progress] I'm working on a way to add explicit causal reasoning to LLMs. I'll commit all of the change for this once I have the whole thing working.
- ... [In progress] I'm working on a module to chain together analysis from multiple small models. I mostly know how to do this, but implementing it is tedious.
- ... [In progress] I can get an LLM to generate small causal graphs and datapoints from small snippets of text. I'll need to test it for robustness, then update it to handle streams of text.
- [In progress] I'm working on an OpenAI-compatible interface for custom modules. The basic idea is: some modules build custom functionality into the LLM API (e.g., generate results like some character, automatically include things like RAG and Gists, etc.), then run a script to start a server to create an endpoint for that module. Then chatbot UIs like SillyTavern can use that endpoint instead of Ollama/OpenAI/Anthropic to get better & more tailored text generation with a lot more customization options than what the UI itself supports.
- ... [In progress] I need to write a sample server showing how to create & expose a custom module, then test it with SillyTavern.

Anonymous
11/18/24(Mon)05:30:02 No.41656313

Anonymous 11/18/24(Mon)05:30:02 No.41656313

>>41656312
>>41656154

Anonymous
11/18/24(Mon)11:45:24 No.41656790

Anonymous 11/18/24(Mon)11:45:24 No.41656790

Uppy.

Anonymous
11/18/24(Mon)14:46:28 No.41657230

Anonymous 11/18/24(Mon)14:46:28 No.41657230

>>41656790
pone

Anonymous
11/18/24(Mon)15:34:40 No.41657352

Anonymous 11/18/24(Mon)15:34:40 No.41657352

File: error msvw10 dll.jpg (15 KB, 453x142)

15 KB JPG

This isnt really the best place to ask but /g/ is not being very helpful here. Could somebody send me their windows msvw10 dll file on catbox so I could this shitty error fixed?
Yes, ive already tried installing recommended fixes: directx_Jun2010_redist, vcredist_x64, vcredist_x86, and none of them fixed it.

Anonymous
11/18/24(Mon)20:54:06 No.41658214

Anonymous 11/18/24(Mon)20:54:06 No.41658214

>>41657352
This seems like an extraordinarily bad idea

Anonymous
11/19/24(Tue)00:44:43 No.41658776

Anonymous 11/19/24(Tue)00:44:43 No.41658776

>>41624737
~~based bonus track~~

Anonymous
11/19/24(Tue)05:25:03 No.41659138

Anonymous 11/19/24(Tue)05:25:03 No.41659138

>>41658214
>trusting random Anons on internet
I know, but im bit desperate since the other alternative would be to re-install the whole windows system or move on to Linux (and kind of fucking up 90% of program workflow I have set up)

Anonymous
11/19/24(Tue)06:20:14 No.41659189

Anonymous 11/19/24(Tue)06:20:14 No.41659189

>>41659138
I search my Windows laptop and couldn't find that file in C:\Windows.
>or move on to Linux (and kind of fucking up 90% of program workflow I have set up)
If you're a programmer, just do it. Windows sucks for so many reasons other than just workflow issues, and it's not obvious how much it's holding you back until you switch. You'll have better workflows with Linux + VS Code ~~or Cursor~~ + Vim anyway since you can automate things so much more easily.

Anonymous
11/19/24(Tue)06:28:22 No.41659195

Anonymous 11/19/24(Tue)06:28:22 No.41659195

>>41659189
>programmer
Sadly I am artist, and my toolset nvoles using the type of shit that have not been updated for over past 5~15 years. I did try Linux every once in a while (I have it installed on a backup secondary drive) and I could never find a proper alternatives and all the way I see people work with it is way too janky and limited in the exact autistic way I need to work.

Anonymous
11/19/24(Tue)16:03:17 No.41660108

Anonymous 11/19/24(Tue)16:03:17 No.41660108

Sorry for me to inquire I'm not a horse enthusiast but I am a voice clone tts enthusiast
Currently, what is the best open source way to clone a voice and using it for tts?
I've tried gptsovits recently and I've been disappointed, trained a model for an hour.

Anonymous
11/19/24(Tue)16:31:14 No.41660183

Anonymous 11/19/24(Tue)16:31:14 No.41660183

>>41660108
https://files.catbox.moe/l4y9uo.wav

Anonymous
11/19/24(Tue)16:35:25 No.41660192

Anonymous 11/19/24(Tue)16:35:25 No.41660192

>>41660183
I've tried cloning lydia's voice from skyrim. With a minute of handpicked dialogue lines. The maximum allowed epochs for both things.
made sure everything that could be english was and at the end it sounded off, like a chinese person that is somewhat fluent in english, it was off.

Anonymous
11/19/24(Tue)17:29:46 No.41660353

Anonymous 11/19/24(Tue)17:29:46 No.41660353

>>41660192
Try GPT 24 epochs and SoVITS 96 epochs (inspect element to increase the max allowed epochs).

Anonymous
11/19/24(Tue)17:48:52 No.41660406

Anonymous 11/19/24(Tue)17:48:52 No.41660406

>>41660353
nta, how do I run the train new model? I would love if the main gptsovits script had a separate "train" tab like the RVC has to simply point audio references, check the correct boxes and let the one button click take care of rest of the training.

Anonymous
11/19/24(Tue)17:51:02 No.41660414

Anonymous 11/19/24(Tue)17:51:02 No.41660414

>>41660406
Check the guide https://rentry.co/GPT-SoVITS-guide#/

Anonymous
11/19/24(Tue)20:30:22 No.41660812

Anonymous 11/19/24(Tue)20:30:22 No.41660812

>>41660414
ok, about three hours in and so far I learn how to brute force a terminal to use a different python installation, update said installation due to lacking modules, modify the sys.path.insert as somehow it was picking up the wrong directory somehow, and butcher the shit out of the i18n.py because for whatever fucking reason was not opening the en_US.json file.
I will continue my adventures with the training tutorial tomorrow and hopefully actually train an voice with the new tts for once.

Anonymous
11/20/24(Wed)02:35:32 No.41661514

Anonymous 11/20/24(Wed)02:35:32 No.41661514

Up.

Anonymous
11/20/24(Wed)07:24:22 No.41661853

Anonymous 11/20/24(Wed)07:24:22 No.41661853

>>41660192
your settings are somewhat better than what I expected. It doesnt sound chinese at least.
https://vocaroo.com/11JOWNbssSLh
This is an old model, the training data was already prepped for this

Anonymous
11/20/24(Wed)08:40:27 No.41661921

Anonymous 11/20/24(Wed)08:40:27 No.41661921

>>41660353
thank you kind sir, this is my karlach model I cooked up.
>https://vocaroo.com/137bNdlFA5kG
from what I've noticed, DPO is either good or doesnt make a significant difference (the rentry guide says it's bad)
No reference mode gave me more consistent results in inference, pronunciation wise
Anyway.
What does temperature do?
Top_k and top_p?

Anonymous
11/20/24(Wed)08:48:44 No.41661928

Anonymous 11/20/24(Wed)08:48:44 No.41661928

>>41661921
I'm not sure about DPO either (maybe someone here knows), but no reference mode doesn't really exist in the code, it's just caching your old reference. You might have a random seed, so it's generating something else with your same reference.
Temperature < 1.0: Voice closer to the reference but more errors on the pronunciation
Temperature > 1.0: Voice further from the reference but sounds more natural.
The effect of top_k and top_p aren't very clear, I don't touch them.

Anonymous
11/20/24(Wed)08:50:45 No.41661931

Anonymous 11/20/24(Wed)08:50:45 No.41661931

>>41660812
Q:\_AIfromC\_AItts\GPT-SoVITS-v2\GPT-SoVITS-Lite>Q:\_AIfromC\_AItts\GPT-SoVITS-v
2\GPT-SoVITS-Lite\runtime\python.exe Q:\_AIfromC\_AItts\GPT-SoVITS-v2\GPT-SoVITS
-Lite\webui.py
Running on local URL: http://127.0.0.1:9874
IMPORTANT: You are using gradio version 3.38.0, however version 4.44.1 is availa
ble, please upgrade.
--------
"Q:\_AIfromC\_AItts\GPT-SoVITS-v2\GPT-SoVITS-Lite\runtime\python.exe" tools/asr/
fasterwhisper_asr.py -i "Q:\_Vds\___Pie_in_the_sky\Valkyrie_SC1\clean\output" -o
"Q:\_Vds\___Pie_in_the_sky\Valkyrie_SC1\clean\output\asr_opt" -s large-v3 -l en
-p float32
Traceback (most recent call last):
File "Q:\_AIfromC\_AItts\GPT-SoVITS-v2\GPT-SoVITS-Lite\tools\asr\fasterwhisper
_asr.py", line 25, in <module>
from tools.asr.config import check_fw_local_models
ModuleNotFoundError: No module named 'tools.asr'

The script keep shitting itself in not finding the correct path to another python script that is in the exact same folder.

Anonymous
11/20/24(Wed)08:56:48 No.41661937

Anonymous 11/20/24(Wed)08:56:48 No.41661937

>>41661928
for the no reference mode I just slap all the audio clips into the right "optional" pannel. And it just works.
What do you usually use for temperature? I think the max allowed is 2 even if I unblock the UI

Anonymous
11/20/24(Wed)09:00:52 No.41661941

Anonymous 11/20/24(Wed)09:00:52 No.41661941

>>41661937
Then it's using the audio clips as reference. I leave temperature on 1 except if the voice doesn't sound like the character at all, then I lower it a bit (0.75-0.8). More than 1.2 and you get garbage so there is no point to set it that high.

Anonymous
11/20/24(Wed)09:57:16 No.41662005

Anonymous 11/20/24(Wed)09:57:16 No.41662005

File: angry celestia 1684306168(...).jpg (97 KB, 900x900)

97 KB JPG

>>41661931
so i fixed that by installing the module ultraimport, than swapping out the fastwhisper_asr.py following code:
>from tools.asr.config import check_fw_local_models
to:
>import ultraimport
>check_fw_local_models = ultraimport('__dir__/config.py', 'check_fw_local_models')

But now Im getting this errror:
Traceback (most recent call last):
File "Q:\_AIfromC\_AItts\GPT-SoVITS-v2\GPT-SoVITS-Lite\tools\asr\fasterwhisper
_asr.py", line 60, in execute_asr
model = WhisperModel(model_path, device=device, compute_type=precision)
File "Q:\_AIfromC\_AItts\GPT-SoVITS-v2\GPT-SoVITS-Lite\runtime\lib\site-packag
es\faster_whisper\transcribe.py", line 133, in __init__
self.model = ctranslate2.models.Whisper(
RuntimeError: CUDA failed with error CUDA driver version is insufficient for CUD
A runtime version

and the above is fucking bullshit as I can see I can run the CUDA perfectly fine on this system.
Q:\_AIfromC\_AItts\GPT-SoVITS-v2\GPT-SoVITS-Lite>Q:\_AIfromC\_AItts\GPT-SoVITS-v
2\GPT-SoVITS-Lite\runtime\python.exe -c "import torch; print(torch.cuda.is_avai
lable())"
True

Q:\_AIfromC\_AItts\GPT-SoVITS-v2\GPT-SoVITS-Lite>Q:\_AIfromC\_AItts\GPT-SoVITS-v
2\GPT-SoVITS-Lite\runtime\python.exe -c "import torch; print(torch.version.cuda)
"
11.8

Q:\_AIfromC\_AItts\GPT-SoVITS-v2\GPT-SoVITS-Lite>Q:\_AIfromC\_AItts\GPT-SoVITS-v
2\GPT-SoVITS-Lite\runtime\python.exe -c "import torch; print(torch.zeros(1).cuda
())"
tensor([0.], device='cuda:0')

Anonymous
11/20/24(Wed)11:11:39 No.41662125

Anonymous 11/20/24(Wed)11:11:39 No.41662125

>>41662005
uh oh, I may or may not have solve the issue, it seems the ctranslate2 dislikes cuda below 12, but the new 1.0.0++ faster-whisper requires a newer version of it, so both of them needed to be downgraded.
Q:\_AIfromC\_AItts\GPT-SoVITS-v2\GPT-SoVITS-Lite\runtime\python.exe -m pip install -U ctranslate2==3.24.0
Q:\_AIfromC\_AItts\GPT-SoVITS-v2\GPT-SoVITS-Lite\runtime\python.exe -m pip install -U faster-whisper==0.10.1

After that only the "GPT_SoVITS\process_ckpt.py" was acting bit retarded, by having the following lines:
from tools.i18n.i18n import I18nAuto
i18n = I18nAuto()
throwing out errors, BUT not actually using/referencing them inside of its own script, so both got comment out and now it seems the training goes pretty smoothly.
Now training the SoVits part, and it looks like it can do 1 epoch per 2~ minutes.

Anonymous
11/20/24(Wed)11:22:52 No.41662141

Anonymous 11/20/24(Wed)11:22:52 No.41662141

>>41661941
Thank you anon, the 0.75 temp tip is really good for my model
Using the reference definitely helps
Idk if its a good idea to slap all the clips into the right panel
It's a little monotone but it's fine.
Training at 24 gpt epochs and 96 sovits epochs, using both reference panels and having temperature at around .8 helps tremendously
>https://vocaroo.com/1fyBuaedJN3A

Anonymous
11/20/24(Wed)11:32:23 No.41662157

Anonymous 11/20/24(Wed)11:32:23 No.41662157

>>41662141
That's good. And no, it's averaging the clips that's why it sounds monotone. Try only giving it one reference, the cleanest you have.

Anonymous
11/20/24(Wed)13:08:55 No.41662326

Anonymous 11/20/24(Wed)13:08:55 No.41662326

A fun test with Celestia
https://files.catbox.moe/hsc1bv.wav

Anonymous
11/20/24(Wed)14:12:46 No.41662397

Anonymous 11/20/24(Wed)14:12:46 No.41662397

>>41662125
And the very last obstacle, found a roundabout solution to the above errors (from another script), was solved by adding the following line above the "from tools.i18n.i18n import I18nAuto"
>import sys, os
>base_dir = r"Q:/_AIfromC/_AItts/GPT-SoVITS-v2/GPT-SoVITS-Lite"
>sys.path.insert(0, base_dir)
Other than that, trying to train a 25 seconds of not a great quality audio kind of resulted in not so great sounding output, but I feel if I let the sovits part of the model train for few more dozen epoch im may sort itself out.

Anonymous
11/20/24(Wed)16:08:27 No.41662632

Anonymous 11/20/24(Wed)16:08:27 No.41662632

Bleak twiggle song
https://files.catbox.moe/enlb4d.mp3
https://ponepaste.org/10521

~~The ending is from a nightmare I had. I don't expect every new song to be a downer fwiw.~~

Anonymous
11/20/24(Wed)19:21:43 No.41663150

Anonymous 11/20/24(Wed)19:21:43 No.41663150

>https://files.catbox.moe/dfoo1r.flac
>https://huggingface.co/Amo/GPT-SoVITS-v2/tree/main/SC1_Valkyrie_v01_SVe70-GPTe10
Alright, half a day of training the result is not great. The more I trained the GPT the less coherent the tts results end up. The Sovits training was also bit funny, the 70 was not picked because it was good but because it sounded the least worst out of all the generated models.
Given the original 28s training file https://files.catbox.moe/e96m50.wav has a radio like effect on top of it (that was also had engine sound going in background removed with ai filter), the result is still better than expected but worse than what I was wishing for.

Anonymous
11/20/24(Wed)19:41:38 No.41663210

Anonymous 11/20/24(Wed)19:41:38 No.41663210

>>>/wsg/5738260
>https://github.com/kijai/ComfyUI-PyramidFlowWrapper
>pyramidflow
>384p works on 16GB VRAM
>768p needs 24GB+. 10 seconds (also seems to be less reliable than 5 seconds)
So there is this offline ai video maker out there, I've stole this link + webm from /wsg/ thread, I have no idea how difficult would training for this stuff would be but HEY, we are one step closer to ai made cartoons with ponies.

Anonymous
11/20/24(Wed)20:13:23 No.41663316

Anonymous 11/20/24(Wed)20:13:23 No.41663316

>>41662326
That's Celestia? She sounds barely like her.

Anonymous
11/21/24(Thu)00:18:28 No.41663813

Anonymous 11/21/24(Thu)00:18:28 No.41663813

https://files.catbox.moe/jtcgt3.mp3

Anonymous
11/21/24(Thu)01:53:47 No.41663941

Anonymous 11/21/24(Thu)01:53:47 No.41663941

>>41663316
Yeah you're right I messed up: https://files.catbox.moe/cxobua.wav I wonder if there is a way to clean up the end result automatically

Anonymous
11/21/24(Thu)05:26:26 No.41664268

Anonymous 11/21/24(Thu)05:26:26 No.41664268

>>41662632
>digits give me flashbacks to listening to models trained on SC09
nightmare indeed

Anonymous
11/21/24(Thu)10:32:53 No.41664968

Anonymous 11/21/24(Thu)10:32:53 No.41664968

up

Anonymous
11/21/24(Thu)10:33:09 No.41664971

Anonymous 11/21/24(Thu)10:33:09 No.41664971

Where are the mares hiding?

Anonymous
11/21/24(Thu)12:16:12 No.41665243

Anonymous 11/21/24(Thu)12:16:12 No.41665243

>>41663813
https://files.catbox.moe/q1w20r.mp3

Anonymous
11/21/24(Thu)15:19:11 No.41665711

Anonymous 11/21/24(Thu)15:19:11 No.41665711

File: 1649804798265.png (800 KB, 603x734)

800 KB PNG

Can I offer you an AJ in a silly-cute dress in this trying time?

Anonymous
11/21/24(Thu)20:12:25 No.41666591

Anonymous 11/21/24(Thu)20:12:25 No.41666591

There is a archive of mlp show music, that is unmaintained for last few years. It contains instrumentals and high-quallity versions of songs. There are known missing instrumentals of some songs that are publically avaliable. Is anyone interested in maintaining it?

https://docs.google.com/document/d/1zfGmwKJoCNgX8QMkkDoem2nOAw83-dg5fnJqJK0Jxig/edit?tab=t.0

Anonymous
11/21/24(Thu)23:50:51 No.41667099

Anonymous 11/21/24(Thu)23:50:51 No.41667099

>>41666591
What are known missing instrumentals

Anonymous
11/22/24(Fri)05:07:10 No.41667616

Anonymous 11/22/24(Fri)05:07:10 No.41667616

File: OIG1.CdOwfhmXa1jjXBufFylW.jpg (168 KB, 1024x1024)

168 KB JPG

Anonymous
11/22/24(Fri)07:30:45 No.41667791

Anonymous 11/22/24(Fri)07:30:45 No.41667791

>>41667099
Babs Seed, Crystal Empire, Blank Flanks Forever.
First two are in games, last one is in 2019 leak.
Maybe something else too. High quality EqG Better Together with vocals.

Anonymous
11/22/24(Fri)09:45:48 No.41668003

Anonymous 11/22/24(Fri)09:45:48 No.41668003

File: medium (2) (1).png (159 KB, 800x450)

159 KB PNG

Button Mash Sings KSI Thick Of it We The Sus Music Ai Cover

https://files.catbox.moe/oiv6o7.mp3

Anonymous
11/22/24(Fri)11:34:02 No.41668165

Anonymous 11/22/24(Fri)11:34:02 No.41668165

>>41662632
The song might be a downer, but I really dig it anyway.

Anonymous
11/22/24(Fri)15:54:41 No.41668762

Anonymous 11/22/24(Fri)15:54:41 No.41668762

>10

Anonymous
11/22/24(Fri)16:06:38 No.41668791

Anonymous 11/22/24(Fri)16:06:38 No.41668791

>>41660183
> Gee Pea Tea Soviets
Why do I imagine communist Mane 6?
Rainbow Dash stormimg Winter Palace.

Anonymous
11/22/24(Fri)19:05:38 No.41669440

Anonymous 11/22/24(Fri)19:05:38 No.41669440

>>41668791
I would prefer mares to read me all the poetry books that are collecting dust on the bookshelf.

Anonymous
11/22/24(Fri)21:03:11 No.41669834

Anonymous 11/22/24(Fri)21:03:11 No.41669834

>>41668003
Huh, we have Buttons ai voice? Is that rvc or sovits( do kindly link it up either way)?

Anonymous
11/22/24(Fri)22:01:28 No.41669986

Anonymous 11/22/24(Fri)22:01:28 No.41669986

>>41669834
no go find it yourself faggot

Anonymous
11/23/24(Sat)03:57:03 No.41670903

Anonymous 11/23/24(Sat)03:57:03 No.41670903

Where did the ponies get enough data to train a model of Anon's voice? What would they even do with such technology?

Anonymous
11/23/24(Sat)06:19:22 No.41671080

Anonymous 11/23/24(Sat)06:19:22 No.41671080

>>41669440
https://files.catbox.moe/r8kptk.mp3

Anonymous
11/23/24(Sat)06:39:05 No.41671100

Anonymous 11/23/24(Sat)06:39:05 No.41671100

>>41669440
>>41671080
And the rest: https://files.catbox.moe/71otqw.mp3
All generated clips: https://files.catbox.moe/sqlciz.zip

Anonymous
11/23/24(Sat)07:08:43 No.41671140

Anonymous 11/23/24(Sat)07:08:43 No.41671140

>>41671100
poetry by mares, you could say its a mare-etry.

Anonymous
11/23/24(Sat)14:10:19 No.41671927

Anonymous 11/23/24(Sat)14:10:19 No.41671927

>>41671100
Did you apply post-processing after sovits?

Anonymous
11/23/24(Sat)14:26:47 No.41671962

Anonymous 11/23/24(Sat)14:26:47 No.41671962

Are there image generators that don't struggle with show-accurate style? Even best generated images I've seen break on outlines. Can they be improved with postprocessing? Maybe something like bilateral filter? Or maybe train NN to find outlines and then paint them with solid color? Some sort of rasterized vector image sharpener, that gets fed with blurred output of AI? Or maybe make NN that will split image into solid or gradient regions and background, produces plane equation for color of each "splotch".

Representing part of image as mix of planes of color instead of bunch of pixels sounds interesting. Is there any research on this topic?
Or any other mathematical surface, that can be reduced into any plane. Or close enough to it. Maybe cosine table like lossy codecs do, but only for splotches.

Anonymous
11/23/24(Sat)21:39:50 No.41673118

Anonymous 11/23/24(Sat)21:39:50 No.41673118

File: 3987383230_score_9_rating(...).png (1.24 MB, 1024x1024)

1.24 MB PNG

>>41671927
No. I was too dumb and lazy to figure out how to do post-processing with Audacity, and I kept running into audio issues with it.
>>41671962
>prompt:score_9, (rating_safe), pony, show accurate, twilight with headphones listening to music with her eyes closed sitting on a bench, nighttime with stars width:1024 height:1024 scale:7.5 steps:25 sampler: K_EULER_A model:PONY_V6_XL seed:3987383230
It took about 5 attempts. You'll probably want to train a LoRA for it to make it more reliable.
>Representing part of image as mix of planes of color instead of bunch of pixels sounds interesting. Is there any research on this topic?
Maybe the Color ControlNet? I'm not sure if there's a Color ControlNet for SDXL.

Anonymous
11/24/24(Sun)00:28:10 No.41673455

Anonymous 11/24/24(Sun)00:28:10 No.41673455

>>41591651
>precomputing stuff from the reference audio for GPTSoVITS
I have been independently looking into this, and I believe it is feasible. The following 3 variables could be precomputed from the reference audio and its transcription. With a little refactoring, they could then be passed as arguments to the get_tts_wav method in inference_webui.py:
"prompt" - An array computed from the reference audio by passing it through HuBERT, Conv1d, and ResidualVectorQuantizer networks (whose weights are stored in the .pth sovits file). Its size is on the order of 1x100 to 1x1000 integers for a typical reference audio lasting several seconds long.
"phones1" - A list of integers which are indices of arpabet tokens, determined from the transcription. Its length is on the order of a couple hundred integers.
"refers" - a spectrogram of the reference audio, wrapped in a list. Its size is on the order of 1x1025x100 to 1x1025x1000 floating point numbers and is by far the largest value to store.

Note: The code has a variable called "bert1" which is also derived from the transcription. For English reference text, however, it is always an array of zeros, so there is no need to precompute it.
In theory, the user could supply additional reference audio files if they wanted to, and their spectrograms would replace (or be appended to) the precomputed "refers" variable. I am working on a refactor of inference_webui.py and developing some code to perform the precomputations. More to come soon, hopefully within the next week.

Anonymous
11/24/24(Sun)05:45:46 No.41673867

Anonymous 11/24/24(Sun)05:45:46 No.41673867

>>41673455
Nice, color me interested

Anonymous
11/24/24(Sun)08:13:57 No.41674081

Anonymous 11/24/24(Sun)08:13:57 No.41674081

>>41673118
>>Representing part of image as mix of planes of color instead of bunch of pixels sounds interesting. Is there any research on this topic?
>Maybe the Color ControlNet?
Wow. It is not what I meant in quoted part, but also nice. It is what I meant as "rasterized vector image sharpener, that gets fed with blurred output of AI".

Plane representation I meant is coefficients of plane equation, just like GPUs use when rasterizing triangles. Or matrix. And pixels are marked what plane do they use, so in the end pixel coordinates are multiplied by matrix of plane they refer to.

Anonymous
11/24/24(Sun)08:41:56 No.41674113

Anonymous 11/24/24(Sun)08:41:56 No.41674113

>>41592155
>- It's at least theoretically possible to precalculate average speaker timbre information from auxiliary reference audio, since multiple audios can be averaged together. Whether it's actually useful is another question entirely.
Can multiple reference audios somehow be used for changing emotions or pacing over time? Maybe with weighted average, where weights are interpolated? Its weights can be either a vector with elements in [0, 1] range or barycentric coordinates(sum of elements = 1). So it would be basically vector-matrix multiplication.

But I didn't try GPT-SoVITS myself yet, so take it as possibly useless suggestion.

Anonymous
11/24/24(Sun)14:26:51 No.41674915

Anonymous 11/24/24(Sun)14:26:51 No.41674915

>page 10

Anonymous
11/24/24(Sun)17:15:16 No.41675268

Anonymous 11/24/24(Sun)17:15:16 No.41675268

File: 1690155180825824.gif (823 KB, 513x475)

823 KB GIF

>>41666591
bumping this to anchor post >>41572561
since this seems pretty interesting/important.

Anonymous
11/24/24(Sun)17:22:32 No.41675288

Anonymous 11/24/24(Sun)17:22:32 No.41675288

>>41674113
I think no because the sequence dimension gets squished

Anonymous
11/24/24(Sun)17:46:50 No.41675356

Anonymous 11/24/24(Sun)17:46:50 No.41675356

>>41675268
We've had it for some time in the gdrive, and we helped improve it in a few cases where we had higher-quality versions of songs. Narokath was pretty quick to respond even years after the previous update.
https://drive.google.com/drive/folders/1OMwYKv7fbA5bZAS1BUwNDryMzjOpRKhQ

Synthbot
11/24/24(Sun)19:04:32 No.41675589

Synthbot 11/24/24(Sun)19:04:32 No.41675589

>>41667791
Can you upload/link the games that contain the first two? I found the the extra Blank Flanks Forever instrumental & vox.

Synthbot
11/24/24(Sun)19:09:31 No.41675599

Synthbot 11/24/24(Sun)19:09:31 No.41675599

>>41667791
I found the extra Better Together instrumentals as well.

Anonymous
11/25/24(Mon)02:28:14 No.41676569

Anonymous 11/25/24(Mon)02:28:14 No.41676569

Up.

Anonymous
11/25/24(Mon)05:48:47 No.41676871

Anonymous 11/25/24(Mon)05:48:47 No.41676871

>>41671962
>>41673118
NTAs, but did you found plugins / add-ons that let you limit the colours output for the generated images so they look more like a vector screen caps?

Anonymous
11/25/24(Mon)06:25:07 No.41676909

Anonymous 11/25/24(Mon)06:25:07 No.41676909

>>41675589
https://www.hasbro.com/common/assets/html5/mylittlepony/core_games/FinalGames_061614/ff_Jul28/audio/game/music.m4a
https://www.hasbro.com/common/assets/html5/mylittlepony/core_games/FinalGames_061614/ppp/audio/pinkie-pie-theme.m4a
>>41675599
Better Together opening instrumental already in collection. Did you mean something else? Better Together with vocals lacks flac version and instead has only mp3, not flac - this is what I was trying to say.

Anonymous
11/25/24(Mon)07:08:44 No.41676959

Anonymous 11/25/24(Mon)07:08:44 No.41676959

Shitposting with Celestia is always fun
https://voca.ro/1jvfMpMcGH0C

Anonymous
11/25/24(Mon)07:15:51 No.41676972

Anonymous 11/25/24(Mon)07:15:51 No.41676972

>>41577843
>RVC (using a retrieval ratio of 0.75)
What are your settings to do that?

Anonymous
11/25/24(Mon)14:07:36 No.41677731

Anonymous 11/25/24(Mon)14:07:36 No.41677731

>>41676972
That's the setting

Anonymous
11/25/24(Mon)17:25:37 No.41678267

Anonymous 11/25/24(Mon)17:25:37 No.41678267

>>41676569

Anonymous
11/25/24(Mon)18:41:48 No.41678509

Anonymous 11/25/24(Mon)18:41:48 No.41678509

Where can I download good audio references for gpt-sovits (mane6)?

Anonymous
11/25/24(Mon)18:58:13 No.41678555

Anonymous 11/25/24(Mon)18:58:13 No.41678555

>>41678509
You can use dialogue lines from the Master File
mega.nz/folder/jkwimSTa#_xk0VnR30C8Ljsy4RCGSig
To my knowledge there nobody has made a definitive list of "good" lines. There's also these https://github.com/effusiveperiscope/GPT-SoVITS/tree/standalone_gui/ref_audios which I bundle with the GUI by default

Anonymous
11/25/24(Mon)19:59:57 No.41678708

Anonymous 11/25/24(Mon)19:59:57 No.41678708

File: 1641883231591.png (485 KB, 852x891)

485 KB PNG

So just as a general post, what are your guys' feelings on the state of AI voices right now?
After discovering and finetuning GPT-SoVITS, does it seem like a major stepup to you?
What do you love and/or hate about it?

Anonymous
11/25/24(Mon)20:26:59 No.41678768

Anonymous 11/25/24(Mon)20:26:59 No.41678768

>>41678708
It's 110% upgrade from talknet tts, however trying to find relevant reference clip to get the exact tone is still bit painfully behind what grok/15ai offered with all that stuf handled by emoji controller.
Maybe I'm not being lucky but it seems training a sub 30s voice models is possible the results are still not on the level of "yep, that's how I imagined this character would talk" , but it's interesting to see that it is able to get s to half way point.
I may now actually able to contribute something to anti colab since I wouldn't be limited to just badly redubbing my own voice with haysay.

Anonymous
11/25/24(Mon)20:53:07 No.41678816

Anonymous 11/25/24(Mon)20:53:07 No.41678816

>>41678509
For "good" lines, you can just exclude any in the master file tagged as noisy.

>>41678708
I've been messing with it a fair bit recently for a small-scale voice project, I have been able to get to a quality level that I'm happy with though it's hit and miss like with all AI. My feeling so far is that GPT-SoVITS is good for general speech and often has good resemblance to the characters, however getting exactly what I want from the emotional delivery is still somewhat difficult at times and there's been a few pronunciation issues.

Being able to do TTS again rather than voice conversion is a huge factor for those with non-american accents, though that comes with the caveat that I now need to instead search the master file for a good reference line for everything I want to generate. I think that someone who can already do a decent voice impression of the target character would still be better served by so-vits/RVC voice conversion, GPT-SoVITS is a suitable alternative for everyone else.

It's a significant step forward overall and a credit to all involved that we've been able to get this far with voice AI, people from four years ago would absolutely flip their shit if they could hear what we have now. Long may the development continue.

Anonymous
11/25/24(Mon)21:03:07 No.41678842

Anonymous 11/25/24(Mon)21:03:07 No.41678842

>>41678708
Love the quality and easy training
Hate that you need reference audio (Used to it by now from using SoVits, still would like to generate without references)

Anonymous
11/26/24(Tue)01:28:10 No.41679376

Anonymous 11/26/24(Tue)01:28:10 No.41679376

>>41678708
I'm sure there's still advancements to be made, but I'm content with what we have now. SVS and RVC give great results with enough wrangling.
For GPT-S, I'm happy it exists, and it seems to work decently, but have found it difficult to justify using for anything beyond testing just yet. Still, for a CPU based text to speech model it's really impressive.
https://files.catbox.moe/4y2iof.mp3

Anonymous
11/26/24(Tue)02:49:55 No.41679536

Anonymous 11/26/24(Tue)02:49:55 No.41679536

>>41678708
It's really good and the dev said he'll release a v3 base model trained on more hours of audio soon. There are a few bugs in the code too, but after fixing them it's better than ever. Postprocessing with a RVC pass also seems to slightly improve the end result (if you need that extra quality).

Anonymous
11/26/24(Tue)03:09:43 No.41679570

Anonymous 11/26/24(Tue)03:09:43 No.41679570

>>41678708
It's alright but it could be better. so-vits-svc 5.0 is basically peak for me in terms of normal singing voices, not sure how much room there is for improvement there. OTOH I'm still not really satisfied by any of the options for synthesizing speaking voices. GPT-SoVITS as others mentioned is probably useful for projects needing a fully automatable TTS, but if you want a specific timbre or delivery it still falls short. I think I've been spoiled in that dimension by SVC options.

Anonymous
11/26/24(Tue)04:06:08 No.41679657

Anonymous 11/26/24(Tue)04:06:08 No.41679657

Maybe you could avoid having to provide reference audio with GPT-SoVITS by training a text -> reference embedding model, like the prior in DALLE-2.

Anonymous
11/26/24(Tue)12:27:49 No.41680309

Anonymous 11/26/24(Tue)12:27:49 No.41680309

>>41679657
You are reinventing RAG. And yes, you can.

Anonymous
11/26/24(Tue)16:21:15 No.41680819

Anonymous 11/26/24(Tue)16:21:15 No.41680819

Mares are good

Anonymous
11/26/24(Tue)17:55:36 No.41681038

Anonymous 11/26/24(Tue)17:55:36 No.41681038

>>41680819
MARES - MARES Autistic Rage Enhanced Sounds

Anonymous
11/26/24(Tue)19:25:37 No.41681217

Anonymous 11/26/24(Tue)19:25:37 No.41681217

>>41680819
Mares are the best.

Synthbot
11/26/24(Tue)20:50:41 No.41681376

Synthbot 11/26/24(Tue)20:50:41 No.41681376

>>41676909
Thank you.
The leaks have more stems that, e.g., isolate the guitar. I'm not sure if Narokath would want that added to the collection, but the files are available.
I'll ping him on the Google Doc and elsewhere to see if he wants to update his collection. Otherwise I'll just add the files to my clone.

Anonymous
11/26/24(Tue)21:24:29 No.41681440

Anonymous 11/26/24(Tue)21:24:29 No.41681440

File: geoetry thing.jpg (74 KB, 1024x576)

74 KB JPG

>https://www.youtube.com/watch?v=fj-Ipgw9kl8
>https://d1qx31qr3h6wln.cloudfront.net/publications/FUGATTO.pdf
>2 meme paper - Fugatto audio model
tldr: Nvidia is also making their own audio ai model, that is a Swiss-army knife of audio models controlled by text+audio reference (tts, audio effects, music, converting midi audio to other instruments, adding/removing sounds from source reference)

Right now there is no actual testing available to the populous, so we just need to trust the cherry picked examples are true representation of what the model can do. However the idea of converting one version of instrumentals into something else while keeping the same speed and beats is VERY interesting to me since udio/suno loves to add shitty modern pop beats in the backgrounds of supposedly 70s inspired song, so maybe this would be a nice way to fix them (and also be able to fucking play normal fucking songs without YT copyright spazing out every five seconds).

Anonymous
11/27/24(Wed)00:51:05 No.41681915

Anonymous 11/27/24(Wed)00:51:05 No.41681915

Do you think the ponies have the ability to copy voices? Perhaps with magic? How common do you think it would be and what do you think they would use it for?

Anonymous
11/27/24(Wed)02:45:05 No.41682076

Anonymous 11/27/24(Wed)02:45:05 No.41682076

>>41681915
>Perhaps with magic?
This is confirmed. Coloratura let a Unicorn cast a spell on her to pitch her own voice in a live performance.

Anonymous
11/27/24(Wed)07:11:31 No.41682335

Anonymous 11/27/24(Wed)07:11:31 No.41682335

>>41679657
The previous grok (and I think 15ai) used emoji embeddings to link text and audio embeddings. That might work for finding relevant reference audio files too.

Anonymous
11/27/24(Wed)12:25:40 No.41682793

Anonymous 11/27/24(Wed)12:25:40 No.41682793

>>41681915
with transformation magic like Poison Joke and the breezies spell I am sure ponies could make some voice changing magic too.

Anonymous
11/27/24(Wed)16:48:59 No.41683411

Anonymous 11/27/24(Wed)16:48:59 No.41683411

File: OIG4.wBZQu7OcX2.vpCn9OeC1.jpg (183 KB, 1024x1024)

183 KB JPG

Anonymous
11/28/24(Thu)01:32:01 No.41684364

Anonymous 11/28/24(Thu)01:32:01 No.41684364

>>41683411
>teaching machines how to boop
That's bold.

Anonymous
11/28/24(Thu)05:26:24 No.41684693

Anonymous 11/28/24(Thu)05:26:24 No.41684693

>>41683411

Synthbot
11/28/24(Thu)07:30:35 No.41684805

Synthbot 11/28/24(Thu)07:30:35 No.41684805

>>41676909
>>41681376
I got a response from Narokath and sent him the files. I'm pretty sure he'll add the Babs Seed, Crystal Ponies, and Blank Flank ones. Response pending on whether he'll add the new Better Together stems too.
All of the new files are here: https://drive.google.com/drive/folders/1loPFrwJMMsHe2VNzQ9u3gtZmCfdtmsvR
The Better Together ones are disorganized. I'll try organizing them if Narokath decides to include them.

>>41656312
Horsona updates:
- [Done] OpenAI-compatible interface for custom modules. I tested it with SillyTavern, and it works as expected.
- ... Sample: https://github.com/synthbot-anon/horsona/tree/main/samples/llm_endpoint
- ... ... Caveats: Indexing files is slow and requires a large number of LLM calls. It treats every inference as if it's part of the same conversation (so its memory will leak between conversations). The memory module I'm using to retrieve backstory information isn't made to work with stories, so its memory will be flaky. I'll be working on that after I'm done with causal reasoning (below).
- ... Code: https://github.com/synthbot-anon/horsona/tree/main/src/horsona/interface/oai
- ... Tests: https://github.com/synthbot-anon/horsona/blob/main/tests/interfaces/test_oai_api.py
- [Done] I cleaned up a lot of the LLM handling code and added support for streaming results from LLMs.
- [In progress] Adding explicit causal reasoning to LLMs.
- ... [Done] I ported the relevant code over from the DoWhy library so I could clean it up and add a better interface.
- ... [Done] I added support for doing causal reasoning with LLMs so it can deal with natural language data. Previously it only support numerical data.
- ... [In progress] I have a lot of clean up to do to complete support for natural language based causal reasoning.
- ... [ ] After that, I'll need to wrap everything in modules so they plug in nicely with the rest of the framework.

Anonymous
11/28/24(Thu)10:42:12 No.41685167

Anonymous 11/28/24(Thu)10:42:12 No.41685167

>>41683411

Anonymous
11/28/24(Thu)22:25:19 No.41686984

Anonymous 11/28/24(Thu)22:25:19 No.41686984

Happy Thanksgiving, everypony! I appreciate the dedication from everyone here, shitposters included. Every update gives me more hope that I'll one day have my waifu.

Anonymous
11/29/24(Fri)00:17:53 No.41687189

Anonymous 11/29/24(Fri)00:17:53 No.41687189

Trying to follow the haysay_ui installation instructions on windows.

limited_user_migration-1 | chown: invalid user: ‘luna:luna’
limited_user_migration-1 exited with code 1

Anonymous
11/29/24(Fri)02:02:48 No.41687414

Anonymous 11/29/24(Fri)02:02:48 No.41687414

I have refactored the GPT-SoVITS code to allow precomputed values to be passed in:
https://github.com/hydrusbeta/GPT-SoVITS
In that fork, I included a script (pony_precomputer/precomputer.py) for precomputing values for all the master files. It also contains sample code showing how you can use a set of precomputed values to generate audio. It's not terribly useful on its own for now, but perhaps with a text -> embeddings model, as others have suggested, or some other mechanism for selecting a set of precomputed values, we could use it for audio generation without the need for providing reference audio.

I was able to avoid storing the entire spectrogram of the reference audio by precomputing the first step of the decode method, which passes the spectrograms through the code's "MelStyleEncoder" neural network. The result is a relatively small array (512 floats). When I ran my script on all the Sliced Dialog master files from s1-s9 + Rainbow Roadtrip, it generated only 35MB of precomputed data total for all of the files.

Clipper, as part of this effort, I wrote a parser for the master files and found one tiny mistake. The file "00_15_24_Chief Thunderhooves___Our stampede will start at high noon tomorrow..flac" in s1e21 is missing the emotion tag. It should either be "Annoyed" or maybe "Angry". Otherwise, I detect no other issues in any of the file names.

Anonymous
11/29/24(Fri)06:41:35 No.41687661

Anonymous 11/29/24(Fri)06:41:35 No.41687661

Cautionary up.

Anonymous
11/29/24(Fri)13:34:55 No.41688321

Anonymous 11/29/24(Fri)13:34:55 No.41688321

>>41687661
10

Anonymous
11/29/24(Fri)18:29:39 No.41689050

Anonymous 11/29/24(Fri)18:29:39 No.41689050

>>41688321
Indeed.

Anonymous
11/29/24(Fri)20:54:17 No.41689491

Anonymous 11/29/24(Fri)20:54:17 No.41689491

>>41689050

Anonymous
11/29/24(Fri)22:00:42 No.41689623

Anonymous 11/29/24(Fri)22:00:42 No.41689623

>>41687189
I just needed to delete all the old images, missed those when I was deleting the old volumes and containers

Anonymous
11/30/24(Sat)01:55:49 No.41690022

Anonymous 11/30/24(Sat)01:55:49 No.41690022

someone on lmg made a Firefox plugin for right-click reading text from a SoVITS API backend. might be useful for casual custom narration of random things.

Anonymous
11/30/24(Sat)02:19:23 No.41690053

Anonymous 11/30/24(Sat)02:19:23 No.41690053

>>41690022
Post link?

Anonymous
11/30/24(Sat)02:22:41 No.41690059

Anonymous 11/30/24(Sat)02:22:41 No.41690059

>>41690053
https://addons.mozilla.org/en-US/firefox/addon/sovits-screen-reader/

Anonymous
11/30/24(Sat)02:25:57 No.41690062

Anonymous 11/30/24(Sat)02:25:57 No.41690062

>>41690053
>>>/g/103341565

Anonymous
11/30/24(Sat)04:04:25 No.41690198

Anonymous 11/30/24(Sat)04:04:25 No.41690198

Sorry if this is the wrong thread but it's the only AI voice related thread I know. How does one go about making those AI generated songs that emulate somepony's voice? I wrote a parody of Gaston's song (from Beauty and the Beast) using Rainbow instead of Gaston, and I'd like to see if I can have Scootaloo and Dash sing it. Is there even enough training data to emulate Scoot's voice accurately?

Anonymous
11/30/24(Sat)04:33:05 No.41690233

Anonymous 11/30/24(Sat)04:33:05 No.41690233

>>41690198
>How does one go about making those AI generated songs that emulate somepony's voice?
Usually the workflow is
>Generate a song with Suno.ai or Udio.ai
>Separate vocals from song with something like Ultimate Vocal Remover
>Run separated voice through pony voice AI and recombine with the instrumental

Anonymous
11/30/24(Sat)04:43:06 No.41690252

Anonymous 11/30/24(Sat)04:43:06 No.41690252

>>41690233
Nah, not generating new songs from scratch, I mean these people who have taken existing songs and re-did them with someone else's voice. I've heard quite a few around.

Anonymous
11/30/24(Sat)05:14:15 No.41690299

Anonymous 11/30/24(Sat)05:14:15 No.41690299

>>41690252
Then skip step 1, the rest still stands. With real songs sometimes you can find stems online with vocals and instruments already seperated

Anonymous
11/30/24(Sat)12:10:21 No.41690932

Anonymous 11/30/24(Sat)12:10:21 No.41690932

Bump.

HydrusBeta
11/30/24(Sat)12:10:46 No.41690933

HydrusBeta 11/30/24(Sat)12:10:46 No.41690933

>>41689623
Sorry I never got around to replying to you. Glad to hear you figured out a solution!

Anonymous
11/30/24(Sat)15:01:24 No.41691401

Anonymous 11/30/24(Sat)15:01:24 No.41691401

If I understand correctly, GPT-SoVITS uses pytorch. And I've found few mentions of OpenCL backend for pytorch being developed: https://dev-discuss.pytorch.org/t/opencl-backend-important-updates/845/13
But in general it is slower than vendor's libraries.

What ponies here use for inference? CPU? GPU? NPU? Which one?

Anonymous
11/30/24(Sat)15:10:58 No.41691421

Anonymous 11/30/24(Sat)15:10:58 No.41691421

File: Feral, solo, pony, nightm(...).png (959 KB, 1024x1024)

959 KB PNG

>>41690198
For a parody like that without re-singing it, I would use Synthesizer V to recreate the vocals using it's synthetic singing voices, then feed those outputs into whichever AI pony voice conversion I feel best suited, probably RVC. There previously were many sources to the basic version of SynthV here: (https://resource.dreamtonics.com/download/English/) but I guess they've since removed it to try and bump their sales of the paid version, which is still kinda worth it as it has some great features in the main one:
>Can read separated vocals to attempt replication and even tries to match vocal qualities to edit afterwards
>AI retake to vary up how pronunciations and deliveries are done
>More voices available to it, though most are paid models
>Force English for most other non-native voices
>No limits to track numbers; good harmony potenital
In any case, the free limited version (dubbed as "basic") is still very much useful and a basic beta version can be found at the earlier linked location, obtained from their website (https://dreamtonics.com/download-free-trials/)

As an idea of it's capabilities, here are some tracks I used SynthV + RVC for, some still WIP:
>ANRI voice output of Danger Zone - https://files.catbox.moe/12is7e.mp3
>Final Fluttershy Danger Zone (of section above) - https://files.catbox.moe/12is7e.mp3
>Starlight singing a Pendulum Watercolour section - https://files.catbox.moe/xpb2qa.mp3
>Fluttershy singing the same as above - https://files.catbox.moe/4kel93.mp3
>NMM Parody I threw together just now (Rhythm of the Night -> Eternal Night) - https://files.catbox.moe/hz1cex.mp3

Anonymous
11/30/24(Sat)16:18:55 No.41691545

Anonymous 11/30/24(Sat)16:18:55 No.41691545

>>41691421
I'm looking through the quick start guide in OP and it looks like so-vits-svc would be better suited to this task than RVC, but you recommend RVC?

Anonymous
11/30/24(Sat)16:21:50 No.41691552

Anonymous 11/30/24(Sat)16:21:50 No.41691552

>>41691421
>>41691545
Oh wait, so-vits-svc wouldn't allow me to change the words, would it?

Anonymous
11/30/24(Sat)16:27:09 No.41691563

Anonymous 11/30/24(Sat)16:27:09 No.41691563

>>41691552
Only direct voice conversion I recall being able to change the lyrics for was Talknet, which is quite old and not all that reliable when changing the words. Thus why I consider SynthV as the better pass used to change the lyrics, pitch, timbre and/or delivery prior to ponification with newer formats.

Anonymous
11/30/24(Sat)16:27:58 No.41691565

Anonymous 11/30/24(Sat)16:27:58 No.41691565

>>41691563
Where are instructions for SynthV? I don't see it mentioned in the quick guide

Anonymous
11/30/24(Sat)17:42:41 No.41691752

Anonymous 11/30/24(Sat)17:42:41 No.41691752

>>41691565
>SynthV
thats not an ai tool, its a bootleg vocaloid.

Anonymous
11/30/24(Sat)18:37:07 No.41692024

Anonymous 11/30/24(Sat)18:37:07 No.41692024

>10

Anonymous
11/30/24(Sat)21:38:02 No.41692647

Anonymous 11/30/24(Sat)21:38:02 No.41692647

>>41692024

Anonymous
12/01/24(Sun)00:35:26 No.41693154

Anonymous 12/01/24(Sun)00:35:26 No.41693154

>>41572862
cute numget pat pat

Anonymous
12/01/24(Sun)05:44:22 No.41693669

Anonymous 12/01/24(Sun)05:44:22 No.41693669

>>41692024

Anonymous
12/01/24(Sun)05:56:39 No.41693673

Anonymous 12/01/24(Sun)05:56:39 No.41693673

>>41691565
You can easily find tutorials online for it. I intend to make one for here, and include my method for ponification with it's outputs, though that shouldn't differ much from existing documentation. Retail work and moving residence is occupying a lot of my spare time, so won't be made for a while.

Anonymous
12/01/24(Sun)07:33:49 No.41693857

Anonymous 12/01/24(Sun)07:33:49 No.41693857

Up.

Anonymous
12/01/24(Sun)20:22:13 No.41695896

Anonymous 12/01/24(Sun)20:22:13 No.41695896

>>41693857
Yes.

Anonymous
12/02/24(Mon)01:54:54 No.41696589

Anonymous 12/02/24(Mon)01:54:54 No.41696589

File: laughing_link_mare.png (208 KB, 664x906)

208 KB PNG

Anonymous
12/02/24(Mon)05:36:52 No.41696936

Anonymous 12/02/24(Mon)05:36:52 No.41696936

>>41696589
That's an excessive amount of scrunch.

Anonymous
12/02/24(Mon)11:23:15 No.41697449

Anonymous 12/02/24(Mon)11:23:15 No.41697449

Up.

Anonymous
12/02/24(Mon)15:23:16 No.41697989

Anonymous 12/02/24(Mon)15:23:16 No.41697989

>10

Clipper
12/02/24(Mon)18:01:30 No.41698335

Clipper 12/02/24(Mon)18:01:30 No.41698335

>>41687414
>missing the emotion tag
Fixed, thanks.

>>41684805
fyi, for your mirror

Anonymous
12/02/24(Mon)18:28:33 No.41698386

Anonymous 12/02/24(Mon)18:28:33 No.41698386

>>41690059
How do I use this?
Pretty please

Anonymous
12/02/24(Mon)18:46:19 No.41698421

Anonymous 12/02/24(Mon)18:46:19 No.41698421

File: Capture d'écran 2024-12-0(...).png (34 KB, 1352x579)

34 KB PNG

>>41698386
My settings, I've been messing with this and nothing works
changed ports from the base ui to the text inference ui (it is running ofc)
changed paths from shorter, gptsovits folder as root, to full
nothing works
pretty please help

Anonymous
12/02/24(Mon)19:40:33 No.41698568

Anonymous 12/02/24(Mon)19:40:33 No.41698568

>>41698421
Baguette

Synthbot
12/02/24(Mon)23:32:18 No.41699086

Synthbot 12/02/24(Mon)23:32:18 No.41699086

>>41698335
Will update after a few days.

Anonymous
12/03/24(Tue)04:26:19 No.41699509

Anonymous 12/03/24(Tue)04:26:19 No.41699509

>>41681376
Maybe those stems can be used for training instrumental extractor AI, if we ever will get enough data and POWARR to train one. But unlikely we will get both.

Anonymous
12/03/24(Tue)10:38:10 No.41699901

Anonymous 12/03/24(Tue)10:38:10 No.41699901

Up.

[Return] [Catalog] [Top]

Post a Reply

Return Catalog Top Refresh

[Advertise on 4chan]

Delete Post: [File Only] Style:

[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.