Welcome to the Pony Voice Preservation Project!youtu.be/730zGRwbQuEThe Pony Preservation Project is a collaborative effort by /mlp/ to build and curate pony datasets for as many applications in AI as possible.Technology has progressed such that a trained neural network can generate convincing voice clips, drawings and text for any person or character using existing audio recordings, artwork and fanfics as a reference. As you can surely imagine, AI pony voices, drawings and text have endless applications for pony content creation.AI is incredibly versatile, basically anything that can be boiled down to a simple dataset can be used for training to create more of it. AI-generated images, fanfics, wAIfu chatbots and even animation are possible, and are being worked on here.Any anon is free to join, and there are many active tasks that would suit any level of technical expertise. If you’re interested in helping out, take a look at the quick start guide linked below and ask in the thread for any further detail you need.EQG and G5 are not welcome.>Quick start guide:docs.google.com/document/d/1PDkSrKKiHzzpUTKzBldZeKngvjeBUjyTtGCOv2GWwa0/editIntroduction to the PPP, links to text-to-speech tools, and how (You) can help with active tasks.>The main Doc:docs.google.com/document/d/1y1pfS0LCrwbbvxdn3ZksH25BKaf0LaO13uYppxIQnac/editAn in-depth repository of tutorials, resources and archives.>Active tasks:Research into text-to-speechResearch into speech-to-speechResearch into chatbots>Latest developments:See developments post below>The PoneAI drive, an archive for AI pony voice content:drive.google.com/drive/folders/1E21zJQWC5XVQWy2mt42bUiJ_XbqTJXCp>Clipper’s Master Files, the central location for MLP voice data:mega.nz/folder/jkwimSTa#_xk0VnR30C8Ljsy4RCGSigmega.nz/folder/gVYUEZrI#6dQHH3P2cFYWm3UkQveHxQdrive.google.com/drive/folders/1MuM9Nb_LwnVxInIPFNvzD_hv3zOZhpwxhttps://huggingface.co/datasets/synthbot/pony-speechhttps://huggingface.co/datasets/synthbot/pony-singing>Cool, where is the discord/forum/whatever unifying place for this project?You're looking at it.Last Thread:>>41498541
>>41571795>Latest developmentshttps://ponepaste.org/10430FAQs:If your question isn’t listed here, take a look in the quick start guide and main doc to see if it’s already answered there. Use the tabs on the left for easy navigation.Quick: docs.google.com/document/d/1PDkSrKKiHzzpUTKzBldZeKngvjeBUjyTtGCOv2GWwa0/editMain: docs.google.com/document/d/1y1pfS0LCrwbbvxdn3ZksH25BKaf0LaO13uYppxIQnac/edit>Where can I find the AI text-to-speech tools and how do I use them?A list of TTS tools: docs.google.com/document/d/1y1pfS0LCrwbbvxdn3ZksH25BKaf0LaO13uYppxIQnac/edit#heading=h.yuhl8zjiwmwqHow to get the best out of them: docs.google.com/document/d/1y1pfS0LCrwbbvxdn3ZksH25BKaf0LaO13uYppxIQnac/edit#heading=h.mnnpknmj1hcy>Where can I find content made with the voice AI?In the PoneAI drive: drive.google.com/drive/folders/1E21zJQWC5XVQWy2mt42bUiJ_XbqTJXCpAnd the PPP Mega Compilation: docs.google.com/spreadsheets/d/1T2TE3OBs681Vphfas7Jgi5rvugdH6wnXVtUVYiZyJF8/edit>I want to know more about the PPP, but I can’t be arsed to read the doc.See the live PPP panel shows presented on /mlp/con for a more condensed overview.2020 pony.tube/w/5fUkuT3245pL8ZoWXUnXJ42021 pony.tube/w/a5yfTV4Ynq7tRveZH7AA8f2022 pony.tube/w/mV3xgbdtrXqjoPAwEXZCw52023 pony.tube/w/fVZShksjBbu6uT51DtvWWz>How can I help with the PPP?Build datasets, train AIs, and use the AI to make more pony content. Take a look at the quick start guide for current active tasks, or start your own in the thread if you have an idea. There’s always more data to collect and more AIs to train.>Did you know that such and such voiced this other thing that could be used for voice data?It is best to keep to official audio only unless there is very little of it available. If you know of a good source of audio for characters with few (or just fewer) lines, please post it in the thread. 5.1 is generally required unless you have a source already clean of background noise. Preferably post a sample or link. The easier you make it, the more likely it will be done.>What about fan-imitations of official voices?No.>Will you guys be doing a [insert language here] version of the AI?Probably not, but you're welcome to. You can however get most of the way there by using phonetic transcriptions of other languages as input for the AI.>What about [insert OC here]'s voice?It is often quite difficult to find good quality audio data for OCs. If you happen to know any, post them in the thread and we’ll take a look.>I have an idea!Great. Post it in the thread and we'll discuss it.>Do you have a Code of Conduct?Of course: 15.ai/code>Is this project open source? Who is in charge of this?pony.tube/w/mqJyvdgrpbWgZduz2cs1CmPPP Redubs:pony.tube/w/p/aR2dpAFn5KhnqPYiRxFQ97Stream Premieres:pony.tube/w/6cKnjJEZSCi3gsvrbATXnCpony.tube/w/oNeBFMPiQKh93ePqTz1ns8
>>41571795>>41571851I'm not the usual OP. Sorry if I got anything wrong. I can update the Latest Developments paste if anyone has suggestions or corrections. I made one change to the OP:- Added the Huggingface clone of Clipper's Master Files.>Will you guys be doing a [insert language here] version of the AI?>Probably not, but you're welcome to. You can however get most of the way there by using phonetic transcriptions of other languages as input for the AI.For future threads, the answer here can probably be updated since the fine-tuned GPT-SoVITS seems very capable of generating Japanese speech, and probably Mandarin/Cantonese, and Korean too.
>>41571963I didn't even know where to look for the thread until I looked in the catalog thing, I thought for sure that 149 wasn't a thing yet...
Rainbow Dash GPT-SoVITS Model (GPT 8, SoVITS 24)https://huggingface.co/therealvul/GPT-SoVITS-v2/tree/main/Rainbow-SVe24-GPTe8Reference: https://files.catbox.moe/r2v0mv.mp3Multispeaker: https://files.catbox.moe/kiydla.mp3Individual: https://files.catbox.moe/nwdqh2.mp3Reference: https://files.catbox.moe/mkotd9.mp3Multispeaker: https://files.catbox.moe/csya0e.mp3Individual: https://files.catbox.moe/m4f76x.mp3Pinkie Pie GPT-SoVITS Model (GPT 8, SoVITS 24)https://huggingface.co/therealvul/GPT-SoVITS-v2/tree/main/Pinkie-SVe24-GPTe8Reference: https://files.catbox.moe/5d77ck.mp3Multispeaker: https://files.catbox.moe/2kmvgv.mp3Individual: https://files.catbox.moe/ok3mbn.mp3Reference: https://files.catbox.moe/2btax8.mp3Multispeaker: https://files.catbox.moe/7w4b6q.mp3Individual: https://files.catbox.moe/ksirb0.mp3Fluttershy GPT-SoVITS Model (GPT 8, SoVITS 24)https://huggingface.co/therealvul/GPT-SoVITS-v2/tree/main/Fluttershy-SVe24-GPTe8Reference: https://files.catbox.moe/06oyrp.mp3Multispeaker: https://files.catbox.moe/7yiplw.mp3Individual: https://files.catbox.moe/b7gqjx.mp3Reference: https://files.catbox.moe/lc1z49.mp3Multispeaker: https://files.catbox.moe/exekkh.mp3Individual: https://files.catbox.moe/efzd04.mp3Definitely some cases in which I'm not sure we gain anything over the multispeaker model but they exist now I guess.That rounds out the Mane 6.
>>41572507It appears that we slid off the catalog in the middle of US night/early morning.
unofficial anchor post for unofficial thread
Fimfiction groups are scraped.https://github.com/uis246/fimfarc-search/releases/download/0.1-rc2/fimfgroups-20241022.tar.xzQuick guilde into archive structure:group-names - table with 2 rows: group id, group nameout*/ - directories for each depth in folder treeout*/.folders - table with 3 rows: group id, folder id, parent folder idout*/.names - table with 2 rows: folder id, folder nameout*/* - lists of fanfics in corresponding folders
>>41572514
>>41558920I made an audio dataset for Button's mom 2014, before she got sick and her voice changed some more:https://mega.nz/folder/PiQATIbC#XhtJf-n5Y6ug2SFrztjT7AHer voice got deeper after 2012-2013, and I have to find the audio project file for 9 minutes of good mommy data and upload it to replace the 5 minutes of data in clipper's archive.
>>41572514Spike GPT-SoVITS (GPT 8, SoVITS 24)https://huggingface.co/therealvul/GPT-SoVITS-v2/tree/main/Spike-SVe24-GPTe8Reference: https://pomf2.lain.la/f/mtcmnad.mp3Generated: https://pomf2.lain.la/f/2og0me18.mp3Not the best reference since it has a cut in the middle but whatever
>>41571963You did fine. Thanks for making the thread, I was just about to start working on it. Usually I try to prepare for the next thread around post 400.Also, the trend of new anti-spam bullshit each thread continues. I wonder what it'll be next thread?
>>41573536.https://files.catbox.moe/e5w4fh.zipAny chance you could do something with the above few seconds clips of Prince Blueblood?>https://files.catbox.moe/d9m1no.zipAlternative I have this one minute of synthesized PB voice if above is not enough
>>41573818I could look into it but I'd like to finish training the models for the common characters before I start mucking about with very low data ones which I imagine will require some finagling with multispeaker models
>>41573892I hope you can do Meadowbrook. God I love her sweet Cajun voice.
>>41573536Celestiahttps://huggingface.co/therealvul/GPT-SoVITS-v2/tree/main/Celestia-SVe24-GPTe8ref: https://files.catbox.moe/e0fewc.mp3generated: https://files.catbox.moe/oqnf90.mp3Lunahttps://huggingface.co/therealvul/GPT-SoVITS-v2/tree/main/Luna-SVe24-GPTe8ref: https://files.catbox.moe/0nnthl.mp3generated: https://files.catbox.moe/bvmb34.mp3ref: https://files.catbox.moe/3zfbva.mp3generated: https://files.catbox.moe/24n0da.mp3Glimglamhttps://huggingface.co/therealvul/GPT-SoVITS-v2/tree/main/Starlight-SVe24-GPTe8ref: https://files.catbox.moe/8uudhe.mp3generated: https://files.catbox.moe/7odtqn.mp3The choice of reference audio seems to have pretty violent effects on output quality. Not sure what's up with that.I wonder if there's a way to modify pronunciations.
>>41574170Yes, but you have to retrain expression trabslate from chinese, with this. https://huggingface.co/Systran/faster-whisper-large-v3
>>41574170Not sure if it was already posted by what are the GPT-SoVITS-v2 ram demands just for generating a 10~20 output?
>>41574537~3 GB with batch size of 20I did post this but it was in the other thread
>>41574331I meant at inference time, but this could also be helpful.
>>41574933Huh, that's not bad. How long does inference typically take for you, and would more RAM allowance (say... double) improve upon that speed, and/or just expand how large the batch output is?
>>41574170Apple Bloom! Apple Bloom! Apple Bloom!https://huggingface.co/therealvul/GPT-SoVITS-v2/tree/main/Apple%20Bloom-SVe24-GPTe8ref: https://files.catbox.moe/9g6eff.mp3gen: https://files.catbox.moe/3dzimz.mp3>>41575510I'd say RTF is roughly 16% to 40% on 3080Ti. i.e. generating a 60 second output takes 10 to 24 seconds (faster than realtime). Adjusting the batch size (at least in their gradio interface) doesn't seem to do anything, not sure how it works.
>>41575835Sweeble.https://huggingface.co/therealvul/GPT-SoVITS-v2/tree/main/SweetieBelle-SVe24-GPTe8Reference: https://files.catbox.moe/vj3zv7.mp3Generation: https://files.catbox.moe/4d204z.mp3Ok, so batched inference doesn't work the way I expected. I think what the GUI is doing is it uses a user-selectable slicing method to split the input into "batches". So a batch can be 4 sentences, or one sentence--but this means you won't see the time benefits or memory usage of batching unless you synthesize more than one/four sentences at a time (which is unusual for most content creation, but may be more suitable for certain automated systems/audiobooks). Long demo: https://files.catbox.moe/yka037.mp3The memory usage doesn't seem to actually be consistent with just batch size; it also varies with slice length and from inference to inference. If I slice by sentences, I can infer the ~2k word passage with a batch size of 20 under 14 GB, a batch size of 10 under 6 GB. If I slice by 4 sentences I OOM. Inference time varies (obv. variance increases wrt to the text length), for this passage it took anywhere from 26sec to 1min.
>>41575835>>41576416Wew, that's really impressive. I think book narration still needs a bit more, but I can see the tech is there and it manages it. Biggest concern would be it randomly hallucinating in long inference, which would be difficult to detect.
>>41576416Could you also train a specific Squeaky Belle model at some point please? Would love to see AI attempt that, and would feel more early seasons.
>>41576416Scoot Scootaloo.https://huggingface.co/therealvul/GPT-SoVITS-v2/tree/main/Scootaloo-SVe24-GPTe8ref: https://files.catbox.moe/8xy1ud.mp3gen: https://files.catbox.moe/cmq12h.mp3ref: https://files.catbox.moe/tvtvuz.mp3gen: https://files.catbox.moe/xkjobn.mp3no ref gen: https://files.catbox.moe/klbjwq.mp3>>41577055Noted.
>>41577055>https://huggingface.co/Amo/RVC_v2_GA/tree/main/models/MLP_Sweetie_Belle_SqueekyThere is RVC2 model of her, if you are interested.
>>41574170I want to hear Celestia and Luna narrating books/articles. They're very well spoken in these gens.
>>41577119I feel that would require squeaky voice acting to work, and unless it's accidental it loses its charm. Thanks though, could be good to pair with a later Squeaky TTS.>>41577113Glad to hear references aren't necessary for the voice qualities to surface and retain in inference.>>41575835Forgot to mention, Bloom feels the most accurate so far, and really highlights how well it matches her usual pitch changes. Awesome to see AI less restrictive in its convincing vocal ranges.
>>41577113so lightly glossing the github, is it really just taking like 5 seconds of audio as reference and then using that to make these generated voices you're posting? if so it sounds really clear
>>41577207Well, first I have to finetune the model. The reference audio just helps with steering towards the final desired timbre (the Scootaloo post has examples generated both with references and one without to show the difference). Trying to use the reference audio on the base, untrained model does not produce good results.
>>41577184>feel that would require squeaky voice acting to workNot really, I don't have the test file on my but i remember it was pretty decent at forcing whatever input audio to sound like her S1 squeaky self.>https://files.catbox.moe/ltfpaw.zipHere are wav folder with 6m of just squeaky SB if you need it.
>>41569387>Hay Say using 0% of GPUOpen the docker-compose.yaml file. There are several commented-out sections that start with "deploy:". Uncomment those sections and restart Hay Say. You should then see the option to generate with GPU above the Generate button.
>>41577113The Grrrrrreat and Powerful Trrrixie!https://huggingface.co/therealvul/GPT-SoVITS-v2/tree/main/Trixie-SVe24-GPTe8>Can she roll her R's?Ref: https://files.catbox.moe/q05hko.mp3Prompt:>The Great and Powerful Trixie, has no need for your frivolous manual! Trixie is a master of magic and illusion! Everything she does is a display of sheer brilliance! Instructions are for those who lack natural talent, unlike Trixie!Gen: https://files.catbox.moe/c6qdy8.mp3Gen: https://files.catbox.moe/mg0wnu.mp3Gen (less roll): https://files.catbox.moe/upw9ij.mp3Sort of, sometimes, apparently!
>>41577801>Additional postprocessing?RVC (using a retrieval ratio of 0.75): https://files.catbox.moe/bisplb.mp3https://files.catbox.moe/f6heuq.mp3
>>41577801>Sort of, sometimes, apparently!Holy shit.
>>41577801Every turn this TTS AI surprises me with its capabilities. It replicated Trixie REALLY well, especially with this (>>41577843) additional pass; which I imagine is responsible for also clearing most of the buzz and noise. I'm starting to wonder if there's even a pony voice odd or unique enough to stump it. Discord maybe? Breezies? Chrysalis?
>>41577801Peetzer (GPT epoch 10, SoVITS epoch 16)https://huggingface.co/therealvul/GPT-SoVITS-v2/tree/main/Cadance-SVe16-GPTe10Ref: https://files.catbox.moe/8kg1ov.mp3Gen: https://files.catbox.moe/3i6sgh.mp3Ref: https://files.catbox.moe/zdyuu1.mp3Gen: https://files.catbox.moe/1sj35x.mp3Maybe I'm not familiar with Cadance's vocal timbre enough but this one feels weird to me, which is why I spent so much time deliberating over which combination to use. Didn't like the 24th SoVITS epoch, too much buzziness, like when you overtrain an RVC model. I suspect we're at the point where the quantity of data for the character becomes a noticeable problem.>>41578124Not so much the uniqueness of the voice as much as the availability of data.
STTATTS: Unified Speech-To-Text And Text-To-Speech Modelhttps://arxiv.org/abs/2410.18607>Speech recognition and speech synthesis models are typically trained separately, each with its own set of learning objectives, training data, and model parameters, resulting in two distinct large networks. We propose a parameter-efficient approach to learning ASR and TTS jointly via a multi-task learning objective and shared parameters. Our evaluation demonstrates that the performance of our multi-task model is comparable to that of individually trained models while significantly saving computational and memory costs (∼50\% reduction in the total number of parameters required for the two tasks combined). We experiment with English as a resource-rich language, and Arabic as a relatively low-resource language due to shortage of TTS data. Our models are trained with publicly available data, and both the training code and model checkpoints are openly available for further research.https://github.com/mbzuai-nlp/sttattsno examples but weights are up. voice conversion is one of the tasks
>>41578173man, there is so many of those "it mighty be cool" project out there, I just wish PPP and Chag wasnt the only groups that took chances with developing stuff further
>>41578124>>41578166Ok, Discord. (GPT epoch 48? SoVITS epoch 96?!)https://huggingface.co/therealvul/GPT-SoVITS-v2/tree/main/Discord-GPTe48-SVe96ref: https://files.catbox.moe/ich5gc.mp3gen: https://files.catbox.moe/59y7pi.mp3ref: https://files.catbox.moe/dj7rzm.mp3gen: https://files.catbox.moe/i5nrni.mp3First GPT-SoVITS L? Or is it a me problem? I tried quite a lot, as you can tell from the epochs. The increased GPT seems to help with the framiness, but there are some occasional pronunciation misses. It seems to have a lot of trouble modeling the deeper voice--maybe the base model is biased to higher voices, or the analysis frames are too short to model lower f0?
>>41579340Crazy Glue (GPT epoch 48? SoVITS epoch 24)https://huggingface.co/therealvul/GPT-SoVITS-v2/tree/main/CozyGlow-SVe24-GPTe48ref: https://files.catbox.moe/0kpx47.mp3gen: https://files.catbox.moe/5hze52.mp3Not sure how faithful this is, but for some reason increasing GPT epochs seemed to increase the "resemblance" I perceived.
>>41579585can you do derpy
>>41579340It sounds pretty spot on to me, though there might be some subtle differences yeah. Not perfect. I think. It's hard to tell honestly. We're in the subtle territory for AI now it seems.
>>41579585T-rexhttps://huggingface.co/therealvul/GPT-SoVITS-v2/tree/main/Tirek-SVe32-GPTe32ref: https://files.catbox.moe/byfz9o.mp3gen: https://files.catbox.moe/3j1f3g.mp3ref: https://files.catbox.moe/57fg6x.mp3gen: https://files.catbox.moe/zsnanq.mp3Chryssiehttps://huggingface.co/therealvul/GPT-SoVITS-v2/tree/main/Chrysalis-SVe32-GPTe8ref: https://files.catbox.moe/sns1uo.mp3gen: https://files.catbox.moe/ljpexn.mp3ref: https://files.catbox.moe/iafgjk.mp3gen: https://files.catbox.moe/hl3tse.mp3ref: https://files.catbox.moe/hd9no1.mp3gen: https://files.catbox.moe/p4dnvf.mp3Noticeably more finicky and lower quality, both in audio quality and pronunciations.>>41579620Noted
>>41573818Prince Blueblood (?) (GPT 40, SoVITS 32)https://huggingface.co/therealvul/GPT-SoVITS-v2/tree/main/Blueblood-SVe32-GPTe40ref: https://files.catbox.moe/lmo8x7.mp3gen: https://files.catbox.moe/2xr3dk.mp3Well, that's an impressive performance for 17 seconds of audio. I wonder how much of it is inherited from the base model, and I wonder how well it works for other characters. I had to stitch two reference audios together to hit the 3 second requirement. I still can't tell what the "auxiliary reference audios" slot does.
>>41581034>I still can't tell what the "auxiliary reference audios" slot does.What I've been told is that it "averages" the tone of the audios.
>>41581034>>41574083Meadowbrook (?) (GPT 48, SoVITS 24)https://huggingface.co/therealvul/GPT-SoVITS-v2/tree/main/Meadowbrook-SVe24-GPTe48Ref: https://files.catbox.moe/3s9n96.mp3Gen: https://files.catbox.moe/2be3l9.mp3Gen using entire dataset as aux reference: https://files.catbox.moe/ek4uoy.mp3These inflections are really weird. The accent isn't 100% there but I doubt we'll get much closer.
>>41577055>>41581357Squeaky Belle (?) (GPT 32 SoVITS 48)https://huggingface.co/therealvul/GPT-SoVITS-v2/tree/main/%20SqueakyBelle-SVe48-GPTe32Ref: https://files.catbox.moe/xd9qb2.mp3Gen: https://files.catbox.moe/u4gng7.mp3Well, no real squeaks. I noticed there was some S2 material in there too, I wonder if that affects anything?
>>41579620Tabitha's Derpy (GPT epoch 24, SoVITS epoch 36)https://huggingface.co/therealvul/GPT-SoVITS-v2/tree/main/Derpy-SVe36-GPTe48ref: https://files.catbox.moe/bl8oom.mp3gen: https://files.catbox.moe/8mroi0.mp3Doesn't seem to follow the reference timbre that well?
>>41581039Thanks.I'm not really sure what the "no reference" option (or whatever Google Translate missed) actually does either. The underlying pipeline doesn't seem to actually allow you to generate "without a reference" until something has already been generated--and there seems to be some kind of underlying caching of the reference audio and the requests that the pipeline receives (so that if you screw up a request by selecting the wrong reference audio language once, then it's just perma-fucked until you restart the webui).
>>41581357Daym, that's impressive. It seemed impossible just a year ago.
An update of a shitpostraw gpt-sovits: https://files.catbox.moe/l9qtqe.mp3post-rvc + pitch shift: https://files.catbox.moe/v80sqy.mp3
>>41581034>>41582059Oh man, this is really bloody based, there is a whole list if characters that I would live to have the voice model for but their dataset is limited to 10-30 seconds, but this, fuck me this is a proper game changer.
>>41581357>>41581823I can see there are some instructions on how to use different languages for tts, but did the original developers provide instructions on how to add another language? I would imagine for 99% of people on the board it would be useless as English is a go-to language but I would be interested if a new languages could be added to it (or if it would required a re-training base model from scratch)?
>>41582329I'd love a Russian dub with original voices, instead of that terrible official one.
>>41581823I love it! Now very squeaky, but does sound within the era of.>>41582059Hmm, might sound better with a less anxious line? Maybe with something like these: https://files.catbox.moe/7rf4vw.wavhttps://files.catbox.moe/2kg7w8.wav>>41582298Kek. I think it falters at the end, most of these tend to come to think of it. Maybe with this AI we'd have to get in the habit of leaving an additional dummy sentence at the end to snip out?
>>41582329Don't know, I don't think they have any base training instructions>>41582575You don't have to generate the entire thing at once.
>>41582329I don't see any instructions on how to add new languages. Glancing through the repo, however, it looks like it would require some modifications to the code because it checks the language you pass against a whitelist and also does some mapping (e.g. "en" -> "english"). You'd also need to write a phoneme tokenizer specific to the language you want. The built-in tokenizers are located here: https://github.com/RVC-Boss/GPT-SoVITS/tree/main/GPT_SoVITS/text. I'm not 100% sure, but I don't think you'd need to retrain the base model.
>>41583252I have a feeling the existing tokenizers could use some updating considering the occasional (but strongly) mispronounced words shown in the recent examples. Or perhaps establish a way to select from specific tokenizers to account for both using manually updated ones, and selecting differing pronunciations with ones tailored to accent and region.In any case, not something overly urgent or to consider working on at this stage, but something of note to consider when it comes to QoL improvements.
Edge case: Long sentence, no punctuationhttps://gist.github.com/effusiveperiscope/2740ec098c0834dee76919e0c2e205b3https://files.catbox.moe/33ltgn.mp3Seems to be suffering the usual transformer context length issue.Uses 3.8 GB with a batch size of 1, could be problematic for small GPUs.
>>41584258>>41581357Hm... going back and comparing these I wonder if I was premature in stopping the Mane 6 training.
>>41584264Whether you continue training from the current models or start anew for extended training, the existing ones are still great overall, so perhaps save the new batch as separate and mark as like "v2" or 'EXT". The existing ones would be useful as a baseline to compare against anyway.
>>41583036https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-ssml-phonetic-setsAs you said, this is not on a high list of priorities but if you did choose to look into this, I had found the following link that may or may not be bit useful here.
Bump
>>41584264>>41584659Well, I trained the Applejack model up to SoVITS epoch 48 and GPT epoch 32. The high-frequency information increased marginally but increasing either independently or jointly seems to cost character resemblance. Since this is something you can fix anyways with an RVC postprocessing pass I decided it's not a worthwhile tradeoff.Also, I think I can demonstrate what I mean by "framiness" more concretely. Take these samples:GPT-SoVITS: https://files.catbox.moe/r6zj8m.mp3RVC: https://files.catbox.moe/b9tzxg.mp3For GPT-SoVITS, in the word "ends" and "is" you can hear some sort of unnatural discontinuity that RVC figures out it should smooth over. You can see this in the spectrogram--for "ends", the discontinuities are more in the upper harmonics, which seem to momentarily drop out in a place that RVC (and our ears) doesn't agree with, and for "is" there is a very noticeable discontinuity in f0 as well.
>>41586799Huggingface's scanner has some unspecified issues with s2G488k.pth in the GPT-SoVITS HF repo. I checked the file and didn't see anything dangerous in its data.pkl, though my checks https://ponepaste.org/10436 are pretty crude. If you can load the model using use_safetensors=True, that would be good. It's not a big deal for colab, but you might want to do this when loading the model locally. People are starting to upload more malware to Huggingface.
>>41588158Huggingface marks every single one of my SoVITS weights as suspicious too, but for some reason none of the GPT weights. There are at least four pretrained models I think that are involved. I'll see what I can do about this soon
>>41588517>>41588158Well, facially, these aren't just normal pytorch model state dicts; it contains configuration information as well (yes, I did just torch.load the pickle and check the keys it said it had; I've probably already loaded these hundreds of times anyways) and shoves the model state dict under the key 'weight', so I don't think it can actually be converted to safetensors. Assuming that nothing actually malicious is happening, I could try separating the data out into another file.(Also, if we made safetensors loading default, that'd make us incompatible with any models trained by anyone else, and it'd add an extra step for people training models to do.)
>>41588963It's probably fine for now. I'll think about it to see if there's a better solution since it'll likely affect a lot of models going forward.
>ngrok needs a "verified account" with a CREDIT CARD to be used nowhttps://files.catbox.moe/46dfg5.mp32 more weeks
>>41570055Horsona updates:- [Done] I redid how the database cache works since it clubbed together multiple disparate functionality, and its interface required special handling by any module that used it. The new version gives an embedding database an LLM interface. It can be queried like any other LLM, and it does any embedding-specific handling in there (esp. generating keyword searches from the prompt to get better embedding lookups). For whatever underlying LLM it uses, it requires two queries: one to generate the search terms, and one to respond to the query.- ... Code: https://github.com/synthbot-anon/horsona/blob/main/src/horsona/memory/embedding_llm.py- [Done] I implemented ReadAgent for dealing with long documents. ReadAgent generates a "gist" for each "page" of the document, which can be used to determine what information is on each page. At query time, it uses one LLM call to determine which pages to pull into the context, then a second LLM call to respond to the query. I implemented this as two modules: one to generate & keep track of gists, and one to provide the LLM interface. My version has two changes relative to the original: (1) when summarizing pages, it provides all gists-so-far as context so it can generate better summaries, and (2) when responding to a query, it provides all gists along with the selected pages rather than just the selected pages.- ... Code for creating gists: https://github.com/synthbot-anon/horsona/blob/main/src/horsona/memory/gist_module.py- ... Code for the ReadAgent LLM wrapper: https://github.com/synthbot-anon/horsona/blob/main/src/horsona/memory/readagent_llm.py- [Done] I added some utility functions that are generally useful for getting "smarter" responses. One of the is for searching the web for information on a given topic. The second is for decomposing a given topic into subtopics.- ... Code for searching the web: https://github.com/synthbot-anon/horsona/blob/main/src/horsona/smarts/search_module.py- ... Code for decomposing a topic: https://github.com/synthbot-anon/horsona/blob/main/src/horsona/smarts/mece_module.py- [In progress] I like the LLM wrapper approach for generating augmented responses. I'll likely update some other modules to use the same approach, particularly the DialogueModule for generating in-character responses.- [In progress] I need to update my ReadModule to reflect the database cache changes.
>>41572561GPT-SoVITS Inference GUI, first build.>Windows pyinstaller download (may remove this later if it turns out to be broken)https://drive.google.com/file/d/1PZt71cOH0X7QSFRgcThTwC2_WOain7Nj/view?usp=sharing>GitHub + usage instructionshttps://github.com/effusiveperiscope/GPT-SoVITS Please help test:>GPU/No GPU system, other system info?>Does it work at all?>Interface appearance on your display?>Other issues?>>41590487Closer to 2 more hours. Apparently it's easier to turn a client/server application into a client-only application since the decoupling's built in.
What do you think is behind the stellar quality of 11.ai? Some secret sauce algorithm or raw compute power for training?
>>41590988is GPT-SoVITS the closest to 11.ai TTS right now? I really don't like how reference audio is mandatory.
>>41590988Yeah, once I get out of the wagiecage in 5 hours I will give this a go. Any advise on what kind of reference work better vs what could make the output sound trash?
>>41591065Most likely a large dataset, a large model, and enough resources to train one on the other. It's not "that good" for our use case though; we are looking for very specific voices and prosody patterns which even "the best" models can't quite get zero shot, and companies aren't going to special finetune their models just for us.>>41591141It's the closest that we have the resources to deal with. It's open source, has maintainers who are willing to train and release a base model that performs well enough at inference time on a footprint that fits onto most consumer GPUs (under 4GB), requires relatively minimal resources, data, time to finetune to a reasonable performance and character resemblance (unlike StyleTTS2 and xTTS), and doesn't have any obviously crippling flaws like ParlerTTS's performance on underrepresented tokens or general schizo energy.>I really don't like how reference audio is mandatory.For all we know 11 could be doing the same thing under the hood, just having a default reference audio for each character. I'm actually still not quite sure whether reference audio is actually mandatory, or whether whatever they get out of reference audio could be precomputed or not. Most of the repo is either Chinese or google translated Chinese and there's a lot of confused terminology (for example, the splitting unit of the text splitting method is called a "batch", but "batch size" is also used in its normal sense, but with this definition of "batch" "batch size" wouldn't refer to the size of the individual "batch" but rather the maximum number of "batch"es). All I know is that the original webui and underlying TTS pipeline give me an error if I run it without reference audio, and what I thought earlier was the "no reference audio" option seems to rely on some kind of cached internal state which produces really bad results if you switch models.>>41591321The TTS pipeline for some reason disallows reference audio outside of the 3 second to 10 second range (not sure why), so that's a length constraint (you could get around it by frankensteining audio clips in an editor). Shouting references tend not to work very well. I haven't quite nailed down what makes references work consistently, but some factors seem to be:>The brightness/high frequency information available in the audio clip (too much/too little)>The intonation - like StyleTTS2, GPT-SoVITS seems to impose the average pitch and general pitch contour of the primary reference onto the outputAdding a bunch of auxiliary references of the same character seems to help with audio quality, although I haven't confirmed this. Feel free to experiment and report your own observations (assuming it even works).
>>41590988OK, so I noticed a pretty big bug already -- the auxiliary reference audio paths are never actually passed in. Still, since inference works on my machine, I'm interested in knowing whether it actually works on anyone else's before I go and put up another version.Also, Clipper, if you're here -- do you mind if I remove the demucs-processed episode stems from my Drive?
>>41590988To make the (other) system steps less ambiguous, please add the steps/commands to create and activate a conda/venv environment and then the ones to install the dependencies from the txt files to the git page. The more dum dum proof we can make it the more accessible and hassle free it'll be.
>>41591914OK updated
>>41591949>conda env create -n GPTSovitsClient python=3.10>SpecNotFound: Invalid name 'python=3.10', try the format: user/packageHmm, that doesn't seem to work, at least with Ubuntu. Wrong syntax perhaps?Depending on the interfacing, maybe something like Applio's setup and run scripts can be examined and revised for use with this SoVitsTTS. Or perhaps a version made which integrates this TTS into it as a separate page/function?>https://github.com/IAHispano/Applio/releases
>>41592046Whoops, I forgot that it's just conda create. Updated.
>>41592046Could someone test if the script still works with "python=3.10.3" version, as I believe the .4 and above versions has some pip/fairseq/omegaconf some other shitty module was throwing a fit in newer versions.
>>41590988I've been out of the loop. Is this *just* a GUI, intended to plug into a preinstalled AI, or does it come with the actual AI model and functionality too?
>>41591651OK, after poring over the code for around 2 hours, here's what I think is going on:- Primary reference audio is passed into a "HuBERT" model followed by a Conv1d and RVQ to produce codes--presumably to represent semantic content of the audio.- The phoneme embeddings plus a positional embedding and something to do with BERT (presumably to represent the semantic content of both the reference text prompt and the actual text prompt) are concatenated with the semantic audio codes (also plus a positional embedding) then run through a transformer model ("GPT") to create another intermediate representation. They use sinusoidal positional embeddings.- This representation then gets fed into a VITS network ("SoVITS") which is conditioned on speaker timbre information (the same way speaker embeddings are normally applied in VITS) and converts it into audio.- The speaker timbre information comes from the auxiliary reference audio, or if none is specified, the primary reference audio. These are converted into spectrograms and fed in to a MelStyleEncoder which eventually averages them out temporally. After this they are all averaged together producing a [1, 512] output.What this means for us:- The primary reference appears to keep its sequence dimension, so it's not possible to calculate an "average" primary reference audio.- Primary reference audio might not be mandatory; the model seems to have cases built in for having no reference, it could just be the TTS pipeline code that disallows it. That being said, I don't know if these are dead code or if generating without a reference will produce good results.- There doesn't seem to be any inherent reason to restrict the primary reference audio to 3-10 seconds, other than for quality vs. memory usage purposes (perhaps there are edge cases where a reference audio might end up too short for something in the model to use). Maybe they didn't want GIGO to give the model a bad rep?- It's at least theoretically possible to precalculate average speaker timbre information from auxiliary reference audio, since multiple audios can be averaged together. Whether it's actually useful is another question entirely.>>41592126I'm using an environment with python=3.10.15 and it seems to work.>>41592134The pyinstaller is a self-contained solution for running the model; it has the actual AI model and doesn't plug into anything else (it would be pretty bad if it were 10GB and needed to plug into something else!)Originally I intended for it to interface with a server but then I found out >>41590487 which defeated most of the reason I even wanted to make a client-server model
>>41592046>>41592116Also, something's may be odd with my conda setup perhaps. Running the "install.sh" gives a command not found kinda error for the first 4 lines, but continues with pip. But inputting the same conda commands into terminal works just fine and I was able to install everything required. Strange. The install.sh is missing the install for the requirements_client.txt, ended up trying to launch with "py" and it said "gui_client" is not defined, then I tried with "python" and it was missing its "peewee". Everything was seemingly okay after doing the final step of installing the requirements_cleint.txt files, but now the following error in image related happened and I'm now stuck.
>>41592134if you want to mass download the models from hugging face use this with your python terminalcd #directory were you want it saved#pip install huggingface_hubpython -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='therealvul/GPT-SoVITS-v2', cache_dir='tmp', local_dir='models')"
>>41592180You are not supposed to run ./install.sh. That is from the original repo. The only correct instructions are under the README. https://github.com/effusiveperiscope/GPT-SoVITS
>>41592180>>41592205Also, it looks like you're in your base conda environment.
>>41592205>>41592256Oh. That's surprising, because I was able to get Applio to run that way, minus the extra steps for this one.>conda env create -n GPTSovitsClient python=3.10>TypeError: deprecated() got an unexpected keyword argument 'name'Full error: https://ponepaste.org/10440And yeah, also did "conda create -n GPTSovitsClient python=3.10" as per the updated git instructions and got basically the same error.Would it be possible to still continue the setup and run the GUI without setting up a conda environment if these errors persist?
>>41592269>https://ponepaste.org/10440It would be helpful if you could post the error you get specifically when you run "conda create -n GPTSovitsClient python=3.10".>deprecated() got an unexpected keyword argument 'name'https://github.com/aws/aws-cli/issues/7325This seems to be a problem with pyOpenSSL, try uninstalling it: `pip3 uninstall pyOpenSSL`>Would it be possible to still continue the setup and run the GUI without setting up a conda environment if these errors persist?Possible, if you're willing to overwrite packages in your base python/conda environment (this could cause other things to break if they depend on it). However, if your python version doesn't match it increases the likelihood of bugs that might be difficult to solve, and you could end up putting things in a state that is difficult to recover.
>>41592269>Oh. That's surprising, because I was able to get Applio to run that way, minus the extra steps for this one.I've removed install.sh from my fork to prevent further confusion. However generally you should not assume that installation steps transfer from project to project.
>>41592286>This seems to be a problem with pyOpenSSL, try uninstalling it: `pip3 uninstall pyOpenSSL`Thank you! This problem actually prevented me doing anything with conda apparently, including updating the OpenSSL and whatnot, throwing the same error.I was able to create the environment and proceed as expected, but ran into one more error:>ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.>fairseq 0.12.2 requires hydra-core<1.1,>=1.0.7, but you have hydra-core 1.3.2 which is incompatible.>fairseq 0.12.2 requires omegaconf<2.1, but you have omegaconf 2.2.0 which is incompatible.>hydra-core 1.3.2 requires antlr4-python3-runtime==4.9.*, but you have antlr4-python3-runtime 4.8 which is incompatible.and it stopped installing afterwards. However I was able to separate the step of "pip install -r requirements.txt -r requirements_client.txt" into separate pip install commands, which allowed everything to be fetched correctly (some apparently it missed) and it started to work for a bit but now can't find something (image related). Note: The aforementioned dependency error is still existing after the requirements_client.txt step. I may withhold from further attempting as it's quite late my end. I'll see about continuing when I wake.>>41592299Noted.
>>41592417This seems to be a linux specific issue.https://github.com/elieserdejesus/JamTaba/issues/1228If you're using apt, try:`apt install libqt5multimedia5`If not, try to find the equivalent libraries for your distro.
>>41592155>Arbitrary reference lengthOK, so here's how reference lengths shorter than 3 seconds affect output quality:>under 2 secondsref: https://files.catbox.moe/944rn3.mp3gen: https://files.catbox.moe/tkkl1c.mp3ref: https://files.catbox.moe/7k63bq.mp3gen: https://files.catbox.moe/oylemz.mp3>2 secondsref: https://files.catbox.moe/5ysutk.mp3gen: https://files.catbox.moe/dihj8g.mp3>3.5 seconds:ref: https://files.catbox.moe/mdu2vl.mp3gen: https://files.catbox.moe/vegwm8.mp3>over 10 seconds:ref (for both): https://files.catbox.moe/ejlgxr.mp3gen: https://files.catbox.moe/vhw04c.mp3gen: https://files.catbox.moe/z3hjev.mp3I guess you might expect more pronunciation errors with >10 seconds due to increased context length? And under 2 seconds the quality of generated audio and character resemblance seems to suffer. Around 2 seconds I think is "OK" territory though. I think I can adjust the code just to give you a warning if your reference is shorter than 3 seconds or longer than 10 seconds rather than straight out disallowing it.>Mandatory referencesOTOH, it seems that some of the module code DOES expect reference audio to exist, so it looks like reference audio is mandatory unless I start mucking about in the model's innards.
>>41591911Go ahead, I have all those saved locally.
>>41592954OK.
>>41590988Trying to run on windows, get this error while running the exe:https://pomf2.lain.la/f/7lmb842g.txt
>>41593176I have my suspicions about what's causing this but I'm not 100% sure. Do you have ffmpeg/ffprobe on your PATH (I know it's not in the instructions)?
>>41590988GPT-SoVITS Inference GUI, revision 1.https://drive.google.com/file/d/1UvzWIFRyO8jjB2z5bgeQnMn0GrOjkaNH/view?usp=drive_link>Changes- Fixed auxiliary reference audio paths not being passed into inference properly- Remove pydub/ffmpeg dependency- Added experimental ARPAbet support for English- Allow selection of <3s or >10s reference audio- Warn instead of stopping generation on <3s or >10s reference audio>>41593176>>41593610Ok, I think I know what the cause was. Apparently, pydub actually depends on ffmpeg/ffprobe, and those aren't bundled with it when I run pyinstaller. I've removed pydub as a dependency and replaced its functionality with soundfile, since soundfile at least seems to bundle its library properly with pyinstaller.
>>41592440Awesome, one more hurtle dealt with. Everything went good with running, but once more an obstacle; this time something about a Qt platform plugin not being found. Are there some additional things I need? The error persists even with a fresh clone of the git, which I would hope has the same updates as the windows one >>41593870
>>41593870>since soundfile at least seems to bundle its library properly with pyinstaller.It feel like this kind of errors the module developers should had fixed ages ago.
Hey now you've gotta save the thread
>>41594400It seems that there are a variety of causes behind this, but the most common solution appears to be `apt install libxcb-cursor0`. Try running that and see if it works; if it does, I'll add it to the READMEhttps://stackoverflow.com/questions/68036484/qt-qpa-plugin-could-not-load-the-qt-platform-plugin-xcb-in-even-though-ithttps://forum.qt.io/topic/148718/qt-qpa-plugin-could-not-load-the-qt-platform-plugin-xcb-in-even-though-it-was-found/2
>>41595049I've tried many methods surrounding that:[Already had installed]>pip install pyqt6>sudo apt-get install libxcb-xinerama0>sudo apt-get install libxcb-xinerama0-dev>sudo apt-get install --reinstall libxcb-xinerama0 // (Hadn't changed anything)[New but didn't solve]>sudo apt-get install libxcb-randr0-dev libxcb-xtest0-dev libxcb-xinerama0-dev libxcb-shape0-dev libxcb-xkb-dev>sudo apt-get install libxkbcommon-x11-dev>pip install opencv-python-headlessMy current theory is that the plugin's placement is incorrectly configured. As some feedback regarding the type of issue expects the plugin "libqxcb.so" to be in "/home/user/.local/lib/python3.10/site-packages/cv2/qt/plugins/" but I found it in another directory beyond it "~/plugins/platforms/", maybe I need to create a symbolic link or something? Or the program/lib reconfigured to look there instead? Or maybe like... move/duplicate the libqxcb.so to be in the expected directory?
>>41595133>libxcb-randr0-dev libxcb-xtest0-dev libxcb-xinerama0-dev libxcb-shape0-dev libxcb-xkb-devThese are development files (headers, static libraries for compiling), they shouldn't affect anything.Have you tried running with `QT_DEBUG_PLUGINS=1 python gui_client.py`? If so could you post the output here?
>>41595192Another thing that seems promising:>sudo apt-get install libqt5x11extras5
>>41595199Oh yeah, also tried that. Sadly, no dice.>>41595192That doesn't seem to output anything different than the previous error. Pretty much identical; wrong syntax for additional parameters maybe?
>>41595262Try `export QT_DEBUG_PLUGINS=1` then `python gui_client.py`?
>>41595272Couldn't get that work work in terminal, but I was able to add this tot he client_gui.py for more info: >import os>os.environ["QT_DEBUG_PLUGINS"] = "1"Which the output helped inform me where it was looking for the plugin, which it wasn't; it was elsewhere. By default it looked for it in:>/home/hazyskies/miniconda3/envs/GPTSovitsClient/bin/platformsbut the "platforms" directory was never created, and thus no plugin. I found the one in "/usr/lib/x86_64-linux-gnu/qt6/plugins/platforms" but this is an outdated one that likely came with the distro or something, so it errored stating so:>"The plugin '/home/hazyskies/miniconda3/envs/GPTSovitsClient/bin/platforms/libqxcb.so' uses incompatible Qt library. (6.2.0) [release]" not a pluginThankfully from my earlier testing I found the REAL one it was looking for in "/home/hazyskies/.local/lib/python3.10/site-packages/PyQt5/Qt5/plugins/platforms", and so I manually created a directory called "platforms" in "/home/hazyskies/miniconda3/envs/GPTSovitsClient/bin/" and copied the plugin into, and now it works! Yay! More testing to be done later. Consider adding a check to see if the directory/plugin exists when first running the script and checking the relevant locations. mkdir and copying the plugin into the made directory when it exists to ensure there's no further errors? That'd save similar OS users some headache.Now that I finally got it running, I'm glad. Just wish the GUI could be scalable, my second monitor is currently being occupied by another device and it doesn't scale well on my ancient 5:4 (1280x1024) monitor.
>>41595665>"/usr/lib/x86_64-linux-gnu/qt6/plugins/platforms"The Qt6 one, likely installed from when you pip installed it earlier, wouldn't work because the GUI uses PyQt5.>Consider adding a check to see if the directory/plugin exists when first running the script and checking the relevant locations. mkdir and copying the plugin into the made directory when it exists to ensure there's no further errors?This seems very hacky. There's no reason to assume that the user has PyQt5 already installed in their local python.Todo:- Investigate why this happens and if there's a more robust solution.- Work on making the UI more compact/scalable.
>>41595701I notice there's also no default voice lines. Consider adding a download button for that too, or maybe also include some to use when you download a voice model? That way they can be used immediately without having to go to the mega each time, or assuming the user already has a copy. Probably would need a search function to find ones of certain lines given the sheer amount we have available.Another issue, though thankfully not critical this time. None of the audio wants to play, it just goes to the pause state and doesn't play. The only info in terminal says "defaultServiceProvider::requestService(): no service found for - "org.qt-project.qt.mediaplayer". I was able to find the files and play them in VLC though, so still workable.[First test] Trixie would surely be a master coder:Ref: https://files.catbox.moe/rn7rxj.flacOutput (compiled): https://files.catbox.moe/h340ee.mp3
>>41593870Not sure if this is a problem with the program or if I'm just not using it correctly - it doesn't seem to be able to play any of the reference lines supplied or added to the table and I get an error when trying to generate relating to failing to load audio.https://pomf2.lain.la/f/7rzu6zzk.mkv
>>41596098Oh, there's meant to be circles and squares in the primary and aux tables? Those are missing in Linux, just empty. Took a bit to work out what the table had to be clicked there first before the generations were allowed.>Not playing reference linesYeah, same for me.
>>41596098Not at home right now but I think I know what's going on--the base GPT-SoVITS library also seems to require ffmpeg be installed. I'm not sure how much I can work around this yet.>>41596196Do you think it might be a result of the resolution? Also do you get anything that looks like an error when you try to play reference lines?
>>41595665Do you still have the specific error output from when you didn't have the library? That would be helpful.>>41595917>Adding a download button, search functionThat'd probably require some kind of index of the MEGA Master File be created, since we can't download subsets of HF datasets. I'll put it into consideration since it seems like an important feature to have.>Bundle reference audio with the voice modelsPossible.>"defaultServiceProvider::requestService(): no service found for - "org.qt-project.qt.mediaplayer".It looks like on Linux gstreamer plugins are also a dependency. Could you try following the advice from here:https://doc.qt.io/qt-5/linux-requirements.html#multimedia-dependenciesAnd report if it works?>>41596098On the preview requirement, I remember now: This is a codec issue. Try installing K-Lite codecs: https://codecguide.com/download_kl.htmOn not being able to generate -- still looking into it.
>>41596607>ffmpeg:Apparently it looks like GPT-SoVITS not only uses ffmpeg for I/O but also plans(?) to use it for audio stretching(??), which is not so easy to do with another library (that wouldn't require yet another extra install, like rubberband). It's tied into enough things that unfortunately I think the best solution really is just bundling the ffmpeg executables with the Windows pyinstaller, taking us up to a hefty 10.6 GB for a full install. Bloat über alles!
>>41593870GPT-SoVITS GUI, revision 2Windows pyinstaller: https://drive.google.com/file/d/12JgwvkFao_h_6hHLi-VqoOrA6Lf11X4f/view?usp=drive_linkUpdates:>ffmpeg bundled, possibly last missing dependency for generation on Windows?>Tentative GUI changes to make it compatible with smaller displays (should still be resizable to more sane size for larger displays)>ARPAbet syntax highlighting in the prompt editor
>>41596821getting this error when trying to play the rendered audio.DirectShowPlayerService::doRender: Unknown error 0x80040266.
>>41596960>>41596607Try installing K-Lite codecs: https://codecguide.com/download_kl.htm
>>41596607>Do you still have the specific error output from when you didn't have the library? That would be helpful.Yes -> Image related in >>41594400Image related in this post has a little more info, which is when I added the wrong plugin to where it was looking for it.>gstreamer plugins>And report if it works?I already have gstreamer and gstreamer-plugins installed it seems. I also installed good and bad, also another for pipewire (as it's the audio runner thing I use) and libqt5multimedia5-plugins. However, I think it's the same case of it not finding the service in the right directory rather than missing something require to run it. The output says among other things before it:>QFactoryLoader::QFactoryLoader() checking directory path "/home/hazyskies/miniconda3/envs/GPTSovitsClient/bin/mediaservice" ...>defaultServiceProvider::requestService(): no service found for - "org.qt-project.qt.mediaplayer"So there's a bunch of things missing from bin that should be there but aren't. Maybe I'll have to go through the list again or something, might've missed a step or not had my env loaded at the time? Or it's not allowing install into the env because what it's looking for already exists outside of it? Does this also mean the git is updated with the same changes? If so, how would I go about updating my local copy. Git pull?
>>41597162I am referring specifically to the error with debug enabled from before you added the wrong plugin.At this point I'm considering spinning up a VM just to troubleshoot this. What distro are you using?>Does this also mean the git is updated with the same changes? If so, how would I go about updating my local copy. Git pull?If you cloned the repository with git clone then git pull should get the new changes, yes.
>>41597230>debug enabled from before you added the wrong plugin.There was no additional debug info before I added the plugin, wrong or otherwise. Whether doing the QT debug thing in terminal, or adding it to the python script.>What distro are you using?Linux Mint 21
yet another zero shot like TTShttps://github.com/open-mmlab/Amphion/tree/main/models/tts/maskgcthugging space demo:https://huggingface.co/spaces/amphion/maskgctsamples:https://maskgct.github.io/
>>41597325Well, shit. I remember this was posted a few threads back, whatever happened to them saying they wouldn't release weights for "safety reasons"? Or was that just made up?Also looks like no finetuning code. It's still worth taking a look at because if they release the weights they might plan on releasing training/finetuning code as well. Will test zero shot soon.
>>41597230>>41597292This is strange. I used the XFCE version from here: https://linuxmint.com/edition.php?id=301With the exception of a missing nltk package (which might also affect the Windows version), I'm able to load the GUI without issue. It displays radio buttons and check buttons correctly, and I also am able to play preview audio, generate, and preview generations with no real problems (apart from my audio stuttering like hell b/c it's a VM). It's possible that something unrelated installed on your system or one of the commands you ran in >>41595133 might cause these issues--but that makes it harder to debug.I've also confirmed that you can perform CPU-only inference, at least inside a Linux VM. It is quite slow; it took me 44.5 s to synthesize these 28 s of audio.gen: https://files.catbox.moe/lo3s60.mp3I also have some more ideas on how to make the UI fit smaller displays.
>>41598009>>41597292In retrospect I probably should've asked you what DE you're using as well, that may affect things.
>>41598009>>41598043Yea, to be fair I probably should've mentioned it to avoid confusion. I'm using Cinnamon. May have been because I liked the interface better or something. That or the name. >you can perform CPU-only inference>it took me 44.5 s to synthesize these 28 s of audioThat's not too bad. It's double realtime but still pretty quick for the quality it puts out. There's also three outputs, so technically speaking it's still faster than realtime?>It's possible that something unrelated installed on your system --- might cause these issues--but that makes it harder to debug.Well in any case I may be doing a fresh install anyway, as I just got my PNY m.2 nvme. Finally making the very long overdue switch from the relic format that is HDD. Or well, at least for my main operating system; storage with those will still be sound until we all get those new petabit sized optical disks on the horizon. Then there's room for all the mares everywhere.
>>41598071>I'm using CinnamonGot it.>There's also three outputs, so technically speaking it's still faster than realtime?This was only with one output. There seems to be a -slight- fixed cost effect, because with 3 repetitions I was able to get 115 s gen time which is under 3*45, but I was also dipping into swap with 8GB of RAM.>Well in any case I may be doing a fresh install anyway, as I just got my PNY m.2 nvme.Even so, I would still like to make sure it works for Cinnamon users.
>>41598083Did an install on Cinnamon, same results.
>>41572561Amphion test>>41597325OK, so this might be one of the worse installs I've done on Windows. I had to fork phonemizer and Amphion and modify the inference script with a bunch of hacky stuff to even get it to start downloading models.>Rough installation instructions on Windows0. Create conda environment with python=3.9.15 and activate1. Install espeak-NG and locate the install directory2. Clone Amphion repository from here: https://github.com/effusiveperiscope/Amphion3. In models/tts/maskgct/maskgct_inference.py modify _ESPEAK_LIBRARY to point to your espeak-NG install directory4. pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu1185. Install dependencies roughly following https://github.com/open-mmlab/Amphion/blob/main/models/tts/maskgct/env.sh WITH CHANGES:- Remove torch==2.0.1 from the end of tensorboard- The version of phonemizer they use won't work on Windows because of missing mbrola. Change the pip install phonemizer to pip install git+https://github.com/effusiveperiscope/phonemizer.git- Also, pip install json56. pip install -U numpy==1.26.4>Inference requirementsIt looks like you need an NVIDIA GPU with minimum 12GB VRAM to run this (it maxed out at 10.2 GB). That puts it out of reach for most people here.The git repo itself is ~240 MB on disk, and all of the pretrained models it downloads together are ~5.5 GB. The miniconda environment is another 5.5 GB.>Zero shot performanceIt used 27 s to infer 18 s on 3080Ti.ref: https://files.catbox.moe/w90njn.mp3gen (18 s): https://files.catbox.moe/e8am3r.mp3Well, that's disturbing. But it sounded pretty good at the start? Much better zero shot performance than most other models.ref: https://files.catbox.moe/j9hnbg.mp3gen (18 s): https://files.catbox.moe/3qym13.mp3gen (10 s): https://files.catbox.moe/w90njn.mp3Not so good on Rarity -- obviously she has a less generic accent. The hallucinations are less likely to happen if I make the audio shorter. I think this failure mode is much less desirable for automated systems compared to mispronunciation errors.The inference requires that you explicitly specify the output duration in advance. >>41598191Also -- do you remember if you selected the "multimedia" codec thing at installation?
>>41598404>gen (10 s): https://files.catbox.moe/w90njn.mp3Whoops, wrong file. Here's the correct one: https://files.catbox.moe/rg3m6h.mp3
>>41598404Also this model is called "MaskGCT", not amphion.
>>41596821Managed to install and train my own models with the default fork, your GUI is super handy but I noticed the generations don't sound the same compared to the RVC-Boss fork of GPT RVC-Boss:https://files.catbox.moe/n9sn0i.wavGPT-SoVITS GUI:https://files.catbox.moe/v6bfdb.flacReference Audio:https://files.catbox.moe/lwus9w.flaccould the requirements I downloaded for RVC-Boss be conflicting with this fork, how do I get the generations to sound the same?
>>41598852You may have the base pretrained model loaded (it's loaded by default). Did you load the dedicated model?
>>41598852>generations don't sound the same compared to the RVC-Boss fork of GPT>how do I get the generations to sound the same?The same ... in what regard? Are you referring to how dynamic/varied the generations are with GPT-SoVITS? As that's a good thing, allowing for many takes to get a really good line and/or delivery. It shouldn't be necessary to try and make one AI sound like another, but rather the character be more accurate with clarity.
>>41596821Windows 10, CPU gen inference seems to be functional.https://files.catbox.moe/lsfqe3.mp3It took a moment to figure it out coming from SVS/RVC, but that's pretty damn impressive for a local CPU gen.
>>41599176Nice, thanks for reporting
>>41596821Hey, how do I properly add more reference clips while automatically having them be sorted? Or do I have to sort them out myself? I'm confused.
>>41599817What do you mean sorted? You can just download the voice data from Clipper's master file (in the OP) and all of those have the relevant data to label themselves.
>>41599817What do you mean specifically by sorted? If your reference clips don't have PPP-style labeling data in their names then you will have to fill in the fields manually.
>>41598870>You may have the base pretrained model loaded (it's loaded by default). Did you load the dedicated model?Dammit, I completely missed the 'load selected models' button, everything works perfectly. I feel like a complete jerk for bringing the 'issue' up. I just got everything set up and was rushing out the door for work when I posted. Thanks for quick fix anon!
>>41600135>I feel like a complete jerk for bringing the 'issue' up.Eh, it's a fairly reasonable assumption to make that there's nothing loaded there beforehand, and that the newly downloaded model would be loaded automatically. I'm hesitant to touch GPT-SoVITS's automatic base model load though because I don't know if something else might depend on it.
TTS is finally winning big here with GPT-SoVITS. Hopeful to see some great line delivery in future fan episodes.
https://files.catbox.moe/wu08z5.mp3
>>41600633That's a lot of bugs, you should probably file a mare issue.
>>41596821A few lines I did with Twilight's model! I like it a lot!https://files.catbox.moe/idmne9.wav
>>41600989Wow, nice.
I found a neat trick. Process audio octaves apart and layer them. The hoarse parts don't stand out anymore and you get a neat effect as they add up to something good in aggregate. Especially good for a song that already sounds flange-y or had heavily processed voice in the first place.
>>41582059Oh, before I forget to ask, would you be so kind as to create or otherwise provide documentation on how to fine-tune models for this? Wouldn't mind training some of my own given how good these turned out, providing an RTX 2060 is capable enough to do so.
>>41601159Some relevant information on setting up the environment here:https://desuarchive.org/mlp/thread/41498541/#q41562711 (it's not true that training always starts from epoch 0; that was an erroneous observation I made)This rentry is also helpful but inaccurate (you should clone the repository, don't use the zip because it's outdated.): https://rentry.org/GPT-SoVITS-guideAlso the choice of epochs is much less rigid than the guide states. I didn't notice any adverse effects from enabling DPO. Check the archive for my observations on training hyperparams and their effects on output.Very generally:0. If you don't have transcriptions you need to follow the "audio slicer" etc. steps in the rentry which will use ASR to automatically generate them and also automatically slice your audio1. If you already have transcriptions, you don't need to use their step 0 preprocessing, but you do need to provide your own filelist with each line representing a sample, in the format:><audio_path>|<speaker_name>|en|<plaintext transcription>\nI have my own notebook+library to do this with a local copy of the Master File:https://github.com/effusiveperiscope/PPPDataset/blob/f42390a7ee75fae04a50afb400be417d380577b1/ppp2.ipynb2. You can specify your own filelist under the webUI step 1 (GPT-SOVITS-TTS) tabs. I feel the rest of the interface is self-explanatory.Also:- I modified the webui code to increase the maximum number of SoVITS and GPT epochs (because I found it worked better for them).
>>41596821GPT-SoVITS Inference GUI, revision 3.https://drive.google.com/file/d/1EljbxeUckYATH269utj7q1T-8oKcPhte/view?usp=sharingI think we're feature-complete here.>Changes- Add a Master File downloader, which programmatically constructs an index of the Master File and lets you search for files using unix glob patterns (e.g. *_Rarity_*) and download them to ref_audios folder- Rearranged columns for better UX on small displays, and made the reference audio table slightly more adaptive overall to different display sizes- Add less time-consuming check for NLTK packages- Check for download averaged_perceptron_tagger_eng (although I'm not sure if it's needed)- Fill in missing config keys if they are not found in user config- Loosen omegaconf requirement on Linux so requirements.txt and requirements_client.txt are (hopefully) more compatible- Warn the user if inferring with the base model>MigrationIf you want to avoid redownloading the pretrained models, you should just be able to replace _internal and gptsovits.exe in an existing install with the new versions, but let me know if it doesn't work.>Note for source usersDependencies changed, requirements_client.txt has new dependencies.>RantBy far the hardest part of this was trying to figure out how to programmatically list files and download links from MEGA shared folders. mega.py is deprecated and the MEGA REST API itself is undocumented and quite opaque, and most info about it only seems to come from reverse engineering. The only other alternative is MegaCMD, and I didn't want to make users install yet another dependency to make the damn thing work, so I ended up implementing it in python. The end result feels hacky, and I don't know how long it's going to work or how well supported/breaking it ends up being across systems. There is exactly one StackOverflow thread that gave me most of the information that I needed to work with: https://stackoverflow.com/questions/64488709/how-can-i-list-the-contents-of-a-mega-public-folder-by-its-shared-url-using-meg
>>41601267lol I forgot to include a screenshot
All this "use 5 second of reference to get a style of speech" make me interested in this idea (codefags, feel free to call me out an idiot on how this is not how it works, but it makes sense in my head):Have a audio reference model train with audio that is separated into clear folder and one into noisy folder (random reverb, bad pitch, crap going on in background etc).The way I think the raining process would work is to get model to understanding two parameters, a) what makes the audio "clear" and b)what makes audio "noisy".Once trained, one could feed the model the main noisy audio and a 2nd reference of clean audio to convert the main audio into a clear version of itself (aka use the 2nd reference to fix up the bad quality in the 1st audio).
>>41601911I think it's possible, you've basically structured a style transfer problem. I guess the reason you feed it a 2nd reference as opposed to not conditioning on anything is so you give the model an idea of what the character's "clean" audio is like, to bootstrap off existing data?- ParlerTTS already kind of implicitly does half of this with labeled noise levels, it works OK- Generally for denoising tasks people use the more straightforward approach to just degrade an existing clean input so you have matched noisy/clean pairs- For it to actually be worth using it would have to outperform things like just running the input through, for instance, an RVC or so-vits-svc 5.0 trained on the existing clean data
>>41601267ehh, getting this error on W7, I remember getting something similar with RVC and having it fixed with this set up:pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1+cu116 "tensorflow[and-cuda]" --extra-index-url https://download.pytorch.org/whl/I will test out different module combos over the weekend, and once I find something that works I will post it here.
>>41602441Well the python runtime and dependencies are bundled with the installer so you won't be able to modify anything with it by messing with your pip directly. You might want to try an install from source but replace the pytorch version with one that you know is compatible with your system (idk if anything will break though; the recommended pytorch version here is 2.3.0).
>>41602526Upon further investigation, it looks like there is a way to install/use pytorch versions >= 2.1 on Windows 7 using something called VxKex + an extra DLL:https://discuss.pytorch.org/t/pytorch-2-1-is-no-more-able-to-use-my-gpu/208672/13
Alright, here's another fun mistake you can worry about when finetuning: Make sure you have the pretrained discriminator in the right directory and you don't accidentally delete it when you're cleaning up your project folder like I did ;^)With pretrained discriminator: https://files.catbox.moe/ba7z82.mp3Without pretrained discriminator: https://files.catbox.moe/lbw4m7.mp3Also, I think the deeper male voices (like Flam's) require more SoVITS epochs (up to 96) to get the needed deepness--the pretrained model seems to have some bias to higher pitched voices.
https://files.catbox.moe/hxfzyl.flac
>>41601185So, a little confused as to the hardware requirements in that referenced post. Saying the GPT and DPO (Dunno what the latter is) can train on ~6GBs, but the SoVITS side or something needs ~12GBs? Depending on requirements to fine-tune like the earlier examples, might be outside my 8GB capabilities; hoping this isn't the case.
>>41602858That was for the given batch sizes. You can use lower batch sizes to lower the memory requirements.
>>41602858NTA but I've trained models with batch size 12 on my RTX 2060 6GB VRAM.
>>41602860Ah, good. So just means longer training times then. I'll look further into it and attempt within the next few days. Will keep posted.
>>41602782https://files.catbox.moe/bn6xn4.mp3Flim (SoVITS epoch 96, GPT epoch 36): https://huggingface.co/therealvul/GPT-SoVITS-v2/tree/main/Flim-SVe96-GPTe36gen: see aboveFlam (SoVITS epoch 96, GPT epoch 48): https://huggingface.co/therealvul/GPT-SoVITS-v2/tree/main/Flam-SVe96-GPTe48gen: see >>41602624(auxiliary references were used for these so I'm going to refrain from listing them for)
https://x.com/genmoai/status/1852154518911304152
>>41602947For some reason it didn't register to me that it was Nightmare Night until now.Black Snooty (?) (SoVITS epoch 96, GPT epoch 32): https://huggingface.co/therealvul/GPT-SoVITS-v2/tree/main/NightmareMoon-SVe96-GPTe32https://files.catbox.moe/b0524r.mp3https://files.catbox.moe/kxb5iv.mp3I think what I'm noticing more clearly is that as GPT epochs increase, model generates more of its own information and have better naturalness/accent resemblance to the overall character, whereas at lower GPT epochs, it seems to adhere more closely to the reference.
>>41602539>>41602526Berry interesting, I will mess around with this over the weekend. Thanks for posting this, looks like I will not have to decide between cucking to win11 or going full linux autism for at lest few years.
>>41603208>The model requires at least 4 H100 GPUs to runwew. I mean, it's cool the video generator ai exist out in the open for people to use, but it's going to be two ro three years until this is downsized to the degree nor/mlp/eople can causally use.
>early are bump
Guess who else has 17 seconds of data?Octavia Melodyhttps://huggingface.co/therealvul/GPT-SoVITS-v2/tree/main/Octavia-SVe84-GPTe48gen: https://files.catbox.moe/2q9dp1.mp3
https://files.catbox.moe/a41mh4.mp3
https://huggingface.co/fishaudio/fish-agent-v0.1-3b
>>41604708I remember finetuning fish-speech before, I don't have the samples before but I was not very impressed by its character resemblance even after finetuning
>>41604390welcome back tts Octavia voice, it's been awhile.
https://files.catbox.moe/e9dcwv.mp3
https://github.com/etched-ai/open-oasisWhat if this was fed Gameloft pony gameplay, or similar? What kind of wild, unnatural and possibly cursed mares would it unleash?
https://files.catbox.moe/3i1qn5.wav
So I have been thinking, is there anything somebody could do to help out with PPP and pony ai related stuff that do not involve running/training models on 10+ VRAM GPU?
>>41607544Nice>>41607871>Run/train models on lower VRAM GPU>Make content>Study ML, make toy ML projects>Make toy LLM projects using the free yet severely rate limited models on OpenRouter idk>Clean up OP/docs
>>41602526>>41602539so I look into the link and uhhhh, how do I convert this stuff into exe installation file?
>>41607977https://github.com/i486/VxKex/releases/tag/Version1.1.1.1375appears to have exe
>>41602947Did you figure out the proper amount of audio needed to get a good voice?
>>41608011If you're OK with this level of performance >>41604390as low as 17 seconds seems to be possible (haven't bothered with lower), but with that little data any accents will suffer (for instance, how she pronounces "gold"). Some voices seem more well-behaved than others; deeper voices tend to have a bit more trouble.
>>41604390Any chance for requesting the S1E2 Woona voice?
>>41607251>https://files.catbox.moe/l5jo3e.mp4man, i love how strange this feels, its like trying to explain a dream morphing from one thought to another.
>>41608282She only ever spoke nine words. Like I said when 15's site was available, you might as well just use the later Luna voice and raise its pitch with audio software:https://u.smutty.horse/magmddxudmm.mp3
>>41609407
https://files.catbox.moe/f6jbkt.png
>>41598009Is because you're ins a env python enverioment, install qt pluging in there or, just copy the .so
>>41608282Wasn't she voiced by Tabitha too?
>>41610135Imagine the taste
>>41608688>>41611327She only has 3 seconds of audio but I can't stop thinking how sweet and innocent , I just wish to hear more of it.
How use the API of GPT-Sovits to connect with Silly tavern I tried but the model tend to repeat and make nonsense word...
>mares
>>41578166What was the amount of data for Cadance?
Would Anons here be interested in doing a original ai song album for marecon (which I would assume will happen sometime in January)?The general theme would be mares signing about/to Anon, whenever the subject would be love/hate/friendship, and given that Anons here have all kind of different taste in ponies and music I feel like it would be a pretty interesting spectrum of songs to enjoy. I was thinking of doing this solo, with one song per m6 but that I thought there must be at least few Anons out there that may be interested in this idea as well.All the ai song tools for this are currently free (and while I hate the service model style of Udio/Suno, they are currently only fully working song models unless something changes in next few week), from there one can use preferred audio/vocal separator and apply rvc/sovit/other ai tool from PPP to give the songs the proper pony voice.
>>41611817quite so
>>41611402Some guy made an AI out of fucking Dark Souls 1 male pain noise. S1 Luna voice would work.
>>41613907I don't want to imagine how that sounds.
Page 10 save.
So it turns out Tara Strong did the Twilight Sparkle 5 years prior to working on MLP for a disney pilot, at least that's what it seems likehttps://www.youtube.com/watch?v=tJoW6rNR_A4
>>41608282>>41608688Nobody ever told you guys to find Tabitha's original voice for it? cause I keep fucking trying, but I cannot find her exact raspy voice.There's also another problem. Her "I'm so sorry" has a raspy voice that Tabitha never uses.Her second line "I missed you so much big sister" doesn't have the same raspiness. So basically got to contact Tabitha to make you a voice unless you guys can find some japanese or western cartoon where she did that exact raspy accent. >>41608688It's not the same accent/pitch/personality. It just sounds like a younger Rarity. Also 15.AI still sounds like shit , he lost the AI race.
>>41617765Tara Strong uses the same 3-4 voices and her generic-ass voice isn't hard to replicate by other voice actors.However that voice is spot-on for the exact pitch & accent she used for Twilight.Holy shit that's a lot of popular voice actors.
>>41614604
>>41611430What's it sound like in that weirded state? I wish to hear the sound of misconfigured mares.
First time checking in on these threads in about 9 months>it's just some guy posting porn every other day or so to keep it from falling off Well that's sad but now that AI voice cloning is democratized and is as easy as downloading a model and throwing up a w-okada instance what is the continued goal of the project? And did 15 ever stop being a faggot and explain why he disappeared for years at a time and continually miss promised launch dates while taking everyone's money?
>>41621671I'm not a regular, but I don't think voice cloning is everything there is to this project. Also, it's still not perfect though it's coming close to the big players (11labs). 15.ai was the goat, but now it's time to move on.
>>41621950Gotta agree that this part of year is stupidly busy irl, and with lack of spare time sadly the pony content making time is sadly pretty limited.
Page 10 bump.
>>41617765Wow! Glimmer
>>41601267Ran into this error after trying to use my own recorded reference lines for the first time. Not sure if it's a bug or if I've done something wrong?
Back from an involuntary vacation. J*nnies apparently consider my posting patterns + bumps "flooding". Frankly if this keeps happening I'm going to move to NHNB.>GPT-SoVITSI've released alternate versions of some of the models trained with more GPT and SoVITS epochs on my HF: https://huggingface.co/therealvul/GPT-SoVITS-v2/tree/mainThe old models will remain uploaded (as long as as huggingface keeps letting me dump models onto their servers for free). I did not exhaustively check to see how they actually performed, so it's possible some may be screwed up. Should be improved audio quality but character resemblance/reference resemblance/coherency may vary.>MusicI have finally forced myself to make an original song for the first time in nine months: https://files.catbox.moe/iu1jyb.mp3Lyrics: https://ponepaste.org/10467I thought it would be interesting to train a so-vits-svc 5.0 model on Luna's speaking voice (Tabitha St. Germain) for this one, since I'm not completely satisfied with Aloma's audio quality or timbre. I think it turned out OK. https://huggingface.co/therealvul/so-vits-svc-5.0/tree/main/Luna%20Speaking
>>41624678I think this is because you haven't specified a character name for the sample. Unfortunately I'm retarded so I didn't anticipate that case.
>>41624737King. Thx for keeping this shit alive.
>>41624737I too, hate janniesGreat to see you back at it again.>>41624743Yeah that was the problem, assumed that I could leave everything except "Utterance" blank from reading the instructions, I suppose there's one little thing in there somewhere that has character name as a dependency. No matter to me though, from this experiment I once again learn that so-vits just doesn't vibe at all with my voice so will be sticking to samples from the master file. Cheers.
>>41624737not the first nor is it going to be last time when jannies were acting like complete fucking mongoloids on this site. Nice song thu, I can't put my finger on it but it reminds me of something Ive listen to between '06~'10.
>>41624678GPT-SoVITS GUI revision 4I've updated the program to fix this behavior and also a bug where all the 'n's in filenames were being sanitized out.https://drive.google.com/file/d/1dgG1kg0e9p4khrwpPaI9NdV_PIiMDGOZ/view>>41611865Cadance has 13 minutes of Clean+Noisy data.>>41624752>>41624893>>41625105Thanks.
Hey, so I was doing more experimenting with Ace Studio and using the custom voices thing to make the mares sing, and I think they made some adjustments to how they handled the tone of the voice? It seems like it takes accent more into account.Applejack on Solo23 - https://files.catbox.moe/3ywhaz.wavApplejack on Verse24 - https://files.catbox.moe/m1c403.wavTwilight on Solo23 - https://files.catbox.moe/axyrm0.wavTwilight on Verse24 - https://files.catbox.moe/7734xc.wavRarity on Solo23 - https://files.catbox.moe/4b8g4k.wavRarity on Verse24 - https://files.catbox.moe/x8tdr6.wavI still don't like the whole subscription thing, but I do admit that they're getting better at preserving accents, and I like that.
>>41624737>https://files.catbox.moe/iu1jyb.mp3This is really trippy, nice. Love the tone it sets, I haven't heard a lot like it.
>>41624737Honestly after using these a little I think GPT epoch 24 is a mistake. The resulting tone of speaking seems a lot more boring which is not what we're really going for here.
>>41624737very rich textures
>>41626272What was the rational for going from SV24-GPTe8 to SV96-GPTe24? I'm trying to understand how this thing works and where to stop the training.Also you can dump as much as you can on HF, there is a 10K files limit per folder and 50GB per file. I have one repo with like 1TB of checkpoints lmao.
>digital mares forever
>>41626354My working theory is:- SoVITS training generally increases audio quality, quality of sibilants up to some point of overtraining- Some degree of GPT training is needed just to get plausible results. After that, more GPT training seems to "increase" the "plausibility" of the delivery (pitch and rhythm) at the cost of variation and possibly increased pronunciation errors
>>41626736Hopefully, yes.
>>41625728>You need accessreeee!
>>41627548Fuck fixed
>>41626871Thanks for the feedback. I take it that the overtraining point for Sovits was at epoch 96 then? I hope you can find a good middle ground for GPT. Also, I am wondering if DPO has a big effect.
>>41627957I think it was for Twilight, I just extrapolated it to the rest so as to not think or test as much. I'm really not sure I preferred anything I observed past GPT 8.
Up.
>>41621240Repeat the reference audio randomly thorough the text
>>41630112I don't know specifically about SillyTavern but this hallucination can happen in a few situations:- If there is no reference text/utterance being passed alongside the reference audio- If the text being passed to generation is empty or very similar to the reference audio's utterance- If the generation is too long (this can happen with the 4-sentence batching method)
idk if you guys care about AI music covers anymore, but I just finished two projects from last year.https://www.youtube.com/watch?v=Y-9K9aWhutkhttps://www.youtube.com/watch?v=HgfsKS-Ux_A
>>41631289I love AI music, covers included.
Zero shot F5-TTS with no training or finetuning using a randomly selected 5-second reference audio:https://voca.ro/1c0r89ojIOcMUsing RVC as a post process:https://voca.ro/1ZTudWW2y9jA
>10
>>41632050>F5-TTSinteresting, the first clip do not really sound like any specific character but having some other tts alternative is always nice.
Im sure somebody had ask this before but are there any TTS programs that allow to train custom voices that are NOT pytorch/ai based? I know the Vocaloid copycats SynthV/UTAU allow for that but their UI is not really friendly for simple copy pasting text into it. Im just looking for something closer to MS Sam tier.I know TkinetAnon and DeltaVox tts programs exist, but I was hoping for a project that is bit more up-to-date as these programs had massive issue reading stuff and were extremely non-customizationed.
>>41632943>https://github.com/rhasspy/piper>https://www.youtube.com/watch?v=rjq5eZoWWSo>https://www.youtube.com/watch?v=b_we_jma220Alright, I lurked around and there is this project from last year called Piper TTS, designed to primarily run on Raspberry Pi 4 (so max requirements to run it cannot exceed 8GB of RAM).The training process do not seem to be that much of pain in the ass and from the github they state the project is supposedly able to work just fine on linux/windows/mac environment. No idea about minimum and maximum requirements (other than audio needs to be 22,050 and fine tuned on the large model for the best quality) so that needs to be tested.This would benefit from having a proper UI as using just plain terminal is doable but i feel like it would be too much pain in ass for everybody else, but that's something to look into in some distant future.
>>41633143Check the archive someone already trained models
>>41625728I have a really old graphics card (GTX 770M) that is not supported for any recent version of CUDA. When I attempt to run the inference GUI (revision 4), I get this error. I already have the latest official NVIDIA driver installed for it, and I'm pretty certain this GPU will never work with GPT-SoVITS. Is there a way to force the inference GUI to ignore my GPU and use the CPU instead?
>>41633650I found a workaround. I can just set the CUDA_VISIBLE_DEVICES environment variable to an empty string and it will use the CPU instead. It's still reasonably quick too on old hardware (20-50 seconds for a 1-2 sentence gneeration).https://files.catbox.moe/ee1adp.mp3
>>41633798screenshot for demonstration
>>41633650>CUDA_VISIBLE_DEVICES=""hmm, Ive tried that and im still getting error of : File "torch\__init__.py", line 130, in <module>OSError: [WinError 193] %1 is not a valid Win32 application. Error loading "Q:\_AIfromC\_AItts\GPT-SoVITS-v2\gptsovits_11-7-24-r4\_internal\torch\lib\c10_cuda.dll" or one of its dependencies.[PYI-29800:ERROR] Failed to execute script 'gui_client' due to unhandled exception!>>41633798>>41633803yeah, would be nice if there was an option to add parameter/argument in the command line "--CPU=TRUE" to force it to start in a plain CPU mode
>>41625728Is there any place to get more pony voices for this?
>>41634178I'm not aware of anyone else having trained pony models for it.
>>41633979>download.pytorch.org/whl/>This site can’t be reachedfucking great, absolutely splendid. does anyone know an alternative/archive to the above that can be used to install different version of torch?
>>41634209>download.pytorch.org/whl/Works for me anon. Although updating your pytorch in your own python installation won't affect anything from the pyinstaller zip. I'm not sure to what extent the cuda build of pytorch relies on the cuda DLLs being loadable (it looks like it's failing pretty early on in the process), so it's possible that for portability I might actually need to maintain two packaging environments and builds--one for CUDA and one for CPU only.
>>41634209>>41634244>CPU onlyCan you try this? (Also updated github readme with this link)https://drive.google.com/file/d/1FVuwuKyUqfuRcHVKr-ACgAul4araUjPN/view?usp=sharing
>>41634244pytroch website is acting like a bitch for no reason and I had to use vpn to install the following torch that works on my old pc:pip3 install torch==1.10.1+cu113 torchvision==0.11.2+cu113 torchaudio===0.10.1+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html --force-reinstallHowever I was still getting the errors like above, so I finally decided to install that "KexSetup_Release_1_1_1_1375.exe". In properties I set the option to use "Win7 pack 1 " and open the "gptsovits.exe" was able to open up from console like for the Anon from above.https://voca.ro/1gMIOaBVPaIUI had not test it thoroughly but it is working and messing around with the Blueblood model I can that it seems to struggle longer 5s clips, however this is way better than not having any tts options for it.There seem to be one problem, I cannot replay a once played clip (I can see animation of the red bar going from start to end, but no sound on replay).The Generation window could also use the name of the clip that was generated and being played since I can see myself spamming like five dozen of lines and forgetting which ones I liked and having to re-listing to all the clips all over again.Also I think I found bug, when setting "Seed" to non-random number and "Repetition" to a number higher than 1 the program seems to get stuck in never ending generating mode?>>41634377uhoh, sorry, the above problem seems to be solved for me, but potentially the other Anons could use this version.
>>41634446>The Generation window could also use the name of the clip that was generated and being played since I can see myself spamming like five dozen of lines and forgetting which ones I liked and having to re-listing to all the clips all over again.You can drag and drop audio clips from the play button into another folder or DAW, does this help?>Also I think I found bug, when setting "Seed" to non-random number and "Repetition" to a number higher than 1 the program seems to get stuck in never ending generating mode?Some generation errors don't get propagated all the way up, check the console.
>>41634475>You can drag and drop audio clips from the play button into another folder or DAW, does this help?not really, since I will still need to re-listing all the clips all over again , instead of making a note "pony123 sound bad but clip pony 124 sounded pretty good" while generating next batch.Also, there seen to be an issue with the tts part hallucinating extra words, as trying to generate " and Derpy is my beloved pony." I get "and Derpy WHY is my beloved pony." I can somewhat fix it by changing the word to "Derpee" to minimize the interaction with audacity.
Go home guys. Udio.com won.
>>41634640you could had said the same about Amazon Alexa in 2019 and yet here we are.
>>41634610>not really, since I will still need to re-listing all the clips all over again , instead of making a note "pony123 sound bad but clip pony 124 sounded pretty good" while generating next batch.Then just keep all of the clips that sound good in a separate folder. Do you intend to keep the bad sounding clips? What exactly are you trying to do?>I can somewhat fix it by changing the word to "Derpee" to minimize the interaction with audacity.I think this happens because the repo authors use an extra library for word segmentation which probably detects the word "Derpy" as two words "Derp+y". There's also ARPAbet support which won't be affected by this issue.
>>41633650I just took another look at this screenshot I posted earlier and now I feel really dumb. The problem was not my graphics card, it was the application failing to download the pretrained s1-BERT model. After setting CUDA_VISIBLE_DEVICES="", I no longer got the CUDA warning at the top of the console, so I assumed I had "fixed" that "problem" and had ran into a different error. In reality, I think it was the same error and stack trace preventing the startup, just without the CUDA warning. To resolve it, I manually downloaded the model from https://huggingface.co/lj1995/GPT-SoVITS/tree/main/gsv-v2final-pretrained and saved it to GPT_SoVITS\pretrained_models\gsv-v2final-pretrained\I tried starting the GUI again just now without setting CUDA_VISIBLE_DEVICES="" and it worked, which confirms that the GUI will default to using CPU if it detects an old graphics driver; it just prints a harmless warning to the console. No need to mess with environment variables after all.
>>41635014Didn't we use to have a customized ARPAbet dictionary that added words that were only found in the show? Is GPT-SoVITS able to load it?
>>41635317I think by default it uses its own universal dictionary (some version of the CMU pronouncing dictionary). You might be able to hack in the horsewords one, but not sure how it would interact with the word segmentation problem.
>>41634640Who?
Mare!
>>41636331and again
>>41634640What's that supposed to be?
>>41636830yes
>GPT-SoVITSDid anyone else had tested how far you can push the clip reference without actual training? I've tried Vinyl Scratch og voice on rarity model and results were better than expected but still meh.
https://vocaroo.com/11g3P0OEfrme
It seems elevenlabs has the best voice denoiser tool, or there is open source alternative where i don't need to make throwaway account to use?
>>41639035What's wrong using the ultimate vocal remover? Vul made a model that remove random background noises sone time ago.
>>41633143>https://github.com/rhasspy/piper-phonemizeHello, I require an assistance of someone who is much more competent coder that I am, if somebody could be willing to turn this hithub into a windows wheel that would be really appreciated. For whatever reason the microsoft visual studio 2022 is pissing and shitting itself on my pc and all the other wheels for the above only exist in mac and linux format.I was trying using the "make" command but for once again unknown reasons to me it gets the job done at 95% and than just errors out.
Derp the Wind! I happened to make this right before the song was discovered recently. I made it for the Ponyville Ciderfest mix tape, and I usually don't upload songs I make for convention tapes immediately, but this warranted an exception.https://www.youtube.com/watch?v=p9OFkfzZuLg
>>41640508One-shot generation:https://files.catbox.moe/lbaww6.oggI'm probably going to consume a lot of news this way.
>>41640954>A whole broadcastNeat idea. I guess you could use it for stuff like text review and proofreading too.
>>41640954>Guest>HostWas the dialogue generated by an LLM?
>>41641113Yeah. I had it deliberately set the names to Host and Guest so I could swap out the voices without the result being too distracting. It was generated through about 1200 calls to llama 3.1 70b based on this text:https://buttondown.com/ainews/archive/ainews-bitnet-was-a-lie/It was mildly complicated. The code should be published soon™.
>>41625728I submitted a pull request to patch api.py and Dockerfile for better automation. The updated api.py accepts a file path for prompt_text instead of the actual text. It's so the caller doesn't need to know specifics about what reference files are available, which makes it easier to decouple the caller from the api server. I don't think you're using api.py, so it shouldn't break any of your code.
>>41640954Rarity and Starlight discuss this thread: https://files.catbox.moe/9xn1x4.oggTranscript: https://files.catbox.moe/exo0sz.txtI tried to get rid it to cover the input document more comprehensively & faithfully. It tends to discuss redundant topics when doing this for threads. Fixing that will probably require some preprocessing step to create an organized document from the thread. Using more context could also fix it, but the API I'm using only supports an 8k context window. I'll get back to this later.
>>41641441Would make my day if you could make another one with Pinkie and Dash. Voices here seem really impressive.
>>41572862numget
>>41641297Is this still GPT-Svoits or something else?
>>41641395ok merged
>>41642134It is GPT-SoVITS with Vul's voice models & Master File clip references.>>41641944I ran into a daily token limit, but will do once I can.
>>41640954>>41641441Listening to these put a big stupid grin on my face and prompted me to think back to the days of the first threads when this was all just getting started. Being able to now make voices this good with relatively little work makes all the effort feel worthwhile.
>>41571795EQG when?I need them for ponifications, I swear.
>>41643470>>>/trash/
https://files.catbox.moe/bjjhq7.mp3
>>41641944Here (You) go:https://files.catbox.moe/vdc6i7.oggPinkie and Dash geeking out over Vul's commit history.>>41643942This stirs something primal in me.
>>41644174>55 minutesOh boy.
>>41644174Yo THANKS for coming through, dude! Really appreciate your work here
>>41644174Do you use some kind of system to detect who is speaking from the text, or is this just set up as chat style text:"character 1: text" "character 2: text" ?
I'm working on a project And I'm trying to find the best way to prompt a Text to speech through a Python script whether that be sending a request or importing it directly Is there any like API thing that comes with haysay that I could just hook into? or any other repos that I should look at
^ I too would be interested in an answer to Anons question above.
>>41644760For this one, I had it generate chat-style text, exactly as you described, then I parsed it out. The transcript in >>41641441 is exactly what it generated, just in small segments. In cases where I want to parse out text, I usually have the LLM generate JSON with the relevant fields already separated. In this case, the LLM produced worse dialogue when I had it output JSON, which is why I had it just generate chat-style text.>>41645227https://github.com/synthbot-anon/horsona/tree/main/samples/gpt_sovitsIn your own project, you just need to use two of the files:- A utility class for making sure parallel calls involving different speakers are ordered properly, to avoid switching out the voice model too frequently: https://github.com/synthbot-anon/horsona/blob/main/src/horsona/lock/resource_state_lock.py- The class for actually generating speech: https://github.com/synthbot-anon/horsona/blob/main/src/horsona/audiogen/gptsovits.pyExample code once you have those two files: https://github.com/synthbot-anon/horsona/blob/main/samples/gpt_sovits/src/main.py
>>41645829I'm just annoyed that you didn't even try to search and replace Host and Guest with Rarity and Twilight, or AI with "ae eye".
>>41646072I switched to working on something else after I had it written. Right now, everything is generated in a fire-and-forget way, so I don't get a chance to modify the transcript before it's passed to the TTS. Once it's published, I'll clean up things like that.On that note, horsona updates:- [Done] I added OpenAPI support when running the node_graph server for game engine integration. Here's an example of the spec it generates: https://ponepaste.org/10498. It generates this dynamically on the /api/openapi.json endpoint. There are a lot tools for automatically generating clients from OpenAPI specs https://github.com/OpenAPITools/openapi-generator so this would be an easy way to expose any functionality written with the library to external clients.- ... [In progress] I'm going to write a sample application for this.- [In progress] I'm working on a way to add explicit causal reasoning to LLMs. I'll commit all of the change for this once I have the whole thing working.- ... [Done] I have a module that can do causal regression given a small number of datapoints & a small causal graph.- ... [Done] I have a module for picking representative data points for cases where it's given too much data. I wrote this because causal regression is slow with a large number of datapoints.- ... [In progress] I'm working on a module to chain together analysis from multiple small models. I mostly know how to do this, but implementing it is tedious.- ... [In progress] I can get an LLM to generate small causal graphs and datapoints from small snippets of text. I'll need to test it for robustness, then update it to handle streams of text.- [In progress] I'm writing a few modules to handle streams of text. I can get an 8k token context window to handle about 25k words of context right now with a combination of the GistModule + Recent Messages. I have some thoughts on how to get that number much higher using multiple levels of Gists.- [Done] TTS support through GPT-SoVITS + sample application.
>>41646545ay
>>41646911neigh
>>41647285nay
Collab song with Vulhttps://www.youtube.com/watch?v=vwZRqM9quic
I tried something, but somebody else should try it with Rebecca Shoichet or Tara Strong's voice.1st one is using Udio. 2nd one is using ElevenLabs.https://vocaroo.com/1oHodh5vYKpRhttps://vocaroo.com/1eG8IzJ6C5y6Here is the original speech for anyone else to give it a go:A cutie mark is far more than a mere symbol or identifier—it is the distilled essence of a pony’s very being. It is not simply a reflection of a talent or hobby, nor a role assigned by society. Instead, it stands as an intricate, immutable emblem of individuality, representing a pony’s soul, heritage, and identity in a way that is both deeply personal and profoundly abstract.This mark is a tapestry of meaning, weaving together culture, ancestry, character, and spirit. It is a flag of individuality, a coat of arms that each pony bears proudly. Like a fingerprint unique to the self, it cannot be replicated or erased. In its permanence, as confirmed in Call of the Cutie, the cutie mark becomes a lifelong affirmation of one’s unique narrative—a sacred banner of identity and self-discovery.To reduce such a profound symbol to a mere vocational label, as some later depictions in Cutie Pox or Magical Mystery Cure attempt, is to strip it of its true magnificence. A cutie mark is not a job or an obligation; it is a timeless reflection of the harmony between body, mind, and soul. To trivialize its meaning is to misunderstand its transcendent role in expressing individuality and purpose.Scientifically, one might liken it to a unique genetic code, an expression of existence so layered and intricate that it defies reductive interpretation. Emotionally, it is a beacon—a radiant testament to the miracle of identity and the wonder of self-expression.Let the cutie mark remain untouched, its beauty unblemished and its meaning untarnished. To honor the cutie mark is to honor the sacred, irreplaceable essence of the individual. It is a celebration of the complexities that define us, a crystallized symbol of the infinite beauty of the soul. Let it forever stand as the brilliant coat of arms it was always meant to be—a shining flag of the heart, unfurled in the winds of life. A true snowflake essence known colloquially as snowpity.
>>41647735nice
>>41647735very nice>>41647801>2nd onethis is not sounding good at all, not sure if this is due to their service output or some option messed around in audio editor.
Small VLMs?
I'm new to gpt-sovits, please I need help.this is from the rentry:>Here are the recommended settings for SoVITS training:> Batch size: 2 (1 if your gpu has 6G vram)> Total epochs: 8> Text model learning rate weighting: <=0.4> Save frequency: 4I've seen that the biggest points of contention were with these specific settings.What settings would you suggest?how long does it take to train 8 epochs? just te get an idea. I want to mess it with but not let my gpu run for a month, I have 1080ti and 32gb memory
>>41647801So uh… how did you do the first thing with Udio? I thought it was a text to music site. Can you guide us through your process?
>>41649757Ok I tested it and it produced something within 10 minutes with the default settings. Not bad. the results weren't great but I liked it better than e2 f5. Pronounciation is worse but virtually no mistakes.Now I'm training with the maximum allowed epochs for both, and it seems it's still fast enough, gonna be done in 45min or so.Is there away to unlock the ui? to train beyond 25/50
>>41649757>>41601185>>41626354>>41626871Not recommendations but guidelines. You can save multiple checkpoints at epoch intervals and test them too.
>>41649818update, it might be done sooner than that lol >>41649818about 15 minutes for 25 epochs for a 1080ti, running with a batch size of 6
>>41649819How can I unlock the max epochs in the ui pretty please?
wtf is this? I see chinese is that normal for training and eng model?
>b
>>41649852Don't worri mai ferrow anon
>>41650140yes
>>41649818>>41649824>Is there away to unlock the ui? to train beyond 25/50I'm not sure what the other anon training it was doing, but I just unlocked it via inspect element and it worked fine, so you can do that.
Up from 10.
>>41649852The repo was worked on by a chinese guy so everything is going to default to it.
>>41652461>RVC was made by group of rando Chinese programers >This one as wellUhoh, it's little bit worrying that all the quality ai projects are only being progressed between three groups, globohomo western corpos, china commies and the small group of horsefuckers.
>>41652933They're the ones that care the most about AI and don't have to deal with all the red tape that western devs have to deal with since they don't give a shit about things like ethics or copyright.
https://vocaroo.com/14eyuFuDu0Zs
>>41653178Shouldn't Celestia's voice be deeper than that?
>>41646158Horsona updates:- [Done] I finished the sample application for automatically generating an SDK. The usage looks a little ugly since the SDK generator I'm using generates ugly code, and I had trouble finding a better one for python. It at least shows that the auto-generated OpenAPI spec works. Hopefully there are better generators for other languages. C++ and C# generators seem to be the important ones for game engine integration. (Unreal Engine, Unity.)- ... Code: https://github.com/synthbot-anon/horsona/tree/main/samples/node_graph_client- [In progress] I'm working on an OpenAI-compatible interface for custom modules. The basic idea is: some modules build custom functionality into the LLM API (e.g., generate results like some character, automatically include things like RAG and Gists, etc.), then run a script to start a server to create an endpoint for that module. Then chatbot UIs like SillyTavern can use that endpoint instead of Ollama/OpenAI/Anthropic to get better & more tailored text generation with a lot more customization options than what the UI itself supports.- ... [Done] I cleaned up a bunch of code to make this possible and to make it easier to create custom LLM APIs. Here's an example for how to create one that can reference a ~25k "canon" story with an 8k LLM context window: https://github.com/synthbot-anon/horsona/blob/main/src/horsona/memory/readagent_llm.py- ... [Done] I have the code for creating an endpoint for custom LLM modules here: https://github.com/synthbot-anon/horsona/tree/main/src/horsona/interface/oai.- ... [In progress] I need to write a sample server showing how to create & expose a custom module, then test it with SillyTavern.No changes from the last post:- [In progress] I'm working on a way to add explicit causal reasoning to LLMs. I'll commit all of the change for this once I have the whole thing working.- ... [In progress] I'm working on a module to chain together analysis from multiple small models. I mostly know how to do this, but implementing it is tedious.- ... [In progress] I can get an LLM to generate small causal graphs and datapoints from small snippets of text. I'll need to test it for robustness, then update it to handle streams of text.- [In progress] I'm writing a few modules to handle streams of text. I can get an 8k token context window to handle about 25k words of context right now with a combination of the GistModule + Recent Messages. I have some thoughts on how to get that number much higher using multiple levels of Gists.
>>41654108I don't think that's ai generated.
Zero-shot Voice Conversion with Diffusion Transformershttps://arxiv.org/abs/2411.09943>Zero-shot voice conversion aims to transform a source speech utterance to match the timbre of a reference speech from an unseen speaker. Traditional approaches struggle with timbre leakage, insufficient timbre representation, and mismatches between training and inference tasks. We propose Seed-VC, a novel framework that addresses these issues by introducing an external timbre shifter during training to perturb the source speech timbre, mitigating leakage and aligning training with inference. Additionally, we employ a diffusion transformer that leverages the entire reference speech context, capturing fine-grained timbre features through in-context learning. Experiments demonstrate that Seed-VC outperforms strong baselines like OpenVoice and CosyVoice, achieving higher speaker similarity and lower word error rates in zero-shot voice conversion tasks. We further extend our approach to zero-shot singing voice conversion by incorporating fundamental frequency (F0) conditioning, resulting in comparative performance to current state-of-the-art methods. Our findings highlight the effectiveness of Seed-VC in overcoming core challenges, paving the way for more accurate and versatile voice conversion systems.https://github.com/Plachtaa/seed-vc
>>41654800It's kind of frustrating that the API stuff for the gpt sovits sample is mixed in with the same download containing over 2GB of pretrained models. Wouldn't it be better to separate them.
>>41656154>2GBFirst day with python?
>>41656242Agreed. I updated it.The updated voices file: https://drive.google.com/file/d/106i6hQVDrUuULe_k8-MSi7wB4fW0X2Qx/view?usp=sharing- This one contains only the tts config and one voice folder for reference.And the updated readme: https://github.com/synthbot-anon/horsona/tree/main/samples/gpt_sovits- The only changes are (1) use the new voice file, and (2) use synthbot/gpt-sovits:v3 instead of v2.>>41654800Horsona updates:- [In progress] I'm almost done with a module to use significantly more memory than the context window allows. This one combines ReadAgent gist & fetch with RAG. The RAG is used to identify relevant gists up to some character limit, then ReadAgent is used to pull the most relevant pages into context up to some other character limit. The data can be organized like a filesystem. Once this is done, I'll create the SillyTavern integration sample app based on this. I've already tested the SillyTavern integration to make sure it works.No changes from the last post:- [In progress] I'm working on a way to add explicit causal reasoning to LLMs. I'll commit all of the change for this once I have the whole thing working.- ... [In progress] I'm working on a module to chain together analysis from multiple small models. I mostly know how to do this, but implementing it is tedious.- ... [In progress] I can get an LLM to generate small causal graphs and datapoints from small snippets of text. I'll need to test it for robustness, then update it to handle streams of text.- [In progress] I'm working on an OpenAI-compatible interface for custom modules. The basic idea is: some modules build custom functionality into the LLM API (e.g., generate results like some character, automatically include things like RAG and Gists, etc.), then run a script to start a server to create an endpoint for that module. Then chatbot UIs like SillyTavern can use that endpoint instead of Ollama/OpenAI/Anthropic to get better & more tailored text generation with a lot more customization options than what the UI itself supports.- ... [In progress] I need to write a sample server showing how to create & expose a custom module, then test it with SillyTavern.
>>41656312>>41656154
Uppy.
>>41656790pone
This isnt really the best place to ask but /g/ is not being very helpful here. Could somebody send me their windows msvw10 dll file on catbox so I could this shitty error fixed?Yes, ive already tried installing recommended fixes: directx_Jun2010_redist, vcredist_x64, vcredist_x86, and none of them fixed it.
>>41657352This seems like an extraordinarily bad idea
>>41624737based bonus track
>>41658214>trusting random Anons on internetI know, but im bit desperate since the other alternative would be to re-install the whole windows system or move on to Linux (and kind of fucking up 90% of program workflow I have set up)
>>41659138I search my Windows laptop and couldn't find that file in C:\Windows.>or move on to Linux (and kind of fucking up 90% of program workflow I have set up)If you're a programmer, just do it. Windows sucks for so many reasons other than just workflow issues, and it's not obvious how much it's holding you back until you switch. You'll have better workflows with Linux + VS Code or Cursor + Vim anyway since you can automate things so much more easily.
>>41659189>programmerSadly I am artist, and my toolset nvoles using the type of shit that have not been updated for over past 5~15 years. I did try Linux every once in a while (I have it installed on a backup secondary drive) and I could never find a proper alternatives and all the way I see people work with it is way too janky and limited in the exact autistic way I need to work.
Sorry for me to inquire I'm not a horse enthusiast but I am a voice clone tts enthusiastCurrently, what is the best open source way to clone a voice and using it for tts?I've tried gptsovits recently and I've been disappointed, trained a model for an hour.
>>41660108https://files.catbox.moe/l4y9uo.wav
>>41660183I've tried cloning lydia's voice from skyrim. With a minute of handpicked dialogue lines. The maximum allowed epochs for both things.made sure everything that could be english was and at the end it sounded off, like a chinese person that is somewhat fluent in english, it was off.
>>41660192Try GPT 24 epochs and SoVITS 96 epochs (inspect element to increase the max allowed epochs).
>>41660353nta, how do I run the train new model? I would love if the main gptsovits script had a separate "train" tab like the RVC has to simply point audio references, check the correct boxes and let the one button click take care of rest of the training.
>>41660406Check the guide https://rentry.co/GPT-SoVITS-guide#/
>>41660414ok, about three hours in and so far I learn how to brute force a terminal to use a different python installation, update said installation due to lacking modules, modify the sys.path.insert as somehow it was picking up the wrong directory somehow, and butcher the shit out of the i18n.py because for whatever fucking reason was not opening the en_US.json file.I will continue my adventures with the training tutorial tomorrow and hopefully actually train an voice with the new tts for once.
>>41660192your settings are somewhat better than what I expected. It doesnt sound chinese at least.https://vocaroo.com/11JOWNbssSLhThis is an old model, the training data was already prepped for this
>>41660353thank you kind sir, this is my karlach model I cooked up.>https://vocaroo.com/137bNdlFA5kGfrom what I've noticed, DPO is either good or doesnt make a significant difference (the rentry guide says it's bad)No reference mode gave me more consistent results in inference, pronunciation wiseAnyway.What does temperature do?Top_k and top_p?
>>41661921I'm not sure about DPO either (maybe someone here knows), but no reference mode doesn't really exist in the code, it's just caching your old reference. You might have a random seed, so it's generating something else with your same reference.Temperature < 1.0: Voice closer to the reference but more errors on the pronunciationTemperature > 1.0: Voice further from the reference but sounds more natural.The effect of top_k and top_p aren't very clear, I don't touch them.
>>41660812Q:\_AIfromC\_AItts\GPT-SoVITS-v2\GPT-SoVITS-Lite>Q:\_AIfromC\_AItts\GPT-SoVITS-v2\GPT-SoVITS-Lite\runtime\python.exe Q:\_AIfromC\_AItts\GPT-SoVITS-v2\GPT-SoVITS-Lite\webui.pyRunning on local URL: http://127.0.0.1:9874IMPORTANT: You are using gradio version 3.38.0, however version 4.44.1 is available, please upgrade.--------"Q:\_AIfromC\_AItts\GPT-SoVITS-v2\GPT-SoVITS-Lite\runtime\python.exe" tools/asr/fasterwhisper_asr.py -i "Q:\_Vds\___Pie_in_the_sky\Valkyrie_SC1\clean\output" -o "Q:\_Vds\___Pie_in_the_sky\Valkyrie_SC1\clean\output\asr_opt" -s large-v3 -l en -p float32Traceback (most recent call last): File "Q:\_AIfromC\_AItts\GPT-SoVITS-v2\GPT-SoVITS-Lite\tools\asr\fasterwhisper_asr.py", line 25, in <module> from tools.asr.config import check_fw_local_modelsModuleNotFoundError: No module named 'tools.asr'The script keep shitting itself in not finding the correct path to another python script that is in the exact same folder.
>>41661928for the no reference mode I just slap all the audio clips into the right "optional" pannel. And it just works.What do you usually use for temperature? I think the max allowed is 2 even if I unblock the UI
>>41661937Then it's using the audio clips as reference. I leave temperature on 1 except if the voice doesn't sound like the character at all, then I lower it a bit (0.75-0.8). More than 1.2 and you get garbage so there is no point to set it that high.
>>41661931 so i fixed that by installing the module ultraimport, than swapping out the fastwhisper_asr.py following code:>from tools.asr.config import check_fw_local_models to:>import ultraimport>check_fw_local_models = ultraimport('__dir__/config.py', 'check_fw_local_models') But now Im getting this errror:Traceback (most recent call last): File "Q:\_AIfromC\_AItts\GPT-SoVITS-v2\GPT-SoVITS-Lite\tools\asr\fasterwhisper_asr.py", line 60, in execute_asr model = WhisperModel(model_path, device=device, compute_type=precision) File "Q:\_AIfromC\_AItts\GPT-SoVITS-v2\GPT-SoVITS-Lite\runtime\lib\site-packages\faster_whisper\transcribe.py", line 133, in __init__ self.model = ctranslate2.models.Whisper(RuntimeError: CUDA failed with error CUDA driver version is insufficient for CUDA runtime version and the above is fucking bullshit as I can see I can run the CUDA perfectly fine on this system.Q:\_AIfromC\_AItts\GPT-SoVITS-v2\GPT-SoVITS-Lite>Q:\_AIfromC\_AItts\GPT-SoVITS-v2\GPT-SoVITS-Lite\runtime\python.exe -c "import torch; print(torch.cuda.is_available())"TrueQ:\_AIfromC\_AItts\GPT-SoVITS-v2\GPT-SoVITS-Lite>Q:\_AIfromC\_AItts\GPT-SoVITS-v2\GPT-SoVITS-Lite\runtime\python.exe -c "import torch; print(torch.version.cuda)"11.8Q:\_AIfromC\_AItts\GPT-SoVITS-v2\GPT-SoVITS-Lite>Q:\_AIfromC\_AItts\GPT-SoVITS-v2\GPT-SoVITS-Lite\runtime\python.exe -c "import torch; print(torch.zeros(1).cuda())"tensor([0.], device='cuda:0')
>>41662005uh oh, I may or may not have solve the issue, it seems the ctranslate2 dislikes cuda below 12, but the new 1.0.0++ faster-whisper requires a newer version of it, so both of them needed to be downgraded.Q:\_AIfromC\_AItts\GPT-SoVITS-v2\GPT-SoVITS-Lite\runtime\python.exe -m pip install -U ctranslate2==3.24.0Q:\_AIfromC\_AItts\GPT-SoVITS-v2\GPT-SoVITS-Lite\runtime\python.exe -m pip install -U faster-whisper==0.10.1After that only the "GPT_SoVITS\process_ckpt.py" was acting bit retarded, by having the following lines:from tools.i18n.i18n import I18nAutoi18n = I18nAuto()throwing out errors, BUT not actually using/referencing them inside of its own script, so both got comment out and now it seems the training goes pretty smoothly.Now training the SoVits part, and it looks like it can do 1 epoch per 2~ minutes.
>>41661941Thank you anon, the 0.75 temp tip is really good for my modelUsing the reference definitely helpsIdk if its a good idea to slap all the clips into the right panelIt's a little monotone but it's fine.Training at 24 gpt epochs and 96 sovits epochs, using both reference panels and having temperature at around .8 helps tremendously>https://vocaroo.com/1fyBuaedJN3A
>>41662141That's good. And no, it's averaging the clips that's why it sounds monotone. Try only giving it one reference, the cleanest you have.
A fun test with Celestiahttps://files.catbox.moe/hsc1bv.wav
>>41662125And the very last obstacle, found a roundabout solution to the above errors (from another script), was solved by adding the following line above the "from tools.i18n.i18n import I18nAuto" >import sys, os>base_dir = r"Q:/_AIfromC/_AItts/GPT-SoVITS-v2/GPT-SoVITS-Lite">sys.path.insert(0, base_dir)Other than that, trying to train a 25 seconds of not a great quality audio kind of resulted in not so great sounding output, but I feel if I let the sovits part of the model train for few more dozen epoch im may sort itself out.
Bleak twiggle songhttps://files.catbox.moe/enlb4d.mp3https://ponepaste.org/10521The ending is from a nightmare I had. I don't expect every new song to be a downer fwiw.
>https://files.catbox.moe/dfoo1r.flac>https://huggingface.co/Amo/GPT-SoVITS-v2/tree/main/SC1_Valkyrie_v01_SVe70-GPTe10Alright, half a day of training the result is not great. The more I trained the GPT the less coherent the tts results end up. The Sovits training was also bit funny, the 70 was not picked because it was good but because it sounded the least worst out of all the generated models.Given the original 28s training file https://files.catbox.moe/e96m50.wav has a radio like effect on top of it (that was also had engine sound going in background removed with ai filter), the result is still better than expected but worse than what I was wishing for.
>>>/wsg/5738260>https://github.com/kijai/ComfyUI-PyramidFlowWrapper>pyramidflow>384p works on 16GB VRAM>768p needs 24GB+. 10 seconds (also seems to be less reliable than 5 seconds)So there is this offline ai video maker out there, I've stole this link + webm from /wsg/ thread, I have no idea how difficult would training for this stuff would be but HEY, we are one step closer to ai made cartoons with ponies.
>>41662326That's Celestia? She sounds barely like her.
https://files.catbox.moe/jtcgt3.mp3
>>41663316Yeah you're right I messed up: https://files.catbox.moe/cxobua.wav I wonder if there is a way to clean up the end result automatically
>>41662632>digits give me flashbacks to listening to models trained on SC09nightmare indeed
up
Where are the mares hiding?
>>41663813https://files.catbox.moe/q1w20r.mp3
Can I offer you an AJ in a silly-cute dress in this trying time?
There is a archive of mlp show music, that is unmaintained for last few years. It contains instrumentals and high-quallity versions of songs. There are known missing instrumentals of some songs that are publically avaliable. Is anyone interested in maintaining it?https://docs.google.com/document/d/1zfGmwKJoCNgX8QMkkDoem2nOAw83-dg5fnJqJK0Jxig/edit?tab=t.0
>>41666591What are known missing instrumentals
>>41667099Babs Seed, Crystal Empire, Blank Flanks Forever.First two are in games, last one is in 2019 leak.Maybe something else too. High quality EqG Better Together with vocals.
Button Mash Sings KSI Thick Of it We The Sus Music Ai Coverhttps://files.catbox.moe/oiv6o7.mp3
>>41662632The song might be a downer, but I really dig it anyway.
>>41660183> Gee Pea Tea SovietsWhy do I imagine communist Mane 6?Rainbow Dash stormimg Winter Palace.
>>41668791I would prefer mares to read me all the poetry books that are collecting dust on the bookshelf.
>>41668003Huh, we have Buttons ai voice? Is that rvc or sovits( do kindly link it up either way)?
>>41669834no go find it yourself faggot
Where did the ponies get enough data to train a model of Anon's voice? What would they even do with such technology?
>>41669440https://files.catbox.moe/r8kptk.mp3
>>41669440>>41671080And the rest: https://files.catbox.moe/71otqw.mp3All generated clips: https://files.catbox.moe/sqlciz.zip
>>41671100poetry by mares, you could say its a mare-etry.
>>41671100Did you apply post-processing after sovits?
Are there image generators that don't struggle with show-accurate style? Even best generated images I've seen break on outlines. Can they be improved with postprocessing? Maybe something like bilateral filter? Or maybe train NN to find outlines and then paint them with solid color? Some sort of rasterized vector image sharpener, that gets fed with blurred output of AI? Or maybe make NN that will split image into solid or gradient regions and background, produces plane equation for color of each "splotch".Representing part of image as mix of planes of color instead of bunch of pixels sounds interesting. Is there any research on this topic?Or any other mathematical surface, that can be reduced into any plane. Or close enough to it. Maybe cosine table like lossy codecs do, but only for splotches.
>>41671927No. I was too dumb and lazy to figure out how to do post-processing with Audacity, and I kept running into audio issues with it.>>41671962>prompt:score_9, (rating_safe), pony, show accurate, twilight with headphones listening to music with her eyes closed sitting on a bench, nighttime with stars width:1024 height:1024 scale:7.5 steps:25 sampler: K_EULER_A model:PONY_V6_XL seed:3987383230It took about 5 attempts. You'll probably want to train a LoRA for it to make it more reliable.>Representing part of image as mix of planes of color instead of bunch of pixels sounds interesting. Is there any research on this topic?Maybe the Color ControlNet? I'm not sure if there's a Color ControlNet for SDXL.
>>41591651>precomputing stuff from the reference audio for GPTSoVITSI have been independently looking into this, and I believe it is feasible. The following 3 variables could be precomputed from the reference audio and its transcription. With a little refactoring, they could then be passed as arguments to the get_tts_wav method in inference_webui.py:"prompt" - An array computed from the reference audio by passing it through HuBERT, Conv1d, and ResidualVectorQuantizer networks (whose weights are stored in the .pth sovits file). Its size is on the order of 1x100 to 1x1000 integers for a typical reference audio lasting several seconds long."phones1" - A list of integers which are indices of arpabet tokens, determined from the transcription. Its length is on the order of a couple hundred integers."refers" - a spectrogram of the reference audio, wrapped in a list. Its size is on the order of 1x1025x100 to 1x1025x1000 floating point numbers and is by far the largest value to store. Note: The code has a variable called "bert1" which is also derived from the transcription. For English reference text, however, it is always an array of zeros, so there is no need to precompute it.In theory, the user could supply additional reference audio files if they wanted to, and their spectrograms would replace (or be appended to) the precomputed "refers" variable. I am working on a refactor of inference_webui.py and developing some code to perform the precomputations. More to come soon, hopefully within the next week.
>>41673455Nice, color me interested
>>41673118>>Representing part of image as mix of planes of color instead of bunch of pixels sounds interesting. Is there any research on this topic?>Maybe the Color ControlNet?Wow. It is not what I meant in quoted part, but also nice. It is what I meant as "rasterized vector image sharpener, that gets fed with blurred output of AI".Plane representation I meant is coefficients of plane equation, just like GPUs use when rasterizing triangles. Or matrix. And pixels are marked what plane do they use, so in the end pixel coordinates are multiplied by matrix of plane they refer to.
>>41592155>- It's at least theoretically possible to precalculate average speaker timbre information from auxiliary reference audio, since multiple audios can be averaged together. Whether it's actually useful is another question entirely.Can multiple reference audios somehow be used for changing emotions or pacing over time? Maybe with weighted average, where weights are interpolated? Its weights can be either a vector with elements in [0, 1] range or barycentric coordinates(sum of elements = 1). So it would be basically vector-matrix multiplication.But I didn't try GPT-SoVITS myself yet, so take it as possibly useless suggestion.
>page 10
>>41666591bumping this to anchor post >>41572561since this seems pretty interesting/important.
>>41674113I think no because the sequence dimension gets squished
>>41675268We've had it for some time in the gdrive, and we helped improve it in a few cases where we had higher-quality versions of songs. Narokath was pretty quick to respond even years after the previous update.https://drive.google.com/drive/folders/1OMwYKv7fbA5bZAS1BUwNDryMzjOpRKhQ
>>41667791Can you upload/link the games that contain the first two? I found the the extra Blank Flanks Forever instrumental & vox.
>>41667791I found the extra Better Together instrumentals as well.
>>41671962>>41673118NTAs, but did you found plugins / add-ons that let you limit the colours output for the generated images so they look more like a vector screen caps?
>>41675589https://www.hasbro.com/common/assets/html5/mylittlepony/core_games/FinalGames_061614/ff_Jul28/audio/game/music.m4ahttps://www.hasbro.com/common/assets/html5/mylittlepony/core_games/FinalGames_061614/ppp/audio/pinkie-pie-theme.m4a>>41675599Better Together opening instrumental already in collection. Did you mean something else? Better Together with vocals lacks flac version and instead has only mp3, not flac - this is what I was trying to say.
Shitposting with Celestia is always funhttps://voca.ro/1jvfMpMcGH0C
>>41577843>RVC (using a retrieval ratio of 0.75)What are your settings to do that?
>>41676972That's the setting
>>41676569
Where can I download good audio references for gpt-sovits (mane6)?
>>41678509You can use dialogue lines from the Master Filemega.nz/folder/jkwimSTa#_xk0VnR30C8Ljsy4RCGSigTo my knowledge there nobody has made a definitive list of "good" lines. There's also these https://github.com/effusiveperiscope/GPT-SoVITS/tree/standalone_gui/ref_audios which I bundle with the GUI by default
So just as a general post, what are your guys' feelings on the state of AI voices right now?After discovering and finetuning GPT-SoVITS, does it seem like a major stepup to you?What do you love and/or hate about it?
>>41678708It's 110% upgrade from talknet tts, however trying to find relevant reference clip to get the exact tone is still bit painfully behind what grok/15ai offered with all that stuf handled by emoji controller.Maybe I'm not being lucky but it seems training a sub 30s voice models is possible the results are still not on the level of "yep, that's how I imagined this character would talk" , but it's interesting to see that it is able to get s to half way point.I may now actually able to contribute something to anti colab since I wouldn't be limited to just badly redubbing my own voice with haysay.
>>41678509For "good" lines, you can just exclude any in the master file tagged as noisy.>>41678708I've been messing with it a fair bit recently for a small-scale voice project, I have been able to get to a quality level that I'm happy with though it's hit and miss like with all AI. My feeling so far is that GPT-SoVITS is good for general speech and often has good resemblance to the characters, however getting exactly what I want from the emotional delivery is still somewhat difficult at times and there's been a few pronunciation issues. Being able to do TTS again rather than voice conversion is a huge factor for those with non-american accents, though that comes with the caveat that I now need to instead search the master file for a good reference line for everything I want to generate. I think that someone who can already do a decent voice impression of the target character would still be better served by so-vits/RVC voice conversion, GPT-SoVITS is a suitable alternative for everyone else.It's a significant step forward overall and a credit to all involved that we've been able to get this far with voice AI, people from four years ago would absolutely flip their shit if they could hear what we have now. Long may the development continue.
>>41678708Love the quality and easy trainingHate that you need reference audio (Used to it by now from using SoVits, still would like to generate without references)
>>41678708I'm sure there's still advancements to be made, but I'm content with what we have now. SVS and RVC give great results with enough wrangling.For GPT-S, I'm happy it exists, and it seems to work decently, but have found it difficult to justify using for anything beyond testing just yet. Still, for a CPU based text to speech model it's really impressive.https://files.catbox.moe/4y2iof.mp3
>>41678708It's really good and the dev said he'll release a v3 base model trained on more hours of audio soon. There are a few bugs in the code too, but after fixing them it's better than ever. Postprocessing with a RVC pass also seems to slightly improve the end result (if you need that extra quality).
>>41678708It's alright but it could be better. so-vits-svc 5.0 is basically peak for me in terms of normal singing voices, not sure how much room there is for improvement there. OTOH I'm still not really satisfied by any of the options for synthesizing speaking voices. GPT-SoVITS as others mentioned is probably useful for projects needing a fully automatable TTS, but if you want a specific timbre or delivery it still falls short. I think I've been spoiled in that dimension by SVC options.
Maybe you could avoid having to provide reference audio with GPT-SoVITS by training a text -> reference embedding model, like the prior in DALLE-2.
>>41679657You are reinventing RAG. And yes, you can.
Mares are good
>>41680819MARES - MARES Autistic Rage Enhanced Sounds
>>41680819Mares are the best.
>>41676909Thank you.The leaks have more stems that, e.g., isolate the guitar. I'm not sure if Narokath would want that added to the collection, but the files are available.I'll ping him on the Google Doc and elsewhere to see if he wants to update his collection. Otherwise I'll just add the files to my clone.
>https://www.youtube.com/watch?v=fj-Ipgw9kl8>https://d1qx31qr3h6wln.cloudfront.net/publications/FUGATTO.pdf>2 meme paper - Fugatto audio modeltldr: Nvidia is also making their own audio ai model, that is a Swiss-army knife of audio models controlled by text+audio reference (tts, audio effects, music, converting midi audio to other instruments, adding/removing sounds from source reference)Right now there is no actual testing available to the populous, so we just need to trust the cherry picked examples are true representation of what the model can do. However the idea of converting one version of instrumentals into something else while keeping the same speed and beats is VERY interesting to me since udio/suno loves to add shitty modern pop beats in the backgrounds of supposedly 70s inspired song, so maybe this would be a nice way to fix them (and also be able to fucking play normal fucking songs without YT copyright spazing out every five seconds).
Do you think the ponies have the ability to copy voices? Perhaps with magic? How common do you think it would be and what do you think they would use it for?
>>41681915>Perhaps with magic?This is confirmed. Coloratura let a Unicorn cast a spell on her to pitch her own voice in a live performance.
>>41679657The previous grok (and I think 15ai) used emoji embeddings to link text and audio embeddings. That might work for finding relevant reference audio files too.
>>41681915with transformation magic like Poison Joke and the breezies spell I am sure ponies could make some voice changing magic too.
>>41683411>teaching machines how to boopThat's bold.
>>41683411
>>41676909>>41681376I got a response from Narokath and sent him the files. I'm pretty sure he'll add the Babs Seed, Crystal Ponies, and Blank Flank ones. Response pending on whether he'll add the new Better Together stems too.All of the new files are here: https://drive.google.com/drive/folders/1loPFrwJMMsHe2VNzQ9u3gtZmCfdtmsvRThe Better Together ones are disorganized. I'll try organizing them if Narokath decides to include them.>>41656312Horsona updates:- [Done] OpenAI-compatible interface for custom modules. I tested it with SillyTavern, and it works as expected.- ... Sample: https://github.com/synthbot-anon/horsona/tree/main/samples/llm_endpoint- ... ... Caveats: Indexing files is slow and requires a large number of LLM calls. It treats every inference as if it's part of the same conversation (so its memory will leak between conversations). The memory module I'm using to retrieve backstory information isn't made to work with stories, so its memory will be flaky. I'll be working on that after I'm done with causal reasoning (below).- ... Code: https://github.com/synthbot-anon/horsona/tree/main/src/horsona/interface/oai- ... Tests: https://github.com/synthbot-anon/horsona/blob/main/tests/interfaces/test_oai_api.py- [Done] I cleaned up a lot of the LLM handling code and added support for streaming results from LLMs.- [In progress] Adding explicit causal reasoning to LLMs.- ... [Done] I ported the relevant code over from the DoWhy library so I could clean it up and add a better interface.- ... [Done] I added support for doing causal reasoning with LLMs so it can deal with natural language data. Previously it only support numerical data.- ... [In progress] I have a lot of clean up to do to complete support for natural language based causal reasoning.- ... [ ] After that, I'll need to wrap everything in modules so they plug in nicely with the rest of the framework.
Happy Thanksgiving, everypony! I appreciate the dedication from everyone here, shitposters included. Every update gives me more hope that I'll one day have my waifu.
Trying to follow the haysay_ui installation instructions on windows.limited_user_migration-1 | chown: invalid user: ‘luna:luna’limited_user_migration-1 exited with code 1
I have refactored the GPT-SoVITS code to allow precomputed values to be passed in:https://github.com/hydrusbeta/GPT-SoVITSIn that fork, I included a script (pony_precomputer/precomputer.py) for precomputing values for all the master files. It also contains sample code showing how you can use a set of precomputed values to generate audio. It's not terribly useful on its own for now, but perhaps with a text -> embeddings model, as others have suggested, or some other mechanism for selecting a set of precomputed values, we could use it for audio generation without the need for providing reference audio.I was able to avoid storing the entire spectrogram of the reference audio by precomputing the first step of the decode method, which passes the spectrograms through the code's "MelStyleEncoder" neural network. The result is a relatively small array (512 floats). When I ran my script on all the Sliced Dialog master files from s1-s9 + Rainbow Roadtrip, it generated only 35MB of precomputed data total for all of the files.Clipper, as part of this effort, I wrote a parser for the master files and found one tiny mistake. The file "00_15_24_Chief Thunderhooves___Our stampede will start at high noon tomorrow..flac" in s1e21 is missing the emotion tag. It should either be "Annoyed" or maybe "Angry". Otherwise, I detect no other issues in any of the file names.
Cautionary up.
>>4168766110
>>41688321Indeed.
>>41689050
>>41687189I just needed to delete all the old images, missed those when I was deleting the old volumes and containers
someone on lmg made a Firefox plugin for right-click reading text from a SoVITS API backend. might be useful for casual custom narration of random things.
>>41690022Post link?
>>41690053https://addons.mozilla.org/en-US/firefox/addon/sovits-screen-reader/
>>41690053>>>/g/103341565
Sorry if this is the wrong thread but it's the only AI voice related thread I know. How does one go about making those AI generated songs that emulate somepony's voice? I wrote a parody of Gaston's song (from Beauty and the Beast) using Rainbow instead of Gaston, and I'd like to see if I can have Scootaloo and Dash sing it. Is there even enough training data to emulate Scoot's voice accurately?
>>41690198>How does one go about making those AI generated songs that emulate somepony's voice?Usually the workflow is>Generate a song with Suno.ai or Udio.ai >Separate vocals from song with something like Ultimate Vocal Remover>Run separated voice through pony voice AI and recombine with the instrumental
>>41690233Nah, not generating new songs from scratch, I mean these people who have taken existing songs and re-did them with someone else's voice. I've heard quite a few around.
>>41690252Then skip step 1, the rest still stands. With real songs sometimes you can find stems online with vocals and instruments already seperated
Bump.
>>41689623Sorry I never got around to replying to you. Glad to hear you figured out a solution!
If I understand correctly, GPT-SoVITS uses pytorch. And I've found few mentions of OpenCL backend for pytorch being developed: https://dev-discuss.pytorch.org/t/opencl-backend-important-updates/845/13But in general it is slower than vendor's libraries.What ponies here use for inference? CPU? GPU? NPU? Which one?
>>41690198For a parody like that without re-singing it, I would use Synthesizer V to recreate the vocals using it's synthetic singing voices, then feed those outputs into whichever AI pony voice conversion I feel best suited, probably RVC. There previously were many sources to the basic version of SynthV here: (https://resource.dreamtonics.com/download/English/) but I guess they've since removed it to try and bump their sales of the paid version, which is still kinda worth it as it has some great features in the main one:>Can read separated vocals to attempt replication and even tries to match vocal qualities to edit afterwards>AI retake to vary up how pronunciations and deliveries are done>More voices available to it, though most are paid models>Force English for most other non-native voices>No limits to track numbers; good harmony potenitalIn any case, the free limited version (dubbed as "basic") is still very much useful and a basic beta version can be found at the earlier linked location, obtained from their website (https://dreamtonics.com/download-free-trials/)As an idea of it's capabilities, here are some tracks I used SynthV + RVC for, some still WIP:>ANRI voice output of Danger Zone - https://files.catbox.moe/12is7e.mp3>Final Fluttershy Danger Zone (of section above) - https://files.catbox.moe/12is7e.mp3>Starlight singing a Pendulum Watercolour section - https://files.catbox.moe/xpb2qa.mp3>Fluttershy singing the same as above - https://files.catbox.moe/4kel93.mp3>NMM Parody I threw together just now (Rhythm of the Night -> Eternal Night) - https://files.catbox.moe/hz1cex.mp3
>>41691421I'm looking through the quick start guide in OP and it looks like so-vits-svc would be better suited to this task than RVC, but you recommend RVC?
>>41691421>>41691545Oh wait, so-vits-svc wouldn't allow me to change the words, would it?
>>41691552Only direct voice conversion I recall being able to change the lyrics for was Talknet, which is quite old and not all that reliable when changing the words. Thus why I consider SynthV as the better pass used to change the lyrics, pitch, timbre and/or delivery prior to ponification with newer formats.
>>41691563Where are instructions for SynthV? I don't see it mentioned in the quick guide
>>41691565>SynthVthats not an ai tool, its a bootleg vocaloid.
>>41692024
>>41572862cute numget pat pat
>>41691565You can easily find tutorials online for it. I intend to make one for here, and include my method for ponification with it's outputs, though that shouldn't differ much from existing documentation. Retail work and moving residence is occupying a lot of my spare time, so won't be made for a while.
>>41693857Yes.
>>41696589That's an excessive amount of scrunch.
>>41687414>missing the emotion tagFixed, thanks.>>41684805fyi, for your mirror
>>41690059How do I use this?Pretty please
>>41698386My settings, I've been messing with this and nothing workschanged ports from the base ui to the text inference ui (it is running ofc)changed paths from shorter, gptsovits folder as root, to fullnothing workspretty please help
>>41698421Baguette
>>41698335Will update after a few days.
>>41681376Maybe those stems can be used for training instrumental extractor AI, if we ever will get enough data and POWARR to train one. But unlikely we will get both.