Welcome to the Pony Voice Preservation Project!youtu.be/730zGRwbQuEThe Pony Preservation Project is a collaborative effort by /mlp/ to build and curate pony datasets for as many applications in AI as possible.Technology has progressed such that a trained neural network can generate convincing voice clips, drawings and text for any person or character using existing audio recordings, artwork and fanfics as a reference. As you can surely imagine, AI pony voices, drawings and text have endless applications for pony content creation.AI is incredibly versatile, basically anything that can be boiled down to a simple dataset can be used for training to create more of it. AI-generated images, fanfics, wAIfu chatbots and even animation are possible, and are being worked on here.Any anon is free to join, and there are many active tasks that would suit any level of technical expertise. If you’re interested in helping out, take a look at the quick start guide linked below and ask in the thread for any further detail you need.EQG and G5 are not welcome.>Quick start guide:docs.google.com/document/d/1PDkSrKKiHzzpUTKzBldZeKngvjeBUjyTtGCOv2GWwa0/editIntroduction to the PPP, links to text-to-speech tools, and how (You) can help with active tasks.>The main Doc:docs.google.com/document/d/1y1pfS0LCrwbbvxdn3ZksH25BKaf0LaO13uYppxIQnac/editAn in-depth repository of tutorials, resources and archives.>Active tasks:Research into animation AIResearch into pony image generation>Latest developments:ponepaste.org/10569>The PoneAI drive, an archive for AI pony voice content:drive.google.com/drive/folders/1E21zJQWC5XVQWy2mt42bUiJ_XbqTJXCp>Clipper’s Master Files, the central location for MLP voice data:mega.nz/folder/jkwimSTa#_xk0VnR30C8Ljsy4RCGSigmega.nz/folder/gVYUEZrI#6dQHH3P2cFYWm3UkQveHxQdrive.google.com/drive/folders/1MuM9Nb_LwnVxInIPFNvzD_hv3zOZhpwx>Cool, where is the discord/forum/whatever unifying place for this project?You're looking at it.Last Thread:>>41571795
FAQs:If your question isn’t listed here, take a look in the quick start guide and main doc to see if it’s already answered there. Use the tabs on the left for easy navigation.Quick: docs.google.com/document/d/1PDkSrKKiHzzpUTKzBldZeKngvjeBUjyTtGCOv2GWwa0/editMain: docs.google.com/document/d/1y1pfS0LCrwbbvxdn3ZksH25BKaf0LaO13uYppxIQnac/edit>Where can I find the AI text-to-speech tools and how do I use them?A list of TTS tools: docs.google.com/document/d/1y1pfS0LCrwbbvxdn3ZksH25BKaf0LaO13uYppxIQnac/edit#heading=h.yuhl8zjiwmwqHow to get the best out of them: docs.google.com/document/d/1y1pfS0LCrwbbvxdn3ZksH25BKaf0LaO13uYppxIQnac/edit#heading=h.mnnpknmj1hcy>Where can I find content made with the voice AI?In the PoneAI drive: drive.google.com/drive/folders/1E21zJQWC5XVQWy2mt42bUiJ_XbqTJXCpAnd the PPP Mega Compilation: docs.google.com/spreadsheets/d/1T2TE3OBs681Vphfas7Jgi5rvugdH6wnXVtUVYiZyJF8/edit>I want to know more about the PPP, but I can’t be arsed to read the doc.See the live PPP panel shows presented on /mlp/con for a more condensed overview.2020 pony.tube/w/5fUkuT3245pL8ZoWXUnXJ42021 pony.tube/w/a5yfTV4Ynq7tRveZH7AA8f2022 pony.tube/w/mV3xgbdtrXqjoPAwEXZCw52023 pony.tube/w/fVZShksjBbu6uT51DtvWWz>How can I help with the PPP?Build datasets, train AIs, and use the AI to make more pony content. Take a look at the quick start guide for current active tasks, or start your own in the thread if you have an idea. There’s always more data to collect and more AIs to train.>Did you know that such and such voiced this other thing that could be used for voice data?It is best to keep to official audio only unless there is very little of it available. If you know of a good source of audio for characters with few (or just fewer) lines, please post it in the thread. 5.1 is generally required unless you have a source already clean of background noise. Preferably post a sample or link. The easier you make it, the more likely it will be done.>What about fan-imitations of official voices?No.>Will you guys be doing a [insert language here] version of the AI?Probably not, but you're welcome to. You can however get most of the way there by using phonetic transcriptions of other languages as input for the AI.>What about [insert OC here]'s voice?It is often quite difficult to find good quality audio data for OCs. If you happen to know any, post them in the thread and we’ll take a look.>I have an idea!Great. Post it in the thread and we'll discuss it.>Do you have a Code of Conduct?Of course: 15.ai/code>Is this project open source? Who is in charge of this?pony.tube/w/mqJyvdgrpbWgZduz2cs1CmPPP Redubs:pony.tube/w/p/aR2dpAFn5KhnqPYiRxFQ97Stream Premieres:pony.tube/w/6cKnjJEZSCi3gsvrbATXnCpony.tube/w/oNeBFMPiQKh93ePqTz1ns8
>>41706417Anchor.
>>41691563Am I gonna need the pro version of SynthV to do this? >>41690198 If so, any way around that whopping $90 price tag?
>>41706747So I made the first few lines in SynthV and I can already tell this isn't going to work. There's absolutely no way I'm going to get the timing down anywhere close enough to line up with a karaoke track of the song. I can have very tidy robotic like timing, but the song has all sorts of fermatas and tempo variations that just don't play nicely with the hard timing of a midi-like note generator. The only way this is going to work is if there exists an AI tool that can take the existing audio track and imitate the melody and timing itself. I've seen people turn existing songs into the same song but sung by a pony, so clearly THAT is possible, but does there exist a similar tool which also allows you to change the words but keep the same pitch and phrasing?
>>41706831Not unless you sing the song yourself.
>>41706840...That's not completely out of the question if AI could take my voice and turn it into a pony's.
Slow start.
Just shooting a general question here, do anybody here knows of program/github project that is able to take 22050Hz audio and bump up the quality to 48000Hz with the ai predictions to fill out the un-cropped segments?
>>41707278https://audioldm.github.io/audiosr/https://github.com/haoheliu/versatile_audio_super_resolution/
>>41706747>need the pro version of SynthV?Nope, most of my covers that have used SynthV have all been done via the free/basic version. You are limited to 3 channels/tracks though, but you only really need that many for simple covers.>>41706831If you note pic related, there's a 3 digit number next to 4/4, this is your tempo. You'll need to adjust this to suit the bpm of your desired song, which can be found via a quick lookup. You can also right click anywhere on that line to create a marker that will change the tempo once the song reaches that point. Above that, there's also a snap amount in the piano roll that you can adjust too to be smaller than the default. 1/8 Quarter. Pressing Alt+Ctrl while dragging the note start/end bits ignores snapping. This method may still work for you, but there's a bit of a learning curve involved.
>>41707571I understand how the program works, but fermatas and various smooth transitions between tempos are extraordinarily difficult to work with in a system like that. Yeah you can make an arbitrary tempo transition, even a smooth one, but good luck trying to time that to an existing piece of music without being off on your timing. I used to work with Logic and FL Studio a lot so it's not like I'm new to this process, just rusty.
>>41707278What do you mean un-cropped? Regular resampling can be done with ffmpeg. Or do you want to somehow extend spectrum of audio?
>>41708621the thing from anon first link, to add the information that is cropped in the low quality audio.
Late night bump.
>>41709527later night bump
>>41710355G4 instrumental album might be useful for isolating vocals?
>>41710492Most of G4 instrumentals and vocals were leaked long time ago. See 2019 leak. It even has vocals on different stages of sound processing.
>>41710706Except at least Glass of Water. I guess it is likely to help to extract better vocals of some songs.
>>41710492>>41710706>>41710711True, True Friend is not what is in the show
>https://files.catbox.moe/tvkahf.mp3>Rags To Riches - Tony Bennett - cover with Vinyl ScratchI was thinking of using Rarara voice but than again I feel like VS isn't used often enough. some of the instrumental and vocal segments did not separated correctly causing the rvc to derp out, I did my best trying to correct few words with the steps gpt-sovit -> talknet -> rvc.
https://x.com/fifteenai/status/1865439846744871044>The past and future of 15.ai>The plan was always to make the backend open source when the time was right — sadly, much before that time could come, I got hit with a notice saying that I couldn’t do that at allWhat a bizarre lie
>>41711504>The plan was always to make the backend open sourceAh yes, that's why he was as vague and secretive as possible for years since the inception of the project and didn't open-source a single thing.> 2 more weeks
>>41710938I wonder if she's seldom used because of her lack of voice in official media. Anyway. here's the two song covers I did using her. https://files.catbox.moe/64h78h.mp3https://files.catbox.moe/3tux3i.mp3
Dumping full Twitter post since there's no other way to see it (if this retarded site will let me)>The past and future of 15.ai>I’ve been meaning to write this for a long time, but I’ve never been good at writing things on social media. I know it’s been a while since I posted anything, but I want to reflect on how 15.ai came to be, share some of my thoughts, and talk about the future of the project.>The idea of 15.ai started as early as 2016 when I stumbled upon a paper written by DeepMind called “WaveNet: A Generative Model for Raw Audio” as an undergrad at MIT – I was 18 years old. That paper lit a fire under me. I didn’t just want to learn about AI voice generation – I wanted to push it further, to see what was really possible. I dove headfirst into it, fully convinced I could refine this technology and explore its potential like no one else had.>For three years, I worked on this project alongside my undergraduate studies, and in 2017, the famous Tacotron2 paper was released. In 2019, I gave a lecture and presentation about my findings, as I was able to replicate the results of WaveNet and Tacotron but with about 25% of the data they claimed was necessary (shoutouts to Dr. Edelman at MIT, he’s a great guy and none of this would have happened without him). I had originally planned to base my PhD dissertation on this work and bring that percentage down even lower (my extremely audacious prediction at the time was that you only needed 15 seconds of data to replicate a person’s voice; hence, the name 15), but when the startup I was working on with friends was accepted into the Y Combinator incubator on the very day I had to decide, I chose to enter the industry instead.>Fast forward about a year and a half, I left the startup in 2020 for various reasons. While I had made a good amount of money working in the industry, my exit wasn’t exactly on the best terms. I felt pretty angry that I had given up my dream of pursuing a PhD and becoming a professor for something that ultimately left me feeling unfulfilled. So, I threw myself back into research. I wanted to prove that the ideas I had once planned for my dissertation weren’t just credible – they were groundbreaking. But the grad school application cycle had already passed, and I wasn’t about to wait a year to apply.>Instead, I decided to take matters into my own hands. The best way to get my work noticed was to show it off. No gatekeeping, no barriers – just a free, accessible tool for anyone to use. I wanted to democratize AI research. I wanted to give people something that didn’t require coding skills or expensive hardware, something they could just use and be amazed by.
>>41711970>I got to work right away. I hacked together a functional frontend and backend for the website while scouring the Internet for interesting data sources, since well-known speech corpora like LJSpeech were boring. The whole point of the project was to prove that it was possible to replicate speech accurately with as little data as possible. Cloning a monotone voice that enunciates syllables slowly and coherently wasn’t all that impressive; real speech has complex undertones and nuances, and I wanted to capture that challenge.>That’s when I found the goldmine: My Little Pony: Friendship Is Magic. I was familiar with the show – I had watched it when I was in middle school, but I hadn’t engaged with the fandom in years because of my studies. What truly impressed me was the dedication of the show’s fans from the “Pony Preservation Project”, who had compiled an extensive speech corpus unlike anything I’d seen before. Every single line from the show had been meticulously trimmed, denoised, transcribed, and even emotion-tagged. This was work that no other fandom had ever achieved at the time. (This was 2020 before any of that could be automated – this had all been done by hand.)>With this newfound data source, I found myself at a turning point. I realized that with this, I could not only push the boundaries of my research but also demonstrate the true potential of what this technology could achieve. I extracted the data from the PPP along with multiple other data sources that I had to manually transcribe (like the voices for GLaDOS, Wheatley, SpongeBob, the Narrator from The Stanley Parable, etc.), trained separate models on the data, and hosted them on the website. The design of the website was intentional – while it was supposed to be very easy to use, I didn’t want my research to go unnoticed. That was why I had included a bunch of relevant numbers, graphs, etc. next to the generated audio files.>As I added more voices to the website, I realized it was possible to encode all the speakers into a single embedding, which would allow me to train all of the voices simultaneously instead of sequentially, saving me a huge amount of time on research and development. Near the end of 2020, I released a version of the website that added over 50 character voices to the website at once – a huge step up from the 7 or 8 or so I had previously.
>>41711975>Then, 2021 happened. The website exploded. It was all over Twitter, YouTube, and eventually, news outlets. Before I knew it, I was getting slammed by millions of requests every day. Autoscaling on AWS quickly turned into a nightmare, and as I watched the charges rack up, I realized I was in for a long ride. At its absolute peak, I was charged $12K for a single month (yes, you read that right), which included costs for training, inference, hosting, and everything else needed to keep the site running. But honestly? I was too stubborn to stop. I knew what I was getting into, and as a 23-year-old living alone, it was terrifying – but also kind of thrilling.>The attention came with offers – job interviews, acquisition proposals, you name it. I turned them all down. In hindsight, maybe not the smartest move, but I didn’t want to monetize the project or turn it into a job. I was afraid that would kill the joy I had for it. I just wanted to build something cool and keep improving it. So, I kept quiet and decided to focus entirely on expanding the list of characters and improving the underlying technology.>In early 2022, the whole Voiceverse NFT plagiarism thing happened, which pissed me off, but ultimately it didn’t do anything in the long run. So there’s that.>Then, in the middle of 2022, things started to go wrong. I received multiple complaints of copyright violations, and I received a cease-and-desist letter. I dismissed it as unimportant and chose to disregard it, since, technically, copyright law surrounding generative AI at the time was on my side. But due to certain other details that I can’t share here, I was effectively forced into stopping operations of the website immediately without warning or preparation.>I wanted to bring back the original website as quickly as possible, but my only option was to pivot to something that steered clear of copyright issues. That was easier said than done. I had built my reputation on doing things differently, on showing that I could take on challenges others wouldn’t touch, and now I was in a position where I had to tread carefully. It was frustrating as hell, but I knew I wasn’t going to let this project die – not when I’d come that far.>Looking back, I’ll admit I was a bit egotistical during this time. I thought I could handle everything on my own: the scaling issues, the legal headaches, the insane costs. I thought I was untouchable because, honestly, I believed I was doing something no one else could. And maybe I still believe that to some extent, because even now, I’m proud of what I built. But I can also see now that my stubbornness might’ve cost me. Maybe if I’d accepted a few offers or reached out for help, things could’ve been different.
>>41711980>Even so, I don’t really regret the core decisions I made. I wanted to create something that mattered, something that made people think, “Wow, someone really built this and gave it away for free?” And I like to think that I succeeded. 15.ai wasn’t just a tool; it was proof that cutting-edge technology and AI doesn’t have to be locked behind paywalls or reserved for corporations. It was a challenge to the status quo, and it was also a little bit of me flexing.>As for what’s next, I’m still figuring that out. The copyright issues, the shutdown – it all sucked, but it didn’t break me. I’m still working, still thinking about how to bring this back in a way that’s better, smarter, and maybe just a bit more sustainable. I have some ideas, but if I’ve learned anything from all this, it’s that nothing goes exactly as planned. So, I’m going to keep pushing, keep experimenting, and keep doing things my way. Because if nothing else, that’s what got me here in the first place.>Thanks to everyone who stuck around during the highs, the lows, and everything in between. Whether you loved the site, hated it, or just thought it was interesting, you’re part of what made this whole thing worth it.>- 15>P.S.: For journalists, researchers, or anyone else with questions about the project, feel free to reach out to me at 15@15.ai. I’m always open to discussing the journey, the tech, or whatever else you’re curious about.https://x.com/fifteenai/status/1865439846744871044https://nitter.poast.org/fifteenai/with_replies
>>41711985This is proof that you can be some genius obsessed with academics, yet still completely and utterly fail if you don't have the balls to do or say anything. Not even vaguely state the truth at any point - he has taken multiple years to even do that at the bare minimum. His work was absolutely impressive, but after things went downhill, all he ever did was do his annual report of "it's coming back soon, I swear.">The plan was always to make the backend open source when the time was right — sadly, much before that time could come, I got hit with a notice saying that I couldn’t do that at all> I never had a Patreon because I wanted it — people felt like that they had some obligation to donate money (even though I wrote that I wanted no handouts) so I asked someone else to make the account for me. I never wanted to make money off this project.> As for “keeping people in the dark”, as much as I wanted to tell people what happened, I didn’t even know what was going to happen at the time given the volatility of the space. If I said anything, I’d probably open myself up to something else. If you want a refund, let me know.> Making a Patreon because people told me to was a huge mistake, honestly. It saddens me to see so many people think my project was a scam because I didn’t notify people even though I never wanted to make money to begin with.Right.Every time, this guy does something and then abruptly walks away with no explanation. This happened with Mare Fair early on, before the venue incident. Luckily they sorted it out without much trouble, but again, >15. Then later he comes out to cry about how hard he's had it. Please, spare us the theatrics. Either deliver on your promises, or don't. Just say it, man.
>>41711985The thing is, if he just released the code to the public then no company would have been able to put the cat back in the bag. It would still be available.
>>41711975>>That’s when I found the goldmine: My Little Pony: Friendship Is Magic. I was familiar with the show – I had watched it when I was in middle school, but I hadn’t engaged with the fandom in years because of my studies. What truly impressed me was the dedication of the show’s fans from the “Pony Preservation Project”, who had compiled an extensive speech corpus unlike anything I’d seen before. Every single line from the show had been meticulously trimmed, denoised, transcribed, and even emotion-tagged. This was work that no other fandom had ever achieved at the time. (This was 2020 before any of that could be automated – this had all been done by hand.)Coward.Ah yes, there was this obscure (by this point) TV show and I just so happened to randomly stumble across this even more obscure thread in a hated corner of the internet, and they just so happened to have really good manually trimmed sets.I'm also just baffled by how totally contrary this is to the show's lessons. He refused to get any help, never stood by his alleged principles through open sourcing his work (you can always do it somehow no matter what threats you got, let's not bullshit ourselves here), and ultimately wanted all of the fame and glory for himself even if he tries to downplay it. This is literally S0 Twilight.I'm just... disappointed, I guess. I hoped for more from this guy.
>>41712185People who act like him are either geniuses or fucking retards. The outcome in this case is unfortunate.
>>41711970Anybody want to update 15's Wikipedia article citing these tweets?
So did anyone do an auto pony image tagger yet?
>>41713908Last year I was trying out some of the auto1111 addons that did that, but from what I remember sadly most of them were bit sucky in the quality of description that were produced.
>>41712487Wasn't that article marked for deletion?
I wonder, can music-lyrics separator, voice recognition and tts be trained all at once on instrumental+vocal tracks from leak? And how it will perform?
>>41713908Not a good one to my knowledge. Using the pony diffusion encoder + kmeans might be an easy way to do it.
I would like to make a toast for keeping the haysay ai alive, so even I, with my potato pc, can make pony voice stuff.
>>41711970His project will be buried with him. Good riddance.
>>41687414Is there some generation speed-up when using precomputed values compared to passing the ref audio and text ref?
It's sad to see that you're all angry, jealous and bitter about 15. Just because he's better than all of you put together.
>>41717165Should I put that on your epitaph, 15?
>>41717165It's... not that we're jealous, we're just... kinda exhausted. Exhausted of him making promise after promise and not being straight forward with us. Is he talented? Absolutely! But his hubris is holding him back.
15.ai still had the closest inflection to the show, and best controllability of any TTS model The project needs to return in some form or we will definitely be taking a big step backwards in those respects
>>41717393Nah sovits is good enough
>>41717393>willIt seems you missed the news, but 15 TTS has been unavailable for a very long time now. In fact, it has been so long that we now have an open source competitor on the same level, minus some control.
so does this actually work? can i get an ai of my waifu?
>>41717159I haven't run any benchmark or anything, but based on the limited manual testing I've done so far, there is negligible impact on the generation speed.
>>41717165Holy missed point, Batman.
>>41718749Silly tavern had some tts options for a while, and even a module for speech to text, so yeah, you can chat with your waifu if your GPU is chunky enough to run all these tools together.
>>41711970>my extremely audacious prediction at the time was that you only needed 15 seconds of data to replicate a person’s voice; hence, the name 15I thought it was a reference to GR15.
So 15.ai is dead. Atleast we finished playing catchup
>>41719881Really would have been easier if we had cerainty months ago.
ElevenLabs won unfortunately. They're also about to release a singing/music AI that might make Udio look like a baby.Time to put ElevenLabs to good use cause using your local machine to generate good audio is a pipedream unless you got 30k-90k to spare.
>>41722084GPT-SoVITS works on my machine just fine
>>41722084lollmaonigger
>>41722084>Time to put ElevenLabs to good useLol. Roflmao, even.
>page 10This is sad. I offer you guys a project to work on so you stay afloat...Redub Tamers' videos. I'll even pick it for you cause obviously you'll bitch about which video to pick.https://www.youtube.com/watch?v=UubZA0fNiYg&ab_channel=Tamers12345
>>41722986>TamersInb4 drama.
>>41723024+1 agree, it's not really PPP related, and there is still whole FiM season 1 of the redub to be done.
>>41723036>filenameUh, what?
>>41722400Okay, what's the beef with elevenlabs? I don't like that it's not open source, but is that the reason everyone hates it here?
>>41723226elevenlabs are grifters that steal other people's code and turn it into a paid service
>>41723415And yet they btfo everyone else?
>>41723451
>>41723451okay okay, but seriously, why the fuck are you here if you want to use elevenlabs
>>41712185>I'm also just baffled by how totally contrary this is to the show's lessons. He refused to get any help, never stood by his alleged principles through open sourcing his work (you can always do it somehow no matter what threats you got, let's not bullshit ourselves here), and ultimately wanted all of the fame and glory for himself even if he tries to downplay it. This is literally S0 Twilight.Funny you say that when he's a Twifag.Didn't he say he's never seen an episode? Or has that changed?I'd also like to know how many board projects were killed by ego. This is getting out of hand.
Not sure where to ask this but this seems like a good place. There's a project I'm working on and I'll need archive scrapes of /mlp/. Are there any I can download, even if they aren't up-to-date by over a year? Or will I have to start over?
>>41723830I swear I remember someone talking about it the threads... in like 2022 or something like that? hopefully one of the codefags will know more about this.
>>41723415Other peoples' code? As in... there was tech out there that could've been open source but Elevenlabs got to them first? Or how does it work?
anyone member cookie? I member.
>>41711970>>39874017I wasn't sure whether to be upset or understanding. Probably upset since nobody would be able to stop this if he had just made it open-source instead of protecting his ego by insisting he gets all the credit. Now it's kill. And we will never have anything like it again. I don't know how well the claim of C&Ds killing the project holds up, but I heard pony voices aren't on uber*uck anymore, supposedly because of the same legal issues 15 claims to have, though I refuse to make an account to verify this.I see a lot of disappointment expressed towards ignoring the classic /mlp/ advice of ignoring C&Ds, like nothing will ever happen. Perhaps composing these posts is why 15 won't be going to Mare Fair next year. I know /mlp/ has a tendency to just turn on people once they stop being le based, (although Imalou got what was coming to her) so that's probably all he can do. If there's anything to learn from this, never let your ego run the show.
>>41723679>Didn't he say he's never seen an episode?>I had watched it when I was in middle schoolnice reading comprehension, retard
>>41726107Wait what happened to Imalou?
>>41726334caved into pressure from twitter and denounced /mlp/
>>41726334A anon fucked her so silly she turned bi
any update on gpt-so-vits coming to haysay.ai?
does any know it's possible make twilight ai cover of this song? if so could someone try?https://files.catbox.moe/o60yol.mp3
Zero-Shot Mono-to-Binaural Speech Synthesishttps://arxiv.org/abs/2412.08356>We present ZeroBAS, a neural method to synthesize binaural audio from monaural audio recordings and positional information without training on any binaural data. To our knowledge, this is the first published zero-shot neural approach to mono-to-binaural audio synthesis. Specifically, we show that a parameter-free geometric time warping and amplitude scaling based on source location suffices to get an initial binaural synthesis that can be refined by iteratively applying a pretrained denoising vocoder. Furthermore, we find this leads to generalization across room conditions, which we measure by introducing a new dataset, TUT Mono-to-Binaural, to evaluate state-of-the-art monaural-to-binaural synthesis methods on unseen conditions. Our zero-shot method is perceptually on-par with the performance of supervised methods on the standard mono-to-binaural dataset, and even surpasses them on our out-of-distribution TUT Mono-to-Binaural dataset. Our results highlight the potential of pretrained generative audio models and zero-shot learning to unlock robust binaural audio synthesis.https://github.com/google-research/google-researchMight be posted here. Downstream will augment AR and VR experiences.
LatentSpeech: Latent Diffusion for Text-To-Speech Generationhttps://arxiv.org/abs/2412.08117>Diffusion-based Generative AI gains significant attention for its superior performance over other generative techniques like Generative Adversarial Networks and Variational Autoencoders. While it has achieved notable advancements in fields such as computer vision and natural language processing, their application in speech generation remains under-explored. Mainstream Text-to-Speech systems primarily map outputs to Mel-Spectrograms in the spectral space, leading to high computational loads due to the sparsity of MelSpecs. To address these limitations, we propose LatentSpeech, a novel TTS generation approach utilizing latent diffusion models. By using latent embeddings as the intermediate representation, LatentSpeech reduces the target dimension to 5% of what is required for MelSpecs, simplifying the processing for the TTS encoder and vocoder and enabling efficient high-quality speech generation. This study marks the first integration of latent diffusion models in TTS, enhancing the accuracy and naturalness of generated speech. Experimental results on benchmark datasets demonstrate that LatentSpeech achieves a 25% improvement in Word Error Rate and a 24% improvement in Mel Cepstral Distortion compared to existing models, with further improvements rising to 49.5% and 26%, respectively, with additional training data. These findings highlight the potential of LatentSpeech to advance the state-of-the-art in TTS technologyhttps://github.com/haoweilou/LatentSpeechCode is up. might be actually useful
Multimodal Latent Language Modeling with Next-Token Diffusionhttps://arxiv.org/abs/2412.08635>Multimodal generative models require a unified approach to handle both discrete data (e.g., text and code) and continuous data (e.g., image, audio, video). In this work, we propose Latent Language Modeling (LatentLM), which seamlessly integrates continuous and discrete data using causal Transformers. Specifically, we employ a variational autoencoder (VAE) to represent continuous data as latent vectors and introduce next-token diffusion for autoregressive generation of these vectors. Additionally, we develop σ-VAE to address the challenges of variance collapse, which is crucial for autoregressive modeling. Extensive experiments demonstrate the effectiveness of LatentLM across various modalities. In image generation, LatentLM surpasses Diffusion Transformers in both performance and scalability. When integrated into multimodal large language models, LatentLM provides a general-purpose interface that unifies multimodal generation and understanding. Experimental results show that LatentLM achieves favorable performance compared to Transfusion and vector quantized models in the setting of scaling up training tokens. In text-to-speech synthesis, LatentLM outperforms the state-of-the-art VALL-E 2 model in speaker similarity and robustness, while requiring 10x fewer decoding steps. The results establish LatentLM as a highly effective and scalable approach to advance large multimodal models.https://github.com/microsoft/unilm/tree/master/LatentLMCode is up. outperforms VALL-E 2 model in speaker similarity and robustness.
>>41726334TDS struck her badly.
>>41726524Sadly the audio separator is struggling a lot with it. The recording is very low quality, and the voice is too low volume to break through the sounds of the guitar.
Bump.
Does anyone have a Minuette/Colgate voice file? In an .onnx format. I am using Piper
>>41726497It's coming very soon. I pushed images to docker hub last night but discovered a couple of bugs after some additional testing, so I do NOT recommend updating Hay Say yet if you have it installed locally. I plan to fix the bugs tonight and deploy to haysay.ai either tonight or tomorrow evening.
i have a bunch of voices downloaded from hugging face, they are .pth and .index filesi dont know what uses them but i would like to know if anyone here know where i can use them, i tried alltalktts2 but it sounds like garbage, even with 5 minutes of voice clip that i painstakingly transcripted, it still sound like early 15.ai voice
>>41727805>.pth and .index filesuhh, could it be rvc? if you have a 5 minute dataset you can easily train the rvc + sovits and gpt-sovits with it.
>>41726576Why?.. DSP methods are more controllable and physically-accurate.>>41726596Looks interesting. Maybe somepony will generate test audio.>>41726641This might become one model for STS and TTS. And STT if you need it.
>>41728480STS? Speech to speech? What, like changing from one voice to another?
>>41728482Yes. Like what so-vits-svc does.I noticed noise at the end of audio https://mega.nz/folder/jkwimSTa#_xk0VnR30C8Ljsy4RCGSig/folder/Kwp33AQA
So where are the Colgate ai voice files?
Don't know how important this is but i got one generation out of this but the next one shit the bed>first gen = "Hnng...>second gen (one that broke) = HEEEEH!!!!!just fucking around with it of course
>>41729142I think StyleTTS2 is tripping over the extra exclamation marks. "Heeeeh!" seems to work.
>>41729157Cold starting my glimmyhttps://files.catbox.moe/i2imfe.mp3
>>41727149that's a shame, what about this song? does it have the same issues as well https://files.catbox.moe/d9q6s3.mp3
>>41729193Glimmy on the autobahnhttps://files.catbox.moe/pubqox.mp3
>>41729193>>41729234>sensiblechuckle.jif
>>41729234Got this with Rarity once.https://files.catbox.moe/5p3rki.mp3
>>41729194This is more doable, there is still some instrumental leakage to the vocals but that should be easy to clean up.
>>41729769that's good to know, curious are you actually doing the whole ai cover? since I'm not incapable of doing it myself
>>41730121i am incapable*
>>41730121ehh, please wait a bit as I am kind of busy will way to many holiday related project>incapable of doing it myselfits not really that difficult, even if all you have is potato pc/laptop, you just need to be able to use some kind of vocal separator (either the Ultimate Vocal Remover for offline, or one of the few online ones), than clean up the vocals a bit with audacity and chop it up into 3~10 seconds segments.After that just use haysay rvc and sovits 5, this process (when i can actually sit down and only focus on it) takes less than two hours to render all clips and put the together into new cover.
>>41730144i don't have a laptop nor pc and I'm fine waiting so take your time mate
>>41730224well, mr phoneposter, I have two version for you than:https://files.catbox.moe/awanqk.mp3>Solo Twihttps://files.catbox.moe/0ym3y9.mp3>Duo with reference vocalsI can feel in my bones how much rvc and sovit was struggling with the non-english words, so I think the duo version is bit better out of the two.
>>41730297sounds better than i thought it would, thank you so much mate
>>41729194just for the record, when you make an AI cover, the singer needs to be in the same vocal range as the pony youre trying to replace it with. If you have song where the original vocals are sung by a guy with a deep voice, trying to replace it with Twilight wont work
>>41730472my dude, thats what the pitch change settings are for.
>one of the most active generals on this board was reduced to a rotting corpse because of one man>15 single-handedly crippled the PPP and put it on life supportkinda based, ngľ
>>41730607unless you can change the vocals by a full octave, you can't really change the pitch as that would make the vocals sound out of tune with the music.
>>41730656>change the vocals by a full octave...dude, im not musical expert but this has been pretty well explained in the past threads, one octane = 12 pitch shifts, thats what happen when you use the rvc/sovits and change the pitch by (plus/minus) 12 or 24 (or other multiplications of 12) to get the correct pitch/octane.Yeah, the edited vocals will not sound 100% as good as if the original va had sang it but its still way better than using the talknet or re-editing the tts clips.
Up.
GPT-SoVITS v2 has (finally!) been added to Hay Say. You have the choice of either uploading reference audio of the character speaking or selecting an emotion from a dropdown list. If you select from the dropdown list, then Hay Say will randomly select from a set of precomputed embeddings for that emotion. This update also includes a security enhancement that restricts network access for most of the Docker containers. That ensures that if there are any malicious packages somewhere in the software dependency tree, they can't phone home.Note: The way you select the reference audio for GPT-SoVITS in Hay Say is different from how you do it for StyleTTS2. I think this way is a little better. I hope this inconsistency in the UI isn't too confusing.I'm a little disappointed at the server's performance. Generating one or two sentences with gpt so-vits on haysay.ai takes about about 40 seconds. I have a couple of projects on the horizon that should improve performance across the board, but I'm going to take a break from Hay Say development for a little while to focus on another (but still pony-related) project.
>>41731435
>>41731435>Generating one or two sentences with gpt so-vits on haysay.ai takes about about 40 secondsOn CPU? That's already fast enough
>>41730886I have made several AI covers, and for the average vocals transposing a full octave is way too much, unless you like Alvin and the Chipmunks or something. Here's an example of something I was working on. The first minute or so is normal, then the clip repeats with the pitch of Cadance's vocals raised as if to mimic what it would sound like if the original vocals were a bit too low for her range and it needed some minor pitch work. This is one single semitone and you can notice right away how off-key it is.>https://files.catbox.moe/b2wcon.mp3>>41731435That sounds cool. I'll have to try that out sometime. Sounds like it would a nice tool for doing fanfic readings/radio play or something like that.
>>41732320>change of pitch mid songyep, that will fuck up any conversion if you try to use the exact same setting across the clips. in this case you could try to run the clips as a different voice to "normalize" the pitch and than apply desired pony voice on the output, or find some alternative cover song and take the vocals audio from that instead.
>>41730297It has such heavy accent, almost inunderstandable at times, but I like the idea.
Why so-vits-svc 5.0 sounds terrible compared to 4.0?Did anypony try https://github.com/yl4579/StyleTTS-VC and was it good?
There is new paper https://arxiv.org/abs/2409.10058https://styletts-zs.github.io/
>>41723830There's a 2019 DB dump from Desuarchive. It's the most recent /mlp/ archive dump as far as I can tell.https://archive.org/details/desuarchive_db_201909
GPT-SoVITS doesn't handle Pinkie well. I used line from one fanfic as a test.>Why would you say that, Sunset? I mean, this is super exciting! Here, I read the backside on the DVD yesterday. Did you know that Rainbow Dash and Applejack have the same voice actress? The same goes for me and Fluttershy!
>>41734872I think the model has problems handling an input string that long. Try breaking it up into chunks of no more than 1-2 sentences.
>>41734872Can you upload your result to compare later?
>>41735885How many of those have you made?
>>41736373I go to Bing and generate a new one every time I want to boop the thread. This is the ultimate consequence of being able to create endless slop with no effort.
I wish there were more covers of original mlp songs made wit suno, udio, etc. I think these sites are very well suite for this, much beter than most original songs people make with it.They have a free trial of V4 with the mobile app and I put in low effort WWU in metal style:https://suno.com/song/5b6b728a-b027-437f-8e58-6a0bf1596b83
>>41737527I do try but getting inspiration for writing good quality lyrics is pretty difficult (even with ai text models, since these have been poisoned with pop music lowbrown lyrics dataset).
>>41736962It does have its merits though.
Somewhere within this decade or the next, an 18 year old is going to build a robot pony waifu with advanced 2030s AI and grow old with it, like Sweetie Bot but real. And well, I think about the possible repercussions of having a partner that cannot and will never physically age, at least the way humans do. Imagine your great grandnephew or grandniece powering up your rusting 80 year old robot pony waifu long after you've passed away, I say niece or nephew because it's already implied you won't be having kids but your siblings did. What would they do with it? Would the pony rather be dead knowing that its creator died nearly 20 years ago? What exactly would she want to exist for, having outlived her purpose, to make (You) happy? I think about this stuff way too much. It's definitely going to happen.
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Modelshttps://arxiv.org/abs/2412.10117>In our previous work, we introduced CosyVoice, a multilingual speech synthesis model based on supervised discrete speech tokens. By employing progressive semantic decoding with two popular generative models, language models (LMs) and Flow Matching, CosyVoice demonstrated high prosody naturalness, content consistency, and speaker similarity in speech in-context learning. Recently, significant progress has been made in multi-modal large language models (LLMs), where the response latency and real-time factor of speech synthesis play a crucial role in the interactive experience. Therefore, in this report, we present an improved streaming speech synthesis model, CosyVoice 2, which incorporates comprehensive and systematic optimizations. Specifically, we introduce finite-scalar quantization to improve the codebook utilization of speech tokens. For the text-speech LM, we streamline the model architecture to allow direct use of a pre-trained LLM as the backbone. In addition, we develop a chunk-aware causal flow matching model to support various synthesis scenarios, enabling both streaming and non-streaming synthesis within a single model. By training on a large-scale multilingual dataset, CosyVoice 2 achieves human-parity naturalness, minimal response latency, and virtually lossless synthesis quality in the streaming mode.https://funaudiollm.github.io/cosyvoice2https://github.com/FunAudioLLM/CosyVoicehttps://www.modelscope.cn/studios/iic/CosyVoice2-0.5Bhttps://huggingface.co/FunAudioLLMCode is up. Modelscope has a demo with Chinese UI. No weights uploaded to HF yetmultilingual though majority voice data was chinese with english second (some japanese/korean). can voice clone after a fine-tune. example page has a good one of elon
>>41739401>Emotions, speaking rate, dialect. 'role playing'This is the kind of tts control I fucking wish we had all along.
>>41739401https://www.modelscope.cn/models/iic/CosyVoice2-0.5B/summaryWeights
Really enjoying GPT SoVITS. It does Rarity really well. This was done with the anxious preset. https://files.catbox.moe/prycko.mp3I did try to upload reference audio and it failed out. Is there a format or length required?
>>41741296>length requiredI think in the OG training files it notes to only use 3~10s clips so Im guessing that would apply to the UI tts as well?
>>41731435WAV + OGG are missing as export formats, and mp3 format doesn't work when download is attempted. I'm assuming it's something changed recently, it was working fine ~a week ago.
>>41741520On the website, I should clarify.
>>41741296Glad to hear you're enjoying it! That's a nice generated Rarity clip. I find it interesting that it generated little gasps/breaths in the right places too. I played around with using reference audio for a bit but was unable to generate any errors. Could you provide details about the reference file you uploaded or post the error message if you are able to reproduce the issue again? There is no length requirement; I commented out the code that throws an error if the reference is too short or too long (https://github.com/hydrusbeta/GPT-SoVITS/blob/main/GPT_SoVITS/inference_webui.py#L454). Any format that can be read by Librosa should work (which covers a ton of formats). Internally, Hay Say converts the file to .wav and tells GPT-SoVits to use that file. >>41741520Ah Shoot. The code for saving to different file formats looks old; I think I never committed my updated code to GIT. I'll work on fixing that now. Thanks for bringing it to my attention.
>>41741850I'm stuck.
>>41741860Sorry to see you're getting that error. Strange that it's reporting an unexpected End-of-file. can you pull individual images one at a time? Try this:docker pull hydrusbeta/hay_say:hay_say_uiIf that works, then you can do the same for the rest of them:docker pull hydrusbeta/hay_say:so_vits_svc_3_serverdocker pull hydrusbeta/hay_say:so_vits_svc_4_serverdocker pull hydrusbeta/hay_say:so_vits_svc_5_serverdocker pull hydrusbeta/hay_say:rvc_serverdocker pull hydrusbeta/hay_say:styletts2_serverdocker pull hydrusbeta/hay_say:gpt_so_vits_serverdocker pull hydrusbeta/hay_say:controllable_talknet_serverIf it fails consistently on one particular image, that could indicate that there's a corrupt layer on one of them, somehow. Please let me know if that's the case. If you successfully pull all images, then try docker compose up again. In the meantime, I'll try installing Hay Say on a system I haven't installed it on before, to see if I can reproduce the issue. I also see a TLS handshake timeout at the top of your screenshot. Any nonstandard network stuff going on? For example, are you on a VPN or using a proxy?
>>41741520The file format dropdown should be fixed now on haysay.ai. There are a lot of new options; it now supports all file formats that soundfile supports. I'll push an updated image to Docker tonight for local installs too.
>>41741860>>41742076I was able to install Hay Say on a Windows system that hasn't seen it before, so I think that rules out a corrupt image layer. Seeing as how your system was able to partially pull the images, I suspect it was some network glitch that randomly happened partway through the download. Try "docker compose up" again or see if you can pull the images individually.
>>41742076>>41742304>For example, are you on a VPN or using a proxy?I'm not doing anything like that but it worked after I spammed docker compose up several times until it completed, so it works now. I did a clean Windows install a few months ago and didn't have this issue and I haven't changed any router settings. Thanks for checking though.
>>41742343Great! Glad to hear you got it working.
>>41739000She'll make a husbando modeled on you.
>>41742392A colony of pony bots that multiply themselves would be an improvement.
>>41744445>A colony of pony bots that multiply themselveshttps://www.youtube.com/watch?v=dwG6MO92xtI
>>41745048At least they won't eat your stockpiles since they are robots.
>>41745264>Gray goo scenario but it's cute mareswhat a time to be alive
>>41742274for gpt-so-vits are we able to also do other characters' voices not listed on haysay by using our own reference audio of said characters, including non-pony?
>>41745695Nta, but Could somebody test that? I'm kind of stuck phone posting until after new year.
>>41745695also can we maybe get some more voice emotion options for the mane six and maybe other ponies such as horny, seductive, hypnotized, etc
>>41745630It's one of the better ways to go, I guess.
>10
Precautionary late night bump.
I don't like gpt sovits, maskgct for now seems to be the best open source option, a shame it takes like 20gb of vram.is there any new exciting tts tech coming up to look up for?
>>41748123>takes like 20gb of vramChrist, it's staggering how hard this stuff can go on hardware.
>>41748123>maskgctUhh, I don't think that's included in standard gptsovit, since I can run this on my old 8gb vram.
I know the OP post says G5 not welcome but it feels like a waste not to do anything with the raw voice actor files that have been leaking
>>41749695>SparkyIs it just farting sounds?
>>41749702his VA reads what the lines are then does themhttps://vocaroo.com/16axVp3nhrTxhttps://vocaroo.com/1oMVVJ019jMt
>>41749695G5 is a clusterfuck VA wise though. It starts all the way down with the change in the voice cast for the main characters.
What does Top K, Top P and Temperature do in GPT SoVITS?
>>41751186+1 to that question.
>>41734531I'll see what I can do with this.
>>41731435>Meadowbrook model>even sounds recognizablehttps://voca.ro/16eMFsLvpesIFucking incredible, I can die happy now.
>>41749695I guess if no one else is going to do it, I will
This seems like it could be useful for workflows.https://github.com/intel/openvino-plugins-ai-audacity/tree/main>A set of AI-enabled effects, generators, and analyzers for Audacity®. These AI features run 100% locally on your PC -- no internet connection necessary! OpenVINO™ is used to run AI models on supported accelerators found on the user's system such as CPU, GPU, and NPU.>Music Separation -- Separate a mono or stereo track into individual stems -- Drums, Bass, Vocals, & Other Instruments.>Noise Suppression -- Removes background noise from an audio sample.>Music Generation & Continuation -- Uses MusicGen LLM to generate snippets of music, or to generate a continuation of an existing snippet of music.>Whisper Transcription -- Uses whisper.cpp to generate a label track containing the transcription or translation for a given selection of spoken audio or vocals.>Super Resolution -- Upscales and enriches audio for improved clarity and detail.\
NovelAI's new V4 anime model is actually pretty decent at mares, even though it's not directly intended. Could be useful for more artsy styles.
>>41754075Can v4 do voices now?
>>41754165NovelAI has pretty much only updated their TTS once, like... 3 years ago or something. They likely forgot it even exists.
>>41754678So why exactly are you posting here instead of the AI art thread?
>>41754075those are some neat twiggles
>>41754687>Forgor>More mares>Limited time during break>Improvement of existing tech highlighted here in the past>Since when PPP thread voice onlyRainbow face went wurbwap
>>41754687nta but i feel like it's good idea to share ai news between the ai threads just to keep people updated.
>>41755173Yeah, it's good idea when it makes you money. Too bad you don't share news about anything else.
>>41755632meds
https://civitai.com/models/833294?modelVersionId=1190596The final version of NoobAI v-pred is out.
>>41757931Almost there.
>>41759472
>>41757931
>>41706417EqG voices for ponified EqGirls so I can make more FiM ponies
>>41759488Mares forever!
>>41762138Indeed.
>slow day to day free man!
>>41764480Yeah, that can happen during Christmas.
Happy Hearth's Warming, everypreservationist!
>>41764743You too, fellow preservationist!
>>41764743So once again it's the cheery time of the year!
>>41764480True.
>9
Nice bump thread, faggots
>>41751186Top K is Top Kek, it measures how funny you want the output to beTop P is Top Pony, the higher it is the more the voice will resemble best ponyTemperature is how hot AKA sexy you want it to sound
>>41768042yeah
>>41768042Bump for the bump thread! Sage for the sage throne!
>>41768329Top K>>41751186This is a little hard to explain if you don't understand how inference works. When generating outputs, the model generates one "token" at a time, and it assigns a probability to each possible token. Top K should limit the per-token output selection based on the k best matches. E.g., with topk=5, it finds the 5 most likely tokens and picks from that. Top P should limit the per-token output selection based on the probability. With topp=0.5, it should find the most likely tokens up to 50% probability, then cut off the rest. Higher temperature makes all the tokens more equally likely to be selected, lower temperature makes it so probabilities are skewed in favor of the more likely tokens. You generally only want to use one of of [topk, topp, temperature] for any given inference.topk, topp, and low temperature all accomplish the similar things in slightly different ways. If you want the model to pick the single best output at every time (which will give you the same result every time you run it), you want temperature=0 or topk=1 (same thing). If you want to prevent very bad tokens from being selected, you want to use topp <= say 0.8. (Lower topp means more tokens are considered "very bad".) Other than that, the values get hard to interpret.That's how it works for LLMs. I'm guessing it's the same for GPT-SoVITS.
>>41770942Let the catalog burn!
>>41754075AI art has now more soul than human made art. It's over, join us!AI is here to save the fandom and artists against it or trying to stop AI projects are the threat, not AI.
>>41771952Don't bring this shit here.
>>41771957It's just true, the fandom will not be saved by creeps trying to sell both MLP art and furry or loli porn but with absolute freedom of creativity, this is important.
>>41771952fuck off with this divisive bullshit, go back the discord your crawled into the AI art general from.
>>41754075Most important question: what publically avaliable models(weights, code, training and tuning procedures) can achieve and how can they be improved.As a demo of AI potential this is fine, so now we need to get some idea how get that and better.
We might try to develop MAGIC: Multi-Agent Generative Image Converter.I've seen some MAS LLM research, but is there any MAS image generator papers?
>>41771952No drama, just pony.
here's all the leaked tell your tale lines so farhttps://mega.nz/folder/pWczEYKY#T19kpTbI7haPnw63G2msoA
>>41772991>G%Big old meh.
>>41772991If audio from the celebrity VAs in ANG had leaked, that might be worthwhile. Otherwise, I sleep.
>>41726596Looks interesting. I think this is first paper on voice conversion that I have read entirely with searching for everything I don't know.I'm comparing it with StyleTTS2, and new thing this paper seems to propose is alternative to decoder and slight alteration to training process.1. It does not use convolution. Not a fancy upsampler.2. Instead of using spectrogram like iSTFTNet or HifiGAN it uses latent diffusion model to generate multiple waveforms on different frequency bands directly from embedding. LatentThroating in other words.3. As result, we don't need two models for embedding->mel and mel->waveform(or STFT).4. It is not GAN. (Yay training?)It is still possible to make upsampler by giving encoder decimated input(or giving less bands) and comparing loss relative to true audio.In StyleTTS2 it would be replacement of vocoder and require replacement of decoder, maybe merging them in one model.
>>41774700StyleTTS-ZS looks like decoder+vocoder combo I mentioned. But audio does not go through PQMF.
Ultra-lightweight Neural Differential DSP Vocoder For High Quality Speech Synthesishttps://arxiv.org/abs/2401.10460v1>Neural vocoders model the raw audio waveform and synthesize high-quality audio, but even the highly efficient ones, like MB-MelGAN and LPCNet, fail to run real-time on a low-end device like a smartglass. A pure digital signal processing (DSP) based vocoder can be implemented via lightweight fast Fourier transforms (FFT), and therefore, is a magnitude faster than any neural vocoder. A DSP vocoder often gets a lower audio quality due to consuming over-smoothed acoustic model predictions of approximate representations for the vocal tract. In this paper, we propose an ultra-lightweight differential DSP (DDSP) vocoder that uses a jointly optimized acoustic model with a DSP vocoder, and learns without an extracted spectral feature for the vocal tract. The model achieves audio quality comparable to neural vocoders with a high average MOS of 4.36 while being efficient as a DSP vocoder. Our C++ implementation, without any hardware-specific optimization, is at 15 MFLOPS, surpasses MB-MelGAN by 340 times in terms of FLOPS, and achieves a vocoder-only RTF of 0.003 and overall RTF of 0.044 while running single-threaded on a 2GHz Intel Xeon CPU.
>>41774700Correction: there is still convolution
>mare
>>41731435Been a hot while since I've been to these threads. This is extremely impressive, definitely better than what I remember 15 being, give or take. What's the latest development on producing non-horse audio? Are there any actually solid services that sound halfway good?
>>41775543Just train same ai models on different voices
LDM-SVC: Latent Diffusion Model Based Zero-Shot Any-to-Any Singing Voice Conversion with Singer Guidancehttps://arxiv.org/abs/2406.05325v1>Any-to-any singing voice conversion (SVC) is an interesting audio editing technique, aiming to convert the singing voice of one singer into that of another, given only a few seconds of singing data. However, during the conversion process, the issue of timbre leakage is inevitable: the converted singing voice still sounds like the original singer's voice. To tackle this, we propose a latent diffusion model for SVC (LDM-SVC) in this work, which attempts to perform SVC in the latent space using an LDM. We pretrain a variational autoencoder structure using the noted open-source So-VITS-SVC project based on the VITS framework, which is then used for the LDM training. Besides, we propose a singer guidance training method based on classifier-free guidance to further suppress the timbre of the original singer. Experimental results show the superiority of the proposed method over previous works in both subjective and objective evaluations of timbre similarity.
LHQ-SVC: Lightweight and High Quality Singing Voice Conversion Modelinghttps://arxiv.org/abs/2409.08583>Singing Voice Conversion (SVC) has emerged as a significant subfield of Voice Conversion (VC), enabling the transformation of one singer's voice into another while preserving musical elements such as melody, rhythm, and timbre. Traditional SVC methods have limitations in terms of audio quality, data requirements, and computational complexity. In this paper, we propose LHQ-SVC, a lightweight, CPU-compatible model based on the SVC framework and diffusion model, designed to reduce model size and computational demand without sacrificing performance. We incorporate features to improve inference quality, and optimize for CPU execution by using performance tuning tools and parallel computing frameworks. Our experiments demonstrate that LHQ-SVC maintains competitive performance, with significant improvements in processing speed and efficiency across different devices. The results suggest that LHQ-SVC can meet
>>41389084Interesting. Now I'm thinking about voice-only variation of SESD.1. Train Decoder->Encoder like LatentSpeech does2. Freeze codec and train embedding denoiserThe goal here is to deal better with speakers differences.Or maybe something else, but idea is to make encoder more stable.
>>41776814I found LDM-SVC-but-faster paperLCM-SVC: Latent Diffusion Model Based Singing Voice Conversion with Inference Acceleration via Latent Consistency Distillationhttps://arxiv.org/abs/2408.12354>Any-to-any singing voice conversion (SVC) aims to transfer a target singer's timbre to other songs using a short voice sample. However many diffusion model based any-to-any SVC methods, which have achieved impressive results, usually suffered from low efficiency caused by a mass of inference steps. In this paper, we propose LCM-SVC, a latent consistency distillation (LCD) based latent diffusion model (LDM) to accelerate inference speed. We achieved one-step or few-step inference while maintaining the high performance by distilling a pre-trained LDM based SVC model, which had the advantages of timbre decoupling and sound quality. Experimental results show that our proposed method can significantly reduce the inference time and largely preserve the sound quality and timbre similarity comparing with other state-of-the-art SVC models.
https://www.youtube.com/watch?v=WPUVxX734iwIs technology good enough to finally make it sung in Ponk's voice?
Bonk.
>>41778261Yes it is!https://files.catbox.moe/wa6e8d.wav
>>41779092
is it just me or is haysay.ai down?
REDUB TAMERS' VIDEOSREDUB TAMERS' VIDEOSREDUB TAMERS' VIDEOS
>>41780436https://files.catbox.moe/ej3uqm.mp3as rule of thumb I usually wait 5 minutes and refresh the website to see if it underps itself.
>bump
how does one train a gpt-sovits-v2 model?
>>41783125Tenderly yet firmly.
>>41783125https://rentry.co/GPT-SoVITS-guidehttps://huggingface.co/Delik/gsvlite/resolve/main/GPT-SoVITS-Lite.7z?download=truelite version (v1)I remember some files needed to be mess around with due to python being retarded and not connecting to the correct elements.
>>41783016
>>41783762
>>41784445
>mares
>>41779092BOINK
>>41785304>stallions
>>41785941
>>41787642
>>41788232
REDUB TAMERS' VIDEOSREDUB TAMERS' VIDEOSREDUB TAMERS' VIDEOS!!!
>>41788969nah, I would prefer anons getting into ai music.
How do we stop being dead?
>>41789149AI in it's current state is a fad. Just like early phonographs were.>oh, we can record and play back sounds now? that's cool I guess.Most people have gotten over AI. It's not amazing anymore. The technology won't be very interesting to most people unless it has>Ease of use>Low cost>MaturityEase of use was presented by 15.ai, and it was free and mature compared to the ngroks, but it's gone now. Haysay isn't quite the same, and that's because of maturity.It's going to be current year tomorrow and we still have robotic AI voices. Paid AI models might be better, but 1. Nobody here wants to pay for AI, and 2. Doxing yourself just to have all your inputs restricted and lobotomized isn't worth it.And nobody here cares about voice to voice, this has always been a TTS thread, as the most active periods have been when 15.ai was alive. Why would anyone want to sing like an absolute fag just to make songs with robotic pony voices?So yeah, we're gonna be dead for a while. Nobody's going to care that much until some new technology comes and reignites interest in it again.
>>41789219We do have non robotic voices it's called Udio and ElevenLabs, not our problem you Sonicfags and Bronies are so uncultured and live under a rock. I think 15.AI got bought or joined a bigger AI project, the guy was always secretive.
>>41789425>can't read and immediately begins seething at nothing
>>4178942515 said he shut his site down because he received a cease & desist letter. Interestingly, he never said who sent the letter: was it Hasbro or the Screen Actors' Guild? Originally, 15 cited the Google Books precedent as a way to justify why his site wasn't illegal, but after SAG went on strike and received concessions to give actors control over how their likenesses can be used in AI training, they'd have a stronger case to get 15 shut down.
>>41789828The same cease and desist Jan got, right?
>>41789843That was definitely Hasbro.
>>41789854he could've just ignored it
>>41789859If he ignored the C&D, he'd have to prepare to defend his position in case Hasbro decides to sue him. Since he produced animations that a casual observer can mistake for legitimate MLP cartoons, that's clearly a violation of Hasbro's trademark. He could claim he's producing a parody, but it'll still be very hard for him to win the lawsuit. I don't think ForgaLorga/Agrol has that problem simply because his animations don't have any dialogue.
>>41790053Hasbro would never go after him, nobody wants to waste money.
happy new year for mares!
>>41790708New year? NEW MARES!
>>41789425>ElevenFagsKill yourself.
Happy New Year!
>>41791927Happy New Year to you too!
>>41789828>The good ol'd reliable we got C&D u guys!!HAHAHA AND YOU BELIEVED IT? Tiarawhy came out and told me he lied about it to saveface, they fabricated the email themselves.
>>41793517
>>41789219StyleTTS2 does pretty mares' voices from robotic input. We can try doing StyleTTS2-VC like StyleTTS-VC or wait for StyleTTS-ZS models to come out.Or do our own models.
Is it still impossible to do voice AI without Nvidia GPU?
>>41796493Sadly all current tech is depended on (py)torch and that is completely depended on Nvidia gpu (there are some walkabouts to make AMD gpus work with it on Linux but from what I heard it's very tricky to make it work).
>>41795152Almost again.
What do you think ponies would call machine learning?
>>41797986Ugh, what has happened this time?
>>41798255Nothing. I think that's the point, this place is kill.
>>41798302Hey now, there is a pretty good possibility we will get some kind of mlp animation generator based on anons posting from ai image threads.>>41797878 >>41797864it will take a while to see a smaller scale version, hopefully some other nerds will join in in making new models/architecture since right now everything is now limited to openai vs chinese stuff.
>>41798302In this case, bump.
hydrusbeta, could you post a link for the RVC models depository (specially the ones for singing)? im asking as the huggingface.co/hydrusbeta seems to be missing a lot of singing models, and I kind prefer to be able to run this offline as uploading and running voice convector online takes ages.
https://files.catbox.moe/3oavqv.mp3Luna song. Few times rvc misunderstood loud=high pitch.Any chance one of training wizards will be able to use the 3 seconds "I missed you big sister" clip to create S1 Luna tts?
>>41711970>>41711975>>41711980>>41711985I love you 1111 aka 15 but you sound like a Cali.t. the anon formerly known as the IFOWONAIO anon
>>41799855Vul trained a ton of singing and non-singing voices for RVC:https://huggingface.co/therealvul/RVCv2/tree/mainFor a list of all the RVC models that Hay Say knows about, along with links to the model files, see this JSON file:https://github.com/hydrusbeta/hay_say_ui/blob/main/architectures/rvc/character_models.json
>>41706417I can't find the other thread so I'm sticking this here. Also this generation turned out unusually good, my other ones were not nearly as cool.
>>41797864says file is corrupt? Is it just my Firefox?
What would ponies name their GPUs?
>>41801634My best guess would be workhorse related terms. Something referring to heavy lifting, probably.
is vul the author of that song about jannies by ponka?I don't remember where I found it but did he post it here?
>>41802539>about jannies by ponkaNot really hitting any bells with this one, if you could post the file + filename maybe I could compare to my playlist (so far cant really find anything related). It possible it was posted in /create/ thread?
>>41802627https://pomf2.lain.la/f/kg95divo.mp3create is down right now..
>>41802657uhhh, soory dude, I had not head this one before, so it must had been posted somewhere outside ppp.
>11
something straight from tardland, i mean, /sci/https://github.com/odin-loki/Cell-AI
>>41806798Im happy to see people trying alternative methods of running/creating ai stuff, but I would also love if they actually run some presentation on how it would work in the practice.
>>41802539Yes, the original was linked at the bottom of the lyrics for "Word of the Nightmare"https://ponepaste.org/10467
>>41805738Nice try. The board didn't have a sticky in a while.
>>41807602does he announce/post new songs here or in create?
>>41731435Kinda bummed that with all those East Asian language opinions there's no Autumn Blaze voice so I could make her say>"Konnichiwa dude!"But then, she only has three minutes of dialogue minus the song.>0hymn4
>>41804850
>>41810436
>>41808634I've seen posts from him in both places in the past, but I don't know if he prefers one thread over the other
>>41801022Marecelium vibes intensify.
>>41808634He posts in both from what I've seen, though afaik he wasn't able to get as much out in 2024 as he wanted, hence why there wasn't a ton of new Vul tracks to find.
>>41807602the link to that catbox song is broken,is catbox down?
>>41811597Thank you for the (You) I actually really like Marecelium but this OC existed before Marecelium did, since it was created for a D&D campaign originally though not by much time, they were created the same year though
>run out of credit before I could get music ai model generate the correct type of song after just an hour of promptingthe music ai services are really fucking gay, can somebody please make a offline model that works on 6GB gpu?
>>41811132
>>41813383I hate this credit nonsense. There are too many artificial 'currencies' like that out there.
>>41815912indeed , its all too kosher, lots of times I end up loosing the credits because their model fucked up and either swap sentences around, change the pitch/vocalist mid song or straight up started to pump out house/techno background noises when I specifically asked for a classical 18th century piano tunes.
>https://files.catbox.moe/z98zyl.mp3redoing the classic meme with gpt-sovits.
99
>>41817019
I need to figure out if this is worth paying $10 a month for.https://suno.com/invite/@enrapturingelevatormusic755
Does anypony have advice for generating yelling/screaming voices? I'm trying to make a mod for Helldivers 2 where the helldivers' voices are replaced with mlp characters. I can get the calm voice lines to sound fine with just a couple steps through the models on haysay, but the screaming voice lines take all sorts of passes/retries/edits to even get close to "just ok." As examples:Calm line original: https://voca.ro/13diHsoRGyQ8Calm line Rarity: https://voca.ro/1mH7Ei7a5K84Yelling line original: https://voca.ro/1ljbWIVgWlhQYelling line Rarity: https://voca.ro/1jk038XmYqHW
>>41817774If there's a secret, I haven't found it yet. Best I could figure out is to just spam it with multiple (sometimes dozens) of takes, and splice together whatever parts sound good enough.I will say that SoVitsSVC4.0 can sometimes translate certain parts of a performance better, I've gotten okay out of it with basic yelling. However I doubt you're ever gonna get the gritty, gravelly kind of deliveries you'd expect in a warzone out of the MLP characters, cause the data for it ain't there.
>>41817690ehh, kind of but not really, when you get struck by inspiration (and do not mind spending half a day proompting) you can make pure gold like the /g/ Anon who made the "4am" in the first few days the server when online>https://files.catbox.moe/0yeais.mp3than there are times that even their credits run out and end up like this >>41813383 or worse, have entire days/weeks/months were the creative spell is broken and you just get your money siphoned just like with all the other modern services. But than again it's just 10$ so as long as you aren't struggling irl you probably will most likely not notice it.
>>41817821I figured that might be the case. Although splicing together final takes didn’t occur to me, thanks! Mostly my concern was getting lines to be the proper cadence and hitting the notes. They just end up being either super short or sound flat (or sound like shit). It took me multiple edits to get her to say “Earth” for any good length of time in that last one.I may not go through on this after all since that’s hundreds of voice lines on 4 characters and the proof of concept took too much effort on its own. My last hail mary would be making a script to automate batch-generation of lines so I can check 50 or so at once and use the best ones.
Goddamnit Twilight Sparkle get out of my unrelated songs:https://voca.ro/1d4GeS8UkFJVhttps://voca.ro/13LOt6zYasm1https://www.youtube.com/watch?v=isMwV-EO1tI&list=PLXplGAZHGThcgC0USArWosEuvPY8InUu0&ab_channel=Domibombs Does this count as Tara Strong or Rebecca Shoichet singing?
https://youtu.be/N6Piou4oYx8https://openreview.net/forum?id=AL1fq05o7HMare, this is interesting. MAMBA-SoVITS, anypony?This horsie wonders if MAMBA can be used in StyleTTS2 - the best and most stable TTS so far.Did anycreature experiment with smashing GPT and StyleTTS together? If it there will be positive results, then maybe experiment on MAMBA and StyleTTS.
>stealing news from /g/>VLC automatic subtitles generation and translation based on local and open source AI models running on your machine working offline, and supporting numerous languages!>Demo can be found on our #CES2025 booth in Eureka Park.VLChads can't stop winning:https://x.com/videolan/status/1877072497146781946?t=jcarCV_7wCs11kDPWunwvg&s=19This would be pretty cool if there was an api option to hook it up to tts and have mare voices translate random Japaniese vtubers.