/mlp/ - Pony

File: AnotherFilenameMaybe.png (1.54 MB, 2119x1500)
1.54 MB
1.54 MB PNG
Welcome to the Pony Voice Preservation Project!

See below for regular OP post.
The Pony Preservation Project is a collaborative effort by /mlp/ to build and curate pony datasets for as many applications in AI as possible.

Technology has progressed such that a trained neural network can generate convincing voice clips, drawings and text for any person or character using existing audio recordings, artwork and fanfics as a reference. As you can surely imagine, AI pony voices, drawings and text have endless applications for pony content creation.

AI is incredibly versatile, basically anything that can be boiled down to a simple dataset can be used for training to create more of it. AI-generated images, fanfics, wAIfu chatbots and even animation are possible, and are being worked on here.

Any anon is free to join, and there are many active tasks that would suit any level of technical expertise. If you’re interested in helping out, take a look at the quick start guide linked below and ask in the thread for any further detail you need.

EQG and G5 are not welcome.

>Quick start guide:
Introduction to the PPP, links to text-to-speech tools, and how (You) can help with active tasks.

>The main Doc:
An in-depth repository of tutorials, resources and archives.

>Active tasks:
Research into animation AI
Research into pony image generation
>Latest developments:
GDrive clone of Master File now available >>37159549
SortAnon releases script to run TalkNet on Windows >>37299594
TalkNet training script >>37374942
GPT-J downloadable model >>37646318
FiMmicroSoL model >>38027533
Delta GPT-J notebook + tutorial >>38018428
New FiMfic GPT model >>38308297 >>38347556 >>38301248
FimFic dataset release >>38391839
Offline GPT-PNY >>38821349
FiMfic dataset >>38934474
SD weights >>38959367
SD low vram >>38959447
Huggingface SD: >>38979677
Colab SD >>38981735
NSFW Pony Model >>39114433
New DeltaVox >>39678806
so-vits-svt 4.0 >>39683876
so-vits-svt tutorial >>39692758
Hay Say >>39920556
Haysay on the web! >>40391443
SFX seperator >>40786997 >>40790270
Clipper finishes re-reviewing audio >>40999872
>The PoneAI drive, an archive for AI pony voice content:

>Clipper’s Master Files, the central location for MLP voice data:

>Cool, where is the discord/forum/whatever unifying place for this project?
You're looking at it.

Last Thread:
If your question isn’t listed here, take a look in the quick start guide and main doc to see if it’s already answered there. Use the tabs on the left for easy navigation.
Quick: docs.google.com/document/d/1PDkSrKKiHzzpUTKzBldZeKngvjeBUjyTtGCOv2GWwa0/edit
Main: docs.google.com/document/d/1y1pfS0LCrwbbvxdn3ZksH25BKaf0LaO13uYppxIQnac/edit

>Where can I find the AI text-to-speech tools and how do I use them?
A list of TTS tools: docs.google.com/document/d/1y1pfS0LCrwbbvxdn3ZksH25BKaf0LaO13uYppxIQnac/edit#heading=h.yuhl8zjiwmwq
How to get the best out of them: docs.google.com/document/d/1y1pfS0LCrwbbvxdn3ZksH25BKaf0LaO13uYppxIQnac/edit#heading=h.mnnpknmj1hcy

>Where can I find content made with the voice AI?
In the PoneAI drive: drive.google.com/drive/folders/1E21zJQWC5XVQWy2mt42bUiJ_XbqTJXCp
And the PPP Mega Compilation: docs.google.com/spreadsheets/d/1T2TE3OBs681Vphfas7Jgi5rvugdH6wnXVtUVYiZyJF8/edit

>I want to know more about the PPP, but I can’t be arsed to read the doc.
See the live PPP panel shows presented on /mlp/con for a more condensed overview.
2020 pony.tube/w/5fUkuT3245pL8ZoWXUnXJ4
2021 pony.tube/w/a5yfTV4Ynq7tRveZH7AA8f
2022 pony.tube/w/mV3xgbdtrXqjoPAwEXZCw5
2023 pony.tube/w/fVZShksjBbu6uT51DtvWWz

>How can I help with the PPP?
Build datasets, train AIs, and use the AI to make more pony content. Take a look at the quick start guide for current active tasks, or start your own in the thread if you have an idea. There’s always more data to collect and more AIs to train.

>Did you know that such and such voiced this other thing that could be used for voice data?
It is best to keep to official audio only unless there is very little of it available. If you know of a good source of audio for characters with few (or just fewer) lines, please post it in the thread. 5.1 is generally required unless you have a source already clean of background noise. Preferably post a sample or link. The easier you make it, the more likely it will be done.

>What about fan-imitations of official voices?

>Will you guys be doing a [insert language here] version of the AI?
Probably not, but you're welcome to. You can however get most of the way there by using phonetic transcriptions of other languages as input for the AI.

>What about [insert OC here]'s voice?
It is often quite difficult to find good quality audio data for OCs. If you happen to know any, post them in the thread and we’ll take a look.

>I have an idea!
Great. Post it in the thread and we'll discuss it.

>Do you have a Code of Conduct?
Of course: 15.ai/code

>Is this project open source? Who is in charge of this?

PPP Redubs:

Stream Premieres:
File: WoodAnch.png (583 KB, 656x847)
583 KB
583 KB PNG
Here is the rest of the news section and an explanation why the OP is like it is. Sorry about that.

Prompt: Mares of Equestria love Anon AKA Anonymous, Who is the only (male) human in Equestria;female vocalist, Eurodance, Pop, Melodic, Funny

Anon in Equestria

In this land of hooves and tails, he walks alone
With a two-legged stride in a horse's home
They trot to his side, say 'Anon's our own!'
Every mare in Equestria is calling his phone

Their hearts trotting fast, galloping in time
To the beat of their world where the sun always shines

All the stallions might glare, but Anon's unfazed
He's got unicorn magic in his human ways
Pegasus girls, they all wanna race
But it's Anon, only Anon, who can keep up the pace

Whoa-oh, in Equestria, he's living the dream
Every mare's got a crush, yeah, Anon's the theme
Through the fields, he's the one they pursue
A human touch that's both rare and true
(left out: When Anon steps, the world's brand new)
In a pony parade, he's the first in view

Flutters shy and Rarity's gleam
Every mare in the land wants to join his team
With a laughter's ring (aha!), Pinkie's in line
Throws a party for two, with Anon, it's divine
Rainbow Dash in the sky, drafting clouds to spell 'Anon's mine!'
But there's a queue, yeah, the line's longer each dawn

Twilight's sparkle can't compare
To the way that they swoon when he's there

In this land of hooves and tails, he walks alone
With a two-legged stride in a horse's home
They trot to his side, say 'Anon's our own!'
Every mare in Equestria is calling his phone
Adapting WavLM for Speech Emotion Recognition
>Recently, the usage of speech self-supervised models (SSL) for downstream tasks has been drawing a lot of attention. While large pre-trained models commonly outperform smaller models trained from scratch, questions regarding the optimal fine-tuning strategies remain prevalent. In this paper, we explore the fine-tuning strategies of the WavLM Large model for the speech emotion recognition task on the MSP Podcast Corpus. More specifically, we perform a series of experiments focusing on using gender and semantic information from utterances. We then sum up our findings and describe the final model we used for submission to Speech Emotion Recognition Challenge 2024.
good walkthrough for how they did it for anyone here who wants to make one for ponies. made for this competition
>op didn't put "ppp"in the text and now I can't quickly find the thread
Anyhow, I'm thinking of training more seasons specific voices, does anyone have some special request ?
File: Coloratura Sit.png (1.64 MB, 6735x6735)
1.64 MB
1.64 MB PNG
Coloratura. What architecture are you training for?
Rvc, but first I will need to check if she has at least 2 minutes of files. I may try sovit later but those usually take almost all day.
"ppp" has never been in the text. I usually find the thread by searching for "preservation".
File: razzledazzle.png (15 KB, 90x90)
15 KB
hmm, weird, typing "ppp" in the search always worked in the past several years when lurking in the main catalogue.
>>41065205 >>41064828
RVC of Countess Coloratura aka Rara, for training I've used mix of singing and speaking lines to pad it out to 4 minutes of dataset. For some reason it seems the model is having more difficulties working with the male voice clips conversion than the previous models I've trained.
I will need to refresh my knowledge on how my sovit training setup works so her model will be train sometime over the weekend.
What is the best TTS for mares in Silly Tavern?
Any chance this could get upped on haysay.ai?
File: happy Twilight.gif (1.15 MB, 675x540)
1.15 MB
1.15 MB GIF
I am late to the party, but...
Happy birthday PPPV!
I was here at the beginning, and even if I can't help much these past two years, it always warms my heart to see this thread alive and kicking!

Long life to PPP!

Has anyone tried this one yet? Is it good?
Same anon. I tried it... but there seems to be an issue involving the output being very squeaky at times and deeper in others... I don't know if it's because of the data I used or what, but I don't like that it's giving that sort of quality.
>5 years
How? It feels like it only started yesterday?
early 2023 15 ai audio clips i haven't shared:
fluttershy yelling at angel spongebob reference
cocopommel says i love you
Anyone know why Udio has so many problems generating 16-bit music?
File: out of apples.png (1.27 MB, 1400x1400)
1.27 MB
1.27 MB PNG
nicely done
Very Nice.
mare music
The problem is that you can't link 5 posts consecutively on a single line. So this should work:

Something in the following text will prevent your posts from going through. I tried like 6 times to make the thread, I tried from both Chromium and Firefox, different ISPs, I tried without 4chanx. I would just get redirected to this thread >>41064811. Eventually I tried posting it piecemeal to see if maybe a part of the OP was being blocked? And that did seem to be the case. Something in the below text will prevent your text from being posted. If the delete functionality still worked I would remake the thread. I'm really sorry about the messed up OP.

Synthbot updates GDrive >>41019588
Private "MareLoid" project >>40925332 >>40928583 >>40932952
VoiceCraft >>40938470 >>40953388
Fimfarch dataset >>41027971
5 years of PPP >>41029227
Various AI News >>40947581 >>40991154 >>41012445 >>41023953
>>41025268 >>41041365
Does anyone know if there is some kind of add-on to the RVC training script were you can provide it with a list for creating an entire batch of voices?
I would like to use such function for whenever I would be away from my pc (like a weekend with the extended family) and it would be pretty neat to return to a whole cast of characters done cooking and ready to go.
Even if I didn't recognize AJ very well at the beginning, it changes very fast.
An awesome work indeed!
Mystery solved. Thanks Anon. I thought I've made long post lists previously, but maybe it's a recent change. Who knows anymore.
Page 9 bump.
You mean page 10 bump, right?
When running training for so-vits-svc 4 I got this strange error, does anyone know what this is about, as I've never seen it before?
Did you run out of RAM perhaps?
Nope, I was montoring the RAM and VRAM usage and while it was pretty high it didn't ran out (usually if a program used up all my ram in past the pc would crash). Like I've said I never seen this error before and I did trained few sovits voices in the past.
However it is possible that another program and/or browser that was running at the same time could have somehow affect the training process, so I may re-attempt training some other time.
For whatever reason (usually running out of non-reserved memory) the computer started using the hard drive as additional memory through a pagefile. This is the default behavior when low on memory but it's no impossible some weird edgecase made it try to reserve an absurd amount of memory. How much space do you have left on your OS HDD/SSD?
That works too.
File: 1672976889910218.jpg (97 KB, 385x501)
97 KB
Logically speaking, how would you bridge the gap between a simple text to text model (think /CHAG/) and allow it to interact with the world mechanically?
my initial guess it just wouldn't be advanced enough, it's built exclusively for text to text how could you take it to the territory of motion, not even anything accurate but just any mechanical capabillities at all?
strap it into a horse-shaped BattleMech, of course
Give it senses. A text generator model already has a way to interact with the world what it doesn't have is a method to autonomously and accurately generate context.
>it's built exclusively for text to text
They do image to text too now. It's basic and I don't think it would be enough for complex motions, but that should give some basic interactions with the world around.
You can formulate it as an action that an llm can choose. The environment can be mapped by the robot in 3d and objects labelled with a corresponding model. it's no different than those dog robots.

If you want something less scripted there are models that estimate humanoid skeleton animation given a description, same can be applied to ponies.

Theoretically it's already perfectly possible, you would only need a tech guy, a robotics guy, 3d printer and an H200 server.
I would imagine something along the similar Sims like "emotional needs" system would be enough for most interactions, as long as it would be correctly exchanging the information between itself and the more advanced ai text model to give an illusion of the robot acting semi-intelligent.
>>41065205 >>41066442
Hi Anon, here is another model of Countess Coloratura trained for sovits 4.0.
File: 1673497361137281.jpg (61 KB, 600x600)
61 KB
>text models can already interact with the world
by this im assuming you just mean interacting with the world by generating text right?
>but they dont have a method to autonomously and accurately generate context
isn't context basically just the 'memory' a model has of past interactions?
i wish there was a way to increase context through RAM usage or maybe even hard drive storage
wish we could just pump out some kind of simple python code to do that, infinite context, but at the same time how fast would it be able to make use of that context? if it's on a hard drive as opposed to RAM then I would imagine it would take much longer to load that context
you would probably need something more like video to text in order to make it interact in real time rather than hyper delayed reactions
>the enviroment can be mapped by the robot in 3D and objects labelled with a corresponding model
im assuming you mean translating a video feed with a model that identifies stuff so that it can be interpreted by the main LLM that has the personality and movement control and all that
where would you even get a model to translate video to text in real time?
>models that estimate humanoid skeleton animation given a description
this one im completely lost on, what does this mean and how would it be of use? maybe im just sub 80 IQ
i know what you're talking about and part of that is you could have a self reflection mechanism
this might be really niche but i remember seeing i think it was on GitHub a program that you could put your API key in and it was called 'auto computer' or something like that where it could perform tasks by asking the AI to do something and it will break it down into actionable steps before performing it which the AI then generates after being prompted to create the steps to do something
In this way, you could have an automatic prompt after each interaction to gauge emotion like 'how did this make you feel' which will then output an emotional state of some kind i would imagine
if you wanted to take it further you could make it more dynamic with percentages like this
happiness - #%
sadness - #%
anger - #%
and it could factor in previous interactions to make a totaled emotional state, similar to how you could have a bad day at work but then you get to hang out with friends after and you start to feel better
but then again you would need a mechanism to calculate whether those emotions would fade away after some time and some emotions stick longer than others which would be confusing to deal with and I think is a project all on it's own

Last thing I wanted to ask was how could you imagine someone pulling in funding for a project like this, assuming someone had baseline needs met and some kind of disposable income to actually complete the project, what would they do with the project itself or even the past iterations that were made?
>isn't context basically just the 'memory' a model has of past interactions?
You can think of context as the working memory of a model. Without context a model will react to everything entirely based on its 'instinct' (that is completing based on averages). Giving a model a way to autonomously observe (generate context) what's around it is the first step to having it start to form an understanding of the world. Of course the model would be limited by the maximum size of useful context the system can handle for how complex the models understanding of the world can be.
File: 1417744044882.gif (60 KB, 720x1050)
60 KB
oh AUTONOMOUSLY creating context
yeah that makes more sense
it has to be able to learn on it's own without human input
i think this would require some form of recursive self reflection like 'what am i looking at right now' but at the same time could cause problems since you could look at the same thing and have it make a new entry even though it looked at it just a few minutes ago
a big problem would be just rewriting the same context over and over and filling up context space needlessly
i wonder hypothetically how much storage you would need for context alone
assuming you could convert context to like, idk, a text file or some kind of file which could be read fast enough to not slow down response times
I wonder how fast the context would fill up though, i mean if it's taking literally everything in that it sees i would imagine it's gonna fill up crazy fast

the main bottle necks here are how fast you could access context and the sheer amount of storage that it would need, this is assuming you could even somehow automatically write the context into a TXT file and keep it stored on an actual hard drive rather than contained within the AI or however the fuck it works, I would have to ask /CHAG/ where context itself is even stored since I'm a spastic and don't know how that works but I do know that past it's limit it starts overwriting previous context (I THINK)
Yeah there are many barriers still to actually getting such a system working. Both on the hardware and software side.
Maybe it'll be simpler to have an AI-controlled agent in a virtual environment, like Second Life. I know people have trained AIs to play video games before. You could try asking the popen thread how hard it would be to get an AI ton control a popen.
Are there any good samples or excerpts of ponies cheering? What about booing?
not that anon; how feasable and/or difficult is it to model an AI's memory after the human brain, i.e. short term memory, long term memory, and the conversion between the two?
I know there are attempts of varying quality using summarization, clever systems of reinforcing and forgetting context and such but they're to my understanding just the AI compressing and managing context. I'd guess a true split would require some new architecture? Not exactly an expert here.
honestly i see no reason to make short term memory unless having everything in long term memory would be too many files and take too much storage or confuse the model somehow
File: Untitled.jpg (322 KB, 1565x957)
322 KB
322 KB JPG
AniTalker: Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion Encoding
>The paper introduces AniTalker, an innovative framework designed to generate lifelike talking faces from a single portrait. Unlike existing models that primarily focus on verbal cues such as lip synchronization and fail to capture the complex dynamics of facial expressions and nonverbal cues, AniTalker employs a universal motion representation. This innovative representation effectively captures a wide range of facial dynamics, including subtle expressions and head movements. AniTalker enhances motion depiction through two self-supervised learning strategies: the first involves reconstructing target video frames from source frames within the same identity to learn subtle motion representations, and the second develops an identity encoder using metric learning while actively minimizing mutual information between the identity and motion encoders. This approach ensures that the motion representation is dynamic and devoid of identity-specific details, significantly reducing the need for labeled data. Additionally, the integration of a diffusion model with a variance adapter allows for the generation of diverse and controllable facial animations.
>The weights and code are being organized, and we will make them public as soon as possible.
I was really confused since I thought I had already read this paper but it turns out there was another one
that I had a link to their implementation in the links rentry
but bizarrely even though both papers are chinese the anitalk one does not even mention the aniportrait paper or compare against it. anyway it has face cloning ability for anyone interested
for you guys specifically I think a new model or a finetune would need to be done with ponies facial motion data
File: 649480.png (303 KB, 2090x2350)
303 KB
303 KB PNG
march of the pigs
Any language model is just computing a probability distribution for the next token, a token is a word or part of a word. The context is all the previous tokens. You take the model's prediction, sample the token from predicted probability distribution (not necessairly the most likely one) add it to the initial sequence and refeed it to model, repeat the process until you have generated enough tokens, or stop when the model outputs a special end of sequence token. You can manipulate the context any way you want, inserting information that might be useful for the model, such as surrounding info, for corporate cloud models you dont get such option. There are also multimodal models that are trained together with, say, image description model, so that the image description outputs tokens, which are inserted to model context, that aren't words yet are understanded by the model, this is how gpt4 works.
If the context has significantly more tokens than the model was trained with, the predictions will be shit. You can indeed try to summarize the most important points and insert them to the beginning of context.
>made with so-vits 5.0
rvc v3 better catch up soon this is amazing
SFX and Music folder.
vul - I'm trying to address this post from Synthbot that I previously forgot to sort out >>41015670.
He's asked to re-export some of the sorted audios from S1 and S2. For some reason I'm now getting errors whenever I try to open saved projects in the PonySorter, both the most recent version and previous versions before EQG became a factor. The common error reported is that it can't find episodes in the labels index, though I'm sure that the supplied index JSON file contains the required data. It's been a while since I last opened the PonySorter so perhaps there's something I've just forgotten that I need to do.

Here are the save files I created for S1 and S2 - can you see if you can get them to open on your end, and then perhaps find what's going wrong?
honestly, it could've been made with RVC too. I don't quite remember. I was trying a lot of different settings on haysay and I don't know which result I ended up saving.
Does your config index_file point to episodes_labels_index.json?
Yeah it does, which is minaly why I have no idea what's going wrong with it.
Can you post the specific error outputs?
After trying to load the first savefile >>41081961
Can I see the labels index file as well
I can't reproduce this problem on my end but this labels index is what I used

Also reading the post chain further--is this about the dialogue from those episodes, or SFX and Music?
Thank you!
LLAniMAtion: LLAMA Driven Gesture Animation
>Co-speech gesturing is an important modality in conversation, providing context and social cues. In character animation, appropriate and synchronised gestures add realism, and can make interactive agents more engaging. Historically, methods for automatically generating gestures were predominantly audio-driven, exploiting the prosodic and speech-related content that is encoded in the audio signal. In this paper we instead experiment with using LLM features for gesture generation that are extracted from text using LLAMA2. We compare against audio features, and explore combining the two modalities in both objective tests and a user study. Surprisingly, our results show that LLAMA2 features on their own perform significantly better than audio features and that including both modalities yields no significant difference to using LLAMA2 features in isolation. We demonstrate that the LLAMA2 based model can generate both beat and semantic gestures without any audio input, suggesting LLMs can provide rich encodings that are well suited for gesture generation.
again another thing you'd need to new model with a pony dataset but could be interesting for anyone wanting to animate using generated voice. maybe worth emailing the writers to see if they'd post their code/weights
File: 1715668337046185.png (484 KB, 1500x793)
484 KB
484 KB PNG
hypothetically speaking, is it possible to train a model while using it?
Like lets say your have a past conversation you had with it, and its getting too long for the context window
now my gut instinct says 'duh no fuckin way can you train a model while also using it or holding a conversation with it' but i figured i might as well ask
If it is impossible to train and use at the same time then why is that?
duh no fuckin way can you train a model while also using it or holding a conversation with it
>Is it possible to train a model while using it?
Your sessions are saved inside text files so you could train a model on it afterward if you want. But training requires a lot and lot of data.
>It's getting too long for the context window
It's not the best, but you can use solutions like summarization or lorebooks if there are important events you want to keep in memory. You can also configure some part to always be in context no matter what.

Still getting the same error with that index file. It's the exact same filesize as my current one so pretty sure they're the same. No idea why it's not working.

>Dialogue or SFX and Music?
It might be SFX and Music actually, in which case this issue doesn't matter.
Synthbot - I've forgotten the context here, am I re-exporting dialogue or SFX and Music?
From a theoretical standpoint, I don't see why not. I don't know of any LLMs that currently do this, though. Look up "online supervised learning" and "reinforcement learning".
You're re-exporting the labels and audio files for just the SFX and Music for the episodes listed in >>41015670. Those are the only ones with mismatches between their label files and audio files.
Because you need an estimate of model's performance for reinforcement learning. This means you will have to rate every response yourself. Finetuning a model like this by yourself is insanity. If you want to understand what you are talking about, I'd suggest reading books on the subject.
File: 3098740.jpg (22 KB, 290x292)
22 KB
Okay, new problem - turns out I only have the dialogue tracks, I didn't hold onto any of the full audio versions with the SFX and Music, presumably for space saving as I have a LOT of stuff saved. I can't find the original source audio that was used, none of the versions on yayponies seem to line up with the labels and the old torrent links don't have anyone seeding them.

Here's the original torrent, don't suppose anyone could reseed, or reupload if they have the same audio used from way back when?
File: exampl.jpg (105 KB, 922x435)
105 KB
105 KB JPG
if you could automate lore book entries you could essentially create a sort of memory system
though i feel that there would be issues regarding correct key wording and making the content actually accurate and relevant to the topic at hand which it needs to memorize
im sure with testing you could get decent results though which would inevitably make context exist more so for how much of the lorebook it can recall

although now that i think about it this could cause insane bloat with it recalling even the most mundane of things and absolutely WRECKING the context with nonsense
File: 1536707646075.png (537 KB, 958x958)
537 KB
537 KB PNG
furthermore, you would need a system where everytime it learns something new about a topic it doesn't over write stuff and accidentally remove key parts of the lorebook entry

guess you could see if you could min max stuff in regards to lorebooks like that one guy who did a test and found out you could 'typelikethistosavetheamountoftokensyouhavetouse'
other than that, you would have to monitor how in depth the automated lore book entries to make sure it's all relevant info

fuck i wish i had a powerful graphics card
There are XTTS models of main six?
Candid question, but why not reduce the description to a small nimber of word for "lesser known ponies" ?
That would allow it to still know the ponies, without using too much ressources?

Like: "Ocatavia Melody is a sophisticated cellist grey pony mare" ? Still better than nothing, and would keep data low?
>Long life to the PPP!
Moving from CHAI (Character AI chatbots) to giving robots a mechanism (usable) from which they can function and begin to affect the world requires servo motors (for movement functions) and well, ah...
One motive has changed. I believe that in the future autoforce will be required for all robots.
>What is "Autoforce"
It's a telekinetic leveraging superpower/ability
that uses microwave resonance and only works on metalloids/metals or on specific objects that have been coated with a special coating.
>What is it for?
For getting "realism" into our robotic pones of course
mostly for unicorns since teleportation isn't possible—not unless you're willing to "deconstruct" and reconstruct your robot and transport it and its assorted accessories somewhere and then tap on it like its memory has been vaguely veiled for what "the past 30 minutes or so" and so it "thinks" but only THINKS that it's been teleported but in reality all you did was shut the power off and moved it and then "woke it up" on-time
So giving their horn a shooty-energy weapon or magnetic-beam-thingy works much better
>But why not lasers?
You see, lasers; all cool and everything, not complaining.
But-HEY-imagine you're looking at your waifu (or she's not looking at you, whatever) and sOEMTHING happens (oh no!) :itsover:
bt then
your robot mare comes to the rescue wth her
Lasserrr powwerrr!!!1one
but uh oh!
The laser done highlights your face and then SWA-PTchooo!!!
then yur
eye is
ao o
>So thta's why we use microwave weapons?
Heh yes exaclty
these area all obvious (*(and valid)) options too
*but they aren't actually the answer we're after there though
based schizoposter
File: Blini.png (638 KB, 811x739)
638 KB
638 KB PNG
Hey /ppp/, I don't know if this is the right place to ask, but I was curious if you guys knew of any sort of any AI that helped translate audio to english. Specifically, I wanted to translate this video.
I've heard it's possible, but the video length and the audio quality might pose a problem.
I was just interested in seeing if you guys knew or heard of anything that could help me.
It MAYBE possible to use the above code, if you replace the code part that refer to English model with the "vosk-model-ru-0.42" from here https://alphacephei.com/vosk/models
You will still need to download the raw audio and chop it up into small pieces before it will be usable by the Vosk transcript code. (I do not guarantee that this will work).
OR alternatively you can pump up the volume and pull out the phone with that "auto translate spoken language" google translation option.
The way lorebooks are typically handled is that they don't insert any info unless one of the keywords appears in the context. So the problem, imo, is less "Octavia's context has too much tokens", and more about getting the keywords right. If your automated lorebook uses too obvious words then you'll get flooded with garbage bloat context, and if the keywords are intentionally very specific to avoid this, then it might not hit them even if the object/concept from the lorebook is being discussed.
Not to my knowledge. IIRC I tried to finetune Celestia and Twilight and it didn't fly so good

IDK what the stage of development is on this problem but here is what SillyTavern has to say about smart context
nta but this is pretty interesting, I would love to see someone test it with more weird japaniese references (Im going to assume it will not make the "Jelly Filled Doughnuts" tier mistakes).
What's 15 cooking? I made my monthly rounds on his site to point and laugh, but noticed that it's blank. Last time that meant that the site was going to be back up within a month or so. What is pulling out of his ass after 2 years of absence and leeching off of Patreon money?
I don't know I think he might have just given up It's hard to know considering he doesn't communicate very often I really hope he hasn't given up because I just want to hear a good derpy voice again If he has given up, I just really wish he would open source it at least
i heard a rumor that 15 has been battling lawyers and court cases, like viacom, hbo or hasbro, forced a DMCA to take down the site.
If this is true, I wish he would just say so instead of just going silent
When you end up in legal shit, the first thing your lawyers are gonna advise is for you to shut the fuck up, and sit the fuck down, so you don't end up digging yourself any deeper.
Yeah, but he could have just said oh there's been an issue with thing and putting it pause or something Instead of saying it will be out next week and then go disappear
He's fapping to dashfag's ass.
Or, he's only using his website as his stepping stone for his Computer Science Ph.D. but the university stops funding him for being a failure.
Datacenter isn't exactly cheap.

If he still lurk, he don't say so...
Exploring speech style spaces with language models: Emotional TTS without emotion labels
>Many frameworks for emotional text-to-speech (E-TTS) rely on human-annotated emotion labels that are often inaccurate and difficult to obtain. Learning emotional prosody implicitly presents a tough challenge due to the subjective nature of emotions. In this study, we propose a novel approach that leverages text awareness to acquire emotional styles without the need for explicit emotion labels or text prompts. We present TEMOTTS, a two-stage framework for E-TTS that is trained without emotion labels and is capable of inference without auxiliary inputs. Our proposed method performs knowledge transfer between the linguistic space learned by BERT and the emotional style space constructed by global style tokens. Our experimental results demonstrate the effectiveness of our proposed framework, showcasing improvements in emotional accuracy and naturalness. This is one of the first studies to leverage the emotional correlation between spoken content and expressive delivery for emotional TTS.
no model but you wouldn't want it anyway (trained on indian english it seems). for that one anon (if he's even still around) who seems willing to fuck around with training his own models
>indian english
>we propose a novel approach that leverages text awareness to acquire emotional styles without the need for explicit emotion labels
so this is basically ppp ngrok from 2020? At least it's nice to see some other (more mainstream) groups dipping their toes into emotional tts, if only there wa an easy way to get that stff working with rvc + sovits or some other small scale tts models.
Thank you! I'll check these out.
File: 1685784935422320.jpg (89 KB, 637x358)
89 KB
Mares, if you can keep them.
Does anyone know if Bark is getting any updates in it's architecture? It would be nice if there was some non-shill alternative for ai song maker models.
File: Derp.png (2.58 MB, 2500x1800)
2.58 MB
2.58 MB PNG
Posting about this again as it's important for future-proofing the voice dataset - does anyone know which source was used for the show audio and/or still have them saved? It's not clear to me from reading the archives and the original Mega/torrents are dead.

I still have all the original dialogue tracks locally and so can re-generate the voice dataset from the label files if that's ever necessary for whatever reason, but no one else will be able to unless they can get hold of the same tracks, which seemingly aren't sourced from anything on yayponies. I'll upload these dialogue tracks for safekeeping but would like the full 5.1 mixes to go with them.
That's slightly worrying. Are there other parts of the project that are living only on a few computers?
I still have them saved. The source was Netflix. I believe I uploaded a copy to Rome's archive.
Don't think so, at least not for the voice dataset. The master file has several backups, just seems to be the source audio that has the issue.

Found this:

Which contains the dialogue tracks (not full 5.1) for S1-8 and the full 5.1 audio for S1 only. The S1 files appear to line up correctly with the label files so I'll grab those, but that still leaves 2-9 unaccounted for. Am I missing something else?
If the others aren't elsewhere in Rome, could you re-upload whatever you still have saved? Only need to keep it up for as long as it takes for me to clone it. Should then make sure Rome has all of this as well.
Sure, you can send it over. I stopped mirroring these files a long time ago because it was a chore working with Mega.
Also I think you mean this, maybe:
I just don't get around to sort the uploads into he main archive
>it was a chore working with Mega.
Fair enough, I'll put everything in one place once I've got it all assembled and should be able to do a GDrive version for you as well.

Looks good for S1-6, syncs with the label files - but S7-9 are 2.0, not 5.1. Anywhere else they may be stored?
If it isn't static and files get added then a upload location that is easily syncable is always the best solution
AFAIK that's all I am aware of now
Looking at my archives, I don't think we ever got S9 from Netflix. I'm thinking that Netflix just didn't have S9 when we were doing things. I just have the iTunes rips for S9. I guess I'll see about uploading things again somewhere. It's about 270gb so it'll take a while. Maybe I'll upload it to Rome so that it'll be a bit more permanent.
I have two old archives in my folder:
2019 07 08: s1-2-3-4.zip (~1.4Gb)
2019 06 22: Season 9 Audio 1-12.rar (~1.4Gb)
Is it of any interest?
i found some pastebin accounts with greens that dont seem to be archived on ponepaste, would any anon be interested in helping archive them? i feel bad for encountering them now because im about to leave home for several days.
Uploading here:

Will be a while before it's done.
Maybe, if anything from >>41100938 is still missing. I'll review it when it's ready and let you know.

Thanks for the help everyone.
File: OIG3.jPjOpAz_.LtmjoqWlPaw.jpg (140 KB, 1024x1024)
140 KB
140 KB JPG
dreaming of electronic mares
>EQG and G5 are not welcome.

I've been meaning to ask but wasn't that already done by other people anyway?
Maybe, but it's not what this thread is about.
Has anyone dug through the DHX leaks for the song pro tools stems?
Alright, I think it's all up. Let me know if you think anything's missing.
I'm guessing not. I got curious after listening to some music demos and I'm downloading it now (will take a while). At minimum there seem to be a lot of takes for the EQG songs which have vocal lines that are doubled/layered in the version that we have in master file 2.

Ex: https://files.catbox.moe/xskary.mp3
Man, I still can't believe that a studio fuck up could turn into such amazing goldmine of resources.
File: Untitled.png (1.44 MB, 1045x1361)
1.44 MB
1.44 MB PNG
Pandora : Towards General World Model with Natural Language Actions and Video States
World models simulate future states of the world in response to different actions. They facilitate interactive content creation and provides a foundation for grounded, long-horizon reasoning. Current foundation models do not fully meet the capabilities of general world models: large language models (LLMs) are constrained by their reliance on language modality and their limited understanding of the physical world, while video models lack interactive action control over the world simulations. This paper makes a step towards building a general world model by introducing Pandora , a hybrid autoregressive-diffusion model that simulates world states by generating videos and allows real-time control with free-text actions. Pandora achieves domain generality, video consistency, and controllability through large-scale pretraining and instruction tuning. Crucially, Pandora bypasses the cost of training-from-scratch by integrating a pretrained LLM (7B) and a pretrained video model, requiring only additional lightweight finetuning. We illustrate extensive outputs by Pandora across diverse domains (indoor/outdoor, natural/urban, human/robot, 2D/3D, etc.). The results indicate great potential of building stronger genearl world models with larger-scale training.
checkout the website for examples in how it works. image to video with text guiding. weights are up though they covered their ass by asking you to agree to not do bad stuff with it first
File: Untitled.png (90 KB, 1042x262)
90 KB
Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models
>We propose an unsupervised adaptation framework, Self-TAught Recognizer (STAR), which leverages unlabeled data to enhance the robustness of automatic speech recognition (ASR) systems in diverse target domains, such as noise and accents. STAR is developed for prevalent speech foundation models based on Transformer-related architecture with auto-regressive decoding (e.g., Whisper, Canary). Specifically, we propose a novel indicator that empirically integrates step-wise information during decoding to assess the token-level quality of pseudo labels without ground truth, thereby guiding model updates for effective unsupervised adaptation. Experimental results show that STAR achieves an average of 13.5% relative reduction in word error rate across 14 target domains, and it sometimes even approaches the upper-bound performance of supervised adaptation. Surprisingly, we also observe that STAR prevents the adapted model from the common catastrophic forgetting problem without recalling source-domain data. Furthermore, STAR exhibits high data efficiency that only requires less than one-hour unlabeled data, and seamless generality to alternative large speech models and speech translation tasks.
seems useful considering 1 hour of unlabeled data can lead to such improvements. so if you have a custom case where you need a specific accent understood then this method seems to be the way to do it
Internal song charts!
Requesting your choice for a voice for the followjng
>The first time I sniffed a filly, she was two miles from my house. The trace amount of intoxicating filly pheromones induced a lust for preadolescent filly flesh with such terrifying intensity I immediately began a dead sprint directly towards the origin point. Basically my reaction was like the rage virus from 28 Days Later except also with a boner.
>Within four minutes I set upon her previously unmolested body like a rabid pitbull on an unguarded toddler and within four minutes and forty-five seconds my enraged member began hemorrhaging volumes of glutinous ejaculate at forces sufficient to immediately terminate her ovarian virginity on both sides of her bruised, overinflated uterus.
>The cavitation bubble in my prostate formed by the instantaneous vacuum left in the wake of my ejected semen collapsed, creating a shockwave that left me with two bloodshot eyes, temporarily deaf in one ear, and bad hiccups. The look on her sister's face was priceless.
Hey BGM, I was going through my music library and I rediscovered this dumb song and I thought of pony, Starlight and Glimmer, and you. If you're interested anyway.
No problem.
I am not as active as before, but I will lurk and give it if needed.

Yeah, post the links.
But if I recall correctly, there was a thread dedicated to saving pony content, no?
I'll take what I can get
Some stuff aged well!
Some other, well... Not so much.
You faggots convinced me to start my own ai business. I promise when I make enough money, Ill come back and if you guys decided to go main stream. ill use the resources I acquired to help
Don't know if it means anything but you have my word ill be back someday.

Till then thank you faggots for everything.
Well, good luck.
And if you succeed, we will be glad to hear about it!
Also, put a "Thanks PPPV" somewhere in your website...
Hi Haysay maintainer!
As asked, I post the error here.
I tried my best to remove "personal info" (?)

Traceback (most recent call last): File "/hay_say_ui/generator.py", line 25, in generate_and_prepare_postprocessed_display generate(cache_type, gpu_id, session_data, selected_architectures, user_text, File "/hay_say_ui/generator.py", line 55, in generate hash_output = process(cache, user_text, hash_preprocessed, selected_tab_object, relevant_inputs, File "/hay_say_ui/generator.py", line 97, in process send_payload(payload, host, port) File "/hay_say_ui/generator.py", line 134, in send_payload raise Exception(message) Exception: An error occurred while generating the output: Traceback (most recent call last): File "/so_vits_svc_4_server/main.py", line 41, in generate execute_program(input_filename_sans_extension, character, pitch_shift, predict_pitch, slice_length, File "/so_vits_svc_4_server/main.py", line 210, in execute_program model_path, config_path = get_model_and_config_paths(character) File "/so_vits_svc_4_server/main.py", line 127, in get_model_and_config_paths model_filename, config_filename = get_model_and_config_filenames(character_dir) File "/so_vits_svc_4_server/main.py", line 133, in get_model_and_config_filenames return get_model_filename(character_dir), get_config_filename(character_dir) File "/so_vits_svc_4_server/main.py", line 138, in get_config_filename raise Exception('Config file not found! Expecting a file with the name config.json in ' + character_dir) Exception: Config file not found! Expecting a file with the name config.json in /models/so_vits_svc_4/characters/Applejack (singing, PS1) Payload: {"Inputs": {"User Text": "Those horsefuckers will give you a fright[shortened to fit in one message]", "User Audio": "22b52ca95fb9cf1a43cd"}, "Options": {"Architecture": "so_vits_svc_4", "Character": "Applejack (singing, PS1)", "Pitch Shift": 0, "Predict Pitch": false, "Slice Length": 0.0, "Cross-Fade Length": 0.0, "Character Likeness": 0.0, "Reduce Hoarseness": false, "Apply nsf_hifigan": false, "Noise Scale": 0.4}, "Output File": "dbcba8199d50888b393a", "GPU ID": "", "Session ID": "ed4b3e9234434983ae8c758baeac134b"} Input Audio Dir Listing: .git, .github, .gitignore, LICENSE, README.md, cluster, configs, configs_template, data_utils.py, dataset_raw, filelists, flask_api.py, flask_api_full_song.py, hubert, inference, inference_main.py, logs, models.py, modules, onnx_export.py, onnx_export_speaker_mix.py, onnxexport, preprocess_flist_config.py, preprocess_hubert_f0.py, pretrain, raw, requirements.txt, requirements_win.txt, resample.py, sovits4_for_colab.ipynb, train.py, utils.py, vdecoder, wav_upload.py, webUI.py, __pycache__, results
Wanted to ask, is there anything, and I mean ANYTHING, out there that allows you to generate an accompaniment based on a vocal melody? I'm not talking about Suno or Udio, I mean like an AI system where it analyzes the vocal melody audio and makes an instrumental based off of it. I know it's possible, I saw a paper of something called SingSong that does that, I found it last year, but they haven't said anything about it since. I'm looking for something in that same vein, sort of...
Thank you for bringing this error to my attention. Apparently, a lot of the multi-speaker models had broken links to their config and pytorch state dictionary files. I wonder how long they've been like that... Nevertheless, I believe I have fixed them all now.
So after a fairly exhaustive review, here are the main things I found that don't seem to be included in master files yet.

>Raw recordings, including harmonies and unused takes, for these songs:
A Kirin Tale
The Magic Inside
Equestria The Land I Love
Get The Show On The Road
Cheer You On
Find The Magic
Run And Break Free
True Original
Let It Rain
So Much More
Five To Nine
I'm On A Yacht
All Good

>Stems for songs not included in our dataset?
Shake Things Up
Past Is Not Today
Friendship Through The Ages
Life Is A Runway

>Some broadcast stems that have a few separated mono/reverb-less lines that are an improvement over our version:
Friendship U (has isolated Flim/Flam monos)
Fit Right In (isolated Rarity/Yona monos for ending)
One More Day (has reverb-less stems, still lots of crowd vocals)
Lotta Little Things (there's an extra Luna track)
Your Heart Is In Two Places (has isolated Sweetie Belle/Scootaloo monos; barely matters because they're alternating most of the time)
We've Come So Far (has reverb-less stems)

>Other data notes
Almost all of the lines for Coinky-Dink World have vocal doubling but no noise label

>Other neat items that don't have much real utility
There are stems for Shannon Chan-Kent's demo of A Kirin Tale. Doesn't sound much like Pinkie though.
These are all the non-special source dialogue tracks and look all fine from taking a random sampling, so good that we have those uploaded. Still missing the 5.1's for S7-9 from >>41100446, which are the only things needed now to complete the set.

Could we all take one more look at our archives to check for 5.1 audio mixes for S7-9? It may well be that those were never uploaded for anything and so aren't *technically* missing, but still would be best to have them for completeness sake. Meanwhile, I'll work on putting this all together in one package for simpler archiving.

If your S9 audios are 5.1, could you upload those please?
Me too, and I know one other person doing the same. If you're in the SF Bay Area, I can try to get us all together. Or we can meet up at Mare Fair.
(16 minutes of Autumn Blaze singing.)
There were 5.1 mixes in there for s7-9, it's just that Netflix didn't have 5.1 mixes for these if I remember right. The iTunes sources are 5.1 as well as the TrollHD ones.
Same anon, hoping this isn't a lost cause, because I'm just curious.
I check that this afternoon, after work
Here's the material roughly separated by song and character. (Some of the recordings seem to be "mislabeled"/using a different track for a VA--sometimes Ashleigh does her AJ voice on a Rainbow track, and sometimes Fluttershy shows up on a Pinkie track). Did a bit of dynamic splitting in REAPER to cut out some empty space.
File: large.png (55 KB, 1024x1024)
55 KB
Heyyyy haven't been here in forever. I was BFDIAnon but I don't really watch BFDI anymore and in hindsight my immediate jump to namefaggotry was cringe as fuck.
Anyways not sure how often Poopsikins still comes around here, but I trained a StyleTTS2 model on his Adventure Time dataset, and it works with the GUI by effusiveperiscope. It says 29 epochs in the filename but it's actually closer to between 59 to 69 because Colab crashed or timed out a few times as I was training. I also added a few Fionna and Cake characters that weren't in the dataset before. https://mega.nz/file/aBkwwJrI#beuyzKysMtZXp4yNQoW6xBpVrI4kDtVS1zy-Y9Hwxgs
Character list: https://pastebin.com/UGVLvYXd
Wow, that's a name I haven't heard in a long time!
Welcome back.
File: Rainicorn.png (395 KB, 767x1300)
395 KB
395 KB PNG
I'm still around, I actually trained a few AT voices last year locally, had a blast running songs through Marceline's voice.
I lost the original audio source files (mostly ripped audio from games and animatics) I had on my hard drive so I need to recreate/improve the dataset one day. Multiversus has a few AT characters I can rip new lines from.
I never would've thought someone would actually add to the janky dataset I made thank you! and welcome back.

Here's a test vid I put together when I first trained Jake last year using Sovits:
File: 1687895489046667.png (263 KB, 1000x1000)
263 KB
263 KB PNG
Oh you know the usual, jams, marmalade, chutneys and such.
More like maremalade
Mares like maremalade
Mares mare maremalade
Mares mare maremamare
File: 1692630539692137.png (310 KB, 1229x1200)
310 KB
310 KB PNG
I wonder what Hayjuice would taste like.
I mean, oat milk is not bad, so hay drink could be... interesting ?
just a question fags, from my dumbass head, does anyone here know the toll where u can use a mic to talk to ponies, roleplay with em and stuff and they would reply back to you, i would like to be a complete degan and talk to pinkie pie and other ponies and being a perv fag and fap over it
Not 100% sure but anons in chag thread may know more about it, I haven't played around the silly tavern but "speech to text" program should exist and I do believe a some kind of TTS output reader (whenever this can be connected to StyleTTS for getting pony voices or something else is a different problem) addon should be possible as well.
There was a project like that some time ago, but it required some (a lot) of knowledge.
Not something turnkey...
>crate 2~5 seconds animation by providing two image frames as input
obviously this still looks bit rough as shit, however it means the progress in proper ai animation has some decent foundation to build on.
dam, that's impressing!
File: Spoiler Image (73 KB, 1318x276)
73 KB
Does anybody happens to have the old vocaroo audio of this shitpost sorry for no pony request but there rvc threads are pretty dead on /v/ .
Turns out it's not dubbed with RVC but a decent sounding impersonators
nta, but damn, they are good
File: 3059014.png (553 KB, 5000x5000)
553 KB
553 KB PNG
Can't wait to hear Anons singing like a Disney princess through Fluttershy, then having an orchestra swell to meet them.
A musical ControlNet would be pretty great, making a pony musicl would be the dream come true
File: Imagine (2).png (470 KB, 1139x1079)
470 KB
470 KB PNG
Not gone yet. Mare is there.
A Survey of Deep Learning Audio Generation Methods
>This article presents a review of typical techniques used in three distinct aspects of deep learning model development for audio generation. In the first part of the article, we provide an explanation of audio representations, beginning with the fundamental audio waveform. We then progress to the frequency domain, with an emphasis on the attributes of human hearing, and finally introduce a relatively recent development. The main part of the article focuses on explaining basic and extended deep learning architecture variants, along with their practical applications in the field of audio generation. The following architectures are addressed: 1) Autoencoders 2) Generative adversarial networks 3) Normalizing flows 4) Transformer networks 5) Diffusion models. Lastly, we will examine four distinct evaluation metrics that are commonly employed in audio generation. This article aims to offer novice readers and beginners in the field a comprehensive understanding of the current state of the art in audio generation methods as well as relevant studies that can be explored for future research.
just a survey but might be interesting for anyone wishing to know more
Accent Conversion in Text-To-Speech Using Multi-Level VAE and Adversarial Training
>With rapid globalization, the need to build inclusive and representative speech technology cannot be overstated. Accent is an important aspect of speech that needs to be taken into consideration while building inclusive speech synthesizers. Inclusive speech technology aims to erase any biases towards specific groups, such as people of certain accent. We note that state-of-the-art Text-to-Speech (TTS) systems may currently not be suitable for all people, regardless of their background, as they are designed to generate high-quality voices without focusing on accent. In this paper, we propose a TTS model that utilizes a Multi-Level Variational Autoencoder with adversarial learning to address accented speech synthesis and conversion in TTS, with a vision for more inclusive systems in the future. We evaluate the performance through both objective metrics and subjective listening tests. The results show an improvement in accent conversion ability compared to the baseline.
>This repo combines a Tacotron2 model with a ML-VAE and adversarial learning to target accent conversion in TTS settings (pick a speaker A with and assign them accent B).
probably not useful out of the box but adversarial learning to get better accent conversion might be a useful technique to have
ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec
>In this paper, we present ControlSpeech, a text-to-speech (TTS) system capable of fully cloning the speaker's voice and enabling arbitrary control and adjustment of speaking style, merely based on a few seconds of audio prompt and a simple textual style description prompt. Prior zero-shot TTS models and controllable TTS models either could only mimic the speaker's voice without further control and adjustment capabilities or were unrelated to speaker-specific voice generation. Therefore, ControlSpeech focuses on a more challenging new task-a TTS system with controllable timbre, content, and style at the same time. ControlSpeech takes speech prompts, content prompts, and style prompts as inputs and utilizes bidirectional attention and mask-based parallel decoding to capture corresponding codec representations in a discrete decoupling codec space. Moreover, we discovered the issue of text style controllability in a many-to-many mapping fashion and proposed the Style Mixture Semantic Density (SMSD) model to resolve this problem. SMSD module which is based on Gaussian mixture density networks, is designed to enhance the fine-grained partitioning and sampling capabilities of style semantic information and generate speech with more diverse styles. In terms of experiments, we make available a controllable model toolkit called ControlToolkit with a new style controllable dataset, some replicated baseline models and propose new metrics to evaluate both the control capability and the quality of generated audio in ControlSpeech. We hope that ControlSpeech can establish the next foundation paradigm of controllable speech synthesis.
weights on a google drive. sounds okay? authors at least have the right spirit fully aiming for voice cloning
File: themoreyouknow.jpg (144 KB, 561x370)
144 KB
144 KB JPG
It's interesting to see new approaches constantly being work on.
>ChatTTS is a powerful text-to-speech system. However, it is very important to utilize this technology responsibly and ethically. To limit the use of ChatTTS, we added a small amount of high-frequency noise during the training of the 40,000-hour model, and compressed the audio quality as much as possible using MP3 format, to prevent malicious actors from potentially using it for criminal purposes.
Looks like Stability just released Stable Audio's weights.

There's a lot we could potentially do with this:
>Infinite digital horsey noises
>Experiment with character voices to create SD voice models of them (Even though it's not intended for vocals, worth trying anyway)
>Feed it all of FiM's music then generate new pony music
>Feed it all of FiM's sound effects and generate a bunch more
>Train it to make lewd horse noises so BGM can make another Pony Zone

From my earlier testings on https://www.stableaudio.com/ though it had quite a few limitations, but seeing how much Purplesmart improved upon base SD, maybe we can do similar here?
>Open-source VQ encoder and Lora training code
>Open-source the 40k hour version with multi-emotion control
ok, there is potential here.
fucking finally, the model is 5GB large so hopefully once someone makes a auto1111 like UI for it (and some dumbdumb friendly training process) we will get so many fucking mare music.
An Independence-promoting Loss for Music Generation with Language Models
>Music generation schemes using language modeling rely on a vocabulary of audio tokens, generally provided as codes in a discrete latent space learnt by an auto-encoder. Multi-stage quantizers are often employed to produce these tokens, therefore the decoding strategy used for token prediction must be adapted to account for multiple codebooks: either it should model the joint distribution over all codebooks, or fit the product of the codebook marginal distributions. Modelling the joint distribution requires a costly increase in the number of auto-regressive steps, while fitting the product of the marginals yields an inexact model unless the codebooks are mutually independent. In this work, we introduce an independence-promoting loss to regularize the auto-encoder used as the tokenizer in language models for music generation. The proposed loss is a proxy for mutual information based on the maximum mean discrepancy principle, applied in reproducible kernel Hilbert spaces. Our criterion is simple to implement and train, and it is generalizable to other multi-stream codecs. We show that it reduces the statistical dependence between codebooks during auto-encoding. This leads to an increase in the generated music quality when modelling the product of the marginal distributions, while generating audio much faster than the joint distribution model.
>We aim to release the weights of our 32kHz EnCodec-MMD soon, please bear with us.
give a listen to the samples. sounds pretty good and a serious upgrade over musicgen. forget if you guys do this over here too
This is interesting, if the guy figures out how to stop it from randomly adding vocalist it could be pretty useful as well.
That's the one made by Meta, yeah? If their Audobox ai is anything to go by, it's definitely got horsey data in it.

I look forward to when our AI mares can neigh with their own individual vocal qualities.
hey HydrusBeta (pretty sure you're the one running haysay.ai, right?), the RVC models on haysay.ai for Nightmare Moon and Rainbow Dash stopped working for me just now. Weird, since they've been working fine for me in the past few days
here's the error it gave me https://pomf2.lain.la/f/51gvt0ij.txt
I was still able to use sovits 5 fine, but it just doesn't give the result I want for speaking text
Thanks for hosting haysay by the way!! I've gotten a lot of mileage out of it, excited to show off the stuff we've been working on
It should be fixed now. Give it another try. I do not know how, but a .json file used internally by RVC somehow got wiped clean and that prevented RVC models from loading. I am pretty sure I have run into this issue once before and it is very strange.
It makes me really happy to hear that you are getting a lot of use out of Hay Say!
Can't confirm the validity of any of this but people are chatting about 15 and the site, for anyone curious >>41134829
File: rarity drink.jpg (71 KB, 900x712)
71 KB
Different Anon, but I absolutely love Hay Say! I'm currently doing some finishing touches on a Rarity cover.
There is response to Anon from some threads ago asking if there is currently a way to talk to computer and have it response in their waifu voice. The scripts should get you as close as possible to such idea, I haven't mess around with silly tavern but I would imagine that it would require to have a semi biffy gpu to run both a text ai mode land rvc at the same time.

