What is a "Step"?

Looking into the documentation for mimic2 and mimic-recording-studio GitHub repos. There is a LOT of documentation stating the number of “steps” that are needed to be performed, but there does not seem to be a definition of what a step actually is.

For example, there is documentation that states that a model is output as a checkpoint every 1,000 steps. Mainly looking to learn if a step is the same thing as a recording, e.g. phrase you recorded … or is a step something different?

1 Like

Hey there - welcome to the Forums!

Steps in this context relate to the machine learning training process. Without going into too much detail, a voice model used by Mimic2 or other TTS services are generally recurrent neural networks. They are trained by performing a learning process over and over again, hundreds of thousands of times. Each of these learning iterations is called a step. There’s no magic endpoint to this process, just like there’s no endpoint if you as a human decided to learn about “cars”. So every 1k steps it creates a checkpoint.

So these steps are very much when you get to the model training stage. There are people around here that might be able to help you with that too.

Also just flagging that this PR will likely be merged in the very near future:

I’d suggest incorporating it before starting your recording. More info in the issue here:

While I agree about using the recording studio, I would strongly caution you NOT to try using mimic2, and instead, use coqui.ai (formerly Mozilla TTS). You will save yourself a lot of frustration.

1 Like

The LJSpeech dataset (which is a kind of reference dataset for English language) consists of more than 13000 recordings with a total of 24h recording time.

For more details check out the Coqui-TTS Wiki, sections “FAQ” and “What makes a good dataset”.

Community member @Thorsten recently had a public talk on his journey of creating a dataset in german language

4 Likes

As a follow on to this, “training” a model takes a lot of time and trial and error. 1000 steps will not result in anything practically usable, and it usually takes 20k to find out if it’s gotten a good alignment and over 300,000 for most models (some go to 900k). You’ll probably train three or four times to 100k before you find results that you like. Training takes an nvidia GPU with a bunch of memory (8gb+) to complete in a reasonable time.

For the dataset, you’ll want at least ten hours of very high quality recordings; the more you have the better your end result is likely to be. Definitely watch the talk Dominik linked by Thorsten, he covers a lot of this in more detail.

2 Likes

Thanks @baconator and @Dominik

I will watch that public talk today.

I am still curious if anyone can answer the remaining question I had, which is:

How many phrases do you need to record to have 1,000 steps created?

I am interested in this as a programmer. I am fully aware that 1,000 steps will not be result in anything usable. I am more just interested in learning expected behavior of the software … more particularly, I am looking to learn if there is an estimate number of recorded phrases, or even a specific number of hours or recordings needed, to get generate 1,000 steps in this software.

The reason I want to know that is so I can generate some user interface elements for the user on the front-end of the mimic-recording-studio software I have forked, so they can know when to expect a model to be generated if they run the training process.

e.g. it’s useless to run the training software if you have zero recordings, but it might also be useless to run it if you have 1,000 recordings. Sure you can still run an analyzer to inspect the quality of what you ran through the processor, but at some point you will be able to train a model in mimic2, and since the default is to export a checkpoint every 1,000 steps, the obvious question is … “How many recordings do I need to have to meet that minimal threshold”

If I know the answer to that, I can check that with software. If you do not have X recordings, or X hours of recordings, then you will never even hit the minimal threshold needed to generate the number of steps needed to generate a model / checkpoint.

And I know you can change the number of steps needed to generate a model, but it still does not answer the question of understanding the ratio of number of recordings to number of steps …

so, if someone can clarify that for me, it would mean a LOT. If it can’t be answered because there is no way to know, that’s fine too, just seemed like it must be answerable by someone with experience with this that could go, “oh, ya, to generate at least 1,000 steps in the training process, you would need about X hours of recordings” … and that’s what I want to know, since the original question was answered that a step is not a recording. Just looking to understand the relationship between a recording and a step.

Sorry for the long post, ha ha

One clip is enough to run 1000 steps (and incredibly overfit the data, but for hypothetical purposes, whatever).

A step is a single run through of one batch of data. It may be an epoch, or usually, less than an epoch. A batch size of N items from an M-sized epoch means you’d have N/M steps to handle one epoch. There can be other factors at play here as well, but in general…

The number of steps to complete training something is “it depends.” This is due to dataset size, desired end result, how well things align and training is working, the phase of the moon, and quality of chicken you sacrificed to the gods. After numerous observations of other people’s data as well as my own, there’s a couple of spots in training I look at very closely. First is about 1000 steps in to see that things have started to align. The tensorboard charts are looking more like vague lines than clouds, there’s a coherent sense to the sample inference. Second is 20k-40k, when things really start to come together if all is going well. After that, I check it every 50k steps or so to ensure things are on track.

1 Like

So let’s forget about the number 1000 and worry more about what is “the goal you’re trying to accomplish”.

Are you trying to figure out how many steps you can run and get something?
Are you trying to make a usable model?
Are you trying to make a high quality model?
Is it an intellectual pursuit only?

That will get you somewhere. 1000 steps will get you 1/10th of the way to your daily total on your fitbit.

I am honestly just trying to figure out how many recordings I need to create in order to trigger the FIRST checkpoint, where an actual model is created. I do not intend to USE that model, but rather, I am looking to LEARN what the requirements are to even have the model created in the first place.

The default in mimic2 is 1,000 steps. So the question is, how many recordings should I expect to need to create to hit that first milestone. Do I need 10,000? recordings? Your previous comment made it sound like I could technically hit 1,000 steps with a single recording.

So ya, I guess, just looking to learn what I have kind of been asking, mainly, what how number of recordings correlates to number of steps, e.g. is there some magic threshold I need in number of recordings to get a trigger the checkpoint_interval .

As I stated above, I am working on a forked build that I want to be able to let the user know WHEN they can expect a model to be generated. It is impossible to know WHEN that will happen unless I understand the relationship between number of recordings to number of steps.

So, very specific version of what I am looking to learn, even though you said to forget 1,000, I need to use 1,000 here because it is literally the default value for checkpoint_interval in the mimic2 software… so:

In Mimic2, using the Mimic Recording Studio’s default CSV file reading into prompts, how many recordings would a person need to do, to hit that 1,000 Step checkpoint_interval where a model is generated? ( Keep in mind, I do not care if it is usable, I literally just want to know the answer to that question so I can complete the software I am building where I can let the user know if they even have enough recordings saved to where a model can even be generated to begin … totally not even getting into whether it will be a GOOD model … literally just want to know that FIRST checkpoint_interval and how many recordings they need to hit it … if that can be answered )

Theoretically 1 clip. Practically, might need to experiment to see what you can get away with? Quality of clips plus quantity will make things improve from there, which is why the 10k clips/10+ hours minimum standard gets tossed around in a bunch of tacotron repos. Google used something like 32 hours on their first few runs and have moved to over 100 hours now.

A model gets written after the interval, so, as long as training doesn’t bomb out, then off you go.

To be blunt, your branch idea doesn’t seem like a very good idea. I’d rather know how to get a viable model than to know I could get a model that just chirps and echoes horribly.

Try and incorporate the no longer pending length fix pr as well.

1 Like

Hey, looks nice - I particularly like the auto-review of recordings.

I’m curious about the STT verification too. What happens if the STT transcription doesn’t match the corpus provided sentence? Does it show a warning so the user can review it but lets them continue if they disagree with the STT?

If you’re reading from a known list of sentences, and you’ve read it right, then it should [closely] match one of them. If the STT results don’t come close to matching then it should be flagged. In this case it’s even more focused as you know the single sentence it should be matching, there’s probably more leniency available; any WER over 33 might merit a look?

@gez-mycroft yes, it’s exactly as you stated. If it looks like you read it wrong, it colors the words STT heard incorrectly as red with a strikethrough. If you hover over the incorrect word, it will give you a tool tip with a message that tells you the expected word. You can listen to your recording and decide if STT just got it wrong and move on.

screenshot 2021-06-15 at 3.01.37 PM

What I am also adding is a quick mapping to fix things that are just going to be wrong. TTS will hear numbers correctly and write them as numbers, so the spelled out numbers will not match, so in my fork, I am mapping up known fixes so it will still mark phrases correctly just because the way they were written by TTS is slightly different that the written phrase.

But ya, I needed this for myself since I kept finding that I was reading what I THOUGHT should be there, vs what was actually written on the screen. To complicate things even more, hearing what I said played back while I re-read the phrase, somehow my brain just reads the text I was SAYING not the text that was written.

So I needed a non-biased solution that would literally show me what I said, and NOT what I THINK I said, ha ha

What I found by adding the TTS is that is also catches the words that I did not speak very clearly, as it hears the wrong word because I did not pronounce it clear enough. Which I would imagine is going to be a HUGE help for the training of the model, as it seems if TTS could understand what I said correctly, and it matches what I was supposed to say, that’s gotta help make things go a little better :wink:

1 Like

This is the specific React Component I created for this:

https://github.com/manifestinteractive/mimic-recording-studio/blob/master/frontend/src/App/components/SpokenWords.js

For some reason this forum will not let me post the link without wrapping it in backticks.

You can tell I only got to phrase 250, ha ha … but this is going to be a LOT of work in this component, as about 10% of the phrases have something weird in it that needs custom handling. About half of it is related to numbers, which I might try to automate a different way ( but some of these phrases the numbers are just weird and might not be possible ), but there is a fair amount that is just spelling related, or TTS specific ways it hears a partial phrase and misunderstands context and gets a word wrong but sound matches ( e.g. “shoot” vs “chute” in one partial phrase )

This is why my fork is specific to the CSV file that comes with Mimic Recording Studio only, and ONLY for English ( and why it likely would not be a good candidate for a Pull Request, as there are a LOT of breaking changes ). But Mycroft can totally take anything they like from the work I am doing, if you wanted :slight_smile:

1 Like