[D] training ASR checking assumptions
Suppose you’re training an ASR system from scratch using audio books, where you have the plain text as well as the audio. One big mp3 file for the book (or maybe split into several chapter’s mp3 files ). And several text files corresponding to the chapters.
The first stage is you need labelled data, ie 1.wav with a transcript 1.txt for the first line of the book. Then 2.wav with a transcript 2.txt for the 2nd line of the book. And so on. Once you have all those pairings, you can feed pairs at a time into your ASR system (eg mozilla deepspeech). The algorithm wont take bigger chunks, so a sentence seems to be the right way to go. You could try feeding it per word but I don’t know if that would improve performance? I don’t even know how you begin to detect and segment on single words in an utterance, that seems to be a far harder problem than waiting for natural pauses which are easier to detect by machine. Anyway suppose it’s per line for simplicity, because you don’t really want a bazillion files of just one word uttered and transcribed.
But how exactly do you segment the bigger audio files into smaller line sized pieces? You can try by voice activity detection. I can stick a large mp3 in Audacity and have it do approximately per line labelling via voice segment labelling, it’s easy to produce a bunch of 1-5s long wav files that have been cut on silence. But now you don’t know exactly how each piece of audio (say foo.wav) produces a corresponding foo.txt since the narrator might blend two sentences together with an altogether too small pause in some cases. If the narrator pauses faithfully between sentences it would be easy. But you can’t assume VAD will give you neat divisions of 1:1 speech line to text sentence, so you can’t easily work out which sentence of text corresponds to what chunk of audio. Unfortunately sentences will often spill over wav boundaries. In the best case it might just mean you have two whole sentences combined in one wav, so you need to split them up into two separate wavs, but the worst case would be a sentence being divided over two wav files. Then you need to edit the wavs by hand, moving a bit of one wav into the second or vice versa. Or just accept broken sentences.
So your audio data is cut up into 1-5s pieces, by voice activity, unlabelled, you need to label it. In this phase I think you’re supposed to use tools like forced alignment. But the problem is we don’t really care about word alignment which is what forced aligners do, we’re just after sentences after all. And forced alignment needs line by line, per wav, transcripts to work in the first place, which we don’t have, which is the problem we were originally faced with before anyone mentioned forced aligner. If you are working with broken lines then labels don’t even correspond evenly to text sentences anymore, just whatever words of the book are uttered in that given wav.
Apparently one solution to the labelling problem is to simply run ASR on those small wavs to generate rough transcripts. Then presumably you match it up approximately with the known text lines and align them up that way? Sounds complicated, especially when we don’t have a decently performing ASR to use. So does that mean the ASR bootstrapping labelling is done mainly by hand? That’s where i’m stuck. Is there no way around the bootstrapping by hand? For a low resource language you’re just stuck with hand labelling lots of data first. And without enough data, the ASR isn’t going to be helpful since the error rate will be too high. You just have to build up a decent amount of hand labelled data to train an okay ASR which can be used to bootstrap more efficient machine driven training on new datasets. Are all my assumptions correct? This bootstrapping seems unavoidable?
I should probably mention there is an assumption that we’re training an end-to-end deepnet ASR system. I’m not sure how the classical systems worked but they’re probably even harder to train, because you need a phoneme dictionary for the language and then training a model to discriminate and identify phones in a given speech fragment. Which means you do need per word alignment on training data, a much harder problem than the per sentence alignment needed by the end-to-end system.