[D] What is the best implementation of a trainable TTS network for creating custom TTS voices?
In this instance, TTS refers to Text-To-Speech.
As the title implies I am looking for the best way to train a network to produce high-quality text to speech results in a custom voice pulled from training data. Assuming access to large amounts of high-quality speech data from a single speaker, the English language, powerful machines, and extended training times what is the best implementation/codebase to use?
I have done quite a lot of research into this but have found my results to be quite confusing. Tacotron-2 seems to me to provide the highest quality results with an open-source implementation. However, implementations such as ESPnet(1) seem to be geared more towards testing different methods rather than developing your own custom voice. I am not new to Machine Learning but I am new to applying ML to audio or language-related problems thus I am very behind on my understanding of the state of such lines of research.
If I was looking to replicate something like the results from “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions”(2) where they used 20+ hours of data from an English to produce very natural sounding speech(3) what would be my best option? I just figured I would ask the experts of Reddit before I took the plunge on setting up a codebase and dataset only to realize there were significantly better options available.
(2)(paper link) https://arxiv.org/abs/1712.05884
(3)(audio sample link) https://google.github.io/tacotron/publications/tacotron2/index.html