[D] What is the best implementation of a trainable TTS network for creating custom TTS voices?

In this instance, TTS refers to Text-To-Speech.

As the title implies I am looking for the best way to train a network to produce high-quality text to speech results in a custom voice pulled from training data. Assuming access to large amounts of high-quality speech data from a single speaker, the English language, powerful machines, and extended training times what is the best implementation/codebase to use?

I have done quite a lot of research into this but have found my results to be quite confusing. Tacotron-2 seems to me to provide the highest quality results with an open-source implementation. However, implementations such as ESPnet(1) seem to be geared more towards testing different methods rather than developing your own custom voice. I am not new to Machine Learning but I am new to applying ML to audio or language-related problems thus I am very behind on my understanding of the state of such lines of research.

If I was looking to replicate something like the results from “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions”(2) where they used 20+ hours of data from an English to produce very natural sounding speech(3) what would be my best option? I just figured I would ask the experts of Reddit before I took the plunge on setting up a codebase and dataset only to realize there were significantly better options available.

Thanks!

(1) https://github.com/espnet/espnet

(2)(paper link) https://arxiv.org/abs/1712.05884

(3)(audio sample link) https://google.github.io/tacotron/publications/tacotron2/index.html

submitted by /u/blackfish_88
[link] [comments]

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

JOB POSTINGS

CONTACT

[D] What is the best implementation of a trainable TTS network for creating custom TTS voices?