[P] Not Jordan Peterson – Speech synthesis using Google’s Tacotron 2 and Nvidia’s Waveglow
The technology used to generate audio on this site is a combination of two neural network models that were trained using audio data of Dr. Peterson speaking, along with the transcript of his speech. If you don’t know who Jordan Peterson is or what his voice sounds like, you can find links to his podcast, lectures, and YouTube videos on his website.
The first model, developed at Google, is called Tacotron 2. It takes as input the text that you type and produces what is known as an audio spectrogram, which represents the amplitudes of the frequencies in an audio signal at each moment in time. The model is trained on text/spectrogram pairs, where the spectrograms are extracted from the source audio data using a Fourier transform.
The second model, developed at NVIDIA, is called Waveglow. It acts as a vocoder, taking in the spectrogram output of Tacotron 2 and producing a full audio waveform, which is what gets encoded into an audio file you can then listen to. The model is trained on spectrogram/waveform pairs of short segments of speech.
The implementations used to create this site were forked from NVIDIA’s public implementations of Waveglow and Tacotron 2.
Disclaimer: This is not my product, I found it online.