[P] Tensorflow implementation of WaveGlow with VQVAE

Hi, I am newbie in here.

Anyway, I am currently working on combining VQVAE and WaveGlow.

WaveGlow is a great model to synthesize speech in a parallel way.

VQVAE is known to good at disentangling speaker identity and linguistic features from raw audio.

As I want to make an efficient multi-speaker voice synthesizer, I have been trying combining those two models.

There are a lot of remaining works though.

So far, What I found from my implementation is

– For single speaker, it works quite well

– For multi speakers, it doesn’t seem to disentangle speaker identity and linguistic features.

I am trying to solve this issue at now, So if you have any idea, please let me know.

Additionally I slightly modified pure VQVAE method with Soft-EM like gradient descent method.

For now, it seems work quite well avoiding hyper parameter tuning and index collapse.

For more information, please see my repository

and if you’re interested, please give me critic comments !

Blog