[Discussion] Understanding Subscale WaveRNN & usage of Masked Dilated CNN as conditioning network
Related Paper: Efficient Neural Audio Synthesis
I have been reading the sections relating to Subscale WaveRNN in where the DeepMind team was able to generate B samples in a single step. They have discussed about conditioning a particular sample using past samples and up to F samples from the previous sub-tensors future context. In their case, they used a masked dilated CNN (this can be found on the last paragraph of 4.1 Subscale Dependency Scheme). Here’s the excerpt specifically to this:
The Subscale WaveRNN that generates a given sub-tensor is conditioned on the future context of previous sub-tensors using a masked dilated CNN with relus and the mask applied over past connections instead of future ones.
My first question is: how could a masked dilated CNN help with this?
Next, Nal Kalchbrenner has tweeted this quick demo of the Subscale WaveRNN. This one confuses me a lot when I’m referring back to the original paper.
My final question is: does anyone have taken a look at subscaling more closely?
Any insights would be appreciated.
(Note: This is my first post and I am hoping that I followed the format correctly.)