[D] Audio/Digital Signal Processing/Recurrent NN – Need help understanding and reproducing this paper in Python
I am trying to reproduce this paper in Python: A Hybrid DSP/Deep Learning Approach to Real-Time Full-Band Speech Enhancement by Jean-Marc Valin
Additionally, there is a blog post by the author explaining the paper differently: RNNoise: Learning Noise Suppression, and a GitHub repository with the code for training the proposed network.
However, I have difficulty understanding the concepts regarding preparing input data for training and prediction. Can someone give me practical notes on how I can achieve this?
Some questions I have in section II:
The paper and blog post computes 22 bands at first. Where a DCT is applied on the log spectrum, resulting in 22 Bark-frequency cepstral coefficients. Which is closely related to the Mel-Frequency Cepstral Coefficients. What does this mean, and how does this work?
The author also includes the temporal derivative and the second temporal derivative of the first six Bark-frequency cepstral coefficients across frames. What does this mean?
In formula (5) the pitch correction for every band is calculated, with that the author computes the DCT of the pitch correlation across frequency bands and include the first six coefficients. I assume DCT returns a finite set of results. So only 6 of the first coefficients is used per band, correct?
The author mentions including the pitch period as well as a spectral non-stationarity metric. What does this mean?
Some background: I have mostly worked with visual data and convolutional neural networks, so I have almost no knowledge about digital signal processing. Please bear with me.
Thanks in advance!
submitted by /u/VividFee