[D] Audio/Digital Signal Processing/Recurrent NN – Need help understanding and reproducing this paper in Python
Hello everyone!
I am trying to reproduce this paper in Python: A Hybrid DSP/Deep Learning Approach to Real-Time Full-Band Speech Enhancement by Jean-Marc Valin
Additionally, there is a blog post by the author explaining the paper differently: RNNoise: Learning Noise Suppression, and a GitHub repository with the code for training the proposed network.
However, I have difficulty understanding the concepts regarding preparing input data for training and prediction. Can someone give me practical notes on how I can achieve this?
Some questions I have in section II:
-
The paper and blog post computes 22 bands at first. Where a DCT is applied on the log spectrum, resulting in 22 Bark-frequency cepstral coefficients. Which is closely related to the Mel-Frequency Cepstral Coefficients. What does this mean, and how does this work?
-
The author also includes the temporal derivative and the second temporal derivative of the first six Bark-frequency cepstral coefficients across frames. What does this mean?
-
In formula (5) the pitch correction for every band is calculated, with that the author computes the DCT of the pitch correlation across frequency bands and include the first six coefficients. I assume DCT returns a finite set of results. So only 6 of the first coefficients is used per band, correct?
-
The author mentions including the pitch period as well as a spectral non-stationarity metric. What does this mean?
Some background: I have mostly worked with visual data and convolutional neural networks, so I have almost no knowledge about digital signal processing. Please bear with me.
Thanks in advance!
submitted by /u/VividFee
[link] [comments]