[D] Pipeline for Recognizing Emotions from Speech
For my master thesis, my goal is to perform a speech emotion recognition task in a continuous space where I try to predict the dimensional values(valence, arousal, dominance) of emotion classes. I extract the features for each frame of the wav files in order to avoid non-stationary nature of the speech signals. Since the feature vector for each example needs to have the same shape as the input of the model, I investigate the number of frames for each audio file and found that the number of frames range between 32 and 1364. I know one solution for having equal sized feature vectors is to pad zero values until every vector have the maximum length which is 1364. So I have two questions regarding the construction of the feature vectors;
- Does adding too many zeros(increasing the sparsity of the vector) have a negative effect on the performance of the features? If yes, can I tackle this by calculating the statistical functions(min, max, std, mean etc.) of the each feature vector? Or is there any other solution rather than padding zeros?
- Some of the features in my case are extracted not from a frame but from the whole utterance like the duration of the utterance or the emotion class of the instance. What should be my approach for these features? Are there any downsides of considering them just like the other features with only 1 frame and to pad 1364-1 zero values to those vectors?
I’d be appreciated to hear your thoughts. Cheers,