[P] Comparing 11 Speech-to-Text models using Tensorflow
Here I compare 11 Speech-to-Text models using Tensorflow, 100% jupyter notebook and simplify. Accuracy based on character position.
80% of the dataset to train, 20% of the dataset to test.
- Tacotron, test accuracy 77.09%
- BiRNN LSTM, test accuracy 84.66%
- BiRNN Seq2Seq + Luong Attention + Cross Entropy, test accuracy 87.86%
- BiRNN Seq2Seq + Bahdanau Attention + Cross Entropy, test accuracy 89.28%
- BiRNN Seq2Seq + Bahdanau Attention + CTC, test accuracy 86.35%
- BiRNN Seq2Seq + Luong Attention + CTC, test accuracy 80.30%
- CNN RNN + Bahdanau Attention, test accuracy 80.23%
- Dilated CNN RNN, test accuracy 31.60%
- Wavenet, test accuracy 75.11%
- Deep Speech 2, test accuracy 81.40%
- Wav2Vec Transfer learning BiRNN LSTM, test accuracy 83.24%
Link to repository, https://github.com/huseinzol05/NLP-Models-Tensorflow#speech-to-text
Link to dataset, https://tspace.library.utoronto.ca/handle/1807/24487, also included a notebook how to download the dataset.
- Dataset is not that really big, only 286MB.
- Transfer learning Wav2Vec accuracy is not that high, maybe need more dataset.
- I use my own hyperparameters for Wav2Vec, use original hyperparameters caused my GPU sync problem, sequence is too long.
- I need to use bigger dataset.