[D] Voice Assistant: Better to use a model trained on commands or just use STT?
I would like to make a deep-learning based voice assistant for an application I have that controls a digital camera. Some example commands are “auto focus”, “set zoom to 2”, “turn off flash”, etc.
I see two ways of going about this:
Train a model that classifies an audio snippet as containing one of the commands or background noise. This seems easier than option 2 but also less robust, as I would have to retrain the model every time I add a new command. Also not sure how numbers would work (record myself saying every number up to like 100?).
Use STT to convert audio to text and do some fuzzy string matching to see if it matches a command. I’ve downloaded Mozilla’s DeepSpeech and it did not seem to work very well, so I’m guessing that creating a good STT model is very difficult.
Which of these is a better approach? Or is there some in-between approach that’s even better?