[D] Voice Assistant: Better to use a model trained on commands or just use STT?
I would like to make a deep-learning based voice assistant for an application I have that controls a digital camera. Some example commands are “auto focus”, “set zoom to 2”, “turn off flash”, etc.
I see two ways of going about this:
-
Train a model that classifies an audio snippet as containing one of the commands or background noise. This seems easier than option 2 but also less robust, as I would have to retrain the model every time I add a new command. Also not sure how numbers would work (record myself saying every number up to like 100?).
-
Use STT to convert audio to text and do some fuzzy string matching to see if it matches a command. I’ve downloaded Mozilla’s DeepSpeech and it did not seem to work very well, so I’m guessing that creating a good STT model is very difficult.
Which of these is a better approach? Or is there some in-between approach that’s even better?
submitted by /u/elmosworld37
[link] [comments]