Join our meetup, learn, connect, share, and get to know your Toronto AI community.
Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.
Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.
I would like to make a deep-learning based voice assistant for an application I have that controls a digital camera. Some example commands are “auto focus”, “set zoom to 2”, “turn off flash”, etc.
I see two ways of going about this:
Train a model that classifies an audio snippet as containing one of the commands or background noise. This seems easier than option 2 but also less robust, as I would have to retrain the model every time I add a new command. Also not sure how numbers would work (record myself saying every number up to like 100?).
Use STT to convert audio to text and do some fuzzy string matching to see if it matches a command. I’ve downloaded Mozilla’s DeepSpeech and it did not seem to work very well, so I’m guessing that creating a good STT model is very difficult.
Which of these is a better approach? Or is there some in-between approach that’s even better?
submitted by /u/elmosworld37
[link] [comments]