Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[D] Voice Assistant: Better to use a model trained on commands or just use STT?

I would like to make a deep-learning based voice assistant for an application I have that controls a digital camera. Some example commands are “auto focus”, “set zoom to 2”, “turn off flash”, etc.

I see two ways of going about this:

  1. Train a model that classifies an audio snippet as containing one of the commands or background noise. This seems easier than option 2 but also less robust, as I would have to retrain the model every time I add a new command. Also not sure how numbers would work (record myself saying every number up to like 100?).

  2. Use STT to convert audio to text and do some fuzzy string matching to see if it matches a command. I’ve downloaded Mozilla’s DeepSpeech and it did not seem to work very well, so I’m guessing that creating a good STT model is very difficult.

Which of these is a better approach? Or is there some in-between approach that’s even better?

submitted by /u/elmosworld37
[link] [comments]