Learn About Our Meetup

5000+ Members



Join our meetup, learn, connect, share, and get to know your Toronto AI community. 



Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.



Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[P] Implementing “Ambient Sound Provides Supervision for Visual Learning”

Hi, I attempted to recreate the results from the multi-modal self supervised learning paper – “Ambient Sound Provides Supervision for Visual Learning” by Owens, et. al.

Here is the code along with my detailed report on it –

Some key things I learned during this:

  1. Sound is an interesting supervision signal for image/scene recognition.
  2. Representation of sound matter quite a lot. I tried using MFCCs along with the Statistical sound summaries proposed in the original paper and saw some increase in the downstream task evaluation.
  3. I don’t know what makes an optimal sound representation. Perhaps these can be learned.
  4. Visualizing the top activated images from learned by using sound as supervision, we see that the model somewhat understands the context in which an object occurs (like fish splashing in water/ man holding some kind of fish) and not the details of the object itself.
  5. Can we combine both audio and images as supervisory signals? What are some good papers on this?

Would love to hear some comments/criticisms/thoughts on this.


submitted by /u/Lorenzo_de_Medici
[link] [comments]

Toronto AI is a social and collaborative hub to unite AI innovators of Toronto and surrounding areas. We explore AI technologies in digital art and music, healthcare, marketing, fintech, vr, robotics and more. Toronto AI was founded by Dave MacDonald and Patrick O'Mara.