Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[P] Implementing “Ambient Sound Provides Supervision for Visual Learning”

Hi, I attempted to recreate the results from the multi-modal self supervised learning paper – “Ambient Sound Provides Supervision for Visual Learning” by Owens, et. al.

Here is the code along with my detailed report on it – https://github.com/rowhanm/ambient-sound-self-supervision

Some key things I learned during this:

  1. Sound is an interesting supervision signal for image/scene recognition.
  2. Representation of sound matter quite a lot. I tried using MFCCs along with the Statistical sound summaries proposed in the original paper and saw some increase in the downstream task evaluation.
  3. I don’t know what makes an optimal sound representation. Perhaps these can be learned.
  4. Visualizing the top activated images from learned by using sound as supervision, we see that the model somewhat understands the context in which an object occurs (like fish splashing in water/ man holding some kind of fish) and not the details of the object itself.
  5. Can we combine both audio and images as supervisory signals? What are some good papers on this?

Would love to hear some comments/criticisms/thoughts on this.

Thanks!

submitted by /u/Lorenzo_de_Medici
[link] [comments]