[P] Implementing “Ambient Sound Provides Supervision for Visual Learning”

Written by torontoai on December 22, 2019. Posted in Reddit MachineLearning.

Hi, I attempted to recreate the results from the multi-modal self supervised learning paper – “Ambient Sound Provides Supervision for Visual Learning” by Owens, et. al.

Here is the code along with my detailed report on it – https://github.com/rowhanm/ambient-sound-self-supervision

Some key things I learned during this:

Sound is an interesting supervision signal for image/scene recognition.
Representation of sound matter quite a lot. I tried using MFCCs along with the Statistical sound summaries proposed in the original paper and saw some increase in the downstream task evaluation.
I don’t know what makes an optimal sound representation. Perhaps these can be learned.
Visualizing the top activated images from learned by using sound as supervision, we see that the model somewhat understands the context in which an object occurs (like fish splashing in water/ man holding some kind of fish) and not the details of the object itself.
Can we combine both audio and images as supervisory signals? What are some good papers on this?

Would love to hear some comments/criticisms/thoughts on this.

Thanks!

submitted by /u/Lorenzo_de_Medici
[link] [comments]

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

JOB POSTINGS

CONTACT

[P] Implementing “Ambient Sound Provides Supervision for Visual Learning”