[P] Implementing “Ambient Sound Provides Supervision for Visual Learning”
Hi, I attempted to recreate the results from the multi-modal self supervised learning paper – “Ambient Sound Provides Supervision for Visual Learning” by Owens, et. al.
Here is the code along with my detailed report on it – https://github.com/rowhanm/ambient-sound-self-supervision
Some key things I learned during this:
- Sound is an interesting supervision signal for image/scene recognition.
- Representation of sound matter quite a lot. I tried using MFCCs along with the Statistical sound summaries proposed in the original paper and saw some increase in the downstream task evaluation.
- I don’t know what makes an optimal sound representation. Perhaps these can be learned.
- Visualizing the top activated images from learned by using sound as supervision, we see that the model somewhat understands the context in which an object occurs (like fish splashing in water/ man holding some kind of fish) and not the details of the object itself.
- Can we combine both audio and images as supervisory signals? What are some good papers on this?
Would love to hear some comments/criticisms/thoughts on this.