[D] Do CNNs understand semantic relationships between their classes?
How do CNNs understand “compositional semantic relationships” between their classes? The problem exists in the entire field, but I’m referencing this paper in particular: http://gandissect.csail.mit.edu/
In the 3rd paragraph of the introduction of the paper, Bau et al. say (emphasis mine):
To a human observer, a well-trained GAN appears to have learned facts about the objects in the image: for example, a door can appear on a building but not on a tree. We wish to understand how a GAN represents such structure. Do the objects emerge as pure pixel patterns without any explicit representation of objects such as doors and trees, or does the GAN contain internal variables that correspond to the objects that humans perceive?
From my (limited) understanding of CNNs, they take in the input image (HxWx3 channels) and pass it through a bunch of filters and maxpool layers. Each maxpool layer reduces the HxW of the matrix. Each filter layer increases the depth of matrix as we move away from the image and toward the “highest levels of representation.” In a sense, we’re abstracting toward higher and higher level features as the receptive field of our neurons increases and we’re able to model longer-distance relationships.
Finally, the last layer of the CNN is connected to the class output by a fully connected layer.
From what I’ve read, it seems like the field is a bit split on this. You have this paper saying
oh look! GANs (and thus CNNs) can understand relationships between classes because doors don’t appear in the sky
But others also say it’s an explicit shortcoming of the convolution function — that the spatial equivariance of convolution means it inherently cannot understand these relationships.
A CNN only looks for 2 eyes, 1 nose, and 1 mouth. It doesn’t care that the eyes are parallel and above the nose, or that the nose is above the mouth!
My take is that the CNN can understand broadly the correlations between classes because of the last, FC layer. As a result, it can understand that maybe standing is negatively correlated with beer, or that pens are correlated with paper. But, it can’t understand spatial relationships.
What do you guys think of this issue?