[D] Neural Networks Without Bias Terms Are Brightness Invariant
What do matrix multiplication, ReLU, and max pooling all have in common? Yes their second derivatives are all zero, but there is another interesting property that they all satisfy:
f(a x) = a f(x)
Which means that, when you stack these on top of each other, scaling the input of the network by some constant is equivalent to scaling the output by some constant. Moreover, there are cases where the scale of the output doesn’t matter (e.g. if predicted classes are based on argmax of the network output).
This leaves us in an interesting situation where it’s actually very easy to encode complete brightness invariance in a network — where you can always brighten/darken an image by some factor without affecting its predictions (assuming you normalize the scale of the output in some way — softmax, sphere projection, etc.).
I’ve trained models with and without biases on CIFAR and find they’re both reasonable. I suspect a more rigorous comparision would find that networks without bias terms tend to do marginally worse than models with them — if only because they have fewer parameters (e.g. without bias terms you can’t learn that airplane images are usually brighter than frog images).
But in the interest of developing networks that actually generalize well to the real world (not just the random sample of your data you held out as a test set) it seems like this a modest performance gap might be permissible, if it means you can be confident that your network will work well in significantly different lighting conditions.