[D] How to deal with bags of images?
We are creating a classifier, that should get as input bags of images, and output binary labels *per bag*. A bag could have between 2 and 25 images, photos of the same object from different angles, and we must output a fixed-length binary vector for each bag.
What we are using right now:
- We filter the 5% of bags with too many images. We are left with maximum bag size of 13.
- For the bags with less than 13 images, we pad them with grey images. (we could also repeat some of the images).
- The classifier is fit, predicting the binary label *for each image*. So, for the first bag, we would have an input vector with shape (13 x 224 x 224 x 3), and an output vector of (13 x n), where the images have a shape of 224 x 224, and n is the length of the binary vector.
- We make predictions for each image for each bag in the test set.
- We use a heuristic to aggregate the 13 prediction vectors into a single one. That could be simple maximum, some sort of mean, etc. etc.
This pipeline feels unsatisfactory, because the model is not using all the images at once. Also, the signals seem noisy, since most images, when labeled by a human, would be just zero vectors.
We also have two ideas we will try:
- make the model operate directly on bags of images. So, for example, if the batch size is 16, in the pipeline I described above, the input vector could be something like 208 x 224 x 224 x 3, and the output vector would be 208 x n. We could make the input be 16 x 13 x 224 x 224 x 3, and the output vector to be 16 x n, and instead of using 2D convolutions, we could use 3D convolutions. This seems a lot cleaner. However, the images are not “similar”. The images from a video would be “similar” since it’s a small angle change in each frame. This is not the case here. Maybe we could start with several consecutive layers of 2D convolutions, before we move on to 3D layers? This still feels wrong, but it’s hard for me to explain why I feel that.
- Using the pipeline above, we get a label of 13 x n, for each bag. Each row of 1 x n is wrong, since most of those should be mostly zeros (the features we are looking for are small, and are seen usually from only one or two angles). So, we could use some heuristic to find the “true” labels for each separate photo. For this idea, could you recommend me some papers/ways to do this?
Do you have any tips, tricks, ideas to try, papers to read?