[D] CNN Image Segmentation: Why do UNET-like architectures outperform sliding-window approaches?
I’m writing a thesis that heavily focuses on semantic segmentation of biomedical images.
I’m reviewing different segmentation approaches, identifying two main approach branches:
- A sliding window-like approach: a classification network is used over different patches of original image to reconstruct a pixel-by-pixel estimates of the probability maps.
- A full-image approach: like the FCNN and UNET approach, rely on fully convolutional architectures and the upscaling phase is incorporated in the network itself using transposed convolutions.https://arxiv.org/abs/1505.04597
The second approach clearly outperforms the first one. I have a vague hunch on why this happens: my hypothesis is that the transposed-convolution operations, being at their core local operations, force local criteria on the segmentation of close pixels so that pixel contiguity is heavily encouraged in the fully convolutional case.
I do not find this kind of explanation satisfying because of two reasons:
- I do not have papers or real data to support this: I cannot seem to find any paper on the theme.
- The sliding-window approach has a built-in form of local consistency as well: if overlapping windows share most of the pixels it’s reasonable to think that – given the network is not totally chaotic and shows enough linearity – the outputs would be similar.
Do anyone have a bit of insight or source on any of this? Any contribution, even brainstorming or unsupported hypothesis (like mine) is well appreciated.