Skip to main content


Learn About Our Meetup

5000+ Members



Join our meetup, learn, connect, share, and get to know your Toronto AI community. 



Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.



Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[D] CNN Image Segmentation: Why do UNET-like architectures outperform sliding-window approaches?

I’m writing a thesis that heavily focuses on semantic segmentation of biomedical images.

I’m reviewing different segmentation approaches, identifying two main approach branches:

  • A sliding window-like approach: a classification network is used over different patches of original image to reconstruct a pixel-by-pixel estimates of the probability maps.
  • A full-image approach: like the FCNN and UNET approach, rely on fully convolutional architectures and the upscaling phase is incorporated in the network itself using transposed convolutions.

The second approach clearly outperforms the first one. I have a vague hunch on why this happens: my hypothesis is that the transposed-convolution operations, being at their core local operations, force local criteria on the segmentation of close pixels so that pixel contiguity is heavily encouraged in the fully convolutional case.

I do not find this kind of explanation satisfying because of two reasons:

  1. I do not have papers or real data to support this: I cannot seem to find any paper on the theme.
  2. The sliding-window approach has a built-in form of local consistency as well: if overlapping windows share most of the pixels it’s reasonable to think that – given the network is not totally chaotic and shows enough linearity – the outputs would be similar.

Do anyone have a bit of insight or source on any of this? Any contribution, even brainstorming or unsupported hypothesis (like mine) is well appreciated.

submitted by /u/automatedredditor
[link] [comments]