[D] Feature Loss vs. GANs – what are the trade offs?
I’m doing a bit of reading on the speech enhancement problem, where you have an audio signal containing human speech plus some noise, and you want extract just the human speech. It’s pretty analogous to image denoising or “super-resolution”, and a lot of the techniques from the image domain are being borrowed and re-applied to audio quite successfully (eg. repurposing the U-Net architecture from image processing to spectrograms and then raw audio). It’s all pretty cool.
There’s some interesting work being done with loss functions this space and I’m looking for some clarification as to why you’d choose one approach over another. You want to compare a target image, or audio waveform, with a predicted sample, and you need to define a loss function which measures how “close” they are. The Related work – Loss functions (1.1.3) section of this paper gives a pretty good overview of the different approaches, which I’ll try to summarize here.
- Mean squared error loss: A pretty standard regression loss as far as I know, but it’s limited to only considering one pixel at a time: “minimizing MSE encourages finding pixel-wise averages of plausible solutions which are typically overly-smooth and thus have poor perceptual quality”.
- Feature loss: This is where you pre-train a network on a similar problem, such as image classification, and then you freeze the weights. For both the target and predicted sample, you run each through the classification network, then grab some internal activations from that network and call them “features”. You compute some distance between these feature vectors to get your loss. The key idea is that the classification network is able to capture important features that MSE loss cannot (more detail here).
- GAN loss: A discriminator network trains in-tandem with the generator network, where the job of the discriminator is to classify whether its input is “real” or “generated”. Like the feature loss network, it can detect features that MSE loss cannot, but it can also punish identifiable quirks of the generator network, whereas feature loss can potentially be “hacked” by the generator network.
So my questions are:
- Have I characterised these approaches well?
- Why would you ever choose feature loss over using a discriminator network (ie. GAN)?
- Discriminators can punish the generator for being predictably wrong (ie. common artifacts)
- Pre-trained feature loss networks may better represent image features, if they have been trained for longer, on larger data sets
- Apparently GANs can have stability issues when training
- The SRGAN paper suggests using both feature loss and a GAN for their loss function – is this the best known approach?