[D] SELUs don’t actually solve the dying ReLU problem

One frequently mentioned problem with ReLUs is that they can get stuck outputting nothing but 0s when their input shifts such that every value is negative. SELUs [1] claim to solve this problem.

However, there is another way that activation functions can stop being useful to the network: when they degenerate to a linear function. This can happen with ReLUs, SELUs and some other activation functions when their input shifts such that every value is positive. To demonstrate this I made a simple toy network.

The task is to approximate the ReLU function itself with the function f(x * a + b) * c + d, where x is the input, a, b, c and d are learned scalar values and f is an activation function. Values for x are uniformly chosen from the range [-0.5, 0.5].

If we start with a = 1 and b = 0.5 then all inputs to f will be positive. For many starting points of c and d this will still converge when ReLU is used for f. But for c = 1 and d = -0.5 all common piecewise activation functions will fail, including SELU and ELU.

However, there is a potential activation function that does not exhibit that problem, that I don’t see being talked about a lot: Softplus, defined as log(exp(x) + 1). Its derivative is strictly monotonically increasing and therefor non-linear in every sub range. Using softplus in place of f in the toy example allows it to converge from any starting point. [proof pending]

In the following images you can see the learned function at different numbers of iterations. The starting point a = 1, b = 0.5, c = 1 and d = -0.5 was used. All use Adam optimizer with a learning rate of 0.1 and default values for alpha and beta. The mean absolute difference is minimized. Tensorflow 1.14.0 was used.

ReLU

SELU

Softplus

In practice the inputs to activation functions may follow a long tail distribution making this very unlikely when the loss function is fixed. But for some problems, like adversarial networks, where the loss function itself is learned, this might not be the case.

There are even situations where SELU fails to converge whereas ReLU and ELU do. The following images use the starting point a = 1, b = 0.5, c = 1 and d = 0. Again, all initial inputs to the activation function are positive. As we can see this does not necessarily mean that it is stuck.

ReLU. The slight curve at 0 is a result of under-sampling the function.

SELU. Note how the initial increase in gradient below 0 creates an insurmountable wall of increased loss that gradient descent can’t overcome.

ELU. By having a monotonic gradient it does not have the same problem as SELU.

Softplus

Alternative title for this post: SELU considered harmful. (I mean no offense to the authors, their paper is truly insightful and you should definitely read it!)

[1] https://arxiv.org/abs/1706.02515

submitted by /u/relgukxilef
[link] [comments]

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

JOB POSTINGS

CONTACT

[D] SELUs don’t actually solve the dying ReLU problem