[D] Learnable activation function and aggregate (reduction) function
Any recent papers pointing to these ideas?
Activation function itself can be learned instead of fixing it to ReLU, tanh, etc.
https://arxiv.org/pdf/1412.6830.pdf (ICLR 2015).
And how about aggregate (reduction) functions like max-pool, average-pool ? Can they be learned instead of being fixed?