Category: Reddit MachineLearning
[D] Jurgen Schmidhuber on Seppo Linnainmaa, inventor of backpropagation in 1970
still mining Jurgen’s dense blog post on their miraculous year 1990-1991, a rich resource for reddit threads, see exhibits A, B, C
everybody in deep learning is using backpropagation, but many don’t know who invented it, the blog has a separate web site on this which says
Its modern version (also called the reverse mode of automatic differentiation) was first published in 1970 by Finnish master student Seppo Linnainmaa
whose thesis introduced the algorithm 5 decades ago in BP1, in Finnish, English version here
In the course of many trials, Seppo Linnainmaa’s gradient-computing algorithm of 1970 [BP1], today often called backpropagation or the reverse mode of automatic differentiation is used to incrementally weaken certain NN connections and strengthen others, such that the NN behaves more and more like the teacher
Jurgen’s scholarpedia article on deep learning also cites an earlier paper by Kelley (Gradient Theory of Optimal Flight Paths, 1960) which already had the recursive chain rule for continuous systems, and papers by Bryson 1961 and Dreyfus 1962:
BP’s continuous form was derived in the early 1960s (Kelley, 1960; Bryson, 1961; Bryson and Ho, 1969). Dreyfus (1962) published the elegant derivation of BP based on the chain rule only.
however, that was not yet Seppo Linnainmaa’s
explicit, efficient error backpropagation (BP) in arbitrary, discrete, possibly sparsely connected, NN-like networks
BP’s modern efficient version for discrete sparse networks (including FORTRAN code) was published by Linnainmaa (1970). Here the complexity of computing the derivatives of the output error with respect to each weight is proportional to the number of weights. That’s the method still used today.
Jurgen’s comprehensive survey also cites Andreas Griewank, godfather of automatic differentiation who writes
Nick Trefethen [13] listed automatic differentiation as one of the 30 great numerical algorithms of the last century… Seppo Linnainmaa (Lin76) of Helsinki says the idea came to him on a sunny afternoon in a Copenhagen park in 1970…
starting on page 391, Griewank’s survey explains in detail what Linnainmaa did, it’s really illuminating
Gerardi Ostrowski came a tad too late, he published reverse mode backpropagation in 1971, in German, one year after Linnainmaa, hey, publish first or perish
the scholarpedia article also says:
Dreyfus (1973) used BP to change weights of controllers in proportion to such gradients.
later Paul Werbos was the first to apply this to neural networks, not in 1974, as some say, but in 1982:
Werbos (1982) published the first application of BP to NNs, extending thoughts in his 1974 thesis, which did not yet have Linnainmaa’s modern, efficient form of BP.
Jurgen famously complained that Yann & Yoshua & Geoff did not mention the inventors of backpropagation
They heavily cite each other. Unfortunately, however, they fail to credit the pioneers of the field, which originated half a century ago.
astonishingly, the recent Turing award laudation refers to Yann’s variants of backpropagation and Geoff’s computational experiments with backpropagation, without clarifying that the method was invented by others
in the GAN thread someone wrote that “LeCun quipped that backpropagation was invented by Leibniz because it’s just the chain rule of derivation” but that’s a red herring, Linnainmaa’s reverse mode backpropagation is more specific than that, it is the efficient recursive chain rule for graphs, Leibniz did not have that
section 3 of the blog mentions Linnainmaa again in the context of Sepp Hochreiter’s 1991 thesis VAN1 which
formally showed that deep NNs suffer from the now famous problem of vanishing or exploding gradients: in typical deep or recurrent networks, back-propagated error signals either shrink rapidly, or grow out of bounds. In both cases, learning fails… Note that Sepp’s thesis identified those problems of backpropagation in deep NNs two decades after another student with a similar first name (Seppo Linnainmaa) published modern backpropagation or the reverse mode of automatic differentiation in his own thesis of 1970 [BP1].
submitted by /u/siddarth2947
[link] [comments]
[P] Learning Rate Dropout in PyTorch
https://github.com/noahgolmant/pytorch-lr-dropout
I just implemented learning rate dropout using PyTorch! This technique applies dropout to the weight update at each iteration instead of the weights themselves.
I welcome any and all feedback! I ran four trials with a ResNet34 model on CIFAR-10 using both the baseline optimizer (SGD with momentum) and this variant. I wasn’t able to achieve the numbers reported in the paper, though. Feel free to double-check the masking logic or hyperparameters in case that explains the difference.
submitted by /u/noahgolm
[link] [comments]
[P] What could cause this behavior?
Hi,
I’m making an LSTM that takes a list of same-size vectors as input. These vectors are encodings of frames in a video, and I want the LSTM to output an encoding of the entire video. To get this encoding, I am just taking the last hidden state and feeding it through a linear layer.
My issue is the hidden state seems to be converging on some fixed vector after a couple of time steps. It seems like the LSTM is forgetting previous states and entering a loop. What could cause this behavior? Is there a nice way to fix this?
Thanks
submitted by /u/jsonathan
[link] [comments]
[P] Simple hyperparameter management through dependency injection
| |
What an unruly mess some hyperparameter configurations are… In many open source deep learning codebases, the hyperparameters are treated as global variables but it’s nothing new that global variables should be avoided. Yet, here we are. 3 years ago, I started as a junior deep learning engineer at Apple and I developed a similar approach to this one: https://www.reddit.com/r/MachineLearning/comments/e5jvhq/p_how_to_get_rid_of_boilerplate_ifstatements_and/. My team had to abondon it though because…. The solution required redundant boilerplate. Using YAML files was annoying too because YAML has little support for variables and YAML has no support for lambdas or Python objects. Lastly, it wasn’t an easy process to modularize YAML files as compared to Python functions. Anywho, the above solutions just didn’t work. 3 year later, after tinkering and working at different companies as a deep learning engineer, I came up with this approach: https://github.com/PetrochukM/HParams. Here’s what it looks like: The approach has a couple of benefits:
Anywho, let me know what you think! Lastly, it’s kinda cool, that similar approaches to mine were discovered and implemented by the AllenNLP library and Google’s gin-config. Does that mean I’m doing something right? submitted by /u/Deepblue129 |