Skip to main content


Learn About Our Meetup

5000+ Members



Join our meetup, learn, connect, share, and get to know your Toronto AI community. 



Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.



Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[D] Five major deep learning papers by Geoff Hinton did not cite similar earlier work by Jurgen Schmidhuber

still milking Jurgen’s very dense inaugural tweet about their annus mirabilis 1990-1991 with Sepp Hochreiter and others, 2 of its 21 sections already made for nice reddit threads, section 5 Jurgen really had GANs in 1990 and section 19 DanNet, the CUDA CNN of Dan Ciresan in Jurgen’s team, won 4 image recognition challenges prior to AlexNet, but these are not the juiciest parts of the blog post

instead look at sections 1 2 8 9 10 where Jurgen mentions work they did long before Geoff, who did not cite, as confirmed by studying the references, at first glance it’s not obvious, it’s hidden, one has to work backwards from the references

section 1, First Very Deep NNs, Based on Unsupervised Pre-Training (1991), Jurgen “facilitated supervised learning in deep RNNs by unsupervised pre-training of a hierarchical stack of RNNs” and soon was able to “solve previously unsolvable Very Deep Learning tasks of depth > 1000,” he mentions reference [UN4] which is actually Geoff’s later similar work:

More than a decade after this work [UN1], a similar method for more limited feedforward NNs (FNNs) was published, facilitating supervised learning by unsupervised pre-training of stacks of FNNs called Deep Belief Networks (DBNs) [UN4]. The 2006 justification was essentially the one I used in the early 1990s for my RNN stack: each higher level tries to reduce the description length (or negative log probability) of the data representation in the level below.

back then unsupervised pre-training was a big deal, today it’s not so important any more, see section 19, From Unsupervised Pre-Training to Pure Supervised Learning (1991-95 and 2006-11)

section 2, Compressing / Distilling one Neural Net into Another (1991), Jurgen also trained “a student NN to imitate the behavior of the teacher NN,” briefly referring to Geoff’s much later similar work [DIST2]:

I called this “collapsing” or “compressing” the behavior of one net into another. Today, this is widely used, and also called “distilling” [DIST2] or “cloning” the behavior of a teacher net into a student net.

section 9, Learning Sequential Attention with NNs (1990), Jurgen “had both of the now common types of neural sequential attention: end-to-end-differentiable “soft” attention (in latent space) through multiplicative units within NNs FAST2, and “hard” attention (in observation space) in the context of Reinforcement Learning (RL) ATT0 [ATT1],” the blog has a statement about Geoff’s later similar work ATT3 which I find both funny and sad:

My overview paper for CMSS 1990 [ATT2] summarised in Section 5 our early work on attention, to my knowledge the first implemented neural system for combining glimpses that jointly trains a recognition & prediction component with an attentional component (the fixation controller). Two decades later, the reviewer of my 1990 paper wrote about his own work as second author of a related paper [ATT3]: “To our knowledge, this is the first implemented system for combining glimpses that jointly trains a recognition component … with an attentional component (the fixation controller).”

similar in section 10, Hierarchical Reinforcement Learning (1990), Jurgen introduced HRL “with end-to-end differentiable NN-based subgoal generators HRL0, also with recurrent NNs that learn to generate sequences of subgoals [HRL1] [HRL2],” referring to Geoff’s later work HRL3:

Soon afterwards, others also started publishing on HRL. For example, the reviewer of our reference [ATT2] (which summarised in Section 6 our early work on HRL) was last author of ref [HRL3]

section 8, End-To-End-Differentiable Fast Weights: NNs Learn to Program NNs (1991), Jurgen published a network “that learns by gradient descent to quickly manipulate the fast weight storage” of another network, and “active control of fast weights through 2D tensors or outer product updates FAST2,” dryly referring to FAST4a which happens to be Geoff’s later similar paper:

A quarter century later, others followed this approach [FAST4a]

it’s really true, Geoff did not cite Jurgen in any of these similar papers, and what’s kinda crazy, he was editor of Jurgen’s 1990 paper ATT2 summarising both attention learning and hierarchical RL, then later he published closely related work, sections 9, 10, but he did not cite

Jurgen also famously complained that Geoff’s deep learning survey in Nature neither mentions the inventors of backpropagation (1960-1970) nor “the father of deep learning, Alexey Grigorevich Ivakhnenko, who published the first general, working learning algorithms for deep networks” in 1965

apart from the early pioneers in the 60s and 70s, like Ivaknenko and Fukushima, most of the big deep learning concepts stem from Jurgen’s team with Sepp and Alex and Dan and others: unsupervised pre-training of deep networks, artificial curiosity and GANs, vanishing gradients, LSTM for language processing and speech and everything, distilling networks, attention learning, CUDA CNNs that win vision contests, deep nets with 100+ layers, metalearning, plus theoretical work on optimal AGI and Godel Machine

submitted by /u/siddarth2947
[link] [comments]