I’m trying to understand this paper that was posedt in a thread here earlier, which claims to refute the Information Bottleneck [IB] theory of Deep Learning. Ironically I get what the authors of this refutation result are saying, but I fail to understand why IB was considered such a big deal in the first place.

According to this post , IB *“[opens] the black box of deep neural networks via information” and “this paper fully justifies all of the excitement surrounding it.”*

According to this post, IB “*is helping to explain the puzzling success of today’s artificial-intelligence algorithms — and might also explain how human brains learn.”,* Hinton is quoted as saying *““It’s extremely interesting,[…] I have to listen to it another 10,000 times to really understand it, but it’s very rare nowadays to hear a talk with a really original idea in it that may be the answer to a really major puzzle.”* and Kyle Cranmer, a particle physicist at New York University says that IB “somehow smells right.”

Here’s where I’m confused, isn’t the idea that an algorithm:

- Tries to fit the data “naively”
- Then removes the noise and keep just the useful model
- Do so by stochastically iterating through a large set of examples (i.e. the stochasticity is what allows the algorithm so separate the signal from the noise).

…just a formalization of what any non-parametric supervised learning algorithm that is based on function approximation (i.e. excluding parametric models like linear regression, and “fully non-parametric”, in-memory models like k-NN).

I understand that Tibshy and his co-authors provide very specific details how this happens, namely that there are two clear phases between (1) and (2), what happens in (2) is what makes a Deep Learning model generalize well, and that (3) is due to the stochasticity of SGD ,which allows the compression that happens in (2).

What I don’t understand is why was this considered a major paradigm shifting result that Hinton has to hear 10000 times to grasp and deems to answer a major puzzle?

For (2), isn’t an algorithm that uses function approximation to learn (i.e. excluding k-NN, and some Parzen based methods, which store the entire training set in memory, and parametric models like linear regression, where the functional form is assumed before hand) performing data compression by design, i.e. take the training data and try to boil it down to reasonably small functional form that preserves the signal and discards the noise?

For (3), we’ve known since the 70s at least that adding stochasticity and random sampling improves the ability of optimization algorithms to get close to a global optimum.

AFAIK, the only really interesting part here is the phase transition between (1) and (2), but even for that, we’ve know about phase transitions in learning and optimization problems have been studied and well known since at least the early 80s.

So what was it about *Tibshy et al.* that was so revolutionary that non less than Geoffrey Hinton said he needs 10000 epochs to grasp, it “opens the black box of Deep Learning”, and its refutation by *Saxe et al.* in the aforementioned paper is such a big deal?!?!?!?

What am I missing in IB? Is overall outline of it correct?