Learn About Our Meetup

5000+ Members



Join our meetup, learn, connect, share, and get to know your Toronto AI community. 



Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.



Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[R] Do you have any hacks and heuristics for quickly gaining the necessary theoretical background to understand a recent theoretical ML paper?

A challenge when reading an ML paper is that the authors don’t have the time or space to explain and clarify every advanced concept, theorem, or notation they used to get the to result they present in the paper. At best, they will point to a text book or a review paper if the concept is novel enough from the point of their target audience, other times they will just assume that their readers are smart/educated enough to figure out the necessary concepts on their own. But most readers (like myself) aren’t that smart.

When I was in grad school, I had the luxury and time to go through the references one by one, look up theorems and text books etc, and dedicate several weeks to building up the necessary theoretical baggage to grasp a paper that I was really interested in.

But now I work in industry, and I don’t have the time to pick a text book on graph theory or algebraic topology, if a concept from those fields is used to illustrate or prove a point in a theoretical ML paper, or read an additional 10 papers besides the one I am actually interested in. In fact I barely have time to read papers in general.

Do you have any hacks/heuristics to quickly get up to speed on the necessary theoretical backgrounds for an advanced ML paper, without having to dedicated several weeks to going through graduate level textbooks, and “reverse reading” bibliographies until you get to a paper that simplifies a given concept?

submitted by /u/AlexSnakeKing
[link] [comments]

[D] Trying to wrap my head around the Information Bottle neck theory of Deep Learning: Why is it such a big deal?

I’m trying to understand this paper that was posedt in a thread here earlier, which claims to refute the Information Bottleneck [IB] theory of Deep Learning. Ironically I get what the authors of this refutation result are saying, but I fail to understand why IB was considered such a big deal in the first place.

According to this post , IB “[opens] the black box of deep neural networks via information” and “this paper fully justifies all of the excitement surrounding it.”

According to this post, IB “is helping to explain the puzzling success of today’s artificial-intelligence algorithms — and might also explain how human brains learn.”, Hinton is quoted as saying ““It’s extremely interesting,[…] I have to listen to it another 10,000 times to really understand it, but it’s very rare nowadays to hear a talk with a really original idea in it that may be the answer to a really major puzzle.” and Kyle Cranmer, a particle physicist at New York University says that IB “somehow smells right.”

Here’s where I’m confused, isn’t the idea that an algorithm:

  1. Tries to fit the data “naively”
  2. Then removes the noise and keep just the useful model
  3. Do so by stochastically iterating through a large set of examples (i.e. the stochasticity is what allows the algorithm so separate the signal from the noise).

…just a formalization of what any non-parametric supervised learning algorithm that is based on function approximation (i.e. excluding parametric models like linear regression, and “fully non-parametric”, in-memory models like k-NN).

I understand that Tibshy and his co-authors provide very specific details how this happens, namely that there are two clear phases between (1) and (2), what happens in (2) is what makes a Deep Learning model generalize well, and that (3) is due to the stochasticity of SGD ,which allows the compression that happens in (2).

What I don’t understand is why was this considered a major paradigm shifting result that Hinton has to hear 10000 times to grasp and deems to answer a major puzzle?

For (2), isn’t an algorithm that uses function approximation to learn (i.e. excluding k-NN, and some Parzen based methods, which store the entire training set in memory, and parametric models like linear regression, where the functional form is assumed before hand) performing data compression by design, i.e. take the training data and try to boil it down to reasonably small functional form that preserves the signal and discards the noise?

For (3), we’ve known since the 70s at least that adding stochasticity and random sampling improves the ability of optimization algorithms to get close to a global optimum.

AFAIK, the only really interesting part here is the phase transition between (1) and (2), but even for that, we’ve know about phase transitions in learning and optimization problems have been studied and well known since at least the early 80s.

So what was it about Tibshy et al. that was so revolutionary that non less than Geoffrey Hinton said he needs 10000 epochs to grasp, it “opens the black box of Deep Learning”, and its refutation by Saxe et al. in the aforementioned paper is such a big deal?!?!?!?

What am I missing in IB? Is overall outline of it correct?

submitted by /u/AlexSnakeKing
[link] [comments]

[D] Is it possible to hack AutoEncoder?

Re-Ha! (which means Reddit Hi!)

As I wrote down in the title, is it possible to hack Auto-Encoder?

What the heck is ‘hacking AutoEncoder’ then?

Let me give you a simple scenario.


Suppose Jane extracts latent representation, L, of her private data , X, with three features

(body weight, height, and a binary variable indicating whether she ate pizza or not) daily.

X -> Encoder -> L -> Decoder -> X’ (reconstructed input ~ X)

(X: 3 dim., L: 1 dim., X’: 3 dim.)

She made a simple ML system that continually tracks the three features every day,

trains the AutoEncoder again, and uploads in her private server.

Then, suppose Chris (friend-zoned by Jane a month ago), succeeds in stealing L,
by installing a backdoor program on Jane’s server.

But he doesn’t know the structure of Decoder network, trained weight of Decoder network, and reconstructed input X’.

What he only has is the latent representation of L (continually updated), and the dimension of the original input X.


In this situation, is it possible for Chris to retrieve the original input X?

I think it is of course impossible, but then what can be the related mathematical concept supporting the impossibility?

Or, is there any possible method to reconstruct/approximate the original input?

Thank you in advance!

submitted by /u/vaseline555
[link] [comments]

[D] Clustering methodology for high dimensional data, where some features have strong correlations to one another?

Hi, I’m working on a model to cluster users based on their demographic and behavioral features.

Was reading up on some literature on the topic, and found that having strongly correlated features would skew the dimensionality reduction (right now, via PCA) to take only those features with high correlation with each other.

Was thinking of running a simple correlation matrix to remove those features and sort through the clutter before clustering.

But right now, our methodology looks like… 1. Normalizing our features (mean 0, stdev 1) 2. Correlation matrix to weed out some features 3. PCA or some other dimensionality reduction 4. K-Means Clustering

Problem is there are some features we might not be able to cut – category mixes (e.g. user has spent x% on category A, y% on category B, z% on category C, where x+y+z = 100%) ought to still be relevant in our case but will be highly correlated with one another. Any ideas on how we can handle for this?

And as an aside, how do clustering algorithms (K-means specifically) handle nullness?

Would love for you guys’ take on the methodology! All help appreciated on this, thanks!

submitted by /u/ibetDELWYN
[link] [comments]

[D] Learning one representation for multiple tasks – favoring some tasks over others?

Are there any papers on balancing the impact of multiple tasks on the final single representation?

Let there be n tasks to be solved by one representation:

L_total = L_t1 + L_t2 + … + L_tn

If we want to favor one task over another based on prior knowledge is there any other way than setting a lambda type hyperparameter to increase the loss for a specific task?

submitted by /u/searchingundergrad
[link] [comments]

[P] Object detection using faster R-CNN

I am working on a problem where I have to identify small objects in high resolution images, and I was wondering on how to solve this problem.

Basically, I have at most a hundred high resolution images containing tiny objects I need to detect, and I was wondering if I should transform this problem into a supervised problem, where I would first label some images and then try some classification stuff, or whether I apply algorithms such as fast r-cnn.

As I cannot elaborate much more about the topic due to privacy concerns, I would like to know which approach would be the best, or how can I assess which approach to take.

submitted by /u/xOrbitz
[link] [comments]

[P] Training Random Forest with a single vector (for each obs) in h2o?

I’m starting to use h2o to train and serve models. I have a dataset that I’d already curated for Spark ML pipelines. I have a single 16D vector I pass as the training data for each observation.

A friend said that h2o requires columns for each category and treats my single vector as a string, which I just can’t find anything to support. The accuracy is around what I got out of Spark ML, but I’m worried about how h2o is handling my data. Does anyone know how h2o handles this case?

tl;dr – Can I use a single vector for each training observation in h2o?

submitted by /u/Octosaurus
[link] [comments]

[R] SPACE: Unsupervised Object-Oriented Scene Representation via Spatial Attention and Decomposition

[R] SPACE: Unsupervised Object-Oriented Scene Representation via Spatial Attention and Decomposition

Project Page:


Abstract: The ability to decompose complex multi-object scenes into meaningful abstractions like objects is fundamental to achieve higher-level cognition. Previous approaches for unsupervised object-oriented scene representation learning are either based on spatial-attention or scene-mixture approaches and limited in scalability which is a main obstacle towards modeling real-world scenes. In this paper, we propose a generative latent variable model, called SPACE, that provides a unified probabilistic modeling framework that combines the best of spatial-attention and scene-mixture approaches. SPACE can explicitly provide factorized object representations for foreground objects while also decomposing background segments of complex morphology. Previous models are good at either of these, but not both. SPACE also resolves the scalability problems of previous methods by incorporating parallel spatial-attention and thus is applicable to scenes with a large number of objects without performance degradations. We show through experiments on Atari and 3D-Rooms that SPACE achieves the above properties consistently in comparison to SPAIR, IODINE, and GENESIS.


submitted by /u/yifuwu
[link] [comments]

Next Meetup




Plug yourself into AI and don't miss a beat


Toronto AI is a social and collaborative hub to unite AI innovators of Toronto and surrounding areas. We explore AI technologies in digital art and music, healthcare, marketing, fintech, vr, robotics and more. Toronto AI was founded by Dave MacDonald and Patrick O'Mara.