Author: torontoai

[D] What are some problems types where ML could be applied “in theory” but it’s outside of practical reach ?

Written on November 28, 2019. Posted in Reddit MachineLearning.

It might be an overly-simplified view of the field, but it seems to me that a lot of the ML boom of this decade is due to the appearance of hardware+architectures that were able to tackle a set of problems which were easy in terms of data gathering and “pretty deterministic” (that is to say, based on our human abilities to tackle them, we can be pretty certain there are usually no latent variables which are necessary to solve the problem correctly), things like bounding boxes, image classification and translation.

On the other hand these new methods have hardly put a dent in how most people approach mostly “pretty non-deterministic” issues (e.g. stock trading or risk analysis), where practice and intuition shows that there’s simply not sufficient “easy” data that can make a reliable prediction.

It seems to me that most efforts right now are focused on either “productizing” the gains that were had on text and image problems (e.g. getting that 0.x% extra accuracy and 0.y% extra specificity that makes them practical to use in fields with low error margins) or getting algorithms that can better communicate the uncertainty of non-deterministic datasets (e.g. Bayesian/Probabilistic NNs).

However, it’s not obvious to me what the next set of problems similar to images and text will hit the chopping block, or if there is such a set of problems.

I’ve seen some interesting research (e.g. Alpha Fold) and some huge failures (e.g. that earthquake prediction publish in Nature that was worse than a linear regression) in the realm of scientific problems where we “seem to” have sufficient data but lack the mathematical frameworks to gain insights from the data. I think anything related to complex molecular dynamics in a “static” environment is a pretty good example, since in theory the starting state should allow us insight into any state at a later time T, but in practice this is often too computationally expensive and/or too complex to formalize in a way that is fitting for our current models. However, there doesn’t seem to be near that amount of adoption, excitement or novel ideas coming from this class of problems.

So I wonder, what would you guys think would be the next “category” of problems where, conceptually, ML techniques could be applied without too much of a data-gathering barrier, yet the hardware+knowledge combination of current humans is yet to evolve to a point where they are feasible.

submitted by /u/elcric_krej
[link] [comments]

[R] Curiosity Driven World Models

Written on November 28, 2019. Posted in Reddit MachineLearning.

Here’s some work we did as a course project. It’s an (unsuccessful) attempt at incorporating curiosity in world models. It’s a beginner’s work and any feedback is appreciated.

submitted by /u/akhandait
[link] [comments]

[R] Faster AutoAugment: Learning Augmentation Strategies using Backpropagation

Written on November 28, 2019. Posted in Reddit MachineLearning.

Paper: https://arxiv.org/abs/1911.06987

Abstract: Data augmentation methods are indispensable heuristics to boost the performance of deep neural networks, especially in image recognition tasks. Recently, several studies have shown that augmentation strategies found by search algorithms outperform hand-made strategies. Such methods employ black-box search algorithms over image transformations with continuous or discrete parameters and require a long time to obtain better strategies. In this paper, we propose a differentiable policy search pipeline for data augmentation, which is much faster than previous methods. We introduce approximate gradients for several transformation operations with discrete parameters as well as the differentiable mechanism for selecting operations. As the objective of training, we minimize the distance between the distributions of augmented data and the original data, which can be differentiated. We show that our method, Faster AutoAugment, achieves significantly faster searching than prior work without a performance drop.

submitted by /u/youali
[link] [comments]

[D] Help/Question about using Vector Projection + K-Means in VAE encoded result as pseudo-recommendation system

Written on November 28, 2019. Posted in Reddit MachineLearning.

I have a project that uses Variational Autoencoder for an apparel dataset that is grouped into five categories (say, (A B C D E).

My plan is the following.

Train a VAE model using the apparel dataset.
Use encoder on each data to produce latent code (e.g. my bottleneck/latent representation is of size 10 for example). Store to database
Use K-Means to cluster the data (using the latent code) in the database with n categories (five for example). Store cluster labels for each data in database.
Store cluster centroids in the database created from #3.
User interacts with a GUI that lets him use sliders to generate its own latent code (ten slides because #2 is 10). Decoder generates an image from the latent code given.
Click Recommend – enables the user two parts
1. Get the product/apparel from database that is most similar. (1. Predict cluster. 2. Find the most similar in the cluster using distance metric on the latent codes stored).
2. Recommend from other cluster. The idea is that if the user generates a topwear (e.g. a shirt), I would also generate from other clusters (for example, other cluster have bottomwear, shoes, etc.). This is my problem right here.

For clustering, I could just use a simple K-Means. I can get the cluster labels and the cluster centroids.

My idea for 6.2 (Recommendation):

I’m not really sure but for sure, there is a relationship between the cluster centroid (cluster mean) and the most similar/generated latent code. Is dot product applicable to this?
My idea is that if my user generated code (vector) is called X, the most similar as A1, cluster centroid for the predicted cluster as A0, cluster centroid for another cluster as B0.
I could know the projection of X w.r.t A0 and then use this amount of projection (idk what it is called, or if there is such a concept), to B0 to find the most similar in cluster B which is B1.

IS this even possible? If yes, what is this called? If not, could you recommend a better recommendation system that revolves around the same concept?

submitted by /u/sarmientoj24
[link] [comments]

[D] Bayes Optimal Classifier

Written on November 28, 2019. Posted in Reddit MachineLearning.

For the bayes optimal Classifier, when deriving it, if you have a loss function with unequal penalties for two incorrect decisions:

L=10 when y=1 and f=0

L=1 when y=0 and f=1

L=0 when y=f

Where f is the classifier. How does one go about deriving a decision threshold for this problem?

submitted by /u/ssd123456789
[link] [comments]

[D] How do you deal with the pressure of such a fast moving field?

Written on November 28, 2019. Posted in Reddit MachineLearning.

The progress of ML is getting crazy and I feel super stressed about it lately. It also didn’t help that a few ICML submissions do exactly the same thing that I wanted to do during my PhD that I started recently (I was an ML engineer previously).

There are just so many people working on similar ideas, I find it hard to keep up and contribute original ideas. How do you deal with this?

submitted by /u/vakker00
[link] [comments]

[D] Five major deep learning papers by Geoff Hinton did not cite similar earlier work by Jurgen Schmidhuber

Written on November 28, 2019. Posted in Reddit MachineLearning.

still milking Jurgen’s very dense inaugural tweet about their annus mirabilis 1990-1991 with Sepp Hochreiter and others, 2 of its 21 sections already made for nice reddit threads, section 5 Jurgen really had GANs in 1990 and section 19 DanNet, the CUDA CNN of Dan Ciresan in Jurgen’s team, won 4 image recognition challenges prior to AlexNet, but these are not the juiciest parts of the blog post

instead look at sections 1 2 8 9 10 where Jurgen mentions work they did long before Geoff, who did not cite, as confirmed by studying the references, at first glance it’s not obvious, it’s hidden, one has to work backwards from the references

section 1, First Very Deep NNs, Based on Unsupervised Pre-Training (1991), Jurgen “facilitated supervised learning in deep RNNs by unsupervised pre-training of a hierarchical stack of RNNs” and soon was able to “solve previously unsolvable Very Deep Learning tasks of depth > 1000,” he mentions reference [UN4] which is actually Geoff’s later similar work:

More than a decade after this work [UN1], a similar method for more limited feedforward NNs (FNNs) was published, facilitating supervised learning by unsupervised pre-training of stacks of FNNs called Deep Belief Networks (DBNs) [UN4]. The 2006 justification was essentially the one I used in the early 1990s for my RNN stack: each higher level tries to reduce the description length (or negative log probability) of the data representation in the level below.

back then unsupervised pre-training was a big deal, today it’s not so important any more, see section 19, From Unsupervised Pre-Training to Pure Supervised Learning (1991-95 and 2006-11)

section 2, Compressing / Distilling one Neural Net into Another (1991), Jurgen also trained “a student NN to imitate the behavior of the teacher NN,” briefly referring to Geoff’s much later similar work [DIST2]:

I called this “collapsing” or “compressing” the behavior of one net into another. Today, this is widely used, and also called “distilling” [DIST2] or “cloning” the behavior of a teacher net into a student net.

section 9, Learning Sequential Attention with NNs (1990), Jurgen “had both of the now common types of neural sequential attention: end-to-end-differentiable “soft” attention (in latent space) through multiplicative units within NNs FAST2, and “hard” attention (in observation space) in the context of Reinforcement Learning (RL) ATT0 [ATT1],” the blog has a statement about Geoff’s later similar work ATT3 which I find both funny and sad:

My overview paper for CMSS 1990 [ATT2] summarised in Section 5 our early work on attention, to my knowledge the first implemented neural system for combining glimpses that jointly trains a recognition & prediction component with an attentional component (the fixation controller). Two decades later, the reviewer of my 1990 paper wrote about his own work as second author of a related paper [ATT3]: “To our knowledge, this is the first implemented system for combining glimpses that jointly trains a recognition component … with an attentional component (the fixation controller).”

similar in section 10, Hierarchical Reinforcement Learning (1990), Jurgen introduced HRL “with end-to-end differentiable NN-based subgoal generators HRL0, also with recurrent NNs that learn to generate sequences of subgoals [HRL1] [HRL2],” referring to Geoff’s later work HRL3:

Soon afterwards, others also started publishing on HRL. For example, the reviewer of our reference [ATT2] (which summarised in Section 6 our early work on HRL) was last author of ref [HRL3]

section 8, End-To-End-Differentiable Fast Weights: NNs Learn to Program NNs (1991), Jurgen published a network “that learns by gradient descent to quickly manipulate the fast weight storage” of another network, and “active control of fast weights through 2D tensors or outer product updates FAST2,” dryly referring to FAST4a which happens to be Geoff’s later similar paper:

A quarter century later, others followed this approach [FAST4a]

it’s really true, Geoff did not cite Jurgen in any of these similar papers, and what’s kinda crazy, he was editor of Jurgen’s 1990 paper ATT2 summarising both attention learning and hierarchical RL, then later he published closely related work, sections 9, 10, but he did not cite

Jurgen also famously complained that Geoff’s deep learning survey in Nature neither mentions the inventors of backpropagation (1960-1970) nor “the father of deep learning, Alexey Grigorevich Ivakhnenko, who published the first general, working learning algorithms for deep networks” in 1965

apart from the early pioneers in the 60s and 70s, like Ivaknenko and Fukushima, most of the big deep learning concepts stem from Jurgen’s team with Sepp and Alex and Dan and others: unsupervised pre-training of deep networks, artificial curiosity and GANs, vanishing gradients, LSTM for language processing and speech and everything, distilling networks, attention learning, CUDA CNNs that win vision contests, deep nets with 100+ layers, metalearning, plus theoretical work on optimal AGI and Godel Machine

submitted by /u/siddarth2947
[link] [comments]

[Project]Recommender web app for short stories

Written on November 28, 2019. Posted in Reddit MachineLearning.

Developed a bare-bone web app which helps in reading short stories from project Guttenberg and based on what a user might like recommends something similar

https://project-guttenberg.herokuapp.com/

(Thoughts and ideas on what can be done to enhance it please??)

submitted by /u/ShubC
[link] [comments]