Author: torontoai

[D] Research shows SGD with too large of a mini batch can lead to huge overfitting in deep learning. Why doesn’t batch gradient descent have this problem?

Written on August 28, 2019. Posted in Reddit MachineLearning.

Here is an example paper showing test score getting very bad as batch size gets too large: https://arxiv.org/pdf/1804.07612.pdf

Batch gradient descent runs over the whole dataset. Does it have the same problem? If not, why?

submitted by /u/DstnB3
[link] [comments]

[P] Tensorflow implementation of RAdam optimizer (On the Variance of the Adaptive Learning Rate and Beyond)

Written on August 28, 2019. Posted in Reddit MachineLearning.

result

submitted by /u/taki0112
[link] [comments]

[N] Deep Graph Library new release (v0.3.1)

Written on August 27, 2019. Posted in Reddit MachineLearning.

Though only a minor release, this new release includes a bunch of very useful Graph Neural Network modules and model examples that can be directly used in your project. Here is a list of new modules:

New NN Modules

GATConv from “Graph Attention Network”
RelGraphConv from “Modeling Relational Data with Graph Convolutional Networks”
TAGConv from “Topology Adaptive Graph Convolutional Networks”
EdgeConv from “Dynamic Graph CNN for Learning on Point Clouds”
SAGEConv from “Inductive Representation Learning on Large Graphs”
GatedGraphConv from “Gated Graph Sequence Neural Networks”
GMMConv from “Geometric Deep Learning on Graphs and Manifolds using Mixture Model CNNs”
GINConv from “How Powerful are Graph Neural Networks?”
ChebConv from “Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering”
SGConv from “Simplifying Graph Convolutional Networks”
NNConv from “Neural Message Passing for Quantum Chemistry”
APPNPConv from “Predict then Propagate: Graph Neural Networks meet Personalized PageRank”
AGNNConv from “Attention-based Graph Neural Network for Semi-Supervised Learning”
DenseGraphConv (Dense implementation of GraphConv)
DenseSAGEConv (Dense implementation of SAGEConv)
DenseChebConv (Dense implementation of ChebConv)

New global pooling module

Sum/Avg/MaxPooling
SortPooling
GlobalAttentionPooling from GGNN model
Set2Set from “Order Matters: Sequence to sequence for sets”
SetTransformerEncoder and SetTransformerDecoder from “Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks”

New graph transformation routines

dgl.transform.khop_adj
dgl.transform.khop_graph
dgl.transform.laplacian_lambda_max
dgl.transform.knn_graph
dgl.transform.segmented_knn_graph

This DGL release also includes a model zoo for chemistry applications such as using GNNs to predict molecular property or generate new molecule structures that is valuable for drug discovery. Pre-trained models are also available for download in simply two lines of codes:

“`python from dgl.data import Tox21 from dgl import model_zoo

dataset = Tox21() model = model_zoo.chem.load_pretrained(‘GCN_Tox21’) # Pretrained model loaded model.eval()

smiles, g, label, mask = dataset[0] feats = g.ndata.pop(‘h’) label_pred = model(g, feats) print(smiles) # CCOc1ccc2nc(S(N)(=O)=O)sc2c1 print(label_pred[:, mask != 0]) # Mask non-existing labels

tensor([[-0.7956, 0.4054, 0.4288, -0.5565, -0.0911,

0.9981, -0.1663, 0.2311, -0.2376, 0.9196]])

“`

Check it out if you are using GNNs, working with molecules or just interested in this whole new field.

See full release note here: https://www.dgl.ai/release/2019/08/28/release.html.

submitted by /u/jermainewang
[link] [comments]

[D] Is “Wasserstein metric” the right name to use?

Written on August 27, 2019. Posted in Reddit MachineLearning.

According to wiki: ” The name “Wasserstein distance” was coined by R. L. Dobrushin in 1970, after the Russian mathematician Leonid Vaseršteĭn who introduced the concept in 1969. “. And indeed, I found the paper written in Russian by Dobrushin, which mentioned in reference:” Л..Н. Васерштейн, Марковские процессы на счетном произведении пространств, описывающие большие системы автоматов. Пробл. перед, информ. 5, 3 (1969), 64—73. “, and Leonid Vaseršteĭn is just english for Леонид Васерштейн.

Although I could not read Russian and I could not find the content of the original papr by Leonid Vaseršteĭn, the wiki still seems convincing.

However, it seems Fréchet distance is identical to 2-Wasserstein distance, and Fréchet distance was introduced in 1957, according to the original French paper “Sur la distance de deux lois de probabilité.”

Does it means Fréchet discovered it first and wiki is wrong about the origin? What’s more, should we call it Fréchet distance instead of Wasserstein distance?

P.S.

If you search “Fréchet distance” on google, what comes out is not a distance for distribution but distance for path. I am confused by the relationship between “Fréchet distance of path” with “Fréchet distance of distribution”.

submitted by /u/746645147
[link] [comments]

[R] Evolving Space-Time Neural Architectures for Videos (Google Brain) ICCV

Written on August 27, 2019. Posted in Reddit MachineLearning.

Paper: https://arxiv.org/abs/1811.10636

Code: https://github.com/piergiaj/evanet-iccv19

Abstract:

We present a new method for finding video CNN architectures that capture rich spatio-temporal information in videos. Previous work, taking advantage of 3D convolutions, obtained promising results by manually designing video CNN architectures. We here develop a novel evolutionary search algorithm that automatically explores models with different types and combinations of layers to jointly learn interactions between spatial and temporal aspects of video representations. We demonstrate the generality of this algorithm by applying it to two meta-architectures, obtaining new architectures superior to manually designed architectures. Further, we propose a new component, the iTGM layer, which more efficiently utilizes its parameters to allow learning of space-time interactions over longer time horizons. The iTGM layer is often preferred by the evolutionary algorithm and allows building cost-efficient networks. The proposed approach discovers new and diverse video architectures that were previously unknown. More importantly they are both more accurate and faster than prior models, and outperform the state-of-the-art results on multiple datasets we test, including HMDB, Kinetics, and Moments in Time. We will open source the code and models, to encourage future model development.

submitted by /u/Himalun
[link] [comments]

[D] Eric Drexler’s “Reframing Superintelligence”

Written on August 27, 2019. Posted in Reddit MachineLearning.

Following the Slate Star Codex review of “Reframing Superintelligence” I (as an AI researcher) have become pretty excited to see such a comprehensive reply exists to Bostrom-type “paperclip maximizer” fears of AGI. A good summary here – Less Like Us: An Alternate Theory of Artificial General Intelligence – basically the idea is that realistically AI is not developed with the ability to self improve and do whatever it wants, so we should not fear AGIs that get out of control in this way.

What do you think of this reply to AGI concerns? Certainly given present day AI and how it is developing, the “service ai’ seems like a cogent prediction of what we can actually say is likely to come about and we need to be wary of doing wrong.

submitted by /u/regalalgorithm
[link] [comments]

Senior Data Scientist – Rogers Communications – Toronto, ON

Written on August 27, 2019. Posted in Toronto Job Postings.

Deep understanding (details) of statistical modeling and machine learning methods. The candidate will be responsible for providing end to end advanced analytics…
From Rogers – Wed, 28 Aug 2019 19:29:22 GMT – View all Toronto, ON jobs

[R] Google AI Blog: Exploring Weight Agnostic Neural Networks

Written on August 27, 2019. Posted in Reddit MachineLearning.

Google AI Blog: Exploring Weight Agnostic Neural Networks

In “Weight Agnostic Neural Networks” (WANN), we present a first step toward searching specifically for networks with these biases: neural net architectures that can already perform various tasks, even when they use a random shared weight. Our motivation in this work is to question to what extent neural network architectures alone, without learning any weight parameters, can encode solutions for a given task. By exploring such neural network architectures, we present agents that can already perform well in their environment without the need to learn weight parameters. Furthermore, in order to spur progress in this field community, we have also open-sourced the code to reproduce our WANN experiments for the broader research community.

We start with a population of minimal neural network architecture candidates, each with very few connections only, and use a well-established topology search algorithm (NEAT), to evolve the architectures by adding single connections and single nodes one by one.

https://weightagnostic.github.io/

Very interesting results from Google, using evolution-like approach to create network topologies. Thoughts?

submitted by /u/Marha01
[link] [comments]

[D] Do VAEs have a manifold?

Written on August 27, 2019. Posted in Reddit MachineLearning.

I am kind of confused as to how VAEs do manifold learning.

While I can grasp that regular AEs perform deterministic transformation from the input vector space to the latent space with the encoder, it is very hard for me to understand how that would work on a VAE. Is the manifold on the parameters of the distribution MU and SIGMA?

Can anyone clarify that for me, maybe point to a paper? Thanks

submitted by /u/eigenlaplace
[link] [comments]

[R] DistilBERT: A smaller, faster, cheaper, lighter BERT trained with distillation!

Written on August 27, 2019. Posted in Reddit MachineLearning.

HuggingFace released their first NLP transformer model “DistilBERT”, which is similar to the BERT architecture: only 66 million parameters (instead of 110 million) while keeping 95% of the performance on GLUE.

They released a blogpost detailing the procedure with a hands-on.

It is also available on their repository pytorch-transformers alongside 7 other transformer models.

submitted by /u/jikkii
[link] [comments]

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

JOB POSTINGS

CONTACT

Author: torontoai

[D] Research shows SGD with too large of a mini batch can lead to huge overfitting in deep learning. Why doesn’t batch gradient descent have this problem?

[P] Tensorflow implementation of RAdam optimizer (On the Variance of the Adaptive Learning Rate and Beyond)

[N] Deep Graph Library new release (v0.3.1)

New NN Modules

New global pooling module

New graph transformation routines

tensor([[-0.7956, 0.4054, 0.4288, -0.5565, -0.0911,

0.9981, -0.1663, 0.2311, -0.2376, 0.9196]])

[D] Is “Wasserstein metric” the right name to use?

[R] Evolving Space-Time Neural Architectures for Videos (Google Brain) ICCV

[D] Eric Drexler’s “Reframing Superintelligence”

Senior Data Scientist – Rogers Communications – Toronto, ON

[R] Google AI Blog: Exploring Weight Agnostic Neural Networks

[D] Do VAEs have a manifold?

[R] DistilBERT: A smaller, faster, cheaper, lighter BERT trained with distillation!