Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

Category: Reddit MachineLearning

[R] A simple module consistently outperforms self-attention and Transformer model on main NMT datasets with SoTA performance.

[R] A simple module consistently outperforms self-attention and Transformer model on main NMT datasets with SoTA performance.

In the paper: MUSE: Parallel Multi-Scale Attention for Sequence to Sequence Learning,

They delve into three questions in sequence to sequence learning:

  1. Is attention alone good enough?
  2. Is parallel representation learning applicable to sequence data and tasks?
  3. How to design a module that combines both inductive bias of convolution and self-attention?

They find that there are shortcomings in stand-alone self-attention, and present a new module that maps the input to the hidden space and performs the three operations of self-attention, convolution and nonlinearity in parallel, simply stacking this module outperforms all previous models including Transformer (Vasvani et al., 2017) on main NMT tasks under standard setting.

Key features:

  • First successfully combine convolution and self-attention in one module for sequence tasks by the proposed shared projection,
  • SOTA on three main translation datasets, including WMT14 En-Fr, WMT14 En-De and IWSLT14 De-En,
  • Parallel learn sequence representations and thus have potential for acceleration。

Quick links:

Arxiv : pdf;

Github : Code, pretrained models, instructions for training are all available.

Main results:

https://preview.redd.it/2qkne32qwo041.png?width=554&format=png&auto=webp&s=f7ed2973ebdd434f2982a2ec546f65ca635ff529

https://preview.redd.it/cdx3sc5vwo041.png?width=554&format=png&auto=webp&s=611d277629129f5734119321b4466f1bd4781f03

Abstract:

In sequence to sequence learning, the self-attention mechanism proves to be highly effective, and achieves significant improvements in many tasks. However, the self-attention mechanism is not without its own flaws. Although self-attention can model extremely long dependencies, the attention in deep layers tends to overconcentrate on a single token, leading to insufficient use of local information and difficultly in representing long sequences. In this work, we explore parallel multi-scale representation learning on sequence data, striving to capture both long-range and short-range language structures. To this end, we propose the Parallel MUlti-Scale attEntion (MUSE) and MUSE-simple. MUSE-simple contains the basic idea of parallel multi-scale sequence representation learning, and it encodes the sequence in parallel, in terms of different scales with the help from self-attention, and pointwise transformation. MUSE builds on MUSE-simple and explores combining convolution and self-attention for learning sequence representations from more different scales. We focus on machine translation and the proposed approach achieves substantial performance improvements over Transformer, especially on long sequences. More importantly, we find that although conceptually simple, its success in practice requires intricate considerations, and the multi-scale attention must build on unified semantic space. Under common setting, the proposed model achieves substantial performance and outperforms all previous models on three main machine translation tasks. In addition, MUSE has potential for accelerating inference due to its parallelism.

submitted by /u/stopwind
[link] [comments]

[D] What do you see as the most promising directions for reducing sample inefficiency in reinforcement learning?

I often read from ML researchers, but more from computational cognitive scientists, that humans are able to generalize patterns from only a few data points or use “rich, informative priors” even as children, and how that is very important for us as cognitive beings that sets us apart from the current neural network approaches to RL used today.

I’m also not entirely convinced that the current neural net paradigm with the McCulloch–Pitts-esque neurons is ever going to become sample efficient enough for real-world reinforcement learning tasks. It seems like despite our best efforts to increase sample efficiency in NN techniques, the most impressive results still use hundreds of thousands or more simulations/data points that could be infeasible to implement for any sufficiently complex real-world environments.

That being said, what approaches are you most excited for in reducing sample efficiency in reinforcement learning or in neural network techniques in general?

submitted by /u/p6m_lattice
[link] [comments]

[P] PySNN: Spiking Neural Network framework, built on top of PyTorch

Hi everyone!

Recently a friend and I have been working on a new Python library for machine learning with Spiking Neural Networks (SNNs), called PySNN, which is built on top of PyTorch. We feel it is time to share it with more people, and hopefully get good feedback and contributions:)!

Our goal for PySNN was to make a truly modular framework for machine learning with SNNs, while staying as close to PyTorch as possible. All of the existing frameworks either operate more like simulators for neuroscientific research, or use relatively fixed network and training/evaluation loop designs. PySNN, on the other hand, consists of building blocks for neurons, connections, and learning rules which the user can combine in their desired way. It even allows for mixing learning rules or training only specific parts of the network.

Furthermore, since PySNN consists of just the basic elements, the framework is lightweight and allows for easy extension. Because of its tight integration with PyTorch it fully supports GPU acceleration, batching of samples, and supports tools like the jit compiler and graph tracing for TensorBoard.

There are still many improvements and extensions that can be made, so feel free to have a look and send out a pull request! We will be very active in helping with any issues! https://github.com/BasBuller/PySNN

We are looking forward to your comments and suggestions!:)

submitted by /u/DontShowYourBack
[link] [comments]

[P] A Chess/Go/Shogi model that passes the Turing test, how do I build an imitation learning model that incorporates some kind of lookahead algorithm?

I want to build a model for Chess/Go/Shogi that is trained and tested on real players, and I want it to pass the Turing test. I don’t want my model to play the best move in a position, I want it to play the move that a person would play (of a certain strength, time control, etc..).

It’s easy to make this a classification problem and train a CNN on a one-hot encoded policy of actual moves played. The only problem is, without some kind of look-ahead algorithm (MCTS for example) the model fails to learn sequences that require multiple moves, such as tactics.

However, current MCTS/alpha-beta/minimax models require evaluation of leaf nodes. I don’t have a way to shape the reward to an evaluation of a leaf node. So my question: how would I incorporate a look-ahead algorithm in an imitation learning problem like this?

submitted by /u/Pawngrubber
[link] [comments]

[Discussion] Hyperparameters for Word2Vec for SMS corpus…

Hey all,

Working at a small startup, and we have extracted 33 million text messages from our users. We plan to create a model to classify different types of sms relevant to us.

First step is to create a Word 2 Vector dictionary for EDA and clustering and possibly to use these embeddings for classification further down the line .

Just wanted some guidance about the hyperparameters for the gensim’s Word2Vec.

The corpus is 33 million sms, average sms length is 16 words and the vocab size is 1.5 million.

I used the following hyperparameters and obtained decent results but just wanted to know if I’m doing anything wrong that could be hampering the model from performing even better:

Cbow, window = 4, vector size = 125, iterations =10, workers = 5, min_count= 4.

Furthermore does anyone have any tips on how to evaluate the embeddings ( other than checking that the similarity for a small set of words makes sense) so that I can fine-tune these hyperparameters?

And final question ( I promise) Would it possible or recomendable to take a pre trained Word2Vec model and improve on it by giving it the sms data so that it learns new words like slang and typos without losing its overall knowledge of the language?

Thanks so much for your time in reading.

submitted by /u/conradws
[link] [comments]

[P] Hypertunity: a library for hyperparameter optimisation

I would like to share my pet project, Hypertunity, a Python library for black-box hyperparameter optimisation. It’s main features are: * Bayesian Optimisation using Gaussian process regression by wrapping GPyOpt; * Native support for random and grid search; * Visualisation of the results in Tensorboard using the HParams plugin; * Scheduled, parallel execution of experiments using joblib; * Also possible to schedule jobs on Slurm.

For the full set of features, check out the docs.

Your feedback is very much appreciated!

submitted by /u/gdikov
[link] [comments]

Need some advice for training BERT to classify multiple sentiments.[PROJECT]

Basically, so far, I have been trying to train BERT on a very long document by cutting start, middle , and end sections of article so it could be fit into the limited input dimension of 512. However; the performance has been dismal for most of the time. So far, I am not sure if using LSTM+GRU was a better approach than this. But are there other ways to train it than just cutting up the article? When I googled for an alternative approach, I couldn’t find much…

submitted by /u/shstan
[link] [comments]