[R] A simple module consistently outperforms self-attention and Transformer model on main NMT datasets with SoTA performance.

Written by torontoai on November 23, 2019. Posted in Reddit MachineLearning.

In the paper: MUSE: Parallel Multi-Scale Attention for Sequence to Sequence Learning，

They delve into three questions in sequence to sequence learning:

Is attention alone good enough？
Is parallel representation learning applicable to sequence data and tasks?
How to design a module that combines both inductive bias of convolution and self-attention？

They find that there are shortcomings in stand-alone self-attention, and present a new module that maps the input to the hidden space and performs the three operations of self-attention, convolution and nonlinearity in parallel, simply stacking this module outperforms all previous models including Transformer (Vasvani et al., 2017) on main NMT tasks under standard setting.

Key features:

First successfully combine convolution and self-attention in one module for sequence tasks by the proposed shared projection，
SOTA on three main translation datasets, including WMT14 En-Fr, WMT14 En-De and IWSLT14 De-En，
Parallel learn sequence representations and thus have potential for acceleration。

Quick links:

Arxiv : pdf;

Github : Code, pretrained models, instructions for training are all available.

Main results:

https://preview.redd.it/2qkne32qwo041.png?width=554&format=png&auto=webp&s=f7ed2973ebdd434f2982a2ec546f65ca635ff529

https://preview.redd.it/cdx3sc5vwo041.png?width=554&format=png&auto=webp&s=611d277629129f5734119321b4466f1bd4781f03

Abstract:

In sequence to sequence learning, the self-attention mechanism proves to be highly effective, and achieves significant improvements in many tasks. However, the self-attention mechanism is not without its own flaws. Although self-attention can model extremely long dependencies, the attention in deep layers tends to overconcentrate on a single token, leading to insufficient use of local information and difficultly in representing long sequences. In this work, we explore parallel multi-scale representation learning on sequence data, striving to capture both long-range and short-range language structures. To this end, we propose the Parallel MUlti-Scale attEntion (MUSE) and MUSE-simple. MUSE-simple contains the basic idea of parallel multi-scale sequence representation learning, and it encodes the sequence in parallel, in terms of different scales with the help from self-attention, and pointwise transformation. MUSE builds on MUSE-simple and explores combining convolution and self-attention for learning sequence representations from more different scales. We focus on machine translation and the proposed approach achieves substantial performance improvements over Transformer, especially on long sequences. More importantly, we find that although conceptually simple, its success in practice requires intricate considerations, and the multi-scale attention must build on unified semantic space. Under common setting, the proposed model achieves substantial performance and outperforms all previous models on three main machine translation tasks. In addition, MUSE has potential for accelerating inference due to its parallelism.

submitted by /u/stopwind
[link] [comments]

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

JOB POSTINGS

CONTACT

[R] A simple module consistently outperforms self-attention and Transformer model on main NMT datasets with SoTA performance.