[D] Confused about generating a translation using Transformer

Written by torontoai on December 4, 2019. Posted in Reddit MachineLearning.

I’m reading the Attention Is All You Need paper and it doesn’t seem to explain how exactly the Transformer is used to generate a translation. Here’s how I understand it so far (please correct if I’m wrong):

A sequence of k tokens comes in as one-hot vectors of length v – the vocab size. This is a (k x v) token matrix.
The tokens are embedded in d_m (model size, e.g. 512) dimensional space via multiplication by an Embedding matrix E of dim (v x d_m), yielding a (k x d_m).
Positional Encodings added, dim is still (k x d_m).

Encoding:

Encoder block takes in the (k x d_m) matrix and outputs another (k x d_m) matrix.
Repeat N times to get a final (k x d_m) matrix, i.e. the encoder output.

Now for decoding:

The decoder takes in a (p x d_m) matrix and adds positional encodings.
The (non-masked) multi-head attention function inside the decoder receives encoder’s (k x d_m) output as key K, and value V, and a (p x d_m) matrix as the query Q, yielding a (p x d_m) output.
The final output of the decoder is therefore (p x d_m).

Final output:

The (p x d_m) decoder output is mapped to (p x v) by a matrix multiply (Question: they say it’s “tied” to the embedding matrix E, so is this just E^T?).
Select the max of each value in the p rows (softmax), so you get p tokens out.

Suppose I want to translate the sequence “This attention paper is super confusing !” into German. Here k = 7, so my encoder outputs a (7 x 512) matrix. From here, can someone walk me through the steps of generating the translation?

Thanks for looking at my question and have an awesome day!

submitted by /u/ME_PhD
[link] [comments]

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

JOB POSTINGS

CONTACT

[D] Confused about generating a translation using Transformer