Blog

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[D] Confused about generating a translation using Transformer

I’m reading the Attention Is All You Need paper and it doesn’t seem to explain how exactly the Transformer is used to generate a translation. Here’s how I understand it so far (please correct if I’m wrong):

1. A sequence of k tokens comes in as one-hot vectors of length v – the vocab size. This is a (k x v) token matrix.
2. The tokens are embedded in d_m (model size, e.g. 512) dimensional space via multiplication by an Embedding matrix E of dim (v x d_m), yielding a (k x d_m).
3. Positional Encodings added, dim is still (k x d_m).

Encoding:

1. Encoder block takes in the (k x d_m) matrix and outputs another (k x d_m) matrix.
2. Repeat N times to get a final (k x d_m) matrix, i.e. the encoder output.

Now for decoding:

1. The decoder takes in a (p x d_m) matrix and adds positional encodings.
2. The (non-masked) multi-head attention function inside the decoder receives encoder’s (k x d_m) output as key K, and value V, and a (p x d_m) matrix as the query Q, yielding a (p x d_m) output.
3. The final output of the decoder is therefore (p x d_m).

Final output:

1. The (p x d_m) decoder output is mapped to (p x v) by a matrix multiply (Question: they say it’s “tied” to the embedding matrix E, so is this just E^T?).
2. Select the max of each value in the p rows (softmax), so you get p tokens out.

Suppose I want to translate the sequence “This attention paper is super confusing !” into German. Here k = 7, so my encoder outputs a (7 x 512) matrix. From here, can someone walk me through the steps of generating the translation?

Thanks for looking at my question and have an awesome day!

submitted by /u/ME_PhD