# Blog

## 5000+ Members

### MEETUPS

LEARN, CONNECT, SHARE

### JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

### CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

# [D] Confused about generating a translation using Transformer

I’m reading the Attention Is All You Need paper and it doesn’t seem to explain how exactly the Transformer is used to generate a translation. Here’s how I understand it so far (please correct if I’m wrong):

1. A sequence of k tokens comes in as one-hot vectors of length v – the vocab size. This is a (k x v) token matrix.
2. The tokens are embedded in d_m (model size, e.g. 512) dimensional space via multiplication by an Embedding matrix E of dim (v x d_m), yielding a (k x d_m).
3. Positional Encodings added, dim is still (k x d_m).

Encoding:

1. Encoder block takes in the (k x d_m) matrix and outputs another (k x d_m) matrix.
2. Repeat N times to get a final (k x d_m) matrix, i.e. the encoder output.

Now for decoding:

1. The decoder takes in a (p x d_m) matrix and adds positional encodings.
2. The (non-masked) multi-head attention function inside the decoder receives encoder’s (k x d_m) output as key K, and value V, and a (p x d_m) matrix as the query Q, yielding a (p x d_m) output.
3. The final output of the decoder is therefore (p x d_m).

Final output:

1. The (p x d_m) decoder output is mapped to (p x v) by a matrix multiply (Question: they say it’s “tied” to the embedding matrix E, so is this just E^T?).
2. Select the max of each value in the p rows (softmax), so you get p tokens out.

Suppose I want to translate the sequence “This attention paper is super confusing !” into German. Here k = 7, so my encoder outputs a (7 x 512) matrix. From here, can someone walk me through the steps of generating the translation?

Thanks for looking at my question and have an awesome day!

submitted by /u/ME_PhD