[D] Confused about generating a translation using Transformer
I’m reading the Attention Is All You Need paper and it doesn’t seem to explain how exactly the Transformer is used to generate a translation. Here’s how I understand it so far (please correct if I’m wrong):
- A sequence of k tokens comes in as one-hot vectors of length v – the vocab size. This is a (k x v) token matrix.
- The tokens are embedded in d_m (model size, e.g. 512) dimensional space via multiplication by an Embedding matrix E of dim (v x d_m), yielding a (k x d_m).
- Positional Encodings added, dim is still (k x d_m).
- Encoder block takes in the (k x d_m) matrix and outputs another (k x d_m) matrix.
- Repeat N times to get a final (k x d_m) matrix, i.e. the encoder output.
Now for decoding:
- The decoder takes in a (p x d_m) matrix and adds positional encodings.
- The (non-masked) multi-head attention function inside the decoder receives encoder’s (k x d_m) output as key K, and value V, and a (p x d_m) matrix as the query Q, yielding a (p x d_m) output.
- The final output of the decoder is therefore (p x d_m).
- The (p x d_m) decoder output is mapped to (p x v) by a matrix multiply (Question: they say it’s “tied” to the embedding matrix E, so is this just E^T?).
- Select the max of each value in the p rows (softmax), so you get p tokens out.
Suppose I want to translate the sequence “This attention paper is super confusing !” into German. Here k = 7, so my encoder outputs a (7 x 512) matrix. From here, can someone walk me through the steps of generating the translation?
Thanks for looking at my question and have an awesome day!