[P] The Illustrated GPT-2 (Visualizing Transformer Language Models)
This is a new post in which I try to visualize the majority of what happens inside a trained GPT-2. We follow the journey of an input word from embedding, all the way up to the output of the model. I’ve also included a crude analogy for the query/key/value vectors of self-attention that I hope makes it easier for people starting out with transformer architectures. By the end of the post, we’d have looked at the major weight matrices of a single block, as well as the major weight matrices of the entire model. All feedback and corrections are welcomed!