[D] GPT paper disappointingly simple
Not in a bad way, of course. I don’t know how else to put it. I read through the whole thing and cross referenced their old models and techniques they cited, and while there’s a lot of preliminary preprocessing and clever things going on in terms of how the text is compressed, the actual machine learning model, all they did was rearrange where they did layer normalization and added an extra one. I got curious about how the attention function was formulated in the ‘Attention is all you need’ and it reminded me of looking at an lstm, just a seemingly random sequence of matrix transformations.
Part of my intuition is just telling me “why don’t you just collect a set of 30 or so symbols representing all the functions that can be performed on the input as its currently formatted at this step, randomly pick a few, and see what happens?” Surely a lot of these models can be produced, trained to a small amount of epochs, and compared against each other. Perhaps a better method of doing scalar dot product attention can be attained like this, especially considering its only 4 operations. This paragraph is more of a ramble, because I haven’t been able to get this feeling out of my mind as I read some papers and look at these models.
It almost seems as if how the initial data is organized and represented is almost more important than the model itself in some cases.
Can anyone shed some more light so I can have more intuition on the beauty of the matrix math here, rather than thinking that I can just randomly cherry pick some random sequence of transformations, fine tune it, call it attention or something buzz wordy, and call it a day?