[D] GPT-2 Training Data Augmentation
Hey y’all, I’m pretty much a complete novice when it comes to machine learning, but I had what seemed to me like an interesting idea. Don’t hesitate to dismiss this as the hunch of a layman if that is what it seems to be in your more educated opinions.
So, I’ve been screwing around with GPT-2-Lite recently just for fun. Something I noticed is that one skill the model has 100% on lock is always closing parentheses (Yes, I’m anthropomorphizing lol). Further, it seems to have a pretty good sense of how parentheticals relate to preceding text — i.e. they tend to contain elaboration or qualification on the immediately preceding topic, or a digression that somewhat interrupts the flow of the text. You can get this behavior on demand by feeding the model a prompt that ends with an open paren. For example, both parentheticals in this sentence from the Gettysburg address were generated that way:
It is for us the living (who, in fact are in perpetual combat here, and who are still here), rather, to be dedicated here to the unfinished work (and for that work to be paid) which they who fought here have thus far so nobly advanced.
So, I guess it must have picked up on the characteristics of phrases following the ‘(‘ character in the training set. Same would apply to the open quote character, em-dash, and function words like “which” that head clauses with certain predictable characteristics. My thought is that there may be other semantic features in the training set which could be detected programmatically but are not currently delimited by such explicit and compact cues. Would you expect there to be any value in augmenting the training data with some made up cues for those features? E.g., you could trawl through the text and insert ‘|’ characters bounding every sentence that meets some sentiment analysis criterion (maybe extreme negative sentiment); or you could do the same for compound or complex sentences. What do you reckon would happen if you fed the resulting trained model some text that ends with the opening delimiter? Would that give you a way to reliably elicit completions that have the target semantic feature? I don’t have the hardware or know-how to do the training part, so I can’t test this myself.