[P] Heuristical keyword extraction from documents and encoding for training GPT-2 to generate texts based on user-specified keywords (+ parallelized spaCy)
A couple weeks ago I posted a Reddit title generator app based on GPT-2 to this subreddit which allows the user to generate Reddit titles based on a subreddit and also allows the user to specify the keywords to condition the title generation upon. There were a few comments asking how I handled the keywords, so here it is.
The heuristics the script uses is outlined in the README. It’s not the most mathematically-rigorous option, but it’s hard to argue with the results.
When working with the Reddit data, I found that spaCy was too slow to encode hundreds of thousands of texts (would have taken 24+ hours on my first pass). So I used ray to parallelize it, which resulted in a 11x speedup that’s more reasonable. That may end up being of more interest to this subreddit.
Speaking of the Reddit API, now that the keyword generation is open sourced, I have open-sourced the Reddit API itself (sans the model since that’s hard to distribute), with a mini howto on how I built it.