[P] Heuristical keyword extraction from documents and encoding for training GPT-2 to generate texts based on user-specified keywords (+ parallelized spaCy)

Written by torontoai on June 30, 2019. Posted in Reddit MachineLearning.

https://github.com/minimaxir/gpt-2-keyword-generation

A couple weeks ago I posted a Reddit title generator app based on GPT-2 to this subreddit which allows the user to generate Reddit titles based on a subreddit and also allows the user to specify the keywords to condition the title generation upon. There were a few comments asking how I handled the keywords, so here it is.

The heuristics the script uses is outlined in the README. It’s not the most mathematically-rigorous option, but it’s hard to argue with the results.

When working with the Reddit data, I found that spaCy was too slow to encode hundreds of thousands of texts (would have taken 24+ hours on my first pass). So I used ray to parallelize it, which resulted in a 11x speedup that’s more reasonable. That may end up being of more interest to this subreddit.

Speaking of the Reddit API, now that the keyword generation is open sourced, I have open-sourced the Reddit API itself (sans the model since that’s hard to distribute), with a mini howto on how I built it.

submitted by /u/minimaxir
[link] [comments]

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

JOB POSTINGS

CONTACT

[P] Heuristical keyword extraction from documents and encoding for training GPT-2 to generate texts based on user-specified keywords (+ parallelized spaCy)