Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[P] Heuristical keyword extraction from documents and encoding for training GPT-2 to generate texts based on user-specified keywords (+ parallelized spaCy)

https://github.com/minimaxir/gpt-2-keyword-generation

A couple weeks ago I posted a Reddit title generator app based on GPT-2 to this subreddit which allows the user to generate Reddit titles based on a subreddit and also allows the user to specify the keywords to condition the title generation upon. There were a few comments asking how I handled the keywords, so here it is.

The heuristics the script uses is outlined in the README. It’s not the most mathematically-rigorous option, but it’s hard to argue with the results.

When working with the Reddit data, I found that spaCy was too slow to encode hundreds of thousands of texts (would have taken 24+ hours on my first pass). So I used ray to parallelize it, which resulted in a 11x speedup that’s more reasonable. That may end up being of more interest to this subreddit.

Speaking of the Reddit API, now that the keyword generation is open sourced, I have open-sourced the Reddit API itself (sans the model since that’s hard to distribute), with a mini howto on how I built it.

submitted by /u/minimaxir
[link] [comments]