Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[P] OpenWebTextCorpus download released: Replication of GPT-2’s Training Dataset

https://skylion007.github.io/OpenWebTextCorpus/

Today we’re announcing the release of a beta version of our Open WebText Corpus – an open source effort to reproduce OpenAI’s WebText dataset, as detailed here. This distribution was created by Aaron Gokaslan and Vanya Cohen of Brown University. The following post outlines the steps taken to reproduce the dataset, and provides information for those seeking to contribute to its further development.

We would like to thank the contributors of the OpenWebText project for their very useful scraping and data filtering scripts. After some experimentation, we were able to clean the number of documents until we had 38GB of text data (40GB using SI units) from 8,013,769 documents which matches the numbers listed in the paper. We hope that this dataset will allow for further research to build upon this valuable source of NLP data.

submitted by /u/Skylion007
[link] [comments]