[P] OpenWebTextCorpus download released: Replication of GPT-2’s Training Dataset
Today we’re announcing the release of a beta version of our Open WebText Corpus – an open source effort to reproduce OpenAI’s WebText dataset, as detailed here. This distribution was created by Aaron Gokaslan and Vanya Cohen of Brown University. The following post outlines the steps taken to reproduce the dataset, and provides information for those seeking to contribute to its further development.
We would like to thank the contributors of the OpenWebText project for their very useful scraping and data filtering scripts. After some experimentation, we were able to clean the number of documents until we had 38GB of text data (40GB using SI units) from 8,013,769 documents which matches the numbers listed in the paper. We hope that this dataset will allow for further research to build upon this valuable source of NLP data.