[P] Replicate Toronto BookCorpus
Hey all,
I created a small python repository called Replicate TorontoBookCorpus that one can use to replicate the no-longer-available Toronto BookCorpus (TBC) dataset.
As I’m currently doing research on transformers for my thesis, but could not find/get a copy of the original TBC dataset by any means, my only alternative was to replicate it. I figured I am not the only one with this issue, and thus made and published this small project.
As with the original TBC dataset, it only contains English-language books with at least 20k words. Furthermore, the total number of words in the replica dataset is also slightly over 0.9B. All in all, if you follow the steps outlined in the repository, you end up with a 5Gb text file with one sentence per line (and three blank sentences between books).
PS. If you have a copy of the original TBC dataset, please get in touch with me (I am desperately looking for the original)!
submitted by /u/SynonymOfHeat
[link] [comments]