Join our meetup, learn, connect, share, and get to know your Toronto AI community.
Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.
Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.
Hey all,
I created a small python repository called Replicate TorontoBookCorpus that one can use to replicate the no-longer-available Toronto BookCorpus (TBC) dataset.
As I’m currently doing research on transformers for my thesis, but could not find/get a copy of the original TBC dataset by any means, my only alternative was to replicate it. I figured I am not the only one with this issue, and thus made and published this small project.
As with the original TBC dataset, it only contains English-language books with at least 20k words. Furthermore, the total number of words in the replica dataset is also slightly over 0.9B. All in all, if you follow the steps outlined in the repository, you end up with a 5Gb text file with one sentence per line (and three blank sentences between books).
PS. If you have a copy of the original TBC dataset, please get in touch with me (I am desperately looking for the original)!
submitted by /u/SynonymOfHeat
[link] [comments]