Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[P] Replicate Toronto BookCorpus

Hey all,

I created a small python repository called Replicate TorontoBookCorpus that one can use to replicate the no-longer-available Toronto BookCorpus (TBC) dataset.

As I’m currently doing research on transformers for my thesis, but could not find/get a copy of the original TBC dataset by any means, my only alternative was to replicate it. I figured I am not the only one with this issue, and thus made and published this small project.

As with the original TBC dataset, it only contains English-language books with at least 20k words. Furthermore, the total number of words in the replica dataset is also slightly over 0.9B. All in all, if you follow the steps outlined in the repository, you end up with a 5Gb text file with one sentence per line (and three blank sentences between books).

PS. If you have a copy of the original TBC dataset, please get in touch with me (I am desperately looking for the original)!

submitted by /u/SynonymOfHeat
[link] [comments]