Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[P] Computer Science Summarization Dataset

This is a dataset of 5.6 million title / abstract data points, about 75% of which are from computer science papers (I tried my best to filter all non-CS papers (perhaps the non-CS papers add a bit of a “regularization” effect . . . ?) ) .

Title/Abstract pairs have been used to train biomedical summarizers [https://arxiv.org/pdf/1804.08875.pdf] , but I am doing a project on CS/ML papers so I made my own.

The dataset is basically a filtered version of the Semantic Scholar Corpus https://api.semanticscholar.org/corpus/

But it took some effort to produce it and I figure I may save some people time if they wanted the same.


This is a zip file containing 12 parquet files

https://drive.google.com/open?id=1WEdf-_au3vg2EzmWhawmW9xsYaHAE7iV

it’s ~2.5 gb zipped, I think like 6 something gigs unzipped


This is the sqlite database version, 1 file

https://drive.google.com/open?id=1IhIaBD98BEseteAUi1S_f_SfIaUI8V4D

it’s 2.5 gb zipped, 7.5 gb unzipped


If anyone is interested, this a part of an ongoing project to use deep learning models to better search through research papers, started with ML/CS papers. If anyone is interested in being involved, feel free to reach out. We also have a public page if anyone wants to keep updated.

https://github.com/Santosh-Gupta/Arxiv-Manatee-PublicUpdates

https://snag.gy/cwnUGB.jpg

submitted by /u/BatmantoshReturns
[link] [comments]