Blog

Learn About Our Meetup

4500+ Members

[P] Computer Science Summarization Dataset

This is a dataset of 5.6 million title / abstract data points, about 75% of which are from computer science papers (I tried my best to filter all non-CS papers (perhaps the non-CS papers add a bit of a “regularization” effect . . . ?) ) .

Title/Abstract pairs have been used to train biomedical summarizers [https://arxiv.org/pdf/1804.08875.pdf] , but I am doing a project on CS/ML papers so I made my own.

The dataset is basically a filtered version of the Semantic Scholar Corpus https://api.semanticscholar.org/corpus/

But it took some effort to produce it and I figure I may save some people time if they wanted the same.


This is a zip file containing 12 parquet files

https://drive.google.com/open?id=1WEdf-_au3vg2EzmWhawmW9xsYaHAE7iV

it’s ~2.5 gb zipped, I think like 6 something gigs unzipped


This is the sqlite database version, 1 file

https://drive.google.com/open?id=1IhIaBD98BEseteAUi1S_f_SfIaUI8V4D

it’s 2.5 gb zipped, 7.5 gb unzipped


If anyone is interested, this a part of an ongoing project to use deep learning models to better search through research papers, started with ML/CS papers. If anyone is interested in being involved, feel free to reach out. We also have a public page if anyone wants to keep updated.

https://github.com/Santosh-Gupta/Arxiv-Manatee-PublicUpdates

https://snag.gy/cwnUGB.jpg

submitted by /u/BatmantoshReturns
[link] [comments]

Next Meetup

 

Days
:
Hours
:
Minutes
:
Seconds

 

Plug yourself into AI and don't miss a beat

 


Toronto AI is a social and collaborative hub to unite AI innovators of Toronto and surrounding areas. We explore AI technologies in digital art and music, healthcare, marketing, fintech, vr, robotics and more. Toronto AI was founded by Dave MacDonald and Patrick O'Mara.