[P] Computer Science Summarization Dataset
This is a dataset of 5.6 million title / abstract data points, about 75% of which are from computer science papers (I tried my best to filter all non-CS papers (perhaps the non-CS papers add a bit of a “regularization” effect . . . ?) ) .
Title/Abstract pairs have been used to train biomedical summarizers [https://arxiv.org/pdf/1804.08875.pdf] , but I am doing a project on CS/ML papers so I made my own.
The dataset is basically a filtered version of the Semantic Scholar Corpus https://api.semanticscholar.org/corpus/
But it took some effort to produce it and I figure I may save some people time if they wanted the same.
This is a zip file containing 12 parquet files
https://drive.google.com/open?id=1WEdf-_au3vg2EzmWhawmW9xsYaHAE7iV
it’s ~2.5 gb zipped, I think like 6 something gigs unzipped
This is the sqlite database version, 1 file
https://drive.google.com/open?id=1IhIaBD98BEseteAUi1S_f_SfIaUI8V4D
it’s 2.5 gb zipped, 7.5 gb unzipped
If anyone is interested, this a part of an ongoing project to use deep learning models to better search through research papers, started with ML/CS papers. If anyone is interested in being involved, feel free to reach out. We also have a public page if anyone wants to keep updated.
https://github.com/Santosh-Gupta/Arxiv-Manatee-PublicUpdates
submitted by /u/BatmantoshReturns
[link] [comments]