[P] Computer Science Summarization Dataset

This is a dataset of 5.6 million title / abstract data points, about 75% of which are from computer science papers (I tried my best to filter all non-CS papers (perhaps the non-CS papers add a bit of a “regularization” effect . . . ?) ) .

Title/Abstract pairs have been used to train biomedical summarizers [] , but I am doing a project on CS/ML papers so I made my own.

The dataset is basically a filtered version of the Semantic Scholar Corpus

But it took some effort to produce it and I figure I may save some people time if they wanted the same.

This is a zip file containing 12 parquet files

it’s ~2.5 gb zipped, I think like 6 something gigs unzipped

This is the sqlite database version, 1 file

it’s 2.5 gb zipped, 7.5 gb unzipped

If anyone is interested, this a part of an ongoing project to use deep learning models to better search through research papers, started with ML/CS papers. If anyone is interested in being involved, feel free to reach out. We also have a public page if anyone wants to keep updated.

