[P] Scientific summarization datasets w/accompanying (Beginner Friendly) Colab notebooks to train them with Pointer-Generators, Transformers or Bert. Sources are paper sections (Background, Methods, Results, Conclusions etc.), summaries are corresponding sections in abstract. ~11 Million data points
The dataset is based on the methodology described in this paper https://arxiv.org/abs/1905.07695 by Gidiotis, Tsoumakas which describe using the sections of a structured abstract as the gold standard summaries of their corresponding sections of the paper.
The biggest dataset has ~11 million data points from ~4.3 million papers.
The datasets are in parquet.gz files and can be easily read in python pandas parquet (no need to unzip)
import pandas as pd df = pd.read_parquet( file.parquet.gz )
Furthermore, processing the data and setting up training can shave off of few hours in your, many more if you’re unfamiliar with the libraries/repos. So I forked the repos and set up Colab notebook that do all of the heavily lifting, so that you can start training within a few minutes using one of the state of the art architectures for summarization.
For a quick start, here is a link to the main dataset (there are several others, check out the link at the bottom.
Download it into your drive, then use one of the following notebooks that process the dataset and start training on it
Bert Extractive (BertSum)
Transformer, using Tensor2Tensor
Here is a link to the full details, including a few other scientific datasets I have created.
If you have any trouble, feel free to type a comment or open an issue on Github. I am hoping people can make some pretty effective scientific summarizers using the data.