[P] Scientific summarization datasets w/accompanying (Beginner Friendly) Colab notebooks to train them with Pointer-Generators, Transformers or Bert. Sources are paper sections (Background, Methods, Results, Conclusions etc.), summaries are corresponding sections in abstract. ~11 Million data points
The dataset is based on the methodology described in this paper https://arxiv.org/abs/1905.07695 by Gidiotis, Tsoumakas which describe using the sections of a structured abstract as the gold standard summaries of their corresponding sections of the paper.
The biggest dataset has ~11 million data points from ~4.3 million papers.
The datasets are in parquet.gz files and can be easily read in python pandas parquet (no need to unzip)
import pandas as pd df = pd.read_parquet( file.parquet.gz )
Furthermore, processing the data and setting up training can shave off of few hours in your, many more if you’re unfamiliar with the libraries/repos. So I forked the repos and set up Colab notebook that do all of the heavily lifting, so that you can start training within a few minutes using one of the state of the art architectures for summarization.
For a quick start, here is a link to the main dataset (there are several others, check out the link at the bottom.
https://drive.google.com/open?id=1AH3HEDDs08e-xVRLjAev7K902R0eBrcl
Download it into your drive, then use one of the following notebooks that process the dataset and start training on it
Pointer Generator
https://colab.research.google.com/drive/14-hIiDmUE_qmVK0UHVTjyluHoM1yVKnE
Bert Extractive (BertSum)
https://colab.research.google.com/drive/1IEHBsryjAjddS0jv7oJOi25_TxjVfA4F
Transformer, using Tensor2Tensor
https://colab.research.google.com/drive/1JEfZ2cCJc8Dz_LQMS9_rGgtMgecfXJDG
Here is a link to the full details, including a few other scientific datasets I have created.
https://github.com/Santosh-Gupta/ScientificSummarizationDataSets/blob/master/ReadMe.md
If you have any trouble, feel free to type a comment or open an issue on Github. I am hoping people can make some pretty effective scientific summarizers using the data.
submitted by /u/BatmantoshReturns
[link] [comments]