Skip to main content


Learn About Our Meetup

5000+ Members



Join our meetup, learn, connect, share, and get to know your Toronto AI community. 



Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.



Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[P] Scientific summarization datasets w/accompanying (Beginner Friendly) Colab notebooks to train them with Pointer-Generators, Transformers or Bert. Sources are paper sections (Background, Methods, Results, Conclusions etc.), summaries are corresponding sections in abstract. ~11 Million data points

The dataset is based on the methodology described in this paper by Gidiotis, Tsoumakas which describe using the sections of a structured abstract as the gold standard summaries of their corresponding sections of the paper.

The biggest dataset has ~11 million data points from ~4.3 million papers.

The datasets are in parquet.gz files and can be easily read in python pandas parquet (no need to unzip)

import pandas as pd df = pd.read_parquet( file.parquet.gz ) 

Furthermore, processing the data and setting up training can shave off of few hours in your, many more if you’re unfamiliar with the libraries/repos. So I forked the repos and set up Colab notebook that do all of the heavily lifting, so that you can start training within a few minutes using one of the state of the art architectures for summarization.

For a quick start, here is a link to the main dataset (there are several others, check out the link at the bottom.

Download it into your drive, then use one of the following notebooks that process the dataset and start training on it

Pointer Generator

Bert Extractive (BertSum)

Transformer, using Tensor2Tensor

Here is a link to the full details, including a few other scientific datasets I have created.

If you have any trouble, feel free to type a comment or open an issue on Github. I am hoping people can make some pretty effective scientific summarizers using the data.

submitted by /u/BatmantoshReturns
[link] [comments]