Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[P] Scientific summarization datasets w/accompanying (Beginner Friendly) Colab notebooks to train them with Pointer-Generators, Transformers or Bert. Sources are paper sections (Background, Methods, Results, Conclusions etc.), summaries are corresponding sections in abstract. ~11 Million data points

https://snag.gy/XkzUBd.jpg

The dataset is based on the methodology described in this paper https://arxiv.org/abs/1905.07695 by Gidiotis, Tsoumakas which describe using the sections of a structured abstract as the gold standard summaries of their corresponding sections of the paper.

https://snag.gy/YmGADV.jpg

The biggest dataset has ~11 million data points from ~4.3 million papers.

The datasets are in parquet.gz files and can be easily read in python pandas parquet (no need to unzip)

import pandas as pd df = pd.read_parquet( file.parquet.gz ) 

Furthermore, processing the data and setting up training can shave off of few hours in your, many more if you’re unfamiliar with the libraries/repos. So I forked the repos and set up Colab notebook that do all of the heavily lifting, so that you can start training within a few minutes using one of the state of the art architectures for summarization.

For a quick start, here is a link to the main dataset (there are several others, check out the link at the bottom.

https://drive.google.com/open?id=1AH3HEDDs08e-xVRLjAev7K902R0eBrcl

Download it into your drive, then use one of the following notebooks that process the dataset and start training on it

Pointer Generator

https://colab.research.google.com/drive/14-hIiDmUE_qmVK0UHVTjyluHoM1yVKnE

Bert Extractive (BertSum)

https://colab.research.google.com/drive/1IEHBsryjAjddS0jv7oJOi25_TxjVfA4F

Transformer, using Tensor2Tensor

https://colab.research.google.com/drive/1JEfZ2cCJc8Dz_LQMS9_rGgtMgecfXJDG

Here is a link to the full details, including a few other scientific datasets I have created.

https://github.com/Santosh-Gupta/ScientificSummarizationDataSets/blob/master/ReadMe.md

If you have any trouble, feel free to type a comment or open an issue on Github. I am hoping people can make some pretty effective scientific summarizers using the data.

submitted by /u/BatmantoshReturns
[link] [comments]