[P] A BertSum (Bert extractive summarizer) model trained on research papers. Access to datasets also included.
https://github.com/Santosh-Gupta/ScientificSummarizationDataSets
A few months ago, I released several a dataset from ~7 million papers for ~12 million datapoints. I think the most exciting part were the datasets designed by a similar methodology of Alexios Gidiotis, Grigorios Tsoumakas [https://arxiv.org/abs/1905.07695] who discovered that there are many papers with structured abstractions, whose sections correspond to entire sections within the papers.
Having a dataset of these abstract sections and full paper sections is probably the best dataset available for research paper summarization, as far as I know.
Using some of the text processing methods in Gidiotis, Tsoumakas, and using Semantic Scholar’s Science Parse, I was able to create a dataset from Arxiv and the Semantic Scholar Corpus.
I have now released a model using a slightly modified version of the BertSum repo [ https://github.com/nlpyang/BertSum https://arxiv.org/abs/1903.10318 ]. The model was trained on a batch size of 1024 for 5000 steps, and then a batch size of 4096 for 25000 steps.
The datasets and model are all available here.
https://github.com/Santosh-Gupta/ScientificSummarizationDataSets
I also included text processing and training setups for Pointer-Generator and the Tensor2Tensor transformers abstractive summarizers. At the time they were the best for abstractive summarization, but for the purposes of my future project, I needed the most accurate summarizer, which needed an extractive method.
submitted by /u/BatmantoshReturns
[link] [comments]