[R] Stanford NLP just released a model for question -> document retrieval -> query generation -> gold document retrieval -> gold answer retrieval.
The most interesting part is that based on the question, it will look up documents, and based on the question and information in the first set of retrieved documents, it’ll generate new queries to look up and find the exact document which as the answer. The concept itself isn’t new; it’s been a goal for the NLP/ML community for a while, but Stanford was able to do it by creating a dataset (not sure if that’s the entirely right word, they used ‘query generation supervision signal’) of these generated queries.
They generated the gold candidate queries by finding overlap of the content of the first set of retrieved content, and content of the the text that contains the answer. In their own words (and I think this is the most important part of the paper):
“ computing the longest common string/sequence between the current retrieval context and the title/text of the intended paragraph ignoring stop words, then taking the contiguous span of text that corresponds to this overlap in the retrieval context.”
Final thoughts: I love this paper. I’m really interested in dataset generation using very accurate / robust heuristics and models. I think these datasets can be used to trained some very effective language models for information retrieval. I am currently working on a project like this; I’m currently processing a dataset for research paper retrieval.