Announcing Two New Natural Language Dialog Datasets
Today’s digital assistants are expected to complete tasks and return personalized results across many subjects, such as movie listings, restaurant reservations and travel plans. However, despite tremendous progress in recent years, they have not yet reached human-level understanding. This is due, in part, to the lack of quality training data that accurately reflects the way people express their needs and preferences to a digital assistant. This is because the limitations of such systems bias what we say—we want to be understood, and so tailor our words to what we expect a digital assistant to understand. In other words, the conversations we might observe with today’s digital assistants don’t reach the level of dialog complexity we need to model human-level understanding.
To address this, we’re releasing the Coached Conversational Preference Elicitation (CCPE) and Taskmaster-1 dialog datasets. Both collections make use of a Wizard-of-Oz platform that pairs two people who engage in spoken conversations, just like those one might like to have with a truly effective digital assistant. For both datasets, an in-house Wizard-of-Oz interface was designed to uniquely mimic today’s speech-based digital assistants, preserving the characteristics of spoken dialog in the context of an automated system. Since the human “assistants” understand exactly what the user asks, as any person would, we are able to capture how users would actually express themselves to a “perfect” digital assistant, so that we can continue to improve such systems. Full details of the CCPE dataset are described in our research paper to be published at the 2019 Annual Conference of the Special Interest Group on Discourse and Dialogue, and the Taskmaster-1 dataset is described in detail in a research paper to appear at the 2019 Conference on Empirical Methods in Natural Language Processing.
In the movie-oriented CCPE dataset, individuals posing as a user speak into a microphone and the audio is played directly to the person posing as a digital assistant. The “assistant” types out their response, which is in turn played to the user via text-to-speech. These 2-person dialogs naturally include disfluencies and errors that happen spontaneously between the two parties that are difficult to replicate using synthesized dialog. This creates a collection of natural, yet structured, conversations about people’s movie preferences.
Among the insights into this dataset, we find that the ways in which people describe their preferences are amazingly rich. This dataset is the first to characterize that richness at scale. We also find that preferences do not always match the way digital assistants, or for that matter recommendation sites, characterize options. To put it another way, the filters on your favorite movie website or service probably don’t match the language you would use in describing the sorts of movies that you like when seeking a recommendation from a person.
The Taskmaster-1 dataset makes use of both the methodology described above as well as a one-person, written technique to increase the corpus size and speaker diversity—about 7.7k written “self-dialog” entries and ~5.5k 2-person, spoken dialogs. For written dialogs, we engaged people to create the full conversation themselves based on scenarios outlined for each task, thereby playing roles of both the user and assistant. So, while the spoken dialogs more closely reflect conversational language, written dialogs are both appropriately rich and complex, yet are cheaper and easier to collect. The dataset is based on one of six tasks: ordering pizza, creating auto repair appointments, setting up rides for hire, ordering movie tickets, ordering coffee drinks and making restaurant reservations.
This dataset also uses a simple annotation schema that provides sufficient grounding for the data, while making it easy for workers to apply labels to the dialog consistently. As compared to traditional, detailed strategies that make robust agreement among workers difficult, we focus solely on API arguments for each type of conversation, meaning just the variables required to execute the transaction. For example, in a dialog about scheduling a rideshare, we label the “to” and “from” locations along with the car type (economy, luxury, pool, etc.). For movie tickets, we label the movie name, theater, time, number of tickets, and sometimes the screening type (e.g., 3D or standard). A complete list of labels is included with the corpus release.
It is our hope that these datasets will be useful to the research community for experimentation and analysis in both dialog systems and conversational recommendation.
We would like to thank our co-authors and collaborators whose hard work and insights made the release of these datasets possible: Karthik Krishnamoorthi, Krisztian Balog, Chinnadhurai Sankar, Arvind Neelakantan, Amit Dubey, Kyu-Young Kim, Andy Cedilnik, Scott Roy, Muqthar Mohammed, Mohd Majeed, Ashwin Kakarla and Hadar Shemtov.