Author: torontoai

Releasing PAWS and PAWS-X: Two New Datasets to Improve Natural Language Understanding Models

Written on October 1, 2019. Posted in Google.

Posted by Yuan Zhang, Research Scientist and Yinfei Yang, Software Engineer, Google Research

Word order and syntactic structure have a large impact on sentence meaning — even small perturbations in word order can completely change interpretation. For example, consider the following related sentences:

Flights from New York to Florida.
Flights to Florida from New York.
Flights from Florida to New York.

All three have the same set of words. However, 1 and 2 have the same meaning — known as paraphrase pairs — while 1 and 3 have very different meanings — known as non-paraphrase pairs. The task of identifying whether pairs are paraphrase or not is called paraphrase identification, and this task is important to many real-world natural language understanding (NLU) applications such as question answering. Perhaps surprisingly, even state-of-the-art models, like BERT, would fail to correctly identify the difference between many non-paraphrase pairs like 1 and 3 above if trained only on existing NLU datasets. This is because existing datasets lack training pairs like this, so it is hard for machine learning models to learn this pattern even if they have the capability to understand complex contextual phrasings.

To address this, we are releasing two new datasets for use in the research community: Paraphrase Adversaries from Word Scrambling (PAWS) in English, and PAWS-X, an extension of the PAWS dataset to six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean. Both datasets contain well-formed sentence pairs with high lexical overlap, in which about half of the pairs are paraphrase and others are not. Including new pairs in training data for state-of-the-art models improves their accuracy on this problem from <50% to 85-90%. In contrast, models that do not capture non-local contextual information fail even with new training examples. The new datasets therefore provide an effective instrument for measuring the sensitivity of models to word order and structure.

The PAWS dataset contains 108,463 human-labeled pairs in English, sourced from Quora Question Pairs (QQP) and Wikipedia pages. PAWS-X contains 23,659 human translated PAWS evaluation pairs and 296,406 machine translated training pairs. The table below gives detailed statistics of the datasets.

	PAWS		PAWS-X
Language	English	English	Chinese	French	German	Japanese	Korean	Spanish
	(QQP)	(Wiki)	(Wiki)	(Wiki)	(Wiki)	(Wiki)	(Wiki)	(Wiki)
Training	11,988	79,798	49,401^†	49,401^†	49,401^†	49,401^†	49,401^†	49,401^†
Dev	677	8,000	1,984	1,992	1,932	1,980	1,965	1,962
Test	–	8,000	1,975	1,985	1,967	1,946	1,972	1,999

† The training set of PAWS-X is machine translated from a subset of the PAWS Wiki dataset in English.

Creating the PAWS Dataset in English
In “PAWS: Paraphrase Adversaries from Word Scrambling,” we introduce a workflow for generating pairs of sentences that have high word overlap, but which are balanced with respect to whether they are paraphrases or not. To generate examples, source sentences are first passed to a specialized language model that creates word-swapped variants that are still semantically meaningful, but ambiguous as to whether they are paraphrase pairs or not. These were then judged by human raters for grammaticality and then multiple raters judged whether they were paraphrases of each other.

PAWS corpus creation workflow.

One problem with this swapping strategy is that it tends to produce pairs that aren’t paraphrases (e.g., “why do bad things happen to good people” != “why do good things happen to bad people“). In order to ensure balance between paraphrases and non-paraphrases, we added other examples based on back-translation. Back-translation has the opposite bias as it tends to preserve meaning while changing word order and word choice. These two strategies lead to PAWS being balanced overall, especially for the Wikipedia portion.

Creating the Multilingual PAWS-X Dataset
After creating PAWS, we extended it to six more languages: Chinese, French, German, Korean, Japanese, and Spanish. We hired human translators to translate the development and test sets, and used a neural machine translation (NMT) service to translate the training set.
We obtained human translations (native speakers) on a random sample of 4,000 sentence pairs from the PAWS development set for each of the six languages (48,000 translations). Each sentence in a pair is presented independently so that translation is not affected by context. A randomly sampled subset was validated by a second worker. The final dataset has less than 5% word level error rate.
Note, we allowed professionals to not translate a sentence if it was incomplete or ambiguous. On average, less than 2% of the pairs were not translated, and we simply excluded them. The final translated pairs are split then into new development and test sets, ~2,000 pairs for each.

Examples of human translated pairs for German(de) and Chinese(zh).

Language Understanding with PAWS and PAWS-X
We train multiple models on the created dataset and measure the classification accuracy on the eval set. When trained with PAWS, strong models, such as BERT and DIIN, show remarkable improvement over when they are trained on the existing Quora Question Pairs (QQP) dataset. For example, on the PAWS data sourced from QQP (PAWS-QQP), BERT gets only 33.5 accuracy if trained on existing QQP, but it recovers to 83.1 accuracy when given PAWS training examples. Unlike BERT, a simple Bag-of-Words (BOW) model fails to learn from PAWS training examples, demonstrating its weakness at capturing non-local contextual information. These results demonstrate that PAWS effectively measures sensitivity of models to word order and structure.

Accuracy on PAWS-QQP Eval Set (English).

The figure below shows the performance of the popular multilingual BERT model on PAWS-X using several common strategies:

Zero Shot: The model is trained on the PAWS English training data, and then directly evaluated on all others. Machine translation is not involved in this strategy.
Translate Test: Train a model using the English training data, and machine-translate all test examples to English for evaluation.
Translate Train: The English training data is machine-translated into each target language to provide data to train each model.
Merged: Train a multilingual model on all languages, including the original English pairs and machine-translated data in all other languages.

The results show that cross-lingual techniques help, while it also leaves considerable headroom to drive multilingual research on the problem of paraphrase identification

Accuracy of PAWS-X Test Set using BERT Models.

It is our hope that these datasets will be useful to the research community to drive further progress on multilingual models that better exploit structure, context, and pairwise comparisons.

Acknowledgements
The core team includes Luheng He, Jason Baldridge, Chris Tar. We would like to thank the Language team in Google Research, especially Emily Pitler, for the insightful comments that contributed to our papers. Many thanks also to Ashwin Kakarla, Henry Jicha, and Mengmeng Niu, for the help with the annotations.

[R] Interpretations are useful: penalizing explanations to align neural networks with prior knowledge

Written on October 1, 2019. Posted in Reddit MachineLearning.

TL;DR: Penalizing wrong explanations increases predictive accuracy for neural networks!

Paper

Code

Abstract: For an explanation of a deep learning model to be effective, it must provide both insight into a model and suggest a corresponding action in order to achieve some objective. Too often, the litany of proposed explainable deep learning methods stop at the first step, providing practitioners with insight into a model, but no way to act on it. In this paper, we propose contextual decomposition explanation penalization (CDEP), a method which enables practitioners to leverage existing explanation methods in order to increase the predictive accuracy of deep learning models. In particular, when shown that a model has incorrectly assigned importance to some features, CDEP enables practitioners to correct these errors by directly regularizing the provided explanations. Using explanations provided by contextual decomposition (CD) (Murdoch et al., 2018), we demonstrate the ability of our method to increase performance on an array of toy and real datasets.

submitted by /u/laura-rieger
[link] [comments]

Data Science – Senior Consultant, Omnia AI, Toronto – Deloitte – Toronto, ON

Written on October 1, 2019. Posted in Toronto Job Postings.

Strong experience with statistical analytical techniques, data mining, machine learning, and predictive models using Python, R or similar tools.
From Deloitte – Wed, 02 Oct 2019 16:39:30 GMT – View all Toronto, ON jobs

[D] Instrumenting a differential list of apartment complex features based on real choices (between complex A and B, B was chosen) in order to perform feature selection and figure out most important apartment complex features related to choice

Written on October 1, 2019. Posted in Reddit MachineLearning.

Good afternoon ML community,

I am approaching this problem from a supervised machine learning perspective since that is where the majority of my experience is — so I need a sanity check on if this approach is correct or if I should be using a different approach altogether.

Lets say I have data on approximately 600 apartment complexes, each with about 50-100 features (‘amenities’). These include ‘pool’, or ‘no pool’, ‘pets allowed’, ‘no pets allowed’, ‘small pets allowed’, “more expensive”, “less expensive”,etc.

I also have, for about 15 of these complexes, choice data on rental losses. So– for these 15, everytime somebody chose another complex, they were surveyed and revealed which alternative they chose. There’s about 100 ‘lost choices’ for each of the 15 complexes. My goal is to construct the data in such a way that I can do feature selection on the amenities to figure out which ones play most prominently into choosing another complex, to help understand how to improve the initial 15 complexes.

The approach I was thinking about implementing was constructing a dataset based of differentials and similarities. So for each ‘choice’, there becomes two datapoints: one where we have a list of amenities in complex A vs complex B, and then a counterpoint for the opposite. So it would look like this:

For each datapoint, in the case when complex B is chosen, which we’ll label with an output of “1” for “chosen”, the input data vector would be a list of 0-3 for every amenity in the matrix:

B has this amenity but A doesn't: 0

A has this amenity but B doesn't: 1

Both facilities have this amenity: 2

Neither facilities have this amenity: 3

Then we would create the complimentary data point, where the A and B vector differentials are switched (A has this amenity but B doesn’t: 1, etc) and the output label would be 0 for “not chosen”.

Logically this makes sense to me, but I can’t help but think I am over complicating it– and I can’t think of any other way to instrument the data. Once it’s instrumented in this way, I could either build a classifier (xgboost) and look at feature importance of all the choices of ‘1’, or do feature selection analysis on the data to come up with which features to focus on. Does this seem like a good approach, or are there some glaringly obvious drawbacks and/or better tools for this?

submitted by /u/SpicyBroseph
[link] [comments]

Microsoft collaborates with SilverCloud Health to develop AI for improved mental health

Written on October 1, 2019. Posted in Microsoft.

The post Microsoft collaborates with SilverCloud Health to develop AI for improved mental health appeared first on The AI Blog.

[R] On the Equivalence between Node Embeddings and Structural Graph Representations

Written on October 1, 2019. Posted in Reddit MachineLearning.

This work provides the first unifying theoretical framework for node embeddings and structural graph representations, bridging methods like matrix factorization and graph neural networks. Using invariant theory, we show that the relationship between structural representations and node embeddings is analogous to that of a distribution and its samples. We prove that all tasks that can be performed by node embeddings can also be performed by structural representations and vice-versa. We also show that the concept of transductive and inductive learning is unrelated to node embeddings and graph representations, clearing another source of confusion in the literature. Finally, we introduce new practical guidelines to generating and using node embeddings, which fixes significant shortcomings of standard operating procedures used today.

https://arxiv.org/abs/1910.00452

submitted by /u/bsriniv
[link] [comments]

[R] Research Guide for Transformers

Written on October 1, 2019. Posted in Reddit MachineLearning.

Until recently, recurrent neural networks (RNNs) and convolutional neural networks (CNNs) have been used to tackle this challenge. The problem with these is that they aren’t able to keep up with context and content when sentences are too long. This limitation has been solved by paying attention to the word that is currently being operated on. This guide will focus on how this problem can be addressed by Transformers with the help of deep learning.

https://heartbeat.fritz.ai/research-guide-for-transformers-3ff751493222

submitted by /u/mwitiderrick
[link] [comments]

[D] Is anyone else’s advisor on leave at an industry research group?

Written on October 1, 2019. Posted in Reddit MachineLearning.

This seems to be the trend in ML/Vision/NLP

submitted by /u/WonderfulPattern
[link] [comments]

[D] Implementation of CorEx in R?

Written on October 1, 2019. Posted in Reddit MachineLearning.

I recently got turned on to the idea of Topic modeling by way of Correlation Explanation (CorEx) vis-a-vis this post:

https://medium.com/pew-research-center-decoded/overcoming-the-limitations-of-topic-models-with-a-semi-supervised-approach-b947374e0455

I found an implementation in Python here:

https://github.com/gregversteeg/corex_topic

For those familiar with LDA (Latent Dirichlet Allocation), oftentimes the resulting topics don’t make very much sense. Often the beta probabilities (word-to-topic) are so similar that any classification is arbitrary at best, and more often, simply meaningless.

CorEx provides the ability to “anchor” topics to specific terms, providing a semi-supervised approach to topic modeling. Sounds exciting. Has anyone worked with this algorithm before? Any good results?

Also: has anyone found (or made) an implementation of this CorEx algorithm in R yet?

submitted by /u/cleverchimp
[link] [comments]

[R] Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

Written on October 1, 2019. Posted in Reddit MachineLearning.

submitted by /u/SkiddyX
[link] [comments]

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

JOB POSTINGS

CONTACT

Author: torontoai

Releasing PAWS and PAWS-X: Two New Datasets to Improve Natural Language Understanding Models

[R] Interpretations are useful: penalizing explanations to align neural networks with prior knowledge

Data Science – Senior Consultant, Omnia AI, Toronto – Deloitte – Toronto, ON

[D] Instrumenting a differential list of apartment complex features based on real choices (between complex A and B, B was chosen) in order to perform feature selection and figure out most important apartment complex features related to choice

Microsoft collaborates with SilverCloud Health to develop AI for improved mental health

[R] On the Equivalence between Node Embeddings and Structural Graph Representations

[R] Research Guide for Transformers

[D] Is anyone else’s advisor on leave at an industry research group?

[D] Implementation of CorEx in R?

[R] Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning