Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

Category: Google

ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations

Ever since the advent of BERT a year ago, natural language research has embraced a new paradigm, leveraging large amounts of existing text to pretrain a model’s parameters using self-supervision, with no data annotation required. So, rather than needing to train a machine-learning model for natural language processing (NLP) from scratch, one can start from a model primed with knowledge of a language. But, in order to improve upon this new approach to NLP, one must develop an understanding of what, exactly, is contributing to language-understanding performance — the network’s height (i.e., number of layers), its width (size of the hidden layer representations), the learning criteria for self-supervision, or something else entirely?

In “ALBERT: A Lite BERT for Self-supervised Learning of Language Representations”, accepted at ICLR 2020, we present an upgrade to BERT that advances the state-of-the-art performance on 12 NLP tasks, including the competitive Stanford Question Answering Dataset (SQuAD v2.0) and the SAT-style reading comprehension RACE benchmark. ALBERT is being released as an open-source implementation on top of TensorFlow, and includes a number of ready-to-use ALBERT pre-trained language representation models.

What Contributes to NLP Performance?
Identifying the dominant driver of NLP performance is complex — some settings are more important than others, and, as our study reveals, a simple, one-at-a-time exploration of these settings would not yield the correct answers.

The key to optimizing performance, captured in the design of ALBERT, is to allocate the model’s capacity more efficiently. Input-level embeddings (words, sub-tokens, etc.) need to learn context-independent representations, a representation for the word “bank”, for example. In contrast, hidden-layer embeddings need to refine that into context-dependent representations, e.g., a representation for “bank” in the context of financial transactions, and a different representation for “bank” in the context of river-flow management.

This is achieved by factorization of the embedding parametrization — the embedding matrix is split between input-level embeddings with a relatively-low dimension (e.g., 128), while the hidden-layer embeddings use higher dimensionalities (768 as in the BERT case, or more). With this step alone, ALBERT achieves an 80% reduction in the parameters of the projection block, at the expense of only a minor drop in performance — 80.3 SQuAD2.0 score, down from 80.4; or 67.9 on RACE, down from 68.2 — with all other conditions the same as for BERT.

Another critical design decision for ALBERT stems from a different observation that examines redundancy. Transformer-based neural network architectures (such as BERT, XLNet, and RoBERTa) rely on independent layers stacked on top of each other. However, we observed that the network often learned to perform similar operations at various layers, using different parameters of the network. This possible redundancy is eliminated in ALBERT by parameter-sharing across the layers, i.e., the same layer is applied on top of each other. This approach slightly diminishes the accuracy, but the more compact size is well worth the tradeoff. Parameter sharing achieves a 90% parameter reduction for the attention-feedforward block (a 70% reduction overall), which, when applied in addition to the factorization of the embedding parameterization, incur a slight performance drop of -0.3 on SQuAD2.0 to 80.0, and a larger drop of -3.9 on RACE score to 64.0.

Implementing these two design changes together yields an ALBERT-base model that has only 12M parameters, an 89% parameter reduction compared to the BERT-base model, yet still achieves respectable performance across the benchmarks considered. But this parameter-size reduction provides the opportunity to scale up the model again. Assuming that memory size allows, one can scale up the size of the hidden-layer embeddings by 10-20x. With a hidden-size of 4096, the ALBERT-xxlarge configuration achieves both an overall 30% parameter reduction compared to the BERT-large model, and, more importantly, significant performance gains: +4.2 on SQuAD2.0 (88.1, up from 83.9), and +8.5 on RACE (82.3, up from 73.8).

These results indicate that accurate language understanding depends on developing robust, high-capacity contextual representations. The context, modeled in the hidden-layer embeddings, captures the meaning of the words, which in turn drives the overall understanding, as directly measured by model performance on standard benchmarks.

Optimized Model Performance with the RACE Dataset
To evaluate the language understanding capability of a model, one can administer a reading comprehension test (e.g., similar to the SAT Reading Test). This can be done with the RACE dataset (2017), the largest publicly available resource for this purpose. Computer performance on this reading comprehension challenge mirrors well the language modeling advances of the last few years: a model pre-trained with only context-independent word representations scores poorly on this test (45.9; left-most bar), while BERT, with context-dependent language knowledge, scores relatively well with a 72.0. Refined BERT models, such as XLNet and RoBERTa, set the bar even higher, in the 82-83 score range. The ALBERT-xxlarge configuration mentioned above yields a RACE score in the same range (82.3), when trained on the base BERT dataset (Wikipedia and Books). However, when trained on the same larger dataset as XLNet and RoBERTa, it significantly outperforms all other approaches to date, and establishes a new state-of-the-art score at 89.4.

Machine performance on the RACE challenge (SAT-like reading comprehension). A random-guess baseline score is 25.0. The maximum possible score is 95.0.

The success of ALBERT demonstrates the importance of identifying the aspects of a model that give rise to powerful contextual representations. By focusing improvement efforts on these aspects of the model architecture, it is possible to greatly improve both the model efficiency and performance on a wide range of NLP tasks. To facilitate further advances in the field of NLP, we are open-sourcing ALBERT to the research community.

The On-Device Machine Learning Behind Recorder

Over the past two decades, Google has made information widely accessible through search — from textual information, photos and videos, to maps and jobs. But much of the world’s information is conveyed through speech. Yet even though many people use audio recording devices to capture important information in conversations, interviews, lectures and more, it can be very difficult to later parse through hours of recordings to identify and extract information of interest. But what if there was the ability to automatically transcribe and tag long recordings in real-time, enabling you to intuitively find the relevant information you need, when you need it?

For this reason, we launched Recorder, a new kind of audio recording app for Pixel phones that leverages recent developments in on-device machine learning (ML) to transcribe conversations, to detect and identify the type of audio recorded (from broad categories like music or speech to particular sounds, such as applause, laughter and whistling), and to index recordings so users can quickly find and extract segments of interest. All of these features run entirely on-device, without the need for an internet connection.

Transcription
Recorder transcribes speech in real-time using an on-device automatic speech recognition model based on improvements announced earlier this year. Being a key component to many of Recorder’s smart features, we made sure that this model can transcribe long audio recordings (a few hours) reliably, while also indexing conversation by mapping words to timestamps as computed by the speech recognition model. This enables the user to click on a word in the transcription and initiate playback starting from that point in the recording, or to search for a word and jump to the exact point in the recording where it was being said.

Recording Content Visualization via Sound Classification
While presenting a transcript for a recording is useful and allows one to search for specific words, sometimes (especially for very long recordings) it’s more useful to visually search for sections of a recording based on specific moments or sounds. To enable this, Recorder additionally represents audio visually as a colored waveform where each color is associated with a different sound category. This is done by combining research into using CNNs to classify audio sounds (e.g., identifying a dog barking or a musical instrument playing) with previously published datasets for audio event detection to classify apparent sound events in individual audio frames.

Of course, in most situations many sounds can appear at the same time. In order to visualize the audio in a very clear way, we decided to color each waveform bar in a single color that represents the most dominant sound in a given time frame (in our case, 50ms bars). The colorized waveform lets users understand what type of content was captured in a specific recording and navigate along an ever-growing audio library more easily. This brings a visual representation of the audio recordings to the users, and also enables them to search over audio events in their recordings.

Recorder implements a sliding window capability that processes partially overlapping 960ms audio frames at 50ms intervals and outputs a sigmoid scores vector, representing the probability for each supported audio class within the frame. We apply a linearization process on the sigmoid scores in combination with a thresholding mechanism, in order to maximize the system precision and report the correct sound classification. This process of analyzing the content of the 960ms window with small 50ms offsets makes it possible to pinpoint exact start and end times in a manner that is less prone to mistakes than analyzing consecutive large 960ms window slices on their own.

Since the model analyzes each audio frame independently, it can be prone to quick jittering between audio classes. This is solved with an adaptive-size median filtering technique applied to the most recent model audio class outputs, thus providing a smoothed consecutive output. The process runs continuously in real-time, requiring it to meet very strict power consumption limitations.

Suggesting Tags for Titles
Once a recording is done, Recorder suggests three tags that the app deems to represent the most memorable content, enabling the user to quickly compose a meaningful title.

To be able to suggest these tags immediately when the recording ends, Recorder analyzes the content of the recording as it is being transcribed. First, Recorder counts term occurrences as well as their grammatical role in the sentence. The terms identified as entities are capitalized. Then, we utilize an on-device part-of-speech-tagger — a model that labels each word in the sentence according to its grammatical role — to detect common nouns and proper nouns, which appear to be more memorable by users. Recorder utilizes a prior scores table supporting both unigram and bigram terms extraction. To generate the scores, we trained a boosted decision tree with conversational data and utilized textual features like document words frequency and specificity. Last, filtering of stop words and swear words is applied and the top tags are outputted.

Tags extraction pipeline architecture

Conclusion
Recorder galvanized some of our most recent on-device ML research efforts into helpful features, running models on-device to ensure user privacy. The positive feedback loop between machine learning investigations and user needs revealed exciting opportunities to make our software even more useful. We’re excited for future research that will make everyone’s ideas and conversations even more easily accessible and searchable.

Acknowledgments
Special thanks to Dror Ayalon who played a key role in developing and forming the above features and without whom this blog post wouldn’t have been possible. We would also want to thank all our team members and collaborators who worked on this project with us: Amit Pitaru, Kelsie Van Deman, Isaac Blankensmith, Teo Soares, John Watkinson, Matt Hall, Josh Deitel, Benny Schlesinger, Yoni Tsafir, Michelle Tadmor Ramanovich, Danielle Cohen, Sushant Prakash, Renat Aksitov, Ed West, Max Gubin, Tiantian Zhang, Aaron Cohen, Yunhsuan Sung, Chung-Ching Chang, Nathan Dass, Amin Ahmad, Tiago Camolesi, Guilherme Santos‎, Julio da Silva, Dan Ellis, Qiao Liang, Arun Narayanan‎, Rohit Prabhavalkar, Benyah Shaparenko‎, Alex Salcianu, Mike Tsao, Shenaz Zak, Sherry Lin, James Lemieux, Jason Cho, Thomas Hall‎, Brian Chen, Allen Su, Vincent Peng‎, Richard Chou‎, Henry Liu‎, Edward Chen, Yitong Lin, Tracy Wu, Yvonne Yang‎.

Improving Out-of-Distribution Detection in Machine Learning Models

Successful deployment of machine learning systems requires that the system be able to distinguish between data that is anomalous or significantly different from that used in training. This is particularly important for deep neural network classifiers, which might classify such out-of-distribution (OOD) inputs into in-distribution classes with high confidence. This is critically important when these predictions inform real-world decisions.

For example, one challenging application of machine learning models to real-world applications is bacteria identification based on genomic sequences. Bacteria detection is crucial for diagnosis and treatment of infectious diseases, such as sepsis, and for identifying foodborne pathogens. New bacterial classes continue to be discovered over the years, and while a neural network classifier trained on the known classes achieves high accuracy as measured through cross-validation, deploying a model is challenging, since real-world data is ever evolving and will inevitably contain genomes from unseen classes (OOD inputs) not present in the training data.

New bacterial classes are gradually discovered over the years. A classifier trained on known classes achieves high accuracy for test inputs belonging to known classes, but can wrongly classify inputs from unknown classes (i.e., out-of-distribution) into known classes with high confidence.

In “Likelihood Ratios for Out-of-Distribution Detection”, presented at NeurIPS 2019, we proposed and released a realistic benchmark dataset of genomic sequences for OOD detection that is inspired by the real-world challenges described above. We tested existing methods for OOD detection using generative models on genomic sequences and found that the likelihood values — i.e., the model’s probability that an input comes from the distribution as estimated using in-distribution data — was often in error. This phenomenon has also been observed in recent work on deep generative models of images. We explain this phenomenon through the effect of background statistics and propose a likelihood-ratio based solution that significantly improves the accuracy of OOD detection.

Why Do Density Models Fail At OOD Detection?
To mimic the real problem and systematically evaluate different methods, we built a new bacterial dataset using data sourced from the publicly available NCBI catalog of prokaryotic genome sequences. To mimic sequencing data, we fragmented genomes into short sequences of 250 base pairs, a length commonly generated by current sequencing technology. We then separated in- and out-of-distribution data by the date of discovery, such that bacterial classes discovered before a cutoff time were defined as in-distribution, and those discovered afterward as OOD.

We then trained a deep generative model on in-distribution genomic sequences and examined how well the model discriminated between in- and out-of-distribution inputs by plotting their likelihood values. The histogram of the likelihood for OOD sequences largely overlaps with that of in-distribution sequences, indicating that the generative model was unable to distinguish between the two populations for OOD detection. Similar results were shown in earlier work for deep generative models of images — for instance, a PixelCNN++ model trained on images from Fashion-MNIST dataset (which consists of images of clothing and footwear) assigns higher likelihood to OOD images from the MNIST dataset (which consists of images of digits 0-9).

Left: Histogram of likelihood values for in- and out-of-distribution (OOD) genomic sequences. The likelihood fails to separate in-distribution and OOD genomic sequences. Right: A similar plot for a model trained on Fashion-MNIST and evaluated on MNIST. The model assigns higher likelihood values for OOD (MNIST) than in-distribution images.

When investigating this failure mode, we observed that the likelihood can be confounded by background statistics. To understand the phenomenon more intuitively, assume that an input is composed of two components, (1) a background component characterized by background statistics, and (2) a semantic component characterized by patterns specific to the in-distribution data. For example, an MNIST image can be modeled as background plus semantics. When humans interpret the image, we can easily ignore the background and focus primarily on the semantic information, e.g., the “/” mark in the image below. But the likelihood is calculated for all pixels in an image, including both semantic and background pixels. Though we want to use just the semantic likelihood for decision making, the raw likelihood can be dominated by background.

Left top: Sample images from Fashion-MNIST. Left bottom: Sample images from MNIST. Right: Background and semantic components in an MNIST image.

Likelihood Ratios For OOD Detection
We propose a likelihood ratio method that removes the effect of background and focuses on semantics. First, we train a background model on perturbed inputs. The method for perturbing the input is inspired by genetic mutations, and proceeds by randomly selecting positions in the input and substituting the value with another that has equal probability. For imaging, the values are randomly chosen from the 256 possible pixel values, and for the DNA sequences, the value is selected from the four possible nucleotides (A, T, C, or G). The right amount of perturbation can corrupt the semantic structure in the data, and captures only the background. Then we compute the likelihood ratio between the full model and the background model, and the background component is cancelled out, so that only the likelihood for semantics remains. Likelihood ratio is a background contrastive score, i.e., it captures the significance of the semantics compared to the background.

To qualitatively evaluate the difference between the likelihood and likelihood ratio, we plotted their values for each pixel in the Fashion-MNIST and MNIST datasets, creating heatmaps that have the same size as the images. This allows us to visualize which pixels contribute the most to the two terms, respectively. From the log-likelihood heatmaps, we see that the background pixels contribute much more to the likelihood than the semantic pixels. In hindsight, this is not surprising, since background pixels consist mostly of a string of zeros, a pattern very easily learned by the model. A comparison between the MNIST and Fashion-MNIST heatmaps demonstrates why MNIST returns higher likelihood values — it simply has a lot more background pixels! The likelihood ratio instead focuses more on the semantic pixels.

Left: Log-likelihood heatmaps for Fashion-MNIST and MNIST datasets. Right: The same examples showing heatmaps of the likelihood-ratio. Pixels with higher values are of lighter shades. The likelihood is dominated by the “background” pixels, whereas the likelihood ratio focuses on the “semantic” pixels and is thus better for OOD detection.

Our likelihood ratio method corrects the background effect and significantly improves the OOD detection of MNIST images from an AUROC score of 0.089 to 0.994, based on a PixelCNN++ model trained for Fashion-MNIST. When applied to the genomic benchmark dataset, this method achieves state-of-the-art performance on this challenging problem, when compared to 12 other baseline methods.

For more details, please check out our recent paper at NeurIPS 2019. While our likelihood ratio method reaches state-of-the-art performance on the genomic dataset, it does not yet have high enough accuracy to reach the standards for deployment of the model to real applications. We encourage researchers to contribute their solutions to this important problem and improve the current state-of-the-art. The dataset is available on our GitHub repository.

Acknowledgments
The work described here was authored by Jie Ren, Peter J. Liu, Emily Fertig, Jasper Snoek, Ryan Poplin, Mark A. DePristo, Joshua V. Dillon, Balaji Lakshminarayanan, through a collaboration spanning several teams across Google AI and DeepMind. We are grateful for all the discussions and feedback on this work that we received from the reviewers at NeurIPS 2019, and our colleagues at Google and DeepMind: Alexander A. Alemi, Andreea Gane, Brian Lee, D. Sculley, Eric Jang, Jacob Burnim, Katherine Lee, Matthew D. Hoffman, Noah Fiedel, Rif A. Saurous, Suman Ravuri, Thomas Colthurst, Yaniv Ovadia, along with the Google Brain and TensorFlow teams.

Improvements to Portrait Mode on the Google Pixel 4 and Pixel 4 XL

Portrait Mode on Pixel phones is a camera feature that allows anyone to take professional-looking shallow depth of field images. Launched on the Pixel 2 and then improved on the Pixel 3 by using machine learning to estimate depth from the camera’s dual-pixel auto-focus system, Portrait Mode draws the viewer’s attention to the subject by blurring out the background. A critical component of this process is knowing how far objects are from the camera, i.e., the depth, so that we know what to keep sharp and what to blur.

With the Pixel 4, we have made two more big improvements to this feature, leveraging both the Pixel 4’s dual cameras and dual-pixel auto-focus system to improve depth estimation, allowing users to take great-looking Portrait Mode shots at near and far distances. We have also improved our bokeh, making it more closely match that of a professional SLR camera.

Pixel 4’s Portrait Mode allows for Portrait Shots at both near and far distances and has SLR-like background blur. (Photos Credit: Alain Saal-Dalma and Mike Milne)

A Short Recap
The Pixel 2 and 3 used the camera’s dual-pixel auto-focus system to estimate depth. Dual-pixels work by splitting every pixel in half, such that each half pixel sees a different half of the main lens’ aperture. By reading out each of these half-pixel images separately, you get two slightly different views of the scene. While these views come from a single camera with one lens, it is as if they originate from a virtual pair of cameras placed on either side of the main lens’ aperture. Alternating between these views, the subject stays in the same place while the background appears to move vertically.

The dual-pixel views of the bulb have much more parallax than the views of the man because the bulb is much closer to the camera.

This motion is called parallax and its magnitude depends on depth. One can estimate parallax and thus depth by finding corresponding pixels between the views. Because parallax decreases with object distance, it is easier to estimate depth for near objects like the bulb. Parallax also depends on the length of the stereo baseline, that is the distance between the cameras (or the virtual cameras in the case of dual-pixels). The dual-pixels’ viewpoints have a baseline of less than 1mm, because they are contained inside a single camera’s lens, which is why it’s hard to estimate the depth of far scenes with them and why the two views of the man look almost identical.

Dual Cameras are Complementary to Dual-Pixels
The Pixel 4’s wide and telephoto cameras are 13 mm apart, much greater than the dual-pixel baseline, and so the larger parallax makes it easier to estimate the depth of far objects. In the images below, the parallax between the dual-pixel views is barely visible, while it is obvious between the dual-camera views.

Left: Dual-pixel views. Right: Dual-camera views. The dual-pixel views have only a subtle vertical parallax in the background, while the dual-camera views have much greater horizontal parallax. While this makes it easier to estimate depth in the background, some pixels to the man’s right are visible in only the primary camera’s view making it difficult to estimate depth there.

Even with dual cameras, information gathered by the dual pixels is still useful. The larger the baseline, the more pixels that are visible in one view without a corresponding pixel in the other. For example, the background pixels immediately to the man’s right in the primary camera’s image have no corresponding pixel in the secondary camera’s image. Thus, it is not possible to measure the parallax to estimate the depth for these pixels when using only dual cameras. However, these pixels can still be seen by the dual pixel views, enabling a better estimate of depth in these regions.

Another reason to use both inputs is the aperture problem, described in our previous blog post, which makes it hard to estimate the depth of vertical lines when the stereo baseline is also vertical (or when both are horizontal). On the Pixel 4, the dual-pixel and dual-camera baselines are perpendicular, allowing us to estimate depth for lines of any orientation.

Having this complementary information allows us to estimate the depth of far objects and reduce depth errors for all scenes.

Depth from Dual Cameras and Dual-Pixels
We showed last year how machine learning can be used to estimate depth from dual-pixels. With Portrait Mode on the Pixel 4, we extended this approach to estimate depth from both dual-pixels and dual cameras, using Tensorflow to train a convolutional neural network. The network first separately processes the dual-pixel and dual-camera inputs using two different encoders, a type of neural network that encodes the input into an intermediate representation. Then, a single decoder uses both intermediate representations to compute depth.

Our network to predict depth from dual-pixels and dual-cameras. The network uses two encoders, one for each input and a shared decoder with skip connections and residual blocks.

To force the model to use both inputs, we applied a drop-out technique, where one input is randomly set to zero during training. This teaches the model to work well if one input is unavailable, which could happen if, for example, the subject is too close for the secondary telephoto camera to focus on.

Depth maps from our network where either only one input is provided or both are provided. Top: The two inputs provide depth information for lines in different directions. Bottom: Dual-pixels provide better depth in the regions visible in only one camera, emphasized in the insets. Dual-cameras provide better depth in the background and ground. (Photo Credit: Mike Milne)

The lantern image above shows how having both signals solves the aperture problem. Having one input only allows us to predict depth accurately for lines in one direction (horizontal for dual-pixels and vertical for dual-cameras). With both signals, we can recover the depth on lines in all directions.

With the image of the person, dual-pixels provide better depth information in the occluded regions between the arm and torso, while the large baseline dual cameras provide better depth information in the background and on the ground. This is most noticeable in the upper-left and lower-right corner of depth from dual-pixels. You can find more examples here.

SLR-Like Bokeh
Photographers obsess over the look of the blurred background or bokeh of shallow depth of field images. One of the most noticeable things about high-quality SLR bokeh is that small background highlights turn into bright disks when defocused. Defocusing spreads the light from these highlights into a disk. However, the original highlight is so bright that even when its light is spread into a disk, the disk remains at the bright end of the SLR’s tonal range.

Left: SLRs produce high contrast bokeh disks. Middle: It is hard to make out the disks in our old background blur. Right: Our new bokeh is closer to that of an SLR.

To reproduce this bokeh effect, we replaced each pixel in the original image with a translucent disk whose size is based on depth. In the past, this blurring process was performed after tone mapping, the process by which raw sensor data is converted to an image viewable on a phone screen. Tone mapping compresses the dynamic range of the data, making shadows brighter relative to highlights. Unfortunately, this also results in a loss of information about how bright objects actually were in the scene, making it difficult to produce nice high-contrast bokeh disks. Instead, the bokeh blends in with the background, and does not appear as natural as that from an SLR.

The solution to this problem is to blur the merged raw image produced by HDR+ and then apply tone mapping. In addition to the brighter and more obvious bokeh disks, the background is saturated in the same way as the foreground. Here’s an album showcasing the better blur, which is available on the Pixel 4 and the rear camera of the Pixel 3 and 3a (assuming you have upgraded to version 7.2 of the Google Camera app).

Blurring before tone mapping improves the look of the backgrounds by making it more saturated and by making disks higher contrast.

Try it Yourself
We have made Portrait Mode on the Pixel 4 better by improving depth quality, resulting in fewer errors in the final image and by improving the look of the blurred background. Depth from dual-cameras and dual-pixels only kicks in when the camera is at least 20 cm from the subject, i.e. the minimum focus distance of the secondary telephoto camera. So consider keeping your phone at least that far from the subject to get better quality portrait shots.

Acknowledgments
This work wouldn’t have been possible without Rahul Garg, Sergio Orts Escolano, Sean Fanello, Christian Haene, Shahram Izadi, David Jacobs, Alexander Schiffhauer, Yael Pritch Knaan and Marc Levoy. We would also like to thank the Google Camera team for helping to integrate these algorithms into the Pixel 4. Special thanks to our photographers Mike Milne, Andy Radin, Alain Saal-Dalma, and Alvin Li who took numerous test photographs for us.

Fairness Indicators: Scalable Infrastructure for Fair ML Systems

While industry and academia continue to explore the benefits of using machine learning (ML) to make better products and tackle important problems, algorithms and the datasets on which they are trained also have the ability to reflect or reinforce unfair biases. For example, consistently flagging non-toxic text comments from certain groups as “spam” or “high toxicity” in a moderation system leads to exclusion of those groups from conversation.

In 2018, we shared how Google uses AI to make products more useful, highlighting AI principles that will guide our work moving forward. The second principle, “Avoid creating or reinforcing unfair bias,” outlines our commitment to reduce unjust biases and minimize their impacts on people.

As part of this commitment, at TensorFlow World, we recently released a beta version of Fairness Indicators, a suite of tools that enable regular computation and visualization of fairness metrics for binary and multi-class classification, helping teams take a first step towards identifying unjust impacts. Fairness Indicators can be used to generate metrics for transparency reporting, such as those used for model cards, to help developers make better decisions about how to deploy models responsibly. Because fairness concerns and evaluations differ case by case, we also include in this release an interactive case study with Jigsaw’s Unintended Bias in Toxicity dataset to illustrate how Fairness Indicators can be used to detect and remediate bias in a production machine learning (ML) model, depending on the context in which it is deployed. Fairness Indicators is now available in beta for you to try for your own use cases.

What is ML Fairness?
Bias can manifest in any part of a typical machine learning pipeline, from an unrepresentative dataset, to learned model representations, to the way in which the results are presented to the user. Errors that result from this bias can disproportionately impact some users more than others.

To detect this unequal impact, evaluation over individual slices, or groups of users, is crucial as overall metrics can obscure poor performance for certain groups. These groups may include, but are not limited to, those defined by sensitive characteristics such as race, ethnicity, gender, nationality, income, sexual orientation, ability, and religious belief. However, it is also important to keep in mind that fairness cannot be achieved solely through metrics and measurement; high performance, even across slices, does not necessarily prove that a system is fair. Rather, evaluation should be viewed as one of the first ways, especially for classification models, to identify gaps in performance.

The Fairness Indicators Suite of Tools
The Fairness Indicators tool suite enables computation and visualization of commonly-identified fairness metrics for classification models, such as false positive rate and false negative rate, making it easy to compare performance across slices or to a baseline slice. The tool computes confidence intervals, which can surface statistically significant disparities, and performs evaluation over multiple thresholds. In the UI, it is possible to toggle the baseline slice and investigate the performance of various other metrics. The user can also add their own metrics for visualization, specific to their use case.

Furthermore, Fairness Indicators is integrated with the What-If Tool (WIT) — clicking on a bar in the Fairness Indicators graph will load those specific data points into the the WIT widget for further inspection, comparison, and counterfactual analysis. This is particularly useful for large datasets, where Fairness Indicators can be used to identify problematic slices before the WIT is used for a deeper analysis.

Using Fairness Indicators to visualize metrics for fairness evaluation.
Clicking on a slice in Fairness Indicators will load all the data points in that slice inside the What-If Tool widget. In this case, all data points with the “female” label are shown.

The Fairness Indicators beta launch includes the following:

How To Use Fairness Indicators in Models Today
Fairness Indicators is built on top of TensorFlow Model Analysis, a component of TensorFlow Extended (TFX) that can be used to investigate and visualize model performance. Based on the specific ML workflow, Fairness Indicators can be incorporated into a system in one of the following ways:
If using TensorFlow models and tools, such as TFX:

  • Access Fairness Indicators as part of the Evaluator component in TFX
  • Access Fairness Indicators in TensorBoard when evaluating other real-time metrics

If not using existing TensorFlow tools:

  • Download the Fairness Indicators pip package, and use Tensorflow Model Analysis as a standalone tool

For non-TensorFlow models:

Fairness Indicators Case Study
We created a case study and introductory video that illustrates how Fairness Indicators can be used with a combination of tools to detect and mitigate bias in a model trained on Jigsaw’s Unintended Bias in Toxicity dataset. The dataset was developed by Conversation AI, a team within Jigsaw that works to train ML models to protect voices in conversation. Models are trained to predict whether text comments are likely to be abusive along a variety of dimensions including toxicity, insult, and sexual explicitness.

The primary use case for models such as these is content moderation. If a model penalizes certain types of messages in a systematic way (e.g., often marks comments as toxic when they are not, leading to a high false positive rate), those voices will be silenced. In the case study, we investigated false positive rate on subgroups sliced by gender identity keywords that are present in the dataset, using a combination of tools (Fairness Indicators, TFDV, and WIT) to detect, diagnose, and take steps toward remediating the underlying problem.

What’s next?
Fairness Indicators is only the first step. We plan to expand vertically by enabling more supported metrics, such as metrics that enable you to evaluate classifiers without thresholds, and horizontally by creating remediation libraries that utilize methods, such as active learning and min-diff. Because we believe it is important to learn through real examples, we hope to ground our work in more case studies to be released over the next few months, as more features become available.

To get started, see the Fairness Indicators GitHub repo. For more information on how to think about fairness evaluation in the context of your use case, see this link.

We would love to partner with you to understand where Fairness Indicators is most useful, and where added functionality would be valuable. Please reach out at tfx@tensorflow.org to provide any feedback on your experience!

Acknowledgements
The core team behind this work includes Christina Greer, Manasi Joshi, Huanming Fang, Shivam Jindal, Karan Shukla, Osman Aka, Sanders Kleinfeld, Alicia Chang, Alex Hanna, and Dan Nanas. We would also like to thank James Wexler, Mahima Pushkarna, Meg Mitchell and Ben Hutchinson for their contributions to the project.

Lessons Learned from Developing ML for Healthcare

Machine learning (ML) methods are not new in medicine — traditional techniques, such as decision trees and logistic regression, were commonly used to derive established clinical decision rules (for example, the TIMI Risk Score for estimating patient risk after a coronary event). In recent years, however, there has been a tremendous surge in leveraging ML for a variety of medical applications, such as predicting adverse events from complex medical records, and improving the accuracy of genomic sequencing. In addition to detecting known diseases, ML models can tease out previously unknown signals, such as cardiovascular risk factors and refractive error from retinal fundus photographs.

Beyond developing these models, it’s important to understand how they can be incorporated into medical workflows. Previous research indicates that doctors assisted by ML models can be more accurate than either doctors or models alone in grading diabetic eye disease and diagnosing metastatic breast cancer. Similarly, doctors are able to leverage ML-based tools in an interactive fashion to search for similar medical images, providing further evidence that doctors can work effectively with ML-based assistive tools.

In an effort to improve guidance for research at the intersection of ML and healthcare, we have written a pair of articles, published in Nature Materials and the Journal of the American Medical Association (JAMA). The first is for ML practitioners to better understand how to develop ML solutions for healthcare, and the other is for doctors who desire a better understanding of whether ML could help improve their clinical work.

How to Develop Machine Learning Models for Healthcare
In “How to develop machine learning models for healthcare” (pdf), published in Nature Materials, we discuss the importance of ensuring that the needs specific to the healthcare environment inform the development of ML models for that setting. This should be done throughout the process of developing technologies for healthcare applications, from problem selection, data collection and ML model development to validation and assessment, deployment and monitoring.

The first consideration is how to identify a healthcare problem for which there is both an urgent clinical need and for which predictions based on ML models will provide actionable insight. For example, ML for detecting diabetic eye disease can help alleviate the screening workload in parts of the world where diabetes is prevalent and the number of medical specialists is insufficient. Once the problem has been identified, one must be careful with data curation to ensure that the ground truth labels, or “reference standard”, applied to the data are reliable and accurate. This can be accomplished by validating labels via comparison to expert interpretation of the same data, such as retinal fundus photographs, or through an orthogonal procedure, such as a biopsy to confirm radiologic findings. This is particularly important since a high-quality reference standard is essential both for training useful models and for accurately measuring model performance. Therefore, it is critical that ML practitioners work closely with clinical experts to ensure the rigor of the reference standard used for training and evaluation.

Validation of model performance is also substantially different in healthcare, because the problem of distributional shift can be pronounced. In contrast to typical ML studies where a single random test split is common, the medical field values validation using multiple independent evaluation datasets, each with different patient populations that may exhibit differences in demographics or disease subtypes. Because the specifics depend on the problem, ML practitioners should work closely with clinical experts to design the study, with particular care in ensuring that the model validation and performance metrics are appropriate for the clinical setting.

Integration of the resulting assistive tools also requires thoughtful design to ensure seamless workflow integration, with consideration for measurement of the impact of these tools on diagnostic accuracy and workflow efficiency. Importantly, there is substantial value in prospective study of these tools in real patient care to better understand their real-world impact.

Finally, even after validation and workflow integration, the journey towards deployment is just beginning: regulatory approval and continued monitoring for unexpected error modes or adverse events in real use remains ahead.

Two examples of the translational process of developing, validating, and implementing ML models for healthcare based on our work in detecting diabetic eye disease and metastatic breast cancer.

Empowering Doctors to Better Understand Machine Learning for Healthcare
In “Users’ Guide to the Medical Literature: How to Read Articles that use Machine Learning,” published in JAMA, we summarize key ML concepts to help doctors evaluate ML studies for suitability of inclusion in their workflow. The goal of this article is to demystify ML, to assist doctors who need to use ML systems to understand their basic functionality, when to trust them, and their potential limitations.

The central questions doctors ask when evaluating any study, whether ML or not, remain: Was the reference standard reliable? Was the evaluation unbiased, such as assessing for both false positives and false negatives, and performing a fair comparison with clinicians? Does the evaluation apply to the patient population that I see? How does the ML model help me in taking care of my patients?

In addition to these questions, ML models should also be scrutinized to determine whether the hyperparameters used in their development were tuned on a dataset independent of that used for final model evaluation. This is particularly important, since inappropriate tuning can lead to substantial overestimation of performance, e.g., a sufficiently sophisticated model can be trained to completely memorize the training dataset and generalize poorly to new data. Ensuring that tuning was done appropriately requires being mindful of ambiguities in dataset naming, and in particular, using the terminology with which the audience is most familiar:

The intersection of two fields: ML and healthcare creates ambiguity in the term “validation dataset”. An ML validation set is typically used to refer to the dataset used for hyperparameter tuning, whereas a “clinical” validation set is typically used for final evaluation. To reduce confusion, we have opted to refer to the (ML) validation set as the “tuning” set.

Future outlook
It is an exciting time to work on AI for healthcare. The “bench-to-bedside” path is a long one that requires researchers and experts from multiple disciplines to work together in this translational process. We hope that these two articles will promote mutual understanding of what is important for ML practitioners developing models for healthcare and what is emphasized by doctors evaluating these models, thus driving further collaborations between the fields and towards eventual positive impact on patient care.

Acknowledgements
Key contributors to these projects include Yun Liu, Po-Hsuan Cameron Chen, Jonathan Krause, and Lily Peng. The authors would like to acknowledge Greg Corrado and Avinash Varadarajan for their advice, and the Google Health team for their support.

Google at NeurIPS 2019

This week, Vancouver hosts the 33rd annual Conference on Neural Information Processing Systems (NeurIPS 2019), the biggest machine learning conference of the year. The conference includes invited talks, demonstrations and presentations of some of the latest in machine learning research. As a Diamond Sponsor of NeurIPS 2019, Google will have a strong presence at NeurIPS 2019 with more than 500 Googlers attending in order to contribute to, and learn from, the broader academic research community via talks, posters, workshops, competitions and tutorials. We will be presenting work that pushes the boundaries of what is possible in language understanding, translation, speech recognition and visual & audio perception, with Googlers co-authoring more than 130 accepted papers.

If you are attending NeurIPS 2019, we hope you’ll stop by our booth and chat with our researchers about the projects and opportunities at Google that go into solving the world’s most challenging research problems, and to see demonstrations of some of the exciting research we pursue, such as ML-based Flood Forecasting, AI for Social Good, Google Research Football, Google Dataset Search, TF-Agents and much more. You can also learn more about our work being presented in the list below (Google affiliations highlighted in blue).

NeurIPS Foundation Board
Samy Bengio, Corinna Cortes

NeurIPS Advisory Board
John C. Platt, Fernando Pereira, Dale Schuurmans

NeurIPS Program Committee
Program Chair: Hugo Larochelle
Diversity & Inclusion Co-Chair: Katherine Heller
Meetup Chair: Nicolas La Roux
Party Co-Chair: Pablo Samuel Castro

Senior Area Chairs include: Amir Globerson, Claudio Gentile, Cordelia Schmid, Corinna Cortes, Dale Schuurmans, Elad Hazan, Honglak Lee, Mehryar Mohri, Peter Bartlett, Satyen Kale, Sergey Levine, Surya Ganguli

Area Chairs include: Afshin Rostamizadeh, Alex Kulesza, Amin Karbasi, Andrew Dai, Been Kim, Boqing Gong, Brainslav Kveton, Ce Liu, Charles Sutton, Chelsea Finn, Cho-Jui Hsieh, D Sculley, Danny Tarlow, David Held, Denny Zhou, Yann Dauphin, Dustin Tran, Hartmut Neven, Hossein Mobahi, Ilya Tolstikhin, Jasper Snoek, Jean-Philippe Vert, Jeffrey Pennington, Kevin Swersky, Kun Zhang, Kunal Talwar, Lihong Li, Manzil Zaheer, Marc G Bellemare, Marco Cuturi, Maya Gupta, Meg Mitchell, Minmin Chen, Mohammad Norouzi, Moustapha Cisse, Olivier Bachem, Qiang Liu, Rong Ge, Sanjiv Kumar, Sanmi Koyejo, Sebastian Nowozin, Sergei Vassilvitskii, Shivani Agarwal, Slav Petrov, Srinadh Bhojanapalli, Stephen Bach, Timnit Gebru, Tomer Koren, Vitaly Feldman, William Cohen, Yann Dauphin, Nicolas La Roux

NeurIPS Workshops Program Committee
Yann Dauphin, Honglak Lee, Sebastian Nowozin, Fernanda Viegas

NeurIPS Invited Talk
Social Intelligence
Blaise Aguera y Arcas

Accepted Papers
Memory Efficient Adaptive Optimization
Rohan Anil, Vineet Gupta, Tomer Koren, Yoram Singer

Stand-Alone Self-Attention in Vision Models
Niki Parmar, Prajit Ramachandran, Ashish Vaswani, Irwan Bello, Anselm Levskaya, Jon Shlens

High Fidelity Video Prediction with Large Neural Nets
Ruben Villegas, Arkanath Pathak, Harini Kannan, Dumitru Erhan, Quoc V. Le, Honglak Lee

Unsupervised Learning of Object Structure and Dynamics from Videos
Matthias Minderer, Chen Sun, Ruben Villegas, Forrester Cole, Kevin Murphy, Honglak Lee

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, Hyouk Joong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, Zhifeng Chen

Quadratic Video Interpolation
Xiangyu Xu, Li Si-Yao, Wenxiu Sun, Qian Yin, Ming-Hsuan Yang

Online Stochastic Shortest Path with Bandit Feedback and Unknown Transition Function
Aviv Rosenberg, Yishay Mansour

Individual Regret in Cooperative Nonstochastic Multi-Armed Bandits
Yogev Bar-On, Yishay Mansour

Learning to Screen
Alon Cohen, Avinatan Hassidim, Haim Kaplan, Yishay Mansour, Shay Moran

DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections
Ofir Nachum, Yinlam Chow, Bo Dai, Lihong Li

A Kernel Loss for Solving the Bellman Equation
Yihao Feng, Lihong Li, Qiang Liu

Accurate Uncertainty Estimation and Decomposition in Ensemble Learning
Jeremiah Liu, John Paisley, Marithani-Anna Kioumourtzoglou, Brent Coull

Saccader: Improving Accuracy of Hard Attention Models for Vision
Gamaleldin F. Elsayed, Simon Kornblith, Quoc V. Le

Invertible Convolutional Flow
Mahdi Karami, Dale Schuurmans, Jascha Sohl-Dickstein, Laurent Dinh, Daniel Duckworth

Hypothesis Set Stability and Generalization
Dylan J. Foster, Spencer Greenberg, Satyen Kale, Haipeng Luo, Mehryar Mohri, Karthik Sridharan

Bandits with Feedback Graphs and Switching Costs
Raman Arora, Teodor V. Marinov, Mehryar Mohri

Regularized Gradient Boosting
Corinna Cortes, Mehryar Mohri, Dmitry Storcheus

Logarithmic Regret for Online Control
Naman Agarwal, Elad Hazan, Karan Singh

Sampled Softmax with Random Fourier Features
Ankit Singh Rawat, Jiecao Chen, Felix Yu, Ananda Theertha Suresh, Sanjiv Kumar

Multilabel Reductions: What is My Loss Optimising?
Aditya Krishna Menon, Ankit Singh Rawat, Sashank Reddi, Sanjiv Kumar

MetaInit: Initializing Learning by Learning to Initialize
Yann N. Dauphin, Sam Schoenholz

Generalization Bounds for Neural Networks via Approximate Description Length
Amit Daniely, Elad Granot

Variance Reduction of Bipartite Experiments through Correlation Clustering
Jean Pouget-Abadie, Kevin Aydin, Warren Schudy, Kay Brodersen, Vahab Mirrokni

Likelihood Ratios for Out-of-Distribution Detection
Jie Ren, Peter J. Liu, Emily Fertig, Jasper Snoek, Ryan Poplin, Mark A. DePristo, Joshua V. Dillon, Balaji Lakshminarayanan

Can You Trust Your Model’s Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift
Yaniv Ovadia, Emily Fertig, Jie Jessie Ren, D. Sculley, Josh Dillon, Sebastian Nowozin, Zack Nado, Balaji Lakshminarayanan, Jasper Snoek

Surrogate Objectives for Batch Policy Optimization in One-step Decision Making
Minmin Chen, Ramki Gummadi, Chris Harris, Dale Schuurmans

Globally Optimal Learning for Structured Elliptical Losses
Yoav Wald, Nofar Noy, Gal Elidan, Ami Wiesel

DPPNet: Approximating Determinantal Point Processes with Deep Networks
Zelda Mariet, Yaniv Ovadia, Jasper Snoek

Graph Normalizing Flows
Jenny Liu, Aviral Kumar, Jimmy Ba, Jamie Kiros, Kevin Swersky

When Does Label Smoothing Help?
Rafael Muller, Simon Kornblith, Geoff Hinton

On the Role of Inductive Bias From Simulation and the Transfer to the Real World: a new Disentanglement Dataset
Muhammad Waleed Gondal, Manuel Wüthrich, Đorđe Miladinović, Francesco Locatello, Martin Breidt, Valentin Volchkov, Joel Akpo, Olivier Bachem, Bernhard Schölkopf, Stefan Bauer

On the Fairness of Disentangled Representations
Francesco Locatello, Gabriele Abbati, Tom Rainforth, Stefan Bauer, Bernhard Schölkopf, Olivier Bachem

Are Disentangled Representations Helpful for Abstract Visual Reasoning?
Sjoerd van Steenkiste, Francesco Locatello, Jürgen Schmidhuber, Olivier Bachem

Don’t Blame the ELBO! A Linear VAE Perspective on Posterior Collapse
James Lucas, George Tucker, Roger Grosse, Mohammad Norouzi

Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction
Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, Sergey Levine

Optimizing Generalized Rate Metrics with Game Equilibrium
Harikrishna Narasimhan, Andrew Cotter, Maya Gupta

On Making Stochastic Classifiers Deterministic
Andrew Cotter, Harikrishna Narasimhan, Maya Gupta

Discrete Flows: Invertible Generative Models of Discrete Data
Dustin Tran, Keyon Vafa, Kumar Agrawal, Laurent Dinh, Ben Poole

Graph Agreement Models for Semi-Supervised Learning
Otilia Stretcu, Krishnamurthy Viswanathan, Dana Movshovitz-Attias, Emmanouil Platanios, Andrew Tomkins, Sujith Ravi

A Robust Non-Clairvoyant Dynamic Mechanism for Contextual Auctions
Yuan Deng, Sébastien Lahaie, Vahab Mirrokni

Adversarial Robustness through Local Linearization
Chongli Qin, James Martens, Sven Gowal, Dilip Krishnan, Krishnamurthy (Dj) Dvijotham, Alhusein Fawzi, Soham De, Robert Stanforth, Pushmeet Kohli

A Geometric Perspective on Optimal Representations for Reinforcement Learning
Marc G. Bellemare, Will Dabney, Robert Dadashi, Adrien Ali Taiga, Pablo Samuel Castro, Nicolas Le Roux, Dale Schuurmans, Tor Lattimore, Clare Lyle

Online Learning via the Differential Privacy Lens
Jacob Abernethy, Young Hun Jung, Chansoo Lee, Audra McMillan, Ambuj Tewari

Reducing the Variance in Online Optimization by Transporting Past Gradients
Sébastien M. R. Arnold, Pierre-Antoine Manzagol, Reza Babanezhad, Ioannis Mitliagkas, Nicolas Le Roux

Universality and Individuality in Neural Dynamics Across Large Populations of Recurrent Networks
Niru Maheswaranathan, Alex Williams, Matt Golub, Surya Ganguli, David Sussillo

Reverse Engineering Recurrent Networks for Sentiment Classification Reveals Line Attractor Dynamics
Niru Maheswaranathan, Alex H. Williams, Matthew D. Golub, Surya Ganguli, David Sussillo

Strategizing Against No-Regret Learners
Yuan Deng, Jon Schneider, Balasubramanian Sivan

Prior-Free Dynamic Auctions with Low Regret Buyers
Yuan Deng, Jon Schneider, Balasubramanian Sivan

Private Stochastic Convex Optimization with Optimal Rates
Raef Bassily, Vitaly Feldman, Kunal Talwar, Abhradeep Thakurta

Computational Separations between Sampling and Optimization
Kunal Talwar

Momentum-Based Variance Reduction in Non-Convex SGD
Ashok Cutkosky and Francesco Orabona

Kernel Truncated Randomized Ridge Regression: Optimal Rates and Low Noise Acceleration
Kwang-Sung Jun, Ashok Cutkosky, Francesco Orabona

Fast and Flexible Multi-Task Classification using Conditional Neural Adaptive Processes
James Requeima, Jonathan Gordon, John Bronskill, Sebastian Nowozin, Richard E. Turner

Icebreaker: Element-wise Active Information Acquisition with Bayesian Deep Latent Gaussian Model
Wenbo Gong, Sebastian Tschiatschek, Richard E. Turner, Sebastian Nowozin, Jose Miguel Hernandez-Lobato, Cheng Zhang

Multiview Aggregation for Learning Category-Specific Shape Reconstruction
Srinath Sridhar, Davis Rempe, Julien Valentin, Sofien Bouaziz, Leonidas J. Guibas

Visualizing and Measuring the Geometry of BERT
Andy Coenen, Emily Reif, Ann Yuan, Been Kim, Adam Pearce, Fernanda Viégas, Martin Wattenberg

Locality-Sensitive Hashing for f-Divergences: Mutual Information Loss and Beyond
Lin Chen, Hossein Esfandiari, Thomas Fu, Vahab S. Mirrokni

A Benchmark for Interpretability Methods in Deep Neural Networks
Sara Hooker, Dumitru Erhan, Pieter-jan Kindermans, Been Kim

Practical and Consistent Estimation of f-Divergences
Paul Rubenstein, Olivier Bousquet, Josip Djolonga, Carlos Riquelme, Ilya Tolstikhin

Tree-Sliced Variants of Wasserstein Distances
Tam Le, Makoto Yamada, Kenji Fukumizu, Marco Cuturi

Game Design for Eliciting Distinguishable Behavior
Fan Yang, Liu Leqi, Yifan Wu, Zachary Lipton, Pradeep Ravikumar, Tom M Mitchell, William Cohen

Differentially Private Anonymized Histograms
Ananda Theertha Suresh

Locally Private Gaussian Estimation
Matthew Joseph, Janardhan Kulkarni, Jieming Mao, Zhiwei Steven Wu

Exponential Family Estimation via Adversarial Dynamics Embedding
Bo Dai, Zhen Liu, Hanjun Dai, Niao He, Arthur Gretton, Le Song, Dale Schuurmans

Learning to Predict Without Looking Ahead: World Models Without Forward Prediction
C. Daniel Freeman, Luke Metz, David Ha

Adaptive Density Estimation for Generative Models
Thomas Lucas, Konstantin Shmelkov, Karteek Alahari, Cordelia Schmid, Jakob Verbeek

Weight Agnostic Neural Networks
Adam Gaier, David Ha

Retrosynthesis Prediction with Conditional Graph Logic Network
Hanjun Dai, Chengtao Li, Connor Coley, Bo Dai, Le Song

Large Scale Structure of Neural Network Loss Landscapes
Stanislav Fort, Stainslaw Jastrzebski

Off-Policy Evaluation via Off-Policy Classification
Alex Irpan, Kanishka Rao, Konstantinos Bousmalis, Chris Harris, Julian Ibarz, Sergey Levine

Domes to Drones: Self-Supervised Active Triangulation for 3D Human Pose Reconstruction
Aleksis Pirinen, Erik Gartner, Cristian Sminchisescu

Energy-Inspired Models: Learning with Sampler-Induced Distributions
Dieterich Lawson, George TuckerBo Dai, Rajesh Ranganath

From Deep Learning to Mechanistic Understanding in Neuroscience: The Structure of Retinal Prediction
Hidenori Tanaka, Aran Nayebi, Niru Maheswaranathan, Lane McIntosh, Stephen Baccus, Surya Ganguli

Language as an Abstraction for Hierarchical Deep Reinforcement Learning
Yiding Jiang, Shixiang Gu, Kevin Murphy, Chelsea Finn

Bayesian Layers: A Module for Neural Network Uncertainty
Dustin Tran, Michael W. Dusenberry, Mark van der Wilk, Danijar Hafner

Adaptive Temporal-Difference Learning for Policy Evaluation with Per-State Uncertainty Estimates
Hugo Penedones, Carlos RiquelmeDamien Vincent, Hartmut Maennel, Timothy Mann, Andre Barreto, Sylvain Gelly, Gergely Neu

A Unified Framework for Data Poisoning Attack to Graph-based Semi-Supervised Learning
Xuanqing Liu, Si Si, Xiaojin Zhu, Yang Li, Cho-Jui Hsieh

MixMatch: A Holistic Approach to Semi-Supervised Learning
David Berthelot, Nicholas Carlini, Ian Goodfellow (work done while at Google), Avital Oliver, Nicolas Papernot, Colin Raffel

SMILe: Scalable Meta Inverse Reinforcement Learning through Context-Conditional Policies
Seyed Kamyar Seyed Ghasemipour, Shixiang (Shane) Gu, Richard Zemel

Limits of Private Learning with Access to Public Data
Noga Alon, Raef Bassily, Shay Moran

Regularized Weighted Low Rank Approximation
Frank Ban, David Woodruff, Richard Zhang

Unsupervised Curricula for Visual Meta-Reinforcement Learning
Allan Jabri, Kyle Hsu, Abhishek Gupta, Benjamin Eysenbach, Sergey Levine, Chelsea Finn

Secretary Ranking with Minimal Inversions
Sepehr Assadi, Eric Balkanski, Renato Paes Leme

Mixtape: Breaking the Softmax Bottleneck Efficiently
Zhilin Yang, Thang Luong, Russ Salakhutdinov, Quoc V. Le

Budgeted Reinforcement Learning in Continuous State Space
Nicolas Carrara, Edouard Leurent, Romain Laroche, Tanguy Urvoy, Odalric-Ambrym Maillard, Olivier Pietquin

From Complexity to Simplicity: Adaptive ES-Active Subspaces for Blackbox Optimization
Krzysztof Choromanski, Aldo Pacchiano, Jack Parker-Holder, Yunhao Tang

Generalization Bounds for Neural Networks via Approximate Description Length
Amit Daniely, Elad Granot

Flattening a Hierarchical Clustering through Active Learning
Fabio Vitale, Anand Rajagopalan, Claudio Gentile

Robust Attribution Regularization
Jiefeng Chen, Xi Wu, Vaibhav Rastogi, Yingyu Liang, Somesh Jha

Robustness Verification of Tree-based Models
Hongge Chen, Huan Zhang, Si Si, Yang Li, Duane Boning, Cho-Jui Hsieh

Meta Architecture Search
Albert Shaw, Wei Wei, Weiyang Liu, Le Song, Bo Dai

Contextual Bandits with Cross-Learning
Santiago Balseiro, Negin Golrezaei, Mohammad Mahdian, Vahab Mirrokni, Jon Schneider

Dynamic Incentive-Aware Learning: Robust Pricing in Contextual Auctions
Negin Golrezaei, Adel Javanmard, Vahab Mirrokni

Optimizing Generalized Rate Metrics with Three Players
Harikrishna Narasimhan, Andrew Cotter, Maya Gupta

Noise-Tolerant Fair Classification
Alexandre Louis Lamy, Ziyuan Zhong, Aditya Krishna Menon, Nakul Verma

Towards Automatic Concept-based Explanations
Amirata Ghorbani, James Wexler, James Zou, Been Kim

Locally Private Learning without Interaction Requires Separation
Amit Daniely, Vitaly Feldman

Learning GANs and Ensembles Using Discrepancy
Ben Adlam, Corinna Cortes, Mehryar Mohri, Ningshan Zhang

CondConv: Conditionally Parameterized Convolutions for Efficient Inference
Brandon Yang, Gabriel Bender, Quoc V. Le, Jiquan Ngiam

A Fourier Perspective on Model Robustness in Computer Vision
Dong Yin, Raphael Gontijo Lopes, Jonathon Shlens, Ekin D. Cubuk, Justin Gilmer

Robust Bi-Tempered Logistic Loss Based on Bregman Divergences
Ehsan Amid, Manfred K. Warmuth, Rohan Anil, Tomer Koren

When Does Label Smoothing Help?
Rafael Müller, Simon Kornblith, Geoffrey Hinton

Memory Efficient Adaptive Optimization
Rohan Anil, Vineet Gupta, Tomer Koren, Yoram Singer

Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model
Guodong Zhang, Lala Li, Zachary Nado, James Martens, Sushant Sachdeva, George E. Dahl, Christopher J. Shallue, Roger Grosse

Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent
Jaehoon Lee, Lechao Xiao, Samuel S. Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, Jeffrey Pennington

Universality and Individuality in Neural Dynamics Across Large Populations of Recurrent Networks
Niru Maheswaranathan, Alex H. Williams, Matthew D. Golub, Surya Ganguli, David Sussillo

Abstract Reasoning with Distracting Features
Kecheng Zheng, Zheng-Jun Zha, Wei Wei

Search on the Replay Buffer: Bridging Planning and Reinforcement Learning
Benjamin Eysenbach, Ruslan Salakhutdinov, Sergey Levine

Differentiable Ranking and Sorting Using Optimal Transport
Marco Cuturi, Olivier Teboul, Jean-Philippe Vert

XLNet: Generalized Autoregressive Pretraining for Language Understanding
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le

Private Learning Implies Online Learning: An Efficient Reduction
Alon Gonen, Elad Hazan, Shay Moran

Evaluating Protein Transfer Learning with TAPE
Roshan Rao, Nicholas Bhattacharya, Neil Thomas, Yan Duan, Peter Chen, John Canny, Pieter Abbeel, Yun Song

Tight Dimensionality Reduction for Sketching Low Degree Polynomial Kernels
Michela Meister, Tamas Sarlos, David P. Woodruff

No Pressure! Addressing the Problem of Local Minima in Manifold Learning Algorithms
Max Vladymyrov

Subspace Detours: Building Transport Plans that are Optimal on Subspace Projections
Boris Muzellec, Marco Cuturi

Online Stochastic Shortest Path with Bandit Feedback and Unknown Transition Function
Aviv Rosenberg, Yishay Mansour

Private Learning Implies Online Learning: An Efficient Reduction
Alon Gonen, Elad Hazan, Shay Moran

On the Fairness of Disentangled Representations
Francesco Locatello, Gabriele Abbati, Tom Rainforth, Stefan Bauer, Bernhard Schölkopf, Olivier Bachem

On the Transfer of Inductive Bias from Simulation to the Real World: a New Disentanglement Dataset
Muhammad Waleed Gondal, Manuel Wüthrich, Ðorde Miladinovíc, Francesco Locatello, Martin Breidt, Valentin Volchkov, Joel Akpo, Olivier Bachem, Bernhard Schölkopf, Stefan Bauer

Stacked Capsule Autoencoders
Adam R. Kosiorek, Sara Sabour, Yee Whye Teh, Geoffrey E. Hinton

Wasserstein Dependency Measure for Representation Learning
Sherjil Ozair, Corey Lynch, Yoshua Bengio, Aaron van den Oord, Sergey Levine, Pierre Sermanet

Sampling Sketches for Concave Sublinear Functions of Frequencies
Edith Cohen, Ofir Geri

Hamiltonian Neural Networks
Sam Greydanus, Misko Dzamba, Jason Yosinski

Evaluating Protein Transfer Learning with TAPE
Roshan Rao, Nicholas Bhattacharya, Neil Thomas, Yan Duan, Xi Chen, John Canny, Pieter Abbeel, Yun S. Song

Computational Mirrors: Blind Inverse Light Transport by Deep Matrix Factorization
Miika Aittala, Prafull Sharma, Lukas Murmann, Adam B. Yedidia, Gregory W. Wornell, William T. Freeman, Frédo Durand

Quadratic Video Interpolation
Xiangyu Xu, Li Siyao, Wenxiu Sun, Qian Yin, Ming-Hsuan Yang

Transfusion: Understanding Transfer Learning for Medical Imagings
Maithra Raghu, Chiyuan Zhang, Jon Kleinberg, Samy Bengio

XLNet: Generalized Autoregressive Pretraining for Language Understanding
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le

Differentially Private Covariance Estimation
Kareem Amin, Travis Dick, Alex Kulesza, Andres Munoz, Sergei Vassilvitskii

Private Stochastic Convex Optimization with Optimal Rates
Raef Bassily, Vitaly Feldman, Kunal Talwar, Abhradeep Thakurta

Learning Transferable Graph Exploration
Hanjun Dai, Yujia Li, Chenglong Wang, Rishabh Singh, Po-Sen Huang, Pushmeet Kohli

Neural Attribution for Semantic Bug-Localization in Student Programs
Rahul Gupta, Aditya Kanade, Shirish Shevade

PyTorch: An Imperative Style, High-Performance Deep Learning Library
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, Soumith Chintala

Breaking the Glass Ceiling for Embedding-Based Classifiers for Large Output Spaces
Chuan Guo, Ali Mousavi, Xiang Wu, Daniel Holtmann-Rice, Satyen Kale, Sashank Reddi, Sanjiv Kumar

Efficient Rematerialization for Deep Networks
Ravi Kumar, Manish Purohit, Zoya Svitkina, Erik Vee, Joshua R. Wang

Momentum-Based Variance Reduction in Non-Convex SGD
Ashok Cutkosky, Francesco Orabona

Kernel Truncated Randomized Ridge Regression: Optimal Rates and Low Noise Acceleration
Kwang-Sung Jun, Ashok Cutkosky, Francesco Orabona

Workshops
3rd Conversational AI: Today’s Practice and Tomorrow’s Potential
Organizers include: Bill Byrne

AI for Humanitarian Assistance and Disaster Response Workshop
Invited Speakers include: Yossi Matias

Bayesian Deep Learning
Organizers include: Kevin P Murphy

Beyond First Order Methods in Machine Learning Systems
Invited Speakers include: Elad Hazan

Biological and Artificial Reinforcement Learning
Invited Speakers include: Igor Mordatch

Context and Compositionality in Biological and Artificial Neural Systems
Invited Speakers include: Kenton Lee

Deep Reinforcement Learning
Organizers include: Chelsea Finn

Document Intelligence
Organizers include: Tania Bedrax Weiss

Federated Learning for Data Privacy and Confidentiality
Organizers include: Jakub KonečnýBrendan McMahan
Invited Speakers include: Françoise Beaufays, Daniel Ramage

Graph Representation Learning
Organizers include: Rianne van den Berg

Human-Centric Machine Learning
Invited Speakers include: Been Kim

Information Theory and Machine Learning
Organizers include: Ben Poole
Invited Speakers include: Alex Alemi

KR2ML – Knowledge Representation and Reasoning Meets Machine Learning
Invited Speakers include: William Cohen

Learning Meaningful Representations of Life
Organizers include: Jasper Snoek, Alexander Wiltschko

Learning Transferable Skills
Invited Speakers include: David Ha

Machine Learning for Creativity and Design
Organizers include: Adam Roberts, Jesse Engel

Machine Learning for Health (ML4H): What Makes Machine Learning in Medicine Different?
Invited Speakers include: Lily Peng, Alan Karthikesalingam, Dale Webster

Machine Learning and the Physical Sciences
Speakers include: Yasaman Bahri, Samual Schoenholz

ML for Systems
Organizers include: Milad HashemiKevin SwerskyAzalia MirhoseiniAnna Goldie
Invited Speakers include: Jeff Dean

Optimal Transport for Machine Learning
Organizers include: Marco Cuturi

The Optimization Foundations of Reinforcement Learning
Organizers include: Bo DaiNicolas Le RouxLihong LiDale Schuurmans

Privacy in Machine Learning
Invited Speakers include: Brendan McMahan

Program Transformations for ML
Organizers include: Pascal LamblinAlexander WiltschkoBart van MerrienboerEmily Fertig
Invited Speakers include: Skye Wanderman-Milne

Real Neurons & Hidden Units: Future Directions at the Intersection of Neuroscience and Artificial Intelligence
Organizers include: David Sussillo

Robot Learning: Control and Interaction in the Real World
Organizers include: Stefan Schaal

Safety and Robustness in Decision Making
Organizers include: Yinlam Chow

Science Meets Engineering of Deep Learning
Invited Speakers include: Yasaman Bahri, Surya Ganguli‎, Been Kim, Surya Ganguli

Sets and Partitions
Organizers include: Manzil Zaheer, Andrew McCallum
Invited Speakers include: Amr Ahmed

Tackling Climate Change with ML
Organizers include: John Platt
Invited Speakers include: Jeff Dean

Visually Grounded Interaction and Language
Invited Speakers include: Jason Baldridge

Workshop on Machine Learning with Guarantees
Invited Speakers include: Mehryar Mohri

Tutorials
Representation Learning and Fairness
Organizers include: Moustapha Cisse, Sanmi Koyejo

Understanding Transfer Learning for Medical Imaging

As deep neural networks are applied to an increasingly diverse set of domains, transfer learning has emerged as a highly popular technique in developing deep learning models. In transfer learning, the neural network is trained in two stages: 1) pretraining, where the network is generally trained on a large-scale benchmark dataset representing a wide diversity of labels/categories (e.g., ImageNet); and 2) fine-tuning, where the pretrained network is further trained on the specific target task of interest, which may have fewer labeled examples than the pretraining dataset. The pretraining step helps the network learn general features that can be reused on the target task.

This kind of two-stage paradigm has become extremely popular in many settings, and particularly so in medical imaging. In the context of transfer learning, standard architectures designed for ImageNet with corresponding pretrained weights are fine-tuned on medical tasks ranging from interpreting chest x-rays and identifying eye diseases, to early detection of Alzheimer’s disease. Despite its widespread use, however, the precise effects of transfer learning are not yet well understood. While recent work challenges many common assumptions, including the effects on performance improvement, contribution of the underlying architecture and impact of pretraining dataset type and size, these results are all in the natural image setting, and leave many questions open for specialized domains, such as medical images.

In our NeurIPS 2019 paper, “Transfusion: Understanding Transfer Learning for Medical Imaging,” we investigate these central questions for transfer learning in medical imaging tasks. Through both a detailed performance evaluation and analysis of neural network hidden representations, we uncover many surprising conclusions, such as the limited benefits of transfer learning for performance on the tested medical imaging tasks, a detailed characterization of how representations evolve through the training process across different models and hidden layers, and feature independent benefits of transfer learning for convergence speed.

Performance Evaluation
We first performed a thorough study on the effect of transfer learning on model performance. We compared models trained from random initialization and applied directly on tasks to those pretrained on ImageNet that leverage transfer learning for the same tasks. We looked at two large scale medical imaging tasks — diagnosing diabetic retinopathy from fundus photographs and identifying five different diseases from chest x-rays. We evaluated various neural network architectures including both standard architectures popularly used for medical imaging (ResNet50, Inception-v3) as well as a family of simple, lightweight convolutional neural networks that consist of four or five layers of the standard convolution-batchnormReLU progression, or CBRs.

The results from evaluating all of these models on the different tasks with and without transfer learning give us four main takeaways:

  • Surprisingly, transfer learning does not significantly affect performance on medical imaging tasks, with models trained from scratch performing nearly as well as standard ImageNet transferred models.
  • On the medical imaging tasks, the much smaller CBR models perform at a level comparable to the standard ImageNet architectures.
  • As the CBR models are much smaller and shallower than the standard ImageNet models, they perform much worse on ImageNet classification, highlighting that ImageNet performance is not indicative of performance on medical tasks.
  • The two medical tasks are much smaller in size than ImageNet (~200k vs ~1.2m training images), but in the very small data regime, there may only be a few thousand training examples. We evaluated transfer learning in this very small data regime, finding that while there was a larger gap in performance between transfer and training from scratch for large models (ResNet) this was not true for smaller models (CBRs), suggesting that the large models designed for ImageNet might be too overparameterized for the very small data regime.

Representation Analysis
We next study the degree to which transfer learning affects the kinds of features and representations learned by the neural networks. Given the similar performance, does transfer learning result in different representations from random initialization? Is knowledge from the pretraining step reused, and if so, where? To find answers to these questions, this study analyzes and compares the hidden representations (i.e., representations learned in the latent layers of the network) in the different neural networks trained to solve these tasks. This quantitative analysis can be challenging, due to the complexity and lack of alignment in different hidden layers. But a recent method, singular vector canonical correlation analysis (SVCCA; code and tutorials), based on canonical correlation analysis (CCA), helps overcome these challenges, and can be used to calculate a similarity score between a pair of hidden representations.

Similarity scores are computed for some of the hidden representations from the top latent layers of the networks (closer to the output) between networks trained from random initialization and networks trained from pretrained ImageNet weights. As a baseline, we also compute similarity scores of representations learned from different random initializations. For large models, representations learned from random initialization are much more similar to each other than those learned from transfer learning. For smaller models, there is greater overlap between representation similarity scores.

Representation similarity scores between networks trained from random initialization and networks trained from pretrained ImageNet weights (orange), and baseline similarity scores of representations trained from two different random initializations (blue). Higher values indicate greater similarity. For larger models, representations learned from random initialization are much more similar to each other than those learned through transfer. This is not the case for smaller models.

The reason for this difference between large and small models becomes clear with further investigation into the hidden representations. Large models change less through training, even from random initialization. We perform multiple experiments that illustrate this, from simple filter visualizations to tracking changes between different layers through fine-tuning.

When we combine the results of all the experiments from the paper, we can assemble a table summarizing how much representations change through training on the medical task across (i) transfer learning, (ii) model size and (iii) lower/higher layers.

Effects on Convergence: Feature Independent Benefits and Hybrid Approaches
One consistent effect of transfer learning was a significant speedup in the time taken for the model to converge. But having seen the mixed results for feature reuse from our representational study, we looked into whether there were other properties of the pretrained weights that might contribute to this speedup. Surprisingly, we found a feature independent benefit of pretraining — the weight scaling.

We initialized the weights of the neural network as independent and identically distributed (iid), just like random initialization, but using the mean and variance of the pretrained weights. We called this initialization the Mean Var Init, which keeps the pretrained weight scaling but destroys all the features. This Mean Var Init offered significant speedups over random initialization across model architectures and tasks, suggesting that the pretraining process of transfer learning also helps with good weight conditioning.

Filter visualization of weights initialized according to pretrained ImageNet weights, Random Init, and Mean Var Init. Only the ImageNet Init filters have pretrained (Gabor-like) structure, as Rand Init and Mean Var weights are iid.

Recall that our earlier experiments suggested that feature reuse primarily occurs in the lowest layers. To understand this, we performed weight transfusion experiments, where only a subset of the pretrained weights (corresponding to a contiguous set of layers) are transferred, with the remainder of weights being randomly initialized. Comparing convergence speeds of these transfused networks with full transfer learning further supports the conclusion that feature reuse is primarily happening in the lowest layers.

Learning curves comparing the convergence speed with AUC on the test set. Using only the scaling of the pretrained weights (Mean Var Init) helps with convergence speed. The figures compare the standard transfer learning and the Mean Var initialization scheme to training from random initialization.

This suggests hybrid approaches to transfer learning, where instead of reusing the full neural network architecture, we can recycle its lowest layers and redesign the upper layers to better suit the target task. This gives us most of the benefits of transfer learning while further enabling flexible model design. In the Figure below, we show the effect of reusing pretrained weights up to Block2 in Resnet50, halving the remainder of the channels, initializing those layers randomly, and then training end-to-end. This matches the performance and convergence of full transfer learning.

Hybrid approaches to transfer learning on Resnet50 (left) and CBR models (right) — reusing a subset of the weights and slimming the remainder of the network (Slim), and using mathematically synthesized Gabors for conv1 (Synthetic Gabor).

The figure above also shows the results of an extreme version of this partial reuse, transferring only the very first convolutional layer with mathematically synthesized Gabor filters (pictured below). Using just these (synthetic) weights offers significant speedups, and hints at many other creative hybrid approaches.

Synthetic Gabor filters used to initialize the first layer if neural networks in some of the experiments in this paper. The Gabor filters are generated as grayscale images and repeated across the RGB channels. Left: Low frequencies. Right: High frequencies.

Conclusion and Open Questions
Transfer learning is a central technique for many domains. In this paper we provide insights on some of its fundamental properties in the medical imaging context, studying performance, feature reuse, the effect of different architectures, convergence and hybrid approaches. Many interesting open questions remain: How much of the original task has the model forgotten? Why do large models change less? Can we get further gains matching higher order moments of pretrained weight statistics? Are the results similar for other tasks, such as segmentation? We look forward to tackling these questions in future work!

Acknowledgements
Special thanks to Samy Bengio and Jon Kleinberg, who are co-authors on this work. Thanks also to Geoffrey Hinton for helpful feedback.

Developing Deep Learning Models for Chest X-rays with Adjudicated Image Labels

With millions of diagnostic examinations performed annually, chest X-rays are an important and accessible clinical imaging tool for the detection of many diseases. However, their usefulness can be limited by challenges in interpretation, which requires rapid and thorough evaluation of a two-dimensional image depicting complex, three-dimensional organs and disease processes. Indeed, early-stage lung cancers or pneumothoraces (collapsed lungs) can be missed on chest X-rays, leading to serious adverse outcomes for patients.

Advances in machine learning (ML) present an exciting opportunity to create new tools to help experts interpret medical images. Recent efforts have shown promise in improving lung cancer detection in radiology, prostate cancer grading in pathology, and differential diagnoses in dermatology. For chest X-ray images in particular, large, de-identified public image sets are available to researchers across disciplines, and have facilitated several valuable efforts to develop deep learning models for X-ray interpretation. However, obtaining accurate clinical labels for the very large image sets needed for deep learning can be difficult. Most efforts have either applied rule-based natural language processing (NLP) to radiology reports or relied on image review by individual readers, both of which may introduce inconsistencies or errors that can be especially problematic during model evaluation. Another challenge involves assembling datasets that represent an adequately diverse spectrum of cases (i.e., ensuring inclusion of both “hard” cases and “easy” cases that represent the full spectrum of disease presentation). Finally, some chest X-ray findings are non-specific and depend on clinical information about the patient to fully understand their significance. As such, establishing labels that are clinically meaningful and have consistent definitions can be a challenging component of developing machine learning models that use only the image as input. Without standardized and clinically meaningful datasets as well as rigorous reference standard methods, successful application of ML to interpretation of chest X-rays will be hindered.

To help address these issues, we recently published “Chest Radiograph Interpretation with Deep Learning Models: Assessment with Radiologist-adjudicated Reference Standards and Population-adjusted Evaluation” in the journal Radiology. In this study we developed deep learning models to classify four clinically important findings on chest X-rays — pneumothorax, nodules and masses, fractures, and airspace opacities. These target findings were selected in consultation with radiologists and clinical colleagues, so as to focus on conditions that are both critical for patient care and for which chest X-ray images alone are an important and accessible first-line imaging study. Selection of these findings also allowed model evaluation using only de-identified images without additional clinical data.

Models were evaluated using thousands of held-out images from each dataset for which we collected high-quality labels using a panel-based adjudication process among board-certified radiologists. Four separate radiologists also independently reviewed the held-out images in order to compare radiologist accuracy to that of the deep learning models (using the panel-based image labels as the reference standard). For all four findings and across both datasets, the deep learning models demonstrated radiologist-level performance. We are sharing the adjudicated labels for the publicly available data here to facilitate additional research.

Data Overview
This work leveraged over 600,000 images sourced from two de-identified datasets. The first dataset was developed in collaboration with co-authors at the Apollo Hospitals, and consists of a diverse set of chest X-rays obtained over several years from multiple locations across the Apollo Hospitals network. The second dataset is the publicly available ChestX-ray14 image set released by the National Institutes of Health (NIH). This second dataset has served as an important resource for many machine learning efforts, yet has limitations stemming from issues with the accuracy and clinical interpretation of the currently available labels.

Chest X-ray depicting an upper left lobe pneumothorax identified by the model and the adjudication panel, but missed by the individual radiologist readers. Left: The original image. Right: The same image with the most important regions for the model prediction highlighted in orange.

Training Set Labels Using Deep Learning and Visual Image Review
For very large datasets consisting of hundreds of thousands of images, such as those needed to train highly accurate deep learning models, it is impractical to manually assign image labels. As such, we developed a separate, text-based deep learning model to extract image labels using the de-identified radiology reports associated with each X-ray. This NLP model was then applied to provide labels for over 560,000 images from the Apollo Hospitals dataset used for training the computer vision models.

To reduce noise from any errors introduced by the text-based label extraction and also to provide the relevant labels for a substantial number of the ChestX-ray14 images, approximately 37,000 images across the two datasets were visually reviewed by radiologists. These were separate from the NLP-based labels and helped to ensure high quality labels across such a large, diverse set of training images.

Creating and Sharing Improved Reference Standard Labels
To generate high-quality reference standard labels for model evaluation, we utilized a panel-based adjudication process, whereby three radiologists reviewed all final tune and test set images and resolved disagreements through discussion. This often allowed difficult findings that were initially only detected by a single radiologist to be identified and documented appropriately. To reduce the risk of bias based on any individual radiologist’s personality or seniority, the discussions took place anonymously via an online discussion and adjudication system.

Because the lack of available adjudicated labels was a significant initial barrier to our work, we are sharing with the research community all of the adjudicated labels for the publicly available ChestX-ray14 dataset, including 2,412 training/validation set images and 1,962 test set images (4,374 images in total). We hope that these labels will facilitate future machine learning efforts and enable better apples-to-apples comparisons between machine learning models for chest X-ray interpretation.

Future Outlook
This work presents several contributions: (1) releasing adjudicated labels for images from a publicly available dataset; (2) a method to scale accurate labeling of training data using a text-based deep learning model; (3) evaluation using a diverse set of images with expert-adjudicated reference standard labels; and ultimately (4) radiologist-level performance of deep learning models for clinically important findings on chest X-rays.

However, in regards to model performance, achieving expert-level accuracy on average is just a part of the story. Even though overall accuracy for the deep learning models was consistently similar to that of radiologists for any given finding, performance for both varied across datasets. For example, the sensitivity for detecting pneumothorax among radiologists was approximately 79% for the ChestX-ray14 images, but was only 52% for the same radiologists on the other dataset, suggesting a more difficult collection cases in the latter. This highlights the importance of validating deep learning tools on multiple, diverse datasets and eventually across the patient populations and clinical settings in which any model is intended to be used.

The performance differences between datasets also emphasize the need for standardized evaluation image sets with accurate reference standards in order to allow comparison across studies. For example, if two different models for the same finding were evaluated using different datasets, comparing performance would be of minimal value without knowing additional details such as the case mix, model error modes, or radiologist performance on the same cases.

Finally, the model often identified findings that were consistently missed by radiologists, and vice versa. As such, strategies that combine the unique “skills” of both the deep learning systems and human experts are likely to hold the most promise for realizing the potential of AI applications in medical image interpretation.

Acknowledgements
Key contributors to this project at Google include Sid Mittal, Gavin Duggan, Anna Majkowska, Scott McKinney, Andrew Sellergren, David Steiner, Krish Eswaran, Po-Hsuan Cameron Chen, Yun Liu, Shravya Shetty, and Daniel Tse. Significant contributions and input were also made by radiologist collaborators Joshua Reicher, Alexander Ding, and Sreenivasa Raju Kalidindi. The authors would also like to acknowledge many members of the Google Health radiology team including Jonny Wong, Diego Ardila, Zvika Ben-Haim, Rory Sayres, Shahar Jamshy, Shabir Adeel, Mikhail Fomitchev, Akinori Mitani, Quang Duong, William Chen and Sahar Kazemzadeh. Sincere appreciation also goes to the many radiologists who enabled this work through their expert image interpretation efforts throughout the project.

Astrophotography with Night Sight on Pixel Phones

Taking pictures of outdoor scenes at night has so far been the domain of large cameras, such as DSLRs, which are able to achieve excellent image quality, provided photographers are willing to put up with bulky equipment and sometimes tricky postprocessing. A few years ago experiments with phone camera nighttime photography produced pleasing results, but the methods employed were impractical for all but the most dedicated users.

Night Sight, introduced last year as part of the Google Camera App for the Pixel 3, allows phone photographers to take good-looking handheld shots in environments so dark that the normal camera mode would produce grainy, severely underexposed images. In a previous blog post our team described how Night Sight is able to do this, with a technical discussion presented at SIGGRAPH Asia 2019.

This year’s version of Night Sight pushes the boundaries of low-light photography with phone cameras. By allowing exposures up to 4 minutes on Pixel 4, and 1 minute on Pixel 3 and 3a, the latest version makes it possible to take sharp and clear pictures of the stars in the night sky or of nighttime landscapes without any artificial light.

The Milky Way as seen from the summit of Haleakala volcano on a cloudless and moonless September night, captured using the Google Camera App running on a Pixel 4 XL phone. The image has not been retouched or post-processed in any way. It shows significantly more detail than a person can see with the unaided eye on a night this dark. The dust clouds along the Milky Way are clearly visible, the sky is covered with thousands of stars, and unlike human night vision, the picture is colorful.

A Brief Overview of Night Sight
The amount of light detected by the camera’s image sensor inherently has some uncertainty, called “shot noise,” which causes images to look grainy. The visibility of shot noise decreases as the amount of light increases; therefore, it is best for the camera to gather as much light as possible to produce a high-quality image.

How much light reaches the image sensor in a given amount of time is limited by the aperture of the camera lens. Extending the exposure time for a photo increases the total amount of light captured, but if the exposure is long, motion in the scene being photographed and unsteadiness of the handheld camera can cause blur. To overcome this, Night Sight splits the exposure into a sequence of multiple frames with shorter exposure times and correspondingly less motion blur. The frames are first aligned, compensating for both camera shake and in-scene motion, and then averaged, with careful treatment of cases where perfect alignment is not possible. While individual frames may be fairly grainy, the combined, averaged image looks much cleaner.

Experimenting with Exposure Time
Soon after the original Night Sight was released, we started to investigate taking photos in very dark outdoor environments with the goal of capturing the stars. We realized that, just as with our previous experiments, high quality pictures would require exposure times of several minutes. Clearly, this cannot work with a handheld camera; the phone would have to be placed on a tripod, a rock, or whatever else might be available to hold the camera steady.

Just as with handheld Night Sight photos, nighttime landscape shots must take motion in the scene into account — trees sway in the wind, clouds drift across the sky, and the moon and the stars rise in the east and set in the west. Viewers will tolerate motion-blurred clouds and tree branches in a photo that is otherwise sharp, but motion-blurred stars that look like short line segments look wrong. To mitigate this, we split the exposure into frames with exposure times short enough to make the stars look like points of light. Taking pictures of real night skies we found that the per-frame exposure time should not exceed 16 seconds.

Motion-blurred stars in a single-frame two-minute exposure.

While the number of frames we can capture for a single photo, and therefore the total exposure time, is limited by technical considerations, we found that it is more tightly constrained by the photographer’s patience. Few are willing to wait more than four minutes for a picture, so we limited a single Night Sight image to at most 15 frames with up to 16 seconds per frame.

Sixteen-second exposures allow us to capture enough light to produce recognizable images but a useable camera app capable of taking pictures that look great must deal with additional issues that are unique to low-light photography.

Dark Current and Hot Pixels
Dark current causes CMOS image sensors to record a spurious signal, as if the pixels were exposed to a small amount of light, even when no actual light is present. The effect is negligible when exposure times are short, but it becomes significant with multi-second captures. Due to unavoidable imperfections in the sensor’s silicon substrate, some pixels exhibit higher dark current than their neighbors. In a recorded frame these “warm pixels,” as well as defective “hot pixels,” are visible as tiny bright dots.

Warm and hot pixels can be identified by comparing the values of neighboring pixels within the same frame and across the sequence of frames recorded for a photo, and looking for outliers. Once an outlier has been detected, it is concealed by replacing its value with the average of its neighbors. Since the original pixel value is discarded, there is a loss of image information, but in practice this does not noticeably affect image quality.

Left: A small region of a long-exposure image with hot pixels, and warm pixels caused by dark current nonuniformity. Right: The same image after outliers have been removed. Fine details in the landscape, including small points of light, are preserved.

Scene Composition
Mobile phones use their screens as electronic viewfinders — the camera captures a continuous stream of frames that is displayed as a live video in order to aid with shot composition. The frames are simultaneously used by the camera’s autofocus, auto exposure, and auto white balance systems.

To feel responsive to the photographer, the viewfinder is updated at least 15 times per second, which limits the viewfinder frame exposure time to 66 milliseconds. This makes it challenging to display a detailed image in low-light environments. At light levels below the rough equivalent of a full moon or so, the viewfinder becomes mostly gray — maybe showing a few bright stars, but none of the landscape — and composing a shot becomes difficult.

To assist in framing the scene in extremely low light, Night Sight displays a “post-shutter viewfinder”. After the shutter button has been pressed, each long-exposure frame is displayed on the screen as soon as it has been captured. With exposure times up to 16 seconds, these frames have collected almost 250 times more light than the regular viewfinder frames, allowing the photographer to easily see image details as soon as the first frame has been captured. The composition can then be adjusted by moving the phone while the exposure continues. Once the composition is correct, the initial shot can be stopped, and a second shot can be captured where all frames have the desired composition.

Left: The live Night Sight viewfinder in a very dark outdoor environment. Except for a few points of light from distant buildings, the landscape and the sky are largely invisible. Right: The post-shutter viewfinder during a long exposure shot. The image is much clearer; it updates after every long-exposure frame.

Autofocus
Autofocus ensures that the image captured by the camera is sharp. In normal operation, the incoming viewfinder frames are analyzed to determine how far the lens must be from the sensor to produce an in-focus image, but in very low light the viewfinder frames can be so dark and grainy that autofocus fails due to lack of detectable image detail. When this happens, Night Sight on Pixel 4 switches to “post-shutter autofocus.” After the user presses the shutter button, the camera captures two autofocus frames with exposure times up to one second, long enough to detect image details even in low light. These frames are used only to focus the lens and do not contribute directly to the final image.

Even though using long-exposure frames for autofocus leads to consistently sharp images at light levels low enough that the human visual system cannot clearly distinguish objects, sometimes it gets too dark even for post-shutter autofocus. In this case the camera instead focuses at infinity. In addition, Night Sight includes manual focus buttons, allowing the user to focus on nearby objects in very dark conditions.

Sky Processing
When images of very dark environments are viewed on a screen, they are displayed much brighter than the original scenes. This can change the viewer’s perception of the time of day when the photos were captured. At night we expect the sky to be dark. If a picture taken at night shows a bright sky, then we see it as a daytime scene, perhaps with slightly unusual lighting.

This effect is countered in Night Sight by selectively darkening the sky in photos of low-light scenes. To do this, we use machine learning to detect which regions of an image represent sky. An on-device convolutional neural network, trained on over 100,000 images that were manually labeled by tracing the outlines of sky regions, identifies each pixel in a photograph as “sky” or “not sky.”

A landscape picture taken on a bright full-moon night, without sky processing (left half), and with sky darkening (right half). Note that the landscape is not darkened.

Sky detection also makes it possible to perform sky-specific noise reduction, and to selectively increase contrast to make features like clouds, color gradients, or the Milky Way more prominent.

Results
With the phone on a tripod, Night Sight produces sharp pictures of star-filled skies, and as long as there is at least a small amount of moonlight, landscapes will be clear and colorful.

Of course, the phone’s capabilities are not limitless, and there is always room for improvement. Although nighttime scenes are dark overall, they often contain bright light sources such as the moon, distant street lamps, or prominent stars. While we can capture a moonlit landscape, or details on the surface of the moon, the extremely large brightness range, which can exceed 500,000:1, so far prevents us from capturing both in the same image. Also, when the stars are the only source of illumination, we can take clear pictures of the sky, but the landscape is only visible as a silhouette.

For Pixel 4 we have been using the brightest part of the Milky Way, near the constellation Sagittarius, as a benchmark for the quality of images of a moonless sky. By that standard Night Sight is doing very well. Although Milky Way photos exhibit some residual noise, they are pleasing to look at, showing more stars and more detail than a person can see looking at the real night sky.

Examples of photos taken with the Google Camera App on Pixel 4. An album with more pictures can be found here.

Tips and Tricks
In the course of developing and testing Night Sight astrophotography we gained some experience taking outdoor nighttime pictures with Pixel phones, and we’d like to share a list of tips and tricks that have worked for us. You can find it here.

Acknowledgements
Night Sight is an ongoing collaboration between several teams at Google. Key contributors to the project include from the Gcam team, Orly Liba, Nikhil Karnad, Charles He, Manfred Ernst, Michael Milne, Andrew Radin, Navin Sarma, Jon Barron, Yun-Ta Tsai, Tianfan Xue, Jiawen Chen, Dillon Sharlet, Ryan Geiss, Sam Hasinoff, Alex Schiffhauer, Yael Pritch Knaan and Marc Levoy; from the Super Res Zoom team, Bart Wronski, Peyman Milanfar, and Ignacio Garcia Dorado; from the Google camera app team, Emily To, Gabriel Nava, Sushil Nath, Isaac Reynolds, and Michelle Chen; from the Android platform team, Ryan Chan, Ying Chen Lou, and Bob Hung; from the Mobile Vision team, Longqi (Rocky) Cai, Huizhong Chen, Emily Manoogian, Nicole Maffeo, and Tomer Meron; from Machine Perception, Elad Eban and Yair Movshovitz-Attias.