Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

Author: torontoai

Improving Out-of-Distribution Detection in Machine Learning Models

Successful deployment of machine learning systems requires that the system be able to distinguish between data that is anomalous or significantly different from that used in training. This is particularly important for deep neural network classifiers, which might classify such out-of-distribution (OOD) inputs into in-distribution classes with high confidence. This is critically important when these predictions inform real-world decisions.

For example, one challenging application of machine learning models to real-world applications is bacteria identification based on genomic sequences. Bacteria detection is crucial for diagnosis and treatment of infectious diseases, such as sepsis, and for identifying foodborne pathogens. New bacterial classes continue to be discovered over the years, and while a neural network classifier trained on the known classes achieves high accuracy as measured through cross-validation, deploying a model is challenging, since real-world data is ever evolving and will inevitably contain genomes from unseen classes (OOD inputs) not present in the training data.

New bacterial classes are gradually discovered over the years. A classifier trained on known classes achieves high accuracy for test inputs belonging to known classes, but can wrongly classify inputs from unknown classes (i.e., out-of-distribution) into known classes with high confidence.

In “Likelihood Ratios for Out-of-Distribution Detection”, presented at NeurIPS 2019, we proposed and released a realistic benchmark dataset of genomic sequences for OOD detection that is inspired by the real-world challenges described above. We tested existing methods for OOD detection using generative models on genomic sequences and found that the likelihood values — i.e., the model’s probability that an input comes from the distribution as estimated using in-distribution data — was often in error. This phenomenon has also been observed in recent work on deep generative models of images. We explain this phenomenon through the effect of background statistics and propose a likelihood-ratio based solution that significantly improves the accuracy of OOD detection.

Why Do Density Models Fail At OOD Detection?
To mimic the real problem and systematically evaluate different methods, we built a new bacterial dataset using data sourced from the publicly available NCBI catalog of prokaryotic genome sequences. To mimic sequencing data, we fragmented genomes into short sequences of 250 base pairs, a length commonly generated by current sequencing technology. We then separated in- and out-of-distribution data by the date of discovery, such that bacterial classes discovered before a cutoff time were defined as in-distribution, and those discovered afterward as OOD.

We then trained a deep generative model on in-distribution genomic sequences and examined how well the model discriminated between in- and out-of-distribution inputs by plotting their likelihood values. The histogram of the likelihood for OOD sequences largely overlaps with that of in-distribution sequences, indicating that the generative model was unable to distinguish between the two populations for OOD detection. Similar results were shown in earlier work for deep generative models of images — for instance, a PixelCNN++ model trained on images from Fashion-MNIST dataset (which consists of images of clothing and footwear) assigns higher likelihood to OOD images from the MNIST dataset (which consists of images of digits 0-9).

Left: Histogram of likelihood values for in- and out-of-distribution (OOD) genomic sequences. The likelihood fails to separate in-distribution and OOD genomic sequences. Right: A similar plot for a model trained on Fashion-MNIST and evaluated on MNIST. The model assigns higher likelihood values for OOD (MNIST) than in-distribution images.

When investigating this failure mode, we observed that the likelihood can be confounded by background statistics. To understand the phenomenon more intuitively, assume that an input is composed of two components, (1) a background component characterized by background statistics, and (2) a semantic component characterized by patterns specific to the in-distribution data. For example, an MNIST image can be modeled as background plus semantics. When humans interpret the image, we can easily ignore the background and focus primarily on the semantic information, e.g., the “/” mark in the image below. But the likelihood is calculated for all pixels in an image, including both semantic and background pixels. Though we want to use just the semantic likelihood for decision making, the raw likelihood can be dominated by background.

Left top: Sample images from Fashion-MNIST. Left bottom: Sample images from MNIST. Right: Background and semantic components in an MNIST image.

Likelihood Ratios For OOD Detection
We propose a likelihood ratio method that removes the effect of background and focuses on semantics. First, we train a background model on perturbed inputs. The method for perturbing the input is inspired by genetic mutations, and proceeds by randomly selecting positions in the input and substituting the value with another that has equal probability. For imaging, the values are randomly chosen from the 256 possible pixel values, and for the DNA sequences, the value is selected from the four possible nucleotides (A, T, C, or G). The right amount of perturbation can corrupt the semantic structure in the data, and captures only the background. Then we compute the likelihood ratio between the full model and the background model, and the background component is cancelled out, so that only the likelihood for semantics remains. Likelihood ratio is a background contrastive score, i.e., it captures the significance of the semantics compared to the background.

To qualitatively evaluate the difference between the likelihood and likelihood ratio, we plotted their values for each pixel in the Fashion-MNIST and MNIST datasets, creating heatmaps that have the same size as the images. This allows us to visualize which pixels contribute the most to the two terms, respectively. From the log-likelihood heatmaps, we see that the background pixels contribute much more to the likelihood than the semantic pixels. In hindsight, this is not surprising, since background pixels consist mostly of a string of zeros, a pattern very easily learned by the model. A comparison between the MNIST and Fashion-MNIST heatmaps demonstrates why MNIST returns higher likelihood values — it simply has a lot more background pixels! The likelihood ratio instead focuses more on the semantic pixels.

Left: Log-likelihood heatmaps for Fashion-MNIST and MNIST datasets. Right: The same examples showing heatmaps of the likelihood-ratio. Pixels with higher values are of lighter shades. The likelihood is dominated by the “background” pixels, whereas the likelihood ratio focuses on the “semantic” pixels and is thus better for OOD detection.

Our likelihood ratio method corrects the background effect and significantly improves the OOD detection of MNIST images from an AUROC score of 0.089 to 0.994, based on a PixelCNN++ model trained for Fashion-MNIST. When applied to the genomic benchmark dataset, this method achieves state-of-the-art performance on this challenging problem, when compared to 12 other baseline methods.

For more details, please check out our recent paper at NeurIPS 2019. While our likelihood ratio method reaches state-of-the-art performance on the genomic dataset, it does not yet have high enough accuracy to reach the standards for deployment of the model to real applications. We encourage researchers to contribute their solutions to this important problem and improve the current state-of-the-art. The dataset is available on our GitHub repository.

Acknowledgments
The work described here was authored by Jie Ren, Peter J. Liu, Emily Fertig, Jasper Snoek, Ryan Poplin, Mark A. DePristo, Joshua V. Dillon, Balaji Lakshminarayanan, through a collaboration spanning several teams across Google AI and DeepMind. We are grateful for all the discussions and feedback on this work that we received from the reviewers at NeurIPS 2019, and our colleagues at Google and DeepMind: Alexander A. Alemi, Andreea Gane, Brian Lee, D. Sculley, Eric Jang, Jacob Burnim, Katherine Lee, Matthew D. Hoffman, Noah Fiedel, Rif A. Saurous, Suman Ravuri, Thomas Colthurst, Yaniv Ovadia, along with the Google Brain and TensorFlow teams.

[D] Current state of the Topic Segmentation problem

Recently, I did a little research in the literature for “Topic segmentation” since “Text segmentation” seems to be more related to identifying text in images. From the results, it appears that the most recent survey is from 2011 [1], while the most recent papers in big conferences are from 2008 to 2013 [2, 3, 4].

Is this the current state of the problem, or there are more recent and relevant works?

It’s also possible that I’m using the wrong terms. So, for clarification, I’m most interest in segmenting a collection of documents in a small and well-known number of sections / topics.

[1] Purver, Matthew. “Topic segmentation.” In Spoken language understanding: systems for extracting semantic information from speech (2011)

[2] Eisenstein, Jacob, and Regina Barzilay. “Bayesian unsupervised topic segmentation.” In Proceedings of the Conference on Empirical Methods in Natural Language Processing (2008)

[3] Riedl, Martin, and Chris Biemann. “TopicTiling: a text segmentation algorithm based on LDA.” In Proceedings of ACL 2012 Student Research Workshop (2012)

[4] Du, Lan, Wray Buntine, and Mark Johnson. “Topic segmentation with a structured topic model.” In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics (2013)

submitted by /u/Daango_
[link] [comments]

[R] We released our Oktoberfest Food Dataset

[R] We released our Oktoberfest Food Dataset

Data sample example

Abstract:
” We release a realistic, diverse, and challenging dataset for object detection on images. The data was recorded at a beer tent in Germany and consists of 15 different categories of food and drink items. We created more than 2,500 object annotations by hand for 1,110 images captured by a video camera above the checkout. We further make available the remaining 600GB of (unlabeled) data containing days of footage. Additionally, we provide our trained models as a benchmark. Possible applications include automated checkout systems which could significantly speed up the process. “

Arxiv link

git

If you have any feedback or comments on our work, we are more than happy to hear that.

submitted by /u/comp_vision_
[link] [comments]

Running distributed TensorFlow training with Amazon SageMaker

TensorFlow is an open-source machine learning (ML) library widely used to develop heavy-weight deep neural networks (DNNs) that require distributed training using multiple GPUs across multiple hosts. Amazon SageMaker is a managed service that simplifies the ML workflow, starting with labeling data using active learning, hyperparameter tuning, distributed training of models, monitoring of training progression, deploying trained models as automatically scalable RESTful services, and centralized management of concurrent ML experiments.

This post focuses on distributed TensorFlow training using Amazon SageMaker.

Overview of concepts

While many of the distributed training concepts in this post are generally applicable across many types of TensorFlow models, this post focuses on distributed TensorFlow training for the Mask R-CNN model on the Common Object in Context (COCO) 2017 dataset.

Model

The Mask R-CNN model is used for object instance segmentation, whereby the model generates pixel-level masks (Sigmoid binary classification) and bounding boxes (Smooth L1 regression) annotated with an object-category (SoftMax classification) to delineate each object instance in an image. Some common use cases for Mask R-CNN include perception in autonomous vehicles, surface defect detection, and analysis of geospatial imagery.

There are three key reasons for selecting the Mask R-CNN model for this post:

  1. Distributed data parallel training of Mask R-CNN on large datasets increases the throughput of images through the training pipeline and reduces training time.
  2. There are many open-source TensorFlow implementations available for the Mask R-CNN model. This post uses Tensorpack Mask/Faster-RCNN implementation as its primary example, but a highly optimized AWS Samples Mask-RCNN is recommended, as well.
  3. The Mask R-CNN model is submitted as part of MLPerf results as a heavy-weight object detection model.

The following graphic is a schematic outline of the Mask R-CNN deep neural network architecture.

Synchronized allreduce of gradients in distributed training

The central challenge in distributed DNN training is that the gradients computed during back propagation across multiple GPUs need to be allreduced (averaged) in a synchronized step before applying the gradients to update the model weights at multiple GPUs across multiple nodes.

The synchronized allreduce algorithm needs to be highly efficient; otherwise, you would lose any training speedup gained from distributed data-parallel training to the inefficiency of a synchronized allreduce step.

There are three key challenges to making a synchronized allreduce algorithm highly efficient:

  • The algorithm needs to scale with the increasing number of nodes and GPUs in the distributed training cluster.
  • The algorithm needs to exploit the topology of high-speed GPU-to-GPU interconnects within a single node.
  • The algorithm needs to efficiently interleave computations on a GPU with communications with other GPUs by efficiently batching the communications with other GPUs.

Uber’s open-source library Horovod addresses these three key challenges as follows:

  • Horovod offers a choice of highly efficient synchronized allreduce algorithms that scale with an increasing number of GPUs and nodes.
  • The Horovod library uses Nvidia Collective Communications Library (NCCL) communication primitives that exploit awareness of Nvidia GPU topology.
  • Horovod includes Tensor Fusion, which efficiently interleaves communication with computation by batching data communication for allreduce.

Horovod is supported with many ML frameworks, including TensorFlow. TensorFlow distribution strategies also use NCCL and provide an alternative to using Horovod to do distributed TensorFlow training. This post uses Horovod.

Training heavy-weight DNNs such as Mask R-CNN require high per GPU memory so you can pump one or more high-resolution images through the training pipeline. They also require high-speed GPU-to-GPU interconnect and high-speed networking interconnecting machines so synchronized allreduce of gradients can be done efficiently. Amazon SageMaker ml.p3.16xlarge and ml.p3dn.24xlarge instance types meet all these requirements. For more information, see Amazon SageMaker ML Instance Types. With eight Nvidia Tesla V100 GPUs, 128–256 GB GPU memory, 25–100 Gbps networking interconnect, and high-speed Nvidia NVLink GPU-to-GPU interconnect, they are ideally suited for distributed TensorFlow training on Amazon SageMaker.

Message Passing Interface

The next challenge in distributed TensorFlow training is the appropriate placement of training algorithm processes across multiple nodes, and associating each algorithm process with a unique global rank. Message Passing Interface (MPI) is a widely used collective communication protocol for parallel computing and is useful in managing a group of training algorithm worker processes across multiple nodes.

MPI is used to distribute training algorithm processes across multiple nodes and associate each algorithm process with a unique global and local rank. Horovod is used to logically pin an algorithm process on a given node to a specific GPU. The logical pinning of each algorithm process to a specific GPU is required for synchronized allreduce of gradients.

The key MPI concept to understand for this post is that MPI uses the mpirun command on a master node to launch concurrent processes across multiple nodes. Using MPI, the master host manages the lifecycle of distributed training processes running across multiple nodes centrally. To use MPI to do distributed training using Amazon SageMaker, you must integrate MPI with the native distributed training capabilities of Amazon SageMaker.

Integrating MPI with Amazon SageMaker distributed training

To understand how to integrate MPI with Amazon SageMaker distributed training, you need an understanding of the following concepts:

  • Amazon SageMaker requires the training algorithm and frameworks packaged in a Docker image.
  • The Docker image must be enabled for Amazon SageMaker training. This enablement is simplified through the use of Amazon SageMaker containers, which is a library that helps create Amazon SageMaker-enabled Docker images.
  • You need to provide an entry point script (typically a Python script) in the Amazon SageMaker training image to act as an intermediary between Amazon SageMaker and your algorithm code.
  • To start training on a given host, Amazon SageMaker runs a Docker container from the training image and invokes the entry point script with entry point environment variables that provide information such as hyperparameters and the location of input data.
  • The entry point script uses the information passed to it in the entry point environment variables to start your algorithm program with the correct args and polls the running algorithm process.
  • When the algorithm process exits, the entry point script exits with the exit code of the algorithm process. Amazon SageMaker uses this exit code to determine the success or failure of the training job.
  • The entry point script redirects the output of the algorithm process’ stdout and stderr to its own stdout. In turn, Amazon SageMaker captures the stdout from the entry point script and sends it to Amazon CloudWatch Logs. Amazon SageMaker parses the stdout output for algorithm metrics defined in the training job and sends the metrics to Amazon CloudWatch metrics.
  • When Amazon SageMaker starts a training job that requests multiple training instances, it creates a set of hosts and logically names each host as algo-k, where k is the global rank of the host. For example, if a training job requests four training instances, Amazon SageMaker names the hosts as algo-1, algo-2, algo-3, and algo-4. The hosts can connect on the network using these hostnames.

In the case of distributed training using MPI, you need a single mpirun command running on the master node (host) that controls the lifecycle of all algorithm processes distributed across multiple nodes, algo-1 through algo-n, where n is the number of training instances requested in your Amazon SageMaker training job. However, Amazon SageMaker is unaware of MPI or any other parallel processing framework you may use to distribute your algorithm processes across multiple nodes. Amazon SageMaker is going to invoke the entry point script on the Docker container running on each node. This means the entry point script needs to be aware of the global rank of its node and execute different logic depending on whether it is invoked on the master node or one of the non-master nodes.

Specifically, for the case of MPI, the entry point script invoked on the master node needs to run the mpirun command to start algorithm processes across all the nodes in the current Amazon SageMaker training job’s host set. The same entry point script when invoked by Amazon SageMaker on any of the non-master nodes periodically checks if the algorithm processes on the non-master node, which the mpirun command manages remotely from the master node, are still running, and exit when they are no longer running.

A master node in MPI is a logical concept, so it is up to the entry point script to designate a host from among all the hosts in the current training job host set as a master node. This designation has to be done in a decentralized manner. A simple approach is to designate algo-1 as the master node and all other hosts as non-master nodes. Because Amazon SageMaker provides each node its logical hostname in the entry point environment variables, it is straightforward for a node to decide if it is the master node or a non-master node.

The train.py included in the accompanying GitHub repo and packaged in the Tensorpack Mask/Faster-RCNN algorithm Docker image follows the logic outlined in this section.

With the background of this conceptual understanding, you’re ready to proceed to the step-by-step tutorial on how to run distributed TensorFlow training for Mask R-CNN using Amazon SageMaker.

Solution overview

This tutorial has the following key steps:

  1. Use an AWS CloudFormation automation script to create a private Amazon VPC and create an Amazon SageMaker notebook instance network attached to this private VPC.
  2. From the Amazon SageMaker notebook instance, launch distributed training jobs in an Amazon SageMaker-managed Amazon VPC network attached to your private VPC. You can use Amazon S3, Amazon EFS, and Amazon FSx as data sources for the training data pipeline.

Prerequisites

The following prerequisites are required:

  1. Create and activate an AWS Account or use an existing AWS account.
  2. Manage your Amazon SageMaker instance limits. You need a minimum of two ml.p3dn.24xlarge or two ml.p3.16xlarge instances; a service limit of four of each is recommended. Keep in mind that the service limit is specific to each AWS Region. This post uses us-west-2.
  3. Clone this post’s GitHub repo and complete the steps in this post. All paths in this post are relative to the GitHub repo root.
  4. Use any AWS Region that supports Amazon SageMaker, EFS, and Amazon FSx. This post uses us-west-2.
  5. Create a new S3 bucket or choose an existing bucket.

Creating an Amazon SageMaker notebook instance attached to a VPC

The first step is to run an AWS CloudFormation automation script to create an Amazon SageMaker notebook instance attached to a private VPC. To run this script, you need IAM user permissions consistent with the Network Administrator function. If you do not have such access, you may need to seek help from your network administrator to run the AWS CloudFormation automation script included in this tutorial. For more information, see AWS Managed Policies for Job Functions.

Use the AWS CloudFormation template cfn-sm.yaml to create an AWS CloudFormation stack that creates a notebook instance attached to a private VPC. You can either create the AWS CloudFormation stack using cfn-sm.yaml in AWS CloudFormation service console, or you can customize variables in stack-sm.sh script and run the script anywhere you have AWS CLI installed.

To use the AWS CLI approach, complete the following steps:

  1. Install AWS CLI and configure it.
  2. In stack-sm.sh, set AWS_REGION to your AWS Region and S3_BUCKET to your S3 bucket. These two variables are required.
  3. Optionally, set the EFS_ID variable if you want to use an existing EFS file system. If you leave EFS_ID blank, a new EFS file system is created. If you chose to use an existing EFS file system, make sure the existing file system does not have any existing mount targets. For more information, see Managing Amazon EFS File Systems.
  4. Optionally, specify GIT_URL to add a GitHub repo to the Amazon SageMaker notebook instance. If the GitHub repo is private, you can specify GIT_USER and GIT_TOKEN variables.
  5. Run the customized stack-sm.sh script to create an AWS CloudFormation stack using AWS CLI.

Save the summary output of the AWS CloudFormation script to use later. You can also view the output under the AWS CloudFormation Stack Outputs tab on the AWS Management Console.

Launching Amazon SageMaker training jobs

In the Amazon SageMaker console, open the notebook instance you created. In this notebook instance, there are three Jupyter notebooks available for training Mask R-CNN:

The training time performance for all three data source options is similar (though not identical) for this post’s choice of Mask R-CNN model and COCO 2017 dataset. The cost profile for each of the data sources is different. The following are differences in terms of the time it takes to set up the training data pipeline:

  • For the S3 data source, each time the training job launches, it takes approximately 20 minutes to replicate the COCO 2017 dataset from your S3 bucket to the storage volumes attached to each training instance.
  • For the EFS data source, it takes approximately 46 minutes to copy the COCO 2017 dataset from your S3 bucket to your EFS file system. You only need to copy this data one time. During training, data is input from the shared EFS file system mounted on all the training instances through a network interface.
  • For Amazon FSx, it takes approximately 10 minutes to create a new Amazon FSx Lustre file system and import the COCO 2017 dataset from your S3 bucket to the new Amazon FSx Lustre file system. You only need to do this one time. During training, data is input from the shared Amazon FSx Lustre file system mounted on all the training instances through a network interface.

If you are not sure which data source option is best for you, start with S3, and explore EFS or Amazon FSx if the training data download time at the start of each training job is not acceptable. Do not assume anything about training time performance for any of the data sources. Training time performance depends on many factors; it is best to experiment and measure it.

In all three cases, the logs and model checkpoints output during training are written to a storage volume attached to each training instance, and upload to your S3 bucket when training is complete. The logs are also fed into Amazon CloudWatch as training progresses that you can review during training. System and algorithm training metrics are fed into Amazon CloudWatch metrics during training, which you can visualize in the Amazon SageMaker service console.

Training results

The following graphs are example results for the two algorithms, after training for 24 epochs on the COCO 2017 dataset.

Below you can see the example results for TensorPack Mask/Faster-RCNN algorithm. The graphs below can be split into three buckets:

  1. Mean average precision (mAP) graphs for bounding box (bbox) prediction for various values of Intersection over Union (IoU), and small, medium, and large object sizes
  2. Mean average precision (mAP) graphs for object instance segmentation (segm) prediction for various values of Intersection over Union (IoU), and  small, medium, and large object sizes
  3. Other metrics related to training loss, or label accuracy

Below you can see the example results for the optimized AWS Samples Mask R-CNN algorithm. The converged mAP metrics shown in the graphs below are almost identical  to the previous algorithm, although the convergence progression is different.

Conclusion

Amazon SageMaker provides a Docker-based, simplified distributed TensorFlow training platform that allows you to focus on your ML algorithm and not be distracted by ancillary concerns such as the mechanics of infrastructure availability and scalability, and concurrent experiment management. When your model is trained, you can use the integrated model deployment capability of Amazon SageMaker to create an automatically scalable RESTful service endpoint for your model and start testing it. For more information, see Deploy a Model on Amazon SageMaker Hosting Services. When your model is ready, you can seamlessly deploy the model RESTful service into production.


About the Author

Ajay Vohra is a Principal Solutions Architect specializing in perception machine learning for autonomous vehicle development. Prior to Amazon, Ajay worked in the area of massively parallel grid-computing for financial risk modeling, and automation of application platform engineering in on-premise data centers.

 

 

Vector researcher Will Grathwohl wants to lower the barriers to entry to AI

By Ian Gormely

Artificial intelligence is a transformative technology. Yet, much like the Internet before web browsers, it remains inaccessible to many people. The web’s true potential wasn’t realized until the barriers to entry were lowered to the point that “anyone with a laptop had the potential to build the next Facebook,” says Will Grathwohl, a Vector researcher and graduate student at the University of Toronto. “I think we should put AI into people’s hands. The people who have the best ideas for how to apply something are usually not the people that created the thing. But right now, it’s just not like that at all.”

Grathwohl was part of a robust contingent of Vector affiliated folk who attended this year’s International Conference on Learning Representations (ICLR) in New Orleans. In total, 12 posters from Vector Faculty Members were accepted to the conference, with Grathwohl giving an oral presentation of the paper, “FFJORD: Free-Form Continuous Dynamics for Scalable Reversible Generative Models” which he co-authored with Ricky T. Q. Chen, Jesse Bettencourt, Ilya Sutskever, and Vector Faculty Member David Duvenaud. 

FFJORD, an acronym for Free-form Jacobian of Reversible Dynamics, is a small but important step in Grathwhol’s quest to lower the barriers to entry to AI. There have been tremendous breakthroughs in the field, particularly around the use of machine learning, over the past five years. But those breakthroughs still require vast sums of hand-labelled data – say pictures of cats that are identified as such – and computing power, neither of which comes cheap. “To me, the most interesting method to making that amount of data smaller is finding ways to use the massive amounts of unlabeled data that are out there,” says the 27-year old. “One way that that’s become popular to do that is to look into generative models.”

Grathwohl’s paper looks specifically at normalizing flows, a class of generative models that have become popular in the machine learning community for their ability to generate samples and compute likelihood. Building them though requires placing a lot of restrictions on neural networks that can be used to solve a problem. FFJORD applies the idea of continuous time as a workaround to build better, less restrictive normalizing flows.

It builds off an idea first put forth by Grathwohl’s advisor, Vector Faculty Member David Duvenaud in the paper Neural Ordinary Differential Equations, which won Duvenaud, and his co-authors Ricky Tian Qi Chen, Yulia Rubanova, and Jesse Bettencourt the Best Paper Award at last year’s NeuIPS Conference. “David’s paper presented the idea of having a neural network parameterize a continuous time dynamic process. And that opened up a whole new paradigm to think about things that involve machine learning in neural networks,” says Grathwohl. Leveraging Duvenaud’s idea of switching from discrete time – data sampled at regular intervals – to continuous time – data sampled at any point in the flow – allows for the creation of normalizing flow-based generative models in a much simpler and expressive way.

After finishing his undergrad in 2014, Grathwohl spent several years bouncing around the tech industry, first as an entrepreneur, developing content moderation software, and later using machine learning for product indexing at a startup. Eventually, he became frustrated with the lack of creativity. Yet out of that milieu came the inspiration for his return to school. “My job was building infrastructure to collect data and figuring out how to do it as cheaply as possible,” he says. “We had to build more classifiers to serve more industries and more customers. Every single one of those was a constant cost of time and money. I realized we need to make these things work better with less data.”

FFJORD does not solve that problem, but it is a step in the right direction. “Better models that can solve this less labelled data problem will be a key piece,” he says, noting that down the road, normalizing flows could also help in modelling environments, an important aspect of genetic research and robotics. “Any improvement in unsupervised generative models will help us in the semi-supervised learning setting.”