Learn About Our Meetup

4200+ Members

Category: Global

Exploring data warehouse tables with machine learning and Amazon SageMaker notebooks

Are you a data scientist with data warehouse tables that you’d like to explore in your machine learning (ML) environment? If so, read on.

In this post, I show you how to perform exploratory analysis on large datasets stored in your data warehouse and cataloged in your AWS Glue Data Catalog from your Amazon SageMaker notebook. I detail how to identify and explore a dataset in the corporate data warehouse from your Jupyter notebook running on Amazon SageMaker. I demonstrate how to extract the interesting information from Amazon Redshift into Amazon EMR and transform it further there. Then, you can continue analyzing and visualizing your data in your notebook, all in a seamless experience.

This post builds on the following prior posts—you may find it helpful to review them first.

Amazon SageMaker overview

Amazon SageMaker is a fully managed ML service. With Amazon SageMaker, data scientists and developers can quickly and easily build and train ML models, and then directly deploy them into a production-ready hosted environment. Amazon SageMaker provides an integrated Jupyter authoring environment for data scientists to perform initial data exploration, analysis, and model building.

The challenge is locating the datasets of interest. If the data is in the data warehouse, you extract the relevant subset of information and load it into your Jupyter notebook for more detailed exploration or modeling. As individual datasets get larger and more numerous, extracting all potentially interesting datasets, loading them into your notebook, and merging them there ceases to be practical and slows productivity. This kind of data combination and exploration can take up to 80% of a data scientist’s time. Increasing productivity here is critical to accelerating the completion of your ML projects.

An increasing number of corporations are using Amazon Redshift as their data warehouse. Amazon Redshift allows you to run complex analytic queries against petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance local disks, and massively parallel query execution. These capabilities make it a magnet for the kind of data that is also of interest to data scientists. However, to perform ML tasks, the data must be extracted into an ML platform so data scientists can operate on it. The capabilities of Amazon Redshift can be used to join and filter the data as needed, then extracting only the relevant data into the ML platform for ML-specific transformation.

Frequently, large corporations also use AWS Glue to manage their data lake. AWS Glue is a fully managed ETL (extract, transform, and load) service. It makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores. AWS Glue contains a central metadata repository known as the AWS Glue Data Catalog, which makes the enriched and categorized data in the data lake available for search and querying. You can use the metadata in the Data Catalog to identify the names, locations, content, and characteristics of datasets of interest.

Even after joining and filtering the data in Amazon Redshift, the remaining data may still be too large for your notebook to store and run ML operations on. Operating on extremely large datasets is a task for which Apache Spark on EMR is ideally suited.

Spark is a cluster-computing framework with built-in modules supporting analytics from a variety of languages, including Python, Java, and Scala. Spark on EMR’s ability to scale is a good fit for the large datasets frequently found in corporate data lakes. If the datasets are already defined in your AWS Glue Data Catalog, it becomes easier still to access them, by using the Data Catalog as an external Apache Hive Metastore in EMR. In Spark, you can perform complex transformations that go well beyond the capabilities of SQL. That makes it a good platform for further processing or massaging your data; for example, using the full capabilities of Python and Spark MLlib.

When using the setup described in this post, you use Amazon Redshift to join and filter the source data. Then, you iteratively transform the resulting reduced (but possibly still large) datasets, using EMR for heavyweight processing. You can do this while using your Amazon SageMaker notebook to explore and visualize subsets of the data relevant to the task at hand. The various tasks (joining and filtering; complex transformation; and visualization) have each been farmed out to a service intimately suited to that task.

Solution overview

The first section of the solution walks through querying the AWS Glue Data Catalog to find the database of interest and reviewing the tables and their definitions. The table declarations identify the data location—in this case, Amazon Redshift. The AWS Glue Data Catalog also provides the needed information to build the Amazon Redshift connection string for use in retrieving the data.

The second part of the solution is reading the data into EMR. It applies if the size of the data that you’re extracting from Amazon Redshift is large enough that reading it directly into your notebook is no longer practical. The power of a cluster-compute service, such as EMR, provides the needed scalability.

If the following are true, there is a much simpler solution. For more information, see the Amazon Redshift access demo sample notebook provided with the Amazon SageMaker samples.

  • You know the Amazon Redshift cluster that contains the data of interest.
  • You know the Amazon Redshift connection information.
  • The data you’re extracting and exploring is at a scale amenable to a JDBC connection.

The solution is implemented using four AWS services and some open source components:

  • An Amazon SageMaker notebook instance, which provides zero-setup hosted Jupyter notebook IDEs for data exploration, cleaning, and preprocessing. This notebook instance runs:
    • Jupyter notebooks
    • SparkMagic: A set of tools for interactively working with remote Spark clusters through Livy in Jupyter The SparkMagic project includes a set of magics for interactively running Spark code in multiple languages. Magics are predefined functions that execute supplied commands. The project also includes some kernels that you can use to turn Jupyter into an integrated Spark environment.
  • An EMR cluster, running Apache Spark, and:
    • Apache Livy: a service that enables easy interaction with Spark on an EMR cluster over a REST interface. Livy enables the use of Spark for interactive web/mobile applications — in this case, from your Jupyter notebook.
    • The AWS Glue Data Catalog, which acts as the central metadata repository. Here it’s used as your external Hive Metastore for big data applications running on EMR.
    • Amazon Redshift, as your data warehouse.
  • The EMR cluster with Spark reads from Amazon Redshift using a Databricks-provided package, Redshift Data Source for Apache Spark.

In this post, all these components interact as shown in the following diagram.

You get access to datasets living on Amazon S3 and defined in the AWS Glue Data Catalog with the following steps:

  1. You work in your Jupyter SparkMagic notebook in Amazon SageMaker. Within the notebook, you issue commands to the EMR cluster. You can use PySpark commands, or you can use SQL magics to issue HiveQL commands.
  2. The commands to the EMR cluster are received by Livy, which is running on the cluster.
  3. Livy passes the commands to Spark, which is also running on the EMR cluster.
  4. Spark accesses its Hive Metastore to identify the location, DDL, and properties of the cataloged dataset. In this case, the Hive metastore has been set to the Data Catalog.
  5. You define and run a boto3 function (get_redshift_data, provided below) to retrieve the connection information from the Data Catalog, and issue the command to Amazon Redshift to read the table. The spark-redshift package unloads the table into a temporary S3 file, then loads it into Spark.
  6. After performing your desired manipulations in Spark, EMR returns the data to your notebook as a dataframe for additional analysis and visualization.

In the sections that follow, you perform these steps on a sample set of tables:

  1. Use the provided AWS CloudFormation stack to create the Amazon SageMaker notebook instance; EMR cluster with Livy and Spark; and the Amazon Redshift driver. Specify AWS Glue as the cluster’s Hive Metastore; and select an Amazon Redshift cluster. The stack also sets up an AWS Glue connection to the Amazon Redshift cluster, and a crawler to crawl Amazon Redshift.
  2. Set up some sample data in Amazon Redshift.
  3. Execute the AWS Glue crawler to access Amazon Redshift and populate metadata about the tables it contains into the Data Catalog.
  4. From your Jupyter notebook on Amazon SageMaker:
    1. Use the Data Catalog information to locate the tables of interest, and extract the connection information for Amazon Redshift.
    2. Read the tables from Amazon Redshift, pulling the data into Spark. You can filter or aggregate the Amazon Redshift data as needed during the unload operation.
    3. Further transform the data in Spark, transforming it into the desired output.
    4. Pull the reduced dataset into your notebook, and perform some rudimentary ML on it.

Set up the solution infrastructure

First, you launch a predefined AWS CloudFormation stack to set up the infrastructure components. The AWS CloudFormation stack sets up the following resources:

  • An EMR cluster with Livy and Spark, using the AWS Glue Data Catalog as the external Hive compatible Metastore. In addition, it configures Livy to use the same Metastore as the EMR cluster.
  • An S3 bucket.
  • An Amazon SageMaker notebook instance, along with associated components:
    • An IAM role for use by the notebook instance. The IAM role has the managed role AmazonSageMakerFullAccess, plus access to the S3 bucket created above.
    • A security group, used for the notebook instance.
    • An Amazon SageMaker lifecycle configuration that configures Livy to access the EMR cluster launched by the stack, and copies in a predefined Jupyter notebook with the sample code.
  • An Amazon Redshift cluster, in its own security group. Ports are opened to allow EMR, Amazon SageMaker, and the AWS Glue crawler to access it.
  • An AWS Glue database, an AWS Glue connection specifying the Amazon Redshift cluster as the target, and an AWS Glue crawler to crawl the connection.

To see this solution in operation in us-west-2, launch the stack from the following button. The total solution costs around $1.00 per hour to run. Remember to delete the AWS CloudFormation stack when you’ve finished with the solution to avoid additional charges.

  1. Choose Launch Stack and choose Next.
  2. Update the following parameters for your environment:
    • Amazon Redshift password—Must contain at least one uppercase letter, one lowercase letter, and one number.
    • VPCId—Must have internet access and an Amazon S3 VPC endpoint. You can use the default VPC created in your account. In the Amazon VPC dashboard, choose Endpoints. Check that the chosen VPC has the following endpoint: If not, create one.
    • VPCSubnet—Must have internet access, to allow needed software components to be installed.
    • Availability Zone—Must match the chosen subnet.

    The Availability Zone information and S3 VPC endpoint are used by the AWS Glue crawler to access Amazon Redshift.

  3. Leave the default values for the other parameters. Changing the AWS Glue database name requires changes to the Amazon SageMaker notebook that you run in a later step. The following screenshot shows the default parameters.
  4. Choose Next.
  5. Select I acknowledge that AWS CloudFormation might create IAM resources with custom names, and I acknowledge that AWS CloudFormation might require the following capability: CAPABILITY_AUTO_EXPAND.
  6. Choose Create.

Wait for the AWS CloudFormation master stack and its nested stacks to reach a status of CREATE_COMPLETE. It can take up to 45 minutes to deploy.

On the master stack, check the Outputs tab for the resources created. You use the key-value data in the next steps. The following screenshot shows the resources that I created but your values will differ.

Add sample data to Amazon Redshift

Using the RedshiftClusterEndpoint from your CloudFormation outputs, the master user name (masteruser), the password you specified in the AWS CloudFormation stack, and the Redshift database of ‘dev’, connect to your Amazon Redshift cluster using your favorite SQL client. Use one of the following methods:

The sample data to use comes from Step 6: Load Sample Data from Amazon S3. This data contains the ticket sales for events in several categories, along with information about the categories “liked” by the purchasers. Later, you use this data to calculate the correlation between liking a category and attending events (and then further exploration as desired).

Run the table creation commands followed by the COPY commands. Insert the RedshiftIamCopyRoleArn IAM role created by AWS CloudFormation in the COPY commands. At the end of this sequence, the sample data is in Amazon Redshift, in the public schema. Explore the data in the table, using SQL. You explore the same data again in later steps. You now have an Amazon Redshift data warehouse with several normalized tables containing data related to event ticket sales.

Try the following query. Later, you use this same query (minus the limit) from Amazon SageMaker to retrieve data from Amazon Redshift into EMR and Spark. It also helps confirm that you’ve loaded the data into Amazon Redshift correctly.

SELECT distinct u.userid,, u.state, 
u.likebroadway, u.likeclassical, u.likeconcerts, u.likejazz, u.likemusicals, u.likeopera, u.likerock, u.likesports, u.liketheatre, u.likevegas, 
d.caldate,, d.month, d.year, d.week,,
s.pricepaid, s.qtysold, -- s.salesid, s.listid, s.saletime, s.sellerid, s.commission
e.eventname, -- e.venueid, e.catid, e.eventid, 
c.catgroup, c.catname,
v.venuecity, v.venuename, v.venuestate, v.venueseats
FROM  users u, sales s, event e, venue v, date d, category c
WHERE u.userid = s.buyerid and s.dateid = e.dateid and s.eventid = e.eventid and e.venueid = v.venueid 
    and e.dateid = d.dateid and e.catid = c.catid
LIMIT 100;

The ‘like’ fields contain nulls. Convert these to ‘false’ here, to simplify later processing.

SELECT distinct u.userid,, u.state , 
NVL(u.likebroadway, false) as likebroadway, NVL(u.likeclassical, false) as likeclassical, NVL(u.likeconcerts, false) as likeconcerts, 
NVL(u.likejazz, false) as likejazz, NVL(u.likemusicals, false) as likemusicals, NVL(u.likeopera, false) as likeopera, NVL(u.likerock, false) as likerock,
NVL(u.likesports, false) as likesports, NVL(u.liketheatre, false) as liketheatre, NVL(u.likevegas, false) as likevegas, 
d.caldate,, d.month, d.year, d.week,,
s.pricepaid, s.qtysold, -- s.salesid, s.listid, s.saletime, s.sellerid, s.commission
e.eventname, -- e.venueid, e.catid, e.eventid, 
c.catgroup, c.catname,
v.venuecity, v.venuename, v.venuestate, v.venueseats
FROM  users u, sales s, event e, venue v, date d, category c
WHERE u.userid = s.buyerid and s.dateid = e.dateid and s.eventid = e.eventid and e.venueid = v.venueid 
    and e.dateid = d.dateid and e.catid = c.catid
LIMIT 100;

Use an AWS Glue crawler to add tables to the Data Catalog

Now that there’s sample data in the Amazon Redshift cluster, the next step is to make the Amazon Redshift tables visible in the AWS Glue Data Catalog. The AWS CloudFormation template set up the components for you: an AWS Glue database, a connection to Amazon Redshift, and a crawler. Now you run the crawler, which reads Amazon Redshift’s catalog and populate the Data Catalog with that information.

First, test that the AWS Glue connection can connect to Amazon Redshift:

  1. In the AWS Glue console, in the left navigation pane, choose Connections.
  2. Select the connection GlueRedshiftConnection, and choose Test Connection.
  3. When asked for an IAM role, choose the GlueRedshiftService role created by the AWS CloudFormation template.
  4. Wait while AWS Glue tests the connection. If it successfully does so, you see the message GlueRedshiftConnection connected successfully to your instance. If it does not, the most likely cause is that the subnet, VPC, and Availability Zone did not match. Or, it could be that the subnet is missing an S3 endpoint or internet access.

Next, retrieve metadata from Amazon Redshift about the tables that exist in the Amazon Redshift database noted in the AWS CloudFormation template parameters. To do so, run the AWS Glue crawler that the AWS CloudFormation template created:

  1. In the AWS Glue console, choose Crawlers in the left-hand navigation bar.
  2. Select GlueRedshiftCrawler in the crawler list, and choose Run Crawler. If asked for an IAM role, choose the GlueRedshiftService role created by the AWS CloudFormation template.
  3. Wait as the crawler runs. It should complete in two or three minutes. You see the status change to Starting, then Running, Stopping, and finally Ready.
  4. When the crawler status is Ready, check the column under Tables Added. You should see that seven tables have been added.

To review the tables the crawler added, use the following steps:

  1. Choose Databases and select the database named glueredsage. This database was created by the AWS CloudFormation stack.
  2. Choose Tables in glueredsage.

You should see the tables that you created in Amazon Redshift listed, as shown in the screenshot that follows. The AWS Glue table name is made up of the database (dev), the schema (public), and the actual table name from Amazon Redshift (for example, date). The AWS Glue classification is Amazon Redshift.

You access this metadata from your Jupyter notebook in the next step.

Access data defined in AWS Glue Data Catalog from the notebook

In this section, you locate the Amazon Redshift data of interest in the AWS Glue Data Catalog and get the data from Amazon Redshift, from an Amazon SageMaker notebook.

  1. In the Amazon SageMaker console, in the left navigation pane, choose Notebook instances.
  2. Next to the notebook started by your AWS CloudFormation stack, choose Open Jupyter.

You see a page similar to the screenshot that follows. The Amazon SageMaker lifecycle configuration in the CF stack automatically uploaded the notebook Using_SageMaker_Notebooks_to_access_Redshift_via_Glue.ipynb to your Jupyter dashboard.

Open the notebook. The kernel type is “SparkMagic (PySpark)”. Alternatively, you can browse the static results of a prior run in HTML format. The following links take you to the relevant section in this version.

Begin executing the cells in the notebook, following the instructions there. The instructions there walk you through:

  • Accessing the Spark cluster from your local notebook via Livy, and issuing a simple Pyspark statement from your local notebook to show how you can use Pyspark in this environment.
  • Listing the databases in your AWS Glue Data Catalog, and showing the tables in the AWS Glue database, glueredsage, that you set up previously via the AWS CloudFormation template.Here, you use a couple of Python helper functions to access the Data Catalog from your local notebook. You can identify the tables of interest from the Data Catalog, and see that they’re stored in Amazon Redshift. This is your clue that you must connect to Amazon Redshift to read this data.
  • Retrieving Amazon Redshift connection information from the Data Catalog for the tables of interest.
  • Retrieving the data relevant to your planned research problem from a series of Amazon Redshift tables into Spark EMR using two methods: retrieving the full table, or, by executing a SQL that joins and filters the data.First, you retrieve a small Amazon Redshift table containing some metadata — the categories of events. Then, you perform a complex query that pulls back a flattened dataset containing data about which eventgoers in which cities like what types of events (Broadway, Jazz, classical, etc.). Irrelevant data is not retrieved for further analysis. The data comes back as a Spark data frame, on which you can perform additional analysis.
  • Using the resulting (potentially large) dataframe on EMR to first perform some ML functions in Spark: converting several columns into one-hot vector representations and calculating correlations between them. The dataframe of correlations is much smaller, and is practical to process on your local notebook.
  • Lastly, working with the processed data frame in your local notebook instance. Here you visualize the (much smaller) results of your correlations locally.

Here’s the result of your initial analysis, showing the correlation between event attendance versus categories of events liked:

You can see that, based on these ticket purchases and event attendances, the likes and event categories are only weakly correlated (max correlation is 0.02). Though the correlations are weak, relatively speaking:

  • Liking theatre is positively correlated with attending musicals.
  • Liking opera is positively correlated with attending plays.
  • Liking rock is negatively correlated with attending musicals.
  • Liking Broadway is negatively correlated with attending plays (surprisingly!).

Debugging your connection

If your notebook does not connect to your EMR cluster, review the following information to see where the problem lies.

Amazon SageMaker notebook instances can use a lifecycle configuration. With a lifecycle configuration, you can provide a Bash script to be run whenever an Amazon SageMaker notebook instance is created, or when it is restarted after having been stopped. The AWS CloudFormation template uses a creation-time script to configure the Livy configuration on the notebook instance with the address of the EMR master instance created earlier. The most common sources of connection difficulties are as follows:

  • Not having the correct settings in livy.conf.
  • Not having the correct ports open on the security groups between the EMR cluster and the notebook instance.

When the notebook instance is created or started, the results of running the lifecycle config are captured in an Amazon CloudWatch Logs log group called /aws/sagemaker/NotebookInstances. This log group has a stream for <notebook-instance-name>/LifecycleConfigOnCreate script results, and another for <notebook-instance-name>/LifeCycleConfigOnStart (shown below for a notebook instance of “test-scripts2”). These streams contain log messages from the lifecycle script executions, and you can see if any errors occurred.

Next, check the Livy configuration and EMR access on the notebook instance. In the Jupyter files dashboard, choose New, Terminal. This opens a shell for you on the notebook instance.

The Livy config file is stored in: /home/ec2-user/SageMaker/.sparkmagic/config.json. Check to make sure that your EMR cluster IP address has replaced the original http://localhost:8998 address in three places in the file.

If you are receiving errors during data retrieval from Amazon Redshift, check whether the request is getting to Amazon Redshift.

  1. In the Amazon Redshift console, choose Clusters and select the cluster started by the AWS CloudFormation template.
  2. Choose Queries.
  3. Your request should be in the list of the SQL queries that the Amazon Redshift cluster has executed. If it isn’t, check that the connection to Amazon Redshift is working, and you’ve used the correct IAM copy role, userid, and password.

A last place to check is the temporary S3 directory that you specified in the copy statement. You should see a folder placed there with the data that was unloaded from Amazon Redshift.

Extending the solution and using in production

The example provided uses a simple dataset and SQL to allow you to more easily focus on the connections between the components. However, the real power of the solution comes from accessing the full capabilities of your Amazon Redshift data warehouse and the data within. You can use far more complex SQL queries—with joins, aggregations, and filters—to manipulate, transform, and reduce the data within Amazon Redshift. Then, pull back the subsets of interest into Amazon SageMaker for more detailed exploration.

This section touches on three additional questions:

  • What about merging Amazon Redshift data with data in S3?
  • What about moving from the data-exploration phase into training your ML model, and then to production?
  • How do you replicate this solution?

Using Redshift Spectrum in this solution

During this data-exploration phase, you may find that some additional data exists on S3 that is useful in combination with the data housed on Amazon Redshift. It’s straightforward to merge the two, using the power of Amazon Redshift Spectrum. Amazon Redshift Spectrum directly queries data in S3, using the same SQL syntax of Amazon Redshift. You can also run queries that span both the frequently accessed data stored locally in Amazon Redshift and your full datasets stored cost-effectively in S3.

To use this capability in from your Amazon SageMaker notebook:

  1. First, follow the instructions for Cataloging Tables with a Crawler to add your S3 datasets to your AWS Glue Data Catalog.
  2. Then, follow the instructions in Creating External Schemas for Amazon Redshift Spectrum to add an existing external schema to Amazon Redshift. You need the permissions described in Policies to Grant Minimum Permissions.

After the external schema is defined in Amazon Redshift, you can use SQL to read the S3 files from Amazon Redshift. You can also seamlessly join, aggregate, and filter the S3 files with Amazon Redshift tables.

In exactly the same way, you can use SQL from within the notebook to read the combined S3 and Amazon Redshift data into Spark/EMR. From there, read it into your notebook, using the functions already defined.

Moving from exploration to training and production

The pipeline described here—reading directly from Amazon Redshift—is optimized for the data-exploration phase of your ML project. During this phase, you’re likely iterating quickly across different datasets, seeing which data and which combinations are useful for the problem you’re solving.

After you’ve settled on the data to be used for training, it is more appropriate to materialize the final SQL into an extract on S3. The dataset on S3 can then be used for the training phase, as is demonstrated in the sample Amazon SageMaker notebooks.

Deployment into production has different requirements, with a different data access pattern. For example, the interactive responses needed by online transactions are not a good fit for Amazon Redshift. Consider the needs of your application and data pipeline, and engineer an appropriate combination of data sources and access methods for that need. Replicating this solution


To avoid additional charges, remember to delete the AWS CloudFormation stack when you’ve finished with the solution.


By now, you can see the true power of this combination in exploring data that’s in your data lake and data warehouse:

  • Expose data via the AWS Glue Data Catalog.
  • Use the scalability and processing capabilities of Amazon Redshift and Amazon EMR to preprocess, filter, join, and aggregate data from your Amazon S3 data lake data.
  • Your data scientists can use tools they’re familiar with—Amazon SageMaker, Jupyter notebooks, and SQL—to quickly explore and visualize data that’s already been cataloged.

Another source of friction has been removed, and your data scientists can move at the pace of business.

About the Author

Veronika Megler is a Principal Consultant, Big Data, Analytics & Data Science, for AWS Professional Services. She holds a PhD in Computer Science, with a focus on spatio-temporal data search. She specializes in technology adoption, helping customers use new technologies to solve new problems and to solve old problems more efficiently and effectively.




Introducing Translatotron: An End-to-End Speech-to-Speech Translation Model

Speech-to-speech translation systems have been developed over the past several decades with the goal of helping people who speak different languages to communicate with each other. Such systems have usually been broken into three separate components: automatic speech recognition to transcribe the source speech as text, machine translation to translate the transcribed text into the target language, and text-to-speech synthesis (TTS) to generate speech in the target language from the translated text. Dividing the task into such a cascade of systems has been very successful, powering many commercial speech-to-speech translation products, including Google Translate.

In “Direct speech-to-speech translation with a sequence-to-sequence model”, we propose an experimental new system that is based on a single attentive sequence-to-sequence model for direct speech-to-speech translation without relying on intermediate text representation. Dubbed Translatotron, this system avoids dividing the task into separate stages, providing a few advantages over cascaded systems, including faster inference speed, naturally avoiding compounding errors between recognition and translation, making it straightforward to retain the voice of the original speaker after translation, and better handling of words that do not need to be translated (e.g., names and proper nouns).

The emergence of end-to-end models on speech translation started in 2016, when researchers demonstrated the feasibility of using a single sequence-to-sequence model for speech-to-text translation. In 2017, we demonstrated that such end-to-end models can outperform cascade models. Many approaches to further improve end-to-end speech-to-text translation models have been proposed recently, including our effort on leveraging weakly supervised data. Translatotron goes a step further by demonstrating that a single sequence-to-sequence model can directly translate speech from one language into speech in another language, without relying on an intermediate text representation in either language, as is required in cascaded systems.

Translatotron is based on a sequence-to-sequence network which takes source spectrograms as input and generates spectrograms of the translated content in the target language. It also makes use of two other separately trained components: a neural vocoder that converts output spectrograms to time-domain waveforms, and, optionally, a speaker encoder that can be used to maintain the character of the source speaker’s voice in the synthesized translated speech. During training, the sequence-to-sequence model uses a multitask objective to predict source and target transcripts at the same time as generating target spectrograms. However, no transcripts or other intermediate text representations are used during inference.

Model architecture of Translatotron.

We validated Translatotron’s translation quality by measuring the BLEU score, computed with text transcribed by a speech recognition system. Though our results lag behind a conventional cascade system, we have demonstrated the feasibility of the end-to-end direct speech-to-speech translation.

Compared in the audio clips below are the direct speech-to-speech translation output from Translatotron to that of the baseline cascade method. In this case, both systems provide a suitable translation and speak naturally using the same canonical voice.

Input (Spanish)
Reference translation (English)
Baseline cascade translation
Translatotron translation

You can listen to more audio samples here.

Preserving Vocal Characteristics
By incorporating a speaker encoder network, Translatotron is also able to retain the original speaker’s vocal characteristics in the translated speech, which makes the translated speech sound more natural and less jarring. This feature leverages previous Google research on speaker verification and speaker adaptation for TTS. The speaker encoder is pretrained on the speaker verification task, learning to encode speaker characteristics from a short example utterance. Conditioning the spectrogram decoder on this encoding makes it possible to synthesize speech with similar speaker characteristics, even though the content is in a different language.

The audio clips below demonstrate the performance of Translatotron when transferring the original speaker’s voice to the translated speech. In this example, Translatotron gives more accurate translation than the baseline cascade model, while being able to retain the original speaker’s vocal characteristics. The Translatotron output that retains the original speaker’s voice is trained with less data than the one using the canonical voice, so that they yield slightly different translations.

Input (Spanish)
Reference translation (English)
Baseline cascade translation
Translatotron translation (canonical voice)
Translatotron translation (original speaker’s voice)

More audio samples are available here.

To the best of our knowledge, Translatotron is the first end-to-end model that can directly translate speech from one language into speech in another language. It is also able to retain the source speaker’s voice in the translated speech. We hope that this work can serve as a starting point for future research on end-to-end speech-to-speech translation systems.

This research was a joint work between the Google Brain, Google Translate, and Google Speech teams. Contributors include Ye Jia, Ron J. Weiss, Fadi Biadsy, Wolfgang Macherey, Melvin Johnson, Zhifeng Chen, Mengmeng Niu, Quan Wang, Jason Pelecanos, Ignacio Lopez Moreno, Tom Walters, Heiga Zen, Patrick Nguyen, Yu Zhang, Jonathan Shen, Orhan Firat, and Yonghui Wu. We also thank Jorge Pereira and Stella Laurenzo for verifying the quality of the translation from Translatotron.

Paige.AI Ramps Up Cancer Pathology Research Using NVIDIA Supercomputer

An accurate diagnosis is key to treating cancer — a disease that kills 600,000 people a year in the U.S. alone — and AI can help.

Common forms of the disease, like breast, lung and prostate cancer, can have good recovery rates when diagnosed early. But diagnosing the tumor, the work of pathologists, can be a very manual, challenging and time-consuming process.

Pathologists traditionally interpret dozens of slides per cancer case, searching for clues pointing to a cancer diagnosis. For example, there can be more than 60 slides for a single breast cancer case and, out of those, only a handful may contain important findings.

AI can help pathologists become more productive by accelerating and enhancing their workflow as they examine massive amounts of data. It gives the pathologists the tools to analyze images, provide insight based on previous cases and diagnose faster by pinpointing anomalies.

Paige.AI is applying AI to pathology to increase diagnostic accuracy and deliver better patient outcomes, starting with prostate and breast cancer. Earlier this year, Paige.AI was granted “Breakthrough Designation” by the U.S. Food and Drug Administration, the first such designation for AI in cancer diagnosis.

The FDA grants the designation for technologies that have the potential to provide for more effective diagnosis or treatment for life-threatening or irreversibly debilitating diseases, where timely availability is in the best interest of patients.

To find breakthroughs in cancer diagnosis, Paige.AI will access millions of pathology slides, providing the volume of data necessary to train and develop cutting-edge AI algorithms.

DGX-1 AI supercomputer
NVIDIA DGX-1 is proving to be an important research tool for many of the world’s leading AI researchers.

To make sense of all this data, Paige.AI uses an AI supercomputer made up of 10 interconnected NVIDIA DGX-1 systems. The supercomputer has the enormous computing power of over 10 petaflops necessary to develop a clinical-grade model for pathology and, for the first time, bridge the gap from research to a clinical setting that benefits future patients.

One example of how NVIDIA’s technology is already being used is a recent study by Paige.AI that used seven NVIDIA DGX-1 systems to train neural networks on a new dataset to detect prostate cancer. The dataset consisted of 12,160 slides, two orders of magnitude larger than previous datasets in pathology. The researchers achieved near perfect accuracy on a test set consisting of 1,824 real-world slides without any manual image-annotation.

By minimizing the time pathologists spend processing data, AI can help them focus their time on analyzing it. This is especially critical given the short supply of pathologists.

According to The Lancet medical journal, there is a single pathologist for every million people in sub-Saharan Africa and one for every 130,000 people in China. In the United States, there is one for rohly every 20,000 people, however, studies predict that number will shrink to one for about every 30,000 people by 2030.

AI gives a big boost to computational pathology by enabling quantitative analysis of the study of structures seen under a microscope and cell biology. This advancement is made possible by combining novel image analysis, computer vision and machine learning techniques.

“With the help of NVIDIA technology, Paige.AI is able to train deep neural networks from hundreds of thousands of gigapixel images of whole slides. The result is clinical-grade artificial intelligence for pathology,” said Dr. Thomas Fuchs, co-founder and chief scientific officer at Paige.AI. “Our vision is to help pathologists improve the efficiency of their work, for researchers to generate new insights, and clinicians to improve patient care.”


Feature image credit: Dr. Cecil Fox, National Cancer Institute, via Wikimedia Commons.

The post Paige.AI Ramps Up Cancer Pathology Research Using NVIDIA Supercomputer appeared first on The Official NVIDIA Blog.

As search needs evolve, Microsoft makes AI tools for better search available to researchers and developers

Only a few years ago, web search was simple. Users typed a few words and waded through pages of results.

Today, those same users may instead snap a picture on a phone and drop it into a search box or use an intelligent assistant to ask a question without physically touching a device at all. They may also type a question and expect an actual reply, not a list of pages with likely answers.

These tasks challenge traditional search engines, which are based around an inverted index system that relies on keyword matches to produce results.

“Keyword search algorithms just fail when people ask a question or take a picture and ask the search engine, ‘What is this?’” said Rangan Majumder, group program manager on Microsoft’s Bing search and AI team.

Of course, keeping up with users’ search preferences isn’t new — it’s been a struggle since web search’s inception. But now, it’s becoming easier to meet those evolving needs, thanks to advancements in artificial intelligence, including those pioneered by Bing’s search team and researchers at Microsoft’s Asia research lab.

“The AI is making the products we work with more natural,” said Majumder. “Before, people had to think, ‘I’m using a computer, so how do I type in my input in a way that won’t break the search?’”

Microsoft has made one of the most advanced AI tools it uses to better meet people’s evolving search needs available to anyone as an open source project on GitHub. On Wednesday, it also released user example techniques and an accompanying video for those tools via Microsoft’s AI lab.

The algorithm, called Space Partition Tree And Graph (SPTAG), allows users to take advantage of the intelligence from deep learning models to search through billions of pieces of information, called vectors, in milliseconds. That, in turn, means they can more quickly deliver more relevant results to users.

Vector search makes it easier to search by concept rather than keyword. For example, if a user types in “How tall is the tower in Paris?” Bing can return a natural language result telling the user the Eiffel Tower is 1,063 feet, even though the word “Eiffel” never appeared in the search query and the word “tall” never appears in the result.

Microsoft uses vector search for its own Bing search engine, and the technology is helping Bing better understand the intent behind billions of web searches and find the most relevant result among billions of web pages.


YouTube Video

Using vectors for better search

Essentially a numerical representation of a word, image pixel or other data point, a vector helps capture what a piece of data actually means. Thanks to advances in a branch of AI called deep learning, Microsoft said it can begin to understand and represent search intent using these vectors.

Once the numerical point has been assigned to a piece of data, vectors can be arranged, or mapped, with close numbers placed in proximity to one another to represent similarity. These proximal results get displayed to users, improving search outcomes.

The technology behind the vector search Bing uses got its start when company engineers began noticing unusual trends in users’ search patterns.

“In analyzing our logs, the team found that search queries were getting longer and longer,” said Majumder. This suggested that users were asking more questions, over-explaining because of past, poor experiences with keyword search, or were “trying to act like computers” when describing abstract things — all unnatural and inconvenient for users.

With Bing search, the vectorizing effort has extended to over 150 billion pieces of data indexed by the search engine to bring improvement over traditional keyword matching. These include single words, characters, web page snippets, full queries and other media. Once a user searches, Bing can scan the indexed vectors and deliver the best match.

Vector assignment is also trained using deep learning technology for ongoing improvement. The models consider inputs like end-user clicks after a search to get better at understanding the meaning of that search.

While the idea of vectorizing media and search data isn’t new, it’s only recently been possible to use it on the scale of a massive search engine such as Bing, Microsoft experts said.

“Bing processes billions of documents every day, and the idea now is that we can represent these entries as vectors and search through this giant index of 100 billion-plus vectors to find the most related results in 5 milliseconds,” said Jeffrey Zhu, program manager on Microsoft’s Bing team.

To put that in perspective, Majumder said, consider this: A stack of 150 billion business cards would stretch from here to the moon. Within a blink of an eye, Bing’s search using SPTAG can find 10 different business cards one after another within that stack of cards.

Uses for visual, audio search

The Bing team said they expect the open source offering could be used for enterprise or consumer-facing applications to identify a language being spoken based on an audio snippet, or for image-heavy services such as an app that lets people take pictures of flowers and identify what type of flower it is. For those types of applications, a slow or irrelevant search experience is frustrating.

“Even a couple seconds for a search can make an app unusable,” noted Majumder.

The team also is hoping that researchers and academics will use it to explore other areas of search breakthroughs.

“We’ve only started to explore what’s really possible around vector search at this depth,” he said.

Related links:

The post As search needs evolve, Microsoft makes AI tools for better search available to researchers and developers appeared first on The AI Blog.

Bird’s-AI View: Harnessing Drones to Improve Traffic Flow

Traffic. It’s one of the most commonly cited frustrations across the globe.

It consumed nearly 180 hours of productive time for the average U.K. driver last year. German drivers lost an average of 120 hours. U.S. drivers lost nearly 100 hours.

Because time is too precious to waste, RCE Systems — a Brno, Czech Republic-based startup and member of the NVIDIA Inception program — is taking its tech to the air to improve traffic flow.

Its DataFromSky platform combines trajectory analysis, computer vision and drones to ease congestion and improve road safety.

AI in the Sky

Traffic analysis has traditionally been based on video footage from fixed cameras, mounted at specific points along roads and highways.

This can severely limit the analysis of traffic which is, by nature, constantly moving and changing.

Capturing video from a bird’s-eye perspective via drones allows RCE Systems to gain deeper insights into traffic.

Beyond monitoring objects captured on video, the DataFromSky platform interprets movements using AI to provide highly accurate telemetric data about every object in the traffic flow.

RCE Systems trains its deep neural networks using thousands of hours of video footage from around the globe, shot in various weather conditions. The training takes place on NVIDIA GPUs using Caffe and TensorFlow.

These specialized neural networks can then recognize objects of interest and continually track them in video footage.

The data captured via this AI process is used in numerous research projects, enabling deeper analysis of object interaction and new behavioral models of drivers in specific traffic situations.

Ultimately, this kind of data will also be crucial for the development of autonomous vehicles.

Driving Impact

The DataFromSky platform is still in its early days, but its impact is already widespread.

RCE Systems is working on a system for analyzing safety at intersections, based on driver behavior. This includes detecting situations where accidents were narrowly avoided and then determining root causes.

By understanding these situations better, their occurrence can be avoided — making traffic flow easier and preventing vehicle damage as well as potential loss of life.

Toyota Europe used RCE Systems’ findings from the DataFromSky platform to create probabilistic models of driver behavior as well as deeper analysis of interactions with roundabouts.

Leidos used insights gathered by RCE Systems to calibrate traffic simulation models as part of its projects to examine narrowing freeway lanes and shoulders in Dallas, Seattle, San Antonio and Honolulu.

And the value of RCE Systems’ analysis is not limited to vehicles. The Technical University of Munich has used it to perform a behavioral study of cyclists and pedestrians.

Moving On

RCE Systems is looking to move to NVIDIA Jetson AGX Xavier in the future to accelerate their AI at the edge solution. They are currently developing a “monitoring drone” capable of evaluating image data in flight, in real time.

It could one day replace a police helicopter during high-speed chases or act as a mobile surveillance system for property protection.

The post Bird’s-AI View: Harnessing Drones to Improve Traffic Flow appeared first on The Official NVIDIA Blog.

The AWS DeepRacer League virtual circuit is underway—win a trip to re:Invent 2019!

The competition is heating up in the AWS DeepRacer League, the world’s first global autonomous racing league, open to anyone. The first round is almost halfway home, now that 9 of the 21 stops on the summit circuit schedule are complete. Developers continue to build new machine learning skills and post winning times to the leaderboards. Here’s a quick round-up of the news from all of this week’s action.

The AWS DeepRacer virtual circuit launched on April 29. Developers of all skill levels can enter the league from anywhere in the world via the AWS DeepRacer console.

The first of six monthly tracks is the London Loop, and racing is well underway. As of May 8, 2019, the are 346 participants on the leaderboard, competing to be crowned the first champion of the virtual circuit and advance on an all-expenses-paid trip to re:Invent. Our current leader is Holly, with a time of 12.48 seconds. Twenty-three days remain, so there’s still time to get rolling into the online competition. There are prizes for the Top 10, and plenty of chances to win!

Current leaderboard standings:

Time remaining on the London Loop race:

On the Summit Circuit this week, the AWS DeepRacer League made stops in Madrid and London and crowned two new champions. They both advance on an all-expenses-paid trip to re:Invent 2019 in Las Vegas, Nevada.

First up was Madrid, the third city in Europe to host the AWS DeepRacer League. The crowd was energetic and the competitors eager to win. The top 3 took to the tracks 14 times between them.

Pedro, Javier, and David arrived at the AWS Summit together, with 27 models that they had been training together in the AWS DeepRacer 3D racing simulator. They had seen some good results in the virtual world. However, the first couple of runs on the track didn’t seem to deliver in the same way, with our champion Pedro posting an opening time of 40 seconds. They pulled together as a team, tuning and trying the different models they had built at home, and eventually began to see much better results.

In the following video, David shares their thoughts on strategy during the day.

With about two hours of racing left, and on his fourth attempt, Pedro was the lucky team member who took the top spot with a winning time of 9.36 seconds. His colleagues were not far behind, claiming the second and third spot. Pedro advances to the finals and is excited to work with his teammates on a strategy to take home the AWS DeepRacer League Championship Cup. Don’t worry, they both join him to take on the rest of the field!

And on to London, the hometown of the reigning AWS DeepRacer Champion, Rick Fish. Developers came to the expo hall at the AWS Summit, for a full day of racing on two tracks and the chance to win their trip to re:Invent 2019.

The day started strong with our eventual third-place finisher “breadcentric,” with a 13-second lap. New to machine learning, he brought his model to the AWS Summit and was ready to race as soon as the tracks opened at 8AM. The competition came in strong as competitors quickly started logging lap times under 10 seconds, including our eventual champion, Matt Camp. Matt works at Jigsaw XYZ, whose cofounder happens to be Rick Fish! Rick’s team at Jigsaw XYZ had been preparing for the London race since re:Invent and knew that the pressure would be on to win.

Matt had been working on his model at home and was eager to see how well it could perform. Matt’s friend and colleague Tony joined him. With only 1 hour to go, they were in second and third position on the podium, behind Raul, who had spent most of the day on top with a 9.01-second lap. The Jigsaw XYZ team took to the tracks one more time. In his final 2 minutes of racing, Matt clinched the title with an 8.9-second lap. Matt had no experience with machine learning before re:Invent 2018. He now heads back in 2019 to take on Rick Fish and rest of the field to win the AWS DeepRacer League Championship Cup.

The competition and excitement are certainly building in the AWS DeepRacer League. Developers of all skill levels get hands-on, learn, and put their machine learning skills to the ultimate test. Get started in the AWS DeepRacer League, either virtually or at the next summit near you. We have all the tools to get you started even if you have no machine learning experience, as well as resources to help you take on the challenge and win!

Coming soon, we share our best tips from the AWS DeepRacer team, so stay tuned.

About the Author

Alexandra Bush is a Senior Product Marketing Manager for AWS AI. She is passionate about how technology impacts the world around us and enjoys being able to help make it accessible to all. Out of the office she loves to run, travel and stay active in the outdoors with family and friends.




An End-to-End AutoML Solution for Tabular Data at KaggleDays

Machine learning (ML) for tabular data (e.g. spreadsheet data) is one of the most active research areas in both ML research and business applications. Solutions to tabular data problems, such as fraud detection and inventory prediction, are critical for many business sectors, including retail, supply chain, finance, manufacturing, marketing and others. Current ML-based solutions to these problems can be achieved by those with significant ML expertise, including manual feature engineering and hyper-parameter tuning, to create a good model. However, the lack of broad availability of these skills limits the efficiency of business improvements through ML.

Google’s AutoML efforts aim to make ML more scalable and accelerate both research and industry applications. Our initial efforts of neural architecture search have enabled breakthroughs in computer vision with NasNet, and evolutionary methods such as AmoebaNet and hardware-aware mobile vision architecture MNasNet further show the benefit of these learning-to-learn methods. Recently, we applied a learning-based approach to tabular data, creating a scalable end-to-end AutoML solution that meets three key criteria:

  • Full automation: Data and computation resources are the only inputs, while a servable TensorFlow model is the output. The whole process requires no human intervention.
  • Extensive coverage: The solution is applicable to the majority of arbitrary tasks in the tabular data domain.
  • High quality: Models generated by AutoML has comparable quality to models manually crafted by top ML experts.

To benchmark our solution, we entered our algorithm in the KaggleDays SF Hackathon, an 8.5 hour competition of 74 teams with up to 3 members per team, as part of the KaggleDays event. The first time that AutoML has competed against Kaggle participants, the competition involved predicting manufacturing defects given information about the material properties and testing results for batches of automotive parts. Despite competing against participants thats were at the Kaggle progression system Master level, including many who were at the GrandMaster level, our team (“Google AutoML”) led for most of the day and ended up finishing second place by a narrow margin, as seen in the final leaderboard.

Our team’s AutoML solution was a multistage TensorFlow pipeline. The first stage is responsible for automatic feature engineering, architecture search, and hyperparameter tuning through search. The promising models from the first stage are fed into the second stage, where cross validation and bootstrap aggregating are applied for better model selection. The best models from the second stage are then combined in the final model.

The workflow for the “Google AutoML” team was quite different from that of other Kaggle competitors. While they were busy with analyzing data and experimenting with various feature engineering ideas, our team spent most of time monitoring jobs and and waiting for them to finish. Our solution for second place on the final leaderboard required 1 hour on 2500 CPUs to finish end-to-end.

After the competition, Kaggle published a public kernel to investigate winning solutions and found that augmenting the top hand-designed models with AutoML models, such as ours, could be a useful way for ML experts to create even better performing systems. As can be seen in the plot below, AutoML has the potential to enhance the efforts of human developers and address a broad range of ML problems.

Potential model quality improvement on final leaderboard if AutoML models were merged with other Kagglers’ models. “Erkut & Mark, Google AutoML”, includes the top winner “Erkut & Mark” and the second place “Google AutoML” models. Erkut Aykutlug and Mark Peng used XGBoost with creative feature engineering whereas AutoML uses both neural network and gradient boosting tree (TFBT) with automatic feature engineering and hyperparameter tuning.

Google Cloud AutoML Tables
The solution we presented at the competitions is the main algorithm in Google Cloud AutoML Tables, which was recently launched (beta) at Google Cloud Next ‘19. The AutoML Tables implementation regularly performs well in benchmark tests against Kaggle competitions as shown in the plot below, demonstrating state-of-the-art performance across the industry.

Third party benchmark of AutoML Tables on multiple Kaggle competitions

We are excited about the potential application of AutoML methods across a wide range of real business problems. Customers have already been leveraging their tabular enterprise data to tackle mission-critical tasks like supply chain management and lead conversion optimization using AutoML Tables, and we are excited to be providing our state-of-the-art models to solve tabular data problems.

This project was only possible thanks to Google Brain team members Ming Chen, Da Huang, Yifeng Lu, Quoc V. Le and Vishy Tirumalashetty. We also thank Dawei Jia, Chenyu Zhao and Tin-yun Ho from the Cloud AutoML Tables team for great infrastructure and product landing collaboration. Thanks to Walter Reade, Julia Elliott and Kaggle for organizing such an engaging competition.

Bringing a Critical AI to News: Extracting Insight from Coverage

In 2015, Sean Gourley penned an article called “Robot Propaganda” for Wired magazine.

It contained this then-bold prediction: “We are likely to see versions of these bots deployed on U.S. audiences as part of the 2016 presidential election campaigns.”

Well, we all know how that turned out.

Gourley recently joined the AI Podcast to talk about bots, propaganda and fake news and how they relate to the work his own company is doing in natural language understanding and generation.

Gourley — who holds a Ph.D. in physics from Oxford University — is founder and CEO of Primer, a San Francisco-based machine intelligence company.

It builds machines that can read and write, automating the analysis of very large datasets.

In short, it automates the job of wringing insights out of news and other sources of information.

As a result, it grapples with the problems created by “fake news” and propaganda in a very real way for customers that include government agencies, financial institutions and Fortune 500 companies.

“The big thing for us is building systems that can help us understand the world that we’re living in,” Gourley said.

The best way to do that: track current events, and the events detailed by reputable sources closely.

“That’s become a really important piece in starting to kind of navigate a world where there’s an increasing volume of fake information and increasingly sophisticated fake information that’s out there.”

For more from Gourley, tune into the AI Podcast.

How to Tune into the AI Podcast

Our AI Podcast is available through iTunes, Castbox, DoggCatcher, Google Play Music, Overcast, PlayerFM, Podbay, PodBean, Pocket Casts, PodCruncher, PodKicker, Stitcher, Soundcloud and TuneIn.

If your favorite isn’t listed here, email us at aipodcast [at] nvidia [dot] com.

The post Bringing a Critical AI to News: Extracting Insight from Coverage appeared first on The Official NVIDIA Blog.

Build end-to-end machine learning workflows with Amazon SageMaker and Apache Airflow

Machine learning (ML) workflows orchestrate and automate sequences of ML tasks by enabling data collection and transformation. This is followed by training, testing, and evaluating a ML model to achieve an outcome. For example, you might want to perform a query in Amazon Athena or aggregate and prepare data in AWS Glue before you train a model on Amazon SageMaker and deploy the model to production environment to make inference calls. Automating these tasks and orchestrating them across multiple services helps build repeatable, reproducible ML workflows. These workflows can be shared between data engineers and data scientists.


ML workflows consist of tasks that are often cyclical and iterative to improve the accuracy of the model and achieve better results. We recently announced new integrations with Amazon SageMaker that allow you to build and manage these workflows:

  1. AWS Step Functions automates and orchestrates Amazon SageMaker related tasks in an end-to-end workflow.  You can automate publishing datasets to Amazon S3, training an ML model on your data with Amazon SageMaker, and deploying your model for prediction. AWS Step Functions will monitor Amazon SageMaker and other jobs until they succeed or fail, and either transition to the next step of the workflow or retry the job. It includes built-in error handling, parameter passing, state management, and a visual console that lets you monitor your ML workflows as they run.
  2. Many customers currently use Apache Airflow, a popular open source framework for authoring, scheduling, and monitoring multi-stage workflows. With this integration, multiple Amazon SageMaker operators are available with Airflow, including model training, hyperparameter tuning, model deployment, and batch transform. This allows you to use the same orchestration tool to manage ML workflows with tasks running on Amazon SageMaker.

This blog post shows how you can build and manage ML workflows using Amazon Sagemaker and Apache Airflow. We’ll build a recommender system to predict a customer’s rating for a certain video based on the customer’s historical ratings of similar videos, as well as the behavior of other similar customers. We’ll use historical star ratings from over 2 million Amazon customers on over 160,000 digital videos. Details on this dataset can be found at its AWS Open Data page.

High-level solution

We’ll start by exploring the data, transforming the data, and training a model on the data. We’ll fit the ML model using an Amazon SageMaker managed training cluster. We’ll then deploy to an endpoint to perform batch predictions on the test data set. All of these tasks will be plugged into a workflow that can be orchestrated and automated through Apache Airflow integration with Amazon SageMaker.

The following diagram shows the ML workflow we’ll implement for building the recommender system.

The workflow performs the following tasks:

  1. Data pre-processing: Extract and pre-process data from Amazon S3 to prepare the training data.
  2. Prepare training data: To build the recommender system, we’ll use the Amazon SageMaker built-in algorithm, Factorization machines. The algorithm expects training data only in recordIO-protobuf format with Float32 tensors. In this task, pre-processed data will be transformed to RecordIO Protobuf format.
  3. Training the model:Train the Amazon SageMaker built-in factorization machine model with the training data and generate model artifacts. The training job will be launched by the Airflow Amazon SageMaker operator.
  4. Tune the model hyperparameters:A conditional/optional task to tune the hyperparameters of the factorization machine to find the best model. The hyperparameter tuning job will be launched by the Amazon SageMaker Airflow operator.
  5. Batch inference:Using the trained model, get inferences on the test dataset stored in Amazon S3 using the Airflow Amazon SageMaker operator.

Note: You can clone this GitHub repo for the scripts, templates and notebook referred to in this blog post.

Airflow concepts and setup

Before implementing the solution, let’s get familiar with Airflow concepts. If you are already familiar with Airflow concepts, skip to the Airflow Amazon SageMaker operators section.

Apache Airflow is an open-source tool for orchestrating workflows and data processing pipelines. Airflow allows you to configure, schedule, and monitor data pipelines programmatically in Python to define all the stages of the lifecycle of a typical workflow management.

Airflow nomenclature

  • DAG (Directed Acyclic Graph): DAGs describe how to run a workflow by defining the pipeline in Python, that is configuration as code. Pipelines are designed as a directed acyclic graph by dividing a pipeline into tasks that can be executed independently. Then these tasks are combined logically as a graph.
  • Operators: Operators are atomic components in a DAG describing a single task in the pipeline. They determine what gets done in that task when a DAG runs. Airflow provides operators for common tasks. It is extensible, so you can define custom operators. Airflow Amazon SageMaker operators are one of these custom operators contributed by AWS to integrate Airflow with Amazon SageMaker.
  • Task: After an operator is instantiated, it’s referred to as a “task.”
  • Task instance: A task instance represents a specific run of a task characterized by a DAG, a task, and a point in time.
  • Scheduling: The DAGs and tasks can be run on demand or can be scheduled to be run at a certain frequency defined as a cron expression in the DAG.

Airflow architecture

The following diagram shows the typical components of Airflow architecture.

  • Scheduler: The scheduler is a persistent service that monitors DAGs and tasks, and triggers the task instances whose dependencies have been met. The scheduler is responsible for invoking the executor defined in the Airflow configuration.
  • Executor: Executors are the mechanism by which task instances get to run. Airflow by default provides different types of executors and you can define custom executors, such as a Kubernetes executor.
  • Broker: The broker queues the messages (task requests to be executed) and acts as a communicator between the executor and the workers.
  • Workers: The actual nodes where tasks are executed and that return the result of the task.
  • Web server: A web server to render the Airflow UI.
  • Configuration file: Configure settings such as executor to use, airflow metadata database connections, DAG, and repository location. You can also define concurrency and parallelism limits, etc.
  • Metadata database: Database to store all the metadata related to the DAGS, DAG runs, tasks, variables, and connections.

Airflow Amazon SageMaker operators

Amazon SageMaker operators are custom operators available with Airflow installation allowing Airflow to talk to Amazon SageMaker and perform the following ML tasks:

  • SageMakerTrainingOperator: Creates an Amazon SageMaker training job.
  • SageMakerTuningOperator: Creates an AmazonSageMaker hyperparameter tuning job.
  • SageMakerTransformOperator: Creates an Amazon SageMaker batch transform job.
  • SageMakerModelOperator: Creates an Amazon SageMaker model.
  • SageMakerEndpointConfigOperator: Creates an Amazon SageMaker endpoint config.
  • SageMakerEndpointOperator: Creates an Amazon SageMaker endpoint to make inference calls.

We’ll review usage of the operators in the Building a machine learning workflow section of this blog post.

Airflow setup

We will set up a simple Airflow architecture with a scheduler, worker, and web server running on a single instance. Typically, you will not use this setup for production workloads. We will use AWS CloudFormation to launch the AWS services required to create the components in this blog post. The following diagram shows the configuration of the architecture to be deployed.

The stack includes the following:

  • An Amazon Elastic Compute Cloud (EC2) instance to set up the Airflow components.
  • An Amazon Relational Database Service (RDS) Postgres instance to host the Airflow metadata database.
  • An Amazon Simple Storage Service (S3) bucket to store the Amazon SageMaker model artifacts, outputs, and Airflow DAG with ML workflow. The template will prompt for the S3 bucket name.
  • AWS Identity and Access Management (IAM) roles and Amazon EC2 security groups to allow Airflow components to interact with the metadata database, S3 bucket, and Amazon SageMaker.

The prerequisite for running this CloudFormation script is to set up an Amazon EC2 Key Pair to log in to manage Airflow, for example, if you want to troubleshoot or add custom operators.

It might take up to 10 minutes for the CloudFormation stack to create the resources. After the resource creation is completed, you should be able to log in to Airflow web UI. The Airflow web server runs on port 8080 by default. To open the Airflow web UI, open any browser, and type in the http://ec2-public-dns-name:8080. The public DNS name of the EC2 instance can be found on the Outputs tab of CloudFormation stack on the AWS CloudFormation console.

Building a machine learning workflow

In this section, we’ll create a ML workflow using Airflow operators, including Amazon SageMaker operators to build the recommender. You can download the companion Jupyter notebook to look at individual tasks used in the ML workflow. We’ll highlight the most important pieces here.

Data preprocessing

  • As mentioned earlier, the dataset contains ratings from over 2 million Amazon customers on over 160,000 digital videos. More details on the dataset are here.
  • After analyzing the dataset, we see that there are only about 5 percent of customers who have rated 5 or more videos, and only 25 percent of videos have been rated by 9+ customers. We’ll clean this long tail by filtering the records.
  • After cleanup, we transform the data into sparse format by giving each customer and video their own sequential index indicating the row and column in our ratings matrix. We store this cleansed data in an S3 bucket for the next task to pick up and process.
  • The following PythonOperator snippet in the Airflow DAG calls the preprocessing function:
    # preprocess the data
    preprocess_task = PythonOperator(

NOTE: For this blog post, the data preprocessing task is performed in Python using the Pandas package. The task gets executed on the Airflow worker node. This task can be replaced with the code running on AWS Glue or Amazon EMR when working with large data sets.

Data preparation

  • We are using the Amazon SageMaker implementation of Factorization Machines (FM) for building the recommender system. The algorithm expects Float32 tensors in recordIO protobuf format. The cleansed data set is a Pandas DataFrame on disk.
  • As part of data preparation, the Pandas DataFrame will be transformed to a sparse matrix with one-hot encoded feature vectors with customers and videos. Thus, each sample in the data set will be a wide Boolean vector with only two values set to 1 for the customer and the video.
    Cust 1 Cust 2 Cust N Video 1 Video 2 Video m
    1 0 0 0 1 0
  • The following steps are performed in the data preparation task:
    1. Split the cleaned data set into train and test data sets.
    2. Build a sparse matrix with one-hot encoded feature vectors (customer + videos) and a label vector with star ratings.
    3. Convert both the sets to protobuf encoded files.
    4. Copy the prepared files to an Amazon S3 bucket for training the model.
  • The following PythonOperator snippet in the Airflow DAG calls the data preparation function.
    # prepare the data for training
    prepare_task = PythonOperator(

Model training and tuning

  • We’ll train the Amazon SageMaker Factorization Machine algorithm by launching a training job using Airflow Amazon SageMaker Operators. There are couple of ways we can train the model.
    • Use SageMakerTrainingOperator to run a training job by setting the hyperparameters known to work for your data.
      # train_config specifies SageMaker training configuration
      train_config = training_config(
      # launch sagemaker training job and wait until it completes
      train_model_task = SageMakerTrainingOperator(

    • Use SageMakerTuningOperator to run a hyperparameter tuning job to find the best model by running many jobs that test a range of hyperparameters on your dataset.
      # create tuning config
      tuner_config = tuning_config(
      tune_model_task = SageMakerTuningOperator(

  • Conditional tasks can be created in the Airflow DAG that can decide whether to run the training job directly or run a hyperparameter tuning job to find the best model. These tasks can be run in synchronous or asynchronous mode.
    branching = BranchPythonOperator(
        python_callable=lambda: "model_tuning" if hpo_enabled else "model_training")

  • The progress of the training or tuning job can be monitored in the Airflow Task Instance logs.

Model inference

  • Using the Airflow SageMakerTransformOperator, create an Amazon SageMaker batch transform job to perform batch inference on the test dataset to evaluate performance of the model.
    # create transform config
    transform_config = transform_config_from_estimator(
        task_id="model_tuning" if hpo_enabled else "model_training",
        task_type="tuning" if hpo_enabled else "training",
    # launch sagemaker batch transform job and wait until it completes
    batch_transform_task = SageMakerTransformOperator(

  • We can further extend the ML workflow by adding a task to validate model performance by comparing the actual and predicted customer ratings before deploying the model in production environment.

In the next section, we’ll see how all these tasks are stitched together to form a ML workflow in an Airflow DAG.

Putting it all together

Airflow DAG integrates all the tasks we’ve described as a ML workflow. Airflow DAG is a Python script where you express individual tasks with Airflow operators, set task dependencies, and associate the tasks to the DAG to run on demand or at a scheduled interval. The Airflow DAG script is divided into following sections.

  1. Set DAG with parameters such as schedule interval, concurrency, etc.
    dag = DAG(
        user_defined_filters={'tojson': lambda s: JSONEncoder().encode(s)}

  2. Set up training, tuning, and inference configurations for each operator using Amazon SageMaker Python SDK for Airflow
  3. Create individual tasks with Airflow operators that define trigger rules and associate them with the DAG object. Refer to the previous section for defining these individual tasks.
  4. Specify task dependencies.

After the DAG is ready, deploy it to the Airflow DAG repository using CI/CD pipelines. If you followed the setup outlined in Airflow setup, the CloudFormation stack deployed to install Airflow components will add the Airflow DAG to the repository on the Airflow instance that has the ML workflow for building the recommender system. Download the Airflow DAG code from here.

After triggering the DAG on demand or on a schedule, you can monitor the DAG in multiple ways: tree view, graph view, Gantt chart, task instance logs, etc. Refer to the Airflow documentation for ways to author and monitor Airflow DAGs.

Clean up

Now to the final step, cleaning up the resources.

To avoid unnecessary charges on your AWS account do the following:

  1. Destroy all of the resources created by the CloudFormation stack in Airflow set up by deleting the stack after you’re done experimenting with it. You can follow the steps here to delete the stack.
  2. You have to manually delete the S3 bucket created by the CloudFormation stack because AWS CloudFormation can’t delete a non-empty Amazon S3 bucket.


In this blog post, you have seen that building an ML workflow involves quite a bit of preparation but it helps improve the rate of experimentation, engineering productivity, and maintenance of repetitive ML tasks. Airflow Amazon SageMaker Operators provide a convenient way to build ML workflows and integrate with Amazon SageMaker.

You can extend the workflows by customizing the Airflow DAGs with any tasks that better fit your ML workflows, such as feature engineering, creating an ensemble of training models, creating parallel training jobs, and retraining models to adapt to the data distribution changes.


  • Refer to the Amazon SageMaker SDK documentation and Airflow documentation for additional details on the Airflow Amazon SageMaker operators.
  • Refer to the Amazon SageMaker documentation to learn about the Factorization Machines algorithm used in this blog post.
  • Download the resources (Jupyter Notebooks, CloudFormation template, and Airflow DAG code) referred in this blog post from our GitHub repo.

If you have questions or suggestions, please leave them in the following comments section.

About the Author

Rajesh Thallam is a Professional Services Architect for AWS helping customers run Big Data and Machine Learning workloads on AWS. In his spare time he enjoys spending time with family, traveling and exploring ways to integrate technology into daily life. He would like to thank his colleagues David Ping and Shreyas Subramanian for helping with this blog post.

Next Meetup




Plug yourself into AI and don't miss a beat


Toronto AI is a social and collaborative hub to unite AI innovators of Toronto and surrounding areas. We explore AI technologies in digital art and music, healthcare, marketing, fintech, vr, robotics and more. Toronto AI was founded by Dave MacDonald and Patrick O'Mara.