Learn About Our Meetup

4500+ Members

Category: Global

Joint Speech Recognition and Speaker Diarization via Sequence Transduction

Being able to recognize “who said what,” or speaker diarization, is a critical step in understanding audio of human dialog through automated means. For instance, in a medical conversation between doctors and patients, “Yes” uttered by a patient in response to “Have you been taking your heart medications regularly?” has a substantially different implication than a rhetorical “Yes?” from a physician.

Conventional speaker diarization (SD) systems use two stages, the first of which detects changes in the acoustic spectrum to determine when the speakers in a conversation change, and the second of which identifies individual speakers across the conversation. This basic multi-stage approach is almost two decades old, and during that time only the speaker change detection component has improved.

With the recent development of a novel neural network model—the recurrent neural network transducer (RNN-T)—we now have a suitable architecture to improve the performance of speaker diarization addressing some of the limitations of the previous diarization system we presented recently. As reported in our recent paper, “Joint Speech Recognition and Speaker Diarization via Sequence Transduction,” to be presented at Interspeech 2019, we have developed an RNN-T based speaker diarization system and have demonstrated a breakthrough in performance from about 20% to 2% in word diarization error rate—a factor of 10 improvement.

Conventional Speaker Diarization Systems
Conventional speaker diarization systems rely on differences in how people sound acoustically to distinguish the speakers in the conversations. While male and female speakers can be identified relatively easily from their pitch using simple acoustic models (e.g., Gaussian mixture models) in a single stage, speaker diarization systems use a multi-stage approach to distinguish between speakers having potentially similar pitch. First, a change detection algorithm breaks up the conversation into homogeneous segments, hopefully containing only a single speaker, based upon detected vocal characteristics. Then, deep learning models are employed to map segments from each speaker to an embedding vector. Finally, in a clustering stage, these embeddings are grouped together to keep track of the same speaker across the conversation.

In practice, the speaker diarization system runs in parallel to the automatic speech recognition (ASR) system and the outputs of the two systems are combined to attribute speaker labels to the recognized words.

Conventional speaker diarization system infers speaker labels in the acoustic domain and then overlays the speaker labels on the words generated by a separate ASR system.

There are several limitations with this approach that have hindered progress in this field. First, the conversation needs to be broken up into segments that only contain speech from one speaker. Otherwise, the embedding will not accurately represent the speaker. In practice, however, the change detection algorithm is imperfect, resulting in segments that may contain multiple speakers. Second, the clustering stage requires that the number of speakers be known and is particularly sensitive to the accuracy of this input. Third, the system needs to make a very difficult trade-off between the segment size over which the voice signatures are estimated and the desired model accuracy. The longer the segment, the better the quality of the voice signature, since the model has more information about the speaker. This comes at the risk of attributing short interjections to the wrong speaker, which could have very high consequences, for example, in the context of processing a clinical or financial conversation where affirmation or negation needs to be tracked accurately. Finally, conventional speaker diarization systems do not have an easy mechanism to take advantage of linguistic cues that are particularly prominent in many natural conversations. An utterance, such as “How often have you been taking the medication?” in a clinical conversation is most likely uttered by a medical provider, not a patient. Likewise, the utterance, “When should we turn in the homework?” is most likely uttered by a student, not a teacher. Linguistic cues also signal high probability of changes in speaker turns, for example, after a question.

There are a few exceptions to the conventional speaker diarization system, but one such exception was reported in our recent blog post. In that work, the hidden states of the recurrent neural network (RNN) tracked the speakers, circumventing the weakness of the clustering stage. Our approach takes a different approach and incorporates linguistic cues, as well.

An Integrated Speech Recognition and Speaker Diarization System
We developed a novel and simple model that not only combines acoustic and linguistic cues seamlessly, but also combines speaker diarization and speech recognition into one system. The integrated model does not degrade the speech recognition performance significantly compared to an equivalent recognition only system.

The key insight in our work was to recognize that the RNN-T architecture is well-suited to integrate acoustic and linguistic cues. The RNN-T model consists of three different networks: (1) a transcription network (or encoder) that maps the acoustic frames to a latent representation, (2) a prediction network that predicts the next target label given the previous target labels, and (3) a joint network that combines the output of the previous two networks and generates a probability distribution over the set of output labels at that time step. Note, there is a feedback loop in the architecture (diagram below) where previously recognized words are fed back as input, and this allows the RNN-T model to incorporate linguistic cues, such as the end of a question.

An integrated speech recognition and speaker diarization system where the system jointly infers who spoke when and what.

Training the RNN-T model on accelerators like graphical processing units (GPU) or tensor processing units (TPU) is non-trivial as computation of the loss function requires running the forward-backward algorithm, which includes all possible alignments of the input and the output sequences. This issue was addressed recently in a TPU friendly implementation of the forward-backward algorithm, which recasts the problem as a sequence of matrix multiplications. We also took advantage of an efficient implementation of the RNN-T loss in TensorFlow that allowed quick iterations of model development and trained a very deep network.

The integrated model can be trained just like a speech recognition system. The reference transcripts for training contain words spoken by a speaker followed by a tag that defines the role of the speaker. For example, “When is the homework due?” ≺student≻, “I expect you to turn them in tomorrow before class,” ≺teacher≻. Once the model is trained with examples of audio and corresponding reference transcripts, a user can feed in the recording of the conversation and expect to see an output in a similar form. Our analyses show that improvements from the RNN-T system impact all categories of errors, including short speaker turns, splitting at the word boundaries, incorrect speaker assignment in the presence of overlapping speech, and poor audio quality. Moreover, the RNN-T system exhibited consistent performance across conversation with substantially lower variance in average error rate per conversation compared to the conventional system.

A comparison of errors committed by the conventional system vs. the RNN-T system, as categorized by human annotators.

Furthermore, this integrated model can predict other labels necessary for generating more reader-friendly ASR transcripts. For example, we have been able to successfully improve our transcripts with punctuation and capitalization symbols using the appropriately matched training data. Our outputs have lower punctuation and capitalization errors than our previous models that were separately trained and added as a post-processing step after ASR.

This model has now become a standard component in our project on understanding medical conversations and is also being adopted more widely in our non-medical speech services.

We would like to thank Hagen Soltau without whose contributions this work would not have been possible. This work was performed in collaboration with Google Brain and Speech teams.

Applications Open for $50,000 NVIDIA Graduate Fellowship Awards

Bringing together the world’s brightest minds and the latest GPU technology leads to powerful research breakthroughs.

That’s why we’re taking applications for the 19th annual NVIDIA Graduate Fellowship Program, seeking students doing outstanding GPU-based research. Our goal: Provide them with grants, mentors and technical support so they can help solve the world’s biggest research problems.

We’re especially seeking doctoral students working in artificial intelligence, machine learning, autonomous vehicles, robotics, AI for healthcare, high performance computing and related fields. Our Graduate Fellowship awards are up to $50,000 per student.

Since its inception in 2002, the Graduate Fellowship Program has awarded over 160 grants worth nearly $5 million.

We’re looking for students who have completed their first year of Ph.D.-level studies at the time of application. Candidates need to be studying computer science, computer engineering, system architecture, electrical engineering or a related area. Applicants must also be investigating innovative ways to use GPUs.

The NVIDIA Graduate Fellowship Program for the 2020-2021 academic year is open to applicants worldwide. The deadline for submitting applications is Sept. 13, 2019. An internship at NVIDIA preceding the fellowship year is now mandatory — eligible candidates should be available for the internship in summer 2020.

For more on eligibility and how to apply, visit the program website or email

The post Applications Open for $50,000 NVIDIA Graduate Fellowship Awards appeared first on The Official NVIDIA Blog.

Modernizing wound care with Spectral MD, powered by Amazon SageMaker

Spectral MD, Inc. is a clinical research stage medical device company that describes itself as “breaking the barriers of light to see deep inside the body.” Recently designated by the FDA as a “Breakthrough Device,” Spectral MD provides an impressive solution to wound care using cutting edge multispectral imaging and deep learning technologies. This Dallas-based company relies on AWS services including Amazon SageMaker and Amazon Elastic Compute Cloud (Amazon EC2) to support their unprecedented wound care analysis efforts. With AWS as their cloud provider, the Spectral MD team can focus on healthcare breakthroughs, knowing their data is stored and processed swiftly and effectively.

“We chose AWS because it gives us access to the computational resources we need to rapidly train, optimize, and validate the state-of-the-art deep learning algorithms used in our medical device,” explained Kevin Plant, the software team lead at Spectral MD. “AWS also serves as a secure repository for our clinical dataset that is critical for the research, development and deployment of the algorithm.”

The algorithm is the 10-year-old company’s proprietary DeepView Wound Imaging System, which uses a non-invasive digital approach that allows clinical investigators to see hidden ailments without ever coming in contact with the patient. Specifically, the technology combines visual inputs with digital analyses to understand complex wound conditions and predict a wound’s healing potential. The portable imaging device in combination with computational power from AWS, allows clinicians to capture a precise snapshot of what is hidden to the human eye.

Spectral MD’s revolutionary solution is possible thanks to a number of AWS services for both core computational power and machine learning finesse. The company stores data captured by their device on Amazon Simple Storage Service (Amazon S3), with the metadata living in Amazon DynamoDB. From there, they back up all the data in Amazon S3 Glacier. This data fuels their innovation with AWS machine learning (ML).

To manage the training and deployment of their image classification algorithms, Spectral MD uses Amazon SageMaker and Amazon EC2. These services also help the team to achieve improved algorithm performance and to conduct deep learning algorithm research.

Spectral MD particularly appreciates that using AWS services saves the data science team a tremendous amount of time. Plant described, “The availability of AWS on-demand computational resources for deep learning algorithm training and validation has reduced the time it takes to iterate algorithm development by 80%. Instead of needing weeks for full validation, we’re now able to cut the time to 2 days. AWS has enabled us to maximize our algorithm performance by rapidly incorporating the latest developments in the state-of-the-art of deep learning into our algorithm.”

That faster timeline in turn benefits the end patients, for whom time is of the essence. Diagnosing burns quickly and accurately is critical for accelerating recovery and can have important long-term implications for the patient. Yet, current medical practice (without Spectral MD) suffers a 30% diagnostic error rate, meaning that some patients unnecessarily treated with surgery while others who would have benefited from surgery are not offered that option.

Spectral MD’s solution takes advantage of the natural patterns in the chemicals and tissues that compose human skin – and the natural pattern-matching abilities of ML. Their model has been trained on thousands of images of accurately diagnosed burns. Now, it is precise enough that the company can create datasets from scratch that differentiate pathologies from healthy skin, operating to a degree that is impossible for the human eye.

These datasets are labeled by expert clinicians using Amazon SageMaker Ground Truth. Spectral MD has extended Amazon SageMaker Ground Truth with the ability to review clinical reference data stored in Amazon S3. During the labeling process this provides clinicians with the ideal information set to maximize the accuracy of the diagnostic ground truth labels.

Going forward, Spectral MD plans to push the boundaries of ML and of healthcare. Their team has recently been investigating the use of Amazon SageMaker Neo for deploying deep learning algorithms to edge hardware. In Plant’s words, “There are many barriers to incorporating new technology into medical devices. But AWS continually improves how easy it is for us to take advantage of new and powerful features; no one else can keep pace with AWS.”

About the Author

Marisa Messina is on the AWS ML marketing team, where her job includes identifying the most innovative AWS-using customers and showcasing their inspiring stories. Prior to AWS, she worked on consumer-facing hardware and then university-facing cloud offerings at Microsoft. Outside of work, she enjoys exploring the Pacific Northwest hiking trails, cooking without recipes, and dancing in the rain.






Authenticate users with one-time passwords in Amazon Lex chatbots

Today, many companies use one-time passwords (OTP) to authenticate users. An application asks you for a password to proceed. This password is sent to you via text message to a registered phone number. You enter the password to authenticate. It is an easy and secure approach to verifying user identity. In this blog post, we’ll describe how to integrate the same OTP functionality in an Amazon Lex chatbot.

Amazon Lex lets you easily build life-like conversational interfaces into your existing applications using both voice and text.

Before we jump into the details, let’s take a closer look at OTPs. OTP is usually a sequence of numbers that is valid for only one login session or transaction. The OTP expires after a certain time period, and, after that, a new one has to be generated. It can be used on a variety of channels such as web, mobile, or other devices.

In this blog post, we’ll show how to authenticate your users using an example of a food-ordering chatbot on a mobile device. The Amazon Lex bot will place the order for users only after they have been authenticated by OTP.

Let’s consider the following conversation that uses OTP.

To achieve the interaction we just described, we build a food delivery bot first with the following intents: GetFoodMenu and OrderFood. The OTP password is used in intents that involve transactions, such as OrderFood.

We’ll show you two different implementations of capturing the OTP – one via voice and the other via text. In the first implementation, the OTP is captured directly by Amazon Lex as a voice or text modality. The OTP value is sent to directly to Amazon Lex as a slot value. In the second implementation, the OTP is captured by the client application (using text modality). The client application captures the OTP from the dialog box on the client and sends it to Amazon Lex as a session attribute. Session attributes can be encrypted.

It is important to note that all API calls made to the Amazon Lex runtime are encrypted using HTTPS. The encryption of the OTP when used via session attributes provides an extra level of security. Amazon Lex passes the OTP received via the session attribute or slot value to an AWS Lambda function that can verify the OTP.

Application architecture

The bot has an architecture that is based on the following AWS services:

  • Amazon Lex for building the conversational interface.
  • AWS Lambda to run data validation and fulfillment.
  • Amazon DynamoDB to store and retrieve data.
  • Amazon Simple Notification Service (SNS) to publish SMS messages.
  • AWS Key Management Service (KMS) to encrypt and decrypt the OTP.
  • Amazon Cognito identity pool to obtain temporary AWS credentials to use KMS.

The following diagram illustrates how the various services work together.

Capturing the OTP using voice modality

When the user first starts an interaction with the bot, the user’s email or other metadata are passed from the frontend to the Amazon Lex runtime.

An AWS Lambda validation code hook is used to perform the following tasks:

  1. AWS Lambda generates an OTP and stores it in the DynamoDB table.
  2. AWS Lambda sends the OTP to user’s mobile phone using SNS.
  3. The user inputs the OTP into the client application, which gets sent as a slot type to Amazon Lex.
  4. AWS Lambda verifies the OTP, and, if the authentication is successful, it signals Amazon Lex to proceed with the conversation.

After the user is authenticated the user is able to place an order with the Amazon Lex bot.

Capturing the OTP using text modality

Similar to the first implementation, the user’s email or other metadata are sent to the Amazon Lex runtime from the front end.

In the second implementation, an AWS Lambda validation code hook is used to perform the following tasks:

  1. AWS Lambda generates an OTP and stores it in the DynamoDB table.
  2. Lambda sends the OTP to user’s mobile phone using SNS.
  3. User enters the OTP into the dialog box of the client application.
  4. The client application encrypts the OTP entered by the user and sends it to the Amazon Lex runtime in the session attributes.

Note: Session attributes can be encrypted.

  1. AWS Lambda verifies the OTP, and, if the authentication is successful, it signals Amazon Lex to proceed with the conversation.

Note: if the OTP is encrypted, the Lambda function will need to decrypt it first.

After the user is authenticated, the user can place an order with the Amazon Lex bot.

Generating an OTP

There are many methods of generating an OTP. In our example, we generate a random six-digit number as an OTP that is valid for one minute and store it in a DynamoDB table. To verify the OTP, we compare the value entered by the user with the value in the DynamoDB table.

Deploying the OTP bot

Use this AWS CloudFormation button to launch the OTP bot in the AWS Region us-east-1:

The source code is available in our GitHub repository.

Open the AWS CloudFormation console, and on the Parameters page, enter a valid phone number. This is the phone number the OTP is sent to.

Choose Next twice to display the Review page.

Select the acknowledgement checkbox, and choose Create to deploy the ExampleBot.

The CloudFormation stack creates the following resources in your AWS account:

  • Amazon S3 buckets to host the ExampleBot web UIs.
  • Amazon Lex Bot to provide natural language processing.
  • AWS Lambda functions used to send and validate the OTP.
  • AWS IAM roles for the Lambda function.
  • Amazon DynamoDB tables to store session data.
  • AWS KMS key for encrypting and decrypting data.
  • Amazon Cognito identity pool configured to authenticate clients and provide temporary AWS credentials.

When the deployment is complete (after about 15 minutes), the stack Output tab shows the following:

  • ExampleBotURL: Click on this URL to interact with ExampleBot.

Let’s create the bot

This blog post builds upon the bot building basics covered in Building Better Bots Using Amazon Lex. Following the guidance from that blog post, we create an Amazon Lex bot with two intents: GetFoodMenu and OrderFood.

The GetFoodMenu intent does not require authentication. The user can ask the bot what food items are on the menu such as:

Please recommend something.

Show me your menu please.

What is on the menu?

What kind of food do you have?

The bot returns a list of food the user can order when the GetFoodMenu intent is elicited.

If the user already knows which food item they want to order, they can order the food item with the following input text to invoke the OrderFood intent:

I would like to order some pasta.

Can I order some food please?

Cheese burger!

Amazon Lex uses the Lambda code hook to check if the user is authenticated. If the user is authenticated, Amazon Lex adds the food item to the user’s current order.

If the user has not been authenticated yet, the interaction looks like this:

User: I would like to order some pasta.

Bot: It seems that you are not authenticated yet. We have sent an OTP to your registered phone number. Please enter the OTP.

User: 812734

Note: If the user is using the text modality, the user’s “to” input can be encrypted.

Bot: Thanks for entering the OTP. We’ve added pizza to your order. Would you like anything else?

If the user is not authenticated, Amazon Lex initiates the multifactor authentication (MFA) process. AWS Lambda queries the DynamoDB table for that user’s mobile phone number and delivery address. After DynamoDB returns the values, AWS Lambda generates an OTP based on the metadata of the user, saves it in a DynamoDB table with a UUID as a primary key, and stores it in the session attributes. Then AWS Lambda uses SNS to send an OTP to the user and elicit the pin using the pin slot in the OrderFood intent.

After the user inputs the OTP, Amazon Lex uses the Lambda code hook to validate the pin. AWS Lambda queries the DynamoDB table with the UUID in the session attributes to verify the OTP. If the pin is correct, the Lambda function queries DynamoDB for the secret data; if the pin is incorrect, Lambda performs the validation step again.

Implementation details

The following tables and screenshots show you the different slot types and intents, and how you can use the AWS Management Console to specify the ones you want.


Slot type: Slot Values
Food Amazon.Food
Pin Amazon.Number


Intent Name Sample Utterances

I would like to order some {Food}


I would like {food}

order {food}

Can I order some food please

Can I order some {Food} please


Please recommend something.

Show me your menu, please.

What is on the menu?

What kind of food do you have?

The GetFoodMenu intent uses the GetMenu Lambda function as the initialization and validation code hook to perform the logic, whereas the OrderFood intent uses the OrderFood Lambda function as the initialization and validation code hook to perform the logic.

These are the steps the Lambda function follows:

  1. The Lambda function first checks which intent the user has invoked.
  2. If the payload is for the GetFoodMenu intent:
    1. We’re assuming that the client will send the following items in the session attribute for the first Amazon Lex runtime API call. Since we cannot pass session attributes in the Lex console, for testing purposes, our Lambda function will create the following session attributes if the session attribute is empty.
      {’email’ : ‘’, ‘auth’ : ‘false’, ‘uuid’: None, ‘currentOrder’: None, ‘encryptedPin’: None}

      • ’email’ is the email address of the user.
      • ‘auth’ : ‘false’ implies the user is unauthenticated.
      • ‘uuid’ is a flag used later as the primary key to store the OTP into the DynamoDB.
      • ‘currentOrder’ will keep track of the food items ordered by the user.
      • ‘encryptedPin’ will be used by the frontend client to send encrypted OTP. If the implementation does not require OTP to be encrypted, then this attribute is optional.
    2. The Lambda function will return a list of food items and ask the user which food items they wish to order. In other words, the Lambda function will elicitSlot for the Food slot in the OrderFood
  3. If the payload is for the OrderFood intent:
    1. As we stated earlier, for testing purposes our Lambda function will create the following session attribute if the session attributes are empty.
      {’email’ : ‘’, ‘auth’ : ‘false’, ‘uuid’: None, ‘currentOrder’: None, ‘encryptedPin’: None}
    2. If the user is authenticated, the Lambda function will add the requested food item to currentOrder in the session attributes.
    3. The Lambda function will query the phoneNumbers DynamoDB table using the email for the phone number of the user.
      1. If DynamoDB is not able to return a phone number matching that email address, the Lambda function will tell the user it wasn’t able to find a phone number associated with that email and will ask the user to contact support.
    4. The Lambda function will generate an OTP and an uuid. The uuid is stored in the session attributes and the key value pair { uuid : OTP} will be stored as a record in the onetimepin DynamoDB table.
    5. The Lambda function will use SNS to send the OTP to the user’s phone number and ask the user to enter the one time pin they received by eliciting the pin slot in the OrderFood
    6. After the user enters the pin, the Lambda function will query the onetimepin DynamoDB table for the record with the uuid stored in the sessionAttributes.
      1. If the user enters an incorrect pin, the Lambda function will generate a new OTP, store it in DynamoDB, update the uuid in Session Attributes, send this new OTP to the user via SNS again, and ask the user to enter the pin again. The following screenshot illustrates this.
      2. If the pin is correct Lambda will validate the food item the user is requesting.
      3. If the “Food” slot type is null, Lambda will ask the user which data they are interested in by eliciting the “Food” slot in the “OrderFood” intent.
      4. Lambda will add the requested food item to ‘currentOrder’ in session attributes.
  4. After the user is authenticated, the subsequent items they want to add to the order will not require authentication as long as the session does not expire.

OTP encryption and decryption

In this section, we’ll show you how to encrypt the OTP from the client side before sending the OTP as a session attribute to Amazon Lex. We’ll also show you how to decrypt the session attribute in the Lambda function.

To encrypt the OTP, the frontend needs to use Amazon Cognito identity pool to assume an unauthenticated role that has the permissions to perform the encrypt action using KMS before sending the OTP through to Amazon Lex as a session attribute. For more information on Amazon Cognito identity pool, see the documentation.

After the Lambda function receives the OTP if it is encrypted Lambda uses KMS to decrypt the OTP and query a Dynamo DB table to confirm if the OTP is correct.

Please refer to the documentation linked here for the prerequisites:

  1. Create an Amazon Cognito identity pool.

Ensure that unauthenticated identities are enabled.

  1. Create a KMS key.
  2. Lock down access to the KMS key to the unauthenticated role created in step 1.
    1. Allow the unauthenticated Amazon Cognito role to use this KMS to perform the encrypt action.
    2. Allow the Lambda function’s IAM role the decrypt KMS action.
    3. Here is an example of the two key policy statements that illustrate this.
            "Sid": "Allow use of the key to encrypt",
            "Effect": "Allow",
            "Principal": {
              "AWS": [
                "arn:aws:iam::<your account>:role/<unauthenticated_role>",
            "Action": [
            "Resource": "arn:aws:kms:AWS_region:AWS_account_ID:key/key_ID"
            "Sid": "Allow use of the key to decrypt",
            "Effect": "Allow",
            "Principal": {
              "AWS": [
                "arn:aws:iam::<your account>:role/<Lambda_functions_IAM_role>",
            "Action": [
            "Resource": "arn:aws:kms:AWS_region:AWS_account_ID:key/key_ID"

Now we are ready to use the frontend to encrypt the OTP:

  1. In the frontend client, use the GetCredentialsForIdentity API to get the temporary AWS credentials for the unauthenticated Amazon Cognito role. These temporary credentials are used by the frontend to access the AWS KMS service.
  2. The frontend uses the KMS Encrypt API to encrypt the OTP.
  3. The encrypted OTP is sent to Amazon Lex in the session attributes.
  4. The Lambda function uses the KMS Decrypt API to decrypt the encrypted OTP.
  5. After the OTP is decrypted the Lambda function validates the OTP value.


In this post we showed how to use OTP functionality on an Amazon Lex bot using a simple example. In our design we used AWS Lambda to run data validation and fulfillment; DynamoDB to store and retrieve data; SNS to publish SMS messages; KMS to encrypt and decrypt the OTP; and Amazon Cognito identity pool to obtain temporary AWS credentials to use KMS.

It’s easy to incorporate the OTP functionality described here into any bot. You can pass the OTP pin from the frontend to Amazon Lex either as a slot value or session attribute value in your intent. Then, send and perform the validation using a Lambda function, and your bot is ready to accept OTP!

About the Author

Kun Qian is is a Cloud Support Engineer at AWS. He enjoys providing technical guidance to customers, and helping them troubleshoot and design solutions on AWS.

AWS DeepRacer League weekly challenges – compete in the AWS DeepRacer League virtual circuit to win cash prizes and a trip to re:Invent 2019!

The AWS DeepRacer League is the world’s first global autonomous racing league, open to anyone. Developers of all skill levels can get hands-on with machine learning in a fun and exciting way, racing for prizes and glory at 21 events globally and online using the AWS DeepRacer console. The Virtual Circuit launched at the end of April, allowing developers to compete from anywhere in the world via the console – no car or track required – for a chance to top the leaderboard and score points in one of the six monthly competitions. The top prize is an expenses paid trip to re:Invent to compete in the 2019 Championship Cup finals, but that is not the only prize.

More chances to win with weekly challenges

The 2019 virtual racing season ends in the last week of October. Between now and then, the AWS DeepRacer League will run weekly challenges, offering more opportunities to win prizes, and compete for a chance to advance to the Championship Cup on an expenses paid trip to re:Invent 2019. Multiple challenges will launch each week providing cash prizes in the form of AWS credits, that will help you to keep rolling your way up the leaderboard as you continue to tune and train your model.

More detail on each of the challenges

The Rookie – Even the best in the league had to start somewhere! To help those brand new to AWS DeepRacer and machine learning, there are going to be rewards for those who make their debut on the leaderboard. When you submit your first model of the week to the virtual leaderboard, you’re in the running for a chance to win prizes, even if the top spot seems out of reach. If you win, you will receive AWS credits to help you build on these new found skills and climb the leaderboard. Anything is possible once you know how, and this could be the boost you need to take that top prize of a trip to re:Invent 2019.

The Most Improved – Think the top spot is out of reach and the only way to win? Think again. A personal record is an individual achievement that should be rewarded. This challenge is designed to help developers, new and existing, reach new heights in their machine learning journey. The drivers who improve their lap time the most each week will receive AWS credits to help them continue to train, improve, and win!

The Quick Sprint challenge – In this challenge, you have the opportunity to put your machine learning skills to the test and see how quickly you can create your model for success. Those who train a model in the fastest time and complete a successful lap of the track have the chance to win cash prizes!

New challenges every week

The AWS DeepRacer League will be rolling out new challenges throughout the month of August, including rewards for hitting that next milestone. Time Target 50, is rewarding those who finish a lap closest to 50 seconds.

Current open challenges:

The August track, Shanghai Sudu, is open now, so what are you waiting for? Start training in the AWS DeepRacer console today, submit a model to the leaderboard, and you will automatically be entered to win! You can also learn more about the points and prizes up for grabs on the AWS DeepRacer League points and prizes page.

About the Author

Alexandra Bush is a Senior Product Marketing Manager for AWS AI. She is passionate about how technology impacts the world around us and enjoys being able to help make it accessible to all. Out of the office she loves to run, travel and stay active in the outdoors with family and friends.




Project Euphonia’s Personalized Speech Recognition for Non-Standard Speech

The utility of technology is dependent on its accessibility. One key component of accessibility is automatic speech recognition (ASR), which can greatly improve the ability of those with speech impairments to interact with every-day smart devices. However, ASR systems are most often trained from ‘typical’ speech, which means that underrepresented groups, such as those with speech impairments or heavy accents, don’t experience the same degree of utility. For example, amyotrophic lateral sclerosis (ALS) is a disease that can adversely affect a person’s speech—about 25% of people with ALS experiencing slurred speech as their first symptom. In addition, most people with ALS eventually lose the ability to walk, so being able to interact with automated devices from a distance can be very important. Yet current state-of-the-art ASR models can yield high word error rates (WER) for speakers with only a moderate speech impairment from ALS, effectively barring access to ASR reliant technologies.

In “Personalizing ASR for Dysarthric and Accented Speech with Limited Data,” to be presented at Interspeech 2019, we describe some of the research behind Project Euphonia, an ASR platform that performs speech-to-text transcription. This work presents an approach to improve ASR for people with ALS that may also be applicable to many other types of non-standard speech. Using a two-step training approach that starts with a baseline “standard” corpus and then fine-tunes the training with a personalized speech dataset, we have demonstrated significant improvements for speakers with atypical speech over current state-of-the-art models.

A Two-Phased Approach to Training
In order to create ASR models that work on non-standard speech, one needs to overcome two challenges. The first is that within a particular class of atypical speech, be it a regional accent or a speech impairment, for example, individuals can exhibit very different ways of speaking. Our approach deals with this sub-group heterogeneity by training the ASR model in two phases. We start with a high-quality ASR model trained on thousands of hours of standard speech and then we fine-tune parts of the model to an individual with non-standard speech. This approach is similar to that of Parrotron: both systems use end-to-end neural networks to help improve communication and accessibility, but Parrotron focuses exclusively on speech-to-speech, where a person’s speech is converted directly into synthesized speech, rather than text.

The second challenge arises from the difficulty in collecting enough data to train a state-of-the-art recognizer for individuals. Typical speech recognizers are trained on thousands of hours of speech from many different speakers. Acquiring this much data from a single speaker is nearly impossible, especially if the speaker may experience exhaustion from speaking due to a medical condition. Our approach overcomes this issue by first training a base model on a large corpus of typical speech, and then training a personalized model using a much smaller dataset with the targeted non-standard speech characteristics.

The Neural Network Architecture
When developing the models used for training data on atypical speech, we explored two different neural architectures. The first is the RNN-Transducer (RNN-T), a neural network architecture consisting of encoder and decoder networks that has shown good results on numerous ASR tasks. The encoder is bidirectional (i.e., it looks at the entire sentence at once in order to provide context), and thus it requires the entire audio sample to perform speech recognition.

The other architecture we explored was Listen, Attend, and Spell (LAS), which is an attention-based, sequence-to-sequence model that maps sequences of acoustic properties to sequences of languages. This model uses an encoder to convert the sequence of acoustic frames to a sequence of internal representations, and a decoder to convert the sequence of internal representations to linguistic output. The network produces “word pieces”, which are a linguistic representation between graphemes and words.

Comparison of the RNN-Transducer (left) and Listen, Attend, Spell (right) architectures. From Prabhavalkar et al. 2017.

We experimented with fine-tuning the state-of-the-art RNN-T and LAS base models on two types of non-standard speech. In partnership with the ALS Therapy Development Institute, we first collected about 36 hours of audio from 67 speakers who have ALS. The participants recorded themselves on their home computers using custom software while they read sentences from a very restricted language domain. Many phrases were single sentences with simple grammatical structure (e.g., “What time is the basketball game on tonight?”). This is in contrast with unrestricted language domains, which include domain-specific vocabulary (e.g., science talks) and complex language structure (e.g., a debate). The recordings did not include many of the filler words common in normal speech, such as “um” and “uh”.

We also tested accented speech, using the open source L2 Arctic dataset of non-native speech, which consists of 20 speakers with approximately 1 hour of speech per speaker. Each speaker recorded a set of 1150 utterances from the CMU Arctic prompts.

Audio Euphonia Model Standard Speech Model
Did I have anything to say about it? Dictatorship angels to think about it
Come right back please Cameras object
Let’s try that again It extracts
Turn it down a little bit please Turning down a little bit please
The audio (left) are recordings of a speaker with ALS. The text transcriptions are output from the Euphonia model (center) and the Standard Speech model (right). Incorrectly transcribed text is underlined.

The absolute word error rates on the language-restricted test set is shown below. There is an improvement over the baseline model for very non-standard speech (heavy accents and ALS speech below 3 on the ALS Functional Rating Scale) and moderate improvements in ALS speech that is similar to typical speech. The relative difference between the base model and the fine-tuned model demonstrates that the majority of the improvement comes from the fine-tuning process, except in the case of the RNN-T on the Arctic dataset, where the RNN-T baseline is already strong.

1 Non-native English speech from the L2-Arctic dataset.
2 Low FRS (ALS Functional Rating Scale) speech; intelligible with repeating (FRS 2); Speech combined with non-vocal communication (FRS 1).
3 FRS 3; detectable speech disturbance.

The RNN-T model achieved 91% of the improvement by fine-tuning just two layers, most of which are close to the input. On the accented dataset, fine-tuning the same two layers achieved 86% of the relative improvement compared to fine-tuning the entire network. This is consistent with previous speech work.

Most of the performance gains were achieved early in training. The models we trained were tested on a relatively limited domain of vocabulary and linguistic complexity, so the performance numbers are not necessarily related to how well the models perform on more general tasks. We hope that just fine-tuning part of the network allows it to retain the acoustic and linguistic information from the general speech model, while needing minimal modifications to adapt to a single new speaker. Future work will test this hypothesis.

Low FRS corresponds to the ALS speakers with low intelligibility (FRS 2, 1), while high FRS corresponds to ALS speakers with less severely impacted speech (FRS 3).

Understanding Model Behavior
To better understand how our models improved after fine-tuning, we looked at the pattern of phoneme mistakes. We started by comparing the distribution of phoneme mistakes made by the base ASR model on standard speech to the mistakes made on ALS speech. The SAMPA phonemes with the five largest differences between the ALS data and standard speech are p, U, f, k, and Z, which account for 20% of the deletion mistakes. Similarly, the n and m phonemes together account for 17% of the insertion / substitution mistakes. The same analysis on our fine-tuned models verifies that the unrecognized phoneme distribution is more similar to that of standard speech.

Our analysis shows that there are two aspects to every mistake: which phoneme the system doesn’t understand, and which phoneme the system thinks was said. Imagine having two systems with identical accuracy: one system always thinks that the f phoneme is actually the g phoneme, while another doesn’t know what the f phoneme is and randomly guesses. These two systems will have identical performance and identical distributions of phoneme mistakes, but very different distributions of the predicted phoneme when a mistake is made. Surprisingly, ASR mistakes on ALS speech are far more similar to regular speech mistakes after Euphonia fine-tuning.

Deletion / substitution mistakes per SAMPA phoneme on ALS speech before fine-tuning, ALS speech after fine-tuning, and on typical speech (Librispeech dataset).

Future Work
In the future, we intend to explore additional techniques that can be helpful in the low data regime. We also hope to use phoneme mistakes to weight certain examples during training, or to pick training sentences for people with ALS to record that contain the most common phoneme mistakes. We would like to explore pooling data from multiple speakers with similar conditions.

We hope that continued research in this area will help voice interfaces become accessible to more people, especially those who need it most. One key component to this is collecting data. Anyone 18 or older can help us build better personalized models by donating audio data. If you’re interested, you can fill out this form to allow Google to contact you.

This work would not have been possible without the extraordinary effort and support of the ALS Therapy Development Institute and the ALS community, especially Fernando Vieira, Maeve McNally, Taylor Charbonneau, Melissa Nollstadt, and the individuals with ALS who kindly and patiently volunteered their audio. This work builds on the pioneering advances in speech recognition made by Google’s speech team, in particular the recent development and deployment of end-to-end speech recognition models. We are grateful to the Google speech team for advice and collaboration, particularly to Anshuman Tripathi and Hasim Sak who guided us in training the initial models. We’d also like to thank Oran Lang, Omry Tuval, Michael Brenner, Julie Cattiau, Tara Sainath, Ding Zhao, Qiao Liang, Chung-Cheng Chiu, Dan Liebling, Ron Weiss, Anjuli Kannan, Dimitri Kanevsky, Ryan He, Gabor Simko, Benjamin Lee, Françoise Beaufays, Khe Chai Sim, Jimmy Tobin, Chet Gnegy, Jacqueline Huang, Ye Jia, Yu Zhang, Yonghui Wu, Michelle Ramanovich, Rus Heywood, Katrin Tomanek, Bob MacDonald, Pan-Pan Jiang, Ronnie Maor, Rif A. Saurous, Trevor Strohman, Dick Lyon, Avinatan Hassidim, Philip Nelson, and Yossi Matias for their technical contributions and project guidance.

Gut Feeling: Endoscopy Startup Uses AI to Spot Stomach, Colon Cancer

Even the most experienced doctors can’t catch every tiny polyp during an endoscopy, a screening of the digestive system.

But even in routine exams, the stakes are high — missing an early warning sign of cancer can lead to delayed diagnosis and treatment, lowering a patient’s chances for recovery.

To cut down on the rate of missed precancerous lesions, one Japanese endoscopist is turning to AI. His startup, AIM (short for AI Medical Service), is building a GPU-powered AI system that will analyze endoscopy video feeds in real time, spotting lesions and helping doctors identify which are cancerous or at risk of becoming so.

AI screening could also help clinicians manage a demanding workload: Japanese endoscopists must check more than 3,000 medical images a day, on average. Stomach and colon cancer are two of the three leading causes of cancer-related deaths in the country.

“Coming from 23 years of experience as an actual endoscopist, I saw firsthand the challenges facing experts in the field,” said Tomohiro Tada, CEO of AIM. “GPU-powered AI can help manage the overwhelming demand for checking endoscopic images, while improving the overall accuracy of lesion detection.”

A quarter of precancerous lesions are overlooked in endoscopy screenings, according to one Japanese study. In preclinical research trials, AIM’s AI model achieved 92 percent sensitivity in detecting stomach cancer lesions from endoscopy videos. The startup’s deep learning tool could help endoscopists better distinguish hard-to-spot lesions and improve consistency across different clinics.

AI Powers a Better Gut Check 

During an upper gastrointestinal endoscopy, a doctor examines a patient’s esophagus, stomach and upper region of the small intestine using a long tube with a small camera attached to it. The video feed from this camera is displayed on a larger screen for the clinician, who looks for bleeding, cancer or other conditions.

While doctors examine the endoscopy video footage live to check for polyps, they also check still images after the procedure. Having an AI to assist in real-time detection during a procedure could help doctors save time spent on secondary screening, Tada said.

AIM plans to deploy its AI model, which can identify different kinds of stomach lesions, in an NVIDIA Quadro RTX 4000 GPU-powered device that connects to existing endoscope systems. The device would receive the live endoscopy video feed and simultaneously process the footage to assist doctors during the procedure.

The startup uses a variety of NVIDIA GPUs, including the TITAN Xp and Quadro P6000, to train its deep learning models. It’s using an NVIDIA Quadro mobile workstation for inference in the prototype of its real-time AI device.

AIM’s deep-learning based object detection and classification algorithms are developed using tens of thousands of annotated endoscopy images from Tada’s clinic and from research partners including Japan’s Cancer Institute Hospital and the University of Tokyo Hospital.

The post Gut Feeling: Endoscopy Startup Uses AI to Spot Stomach, Colon Cancer appeared first on The Official NVIDIA Blog.

Kinect Energy uses Amazon SageMaker to Forecast energy prices with Machine Learning

The Amazon ML Solutions Lab worked with Kinect Energy recently to build a pipeline to predict future energy prices based on machine learning (ML). We created an automated data ingestion and inference pipeline using Amazon SageMaker and AWS Step Functions to automate and schedule energy price prediction.

The process makes special use of the Amazon SageMaker DeepAR forecasting algorithm. By using a deep learning forecasting model to replace the current manual process, we saved Kinect Energy time and put a consistent, data-driven methodology into place.

The following diagram shows the end-to-end solution.

The data ingestion is orchestrated using a step function which loads and processes data daily and deposits it is into a data lake in Amazon S3. The data is then passed to Amazon SageMaker which handles inference generation via a batch transform call that triggers an inference pipeline model.

Project motivation

The natural power market depends on a range of sources for production—wind, hydro-reservoir generation, nuclear, coal, and oil & gas—to meet consumer demand. The actual mix of power sources used to satisfy that demand depends on the price of each energy component on a given day. That price depends on that day’s power demand. Investors then trade the price of electricity in an open market.

Kinect Energy buys and sells energy to clients, and an important piece of their business model involves trading financial contracts derived from energy prices. This requires an accurate forecast of the energy price.

Kinect Energy wanted to improve and automate the process of forecasting—historically done manually—by using ML. The spot price is the current commodity price, as opposed to the future or forward price—the price at which a commodity can be bought or sold for future delivery. Comparing predicted spot prices and forward prices provides opportunities for the Kinect Energy team to hedge against future price movements based on current predictions.

Data requirements

In this solution, we wanted to predict spot prices for a four-week outlook on an hourly interval. One of the major challenges for the project involved creating a system to gather and process the required data automatically. The pipeline required two main components of the data:

  • Historic spot prices
  • Energy production and consumption rates and other external factors that influence the spot price

(We denote the production and consumption rates as external data.)

To build a robust forecasting model, we had to gather enough historical data to train the model, preferably spanning multiple years. We also had to update the data daily as the market generates new information. The model also needed access to a forecast of the external data components for the entire period over which the model forecasts.

Vendors update hourly spot prices to an external data feed daily. Various other entities provide data components on production and consumption rates, publishing their data on different schedules.

The analysts of the Kinect Energy team require the spot price forecast at a specific time of the day to shape their trading strategy. So, we had to build a robust data pipeline that periodically calls multiple API actions. Those actions collect data, perform the necessary preprocessing, and then store it in an Amazon S3 data lake where the forecasting model accesses it.

The data ingestion and inference generation pipeline

The pipeline consists of three main steps orchestrated by an AWS Step Function state machine: data ingestion, data storage, and inference generation. An Amazon CloudWatch event triggers the state machine to run on a daily schedule to prepare the consumable data.

The flowchart above details the individual steps that consist of the entire step function. The step function coordinates downloading of new data, updating of the historical data, and generating new inferences so that the whole process can be carried out in a single continuous workflow.

Although we built the state machine around a daily schedule, it employs two modes of data retrieval. By default, the state machine downloads data daily. A user can manually trigger the process to download the full historical data on demand as well for setup or recovery. The step function calls multiple API actions to gather the data, each with different latencies. The data gathering processes run in parallel. This step also performs all the required preprocessing, and stores the data in S3 organized by time-stamped prefixes.

The next step updates the historical data for each component by appending the respective daily elements. Additional processing prepares it in the format that DeepAR requires and sends the data to another designated folder.

The model then triggers an Amazon SageMaker batch transform job that pulls the data from that location, generates the forecast, and finally stores the result in another time-stamped folder. An Amazon QuickSight dashboard picks up the forecast and displays it to the analysts.

Packaging required dependencies into AWS Lambda functions

We set up Python pandas and the scikit-learn (sklearn) library to handle most of the data preprocessing. These libraries aren’t available by default for import into a Lambda function that Step Function calls. To adapt, we packaged the Lambda function Python script and its necessary imports into a .zip file.

cd ../package
pip install requests --target .
pip install pandas --target .
pip install lxml --target .
pip install boto3 --target .
zip -r9 ../lambda_function_codes/ .
cd -
zip -g

This additional code uploads the .zip file to the target Lambda function:

aws lambda update-function-code 
  --function-name update_history 
  --zip-file fileb://

Exception handling

One of the common challenges of writing robust production code is anticipating possible failure mechanisms and mitigating them. Without instructions to handle unusual events, the pipeline can fall apart.

Our pipeline presents two major potentials for failure. First, our data ingestion relies on external API data feeds, which could experience downtime, leading our queries to fail. In this case, we set a fixed number of retry attempts before the process marks the data feed temporarily unavailable. Second, feeds may not provide updated data and instead return old information. In this case, the API actions do not return errors, so our process needs the ability to decide for itself if the information is new.

Step Functions provide a retry option to automate the process. Depending on the nature of the exception, we can set the interval between two successive attempts (IntervalSeconds) and the maximum number of times to try the action (MaxAttempts). The parameter BackoffRate=1 arranges the attempts at a regular interval, whereas BackoffRate=2 means every interval is twice the length of the previous one.

"Retry": [
      "ErrorEquals": [ "DataNotAvailableException" ],
      "IntervalSeconds": 3600,
      "BackoffRate": 1.0,
      "MaxAttempts": 8
      "ErrorEquals": [ "WebsiteDownException" ],
      "IntervalSeconds": 3600,
      "BackoffRate": 2.0,
      "MaxAttempts": 5

Flexibility in data retrieval modes

We built the Step Function state machine to provide functionality for two distinct data retrieval modes:

  • A historical data pull to grab the entire existing history of the data
  • A refreshed data pull to grab the incremental daily data

The Step Function normally only has to extract the historical data one time in the beginning and store it in the S3 data lake. The stored data grows as the state machine appends new daily data. The option to refresh the historical data exists by setting the parameter full_history_download to True in the Lambda function that the CheckHistoricalData step calls. Doing so refreshes the entire dataset.

import json
from datetime import datetime
import boto3
import os

def lambda_handler(payload, context):
    if os.environ['full_history_download'] == 'True':
        print("manual historical data download required")
        return { 'startdate': payload['firstday'], 'pull_type': 'historical' }

    s3_bucket_name = payload['s3_bucket_name']
    historical_data_path = payload['historical_data_path']

    s3 = boto3.resource('s3')
    bucket = s3.Bucket(s3_bucket_name)
    objs = list(bucket.objects.filter(Prefix=historical_data_path))

    if (len(objs) > 0) and (objs[0].key == historical_data_path):
        print("historical data exists")
        return { 'startdate': payload['today'], 'pull_type': 'daily' }
        print("historical data does not exist")
        return { 'startdate': payload['firstday'], 'pull_type': 'historical' }

Building the forecasting model

We built the ML model in Amazon SageMaker. After putting together a collection of historical data on S3, we cleaned and prepared it using popular Python libraries such as pandas and sklearn.

A separate Amazon SageMaker ML algorithm called principal component analysis (PCA) was used to perform the feature engineering. To reduce the scope of our feature space while preserving information and creating desirable features, we applied PCA to our dataset before training our forecasting model.

We used a separate Amazon SageMaker ML algorithm called DeepAR as the forecasting model. DeepAR is a custom forecasting algorithm that specializes in processing time-series data. Amazon originally used the algorithm for product demand forecasting. Its ability to predict consumer demand based on temporal data and various external factors made the algorithm a strong choice to predict the fluctuations in energy price based on usage.

The following figure demonstrates some modelling results. We tested the model on available 2018 data after training it on historical data. A benefit of using the DeepAR model is that it returns a confidence interval from 10%-90%, providing a forecasted range. Zooming in to different time periods of the forecast, we can see that DeepAR excels at reproducing past periodic temporal patterns compared to the actual price records.

Above shows a comparison between values predicted by the DeepAR model versus the actual values over the test set of January to September 2018.

Amazon SageMaker also provides a straightforward way to perform hyperparameter optimization (HPO). After model training, we tuned the hyperparameters of the model to extract incrementally better model performance. Amazon SageMaker HPO uses Bayesian optimization to search the hyperparameter space and identify the ideal parameters for different models.

The Amazon SageMaker HPO API makes it simple to specify resource constraints such as the number of training jobs and computing power allocated to the process. We chose to test ranges for common parameters important to the DeepAR structure, such as the dropout rate, embedding dimension, and the number of layers in the neural network.

from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner

objective_metric_name = 'test:RMSE'

hyperparameter_ranges = {'num_layers': IntegerParameter(1, 4),
                        'dropout_rate': ContinuousParameter(0.05, 0.2),
                        'embedding_dimension': IntegerParameter(5, 50)}
tuner = HyperparameterTuner(estimator_DeepAR,
                    objective_type = "Minimize",
data_channels = {"train": "{}{}/train/".format(s3_data_path, model_name),
                "test": "{}{}/test/".format(s3_data_path, model_name)}
         , wait=False)

Packaging modeling steps into an Amazon SageMaker inference pipeline with sklearn containers

To implement and deploy an ML model effectively, we had to ensure that the data input format and processing from the inference matched up with the format and processing used for model training.

Our model pipeline uses sklearn functions for data processing and transformations, as well as a PCA feature engineering step before training using DeepAR. To preserve this process in an automated pipeline, we used prebuilt sklearn containers within Amazon SageMaker and the Amazon SageMaker inference pipelines model.

Within the Amazon SageMaker SDK, a set of sklearn classes handles end-to-end training and deployment of custom sklearn code. For example, the following code shows a sklearn Estimator executing an sklearn script in a managed environment. The managed sklearn environment is an Amazon Docker container that executes functions defined in the entry_point Python script. We supplied the preprocessing script as a .py file path. After fitting the Amazon SageMaker sklearn model on the training data, we can ensure that the same pre-fit model processes the data at inference time.

from sagemaker.sklearn.estimator import SKLearn

script_path = ''

sklearn_preprocessing = SKLearn(
    sagemaker_session=sagemaker_session){'train': train_input})

After this, we strung together the modeling sequence using Amazon SageMaker inference pipelines. The PipelineModel class within the Amazon SageMaker SDK creates an Amazon SageMaker model with a linear sequence of two to five containers to process requests for data inferences. With this, we can define and deploy any combination of trained Amazon SageMaker algorithms or custom algorithms packaged in Docker containers.

Like other Amazon SageMaker model endpoints, the process handles pipeline model invocations as a sequence of HTTP requests. The first container in the pipeline handles the initial request, and then the second container handles the intermediate response, and so on. The last container in the pipeline eventually returns the final response to the client.

A crucial consideration when constructing a PipelineModel is to note the data formatting for both the input and output for each container. For example, the DeepAR model requires a specific data structure for input data during training and a JSON Lines data format for inference. This format is different from supervised ML models because a forecasting model requires additional metadata, such as the starting date and the time interval of the time-series data.

from sagemaker.model import Model
from sagemaker.pipeline import PipelineModel

sklearn_inference_model = sklearn_preprocessing.create_model()

PCA_inference_model = Model(model_data=PCA_model_loc,

DeepAR_inference_model = Model(model_data=DeepAR_model_loc,

DeepAR_pipeline_model_name = "Deep_AR_pipeline_inference_model"

DeepAR_pipeline_model = PipelineModel(
    name=DeepAR_pipeline_model_name, role=Sagemaker_role, 
    models=[sklearn_inference_model, PCA_inference_model, DeepAR_inference_model])

After creation, using a pipeline model provided the benefit of a single endpoint that could handle inference generation. Besides ensuring that preprocessing on training data matched that at inference time, we can deploy a single endpoint that runs the entire workflow from data input to inference generation.

Model deployment using batch transform

We used Amazon SageMaker batch transform to handle inference generation. You can deploy a model in Amazon SageMaker in one of two ways:

  • Create a persistent HTTPS endpoint where the model provides real-time inference.
  • Run an Amazon SageMaker batch transform job that starts an endpoint, generates inferences on the stored dataset, outputs the inference predictions, and then shuts down the endpoint.

Due to the specifications of this energy forecasting project, the batch transform technique proved the better choice. Instead of real-time predictions, Kinect Energy wanted to schedule daily data collection and forecasting, and use this output for their trading analysis. With this solution, Amazon SageMaker takes care of starting up, managing, and shutting down the required resources.

DeepAR best practices suggest that the entire historical time-series for the target and dynamic features should be provided to the model at both training and inference. This is because the model uses data points further back to generate lagged features and input datasets can grow very large.

To avoid the usual 5-MB request body limit, we created a batch transform using the inference pipeline model and set the limit for input data size using the max_payload argument. Then, we generated input data using the same function we used on the training data and added it to an S3 folder. We could then point the batch transform job to this location and generate an inference on that input data.

input_location = "s3://...input"
output_location = "s3://...output"

DeepAR_pipelinetransformer.transform(input_location, content_type="text/csv")

Automating inference generation

Finally, we created a Lambda function that generates daily forecasts. To do this, we converted the code to the Boto3 API so Lambda can use it.

The Amazon SageMaker SDK library lets us access and invoke the trained ML models, but is far larger than the 50-MB limit for including in a Lambda function. Instead, we used the natively-available Boto3 library.

# Create the json request body
batch_params = {
    "MaxConcurrentTransforms": 1,
    "MaxPayloadInMB": 100,
    "ModelName": model_name,
    "TransformInput": {
        "ContentType": "text/csv",
        "DataSource": {
            "S3DataSource": {
                "S3DataType": "S3Prefix",
                "S3Uri": input_data_path
    "TransformJobName": job_name,
    "TransformOutput": {
        "S3OutputPath": output_data_path
    "TransformResources": {
        "InstanceCount": 1,
        "InstanceType": 'ml.c4.xlarge'

# Create the SageMaker Boto3 client and send the payload
sagemaker = boto3.client('sagemaker')
ret = sagemaker.create_transform_job(**batch_params)


Along with the Kinect Energy team, we were able to create an automated data ingestion and inference generation pipeline. We used AWS Lambda and AWS Step Functions to automate and schedule the entire process.

In the Amazon SageMaker platform, we built, trained, and tested a DeepAR forecasting model to predict electricity spot prices. Amazon SageMaker inference pipelines combined preprocessing, feature engineering, and model output steps. A single Amazon SageMaker batch transform job could put the model into production and generate an inference. These inferences now help Kinect Energy make more accurate predictions of spot prices and improve their electricity price trading capabilities.

The Amazon ML Solutions Lab engagement model provided the opportunity to deliver a production-ready ML model. It also gave us the chance to train the Kinect Energy team on data science practices so that they can maintain, iterate, and improve upon their ML efforts. With the resources provided, they can expand to other possible future use cases.

Get started today! You can learn more about Amazon SageMaker and kickoff your own Machine Learning solution by visiting the Amazon SageMaker console.

About the Authors

Han Man is a Data Scientist with AWS Professional Services. He has a PhD in engineering from Northwestern University and has several years of experience as a management consultant advising clients across many industries. Today he is passionately working with customers to develop and implement machine learning, deep learning, & AI solutions on AWS. He enjoys playing basketball in his spare time and taking his bulldog, Truffle, to the beach.



Arkajyoti Misra is a Data Scientist working in AWS Professional Services. He loves to dig into Machine Learning algorithms and enjoys reading about new frontiers in Deep Learning.




Matt McKenna is a Data Scientist focused on machine learning in Amazon Alexa, and is passionate about applying statistics and machine learning methods to solve real world problems. In his spare time, Matt enjoys playing guitar, running, craft beer, and rooting for Boston sports teams.




Many thanks to Kinect Energy team who worked on the project. Special thanks to following leaders from Kinect Energy who encouraged and reviewed the blog post.

  • Tulasi Beesabathuni: Tulasi is the squad lead for both Artificial Intelligence / Machine Learning and Box / OCR (content management) at World Fuel Services. Tulasi oversaw the initiation, development and deployment of the Power Prediction Model and employed his technical and leadership skillsets to complete the team’s first use case using new technology.
  • Andrew Stypa: Andrew is the lead business analyst for the Artificial Intelligence / Machine learning squad at World Fuel Services. Andrew used his prior experiences in the business to initiate the use case and ensure that the trading team’s specifications were met by the development team.





Next Meetup




Plug yourself into AI and don't miss a beat


Toronto AI is a social and collaborative hub to unite AI innovators of Toronto and surrounding areas. We explore AI technologies in digital art and music, healthcare, marketing, fintech, vr, robotics and more. Toronto AI was founded by Dave MacDonald and Patrick O'Mara.