Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

Category: Global

Build a serverless anomaly detection tool using Java and the Amazon SageMaker Random Cut Forest algorithm

One of the problems that business owners commonly face is detecting when something unusual is happening in their business. Detecting unusual user activity or changes in daily traffic patterns are just some of the challenges. With an ever-increasing amount of data and metrics, detecting anomalies with the help of machine learning is a great way to proactively identify problems.

In this blog post we’ll explain how to build a serverless anomaly detection tool using Amazon SageMaker with Java. Amazon SageMaker makes it easy to train and host machine learning models, and the available built-in algorithms solve common business problems. To solve this particular business problem, we’ll use the Random Cut Forest (RCF) anomaly detection algorithm. Amazon Web Services offers a broad set of global cloud-based products to help organizations move faster, lower IT costs, and scale. We’ll demonstrate how these can be used to build a serverless anomaly detection tool. While Python is one of the most popular programming languages for tackling machine learning problems, many users build micro-services and serverless applications using Java and other JVM-based languages. By the end of this blog post you’ll be able to enable machine learning in your Java applications using Amazon SageMaker.

Throughout the blog post we will use Java code snippets to focus on particular aspects of the tool. You can find the code used to build and deploy this solution into your own AWS account here.

Problem overview

In our example, Alice is a Java developer who owns a video streaming platform that runs on top of multiple AWS services and serves thousands of customers. Alice sets up dashboards to track metrics that show how well her platform is performing. One of the most important metrics she looks at is the total number of active users of the platform, as shown in the following diagram.

This metric shows a general daily pattern of usage, but it also changes seasonally. A low number of active users, a high number of active users, and breaks of daily pattern are all considered anomalies. Alice is mostly interested in understanding the root cause for those anomalous datapoints. Currently, she doesn’t rely on automated tools for finding anomalies in the data. Instead, she goes through a manual process and spend a lot of time identifying spikes, dips, and breaks in periodicity. Fixed thresholds or threshold windows don’t work for her due to changing patterns and seasonality. She needs a better solution!

What can we do to make Alice’s life easier?

Solution architecture

To help Alice solve her anomaly detection problem we first need to identify all the building blocks for an anomaly detection tool:

  • Amazon SageMaker– We’ll need Amazon SageMaker to easily build a model based on the historical metric data. Then, we’ll use it to find anomalous data points in current data (from the previous week). The Amazon SageMaker Random Cut Forest algorithm learns the trends in your data and after training can identify anomalies. For using your trained model to find anomalies, we can choose between two options: (1) We can host a model on an endpoint and run inference requests against that endpoint using HTTP requests. (2) We can use a batch transform job to bulk transform new metric data. We need to get results once a week, so the batch transform job seems like a better option. Hosting a model and then hitting an endpoint once a week would be a waste of resources.
  • Amazon CloudWatch Events – We’ll use Amazon CloudWatch Events to schedule a recurring weekly event that triggers our weekly transformation job. The patterns in the underlying data will change over time, so it’s important to occasionally refresh the model we’re using. We will use another CloudWatch Events rule to run a training job once per month.
  • Amazon CloudWatch Metrics– Alice stores all of her metrics in CloudWatch, which we’ll use as our data source. We’ll also publish our anomalous metric scores to CloudWatch from the batch transform job so Alice can easily view when anomalies occur.
  • Amazon S3 Amazon SageMaker uses Amazon S3 as an input data source for training and batch transform jobs. After we retrieve and preprocess CloudWatch data we will store it in S3 for our Amazon SageMaker jobs.
  • AWS Step Functions– Getting data from CloudWatch, uploading it to S3, starting the training and batch transform jobs, and publishing the results back to CloudWatch are all steps that we need so that our anomaly detection tool works as expected. Instead of writing a new service to orchestrate this workflow, we’ll use serverless technologies to simplify the process, and we’ll automate the process using AWS Step Functions. We’ll use two state machines, one for training and one for batch inference, which will ensure that all of the described steps are being executed in the correct order and that any failures are handled gracefully.
  • AWS Lambda– All the previously described actions will be executed as AWS Lambda functions, which will be triggered by the AWS Step Functions state machine. All of our Lambda functions use Java 8 and the AWS SDK. Note:  Some of the Lambda functions could potentially be replaced following recent release of Amazon SageMaker support for Amazon States Language. However, in this blog post we want to focus on the perspective of Java development to provide unified view on the subject.

The following diagram illustrates our architecture:

Training job state machine

The following diagram illustrates the training state machine:

  1. The first Lambda function (“Store CloudWatch Metric Data in S3”) gets one-month worth of metric data from CloudWatch with a resolution of 5 minutes. The Lambda function creates a CSV file containing the timestamp and a value for each of the 5-minute data points, and uploads the file to the S3 bucket.
  2. The second Lambda function (“Start SageMaker Training Job”) uses the S3 dataset created in the previous step to start an Amazon SageMaker training job. The creation of the job is executed in asynchronous fashion and the execution of the state machine continues.
    public class StartTrainingJobHandler {
    
        private static final String TRAINING_JOB_STATUS = "InProgress";
    
        private final AmazonSageMaker sagemaker;
    
        public StartTrainingJobHandler() {
            sagemaker = AmazonSageMakerClientBuilder.standard().build();
        }
    
        public StartTrainingJobOutput handleRequest(StartTrainingJobInput input, Context context) {
            StartTrainingJobConfig config = new StartTrainingJobConfig(
                input.getTimestamp(), input.getBucket(), input.getValuesKey());
            
            CreateTrainingJobRequest request = config.getTrainingJobRequest();
            sagemaker.createTrainingJob(request);
            
            return new StartTrainingJobOutput(
                input.getTimestamp(), request.getTrainingJobName(),
                TRAINING_JOB_STATUS, config.getModelOutputPath());
        }
    }

  3. Wait until the Amazon SageMaker training job is finished. If the job failed, we report the job failure and finish the execution. If the job has completed successfully we move to the next state.
    public class CheckTrainingJobStatusHandler {
    
        private final AmazonSageMaker sagemaker;
    
        public CheckTrainingJobStatusHandler() {
            sagemaker = AmazonSageMakerClientBuilder.standard().build();
        }
    
        public StartTrainingJobOutput handleRequest(StartTrainingJobOutput input, Context context) {
            DescribeTrainingJobRequest request = new DescribeTrainingJobRequest()
                .withTrainingJobName(input.getTrainingJobName());
    
            DescribeTrainingJobResult result =        sagemaker.describeTrainingJob(request);
    
            input.setTrainingJobStatus(result.getTrainingJobStatus());
            return input;
        }
    }

  4. The final Lambda function (“Create SageMaker Model”) creates an Amazon SageMaker model based on model output created in training job.
    public class CreateModelHandler {
    
        private final AmazonSageMaker sagemaker;
    
        public CreateModelHandler() {
            sagemaker = AmazonSageMakerClientBuilder.standard().build();
        }
    
        public CreateModelOutput handleRequest(CreateModelInput input, Context context) {
            ContainerDefinition containerDefinition = new ContainerDefinition()
                .withImage(RandomCutForestConfig.getAlgorithmImage())
                .withModelDataUrl(input.getModelOutputPath());
    
            CreateModelRequest request = new CreateModelRequest()
                .withExecutionRoleArn(Env.getSagemakerRoleArn())
                .withModelName(RandomCutForestConfig.ALGORITHM_NAME + "-" + input.getTimestamp())
                .withPrimaryContainer(containerDefinition);
    
            sagemaker.createModel(request);
    
            return new CreateModelOutput(request.getModelName());
        }
    }

Transform job state machine

The following diagram illustrates the transform job state machine:

The following steps are executed as part of transform job state machine:

  1. We reuse same Lambda function as in the training step (“Store CloudWatch Metric Data in S3”), but we configure it to get only one week of data from CloudWatch.
  2. The second Lambda function (“Start SageMaker Transform Job”) finds the models we have trained (created by training state machine), picks the latest one, and asynchronously starts the Amazon SageMaker batch transform job.
    public class StartTransformJobHandler {
    
        private static final String TRANSFORM_JOB_STATUS = "InProgress";
    
        private static final int LIST_MODELS_MAX_RESULTS = 1;
        private static final int LATEST_MODEL_INDEX = 0;
    
        private final AmazonSageMaker sagemaker;
    
        public StartTransformJobHandler() {
            sagemaker = AmazonSageMakerClientBuilder.standard().build();
        }
    
        public StartTransformJobOutput handleRequest(StartTransformJobInput input, Context context) {
            String modelName = getLatestModelName();
            return createSageMakerTransformJob(input, modelName);
        }
    
    
        private String getLatestModelName() {
            ListModelsRequest request = new ListModelsRequest()
                    .withNameContains(ALGORITHM_NAME)
                    .withMaxResults(LIST_MODELS_MAX_RESULTS)
                    .withSortBy(ModelSortKey.CreationTime)
                    .withSortOrder(OrderKey.Descending);
    
            ListModelsResult result = sagemaker.listModels(request);
            ModelSummary modelSummary = result.getModels().get(LATEST_MODEL_INDEX);
    
            return modelSummary.getModelName();
        }
    
            private StartTransformJobOutput createSageMakerTransformJob(StartTransformJobInput input, String modelName) {
            StartTransformJobConfig config = new StartTransformJobConfig(
                input.getTimestamp(), input.getBucket(), input.getValuesKey(), input.getValuesFile(), modelName);
            CreateTransformJobRequest request = config.getTransformJobRequest();
            
            sagemaker.createTransformJob(request);
            return new StartTransformJobOutput(input.getBucket(), input.getTimestamp(),
                input.getTimestampsKey(), config.getAnomalyScoresKey(),
                request.getTransformJobName(), TRANSFORM_JOB_STATUS);
        }
    }

  3. Wait until the batch transform job finishes successfully.
    public class CheckTransformJobStatusHandler {
    
        private final AmazonSageMaker sagemaker;
    
        public CheckTransformJobStatusHandler() {
            sagemaker = AmazonSageMakerClientBuilder.standard().build();
        }
    
        public StartTransformJobOutput handleRequest(StartTransformJobOutput input, Context context) {
            DescribeTransformJobRequest request = new DescribeTransformJobRequest()
                .withTransformJobName(input.getTransformJobName());
    
            DescribeTransformJobResult result = sagemaker.describeTransformJob(request);
    
            input.setTransformJobStatus(result.getTransformJobStatus());
            return input;
        }
    }

  4. The final Lambda function (“Publish Anomaly Score Metric to CloudWatch”) gets output scores from the batch transform job. It uses a simple, standard technique for classifying anomalies in which all anomaly scores outside three standard deviations from the mean score are considered anomalous. Finally, all the data points that have been labeled as anomalous are published to CloudWatch with a value of 1, and all the data points that haven’t been marked as anomalous are published with a value of 0. To know for which timestamp to publish the anomalous score metric, we use the input dataset.
    public class AnomalousDataUploadHandler {
    
        private final AmazonCloudWatch cloudWatch;
        private final S3FileManager s3FileManager;
    
        public AnomalousDataUploadHandler() {
            cloudWatch = AmazonCloudWatchClientBuilder.standard().build();
            s3FileManager = new S3FileManager();
        }
    
        public AnomalousDataUploadOutput handleRequest(AnomalousDataUploadInput input, Context context) throws IOException {
            List<Double> anomalyScores = getAnomalyScores(input.getBucket(), input.getAnomalyScoresKey());
    
            List<Integer> anomalyIndices = findAnomalousIndices(anomalyScores);
    
            List<Long> timestamps = getTimestamps(input.getBucket(), input.getTimestampsKey());
    
            return uploadAnomalousDataToCloudWatch(timestamps, anomalyIndices, anomalyScores.size());
        }
    
        private List<Integer> findAnomalousIndices(List<Double> anomalyScores) {
            double mean = getMean(anomalyScores);
            
            double std = getStd(anomalyScores, mean);
                    double scoreCutoff = mean + 2 * std;
            
            List<Integer> anomalousIndices = getAnomalousIndices(anomalyScores, scoreCutoff);
            
            return anomalousIndices;
        }
    
    	private List<Integer> getAnomalousIndices(List<Double> anomalyScores, double scoreCutoff) {
        	  return IntStream.range(0, anomalyScores.size())
            	      .filter(i -> anomalyScores.get(i) > scoreCutoff)
            	      .boxed().collect(Collectors.toList());
        }
    
    }

After both state machines have run, a new metric is available in the Amazon CloudWatch console. We can graph this new metric over the original metric to understand when anomalies happen. Now Alice can use the new metric to zoom in on specific points of interest in her original metric, and navigate to the Amazon CloudWatch Logs console for those data points.

Since Alice is storing anomalies in CloudWatch, she can use all of the rich alerting and monitoring functionality that is available so she can be notified automatically when something strange happens. Similarly, because she is using Amazon SageMaker s she can take the model and use it for online inference in the future if she wants to (for example, she can evaluate anomalies in near real time by making HTTP calls to a hosted endpoint).

Conclusion

In this blog post we showed you how to build an automated anomaly detection tool using Amazon SageMaker. We explained what services help us remove the undifferentiated heavy lifting to build the tool and how they all fit together to form a meaningful workflow. We also showcased one of the latest Amazon SageMaker releases, batch transform jobs, which is ideal for use cases that don’t require hosting a model for near real-time inference. All the Lambda functions were written using Java 8. It is our hope that this blog post, in combination with code examples, will help Java developers integrate Amazon SageMaker into their services and applications.


About the authors

Luka Krajcar is a Software Development Engineer on the AWS AI Labs team. He received his M.S. in Computer Science at the Faculty of Electrical Engineering and Computing at the University of Zagreb. Outside of work, Luka enjoys reading fiction, running, and video gaming.

 

 

 

Julio Delgado Mangas is a Software Development Engineer on the AWS AI Labs team. He has contributed to AWS services like Amazon CloudWatch and the Amazon QuickSight SPICE engine. Before joining Amazon, he was a research engineer on the Human Brain Project.

 

 

 

Laurence Rouesnel is the Algorithms & Platforms Group Manager in Amazon AI Labs. He leads a team of engineers and scientists working on deep learning and machine learning research and products. In his spare time, he is an avid traveler, and loves the outdoors whether it’s hiking, skiing, or windsurfing.

 

 

 

Chris Swierczewski is an Applied Scientist on the AWS AI Labs team, where he has contributed to the Amazon SageMaker Latent Dirichlet Allocation and the Amazon SageMaker Random Cut Forest algorithms. Before Amazon, Chris was a Ph.D. student in Applied Mathematics at the University of Washington. He likes to go hiking, backpacking, and camping with his wife and their dog, River.

 

 

 

Madhav Jha is an Applied Scientist on the AWS AI Labs team where he uses his background in sublinear algorithms to develop scalable machine learning algorithms. He is a theoretical computer scientist who enjoys coding. He is always up for coffee conversations on startups and technology.

 

 

 

 

 

Launch EI accelerators in minutes with the Amazon Elastic Inference setup tool for EC2

The Amazon Elastic Inference (EI) setup tool is a Python script that enables you to quickly get started with EI.

Elastic Inference allows you to attach low-cost GPU-powered acceleration to Amazon EC2 and Amazon SageMaker instances to reduce the cost of running deep learning inference by up to 75 percent. If you are using EI for the first time, there are a number of dependencies that must be set up: Amazon Web Services (AWS) PrivateLink VPC Endpoints, IAM policies, and security group rules. To accelerate this, The EI setup script makes it easy for you to get started by creating the necessary resources to help you launch EI accelerators in minutes. In this blog post I describe how to use the script, functionalities of the script, and what to expect when you run it.

At a high level, the script does the following:

  1. Creates an IAM role for the instance with an IAM policy that lets you connect to the AWS Elastic Inference service.
  2. Creates a security group with the necessary ingress and egress rules to allow the instance to communicate with the accelerator.
  3. Creates an AWS PrivateLink VPC Endpoint within your desired subnet.
  4. Launches the desired EC2 instance with an EI accelerator using the latest AWS Deep Learning AMI (DLAMI) for the chosen operating system

Prerequisites

To set up EI, run the script linked below. It depends on the following entities:

  1. Python 3 installed on your local machine where you expect to run the tool.
  2. The AWS SDK for Python (Boto3).
  3. An Amazon VPC in the Region where you are launching the instance (could be your default VPC).
  4. Subnet where you’d like to launch the instance.
  5. EC2 Key Pair.
  6. AWS credentials.

With these in place, download the amazonei_setup.py script from GitHub to your local machine and run it from your terminal using following command:

$ python amazonei_setup.py

What the tool creates on your behalf

The script creates following AWS resources:

  • Instance role with an Amazon EI Policy. This role is created the first time the script is run. In all subsequent runs, script reuses this IAM role. If this role is deleted, script recreates the role next time it is run. The IAM role has following properties:
    • Role name: Amazon-Elastic-Inference-Connect-Role
    • Policy name: Amazon-Elastic-Inference-Connect-Policy
    • Instance profile name: Amazon-Elastic-Inference-Instance-Profile

    The policy description is as follows:

    { "Version": "2012-10-17", 
      "Statement": [
           {
                "Effect": "Allow", 
                "Action": [ 
                "elastic-inference:Connect", 
                "iam:List*",
                "iam:Get*",
                "ec2:Describe*",
                "ec2:Get*" 
                ],
                "Resource": "*"
            } 
        ] 
    }
    

  • Security Group (SG). The security group associated with the EC2 instance should allow inbound traffic to port 443 as required by Amazon EI service. You also need inbound rules that allow traffic to port 22 for SSH. If a security group matching these rules is found, it is used. However, if no matching SG is found, a new SG with required rules is created. The outbound rules are set to allow traffic to all ports. The new SG name is amazon_ei_security_group, with the description Security Group for accessing Amazon EI service.
  • Interface VPC endpoint (AWS PrivateLink). The script scans for existing endpoint associated with Amazon EI service for the Region and VPC that you chose. For example, for the us-west-2 Region, the script looks for the endpoint with name amazonaws.us-west-2.elastic-inference.runtime in the given VPC ID.  If the endpoint is not found, the script creates one. Also, the script sets following attributes of the VPC endpoint to true, as required by Amazon EI:
    • EnableDnsSupport
    • EnableDnsHostNames
    • The script modifies the endpoint and add SG and chosen subnet if they are missing from the discovered endpoint
  • The script discovers latest AWS DLAMI based on the operating system chosen by the user.
  • If all steps succeed, the script launches an instance and reports the instance ID.
  • The script tries to obtain public DNS name after the instance is launched and is in running state.
  • Even if the instance is running, it may not be ready for accepting SSH connection and users may want to wait until the instance is fully initialized. EC2 console or AWS CLI can be used to query the initialization state, using the instance ID that is reported by the script for the newly launched instance.

What to expect when you run the tool

The example here illustrates what to expect when you run the script.

  • Launch the script. The script can be launched from the command prompt as:
  • $ python amazonei_setup.py –region us-west-2 –instance-type m5.xlargeAWS credentials are required to create or modify AWS resources. It uses Boto3, AWS SDK for Python. In order to be able to configure and manage AWS resources, the script needs user credentials. If the script is run without appropriate credentials, it reports the error below:
    $ python amazonei_setup.py --region us-west-2 --instance-type m5.xlarge
    Error setting up Amazon EI configuration - 
     Failed to retrieve VPC endpoints for us-west-2 : An error occurred (RequestExpired)
     when calling the DescribeVpcEndpointServices operation: Request has expired.

    The solution is to configure AWS credentials using one of the methods described in the Amazon Boto3 documentation. After the credentials are in place, the script is able to proceed.

  • Choose Operating System. The script prints informative message and prompts for choosing the OS. It also informs that entering ‘q’ causes the script to exit. Choose ‘1’ for the next step.
    $ python amazonei_setup.py --region us-west-2 --instance-type m5.xlarge
    
    This script launches Amazon EC2 instances with Amazon Elastic Inference accelerators.
    Performs the following functions:
     1. It uses the Deep Learning AMIs preconfigured with EI-enabled deep learning 
     frameworks to launch the instances.
     2. It creates security groups for the instance and VPC endpoint.
     3. It creates the VPC endpoint needed for your instances to communicate with EI 
     accelerators.
     4. It creates an IAM Instance Role and Policy with the permissions needed to 
     connect to accelerators.
    
     To begin, please choose the Operating System for your instance by typing its index :
    
     0: Amazon Linux
     1: Ubuntu
    
    Type 'q' to quit.
    amazonei-wizard>
    

  • Choose Accelerator size. The script discovered latest DL AMI for Ubuntu, it also discovered one key pair. If it discovers multiple key pairs, it lists those and ask the user to choose desired key pair by typing its index.  In general if there are multiple eligible inputs, the script shows them as indexed list and let the user choose an item by typing its index. Thus, script lists supported accelerator sizes and lets user choose.
    amazonei-wizard>1 
     Using Image ID: ami-0027dfad6168539c7,Image Name: Deep Learning AMI (Ubuntu) Version 21.2
     Using instance type: m5.xlarge
     Using Key Pair: Efti-Default-KeyPair
    
    Please type index of the accelerator type to use:
    
     0: eia1.medium (1 GB of accelerator memory)
     1: eia1.large (2 GB of accelerator memory)
     2: eia1.xlarge (4 GB of accelerator memory)
    
    Type 'q' to quit.
    amazonei-wizard>

  • Choose VPC. As illustrated, user chose option ‘1’ for Accelerator size and the script confirmed the Accelerator size chosen and proceeded to discover IAM role. Subsequently, it presents list of available VPCs.
    amazonei-wizard> 1 
     Using Amazon EI accelerator type: eia1.large
    
     Found an IAM role configured for connecting to Amazon EI service. Name - Amazon-Elastic-Inference-Connect-Role, ARN - arn:aws:iam::326228132093:role/Amazon-Elastic-Inference-Connect-Role
    
    Please select the VPC to use by typing the desired VPC index. Type 0 for default VPC.
    
     0: VPC Id 'vpc-d7d218af'
     1: VPC Id 'vpc-0c2496c51925ff1be'
    
    Type 'q' to quit.
    amazonei-wizard>

  • Launch an instance. Once user chooses the VPC ID, the script found a security group with matching inbound rules associated with chosen VPC, it also found one subnet associated with the chosen VPC ID. Additionally it found VPC endpoint for Amazon EI service. As the script has all the details to launch an EC2 instance, the script summarizes all the parameters it uses to launch the instance.
    amazonei-wizard>1 
     Using VPC ID: vpc-0c2496c51925ff1be
     Using Security Group: sg-00aec97685affb306
     Using Subnet: subnet-04881d24764d6e73f
    
     Discovered VPC endpoint for Amazon EI service, ID: vpce-0d2942a8147305240
    
     The script will now launch new instance with following configuration. Type 'y' to continue. 
    
     Accelerator Type: eia1.large
     Region: us-west-2
     Image-ID: ami-0027dfad6168539c7 - (Deep Learning AMI (Ubuntu) Version 21.2)
     Instance Type: m5.xlarge
     Key Pair: Efti-Default-KeyPair
     Security Group ID: sg-00aec97685affb306
     Subnet ID: subnet-04881d24764d6e73f
     Instance Profile: Amazon-Elastic-Inference-Instance-Profile
    
    Type 'y' to continue. Type 'q' to quit.
    amazonei-wizard>

  • Launch and wait for the instance to reach running state. As the user typed ‘y’, the script proceeded to launch the instance. The script also printed probable SSH command. The script infers the SSH command based on the OS type, key pair chosen, and the public DNS name. Actual command differs based on location of pem file. The script also warns that the instance may not be immediately accessible via SSH, even though it is in running state. The instance needs to be initialized fully, specifically  the SSH daemon needs to be started before it can accept SSH connections. If the pem file is correctly located the user should be able to access the instance and proceed with using Amazon Elastic Inference.
    amazonei-wizard>y
    
     Launching Instance ..
    
     Launched instance successfully. The instance ID is 'i-0969820364c038cca'.
    
     Waiting for instance to reach running state ...
    
     You can use the following sample SSH command to connect to your instance: ssh -i "Efti-Default-KeyPair.pem" ubuntu@ec2-52-13-194-188.us-west-2.compute.amazonaws.com
    
    
     Note: Please wait until instance is fully initialized and ready to accept SSH connections. You may check instance status at EC2 console.
     Also please locate your private key file 'Efti-Default-KeyPair.pem'.
    
    amazon-elastic-inference-tools $ 

Summary

The setup script simplifies your launch of an EC2 instance with EI. It ensures that all settings are correctly configured and instance is launched with requisite permissions to use EI. If you have any feedback about this blog post, feel free to use the comment section on this page.


About the Authors

Eftiquar Shaikh is Senior Software engineer with AWS AI. He works on building AWS services in AI space. When he is not programming, he likes to read, run and travel.

 

 

 

 

Satadal Bhattacharjee is Principal Product Manager with AWS AI. He leads the Machine Learning Engine PM team working on projects such as SageMaker Neo, AWS Deep Learning AMIs, and AWS Elastic Inference. For fun outside work, Satadal loves to hike, coach robotics teams, and spend time with his family and friends.

 

 

 

 

 

How AI Is Changing Medicine

Doctors and nurses stand at the front lines of our healthcare system, providing immediate care. But there’s an equally important universe of researchers working in parallel who are advancing the tools and knowledge clinicians can draw on to treat their patients.

These researchers are developing new drugs to solve as-yet incurable diseases, simulating biological organisms and structures to better understand how they work, and diving into genomic data to find genetic markers related to specific health conditions.

And, for an ever-growing number of applications, they’re doing so using AI and accelerated computing.

How AI Is Changing Drug Discovery

There are almost as many potential drug-like molecules as there are atoms in the observable universe. Pharmaceutical companies and researchers pour years of effort and billions of dollars into exploring this vast library of molecules to discover new treatments for diseases.

Scientists use their expertise to guess which drug molecules will be able to stop a particular ailment in its tracks. They traditionally focus on one disease at a time, performing research over many years. With AI, they can instead virtually model millions of molecules and screen hundreds of diseases at a time.

Deep learning can pick up on the biochemical laws that govern how a drug molecule will act in the body, helping researchers understand the potential side effects of a drug molecule — or even come up with new, synthetic molecules that could treat a disease.

University of Pittsburgh researcher David Koes is doing just that, using NVIDIA GPUs for molecular docking, the process of simulating how a drug candidate will bind to a target protein. His team developed a deep learning model that improved their prediction accuracy from 52 percent to 70 percent.

And Recursion Pharmaceuticals, a member of the NVIDIA Inception program, is using more than 100 GPUs to train its neural networks for drug discovery across several therapeutic areas, including hundreds of rare diseases that currently lack treatments.

Recursion’s deep learning models analyze microscopy images, determining whether a drug compound is effective at healing diseased cells. Using AI allows the company to screen hundreds of features from more than 10 million cells in a week.

How AI Is Changing Genomics

Another area of medicine where the size and complexity of data are staggering is genomics. Despite being a relatively young field, genomics is growing fast, with datasets doubling in size around every eight months.

Around a million whole human genomes have been sequenced worldwide, giving scientists an ocean of granular data that can be harnessed for precision medicine, immunotherapy and population studies. But once this data is collected, it’s computationally demanding to analyze.

Scripps Research Translational Institute is partnering with NVIDIA to build deep learning applications for more affordable genome sequencing and better mutation detection from genomic data. Startups, too, are harnessing GPUs to solve challenges in genomic analysis.

Just as GPUs solve graphics problems by processing many pixels independently, they can break genetic information into tiny, individual pieces that can be crunched separately and then strung back together, says Ankit Sethia, cofounder of Inception startup Parabricks.

The company is using an NVIDIA DGX-1 server to detect key markers and outliers in a sequenced genome — shrinking the time it takes from a couple days to under an hour.

How AI Is Changing Medical Research

Researchers in universities around the world are using AI and GPUs to simulate biological structures and diseases that we don’t yet fully understand.

In Australia, a team at Monash University is using a process called cryo-electron microscopy to develop high-resolution 3D models of molecules, a compute-intensive process that requires an NVIDIA GPU-powered supercomputer to run.

The researchers are using the technology to develop drugs that can combat superbugs, or drug-resistant bacteria.

In the U.S., Colorado State University researchers are simulating an enzyme found in the deadly dengue virus, which infects hundreds of millions of people each year. Using GPU-powered supercomputers at the San Diego Supercomputing Center, the team was able to discover new aspects of enzyme motion.

With increased precision, this work could lead to insights that stop diseases like dengue from spreading.

Deep learning can also be used to help researchers amass the source data they need to develop breakthrough healthcare applications. At NVIDIA, researchers are using generative adversarial networks, or GANs, to advance medical research by generating abnormal brain MRIs to train neural networks for medical imaging.

These synthetic MRIs can help solve a challenge developers in the medical community often face: a lack of balanced, reliable training data to train their deep learning models.

See the NVIDIA healthcare page for more.

The post How AI Is Changing Medicine appeared first on The Official NVIDIA Blog.

Announcing the first winner of the AWS DeepRacer League Summit circuit!

Today, at the AWS Summit in Santa Clara, California, we kicked off the 2019 season of the world’s first global autonomous racing league. The AWS DeepRacer League allows developers of all skill levels to get hands on with machine learning through a series of live racing events at AWS Global Summits around the world. The AWS DeepRacer League includes virtual events and tournaments throughout the year.

It was an exciting day as developers put their machine learning skills to the test! After 9 hours, 400 autonomously driven laps, and over 5 miles of racing, the Santa Clara winner was declared. Chris Miller, founder of Cloud Brigade, based in Santa Cruz California, topped the leaderboard and will be the first victor to advance on an expenses-paid trip to the AWS DeepRacer Championship Cup at re:Invent 2019 in Las Vegas, Nevada. With a winning time of 10.43 seconds, Chris and his team came to the Santa Clara Summit with the intent to learn more about AI and ML “I’m excited about machine learning and the technology that is being made available for modern applications”. Chris trained his winning model in one of the AWS DeepRacer workshops at the summit. Next on the agenda for Chris –  he is now preparing for re:Invent by learning more about machine learning and how he can customize his model further.

The top three developers on the leaderboard: Chris Miller (Center) Santa Clara Summit Champion, Rahul Shah (left) First Runner Up, Adrian Sarno (Right) Second Runner Up

Machine Learning available for all

The league is only just beginning and you don’t have to be at an AWS Summit to start learning about machine learning with AWS DeepRacer. Today we are launching a new online digital training course called AWS DeepRacer: Driven by Reinforcement Learning. The course is available at no cost as part of AWS Training and Certification, within the AWS Machine Learning Developer Learning Path. The course has 6 self-guided chapters and in 90 minutes will help you prepare to compete in the AWS DeepRacer League. You will learn how to build a reinforcement learning model and find tips and tricks about how to tune those models to climb the leaderboard.

Up next!

The journey to crown the 2019 AWS DeepRacer Champion continues on April 2nd at the AWS Summit in Paris. Follow the live results on the AWS DeepRacer League webpage. While you’re there, plan your next race. And don’t forget, this competition is open to all. If you don’t have an AWS DeepRacer car or your own model, our Summit pit crew is there to help you select a pre-trained model and race it straight-away. Also, if you can’t make it to any of the in-person events, our virtual circuit is coming soon and will allow anyone, anywhere to compete.

See you on the tracks!


About the Author

Alexandra Bush is a Senior Product Marketing Manager for AWS AI. She is passionate about how technology impacts the world around us and enjoys being able to help make it accessible to all. Out of the office she loves to run, travel and stay active in the outdoors with family and friends.

 

 

 

 

Train Deep Learning Models on GPUs using Amazon EC2 Spot Instances

You’ve collected your datasets, designed your deep neural network architecture, and coded your training routines. You are now ready to run training on a large dataset for multiple epochs on a powerful GPU instance. You learn that the Amazon EC2 P3 instances with NVIDIA Tesla V100 GPUs are ideal for compute-intensive deep learning training jobs, but you have a tight budget and want to lower your cost-to-train.

Spot-instance pricing makes high-performance GPUs much more affordable for deep learning researchers and developers who run training jobs that span several hours or days. Spot instances allow you to access spare Amazon EC2 compute capacity at a steep discount compared to on-demand rates. For an up-to-date list of prices by instance and Region, visit the Spot Instance Advisor. To learn more about the key differences between spot instances and on-demand instances, I recommend going through this Amazon EC2 user-guide.

Spot instances are great for deep learning workflows, but there are a few challenges associated using spot instances versus on-demand instances. First, spot instances can be preempted and can be terminated with just 2 minutes notice. This means you can’t count on your instance to run a training job to completion. Therefore, it’s not recommended for time-sensitive workloads. Second, instance termination can cause data loss if the training progress is not saved properly. Third, if you decide your application should not be interrupted after launching the spot instance, your only option is to stop the spot instance and re-launch as an on-demand or reserved instance.

To address these challenges, here is a step-by-step tutorial on how to set up spot instances for deep learning training workflows while minimizing training progress loss if a spot interruption occurs. My goal is to implement a setup with the following characteristics:

  • Decouple compute, storage and code artifacts, and keep the compute instance stateless. This enables easy recovery and training state restore when an instance is terminated and replaced
  • Use a dedicated volume for datasets, training progress (checkpoints) and logs. This volume should be persistent and not be affected by instance termination
  • Use a version control system (e.g. Git) for training code. This repo should be cloned to commence/resume training. this enables traceability and prevents loss of code changes when instance is terminated
  • Minimize code changes to the training script. This ensures that the training script can be developed independently and backup and snapshot operations are performed outside of the training code
  • Automate, automate, automate. Automate replacement instance creation after termination, attaching of dataset and checkpoints EBS volume at launch, moving volumes across Availability Zones, performing instance state restore, resuming training, and terminating instance once training is finished

Deep learning with Spot Instances using TensorFlow and the AWS Deep Learning AMI

In this example, I use spot instances and the AWS Deep Learning AMI to train a ResNet50 model on the CIFAR10 dataset. I use TensorFlow 1.12 configured with CUDA 9 available on the AWS Deep Learning AMI version 21. AWS Deep Learning AMIs are updated frequently, check the AWS Marketplace first to make sure you’re using the latest version compatible with your training code. For TensorFlow 1.13 and CUDA 10 use this AWS Deep Learning AMI instead.

I show you how to set up a spot fleet request for deep learning training jobs, which and you use as a starting point for your specific dataset and models.

To follow along, I assume you’ve met the following pre-requisites:

  1. You have an AWS account, and AWS CLI tool installed on your host
  2. You are familiar with Python and at least one deep learning framework

As you go through the implementation details, you learn everything else required. All the code, configuration files and AWS CLI commands are available on GitHub.

I use the following AWS and open-source services and concepts. Figure 1 shows how all of these fit together in our example.

  • AWS CLI: I use the CLI to interact with AWS services. Everything you can do with the CLI can also be done through the AWS console. The CLI will let you automate, which is one of my goals for this example.
  • Amazon EC2 spot instance and spot instance requests: Spot requests ensure that the specified number of spot instances are running. Spot fleet places spot requests to meet the target capacity and automatically replenish any interrupted instances.
  • AWS Deep Learning AMI: An Amazon machine image with pre-installed deep learning frameworks. In this example, I use the GPU-accelerated TensorFlow framework for training
  • Amazon Elastic Block Storage (EBS): A persistent volume to store datasets, checkpoints and logs, that can be attached to a currently running instance
  • Amazon EBS snapshots: Snapshots let you back up data on your Amazon EBS volumes to Amazon S3. A snapshot contains all of the information needed to restore your data to a new EBS volume and can be used to migrate volumes to a new Availability Zone.
  • Amazon EC2 user data and instance metadata: At instance launch, user data shell script can be executed to perform actions such as attaching volumes, initiating training and clean up. Instance metadata allows an instance to query information about itself such as instance-id for use with use data shell scripts
  • Amazon IAM role and policy: Grants EC2 instance permissions to use AWS services on your behalf. Essential to automate everything.

Figure 1: Reference architecture for using spot instances in deep learning workflows

Step 1: Set up a dedicated EBS volume for datasets and checkpoints using a general-purpose instance

The first step is to set up our dedicated EBS volume for storing datasets, checkpoints and other information that needs to persist such as logs and other metadata. This step is only done once so I start by launching an on-demand m4.xlarge instance. If your dataset is small and you’re not going to be performing any pre-processing steps during preparation, then you could launch an instance with lesser memory and processing power that may cost less. If you’re going to be transcoding images or running other multi-threaded pre-processing routines then pick a GPU-backed or compute-optimized CPU instance.

Run the following command on your terminal using the AWS CLI. All the commands listed here were tested on a MacOS.

aws ec2 run-instances 
    --image-id ami-0027dfad6168539c7 
    --security-group-ids <SECURITY_GROUP_ID> 
    --count 1 
    --instance-type m4.xlarge 
    --key-name <KEYPAIR_NAME> 
    --subnet-id <SUBNET_ID> 
    --query "Instances[0].InstanceId"

image-id refers to the Deep Learning AMI Ubuntu instance. Be sure to update the security group, key ID and subnet ID to allow SSH connections into the instance. See this documentation page for more details.

Important: Create a subnet in a specific Availability Zone and remember your choice. EBS volumes can only be attached to instances in the same subnet. See Figure 1 for illustration. In this example I use us-west-2b as my Availability Zone for setup. In step 3 I show you how to automate migration of EBS volumes between Availability Zones using EBS snapshots.

Throughout this example, everything in italics needs to be replaced with values specific to your setup, the rest can just be copied.

Next, create an EBS volume for your datasets and checkpoints. Here I request 100 GiB. You should choose a value that suits your dataset needs. The EBS volume should be in the same Availability Zone as your instance. After you create the volume, attach it to your instance. Specify the ID details from the output of the run-instances and create-volume commands.

aws ec2 create-volume 
    --size 100 
    --region <AWS_REGION> 
    --availability-zone <INSTANCE_AZ> 
    --volume-type gp2 
    --tag-specifications 'ResourceType=volume,Tags=[{Key=Name,Value=DL-datasets-checkpoints}]' 

aws ec2 attach-volume 
    --volume-id vol-<your_volume_id> 
    --instance-id i-<your_instance_id> 
    --device /dev/sdf

Follow the steps in the documentation to connect by using SSH into your instance and then format and mount the attached volume. In this example, I use a mount point directory at root named /dltraining

Do this step only once. Later in step 3 you can see how each new spot instance will automatically self-mount the volume at launch so the datasets and checkpoints are available for training.

In this example I use the following paths:

  • Datasets: /dltraining/datasets
  • Training progress checkpoints: /dltraining/checkpoints
sudo mkdir /dltraining
sudo mkfs -t xfs /dev/xvdf
sudo mount /dev/xvdf /dltraining
sudo chown -R ubuntu: /dltraining/
cd /dltraining
mkdir datasets
mkdir checkpoints
#
# Optional: Run commands to move your custom datasets into the Datasets directory.
#

To follow along with this example, you can create and then leave these directories empty. The training script ec2_spot_keras_training.py will download the CIFAR10 dataset using Keras, the first-time training is initiated.

You can terminate this instance using the command below. Volume setup is now complete and will persist in the Availability Zone it was created in.

aws ec2 terminate-instances 
    --instance-ids i-<your_instance_id> 
    --output text

Step 2: Create IAM role and policy to grant instance permissions

If you’re new to the cloud, AWS Identity and Access Management (IAM) concepts may be new to you. IAM roles and policies are used to grant instances specific permissions that allow access other AWS services on your behalf.

During training, I want the spot instance to have access to my datasets and checkpoints in the EBS volume I created in step 1. However, only volumes in the same Availability Zone as the instances can be attached to it. If the volume and the instance are in different Availability Zones, a new volume needs to be created using a snapshot of the volume stored in Amazon S3.

All these steps can be performed at instance launch using the AWS CLI and user data bash script, and you can see how in step 3. Here are all the AWS CLI commands you need to run at instance launch:

  • Query for volumes with the name tag: DL-datasets-checkpoints (there should be only one)
  • Create a snapshot of this volume with tag: DL-datasets-checkpoints-snapshot
  • If the instance and volume are in the same Availability Zone, attach volume to the instance
  • If the instance and volume are in different Availability Zones, create a new volume from the snapshot in the instance’s Availability Zone with name: DL-datasets-checkpoints, and attach it to the instance. Delete the volume in the different Availability Zone to ensure there is only one copy.
  • Once training is complete, cancel the spot fleet request and terminate all training instances

In order for the instance to be able to perform these actions, I will need to grant the instance the permissions to do so on my behalf. This way I don’t grant the instance all the same permissions that I as a user have and risk potential abuse.

I start by first creating a role for my Amazon EC2 instance, called the IAM role. After that I grant specific permissions to this role by creating what is called a policy.  Execute the following command to create a new IAM role. I’ve named my role DL-Training feel free to choose another name.

aws iam create-role 
    --role-name DL-Training 
    --assume-role-policy-document '{"Version":"2012-10-17","Statement":[{"Sid":"","Effect":"Allow","Principal":{"Service":"ec2.amazonaws.com"},"Action":"sts:AssumeRole"}]}'

Next, I will create and attach a policy that grants the instance the following permissions:

  1. Describe, create, attach and delete volumes
  2. Create snapshots from volumes
  3. Describe spot instances
  4. Cancel spot fleet requests and terminate instances

You can grant permissions to access other AWS services if you’re going to be using them in your application. In general, the more specific you are about the actions the instance takes the better. The permissions are in a file called ec2-permissions-dl-training.json on the example GitHub repository.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ec2:AttachVolume",
                "ec2:DeleteVolume",
                "ec2:DescribeVolumeStatus",
                "ec2:CancelSpotFleetRequests",
                "ec2:CreateTags",
                "ec2:DescribeVolumes",
                "ec2:CreateSnapshot",
                "ec2:DescribeSpotInstanceRequests",
                "ec2:DescribeSnapshots",
                "ec2:CreateVolume"
            ],
            "Resource": "*"
        }
    ]
}

And run the following to create a policy and attach it to our IAM role:

aws iam create-policy 
    --policy-name ec2-permissions-dl-training  
    --policy-document file://ec2-permissions-dl-training.json
 
aws iam attach-role-policy 
    --policy-arn arn:aws:iam::<account_id>:policy/ec2-permissions-dl-training 
    --role-name DL-Training

Be sure to substitute <account_id> with your AWS account ID in the attach-role-policy command.

Step 3: Create EC2 user data bash script

Next, I create a launch specification file with details about the instance you want to run your training on. In this example I’m going to be using a p3.2xlarge. If you’re running a multi-GPU training job then you can request for an instance with more GPUs. Note, by multi-GPU jobs, I’m referring to multiple GPUs on the same instance. Currently, the maximum number of GPUs you can get on a single instance are 8 GPUs with a p3.16xlarge or p3dn.24xlarge. I cover distributed/multi-node training use-cases in a future blog post.

As discussed in step 2, Amazon EC2 allows you to pass user data shell scripts to an instance that gets executed at launch. Let’s take a look at our user data shell script. The full script (user_data_script.sh) is available on GitHub.

There are 4 key sections in the file:

 Get instance ID and query volume

In this section the script queries the instance metadata API to access to the ID instance on which this script is running. It then uses this information to search for the datasets and checkpoints volume with the tag: DL-datasets-checkpoints

#!/bin/bash

# Get instance ID 
INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
INSTANCE_AZ=$(curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone)
AWS_REGION=us-west-2

# Get Volume Id and availability zone
VOLUME_ID=$(aws ec2 describe-volumes --region $AWS_REGION --filter "Name=tag:Name,Values=DL-datasets-checkpoints" --query "Volumes[].VolumeId" --output text)
VOLUME_AZ=$(aws ec2 describe-volumes --region $AWS_REGION --filter "Name=tag:Name,Values=DL-datasets-checkpoints" --query "Volumes[].AvailabilityZone" --output text)

Check if the volume and instance are in the same availability zone

In this section the script checks with the volume and the instance are in the same Availability Zone. If they are in different Availability Zones, it first creates a point-in-time snapshot of the volume in Amazon S3. Once the snapshot is created, it deletes the volume and creates a new volume from the snapshot in the instance’s Availability Zone. Figure 2 illustrates the two patterns.

The aws ec2 wait command ensures that snapshot and volume creation are complete before proceeding to the next command.

Figure 2: On spot instance termination, if a new spot instance is launched in a different availability zone (a), EBS volume snapshots are saved to S3 and a new volume is created from the snapshot in the instance’s availability zone. If the new spot instance is launched in the same availability zone as the volume (b), the same EBS volume is attached to the new instance

if [ $VOLUME_AZ != $INSTANCE_AZ ]; then
		SNAPSHOT_ID=$(aws ec2 create-snapshot 
				--region $AWS_REGION 
				--volume-id $VOLUME_ID 
				--description "`date +"%D %T"`" 
				--tag-specifications 'ResourceType=snapshot,Tags=[{Key=Name,Value=DL-datasets-checkpoints-snapshot}]' 
				--query SnapshotId --output text)
		aws ec2 wait --region $AWS_REGION snapshot-completed --snapshot-ids $SNAPSHOT_ID
		aws ec2 --region $AWS_REGION  delete-volume --volume-id $VOLUME_ID
		VOLUME_ID=$(aws ec2 create-volume 
				--region $AWS_REGION 
				--availability-zone $INSTANCE_AZ 
				--snapshot-id $SNAPSHOT_ID 
				--volume-type gp2 
				--tag-specifications 'ResourceType=volume,Tags=[{Key=Name,Value=DL-datasets-checkpoints}]' 
				--query VolumeId --output text)
		aws ec2 wait volume-available --region $AWS_REGION --volume-id $VOLUME_ID
fi

Attach and mount volume: In this section the script first attaches the volume that is in the same Availability Zone as the instance. It then mounts the attached volume to the mount point directory at /dltraining. And then updates the ownership to the Ubuntu user since the user data script is run as root.

aws ec2 attach-volume 
    --region $AWS_REGION --volume-id $VOLUME_ID 
    --instance-id $INSTANCE_ID --device /dev/sdf
sleep 10

# Mount volume and change ownership, since this script is run as root
mkdir /dltraining
mount /dev/xvdf /dltraining
chown -R ubuntu: /dltraining/
cd /home/ubuntu/

Get training scripts: In this section, the script clones the training code git repository

# Get training code
git clone https://github.com/awslabs/ec2-spot-labs.git
chown -R ubuntu: ec2-spot-labs
cd ec2-spot-labs/ec2-spot-deep-learning-training/

Initiate/resume training: The script activates the tensorflow_p36 Conda environment and runs the training script as the Ubuntu user. The training script takes care of loading the dataset from the Amazon EBS volume and resuming training from checkpoints. Step 4 will go into the modification needed for your training script.

# Initiate training using the tensorflow_36 conda environment
sudo -H -u ubuntu bash -c "source /home/ubuntu/anaconda3/bin/activate tensorflow_p36; python ec2_spot_keras_training.py "

Clean up: Once training is complete, the script cleans up by canceling spot fleet requests associated with the current instance. cancel-spot-fleet-requests can also terminate instances managed by the fleet.

# After training, clean up by cancelling spot fleet requests
SPOT_FLEET_REQUEST_ID=$(aws ec2 describe-spot-instance-requests --region $AWS_REGION --filter "Name=instance-id,Values='$INSTANCE_ID'" --query "SpotInstanceRequests[].Tags[?Key=='aws:ec2spot:fleet-request-id'].Value[]" --output text)

aws ec2 cancel-spot-fleet-requests --region $AWS_REGION --spot-fleet-request-ids $SPOT_FLEET_REQUEST_ID --terminate-instances

Step 4: Create a spot fleet request configuration file

Next, I will create a spot fleet configuration file that includes target capacity (1 instance in our example), launch specifications for the instance, and the maximum price that you are willing to pay.  Spot fleet places requests to meet the target capacity and automatically replenish any interrupted instances.

Under LaunchSpecifications section, I have two different specifications.

  1. A p3.2xlarge instance type that may be placed in any Availability Zone within the us-west-2 Region
  2. A p2.xlarge instance type that may be placed in any Availability Zone within the us-west-2 Region

The spot fleet configuration is in a file called spot_fleet_config.json in the example GitHub repository. Spot fleet configuration file gives you the flexibility to mix and match instance types and Availability Zones. If your training script takes advantage of NVIDIA Tesla V100’s mixed-precision Tensor Cores, you may want to restrict instance types to only p3.2xlarge. The p2.xlarge with NVIDIA Tesla K80 only supports single (FP32) and double precision (FP64), and are cheaper but slower than V100 for deep learning training. Choose a combination that suits your needs.

{
  "TargetCapacity": 1,
  "AllocationStrategy": "lowestPrice",
  "IamFleetRole": "arn:aws:iam::<ACCOUNT_NUMBER>:role/DL-Training-Spot-Fleet-Role",
  "LaunchSpecifications": [
      {
          "ImageId": "ami-0027dfad6168539c7",
          "KeyName": "<KEYPAIR_NAME>",
          "SecurityGroups": [
              {
                  "GroupId": <SECURITY_GROUP_ID>
              }
          ],
          "InstanceType": "p3.2xlarge",
          "Placement": {
              "AvailabilityZone": "us-west-2a, us-west-2b, us-west-2c, us-west-2d"
          },
                  "UserData": "base64_encoded_bash_script",
          "IamInstanceProfile": {
              "Arn": "arn:aws:iam::<ACCOUNT_NUMBER>:instance-profile/DL-Training"
          }
      },
        {
          "ImageId": "ami-0027dfad6168539c7",
          "KeyName": "<KEYPAIR_NAME>",
          "SecurityGroups": [
              {
                  "GroupId": <SECURITY_GROUP_ID>
              }
          ],
          "InstanceType": "p2.xlarge",
          "Placement": {
              "AvailabilityZone": "us-west-2a, us-west-2b, us-west-2c, us-west-2d"
          },
                  "UserData": "base64_encoded_bash_script",
          "IamInstanceProfile": {
              "Arn": "arn:aws:iam::<ACCOUNT_NUMBER>:instance-profile/DL-Training"
          }
      }

Be sure to use a security group that allows you to SSH into the instance for debugging and checking progress manually and use your Key pair name for authentication. Under IAM instance profile, update the IAM role you created in step 2, that grants the instance necessary permissions.

To use the spot fleet Request, create an IAM fleet role by running the following commands:

aws iam create-role 
     --role-name DL-Training-Spot-Fleet-Role 
     --assume-role-policy-document '{"Version":"2012-10-17","Statement":[{"Sid":"","Effect":"Allow","Principal":{"Service":"spotfleet.amazonaws.com"},"Action":"sts:AssumeRole"}]}'

aws iam attach-role-policy 
     --policy-arn arn:aws:iam::aws:policy/service-role/AmazonEC2SpotFleetTaggingRole --role-name DL-Training-Spot-Fleet-Role

In the configuration snippet above, under user data you have to replace the text base64_encoded_bash_script with base64-encoded user data shell script. To do this you can use the base64 utility available on Mac and linux based OS. The following works on a Mac; for Linux flavors, replace -b with -w to remove line breaks. The sed command replaces all occurrences of the string base64_encoded_bash_script with the base64-encoded bash script.

USER_DATA=`base64 user_data_script.sh -b0`
sed -i '' "s|base64_encoded_bash_script|$USER_DATA|g" spot_fleet_config.json 

Step 5: Update deep learning training script

The final step is to update your deep learning training script to ensure datasets are loaded from and checkpoints are saved to the attached Amazon EBS volume. In this example I’m training a ResNet50 model on the CIFAR10 dataset. A typical deep learning training script may have the following steps. In pseudo-code below, are changes you’ll need to make to your training script to use with our setup.

# Prepare datasets / setup dataset loaders
dataset = load_data(ebs_mount_point_dataset)

# Define model
if exists(ebs_mount_point_checkpoints)
    checkpoint, checkpoint_epoch = get_latest_checkpoint(ebs_mount_point_checkpoints)
    model = load_model(checkpoint)
else
    model = define_model()
    checkpoint_epoch = 0
    
# Define training parameters

# Execute training loop
for i = checkpoint_epoch to max_epoch
    ...
    ...
    ...
    # Avoid corrupted checkpoints due to termination
    status = get_spot_termination_status()
    if status == “Terminating”
        pause_training()
    # Save checkpoints and progress
    save_model_checkpoint(model, ebs_mount_point_checkpoints)
    save_progress_logs(ebs_mount_point)
end

To summarize,

  • Load data from the mounted Amazon EBS volume, in our example that would be /dltraining
  • Check if a checkpoint exists, then load the checkpoint and update epoch number to resume training. If not, define the model architecture and start training from scratch.
  • In the training loop, check if termination notice has been issued. If yes, then pause training to avoid termination during checkpointing to avoid corrupt or incomplete checkpoints.
  • If termination notice hasn’t been issued, save the model checkpoints to /dltraining/checkpoints/

The training script for this example is called ec2_spot_keras_training.py and is available in the example repository. Below is a code snippet from our training script. The function load_checkpoint_model() loads the latest checkpoint to resume training.

def load_checkpoint_model(checkpoint_path, checkpoint_names):
    list_of_checkpoint_files = glob.glob(os.path.join(checkpoint_path, '*'))
    checkpoint_epoch_number = max([int(file.split(".")[1]) for file in list_of_checkpoint_files])
    checkpoint_epoch_path = os.path.join(checkpoint_path,
                                         checkpoint_names.format(epoch=checkpoint_epoch_number))
    resume_model = load_model(checkpoint_epoch_path)
    return resume_model, checkpoint_epoch_number

Since I’m using Keras with a TensorFlow backend, I didn’t have to explicitly write the training loop. Keras provides convenient callback functions for saving checkpoints and logging progress after each epoch.

Note: if you’re implementing your own training loop with TensorFlow’s low-level API, PyTorch or other framework, you are responsible for checkpointing progress. This can be very tricky if you don’t know what you’re doing. To resume training properly, you’ll need to make sure that you’re saving (1) model architecture to re-define the model (2) completed epoch number and weights of the model at the end of the current epoch (3) training hyper-parameters such as loss function, optimizer, learning rate schedule etc. (4) optimizer state at the end of the epoch

Keras callbacks I’m using to checkpoint progress and check for termination status are below:

def define_callbacks(volume_mount_dir, checkpoint_path, checkpoint_names, today_date):

    # Model checkpoint callback
    if not os.path.isdir(checkpoint_path):
        os.makedirs(checkpoint_path)
    filepath = os.path.join(checkpoint_path, checkpoint_names)
    checkpoint_callback = ModelCheckpoint(filepath=filepath,
                                          save_weights_only=False,
                                          monitor='val_loss')

    # Loss history callback
    epoch_results_callback = CSVLogger(os.path.join(volume_mount_dir, 
                           'training_log_{}.csv'.format(today_date)),
                           append=True)

    class SpotTermination(keras.callbacks.Callback):
        def on_batch_begin(self, batch, logs={}):
            status_code = requests.get("http://169.254.169.254/latest/meta-data/spot/instance-action").status_code
            if status_code != 404:
                time.sleep(150)
spot_termination_callback = SpotTermination()
    callbacks = [checkpoint_callback, epoch_results_callback]
    return callbacks

Step 6: Initiate spot request to start the training

I’m now ready to submit our spot fleet request using the spot_fleet_config.json configuration file I created in Step 4.

aws ec2 request-spot-fleet --spot-fleet-request-config file://spot_fleet_config.json

How it all comes together

So far I’ve introduced lot of code, configuration files and AWS CLI commands. Figure 3 shows how all these code and configuration artifacts fit together. Let’s walk through the process so you can get a better sense of how they are all connected.

Figure 3: Data, code and configuration artifacts dependency chart

Let’s start with you, the user.

As a deep learning researcher or developer, first prototype and develop your models locally or on an inexpensive CPU-only Amazon EC2 on-demand instance with the AWS Deep Learning AMI. When you’re ready to run a training job on GPUs, you then push your training scripts to a Git repository.

Next, submit a spot request using the aws ec2 request-spot-fleet command shown in step 6. This sets everything into motion.

The spot request uses the spot fleet configuration file spot_fleet_config.json to launch the desired spot instance type. In this example, you run a training job on a p3.2xlarge instance in any of the us-west-2 Region’s Availability Zones. The training script will run on an instance imaged using the AWS Deep Learning AMI, which includes GPU optimized TensorFlow framework.

The spot fleet configuration file also includes the user_data_script.sh bash script file. The user data bash script is executed on the spot instance at launch. This script is responsible for mounting the dataset and checkpoint volume, cloning the training scripts, and initiating the training as we saw in step 3.

In the event of a spot interruption due to higher spot instance price or lack of capacity, the instance will be terminated and the dataset and checkpoints Amazon EBS volume will be detached. Spot fleet then places another request to automatically replenish the interrupted instance.

When the request is fulfilled again, a new spot instance will be launched and it will execute the user_data_script.sh at launch. The script queries for the dataset and checkpoint volume. If the volume and the instance are in different Availability Zones, it first creates a snapshot of the volume and then creates a new volume based on the snapshot in the current instance’s Availability Zone. The volume in the previous Availability Zone is deleted to ensure there is only one source of truth.

The script then attaches the volume to the instance and resumes training from the most recent checkpoint. Once training is complete the spot fleet request is cancelled and the current running instance is terminated.

If you want to specify a higher maximum spot instance price, or change instance types or Availability Zones, simply cancel the running spot fleet request by issuing aws ec2 cancel-spot-fleet-requests and initiating a new request with an updated spot fleet configuration file spot_fleet_config.json

Summary

That’s your overview about how spot instances can be used to run deep learning training experiments on GPU instances at a much lower cost than on-demand instances.

The setup in this blog post can be extended to cover more advanced deep learning workflows, and here are some ideas:

  • Multi-GPU training. Update the training script to enable multi-GPU training
  • Sub-epoch granularity checkpointing and resuming. In this example, checkpoints are saved only at the end of each epoch. For large datasets and complex models that take long time to finish an epoch, frequent checkpointing minimizes progress loss during interruption.
  • Multiple parallel experiments. Increase spot fleet target capacity to run multiple independent training jobs with different hyperparameters.

I hope you enjoyed reading this post. If you have questions, comments or feedback please use the comments section below. Happy spot training!


About the Author

Shashank Prasanna is an AI & Machine Learning Technical Evangelist at Amazon Web Services (AWS) where he focuses on helping engineers, developers and data scientists solve challenging problems with machine learning. Prior to joining AWS, he worked at NVIDIA, MathWorks (makers of MATLAB & Simulink) and Oracle in product marketing, product management, and software development roles.