Automated and continuous deployment of Amazon SageMaker models with AWS Step Functions
Amazon SageMaker is a complete machine learning (ML) workflow service for developing, training, and deploying models, lowering the cost of building solutions, and increasing the productivity of data science teams. Amazon SageMaker comes with many predefined algorithms. You can also create your own algorithms by supplying Docker images, a training image to train your model and an inference model to deploy to a REST endpoint.
Automating the build and deployment of machine learning models is an important step in creating production machine learning services. Models need to be retrained and deployed when code and/or data are updated. In this blog post we will discuss a technique for Amazon SageMaker automation using AWS Step Functions. We’ll demonstrate it through a new open source project, aws-sagemaker-build. This project provides a full implementation of our workflow. It includes Jupyter notebooks showing how to create, launch, stop, and track the progress of the build using Python and Amazon Alexa! The goal of aws-sagemaker-build is to provide a repository of common and useful pipelines that use Amazon SageMaker and AWS Step Functions that can be shared with the community and grown by the community.
The code is open source, and it is hosted on GitHub here.
This blog post won’t discuss the details of how to write and design your Dockerfiles for training or inference. For more details you can dive deep into our documentation here:
- Example Project and Tutorial using aws-sagemaker-build
- Training Image Documentation
- Inference Image Documentation
What AWS services do we need?
We focus on serverless technologies and managed services to keep this solution simple. It’s important for our solution to be scalable and cost effective even when training takes a long time. Training large neural networks can sometimes take days to complete!
AWS Step Functions
There are several AWS services for workflow orchestration such as AWS CloudFormation, AWS Step Functions, AWS CodePipeline, AWS Glue and others. For our application AWS Step Functions provides the right tools to implement our workflow. Step Functions act like a state machine. They begin with an initial state and use AWS Lambda Functions to transform the state, — changing, branching, or looping through state as needed. This abstraction makes Step Functions very flexible. They also can run for up to one year and are only charged by the transition, making them a scalable and cost efficient tool for our use case.
AWS CodeBuild is an on demand code building service. We will use it to build our Docker images and push them to an Amazon Elastic Container Registry (Amazon ECR) repository. For more information see the documentation.
Step Functions use Lambda functions to do the work of the build. There are functions for starting training, checking on training status, starting CodeBuild, checking on CodeBuild, and so on.
One challenge was to figure out how to provide configuration parameters to different stages of the build, given that some parameters would be static, others would be dependent on previous build steps, and others would be specific to a customers need. For example, the training and inference image IDs need to be passed on to the training and deployment steps, the Amazon S3 bucket name is static to the pipeline, and the ML instances used for training and inference need to be chosen by the individual user. The solution was to also use Lambda functions. There are two Lambda functions that take as input the current state of the build and output the training job and endpoint configurations. You can edit or overwrite the code of these functions to suit your needs. For example, the Lambda function could query a data catalog to get the Amazon S3 location of a data set.
Lambdas functions are also used for various custom resources needed in setting up and tearing down the CloudFormation script. Custom resource Lambda functions include: clearing out an S3 bucket on stack delete, uploading a Jupyter notebook to the Amazon SageMaker notebook instance, clearing SageMaker resources
AWS Systems Manager Parameter Store
AWS Systems Manager Parameter Store provides a durable, centralized, and scalable data store. We will store the parameters of our training jobs and deployment here and the Step Functions’ Lambda functions will query the parameters from this store. To change the parameters you just change the JSON string in the store. The example notebooks included with aws-sagemaker-build show you how to do this.
Amazon Simple Notification service (Amazon SNS) is used for starting builds and for notifications. AWS CodeCommit, GitHub, and Amazon S3 can publish to a start-build SNS topic when a change is made. We also publish to a notifications SNS topic when the build has started, finished, and failed. You can use these topics to connect aws-sagemaker-build to other systems.
To deploy an model using Amazon Sagemaker you need to do the following steps.
- If using custom algroithms, build the Docker images and upload to Amazon ECR.
- Create an Amazon SageMaker training job and wait to complete.
- Create an Amazon SageMaker model.
- Create an Amazon SageMaker endpoint configuration.
- Create/update a SageMaker endpoint and wait for it to finish.
Those are the steps that aws-sagemaker-build will automate using Step Functions.
- The following diagram describes the flow of the Step Functions state machine. There are several points where the state machine has to poll and wait for a task to be completed.
- The following diagram shows how the services work together
The following CloudFormation template will create resources in your account. These include an Amazon SageMaker notebook instance and an Amazon SageMaker Endpoint, both resources you pay for by the hour.
Note: To order to prevent unnecessary charges, please tear down this stack when you are done!
Click the “Lauch Stack” button below to launch the aws-sagemaker-build CloudFormation template. Choose a name for your CloudFormation stack and leave all the other parameters at defaults.
Once your template has finished being created follow these instructions:
- In the outputs of your stack choose the link next to NoteBookUrl
- In the Jupyter browser choose the SageBuild folder so see the example notebooks for how to use aws-sagemaker-build.
Set up events and notifications
The CloudFormation stack can automatically create a CodeCommit repo and an S3 bucket that will launch a build when any updates happen. Do this by setting the “BucketTriggerBuild” or “BucketTriggerBuild” stack parameters to non-default values. You can have other events trigger rebuilds by publishing to the LaunchTopic SNS topic in the outputs of the CloudFormation template. To setup a GitHub repo to trigger rebuilds on changes follow the instructions in this blog post You can also have the TrainStatusTopic send email or text you updates by subscribing it.
The CloudFormation stack has an output named AlexaLambdaArn. You can use this Lambda function to create an Alexa skill to manage aws-sagemaker-build:
- Download the model definition:json
- The Lambda function is already configured with permissions to be called by Alexa.
- Create an Amazon Developer account if you don’t have one. This is different than your AWS account.
- Create the Alexa skill following these instructions:
- Log In to the Amazon developer console and choose the “Alexa Skills Kit” tab.
- In the next screen choose “custom” for your skill type and give your skill a name.
- In the menu on the left choose “Invocation” and give your skill an invocation name like “sagebuild”.
- In the menu on the left choose “Endpoint” and copy the AlexaLambdaArn output from your aws-sagemaker-build stack and paste into the default region field under “AWS Lambda Arn”
- In the menu on the left choose “JSON Editor” and copy the model definition you downloaded and paste in to the editor
- Choose “Save Model” and then “Build Model”
You can now have a workflow where you push code changes to a repository (or upload new data), make some dinner, and periodically ask Alexa, “Alexa, ask SageBuild, ‘Is my build done?’.” I have done this and it is very awesome!
aws-sagemaker-build does not do any validation on your training. This means that if your training job does not fail then the model is deployed to the endpoint, even if that model does not perform better than the current model. Your training job should contain logic to validate your model and cause the training to fail if necessary.
aws-sagemaker-build supports four different configurations: Bring-Your-Own-Docker (BYOD), Amazon SageMaker algorithms, TensorFlow, and MXNet. The configuration is set as a parameter of the CloudFormation template but can be changed after deployment. For the TensorFlow and MXNet configurations the user scripts are copied and saved with version names so that roll backs or redeployment of old versions works correctly. The notebook that is launched in the aws-sagemaker-build stack has examples of each different configuration.
First Create a CodeCommit repo and an Amazon S3 data bucket. Then launch two aws-sagemaker-build stacks, both using the repo and the S3 bucket you just created. Set one stack to use the “master” branch and another to use the “dev” branch.
Here is a diagram of what that architecture would look like:
Amazon CloudWatch Events
With Amazon CloudWatch Events you can publish to your stack’s LaunchTopic topic on a regular schedule (for example, everyday at 5pm or once a week on Friday at 9pm). You can use this in a workflow in which you have a smaller development dataset that you develop with during the week. You pushing your tested changes to your code branch, and you only redeploy this branch at the end of the week. This way you’re not constantly training large models and replacing them, which can be very expensive.
Conclusion and let us know what you think
If this blog post helps you or inspires you to solve a problem we would love to hear about it! We also have the code up on GitHub for you to use and extend. Contributions are always welcome!
Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
About the Author
John Calhoun is a machine learning specialist for AWS Public Sector. He works with our customers and partners to provide leadership on machine learning, helping them shorten their time to value when using AWS.