Optimizing TensorFlow model serving with Kubernetes and Amazon Elastic Inference
This post offers a dive deep into how to use Amazon Elastic Inference with Amazon Elastic Kubernetes Service. When you combine Elastic Inference with EKS, you can run low-cost, scalable inference workloads with your preferred container orchestration system.
Elastic Inference is an increasingly popular way to run low-cost inference workloads on AWS. It allows you to attach low-cost GPU-powered acceleration to Amazon EC2 and Amazon SageMaker instances to reduce the cost of running deep learning inference by up to 75%. Amazon EKS is also of growing importance to companies of all sizes, from startups to enterprises, for running containers on AWS. For those who are looking to run inference workloads in a managed Kubernetes environment, using Elastic Inference and EKS together opens the door to running these workloads with accelerated yet low-cost compute resources.
The example in this post shows how to use Elastic Inference and EKS together to deliver a cost-optimized, scalable solution for performing object detection on video frames. More specifically, it does the following:
- Run pods in EKS that read a video from Amazon S3
- Preprocess video frames
- Send the frames for object detection to a TensorFlow Serving pod modified to work with Elastic Inference.
This computationally intensive use case showcases the advantages of using Elastic Inference and EKS to achieve accelerated inference at low cost within a scalable, containerized architecture. For more information about this code, see Optimizing scalable ML inference workloads with Amazon Elastic Inference and Amazon EKS on GitHub.
Elastic Inference overview
Research by AWS indicates that inference can drive as much as 90% of the cost of running machine learning workloads, a much higher percentage than training models. However, using a GPU instance for inference often is wasteful because you’re typically not fully utilizing the instance.
Elastic Inference solves this by allowing you to attach just the right amount of low-cost GPU-powered acceleration to any Amazon EC2 or Amazon SageMaker CPU-based instance. This reduces inference costs by up to 75% because you no longer need to over-provision GPU compute for inference.
You can configure EC2 instances or Amazon SageMaker endpoints with Elastic Inference accelerators using the AWS Management Console, AWS CLI, the AWS SDK, AWS CloudFormation, or Terraform. Launching an instance with Elastic Inference provisions an accelerator in the same Availability Zone behind a VPC endpoint. The accelerator then attaches to the instance over the network. There are two prerequisites:
- First, provision an AWS PrivateLink VPC endpoint for the subnets where you plan to launch accelerators.
- Second, provide an instance role with a policy that allows actions on Elastic Inference accelerators.
You also will need to make sure your security groups are properly configured to allow traffic to and from the VPC endpoint and instance. For further details, see Setting Up to Launch Amazon EC2 with Elastic Inference.
Deep learning tools and frameworks enabled for Elastic Inference can automatically detect and offload the appropriate model computations to the attached accelerator. Elastic Inference supports TensorFlow, Apache MXNet, and ONNX models, with more frameworks coming soon. If you are a PyTorch user, you can convert your models to the ONNX format to enable usage of Elastic Inference. The example in this post uses TensorFlow.
Kubernetes is open-source software that allows you to deploy and manage containerized applications at scale. Kubernetes groups containers into logical groupings for management and discoverability, then launches them into clusters of EC2 instances. EKS makes it easy to deploy, manage, and scale containerized applications using Kubernetes on AWS.
More specifically, EKS provisions, manages, and scales the Kubernetes control plane for you. At a high level, Kubernetes consists of two major components: a cluster of worker nodes that run your containers, and the control plane that manages when and where containers start on your cluster and monitors their status.
Without EKS, you have to run both the Kubernetes control plane and the cluster of worker nodes yourself. By handling the control plane for you, EKS removes a substantial operational burden for running Kubernetes, and allows you to focus on building your application instead of managing infrastructure.
Additionally, EKS is secure by default, and runs the Kubernetes management infrastructure across multiple Availability Zones to eliminate a single point of failure. EKS is certified Kubernetes conformant so you can use existing tooling and plugins from partners and the Kubernetes community. Applications running on any standard Kubernetes environment are fully compatible, and you can easily migrate them to EKS.
Integrating Elastic Inference with EKS
This post’s example involves performing object detection on video frames. These frames are extracted from videos stored in S3. The following diagram shows the overall architecture:
Inference node container with TensorFlow Serving and Elastic Inference support
TensorFlow Serving (TFS) is the preferred way to serve TensorFlow models. Accordingly, the first step in this solution is to build an inference container with TFS and Elastic Inference support. AWS provides a TFS binary modified for Elastic Inference. Versions of the binary are available for several different versions of TFS, as well as Apache MXNet. For more information and links to the binaries, see Amazon Elastic Inference Basics.
In this solution, the inference Dockerfile gets the modified TFS binary from S3 and installs it, along with an object detection model. The model is a variant of MobileNet trained on the COCO dataset, published in the Tensorflow detection model zoo. The complete Dockerfile is available in the amazon-elastic-inference-eks GitHub repo, under the
Standard node container
In addition to an inference container with Elastic Inference-enabled TFS and an object detection model, you also need a separate standard node container which performs the bulk of the application tasks and gets predictions from the inference container. As a top-level summary, this standard node container performs the following tasks:
- Polls an Amazon SQS queue for messages regarding the availability of videos.
- Fetches the next available video from S3.
- Converts the video to individual frames.
- Batches some of the frames together and sends the batched frames to the model server container for inference.
- Processes the returned predictions.
The only aspect of the code that isn’t straightforward is the need to enable EC2 instance termination protection while workers are processing videos, as shown in the following code example:
After the job processes, a similar API call disables termination protection. This example application uses termination protection because the jobs are long-running, and you don’t want an EC2 instance terminated during a scale-in event if it is still processing a video.
You can easily modify the inference code and optimize it for your use case, so this post doesn’t spend further time examining it. To review the Dockerfile for the inference code, see the amazon-elastic-inference-eks GitHub repo, under the
/Dockerfile directory. The code itself is in the test.py file.
Kubernetes cluster details
The EKS cluster deployed in the sample CloudFormation template contains two distinct node groups by default. One node group contains M5 instances, which are currently the latest generation of general purpose instances, and the other node group contains C5 instances, which are currently the latest generation of compute-optimized instances. The instances in the C5 node group each have a single Elastic Inference accelerator attached.
Currently, Kubernetes doesn’t schedule pods using Elastic Inference accelerators. Accordingly, this example uses Kubernetes labels and selectors to distribute the inference workload to the resources in the cluster with attached Elastic Inference accelerators.
More specifically, to minimize the complexity of scheduling access to the Elastic Inference accelerator, the application and inference pods deploy as a DaemonSet with a selector, which ensures that each node with a defined label runs one copy of the application and inference on the instance. The sample application pulls job metadata from the SQS queue and then processes each one sequentially, so you don’t need to worry about multiple processes interacting with the Elastic Inference accelerator.
Additionally, the deployed cluster contains an Auto Scaling group that scales the nodes in the inference group in/out based upon the approximate depth of the SQS queue. Automatic scaling helps keep the inference node group sized appropriately to keep costs as low as possible. Depending on your workload, you also could consider using Spot Instances to keep your costs low.
Currently, SQS metrics update every five minutes, so you can trigger an AWS Lambda function using CloudWatch Events one time per minute to query the depth of the queue directly and update a custom CloudWatch metric.
Launching with AWS CloudFormation
To create the resources described in this post, you must run several AWS CLI commands. For more information about running and launching these resources, see the associated Makefile on GitHub. For instructions to create these resources using a CloudFormation template, see the README file in the GitHub repository.
Finally, you can see how much you saved on costs by using Elastic Inference rather than a full GPU instance.
By default, this solution uses a CPU instance of type c5.large with an attached accelerator of type eia1.medium for the inference nodes. The current On-Demand pricing for those resources is $0.085 per hour, plus $0.130 per hour, for a total of $0.215 per hour. The total cost compared to the pricing of the smallest current generation GPU instance is as follows:
- Elastic Inference solution – $0.215 per hour
- GPU instance p3.2xlarge – $3.06 per hour
To summarize, the Elastic Inference solution cost is less than 10% of the cost of using a full GPU instance.
Despite its much lower cost, the Elastic Inference solution in these tests can do real-time inference, processing video at a rate of almost 30 frames per second. This result is impressive, especially considering that there is room to optimize the code further. For more information about the cost comparison of Elastic Inference versus GPU instances, see Optimizing costs in Amazon Elastic Inference with TensorFlow. To achieve even greater costs savings, you can use Spot Instances for the CPU instances for up to a 90% discount for those instances compared to On-Demand prices.
Elastic Inference enables low-cost accelerated inference, and EKS makes it easy to run Kubernetes on AWS. You can combine the two to create a powerful, low-touch solution for running inexpensive accelerated inference workloads in a managed, scalable, and highly available Kubernetes cluster.
Many variations of this solution are possible on AWS. For example, instead of using EKS, you could use Amazon ECS, another managed container orchestration service on AWS. Alternatively, you could run the Kubernetes control plane yourself directly on EC2, or run containers directly on EC2 without Kubernetes. The choice is yours. AWS enables you to build the architecture that best suits your use case, tooling, and workflow preferences.
To get started, go to the CloudFormation console and create a stack using the CloudFormation template for this blog post’s example solution. Details can be found in the Launching with AWS CloudFormation section above, and in the related GitHub repository linked there.
About the Authors
Ryan Nitz is a software engineer and architect working on the Startup Solutions Architecture team.
Brent Rabowsky focuses on data science at AWS, and leverages his expertise to help AWS customers with their own data science projects.