Skip to main content


Learn About Our Meetup

5000+ Members



Join our meetup, learn, connect, share, and get to know your Toronto AI community. 



Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.



Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

Running distributed TensorFlow training with Amazon SageMaker

TensorFlow is an open-source machine learning (ML) library widely used to develop heavy-weight deep neural networks (DNNs) that require distributed training using multiple GPUs across multiple hosts. Amazon SageMaker is a managed service that simplifies the ML workflow, starting with labeling data using active learning, hyperparameter tuning, distributed training of models, monitoring of training progression, deploying trained models as automatically scalable RESTful services, and centralized management of concurrent ML experiments.

This post focuses on distributed TensorFlow training using Amazon SageMaker.

Overview of concepts

While many of the distributed training concepts in this post are generally applicable across many types of TensorFlow models, this post focuses on distributed TensorFlow training for the Mask R-CNN model on the Common Object in Context (COCO) 2017 dataset.


The Mask R-CNN model is used for object instance segmentation, whereby the model generates pixel-level masks (Sigmoid binary classification) and bounding boxes (Smooth L1 regression) annotated with an object-category (SoftMax classification) to delineate each object instance in an image. Some common use cases for Mask R-CNN include perception in autonomous vehicles, surface defect detection, and analysis of geospatial imagery.

There are three key reasons for selecting the Mask R-CNN model for this post:

  1. Distributed data parallel training of Mask R-CNN on large datasets increases the throughput of images through the training pipeline and reduces training time.
  2. There are many open-source TensorFlow implementations available for the Mask R-CNN model. This post uses Tensorpack Mask/Faster-RCNN implementation as its primary example, but a highly optimized AWS Samples Mask-RCNN is recommended, as well.
  3. The Mask R-CNN model is submitted as part of MLPerf results as a heavy-weight object detection model.

The following graphic is a schematic outline of the Mask R-CNN deep neural network architecture.

Synchronized allreduce of gradients in distributed training

The central challenge in distributed DNN training is that the gradients computed during back propagation across multiple GPUs need to be allreduced (averaged) in a synchronized step before applying the gradients to update the model weights at multiple GPUs across multiple nodes.

The synchronized allreduce algorithm needs to be highly efficient; otherwise, you would lose any training speedup gained from distributed data-parallel training to the inefficiency of a synchronized allreduce step.

There are three key challenges to making a synchronized allreduce algorithm highly efficient:

  • The algorithm needs to scale with the increasing number of nodes and GPUs in the distributed training cluster.
  • The algorithm needs to exploit the topology of high-speed GPU-to-GPU interconnects within a single node.
  • The algorithm needs to efficiently interleave computations on a GPU with communications with other GPUs by efficiently batching the communications with other GPUs.

Uber’s open-source library Horovod addresses these three key challenges as follows:

  • Horovod offers a choice of highly efficient synchronized allreduce algorithms that scale with an increasing number of GPUs and nodes.
  • The Horovod library uses Nvidia Collective Communications Library (NCCL) communication primitives that exploit awareness of Nvidia GPU topology.
  • Horovod includes Tensor Fusion, which efficiently interleaves communication with computation by batching data communication for allreduce.

Horovod is supported with many ML frameworks, including TensorFlow. TensorFlow distribution strategies also use NCCL and provide an alternative to using Horovod to do distributed TensorFlow training. This post uses Horovod.

Training heavy-weight DNNs such as Mask R-CNN require high per GPU memory so you can pump one or more high-resolution images through the training pipeline. They also require high-speed GPU-to-GPU interconnect and high-speed networking interconnecting machines so synchronized allreduce of gradients can be done efficiently. Amazon SageMaker ml.p3.16xlarge and ml.p3dn.24xlarge instance types meet all these requirements. For more information, see Amazon SageMaker ML Instance Types. With eight Nvidia Tesla V100 GPUs, 128–256 GB GPU memory, 25–100 Gbps networking interconnect, and high-speed Nvidia NVLink GPU-to-GPU interconnect, they are ideally suited for distributed TensorFlow training on Amazon SageMaker.

Message Passing Interface

The next challenge in distributed TensorFlow training is the appropriate placement of training algorithm processes across multiple nodes, and associating each algorithm process with a unique global rank. Message Passing Interface (MPI) is a widely used collective communication protocol for parallel computing and is useful in managing a group of training algorithm worker processes across multiple nodes.

MPI is used to distribute training algorithm processes across multiple nodes and associate each algorithm process with a unique global and local rank. Horovod is used to logically pin an algorithm process on a given node to a specific GPU. The logical pinning of each algorithm process to a specific GPU is required for synchronized allreduce of gradients.

The key MPI concept to understand for this post is that MPI uses the mpirun command on a master node to launch concurrent processes across multiple nodes. Using MPI, the master host manages the lifecycle of distributed training processes running across multiple nodes centrally. To use MPI to do distributed training using Amazon SageMaker, you must integrate MPI with the native distributed training capabilities of Amazon SageMaker.

Integrating MPI with Amazon SageMaker distributed training

To understand how to integrate MPI with Amazon SageMaker distributed training, you need an understanding of the following concepts:

  • Amazon SageMaker requires the training algorithm and frameworks packaged in a Docker image.
  • The Docker image must be enabled for Amazon SageMaker training. This enablement is simplified through the use of Amazon SageMaker containers, which is a library that helps create Amazon SageMaker-enabled Docker images.
  • You need to provide an entry point script (typically a Python script) in the Amazon SageMaker training image to act as an intermediary between Amazon SageMaker and your algorithm code.
  • To start training on a given host, Amazon SageMaker runs a Docker container from the training image and invokes the entry point script with entry point environment variables that provide information such as hyperparameters and the location of input data.
  • The entry point script uses the information passed to it in the entry point environment variables to start your algorithm program with the correct args and polls the running algorithm process.
  • When the algorithm process exits, the entry point script exits with the exit code of the algorithm process. Amazon SageMaker uses this exit code to determine the success or failure of the training job.
  • The entry point script redirects the output of the algorithm process’ stdout and stderr to its own stdout. In turn, Amazon SageMaker captures the stdout from the entry point script and sends it to Amazon CloudWatch Logs. Amazon SageMaker parses the stdout output for algorithm metrics defined in the training job and sends the metrics to Amazon CloudWatch metrics.
  • When Amazon SageMaker starts a training job that requests multiple training instances, it creates a set of hosts and logically names each host as algo-k, where k is the global rank of the host. For example, if a training job requests four training instances, Amazon SageMaker names the hosts as algo-1, algo-2, algo-3, and algo-4. The hosts can connect on the network using these hostnames.

In the case of distributed training using MPI, you need a single mpirun command running on the master node (host) that controls the lifecycle of all algorithm processes distributed across multiple nodes, algo-1 through algo-n, where n is the number of training instances requested in your Amazon SageMaker training job. However, Amazon SageMaker is unaware of MPI or any other parallel processing framework you may use to distribute your algorithm processes across multiple nodes. Amazon SageMaker is going to invoke the entry point script on the Docker container running on each node. This means the entry point script needs to be aware of the global rank of its node and execute different logic depending on whether it is invoked on the master node or one of the non-master nodes.

Specifically, for the case of MPI, the entry point script invoked on the master node needs to run the mpirun command to start algorithm processes across all the nodes in the current Amazon SageMaker training job’s host set. The same entry point script when invoked by Amazon SageMaker on any of the non-master nodes periodically checks if the algorithm processes on the non-master node, which the mpirun command manages remotely from the master node, are still running, and exit when they are no longer running.

A master node in MPI is a logical concept, so it is up to the entry point script to designate a host from among all the hosts in the current training job host set as a master node. This designation has to be done in a decentralized manner. A simple approach is to designate algo-1 as the master node and all other hosts as non-master nodes. Because Amazon SageMaker provides each node its logical hostname in the entry point environment variables, it is straightforward for a node to decide if it is the master node or a non-master node.

The included in the accompanying GitHub repo and packaged in the Tensorpack Mask/Faster-RCNN algorithm Docker image follows the logic outlined in this section.

With the background of this conceptual understanding, you’re ready to proceed to the step-by-step tutorial on how to run distributed TensorFlow training for Mask R-CNN using Amazon SageMaker.

Solution overview

This tutorial has the following key steps:

  1. Use an AWS CloudFormation automation script to create a private Amazon VPC and create an Amazon SageMaker notebook instance network attached to this private VPC.
  2. From the Amazon SageMaker notebook instance, launch distributed training jobs in an Amazon SageMaker-managed Amazon VPC network attached to your private VPC. You can use Amazon S3, Amazon EFS, and Amazon FSx as data sources for the training data pipeline.


The following prerequisites are required:

  1. Create and activate an AWS Account or use an existing AWS account.
  2. Manage your Amazon SageMaker instance limits. You need a minimum of two ml.p3dn.24xlarge or two ml.p3.16xlarge instances; a service limit of four of each is recommended. Keep in mind that the service limit is specific to each AWS Region. This post uses us-west-2.
  3. Clone this post’s GitHub repo and complete the steps in this post. All paths in this post are relative to the GitHub repo root.
  4. Use any AWS Region that supports Amazon SageMaker, EFS, and Amazon FSx. This post uses us-west-2.
  5. Create a new S3 bucket or choose an existing bucket.

Creating an Amazon SageMaker notebook instance attached to a VPC

The first step is to run an AWS CloudFormation automation script to create an Amazon SageMaker notebook instance attached to a private VPC. To run this script, you need IAM user permissions consistent with the Network Administrator function. If you do not have such access, you may need to seek help from your network administrator to run the AWS CloudFormation automation script included in this tutorial. For more information, see AWS Managed Policies for Job Functions.

Use the AWS CloudFormation template cfn-sm.yaml to create an AWS CloudFormation stack that creates a notebook instance attached to a private VPC. You can either create the AWS CloudFormation stack using cfn-sm.yaml in AWS CloudFormation service console, or you can customize variables in script and run the script anywhere you have AWS CLI installed.

To use the AWS CLI approach, complete the following steps:

  1. Install AWS CLI and configure it.
  2. In, set AWS_REGION to your AWS Region and S3_BUCKET to your S3 bucket. These two variables are required.
  3. Optionally, set the EFS_ID variable if you want to use an existing EFS file system. If you leave EFS_ID blank, a new EFS file system is created. If you chose to use an existing EFS file system, make sure the existing file system does not have any existing mount targets. For more information, see Managing Amazon EFS File Systems.
  4. Optionally, specify GIT_URL to add a GitHub repo to the Amazon SageMaker notebook instance. If the GitHub repo is private, you can specify GIT_USER and GIT_TOKEN variables.
  5. Run the customized script to create an AWS CloudFormation stack using AWS CLI.

Save the summary output of the AWS CloudFormation script to use later. You can also view the output under the AWS CloudFormation Stack Outputs tab on the AWS Management Console.

Launching Amazon SageMaker training jobs

In the Amazon SageMaker console, open the notebook instance you created. In this notebook instance, there are three Jupyter notebooks available for training Mask R-CNN:

The training time performance for all three data source options is similar (though not identical) for this post’s choice of Mask R-CNN model and COCO 2017 dataset. The cost profile for each of the data sources is different. The following are differences in terms of the time it takes to set up the training data pipeline:

  • For the S3 data source, each time the training job launches, it takes approximately 20 minutes to replicate the COCO 2017 dataset from your S3 bucket to the storage volumes attached to each training instance.
  • For the EFS data source, it takes approximately 46 minutes to copy the COCO 2017 dataset from your S3 bucket to your EFS file system. You only need to copy this data one time. During training, data is input from the shared EFS file system mounted on all the training instances through a network interface.
  • For Amazon FSx, it takes approximately 10 minutes to create a new Amazon FSx Lustre file system and import the COCO 2017 dataset from your S3 bucket to the new Amazon FSx Lustre file system. You only need to do this one time. During training, data is input from the shared Amazon FSx Lustre file system mounted on all the training instances through a network interface.

If you are not sure which data source option is best for you, start with S3, and explore EFS or Amazon FSx if the training data download time at the start of each training job is not acceptable. Do not assume anything about training time performance for any of the data sources. Training time performance depends on many factors; it is best to experiment and measure it.

In all three cases, the logs and model checkpoints output during training are written to a storage volume attached to each training instance, and upload to your S3 bucket when training is complete. The logs are also fed into Amazon CloudWatch as training progresses that you can review during training. System and algorithm training metrics are fed into Amazon CloudWatch metrics during training, which you can visualize in the Amazon SageMaker service console.

Training results

The following graphs are example results for the two algorithms, after training for 24 epochs on the COCO 2017 dataset.

Below you can see the example results for TensorPack Mask/Faster-RCNN algorithm. The graphs below can be split into three buckets:

  1. Mean average precision (mAP) graphs for bounding box (bbox) prediction for various values of Intersection over Union (IoU), and small, medium, and large object sizes
  2. Mean average precision (mAP) graphs for object instance segmentation (segm) prediction for various values of Intersection over Union (IoU), and  small, medium, and large object sizes
  3. Other metrics related to training loss, or label accuracy

Below you can see the example results for the optimized AWS Samples Mask R-CNN algorithm. The converged mAP metrics shown in the graphs below are almost identical  to the previous algorithm, although the convergence progression is different.


Amazon SageMaker provides a Docker-based, simplified distributed TensorFlow training platform that allows you to focus on your ML algorithm and not be distracted by ancillary concerns such as the mechanics of infrastructure availability and scalability, and concurrent experiment management. When your model is trained, you can use the integrated model deployment capability of Amazon SageMaker to create an automatically scalable RESTful service endpoint for your model and start testing it. For more information, see Deploy a Model on Amazon SageMaker Hosting Services. When your model is ready, you can seamlessly deploy the model RESTful service into production.

About the Author

Ajay Vohra is a Principal Solutions Architect specializing in perception machine learning for autonomous vehicle development. Prior to Amazon, Ajay worked in the area of massively parallel grid-computing for financial risk modeling, and automation of application platform engineering in on-premise data centers.