Speed up training on Amazon SageMaker using Amazon EFS or Amazon FSx for Lustre file systems

Amazon SageMaker provides a fully-managed service for data science and machine learning workflows. One of the most important capabilities of Amazon SageMaker is its ability to run fully-managed training jobs to train machine learning models. Visit the service console to train machine learning models yourself on Amazon SageMaker.

Now, you can speed up your training job runs by training machine learning models from data stored in Amazon Elastic File System (EFS) or Amazon FSx for Lustre. Amazon EFS provides a simple, scalable, elastic file system for Linux-based workloads for use with AWS Cloud services and on-premises resources. Amazon FSx for Lustre is a high-performance file system optimized for workloads, such as machine learning, analytics, and high performance computing.

Training machine learning models requires providing the training datasets to the training job. When using Amazon Simple Storage Service (S3) as the training datasource in file input mode, all training data is downloaded from Amazon S3 to the EBS volumes attached to the training instances at the start of the training job. A distributed file system such as Amazon EFS or FSx for Lustre can speed up machine learning training by eliminating the need for this download step.

In this blog post, we go over the benefits of training your models using a file system, provide information to help you choose a file system, and show you how to get started.

Choosing a file system for training models on SageMaker

When considering whether you should train your machine learning models from a file system the first thing to consider is: where does your training data reside now?

If your training data is already in Amazon S3 and your needs do not dictate a faster training time for your training jobs, you can get started with Amazon SageMaker with no need for data movement. However, if you need faster startup and training times we recommend that you take advantage of Amazon SageMaker’s integration with Amazon FSx for Lustre file system, which can speed up your training jobs by serving as a high-speed cache.

The first time you run a training job, if Amazon FSx for Lustre is linked to Amazon S3, it automatically loads data from Amazon S3 and makes it available to Amazon SageMaker at hundreds of gigabytes per second and submillisecond latencies. Additionally, subsequent iterations of your training job will have instant access to the data in Amazon FSx. Because of this, Amazon FSx has the most benefit to training jobs that have several iterations requiring multiple downloads from Amazon S3, or in workflows where training jobs must be run several times using different training algorithms or parameters to see which gives the best result.

If your training data is already in an Amazon EFS file system, we recommend choosing Amazon EFS as the file system data source. This choice has the benefit of directly launching your training jobs from the data in Amazon EFS with no data movement required, resulting in faster training start times. This is often the case in environments where data scientists have home directories in Amazon EFS, and are quickly iterating on their models by bringing in new data, sharing data with colleagues, and experimenting with which fields or labels to include. For example, a data scientist can use a Jupyter notebook to do initial cleansing on a training set, launch a training job from Amazon SageMaker, then use their notebook to drop a column and re-launch the training job, comparing the resulting models to see which works better.

Getting started with Amazon FSx for training on Amazon SageMaker

Note your training data Amazon S3 bucket and path.
Launch an Amazon FSx file system with the desired size and throughput, and reference the training data Amazon S3 bucket and path. Once created, note your file system id.
Now, go to the Amazon SageMaker console and open the Training jobs page to create the training job, associate VPC subnets, security groups, and provide the file system as the data source for training.
Create your training job:
1. Provide the ARN for the IAM role with the required access control and permissions policy. Refer to AmazonSageMakerFullAccess for details.
2. Specify a VPC that your training jobs and file system have access to. Also, verify that your security groups allow Lustre traffic over port 988 to control access to the training dataset stored in the file system. For more details, refer to Getting started with Amazon FSx.
3. Choose file system as the data source and properly reference your file system id, path, and format.
Launch your training job.

Getting started with Amazon EFS for training on Amazon SageMaker

Put your training data in its own directory in Amazon EFS.
Now go to the Amazon SageMaker console and open the Training jobs page to create the training job, associate VPC subnets, security groups, and provide the file system as the data source for training.
Create your training job:
1. Provide the IAM role ARN for the IAM role with the required access control and permissions policy
2. Specify a VPC that your training jobs and file system have access to. Also, verify that your security groups allow NFS traffic over port 2049 to control access to the training dataset stored in the file system.
3. Choose file system as the data source and properly reference your file system id, path, and format.
Launch your training job.

After your training job completes, you can view the status history of the training job to observe the faster download time when using a file system data source.

Summary

With the addition of Amazon EFS and Amazon FSx for Lustre as data sources for training machine learning models in Amazon SageMaker, you now have greater flexibility to choose a data source that is suited to your use case. In this blog post, we used a file system data source to train machine learning models, resulting in faster training start times by eliminating the data download step.

Go here to start training machine learning models yourself on Amazon SageMaker or refer to our sample notebook to train a liner learner model using a file system data source to learn more.

About the Authors

Vidhi Kastuar is a Sr. Product Manager for Amazon SageMaker, focusing on making machine learning and artificial intelligence simple, easy to use and scalable for all users and businesses. Prior to AWS, Vidhi was Director of Product Management at Veritas Technologies. For fun outside work, Vidhi loves to sketch and paint, work as a career coach, and spend time with his family and friends.

Will Ochandarena is a Principal Product Manager on the Amazon Elastic File System team, focusing on helping customers use EFS to modernize their application architectures. Prior to AWS, Will was Senior Director of Product Management at MapR.

Blog

Learn About Our Meetup

5000+ Members

MEETUPS