Creating custom labeling jobs with AWS Lambda and Amazon SageMaker Ground Truth
Amazon SageMaker Ground Truth helps you build highly accurate training datasets for machine learning. It offers easy access to public and private human labelers, and provides them with built-in workflows and interfaces for common labeling tasks. Ground Truth can lower your labeling costs by up to 70% using automatic labeling. It works by training Ground Truth from human-labeled data, so that the service learns to label data independently.
In addition to built-in workflows, Ground Truth gives you the option to upload custom workflows. A custom workflow consists of an HTML interface that provides the human labelers with all of the instructions and required tools for completing the labeling tasks. You also create pre– and post-processing AWS Lambda functions:
- The pre-processing Lambda function helps customize input to the HTML interface.
- The post-processing Lambda function helps to process the data. For example, one of its primary uses is to host an accuracy improvement algorithm to tell Ground Truth how it should assess the quality of human-provided labels.
An algorithm is used to find consensus on what is “right” when the same data is provided to multiple human labelers. It also identifies and de-emphasizes those labelers who tend to provide poor quality data. You can upload the HTML interface and the pre- and post-processing Lambda functions using the Amazon SageMaker console.
To integrate successfully with the HTML interface, the pre– and post-processing Lambda functions should adhere to the input/output specifications laid out in the Creating Custom Labeling Workflows. Setting up all the moving pieces and getting them to talk to each other successfully may take a few iterations.
In this post, I walk you through the process of setting up a custom workflow with a custom HTML template and sample Lambda functions. The sample Lambda functions can be found in the AWS Serverless Application Repository. These Lambda functions can be easily deployed to your AWS account and directly modified in the AWS Lambda Console. The source code is available in the aws-sagemaker-ground-truth-recipe GitHub repo.
For this post, you create a custom labeling job for instance segmentation. But first, deploy Lambda functions from the AWS Serverless Application Repository to your AWS account.
Import Lambda functions
On the Serverless Application Repository home page, select “Available applications” on the left-hand menu and search for Ground Truth. Choose aws-sagemaker-ground-truth-recipe.
On the application’s details page, choose Deploy. Make sure that the user has permissions to create IAM roles. If the user does not have permissions, this deployment fails.
It may take a few minutes to deploy this application. Wait until you see the status screen, which shows that four AWS resources (two Lambda functions and two IAM roles) have been created.
Now, you have successfully imported the Lambda functions used in the labeling job into your account. To modify these Lambda functions, select them and tweak the Python code.
Create a custom labeling job
Assume that there are millions of images taken from cameras mounted in cars driving the public roadways. These images are stored in an Amazon S3 bucket location called
s3://mybucket/datasets/streetscenes/. To start a labeling job for instance segmentation, you first create a manifest to be fed to Ground Truth.
The following code example shows the sample contents of a manifest file with a set of images. For more information, see Input Data.
Step 1: Download the example dataset
If you already have a manifest file for instance segmentation, skip this section.
For this example, I use the CBCL StreetScenes dataset. This dataset has over 3000 images, but I use a selection of just 10 images. The full dataset is approximately 2 GB. You can choose to upload all of the images to S3 for labeling or just a selection of them.
- Download the zip file and extract to a folder. By default, the folder is Output.
- Create a small sample dataset with which to work:
In the S3 console, create the /streetscenes folder in your bucket. S3 is a key-value store, so there is no concept of folders. However, the S3 console gives you a sense of folder structure by using forward slashes in the key. You use the console to create the key.
Upload the following files to your S3 bucket,
s3://mybucket/datasets/streetscenes/. You can use the S3 console or the following AWS CLI command:
Step 2: Create an input manifest
If you already have a manifest file for instance segmentation, skip this section.
In the Amazon SageMaker console, start the process by creating a labeling job.
Under input dataset location, choose Create manifest file. This tool helps you create the manifest by crawling an S3 location containing raw data (images or text).
For images, the crawler takes an input s3Prefix and crawls all of the image files with extensions .jpg, .jpeg, and .png in that prefix. It then creates a manifest with each line as follows:
The Create manifest file link opens a modal window. Enter the S3 path to which you uploaded the images files, and make sure to include the trailing slash. Next, choose Create. When the creation process is completed, choose Use this manifest. It takes a few seconds to create the manifest.
In this example, the objects are images in S3, so you can use the crawling tool to create the initial manifest. Each line of JSON contains a field called source-ref pointing to the s3Uri value of an image. The contents of the created manifest file should look as follows:
Step 3: Create a custom labeling job
Configure the following job settings:
- Labeling job name—This name must be unique in your account within an AWS Region.
- Input dataset S3 location—The location of the input manifest file that you created in Step 2.
- Output dataset S3 location—The location to which output data is written.
- IAM role —Sets permissions using an IAM policy. Make a note of this role, as you need it in Step 4.
Select the custom task type and choose Next.
For Workers, choose Private. For more information about the different workforce options, see Managing Your Workforce.
There are a number of labeling UI templates that you can use for setting up your own custom workflows. In this case, use the instance segmentation UI. For Templates, choose Instance Segmentation.
Modify the HTML code to look like the following. In the original template, you had three placeholders: src, header, and labels. I changed the header and labels fields. When tasks are created for workers using this template, Ground Truth provides the data to fill in the src placeholder field.
Next, for pre– and post-processing task Lambda function fields, select the Lambda functions that you imported earlier.
Under Custom labeling task setup, choose Preview. Remember to allow pop-ups before attempting to preview the UI. If the page loads successfully without errors, you know that the pre-processing task Lambda function and the custom HTML template are working well together.
Step 4: Give execute permissions to the Amazon SageMaker role
In the previous step, while creating a Ground Truth labeling job, you created an IAM role. Ground Truth uses this IAM role to execute your labeling job. This role should trust the execution role of the post-processing Lambda function.
In the Lambda console, select the Lambda function that you previously imported. At the top of the page or under the Tags section, note the Amazon Resource Name (ARN). It should look like the following:
Choose Execution role, Use an existing role, and view the role.
Copy the IAM role ARN.
In the IAM console, find the Amazon SageMaker execution role that you created. Choose Trust relationships, Edit trust relationship. Add the copied Lambda execution role to the trust relationship. The following code example shows the contents of the trust relationship.
Choose Permissions, Attach policies. Select AWSLambdaFullAccess, and choose Attach Policy. After attaching the policy, your Permissions tab should look like the following screenshot:
Close the current tab.
Step 5: Submit the labeling job
In the main browser tab, which has your labeling job open, choose Submit. The labeling job is in progress. Wait for this job to complete.
If you have workers assigned to your private work team, instruct them to work on your tasks. If you have added yourself as a worker, complete the tasks in the private work team portal. For more information, see Managing a Private Workforce.
After the workers perform the labeling work, your output manifest looks like the following:
Expand one JSON line to view the annotations. You can see that three workers worked on the image and produced annotations:
- source-ref: The location of the image.
- workerId: The ID of the worker to whom the subsequent annotationData In this case, you can see three workerIds, which means three workers annotated this image.
- annotationData: The annotation result.
- gt-label-17-metadata: The metadata associated with the labeling job of which this image was a part.
In order to avoid incurring future charges:
- Make sure that your labeling job is marked as “Complete,” “Stopped,” or “Failed” in the Amazon SageMaker console.
- Delete the corresponding S3 bucket “mybucket” in Amazon S3.
- Delete the “serverlessrepo-aws-sagemaker-ground-truth-recipe” stack from Amazon CloudFormation console.
In this post, I started by deploying the pre– and post-processing Lambda functions from a Ground Truth app, using the AWS Serverless Application Repository. I then created a custom labeling job and configured it to use the imported Lambda functions.
These sample Lambda functions help you get a custom labeling job running quickly. You can add or modify them with your own logic, using the AWS Lambda Console.
About the Authors
Anjan Dash is a Software Development Engineer in AWS AI where he builds large scale distributed systems to solve complex machine learning problems. He is primarily focused on innovating technologies that can ‘Divide and Conquer’ Big Data problem. In his spare time, he loves spending time with family in outdoors activities.
Revekka Kostoeva is a Software Developer Engineer intern at Amazon AI where she works on customer facing and internal solutions to expand the breadth of Sagemaker Ground Truth services. As a researcher, she is driven to improve the tools of the trade to drive innovation forward.