Chaining Amazon SageMaker Ground Truth jobs to label progressively
Amazon SageMaker Ground Truth helps you build highly accurate training datasets for machine learning. It can reduce your labeling costs by up to 70% using automatic labeling.
This blog post explains the Amazon SageMaker Ground Truth chaining feature with a few examples and its potential in labeling your datasets. Chaining reduces time and cost significantly as Amazon SageMaker Ground Truth determines the objects that are already labeled and optimizes the data for automated data labeling mode. As a prerequisite, you might want to check the post “Creating hierarchical label taxonomies using Amazon SageMaker Ground Truth” that shows how to achieve multi-step hierarchical labeling and the documentation on how to use the augmented manifest functionality.
Chaining a labeling job
Chaining can help in the following scenarios:
- Partially completed labeling job – A labeling job in which you have an input manifest that already contains few labels and the rest are to be labeled.
- Failed labeling job – A labeling job in which you generated a few labels successfully and the rest of the labels either failed or expired.
- Stopped labeling job – A labeling job that a user stopped, which may have generated a few labels before stopping.
The chaining feature allows you to reuse these previous labels and get the remaining labels coherently. For more information, see Chaining labeling jobs.
Chaining uses the output from a previous job as the input for a subsequent job.
The following are the artifacts used to bootstrap the new chained labeling job:
- Output manifest file contents from the previous labeling job
- The model, if available
If you are starting a job from the Amazon Sagemaker Ground Truth console, by default, the
LabelingJob name is used as the
LabelAttributeName. For more information, see LabelAttributeName.
If you are chaining a partially completed job, the console uses the
LabelAttributeName of the parent job to decide which object is already labeled and which is not, so that only unlabeled or previously failed objects are sent for labeling. You can override this behavior by providing a different
LabelAttributeName, in which case the previous labels aren’t counted and a new labeling job sends all the data for labeling. This post describes this process in more detail later.
If you are using the API or SDK, you need to properly configure these fields, which this post describes later.
When you enable automated data labeling, Amazon Sagemaker Ground Truth uses
LabelAttributeName to decide which existing labels to use to start automated data labeling mode and see if you are eligible to train early. You can reap the maximum benefit of machine learning with existing labels; it reduces the cost of labeling tasks because you use existing labels instead of sending them to human labelers again.
The following diagram shows the workflow of this solution.
Step 1: Building the initial unlabeled dataset
Step 2: Launching a labeling job and stopping it (To simulate stopping/Failed status)
Step 3: Chaining your first job
Step 1: Building the initial unlabeled dataset
The first step is to build the initial unlabeled dataset. For more information about this process, see Step 1 in Creating hierarchical label taxonomies using Amazon SageMaker Ground Truth.
This post uses the CBCL StreetScenes dataset, which contains approximately 3547 images. The full dataset is approximately 2 GB; you may choose to upload some or all of the dataset to S3 for labeling. Complete the following steps:
- Download the zip file.
- Extract the .zip archive to a folder. By default, the folder is Output.
- Create a small sample dataset to work with, or use the entire dataset.
For more information about creating an input manifest, see Step 2 in Creating hierarchical label taxonomies using Amazon SageMaker Ground Truth.
The lines in the manifest appear as the following code:
Step 2: Launching a labeling job and stopping it
From the console, start a labeling job using the Image classification task type to classify pictures as a vehicle, traffic signal, or pedestrian. Use the previously created manifest file as the input and
Streetscenes-Job1 as the job name. For more information about starting a labeling job, see Amazon SageMaker Ground Truth – Build Highly Accurate Datasets and Reduce Labeling Costs by up to 70%.
To simulate the stopped or failed state, this post manually stopped the job after 1000 labels.
The output of the labeling job is written to an augmented manifest with the corresponding label augmented in each of the JSON lines in the manifest. Some of these have labels and some do not. See the following code:
For more information about the format for different modalities, see Output Data.
Step 3: Chaining your first job
You can now chain
Streetscenes-Job1. In Labeling jobs, from the Actions dropdown, choose Chain.
The console pre-populates the input dataset location as it fetches the output manifest from the previous stopped job. The label attribute name remains the same as the previous job.
After the job starts, the console shows the counter as 1000, which reflects the data already labeled.
After the job is complete, all labels are generated.
The following code is from the output manifest. All the lines in the output manifest have labels
Chaining in a series
The previous scenarios only showed one level of chaining. Chaining is a powerful feature in which you can feed the output of one job as input to another.
Scenarios for chaining
The following table shows some of the scenarios with which you can experiment with chaining. AL indicates that automated data labeling mode is enabled. Non-AL indicates that automated data labeling mode is not enabled. For more information, see Annotate data for less with Amazon SageMaker Ground Truth and automated data labeling.
|Parent labeling job
|Chained labeling lob
|You started a labeling job in Non-AL mode and it failed or stopped before labeling all the objects. You want to resume the job in Non-AL mode to label the remaining unlabeled objects by a human.
|You started a labeling job in Non-AL mode and it failed or stopped before labeling all the objects. You want to resume the job in AL mode to label the remaining unlabeled objects automatically based on the existing labels.
|You started a labeling job in AL mode and it failed or stopped before labeling all the objects. You want to resume the job in Non-AL mode to label the remaining unlabeled objects by a human.
|You started a labeling job in AL mode and it failed or stopped before labeling all the objects. You want to resume the job in AL mode to label the remaining unlabeled objects automatically based on the existing labels or pre-trained models.
|You acquired some labels through other sources (Amazon SageMaker Ground Truth or a third party) and have a manifest with labeled objects and unlabeled data. You want to start a new job in Non-AL mode to label the remaining unlabeled objects automatically based on the existing labels.
|You acquired some labels through other sources (Amazon SageMaker Ground Truth or a third party) and have a manifest with labeled objects and unlabeled data. You want to start a new job in AL mode to label the remaining unlabeled objects automatically based on the existing labels.
In some of these scenarios, if you are in AL mode and the job stops after a model is generated, the subsequent AL job uses the model from the first step, which reduces training time. For more information, see Amazon SageMaker Ground Truth: Using a Pre-Trained Model for Faster Data Labeling.
Additionally, if enough pre-labeled objects are available, you can bootstrap these labels to be the training set for your automated labeling loop. This method saves on time and cost by not fetching labels from human annotators.
Using third-party labels
This section elaborates on the final two scenarios in the previous table. You can bring in third-party labels as long as it adheres to the Amazon Sagemaker Ground Truth label format. For more information, see Output Data.
For example, assume you have a job in which the manifest has 989/3450 third-party labels. You can start the labeling job with the following code, which contains third-party labels:
After the job starts, it automatically updates the counter.
Time and cost savings
Chaining offers many time- and cost-saving benefits.
Firstly, objects that are already labeled aren’t processed again. Additionally, if automated data labeling is enabled, auto labeling is attempted as soon as possible. If your data is already partially labeled, a validation set is collected by sending work to a human workforce, after which you can bootstrap the partially labeled input data to be the training set, and Amazon Sagemaker Ground Truth performs automated labeling depending on the number of existing labels. This expedites the automated data labeling process; training starts sooner and reduces the training job’s overall time.
Furthermore, skipping labeled objects reduces costs. Training costs are also reduced by using the ML model generated from your existing data.
Chaining using the API
You can also use the API or AWS CLI to do chaining. For more information, see create-labeling-job.
If you have a failed job and want to resume it, you need to enter the same
create-labeling-job information as the failed job, with the same
LabelAttributeName as the previous job, and use the output manifest file as the input in your chained job.
Similarly, if you want to chain the job for labeling all the objects with a different kind of label, you need to use a different
LabelAttributeName than the one in the previous labeling job.
The following code is an example CLI for chaining:
This code uses the same label attribute name (
label-attribute-name) as the first job,
This post demonstrated how the Amazon SageMaker Ground Truth chaining feature offers time-saving and cost-reduction benefits. This is a very powerful feature and this post merely scratches the surface of what Amazon SageMaker Ground Truth chaining can do. Let us know what you think in the comments. You can get started with Amazon Sagemaker Ground Truth by visiting Getting Started page in the documentation.
About the authors
Priyanka Gopalakrishna is a software engineer at Amazon AI. She works on building scalable solutions using distributed systems for machine learning. In her spare time, she loves to hike, catch up on things related to space sciences or read good old strips of Calvin and Hobbes.
Zahid Rahman is a SDE in AWS AI where he builds large scale distributed systems to solve complex machine learning problems . He is primarily focused on innovating technologies that can ‘Divide and Conquer’ Big Data problem.