Associating prediction results with input data using Amazon SageMaker Batch Transform
When you run predictions on large datasets, you may want to drop some input attributes before running the predictions. This is because those attributes don’t carry any signal or were not part of the dataset used to train your machine learning (ML) model. Similarly, it can be helpful to map the prediction results to all or part of the input data for analysis after the job is complete.
For example, consider a dataset that comes with an ID attribute. Commonly, an observation ID is a randomly generated or sequential number that carries no signal for a given ML problem. For this reason, it is usually not part of the training data attributes. However, when you make batch predictions, you may want your output to contain both the observation ID and the prediction result as a single record.
The Batch Transform feature in Amazon SageMaker enables you to run predictions on datasets stored in Amazon S3. Previously, you had to filter your input data before creating your batch transform job and join prediction results with desired input fields after the job was complete. Now, you can use Amazon SageMaker Batch Transform to exclude attributes before running predictions. You can also join the prediction results with partial or entire input data attributes when using data that is in CSV, text, or JSON format. This eliminates the need for any additional pre-processing or post-processing and accelerates the overall ML process.
This post demonstrates how you can use this new capability to filter input data for a batch transform job in Amazon SageMaker and join the prediction results with attributes from the input dataset.
Background
Amazon SageMaker is a fully managed service that covers the entire ML workflow. The service labels and prepares your data, chooses an algorithm, trains the model, tunes and optimizes it for deployment, makes predictions, and takes action.
Amazon SageMaker manages the provisioning of resources at the start of batch transform jobs. It releases the resources when the jobs are complete, so you pay only for what was used during the execution of your job. When the job is complete, Amazon SageMaker saves the prediction results in an S3 bucket that you specify.
Batch transform example
Use the public data set for breast cancer detection from UCI and train a binary classification model to detect whether a given tumor is likely to be malignant (1) or benign (0). This dataset comes with an ID attribute for each tumor, which you exclude during training and prediction. However, you bring it back in your final output and record it with the predicted probability of malignancy for each tumor from the batch transform job.
You can also download the companion Jupyter notebook. Each of the following sections in the post corresponds to a notebook section so that you can run the code for each step as you read along.
Setup
First, import common Python libraries for ML such as pandas and NumPy, along with the Amazon SageMaker and Boto3 libraries that you later use to run the training and batch transform jobs.
Also, set up your S3 bucket for uploading your training data, validation data, and the dataset against which you run the batch transform job. Amazon SageMaker stores the model artifact in this bucket, as well as the output of the batch transform job. Use a folder structure to keep the input datasets separate from the model artifacts and job outputs.
Data preparation
Download the public data set onto the notebook instance and look at a sample for preliminary analysis. Although the dataset for this example is small (with 569 observations and 32 columns), you can use the Amazon SageMaker Batch Transform feature on large datasets with petabytes of data.
In the following table, the first column in your dataset is the ID of the tumors and the second is the diagnosis (M for malignant or B for benign). In the context of supervised learning, this is your target, or what you want to be able to predict. The following attributes are the features also known as predictors.
id | diagnosis | radius_mean | texture_mean | perimeter_mean | … | concave points_worst | symmetry_worst | fractal_dimension_worst | |
288 | 8913049 | B | 11.26 | 19.96 | 73.72 | … | 0.09314 | 0.2955 | 0.07009 |
375 | 901303 | B | 16.17 | 16.07 | 106.3 | … | 0.1251 | 0.3153 | 0.0896 |
467 | 9113514 | B | 9.668 | 18.1 | 61.06 | … | 0.025 | 0.3057 | 0.07875 |
203 | 87880 | M | 13.81 | 23.75 | 91.56 | … | 0.2013 | 0.4432 | 0.1086 |
148 | 86973702 | B | 14.44 | 15.18 | 93.97 | … | 0.1599 | 0.2691 | 0.07683 |
118 | 864877 | M | 15.78 | 22.91 | 105.7 | … | 0.2034 | 0.3274 | 0.1252 |
224 | 8813129 | B | 13.27 | 17.02 | 84.55 | … | 0.09678 | 0.2506 | 0.07623 |
364 | 9010877 | B | 13.4 | 16.95 | 85.48 | … | 0.06987 | 0.2741 | 0.07582 |
After doing some minimal data preparation, split the data into three sets:
- A training set consisting of 80% of your original data.
- A validation set for your algorithm to perform the proper evaluation of the model.
- A batch set that you set aside for now and use later to run a batch transform job using the new I/O join feature.
To train and validate the model, keep all the features, such as radius_mean
, texture_mean
, perimeter_mean
, and so on. Drop the id
attribute, because it has no relevance in determining whether a tumor is malignant.
When you have a trained model in production, you typically want to run predictions against it. One way to do that is to deploy the model for real-time predictions using the Amazon SageMaker hosting services.
However, in your case, you do not need real-time predictions. Instead, you have a backlog of tumors as a .csv file in Amazon S3, which consists of a list of tumors identified by their ID. Use a batch transform job to predict, for each tumor, the probability of being malignant. To create this backlog of tumors here, make a batch set with the id
attribute but without the diagnosis
attribute. That’s what you’re trying to predict with your batch transform job.
The following code example shows the configuration of data between the three datasets:
Finally, upload these three datasets to S3.
Training job
Use the Amazon SageMaker XGBoost built-in algorithm to quickly train a model for binary classification based on your training and validation datasets. Set the training objective to binary:logistic, which trains XGBoost to output the probability that an observation belongs to the positive class (malignant in this example), as shown in the following code example:
Batch transform
Use the Python SDK to kick off the batch transform job to run inferences on your batch dataset and store your inference results in S3.
Your batch dataset contains the id
attribute in the first column. As I explained earlier, because this attribute was not used for training, you must use the new input filter capability. Also, before the I/O join feature, the output of the batch transform job would have been a list of probabilities, such as the following:
It would have required some post-inference logic to map these probabilities to their corresponding input tumor. The Batch Transform filters make this easy.
Specify the input as the join source. Then, specify an output filter to indicate that you do not require the entire input (which is the ID followed by the 30 features). Instead, you want to present the tumor ID and its probability of being malignant. And you only want to show the first (id) and the last (inference result
) columns.
The Batch Transform filters use JSONPath expressions to selectively extract the input or output data, based on your needs. For the supported JSONPath operators in Batch Transform, see this link.
JSONPath is developed to work with JSON data. To use it on CSV data, consider a CSV row as a JSON array with a zero-based index. For example, applying ‘$[0,1]’ on a row of ‘8810158, B, 13.110, 22.54, 87.02’ returns the first and second column of the row, which is ‘8810158, B’.
In this example, the input filter is “$[1:]”. You are excluding column 0 (id) before processing the inferences, and keeping everything from column 1 to the last column
(all the features or predictors). The output filter is “$[0,-1].” When presenting the output, you only want to keep column 0 (id
) and the last (-1) column (inference_result
), which is the probability of a given tumor to be malignant.
You could also consider bringing back input columns other than id. In your current example, where you happen to have the ground truth diagnosis, you could consider bringing it back in your output file next to the prediction. That way, you could do side-by-side comparisons and evaluate your predictions.
Result
Read the CSV output in S3 at your output location:
It should show the list of tumors identified by their ID and their corresponding probabilities of being malignant. Your file should look like the following:
Conclusion
This post demonstrated how you can provide input and output filters for your batch transform jobs using the Amazon SageMaker Batch Transform feature. This eliminates the need to pre-process or post-process input and output data respectively. In addition, you can associate prediction results with their corresponding input data with the flexibility of keep all or part of the input data attributes. To learn more about this feature, see the Amazon SageMaker Developer Guide.
About the Authors
Ro Mullier is a Sr. Solutions Architect at AWS helping customers run a variety of applications on AWS and machine learning workloads in particular. In his spare time, he enjoy spending time with family and friends, playing soccer and competing in machine learning competitions.
Han Wang is a Software Development Engineer at AWS AI. She focuses on developing highly scalable and distributed machine learning platforms. In her spare time, she enjoys watching movies, hiking and playing “Zelda: Breath of the wild”.