Building a custom classifier using Amazon Comprehend

Amazon Comprehend is a natural language processing (NLP) service that uses machine learning (ML) to find insights and relationships in texts. Amazon Comprehend identifies the language of the text; extracts key phrases, places, people, brands, or events; and understands how positive or negative the text is. For more information about everything Amazon Comprehend can do, see Amazon Comprehend Features.

You may need out-of-the-box NLP capabilities tied to your needs without having to lead a research phase. This would allow you to recognize entity types and perform document classifications that are unique to your business, such as recognizing industry-specific terms and triaging customer feedback into different categories.

Amazon Comprehend is a perfect match for these use cases. In November 2018, Amazon Comprehend added the ability for you to train it to recognize custom entities and perform custom classification. For more information, see Build Your Own Natural Language Models on AWS (no ML experience required).

This post demonstrates how to build a custom text classifier that can assign a specific label to a given text. No prior ML knowledge is required.

About this blog post
Time to complete	1 hour for the reduced dataset ; 2 hours for the full dataset
Cost to complete	~ $50 for the reduced dataset ; ~ $150 for the full dataset These include training, inference and model management, see Amazon Comprehend pricing for more details.
Learning level	Advanced (300)
AWS services	Amazon Comprehend Amazon S3 AWS Cloud9

Prerequisites

To complete this walkthrough, you need an AWS account and access to create resources in AWS IAM, Amazon S3, Amazon Comprehend, and AWS Cloud9 within that account.

This post uses the Yahoo answers corpus cited in the paper Text Understanding from Scratch by Xiang Zhang and Yann LeCun. This dataset is available on the AWS Open Data Registry.

You can also use your own dataset. It is recommended that you train your model with up to 1,000 training documents for each label, and that when you select your labels, suggest labels that are clear and don’t overlap in meaning. For more information, see Training a Custom Classifier.

Solution overview

The walkthrough includes the following steps:

Preparing your environment
Creating an S3 bucket
Setting up IAM
Preparing data
Training the custom classifier
Gathering results

For more information about how to build a custom entity recognizer to extract information such as people and organization names, locations, time expressions, numerical values from a document, see Build a custom entity recognizer using Amazon Comprehend.

Preparing your environment

In this post, you use the AWS CLI as much as possible to speed up the experiment.

AWS Cloud9 is a cloud-based integrated development environment (IDE) that lets you write, run, and debug your code with a browser. It includes a code editor, debugger, and terminal. AWS Cloud9 comes pre-packaged with essential tools for popular programming languages and the AWS CLI pre-installed, so you don’t need to install files or configure your laptop for this workshop.

Your AWS Cloud9 environment has access to the same AWS resources as the user with which you logged in to the AWS Management Console.

To prepare your environment, complete the following steps:

On the console, under Services, choose AWS Cloud9.
Choose Create environment.
For Name, enter CustomClassifier.
Choose Next step.
Under Environment settings, change the instance type to t2.large.
Leave other settings at their defaults.
Choose Next step.
Review the environment settings and choose Create environment.

It can take up to a few minutes for your environment to be provisioned and prepared. When the environment is ready, your IDE opens to a welcome screen, which contains a terminal prompt.

You can run AWS CLI commands in this prompt the same as you would on your local computer.

To verify that your user is logged in, enter the following command:

Admin:~/environment $ aws sts get-caller-identity

You get the following output which indicates your account and user information:

{
    "Account": "123456789012",
    "UserId": "AKIAI53HQ7LSLEXAMPLE",
    "Arn": "arn:aws:iam::123456789012:user/Colin"
}

Record the account ID to use in the next step.

Keep your AWS Cloud9 IDE opened in a tab throughout this walkthrough.

Creating an S3 bucket

Use the account ID from the previous step to create a globally unique bucket name, such as 123456789012-customclassifier. Enter the following command in your AWS Cloud9 terminal prompt:

Admin:~/environment $ aws s3api create-bucket --acl private --bucket '123456789012-comprehend' --region us-east-1

The output shows the name of the bucket you created:

{
    "Location": "/123456789012-comprehend"
}

Setting up IAM

To authorize Amazon Comprehend to perform bucket reads and writes during the training or during the inference, you must grant Amazon Comprehend access to the S3 bucket that you created. You are creating a data access role in your account to trust the Amazon Comprehend service principal.

To set up IAM, complete the following steps:

On the console, under Services, choose IAM.
Choose Roles.
Choose Create role.
Select AWS service as the type of trusted entity.
Choose Comprehend as the service that uses this role.
Choose Next: Permissions.

The Policy named ComprehendDataAccessRolePolicy is automatically attached.

Choose Next: Review
For Role name, enter ComprehendBucketAccessRole.
Choose Create role.
Record the Role ARN.

You use this ARN when you launch the training of your custom classifier.

Preparing data

In this step, you download the corpus and prepare the data to match Amazon Comprehend’s expected formats for both training and inference. This post provides a script to help you achieve the data preparation for your dataset.

Alternatively, for even more convenience, you can download the prepared data by entering the following two command lines:

Admin:~/environment $ aws s3 cp s3://aws-ml-blog/artifacts/comprehend-custom-classification/comprehend-test.csv . 

Admin:~/environment $ aws s3 cp s3://aws-ml-blog/artifacts/comprehend-custom-classification/comprehend-train.csv .

If you follow the preceding step, skip the next steps and go directly to the upload part at the end of this section.

If you want to go through the dataset preparation for this walkthrough, or if you are using your own data follow the next steps:

Enter the following command in your AWS Cloud9 terminal prompt to download it from the AWS Open Data registry:

Admin:~/environment $ aws s3 cp s3://fast-ai-nlp/yahoo_answers_csv.tgz .

You see a progress bar and then the following output:

download: s3://fast-ai-nlp/yahoo_answers_csv.tgz to ./yahoo_answers_csv.tgz

Uncompress it with the following command:

Admin:~/environment $ tar xvzf yahoo_answers_csv.tgz

You should delete the archive because you are limited in available space in your AWS Cloud9 environment. Use the following command:

Admin:~/environment $ rm -f yahoo_answers_csv.tgz

You get a folder yahoo_answers_csv, which contains the following four files:

classes.txt
readme.txt
test.csv
train.csv

The files train.csv and test.csv contain the training samples as comma-separated values. There are four columns in them, corresponding to class index (1 to 10), question title, question content, and best answer. The text fields are escaped using double quotes (“), and any internal double quote is escaped by two double quotes (“”). New lines are escaped by a backslash followed with an “n” character, that is “n”.

The following code is the overview of file content:

"5","why doesn't an optical mouse work on a glass table?","or even on some surfaces?","Optical mice use an LED
"6","What is the best off-road motorcycle trail ?","long-distance trail throughout CA","i hear that the mojave
"3","What is Trans Fat? How to reduce that?","I heard that tras fat is bad for the body.  Why is that? Where ca
"7","How many planes Fedex has?","I heard that it is the largest airline in the world","according to the www.fe
"7","In the san francisco bay area, does it make sense to rent or buy ?","the prices of rent and the price of b

The file classes.txt contains the available labels.

The train.csv file contains 1,400,000 lines and test.csv contains 60,000 lines. Amazon Comprehend uses between 10–20% of the documents submitted for training to test the custom classifier.

The following command indicates that the data is evenly distributed:

Admin:~/environment $ awk -F '","' '{print $1}'  yahoo_answers_csv/train.csv | sort | uniq -c

You should train your model with up to 1,000 training documents for each label and no more than 1,000,000 documents.

With 20% of 1,000,000 used for testing, that is still plenty of data to train your custom classifier.

Use a shortened version of train.csv to train your custom Amazon Comprehend model, and use test.csv to perform your validation and see how well your custom model performs.

For training, the file format must conform to the following requirements:

File must contain one label and one text per line – 2 columns
No header
Format UTF-8, carriage return “n”.

Labels must be uppercase, can be multi-token, have white space, consist of multiple words connected by underscores or hyphens, or may even contain a comma, as long as it is correctly escaped.

The following table contains the formatted labels proposed for the training.

Index	Original	For training
1	Society & Culture	SOCIETY_AND_CULTURE
2	Science & Mathematics	SCIENCE_AND_MATHEMATICS
3	Health	HEALTH
4	Education & Reference	EDUCATION_AND_REFERENCE
5	Computers & Internet	COMPUTERS_AND_INTERNET
6	Sports	SPORTS
7	Business & Finance	BUSINESS_AND_FINANCE
8	Entertainment & Music	ENTERTAINMENT_AND_MUSIC
9	Family & Relationships	FAMILY_AND_RELATIONSHIPS
10	Politics & Government	POLITICS_AND_GOVERNMENT

When you want your custom Amazon Comprehend model to determine which label corresponds to a given text in an asynchronous way, the file format must conform to the following requirements:

File must contain one text per line
No header
Format UTF-8, carriage return “n”.

This post includes a script to speed up the data preparation. Enter the following command to copy the script to your local AWS Cloud9 environment:

Admin:~/environment $ aws s3 cp s3://aws-ml-blog/artifacts/comprehend-custom-classification/prepare_data.py .

To launch data preparation, enter the following commands:

Admin:~/environment $ sudo pip-3.6 install pandas tqdm
Admin:~/environment $ python3 prepare_data.py

This script is tied to the Yahoo corpus and uses the pandas library to format the training and testing datasets to match your Amazon Comprehend expectations. You may adapt it to your own dataset or change the number of items in the training dataset and validation dataset.

When the script is finished (it should run for approximately 11 minutes on a t2.large instance for the full dataset, and in under 5 minutes for the reduced dataset), you have the following new files in your environment:

comprehend-train.csv
comprehend-test.csv

Upload the prepared data (either the one you downloaded or the one you prepared) to the bucket you created with the following commands:

Admin:~/environment $ aws s3 cp comprehend-train.csv s3://123456789012-comprehend/

Admin:~/environment $ aws s3 cp comprehend-test.csv s3://123456789012-comprehend/

Training the custom classifier

You are ready to launch the custom text classifier training. Enter the following command, and replace the role ARN and bucket name with your own:

Admin:~/environment $ aws comprehend create-document-classifier --document-classifier-name "yahoo-answers" --data-access-role-arn arn:aws:iam:: 123456789012:role/ComprehendBucketAccessRole --input-data-config S3Uri=s3://123456789012-comprehend/comprehend-train.csv --output-data-config S3Uri=s3://123456789012-comprehend/TrainingOutput/ --language-code en

You get the following output that names the custom classifier ARN:

{
    "DocumentClassifierArn": "arn:aws:comprehend:us-east-1:123456789012:document-classifier/yahoo-answers"
}

It is an asynchronous call. You can then track the training progress with the following command:

Admin:~/environment $ aws comprehend describe-document-classifier --document-classifier-arn arn:aws:comprehend:us-east-1:123456789012:document-classifier/yahoo-answers

You get the following output:

{
    "DocumentClassifierProperties": {
        "DocumentClassifierArn": "arn:aws:comprehend:us-east-1: 123456789012:document-classifier/yahoo-answers",
        "Status": "TRAINING", 
        "LanguageCode": "en", 
        "DataAccessRoleArn": "arn:aws:iam:: 123456789012:role/ComprehendBucketAccessRole", 
        "InputDataConfig": {
            "S3Uri": "s3://123456789012-comprehend/comprehend-train.csv"
        },
        "SubmitTime": 1571657958.503, 
        "OutputDataConfig": {
            "S3Uri": "s3://123456789012-comprehend/TrainingOutput/123456789012-CLR-b205910479f3a195124bea9b70c4e2a9/output/output.tar.gz"
        }
    }
}

When the training is finished, you get the following output:

{
    "DocumentClassifierProperties": {
        "DocumentClassifierArn": "arn:aws:comprehend:us-east-1: 123456789012:document-classifier/yahoo-answers",
        "Status": "TRAINED", 
        "LanguageCode": "en", 
        "DataAccessRoleArn": "arn:aws:iam:: 123456789012:role/ComprehendBucketAccessRole", 
        "InputDataConfig": {
            "S3Uri": "s3://123456789012-comprehend/comprehend-train.csv"
        },
        "OutputDataConfig": {
            "S3Uri": "s3://123456789012-comprehend/TrainingOutput/123456789012-CLR-b205910479f3a195124bea9b70c4e2a9/output/output.tar.gz"
        },
        "SubmitTime": 1571657958.503,
        "EndTime": 1571661250.482,
        "TrainingStartTime": 1571658140.277
        "TrainingEndTime": 1571661207.915,
        "ClassifierMetadata": {
            "NumberOfLabels": 10,
            "NumberOfTrainedDocuments": 989873,
            "NumberOfTestDocuments": 10000,
            "EvaluationMetrics": {
                "Accuracy": 0.7235,
                "Precision": 0.722,
                "Recall": 0.7235,
                "F1Score": 0.7219
            }
        },
    }
}

The training duration may vary; in this case, the training took approximately one hour for the full dataset (20 minutes for the reduced dataset).

The output for the training on the full dataset shows that your model has a recall of 0.72—in other words, it correctly identifies 72% of given labels.

The following screenshot shows the view from the console (Comprehend > Custom Classification > yahoo-answers).

Gathering results

You can now launch an inference job to test how the classifier performs. Enter the following commands:

Admin:~/environment $ aws comprehend start-document-classification-job --document-classifier-arn arn:aws:comprehend:us-east-1:123456789012:document-classifier/yahoo-answers --input-data-config S3Uri=s3://123456789012-comprehend/comprehend-test.csv,InputFormat=ONE_DOC_PER_LINE --output-data-config S3Uri=s3://123456789012-comprehend/InferenceOutput/ --data-access-role-arn arn:aws:iam::123456789012:role/ComprehendBucketAccessRole

You get the following output:

{
    "JobStatus": "SUBMITTED", 
    "JobId": "cd5a6ee7f490a353e33f50d866d0317a"
}

Just as you did for the training progress tracking, because the inference is asynchronous, you can check the progress of the newly launched job with the following command:

Admin:~/environment $ aws comprehend describe-document-classification-job --job-id cd5a6ee7f490a353e33f50d866d0317a

You get the following output:

{
    "DocumentClassificationJobProperties": {
        "InputDataConfig": {
            "S3Uri": "s3://123456789012-comprehend/comprehend-test.csv", 
            "InputFormat": "ONE_DOC_PER_LINE"
        }, 
        "DataAccessRoleArn": "arn:aws:iam:: 123456789012:role/ComprehendBucketAccessRole", 
        "DocumentClassifierArn": "arn:aws:comprehend:us-east-1: 123456789012:document-classifier/yahoo-answers", 
        "JobStatus": "IN_PROGRESS", 
        "JobId": "cd5a6ee7f490a353e33f50d866d0317a", 
        "SubmitTime": 1571663022.503, 
        "OutputDataConfig": {
            "S3Uri": "s3://123456789012-comprehend/InferenceOutput/123456789012-CLN-cd5a6ee7f490a353e33f50d866d0317a/output/output.tar.gz"
        }
    }
}

When it is completed, JobStatus changes to COMPLETED. This takes approximately a few minutes to complete.

Download the results using OutputDataConfig.S3Uri path with the following command:

Admin:~/environment $ aws s3 cp s3://123456789012-comprehend/InferenceOutput/123456789012-CLN-cd5a6ee7f490a353e33f50d866d0317a/output/output.tar.gz .

When you uncompress the output (tar xvzf output.tar.gz), you get a .jsonl file. Each line represents the result of the requested classification for the corresponding line of the document you submitted.

For example, the following code is one line from the predictions:

{"File": "comprehend-test.csv", "Line": "9", "Classes": [{"Name": "ENTERTAINMENT_AND_MUSIC", "Score": 0.9685}, {"Name": "EDUCATION_AND_REFERENCE", "Score": 0.0159}, {"Name": "BUSINESS_AND_FINANCE", "Score": 0.0102}]}

This means that your custom model predicted with a 96.8% confidence score that the following text was related to the “Entertainment and music” label.

"What was the first Disney animated character to appear in color? n Donald Duck was the first major Disney character to appear in color, in his debut cartoon, "The Wise Little Hen" in 1934.nnFYI: Mickey Mouse made his color debut in the 1935 'toon, "The Band Concert," and the first color 'toon from Disney was "Flowers and Trees," in 1932."

Each line of results also provides the second and third possible labels. You might use these different scores to build your application upon applying each label with a score superior to 40% or changing the model if no single score is above 70%.

Summary

With the full dataset for training and validation, in less than two hours, you used Amazon Comprehend to learn 10 custom categories—and achieved a 72% recall on the test—and to apply that custom model to 60,000 documents.

Try custom categories now from the Amazon Comprehend console. For more information, see Custom Classification. You can discover other Amazon Comprehend features and get inspiration from other AWS blog posts about how to use Amazon Comprehend beyond classification.

Amazon Comprehend can help you power your application with NLP capabilities in almost no time. Happy experimentation!

About the Author

Hervé Nivon is a Solutions Architect who helps startup customers grow their business on AWS. Before joining AWS, Hervé was the CTO of a company generating business insights for enterprises from commercial unmanned aerial vehicle imagery. Hervé has also served as a consultant at Accenture.

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

JOB POSTINGS

CONTACT

Building a custom classifier using Amazon Comprehend

Prerequisites

Solution overview

Preparing your environment

Creating an S3 bucket

Setting up IAM

Preparing data

Training the custom classifier

Gathering results

Summary

About the Author