Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

Category: Global

Shopper Sentiment: Analyzing in-store customer experience

Retailers have been using in-store video to analyze customer behaviors and demographics for many years.  Separate systems are commonly used for different tasks.  For example, one system would count the number of customers moving through a store, in which part of the store those customers linger and near which products.  Another system will hold the store layout, whilst yet another may record transactions.  Historically, for a retailer to join these data sources to gain insights which could drive more sales by following a strategy has required complex algorithms and data structures that also require significant investment to deliver and incur ongoing maintenance costs.

In this blog post, we will demonstrate how to simplify this process using AWS services (Amazon Rekognition, AWS Lambda, Amazon S3, AWS Glue, AWS Athena and AWS QuickSight) to build an end-to-end solution for in-store video analytics. We will focus on the analysis of still images leveraging an existing loss prevention store camera to produce data for the retail in-store experience.

The following diagram shows the overall architecture and the AWS services involved.

Using the Machine Learning services on AWS like Amazon Rekognition and applying them to motion video or still images from your store, it is possible to derive insights from customer behavior (i.e. which area of the store is frequently visited), demographic segmentation of store traffic (i.e. such as gender or approximate age) while also analyzing patterns of customer sentiment. This practice is already common in the industry, and our proposed solution makes it easier, faster, and more accurate. Sentiment analysis can be used, for example, to get insights into how customers respond to brand content and signage, end cap displays or advertising campaigns while presenting these insights using dashboards similar to the examples shown below.

The overall solution can be decomposed into four main steps, collect, store, process and analyze.  Let’s describe each of these components:

Collect

The purpose of this stage is to collect images or motion video of your customers in-store experience from the camera.   This is possible by making use of various cameras such as an existing CCTV or IP Camera system, a (configured) Raspberry Pi with an attached camera module, an AWS DeepLens,  or any other similar camera.   These still images or motion video files are stored in an Amazon S3 bucket for further processing.

For this example, we used a Raspberry Pi with the motion package installed. This package helps to collect images when there is an interesting event that limits the amount of data needed to be processed. This package also detects motion, creates still images in a local folder, and this folder can be easily synced (in a realtime or batch manner) to the input S3 bucket. After installing the AWS Command Line (instructions here), here is one example of syncing the motion folder to an S3 bucket and deleting locally the file after successful synchronisation (need to adapt the destination bucket to your specific bucket).

aws s3 sync /var/lib/motion/ s3://retail_instore_demo_source/`hostname` && sudo find /var/lib/motion/ -type f -mmin +1 -delete

Store

We propose using Amazon S3 object store so we can benefit from its virtually unlimited storage, high availability and event triggering capabilities.  After creating this bucket, we enable the Amazon S3 event notification capability to publish events to AWS Lambda for every new file in the input folder, then an invoked Lambda function will pass the event data (i.e., incoming data as a parameter to be processed).

Process

To process the incoming images, we use AWS Lambda to read the image and use the Amazon Rekognition APIs to gather all of the relevant information provided by Rekognition for each image (such as facial landmarks that include the coordinates for eye and mouth), gender, age, presence of beard, sunglasses etc) and put the resulting information to a Amazon Kinesis Data Firehose which will publish the data to an Amazon S3 bucket. Amazon Kinesis Data Firehose simplifies the data management because it automatically handles the encryption, the folder structure (year/month/day/hour), optional data transformation, and compression.

The resulting dataset is a set of JSON files that contain the output from Rekognition, representing customers captured on these images. For effective querying from S3 we recommend the files to be in columnar format. One option is to use Amazon Firehose data Data Transformation feature, another is to convert the JSON using AWS Lambda or AWS Glue. Querying small datasets of JSON files is fine but as the table grows with thousands of files it will become less optimal. In this demo, we will be using JSON format to keep it simple.

Analyze

All the resulting information is then stored in a new Amazon S3 bucket.   Per the process step, the information is stored in a JSON format and therefore allows to be queried with Amazon Athena.  Therefore we can use AWS Glue Crawlers to automatically infer the data schema based on the data sitting in S3 and use the shared AWS Glue Catalog for Amazon Athena to query the data.  Amazon Athena is a service that allows you to query data directly in S3 using standard ANSI SQL commands and without needing to spin up any infrastructure.  This allows any data visualization / dashboard tool (i.e. Tableu, Superset, or Amazon QuickSight) to connect to Athena and visualize the data.  For our example, we will show how we can use Amazon QuickSight to create a dashboard for this data.

Build this solution yourself

Now that we have described the components of this solution, let’s bring it all together.

We have provided a CloudFormation template which will deploy all the necessary components shown in the architecture diagram except Amazon QuickSight and also the devices IP Camera / Raspberry Pi.  In the following section, we will explain key parts of the solution and show how we can make sense of all the analyzed data using Amazon QuickSight while building the dashboard manually.

Note: Cloud Formation template is tested only on eu-west-1 (Ireland) region and it may not work in other regions. Some of the resources deployed by the stack incur costs as long as they’re in use.

To get started deploying the CloudFormation template, take these steps:

  1. Choose the Launch Stack button. It automatically launches the AWS CloudFormation service in eu-west-1 region on your AWS account with a template. You’re prompted to sign in if needed.

    Input Value
    Region eu-west-1
    CFN Template https://s3-eu-west-1.amazonaws.com/retail-instore-analysis/retail-instore-demo-cloudformation.json
  2. Choose Next to proceed with the following settings.
  3. Specify the name of the CloudFormation stack and the required parameters. The default bucket name on the template might have been used, please change the bucket names to a unique name and click Next.
    A B
    1 Parameter Description
    2 SourceBucketName Unique bucket name where the images files will be uploaded for processing
    3 ProcessedBucketName Unique bucket name where the processed files will be stored
    4 ArchivedBucketName Unique bucket name where the images files will be archived after processing

  4. On the Review page, acknowledge that CloudFormation creates AWS Identity and Access Management (IAM)resources and with custom names as a result of launching the stack.
  5. Custom names are required because the template uses serverless transforms. Choose Create Change Set to check the resources that the transforms add, then choose Execute.
  6. After the CloudFormation stack is launched, wait until the status changes from CREATE_IN_PROGRESS to CREATE_COMPLETE. Usually it takes around 7 to 9 minutes to provision all the required resources.
  7. When the launch is finished, you’ll see a set of resources that we’ll use throughout this blog post:

Test the functionality

Once AWS CloudFormation template has created the required resources, lets use some pre-captured sample images to process, and AWS Glue crawlers to automatically discover the schema of our processed image data.

  1. Download the pre-captured images from the S3 bucket https://s3-eu-west-1.amazonaws.com/retail-instore-analysis/Amazon_Fresh_Pickup_Images.zip
  2. Open the source bucket (retail-instore-demo-source), and upload an image or multiple images with multiple people in it. For this example, use few of the downloaded images from the step 1 as you could upload other images later each hour to get different time interval graphs.
  3. Lambda will be triggered, image analyzed using Amazon Rekognition and the results will be put in the S3 processed bucket (retail-instore-demo-processed) for further processing by Athena and QuickSight. You can monitor the files being processed either by watching the processed images being dropped in the processed bucket or by monitoring the AWS Lambda executions in the Lambda Monitoring Console.
  4. In order to query using Athena, first we need to create the tables. We will leverage the AWS Glue crawlers which will automatically discover the schema of our data and create the appropriate table definition in AWS Glue Data Catalog. More details can be found in the documentation for Crawlers with AWS Glue.

By launching a stack from the CloudFormation template, we have only created a Crawler and configured it to “run on demand” as it’s charged hourly. Therefore, we need to run the Crawler manually when we have new data sources to construct the data catalog using pre-built classifiers. To do this, Go to AWS Glue, under Crawlers, run the crawler (retail-instore-demo-glue-s3-crawler).

The Crawler connects to the S3 bucket, progresses through a prioritized list of classifiers to determine the schema for your data, and then creates metadata tables in AWS Glue Data Catalog which will be used to query the processed content in S3 using Athena on top of which we will build a QuickSight dashboard.

Note: Make sure you upload pictures at different time and repeat step 2 to 4, so that you can see graph projection at different timelines.

Building QuickSight dashboards

First let’s the configure the QuickSight with the dataset.

  1. Login into the QuickSight console – https://eu-west-1.quicksight.aws.amazon.com/sn/start
  2. If it’s the first time you access QuickSight, follow the instruction below. If not, go to step 3.
    1. click on “Sign up for QuickSight”
    2. select the “Standard” edition
    3. Enter a QuickSight Account name
    4. Enter a valid Email
    5. Select the EU (Ireland) Region
    6. Tick the checkbox next to “Amazon Athena”
    7. Tick the checkbox next to “Amazon S3”
    8. click “choose s3 buckets” and select the retail_instore_demo_processed bucket
  3. Choose Manage data from the top right.
  4. Choose New Data Set.
  5. Create a Data Set from Athena Data Source as “retail-instore-analysis-blog”
  6. Choose the “retail-instore-demo-db” database, select the “retail_instore_demo_processed” table and click on Select
  7. Leave it on default Import to SPICE for quicker analytics and then click Visualize.
  8. Import complete will appear with the total rows imported to SPICE (which is the in-memory storage component used by QuickSight)Now you can easily start to build some dashboards. We will step through creating a dashboard.
  9. We will first add a custom date field, so that we can have graphs with time axis. Click on “Add” and Add calculated field.
  10. Provide a name for the Calculated field name like “DateCalculated”
  11. Select “epochDate” from Function list, select “retailtimestamp” from Fieldlist, and Create.

  12. This should create the calculated field.
  13. Resize the visual windows as necessary, and select “Vertical stacked bar chart” under Visual types.
  14. From Fields list, drag and drop the “DateCalculated” to X axis, “emotion” to Value and “emotion” to Group/Color on the Field wells. Graph will be populated based on the emotions captured from the images.
  15. Click on drop down arrow on “DateCacluated” in X axis of Field wells, and scroll over “Aggregate: Day” and select “Hour”
  16. This graph will display the emotions of people in different timestamps. For example, this can be a dashboard to display emotions captured at Aisle 23 while tracking a new product response on that aisle.  Another example, this allows to better count and qualify customers in front of a specific store endcap.
  17. Lets add few more visuals. Click on “Add” and Add visual.
  18. Select “pie chart” from Visual types and drag and drop “emotion” from fields list to Group/Color on Field wells
  19. You can further enrich you dashboard by adding meaningful heading, and also adding other visual types with fields list as below,
  • add a new Visual, Select “Vertical bar char”, Set X axis as DateCalculated(DAY), Value as emotion(count) in Field wells:

Results

The QuickSight dashboard above shows analysis of image capture events in a store.  For examples, you can see analysis of overall customer sentiment. You could understand the customer age range, and how many people visited each day from these metrics. Once you’re done with your analysis, you can easily share these insights with others in your team or organization by publishing it as a dashboard.

Using Amazon Rekognition you can better segment your customers.  You can obtain feedback on customer experience for a specific area of the store.

Shutting down

When you have finished experimenting, you can remove all the resources by following these steps.

  1. Delete the contents inside all the S3 buckets that the CloudFormation template created.
  2. Delete the CloudFormation stack.

Conclusions and next steps

This post showed you how simple it is to gain insights into customers’ in-store behaviour using AWS. Using your existing cctv systems or any camera, you can quickly build this solution and adapt it to your needs.

Some additional concepts that leverage this solution are:

  • In-store video streams analysis:Using Kinesis Videostreams it’s possible to stream your existing videos, which allows for additional use cases such as capturing the path of customers (and using heatmaps) and where they spent time in the store.
  • Customer loyalty and engagement programs utilizing facial recognition: Another very interesting possibility is the ability to recognize customers who have opted-in and shared a profile photo of themselves as part of a loyalty program, incentive program, or other customer benefit. Using the volunteered data, those customers could be recognized and offered a more elevated and personalized customer experience at a retail location.

If you have questions or suggestions, please comment below.

Additional Reading


About the Authors

Bastien Leblanc is a Solutions Architect with AWS. He helps Retail customers adopt the AWS Platform, focusing on Data & Analytics workloads, he likes to work with customers helping to solve retail problems and drive innovation.

 

 

 

 

Imran Dawood is a Solutions Architect with AWS. He works with Retail customers helping them build solutions on AWS with architectural guidance to achieve success in the AWS cloud. In his spare time, Imran enjoys playing table tennis and spending time with family.

 

 

 

 

 

 

 

 

Accelerate model training using faster Pipe mode on Amazon SageMaker

Amazon SageMaker now comes with a faster Pipe mode implementation, significantly accelerating the speeds at which data can be streamed from Amazon Simple Storage Service (S3) into Amazon SageMaker while training machine learning models.

Pipe mode offers significantly better read throughput than the File mode that downloads data to the local Amazon Elastic Block Store (EBS) volume prior to starting the model training. This means your training jobs start sooner, finish quicker, and need less disk space, reducing your overall cost to train machine learning models on Amazon SageMaker. For example, we conducted internal benchmarks earlier this year when we launched Pipe Input Mode for Amazon SageMaker built-in algorithms. We learned that start times were reduced by up to 87 percent on a 78 GB training dataset. In addition, we saw that throughput was twice as fast in some benchmarks, resulting in up to 35 percent reduction in total training time.

Overview

Amazon SageMaker supports two mechanisms for transferring training data: File mode and Pipe mode. In File mode, the training data is downloaded first to an encrypted EBS volume attached to the training instance prior to commencing the training. However, in Pipe mode the input data is streamed directly to the training algorithm while it is running. This continuous streaming of data enables a few significant advantages. First, the startup time of a training job becomes independent of the size of the input data, resulting in much quicker startup, especially while training on gigabyte- and petabyte-scale datasets. Furthermore, you don’t have to pay for a large disk volume to download large datasets. Finally, if your training algorithm is I/O-bound, the highly concurrent, high-throughput reading mechanism employed by Pipe mode can significantly speed up your model training.

Higher I/O throughput with faster Pipe mode

The latest implementation of Pipe mode provides higher data streaming throughputs than before. The following chart demonstrates the throughput improvements in Pipe mode compared to when we launched Pipe mode support earlier this year. For apples to apples comparison, the streaming throughput numbers are baselined against that of File mode as measured across instance types supported by Amazon SageMaker training.

As you can see, streaming training data using Pipe mode is now up to three times faster than before in some cases. Pipe mode support is available out of the box for Amazon SageMaker built-in algorithms. Now we will present an example of how you can take advantage of the Pipe mode if you are bringing your own custom training algorithms to Amazon SageMaker.

Writing Pipe mode training code

In Pipe mode, data is pre-fetched from Amazon S3 at high-concurrency and throughput and streamed into Unix Named Pipes (aka FIFOs). There is one FIFO per channel per epoch. The algorithm must open the FIFO for reading and read through to <EOF> (or optionally abort mid-stream). It must close its end-of-the-file descriptor when done. It can then optionally wait for the next epoch’s FIFO to get created and then it can commence reading. The algorithm iterates through epochs until achieves its completion criteria.

This notebook example has an extremely simple Pipe mode “training” algorithm implementation in Python. It conforms to the specifications required by Amazon SageMaker training It reads data in Pipe mode but does nothing with the data. It simply reads it and throws it away. The example is written this way to illustrate exactly what’s needed to support Pipe mode without complicating the code with a real training algorithm.

The train.py Python program contains the code. The following code snippet iterates through reading each epoch’s data through its corresponding FIFO:

# We're allocating a byte array here to read data into; a real algorithm
# may opt to prefetch the data into a memory buffer and train
# in parallel so that both IO and training happen simultaneously
data = bytearray(16777216)
total_read = 0
total_duration = 0
for epoch in range(num_epochs):
    check_termination()
    epoch_bytes_read = 0
    # As per the Amazon SageMaker training spec, the FIFO's path will be based on
    # the channel name and the current epoch:
    fifo_path = '{0}/{1}_{2}'.format(data_dir, channel_name, epoch)

    # Usually the FIFO will already exist by the time we get here, but
    # to be safe we should wait to confirm:
    wait_till_fifo_exists(fifo_path)
    with open(fifo_path, 'rb', buffering=0) as fifo:
        print('opened fifo: %s' % fifo_path)
        # Now simply iterate reading from the file until EOF. Again, a
        # real algorithm will actually do something with the data
        # rather than simply reading and immediately discarding like we
        # are doing here
        start = time.time()
        bytes_read = fifo.readinto(data)
        total_read += bytes_read
        epoch_bytes_read += bytes_read
        while bytes_read > 0 and not terminated:
            bytes_read = fifo.readinto(data)
            total_read += bytes_read
            epoch_bytes_read += bytes_read

        duration = time.time() - start
        total_duration += duration
        epoch_throughput = epoch_bytes_read / duration / 1000000
        print('Completed epoch %s; read %s bytes; time: %.2fs, throughput: %.2f MB/s'
              % (epoch, epoch_bytes_read, duration, epoch_throughput))

Using Pipe mode versus File mode

There are a few situations where Pipe mode may not be the optimum choice for training. In that case you should stick to using File mode:

  • If your algorithm needs to backtrack or skip ahead within an epoch. This isn’t possible in Pipe mode because the underlying FIFO cannot support lseek() operations.
  • If your training dataset is small enough to fit in memory and you need to run multiple epochs. In this case it might be quicker and easier just to load it all into memory and iterate.
  • If it is not easy to parse your training dataset from a streaming source.

In all other scenarios, if you have an I/O-bound training algorithm, switching to Pipe mode should give you a significant throughput-boost as well as reduce the size of the disk volume required. This should result in both saving you time and reducing training costs.


About the authors

Ishaaq Chandy is a Senior Engineer in Amazon AI where he loves his work in building an innovative and massively scalable training platform for Amazon Sagemaker. Prior to this he was working on AWS ELB where he was part of the launch teams for both ALB as well as NLB.

 

 

 

Sumit Thakur works on products that make it quick and easy for customers to get started with deep learning on cloud. He is product manager for Amazon SageMaker and AWS Deep Learning AMI. In his spare time, he likes connecting with nature and watching sci-fi TV series.

 

 

 

Amazon SageMaker Neural Topic Model now supports auxiliary vocabulary channel, new topic evaluation metrics, and training subsampling

In this blog post, we introduce three new features of the Amazon SageMaker Neural Topic Model (NTM) that are designed to help improve user productivity, enhance topic evaluation capability, and speed up model training. In addition to these new features, by optimizing sparse operations and the parameter server, we have improved the speed of the algorithm by 2x for training and 4x for evaluation on a single GPU. The speedup is even more significant for multi-GPU training.

Amazon SageMaker NTM is an unsupervised learning algorithm that learns the topic distributions of large collections of documents (corpus). With SageMaker NTM, you can build machine learning solutions for use cases such as document classification, information retrieval, and content recommendation. See Introduction to the Amazon SageMaker Neural Topic Model if you aren’t already familiar with Amazon SageMaker NTM.

If you are new to machine learning, or want to free up time to focus on other tasks, then the fully automated Amazon Comprehend topic modeling API is your best option. If you are a data science specialist looking for finer control over the various layers of building and tuning your own topic modeling model, then the Amazon SageMaker NTM might work better for you. For example, let’s say you are building a document topic tagging application that needs a customized vocabulary, and you need the ability to adjust the algorithm hyperparameters, such as the number of layers of the neural network, so you can train a topic model that meets the target accuracy in terms of coherence and uniqueness scores. In this case, the Amazon SageMaker NTM would be the appropriate tool to use.

Auxiliary vocabulary channel

When training a topic model, it’s important to know the top words in each of the topics so customers can understand what a topic is about. For customers who want to retrieve the actual representation of the words for each of the topics instead of integer representations from an Amazon SageMaker NTM model, they can now use the auxiliary vocabulary channel feature to remove the manual mapping effort.

Currently, when an Amazon SageMaker NTM training job runs, it outputs the training status and evaluation metrics to Amazon CloudWatch Logs and directly inside the Jupyter console. Among the outputs are lists of top words for the different topics detected. Prior to the availability of auxiliary vocabulary channel support, the top words were represented as integers, and customers needed to map the integers to an external custom vocabulary lookup table in order to know what the actual words were. With the support of the auxiliary vocabulary channel, users can now add a vocabulary file as an additional data input channel, and Amazon SageMaker NTM will output the actual words for a topic instead of integers. This feature eliminates the manual effort needed to map integers to the actual vocabulary. The following sample shows what a custom vocabulary text file looks like. The text file simply contains a list of words, one word per row, in the order corresponding to the integer IDs provided in the data.

absent
absentee
absolute
absolutely

To include an auxiliary vocabulary for a training job, you should name the vocabulary file vocab.txt and place it in the auxiliary directory. See the following sample code for the syntax for adding auxiliary vocabulary file. UTF-8 encoding is supported for the vocabulary file.

# s3_aux_data contains the auxiliary channel path on s3. E.g. "s3://bucketname/auxiliary"
s3_aux = s3_input(s3_aux_data, distribution='FullyReplicated', content_type='text/plain')
s3_train = s3_input(s3_train_data, distribution='ShardedByS3Key',
                    content_type='application/x-recordio-protobuf')
s3_val = s3_input(s3_val_data, distribution='FullyReplicated',
                  content_type='application/x-recordio-protobuf')
ntm.fit({'train': s3_train, 'validation': s3_val, 'auxiliary': s3_aux})

After the training is completed, the output looks like the following:

[09/07/2018 12:55:38 INFO 139989295474496] Topics from epoch:final (num_topics:20) [wetc 0.37, tu 0.75]:
[09/07/2018 12:55:38 INFO 139989295474496] [0.39, 0.72] game season september released power episode novel player ign jack playstation career trek boy life caused music people death fan
[09/07/2018 12:55:38 INFO 139989295474496] [0.31, 0.75] single hit design tube speed tropical maximum final nuclear storm mm taxonomy drum peaked depth wave reached wind lb iii
[09/07/2018 12:55:38 INFO 139989295474496] [0.35, 0.65] died record recorded st highway ny rule intersection chart intersects connector house century reached death billboard route charter king sr
[09/07/2018 12:55:38 INFO 139989295474496] [0.42, 0.77] gameplay player game battalion group division goal league mode match win football infantry score japanese attack regular enemy defeating yard
[09/07/2018 12:55:38 INFO 139989295474496] [0.38, 0.74] creek century mile water completed construction battalion freeway river destroyer interchange operation highway foot service city tower intersection cambridge gun
[09/07/2018 12:55:38 INFO 139989295474496] [0.36, 0.83] stone personnel film built chart long song sr rock album hop certification u rolling listing provided carey instrument instrumentation single
[09/07/2018 12:55:38 INFO 139989295474496] [0.37, 0.83] description forest possession era straight century remains round record existed yellow semi current county goal designation latin east hindu age
[09/07/2018 12:55:38 INFO 139989295474496] [0.40, 0.73] novel u bishop king archbishop fiction state poem god story henry cathedral expressed entire church new change originally turn force
[09/07/2018 12:55:38 INFO 139989295474496] [0.33, 0.73] game gameplay star race tournament body episode series developer unit player ha leslie character announced ray session international staff andy
[09/07/2018 12:55:38 INFO 139989295474496] [0.43, 0.86] team season game competition appearance florida play ground level brown japan goal summer host australia flight live feature specie injury
[09/07/2018 12:55:38 INFO 139989295474496] [0.35, 0.70] personnel surrender heavy reached radio single party bishop hit wave ship british parliament territory mm gun robert equipment issued armament
[09/07/2018 12:55:38 INFO 139989295474496] [0.41, 0.80] line french german zealand club race poem men force class light british position african american boat share smith veronica crew
[09/07/2018 12:55:38 INFO 139989295474496] [0.40, 0.83] simpson million development network map percent female doe level volume police people park developed water production business company public begin
[09/07/2018 12:55:38 INFO 139989295474496] [0.33, 0.70] game hill manchester building player division train battalion church oslo elected film vote baltimore amateur seat gate regiment infantry borough
[09/07/2018 12:55:38 INFO 139989295474496] [0.38, 0.63] ship water wind caused storm homer people poem hour affected line episode movement effect developed viewer house burn death street
[09/07/2018 12:55:38 INFO 139989295474496] [0.35, 0.82] raaf battalion party political building command rebel unit emperor hm granted outbreak brigade army restaurant cambridge appointed squadron commanded church
[09/07/2018 12:55:38 INFO 139989295474496] [0.30, 0.85] painting pagan century wasp altar architecture medieval church mary archaeologist era settlement witchcraft breed ode centre shakespeare religious scholar creek
[09/07/2018 12:55:38 INFO 139989295474496] [0.40, 0.70] episode season black series star aired character film female dvd speed rating nielsen simpson plot writer nbc director producer game
[09/07/2018 12:55:38 INFO 139989295474496] [0.41, 0.68] government community killed force vote archbishop turned death house dublin scholar legislation continued died opposed god king existed county official
[09/07/2018 12:55:38 INFO 139989295474496] [0.31, 0.70] class draft season assigned defense touchdown route ton division decommissioned line destroyer fleet mm torpedo boat tube sister king battleship

Word embedding topic coherence metric

To evaluate the performance of a trained Amazon SageMaker NTM model, customers can examine the perplexity metric emitted by a training job. Another measure of model quality is the semantic similarity of top words in each topic. A high-quality model should have words that are semantically similar in each of topics. For customers who want to effectively measure the topic coherence during training, they can now use the new word embedding topic coherence (WETC) feature.

Traditional methods like normalized point-wise mutual information (NPMI), while widely accepted, require a large external corpus. The new WETC metric measures the similarity of words in a topic by using a pre-trained word embedding, Glove-6B-400K-50d.

Intuitively, each word in the vocabulary is given a vector representation (embedding). We compute the WETC of a topic by averaging the pair-wise cosine similarities between the vectors corresponding to the top words of the topic. Finally, we average the WETC for all the topics to obtain a single score for the model.

Our tests have shown that WETC correlates very well with NPMI as an effective surrogate. For details about the pair-wise WETC computation and its correlation to NPMI, please refer to our paper [1]

WETC value ranges between 0 and 1, the higher value indicates a higher degree of topic coherence. A typical value would be in the range of 0.2 to 0.8. The WETC metric is evaluated whenever the vocabulary file is provided. The average WETC score over the topics is displayed in the log above the top words of all topics. The WETC metric for each topic is also displayed along with the top words of each topic. See the following screenshot for an example.

 

Note: In the situation in which many of the words in the supplied vocabulary can’t be found in the pre-trained word embedding, the WETC score can be misleading. Therefore, we provide a warning message to alert the user to exactly how many words in the vocabulary do not have an embedding:

[09/07/2018 14:18:57 WARNING 140296605947712] 69 out of 16648 in vocabulary do not have embeddings! Default vector used for unknown embedding!

Topic uniqueness metric

A good topic modeling algorithm should generate topics that are unique to avoid topic duplication. Customers who want to understand the topic uniqueness of a trained Amazon SageMaker NTM model to evaluate its quality can now use the new topic uniqueness (TU) metric.

To understand how TU works, suppose there are K topics, and we extract the top n words for each topic. The TU for topic k is defined as:

where cnt(i,k) is the total number of times the ith top word in topic k appears in the top words across all topics. E.g. if the ith top word in topic k appears only in topic k, then cnt(i,k)=1; on the other hand, if the word appears in all the topics then cnt(i,k)=K. Finally, the average TU is computed as:

The range of the TU value is between 1/K and 1, where K is the number of topics. A higher TU value represents higher topic uniqueness for the topics detected.

The TU score is displayed regardless of the existence of a vocabulary file. The average TU score over the topics is displayed in the log above the top words of all topics. The TU score for each topic is also displayed along with the top words of each topic. See the following screenshot for an example.

Training subsampling

Topic model training often deals with large text corpus, and it could be very time consuming to train a topic model. For customers who want to speed up the model training while maintaining the model performance when using the Amazon SageMaker NTM with a large text corpus, they can now use the new training subsampling feature.

In typical online training, the entire training dataset is fed into the training algorithm for each epoch. When the corpus is large, this leads to long training time. With effective subsampling of the training dataset, we can achieve faster model convergence while maintaining the model performance. The new subsampling feature of Amazon SageMaker NMT allows customers to specify a percentage of training data used for training using a new hyperparameter, sub_sample. For example, specifying 0.8 for sub_sample would direct Amazon SageMaker NTM to use 80% of training data randomly for each epoch. As a result, the algorithm will stochastically cover different subsets of data during different epochs. You can configure this value in both the Amazon SageMaker console or directly in the training code. See the following sample code for how to set this value for training.

ntm.set_hyperparameters(num_topics=num_topics, feature_dim=vocab_size, mini_batch_size=128, epochs=100, sub_sample=0.7)

We demonstrate the utility of the sub_sample hyperparameter by setting it to 1.0 and 0.2 for training on the wikitext-103 dataset [2]. In both settings, NTM would early-exit training when the loss on validation data does not improve in 3 consecutive epochs. We report the TU, WETC, and NPMI of the best epoch based on validation loss as well as the total time for both settings as follows.

sub_sample TU WETC NPMI Total time (Seconds) Best epoch
1.0 0.9 0.13 0.163 900 18
0.2 0.91 0.17 0.204 673 49

We observe that setting sub_sample to 0.2 leads to reduced total training time even though it takes more epochs to converge (49 instead of 18). The increase in the number of epochs to convergence is expected due to the variance introduced by training on a random subset of data per epoch. Yet the overall training time is reduced because training is about 5 times faster per epoch at the subsampling rate of 0.2. We also note the higher scores in terms of TU, WETC, and NPMI at the end of training with subsampling. More details of the experiment can be found in the notebook.

If you want to see a complete sample notebook on how the 3 new features are used in practice. Please check out this notebook here.

Conclusion

In this blog post, we introduced three new Amazon SageMaker NTM features. After finishing this post and the sample notebook, you should have learned how to add an auxiliary vocabulary channel to automatically map integer word representations in a topic to a humanly understandable vocabulary. You also have learned to evaluate the quality of a trained model using the new word embedding topic coherence and topic uniqueness metrics. And lastly, you have learned to use the subsampling feature to reduce the model training time while maintaining similar model performance.

[1] Ran Ding, Ramesh Nallapati, and Bing Xiang. 2018. Coherence-Aware Neural Topic Modeling (Accepted for EMNLP 2018)

[2] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer Sentinel Mixture Models


About the Authors

David Ping is a Principal Solutions Architect with the AWS Solutions Architecture organization. He works with our customers to build cloud and machine learning solutions using AWS. He lives in the NY metro area and enjoys learning the latest machine learning technologies.

 

 

 

Feng Nan is an Applied Scientist on the AWS AI Algorithms team, researching and developing machine learning algorithms in Amazon SageMaker. Before Amazon, Feng obtained his PhD in Systems Engineering from Boston University and his thesis focused on resource-constrained machine learning.

 

 

 

Ran Ding is an Applied Scientist on the AWS AI Algorithms team, researching and developing machine learning algorithms in Amazon SageMaker. Before Amazon, Ran obtained his PhD in Electrical Engineering from the University of Washington and worked at a startup company making optical processors.

 

 

 

Ramesh Nallapati is a Senior Applied Scientist in the AWS AI SageMaker team. He works on building novel deep neural networks at scale primarily in the natural language processing domain. He is very passionate about deep learning, and enjoys learning about latest developments in AI and is excited about contributing to this field to the best of his abilities.

 

 

 

Patrick Ng is a Software Development Engineer on the AWS AI SageMaker Algorithms team. He works on building scalable distributed machine learning algorithms, with focus in the area of deep neural networks and natural language processing.  Before Amazon, he obtained his PhD in Computer Science from the Cornell University and worked at startup companies building machine learning systems.