Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

Category: Global

Amazon SageMaker Automatic Model Tuning becomes more efficient with warm start of hyperparameter tuning jobs

Earlier this year, we launched Amazon SageMaker Automatic Model Tuning, which allows developers and data scientists to save significant time and effort in training and tuning their machine learning models. Today, we are launching warm start of hyperparameter tuning jobs in Automatic Model Tuning. Data scientists and developers can now create a new hyperparameter tuning job based on selected parent jobs, so that training jobs conducted in those parent jobs can be reused as prior knowledge. Warm start of hyperparameter tuning jobs will accelerate the hyperparameter tuning process and reduce the cost for tuning models.

While data scientists and developers could already efficiently tune their models through Automatic Model Tuning, there are still places where they need more help. For example, they might start a hyperparameter tuning job with a small budget, and, after analyzing the results, decide that they want to continue tuning the model with a larger budget. Potentially they might use different hyperparameter configurations (e.g., by adding more hyperparameters to tune or trying different search ranges for some hyperparameters). Another example is when data scientists or developers might want to re-tune a model after they have collected new data subsequent to a previous model tuning. In both cases, starting a hyperparameter tuning job with prior knowledge collected from previous tuning jobs on this model can help get to the best model faster, and end up saving cost for customers. However, previously every tuning job would start from scratch. Even if the same model was already tuned with a similar tuning configuration, no information was reused.

Warm start of hyperparameter tuning jobs addresses these needs. Now we’ll show you how to iteratively tune your model leveraging warm start.

Tuning an image classification model leveraging warm start

In this example, we’ll build an image classifier and iteratively tune it by running multiple hyperparameter tuning jobs leveraging warm start. We’ll use the Amazon SageMaker built-in image classification algorithm and train the model against the Caltech-256 dataset. You can find the full sample notebook here.

Set up and launch the hyperparameter tuning job

We’ll skip the steps of creating a notebook instance, preparing the dataset, and pushing it to Amazon S3, and directly start from launching a hyperparameter tuning job. The sample notebook has all the details so we won’t go through the process here.

We’ll run this first tuning job to learn about the search space and evaluate the impact of tuning tunable hyperparameters in image classification. This job will assess if tuning the model is promising, and if we want to continue the tuning by creating a subsequent tuning job.

To create a tuning job, we first need to create a training estimator for the built-in image classification algorithm, and specify values for every hyperparameter of this algorithm, except for those we plan to tune. To learn more about hyperparameters of the built-in image classification algorithm, you can explore our documentation.

s3_output_location = 's3://{}/{}/output'.format(bucket, prefix)
s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='application/x-recordio')
s3_input_validation = sagemaker.s3_input(s3_data='s3://{}/{}/validation/'.format(bucket, prefix), content_type='application/x-recordio')
sess = sagemaker.Session()

imageclassification = sagemaker.estimator.Estimator(training_image,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.p3.8xlarge',
                                    output_path=s3_output_location,
                                    sagemaker_session=sess)

imageclassification.set_hyperparameters(num_layers=18,
                                        image_shape='3,224,224',
                                        num_classes=257,
                                        num_training_samples=15420,
                                        mini_batch_size=128,
                                        epochs=50,
                                        optimizer='sgd',
                                        top_k='2',
                                        precision_dtype='float32',
                                        augmentation_type='crop')

Now that we have the estimator, we can create a hyperparameter tuning job with the estimator and specify the search ranges for hyperparameters we want to tune and the number of total training jobs we want to run.

We selected the three hyperparameters that we believe are most likely to affect the model quality, and thus our objective metric. Since we don’t know yet the values that lead to the best model, we chose the full range of search for momentum and weight_decay as specified in image classification documentation, and a smaller range of search for learning rate (0.0001, 0.05):

  • learning_rate: controls how fast the training algorithm will try to optimize your model. Lower learning rates can achieve better accuracy but will take more time to train your model. Higher learning rates can fail to improve your model accuracy. You need to find a good balance for this attribute.
  • momentum: uses information from the direction of our previous update to inform our current update. The default value of 0 means weight updates are based only on the information in the current batch.
  • weight_decay: penalizes weights when they grow too large. The

default value of 0 means no penalty.

In this case we don’t need to specify the regular expressions for the objective metric because we are using one of the Amazon SageMaker built-in algorithms.

from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner

hyperparameter_ranges = {'learning_rate': ContinuousParameter(0.0001, 0.05),
                         'momentum': ContinuousParameter(0.0, 0.99),
                         'weight_decay': ContinuousParameter(0.0, 0.99)}

objective_metric_name = 'validation:accuracy'

tuner = HyperparameterTuner(imageclassification,
                            objective_metric_name,
                            hyperparameter_ranges,
                            objective_type='Maximize',
                            max_jobs=10,
                            max_parallel_jobs=2) 

After the hyperparameter tuning job finishes, we can bring in a table of metrics using the HyperparameterTuningJobAnalytics API action from the Amazon SageMaker Python SDK.

tuner_parent = sagemaker.HyperparameterTuningJobAnalytics(tuning_job_name)
tuner_parent.dataframe().sort_values(['FinalObjectiveValue'], ascending=False)

This table shows a subset of the training jobs that have been run. You can look at all of the results by running the notebook. Observe that the hyperparameters we are tuning have a significant impact on the objective metric values for the image classification algorithm. Choosing different values gives very different results.

Using the HPO_Analyze_TuningJob_Results.ipynb notebook, we can plot how the objective metric changes over time as the tuning job progresses.

You can see that the objective metric values improve over time as Automatic Model Tuning is learning through the search space. We might get further improvement beyond the 0.33 validation accuracy by running a few more training jobs. To validate the hypothesis, we‘ll run a second tuning job with another 10 training jobs. This time we‘ll use warm start to reuse the learning we gathered from the first tuning job.

Don‘t worry if youdon‘t get a trend as clear as the one we just discussed, given the nature of randomness in a tuning process. Even running the same experiment won‘t give you the same result, but typically you should see an overall trend of model quality improvement.

Set up and launch a hyperparameter tuning job using a warm start configuration

To use warm start in the new tuning job, we need to specify two parameters:

  • The list of parent tuning jobs the new tuning job should use as a starting point. (The maximum number of parents can be 5 but we will use 1 in this example.)
  • The type of warm start configuration:
    • IDENTICAL_DATA_AND_ALGORITHM warm starts a tuning job with previous evaluations essentially with the same task, allowing for slight changes in the search space. This option should be used when the data set and the algorithm container haven’t changed.
    • TRANSFER_LEARNING warm starts a tuning job with the evaluations from similar tasks, allowing both search space, algorithm image, and dataset change.

In this example we’ll use IDENTICAL_DATA_AND_ALGORITHM because we are not changing the data set or algorithm, we are just running more training jobs.

We will use the Amazon SageMaker console to launch our second tuning job with warm start. Open the Amazon SageMaker console, and in the left navigation pane choose Training.-Then choose Hyperparameter tuning jobs and Create hyperparameter tuning job. At the top of the page, enable Warm start with identical data and algorithm Warm start type. The next step is to select the parent jobs of the new tuning job:

The console allows us to easily populate the values of the new tuning job by using Copy settings from the parent tuning job. After choosing Copy settings, the form gets populated. Choose Next and validate that the static and tunable hyperparameters look good:

In this case, we are not changing any hyperparameter values, so we just need to choose Next again and create the new tuning job using warm start. Really simple!

After the warm start hyperparameter tuning job has completed, we can go back to the notebook to use tuner.analytics() to visualize how the objective metric changes over time for the parent tuning job (black data points) and the new tuning job we launched using warm start (red data points).

You can see that the new tuning job managed to find good hyperparameter configurations very early on, thanks to the prior knowledge from the parent tuning job. As the optimization continues, the objective metric continues improving and it reaches 0.47, which is significantly higher than the metric we had gotten (0.33) when we ran the first tuning job from scratch.

Lastly, to demonstrate how you could apply transfer learning to a tuning job using warm start, we’ll run a third tuning job using more data augmentations in the data set to see if those drive our validation accuracy further up. To apply more data augmentations we can use augmentation_type hyperparameter exposed by the Amazon SageMaker pre-built image classification algorithm. We’ll apply crop_color_transform transformation to the data set during training. With this transformation, in addition to crop and color transformations, random transformations (including rotation, shear, and aspect ratio variations) are applied to the image.

To create our last hyperparameter tuning job, we will use Transfer learning WarmStartType since our data set is going to change as a result of applying new data augmentations. We’ll use both of the two previous tuning jobs that we ran as parent tuning jobs and run 10 more training jobs. Let’s go back to the notebook to launch this last hyperparameter tuning job:

from sagemaker.tuner import WarmStartConfig, WarmStartTypes

parent_tuning_job_name_2 = warmstart_tuning_job_name
transfer_learning_config = WarmStartConfig(WarmStartTypes.TRANSFER_LEARNING, 
                                    parents={parent_tuning_job_name,parent_tuning_job_name_2})

imageclassification.set_hyperparameters(num_layers=18,
                                        image_shape='3,224,224',
                                        num_classes=257,
                                        num_training_samples=15420,
                                        mini_batch_size=128,
                                        epochs=50,
                                        optimizer='sgd',
                                        top_k='2',
                                        precision_dtype='float32',
                                        augmentation_type='crop_color_transform')

tuner_transfer_learning = HyperparameterTuner(imageclassification,
                            objective_metric_name,
                            hyperparameter_ranges,
                            objective_type='Maximize',
                            max_jobs=10,
                            max_parallel_jobs=2,
                            base_tuning_job_name='transferlearning',
                            warm_start_config=transfer_learning_config)

tuner_transfer_learning.fit({'train': s3_input_train, 'validation': s3_input_validation},include_cls_metadata=False)

One last time, after the new hyperparameter tuning job has been completed, we can go use tuner.analytics() to visualize how the objective metric changed over time for the parent tuning jobs (black and red data points) and the new tuning job we launched using warm start transfer learning (blue data points).

After the tuning job has been completed, the objective metric has improved again and has reached 0.52.

If you are satisfied with the results, you can find the training job that generated the best model by getting BestTrainingJob in the Automatic Model Tuning describe API or by going to the console. From the console you can deploy the model to an Amazon SageMaker hosting endpoint.

Conclusion

To recap, we explored one use case that showed how using warm start can help explore the search space iteratively without losing the learning gathered in previous iterations. We also demonstrated how you can use warm start to transfer the learning of previous tuning jobs even if your dataset or algorithm has been changed, but you believe they are close enough to datasets or algorithms used in previous hyperparameter tuning jobs.

Warm start of hyperparameter tuning jobs is now available in all the AWS Regions where Amazon SageMaker is available today. For more information on Amazon SageMaker Automatic Model Tuning, visit Amazon SageMaker documentation.


 

About the Authors

Patricia Grao is a Software Development Manager in Amazon AI. She became passionate about machine learning while working in search ranking and query understanding in Amazon Search. She was part of the team that launched Amazon SageMaker Automatic Model Tuning.

 

 

 

Fela Winkelmolen works as an applied scientists for Amazon AI and was part of the team that launched the Automatic Model Tuning feature of Amazon SageMaker

 

 

 

 

Fan Li is a Product Manager of Amazon SageMaker. He used to be a big fan of ballroom dance but now loves whatever his 8-year-old son likes.

 

 

 

Build Your Own Natural Language Models on AWS (no ML experience required)

At AWS re:Invent last year we announced Amazon Comprehend, a natural language processing service which extracts key phrases, places, peoples’ names, brands, events, and sentiment from unstructured text. Comprehend – which is powered by sophisticated deep learning models trained by AWS – allows any developer to add natural language processing to their applications without requiring any machine learning skills.

Today we are excited to bring new customization features to Comprehend, which allow developers to extend Comprehend to identify natural language terms and classify text which is specialized to their team, business, or industry.

Many customers tell us they have a surplus of data – specifically – data comprising unstructured, natural language. You likely won’t have to look far inside your own organization before you find a treasure trove of potential information, hiding inside reams of customer emails, support tickets, financial reports, product reviews, social media, or advertising copy. Helping find the needle inside this proverbial haystack is something machine learning is particularly good at; machine learning models can be extremely accurate at picking up specific items of interest inside vast swathes of text (such as finding company names in analyst reports), and are sensitive to the sentiment hidden inside language (identifying negative reviews, or positive customer interactions with customer service agents).

While Comprehend has highly accurate models for finding generic terms (such as places and things), customers often want to extend this capability to identify more specific language, such as policy numbers or part codes. This usually involves starting from scratch, and building new, specialized machine learning language models – annotating data, selecting algorithms, tuning parameters, optimizing models, and running them in production. Not only do these steps all require deep machine learning expertise, but they also represent “undifferentiated heavy lifting”; effort which many application developers would rather spend on building new features of their own.

Customize Amazon Comprehend (No ML Experience Required)

Today, we’re helping customers find more needles in more haystacks; no machine learning skills required. Under the hood, Comprehend will do the heavy lifting to build, train, and host the customized machine learning models, and make those models available through a private API.

Custom Entities allows developers to customize Comprehend to identify terms that are specific to their domain. Comprehend will learn from a small private index of examples (a list of policy numbers, and text in which they are used, for example), and train a private, custom model to recognize these in any other block of text. There are no servers to manage, and no algorithms to master.

Custom Classification allows developers to group documents into named categories. Through as few as 50 examples, Comprehend will automatically train a custom classification model that can be used to categorize all your documents. You could group support emails by department, social media posts by product, or analyst reports by business unit. If you don’t have any examples, or your categories change frequently (which is common in social media), Comprehend can also classify based on just the content of the documents, using Topic Modeling.

Customer Success with Amazon Comprehend

When it comes to understanding unstructured text in a specific domain, natural language doesn’t come much more specialized than in the legal profession. The “legalese” used in most legal documents is famous for its complex syntax, nomenclature and structure. It’s a great example of where Comprehend Custom Entitites can help; we worked with LexisNexis while developing these new capabilities, to extract legal entities from hundreds of millions of documents, with very high accuracy.

“We provide legal professionals with insightful research and analytics to help them make informed decisions,” said Rick McFarland, Chief Data Officer of LexisNexis. “Therefore, we are always looking for better ways to discover insights from legal documents. Thanks to Amazon Comprehend’s automatic machine learning, we can now build accurate custom entity recognition models without getting into the complexities associated with ML. The entities that we care about the most, such as judge and attorney, can be identified quickly from more than 200 million documents at greater than 92 percent accuracy.”

New Amazon Comprehend features are now Generally Available

Since the earliest days of AWS, our goal has been to take technology which is traditionally only within reach of large, well-funded organizations, and to put it in the hands of all developers. And just like with services such as EC2 and RDS, to do this for machine learning we need to continue to invent and simplify on behalf of our customers, across the machine learning stack. These new capabilities for Comprehend are a perfect reflection of this spirit; we’re excited to see what you build with them.

 

Dr. Matt Wood, General Manager of Artificial Intelligence, AWS

 

 

 

 

Getting Started with Amazon Comprehend custom entities

Amazon Comprehend is a natural language processing (NLP) service that uses machine learning to find insights and relationships in text. We released an update to Amazon Comprehend enabling support for private, custom entity types. Customers can now train state-of-the-art entity recognition models to extract their specific terms, completely automatically. No machine learning experience required. For example, financial companies can analyze market reports for terms and language related to bankruptcy activity. Manufacturing companies can now analyze logistics documents looking for specific parts IDs and route numbers. Combining custom entities with Comprehend’s pre-trained entities enables a complete picture of what is contained within text data. Use this data to look for trends, anomalies, or specific conditions within text.

Training the service to learn custom entity types is as easy as providing a set of those entities and a set of real-world documents that contain them. To get started, put together a list of entities. Gather these from a product database, or an Excel file that your company uses for business planning. For this blog post, we are going to train a custom entity type to extract key financial terms from financial documents.

The CSV format requires “Text” and “Type” as column headers. The text contains the entities and the type is the name of the entity type we are about to create.

Next, collect a set of documents that contain those entities in the context of how they are used. The service needs a minimum of 1,000 documents containing at least one or more of the entities from our list.

Next, configure the training job to read the entity list CSV from one folder, and the text file containing all of the documents (one per line) from another folder.

After both sets of training data are prepared, train the model. This process can take a few minutes, or multiple hours depending on the size and complexity of the training data. Using automatic machine learning, Amazon Comprehend selects the right algorithm, sampling and tuning the models to find the right combination that works best for the data.

When the training is completed the custom model is ready to go. Below, view the trained model along with some helpful metadata.

To start analyzing documents looking for custom entities, either use the portal or APIs via the AWS SDK. In this example, create an analysis job in the portal to analyze financial documents using the custom entity type:

This is how the same job submission would look using our CLI:

aws comprehend start-entities-detection-job 
--entity-recognizer-arn "arn:aws:comprehend:us-east-1:1234567890:entity-recognizer/person-recognizer“ 
--job-name person-job 
--data-access-role-arn "arn:aws:iam::1234567890:role/service-role/AmazonComprehendServiceRole-role" 
--language-code en 
--input-data-config "S3Uri=s3://data/input/” 
--output-data-config "S3Uri=s3://data/output/“ 
--region us-east-1

Take a look at the job output by opening the JSON response object and look at our custom entities. For each entity, the service also returns a confidence score metric. If there are lower confidence scores, fix them by adding more documents that contain that specific entity.

Below, view the custom model extracted financial terms.

{
  "Entities": [
    {
      "BeginOffset": 10,
      "EndOffset": 16,
      "Score": 0.999985933303833,
      "Text": "stocks",
      "Type": "FINANCE_ENTITY"
    },
    {
      "BeginOffset": 24,
      "EndOffset": 36,
      "Score": 0.9998899698257446,
      "Text": "modest gains",
      "Type": "FINANCE_ENTITY"
    },
    {
      "BeginOffset": 55,
      "EndOffset": 62,
      "Score": 0.9999994039535522,
      "Text": "trading",
      "Type": "FINANCE_ENTITY"
    },

Please visit the product forum to provide feedback or get some help.


About the author

Nino Bice is a Sr. Product Manager leading product for Amazon Comprehend, AWS’s natural language processing service.

 

 

 

 

 

 

 

 

 

 

 

 

Amazon Polly adds Italian and Castilian Spanish voices, and Mexican Spanish language support

Amazon Polly is an AWS service that turns text into lifelike speech. This pre-trained service requires no machine learning skills to easily integrate AI into your applications.

In addition to the previously available Italian voices Carla and Giorgio, we have now added a second female Italian voice. Listen to the introduction by Bianca.

Listen now

Voiced by Amazon Polly

We have also added Lucia, a second female Castilian Spanish voice. Listen to the introduction by Lucia.

Listen now

Voiced by Amazon Polly

In addition, we are introducing Mia, our first Mexican Spanish voice, which expands our portfolio of Spanish options beyond Castilian and US Spanish.

Listen now

Voiced by Amazon Polly

With these additions, the Amazon Polly portfolio now includes 57 voices across 28 languages. Visit the Amazon Polly documentation for the full list of text-to-speech voices, and log in to the Amazon Polly console to try them out!

 


About the Author

Robin Dautricourt is a Principle Product Manager for Amazon Text-to-Speech, and he leads product management for Amazon Polly. He enjoys innovating on behalf of customers, to launch features that will benefit their business needs and end users. He enjoys spending his free time with his wife and kids.

 

 

 

 

 

Introduction to Amazon SageMaker Object2Vec 

In this blog post, we’re introducing the Amazon SageMaker Object2Vec algorithm, a new highly customizable multi-purpose algorithm that can learn low dimensional dense embeddings of high dimensional objects.

Embeddings are an important feature engineering technique in machine learning (ML). They convert high dimensional vectors into low-dimensional space to make it easier to do machine learning with large sparse vector inputs. Embeddings also capture the semantics of the underlying data by placing similar items closer in the low-dimensional space. This makes the features more effective in training downstream models. One of the well-known embedding techniques is Word2Vec, which provides embeddings for words. It has been widely used in many use cases, such as sentiment analysis, document classification, and natural language understanding. See the following diagram for a conceptual representation of word embeddings in the feature space.

Figure 1: Word2Vec embeddings: words that are semantically similar are close together in the embedding space.

In addition to word embeddings, there are also use cases where we want to learn the embeddings of more general-purpose objects such as sentences, customers, and products. This is so we can build practical applications for information retrieval, product search, item matching, customer profiling based on similarity or as inputs for other supervised tasks. This is where Amazon SageMaker Object2Vec comes in. In this blog post, we will talk about what it is, how it works, discuss some practical use cases, and show you how Object2Vec can be used to solve those use cases.

How it works

The embeddings are learned such that the semantics of the relationship between pairs of objects in the original space are preserved in the embedding space. Thus, the learned embeddings can be used to efficiently compute nearest neighbors of objects, as well as to visualize natural clusters of related objects in low-dimensional space. In addition, the embeddings can also be used as features of the corresponding objects in downstream supervised tasks such as classification or regression.

The architecture of Amazon SageMaker Object2Vec consists of the following main components:

  • 2 input channels—The two input channels take object pairs of same or different types as inputs and pass them to independent and customizable encoders.  Examples of input objects could be sequence pairs, tokens pairs, and sequence and tokens pairs.
  • 2 encoders—The encoders convert each object into a fixed-length embedding vector.  The encoded embeddings of the objects in the pair are then passed into a comparator.
  • Comparator—The comparator compares the embeddings in different ways and outputs scores that correspond to the strength of the relationship of the objects in the pair for each relationship type specified by the user. An example of the output score could be 1, indicating a strong relationship between the pair of objects, or 0, representing a weak relationship.

At training time, the training loss function minimizes the differences between the relationships predicted by the model and those specified by the user in the training data. After the model is trained, the trained encoder can be used to convert new input objects into fixed-length embeddings. The architectural diagram of Object2Vec and an explanation of the parts of the architecture follows.

Supported input types, encoders and loss functions

Natively, Object2Vec currently supports singleton discrete tokens represented as integer-ids as well as sequences of discrete tokens represented as lists of integer-ids as inputs, so pre-processing is required to transform the input data to the supported formats. The objects in each pair can be asymmetric with respect to each other. For example, they can be (token, sequence) pairs, or (token, token) pairs, or (sequence, sequence) pairs. For tokens, we support simple embeddings as compatible encoders, while for sequences of tokens, we support average-pooled embeddings, hierarchical Convolutional Neural Networks (CNNs), as well as multi-layered Bi-Directional-Long-Short-Term-Memory (BiLSTM)-based Recurrent Neural Networks as encoders. The input label for each pair can be a categorical label that expresses the relationship between the objects in the pair, or it can be a rating or a score that expresses the strength of similarity between the two objects. For categorical labels, we support Cross-Entropy loss function, and for ratings/score-based labels, we support Mean Squared Error (MSE) loss function.

Although the current input types supported in Object2Vec are either sequences of discrete tokens or singleton tokens, these input types already cover plenty of real-world objects since the data that describes these objects can usually be represented as discrete sequences. Here are a few illustrative examples:

  • Embeddings of customers: To learn the embeddings of customers, you can generate training data consisting of a recent sequence of transactions of each customer, where the sequence is represented as the list of product-IDs bought by the customer, paired with the ID of the customer, as positive examples. As negative examples, one can generate transactions of a different customer paired with the original (and therefore incorrect) customer-ID. For each pair, the sequence of transactions could be passed as input to a CNN or BiLSTM encoder, and the customer-ID to an embedding-based encoder. Once trained, the embeddings of the customers can be directly read from the embedding-based encoder.
  • Embeddings of products: To train embeddings of products, you can pair the title of the product, represented as a sequence of text tokens, and the product-ID as positive examples. As negative examples, one can pair title of another (potentially related) product with the original (and incorrect) product-ID.
  • Embeddings of users and movies: To train embeddings of users and movies, you can use user-movie pairs where the user has assigned a high rating to the movie as positive examples, and those that the user has assigned low rating as negative examples. You can use an embedding-based encoder for both users and movies, and the embeddings of either can be read directly from the corresponding encoder, once trained.
  • Embeddings of football players: To learn the embeddings of players in a football game, you can use the time sequence of discretized locations in the field traced by each player during a game, paired with the player-ID as positive examples. Traced location sequences of a player paired with a different player-ID can serve as negative examples.
  • Embeddings of English sentences: To learn embeddings of sentences in English, you can treat pairs of adjacent sentences in a document as positive pairs, and the pairs of sentences sampled from different documents as negative pairs. You can use CNN- or BiLSTM-based encoders for both sentences in the pair. Once trained, either encoder can be used to generate embeddings of new sentences.

In this blog post we’ll walk  through some of these use cases in more detail using our Jupyter Notebook examples (movie recommendation, multi-label document classification, and sentence similarity).

Is Object2Vec a supervised learning algorithm?

Since the algorithm requires labeled data for training, it is indeed true that Object2Vec is a supervised learner. However, we want to emphasize that there are many scenarios where the relationship labels can be obtained purely from natural clusterings in data, without any explicit human annotation. We discussed some examples earlier, but we reiterate them as follows for clarity.

  • To learn embeddings of words, pairs of words that occur within a context window in a given document can be considered examples with a positive label and word pairs obtained as samples from unigram distribution in a corpus can be considered as examples with a negative label.
  • Likewise, to learn embeddings of sentences, pairs of sentences that occur adjacent to each other in a document can be considered examples with “positive labels” and sentence pairs that do not co-occur in the same document can be considered as those with negative labels.”
  • To learn embeddings of a customer, pairs of transaction records from the same customer within a given window of time can be considered positive examples, and pairs of transactions from two different customers can be considered negative examples.

To reiterate, the architecture of Object2Vec requires the user to make the relationship between objects in each pair explicit at training time, but the relationships themselves may be obtained from natural groupings in data, and they might not require explicit human labeling.

Hyperparameter

Object2Vec supports a range of hyperparameters for fine-tuning the training to meet different requirements. These are some of the main hyperparameters:

  • Encoder network (network)– You can choose Hierarchical CNN, BiLSTM, or Pooled Embedding.  Use Hierarchical CNN if you want faster training speed due to parallelization. BiLSTM will give you better results for sequential inputs, such as sentences where long-distance dependencies between tokens in the sequence need to be captured.  Pooled embedding is designed for the super-fast training at the cost of some drop in accuracy.
  • Optimizer– You can choose among ‘adam’ ‘adagrad,’ ‘rmsprop,’ ‘sgd,’ and ‘adadelta.’
  • Token embedding dimension (token_embedding_dim) – The dimension of the input layer. This is the layer where pre-trained embeddings could be applied.
  • Encoding dimension (enc_dim) – The dimension of the final encoding of the input, which is the output of the corresponding encoder.
  • Early stopping tolerance and patience – Use these hyperparameters to control the early stopping of training by measuring performance improvement over a number of epochs.

See here for a full list of supported hyperparameters.

Data input channel

Similar to other Amazon SageMaker built-in algorithms, Object2Vec supports a training data channel, a validation data channel, and a test data channel. It also provides an auxiliary data channel for you to provide a pre-trained embedding file and a vocabulary file. A pre-trained embedding file (e.g., GloVe embedding file) is used to replace each integer-id in input with a pre-trained embedding vector for each token-id. Using pre-trained embedding provides a warm start to the algorithm training since it starts from an informed initial point in the input layer. For Natural Language Processing applications, pre-trained embeddings such as (word2vec and GloVe) are available for download from multiple locations. To ensure that we use the correct embedding for each input token, the user is required to also provide a vocabulary dictionary that maps the integer-ids in the input to words, which are then used to look up the corresponding pre-trained embeddings. The vocabulary dictionary is a mapping of words and the corresponding integer representations in JSON format. The following example shows what a vocabulary file looks like.

{"!": 0, "#": 1, "$": 2, "%": 3, "&": 4, "'": 5, "''": 6, "'14": 7, 
"'50s-themed": 8, "'60s": 9, "'80s": 10, "'AST": 11, "'Anaconda": 12, 
"'Chips": 13, "'Em": 14, "'Free": 15, "'Good": 16, "'KISS": 17, 
"'Marco": 18, "'Mega": 19, "'Melanie": 20, "'N": 21, "'Out": 22, 
"'Round": 23, "'S": 24, "'Stairway": 25, "'T": 26, "'The": 27, 
"'Thing-o-matic": 28, "'White": 29, "'cleanest": 30, "'d": 31, 
"'free": 32, "'gobble": 33, "'heading": 34, "'house": 35, "'ll": 36, 
"'m": 37, "'mommy": 38, "'n": 39, "'no": 40, "'o": …}

Inference

After the model is trained, the trained encoder can be used to perform inference in two modes:

  • To convert singleton input objects into fixed length embeddings using the corresponding encoder.
  • To predict the relationship label or score between a pair of input objects.

The inference server automatically figures out which of these two modes is requested based on the input data. To get the embeddings as output, we would only provide one input in each instance, whereas to predict the relationship label or score, we would provide both inputs in the pair.

Compute recommendation

Currently, Object2Vec is set up to train only on a single machine. However, it does offer support for training on multiple GPUs. For training, we recommend that you start with GPUs for model training because GPUs provide higher throughput. For inference, CPU is recommended because there is no latency overhead between CPU and GPU communication.

Performance

Despite being a general-purpose embedding algorithm for a range of input types, Amazon SageMaker Object2Vec has comparable performance results against some of the purpose-built embedding algorithms. See the following for the Pearson Correlation comparisons using various versions of the Semantic Text Similarity (STS) dataset, where we compare Object2Vec with a state-of-the-art model called InferSent.

Use cases for Object2Vec

We currently support learning embeddings of pairs of tokens, pairs of sequences, and pairs of token and sequence. There are many use cases that can be mapped into one of these representations. Next, we will take a look at three specific use cases:

  • Collaborative recommendation system
  • Multi-label document classification
  • Sentence embeddings

Training with pairs of tokens: Collaborative recommendation system

Collaborative filtering is a popular technique for building recommendation systems. The main concept behind collaborative filtering is that users with similar tastes (based on observed user-item interactions) are more likely to have similar interactions with new items. Object2Vec can make recommendations by approximating the observed user-item interactions using low dimensional representations of users and items.

The following diagram shows how user-item interaction data can be used to learn the embedding of users and items. The resulting model can be used to predict user rating on a new item.

To see how SageMaker Object2Vec can be used for building a collaborative recommendation model, let’s take a look at this notebook.  More specifically, we will show how to solve the following two different kinds of machine learning tasks using the MovieLens dataset.

  • Task 1: Rating prediction as a regression problem
  • Task 2: Movie recommendation as a classification problem

The MovieLens dataset contains paired data of (user,movie) and the corresponding ratings. The integer-id corresponding to a user is fed into one arm of Object2Vec and the integer-id corresponding to the movie is fed into the other arm. We use separate embedding-based encoders for users and movies to convert them into dense embeddings, which are passed into the comparator that makes prediction of the rating for a given (user, movie) pair.  We will first show how to learn the embeddings of users and movies based on labeled training data. Then, we will demonstrate how to use the learned embeddings to make predictions of ratings on the held-out test set, and show that our model achieves accuracy comparable to some of the best tools available in the open source domain.  See the following diagram for a high-level logic flow of the data processing and training pipeline.

In the data processing and preparation step, we will create a training data file, a validation data file, and a test data file, and the files will be copied to an Amazon S3 bucket.  Amazon SageMaker Object2Vec takes input in JSON-lines format, so the raw MovieLens data will be converted to the format similar to the sample that follows.  In this sample, in0 represents the user id, in1 represents the movie id, and the label represents the movie rating by the user for the movie. We will use the raw dataset to create a training dataset, a validation dataset, and a test dataset. During training time, the in0 value will be fed into one arm of the Object2Vec algorithm, and in1 will be fed into another arm.

{"in0": [1], "in1": [20], "label": 4.0}
{"in0": [1], "in1": [33], "label": 4.0}
{"in0": [1], "in1": [61], "label": 4.0}
{"in0": [1], "in1": [117], "label": 3.0}
{"in0": [1], "in1": [155], "label": 2.0}

For the training step, we will configure the necessary hyperparameters for task 1 and task 2. For task1, which is a regression job, we will set the “output_layer” hyperparameter to “mean_squared_error“, and for task 2, we will use “softmax” for the “output_layer”.  Since the inputs are individual tokens, we set the network for both encoders to “pooled_embedding”.

Amazon SageMaker provides a Python SDK for easier integration with the SageMaker backend operations such as training and deployment. Here, we will use the Amazon SageMaker Estimator to kick off the training job.  See the following code sample for the syntax in the Amazon SageMaker Python SDK for kicking off the rating prediction (regression) job.

regressor = sagemaker.estimator.Estimator(container,
                                          role, 
                                          train_instance_count=1, 
                                          train_instance_type='ml.p2.xlarge',
                                          output_path=output_path,
                                          sagemaker_session=sess)

## train, tune, and test the model
regressor.fit({'train': s3_train, 'validation':s3_valid, 'test':s3_test})

See the following  for the code sample for kicking off a recommendation (classification) job.

classifier = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.p2.xlarge',
                                    output_path=output_path,
                                    sagemaker_session=sess)

## train, tune, and test the model
classifier.fit({'train': s3_train_c, 'validation':s3_valid_c, 'test':s3_test_c})

When the training job runs, it will output the following training and validation metrics in Amazon CloudWatch Logs and in the Jupyter notebook console.

[10/18/2018 14:39:43 INFO 140224059168576] Epoch 6 Training metrics:   mean_squared_error: 0.084 mean_absolute_error: 0.224 
[10/18/2018 14:39:43 INFO 140224059168576] #quality_metric: host=algo-1, epoch=6, train mean_squared_error <loss>=0.084217468395
[10/18/2018 14:39:43 INFO 140224059168576] Epoch 6 Validation metrics: mean_squared_error: 0.931 mean_absolute_error: 0.762 
[10/18/2018 14:39:43 INFO 140224059168576] #quality_metric: host=algo-1, epoch=6, validation mean_squared_error <loss>=0.930595424127

Check out the full notebook for detailed instructions.

Training with pairs of (one-hot vectors and sequences of one-hot vectors): Multi-label document classification

Document classification and tagging are common business challenges for many organizations, especially in the era of big data. There are unsupervised machine learning approaches such as topic modeling and supervised machine learning approaches such as multi-label classification. Object2Vec’s ability to support token and a sequence pair input is well suited for the multi-label document classification problem. See the following diagram on how document and label data can be fed into Object2Vec for multi-label classification training.

In this notebook example, we will show how to train Object2Vec on token, sequence pairs. The specific use case we consider is multi-labeled document classification. To model this problem, one arm of our architecture accepts a document represented as a sequence of word-ids as input. The other arm accepts the categorical label of the document represented as an integer-id as input. We convert multi-labeled documents into document, label pairs where each document is repeatedly paired with every label in the corpus. We apply a `positive’ relationship to a (document, label) pair if the document is tagged with the specific label in the ground truth data. Otherwise, the relationship between the document, label pair is marked as `negative.’ The encoder for the document arm would be a CNN or a BiLSTM which would convert the variable-length sequence into a fixed-length embedding. The encoder for the label arm would be a simple embedding encoder which would convert the label-id into dense embedding. These two would be passed to a comparator, which would emit scores that correspond to the model’s confidence in the two relationship types between the document and the label.

At training time, we associate all (document, label) pairs that exist in the training data with a “positive” relationship type, and we sample pairs with a “negative” relationship type from the cross-product of (documents, labels) such that the document is in the training data, but the pair (document, label) does not occur in the training data. If the same document has multiple labels, we generate a unique (document, label) pair with a “positive” relationship for each label that applies to the document. Such preprocessing is similar to how multi-labeled document classification is handled using multiple one-vs-rest classifiers.

At test time, given a document D, we pass the document to Object2Vec multiple times, where each time it is paired with one unique label L in the training set as input. We accept the label L as applicable to the document if the score for “positive” relationship type for the pair (D,L) is higher than a threshold.

Check out the full notebook for detailed instructions on running multi-label document classification with Object2Vec.

Training with pairs of (sequence of tokens, sequence of tokens): Sentence similarity

There are many practical use cases for sentence similarity. For example, in a customer support workflow, you might need to identify duplicate support tickets or route tickets to the correct support queue based on similarity of the text found in the ticket. Another example where sentence/text similarity can be used is information retrieval where a system can return a list of similar text given an input text.

Modeling sequences of tokens as embeddings is also relevant in contexts other than natural language. For example, a customer’s preferences can be modeled by the sequence of product-ids that he/she has bought in the recent past. We can learn a customer’s embedding by pairing sequences of product-ids bought by the customer as positive pairs and sequences of product-ids bought by different customers as negative examples. The embeddings thus learned can be used to find similar customers or can be used as features in a downstream supervised task.

In this notebook, we will demonstrate how to use Object2Vec to generate embeddings for sentences so they can be used for sentence similarity comparison. The following diagram shows the sentence pairs will be fed into the Object2Vec to learn the embeddings.

The high-level logical flow of the data processing and training pipeline is depicted in the following diagram.

For training data, we will use The Stanford Natural Language Inference (SNLI) dataset, which consists of pairs of sentences labeled “entailment,” “neutral,” or “contradiction.”  After the model is trained, we can use the trained model to convert any English sentences into fixed-length embeddings. We will measure the quality of the model by using a hold-out test dataset from the SNLI dataset. In the notebook, we will also measure the quality of embeddings on new sentences, by comparing the similarity of sentence pairs in the embedding space from the Semantic Text Similarity (STS) dataset and evaluate that against the human-labeled ground truth.

We will preprocess the SNLI data into a JSON line structure shown by the following sample data, so it can be consumed by the Object2Vec algorithm.  In this sample, in0 and in1 represent two sentences respectively with the integers representing the words in the sentences, and the label representing the relationship (entailment, neutral, or contradiction) between the two sentences.

{"in0": [8976, 43036, 10889, 19131, 42641, 23620, 40005, 21984, 29937, 58], 
"in1": [8653, 36222, 10889, 23971, 22084, 42641, 23620, 40005, 21984], 
"label": 1}

Training this sentence similarity model is just like training any other models using the Amazon SageMaker built-in algorithms.  You would first define data channels which include channels for training data, validation data, and auxiliary data which is for the vocabulary file and the pre-trained embedding file.  You would then configure the necessary training hyperparameters and use an Amazon SageMaker estimator to kick off a training job.  See the following sample code snippet on the syntax for using the Amazon SageMaker Estimator to start the training job.

regressor = sagemaker.estimator.Estimator(container,
                                          role, 
                                          train_instance_count=1, 
                                          train_instance_type='ml.p2.xlarge',
                                          output_path=output_path,
                                          sagemaker_session=sess)

regressor.fit(input_channels)

When the training job runs, you will see the following performance metrics being reported in Amazon CloudWatch Logs and directly inside the Jupyter notebook console after each epoch runs.

[10/14/2018 22:13:39 INFO 140406399915840] Completed Epoch: 0, time taken: 0:00:18.556232
[10/14/2018 22:13:39 INFO 140406399915840] Epoch 0 Training metrics:   perplexity: 2.433 cross_entropy: 0.889 accuracy: 0.582 
[10/14/2018 22:13:39 INFO 140406399915840] #quality_metric: host=algo-1, epoch=0, train cross_entropy <loss>=0.88893990372
[10/14/2018 22:13:39 INFO 140406399915840] #quality_metric: host=algo-1, epoch=0, train accuracy <score>=0.582158705767
[10/14/2018 22:13:39 INFO 140406399915840] Epoch 0 Validation metrics: perplexity: 2.139 cross_entropy: 0.760 accuracy: 0.669 
[10/14/2018 22:13:39 INFO 140406399915840] #quality_metric: host=algo-1, epoch=0, validation cross_entropy <loss>=0.760480128229
[10/14/2018 22:13:39 INFO 140406399915840] #quality_metric: host=algo-1, epoch=0, validation accuracy <score>=0.668823242188 

Check out the full notebook for detailed instructions on running the embedding pipeline and additional code samples on how to use the STS dataset for measuring sentence similarity.

Conclusion

In this blog post, we introduced to you the new Amazon SageMaker Object2Vec algorithm. This post and the companion notebooks show you how Object2Vec works and how it can be applied to a range of practical business use cases.


About the authors

David Ping is a Principal Solutions Architect with the AWS Solutions Architecture organization. He works with our customers to build cloud and machine learning solutions using AWS. He lives in the NY metro area and enjoys learning the latest machine learning technologies.

 

 

 

Patrick Ng is a Software Development Engineer in the Verticals and Applications Group at AWS AI. He works on building scalable distributed machine learning algorithms, with focus in the area of deep neural networks and natural language processing.  Before Amazon, he obtained his PhD in Computer Science from the Cornell University and worked at startup companies building machine learning systems.

 

 

 

Cheng Tang is an Applied Scientist in the Verticals and Applications Group at AWS AI. Broadly interested in machine learning research and its applications to the natural language processing domain, Cheng finds great inspiration to be part of both research and industrialization of machine learning/deep learning algorithms, and she is thrilled to see them delivered to the customers.

 

 

 

Saswata Chakravarty is a Software Engineer in the AWS Algorithms team. He works on bringing fast and scalable algorithms to Amazon SageMaker and making them easy to use for customers.

 

 

 

 

Ramesh Nallapati is a Senior Applied Scientist in the Verticals and Applications Group at AWS AI. He works on building novel deep neural networks at scale primarily in the natural language processing domain. He is very passionate about deep learning, and enjoys learning about latest developments in AI and is excited about contributing to this field to the best of his abilities.

 

 

 

Bing Xiang is a Principal Scientist and Head of Verticals and Applications Group at AWS AI. He leads a team of scientists and engineers working on deep learning, machine learning, and natural language processing for multiple AWS services.

 

 

 

 Acknowledgments

We would like to thank Orchid Majumder, Software Development Engineer, AI Platforms team, for his early contributions on this project, as well as Laurence Rouesnel, Group Manager for the Algorithms & Platforms Group in Amazon AI Labs, and Leo Dirac, Senior Principal Engineer, AI Platforms team, for their expert advice and guidance throughout this work.

 

K-means clustering with Amazon SageMaker

Amazon SageMaker provides several built-in machine learning (ML) algorithms that you can use for a variety of problem types. These algorithms provide high-performance, scalable machine learning and are optimized for speed, scale, and accuracy. Using these algorithms you can train on petabyte-scale data. They are designed to provide up to 10x the performance of the other available implementations. In this blog post, we will explore k-means, which is an unsupervised learning problem. In addition, we’ll walk through the details of the Amazon SageMaker built-in k-means algorithm.

What is k-means?

The k-means algorithm attempts to find discrete groupings within data, where members of a group are as similar as possible to one another and as different as possible from members of other groups (see the following figure). You define the attributes that you want the algorithm to use to determine similarity.  Another way you can define k-means is that it is a clustering problem that finds k cluster centroids for a given set of records, such that all points within a cluster are closer in distance to their centroid than they are to any other centroid.

The diagram demonstrates that in the given dataset, there are three obvious clusters marked red, blue, and green. Each cluster has a cluster center. Note that the points in each cluster are spatially closer to the cluster center they are assigned to than the other cluster centers. 

Mathematically, it can be interpreted as follows:

Given: S={x1…xn}, a set S of n vectors of dimension d and an integer k

Goal: Find C={µ1µk }, a set of k cluster centers, that minimize the expression:

Where can you use k-means?

 The k-means algorithm can be a good fit for finding patterns or groups in large datasets that have not been explicitly labeled. Here are some example use cases in different domains:

  • E-commerce
    • Classifying customers by purchase history or clickstream activity.
  • Healthcare
    • Detecting patterns for diseases or success treatment scenarios.
    • Grouping similar images for image detection.
  • Finance
    • Detecting fraud by detecting anomalies in the dataset. For example, detecting credit card frauds by abnormal purchase patterns.
  • Technology
    • Building a network intrusion detection system that aims to identify attacks or malicious activity.
  • Meteorology
    • Detecting anomalies in sensor data collection such as storm forecasting.

We’ll provide a step-by-step tutorial for k-means using the Amazon SageMaker built-in k-means algorithm and the technique to select an optimal k for a given dataset.

The Amazon SageMaker k-means algorithm

The Amazon SageMaker implementation of k-means combines several independent approaches. The first, is the stochastic variant of Lloyds iteration, given by [Scully’ 10 https://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf]. The second is a more theoretical approach based on facility location [Mayerson’ 01 http://web.cs.ucla.edu/~awm/papers/ofl.pdf and subsequent works]. The third is the divide and conquer, or core-set approach [Guha et al.’ 03 http://theory.stanford.edu/~nmishra/Papers/clusteringDataStreamsTheoryPractice.pdf].

The high-level idea is to implement the stochastic Lloyd variant of [Scully’ 10], yet with more centers than required. During the data processing phase we keep track of the cluster sizes, disregard centers with small clusters, and open new centers using techniques inspired by facility location algorithms. To handle the state having more centers than needed we use a technique inspired by core-sets, and we represent the dataset as the larger set of centers, meaning that each center represents the data points in its cluster. Given this view, after we finish processing the stream, we finalize the state into a model of k centers by running a local version of k-means, clustering the larger set of centers, with k-means++ initialization and Lloyds iteration.

Highlights

Single pass. Amazon SageMaker k-means is able to obtain a good clustering with only a single pass over the data. This property translates into a blazing fast runtime. Additionally, it allows for incremental updates. For example, imagine we have a dataset that keeps growing every day. If we require a clustering of the entire set, we don’t need to retrain over the entire collection every day. Instead, we can update the model in time proportional only to the new amount of data.

Speed and GPU support. Other than having a single pass implementation, our algorithm can be run on a GPU machine achieving blazing-fast speed. For example, processing a 400-dimensional dataset of 23 M entries (~37 GB of data), with k=500 clusters can be done in 7 minutes. The cost is a little over one dollar. For comparison, a popular and fast alternative of Spark-streaming k-means would require 26 minutes to run and cost about $ 8.50.

To explain the advantage of using GPUs, notice that the time it takes to process each data point of dimension d is O(kd) with k being the number of clusters. For a large number of clusters, GPU machines provide a much faster (and cheaper) solution than CPU implementations.

Accuracy. Although we require a single pass, our algorithm achieves the same mean square distance cost as the state-of-the-art multiple pass implementation of k-means++ (or k-means||) initialization coupled with Lloyds iteration. For comparison, in our experiments current implementations of a single pass solution, based on minor modifications of the paper [Scully ‘10] achieve a clustering with a mean square distance of 1.5-2 times larger than that of the multi-pass solutions.

Getting started

In our example, we’ll use k-means on the GDELT dataset, which monitors world news across the world, and the data is stored for every second of every day. This information is freely available on Amazon S3 as part of the AWS Public Datasets program.

The data are stored as multiple files on Amazon S3, with two different formats: historical, which covers the years from 1979 to 2013, and daily updates, which cover the years from 2013 on.  For this example, we’ll stick to the historical format. Let’s bring in 1979 data for the purpose of interactive exploration. We’ll import the required libraries and write a simple function so that later we can use it to download multiple files. Replace user-data-bucket with your Amazon S3 bucket.

import boto3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display
import io
import time
import copy
import json
import sys
import sagemaker.amazon.common as smac
import os
import mxnet as mx
from scipy.spatial.distance import cdist
import numpy as np
from numpy import array
import urllib.request
import gzip
import pickle
import sklearn.cluster
import sklearn
import re
from sagemaker import get_execution_role

# S3 bucket and prefix
bucket = '<user-data-bucket>' # '<user-data-bucket>' # replace with your bucket name'
prefix = 'sagemaker/DEMO-kmeans'

role = get_execution_role()

def get_gdelt(filename):
    s3 = boto3.resource('s3')
    s3.Bucket('gdelt-open-data').download_file('events/' + filename, '.gdelt.csv')
    df = pd.read_csv('.gdelt.csv', sep='t')
    header = pd.read_csv('https://www.gdeltproject.org/data/lookups/CSV.header.historical.txt', sep='t')
    df.columns = header.columns
    return df

data = get_gdelt('1979.csv')
data

As we can see, there are 57 columns, some of which are sparsely populated, cryptically named, and in a format that’s not particularly friendly for machine learning. So, for our use case, we’ll strip down to a few core attributes. We’ll use the following:

  • EventCode: This is the raw CAMEO action code describing the action that Actor1 performed upon Actor2.  More detail can be found at (https://www.gdeltproject.org/data/documentation/CAMEO.Manual.1.1b3.pdf)
  • NumArticles: This is the total number of source documents containing one or more mentions of this event. This can be used as a method of assessing the “importance” of an event. The more discussion of that event, the more likely it is to be significant.
  • AvgTone: This is the average “tone” of all documents containing one or more mentions of this event. The score ranges from -100 (extremely negative) to +100 (extremely positive). Common values range between -10 and +10, with 0 indicating neutral.
  • Actor1Geo_Lat: This is the centroid latitude of the Actor1 landmark for mapping.
  • Actor1Geo_Long: This is the centroid longitude of the Actor1 landmark for mapping.
  • Actor2Geo_Lat: This is the centroid latitude of the Actor2 landmark for mapping.
  • Actor2Geo_Long: This is the centroid longitude of the Actor2 landmark for mapping.

We will now prepare our data for machine learning. We will also use a few functions to help us scale this to GDELT datasets from other years.

data = data[['EventCode', 'NumArticles', 'AvgTone', 'Actor1Geo_Lat', 'Actor1Geo_Long', 'Actor2Geo_Lat', 'Actor2Geo_Long']]
data['EventCode'] = data['EventCode'].astype(object)

events = pd.crosstab(index=data['EventCode'], columns='count').sort_values(by='count', ascending=False).index[:20]

#routine that converts the training data into protobuf format required for Sagemaker K-means.
def write_to_s3(bucket, prefix, channel, file_prefix, X):
    buf = io.BytesIO()
    smac.write_numpy_to_dense_tensor(buf, X.astype('float32'))
    buf.seek(0)
    boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, channel, file_prefix + '.data')).upload_fileobj(buf)

#filter data based on actor locations and events as described above
def transform_gdelt(df, events=None):
    df = df[['AvgTone', 'EventCode', 'NumArticles', 'Actor1Geo_Lat', 'Actor1Geo_Long', 'Actor2Geo_Lat', 'Actor2Geo_Long']]
    df['EventCode'] = df['EventCode'].astype(object)
    if events is not None:
        df = df[np.in1d(df['EventCode'], events)]
    return pd.get_dummies(df[((df['Actor1Geo_Lat'] == 0) & (df['Actor1Geo_Long'] == 0) != True) &
                                   ((df['Actor2Geo_Lat'] == 0) & (df['Actor2Geo_Long'] == 0) != True)])

#prepare training training and save to S3.
def prepare_gdelt(bucket, prefix, file_prefix, events=None, random_state=1729, save_to_s3=True):
    df = get_gdelt(file_prefix + '.csv')
    model_data = transform_gdelt(df, events)
    train_data = model_data.sample(frac=1, random_state=random_state).as_matrix()
    if save_to_s3:
        write_to_s3(bucket, prefix, 'train', file_prefix, train_data)
    return train_data

# using the dataset for 1979
train_79 = prepare_gdelt(bucket, prefix, '1979', events, save_to_s3=False

We will now use the training data and visualize using t-Distributed Stochastic Neighbor Embedding (TSNE).  TSNE is a non-linear dimensionality reduction algorithm used for exploring high-dimensional data.

# using TSNE for visualizing first 10000 data points from 1979 dataset
from sklearn import manifold
tsne = manifold.TSNE(n_components=2, init='pca', random_state=1200)
X_tsne = tsne.fit_transform(train_79[:10000])

plt.figure(figsize=(6, 5))
X_tsne_1000 = X_tsne[:1000]
plt.scatter(X_tsne_1000[:, 0], X_tsne_1000[:, 1])
plt.show()

After we have explored our data and we are ready for modeling, we can begin training. For this example, we are using data for years 1979 to 1980.

BEGIN_YEAR = 1979
END_YEAR = 1980

for year in range(BEGIN_YEAR, END_YEAR):
    train_data = prepare_gdelt(bucket, prefix, str(year), events)

# SageMaker k-means ECR images ARNs 
images = {'us-west-2': '174872318107.dkr.ecr.us-west-2.amazonaws.com/kmeans:latest',
          'us-east-1': '382416733822.dkr.ecr.us-east-1.amazonaws.com/kmeans:latest',
          'us-east-2': '404615174143.dkr.ecr.us-east-2.amazonaws.com/kmeans:latest',
          'eu-west-1': '438346466558.dkr.ecr.eu-west-1.amazonaws.com/kmeans:latest'}
image = images[boto3.Session().region_name]

We’ll run the training algorithm from values of k from 2 to 12 to determine the right number of clusters. If you are running training jobs in parallel, ensure that you have Amazon EC2 limits in your account to create the instances that are required for parallel training. To request a limit increase see the AWS service limits documentation. In our case, we are using 24 ml.c4.8xlarge in parallel. You can run jobs sequentially by setting the variable run_parallel_jobs to false. Our training job ran for approximately 8 minutes. For pricing details please refer the Amazon SageMaker pricing page.

from time import gmtime, strftime
output_time = strftime("%Y-%m-%d-%H-%M-%S", gmtime())
output_folder = 'kmeans-lowlevel-' + output_time
K = range(2, 12) # change the range to be used for k
INSTANCE_COUNT = 2
run_parallel_jobs = True #make this false to run jobs one at a time, especially if you do not want 
#create too many EC2 instances at once to avoid hitting into limits.
job_names = []


# launching jobs for all k
for k in K:
    print('starting train job:' + str(k))
    output_location = 's3://{}/kmeans_example/output/'.format(bucket) + output_folder
    print('training artifacts will be uploaded to: {}'.format(output_location))
    job_name = output_folder + str(k)

    create_training_params = 
    {
        "AlgorithmSpecification": {
            "TrainingImage": image,
            "TrainingInputMode": "File"
        },
        "RoleArn": role,
        "OutputDataConfig": {
            "S3OutputPath": output_location
        },
        "ResourceConfig": {
            "InstanceCount": INSTANCE_COUNT,
            "InstanceType": "ml.c4.8xlarge",
            "VolumeSizeInGB": 50
        },
        "TrainingJobName": job_name,
        "HyperParameters": {
            "k": str(k),
            "feature_dim": "26",
            "mini_batch_size": "1000"
        },
        "StoppingCondition": {
            "MaxRuntimeInSeconds": 60 * 60
        },
            "InputDataConfig": [
            {
                "ChannelName": "train",
                "DataSource": {
                    "S3DataSource": {
                        "S3DataType": "S3Prefix",
                        "S3Uri": "s3://{}/{}/train/".format(bucket, prefix),
                        "S3DataDistributionType": "FullyReplicated"
                    }
                },

                "CompressionType": "None",
                "RecordWrapperType": "None"
            }
        ]
    }

    sagemaker = boto3.client('sagemaker')

    sagemaker.create_training_job(**create_training_params)

    status = sagemaker.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
    print(status)
    if not run_parallel_jobs:
        try:
            sagemaker.get_waiter('training_job_completed_or_stopped').wait(TrainingJobName=job_name)
        finally:
            status = sagemaker.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
            print("Training job ended with status: " + status)
            if status == 'Failed':
                message = sagemaker.describe_training_job(TrainingJobName=job_name)['FailureReason']
                print('Training failed with the following error: {}'.format(message))
                raise Exception('Training job failed')
    
    job_names.append(job_name)

Now that we have started the training jobs, let’s poll for the jobs to ensure that all the jobs are complete. This is only used when training jobs run in parallel.

while len(job_names):
    try:
        sagemaker.get_waiter('training_job_completed_or_stopped').wait(TrainingJobName=job_names[0])
    finally:
        status = sagemaker.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
        print("Training job ended with status: " + status)
        if status == 'Failed':
            message = sagemaker.describe_training_job(TrainingJobName=job_name)['FailureReason']
            print('Training failed with the following error: {}'.format(message))
            raise Exception('Training job failed')

    print(job_name)

    info = sagemaker.describe_training_job(TrainingJobName=job_name)
    job_names.pop(0)

We will now identify the optimal k for k-means using the elbow method.

plt.plot()
colors = ['b', 'g', 'r']
markers = ['o', 'v', 's']
models = {}
distortions = []
for k in K:
    s3_client = boto3.client('s3')
    key = 'kmeans_example/output/' + output_folder +'/' + output_folder + str(k) + '/output/model.tar.gz'
    s3_client.download_file(bucket, key, 'model.tar.gz')
    print("Model for k={} ({})".format(k, key))
    !tar -xvf model.tar.gz                       
    kmeans_model=mx.ndarray.load('model_algo-1')
    kmeans_numpy = kmeans_model[0].asnumpy()
    distortions.append(sum(np.min(cdist(train_data, kmeans_numpy, 'euclidean'), axis=1)) / train_data.shape[0])
    models[k] = kmeans_numpy
 
# Plot the elbow
plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('distortion')
plt.title('Elbow graph')
plt.show()

In the graph we plot the Euclidean distance to the cluster centroid. You can see that the error decreases as k gets larger. This is because when the number of clusters increases, they should be smaller, so distortion is also smaller. This produces an “elbow effect” in the graph. The idea of the elbow method is to choose the k at which the rate of decrease sharply shifts. Based on the graph above, k=7 would be a good cluster size for this dataset. Once completed, make sure to stop the notebook instance to avoid additional charges.

Conclusion

In this post, we showed you how to use k-means to evaluate common clustering problems. Using k-means on Amazon SageMaker provides additional benefits like distributed training and managed model hosting without having to set up and manage any infrastructure. You can refer to Amazon SageMaker sample notebooks to get started.


About the Authors

Gitansh Chadha is a Solutions Architect at AWS. He lives in the San Francisco bay area and helps customers architect and optimize applications on AWS. In his spare time, he enjoys the outdoors and spending time with his twin daughters.

 

 

 

Piali Das is a Software Development Engineer on the AWS AI Algorithms team, which is responsible for building the Amazon SageMaker’s built-in algorithms. She enjoys programming for scientific applications in general and has developed an interest in machine learning and distributed systems.

 

 

 

 

Zohar Karnin is a Principal Scientist in Amazon AI. His research interests are in the area of large scale and online machine learning algorithms. He develops infinitely scalable machine learning algorithms for Amazon SageMaker.

 

 

 

AWS expands HIPAA eligible machine learning services for healthcare customers

Today, AWS announced that Amazon Translate, Amazon Comprehend, and Amazon Transcribe are now U.S. Health Insurance Portability and Accountability Act of 1996 (HIPAA) eligible services. This announcement adds to the number of AWS artificial intelligence services that are already HIPAA eligible– Amazon Polly, Amazon SageMaker, and Amazon Rekognition. By using these services, AWS customers in the healthcare industry can leverage data insights to deliver better outcomes for providers and patients using the power of machine learning (ML).

To support our healthcare customers, AWS HIPAA eligible services enable covered entities and their business associates subject to HIPAA to use the secure AWS environment to process, maintain, and store protected health information. Healthcare companies like NextGen Healthcare, Omada Health, Verge Health, and Orion Health are already running HIPAA workloads on AWS to analyze numerous patient records.

The addition of Amazon Translate, Amazon Transcribe, and Amazon Comprehend to the list of HIPAA eligible services will allow customers to leverage these AWS ML services to better streamline customer support and improve patient engagement. Customers can use these three services to leverage the following ML capabilities:

  • Amazon Transcribe: A speech-to-text service that automatically creates text transcripts from audio files will allow healthcare organizations to create text transcripts calls with patients.
  • Amazon Translate: A neural machine translation service that delivers fast, high-quality, and affordable language translation. This service can be employed to easily translate large volumes of text efficiently and enable patients to chat with their healthcare provider in their preferred language.
  • Amazon Comprehend: A natural language processing (NLP) service that can find insights and relationships in unstructured text. It can analyze sentiment (e.g., negative, positive, and neutral), and extract key phrases from patient interactions to better understand and improve engagement.

Many healthcare customers are exploring new ways use the power of ML to advance their current workloads and transform how they provide care to patients, all while meeting the requirements of HIPAA.

Zocdoc, a company that provides medical care search for consumers, uses Amazon SageMaker, a platform that enables developers and data scientists to quickly and easily build, train, and deploy ML models, to expedite the amount of time it takes to match patients and doctors.

“At Zocdoc, our focus has been making it easier for patients to find the right doctor and book an appointment at the most convenient time and location. You can imagine the ML use cases. There is a lot of excitement among Zocdoc engineers around how easy it is to quickly build and deploy a model using Amazon SageMaker. As a matter of fact, one of our mobile engineers was able to train and deploy a doctor specialty recommendation model from scratch in less than a day during a recent Zocdoc Hackathon, which we ended up rolling out to production. Previously, our data science team had to contribute to the development of any model work, which slowed down product teams given that the data science team is a shared resource. With Amazon SageMaker, we could get this from concept to a quick production test much faster, due to the ease of streamlined end-to-end build/deploy/test capabilities of Amazon SageMaker. HIPAA eligibility is a welcome improvement and will allow us to expand its use to improve healthcare experience for our patients.”

Aculab has been providing deployment-proven telecom products to the global communication market for nearly 40 years. They are leveraging Amazon Polly, a service that turns text into lifelike speech using deep-learning, to provide telecom solutions for their major healthcare customers.

“One of the key decision points that led Aculab to choose Amazon Polly for our Text-to-Speech (TTS) on the Aculab Cloud platform was the HIPAA support. We have major customers using our system for services such as medical appointment reminders, and we needed a TTS solution that we could use with HIPAA workloads to complement the rest of our HIPAA-compliant architecture. Amazon Polly was able to provide not only a world-class TTS service, but one that could safely handle protected health information,” said David Samuel, CEO of Aculab.

For additional information on Amazon ML services and how healthcare and life sciences companies can run sensitive workloads on AWS refer, to the following materials:

 


About the author

Vasi Philomin is the GM for Machine Learning & AI at AWS, responsible for Amazon Lex, Polly, Transcribe, Translate and Comprehend.

 

 

 

 

 

Now easily perform incremental learning on Amazon SageMaker

Data scientists and developers can now easily perform incremental learning on Amazon SageMaker. Incremental learning is a machine learning (ML) technique for extending the knowledge of an existing model by training it further on new data. Starting today both of the Amazon SageMaker built-in visual recognition algorithms – Image Classification and Object Detection – will provide out of the box support for incremental learning. So now you can easily load an existing Amazon SageMaker visual recognition model using the AWS Management Console or Amazon SageMaker Python SDK APIs, prior to starting the model training on new data.

Overview

Incremental learning is the technique of continuously extending the knowledge of an existing machine learning model by training it further on new data. So at the beginning of a training run, you first load the model weights from a prior training run instead of randomly initializing them, and then continue training the model on new data. In this way you preserve the knowledge that the model gained from prior training runs and extend it further. This is useful when you don’t have access to all of the training data at the same time and your data arrives continuously in batches over time. You can also use this learning technique to save some time and compute resources when re-training your model on new training data.

In this blog post we’ll also demonstrate how to use Amazon SageMaker incremental learning features to perform transfer learning. For this demonstration we’ll use an existing model off the shelf. We’ll choose an image classification model from a model zoo, and then use it as a starting point to train the model for performing a new classification task. Transfer learning enables building new models on top of state-of-the-art reference implementations for specific machine learning tasks. This is also useful when you don’t have enough data to train a deep and complex network from scratch.

Now let’s dive into the examples.

Incrementally train visual recognition models using Amazon SageMaker built-in algorithms

We have provided sample notebooks for both of the Amazon SageMaker built-in visual recognition algorithms – Image Classification, and Object Detection – that now support incremental learning. Following are the code snippets from the Image Classification notebook. If you are training an Amazon SageMaker Image Classification model for the first time, the notebook has step-by-step instructions for it. In this example we are assuming you already have an existing Image Classification model that was trained before on Amazon SageMaker.

Step 1: Define an input channel for consuming the existing Amazon SageMaker Image Classification model.

An Amazon SageMaker channel is a named input data source that training algorithms can consume. This input channel has to be named “model” and it specifies the Amazon S3 URI of the existing model. Note that the existing model artifacts is a single gzip compressed tar archive (.tar.gz suffix) created by Amazon SageMaker Training.

s3model = 's3://{}/model/'.format(bucket)
model_data = sagemaker.session.s3_input(s3model, distribution= 'FullyReplicated',s3_data_type='S3Prefix',content_type='application/x-sagemaker-model')
data_channels = {'train': train_data, 'validation': validation_data, 'model': model_data}

Step 2: Now continue training on new batch of training data.
The hyperparameters that define the network, such as num_layers, image_shape, num_classes, etc., should be the same as those used for training the existing model. Since the algorithm starts with an existing, pre-trained model, the accuracy would be higher right from the first epoch, thereby leading to faster convergence.

incr_ic = sagemaker.estimator.Estimator(training_image, role, train_instance_count=1, train_instance_type='ml.p2.xlarge', train_volume_size = 50, train_max_run= 360000, input_mode= 'File', output_path=s3_output_location, sagemaker_session=sess)

incr_ic.set_hyperparameters(num_layers=18, image_shape= "3,224,224", num_classes=257, num_training_samples=15420, mini_batch_size=128, epochs=10, learning_rate=0.01, top_k=2)

incr_ic.fit(inputs=data_channels, logs=True)

You can repeat these steps as many time as you need to train your model further on new data.

Use a pre-trained Caffe model from ONNX model zoo to perform your image classification task

We’ll now show you an example of how to pick a model off the shelf, in this case a Caffe BVLC GoogleNet model that was trained using the ImageNet dataset and available on the ONNX Model Zoo. We’ll use this model as a starting point and then fine-tune it for a new image classification task on the Caltech 101 Dataset using Amazon SageMaker. We’re using the same model training script as shown in the MXNet/Gluon tutorial for transfer learning.

We’ll use the Amazon SageMaker MXNet framework container to train the model. Also note that this example uses the Amazon SageMaker Python SDK , similar to our existing Gluon notebooks.

Step 1: Download the pre-trained GoogleNet model from the ONNX model zoo and upload the model.onnx file to Amazon S3.

The ONNX model zoo hosts pre-trained models in Amazon S3 buckets in the us-east-1 AWS Region. You can use the Amazon S3 URI of pre-trained model as-it-is. However, if you are using Amazon SageMaker training in a different AWS Region (such as us-west-2), here is sample code for moving the file across Regions.

# first download model from https://github.com/onnx/models/tree/master/bvlc_googlenet
wget –quiet -P data/ https://s3.amazonaws.com/download.onnx/models/opset_3/bvlc_googlenet.tar.gz

tar -xzf data/bvlc_googlenet.tar.gz -C data/ && rm data/bvlc_googlenet.tar.gz
...
#now upload the model to a bucket in Region where you are using Amazon SageMaker
sagemaker_session.upload_data(path='data/bvlc_googlenet/model.onnx', key_prefix='data/pretrained')

Step 2: Define Amazon SageMaker channels for the input data – one for the Caltech 101 training dataset and another for the pre-trained GoogleNet model.

In this example we define a ‘training’ channel for the Caltech 101 training dataset, and a ‘pretrained’ channel for the pre-trained GoogleNet model (from Step 1).

s3train = 's3://{}/{}/'.format(bucket, 'data/ONNX-incremental')
s3pretrained = 's3://{}/{}/'.format(bucket, 'data/pretrained')

training_data = sagemaker.session.s3_input(s3train, distribution= 'FullyReplicated', s3_data_type='S3Prefix', input_mode='File')

pretrained_model = sagemaker.session.s3_input(s3pretrained, distribution='FullyReplicated', s3_data_type='S3Prefix', input_mode='File')

As you can see we are defining the input mode as ‘File’ at each channel level. File mode enables fetching the pre-trained model from Amazon S3 to local storage attached to the Amazon SageMaker training instances before the model training starts.

Now before we show you the code for starting Amazon SageMaker training using our pre-built MXNet container, we will first show you how you can make small, one-line code changes to the model training script from the Gluon tutorial for transfer learning for easily accessing your pre-trained GoogleNet model.

Step 3: Easily access the channel information inside the MXNet container using environment variables.

You can use the default environment variables of the MXNet container that are automatically initialized by Amazon SageMaker with all the information about the input channels you defined in Step 2.

parser.add_argument('--training_channel', type=str, default=os.environ['SM_CHANNEL_TRAINING'])

parser.add_argument('--pretrained_model_channel', type=str, default=os.environ['SM_CHANNEL_PRETRAINED'])

Now you are ready to call the call the train function in the model training script, passing it the Caltech 101 training dataset and pre-trained GoogleNet model.

model = train(num_cpus, num_gpus, args.training_channel, args.model_dir, args.pretrained_model_channel, args.batch_size, args.epochs, args.learning_rate, args.weight_decay, args.momentum, args.log_interval)

You can save this updated script as transfer_learning_example.py.

Following is a short code snippet from the train function for illustration purposes. As you can see, the function loads the pre-trained GoogleNet model before tuning it further on Caltech 101 training dataset.

def train(num_cpus, num_gpus, training_dir, model_dir, pretrained_model_dir, batch_size, epochs, learning_rate, weight_decay, momentum, log_interval): dataset_name = "101_ObjectCategories"
    
    # Location of the pre-trained model on local disk
    onnx_path = os.path.join(pretrained_model_dir, 'model.onnx')
    ...
    # Load the ONNX Model
    sym, arg_params, aux_params = onnx_mxnet.import_model(onnx_path)
 
    new_sym, new_arg_params, new_aux_params = get_layer_output(sym, arg_params, aux_params, 'flatten0')
    ... 

Step 4: Train the model on Amazon SageMaker using a pre-built MXNet container.

You are now ready to run the training script from Step 3 using a pre-built Amazon SageMaker MXNet container. We recommend using a GPU instance for faster training. In this example, we use a p3.2xlarge instance.

m = MXNet('transfer_learning_example.py',
          role=role,
          train_instance_count=1,
          train_instance_type='ml.p3.2xlarge',
          framework_version='1.3.0',
          py_version='py2',
          hyperparameters={'batch-size': 32,
                           'epochs': 5,
                           'learning-rate': 0.0005,
                           'weight-decay': 0.00001, 
                           'momentum': 0.9})

m.fit(inputs=channels, logs=True)

Step 5: Observe the improvement in training accuracy from the training logs.

Our training script prints out the untrained network accuracy on the new data set and the accuracy after fine-tuning on the new dataset.

Train dataset: 6996 images, Test dataset: 1681 images
...
Untrained network Test Accuracy: 0.0120...
...
Epoch [0] Test Accuracy 0.7025
...
Epoch [1] Test Accuracy 0.8558
...
Epoch [2] Test Accuracy 0.8876
...
Epoch [4] Test Accuracy 0.9183

As you can see, we were able to improve our accuracy on the Caltech 101 Dataset substantially with just few minutes of fine-tuning on a GPU!

Get started with more examples and developer support

In this blog post we showed you  examples of how to easily perform incremental learning and transfer learning using input channels on Amazon SageMaker. You can refer our developer guide for more developer resources or post your questions on our developer forum. Happy modeling!


About the authors

Gurumurthy Swaminathan is a Senior Applied Scientist in the Amazon AI Platforms group and is working on building computer vision algorithms for Sagemaker. His current area of research includes Neural Network compression and Computer Vision algorithms.

 

 

 

Jeffrey Geevarghese is a Senior Engineer in Amazon AI where he’s passionate about building scalable infrastructure for deep learning. Prior to this he was working on machine learning algorithms and platforms and was part of the launch teams for both Amazon SageMaker and Amazon Machine Learning.

 

 

 

Sumit Thakur is a Senior Product Manager for AWS Machine Learning Platforms where he loves working on products that make it easy for customers to get started with machine learning on cloud. He is product manager for Amazon SageMaker and AWS Deep Learning AMI. In his spare time, he likes connecting with nature and watching sci-fi TV series.

 

 

 

Direct access to Amazon SageMaker notebooks from Amazon VPC by using an AWS PrivateLink endpoint

Amazon SageMaker now supports AWS PrivateLink for notebook instances. In this post, I will show you how to set up AWS PrivateLink to secure your connection to Amazon SageMaker notebooks.

Maintaining compliance with regulations such as HIPAA or PCI may require preventing information from traversing the internet. Additionally, preventing exposure of data to the public internet reduces the likelihood of threat vectors such as brute force and distributed denial-of-service attacks.

AWS PrivateLink simplifies the security of data shared with cloud-based applications by eliminating the exposure of data to the public internet. It enables private connectivity between VPCs, AWS services, and on-premises applications. With AWS PrivateLink your services function as though they were hosted directly on your private network.

To secure your Amazon SageMaker API and prediction calls using AWS PrivateLink, we previously introduced PrivateLink support for API operations and runtime. Now it’s possible to use AWS PrivateLink to secure your connection to notebook instances as well.

To use Amazon SageMaker notebooks via AWS PrivateLink, you need to set up Amazon Virtual Private Cloud (VPC) endpoints. AWS PrivateLink enables you to privately access all Amazon SageMaker API operations from your VPC in a scalable manner by using interface VPC endpoints. A VPC endpoint is an elastic network interface in your subnet with private IP addresses. It serves as an entry point for all Amazon SageMaker API calls.

To limit access to the VPC endpoints you created, you also need to configure AWS Identity and Access Management (IAM) roles to allow traffic only from your VPC.

Note: Keep in mind that the AWS Management Console is accessed through the public internet, and since your connection will be avoiding the internet with AWS PrivateLink, you’ll need to use Amazon SageMaker only through the CLI and APIs. In other words, you won’t be able to use Amazon SageMaker through the console after you activate AWS PrivateLink with the following configuration.

Creating VPC endpoints

We will go through AWS Management Console steps to create VPC endpoints, but you can do the same operations using AWS Command Line Interface (AWS CLI) commands as well.

Here, we will create 2 VPC endpoints, where one is used to create a notebook instance by using SageMaker APIs, and the other one is used to access the notebook instance you created (CreatePresignedNotebookInstanceUrl). To create VPC endpoints from the console, open the Amazon VPC console, open the Endpoints page, and create a new endpoint, as shown in the following image.

Let’s start with creating the VPC endpoint for our notebook first. Here, you’ll need to define three attributes:

  1. The Amazon SageMaker API service name. For Service category, select AWS services; and for Service Name, select aws.sagemaker.us-west-2.notebook. (The Region information – us-west-2- in the URL may differ depending on the Region you select.)
  2. The VPC and Availability Zones that you want to use:
  3. The security group to be associated with the interface VPC endpoint: If you don’t specify a security group, the default security group for your VPC is associated.

Here, a private hosted zone enables you to access the resources in your VPC using custom DNS domain names, such as example.com, instead of using private IPv4 addresses or private DNS hostnames provided by AWS. The Amazon SageMaker DNS hostname that the AWS CLI and Amazon SageMaker SDKs use by default (https://api.sagemaker.Region.amazonaws.com) resolves to your VPC endpoint.

Repeat the same steps to create a second VPC endpoint for Amazon SageMaker APIs. This time you’ll select com.amazonaws.us-west-2.sagemaker.api while selecting the service name. You can begin using the VPC endpoint when its status is available.

Connecting your private network to your VPC

After you create VPC endpoints, make sure that you are either trying to access your notebook instances from within the same VPC or that you have a configuration in place, such as Amazon Virtual Private Network (VPN) or AWS Direct Connect, to connect to your notebooks. This is not necessary for other Amazon SageMaker API operations, but it’s essential to access your notebooks via a web browser from outside of your VPC since VPN needs to replace the internet gateway while connecting to your VPC. Here is a tutorial that you can refer to while connecting your private network to your VPC by using a VPN: https://aws.amazon.com/premiumsupport/knowledge-center/create-connection-vpc/

Configuring IAM roles

Once you have created VPC endpoints to the API services, you need to update IAM roles with conditional operator policies for all users or groups that will be accessing Amazon SageMaker notebooks. IAM is a web service that helps you securely control access to AWS resources. A policy is an entity that, when attached to an identity or resource, defines their permissions.

To grant or restrict access to Amazon SageMaker notebooks based on the VPC endpoints used, we will employ a aws:sourceVpce condition in the IAM policy. Since IAM denies all access requests by default, attaching an Allow policy with a condition ensures that requests will be successful only if they meet the required condition. For example, the following example policy allows a user to perform API operations only when the request comes through the specified two VPC endpoints (replace the placeholder AWS account ID with your own account ID, and the placeholder VPC endpoint IDs with your own endpoint IDs). Don’t forget to include both VPC endpoints you created.

Note: The actions covered in the following policy exemplifies notebook access cases specifically. You need to update the “Action” section for other Amazon SageMaker APIs you want to cover. Alternatively, you can use “sagemaker:*” to cover all Amazon SageMaker APIs in your policy.

{
    "Id": "notebook-example-with-sourcevpce",
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Enable Notebook Access",
            "Effect": "Allow",
            "Action": [
                "sagemaker:CreatePresignedNotebookInstanceUrl",
                "sagemaker:DescribeNotebookInstance"
            ],
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "aws:sourceVpce": [
                        "vpce-111bbccc",
                        "vpce-111bbddd"
                    ]
                }
            }
        }
    ]
}

This policy works by including an Allow statement with a StringEquals condition. When a user makes a request to Amazon SageMaker through a VPC endpoint, the endpoint’s ID is compared to the aws:sourceVpce values specified in the policy. If the values do not match, the request is denied.

Another way to configure the policy is by using the aws:sourceVpc condition instead of a aws:sourceVpce condition. The difference is that you will be using the VPC information in general instead of a specific endpoint within that VPC. This is useful when you don’t want to limit access by specific endpoints, but rather by the whole VPC. This way you’ll keep VPC information generic and won’t need to update IAM roles in case you update endpoints within that VPC. Here is an example:

{
    "Id": "notebook-example-with-sourcevpc",
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Enable Notebook Access",
            "Effect": "Allow",
            "Action": [
                "sagemaker:CreatePresignedNotebookInstanceUrl",
                "sagemaker:DescribeNotebookInstance"
            ],
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "aws:SourceVpc": "vpc-111bbaaa"
                }
            }
        }
    ]
}

You can also consider using Service Control Policies to orchestrate restrictions, at the account level of granularity, on what services and actions the users, groups, and roles in your organization can do.

Using Amazon SageMaker Notebooks via AWS PrivateLink

Use the following example AWS CLI command to list notebook instances from inside your VPC using the configured VPC endpoint.

aws sagemaker list-notebook-instances --endpoint-url VPC_Endpoint_ID.api.sagemaker.Region.vpce.amazonaws.com

If you enabled private DNS hostnames for your VPC endpoint, as shown in the following image, you don’t need to specify the endpoint URL.

If you enabled a private hosted zone or if you’re using an SDK released before August 13, 2018, you have to specify the endpoint when using the SDK or AWS CLI. For example:

aws --endpoint https://VPC_Endpoint_ID.api.sagemaker.Region.vpce.amazonaws.com sagemaker create-presigned-notebook-instance-url --notebook-instance-name NotebookInstanceName

For the VPC endpoint in the preceding example, this would be:

aws --endpoint https://vpce-08e906a63733a8aa1.api.sagemaker.us-west-2.vpce.amazonaws.com sagemaker create-presigned-notebook-instance-url --notebook-instance-name NotebookInstanceName

If you enabled a private hosted zone and you’re using the SDK released on August 13, 2018, this would be:

aws sagemaker create-presigned-notebook-instance-url --notebook-instance-name NotebookInstanceName

Conclusion

AWS PrivateLink support is available in all Regions where Amazon SageMaker and AWS Private Link are available. To learn more about using security features in Amazon SageMaker, see the Amazon SageMaker Developer Guide.


About the Author

Erkan Tas is a Senior Product Manager for Amazon SageMaker. He is on a mission to make Artificial Intelligence easy, accessible, and scalable through AWS platforms. He is also a sailor, science and nature admirer, Go and Stratocaster player.

 

 

 

 

Customize your notebook volume size, up to 16 TB, with Amazon SageMaker

Amazon SageMaker now allows you to customize the notebook storage volume when you need to store larger amounts of data.

Allocating the right storage volume for your notebook instance is important while you develop machine learning models. You can use the storage volume to locally process a large dataset or to temporarily store other data to work with.

Every notebook instance you create with Amazon SageMaker comes with a default storage volume of 5 GB. You can choose any size between 5 GB and 16384 GB, in 1 GB increments.

When you create notebook instances using the Amazon SageMaker console, you can define the storage volume:

Here, you need to edit the volume size in GB depending on your needs:

Conclusion

Customize the storage volume for your notebook instances depending on your needs. You can refer to Amazon SageMaker documentation to learn more about how to create and use notebook instances.

 


About the Author

Erkan Tas is a Senior Product Manager for Amazon SageMaker. He is on a mission to make Artificial Intelligence easy, accessible, and scalable through AWS platforms. He is also a sailor, science and nature admirer, Go and Stratocaster player.