Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

Category: Amazon

AWS DeepRacer League: The Championship lineup is complete, making for an exciting re:Invent 2019 final!

The AWS DeepRacer League is the world’s first autonomous racing league, open to anyone. Announced at re:Invent 2018, it puts machine learning in the hands of every developer in a fun and exciting way. Since March 2019, thousands of developers of all skill levels have competed for the chance to advance to the Championship Cup at re:Invent 2019.

2019 League wrap-up

As well as racing at AWS Summits around the world, participants have been racing virtually via the AWS DeepRacer console. Developers have been testing their skills on different tracks in simulation throughout the year, and competing in monthly competitions with the hope of winning an expenses-paid trip to re:Invent 2019. The final Virtual Circuit race concluded on October 31, completing the Championship Cup lineup.

Two champions were named: the winner of the final virtual race of the year, as well as 18 top point scorers who have been competing in multiple races throughout the year. “Eric” from Taiwan won the Toronto Turnpike race with a lap time of 7.172 seconds, which is the fastest time recorded on any of the virtual tracks, and beats the world record set at the Summits. The next challenge for Eric is transitioning his models from simulation to the real world when he gets to Las Vegas!

Lyndon Leggate, an early AWS DeepRacer enthusiast and the founder of the AWS DeepRacer Slack community, was victorious in the overall virtual leaderboard and is joined by 17 other skilled racers from the Virtual Circuit. Each of the 18 racers competed in all six virtual races, racking up points along the way with very consistent models, and clocking times ranging between 9.4–14.6 seconds. We will see each of these developers at re:Invent 2019, when the in-person and virtual worlds collide in the Championship Cup knock-out rounds.

The AWS DeepRacer 2019 Summit Circuit results

The AWS DeepRacer Virtual Circuit results

Get ready to race at re:Invent

re:Invent 2019 is the final destination on the journey to crown the 2019 AWS DeepRacer Championship Cup winner. The November Championship Cup warm-up race is now open. On the newly revealed track shape, developers can train models on the official track to be used during the Championship Cup! You can take part in this friendly warm-up race via the AWS DeepRacer console and compete for up to $500 in AWS credits. See how your model performs on the official Championship Cup track today, and bring that model with you to re:Invent and race at the MGM Grand Garden Arena. There will be prizes up for grabs, all while getting a trackside seat to witness the best racers from around the world compete in the knock-outs.

The Championship Cup

The Championship Cup competition includes a set of elimination rounds at the MGM Grand Garden Arena, where 64 of the League’s best face off in a knock-out tournament in the hopes of taking home the glory! Starting on Tuesday, December 3, the field will whittle down from 64 to 3, who will go on to compete onstage in the Grand Final at Werner Vogel’s keynote on Thursday, December 5. The League will hold one final chance for in-person racers to advance to the knock-out rounds on Monday, December 2, from 4–7 PM, at the Quad in the Aria hotel. Open to all re:Invent attendees, you can race on the iconic 2019 track for a chance to advance to the finals where not one but three contestants will go through!

Learn and grow

New racers not competing for the 2019 cup can attend one of the 10 AWS DeepRacer workshops to learn how to build the best model to compete in the 2020 League and learn from AWS DeepRacer experts.

The AWS DeepRacer workshops provide customers with hands-on training, enabling them to build their models and learn more about what’s next for AWS DeepRacer. The sessions are open for registration now, so don’t miss out on your chance to learn and get ready to race!

AWS customers who want to learn and prepare for the 2020 season will benefit from the AWS DeepRacer Expert Boot Camp. This two-day event offers unprecedented access to AWS DeepRacer experts, including AWS DeepRacer data scientists, 2019 AWS Summit winners, and developer experts sharing best practices and racing tips. With a full track for practicing in real time, this is one event you do not want to miss.

The home stretch of 2019!

In less than a year, AWS DeepRacer has seen a dramatic evolution in the speeds developers are clocking on the tracks, from Rick Fish’s championship-winning time of 51.50 seconds to the world record of 7.44 seconds set by SOLA at the Tokyo Summit in June. Developers around the world have embraced the challenge, testing their models for days and weeks at a time, playing with speed and other parameters to push the car to its physical (and virtual) limits. The Championship Cup is set to be the most exciting yet. Register for re:Invent 2019 today, and start training your models to win prizes in the warm-up challenge!


About the Author

Alexandra Bush is a Senior Product Marketing Manager for AWS AI. She is passionate about how technology impacts the world around us and enjoys being able to help make it accessible to all. Out of the office she loves to run, travel and stay active in the outdoors with family and friends.

 

 

Building an interactive and scalable ML research environment using AWS ParallelCluster

When it comes to running distributed machine learning (ML) workloads, AWS offers you both managed and self-service offerings. Amazon SageMaker is a managed service that can help engineering, data science, and research teams save time and reduce operational overhead. AWS ParallelCluster is an open-source, self-service cluster management tool for customers who wish to maintain more direct control over their computing infrastructure. This post addresses how to perform distributed ML on AWS. For more information about distributed training using Amazon SageMaker, see the following posts on launching TensorFlow distributed training with Horovod and multi-region serverless distributed training.

AWS ParallelCluster is an AWS-supported open-source cluster management tool that helps users deploy and manage high performance computing (HPC) clusters in the AWS Cloud. AWS ParallelCluster allows data scientists and researchers to reproduce a familiar working environment on elastically scaled AWS resources by automatically setting up the required compute resources and shared file system. Broadly supported data science and ML tools such as Jupyter, Conda, MXNet, PyTorch, and TensorFlow allow flexible, interactive development with low-overhead scaling. These features make AWS ParallelCluster environments ideally suited for ML research environments that support distributed model development and training.

AWS ParallelCluster enables a scalable research workflow built around on-demand allocation of compute resources. Rather than working with, and potentially underutilizing, a single high-power GPU-enabled workstation, AWS ParallelCluster manages an on-demand fleet of GPU-enabled compute workers. This allows trivial scale-up for parallel training experiments and automatic scale-down when resources aren’t required, minimizing cost and (most importantly) saving researcher time. An attached Amazon FSx file system takes advantage of a traditional high-performance Lustre file system during development, but archives models and data into the low-cost Amazon S3.

The following graphic shows an AWS ParallelCluster-based research environment. Autoscaled Amazon EC2 resources access remote storage, with models and data archived to S3.

This post shows you how to set up, run, and tear down a complete AWS ParallelCluster environment implementing this architecture. The post runs two NLP tutorials, fine-tuning a BERT model on a paraphrasing task and training an English-German machine translation model. This includes the following steps:

  1. AWS ParallelCluster configuration and setup
  2. Conda-based installation of your ML and NLP packages
  3. Initial interactive model training
  4. Parallel model training and evaluation
  5. Data archiving and cluster teardown

The tutorial lays out a workflow using standard tools, and you can adapt it to your research requirements.

Prerequisites

This post uses a combination of m5 and p3 EC2 instances and Amazon FSx and Amazon S3 storage. Furthermore, because you are using GPU-enabled instances for training, this tutorial takes your account out of the free AWS tier. Before you begin, complete the following prerequisites:

  1. Set up an AWS account and create an access token with administrator permissions.
  2. Request quota increases in your target AWS Region for at least one m5.xlarge, three p3.2xlarge, and three p3.8xlarge On-Demand Instances.

Setting up your client and cluster

Start with a one-time setup and configuration of your workstation with the aws-parallelcluster client in a dedicated Conda environment. You reuse this pattern again later when setting up isolated environments for each subproject that contains a precise set of dependencies required to reproduce your work.

Installing Conda

Perform a one-time installation of a base Miniconda environment and initialize your shell to enable Conda. This post works from a macOS workstation; use the download URL for your preferred platform. This configuration sets up a base environment and activates it in your interactive shell. See the following code:

@work:~$ wget -O miniconda.sh 
    "https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh" 
    && bash miniconda.sh -p ~/.conda 
    && ~/.conda/bin/conda init

Setting up your client environment

Install AWS ParallelCluster and the AWS CLI tools using a Conda environment called pcluster_client. This environment provides separation between the client and your system environment. First, write an environment.yml file specifying the environment name and dependency versions. Call conda env update to download and install the libraries. See the following code:

(base) @work:~$ cat > pcluster_client.environment.yml <<EOF
name: pcluster_client
dependencies:
  - python=3.7
  - pip
  - conda-forge::jq
  - conda-forge::awscli
  - pip:
    - aws-parallelcluster >= 2.4
EOF

(base) @work:~$ conda env update -f pcluster_client.environment.yml

Configuring pcluster and creating storage

To configure AWS ParallelCluster, conda activate your pcluster_client environment and configure aws and pcluster via the default configuration flow. For more information, see Configuring AWS ParallelCluster.

During configuration, upload your id_rsa public key to AWS and store your private key locally, which you use to access your pcluster instances. See the following code:

(base) @work:~$ conda activate pcluster_client
(pcluster_client) @work:~$ aws configure
  [...]
(pcluster_client) @work:~$ aws ec2 import-key-pair 
    --key-name $USER --public-key-material file://~/.ssh/id_rsa.pub
{
    "KeyFingerprint": [...]
    [...]
}
(pcluster_client) @work:~$ pcluster configure
  [...]

After configuring AWS ParallelCluster, create an S3 bucket for persistent storage of your data and models with the following code:

(pcluster_client) @work:~$ export AWS_ACCOUNT=$(aws sts get-caller-identity | jq -r ".Account")
(pcluster_client) @work:~$ export S3_BUCKET=pcluster-training-workspace-$AWS_ACCOUNT
(pcluster_client) @work:~$ aws s3 mb s3://$S3_BUCKET
  make_bucket: pcluster-training-workspace-[...account id...]

Add config entries for a GPU-enabled cluster and Amazon FSx file system with the following code:

(pcluster_client) @work:~$ cat >> ~/.parallelcluster/config <<EOF

[cluster p3.2xlarge]
key_name                 = $USER
vpc_settings             = public

scheduler                = slurm
base_os                  = centos7
fsx_settings             = workspace

initial_queue_size       = 1
max_queue_size           = 3

master_instance_type     = m5.xlarge
compute_instance_type    = p3.2xlarge

[fsx workspace]
shared_dir = /workspace
storage_capacity = 3600
import_path = s3://$S3_BUCKET
export_path = s3://$S3_BUCKET
imported_file_chunk_size = 1024

EOF

Creating and bootstrapping your cluster

After configuration, bring your cluster online. This command creates a persistent master instance, attaches an Amazon FSx file system, and sets up a p3 class Auto Scaling group. After cluster creation is complete, set up Miniconda again, this time installing it onto the /workspace file system accessible on all master and compute nodes. See the following code:

(pcluster_client) @work:~$ pcluster create -t p3.2xlarge training
Beginning cluster creation for cluster: training
Creating stack named: parallelcluster-training
Status: [...]

(pcluster_client) @work:~$ pcluster ssh training

[centos@ip-172-31-48-17 ~]$ wget -O miniconda.sh 
    "https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh" 
    && bash miniconda.sh -p /workspace/.conda 
    && /workspace/.conda/bin/conda init
[centos@ip-172-31-48-17 ~]$ exit

Your compute cluster now contains a single m5 class instance, with p3.2xlarge instances available via the slurm job manager. You can use an interactive salloc session to access your p3 resources via srun commands. An important implication of your autoscaled cluster strategy is that while all code and data are available across the cluster, access to attached GPUs is limited to compute nodes accessed via srun. You can demonstrate this via calls to nvidia-smi, which reports the status of attached resources. See the following code:

(pcluster_client) @work:~$ pcluster ssh training

# Execution on the master node can not access gpu resources.
(base) [centos@ip-172-31-48-17 ~]$ hostname
ip-172-31-48-17
(base) [centos@ip-172-31-48-17 ~]$ nvidia-smi
NVIDIA-SMI has failed [...]

# Use salloc to bring a compute node online, then use calls to srun to
# execute commands on the GPU-enabled compute node.
(base) [centos@ip-172-31-48-17 ~]$ salloc
salloc: Required node not available (down, drained or reserved)
salloc: Pending job allocation 2
salloc: job 2 queued and waiting for resources
salloc: job 2 has been allocated resources
salloc: Granted job allocation 2

(base) [centos@ip-172-31-48-17 ~]$ srun hostname
ip-172-31-48-226

(base) [centos@ip-172-31-48-17 ~]$ srun nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56       Driver Version: 418.56       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   34C    P0    39W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
(base) [centos@ip-172-31-48-17 ~]$ exit
exit
salloc: Relinquishing job allocation 2

AWS ParallelCluster performs automatic management of your compute Auto Scaling group. This keeps a compute node running and available for the lifetime of your salloc and terminates the idle compute node several minutes after the job ends.

Model training

Initial GPU-enabled interactive training

For an initial research task, run a standard natural language process workflow, fine-tuning a pre-trained BERT model onto a specific subtask. Establish a working environment with your model dependencies, download the pre-trained model and training data, and run fine-tuning training on a GPU. For more information about PyTorch pre-trained BERT examples, see the GitHub repo.

First, run a one-time setup of your project: a Conda environment with library dependencies and a workspace with training data. Write an environment.yml specifying the dependencies for your project, call conda env update to create and install the environment, and call conda env activate. Fetch your training data into /workspace/bert_tuning. See the following code:

(base) [centos@ip-172-31-48-17 ]$ mkdir /workspace/bert_tuning
 (base) [centos@ip-172-31-48-17 ]$ cd /workspace/bert_tuning

(base) [centos@ip-172-31-48-17 bert_tuning]$ cat > environment.yml <<EOF
name: bert_tuning
dependencies:
  - python=3.7
  - pytorch::pytorch=1.1
  - scipy=1.2
  - scikit-learn=0.21
  - pip
  - requests
  - tqdm
  - boto3
  - pip:
    - pytorch_pretrained_bert==0.6.2
EOF

(base) [centos@ip-172-31-48-17 bert_tuning]$ conda env update
[...]
# To activate this environment, use
#
#     $ conda activate bert_tuning

(base) [centos@ip-172-31-48-17 bert_tuning]$ conda activate bert_tuning

(bert_tuning) [centos@ip-172-31-48-17 bert_tuning]$ wget 
   https://gist.githubusercontent.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e/raw/17b8dd0d724281ed7c3b2aeeda662b92809aadd5/download_glue_data.py
(bert_tuning) [centos@ip-172-31-48-17 bert_tuning]$ python download_glue_data.py --data_dir glue
Downloading and extracting Cola...
[...]
        Completed!

After downloading your dependencies, fetch the training script and run fine-tuning in an interactive session. The only difference from the documented non-cluster example is that you run your training via salloc --exclusive srun rather than directly invoking the training script. The /workspace Amazon FSx file system allows the compute node to access your Conda environment’s installed libraries and your model definition, training data, and model checkpoints. As before, allocate a GPU-enabled node for the training run, which terminates after your run is complete. See the following code:

(bert_tuning) [centos@ip-172-31-48-17 bert_tuning]$ wget 
  https://raw.githubusercontent.com/huggingface/pytorch-pretrained-BERT/v0.6.2/examples/run_classifier.py
(bert_tuning) [centos@ip-172-31-48-17 bert_tuning]$ salloc --exclusive srun 
python run_classifier.py 
  --task_name MRPC 
  --do_train 
  --do_eval 
  --do_lower_case 
  --data_dir glue/MRPC/ 
  --bert_model bert-base-uncased 
  --max_seq_length 128 
  --train_batch_size 32 
  --learning_rate 2e-5 
  --num_train_epochs 3.0 
  --output_dir mrpc_output
salloc: Required node not available (down, drained or reserved)
salloc: Pending job allocation 3
salloc: job 3 queued and waiting for resources
salloc: job 3 has been allocated resources
salloc: Granted job allocation 3
06/12/2019 02:15:36 - INFO - __main__ -   device: cuda n_gpu: 1, distributed training: False, 16-bits training: False
[...]
Epoch:  100%|██████████| 3/3 [01:11<00:35, 35.90s/it] 
[...]
Evaluating: 100%|██████████| 51/51 [00:01<00:00, 41.42it/s]
06/12/2019 02:17:48 - INFO - __main__ -   ***** Eval results *****
06/12/2019 02:17:48 - INFO - __main__ -     acc = 0.8455882352941176
06/12/2019 02:17:48 - INFO - __main__ -     acc_and_f1 = 0.867627742865973
06/12/2019 02:17:48 - INFO - __main__ -     eval_loss = 0.42869279022310297
06/12/2019 02:17:48 - INFO - __main__ -     f1 = 0.8896672504378283
06/12/2019 02:17:48 - INFO - __main__ -     global_step = 345
06/12/2019 02:17:48 - INFO - __main__ -     loss = 0.15244172460035138
salloc: Relinquishing job allocation 3

(bert_tuning) [centos@ip-172-31-48-17 bert_tuning]$ exit

Multi-GPU training

Using salloc is useful for interactive model development, short training jobs, and testing. However, the majority of modern research requires multiple long-running training jobs for model development and tuning. To support more compute-intensive experimentation, update your cluster to multi-GPU compute instances and use sbatch for non-interactive training. Enqueue multiple training jobs for an experiment and let AWS ParallelCluster scale up your compute group for the run and scale down after the experiment is complete.

From your workstation, add configuration for a multi-GPU cluster, shut down any remaining single-GPU nodes, and update your cluster configuration to multi-GPU p3.8xlarge compute instances. See the following code:

(pcluster_client) @work:~$ cat >> ~/.parallelcluster/config <<EOF

[cluster p3.8xlarge]
key_name                 = $USER
vpc_settings             = public

scheduler                = slurm
base_os                  = centos7
fsx_settings             = workspace

initial_queue_size       = 1
max_queue_size           = 3

master_instance_type     = m5.xlarge
compute_instance_type    = p3.8xlarge

EOF

(pcluster_client) @work:~$ (
       pcluster stop training
       pcluster update training -t p3.8xlarge
       pcluster start training 
   )

Stopping compute fleet : training
Updating: training
Calling update_stack
Status: parallelcluster-training - UPDATE_COMPLETE
Starting compute fleet : training

(pcluster_client) @work:~$ pcluster ssh training

(base) [centos@ip-172-31-48-17 ~]$ salloc srun nvidia-smi
salloc: Granted job allocation 4
Wed Jun 12 06:02:25 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56       Driver Version: 418.56       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:1B.0 Off |                    0 |
| N/A   47C    P0    52W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:00:1C.0 Off |                    0 |
| N/A   46C    P0    52W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000000:00:1D.0 Off |                    0 |
| N/A   49C    P0    58W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   47C    P0    57W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
salloc: Relinquishing job allocation 4

This post retrains a transformer-based English-to-German translation model using the FairSeq NLP framework. As before, set up a new workspace and environment and download training data. See the following code:

(base) [centos@ip-172-31-48-17 ~]$ mkdir /workspace/translation
(base) [centos@ip-172-31-48-17 ~]$ cd /workspace/translation

(base) [centos@ip-172-31-48-17 translation]$ cat > environment.yml <<EOF
name: translation
dependencies:
  - python=3.7
  - pytorch::pytorch=1.1
  - pip
  - tqdm
  - pip:
    - fairseq==0.6.2
EOF

(translation) [centos@ip-172-31-48-17 translation]$ conda env update && conda activate translation

(translation) [centos@ip-172-31-48-17 translation]$ wget 
  https://raw.githubusercontent.com/pytorch/fairseq/v0.6.2/examples/translation/prepare-iwslt14.sh 
  && bash prepare-iwslt14.sh

[...]

(translation) [centos@ip-172-31-48-17 translation]$ fairseq-preprocess 
  --source-lang de --target-lang en 
  --trainpref iwslt14.tokenized.de-en/train 
  --validpref iwslt14.tokenized.de-en/valid 
  --testpref  iwslt14.tokenized.de-en/test 
  --destdir data-bin/iwslt14.tokenized.de-en
    
[...]
| Wrote preprocessed data to data-bin/iwslt14.tokenized.de-en

After downloading and preprocessing your training data, write your training script and launch a quick interactive training run to confirm that your script launches and successfully trains for several epochs. Your first job is limited to a single GPU via CUDA_VISIBLE_DEVICES and should train in approximately 60 seconds/epoch; after an epoch or so, interrupt with ctrl-C. Because your underlying model supports distributed data-parallel training, you can expect nearly linear performance scaling with additional GPUs on a single worker. Training in a second job with all four devices should train in approximately 15–20 seconds/epoch, confirming effective multi-GPU scaling, which you again interrupt. See the following code:

(translation) [centos@ip-172-31-48-17 translation]$ mkdir -p checkpoints/transformer
(translation) [centos@ip-172-31-48-17 translation]$ (cat > train_transformer && chmod +x train_transformer) <<EOF
#!/bin/bash
fairseq-train data-bin/iwslt14.tokenized.de-en 
  -a transformer_iwslt_de_en --optimizer adam --lr 0.0005 -s de -t en 
  --label-smoothing 0.1 --dropout 0.3 --max-tokens 4000 
  --min-lr '1e-09' --lr-scheduler inverse_sqrt --weight-decay 0.0001 
  --criterion label_smoothed_cross_entropy --max-update 50000 
  --warmup-updates 4000 --warmup-init-lr '1e-07' 
  --adam-betas '(0.9, 0.98)' --fp16 
  --save-dir checkpoints/transformer
EOF

(translation) [centos@ip-172-31-48-17 translation]$ CUDA_VISIBLE_DEVICES=0 salloc --exclusive 
  srun -X --pty ./train_transformer
  
  [...]
| training on 1 GPUs
  [...]
  ^C
  [...]
  KeyboardInterrupt
  
(translation) [centos@ip-172-31-48-17 translation]$ salloc --exclusive 
  srun -X --pty ./train_transformer
  
  [...]
| training on 4 GPUs
  [...]
  ^C
  [...]
  KeyboardInterrupt

After your initial validation, run sbatch to schedule your full training run. The sinfo command provides information about your running cluster, and squeue shows the status of your batch job. tail on the job log allows you to monitor training progress, and ssh access to the compute node address reported by squeue allows you to check resource utilization. As before, AWS ParallelCluster scales up your compute cluster for the batch training job and releases the GPU-enabled instances after batch training is complete. See the following code:

(translation) [centos@ip-172-31-48-17 translation]$ sbatch --exclusive 
  --output=train_transformer.log 
  ./train_transformer

Submitted batch job 9.

(translation) [centos@ip-172-31-21-188 translation]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up   infinite      1  alloc ip-172-31-20-225

(translation) [centos@ip-172-31-21-188 translation]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 9   compute   sbatch   centos  R       0:22      1 ip-172-31-20-225
                
(translation) [centos@ip-172-31-21-188 translation]$ tail train_transformer.log
[...]
| loaded checkpoint checkpoints/transformer/checkpoint_last.pt (epoch 5 @ 1413 updates)
| epoch 006 | loss 7.268 | [...]
| epoch 006 | valid on 'valid' subset | loss 6.806 | [...]

(translation) [centos@ip-172-31-21-188 translation]$ ssh -t ip-172-31-20-225 watch nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56       Driver Version: 418.56       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:1B.0 Off |                    0 |
| N/A   63C    P0   214W / 300W |   3900MiB / 16130MiB |     83%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:00:1C.0 Off |                    0 |
| N/A   64C    P0   175W / 300W |   4110MiB / 16130MiB |     82%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000000:00:1D.0 Off |                    0 |
| N/A   60C    P0   164W / 300W |   4026MiB / 16130MiB |     65%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   62C    P0   115W / 300W |   3994MiB / 16130MiB |     74%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     41837      C   ...ntos/.conda/envs/translation/bin/python  3889MiB |
|    1     41838      C   ...ntos/.conda/envs/translation/bin/python  4099MiB |
|    2     41839      C   ...ntos/.conda/envs/translation/bin/python  4015MiB |
|    3     41840      C   ...ntos/.conda/envs/translation/bin/python  3983MiB |
+-----------------------------------------------------------------------------+

The job takes approximately 80–90 minutes to complete. You can now evaluate your model via interactive translation. See the following code:

(translation) [centos@ip-172-31-21-188 translation]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
(translation) [centos@ip-172-31-21-188 translation]$ fairseq-interactive 
  data-bin/iwslt14.tokenized.de-en 
  --path checkpoints/transformer/checkpoint_best.pt --beam 5 --remove-bpe <<EOF
hallo welt
EOF

Namespace([...])
| [de] dictionary: 8848 types
| [en] dictionary: 6632 types
| loading model(s) from checkpoints/transformer/checkpoint_best.pt
| Type the input sentence and press return:
S-0    hallo welt
H-0    -0.32129842042922974    hello world .
P-0    -0.8112 -0.0095 -0.4157 -0.2850 -0.0851

Jupyter and other HTTP services

Interactive notebook-based development is frequently used for data exploration, model analysis, and prototyping. You can launch and access a notebook server running on your AWS ParallelCluster workers. Add jupyterlab to the project’s workspace environment and srun the notebook. See the following code:

(translation) [centos@ip-172-31-48-17 translation]$ conda install jupyterlab

[...]

# unset XDG_RUNTIME_DIR and listen on node name to allow ssh tunnel.
(translation) [centos@ip-172-31-48-17 translation]$ 
  XDG_RUNTIME_DIR= 
  salloc --exclusive srun -X --pty bash -c 
  'jupyter lab --ip=$SLURMD_NODENAME'

[...]
The Jupyter Notebook is running at:
http://ip-172-31-21-236:8888/?token=[...token...]

In a separate terminal, set up a pcluster ssh tunnel to the notebook worker using the node address and access token reported by Jupyter and open a local browser. See the following code:

(pcluster_client) @work:~$ pcluster ssh training -L 8888:ip-172-31-21-236:8888 -N&
(pcluster_client) @work:~$ jobs
[1]+  Running.  pcluster ssh training -L 8888:ip-172-31-21-236:8888 -N &

(pcluster_client) @work:~$ open http://localhost:8888/?token=[...token...]

You can use a similar approach to run tools such as tensorboard in your cluster environment.

Storage and cluster teardown

After completing model training and evaluation, you can archive your /workspace file system to Amazon S3 via Amazon FSx’s hierarchical storage support. For more information, see Using Data Repositories. After the hsm_archive actions complete in approximately 60–90 minutes, verify the contents of your s3 export bucket via the AWS CLI with the following code:

(pcluster_client) @work:~$ pcluster ssh training

# Find and archive all files in the /workspace
(base) [centos@ip-172-31-48-17 translation]$ 
  find /workspace -type f -print0 
  | xargs -0 -n 16 sudo lfs hsm_archive
  
# Returns 0 when all archive operations are complete
(base) [centos@ip-172-31-48-17 translation]$ 
  find /workspace -type f -print0 
  | xargs -0 -n 16 -P 8 sudo lfs hsm_action | grep "ARCHIVE" | wc -l
  
0

(base) [centos@ip-172-31-48-17 translation]$ exit

(pcluster_client) @work:~$ aws s3 ls 
    s3://pcluster-training-workspace-$(aws sts get-caller-identity | jq -r ".Account")
                           
    PRE bert_tuning/
    PRE translation/
    
(pcluster_client) @work:~$ pcluster delete training
Deleting: training
[...]

A later call to pcluster create with the same configuration restores your cluster, pre-populating /workspace from your S3 archive.

Multiple clusters

You can use AWS ParallelCluster to manage multiple concurrent compute clusters. For instance, you can use a mix of CPU and GPU clusters to support preprocessing or analysis tasks that involve significant CPU-bound processing. Additionally, this can provide independent clusters for multiple researchers in a single shared AWS workspace.

Adapting this workflow to a multi-cluster configuration is relatively simple. Set up a standalone Amazon FSx file system and manage its lifecycle via existing CloudFormation templates in the amazon-fsx-workshop/lustre GitHub repo. Specify an export prefix and update ~/.parallelcluster/config with the following code:

[fsx workspace]
       shared_dir = /workspace
       fsx_fs_id = <filesystem id>

Multiple clusters now share a /workspace file system, decoupled from the lifetime of any individual cluster. You can use calls to lfs hsm_archive from any cluster to back up file system contents to S3, potentially via a nightly cron.

Capacity management

AWS ParallelCluster manages a compute cluster of EC2 instances via a standard Auto Scaling group, allowing you to use existing AWS-native tools for capacity management as you scale clusters. AWS ParallelCluster has built-in support for using Spot Instances within compute fleets via cluster_type configuration, and uses Reserved Instance capacity if available. You can use On-Demand Capacity Reservations so AWS ParallelCluster can rapidly scale to match your target compute fleet size.

Conclusion

If you wish to maintain more direct control over your computing infrastructure, an AWS ParallelCluster-based workflow provides an ideal working environment for applied machine learning research. Rapid cluster setup, scaling, and updates allow interactive exploration of a modeling task, including identification of a proper instance type and multi-instance scaling for parallel training runs. Conda environments and a high-performance Amazon FSx file system provide a familiar file interface and handle the critical, but undifferentiated, heavy lifting of reproducibly archiving model artifacts to S3 transparently.

For more information about configuring AWS ParallelCluster and building an interactive and scalable ML or HPC research environment, see the AWS ParallelCluster User Guide or the aws-parallelcluster GitHub repo.


About the author

Alex Ford is an Applied Scientist with AWS. He is passionate about emerging applications at the intersection of machine learning and the natural sciences. In his spare time, he explores the geography and geology of the Cascadia subduction zone, with deep affection for the Index batholith.

Accenture drives machine learning growth in one of the world’s largest private AWS DeepRacer Leagues

Accenture has a rich history of helping customers all over the world build artificial intelligence (AI) and machine learning (ML) powered solutions with AWS services. In doing so, they always look for new and engaging ways to develop their teams with the appropriate level of enablement and hands-on training. Accenture’s next ML initiative is rolling out their version of an AWS DeepRacer League, which is the world’s first global autonomous racing league launched by AWS at re:Invent in 2018. Accenture’s league spans 30 global locations and 17 countries, with each location featuring both a physical and virtual track to compete on for the title of Accenture AWS DeepRacer Champion.

Why an AWS DeepRacer League and why now?

Machine learning is one of the fastest growing areas in the market. IDC predicted that by 2021, global spending on AI and cognitive technologies will exceed $50M; companies are exploring how they can best take advantage of the technology, no matter the industry. However, the opportunities heavily outweigh the skills present in the workforce to make an AI strategy a reality, and although the breadth of ML-skilled data scientists is growing, companies cannot afford to hire at the scale needed to succeed, leading them to explore ways to upskill their existing talent. AWS DeepRacer and the implementation of the league is a mechanism for Accenture to help their customers take advantage of new ML technologies at scale by democratizing the development of these ML skills throughout their global organization. This unique program provides employees and customers with creative ways to explore machine learning. Participants have the opportunity to learn through hands-on labs followed instantly with practical application—deploying their models to an AWS DeepRacer car and watching it perform. Coupled with the element of competition, it gives teams something to rally around, while helping their organizations learn and grow.

Accenture’s AWS DeepRacer journey

As an emerald sponsor at re:Invent 2018, Accenture was present for the AWS DeepRacer announcement. They attended the workshops, learned about ML basics, built and trained a reinforcement learning model via the AWS DeepRacer 3D cloud-based simulator, and raced that model on one of the physical tracks in the MGM Grand Garden Arena. They even took home their own DeepRacer car! It was during this experience that Accenture realized how easy it was to learn such a complex ML technique and apply these new skills in a fun and engaging way.

Multiple individuals and groups within Accenture signed up to become private preview customers with access to the AWS DeepRacer console in preparation for the launch of their global competition. They have also begun building their own leaderboard that is integrated into the Accenture single sign on, to use for every site’s competition. Accenture participants in each city can create competitions, track their leaderboard, join competitions in other cities, and upload video recordings from their blazing fast laps to claim victory.

The Accenture AWS Business Group has been the driving force behind the Accenture DeepRacer League competition, assembling teams across the world, and equipping each location with everything they need, including tracks, barriers, and leaderboards. Any Accenture employee can join a competition and start their engines November 14, when the Accenture league will launch with a 24-hour follow-the-sun competition across the globe, bringing the excitement of AWS DeepRacer and machine learning to life.

Showcasing AWS DeepRacer at Accenture’s innovation centers

Accenture’s innovation centers, innovation hubs, and liquid studios are the primary locations hosting the AWS DeepRacer physical tracks. The intention is to showcase Accenture’s ML expertise and accelerate AWS ML around the world, extending the opportunity to upskill clients and the AWS communities in each global city. We encourage you to see how straightforward it is to get hands-on with ML, learn essential ML concepts, and experiment through autonomous driving using AWS DeepRacer. Connect with the teams of technologists from the Accenture AWS Business Group (AABG) to get started on your AWS machine learning journey today!


About the Author

Alexandra Bush is a Senior Product Marketing Manager for AWS AI. She is passionate about how technology impacts the world around us and enjoys being able to help make it accessible to all. Out of the office she loves to run, travel and stay active in the outdoors with family and friends.

 

 

AWS Machine Learning Research Awards Call for Proposal

Academic research and open-source software development are at the forefront of machine learning (ML) technology development. Since 2017, the AWS Machine Learning Research Awards (MLRA) has been aiming to advance machine learning by funding innovative research, training students, and providing researchers with access to the latest technology. MLRA has supported over 100 cutting-edge ML projects, with topics such as ML algorithms, computer vision, natural language processing, medical research, neuroscience, social science, physics, and robotics. Many of the MLRA-backed projects have received media coverage, for example, Researchers are Using Machine Learning to Screen for Autism in Children, The Robotic Future: Where Bots Operate Together and Learn from Each Other, Autonomous Vehicles: The Answer to Our Growing Traffic Woes, Amazon Gives AI to Harvard Hospital in Tech’s Latest Health Push, and Facebook’s Fight to Prevent Deepfake Dystopia Gets a Powerful Partner in Amazon Web Services.

AWS is pleased to announce that MLRA is now calling for proposals for the Q4 2019 cycle, and welcomes faculty members at accredited (Ph.D. granting) academic institutions and researchers at non-profit organizations to apply. The following types of projects are eligible for MLRA funding:

MLRA may provide unrestricted cash funds, AWS Promotional Credit, and training resources, including tutorials on how to run ML on AWS and hands-on sessions with Amazon scientists and engineers.

The average awarded amount is no more than $70,000 cash and $100,000 AWS Promotional Credits for individual projects. The actual amount awarded depends on the nature of the project. An internal advisory board at AWS reviews the proposals and makes funding decisions based on potential impact to the ML community, quality of the scientific content, and extent of usage of AWS AI/ML Services.

The submission deadline is at 11:59 PM (PST), December 8, 2019, and decision letters are sent out approximately three months after the submission deadline.

To get started with your application, please consult the MLRA website or send an email to aws-ml-research-awards@amazon.com. We look forward to receiving your applications!


About the Author

An Luo, PhD, is a Senior Technical Program Manager at AWS. An spent many years applying machine learning to biomedical research. Now, she focuses on enabling and accelerating machine learning research leveraging AWS AI/ML technologies.

 

 

 

 

 

Optimizing portfolio value with Amazon SageMaker automatic model tuning

Financial institutions that extend credit face the dual tasks of evaluating the credit risk associated with each loan application and determining a threshold that defines the level of risk they are willing to take on. The evaluation of credit risk is a common application of machine learning (ML) classification models. The determination of a classification threshold, though, is often treated as a secondary concern and set in an ad hoc, unprincipled manner. As a result, institutions may be creating underperforming portfolios and leaving risk-adjusted return on the table.

In this blog post, we describe how to use Amazon SageMaker automatic model tuning to determine the classification threshold that maximizes the portfolio value of a lender choosing a subset of borrowers to lend to. More generally, we describe a method of choosing an optimal threshold, or set of thresholds, in a classification setting. The method we describe doesn’t rely on rules of thumb or generic metrics. It is a systematic and principled method that relies on a business success metric specific to the problem at hand. The method is based upon utility theory and the idea that a rational individual makes decisions so as to maximize her expected utility, or subjective value.

In this post, we assume that the lender is attempting to maximize the expected dollar value of her portfolio by choosing a classification threshold that divides loan applications into two groups: those she accepts and lends to, and those she rejects. In other words, the lender is searching over the space of potential threshold values to find the threshold that results in the highest value for the function that describes her portfolio value.

This post uses Amazon SageMaker automatic model tuning to find that optimal threshold. The accompanying Jupyter notebook demonstrates the code supporting this use case. This is a novel use of the automatic model tuning functionality, which is typically used to choose the hyperparameters that optimize model performance. This post uses it as a general tool to maximize a function over some specific parameter space.

This approach has several advantages over the typical threshold determination approach. Typically, a classification threshold is set (or allowed to default) to 0.5. This threshold doesn’t generate the maximum possible result in the majority of use cases. In contrast, the approach described here chooses a threshold that generates the maximum possible result for the specific business use case being addressed. In the use case in this post, choosing the optimal threshold in the way we describe increases portfolio value by 2.1%.

Also, this approach moves beyond using general rules of thumb and expert judgment in determining an optimal threshold. It lays out a structured framework that can be systematically applied to any classification problem. Additionally, this approach requires the business to explicitly state its cost matrix based on the specific actions to be taken on model predictions and their benefits and costs. This evaluation process moves well beyond simply assessing the classification results of the model. This approach can drive challenging discussions in the business, and force differing implicit decisions and valuations onto the table for open discussion and agreement. This drives the discussion from a simple “maximize this value”, to a more informative analysis that allows more complex economic trade-offs, which provides more value back to the business.

About this blog post
Time to read 20 minutes
Time to complete 1.5 hours
Cost to complete ~ $2
Learning level Advanced (300)
AWS services Amazon SageMaker

Background

Assume that a lender is attempting to construct a portfolio from a pool of potential loans. To tackle this use case, the lender must first assess the credit risk associated with each loan in the pool by calculating a probability of default for each loan; the higher the probability of default associated with a loan, the higher the credit risk associated with a loan. To calculate a loan’s probability of default, the lender uses an ML classification model, such as a logistic regression or random forest.

Given that the lender has estimated a default probability model, how does she choose the threshold that sets the maximum default probability that a loan can have and she be willing to extend the loan? Users of classification models often set the value of a threshold to the conventional default value of 0.5. Even if they do attempt to set a use case-specific threshold, they do so based upon maximizing some threshold-based metric such as precision or recall. One issue with these metrics is that they ignore certain parts of the discrete outcomes described in the classification matrix. For example, precision overlooks true and false negative outcomes. Additionally, these metrics do not incorporate the dollar costs and benefits associated with each cell of the classification matrix. For example, in the case we examine in this post, the interest rate and loss given a default associated with each loan would be ignored in the calculation of typical threshold-based measures. This situation is less than ideal because, ultimately, what a business values is not the precision or recall of its model, but the dollar value of the incremental profit from using a specific model and threshold.

Therefore, instead of using a generic metric, it is likely more profitable and meaningful to the business to design a threshold-based metric that captures the cost and benefit structure of the specific business use case at hand. The lender we describe in this post is deciding whether to lend or not to set of borrowers. Therefore, a metric that incorporates the expected interest earned and losses from each loan given a predicted probability of default is much more relevant to the business and its decision-making process than some generic metric such as precision or recall. Specifically, the portfolio value metric that we define classifies each loan into one of four buckets: True Positive (TP), False Negative (FN), True Negative (TN), and False Positive (FP); and then calculates the value of each bucket of loans using the following guidelines:

TP value = -Fixed_Cost

FN value = -Fixed_CostLoss_Given_Default * Outstanding_Principal_Balance

TN value = -Fixed_Cost + Interest_Rate * Outstanding_Principal_Balance

FP value = -Fixed_Cost

Fixed_Cost captures the costs associated with processing a loan, whether it is approved or not.

Outstanding_Principal_Balance is the principal remaining at the time of default or full repayment.

Interest_Rate is a borrower-specific rate that is set based upon the probability of default associated with a specific loan application plus the expected return desired by the lender.

Loss_Given_Default is the proportion of principal expected to be lost if a loan defaults.

To calculate the total value of a specific bucket of loans, the value of all loans is summed. This total is what the lender is attempting to maximize by choosing a threshold.

Once the lender has clearly defined a quantitative measure of portfolio value, she must then choose the threshold that maximizes that measure. We use Amazon SageMaker automatic model tuning to find the optimal threshold. Amazon SageMaker automatic model tuning is a powerful tool for not only tuning the hyperparameters of an ML model, but also for maximizing an arbitrary function. In this case, we use automatic model tuning in two ways:

  • Finding the choice of a threshold that maximizes the lender’s portfolio value.
  • Mapping out the relationship between threshold and portfolio value more generally.

Understanding the relationship between the threshold choice and portfolio value allows us to more fully understand the economic trade-offs of increasing or decreasing the threshold. This is important as lenders frequently want to consider additional goals beyond simply maximizing the dollar value of their portfolio. Some lenders have idiosyncratic, secondary goals. For example, a lender may want to maximize her portfolio value while also emphasizing lending to a particular sector of the economy or certain subgroup of the overall population. Knowing how the portfolio’s value changes when the threshold moves allows the lender to set a reasonable threshold that addresses both her primary goal of portfolio maximization and her additional secondary goals.

We make several assumptions in this work. We assume that the lender has access to the capital necessary to extend all the loans associated with default probabilities below a chosen threshold. The problem is unconstrained in that sense. Additionally we assume that if a loan is approved, the applicant accepts the terms of the loan no matter what interest rate the lender offers. Lastly, we assume that the lender is risk-neutral, that is, we assume that the lender’s utility function is the identity function. In other words, the utility that a lender gains from a certain portfolio value is equal to the portfolio value itself.

The Amazon SageMaker notebook containing the executable code is available on this GitHub repo. You need to run this notebook within an Amazon SageMaker notebook instance to use Amazon SageMaker automatic model tuning. To do this, download the Jupyter notebook associated with this post from the preceding GitHub link. Create an Amazon SageMaker notebook instance and upload the Jupyter notebook onto this notebook instance. Lastly, open the notebook and step through the code. For more information, see Create a Notebook Instance. This post provides an HTML version so that you can review the code without needing to execute it.

Solution overview

The next sections walk through the following steps:

  1. Preparing a set of loan data for model training.
  2. Training a random forest classifier using the Amazon SageMaker built-in Scikit-learn Estimator.
  3. Analyzing the performance of the initial model.
  4. Using automatic model tuning to find the threshold that gives the highest portfolio value.
  5. Analyzing portfolio performance compared to the portfolio that uses the default threshold.
  6. Incorporating additional business goals and analyzing their impact on the portfolio.

Loan data

The data consists of a set of US Small Business Administration (SBA)-guaranteed loans from 1987 to 2014. These are loans extended to US-based small businesses by private banks, though the US SBA guarantees a large percentage of the principal in the event of borrower default. On average, the SBA guarantees about 70% of the principal for each of the loans in this dataset. This sizable guarantee offsets much of the credit risk associated with each loan and encourages private banks to extend credit to small businesses to which they might not otherwise. For the data itself, and a more detailed description of the data, see the supplementary material of Li, Mickel, and Taylor. You should also read the license associated with the use of this research paper.

Our goal is to construct a model that predicts the probability that a specific loan will default, thus the target variable is MIS_Status. MIS_Status takes on two values: “P I F” if a loan has been paid in full, or “CHGOFF” if a loan has defaulted and the bank has taken the resulting loss.

The accompanying notebook shows that the target variable is imbalanced—about 18% of the observations have defaulted. Our approach in dealing with this imbalance is to estimate the model with the data as-is, and then set the decision threshold to optimize the economic value of our credit portfolio.

Training the model

Next we train a random forest classifier using the Amazon SageMaker built-in Scikit-learn estimator. We chose a random forest after comparing its performance to both that of a Logistic Regression and a Gradient Boosted Classifier. With the Amazon SageMaker built-in estimator, you can build and deploy custom Scikit-learn models without needing to create and manage a custom Docker container.

For more information, see Using Scikit-learn with the Amazon SageMaker Python SDK.

For the code detailing the training of the random forest, see the “Training the Model” section of the notebook associated with this post.

Analyzing model performance (part 1)

For comparison, we create a naïve model, classifying all observations to the majority class, that is, predicting that no loans will default. Does the random forest perform better than the naive model?

We have not yet determined the optimal threshold to classify the prediction of the random forest model into default or non-default classes. Therefore, the only performance metrics available to us to answer the above question are those based upon the predicted class probabilities output by our model. Metrics based on class predictions, for example, accuracy, precision or recall, are dependent on our as-yet-undefined threshold. So to answer this question initially, we compare the log loss of the random forecast and naive models. Log loss calculates how far predicted class probabilities are off from the true labels. Therefore, log loss is metric that can be determined without reference to a threshold.

We will more thoroughly analyze model performance, using the more familiar threshold-based metrics, after we have calculated the optimal threshold.

Calculating log loss

Does the random forest perform better than the naive model? Remember that a smaller log loss indicates a smaller error and better performance. The following output from the model runs shows the results:

Naive Log Loss: 6.1230
Random Forest Log Loss: 0.2039

The answer is yes, the random forest improves on the log loss of the naive model by a significant amount. This implies that the random forest model assigned predicted class probabilities to each observation that are much closer to the truth than the naive model’s predictions.

Plotting the model predictions

In each of the following plot sets, the top histogram plots the distribution of predicted scores for all actual negatives, that is, the predicted scores for borrowers that do not default. In essence, it represents the score distributions associated with specificity. The bottom histogram plots predicted scores for actual positives, that is, the predicted scores for borrowers that do default, thus representing the score distributions for sensitivity.

The correctly classified observations on each plot are colored blue, and the incorrectly classified observations are colored orange. We use the default threshold value of 0.5 to color these plots. This is the typical threshold used to classify the results of a classification model, chosen without attempting to maximize the user’s success—or value—metric.

The threshold choice does not affect the actual predicted scores, shape, or level of the plots, only the coloring. It does, however, effect metric results, including sensitivity, specificity, and most commonly used model performance metrics.

These two graph shows that while the scores for the true negatives are clustered close to 0, the scores for the false negatives are distributed relatively evenly from 0 to the current cutoff at 0.5. The dataset doesn’t include data items that would allow strong discrimination between true and false negatives.

This distribution may point to a significant amount of potential income being missed from this portfolio of approved and rejected loans. Using the default threshold score of 0.5 for approving a loan is not optimal for this dataset. Let’s explore how the portfolio value can be further increased by optimizing the threshold.

Calculating portfolio value based upon a 0.5 threshold

Lastly, we calculate the portfolio values for the naive model and the random forest model based upon a 0.5 threshold. These portfolio values act as reference points to determine if choosing an optimal threshold increases the value of the loan portfolio.

Note that any non-zero threshold results in the same portfolio value in the naive model because the probability of default for all loans is 0. The following output shows the calculated portfolio values:

Naive Portfolio Value (Threshold=0.5): $203,498,022
Random Forest Portfolio Value (Threshold=0.5): $823,674,285

Determining the optimal classification threshold with automatic model tuning

Could we do even better, by choosing a different threshold? And how do we go about finding the optimal threshold to balance the lender’s risk and reward?

In this section, the optimal threshold for classifying loans as default or non-default is determined with Amazon SageMaker automatic model tuning. The optimal threshold is the threshold that maximizes the user’s value metric. In this case, the metric that is being maximized is total portfolio value, as described previously.

To use Amazon SageMaker automatic model tuning to optimize the classification threshold, we construct a Docker container that takes the random forest model trained previously and the test set as input. Given a threshold, the container calculates the total value of the portfolio if the lender extended all loans classified as non-default, and the borrower accepted them. Amazon SageMaker automatic model tuning generates a range of thresholds between 0 and 1 and chooses the threshold that maximizes portfolio value. For the code detailing the automatic model tuning job, see the “Determining the Optimal Classification Threshold with Automatic Modeling Tuning” section of the notebook associated with this post.

Running the automatic model tuning job

To use the Amazon SageMaker automatic model tuning feature, we first need to define the metric that we want Amazon SageMaker to optimize, the parameter space we want the tuning job to search over to find the optimal threshold, and any additional metrics we want calculated during the tuning job.

In the notebook associated with this blog post, we define the metrics we wish each job to return. As we’d like to explore the characteristics of the portfolio generated in some detail, we generate a list of metrics that describe the approved and rejected loans. These metrics are reported from each training job that runs via automatic model tuning. The additional metrics allow us to explore the characteristics of the maximized portfolio.

Of all the metrics we define, we need to specify which metric the automatic model tuning job should use to optimize the threshold. We do this by specifying the objective_metric_name in the following HyperparameterTuner object. In the same object, we specify the hyperparameter range to search over; in this case, we specify all continuous values between 0 and 1 to search over for the optimal threshold.

Lastly, we specify that we want Amazon SageMaker to run 200 individual training jobs. Each of these 200 training jobs uses a specific threshold value to calculate a different portfolio value. After Amazon SageMaker calculates the 200 portfolio values, each based upon a different threshold, it outputs the threshold that maximizes portfolio value.

This job takes up to 1 hour to run.

Analyzing model performance (part 2)

In this section, we continue analyzing the performance of the naive and random forest models, but now that we have determined the optimal threshold, we are able to incorporate threshold-based metrics in the analysis.

Plotting the automatic model tuning job results

The flatness in the following scatter plots is due to the precision of predictions, which is a function of the number of trees in the random forest model. Because there are 100 trees in the random forest model, the precision of the predictions is two decimal places. This implies that all thresholds, for example, >.32 and <=.33, give the same result.

Plotting prediction distributions given the optimal threshold

Now that we know the optimal threshold, we are able to plot the probability predictions of the random forest model and classify each as correct or incorrect. The top histogram plots the distribution of predicted scores for all actual negatives, that is, predicted scores for actual non-defaulters. The bottom histogram plots predicted scores for actual defaulters. The correctly classified observations on each plot are blue, and the incorrectly classified observations are orange.

The plot shows that the optimal threshold is below 0.5 and to the left of the bulk of the actual positives. The threshold seems to be at the point where the rate of change of true negatives as the threshold increases is slowing and the rate of change of false negatives is speeding up. The automatic model tuning job seems to have chosen a threshold that balances the two rates of change. To better understand the choice of optimal threshold, we would need to dig deeper into the portfolio value calculation and understand the costs and benefits associated with a change in threshold.

Determining maximum portfolio value

The following graphs plot the output of the automatic model tuning job. That is, they plot the portfolio value (on the y-axis) given a specific threshold (on the x-axis). Each point on a plot represents the outcome of a single training job from the overall automatic model tuning job. Recall that the goal is to find the classification threshold that optimizes the overall portfolio value. In each plot, the optimal threshold is the vertical, orange line.

The graph on the far left plots all 200 training job outcomes. The middle graph plots the top 100 training jobs as ranked by portfolio value, and the far-right graph plots the top 50 training jobs, also ranked by portfolio value.

Interestingly, the magnitude of the rate of change as we increase the threshold beyond its optimum value is generally much lower than the magnitude of the rate of change as we increase the threshold from 0 to its optimal value. This asymmetry is due to the SBA guarantee. The guarantee limits the downside risk that the lender takes on as she loosens her borrowing standards. If the SBA guarantee were not in place, we would expect the right side of this graph to decrease much more steeply.

Looking at the right two graphs, we zoom in on the peak of the curve and see that it is more symmetric around the optimal threshold. Additionally, the curve is not strictly decreasing after the optimal threshold; at times, the curve increases briefly. The following output shows the portfolio values for each model:

Naive Portfolio Value (Threshold=0.5):  $203,498,022
Random Forest Portfolio Value (Threshold=0.5): $823,674,285
Random Forest Portfolio Value (Optimal Threshold=0.359): $841,421,888

The top portfolio value returned from the random forest model with an optimized threshold is higher than both that generated by the naive model and by the random forest model with a 0.5 threshold. The increased portfolio value by adjusting the threshold is $17.7M, or 2.1%—a substantial increase in potential return.

Interestingly, the optimal threshold is less than 0.5, so the lender can increase the overall value of her portfolio by decreasing the credit risk of the loans in the portfolio (by decreasing the threshold). If the lender had used a 0.5 threshold (the typical default value), she would likely have created a portfolio likely with more credit risk and lower portfolio value. If the SBA guarantee were not in place for these loans, the portfolio value at a threshold of 0.5 would likely have been much lower.

Analyzing the return associated with maximum portfolio value

This section shifts from focusing on the dollar return of the portfolio to the percentage return. The following set of graphs is similar to the previous set except that the graphs plot the net return on the portfolio associated with each of the 200 training jobs in the automatic model tuning run. The orange, vertical line is again the optimal threshold—optimal in the sense of maximizing portfolio value, not portfolio return—and the x-axis is the threshold. The y-axis is the portfolio return.

From left to right, these graphs plot all 200 training job outcomes, the top 100 outcomes (based upon portfolio values), and the top 50 outcomes (based upon portfolio values). These return curves are much flatter than the portfolio value curves in the previous set of graphs. This is because the lender actively set interest rates on each of the loans she extends so that the return on the overall portfolio is expected to be about 5%. Additionally, note that the optimal threshold does not mark the peak in portfolio return. This is because, when maximizing portfolio value, it doesn’t matter whether adding more loans increases the percentage return on the portfolio, only that adding more loans adds to the dollar return on the portfolio. We can add lower percentage return loans to the portfolio and still add positive value in dollar terms, and that is what we are attempting to maximize.

The following output shows the results of calculating the return:

Naive Model Portfolio Return (Threshold=0.5): 0.012
Random Forest Portfolio Return (Threshold=0.5): 0.051
Random Forest Portfolio Return (Optimal Threshold=0.359): 0.054

Likewise, the portfolio return from the random forest model with an optimized threshold is much higher than that generated by the naive model, though the returns from the two random forest models are similar. This is because in both of those models, the lender can set borrower-specific interest rates to compensate for borrower-specific levels of credit risk. If the threshold increases and higher risk loans enter the portfolio, the lender can set higher interest rates on those loans and on average keep her return the same.

Adjusting the optimal threshold based upon additional business considerations

Now we investigate how to determine if we should make marginal adjustments to the optimal threshold. Why would we want to adjust the optimal threshold calculated previously? There may be certain idiosyncratic goals that a lender wants to achieve that a generic portfolio value calculation doesn’t capture. For example, a lender may want to maximize her portfolio value while also emphasizing lending to a certain sector of the economy or subgroup of the overall population. Adding this additional constraint to the portfolio value calculation itself may be difficult, if not impossible. Tackling these problems in two steps—finding the generic optimum and then adjusting that optimum based upon idiosyncratic preferences—is likely much easier and more intuitive of a calculation.

As an example, say that the lender would like to extend more credit to the Construction sector of the economy. She wishes to determine if she should increase the optimal threshold to achieve this goal. Essentially she needs to determine the price she is willing to pay to include one more Construction sector loan in the portfolio, and the effect on portfolio value of including that loan. If the price is greater than the cost, then she should increase the threshold.

More specifically, to answer the question of whether the lender should increase the threshold by 0.01 (the smallest increment possible in our case), she needs to do the following:

  1. Determine the price P that she is willing to pay for each additional Construction loan.
  2. Calculate the decrease in portfolio value resulting from increasing the threshold by 0.01.
  3. Calculate the number of Construction sector loans added to the portfolio when the threshold increases.
  4. Calculate the average cost of each additional Construction loan by dividing the change in portfolio value by the number of Construction loans added. This is the mean cost C of each additional Construction loan in dollar terms.
  5. Compare price P that the lender is willing to pay for each additional Construction to the cost C that she must actually pay for each additional Construction loan.
    • If the willingness-to-pay price is greater than the cost (P >= -C), increase the threshold by 0.01.
    • Otherwise, keep the threshold as-is.
  6. Continue to iterate on steps 2 to 5, until it is no longer advantageous to increase the threshold.

For the code detailing the following calculations, see the notebook associated with this post.

Step 1: Determining the lender’s willingness-to-pay

The lender must first determine the amount of portfolio value she is willing to forfeit for each additional Construction sector loan. Assume the lender’s willing-to-pay P in this example is $75,000.

P = 75000

Step 2: Determining the decrease in portfolio value

The lender must calculate the portfolio value at the optimal threshold and the next highest threshold value, and then calculate the difference to determine how much the portfolio value decreases as she increases the threshold by the minimum unit. This calculates as follows:

Decrease in Portfolio Value: -$1,640,192

Step 3: Determining the increase in number of construction loans

Next, calculate the number of Construction sector loans that are added to the portfolio when the threshold increases by 0.01. The result is as follows:

Increase in Number of Construction Loans: 26

Step 4: Determining the cost of each construction loan

The cost is calculated according to the following formula:

Cost of each Additional Construction Loan: -$63,084

Step 5: Comparing the cost to willingness-to-pay

If the price P is greater than or equal to the cost C x -1 (because the cost is negative), move the threshold. In this example, the lender should move the threshold because the cost of $63,084 is less than the lender’s willingness-to-pay of $75,000, and make those 26 additional loans.

The lender would not stop with this one step. She would continue to ask if she should increase the threshold by another 0.01 and iterate through the previous steps until she reaches a point at which she chooses not to increase the threshold.

We assume that the lender always has access to the required capital if her willingness-to-pay is greater than the cost of an additional Construction sector loan. If desired, we can include a capital budget W for the lender as well. This change would modify the final step so that the lender checks both if P >= -C and if there is a sufficient amount of capital remaining in W to cover the sum of the principal of the additional loans.

Other model metrics

How do the naive, random forest with 0.5 threshold, and random forest with optimal threshold models compare according to the more traditional performance metrics, such as accuracy, precision, and recall?

The following table reports the accuracy, precision, and recall for all three models:

Accuracy Precision_0 Precision_1 Recall_0 Recall_1
Naive Model 0.822721 0.822721 NaN 1.000000 0.000000
Random Forest Model (0.5 Threshold) 0.935302 0.944246 0.883336 0.979177 0.731683
Random Forest Model (Optimal Threshold) 0.934975 0.960350 0.817026 0.960626 0.815937

According to this table, which model is the best? That question can’t be truly answered unless we know the benefits and costs to the lender associated with each cell of the confusion matrix, that is, the benefits associated with the true positives and true negatives and the costs associated with the false positives and false negatives.

It’s clear from the preceding table that both random forest models strictly dominate the naive model (assuming that the cost of a false positive isn’t significantly larger than the cost of a false negative). Additionally, there isn’t a clear-cut winner between the two random forest models. The answer depends upon the relative costs of misclassification to the lender. We know from the business context of the problem described in the introduction that there is a significantly higher cost associated with a false negative than with a false positive. Given that information, it is more valuable for the lender to minimize false negatives, and as such, Recall_1 or Precision_0 are the most salient metric.

This discussion illustrates the fact that determining the so-called best model requires knowledge of the business use case that this ML model addresses, and the benefits and costs associated with each potential classification outcome; only then can we determine the metric that best captures what success means to the business. Additionally, precision and recall only include information about two of the four cells of the confusion matrix, but the lender cares about the net benefits associated with all four cells. Using these typical metrics ignores half of the outcomes that the lender cares about and also ignores the specific costs and benefits associated with all outcomes. Because of this, these metrics are lacking, and one should calculate a single problem-specific metric that incorporates the specific costs and benefits associated with all cells of the confusion matrix to determine the optimal threshold. In this post, this metric is portfolio value.

This optimization approach can be used more generally to test whether a threshold is optimal for the problem and data at hand.

Cleaning Up

If you created a new Amazon SageMaker notebook instance to run the code, remember to stop or delete it to minimize costs.

Conclusion

This post showed how to find the optimal threshold in a binary classification problem. Specifically, we describe how to use Amazon SageMaker automatic model tuning to determine the classification threshold that maximizes the portfolio value of a lender when choosing which subset of borrowers to extend credit to. More generally, the method of choosing an optimal threshold we describe can be applied to situations in which you need to choose multiple thresholds. The main modification needed is to incorporate multiple thresholds into the problem-specific, threshold-based metric. After doing that, you could use Amazon SageMaker automatic model tuning to find a vector of thresholds, as opposed to a single threshold, that maximizes your metric.

The threshold determination approach we describe has several substantial advantages. First, it makes the logic and rationale used in determining a threshold explicit. Second, it requires the business to clearly state its cost matrix, based on the specific actions to take on the model predictions and their associated benefits and costs. Making the logic and cost structure explicit can drive challenging discussions in the business, and force differing implicit decisions and valuations onto the table for open discussion and agreement. In addition, though explainable ML is beyond the scope of this post, the explicit statement of the logic and cost structure of threshold determination encouraged by our approach fits well with the goals of that line of research.

Lastly, this approach can also potentially be used to address the issue of imbalanced data. The issue with imbalanced data is often not that one target class has a much larger representation in the data than another target class, it’s that the misclassification costs (that is, the cost of a false positive versus a false negative), are dramatically different from one another. Instead of using sampling to balance the training data, you can clearly define the misclassification costs in the problem-specific metric, and use that metric to find an optimal threshold. This approach makes the issue less a technical one of using a trick of modifying the distribution of data to more of a business one of clearly specifying the cost structure of a problem. That may address the true issue of imbalanced data more directly, which is the issue of imbalanced misclassification costs.

For any of your business use cases that requires setting a classification threshold, consider using Amazon SageMaker automatic model tuning and the method this post describes. To get started, open the Amazon SageMaker console and the code from the GitHub repo that generated results in this post. If you have thoughts on business use cases that you could apply this method to, or any questions, please leave them in the comments. For more information on training models that have asymmetric classification costs, see Training models with unequal economic error costs using Amazon SageMaker.

Sources and references:

Friedman, Milton, and L. J. Savage. “The Utility Analysis of Choices Involving Risk.” Journal of Political Economy 56, no. 4 (1948): 279–304.

Data sourced from: Li, Min, Amy Mickel, and Stanley Taylor. “‘Should This Loan Be Approved or Denied?’: A Large Dataset with Class Assignment Guidelines.” Journal of Statistics Education 26, no. 1 (January 2, 2018): 55–66. https://doi.org/10.1080/10691898.2018.1434342.

Metz, Charles E. “Basic Principles of ROC Analysis.” Seminars in Nuclear Medicine 8, no. 4 (October 1978): 283–98. https://doi.org/10.1016/S0001-2998(78)80014-2

Wu, Yirong, Craig K. Abbey, Xianqiao Chen, Jie Liu, David C. Page, Oguzhan Alagoz, Peggy Peissig, Adedayo A. Onitilo, and Elizabeth S. Burnside. “Developing a Utility Decision Framework to Evaluate Predictive Models in Breast Cancer Risk Estimation.” Journal of Medical Imaging 2, no. 4 (October 2015). https://doi.org/10.1117/1.JMI.2.4.041005.

Zadrozny, Bianca, and Charles Elkan. “Learning and Making Decisions When Costs and Probabilities Are Both Unknown,” 204–13. ACM Press, 2001. https://doi.org/10.1145/502512.502540.

Veronika Megler and Scott Gregoire. “Training models with unequal economic error costs using Amazon SageMaker,“ AWS Machine Learning Blog, 18 Sept 2018.


About the Authors

Scott Gregoire is a Data Scientist with AWS Professional Services. He holds a PhD in Economics from the University of Texas at Austin and has advised clients in sectors ranging from international finance to retail. Currently, he is working with customers to develop innovative machine learning solutions on AWS.

 

 

 

Veronika Megler, PhD, is a senior consultant for AWS Professional Services. She enjoys adapting innovative big data, AI and ML technologies to help customers solve new problems, and to solve old problems more efficiently and effectively.

 

 

 

 

Your guide to artificial Intelligence and machine learning at re:Invent 2019

With less than 40 days to re:Invent 2019, the excitement is building up and we are looking forward to seeing you all soon! Continuing our journey on artificial intelligence and machine learning, we are bringing a lot of technical content this year, with over 200 breakout sessions, deep-dive chalk talks, hands-on exercises with workshops featuring Amazon SageMaker, AWS DeepRacer, and deep learning frameworks such as TensorFlow, PyTorch, and more. You’ll hear from many customers including Vanguard, BBC, Autodesk, British Airways, Fannie Mae, Thermo Fisher, Intuit, and many more. We are also hosting the Machine Learning Summit again this year, where you will hear from researchers and entrepreneurs about the latest breakthroughs today and the future possibilities tomorrow.

To get you started on planning, here are a few highlights for the AI and ML sessions from the re:Invent 2019 session catalog. The reserved seating is now open, so get your seats in advance for your favorite sessions.

Getting started

If you are new to AI and ML, we have some sessions for you to get started and learn these concepts. These sessions cover the basics including overviews and demos for Amazon SageMaker, the different AI services for many applications, and the popular AWS DeepLens and AWS DeepRacer to help you learn, while having fun.

Leadership session: Machine Learning (Session AIM218-L)

As we embark on the golden age of machine learning, we are seeing the constraints and blockers disappear, and the value extending across different industries. In this leadership session, learn about the latest machine learning offerings from AWS as we explore the democratization of machine learning. We will discuss the breadth and depth of our machine learning services and you will hear from customers who are partnering with AWS on this journey.

Amazon SageMaker deep dive: A modular solution for machine learning (Session AIM307)

Amazon SageMaker is a fully managed service enabling all developers and data scientists with every aspect of the machine learning workflow. In this session, we will discuss the technical details of Amazon SageMaker to help you with your machine learning journey to get your ML models from experimentation to production at scale. We will also discuss practical deployments through real-world customer examples.

Starting the enterprise machine learning journey (Session AIM205)

Amazon has been investing in machine learning for more than 20 years, innovating in areas such as fulfillment and logistics, personalization and recommendations, forecasting, fraud prevention, and supply chain optimization. During this session, we take this expertise and show you how to identify business problems that can be solved with machine learning. We discuss considerations including selecting the right use case for a machine learning pilot, nurturing skills, and measuring the success of such pilots.

Finding a needle in a haystack: Use AI to transform content management (Session AIM206)

Finding digital content, from documents to media, can be frustrating and time-consuming. Across your employees or customers, this challenge can waste hours, derail projects, and create poor experiences. In this breakout session, learn how to use language and vision AI services to extract data, insights, and trends from all of your digital content, with a focus on how to more effectively manage your documents and find what you need.

Get started with AWS DeepRacer (Workshop AIM207)

Get behind the keyboard for an immersive experience with AWS DeepRacer. Developers with no prior machine learning experience learn new skills and apply their knowledge in a fun and exciting way. With the help of the AWS pit crew, build and train a reinforcement learning model that you can race on the tracks and win special AWS prizes, in this one of many workshops for AWS DeepRacer. See the “Advanced topics in machine learning” section for an advanced version of this workshop.

Start using computer vision with AWS DeepLens (Workshop AIM229)

If you’re new to deep learning, this workshop is for you. Learn how to build and deploy computer-vision models using the AWS DeepLens deep-learning-enabled video camera. Also learn how to build a machine learning application and a model from scratch using Amazon SageMaker. Finally, learn to extend that model to Amazon SageMaker to build an end-to-end AI application. See the “Advanced topics in machine learning” section for an advanced version of this workshop.

Improve machine learning model quality in response to changes in data (Session AIM213)

Machine learning models are typically trained and evaluated using historical data. But the real-world data may not look like the training data, especially as models age over time and the distribution of data changes. This gradual variance of the model from the real world is known as model drift, and it can have a big impact on prediction quality. This session explores techniques you can use to monitor prediction quality in production, as well as effective corrective actions such as auditing and iterative retraining.

Practical applications of machine learning

The biggest value for machine learning is its applicability across different industries. In these sessions, chalk talks, and workshops, we will dive deep into the practical aspects of machine learning for specific industries including finance, healthcare, retail, media and entertainment, manufacturing, and more.

Transforming Healthcare with AI (Session AIM210)

Improving patient care, making treatment decisions, managing clinical trials, and more are all moving into a new age due to advancements in AI. In this session, we cover AI solutions specific to the Healthcare industry, from extracting relevant medical information from patient records and clinical trial reports to automating the clinical documentation process with automatic speech recognition. Hear directly from our customers and come away with answers on how to get started immediately.

ML in retail: Solutions that add intelligence to your business (Session AIM212)

Machine learning is ranked the number-one “game changer” for the retail market segment by chief experience officers (CXOs), yet it’s only number eight on top spending priorities. So which scenarios are real? In this session, we dive into how AWS puts machine learning in the hands of every developer, without the need for deep machine learning experience. Learn about personalized product recommendations, inventory forecasting, new in-store experiences, and more. Learn from our experience at Amazon.com and hear from our customers today.

AI document processing for business automation (Session AIM211)

Millions of times per day, customers from the Finance, Healthcare, public, and other sectors rely on information that is locked in documents. Amazon Textract uses artificial intelligence to “read” such documents as a person would, to extract not only text but also tables, forms, and other structured data without configuration, training, or custom code. In this session, we demonstrate how you can use Amazon Textract to automate business processes with AI. You also hear directly from our customers about how they accelerated their own business processes with Amazon Textract.

Predict future business outcomes using Amazon Forecast (Session AIM312)

Based on the same technology used at Amazon.com, Amazon Forecast uses machine learning and time-series data to build accurate business forecasts. In this session, learn how machine learning can improve accuracy in demand forecasting, financial planning, and resource allocation while reducing your forecasting time from months to hours.

Build accurate training datasets with Amazon SageMaker Ground Truth (Session AIM308)

Successful machine learning models are built on high-quality training datasets. Typically, the task of data labeling is distributed across a large number of humans, adding significant overhead and cost. This session explains how Amazon SageMaker Ground Truth reduces cost and complexity using techniques designed to improve labeling accuracy and reduce human effort. We will walk through best practices for building highly accurate training datasets and discuss how you can use Amazon SageMaker Ground Truth to implement them.

Build predictive maintenance systems with Amazon SageMaker (Chalk Talk AIM328)

Across a wide spectrum of industries, customers are starting to use prediction maintenance models to proactively fix problems before they impact production. The result is an optimized supply chain and improved working conditions. In this session, learn how to use data from equipment to build, train, and deploy predictive models. We dive deep into the architecture for using the turbofan degradation simulation dataset to train the model to recognize potential equipment failures and share details.

Build a fraud detection system with Amazon SageMaker (Workshop AIM359)

In this workshop, we will explore the new AWS Fraud Detection Solution. We show you how to build, train, and deploy a fraud detection machine learning model. The fraud detection model recognizes fraud patterns, and is self-learning that enables it to adapt to new, unknown fraud patterns. We will show you how to execute automated transaction processing, and how to the Fraud Detection solution flags that activity for review.

Delight your customers with ML-based personalized recommendations (Session AIM323)

Recommendation engines make targeted marketing campaigns, re-ranking of items, personalized notifications, and personalized search possible. In this session, we deep-dive into using Amazon Personalize to create and manage personalized recommendations efficiently, letting you focus on the real value of the data for your business. We discover how these deep learning techniques have a direct impact on the bottom line of your business by increasing engagement, click-through, satisfaction, and revenue. Learn from customer examples and dive into some live demonstrations.

Accelerate time-series forecasting with Amazon Forecast (Workshop AIM335)

Based on the same technology used at Amazon.com, Amazon Forecast uses machine learning to combine time-series data with additional variables to build up to 50% more accurate forecasts. In this workshop, prepare a dataset, build models based on that dataset, evaluate a model’s performance based on real observations, and learn how to evaluate the value of a forecast compared with another. Gain the skills to make decisions that will impact the bottom line of your business.

Build a content-recommendation engine with Amazon Personalize (Workshop AIM304)

Machine learning is being used increasingly to improve customer engagement by powering personalized product and content recommendations. Amazon Personalize lets you easily build sophisticated personalization capabilities into your applications, using machine learning technology perfected from years of use on Amazon.com. In this workshop, you build your own recommendation engine by providing training data, building a model based on the algorithm of your choice, testing the model by deploying your Amazon Personalize campaign, and integrating it into your own application.

Advanced topics in machine learning

We have a number of sessions that will dive deep into the technical details of machine learning across our service portfolio as well as deep learning frameworks including TensorFlow, PyTorch, and Apache MXNet. These code-level sessions and hands-on workshops will enable the advanced developer or data scientist in you to customize, integrate, and solve many challenges with deep technical solutions.

Deep learning with TensorFlow (Session AIM410, Workshop AIM401)

TensorFlow is of the most popular open-source deep learning frameworks used in machine learning development. The advanced breakout session will dive deep into training machine learning models with TensorFlow using Amazon SageMaker, including distributed training, cost-effective inference, and workflow management. The code-level workshop will include hands-on exercises where we will train and deploy TensorFlow models, apply automatic model tuning using Amazon SageMaker, and make predictions in production.

Deep learning with PyTorch (Session AIM412, Workshop AIM402)

PyTorch is rapidly gaining popularity in the industry as a deep learning framework used to transition seamlessly from research prototyping to production deployment. In the breakout session, you will lern how to develop deep learning models with PyTorch using Amazon SageMaker for multiple use cases including using a BERT model and instance segmentation for fine-grain computer vision. In the workshop, you will build a natural language processing model to analyze text.

Deep learning with Apache MXNet (Session AIM411, Workshop AIM403)

Apache MXNet has been a widely used deep learning framework on diverse applications such as computer vision, speech recognition, and natural language processing (NLP). The breakout session will discuss on building computer vision and NLP models using MXNet to automatically extract information from documents. In the workshop, we will build a computer vision model using MXNet and train the model for high accuracy, and finally deploy it to production using Amazon SageMaker.

Deep dive on Project Jupyter (Session AIM413)

Amazon SageMaker offers fully managed Jupyter notebooks that you can use in the cloud so you can explore and visualize data and develop your machine learning model. In this session, we explain why we picked Jupyter notebooks, and how and why AWS is contributing to Project Jupyter. We dive deep into our overall strategy for Jupyter and explain different use cases for Jupyter, including data science, analytics, and simulation.

Under the hood of AWS DeepRacer: Advanced RL driving course (Workshop AIM428)

This technical deep dive is suitable for advanced machine learning developers looking to learn more complex reinforcement learning concepts using AWS DeepRacer and Amazon SageMaker RL. AWS data scientists help you build models that require innovations in neural network architecture, expand the algorithms, and help you customize your AWS DeepRacer model for performance. We also dive deep into the technology under the hood that powers the AWS DeepRacer car.

Optimize deep learning models for edge deployments with AWS DeepLens (Workshop AIM405)

In this workshop, learn how to optimize your computer vision pipelines for edge deployments with AWS DeepLens and Amazon SageMaker Neo. Also learn how to build a sample object detection model with Amazon SageMaker and deploy it to AWS DeepLens. Finally, learn how to optimize your deep learning models and code to achieve faster performance for use cases where speed matters.

Take an ML model from idea to production using Amazon SageMaker (Workshop AIM427)

Come build the most accurate text-classification model possible with Amazon SageMaker. This service lets you build, train, and deploy ML models using built-in or custom algorithms. In this workshop, learn how to leverage Keras/TensorFlow deep-learning frameworks to build a text-classification solution using custom algorithms on Amazon SageMaker. We walk you through packaging custom training code in a Docker container, testing it locally, and then using Amazon SageMaker to train a deep-learning model. You then try to iteratively improve the model to achieve high accuracy. Finally, you deploy the model in production so applications can leverage the classification service.

Implement ML workflows with Kubernetes and Amazon SageMaker (Session AIM326)

Until recently, data scientists have spent much time performing operational tasks, such as ensuring that frameworks, runtimes, and drivers for CPUs and GPUs work well together. In addition, data scientists needed to design and build end-to-end machine learning (ML) pipelines to orchestrate complex ML workflows for deploying ML models in production. With Amazon SageMaker, data scientists can now focus on creating the best possible models while enabling organizations to easily build and automate end-to-end ML pipelines. In this session, we dive deep into Amazon SageMaker and container technologies, and we discuss how easy it is to integrate such tasks as model training and deployment into Kubernetes and Kubeflow-based ML pipelines.

Security for ML environments with Amazon SageMaker (Session AIM327)

Amazon SageMaker is a modular, fully managed platform that enables developers and data scientists to quickly and easily build, train, and deploy machine learning models at any scale. In this session, we dive deep into the security configurations of Amazon SageMaker components, including notebooks, training, and hosting endpoints. Vanguard joins us to discuss the company’s use of Amazon SageMaker and its implementation of key controls in a highly regulated environment, including fine-grained access control, end-to-end encryption in transit, and comprehensive audit trails for resource and data access. If you want to build secure ML environments, this session is for you.

Machine Learning Summit

Whether you are a data scientist, machine learning practitioner, or business professional, you’ll enjoy the Machine Learning Summit at this year’s re:Invent, which will showcase advances in machine learning as well as the emerging trends. From disaster management to pediatrics, from fighting fake news to indoor farming, you will hear experts share their knowledge and perspectives.

Some of the sessions include:

Deep Learning for Disaster Management and Response
Cornelia Caragea, Associate Professor, Science and Engineering Offices,
Computer Science, University of Illinois at Chicago

Fighting Fake News and Deep Fakes with Machine Learning
Delip Rao, Vice President of Research at the AI Foundation

Deep Learning in Deep Nets: Helping Fish Farmers Feed the World
Bryton Shang, Founder and CEO, Aquabyte

Big Data for Tiny Patients: Applying ML to Pediatrics
Dr. Judith Dexheimer, Associate Professor, UC Department of Pediatrics,
Cincinnati Children’s Hospital Medical Center

Machine Learning and Society: Bias, Fairness and Explainability
Pietro Perona, Amazon Fellow, AWS

From Seed to Store: Using AI to Optimize the Indoor Farms of the Future
Henry Sztul, SVP, Science and Technology, Bowery Farming

The Machine Learning Summit will inform you about what’s on the horizon for machine learning. The event is scheduled for Tuesday, December 3, 2019, from 1:30 PM to 6 PM at the Venetian Theater. Visit the summit home page and register today.

 


About the Author

Shyam Srinivasan is on the AWS Machine Learning marketing team. He cares about making the world a better place through technology and loves being part of this journey. In his spare time, Shyam loves to run, travel, and have fun with his family and friends.

 

 

 

 

US Spanish and Brazilian Portuguese neural voices join Amazon Polly

Amazon Polly turns text into lifelike speech. In July 2019, AWS launched eight US English and three UK English voices in Neural Text-to-Speech (NTTS) technology, which delivers ground-breaking improvements in speech quality through a new machine learning approach. Polly is now adding the first non-English NTTS voices, in US Spanish and Brazilian Portuguese. Introducing Lupe and Camila!

Why US Spanish?

There are an estimated 59.8 million Hispanic people in the United States. (This figure comes from the US Census annual estimate as of July 1, 2018.) Companies that provide engaging online content to their Hispanic audience set themselves up for success. The new US Spanish voice, Lupe, joins this trend. After Miguel and Penelope, it is the third US Spanish TTS voice in the Amazon Polly portfolio. Lupe offers a human-like quality with enhanced intonation, especially when listening to the neural version of the voice. Lupe not only speaks Spanish but also handles English very well; it provides a fully bilingual Spanish-English experience. All this thanks to an extended phoneme coverage, comprised of 72 English and Spanish phoneme variants. In contrast, the phone set for Penélope and Miguel contains only 29 Spanish phonemes.

 

Listen now

Voiced by Amazon Polly

Listen now

Voiced by Amazon Polly

Why Brazilian Portuguese?

Camila, the new Brazilian Portuguese TTS voice, supports customers whose priority is to provide best-in-class TTS voices for their Brazilian Portuguese-speaking audience. Similar to Lupe, Camila is a natural-sounding TTS voice that demonstrates a high prosodic quality. The synthesis generated by this voice is smooth and clear, which makes Camila a pleasant voice to listen to. Amazon Polly customers can now enjoy a selection of three Brazilian Portuguese voices: Ricardo, Vitória, and Camila.

 

Listen now

Voiced by Amazon Polly

Listen now

Voiced by Amazon Polly

The neural versions of Camila and Lupe are the first two non-English NTTS voices that Amazon Polly offers, and are available in US East (N. Virginia), US West (Oregon), and EU (Ireland) Regions. Standard versions of these voices are also available across 18 AWS Regions.

Amazon Polly now offers a selection of 61 voices across 29 languages. Of these, thirteen voices in four languages are available in both standard and neural technology.

Try these new voices and experience for yourself the natural-sounding NTTS technology powering Camila and Lupe.

 


About the Author

Marta Smolarek is a Program Manager in the Amazon Text-to-Speech team. At work she connects the dots. In her spare time, she loves to go camping with her family.

 

 

 

 

AWS supports the Deepfake Detection Challenge with competition data and AWS credits

Today AWS is pleased to announce that it is working with Facebook, Microsoft, and the Partnership on AI on the first Deepfakes Detection Challenge.  The competition, to which we are contributing up to $1 million in AWS credits to researchers and academics over the next two years, is designed to produce technology that can be deployed to better detect when artificial intelligence has been used to alter a video in order to mislead the viewer. We plan to host the full competition dataset when it is made available later this year, and are offering the support of Amazon machine learning experts to help teams get started. We want to ensure access to this data for a diverse set of participants with varied perspectives to help develop the best possible solutions to combat the growing problem of “deepfakes.”

The same technology which has given us delightfully realistic animation effects in movies and video games, has also been used by bad actors to blur the distinction between reality and fiction. “Deepfake” videos manipulate audio and video using artificial intelligence to make it appear as though someone did or said something they didn’t. These techniques can be packaged up in to something as simple as a cell phone app, and are already being used to deliberately mislead audiences by spreading fake viral videos through social media. The fear is that deepfakes may become so realistic that they will be used to the detriment of reputations, to sway popular opinion, and could in time make any piece of information suspicious.

The Deepfakes Detection Challenge invites participants to build new approaches that can detect deepfake audio, video, and other tampered media. The challenge will kick off in December at the NeurIPS Conference with the release of a new dataset generated by Facebook which comprises tens of thousands of example videos, both real and fake. Competitors will use this dataset to design novel algorithms which can detect a real or fake video, and the algorithms will be evaluated against a secret test dataset (which will not be made available to ensure there is a standard, scientific evaluation of entries).

Building deepfake detectors will require novel algorithms which can process this vast library of data (more than 4 petabytes). AWS will work with DFDC partners to explore options for hosting the data set, including the use of Amazon S3, and we will make $1 million in AWS credits available to develop and test these sophisticated new algorithms. All participants will be able to request a minimum of $1,000 in AWS credits to get started, with additional awards granted in quantities of up to $10,000 as entries demonstrate viability or success in detecting deepfakes. Participants can visit www.aws.amazon.com/aws-ml-research-awards to learn more and request AWS credits.

The Deepfakes Detection Challenge steering committee is sharing the first 5,000 videos of the dataset with researchers working in this field. The group will collect feedback and host a targeted technical working session at the International Conference on Computer Vision (ICCV) in Seoul beginning on October 27, 2019. Following this due diligence, the full data set release and the launch of the Deepfakes Detection Challenge will coincide with the Conference on Neural Information Processing Systems (NeurIPS) this December.

To support participants in this endeavor, AWS will also be providing access to Amazon ML Solutions Lab experts and solutions architects to help provide technical support and guidance to contestants to help teams get started in the challenge. The Amazon ML Solutions Lab is a dedicated service offering for AWS customers that provides access to the same talent that built many of Amazon’s machine learning-powered products and services. These Amazon experts help AWS customers utilize machine learning technology to build intelligent solutions that to address some of the world’s toughest challenges like predicting famine, identifying cancer faster, and expediting assistance to areas hard hit by natural disasters. Amazon ML Solutions Lab experts will be paired with Challenge participants to provide assistance throughout the competition.

In addition to serving as a founding member of the Partnership on AI, AWS is also joining the non-profit’s Steering Committee on AI and Media Integrity. The goal, as with sponsorship of the Deepfakes Deception Challenge, is to coordinate the activities of media, tech companies, governments, and academia to promote technologies and policies that strengthen trust in media and help audiences differentiate fact from fiction.

To learn more about the Deepfakes Detection Challenge and receive updates on how to register and participate, visit www.Deepfakedetectionchallenge.ai. Stay tuned for more updates as we get closer to kick-off!

 


About the Author

Michelle Lee is vice president of the Machine Learning Solutions Lab at AWS.

 

 

The AWS DeepRacer League and countdown to the re:Invent Championship Cup 2019

The AWS DeepRacer League is the world’s first autonomous racing league, open to anyone. Announced at re:Invent 2018, it puts machine learning in the hands of every developer in a fun and exciting way. Throughout 2019, developers of all skill levels have competed in the League at 21 Amazon events globally, including Amazon re:MARS and select AWS Summits, and put their skills to the test in the League’s virtual circuit via the AWS DeepRacer console. The League concludes at re:Invent 2019. Log in today and start racing—time is running out to win an expenses paid trip to re:Invent!

The final AWS Summit race in Toronto

In the eight months since the League kicked off in Santa Clara, the League has visited 17 countries, with thousands of developers completing over 13,000 laps and 165 miles of track. Each city has crowned its champion, and we will see each of them at re:Invent 2019!

On October 3, 2019, the 21st and final AWS DeepRacer Summit race took place in Toronto, Canada. The event concluded in-person racing for the AWS DeepRacer League, and not one, but four expenses paid trips were up for grabs.

First was the crowning of our Toronto champion Mohammad Al Ansari, with a winning time of 7.85 seconds, just 0.4 seconds away from beating the current world record of 7.44 seconds. Mohammad came to the AWS Summit with his colleague from Myplanet, where they took part in an AWS-led workshop for AWS DeepRacer to learn more about machine learning. They then made connections with AWS DeepRacer communities and received support from AWS DeepRacer enthusiasts such as Lyndon Leggate, a recently announced AWS ML Hero.

The re:Invent line up is shaping up

Once the racing concluded, it was time to tally up the scores for the overall competition and name the top three overall Summit participants. Foreign Exchange IT specialist Ray Goh traveled from Singapore to compete in his fourth race in his quest to top the overall leaderboard. Ray previously attended the Singapore, Hong Kong, and re:Mars races, and has steadily improved his models all year. He closed out the season with his fastest time of 8.15 seconds at the Toronto race. The other two spots went to ryan@ACloudGuru and Raycha@Kakao, who have also secured their place in the knockouts at re:Invent along with the 21 Summit Champions.

It could be you that lifts the Championship Cup

The Championship Cup at re:Invent is sure to be filled with fun and surprises, so watch this space for more information. There is still time for developers of all skill levels to advance to the knockouts. Compete now in the final AWS DeepRacer League Virtual Circuit, and it could be you who is the Champion of the 2019 AWS DeepRacer League!

 


About the Author

Alexandra Bush is a Senior Product Marketing Manager for AWS AI. She is passionate about how technology impacts the world around us and enjoys being able to help make it accessible to all. Out of the office she loves to run, travel and stay active in the outdoors with family and friends.

 

 

Calculating new stats in Major League Baseball with Amazon SageMaker

The 2019 Major League Baseball (MLB) postseason is here after an exhilarating regular season in which fans saw many exciting new developments. MLB and Amazon Web Services (AWS) teamed up to develop and deliver three new, real-time machine learning (ML) stats to MLB games: Stolen Base Success Probability, Shift Impact, and Pitcher Similarity Match-up Analysis. These features are giving fans a deeper understanding of America’s pastime through Statcast AI, MLB’s state-of-the-art technology for collecting massive amounts of baseball data and delivering more insights, perspectives, and context to fans in every way they’re consuming baseball games.

This post looks at the role machine learning plays in providing fans with deeper insights into the game. We also provide code snippets that show the training and deployment process behind these insights on Amazon SageMaker.

Machine learning steals second

Stolen Base Success Probability provides viewers with a new depth of understanding of the cat and mouse game between the pitcher and the baserunner.

To calculate the Stolen Base Success Probability, AWS used MLB data to train, test, and deploy an ML model that analyzes thousands of data points covering 37 variables that, together, determine whether or not a player safely arrives at second if he attempts to steal. Those variables include the runner’s speed and burst, the catcher’s average pop time to second base, the pitcher’s velocity and handedness, historical stolen base success rates for the runner, batter, and pitcher, along with relevant data about the game context.

We took a 10-fold cross-validation approach to explore a range of classification algorithms, such as logistic regression, support vector machines, random forests, and neural networks, by using historical play data from 2015 to 2018 provided by MLB that corresponds to ~7.3K stolen base attempts with ~5.5K successful stolen bases and ~1.8K runners caught stealing. We applied numerous strategies to deal with the class imbalance, including class weights, custom loss functions, and sampling strategies, and found that the best performing model for predicting the probability of stolen base success was a deep neural network trained on an Amazon Deep Learning (DL) AMI, pre-configured with popular DL frameworks. The trained model was deployed using Amazon SageMaker, which provided the subsecond response times required for integrating predictions into in-game graphics in real-time, and on ML instances that auto-scaled across multiple Availability Zones. For more information, see Deploy trained Keras or TensorFlow models using Amazon SageMaker.

As the player on first base contemplates stealing second, viewers can see his Stolen Base Success Probability score in real-time right on their screens.

MLB offered fans a pilot test and preview of Stolen Base Success Probability during the 2018 postseason. Thanks to feedback from broadcasters and fans, MLB and AWS collaborated during the past offseason to develop an enhanced version with new graphics, improved latency of real-time stats for replays, and a cleaner look. One particular enhancement is the “Go Zone,” the point along the baseline where the player’s chances of successfully making the steal reaches a minimum of 85%.

As the player extends his lead towards second, viewers can now see the probability changing dynamically and a jump in his chances of success when he hits the “Go Zone.” After the runner reaches second base, whether he gets called “safe” or “out,” viewers have the opportunity during a replay to see data generated from a variety of factors that may have determined the ultimate outcome, like the runner’s sprint speed and the catcher’s pop time. Plus, that data is color-coded in green, yellow, and red to help fans visualize the factors that played the most significant roles in determining whether or not the player successfully made it to second.

Predicting impact of infield defensive strategies

Over the last decade, there have been few changes in MLB as dramatic as the rise of the infield shift, a “situational defensive realignment of fielders away from their traditional starting points.” Teams use the shift to exploit batted-ball patterns, such as a batter’s tendency to pull batted balls (right field for left-handed hitters and left field for right-handed hitters). As a batter steps up to the plate, the defensive infielders adjust their positions to cover the area where the batter has historically hit the ball into play.

Using Statcast AI data, teams can give their defense an advantage by shifting players to prevent base hits—and teams are employing this strategy more often now than at any other time in baseball history. League-wide shifting rates have increased by 86% over the last three years, up to 25.6% in 2019 from 13.8% in 2016.

AWS and MLB teamed up to employ machine learning to give baseball fans insight into the effectiveness of a shifting strategy. We developed a model to estimate the Shift Impact—the change in a hitter’s expected batting average on ground balls—as he steps up to the plate, using historical data and Amazon SageMaker. As infielders move around the field, the Shift Impact dynamically updates by re-computing the expected batting average with the changing positions of the defenders. This provides a real-time experience for fans.

Using data to quantify the Shift Impact

A spray chart can illustrate the tendency batters have in hitting balls towards a particular direction. The chart indicates the percentage at which a player’s batted balls are hit through various sections of the field. The following chart shows the 2018 spray distribution of Joey Gallo’s (from the Texas Rangers) batted balls hit within the infielders’ reach, defined as having a projected distance of less than 200 feet away from home plate. For more information, see Joey Gallo’s current stats on Baseball Savant.

The preceding chart shows the tendency to pull the ball toward right field for Joey Gallo, who hit 74% of his balls to the right of second base in 2018. A prepared defense can take advantage of this observation by overloading the right side of the infield, cutting short the trajectory of the ball and increasing the chance of converting the batted ball into an out.

We estimated the value of specific infield alignments against batters based on their historical batted-ball distribution by taking into account the last three seasons of play, or approximately 60,000 batted balls in the infield. For each of these at-bats, we gathered the launch angle and exit velocity of the batted ball and infielder positions during the pitch, while looking up the known sprint speed and handedness of the batter. While there are many metrics for offensive production in baseball, we chose to use batting average on balls in play—that is, the probability of a ball in play resulting in a base hit.

We calculated how effective a shift might be by estimating the amount by which a specific alignment decreases our offensive measure. After deriving new features, such as the projected landing path of the ball and one-hot encoding the categorical variables, the data was ready for ingestion into various ML frameworks to estimate the probability that a ball in play results in a base hit. From that, we could compute the changes to the probability due to changing infielder alignments.

Using Amazon SageMaker to calculate Shift Impact

We trained ML models on more than 50,000 at-bat samples. We found that the results of a Bayesian search through a hyperparameter optimization (HPO) job using Amazon SageMaker’s Automatic Model Tuning feature over the pre-built XGBoost algorithm on Amazon SageMaker returned the most performant predictions with overall precision of 88%, recall of 88%, and an f1 score of 88% on the validation set of nearly 10,000 events. Launching an HPO job on Amazon SageMaker is as simple as defining the parameters to describe the job, then firing it off to the backend services that manage the core infrastructure (Amazon EC2, Amazon S3, Amazon ECS) to iterate through the defined hyperparameter space efficiently and find the optimal model.

The code snippets shown utilize boto3, the Python API for AWS products and tools. Amazon SageMaker also offers the SageMaker Python SDK, an open source library with several high-level abstractions for working with Amazon SageMaker and popular deep learning frameworks.

Defining the HPO job

We started by setting up the Amazon SageMaker client and defining the tuning job. This specifies which parameters to vary during tuning, along with the evaluation metric we wish to optimize towards. In the following code, we set it to minimize the log loss on the validation set:

import boto3
from sagemaker import get_execution_role
from sagemaker.amazon.amazon_estimator import get_image_uri

sm_client = boto3.Session().client('sagemaker')
xgboost_image = get_image_uri(boto3.Session().region_name, 'xgboost')
role = get_execution_role()

tuning_job_config = {
    "ParameterRanges": {
      "CategoricalParameterRanges": [],
      "ContinuousParameterRanges": [
        {
          "MaxValue": "1",
          "MinValue": "0",
          "Name": "eta"
        },
        {
          "MaxValue": "2",
          "MinValue": "0",
          "Name": "alpha"
        },
      ],
      "IntegerParameterRanges": [
        {
          "MaxValue": "10",
          "MinValue": "1",
          "Name": "max_depth"
        },
      ]
    },
    "ResourceLimits": {
      "MaxNumberOfTrainingJobs": 100,
      "MaxParallelTrainingJobs": 10
    },
    "Strategy": "Bayesian",
    "HyperParameterTuningJobObjective": {
      "MetricName": "validation:logloss",
      "Type": "Minimize"
    }
  }
 
training_job_definition = {
    "AlgorithmSpecification": {
      "TrainingImage": xgboost_image,
      "TrainingInputMode": "File"
    },
    "InputDataConfig": [
      {
        "ChannelName": "train",
        "CompressionType": "None",
        "ContentType": "csv",
        "DataSource": {
          "S3DataSource": {
            "S3DataDistributionType": "FullyReplicated",
            "S3DataType": "S3Prefix",
            "S3Uri": s3_input_train # path to training data
          }
        }
      },
      {
        "ChannelName": "validation",
        "CompressionType": "None",
        "ContentType": "csv",
        "DataSource": {
          "S3DataSource": {
            "S3DataDistributionType": "FullyReplicated",
            "S3DataType": "S3Prefix",
            "S3Uri": s3_input_validation # path to validation data
          }
        }
      }
    ],
    "OutputDataConfig": {
      "S3OutputPath": s3_output # outpath path for model artifacts
    },
    "ResourceConfig": {
      "InstanceCount": 2,
      "InstanceType": "ml.c4.2xlarge",
      "VolumeSizeInGB": 10
    },
    "RoleArn": role,
    "StaticHyperParameters": {
      "eval_metric": "logloss",
      "objective": "binary:logistic",
      "rate_drop": "0.3",
      "tweedie_variance_power": "1.4",
    },
    "StoppingCondition": {
      "MaxRuntimeInSeconds": 43200
    }
}

Launching the HPO job

With the tuning job defined in the Python dictionary above, we now submit it to the Amazon SageMaker client, which then automates the process of launching EC2 instances with containers optimized to run XGBoost from ECS. See the following code:

sm_client.create_hyper_parameter_tuning_job(HyperParameterTuningJobName = "tuning_job_name",
                                            HyperParameterTuningJobConfig = tuning_job_config,
                                            TrainingJobDefinition = training_job_definition)

During the game, we can analyze a given batter with his most recent at-bats and run those events through the model for all infielder positions as laid out on a grid. Since the amount of compute required for inference increases geometrically as the size of each grid cell is reduced, we adjusted the size to reach a balance between the resolution required for meaningful predictions and compute time. For example, consider a shortstop that shifts over to his left. If he moves over by only one foot, there will be a negligible effect on the outcome of a batted ball. However, if he repositions himself 10 feet to his left, that can very well put himself in a better position to field a ground ball pulled to right field. Examining all at-bats in our dataset, we found such a balance on a grid composed of 10-foot by 10-foot cells, accounting for more than 10,000 infielder configurations.

The process of obtaining the best performing model from the HPO job and deploying to production follows in the next section. Due to the large number of calls required for real-time inference, the results of the model are prepopulated into a lookup table that provides the relevant predictions during a live game.

Deploying the most performant model

Each tuning job launches a number of training jobs, from which the best model is selected according to the criteria defined earlier when configuring the HPO. From Amazon SageMaker, we first pull the best training job and its model artifacts. These are stored in the S3 bucket from which the training and validation datasets were pulled. See the following code:

# get best model from HPO job
best_training_job = smclient.describe_hyper_parameter_tuning_job(
    HyperParameterTuningJobName=tuning_job_name)['BestTrainingJob']
info = smclient.describe_training_job(TrainingJobName=best_training_job['TrainingJobName'])
model_name = best_training_job['TrainingJobName'] + '-model'
model_data = info['ModelArtifacts']['S3ModelArtifacts']

Next, we refer to the pre-configured container optimized to run XGBoost models and link it to the model artifacts of the best-trained model. Once this model-container pair is created on our account, we can configure an endpoint with the instance type, number of instances, and traffic splits (for A/B testing) of our choice:

create_model_response = sm_client.create_model(
    ModelName = model_name,
    ExecutionRoleArn = role,
    PrimaryContainer = {
        'Image': xgboost_image,
        'ModelDataUrl': model_data})

# create endpoint configuration
endpoint_config_name = model_name+'-endpointconfig'
create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName = endpoint_config_name,
    ProductionVariants=[{
        'InstanceType':'ml.m5.2xlarge',
        'InitialVariantWeight':1,
        'InitialInstanceCount':1,
        'ModelName':model_name,
        'VariantName':'AllTraffic'}])

# create endpoint
endpoint_name = model_name+'-endpoint'
create_endpoint_response = smclient.create_endpoint(
EndpointName=endpoint_name,
EndpointConfigName=endpoint_config_name)
resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp['EndpointStatus']

print("Arn: " + resp['EndpointArn'])
print(create_endpoint_response['EndpointArn'])

Inference from the endpoint

The Amazon SageMaker runtime client makes predictions from the model, and sends a request to the endpoint hosting the model container on an EC2 instance and returns the output. We can configure entry points of the endpoint for custom models and data processing steps:

# invoke endpoint
runtime_client = boto3.client('runtime.sagemaker')
random_payload = np.array2string(np.random.random(num_features), separator=',', max_line_width=np.inf)[1:-1]
response = runtime_client.invoke_endpoint(EndpointName=endpoint_name, Body=random_payload)
prediction = response['Body'].read().decode("utf-8")
print(prediction) 

With all of the predictions for a given batter and infielder configurations, we then average the probability of a base hit returned from the model stored in the lookup table and subtract the expected batting average for the same sample of batted balls. The resulting metric is the Shift Impact.

Matchup Analysis

In interleague games, where teams from the American and National leagues compete against each other, many batters face pitchers they have never seen before. Estimating outcomes in interleague games is difficult because there is limited relevant historical data. AWS worked with MLB to group similar pitchers together to gain insight on how the batter has historically performed against similar pitchers. We took a machine learning approach, which allowed us to combine the domain knowledge of experts with data comprised of hundreds of thousands of pitches to find additional patterns we could use to identify similar pitchers.

Modeling

Taking inspiration from the field of recommendation systems, in which the matching problem is typically solved by computing a user’s inclination towards a product, here we seek to determine the interaction between a pitcher and batter. There are many algorithms appropriate to building recommenders, but few that allow us to then cluster like items that are put into the algorithm. Neural networks shine in this area. End layers in a neural network architecture can be interpreted as numerical representations of the input data, whether it be an image or a pitcher ID. Given input data, its associated numerical representation–or embedding–can be compared against the embeddings of other input items. Those embeddings that lie near each other are similar, not just in this embedding space, but also in interpretable characteristics. For example, we expect handedness to play a role in defining which pitchers are similar. This approach to recommendation systems and clustering items is known as deep matrix factorization.

Deep matrix factorization accounts for nonlinear interactions between a pair of entities, while also mixing in the techniques of content-based and collaborative filtering. Rather than working solely with a pitcher-batter matrix, as in matrix factorization, we build a neural network that aligns each pitcher and batter with their own embedding and then pass them through a series of hidden layers that are trained towards predicting the outcome of a pitch. In addition to the collaborative nature of this architecture, additional contextual data is included for each pitch such as the count, number of runners on base, and the score.

The model is optimized against the predicted outcome of each pitch, including both the pitch characteristics (slider, changeup, fastball, etc.) and the outcome (ball, single, strike, swinging strike, etc.). After training a model on this classification problem, the end layer of the pitcher ID input is extracted as the embedding for that particular pitcher.

Results

As a batter steps up to the plate against a pitcher he hasn’t faced before, we search for the nearest embeddings to that of the opposing pitcher and calculate the on-base plus slugging percentage (OPS) against that group of pitchers. To see the results in action, see 9/11/19: FSN-Ohio executes OPS comparison.

Summary

MLB uses cloud computing to create innovative experiences that introduce additional ways for fans to experience baseball. With Stolen Base Success Probability, Shift Impact, and Pitcher Similarity Match-up Analysis, MLB provides compelling, real-time insight into what’s happening on the field and a greater connection to the context that builds the unique drama of the game that fans love.

This postseason, fans will have many opportunities to see stolen base probability in action, the potential effects of infield alignments, and launch into debates with friends about what makes pitchers similar.

Fans can expect to see these new stats in live game broadcasts with partners such as ESPN and MLB Network. Plus, other professional sports leagues including the NFL and Formula 1 have selected AWS as their cloud and machine learning provider of choice.

You can find full, end-to-end examples of implementing an HPO job on Amazon SageMaker at the AWSLabs GitHub repo. If you’d like help accelerating your use of machine learning in your products and processes, please contact the Amazon ML Solutions Lab program.


About the Authors

Hussain Karimi is a data scientist at the Amazon ML Solutions Lab, where he works with AWS customers to develop machine learning models that uncover unique insights in various domains.

 

 

 

 

Travis Petersen is a Senior Data Scientist at MLB Advanced Media and an adjunct professor at Fordham University.

 

 

 

 

Priya Ponnapalli is a principal scientist and manager at Amazon ML Solutions Lab, where she helps AWS customers across different industries accelerate their AI and cloud adoption.