Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

Author: torontoai

[R] Hey Reddit Machine Learners, do you model for a living? Be part of a ML/DL user research study and get a cool AI t-shirt every month.

We are looking for full-time data scientists for a ML/DL user study. You’ll be participating in a calibrated user research experiment for 45 minutes. The study will be done over a video call. We’ve got plenty of funny tees that you can show-off to your teammates. We’ll ship you a different one every month for a year.

Click here to learn more.

P.S: We love the reddit vibe and this community. Give us your best ML/DL/Data Science quote. We’ll make one and ship it you if it’s quirky and fun even if you aren’t part of the study group.

submitted by /u/Mikmik303
[link] [comments]

[R] AC-Teach: A Bayesian Actor-Critic Method for Policy Learning with an Ensemble of Suboptimal Teachers

When looking at RL training, it’s often frustrating to see the agent taking so long to discover simple things you could code up yourself for parts of the task. This work takes that idea as it basis – if you code up some solutions to parts of the problem, how do you incorporate that into RL training? Turns out it’s a little tricky…

Arxiv: https://arxiv.org/abs/1909.04121 (CORL 2019)

Blog post: http://ai.stanford.edu/blog/acteach/

(hope posting own papers is kosher, open to answering any questions!)

submitted by /u/regalalgorithm
[link] [comments]

[P] RNNs and Reinforcement learning

Keep in mind that I am still in the early phase of learning ML. I cannot disclose the exact task/problem I am working on (not related to NLP), but the below task captures the essence of it.

(REINFORCEMENT LEARNING)

Input – A paragraph written in English.

Output – On a scale of 1-10 (continuous/not discrete scale) predict the level of English of each sentence. For example, My English is poor should score better than I have bad English. Even though both are grammatically correct.

Example input: Hey! how are you? It has been so long since I last saw you.

Example output: [5.544554, 5.890909] (made up numbers)

My approach:

  1. Encode each word. (fixed length if that matters)
  2. Break paragraph into sentences, because prediction for a sentence will not depend upon other sentences.
  3. Pad every sentence so that they have same length.
  4. For every sentence:

i. Pass each word of the sentence to a RNN encoder, And extract the hidden state corresponding to the last word (before padding). (for example: Sentence: i am fine padded_word padded_word, RNN output: [A,B,C,D,E] so I extract RNN output/hidden_state C. (not sure if this is the right thing to do)

j. Pass this hidden state C to RNN decoder, which makes the prediction. This prediction leads to a reward.

  1. Use PPO (proximal policy optimization).

I hope this is clear and am sorry for being so vague about my problem. If it matters, I have a few fully connected layers between encoder and decoder.

So, is this the best approach for this problem?

Does PPO works well with RNNs?

Also, what might be the reason that the network is not learning even when I am using normalized environment?

Any help would be highly appreciated.

submitted by /u/xicor7017
[link] [comments]

NVIDIA Software Head Helps Transform Alma Mater into Leading AI Center with $34M Gift 

Three decades and hundreds of millions of lines of computer code after graduating from the Milwaukee School of Engineering, NVIDIA’s Dwight Diercks returned today to celebrate a donation that will put his alma mater at the forefront of AI undergraduate education.

Exterior of Diercks Hall at MSOE
Diercks Hall at MSOE in Milwaukee.

Diercks, who grew up the son of a mailman, working on his family’s pig farm in Red Wing, Minnesota, came to NVIDIA as its 22nd employee. Today, he oversees a team of some 5,000 software engineers around the world who ship tens of millions of lines of code each month that help accelerate the world’s computing.

Diercks’ $34 million gift, the largest from an alum in MSOE’s 116-year history, is the keystone in the school’s efforts to infuse its engineering program with artificial intelligence. Two years ago, MSOE became one of the very few programs, together with Carnegie Mellon, to offer a computer science degree focused on AI.

As a result, at a time when many smaller schools wrestle with getting students in the door and financial pressures, MSOE is on a roll. Enrollment in computer science-related programs at the 2,800-student school — based in the heart of downtown Milwaukee, just a few blocks from the green parkland alongside Lake Michigan — is up 67 percent since the program was introduced. Other key admissions indicators are also up by strong double digits.

Speaking ahead of a ceremony to mark the donation, MSOE President John Walz said, “AI has very quickly become huge for us.” He noted that the new computer science program is already on pace to be the school’s second largest program and that the number of companies now recruiting there is approaching the number in its graduating class.

The Milwaukee School of Engineering’s new supercomputer is dubbed “Rosie.”

Central to MSOE’s focus on AI is the spanking new NVIDIA-powered AI supercomputer housed in a glass-walled area within the newly constructed four-story Diercks Hall. The system includes three NVIDIA DGX-1 pods, each with eight NVIDIA V100 Tensor Core GPUs, and 20 servers each with four NVIDIA T4 GPUs. The nodes are joined together by Mellanox networking fabric and share 200TB of network-attached storage.

Rare among supercomputers in higher education, the system —which provides 8.2 petaflops of deep learning performance — will be used for teaching undergrad classes.

Diercks, who made the donation with his wife, Dian, initiated the AI initiative because of the school’s highly practical, hands-on approach to teaching future engineers, leading them to spend more time in labs than classrooms. His own immersion in NVIDIA’s evolution in recent years into an AI powerhouse from its roots in computer gaming helped him encourage MSOE to reshape its approach around preparing students for the brave new age of artificial intelligence.

Dwight and Dian Diercks
Dwight and Dian Diercks.

“We knew MSOE needed a supercomputer and one that can expand to scale out for students and scale up for local industries and professors,” Diercks said. In an emotional speech, he thanked a high school teacher, MSOE professor and NVIDIA founder and CEO Jensen Huang for reinforcing what his parents taught him about the importance of hard work and continuous learning.

“You don’t ever take a day off learning,” he quoted his former math teacher, Ron Gray, as telling him when he tried to skip out on a test. The long-retired teacher shyly stood up in the back of the hall.

While MSOE students come to the school from across the Midwest, with a smattering from California and Texas, many choose to stay in the Milwaukee area. The largely deindustrialized city of German church spires — which a century ago represented American innovation, giving birth to the typewriter, steam shovel and motorcycle — is home to thriving companies like Northwestern Mutual, Harley-Davidson and Rockwell Automation that hire many grads.

While not widely recognized as tech companies, these regional giants collect oceans of data that need to be crunched using the latest tools of deep learning and data science.

Huang, who delivered a keynote after the ceremony, called AI the fourth industrial revolution that will sweep across the work of virtually every industry. MSOE’s new AI push and supercomputer will help it enable generations of computer scientists trained for tomorrow’s challenges.

“MSOE now has the single most important instrument of knowledge today,” Huang said, delivering the first address in the NVIDIA auditorium. “Without access to the correct instrument, you can’t access knowledge.”

Outside the auditorium, Kyle Rodrigues, a sophomore from suburban Chicago enrolled in the new computer science program, said it was AI that drew him to MSOE. He exclaimed how thrilled he was to get his hands on the supercomputer, which MSOE is christening “Rosie,” the term used for a half dozen pioneering women who worked in the 1940s programming the early ENIAC computer — and which was also the name of Dierck’s mother.

The post NVIDIA Software Head Helps Transform Alma Mater into Leading AI Center with $34M Gift  appeared first on The Official NVIDIA Blog.

[D] looking for ML theory researchers

hello, I’m looking for researchers who work on the theoretical side of ML (learning theory, privacy, architecture search and model compression, etc) to speak in remote spotlight sessions. For context, we have been creating a large repository of ML papers turned into in-depth discussion videos (YouTube channel, website), and recently started inviting paper authors to speak about their work remotely. This would be a great contribution to the ML community, but also a good way to get exposure for your work.

If interested, please email us ([events@ai.science](mailto:events@ai.science)), DM here, comment on this, idk make some sort of noise so that I know you’re out there, and let’s talk.

commitment:

  • prepare a 20-30 mins talk about one of your papars
  • spend 20-30 mins in Q&A with the session moderator and the audience (through live chat read to you by the moderator)
  • hop on a video call with us, share your screen, bam!

submitted by /u/tdls_to
[link] [comments]

[D] Does anyone know of an example of model for translating acronyms?

I have a huge corpus of documents that are filled with acronyms. It is mostly government stuff. Currently we use regex to translates, but the regex performs poorly and requires a lot of manual fixing. I haven’t been able to google this question (it just brings up lists of machine learning/deep learning acronyms).

submitted by /u/Secret_Identity_
[link] [comments]

Launching TensorFlow distributed training easily with Horovod or Parameter Servers in Amazon SageMaker

Amazon SageMaker supports all the popular deep learning frameworks, including TensorFlow. Over 85% of TensorFlow projects in the cloud run on AWS. Many of these projects already run in Amazon SageMaker. This is due to the many conveniences Amazon SageMaker provides for TensorFlow model hosting and training, including fully managed distributed training with Horovod and parameter servers.

Customers are increasingly interested in training models on large datasets, which can take a week or more. In these cases, you might be able to speed the process by distributing training on multiple machines or processes in a cluster. This post discusses how Amazon SageMaker helps you set up and launch distributed training with TensorFlow quickly, without the expense and difficulty of directly managing your training clusters.

Starting with TensorFlow version 1.11, you can use Amazon SageMaker prebuilt TensorFlow containers: Simply provide a Python training script, specify hyperparameters, and indicate your training hardware configuration. Amazon SageMaker does the rest, including spinning up a training cluster and tearing down the cluster when training ends. This feature is called “script mode.” Script mode currently supports two distributed training approaches out-of-the-box:

  • Option #1: TensorFlow’s native parameter server (TensorFlow versions 1.11 and above)
  • Option #2: Horovod (TensorFlow versions 1.12 and above)

In the following sections, we provide an overview of the steps required to enable these TensorFlow distributed training options in Amazon SageMaker script mode.

Option #1: Parameter servers

One common pattern in distributed training is to use one or more dedicated processes to collect gradients computed by “worker” processes, then aggregate them and distribute the updated gradients back to the workers in an asynchronous manner. These processes are known as parameter servers.

In a TensorFlow parameter server cluster in Amazon SageMaker script mode, each instance in the cluster runs one parameter server process and one worker process. Each parameter server communicates with all workers (“all-to-all”), as shown in the following diagram (from Meet Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow):

In Amazon SageMaker script mode, the implementation of parameter servers is asynchronous: each worker computes gradients and submits gradient updates to the parameter servers independently, without waiting for the other workers’ updates.

In practice, asynchronous updates usually don’t have an overly adverse impact. Workers that fall behind might submit stale gradients, which can negatively affect training convergence. Generally, this can be managed by reducing the learning rate. On the plus side, because there is no waiting for other workers, asynchronous updates can result in faster training.

If you use Amazon SageMaker script mode, you don’t have to set up and manage the parameter server cluster yourself. The Amazon SageMaker prebuilt TensorFlow container comes with a built-in script mode option for use with parameter servers. Using this option saves time and spares you the complexities of cluster management.

The following code example shows how to set up a parameter server cluster with script mode. Specify “parameter_server” as the value in the distributions parameter of an Amazon SageMaker TensorFlow Estimator object. Amazon SageMaker script mode then launches a parameter server thread on each instance in the training cluster and executes your training code in a separate worker thread on each instance. To run a distributed training job with multiple instances, set train_instance_count to a number larger than 1.

from sagemaker.tensorflow import TensorFlow

ps_instance_type = 'ml.p3.2xlarge'
ps_instance_count = 2

distributions = {'parameter_server': {
                    'enabled': True}
                }

hyperparameters = {'epochs': 60, 'batch-size' : 256}

estimator_ps = TensorFlow( base_job_name='ps-cifar10-tf',
                           source_dir='code',
                           entry_point='train_ps.py', 
                           role=role,
                           framework_version='1.13',
                           py_version='py3',
                           hyperparameters=hyperparameters,
                           train_instance_count=ps_instance_count, 
                           train_instance_type=ps_instance_type,
                           model_dir=model_dir,
                           distributions=distributions )

# start training; inputs can be in Amazon S3, Amazon EFS, or Amazon FSx for Lustre
estimator_ps.fit(inputs)

For an example of how to use parameter server-based distributed training with script mode, see our TensorFlow Distributed Training Options example on GitHub.

Option #2: Horovod

Horovod is an open source framework for distributed deep learning. It is available for use with TensorFlow and several other deep learning frameworks. As with parameter servers, Amazon SageMaker automates Horovod cluster setup and runs the appropriate commands to make sure that training goes smoothly without the need for you to manage clusters directly yourself.

Horovod’s cluster architecture differs from the parameter server architecture. Recall that the parameter server architecture uses the all-to-all communication model, where the amount of data sent is proportional to the number of processes. By contrast, Horovod uses Ring-AllReduce, where the amount of data sent is more nearly proportional to the number of cluster nodes, which can be more efficient when training with a cluster where each node has multiple GPUs (and thus multiple worker processes).

Additionally, whereas the parameter server update process described above is asynchronous, in Horovod updates are synchronous. After all processes have completed their calculations for the current batch, gradients calculated by each process circulate around the ring until every process has a complete set of gradients for the batch from all processes.

At that time, each process updates its local model weights, so every process has the same model weights before starting work on the next batch. The following diagram shows how Ring-AllReduce works (from Meet Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow):

Horovod employs Message Passing Interface (MPI), a popular standard for managing communication between nodes in a high-performance cluster, and uses NVIDIA’s NCCL library for GPU-level communication.

The Horovod framework eliminates many of the difficulties of Ring-AllReduce cluster setup and works with several popular deep learning frameworks and APIs. For example, if you are using the popular Keras API, you can use either the reference Keras implementation or tf.keras directly with Horovod without converting to an intermediate API such as tf.Estimator.

In Amazon SageMaker script mode, Horovod is available for TensorFlow version 1.12 or newer. When you use Horovod in script mode, the Amazon SageMaker TensorFlow container sets up the MPI environment and executes the mpirun command to start jobs on the cluster nodes. To enable Horovod in script mode, you must change the Amazon SageMaker TensorFlow Estimator and your training script. To configure training with Horovod, specify the following fields in the distributions parameter of the Estimator:

  • enabled (bool): If set to True, MPI is set up and the mpirun command executes.
  • processes_per_host (int): Number of processes MPI should launch on each host. Set this flag for multi-GPU training.
  • custom_mpi_options (str): Any mpirun flags passed in this field are added to the mpirun command and executed by Amazon SageMaker for Horovod training.

The number of processes MPI launches on each host should not be greater than the available slots on the selected instance type.

For example, here’s how to create an Estimator object to launch Horovod distributed training on two hosts with one GPU/process each:

from sagemaker.tensorflow import TensorFlow

hvd_instance_type = 'ml.p3.2xlarge'
hvd_processes_per_host = 1
hvd_instance_count = 2

distributions = {'mpi': {
                    'enabled': True,
                    'processes_per_host': hvd_processes_per_host,
                    'custom_mpi_options': '-verbose --NCCL_DEBUG=INFO -x OMPI_MCA_btl_vader_single_copy_mechanism=none'
                        }
                }

hyperparameters = {'epochs': 60, 'batch-size' : 256}

estimator_hvd = TensorFlow(base_job_name='hvd-cifar10-tf',
                           source_dir='code',
                           entry_point='train_hvd.py', 
                           role=role,
                           framework_version='1.13',
                           py_version='py3',
                           hyperparameters=hyperparameters,
                           train_instance_count=hvd_instance_count, 
                           train_instance_type=hvd_instance_type,
                           distributions=distributions)

# start training; inputs can be in Amazon S3, Amazon EFS, or Amazon FSx for Lustre
estimator_hvd.fit(inputs)

Besides modifying the Estimator object, you also must make the following additions to the training script. You can make these changes conditional based on whether MPI is enabled.

  1. Run hvd.init().
  2. Pin a server GPU to be used by this process using config.gpu_options.visible_device_list. With the typical setup of one GPU per process, you can set this to local rank. In that case, the first process on the server allocates the first GPU, second process allocates the second GPU, and so forth.
  3. Scale the learning rate by number of workers. Effective batch size in synchronous distributed training should scale by the number of workers. An increase in learning rate compensates for the increased batch size.
  4. Wrap the optimizer in hvd.DistributedOptimizer. The distributed optimizer delegates gradient computation to the original optimizer, averages gradients using allreduce, and then applies those averaged gradients.
  5. Add the code hvd.BroadcastGlobalVariablesHook(0) to broadcast initial variable states from rank 0 to all other processes. This initial broadcast makes sure that all workers receive consistent initialization (with random weights or restored from a checkpoint) when training starts. Alternatively, if you’re not using MonitoredTrainingSession, you can execute the hvd.broadcast_global_variables op after global variables initialize.
  6. Modify your code to save checkpoints only on worker 0 to prevent other workers from corrupting them. To do this, pass checkpoint_dir=None to tf.train.MonitoredTrainingSession if hvd.rank() != 0.

Find more details about Horovod at the Horovod GitHub Repository. For an example of Horovod usage with script mode, see our TensorFlow Distributed Training Options example on GitHub.

Choosing a distributed training option

Before moving to distributed training in a cluster, make sure that you have first tried scaling up on a single machine with multiple GPUs. Communication between multiple GPUs on a single machine is faster than communicating across a network between multiple machines. For more details, see the AWS whitepaper Power Machine Learning at Scale.

If you must scale out to a cluster instead of scaling up with more GPUs within a single machine, the next consideration is whether to choose the parameter server option or Horovod. This choice partly depends on the version of TensorFlow that you are using.

  • For TensorFlow versions 1.11 and newer in Amazon SageMaker script mode, you can use parameter servers.
  • To use Horovod, you must use TensorFlow versions 1.12 or newer.

The following chart summarizes some general guidelines regarding performance for each option. These rules aren’t absolute, and ultimately, the best choice depends on the specific use case. Typically, the performance significantly depends on how long it takes to share gradient updates during training. In turn, this is affected by the model size, gradients size, GPU specifications, and network speed.

Better CPU performance Better GPU performance

Relatively long time to share gradients

(larger number of gradients / bigger model size)

Parameter server Parameter server, OR Horovod on a single instance with multi-GPUs

Relatively short time to share gradients

(smaller number of gradients / lesser model size)

Parameter server Horovod

Complexity is another consideration. Parameter servers are straightforward to use for one GPU per instance. However, to use multi-GPU instances, you must set up multiple towers, with each tower assigned to a different GPU. A “tower” is a function for computing inference and gradients for a single model replica, which in turn is a copy of a model training on a subset of the complete dataset. Towers involve a form of data parallelism. Horovod also employs data parallelism but abstracts away the implementation details.

Finally, cluster size makes a difference. Given larger clusters with many GPUs, parameter server all-to-all communication can overwhelm network bandwidth. Reduced scaling efficiency can result, among other adverse effects. In such situations, you might find Horovod a better option.

Additional considerations

The example code for this post consists of one large TFRecord file containing the CIFAR-10 dataset, which is relatively small. However, larger datasets might require that you shard the data into multiple files, particularly if Pipe Mode is used (see the second bullet following). Sharding may be accomplished by specifying an Amazon S3 data source as a manifest file or ShardedByS3Key. Also, Amazon SageMaker provides other ways to make distributed training more efficient for very large datasets:

  • VPC training: Performing Horovod training inside a VPC improves the network latency between nodes, leading to higher performance and stability of Horovod training jobs. To learn how to conduct distributed training within a VPC, see the example notebook Horovod Distributed Training with Amazon SageMaker TensorFlow script mode.
  • Pipe Mode: For large datasets, using Pipe Mode reduces startup and training times. Pipe Mode streams training data from Amazon S3 directly to the algorithm (as a Linux FIFO), without saving to disk. For details about using Pipe Mode with TensorFlow in Amazon SageMaker, see Training with Pipe Mode using PipeModeDataset.
  • Amazon FSx for Lustre and Amazon EFS: performance on large datasets in File Mode may be improved in some circumstances using either Amazon FSx for Lustre or Amazon EFS. For more details, please refer to the related blog post.

Conclusion

Amazon SageMaker provides multiple tools to make distributed training quicker and easier to use. If neither parameter server nor Horovod fit your needs, you can always provide another distributed training option using a Bring Your Own Container (BYOC) approach. Amazon SageMaker gives you the flexibility to mix and match the tools best suited for your use case and dataset.

To get started with Tensorflow distributed training in script mode, go to Amazon SageMaker console. Either create a new Amazon SageMaker notebook instance or open an existing one. Then, simply import the distributed training example referenced in this blog post, and compare and contrast the parameter server option and the Horovod option.


About the authors

Rama Thamman is R&D Manager on the AWS R&D and Innovation Solutions Architecture team. He works with customers to build scalable cloud and machine learning solutions on AWS.

 

 

 

 

Brent Rabowsky focuses on data science at AWS and uses his expertise to help AWS customers with their data science projects.