Learn About Our Meetup

4200+ Members

[P] Trump, Obama, Jordan Peterson and Neil deGrasse Tyson TTS models sing Straight Outta Compton

This is a great demonstration of some of the different TTS models I’ve trained and how I can control style:

These models were trained using my implementation of the papers “Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis” ( and “Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron” (

submitted by /u/hanyuqn
[link] [comments]

[D] Observations from OpenAI’s Five (Dota 2)

I’ve been working on writing an article about AlphaStar for a while (cough, very late), but after last week’s events I decided to sit down and write about OpenAI’s 5 success.

There are a few areas I wish I had more knowledge to expand on:

  • I wish I knew more about OpenAI’s Rapid to write about.
  • Pros and Cons of PPO for Dota 2. I’d also like to know what didn’t work.
  • Decision Tree of Starcraft 2 vs OpenAI. Relative to each of the games, OpenAI has solved more of Dota 2’s action space. However, Starcraft 2 seems to have a larger decision tree?
  • OpenAI mentioned “surgery” when thinking of transfer learning, but there isn’t much information out there.

Very open to feedback and suggestions, thanks!

submitted by /u/jshek
[link] [comments]

[D] GAN Immediate Mode Collapse

I’m not even sure if mode collapse is the correct term; neither the generator nor discriminator is learning anything when I pass both true/false samples to the discriminator. If instead I only show the discriminator true or false samples, the loss drops. I’ve seen mode collapse after a few epochs of training other GANs but never complete stagnation out of the gate. What might be going wrong here?

def generator(): neurons = 121 model = Sequential() # Input shape [batch_size,timestep,input_dim] model.add(LSTM(neurons,activation='tanh',recurrent_activation='hard_sigmoid',kernel_initializer='RandomUniform',return_sequences=True)) model.add(LSTM(neurons,activation='tanh',recurrent_activation='hard_sigmoid',kernel_initializer='RandomUniform',return_sequences=True)) model.add(Dense(1,activation=None)) return model def discriminator(): model = Sequential() # Input shape [batch_size,steps,channels] model.add(Conv1D(32,4,strides=2,activation=None,padding='same',input_shape=(None,1))) model.add(LeakyReLU()) model.add(Conv1D(64,4,strides=2,activation=None,padding='same')) model.add(LeakyReLU()) model.add(BatchNormalization()) model.add(Conv1D(128,4,strides=2,activation=None,padding='same')) model.add(LeakyReLU()) model.add(BatchNormalization()) model.add(Dense(128,activation='relu')) model.add(Dense(1,activation='sigmoid')) return model def generator_containing_discriminator(g, d): model = Sequential() model.add(g) d.trainable = False model.add(d) return model def g_loss_function(y_true,y_pred): l_bce = keras.losses.binary_crossentropy(y_tue,y_pred) l_norm = K.sqrt(K.square(y_true)-K.square(y_pred)) return l_bce+l_norm def train(X,Y,BATCH_SIZE): d_optim = SGD(lr=0.002) g_optim = SGD(lr=0.00004) g = generator() d = discriminator() gan = generator_containing_discriminator(g, d) g.compile(loss=g_loss_function, optimizer=g_optim) gan.compile(loss='binary_crossentropy',optimizer="SGD") d.trainable = True d.compile(loss='binary_crossentropy', optimizer=d_optim) num_batches = int(X.shape[0]/float(BATCH_SIZE)) for epoch in range(1000): for index in range(1,num_batches): # Prepare data startIdx = (index-1)*BATCH_SIZE endIdx = index*BATCH_SIZE inputs = X[startIdx:endIdx,:] targets = Y[startIdx:endIdx] # Generate predictions Y_pred = g.predict(inputs) # Build input and truth arrays for discriminator targets = targets.reshape(BATCH_SIZE,1,1) truth = np.vstack((np.ones((BATCH_SIZE,1,1)),np.zeros((BATCH_SIZE,1,1)))) d_loss = d.train_on_batch(np.vstack((targets,Y_pred)),truth) d.trainable = False # Test GAN g_truth = np.ones((BATCH_SIZE,1,1)) g_loss = gan.train_on_batch(inputs,g_truth) d.trainable = True print('Epoch {} | d_loss: {} | g_loss: {}'.format(epoch, d_loss,g_loss)) g.save_weights('generator',True) d.save_weights('discriminator',True) return d,g,gan 

submitted by /u/Cranial_Vault
[link] [comments]

[Project]Deploy trained model to AWS lambda with Serverless framework

Hi guys,

We have continue updating our open source project for packaging and deploying ML models to product (, and we have create an easy way to deploy ML model as a serverless ( project that you could easily deploy to AWS lambda and Google Cloud Function. We want to share with you guys about it and hear your feedback.


A little background of BentoML for those aren’t familiar with it. BentoML is a python library for packaging and deploying machine learning models. It provides high-level APIs for defining a ML service and packaging its artifacts, source code, dependencies, and configurations into a production-system-friendly format that is ready for deployment.

Feature highlights: * Multiple Distribution Format – Easily package your Machine Learning models into format that works best with your inference scenario: – Docker Image – deploy as containers running REST API Server – PyPI Package – integrate into your python applications seamlessly – CLI tool – put your model into Airflow DAG or CI/CD pipeline – Spark UDF – run batch serving on large dataset with Spark – Serverless Function – host your model with serverless cloud platforms

  • Multiple Framework Support – BentoML supports a wide range of ML frameworks out-of-the-box including Tensorflow, PyTorch, Scikit-Learn, xgboost and can be easily extended to work with new or custom frameworks.

  • Deploy Anywhere – BentoML bundled ML service can be easily deploy with platforms such as Docker, Kubernetes, Serverless, Airflow and Clipper, on cloud platforms including AWS Lambda/ECS/SageMaker, Gogole Cloud Functions, and Azure ML.

  • Custom Runtime Backend – Easily integrate your python preprocessing code with high-performance deep learning model runtime backend (such as tensorflow-serving) to deploy low-latancy serving endpoint.


How to package machine learning model as serverless project with BentoML

It’s surprising easy, just with a single CLI command. After you finished training your model and saved it to file system with BentoML. All you need to do now is run bentoml build-serverless-archive command, for example:

 $bentoml build-serverless-archive /path_to_bentoml_archive /path_to_generated_serverless_project --platform=[aws-python, aws-python3, google-python] 

This will generate a serverless project at the specified directory. Let’s take a look of what files are generated.

 /path_to_generated_serverless_project - serverless.yml - requirements.txt - copy_of_bentoml_archive/ - (if platform is google-python, it will generate 

serverless.yml is the configuration file for serverless framework. It contains configuration to the cloud provider you are deploying to, and map out what events will trigger what function. BentoML automatically modifies this file to add your model prediction as a function event and update other info for you.

requirements.txt is a copy from your model archive, it includes all of the dependencies to run your model is the file that contains your function code. BentoML fill this file’s function with your model archive class, you can make prediction with this file right away without any modifications.

copy_of_bentoml_archive: A copy your model archive. It will be bundle with other files for serverless deployment.


What’s next

After you generate this serverless project. If you have the default configuration for AWS or google. You can deploy it right away. Otherwise, you can update the serverless.yaml based on your own configurations.

Love to hear feedback from you guys on this.





Edit: Styling

submitted by /u/yubozhao
[link] [comments]

[Project] Computer Vision with ONNX Models

Hey everyone! I just created a new runtime for Open Neural Network Exchange (ONNX) models called ONNXCV.

Basically, you can inference ONNX models for realtime computer vision applications (i.e. image classification and object detection) without having to write boilerplate code. It is useful in the sense that one can focus on making deep learning models using any deep learning library (that converts models into the ONNX file format), without having to sacrifice time into building the actual inferencing program.

Let me know what you think and if I should continue to build upon this (or not).

Here is the code.

submitted by /u/Chromobacterium
[link] [comments]

SpecAugment: A New Data Augmentation Method for Automatic Speech Recognition

Automatic Speech Recognition (ASR), the process of taking an audio input and transcribing it to text, has benefited greatly from the ongoing development of deep neural networks. As a result, ASR has become ubiquitous in many modern devices and products, such as Google Assistant, Google Home and YouTube. Nevertheless, there remain many important challenges in developing deep learning-based ASR systems. One such challenge is that ASR models, which have many parameters, tend to overfit the training data and have a hard time generalizing to unseen data when the training set is not extensive enough.

In the absence of an adequate volume of training data, it is possible to increase the effective size of existing data through the process of data augmentation, which has contributed to significantly improving the performance of deep networks in the domain of image classification. In the case of speech recognition, augmentation traditionally involves deforming the audio waveform used for training in some fashion (e.g., by speeding it up or slowing it down), or adding background noise. This has the effect of making the dataset effectively larger, as multiple augmented versions of a single input is fed into the network over the course of training, and also helps the network become robust by forcing it to learn relevant features. However, existing conventional methods of augmenting audio input introduces additional computational cost and sometimes requires additional data.

In our recent paper, “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition”, we take a new approach to augmenting audio data, treating it as a visual problem rather than an audio one. Instead of augmenting the input audio waveform as is traditionally done, SpecAugment applies an augmentation policy directly to the audio spectrogram (i.e., an image representation of the waveform). This method is simple, computationally cheap to apply, and does not require additional data. It is also surprisingly effective in improving the performance of ASR networks, demonstrating state-of-the-art performance on the ASR tasks LibriSpeech 960h and Switchboard 300h.

In traditional ASR, the audio waveform is typically encoded as a visual representation, such as a spectrogram, before being input as training data for the network. Augmentation of training data is normally applied to the waveform audio before it is converted into the spectrogram, such that after every iteration, new spectrograms must be generated. In our approach, we investigate the approach of augmenting the spectrogram itself, rather than the waveform data. Since the augmentation is applied directly to the input features of the network, it can be run online during training without significantly impacting training speed.

A waveform is typically converted into a visual representation (in our case, a log mel spectrogram; steps 1 through 3 of this article) before being fed into a network.

SpecAugment modifies the spectrogram by warping it in the time direction, masking blocks of consecutive frequency channels, and masking blocks of utterances in time. These augmentations have been chosen to help the network to be robust against deformations in the time direction, partial loss of frequency information and partial loss of small segments of speech of the input. An example of such an augmentation policy is displayed below.

The log mel spectrogram is augmented by warping in the time direction, and masking (multiple) blocks of consecutive time steps (vertical masks) and mel frequency channels (horizontal masks). The masked portion of the spectrogram is displayed in purple for emphasis.

To test SpecAugment, we performed some experiments with the LibriSpeech dataset, where we took three Listen Attend and Spell (LAS) networks, end-to-end networks commonly used for speech recognition, and compared the test performance between networks trained with and without augmentation. The performance of an ASR network is measured by the Word Error Rate (WER) of the transcript produced by the network against the target transcript. Here, all hyperparameters were kept the same, and only the data fed into the network was altered. We found that SpecAugment improves network performance without any additional adjustments to the network or training parameters.

Performance of networks on the test sets of LibriSpeech with and without augmentation. The LibriSpeech test set is divided into two portions, test-clean and test-other, the latter of which contains noisier audio data.

More importantly, SpecAugment prevents the network from over-fitting by giving it deliberately corrupted data. As an example of this, below we show how the WER for the training set and the development (or dev) set evolves through training with and without augmentation. We see that without augmentation, the network achieves near-perfect performance on the training set, while grossly under-performing on both the clean and noisy dev set. On the other hand, with augmentation, the network struggles to perform as well on the training set, but actually shows better performance on the clean dev set, and shows comparable performance on the noisy dev set. This suggests that the network is no longer over-fitting the training data, and that improving training performance would lead to better test performance.

Training, clean (dev-clean) and noisy (dev-other) development set performance with and without augmentation.

State-of-the-Art Results
We can now focus on improving training performance, which can be done by adding more capacity to the networks by making them larger. By doing this in conjunction with increasing training time, we were able to get state-of-the-art (SOTA) results on the tasks LibriSpeech 960h and Switchboard 300h.

Word error rates (%) for state-of-the-art results for the tasks LibriSpeech 960h and Switchboard 300h. The test set for both tasks have a clean (clean/Switchboard) and a noisy (other/CallHome) subset. Previous SOTA results taken from Li et. al (2019), Yang et. al (2018) and Zeyer et. al (2018).

The simple augmentation scheme we have used is surprisingly powerful—we are able to improve the performance of the end-to-end LAS networks so much that it surpasses those of classical ASR models, which traditionally did much better on smaller academic datasets such as LibriSpeech or Switchboard.

Performance of various classes of networks on LibriSpeech and Switchboard tasks. The performance of LAS models is compared to classical (e.g., HMM) and other end-to-end models (e.g., CTC/ASG) over time.

Language Models
Language models (LMs), which are trained on a bigger corpus of text-only data, have played a significant role in improving the performance of an ASR network by leveraging information learned from text. However, LMs typically need to be trained separately from the ASR network, and can be very large in memory, making it hard to fit on a small device, such as a phone. An unexpected outcome of our research was that models trained with SpecAugment out-performed all prior methods even without the aid of a language model. While our networks still benefit from adding an LM, our results are encouraging in that it suggests the possibility of training networks that can be used for practical purposes without the aid of an LM.

Word error rates for LibriSpeech and Switchboard tasks with and without LMs. SpecAugment outperforms previous state-of-the-art even before the inclusion of a language model.

Most of the work on ASR in the past has been focused on looking for better networks to train. Our work demonstrates that looking for better ways to train networks is a promising alternative direction of research.

We would like to thank the co-authors of our paper Chung-Cheng Chiu, Ekin Dogus Cubuk, Quoc Le, Yu Zhang and Barret Zoph. We also thank Yuan Cao, Ciprian Chelba, Kazuki Irie, Ye Jia, Anjuli Kannan, Patrick Nguyen, Vijay Peddinti, Rohit Prabhavalkar, Yonghui Wu and Shuyuan Zhang for useful discussions.

Next Meetup




Plug yourself into AI and don't miss a beat