Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

Author: torontoai

high-res (+4MP) neural-style with home PC?

My specs: i5-2500k 16GB RAM, 6000 on Passmark, Nvidia 660 2GB GPU.

It’s not super new and fantastic, but still a decent older machine.

So here is what I already learned by experimenting:

  1. jcjohnson/neural-style (and similar) is a total dead end. High-res eats infinite amounts of RAM, both in CPU and GPU mode, that even super computers can’t provide. You can do 512px at home max, and 1024px max if you rent time at some TESLA farm and that is it.
  2. Tiling produces total garbage shit results that are unusable and wrong in virtually all cases of use to essentially no one. Don’t even try this, even if people claim it can produce acceptable results. It doesn’t, and it never does except in maybe 1% of the cases.
  3. chainer-fast-neuralstyle: I am trying this now. Its not a dead end. I don’t know what kind of RAM you need for a 10MP and 20MP image, but it’s not 32 petaquads as with neural-style. I can do 4MP easily at home, supposedly (I hope?).

Chainer-fast-neuralstyle requires you to train a style image specific model on a 20GB dataset, which takes like 20 hours. And then on my home machine a 3.2MP image takes like 1-1.5 hours to style with that one and only model.

I must say in retrospective, I just wanted to print a 60x40cm portrait of myself as an oil painting and not pay 20 bucks on deepart.io. But now letting my PC fume for 5 days on this stuff, I surely have paid this in electricity (Germany) when I am finished.

I wonder have you made experiences and maybe a better solution?

submitted by /u/C0MPAQ
[link] [comments]

[R][BAIR] “we show that a generative text model trained on sensitive data can actually memorize its training data” – Nicholas Carlini

Evaluating and Testing Unintended Memorization in Neural Networks

Link: https://bair.berkeley.edu/blog/2019/08/13/memorization/

For example, we show that given access to a language model trained on the Penn Treebank with one credit card number inserted, it is possible to completely extract this credit card number from the model.

submitted by /u/downtownslim
[link] [comments]

[P] This conversational AI has feelings that respond to what you say

Ri, a conversational AI, links different ideas. It has a vocabulary and doesn’t need to be trained. Other features: You change the way Ri feels with conversation. Ri answers your questions. It can relate memories and tell stories. Ri will continue talking if it thinks it’s said something clever. Try it at: http://representi.com.

submitted by /u/James_Representi
[link] [comments]

[P] Cox: a python logging library for machine learning experiments

Cox is a logging library for python designed for collecting and analyzing data from experiments! Read more and install it here: https://github.com/madrylab/cox

Cox is built for a pattern in experimental design where each individual run of an experiment (e.g. each hyperparam configuration) writes to a separate, database-like store (complete with schemes, indexing, etc), saving all information in tables. Experiments are collected and analyzed together by merging together tables, and Cox provides a really simple API/flow for merging and analyzing multiple experiments at the same time (e.g. comparing results across hyperparameters).

This pattern is particularly common in machine learning! We’ve used this logging library for projects involving RL and supervised learning and found it really helpful. Check out the repository for more information, and let me know if you have any questions!

submitted by /u/loganengstrom
[link] [comments]

[D] Full time consulting/remote/contractor work as a PhD?

There’s a lot of posts here about getting research jobs at one of the top labs in industry after PhD, but I’m curious if anyone else has the ultimate goal of living in a nice low CoL area and working remotely. My area is machine learning, broadly applied to computer vision and robotics.

I think I’ll almost certainly have to work a number of years after graduating in one of the large industry clusters (i.e. the Bay Area) but I would love to be able to transition into a remote/consulting/contracting role after that and buy a house somewhere else. My wife is a physician so she can work pretty much anywhere there’s a hospital (in fact for her the pay is better the more remote the location).

Has anyone gone down this route? What kinds of companies in the field are open to remote or hire contractors (and how do you go about getting gigs?) How do I plan for this now if it’s my ultimate goal, or is this area so specialized/niche that almost all of the opportunities are onsite only?

submitted by /u/moduluus
[link] [comments]

[P] Towards explainable video analysis – Visual Attention For Action Recognition

[P] Towards explainable video analysis - Visual Attention For Action Recognition

I am currently researching practical applications of action recognition models with use of attention models. I have decided to share lessons learned from implementing several ideas from research papers in this field. The network learns to classify images from HMDB-51 dataset and creates attention heatmaps which focus on different parts on the image and thus justify model’s decision. Heatmaps can be very accurate, to the point that one could probably use them for tracking.

Network attends to the relevant part of the video

The tutorial contains brief overview of action recognition and visual attention mechanisms. Then I present the network architecture and discuss the results of my project. Additionally, I include github repo with my implementation.

Here are the results!

I hope you guys find it interesting!

submitted by /u/dtransposed
[link] [comments]

Project Euphonia’s Personalized Speech Recognition for Non-Standard Speech

The utility of technology is dependent on its accessibility. One key component of accessibility is automatic speech recognition (ASR), which can greatly improve the ability of those with speech impairments to interact with every-day smart devices. However, ASR systems are most often trained from ‘typical’ speech, which means that underrepresented groups, such as those with speech impairments or heavy accents, don’t experience the same degree of utility. For example, amyotrophic lateral sclerosis (ALS) is a disease that can adversely affect a person’s speech—about 25% of people with ALS experiencing slurred speech as their first symptom. In addition, most people with ALS eventually lose the ability to walk, so being able to interact with automated devices from a distance can be very important. Yet current state-of-the-art ASR models can yield high word error rates (WER) for speakers with only a moderate speech impairment from ALS, effectively barring access to ASR reliant technologies.

In “Personalizing ASR for Dysarthric and Accented Speech with Limited Data,” to be presented at Interspeech 2019, we describe some of the research behind Project Euphonia, an ASR platform that performs speech-to-text transcription. This work presents an approach to improve ASR for people with ALS that may also be applicable to many other types of non-standard speech. Using a two-step training approach that starts with a baseline “standard” corpus and then fine-tunes the training with a personalized speech dataset, we have demonstrated significant improvements for speakers with atypical speech over current state-of-the-art models.

A Two-Phased Approach to Training
In order to create ASR models that work on non-standard speech, one needs to overcome two challenges. The first is that within a particular class of atypical speech, be it a regional accent or a speech impairment, for example, individuals can exhibit very different ways of speaking. Our approach deals with this sub-group heterogeneity by training the ASR model in two phases. We start with a high-quality ASR model trained on thousands of hours of standard speech and then we fine-tune parts of the model to an individual with non-standard speech. This approach is similar to that of Parrotron: both systems use end-to-end neural networks to help improve communication and accessibility, but Parrotron focuses exclusively on speech-to-speech, where a person’s speech is converted directly into synthesized speech, rather than text.

The second challenge arises from the difficulty in collecting enough data to train a state-of-the-art recognizer for individuals. Typical speech recognizers are trained on thousands of hours of speech from many different speakers. Acquiring this much data from a single speaker is nearly impossible, especially if the speaker may experience exhaustion from speaking due to a medical condition. Our approach overcomes this issue by first training a base model on a large corpus of typical speech, and then training a personalized model using a much smaller dataset with the targeted non-standard speech characteristics.

The Neural Network Architecture
When developing the models used for training data on atypical speech, we explored two different neural architectures. The first is the RNN-Transducer (RNN-T), a neural network architecture consisting of encoder and decoder networks that has shown good results on numerous ASR tasks. The encoder is bidirectional (i.e., it looks at the entire sentence at once in order to provide context), and thus it requires the entire audio sample to perform speech recognition.

The other architecture we explored was Listen, Attend, and Spell (LAS), which is an attention-based, sequence-to-sequence model that maps sequences of acoustic properties to sequences of languages. This model uses an encoder to convert the sequence of acoustic frames to a sequence of internal representations, and a decoder to convert the sequence of internal representations to linguistic output. The network produces “word pieces”, which are a linguistic representation between graphemes and words.

Comparison of the RNN-Transducer (left) and Listen, Attend, Spell (right) architectures. From Prabhavalkar et al. 2017.

We experimented with fine-tuning the state-of-the-art RNN-T and LAS base models on two types of non-standard speech. In partnership with the ALS Therapy Development Institute, we first collected about 36 hours of audio from 67 speakers who have ALS. The participants recorded themselves on their home computers using custom software while they read sentences from a very restricted language domain. Many phrases were single sentences with simple grammatical structure (e.g., “What time is the basketball game on tonight?”). This is in contrast with unrestricted language domains, which include domain-specific vocabulary (e.g., science talks) and complex language structure (e.g., a debate). The recordings did not include many of the filler words common in normal speech, such as “um” and “uh”.

We also tested accented speech, using the open source L2 Arctic dataset of non-native speech, which consists of 20 speakers with approximately 1 hour of speech per speaker. Each speaker recorded a set of 1150 utterances from the CMU Arctic prompts.

Audio Euphonia Model Standard Speech Model
Did I have anything to say about it? Dictatorship angels to think about it
Come right back please Cameras object
Let’s try that again It extracts
Turn it down a little bit please Turning down a little bit please
The audio (left) are recordings of a speaker with ALS. The text transcriptions are output from the Euphonia model (center) and the Standard Speech model (right). Incorrectly transcribed text is underlined.

Results
The absolute word error rates on the language-restricted test set is shown below. There is an improvement over the baseline model for very non-standard speech (heavy accents and ALS speech below 3 on the ALS Functional Rating Scale) and moderate improvements in ALS speech that is similar to typical speech. The relative difference between the base model and the fine-tuned model demonstrates that the majority of the improvement comes from the fine-tuning process, except in the case of the RNN-T on the Arctic dataset, where the RNN-T baseline is already strong.

1 Non-native English speech from the L2-Arctic dataset.
2 Low FRS (ALS Functional Rating Scale) speech; intelligible with repeating (FRS 2); Speech combined with non-vocal communication (FRS 1).
3 FRS 3; detectable speech disturbance.

The RNN-T model achieved 91% of the improvement by fine-tuning just two layers, most of which are close to the input. On the accented dataset, fine-tuning the same two layers achieved 86% of the relative improvement compared to fine-tuning the entire network. This is consistent with previous speech work.

Most of the performance gains were achieved early in training. The models we trained were tested on a relatively limited domain of vocabulary and linguistic complexity, so the performance numbers are not necessarily related to how well the models perform on more general tasks. We hope that just fine-tuning part of the network allows it to retain the acoustic and linguistic information from the general speech model, while needing minimal modifications to adapt to a single new speaker. Future work will test this hypothesis.

Low FRS corresponds to the ALS speakers with low intelligibility (FRS 2, 1), while high FRS corresponds to ALS speakers with less severely impacted speech (FRS 3).

Understanding Model Behavior
To better understand how our models improved after fine-tuning, we looked at the pattern of phoneme mistakes. We started by comparing the distribution of phoneme mistakes made by the base ASR model on standard speech to the mistakes made on ALS speech. The SAMPA phonemes with the five largest differences between the ALS data and standard speech are p, U, f, k, and Z, which account for 20% of the deletion mistakes. Similarly, the n and m phonemes together account for 17% of the insertion / substitution mistakes. The same analysis on our fine-tuned models verifies that the unrecognized phoneme distribution is more similar to that of standard speech.

Our analysis shows that there are two aspects to every mistake: which phoneme the system doesn’t understand, and which phoneme the system thinks was said. Imagine having two systems with identical accuracy: one system always thinks that the f phoneme is actually the g phoneme, while another doesn’t know what the f phoneme is and randomly guesses. These two systems will have identical performance and identical distributions of phoneme mistakes, but very different distributions of the predicted phoneme when a mistake is made. Surprisingly, ASR mistakes on ALS speech are far more similar to regular speech mistakes after Euphonia fine-tuning.

Deletion / substitution mistakes per SAMPA phoneme on ALS speech before fine-tuning, ALS speech after fine-tuning, and on typical speech (Librispeech dataset).

Future Work
In the future, we intend to explore additional techniques that can be helpful in the low data regime. We also hope to use phoneme mistakes to weight certain examples during training, or to pick training sentences for people with ALS to record that contain the most common phoneme mistakes. We would like to explore pooling data from multiple speakers with similar conditions.

We hope that continued research in this area will help voice interfaces become accessible to more people, especially those who need it most. One key component to this is collecting data. Anyone 18 or older can help us build better personalized models by donating audio data. If you’re interested, you can fill out this form to allow Google to contact you.

Acknowledgements
This work would not have been possible without the extraordinary effort and support of the ALS Therapy Development Institute and the ALS community, especially Fernando Vieira, Maeve McNally, Taylor Charbonneau, Melissa Nollstadt, and the individuals with ALS who kindly and patiently volunteered their audio. This work builds on the pioneering advances in speech recognition made by Google’s speech team, in particular the recent development and deployment of end-to-end speech recognition models. We are grateful to the Google speech team for advice and collaboration, particularly to Anshuman Tripathi and Hasim Sak who guided us in training the initial models. We’d also like to thank Oran Lang, Omry Tuval, Michael Brenner, Julie Cattiau, Tara Sainath, Ding Zhao, Qiao Liang, Chung-Cheng Chiu, Dan Liebling, Ron Weiss, Anjuli Kannan, Dimitri Kanevsky, Ryan He, Gabor Simko, Benjamin Lee, Françoise Beaufays, Khe Chai Sim, Jimmy Tobin, Chet Gnegy, Jacqueline Huang, Ye Jia, Yu Zhang, Yonghui Wu, Michelle Ramanovich, Rus Heywood, Katrin Tomanek, Bob MacDonald, Pan-Pan Jiang, Ronnie Maor, Rif A. Saurous, Trevor Strohman, Dick Lyon, Avinatan Hassidim, Philip Nelson, and Yossi Matias for their technical contributions and project guidance.