Learn About Our Meetup

4500+ Members

[D] Unstable performance during parameter search (Keras)

Hi all,

I was hoping we could discuss the plot below:

That plot comes from a parameter search using Keras/Tensorflow for a binary classification problem with an unbalanced class distribution (as you can tell from the acc plot, the ratio is about 5:1 negative to positive).

The metric that I am most interested in is Precision, and as you can see in this example it is very unstable, bouncing around wildly between epochs – which obviously doesn’t lend itself to being a good/stable model.

Whilst there is a little overfitting, there doesn’t seem to be too much and I can confirm that the data itself is all properly scaled and normalised.

Although the plot scale is a bit large (sorry) to tell properly, I think what we’d find is that Recall fluctuates in unison with Precision. As Recall bounces upwards, I’d expect Precision to take a dive downwards.

I can’t post the exact model because it’s a parameter search with a wide range of possible configurations, but I’m optimising across a range of network depths, widths, dropouts, shapes, learning rate, etc. I’m using binary_crossentropy as the loss, Elu activations, and Nadam optimizer – though I’ve tried a various others with similar results.

What would be your suggestions for creating a more stable model?

At the moment, the class_weight is set to 0:1, 1:1. I think upping the positive class ratio would somewhat stabilise the model (by increasing recall), but I’m shooting to have a high precision and accepting that my recall will be the trade-off and be somewhat low. For example, I’d be happy with 57% precision at 5% recall. In fact – that’s the exact result I got from a previous parameter search, but it didn’t generalise well to the blind test set, and I’m suspecting that the cause was the unstable epoch-to-epoch precision we’re seeing in this plot (though I can only see the plots for the “current” model being generated, so by the end of the many-hour parameter search all I have is a csv of the final values, with no plots to go along with them).

submitted by /u/Zman420
[link] [comments]

[R] Autonomous Navigation in Unconstrained Environments

While several datasets for autonomous navigation have become available in recent years, they have tended to focus on structured driving environments. This usually corresponds to well-delineated infrastructure such as lanes, a small number of well-defined categories for traffic participants, low variation in an object or background appearance and strong adherence to traffic rules.

I recently worked with IDD, dataset collected from India. It’s relatively more challenging than other autonomous navigation-related datasets (such as Berkeley deep drive or cityscapes) since much of the data has been captured from non standard conditions (drivable areas except roads etc.).

I’m releasing the code for this work, feel free to use it for your projects or research.



submitted by /u/vector_machines
[link] [comments]

[D] Why have we not seen equivalent success in deep learning based image registration?

It seems that other computer vision tasks such as classification, segmentation and synthesis have seen huge advances in accuracy thanks to CNNs, but there seems to be no equivalent advance in image registration. I tried searching for advances in image registration, but it seems that researchers still use ‘classical’ image registration techniques like mutual information, cross-correlation, etc. Even though there are DL image registration research papers, they are not well adopted in the community.

Fundamentally, is there a reason why this task is more complex that the aforementioned ones?

submitted by /u/deep-yearning
[link] [comments]

[R] ABD-Net Person Re-ID code is available

Attention mechanism has been shown to be effective for person re-identification (Re-ID). However, the learned attentive feature embeddings which are often not naturally diverse nor uncorrelated, will compromise the retrieval performance based on the Euclidean distance. We advocate that enforcing diversity could greatly complement the power of attention. To this end, we propose an Attentive but Diverse Network (ABD-Net), which seamlessly integrates attention modules and diversity regularization throughout the entire network, to learn features that are representative, robust, and more discriminative.

submitted by /u/yang-explore
[link] [comments]

[N] HGX-2 Deep Learning Benchmarks: The 81,920 CUDA Core “Behemoth” GPU Server

[N] HGX-2 Deep Learning Benchmarks: The 81,920 CUDA Core “Behemoth” GPU Server

Deep learning benchmarks for TensorFlow on Exxact TensorEX HGX-2 Server.

Original Post from Exxact Here

Notable GPU Server Features

  • 16x NVIDIA Tesla V100 SXM3
  • 81,920 NVIDIA CUDA Cores
  • 10,240 NVIDIA Tensor Cores
  • .5TB Total GPU Memory
  • NVSwitch powered by NVLink 2.4TB/sec aggregate speed



Tests were run on ResNet-50, ResNet-152, Inception V3, VGG-16. Also compared FP16 to FP32 performance, and used batch size of 256 (except for ResNet152 FP32, the batch size was 64). Same tests run using 1,2,4,8 and 16 GPU configurations. All benchmarks were done using ‘vanilla’ TensorFlow settings for FP16 and FP32.

For the full write-up + tables and numbers visit:

submitted by /u/exxact-jm
[link] [comments]

Joint Speech Recognition and Speaker Diarization via Sequence Transduction

Being able to recognize “who said what,” or speaker diarization, is a critical step in understanding audio of human dialog through automated means. For instance, in a medical conversation between doctors and patients, “Yes” uttered by a patient in response to “Have you been taking your heart medications regularly?” has a substantially different implication than a rhetorical “Yes?” from a physician.

Conventional speaker diarization (SD) systems use two stages, the first of which detects changes in the acoustic spectrum to determine when the speakers in a conversation change, and the second of which identifies individual speakers across the conversation. This basic multi-stage approach is almost two decades old, and during that time only the speaker change detection component has improved.

With the recent development of a novel neural network model—the recurrent neural network transducer (RNN-T)—we now have a suitable architecture to improve the performance of speaker diarization addressing some of the limitations of the previous diarization system we presented recently. As reported in our recent paper, “Joint Speech Recognition and Speaker Diarization via Sequence Transduction,” to be presented at Interspeech 2019, we have developed an RNN-T based speaker diarization system and have demonstrated a breakthrough in performance from about 20% to 2% in word diarization error rate—a factor of 10 improvement.

Conventional Speaker Diarization Systems
Conventional speaker diarization systems rely on differences in how people sound acoustically to distinguish the speakers in the conversations. While male and female speakers can be identified relatively easily from their pitch using simple acoustic models (e.g., Gaussian mixture models) in a single stage, speaker diarization systems use a multi-stage approach to distinguish between speakers having potentially similar pitch. First, a change detection algorithm breaks up the conversation into homogeneous segments, hopefully containing only a single speaker, based upon detected vocal characteristics. Then, deep learning models are employed to map segments from each speaker to an embedding vector. Finally, in a clustering stage, these embeddings are grouped together to keep track of the same speaker across the conversation.

In practice, the speaker diarization system runs in parallel to the automatic speech recognition (ASR) system and the outputs of the two systems are combined to attribute speaker labels to the recognized words.

Conventional speaker diarization system infers speaker labels in the acoustic domain and then overlays the speaker labels on the words generated by a separate ASR system.

There are several limitations with this approach that have hindered progress in this field. First, the conversation needs to be broken up into segments that only contain speech from one speaker. Otherwise, the embedding will not accurately represent the speaker. In practice, however, the change detection algorithm is imperfect, resulting in segments that may contain multiple speakers. Second, the clustering stage requires that the number of speakers be known and is particularly sensitive to the accuracy of this input. Third, the system needs to make a very difficult trade-off between the segment size over which the voice signatures are estimated and the desired model accuracy. The longer the segment, the better the quality of the voice signature, since the model has more information about the speaker. This comes at the risk of attributing short interjections to the wrong speaker, which could have very high consequences, for example, in the context of processing a clinical or financial conversation where affirmation or negation needs to be tracked accurately. Finally, conventional speaker diarization systems do not have an easy mechanism to take advantage of linguistic cues that are particularly prominent in many natural conversations. An utterance, such as “How often have you been taking the medication?” in a clinical conversation is most likely uttered by a medical provider, not a patient. Likewise, the utterance, “When should we turn in the homework?” is most likely uttered by a student, not a teacher. Linguistic cues also signal high probability of changes in speaker turns, for example, after a question.

There are a few exceptions to the conventional speaker diarization system, but one such exception was reported in our recent blog post. In that work, the hidden states of the recurrent neural network (RNN) tracked the speakers, circumventing the weakness of the clustering stage. Our approach takes a different approach and incorporates linguistic cues, as well.

An Integrated Speech Recognition and Speaker Diarization System
We developed a novel and simple model that not only combines acoustic and linguistic cues seamlessly, but also combines speaker diarization and speech recognition into one system. The integrated model does not degrade the speech recognition performance significantly compared to an equivalent recognition only system.

The key insight in our work was to recognize that the RNN-T architecture is well-suited to integrate acoustic and linguistic cues. The RNN-T model consists of three different networks: (1) a transcription network (or encoder) that maps the acoustic frames to a latent representation, (2) a prediction network that predicts the next target label given the previous target labels, and (3) a joint network that combines the output of the previous two networks and generates a probability distribution over the set of output labels at that time step. Note, there is a feedback loop in the architecture (diagram below) where previously recognized words are fed back as input, and this allows the RNN-T model to incorporate linguistic cues, such as the end of a question.

An integrated speech recognition and speaker diarization system where the system jointly infers who spoke when and what.

Training the RNN-T model on accelerators like graphical processing units (GPU) or tensor processing units (TPU) is non-trivial as computation of the loss function requires running the forward-backward algorithm, which includes all possible alignments of the input and the output sequences. This issue was addressed recently in a TPU friendly implementation of the forward-backward algorithm, which recasts the problem as a sequence of matrix multiplications. We also took advantage of an efficient implementation of the RNN-T loss in TensorFlow that allowed quick iterations of model development and trained a very deep network.

The integrated model can be trained just like a speech recognition system. The reference transcripts for training contain words spoken by a speaker followed by a tag that defines the role of the speaker. For example, “When is the homework due?” ≺student≻, “I expect you to turn them in tomorrow before class,” ≺teacher≻. Once the model is trained with examples of audio and corresponding reference transcripts, a user can feed in the recording of the conversation and expect to see an output in a similar form. Our analyses show that improvements from the RNN-T system impact all categories of errors, including short speaker turns, splitting at the word boundaries, incorrect speaker assignment in the presence of overlapping speech, and poor audio quality. Moreover, the RNN-T system exhibited consistent performance across conversation with substantially lower variance in average error rate per conversation compared to the conventional system.

A comparison of errors committed by the conventional system vs. the RNN-T system, as categorized by human annotators.

Furthermore, this integrated model can predict other labels necessary for generating more reader-friendly ASR transcripts. For example, we have been able to successfully improve our transcripts with punctuation and capitalization symbols using the appropriately matched training data. Our outputs have lower punctuation and capitalization errors than our previous models that were separately trained and added as a post-processing step after ASR.

This model has now become a standard component in our project on understanding medical conversations and is also being adopted more widely in our non-medical speech services.

We would like to thank Hagen Soltau without whose contributions this work would not have been possible. This work was performed in collaboration with Google Brain and Speech teams.

Applications Open for $50,000 NVIDIA Graduate Fellowship Awards

Bringing together the world’s brightest minds and the latest GPU technology leads to powerful research breakthroughs.

That’s why we’re taking applications for the 19th annual NVIDIA Graduate Fellowship Program, seeking students doing outstanding GPU-based research. Our goal: Provide them with grants, mentors and technical support so they can help solve the world’s biggest research problems.

We’re especially seeking doctoral students working in artificial intelligence, machine learning, autonomous vehicles, robotics, AI for healthcare, high performance computing and related fields. Our Graduate Fellowship awards are up to $50,000 per student.

Since its inception in 2002, the Graduate Fellowship Program has awarded over 160 grants worth nearly $5 million.

We’re looking for students who have completed their first year of Ph.D.-level studies at the time of application. Candidates need to be studying computer science, computer engineering, system architecture, electrical engineering or a related area. Applicants must also be investigating innovative ways to use GPUs.

The NVIDIA Graduate Fellowship Program for the 2020-2021 academic year is open to applicants worldwide. The deadline for submitting applications is Sept. 13, 2019. An internship at NVIDIA preceding the fellowship year is now mandatory — eligible candidates should be available for the internship in summer 2020.

For more on eligibility and how to apply, visit the program website or email

The post Applications Open for $50,000 NVIDIA Graduate Fellowship Awards appeared first on The Official NVIDIA Blog.

Next Meetup




Plug yourself into AI and don't miss a beat


Toronto AI is a social and collaborative hub to unite AI innovators of Toronto and surrounding areas. We explore AI technologies in digital art and music, healthcare, marketing, fintech, vr, robotics and more. Toronto AI was founded by Dave MacDonald and Patrick O'Mara.