Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

Author: torontoai

[D] Whats the best explanation of accuracy_per_sequence when evaluating transformer model in Tensorflow?

[D] Whats the best explanation of accuracy_per_sequence when evaluating transformer model in Tensorflow?

Hello!

I’m looking for a good detailed explanation of the evaluation metric: Accuracy_per_sequence as displayed when training the transformer network https://github.com/tensorflow/models/tree/master/official/transformer

Here is the tensorboard output:

https://i.redd.it/wydu5wvqc1m31.png

I’m assuming it refers to accuracy of sequences per step?

I’m familiar with Top-5 accuracy and such but haven’t found a good explanation of Accuracy per sequence.

Appreciate the help, Thanks!

J

submitted by /u/jimi_jimi_jimi_
[link] [comments]

[D] What do you think of ML/DL applications on the new iPhone

I am really curious what the community thinks about the advances Apple seems to be claiming on device with their neural engine and GPUs with each new phone. What applications do you foresee for AI applications on device? Training as well or just inference? Where do you think they are taking this? Super curious to hear some opinions on this.

submitted by /u/whichoneisblue
[link] [comments]

Learning Cross-Modal Temporal Representations from Unlabeled Videos

While people can easily recognize what activities are taking place in videos and anticipate what events may happen next, it is much more difficult for machines. Yet, increasingly, it is important for machines to understand the contents and dynamics of videos for applications, such as temporal localization, action detection and navigation for self-driving cars. In order to train neural networks to perform such tasks, it is common to use supervised training, in which the training data consists of videos that have been meticulously labeled by people on a frame-by-frame basis. Such annotations are hard to acquire at scale. Consequently, there is much interest in self-supervised learning, in which models are trained on various proxy tasks, and the supervision of those tasks naturally resides in the data itself.

In “VideoBERT: A Joint Model for Video and Language Representation Learning” (VideoBERT) and “Contrastive Bidirectional Transformer for Temporal Representation Learning” (CBT), we propose to learn temporal representations from unlabeled videos. The goal is to discover high-level semantic features that correspond to actions and events that unfold over longer time scales. To accomplish this, we exploit the key insight that human language has evolved words to describe high-level objects and events. In videos, speech tends to be temporally aligned with the visual signals, and can be extracted by using off-the-shelf automatic speech recognition (ASR) systems, and thus provides a natural source of self-supervision. Our model is an example of cross-modal learning, as it jointly utilizes the signals from visual and audio (speech) modalities during training.

Image frames and human speech from the same video locations are often semantically aligned. The alignment is non-exhaustive and sometimes noisy, which we hope to mitigate by pretraining on larger datasets. For the left example, the ASR output is, “Keep rolling tight and squeeze the air out to its side and you can kind of pull a little bit.”, where the actions are captured by speech but the objects are not. For the right example, the ASR output is, “This is where you need to be patient patient patient,” which is not related to the visual content at all.

A BERT Model for Videos
The first step of representation learning is to define a proxy task that leads the model to learn temporal dynamics and cross-modal semantic correspondence from long, unlabeled videos. To this end, we generalize the Bidirectional Encoder Representations from Transformers (BERT) model. The BERT model has shown state-of-the-art performance on various natural language processing tasks, by applying the Transformer architecture to encode long sequences, and pretraining on a corpus containing a large amount of text. BERT uses the cloze test as its proxy task, in which the BERT model is forced to predict missing words from context bidirectionally, instead of just predicting the next word in a sequence.

To do this, we generalize the BERT training objective, using image frames combined with the ASR sentence output at the same locations to compose cross-modal “sentences”. The image frames are converted into visual tokens with durations of 1.5 seconds, based on visual feature similarities. They are then concatenated with the ASR word tokens. We train the VideoBERT model to fill out the missing tokens from the visual-text sentences. Our hypothesis, which our experiments support, is that by pretraining on this proxy task, the model learns to reason about longer-range temporal dynamics (visual cloze) and high-level semantics (visual-text cloze).

Illustration of VideoBERT in the context of a video and text masked token prediction, or cloze, task. Bottom: visual and text (ASR) tokens from the same locations of videos are concatenated to form the inputs to VideoBERT. Some visual and text tokens are masked out. Middle: VideoBERT applies the Transformer architecture to jointly encode bidirectional visual-text context. Yellow and pink boxes correspond to the input and output embeddings, respectively. Top: the training objective is to recover the correct tokens for the masked locations.

Inspecting the VideoBERT Model
We trained VideoBERT on over one million instructional videos, such as cooking, gardening and vehicle repair. Once trained, one can inspect what the VideoBERT model learns on a number of tasks to verify that the output accurately reflects the video content. For example, text-to-video prediction can be used to automatically generate a set of instructions (such as a recipe) from video, yielding video segments (tokens) that reflect what is described at each step. In addition, video-to-video prediction can be used to visualize possible future content based on an initial video token.

Qualitative results from VideoBERT, pretrained on cooking videos. Top: Given some recipe text, we generate a sequence of visual tokens. Bottom: Given a visual token, we show the top three future tokens forecast by VideoBERT at different time scales. In this case, the model predicts that a bowl of flour and cocoa powder may be baked in an oven, and may become a brownie or cupcake. We visualize the visual tokens using the images from the training set closest to the tokens in feature space.

To verify if VideoBERT learns semantic correspondences between videos and text, we tested its “zero-shot” classification accuracy on a cooking video dataset in which neither the videos nor annotations were used during pre-training. To perform classification, the video tokens were concatenated with a template sentence “now let me show you how to [MASK] the [MASK]” and the predicted verb and noun tokens were extracted. The VideoBERT model matched the top-5 accuracy of a fully-supervised baseline, indicating that the model is able to perform competitively in this “zero-shot” setting.

Transfer Learning with Contrastive Bidirectional Transformers
While VideoBERT showed impressive results in learning how to automatically label and predict video content, we noticed that the visual tokens used by VideoBERT can lose fine-grained visual information, such as smaller objects and subtle motions. To explore this, we propose the Contrastive Bidirectional Transformers (CBT) model which removes this tokenization step, and further evaluated the quality of learned representations by transfer learning on downstream tasks. CBT applies a different loss function, the contrastive loss, in order to maximize the mutual information between the masked positions and the rest of cross-modal sentences. We evaluated the learned representations for a diverse set of tasks (e.g., action segmentation, action anticipation and video captioning) and on various video datasets. The CBT approach outperforms previous state-of-the-art by significant margins on most benchmarks. We observe that: (1) the cross-modal objective is important for transfer learning performance; (2) a bigger and more diverse pre-training set leads to better representations; (3) compared with baseline methods such as average pooling or LSTMs, the CBT model is much better at utilizing long temporal context.

Action anticipation accuracy with the CBT approach from untrimmed videos with 200 activity classes. We compare with AvgPool and LSTM, and report performance when the observation time is 15, 30, 45 and 72 seconds.

Conclusion & future work
Our results demonstrate the power of the BERT model for learning visual-linguistic and visual representations from unlabeled videos. We find that our models are not only useful for zero-shot action classification and recipe generation, but the learned temporal representations also transfer well to various downstream tasks, such as action anticipation. Future work includes learning low-level visual features jointly with long-term temporal representations, which enables better adaptation to the video context. Furthermore, we plan to expand the number of pre-training videos to be larger and more diverse.

Acknowledgements
The core team includes Chen Sun, Fabien Baradel, Austin Myers, Carl Vondrick, Kevin Murphy and Cordelia Schmid. We would like to thank Jack Hessel, Bo Pang, Radu Soricut, Baris Sumengen, Zhenhai Zhu, and the BERT team for sharing amazing tools that greatly facilitated our experiments. We also thank Justin Gilmer, Abhishek Kumar, Ben Poole, David Ross, and Rahul Sukthankar for helpful discussions.

[D] How do you feel Machine Learning will affect video games?

I’m studying to become a video game designer, and these advances in machine learning are proooobably gonna affect my chosen career at some point. I know there’s stuff already, like AlphaStar being a Pretty Darn Good SC2 AI, and this paper where character face sliders are preselected based off of a photo.

There’s one application I can already see happening in the short term: right now, in story driven games, characters awkwardly avoid referring to the player by their name (with one notable exception in Fallout 4). But, I could easily see AIs trained off of a game’s voice actors adjusting in-game recorded dialogue to include synthesized clips of the player’s name. Even on lower end machines, these clips could be generated in the background so they’re ready when needed.

submitted by /u/varkarrus
[link] [comments]

[D] Directed acyclic graph and the definition of causality

I listened to a very interesting talk at MAIS 2019 last Friday about a novel approach to learn DAG using neural networks (all the details in this paper here: arXiv:1906.02226). It’s far from my actual discipline of sensor design and data processing, but I still sent to speak with the author at the poster session afterwards.

We didn’t go into the details of the technique, instead we had a discussion about how there doesn’t seem to exist a usable definition of causality in terms of graph analysis. He said that causality is something we all kind of agree on, but that we can’t define. For example, the direction of the causality arrow between the average temperature and the altitude of a city is clear. If we magically changed the altitude, the average temperature would change, while if we magically changed the average temperature, the altitude wouldn’t change. Therefore, the direction of causality is from “altitude” to “average temperature”.

From my readings in cosmology and thermodynamics, I realized there seems to be a very similar concept that would benefit from being shared here. At least I hope so, it’s sometimes hard to know the exact boundaries 😉

Here is a proposed definition for causality: a causal relationship R from set A to set B is a function transforming A into B such that information that was available in A is lost when working with the set B.

It means that a system that can be uniquely described in A cannot be uniquely described in B and it is impossible to know exactly which element from A was mapped to an element from B. In that sense, the set A has a greater information content than the set B and the function R reduces the amount of information available in the set.

In the case of large scale phenomenon where classical physics tells us that each process is deterministic (ie: maps one unique state to one other unique state), but we must also take into account the passage of time. The chronological order of the events dictates the direction of causality. This is where it gets interesting in my opinion: the arrow of time as defined by physicist Sean Carroll (book, multiple articles) is deeply linked to the evolution of the entropy of the universe. The entropy itself is closely related to information content, from the definition of the Shannon Entropy.

It all comes back to the fact that causality points from a set containing more information to a set containing less information, and not the other way around.

I hope it makes sense and there’s probably a better way to write it all and make the explanation clearer, but I feel like there’s something useful about that.
For example, if we find a causal link between two variables that seems to go against the above definition, it probably means that we are missing some information about the first set, or that the second set is not described in a very “compact” way and has redundant information.

Thanks for your comments!

submitted by /u/i_love_FFT
[link] [comments]

Let Them Eat Take-Out: Kiwibots Bring Sustenance to Students

College students are many things — sleepy, overly caffeinated, stressed — but above all, they are hungry. Kiwi Campus is here to help.

Co-founder and CEO of Kiwi Campus, Felipe Chávez, joined AI Podcast host Noah Kravitz to talk about Kiwi and its delivery service.

Based in Berkeley, Calif., the company specializes in creating a robotic ecosystem for last-mile delivery. Its solution is the Kiwibot. The small autonomous robot delivers orders seven days a week from 10 a.m. to 8 p.m. Its coverage area includes UC Berkeley and the surrounding streets.

Chávez, originally from Colombia, noticed how expensive it was to have food delivered in the US. He says online food ordering ranks at about 20 percent in Latin America’s largest cities. By contrast, when he moved to America, “it was 6 percent two years ago, and now it’s 9 percent.”

Given the American economy and level of productivity, Chávez says, “It’s insane that we’re not ordering several times per day.”

Kiwi Campus has a unique delivery system. It starts with Kiwi Trike, an autonomous tricycle, that brings Kiwibots to restaurants. Kitchen staff load the order into the Kiwibots, which then complete the final legs of the journey.

The Kiwibot runs on a blend of AI and human input. The bots themselves use a Jetson TX2, six ultra-HD cameras, and radar to navigate the streets of Berkeley. Chávez realized that the best way to avoid high-risk situations would be to incorporate human input.

Kiwi’s human workers are based in Colombia. Each person is assigned to three robots and provides observations, with a latency of just five seconds. Their role is to ensure that the robots “are in the correct direction. Also, sometimes we have a behavioral neural network that keeps the robot centered in the sidewalk but sometimes it’s not, so they keep it centered, and also giving extra input about position.”

Human observations also are key in crossing the street. Kiwibots are “crossing 2,000 streets per day,” says Chávez. Before each crossing, humans confirm the input each Kiwibot receives from traffic lights. They then cross the street safely.

This approach seems to be working — Kiwi Campus has had more than 30,000 orders in the last 10 months.

Chávez promises that Kiwi Campus will soon be in more than 10 campuses. In the meantime, you can visit their website to learn more, or connect with Chávez on twitter at @felipekiwi90.

The post Let Them Eat Take-Out: Kiwibots Bring Sustenance to Students appeared first on The Official NVIDIA Blog.

[P] Fine-Tuned GPT-2 to generate new plots form IMDB’s top 250 movies

Hello r/MachineLearning,

>Link to the Twitter Plot Bot<

I’ve seen a post about the Trump bot and I don’t know why it has been removed. But I’d like to share the bot that u/Schnox have created by fine-tuning GPT-2 (774M) with r/WritingPrompts. Then we fed it with IMDB plotlines of the top 250 movies to create new narratives for all the best movies we like and love.
The results are not chosen, this is unfiltered output of the algorithm.

We’d appreciate your feedback on the project! Feel free to post questions or check out the code at github.com/hansbambel/storytelling_gpt2

submitted by /u/scientist_1337
[link] [comments]

[D] Batch Normalization is a Cause of Adversarial Vulnerability

Abstract – Batch normalization (batch norm) is often used in an attempt to stabilize and accelerate training in deep neural networks. In many cases it indeed decreases the number of parameter updates required to achieve low training error. However, it also reduces robustness to small adversarial input perturbations and noise by double-digit percentages, as we show on five standard data-sets. Furthermore, substituting weight decay for batch norm is sufficient to nullify the relationship between adversarial vulnerability and the input dimension. Our work is consistent with a mean-field analysis that found that batch norm causes exploding gradients.

Page – https://arxiv.org/abs/1905.02161

PDF – https://arxiv.org/pdf/1905.02161.pdf

Has anyone read the paper and experienced robustness issues with deployment of Batchnorm models in the real world?

submitted by /u/aseembits93
[link] [comments]