Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

Author: torontoai

[D] Which SOTA authorship attribution / text classification model to use?

I’m currently doing research for my thesis project, and was wondering which models to experiment with. I have a large dataset of political speeches (around 180.000) annotated with the respective party (10 parties total), and would like a model to learn to classify each party given the speeches.

My question is, which model is currently best for this type of task? I have some experience with Bi-LSTM models, and also CNN with LSTM – however I’m very interested if other models would perform better at this task, or if you any experience with the architecture of these type of models?

submitted by /u/mikkelmedm
[link] [comments]

Rapid large-scale fractional differencing to minimize memory loss while making a time series stationary. 6x-400x speed up over CPU implementation.

Happy to launch GFD: GPU-accelerated Fractional Differencing. A substantial 6x-400x speed-up for single GPU RAPIDS cuDF implementation over NumPy/Pandas CPU-implementation.

Feel free to play with the code on Google Colab, run it on GCP/AWS or your local machine with the entirely self-contained notebook.

Summary

Typically we attempt to achieve some form of stationarity via a transformation on our time series through common methods including integer differencing. However, integer differencing unnecessarily removes too much memory to achieve stationarity. An alternative, fractional differencing, allows us to achieve stationarity while maintaining the maximum amount of memory compared to integer differencing. While existing CPU-based implementations are inefficient for running fractional differencing on many large-scale time series, our GPU-based implementation enables rapid fractional differencing of up to 400x faster on a single machine.

Code

https://github.com/ritchieng/fractional_differencing_gpu

Presentation

https://www.researchgate.net/publication/335159299_GFD_GPU_Fractional_Differencing_for_Rapid_Large-scale_Stationarizing_of_Time_Series_Data_while_Minimizing_Memory_Loss

submitted by /u/ritchieng
[link] [comments]

[N] Trump falsely claims Google ‘manipulated’ millions of 2016 votes

https://www.cnn.com/2019/08/19/politics/trump-google-manipulated-votes-claim/index.html

The referenced article: https://aibrt.org/downloads/EPSTEIN_et_al_2017-SUMMARY-A_Method_for_Detecting_Bias_in_Search_Rankings-EMBARGOED_until_March_14_2017.pdf

Key point from the article referenced by CNN’s story: Was the bias the same for all search engines? No. The level of pro-Clinton bias we found on Google (0.19) was more than twice as high as the level of pro-Clinton bias we found on Yahoo (0.09).

Among other issues, one thing that CNN did not mention is the presumption that Google is wrong, Yahoo correct, given that there is no ground truth to compare to. Perhaps there were more pro-Clinton articles and news appearing those days. And more generally, I might guess that Yahoo’s and Google’s engines are simply different algorithms showing different things.

Before someone complains: yes, pagerank was considered “machine learning”, though not deep learning of course. Though it feels more like graph theory to me.

submitted by /u/errorsignal
[link] [comments]

[D] “Inverse Design” to create new optical chip components

I hope discussions of ML applications is OK in this sub. I came across this article recently about researchers in the field of photonics, which doesn’t have a lot of analytical equations to calculate performance by hand, using some basic ML techniques to create high performance components for photonic integrated circuits. They start with a black box, feed in the desired output performance, and then use basic electromagnetic boundary conditions and ML to work backward to what would be required to get there. They call this “inverse design”.

This paper goes into it a little more and shows an example of the result of the technique: https://arxiv.org/pdf/1504.00095.pdf

submitted by /u/gburdell
[link] [comments]

[D] Is vision a solved problem?

I am curious, as I’ve been thinking about this for a while. To me, it seems as though we seem to be making improvements, but there is not a ton left to solve within this sub field. I don’t claim to be an expert by any stretch, but through all of the advancements we have made, we are capable of object detection, classification, image captioning for the contents of the image, image generation, and as we are closing in on depth perception and improvements on the 3D space, I feel like we are finding new applications for the tools we already have.

Thoughts?

I would love for someone to step in, call me a simpleton and give me all the reasons I am wrong and all of the problems we have yet to address within this space. 🙂

submitted by /u/Awill1aB
[link] [comments]

On-Device, Real-Time Hand Tracking with MediaPipe

The ability to perceive the shape and motion of hands can be a vital component in improving the user experience across a variety of technological domains and platforms. For example, it can form the basis for sign language understanding and hand gesture control, and can also enable the overlay of digital content and information on top of the physical world in augmented reality. While coming naturally to people, robust real-time hand perception is a decidedly challenging computer vision task, as hands often occlude themselves or each other (e.g. finger/palm occlusions and hand shakes) and lack high contrast patterns.

Today we are announcing the release of a new approach to hand perception, which we previewed CVPR 2019 in June, implemented in MediaPipe—an open source cross platform framework for building pipelines to process perceptual data of different modalities, such as video and audio. This approach provides high-fidelity hand and finger tracking by employing machine learning (ML) to infer 21 3D keypoints of a hand from just a single frame. Whereas current state-of-the-art approaches rely primarily on powerful desktop environments for inference, our method achieves real-time performance on a mobile phone, and even scales to multiple hands. We hope that providing this hand perception functionality to the wider research and development community will result in an emergence of creative use cases, stimulating new applications and new research avenues.

3D hand perception in real-time on a mobile phone via MediaPipe. Our solution uses machine learning to compute 21 3D keypoints of a hand from a video frame. Depth is indicated in grayscale.

An ML Pipeline for Hand Tracking and Gesture Recognition
Our hand tracking solution utilizes an ML pipeline consisting of several models working together:

  • A palm detector model (called BlazePalm) that operates on the full image and returns an oriented hand bounding box.
  • A hand landmark model that operates on the cropped image region defined by the palm detector and returns high fidelity 3D hand keypoints.
  • A gesture recognizer that classifies the previously computed keypoint configuration into a discrete set of gestures.

This architecture is similar to that employed by our recently published face mesh ML pipeline and that others have used for pose estimation. Providing the accurately cropped palm image to the hand landmark model drastically reduces the need for data augmentation (e.g. rotations, translation and scale) and instead allows the network to dedicate most of its capacity towards coordinate prediction accuracy.

Hand perception pipeline overview.

BlazePalm: Realtime Hand/Palm Detection
To detect initial hand locations, we employ a single-shot detector model called BlazePalm, optimized for mobile real-time uses in a manner similar to BlazeFace, which is also available in MediaPipe. Detecting hands is a decidedly complex task: our model has to work across a variety of hand sizes with a large scale span (~20x) relative to the image frame and be able to detect occluded and self-occluded hands. Whereas faces have high contrast patterns, e.g., in the eye and mouth region, the lack of such features in hands makes it comparatively difficult to detect them reliably from their visual features alone. Instead, providing additional context, like arm, body, or person features, aids accurate hand localization.

Our solution addresses the above challenges using different strategies. First, we train a palm detector instead of a hand detector, since estimating bounding boxes of rigid objects like palms and fists is significantly simpler than detecting hands with articulated fingers. In addition, as palms are smaller objects, the non-maximum suppression algorithm works well even for two-hand self-occlusion cases, like handshakes. Moreover, palms can be modelled using square bounding boxes (anchors in ML terminology) ignoring other aspect ratios, and therefore reducing the number of anchors by a factor of 3-5. Second, an encoder-decoder feature extractor is used for bigger scene context awareness even for small objects (similar to the RetinaNet approach). Lastly, we minimize the focal loss during training to support a large amount of anchors resulting from the high scale variance.

With the above techniques, we achieve an average precision of 95.7% in palm detection. Using a regular cross entropy loss and no decoder gives a baseline of just 86.22%.

Hand Landmark Model
After the palm detection over the whole image our subsequent hand landmark model performs precise keypoint localization of 21 3D hand-knuckle coordinates inside the detected hand regions via regression, that is direct coordinate prediction. The model learns a consistent internal hand pose representation and is robust even to partially visible hands and self-occlusions.

To obtain ground truth data, we have manually annotated ~30K real-world images with 21 3D coordinates, as shown below (we take Z-value from image depth map, if it exists per corresponding coordinate). To better cover the possible hand poses and provide additional supervision on the nature of hand geometry, we also render a high-quality synthetic hand model over various backgrounds and map it to the corresponding 3D coordinates.

Top: Aligned hand crops passed to the tracking network with ground truth annotation. Bottom: Rendered synthetic hand images with ground truth annotation

However, purely synthetic data poorly generalizes to the in-the-wild domain. To overcome this problem, we utilize a mixed training schema. A high-level model training diagram is presented in the following figure.

Mixed training schema for hand tracking network. Cropped real-world photos and rendered synthetic images are used as input to predict 21 3D keypoints.

The table below summarizes regression accuracy depending on the nature of the training data. Using both synthetic and real world data results in a significant performance boost.

Dataset Mean regression error normalized by palm size
Only real-world 16.1 %
Only rendered synthetic 25.7 %
Mixed real-world + synthetic 13.4 %

Gesture Recognition
On top of the predicted hand skeleton, we apply a simple algorithm to derive the gestures. First, the state of each finger, e.g. bent or straight, is determined by the accumulated angles of joints. Then we map the set of finger states to a set of pre-defined gestures. This straightforward yet effective technique allows us to estimate basic static gestures with reasonable quality. The existing pipeline supports counting gestures from multiple cultures, e.g. American, European, and Chinese, and various hand signs including “Thumb up”, closed fist, “OK”, “Rock”, and “Spiderman”.

Implementation via MediaPipe
With MediaPipe, this perception pipeline can be built as a directed graph of modular components, called Calculators. Mediapipe comes with an extendable set of Calculators to solve tasks like model inference, media processing algorithms, and data transformations across a wide variety of devices and platforms. Individual calculators like cropping, rendering and neural network computations can be performed exclusively on the GPU. For example, we employ TFLite GPU inference on most modern phones.

Our MediaPipe graph for hand tracking is shown below. The graph consists of two subgraphs—one for hand detection and one for hand keypoints (i.e., landmark) computation. One key optimization MediaPipe provides is that the palm detector is only run as necessary (fairly infrequently), saving significant computation time. We achieve this by inferring the hand location in the subsequent video frames from the computed hand key points in the current frame, eliminating the need to run the palm detector over each frame. For robustness, the hand tracker model outputs an additional scalar capturing the confidence that a hand is present and reasonably aligned in the input crop. Only when the confidence falls below a certain threshold is the hand detection model reapplied to the whole frame.

The hand landmark model’s output (REJECT_HAND_FLAG) controls when the hand detection model is triggered. This behavior is achieved by MediaPipe’s powerful synchronization building blocks, resulting in high performance and optimal throughput of the ML pipeline.

A highly efficient ML solution that runs in real-time and across a variety of different platforms and form factors involves significantly more complexities than what the above simplified description captures. To this end, we are open sourcing the above hand tracking and gesture recognition pipeline in the MediaPipe framework, accompanied with the relevant end-to-end usage scenario and source code, here. This provides researchers and developers with a complete stack for experimentation and prototyping of novel ideas based on our model.

Future Directions
We plan to extend this technology with more robust and stable tracking, enlarge the amount of gestures we can reliably detect, and support dynamic gestures unfolding in time. We believe that publishing this technology can give an impulse to new creative ideas and applications by the members of the research and developer community at large. We are excited to see what you can build with it!

Acknowledgements
Special thanks to all our team members who worked on the tech with us: Andrey Vakunov, Andrei Tkachenka, Yury Kartynnik, Artsiom Ablavatski, Ivan Grishchenko, Kanstantsin Sokal‎, Mogan Shieh, Ming Guang Yong, Anastasia Tkach, Jonathan Taylor, Sean Fanello, Sofien Bouaziz, Juhyun Lee‎, Chris McClanahan, Jiuqiang Tang‎, Esha Uboweja‎, Hadon Nash‎, Camillo Lugaresi, Michael Hays, Chuo-Ling Chang, Matsvei Zhdanovich and Matthias Grundmann.

[P] Google’s wavenet API so good that it’s synthetic speech can be used to train hotword detectors with no ‘real’ data?

[P] Google's wavenet API so good that it's synthetic speech can be used to train hotword detectors with no 'real' data?

TLDR: Google TTS -> Simple Noise augment -> {wav files} ->SnowBoy ->{.pmdl models} -> Raspberry Pi

So, I trained a black-box deep net hotword detector (using Snowboy/kitt.ai) entirely out of synthetic speech samples generated using Google’s Text-to-speech API and it was able to ‘transfer to the real world’ on a Raspberry Pi-3. Not entirely shocking. But reasonably neat I suppose given that you need to spend $0 for this. (Free GC credits + free 100 API calls from Snowboy + Colab)

Project picture:

The final hardware setup

I’d posit we are not too far off at least for this problem space from a point where we can directly do text->model generation directly, sans any data collection.

Blog: https://towardsdatascience.com/build-your-own-custom-hotword-detector-with-zero-training-data-and-0-35adfa6b25ea

Code/Colab notebooks (pre-cleanup :P) : https://github.com/vinayprabhu/BurningMan2019

Demo Video: https://www.youtube.com/watch?time_continue=1&v=kIigaO6Iga0

submitted by /u/VinayUPrabhu
[link] [comments]