Business Intelligence Specialist – Altus Group – Toronto, ON
From Altus Group – Thu, 17 Oct 2019 23:51:08 GMT – View all Toronto, ON jobs
I recall reading in the last month a paper with the topic described in the title, but despite quite a lot of searching, have not been able to find it. If I recall correctly, the authors found that they could produce just about any sentence with an appropriate initial state.
Alternately, if anyone is aware of other work on this topic, references would be appreciated.
I am aware of this work, and don’t think that it’s what I’m thinking of.
submitted by /u/davisyoshida
[link] [comments]
*or non gaussian
I get the point in having your features be of the same scaling but wouldn’t min max scaling (x-xmin/xmax-xmin) work better? Every tutorial Ive seen about data cleaning seems to use the StandardScaler() without looking at the underlying data.
submitted by /u/Trevahhhh
[link] [comments]
Deep learning applications are data hungry. The more high-quality labeled data a developer feeds an AI model, the more accurate its inferences.
But creating robust datasets is the biggest obstacle for data scientists and developers building machine learning models, says Gaurav Gupta, CEO of TrainingData.io, a member of the NVIDIA Inception virtual accelerator program.
The startup has created a web platform to help researchers and companies manage their data labeling workflow and use AI-assisted segmentation tools to improve the quality of their training datasets.
“When the labels are accurate, then the AI models learn faster and they reach higher accuracy faster,” said Gupta.
The company’s web interface, which runs on NVIDIA T4 GPUs for inference in Google Cloud, helped one healthcare radiology customer speed up labeling by 10x and decrease its labeling error rate by more than 15 percent.
The higher the data quality, the less data needed to achieve accurate results. A machine learning model can produce the same results after training on a million images with low-accuracy labels, Gupta says, or just 100,000 images with high-accuracy labels.
Getting data labeling right the first time is no easy task. Many developers outsource data labeling to companies or crowdsourced workers. It may take weeks to get back the annotated datasets, and the quality of the labels is often poor.
A rough annotated image of a car on the street, for example, may have a segmentation polygon around it that also includes part of the pavement, or doesn’t reach all the way to the roof of the car. Since neural networks parse images pixel by pixel, every mislabeled pixel makes the model less precise.
That margin of error is unacceptable for training a neural network that will eventually interact with people and objects in the real world — for example, identifying tumors from an MRI scan of the brain or controlling an autonomous vehicle.
Developers can manage their data labeling through TrainingData.io’s web interface, while administrators can assign image labeling tasks to annotators, view metrics about individual data labelers’ performance and review the actual image annotations.
When a data scientist first runs a machine learning model, it may only be 60 percent accurate. The developer then iterates several times to improve the performance of the neural network, each time adding new training data.
TrainingData.io is helping AI developers across industries use their early-stage machine learning models to ease the process of labeling new training data for future versions of the neural networks — a process known as active learning.
With this technique, the developer’s initial machine learning model can take the first pass at annotating the next set of training data. Instead of starting from scratch, annotators can just go through and tweak the AI-generated labels, saving valuable time and resources.
The startup offers active learning for data labeling across multiple industries. For healthcare data labeling, its platform integrates with the NVIDIA Clara Deploy SDK, allowing customers to use the software toolkit for AI-assisted segmentation of healthcare datasets.
TrainingData.io chose to deploy its platform on cloud-based GPUs to easily scale usage up and down based on customer demand. Researchers and companies using the tool can choose whether to use the interface online, connected to the cloud backend, or instead use a containerized application running on their own on-premises GPU system.
“It’s important for AI teams in healthcare to be able to protect patient information,” Gupta said. “Sometimes it’s necessary for them to manage the workflow of annotating data and training their machine learning models within the security of their private network. That’s why we provide Docker images to support on-premises annotation on local datasets.”
Balzano, a Swiss startup building deep learning models for radiologists, is using TrainingData.io’s platform linked to an on-premises server of NVIDIA V100 Tensor Core GPUs. To develop training datasets for its musculoskeletal orthopedics AI tools, the company labels a few hundred radiology images each month. Adopting TrainingData.io’s interface saved the company a year’s worth of engineering effort compared to building a similar solution from scratch.
“TrainingData.io’s features allow us to annotate and segment anatomical features of the knee and cartilage more efficiently,” said Stefan Voser, chief operating officer and product manager at Balzano, which is also an Inception program member. “As we ramp up the annotation process, this platform will allow us to leverage AI capabilities and ensure the segmented images are high quality.”
Balzano and TrainingData.io will showcase their latest demos in NVIDIA booth 10939 at the annual meeting of the Radiological Society of North America, Dec. 1-6 in Chicago.
The post Put AI Label on It: Startup Aids Annotators of Healthcare Training Data appeared first on The Official NVIDIA Blog.
Video understanding is a challenging problem. Because a video contains spatio-temporal data, its feature representation is required to abstract both appearance and motion information. This is not only essential for automated understanding of the semantic content of videos, such as web-video classification or sport activity recognition, but is also crucial for robot perception and learning. Just like humans, an input from a robot’s camera is seldom a static snapshot of the world, but takes the form of a continuous video.
The abilities of today’s deep learning models are greatly dependent on their neural architectures. Convolutional neural networks (CNNs) for videos are normally built by manually extending known 2D architectures such as Inception and ResNet to 3D or by carefully designing two-stream CNN architectures that fuse together both appearance and motion information. However, designing an optimal video architecture to best take advantage of spatio-temporal information in videos still remains an open problem. Although neural architecture search (e.g., Zoph et al, Real et al) to discover good architectures has been widely explored for images, machine-optimized neural architectures for videos have not yet been developed. Video CNNs are typically computation- and memory-intensive, and designing an approach to efficiently search for them while capturing their unique properties has been difficult.
In response to these challenges, we have conducted a series of studies into automatic searches for more optimal network architectures for video understanding. We showcase three different neural architecture evolution algorithms: learning layers and their module configuration (EvaNet); learning multi-stream connectivity (AssembleNet); and building computationally efficient and compact networks (TinyVideoNet). The video architectures we developed outperform existing hand-made models on multiple public datasets by a significant margin, and demonstrate a 10x~100x improvement in network runtime.
EvaNet: The first evolved video architectures
EvaNet, which we introduce in “Evolving Space-Time Neural Architectures for Videos” at ICCV 2019, is the very first attempt to design neural architecture search for video architectures. EvaNet is a module-level architecture search that focuses on finding types of spatio-temporal convolutional layers as well as their optimal sequential or parallel configurations. An evolutionary algorithm with mutation operators is used for the search, iteratively updating a population of architectures. This allows for parallel and more efficient exploration of the search space, which is necessary for video architecture search to consider diverse spatio-temporal layers and their combinations. EvaNet evolves multiple modules (at different locations within the network) to generate different architectures.
Our experimental results confirm the benefits of such video CNN architectures obtained by evolving heterogeneous modules. The approach often finds that non-trivial modules composed of multiple parallel layers are most effective as they are faster and exhibit superior performance to hand-designed modules. Another interesting aspect is that we obtain a number of similarly well-performing, but diverse architectures as a result of the evolution, without extra computation. Forming an ensemble with them further improves performance. Due to their parallel nature, even an ensemble of models is computationally more efficient than the other standard video networks, such as (2+1)D ResNet. We have open sourced the code.
AssembleNet: Building stronger and better (multi-stream) models
In “AssembleNet: Searching for Multi-Stream Neural Connectivity in Video Architectures”, we look into a new method of fusing different sub-networks with different input modalities (e.g., RGB and optical flow) and temporal resolutions. AssembleNet is a “family” of learnable architectures that provide a generic approach to learn the “connectivity” among feature representations across input modalities, while being optimized for the target task. We introduce a general formulation that allows representation of various forms of multi-stream CNNs as directed graphs, coupled with an efficient evolutionary algorithm to explore the high-level network connectivity. The objective is to learn better feature representations across appearance and motion visual clues in videos. Unlike previous hand-designed two-stream models that use late fusion or fixed intermediate fusion, AssembleNet evolves a population of overly-connected, multi-stream, multi-resolution architectures while guiding their mutations by connection weight learning. We are looking at four-stream architectures with various intermediate connections for the first time — 2 streams per RGB and optical flow, each one at different temporal resolutions.
The figure below shows an example of an AssembleNet architecture, found by evolving a pool of random initial multi-stream architectures over 50~150 rounds. We tested AssembleNet on two very popular video recognition datasets: Charades and Moments-in-Time (MiT). Its performance on MiT is the first above 34%. The performances on Charades is even more impressive at 58.6% mean Average Precision (mAP), whereas previous best known results are 42.5 and 45.2.
Tiny Video Networks: The fastest video understanding networks
In order for a video CNN model to be useful for devices operating in a real-world environment, such as that needed by robots, real-time, efficient computation is necessary. However, achieving state-of-the-art results on video recognition tasks currently requires extremely large networks, often with tens to hundreds of convolutional layers, that are applied to many input frames. As a result, these networks often suffer from very slow runtimes, requiring at least 500+ ms per 1-second video snippet on a contemporary GPU and 2000+ ms on a CPU. In Tiny Video Networks, we address this by automatically designing networks that provide comparable performance at a fraction of the computational cost. Our Tiny Video Networks (TinyVideoNets) achieve competitive accuracy and run efficiently, at real-time or better speeds, within 37 to 100 ms on a CPU and 10 ms on a GPU per ~1 second video clip, achieving hundreds of times faster speeds than the other human-designed contemporary models.
These performance gains are achieved by explicitly considering the model run-time during the architecture evolution and forcing the algorithm to explore the search space while including spatial or temporal resolution and channel size to reduce computations. The below figure illustrates two simple, but very effective architectures, found by TinyVideoNet. Interestingly the learned model architectures have fewer convolutional layers than typical video architectures: Tiny Video Networks prefers lightweight elements, such as 2D pooling, gating layers, and squeeze-and-excitation layers. Further, TinyVideoNet is able to jointly optimize parameters and runtime to provide efficient networks that can be used by future network exploration.
Conclusion
To our knowledge, this is the very first work on neural architecture search for video understanding. The video architectures we generate with our new evolutionary algorithms outperform the best known hand-designed CNN architectures on public datasets, by a significant margin. We also show that learning computationally efficient video models, TinyVideoNets, is possible with architecture evolution. This research opens new directions and demonstrates the promise of machine-evolved CNNs for video understanding.
Acknowledgements
This research was conducted by Michael S. Ryoo, AJ Piergiovanni, and Anelia Angelova. Alex Toshev and Mingxing Tan also contributed to this work. We thank Vincent Vanhoucke, Juhana Kangaspunta, Esteban Real, Ping Yu, Sarah Sirajuddin, and the Robotics at Google team for discussion and support.
Why use fixed length sequences in transformer ? In what way and why does it effect the performance and training of transformer ? Why did they not use sequences of length <= some number ?
Any paper regarding this?
Also, while reading the paper on Transformer-XL (Dai et. al, 2019) they say,
“We propose a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence”
Why can’t we learn dependencies with a normal transformer(Vaswani et. al) beyond a fixed length without disrupting temporal coherence?
I think temporal coherence gets disturbed when the input length becomes comparable to the length of embedding used for a single word/character because the embedding then doesn’t contain enough information to link the word embedding to all the previous length of this input sequence . Am i right ?
submitted by /u/Jeevesh88
[link] [comments]