Author: torontoai

Bi-Tempered Logistic Loss for Training Neural Nets with Noisy Data

Written on August 25, 2019. Posted in Google.

Posted by Ehsan Amid, Student Researcher and Rohan Anil, Software Engineer, Google Research

The quality of models produced by machine learning (ML) algorithms directly depends on the quality of the training data, but real world datasets typically contain some amount of noise that introduces challenges for ML models. Noise in the dataset can take several forms from corrupted examples (e.g., lens flare in an image of a cat) to mislabelled examples from when the data was collected (e.g., an image of cat mislabelled as a flerken).

The ability of an ML model to deal with noisy training data depends in great part on the loss function used in the training process. For classification tasks, the standard loss function used for training is the logistic loss. However, this particular loss function falls short when handling noisy training examples due to two unfortunate properties:

Outliers far away can dominate the overall loss: The logistic loss function is sensitive to outliers. This is because the loss function value grows without bound as the mislabelled examples (outliers) are far away from the decision boundary. Thus, a single bad example that is located far away from the decision boundary can penalize the training process to the extent that the final trained model learns to compensate for it by stretching the decision boundary and potentially sacrificing the remaining good examples. This “large-margin” noise issue is illustrated in the left panel of the figure below.
Mislabeled examples nearby can stretch the decision boundary: The output of the neural network is a vector of activation values, which reflects the margin between the example and the decision boundary for each class. The softmax transfer function is used to convert the activation values into probabilities that an example will belong to each class. As the tail of this transfer function for the logistic loss decays exponentially fast, the training process will tend to stretch the boundary closer to a mislabeled example in order to compensate for its small margin. Consequently, the generalization performance of the network will immediately deteriorate, even with a low level of label noise (right panel below).

We visualize the decision surface of a 2-layered neural network as it is trained for binary classification. Blue and orange dots represent the examples from the two classes. The network is trained with logistic loss under two types of noisy conditions: (left) large-margin noise and (right) small-margin-noise.

We tackle these two problems in a recent paper by introducing a “bi-tempered” generalization of the logistic loss endowed with two tunable parameters that handle those situations well, which we call “temperatures”—t₁, which characterizes boundedness, and t₂ for tail-heaviness (i.e. the rate of decline in the tail of the transfer function). These properties are illustrated below. Setting both t₁ and t₂ to 1.0 recovers the logistic loss function. Setting t₁ lower than 1.0 increases the boundedness and setting t₂ greater than 1.0 makes for a heavier-tailed transfer function. We also introduce this interactive visualization which allows you to visualize the neural network training process with the bi-tempered logistic loss.

Left: Boundedness of the loss function. When t₁ is between 0 and 1, exclusive, only a finite amount of loss is incurred for each example, even if they are mislabeled. Shown is t₁ = 0.8. Right: Tail-heaviness of the transfer function. The heavy-tailed transfer function applies when t₂ = > 1.0 and assigns higher probability for the same amount of activation, thus preventing the boundary from drawing closer to the noisy example. Shown is t₂ = 2.0.

To demonstrate the effect of each temperature, we train a two-layer feed-forward neural network for a binary classification problem on a synthetic dataset that contains a circle of points from the first class, and a concentric ring of points from the second class. You can try this yourself on your browser with our interactive visualization. We use the standard logistic loss function, which can be recovered by setting both temperatures equal to 1.0, as well as our bi-tempered logistic loss for training the network. We then demonstrate the effects of each loss function for a clean dataset, a dataset with small-margin noise, large-margin noise, and a dataset with random noise.

Logistic vs. bi-tempered logistic loss: (a) noise-free labels, (b) small-margin label noise, (c) large-margin label noise, and (d) random label noise. The temperature values (t₁, t₂) for the tempered loss are shown above each figure. We find that for each situation, the decision boundary recovered by training with the bi-tempered logistic loss function is better than before.

Noise Free Case:
We show the results of training the model on the noise-free dataset in column (a), using the logistic loss (top) and the bi-tempered logistic loss (bottom). The white line shows the decision boundary for each model. The values of (t₁, t₂), the temperatures in the bi-tempered loss function, are shown below each column of the figure. Notice that for this choice of temperatures, the loss is bounded and the transfer function is tail-heavy. As can be seen, both losses produce good decision boundaries that successfully separates the two classes.

Small-Margin Noise:
To illustrate the effect of tail-heaviness of the probabilities, we artificially corrupt a random subset of the examples that are near the decision boundary, that is, we flip the labels of these points to the opposite class. The results of training the networks on data with small-margin noise using the logistic loss as well as the bi-tempered loss is shown in column (b).

As can be seen, the logistic loss, due to the lightness of the softmax tail, stretches the boundary closer to the noisy points to compensate for their low probabilities. On the other hand, the bi-tempered loss using only the tail-heavy probability transfer function by adjusting t₂ can successfully avoid the noisy examples. This can be explained by the heavier tail of the tempered exponential function, which assigns reasonably high probability values (and thus, keeps the loss value small) while maintaining the decision boundary away from the noisy examples.

Large-Margin Noise:
Next, we evaluate the performance of the two loss functions for handling large-margin noisy examples. In (c), we randomly corrupt a subset of the examples that are located far away from the decision boundary, the outer side of the ring as well as points near the center).

For this case, we only use the boundedness property of the bi-tempered loss, while keeping the softmax probabilities the same as the logistic loss. The unboundedness of the logistic loss causes the decision boundary to expand towards the noisy points to reduce their loss values. On the other hand, the bounded bi-tempered loss, bounded by adjusting t₁, incurs a finite amount of loss for each noisy example. As a result, the bi-tempered loss can avoid these noisy examples and maintain a good decision boundary.

Random Noise:
Finally, we investigate the effect of random noise in the training data on the two loss functions. Note that random noise comprises both small-margin and large-margin noisy examples. Thus, we use both boundedness and tail-heaviness properties of the bi-tempered loss function by setting the temperatures to (t₁, t₂) = (0.2, 4.0).

As can be seen from the results in the last column, (d), the logistic loss is highly affected by the noisy examples and clearly fails to converge to a good decision boundary. On the other hand, the bi-tempered can recover a decision boundary that is almost identical to the noise-free case.

Conclusion
In this work we constructed a bounded, tempered loss function that can handle large-margin outliers and introduced heavy-tailedness in our new tempered softmax function, which can handle small-margin mislabeled examples. Using our bi-tempered logistic loss, we achieve excellent empirical performance on training neural networks on a number of large standard datasets (please see our paper for full details). Note that the state-of-the-art neural networks have been optimized along with a large variety of variables such as: architecture, transfer function, choice of optimizer, and label smoothing to name just a few. Our method introduces two additional tunable variables, namely (t₁, t₂). We believe that with a systematic “joint optimization” of all commonly tried variables, significant further improvements can be achieved in conjunction with our loss function. This is of course a more long-term goal. We also plan to explore the idea of annealing the temperature parameters over the training process.

Acknowledgements:
This blogpost reflects work with our co-authors Manfred Warmuth, Visiting Researcher and Tomer Koren, Senior Research Scientist, Google Research. Preprint of our paper is available here, which contains theoretical analysis of the loss function and empirical results on standard datasets at scale.

[R] BasisConv: A method for compressed representation and learning in CNNs

Written on August 25, 2019. Posted in Reddit MachineLearning.

submitted by /u/schrodingershit
[link] [comments]

[Project] Stochastic Variance Reduction Gradient Descent (SVRG) optimizer for Keras

Written on August 25, 2019. Posted in Reddit MachineLearning.

I’ve implemented SVRG (Stochastic Variance Reduction Gradient Descent) optimizer for Keras. The goal is to make this optimizer available in Keras as well, which may be beneficial in the case of RL as some papers claimed it is advantageous over Adam.

The paper: https://papers.nips.cc/paper/4937-accelerating-stochastic-gradient-descent-using-predictive-variance-reduction.pdfLink to the project: https://github.com/tilkb/SVRGoptimizerKeras

submitted by /u/tilkb
[link] [comments]

[R] Call for Papers: Shared Visual Representations in Human and Machine Intelligence (SVRHM) NeurIPS 2019 workshop

Written on August 25, 2019. Posted in Reddit MachineLearning.

The goal of the Shared Visual Representations in Human and Machine Intelligence (SVRHM) workshop at NeurIPS 2019 is to discuss and disseminate relevant findings and parallels between the computational neuro/cognitive science and machine learning/artificial intelligence communities.

In the past few years, machine learning tools — especially deep neural networks — have permeated the vision/cognitive/neuro science communities to become the leading computational models that describe many cognitive tasks. Huge strides are also being made on the machine learning/artificial intelligence community with biologically inspired algorithms providing large efficiency gains in both computational and learning capabilities. However, many mysteries remain with regards to the alignment of human and machine perception, and there are cases where we see divergent rather than convergent representations. To resolve such questions, this workshop aims to bring fruitful discussions between scientists and engineers with multi-disciplinary backgrounds to review the recent progress in shared visual representations in both humans and machines, and in doing so identifying road-blocks and areas of interest to further accelerate the growth of both fields.

The workshop will include a series of talks and panel discussions from a diverse group of speakers from both industry and academia who will share their research at the intersection of humans and machines that pushes the field of vision forward. The aim of our Call for Papers is to bring together scientists and engineers to share their work in progress at the Poster Session that are applicable to the scope of the Workshop.

The following areas provide a sense of suitable topics for 2-4 page paper submissions:

Biological inspiration and inductive bias in vision
Human-relevant strategies for robustness and generalization
New datasets (e.g., for comparing humans/animals and machines)
Biologically-driven self-supervision
Perceptual invariance and metamerism
Biologically-informed strategies to mitigate adversarial vulnerability
Foveation, active perception, and attention models
Intuitive physics
Perceptual and cognitive robustness
Nuances and noise in perceptual and cognitive systems
Creative problem-solving
Differences and similarities between humans and deep neural networks
Canonical computations in biological and artificial systems
Alternative architectures for deep neural networks
Reverse engineering of the human visual system via deep neural networks

We will be awarding an NVIDIA Titan RTX and an Oculus Quest as best paper and poster prize respectively at the conference.

Link to the workshop with additional details for the Call for Papers: https://www.svrhm2019.com/

Link to Paper workshop submission: https://cmt3.research.microsoft.com/SVRHM2019

Questions regarding the workshop should be sent to: [info@svrhm2019.com](mailto:info@svrhm2019.com)📷

Sincerely,

The Organizers

Arturo Deza, Joshua Peterson, Apruva Ratan Murty, Tom Griffiths

The SVRHM workshop is currently sponsored by NVIDIA, MIT’s Center from Brains, Minds and Machines (CBMM), National Science Foundation (NSF), Oculus and MIT’s Quest for Intelligence.

submitted by /u/NeuroSurfer77
[link] [comments]

[D] Rich model/Poor model: Logarithmic Loss and comparing model performance – an exploratory analysis

Written on August 25, 2019. Posted in Reddit MachineLearning.

I was reading about logarithmic loss and I became curious about a few things, so I did an exploratory analysis using a “good” model and a “bad” model.

Some things I looked at include: distribution of log loss, mean vs. median and calculating it separately for the target and non-target classes.

I’d like to know if anyone has any thoughts or ideas to discuss regarding some of the more nuanced aspects of the metric.

https://emilyswebber.github.io/LogLoss/

submitted by /u/datadatadata84
[link] [comments]

[D] ML approaches to Fuzzy Matching for MDM?

Written on August 25, 2019. Posted in Reddit MachineLearning.

I am currently working on a project with an interesting task that I am struggling with an approach. I am tasked to implement ML algorithms to replace fuzzy matching for MDM purposes. The fields will include PII such as SSN, First Name, Last Name, Address etc. I am quite new to Fuzzy Matching would love to hear some opinions on some other approaches to this problem! The development language for Python (I did work through FuzzyWuzzy and I believe the current implementation uses something similar to this and the similarity score)

submitted by /u/Fender6969
[link] [comments]

[D] Leveraging Learning in Robotics: RSS 2019 Highlights

Written on August 25, 2019. Posted in Reddit MachineLearning.

https://thegradient.pub/leveraging-learning-in-robotics-rss-2019-highlights/

I recently wrote a blog post, summarizing interesting works presented at the Robotics Science and Systems Conference. Would love your feedback!

submitted by /u/aseembits93
[link] [comments]

[P] Introducing Deepkit – the first collaborative desktop app for deep learning experiments. Experiment tracking, model debugging, infrastructure management.

Written on August 25, 2019. Posted in Reddit MachineLearning.

https://deepkit.ai

Hi guys, I’m the founder of Deepkit. An app that helps you visualize, debug, track, and run ML/DL experiments, directly on your workstation or on your own servers, in your LAN or in the cloud. Deepkit will be free for individual users and available in all app stores. You can use the app alone or use the real-time collaborative features within a team using the Deepkit team server.

We’re are looking for alpha users that want to help us building a better, cheaper and more efficient way of doing ML/DL experiments. If you’re interested, please register at the website directly or use this link. We currently only support MacOS, but Windows & Linux will follow. Follow us on Twitter @deepkitAI to get notified once we release the public version.

If you got any questions, I’m happy to answer in the comments.

submitted by /u/marcjschmidt
[link] [comments]

[R] AdvHat: Real-world adversarial attack on ArcFace Face ID system

Written on August 25, 2019. Posted in Reddit MachineLearning.

Hi! We have done some interesting research on breaking the current best public Face ID system – ArcFace – using the adversarial attack technique. It’s quite ordinary but what we succeed is to do it in the real world (i.e. made it not in digital domain only): someone can print the color sticker and stick it to a hat, and after that the similarity with the ground truth drops significantly. Even some sort of attack transferability to other top Face ID models from insightface exists.

Paper: https://arxiv.org/abs/1908.08705

Code: https://github.com/papermsucode/advhat

Video demonstration: https://www.youtube.com/watch?v=a4iNg0wWBsQ

Any comments are welcome!

submitted by /u/AleksanderPet
[link] [comments]

[D] How can I reduce the difference between real and predicted stock price?

Written on August 25, 2019. Posted in Reddit MachineLearning.

I’m using LSTM in deep learning to predict indexes.

I used MinMaxSacler(0~1) and The lowest MSE obtained through the model is 0.000003

However, there is a big difference between the predicted price and the real price.

I wonder if it is possible to bridge this gap.

If I can, I wonder which method I should use.

Your valuable opinions and thoughts will be very much appreciated.

submitted by /u/GoBacksIn
[link] [comments]

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

JOB POSTINGS

CONTACT

Author: torontoai

Bi-Tempered Logistic Loss for Training Neural Nets with Noisy Data

[R] BasisConv: A method for compressed representation and learning in CNNs

[Project] Stochastic Variance Reduction Gradient Descent (SVRG) optimizer for Keras

[R] Call for Papers: Shared Visual Representations in Human and Machine Intelligence (SVRHM) NeurIPS 2019 workshop

[D] Rich model/Poor model: Logarithmic Loss and comparing model performance – an exploratory analysis

[D] ML approaches to Fuzzy Matching for MDM?

[D] Leveraging Learning in Robotics: RSS 2019 Highlights

[P] Introducing Deepkit – the first collaborative desktop app for deep learning experiments. Experiment tracking, model debugging, infrastructure management.

[R] AdvHat: Real-world adversarial attack on ArcFace Face ID system

[D] How can I reduce the difference between real and predicted stock price?