Skip to main content


Learn About Our Meetup

5000+ Members



Join our meetup, learn, connect, share, and get to know your Toronto AI community. 



Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.



Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[D] AI Scandal: SOTA classifier with 92% ImageNet accuracy scores 2% on new dataset

On a new image dataset, unedited, without adversarial noise injection, ResNeXt-50 and DenseNet-121 see their accuracies drop to under 3%. Other former SOTA approaches plummet likewise by unacceptable margins:

Natural Adversarial Examples – original paper, July 2019

These Images Fool Neural Networks – TwoMinutePapers clip, 5 mins

So who says it’s a scandal? Well, I do – and I’ve yet to hear an uproar over it. A simple yet disturbing interpretation of these results is – there are millions of images out there that we humans can identify with obviousness and ease, yet our best AI completely flunk.

Thoughts on this? I summarize some of mine below, along a few of authors’ findings.


Where’d they get the images? The idea’s pretty simple: select a subset classified incorrectly by several top classifiers, and find alike images.

Why do the NN’s fail? Misclassified images tend to have a set of features in common, that can be systematically exploited –> adversarial attacks. Instead of artificially injecting such features, authors find images already containing them: “Networks may rely too heavily on texture and color cues, for instance misclassifying a dragonfly as a banana presumably due to a nearby yellow shovel” (pg. 4).

Implications for research: self-attention mechanisms, e.g. Squeeze-and-Excite, improve accuracy on ImageNet by ~1% – but on this new dataset, by 10%. Likewise, related methods for increased robustness may improve performance on benchmark datasets by a little, but by a lot on adversarial ones.

  • Thus, instead of pooling all efforts into maximizing F1-score on dataset A, testing against engineered robustness metrics that’ll promise improvement on an unsampled dataset B may be more worthwhile (e.g. “mean corruption error” pg. 8).

Implications for business: you don’t want your bear-catching drone to tranquilize a kid with a teddy.

submitted by /u/OverLordGoldDragon
[link] [comments]