Thoughts on Recent Research Paper and Associated Article on Amazon Rekognition
A research paper and associated article published yesterday made claims about the accuracy of Amazon Rekognition. We welcome feedback, and indeed get feedback from folks all the time, but this research paper and article are misleading and draw false conclusions. This blog post shares details which we hope will help clarify several misperceptions and inaccuracies.
People often think of accuracy as an absolute measure, such as a percentage score on a math exam, where each answer is either right or wrong. To understand, interpret, and compare the accuracy of machine learning systems, it’s important to understand what is being predicted, the confidence of the prediction, and how the prediction is to be used, which is impossible to glean from a single absolute number or score.
What is being predicted: Amazon Rekognition provides two distinct face capabilities using a type of machine learning called computer vision. The first capability is facial analysis—for a particular image or video, the service can tell you where a face appears, and certain characteristics of the image (such as if the image contains a smile, glasses, mustache, or the gender of a face). These attributes are usually used to help search a catalog of photographs. The second capability of Amazon Rekognition is commonly known as facial recognition. It is a distinct and different feature from facial analysis and attempts to match faces that appear similar. This is the same approach used to unlock some phones, or authenticate somebody entering a building, or by law enforcement to narrow the field when attempting to identify a person of interest. In the latter, it’s the modern equivalent of detectives in old movies flicking through books of photos, but much faster.
Facial analysis and facial recognition are completely different in terms of the underlying technology and the data used to train them. Trying to use facial analysis to gauge the accuracy of facial recognition is ill-advised, as it’s not the intended algorithm for that purpose (as we state in our documentation).
Confidence: For both facial analysis and facial recognition, Amazon Rekognition also tells you how confident the service is in a specific result. Since all machine learning systems are probabilistic by nature, the confidence score can be thought of as a measure of how much trust the systems place in their results; the higher the confidence number, the more the results can be trusted. It is not possible to interpret the quality of either facial analysis or facial recognition without being transparent and thoughtful about the confidence threshold used to interpret the results. We are not yet aware of the threshold used in this research, but as you will see below, the results are much different when run with the recommended confidence level.
Use case for predictions: Combined with confidence, the intended use of a machine learning prediction is important, as it helps put the accuracy in context. For example, when using facial analysis to search for images containing ‘sunglasses’ in a photo catalog, showing more images in the search results is often desirable, even if there are some that aren’t perfect matches. Because the cost of an imperfect result in this use case is low, people often accept a lower confidence level in exchange for more results and less manual inspection of those results. However, when using facial recognition to identify persons of interest in an investigation, law enforcement should use our recommended 99% confidence threshold (as documented), and only use those predictions as one element of the investigation (not the sole determinant).
With the above context for how to think about ‘tests’ of Amazon Rekognition, we can get to this latest report and its erroneous claims.
The research paper seeks to “expose performance vulnerabilities in commercial facial recognition products,” but uses facial analysis as a proxy.
As stated above, facial analysis and facial recognition are two separate tools; it is not possible to use facial analysis to match faces in the same way as you would in facial recognition. This is not just an issue of semantics or definitions; they are two different features with two different purposes. Facial analysis can only find generic features (such as facial hair, smiles, frowns, gender, and so forth), which are primarily used to help filter and organize images. It has no knowledge of features which make a face unique (and cannot reverse engineer this from the image). In contrast, facial recognition focuses on unique facial features to match faces, and is used to match faces in datasets that customers bring to the service. Using facial analysis to do facial recognition is an inaccurate and unadvised way to identify unique individuals. We explain this in our documentation, and haven’t received a report from a customer who’s been confused on this issue.
The research paper states that Amazon Rekognition provides low quality facial analysis results. This does not reflect our own extensive testing and what we’ve heard from customers using the service.
First, the researchers used an outdated version of Amazon Rekognition. We made a significant set of improvements in November. Second, in a test run by AWS using the latest version of Amazon Rekognition, we ran facial analysis to perform gender classification on more than 12,000 images: a random selection of 1,000 men and 1,000 women across six ethnicities (South Asian, Hispanic, East Asian, Caucasian, African American, and Middle Eastern). Across all ethnicities, we found no significant difference in accuracy with respect to gender classification. In a broader test of facial recognition (which, as we explained earlier, is the logical and recommended way to do facial recognition), we evaluated photos from parliamentary websites with the Megaface dataset of 1 million images using Amazon Rekognition, and found exactly zero false positive matches at the recommended 99% confidence threshold. The research paper in question does not use the recommended facial recognition capabilities, does not share the confidence levels used in their research, and we have not been able to reproduce the results of the study. We’d love to collaborate with these researchers on helping with this research, and more importantly, to help continue improving the state of the art in facial recognition.
Beyond our internal tests or single ‘point in time’ results, we are very interested in working with academics in establishing a series of standardized tests for facial analysis and facial recognition and in working with policy makers on guidance and/or legislation of its use. One existing standardized test from the National Institute of Standards and Technology (NIST). Amazon Rekognition’s Face API is a large-scale system which runs on a broad set of Amazon EC2 instance types using multiple deep learning models and proprietary data processing, storage, and search systems. Amazon Rekognition can’t be ‘downloaded’ for testing outside of AWS, and components cannot be tested in isolation while replicating how customers would use the service in the real world. We welcome the opportunity to work with NIST on improving their tests against this API objectively, and to establish datasets and benchmarks with the broader academic community.
The research papers implies that Amazon Rekognition is not improving, and that AWS is not interested in discussing issues around facial recognition.
This is false. We are now on our fourth significant version update of Amazon Rekognition. We are acutely aware of the concerns around facial recognition, and remain highly motivated and committed to continuous improvement, just as we are with all of our services. We make funding available for research projects and staff through the AWS Machine Learning Research Grants and have made significant investments to continuously improve Amazon Rekognition. Those improvements are made available to customers in all geographic regions, as soon as our improvements are validated – and just like all AWS services – we will continue to update and improve Amazon Rekognition. So far, our direct offers to discuss, update, and collaborate on these results have not been acknowledged or accepted by the researchers in this case.
We know that facial recognition technology, when used irresponsibly, has risks. This is true of a lot of technologies, computers included. And, people are concerned about this. We are, too. It’s why we suspend people’s use of our services if we find they’re using them irresponsibly or to infringe on people’s civil rights. It’s also why we clearly recommend in our documentation that facial recognition results should only be used in law enforcement when the results have confidence levels of at least 99%, and even then, only as one artifact of many in a human-driven decision. But, we remain optimistic about the good this technology will provide in society, and are already seeing meaningful proof points with facial recognition helping thwart child trafficking, reuniting missing kids with parents, providing better payment authentication, or diminishing credit card fraud. And, to date (over two years after releasing the service), we have had no reported law enforcement misuses of Amazon Rekognition.
The answer to anxieties over new technology is not to run ‘tests’ inconsistent with how the service is designed to be used, and to amplify the test’s false and misleading conclusions through the news media. We are eager to continue to work with researchers, academics, and customers, to continuously improve as we evolve this important technology.
-Dr. Matt Wood, general manager of artificial intelligence at AWS
Updated (1st Feb): This post was updated to accurately reflect the current state of testing with NIST.