[D] Concerns about “Face Beautification: Beyond Makeup Transfer”
I came across the paper “Face Beautification: Beyond Makeup Transfer” and was appalled at the poor ethical and scientific practice shown by the paper. I emailed the PC and the D&I chairs, but I wanted to share my critique with the community as well:
I came across the paper “Face Beautification: Beyond Makeup Transfer” that was published at NeurIPS this year. I was deeply concerned by the apparent complete lack of care for the social and ethical repercussions of the paper. The goal of the paper is to change photos of women to make them more attractive. While it may be possible to do this in a way that isn’t objectionable, the paper there is zero discussion of or acknowledgement of the social, political, and power-dynamical (is that a word?) aspects of what is judged as attractive. The paper also contains serious methodological issues and blatantly contradicts itself in a fashion that I would expect to disqualify the paper from publication in the first place.
The examples in the paper make it clear that the algorithm’s concept of “attractive” is “light skinned white people.” Of the 114 demo examples of computer-generated “attractive people” in the paper, 100% are white. Not only that, almost all of them have extremely light skin. Only a couple of the shown inputs appear to be non-white people (e.g., Table 2 appears to contain a South Asian woman), and the algorithm clearly makes them into white people both by lightening their skin and by changing other morphological features to make the person appear more white. None of the inputs appear to be black people. Even among white people it strongly prefers people with lighter skin; there are zero examples where the algorithm appears to darken the skin tone of the person to be beautified and in the majority of cases it is significantly lightened.
This isn’t just an issue of using white people as “attractive references,” as it even happens when the reference attractive image is a photo of a non-white person, as seen in the table at the top of the first page. Two east Asian people are used as reference images to beautify white people, but the resulting image has typically white features such as a less ovular face shape and doubled eyelids.
Not only does this appear as a persistent pattern in the images, the authors don’t even mention that it happens, let alone critically engage with this. Given how much NeurIPS appears to pride itself on social awareness in AI research, I am saddened and disheartened to see that this paper was viewed as having sufficient merit and ethical practice to warrant publication.
Another major issue is the very last paragraph of their paper. It says
>Personalized beautification is expected to attract increasingly more attention in the incoming years. This work we have only focused on the beautification of female Caucasian faces. A similar question can be studied for other populations even though the relationship between gender, race, cultural background and the perception of facial attractiveness has remained under-researched in the literature. How can AI help reshape the practice of personal makeup and plastic surgery is an emerging field for future research.
This paragraph is clearly false for several reasons. As I mentioned, they have reference photos of non-Caucasian people in the paper itself and appear to input at least a couple non-Caucasian people. Secondly, the authors use data sets that contain a large number of non-Caucasian people. Since they mention the number of training and testing data points used, it is easy to verify that either their “Experimental Setup” section is not wrong or this paragraph is. Given that the images given as examples in the paper itself appear to falsify this paragraph, it seems clear that this paragraph is not true. At no point in the entire paper other than this paragraph do they say anything about only being interested in Caucasian people, and they do not mention Sundering the data to Caucasians. They do mention subsetting the data to women.
While I generally believe in making the most charitable assumptions, it seems uncredible that this might be a mistake or that the authors might be unaware that this paragraph is false. Not only do the reference images in their own paper falsify it, one of their data sets is drawn from a paper titled “SCUT-FBP5500‡ : A Diverse Benchmark Dataset for Multi-Paradigm Facial Beauty Prediction.” The very first page of this paper prominently features a graphic showing non-Caucasian people and the word “diverse” appears in their abstract three times. For their other data set (CelebA), the project website shows demo examples of non-Caucasian people. It does not appear possible for the authors to have done their due diligence and not noticed this. Additionally, it took me only a couple seconds to find dozens of papers studying “the relationship between gender, race, cultural background and the perception of facial attractiveness” and even Googling that phrase brings up lots of papers. Given the location of this paragraph within the paper and the fact that the paragraph blatantly contradicts the description of the experiment in the paper itself I fear this paragraph was added later after concerns about the paper were raised in order to mislead the reader and justify their poor ethical practice.
I also believe that the validation methodologies considered by the paper are extremely insufficient, even setting aside social and ethical concerns.The authors say
>To evaluate the image quality from human’s perception, we develop a user study and ask users to vote the most attractive one among ours and the baseline. 100 face images from testing set are submitted to Amazon Mechanical Turk (AMT), and each survey requires 20 users. We collect 2000 data points in total to evaluate human preference. The final results demonstrate the superiority of out model, showing in Table 1.
This is a rather small sample size, especially as no analysis of variance or estimation of uncertainty is done. Despite the extensive literature on how socioeconomic and racial factors influence assessments of attraction, these attributes are never discussed in the Mechanical Turk population. Additionally, they never actually assess if people find the computer generated images more attractive than the reference images, which is purportedly the entire purpose of the paper. They only ask if the image their algorithm generates is more attractive than other computer-generated images. The only further validation is that they ask their algorithm to score the beauty of the new images and find that on average the beauty rating goes up. This isn’t evidence of anything meaningful at all, as they’re using the same algorithm to evaluate if the beauty increased as they used inside their GAN to make the image more beautiful in the first place.
This is all the validation that they do in the paper.
You can find the paper on arXiv here: https://arxiv.org/abs/1912.03630
Edit: This post is based on the email I sent but I want to be clear that it is not the exact text. It has been edited for grammar and clarity. The content has not been substantively changed.