Visual Generative AI: bias and taste
It is easy to see how most images of “CEO” result in males, while “nurses” are represented by women, reflecting and reinforcing the imbalances in society. “Native Americans” are mostly represented with headdresses, a stereotypical view which doesn’t reflect real life customs.
The models used in the research — Stability’s Stable DIffusion and OpenAI’s Dall-E — are trained on millions of images matched with textual descriptions, downloaded from the web. The relations between these texts and the pictures are apprehended statistically by the models. Besides the issues of privacy and authorship — few of these images have an explicit license allowing them to be used for these means — the brute usage of the materials imply that the outcomes will always reflect the existing problems in society.
If most pictures of CEOs on the Internet are of men, so will be the generated images. Large language models such as ChatGPT suffer the same fate, and that is why OpenAI adds many constraints to its usage, like the refusal to reply to some questions. Image models, however, usually don’t have these safeguards, in particular if they’re published as open source.
A lesser known prejudice in such models has to do with aesthetics and subjectivity. For people who are not WEIRD — white, educated, industrial, rich, democratic — all the images generated by AI have the same dull look. This is not to dismiss the amazing capabilities of generative engines. Technology has come a long way since the early surrealistic-looking GAN networks.
But there are many elements in the training techniques of diffusion models that lead to biases in taste. First, there is the fact that again, other visual cultures are sub represented in the image sets scraped from the Internet. Users in developed countries are much more prone to publish content than counterparts in the global south.
But perhaps even more importantly, guidance of the subjective quality in the sets (and in the networks trained from them) is done practically only by people from WEIRD groups.
One way that this happens is that the raw scraped collections are not appropriate for training as they contain too many garbage pictures — out of focus, poorly lit, incomplete parts of logos… However, manually filtering five billion images for quality would be impossible.
Therefore specialized networks are trained for these tasks. And the data used to train these curating algorithms comes from voting systems. Users are shown a variety of pictures and asked to rate them on a scale of, say, 1 to 10. Once the predictor is trained, it is used to do a selection of the original set. In the case of Laion-5B (for five billion), the set originally used by Stable Diffusion, this yielded the Laion-Asthetics set, with “only” 600 millions images.
But who were these voters? Whose subjective taste got imprinted in the training set? In the case of Laion, there were votes from early generative image geeks gathered in discord groups, the participants of a digital photography competition and even a group of German high school students — all people who carry very specific kinds of visual education.
Finally, voting systems are also used during the training period of the models, to guide the generation of “good looking” pictures — and again, with votes coming majoritarily from a small, non-diverse, community of people.
Generative AI has just begun
As with any incipient media, it causes a variety of disruptions and has many edges to be trimmed. Since it is not considered a high-risk activity, it may even fall through the gaps of legislation and not be subject to mandatory safeguards. But for all of us practitioners of creative computing, it is fundamental to be aware of its problems and actively mitigate its biases.
Just like affirmative action is an effort to address imbalances of representation in education and governance, an active stance must be taken in order not to let the unfairness of society be carried onto the outputs of generative AI, which might become very influential in a not so distant future.