Career Nav #41: Biases That Exist in Toxicity Detection Systems for the Web
At the Laboratory for Computational Social Systems, we study various aspects of natural language processing and social computing. Within that broad area of social computing, my world focuses explicitly on analyzing different aspects of hate speech, how to detect it, how it is spreading, and the various biases that exist within these systems.
When I talk about the theme of bias in toxicity detection, there are two aspects that we need to cover. First, what is meant by toxicity, and second, how do we define the presence and mitigate the presence of bias in toxicity detection models? Hate speech, as we know, has existed within the human community for as long as our civilization dates back. The presence of some form of stereotypes or hateful content has always existed in the real world. We see discrimination and bias happening all the time. In the online world, those same stereotypes and biases get amplified.
Cognitively speaking, some biases help us. They help us make decisions faster and are important for our evolution as a species. However, just because they help us reach a decision faster doesn't mean that biases help us filter the right information, and help us make correct decisions all the time. When our biased interactions in the online world are used as a source of data to train the machine learning models, the machine learning models pick on these bias cues. They do a statistical sampling of the provided data and try to develop patterns to discriminate between different classes. If we have certain stereotypes that get reinforced again and again, then the same stereotypes act as significant or statistical data points for the model to use as a basis for their classification.
When a stereotype exists in the real world, and because of their frequent usage in the online world, they get picked by the models that depend on the online data. These biases get amplified and sometimes become the pace for the model to start learning things. In Natural Language Processing, we have word vectors. There have been extensive studies to show that these word vectors are biased. Gender is one aspect of it. Some studies have shown that these word vectors discriminate based on the dialect. When these vectors are employed in association with speech detection systems, they are able to pick particular dialects better than others. There have been studies that show that specific names that sound to be of Black origin have been associated with harmful objectives and negative sentiments within the word embedding model.
Unfortunately, there is no strict definition of what can be considered hateful, and what can be regarded as toxic. It depends on the context, the geography, the social platform you are talking about, and time. There are specific terms that were once associated to be abusive and offensive but now have been reclaimed by the very people against whom these words were spoken. We have a contemporary contextual feature related to the detection of toxic content. It is not as objective as trying to detect the presence of a cat in an image. No matter what modification you make regarding rotating the picture, the cat remains a cat. In NLP, even minor perturbations in changing one word, or changing the sentiment of one word can modify the sentence's context completely. We also look at the nuances of who is talking to whom. Who is this, or what community does this person represent? What marginalized group does this person represent? What is the bigger context of this conversation? What was said before, and what has been said after?
Once you start adding all of these features, you realize that detection of toxic content and even adversarially attacking such content is very difficult and contextual. The data itself can be corrupted by the lack of information on the annotator's side. If you want to mitigate racial bias in the labeling of your hate speech data, and you tell the annotator, "Hey, this statement was made by someone from the Black community," and the annotator is desensitized, they will use this information in the correct context. If the annotator themselves are biased against a particular race, then they can use this extra information to reinforce that bias, and purposely label the data as hateful just because they have a prejudice that they want to reinforce. How you present these sensitive attributes to the moderators and annotators is important.
There is no golden rule of how much content can be provided so that the context is complete. At the same time, it doesn't reveal the existing prejudice of the people looking at such content, flagging it, and moderating it. When it is flagged, we build classification systems. Whatever biases have come in the form of annotations and labeling, the data that we collected gets trickled down into the models themselves.
Then you start looking at these models from different verticals. How is this model performing when I look at content written by or written against females? How is sexism, racism, and Islamophobia captured contextually by the model? When we start analyzing these aspects, we start seeing some unintended biases that trickle down from the source into these different demographic groups.
Recent studies have also shown that your political ideology can be a source of unintended bias for the model. The use of specific terms and phrases by people of one political orientation versus the other can be used by the model to discriminate and say that one group is more toxic than the other.
When people have studied race, they have mainly looked at the discrimination against the Black community and the dialect, which is termed the African-American dialect. Here, though, the notion is not that one dialect is better than the other, but because of the prevalence of the data that is present online, one dialect gets discriminated against more than the other. If someone posts a text from the Black community and they are talking colloquially in the language in which they speak with other members of the community, a third-party view of that dataset, whether it's a content moderator or toxicity detection system and a computer, or software that is looking at this context, this interaction, without the context of this person belonging to Black community, talking to another person from the Black community, the presence of certain words, can be used as say marking the content as hateful. The presence of ethnic slurs, the presence of keywords, presence of colloquial terms that could be acceptable within the community but unacceptable otherwise can be used as a source of bias for the model to use as a discrimination point.
When that happens, the false positive against these communities increases. If you look at the broader social impact, you are saying that I want to build a model to detect hate against the Black community. Unfortunately, these unintended biases will lead you to detect the content posted by the Black community to be more hateful. When that happens, their content gets flagged more. Their content gets taken down more. As a result, instead of helping the Black community, you are reinforcing the marginalization.
If the model with a transformed dialect says, "Hey, I think that this is non-toxic in nature." But if the same content in a different dialect is said to be toxic, then we can say, "Hey, it looks like the model has a bias towards dialect. It does not focus on the content and context, but rather uses dialect as a discriminating feature." Once you can determine this, you can use a corrective pipeline. It is possible that particular Black-oriented dialects were annotated as hate.
Using a generative AI and style transferring pipeline, you can correct the label and augment your dataset. That correct augmented information is given to the toxicity detection model. The authors observed that when this simple corrective pipeline was used, they could reduce the false positive rate for the text being detected as hateful against the Black community. Using simple corrective pipelines is one way of overcoming it. Another could be the use of adversarial techniques. A group of researchers said, "Okay, I have a toxicity detection model, and then I have a model that detects dialect. I'm going to use this dialect detection system as a base, and I will mark all my samples for the dialect. I will develop a discriminative model. I have one model that is detecting hate. At the same time, the embedding that is used, the word embedding, or the sentence embedding that is used to detect hate, I'm going to send the same embedding to a discriminator. This discriminator will try to predict the protective attribute of the dialect. If it can do that, it means my model is learning the markers of dialect, and it is using that to predict toxicity."
This adversarial system aims to make sure that I am building towards sentence embeddings that are specifically focused on the context so that the classification of the text being hateful is as accurate as possible. At the same time, the same embedding has as less influence or as a less, or dialect-based feature as possible so that the discriminator that is supposed to detect dialect performs as worse as possible. The combination of these two systems reduced the bias in protecting hateful content against people from the Black community.
One aspect of the future direction of studying racial bias is to build better systems. First, detecting the presence of certain racial slurs and ethnicity while at the same time not using those systems against the very ethnic group that we are trying to study. Developing such systems requires not just knowledge about these cultural groups and knowledge about the ethical slurs but also knowledge about the context. For example, a study analyzed hateful content within the drag community. It observed that for an unintended audience that was not aware of how the people within the drag community communicate with each other, there were a lot of seemingly harsh and abusive terms to encourage each other. If a third person is looking at this and they're not aware of how the drag community collectively communicates, they will think that this is abusive content. Second, we need to combine not just our census-based information and our online information but also look at other data sources. Involve people from different backgrounds, not just statisticians and computer scientists. Include people who study race from a social and psychological perspective so that their inputs can also be used to determine whether the quality of the data that has been corrected represents the community that we want to map to statistically.
Moving on from racial bias to gender bias, we are well aware of the fact that word embedding has issues determining feminine and masculine features. Again, this is an unfortunate situation in that the majority of the studies which try to look at different aspects of gender bias in word embedding systems and toxicity detection systems assume this binary demarcation of masculine and feminine. We need more fine grain and diverse labels for analyzing gender and sexual orientation-based biases and toxicity detection. One group of researchers suggested that we perform data augmentation and gender in terms of gender swapping. If you have a statement that says she is a bitch. You augment that by saying he is a bitch. That way, you are telling the model that they need to focus on the context of the sentence and not just associate a certain gender with certain keywords and use that as hateful.
When you are performing gender augmentation, you are assuming that the label is already correct. I'm assuming that the statement, including both he and she, are non-toxic in nature. This context is given to the model to learn on. However, issues with gender-swapping these augmentation techniques only look at the binary classification of gender. Second, within the NLP, certain word swapping may lead to gibberish content. If the model picks on it is picking on gibberish information and trying to make sense of that. You can not blindly do gender swapping. This needs to be supervised by humans. Similarly, when the authors did the swapping of dialects, they assumed that the dialect swapping will not lead to a change in context or the context that is now generated.
Another way of trying to reduce the presence of unintended bias, with respect to gender and toxicity detection models, would be to use biased word embedding. A good amount of research has been done, and it's still ongoing in de-biasing and word embeddings. If you assume word embeddings to be vectors in dimensional space, and you are saying that there is a negative term and abusive term associated more with females, the de-biasing technique aims to bring this term at an equal distance from both the masculine and the feminine axis. In that way, you say the embedding has been de-biased because it is now not sensitive to the abusive term associated more with a particular gender. Use those de-biased word embeddings in toxicity detection models. We observe a reduction in false positives in determining toxic content. A simple, effective technique like gender swapping and de-biasing the mitigation, using already de-biased embedding for mitigation of the gender bias, have been studied in the literature.
Once you have studied aspects of race and gender, then comes the question of what happens in the intersection of the two. Through one series of studies and experiments and research, we have seen that people from the Black community are more likely to have their content labeled as toxic. At the same time, you have another series of studies that says that abusive terms have a higher chance of being associated with feminine attributes than a masculine one. Content targeted against and generated by women will be more likely to be labeled as toxic.
When you look at the intersection of the two, race and gender, together, one study found that Black males are more likely to have their content labeled as hateful, compared to Black females and White males and White females. It is important to acknowledge that we start seeing these issues when we look at biases at the intersection of more than two components. We are acknowledging the diversity in the real world, but we are not recognizing the same diversity when we start looking at biases in our machine learning models and biases in our toxicity detection model.
Another aspect of intersection bias is not just gender and race but when you start including geography as a feature. Most of the dataset within the hate speech community, within the toxicity speech detection community, focuses on English text. Even when you have a dataset on memes, it consists of Western references and Western cultures, and the text within the meme is English. You are looking at toxicity or even the intersection of race and gender from a Western perspective.
We need the research community to ask questions about what will happen when we start looking at the side effects of the data and the toxicity detection models that we have from a non-Western perspective of race, gender, religion, ageism, political orientation, and so on. Instead of using a particular term that is hateful, why not replace it with a more generic one? So, say you know the terms Islam and Muslim are constantly labeled as hateful, why not replace them with generic terms like religion and faith? Intuitively it makes sense that if you replace this with generic terms, you will see a reduction in bias against the word Muslim and Islam.
We have observed that the bias doesn't go away. It shifts. Bias, by itself, cannot be destroyed. It just moves from one form to another. We observed a shift in bias from the term Islam to the term religion. Going back to the case of the intersection of gender and race, if you are looking at mitigating, say racial bias, how does it influence or translate into bias against gender? Does it lead to an increase in the false positive concerning gender? Does it reduce it? Can the two aspects be de-biased at the same time?
Asking these questions is important and this is one of the open areas of research as to how you look at these side effects together and not just look at one aspect of bias in isolation. In isolation, we have identified the presence and we have developed techniques, of course, incomplete and ongoing. There are techniques for mitigation of bias in individual verticals. How you bring them together will be important because the diversity in the real world needs to be acknowledged and accounted for when we are studying bias in the online world.
Guest: Sarah Masud, Doctoral Student, Laboratory for Computational Social Systems, IIIT-Delhi
Survey Paper: Handling Bias in Toxic Speech Detection: A Survey
Producer: JL Lewitin