Talks Tech #51: Fairness and Bias in Recommendation Systems
Welcome to the Women Who Code podcast. I am Ashmi Banerjee, a PhD candidate at the Technical University of Munich specializing in Recommended Systems Research. Today, we will explore the topic of fairness and biases in recommended systems.
First, we will start by understanding how these systems work. Then, we will look into identifying the different stakeholders that are involved in this whole recommendation process and the challenges and the importance of fairness. But we will also delve into more effective bias mitigation strategies.
When we talk about recommended systems, I'm sure you have used them in some form or another, with or without knowing what it is about. So, for example, I'm sure that in today's era, you have come across systems that help you buy products, for example, Amazon or eBay. Or even on social networks that recommend you for newsfeed or friend recommendations, such as Facebook, Instagram, and Twitter.
So these are all different recommended systems, from travel to e-commerce to movie recommendations such as Netflix or Spotify, for music or YouTube, or both. Everywhere, it's a recommended system. So, what is a recommended system?
A recommended system is something that provides personalized suggestions to the users, the use of a variety of criteria such as past purchases, search history, demographics, etcetera, to recommend a variety of items to the users. One of the most important users of the recommended system is to minimize information overload. Often the users on the web are presented with numerous choices, and they don't know which one to choose. And that's exactly where these recommended systems come into play. They take into consideration all the data that they have about the users and their preferences and try to recommend things that the user might like.
However, these recommended systems are a multi-stakeholder environment, which encompasses different stakeholders. So, a stakeholder in the case of recommended systems is any group or individuals affected by, or can affect, the delivery of recommendations. In a conventional recommended system scenario, there are typically three major stakeholder classes: the consumers, the item providers, and the platforms.
Now, let's try to understand this with the help of an example. Let us consider the use case of hotel booking. As a user, let's assume that you would like to book a hotel on one of the very popular websites, such as Booking.com. Or it can also be, say, Airbnb, for example, you choose. In that case, this user becomes the consumer because they are consuming the item, and the item, in this case, is the hotel or the accommodation that the user is looking for.
The platform or the website where the user is looking for recommendations becomes the platform. In the whole process of recommendation, each stakeholder has a vested interest in this process of recommendation. Travelers, for example, want to find hotels that match their preferences. The hotels want fair exposure to attract their guests and booking platforms. They also would like to maximize the commission that they receive from these hotels and maintain long-term relationships with both the users as well as the hotel providers. These stakeholders all depend on each other for their economic well-being. Therefore, this whole booking platform must consider the needs of all stakeholders when recommending the generations, which makes this whole recommendation process very complicated. Therefore, the goal of the recommended system should be to consider the needs of all the stakeholders involved in this process.
That's the ideal scenario. But what happens in reality? Let's have a look. In reality, the recommended systems applications have often been responsible for sparking different controversies. For example, some years ago, Amazon had an automated tool to screen applicants for job interviews. However, after some time, they realized that they were getting fewer female candidates compared to their male ones. An investigation revealed that the automated screening model was biased against women as it was trained on historical data. This historical data was heavily dominated by male applicants owing to the smaller number of women in tech, which is why we have organizations like Women Who Code, which are working towards empowering more women in tech. And even though Amazon took immediate action to mitigate all these things so that there is no more bias...
These scenarios are still very, very common in the real world. Another example is Facebook's ad delivery algorithm, which research shows is very discriminatory. So, for example, if you would include a picture of a woman versus a picture of a man, in general, it will go more to... It was discriminatory toward women based on race, gender, and age in the photos. Similarly, other platforms, such as YouTube or Spotify, have often been accused of prioritizing specific playlists or artists over others, which often leads to disparity in the exposure and popularity these are received. This brings us to the following question. What causes this behavior? The causes for such a biased treatment of specific stakeholders can be attributed to several reasons. One of them could be statistical or data bias.
If your data is biased, for example, in the case of Amazon, we saw that it was trained on a dataset with very few female applicants or female data. Therefore, the model, when it was trained on such data, it learned the... It learned that to be discriminatory towards women. And this is a very, very common scenario. These recommended systems are trained on such data, biased towards one or several stakeholder groups. And then, these systems learn the patterns and try to propagate them. Other options or other causes could be, for example, cultural bias, which are interpretations and judgments that are acquired throughout our lives. For example, some racial, gender, sexual, or other biases that we may have are then transmitted through us into the data and get reflected in the system. There can also be cognitive biases, for example, systemic patterns or deviation from the norm, rationality in judgment, and confirmation bias, which is the tendency to favor information that supports one's existing beliefs. These are some of the biases that are very, very common in machine learning or information retrieval scenarios.
However, when it comes to specifically recommended systems, we have two more such biases, which are additionally relevant in this case. One of them is popularity bias, where popular items are repeatedly recommended, creating a self-reinforcing cycle of popularity. For example, earlier, I mentioned the case of Spotify or YouTube, where certain tracks or certain items are being recommended more often than other items, which may lead to some items receiving more popularity than others. This limits the diversity of the recommendations and also restricts the user's exposure to new or niche content. The user is often very limited in what they are getting in terms of recommendations. So we did some research, and then our research showed. We did some research based on the datasets from Yelp, and our research shows that the top-rated restaurants on Yelp get way more exposure, irrespective of their ratings, than those not so well-rated.
The less popular restaurants located further away from the popular areas of a city received way less exposure, even if they are of very good quality in terms of ratings. Similarly, another popular bias that exists in the recommended system is called ranking bias or position bias. Recommended systems generate a ranked list of results. So when the user interacts with the items in this rank list, it is often seen that the user only interacts with the top items of the list and not the ones that are located at the bottom of the list. For example, how many of you would check out the items on the second page of your Google search? Yeah. Situations like this often result in position bias, where the user focuses only on the popular items. So, less exposure is received by the items ranked lower in the list.
We also did some research using the Trivago dataset, which was made public public during the ACM RecSys challenge of 2019 it showed that people tend to click on items at the top of the list much more often than the ones that are located below in the list. This confirmed the existence of decision bias on these platforms. We talked a little bit about why the platforms behave in a way that they ideally should not. But now, it brings us to the next part: what could be done to fix these problems? An easy way to fix these problems is by being fair to all these stakeholders. This leads us to our next question. How do we define fairness? Fairness concerns actually how the outcomes are assigned to a particular group of individuals. And it is often a political construct where someone decides to avoid direct or indirect harm to people.
However, the challenge is that there exists a multitude of definitions of fairness for algorithmic decision-making, and it's super hard to narrow down to a specific definition that will work for all the use cases. On a high level, the fairness definitions can be divided into individual and group fairness notions, where, as the name suggests, group fairness ensures that fair treatment of similar subjects within the different groups based on certain protected attributes such as race or gender, etcetera. Individual fairness assesses whether individuals are treated fairly by ensuring that similar subjects receive a similar decision outcome. As I mentioned before, the recommended system is a multi-stakeholder environment. Therefore, to ensure fairness, the recommended system should also be a multi-sided concept that considers the needs of all the stakeholders. As I said before, the definition of fairness, again, can vary depending on multiple factors and must be determined on a case-by-case basis.
Why do we think that we need to be fair while we are recommending items or allocating certain resources? Because the recommendation of the items is a way of allocating limited resources. So recommendation slot positions are limited, as I said before because as you go further down the ranking, your user's attention drastically drops. The probability of that item being recommended or liked by the user also drops. This means that the items should be recommended at the top to get the exposure opportunity that they deserve.
This is the reason why it is necessary that the ranking of the items be fair so that they get the exposure that they deserve. The next issue is that options are often abundant, but only one recommended solution. For example, when we look at the use case of matching drivers to different passengers to maximize their profit with different... With limited passengers. For example, when we have to choose a single candidate from a pool of applicants for a single position, the whole thing becomes hard. And then, if you can't do it fairly, it might affect the life of a particular individual looking for that job. So we should not encourage unfair practices, but we should look into fair practices and so that... And make sure that our systems are fair.
Some other consequences of unfairness in recommended systems could be attributed to information asymmetry or harm of allocation. As I said, if a certain group is allocated or deprived of certain opportunities by a certain system despite it being a deserving candidate. For example, in the case of a job opportunity that can change someone's life, the whole system must take into consideration the ranking in a very fair way. Another aspect is known as the Matthew Effect or the harm of representation. This is a very common thing when you search on Google for images of a nurse, most of the results show images of women.
This is something I first read about on the Internet, and then I did a search myself a few weeks ago, and I realized that nothing has changed. If you look for images of nurses on the web, then you will find that the majority of the images are of women. This reinforces the subordination of certain groups along the lines of identity, such as gender, race, etcetera, and also promotes stereotyping that, okay, so whenever we are talking about nurses, it's always a woman.
I came across it recently that LLMs, such as ChatGPT, since they have been trained on massive amounts of data, are also biased in certain ways. And if you talk about nurses, they often refer to it as a woman. So yeah. The third one is, stereotyping, obviously, so the formation of echo chambers or stereotyping, where we associate a certain type of behavior with a certain group of people and then try to treat them very differently from the way that they deserve. This makes people feel like the whole world feels the way they do, so they are basically in their bubbles. And then, it prevents them from exploring new ideas, opinions, and other ideas.
We talked a lot about what are the bad influences of unfairness and how it can affect the recommended systems and also our lives. But this brings us to the next part: how do we mitigate this? So, there could be some high-level strategies in recommended systems that can be used to mitigate unfairness and the biases that exist in the systems. One could be the algorithmic strategies, so we are all computer scientists. We use different algorithms to train our models, and one way is to use such algorithms so that the bias is reduced. Algorithmic strategies include pre-processing, which includes transforming the data to eliminate the underlying discriminations, then, it can also be like in processing, where we try to modify the learning algorithms during the training period to prevent discriminations. Last, it can also be post-processing strategies where we perform a post-training evaluation on a holdout set. That has not been used in the model training process.
Another interesting way of bias mitigation is the use of explanation. A lot of the different platforms use this process. For example, we are very familiar with statements such as, "This digital camera is a must-buy for you because you probably bought so and so," or something like that, right? These explanations help you understand why an item was recommended to you by the algorithm. So they help you in decoding the black box of this whole recommended system algorithm. The advantage of using such explanations is transparency, so you know that something is not recommended to you out of the blue, but there is a certain reason why the thing was recommended to you. It also helps to make the system a lot more trustworthy, and the system can use it for persuasiveness. So trying to persuade the user to buy a certain item or behave how the system wants the user to behave.
Last but not least, it also helps the user to make an effective decision. So it aids in this effective decision-making process. The third category of trying to mitigate biases is the more generic things that you should do while you are training your models or working in this area. These are some of the generic best practices that you can undertake. For example, when you are starting, it's best to check for dataset imbalances if there exists something and then to immediately mitigate that or try to mitigate that so that it does not influence your results afterward. The other one would be to ensure that the model treats all the groups fairly. You should understand the model behavior, monitor the protection, and see whether the model behaves in a way you expect it to behave or if it has some sort of unexpected behavior.
From time to time, if it's necessary, then please retrain your model on new data or updated data. So, as we conclude this journey into recommended systems and this very crucial topic of fairness and biases, I hope that you have gained some valuable insights and a better understanding of these algorithms and their ethical challenges.
Guest: Ashmi Banerjee, a PhD candidate at the Technical University of Munich specializing in Recommended Systems Research
Producer: JL Lewitin, Senior Producer, WWCode