Finance - the Adult dataset | Machine Learning Bias Mitigation

The first dataset we will be considering is the Adult dataset from the UCI Machine learning repository. It was derived from the US Census, and contains demographic information from several thousand individuals. The typical task on this dataset is to predict whether or not an individual is a high earner from their demographics. You can see the details of our data acquisition and preprocessing in this notebook, and you can see the details of the below data exploration in this notebook.

We will assume that we are a credit card company that wants to target marketing of a new premium credit card to high earning individuals using a machine learning model. We have found the Adult dataset and want to use it to train a first version of the model.

Features in the adult dataset

We have selected this dataset as it is open access, and well known in the machine learning community. Moreover it is commonly used in experiments in the academic literature on fairness, which means we can compare our findings here to those published in many of the original papers.

Before we proceed with modelling, we should think about what we are trying to achieve and identify possible bias-related risks, as well as whether the data we have available to us is appropriate for the task.

Purpose identification

Our goal is to try and predict an individual's annual income in order to target marketing for a premium credit card. Since our model is based on demographic information, it is at risk of perpetuating or exacerbating demographic inequality. In particular we should be mindful of possible racial or sex-based disparities present in the data, which the model could then use for classification.

Data identification

In the case of this hypothetical example we don't have the means to collect different data. In the real world we would always consider whether we could have access to better data that is more representative. For example, the data we are using was collected in 1994, so it's quite likely that the demographic patterns captured in the data are no longer representative. If we were performing this task for real we would likely recommend that more up to date data is collected, that better reflects demographic shifts in the intervening years. Moreover the data was collected in the US - if the model we are training is intended for UK use this would be a further reason to try and collect more data.

Another consideration is the representation of different groups in the data. Men outnumber women almost two to one, White respondents outnumber Black respondents nine to one, and Asian respondents almost thirty to one. This means that our data is not representative of the overall population which isn't ideal, but the bigger problem is that there are relatively few observations of precisely those individuals who are at risk of mistreatment. In order to minimise the risk of bias we should make sure that all groups are well represented in our data.

Since we are merely demonstrating different methods of measuring and mitigating bias we proceed with the data as is, but in a real world application we should see if we can collect more data in order to ensure better representation.

Once the data has been preprocessed, we can look at bias that is present in the data, before worrying about bias introduced by the model. First, we look at the proportion of men and women that are high earners, and observe that the proportion is much higher for men.

Similarly, there is a large disparity in the proportion of high earners across different races.

Finally there is, perhaps unsurprisingly, a clear relationship between the number of hours worked per week and the chances that the individual is a high earner. We will consider this as a "legitimate risk factor" when we investigate imposing conditional demographic parity. This means that we would allow ourselves to target protected groups at different rates, if the disparity is due to those groups working different numbers of hours per week on average.

Whether this is the right notion of fairness is unclear. On the one hand basing our decision on the hours worked per week feature seems reasonable, on the other hand the number of hours an individual works each week could itself be a manifestation of systemic biases, such as societal pressure for women to stay home and raise children. When selecting legitimate risk factors for analysing fairness it is important we consider whether they are themselves just a proxy for systemic bias.

Next we'll look at the data we use for the recruiting use case.

ML Bias Mitigation

Finance - the Adult dataset

Table of Contents

Purpose identification

Data identification

Home

Recruiting