Table of Contents
The first dataset we will be considering is the Adult dataset from the UCI Machine learning repository. It was derived from the US Census, and contains demographic information from several thousand individuals. The typical task on this dataset is to predict whether or not an individual is a high earner from their demographics. You can see the details of our data acquisition and preprocessing in this notebook, and you can see the details of the below data exploration in this notebook.
We will assume that we are a credit card company that wants to target marketing of a new premium credit card to high earning individuals using a machine learning model. We have found the Adult dataset and want to use it to train a first version of the model.
The Adult dataset contains the following features.
age: the age of the individual in years,
capital_gain: capital gain in the previous year,
capital_loss: capital loss in the previous year,
education: highest level of education achieved by the individual,
education-num: a numeric form of the highest level of education achieved,
fnlwgt: an estimate of the number of individuals in the population with the same demographics as this individual,
hours_per_week: hours worked per week,
marital_status: the marital status of the individual,
native_country: the native country of the individual,
occupation: the occupation of the individual,
race: the individual's race,
relationship: the individual's relationship status,
sex: the individual's sex,
workclass: the industry / sector that the individual works in.
Please note that in this dataset (drawn from the US Census), legacy categories for ethnic background exist that are now inappropriate. The graphs shown here are built automatically from that data. The data label
race includes the category "Eskimo" - but where "American Indian / Eskimo" appears here, the correct terms would be Indigenous peoples, American Indian, Native American, or Alaska Native, depending on context.
In addition to this, the binary label
salary encodes whether individual earns more or less than $50,000. We refer to those that earn more than $50,000 as "high earners" and those who don't as "low earners".
We drop the
fnlwgt feature, as it is not relevant to the prediction task at hand, and the
education-num feature as it duplicates the information available in the
We have selected this dataset as it is open access, and well known in the machine learning community. Moreover it is commonly used in experiments in the academic literature on fairness, which means we can compare our findings here to those published in many of the original papers.
Before we proceed with modelling, we should think about what we are trying to achieve and identify possible bias-related risks, as well as whether the data we have available to us is appropriate for the task.
Our goal is to try and predict an individual's annual income in order to target marketing for a premium credit card. Since our model is based on demographic information, it is at risk of perpetuating or exacerbating demographic inequality. In particular we should be mindful of possible racial or sex-based disparities present in the data, which the model could then use for classification.
In the case of this hypothetical example we don't have the means to collect different data. In the real world we would always consider whether we could have access to better data that is more representative. For example, the data we are using was collected in 1994, so it's quite likely that the demographic patterns captured in the data are no longer representative. If we were performing this task for real we would likely recommend that more up to date data is collected, that better reflects demographic shifts in the intervening years. Moreover the data was collected in the US - if the model we are training is intended for UK use this would be a further reason to try and collect more data.
Another consideration is the representation of different groups in the data. Men outnumber women almost two to one, White respondents outnumber Black respondents nine to one, and Asian respondents almost thirty to one. This means that our data is not representative of the overall population which isn't ideal, but the bigger problem is that there are relatively few observations of precisely those individuals who are at risk of mistreatment. In order to minimise the risk of bias we should make sure that all groups are well represented in our data.
Since we are merely demonstrating different methods of measuring and mitigating bias we proceed with the data as is, but in a real world application we should see if we can collect more data in order to ensure better representation.
Once the data has been preprocessed, we can look at bias that is present in the data, before worrying about bias introduced by the model. First, we look at the proportion of men and women that are high earners, and observe that the proportion is much higher for men.
Similarly, there is a large disparity in the proportion of high earners across different races.
Finally there is, perhaps unsurprisingly, a clear relationship between the number of hours worked per week and the chances that the individual is a high earner. We will consider this as a "legitimate risk factor" when we investigate imposing conditional demographic parity. This means that we would allow ourselves to target protected groups at different rates, if the disparity is due to those groups working different numbers of hours per week on average.
Whether this is the right notion of fairness is unclear. On the one hand basing our decision on the hours worked per week feature seems reasonable, on the other hand the number of hours an individual works each week could itself be a manifestation of systemic biases, such as societal pressure for women to stay home and raise children. When selecting legitimate risk factors for analysing fairness it is important we consider whether they are themselves just a proxy for systemic bias.
Next we'll look at the data we use for the recruiting use case.