What steps do data analyst take to ensure fairness?

Table of Contents Show

Defining “Fairness”
Group vs. Individual Fairness
Balanced vs. Imbalanced Ground Truth
Sample Bias vs. Label Bias in your Data
Think about the ground truth you are trying to model
Think about the process that generated your data
Keep a human in the loop, if your model affects people’s lives.

Here at Civis, we build a lot of models. Most of the time we’re modeling people and their behavior because that’s what we’re particularly good at, but we’re hardly the only ones doing this — as we enter the age of “big data” more and more industries are applying machine learning techniques to drive person-level decision-making. This comes with exciting opportunities, but it also introduces an ethical dilemma: when machine learning models make decisions that affect people’s lives, how can you be sure those decisions are fair?

Defining “Fairness”

A central challenge in trying to build fair models is quantifying some notion of ‘fairness’. In the US there is a legal precedent which establishes one particular definition, however this is also an area of active research. In fact, a substantial portion of the academic literature is focused on proposing new fairness metrics and proving that they display various desirable mathematical properties. This proliferation of metrics reflects the multifaceted nature of ‘fairness’: no single formula can capture all of its nuances. This is good news for the academics — more nuance means more papers to publish — but for practitioners the multitude of options can be overwhelming. To address this my colleagues and I focused on three questions that helped guide our thinking around the tradeoffs between different definitions of fairness.

Group vs. Individual Fairness

For a given modeling task, do you care more about group-level fairness or individual-level fairness? Group fairness is the requirement that different groups of people should be treated the same on average. Individual fairness is the requirement that individuals who are similar should be treated similarly. These are both desirable, but in practice it’s usually not possible to optimize both at the same time. In fact in most cases it’s mathematically impossible. The debate around affirmative action in college admissions illustrates the conflict between individual and group fairness: group fairness stipulates that admission rates be equal across groups, for example gender or race, while individual fairness requires that each applicant be evaluated independently of any broader context.

Balanced vs. Imbalanced Ground Truth

Is the ground truth for whatever you are trying to model balanced between different groups? Many intuitive definitions of fairness assume that ground truth is balanced, but in real-world scenarios this is not always the case. For example, suppose you are building a model to predict the occurrence of breast cancer. Men and women have breast cancer at dramatically different rates, so a definition of fairness that required similar predictions between different groups would be ill-suited to this problem. Unfortunately, determining the balance of ground truth is generally hard because in most cases our only information about ground truth comes from datasets which may contain bias.

Sample Bias vs. Label Bias in your Data

Speaking of, what types of bias might be present in your data? In our thinking we focused on two types of bias that affect data generation: label bias and sample bias. Label bias occurs when the data-generating process systematically assigns labels differently for different groups.
For example, studies have shown that non-white kindergarten children are suspended at higher rates than white children for the same problem behavior, so a dataset of kindergarten disciplinary records might contain label bias. Accuracy is often a component of fairness definitions, however optimizing for accuracy when data labels are biased can perpetuate biased decision making.

Sample bias occurs when the data-generating process samples from different groups in different ways. For example, an analysis of New York City’s stop-and-frisk policy found that Black and Hispanic people were stopped by police at rates disproportionate to their share of the population (while controlling for geographic variation and estimated levels of crime participation). A dataset describing these stops would contain sample bias because the process by which data points are sampled is different for people of different races. Sample bias compromises the utility of accuracy as well as ratio-based comparisons, both of which are frequently used in definitions of algorithmic fairness.

Based on our reading my colleagues and I came up with three recommendations for our coworkers:

Think about the ground truth you are trying to model

Whether or not ground truth is balanced between the different groups in your dataset is a central question, however there is often no way to know for sure one way or the other. Absent any external source of certainty it is up to the person building the model to establish a prior belief about an acceptable level of imbalance between the model’s predictions for different groups.

Think about the process that generated your data

To paraphrase a famous line, “All datasets are biased, some are useful.” It is usually impossible to know exactly what biases are present in a dataset, so the next best option is to think carefully about the process that generated the data. Is the dependent variable the result of a subjective decision? Could the sampling depend on some sensitive attribute? Failing to consider the bias in your datasets at best can lead to poorly performing models, and at worst can perpetuate a biased decision making process.

Keep a human in the loop, if your model affects people’s lives.

There is an understandable temptation to assume that machine learning models are inherently fair because they make decisions based on math instead of messy human judgements. This is emphatically not the case — a model trained on biased data will produce biased results. It is up to the people building the model to ensure that its outputs are fair. Unfortunately there is no silver bullet measurement which is guaranteed to detect unfairness: choosing an appropriate definition of model fairness is task-specific and requires human judgement. This human intervention is especially important when a model can meaningfully affect people’s lives.

Machine learning is a powerful tool, and like any powerful tool it has the potential to be misused. The best defense against misuse is to keep a human in the loop, and it is incumbent on those of us who do this kind of thing for a living to accept that responsibility.

1.Scenario 1, question 1-5

You’ve just started a new job as a data analyst. You’re working for a midsized pharmacy chain with 38 stores in the American Southwest. Your supervisor shares a new data analysis project with you.

She explains that the pharmacy is considering discontinuing a bubble bath product called Splashtastic. Your supervisor wants you to analyze sales data and determine what percentage of each store’s total daily sales come from that product. Then, you’ll present your findings to leadership.

You know that it’s important to follow each step of the data analysis process: ask, prepare, process, analyze, share, and act. So, you begin by defining the problem and making sure you fully understand stakeholder expectations.

One of the questions you ask is where to find the dataset you’ll be working with. Your supervisor explains that the company database has all the information you need.

Next, you continue to the prepare step. You access the database and write a query to retrieve data about Splashtastic. You notice that there are only 38 rows of data, representing the company’s 38 stores. In addition, your dataset contains five columns: Store Number, Average Daily Customers, Average Daily Splashtastic Sales (Units), Average Daily Splashtastic Sales (Dollars), and Average Total Daily Sales (All Products).

Considering the size of your dataset, you decide a spreadsheet will be the best tool for your project. You proceed by downloading the data from the database. Describe why this is the best choice.

6.Scenario 2, questions 6-10

You’ve been working for the nonprofit National Dental Society (NDS) as a junior data analyst for about two months. The mission of the NDS is to help its members advance the oral health of their patients. NDS members include dentists, hygienists, and dental office support staff.

The NDS is passionate about patient health. Part of this involves automatically scheduling follow-up appointments after crown replacement, emergency dental surgery, and extraction procedures. NDS believes the follow-up is an important step to ensure patient recovery and minimize infection.

Unfortunately, many patients don’t show up for these appointments, so the NDS wants to create a campaign to help its members learn how to encourage their patients to take follow-up appointments seriously. If successful, this will help the NDS achieve its mission of advancing the oral health of all patients.

Your supervisor has just sent you an email saying that you’re doing very well on the team, and he wants to give you some additional responsibility. He describes the issue of many missed follow-up appointments. You are tasked with analyzing data about this problem and presenting your findings using data visualizations.

An NDS member with three dental offices in Colorado offers to share its data on missed appointments. So, your supervisor uses a database query to access the dataset from the dental group. The query instructs the database to retrieve all patient information from the member’s three dental offices, located in zip code 81137.

The table is dental_data_table, and the column name is zip_code. You write the following query, but get an error. What statement will correct the problem?

As you’re reviewing the dataset, you notice that there are a disproportionate number of senior citizens. So, you investigate further and find out that this zip code represents a rural community in Colorado with about 800 residents. In addition, there’s a large assisted-living facility in the area. Nearly 300 of the residents in the 81137 zip code live in the facility.

You recognize that’s a sizable number, so you want to find out if age has an effect on a patient’s likelihood to attend a follow-up dental appointment. You analyze the data, and your analysis reveals that older people tend to miss follow-ups more than younger people.

So, you do some research online and discover that people over the age 60 are 50% more likely to miss dentist appointments. Sometimes this is because they’re on a fixed income. Also, many senior citizens lack transportation to get to and from appointments.

With this new knowledge, you write an email to your supervisor expressing your concerns about the dataset. He agrees with your concerns, but he’s also impressed with what you’ve learned and thinks your findings could be very important to the project. He asks you to change the business task. Now, the NDS campaign will be about educating dental offices on the challenges faced by senior citizens and finding ways to help them access quality dental care.

Changing the business task involves which of the following?

report this ad