A Challenge to Data Scientists – Becoming A Data Scientist

As data scientists, we are aware that bias exists in the world. We read up on stories about how cognitive biases can affect decision-making. We know that, for instance, a resume with a white-sounding name will receive a different response than the same resume with a black-sounding name, and that writers of performance reviews use different language to describe contributions by women and men in the workplace. We read stories in the news about ageism in healthcare and racism in mortgage lending.

Data scientists are problem solvers at heart, and we love our data and our algorithms that sometimes seem to work like magic, so we may be inclined to try to solve these problems stemming from human bias by turning the decisions over to machines. Most people seem to believe that machines are less biased and more pure in their decision-making – that the data tells the truth, that the machines won’t discriminate.

Most people seem to believe that machines are less biased and more pure in their decision-making – that the data tells the truth, that the machines won’t discriminate.

However, we must remember that humans decide what data to collect and report (and whether to be honest in their data collection), what data to load into our models, how to manipulate that data, what tradeoffs we’re willing to accept, and how good is good enough for an algorithm to perform. Machines may not inherently discriminate, but humans ultimately tell the machines what to do, and then translate the results into information for other humans to use.

We aim to feed enough parameters into a model, and improve the algorithms enough, that we can tell who will pay back that loan, who will succeed in school, who will become a repeat offender, which company will make us money, which team will win the championship. If we just had more data, better processing systems, smarter analysts, smarter machines, we could predict the future.

I think Chris Anderson was right in his 2008 Wired article when he said “The new availability of huge amounts of data, along with the statistical tools to crunch these numbers, offers a whole new way of understanding the world,” but I think he was wrong when he said that petabyte-scale data “forces us to view data mathematically first and establish a context for it later,” and “With enough data, the numbers speak for themselves.” To me, context always matters. And numbers do not speak for themselves, we give them voice.

To me, context always matters. And numbers do not speak for themselves, we give them voice.

How aware are you of bias as you are building a data analysis, predictive model, visualization, or tool?

How complete, reliable, and representative is your dataset? Was your data collected by a smartphone app? Phone calls to listed numbers? Sensors? In-person surveying of whoever is out in the middle of the afternoon in the neighborhood your pollsters are covering, and agrees to stop and answer their questions?

Did you remove incomplete rows in your dataset to avoid problems your algorithm has with null values? Maybe the fact that the data was missing was meaningful; maybe the data was censored and not totally unknown. As Claudia Perlich warns, after cleaning your dataset, your data might have “lost its soul“.

Did you train your model on labeled data which already included some systematic bias?

It’s actually not surprising that a computer model built to evaluate resumes may eventually show the same biases as people do when you think about the details of how that model may have been built: Was the algorithm trained to evaluate applicants’ resumes against existing successful employees, who may have benefited from hiring biases themselves? Could there be a proxy for race or age or gender in the data even if you removed those variables? Maybe if you’ve never hired someone that grew up in the same zip code as a potential candidate, your model will dock them a few points for not being a close match to prior successful hires. Maybe people at your company have treated women poorly when they take a full maternity leave, so several have chosen to leave soon after they attempted to return, and the model therefore rates women of common childbearing age as having a higher probability of turnover, even though their sex and age are not (at least directly) the reason they left. In other words, our biases translate into machine biases when the data we feed the machine has biases built in, and we ask the machine to pattern-match.

We have to remember that Machine Learning effectively works by stereotyping. Our algorithms are often just creative ways to find things that are similar to other things. Sometimes, a process like this can reduce bias, if the system can identify predictors or combinations of predictors that may indicate a positive outcome, which a biased human may not consider if they’re hung up on another more obvious variable like race. However, as I mentioned before, we’re the ones training the system. We have to know where our data comes from, and how the ways we manipulate it can affect the results, and how the way we present those results can impact decisions that then impact people.

Data scientists, I challenge you. I challenge you to figure out how to make the systems you design as fair as possible.

Data scientists, I challenge you. I challenge you to figure out how to make the systems you design as fair as possible.

Sure, it makes sense to cluster people by basic demographic similarity in order to decide who to send which marketing message to so your company can sell more toys this Christmas than last. But when the stakes are serious – when the question is whether a person will get that job, or that loan, or that scholarship, or that kidney – I challenge you to do more than blindly run a big spreadsheet through a brute-force system that optimizes some standard performance measure, or lazily group people by zip code and income and elementary school grades without seeking information that may be better suited for the task at hand. Try to make sure your cost functions reflect the human costs of misclassification as well as the business costs. Seek to understand your data, and to understand as much as possible how the decisions you make while building your model are affecting the outcome. Check to see how your model performs on a subset of your data that represents historically disadvantaged people. Speak up when you see your results, your expertise, your model being used to create an unfair system.

As data scientists, even though we know that systems we build can do a lot of good, we also know they can do a lot of harm. As data scientists, we know there are outliers. We know there are misclassifications. We know there are people and families and communities behind the rows in our dataframes.

I challenge you, Data Scientists, to think about the people in your dataset, and to take steps necessary to make the systems you design as unbiased and fair as possible. I challenge you to remain the human in the loop.

——————————–

The links throughout the article provide examples and references related to what is being discussed in each section. I encourage you to go back and click on them. Below are additional links with information that can help you identify and reduce biases in your analyses and models.

The GigaOm article “Careful: Your big data analytics may be polluted by data scientist bias” discusses some “bias-quelling tactics”

“Data Science: What You Already Know Can Hurt You” suggests solutions for avoiding “The Einstellung Effect”

Part I of the book Applied Predictive Modeling includes discussions of the modeling process and explains how each type of data manipluation during pre-processing can affect model outcome

This paper from the NIH outlines some biases that occur during clinical research and how to avoid them: “Identifying and Avoiding Bias in Research”

The study “Bias arising from missing data in predictive models” uses Monte Carlo simulation to determine how different methods of handling missing data affect odds-ratio estimates and model performance

Use these wikipedia articles to learn about Accuracy and Precision and Precision and Recall

A study in Clinical Chemistry examines “Bias in Sensitivity and Specificity Caused by Data-Driven Selection of Optimal Cutoff Values: Mechanisms, Magnitude, and Solutions”

More resources from a workshop on fairness, accountability, and transparency in machine learning

Edit: After listening to the SciFri episode I linked to in the comments, I found this paper “Certifying and removing disparate impact” about identifying and reducing bias in machine learning algorithms.

Edit 11/23: Carina Zona suggested that her talk “Consequences of an Insightful Algorithm” might be a good reference to include here. I agree!

(P.S. Sometimes the problem with turning a decision over to machines is that the machines can’t discriminate enough!)

Do you have a story related to data science and bias? Do you have additional links that would help us learn more? Please share in the comments!

17 comments

Renee says:

November 22, 2015 at 12:32 am

It took me a couple weeks to form this essay, and the day before I planned to publish it, I heard a promo for Science Friday on NPR for a segment featuring Kate Crawford about the same topic!

I’m listening to it now and it’s really interesting, go give it a listen:
http://www.sciencefriday.com/segments/why-machines-discriminate-and-how-to-fix-them/

1. Renee says:
  
  November 22, 2015 at 12:36 am
  
  Suresh Venkat is also on the episode. Here is his article about algorithms and discrimination:
  https://medium.com/@geomblog/when-an-algorithm-isn-t-2b9fe01b9bb5#.4llc1hoo0
Marts says:

November 22, 2015 at 11:54 am

Yikes! Quite the challenge, but definitely one worth pursuing.

1. Renee says:
  
  November 22, 2015 at 12:04 pm
  
  Yes, I know we’re never going to achieve perfection here, and often we’re tasked with doing things fast vs thoughtfully, but my main point of writing was to get people to at least think about it. With so many systems being computerized and automated, it’s scary to think how we could be building bias into them and therefore could be discriminating even faster and more consistently! I hope that being aware of it will make us tweak our actions just enough. Or at least check to make sure we’re not causing unintended negative consequences.
2. Renee says:
  
  November 22, 2015 at 12:05 pm
  
  (Thanks, by the way!)
Pingback: A Challenge to Data Scientists « Another Word For It
Renee says:

November 22, 2015 at 2:38 pm

A thoughtful challenge to my challenge: http://tm.durusau.net/?p=65833&cpage=1#comment-134720

Renee says:

November 23, 2015 at 2:31 am

Been having some conversation about this on Twitter:
https://twitter.com/BecomingDataSci/status/668512912395341828

naveen kumar says:

December 6, 2015 at 1:19 pm

i am working as pl/sql developer,knowledge on data warehousing and reporting streams and having 3 years of experience in insurance domain.
Is it a better option to go for data science
if so what are the areas need to concentrate more
R-language,machine learning,python
please suggest me the better approach how to learn and what are the key areas i need to be strong
please suggest me as you are working already that makes lot’s help for me from past 2 months i am doing the background work to go for data science .
i have approached 4 to 5 faculties they are confusing me without knowing the bigdata data science is not possible is it correct ?

thanks.

1. Renee says:
  
  February 17, 2016 at 10:04 pm
  
  Hi naveen, check out the data science learning club (see link in menu) and join in! The answers to your questions will vary a lot depending on your background and goals, but it’s definitely possible. Maybe check out the intros in the meet& greet section of the club forums and find other people like you to follow along with!
Renee says:

February 17, 2016 at 10:00 pm

This article really drives the point home about considering the impact of your analysis http://arstechnica.co.uk/security/2016/02/the-nsas-skynet-program-may-be-killing-thousands-of-innocent-people/

Renee says:

March 25, 2016 at 12:31 am

It’s not always the designers/developers that introduce bias or negative behavior into a model, but they do have to think about how to prevent those things if allowing others to train it!

https://medium.com/@anthonygarvan/hey-microsoft-the-internet-made-my-bot-racist-too-d897fa847232#.37j9ecrt7

Renee says:

October 11, 2016 at 10:33 am

2 more relevant articles. Glad this topic is getting discussed more lately!

http://www.datacommunitydc.org/blog/2015/11/socially-responsible-algorithms-at-data-science-dc

http://www.nytimes.com/2016/08/01/opinion/make-algorithms-accountable.html?_r=0

And Jennifer Stark @_JAStark gave a talk about “algorithmic accountability” that I’ll post here when I see the video from pydataDC posted online.

Renee says:

January 13, 2017 at 4:34 pm

Another one about bias and fairness measures re:algorithms used in criminal justice systems

https://www.propublica.org/article/bias-in-criminal-risk-scores-is-mathematically-inevitable-researchers-say

Renee says:

February 4, 2017 at 1:54 pm

A collection of essays on data and discrimination
https://www.newamerica.org/oti/policy-papers/data-and-discrimination/

via https://twitter.com/annaeveryday

Renee says:

March 11, 2017 at 2:08 pm

Another case of bias being introduced into a policing model http://snip.ly/aewu1#http://ow.ly/TLB6309Nia4

Renee says:

March 19, 2017 at 5:09 pm

A TED Talk on this topic – by Joy Buolamwini
http://www.ted.com/talks/joy_buolamwini_how_i_m_fighting_bias_in_algorithms

17 comments

Leave a comment Cancel reply