machine learning – Becoming A Data Scientist

Podcast Special Episode 2 – The Future of AI with Dr. Ed Felten

Renee — Thu, 29 Dec 2016 05:02:12 +0000

In this audio-only Becoming a Data Scientist Podcast Special Episode, I interview Dr. Ed Felten, Deputy U.S. Chief Technology Officer, about the Future of Artificial Intelligence (from The White House!).

You can stream or download the audio at this link (download by right-clicking on the player and choosing “Save As”), or listen to it in podcast players like iTunes and Stitcher. Enjoy!

Show Notes:

The White House Names Dr. Ed Felten as Deputy U.S. Chief Technology Officer

Edward W. Felten at Princeton University

Dr. Edward Felten on Wikipedia

White House Office of Science and Technology Policy (OSTP)

The Administration’s Report on the Future of Artificial Intelligence (White House Report from October 2016)

Artificial Intelligence, Automation, and the Economy (White House Report from December 2016)

Ed Felten on Twitter: Official / Personal

Freedom to Tinker blog

———

Other Podcasts in this Government Data Series:

Partially Derivative White House Special – with DJ Patil, US Chief Data Scientist
Not So Standard Deviations – Standards are Like Toothbrushes – with with Daniel Morgan, Chief Data Officer for the U.S. Department of Transportation and Terah Lyons, Policy Advisor to the Chief Technology Officer of the U.S.
Linear Digressions – Data + Healthcare + Government = The Future of Medicine – with Precision Medicine Initiative researcher Matt Might
More to Come!

Becoming a Data Scientist Patreon Campaign

Becoming a Data Scientist Podcast Episode 09: Justin Kiggins

Renee — Tue, 12 Apr 2016 03:29:02 +0000

Justin Kiggins, who calls himself a “full stack neuroscientist” talks to Renee about how he started as a musician majoring in music therapy, switched to mechanical engineering, and eventually made his way via biomedical engineering and neuroscience to study auditory perception and the brains of communicating birds.

Podcast Audio Links:
Link to podcast Episode 9 audio
Podcast’s RSS feed for podcast subscription apps
Podcast on Stitcher
Podcast on iTunes

Podcast Video Playlist:
Youtube playlist of interview videos

More about the Data Science Learning Club:
Data Science Learning Club Welcome Message
Learning Club Activity 9: Normalization [coming soon]
Data Science Learning Club Meet & Greet

Links to topics mentioned by Justin in the interview:

European Starling

video of starling singing
European Starling song file from Justin [1 min wav]

bird song recursive syntactic structure

Zebra Finch song

spectral analysis [pdf]

Neuron

brain electrodes

numpy
spark
thunder
pandas

birth doula

Allen Institute

Jobs for New Data Scientists website mentioned by Renee after interview

Becoming a Data Scientist Podcast Episode 08: Sebastian Raschka

Renee — Tue, 29 Mar 2016 03:01:00 +0000

Renee interviews computational biologist, author, data scientist, and Michigan State PhD candidate Sebastian Raschka about how he became a data scientist, his current research, and about his book Python Machine Learning. In the audio interview, Sebastian also joins us to discuss k-fold cross-validation for our model evaluation Data Science Learning Club activity.

Podcast Audio Links:
Link to podcast Episode 8 audio
Podcast’s RSS feed for podcast subscription apps
Podcast on Stitcher
Podcast on iTunes

Podcast Video Playlist:
Youtube playlist of interview videos

More about the Data Science Learning Club:
Data Science Learning Club Welcome Message
Learning Club Activity 8: Evaluation Metrics [coming soon]
Data Science Learning Club Meet & Greet

Links to topics mentioned by Sebastian in the interview:

computational biology

molecular docking

Protein-ligand docking

DNA -> RNA -> protein

protein signaling pathways

ligand and binding affinity

sea lamprey

pheromone

SiteInterlock project

Neural Network

Random Forest

Sebastian’s Python Machine Learning repository on GitHub

Python Machine Learning Book on DataSciGuide

scikit-learn – Voting Classifier

softmax regression

stochastic gradient descent

multilayer perceptron

logistic regression (from Sebastian’s github)

regularization in logistic regression (from Sebastian’s github)

Keras deep learning library

@rasbt on Twitter
Sebastian Raschka on Quora

Sebastian’s book on Amazon:

Becoming a Data Scientist Podcast Episode 05: Clare Corthell

Renee — Mon, 15 Feb 2016 04:13:03 +0000

Renee Teate interviews Clare Corthell, founding partner of summer.ai (now Luminant Data) and creator of the Open Source Data Science Masters curriculum, about becoming a data scientist.

Podcast Audio Links:
Link to podcast Episode 5 audio
Podcast’s RSS feed for podcast subscription apps
Podcast on Stitcher
Podcast on iTunes

Podcast Video Playlist:
Youtube playlist of interview videos

More about the Data Science Learning Club:
Data Science Learning Club Welcome Message
Learning Club Activity 5: Naive Bayes Classification
Data Science Learning Club Meet & Greet

Resources/topics mentioned by Clare in the interview:

Management Science and Engineering
Markov Chains
Science, Technology, and Society at Stanford

A Challenge to Data Scientists (blog post Renee mentioned)

Mattermark
Product Management
Machine Learning

Open Source Data Science Masters
Nate Silver’s book The Signal and the Noise

Linear Algebra (on Khan Academy)

Bill Howe’s Introduction to Data Science Coursera Course

Recurrent Neural Nets
Bayesian Networks

python

Google Prediction API

data cleaning

Open Source Data Science Masters on GitHub (pull requests welcome!)

summer.ai (Update 2/15 – Clare’s company is now Luminant Data, Inc.)
@ClareCorthell on twitter

A Challenge to Data Scientists

Renee — Sun, 22 Nov 2015 05:22:57 +0000

As data scientists, we are aware that bias exists in the world. We read up on stories about how cognitive biases can affect decision-making. We know that, for instance, a resume with a white-sounding name will receive a different response than the same resume with a black-sounding name, and that writers of performance reviews use different language to describe contributions by women and men in the workplace. We read stories in the news about ageism in healthcare and racism in mortgage lending.

Data scientists are problem solvers at heart, and we love our data and our algorithms that sometimes seem to work like magic, so we may be inclined to try to solve these problems stemming from human bias by turning the decisions over to machines. Most people seem to believe that machines are less biased and more pure in their decision-making – that the data tells the truth, that the machines won’t discriminate.

Most people seem to believe that machines are less biased and more pure in their decision-making – that the data tells the truth, that the machines won’t discriminate.

However, we must remember that humans decide what data to collect and report (and whether to be honest in their data collection), what data to load into our models, how to manipulate that data, what tradeoffs we’re willing to accept, and how good is good enough for an algorithm to perform. Machines may not inherently discriminate, but humans ultimately tell the machines what to do, and then translate the results into information for other humans to use.

We aim to feed enough parameters into a model, and improve the algorithms enough, that we can tell who will pay back that loan, who will succeed in school, who will become a repeat offender, which company will make us money, which team will win the championship. If we just had more data, better processing systems, smarter analysts, smarter machines, we could predict the future.

I think Chris Anderson was right in his 2008 Wired article when he said “The new availability of huge amounts of data, along with the statistical tools to crunch these numbers, offers a whole new way of understanding the world,” but I think he was wrong when he said that petabyte-scale data “forces us to view data mathematically first and establish a context for it later,” and “With enough data, the numbers speak for themselves.” To me, context always matters. And numbers do not speak for themselves, we give them voice.

To me, context always matters. And numbers do not speak for themselves, we give them voice.

How aware are you of bias as you are building a data analysis, predictive model, visualization, or tool?

How complete, reliable, and representative is your dataset? Was your data collected by a smartphone app? Phone calls to listed numbers? Sensors? In-person surveying of whoever is out in the middle of the afternoon in the neighborhood your pollsters are covering, and agrees to stop and answer their questions?

Did you remove incomplete rows in your dataset to avoid problems your algorithm has with null values? Maybe the fact that the data was missing was meaningful; maybe the data was censored and not totally unknown. As Claudia Perlich warns, after cleaning your dataset, your data might have “lost its soul“.

Did you train your model on labeled data which already included some systematic bias?

It’s actually not surprising that a computer model built to evaluate resumes may eventually show the same biases as people do when you think about the details of how that model may have been built: Was the algorithm trained to evaluate applicants’ resumes against existing successful employees, who may have benefited from hiring biases themselves? Could there be a proxy for race or age or gender in the data even if you removed those variables? Maybe if you’ve never hired someone that grew up in the same zip code as a potential candidate, your model will dock them a few points for not being a close match to prior successful hires. Maybe people at your company have treated women poorly when they take a full maternity leave, so several have chosen to leave soon after they attempted to return, and the model therefore rates women of common childbearing age as having a higher probability of turnover, even though their sex and age are not (at least directly) the reason they left. In other words, our biases translate into machine biases when the data we feed the machine has biases built in, and we ask the machine to pattern-match.

We have to remember that Machine Learning effectively works by stereotyping. Our algorithms are often just creative ways to find things that are similar to other things. Sometimes, a process like this can reduce bias, if the system can identify predictors or combinations of predictors that may indicate a positive outcome, which a biased human may not consider if they’re hung up on another more obvious variable like race. However, as I mentioned before, we’re the ones training the system. We have to know where our data comes from, and how the ways we manipulate it can affect the results, and how the way we present those results can impact decisions that then impact people.

Data scientists, I challenge you. I challenge you to figure out how to make the systems you design as fair as possible.

Data scientists, I challenge you. I challenge you to figure out how to make the systems you design as fair as possible.

Sure, it makes sense to cluster people by basic demographic similarity in order to decide who to send which marketing message to so your company can sell more toys this Christmas than last. But when the stakes are serious – when the question is whether a person will get that job, or that loan, or that scholarship, or that kidney – I challenge you to do more than blindly run a big spreadsheet through a brute-force system that optimizes some standard performance measure, or lazily group people by zip code and income and elementary school grades without seeking information that may be better suited for the task at hand. Try to make sure your cost functions reflect the human costs of misclassification as well as the business costs. Seek to understand your data, and to understand as much as possible how the decisions you make while building your model are affecting the outcome. Check to see how your model performs on a subset of your data that represents historically disadvantaged people. Speak up when you see your results, your expertise, your model being used to create an unfair system.

As data scientists, even though we know that systems we build can do a lot of good, we also know they can do a lot of harm. As data scientists, we know there are outliers. We know there are misclassifications. We know there are people and families and communities behind the rows in our dataframes.

I challenge you, Data Scientists, to think about the people in your dataset, and to take steps necessary to make the systems you design as unbiased and fair as possible. I challenge you to remain the human in the loop.

——————————–

The links throughout the article provide examples and references related to what is being discussed in each section. I encourage you to go back and click on them. Below are additional links with information that can help you identify and reduce biases in your analyses and models.

The GigaOm article “Careful: Your big data analytics may be polluted by data scientist bias” discusses some “bias-quelling tactics”

“Data Science: What You Already Know Can Hurt You” suggests solutions for avoiding “The Einstellung Effect”

Part I of the book Applied Predictive Modeling includes discussions of the modeling process and explains how each type of data manipluation during pre-processing can affect model outcome

This paper from the NIH outlines some biases that occur during clinical research and how to avoid them: “Identifying and Avoiding Bias in Research”

The study “Bias arising from missing data in predictive models” uses Monte Carlo simulation to determine how different methods of handling missing data affect odds-ratio estimates and model performance

Use these wikipedia articles to learn about Accuracy and Precision and Precision and Recall

A study in Clinical Chemistry examines “Bias in Sensitivity and Specificity Caused by Data-Driven Selection of Optimal Cutoff Values: Mechanisms, Magnitude, and Solutions”

More resources from a workshop on fairness, accountability, and transparency in machine learning

Edit: After listening to the SciFri episode I linked to in the comments, I found this paper “Certifying and removing disparate impact” about identifying and reducing bias in machine learning algorithms.

Edit 11/23: Carina Zona suggested that her talk “Consequences of an Insightful Algorithm” might be a good reference to include here. I agree!

(P.S. Sometimes the problem with turning a decision over to machines is that the machines can’t discriminate enough!)

Do you have a story related to data science and bias? Do you have additional links that would help us learn more? Please share in the comments!