featured – Becoming A Data Scientist

T-Shirts!!

Renee — Sat, 18 Feb 2017 20:55:46 +0000

The Becoming a Data Scientist tees are ready to sell! I ordered a couple myself before posting them for sale, to make sure the quality was good. They came out great!! And if you order from Teespring before ~~March~~April 1, 2017 using this link: Becoming a Data Scientist Store – Free Shipping, you’ll get free shipping on your order!

(Readers told me that the link above doesn’t discount at all for International shipping, so if you are outside the US, use this link for $3.99 off – equivalent to US Shipping cost)

The design is a combination of those submitted to our contest by Amarendranath “Amar” Reddy and Ryne & Alexis. You can see their design submissions and read more about them on the finalists post! They are each receiving prizes for being selected. Thanks Amar, Ryne, and Alexis for the awesome design!

There are a variety of styles and colors available. The Premium Tee is 100% cotton. The Women’s Premium is a 50/50 cotton/poly blend, and is cut to fit more snugly. They are available in navy blue, gray, purple, and black. There’s even a long-sleeve version!

I make anywhere from $2-$7 on each order (it’s print-on-demand, so not cheap enough for me to make a significant profit yet, and my proceeds will be lower with the free shipping offer, but I want to reward those of you who are excited to flaunt your Becoming a Data Scientist status!) and every dollar earned from these will be going to the fund that helps support my new small team of assistants, who you’ll meet soon! Also, the more of them I sell, the lower the cost to print is per shirt, so please share with all of your friends!

Here are photos of me wearing the shirt, but this was before I made the front design slightly smaller (so it doesn’t wrap into armpit), and I moved the back design slightly higher and also made the gray dots (data points?) transparent so the color of the shirt will show through there now (see store images above for current design). You can see that the teal came out as a lighter blue in printing. This is the “Premium Tee” style in “New Navy”.

Here’s a model wearing a simulated version of the shirt.

Order yours here, with Free Shipping Until March 1!

Update: Kids sizes now available, too!(the design is on the front for kids’ shirts)

PyData DC 2016 Talk

Renee — Tue, 11 Oct 2016 05:00:36 +0000

I just got back from PydataDC, where I learned a lot, had fun, and met a bunch of awesome people! I’ll definitely write about it more later, but I wanted to share my slides here since I told the attendees they could find them on my website. I got good feedback on the talk, and I’m so glad that my message resonated with some people!

The talk was recorded and video should be out within a few weeks!

Here are the slides: Becoming a Data Scientist – Advice from my Podcast Guests
and the slide notes.

Update 10/26: Here is the recording of my talk, with a playlist of other talks from PyData DC!

Becoming a Data Scientist Podcast Episode 13: Debbie Berebichez

Renee — Fri, 15 Jul 2016 02:52:29 +0000

In this interview, we meet physicist Debbie Berebichez, who you might recognize from her TEDx talks, her appearances in Discovery Channel’s Outrageous Acts of Science and other TV shows! Debbie grew up in Mexico City and was discouraged by her family and teachers from studying science, but later went on to become the first Mexican woman to get a PhD in physics from Stanford, and is now Chief Data Scientist at Metis Data Science Bootcamp in New York.

Podcast Audio Links:
Link to podcast Episode 13 audio
Podcast’s RSS feed for podcast subscription apps
Podcast on Stitcher
Podcast on iTunes

Podcast Video Playlist:
Youtube playlist of interview videos

More about the Data Science Learning Club:
Data Science Learning Club Welcome Message
Learning Club Activity 13: Show & Tell
Data Science Learning Club Meet & Greet

Links to topics mentioned by Debbie in the interview:
Metis Data Science Training
[more coming soon]

Becoming a Data Scientist Podcast Episode 12: Data Science Learning Club Members

Renee — Wed, 15 Jun 2016 05:08:12 +0000

Verena, David, Kerry, and Anthony are members of the Becoming a Data Scientist Podcast Data Science Learning Club! They appear in the order in which they joined the club, and each discuss their starting points before joining, their participation in the activities, and advice they have for new data science learners.

Podcast Audio Links:
Link to podcast Episode 12 audio
Podcast’s RSS feed for podcast subscription apps
Podcast on Stitcher
Podcast on iTunes

Podcast Video Playlist:
Youtube playlist of interview videos

More about the Data Science Learning Club:
Data Science Learning Club Welcome Message
Data Science Learning Club Meet & Greet

1) Verena Haunschmid

bioinformatics

R Markdown
ggplot2
jupyter

Data Science Learning Club Activity 07: Linear Regression
Verena’s Results for Linear Regression on Salary Dataset

Verena’s website
@ExpectAPatronum on Twitter

GPS Cat Tracking Project

2) David Asboth

business intelligence

SQL

City University London Msc Data Science

Coursera
Udacity
Khan Academy

Data Science Learning Club Activity 02: Creating Visuals for Exploratory Data Analysis
David’s results exploring London Underground data

Data Science Learning Club Activity 07: K-Means Clustering
David’s results using k-means to draw puppies in 3 colors

FlyLady (the house cleaning system I mentioned)

David’s website
@davidasboth on Twitter

3) Kerry Benjamin

Data Science Learning Club Activity 01: Find, Import, and Explore a Dataset
Kerry’s results for Activity 1 IGN Game Review Data exploration

Data Science Learning Club Activity 02: Creating Visuals for Exploratory Data Analysis
Kerry’s Blog Post about Activity 02 – “My First Data Set Part 2: The Fun Stuff”

ggplot2
dplyr
XLConnect

Blog post about Data Camp – “The Data Science Journey Begins”

Sharp Sight Labs

Kerry’s blog post “Getting Started in Data Science: A Beginner’s Perspective”

#Rstats (twitter hashtag)

Kerry’s Blog “The Data Logs”
@kerry_benjamin1 on Twitter

4) Anthony Peña

molecular biology
biotechnology

Data Science Learning Club Activity 07: K-Means Clustering
Anthony’s results for Activity 07

ggplot2
tidyR
dplyr

R bloggers

Anthony’s website
@agpena_ on Twitter

Becoming A Data Scientist Podcast Episode 01: Will Kurt

Renee — Mon, 21 Dec 2015 06:40:32 +0000

Note: The video is the interview only. The audio podcast has the intro, interview, and data science learning club activity explanation.

In this episode we meet Will Kurt, who talks about his path from English & Literature and Library & Information Science degrees to becoming the Lead Data Scientist at KISSmetrics. He also tells us about his probability blog, Count Bayesie, and I introduce Data Science Learning Club Activity 1. Will has some great advice for people learning data science!

Podcast Audio Links:
Link to podcast Episode 1 audio
Podcast’s RSS feed for podcast subscription apps
(I will distribute the feed out to sites like iTunes and Stitcher this week)

Podcast Video Playlist:
Youtube playlist where I’ll publish future videos

More about the Data Science Learning Club:
Data Science Learning Club Welcome Message
Learning Club Activity 1: Find and explore a dataset
Data Science Learning Club Meet & Greet

Here are the links to things Will references in the video:

Library and Information Science

Foucault

ARPANET

Support Vector Machines

CoffeeScript

Andrew Ng’s Machine Learning course on Coursera

probabalistic graphical models

T.S. Eliot’s Four Quartets

Articulate

KISSMetrics

Count Bayesie blog
Count Bayesie – Parameter Estimation and Hypothesis Testing

R Markdown

Donald Knuth
Literate programming

ggplot2

jupyter

Claude Shannon’s Mathematical Theory of Communication

Count Basie (musician)

Count Bayesie – Measure Theory
Bayes’ Theorem with Lego
Voight-Kampff and Bayes Factor
Black Friday Puzzle – Markov Chains

Zen Buddhism concept of “beginner’s mind”

Count Bayesie Recommended Books on Probability and Statistics

A Challenge to Data Scientists

Renee — Sun, 22 Nov 2015 05:22:57 +0000

As data scientists, we are aware that bias exists in the world. We read up on stories about how cognitive biases can affect decision-making. We know that, for instance, a resume with a white-sounding name will receive a different response than the same resume with a black-sounding name, and that writers of performance reviews use different language to describe contributions by women and men in the workplace. We read stories in the news about ageism in healthcare and racism in mortgage lending.

Data scientists are problem solvers at heart, and we love our data and our algorithms that sometimes seem to work like magic, so we may be inclined to try to solve these problems stemming from human bias by turning the decisions over to machines. Most people seem to believe that machines are less biased and more pure in their decision-making – that the data tells the truth, that the machines won’t discriminate.

Most people seem to believe that machines are less biased and more pure in their decision-making – that the data tells the truth, that the machines won’t discriminate.

However, we must remember that humans decide what data to collect and report (and whether to be honest in their data collection), what data to load into our models, how to manipulate that data, what tradeoffs we’re willing to accept, and how good is good enough for an algorithm to perform. Machines may not inherently discriminate, but humans ultimately tell the machines what to do, and then translate the results into information for other humans to use.

We aim to feed enough parameters into a model, and improve the algorithms enough, that we can tell who will pay back that loan, who will succeed in school, who will become a repeat offender, which company will make us money, which team will win the championship. If we just had more data, better processing systems, smarter analysts, smarter machines, we could predict the future.

I think Chris Anderson was right in his 2008 Wired article when he said “The new availability of huge amounts of data, along with the statistical tools to crunch these numbers, offers a whole new way of understanding the world,” but I think he was wrong when he said that petabyte-scale data “forces us to view data mathematically first and establish a context for it later,” and “With enough data, the numbers speak for themselves.” To me, context always matters. And numbers do not speak for themselves, we give them voice.

To me, context always matters. And numbers do not speak for themselves, we give them voice.

How aware are you of bias as you are building a data analysis, predictive model, visualization, or tool?

How complete, reliable, and representative is your dataset? Was your data collected by a smartphone app? Phone calls to listed numbers? Sensors? In-person surveying of whoever is out in the middle of the afternoon in the neighborhood your pollsters are covering, and agrees to stop and answer their questions?

Did you remove incomplete rows in your dataset to avoid problems your algorithm has with null values? Maybe the fact that the data was missing was meaningful; maybe the data was censored and not totally unknown. As Claudia Perlich warns, after cleaning your dataset, your data might have “lost its soul“.

Did you train your model on labeled data which already included some systematic bias?

It’s actually not surprising that a computer model built to evaluate resumes may eventually show the same biases as people do when you think about the details of how that model may have been built: Was the algorithm trained to evaluate applicants’ resumes against existing successful employees, who may have benefited from hiring biases themselves? Could there be a proxy for race or age or gender in the data even if you removed those variables? Maybe if you’ve never hired someone that grew up in the same zip code as a potential candidate, your model will dock them a few points for not being a close match to prior successful hires. Maybe people at your company have treated women poorly when they take a full maternity leave, so several have chosen to leave soon after they attempted to return, and the model therefore rates women of common childbearing age as having a higher probability of turnover, even though their sex and age are not (at least directly) the reason they left. In other words, our biases translate into machine biases when the data we feed the machine has biases built in, and we ask the machine to pattern-match.

We have to remember that Machine Learning effectively works by stereotyping. Our algorithms are often just creative ways to find things that are similar to other things. Sometimes, a process like this can reduce bias, if the system can identify predictors or combinations of predictors that may indicate a positive outcome, which a biased human may not consider if they’re hung up on another more obvious variable like race. However, as I mentioned before, we’re the ones training the system. We have to know where our data comes from, and how the ways we manipulate it can affect the results, and how the way we present those results can impact decisions that then impact people.

Data scientists, I challenge you. I challenge you to figure out how to make the systems you design as fair as possible.

Data scientists, I challenge you. I challenge you to figure out how to make the systems you design as fair as possible.

Sure, it makes sense to cluster people by basic demographic similarity in order to decide who to send which marketing message to so your company can sell more toys this Christmas than last. But when the stakes are serious – when the question is whether a person will get that job, or that loan, or that scholarship, or that kidney – I challenge you to do more than blindly run a big spreadsheet through a brute-force system that optimizes some standard performance measure, or lazily group people by zip code and income and elementary school grades without seeking information that may be better suited for the task at hand. Try to make sure your cost functions reflect the human costs of misclassification as well as the business costs. Seek to understand your data, and to understand as much as possible how the decisions you make while building your model are affecting the outcome. Check to see how your model performs on a subset of your data that represents historically disadvantaged people. Speak up when you see your results, your expertise, your model being used to create an unfair system.

As data scientists, even though we know that systems we build can do a lot of good, we also know they can do a lot of harm. As data scientists, we know there are outliers. We know there are misclassifications. We know there are people and families and communities behind the rows in our dataframes.

I challenge you, Data Scientists, to think about the people in your dataset, and to take steps necessary to make the systems you design as unbiased and fair as possible. I challenge you to remain the human in the loop.

——————————–

The links throughout the article provide examples and references related to what is being discussed in each section. I encourage you to go back and click on them. Below are additional links with information that can help you identify and reduce biases in your analyses and models.

The GigaOm article “Careful: Your big data analytics may be polluted by data scientist bias” discusses some “bias-quelling tactics”

“Data Science: What You Already Know Can Hurt You” suggests solutions for avoiding “The Einstellung Effect”

Part I of the book Applied Predictive Modeling includes discussions of the modeling process and explains how each type of data manipluation during pre-processing can affect model outcome

This paper from the NIH outlines some biases that occur during clinical research and how to avoid them: “Identifying and Avoiding Bias in Research”

The study “Bias arising from missing data in predictive models” uses Monte Carlo simulation to determine how different methods of handling missing data affect odds-ratio estimates and model performance

Use these wikipedia articles to learn about Accuracy and Precision and Precision and Recall

A study in Clinical Chemistry examines “Bias in Sensitivity and Specificity Caused by Data-Driven Selection of Optimal Cutoff Values: Mechanisms, Magnitude, and Solutions”

More resources from a workshop on fairness, accountability, and transparency in machine learning

Edit: After listening to the SciFri episode I linked to in the comments, I found this paper “Certifying and removing disparate impact” about identifying and reducing bias in machine learning algorithms.

Edit 11/23: Carina Zona suggested that her talk “Consequences of an Insightful Algorithm” might be a good reference to include here. I agree!

(P.S. Sometimes the problem with turning a decision over to machines is that the machines can’t discriminate enough!)

Do you have a story related to data science and bias? Do you have additional links that would help us learn more? Please share in the comments!

How To Use Twitter to Learn Data Science (or anything)

Renee — Sun, 04 Oct 2015 21:09:59 +0000

When I decided that I wanted to become a data scientist, I started following some data scientists on twitter to see what they talk about and what was going on in the “industry”. Then I saw them pointing one another to resources and answering each other’s questions, and I realized I had only seen the tip of the iceberg of “Data Science Twitter”. That’s when I created a new twitter account.

———————–

A few things I should say first…. I think “data science” can be replaced by just about any other topic, but especially science & tech topics, so please keep that in mind as you read this. I follow a bunch of scientists on my “regular” personal twitter account @paix120, and I sense the same things going on in their communities as I’m about to outline for data science.

Another thing I want to mention is that I’ve had other “topical” twitter accounts. I created one called @womenwithdroids when I started a blog of the same name, and I was amazed at how many awesome women I met that were building android apps, wanted to learn more about how to use their android phones (which at the time were being marketed as a “manly” alternative to the “cutesy” iPhone), and wanted to join a community of women talking about android phones and apps. At the time, I had created a separate account because I saw it as a “business” account for my blog, but I realized that there was a lot of value in separating that from my personal account. I’ll go into that below. Now that you know a little background, let’s dive into how you can use twitter to learn just about anything.

———————–

I have explained to people I meet in person how much I gain from Twitter, and they often look at me like I’m a little nutty. I have heard a few recurring comments from them that I see as misconceptions:

“I started using Twitter and was overwhelmed. I couldn’t keep up with my timeline.”
My answer to that is that first, you’re not supposed to “keep up” with your Twitter timeline. I don’t use Facebook, but I get the impression that people that do will scroll back through every post that happened since the last time they visited, to make sure they don’t miss any important info from their friends. Twitter is not like that.

On Twitter, you can jump on when you need a 5-minute break from work, read a few tweets, mark some longer stories to read later or go read an article or two now, and then get right back to work. People that use twitter won’t get mad if you miss one of their tweets. If something resonates with a lot of people, it will be retweeted and you will probably see it later. If not, it’s not a big deal. You see what you see when you’re online, and don’t worry about what you may have missed, it will just stress you out.

Think of Twitter like the news. You may want to see if anything has just happened, what’s at the “top of the news”, or what people are talking about that happened recently. If there is a big news story, it will likely still be visible when you visit later. It would be stressful to try to keep up with every news article that’s published at any time.

I just scroll back a half hour or so and scroll up until I’m ready to do something else. If I’m looking for tweets about a specific topic, I do a search and see what the top tweets are for it. You can narrow down the search results to “People You Follow” if you only want to see what people you are connected with are saying about the topic.
“I started using Twitter and it was just a bunch of junk I didn’t care about.”
Twitter has an onboarding problem. The problem used to be that when you started a new account, you weren’t following anyone, then people would feel lost and not know how to find interesting accounts to follow. Then they started suggesting interesting accounts. Now, the onboarding process shows you a whole bunch of “brand” accounts to follow (whether those are celebrities or companies, they are usually accounts generated to gain followers or money), then they also try to get you to import your email contacts and follow all of them. I don’t know about you, but I don’t care much about what celebrities have to say, and many of my email contacts are people that I had a short business exchange with years ago and have no interest in keeping up with now. It’s no wonder people start with an uninteresting and overwhelming timeline.

My recommendation is that if there is someone in your timeline that frequently annoys you or tweets boring stuff, unfollow them. That’s just clutter. If you see a friend retweet something from someone you don’t follow that is interesting, click on that person’s profile, read a few tweets and see if they are tweeting other things that interest you, and if so, follow them. Constantly tailor your timeline to work for you.

Another important suggestion is to use twitter lists. If there are certain people that you really do want to keep up with (like personal friends, or a small group of accounts on a very specific topic), put them in a list. You can also follow them in your normal timeline, but you don’t have to. When you click over to your list, you will see only tweets by those accounts. One example of how I use a list on my personal account is my “Harrisonburg businesses” list. I don’t frequently care about whether a local restaurant is having a special, or if there’s a cultural event going on at our local university. However, when I’m looking for something to do one night, I can click over to that list and see what the local businesses are tweeting about today. Are there any cool bands playing in town? A special at a local hangout? I follow very few of those 160+ accounts in my regular timeline, but now I have a collection of them in one place when I do want to scroll back through 24 hours of tweets to find something specific.
“Social Media is a time suck for me, and I don’t want to add any more social feeds to my life to waste time on”
OK, I can see this. It is easy to get sucked in and spend a lot of time on social media. To me, this is just a reason to optimize your account so it’s beneficial to you. If you’re just reading celebrity gossip and trending topics, are you improving your life? However, if you have a goal to become a data scientist, and you follow accounts that are actually educating you, is it so bad to spend some time “sucked into” a feed that is actually getting you closer to your goal in your “free time” and keeping you up to date on the latest topics that a colleague or future employer may expect you to know about?

———————–

Now that I’ve explained some misconceptions about Twitter, I want to explain why I have a separate account for “Data Science Renee”. I have had my personal account on Twitter since 2008. I have only really been into data science since late 2013. I have a “network” of people that I chat with about a variety of topics on my personal account, including political topics and random things that catch my attention. Here are my main reasons for starting a separate account for @becomingdatasci:

I personally wanted to separate the topic out. I wanted to go “all in” on data science, and have an account where I ONLY follow people that talk about data science, even people I wouldn’t follow in my normal timeline. I could have done this with a list, but I wanted to take it further than that.
I also wanted to be able to tweet like crazy about data science, and not feel like I had to hold back in order to avoid overwhelming my existing followers with a flood of tweets on a new topic. They might unfollow me if I started tweeting 20 times a day about data science when I had rarely mentioned it before, and my new interest might far outweigh my tweets on other topics I’m interested in. I didn’t want to lose that existing network.
The opposite is also true. I wanted to be able to connect this new account to my blog, and use it to make work connections, without worrying about including personal political views and tweets about gardening and cute animals in that feed! I also would know that people that follow this account are following it because of data science. I check out my followers on this account more often than I do on my personal account, because they’re more likely to share this particular interest with me.
I know I’m good at curating interesting articles about a topic, and I wanted this account to be considered a “go to” account that others could recommend to their friends interested in learning about data science, without worrying what else I might be tweeting about. I decided to become a sort of “learning data science channel”.

———————–

So you see why I have separated this account from my existing personal Twitter account, and how I have tailored it to work for me. But what does that mean? What have I actually gained from this twitter account?

I have learned a LOT that I wouldn’t otherwise know about data science. There are terms that I wouldn’t have known to Google that some of the people I follow tweet about and link to articles, academic publications, and tutorials about. There is a constant flow of interesting new information coming out of the data science “industry” so I can keep up with what is being talked about right now and what is considered “state of the art” and exciting to other data scientists. It’s like being able to walk around and listen in on lunch tables at a data science conference. Everyone is talking about something slightly different, but all in the general topic of data science, and each person is honing in on what is interesting or exciting to them within this realm.
I have made connections that I wouldn’t have made otherwise. I don’t have a lot of time or money to constantly travel to data science conferences and meet people in person. I live in a small town and there aren’t a lot of other people talking about data science here (yet). Twitter has given me a way to personally connect with other data scientists. I have connected with some that don’t live far from me, after all! I have connected with many that live in other countries that I likely wouldn’t even meet at a conference. These connections have cheered me on in my learning, connected me to resources, and more!
I have become a “face” of a person learning data science. At once conference I did attend, I was recognized as “Data Science Renee”! I have been asked to be interviewed on podcasts and blogs (some of those should be coming up soon), offered contract work, and offered free admission to a conference I unfortunately couldn’t go to, but was excited to be considered for. “Famous” people in the industry are now coming to me to work with them in some way. New learners seem to look to me as a resource and guide, and want to see how I learned what I know, and how I have struggled, so they can compare that to their own experiences.
I have found many other women working in data science. When I was first learning about data science, all of the “who’s who” lists of people to follow, people that were interviewed for books or other resources, and the “faces” of data science were often white or asian men, with maybe one woman or minority included in the group. (This is typical of the tech industry.) However, as I made more and more connections, and started to seek out women and other minorities in the industry, I have been able to connect with them and learn from them and hopefully amplify their voices. I now have a twitter list with almost 450 women that work in data science or statistics, and now that list can be a resource for other women looking for role models like them in the industry!
I have learned some specific data science tools and techniques. I regularly see great tutorials on twitter, via blog posts or videos or github links, that show me how to do something I have wanted to learn how to do. These would often be hard to find by searching, but come right to me in my twitter feed where I can bookmark them for later learning sessions.
People on twitter have reached out to help me solve problems when I’m stuck. I have received tweets from people that built python packages I was using, people that had resources that could help me, or just people with general advice and feedback! If I’m clear about what I’m doing and where I’m stuck, I now have a strong enough follower base that I will almost always get a helpful answer!
Not only do I find out about resources I wouldn’t otherwise have, but I see opinions of others on existing resources. A conversation on twitter about being overwhelmed by the vast amount of things there are to learn in the broad topic of “data science” helped inspire me to bring an idea I had been having to life. I have taken a course online that got really difficult at about the 5th lesson. I didn’t know whether it was just me and I had hit a roadblock, or if a lot of people found that course difficult and I just needed some outside resources to continue with it. I also often don’t know where to start in my long list of bookmarked “things to learn”. But seeing what people tweet about, and how others have learned, really is helping me on my learning journey. You can read about my new website DataSciGuide here. I’m hoping the ratings (and eventually learning guides and a recommender system) there will help others avoid “data science learning overwhelm”. (P.S. I’m now in the phase where I need reviews on the items I’ve posted, so please go rate some things!)

———————–

Hopefully this post has helped you understand how to use Twitter to join a community and learn something you have been wanting to learn! You can really gain a lot from it if you optimize its benefit to you like I have.

I know the question now will be, “so who are the best people to follow on twitter for data science?,” and I’m hesitant to answer that for you since there are so many people out there, some with specific topics that would be better for you personally than what I would recommend. For instance, maybe you are especially interested in learning data science for sports analytics, which is a specific topic I don’t follow many people on.

If you follow me on @becomingdatasci and see who I retweet, you’ll find people that are sharing resources that I think are beneficial, so you can start there. You can’t go by my twitter favorites since I use those as bookmarks and haven’t read many of them yet. You could look through people I follow, but there are a lot of them, and they’re not ranked in a helpful way. You can also follow the list of data science women I mentioned above.

Others that are often good to start with are people with data science blogs, since they’re usually purposely writing to educate others. Here’s a large list of data science blogs that includes the twitter handle of the author or blog where applicable, and is sorted into categories. Check it out! https://blog.rjmetrics.com/2015/09/30/the-ultimate-guide-to-data-science-blogs-150-and-counting/

———————–

So to recap:

Tailor your twitter timeline frequently. Unfollow those that annoy or bore you, and follow new accounts on topics you want to know more about.

If you seriously want to hone in on one topic, or to become a “channel” for a topic, create a separate account for it

Use twitter lists to create small lists of people you especially want to keep up with, or sub-specialty topics you occasionally want to dive into. You can follow accounts in lists that you might not otherwise follow in your timeline.

Actually connect with other people. Find people like you that can be role models for your learning. Ask them questions. Help others out when they ask questions on a topic you know more about. Join the community and the conversation.

Have fun and don’t get overwhelmed! Use others’ opinions and recommendations to carve out your learning path.

Comment below if you have any questions about using twitter to help learn data science!

Human Name Variations in Databases

Renee — Sun, 20 Sep 2015 01:38:48 +0000

I normally write about my adventures learning data science here, but my expertise for years has been database design and reporting, and I have some knowledge to contribute to a discussion that I thought I’d document here.

A conversation on Twitter today about how people’s names are stored in databases, with stories of frustration from people that have had terrible customer/patient experience because of “unusual” names, made me want to write about this topic. When you search for information on name standards in databases, you will usually get information on field names, lengths, etc. What is harder to find is information on how to store the variety of names in a system of record.

To get an idea of how people are named in different cultures, see w3’s article about it at http://www.w3.org/International/questions/qa-personal-names

Some examples they give of names that may not be entered into a database the “traditional American way” are:

“Mao Ze Dong” – Mao is the family name, Dong is the given name, and Ze is a generational name common to all siblings in a family. In Chinese script, the names are not separated by spaces.
“José Eduardo Santos Tavares Melo Silva” – Brazilian name which includes many ancestral family names.
“Kogaddu Birappa Timappa Nair” – Indian name which includes village name, father’s name, given name, and last name.

You may think that people should just conform their name to our forms, like just choosing three names for “first”, “middle”, and “last”. Or maybe you think the data collection form should just have one entry field for “name” and not split it up. However, it’s not that simple, and the need to format names for various uses (like mailing labels, letters, etc.) provides additional challenges.

Unfortunately, sometimes the challenge is just getting an organization to accept your actual name at all. The next challenge, once your name is in a system of record, is how it ends up used. Different usages can end up complicating things like government IDs and driver’s licenses (if the state ID name rules don’t allow you to use the name that is on your federal record, for instance), insurance claim rejections (when your name doesn’t match up exactly between the doctor’s office and the insurance company databases), or multiple accounts at retail establishments like pharmacies (I have two different accounts at my local CVS, and when picking up medicine always have to confuse the person at the window by mentioning both my maiden and married names. Also, the name on my CVS discount card was mis-entered, so I have to spell my name incorrectly if they need to look that up for any reason).

Here’s my experience with a “nontraditional” name, since I got married and wanted to make my maiden name a 2nd middle name:

Luckily, I didn’t have trouble changing my name with social security, despite scary stories from other women I know who had to fight to get their name the way they wanted it on their card. Side note: one benefit I’ve since discovered is that since my new name on my driver’s license still includes my maiden name, it makes me more believable when I show up somewhere that doesn’t have my married name and I’m trying to get them to change it.

My Birth Name: Renée Marie Parilak
(I leave off the accent when entering on forms because that would just increase the chance for entry error, like Rene’e or Renee’. Yes, I’ve seen both.)

My Married Name: Renee Marie Parilak Teate.
First name “Renee”, middle name is now “Marie Parilak”, and last name “Teate”.

When I submitted this name on name change forms with credit card companies and other organizations, I had a few challenges, like not enough room on the middle name line, but I sent in all of the forms with my full name. My credit cards came back with all of these variations printed on them, depending on each company’s conventions:

Renee Marie Parilak Teate

Renee M. Teate

Renee M. P. Teate

Renee M. Parilak (yes, one came back unchanged)

Actually, the other card that came back unchanged was my voter registration card. I filled out the update form at the polls because they gave me a hard time last time I voted and had different names on my “proof of identity” information, and they mailed me a new card with my old name on it. Really.

Another has my full name correct when concatenated, but has stored my last name as Parilak Teate instead of Teate.

Here’s a friend’s experience (with names removed for privacy purposes):

My parents, to forestall in-law fights over naming, made it so that the first name of all their female children was my mother’s middle name (which is what she goes by…family tradition of female going by middle name), so we are all named “SharedFirstName MiddleName Surname”. Every girl child was meant to be called by her middle name and so it has been.

Many bureaucracies that have forms that force everyone into using their middle initial only (if they even acknowledge the middle name). It’s been a problem throughout my life, especially at doctor’s offices, but it escalated enormously with the connection of various bureaucratic systems to the internet.

The latest wrinkle started about 2 years ago when one of my sisters moved from the family home to an apartment. She filled out a USPS change-of-address form. Suddenly, not just her mail but some of my mail, my mom’s mail and my sisters’ mail started going to her apartment. Our small-town postal personnel suggested workarounds, none of which worked. It wasn’t as simple as going to the post office, showing ID to prove who we were, and having postal personnel escalate it to whoever is in charge of that database. We tried repeatedly to get the mistake corrected by personally visiting our local post office and talking with the supervisor.

Meanwhile, my sister moved out of that apartment, meaning she was no longer there to reroute that mail to us. My mom, my sister, and I, who are all on Social Security (for age or disability) started to find Medicare and other important SocSec mail now had that apartment address on it instead of our true address. No one from USPS or SocSec ever contacted the address/contact of record to verify we wanted an address change. We have different Social Security numbers and you’d think that would be enough to double-check we weren’t the same person despite slightly similar names, but no. This points out how easy it would be for a ID thief to get a bunch of your most sensitive information re-routed to them. Google has a simple check-in when someone logs in from a strange computer or tries to change the password, yet USPS and Soc Sec don’t??

While this was going on, I’d tell medical providers to be sure to address bills using my middle name as the first name so it wouldn’t be rerouted, which it would be if they used my first name and middle initial, which happens to be identical to one of my sisters’ names in that format. They’d refuse or beg off, saying that their biller took care of that and there was no way to get that kind of customization. So despite being a customer who didn’t want to shirk my bills, who was being pro-active about it all, I ran the risk of running up huge medical bills because of mail going somewhere else and medical offices being so maddeningly subject to the almighty database that they would not avoid sending their mail into a black hole.

A couple months ago I tried to call Soc Sec to get advice about something. Even to get advice, they ask you your SS#, mother’s maiden name and other stuff. When I gave my mother’s maiden name over the phone, the operator told me I was wrong! WTF! She was ENORMOUSLY rude to me when I tried to let her know what mistake was happening.

I was fed up with years of this and called my congressperson’s office. So far, they’ve dropped the ball completely. I guess it’s The White House next. Which is a waste of my time, a waste of gov’t time, etc. All of this could have been avoided if good database design practices prevailed (and if bureaucratic organizations would quit distancing customers/clients from reaching the person in the organization who would be capable of making changes to the database once you’ve proved that you are YOU and have always been YOU).

In the ariticle linked above, w3 addresses some “implications for field design” for name fields, including field length, whether to split the name up, etc. I suggest you go read it. Here’s the link again.

If I were designing a name entry form today (Note: for a system that actually needs to store full names! Not all of them do!), I would ask the user for:

Prefix: [Mr., Ms., Rev., etc. – optional, could allow “other” entry]
Given/First Name(s):
Middle Name(s): [optional]
Family/Last Name(s):
Maiden Last Name: [if applicable]
Suffix(es): [Jr, III, Esq. etc. – optional, allow “other” entry]
Full Legal Name: [optional, would default to First, Middle, Last]
What should we call you? (Preferred given name or nickname if desired): [auto-fill with Given name. allow user to edit]
Preferred Mail Name: [auto-fill with prefix, first, middle, last, suffix. allow to edit]

Note: if this were a mobile form, I wouldn’t ask for all of the name variations up front! Complicated mobile forms are a turn-off and can result in lost customers, so just ask for either full legal name (and have your code guess how to split the full name into first/middle/last fields) or first and last, and “What do you want us to call you?”, then have a detailed profile page that lets them go in later and fix anything your system got wrong.

In my case, I would be able to specify that my first name is “Renee” and my last name is “Teate”, and the other two names are my middle names. My friend could specify her full legal name and also that she prefers to be addressed as her middle name and not her legal first name. Mao Ze Dong could specify that Mao is the Family Name and Dong is the Given name, while still leaving the names in the original order for the full legal name.

Make sure your database can handle all of the variations that may be written on a paper form. Also make sure you can handle special characters.

Another Note: If you are going to display the name publicly in any way, like on a user profile on a website, you have to give the person full control over how they want their name displayed there. There are various safety and personal reasons a person may not want their name displayed the way you want to display it. Here’s one example story. Also see Nymwars.

And if my system were something like a medical insurance record where past names may come in from doctor’s offices even though I have the patient’s current name, I might ask for a list of past names to keep on the record. You can store several “former names” in a table with a one-to-many relationship to the person’s primary record, and store all of the prior name fields when a new name comes in. You can even store names that aren’t their actual name, but may come in from another system regularly (like a misspelling) or be an incorrect version of their name that you stored in the past.

When in doubt, you can use the full legal name on communications. If sending an informal email, you can use the “What should we call you?” name. If sending a formal letter, you can use the Preferred Mail Name. Someone could have the preferred mail name of “Mr. J. Edgar Hoover” (funny that’s the name that came to mind as an example of first name initial when I’m writing about storing personal information), but prefer to be called “Ed” in person, and it’s good for your organization to know.

If you have spouses in your database, the marriage record should store preferred informal and mailing joint names, like “Mr. and Mrs. Doe” or “John and Jane”, and not just auto-generate the combinations dynamically (though you could default to that), since some couples have strong preferences on how their names are shown in letter salutations (like wanting the wife’s name first, or preferring to always be referred to formally – I work in a fundraising organization, and they definitely want to address large donors’ names they way they want them!). A relatively safe way to do envelope labels is to have “stacked” name labels, where both spouses’ preferred mailing names are completely written out and listed one above the other. This avoids uncomfortable situations in cases where the spouses have different last names, for instance. Also, you can then handle cases that some systems have trouble with, such as “Drs. Jane and John Doe”, or professional/military prefixes and suffixes in a certain order.

Speaking of marriage, there is a great article on storing a variety of marriage relationships in your database, called “Y2Gay“. It’s a good read for database designers or anyone that has to think about data issues!

Besides having a system that can respond to customer preferences (following their preferences will make them happier customers; rejecting an insurance claim because the first/middle/last names don’t exactly line up with your record will not), having all of these name variations does help in a data science way as well: you can now better match profiles coming in from different systems and have a higher degree of confidence that you are pulling incoming data into the correct “Mary Smith” record, for instance.

Hopefully, our stories made for some food-for-thought for people designing databases, websites, and processes involving people’s names!

Please share your experiences, thoughts, and comments on my database field choices below!

Here are some other references on the topic:

Bad assumptions programmers make about names