Renee – Becoming A Data Scientist

Valley TechCon Slides

Renee — Sun, 29 Sep 2019 03:46:06 +0000

This week, I gave a talk at Valley TechCon in Harrisonburg, VA (where I live), which is a conference that features tech businesses and practitioners from the Shenandoah Valley of Virginia, and is in its 2nd year. There were a bunch of cool topics and speakers, and it was great to see what other great tech is being built in my part of the world!

I said I’d share my slides on my “SQL for Data Science” talk, so here they are!

I only had 40 minutes, so this wasn’t a full tutorial, and I couldn’t really teach many different concepts, but my main goal was to point out a bunch of different stuff you could do with SQL, and how to think about creating an analytical dataset, so attendees would be aware of what was possible and what they should look up if they wanted to start learning about SQL.

Enjoy!

Becoming a Data Scientist Episode 17: Andrew Therriault

Renee — Thu, 30 May 2019 03:12:04 +0000

In the first episode of the Becoming a Data Scientist podcast recorded in front of a live audience, Renee interviews Andrew Therriault – formerly the Director of Data Science for the Democratic National Committee & Chief Data Officer for the City of Boston, and currently Data Science Manager at Facebook – about how he learned data science, what advice he has for people who want to learn data science and apply for data science jobs, and about his career path as a Data Scientist and leader in the field.

Recorded at Tom Tom Fest Applied Machine Learning Conference in Charlottesville, VA on April 11, 2019.

Episode 17 Audio

@therriaultphd on twitter

Data and Democracy
This O’Reilly ebook by Andrew Therriault explores how political data science helps to drive everything from overall strategy and messaging to individual voter contacts and advertising.

Data Security for Data Scientists by Andrew Therriault on Medium
Ten practical tips for protecting your data (and more importantly, everyone else’s!)

World premiere #becomingadatascientist podcast by the famous and fabulous, with guest, Andrew Therriault of @facebook @BecomingDataSci #AMLCville #datascience #machinelearning @UVADSI @TomTomFest pic.twitter.com/2EwjsUwULL

— Data Science Connect (@DataScienceATL) April 11, 2019

Our first ever live audience for Becoming a Data Scientist podcast at #AMLCville! With @therriaultphd! pic.twitter.com/JA1RpiKq0u

— Data Science Renee (@BecomingDataSci) April 11, 2019

RVATech Summit Slides

Renee — Fri, 15 Mar 2019 14:34:07 +0000

I promised the audience at the RVATech Summit yesterday that I’d post the updated slides for my “Can a Machine be Racist or Sexist?” talk, so here they are!

Here is the link to the previous post, which has a pdf version of the slides that’s almost identical, and a video from when I gave this talk at Tom Tom Fest Applied Machine Learning Conference in 2018.

Here’s the blog post that was the start of researching this topic, for me.

Here’s the twitter list of people who talk about Ethics & Law in AI/ML.

And here’s the flipboard magazine where I collect articles on this topic.

Enjoy!

Tom Tom Fest Applied Machine Learning Conference 2019

Renee — Sun, 10 Mar 2019 19:39:28 +0000

The 2019 iteration of Tom Tom Fest (named after Thomas Jefferson) starts in under a month, and the Applied Machine Learning Conference (AMLC) is just over a month away, on April 11, in Charlottesville, VA!

This is a conference I’ve helped plan since the beginning, and it’s grown in 3 years from a single theater with a partial day of talks to 4 theaters with non-stop presentations all day, plus keynotes in an even larger venue!

An exciting announcement from me is that I will be recording a short episode of my Becoming a Data Scientist Podcast in front of a live audience for the first time at the AMLC! I’ll be interviewing Andrew Therriault, one of our keynote speakers, about how he became an Infrastructure Data Science Manager at Facebook, after starting out with degrees in politics, and working as the Director of Data Science for the Democratic National Committee, and the Chief Data Officer for the City of Boston, among other roles.

I am featuring an AMLC speaker each day with the #AMLCville hashtag on twitter. You can learn more about all of our speakers (still more to be added!) and get tickets on the conference website. Hope to see you there!

APRA Data Analytics Symposium 2018

Renee — Thu, 09 Aug 2018 12:43:07 +0000

Hi, #APRADAS2018 Attendees!

I plan to come back and add more info here in the future, but for now, here is a PDF version of my slides:
My Journey from Advancement Data Analyst to Data Scientist

Summer of Data Science Goal-Setting

Renee — Wed, 06 Jun 2018 06:14:49 +0000

The purpose of the Summer of Data Science is to learn a specific topic or complete a project or read a book or finish a course so you can check something off of your long data science “to learn” list (get used to it being long, data scientists always have more to learn, so it never gets shorter!), and have fun achieving goals along with other data science learners during a fixed period of time. The deadline should be motivating, to get you to start and finish something before the summer is over.

Week 1 was all about brainstorming ideas and gathering resources – dreaming up what you’d love to learn, and finding content that will help you learn it.

Week 2 (which started yesterday, but don’t worry, jump in any time even if you see this blog post a month from now) is all about goal-setting.

You should set a #SoDS18 goal that’s lofty enough to excite and motivate you, but not so out of reach that you’ll never complete it and only get disheartened when halfway through the summer you realize you are only 10% of the way there.

Make sure to keep goals attainable! Last year my goals were too expansive and by week 4 there was zero chance of accomplishing all of them which was super demotivating https://t.co/uXiwUFNYU5

— Nick Heitzman (@NickDoesData) May 28, 2018

I also want to make sure you know what makes a good goal. I like the definition used by the SMART approach:

Your goal should be

Specific

Measurable

Achievable

Relevant

Time-Bound

Instead of explaining each of these in detail (you can read more about it elsewhere on the internet), I’m going to give an example of things you can jot down for yourself for each of these, then an example summary tweet for 2 different #SoDS18 goals.

Let’s say the idea you had for what to learn this summer is “Start learning Python”, and the resource you found is DataQuest. Let’s turn that into a SMART goal:

Specific – Learn how to import, clean, and visualize data using python and pandas

Measurable – Complete all courses in the DatQuest Data Scientist Path

Achievable – I can spend at least 6 hours on this project every weekend, plus occasional weekday evenings, so I have enough time available to do the work [Note from Renee: I have not actually researched how long this course series would take to complete]. I have joined the #py4ds Slack community and will ask for help there and on DataQuest if I get stuck so I don’t get set far behind.

Relevant – I want to add python and pandas to my resume, and it’s my first step on my new path to becoming a data scientist, so it’s relevant to my career goals and I’m motivated to accomplish it.

Time-Bound – the Summer of Data Science ends on September 3, so I will finish this first goal by August 3 in order to have time to complete a small project during the last month of #SoDS18.

Example tweet to share this goal with the world:

My 1st#SoDS18 goal: I will learn to import, clean, and visualize data with python & pandas by spending 6-8 hours per week on the Data Scientist Path on DataQuest, and will complete it by August 3. I’ll ask in #py4ds Slack if I need help.

Or, if your idea is to “do a machine learning project using at least 2 different algorithms on some kind of dataset that could help people”. That can be converted to a SMART goal like:

Specific – Learn how to use random forest and logistic regression in R by experimenting with data from the Kaggle DonorsChoose.org Dataset to develop a list of donors to email about a particular type of project request

Measurable – I will complete exploratory data analysis on the available DonorsChoose data files and write a blog post about my findings that includes at least 3 visualizations. Then I will find out what it means to submit a Kaggle Kernel, build 2 machine learning models using random forest and logistic regression algorithms and compare their model evaluation metrics to each other, submit the Kernel (even if the contest period is over), and find and study at least 2 other people’s submissions to understand different approaches to the problem. Then I will write another blog post summarizing my results and findings.

Achievable – I have read about random forest and logistic regression online, and my friend gave me the Introduction to Statistical Learning book so I can better understand these machine learning algorithms. I have a bunch of resources bookmarked online in case I need extra references to understand the book. I will tweet using the #rstats hashtag or talk to my friend if I need help. If I find out the dataset I found isn’t great for learning these 2 algorithms, I will search for another dataset as needed. I can dedicate 2 hours a day 4 days per week to working on the project and researching these topics.

Relevant – I started learning R over the last year and have used it to complete labs at school, but want to expand my machine learning capabilities and apply my skills to a real-world dataset before I start applying for jobs in the fall.

Time-Bound – I have 12 weeks to complete the project this summer.

Example tweet

My #SoDS18 goals are to:
-explore the DonorsChoose Kaggle dataset
-use ISL book & online resources to learn to build random forest and logistic regression models
-create and submit a Kaggle Kernel to help DonorsChoose
-write at least 2 blog posts about it over the next 12 weeks

I think you get the idea!

I should also mention that you don’t want to over-plan. Notice the note about switching datasets if one doesn’t work out – plan to be flexible! You don’t yet know what you’re getting into, and you might need to find more time finding good resources to learn, getting help, or pivoting if your original plan doesn’t work out. That’s OK! Just go with the flow and try to achieve something comparable to your initial goal. But, you need an initial goal in order to figure out where you are relative to it!

So, finish brainstorming your learning ideas and finding resources this week, then narrow it down to a SMART goal, and tweet about it with the #SoDS18 hashtag so we know what you plan to learn during the Summer of Data Science 2018!

And if you’re still looking for project ideas, check out Mara Averick’s post, browse the #SoDS18 hashtag, or join a data science learning community! (More about this in another blog post later this week!)

#CraftyDataViz Winners

Renee — Sat, 02 Jun 2018 03:05:46 +0000

About a month ago, on a whim, I posted the #CraftyDataViz contest, hoping for some beautiful and wacky homemade visualizations, and you all sure came through! The entries were gorgeous and the judging was super difficult!

First, I’d like to thank our awesome judges, who went through 2 difficult rounds of judging and discussion to help select the winners!

“Judging this was so difficult. I was blown away by the creativity.“ – Karen Lopez

“Every single entry was compelling.” – Natasha

As described in the original post, we had 3 categories to be won, which can be summarized as follows:

Most Beautiful – depicted the original visualization’s data in a visually impressive way
Most Informational – most effectively/readably communicates the data from the original visualization
Most Fun! – made the judges smile

There was no category where there was a clear winning crafty data visualization. Everyone who entered put their heart into it and had some creative, beautiful, fun, and informative handmade art! Without further ado, here are the top selections:

Most Beautiful

In the Most Beautiful category, the 5 judges initially selected 5 different winners! Eventually, they came to agreement on Patti Shih’s submission, comparing normal brain connectivity to connectivity “while tripping on magic mushrooms”:

(original)

“Patti was able to almost perfectly recreate the original dataviz using something as cozy as string. The choice of colors manages to make it even more visually appealing than the original.” – Emil Hvitfeldt

“Circos diagrams, although hard to gain insight from, are pretty. By sculpturing layers with string, Patti enhanced both their beauty and their informational value.” – Helena Jambor

Honorable Mention

The runner up in the Most Beautiful category was a twisted-metal tree representing a dendrogram of the students that Alli Torban’s (Host of Data Viz Today Podcast) mother taught throughout her career.

“The tree is very very beautifully done and photographed – serves both as reminder that tree-diagrams can be shown in 3D and is in itself almost a sculpture! “ – Helena

Most Informational

In this category, Amy Cesal’s two entries dominated; both of her Play-Doh-based visualizations won the most votes in the original round. They were also strong contenders in the Most Beautiful category. In the end, the judges selected the “joy plot” (aka ridgeline plot) of the Perceptions of Probability for its effectiveness in conveying the meaning of the original data visualization.

“Joy plots were recently re-popularized by @ClausWilke and are both fun and informative. Amy captured its full essence while choosing visually stunning Play-doh colors. Absolutely beautiful!” – Sean Law

Honorable Mention

Amy Cesal’s 2nd entry on birth time patterns was the other top choice in the Most Informational category.

(original)

“A creative use of Play-doh on top of a 3-D object makes me a big fan of Amy Cesal. I never thought that Nadieh Bremer’s wonderfully informative work could be improved upon but this is one beautiful exception!” – Sean Law

“I loved that this crafty version made the original data viz more understandable due to adding another dimension. “ – Karen Lopez

Most Fun

This subjective category also had multiple top picks in the first round of judging. After much deliberation, the judges chose Jon Schwabish’s daughter’s entry depicting her Halloween candy haul! Definitely fun, especially the mysterious “MEGA” category!

And to add to the fun, Hadrien Lacroix reverse-engineered a digital version!

“The joy and childlike wonder displayed in this dataviz brightened my day, which makes this project my choice of “Most Fun”!” – Emil

“The author’s clear zest for each individual data point (ie piece of candy) really shone through!” -Natasha

Honorable Mentions

The judges also thought that Awais Athar and Alli Torban‘s entries were especially fun and interesting!

The 3 category winners will get their choice of item from the Becoming a Data Scientist store. Congratulations! And thank you to every person who entered the #CraftyDataViz contest – there were many more awesome entries than we could depict here. This was truly amazing, and I can’t wait to do it again next year! Check out all of the impressive entries in this Twitter Moment!

Summer of Data Science 2018 #SoDS18 Launch Day!

Renee — Mon, 28 May 2018 17:53:49 +0000

Memorial Day is the unofficial start of summer in the U.S., so in the past, we’ve also used it as the start date for the Summer of Data Science! The main goal of the Summer of Data Science is to learn something new during a fixed period of time, and share your progress and references to help and inspire others (and to get help from and get inspired by others, too!). If you want to learn more about the origin and history of the hashtag, I gave more background in last year’s post.

But I’m sure the main thing you’re here to find out is how to get involved yourself! So, here are the basics:

How to participate in the Summer of Data Science:

Pick a thing or a short list of things related to data science that you want to learn more about this summer (or this winter if you’re in the southern hemisphere!)

Make a plan to learn it (like an online course, a practice project, etc.).

Share that plan on social media, then post updates as you make progress, with the hashtag #SoDS18.

Here’s a twitter moment with a bunch of entries from #SoDS17 for reference!

We’ll run this one from today – May 28, 2018 – through Labor Day in the U.S. – September 3, 2018. What you can realistically get done in that time depends on where you are in your data science learning journey, what your work schedule and family obligations are like, and many other factors – so think about what’s realistic for you to accomplish during this time.

Week 1 will be about brainstorming and researching possibilities and resources for summer projects, courses, etc. And in Week 2 we’ll set specific goals for the rest of the summer. So, start thinking of ideas now!

If you would like some ideas for beginners, here’s a list of beginner content on my site DataSciGuide:
Recommended Resources for Beginners
You might want to pick a book or course and go through it, trying out the exercises this summer.

I also have a Flipboard where I have collected a bunch of Data Science Tutorials you might want to check out (note: these aren’t all aimed at beginners).

There are also a whole bunch of online communities where you can join others in a project, or ask questions if you get stuck on yours. I’ll be writing another post highlighting those this week!

Follow me on twitter @becomingdatasci, and tweet with the hashtag #SoDS18 when you post updates about your progress! (It’s a good idea to “thread” your tweets throughout the summer, or add them to a Twitter Moment, so others can easily follow along!)

I’ll be retweeting a bunch of people’s ideas and resources, so keep an eye out there for more ideas if you aren’t sure where to start!

#CraftyDataViz Contest!

Renee — Sun, 06 May 2018 02:26:33 +0000

A couple weeks ago, I started following the @everytract twitter bot account (by @fitnr), which is tweeting an aerial image of every census tract in America:

Enjoying this bot account https://t.co/uRTj3LWgDm

— Data Science Renee (@BecomingDataSci) April 24, 2018

Today, I saw this tweet:

you could replicate this by gluing various dried out legumes to poster board if you wanted to https://t.co/icHeUuy4Ob

— Rebecca Williams (@internetrebecca) May 5, 2018

and imagined a census tract visualization made out of dried noodles and beans. Being as spontaneous as usual, I asked if anyone would be interested in having a #CraftyDataViz contest, and several people responded yes! So, here we are. Time to get creative!

Basic Contest Rules:

Find a data visualization online, in a book, or create one yourself. We need a digital version of the viz you’re recreating, so save the link or take a photo of it. Now, gather up some goodies from around the house or yard and/or hit the craft store (or maybe the grocery store!) and recreate that data viz in craft form!

It does need to be a physical end-product – no completely digital entries (though usage of printed elements is ok), but other than that there are no restrictions on materials or approach – get as creative as you want! At least part of it needs to be handmade or manually assembled (so no completely 3D-printed entries, for instance, but parts could be 3D printed) You can replicate the viz in detail, or do an abstract interpretation of it :)

I will find some judges that know something about data visualization, and they will pick winners in each of the following categories (we might do some sort of online vote to narrow down the entries, depending on how many there are):

Most Beautiful/Visually Impressive

Most Informational

Most Fun!

The winner of each category will be featured here and on my @becomingdatasci twitter account, and win their choice of item from the Becoming a Data Scientist store! By then, I’ll have the 2018 Summer of Data Science items up, since the end of this contest will also mark the launch of #SoDS18!

How to Enter:

Post a tweet (or a comment below if you don’t use twitter) that includes a link or image of the original dataviz, as well as a photo or video of your creation! On twitter, use hashtag #CraftyDataViz so I can find your entry!

The entry period starts today and ends on 11:59PM EDT on Sunday, May 27. I’ll plan to post the entries for viewing/judging on Monday, May 28 (Memorial Day in the United States).

HAVE FUN!! I’m excited to see what you create!!

Can a Machine Be Racist or Sexist?

Renee — Mon, 16 Apr 2018 02:08:47 +0000

I presented a talk with this title at the Applied Machine Learning Conference at Tom Tom Fest in Charlottesville (which I also helped plan) last Thursday April 12, 2018.

My interest in this topic started long ago, and I partially based this talk off of my blog post “A Challenge to Data Scientists” from 2015. There are a ton of links throughout, and I included the slide notes so you have those along with the presentation (I’m not sure why all of the URLS aren’t automatically linked, so you have to copy and paste some, sorry.) I’m prepping for another presentation right now and don’t have time to write a whole lot about it – so without further ado, here are the PDF files with the slides and the notes:

SLIDES

[SOME] NOTES & LINKS

Just noticed the link above still doesn’t contain all of the notes and links… I need to figure out how to save that to PDF in the right format from Powerpoint. For now, here’s the full slide + notes view with all links. Just don’t print it – it’s almost 100 pages long!
SLIDES WITH NOTES & LINKS

Update: There’s a video on YouTube of this talk and the panel that followed!

Podcast Listens Analysis

Renee — Mon, 02 Oct 2017 01:34:41 +0000

I’ve been telling everyone that I’d do something “data fun” when I hit 20K Twitter followers, so I posted an analysis of my podcast listeners! I used python and pandas in a Jupyter notebook for the first part, then I did a dashboard in Tableau for the last part.

Here’s a video of me explaining the analysis:

A few notes as I skim through:

That part that was broken is where I hadn’t changed from the real IP to the random IP (sorry search bot), so I fixed that in the file below
I pointed to the wrong thing when I was talking about how long I’d been around…Becoming a Data Scientist Podcast started in December 2015! So 1 year later there was a day larger than the 1st day for the 1st 3 episodes.
The top IP that got 36 views – I’ll have to look into it, but I think it could be multiple IPs getting assigned the same random number. I’ll take a look and come back when I have a chance.

Here are all of the episodes, so you can go back and listen to any you missed!

You can download the HTML versions of my Jupyter notebooks, and also play with the Tableau dashboards at these links:

“Clean” version of the Jupyter notebook

Full messy analysis Jupyter notebook

Listen monitoring Tableau dashboard

Interactive episodes by week Tableau dashboard

If you have suggestions for how to do the code in a more sensible way than how I rushed and did it, or if you have any questions, feel free to add suggestions in the comments below!

Hi #DemystifyDS Attendees!

Renee — Wed, 27 Sep 2017 15:07:24 +0000

I’m assuming that some people who see my talk at Demystifying Data Science conference will be dropping by here, so I wanted to put up a quick post summarizing some of the resources I have made available to data science learners.

If you haven’t signed up for the conference yet, it’s free online, going on now (10am to 10pm 9/27/2017) and you can sign up at this link. Everyone who registers will get access to the videos afterward.

Here’s some stuff to help you on your data science journey!

Becoming a Data Scientist Podcast

Audio on iTunes (or other “podcatcher” apps – just search “Becoming a Data Scientist”)
Video on YouTube
Twitter list of podcast guests

Data Science Learning Club

DataSciGuide Data Science Learning Resource Directory

@becomingdatasci (me!) on twitter

My Flipboard Magazines – Becoming a Data Scientist, Data Science Tutorials, and Bias in Data Science

Twitter list of over 1200 Women in Data Science and related disciplines to follow!

And some blog posts from this blog I’d like to highlight:

PyData DC “Becoming a Data Scientist – Advice from my Podcast Guests” Talk (Longer version of the talk I gave today)
A Challenge to Data Scientists – on bias in machine learning
Introductory Machine Learning Terminology with Food
Summer of Data Science (Southern Hemisphere edition coming soon!)
Boosting as a Metaphor for Diverse Teams

Thanks to Metis for hosting the Demystifying Data Science conference today! If we’re able to post the videos of our talks publicly in the future, i’ll post it here.

Thanks for dropping by! Connect with me on twitter for any follow-up questions!

More silliness

Renee — Sat, 29 Jul 2017 23:01:45 +0000

Back before I had so many followers, and it was less stressful to put goofy stuff “in the wild”, I wrote data science parody lyrics to “Summer of ’69” and “For the Love of Money”. Well, a while ago, another idea popped into my head..

Feeling another set of parody lyrics coming on. Maybe will write down this week. Guess the data science theme?https://t.co/flpIozxz8x

— Data Science Renee (@BecomingDataSci) May 1, 2017

and I was reminded of it today by this tweet

Just a Neural Net
Livin' like a marionette
She took the TCP goin' anywhere
Just a chatroom bot
Born and raised on reddit thought
…

— J. Nathan Matias (@natematias) July 29, 2017

So, I wrote it down to share my silliness with you. Enjoy!

Lyrics:

I can fit every point
feature high info gain
but I’m over you asking me when you know I’m not okay
I learned really well
from the training that night
and though I can classify you know I’m just not right

I can’t generalize
I’m overfit
Too complex
I’m overfit
Tricked by noise
I’m overfit
I’m overfit
I’m over

Wanting you to be wanting me
But, my low accuracy
To outliers I’m sensitive
Because I’m so over
I’m overfit
Moving on, it’s too late
I didn’t pass cross validate
Tried so hard to relate
and now I’m so… so overfit!

You can go get more points
Build more models to help
You can regularlize with layers that drop out

Oh that’s why I’m done
I’m overfit
Was fun but
I’m overfit
I’m not the one
I’m overfit
I’m overfit
I’m over

Wanting you to be wanting me
But, my low accuracy
To outliers I’m sensitive
Because I’m so over
I’m overfit
Moving on, it’s too late
I didn’t pass cross validate
Tried so hard to data relate
and now I’m so… so overfit!

Don’t ask me to classify
Ain’t no use don’t ask me why
I’ll never change
Unless… tweak my parameters and run me again

(P.S. here’s the Wikipedia article on overfitting in machine learning!)

2 Quick Announcements

Renee — Tue, 25 Jul 2017 16:30:39 +0000

1. Sorry for the issues loading images and with logins on this site. I’ve had problems ever since I had bluehost set me up with HTTPS certificates, and apparently those certificates have expired or something and are causing issues with the images being able to load, etc. I’ll look into it, but I also have a busy week at work and wasn’t planning on site maintenance here this week, so it might be this way for a little bit. I am aware of it and a fix is on my to-do list, though! Thanks for your patience.

2. I published a post over on DataSciGuide about resources for data science beginners. Check it out!

Thanks for your continued readership here!

P.S. I’m working on the podcast again, so the 2 already-recorded episodes are back in the work pipeline :)

Introductory Machine Learning Terminology with Food

Renee — Tue, 18 Jul 2017 02:48:16 +0000

I was just pondering some ways to discuss machine learning terminology in a way that would be accessible to beginners, and figured I’d share my semi-thought-out ideas here. I’m sure this has been done before, but here are some common machine learning terms couched in the language of cooking and food. Feedback welcome!

Machine Learning Algorithm

A machine learning algorithm is a list of instructions to guide a computer to analyze some data to find patterns, and works much like a cooking recipe. You put some data in (ingredients), do some stuff to it (preparation and cooking), and then evaluate how the results compare to what you were hoping to accomplish (photo in your cookbook and expectations of taste).

Input Columns (Features) -> Ingredients

Features are basically columns in your spreadsheet that you choose to input into an algorithm, and they’re like ingredients in a recipe. If you’re going to make cookies, you’ll need some ingredients like eggs, flour, and chocolate. These ingredients get stirred together and then baked, and the output is a (hopefully) edible cookie.

Feature Engineering -> Ingredient Prep

Some ingredients, like eggs, aren’t added to the recipe as-is. You have to crack the egg, sometimes separate the whites from the yolk, and maybe beat the egg before adding it to your recipe. The same goes for the features, or columns in your spreadsheet. For instance, you might have a column that contains categorical values in text form, and some algorithms will require you to first transform that column into a bunch of yes/no columns that “encode” the feature into binary values. If, for instance, you have category values listed in your original column like “category A”, “category B”, and “category C”, you might convert the 1 column to 3 columns – one for each category – and instead of the spreadsheet cells containing words, they now contain a 1 or a 0 in each category column indicating “yes” or “no” for each category. An algorithm that couldn’t understand the phrase “category A”, can now understand it as a value of “1” in the Category A column.

Number of Input Rows / Volume of Data -> Batch Size

Changing the number of rows of data in your spreadsheet (each row containing a bunch of “features” – or columns) is like scaling the measure of ingredients in your recipe. If you want a double batch of cookies, you can add twice as much of each ingredient into your bowl, and get twice the cookie dough. If you only have half of the amount of flour called for by the recipe, you can halve all of the other ingredients to get half of a batch. You have to beware of modifying how many rows of data you input into your algorithm, though. Like in cooking, adding or removing rows can sometimes have an unexpected effect on the output. If you split a bowl of cookie dough in half, but almost all of the chocolate chips end up in one half and almost none are in the other half, you won’t get two identical batches of cookies! The distribution of values contained in the rows of data you input into your machine learning algorithm impacts the resulting model.

Clean CSV -> Pre-prepared meal kit

In data science courses, you often start assignments with a pre-cleaned spreadsheet. This is like receiving a meal kit delivered to your door. You get the correct amount of each ingredient, and they are often already prepared or just need a little chopping. This is great for learning, but don’t be surprised when you get to a real-world scenario and you are expected to shop for the ingredients, hope they are available for purchase at your local store, make some substitutes, and then figure out how each ingredient you bought needs to be transformed in order for the recipe to work! It makes sense to start out as a short-order cook before you learn the skills and art to being a chef, so there’s nothing wrong with using “meal kits” to start out!

Type of Model/Algorithm -> Cooking method

Now that you have all of your ingredients prepared, what do you do with them? If you’re cooking, that batter could be baked into a cake, or fried into a donut. If you’re doing data science, your spreadsheet could be run through a variety of algorithms to generate different types of models, to see how each turns out. This is called “model selection”. You might use linear regression or a decision tree, for instance. Just like with cooking, different “ingredients” (features) work better with certain cooking styles (types of models). And some people – machine learning researchers – are exploring entirely new methods of “cooking”!

Model Parameters -> Cooking settings

Now that you’ve chosen to bake your cookies, at what temperature do you bake them at? And for how long? This depends on what you are trying to optimize for. Do the people eating the cookies like them more caramelized and don’t mind them being a little burnt around the edges? Or do they like them soft with no dark spots? Some of that is affected by the ingredients you put in, but it’s important to find a good oven temperature and cooking time combination. In machine learning, this is called “parameter tuning” or “hyperparameter optimization”. There are methods for working through different combinations of parameters (settings) to find the best one for the outcome you want to achieve. Note that if you change the ingredients (features), the settings that worked with your last batch of dough may not be the best for your new one. And a good temperature for frying might not be a good temperature for baking!

Also note that I mentioned you have to choose what to optimize for. If you have an audience that has different cookie preferences, you can tweak your model toward one or the other, or try to balance between the two. There are always human choices made in “machine learning”, and every choice is a trade-off! (same goes for choosing which rows and columns to input into your model, not just parameters)

Output Features / Target Variable -> Ideal Cooked Food Outcome

We talked about input features (ingredients), but why are we doing all this “cooking”? We’re trying to achieve some sort of target, which we mentioned a bit earlier when talking about parameter tuning. How do we know what target we’re trying to hit? In supervised machine learning, you have some rows where the output is known, so you can train your model. This is like having a photo of the finished product in a cookbook, and having tasted what a good outcome of the recipe tastes like. You have a target toward which you are trying to optimize. If you say it is “too salty”, you’re comparing the saltiness of your result to some ideal saltiness. That ideal is the “target”, and in supervised machine learning, is encoded in the “output feature” – a column in your spreadsheet that contains a value for each row that identifies how that row should be categorized by the resulting model, or what number the algorithm should generate.

Model Evaluation -> Tasting/Looking at Result, Comparing to Recipe Expectations

Once you’ve run your features through a model (cooked your ingredients with a particular method and settings), you need to evaluate how well the results turned out. In cooking, you’ll often have some taste-testers and you can also compare how the final plated dish looks when compared to the photo in the cookbook. As mentioned before, you can optimize for different things. You might make a dish that looks beautiful – exactly like the photo! – but tastes horrible. (Maybe they staged the photo with fake food and no one can make it look like that with the given recipe!)

So, if your “evaluation metric” of taste is way off, you go back and tweak your ingredients (features), ingredient prep (feature engineering), parameters (oven settings), and batch size (volume of data and which rows are included) until your outcome meets your chosen ideal. Or, if you think the outcome is just fine, it might be the taste testers (metrics) that need rethinking!

———————–

Types of machine learning algorithms.

These next few examples get away from the “recipe” example I’ve been using so far, but still use food to attempt to explain the concepts.

Imagine that you have a big bowl of trail mix. Your job is to teach a child to separate the trail mix into its components. This is like taking a big spreadsheet of data, and figuring out how to group the rows into categories. Each piece in the trail mix corresponds to one row in your spreadsheet.

You notice that the bowl of trail mix contains nuts, candy, and pieces of cereal. You point out each ingredient to the child, explaining “see, this is a peanut, a type of nut”, “this is an M&M, a type of candy”, tell the child to separate them out into piles, and walk away. (Don’t be surprised if some of your “data” is missing after this project!)

CLASSIFICATION

You come back and see that the child has created two piles: One has peanuts, cashews, and almonds in it. The other has M&Ms, chex, cheerios, and raisins in it. This child understood the task as a classification task, and acted as a “nut or not” classifier! He heard you describing the items as nuts, candy, or cereal, ignored the specifics, and decided to do a simple separation into 2 piles: one with nuts, one with non-nuts.

I could go into detail here about confusion matrices and evaluating this child’s performance, categorizing true positives (nuts correctly identified as nuts), false positives (non-nuts identified as nuts), true negatives (non-nuts identified as non-nuts), and false negatives (nuts identified as non-nuts), but getting into the details of these measures would make this article way longer than it needs to be, and I’m just trying to cover the high-level types of algorithms.

This “nut classification” algorithm worked because the child identified features of the items that made certain pieces nut-like (hard, brown, salty), and features that made them non-nut-like, and categorized them into piles based on how nut-like they were.

Even though you left the child unsupervised during this process, the type of machine learning is called “supervised learning“, because you first told the kid what each item was, so he had “labels” for what was a nut and what was not a nut, to learn which features went with each classification.

CLUSTERING

You call over another child. You decide not to tell her anything about the items in the bowl, but tell her that she needs to divide the items in the bowl into piles, hoping she’ll separate them into nuts, candy, and “other”, but not telling her that. You want her to do more than just the “nut or not” that the first kid did, so you tell her to divide them into at least 3 groups. Because you didn’t tell her information about how to categorize each item in the bowl, this is an example of “unsupervised learning” (even if you watch her do it).

She stares at the bowl for a while, and you leave her alone to do the task. When you come back, there are 3 piles. One has nuts, raisins, cereal, and brown M&Ms in it, and then there is a pile for each color of M&Ms! She has “clustered” the items based on color similarity. She gave you more than 3 groups, but since you didn’t label the items, she picked color as what made the items most similar to one another. So you got a large brown pile that contained most of what was in the bowl (since most of the trail mix was brown), then small piles of blue, red, orange, yellow, and green M&Ms.

REGRESSION

You decide to try the same exercise and call over a third child. This time, you just tell the kid to organize the bowl however she wants, and watch from a distance. To your dismay, she is licking every piece! You rush over and see that she has lined up all of the pieces on the ground. Most of the M&Ms are toward one side of the line, and as you go down toward the other end of the line, you see mostly nuts and cereal. “What were you licking them for?,” you say. “Well duh, I had to see how sweet they each were!” she replies. She has sorted them in one dimension – testing only for sweetness, and ordering every piece into a line based on how sweet it was.

This isn’t the best example of regression – but the point is that you got a value as an output – sweetness level – as opposed to a category. It’s possible to use this sweetness level to then create categories – like creating a “candy” category by picking a certain sweetness level, and saying that anything above that is candy.

If you are given a new piece of trail mix, you could compare it to trail mix in the already-licked line, and probably make a good guess about how sweet it is based on how much it looks like the pieces in a particular part of the line.

I hope this helps clarify some machine learning terminology for people who like food metaphors! Add your own food-machine learning metaphors or ask questions in the comments!

Summer of Data Science 2017

Renee — Mon, 29 May 2017 05:12:25 +0000

Since Memorial Day in the U.S. is the unofficial start of the summer season, I figured today would be a good time to launch the SUMMER OF DATA SCIENCE 2017!!!

The Summer of Data Science is a commitment to learn something this summer to enhance your data science skills, and to share what you learned. (Those of you in the Southern Hemisphere will have to pick up the excitement when we’re winding down during our fall/your spring and keep it going! Or, join us during your Winter of Data Science!)

For those of you who haven’t been following me for years, a hashtag I started back in 2015, #SoDS, is actually one of the things that started growing my twitter following. Here’s the history:

Coming up with the hashtag

1st month of tweets, May 2015 Storified

June 2015 #SoDS Storify

Unfortunately, I didn’t keep up the ‘Storification’ after that, but you get the idea. It brought a bunch of us together to share our learning progress. We learned from each other, encouraged each other, and most of all geeked out about data science together!

I didn’t launch one last year, because I was starting a new job and taking a break from recording the podcast, and just didn’t want to take on too much. But I missed it, so I didn’t want to let another year pass without a Summer of Data Science, so we’re going to do it together again this year!

So, here are the only “rules”:

How to participate in the Summer of Data Science:

Pick a thing or a short list of things related to data science that you want to learn more about this summer.

Make a plan to learn it (like an online course, a practice project, etc.).

Share that plan on social media, then post updates as you make progress, with the hashtag #SoDS17.

That’s it! (And yes, there’s a chef competition that used the same hashtag. No worries! Enjoy the food pics.)

If you’re looking for ideas for learning projects or topics, check out the Data Science Learning Club! Please write about your learning experiences and share in the Data Science Learning Club #SoDS forum, and/or on your own blog, and share on social media. I’ll check out the hashtag on twitter regularly and RT others. I’ll be participating myself, too!

Here’s a link to the hashtag on twitter: #SoDS17. See you there!

P.S. Did you know that there is a “Summer of Data Sci” song? :D

P.P.S. There are now Summer of Data Science 2017 t-shirts and tanks in the Becoming a Data Scientist teespring shop!

UPDATE: Here is a twitter Moment with a selection of tweets from the #SoDS17 participants this year! (It starts out with a bunch of intro tweets from me, but click through to twitter and keep scrolling!)
Summer of Data Science 2017

Bias in Machine Learning Flipboard Magazine

Renee — Sat, 25 Mar 2017 21:54:57 +0000

Quick note for those of you who follow me on Flipboard. I added another one, seeded with links from my Challenge to Data Scientists article, on Bias in Machine Learning. Enjoy!

Becoming a Data Scientist Podcast Episode 16: Randy Olson

Renee — Wed, 22 Mar 2017 05:11:50 +0000

Renee interviews Randal S. Olson, Senior Data Scientist in the Institute for Biomedial Informatics at UPenn, about his path to becoming a data scientist, his interesting data science blog posts, and his work with non-data-scientists and students.

Podcast Audio Links:
Link to podcast Episode 16 audio
Podcast’s RSS feed for podcast subscription apps
Podcast on Stitcher
Podcast on iTunes

Podcast Video Playlist:
Youtube playlist of interview videos

More about the Data Science Learning Club:
Data Science Learning Club Welcome Message
Data Science Learning Club Activity 16 – Genetic Algorithms
Data Science Learning Club Meet & Greet

Mentioned in the episode:

bytecode

Dr. Kenneth Stanley at the University of Central Florida

evolutionary algorithm

Michigan State University Artificial Intelligence

BEACON NSF Science and Technology Center at MSU

Randal S. Olson publications

Randy’s blog

Data Is Beautiful Reddit

traveling salesman problem

Google Maps API

Moneyball (book)

Data Science Handbook (book)

Weka

scikit-learn

version control

Randy on:
Twitter
LinkedIN
github
Patreon

Becoming a Data Scientist T-Shirts!

T-Shirts!!

Renee — Sat, 18 Feb 2017 20:55:46 +0000

The Becoming a Data Scientist tees are ready to sell! I ordered a couple myself before posting them for sale, to make sure the quality was good. They came out great!! And if you order from Teespring before ~~March~~April 1, 2017 using this link: Becoming a Data Scientist Store – Free Shipping, you’ll get free shipping on your order!

(Readers told me that the link above doesn’t discount at all for International shipping, so if you are outside the US, use this link for $3.99 off – equivalent to US Shipping cost)

The design is a combination of those submitted to our contest by Amarendranath “Amar” Reddy and Ryne & Alexis. You can see their design submissions and read more about them on the finalists post! They are each receiving prizes for being selected. Thanks Amar, Ryne, and Alexis for the awesome design!

There are a variety of styles and colors available. The Premium Tee is 100% cotton. The Women’s Premium is a 50/50 cotton/poly blend, and is cut to fit more snugly. They are available in navy blue, gray, purple, and black. There’s even a long-sleeve version!

I make anywhere from $2-$7 on each order (it’s print-on-demand, so not cheap enough for me to make a significant profit yet, and my proceeds will be lower with the free shipping offer, but I want to reward those of you who are excited to flaunt your Becoming a Data Scientist status!) and every dollar earned from these will be going to the fund that helps support my new small team of assistants, who you’ll meet soon! Also, the more of them I sell, the lower the cost to print is per shirt, so please share with all of your friends!

Here are photos of me wearing the shirt, but this was before I made the front design slightly smaller (so it doesn’t wrap into armpit), and I moved the back design slightly higher and also made the gray dots (data points?) transparent so the color of the shirt will show through there now (see store images above for current design). You can see that the teal came out as a lighter blue in printing. This is the “Premium Tee” style in “New Navy”.

Here’s a model wearing a simulated version of the shirt.

Order yours here, with Free Shipping Until March 1!

Update: Kids sizes now available, too!(the design is on the front for kids’ shirts)

Becoming a Data Scientist Podcast Episode 15: David Meza

Renee — Mon, 30 Jan 2017 04:41:47 +0000

David Meza is Chief Knowledge Architect at NASA, and talks to Renee in this episode about his educational background, his early work at NASA, and examples of his work with multidisciplinary teams. He also describes a project involving a graph database that improved search capabilities so NASA engineers could more easily find “lessons learned”.

Podcast Audio Links:
Link to podcast Episode 15 audio
Podcast’s RSS feed for podcast subscription apps
Podcast on Stitcher
Podcast on iTunes

Podcast Video Playlist:
Youtube playlist of interview videos

More about the Data Science Learning Club:
Data Science Learning Club Welcome Message
Data Science Learning Club Activity 15 – Explain an Analysis (Communication)
Data Science Learning Club Meet & Greet

Mentioned in the episode:

NASA Knowledge (@NASAKnowledge on twitter)

Engineering Management
Knowledge Management
Organizational Learning
Knowledge Engineering
Information Architecture
Data Analysis

graph database

Neo4j
Elasticsearch
IHS Goldfire
MongoDB

JSC – Johnson Space Center

topic modeling

@davidmeza1 on Twitter
David Meza on LinkedIN

Southern Data Science Conference in Atlanta, GA on April 7, 2017 (Coupon code RENEE takes 15% off ticket price)

T-Shirt Contest Finalists

Renee — Tue, 17 Jan 2017 03:29:01 +0000

I still haven’t heard from one of the 3 finalists, but I wanted to go ahead and post the first two, and I’ll update here with the final one later. These finalists win a data science book and a t-shirt, and I’ll choose from the three (I’m actually considering combining elements from two of them!) and announce the final t-shirt design when they are available for sale.

Without further ado, the top 3 vote-winners after 94 votes, in no particular order, are….

1) Ryne

Ryne is an analyst for an economic consulting firm in Salt Lake City looking to transition to a career in data science. Lucky for him, his wife Alexis is a graphic designer and entered the competition in his name so he could win a book! She had a baby a few months ago and is just doing freelance design as time permits.

Alexis’ website is alexisbrittany.com. (Feel free to reach out if you need a designer!)

2) Amarendra

Amarendra (Amar) has recently moved to Philadelphia to pursue an MS in Business Intelligence and Analytics at Saint Joseph’s University. He believes in continuous learning and aspires to contribute to the field of data exploration. Upon graduation, Amar would like to work in the field of data visualization. He loves working with tableau and one of his resolutions for 2017 is to get appreciated by one of the Tableau Zen Masters.

3) coming soon

I’ll update here as soon as I hear back from the 3rd finalist! If I don’t hear back, I may give the prizes to the 4th place winner.

Amar and Ryne both selected the same data science book, so I thought I’d share that here for others that may be interested. They will both be receiving The Python Data Science Handbook by Jake VanderPlas:

You will hear about the final design soon, as well as options for purchasing t-shirts! Thanks to everyone who entered and voted!

Becoming a Data Scientist Podcast Episode 14: Jasmine Dumas

Renee — Wed, 11 Jan 2017 04:53:29 +0000

In this first episode of “Season 2” of Becoming a Data Scientist podcast, we meet Jasmine Dumas, a new data scientist who tells us about going from biomedical engineering into a data science project experience and then finding her first job as a data scientist.

Podcast Audio Links:
Link to podcast Episode 14 audio
Podcast’s RSS feed for podcast subscription apps
Podcast on Stitcher
Podcast on iTunes

Podcast Video Playlist:
Youtube playlist of interview videos

More about the Data Science Learning Club:
Data Science Learning Club Welcome Message
Activity 14: Hidden Markov Models
Activity 15: Neural Nets for Text
Data Science Learning Club Meet & Greet

Mentioned in the episode:

Science Olympiad

#RStats on twitter

Hadley Wickham’s Advanced R book

Shiny

Survival Analysis

RStudio

shinyGEO: a web-based application for analyzing gene expression omnibus datasets

Google Summer of Code

RTalk Podcast

Simple Finance

Jasmine’s website on github

Jasmine’s projects

@jasdumas on Twitter

Vote for a T-Shirt design!

Renee — Sun, 08 Jan 2017 06:19:40 +0000

Which of these awesome designs would you like on your future Becoming a Data Scientist T-Shirt? You all will narrow it down to 3, then I’ll pick the final winner to be printed!

For the ones that aren’t t-shirt-print-ready, I’ll get a graphic designer to tidy them up, so don’t worry about whether they’re printable when you vote. We can also vary colors and things like that later. Just pick the design you like!

(original contest here)

click the images for larger views

dsnmizan

Venkat

Gordon

Kumar

Nigel

Ridhima

Ryne

Amarendra

––> VOTE HERE <––

(requires google login to prevent multiple-voting, but the voting is anonymous)

I’m hiring!

Renee — Thu, 05 Jan 2017 06:38:35 +0000

I need a part-time remote assistant to help keep my websites up to date, among other things!

Thanks to my generous Patreon supporters, I can hire someone to help me out 8-20 hours per month, paying $15/hr. More info and application form at this link.

Please let me know if you have any questions or if there are any problems with the form. Email me at renee@becomingadatascientist.com or tweet me at @becomingdatasci.

I look forward to reading the applications!!

Podcast Special Episode 2 – The Future of AI with Dr. Ed Felten

Renee — Thu, 29 Dec 2016 05:02:12 +0000

In this audio-only Becoming a Data Scientist Podcast Special Episode, I interview Dr. Ed Felten, Deputy U.S. Chief Technology Officer, about the Future of Artificial Intelligence (from The White House!).

You can stream or download the audio at this link (download by right-clicking on the player and choosing “Save As”), or listen to it in podcast players like iTunes and Stitcher. Enjoy!

Show Notes:

The White House Names Dr. Ed Felten as Deputy U.S. Chief Technology Officer

Edward W. Felten at Princeton University

Dr. Edward Felten on Wikipedia

White House Office of Science and Technology Policy (OSTP)

The Administration’s Report on the Future of Artificial Intelligence (White House Report from October 2016)

Artificial Intelligence, Automation, and the Economy (White House Report from December 2016)

Ed Felten on Twitter: Official / Personal

Freedom to Tinker blog

———

Other Podcasts in this Government Data Series:

Partially Derivative White House Special – with DJ Patil, US Chief Data Scientist
Not So Standard Deviations – Standards are Like Toothbrushes – with with Daniel Morgan, Chief Data Officer for the U.S. Department of Transportation and Terah Lyons, Policy Advisor to the Chief Technology Officer of the U.S.
Linear Digressions – Data + Healthcare + Government = The Future of Medicine – with Precision Medicine Initiative researcher Matt Might
More to Come!

Becoming a Data Scientist Patreon Campaign

Support Becoming a Data Scientist!

Renee — Wed, 07 Dec 2016 21:34:37 +0000

I want to hire some people to help me update my websites more frequently, do the maintenance stuff, and to help edit the podcast so I can produce episodes more frequently.

I outlined my whole plan here on my Patreon Campaign. You’ll see a new page on this site soon acknowledging supporters, and I’ll update you on the progress.

Whether you can give financially, or even if you just share the campaign with your data science friends, you are helping Becoming a Data Scientist podcast, the learning club, Data Sci Guide, Jobs for New Data Scientists, and all of my websites get off the ground! Thank you!!

T-Shirt Design Contest!

Renee — Sat, 19 Nov 2016 18:13:29 +0000

I’ve decided that I want to have Becoming a Data Scientist t-shirts to sell and to give out to podcast guests and contest winners, but I am not a graphic designer, so I need some help! So I’m going to have a t-shirt design contest!

Here are the rules/guidelines for entry:

1. Create a design that prominently says “Becoming a Data Scientist” and can be easily scaled to fit in on the front or back of a t-shirt. If you create a design for the back, also create a small “pocket sized” text or design for the front of the shirt. But I don’t have a preference – front or back of shirt designs are both fine!

If it can be incorporated into the design without looking too cluttered, you can also add “Podcast and Learning Club” and/or “@becomingdatasci”, but that is not a requirement.

The design can be just text, text with an image, more abstract, use your imagination! As long as “Becoming a Data Scientist” is clearly readable, your design will be considered. Obviously, vulgar designs will not be considered, and I’ll also remove them (or anything spammy) from the comments on this post.

Please don’t use more than 2 colors in your design itself, as more can be cost-prohibitive. The background color can be a different 3rd color.

Here is a site with some guidance for preparing a design for t-shirts: https://gomedia.com/zine/tutorials/pro-tips-preparing-artwork-t-shirt-printing/
(Note – I don’t expect the design to be complex or super-artistic, and a text-only design could be created in a simple text or image editor!)

You can even use an online shirt design program like CustomInk, as long as the design can be extracted for use at another printing site (I’m not sure what the rules or capabilities of most of those t-shirt design websites are).

2. Please submit 2 files: your design itself (in a format that can be read on multiple platforms, like a PDF), large enough that it can be zoomed in to “life size”, and then also a smaller image of your design as you imagine it on a shirt – choose a shirt color and location for your design and create a little “shirt preview” image I can share with readers for the vote (this one doesn’t have to be zoomable to full size – the largest size I’d post it at is about 500×500). Please let me know if you have any questions or suggestions about this!

You can submit the files by posting a comment below. Don’t put your email in the text of the comment, I’ll be able to see it behind the scenes from the form. Make sure to include your name (it can be just your first name if you want) as you want it shared along with your design if you get selected for the voting round, along with the links to the 2 files and any link you want to point people to – your blog, your portfolio, your twitter account, etc.

UPDATE 12/5/2016: It looks like I made the turnaround time short (I’d like to have at least 10 designs before the voting), and I know some people may have great ideas but not great graphic design skills, so I’m making 2 changes:

It doesn’t need to be a “print-ready” design. If you sketch it out and your design wins, I’ll get a graphic designer to help turn it into a file to send to the t-shirt printer
The deadline is now extended to the end of the calendar year (see below)

Thank you to those of you who have entered already!!

Here’s how the contest will go:

I’ll accept entries until 11:59pm Saturday, December 31, 2016. Over the next week, depending on how many entries there are, I’ll narrow down the selection to maybe 5-10 choices. I’ll create a blog post with the t-shirt images and names of the designers, with a way to vote on your favorite, and advertise it on @becomingdatasci to get as many votes as possible.

The top 3 vote-winners will win:

A t-shirt
A data science related book of their choice up to $60
Will be featured in a “finalists” blog post with their design, a little blurb about them, and a link to their website
A tweet with their design and a link to their site on my @becomingdatasci twitter account.

From the finalists, I’ll choose my favorite to be printed. The final winner will also get:

2-3 extra t-shirts with their design to give out to friends
Name credited as designer wherever the t-shirt is sold
Additional tweets with their design announced as the winner, with a link to their site, on @becomingdatasci.
A shout-out on the Becoming a Data Scientist podcast

Please let me know if there’s anything I forgot to detail here or if you have any questions! I look forward to seeing the submitted designs!!

Update: I should probably mention that any proceeds from sales of the shirts will go to support the maintenance and creation of more content at this site BecomingADataScientist.com, the podcast, the Data Science Learning Club, DataSciGuide, and my other data science sites and social media accounts. I’ll be posting a Patreon campaign soon to raise money to hire help to keep these sites updated, and money I earn from selling t-shirts will go toward that as well.

Becoming a Data Scientist Podcast Special Episode

Renee — Mon, 14 Nov 2016 03:59:06 +0000

The hosts of Becoming a Data Scientist podcast, Partially Derivative podcast, Adversarial Learning podcast, and some other awesome data people that do elections forecasting for their day jobs joined together for this talk about the US election and the subsequent major questions surrounding the predictions, since basically all of them heavily leaned toward a different overall outcome than we got. If you’re interested at all in data science surrounding political campaigns, this episode is a must-listen!

Episode Audio (mp3) – also available on iTunes, Stitcher, etc.
(note, there is no video for this episode)

On the panel:

Chris Albon (Twitter) (Website)
Joel Grus (Twitter) (Website)
Natalie Jackson (Twitter) (Website)
Jonathon Morgan (Twitter) (Website)
Andrew Musselman (Twitter)
Mark Stephenson (Twitter) (Website)
Renee Teate (Twitter) (Website)
Andrew Therriault (Twitter) (Website)

PyData DC 2016 Talk

Renee — Tue, 11 Oct 2016 05:00:36 +0000

I just got back from PydataDC, where I learned a lot, had fun, and met a bunch of awesome people! I’ll definitely write about it more later, but I wanted to share my slides here since I told the attendees they could find them on my website. I got good feedback on the talk, and I’m so glad that my message resonated with some people!

The talk was recorded and video should be out within a few weeks!

Here are the slides: Becoming a Data Scientist – Advice from my Podcast Guests
and the slide notes.

Update 10/26: Here is the recording of my talk, with a playlist of other talks from PyData DC!

“Becoming a Data Scientist” Survey Results 1: Jobs & Education

Renee — Mon, 22 Aug 2016 05:22:30 +0000

Here is the 1st batch of results from the Becoming a Data Scientist Survey. Because of the sample size and unscientific casual nature of this survey, we can’t make any broad generalizations about the industry from these results, but you can see some general preliminary trends in the breakdowns that would be interesting to study more.

95% of the 158 respondents who gave answers about their jobs follow at least one of my twitter accounts, so you can think of these results as representing my twitter followers.

I’m mostly going to let the tables and graphs speak for themselves. There’s a lot to unpack in the survey (over 50 questions), and a ton of ways to slice and dice it. Because I wanted to show various breakdowns, I did this in simple Excel pivot tables and didn’t spend a ton of time formatting (I know they’re kinda ugly – sorry).

If any of the tables or charts below are particularly hard to understand, please let me know and I’ll annotate further or improve! Click to view full-size.

Educational Field
vs Job Title of Data Analyst or Data Scientist (yes/no)

Computer Scientist or Data Sci / Statistics / Analytics Degree
vs Job Title of Data Analyst or Data Scientist (yes/no)

30 (19%) of respondents had CS degrees, which was the most common degree name, and 27% of them had a job title that I categorized under Data Analyst or Data Scientist.

8.3% of respondents had Data Science, Statistics, or Analytics degrees, and 38.5% of them had job titles that I categorized as Data Analyst/Scientist.

Educational Level
vs Considers Self Data Scientist (No/Not Yet or Yes)

Considers Self Data Scientist by Educational Level and Field

Educational Field by Considers Self Data Scientist

Respondent Breakdown by Job Industry

Visual of those with Science/Tech/Eng Degrees and Math/Stats/Acctg Degrees

Percentage of Respondents by “Considers Self Data Scientist” with each Salary Range

About 20% (28) of those that do not consider themselves to be Data Scientists make $50-75K. About 17% (7) of those that do consider themselves to be Data Scientists make under $25K.

Because the numbers above looked surprising to me, I filtered the same table to those in North America (assuming that many of the low-salary data scientists might be outside of the U.S.).

In North America, about 21% (22) of those that do not consider themselves to be Data Scientists make $50-75K. But now 25% (5) of those that do consider themselves to be Data Scientists make $125-150K. This also shows that 22 of the 42 people above (52%) that considered themselves Data Scientists live outside of North America.

Bar Charts of Above Data. Top = Data Scientists, Bottom = Not Data Scientists.

I may release some of this data at the end of my analysis. If you’re interested in diving in yourself, let me know in the comments and I’ll work on it for a future post.

More results on the way in the near future! Next up, how many respondents follow each twitter account and whether more people watch or listen to the podcast!

Podcast Episodes 0 to 3

Renee — Sat, 13 Aug 2016 18:59:25 +0000

It’s been brought to my attention that iTunes only shows the last 10 episodes of the Becoming a Data Scientist Podcast. If you haven’t seen/heard episodes 0-3, you can watch the interviews on the YouTube channel:

Becoming a Data Scientist Podcast Interviews YouTube Playlist

or listen to/download the full audio episodes via the blog. Here are the links to the blog posts (with links to everything else), and the audio itself for those first four episodes:

Episode 0: Renee Teate (me) and intro to the podcast
Audio Only (with MP3 download link)

Episode 1: Will Kurt – English/Library Science to Data Science
Audio Only (with MP3 download link)

Episode 2: Safia Abdalla – College Student, Conference Speaker, Python/Jupyter Contributor
Audio Only (with MP3 download link)

Episode 3: Shlomo Argamon – Director of Master of Data Science program at IIT
Audio Only (with MP3 download link)

Click through to any episode for links to the RSS subscription feeds, links to the learning club activities, etc.

Enjoy!

Becoming a Data Scientist Survey

Renee — Sun, 07 Aug 2016 20:33:25 +0000

I am collecting information about my “audiences” so I can improve my websites, podcast, and also formulate a plan for a Patreon campaign to generate funds for getting help and to free myself up to create more content.

Please fill out the survey and share it with your friends and followers on social media! The survey is a little long/detailed, but most of it is optional. I value your opinions! Thank you so much for participating!!

Link to the Becoming a Data Scientist Survey

Boosting (in Machine Learning) as a Metaphor for Diverse Teams

Renee — Sun, 07 Aug 2016 00:42:08 +0000

Note – I wrote this article in one sitting, and definitely want to come back later to improve it and add references, but I don’t want to hold it up from being published just because I’m hungry for dinner. :) So I’m hitting publish, but please be aware that the content may change later. And feel free to give suggestions in the comments. -Renee

tl;dr: Boosting ensemble algorithms in Machine Learning use an approach that is similar to assembling a diverse team with a variety of strengths and experiences. If machines make better decisions by combining a bunch of “less qualified opinions” vs “asking one expert”, then maybe people would, too.

Why is this post on this blog?

I’ve been thinking a lot about diversity in tech lately. After the #FBNoExcuses conversations on twitter, I was motivated to start UntappedPipeline.com (on twitter:@untappdpipeline) because I know so many awesome women and people of color in tech, and it amazes me that some companies seem to think they are so rare and hard to find (hence using the “pipeline problem” as an excuse for not having a diverse workforce).

Of course, “diversity” can mean a lot of things: gender diversity, racial/ethnic diversity, diversity of educational backgrounds, etc. – but it all really comes down to diversity of culture/thoughts. If you are interested in learning more about diversity in tech and the benefits of hiring diverse teams, check out the Resources page on UntappedPipeline.com. Here’s one study in particular that highlights the economic benefits of a diverse tech workforce: Decoding Diversity (Dalberg/Intel, PDF).

So if this post is about diversity, why am I writing it on my “Becoming a Data Scientist” blog instead of at Untapped Pipeline? Because I’ve been doing some machine learning lately that I just realized is a great metaphor for the benefits of hiring diverse teams, and this is also an opportunity to explain a data science concept. Additionally, I’ve been talking about data science teams for a long time, and one of my motivations for starting the Becoming a Data Scientist podcast was to feature the many paths people take to data science, because data science in itself is an interdisciplinary job that requires a variety (diversity) of experience: primarily in statistics, computer programming, and “business” (or domain knowledge), but really there is a very broad set of skills that come into play when doing data science, and no one has them all (relevant rant by @quominus). I will list some references about building Data Science teams at the end of the post.

There are two machine learning algorithms I’ve used recently that illustrate the specific “ensemble learning” concept I want to focus on: Random Forest and Gradient Boosting Classifiers.

Random Forest

A Random Forest Classifier works like this: You may have heard of Decision Trees, which are pretty much just “if then” classifiers that end up generating a set of rules: “If Attribute 1 is in this range of values, and Attribute 2 is this boolean value, and Attribute 3 is greater than this value, then out of all of the possible results, this one is most likely.” Every example you feed into it will drop into one of the possible outcomes, with a certain probability of being correct. The article “A Visual Introduction to Machine Learning” has a great animated illustration of how decision trees work.

What a Random Forest does is build up a whole bunch of “dumb” decision trees by only analyzing a subset of the data at a time. A limited set of features (columns) from a portion of the overall records (rows) is used to generate each decision tree, and the “depth” of the tree (and/or size of the “leaves”, the number of examples that fall into each final bin) is limited as well. So the trees in the model are “trained” with only a portion of the available data and therefore don’t indivisually generate very accurate classifications.

However, it turns out that when you combine the results of a bunch of these “dumb” trees (also known as “weak learners”), the combined result is usually even better than the most finely-tuned single full decision tree. (So you can see how the algorithm got its name – a whole bunch of small trees, somewhat randomly generated, but used in combination is a random forest!)

When you combine multiple methods to determine a result in machine learning, it is called ensemble learning. In particular, when you use a bunch of weak learners and combine them by having them vote on the outcome (whichever outcome occurs most often for a particular record wins), iteratively improve the outcome, or average the results in some way to create a result that is stronger than any single-pass approach, that is called boosting.

Gradient Boosting

Another type of boosting is called Gradient Boosting. The approach is actually very similar to the Random Forest algorithm, except (as far as I understand it), it uses a type of optimization called Gradient Descent, which minimizes a loss function. Basically, each time it generates a decision tree, it’s using what it learned the last time it generated one to make the next one a little less bad (reducing cost/loss).

I’m not going to go to far into the technical details here, partially because that’s not really the point of this article, and partly because I start having flashbacks to my Optimization class in grad school, which was not a happy experience.

There are other boosting algorithms like Adaptive Boosting (AdaBoost) and other ensemble methods in machine learning to explore. Some of them are described in the scikit-learn (machine learning python package) documentation if you want to learn more.

So back to my original point

Now that I have the explanations out of the way, I can get back to the point I wanted to make. Notice how these algorithms work. Each “weak learner” only has some of the information needed to make a good guess to classify something. In fact, they’re often incorrect on their own, because they just don’t have the experience that a big “solo” algorithm has with the data. (Imagine a bunch of weird-looking small trees vs one big gorgeous well-developed oak tree.) However, when you combine the wide variety of partial experiences that the “weak learners” provide, their combined guess turns out to actually be better than the guess made by the one big fully-formed tree.

Many Small Trees (young forest growth via The Young Forest Project)

One Big Fully-Formed Tree (Keeler Oak Tree by Msact via Wikimedia Commons)

So, think of the same concept for building a Data Science Team. There is currently a shortage of “experts” in data science, and most companies don’t seem to know what kind of data scientist they need anyway. Some companies are lamenting the lack of “qualified” data scientists (i.e. “unicorns” that have all of the necessary skills and experience already), while in the meantime, there are plenty of business analysts, software developers, UX designers, subject matter experts, people that do similar work in other fields (like biotechnology, cognitive science, etc etc etc), and people that are on their way to becoming data scientists and only have a portion of the requisite skills and knowledge. Some companies are just “pattern matching” and trying to hire people that are exactly like their existing successful employees (though it’s unlikely they have even defined what they mean by “successful employees”).

However, if you know how to find creative and motivated “go-getters” that want to learn on the job and contribute to a team, and each person on the team has a portion of the needed experience and skills, there is a good chance that in combination (if given good support and resources!) that a small group of “junior data scientists with other relevant skills” will actually turn out better than hiring one or two “experts” in the first place. Plus, they cost less. Plus, they are likely very trainable. Plus, they really want to make a difference and prove themselves as capable data scientists.

Anyway, I’m making a lot of generalizations here, and need to go back and fill in some of my comments in this last section with references, but you can see what I’m getting at. For those in tech that have a really hard time believing that that “scrappy new business analyst” with a non-terminal non-computer-science degree from a non-ivy-league school that doesn’t have many years of experience doing the exact kind of work you want them to do at your company won’t be “qualified” to fill a Data Science position, maybe it will help to think of the problem as one that Boosting will solve. Create an “ensemble” of “learners” that may individually only have a subset of the experience and may be self-taught and not do everything the “right” way when tasked to do it alone, but can each contribute their wide variety of experiences and skills to come up with a final solution as a team. I’m willing to bet that the solutions generated by a diverse group with less-than-ideal credentials (but a wider breadth of experience) will turn out better than what an “expert” (or homogenous group) would come up with on their own anyway, because the research shows that it usually does. Just do the math.

[more links about building data science teams to be added]

Becoming a Data Scientist Podcast Episode 13: Debbie Berebichez

Renee — Fri, 15 Jul 2016 02:52:29 +0000

In this interview, we meet physicist Debbie Berebichez, who you might recognize from her TEDx talks, her appearances in Discovery Channel’s Outrageous Acts of Science and other TV shows! Debbie grew up in Mexico City and was discouraged by her family and teachers from studying science, but later went on to become the first Mexican woman to get a PhD in physics from Stanford, and is now Chief Data Scientist at Metis Data Science Bootcamp in New York.

Podcast Audio Links:
Link to podcast Episode 13 audio
Podcast’s RSS feed for podcast subscription apps
Podcast on Stitcher
Podcast on iTunes

Podcast Video Playlist:
Youtube playlist of interview videos

More about the Data Science Learning Club:
Data Science Learning Club Welcome Message
Learning Club Activity 13: Show & Tell
Data Science Learning Club Meet & Greet

Links to topics mentioned by Debbie in the interview:
Metis Data Science Training
[more coming soon]

Becoming a Data Scientist Podcast Episode 12: Data Science Learning Club Members

Renee — Wed, 15 Jun 2016 05:08:12 +0000

Verena, David, Kerry, and Anthony are members of the Becoming a Data Scientist Podcast Data Science Learning Club! They appear in the order in which they joined the club, and each discuss their starting points before joining, their participation in the activities, and advice they have for new data science learners.

Podcast Audio Links:
Link to podcast Episode 12 audio
Podcast’s RSS feed for podcast subscription apps
Podcast on Stitcher
Podcast on iTunes

Podcast Video Playlist:
Youtube playlist of interview videos

More about the Data Science Learning Club:
Data Science Learning Club Welcome Message
Data Science Learning Club Meet & Greet

1) Verena Haunschmid

bioinformatics

R Markdown
ggplot2
jupyter

Data Science Learning Club Activity 07: Linear Regression
Verena’s Results for Linear Regression on Salary Dataset

Verena’s website
@ExpectAPatronum on Twitter

GPS Cat Tracking Project

2) David Asboth

business intelligence

SQL

City University London Msc Data Science

Coursera
Udacity
Khan Academy

Data Science Learning Club Activity 02: Creating Visuals for Exploratory Data Analysis
David’s results exploring London Underground data

Data Science Learning Club Activity 07: K-Means Clustering
David’s results using k-means to draw puppies in 3 colors

FlyLady (the house cleaning system I mentioned)

David’s website
@davidasboth on Twitter

3) Kerry Benjamin

Data Science Learning Club Activity 01: Find, Import, and Explore a Dataset
Kerry’s results for Activity 1 IGN Game Review Data exploration

Data Science Learning Club Activity 02: Creating Visuals for Exploratory Data Analysis
Kerry’s Blog Post about Activity 02 – “My First Data Set Part 2: The Fun Stuff”

ggplot2
dplyr
XLConnect

Blog post about Data Camp – “The Data Science Journey Begins”

Sharp Sight Labs

Kerry’s blog post “Getting Started in Data Science: A Beginner’s Perspective”

#Rstats (twitter hashtag)

Kerry’s Blog “The Data Logs”
@kerry_benjamin1 on Twitter

4) Anthony Peña

molecular biology
biotechnology

Data Science Learning Club Activity 07: K-Means Clustering
Anthony’s results for Activity 07

ggplot2
tidyR
dplyr

R bloggers

Anthony’s website
@agpena_ on Twitter

Becoming a Data Scientist Podcast Episode 11: Stephanie Rivera

Renee — Tue, 31 May 2016 01:15:28 +0000

Stephanie Rivera has worked in machine learning and data science for academic research (at University of Tennessee), for the government (Department of Defense), for a large consulting firm (Booz Allen), and now for a startup (MyStrength). In the interview, she discusses her career path, her experiences with mentorship, and her role in authoring The Field Guide to Data Science and the Explore Data Science online course.

Podcast Audio Links:
Link to podcast Episode 11 audio
Podcast’s RSS feed for podcast subscription apps
Podcast on Stitcher
Podcast on iTunes

Podcast Video Playlist:
Youtube playlist of interview videos

More about the Data Science Learning Club:
Data Science Learning Club Welcome Message
[learning club activity coming soon]
Data Science Learning Club Meet & Greet

Links to topics mentioned by Stephanie in the interview:

machine learning

Odyssey of the Mind

Graph Theory

Total Domination in Graph Theory (pdf)

Some research publications by Stephanie:
Machines Watch you Surf the Web
Total domination dot-stable graphs

The University of Tennessee Knoxville Center for Intelligent Systems and Machine Learning (CISML)

Reinforcement Learning

Connect Four (game)

UTK Distributed Intelligence Laboratory

MATLAB

UTK Infant Perception Action Laboratory

“teach a man to fish” proverb

pattern recognition

Booz Allen Data Science

Natural Language Processing (NLP)

Explore Data Science (now via Metis)

Code School

Field Guide to Data Science

MyStrength (@mystrengthbh on twitter)

DataKind

Stephanie on Twitter @dataginjaninja

Becoming a Data Scientist Podcast Episode 10: Trey Causey

Renee — Sun, 01 May 2016 05:16:38 +0000

Trey Causey is a data scientist with a background in psychology and sociology who, like Renee, is from Virginia. He has worked as a data scientist at a range of companies from zulily to ChefSteps, and has also developed some interesting sports analytics projects, including the New York Times 4th Down bot. Trey also has advice for people wanting to start a career in data science.

Podcast Audio Links:
Link to podcast Episode 10 audio
Podcast’s RSS feed for podcast subscription apps
Podcast on Stitcher
Podcast on iTunes

Podcast Video Playlist:
Youtube playlist of interview videos

More about the Data Science Learning Club:
Data Science Learning Club Welcome Message
[learning club activity coming soon]
Data Science Learning Club Meet & Greet

Links to topics mentioned by Trey in the interview:

Commodore VIC-20
Bulletin Board
C++
Pascal
BASIC

Virginia Tech
Odyssey of the Mind

University of Washington Sociology

Complexity Theory and organizations

[more links to come! …sorry for all of the delays on getting this episode out! -Renee]

treycausey.com
@treycausey

Becoming a Data Scientist Podcast Episode 09: Justin Kiggins

Renee — Tue, 12 Apr 2016 03:29:02 +0000

Justin Kiggins, who calls himself a “full stack neuroscientist” talks to Renee about how he started as a musician majoring in music therapy, switched to mechanical engineering, and eventually made his way via biomedical engineering and neuroscience to study auditory perception and the brains of communicating birds.

Podcast Audio Links:
Link to podcast Episode 9 audio
Podcast’s RSS feed for podcast subscription apps
Podcast on Stitcher
Podcast on iTunes

Podcast Video Playlist:
Youtube playlist of interview videos

More about the Data Science Learning Club:
Data Science Learning Club Welcome Message
Learning Club Activity 9: Normalization [coming soon]
Data Science Learning Club Meet & Greet

Links to topics mentioned by Justin in the interview:

European Starling

video of starling singing
European Starling song file from Justin [1 min wav]

bird song recursive syntactic structure

Zebra Finch song

spectral analysis [pdf]

Neuron

brain electrodes

numpy
spark
thunder
pandas

birth doula

Allen Institute

Jobs for New Data Scientists website mentioned by Renee after interview

Becoming a Data Scientist Podcast Episode 08: Sebastian Raschka

Renee — Tue, 29 Mar 2016 03:01:00 +0000

Renee interviews computational biologist, author, data scientist, and Michigan State PhD candidate Sebastian Raschka about how he became a data scientist, his current research, and about his book Python Machine Learning. In the audio interview, Sebastian also joins us to discuss k-fold cross-validation for our model evaluation Data Science Learning Club activity.

Podcast Audio Links:
Link to podcast Episode 8 audio
Podcast’s RSS feed for podcast subscription apps
Podcast on Stitcher
Podcast on iTunes

Podcast Video Playlist:
Youtube playlist of interview videos

More about the Data Science Learning Club:
Data Science Learning Club Welcome Message
Learning Club Activity 8: Evaluation Metrics [coming soon]
Data Science Learning Club Meet & Greet

Links to topics mentioned by Sebastian in the interview:

computational biology

molecular docking

Protein-ligand docking

DNA -> RNA -> protein

protein signaling pathways

ligand and binding affinity

sea lamprey

pheromone

SiteInterlock project

Neural Network

Random Forest

Sebastian’s Python Machine Learning repository on GitHub

Python Machine Learning Book on DataSciGuide

scikit-learn – Voting Classifier

softmax regression

stochastic gradient descent

multilayer perceptron

logistic regression (from Sebastian’s github)

regularization in logistic regression (from Sebastian’s github)

Keras deep learning library

@rasbt on Twitter
Sebastian Raschka on Quora

Sebastian’s book on Amazon:

Becoming a Data Scientist Podcast Episode 07: Enda Ridge

Renee — Tue, 15 Mar 2016 05:47:38 +0000

Data Scientist, Author, and manager of data science teams Enda Ridge talks to us about data governance, data provenance, reproducible analysis, work pipelines and products, and people, among other topics covered in his book “Guerrilla Analytics – A practical Approach to Working with Data: The Savvy Manager’s Guide”.

Podcast Audio Links:
Link to podcast Episode 7 audio
Podcast’s RSS feed for podcast subscription apps
Podcast on Stitcher
Podcast on iTunes

Podcast Video Playlist:
Youtube playlist of interview videos

More about the Data Science Learning Club:
Data Science Learning Club Welcome Message
Learning Club Activity 7: Linear Regression [coming soon]
Data Science Learning Club Meet & Greet

More show Notes Coming Soon!

@enda_ridge

Enda’s book on Amazon:

Becoming a Data Scientist Podcast Episode 06: Erin Shellman

Renee — Mon, 29 Feb 2016 20:09:55 +0000

In this episode, Renee interviews Bioinformatics PhD and Data Scientist Erin Shellman about her path to becoming a data scientist, including jobs at Nordstrom Innovation Lab and zymergen. Erin discusses school, job interviews, teaching, and eventually getting to do data science within her field of scientific expertise.

Podcast Audio Links:
Link to podcast Episode 6 audio
Podcast’s RSS feed for podcast subscription apps
Podcast on Stitcher
Podcast on iTunes

Podcast Video Playlist:
Youtube playlist of interview videos

More about the Data Science Learning Club:
Data Science Learning Club Welcome Message
Learning Club Activity 6: k-Means Clustering [coming soon]
Data Science Learning Club Meet & Greet

Bioinformatics
Evolutionary Biology
Economics Game Theory
Machine Learning
Biostatistics
Information Science
Systems Biology
Systems Modeling
Comparative Genomics

Human Genome Project

NIH Computational Biosciences

Data Scientists at Work

Nordstrom Innovation Lab (old innovation lab links inactive – appears to be the Nordstrom Technology People Lab now)

Recommender System

million song dataset

Jim Vallandingham (d3)

Crushed It! Landing a Data Science Job

zymergen

University of Michigan Computational Medicine and Bioinformatics

high throughput assays

R
dplyr
ggvis
ggvis interactive controls
ggplot2
R Markdown
Hadley Wickham

Amazon Web Services
AWS S3

Elements of Statistical Learning book

BI Tech CP303 (course Erin taught at University of Washington – use arrow keys to go through slides)
GitHub repository for class
regression
classification – logistic regression, trees
market basket analysis
clustering
UW Business Intelligence Certification

Erin’s website ErinShellman.com
@erinshellman

Data Science Learning Club Update

Renee — Sun, 21 Feb 2016 04:57:51 +0000

For anyone that hasn’t yet joined the Becoming a Data Scientist Podcast Data Science Learning Club, I thought I’d write up a summary of what we’ve been doing!

The first activity involved setting up a development environment. Some people are using R, some using python, and there are several different development tools represented. In this thread, several people posted what setup they were using. I posted a “hello world” program and the code to output the package versions.

Activities 1-3 built upon one another to explore a dataset and generate descriptive statistics and visuals, culminating with a business Q&A:

Activity 1 – Find & Explore a Dataset
Activity 2 – Visuals for Exploratory Data Analysis
Activity 3 – Business Questions & Data Answers

I analyzed a subset of data from the eBird bird observation dataset from Cornell Ornithology for these activities. Some highlights included:

– Learning how to use the pandas python package to explore a dataset (code)

– Learning how to create cool exploratory visuals in Seaborn and Tableau. Here is an example scatterplot matrix made in Seaborn:

– I was most excited to learn how to build interactive Jupyter Notebook inputs, which I used to control Bokeh data visualizations to display Ruby-Throated Hummingbird migration into North America (notebook). Unfortunately, until I host them on a server where you can run the “live” version, you won’t be able to see the interactive widgets (a slider and dynamic dropdowns), but you can see a video of the slider working here:

Here’s my final output for Activity 3, a Jupyter Notebook (with code hidden, and unfortunately interactive widgets disabled) with the Q&A about the hummingbird migration:
Ruby-Throated Hummingbird Migration into North America

Activity 4 was built as a catch-up week for those of us who were behind, but had some ideas of math concepts to learn for those who had time.

We’re currently working on Activity 5, our first machine learning activity where we’re implementing Naive Bayes Classification.

All of my work is available in this github repository: https://github.com/paix120/DataScienceLearningClubActivities

I strongly encourage you to click through the forums and look at some of the other data explorations the members have been doing, including analysis of NFL data, personal music listening habits, transportation in London, German Soccer League data, top-grossing movies, and more!

It’s never too late to join the Data Science Learning Club! If you aren’t sure where to start, check out the welcome message for some clarification.

I’ll post again when I complete some of the machine learning activities!

Becoming a Data Scientist Podcast Episode 05: Clare Corthell

Renee — Mon, 15 Feb 2016 04:13:03 +0000

Renee Teate interviews Clare Corthell, founding partner of summer.ai (now Luminant Data) and creator of the Open Source Data Science Masters curriculum, about becoming a data scientist.

Podcast Audio Links:
Link to podcast Episode 5 audio
Podcast’s RSS feed for podcast subscription apps
Podcast on Stitcher
Podcast on iTunes

Podcast Video Playlist:
Youtube playlist of interview videos

More about the Data Science Learning Club:
Data Science Learning Club Welcome Message
Learning Club Activity 5: Naive Bayes Classification
Data Science Learning Club Meet & Greet

Resources/topics mentioned by Clare in the interview:

Management Science and Engineering
Markov Chains
Science, Technology, and Society at Stanford

A Challenge to Data Scientists (blog post Renee mentioned)

Mattermark
Product Management
Machine Learning

Open Source Data Science Masters
Nate Silver’s book The Signal and the Noise

Linear Algebra (on Khan Academy)

Bill Howe’s Introduction to Data Science Coursera Course

Recurrent Neural Nets
Bayesian Networks

python

Google Prediction API

data cleaning

Open Source Data Science Masters on GitHub (pull requests welcome!)

summer.ai (Update 2/15 – Clare’s company is now Luminant Data, Inc.)
@ClareCorthell on twitter

Becoming a Data Scientist Podcast Episode 04: Sherman Distin

Renee — Tue, 02 Feb 2016 05:48:19 +0000

In Episode 4 of the Becoming a Data Scientist Podcast, we meet Sherman Distin, owner of analytics consulting firm QueryBridge. We discuss his primarily self-taught path to learning the data science techniques he uses to find business insights in marketing data, and he also tells us what he thinks is the most important trait he looks for in data scientists.

Podcast Audio Links:
Link to podcast Episode 4 audio
Podcast’s RSS feed for podcast subscription apps
Podcast on Stitcher
Podcast on iTunes

Podcast Video Playlist:
Youtube playlist of interview videos

More about the Data Science Learning Club:
Data Science Learning Club Welcome Message
Learning Club Activity 4: Learn a New Math Concept [to be posted Tuesday]
Data Science Learning Club Meet & Greet

Resources/topics mentioned by Sherman in the interview:

QueryBridge (Sherman’s business)

linear regression

Target Pregnant Customer story

Excel Solver

EBITA

Survival Analysis
Proportional Hazards Model

Analytics Vidhya

@ShermanDistin on Twitter

Six Sigma

Econometrics

Sherman Distin on Facebook and LinkedIN

Becoming a Data Scientist Podcast Episode 03: Shlomo Argamon

Renee — Mon, 18 Jan 2016 06:08:17 +0000

Note: The video is the interview only. The audio podcast has the intro, interview, and data science learning club activity explanation.

In Episode 3 of the Becoming a Data Scientist Podcast, we meet Shlomo Argamon, who is the founding director of the Master of Data Science program at Illinois Institute of Technology. He talks to us about his path to data science, including research in robotic vision and natural language processing, we discuss the traits of a good data science student, and he gives some advice for those of us learning data science.

Podcast Audio Links:
Link to podcast Episode 3 audio
Podcast’s RSS feed for podcast subscription apps
Podcast on Stitcher
Update 1/19: You should be able to find it on iTunes now!

Podcast Video Playlist:
Youtube playlist of interview videos

More about the Data Science Learning Club:
Data Science Learning Club Welcome Message
Learning Club Activity 3: Business Questions and Communicating Data Answers [to be updated Monday]
Data Science Learning Club Meet & Greet

Here are the links to things Shlomo references in the video:

Illinois Institute of Technology – Professional Master of Data Science Degree

punchcards

machine vision
robotic mapping
Google Scholar Search for Shlomo Argamon’s publications related to robotics
“Passive map learning and visual place recognition” Doctoral Dissertation [ps.gz from yale]

probability theory
probability distributions
statistical inference
bayesian statistics

Kaggle competitions

Natural Language Processing (NLP)
Google Scholar Search for Shlomo Argamon’s publications related to language
“Automatically Categorizing Written Texts by Author Gender” [Moshe Koppel, Shlomo Argamon, and Anat Rachel Shimoni]

Weka
scikit-learn
Natural Language Toolkit (nltk)

sentiment analysis

Ethics in Data Science at IIT
Becoming a Data Scientist – A Challenge to Data Scientists (re: bias)

@ShlomoArgamon on Twitter

Becoming a Data Scientist Podcast Episode 02: Safia Abdalla

Renee — Mon, 04 Jan 2016 18:30:00 +0000

Note: The video is the interview only. The audio podcast has the intro, interview, and data science learning club activity explanation.

In Episode 2 of the Becoming a Data Scientist Podcast, we meet Safia Abdalla, who started programming and even exploring machine learning and natural language processing as a teenager, and is now a student at Northwestern University, a conference speaker and trainer, co-organizer of PyLadies Chicago, and a contributor to Project Jupyter.

Podcast Audio Links:
Link to podcast Episode 2 audio
Podcast’s RSS feed for podcast subscription apps
(I will distribute the feed out to iTunes and Pocket Cast ASAP. It’s available on Stitcher now!)

Podcast Video Playlist:
Youtube playlist where I’ll publish future videos

More about the Data Science Learning Club:
Data Science Learning Club Welcome Message
Learning Club Activity 2: Creating visuals for exploratory data analysis
Data Science Learning Club Meet & Greet

Here are the links to things Safia references in the video:

dll
registry
BIOS

python

Alan Turing
ENIAC

information retrieval
Introduction to Information Retrieval by C. D. Manning, P. Raghavan, H. Schütze

natural language processing
NLTK
machine learning

Northwestern Neuroscience and Robotics Lab

SEO

pyladies
Chicago PyLadies Meetups

write speak code

pair programming

mathematicalmonk’s YouTube series on machine learning

@captainsafia on twitter
Safia’s website
Safia’s blog

JupyterDay Chicago 2016 (post by Safia on jupyter.org)
Jupyter documentation

Podcast Available on Stitcher

Renee — Mon, 21 Dec 2015 20:07:15 +0000

The Becoming a Data Scientist Podcast is now available via Stitcher! Subscribe in the app, or listen online:

If you have another podcast app, you can subscribe by entering the RSS feed link: https://www.becomingadatascientist.com/feed/podcast

There is also a built-in audio player here on the blog that I link to in each episode: https://www.becomingadatascientist.com/podcast/

I’m working on getting a logo now, so hopefully it won’t have a placeholder image for long :) I want to submit it to iTunes, but I have to download the dreaded iTunes desktop software in order to submit and manage it… ugh ridiculous.

Anyway, enjoy it on the blog and on Stitcher for now!

Becoming A Data Scientist Podcast Episode 01: Will Kurt

Renee — Mon, 21 Dec 2015 06:40:32 +0000

Note: The video is the interview only. The audio podcast has the intro, interview, and data science learning club activity explanation.

In this episode we meet Will Kurt, who talks about his path from English & Literature and Library & Information Science degrees to becoming the Lead Data Scientist at KISSmetrics. He also tells us about his probability blog, Count Bayesie, and I introduce Data Science Learning Club Activity 1. Will has some great advice for people learning data science!

Podcast Audio Links:
Link to podcast Episode 1 audio
Podcast’s RSS feed for podcast subscription apps
(I will distribute the feed out to sites like iTunes and Stitcher this week)

Podcast Video Playlist:
Youtube playlist where I’ll publish future videos

More about the Data Science Learning Club:
Data Science Learning Club Welcome Message
Learning Club Activity 1: Find and explore a dataset
Data Science Learning Club Meet & Greet

Here are the links to things Will references in the video:

Library and Information Science

Foucault

ARPANET

Support Vector Machines

CoffeeScript

Andrew Ng’s Machine Learning course on Coursera

probabalistic graphical models

T.S. Eliot’s Four Quartets

Articulate

KISSMetrics

Count Bayesie blog
Count Bayesie – Parameter Estimation and Hypothesis Testing

R Markdown

Donald Knuth
Literate programming

ggplot2

jupyter

Claude Shannon’s Mathematical Theory of Communication

Count Basie (musician)

Count Bayesie – Measure Theory
Bayes’ Theorem with Lego
Voight-Kampff and Bayes Factor
Black Friday Puzzle – Markov Chains

Zen Buddhism concept of “beginner’s mind”

Count Bayesie Recommended Books on Probability and Statistics

Becoming A Data Scientist Podcast Episode 0: Me!

Renee — Mon, 14 Dec 2015 19:23:59 +0000

Here is the first episode of the Becoming a Data Scientist Podcast, which is also available in video form!

(sorry for the poor video quality!)

In this episode, I talk a little about the podcast, I talk about my own background, and I introduce the Data Science Learning Club. Enjoy!
(Note: Episode 1, the first interview episode, comes out Monday 12/21!)

Podcast Audio Links:
Link to podcast Episode 0 audio
Podcast’s RSS feed for podcast subscription apps
(I will distribute this out to sites like iTunes and Stitcher soon)

Podcast Video Playlist:
Youtube playlist where I’ll publish future videos

More about the Data Science Learning Club:
Blog post about Data Science Learning Club
Learning Club Activity 0: Set up your development environment
Data Science Learning Club Meet & Greet

Here are the links with more info of things I reference in the video:

turtle logo programming language

carmen sandiego
lemmings
SimCity

C programming language

JMU Integrated Science and Technology (ISAT)

Visual Basic/VB.NET/ASP.NET
MS Access

Rosetta Stone

PL/SQL
Oracle Data Warehouse
IBM Cognos

CGEP UVA Systems Engineering
Systems Engineering
Linear Algebra at Khan Academy
Stochastic Simulation
Optimization
Cognitive Systems Engineering
Principles of Data Visualization for Exploratory Data Analysis
Machine Learning
Naive Bayes
K-Means
Pattern Recognition and Machine Learning (class textbook)

Summer of Data Science
API and Market Basket Analysis
Jupyter
Docker and Jupyter
Doing Data Science by Cathy O’Neill and Rachel Schutt
O’Reilly Data Science Books
(I’ll post more specific books later)

Data Science Learning Club

Renee — Mon, 14 Dec 2015 02:25:57 +0000

I’m working on the last of my recording and editing for “Episode 0” of the new Becoming A Data Scientist Podcast, which I’m planning to launch tomorrow! I’ve already recorded the interviews for episodes 1-3, which will be airing over the next month or so – so exciting! The guests all had interesting and informative things to share, I believe you’ll like it a lot.

At the end of each podcast episode, I’ll be “assigning” a “Learning Activity” for the Data Science Learning Club. So that is starting tomorrow, too! There won’t be anyone teaching the content, but we’ll be exploring it together for 1-2 weeks between podcast episodes (usually 2 weeks). I’ll post some resources to get everyone started and help out data science beginners, then we’ll each explore the activity on our own with whatever tools and techniques we choose, and we can post our results so we can all learn from one another. If anyone gets stuck, you can post a question to the forum and hopefully someone will be able to help you through it.

I just got the Data Science Learning Club forum set up today, and it’s at this URL: https://www.becomingadatascientist.com/learningclub

Go check it out, register so you can participate, read the Welcome thread, and introduce yourself in the Meet & Greet section! Then tomorrow, the first learning activity will launch and you can get started.

I’m so excited about launching this podcast and data science learning club, and hope this turns out to be a valuable experience for all of us! Keep an eye out on the blog for the podcast post, which should go up tomorrow!

Renee

Renee – Becoming A Data Scientist

Valley TechCon Slides

Becoming a Data Scientist Episode 17: Andrew Therriault

RVATech Summit Slides

Tom Tom Fest Applied Machine Learning Conference 2019

APRA Data Analytics Symposium 2018

Summer of Data Science Goal-Setting

#CraftyDataViz Winners

Summer of Data Science 2018 #SoDS18 Launch Day!

#CraftyDataViz Contest!

Can a Machine Be Racist or Sexist?

Podcast Listens Analysis

Hi #DemystifyDS Attendees!

More silliness

2 Quick Announcements

Introductory Machine Learning Terminology with Food

Summer of Data Science 2017

Bias in Machine Learning Flipboard Magazine

Becoming a Data Scientist Podcast Episode 16: Randy Olson

T-Shirts!!

Order yours here, with Free Shipping Until March 1!

Becoming a Data Scientist Podcast Episode 15: David Meza

T-Shirt Contest Finalists

1) Ryne

2) Amarendra

3) coming soon

Becoming a Data Scientist Podcast Episode 14: Jasmine Dumas

Vote for a T-Shirt design!

dsnmizan

Venkat

Gordon

Kumar

Nigel

Ridhima

Ryne

Amarendra

––> VOTE HERE <––

I’m hiring!

Podcast Special Episode 2 – The Future of AI with Dr. Ed Felten

Support Becoming a Data Scientist!

T-Shirt Design Contest!

Becoming a Data Scientist Podcast Special Episode

PyData DC 2016 Talk

“Becoming a Data Scientist” Survey Results 1: Jobs & Education

Educational Field vs Job Title of Data Analyst or Data Scientist (yes/no)

Computer Scientist or Data Sci / Statistics / Analytics Degree vs Job Title of Data Analyst or Data Scientist (yes/no)

Educational Level vs Considers Self Data Scientist (No/Not Yet or Yes)

Considers Self Data Scientist by Educational Level and Field

Educational Field by Considers Self Data Scientist

Respondent Breakdown by Job Industry

Visual of those with Science/Tech/Eng Degrees and Math/Stats/Acctg Degrees

Percentage of Respondents by “Considers Self Data Scientist” with each Salary Range

Bar Charts of Above Data. Top = Data Scientists, Bottom = Not Data Scientists.

Podcast Episodes 0 to 3

Becoming a Data Scientist Survey

Boosting (in Machine Learning) as a Metaphor for Diverse Teams

Becoming a Data Scientist Podcast Episode 13: Debbie Berebichez

Becoming a Data Scientist Podcast Episode 12: Data Science Learning Club Members

Becoming a Data Scientist Podcast Episode 11: Stephanie Rivera

Becoming a Data Scientist Podcast Episode 10: Trey Causey

Becoming a Data Scientist Podcast Episode 09: Justin Kiggins

Becoming a Data Scientist Podcast Episode 08: Sebastian Raschka

Becoming a Data Scientist Podcast Episode 07: Enda Ridge

Becoming a Data Scientist Podcast Episode 06: Erin Shellman

Data Science Learning Club Update

Becoming a Data Scientist Podcast Episode 05: Clare Corthell

Becoming a Data Scientist Podcast Episode 04: Sherman Distin

Becoming a Data Scientist Podcast Episode 03: Shlomo Argamon

Becoming a Data Scientist Podcast Episode 02: Safia Abdalla

Podcast Available on Stitcher

Becoming A Data Scientist Podcast Episode 01: Will Kurt

Becoming A Data Scientist Podcast Episode 0: Me!

Data Science Learning Club

Educational Field
vs Job Title of Data Analyst or Data Scientist (yes/no)

Computer Scientist or Data Sci / Statistics / Analytics Degree
vs Job Title of Data Analyst or Data Scientist (yes/no)

Educational Level
vs Considers Self Data Scientist (No/Not Yet or Yes)