Becoming A Data Scientist https://www.becomingadatascientist.com Documenting my path from "SQL Data Analyst pursuing an Engineering Master's Degree" to "Data Scientist" Mon, 12 Jun 2017 02:31:12 +0000 en-US hourly 1 https://wordpress.org/?v=4.4.10 Summer of Data Science 2017 https://www.becomingadatascientist.com/2017/05/29/summer-of-data-science-2017/ https://www.becomingadatascientist.com/2017/05/29/summer-of-data-science-2017/#comments Mon, 29 May 2017 05:12:25 +0000 https://www.becomingadatascientist.com/?p=1439 The Summer of Data Science is a commitment to learn something this summer to enhance your data science skills, and to share what you learned.]]> Since Memorial Day in the U.S. is the unofficial start of the summer season, I figured today would be a good time to launch the SUMMER OF DATA SCIENCE 2017!!!

The Summer of Data Science is a commitment to learn something this summer to enhance your data science skills, and to share what you learned. (Those of you in the Southern Hemisphere will have to pick up the excitement when we’re winding down during our fall/your spring and keep it going! Or, join us during your Winter of Data Science!)

For those of you who haven’t been following me for years, a hashtag I started back in 2015, #SoDS, is actually one of the things that started growing my twitter following. Here’s the history:

Coming up with the hashtag

1st month of tweets, May 2015 Storified

June 2015 #SoDS Storify

Unfortunately, I didn’t keep up the ‘Storification’ after that, but you get the idea. It brought a bunch of us together to share our learning progress. We learned from each other, encouraged each other, and most of all geeked out about data science together!

I didn’t launch one last year, because I was starting a new job and taking a break from recording the podcast, and just didn’t want to take on too much. But I missed it, so I didn’t want to let another year pass without a Summer of Data Science, so we’re going to do it together again this year!

So, here are the only “rules”:

How to participate in the Summer of Data Science:

  1. Pick a thing or a short list of things related to data science that you want to learn more about this summer.
  2. Make a plan to learn it (like an online course, a practice project, etc.).
  3. Share that plan on social media, then post updates as you make progress, with the hashtag #SoDS17.

That’s it! (And yes, there’s a chef competition that used the same hashtag. No worries! Enjoy the food pics.)

If you’re looking for ideas for learning projects or topics, check out the Data Science Learning Club! Please write about your learning experiences and share in the Data Science Learning Club #SoDS forum, and/or on your own blog, and share on social media. I’ll check out the hashtag on twitter regularly and RT others. I’ll be participating myself, too!

Here’s a link to the hashtag on twitter: #SoDS17. See you there!

P.S. Did you know that there is a “Summer of Data Sci” song? :D

P.P.S. There are now Summer of Data Science 2017 t-shirts and tanks in the Becoming a Data Scientist teespring shop!

]]>
https://www.becomingadatascientist.com/2017/05/29/summer-of-data-science-2017/feed/ 1
Bias in Machine Learning Flipboard Magazine https://www.becomingadatascientist.com/2017/03/25/bias-in-machine-learning/ https://www.becomingadatascientist.com/2017/03/25/bias-in-machine-learning/#respond Sat, 25 Mar 2017 21:54:57 +0000 https://www.becomingadatascientist.com/?p=1430 Quick note for those of you who follow me on Flipboard. I added another one, seeded with links from my Challenge to Data Scientists article, on Bias in Machine Learning. Enjoy!

]]>
https://www.becomingadatascientist.com/2017/03/25/bias-in-machine-learning/feed/ 0
Becoming a Data Scientist Podcast Episode 16: Randy Olson https://www.becomingadatascientist.com/2017/03/22/becoming-a-data-scientist-podcast-episode-16-randy-olson/ https://www.becomingadatascientist.com/2017/03/22/becoming-a-data-scientist-podcast-episode-16-randy-olson/#respond Wed, 22 Mar 2017 05:11:50 +0000 https://www.becomingadatascientist.com/?p=1413 Renee interviews Randal S. Olson, Senior Data Scientist in the Institute for Biomedial Informatics at UPenn, about his path to becoming a data scientist, his interesting data science blog posts, and his work with non-data-scientists and students. Podcast Audio Links: Link to podcast Episode 16 audio Podcast's RSS feed for podcast subscription apps]]>

Renee interviews Randal S. Olson, Senior Data Scientist in the Institute for Biomedial Informatics at UPenn, about his path to becoming a data scientist, his interesting data science blog posts, and his work with non-data-scientists and students.

Podcast Audio Links:
Link to podcast Episode 16 audio
Podcast’s RSS feed for podcast subscription apps
Podcast on Stitcher
Podcast on iTunes

Podcast Video Playlist:
Youtube playlist of interview videos

More about the Data Science Learning Club:
Data Science Learning Club Welcome Message
Data Science Learning Club Activity 16 – Genetic Algorithms
Data Science Learning Club Meet & Greet
Now sponsored by DataCamp!

Mentioned in the episode:

bytecode

Dr. Kenneth Stanley at the University of Central Florida

evolutionary algorithm

Michigan State University Artificial Intelligence

BEACON NSF Science and Technology Center at MSU

Randal S. Olson publications

Randy’s blog

Data Is Beautiful Reddit

traveling salesman problem

Google Maps API

Moneyball (book)

Data Science Handbook (book)

Weka

scikit-learn

version control

Randy on:
Twitter
LinkedIN
github
Patreon

Becoming a Data Scientist T-Shirts!

Thanks to DataCamp for sponsoring this episode!

DataCamp discount link in Data Science Learning Club forums (only visible to logged-in users)

290_160_bads

]]>
https://www.becomingadatascientist.com/2017/03/22/becoming-a-data-scientist-podcast-episode-16-randy-olson/feed/ 0
T-Shirts!! https://www.becomingadatascientist.com/2017/02/18/t-shirts/ https://www.becomingadatascientist.com/2017/02/18/t-shirts/#respond Sat, 18 Feb 2017 20:55:46 +0000 https://www.becomingadatascientist.com/?p=1372 MarchApril 1 using this link: Becoming a Data Scientist Store – Free Shipping, you’ll get free shipping on your order! The design is a combination of those submitted to our contest by Amarendranath “Amar” Reddy and Ryne & Alexis. combined_shirt_final]]> The Becoming a Data Scientist tees are ready to sell! I ordered a couple myself before posting them for sale, to make sure the quality was good. They came out great!! And if you order from Teespring before MarchApril 1, 2017 using this link: Becoming a Data Scientist Store – Free Shipping, you’ll get free shipping on your order!

(Readers told me that the link above doesn’t discount at all for International shipping, so if you are outside the US, use this link for $3.99 off – equivalent to US Shipping cost)

combined_shirt_final

The design is a combination of those submitted to our contest by Amarendranath “Amar” Reddy and Ryne & Alexis. You can see their design submissions and read more about them on the finalists post! They are each receiving prizes for being selected. Thanks Amar, Ryne, and Alexis for the awesome design!

There are a variety of styles and colors available. The Premium Tee is 100% cotton. The Women’s Premium is a 50/50 cotton/poly blend, and is cut to fit more snugly. They are available in navy blue, gray, purple, and black. There’s even a long-sleeve version!

I make anywhere from $2-$7 on each order (it’s print-on-demand, so not cheap enough for me to make a significant profit yet, and my proceeds will be lower with the free shipping offer, but I want to reward those of you who are excited to flaunt your Becoming a Data Scientist status!) and every dollar earned from these will be going to the fund that helps support my new small team of assistants, who you’ll meet soon! Also, the more of them I sell, the lower the cost to print is per shirt, so please share with all of your friends!

20170214_232527 20170214_234400
Here are photos of me wearing the shirt, but this was before I made the front design slightly smaller (so it doesn’t wrap into armpit), and I moved the back design slightly higher and also made the gray dots (data points?) transparent so the color of the shirt will show through there now (see store images above for current design). You can see that the teal came out as a lighter blue in printing. This is the “Premium Tee” style in “New Navy”.

Here’s a model wearing a simulated version of the shirt.
d7748767dda4e3e

Order yours here, with Free Shipping Until March 1!

Update: Kids sizes now available, too!
(the design is on the front for kids’ shirts)
71040dc1d98e886

]]>
https://www.becomingadatascientist.com/2017/02/18/t-shirts/feed/ 0
Becoming a Data Scientist Podcast Episode 15: David Meza https://www.becomingadatascientist.com/2017/01/29/becoming-a-data-scientist-podcast-episode-15-david-meza/ https://www.becomingadatascientist.com/2017/01/29/becoming-a-data-scientist-podcast-episode-15-david-meza/#respond Mon, 30 Jan 2017 04:41:47 +0000 https://www.becomingadatascientist.com/?p=1340
David Meza is Chief Knowledge Architect at NASA, and talks to Renee in this episode about his educational background, his early work at NASA, and examples of his work with multidisciplinary teams. He also describes a project involving a graph database that improved search capabilities so NASA engineers could more easily find "lessons learned".
Podcast Audio Links: Link to podcast Episode 15 audio Podcast's RSS feed for podcast subscription apps]]>

David Meza is Chief Knowledge Architect at NASA, and talks to Renee in this episode about his educational background, his early work at NASA, and examples of his work with multidisciplinary teams. He also describes a project involving a graph database that improved search capabilities so NASA engineers could more easily find “lessons learned”.


Podcast Audio Links:
Link to podcast Episode 15 audio
Podcast’s RSS feed for podcast subscription apps
Podcast on Stitcher
Podcast on iTunes

Podcast Video Playlist:
Youtube playlist of interview videos

More about the Data Science Learning Club:
Data Science Learning Club Welcome Message
Data Science Learning Club Activity 15 – Explain an Analysis (Communication)
Data Science Learning Club Meet & Greet
Now sponsored by DataCamp!

Mentioned in the episode:

NASA.gov

MS Access

Neutral Buoyancy Lab

civil servant

NASA Knowledge (@NASAKnowledge on twitter)

Engineering Management
Knowledge Management
Organizational Learning
Knowledge Engineering
Information Architecture
Data Analysis

graph database

Neo4j
Elasticsearch
IHS Goldfire
MongoDB

JSC – Johnson Space Center

topic modeling

@davidmeza1 on Twitter
David Meza on LinkedIN

Southern Data Science Conference in Atlanta, GA on April 7, 2017 (Coupon code RENEE takes 15% off ticket price)

Thanks to DataCamp for sponsoring this episode!

DataCamp discount link in Data Science Learning Club forums (only visible to logged-in users)



290_160_bads

]]>
https://www.becomingadatascientist.com/2017/01/29/becoming-a-data-scientist-podcast-episode-15-david-meza/feed/ 0
T-Shirt Contest Finalists https://www.becomingadatascientist.com/2017/01/16/t-shirt-contest-finalists/ https://www.becomingadatascientist.com/2017/01/16/t-shirt-contest-finalists/#comments Tue, 17 Jan 2017 03:29:01 +0000 https://www.becomingadatascientist.com/?p=1329 I still haven’t heard from one of the 3 finalists, but I wanted to go ahead and post the first two, and I’ll update here with the final one later. These finalists win a data science book and a t-shirt, and I’ll choose from the three (I’m actually considering combining elements from two of them!) and announce the final t-shirt design when they are available for sale.

Without further ado, the top 3 vote-winners after 94 votes, in no particular order, are….

1) Ryne

ryne1

Ryne is an analyst for an economic consulting firm in Salt Lake City looking to transition to a career in data science. Lucky for him, his wife Alexis is a graphic designer and entered the competition in his name so he could win a book! She had a baby a few months ago and is just doing freelance design as time permits.

Alexis’ website is alexisbrittany.com. (Feel free to reach out if you need a designer!)

2) Amarendra

amarendra2

Amarendra (Amar) has recently moved to Philadelphia to pursue an MS in Business Intelligence and Analytics at Saint Joseph’s University. He believes in continuous learning and aspires to contribute to the field of data exploration. Upon graduation, Amar would like to work in the field of data visualization. He loves working with tableau and one of his resolutions for 2017 is to get appreciated by one of the Tableau Zen Masters.

3) coming soon

I’ll update here as soon as I hear back from the 3rd finalist! If I don’t hear back, I may give the prizes to the 4th place winner.

Amar and Ryne both selected the same data science book, so I thought I’d share that here for others that may be interested. They will both be receiving The Python Data Science Handbook by Jake VanderPlas:

You will hear about the final design soon, as well as options for purchasing t-shirts! Thanks to everyone who entered and voted!

]]>
https://www.becomingadatascientist.com/2017/01/16/t-shirt-contest-finalists/feed/ 1
Becoming a Data Scientist Podcast Episode 14: Jasmine Dumas https://www.becomingadatascientist.com/2017/01/10/becoming-a-data-scientist-podcast-episode-14-jasmine-dumas/ https://www.becomingadatascientist.com/2017/01/10/becoming-a-data-scientist-podcast-episode-14-jasmine-dumas/#respond Wed, 11 Jan 2017 04:53:29 +0000 https://www.becomingadatascientist.com/?p=1312
In this first episode of "Season 2" of Becoming a Data Scientist podcast, we meet Jasmine Dumas, a new data scientist who tells us about going from biomedical engineering into a data science project experience and then finding her first job as a data scientist.
Podcast Audio Links: Link to podcast Episode 14 audio Podcast's RSS feed for podcast subscription apps]]>

In this first episode of “Season 2” of Becoming a Data Scientist podcast, we meet Jasmine Dumas, a new data scientist who tells us about going from biomedical engineering into a data science project experience and then finding her first job as a data scientist.


Podcast Audio Links:
Link to podcast Episode 14 audio
Podcast’s RSS feed for podcast subscription apps
Podcast on Stitcher
Podcast on iTunes

Podcast Video Playlist:
Youtube playlist of interview videos

More about the Data Science Learning Club:
Data Science Learning Club Welcome Message
Activity 14: Hidden Markov Models
Activity 15: Neural Nets for Text
Data Science Learning Club Meet & Greet
Now sponsored by DataCamp!

Mentioned in the episode:

Science Olympiad

#RStats on twitter

Hadley Wickham’s Advanced R book

Shiny

Survival Analysis

RStudio

shinyGEO: a web-based application for analyzing gene expression omnibus datasets

Google Summer of Code

RTalk Podcast

Simple Finance

Jasmine’s website on github

Jasmine’s projects

@jasdumas on Twitter

Thanks to DataCamp for sponsoring this episode!

DataCamp discount link in Data Science Learning Club forums (only visible to logged-in users)



290_160_bads

]]>
https://www.becomingadatascientist.com/2017/01/10/becoming-a-data-scientist-podcast-episode-14-jasmine-dumas/feed/ 0
Vote for a T-Shirt design! https://www.becomingadatascientist.com/2017/01/08/vote-for-a-t-shirt-design/ https://www.becomingadatascientist.com/2017/01/08/vote-for-a-t-shirt-design/#respond Sun, 08 Jan 2017 06:19:40 +0000 https://www.becomingadatascientist.com/?p=1268 Which of these awesome designs would you like on your future Becoming a Data Scientist T-Shirt? You all will narrow it down to 3, then I’ll pick the final winner to be printed!

For the ones that aren’t t-shirt-print-ready, I’ll get a graphic designer to tidy them up, so don’t worry about whether they’re printable when you vote. We can also vary colors and things like that later. Just pick the design you like!

(original contest here)


click the images for larger views

dsnmizan

dsnmizan

Venkat

venkat1

Gordon

gordon1 gordon2

Kumar

kumar

Nigel

nigel

Ridhima

ridhima

Ryne

ryne1

Amarendra


gordon1 gordon2

––> VOTE HERE <––

(requires google login to prevent multiple-voting, but the voting is anonymous)

]]> https://www.becomingadatascientist.com/2017/01/08/vote-for-a-t-shirt-design/feed/ 0 I’m hiring! https://www.becomingadatascientist.com/2017/01/05/im-hiring/ https://www.becomingadatascientist.com/2017/01/05/im-hiring/#comments Thu, 05 Jan 2017 06:38:35 +0000 https://www.becomingadatascientist.com/?p=1264 I need a part-time remote assistant to help keep my websites up to date, among other things!

Thanks to my generous Patreon supporters, I can hire someone to help me out 8-20 hours per month, paying $15/hr. More info and application form at this link.

Please let me know if you have any questions or if there are any problems with the form. Email me at renee@becomingadatascientist.com or tweet me at @becomingdatasci.

I look forward to reading the applications!!

]]>
https://www.becomingadatascientist.com/2017/01/05/im-hiring/feed/ 3
Podcast Special Episode 2 – The Future of AI with Dr. Ed Felten https://www.becomingadatascientist.com/2016/12/29/podcast-special-episode-2-the-future-of-ai-with-dr-ed-felten/ https://www.becomingadatascientist.com/2016/12/29/podcast-special-episode-2-the-future-of-ai-with-dr-ed-felten/#respond Thu, 29 Dec 2016 05:02:12 +0000 https://www.becomingadatascientist.com/?p=1250 In this audio-only Becoming a Data Scientist Podcast Special Episode, I interview Dr. Ed Felten, Deputy U.S. Chief Technology Officer, about the Future of Artificial Intelligence (from The White House!).

You can stream or download the audio at this link (download by right-clicking on the player and choosing “Save As”), or listen to it in podcast players like iTunes and Stitcher. Enjoy!

Show Notes:

The White House Names Dr. Ed Felten as Deputy U.S. Chief Technology Officer

Edward W. Felten at Princeton University

Dr. Edward Felten on Wikipedia

White House Office of Science and Technology Policy (OSTP)

The Administration’s Report on the Future of Artificial Intelligence (White House Report from October 2016)

Artificial Intelligence, Automation, and the Economy (White House Report from December 2016)

Ed Felten on Twitter: Official / Personal

Freedom to Tinker blog

———

Other Podcasts in this Government Data Series:

Becoming a Data Scientist Patreon Campaign

]]>
https://www.becomingadatascientist.com/2016/12/29/podcast-special-episode-2-the-future-of-ai-with-dr-ed-felten/feed/ 0
Support Becoming a Data Scientist! https://www.becomingadatascientist.com/2016/12/07/support-becoming-a-data-scientist/ https://www.becomingadatascientist.com/2016/12/07/support-becoming-a-data-scientist/#respond Wed, 07 Dec 2016 21:34:37 +0000 https://www.becomingadatascientist.com/?p=1237 I want to hire some people to help me update my websites more frequently, do the maintenance stuff, and to help edit the podcast so I can produce episodes more frequently.

I outlined my whole plan here on my Patreon Campaign. You’ll see a new page on this site soon acknowledging supporters, and I’ll update you on the progress.

Whether you can give financially, or even if you just share the campaign with your data science friends, you are helping Becoming a Data Scientist podcast, the learning club, Data Sci Guide, Jobs for New Data Scientists, and all of my websites get off the ground! Thank you!!

patreon_summary

]]>
https://www.becomingadatascientist.com/2016/12/07/support-becoming-a-data-scientist/feed/ 0
T-Shirt Design Contest! https://www.becomingadatascientist.com/2016/11/19/t-shirt-design-contest/ https://www.becomingadatascientist.com/2016/11/19/t-shirt-design-contest/#comments Sat, 19 Nov 2016 18:13:29 +0000 https://www.becomingadatascientist.com/?p=1213 I’ve decided that I want to have Becoming a Data Scientist t-shirts to sell and to give out to podcast guests and contest winners, but I am not a graphic designer, so I need some help! So I’m going to have a t-shirt design contest!

15229-illustration-of-a-white-shirt-pv

Here are the rules/guidelines for entry:

1. Create a design that prominently says “Becoming a Data Scientist” and can be easily scaled to fit in on the front or back of a t-shirt. If you create a design for the back, also create a small “pocket sized” text or design for the front of the shirt. But I don’t have a preference – front or back of shirt designs are both fine!

If it can be incorporated into the design without looking too cluttered, you can also add “Podcast and Learning Club” and/or “@becomingdatasci”, but that is not a requirement.

The design can be just text, text with an image, more abstract, use your imagination! As long as “Becoming a Data Scientist” is clearly readable, your design will be considered. Obviously, vulgar designs will not be considered, and I’ll also remove them (or anything spammy) from the comments on this post.

Please don’t use more than 2 colors in your design itself, as more can be cost-prohibitive. The background color can be a different 3rd color.

Here is a site with some guidance for preparing a design for t-shirts: https://gomedia.com/zine/tutorials/pro-tips-preparing-artwork-t-shirt-printing/
(Note – I don’t expect the design to be complex or super-artistic, and a text-only design could be created in a simple text or image editor!)

You can even use an online shirt design program like CustomInk, as long as the design can be extracted for use at another printing site (I’m not sure what the rules or capabilities of most of those t-shirt design websites are).

2. Please submit 2 files: your design itself (in a format that can be read on multiple platforms, like a PDF), large enough that it can be zoomed in to “life size”, and then also a smaller image of your design as you imagine it on a shirt – choose a shirt color and location for your design and create a little “shirt preview” image I can share with readers for the vote (this one doesn’t have to be zoomable to full size – the largest size I’d post it at is about 500×500). Please let me know if you have any questions or suggestions about this!

You can submit the files by posting a comment below. Don’t put your email in the text of the comment, I’ll be able to see it behind the scenes from the form. Make sure to include your name (it can be just your first name if you want) as you want it shared along with your design if you get selected for the voting round, along with the links to the 2 files and any link you want to point people to – your blog, your portfolio, your twitter account, etc.

UPDATE 12/5/2016: It looks like I made the turnaround time short (I’d like to have at least 10 designs before the voting), and I know some people may have great ideas but not great graphic design skills, so I’m making 2 changes:

  • It doesn’t need to be a “print-ready” design. If you sketch it out and your design wins, I’ll get a graphic designer to help turn it into a file to send to the t-shirt printer
  • The deadline is now extended to the end of the calendar year (see below)
  • Thank you to those of you who have entered already!!

Here’s how the contest will go:

I’ll accept entries until 11:59pm Saturday, December 31, 2016. Over the next week, depending on how many entries there are, I’ll narrow down the selection to maybe 5-10 choices. I’ll create a blog post with the t-shirt images and names of the designers, with a way to vote on your favorite, and advertise it on @becomingdatasci to get as many votes as possible.

The top 3 vote-winners will win:

  • A t-shirt
  • A data science related book of their choice up to $60
  • Will be featured in a “finalists” blog post with their design, a little blurb about them, and a link to their website
  • A tweet with their design and a link to their site on my @becomingdatasci twitter account.

From the finalists, I’ll choose my favorite to be printed. The final winner will also get:

  • 2-3 extra t-shirts with their design to give out to friends
  • Name credited as designer wherever the t-shirt is sold
  • Additional tweets with their design announced as the winner, with a link to their site, on @becomingdatasci.
  • A shout-out on the Becoming a Data Scientist podcast

Please let me know if there’s anything I forgot to detail here or if you have any questions! I look forward to seeing the submitted designs!!

Update: I should probably mention that any proceeds from sales of the shirts will go to support the maintenance and creation of more content at this site BecomingADataScientist.com, the podcast, the Data Science Learning Club, DataSciGuide, and my other data science sites and social media accounts. I’ll be posting a Patreon campaign soon to raise money to hire help to keep these sites updated, and money I earn from selling t-shirts will go toward that as well.

]]>
https://www.becomingadatascientist.com/2016/11/19/t-shirt-design-contest/feed/ 20
Becoming a Data Scientist Podcast Special Episode https://www.becomingadatascientist.com/2016/11/13/becoming-a-data-scientist-podcast-special-episode/ https://www.becomingadatascientist.com/2016/11/13/becoming-a-data-scientist-podcast-special-episode/#respond Mon, 14 Nov 2016 03:59:06 +0000 https://www.becomingadatascientist.com/?p=1206 The hosts of Becoming a Data Scientist podcast, Partially Derivative podcast, Adversarial Learning podcast, and some other awesome data people that do elections forecasting for their day jobs joined together for this talk about the US election and the subsequent major questions surrounding the predictions, since basically all of them heavily leaned toward a different overall outcome than we got. If you’re interested at all in data science surrounding political campaigns, this episode is a must-listen!

Episode Audio (mp3) – also available on iTunes, Stitcher, etc.
(note, there is no video for this episode)

On the panel:

]]>
https://www.becomingadatascientist.com/2016/11/13/becoming-a-data-scientist-podcast-special-episode/feed/ 0
PyData DC 2016 Talk https://www.becomingadatascientist.com/2016/10/11/pydata-dc-2016-talk/ https://www.becomingadatascientist.com/2016/10/11/pydata-dc-2016-talk/#comments Tue, 11 Oct 2016 05:00:36 +0000 https://www.becomingadatascientist.com/?p=1189 I just got back from PydataDC, where I learned a lot, had fun, and met a bunch of awesome people! I’ll definitely write about it more later, but I wanted to share my slides here since I told the attendees they could find them on my website. I got good feedback on the talk, and I’m so glad that my message resonated with some people!

The talk was recorded and video should be out within a few weeks!

Here are the slides: Becoming a Data Scientist – Advice from my Podcast Guests
and the slide notes.

Update 10/26: Here is the recording of my talk, with a playlist of other talks from PyData DC!

]]>
https://www.becomingadatascientist.com/2016/10/11/pydata-dc-2016-talk/feed/ 5
“Becoming a Data Scientist” Survey Results 1: Jobs & Education https://www.becomingadatascientist.com/2016/08/22/becoming-a-data-scientist-survey-results-1-jobs-education/ https://www.becomingadatascientist.com/2016/08/22/becoming-a-data-scientist-survey-results-1-jobs-education/#comments Mon, 22 Aug 2016 05:22:30 +0000 https://www.becomingadatascientist.com/?p=1162 Here is the 1st batch of results from the Becoming a Data Scientist Survey. Because of the sample size and unscientific casual nature of this survey, we can’t make any broad generalizations about the industry from these results, but you can see some general preliminary trends in the breakdowns that would be interesting to study more.

95% of the 158 respondents who gave answers about their jobs follow at least one of my twitter accounts, so you can think of these results as representing my twitter followers.

I’m mostly going to let the tables and graphs speak for themselves. There’s a lot to unpack in the survey (over 50 questions), and a ton of ways to slice and dice it. Because I wanted to show various breakdowns, I did this in simple Excel pivot tables and didn’t spend a ton of time formatting (I know they’re kinda ugly – sorry).

If any of the tables or charts below are particularly hard to understand, please let me know and I’ll annotate further or improve! Click to view full-size.

Educational Field
vs Job Title of Data Analyst or Data Scientist (yes/no)

educational field vs data scientist title (25% of those with numeric degrees have data scientist titles)

 

Computer Scientist or Data Sci / Statistics / Analytics Degree
vs Job Title of Data Analyst or Data Scientist (yes/no)

30 (19%) of respondents had CS degrees, which was the most common degree name, and 27% of them had a job title that I categorized under Data Analyst or Data Scientist.

8.3% of respondents had Data Science, Statistics, or Analytics degrees, and 38.5% of them had job titles that I categorized as Data Analyst/Scientist.

19% (30) CS degrees, 8.3% (13) had Stats or Analytics degrees

 

Educational Level
vs Considers Self Data Scientist (No/Not Yet or Yes)

educational level vs considers self data scientist (31% of PhD respondents consider themselves data scientists)

Considers Self Data Scientist by Educational Level and Field

Percentage of each educational field and level that consider themselves data scientists - 90% of the 20 respondents who completed a bachelors degree do not consider themselves data scientists

Educational Field by Considers Self Data Scientist

percent of those that consider themselves data scientists by educational level and field - 25% of those that do have a PhD in Science/Tech/Engineering

 

Respondent Breakdown by Job Industry

job industry pie chart - largest is Software/Technology at 24%

Visual of those with Science/Tech/Eng Degrees and Math/Stats/Acctg Degrees

largest category is science/tech/engineering with masters degree at 27 respondents

 

Percentage of Respondents by “Considers Self Data Scientist” with each Salary Range

About 20% (28) of those that do not consider themselves to be Data Scientists make $50-75K. About 17% (7) of those that do consider themselves to be Data Scientists make under $25K.

salary by data scientist or not

 

Because the numbers above looked surprising to me, I filtered the same table to those in North America (assuming that many of the low-salary data scientists might be outside of the U.S.).

In North America, about 21% (22) of those that do not consider themselves to be Data Scientists make $50-75K. But now 25% (5) of those that do consider themselves to be Data Scientists make $125-150K. This also shows that 22 of the 42 people above (52%) that considered themselves Data Scientists live outside of North America.

salary by data scientist or not - North America

 

Bar Charts of Above Data. Top = Data Scientists, Bottom = Not Data Scientists.

Data Scientists. Tallest bar is under-$25K, but 6/7 of those are not in North America. 12 people $125K and above.
Non-Data Scientists. Tallest bar is $50-75K. 22/28 of those are in North America. 12 above $125K.

 

I may release some of this data at the end of my analysis. If you’re interested in diving in yourself, let me know in the comments and I’ll work on it for a future post.

More results on the way in the near future! Next up, how many respondents follow each twitter account and whether more people watch or listen to the podcast!

 

]]>
https://www.becomingadatascientist.com/2016/08/22/becoming-a-data-scientist-survey-results-1-jobs-education/feed/ 1
Podcast Episodes 0 to 3 https://www.becomingadatascientist.com/2016/08/13/podcast-episodes-0-3/ https://www.becomingadatascientist.com/2016/08/13/podcast-episodes-0-3/#comments Sat, 13 Aug 2016 18:59:25 +0000 https://www.becomingadatascientist.com/?p=1153 It’s been brought to my attention that iTunes only shows the last 10 episodes of the Becoming a Data Scientist Podcast. If you haven’t seen/heard episodes 0-3, you can watch the interviews on the YouTube channel:

Becoming a Data Scientist Podcast Interviews YouTube Playlist

or listen to/download the full audio episodes via the blog. Here are the links to the blog posts (with links to everything else), and the audio itself for those first four episodes:

Episode 0: Renee Teate (me) and intro to the podcast
Audio Only (with MP3 download link)

Episode 1: Will Kurt – English/Library Science to Data Science
Audio Only (with MP3 download link)

Episode 2: Safia Abdalla – College Student, Conference Speaker, Python/Jupyter Contributor
Audio Only (with MP3 download link)

Episode 3: Shlomo Argamon – Director of Master of Data Science program at IIT
Audio Only (with MP3 download link)

Click through to any episode for links to the RSS subscription feeds, links to the learning club activities, etc.

Enjoy!

]]>
https://www.becomingadatascientist.com/2016/08/13/podcast-episodes-0-3/feed/ 3
Becoming a Data Scientist Survey https://www.becomingadatascientist.com/2016/08/07/becoming-a-data-scientist-survey/ https://www.becomingadatascientist.com/2016/08/07/becoming-a-data-scientist-survey/#respond Sun, 07 Aug 2016 20:33:25 +0000 https://www.becomingadatascientist.com/?p=1150 I am collecting information about my “audiences” so I can improve my websites, podcast, and also formulate a plan for a Patreon campaign to generate funds for getting help and to free myself up to create more content.

Please fill out the survey and share it with your friends and followers on social media! The survey is a little long/detailed, but most of it is optional. I value your opinions! Thank you so much for participating!!

Link to the Becoming a Data Scientist Survey

]]>
https://www.becomingadatascientist.com/2016/08/07/becoming-a-data-scientist-survey/feed/ 0
Boosting (in Machine Learning) as a Metaphor for Diverse Teams https://www.becomingadatascientist.com/2016/08/06/boosting-in-machine-learning-as-a-metaphor-for-diverse-teams/ https://www.becomingadatascientist.com/2016/08/06/boosting-in-machine-learning-as-a-metaphor-for-diverse-teams/#comments Sun, 07 Aug 2016 00:42:08 +0000 https://www.becomingadatascientist.com/?p=1130 Note – I wrote this article in one sitting, and definitely want to come back later to improve it and add references, but I don’t want to hold it up from being published just because I’m hungry for dinner. :) So I’m hitting publish, but please be aware that the content may change later. And feel free to give suggestions in the comments. -Renee

tl;dr: Boosting ensemble algorithms in Machine Learning use an approach that is similar to assembling a diverse team with a variety of strengths and experiences. If machines make better decisions by combining a bunch of “less qualified opinions” vs “asking one expert”, then maybe people would, too.

Why is this post on this blog?

I’ve been thinking a lot about diversity in tech lately. After the #FBNoExcuses conversations on twitter, I was motivated to start UntappedPipeline.com (on twitter:@untappdpipeline) because I know so many awesome women and people of color in tech, and it amazes me that some companies seem to think they are so rare and hard to find (hence using the “pipeline problem” as an excuse for not having a diverse workforce).

Of course, “diversity” can mean a lot of things: gender diversity, racial/ethnic diversity, diversity of educational backgrounds, etc. – but it all really comes down to diversity of culture/thoughts. If you are interested in learning more about diversity in tech and the benefits of hiring diverse teams, check out the Resources page on UntappedPipeline.com. Here’s one study in particular that highlights the economic benefits of a diverse tech workforce: Decoding Diversity (Dalberg/Intel, PDF).

So if this post is about diversity, why am I writing it on my “Becoming a Data Scientist” blog instead of at Untapped Pipeline? Because I’ve been doing some machine learning lately that I just realized is a great metaphor for the benefits of hiring diverse teams, and this is also an opportunity to explain a data science concept. Additionally, I’ve been talking about data science teams for a long time, and one of my motivations for starting the Becoming a Data Scientist podcast was to feature the many paths people take to data science, because data science in itself is an interdisciplinary job that requires a variety (diversity) of experience: primarily in statistics, computer programming, and “business” (or domain knowledge), but really there is a very broad set of skills that come into play when doing data science, and no one has them all (relevant rant by @quominus). I will list some references about building Data Science teams at the end of the post.

There are two machine learning algorithms I’ve used recently that illustrate the specific “ensemble learning” concept I want to focus on: Random Forest and Gradient Boosting Classifiers.

Random Forest

A Random Forest Classifier works like this: You may have heard of Decision Trees, which are pretty much just “if then” classifiers that end up generating a set of rules: “If Attribute 1 is in this range of values, and Attribute 2 is this boolean value, and Attribute 3 is greater than this value, then out of all of the possible results, this one is most likely.” Every example you feed into it will drop into one of the possible outcomes, with a certain probability of being correct. The article “A Visual Introduction to Machine Learning” has a great animated illustration of how decision trees work.

What a Random Forest does is build up a whole bunch of “dumb” decision trees by only analyzing a subset of the data at a time. A limited set of features (columns) from a portion of the overall records (rows) is used to generate each decision tree, and the “depth” of the tree (and/or size of the “leaves”, the number of examples that fall into each final bin) is limited as well. So the trees in the model are “trained” with only a portion of the available data and therefore don’t indivisually generate very accurate classifications.

However, it turns out that when you combine the results of a bunch of these “dumb” trees (also known as “weak learners”), the combined result is usually even better than the most finely-tuned single full decision tree. (So you can see how the algorithm got its name – a whole bunch of small trees, somewhat randomly generated, but used in combination is a random forest!)

When you combine multiple methods to determine a result in machine learning, it is called ensemble learning. In particular, when you use a bunch of weak learners and combine them by having them vote on the outcome (whichever outcome occurs most often for a particular record wins), iteratively improve the outcome, or average the results in some way to create a result that is stronger than any single-pass approach, that is called boosting.

Gradient Boosting

Another type of boosting is called Gradient Boosting. The approach is actually very similar to the Random Forest algorithm, except (as far as I understand it), it uses a type of optimization called Gradient Descent, which minimizes a loss function. Basically, each time it generates a decision tree, it’s using what it learned the last time it generated one to make the next one a little less bad (reducing cost/loss).

I’m not going to go to far into the technical details here, partially because that’s not really the point of this article, and partly because I start having flashbacks to my Optimization class in grad school, which was not a happy experience.

There are other boosting algorithms like Adaptive Boosting (AdaBoost) and other ensemble methods in machine learning to explore. Some of them are described in the scikit-learn (machine learning python package) documentation if you want to learn more.

So back to my original point

Now that I have the explanations out of the way, I can get back to the point I wanted to make. Notice how these algorithms work. Each “weak learner” only has some of the information needed to make a good guess to classify something. In fact, they’re often incorrect on their own, because they just don’t have the experience that a big “solo” algorithm has with the data. (Imagine a bunch of weird-looking small trees vs one big gorgeous well-developed oak tree.) However, when you combine the wide variety of partial experiences that the “weak learners” provide, their combined guess turns out to actually be better than the guess made by the one big fully-formed tree.

young forest growth photo via The Young Forest Project

Many Small Trees (young forest growth via The Young Forest Project)

Keeler Oak Tree by Msact via Wikimedia Commons

One Big Fully-Formed Tree (Keeler Oak Tree by Msact via Wikimedia Commons)

So, think of the same concept for building a Data Science Team. There is currently a shortage of “experts” in data science, and most companies don’t seem to know what kind of data scientist they need anyway. Some companies are lamenting the lack of “qualified” data scientists (i.e. “unicorns” that have all of the necessary skills and experience already), while in the meantime, there are plenty of business analysts, software developers, UX designers, subject matter experts, people that do similar work in other fields (like biotechnology, cognitive science, etc etc etc), and people that are on their way to becoming data scientists and only have a portion of the requisite skills and knowledge. Some companies are just “pattern matching” and trying to hire people that are exactly like their existing successful employees (though it’s unlikely they have even defined what they mean by “successful employees”).

However, if you know how to find creative and motivated “go-getters” that want to learn on the job and contribute to a team, and each person on the team has a portion of the needed experience and skills, there is a good chance that in combination (if given good support and resources!) that a small group of “junior data scientists with other relevant skills” will actually turn out better than hiring one or two “experts” in the first place. Plus, they cost less. Plus, they are likely very trainable. Plus, they really want to make a difference and prove themselves as capable data scientists.

Anyway, I’m making a lot of generalizations here, and need to go back and fill in some of my comments in this last section with references, but you can see what I’m getting at. For those in tech that have a really hard time believing that that “scrappy new business analyst” with a non-terminal non-computer-science degree from a non-ivy-league school that doesn’t have many years of experience doing the exact kind of work you want them to do at your company won’t be “qualified” to fill a Data Science position, maybe it will help to think of the problem as one that Boosting will solve. Create an “ensemble” of “learners” that may individually only have a subset of the experience and may be self-taught and not do everything the “right” way when tasked to do it alone, but can each contribute their wide variety of experiences and skills to come up with a final solution as a team. I’m willing to bet that the solutions generated by a diverse group with less-than-ideal credentials (but a wider breadth of experience) will turn out better than what an “expert” (or homogenous group) would come up with on their own anyway, because the research shows that it usually does. Just do the math.

[more links about building data science teams to be added]

]]>
https://www.becomingadatascientist.com/2016/08/06/boosting-in-machine-learning-as-a-metaphor-for-diverse-teams/feed/ 1
Becoming a Data Scientist Podcast Episode 13: Debbie Berebichez https://www.becomingadatascientist.com/2016/07/14/becoming-a-data-scientist-podcast-episode-13-debbie-berebichez/ https://www.becomingadatascientist.com/2016/07/14/becoming-a-data-scientist-podcast-episode-13-debbie-berebichez/#respond Fri, 15 Jul 2016 02:52:29 +0000 https://www.becomingadatascientist.com/?p=1121
In this interview, we meet physicist Debbie Berebichez, who you might recognize from her TEDx talks, her appearances in Discovery Channel’s Outrageous Acts of Science and other TV shows! Debbie grew up in Mexico City and was discouraged by her family and teachers from studying science, but later went on to become the first Mexican woman to get a PhD in physics from Stanford, and is now Chief Data Scientist at Metis Data Science Bootcamp in New York.
Podcast Audio Links: Link to podcast Episode 13 audio Podcast's RSS feed for podcast subscription apps]]>

In this interview, we meet physicist Debbie Berebichez, who you might recognize from her TEDx talks, her appearances in Discovery Channel’s Outrageous Acts of Science and other TV shows! Debbie grew up in Mexico City and was discouraged by her family and teachers from studying science, but later went on to become the first Mexican woman to get a PhD in physics from Stanford, and is now Chief Data Scientist at Metis Data Science Bootcamp in New York.

Podcast Audio Links:
Link to podcast Episode 13 audio
Podcast’s RSS feed for podcast subscription apps
Podcast on Stitcher
Podcast on iTunes

Podcast Video Playlist:
Youtube playlist of interview videos

More about the Data Science Learning Club:
Data Science Learning Club Welcome Message
Learning Club Activity 13: Show & Tell
Data Science Learning Club Meet & Greet
Now sponsored by DataCamp!

Links to topics mentioned by Debbie in the interview:
Metis Data Science Training
[more coming soon]



290_160_bads

]]>
https://www.becomingadatascientist.com/2016/07/14/becoming-a-data-scientist-podcast-episode-13-debbie-berebichez/feed/ 0
Becoming a Data Scientist Podcast Episode 12: Data Science Learning Club Members https://www.becomingadatascientist.com/2016/06/15/becoming-a-data-scientist-podcast-episode-12-data-science-learning-club-members/ https://www.becomingadatascientist.com/2016/06/15/becoming-a-data-scientist-podcast-episode-12-data-science-learning-club-members/#comments Wed, 15 Jun 2016 05:08:12 +0000 https://www.becomingadatascientist.com/?p=1089
Verena, David, Kerry, and Anthony are members of the Becoming a Data Scientist Podcast Data Science Learning Club! They appear in the order in which they joined the club, and each discuss their starting points before joining, their participation in the activities, and advice they have for new data science learners. Podcast Audio Links: Link to podcast Episode 12 audio Podcast's RSS feed for podcast subscription apps Podcast on Stitcher Podcast on iTunes Podcast Video Playlist: Youtube playlist of interview videos More about the Data Science Learning Club: Data Science Learning Club Welcome Message]]>

Verena, David, Kerry, and Anthony are members of the Becoming a Data Scientist Podcast Data Science Learning Club! They appear in the order in which they joined the club, and each discuss their starting points before joining, their participation in the activities, and advice they have for new data science learners.

Podcast Audio Links:
Link to podcast Episode 12 audio
Podcast’s RSS feed for podcast subscription apps
Podcast on Stitcher
Podcast on iTunes

Podcast Video Playlist:
Youtube playlist of interview videos

More about the Data Science Learning Club:
Data Science Learning Club Welcome Message
Data Science Learning Club Meet & Greet
Now sponsored by DataCamp!

1) Verena Haunschmid

bioinformatics

R Markdown
ggplot2
jupyter

Data Science Learning Club Activity 07: Linear Regression
Verena’s Results for Linear Regression on Salary Dataset

  

Verena’s website
@ExpectAPatronum on Twitter

GPS Cat Tracking Project

2) David Asboth

business intelligence

SQL

City University London Msc Data Science

Coursera
Udacity
Khan Academy

Data Science Learning Club Activity 02: Creating Visuals for Exploratory Data Analysis
David’s results exploring London Underground data

Data Science Learning Club Activity 07: K-Means Clustering
David’s results using k-means to draw puppies in 3 colors

FlyLady (the house cleaning system I mentioned)

David’s website
@davidasboth on Twitter

3) Kerry Benjamin

Data Camp

Data Science Learning Club Activity 01: Find, Import, and Explore a Dataset
Kerry’s results for Activity 1 IGN Game Review Data exploration

Data Science Learning Club Activity 02: Creating Visuals for Exploratory Data Analysis
Kerry’s Blog Post about Activity 02 – “My First Data Set Part 2: The Fun Stuff”

ggplot2
dplyr
XLConnect

Blog post about Data Camp – “The Data Science Journey Begins”

Sharp Sight Labs

Kerry’s blog post “Getting Started in Data Science: A Beginner’s Perspective”

#Rstats (twitter hashtag)

Kerry’s Blog “The Data Logs”
@kerry_benjamin1 on Twitter

4) Anthony Peña

molecular biology
biotechnology

Data Science Learning Club Activity 07: K-Means Clustering
Anthony’s results for Activity 07

ggplot2
tidyR
dplyr

Data Camp

R bloggers

Anthony’s website
@agpena_ on Twitter



The Data Science Learning Club is sponsored by DataCamp!
290_160_bads

]]> https://www.becomingadatascientist.com/2016/06/15/becoming-a-data-scientist-podcast-episode-12-data-science-learning-club-members/feed/ 1 Becoming a Data Scientist Podcast Episode 11: Stephanie Rivera https://www.becomingadatascientist.com/2016/05/30/becoming-a-data-scientist-episode-11-stephanie-rivera/ https://www.becomingadatascientist.com/2016/05/30/becoming-a-data-scientist-episode-11-stephanie-rivera/#respond Tue, 31 May 2016 01:15:28 +0000 https://www.becomingadatascientist.com/?p=1062
Stephanie Rivera has worked in machine learning and data science for academic research (at University of Tennessee), for the government (Department of Defense), for a large consulting firm (Booz Allen), and now for a startup (MyStrength). In the interview, she discusses her career path, her experiences with mentorship, and her role in authoring The Field Guide to Data Science and the Explore Data Science online course. Podcast Audio Links: Link to podcast Episode 11 audio Podcast's RSS feed for podcast subscription apps Podcast on Stitcher Podcast on iTunes Podcast Video Playlist: Youtube playlist of interview videos]]>

Stephanie Rivera has worked in machine learning and data science for academic research (at University of Tennessee), for the government (Department of Defense), for a large consulting firm (Booz Allen), and now for a startup (MyStrength). In the interview, she discusses her career path, her experiences with mentorship, and her role in authoring The Field Guide to Data Science and the Explore Data Science online course.

Podcast Audio Links:
Link to podcast Episode 11 audio
Podcast’s RSS feed for podcast subscription apps
Podcast on Stitcher
Podcast on iTunes

Podcast Video Playlist:
Youtube playlist of interview videos

More about the Data Science Learning Club:
Data Science Learning Club Welcome Message
[learning club activity coming soon]
Data Science Learning Club Meet & Greet
Now sponsored by DataCamp!

Links to topics mentioned by Stephanie in the interview:

machine learning

Odyssey of the Mind

Graph Theory

Total Domination in Graph Theory (pdf)

Some research publications by Stephanie:
Machines Watch you Surf the Web
Total domination dot-stable graphs

The University of Tennessee Knoxville Center for Intelligent Systems and Machine Learning (CISML)

Reinforcement Learning

Connect Four (game)

UTK Distributed Intelligence Laboratory

MATLAB

UTK Infant Perception Action Laboratory

“teach a man to fish” proverb

pattern recognition

Booz Allen Data Science

Natural Language Processing (NLP)

Explore Data Science (now via Metis)

Code School

Field Guide to Data Science

MyStrength (@mystrengthbh on twitter)

DataKind

Stephanie on Twitter @dataginjaninja



290_160_bads

]]> https://www.becomingadatascientist.com/2016/05/30/becoming-a-data-scientist-episode-11-stephanie-rivera/feed/ 0 Becoming a Data Scientist Podcast Episode 10: Trey Causey https://www.becomingadatascientist.com/2016/05/01/becoming-a-data-scientist-podcast-episode-10-trey-causey/ https://www.becomingadatascientist.com/2016/05/01/becoming-a-data-scientist-podcast-episode-10-trey-causey/#respond Sun, 01 May 2016 05:16:38 +0000 https://www.becomingadatascientist.com/?p=1049
Trey Causey is a data scientist with a background in psychology and sociology who, like Renee, is from Virginia. He has worked as a data scientist at a range of companies from zulily to ChefSteps, and has also developed some interesting sports analytics projects, including the New York Times 4th Down bot. Trey also has advice for people wanting to start a career in data science. Podcast Audio Links: Link to podcast Episode 10 audio]]>

Trey Causey is a data scientist with a background in psychology and sociology who, like Renee, is from Virginia. He has worked as a data scientist at a range of companies from zulily to ChefSteps, and has also developed some interesting sports analytics projects, including the New York Times 4th Down bot. Trey also has advice for people wanting to start a career in data science.

Podcast Audio Links:
Link to podcast Episode 10 audio
Podcast’s RSS feed for podcast subscription apps
Podcast on Stitcher
Podcast on iTunes

Podcast Video Playlist:
Youtube playlist of interview videos

More about the Data Science Learning Club:
Data Science Learning Club Welcome Message
[learning club activity coming soon]
Data Science Learning Club Meet & Greet
Now sponsored by DataCamp!

Links to topics mentioned by Trey in the interview:

Commodore VIC-20
Bulletin Board
C++
Pascal
BASIC

Virginia Tech
Odyssey of the Mind

University of Washington Sociology

Complexity Theory and organizations

[more links to come! …sorry for all of the delays on getting this episode out! -Renee]

treycausey.com
@treycausey



290_160_bads

]]> https://www.becomingadatascientist.com/2016/05/01/becoming-a-data-scientist-podcast-episode-10-trey-causey/feed/ 0 Becoming a Data Scientist Podcast Episode 09: Justin Kiggins https://www.becomingadatascientist.com/2016/04/11/becoming-a-data-scientist-episode-09-justin-kiggins/ https://www.becomingadatascientist.com/2016/04/11/becoming-a-data-scientist-episode-09-justin-kiggins/#respond Tue, 12 Apr 2016 03:29:02 +0000 https://www.becomingadatascientist.com/?p=1020

Justin Kiggins, who calls himself a "full stack neuroscientist" talks to Renee about how he started as a musician majoring in music therapy, switched to mechanical engineering, and eventually made his way via biomedical engineering and neuroscience to study auditory perception and the brains of communicating birds. Podcast Audio Links: Link to podcast Episode 9 audio Podcast's RSS feed for podcast subscription apps Podcast on Stitcher Podcast on iTunes Podcast Video Playlist:]]>


Justin Kiggins, who calls himself a “full stack neuroscientist” talks to Renee about how he started as a musician majoring in music therapy, switched to mechanical engineering, and eventually made his way via biomedical engineering and neuroscience to study auditory perception and the brains of communicating birds.

Podcast Audio Links:
Link to podcast Episode 9 audio
Podcast’s RSS feed for podcast subscription apps
Podcast on Stitcher
Podcast on iTunes

Podcast Video Playlist:
Youtube playlist of interview videos

More about the Data Science Learning Club:
Data Science Learning Club Welcome Message
Learning Club Activity 9: Normalization [coming soon]
Data Science Learning Club Meet & Greet
Now sponsored by DataCamp!

Links to topics mentioned by Justin in the interview:

MatLab

echo localization

convolution

Fulbright scholar

Ushahidi

HarassMap

European Starling
European Starling
video of starling singing
European Starling song file from Justin [1 min wav]

bird song recursive syntactic structure

Zebra Finch song

spectral analysis [pdf]

Neuron

brain electrodes

numpy
spark
thunder
pandas

birth doula

Allen Institute

Jobs for New Data Scientists website mentioned by Renee after interview



290_160_bads

]]> https://www.becomingadatascientist.com/2016/04/11/becoming-a-data-scientist-episode-09-justin-kiggins/feed/ 0 Becoming a Data Scientist Podcast Episode 08: Sebastian Raschka https://www.becomingadatascientist.com/2016/03/28/becoming-a-data-scientist-podcast-episode-08-sebastian-raschka/ https://www.becomingadatascientist.com/2016/03/28/becoming-a-data-scientist-podcast-episode-08-sebastian-raschka/#respond Tue, 29 Mar 2016 03:01:00 +0000 https://www.becomingadatascientist.com/?p=995 Renee interviews computational biologist, author, data scientist, and Michigan State PhD candidate Sebastian Raschka about how he became a data scientist, his current research, and about his book Python Machine Learning. In the audio interview, Sebastian also joins us to discuss k-fold cross-validation for our model evaluation Data Science Learning Club activity.
Podcast Audio Links: Link to podcast Episode 8 audio Podcast's RSS feed for podcast subscription apps Podcast on Stitcher Podcast on iTunes Podcast Video Playlist:]]>

Renee interviews computational biologist, author, data scientist, and Michigan State PhD candidate Sebastian Raschka about how he became a data scientist, his current research, and about his book Python Machine Learning. In the audio interview, Sebastian also joins us to discuss k-fold cross-validation for our model evaluation Data Science Learning Club activity.

Podcast Audio Links:
Link to podcast Episode 8 audio
Podcast’s RSS feed for podcast subscription apps
Podcast on Stitcher
Podcast on iTunes

Podcast Video Playlist:
Youtube playlist of interview videos

More about the Data Science Learning Club:
Data Science Learning Club Welcome Message
Learning Club Activity 8: Evaluation Metrics [coming soon]
Data Science Learning Club Meet & Greet
Now sponsored by DataCamp!

Links to topics mentioned by Sebastian in the interview:

computational biology

molecular docking

Protein-ligand docking

DNA -> RNA -> protein

protein signaling pathways

graph theory

Ensemble learning

cost function

fitness function

ligand and binding affinity

sea lamprey

pheromone

SiteInterlock project

Neural Network

Random Forest

Sebastian’s Python Machine Learning repository on GitHub

Python Machine Learning Book on DataSciGuide

scikit-learnVoting Classifier

softmax regression

stochastic gradient descent

multilayer perceptron

logistic regression (from Sebastian’s github)

regularization in logistic regression (from Sebastian’s github)

Keras deep learning library

@rasbt on Twitter
Sebastian Raschka on Quora


Sebastian’s book on Amazon:



290_160_bads

]]> https://www.becomingadatascientist.com/2016/03/28/becoming-a-data-scientist-podcast-episode-08-sebastian-raschka/feed/ 0 Becoming a Data Scientist Podcast Episode 07: Enda Ridge https://www.becomingadatascientist.com/2016/03/15/becoming-a-data-scientist-podcast-episode-07-enda-ridge/ https://www.becomingadatascientist.com/2016/03/15/becoming-a-data-scientist-podcast-episode-07-enda-ridge/#respond Tue, 15 Mar 2016 05:47:38 +0000 https://www.becomingadatascientist.com/?p=983
Data Scientist, Author, and manager of data science teams Enda Ridge talks to us about data governance, data provenance, reproducible analysis, work pipelines and products, and people, among other topics covered in his book "Guerrilla Analytics - A practical Approach to Working with Data: The Savvy Manager's Guide".
Podcast Audio Links: Link to podcast Episode 7 audio]]>

Data Scientist, Author, and manager of data science teams Enda Ridge talks to us about data governance, data provenance, reproducible analysis, work pipelines and products, and people, among other topics covered in his book “Guerrilla Analytics – A practical Approach to Working with Data: The Savvy Manager’s Guide”.

Podcast Audio Links:
Link to podcast Episode 7 audio
Podcast’s RSS feed for podcast subscription apps
Podcast on Stitcher
Podcast on iTunes

Podcast Video Playlist:
Youtube playlist of interview videos

More about the Data Science Learning Club:
Data Science Learning Club Welcome Message
Learning Club Activity 7: Linear Regression [coming soon]
Data Science Learning Club Meet & Greet
Now sponsored by DataCamp!

More show Notes Coming Soon!

@enda_ridge

Enda’s book on Amazon:

290_160_bads

]]>
https://www.becomingadatascientist.com/2016/03/15/becoming-a-data-scientist-podcast-episode-07-enda-ridge/feed/ 0
Becoming a Data Scientist Podcast Episode 06: Erin Shellman https://www.becomingadatascientist.com/2016/02/29/becoming-a-data-scientist-podcast-episode-06-erin-shellman/ https://www.becomingadatascientist.com/2016/02/29/becoming-a-data-scientist-podcast-episode-06-erin-shellman/#respond Mon, 29 Feb 2016 20:09:55 +0000 https://www.becomingadatascientist.com/?p=949
In this episode, Renee interviews Bioinformatics PhD and Data Scientist Erin Shellman about her path to becoming a data scientist, including jobs at Nordstrom Innovation Lab and zymergen. Erin discusses school, job interviews, teaching, and eventually getting to do data science within her field of scientific expertise.
Podcast Audio Links: Link to podcast Episode 6 audio Podcast's RSS feed for podcast subscription apps]]>

In this episode, Renee interviews Bioinformatics PhD and Data Scientist Erin Shellman about her path to becoming a data scientist, including jobs at Nordstrom Innovation Lab and zymergen. Erin discusses school, job interviews, teaching, and eventually getting to do data science within her field of scientific expertise.

Podcast Audio Links:
Link to podcast Episode 6 audio
Podcast’s RSS feed for podcast subscription apps
Podcast on Stitcher
Podcast on iTunes

Podcast Video Playlist:
Youtube playlist of interview videos

More about the Data Science Learning Club:
Data Science Learning Club Welcome Message
Learning Club Activity 6: k-Means Clustering [coming soon]
Data Science Learning Club Meet & Greet

Bioinformatics
Evolutionary Biology
Economics Game Theory
Machine Learning
Biostatistics
Information Science
Systems Biology
Systems Modeling
Comparative Genomics

Human Genome Project

NIH Computational Biosciences

Data Scientists at Work

Nordstrom Innovation Lab (old innovation lab links inactive – appears to be the Nordstrom Technology People Lab now)

Recommender System

million song dataset

Jim Vallandingham (d3)

Crushed It! Landing a Data Science Job

zymergen

University of Michigan Computational Medicine and Bioinformatics

high throughput assays

R
dplyr
ggvis
ggvis interactive controls
ggplot2
R Markdown
Hadley Wickham

Amazon Web Services
AWS S3

Elements of Statistical Learning book

BI Tech CP303 (course Erin taught at University of Washington – use arrow keys to go through slides)
GitHub repository for class
regression
classification – logistic regression, trees
market basket analysis
clustering
UW Business Intelligence Certification

Erin’s website ErinShellman.com
@erinshellman

]]>
https://www.becomingadatascientist.com/2016/02/29/becoming-a-data-scientist-podcast-episode-06-erin-shellman/feed/ 0
Data Science Learning Club Update https://www.becomingadatascientist.com/2016/02/20/data-science-learning-club-update/ https://www.becomingadatascientist.com/2016/02/20/data-science-learning-club-update/#respond Sun, 21 Feb 2016 04:57:51 +0000 https://www.becomingadatascientist.com/?p=931 For anyone that hasn’t yet joined the Becoming a Data Scientist Podcast Data Science Learning Club, I thought I’d write up a summary of what we’ve been doing!

The first activity involved setting up a development environment. Some people are using R, some using python, and there are several different development tools represented. In this thread, several people posted what setup they were using. I posted a “hello world” program and the code to output the package versions.

Activities 1-3 built upon one another to explore a dataset and generate descriptive statistics and visuals, culminating with a business Q&A:

I analyzed a subset of data from the eBird bird observation dataset from Cornell Ornithology for these activities. Some highlights included:

Learning how to use the pandas python package to explore a dataset (code)

– Learning how to create cool exploratory visuals in Seaborn and Tableau. Here is an example scatterplot matrix made in Seaborn:


– I was most excited to learn how to build interactive Jupyter Notebook inputs, which I used to control Bokeh data visualizations to display Ruby-Throated Hummingbird migration into North America (notebook). Unfortunately, until I host them on a server where you can run the “live” version, you won’t be able to see the interactive widgets (a slider and dynamic dropdowns), but you can see a video of the slider working here:

Here’s my final output for Activity 3, a Jupyter Notebook (with code hidden, and unfortunately interactive widgets disabled) with the Q&A about the hummingbird migration:
Ruby-Throated Hummingbird Migration into North America


Activity 4 was built as a catch-up week for those of us who were behind, but had some ideas of math concepts to learn for those who had time.

We’re currently working on Activity 5, our first machine learning activity where we’re implementing Naive Bayes Classification.

All of my work is available in this github repository: https://github.com/paix120/DataScienceLearningClubActivities

I strongly encourage you to click through the forums and look at some of the other data explorations the members have been doing, including analysis of NFL data, personal music listening habits, transportation in London, German Soccer League data, top-grossing movies, and more!

It’s never too late to join the Data Science Learning Club! If you aren’t sure where to start, check out the welcome message for some clarification.

I’ll post again when I complete some of the machine learning activities!

]]>
https://www.becomingadatascientist.com/2016/02/20/data-science-learning-club-update/feed/ 0
Becoming a Data Scientist Podcast Episode 05: Clare Corthell https://www.becomingadatascientist.com/2016/02/14/becoming-a-data-scientist-podcast-episode-05-clare-corthell/ https://www.becomingadatascientist.com/2016/02/14/becoming-a-data-scientist-podcast-episode-05-clare-corthell/#respond Mon, 15 Feb 2016 04:13:03 +0000 https://www.becomingadatascientist.com/?p=900
Renee Teate interviews Clare Corthell, founding partner of summer.ai and creator of the Open Source Data Science Masters curriculum, about becoming a data scientist.
Podcast Audio Links: Link to podcast Episode 5 audio Podcast's RSS feed for podcast subscription apps]]>


Renee Teate interviews Clare Corthell, founding partner of summer.ai (now Luminant Data) and creator of the Open Source Data Science Masters curriculum, about becoming a data scientist.

Podcast Audio Links:
Link to podcast Episode 5 audio
Podcast’s RSS feed for podcast subscription apps
Podcast on Stitcher
Podcast on iTunes

Podcast Video Playlist:
Youtube playlist of interview videos

More about the Data Science Learning Club:
Data Science Learning Club Welcome Message
Learning Club Activity 5: Naive Bayes Classification
Data Science Learning Club Meet & Greet

Resources/topics mentioned by Clare in the interview:

Management Science and Engineering
Markov Chains
Science, Technology, and Society at Stanford

A Challenge to Data Scientists (blog post Renee mentioned)

Mattermark
Product Management
Machine Learning

Open Source Data Science Masters
Nate Silver’s book The Signal and the Noise

Linear Algebra (on Khan Academy)

Bill Howe’s Introduction to Data Science Coursera Course

Recurrent Neural Nets
Bayesian Networks

python

Google Prediction API

data cleaning

Open Source Data Science Masters on GitHub (pull requests welcome!)

summer.ai (Update 2/15 – Clare’s company is now Luminant Data, Inc.)
@ClareCorthell on twitter

Other links:

SlideShare Slides about Open Source Data Science Masters

Talk Clare gave at Wrangle Conference about AI Design for Humans

]]>
https://www.becomingadatascientist.com/2016/02/14/becoming-a-data-scientist-podcast-episode-05-clare-corthell/feed/ 0
Becoming a Data Scientist Podcast Episode 04: Sherman Distin https://www.becomingadatascientist.com/2016/02/02/becoming-a-data-scientist-podcast-episode-04-sherman-distin/ https://www.becomingadatascientist.com/2016/02/02/becoming-a-data-scientist-podcast-episode-04-sherman-distin/#respond Tue, 02 Feb 2016 05:48:19 +0000 https://www.becomingadatascientist.com/?p=859
In Episode 4 of the Becoming a Data Scientist Podcast, we meet Sherman Distin, owner of analytics consulting firm QueryBridge. We discuss his primarily self-taught path to learning the data science techniques he uses to find business insights in marketing data, and he also tells us what he thinks is the most important trait he looks for in data scientists.
Podcast Audio Links: Link to podcast Episode 4 audio Podcast's RSS feed for podcast subscription apps]]>

In Episode 4 of the Becoming a Data Scientist Podcast, we meet Sherman Distin, owner of analytics consulting firm QueryBridge. We discuss his primarily self-taught path to learning the data science techniques he uses to find business insights in marketing data, and he also tells us what he thinks is the most important trait he looks for in data scientists.


Podcast Audio Links:
Link to podcast Episode 4 audio
Podcast’s RSS feed for podcast subscription apps
Podcast on Stitcher
Podcast on iTunes

Podcast Video Playlist:
Youtube playlist of interview videos

More about the Data Science Learning Club:
Data Science Learning Club Welcome Message
Learning Club Activity 4: Learn a New Math Concept [to be posted Tuesday]
Data Science Learning Club Meet & Greet

Resources/topics mentioned by Sherman in the interview:

QueryBridge (Sherman’s business)

linear regression

Target Pregnant Customer story

Excel Solver

EBITA

Survival Analysis
Proportional Hazards Model

Analytics Vidhya

@ShermanDistin on Twitter

Six Sigma

Econometrics

Sherman Distin on Facebook and LinkedIN

]]>
https://www.becomingadatascientist.com/2016/02/02/becoming-a-data-scientist-podcast-episode-04-sherman-distin/feed/ 0
Becoming a Data Scientist Podcast Episode 03: Shlomo Argamon https://www.becomingadatascientist.com/2016/01/18/becoming-a-data-scientist-podcast-episode-03-shlomo-argamon/ https://www.becomingadatascientist.com/2016/01/18/becoming-a-data-scientist-podcast-episode-03-shlomo-argamon/#respond Mon, 18 Jan 2016 06:08:17 +0000 https://www.becomingadatascientist.com/?p=846
In Episode 3 of the Becoming a Data Scientist Podcast, we meet Shlomo Argamon, who is the founding director of the Master of Data Science program at Illinois Institute of Technology. He talks to us about his path to data science, including research in robotic vision and natural language processing, we discuss the traits of a good data science student, and he gives some advice for those of us learning data science. ]]>

Note: The video is the interview only. The audio podcast has the intro, interview, and data science learning club activity explanation.

In Episode 3 of the Becoming a Data Scientist Podcast, we meet Shlomo Argamon, who is the founding director of the Master of Data Science program at Illinois Institute of Technology. He talks to us about his path to data science, including research in robotic vision and natural language processing, we discuss the traits of a good data science student, and he gives some advice for those of us learning data science.

Podcast Audio Links:
Link to podcast Episode 3 audio
Podcast’s RSS feed for podcast subscription apps
Podcast on Stitcher
Update 1/19: You should be able to find it on iTunes now!

Podcast Video Playlist:
Youtube playlist of interview videos

More about the Data Science Learning Club:
Data Science Learning Club Welcome Message
Learning Club Activity 3: Business Questions and Communicating Data Answers [to be updated Monday]
Data Science Learning Club Meet & Greet

Here are the links to things Shlomo references in the video:

Illinois Institute of Technology – Professional Master of Data Science Degree

punchcards

machine vision
robotic mapping
Google Scholar Search for Shlomo Argamon’s publications related to robotics
“Passive map learning and visual place recognition” Doctoral Dissertation [ps.gz from yale]

probability theory
probability distributions
statistical inference
bayesian statistics

Kaggle competitions

Natural Language Processing (NLP)
Google Scholar Search for Shlomo Argamon’s publications related to language
“Automatically Categorizing Written Texts by Author Gender” [Moshe Koppel, Shlomo Argamon, and Anat Rachel Shimoni]

Weka
scikit-learn
Natural Language Toolkit (nltk)

sentiment analysis

Ethics in Data Science at IIT
Becoming a Data Scientist – A Challenge to Data Scientists (re: bias)

@ShlomoArgamon on Twitter

]]>
https://www.becomingadatascientist.com/2016/01/18/becoming-a-data-scientist-podcast-episode-03-shlomo-argamon/feed/ 0
Becoming a Data Scientist Podcast Episode 02: Safia Abdalla https://www.becomingadatascientist.com/2016/01/04/becoming-a-data-scientist-podcast-episode-02-safia-abdalla/ https://www.becomingadatascientist.com/2016/01/04/becoming-a-data-scientist-podcast-episode-02-safia-abdalla/#respond Mon, 04 Jan 2016 18:30:00 +0000 https://www.becomingadatascientist.com/?p=832 Note: The video is the interview only. The audio podcast has the intro, interview, and data science learning club activity explanation.
In Episode 2 of the Becoming a Data Scientist Podcast, we meet Safia Abdalla, who started programming and even exploring machine learning and natural language processing as a teenager, and is now a student at Northwestern University, a conference speaker and trainer, co-organizer of PyLadies Chicago, and a contributor to Project Jupyter. Podcast Audio Links: Link to podcast Episode 2 audio Podcast's RSS feed for podcast subscription apps (I will distribute the feed out to iTunes and Pocket Cast ASAP. It's available on Stitcher now!)]]>

Note: The video is the interview only. The audio podcast has the intro, interview, and data science learning club activity explanation.

In Episode 2 of the Becoming a Data Scientist Podcast, we meet Safia Abdalla, who started programming and even exploring machine learning and natural language processing as a teenager, and is now a student at Northwestern University, a conference speaker and trainer, co-organizer of PyLadies Chicago, and a contributor to Project Jupyter.

Podcast Audio Links:
Link to podcast Episode 2 audio
Podcast’s RSS feed for podcast subscription apps
(I will distribute the feed out to iTunes and Pocket Cast ASAP. It’s available on Stitcher now!)

Podcast Video Playlist:
Youtube playlist where I’ll publish future videos

More about the Data Science Learning Club:
Data Science Learning Club Welcome Message
Learning Club Activity 2: Creating visuals for exploratory data analysis
Data Science Learning Club Meet & Greet

Here are the links to things Safia references in the video:

dll
registry
BIOS

python

Alan Turing
ENIAC

information retrieval
Introduction to Information Retrieval by C. D. Manning, P. Raghavan, H. Schütze

natural language processing
NLTK
machine learning

Northwestern Neuroscience and Robotics Lab

SEO

pyladies
Chicago PyLadies Meetups

write speak code

pair programming

mathematicalmonk’s YouTube series on machine learning

@captainsafia on twitter
Safia’s website
Safia’s blog

JupyterDay Chicago 2016 (post by Safia on jupyter.org)
Jupyter documentation

]]>
https://www.becomingadatascientist.com/2016/01/04/becoming-a-data-scientist-podcast-episode-02-safia-abdalla/feed/ 0
Podcast Available on Stitcher https://www.becomingadatascientist.com/2015/12/21/podcast-available-on-stitcher/ https://www.becomingadatascientist.com/2015/12/21/podcast-available-on-stitcher/#respond Mon, 21 Dec 2015 20:07:15 +0000 https://www.becomingadatascientist.com/?p=821 The Becoming a Data Scientist Podcast is now available via Stitcher! Subscribe in the app, or listen online:



If you have another podcast app, you can subscribe by entering the RSS feed link: https://www.becomingadatascientist.com/feed/podcast

There is also a built-in audio player here on the blog that I link to in each episode: https://www.becomingadatascientist.com/podcast/

I’m working on getting a logo now, so hopefully it won’t have a placeholder image for long :) I want to submit it to iTunes, but I have to download the dreaded iTunes desktop software in order to submit and manage it… ugh ridiculous.

Anyway, enjoy it on the blog and on Stitcher for now!

]]>
https://www.becomingadatascientist.com/2015/12/21/podcast-available-on-stitcher/feed/ 0
Becoming A Data Scientist Podcast Episode 01: Will Kurt https://www.becomingadatascientist.com/2015/12/21/becoming-a-data-scientist-podcast-episode-01-will-kurt/ https://www.becomingadatascientist.com/2015/12/21/becoming-a-data-scientist-podcast-episode-01-will-kurt/#comments Mon, 21 Dec 2015 06:40:32 +0000 https://www.becomingadatascientist.com/?p=794 Note: The video is the interview only. The audio podcast has the intro, interview, and data science learning club activity explanation.
In this episode we meet Will Kurt, who talks about his path from English & Literature and Library & Information Science degrees to becoming the Lead Data Scientist at KISSmetrics. He also tells us about his probability blog, Count Bayesie, and I introduce Data Science Learning Club Activity 1. Will has some great advice for people learning data science! Podcast Audio Links: Link to podcast Episode 1 audio Podcast's RSS feed for podcast subscription apps]]>

Note: The video is the interview only. The audio podcast has the intro, interview, and data science learning club activity explanation.

In this episode we meet Will Kurt, who talks about his path from English & Literature and Library & Information Science degrees to becoming the Lead Data Scientist at KISSmetrics. He also tells us about his probability blog, Count Bayesie, and I introduce Data Science Learning Club Activity 1. Will has some great advice for people learning data science!

Podcast Audio Links:
Link to podcast Episode 1 audio
Podcast’s RSS feed for podcast subscription apps
(I will distribute the feed out to sites like iTunes and Stitcher this week)

Podcast Video Playlist:
Youtube playlist where I’ll publish future videos

More about the Data Science Learning Club:
Data Science Learning Club Welcome Message
Learning Club Activity 1: Find and explore a dataset
Data Science Learning Club Meet & Greet

Here are the links to things Will references in the video:

R

python

Scala

Lua

Tandy 1000

Prolog

Library and Information Science

Foucault

ARPANET

Support Vector Machines

CoffeeScript

Andrew Ng’s Machine Learning course on Coursera

probabalistic graphical models

T.S. Eliot’s Four Quartets

Articulate

KISSMetrics

Count Bayesie blog
Count Bayesie – Parameter Estimation and Hypothesis Testing

R Markdown

Donald Knuth
Literate programming

ggplot2

jupyter

Claude Shannon’s Mathematical Theory of Communication

Count Basie (musician)

Count Bayesie – Measure Theory
Bayes’ Theorem with Lego
Voight-Kampff and Bayes Factor
Black Friday Puzzle – Markov Chains

Zen Buddhism concept of “beginner’s mind”

Count Bayesie Recommended Books on Probability and Statistics

]]>
https://www.becomingadatascientist.com/2015/12/21/becoming-a-data-scientist-podcast-episode-01-will-kurt/feed/ 4
Becoming A Data Scientist Podcast Episode 0: Me! https://www.becomingadatascientist.com/2015/12/14/becoming-a-data-scientist-podcast-episode-0-me/ https://www.becomingadatascientist.com/2015/12/14/becoming-a-data-scientist-podcast-episode-0-me/#comments Mon, 14 Dec 2015 19:23:59 +0000 https://www.becomingadatascientist.com/?p=779
(sorry for the poor video quality!) In this episode, I talk a little about the podcast, I talk about my own background, and I introduce the Data Science Learning Club. Enjoy! (Note: Episode 1, the first interview episode, comes out Monday 12/21!) Podcast Audio Links: Link to podcast Episode 0 audio Podcast's RSS feed for podcast subscription apps (I will distribute this out to sites like iTunes and Stitcher soon) Podcast Video Playlist: Youtube playlist where I'll publish future videos More about the Data Science Learning Club:]]> Here is the first episode of the Becoming a Data Scientist Podcast, which is also available in video form!

(sorry for the poor video quality!)

In this episode, I talk a little about the podcast, I talk about my own background, and I introduce the Data Science Learning Club. Enjoy!
(Note: Episode 1, the first interview episode, comes out Monday 12/21!)

Podcast Audio Links:
Link to podcast Episode 0 audio
Podcast’s RSS feed for podcast subscription apps
(I will distribute this out to sites like iTunes and Stitcher soon)

Podcast Video Playlist:
Youtube playlist where I’ll publish future videos

More about the Data Science Learning Club:
Blog post about Data Science Learning Club
Learning Club Activity 0: Set up your development environment
Data Science Learning Club Meet & Greet

Here are the links with more info of things I reference in the video:

turtle logo programming language

carmen sandiego
lemmings
SimCity

C programming language

JMU Integrated Science and Technology (ISAT)

Visual Basic/VB.NET/ASP.NET
MS Access

Rosetta Stone

PL/SQL
Oracle Data Warehouse
IBM Cognos

CGEP UVA Systems Engineering
Systems Engineering
Linear Algebra at Khan Academy
Stochastic Simulation
Optimization
Cognitive Systems Engineering
Principles of Data Visualization for Exploratory Data Analysis
Machine Learning
Naive Bayes
K-Means
Pattern Recognition and Machine Learning (class textbook)

Summer of Data Science
API and Market Basket Analysis
Jupyter
Docker and Jupyter
Doing Data Science by Cathy O’Neill and Rachel Schutt
O’Reilly Data Science Books
(I’ll post more specific books later)

]]>
https://www.becomingadatascientist.com/2015/12/14/becoming-a-data-scientist-podcast-episode-0-me/feed/ 8 Data Science Learning Club https://www.becomingadatascientist.com/2015/12/13/data-science-learning-club/ https://www.becomingadatascientist.com/2015/12/13/data-science-learning-club/#respond Mon, 14 Dec 2015 02:25:57 +0000 https://www.becomingadatascientist.com/?p=766 I’m working on the last of my recording and editing for “Episode 0” of the new Becoming A Data Scientist Podcast, which I’m planning to launch tomorrow! I’ve already recorded the interviews for episodes 1-3, which will be airing over the next month or so – so exciting! The guests all had interesting and informative things to share, I believe you’ll like it a lot.

At the end of each podcast episode, I’ll be “assigning” a “Learning Activity” for the Data Science Learning Club. So that is starting tomorrow, too! There won’t be anyone teaching the content, but we’ll be exploring it together for 1-2 weeks between podcast episodes (usually 2 weeks). I’ll post some resources to get everyone started and help out data science beginners, then we’ll each explore the activity on our own with whatever tools and techniques we choose, and we can post our results so we can all learn from one another. If anyone gets stuck, you can post a question to the forum and hopefully someone will be able to help you through it.

I just got the Data Science Learning Club forum set up today, and it’s at this URL: https://www.becomingadatascientist.com/learningclub

Go check it out, register so you can participate, read the Welcome thread, and introduce yourself in the Meet & Greet section! Then tomorrow, the first learning activity will launch and you can get started.

I’m so excited about launching this podcast and data science learning club, and hope this turns out to be a valuable experience for all of us! Keep an eye out on the blog for the podcast post, which should go up tomorrow!

Renee

]]>
https://www.becomingadatascientist.com/2015/12/13/data-science-learning-club/feed/ 0
A Challenge to Data Scientists https://www.becomingadatascientist.com/2015/11/22/a-challenge-to-data-scientists/ https://www.becomingadatascientist.com/2015/11/22/a-challenge-to-data-scientists/#comments Sun, 22 Nov 2015 05:22:57 +0000 https://www.becomingadatascientist.com/?p=719 As data scientists, we are aware that bias exists in the world. We read up on stories about how cognitive biases can affect decision-making. We know that, for instance, a resume with a white-sounding name will receive a different response than the same resume with a black-sounding name, and that writers of performance reviews use different language to describe contributions by women and men in the workplace. We read stories in the news about ageism in healthcare and racism in mortgage lending. Data scientists are problem solvers at heart, and we love our data and our algorithms that sometimes seem to work like magic, so we may be inclined to try to solve these problems stemming from human bias by turning the decisions over to machines. Most people seem to believe that machines are less biased and more pure in their decision-making – that the data tells the truth, that the machines won’t discriminate.]]>

As data scientists, we are aware that bias exists in the world. We read up on stories about how cognitive biases can affect decision-making. We know that, for instance, a resume with a white-sounding name will receive a different response than the same resume with a black-sounding name, and that writers of performance reviews use different language to describe contributions by women and men in the workplace. We read stories in the news about ageism in healthcare and racism in mortgage lending.

Data scientists are problem solvers at heart, and we love our data and our algorithms that sometimes seem to work like magic, so we may be inclined to try to solve these problems stemming from human bias by turning the decisions over to machines. Most people seem to believe that machines are less biased and more pure in their decision-making – that the data tells the truth, that the machines won’t discriminate.

Most people seem to believe that machines are less biased and more pure in their decision-making – that the data tells the truth, that the machines won’t discriminate.

However, we must remember that humans decide what data to collect and report (and whether to be honest in their data collection), what data to load into our models, how to manipulate that data, what tradeoffs we’re willing to accept, and how good is good enough for an algorithm to perform. Machines may not inherently discriminate, but humans ultimately tell the machines what to do, and then translate the results into information for other humans to use.

We aim to feed enough parameters into a model, and improve the algorithms enough, that we can tell who will pay back that loan, who will succeed in school, who will become a repeat offender, which company will make us money, which team will win the championship. If we just had more data, better processing systems, smarter analysts, smarter machines, we could predict the future.

I think Chris Anderson was right in his 2008 Wired article when he said “The new availability of huge amounts of data, along with the statistical tools to crunch these numbers, offers a whole new way of understanding the world,” but I think he was wrong when he said that petabyte-scale data “forces us to view data mathematically first and establish a context for it later,” and “With enough data, the numbers speak for themselves.” To me, context always matters. And numbers do not speak for themselves, we give them voice.

To me, context always matters. And numbers do not speak for themselves, we give them voice.

How aware are you of bias as you are building a data analysis, predictive model, visualization, or tool?

How complete, reliable, and representative is your dataset? Was your data collected by a smartphone app? Phone calls to listed numbers? Sensors? In-person surveying of whoever is out in the middle of the afternoon in the neighborhood your pollsters are covering, and agrees to stop and answer their questions?

Did you remove incomplete rows in your dataset to avoid problems your algorithm has with null values? Maybe the fact that the data was missing was meaningful; maybe the data was censored and not totally unknown. As Claudia Perlich warns, after cleaning your dataset, your data might have “lost its soul“.

Did you train your model on labeled data which already included some systematic bias?

It’s actually not surprising that a computer model built to evaluate resumes may eventually show the same biases as people do when you think about the details of how that model may have been built: Was the algorithm trained to evaluate applicants’ resumes against existing successful employees, who may have benefited from hiring biases themselves? Could there be a proxy for race or age or gender in the data even if you removed those variables? Maybe if you’ve never hired someone that grew up in the same zip code as a potential candidate, your model will dock them a few points for not being a close match to prior successful hires. Maybe people at your company have treated women poorly when they take a full maternity leave, so several have chosen to leave soon after they attempted to return, and the model therefore rates women of common childbearing age as having a higher probability of turnover, even though their sex and age are not (at least directly) the reason they left. In other words, our biases translate into machine biases when the data we feed the machine has biases built in, and we ask the machine to pattern-match.

We have to remember that Machine Learning effectively works by stereotyping. Our algorithms are often just creative ways to find things that are similar to other things. Sometimes, a process like this can reduce bias, if the system can identify predictors or combinations of predictors that may indicate a positive outcome, which a biased human may not consider if they’re hung up on another more obvious variable like race. However, as I mentioned before, we’re the ones training the system. We have to know where our data comes from, and how the ways we manipulate it can affect the results, and how the way we present those results can impact decisions that then impact people.

Data scientists, I challenge you. I challenge you to figure out how to make the systems you design as fair as possible.

Data scientists, I challenge you. I challenge you to figure out how to make the systems you design as fair as possible.

Sure, it makes sense to cluster people by basic demographic similarity in order to decide who to send which marketing message to so your company can sell more toys this Christmas than last. But when the stakes are serious – when the question is whether a person will get that job, or that loan, or that scholarship, or that kidney – I challenge you to do more than blindly run a big spreadsheet through a brute-force system that optimizes some standard performance measure, or lazily group people by zip code and income and elementary school grades without seeking information that may be better suited for the task at hand. Try to make sure your cost functions reflect the human costs of misclassification as well as the business costs. Seek to understand your data, and to understand as much as possible how the decisions you make while building your model are affecting the outcome. Check to see how your model performs on a subset of your data that represents historically disadvantaged people. Speak up when you see your results, your expertise, your model being used to create an unfair system.

As data scientists, even though we know that systems we build can do a lot of good, we also know they can do a lot of harm. As data scientists, we know there are outliers. We know there are misclassifications. We know there are people and families and communities behind the rows in our dataframes.

I challenge you, Data Scientists, to think about the people in your dataset, and to take steps necessary to make the systems you design as unbiased and fair as possible. I challenge you to remain the human in the loop.

——————————–

 
 

The links throughout the article provide examples and references related to what is being discussed in each section. I encourage you to go back and click on them. Below are additional links with information that can help you identify and reduce biases in your analyses and models.

The GigaOm article “Careful: Your big data analytics may be polluted by data scientist bias” discusses some “bias-quelling tactics”

“Data Science: What You Already Know Can Hurt You” suggests solutions for avoiding “The Einstellung Effect”

Part I of the book Applied Predictive Modeling includes discussions of the modeling process and explains how each type of data manipluation during pre-processing can affect model outcome

This paper from the NIH outlines some biases that occur during clinical research and how to avoid them: “Identifying and Avoiding Bias in Research”

The study “Bias arising from missing data in predictive models” uses Monte Carlo simulation to determine how different methods of handling missing data affect odds-ratio estimates and model performance

Use these wikipedia articles to learn about Accuracy and Precision and Precision and Recall

A study in Clinical Chemistry examines “Bias in Sensitivity and Specificity Caused by Data-Driven Selection of Optimal Cutoff Values: Mechanisms, Magnitude, and Solutions”

More resources from a workshop on fairness, accountability, and transparency in machine learning

Edit: After listening to the SciFri episode I linked to in the comments, I found this paper “Certifying and removing disparate impact” about identifying and reducing bias in machine learning algorithms.

Edit 11/23: Carina Zona suggested that her talk “Consequences of an Insightful Algorithm” might be a good reference to include here. I agree!
conseq_of_insightful_alg
 
 
(P.S. Sometimes the problem with turning a decision over to machines is that the machines can’t discriminate enough!)

Do you have a story related to data science and bias? Do you have additional links that would help us learn more? Please share in the comments!

]]>
https://www.becomingadatascientist.com/2015/11/22/a-challenge-to-data-scientists/feed/ 16
“Becoming a Data Scientist” Learning Club? https://www.becomingadatascientist.com/2015/11/08/becoming-a-data-scientist-learning-club/ https://www.becomingadatascientist.com/2015/11/08/becoming-a-data-scientist-learning-club/#comments Mon, 09 Nov 2015 02:12:37 +0000 https://www.becomingadatascientist.com/?p=714 April. The podcast would include interviews focused on how people working in various data-science-related jobs got to where they are today (how did they "become a data scientist"?). I'm getting closer to taking the dive and getting it started. I had an idea today that would take it a step further. Imagine how book clubs work where you pick a book, go off and read it, then gather occasionally to discuss and record your thoughts. Except instead of a book club, it's a data science learning club!]]> I have been thinking about doing a “Becoming a Data Scientist” podcast for a long time, at least since April. The podcast would include interviews focused on how people working in various data-science-related jobs got to where they are today (how did they “become a data scientist”?). I’m getting closer to taking the dive and getting it started.

I had an idea today that would take it a step further. Imagine how book clubs work where you pick a book, go off and read it, then gather occasionally to discuss and record your thoughts. Except instead of a book club, it’s a data science learning club!

I’m imagining picking a topic/project, finding resources showing how to do it, and introducing it to the club at the end of a podcast episode. Then, everyone that wants to participate in learning how to do that particular thing will go off for maybe 2 weeks, work on it and learn what they can, ask questions to each other in a common area like a blog post comment thread, create things and post them to a shared space, then at the end of the period post comments about what they learned and how it went. People could write blog posts about their projects and I would collect those and link to all of them from the original post. Anyone that already knows how to do it could help answer questions if they wanted to participate, too. I might invite some of the participants to talk about their learning experience on a follow-up episode, then the notes and results would be posted for future learners to find.

I think learning together would be fun and valuable, and this type of experience would fall somewhere between learning on your own and taking a class. It would include the pros of learning on your own and exploring, while offsetting some of the cons of going at it alone. It would be a significant time commitment on my part, so I want to make sure other people would join in before I commit. What do you think? Would you join a “data science learning club” and participate in something like this and find it valuable? It’s kind of like the Summer of Data Science, but we’d be learning the same things simultaneously and sharing our results. No one would be “teaching” the group necessarily, but we’d share resources and answer each other’s questions based on what we did individually.

Let me know in the comments or on twitter if you would find this valuable and if you want me to lead it!

]]>
https://www.becomingadatascientist.com/2015/11/08/becoming-a-data-scientist-learning-club/feed/ 8
BPDM’s interview with….. me! https://www.becomingadatascientist.com/2015/10/26/bpdms-interview-with-me/ https://www.becomingadatascientist.com/2015/10/26/bpdms-interview-with-me/#respond Tue, 27 Oct 2015 01:49:11 +0000 https://www.becomingadatascientist.com/?p=706 An organization based in Puerto Rico called “Broadening Participation in Data Mining” (BPDM) interviewed me over the weekend, and it’s online now! Without further ado….


Thanks to Orlando and Herbierto for having me on!

(P.S. I did put up the post about Data Sources on DataSciGuide)

]]>
https://www.becomingadatascientist.com/2015/10/26/bpdms-interview-with-me/feed/ 0
Books for Data Science Beginners, and Data Sources https://www.becomingadatascientist.com/2015/10/26/books-for-data-science-beginners-and-data-sources/ https://www.becomingadatascientist.com/2015/10/26/books-for-data-science-beginners-and-data-sources/#respond Mon, 26 Oct 2015 13:56:17 +0000 https://www.becomingadatascientist.com/?p=701 I just wanted to note here on Becoming A Data Scientist that I recently wrote two posts over on Data Sci Guide that are getting some attention

Books to Read if You Might Be Interested in Data Science
and
Data Sources & APIs for Data Science Projects

Enjoy!

]]>
https://www.becomingadatascientist.com/2015/10/26/books-for-data-science-beginners-and-data-sources/feed/ 0
Data Science Tutorials Flipboard Magazine https://www.becomingadatascientist.com/2015/10/21/data-science-tutorials-flipboard-magazine/ https://www.becomingadatascientist.com/2015/10/21/data-science-tutorials-flipboard-magazine/#comments Wed, 21 Oct 2015 14:31:10 +0000 https://www.becomingadatascientist.com/?p=697 I have been getting great feedback on my “Becoming a Data Scientist” Flipboard magazine, and I had this other set of articles bookmarked that didn’t quite fit into it. I want the Becoming a Data Scientist one to be the “best of the best” of articles I find on Twitter about data science, and to focus on understanding data science and related topics without getting too into the “nitty gritty”. However, I often come across great data science related tutorials on very specific topics that may not have broad appeal (and might look scary to beginners) but I also wanted to share.

So I started the “Data Science Related Tutorials” Flipboard magazine. Enjoy!

]]>
https://www.becomingadatascientist.com/2015/10/21/data-science-tutorials-flipboard-magazine/feed/ 1
Playing With Google Cloud Datalab https://www.becomingadatascientist.com/2015/10/18/google-datalab/ https://www.becomingadatascientist.com/2015/10/18/google-datalab/#comments Mon, 19 Oct 2015 02:59:47 +0000 https://www.becomingadatascientist.com/?p=686 This weekend, I played around with the newly-released Google Cloud Datalab. I learned how to use BigQuery and also played around with Google Charts vs Pandas+Matplotlib plots, since you can do both in Datalab.

datalab

I had a few frustrations with it because the documentation isn’t great, and also sometimes it would silently timeout and it wasn’t clear why nothing was running, but if I stopped all of the services, closed, restarted DataLab, and reopened, everything would work fine again. It’s clearly in Beta, but I had fun learning how to get it up and running, and it was cool to be able to write SQL in a Jupyter notebook.

I tried to connect to my Google Analytics account, but apparently you need a paid Pro account to do that, so I just connected to one of the built-in public datasets. If you view the notebooks, you will see I clearly wasn’t trying to do any in-depth analysis. I was just playing around and getting the queries, dataframes, and charts to work.

I hadn’t planned to get into too many details here, but wanted to share the results. I did jot down notes for myself as I set it up, which I’ll link to below, and you can see the two notebooks I made as I explored DataLab.

Exploring BigQuery and Google Charts
Version Using Pandas and Matplotlib
(These aren’t tidied up to look professional – please forgive any typos or messy approaches!)

Google Cloud Datalab Setup Notes (These are notes I jotted down for myself as I went through the setup steps. Sorry if they’re not intelligible!)

]]>
https://www.becomingadatascientist.com/2015/10/18/google-datalab/feed/ 1
Becoming A Data Scientist Flipboard Magazine https://www.becomingadatascientist.com/2015/10/10/becoming-a-data-scientist-flipboard-magazine/ https://www.becomingadatascientist.com/2015/10/10/becoming-a-data-scientist-flipboard-magazine/#respond Sun, 11 Oct 2015 04:19:21 +0000 https://www.becomingadatascientist.com/?p=677 I love finding and sharing good articles about data science related topics on twitter, but I know not everyone is on twitter, and also sometimes tweets get quickly lost in the timeline and they’re easy to miss. So, I’ve started sharing the best articles via a Flipboard magazine as well!

Check it out! https://flipboard.com/@becomingdatasci/becoming-a-data-scientist-5ktft1lky

]]>
https://www.becomingadatascientist.com/2015/10/10/becoming-a-data-scientist-flipboard-magazine/feed/ 0
How To Use Twitter to Learn Data Science (or anything) https://www.becomingadatascientist.com/2015/10/04/how-to-use-twitter-to-learn-data-science-or-anything/ https://www.becomingadatascientist.com/2015/10/04/how-to-use-twitter-to-learn-data-science-or-anything/#comments Sun, 04 Oct 2015 21:09:59 +0000 https://www.becomingadatascientist.com/?p=653 When I decided that I wanted to become a data scientist, I started following some data scientists on twitter to see what they talk about and what was going on in the “industry”. Then I saw them pointing one another to resources and answering each other’s questions, and I realized I had only seen the tip of the iceberg of “Data Science Twitter”. That’s when I created a new twitter account.

———————–

A few things I should say first…. I think “data science” can be replaced by just about any other topic, but especially science & tech topics, so please keep that in mind as you read this. I follow a bunch of scientists on my “regular” personal twitter account @paix120, and I sense the same things going on in their communities as I’m about to outline for data science.

Another thing I want to mention is that I’ve had other “topical” twitter accounts. I created one called @womenwithdroids when I started a blog of the same name, and I was amazed at how many awesome women I met that were building android apps, wanted to learn more about how to use their android phones (which at the time were being marketed as a “manly” alternative to the “cutesy” iPhone), and wanted to join a community of women talking about android phones and apps. At the time, I had created a separate account because I saw it as a “business” account for my blog, but I realized that there was a lot of value in separating that from my personal account. I’ll go into that below. Now that you know a little background, let’s dive into how you can use twitter to learn just about anything.

———————–

I have explained to people I meet in person how much I gain from Twitter, and they often look at me like I’m a little nutty. I have heard a few recurring comments from them that I see as misconceptions:

  1. “I started using Twitter and was overwhelmed. I couldn’t keep up with my timeline.”

    My answer to that is that first, you’re not supposed to “keep up” with your Twitter timeline. I don’t use Facebook, but I get the impression that people that do will scroll back through every post that happened since the last time they visited, to make sure they don’t miss any important info from their friends. Twitter is not like that.

    On Twitter, you can jump on when you need a 5-minute break from work, read a few tweets, mark some longer stories to read later or go read an article or two now, and then get right back to work. People that use twitter won’t get mad if you miss one of their tweets. If something resonates with a lot of people, it will be retweeted and you will probably see it later. If not, it’s not a big deal. You see what you see when you’re online, and don’t worry about what you may have missed, it will just stress you out.

    Think of Twitter like the news. You may want to see if anything has just happened, what’s at the “top of the news”, or what people are talking about that happened recently. If there is a big news story, it will likely still be visible when you visit later. It would be stressful to try to keep up with every news article that’s published at any time.

    I just scroll back a half hour or so and scroll up until I’m ready to do something else. If I’m looking for tweets about a specific topic, I do a search and see what the top tweets are for it. You can narrow down the search results to “People You Follow” if you only want to see what people you are connected with are saying about the topic.

  2. “I started using Twitter and it was just a bunch of junk I didn’t care about.”

    Twitter has an onboarding problem. The problem used to be that when you started a new account, you weren’t following anyone, then people would feel lost and not know how to find interesting accounts to follow. Then they started suggesting interesting accounts. Now, the onboarding process shows you a whole bunch of “brand” accounts to follow (whether those are celebrities or companies, they are usually accounts generated to gain followers or money), then they also try to get you to import your email contacts and follow all of them. I don’t know about you, but I don’t care much about what celebrities have to say, and many of my email contacts are people that I had a short business exchange with years ago and have no interest in keeping up with now. It’s no wonder people start with an uninteresting and overwhelming timeline.

    My recommendation is that if there is someone in your timeline that frequently annoys you or tweets boring stuff, unfollow them. That’s just clutter. If you see a friend retweet something from someone you don’t follow that is interesting, click on that person’s profile, read a few tweets and see if they are tweeting other things that interest you, and if so, follow them. Constantly tailor your timeline to work for you.

    Another important suggestion is to use twitter lists. If there are certain people that you really do want to keep up with (like personal friends, or a small group of accounts on a very specific topic), put them in a list. You can also follow them in your normal timeline, but you don’t have to. When you click over to your list, you will see only tweets by those accounts. One example of how I use a list on my personal account is my “Harrisonburg businesses” list. I don’t frequently care about whether a local restaurant is having a special, or if there’s a cultural event going on at our local university. However, when I’m looking for something to do one night, I can click over to that list and see what the local businesses are tweeting about today. Are there any cool bands playing in town? A special at a local hangout? I follow very few of those 160+ accounts in my regular timeline, but now I have a collection of them in one place when I do want to scroll back through 24 hours of tweets to find something specific.

  3. “Social Media is a time suck for me, and I don’t want to add any more social feeds to my life to waste time on”

    OK, I can see this. It is easy to get sucked in and spend a lot of time on social media. To me, this is just a reason to optimize your account so it’s beneficial to you. If you’re just reading celebrity gossip and trending topics, are you improving your life? However, if you have a goal to become a data scientist, and you follow accounts that are actually educating you, is it so bad to spend some time “sucked into” a feed that is actually getting you closer to your goal in your “free time” and keeping you up to date on the latest topics that a colleague or future employer may expect you to know about?

———————–

Now that I’ve explained some misconceptions about Twitter, I want to explain why I have a separate account for “Data Science Renee”. I have had my personal account on Twitter since 2008. I have only really been into data science since late 2013. I have a “network” of people that I chat with about a variety of topics on my personal account, including political topics and random things that catch my attention. Here are my main reasons for starting a separate account for @becomingdatasci:

  • I personally wanted to separate the topic out. I wanted to go “all in” on data science, and have an account where I ONLY follow people that talk about data science, even people I wouldn’t follow in my normal timeline. I could have done this with a list, but I wanted to take it further than that.

  • I also wanted to be able to tweet like crazy about data science, and not feel like I had to hold back in order to avoid overwhelming my existing followers with a flood of tweets on a new topic. They might unfollow me if I started tweeting 20 times a day about data science when I had rarely mentioned it before, and my new interest might far outweigh my tweets on other topics I’m interested in. I didn’t want to lose that existing network.

  • The opposite is also true. I wanted to be able to connect this new account to my blog, and use it to make work connections, without worrying about including personal political views and tweets about gardening and cute animals in that feed! I also would know that people that follow this account are following it because of data science. I check out my followers on this account more often than I do on my personal account, because they’re more likely to share this particular interest with me.

  • I know I’m good at curating interesting articles about a topic, and I wanted this account to be considered a “go to” account that others could recommend to their friends interested in learning about data science, without worrying what else I might be tweeting about. I decided to become a sort of “learning data science channel”.

———————–

So you see why I have separated this account from my existing personal Twitter account, and how I have tailored it to work for me. But what does that mean? What have I actually gained from this twitter account?

  1. I have learned a LOT that I wouldn’t otherwise know about data science. There are terms that I wouldn’t have known to Google that some of the people I follow tweet about and link to articles, academic publications, and tutorials about. There is a constant flow of interesting new information coming out of the data science “industry” so I can keep up with what is being talked about right now and what is considered “state of the art” and exciting to other data scientists. It’s like being able to walk around and listen in on lunch tables at a data science conference. Everyone is talking about something slightly different, but all in the general topic of data science, and each person is honing in on what is interesting or exciting to them within this realm.

  2. I have made connections that I wouldn’t have made otherwise. I don’t have a lot of time or money to constantly travel to data science conferences and meet people in person. I live in a small town and there aren’t a lot of other people talking about data science here (yet). Twitter has given me a way to personally connect with other data scientists. I have connected with some that don’t live far from me, after all! I have connected with many that live in other countries that I likely wouldn’t even meet at a conference. These connections have cheered me on in my learning, connected me to resources, and more!

  3. I have become a “face” of a person learning data science. At once conference I did attend, I was recognized as “Data Science Renee”! I have been asked to be interviewed on podcasts and blogs (some of those should be coming up soon), offered contract work, and offered free admission to a conference I unfortunately couldn’t go to, but was excited to be considered for. “Famous” people in the industry are now coming to me to work with them in some way. New learners seem to look to me as a resource and guide, and want to see how I learned what I know, and how I have struggled, so they can compare that to their own experiences.

  4. I have found many other women working in data science. When I was first learning about data science, all of the “who’s who” lists of people to follow, people that were interviewed for books or other resources, and the “faces” of data science were often white or asian men, with maybe one woman or minority included in the group. (This is typical of the tech industry.) However, as I made more and more connections, and started to seek out women and other minorities in the industry, I have been able to connect with them and learn from them and hopefully amplify their voices. I now have a twitter list with almost 450 women that work in data science or statistics, and now that list can be a resource for other women looking for role models like them in the industry!

  5. I have learned some specific data science tools and techniques. I regularly see great tutorials on twitter, via blog posts or videos or github links, that show me how to do something I have wanted to learn how to do. These would often be hard to find by searching, but come right to me in my twitter feed where I can bookmark them for later learning sessions.

  6. People on twitter have reached out to help me solve problems when I’m stuck. I have received tweets from people that built python packages I was using, people that had resources that could help me, or just people with general advice and feedback! If I’m clear about what I’m doing and where I’m stuck, I now have a strong enough follower base that I will almost always get a helpful answer!

  7. Not only do I find out about resources I wouldn’t otherwise have, but I see opinions of others on existing resources. A conversation on twitter about being overwhelmed by the vast amount of things there are to learn in the broad topic of “data science” helped inspire me to bring an idea I had been having to life. I have taken a course online that got really difficult at about the 5th lesson. I didn’t know whether it was just me and I had hit a roadblock, or if a lot of people found that course difficult and I just needed some outside resources to continue with it. I also often don’t know where to start in my long list of bookmarked “things to learn”. But seeing what people tweet about, and how others have learned, really is helping me on my learning journey. You can read about my new website DataSciGuide here. I’m hoping the ratings (and eventually learning guides and a recommender system) there will help others avoid “data science learning overwhelm”. (P.S. I’m now in the phase where I need reviews on the items I’ve posted, so please go rate some things!)

———————–

Hopefully this post has helped you understand how to use Twitter to join a community and learn something you have been wanting to learn! You can really gain a lot from it if you optimize its benefit to you like I have.

I know the question now will be, “so who are the best people to follow on twitter for data science?,” and I’m hesitant to answer that for you since there are so many people out there, some with specific topics that would be better for you personally than what I would recommend. For instance, maybe you are especially interested in learning data science for sports analytics, which is a specific topic I don’t follow many people on.

If you follow me on @becomingdatasci and see who I retweet, you’ll find people that are sharing resources that I think are beneficial, so you can start there. You can’t go by my twitter favorites since I use those as bookmarks and haven’t read many of them yet. You could look through people I follow, but there are a lot of them, and they’re not ranked in a helpful way. You can also follow the list of data science women I mentioned above.

Others that are often good to start with are people with data science blogs, since they’re usually purposely writing to educate others. Here’s a large list of data science blogs that includes the twitter handle of the author or blog where applicable, and is sorted into categories. Check it out! https://blog.rjmetrics.com/2015/09/30/the-ultimate-guide-to-data-science-blogs-150-and-counting/

———————–

So to recap:

  1. Tailor your twitter timeline frequently. Unfollow those that annoy or bore you, and follow new accounts on topics you want to know more about.

  2. If you seriously want to hone in on one topic, or to become a “channel” for a topic, create a separate account for it

  3. Use twitter lists to create small lists of people you especially want to keep up with, or sub-specialty topics you occasionally want to dive into. You can follow accounts in lists that you might not otherwise follow in your timeline.

  4. Actually connect with other people. Find people like you that can be role models for your learning. Ask them questions. Help others out when they ask questions on a topic you know more about. Join the community and the conversation.

  5. Have fun and don’t get overwhelmed! Use others’ opinions and recommendations to carve out your learning path.

Comment below if you have any questions about using twitter to help learn data science!

]]>
https://www.becomingadatascientist.com/2015/10/04/how-to-use-twitter-to-learn-data-science-or-anything/feed/ 7
DataSciGuide Contest https://www.becomingadatascientist.com/2015/10/02/datasciguide-contest/ https://www.becomingadatascientist.com/2015/10/02/datasciguide-contest/#respond Sat, 03 Oct 2015 01:42:34 +0000 https://www.becomingadatascientist.com/?p=650 Want a way to help people that are learning data science, and also get a chance to win a $40 Amazon Gift Card? Review a data science blog, podcast, course, or other content at DataSciGuide!

Here’s more info: http://www.datasciguide.com/review-stuff-and-win-a-40-amazon-gift-card/

]]>
https://www.becomingadatascientist.com/2015/10/02/datasciguide-contest/feed/ 0
Human Name Variations in Databases https://www.becomingadatascientist.com/2015/09/19/human-name-variations-in-databases/ https://www.becomingadatascientist.com/2015/09/19/human-name-variations-in-databases/#comments Sun, 20 Sep 2015 01:38:48 +0000 https://www.becomingadatascientist.com/?p=627 I normally write about my adventures learning data science here, but my expertise for years has been database design and reporting, and I have some knowledge to contribute to a discussion that I thought I’d document here.

A conversation on Twitter today about how people’s names are stored in databases, with stories of frustration from people that have had terrible customer/patient experience because of “unusual” names, made me want to write about this topic. When you search for information on name standards in databases, you will usually get information on field names, lengths, etc. What is harder to find is information on how to store the variety of names in a system of record.

To get an idea of how people are named in different cultures, see w3’s article about it at http://www.w3.org/International/questions/qa-personal-names

Some examples they give of names that may not be entered into a database the “traditional American way”  are:

  • “Mao Ze Dong” – Mao is the family name, Dong is the given name, and Ze is a generational name common to all siblings in a family. In Chinese script, the names are not separated by spaces.
  • “José Eduardo Santos Tavares Melo Silva” – Brazilian name which includes many ancestral family names.
  • “Kogaddu Birappa Timappa Nair” – Indian name which includes village name,  father’s name, given name, and last name.

You may think that people should just conform their name to our forms, like just choosing three names for “first”, “middle”, and “last”. Or maybe you think the data collection form should just have one entry field for “name” and not split it up. However, it’s not that simple, and the need to format names for various uses (like mailing labels, letters, etc.) provides additional challenges.

Unfortunately, sometimes the challenge is just getting an organization to accept your actual name at all. The next challenge, once your name is in a system of record, is how it ends up used. Different usages can end up complicating things like government IDs and driver’s licenses (if the state ID name rules don’t allow you to use the name that is on your federal record, for instance), insurance claim rejections (when your name doesn’t match up exactly between the doctor’s office and the insurance company databases), or multiple accounts at retail establishments like pharmacies (I have two different accounts at my local CVS, and when picking up medicine always have to confuse the person at the window by mentioning both my maiden and married names. Also, the name on my CVS discount card was mis-entered, so I have to spell my name incorrectly if they need to look that up for any reason).

Here’s my experience with a “nontraditional” name, since I got married and wanted to make my maiden name a 2nd middle name:

Luckily, I didn’t have trouble changing my name with social security, despite scary stories from other women I know who had to fight to get their name the way they wanted it on their card. Side note: one benefit I’ve since discovered is that since my new name on my driver’s license still includes my maiden name, it makes me more believable when I show up somewhere that doesn’t have my married name and I’m trying to get them to change it.

My Birth Name: Renée Marie Parilak
(I leave off the accent when entering on forms because that would just increase the chance for entry error, like Rene’e or Renee’. Yes, I’ve seen both.)

My Married Name: Renee Marie Parilak Teate.
First name “Renee”, middle name is now “Marie Parilak”, and last name “Teate”.

When I submitted this name on name change forms with credit card companies and other organizations, I had a few challenges, like not enough room on the middle name line, but I sent in all of the forms with my full name. My credit cards came back with all of these variations printed on them, depending on each company’s conventions:

  • Renee Marie Parilak Teate
  • Renee M. Teate
  • Renee M. P. Teate
  • Renee M. Parilak (yes, one came back unchanged)

Actually, the other card that came back unchanged was my voter registration card. I filled out the update form at the polls because they gave me a hard time last time I voted and had different names on my “proof of identity” information, and they mailed me a new card with my old name on it. Really.

Another has my full name correct when concatenated, but has stored my last name as Parilak Teate instead of Teate.

Here’s a friend’s experience (with names removed for privacy purposes):

My parents, to forestall in-law fights over naming, made it so that the first name of all their female children was my mother’s middle name (which is what she goes by…family tradition of female going by middle name), so we are all named “SharedFirstName MiddleName Surname”. Every girl child was meant to be called by her middle name and so it has been.

Many bureaucracies that have forms that force everyone into using their middle initial only (if they even acknowledge the middle name). It’s been a problem throughout my life, especially at doctor’s offices, but it escalated enormously with the connection of various bureaucratic systems to the internet.

The latest wrinkle started about 2 years ago when one of my sisters moved from the family home to an apartment. She filled out a USPS change-of-address form. Suddenly, not just her mail but some of my mail, my mom’s mail and my sisters’ mail started going to her apartment. Our small-town postal personnel suggested workarounds, none of which worked. It wasn’t as simple as going to the post office, showing ID to prove who we were, and having postal personnel escalate it to whoever is in charge of that database. We tried repeatedly to get the mistake corrected by personally visiting our local post office and talking with the supervisor.

Meanwhile, my sister moved out of that apartment, meaning she was no longer there to reroute that mail to us. My mom, my sister, and I, who are all on Social Security (for age or disability) started to find Medicare and other important SocSec mail now had that apartment address on it instead of our true address. No one from USPS or SocSec ever contacted the address/contact of record to verify we wanted an address change. We have different Social Security numbers and you’d think that would be enough to double-check we weren’t the same person despite slightly similar names, but no. This points out how easy it would be for a ID thief to get a bunch of your most sensitive information re-routed to them. Google has a simple check-in when someone logs in from a strange computer or tries to change the password, yet USPS and Soc Sec don’t??

While this was going on, I’d tell medical providers to be sure to address bills using my middle name as the first name so it wouldn’t be rerouted, which it would be if they used my first name and middle initial, which happens to be identical to one of my sisters’ names in that format. They’d refuse or beg off, saying that their biller took care of that and there was no way to get that kind of customization. So despite being a customer who didn’t want to shirk my bills, who was being pro-active about it all, I ran the risk of running up huge medical bills because of mail going somewhere else and medical offices being so maddeningly subject to the almighty database that they would not avoid sending their mail into a black hole.

A couple months ago I tried to call Soc Sec to get advice about something. Even to get advice, they ask you your SS#, mother’s maiden name and other stuff. When I gave my mother’s maiden name over the phone, the operator told me I was wrong! WTF! She was ENORMOUSLY rude to me when I tried to let her know what mistake was happening.

I was fed up with years of this and called my congressperson’s office. So far, they’ve dropped the ball completely. I guess it’s The White House next. Which is a waste of my time, a waste of gov’t time, etc. All of this could have been avoided if good database design practices prevailed (and if bureaucratic organizations would quit distancing customers/clients from reaching the person in the organization who would be capable of making changes to the database once you’ve proved that you are YOU and have always been YOU).

In the ariticle linked above, w3 addresses some “implications for field design” for name fields, including field length, whether to split the name up, etc. I suggest you go read it. Here’s the link again.

If I were designing a name entry form today (Note: for a system that actually needs to store full names! Not all of them do!), I would ask the user for:

  • Prefix: [Mr., Ms., Rev., etc. – optional, could allow “other” entry]
  • Given/First Name(s):
  • Middle Name(s): [optional]
  • Family/Last Name(s):
  • Maiden Last Name: [if applicable]
  • Suffix(es): [Jr, III, Esq. etc. – optional, allow “other” entry]
  • Full Legal Name: [optional, would default to First, Middle, Last]
  • What should we call you? (Preferred given name or nickname if desired): [auto-fill with Given name. allow user to edit]
  • Preferred Mail Name: [auto-fill with prefix, first, middle, last, suffix. allow to edit]

Note: if this were a mobile form, I wouldn’t ask for all of the name variations up front! Complicated mobile forms are a turn-off and can result in lost customers, so just ask for either full legal name (and have your code guess how to split the full name into first/middle/last fields) or first and last, and “What do you want us to call you?”, then have a detailed profile page that lets them go in later and fix anything your system got wrong.

In my case, I would be able to specify that my first name is “Renee” and my last name is “Teate”, and the other two names are my middle names. My friend could specify her full legal name and also that she prefers to be addressed as her middle name and not her legal first name. Mao Ze Dong could specify that Mao is the Family Name and Dong is the Given name, while still leaving the names in the original order for the full legal name.

Make sure your database can handle all of the variations that may be written on a paper form. Also make sure you can handle special characters.

Another Note: If you are going to display the name publicly in any way, like on a user profile on a website, you have to give the person full control over how they want their name displayed there. There are various safety and personal reasons a person may not want their name displayed the way you want to display it. Here’s one example story.  Also see Nymwars.

And if my system were something like a medical insurance record where past names may come in from doctor’s offices even though I have the patient’s current name, I might ask for a list of past names to keep on the record. You can store several “former names” in a table with a one-to-many relationship to the person’s primary record, and store all of the prior name fields when a new name comes in. You can even store names that aren’t their actual name, but may come in from another system regularly (like a misspelling) or be an incorrect version of their name that you stored in the past.

When in doubt, you can use the full legal name on communications. If sending an informal email, you can use the “What should we call you?” name. If sending a formal letter, you can use the Preferred Mail Name. Someone could have the preferred mail name of “Mr. J. Edgar Hoover” (funny that’s the name that came to mind as an example of first name initial when I’m writing about storing personal information), but prefer to be called “Ed” in person, and it’s good for your organization to know.

If you have spouses in your database, the marriage record should store preferred informal and mailing joint names, like “Mr. and Mrs. Doe” or “John and Jane”, and not just auto-generate the combinations dynamically (though you could default to that), since some couples have strong preferences on how their names are shown in letter salutations (like wanting the wife’s name first, or preferring to always be referred to formally – I work in a fundraising organization, and they definitely want to address large donors’ names they way they want them!). A relatively safe way to do envelope labels is to have “stacked” name labels, where both spouses’ preferred mailing names are completely written out and listed one above the other. This avoids uncomfortable situations in cases where the spouses have different last names, for instance. Also, you can then handle cases that some systems have trouble with, such as “Drs. Jane and John Doe”, or professional/military prefixes and suffixes in a certain order.

Speaking of marriage, there is a great article on storing a variety of marriage relationships in your database, called “Y2Gay“. It’s a good read for database designers or anyone that has to think about data issues!

Besides having a system that can respond to customer preferences (following their preferences will make them happier customers; rejecting an insurance claim because the first/middle/last names don’t exactly line up with your record will not), having all of these name variations does help in a data science way as well: you can now better match profiles coming in from different systems and have a higher degree of confidence that you are pulling incoming data into the correct “Mary Smith” record, for instance.

Hopefully, our stories made for some food-for-thought for people designing databases, websites, and processes involving people’s names!

Please share your experiences, thoughts, and comments on my database field choices below!

Here are some other references on the topic:

Bad assumptions programmers make about names

Related stackoverflow topic

Using External Data in Data Matching

Data Models and Real World Alignment

 

 

]]>
https://www.becomingadatascientist.com/2015/09/19/human-name-variations-in-databases/feed/ 2
DataSciGuide Update https://www.becomingadatascientist.com/2015/09/06/datasciguide-update/ https://www.becomingadatascientist.com/2015/09/06/datasciguide-update/#comments Sun, 06 Sep 2015 23:15:22 +0000 https://www.becomingadatascientist.com/?p=624 I finally had a chance this weekend to make some progress on my “Data Science Directory” website, DataSciGuide.com, and I would love your feedback on it!

That site isn’t open for comments yet, so I’m directing people to leave feedback here.

If you haven’t kept up with the development of DataSciGuide, here are a few things to read:

Let me know if you want an account to post some reviews while I test things out! (I’ll even post content that you want to review, just for you.)

Also, tell me any thoughts you have about the site in the comment form below! (or tweet me!)

]]>
https://www.becomingadatascientist.com/2015/09/06/datasciguide-update/feed/ 2
The Imitation Game, and the Human Element in Data Science https://www.becomingadatascientist.com/2015/08/08/the-imitation-game-and-the-human-element-in-data-science/ https://www.becomingadatascientist.com/2015/08/08/the-imitation-game-and-the-human-element-in-data-science/#comments Sat, 08 Aug 2015 21:05:12 +0000 https://www.becomingadatascientist.com/?p=612 Last night, my husband and I watched The Imitation Game. First of all, it’s a great movie and you should see it. Secondly, there was a moment that got me thinking about the human element of machine learning.

[Spoiler Alerts – but you probably already know much of the story, and the movie is still good even if you know the historical outcome.]

I thought a moment like this may be coming when Alan Turing was first applying to work at Bletchley Park, and Denniston can’t believe he’s applying to be a Nazi codebreaker without even knowing how to speak German. Alan emphasizes that he is masterful at games and solving puzzles, and that the Nazi Enigma machine is a puzzle he wants to solve. He starts designing and building a machine that will theoretically be able to decode the Nazi radio transmissions, but the decoder settings change every day at 12am, so the machine must solve for the settings before the stroke of midnight every day in order for the day’s messages to be decoded in time to be useful and not interfere with the next day’s decoding process. Turing can’t prove his machine will work, simply because it is simply taking too long to solve the daily puzzle. In the meantime, people are dying in the war, and the Nazis are going on transmitting their messages over normal radio waves believing the code is “unbreakable”.

[More specific plot spoiler in the 2 paragraphs below]

The moment I’m referring to is when Alan hears a woman explaining that the German man whose messages she is assigned to translate always starts his messages off with “Cilla”, who she assumes is the transmitter’s girlfriend. This triggers Alan to realize there could be repeated messages that would drastically narrow down the number of decryption keys because you could specify that there was always a word or phrase that could be expected in the messages. They realized that there was a 6am weather report transmitted daily, and that every message ended in “Heil Hitler”, so they set the machine so it could focus its search on finding the words “weather”, “heil”, and “Hitler” in the 6am message every day. This solved the puzzle, as the machine was then able to quickly decode the messages from a narrower set of possible solutions instead of running all day without finding one.

So, Alan Turing (at least in the movie version) had been focusing on a random set of hundreds of millions of possible solutions, which his machine could solve faster than humans, and trying to tune the design of his machine to find the solution faster. The hint that ended up leading to the solution involved bringing the human thought process back into it – if there was something the messages said frequently, because they were written by humans, and humans follow certain communication and linguistic patterns, that could be exploited to narrow down the range of solutions.

[/end specific plot spoiler]

This got me thinking about machine learning and data science in general. A frequent Kaggle competition winning strategy is to quickly iterate through a multitude of algorithms and optimize for the evaluation metric, or focus on stacking thousands of models. [Here are several interviews with Kaggle winners.] Some of the winners do mention needing to study up on the field to gain some domain expertise, and the Quora answer here focuses on data understanding a preparation, but many teams appear to take a brute-force-type approach to building their models and writing programs to iterate through each combination of algorithms to maximize the area under the curve or whatever measure that particular competition is scored on.

I don’t want to do that type of brute-force data science. I’ll leave that up to the competitive types that enjoy quickly iterating through as many solutions as possible, and building systems to help them do this faster, and have a focus on winning vs understanding. Of course there is a place for that type of approach, and it is very valuable in solving some problems, but it’s not attractive to me. It’s too robotic.

However, there are many areas (like my day-job, university/non-profit fundraising), where knowing your specific community, what types of data you’re collecting, where it’s from, how trustworthy it is, will have a large impact on what fields you decide to include in your model and of course how you interpret the output. These are areas where we’re talking about data with a lot of variety, but not necessarily a lot of volume (at least not on the scale of something like credit card processing and fraud detection), and where I would think domain expertise is more valuable than fast model optimization. It is also vital to be able to explain how to be able to use the output of a given model. It is more of a consulting role than a mathematical/programming one.

I think that’s where I’m headed in my data science learning. I am aiming to learn how to use models to inform decision-making, how to choose the best data to put into a model, how to choose which type of model to use, and how to watch out for things like covariance and other confounding effects. Then, when you get a result, how to know whether to trust the result, how to explain it to non-technical managers, and how to best implement what was learned in order to have a positive impact on real-world outcomes. It is more of a creative iterative cycle than a machine/optimization iterative cycle.

Like in Imitation Game, understanding human behavior and communication can be the “big hint” that informs a technical solution, and optimizes its performance beyond what a better-tuned piece of hardware or more efficient code could do. It seems to me (and I have heard others saying) that understanding people and business is more than just a piece of the “data science venn diagram”, it’s really the key to success in this field. (And also a good reason to have diverse data science teams.)

I’m curious about how those of you that work as data scientists see these various aspects of data science, and how much of your work involves creative and “human” skills vs the “hard skills” of math and computer science. I would think it varies depending on industry and the type of problem you’re trying to solve, but I am interested in your personal experience. Please comment below to share your experiences!

 

 

 

]]>
https://www.becomingadatascientist.com/2015/08/08/the-imitation-game-and-the-human-element-in-data-science/feed/ 5
My “Secret” Side Project, Revealed https://www.becomingadatascientist.com/2015/08/01/my-secret-side-project-revealed/ https://www.becomingadatascientist.com/2015/08/01/my-secret-side-project-revealed/#comments Sun, 02 Aug 2015 02:18:44 +0000 https://www.becomingadatascientist.com/?p=590 OK So I was actually hoping to show this to you all long ago, and I kept coming up with more and more ideas for it, so it’s not going to be “ready” to reveal for a while, but I figured I’d go ahead and show it to you anyway.

My main motivation is that I keep hearing people say (and sometimes feel myself) that learning to becoming a data scientist on your own using online resources is totally overwhelming: there are so many different possible topics to dive into, few really good guides, lots of impostor-syndrome-inducing posts by people you follow that make you feel like they’re so far ahead of where you are and you’ll *never* get there…. but there’s so much great data science learning content online for everyone from beginners to experienced data scientists!

We need a better way to navigate it.

Hence my new website: “Data Sci Guide”. It will eventually have a personalized recommender system and structured learning guides and all kinds of other features to help you find the resources to go from where you are to where you want to be, but for now it’s “just” a directory / content rating site. And it’s not ready for you to interact with yet, but it’s getting there, and I’ll need your help fleshing it all out soon.

So go take a look! Then come back here to give me feedback and suggestions, because you have to be registered to comment there and I didn’t turn on new user registration yet.

OK go now. Don’t forget to come back!

data_sci_guide_launch_homepage

>>>> DATA SCI GUIDE.COM <<<

 

So…. what did you think? What do you think of the overall idea and plans? What should I be sure to remember to include? Tell me below!

 

]]>
https://www.becomingadatascientist.com/2015/08/01/my-secret-side-project-revealed/feed/ 7
Entry Level Data Analyst Skills https://www.becomingadatascientist.com/2015/07/23/entry-level-data-analyst-skills/ https://www.becomingadatascientist.com/2015/07/23/entry-level-data-analyst-skills/#respond Thu, 23 Jul 2015 18:30:23 +0000 https://www.becomingadatascientist.com/?p=585 Between an interview from a local TV station about my job and going through the process of hiring someone onto our team, I’ve been thinking about what would be the bare minimum skills someone would need to have a chance at being hired as a data analyst. Maybe this would be a helpful list for someone trying to change careers and trying to decide where to focus their learning time.

I posted this picture on Twitter:
data_analyst_whiteboard

and got some interesting responses:

What do you think?

I’ll revisit this topic later, and I’ll also post about the conference I’m attending (APRA Data Analytics Symposium) when I have a chance to summarize. For the moment, heading back to the sessions!

]]>
https://www.becomingadatascientist.com/2015/07/23/entry-level-data-analyst-skills/feed/ 0
The Data Science Central “Incident” https://www.becomingadatascientist.com/2015/07/08/the-data-science-central-incident/ https://www.becomingadatascientist.com/2015/07/08/the-data-science-central-incident/#comments Wed, 08 Jul 2015 05:36:04 +0000 https://www.becomingadatascientist.com/?p=548 I’m writing this post to respond both to what many of you saw Vincent Granville said about me on Facebook a couple days ago, which was brought to my attention yesterday:
vincent_granville_data_science_renee_comment_single(in context)

and to his apology this evening:
granville_apology

I didn’t want to write a second post about Data Science Central, but after the huge response on twitter today, I want to document everything in one place so anyone looking back at this has all of the info to evaluate what has been said.

I have thought a lot about Vincent Granville’s apology this evening, and honestly when I heard he had apologized, I hoped (but doubted) it would be sincere. I would have loved to be able to accept his apology and move on from all this. However, I can’t bring myself to accept the apology because it’s not really an apology, it’s an accusation. After writing a truly vile post about me, his “apology” accuses me of harassing *him*. He says that I have “attacked” him for 14 months, and is casting himself as a victim. He’s basically saying “I acted a fool in a heated moment because she’s been attacking me non-stop for over a year” (the heated moment apparently being a Facebook post about Ellen Pao that reminded him of me, and the “attacking” being me pointing out his questionable practices in a blog post and on twitter, I guess).

Because of that, and because there are a lot of people who are *actually* harassed online who I think would be offended by his characterization, I want to document everything I’ve said about him, and challenge his definition of harassment. What I have done is document what I saw as some very questionable (if not unethical) behaviors, and occasionally initiated or participated in conversation on twitter about that. I have never “attacked” him in any way, but I want to leave it up to you readers to decide.

Here is the history of my comments about Data Science Central and Vincent Granville:

April 21-22, 2014: Initial twitter conversation with @tesherista about Data Science Central’s contest to find fake accounts created to attract women and minorities to Data Science Central, where Vincent Granville deleted Cory’s comments questioning the practice. (screenshots by @AltonDataSci)

July 1, 2014: Original blog post ““Something has been bothering me about Data Science Central” here on Becoming a Data Scientist, where I wrote about the above experience, as well as exposing one of the fake Data Science Central profiles “Amy Cordan” as having a fake LinkedIN profile (still there as “Amy Sangrene”) with a fake Stanford Computer Science PhD, violating LinkedIN terms & conditions. (He mentioned me questioning his advanced degrees in his “apology”, and this is the only academic credential I have brought under scrutiny, that of “Amy”.)
In response to this post, I received the following comments from readers (among others, you can see them at the end of the post linked above):

  • Alton discussing his negative experience with the Data Science Central contest
  • Ellie talking about losing trust in DSC when she tried to contact “Amy” and realized she wasn’t real
  • Hubart and David questioning his academic background (maybe this is why he thought I did? because commenters on my post did?)
  • “System Administrator” recalling another use of the name Amy Cordan by Vincent Granville online in the past
  • Eric mentioning he found that Vincent Granville was accused of paying people to write positive Amazon reviews of his Developing Analytic Talent book
  • A comment by someone who claims to be the “real” Amy Cordan (Henriques) and used to be close friends with Vincent Granville’s wife

July 1-5, 2014: Twitter conversation with @altondatasci following the blog post above, as well as an explanation for why I wrote the post:

October 26, 2014: Tweet to @kissmetrics (and conversation following) alerting them that Amy was a fake profile.

June 8-10, 2015: Tweets after the real Amy Cordan commented on my blog, conversation between @ellieaskswhy, @metabrown312, and @tesherista on Twitter about fake Amy and deleted comments. Follow-up tweets warning people again about what we had found, and talking about the suspended accounts.

June 24, 2015: Tweets about finding out my Data Science Central account was suspended.

Throughout all of this, I never emailed or otherwise directly contacted Vincent Granville. He also never responded to any of these comments or my blog post, and the only reason I might think he might be aware of my posts was that my Data Science Central account was suspended. Today, he also blocked me, and several other people that shared my post, from following @DataScienceCtrl on Twitter. Suspending women’s accounts on Data Science Central and then calling someone who questions his practices a “harasser” and “attacker” is not a great way to encourage women to participate on his sites.

I’m not rehashing all of this to keep it going, but to show what a man that called me (and my followers/readers) names unprovoked is calling “harassment” and “attacks”. Reading the details of his apology/accusation, it also appears he is lumping the comments of everyone that has questioned/called out Data Science Central on twitter or this blog into a persona he thinks is all me, when in fact I wrote few of those comments myself, but instead provided a space for others to share their experiences.

If Vincent Granville wants to actually apologize for his statement and actions, without blaming and accusing me inappropriately in the process, I will accept it.

]]>
https://www.becomingadatascientist.com/2015/07/08/the-data-science-central-incident/feed/ 17