The Imitation Game, and the Human Element in Data Science

August 8, 2015

Last night, my husband and I watched The Imitation Game. First of all, it’s a great movie and you should see it. Secondly, there was a moment that got me thinking about the human element of machine learning.

[Spoiler Alerts – but you probably already know much of the story, and the movie is still good even if you know the historical outcome.]

I thought a moment like this may be coming when Alan Turing was first applying to work at Bletchley Park, and Denniston can’t believe he’s applying to be a Nazi codebreaker without even knowing how to speak German. Alan emphasizes that he is masterful at games and solving puzzles, and that the Nazi Enigma machine is a puzzle he wants to solve. He starts designing and building a machine that will theoretically be able to decode the Nazi radio transmissions, but the decoder settings change every day at 12am, so the machine must solve for the settings before the stroke of midnight every day in order for the day’s messages to be decoded in time to be useful and not interfere with the next day’s decoding process. Turing can’t prove his machine will work, simply because it is simply taking too long to solve the daily puzzle. In the meantime, people are dying in the war, and the Nazis are going on transmitting their messages over normal radio waves believing the code is “unbreakable”.

Read More

My “Secret” Side Project, Revealed

August 1, 2015

OK So I was actually hoping to show this to you all long ago, and I kept coming up with more and more ideas for it, so it’s not going to be “ready” to reveal for a while, but I figured I’d go ahead and show it to you anyway. My main motivation is that I keep hearing people say (and sometimes feel myself) that learning to becoming a data scientist on your own using online resources is totally overwhelming: there are so many different possible topics to dive into, few really good guides, lots of impostor-syndrome-inducing posts by people you follow that make you feel like they’re so far ahead of where you are and you’ll *never* get there…. but there’s so much great data science learning content online for everyone from beginners to experienced data scientists! We need a better way to navigate it. Hence my new website: “Data Sci Guide”. It will eventually have a personalized recommender system and structured learning guides and all kinds of other features to help you find the resources to go from where you are to where you want to be, but for now it’s “just” a directory / content rating site. And it’s not ready for you to interact with yet, but it’s getting there, and I’ll need your help fleshing it all out soon. So go take a look! Then come back here to give me feedback and suggestions, because you have to be registered to comment there and I didn’t turn on new user registration yet. OK go now. Don’t forget to come back! >>>> DATA SCI GUIDE.COM <<<   So…. what did you think? What do you think of the overall idea and plans? What should I be sure to remember to include? Tell me below!...

Read More

Entry Level Data Analyst Skills

July 23, 2015

Between an interview from a local TV station about my job and going through the process of hiring someone onto our team, I’ve been thinking about what would be the bare minimum skills someone would need to have a chance at being hired as a data analyst. Maybe this would be a helpful list for someone trying to change careers and trying to decide where to focus their learning time. I posted this picture on Twitter: and got some interesting responses: @BecomingDataSci I'd include familiarity with business process in one of those columns. Can't analyze in a vacuum,. — Karen Clark (@clarkkaren) July 17, 2015 @BecomingDataSci @aflyax You've got analytical thinking & problem solving. Maybe add "adaptable to a variety of environments" as generic? — Karen Clark (@clarkkaren) July 20, 2015 @barbarafenton i mentioned that as a misconception! i spend a lot more time communicating than most people think — Data Science Renee (@BecomingDataSci) July 17, 2015 @DataSkeptic yes i think that's important, but you can get an entry level job w/just basic charting skills. was trying to keep to minimum. — Data Science Renee (@BecomingDataSci) July 17, 2015 @BecomingDataSci so e.g. "SQL" could be "data manipulation skills (e.g. SQL)" – don't get hung up on a specific tool to to the job! 2/2 — Martin Monkman (@monkmanmh) July 17, 2015 @BecomingDataSci This is great! My ready-fire-aim data science side says to add "asking forgiveness is easier than permission" to traits :P — Shannon Quinn (@SpectralFilter) July 17, 2015 @BecomingDataSci I'd add : autodidact — craig pfeifer (@aCraigPfeifer) July 17, 2015 What do you think? I’ll revisit this topic later, and I’ll also post about the conference I’m attending (APRA Data Analytics Symposium) when I have a chance to summarize. For the moment, heading back to the...

Read More

API and Market Basket Analysis

July 1, 2015

I was considering waiting until I’m done before posting about this project, but instead I thought I’d post my progress and plans while I think about the next steps. I posted earlier about using the UsesThis API to retrieve data about what other software people that use X software also use. I thought I was going to have to code a workaround for people that didn’t have any software listed in their interviews, but when I tweeted about it, Daniel from @usesthis replied that it was actually a bug and fixed it immediately! It makes it even more fun to develop since he is excited about me using his API! @BecomingDataSci: YES! It’s *awesome*. — The Setup (@usesthis) June 19, 2015 After seeing those results, I thought it would be interesting (and educational) to learn how to do a Market Basket Analysis on the software data. Market Basket Analysis is a data mining technique where you can find out what items are usually found in combination, such as groceries people typically buy together. For instance, do people often buy cereal and milk together? If you buy taco shells and ground beef, are you likely to also buy shredded cheese? This type of analysis allows brick and mortar retailers to decide how to position items in a store. Maybe they will put items regularly purchased together closer together to make the trip more convenient. Maybe they will place coupons or advertisements for shredded cheese next to the taco shells. Or maybe they will place the items further apart so you have to pass more goods on the way from one item to the other and are more likely to pick up something you otherwise wouldn’t have. Online retailers can use this type of analysis to recommend products to increase the size of your purchase. “Other customers that added item X to their shopping cart also purchased items Y and Z.” Because I had this interesting set of software usage from The Setup’s interviews, I wanted to analyze what products frequently go together. I searched Google for ‘Market Basket Analysis python,’ and it led me to this tweet by @clayheaton: I just wrote a simple Market Basket analysis module for Python. #analytics https://t.co/aVf58zcHJa — Clay Heaton (@clayheaton) April 4, 2014 I followed that link and checked out the code on github and it seemed to make sense, so I put the results of my usesthis API request into a format it could use. I did a test with the data from 5 interviews, and it ran. Then I tried 50 interviews, and the results showed that people that use Photoshop were likely to also use Illustrator, and vice-versa. It appeared to be working! However, I then hit a snag. I tried to run it with all of the software data, and it ran for a long time then crashed when my computer ran out of memory. Since it’s building an association graph with an edge for every rule (combination of software used), with up to two pieces of software per “side” of the rule (such as “people that have Photoshop and Illustrator also have a Mac”), you can imagine the graph gets pretty big when you have over 10,000 user-software combinations. I tweeted about this and Clay suggested modifying his code to store the items in a sparse matrix instead of a graph, and I agree that that sounds like a good approach, so that’s my next step on this project. I’ll post again when I’m...

Read More

The Setup (usesthis.com) API

June 10, 2015

There’s a really interesting site usesthis.com AKA “The Setup” which interviews people and lists all of the gear that they use, including software. I found out that they have an API, (documented here) and I wanted to use my new API skills in Python to test it out! This one returns JSON unlike the NPR API that returned XML. Basically what I did is use the list API to return all of the interviews of people that use Python, then used the interviews API to return each of those people’s lists of gear. That way, i could tally up the most-frequently-used software (other than Python) used by the interviewed Python users! Here’s my code in HTML IPython notebook form. I haven’t had a chance to practice visualizations yet, so please point me to any resources that will help me make the horizontal bar chart prettier! UsesThis API – Software that Python users use Preview of the ugly chart: Update 6/18/15: What about other software? I added an input so the user can type in any software title. For the output saved below, I typed in “Android” at the prompt. Here it is on nbviewer. You can use the download button in the upper right corner to download it and run it on your local IPython installation to try it out yourself!...

Read More

May 2015 #SoDS Storify

June 9, 2015

I wanted to capture the participation in the #SoDS (Summer of Data Science) hashtag somehow, so I decided to create a monthly Storify to keep up with all of your great tweets! [View the story “#SoDS May 2015” on...

Read More