Podcast Listens Analysis

October 1, 2017

I’ve been telling everyone that I’d do something “data fun” when I hit 20K Twitter followers, so I posted an analysis of my podcast listeners! I used python and pandas in a Jupyter notebook for the first part, then I did a dashboard in Tableau for the last part.

Read More

Data Science Learning Club Update

February 20, 2016

For anyone that hasn’t yet joined the Becoming a Data Scientist Podcast Data Science Learning Club, I thought I’d write up a summary of what we’ve been doing….

Read More

Data Science Tutorials Flipboard Magazine

October 21, 2015

I have been getting great feedback on my “Becoming a Data Scientist” Flipboard magazine, and I had this other set of articles bookmarked that didn’t quite fit into it. I want the Becoming a Data Scientist one to be the “best of the best” of articles I find on Twitter about data science, and to focus on understanding data science and related topics without getting too into the “nitty gritty”. However, I often come across great data science related tutorials on very specific topics that may not have broad appeal (and might look scary to beginners) but I also wanted to share. So I started the “Data Science Related Tutorials” Flipboard magazine....

Read More

Playing With Google Cloud Datalab

October 18, 2015

This weekend, I played around with the newly-released Google Cloud Datalab. I learned how to use BigQuery and also played around with Google Charts vs Pandas+Matplotlib plots, since you can do both in Datalab. I had a few frustrations with it because the documentation isn’t great, and also sometimes it would silently timeout and it wasn’t clear why nothing was running, but if I stopped all of the services, closed, restarted DataLab, and reopened, everything would work fine again. It’s clearly in Beta, but I had fun learning how to get it up and running, and it was cool to be able to write SQL in a Jupyter notebook. I tried to connect to my Google Analytics account, but apparently you need a paid Pro account to do that, so I just connected to one of the built-in public datasets. If you view the notebooks, you will see I clearly wasn’t trying to do any in-depth analysis. I was just playing around and getting the queries, dataframes, and charts to work. I hadn’t planned to get into too many details here, but wanted to share the results. I did jot down notes for myself as I set it up, which I’ll link to below, and you can see the two notebooks I made as I explored DataLab. Exploring BigQuery and Google Charts Version Using Pandas and Matplotlib (These aren’t tidied up to look professional – please forgive any typos or messy approaches!) Google Cloud Datalab Setup Notes (These are notes I jotted down for myself as I went through the setup steps. Sorry if they’re not...

Read More

API and Market Basket Analysis

July 1, 2015

I was considering waiting until I’m done before posting about this project, but instead I thought I’d post my progress and plans while I think about the next steps. I posted earlier about using the UsesThis API to retrieve data about what other software people that use X software also use. I thought I was going to have to code a workaround for people that didn’t have any software listed in their interviews, but when I tweeted about it, Daniel from @usesthis replied that it was actually a bug and fixed it immediately! It makes it even more fun to develop since he is excited about me using his API! @BecomingDataSci: YES! It’s *awesome*. — The Setup (@usesthis) June 19, 2015 After seeing those results, I thought it would be interesting (and educational) to learn how to do a Market Basket Analysis on the software data. Market Basket Analysis is a data mining technique where you can find out what items are usually found in combination, such as groceries people typically buy together. For instance, do people often buy cereal and milk together? If you buy taco shells and ground beef, are you likely to also buy shredded cheese? This type of analysis allows brick and mortar retailers to decide how to position items in a store. Maybe they will put items regularly purchased together closer together to make the trip more convenient. Maybe they will place coupons or advertisements for shredded cheese next to the taco shells. Or maybe they will place the items further apart so you have to pass more goods on the way from one item to the other and are more likely to pick up something you otherwise wouldn’t have. Online retailers can use this type of analysis to recommend products to increase the size of your purchase. “Other customers that added item X to their shopping cart also purchased items Y and Z.” Because I had this interesting set of software usage from The Setup’s interviews, I wanted to analyze what products frequently go together. I searched Google for ‘Market Basket Analysis python,’ and it led me to this tweet by @clayheaton: I just wrote a simple Market Basket analysis module for Python. #analytics https://t.co/aVf58zcHJa — Clay Heaton (@clayheaton) April 4, 2014 I followed that link and checked out the code on github and it seemed to make sense, so I put the results of my usesthis API request into a format it could use. I did a test with the data from 5 interviews, and it ran. Then I tried 50 interviews, and the results showed that people that use Photoshop were likely to also use Illustrator, and vice-versa. It appeared to be working! However, I then hit a snag. I tried to run it with all of the software data, and it ran for a long time then crashed when my computer ran out of memory. Since it’s building an association graph with an edge for every rule (combination of software used), with up to two pieces of software per “side” of the rule (such as “people that have Photoshop and Illustrator also have a Mac”), you can imagine the graph gets pretty big when you have over 10,000 user-software combinations. I tweeted about this and Clay suggested modifying his code to store the items in a sparse matrix instead of a graph, and I agree that that sounds like a good approach, so that’s my next step on this project. I’ll post again when I’m...

Read More

The Setup (usesthis.com) API

June 10, 2015

There’s a really interesting site usesthis.com AKA “The Setup” which interviews people and lists all of the gear that they use, including software. I found out that they have an API, (documented here) and I wanted to use my new API skills in Python to test it out! This one returns JSON unlike the NPR API that returned XML. Basically what I did is use the list API to return all of the interviews of people that use Python, then used the interviews API to return each of those people’s lists of gear. That way, i could tally up the most-frequently-used software (other than Python) used by the interviewed Python users! Here’s my code in HTML IPython notebook form. I haven’t had a chance to practice visualizations yet, so please point me to any resources that will help me make the horizontal bar chart prettier! UsesThis API – Software that Python users use Preview of the ugly chart: Update 6/18/15: What about other software? I added an input so the user can type in any software title. For the output saved below, I typed in “Android” at the prompt. Here it is on nbviewer. You can use the download button in the upper right corner to download it and run it on your local IPython installation to try it out yourself!...

Read More