Playing With Google Cloud Datalab

October 18, 2015

This weekend, I played around with the newly-released Google Cloud Datalab. I learned how to use BigQuery and also played around with Google Charts vs Pandas+Matplotlib plots, since you can do both in Datalab. I had a few frustrations with it because the documentation isn’t great, and also sometimes it would silently timeout and it wasn’t clear why nothing was running, but if I stopped all of the services, closed, restarted DataLab, and reopened, everything would work fine again. It’s clearly in Beta, but I had fun learning how to get it up and running, and it was cool to be able to write SQL in a Jupyter notebook. I tried to connect to my Google Analytics account, but apparently you need a paid Pro account to do that, so I just connected to one of the built-in public datasets. If you view the notebooks, you will see I clearly wasn’t trying to do any in-depth analysis. I was just playing around and getting the queries, dataframes, and charts to work. I hadn’t planned to get into too many details here, but wanted to share the results. I did jot down notes for myself as I set it up, which I’ll link to below, and you can see the two notebooks I made as I explored DataLab. Exploring BigQuery and Google Charts Version Using Pandas and Matplotlib (These aren’t tidied up to look professional – please forgive any typos or messy approaches!) Google Cloud Datalab Setup Notes (These are notes I jotted down for myself as I went through the setup steps. Sorry if they’re not...

Read More

DataSciGuide Contest

October 2, 2015

Want a way to help people that are learning data science, and also get a chance to win a $40 Amazon Gift Card? Review a data science blog, podcast, course, or other content at DataSciGuide! Here’s more info: http://www.datasciguide.com/review-stuff-and-win-a-40-amazon-gift-card/

Read More

DataSciGuide Update

September 6, 2015

I finally had a chance this weekend to make some progress on my “Data Science Directory” website, DataSciGuide.com, and I would love your feedback on it! That site isn’t open for comments yet, so I’m directing people to leave feedback here. If you haven’t kept up with the development of DataSciGuide, here are a few things to read: original vision for the site updates on my progress content that has been posted so far Let me know if you want an account to post some reviews while I test things out! (I’ll even post content that you want to review, just for you.) Also, tell me any thoughts you have about the site in the comment form below! (or tweet...

Read More

My “Secret” Side Project, Revealed

August 1, 2015

OK So I was actually hoping to show this to you all long ago, and I kept coming up with more and more ideas for it, so it’s not going to be “ready” to reveal for a while, but I figured I’d go ahead and show it to you anyway. My main motivation is that I keep hearing people say (and sometimes feel myself) that learning to becoming a data scientist on your own using online resources is totally overwhelming: there are so many different possible topics to dive into, few really good guides, lots of impostor-syndrome-inducing posts by people you follow that make you feel like they’re so far ahead of where you are and you’ll *never* get there…. but there’s so much great data science learning content online for everyone from beginners to experienced data scientists! We need a better way to navigate it. Hence my new website: “Data Sci Guide”. It will eventually have a personalized recommender system and structured learning guides and all kinds of other features to help you find the resources to go from where you are to where you want to be, but for now it’s “just” a directory / content rating site. And it’s not ready for you to interact with yet, but it’s getting there, and I’ll need your help fleshing it all out soon. So go take a look! Then come back here to give me feedback and suggestions, because you have to be registered to comment there and I didn’t turn on new user registration yet. OK go now. Don’t forget to come back! >>>> DATA SCI GUIDE.COM <<<   So…. what did you think? What do you think of the overall idea and plans? What should I be sure to remember to include? Tell me below!...

Read More

API and Market Basket Analysis

July 1, 2015

I was considering waiting until I’m done before posting about this project, but instead I thought I’d post my progress and plans while I think about the next steps. I posted earlier about using the UsesThis API to retrieve data about what other software people that use X software also use. I thought I was going to have to code a workaround for people that didn’t have any software listed in their interviews, but when I tweeted about it, Daniel from @usesthis replied that it was actually a bug and fixed it immediately! It makes it even more fun to develop since he is excited about me using his API! @BecomingDataSci: YES! It’s *awesome*. — The Setup (@usesthis) June 19, 2015 After seeing those results, I thought it would be interesting (and educational) to learn how to do a Market Basket Analysis on the software data. Market Basket Analysis is a data mining technique where you can find out what items are usually found in combination, such as groceries people typically buy together. For instance, do people often buy cereal and milk together? If you buy taco shells and ground beef, are you likely to also buy shredded cheese? This type of analysis allows brick and mortar retailers to decide how to position items in a store. Maybe they will put items regularly purchased together closer together to make the trip more convenient. Maybe they will place coupons or advertisements for shredded cheese next to the taco shells. Or maybe they will place the items further apart so you have to pass more goods on the way from one item to the other and are more likely to pick up something you otherwise wouldn’t have. Online retailers can use this type of analysis to recommend products to increase the size of your purchase. “Other customers that added item X to their shopping cart also purchased items Y and Z.” Because I had this interesting set of software usage from The Setup’s interviews, I wanted to analyze what products frequently go together. I searched Google for ‘Market Basket Analysis python,’ and it led me to this tweet by @clayheaton: I just wrote a simple Market Basket analysis module for Python. #analytics https://t.co/aVf58zcHJa — Clay Heaton (@clayheaton) April 4, 2014 I followed that link and checked out the code on github and it seemed to make sense, so I put the results of my usesthis API request into a format it could use. I did a test with the data from 5 interviews, and it ran. Then I tried 50 interviews, and the results showed that people that use Photoshop were likely to also use Illustrator, and vice-versa. It appeared to be working! However, I then hit a snag. I tried to run it with all of the software data, and it ran for a long time then crashed when my computer ran out of memory. Since it’s building an association graph with an edge for every rule (combination of software used), with up to two pieces of software per “side” of the rule (such as “people that have Photoshop and Illustrator also have a Mac”), you can imagine the graph gets pretty big when you have over 10,000 user-software combinations. I tweeted about this and Clay suggested modifying his code to store the items in a sparse matrix instead of a graph, and I agree that that sounds like a good approach, so that’s my next step on this project. I’ll post again when I’m...

Read More

IPython, Requests, lxml, and the NPR API

June 7, 2015

Last week, I decided to learn how to use python to get data from an API. I started with the Codecademy “Introduction to APIs in Python” course, which got me oriented to how requests work, and in the subsequent NPR API lesson, specifically how the NPR stories API works. Certain parts of the course assumed you knew more python than you had learned in the course, so heads-up that there are places you will probably have to google for help since the hints aren’t always related to what you’re stuck on. The course isn’t really a requirement for learning this stuff (and I thought it could use a lot of improvement), but it does give you a guided walk-through, which is nice when you are totally new to a topic. Then, I tweeted about my experience, and got 2 responses encouraging me to use the requests library instead of urllib that codecademy used. @BecomingDataSci the urllib api is terrible. You should take a look at http://t.co/CzIPob2tBV — Daniel Moisset (@dmoisset) June 1, 2015 @dmoisset @BecomingDataSci 2nding using of requests over urllib; esp. with HTTPS, requests tends to do saner things (e.g., cert validation) — Cheng H. Lee (@chenghlee) June 1, 2015 I decided to redo what I had learned from scratch, but using requests. I also wanted to learn how to use IPython, so I used an IPython notebook to play around with the code. Below is the HTML export of my IPython notebook, with comments explaining what I was doing. I’m sure there are better ways to do what I did (feel free to comment with suggestions!), but this was my first time doing any of this without any guidance, so I don’t mind posting it even if it’s a little ugly :) I definitely spent a lot of time understanding the hierarchy of the NPR XML and how to loop through it and display it. If you have done something similar in a more elegant way, please point me to your code! Here are the main resources I used to learn how to do what is in the code: python requests library documentation NPR API documentation python lxml library documentation iPython videos I also wanted to mention that there are a lot of frustrations you can run up against when you’re a python beginner. I was having a lot of problems with seemingly basic stuff (like installing packages with pip) and it took a couple hours of googling and asking someone for help to figure out there was a problem with my path environment variables in windows. I’ll post about that another time, but I just wanted to 1) encourage people not to give up if you get stuck on something that seems to be so basic that most “intro” articles don’t even cover it, and 2) encourage people writing intro articles to make some suggestions about what could go wrong and how to problem-solve. Here’s one example: When I tried to export my IPython notebook to HTML, it gave me a 500 server error saying I needed python packages I didn’t already have. After I installed the first, it told me I needed pandoc, so I installed that as well, but it kept giving me the same error. It turns out that you have to run IPython Notebook as an Administrator in Windows in order to get the HTML export to work properly, but the error message didn’t indicate that at all. This is the kind of frustration that may make beginners think they’re not “getting it” and give up, when it fact it’s something outside the scope of what you’re learning. Python seems to require a lot of this sort of problem-solving. (Note: on my other laptop, I installed python and the scipy stack using Anaconda, and have had a lot fewer issues like this.) Without further ado, here’s my iPython notebook! (I’m having issues making it look readable while embedded in wordpress, so click the link to view in a new tab for now, and I’ll fix for viewing later!) Renee’s 1st IPython Notebook (NPR API using requests and lxml) Here’s the actual ipynb file if you have IPython installed and want to run it yourself: First Python API Usage** **NOTE: WordPress wouldn’t let me upload it with the IPython notebook extension for security reasons, so after you download it, change the “.txt” extension to...

Read More