Doing Data Science (Review)

I just finished reading Doing Data Science: Straight Talk from the Frontline, an O’Reilly book by Cathy O’Neil (@mathbabedotorg) and Rachel Schutt (Columbia Data Science blog).

First let me say, I really enjoyed this book! I thought it gave a great overview of Data Science, which is very valuable at this early stage in my data science journey. The authors attempt to define Data Science, but also explain that the definition is evolving, and show throughout the book all of the different types of things that can be categorized as data science activities. I also enjoyed that they emphasize data science teams, and each presenter in the book (each chapter is based on a lecture in the course which had a guest speaker from the field) was introduced with their level of expertise in the various aspects of data science (see image below). For instance, some were more focused on machine learning, while others focused more on visualization, and they were from a variety of different industries. This was nice because it meant the authors didn’t use the same example problems repeatedly when discussing different techniques.

DS_profile
Data Scientist Profile (via semanticcommunity, more here)

Speaking of visualization, I would say that the one negative of the book is that the images were not designed to be printed in black and white, and many are hard to read. There is an image with the caption “red means cancer, green means not”, but the dots all appear to be similar colors of grey. There is an image the students in the class designed to show the various aspects of data science which is basically unreadable because it is tiny and has some text that comes out as grey-on-grey (I happened to find a color version of that image here).

Now, don’t expect to read the book and immediately be able to go out and do all of the activities in the book. First of all, there is a list of prerequisites the authors assume you have. You don’t have to have a deep understanding of all of these fields in order to gain something from the book, but they use terminology at times from linear algebra, statistics, machine learning, and other technical areas, and you would definitely need some of these skills in order to do some of the suggested activities. However, throughout the book are constant definitions and clarifications, references to other texts, websites, and people. I found this to be incredibly useful – any time you want to learn more about a topic, the authors point out how to find more information, and recommend books on the subject.

To make a metaphor, Rachel Schutt and Cathy O’Neil tell you about a great dish someone cooked, and give some general info about the process of making the dish, and what to watch out for when you attempt it yourself. They even include some quotes from the chef about the art of making this particular dish, and tips on preparing and presenting it. But you still have to go out and get the ingredients and tools and learn some cooking techniques and look in some other cookbooks in order to figure out the detailed steps. Then, you have to do a lot of chopping and sautéing and probably burn a few things before you successfully create a similar dish you can serve to your customers. They don’t just hand you a simple recipe, and you are probably a casual at-home cook, not a professional chef yet.

You could describe the book as kind of a “roadmap” to data science. There is some math and some code, but it is much more breadth than depth. The book is not pretentious, and actually warns data scientists against hubris, since overconfidence in a certain tool or method can have negative impact on your work. There are a lot of “tips”, “things to think about”, and “lessons learned” that I feel give the reader a great sense of what pitfalls you might come across when doing real-world analysis, and how to avoid the common ones, but only a few step-by-step how-to’s and code examples (in R or Python).

Some topics I bookmarked to learn more about that I hadn’t read about before “Doing Data Science” introduced them to me: F-score (a combination of precision and recall – terms defined in the book), Log Returns, Simpson’s Paradox, Exponential Random Graph Models.

Some topics I already knew a little about, but “Doing Data Science” helped me better understand: various similarity/distance metrics, exploratory data analysis, data leakage, recommendation engines, confounding variables.

I can imagine that some readers wouldn’t like that the book is “all over the place” and that it gives a combination of not much detail on some topics, and a lot of detail all at once on others; too technical and math-y on some topics, and very “laymans terms” on others. However, I liked that about the writing. It really touches on everything, and gives you enough direction to know where to go next to learn more. It feels like you’re meeting a bunch of people that have had a variety of experiences in the industry, and you’re all trying to give each other a feel for what you do: being technical enough to be impressive, but clear enough to be accessible, and explaining how you learned your particular subset of skills and where someone can get more info.

I give it 5 out of 5, despite the fact that the images were sometimes unreadable. I also suggest you check out the blog that goes with the course that the book follows: http://columbiadatascience.com/blog/.

See more books I’m reading, have read, or plan to read here: Becoming A Data Scientist “Learning” Page.

2 Comments

  1. Cathy O'Neil, mathbabe
    Jun 14, 2014

    Thanks for the nice review! The original printing was accidentally in black and white but they corrected that in subsequent printings. They also screwed up the author order and then fixed it but that doesn’t changes the readability much :)

    Cathy

    • Renee
      Jun 14, 2014

      Wow, it’s so nice to get a response from the author. Thanks for the reply, Cathy!

      I was wondering whether the black and white was a mistake — I did order a copy of the book as soon as it came out. I went ahead and fixed the author order above in the review since the cover of my book is the wrong one. Actually, the reference image on Amazon is still incorrect, too!

      The book was really enjoyable and I learned a lot. I’m sure I’ll be referring back to it regularly in the future, and I do want to go back and do some of the exercises to learn more. Thanks!