Notice: Function _load_textdomain_just_in_time was called incorrectly. Translation loading for the twentytwentyone domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /home1/moderna7/public_html/wp-includes/functions.php on line 6131

Warning: Cannot modify header information - headers already sent by (output started at /home1/moderna7/public_html/wp-includes/functions.php:6131) in /home1/moderna7/public_html/wp-includes/feed-rss2.php on line 8
reviews – Becoming A Data Scientist https://www.becomingadatascientist.com Documenting my path from "SQL Data Analyst pursuing an Engineering Master's Degree" to "Data Scientist" Sat, 03 Oct 2015 01:42:34 +0000 en-US hourly 1 https://wordpress.org/?v=6.9.4 DataSciGuide Contest https://www.becomingadatascientist.com/2015/10/02/datasciguide-contest/ https://www.becomingadatascientist.com/2015/10/02/datasciguide-contest/#respond Sat, 03 Oct 2015 01:42:34 +0000 https://www.becomingadatascientist.com/?p=650 Want a way to help people that are learning data science, and also get a chance to win a $40 Amazon Gift Card? Review a data science blog, podcast, course, or other content at DataSciGuide!

Here’s more info: http://www.datasciguide.com/review-stuff-and-win-a-40-amazon-gift-card/

]]>
https://www.becomingadatascientist.com/2015/10/02/datasciguide-contest/feed/ 0
The Data Science Central “Incident” https://www.becomingadatascientist.com/2015/07/08/the-data-science-central-incident/ https://www.becomingadatascientist.com/2015/07/08/the-data-science-central-incident/#comments Wed, 08 Jul 2015 05:36:04 +0000 https://www.becomingadatascientist.com/?p=548 Continue reading The Data Science Central “Incident”]]> I’m writing this post to respond both to what many of you saw Vincent Granville said about me on Facebook a couple days ago, which was brought to my attention yesterday:
vincent_granville_data_science_renee_comment_single(in context)

and to his apology this evening:
granville_apology

I didn’t want to write a second post about Data Science Central, but after the huge response on twitter today, I want to document everything in one place so anyone looking back at this has all of the info to evaluate what has been said.

I have thought a lot about Vincent Granville’s apology this evening, and honestly when I heard he had apologized, I hoped (but doubted) it would be sincere. I would have loved to be able to accept his apology and move on from all this. However, I can’t bring myself to accept the apology because it’s not really an apology, it’s an accusation. After writing a truly vile post about me, his “apology” accuses me of harassing *him*. He says that I have “attacked” him for 14 months, and is casting himself as a victim. He’s basically saying “I acted a fool in a heated moment because she’s been attacking me non-stop for over a year” (the heated moment apparently being a Facebook post about Ellen Pao that reminded him of me, and the “attacking” being me pointing out his questionable practices in a blog post and on twitter, I guess).

Because of that, and because there are a lot of people who are *actually* harassed online who I think would be offended by his characterization, I want to document everything I’ve said about him, and challenge his definition of harassment. What I have done is document what I saw as some very questionable (if not unethical) behaviors, and occasionally initiated or participated in conversation on twitter about that. I have never “attacked” him in any way, but I want to leave it up to you readers to decide.

Here is the history of my comments about Data Science Central and Vincent Granville:

April 21-22, 2014: Initial twitter conversation with @tesherista about Data Science Central’s contest to find fake accounts created to attract women and minorities to Data Science Central, where Vincent Granville deleted Cory’s comments questioning the practice. (screenshots by @AltonDataSci)

July 1, 2014: Original blog post ““Something has been bothering me about Data Science Central” here on Becoming a Data Scientist, where I wrote about the above experience, as well as exposing one of the fake Data Science Central profiles “Amy Cordan” as having a fake LinkedIN profile (still there as “Amy Sangrene”) with a fake Stanford Computer Science PhD, violating LinkedIN terms & conditions. (He mentioned me questioning his advanced degrees in his “apology”, and this is the only academic credential I have brought under scrutiny, that of “Amy”.)
In response to this post, I received the following comments from readers (among others, you can see them at the end of the post linked above):

  • Alton discussing his negative experience with the Data Science Central contest
  • Ellie talking about losing trust in DSC when she tried to contact “Amy” and realized she wasn’t real
  • Hubart and David questioning his academic background (maybe this is why he thought I did? because commenters on my post did?)
  • “System Administrator” recalling another use of the name Amy Cordan by Vincent Granville online in the past
  • Eric mentioning he found that Vincent Granville was accused of paying people to write positive Amazon reviews of his Developing Analytic Talent book
  • A comment by someone who claims to be the “real” Amy Cordan (Henriques) and used to be close friends with Vincent Granville’s wife

July 1-5, 2014: Twitter conversation with @altondatasci following the blog post above, as well as an explanation for why I wrote the post:

October 26, 2014: Tweet to @kissmetrics (and conversation following) alerting them that Amy was a fake profile.

June 8-10, 2015: Tweets after the real Amy Cordan commented on my blog, conversation between @ellieaskswhy, @metabrown312, and @tesherista on Twitter about fake Amy and deleted comments. Follow-up tweets warning people again about what we had found, and talking about the suspended accounts.

June 24, 2015: Tweets about finding out my Data Science Central account was suspended.

Throughout all of this, I never emailed or otherwise directly contacted Vincent Granville. He also never responded to any of these comments or my blog post, and the only reason I might think he might be aware of my posts was that my Data Science Central account was suspended. Today, he also blocked me, and several other people that shared my post, from following @DataScienceCtrl on Twitter. Suspending women’s accounts on Data Science Central and then calling someone who questions his practices a “harasser” and “attacker” is not a great way to encourage women to participate on his sites.

I’m not rehashing all of this to keep it going, but to show what a man that called me (and my followers/readers) names unprovoked is calling “harassment” and “attacks”. Reading the details of his apology/accusation, it also appears he is lumping the comments of everyone that has questioned/called out Data Science Central on twitter or this blog into a persona he thinks is all me, when in fact I wrote few of those comments myself, but instead provided a space for others to share their experiences.

If Vincent Granville wants to actually apologize for his statement and actions, without blaming and accusing me inappropriately in the process, I will accept it.

]]>
https://www.becomingadatascientist.com/2015/07/08/the-data-science-central-incident/feed/ 18
Codecademy Python Course: Completed https://www.becomingadatascientist.com/2014/09/21/codecademy-python-course-completed/ https://www.becomingadatascientist.com/2014/09/21/codecademy-python-course-completed/#comments Sun, 21 Sep 2014 20:26:52 +0000 https://www.becomingadatascientist.com/?p=349 Continue reading Codecademy Python Course: Completed]]> I can cross off another item on my Goals list since i finally jumped back into the Codecademy “Python Fundamentals” course and completed the final topics this afternoon.

I think the course would be good for people that have had at least an introductory programming course in the past. I didn’t have much trouble with the tasks (though a few were pretty tricky), but I have programming experience (and taught myself some advanced Python outside of the course for my Machine Learning class) and can imagine that someone that had never programmed before and was unfamiliar with basic concepts might get totally stuck at points in the course. I think they need 2 levels of “hints” per topic so that if you just need hints on the most common difficult things that trip people up, you click it once and get the hints they show now. But if you’re truly stuck and need to be walked through it, they should have more in-depth hints for true beginners.

The site estimates it will take you 13 hours to complete the course. I don’t know how much time I spent on it total, since it was broken up over months. It took me about an hour to finish the final 10% of the course, covering classes, inheritance, overrides, file input/output and reviews, then also going back and figuring out where the final 1% was that it said I hadn’t completed (apparently I skipped some topic mid-course accidentally) so I could get the 100% topic complete status.

The topics covered are:

  • Python Syntax
  • Strings and Console Output
  • Conditionals and Control Flow
  • Functions
  • Lists & Dictionaries
  • Loops
  • Iteration over Data Structures
  • Bitwise Operators
  • Classes
  • File Input & Output

I thought this was a good set of topics for an intro course. If they dropped anything, I think Bitwise Operators was a “bit” unnecessary for beginners. I liked the projects they included to test out the skills you learned, like writing a program as if you are a teacher and need to calculate statistics on your class’ test scores.

Overall, I think Codecademy did a good job with this course, and I would point other programmers that want to quickly get up to speed on Python to take this course. I would also point beginners to the course, but with a warning that there are tricky spots they may need outside resources to get through.

]]>
https://www.becomingadatascientist.com/2014/09/21/codecademy-python-course-completed/feed/ 3
Something has been bothering me about Data Science Central https://www.becomingadatascientist.com/2014/07/01/something-has-been-bothering-me-about-data-science-central/ https://www.becomingadatascientist.com/2014/07/01/something-has-been-bothering-me-about-data-science-central/#comments Wed, 02 Jul 2014 02:00:18 +0000 https://www.becomingadatascientist.com/?p=298 So, what I’m about to write about actually occurred a few months ago, but I am reminded of it every day when I receive an email from Data Science Central or see someone tweet an article from the blog network (which includes Analytics Bridge, Big Data News, etc.), so I figured if it’s still bothering me, it’s worth writing about.

In April, I saw a post by Vincent Granville, owner of and primary author at Data Science Central, which said something like

One way we attract women and minorities to Data Science Central is to create accounts that post articles with female profiles and photos, which are not actually written by women. Can you use data science to find these 5 faux bloggers? The winner will receive $500.

I have to try to remember the original post and paraphrase here (I’m sure this is not close to the original text, but I hope I am capturing the message), because the post has now been modified to appear as if the contest was only to find the one “example” fake account with a camel avatar (current version here).

However, you can tell that the original contest was different based on the submission by “Alton” on the site, who was nice enough to hold back the names of the accounts he found in case they weren’t decoys, but was clearly trying to find more accounts than just the “camel” decoy. Below is a screenshot in case it gets modified on the site.

Alton_DSC_comment

My initial response to the original post with the fake female bloggers was originally, “How many sites do this? Is this sexist? It sure is off-putting for this guy Vincent Granville to post articles under fake accounts, pretending to be a woman or underrepresented minority. Do we actually fall for this kind of thing? Is it a widely accepted practice?” I scrolled down and saw that Cory Teshera had posted a comment questioning the practice, and responding incredulously to the approach as well. I posted about it on twitter, showing my surprise and asking questions about the approach (Including her name and tweets here with her permission):

No one had responded, then I checked back to the post and saw that Cory’s comments had been deleted! I couldn’t believe that I was seeing a post talking about attracting women to the site, but the first woman to comment on the approach was being silenced!

I found Cory on Twitter and asked whether she was the one that posted and whether she had deleted the comment or the site had, and she responded:

Then I let her know I might blog about it and we chatted a bit via tweets and DMs. At this point, the blog post had been modified to remove all reference to the practice. Neither of us had received responses from the site. The only response was the “silent” deletion of comments and editing of the contest post.

I was curious at this point, and started browsing Data Science Central to see if I could find any of these fake female accounts. I didn’t have to use any data science methods to find one right away. I just looked at the top featured posts, clicked on one with a female avatar, and found this article:
Good and Not So Good Companies for Data Scientists
Here is Amy’s profile: http://www.datasciencecentral.com/profile/Amy
I see “she” is blogging heavily now since I last looked. She lists no last name, so I can’t look her up anywhere else that way, but I did an image search on Google and found: “Amy” Image Search Results
Amy is spending a lot of time posting on all of the Data Science Central network sites. On the Hadoop360 site, she is listed as “Amy Cordan“. At this point, I was still holding out a glimmer of hope that Amy could be a real woman, and looked her up on LinkedIN. I would have been happy to find out she was actually working for Data Science Central as a writer. Oh look! There is an Amy Cordan on LinkedIN who is listed as a “Data Scientist” with a PhD in Computer Science from Stanford! Her photo looks a little different though… she sure has a lot of endorsements… but she only has one experience listed… “Co-Founder for Data Science Foundation”… let’s check out their site… DataShaping.com. Uh… well, this is clearly a dummy site, and the email address is apparently Vincent Granville’s. I did find this “staff” page, which strangely doesn’t list “co-founder” Amy. It appears the whole profile, including the sparse LinkedIN profile page with the Stanford PhD but no experience other than working on a data science blog, is totally fake.

Anyway, you get the point. Amy does not appear to be a real woman. What really got me is that there is an apparently off-topic response to “Amy”‘s post (linked above) by Vincent Granville about how “Amazon should hire people to improve security on AWS and deal with fake reviews.” Excuse me, mister… you are replying to a post on your own site, which was written by a fake author, which was probably written by you! How hypocritical.

At this point I was totally turned off from Data Science Central, so if the intent of these fake profiles was to attract women to the site, it definitely backfired for me.

Here are my questions now. How many of the females on the site are actually real? Are there many women and minorities joining the site, and are they influenced by these fake accounts falsely making it appear as if more females are participating than actually are? Is this a common practice among technology networking websites? Does it work? Should we accept it as necessary? Has Vincent Granville made any real effort to ask females to write for Data Science Central?

Do the people endorsing “Amy”‘s LinkedIN profile know it is fake? Are they all fake profiles that Vincent Granville created and had endorse each other?

I have so many questions, and am not coming up with many satisfactory answers myself, other than feeling sad and put off by it all. Please let me know what you think!

(P.S. If you’re reading this, Mr. Granville, posts like this that say things like “The first to prove or disprove our conjecture will win $500 and will have his name associated with the theorem in question” aren’t helping you attract any female readers.)

As for me, it has left a bad taste in my mouth, and I’m currently not retweeting anything that I recognize as being from the Data Science Central network, because I just can’t trust anything it produces at this point.

If anyone wants to do any analysis on the site posts, I’m sure there are algorithms out there that can determine whether they’re likely to have all been written by the same author (same typos, style, etc.). I know there is also this analysis tool which is supposed to be able to tell whether a male or female likely wrote a clip of text: Text Gender Classifier (h/t Paul Marks)
The software is of course not perfect, but the text from “Amy”‘s short article above came out to 68% likely to be written by a male, while this article you’re reading right now was classified as 65% likely female.

I guess I’m surprised at how little “sleuthing” I needed to do to see right through all of this. I didn’t spend hours poring over the site, I clicked on the first article I saw with a female author photo, and researched that author’s profile using Google. It’s practically out there in the open, and since Mr. Granville posted the contest – which has now been edited – to identify these faux bloggers, it appears he wasn’t trying to hide the practice.

And by the way, though Vincent Granville apparently has trouble finding females in Data Science to write for his blog, they do exist and aren’t hard to find on twitter or LinkedIN. I’ve started following the data science women I find on Twitter using a twitter list (Please suggest more in the comments!):
Women in Data Science Twitter List

Also check out Meta Brown’s “Binder fulla Women in Analytics” posts on LinkedIN!

]]>
https://www.becomingadatascientist.com/2014/07/01/something-has-been-bothering-me-about-data-science-central/feed/ 56
Doing Data Science (Review) https://www.becomingadatascientist.com/2014/06/13/doing-data-science-review/ https://www.becomingadatascientist.com/2014/06/13/doing-data-science-review/#comments Sat, 14 Jun 2014 03:42:31 +0000 https://www.becomingadatascientist.com/?p=322 Continue reading Doing Data Science (Review)]]> I just finished reading Doing Data Science: Straight Talk from the Frontline, an O’Reilly book by Cathy O’Neil (@mathbabedotorg) and Rachel Schutt (Columbia Data Science blog).

First let me say, I really enjoyed this book! I thought it gave a great overview of Data Science, which is very valuable at this early stage in my data science journey. The authors attempt to define Data Science, but also explain that the definition is evolving, and show throughout the book all of the different types of things that can be categorized as data science activities. I also enjoyed that they emphasize data science teams, and each presenter in the book (each chapter is based on a lecture in the course which had a guest speaker from the field) was introduced with their level of expertise in the various aspects of data science (see image below). For instance, some were more focused on machine learning, while others focused more on visualization, and they were from a variety of different industries. This was nice because it meant the authors didn’t use the same example problems repeatedly when discussing different techniques.

DS_profile
Data Scientist Profile (via semanticcommunity, more here)

Speaking of visualization, I would say that the one negative of the book is that the images were not designed to be printed in black and white, and many are hard to read. There is an image with the caption “red means cancer, green means not”, but the dots all appear to be similar colors of grey. There is an image the students in the class designed to show the various aspects of data science which is basically unreadable because it is tiny and has some text that comes out as grey-on-grey (I happened to find a color version of that image here).

Now, don’t expect to read the book and immediately be able to go out and do all of the activities in the book. First of all, there is a list of prerequisites the authors assume you have. You don’t have to have a deep understanding of all of these fields in order to gain something from the book, but they use terminology at times from linear algebra, statistics, machine learning, and other technical areas, and you would definitely need some of these skills in order to do some of the suggested activities. However, throughout the book are constant definitions and clarifications, references to other texts, websites, and people. I found this to be incredibly useful – any time you want to learn more about a topic, the authors point out how to find more information, and recommend books on the subject.

To make a metaphor, Rachel Schutt and Cathy O’Neil tell you about a great dish someone cooked, and give some general info about the process of making the dish, and what to watch out for when you attempt it yourself. They even include some quotes from the chef about the art of making this particular dish, and tips on preparing and presenting it. But you still have to go out and get the ingredients and tools and learn some cooking techniques and look in some other cookbooks in order to figure out the detailed steps. Then, you have to do a lot of chopping and sautéing and probably burn a few things before you successfully create a similar dish you can serve to your customers. They don’t just hand you a simple recipe, and you are probably a casual at-home cook, not a professional chef yet.

You could describe the book as kind of a “roadmap” to data science. There is some math and some code, but it is much more breadth than depth. The book is not pretentious, and actually warns data scientists against hubris, since overconfidence in a certain tool or method can have negative impact on your work. There are a lot of “tips”, “things to think about”, and “lessons learned” that I feel give the reader a great sense of what pitfalls you might come across when doing real-world analysis, and how to avoid the common ones, but only a few step-by-step how-to’s and code examples (in R or Python).

Some topics I bookmarked to learn more about that I hadn’t read about before “Doing Data Science” introduced them to me: F-score (a combination of precision and recall – terms defined in the book), Log Returns, Simpson’s Paradox, Exponential Random Graph Models.

Some topics I already knew a little about, but “Doing Data Science” helped me better understand: various similarity/distance metrics, exploratory data analysis, data leakage, recommendation engines, confounding variables.

I can imagine that some readers wouldn’t like that the book is “all over the place” and that it gives a combination of not much detail on some topics, and a lot of detail all at once on others; too technical and math-y on some topics, and very “laymans terms” on others. However, I liked that about the writing. It really touches on everything, and gives you enough direction to know where to go next to learn more. It feels like you’re meeting a bunch of people that have had a variety of experiences in the industry, and you’re all trying to give each other a feel for what you do: being technical enough to be impressive, but clear enough to be accessible, and explaining how you learned your particular subset of skills and where someone can get more info.

I give it 5 out of 5, despite the fact that the images were sometimes unreadable. I also suggest you check out the blog that goes with the course that the book follows: http://columbiadatascience.com/blog/.

See more books I’m reading, have read, or plan to read here: Becoming A Data Scientist “Learning” Page.

]]>
https://www.becomingadatascientist.com/2014/06/13/doing-data-science-review/feed/ 2
The Signal and the Noise (Review) https://www.becomingadatascientist.com/2014/05/27/the-signal-and-the-noise-review/ https://www.becomingadatascientist.com/2014/05/27/the-signal-and-the-noise-review/#comments Wed, 28 May 2014 03:59:52 +0000 https://www.becomingadatascientist.com/?p=283 The Signal and the Noise: Why So Many Predictions Fail — but Some Don't by Nate Silver.]]> This is a review of The Signal and the Noise: Why So Many Predictions Fail — but Some Don’t by Nate Silver.

So if you were following me on Twitter, you may realize that it took me months to read this book, which is very unlike me. I normally devour books that interest me in a weekend, or a few weeks at most. There are a few reasons for that.

  1. I was taking grad classes at the time and had little time for reading
  2. At times, the book gets quite tedious (I’ll expand on this later), but mostly,
  3. My Kindle tricked me into thinking the book was extraordinarily long.

If you have the Kindle version of the book, you may notice that when you click on the asterisks (and I recommend doing so, there are some funny notes and valuable insights hidden there), it will take you to a page that has only the short “footnote” and nothing else. It turns out that the Kindle version has these all as pages at the end of the book. As I was reading, and figured I’d be nearly halfway through, I looked and saw I was still not even at 30%. I thought it must be ridiculously long, so I put it off and didn’t read very enthusiastically when I did pick it up. Then, when I did finally take time to finish reading it in the past few weeks since the semester ended, I finished much earlier than expected, because the Kindle text ends at the 66% mark! The last full third of the book is all endnotes, footnotes, and references. So, it’s not as long as it first appears!

Considering that, there are still some sections that felt tedious to me. There was a long section about baseball statistics and Nate Silver’s early analysis work that was quite self-indulgent, and I recommend skimming it unless you’re actually interested in sports statistics. I didn’t feel it added much to the overall message of the book.

Despite it taking me a long time to read, I did like the overall message, which seems to be “most people are terrible at making predictions, so don’t get overconfident in a single model or forecaster, and learn how to tell if your model is good.” I think it’s important to remember that models we build based on data can be quite good, but they are still affected by human error or bias, and unless you have perfect data (which one could argue doesn’t actually exist because you can’t ever capture the entire context), your model won’t be perfect, and often will be much worse.

Another recurring point in the book is that we should constantly update our ideas about the world based on new information and probabilities (a Bayesian approach).

The book doesn’t give technical or mathematical details, except during an explanation of Bayesian statistics (which included some memorable examples about terrorism forecasts before and after 9/11), but gives several different examples of applying certain approaches to analysis that generally do or don’t work. It is very anecdotal, and based on the topics and the interviewees, appears pretty male-focused in topics and participants (I only remember one woman being interviewed or cited), so I would guess that people who have interests similar to Silver (heavy on sports and financial industry), would enjoy it more than I did. I did find it to be a worthwhile read, though.

Here are some topics I noted that came up a few times:

  • Data doesn’t speak for itself, we give it meaning. Related: Our predictions are never completely objective.
  • You’d be surprised how many studies out there can’t be replicated, and how many models are overfitted and not really good predictors
  • Weather forecasting is one area where models have improved significantly over time and are quite accurate now.
  • Don’t be a “hedgehog” and get stuck on rigid “truths” about the world as if there are immutable underlying laws that you understand and can apply in all situations. Instead be a “fox”, seeing the uncertainty in every situation, and looking at multiple ways to approach every problem. (It has been found that “foxes” are better forecasters.)

And here are some quotes from sections I bookmarked:

“We forget – or we willfully ignore – that our models are simplifications of the world. We figure that if we make a mistake, it will be at the margin. In complex systems, however, mistakes are not measured in degrees but in whole orders of magnitude.”

“…experts either aren’t very good at providing an honest description of the uncertainty in their forecasts, or they aren’t very interested in doing so. This property of overconfident predictions has been identified in many other fields… It seems to apply both when we use our judgment to make a forecast… and when we use a statistical model to do so.”

If you have strong analytical skills that might be applicable in a number of disciplines, it is very much worth considering the strength of the competition. It is often possible to make a profit by being pretty good at prediction in fields where the competition succumbs to poor incentives, bad habits, or blind adherence to tradition… It is much harder to be very good in fields where everyone else is getting basics right…”

I think you get the idea about the type of advice he gives along with the examples he details. I wish the book were slightly more technical, and a little less verbose in sections, but overall it is a good read for anyone considering a career in building data models for forecasting, who needs to gain some insight into picking out “signal” from “noise” and not falling into the many common traps analysts apparently fall into more often than we should.

Overall, I give it 4 out of 5 stars. (5 for overall content, -1 for dragging on in some spots since I’m an impatient reader)

See more books I’m reading, have read, or plan to read here: Becoming A Data Scientist “Learning” Page.

]]>
https://www.becomingadatascientist.com/2014/05/27/the-signal-and-the-noise-review/feed/ 2