The Imitation Game, and the Human Element in Data Science

Last night, my husband and I watched The Imitation Game. First of all, it’s a great movie and you should see it. Secondly, there was a moment that got me thinking about the human element of machine learning.

[Spoiler Alerts – but you probably already know much of the story, and the movie is still good even if you know the historical outcome.]

I thought a moment like this may be coming when Alan Turing was first applying to work at Bletchley Park, and Denniston can’t believe he’s applying to be a Nazi codebreaker without even knowing how to speak German. Alan emphasizes that he is masterful at games and solving puzzles, and that the Nazi Enigma machine is a puzzle he wants to solve. He starts designing and building a machine that will theoretically be able to decode the Nazi radio transmissions, but the decoder settings change every day at 12am, so the machine must solve for the settings before the stroke of midnight every day in order for the day’s messages to be decoded in time to be useful and not interfere with the next day’s decoding process. Turing can’t prove his machine will work, simply because it is simply taking too long to solve the daily puzzle. In the meantime, people are dying in the war, and the Nazis are going on transmitting their messages over normal radio waves believing the code is “unbreakable”.

[More specific plot spoiler in the 2 paragraphs below]

The moment I’m referring to is when Alan hears a woman explaining that the German man whose messages she is assigned to translate always starts his messages off with “Cilla”, who she assumes is the transmitter’s girlfriend. This triggers Alan to realize there could be repeated messages that would drastically narrow down the number of decryption keys because you could specify that there was always a word or phrase that could be expected in the messages. They realized that there was a 6am weather report transmitted daily, and that every message ended in “Heil Hitler”, so they set the machine so it could focus its search on finding the words “weather”, “heil”, and “Hitler” in the 6am message every day. This solved the puzzle, as the machine was then able to quickly decode the messages from a narrower set of possible solutions instead of running all day without finding one.

So, Alan Turing (at least in the movie version) had been focusing on a random set of hundreds of millions of possible solutions, which his machine could solve faster than humans, and trying to tune the design of his machine to find the solution faster. The hint that ended up leading to the solution involved bringing the human thought process back into it – if there was something the messages said frequently, because they were written by humans, and humans follow certain communication and linguistic patterns, that could be exploited to narrow down the range of solutions.

[/end specific plot spoiler]

This got me thinking about machine learning and data science in general. A frequent Kaggle competition winning strategy is to quickly iterate through a multitude of algorithms and optimize for the evaluation metric, or focus on stacking thousands of models. [Here are several interviews with Kaggle winners.] Some of the winners do mention needing to study up on the field to gain some domain expertise, and the Quora answer here focuses on data understanding a preparation, but many teams appear to take a brute-force-type approach to building their models and writing programs to iterate through each combination of algorithms to maximize the area under the curve or whatever measure that particular competition is scored on.

I don’t want to do that type of brute-force data science. I’ll leave that up to the competitive types that enjoy quickly iterating through as many solutions as possible, and building systems to help them do this faster, and have a focus on winning vs understanding. Of course there is a place for that type of approach, and it is very valuable in solving some problems, but it’s not attractive to me. It’s too robotic.

However, there are many areas (like my day-job, university/non-profit fundraising), where knowing your specific community, what types of data you’re collecting, where it’s from, how trustworthy it is, will have a large impact on what fields you decide to include in your model and of course how you interpret the output. These are areas where we’re talking about data with a lot of variety, but not necessarily a lot of volume (at least not on the scale of something like credit card processing and fraud detection), and where I would think domain expertise is more valuable than fast model optimization. It is also vital to be able to explain how to be able to use the output of a given model. It is more of a consulting role than a mathematical/programming one.

I think that’s where I’m headed in my data science learning. I am aiming to learn how to use models to inform decision-making, how to choose the best data to put into a model, how to choose which type of model to use, and how to watch out for things like covariance and other confounding effects. Then, when you get a result, how to know whether to trust the result, how to explain it to non-technical managers, and how to best implement what was learned in order to have a positive impact on real-world outcomes. It is more of a creative iterative cycle than a machine/optimization iterative cycle.

Like in Imitation Game, understanding human behavior and communication can be the “big hint” that informs a technical solution, and optimizes its performance beyond what a better-tuned piece of hardware or more efficient code could do. It seems to me (and I have heard others saying) that understanding people and business is more than just a piece of the “data science venn diagram”, it’s really the key to success in this field. (And also a good reason to have diverse data science teams.)

I’m curious about how those of you that work as data scientists see these various aspects of data science, and how much of your work involves creative and “human” skills vs the “hard skills” of math and computer science. I would think it varies depending on industry and the type of problem you’re trying to solve, but I am interested in your personal experience. Please comment below to share your experiences!

 

 

 

5 Comments

  1. Nicole
    Aug 8, 2015

    Amen. And it’s brute-force-data-science that’s unfortunately going to be what “democratizes” it over the next few years… then, after a few high profile bad decisions (which hopefully don’t involve extensive litigation or threats to human health or safety) the “craft” aspect should creep back in. “Autopilot Mode” is coming (e.g. with BigML and Amazon ML services) but you still need a pilot in case of emergency. Which, in data science, could potentially be *every single time*.

    This reminds me of the discussions we were having years ago in astronomy when storage was getting cheaper, and data volumes per unit time were getting bigger and bigger. Easy solution? Just archive all of it, of course! But without being able to effectively describe the original observer’s intent, and store and search that, the value of the data was pretty low. So why spend a few tens of thousands of dollars a year on storing data that really didn’t have much archival value?

    I think data science is similar. We really have to cautiously examine what value using a particular model will add… and really examine it in terms of current context and envisioned context. Brute force cloud ML services can’t do that. Nor would we want them to. But guaranteed, a lot of people will be doing just that.

    • Renee
      Aug 8, 2015

      Yep, I agree with you. And I’ve read interviews with several data scientists where they emphasize “make sure you know the question you are trying to answer, and how the answer to that question will be used before you start developing an approach”.

      Also, good point about autopilot vs emergency manual mode.

  2. Joerg
    Aug 8, 2015

    Oh I think that the human aspect is the most important aspect of Data Science. You need to communicate your findings, need to meet business needs, need to get the DevOps to help you with your stack etc. I think Machine Learning is like 5%, programming is 45 % and 50 % is communication in Data Science (numbers sampled from a rear end distribution)

    • Renee
      Aug 8, 2015

      Huh interesting that you say it’s as high as 50%. I would tend to agree (and it’s similar when working as a data analyst), and wonder if other people have found the same in their data science roles.

  3. Renee
    Aug 8, 2015

    Here’s a video about the enigma machine, and the flaw that was discovered that helped break the code:
    https://www.youtube.com/watch?v=V4V2bpZlqx8