The last chapter is an introduction to R, and it harkens back to previous chapters now that the kernels of understanding had been planted. For example, if you’re doing an
exponential smoothing forecast (which I cover in my book), you should not be doing all
these steps every time. You should be doing it on the shoulders of the giants who’ve written the Ph.D. theses you’re using and just open their package.
Ultimately, people who want to know each little detail of how a boosted tree model works or how modularity maximization works seem to love the book. Programmers who are used to relying on black-box libraries, functions, etc. aren’t the biggest fans.
Given your interest in opening up black boxes to examine the nitty-gritty of different techniques, did you ever want to write your thesis on a new statistical or machine learning technique?
I started at MIT wanting a Ph.D., but in my first year of graduate work I had the opportunity to do some applied work on Dell’s supply chain, and it showed me that my passion lie outside of academia.
You see, my advisor was really interested in publishing results. Although we came to Dell trying to understand how to help the business generate revenue — which I enjoyed — that wasn’t our ultimate goal. The problem with consulting when you have ulterior academic purposes is that the goal of academic publishing is counter to the goal of helping a business, because in order to publish, you need something academically new to say. But if it’s a new technique, it is often not maintainable by the business once the academics leave.
That was a good experience for me, because I realized that I’m not an academic despite the fact that I like technical things. Rather, I’m an analytics professional who enjoys tailoring technical approaches to business settings where the solutions are sometimes complex but often simple, depending not on my needs as a data scientist but on the business’s needs or the customer’s needs.
That ability to think simply and “edit” models is something I just published an article on. One thing I reference in the article is a paper from 1993 by Robert Holte titled “Very
Simple Classification Rules Perform Well on Most Commonly Used Datasets.” His basic
premise is that simple decision rules – a single rule that splits the data on one feature — are pretty effective compared to more complex models, like a CART model. That makes sense since oftentimes in naturally occurring data sets within the business, you have a couple of features that are good, and everything else is just icing on the cake.
JOHN FOREMAN
153
and that really grabbed me.
It made me think, especially in a business context, what it means to justify your complexity. Part of that is the additional expense of keeping the model running versus revenue. Part of that is the poor sap you’re saddling with keeping this thing running. And something that people don’t often consider is the likelihood of abandonment.
Once you move on, whoever gets saddled with it might find some organizational reason or anecdotal evidence to ignore it. They may not even tell you or get back to you. Are you going to stick around and babysit all your models? How do you hand them to someone if they’re complex?
So to come back around to getting a Ph.D., for me, it was this desire to go out and use data to serve a business, and to do that using both complex and simple approaches that pushed me to leave graduate school early and join the workaday world. I don’t regret it. It seems like the academics are trying to do the most complex models, and the business decision makers are thinking it may not be all that helpful, that 80% of the way there is good for us already.
Can you talk more about your background? What were you doing before your PhD?
My dad is an English professor so I thought I was going to do English. Slowly, I realized I was pretty good at math. In my undergrad, I studied pure math. I really liked abstract algebra, and I thought I was going to be a pure math guy. My advisor sat me down and said, “You’re alright. You’ll probably go to grad school in a top 10 program, but you’re really not going to amount to much in the math community.” I felt at the time that it was pretty harsh, but it was true. I couldn’t compare myself to other people doing pure math. The way math works is a lot of people toy with little results for a long time, and suddenly there are huge jumps from certain people. I would never be that individual who would push the mathematical fossil record forward into a new era. I would be a guy that toys with smaller results. So it came down to a question of passion: how passionate was I about pure math?
At the time, I was also doing research for another math professor on knot tying. I got
Your model complexity has to be justified, and that really grabbed me.
JOHN FOREMAN
154
paid as part of this research group to write code that would take two 3D models of knots and join them into a compound knot without crossing over other sections of the knots and forming a new knot type. It was crazy specific, but I learned a lot about Unix and programming. I wrote code to do simulated annealing in C. I was getting all sorts of memory leaks, and I had to do a lot of stuff in the command line with data sets.
I didn’t know what that was at the time. I thought it was just math research that involved code, but I liked it. It turned out to be my most valuable experience as an undergraduate. After all, what would a data scientist do without piping in Unix?
What did you end up doing once you graduated?
I did a couple of internships at the NSA over the summers, and I loved the applied, problem-focused environment. When I did my first summer internship, it was all math students hopped up on stories of Bletchley Park, etc. Lots of energy. It was great, but then I did another internship there and they put me in a regular office with regular employees that had been there for a long time. And that’s what ultimately scared me away.
I remember talking to one guy who had a picture of a golf course above his computer. He said, “That’s what I’m doing next year when I retire, playing golf.” Everyone was tired, and everyone was burned out. I figured that a government job wouldn’t be exciting for long, so I began to look at other applied analytics opportunities.
So in graduate school I chose to study operations research where math was applied to optimization modeling. I went to MIT in their Operations Research Center which is an interdepartmental program between engineering, stats, math and business. It was cool because you could take business classes alongside highly technical classes. I got a kick doing MBA case studies because it was so foreign to a math class. No proofs!
I thought the OR program was awesome, so I knew that career-wise I was headed in the right direction. When I did my graduate research for Dell and was able to use the OR concepts in a consulting framework, though, that’s when it all clicked. I applied to analytics consulting firms and the rest followed.
Is this when you went to Booz Allen? What did you do there?
Yes. I went to Booz Allen and did a lot of analytics consulting work. I was on a team called Modeling, Simulation, Wargaming, and Analysis which exposed me to a huge variety of analytics approaches, techniques and problems. One month I’d be doing system dynamics modeling, the next month I’d be building an optimization modeling tool whose GUI was a bunch of Gantt charts. You never knew where the next project would lead.
JOHN FOREMAN
155
From there, I went on to do consulting at a boutique consulting firm called Revenue Analytics that does pricing models that adjust prices on hotel rooms, cruises, etc. These models are complex IT projects, so most of the clients who had the data to power them and could afford them were Fortune 500s.
During this stint, I worked with Coca Cola in Shanghai to build an optimization model that pulls frozen barrels of orange juice pulp from oranges sourced all around the world and blends them together so that every time you drink one of Coca Cola’s Pulpy drinks in China, the feel of pulp in your mouth is consistent. The project felt like discovering some bizarro corner of the analytics universe halfway around the world.
All these Fortune 500 projects were really fast-paced. But from there I jumped to MailChimp, which is more of a startup, and nothing in the Fortune 500 world could have prepared me for MailChimp’s pace. We’re on a release cycle where every four weeks, we’re putting out a new version of the application. That’s light speed for me and, in fact, it’s too fast for a lot of data science projects, especially if you have a lot of infrastructure requirements. I’m the slowpoke of the organization. That’s an exciting place to be because it means people are pushing me.
One fascinating aspect of MailChimp as a startup is that it’s based in Georgia. Not in Silicon Valley or even New York or Boston. What is the startup scene in Atlanta like?
The startup scene is alright because Georgia Tech produces a lot of talent in the Atlanta area. Some of those folks want to stick around our fair city. But that isn’t to say there isn’t a massive magnet out at the West Coast, because people want to go out to the Valley, join a startup, get equity, and see if they can cash in that lottery ticket later. That’s a very different culture than what you find in Atlanta.
That’s something that we have to think about when we recruit, so we play to our strengths. We have some of the most amazing data sets in the world. Two of our domains are in the Alexa 500. We send ten billion emails a month and process another three billion events on top of that. We added 200,000 active sending customers this quarter. We’re growing so fast, and the nice thing about that message is it attracts those applicants who want interesting work rather than those who merely want an opportunity to cash out later. How does the company think about staying in Atlanta?
JOHN FOREMAN
156
the Silicon Valley, you can be part of a conversation that’s occurring between all these companies, and there are advantages to that because you know where things are headed. There’s also a disadvantage, because you lose a lot of mental freedom.
In fact, it can instill a lot of fear.
You hear a lot of what other people are doing, and it’s like being on Facebook where everyone’s projecting the best version of themselves. This puffery makes you depressed, and you flail about to technologically keep up with the Jones. MailChimp doesn’t have that perspective, because we are slightly isolated. This isolation allows us a little breathing room to seriously evaluate technologies, opportunities, markets, trends, etc., rather than just jumping head first into something because everyone else is doing it. That said, the folks at MailChimp get around a lot. I travel nonstop. I speak a lot. I meet with companies. I have conversations constantly with folks around the world, but it’s targeted and intentional rather than getting an earful all over the place because you live in Silicon Valley. What that means is that there’s less fear, so we’re not thinking, “We have to take VC money” or “We have to acquire this start-up.”
Talking more about unconventional thinking, you’ve written in the past that “Your model is not the goal; your job is not a Kaggle competition.” Can you talk about why you don’t think Kaggle is where data scientists should be spending their time? There’s nothing wrong with Kaggle. I think it’s a great idea. If a company’s at that point where they want a model that’s that good and they’re getting a lot of revenue and want to push like Netflix, go for it.
My one criticism is that the way journalists write about it gives a skewed view of what data science is. There was an article on GigaOM where the author said, and I’m paraphrasing, “The main thing data scientists do is build predictive models. That’s how they spend most of their time.” This is a myth that something like Kaggle will perpetuate.
Before you build a model, you need to know what data sources are available to you within the company, what techniques are available to you, what technologies are available, you have to define the problem appropriately and engineer the features. Usually, when you
What I found is that if you’re in the Silicon Valley, you can be part of a conversation that’s occurring between all these companies, and there are advantages to that because you know where things are headed. There’s also a disadvantage, because you lose a lot of mental freedom.
JOHN FOREMAN
157
grab data from Kaggle, all of these steps are done for you. You don’t have to go around looking for data. You can’t say something like, “Maybe they left some data behind. Can I come into your company and look around?”
I feel that there’s so many steps before you get to modeling that are crucial. Can I ever ask a Kaggle competition, “Is this the competition this company should actually be having?” Think about the Netflix prize. They were trying to predict what star rating readers would give a movie given past data, but I think they backed off that a little bit because they noticed it’s not all about five-star movies. For example, I watch garbage. I will give it two stars, and I will watch it anyway. It’s more about moods. A lot of things drive viewership, such as what my friends are watching on Facebook. That’s something Netflix is doing now — and it’s made their original modeling endeavor somewhat irrelevant. So there’s this notion in data science about whether or not a project should be tackled in the first place that is a priori ignored by Kaggle. And I think a big component of data science is questioning why you’re doing what you’re doing — choosing problems to solve while rejecting other problems that are irrelevant to the business. With Kaggle, for better or for worse, that job is done for you. Kaggle is just an exercise in using a data scientist as model-building machine.
I still think that Kaggle competitions are awesome, and I will never match the intellectual ability of some of the competitors on that platform. I just like to emphasize the other fundamentals of operating in a data science role at a company. I wish there was more focus on them, but those aren’t really sexy to talk about in the media.
What are some of these other fundamentals of operating in a data science role at a company?
Well, one of the fundamentals that everyone talks about is cleaning and prepping data yourself. Finding, pulling, prepping, cleaning, the list goes on. Data manipulation prior to model building is huge. But let’s go beyond that.
For me, a core skill that any data scientist should possess is the ability to communicate with the business. It’s dangerous to rely on others at a business to actively identify and throw problems at the data scientist while he or she passively waits to receive work.
There was an article on GigaOM where the author said, and I’m paraphrasing, “The main thing data scientists do is build predictive models. That’s how they spend most of their time.” This is a myth that something like Kaggle will perpetuate.
JOHN FOREMAN
158
When that’s the setup, the business often hands over the wrong problems, because other teams have no idea what data science can help and what it can’t.
But if you’ve got a data scientist who’s good at communicating, then that data scientist can actively engage in conversations with the business and with executives to prioritize how to best use analytics.
I believe a good data scientist is one who’s engaged enough in conversation with the business to say, for example, “Hey, I know you guys think social data is cool, and I do too. But only 10% of our customers are on Twitter, and it’s anything but a random sample. Have we considered using this other transactional data source to approximate what you want instead?”
So now we’ve got two skills that are important other than building models: data manipulation and communication. What else?
There’s one skill that I like to harp on: the skill of editing. People have a strong desire to distinguish themselves from the herd by flexing their expertise. We see this in all industries and jobs. If you have a particular knowledge set, you’re going to show that off. In analytics, the way that tendency manifests is by making models overly complex. And by that, I don’t mean “using a complex model when a simple one gives the same performance.” No, I mean “using a complex model that is brittle and overly burdensome for the organization to maintain, i.e. whose likelihood of abandonment is high, when a simpler model has a better chance of long-term survival.” Sometimes that means using a simpler model even when some performance is lost. That takes an editing eye. And in data science, as in many disciplines whether that be journalism, oil painting, or speechwriting, editing distinguishes the experienced practitioner from the newbie.
One of the big ideas you mentioned is the fact that complex model-building is not what a data scientist spends most of his time on.
Do you think, in the future when there are more tools built for data scientists to take care of all the steps before modeling, data scientists will in fact be spending most of their time on complex modeling?
I think a big component of data science is questioning why you’re doing what you’re doing — choosing problems to solve while rejecting other problems that are irrelevant to the business.
JOHN FOREMAN
159
I think we’re already seeing the commoditization of a lot of these skills. It’s not that hard to read a book on R and learn how to build models. It’s pretty easy, and that’s where online education can come in and fill in a lot of technical gaps. If that’s all you need