The Interplay Between Science, Engineering and Data Science

genetics naturally prepared her to working with large volumes of data, leading her to realize that the work she engaged in at Stanford naturally belonged in the realm of data science.

After graduating, Diane became part of the Insight Data Science Fellowship program where, as a Fellowship project, she built recipe searching site that used clustering to organize recipes by ingredients.

When we interviewed Diane, she was a data scientist at Palantir. She has since started a new role as a Senior Data Scientist at MetaMind.

The Interplay Between Science, Engineering and Data

Science

DIANE WU

124 DIANE WU

124

industry. Through this program, I realized that most of the training through my PhD was essentially just data science. So the transition for me was very natural. I’m doing a lot of the same things, just not thinking about cells or biology! However, the same tools and challenges apply.

So after bridging this gap, where are you as a data scientist right now? I work as a data scientist at Palantir — a

company that builds a platform that helps integrate data for our customers from the multiple disparate databases that they have, and makes associations and inferences from

these data. We work with customers from the financial sector, the medical space, government and local law enforcement. One of my jobs as a data scientist at Palantir is to help create value out of their data and identify a human-computer symbiotic approach to machine learning.

Given that you’re working with these large institutions, what is the scale of the problems you’re tackling?

There’s a wide range. Some of our customers have hundreds of terabytes of data and some have a few megabytes. Some customers require a streaming solution while others want a static model based off all the information in their databases. The number of databases we work with can also vary between one to many dozens.

Having worked as a data scientist for a while, what would you say are the main responsibilities and goals of data scientists at Palantir?

Data science itself is a very strange term. It’s an umbrella term. In some companies and in some roles, being a data scientist means to be a software engineer, building machine learning models in the back end. In this role, your success is very measurable--it is usually the accuracy or precision/recall of your model performance. At other companies or in other roles, being a data scientist means that you’re an analyst working with engineers to help them determine what features to build and how users are interacting with them. In this role, your success is less measurable, and it is up to you to find the right questions to answer and then to try to make impact with that answer.

At Palantir, we work with customers from a diverse number of sectors, with a wide spectrum of problems that we solve by deploying our platforms against their data. One of our core company missions is to pick incredibly difficult problems at institutions where we would provide the most value, and put our full force into solving these problems.

Data science itself is a very strange term. It’s an umbrella term.

DIANE WU

125

Sometimes, this means developing new capabilities in the platform. Sometimes these capabilities require data science techniques (machine learning, statistics, mathematical modeling), and that’s where we come in. I’m on the machine learning team at Palantir, and we’re dedicated to enabling customer data science needs via our products. To this end, we work closely with customers to help them scope their problems and turn an often poorly defined, qualitative problem into a quantitative one. The process involves identifying an actionable goal or desired insight, evaluating the form, scale, reliability and availability of the data, and building custom machine learning algorithms to solve the problem. And then we iterate. Always iterate.

Some requests we get involve translating from qualitative problems to quantitative ones (identifying good proxy metrics to reach the right conclusion), statistics (doing the calculation on the data), and communication (presenting the data in a digestible manner). In most cases, however, our customers are requesting a predictive analytics approach to a specific type of problem. They present a very difficult problem where a predictive modeling component may be needed. Fraud detection is one of those problems, for example. It is clear that a computational algorithm could aid fraud detection by identifying patterns and outliers, but the problem is complex enough that it will likely always involve a strong human component. In such cases, it is not clear how we should break up the tasks between the human and the computer. One of Palantir’s core values is human-computer-symbiosis: let the computer do what it does best (crunch models, calculate metrics, etc.) and let the humans do what they do best (interpret patterns and meaning, make accountable decisions, especially with respect to the rights and well- being of other humans). One of the overarching goals of our team is to figure out what an ideal predictive analytical solution should look like and where on the spectrum it should lie.

Finally, we also do data science internally, and often want to use product metrics to inform business decisions. Engineers like to build cool things. It’s not intuitive to them to think about things in a scientific way. I think that’s one of the reasons books on lean product development are so popular. It’s because these are not intuitive concepts for engineers. The role of a data scientist is to do the stuff that is a pain for engineers (but fun for us), and help engineers make more data driven product development decisions. It sounds like data scientists are evangelizing the scientific method to engineers! In a way, I guess that’s true. It’s intuitive to me to think in the scientific mindset because I’ve been trained as a scientist for the past 4 years. It’s very natural for a scientist to ask why, to dive into a problem, scope the hypothesis landscape and then perform tests. However, scientific thinking is a double-edged sword, and is in some ways the opposite of the engineer mentality. Scientists ask why something is the way it is before reaching a conclusion, while engineers execute on assumptions and watch to see if things break.

DIANE WU

126

One of the hardest things in recruiting for data scientists is to find candidates who have the right balance of both scientific and engineering mentality. Almost always, with real world problems, there is no time to ask why and figure everything out before executing, and you often have to act with incomplete knowledge. However, engineering without data science is like building a bridge without ever fail testing it. There is a delicate balance to be struck.

What are some challenges and some of the things you’ve found easy in making the transition from PhD to Data Science?

The reason why programs like Insight have been successful is because PhDs have been trained with a quantitative method of thinking. They’re also prone to ask “why” and “how” rather than “what”. I think that most PhDs understand the presence of errors, and how to reduce a complex problem to a smaller problem with a quantifiable solution.

On the other hand, PhDs are often stereotyped to ask “why” too often and are sometimes caricatured to have their heads in the clouds. So if I find a PhD who is also a hacker, then it is the best of both worlds. Indeed, some of the most effective data scientists I’ve seen have been PhDs who worked on a number of side coding projects during their academic career. The challenge for a lot of people is the ability to apply these insights into value. Not all interesting problems can produce insights, and not all interesting insights can inspire action that causes change.

Did you have any challenge in communicating your value as a data scientist?

What I have learned in working with many different customers is that when people request data science, they really just want magic. They want you to use all the data to predict everything. When they approach data science, they often don’t actually know what they want.

That’s the thing about being a data scientist in this time. It’s so new and sort of overhyped, that most people just know they want in on the excitement but don’t know how. They want things, but they have no true idea about what they want.

One of Palantir’s core values is human-computer- symbiosis: let the computer do what it does best (crunch models, calculate metrics, etc.) and let the humans do what they do best (interpret patterns and meaning, make accountable decisions, especially with respect to the rights and well-being of other humans).

DIANE WU

127

Part of the job is really use-case discovery. It’s not always about crunching the right algorithm. It’s about asking the right questions and framing the questions for yourself. And once you do that, the problems tend not to be statistically or algorithmically hard. On the other hand, there are people who think it’s overhyped and want you to prove that data science is worth their investment.

So in your experience, what distinguishes the best data scientists from the rest? That’s a very good question.

There are statisticians and there are computer scientists and designers. And then, there are people who are very good at all of these things. The reason why this role — data scientist — was created, and the reason why it’s a little bit undefined, is that it requires that you’re good at many different things. You have to think about problems, both as an engineer and also as a statistician. You have to know what tests are right, how to approach the problem, how to engineer the solution and how to sift through large datasets.

And then afterwards, you have to present your findings in a clear way. This might require you to create visualizations. Having an understanding of graphic theory and the language of visualization is useful. This ties into communication because as a data scientist you’re communicating with someone who doesn’t have a ton of time to analyze data. They look at the figure and want to be able to extract meaning from it in a few minutes.

Finding someone who’s a good engineer and a good communicator is incredibly difficult. You don’t need to be the best at everything, but some people who are great communicators need to learn how to be great engineers and vice versa.

In academia, there’s a focus on open-ended problems. How have you made the transition to industry where there’s an environment to deliver on prompt deadlines? I think in an ideal world there should be a fusion of the two. In academia, it behooves one to work with deadlines; most PhD students would probably tell you that if it weren’t for publication deadlines and the fear of being scooped, we might never publish. Open-ended problems need to be scoped also, and often a 20% solution will get you 80% towards your goal. In industry, sometimes people can get too “hacky” and deliver v1 solutions all the time, and that can be bad too. Sometimes it’s good to step back and try our hand at some crazy ideas. I think that’s the inspiration behind company internal hackathons and why they’re so popular in the tech industry.

What skills beyond what you’ve already mentioned (hypothesis testing, communication) would you recommend to someone interested in data science?

DIANE WU

128

It’s about asking the right questions and framing the questions for yourself. And once you do that, the problems tend not to be statistically or algorithmically hard.

As a preface, I think the skills you need to learn largely depend on what you want to do. I would put this into three categories:

1. Predictive Modeling: here, algorithms and some complex mathematical modeling

are required. Visualizations are probably not as heavily emphasized.

2. Business Intelligence: here you engage frequently with SQL and some scripting, but

you don’t need great skills in computations and algorithms.

3. This is a spot in the middle: this is more science-y and R&D. Here you want to ask

much deeper questions about user behavior. You want to model user interactions and apply computational algorithms to gain business insights. This is a mesh between two extremes. You need some computational background, and some aspects of communication, etc.

But ultimately, to answer this question requires you to think about what type of job you want, and realizing that you can’t be qualified for everything. You have to pick your best shot and hone your skills there.

Building off of that, what have you found to be useful in building those skills and understanding which position you want to pursue?

Talking to people is important. I don’t mean that in the way of networking, but in the way of understanding what people are looking for. Insight Data Science brought me a lot in this direction.

Looking at folks who have moved into data science, I’ve noticed that the Coursera course by Andrew Ng has been very popular. This ties into the general skill of being driven enough to simply pick up books and start learning. A lot of aspiring data scientists also play around with some Kaggle competitions to get their hands on real data and practice their engineering and analytical skills.

In fact, most data scientists I know are self-motivated, they’ve taught themselves the relevant tools and skills to help them manipulate and understand data. In my opinion, it doesn’t take that long to learn these skills. So if you pick up these things after work, I think you can take advantage of the large demand right now in data science.

Kevin Novak, another data scientist we spoke with at Uber, believes that we’re at the tip of the tip of the iceberg when it comes to data science. Do you agree with

DIANE WU

129

that? And if so, what are the exciting and promising things on the horizon of data science?

I agree with that. I think that data science is largely undefined. Being a data scientist in this time is exciting because you have a lot of potential to define what data science is for the next 10 years. What’s exciting is being able to explore this frontier. You’re also learning a great deal about very different fields intersecting with each other. I really like this position because I’m learning so much and I’m not just honing one skill.

I’ll predict that in 10 years we’ll use more defined terms than data science because people will realize what it is that they’re looking for (analysts vs. predictive modelers).

Are there any final thoughts or parting feedback you’d give to someone just getting into the field right now?

Don’t be afraid!

March forward and learn what you have to learn. Many people who come into data science are overwhelmed. They look at the list of ”requirements” and think that because they’re not a wizard at engineering, or a statistician and a visualizer, that they’re not qualified. I think they shouldn’t underestimate themselves. I think you should approach things in the T-Shaped model, where you accumulate a great deal of breadth and a concentration in one skill that gives you depth.

So be confident and pick up skills; you’ll be surprised at how much value you can add immediately.

In document The Data Science Handbook (Page 128-135)