Data Scientists and Developers: Modes of Collaboration

So far, we have focused on how systems typically look in production.

There are variations in how far you want to go to make the produc‐

tion system really robust and efficient. Sometimes, it may suffice to directly deploy a model in Python, but the separation between the exploratory part and production part is usually there.

One of the big challenges you will face is how to organize the collab‐

oration between data scientists and developers. “Data scientist” is still a somewhat new role, but the work they do differs enough from that of typical developers that you should expect some misunder‐

standings and difficulties in communication.

The work of data scientists is usually highly exploratory. Data sci‐

ence projects often start with a vague goal and some ideas of what kind of data is available and the methods that could be used, but very often, you have to try out ideas and get insights into your data.

Data scientists write a lot of code, but much of this code is there to test out ideas and is not expected to be part of the final solution (Figure 5-6).

What Is Hardcore Data Science—in Practice? | 89

Figure 5-6. Data scientists and developers. Credit: Mikio Braun.

Developers, on the other hand, naturally have a much higher focus on coding. It is their goal to write a system and to build a program that has the required functionality. Developers sometimes also work in an exploratory fashion—building prototypes, proof of concepts, or performing benchmarks—but the main goal of their work is to write code.

These differences are also very apparent in the way the code evolves over time. Developers usually try to stick to a clearly defined process that involves creating branches for independent work streams, then having those reviewed and merged back into the main branch. Peo‐

ple can work in parallel but need to incorporate approved merges into the main branch back into their branch, and so on. It is a whole process around making sure that the main branch evolves in an orderly fashion (Figure 5-7).

Figure 5-7. Branches for independent work streams. Credit: Mikio Braun.

While data scientists also write a lot of code, as I mentioned, it often serves to explore and try out ideas. So, you might come up with a version 1, which didn’t quite do what you expected; then you have a version 2 that leads to versions 2.1 and 2.2 before you stop working on this approach, and go to versions 3 and 3.1. At this point you realize that if you take some ideas from 2.1 and 3.1, you can actually get a better solution, leading to versions 3.3 and 3.4, which is your final solution (Figure 5-8).

What Is Hardcore Data Science—in Practice? | 91

Figure 5-8. Data scientist process. Credit: Mikio Braun.

The interesting thing is that you would actually want to keep all those dead ends because you might need them at some later point.

You might also put some of the things that worked well back into a growing toolbox—something like your own private machine-learning library—over time. While developers are interested in removing “dead code“ (also because they know that you can always retrieve that later on, and they know how to do that quickly), data scientists often like to keep code, just in case.

These differences mean, in practice, that developers and data scien‐

tists often have problems working together. Standard software engi‐

neering practices don’t really work out for data scientist’s exploratory work mode because the goals are different. Introducing code reviews and an orderly branch, review, and merge-back work‐

flow would just not work for data scientists and would slow them

down. Likewise, applying this exploratory mode to production sys‐

tems also won’t work.

So, how can we structure the collaboration to be most productive for both sides? A first reaction might be to keep the teams separate—for example, by completely separating the codebases and having data scientists work independently, producing a specification document as outcome that then needs to be implemented by the developers.

This approach works, but it is also very slow and error-prone because reimplementing may introduce errors, especially if the developers are not familiar with data analysis algorithms, and per‐

forming the outer iterations to improve the overall system depends on developers having enough capacity to implement the data scien‐

tists’ specifications (Figure 5-9).

Figure 5-9. Keep the teams separate. Credit: Mikio Braun.

Luckily, many data scientists are actually interested in becoming bet‐

ter software engineers, and the other way round, so we have started to experiment with modes of collaboration that are a bit more direct and help to speed up the process.

For example, data science and developer code bases could still be separate, but there is a part of the production system that has a clearly identified interface into which the data scientists can hook

What Is Hardcore Data Science—in Practice? | 93

their methods. The code that communicates with the production system obviously needs to follow stricter software development practices, but would still be in the responsibility of the data scien‐

tists. That way, they can quickly iterate internally, but also with the production system (Figure 5-10).

Figure 5-10. Experiment with modes of collaboration. Credit: Mikio Braun.

One concrete realization of that architecture pattern is to take a microservice approach and have the ability in the production system to query a microservice owned by the data scientists for recommen‐

dations. That way, the whole pipeline used in the data scientist’s off‐

line analysis can be repurposed to also perform A/B tests or even go in production without developers having to reimplement every‐

thing. This also puts more emphasis on the software engineering skills of the data scientists, but we are increasingly seeing more peo‐

ple with that skill set. In fact, we have lately changed the title of data scientists at Zalando to “research engineer (data science)” to reflect the fact.

With an approach like this, data scientists can move fast, iterate on offline data, and iterate in a production setting—and the whole team can migrate stable data analysis solutions into the production sys‐

tem over time.

In document Big Data Now 2016 Edition (Page 99-105)