Structuring Your Data Science Workflow

Now What?

Congratulations,  you’ve successfully recruited and hired a few data scientists, positioned them in the right place in your organization, and selected the right team structure for what you’re trying to accomplish. Now what?

It’s time to put them to work! Let’s take some time to review different software development models and how they might apply (or not) to your Data Science (DS) practice.

Agile or Waterfall?

When most people hear the term “software development methodology”, what immediately comes to mind is the debate between Agile and Waterfall methodologies. In reality, it is the case that Agile vs Waterfall is actually a question of where on the Agile to Waterfall spectrum your organization might fall. Some organizations will exhibit more Agile tendencies, whereas others may choose to play it more conservatively. Indeed, rather than choosing one over the other, it is more important to identify the trade-offs of each model, and acknowledge those trade-offs rather than seek to avoid them altogether.

Simply put:

Agile methodologies embrace the changing nature of requirements and seek to empower developers to use their creativity to overcome the hurdles that come with those changes. The “product” is never done, it just changes as the business requires.

Whereas, Waterfall methodologies define the deliverable work of the development team as the delivery of the product, and then subsequent projects seek to modify or improve the product. Waterfall projects will commonly use versioning to define development milestones.


Applying Methodology to Data Science

When it comes to software development, most experts will advice introspection to choose the best development methodology for your organization. Smaller organizations or those with many simple products may elect to take an agile route, whereas large organizations or those with extremely complex products may elect to take a more measured and waterfall-like approach.

In the context of Data Science, the choice of methodology in determining the nature of the workflow will depend on the projects the team is working on, and what methodology your existing software development team has elected to use.

For the purpose of DS, the choice is between a Sprint Focused Workflow or a Project Focused Workflow.

Sprint Focused Workflows

The sprint focused workflow for data scientists involves the application of Agile principles to data science. Rather than trying to tackle a large business problem all at once, sprint focused workflows will instead seek to break a business problem into independent components or structure long term efforts into small chunks (sprints) to ensure that progress continues to be made.

One advantage of this concept is that if progress stalls on a single or particular component, it may be advisable to put it aside. By revisiting the problem at a later date, the Data Scientist’s creativity can be given a chance to stew on the problem, and come up with a solution that might not otherwise surface using a laser focus. At the same time, even when this has happened, the effort is not completely sidelined and the Data Scientist is not tempted to implement a subpar solution for the express purpose of progressing the project.

A Sprint Focused Workflow, however, may not be appropriate for problems that need to be worked in a very specific order. An example of this might be the deployment of an analytical pipeline in production where the development team also works on a sprint cycle. Because it is not practical to work on later pieces of the pipeline when previous ones are not complete, the flexibility of the Sprint Focused Workflow is lost.

One last thing to mention is that there are materialized benefits to aligning the work (or at least methodology) of your Data Science team to your software development team. You may have developers who are particularly interested in DS as a practice, or you may have Data Scientists with a background in software development that are more comfortable with the daily stand-up format of the Sprint Cycle.

Project Focused Workflows

The other choice of workflow for DS involves thinking of your data science efforts as projects – when one project ends, another begins, and the former project does not end until its acceptance criteria or satisfied or it has been convincingly established that the criteria is not achievable (for Data Scientists, infeasibility and failure are but two sides of the same coin).

This concept has the advantage of being easily understood to people who do not have experience, comfort, or exposure to Agile-like development practices. This is essentially how the business world works today, and its similarity to the waterfall process takes advantage of that process’ main advantage – intuitivity! Given that one of the most important guarantors of a Data Science project’s success is the engagement and participation of non-technical business users, this is not an insignificant advantage. Indeed, it helps your DS team from melting into the monolithic body of “technology” when they are instead supposed to bridge that gap.

The comparison with Waterfall being the case, the project focused workflow suffers from the same pitfalls – namely that with improper management and poor or unclear acceptance criteria, the project might never actually end. This attenuates the sense of accomplishment for the DS team, and may lead to attrition or technical debt.

If you’ve elected to model your DS team as a series of embedded teams across your organization’s business units, using a project focused workflow may be the right choice because it will align your DS team’s work with the business at large.


Making the Right Choice

It’s worth nothing that Data Science and software development, while they may require similar skills, are NOT the same thing. At this point in time, increasing automation and creativity in services provided at the software level call for a high level of production from software development teams. To some degree, code has become commoditized as more people have gained access to it.

On the other hand, the approach of instituting a Data Science team to tackle the large and strategic questions across your organization is still a very new and novel thing to do. As such, the Waterfall vs Agile debate for DS teams is not exactly a perfect analogy. Instead, I recommend thinking of that debate as negotiating the trade-offs between structure and agility. Embracing the ability to make quick changes can deprive you of any sense of a long term picture. On the other hand, thinking of your problem as monolithic may make it seem unsolvable, or remove the acceptable conditions for failure.

If you haven’t already, read some of the other pieces that I have written on the topic of managing data scientists, or get in touch if you’re interested in exchanging some ideas

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.