Choosing Your Data Science Architecture
We data people love our architecture. We obsess over it. Every time Apache announces a new top-level project, we fawn over it and talk about how it’s probably the future, and how this changes everything. Hell, many of us are probably working on just those projects in our free time. Understandably, with over 200 (or 300, depending on how you count) top level projects, it’s a mathematical improbability that all of them are going to change everything.
When implementing a Data Science (DS) practice in your organization, it’s still the case that Keep It Simple Stupid (KISS) will get you the most reliable and effective software stack. With all the choices out there, creating an environment for your Data Scientists to succeed does not have to be a hard decision.
Just Another LAMP Stack
In web-focused development, there’s a concept called the LAMP Stack. The LAMP stack is a collection of open source software that can (and very frequently, is) used to serve and process requests for web content. While the term LAMP is itself an acronym of the components, the LAMP stack is intended to convey the four essential components of application development: operating system, web server, database, and scripting language.
The “original” LAMP stack, if such a thing exists, consists of a Linux-based operating system, and Apache web server, a MySQL database, and PHP, Perl, or Python as the scripting language. Variations on the LAMP stack include such take-offs as using a particular variant of Linux as the operating system (RAMP – Red Hat), a non-relational document store for the database (MongoDB or MariaDB), or alternative web server technologies (Nginx being the most common).
The LAMP stack concept is useful for DS because. in an analytical environment, we’re going to be doing many of the same things at the same scale that an application does. While it is possible to run your data science architecture exclusively off your staff’s machines, at some point that won’t scale to meet your needs. With DS workloads and ambitions, that is bound to happen sooner rather than later.
The Data Science Analytical Stack
The core concept of the LAMP stack comes from finding ways to fulfill the needs of applications to be functional and performant. Every application needs a way to talk to the internet (web server), a place to put data (database), a way to manipulate the data (scripting language), and a common platform to tie it all together (operating system). In the same way, your analytical environment needs much the same thing.
The Operating System
In the LAMP stack, the L stands for Linux, the operating system of choice. The operating system provides a common platform on which to run all the required software components of your analytical environment. Unless you’re somehow running an all-Windows environment (and some people do), the obvious choice here is some flavor of Linux or other Unix-like operating system. The world of Linux runs the gamut from heavily consumer-focused distributions (OpenSUSE Leap and Ubuntu Desktop, for example) to tried-and-true enterprise grade standbys such as Red Hat Enterprise Linux, Debian, and Ubuntu server. If your organization already has an existing Linux distro of choice, there’s no reason not to use it for your DS architecture!
One important thing to note is that using Linux also preserves compatibility with any number of end-user operating systems. Regardless of whether your organization religiously uses Apple products, or has a massive contract with Dell for Windows workstations, both of these can easily connect to and interface with Linux environments. Using a desktop-grade Linux distribution for your data science workstations is also an option, especially if that configuration is already supported by your IT team.
Of course, OSX and desktop Linux based workstations will have the advantage of natively running a Unix-like shell environment. On the other hand, a prodigious portion of corporate computing, especially functions such as email and workflow collaboration, runs on Windows.
The Server
In contrast to the LAMP stack, where the A stands for the Apache HTTP web server, in your Data Science architecture the server component of LAMP will be the physical boxes that accept store your analytical data and process the workflows that your Data Scientists create. As you scale up, distributed computation frameworks such as Hadoop and Spark will allow you to expand your computation capacity by adding new machines, rather than buying more and more powerful (and therefore expensive) machines.
In the beginning, and especially for teams of one or two data scientists in small organizations, it may be most appropriate to double-duty individual workstations as the processing server for your analytical environment. Alternatively, for small teams, if one of the members is comfortable working in a desktop Linux environment, consider using that member’s workstation as the computational environment.
However, once you’ve broken through the limits of what an individual’s workstation can handle, it’s time to invest in server hardware. Start out with a single 8 to 12 core (midrange, as of this writing) Xeon processor with at least 64gb of RAM and adequate storage space for whatever data you plan on storing. Consider RAID striping a disk (or SSD) array for redundancy and some performance gains. While this box is going to be significantly more expensive than any single workstation, the amount of cost savings from dead time will easily pay itself back.
Once you’ve outgrown your single server, it’s time to scale up! Using a distributed computation framework such as Hadoop will allow you to use commodity hardware to tackle ever-larger analytical workloads. Instead of just one of the previously covered servers, consider buying a whole rack (or two, if you want to take advantage of locality-awareness in Hadoop)!
One more thing to mention is that many data science workflows have now taken to hosted solutions such as Amazon Web Services, Google Compute Engine, Microsoft Azure, and the like. This is absolutely an option if your data is not considered sensitive! In many cases, the on-demand scaling of these services can produce cost savings and additional operating agility for the data science team.
The Database
Now that you’ve got an operating system to listen to your commands and a server to process your data, you’ve got to find somewhere to put it! In the LAMP stack, the database is most often a relational one such as MySQL or PostgreSQL, or a document store such as MongoDB or MariaDB. Each of these is absolutely an option for your DS architecture.
In actuality, the database component of your DS architecture is all about finding something that your scientists are comfortable with and providing a federation of the right tools for the job. Every one of your data scientists (hopefully) comes from a different background, and is comfortable and familiar with different tools. Quants and hackers may be extremely comfortable with writing complex queries against a SQL-compliant relational-like database structure, whereas academics may prefer a more purpose-built data storage scheme. In addition, the scale of your analytical efforts may inform the architecture available to your data scientists. Distributed computational solutions need data storage and retrieval solutions that take that into account.
With that in mind, here are a few different storage products to consider:
- A conventional relational database such as MySQL or PostgreSQL will provide performant data retrieval for small to medium amounts of highly structured data. Support for parellelization and distributed computation schemes for these databases is poor.
- Writing native MapReduce jobs against a Network File System (NFS) or MapRFS based unstructured data source is great for analyzing medium to large amounts of data – due to the startup and management overhead of Hadoop, this may actually be less efficient for small amounts of data
- Software such as Apache Hive, Cloudera Impala, and Apache Drill can provide a SQL-like interface to Hadoop, allowing analysts comfortable with SQL to abstract away the complicated and error-prone components of MapReduce while retaining the computational efficiency
The Scripting Language
Having established a way to store and pull the data that you want to analyze, it’s time to focus on the P in LAMP – the scripting language, most usually Python, Perl, or PHP.
In the data science world, the role of the scripting language in LAMP is for use in processing data and extracting insights that bring value to the business. In the sense that many application developers declare themselves a “Java House” or a “C++ House”, it may be advisable to find and declare an analytical scripting language as the canonical one for your organization. This may be appropriate if your organization is large and traditionally adheres strictly to standards.
However, it may be an unnecessary abstraction. Unlike in application development, where it is complicated and difficult to get different languages to talk to each other, using scripting for analytical purposes does not really require much in the way of cross-module communication. Until you get to the point where you’re doing things like operationalizing machine learning frameworks, there is not much harm that can come from, for example, running one experiment in R and yet another in Python. This is also a testament to the extensibility of Linux.
It is, however, important to enable cross-language collaboration and knowledge sharing. By documenting the outcome and procedure of experiments, you help promote replicability, which ultimately increases the quality of work. There are a number of tools that help achieve this, from internal Wikis to Atlassian Confluence. However, one of the more popular products for documentation and knowledge sharing is Jupyter Notebook. Although the name would suggest that it is exclusive to Python, Jupyter provides code documentation and markup features for over 40 different scripting and scientific computation languages. Furthermore, the Jupyter kernel can natively run Python, and can be extended to run R code within the notebook itself.
There are also a number of software packages that package together multiple tools for analytical and scientific computing. Two popular such packages are:
- Anaconda, which combines a Jupyter-based Python kernel with a number of commonly used packages, community support for custom packages, and Spyder, a Python IDE focused on analytical computation
- RStudio, which provides an IDE-like environment with which to write and execute R scripts, and also comes packaged with several commonly used modules
Building Your Stack
Building the right data science architecture for your team doesn’t have to be hard. There’s just a lot of noise, as we figure faster and better ways to do things. When evaluating new technologies and how they fit within and extend your stack, it’s important to keep in mind that progress comes slowly. Each of these new technologies is built upon a foundation of earlier work, much of which you’re probably also using. New doesn’t have to mean intimidating.
If you haven’t already, read some of the other pieces that I have written on the topic of managing data scientists, or get in touch if you’re interested in exchanging some ideas