Software

Introducing ETL Markup Toolkit (EMT)

TL;DR – I developed an open source toolkit for writing Spark-native ETL using configurations in a highly sub-scriptable and transparent...

On Real Data Science and the Future of the Business Analyst

What is “real data science” anyway? tl;dr: most data scientists at Facebook are business analysts and that’s perfectly fine One...

Using Machine Learning to Classify the Quality of Wine

Using an open-source dataset, I’ve written up a Jupyter notebook below that explores the performance of several commonly used decision...

Recognizing the Limits of Visualization

Visualization (viz) is an incredibly hot topic in the business analytics/data science (DS) world right now. In every job description,...

Choosing Your Data Science Architecture

We data people love our architecture. We obsess over it. Every time Apache announces a new top-level project, we fawn...

What’s a Hadoop, Anyway?

To Hadoop and Beyond is a series dedicated to exploring the basics of distributed computing as it stands today, and to...

Understanding the Core of Hadoop: the MapReduce Algorithm

To Hadoop and Beyond is a series dedicated to exploring the basics of distributed computing as it stands today, and to...

Structuring Your Data Science Workflow

Now What? Congratulations,  you’ve successfully recruited and hired a few data scientists, positioned them in the right place in your...

Academic Profiles for Data Science

If you’re interested in Data Science (DS) as a field and have read enough job postings, you start to pick...

Measuring the Economics of Skyrim

Recently, I’ve started on a new play-through of Skyrim, the excellent 2011 game that is the most recent single player...