How Small is Big Data?

A quick exercise in market sizing

It’s a pretty wild world out there. If you ask IBM, they’ll tell you that we (in the royal sense) generate 2.5 exabytes of data every day.  That’s 2.5 billion gigabytes. Another way to think of this is to look at Youtube. Every minute, 300 hours of video is uploaded to YouTube. That means it is physically impossible to watch everything on YouTube. In fact, just to keep up, you would need to simultaneously watch 18,000 screens of YouTube videos. That sure sounds like a lot, doesn’t it? Let’s take a look at that, as a quick exercise in market sizing.

There are 7.3 billion people in the world. Let’s assume that 50% of them generate data on any given data. That means that, on average each person is responsible for 0.684 gigabytes of data each day. If that sounds implausibly high, that’s probably because we’re counting the data that businesses generate in that 2.5 exabytes number. Using the Pareto Principle, let’s say that 80% of data is actually generated by businesses, and only 20% is generated by people. So, each day 2 billion gigabytes are generated by businesses, and only 500 million gigabytes are generated by people.

Using the 50% population assumption, this works out to 0.137 gigabytes per person per day, or 137 megabytes, which happens to be about the same size as the latest update to Farming Simulator 2013. Even that seems to be a little on the high side.

Businesses are the primary data-generators

photo-1431540015161-0bf868a2d407And what about businesses? The Small Business Administration (SBA) estimates that there are 27.9 million small businesses in the US, and they make up 99.7% of all firms (which is a very rigorous definition of small!). That implies that there are right about 28 million businesses in total in the US. The US makes up roughly 4.4% of the world’s population, but let’s say that Americans are more likely to start their own businesses than other people in the world; perhaps US businesses make up 15% of all businesses in the world. This works out to 193 million businesses total, worldwide.

Not all of these businesses trade in data in a meaningful way, however. In fact, the SBA estimates that nonemployers (businesses without employees) comprise 78.5% of small businesses. Let’s be Pareto-like and round that up to 80%, world-wide. So, that means that only 41.495 million businesses have employees. Of these, let’s say that 80%, or 33.196 million, of them generate data in some meaningful way. Therefore, each of these businesses generates 60 gigabytes of data each day.

This still seems high. Intuitively, there are probably some businesses that generate far more data than others. It’s hard to compare Google or Microsoft with your local coffee shop. In fact, maybe the top 20% of business are generating 80% of the data. If 8 million businesses are generating 1.6 billion gigabytes of data, that means the average business in this bucket generates 2000 gigabytes, or 2 TB, per day. Now we’re getting somewhere! According to Facebook, they take in about 600 TB of data every day.

How big are small businesses?

So what about the other 80% of businesses? Their share of the daily data is some 400 million gigabytes, or 12.05 gigabytes each. This feels like a lot, but when you take average of 2TB in the top 20%, this is not so outlandish. But yet it still feels large; how much of that data are they actually paying attention to? If 50% of the data generated is being thrown away or going unused, then we can safely say that we’re only dealing with about 6 gigabytes per day that interests us. But what if that ratio is actually Pareto-like? If 80% of data is discarded, then the average business is only thinking about 2.41 GB of data every day.

Is 2.41 GB a lot of data? On the one hand, the June 7, 1982 issue of Computerworld cited the capability of a scalable personal computer-mainframe unit as being impressive for having 2.4 gigabytes of disk storage. On the other, the latest update to Bungie’s game Destiny weighed in at an apparently controversial 2.44 GB. Isn’t it funny how our perceptions of data are so limited, even as the scale and breadth of data have increased so dramatically?

Note: this piece is an example of Fermi Estimation, an invaluable tool for guessing slices of reality that seem too large or homogenous to identify intuitively. They’ve also become a rather contrived and meaningless staple of business school, consulting firm, and data science interviews.

Trackbacks & Pings

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.