Implementing HyperLogLog in Redshift and Tableau

My Take

In an ideal world, data is stored and queried against in as raw a form as possible. Materialization should be kept to a minimum, because layers upon layers of logic make it increasingly likely for one or more assumptions to be broken. However – limiting materialization is not always an option if the scale of the data is larger than can be reliably queried in a timely fashion. In this piece from Instacart, they detail their implementation of a probabilistic counting algorithm within Redshift and Tableau, which has saved them the need to materialize aggregations. This is very interesting to me because quite often, we count things to get a sense of magnitude, and in those cases it is not actually necessary to have a 100% accurate count – instead, by taking this approach we accept some margin of error for computational tractability.

Their Take

We implemented HyperLogLog, a probabilistic counting algorithm, to create data extracts in Tableau. HyperLogLog is an algorithm that estimates the number of distinct elements in a data set with an accuracy of approximately 1.6% (using 4096 registers).  In HyperLogLog, elements of the the set are first hashed to ensure randomness, then converted into binary and added into a register. A register stores the most unlikely event observed in the set by observing the longest consecutive count of leading zeros. Based on the register’s maximum value, an approximate count of distinct elements can be calculated.

https://tech.instacart.com/implementing-hyperloglog-in-redshift-and-tableau-a62081c98e1c

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.