Uber’s Big Data Platform: 100+ Petabytes with Minute Latency

My Take

When it comes to doing analytics and data science at scale, it seems like you can either have your data now or you can have it clean, but you can’t have it both. A monolithic data warehousing appliance serving Uber-scale needs simply doesn’t exist, so using distributed computing has to be the path forward. It would appear from this piece that they’ve found a way to have their cake and eat (some of) it too – by simulating update, insert, and delete operations against distributed data in an ACID fashion.

Their Take

Hudi enables Uber to update, insert, and delete existing Parquet data in Hadoop. Moreover, Hudi allows data users to incrementally pull out only changed data, significantly improving query efficiency and allowing for incremental updates of derived modeled tables.


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.