Uber’s Big Data Platform: 100+ Petabytes with Minute Latency
My Take
When it comes to doing analytics and data science at scale, it seems like you can either have your data now or you can have it clean, but you can’t have it both. A monolithic data warehousing appliance serving Uber-scale needs simply doesn’t exist, so using distributed computing has to be the path forward. It would appear from this piece that they’ve found a way to have their cake and eat (some of) it too – by simulating update, insert, and delete operations against distributed data in an ACID fashion.
Their Take
Hudi enables Uber to update, insert, and delete existing Parquet data in Hadoop. Moreover, Hudi allows data users to incrementally pull out only changed data, significantly improving query efficiency and allowing for incremental updates of derived modeled tables.
https://eng.uber.com/uber-big-data-platform/