dataxu has recently kicked off a tech blog where our technology team shares learnings and useful tips related to the technology they work with around adtech, software, and more. You can follow the full conversation at medium.com/dataxutech. These blogs will also be shared once a month on the primary dataxu blog (here) as our “Technical Tuesday” series.
For the inaugural post of the Technical Tuesday series, Dong Jiang, a Software Architect here at dataxu, digs into the differences between an Enterprise MPP database and a cloud-native data warehouse.
dataxu’s journey from an Enterprise MPP database to a cloud-native data warehouse, Part 1
This is part 1 of a series of blogs on dataxu’s efforts to build out a cloud-native data warehouse and our learnings in that process.
At dataxu, we deal with data collection, storage, processing, analysis, and consumption at massive scale. For this reason, we were an early adopter of the Hadoop framework. We quickly discovered that Hadoop and Hive alone were not sufficient for the growing needs of interactive analysis and querying. And so, about five years ago, we incorporated an MPP database as our warehouse solution.
The on-premise solution served us well as the cluster size expanded 16 fold over the course of five years. However, even with the addition of an MPP, we started to run into significant operational challenges:
- While it is possible to expand the MPP database, it takes months of planning and execution. The capacity planning is particularly tricky. If the business experienced unexpected growth, it would be difficult to bring additional capacity online in a timely fashion. If the business growth slowed, we could be stuck with an over-sized cluster for a number of months, before the volume eventually caught up.
- The database requires constant maintenance, both in terms of hardware (like replacing failed disks) and software (like vacuuming catalog). Moreover, the database constantly experienced failed processes, which requires DBA to perform recovery operations.
- Ad-hoc query users constantly compete with production ETL loads for the fixed capacity, leading to unpredictable load times and SLA misses.
The MPP solution was clearly not a sustainable option for serving dataxu’s business needs. As such, we started to look for an AWS cloud-native solution. After reviewing several competing solutions, we settled on Apache Spark on EMR as our primary ETL solution and AWS Athena as primary query solution.
In this blog, we will discuss the comparison of a cloud-native warehouse vs. MPP, with some focus on Spark as an ETL solution.
Cloud-native warehouse vs. MPP
First and foremost, the primary reason to choose Spark is not for performance. As it currently stands, even if a Spark cluster is configured with an equivalent amount of CPU, RAM, and disk capacity, it is unlikely to beat the query performance of the MPP solution. There are many reasons why an MPP database will “beat” Apache Spark on paper:
- EMR clusters run on VMs, while MPP on-prem runs on highly tuned bare metal servers.
- EMR clusters run in VPC, while MPP on-prem has a dedicated network switch and 100Gbit throughput.
- EMR clusters use S3 as storage, while MPP on-prem has superior I/O performance with direct attached disks on RAID10.
- MPP database has years of query optimization expertise, while Spark has a lot to catch up on.
- MPP databases allow for data locality. That means the data can be split into shards by a key — so each fixed node “owns” a shard. A well-chosen distribution key cuts down on so-called broadcasts (data movements over network across the nodes).
To read more about the reasons for choosing a cloud-native warehouse, such as Apache Spark, read the full post here.