For today’s edition of dataxu’s Technical Tuesday series, Dong Jiang, a Software Architect, digs into dataxu’s query engine and our technique for rewriting SQL in Amazon’s Athena Service.
dataxu’s journey from an Enterprise MPP database to a cloud-native data warehouse, Part 2
This is part 2 of a series of blogs on dataxu’s efforts to build out a cloud-native data warehouse and our learnings in that process. See part 1 here.
Once we determined the architecture of the cloud-native data warehouse as discussed in part 1 of this series, we set out to select a query engine.
dataxu was an early participant in the preview of the Amazon Athena Service, and once the service went GA, we became an early adopter. With Athena, our analytical users are able to run complex and interactive SQL queries over the data stored on S3 with excellent performance and cost efficiency, without the necessity of managing a cluster. After months of testing, Athena has been incorporated as our default query engine in the cloud-native warehouse.
As with any SQL engine, it is critical to learn a few things about the engine in order to write performant SQL. AWS published an excellent blog Top 10 Performance Tuning Tips for Amazon Athena that discusses this topic. But to make things easier, here at dataxu, we distilled these tips down to 3 key recommendations:
- Select only the columns you need, particularly if the underlying data is in columnar format like Parquet or ORC. Never write “ select * ”.
- Always include a filter on the partition column in the where clause. A partitioned table is almost certainly a large table, without a filter on the partition column it will result in a full scan of the data.
- Mind the join order. On multi-table joins, start with the table that has the most rows to the least, from left to right.
We would also like to share a SQL rewrite technique that would significantly improve the query performance and is most applicable to Star schema, a common approach to data organization.
Let’s start with a typical schema in ad tech, where we have a fact table of impressions and a few dimension tables, like campaigns, creatives, algorithms, etc. A sample schema looks like the example below*:
*Note the impressions records each individual event, well over hundreds of billions of records annually.
create external table impressions ( campaign_id int, creative_id int, algorithm_id int, spend bigint, -- many other metrics ... ) partitioned by ( created_date date ) stored as parquet create external table campaigns ( campaign_id int, campaign_name string, advertiser_name string, -- many other attributes ... ) stored as parquet create external table creatives ( creative_id int, creative_name string, creative_type string, -- many other attributes ... ) stored as parquet create external table algorithms ( algorithm_id int, algorithm_name string, algorithm_type string, -- many other attributes ... ) stored as parquet
To read more about dataxu’s strategy for rewriting SQL read the full post here.