Technical Tuesday: dataxu’s Athena SQL rewrite technique


For today’s edition of dataxu’s Technical Tuesday series, Dong Jiang, a Software Architect, digs into dataxu’s query engine and our technique for rewriting SQL in Amazon’s Athena Service.

dataxu’s journey from an Enterprise MPP database to a cloud-native data warehouse, Part 2

This is part 2 of a series of blogs on dataxu’s efforts to build out a cloud-native data warehouse and our learnings in that process. See part 1 here.

Once we determined the architecture of the cloud-native data warehouse as discussed in part 1 of this series, we set out to select a query engine.

dataxu was an early participant in the preview of the Amazon Athena Service, and once the service went GA, we became an early adopter. With Athena, our analytical users are able to run complex and interactive SQL queries over the data stored on S3 with excellent performance and cost efficiency, without the necessity of managing a cluster. After months of testing, Athena has been incorporated as our default query engine in the cloud-native warehouse.

As with any SQL engine, it is critical to learn a few things about the engine in order to write performant SQL. AWS published an excellent blog Top 10 Performance Tuning Tips for Amazon Athena that discusses this topic. But to make things easier, here at dataxu, we distilled these tips down to 3 key recommendations:

  1. Select only the columns you need, particularly if the underlying data is in columnar format like Parquet or ORC. Never write “ select * ”.
  2. Always include a filter on the partition column in the where clause. A partitioned table is almost certainly a large table, without a filter on the partition column it will result in a full scan of the data.
  3. Mind the join order. On multi-table joins, start with the table that has the most rows to the least, from left to right.

We would also like to share a SQL rewrite technique that would significantly improve the query performance and is most applicable to Star schema, a common approach to data organization.

Let’s start with a typical schema in ad tech, where we have a fact table of impressions and a few dimension tables, like campaigns, creatives, algorithms, etc. A sample schema looks like the example below*:

*Note the impressions records each individual event, well over hundreds of billions of records annually.

create external table impressions (  
    campaign_id     int,
    creative_id     int,
    algorithm_id    int,
    spend           bigint,
    -- many other metrics
) partitioned by (
    created_date date
stored as parquet
create external table campaigns (  
    campaign_id         int,
    campaign_name       string,
    advertiser_name     string,
    -- many other attributes
) stored as parquet

create external table creatives (  
    creative_id     int,
    creative_name   string,
    creative_type   string,
    -- many other attributes
) stored as parquet

create external table algorithms (  
    algorithm_id      int,
    algorithm_name    string,
    algorithm_type    string,
    -- many other attributes
) stored as parquet

To read more about dataxu’s strategy for rewriting SQL read the full post here.