Engineering Innovation In Action: Moving from Hadoop to Spark


DataXu’s official mission is to make marketing smarter through the use of data science. And that’s not just talk–hours upon hours of engineering work goes into ensuring our tech does what it’s supposed to do.

If you’re not familiar with DataXu already, here’s a brief summary of what we do. When our customers run ad campaigns on our platform, we bid on their behalf for ad slots on ad exchanges. When we win bids, ads are shown on our behalf on websites and in mobile apps. This process needs to work seamlessly day in and day out to deliver on our value proposition as a company.

Under the hood of our tech, we are continuously collecting non-personal data on where ads were placed and whether they were impactful. This data feeds our machine learning system that learns from past bids to optimize future ones. This complex data flow runs unattended every day and scales to petabyte size and beyond.

While this system gives great results and is cost-effective even for our largest executions, we found that it was not always easy to quickly try out new ideas, algorithms and research prototypes to respond to new industry needs. DataXu therefore considered migrating a portion of our system to “Apache Spark” for two reasons: to address these issues, and also to improve other capabilities.

We started exploring Spark as part of a pretty awesome DataXu program that allows engineers to focus on a project of personal interest in addition to their day-to-day work. There were many theoretical advantages to using Spark, such as in-memory processing (an enabler for iterative machine learning algorithms) and the fact that it comes with a set of standard machine learning algorithms. We asked ourselves the following questions.

  • Can we port DataXu’s award-winning machine learning infrastructure to Spark?
  • Can we leverage some of our existing algorithm implementations?
  • Can Spark models be deployed to our production environment safely and still be able to handle our current load of 1.8 million bid decisions per second?
  • Will the models be fast to bid, will they be scalable and also thread-safe?

Spoiler: The answer is yes to all of the above!

In 2015, DataXu created a team to build a proof of concept model (POC) to explore the questions above. In 2016 we deployed our first Spark classifier to production. But it wasn’t all smooth sailing. And by sharing where we encountered obstacles, hopefully others will be able to navigate more smoothly in the future. Here are some of the challenges (and eventual solutions) that we faced during the journey of moving from an in-house system built on Hadoop to one that leverages Spark. And note–if some of the below sounds familiar, it’s because we went on the road to share these learnings beyond our own follower and customer base at the Spark Summit East 2016.

1. Spark ML Pipelines

Spark’s ML Pipelines API supports two concepts: Transformers and Estimators. Transformers transform Data Frames, reflecting the data-preparation process. Estimators produce a Model, which in fact is effectively a group of transformers capable of running a classifier to get predictions given a Data Frame. These models, however, need to run a Spark Job to get predictions for an entire DataFrame. So would we be able to use this framework at bid time?

Unfortunately, as we discovered, the answer is not directly. At bid time, we don’t have Data Frames; only bid requests that represent a single row of data. In a memory-constrained,  real-time environment, we don’t have the luxury of running a Spark Job and we also cannot create a Spark Context. Additionally, streaming requests were observed to cause significant latency.

We proposed an extension to the notion of transformer that is able to perform the transformations on a single row instead of a Data Frame. This transformation is Spark Context-free and is optimized for a time-critical system. For example, a feature selection step will truncate the features in a row to only contain the columns selected as best features. At bid time, we then were able to use these row transformers to create a live prediction by applying all the necessary row transformers in the pipeline using the correct order.

The DataXu team also extended pipelines in a different way by defining a DAG (see diagram below) inclusive of all steps and dependencies. Each path from the root step to a leaf step represents a distinct pipeline or paths of transformations. Currently, we define these pipelines as JSON configurations. This allows us to easily build different kinds of models from the same data. In our experience, once the Data Frame is in memory, running several pipelines does not cause performance issues. We therefore build different kinds of models for each of our campaigns and then decide which one is best.

DataXu Spark flowchart


As the chart below (built from real data) makes clear, while a number of campaigns do well with any of the basic algorithms for machine learning, some campaigns stand to benefit significantly from a specific algorithm or specific turning.

DataXu Spark graph

2. Model Deployment

So how does DataXu deploy Spark models to production?

We deserialize them using the standard java serialization. We know, we know–this probably isn’t your first choice, and it wasn’t ours either. However, most of the built-in save/load commands require Spark contexts or even running jobs current versions of Spark. Luckily, we found that for a typical campaign the speed of bidding models is similar to our DataXu custom models.  Additionally, the memory requirements are comparable to what we use now. Most of the memory goes into the metadata necessary for pre-processing anyway, such as maps of nominals to their numerical representation.

Deciding Whether Or Not To Make The Spark Move

After all was said and done, DataXu found that Spark models are easy to train and the resulting code is more maintainable and declarative (i.e. SQL transformations). Previously, our code had a big focus on performance and powerful abstractions. However, this made the code considerably more complex. The new Spark-led design simplifies several aspects, which results in a more flexible system. Our data scientists can now easily try new ideas and launch an A/B test with new models.

With Spark, we continue to have automated and unattended ML at large scale. Our classifiers trained with Spark are bidding live and we expect to increase the usage of Spark throughout 2016 and beyond. While we can’t advise you whether or not to shift to Spark, we can share our own experiences with making the move. And so far, it’s been a pretty good decision.