For this edition of dataxu’s Technical Tuesday series, Javier Buquet, a Software Engineer, dives into how you can synchronize data pipelines with a simple pattern and minimal overhead.
Are you finding it hard to synchronize processes that read and write data on a data lake?
Are your solutions overly complex?
If you answered yes to these questions, there is a simple pattern we use at dataxu that can help. It gets the work done with minimal overhead.
The Data Science Engineering team at dataxu spends most of its’ day designing, developing, and maintaining a wide range of data pipelines that support our AI-based bidding system.
A pipeline is a set of stages (or sub-processes) that process input data producing an output, in which the output of each stage serves as the input for the subsequent stage; thus creating a chain of stages that modify a given input to produce a desired output. In particular, a data pipeline can be defined as the process or pipeline of processes that move, join, format, and process data in a system. And a good data pipeline should do so in an automated and reliable way, with minimum—or, ideally, no—human intervention.
In the case of systems with the scale and complexity of dataxu’s platform, TouchPoint™, it’s common to have many different data pipelines—each having a particular goal or role in the whole system— that need to interact. Let’s say we have a dataset with all the ads we’ve shown to different users and another dataset with all the clicks the users have made. We want to have two pipelines that “clean” both datasets independently, and then use the output of those pipelines to study which ads were effective.
Given all these interactions, it’s very important that we have a robust mechanism, or protocol, for synchronizing the data between the different pipelines. I.e. making sure we provide the correct input to each pipeline based on the output of its predecessors.
Defining a synchronization protocol
An acceptable first approach to this problem could be to run each of the predecessors for a given pipeline on each run keeping the results in the program’s memory, and therefore have all the necessary data available right away. Of course, this method is immediately hindered by a number of drawbacks such as:
- Not being able to reuse an intermediate result in case we need to re-run a pipeline (which usually makes things a lot more difficult to debug too)
- Doing duplicate work when different pipelines have shared predecessors
- Not being able to fully de-couple a given pipeline from its predecessors (for example, implementing them using different languages or technologies).
Another approach could be to implement every pipeline independently of its predecessors and agree on a location in which each pipeline will drop its final results (so that they can be picked up by any other pipeline).
For example, let’s say we agree that Pipeline A will drop its results in Location A, so that Pipeline B knows that it should look for its input in such location. Each time Pipeline A runs, it must make sure it cleans its drop location (remove any previous results) and then store the new results in the same place.
This is, in many ways, better than our previous approach, but still not ideal as we don’t keep a history of the various pipelines’ results (which might be useful in some other moment or for some other pipeline). Nor does a pipeline have a way to differentiate complete results from partial results. Imagine the case in which Pipeline B starts running before Pipeline A has finished storing its results.
By the way, this is a simple example of what we call “the blackboard” architectural pattern. The main idea is that we define “a blackboard” in which each “producer” (in this case a predecessor pipeline) writes down its results so that any “consumer” can pick it up at any given time that it needs to. In our case, we usually use Amazon S3 as “the blackboard” and each pipeline defines a specific location for its results within the service, which is then shared with pipelines that need its output as input.
Interested in learning how can we define a synchronization protocol in which we address or mitigate each of the drawbacks of the two approaches mentioned above? Check out the full post over on our dedicated dataxu tech blog here.