Technical Tuesday: Write your own Spark Classifier?


 

In this month’s edition of dataxu’s Technical Tuesday series, Maximo Gurmendez, a Data Science Engineering Lead, discusses why you should consider writing your own spark classifier.

Why write your own Spark Classifier?

You may be asking yourself, why write your own Spark Classifier? The most obvious reason is to craft your own secret-sauce model outside of Spark and integrate it into a Spark pipeline. However, nowadays, with the array of available algorithms and feature engineering tools, this is rarely a necessity.

Other cases exist for which you’d want to consider writing your own Classifier, such as when you need a very efficient, simple, or relaxed algorithm and the nature of your dataset does not merit a more powerful machine learning Classifier.

To illustrate the fact that there are certain situations for which your simplest or relaxed algorithm can work just as well as a more complex one, we created a synthetic dataset generator capable of customizing a dataset in different ways (number of features, cardinality, label imbalance, etc). With the synthetic dataset, we varied the average cardinality of features and recorded the different values of ROC AUC for CategoricalNaiveBayes and Spark’s RandomForest in the graph below.

Note how, for the lower cardinality values, both algorithms perform just as well. RandomForest seems more robust to increased cardinality.

We also considered varying the frequency with which we witness a positive label. See in the chart below how, beyond a certain threshold, the performance of Naive Bayes and Random Forests are similar. However, for datasets with greater imbalance, the performance of Naive Bayes drops dramatically (as expected).

A similar occurrence happens when increasing the number of features in the dataset, as can be seen in the plot below:

We varied some other properties of the dataset and made similar experiments. Completing such an exercise is useful to determine the return on investment (ROI) for the algorithm. In our example, Random Forests are particularly slow, and hence expensive, for categorical values — as the underlying decision trees have an explosive number of binary split candidates. If under a specific scenario, both algorithms have similar predictive power, we might as well pick the algorithm that is most efficient in terms of training time, or even the simplest to understand.

Continue reading the full post here to discover when you should choose to write your own Spark Classifier.