In our last post we’ve started talking about application of ML in consumer analytics but focused primarily on data preparation step of data science process. We’ve played with FeatureTools, an open source python framework for automated feature engineering and ended up with dataset of more than 1000 features (from original 50-something!). Not bad, not bad, but now we find ourselves in bit of a trouble (wtf are we going to do with thousand features?), so in this post we’ll talk more about feature selection and later about modeling, i.e. Gradient Boosting Machines (GBM) that we’ll use for our prediction problem. Gradient boosting is one of the most popular and most powerful techniques for building predictive models, however as always in ML your model is only ever as good as the data you train it on. As such a significant proportion of our effort is often focused on creating best possible dataset (within available budget and time frame). Feature engineering and selection are the methods used for achieving this; former we’ve touched upon in our last post, and latter we’ll discuss today.
A fundamental problem of machine learning is to approximate the functional relationship f() between an input X and an output Y. However, often output Y is not determined by the complete set of the input features X, instead, it is decided only by a subset of them. This process of selecting a subset of relevant features for use in model construction is called feature selection. If you think about it, it makes sense, from all available features some will be more influential than others on the model accuracy. However, you might ask “why not use all of them”? Well, there are two main issues which may be evoked by the irrelevant features involved in the learning process; first, irrelevant input features will induce greater computational cost, i.e. training time increases (often exponentially with number of features) and second, models have increasing risk of overfitting with increasing number of features. Feature selection methods helps with these issues by reducing the dimensionality of the problem by removing redundant or irrelevant features without much loss of the total information. Simplification of the training data will also make the model easier to understand, which can be important when justifying real-world, business decision making because of model outputs. I’ve mentioned how with feature selection we are effectively reducing the dimensionality of our dataset but it’s important to know difference between feature selection and dimensionality reduction techniques; both methods seek to reduce the number of attributes in the dataset, but a dimensionality reduction method (e.g. PCA) do so by creating new combinations of attributes, whereas feature selection methods include/exclude attributes present in the data without changing them.
There are three general classes of feature selection algorithms: filter methods, wrapper methods and embedded methods. Filter methods apply a statistical measure to assign a score and rank each feature based on some criteria. This is done independently of the data modelling algorithm by using intrinsic properties of features, which sometimes can be very advantageous. Filter methods are usually fast and can be broadly divided into two main categories: univariate and multivariate. Wrapper methods consider the selection of a set of features as a search problem, where different combinations of the features are evaluated in order to find the combination that produces the best result for a specific ML algorithm being applied. Embedded methods on the other hand learn which features best contribute to the accuracy of the model while the model is being created. These methods are thus embedded in the algorithm either as its normal or extended functionality.
So which feature selection methods are we going to use? Obviously, downside to wrapper approach is that testing all possible feature combinations can be computationally very expensive, particularly if the feature set is very large and I would say our 1047 features is like thousand features too many 😊 Gradient boosting algorithms (which we are going to use) perform feature importance and selection internally during the model preparation process, however, this is also computationally intensive and by first removing the most obviously unwanted features, a great deal of unnecessary processing is avoided, thus we’ll be also using filter methods to narrow down our feature set to begin with after which we’ll employ embedded method.
Filter methods we are going to use are simple, yet powerful and you’ll probably find yourself doing them often. These are:
- removing features with high percentage of NULL values;
- removing features with low cardinality, i.e. (quasi-)constant features;
- removing collinear, i.e. highly correlated, features.
Obviously, there are other, more complex filter methods but we won’t complicate things further especially since we’ll also use embedded methods as well, in particular we’ll remove columns with low feature importance, i.e. 0.0 or close to 0.0, provided to us by gradient boosting algorithm. One side note, neither null values, nor constant features, nor collinearity is major problem per se in context of GBM. Why? Well, model uses stepwise process for selection of features and algorithm works greedily, i.e. if in step 1 feature A is selected as tree leaf, in step 2 feature B (collinear with A) will not be selected because it does not add any new meaningful information. But don’t get me wrong, all other reasons mentioned before still apply (why it is good to perform feature selection and reduce dimensionality of the problem). Anyway, after going through all these steps and also doing pinch more of feature engineering we are left with 213 features, around 20% of original number, a lot more manageable bunch.
Lot of features we’ve created are embedded in the “old” marketing concept used for customer value/behavior analysis called RFM. The RFM model is based on three quantitative factors: Recency (How recently a customer has made a purchase?), Frequency (How often a customer makes a purchase?) and Monetary Value (How much money a customer spends on purchases?). We’ve also added Duration (How spread apart were purchases, i.e. for how long certain customer shopped with the company). It is a simple and straightforward tool and CRM departments use it often for customer segmentation, however these features are often important ingredients of prediction models (which customers are more likely to make purchases, which customers are more likely to churn etc).
At this point we are satisfied with our data (never 100% satisfied though! 😊) and finally, time to employ XGBoost. XGBoost is implementation of Gradient Boosting Machines (or simply GBM), a decision-tree-based ensemble ML algorithm. Decision trees are fairly simply and interpretable, easy-to-visualize algorithms that I’m sure all of you heard of and have some intuition how they work (even if you haven’t used them), however from this relatively simple algorithm two ensemble concepts evolved over time, first bagging with Random Forest (RF) algorithm and later boosting with GBM as their main representatives. We can define ensemble method as a way of combining multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. Basically, we are combining bunch of “weak” learners/predictors to obtain one strong one, an approach different from classic one where we build single strong predictive model. So how do we combine trees? Well, one way is to “grow a forest” through something called bagging. Bagging (bootstrap+aggregating) is a simpler ensembling technique of the two previously mentioned in which we build many independent predictors/learners (in our case decision trees) and combine or aggregate them using some model averaging techniques, e.g. weighted average, majority vote or normal average. We typically take random sub-sample/bootstrap of data for each model, so that all the models are little different from each other. So, each model will have different observations based on the bootstrap process and by having many uncorrelated learners to produce a final model we reduce error by reducing variance.
In contrast, the family of boosting methods is based on a different strategy of ensemble formation. In boosting, predictors/learners are not made independently, but are added sequentially. At each iteration, a new weak learner is trained with respect to the error of the whole ensemble learnt so far. To simplify, the idea is to use bunch of decision trees, each one added sequentially and refocused on the examples that the previous ones found difficult and “misclassified”. When combined, these many weak and shallow trees produce a powerful prediction “committee”. Gradient Boosting Machines add another wrinkle to the whole concept with, well, gradient thing as the name suggest. Gradient descent is a very generic optimization algorithm capable of finding optimal solutions to a wide range of problems. The general idea of gradient descent is to tweak parameters iteratively in order to minimize a cost function. In our case, GBM generalize standard boosting methods by allowing optimization of arbitrary differentiable loss function. So, the principle idea behind this algorithm is to construct the new base-learners to be maximally correlated with the negative gradient of the loss function, associated with the whole ensemble.
Ok, enough with the theory, let’s do some modeling at last… if you haven’t used it, you might think XGBoost is something extremely complex but after whole this extreme-FE/FS exercise training is just couple lines of code. Obviously, tuning the model is whole other topic and theory behind it is complex, however XGBoost Python package is really simple and straightforward, as is scikit-learn that we’ll use for cross validation, so I won’t bother you with more details. As I said, few additional lines of code (below)…
…and here we go:
Yeah, it takes some time to train the model, but it’s no surprise: couple of hundred thousand records, more than 200 features, 5-fold cross validation, each with 2000 boosting iterations… but, it was worth it. RMSE (Root Mean Squared Error) was used as competition scoring metric and 3.683 which we got for out-of-fold predictions is pretty decent compared with top kernels on Kaggle.
Even more than decent, we focused on feature engineering (n.b. only part of dataset – didn’t include merchant data), but there are few more tricks up our sleeve that can improve prediction power and score. However, before we proceed I got to add few words on interpretability of XGBoost models (and the model we just trained). In theory, XGBoost is black-box model, however we can use feature importance (in our case, averaged over folds) to get a sense of underlying model functioning. Following graph displays TOP 50 features by Gain metric (out of 200-something we had). With feature importance we get a sense of predictive contribution and usefulness of specific feature, however in order to determine the direction and the shape of the interaction effects on target variable (for better business understanding and prescriptive analytics which might follow), additional analysis is needed, however we are not going to do it here.
We’ve demonstrated power of GBM, i.e. XGboost, and its ease of use. However, despite its many good sides there are couple of disadvantages. One we already mentioned, i.e. being less interpretable although this is addressed with various tools (e.g. variable importance, partial dependence plots, LIME, etc.) and compared to Deep Neural Networks GBMs are not so bad. We’ve also seen how computationally expensive XGBoost is, and not only that, the high flexibility that algorithm provides results in many parameters that interact and influence heavily the behavior of the approach (number of iterations, tree depth, how many columns are available to each tree when generating branches, regularization parameters, etc.) and consequently requiring large grid search during tuning. Obviously, we didn’t do any grid search to tune our parameters, so definitely that’s one other area we could work on and improve our results. Another important thing worth noting, in contrast to RF, GBM can easily overfit if not careful. This is especially true if there are outliers present which forces GBM algorithm to continue improving to minimize errors thus causing overfitting (overemphasizing noise). Now, remember my previous post? I did very short EDA which focused on just two things, among which was distribution of target variable and “outliers” that were present. Bringing this up was no coincidence, I had a feeling we would probably end up here 😊. To refresh your memory, here’s graph of target variable distribution where we can see bunch of outliers (~1% of population) that I hypothesized were customers that churned.
We have classic regression problem at our hands which is seriously affected by target variable outliers, so another great way to improve our score is to divide the problem and attack it from two fronts: build regression model (for regular samples) and also build classification model specifically aimed at predicting outliers. Idea is to combine information from multiple predictive models in a way where we highlight each base model where it performs best and “hide” where it performs poorly… similarly how we ensembled “weak” individual decision trees to get more powerful GBMs, here we ensemble couple of “strong” models, in this case each with slightly different tasks. There are lot of ways to do this; stacking, blending, simple averaging, etc. In our case there is simple logic we could use: target = P(outlier) from classification model * (-33.21928) + (1 – P(outlier) from classification model) * (value predicted by regression model). Obviously, building two (or more) quality models is not easy and for this to work accuracy of binary classification model needs to be sufficiently high. This post has gotten big already, so I won’t pursue this direction this time, but I encourage you to give it a go and we’ll definitely talk about model stacking in much more detail in the future.
To conclude this two-series post, this time we’ve tackled a classic problem in most popular analytics domain (CRM/consumer analytics) while also employing one of the most popular ML methods in recent years – Gradient Boosting. We did large-scale automated Feature Engineering, talked about its advantages and disadvantages, and likewise talked about good and bad sides of XGBoost. It’s not only about high-tech ML algorithms, it’s about data as well, so I hope you got the idea how more complex analytics projects are done and understand importance of quality Feature Engineering in whole ML pipeline. That’s all for today, but stay tuned, in the words of Monty Python…