XGBoost your sales! (Part 1)

I was reading this post on KDnuggets about industries/fields where analytics practices were mostly applied which got me thinking. I use this blog to test and play with various ML methods and explore domains which I’ll probably never encounter at work, working on projects for my clients. Also, I enjoy dissecting some non-typical domains, domains you would never think of interesting or good analytics subjects, which is why I named this blog “analytics in everything” (you can find more in About section of the blog). However, I’m sure we can take a break from exclusively Croatian subjects, so this time I’ve decided to balance things a bit; we’ll tackle a classic problem in most popular analytics domain (5 years running) – CRM/consumer analytics while also employing one of the most popular ML methods in recent years – Gradient Boosting.


While searching for quality dataset I stumbled upon Kaggle competition which provides real-life data in customer analytics domain. Yeah, yeah, provider company says data is not real, but simulated, however I wouldn’t bet on it. I mean, they really went through too much trouble to complicate matters (they’ve anonymized data, bundled some features together, masked features with some generic names (e.g. Feature 1) etc.). This is also what I didn’t like, everything was too cryptic because of this, from how data sets were constructed, to competition goal description. Just to illustrate, in the picture below, we have data model of dataset that was made available as part of competition and as you can see there’s not much here. I used to work in CRM department for one of the biggest bank groups in CEE and I remember designing and creating data mart with more than 1.000 client related attributes (personal data, product data, transaction data, campaign and offering data…). Obviously, we haven’t used all that data in our models, but there was more choice and certainly more possible causation links between attributes and target variables… and certainly more business rationale behind feature engineering & selection choices.


Let us also touch briefly on competition description, which says: “In this competition, Kagglers will develop algorithms to identify and serve the most relevant opportunities to individuals, by uncovering signal in customer loyalty. Your input will improve customers’ lives and help Company reduce unwanted campaigns, to create the right experience for customers.” Identify most relevant opportunities? Sure, makes sense.  Reduce unwanted campaigns? Yup. Uncover signal in customer loyalty? Hmmm… what is exactly meant by that? There are several standard ways of measuring customer loyalty; there’s retention as indication of loyalty which is measured through churn rates (e.g.), there’s also advocacy as indication of loyalty which is measured through customer referral revenue (e.g.), however, the strongest indication of customer satisfaction is related to purchasing, which is measured through upselling ratio, repurchase ratio, Customer Lifetime Value (LTV), CAC (Cost to Acquire new Customer) to LTV ratio etc. In the end it’s all about the earnings (potential), i.e. more loyal customer is, more money you expect to make, higher is his/hers LTV, so “signal in customer loyalty” is probably change in customer LTV. But “signal” is some change which happens after some business event, or better say campaign/offering, so why is there no such “opportunity” specified in the data? There must be, and only thing that makes sense is that campaign is encoded in so called card_id (together with feature_% variables)… so knowing past customer behavior and offered product/service, we are trying to predict success of that campaign/offering, i.e. change in customer loyalty which is represented by our target variable. The problem is, all of this shouldn’t be this ambiguous because business/problem understanding is crucial in designing quality ML model (and validating it afterwards) and by doing it this way we are missing a lot. We’ll be doing feature engineering and modeling based on assumptions (most of them not verifiable) and not having proper understanding of the problem – not a good way of approaching analytics projects.

Having all this in mind, let’s try to get some better understanding of our data by doing some EDA. Let’s start by looking at train dataset and in particular, our target variable. As we can see target variable has relatively normal distribution, apart from outliers on the far left (around -33). These are extreme outliers but it’s no fluke or some error in data because it represents more than 1% of samples. Could such extreme and low loyalty score indicate customer churn? Well, there is no worse thing when it comes to customer loyalty than churn, so it could be, which also means we can’t ignore it or discard these samples. Correctly predicting these outliers is also very important for general accuracy of the model (and business results and success in Kaggle competition), so later we’ll see how we can handle situations like these.


Another thing worth checking is train-test split having in mind time component, i.e. first_active_month feature. While modeling, you spend significant amount of time looking at the performance over the test set, so it’s very important that test set is representative of what happens at the time of training. In our case, both training & test come from the same timeline, so we don’t need to worry much, and we can get creative with features.


As we can see, train dataset is pretty basic, so we’ll need to join it with historical & new merchant transactions (card_id as foreign key) and also with merchant data (merchant_id as foreign key). We can continue exploring our dataset to gain better understanding of the features, but this kind of exploration could talk a while, especially considering we’ll need a lot more features than original dataset provides so I’ll skip it this time. Don’t get me wrong, data understanding is crucial step in doing quality work, if you’re in-house data scientist/ML professional you should really dig deep and get to know the data at your disposal, but in situations like these (e.g. Kaggle competitions), with serious time & resource constraints (and anonymized dataset), it’s kinda more about large scale feature engineering and then letting algorithms do the dirty work of selecting relevant ones. In a bank/telco/insurance company manager will ask you why we are selecting specific customers for this and that, based on which criteria are we doing this etc., but here you don’t need to worry about model interpretability (and I’m not saying this is a good thing).

For ones that are just embarking on ML journey, “feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data”. It is the process of taking a dataset and constructing explanatory variables — features — that can be used to train a machine learning model for a prediction problem. Often, data is spread across multiple tables and must be gathered into a single table with rows containing the observations and features in the columns. Feature engineering requires domain knowledge, data understanding and some creativity to extract more information from original raw dataset. Though feature engineering takes a considerable amount of time in the overall ML pipeline, the benefits far outweigh the efforts and time and nowadays feature engineering is probably most important aspect of the whole ML pipeline. With increased computing power everyone can run even most advanced ML algorithms and tuning is made lot easier with automated hyper-parameter search, so very often true differentiators are better data and/or features.


Guys at MIT developed something called Data Science Machine, an automated system for generating predictive models from raw data. Big part of Data Science Machine was Deep Feature Synthesis, an algorithm that automatically generates features from relational datasets. In essence, the algorithm follows relationships in the data to a base field, and then sequentially applies mathematical functions along that path to create the final feature. From whole this emerged Featuretools, a free Python library for automated feature engineering, which we could take for a test drive.

The traditional approach to feature engineering is to build features one at a time using domain knowledge, which is not only time-consuming, but also error-prone process. On top of that, it is problem-dependent, so code can’t be reused and must be re-written for each new dataset. Let’s start with an example, what would manual feature engineering amount to in our case. We have train dataset with card_id and target variable and we also have historical/new transaction datasets with lot of data regarding cardholder past and future behavior (in relation to reference date). Since we require single table for training, feature engineering means consolidating all the information about each card_id in one table. This means we could aggregate all those features relating to cardholder behavior (e.g. min/max/mean/sum of purchase amount, frequency of purchases, mean number of installments, mean of authorized_flag, mode city_id, last transaction date, time between transactions…) on card_id and join to train dataset. There is also merchant information that could be aggregated in this way and joined to historical/new transactions datasets. Obviously, we could get very creative which in the end results in s*** load of manual data wrangling work to be done so why not check Featuretools and see if it can save us some time.

In order for Featuretools to automatically build new features it must understand the dataset we are working on and a way to do that is to build Entity Relationship (ER) model. Entity–relationship modeling was developed for database design long time ago and idea is pretty simple: basic ER model is composed of entity types (which classify the things of interest) and specifies relationships that can exist between entities (instances of those entity types). In our case we have “cards”, “historical transactions”, “new transactions” etc. In code it looks like this:


After we have defined our dataset, we need to define which calculations will be performed during automated feature engineering. In Featuretools, calculations are defined via “primitives” and there are two types: aggregations and transformations. More than 50 of them are already built into package however you can also add custom ones. Aggregation primitives calculate various aggregate functions such as sums, counts, means…, while transform primitives perform other types of functions like divide numeric columns, calculate logarithm, extract day/month/year from date column, calculate percentile etc. This is pretty basic, what if we want to include our domain specific knowledge during feature engineering? Well, there is option for that too, called “seed features” and “interesting values”. By using seed features we can look for certain events or consider only values above/below certain thresholds. Also, sometimes we want to create features that are conditioned on a second value before we calculate – these are called “interesting values” of a variables and associated filters are “where clause”.


However, coolest thing about Featuretools or Deep Feature Synthesis are… Deep Features. The name Deep Feature Synthesis comes from the algorithm’s ability to stack primitives to generate more complex features. Each time we stack a primitive we increase the “depth” of a feature. Imagine in our case that we have calculated sum of transaction amounts across various months, this would be level 1 depth, now imagine we have calculated mean of these sums, this would be level 2 depth. By controlling max_depth parameter we control maximum depth of the feature returned by Featuretools. Word of caution, if there are lot of features and primitives, things can quickly get out of hand as deep features at level 2 can produce huge number of permutations.

Before running Deep Feature Synthesis I’ve done some further preprocessing, in particular, OHE on several features, so our entity set (multiple tables) has around 35 features (ignore variables excluded). I’ve also done some dataframe column type conversions; int64 to int32, float64 to float32, and any objects that could be categorical into category variables, so when Featuretools is running it is as efficient as possible. Also, I’ve reduced chunk size, so chunks of work are not too big and can fit in memory (look at me, how smart I am, didn’t have any problem whatsoever with MemoryError exceptions, lol). Why am I doing all this? Well, cards entity has around 200 thousand rows, while hist_trans and new_trans have 29 million and 2 million rows, so image consolidating all this to cards table, grouping and calculating all those primitives – I’m really pushing it to the limit. After all this tweaking we are ready for launch and after little less than 2 hours we have 1047 new features. Not bad, not bad at all considering size of my dataset.


What if we increase max_depth? Let’s be conservative and put it to 2… [presses CTRL+ENTER]… oops!


Hmm, thousand features is enough to begin with, wouldn’t you say so? 😊


I have to admit, for 5 minutes I was happy all of this worked, soon after that I remembered that old one: be careful what you wish for… you might end up with 1047 features. I mean, 1047 features, wtf?! What am I going to do with thousand features? Well, we’ll find out in next post when we are going to talk about curse of dimensionality, feature selection and give XGBoost a run.

Before I end this post, few closing remarks about automated feature engineering. Technology is far from perfect and certainly it will continue to evolve, but already it delivers significant gains in terms of efficiency. As already mentioned, feature engineering is time-consuming and partial automation could be beneficial, however it would be foolish to think it can completely replace human factor. When it comes to feature engineering, lot of magic comes from domain specific knowledge, human ingenuity and sometimes very innovative techniques. By using automated process, it is lot easier to explore wide variety of possibilities and by doing so stumble upon useful features that would not cross our minds during manual process, however it is also very easy to create lot of useless features that complicate process further on, from feature selection to potential overfitting, loss of interpretability and understanding. As you can see, there are positives and negatives; automated FE shows lot of promise but I would definitely complement it with human touch.