In vino veritas…

In vino veritas. One of many Latin proverbs I had to memorize in high school, along with really painful grammar and everything that came along. Yeah, I’ve studied Latin in high school, but can’t say it was time best spent. I’ve “wasted” lot of educational effort while trying to figure out what to do in life. I mean, I went to high school that focused on foreign languages and social sciences, but then decided to pursue engineering degree (electrical engineering) and then acquired another degree in robotics. Obviously, this meant career in engineering, right? But why not abandon it all together and pursue career in IT and ultimately in data science. So why am I bragging about my educational achievements? 🙂 Well, I need intro for this post, but more importantly there’s something that connects Classical Latin to data science, a millennia old disagreement where the truth lies! Some say that “truth lies in the wine”, while others say that truth lies in the data. So, whom to trust? Today, we’ll bridge these two worlds… by doing analysis on wine reviews!

I’ve written about NLP before, e.g. here, but mostly on level of theory while exploring statistical properties of words in Croatian language corpus. Today however, we’ll do full-blown NLP exercise by trying to predict wine score from review text. We can simplify it and say it’s a form of sentiment analysis, but (un)fortunately, it’s lot more complicated than that. I mean, we are not dealing with “ughh, this sucked, so disappointed!” or “omg, this was amazing, loved it!”, we are dealing with poetic and nuanced word of wine tasting. Let me give you an example, this wine is “slightly reduced, offers a chalky, tannic backbone to an otherwise juicy explosion of rich black cherry, the whole accented throughout by firm oak and cigar box.” So, what do you think? Above or below average wine, score above or below 90? Not trivial, right?

wine_tasting

Let’s start in customary fashion by doing some exploratory analysis in order to better understand our data. There are 12 features in total, but my goal is to see how far I can go by just using description column in order to predict wine ratings (i.e. points), which I’ll use as my target variable.

dataset

Following picture shows distribution of Points variable. As we can see, not a big point range, our sommeliers do not like drinking bad wines! Since we are already dealing with particularly hard problem (predicting only from text) we’ll help ourselves and categorize Points variable, i.e. divide our wines in two groups, very good (<90 points) and excellent ones (>=90 points). This makes more sense as evaluation of classification problems is much more intuitive, I mean, it’s way cooler when you can look at confusion matrix than some MSE.

points_distribution

Anyway, now that we are set up with target variable it’s time to do some NLP. Our first step is to clean, tokenize, stem and remove stop words from our review. If you are not familiar with these terms, please refer to one of my previous posts, it will be worth your time, I promise. This short piece of code is one that does it.

clean_stem_review

After cleaning, stemming and removing stop words we are still left with more than 3 million words in our corpus of wine reviews and little more than 21 thousand words in vocabulary (unique words). So why are we performing stemming, i.e. reducing words to common root and removing stop words? Well, word can have many inflectional forms, but the meaning is the same and we are trying to reduce the dimensionality of the problem because our next step will be to vectorize our cleaned reviews. Same applies to stop words; words like “the”, “is”, “are” etc. are very common, but usually don’t carry any useful information. As we can see, although there are more than 21 thousand unique words in our reviews, 2000 words are the ones that occur most of the time (>90% of total occurrences).

cum_percent

This is the number that we’re going to use as input for our vectorizer as max_features parameter. CountVectorizer is one of the methods for converting text data into vectors as models can process only numerical data, basically it just counts word frequencies. TfidfVectorizer is another popular option and it works a little bit differently. Tf-idf is short for term frequency – inverse document frequency and is numerical statistic that increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general and may not carry particularly useful information.

vectorize

By doing vectorization on our reviews we get 120.975 x 2.000 matrix, where rows represent reviews and columns represent occurrences of particular word in the review. It’s a huge matrix, around quarter of a billion elements, and also extremely sparse, with more than 98% of zeros. What a waste of space, right? Not only that, we’re going to need all the processing power in the world to crunch and process that. If only we can reduce dimensionality of our problem somehow, i.e. reduce it to smaller matrix, one having fewer but more important features? Well, there is one approach which can help us solve our problem. It’s called PCA or Principal Component Analysis.

The goal of PCA is to identify the most meaningful basis to re-express a dataset. The idea is to reduce or project a complex dataset to a lower dimension in order to reveal often hidden, simplified structures that underlie it. Mathematically speaking, it’s a procedure that transforms (usually) correlated variables into a smaller number of uncorrelated variables called principal components. Basically, it can be considered as a rotation of the axes of the original variable coordinate system to new orthogonal axes such that new axes coincide with directions of maximum variation of the original observations. Let’s look at the following picture.

PCA

The two graphs show the exact same data, but the right graph reflects the original data transformed or re-expressed so that our axes are now the principal components. Principal components are orthogonal, i.e. perpendicular, to each other (holds for whatever is the number of dimensions), which means that each component explains non-redundant information (not correlated). As we can see from our data, majority of variation is captured on pc1 axis and if we are to reduce this dataset to one dimension only pc1 vector would be it. This is just 2D illustration of what happens in higher dimensional space and in our case it’s 2 thousand dimensions.

Mathematically speaking, what we are doing is basically the following: first, we calculate the covariance matrix, which measures the degree to which each pair of variables in the dataset are linearly associated. If you still remember your linear algebra classes, terms “eigenvalues” and “eigenvectors” might ring a bell. Second step is to calculate them from previously obtained covariance matrix. Principal components we aim to obtain from the data are exactly these eigenvectors, while eigenvalues tell us proportion of variability explained in the direction of its corresponding eigenvector. Next step is to select first K components (K<N, where N is number of features or dimensions) that explain proportion of variance we are happy with, and once we multiply original dataset (matrix) with selected principal components we get new, re-expressed dataset, one with fewer but more important features. All of this is done with few simple lines of code…

PCA_code

In our case we’ve selected 500 components. As we can see it is still a big number, meaning our features (word counts) are not really that great at explaining our problem domain. However, this was expected, as I said, it’s a poetic world of wine tasting, not an easy domain for sentiment analysis. Nevertheless, we reduced our problem to a quarter of size making it more manageable for ML algorithm.

cumulative_explained_variance

As we can see, PCA is simple, straightforward and relatively computationally cheap. Remarkable feature of PCA is its plug & play nature: any dataset can be plugged in and an answer comes out, requiring no parameters to tweak and no regard for how the data was recorded. Limits of PCA are tightly coupled to its underlying assumptions, meaning linear reduction limits information that can be captured. Kernel PCA, extension of original algorithm, tackles this problem by making PCA work well for non-linear structures. Also, by doing PCA, we are “losing” original features thus making our model less interpretable. In my case (wine reviews) this wasn’t the issue, but in many cases, this might be a show stopper. It’s worth mentioning that there are also other, more complex, dimensionality reduction techniques such as t-Distributed Stochastic Neighbor Embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP) but I’m not going to deal with them in this post.

Let’s recap what we did so far… we’ve cleaned, stemmed and tokenized our reviews, after which we’ve vectorized them, thus “diverging” from one (feature) to 2 thousand features where every feature represents particular word count. Then we’ve applied PCA in order to reduce dimensionality of our problem, thus “converging” to 500 new features which still capture most of data’s original information. Now that we’ve done all this we are finally ready for modeling. I chose XGBoost as algorithm of choice for this occasion. I’ve written about it before, so if you’re interested in learning more about popular GBM algorithm I suggest you visit my previous post. Code is more or less the same, XGBoost with 5-fold cross validation, but what differs from my previous post is objective function and evaluation metric considering that now I’m dealing with classification problem.

Cross validation is an approach that you can use to estimate the performance of a machine learning algorithm with less variance than a single train-test set split. It works by splitting the dataset into k-parts (in our case k=5) where each split of the data is called a fold. The algorithm is trained on k-1 folds and tested on the held back fold. This is repeated so that each fold of the dataset is given a chance to be the held back test set. After running cross validation, you end up with 5 different performance scores that you can summarize using a mean and a standard deviation. You also end up with 5 sets of OOF (out-of-fold) predictions, which aggregated make a prediction for every observation in your dataset, so you can calculate whatever evaluation metric you need.

kfold_cv

After 9+ hours of model fitting (imagine what would have been without PCA?!) we’ve ended up with 81.2% accuracy and similar precision and recall. I would say these are pretty good results since we limited ourselves to using reviews text only for our prediction problem. If we would to use price, country of origin, winery, variety etc., we would easily boost our accuracy. I was even more impressed once I did further analysis on misclassified instances (where model predicted wine score is >=90 points but it is really bellow and where model predicted wine score is <90 but it is actually above). Basically, my model was wrong around 18% of the time, but these misclassifications were not wildly off the mark, i.e. 84% of them are in 90 +/-2 points range, meaning just around cutoff mark (90 points divide “very good” and “excellent” category). Almost one third of “mistakes” are exactly on 90 points, meaning model predicted it’s “very good” wine (<90) but it’s actually “excellent” wine. Considering problem difficulty I would say this is pretty good results.

misclassified

I hope you enjoyed and learned something from this NLP exercise where we demonstrated how to extract information and predict sentiment from textual data. We also talked a lot about PCA, popular dimensionality reduction technique which is simple but great tool in modern data analysis. Considering recent advancements in the NLP domain this is relatively simple exercise, so stay tuned, in the future we’ll talk about Word2Vec, Transformers and other cool things!