Comparing the performance of Simple Tree, Gradient Boosting Tree, and Random Forest Classifiers in analyzing the quality of white wine

Getting Started

First, let's import some stuff we'll need: Pandas, the classifier modules from scikit-learn, a few utility functions for sample selection and performance evaluation

In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support, confusion_matrix

Next, we'll load the dataset that we're working with. Today, we'll be using the white wine component of the wine quality dataset from the UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Wine+Quality

In [2]:
df = pd.read_csv('winequality-white.csv',sep=';',quotechar='"')

What can we learn about this dataset? It looks like a number of different features, and then a target variable called "quality", which ranges from 3 to 9. The median and 75th percentile values are both 6 out of 9, so we can determine that this class is unbalanced. Indeed, we see that 5 and 6 make up more than 2/3 of the dataset.

In [3]:
df.columns
Out[3]:
Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality'],
      dtype='object')
In [4]:
df['quality'].unique()
Out[4]:
array([6, 5, 7, 8, 4, 3, 9], dtype=int64)
In [5]:
df['quality'].describe()
Out[5]:
count    4898.000000
mean        5.877909
std         0.885639
min         3.000000
25%         5.000000
50%         6.000000
75%         6.000000
max         9.000000
Name: quality, dtype: float64
In [6]:
df['quality'].value_counts()
Out[6]:
6    2198
5    1457
7     880
8     175
4     163
3      20
9       5
Name: quality, dtype: int64

Data Processing and Preparation

Let's turn that unbalanced ordinal class into an unbalanced binary class, and apply that to the dataset based on the value of the "quality" feature. We want to build a classifier for the best of the best wines, so it's fine that the class is unbalanced. After all, this isn't Lake Woebegone, where every wine is above average...

In [7]:
def isTasty(quality):
    if quality >= 7:
        return 1
    else:
        return 0
In [8]:
df['tasty'] = df['quality'].apply(isTasty)
In [9]:
df.columns
Out[9]:
Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality', 'tasty'],
      dtype='object')
In [10]:
df['tasty'].value_counts()
Out[10]:
0    3838
1    1060
Name: tasty, dtype: int64

Next, we'll create training and testing subsets that we'll use to train our classifiers. It is best practice to always try train and test your classifiers on different datasets. Here we'll take one-third of the original population and use it for testing, and the other two-thirds will be used for training the classifiers. Note that we can specify a random_state seed in order to get the same results for the same input data if we want to replicate this experiement later.

In [11]:
data = df[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar','chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol']]
target = df['tasty']
In [12]:
data_train, data_test, target_train, target_test = train_test_split(data,target,test_size = 0.33,random_state=123)

Did the splitting process produce populations having the right shapes for what we're trying to do? Yes it did!

In [13]:
[subset.shape for subset in [data_train,data_test,target_train,target_test]]
Out[13]:
[(3281, 11), (1617, 11), (3281,), (1617,)]

Training our Classifiers

Now we'll use our split population of train and test data to train some classifiers. Today we'll be using a simple decision tree, a Gradient-Boosting classifier, and a Random Forest Classifier, all from scikit-learn

For all classifiers that we'll be using today, we'll hold the max tree depth at 5. This prevents over-fitting of individual trees, and also allows us to remove that as a factor in the performance of the classifier.

In [14]:
simpleTree = DecisionTreeClassifier(max_depth=5)
In [15]:
simpleTree.fit(data_train,target_train)
Out[15]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
In [16]:
gbmTree = GradientBoostingClassifier(max_depth=5)
In [17]:
gbmTree.fit(data_train,target_train)
Out[17]:
GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=5,
              max_features=None, max_leaf_nodes=None,
              min_impurity_split=1e-07, min_samples_leaf=1,
              min_samples_split=2, min_weight_fraction_leaf=0.0,
              n_estimators=100, presort='auto', random_state=None,
              subsample=1.0, verbose=0, warm_start=False)
In [18]:
rfTree = RandomForestClassifier(max_depth=5)
In [19]:
rfTree.fit(data_train,target_train)
Out[19]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=5, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

Evaluating Classifier Performance

Now that the classifiers are trained, let's evaluate their performance. We can call the precision_recall_fscore_support function from sklearn.metrics to return, well... the Precision, Recall, F-Score, and Support measures for each classifier, comparing the original target data with the predicted test data

In [20]:
simpleTreePerformance = precision_recall_fscore_support(target_test,simpleTree.predict(data_test))
In [21]:
gbmTreePerformance = precision_recall_fscore_support(target_test,gbmTree.predict(data_test))
In [22]:
rfTreePerformance = precision_recall_fscore_support(target_test,rfTree.predict(data_test))

The object returned by the precision_recall_fscore_support() function is not the prettiest... but we'll solve that later. As you can see, the function returns the precision, recall, f-score, and support metric for each class. Looking at the support metric, we can see that the class-wise composition of the test population is slightly different than the population as a whole. However, it's not crazy unbalanced, so it will not unduly affect the performance metrics.

In [23]:
simpleTreePerformance
Out[23]:
(array([ 0.86759327,  0.564     ]),
 array([ 0.91583012,  0.4378882 ]),
 array([ 0.89105935,  0.49300699]),
 array([1295,  322], dtype=int64))
In [24]:
gbmTreePerformance
Out[24]:
(array([ 0.8996337 ,  0.73412698]),
 array([ 0.94826255,  0.57453416]),
 array([ 0.92330827,  0.6445993 ]),
 array([1295,  322], dtype=int64))
In [25]:
rfTreePerformance
Out[25]:
(array([ 0.85110803,  0.61849711]),
 array([ 0.94903475,  0.33229814]),
 array([ 0.89740781,  0.43232323]),
 array([1295,  322], dtype=int64))

There we go, that's a bit easier to read. We're interested in predicting a positive (1) value for "tasty", so we're mostly concerned with the performance of each classifier in predicting the value for that class, which is the second number in each list for each metric (ie, the Simple Tree Classifier found 0.566 Precision for the positive class, and the Random Forest Classifier found .326 Recall for the positive class)

With each of these metrics, we're looking for a value as close to one (1) as possible. We can see that the Gradient Boosted (GBM) tree generally out-performs the others in correctly classifying tasty wines. The GBM tree also achieved a higher recall for the positive class than the other classifiers. This all being the case, it is clear that the GBM tree classifier was the strongest performer in the cohort.

In [26]:
print('Precision, Recall, Fscore, and Support for each class in simple, gradient boosted, and random forest tree classifiers:'+'\n')
for treeMethod in [simpleTreePerformance,gbmTreePerformance,rfTreePerformance]:
    print('Precision: ',treeMethod[0])
    print('Recall: ',treeMethod[1])
    print('Fscore: ',treeMethod[2])
    print('Support: ',treeMethod[3],'\n')
Precision, Recall, Fscore, and Support for each class in simple, gradient boosted, and random forest tree classifiers:

Precision:  [ 0.86759327  0.564     ]
Recall:  [ 0.91583012  0.4378882 ]
Fscore:  [ 0.89105935  0.49300699]
Support:  [1295  322] 

Precision:  [ 0.8996337   0.73412698]
Recall:  [ 0.94826255  0.57453416]
Fscore:  [ 0.92330827  0.6445993 ]
Support:  [1295  322] 

Precision:  [ 0.85110803  0.61849711]
Recall:  [ 0.94903475  0.33229814]
Fscore:  [ 0.89740781  0.43232323]
Support:  [1295  322] 

Here's another way to conceptualize classifier performance - Confusion Matrices. Here, we compare the quantity of predicted values against each actual value. For example, for the Simple Tree, we see that out of the testing sample, 1187 un-tasty wines were correctly classified, and 141 tasty wines were correctly classified. However, there were also 108 tasty wines that were erroneously classified as un-tasty, and 181 un-tasty wines that were classified as tasty. Gross!

In [27]:
print('Confusion Matrix for simple, gradient boosted, and random forest tree classifiers:')
print('Simple Tree:\n',confusion_matrix(target_test,simpleTree.predict(data_test)),'\n')
print('Gradient Boosted:\n',confusion_matrix(target_test,gbmTree.predict(data_test)),'\n')
print('Random Forest:\n',confusion_matrix(target_test,rfTree.predict(data_test)))
Confusion Matrix for simple, gradient boosted, and random forest tree classifiers:
Simple Tree:
 [[1186  109]
 [ 181  141]] 

Gradient Boosted:
 [[1228   67]
 [ 137  185]] 

Random Forest:
 [[1229   66]
 [ 215  107]]

So, now that we know that the GBM tree is our favored classifier for predicting the tastiness of wines, that begs the question: "what makes a tasty wine?". Luckily, GBM trees produce interpretable results, so we can call the feature_importances method against the GBM tree object and find out which features play the largest role in predicting tastiness.

One important thing to note is that not all machine learning methods produce interpretable results. For example, the random forest classifier is literally a set of multiple decision trees, so it's not easy to say which features played the largest role in determining the outcome of the classification. Neural Networks and Support Vector Machines (SVM) also suffer from this. That's not to say, however, that it cannot be done using local interpretable model explanations (LIME): https://arxiv.org/pdf/1602.04938v1.pdf

In [28]:
gbmTree.feature_importances_
Out[28]:
array([ 0.06244312,  0.09792831,  0.06384643,  0.07786289,  0.08666165,
        0.08620248,  0.09452112,  0.12985839,  0.08189272,  0.08317855,
        0.13560434])

A little hard to read, we can do better!

Indeed, we see that alcohol content and density play the two largest roles in the decision of the GBM classifier, followed by total sulfur diocide and volatile acidity. On the other hand, fixed acidity and residual sugar appear to be less important factors in determining the tastiness of wine. Perhaps if we were looking at red wines, this observation might be different!

In [29]:
print('Feature Importances for GBM tree\n')
for importance,feature in zip(gbmTree.feature_importances_,['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar','chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol']):
    print('{}: {}'.format(feature,importance))
Feature Importances for GBM tree

fixed acidity: 0.062443117444274894
volatile acidity: 0.0979283052153955
citric acid: 0.06384643262654617
residual sugar: 0.07786289245391594
chlorides: 0.0866616456242658
free sulfur dioxide: 0.08620248393448902
total sulfur dioxide: 0.09452112190619916
density: 0.12985838652978277
pH: 0.08189272440970403
sulphates: 0.08317855241664664
alcohol: 0.13560433743878014