Imbalanced Classification


Classification is a supervised task for categorizing observation given a set of variables. The category for each observation is given before. This is similiar to regression but the response variable is now discrete. The most used case is a binary classification but a multi-case classification is possible too. A possible problem in classification is the case of an imbalanced dataset, where the number of cases are not equally distributed. To counter this problem a number of sampling methods can be used. Down-sampling selects observations from the majority class, so that the number are equally to the minority case. In Up-Sampling this is reversed. SMOTE-Sampling is a combination of up and down sampling where in addition synthetic observations are created. In this problem a random forest is used for classification and different sampling methods are applied and compared to each other.

library(randomForest)
library(caret)
library(pROC)
library(DMwR)
library(doParallel)
library(GGally)
library(ggfortify)
library(doParallel)
wine_red=read.csv("winequality-red.csv",header=T,sep=";")
wine_red$quality=factor(wine_red$quality,levels=c(3,4,5,6,7,8))

The classical wine dataset is used which can be dowloaded here. The dataset contains data about 1599 observations. Different variables describing the composition of the wine are given and a quality variable, which ranges from 3 to 8. The target is to predict the quality for each wine.

wine_red$quality=as.numeric(paste(wine_red$quality))
wine_red$binary=ifelse(wine_red$quality<=6,"G","B")
table(wine_red$binary)/length(wine_red$binary)
## 
##         B         G 
## 0.1357098 0.8642902

All observations with a quality value smaller or equal to 6 are classfied as G and otherwise B. This is done to reduce the problem of a multiclass-classification into a binary classification. The same methods for dealing with unbalanced datasets can be applied to multiclass cases as well. But for easiness a binary case is used.

wine_red$binary=factor(wine_red$binary)
set.seed(2508)
trainingRow=createDataPartition(wine_red$binary,p=0.7,list=F)
trainingSet=wine_red[trainingRow,]
testSet=wine_red[-trainingRow,]

The dataset is partitioned into a trainings and a test set. This can be easily done with createDataPartition from the caret package. 70% of the dataset is used as a training set and the rest as a test set.

tr_controll=trainControl(classProbs = T
                         ,method="repeatedcv"
                         ,savePredictions="all"
                         ,summaryFunction=defaultSummary
                         ,repeats=5
                         ,allowParallel=T)

cl <- makePSOCKcluster(detectCores()-1)
registerDoParallel(cl)
model0=train(binary~fixed.acidity+volatile.acidity+citric.acid+residual.sugar+alcohol+sulphates
             ,data=trainingSet,method="rf",metric="Kappa"
             ,maximize=F
             ,tuneLength=10
             ,ntree=1000
             ,trControl=tr_controll)
## note: only 5 unique complexity parameters in default grid. Truncating the grid to 5 .
stopCluster(cl)
roc0=roc(testSet$binary,
    predict(model0, testSet, type = "prob")[,1],
    levels = rev(levels(as.factor(testSet$binary))))

TrainControl is one of the main function besides train and allows for different settings applied for the model estimation. In this case a reapeated cross-validation is used and all the predictions are saved. The number of repetitions is set to 5 and parallel computation is used. Repeated cross-validation is a very computing intense and parallel computing is needed for computing the models in an acceptable time limit. DetectCores-1 is set this way to allow one core for the OS. The rest is used for the computation. The number of clusters are created with makePSOCKcluster.

The main function of caret is train. The method argument specifies, which model should be used. The metric argument is used for selecting the criterion for the optimal model. The Kappa, or Cohens Kappa, is useful for classification in case of unbalanced datasets. Kappa takes into account the random chance. Accuracy is not very useful for classifying unbalanced dataset. Even a simple binary model where the number of cases is predicted by using just the base rate achieves a high accuracy. The tuneLength is the number of variables used at each split in a decision tree. The number of trees fitted for a random forest can be set by ntree. To stop the clusters from working once the compuation is done, stopCluster termiantes the clsuters.

For classification a ROC-Curve is useful as a graphical tool for plotting the quality of a binary classification. On the x-axis the specifity is plotted and on the y-axis the sensitivity. A good model should have a ROC-Curve near the top left.

Confusion_threshold=confusionMatrix(predict(model0,testSet),reference=testSet$binary)
Confusion_threshold
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   G
##          B  36  15
##          G  29 399
##                                           
##                Accuracy : 0.9081          
##                  95% CI : (0.8786, 0.9325)
##     No Information Rate : 0.8643          
##     P-Value [Acc > NIR] : 0.002123        
##                                           
##                   Kappa : 0.5693          
##  Mcnemar's Test P-Value : 0.050016        
##                                           
##             Sensitivity : 0.55385         
##             Specificity : 0.96377         
##          Pos Pred Value : 0.70588         
##          Neg Pred Value : 0.93224         
##              Prevalence : 0.13570         
##          Detection Rate : 0.07516         
##    Detection Prevalence : 0.10647         
##       Balanced Accuracy : 0.75881         
##                                           
##        'Positive' Class : B               
## 

The test set is now used to see assess the perfomance of the model. All the information given below are derived from the confusion matrix. The positive class is “B”. The sensitiviy is here 0.5538 (36/(36+29)) The Kappa value is between 0 and 1, where 1 is a perfect model while 0 is a useless model.

tr_controll=trainControl(classProbs = T
                         ,method="repeatedcv"
                         ,sampling="down"
                         ,savePredictions="all"
                         ,summaryFunction=defaultSummary
                         ,repeats=5
                         ,allowParallel=T)

cl <- makePSOCKcluster(detectCores()-1)
registerDoParallel(cl)
model_down=train(binary~fixed.acidity+volatile.acidity+citric.acid+residual.sugar+alcohol+sulphates
             ,data=trainingSet,method="rf",metric="Kappa"
             ,maximize=F
             ,tuneLength=10
             ,ntree=1000
             ,trControl=tr_controll)
## note: only 5 unique complexity parameters in default grid. Truncating the grid to 5 .
stopCluster(cl)
roc_down=roc(testSet$binary,
         predict(model_down, testSet, type = "prob")[,1],
         levels = rev(levels(as.factor(testSet$binary))))

The only difference in model is the sampling. Down-Sampling is now used to calculate the model.

Confusion_threshold_down=confusionMatrix(predict(model_down,testSet),reference=testSet$binary)
Confusion_threshold_down
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   G
##          B  51  84
##          G  14 330
##                                           
##                Accuracy : 0.7954          
##                  95% CI : (0.7565, 0.8307)
##     No Information Rate : 0.8643          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4001          
##  Mcnemar's Test P-Value : 3.168e-12       
##                                           
##             Sensitivity : 0.7846          
##             Specificity : 0.7971          
##          Pos Pred Value : 0.3778          
##          Neg Pred Value : 0.9593          
##              Prevalence : 0.1357          
##          Detection Rate : 0.1065          
##    Detection Prevalence : 0.2818          
##       Balanced Accuracy : 0.7909          
##                                           
##        'Positive' Class : B               
## 

The Kappa Value for down sampling is now much lower, the same holds for the accuracy. The sensitiviy is now higher as before. The models trades specifity for sensitivity. For all models this trade-off holds. The ROC-Curve for all the models estimated will be displayed later to compare the different models.

tr_controll=trainControl(classProbs = T
                         ,method="repeatedcv"
                         ,sampling="up"
                         ,savePredictions="all"
                         ,summaryFunction=defaultSummary
                         ,repeats=5
                         ,allowParallel=T)

cl <- makePSOCKcluster(detectCores()-1)
registerDoParallel(cl)
model_up=train(binary~fixed.acidity+volatile.acidity+citric.acid+residual.sugar+alcohol+sulphates
                 ,data=trainingSet,method="rf",metric="Kappa"
                 ,maximize=F
                 ,tuneLength=10
                 ,ntree=1000
                 ,trControl=tr_controll)
## note: only 5 unique complexity parameters in default grid. Truncating the grid to 5 .
stopCluster(cl)
roc_up=roc(testSet$binary,
             predict(model_up, testSet, type = "prob")[,1],
             levels = rev(levels(as.factor(testSet$binary))))
Confusion_threshold_up=confusionMatrix(predict(model_up,testSet),reference=testSet$binary)
Confusion_threshold_up
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   G
##          B  41  19
##          G  24 395
##                                          
##                Accuracy : 0.9102         
##                  95% CI : (0.881, 0.9343)
##     No Information Rate : 0.8643         
##     P-Value [Acc > NIR] : 0.001314       
##                                          
##                   Kappa : 0.6045         
##  Mcnemar's Test P-Value : 0.541866       
##                                          
##             Sensitivity : 0.63077        
##             Specificity : 0.95411        
##          Pos Pred Value : 0.68333        
##          Neg Pred Value : 0.94272        
##              Prevalence : 0.13570        
##          Detection Rate : 0.08559        
##    Detection Prevalence : 0.12526        
##       Balanced Accuracy : 0.79244        
##                                          
##        'Positive' Class : B              
## 
tr_controll=trainControl(classProbs = T
                         ,method="repeatedcv"
                         ,sampling="smote"
                         ,savePredictions="all"
                         ,summaryFunction=defaultSummary
                         ,repeats=5
                         ,allowParallel=T)

cl <- makePSOCKcluster(detectCores()-1)
registerDoParallel(cl)
model_smote=train(binary~fixed.acidity+volatile.acidity+citric.acid+residual.sugar+alcohol+sulphates
             ,data=trainingSet,method="rf",metric="Kappa"
             ,maximize=F
             ,tuneLength=10
             ,ntree=1000
             ,trControl=tr_controll)
## note: only 5 unique complexity parameters in default grid. Truncating the grid to 5 .
stopCluster(cl)
roc_smote=roc(testSet$binary,
         predict(model_smote, testSet, type = "prob")[,1],
         levels = rev(levels(as.factor(testSet$binary))))
Confusion_threshold=confusionMatrix(predict(model_smote,testSet),reference=testSet$binary)
Confusion_threshold
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   G
##          B  46  47
##          G  19 367
##                                           
##                Accuracy : 0.8622          
##                  95% CI : (0.8281, 0.8918)
##     No Information Rate : 0.8643          
##     P-Value [Acc > NIR] : 0.585362        
##                                           
##                   Kappa : 0.5029          
##  Mcnemar's Test P-Value : 0.000889        
##                                           
##             Sensitivity : 0.70769         
##             Specificity : 0.88647         
##          Pos Pred Value : 0.49462         
##          Neg Pred Value : 0.95078         
##              Prevalence : 0.13570         
##          Detection Rate : 0.09603         
##    Detection Prevalence : 0.19415         
##       Balanced Accuracy : 0.79708         
##                                           
##        'Positive' Class : B               
## 
ggroc(list(Up=roc_up,Down=roc_down,Roc=roc0,SMOTE=roc_smote))+geom_line(data=data.frame(x=seq(1,0,by=-0.01),y=seq(0,1,by=0.01)),aes(x=x,y=y),inherit.aes=F)+
  ggtitle("ROC-Curve with different Sampling Methods")
model_list=resamples(list(Up=model_up,Smote=model_smote,Down=model_down,No=model0))
model_diff=diff(model_list)
dotplot(model_diff)