For this study, the German credit score data available in the Applied Predictive learning package in R was analyzed. This dataset consists of 1000 individuals, 700 with good credit, and 300 with bad credit. Six models were tested to see how accurately they could predict if a customer had good or bad credit, based on the other 61 variables in the dataset. While all of the models performed similarly, the random forests and the gradient boosted machines models were among the top performers for both the training (area under ROC curve = 0.78) and holdout data (area under ROC curve = 0.75). For this analysis we also consider using a classification cut-off different from 0.5, since taking on a bad customer who doesn’t repay a loan can greatly outweigh the profits from the good customers.
When looking at the important variables across the six models, the variable that stood out as having the strongest importance was the customer being overdrawn (having a checking account balance below zero). Some of the other variables that were appeared important across many of the models were the loan interest rate, loan duration, and whether or not the customer had a checking account. The amount and purpose of the loan also appeared important in some models.
Six models were used to predict credit status based on the predictors in the data. The two linear models that were used were linear discriminant analysis (LDA) and partial least squares (PLS). The four non-linear models were support vector machines (SVM), neural network (NNET), gradient boosted machine (GBM), and random forests (RF).
In order to test each model, the data was split into 70% training and 30% testing. The training sets used 5-fold cross validation to select tuning parameters. The predictors were tested and removed if they were found to have near-zero variance. In addition each variable was centered and scaled, to create a mean of 0 and a standard deviation of 1. This eliminates scale difference across the variables. The last step uses the YeoJohnson function to transform the data to be more normal.
Models were fit using the caret package in R.
After the best tuning parameters were selected for each model, the area under the ROC curve (AUC ROC) was computed for each model. The SVM, RF, and GBM all tied with the strongest measure of 0.78, however all of the models performed similarity.
Variable importance is a measure that indicates how influential each predictor is in determining the model’s results. The most important variable is always given a score of 100, and the rest of the variables are assigned a score between 0 and 100 that is relative to the first one. For each model we tested, we produced a variable importance ranking to tell us which variables contributed the most to prediction in that model. To combine the results across models, we considered the average rank and frequency that it appeared in the top 10 across models, as well as its average importance score. These dimensions are displayed in the plot below.
Some of the important factors that contributed to each model’s predictions were checking account status (overdrawn or not), installment rate percentage, whether the individual had a checking account and loan duration. These appeared in the top ten in most, if not all, the models. In 5 of 6 models tested, having a checking account balance less than zero was deemed to be a strong influence in predicting good or bad credit. This suggests that a customer being overdrawn in their checking account is predictive of credit status. Specifically, of the 274 people that were overdrawn, 135 (nearly 50%) had bad credit, as compared with the overall probability of bad credit at 30% for this data.
Other variables that were important for some models were the amount of the loan, the purpose of the loan (i.e. business, furniture, television), and if the individual was deemed to be a skilled employee. This suggests that a person’s background and motives also influence whether or not they are likely to pay the loan, but don’t necessarily play as big of a role as some of the other variables mentioned above.
The following plot shows the results of running the models on the test set. The models performed similarly, with RF, GBM, and PLS tying for the strongest value of 0.75.
The chart below displays the predicted probabilities the models gave each individual for having good credit, and whether or not they actually did. Each model would require a different cutoff to reach a desired level of sensitivity and/or positive predictive value (PPV). Note that sensitivity is the probability that the model correctly identifies an individual with good credit, and PPV is the probability that an individual that the model predicts as having good credit actually does. These performance measures are determined by the cutoff that is used to make the call between good and bad credit from the probabilities that the models predict.
For example, the green line at 0.7 shows what a cutoff for the PLS model could be that maximizes PPV, since almost all of the individuals above this line actually have good credit. However, it would require a lower cutoff to achieve similar sensitivity as the other models, as many of the individuals who actually have good credit would be incorrectly classified as having bad credit at this cutoff.
While it may not be possible to find a cutoff that eliminates customers with bad credit, using a cutoff different from 0.5 allows us to optimize the classification performance according to the needs of the business problem. For example, in this case, since misclassifying an individual with bad credit can be very costly, it is important to make the PPV as high as possible.