Anda di halaman 1dari 8

A comparison of different machine learning

techniques for small ticket credit scoring

Grothmann, Ralph; Distl, Philipp; Kühner, Daniel

1 Introduction

For financial institutions, credit risk assessment is key to their business success.
As the financial crisis has shown, it is also vital for the economy as a whole and
thus, the performance of the prediction models globally attracts high attention from
regulators. Though substantial developments have been made in the area of ma-
chine learning recently, most banks still work with traditional methods like logistic
regression. Since data is expensive to obtain, most publications which applied ma-
chine learning techniques to credit scoring have used datasets with less than 1,000
observations. A glance at Table 1 reveals another common shortcoming: the share
of defaults is mostly unrealistically high, which bypasses problems that are accom-
panied by heavily imbalanced data. We address these two lackings by using a real-
world dataset from the Siemens Bank GmbH with 183,081 observations and a de-
fault rate of 1.39%, which is within a typical range for a bank’s credit portfolio. Our
results give clear indication that especially state-of-the-art techniques like boosting
algorithms outperform traditional risk models significantly. We first briefly intro-
duce six different machine learning techniques, then proceed with describing the
data preprocessing and the experimental procedure and then show our results and
support our findings through statistical tests.

Grothmann, Ralph
Siemens Corporate Technology; Otto-Hahn-Ring 6, 81739 München; e-mail: ralph.
grothmann@siemens.com
Distl, Philipp
Technical University of Munich; Boltzmannstraße 3, 85748 Garching; e-mail: philipp.
distl@tum.de
Kühner, Daniel
Technical University of Munich; Boltzmannstraße 3, 85748 Garching; e-mail: daniel.
kuehner@tum.de

1
2 Grothmann, Ralph; Distl, Philipp; Kühner, Daniel

Table 1 Comparison of literature. Numbers referring to the mean of datasets.


Publication #datasets #variables #observations defaults classifiers
Tran et al. (2016) 2 17 845 42.75 % ANN, LogR, RF, SVM
Zikeba et al. (2016) 1 64 8,681 4.94 % AdaB, LogR, RF, SVM, XGdB
Lessmann et al. (2015) 8 21 30,403 28.93 % ANN, Boosting, LogR, SVM
Brown and Mues (2012) 5 30 2,581 30.00 % ANN, DT, LogR, RF, XGdB
Marqués et al. (2012) 6 30 1,003 36.04 % Bagging, Boosting
Wang et al. (2011) 3 17 643 41.10 % ANN, DT, Ensembles, LogR, SVM
Twala (2010) 4 18 460 46.37 % ANN, DT
Zhang et al. (2010) 2 17 845 42.75 % DT
Alfaro et al. (2008) 1 16 1,180 50.00 % AdaB, ANN
Yu et al. (2008) 2 12 356 47.66 % ANN, LogR, SVM
Bastos (2007) 2 19 845 42.75 % AdaB, ANN, SVM
West et al. (2005) 2 17 845 42.75 % Bagging, Boosting
Huang et al. (2004) 2 21 169 - ANN, LogR, SVM

2 Methods

Logistic regression

Logistic regression is traditionally one of the most widely used approaches for credit
scoring in banking. For example SCHUFA, one of the major German credit bureaus,
uses logistic regression to generate its scores 1 . Logistic regression can be performed
binary or as multinomial logistic regression if the desired information was a credit
rating with multiple classes. Basically, logistic regression is a a linear regression on
the logarithm of the so called odds, which is the ratio P/(1 − P), where P is the
probability for an event.

ln(P/(1 − P))) = β0 + β1 X1 + ... + βk Xk (1)

One problem with logistic regression is that in presence of multicollinearity, param-


eter estimations may become highly unstable.

Neural networks

An artificial neural network is a machine learning algorithm, which is inspired by


the buildup of the brain and its ability to learn. With standard feedforward neural
network we refer to a neural network, which consists of one hidden layer. There is
a simple flow structure: the input is mapped to the neurons of the hidden layer and
next to the output where an error is measured. This error is backpropagated to adjust
the weights of the edges such that the error converges towards a minimum.
Neural networks have seen a remarkable resurgence in the past decade. Especially
1 https://www.schufa.de/en/about-us/data-scoring/scoring/
scoring-work-schufa/how_does_scoring_work_at_schufa.jsp
A comparison of different machine learning techniques for small ticket credit scoring 3

for image recognition, neural networks are now widely considered uncontested
champions. This emergence was driven by developments in deep neural networks
with many hidden layers, which allow to design complex architectures.

Decision tree

As the graphical representation of a set of subsequently applied rules, decision trees


offer a relatively high degree of interpretability. Crucial for the construction of a de-
cision tree is the splitting criterion used at each node. The most commonly used
criterion is the decrease in Shannon information entropy, which can be seen as
equivalent to increasing node purity. Decision trees, which are constructed using
this splitting criterion, are insensitive to multicollinearity (Piramuthu (2008)).

Random forest

A random forest is an ensemble technique, which combines the outcomes of many


different decision trees. This stochastic regularization enhances robustness and pre-
vents overfitting, leading to increased performance.
There are basically two approaches to ensure diversity amongst trees generated by
the same algorithm. One version, called bagging (bootstrap aggregating), draws
a bootstrap sample for each tree from the original dataset (Breiman (1996)). The
other option is to randomize the subset of variables each tree is constructed with. A
comparison between the methods was done by Ho (Ho (2002)).

Gradient Boosting

Gradient boosting is a meta concept, based on an ensemble of other machine learn-


ing techniques called ”base learners”. The goal is to find a classification algo-
rithm F ∗ : Rd → [0, 1] which assigns a probability of default (PD) to a given ob-
servation such that the expected value for a loss function L(y, F(x)) is minimized:
F ∗ = arg min Ex,y [L(y, F(x))], where the expected value is estimated from the train-
ing data. The initial classifier F0 starts with a constant number F0 = γ, usually the
mean. Next, the base classifier which is best in explaining the remaining residual is
added. This step is repeated then.
n
Fm (x) = Fm−1 (x) + arg min ∑ L(yi , Fm−1 (xi ) + f (xi )) (2)
f ∈H i=1

The solution is approximated by performing a steepest descent step:


n
Fm (x) = Fm−1 (x) − αm ∑ ∇ f L(yi , Fm−1 (xi )), (3)
i=1
4 Grothmann, Ralph; Distl, Philipp; Kühner, Daniel

The step length α is optimized via line search.


In practice, almost exclusively decision trees have been used as base classifiers.
To countervail overfitting, the limit on the number of iterations or the complexity of
base learners can be toughened and the step length α can be reduced by a parameter
η ∈ (0, 1) called learning rate.

Adaptive Boosting

Adaptive Boosting, commonly abbreviated with ”AdaBoost”, also attracted a lot of


appreciation in recent years due to considerable successes. It was developed by Yoav
Freund and Robert Schapire (Freund and Schapire (1995)) who won the Gödel Prize
in 2003 for their work. AdaBoost is a special implementation of gradient boosting
which adds a weighting of observations, starting with equal weights wi,0 = 1/n.
Then in each update t, a base learner ft is chosen which minimizes the weighted
loss, summed up over all observations that were misclassified by ft−1 . The update
Ft = Ft−1 + αt ft includes this base learner with a factor αt = 0.5 ∗ ln((1 − φt )/φt ),
which increases the lower the loss φt of the base learner ft is. Then the weights are
updated according to wi,t+1 = wi,t exp(−yi αt ft ).

3 Experimental setup

The general research target is to compare the performance of several machine learn-
ing techniques on real-world credit data. The comparison was run on a large dataset
containing 183,081 observations, obtained from the Siemens Bank GmbH. Each of
the 183,081 observations refers to one company in the UK, which is represented
through 79 variables as of September 2006. Additional 15 variables were created as
ratios. The binary target variable indicates whether a default occurred for respective
company within one year after the snapshot. A default is defined precisely based on
official, mandatory notices to authorities. The set contained 2,543 defaults, which
corresponds to a default rate of 1.39 %.

3.1 Evaluation criteria

The most common measure of a model’s discriminatory power in credit rating tasks
is the area under the receiver operating characteristic curve (ROC curve). The ROC
curve plots the true positive rate on the ordinate versus the false positive rate on the
abscissae. With random guessing, the ROC curve should be a diagonal line and the
area under the curve (AUC) would be 0.5. A perfect classifier would reach an AUC
of 1. The AUC is also the quality measure recommended by the Basel Committee on
Banking Supervision of the Bank for International Settlements (on Banking Super-
A comparison of different machine learning techniques for small ticket credit scoring 5

vision (2005)). In an industry context, the Gini coefficient is used more frequently
which is just a linear transformation of the AUC with Gini = 2 ∗ AUC − 1.

3.2 Data preprocessing

We standardized numerical values by subtracting the mean and dividing with the
standard deviation. For categorical variables one-hot encoding was applied, so a
variable with n different levels was replaced by (n − 1) dummy variables.
When training on heavily imbalanced data like in our case, classifiers tend to catego-
rize each observation as being member of the majority class. Two ways to deal with
this are random oversampling and Synthetic Minority Class Oversampling Tech-
nique or in short ”SMOTE”. SMOTE creates new minority class observation on a
nearest neighbor basis: a minority class observation obsi is chosen randomly to-
gether with its k nearest neighbors in the set of minority class observations. Out of
these k elements, m elements are randomly chosen and a connection line is drawn
between each of them and obsi on which a new ”synthetic” minority class observa-
tion is created on a random point.
There are several papers suggesting that the SMOTE approach is generally supe-
rior to plainly copying observations (Chawla et al. (2002)). In our tests we could
confirm this. Performance increased significantly when lifted to 10%. Performance
could only be improved very modestly when the share was raised to 20%.

3.3 Data partitioning and experimental protocol

We split the total dataset into five disjoint parts {Part1 ,...,Part5 } of the same size
(36,616 observations). Each of these five sets was then again split into three dis-
joint sets which were used as training set, validation set and test set respectively.
The splitting ratio was 40% for training (14,646 observations), 30% for valida-
tion (10,985 observations) and 30% for test. All splits were performed randomly
but stratified, meaning that in each set, the ratio of defaults was held constant at
1.39%. For each combination (Classi f ieri , Partk ), we started to train the classifier
model on the training set of Partk with the default parameter settings and then iter-
atively altered the parameters by looping through a grid of various parameter com-
binations, each time measuring the model’s performance on the validation set. The
best-performing parameter combination was then chosen for the ultimate model, for
which the performance was now measured on the test set of Partk .

4 Experimental results
6 Grothmann, Ralph; Distl, Philipp; Kühner, Daniel

Table 2 AUC for different learning techniques The results from the ex-
periments described in
Model Average Std. Dev. the section above are
Log. Reg. 0.65770 0.00615 condensed in Table 2 on
Standard NN 0.73378 0.00416 page 6. Standard devi-
Decision Tree 0.74236 0.00295 ation between the tests
Deep NN Ensemble 0.76782 0.00336 on each of the five parti-
Random Forest 0.79406 0.00435 tions was rather low, so
Extreme Gradient Boosting 0.79855 0.00569 we could conclude that
Comb. XGBoost, Random Forest 0.81151 0.00254 the size of each dataset
Adaptive Boosting 0.81137 0.00218 was large enough to
Comb. AdaBoost, Random Forest 0.81234 0.00441 be representative for the
Comb. XGBoost, AdaBoost 0.82066 0.00075 total population. Tech-
Comb. AdaBoost, XGB, RF 0.82382 0.00146 niques based on an en-
semble of decision trees
- random forests, Ex-
treme Gradient Boost-
ing and AdaBoost - can be confidently regarded as the champions for this task.
Notable is also the performance leap between a single decision tree and these ensem-
ble techniques, derived from stochastic regularization. Comparing the ”intelligent”
way, AdaBoost adds decision trees to the final classifier with the simpler method of
random forests, we can notice an increase of 2.18% in the AUC.
We also combined the outputs of our best three classifiers (random forest, Extreme
Gradient Boost and AdaBoost) by simply averaging. Interestingly, with this simple
step it was possible to further lift the performance.

Statistical significance tests

The central question is whether the dif-


ferences in performance are random or not.
Table 3 Post-hoc Friedman Nemenyi test Demšar compared the suitability of different
tests in a general setting where various ma-
AdaB+XGB+RF
chine learning algorithms are compared and
LogReg 9.9e-05 he clearly advocates the use of the Fried-
Standard NN 0.00091 man test (Demšar (2006)), a non-parametric
Decision Tree 0.00643 ANOVA (Analysis of Variance) which exam-
Deep NN Ens. 0.03450 ines differences in the central tendency be-
Random Forest 0.21368 tween three or more paired samples. By par-
XGB 0.26215 titioning into disjoint sets we ensured that re-
XGB+RF 0.94108 sults are independent. The Friedman test re-
AdaBoost 0.94108 sulted in a p-value of 4.298e-07. The null hy-
AdaB+RF 0.94108 potheses that there is no difference apart from
AdaB+XGB 0.99999 an effect of the random data samples can be
rejected with a confidence of 0.9999996%.
A comparison of different machine learning techniques for small ticket credit scoring 7

The Nemenyi test is a post-hoc test which


may be applied after the null hypotheses of the Friedman test was successfully re-
jected. The Nemenyi test examines pairwise whether the performance between two
methods (i.e. two classifiers in our case) is significantly different (Nemenyi (1962)).
The null hypothesis is the same as for the Friedman test. Amongst others, it allows
to reject the hypotheses that the best classifier (AdaBoost+XGB+RF) is not better
than the deep neural network ensemble, the decision tree, the standard neural net-
work or logistic regression, on a 5% confidence level. However, it doesn’t allow
to confidently exclude that differences in performance amongst the top group are
caused randomly.

5 Conclusion and further research

We used heavily imbalanced real-world data with 183,081 observations and 94 vari-
ables to assess the performance of seven different machine learning techniques. Our
tests yield a clear picture that ensembles of decision trees in general - and boosting
techniques like Extreme Gradient Boost or AdaBoost in particular - perform best
on this kind of data. Combinations of the best techniques can even further regular-
ize the model and enhance its robustness and performance. The main takeaway for
practitioners is, that switching from traditional techniques like the logistic regres-
sion to state-of-the-art algorithms is likely to yield significant improvements, in our
case 25% in AUC.
Of course our dataset refers to just a snapshot of one single point in time (Sep. 2006)
and one single country (UK). The importance of certain variables is likely to change
when working with another dataset. However we are confident that structural sim-
ilarities are sufficient to substantiate the top performance of boosting techniques.
Another area of research could be to analyze the importance of each variables and
their correlations and interrelations. This could then be used to draw connections
between the world of purely statistical machine learning models and classical eco-
nomic theory about corporate finance and credit / debt.

References

Alfaro, E., Garcı́a, N., Gámez, M., Elizondo, D., 2008. Bankruptcy forecasting: An
empirical comparison of adaboost and neural networks. Decision Support Sys-
tems 45 (1), 110–122.
Bastos, J., 2007. Credit scoring with boosted decision trees.
Breiman, L., 1996. Bagging predictors. Machine learning 24 (2), 123–140.
Brown, I., Mues, C., 2012. An experimental comparison of classification algorithms
for imbalanced credit scoring data sets. Expert Systems with Applications 39 (3),
3446–3453.
8 Grothmann, Ralph; Distl, Philipp; Kühner, Daniel

Chawla, N. V., Bowyer, K. W., Hall, L. O., Kegelmeyer, W. P., 2002. Smote: syn-
thetic minority over-sampling technique. Journal of artificial intelligence research
16, 321–357.
Demšar, J., 2006. Statistical comparisons of classifiers over multiple data sets. Jour-
nal of Machine learning research 7 (Jan), 1–30.
Freund, Y., Schapire, R. E., 1995. A desicion-theoretic generalization of on-line
learning and an application to boosting. In: European conference on computa-
tional learning theory. Springer, pp. 23–37.
Ho, T. K., 2002. A data complexity analysis of comparative advantages of decision
forest constructors. Pattern Analysis & Applications 5 (2), 102–112.
Huang, Z., Chen, H., Hsu, C.-J., Chen, W.-H., Wu, S., 2004. Credit rating analysis
with support vector machines and neural networks: a market comparative study.
Decision support systems 37 (4), 543–558.
Lessmann, S., Baesens, B., Seow, H.-V., Thomas, L. C., 2015. Benchmarking state-
of-the-art classification algorithms for credit scoring: An update of research. Eu-
ropean Journal of Operational Research 247 (1), 124–136.
Marqués, A., Garcı́a, V., Sánchez, J. S., 2012. Exploring the behaviour of base clas-
sifiers in credit scoring ensembles. Expert Systems with Applications 39 (11),
10244–10250.
Nemenyi, P., 1962. Distribution-free multiple comparisons. In: Biometrics. Vol. 18.
INTERNATIONAL BIOMETRIC SOC 1441 I ST, NW, SUITE 700, WASHING-
TON, DC 20005-2210, p. 263.
on Banking Supervision, B. C., 2005. Studies on the validation of internal rating
systems. Bank for International Settlements, Working Paper (14).
Piramuthu, S., 2008. Input data for decision trees. Expert Systems with applications
34 (2), 1220–1226.
Tran, K., Duong, T., Ho, Q., 2016. Credit scoring model: A combination of genetic
programming and deep learning. In: Future Technologies Conference (FTC).
IEEE, pp. 145–149.
Twala, B., 2010. Multiple classifier application to credit risk assessment. Expert
Systems with Applications 37 (4), 3326–3336.
Wang, G., Hao, J., Ma, J., Jiang, H., 2011. A comparative assessment of ensemble
learning for credit scoring. Expert systems with applications 38 (1), 223–230.
West, D., Dellana, S., Qian, J., 2005. Neural network ensemble strategies for finan-
cial decision applications. Computers & operations research 32 (10), 2543–2559.
Yu, L., Wang, S., Lai, K. K., 2008. Credit risk assessment with a multistage neural
network ensemble learning approach. Expert systems with applications 34 (2),
1434–1444.
Zhang, D., Zhou, X., Leung, S. C., Zheng, J., 2010. Vertical bagging decision trees
model for credit scoring. Expert Systems with Applications 37 (12), 7838–7843.
Zikeba, M., Tomczak, S. K., Tomczak, J. M., 2016. Ensemble boosted trees with
synthetic features generation in application to bankruptcy prediction. Expert Sys-
tems with Applications 58, 93–101.

Anda mungkin juga menyukai