1 Introduction
For financial institutions, credit risk assessment is key to their business success.
As the financial crisis has shown, it is also vital for the economy as a whole and
thus, the performance of the prediction models globally attracts high attention from
regulators. Though substantial developments have been made in the area of ma-
chine learning recently, most banks still work with traditional methods like logistic
regression. Since data is expensive to obtain, most publications which applied ma-
chine learning techniques to credit scoring have used datasets with less than 1,000
observations. A glance at Table 1 reveals another common shortcoming: the share
of defaults is mostly unrealistically high, which bypasses problems that are accom-
panied by heavily imbalanced data. We address these two lackings by using a real-
world dataset from the Siemens Bank GmbH with 183,081 observations and a de-
fault rate of 1.39%, which is within a typical range for a bank’s credit portfolio. Our
results give clear indication that especially state-of-the-art techniques like boosting
algorithms outperform traditional risk models significantly. We first briefly intro-
duce six different machine learning techniques, then proceed with describing the
data preprocessing and the experimental procedure and then show our results and
support our findings through statistical tests.
Grothmann, Ralph
Siemens Corporate Technology; Otto-Hahn-Ring 6, 81739 München; e-mail: ralph.
grothmann@siemens.com
Distl, Philipp
Technical University of Munich; Boltzmannstraße 3, 85748 Garching; e-mail: philipp.
distl@tum.de
Kühner, Daniel
Technical University of Munich; Boltzmannstraße 3, 85748 Garching; e-mail: daniel.
kuehner@tum.de
1
2 Grothmann, Ralph; Distl, Philipp; Kühner, Daniel
2 Methods
Logistic regression
Logistic regression is traditionally one of the most widely used approaches for credit
scoring in banking. For example SCHUFA, one of the major German credit bureaus,
uses logistic regression to generate its scores 1 . Logistic regression can be performed
binary or as multinomial logistic regression if the desired information was a credit
rating with multiple classes. Basically, logistic regression is a a linear regression on
the logarithm of the so called odds, which is the ratio P/(1 − P), where P is the
probability for an event.
Neural networks
for image recognition, neural networks are now widely considered uncontested
champions. This emergence was driven by developments in deep neural networks
with many hidden layers, which allow to design complex architectures.
Decision tree
Random forest
Gradient Boosting
Adaptive Boosting
3 Experimental setup
The general research target is to compare the performance of several machine learn-
ing techniques on real-world credit data. The comparison was run on a large dataset
containing 183,081 observations, obtained from the Siemens Bank GmbH. Each of
the 183,081 observations refers to one company in the UK, which is represented
through 79 variables as of September 2006. Additional 15 variables were created as
ratios. The binary target variable indicates whether a default occurred for respective
company within one year after the snapshot. A default is defined precisely based on
official, mandatory notices to authorities. The set contained 2,543 defaults, which
corresponds to a default rate of 1.39 %.
The most common measure of a model’s discriminatory power in credit rating tasks
is the area under the receiver operating characteristic curve (ROC curve). The ROC
curve plots the true positive rate on the ordinate versus the false positive rate on the
abscissae. With random guessing, the ROC curve should be a diagonal line and the
area under the curve (AUC) would be 0.5. A perfect classifier would reach an AUC
of 1. The AUC is also the quality measure recommended by the Basel Committee on
Banking Supervision of the Bank for International Settlements (on Banking Super-
A comparison of different machine learning techniques for small ticket credit scoring 5
vision (2005)). In an industry context, the Gini coefficient is used more frequently
which is just a linear transformation of the AUC with Gini = 2 ∗ AUC − 1.
We standardized numerical values by subtracting the mean and dividing with the
standard deviation. For categorical variables one-hot encoding was applied, so a
variable with n different levels was replaced by (n − 1) dummy variables.
When training on heavily imbalanced data like in our case, classifiers tend to catego-
rize each observation as being member of the majority class. Two ways to deal with
this are random oversampling and Synthetic Minority Class Oversampling Tech-
nique or in short ”SMOTE”. SMOTE creates new minority class observation on a
nearest neighbor basis: a minority class observation obsi is chosen randomly to-
gether with its k nearest neighbors in the set of minority class observations. Out of
these k elements, m elements are randomly chosen and a connection line is drawn
between each of them and obsi on which a new ”synthetic” minority class observa-
tion is created on a random point.
There are several papers suggesting that the SMOTE approach is generally supe-
rior to plainly copying observations (Chawla et al. (2002)). In our tests we could
confirm this. Performance increased significantly when lifted to 10%. Performance
could only be improved very modestly when the share was raised to 20%.
We split the total dataset into five disjoint parts {Part1 ,...,Part5 } of the same size
(36,616 observations). Each of these five sets was then again split into three dis-
joint sets which were used as training set, validation set and test set respectively.
The splitting ratio was 40% for training (14,646 observations), 30% for valida-
tion (10,985 observations) and 30% for test. All splits were performed randomly
but stratified, meaning that in each set, the ratio of defaults was held constant at
1.39%. For each combination (Classi f ieri , Partk ), we started to train the classifier
model on the training set of Partk with the default parameter settings and then iter-
atively altered the parameters by looping through a grid of various parameter com-
binations, each time measuring the model’s performance on the validation set. The
best-performing parameter combination was then chosen for the ultimate model, for
which the performance was now measured on the test set of Partk .
4 Experimental results
6 Grothmann, Ralph; Distl, Philipp; Kühner, Daniel
Table 2 AUC for different learning techniques The results from the ex-
periments described in
Model Average Std. Dev. the section above are
Log. Reg. 0.65770 0.00615 condensed in Table 2 on
Standard NN 0.73378 0.00416 page 6. Standard devi-
Decision Tree 0.74236 0.00295 ation between the tests
Deep NN Ensemble 0.76782 0.00336 on each of the five parti-
Random Forest 0.79406 0.00435 tions was rather low, so
Extreme Gradient Boosting 0.79855 0.00569 we could conclude that
Comb. XGBoost, Random Forest 0.81151 0.00254 the size of each dataset
Adaptive Boosting 0.81137 0.00218 was large enough to
Comb. AdaBoost, Random Forest 0.81234 0.00441 be representative for the
Comb. XGBoost, AdaBoost 0.82066 0.00075 total population. Tech-
Comb. AdaBoost, XGB, RF 0.82382 0.00146 niques based on an en-
semble of decision trees
- random forests, Ex-
treme Gradient Boost-
ing and AdaBoost - can be confidently regarded as the champions for this task.
Notable is also the performance leap between a single decision tree and these ensem-
ble techniques, derived from stochastic regularization. Comparing the ”intelligent”
way, AdaBoost adds decision trees to the final classifier with the simpler method of
random forests, we can notice an increase of 2.18% in the AUC.
We also combined the outputs of our best three classifiers (random forest, Extreme
Gradient Boost and AdaBoost) by simply averaging. Interestingly, with this simple
step it was possible to further lift the performance.
We used heavily imbalanced real-world data with 183,081 observations and 94 vari-
ables to assess the performance of seven different machine learning techniques. Our
tests yield a clear picture that ensembles of decision trees in general - and boosting
techniques like Extreme Gradient Boost or AdaBoost in particular - perform best
on this kind of data. Combinations of the best techniques can even further regular-
ize the model and enhance its robustness and performance. The main takeaway for
practitioners is, that switching from traditional techniques like the logistic regres-
sion to state-of-the-art algorithms is likely to yield significant improvements, in our
case 25% in AUC.
Of course our dataset refers to just a snapshot of one single point in time (Sep. 2006)
and one single country (UK). The importance of certain variables is likely to change
when working with another dataset. However we are confident that structural sim-
ilarities are sufficient to substantiate the top performance of boosting techniques.
Another area of research could be to analyze the importance of each variables and
their correlations and interrelations. This could then be used to draw connections
between the world of purely statistical machine learning models and classical eco-
nomic theory about corporate finance and credit / debt.
References
Alfaro, E., Garcı́a, N., Gámez, M., Elizondo, D., 2008. Bankruptcy forecasting: An
empirical comparison of adaboost and neural networks. Decision Support Sys-
tems 45 (1), 110–122.
Bastos, J., 2007. Credit scoring with boosted decision trees.
Breiman, L., 1996. Bagging predictors. Machine learning 24 (2), 123–140.
Brown, I., Mues, C., 2012. An experimental comparison of classification algorithms
for imbalanced credit scoring data sets. Expert Systems with Applications 39 (3),
3446–3453.
8 Grothmann, Ralph; Distl, Philipp; Kühner, Daniel
Chawla, N. V., Bowyer, K. W., Hall, L. O., Kegelmeyer, W. P., 2002. Smote: syn-
thetic minority over-sampling technique. Journal of artificial intelligence research
16, 321–357.
Demšar, J., 2006. Statistical comparisons of classifiers over multiple data sets. Jour-
nal of Machine learning research 7 (Jan), 1–30.
Freund, Y., Schapire, R. E., 1995. A desicion-theoretic generalization of on-line
learning and an application to boosting. In: European conference on computa-
tional learning theory. Springer, pp. 23–37.
Ho, T. K., 2002. A data complexity analysis of comparative advantages of decision
forest constructors. Pattern Analysis & Applications 5 (2), 102–112.
Huang, Z., Chen, H., Hsu, C.-J., Chen, W.-H., Wu, S., 2004. Credit rating analysis
with support vector machines and neural networks: a market comparative study.
Decision support systems 37 (4), 543–558.
Lessmann, S., Baesens, B., Seow, H.-V., Thomas, L. C., 2015. Benchmarking state-
of-the-art classification algorithms for credit scoring: An update of research. Eu-
ropean Journal of Operational Research 247 (1), 124–136.
Marqués, A., Garcı́a, V., Sánchez, J. S., 2012. Exploring the behaviour of base clas-
sifiers in credit scoring ensembles. Expert Systems with Applications 39 (11),
10244–10250.
Nemenyi, P., 1962. Distribution-free multiple comparisons. In: Biometrics. Vol. 18.
INTERNATIONAL BIOMETRIC SOC 1441 I ST, NW, SUITE 700, WASHING-
TON, DC 20005-2210, p. 263.
on Banking Supervision, B. C., 2005. Studies on the validation of internal rating
systems. Bank for International Settlements, Working Paper (14).
Piramuthu, S., 2008. Input data for decision trees. Expert Systems with applications
34 (2), 1220–1226.
Tran, K., Duong, T., Ho, Q., 2016. Credit scoring model: A combination of genetic
programming and deep learning. In: Future Technologies Conference (FTC).
IEEE, pp. 145–149.
Twala, B., 2010. Multiple classifier application to credit risk assessment. Expert
Systems with Applications 37 (4), 3326–3336.
Wang, G., Hao, J., Ma, J., Jiang, H., 2011. A comparative assessment of ensemble
learning for credit scoring. Expert systems with applications 38 (1), 223–230.
West, D., Dellana, S., Qian, J., 2005. Neural network ensemble strategies for finan-
cial decision applications. Computers & operations research 32 (10), 2543–2559.
Yu, L., Wang, S., Lai, K. K., 2008. Credit risk assessment with a multistage neural
network ensemble learning approach. Expert systems with applications 34 (2),
1434–1444.
Zhang, D., Zhou, X., Leung, S. C., Zheng, J., 2010. Vertical bagging decision trees
model for credit scoring. Expert Systems with Applications 37 (12), 7838–7843.
Zikeba, M., Tomczak, S. K., Tomczak, J. M., 2016. Ensemble boosted trees with
synthetic features generation in application to bankruptcy prediction. Expert Sys-
tems with Applications 58, 93–101.