We want to build a model that will help Thera bank which is having more liability customers
than asset customers to identify the potential customer who have higher probability of
purchasing the loan on basis of last year campaign data which is having 9.6 % success rate
for 5000 customers.
ID Customer ID
Age Customer's age in years
Experience Years of professional experience
Income Annual income of the customer ($000)
ZIPCode Home Address ZIP code.
Family Family size of the customer
CCAvg Avg. spending on credit cards per month ($000)
Education Education Level. 1: Undergrad; 2: Graduate; 3: Advanced/Professional
Mortgage Value of house mortgage if any. ($000)
Personal Loan Did this customer accept the personal loan offered in the last campaign?
Securities Does the customer have a securities account with the bank?
Account
CD Account Does the customer have a certificate of deposit (CD) account with the
bank?
Online Does the customer use internet banking facilities?
CreditCard Does the customer use a credit card issued by the bank?
Stucture of data (str(data)):
ID int 1 2 3 4 5 6 7 8 9 10 ...
Age..in.years. int 25 45 39 35 35 37 53 50 35 34 ...
Experience..in.years. int 1 19 15 9 8 13 27 24 10 9 ...
Income..in.K.month. int 49 34 11 100 45 29 72 22 81 180 ...
int 91107 90089 94720 94112 91330 92121 91711 93943 90089 93023
ZIP.Code ...
Family.members int 4 3 1 1 4 4 2 1 3 1 ...
CCAvg num 1.6 1.5 1 2.7 1 0.4 1.5 0.3 0.6 8.9 ...
Education int 1 1 1 2 2 2 2 3 2 3 ...
Mortgage int 0 0 0 0 0 155 0 0 104 0 ...
Personal.Loan int 0 0 0 0 0 0 0 0 0 1 ...
Securities.Account int 1 1 0 0 0 0 0 0 0 0 ...
CD.Account int 0 0 0 0 0 0 0 0 0 0 ...
Online int 0 0 0 0 0 1 1 0 1 0 ...
CreditCard int 0 0 0 0 1 0 0 1 0 0 ...
Here Personal Loan is considered the Dependent variable and all other attributes as
Independent variables.
The data includes the demographic information of customer like (Age, Income, Experience,
Family, zip code, family members, Education) which represent the customer behavior, So
we need to take these columns under consideration.
The columns like (Mortgage, Securities, CD Acc, online, credit card) helps us to
understand facilities avail by the customer with bank and represents the customer satisfaction
with bank to encourage its customer to go for personal loan, So we need to consider these
too.
Here we should not consider the ID as its completely unique for each customer and does not
help in model building
Exploratory Data Analysis:
Summary(Data):
From data we are able to see customer having family member having null values is
most of aged persons whose age is more than 40,As a general case at this age
generally a person is married and have children too ,so we are assuming family of 3 as
approximation
Near Zero Variance:
From above data we are able to infer that ID has all unique value and no column we can
eliminate for near zero variance approximation.
Analysis of relationship of dependent and independent variables:
We are able to analyze that more people from age 30-40 have taken loan which is quite
explanatory also as mostly people take loan in their young age of settle career which starts
from age 30.
Here also we are able infer most people taken loan in start of their career having
experience from 0-10.
Income..in.K.month. vs Personal Loan:
Here we can confer that customers having monthly income less than 100 K most
unlikely to take personal loans with current campaign.
We can confer from here customers having average credit card spending per month
more than 2.5 K are pursued by campaign and more likely to take personal loan.
Personal.Loan
CreditCard 0 1
0 90.453258 9.546742
1 90.272109 9.727891
We can’t infer much from credit cards as customers having credit cards and those
who does not have credit cards is same 90% of chances not taking loan.
Personal.Loan
certificate of
deposit 0 1
0 92.762878 7.237122
1 53.642384 46.357616
We can infer that customer having certificate of deposit with bank is having higher
probability of taking loan.
Personal.Loan
Securities.Account 0 1
0 90.620813 9.379187
1 88.505747 11.494253
set.seed(1234)
data$random <- runif(nrow(data), 0, 1);
cart <- data[order(data$random),]
we have divided the data in 70 and 30 ratios randomly with 3516 and 1484 records for train
and validation sample respectively.
here we selected the minsplit =20 (minimum data in node so that it can split)
we selected minbucket=7 ( which is minsplit/3 represent minimum records in terminal node)
AUC KS GINI
We get very high AUC, KS and GINI value which is showing model is built good.
We can also validate this by confusion matrix:
predict.class
Personal.Loan 0 1
0 3170 12
1 38 296
predict.class
Personal.Loan 0 1
0 1330 8
1 15 131
Faster
Limitation of available algo?DS librarires
Python Stack:
any(is.na(data))
data[is.na(data)]=3
attach(data)
library(ggplot2)
ggplot(data,aes(CCAvg,fill=Personal.Loan))+geom_density()+facet_grid(~Personal.Loan)
table(Personal.Loan,CCAvg)
prop.table(table(Securities.Account,Personal.Loan),1)*100
library(rattle)
library(RColorBrewer)
fancyRpartPlot(cartModel)
printcp(cartModel)
plotcp(cartModel)
bestcp
#Model Performance
View(cart.dev)
## deciling code
decile <- function(x){
deciles <- vector(length=10)
for (i in seq(0.1,1,.1)){
deciles[i*10] <- quantile(x, i, na.rm=T)
}
return (
ifelse(x<deciles[1], 1,
ifelse(x<deciles[2], 2,
ifelse(x<deciles[3], 3,
ifelse(x<deciles[4], 4,
ifelse(x<deciles[5], 5,
ifelse(x<deciles[6], 6,
ifelse(x<deciles[7], 7,
ifelse(x<deciles[8], 8,
ifelse(x<deciles[9], 9, 10
))))))))))
};
## deciling
cart.dev$deciles <- decile(cart.dev$predict.score[,2])
View(cart.dev)
## Ranking code
install.packages("data.table", dependencies = TRUE)
?getDTthreads
library(data.table)
tmp_DT = data.table(cart.dev)
rank <- tmp_DT[, list(
cnt = length(Personal.Loan),
cnt_resp = length(which(Personal.Loan == '1')),
cnt_non_resp = length(which(Personal.Loan == '0'))) ,
by=deciles][order(-deciles)];
rank
install.packages("ROCR")
library(ROCR)
library(gplots)
pred <- prediction(cart.dev$predict.score[,2], cart.dev$Personal.Loan)
pred
perf
plot(perf)
?attr
auc
install.packages("ineq")
library(ineq)
gini = ineq(cart.dev$predict.score[,2], type="Gini")
####################################
##VALIDATION FOR HOLDOUT SAMPLE#####
####################################
## deciling
cart.val$deciles <- decile(cart.val$predict.score[,2])
View(cart.val)
tmp_DT = data.table(cart.val)
rank <- tmp_DT[, list(
cnt = length(Personal.Loan),
cnt_resp = length(which(Personal.Loan == "1")),
cnt_non_resp = length(which(Personal.Loan == "0"))) ,
by=deciles][order(-deciles)];
rank$rrate <- round(rank$cnt_resp * 100 / rank$cnt,2);
rank$cum_resp <- cumsum(rank$cnt_resp)
rank$cum_non_resp <- cumsum(rank$cnt_non_resp)
rank$cum_perct_resp <- round(rank$cum_resp * 100 / sum(rank$cnt_resp),2);
rank$cum_perct_non_resp <- round(rank$cum_non_resp * 100 / sum(rank$cnt_non_resp),2);
rank$ks <- abs(rank$cum_perct_resp - rank$cum_perct_non_resp);
View(rank)