Anda di halaman 1dari 35

Training Document

Logistic Regression

Table of Content

Contents
Objective.................................................................................................................... 2
About Logistic Regression........................................................................................... 3
CONCEPT.............................................................................................................. 3
Steps of developing a Logistic Regression Model.......................................................4
Key Metrics Finalization........................................................................................... 4
Rolling Performance Windows..............................................................................4
Data Preparation..................................................................................................... 6
Data Treatment.................................................................................................... 7
Derived variables creation................................................................................. 10
Data Split........................................................................................................... 11
Oversampling..................................................................................................... 12
Variable Selection/Reduction.................................................................................12
Data Distribution Related Issues........................................................................12
Information Value............................................................................................... 13
WOE Approach................................................................................................... 16
MULTI COLLINEARITY CHECK..............................................................................16
Standardization of Variables............................................................................... 19
Logistic Regression Procedure............................................................................19
Key Model Statistics........................................................................................... 20
Model Fit Statistics.................................................................................................... 21
Model description............................................................................................... 22
KS Statistic and Rank Ordering ........................................................................23
Gini and Lorentz curves...................................................................................... 24
Divergence Index Test........................................................................................ 25
Clustering checks ............................................................................................ 26
Deviance and Residual Test................................................................................27
Hosmer and Lemeshow Test...............................................................................28

Model Validation.................................................................................................... 29
1)

Re-estimation on Hold out sample...............................................................29

2)

Rescoring on bootstrap samples..................................................................29

Objective

The Purpose of this document is to guide new joiners or people new to Logistic
modelling on how to carry out each step starting from data collection/preparation to
logistic modelling results and validation.
The level of detail of each stage will be primary. It does this by allowing the reader
to start at the beginning, seeing how each stage of the process contributes to the
overall problem, and how it interacts and flows together while progressing towards
a final solution and its presentation.
The focus will be on execution of each step of the process and methods used to
verify the integrity of the process.

About Logistic Regression


Logistic regression technique uses maximum likelihood estimation to develop the
models. Logistic regression is a form of statistical modeling that is often appropriate
for dichotomous outcomes, for example good and bad. It is a method of describing
the relationship between a binary dependent (predicted) variable and a set of
independent explanatory variables from a set of observations. The independent
variables typically comprise of demographic characteristics, past performance
characteristics, and product related characteristics. Essentially, it is a method of
finding the best fit to a set of data points

CONCEPT
Logistic Regression predicts the probability (P) of an event (Y) to occur through the
following equation:
Log(P/(1-P)) = +1X1+2X2+..+nXn
P is the probability that the event Y occurs, p(Y=1)
Odds Ratio = P/1-P
Log{P/1-P} = log of the odds ratio
METHOD OF ESTIMATION

Maximum Likelihood Estimation: The coefficients , 1, 2,...,p are


estimated such that the Log of the likelihood function is as large as possible.

Maximum likelihood solves for the following condition: (Y p(Y=1)) Xi = 0;


summed over all observations, i = 1, 2....,n.

Assumption: Yi and Yj independent for all ij.

There are no distributional assumptions on the independent predictors.

Steps of developing a Logistic Regression Model


Key Metrics Finalization

Observation Window: Time frame from where independent variables (Xs)


come from.

Observation Point: Point at which the population will be scored

Performance Window: Time frame from where the dependent variable (Y)
comes from

Observation Point

Performance
Observation
WindowWindow

Rolling Performance Windows


The above example uses Jan14 to Mar14 as Observation Window and
May14 to Aug14 as Performance Window i.e. single performance and
observation window. Multiple rolling performance windows are used in
following cases:
1. To capture data seasonality
While using a single performance window, the assumption is that the parameters of
the model are constant over time. However, the economic environment often
changes considerably and it may not be reasonable to assume that a models
parameters are constant. A common technique to assess the constancy of a models
parameters is to compute parameter estimates over a rolling window of a fixed size
through the sample. If the parameters are truly constant over the entire sample,
then the estimates over the rolling windows should not be too different. If the
parameters change at some point during the sample, then the rolling estimates
should capture this instability
e.g. The below example utilizes 3 performance windows of 3 months each. Using
multiple performance windows, data of 10 months (Jan13 to Oct13) is utilized in
model development which would not be possible using a single performance
window. This will cater for seasonality in the data.

2. Utilizing campaign data of multiple months for model development:


If campaign data of multiple months is to be utilized for campaign response model
development then multiple performance windows can be used.
e.g. Credit cards campaign data of 3 months (i.e. Jan13 to Mar13) is available for
campaign response model development. Then instead of a single performance
window, following rolling windows can be utilized. Since different set of customers
will be targeted in campaign in each month, there will be no duplicates across
different windows.
Performance Window 1: Customers targeted in Jan13 for the campaignwhether
they bought a credit card from Feb13 to April13
Performance Window 2: Customers targeted in Feb13 for the campaign whether
they bought a credit card from Mar13 to May13
Performance Window 3: Customers targeted in Mar13 for the campaign whether
they bought a credit card from Apr13 to Jun13
Target variable:
Once the objective and scope of analysis is defined, it is important to identify the
target variable. For example, a risk model depending upon the data and business,
can have multiple target variables, such as 90DPD, Bankruptcy Indicator, Charge
Off, BAD12 (account becoming bad within first 12 months of activation).
Different target variables will lead to different models/ performance/ and business
insights. Sometimes, a combination of various target variables is used to build the
overall model. For Example, an overall BAD predictor can be a combination of
90DPD, Bankruptcy or Charge-offs.

In some problems, the target variable needs to be created and possibly defined. For
example, the client may want to build a model to identify potential churn but might
not have a clear definition of attrition. In such situations, it might often help to look
at the data and come up with some set of rules/algorithm to identify the dependent
variable.
Again, the definition of the dependent variable in certain cases may influence the
overall value of the model. For example, say the objective is to predict bankruptcy
of cardholders. We can choose to define the dependent variable to capture
bankruptcy next month or bankruptcy in 3 months. Clearly the latter model is more
useful if the objective of the analysis is to take some pre-emptive action against
those likely to go bankrupt.
In the current sample data, target variable is defined as :

The set of positive responders to campaign in the population data are


(tagged as 1). Rest of the population (non-target) is tagged as 0.

Exclusion Criteria:
Policy exclusions and any other exclusions needs to be undertaken prior to model
development to ensure data is not biased and model base is representative of the
actual population.

Data Preparation
The goal of this step is to prepare a master data set to be used in the modeling
phase of the problem solution. This dataset atleast should contain:
A key, or set of keys, that identifies each record uniquely
The dependent variable relevant to the problem
All independent variables that are relevant or may be important to the problem
solution
In the early stages of a solution, it can be sometimes hard to determine an exact set
of independent variables. Often, nothing is left out to begin with, and the list of
relevant variables is derived and constantly updated as the process unfolds.
If the required master data is spread across several data sets, then the pertinent
records and variables will need to be extracted from each dataset and merged
together to form the master dataset. If this must be done, it is very important that
proper keys are used across the datasets so that not only do we end up with all the
needed variables in the final dataset, but that you are merging the datasets

correctly. For example, you may have a customer dataset with customer level
information such as name, dob, age, sex, address etc. (a static data set), and
another data set, account data, which contains account level information such as
account number, account type(savings/current/mortgage/Fixed deposit) , total
balance , date of opening , last transaction date etc. This account level dataset
needs to be rolled up to customer level before merging with customer dataset to
create master dataset.
PS:- If you try to merge two datasets by a common numeric variable, but whose
lengths were defined differently in each dataset, you may see a warning in the log
file similar to:
WARNING: Multiple lengths were specified for the BY variable by_var by input data sets. This may
cause unexpected results.

It is generally not wise to overlook log file warnings unless you have a very good
reason to. A short data step redefining the length of the shorter variable in one of
the datasets before merging will suffice to get rid of the warning, and could reveal
important data problems, such as information being truncated from some values of
the BY variables in the data set with the shorter length.

Data Treatment
Once master dataset has been created, univariate macro(if available) needs to be
run to understand the data. Certain characteristics of the data that need to be
looked at are: Variable name
Format
Number of unique values
Number of missing values
Distribution (proc means output for numeric variables; highest and lowest
frequency categories for a categorical variable)
o Numeric variables: standard numerical distribution including the mean,
min, max, and percentiles 1, 5, 10, 25, 50, 75, 90, 95, and 99
o Categorical variables: no of times the variable takes each categorical
value .

Ob
s

na
me

ty
pe

var_len
gth

n_p
os

numo
bs

nmi
ss

uniq
ue

mean_or_t
op1

min_or_t
op2

p1_or_to
p3

Var1

nu
m

46187

929

0.12648

Delet
ed
Colum
ns

p99_or_b
ot2

max_or_b
ot1

0.769

0.998

Var2

nu
m

46187

505

0.06473

0.285

0.944

Var3

nu
m

16

46187

175

714.42876

650

756

794

Var4

nu
m

24

46187

257

656.30054

755

794

Var5

nu
m

32

46187

1067
3

100.50368

1710.922

136318.12
3

Var6

nu
m

40

46187

3312
3

305.97356

2552.431

221315.61
4

Var7

nu
m

48

46187

1332

0.11786

1.073

47.794

Var8

cha
r

56

46187

10

2::524952

::429733

1::37644
1

7::5468

8::2781

Output of univariate macro for few variables is given below:

Code for getting univariate output of variables:

Univariate_Macro.txt

Put the library path location ( where the dataset exists) and the dataset name( on
which univariate will run) in place of XXX at the bottom of the Univariate code
before running it.
Basic things that should be looked for when first assessing the data:
Are data formats correct?
o Are numerical variables stored as text? Do date variables need to be
converted in order to be useful?
Which variables have missing values?
Data Outliers?
Do any variables exhibit invalid values (many 9999999, 101010101, 0/1
values, etc)?
o If you have a data dictionary provided by the client, there may be
information on invalid values, so this would be the first thing to check
Are any distributions clearly out of line with expectations?

Missing Value Imputation


It is important to impute the missing values in a dataset before analysis can be
performed on it. Below are some popular techniques:

Replace missing values with zero


Replace missing values with value of same variable whose records have
closest mean value of response variable
Regression on other predictors
Replace missing values with mean/median
Replace missing values with median
Inter correlation
Do not impute missing values

Replace missing values with zero


Situations where the absence of a value is implicitly zero for example, NUMBER OF
LATE PAYMENTS. The value of this field would be expected to be zero for most
customers. Check related fields to justify this decision. Also, if some records have a
0 and others are blank (missing), then check with client if a blank has a different
interpretation.

Regression on other predictors


Create a linear model with populated records to predict the values of this variable
using other variables in dataset as predictors. Then score the records with missing
values using the model coefficients. It is a very good method when there is
sufficient covariance among the variables in the dataset to produce a precise and
accurate regression.
Replace missing values with mean
This technique should be used in situations where great majority of records ~ 85 %
+ are populated and where other methods are not feasible. It can also be used in
situations where the variable is a predictor with low influence in the model but it
needs to be included.
Replace missing values with median
This technique should be used in situations where great majority of records ~ 85 %
+ are populated and used instead of mean imputation if distribution is highly
skewed and only if other methods are not feasible. . It can also be used in situations
where the variable is a predictor with low influence in the model but it needs to be
included.
Inter-correlation
This method involves finding another predictor variable which has very high fill rate
and which is very highly correlated with the variable being imputed. The other
predictor is binned and the median value of the variable being imputed is calculated

for the bin. The variable is then imputed based on the bin into which it falls. This is a
good method but will need a very high correlation among predictors and require a
very high (close to 100%) fillrate.
Do not impute missing values

Response variable for model


Predictor variable that has low correlation with other predictors and
imputation of zero, mean or median would bias model results
Predictor variable that has low correlation with response and is unlikely to
play significant role in model should be excluded from modeling
Predictor variable that could be important in model, but has large percentage
of values missing should be excluded from model (imputation using above
techniques and inclusion in model would result in either a model with inflated
performance statistics or reflecting data manipulation rather than original
source data)

Predictor variable that could be important in model, but has large percentage of
values missing should either be excluded records with missing values or exclude
variable from model (imputation using above techniques and inclusion in model
would result in either a model with inflated performance statistics or reflecting data
manipulation rather than original source data)
Outlier Treatment
It is very important to eliminate outliers from the dataset before any analysis can be
performed. Outliers can be detected using Proc Univariate output.
Comparing the P99 and Max Values (or the P1 and the Min values), we can identify
the variables having possible outliers .
Here are some common ways of dealing with outliers:

Cap all outliers at P1 or P99.


Cap all outliers at (P1 - ) or (P99 + ). The value of will be subjective.
Using Exponential Smoothening for all values beyond the range P1-P99

The first and second methodis easier to implement but lose the ordinality of data.
The fourth method takes care of the outlier problem but does not lose the ordinality
of the data.
Derived variables creation

Derived variables are created in order to capture all underlying trends and aspects
of the data.
Rather than just using the raw variables in the model; taking proportions, ratios and
making indexes sometimes help reduce bias and also helps in identifying new
trends in the data.
For E.g.: Taking average monthly spends instead of total spends in last 12 months is
more insightful because it helps neutralize the effect of new customers having lower
spends due to the reduced tenure. The normalized average value provides a more
accurate comparison amongst customers
Data Split
Development dataset Fit the model on this dataset

Validation dataset (Hold-out sample) Validate the model using


the hold-out sample

Out of time sample (Validation) Validate the model on a different


time period to ensure the model works.

Development and validation sample are split in any ratio with 50 - 80% records in
development sample. Sample code for doing a 70-30 split is below :data temp;
set xxx;
n=ranuni(8);
proc sort data=temp;
by n;
data training validation;
set temp nobs=nobs;
if _n_<=.7*nobs then output training;
else output validation;
run;
In many situations, data is scarce and it is not possible to generate separate
validation datasets. In such cases, sampling techniques are used as explained below
:-

Multifold Cross validation


Note that this technique is a reasonable substitute for out-of-sample validation but
not for out-of-time. Usually a 10 fold cross validation is applied. Say you are
performing a k fold cross validation:

Divide the data into k disjoint portions.


Build k models, each time using a different portion as the validation dataset
and using the remaining k-1 portions as the modeling set.
Coefficients of model which validate well are averaged to get to the final set
of coefficients

Bootstrapping
This technique involves re-estimating the model over numerous randomly drawn
sub- samples. It is used in several ways. Often the model coefficients are taken to
be the average of the coefficients in the sub sample models. The final model is then
validated.
In other instances, bootstrapping is used as a variable reduction technique.
Here are some steps to perform the task:

1000 random subpopulations from modeling population are formed ( the


number of subpopulations can be varied)
Regression is run for each of these subpopulations and 1000 models are
created
Presence Rate is calculated by noting the presence of each variable as
predictor over all models. So, a variable present as predictor in 750 case will
have a presence rate of 75%
Models are made with variables which had at least 50% Presence Rate ( The
cutoff can be changed to suit the modeling task at hand)

Oversampling
When the target event is rare(less than 2-5%), it is common to oversample the rare
event, that is, take a disproportionately large number of event cases. Oversampling
rare events is generally believed to lead to better predictions.
In the case of logistic regressions, oversampling only affects the intercept term and
the coefficients are left unaffected. Therefore, the rank order produced by the model
estimated on oversampled data holds true and will not be changed even if the
intercept is corrected for oversampling.
Therefore, if the objective of the logistic regression is to produce a scorecard, then
no correction is required for oversampling.

Variable Selection/Reduction
We might get client data with hundreds and even thousands of variables. It is
important to reduce the dimension of the dataset by eliminating
redundant/irrelevant variables before meaningful analysis can be performed.
Irrelevant variables carry no meaningful information while redundant variables carry
little or no additional information. So one example of Redundant variable could be
say variable C which is a linear combination of variable A and B. Irrelevant variable
on the other hand would be something like variable X which has very low correlation
with the dependent variable Y.
Various techniques used for variable reduction:Data Distribution Related Issues
Following categories of variables are generally not considered for model
development exercise.

Variables with very high missing data

Variables with low variation (concentration in single bin)

Date variables

Variables not usable in raw form because of granularity

Information Value
For logistic regression, the raw variable is replaced by its Weight of evidence (Woe),
which has higher predictability than the raw variable.
Weight of Evidence is the log (Event rate/ Non-event rate) for each category (bin) of
the variable considered for analysis.
Methodology to calculate Weight of Evidence:
Fine Classing: Fine Classing is a method to divide the population in categories that
differentiate events from non-events. The objective of fine classing is to identify
categories that can differentiate event from non-events. Customers belonging to
categories that can differentiate events from non-event have high likelihood of
having objective=positive in the final model. Hence have higher predictability. To
fine class the populations divide the population in buckets of (5%-10%) each
1. Calculate the proportion of events in each category

2. Calculate the Event % and Non-event % for each category


3. Calculate the log odds of the category Log odds = natural log(%Events/%non-events)
The positive value indicates that customer in that category is more likely to belong
to the Persona. Higher the positive value higher is the differentiating power of the
category. Negative value indicates that customers in that category are less likely to
be in the Persona.
4. Calculate the Information Value for each category Information Value (IV)
indicates the predictable strength of the category. Higher the IV, higher the
probability of the category of the variable

Illustrative Example

Consider variable bfiinca from the training data


PER_GOOD=% event=% of Customers with Objective=Positive
PER_BAD=% non-event=% Customers with Objective=Negative
Information Value of the variable bfiinca .145
1. Event % is calculated for each category (Column PER_GOOD for e.g. consider
category 22167 38485. Total events 591. Event rate (40/591)*100 = 6.77%
2. Non Event% is calculated for each category (Column PER_BAD) category 22167
38485. Total non-events 567.non-event rate (76/567)*100 = 13.4%
3. % Event/%Non-event (Column RAW_ODDS) = 6.77/13.4 = 0.51
4. Natural log(% Event/%Non-event )=(Column RAW_ODDS) = natural log( 0.51) =
-0.67
5. Information Value (PER_GOOD-PER_BAD)*RAW_ODDS/100 = .044
Similarly, information value is calculated for each category of the variable.
Information Value of variable=sum of information value of all categories of variable
The above calculations of IV with fine classing can be done automatically by running
the below code and pasting the output in the given excel:

csv1.sas

fclassc1_n2.sas

fclassd1_n2.sas

Fineclassing.sas

fc_code_postModific
atoin.xls

Steps of Using the above codes:


1
2
3
4
5
6
7

Create an inverse of target variable ie. Records where target variable is 1,


set inverse target variable as 0 and records where target variable is 0,
set inverse target variable as 1.
Save all the codes in the same location as your dataset.
Open fclassc1_n2, change the libname at the bottom of the code
Do the same for fclassd1_n2.
Once you have made the changes; save it and close these two codes.
Post that open the fineclassing code and run it
Paste the output in the given excel file using text to column and keeping first
column of type text and save the file as Excel 2003 format and close it.

Open fc_code_postModification.xls and open the excel file created in step 8


above. Run the Set Parameters first followed by Sas code macro in the
excel file which will display fine classing output .

The information value gives a measure of how well the characteristic can
discriminate between good and bad and whether it should be considered for
modeling. As a rule of thumb apply the following, however these cut offs can be
changed based on data.
< 0.03 Not predictive dont consider for modelling
0.03 0.1 Predictive consider for modelling
> 0.1 Very Predictive use in modelling
Coarse Classing:
Coarse Classing is a method to identify similar categories. To coarse class
population, group categories with similar log odds and same sign. Calculate log
odds for the grouped category. The new log odds is Weight of Evidence of the
variable
Each of the characteristics deemed to be predictive (information value > 0.03)
should be grouped (normally performed using fine class output and a ruler) into
larger more robust groups such that the underlying trend in the characteristic is
preserved. A rule of thumb suggests that at least 3% of the goods and 3% of the
bads should fall within a group.
For continuous characteristics 10% bands are used to get an initial indication of the
predictive patterns and the strength of the characteristic. However, for the grouping
of attributes more detailed reports are produced using 5% bands.
Try to make classes with around 5-10% of the population. Classes with less
than 5% might not be a true picture of the data distribution and might lead to
model instability.
The trend post coarse classing should either be monotonically increasing,
decreasing, parabola or an inverted parabola. Polytonic trends are usually not
acceptable
Business inputs from the SMEs in the markets are essential for coarseclassing process as fluctuations in variables can be better explained and
classes make business sense.

WOE Approach

Concept:
In the standard WOE approach every variable is replaced by its binned counterpart.
The binned variable is created by assigning a value equal to WOE of each of the
bins formed during coarse classing.
WOE = ln (% Good/ % Bad)
WOE = 0 is imputed for the bins containing missing records and for bins that
consisted of less than 2% of the population.
Advantage:
Every attribute of the variable is differently weighed hence taking care of the
neutral weight assignment in case of dummy approach
Disadvantage:
Lesser degrees of freedom hence the chances of a variable representation is lower
in comparison to the dummy approach.

MULTI COLLINEARITY CHECK


Multicollinearity Macro
Introduction:
The macro MULTI COLLINEARITY is used to remove the multicollinearity. It identifies
the variables that are correlated and helps in removing the correlated and / or
insignificant variables.
Logic:
1. Capture the outputs of Regression and Logistic Regression procedures
2. Transpose the factor-loading matrix and attach the parameter estimate, VIF,
Wald-Chi Square value to each independent variable.
3. Go to last eigen vector (with lowest eigen value) and find out the variables
that are correlated (that have factor loadings more than specified in the CutOff Factor loading in the Excel sheet)
4. Remove those variables (not more than 3 variables at any iteration / point of
time) that have high VIF (more than 1.75) and lower Wald Chi Square value.
5. Go to each eigen vector in the ascending order of the eigen value and find
out the variables that are correlated (that have factor loadings more than
specified in the Cut-Off Factor loading in the Excel sheet)
6. Go to step 4 after each Eigen vector selection.

Points to note:
1. If the factor loadings on a particular Eigen vector are not above the cut-off,
that vector is ignored and next Eigen vector would be looked for.
2. Not more than 3 variables could be dropped.
3. Not more than 250 variables could be used because of Excel limitation on the
number of rows.
4. Clear Contents in the columns M & P of Multicol8.xls sheet before start using
the macro for each new project.

Programs Overview of Inputs & Outputs

Multi Collinearity.sas

1. SAS Program:

Inputs required apart from the Library and dataset name are 1. List of
variables for REG and LOGISTIC procedures 2.Response Variable Name

Output: One Excel sheet "mc.xls" created in directory C ( Can change the
location and Name of the file).
Please go through the program and input the values at appropriate places
( COMMENTS will guide you in doing that)

MultiCol8.ZIP

2. Excel Sheet:Multicol8.xls. Save this in C folder.


This has VB macro that runs on the file "C:\mc.xls" created out
of SAS Program. Save this file on your computer and keep it open at the time of
removing multi collinearity.

3. Outputs:

List of Variables Retained (column M) and Removed (column P) will get pasted in
the same excel sheet "MultiCol8.xls"

Tracking of variables removed from the first iteration to the last itearation. Name
of the tracking file is to be specified each time at the time of running macro (eg:
c:\log.txt). You can use same file name through out the project. This would have
the history of variables removed and corresponding correlated variables. Please
make sure that you change the file name when you are working on a new
project. Otherwise the existing file "log.txt" gets appended.
Idea of having this tracking file is to find out the replacement variables for any
variable that was dropped any point of time. Open this txt file in Excel "
Delimited , select Comma and others "-" .
Each row will give the variables that are correlated.
Columns (B, C and D) give the variables removed at a point. Variables from F
column are correlated with the removed variables and retained at that
respective point.

Frequently Asked Questions


Q1: Why should we do a multicollinearity check?
Ans1: Multicollinearity refers to correlation among independent variables and leads
to an increase in the standard error. This in turn makes the model unreliable.

Clustering of variables
Varclus is a SAS procedure that groups variables into clusters based on the variance
of the data they explain. It is unsupervised in that it does not depend on the
dependent variable. In the background it performs principal component like analysis
and then makes the components orthogonal so that that they contain distinct set of
variables.
Here are some practical implementation steps for running varclus:

Varclus will group the identical variables into clusters

Ideal representative(s) from cluster can be retained

Selected variable should have high r-square with own cluster and low r-square
with next closest cluster

1-R-squared ratio is a good selection criteria ( 1-r-quared with own cluster/1-rsquare with next closest cluster)

Multiple selections can be made from clusters if necessary

Business judgment might often determine variable selection

Here is the sample code:

proc varclus data=imputed maxeigen=.7 short hi;


var &list;
run;
Here is the sample output:

variable
Cluster
bb
cc
Cluster
kk
ll
Cluster
mm
Cluster
nn

R-squared with
1Own
Next closest R-squared
Cluster
Cluster
Ratio

cluster#
69

0.5804
0.443

0.468 0.788722
0.1362 0.644825

0.8057
0.6345

0.2993 0.277294
0.2918 0.516097

0.5625

0.0013 0.438069

0.7797

0.2811

70

71
72
0.30644

So in the above output we have 4 clusters. We want to select one variable to


represent a cluster, but often we might use more than one variable from a cluster
due to business reasons. We ideally want to choose a variable which has a high rsquare with its own cluster and a low r-square with its next closest cluster.
Choosing the variable with the lowest 1- rsquared ratio accomplishes this task.
This implies the choice of cc, kk, mm, nn.
Standardization of Variables

Once final set of variables on which Logistic has to be run is decided,


standardization of coefficients can be obtained. For example, the z score for age
would be calculated as (age mean[age])/std dev[age]. Your output would then give standardized coefficients as
results.

Logistic Regression Procedure


Before running a logistic model in SAS you first need to check that your dataset is
ready for logistic modeling. SAS will discard any observations with missing values

for any of the variables the user specifies to be considered for the model, so it is
necessary that general modeling data preparation and missing imputation have
occurred. Also, PROC LOGISTIC will only accept numeric variables as predictors,
unless categorical level character variables are specified in the CLASS statement.
You must also be careful with categoricals that are coded with numerals the
program will treat these as if they were continuous numerics unless they are
specified in the CLASS statement.
To run a logistic regression in SAS, and estimate the coefficients of the model, one
would run code similar to the below:
proc logistic data = <libname>.<modeling dataset> DESCENDING;
model dependent_variable = var1 var2 var3;
run;

The DESCENDING options lets SAS know that the value of the dependent variable
we wish to predict is 1, and not 0.
With no other options selected, SAS will estimate the full model, meaning all
variables will be included, regardless of whether their coefficients are significantly
different from 0.
PROC LOGISTIC also permits:

forward selection
backward elimination
forward stepwise

selection of optimal subsets of independent variables


Default significance levels for entry into/removal from the model can be modified
by use of the SLENTRY and SLSTAY options
Key Model Statistics
PROC LOGISTIC provides several means of assessing model fit:
ALL THE BELOW TABLES/GRAPHS ARE ILLUSTRATIVE AND YET TO BE
DEVELOPED FOR SAMPLE DATA

Model Fit Statistics


Criterion

Intercept Only

Intercept and Covariates

AIC

501.977

470.517

SC

505.968

494.466

-2 Log L

499.977

458.517

Testing Global Null Hypothesis: BETA=0


Test
Likelihood Ratio

Chi-Square

DF

Pr > ChiSq

41.4590

<.0001

Score

40.1603

<.0001

Wald

36.1390

<.0001

Type 3 Analysis of Effects


Effect

DF

Wald Chi-Square

Pr > ChiSq

GRE

4.2842

0.0385

GPA

5.8714

0.0154

RANK

20.8949

0.0001

The portion of the output labeled Model Fit Statistics describes and tests the overall
fit of the model. The -2 Log L (499.977) can be used in comparisons of nested
models, but we won't show an example of that here.
In the next section of output, the likelihood ratio chi-square of 41.4590 with a pvalue of 0.0001 tells us that our model as a whole fits significantly better than an
empty model. The Score and Wald tests are asymptotically equivalent tests of the

same hypothesis tested by the likelihood ratio test, not surprisingly, these tests also
indicate that the model is statistically significant.
The section labeled Type 3 Analysis of Effects, shows the hypothesis tests for each
of the variables in the model individually. The chi-square test statistics and
associated p-values shown in the table indicate that each of the three variables in
the model significantly improve the model fit. Forgre, and gpa, this test duplicates
the test of the coefficients shown below. However, for class variables (e.g., rank),
this table gives the multiple degree of freedom test for the overall effect of the
variable.

Model description
Variables used in the model:

The LOGISTIC Procedure

Analysis of Maximum Likelihood Estimates

Standard
Parameter

Intercept

DF

Estimate

Error

Wald
Chi-Square

-5.5414

1.1381

23.7081
4.2842

GRE

0.00226

0.00109

GPA

0.8040

0.3318

5.8714

Pr > ChiSq

<.0001
0.0385
0.0154

RANK

1.5514

0.4178

13.7870

0.0002

RANK

0.8760

0.3667

5.7056

0.0169

RANK

0.2112

0.3929

0.2891

0.5908

Odds Ratio Estimates

Point
Effect

95% Wald

Estimate

Confidence Limits

GRE

1.002

1.000

1.004

GPA

2.235

1.166

4.282

RANK 1 vs 4

4.718

2.080

10.701

RANK 2 vs 4

2.401

1.170

4.927

RANK 3 vs 4

1.235

0.572

2.668

Association of Predicted Probabilities and Observed Responses

Percent Concordant
Percent Discordant

69.1

Somers' D

30.6

Gamma

Percent Tied

0.3

Tau-a

Pairs

34671

0.386
0.387
0.168
0.693

KS Statistic and Rank Ordering


The population is divided into the deciles in the ascending order of Bads. Models
that rank orders, predicts the highest number of Goods in the first decile and then
goes progressively down. We define the Kolmogorov-Smirnoff (K-S) statistic as the
maximum value of |G(s) B(s)| over the score range. The K-S statistic has a known
distribution under the null hypothesis that G(s) and B(s) are identical, and has a

G B
critical value of 1.36/

at the 5% level of significance.

Model performance on development sample

Model performance on validation sample

Comparison of model performance on development and validation samples


using Lorentz curve
Gini and Lorentz curves
The industrial standard measures for assessing the predictive power of a model are
the Gini and the K-S Statistic. A model giving a Gini greater than 40% and/or a K-S
statistic in the region of > 20% would be classified as a good model. Let G(s) be the
number of goods with a score less than s, and B(s) be the number of bads with a
score less than s. A Lorentz curve is a plot of G(s) against B(s). The Gini coefficient
captures the degree to which the distributions differ by calculating the difference

between the areas under the G(s) and B(s) curves, i.e.:

G(s) B(s) .ds

. If G(s) and

B(s) are identical the Lorentz curve is the straight line between (0, 0) and (1, 1), and
the required integral is 0.

The Lorentz curves above show the plot of the cumulative percentage of
bad against the cumulative percentage of the good in the development
and validation samples. The darkblueline shows the distribution of good
and bad under random scoring whereas the brown curve (development
sample) and green curve (validation sample) show the lift provided by
the Conversion Rate Model over and above random selection. Model
exhibits similar level of performance across development and validation
samples as can be seen from the almost overlapping Lorentz curves.

Frequently Asked Questions


Q: What is C-stat (C)?
Ans: C is the Area under the Curve=(# of concordant pairs+0.5*# of tie pairs)/#of
pairs
= % concordance+0.5*% tie
=2(1+gini)
Q: How can we determine which variable has maximum contribution in the model?

Ans: To measure the contribution of variables we standardize variables since all


variables are on different scale. If the variables have the same unit of measurement
then their magnitude or scale needs to be compared.
The contribution of variables in the model can be measured by:
a) Wald chi square
b) Point estimate
Q: What does point estimate tell us?
Ans : A point estimate tells us that for every change in one unit of the estimate, how
will the dependent variable change .

Divergence Index Test

D
Divergence Index -

x g xb
s

is a commonly employed measure of the

separation achieved by a model. It is related to a t-distribution (multiply by (G+B)


) if the two population variances are equal. This measure tells us how well the
means of the goods and bads are differentiated. A t statistic > |6| shows a high level
of differentiation.

Null Hypothesis (H0): The mean score of the good in the population is less than
equal to the mean score of the bad in the population. A robust model implies that
the mean score for good will be significantly greater than the mean score for bad
i.e. the null hypothesis needs to be rejected. As shown by the p-value in the Table
4.2.4, the null hypothesis is being rejected at 1% level of significanc

Clustering checks
A good model should not have significant clustering of the population at any
particular score and the population must be well scattered across score points.

Clustering refers to the proportion of accounts falling at various integral


values of the model-generated scores1.

Deviance and Residual Test

Both the Pearson and deviance test whether or not there is sufficient
evidence that the observed data do not fit the model. The null hypothesis is
that the data fit the model. If they are not significant it suggests that there is
no reason to assume that the model is not correct / we accept that the model
generally fits the data. For this model both the Pearson and Deviance test are
coming as insignificant thereby further confirming that the model is fitting the
data.
Hosmer and Lemeshow Test
The Hosmer-Lemeshow Goodness-of-Fit test tells us whether we have
constructed a valid overall model or not. If the model is a good fit to the data
then the Hosmer-Lemeshow Goodness-of-Fit test should have an associated
1 Score is defined as the probability of being a responder (as per Conversion Rate Model) multiplied by 1000

p-value greater than 0.05. In our case the associated p value is coming as
high for both the development and the validation sample signalling that the
model is a good fit for the data.

Frequently Asked Questions

Q: What drawback does Hosmer Lemeshow test have?


Ans: Hosmer Lemeshow is a goodness of fit test. However this metric is volatile
due to the degrees of freedom deployed

Confusion Matrix (1 is the Event Indicator) Development Data


65% of the consumer completions got correctly predicted by
the model

Confusion Matrix (1 is the Event Indicator) Validation Data

Model Validation
1) Re-estimation on Hold out sample

We re-estimate the model parameters on the hold-out validation sample to


ensure the parameters are in close proximity of the development sample and
all the other model performance measures hold
2) Rescoring on bootstrap samples
The samples are selected every time with replacement and the equation from
the development sample is used to re-score the model on several bootstrap
samples in varying sample proportions 20% - 80%. The model should satisfy
the performance measures stated above. Test statistics show that the model
validates for all 5 bootstrap samples with within confidence interval and
achieves complete rank-ordering .

Anda mungkin juga menyukai