Anda di halaman 1dari 13

Discriminant Analysis

Discriminant analysis is used to model the value of a dependent categorical variable based on its relationship to one or more predictors. Discriminant analysis builds a predictive model for group membership. The model is composed of a discriminant function (or, for more than two groups, a set of discriminant functions) based on linear combinations of the predictor variables that provide the best discrimination between the groups. The functions are generated from a sample of cases for which group membership is known; the functions can then be applied to new cases that have measurements for the predictor variables but have unknown group membership. The number of functions equals min (#groups-1, #predictors). Given a set of independent variables, discriminant analysis attempts to find linear combinations of those variables that best separate the groups of cases. These combinations are called discriminant functions. The procedure automatically chooses a first function that will separate the groups as much as possible. It then chooses a second function that is both uncorrelated with the first function and provides as much further separation as possible. The procedure continues adding functions in this way until reaching the maximum number of functions as determined by the number of predictors and categories in the dependent variable. Note: The grouping variable can have more than two values. The codes for the grouping variable must be integers, however, and you need to specify their minimum and maximum values. Cases with values outside of these bounds are excluded from the analysis. Example: On average, people in temperate zone countries consume more calories per day than people in the tropics, and a greater proportion of the people in the temperate zones are city dwellers. A researcher wants to combine this information into a function to determine how well an individual can discriminate between the two groups of countries. The researcher thinks that population size and economic information may also be important. Discriminant analysis allows you to estimate coefficients of the linear discriminant function, which looks like the right side of a multiple linear regression equation. That is, using coefficients a, b, c, and d, the function is: D = a * climate + b * urban + c * population + d * gross domestic product per capita If these variables are useful for discriminating between the two climate zones, the values of D will differ for the temperate and tropic countries. If you use a stepwise variable selection method, you may find that you do not need to include all four variables in the function. Given a set of independent variables, discriminant analysis attempts to find linear combinations of those variables that best separate the groups of cases. These combinations are called discriminant functions. The procedure automatically chooses a first function that will separate the groups as much as possible. It then chooses a second function that is both uncorrelated with the first function and provides as much further separation as possible. The procedure continues adding functions in this way until reaching the maximum number of functions as determined by the number of predictors and categories in the dependent variable.

DA is used when: The dependent is categorical with the predictor IVs at interval level such as age, income, attitudes, perceptions, and years of education, although dummy variables can be used as predictors as in multiple regression. Logistic regression IVs can be of any level of measurement. There are more than two DV categories, unlike logistic regression, which is limited to a dichotomous dependent variable.

Discriminant analysis linear equation DA involves the determination of a linear equation like regression that will predict which group the case belongs to. The form of the equation or function is: D= V X + V X + V X +.. Vi Xi + a Where D = discriminate function v = the discriminant coefficient or weight for that variable X = respondents score for that variable a = a constant i = the number of predictor variables This function is similar to a regression equation or function. The vs are unstandardized discriminant coefficients analogous to the bs in the regression equation. These vs maximize the distance between the means of the criterion (dependent) variable. Standardized discriminant coefficients can also be used like beta weight in regression. Good predictors tend to have large weights. What you want this function to do is maximize the distance between the categories, i.e. come up with an equation that has strong discriminatory power between groups. After using an existing set of data to calculate the discriminant function and classify cases, any new cases can then be classified. The number of discriminant functions is one less the number of groups. There is only one function for the basic two group discriminant analysis. A discriminant score. This is a weighted linear combination (sum) of the discriminating variables. Assumptions The discriminant model has the following assumptions: The predictors are not highly correlated with each other. The mean and variance of a given predictor are not correlated. (Positive/ Negative) The correlation between two predictors is constant across groups. The values of each predictor have a normal distribution. The assumptions of discriminant analysis include that the relationships between all pairs of predictors must be linear, multivariate normality must exist within groups, and the population covariance matrices for predictor variables must be equal across groups. Discriminant analysis is, however, fairly robust to these assumptions, although violations of multivariate normality may affect accuracy of estimates of the probability of correct classification. If multivariate non-normality is suspected, then logistic regression

should be used. Multicollinearity is again an issue with which you need to be concerned. It is also important that the sample size of the smallest group exceed the number of predictor variables in the model. The linearity assumption as well as the assumption of homogeneity of variance-covariance matrices can be tested, as we did for multiple regression in chapter 6, by examining a matrix scatter plot. If the spreads of the scatter plots are roughly equal, then the assumption of homogeneity of variance-covariance matrices can be assumed. This assumption can also be tested with Box's M. Discriminant analysis creates an equation which will minimize the possibility of misclassifying cases into their respective groups or categories. The aim of the statistical analysis in DA is to combine (weight) the variable scores in some way so that a single new composite variable, the discriminant score, is produced.

Example: Using Discriminant Analysis to Assess Credit Risk


If you are a loan officer at a bank, you want to be able to identify characteristics that are indicative of people who are likely to default on loans, and you want to use those characteristics to identify good and bad credit risks. Suppose information on 850 past and prospective customers is contained. The first 700 cases are customers who were previously given loans. Use a random sample of these 700 customers to create a discriminant analysis model, setting the remaining customers aside to validate the analysis. Then use the model to classify the 150 prospective customers as good or bad credit risks. Setting the random seed allows you to replicate the random selection of cases in this analysis. To set the random seed, from the menus choose: Transform Random Number Generators... Select Set Starting Point. Select Fixed Value and type 9191972 as the value Click OK. To create the selection variable for validation, from the menus choose: Transform Compute Variable... Type validate in the Target Variable text box. Type rv.bernoulli(0.7) in the Numeric Expression text box. This sets the values of validate to be randomly generated Bernoulli variates with probability parameter 0.7.

You only intend to use validate with cases that could be used to create the model; that is, previous customers. However, there are 150 cases corresponding to potential customers in the data file. To perform the computation only for previous customers, click If. Select Include if case satisfies condition. Type MISSING(default) = 0 as the conditional expression. This ensures that validate is only computed for cases with non-missing values for default; that is, for customers who previously received loans. Approximately 70 percent of the customers previously given loans will have a validate value of 1. These customers will be used to create the model. The remaining customers who were previously given loans will be used to validate the model results. To run the discriminant analysis, from the menus choose: Analyze Classify Discriminant... Select Previously defaulted as the grouping variable. Select Years with current employer, Years at current address, Debt to income ratio (x100), and Credit card debt in thousands as the independent variables. Select validate as the selection variable. Select Previously defaulted and click Define Range. Type 0 as the minimum. Type 1 as the maximum. Select validate and click Value in the Discriminant Analysis dialog box. Type 1 as the value for selection variable. Select Means, Univariate ANOVAs, and Box's M in the Descriptives group. Select Fisher's and Unstandardized in the Function Coefficients group. Select Within-groups correlation in the Matrices group. Select Summary table and Leave-one-out classification Select Predicted group membership and Probabilities of group membership.

Analysis Case Processing Summary Unweighted Cases Valid Excluded Missing or out-of-range group codes At least one missing discriminating variable Both missing or out-of-range group codes and at least one missing discriminating variable Unselected Total Total 201 351 850 23.6 41.3 100.0 0 .0 0 .0 N 499 150 Percent 58.7 17.6

Group Statistics Previously defaulted Mean No Years with current employer Years at current address Credit card debt in thousands Debt to income ratio (x100) Yes Years with current employer Years at current address Credit card debt in thousands Debt to income ratio (x100) Total Years with current employer Years at current address Credit card debt in thousands Debt to income ratio (x100) 10.2166 6.78238 499 499.000 14.4468 8.4910 8.2525 1.5313 7.97554 6.72386 6.86476 2.13087 124 499 499 499 124.000 499.000 499.000 499.000 8.8179 5.1855 6.3548 2.3656 5.69545 5.72737 6.27836 3.36732 375 124 124 124 375.000 124.000 124.000 124.000 9.5840 8.8800 1.2554 Std. Deviation 6.67766 6.94239 1.41769 Valid N (listwise) Unweighted 375 375 375 Weighted 375.000 375.000 375.000

The group statistics table reveals a potentially more serious problem. For all four predictors, larger group means are associated with larger group standard deviations (Checking for Correlation of Group Means and Variances). In particular, look at Debt to income ratio (x100) and Credit card debt in

thousands, for which the means and standard deviations for the Yes group are considerably higher. In further analysis, you may want to consider using transformed values of these predictors.

Tests of Equality of Group Means Wilks' Lambda Years with current employer Years at current address Credit card debt in thousands Debt to income ratio (x100) .871 73.534 1 497 .000 .920 .975 .949 F 43.262 12.911 26.597 df1 1 1 1 df2 497 497 497 Sig. .000 .000 .000

The tests of equality of group means measure each independent variable's potential before the model is created. Each test displays the results of a one-way ANOVA for the independent variable using the grouping variable as the factor. If the significance value is greater than 0.10, the variable probably does not contribute to the model. According to the results in this table, every variable in your discriminant model is significant. Wilks' lambda is another measure of a variable's potential. Smaller values indicate the variable is better at discriminating between groups. The table suggests that Debt to income ratio (x100) is best, followed by Years with current employer, Credit card debt in thousands, and Years at current address.( Assessing the Contribution of Individual Predictors)

Pooled Within-Groups Matrices Years with current employer Correlation Years with current employer Years at current address Credit card debt in thousands Debt to income ratio (x100) .104 .140 .508 1.000 1.000 .286 .508 Years at current Credit card debt address .286 1.000 .290 in thousands .508 .290 1.000 Debt to income ratio (x100) .104 .140 .508

The within-groups correlation matrix shows the correlations between the predictors. The largest correlations occur between Credit card debt in thousands and the other variables, but it is difficult to tell if they are large enough to be a concern. Look for differences between the structure matrix and discriminant function coefficients to be sure. (Checking Collinearity of Predictors)

Analysis 1 Box's Test of Equality of Covariance Matrices


Log Determinants Previously defaulted Rank No Yes Pooled within-groups 4 4 4 Log Determinant 11.185 12.253 11.957

The ranks and natural logarithms of determinants printed are those of the group covariance matrices.

Log determinants are a measure of the variability of the groups. Larger log determinants correspond to more variable groups. Large differences in log determinants indicate groups that have different covariance matrices.
Test Results Box's M F Approx. df1 df2 Sig. Tests null hypothesis of equal population covariance matrices. 252.117 24.893 10 245917.239 .000

Box's M tests the assumption of equality of covariances across groups. Since Box's M is significant, you should request separate matrices to see if it gives radically different classification results. See the section on specifying separate-groups covariance matrices for more information. (Checking Homogeneity of Covariance Matrices)

Summary of Canonical Discriminant Functions


Eigenvalues Function Eigenvalue
dimension0

Canonical % of Variance
a

Cumulative % 100.0

Correlation .513

.357

100.0

a.First 1 canonical discriminant functions were used in the analysis.

The eigenvalues table provides information about the relative efficacy of each discriminant function. When there are two groups, the canonical correlation is the most useful measure in the table, and it is equivalent to Pearson's correlation between the discriminant scores and the groups. (Assessing Model Fit - how well the discriminant model as a whole fits the data)

Wilks' Lambda Test of Function(s)


dimension0

Wilks' Lambda .737

Chi-square 151.007

df 4

Sig. .000

Wilks' lambda is a measure of how well each function separates cases into groups. It is equal to the proportion of the total variance in the discriminant scores not explained by differences among the groups. Smaller values of Wilks' lambda indicate greater discriminatory ability of the function. The associated chi-square statistic tests the hypothesis that the means of the functions listed are equal across groups. The small significance value indicates that the discriminant function does better than chance at separating the groups. (Assessing Model Fit)
Standardized Canonical Discriminant Function Coefficients Function 1 Years with current employer Years at current address Credit card debt in thousands Debt to income ratio (x100) .437 -.784 -.295 .649

The standardized coefficients allow you to compare variables measured on different scales. Coefficients with large absolute values correspond to variables with greater discriminating ability. This table downgrades the importance of Debt to income ratio (x100), but the order is otherwise the same. (Assessing the Contribution of Individual Predictors)
Structure Matrix Function 1 Debt to income ratio (x100) Years with current employer Credit card debt in thousands Years at current address Pooled within-groups correlations between discriminating variables and standardized canonical discriminant functions Variables ordered by absolute size of correlation within function. -.270 .644 -.494 .387

The structure matrix shows the correlation of each predictor variable with the discriminant function. The ordering in the structure matrix is the same as that suggested by the tests of equality of group means and is different from that in the standardized coefficients table. This disagreement is likely due to the collinearity between Years with current employer and Credit card debt in thousands noted in the correlation matrix. Since the structure matrix is unaffected by collinearity, it's safe to say that this collinearity has inflated the importance of Years with current employer and Credit card debt in thousands in the standardized coefficients table. Thus, Debt to income ratio (x100) best discriminates between defaulters and non defaulters. (Assessing the Contribution of Individual Predictors)

Canonical Discriminant Function Coefficients

Functions at Group Centroids Previously defaulted Function 1

Function 1 No Years with current employer Years at current address Credit card debt in thousands Debt to income ratio (x100) (Constant) Unstandardized coefficients .069 .208 -.122 -.044 .312
dimension0

-.343 1.037

Yes Unstandardized canonical

discriminant functions evaluated at group means

Classification Statistics
Classification Processing Summary Processed Excluded Missing or out-of-range group codes At least one missing discriminating variable Used in Output 850 0 850 0

Classification Function Coefficients Previously defaulted No Years with current employer Years at current address Credit card debt in thousands Debt to income ratio (x100) (Constant) .291 -3.485 .386 -3.676 .277 .145 -.734 Yes .109 .085 -.303

Fisher's linear discriminant functions

The classification functions are used to assign cases to groups. There is a separate function for each group. For each case, a classification score is computed for each function. The discriminant model assigns the case to the group whose classification function obtained the highest score. The coefficients for Years with current employer and Years at current address are smaller for the Yes classification function, which means that customers who have lived at the same address and worked at the same company for many years are less likely to default. Similarly, customers with greater debt are more likely to default.

For example, consider cases 701 and 703. Case 701 has had the same employer for 16 years, lived at her current address for 13 years, and has debt equal to 10.9% of her income, $540 of which is credit card debt. The discriminant model predicts that there is only about an 8% chance that she will default on the loan, so she is a good credit risk. Case 703 has had the same employer and lived at the same address for fewer years and has greater debts, so the model sees him as a poor credit risk. (Classifying Customers as High or Low Credit Risks)

Classification Results

b,c,d

Previously defaulted

Predicted Group Membership No Yes 281 30 94 94 Total 375 124

Cases Selected

Original

Count

dime

No Yes

nsion

dime

No Yes

74.9 24.2

25.1 75.8

100.0 100.0

nsion

Cross-validated

Count

dime

No Yes

278 31

97 93

375 124

nsion

dime

No Yes

74.1 25.0

25.9 75.0

100.0 100.0

nsion

Cases Not Selected

Original

Count
dime nsion

No Yes Ungrouped cases No


dime nsion

106 10 95 74.6 16.9 63.3

36 49 55 25.4 83.1 36.7

142 59 150 100.0 100.0 100.0

Yes Ungrouped cases

a. Cross validation is done only for those cases in the analysis. In cross validation, each case is classified by the functions derived from all cases other than that case. b.75.2% of selected original grouped cases correctly classified. c.77.1% of unselected original grouped cases correctly classified. d.74.3% of selected cross-validated grouped cases correctly classified.

The classification table shows the practical results of using the discriminant model. Of the cases used to create the model, 94 of the 124 people who previously defaulted are classified correctly. 281 of the 375 non-defaulters are classified correctly. Overall, 75.2% of the cases are classified correctly.

Classifications based upon the cases used to create the model tend to be too "optimistic" in the sense that their classification rate is inflated. The cross-validated section of the table attempts to correct this by classifying each case while leaving it out from the model calculations; however, this method is generally still more "optimistic" than subset validation. Subset validation is obtained by classifying past customers who were not used to create the model. These results are shown in the Cases Not Selected section of the table. 77.1 percent of these cases were correctly classified by the model. This suggests that, overall, your model is in fact correct about three out of four times. The 150 ungrouped cases are the prospective customers, and the results here simply give a frequency table of the model-predicted groupings of these customers. (Model Validation)

Specifying Separate-Groups Covariance Matrices


Since Box's M is significant, it's worth running a second analysis to see whether using a separate-groups covariance matrix changes the classification. The classification results have not changed much, so it's probably not worth using separate covariance matrices. Box's M can be overly sensitive to large data files, which is likely what happened here.

Adjusting Prior Probabilities


Prior Probabilities for Groups Previously defaulted Prior No
dimension0

Cases Used in Analysis Unweighted 375 124 499 Weighted 375.000 124.000 499.000

.500 .500 1.000

Yes Total

This table displays the prior probabilities for membership in groups. A prior probability is an estimate of the likelihood that a case belongs to a particular group when no other information about it is available. Unless you specified otherwise, it is assumed that a case is equally likely to be a defaulter or non defaulter. Prior probabilities are used along with the data to determine the classification functions. Adjusting the prior probabilities according to the group sizes can improve the overall classification rate. To obtain a classification using non-uniform priors, recall the Discriminant Analysis dialog box, Click Classify, Select Compute from group sizes. Select Within-groups.

The prior probabilities are now based on the sizes of the groups. A priori, 75.2% of the cases are non defaulters, so the classification functions will now be weighted more heavily in favor of classifying cases as non defaulters.

The overall classification rate is higher for these classifications than for the ones based on equal priors. Unfortunately, this comes at the cost of misclassifying a greater percentage of defaulters. If you need to be conservative in your lending, then your goal is to identify defaulters, and you'd be better off using equal priors. If you can be more aggressive in your lending, then you can afford to use unequal priors.

Summary Using Discriminant Analysis, you created a model that classifies customers as high or low credit risks. Box's M showed a possible problem with heterogeneity of the covariance matrices, although further investigation revealed this was probably an effect of the size of the data file. The use of unequal priors to take advantage of the fact that non defaulters outnumber defaulters resulted in a higher overall classification rate but at the cost of missing defaulters.

Discriminant Analysis Discriminant Analysis Model Using Discriminant Analysis to Assess Credit Risk Preparing the Data for Analysis Running the Analysis Classifying Customers as High or Low Credit Risks Checking Collinearity of Predictors Checking for Correlation of Group Means and Variances Checking Homogeneity of Covariance Matrices Assessing the Contribution of Individual Predictors Tests of Equality of Group Means Standardized Canonical Discriminant Function Coefficients Structure Matrix Assessing Model Fit Eigenvalues Wilks' Lambda Model Validation Specifying Separate-Groups Covariance Matrices Adjusting Prior Probabilities Summary

Anda mungkin juga menyukai