Anda di halaman 1dari 117

Annotated SPSS Output

Descriptive statistics

This page shows examples of how to obtain descriptive statistics, with footnotes explaining the
output. The data used in these examples were collected on 200 high schools students and are
scores on various tests, including science, math, reading and social studies (socst). The variable
female is a dichotomous variable coded 1 if the student was female and 0 if male.

In the syntax below, the get file command is used to load the data into SPSS. In quotes, you
need to specify where the data file is located on your computer. Remember that you need to use
the .sav extension and that you need to end the command (and all commands) with a period.
There are several commands that you can use to get descriptive statistics for a continuous
variable. We will show two: descriptives and examine. We have added some options to each
of these commands, and we have deleted unnecessary subcommands to make the syntax as short
and understandable as possible. You will find that the examine command always produces a lot
of output. This can be very helpful if you know what you are looking for, but can be
overwhelming if you are not used to it. If you need just a few numbers, you may want to use the
descriptives command. Each as shown below.

We will use the hsb2.sav data file for our example.

get file "c:\hsb2.sav".

descriptives write
/statistics = mean stddev variance min max semean kurtosis skewness.

descriptives write
/statistics = mean stddev variance min max semean kurtosis skewness.
a. Valid N (listwise) - This is the number of non-missing values.

b. N - This is the number of valid observations for the variable. The total number of
observations is the sum of N and the number of missing values.

c. Minimum - This is the minimum, or smallest, value of the variable.

d. Maximum - This is the maximum, or largest, value of the variable.

e. Mean - This is the arithmetic mean across the observations. It is the most widely used
measure of central tendency. It is commonly called the average. The mean is sensitive to
extremely large or small values.

f. Std. - Standard deviation is the square root of the variance. It measures the spread of a set of
observations. The larger the standard deviation is, the more spread out the observations are.

g. Variance - The variance is a measure of variability. It is the sum of the squared distances of
data value from the mean divided by the variance divisor. The Corrected SS is the sum of
squared distances of data value from the mean. Therefore, the variance is the corrected SS
divided by N-1. We don't generally use variance as an index of spread because it is in squared
units. Instead, we use standard deviation.

h. Skewness - Skewness measures the degree and direction of asymmetry. A symmetric


distribution such as a normal distribution has a skewness of 0, and a distribution that is skewed to
the left, e.g. when the mean is less than the median, has a negative skewness.

i. Kurtosis - Kurtosis is a measure of the heaviness of the tails of a distribution. In SAS, a


normal distribution has kurtosis 0. Extremely nonnormal distributions may have high positive or
negative kurtosis values, while nearly normal distributions will have kurtosis values close to 0.
Kurtosis is positive if the tails are "heavier" than for a normal distribution and negative if the
tails are "lighter" than for a normal distribution. Please see our FAQ What's with the different
formulas for kurtosis?

examine write
/plot boxplot stemleaf histogram
/percentiles(5,10,25,50,75,90,95,99).
writing score Stem-and-Leaf Plot
Frequency Stem & Leaf
4.00 3 . 1111
4.00 3 . 3333
2.00 3 . 55
5.00 3 . 66777
6.00 3 . 899999
13.00 4 . 0001111111111
3.00 4 . 223
13.00 4 . 4444444444445
11.00 4 . 66666666677
11.00 4 . 99999999999
2.00 5 . 00
16.00 5 . 2222222222222223
20.00 5 . 44444444444444444555
12.00 5 . 777777777777
25.00 5 . 9999999999999999999999999
8.00 6 . 00001111
22.00 6 . 2222222222222222223333
16.00 6 . 5555555555555555
7.00 6 . 7777777
Stem width: 10.00
Each leaf: 1 case(s)

examine write
/plot boxplot stemleaf histogram
/percentiles(5,10,25,50,75,90,95,99).

Case processing summary

a. Valid - This refers to the non-missing cases. In this column, the N is given, which is the
number of non-missing cases; and the Percent is given, which is the percent of non-missing
cases.
b. Missing - This refers to the missing cases. In this column, the N is given, which is the
number of missing cases; and the Percent is given, which is the percent of the missing cases.

c. Total - This refers to the total number cases, both non-missing and missing. In this column,
the N is given, which is the total number of cases in the data set; and the Percent is given, which
is the total percent of cases in the data set.

Descriptive statistics

a. Statistic - These are the descriptive statistics.

b. Std. Error - These are the standard errors for the descriptive statistics. The standard error
gives some idea about the variability possible in the statistic.

c. Mean - This is the arithmetic mean across the observations. It is the most widely used
measure of central tendency. It is commonly called the average. The mean is sensitive to
extremely large or small values.

d. 95% Confidence Interval for Mean Lower Bound - This is the lower (95%) confidence
limit for the mean. If we repeatedly drew samples of 200 students' writing test scores and
calculated the mean for each sample, we would expect that 95% of them would fall between the
lower and the upper 95% confidence limits. This gives you some idea about the variability of the
estimate of the true population mean.

e. 95% Confidence Interval for Mean Upper Bound - This is the upper (95%) confidence
limit for the mean.
f. 5% Trimmed Mean - This is the mean that would be obtained if the lower and upper 5% of
values of the variable were deleted. If the value of the 5% trimmed mean is very different from
the mean, this indicates that there are some outliers. However, you cannot assume that all
outliers have been removed from the trimmed mean.

g. Median - This is the median. The median splits the distribution such that half of all values
are above this value, and half are below.

h. Variance - The variance is a measure of variability. It is the sum of the squared distances of
data value from the mean divided by the variance divisor. The Corrected SS is the sum of
squared distances of data value from the mean. Therefore, the variance is the corrected SS
divided by N-1. We don't generally use variance as an index of spread because it is in squared
units. Instead, we use standard deviation.

i. St. Deviation - Standard deviation is the square root of the variance. It measures the spread of
a set of observations. The larger the standard deviation is, the more spread out the observations
are.

j. Minimum - This is the minimum, or smallest, value of the variable.

k. Maximum - This is the maximum, or largest, value of the variable.

l. Range - The range is a measure of the spread of a variable. It is equal to the difference
between the largest and the smallest observations. It is easy to compute and easy to understand.
However, it is very insensitive to variability.

m. Interquartile Range - The interquartile range is the difference between the upper and the
lower quartiles. It measures the spread of a data set. It is robust to extreme observations.

n. Skewness - Skewness measures the degree and direction of asymmetry. A symmetric


distribution such as a normal distribution has a skewness of 0, and a distribution that is skewed to
the left, e.g. when the mean is less than the median, has a negative skewness.

o. Kurtosis - Kurtosis is a measure of the heaviness of the tails of a distribution. In SAS, a


normal distribution has kurtosis 0. Extremely nonnormal distributions may have high positive or
negative kurtosis values, while nearly normal distributions will have kurtosis values close to 0.
Kurtosis is positive if the tails are "heavier" than for a normal distribution and negative if the
tails are "lighter" than for a normal distribution.
Percentiles

a. Weighted Average - These are the percentiles for the variable write. Some of the values are
fractional, which is a result of how they are calculated. If there is not a value at exactly the 5th
percentile, for example, the value is interpolated. There are several different ways of calculating
these values, so SPSS clarifies what it is doing by indicating that it is using "Definition 1".

b. Tukey's Hinges - These are the first, second and third quartile. They are calculated the way
that Tukey originally proposed when he came up with the idea of a boxplot. The values are not
interpolated; rather, they are approximations that can be obtained with little calculation.

c. Percentiles - These columns given you the values of the variable at various percentiles.
These tell you about the distribution of the variable. Percentiles are determined by ordering the
values of the variable from lowest to highest, and then looking at whatever percent to see the
value of the variable there. For example, in the column labeled 5, the value of the variable write
is 35. Because this is a weighted average, SPSS is taking into account the fact that there are
several values of 35, which is why the weighted average is 35.05.

d. 25 - This is the 25% percentile, also known as the first quartile.

e. 50 - This is the 50% percentile, also know as the median. It is a measure of central tendency.
It is the middle number when the values are arranged in ascending (or descending) order.
Sometimes, the median is a better measure of central tendency than the mean. It is less sensitive
than the mean to extreme observations.

f. 75 - This is the 75% percentile, also know as the third quartile.

Histogram
a. A histogram shows the frequency of values of a variable. The size of the bins is determined
by default when you use the examine command to create a histogram, but you can use either the
graph or ggraph command to create a histogram over which you can have much more control.
In this histogram, each bin contains two values. For example, the first bin contains values 30 and
31, the second bin contains 32 and 33, and so on. The histogram is a graphical representation of
the percentiles that were displayed above. As with percentiles, the purpose of the histogram is
the give you an idea about the distribution of the variable.
Stem and leaf plot

writing score Stem-and-Leaf Plot


Frequencya Stemb& Leafc
4.00 3 . 1111
4.00 3 . 3333
2.00 3 . 55
5.00 3 . 66777
6.00 3 . 899999
13.00 4 . 0001111111111
3.00 4 . 223
13.00 4 . 4444444444445
11.00 4 . 66666666677
11.00 4 . 99999999999
2.00 5 . 00
16.00 5 . 2222222222222223
20.00 5 . 44444444444444444555
12.00 5 . 777777777777
25.00 5 . 9999999999999999999999999
8.00 6 . 00001111
22.00 6 . 2222222222222222223333
16.00 6 . 5555555555555555
7.00 6 . 7777777
Stem width: 10.00
Each leaf: 1 case(s)

a. Frequency - This is the frequency of the leaves.

b. Stem - This is the stem. It is the number in the 10s place of the value of the variable. For
example, in the first line, the stem is 3 and leaves are 1. The value of the variable is 31. The 3 is
in the 10s place, so it is the stem.

c. Leaf - This is the leaf. It is the number in the 1s place of the value of the variable. The
number of leaves tells you how many of these numbers is in the variable. For example, on the
fifth line, there is one 8 and five 9s (hence, the frequency is six). This means that there is one
value of 38 and five values of 39 in the variable write.
Boxplot

a. This is the maximmum score unless there are values more than 1.5 times the interquartile
range above Q3, in which, it is the third quartile plus 1.5 times the interquartile range (the
difference between the first and the third quartile).

b. This is the third quartile (Q3), also known as the 75th percentile.

c. This is the median (Q2), also known as the 50th percentile.

d. This is the first quartile (Q1), also known as the 25th percentile.

e. This is the minimum score unless there are values less than 1.5 times the interquartile range
below Q1, in which case, it is the first quartile minus 1.5 times the interquartile range.
Logistic Regression

This page shows an example of logistic regression with footnotes explaining the output. These
data were collected on 200 high schools students and are scores on various tests, including
science, math, reading and social studies (socst). The variable female is a dichotomous variable
coded 1 if the student was female and 0 if male.

In the syntax below, the get file command is used to load the data into SPSS. In quotes, you
need to specify where the data file is located on your computer. Remember that you need to use
the .sav extension and that you need to end the command with a period. By default, SPSS does a
listwise deletion of missing values. This means that only cases with non-missing values for the
dependent as well as all independent variables will be used in the analysis.

Because we do not have a suitable dichotomous variable to use as our dependent variable, we
will create one (which we will call honcomp, for honors composition) based on the continuous
variable write. We do not advocate making dichotomous variables out of continuous variables;
rather, we do this here only for purposes of this illustration.

Use the keyword with after the dependent variable to indicate all of the variables (both
continuous and categorical) that you want included in the model. If you have a categorical
variable with more than two levels, for example, a three-level ses variable (low, medium and
high), you can use the categorical subcommand to tell SPSS to create the dummy variables
necessary to include the variable in the logistic regression, as shown below. You can use the
keyword by to create interaction terms. For example, the command logistic regression
honcomp with read female read by female. will create a model with the main effects of read
and female, as well as the interaction of read by female.

We will start by showing the SPSS commands to open the data file, creating the dichotomous
dependent variable, and then running the logistic regression. We will show the entire output, and
then break up the output with explanation.

get file "c:\hsb2.sav".

compute honcomp = (write ge 60).


exe.

logistic regression honcomp with read science ses


/categorical ses.
Logistic Regression

Block 0: Beginning Block


Block 1: Method = Enter
This part of the output tells you about the cases that were included and excluded from the
analysis, the coding of the dependent variable, and coding of any categorical variables listed on
the categorical subcommand. (Note: You will not get the third table ("Categorical Variable
Codings") if you do specify the categorical subcommand.)

Logistic Regression
b. N - This is the number of cases in each category (e.g., included in the analysis, missing, total).

c. Percent - This is the percent of cases in each category (e.g., included in the analysis, missing,
total).

d. Included in Analysis - This row gives the number and percent of cases that were included in
the analysis. Because we have no missing data in our example data set, this also corresponds to
the total number of cases.

e. Missing Cases - This row give the number and percent of missing cases. By default, SPSS
logistic regression does a listwise deletion of missing data. This means that if there is missing
value for any variable in the model, the entire case will be excluded from the analysis.

f. Total - This is the sum of the cases that were included in the analysis and the missing cases.
In our example, 200 + 0 = 200.

Unselected Cases - ??

This part of the output describes a "null model", which is model with no predictors and just the
intercept. This is why you will see all of the variables that you put into the model in the table
titled "Variables not in the Equation".

Block 0: Beginning Block


c. Step 0 - SPSS allows you to have different steps in your logistic regression model. The
difference between the steps is the predictors that are included. This is similar to blocking
variables into groups and then entering them into the equation one group at a time. By default,
SPSS logistic regression is run in two steps. The first step, called Step 0, includes no predictors
and just the intercept. Often, this model is not interesting to researchers.

d. Observed - This indicates the number of 0's and 1's that are observed in the dependent
variable.

e. Predicted - In this null model, SPSS has predicted that all cases are 0 on the dependent
variable.

f. Overall Percentage - This gives the percent of cases for which the dependent variables was
correctly predicted given the model. In this part of the output, this is the null model. 73.5 =
147/200.

g. B - This is the coefficient for the constant (also called the "intercept") in the null model.

h. S.E. - This is the standard error around the coefficient for the constant.

i. Wald and Sig. - This is the Wald chi-square test that tests the null hypothesis that the constant
equals 0. This hypothesis is rejected because the p-value (listed in the column called "Sig.") is
smaller than the critical p-value of .05 (or .01). Hence, we conclude that the constant is not 0.
Usually, this finding is not of interest to researchers.

j. df - This is the degrees of freedom for the Wald chi-square test. There is only one degree of
freedom because there is only one predictor in the model, namely the constant.

k. Exp(B) - This is the exponentiation of the B coefficient, which is an odds ratio. This value is
given by default because odds ratios can be easier to interpret than the coefficient, which is in
log-odds units. This is the odds: 53/147 = .361.

l. Score and Sig. - This is a Score test that is used to predict whether or not an independent
variable would be significant in the model. Looking at the p-values (located in the column
labeled "Sig."), we can see that each of the predictors would be statistically significant except the
first dummy for ses.

m. df - This column lists the degrees of freedom for each variable. Each variable to be entered
into the model, e.g., read, science, ses(1) and ses(2), has one degree of freedom, which leads to
the total of four shown at the bottom of the column. The variable ses is listed here only to show
that if the dummy variables that represent ses were tested simultaneously, the variable ses would
be statistically significant.

n. Overall Statistics - This shows the result of including all of the predictors into the model.

The section contains what is frequently the most interesting part of the output: the overall test of
the model (in the "Omnibus Tests of Model Coefficients" table) and the coefficients and odds
ratios (in the "Variables in the Equation" table).

Block 1: Method = Enter


b. Step 1 - This is the first step (or model) with predictors in it. In this case, it is the full model
that we specified in the logistic regression command. You can have more steps if you do
stepwise or use blocking of variables.

c. Chi-square and Sig. - This is the chi-square statistic and its significance level. In this
example, the statistics for the Step, Model and Block are the same because we have not used
stepwise logistic regression or blocking. The value given in the Sig. column is the probability of
obtaining the chi-square statistic given that the null hypothesis is true. In other words, this is the
probability of obtaining this chi-square statistic (65.588) if there is in fact no effect of the
independent variables, taken together, on the dependent variable. This is, of course, the p-value,
which is compared to a critical value, perhaps .05 or .01 to determine if the overall model is
statistically significant. In this case, the model is statistically significant because the p-value is
less than .000.

d. df - This is the number of degrees of freedom for the model. There is one degree of freedom
for each predictor in the model. In this example, we have four predictors: read, write and two
dummies for ses (because there are three levels of ses).

e. -2 Log likelihood - This is the -2 log likelihood for the final model. By itself, this number is
not very informative. However, it can be used to compare nested (reduced) models.

f. Cox & Snell R Square and Nagelkerke R Square - These are pseudo R-squares. Logistic
regression does not have an equivalent to the R-squared that is found in OLS regression;
however, many people have tried to come up with one. There are a wide variety of pseudo-R-
square statistics (these are only two of them). Because this statistic does not mean what R-
squared means in OLS regression (the proportion of variance explained by the predictors), we
suggest interpreting this statistic with great caution.

g. Observed - This indicates the number of 0's and 1's that are observed in the dependent
variable.

h. Predicted - These are the predicted values of the dependent variable based on the full logistic
regression model. This table shows how many cases are correctly predicted (132 cases are
observed to be 0 and are correctly predicted to be 0; 27 cases are observed to be 1 and are
correctly predicted to be 1), and how many cases are not correctly predicted (15 cases are
observed to be 0 but are predicted to be 1; 26 cases are observed to be 1 but are predicted to be
0).

i. Overall Percentage - This gives the overall percent of cases that are correctly predicted by
the model (in this case, the full model that we specified). As you can see, this percentage has
increased from 73.5 for the null model to 79.5 for the full model.

j. B - These are the values for the logistic regression equation for predicting the dependent
variable from the independent variable. They are in log-odds units. Similar to OLS regression,
the prediction equation is

log(p/1-p) = b0 + b1*x1 + b2*x2 + b3*x3 + b3*x3+b4*x4

where p is the probability of being in honors composition. Expressed in terms of the variables
used in this example, the logistic regression equation is

log(p/1-p) = --9.561 + 0.098*read + 0.066*science + 0.058*ses(1) - 1.013*ses(2)

These estimates tell you about the relationship between the independent variables and the
dependent variable, where the dependent variable is on the logit scale. These estimates tell the
amount of increase (or decrease, if the sign of the coefficient is negative) in the predicted log
odds of honcomp = 1 that would be predicted by a 1 unit increase (or decrease) in the predictor,
holding all other predictors constant. Note: For the independent variables which are not
significant, the coefficients are not significantly different from 0, which should be taken into
account when interpreting the coefficients. (See the columns labeled Wald and Sig. regarding
testing whether the coefficients are statistically significant). Because these coefficients are in
log-odds units, they are often difficult to interpret, so they are often converted into odds ratios.
You can do this by hand by exponentiating the coefficient, or by looking at the right-most
column in the Variables in the Equation table labeled "Exp(B)".

read - For every one-unit increase in reading score (so, for every additional point on the reading
test), we expect a 0.098 increase in the log-odds of honcomp, holding all other independent
variables constant.
science - For every one-unit increase in science score, we expect a 0.066 increase in the log-
odds of honcomp, holding all other independent variables constant.
ses - This tells you if the overall variable ses is statistically significant. There is no coefficient
listed, because ses is not a variable in the model. Rather, dummy variables which code for ses
are in the equation, and those have coefficients. However, as you can see in this example, the
coefficient for one of the dummies is statistically significant while the other one is not. The
statistic given on this row tells you if the dummies that represent ses, taken together, are
statistically significant. Because there are two dummies, this test has two degrees of freedom.
This is equivalent to using the test statement in SAS or the test command is Stata.
ses(1) - The reference group is level 3 (see the Categorical Variables Codings table above), so
this coefficient represents the difference between level 1 of ses and level 3. Note: The number
in the parentheses only indicate the number of the dummy variable; it does not tell you anything
about which levels of the categorical variable are being compared. For example, if you changed
the reference group from level 3 to level 1, the labeling of the dummy variables in the output
would not change.
ses(2) - The reference group is level 3 (see the Categorical Variables Codings table above), so
this coefficient represents the difference between level 2 of ses and level 3. Note: The number
in the parentheses only indicate the number of the dummy variable; it does not tell you anything
about which levels of the categorical variable are being compared. For example, if you changed
the reference group from level 3 to level 1, the labeling of the dummy variables in the output
would not change.
constant - This is the expected value of the log-odds of honcomp when all of the predictor
variables equal zero. In most cases, this is not interesting. Also, oftentimes zero is not a realistic
value for a variable to take.

k. S.E. - These are the standard errors associated with the coefficients. The standard error is
used for testing whether the parameter is significantly different from 0; by dividing the parameter
estimate by the standard error you obtain a t-value. The standard errors can also be used to form
a confidence interval for the parameter.

l. Wald and Sig. - These columns provide the Wald chi-square value and 2-tailed p-value used in
testing the null hypothesis that the coefficient (parameter) is 0. If you use a 2-tailed test, then
you would compare each p-value to your preselected value of alpha. Coefficients having p-
values less than alpha are statistically significant. For example, if you chose alpha to be 0.05,
coefficients having a p-value of 0.05 or less would be statistically significant (i.e., you can reject
the null hypothesis and say that the coefficient is significantly different from 0). If you use a 1-
tailed test (i.e., you predict that the parameter will go in a particular direction), then you can
divide the p-value by 2 before comparing it to your preselected alpha level.
For the variable read, the p-value is .000, so the null hypothesis that the coefficient equals 0
would be rejected.
For the variable science, the p-value is .015, so the null hypothesis that the coefficient equals 0
would be rejected.
For the variable ses, the p-value is .035, so the null hypothesis that the coefficient equals 0
would be rejected. Because the test of the overall variable is statistically significant, you can
look at the one degree of freedom tests for the dummies ses(1) and ses(2). The dummy ses(1) is
not statistically significantly different from the dummy ses(3) (which is the omitted, or reference,
category), but the dummy ses(2) is statistically significantly different from the dummy ses(3)
with a p-value of .022.

m. df - This column lists the degrees of freedom for each of the tests of the coefficients.

n. Exp(B) - These are the odds ratios for the predictors. They are the exponentiation of the
coefficients. There is no odds ratio for the variable ses because ses (as a variable with 2 degrees
of freedom) was not entered into the logistic regression equation.
Odds Ratios

In this next example, we will illustrate the interpretation of odds ratios. In this example, we will
simplify our model so that we have only one predictor, the binary variable female. Before we
run the logistic regression, we will use the crosstabs command to obtain a crosstab of the two
variables.

crosstabs female by honcomp.

If we divide the number of males who are in honors composition, 18, by the number of males
who are not in honors composition, 73, we get the odds of being in honors composition for
males, 18/73 = .246. If we do the same thing for females, we get 35/74 = .472. To get the odds
ratio, which is the ratio of the two odds that we have just calculated, we get .472/.246 = 1.918.
As we can see in the output below, this is exactly the odds ratio we obtain from the logistic
regression. The thing to remember here is that you want the group coded as 1 over the group
coded as 0, so honcomp=1/honcomp=0 for both males and females, and then the odds for
females/odds for males, because the females are coded as 1.

You can get the odds ratio from the crosstabs command by using the /statistics risk
subcommand, as shown below.

crosstabs female by honcomp


/statistics risk.
As you can see in the output below, we get the same odds ratio when we run the logistic
regression. (NOTE: Although it is equivalent to the odds ratio estimated from the logistic
regression, the odds ratio in the "Risk Estimate" table is calculated as the ratio of the odds of
honcomp=0 for males over the odds of honcomp=0 for females, which explains the confusing
row heading "Odds Ratio for female (.00/1.00)"). If we calculated a 95% confidence interval, we
would not want this to include the value of 1. When we were considering the coefficients, we
did not want the confidence interval to include 0. If we exponentiate 0, we get 1 (exp(0) = 1).
Hence, this is two ways of saying the same thing. As you can see, the 95% confidence interval
includes 1; hence, the odds ratio is not statistically significant. Because the lower bound of the
95% confidence interval is so close to 1, the p-value is very close to .05. We can use the /print =
ic(95) subcommand to get the 95% confidence intervals included in our output.

There are a few other things to note about the output below. The first is that although we have
only one predictor variable, the test for the odds ratio does not match with the overall test of the
model. This is because the test of the coefficient is a Wald chi-square test, while the test of the
overall model is a likelihood ratio chi-square test. While these two types of chi-square tests are
asymptotically equivalent, in small samples they can differ, as they do here. Also, we have the
unfortunate situation in which the results of the two tests give different conclusions. This does
not happen very often. In a situation like this, it is difficult to know what to conclude. One
might consider the power, or one might decide if an odds ratio of this magnitude is important
from a clinical or practical standpoint.

logistic regression honcomp with female


/print = ci(95).
Regression Analysis

This page shows an example regression analysis with footnotes explaining the output. These
data were collected on 200 high schools students and are scores on various tests, including
science, math, reading and social studies (socst). The variable female is a dichotomous variable
coded 1 if the student was female and 0 if male.

In the syntax below, the get file command is used to load the data into SPSS. In quotes, you
need to specify where the data file is located on your computer. Remember that you need to use
the .sav extension and that you need to end the command with a period. In the regression
command, the statistics subcommand must come before the dependent subcommand. You can
shorten dependent to dep. You list the independent variables after the equals sign on the
method subcommand. The statistics subcommand is not needed to run the regression, but on it
we can specify options that we would like to have included in the output. Here, we have
specified ci, which is short for confidence intervals. These are very useful for interpreting the
output, as we will see. There are four tables given in the output. SPSS has provided some
superscripts (a, b, etc.) to assist you in understanding the output.

Please note that SPSS sometimes includes footnotes as part of the output. We have left those
intact and have started ours with the next letter of the alphabet.

get file "c:\hsb2.sav".

regression
/statistics coeff outs r anova ci
/dependent science
/method = enter math female socst read.
Variables in the model

c. Model - SPSS allows you to specify multiple models in a single regression command. This
tells you the number of the model being reported.

d. Variables Entered - SPSS allows you to enter variables into a regression in blocks, and it
allows stepwise regression. Hence, you need to know which variables were entered into the
current regression. If you did not block your independent variables or use stepwise regression,
this column should list all of the independent variables that you specified.

e. Variables Removed - This column listed the variables that were removed from the current
regression. Usually, this column will be empty unless you did a stepwise regression.

f. Method - This column tells you the method that SPSS used to run the regression. "Enter"
means that each independent variable was entered in usual fashion. If you did a stepwise
regression, the entry in this column would tell you that.

Overall Model Fit


b. Model - SPSS allows you to specify multiple models in a single regression command. This
tells you the number of the model being reported.

c. R - R is the square root of R-Squared and is the correlation between the observed and
predicted values of dependent variable.

d. R-Square - R-Square is the proportion of variance in the dependent variable (science) which
can be predicted from the independent variables (math, female, socst and read). This value
indicates that 48.9% of the variance in science scores can be predicted from the variables math,
female, socst and read. Note that this is an overall measure of the strength of association, and
does not reflect the extent to which any particular independent variable is associated with the
dependent variable. R-Square is also called the coefficient of determination.

e. Adjusted R-square - As predictors are added to the model, each predictor will explain some
of the variance in the dependent variable simply due to chance. One could continue to add
predictors to the model which would continue to improve the ability of the predictors to explain
the dependent variable, although some of this increase in R-square would be simply due to
chance variation in that particular sample. The adjusted R-square attempts to yield a more
honest value to estimate the R-squared for the population. The value of R-square was .489,
while the value of Adjusted R-square was .479 Adjusted R-squared is computed using the
formula 1 - ((1 - Rsq)(N - 1 )/ (N - k - 1)). From this formula, you can see that when the number
of observations is small and the number of predictors is large, there will be a much greater
difference between R-square and adjusted R-square (because the ratio of (N - 1) / (N - k - 1) will
be much greater than 1). By contrast, when the number of observations is very large compared
to the number of predictors, the value of R-square and adjusted R-square will be much closer
because the ratio of (N - 1)/(N - k - 1) will approach 1.

f. Std. Error of the Estimate - The standard error of the estimate, also called the root mean
square error, is the standard deviation of the error term, and is the square root of the Mean
Square Residual (or Error).

Anova Table
c. Model - SPSS allows you to specify multiple models in a single regression command. This
tells you the number of the model being reported.

d. This is the source of variance, Regression, Residual and Total. The Total variance is
partitioned into the variance which can be explained by the independent variables (Regression)
and the variance which is not explained by the independent variables (Residual, sometimes
called Error). Note that the Sums of Squares for the Regression and Residual add up to the Total,
reflecting the fact that the Total is partitioned into Regression and Residual variance.

e. Sum of Squares - These are the Sum of Squares associated with the three sources of variance,
Total, Model and Residual. These can be computed in many ways. Conceptually, these formulas
can be expressed as:
SSTotal The total variability around the mean. (Y - Ybar)2.
SSResidual The sum of squared errors in prediction. (Y - Ypredicted)2.
SSRegression The improvement in prediction by using the predicted value of Y over just
using the mean of Y. Hence, this would be the squared differences between the predicted value of
Y and the mean of Y, (Ypredicted - Ybar)2. Another way to think of this is the SSRegression is
SSTotal - SSResidual. Note that the SSTotal = SSRegression + SSResidual. Note that
SSRegression / SSTotal is equal to .489, the value of R-Square. This is because R-Square is the
proportion of the variance explained by the independent variables, hence can be computed by
SSRegression / SSTotal.

f. df - These are the degrees of freedom associated with the sources of variance. The total
variance has N-1 degrees of freedom. In this case, there were N=200 students, so the DF for
total is 199. The model degrees of freedom corresponds to the number of predictors minus 1 (K-
1). You may think this would be 4-1 (since there were 4 independent variables in the model,
math, female, socst and read). But, the intercept is automatically included in the model (unless
you explicitly omit the intercept). Including the intercept, there are 5 predictors, so the model
has 5-1=4 degrees of freedom. The Residual degrees of freedom is the DF total minus the DF
model, 199 - 4 is 195.

g. Mean Square - These are the Mean Squares, the Sum of Squares divided by their respective
DF. For the Regression, 9543.72074 / 4 = 2385.93019. For the Residual, 9963.77926 / 195 =
51.0963039. These are computed so you can compute the F ratio, dividing the Mean Square
Regression by the Mean Square Residual to test the significance of the predictors in the model.

h. F and Sig. - The F-value is the Mean Square Regression (2385.93019) divided by the Mean
Square Residual (51.0963039), yielding F=46.69. The p-value associated with this F value is
very small (0.0000). These values are used to answer the question "Do the independent variables
reliably predict the dependent variable?". The p-value is compared to your alpha level (typically
0.05) and, if smaller, you can conclude "Yes, the independent variables reliably predict the
dependent variable". You could say that the group of variables math, and female, socst and
read can be used to reliably predict science (the dependent variable). If the p-value were greater
than 0.05, you would say that the group of independent variables does not show a statistically
significant relationship with the dependent variable, or that the group of independent variables
does not reliably predict the dependent variable. Note that this is an overall significance test
assessing whether the group of independent variables when used together reliably predict the
dependent variable, and does not address the ability of any of the particular independent
variables to predict the dependent variable. The ability of each individual independent variable
to predict the dependent variable is addressed in the table below where each of the individual
variables are listed.

Parameter Estimates

b. Model - SPSS allows you to specify multiple models in a single regression command. This
tells you the number of the model being reported.

c. This column shows the predictor variables (constant, math, female, socst, read). The first
variable (constant) represents the constant, also referred to in textbooks as the Y intercept, the
height of the regression line when it crosses the Y axis. In other words, this is the predicted
value of science when all other variables are 0.

d. B - These are the values for the regression equation for predicting the dependent variable from
the independent variable. These are called unstandardized coefficients because they are
measured in their natural units. As such, the coefficients cannot be compared with one another to
determine which one is more influential in the model, because they can be measured on different
scales. For example, how can you compare the values for gender with the values for reading
scores? The regression equation can be presented in many different ways, for example:

Ypredicted = b0 + b1*x1 + b2*x2 + b3*x3 + b3*x3 + b4*x4

The column of estimates (coefficients or parameter estimates, from here on labeled coefficients)
provides the values for b0, b1, b2, b3 and b4 for this equation. Expressed in terms of the
variables used in this example, the regression equation is

sciencePredicted = 12.325 + .389*math + -2.010*female+.050*socst+.335*read


These estimates tell you about the relationship between the independent variables and the
dependent variable. These estimates tell the amount of increase in science scores that would be
predicted by a 1 unit increase in the predictor. Note: For the independent variables which are not
significant, the coefficients are not significantly different from 0, which should be taken into
account when interpreting the coefficients. (See the columns with the t-value and p-value about
testing whether the coefficients are significant).
math - The coefficient (parameter estimate) is .389. So, for every unit (i.e., point, since this is
the metric in which the tests are measured) increase in math, a .389 unit increase in science is
predicted, holding all other variables constant. (It does not matter at what value you hold the
other variables constant, because it is a linear model.) Or, for every increase of one point on the
math test, your science score is predicted to be higher by .389 points. This is significantly
different from 0.
female - For every unit increase in female, there is a -2.010 unit decrease in the predicted
science score, holding all other variables constant. Since female is coded 0/1 (0=male,
1=female) the interpretation can be put more simply. For females the predicted science score
would be 2 points lower than for males. The variable female is technically not statistically
significantly different from 0, because the p-value is greater than .05. However, .051 is so close
to .05 that some researchers would still consider it to be statistically significant.
socst - The coefficient for socst is .050. This means that for a 1-unit increase in the social
studies score, we expect an approximately .05 point increase in the science score. This is not
statistically significant; in other words, .050 is not different from 0.
read - The coefficient for read is .335. Hence, for every unit increase in reading score we
expect a .335 point increase in the science score. This is statistically significant.

e. Std. Error - These are the standard errors associated with the coefficients. The standard error
is used for testing whether the parameter is significantly different from 0 by dividing the
parameter estimate by the standard error to obtain a t-value (see the column with t-values and p-
values). The standard errors can also be used to form a confidence interval for the parameter, as
shown in the last two columns of this table.

f. Beta - These are the standardized coefficients. These are the coefficients that you would
obtain if you standardized all of the variables in the regression, including the dependent and all
of the independent variables, and ran the regression. By standardizing the variables before
running the regression, you have put all of the variables on the same scale, and you can compare
the magnitude of the coefficients to see which one has more of an effect. You will also notice
that the larger betas are associated with the larger t-values.

g. t and Sig. - These columns provide the t-value and 2 tailed p-value used in testing the null
hypothesis that the coefficient/parameter is 0. If you use a 2 tailed test, then you would compare
each p-value to your preselected value of alpha. Coefficients having p-values less than alpha are
statistically significant. For example, if you chose alpha to be 0.05, coefficients having a p-value
of 0.05 or less would be statistically significant (i.e., you can reject the null hypothesis and say
that the coefficient is significantly different from 0). If you use a 1 tailed test (i.e., you predict
that the parameter will go in a particular direction), then you can divide the p-value by 2 before
comparing it to your preselected alpha level. With a 2-tailed test and alpha of 0.05, you should
not reject the null hypothesis that the coefficient for female is equal to 0, because p-value =
0.051 > 0.05. The coefficient of -2.009765 is not significantly different from 0. However, if
you hypothesized specifically that males had higher scores than females (a 1-tailed test) and used
an alpha of 0.05, the p-value of .0255 is less than 0.05 and the coefficient for female would be
significant at the 0.05 level. In this case, we could say that the female coefficient is significantly
greater than 0. Neither a 1-tailed nor 2-tailed test would be significant at alpha of 0.01.
The constant is significantly different from 0 at the 0.05 alpha level. However, having a
significant intercept is seldom interesting.
The coefficient for math (.389) is statistically significantly different from 0 using alpha of 0.05
because its p-value is 0.000, which is smaller than 0.05.
The coefficient for female (-2.01) is not statistically significant at the 0.05 level since the p-
value is greater than .05.
The coefficient for socst (.05) is not statistically significantly different from 0 because its p-
value is definitely larger than 0.05.
The coefficient for read (.335) is statistically significant because its p-value of 0.000 is less
than .05.

h. [95% Conf. Interval] - These are the 95% confidence intervals for the coefficients. The
confidence intervals are related to the p-values such that the coefficient will not be statistically
significant at alpha = .05 if the 95% confidence interval includes zero. These confidence
intervals can help you to put the estimate from the coefficient into perspective by seeing how
much the value could vary.

Probit Regression

This page shows an example of probit regression analysis with footnotes explaining the output in
SPSS. The data in this example were gathered on undergraduates applying to graduate school
and includes undergraduate GPAs, the reputation of the school of the undergraduate (a topnotch
indicator), the students' GRE score, and whether or not the student was admitted to graduate
school. Using this dataset ( probit.sav ), we can predict admission to graduate school using
undergraduate GPA, GRE scores, and the reputation of the school of the undergraduate. Our
outcome variable is binary, and we will use a probit model. Thus, our model will calculate a
predicted probability of admission based on our predictors. The probit model does so using the
cumulative distribution function of the standard normal.

First, let us examine the dataset and our response variable. Our binary outcome variable must be
coded with zeros and ones, so we will include a frequency of our outcome variable admit to
check this.

get file='C:\probit.sav'.

descriptives
variables=gre gpa.
frequencies
variables=admit topnotch.

Next, we can specify our probit model using the plum command and indicating probit as our link
function.

plum admit with gre topnotch gpa


/link = probit
/print = parameter summary.
NOTE: It is also possible to run this probit regression in SPSS using genlin. Please note that
distribution and link are options on the /model subcommand and are not separate subcommands
(which is why there is no / in front of them).
genlin admit (reference=0) with gre gpa topnotch
/model gre gpa topnotch
distribution=binomial
link=probit
/print cps history fit solution.

Case Processing Summary


a. admit - This is the response variable predicted by the model. Here, we see that our outcome
variable is binary and we are provided with frequency counts. With our model, we predict the
probability that admit is 1 for an observation given the values of the predictors.

b. Valid - This is the number of observations in our dataset with valid and non-missing data in
the response and predictor variables specified in our model.

c. Missing - This is the number of observations in our dataset with missing data in the response
or predictor variables specified in our model. Such observations will be excluded from the
analysis.

d. Total - This is the total of the number of valid observations and missing observations. It is
equal to the number of observations in the dataset.

Model Fitting Information

e. Model - This indicates the parameters of the model for which the model fit is calculated.
"Intercept Only" describes a model that does not control for any predictor variables and simply
fits an intercept to predict the outcome variable. "Final" describes a model that includes the
specified predictor variables and has been arrived at through an iterative process that maximizes
the log likelihood of the outcomes seen in the outcome variable. By including the predictor
variables and maximizing the log likelihood of the outcomes seen in the data, the "Final" model
should improve upon the "Intercept Only" model. This can be seen in the differences in the -2
Log Likelihood values associated with the models (see superscript f).

f. -2 Log Likelihood - This is the product of -2 and the log likelihoods of the null model and
fitted "final" model. The likelihood of the model is used to test of whether all predictors'
regression coefficients in the model are simultaneously zero and in tests of nested models.

g. Chi-Square - This is the Likelihood Ratio (LR) Chi-Square test that at least one of the
predictors' regression coefficient is not equal to zero in the model. The LR Chi-Square statistic
can be calculated by -2*L(null model) - (-2*L(fitted model)) = 479.887 - 457.797 = 22.090,
where L(null model) is from the log likelihood with just the response variable in the model
(Iteration 0) and L(fitted model) is the log likelihood from the final iteration (assuming the model
converged) with all the parameters.
h. df - This indicates the degrees of freedom of the Chi-Square distribution used to test the LR
Chi-Sqare statistic and is defined by the number of predictors in the model.

i. Sig. - This is the probability of getting a LR test statistic as extreme as, or more so, than the
observed under the null hypothesis; the null hypothesis is that all of the regression coefficients in
the model are equal to zero. In other words, this is the probability of obtaining this chi-square
statistic (22.090) if there is in fact no effect of the predictor variables. This p-value is compared
to a specified alpha level, our willingness to accept a type I error, which is typically set at 0.05 or
0.01. The small p-value from the LR test, <0.0001, would lead us to conclude that at least one of
the regression coefficients in the model is not equal to zero. The parameter of the Chi-Square
distribution used to test the null hypothesis is defined by the degrees of freedom in the prior
column.

Pseudo R-Square

j. Psuedo R-Square - These are several Pseudo R-Squareds. Probit regression does not have an
equivalent to the R-squared that is found in OLS regression; however, many people have tried to
come up with one. There are a wide variety of pseudo-R-square statistics. Because these
statistics do not mean what R-square means in OLS regression (the proportion of variance of the
response variable explained by the predictors), we suggest interpreting these statistics with great
caution. For more information on pseudo R-squareds, see What are Pseudo R-Squareds?.

Parameter Estimates

k. Estimate - These are the regression coefficients. The predicted probability of admission can
be calculated using these coefficients (the first number in the column, the coefficient for
"Threshold" is the constant term in the model). For a given record, the predicted probability of
admission is

where F is the cumulative distribution function of the standard normal. However, interpretation
of the coefficients in probit regression is not as straightforward as the interpretations of
coefficients in linear regression or logit regression. The increase in probability attributed to a
one-unit increase in a given predictor is dependent both on the values of the other predictors and
the starting value of the given predictors. For example, if we hold gre and topnotch constant at
zero, the one unit increase in gpa from 2 to 3 has a different effect than the one unit increase
from 3 to 4 (note that the probabilities do not change by a common difference or common
factor):

and the effects of these one unit increases are different if we hold gre and topnotch constant at
their respective means instead of zero:

However, there are limited ways in which we can interpret the individual regression coefficients.
A positive coefficient mean that an increase in the predictor leads to an increase in the predicted
probability. A negative coefficient means that an increase in the predictor leads to a decrease in
the predicted probability.

gre - The coefficient of gre is 0.002. This means that an increase in GRE score increases the
predicted probability of admission.

topnotch - The coefficient of topnotch is 0.273. This means attending a top notch institution
as an undergraduate increases the predicted probability of admission.

gpa - The coefficient of gpa is 0.401. This means that an increase in GPA increases the
predicted probability of admission.

Threshold [admit= .00]- This is the constant term in the model. The constant term is -2.798.
This means that if all of the predictors (gre, topnotch and gpa) are evaluated at zero, the
predicted probability of admission is F(-2.798) = .00257101. So, as expected, the predicted
probability of a student with a GRE score of zero and a GPA of zero from a non-topnotch school
has an extremely low predicted probability of admission.

l. Std. Error - These are the standard errors of the individual regression coefficients. They are
used both in the calculation of the Wald test statistic, superscript m, and the confidence interval
of the regression coefficient, superscript m.

m. Wald - These are the test statistics for the individual regression coefficients. The test statistic
is the squared ratio of the regression coefficient Estimate to the Std. Error of the respective
predictor. The test statistic follows a Chi-Square distribution which is used to test against a two-
sided alternative hypothesis that the Estimate is not equal to zero.

n. df - This column lists the degrees of freedom for each of the variables included in the model.
For each of these variables, the degree of freedom is 1.

o. Sig. - These are the p-values of the coefficients or the probability that, within a given model,
the null hypothesis that a particular predictor's regression coefficient is zero given that the rest of
the predictors are in the model. They are based on the Wald test statistics of the predictors. The
probability that a particular Wald test statistic is as extreme as, or more so, than what has been
observed under the null hypothesis is defined by the p-value and presented here. By looking at
the estimates of the standard errors to a greater degree of precision, we can calculate the test
statistics and see that they match those produced in SPSS. To view the estimates with more
decimal places displayed, click on the Parameter Estimates table in your SPSS output, then
double-click on the number of interest.

The Wald test statistic for the Threshold is 18.664 with an associated p-value <.0001. If we
set our alpha level to 0.05, we would reject the null hypothesis and conclude that the model
intercept has been found to be statistically different from zero given gre, gpa and topnotch are
in the model.

The Wald test statistic for the predictor gre is 5.667 with an associated p-value of 0.017. If we
set our alpha level to 0.05, we would reject the null hypothesis and conclude that the regression
coefficient for gre has been found to be statistically different from zero in estimating gre given
topnotch and gpa are in the model.

The Wald test statistic for the predictor topnotch is 2.292 with an associated p-value of 0.130.
If we set our alpha level to 0.05, we would fail to reject the null hypothesis and conclude that the
regression coefficient for topnotch has not been found to be statistically different from zero in
estimating topnotch given gre and gpa are in the model.

The Wald test statistic for the predictor gpa is 4.237 with an associated p-value of 0.040. If we
set our alpha level to 0.05, we would reject the null hypothesis and conclude that the regression
coefficient for gpa has been found to be statistically different from zero in estimating gpa given
topnotch and gre are in the model.
p. 95% Wald Confidence Interval - This is the confidence interval (CI) of an individual
poisson regression coefficient, given the other predictors are in the model. For a given predictor
variable with a level of 95% confidence, we'd say that we are 95% confident that upon repeated
trials 95% of the CI's would include the "true" population poisson regression coefficient. It is
calculated as B (z/2)*(Std.Error), where z/2 is a critical value on the standard normal
distribution. The CI is equivalent to the z test statistic: if the CI includes zero, we'd fail to reject
the null hypothesis that a particular regression coefficient is zero, given the other predictors are
in the model. An advantage of a CI is that it is illustrative; it provides information on where the
"true" parameter may lie and the precision of the point estimate.

T-test

The t-test procedure performs t-tests for one sample, two samples and paired observations. The
single-sample t-test compares the mean of the sample to a given number (which you supply).
The independent samples t-test compares the difference in the means from the two groups to a
given value (usually 0). In other words, it tests whether the difference in the means is 0. The
dependent-sample or paired t-test compares the difference in the means from the two variables
measured on the same set of subjects to a given number (usually 0), while taking into account the
fact that the scores are not independent. In our examples, we will use the hsb2 data set.

Single sample t-test

The single sample t-test tests the null hypothesis that the population mean is equal to the number
specified by the user. SPSS calculates the t-statistic and its p-value under the assumption that the
sample comes from an approximately normal distribution. If the p-value associated with the t-test
is small (0.05 is often used as the threshold), there is evidence that the mean is different from the
hypothesized value. If the p-value associated with the t-test is not small (p > 0.05), then the null
hypothesis is not rejected and you can conclude that the mean is not different from the
hypothesized value.

In this example, the t-statistic is 4.140 with 199 degrees of freedom. The corresponding two-
tailed p-value is .000, which is less than 0.05. We conclude that the mean of variable write is
different from 50.

get file "C:\hsb2.sav".


t-test
/testval=50
variables=write.
One-Sample Statistics

a. - This is the list of variables. Each variable that was listed on the variables= statement in the
above code will have its own line in this part of the output.

b. N - This is the number of valid (i.e., non-missing) observations used in calculating the t-test.

c. Mean - This is the mean of the variable.

d. Std. Deviation - This is the standard deviation of the variable.

e. Std. Error Mean - This is the estimated standard deviation of the sample mean. If we drew
repeated samples of size 200, we would expect the standard deviation of the sample means to be
close to the standard error. The standard deviation of the distribution of sample mean is
estimated as the standard deviation of the sample divided by the square root of sample size:
9.47859/(sqrt(200)) = .67024.

Test statistics
f. - This identifies the variables. Each variable that was listed on the variables= statement will
have its own line in this part of the output. If a variables= statement is not specified, t-test will
conduct a t-test on all numerical variables in the dataset.

g. t - This is the Student t-statistic. It is the ratio of the difference between the sample mean and
the given number to the standard error of the mean: (52.775 - 50) / .6702372 = 4.1403. Since the
standard error of the mean measures the variability of the sample mean, the smaller the standard
error of the mean, the more likely that our sample mean is close to the true population mean.
This is illustrated by the following three figures.
In all three cases, the difference between the population means is the same. But with large
variability of sample means, second graph, two populations overlap a great deal. Therefore, the
difference may well come by chance. On the other hand, with small variability, the difference is
more clear as in the third graph. The smaller the standard error of the mean, the larger the
magnitude of the t-value and therefore, the smaller the p-value.

h. df - The degrees of freedom for the single sample t-test is simply the number of valid
observations minus 1. We loose one degree of freedom because we have estimated the mean
from the sample. We have used some of the information from the data to estimate the mean,
therefore it is not available to use for the test and the degrees of freedom accounts for this.

i. Sig (2-tailed)- This is the two-tailed p-value evaluating the null against an alternative that the
mean is not equal to 50. It is equal to the probability of observing a greater absolute value of t
under the null hypothesis. If the p-value is less than the pre-specified alpha level (usually .05 or .
01) we will conclude that mean is statistically significantly different from zero. For example, the
p-value is smaller than 0.05. So we conclude that the mean for write is different from 50.

j. Mean Difference - This is the difference between the sample mean and the test value.

k. 95% Confidence Interval of the Difference - These are the lower and upper bound of the
confidence interval for the mean. A confidence interval for the mean specifies a range of values
within which the unknown population parameter, in this case the mean, may lie. It is given by

where s is the sample deviation of the observations and N is the number of valid observations.
The t-value in the formula can be computed or found in any statistics book with the degrees of
freedom being N-1 and the p-value being 1-alpha/2, where alpha is the confidence level and by
default is .95.

Paired t-test

A paired (or "dependent") t-test is used when the observations are not independent of one
another. In the example below, the same students took both the writing and the reading test.
Hence, you would expect there to be a relationship between the scores provided by each student.
The paired t-test accounts for this. For each student, we are essentially looking at the differences
in the values of the two variables and testing if the mean of these differences is equal to zero.

In this example, the t-statistic is 0.8673 with 199 degrees of freedom. The corresponding two-
tailed p-value is 0.3868, which is greater than 0.05. We conclude that the mean difference of
write and read is not different from 0.

t-test
pairs=write with read (paired).

Summary statistics
a. - This is the list of variables.

b. Mean - These are the respective means of the variables.

c. N - This is the number of valid (i.e., non-missing) observations used in calculating the t-test.

d. Std. Deviation - This is the standard deviations of the variables.

e. Std Error Mean - Standard Error Mean is the estimated standard deviation of the sample
mean. This value is estimated as the standard deviation of one sample divided by the square root
of sample size: 9.47859/sqrt(200) = .67024, 10.25294/sqrt(200) = .72499. This provides a
measure of the variability of the sample mean.

f. Correlation - This is the correlation coefficient of the pair of variables indicated. This is a
measure of the strength and direction of the linear relationship between the two variables. The
correlation coefficient can range from -1 to +1, with -1 indicating a perfect negative correlation,
+1 indicating a perfect positive correlation, and 0 indicating no correlation at all. (A variable
correlated with itself will always have a correlation coefficient of 1.) You can think of the
correlation coefficient as telling you the extent to which you can guess the value of one variable
given a value of the other variable. The .597 is the numerical description of how tightly around
the imaginary line the points lie. If the correlation was higher, the points would tend to be closer
to the line; if it was smaller, they would tend to be further away from the line.

g. Sig - This is the p-value associated with the correlation. Here, correlation is significant at the .
05 level.

Test statistics
g. writing score-reading score - This is the value measured within each subject: the difference
between the writing and reading scores. The paired t-test forms a single random sample of the
paired difference. The mean of these values among all subjects is compared to 0 in a paired t-test.

h. Mean - This is the mean within-subject difference between the two variables.

i. Std. Deviation - This is the standard deviation of the mean paired difference.

j. Std Error Mean - This is the estimated standard deviation of the sample mean. This value is
estimated as the standard deviation of one sample divided by the square root of sample size:
8.88667/sqrt(200) = .62838. This provides a measure of the variability of the sample mean.

k. 95% Confidence Interval of the Difference - These are the lower and upper bound of the
confidence interval for the mean difference. A confidence interval for the mean specifies a range
of values within which the unknown population parameter, in this case the mean, may lie. It is
given by

where s is the sample deviation of the observations and N is the number of valid observations.
The t-value in the formula can be computed or found in any statistics book with the degrees of
freedom being N-1 and the p-value being 1-alpha/2, where alpha is the confidence level and by
default is .95.

l. t - This is the t-statistic. It is the ratio of the mean of the difference to the standard error of
the difference: (.545/.62838).

m. degrees of freedom - The degrees of freedom for the paired observations is simply the
number of observations minus 1. This is because the test is conducted on the one sample of the
paired differences.

n. Sig. (2-tailed) - This is the two-tailed p-value computed using the t distribution. It is the
probability of observing a greater absolute value of t under the null hypothesis. If the p-value is
less than the pre-specified alpha level (usually .05 or .01, here the former) we will conclude that
mean difference between writing score and reading score is statistically significantly different
from zero. For example, the p-value for the difference between the two variables is greater than
0.05 so we conclude that the mean difference is not statistically significantly different from 0.

Independent group t-test

This t-test is designed to compare means of same variable between two groups. In our example,
we compare the mean writing score between the group of female students and the group of male
students. Ideally, these subjects are randomly selected from a larger population of subjects. The
test assumes that variances for the two populations are the same. The interpretation for p-value
is the same as in other type of t-tests.

In this example, the t-statistic is -3.7341 with 198 degrees of freedom. The corresponding two-
tailed p-value is 0.0002, which is less than 0.05. We conclude that the difference of means in
write between males and females is different from 0.

t-test groups=female(0 1)
/variables=write.

Summary statistics
a. female - This column gives categories of the independent variable female. This variable is
necessary for doing the independent group t-test and is specified by the t-test groups=
statement.

b. N - This is the number of valid (i.e., non-missing) observations in each group.

c. Mean - This is the mean of the dependent variable for each level of the independent variable.

d. Std. Deviation - This is the standard deviation of the dependent variable for each of the levels
of the independent variable.

e. Std. Error Mean - This is the standard error of the mean, the ratio of the standard deviation
to the square root of the respective number of observations.

Test statistics

f. - This column lists the dependent variable(s). In our example, the dependent variable is write
(labeled "writing score").

g. - This column specifies the method for computing the standard error of the difference of the
means. The method of computing this value is based on the assumption regarding the variances
of the two groups. If we assume that the two populations have the same variance, then the first
method, called pooled variance estimator, is used. Otherwise, when the variances are not
assumed to be equal, the Satterthwaite's method is used.

h. F - The test statistic of the two-sample F test is a ratio of sample variances, F = s12/s22 where it
is completely arbitrary which sample is labeled sample 1 and which is labeled sample 2.

i. Sig. - This is the two-tailed p-value associated with the null that the two groups have the same
variance. In our example, the probability is less than 0.05. So there is evidence that the variances
for the two groups, female students and male students, are different. Therefore, we may want to
use the second method (Satterthwaite variance estimator) for our t-test.

j. t - These are the t-statistics under the two different assumptions: equal variances and unequal
variances. These are the ratios of the mean of the differences to the standard errors of the
difference under the two different assumptions: (-4.86995 / 1.30419) = -3.734, (-
4.86995/1.33189) = -3.656.

k. df - The degrees of freedom when we assume equal variances is simply the sum of the two
sample sized (109 and 91) minus 2. The degrees of freedom when we assume unequal variances
is calculated using the Satterthwaite formula.

l. Sig. (2-tailed) - The p-value is the two-tailed probability computed using the t distribution. It
is the probability of observing a t-value of equal or greater absolute value under the null
hypothesis. For a one-tailed test, halve this probability. If the p-value is less than our pre-
specified alpha level, usually 0.05, we will conclude that the difference is significantly different
from zero. For example, the p-value for the difference between females and males is less than
0.05 in both cases, so we conclude that the difference in means is statistically significantly
different from 0.

m. Mean Difference - This is the difference between the means.

n. Std Error Difference - Standard Error difference is the estimated standard deviation of the
difference between the sample means. If we drew repeated samples of size 200, we would
expect the standard deviation of the sample means to be close to the standard error. This
provides a measure of the variability of the sample mean. The Central Limit Theorem tells us
that the sample means are approximately normally distributed when the sample size is 30 or
greater. Note that the standard error difference is calculated differently under the two different
assumptions.

o. 95% Confidence Interval of the Difference - These are the lower and upper bound of the
confidence interval for the mean difference. A confidence interval for the mean specifies a range
of values within which the unknown population parameter, in this case the mean, may lie. It is
given by
where s is the sample deviation of the observations and N is the number of valid observations.
The t-value in the formula can be computed or found in any statistics book with the degrees of
freedom being N-1 and the p-value being 1-width/2, where width is the confidence level and by
default is .95.

Correlation

This page shows an example correlation with footnotes explaining the output. These data were
collected on 200 high schools students and are scores on various tests, including science, math,
reading and social studies (socst). The variable female is a dichotomous variable coded 1 if the
student was female and 0 if male.

In the syntax below, the get file command is used to load the data into SPSS. In quotes, you
need to specify where the data file is located on your computer. Remember that you need to use
the .sav extension and that you need to end the command with a period. By default, SPSS does a
pairwise deletion of missing values. This means that as long as both variables in the correlation
have valid values for a case, that case is included in the correlation. The /print subcommand is
used to have the statistically significant correlations marked.

get file "c:\hsb2.sav".

correlations
/variables = read write math science female
/print = nosig.

a. Pearson Correlation - These numbers measure the strength and direction of the linear
relationship between the two variables. The correlation coefficient can range from -1 to +1, with
-1 indicating a perfect negative correlation, +1 indicating a perfect positive correlation, and 0
indicating no correlation at all. (A variable correlated with itself will always have a correlation
coefficient of 1.) You can think of the correlation coefficient as telling you the extent to which
you can guess the value of one variable given a value of the other variable. From the scatterplot
of the variables read and write below, we can see that the points tend along a line going from the
bottom left to the upper right, which is the same as saying that the correlation is positive. The .
597 is the numerical description of how tightly around the imaginary line the points lie. If the
correlation was higher, the points would tend to be closer to the line; if it was smaller, they
would tend to be further away from the line. Also note that, by definition, any variable
correlated with itself has a correlation of 1.

b. Sig. (2-tailed) - This is the p-value associated with the correlation. The footnote under the
correlation table explains what the single and double asterisks signify.

c. N - This is number of cases that was used in the correlation. Because we have no missing
data in this data set, all correlations were based on all 200 cases in the data set. However, if
some variables had missing values, the N's would be different for the different correlations.

Scatterplot

graph
/scatterplot(bivar) = write with read.
Correlation using listwise deletion of missing data

The correlations in the table below are interpreted in the same way as those above. The only
difference is the way the missing values are handled. When you do a listwise deletion, as we do
with the /missing = listwise subcommand, if a case has a missing value for any of the variables
listed on the /variables subcommand, that case is eliminated from all correlations, even if there
are valid values for the two variables in the current correlation. For example, if there was a
missing value for the variable read, the case would still be excluded from the calculation of the
correlation between write and math.

There are really no rules defining when you should use pairwise or listwise deletion. It depends
on your purpose and whether it is important for exactly the same cases to be used in all of the
correlations. If you have lots of missing data, some correlations could be based on many cases
that are not included in other correlations. On the other hand, if you use a listwise deletion, you
may not have many cases left to be used in the calculation.
Please note that SPSS sometimes includes footnotes as part of the output. We have left those
intact and have started ours with the next letter of the alphabet.

correlations
/variables = read write math science female
/print = nosig
/missing = listwise.

b. Pearson Correlation - This is the correlation between the two variables (one listed in the
row, the other in the column). It is interpreted just as the correlations in the previous example.

c. Sig. (2-tailed) - This is the p-value associated with the correlation. The footnote under the
correlation table explains what the single and double asterisks signify.

Multinomial Logistic Regression

This page shows an example of a multinomial logistic regression analysis with footnotes
explaining the output. The data were collected on 200 high school students and are scores on
various tests, including a video game and a puzzle. The outcome measure in this analysis is the
student's favorite flavor of ice cream - vanilla, chocolate or strawberry- from which we are going
to see what relationships exists with video game scores (video), puzzle scores (puzzle) and
gender (female). The data set can be downloaded here.

get file = 'D:\mlogit.sav'.

Before running the regression, obtaining a frequency of the ice cream flavors in the data can
inform the selection of a reference group. By default, SPSS uses the last category as the
reference category.

frequencies
/variables = ice_cream.
Vanilla is the most frequently preferred ice cream flavor and will be the reference group in this
example. In the data, vanilla is represented by the number 2 (chocolate is 1, strawberry is 3).
We will use the nomreg command to run the multinomial logistic regression.

nomreg ice_cream (base = 2) with video puzzle female


/print = paramter summary cps mfi.

In the above command, base = 2 indicates which level of the outcome variable should be treated
as the reference level. By default, SPSS sorts the groups and chooses the last as the referent
group.
Case Processing Summary

b. N -N provides the number of observations fitting the description in the first column. For
example, the first three values give the number of observations for which the subject's preferred
flavor of ice cream is chocolate, vanilla or strawberry, respectively.

c. Marginal Percentage - The marginal percentage lists the proportion of valid observations
found in each of the outcome variable's groups. This can be calculated by dividing the N for each
group by the N for "Valid". Of the 200 subjects with valid data, 47 preferred chocolate ice cream
to vanilla and strawberry. Thus, the marginal percentage for this group is (47/200) * 100 = 23.5
%.

d. ice_cream - In this regression, the outcome variable is ice_cream which contains a numeric
code for the subject's favorite flavor of ice cream. The data includes three levels of ice_cream
representing three different preferred flavors: 1 = chocolate, 2 = vanilla and 3 = strawberry.

e. Valid - This indicates the number of observations in the dataset where the outcome variable
and all predictor variables are non-missing.

f. Missing - This indicates the number of observations in the dataset where data are missing from
the outcome variable or any of the predictor variables.

g. Total - This indicates the total number of observations in the dataset--the sum of the number
of observations in which data are missing and the number of observations with valid data.

h. Subpopulation - This indicates the number of subpopulations contained in the data. A


subpopulation of the data consists of one combination of the predictor variables specified for the
model. For example, all records where female = 0, video = 42 and puzzle = 26 would be
considered one subpopulation of the data. The footnote SPSS provides indicates how many of
these combinations of the predictor variables consist of records that all have the same value in
the outcome variable. In this case, there are 143 combinations of female, video and puzzle that
appear in the data and 117 of these combinations are composed of records with the same
preferred flavor of ice cream.

Model Fitting Information

i. Model - This indicates the parameters of the model for which the model fit is calculated.
"Intercept Only" describes a model that does not control for any predictor variables and simply
fits an intercept to predict the outcome variable. "Final" describes a model that includes the
specified predictor variables and has been arrived at through an iterative process that maximizes
the log likelihood of the outcomes seen in the outcome variable. By including the predictor
variables and maximizing the log likelihood of the outcomes seen in the data, the "Final" model
should improve upon the "Intercept Only" model. This can be seen in the differences in the
-2(Log Likelihood) values associated with the models.
j. -2(Log Likelihood) - This is the product of -2 and the log likelihoods of the null model and
fitted "final" model. The likelihood of the model is used to test of whether all predictors'
regression coefficients in the model are simultaneously zero and in tests of nested models.

k. Chi-Square - This is the Likelihood Ratio (LR) Chi-Square test that at least one of the
predictors' regression coefficient is not equal to zero in the model. The LR Chi-Square statistic
can be calculated by -2*L(null model) - (-2*L(fitted model)) = 365.736 - 332.641 = 33.095,
where L(null model) is from the log likelihood with just the response variable in the model
(Intercept Only) and L(fitted model) is the log likelihood from the final iteration (assuming the
model converged) with all the parameters.

l. df - This indicates the degrees of freedom of the chi-square distribution used to test the LR
Chi-Sqare statistic and is defined by the number of predictors in the model (three predictors in
two models).

m. Sig. - This is the probability getting a LR test statistic being as extreme as, or more so, than
the observed statistic under the null hypothesis; the null hypothesis is that all of the regression
coefficients in the model are equal to zero. In other words, this is the probability of obtaining this
chi-square statistic (33.095), or one more extreme, if there is in fact no effect of the predictor
variables. This p-value is compared to a specified alpha level, our willingness to accept a type I
error, which is typically set at 0.05 or 0.01. The small p-value from the LR test, <0.00001,
would lead us to conclude that at least one of the regression coefficients in the model is not equal
to zero. The parameter of the chi-square distribution used to test the null hypothesis is defined by
the degrees of freedom in the prior column.

Pseudo R-Square

Pseudo R-Square - These are three pseudo R-squared values. Logistic regression does not have
an equivalent to the R-squared that is found in OLS regression; however, many people have tried
to come up with one. There are a wide variety of pseudo R-squared statistics which can give
contradictory conclusions. Because these statistics do not mean what R-squared means in OLS
regression (the proportion of variance of the response variable explained by the predictors), we
suggest interpreting them with great caution.

Parameter Estimates
n. B - These are the estimated multinomial logistic regression coefficients for the models. An
important feature of the multinomial logit model is that it estimates k-1 models, where k is the
number of levels of the outcome variable. In this instance, SPSS is treating the vanilla as the
referent group and therefore estimated a model for chocolate relative to vanilla and a model for
strawberry relative to vanilla. Therefore, since the parameter estimates are relative to the referent
group, the standard interpretation of the multinomial logit is that for a unit change in the
predictor variable, the logit of outcome m relative to the referent group is expected to change by
its respective parameter estimate (which is in log-odds units) given the variables in the model are
held constant.

chocolate relative to vanilla

Intercept - This is the multinomial logit estimate for chocolate relative to vanilla when the
predictor variables in the model are evaluated at zero. For males (the variable female evaluated
at zero) with zero video and puzzle scores, the logit for preferring chocolate to vanilla is 1.912.
Note that evaluating video and puzzle at zero is out of the range of plausible scores, and if the
scores were mean-centered, the intercept would have a natural interpretation: log odds of
preferring chocolate to vanilla for a male with average video and puzzle scores.

video - This is the multinomial logit estimate for a one unit increase in video score for
chocolate relative to vanilla given the other variables in the model are held constant. If a subject
were to increase his video score by one point, the multinomial log-odds of preferring chocolate
to vanilla would be expected to decrease by 0.024 unit while holding all other variables in the
model constant.

puzzle - This is the multinomial logit estimate for a one unit increase in puzzle score for
chocolate relative to vanilla given the other variables in the model are held constant. If a subject
were to increase his puzzle score by one point, the multinomial log-odds of preferring chocolate
to vanilla would be expected to decrease by 0.039 unit while holding all other variables in the
model constant.
female - This is the multinomial logit estimate comparing females to males for chocolate
relative to vanilla given the other variables in the model are held constant. The multinomial logit
for females relative to males is 0.817 unit higher for preferring chocolate relative to vanilla given
all other predictor variables in the model are held constant. In other words, females are more
likely than males to prefer chocolate ice cream to vanilla ice cream.

strawberry relative to vanilla

Intercept - This is the multinomial logit estimate for strawberry relative to vanilla when the
predictor variables in the model are evaluated at zero. For males (the variable female evaluated
at zero) with zero video and puzzle scores, the logit for preferring strawberry to vanilla is -4.057.

video - This is the multinomial logit estimate for a one unit increase in video score for
strawberry relative to vanilla given the other variables in the model are held constant. If a subject
were to increase his video score by one point, the multinomial log-odds for preferring strawberry
to vanilla would be expected to increase by 0.023 unit while holding all other variables in the
model constant.

puzzle - This is the multinomial logit estimate for a one unit increase in puzzle score for
strawberry relative to vanilla given the other variables in the model are held constant. If a subject
were to increase his puzzle score by one point, the multinomial log-odds for preferring
strawberry to vanilla would be expected to increase by 0.043 unit while holding all other
variables in the model constant.

female - This is the multinomial logit estimate comparing females to males for strawberry
relative to vanilla given the other variables in the model are held constant. The multinomial logit
for females relative to males is 0.033 unit lower for preferring strawberry to vanilla given all
other predictor variables in the model are held constant. In other words, males are more likely
than females to prefer strawberry ice cream to vanilla ice cream.

o. Std. Error - These are the standard errors of the individual regression coefficients for the two
respective models estimated.

p. Wald - This is the Wald chi-square test that tests the null hypothesis that the estimate equals 0.

q. df - This column lists the degrees of freedom for each of the variables included in the model.
For each of these variables, the degree of freedom is 1.

r. Sig. - These are the p-values of the coefficients or the probability that, within a given model,
the null hypothesis that a particular predictor's regression coefficient is zero given that the rest of
the predictors are in the model. They are based on the Wald test statistics of the predictors,
which can be calculated by dividing the square of the predictor's estimate by the square of its
standard error. The probability that a particular Wald test statistic is as extreme as, or more so,
than what has been observed under the null hypothesis is defined by the p-value and presented
here. In multinomial logistic regression, the interpretation of a parameter estimate's significance
is limited to the model in which the parameter estimate was calculated. For example, the
significance of a parameter estimate in the chocolate relative to vanilla model cannot be assumed
to hold in the strawberry relative to vanilla model.

chocolate relative to vanilla

For chocolate relative to vanilla, the Wald test statistic for the predictor video is 1.262 with an
associated p-value of 0.261. If we set our alpha level to 0.05, we would fail to reject the null
hypothesis and conclude that for chocolate relative to vanilla, the regression coefficient for video
has not been found to be statistically different from zero given puzzle and female are in the
model.
For chocolate relative to vanilla, the Wald test statistic for the predictor puzzle is 3.978 with
an associated p-value of 0.046. If we again set our alpha level to 0.05, we would reject the null
hypothesis and conclude that the regression coefficient for puzzle has been found to be
statistically different from zero for chocolate relative to vanilla given that video and female are
in the model.
For chocolate relative to vanilla, the Wald test statistic for the predictor female 4.362 with an
associated p-value of 0.037. If we again set our alpha level to 0.05, we would reject the null
hypothesis and conclude that the difference between males and females has been found to be
statistically different for chocolate relative to vanilla given that video and female are in the
model.
For chocolate relative to vanilla, the Wald test statistic for the intercept, Intercept is 2.878
with an associated p-value of 0.090. With an alpha level of 0.05, we would fail to reject the null
hypothesis and conclude, a) that the multinomial logit for males (the variable female evaluated at
zero) and with zero video and puzzle scores in chocolate relative to vanilla are found not to be
statistically different from zero; or b) for males with zero video and puzzle scores, you are
statistically uncertain whether they are more likely to be classified as chocolate or vanilla. We
can make the second interpretation when we view the Intercept as a specific covariate profile
(males with zero video and puzzle scores). Based on the direction and significance of the
coefficient, the Intercept indicates whether the profile would have a greater propensity to be
classified in one level of the outcome variable than the other level.

strawberry relative to vanilla

For strawberry relative to vanilla, the Wald test statistic for the predictor video is 1.206 with
an associated p-value of 0.272. If we set our alpha level to 0.05, we would fail to reject the null
hypothesis and conclude that for strawberry relative to vanilla, the regression coefficient for
video has not been found to be statistically different from zero given puzzle and female are in
the model.
For strawberry relative to vanilla, the Wald test statistic for the predictor puzzle is 4.675 with
an associated p-value of 0.031. If we again set our alpha level to 0.05, we would reject the null
hypothesis and conclude that the regression coefficient for puzzle has been found to be
statistically different from zero for strawberry relative to vanilla given that video and female are
in the model.
For strawberry relative to vanilla, the Wald test statistic for the predictor female is 0.009 with
an associated p-value of 0.925. If we again set our alpha level to 0.05, we would fail to reject the
null hypothesis and conclude that for strawberry relative to vanilla, the regression coefficient for
female has not been found to be statistically different from zero given puzzle and video are in
the model.
For strawberry relative to vanilla, the Wald test statistic for the intercept, Intercept is 11.007
with an associated p-value of 0.001. With an alpha level of 0.05, we would reject the null
hypothesis and conclude that a) that the multinomial logit for males (the variable female
evaluated at zero) and with zero video and puzzle scores in strawberry relative to vanilla are
statistically different from zero; or b) for males with zero video and puzzle scores, there is a
statistically significant difference between the likelihood of being classified as strawberry or
vanilla. We can make the second interpretation when we view the Intercept as a specific
covariate profile (males with zero video and puzzle scores). Based on the direction and
significance of the coefficient, the Intercept indicates whether the profile would have a greater
propensity to be classified in one level of the outcome variable than the other level.

s. Exp(B) - These are the odds ratios for the predictors. They are the exponentiation of the
coefficients. There is no odds ratio for the variable ice_cream because ice_cream (as a variable
with 2 degrees of freedom) was not entered into the logistic regression equation. The odds ratio
of a coefficient indicates how the risk of the outcome falling in the comparison group compared
to the risk of the outcome falling in the referent group changes with the variable in question. An
odds ratio > 1 indicates that the risk of the outcome falling in the comparison group relative to
the risk of the outcome falling in the referent group increases as the variable increases. In other
words, the comparison outcome is more likely. An odds ratio < 1 indicates that the risk of the
outcome falling in the comparison group relative to the risk of the outcome falling in the referent
group decreases as the variable increases. See the interpretations of the relative risk ratios
below for examples. In general, if the odds ratio < 1, the outcome is more likely to be in the
referent group. For more information on interpreting odds ratios, please see How do I interpret
odds ratios in logistic regression? and Understanding RR ratios in multinomial logistic regression
.

chocolate relative to vanilla

video - This is the odds or "relative risk" ratio for a one unit increase in video score for
chocolate relative to vanilla level given that the other variables in the model are held constant. If
a subject were to increase her video score by one unit, the relative risk for preferring chocolate to
vanilla would be expected to decrease by a factor of 0.977 given the other variables in the model
are held constant. So, given a one unit increase in video, the relative risk of being in the
chocolate group would be 0.977 times more likely when the other variables in the model are held
constant. More generally, we can say that if a subject were to increase her video score, we would
expect her to be more likely to prefer vanilla ice cream over chocolate ice cream.

puzzle - This is the relative risk ratio for a one unit increase in puzzle score for chocolate
relative to vanilla level given that the other variables in the model are held constant. If a subject
were to increase her puzzle score by one unit, the relative risk for preferring chocolate to vanilla
would be expected to decrease by a factor of 0.962 given the other variables in the model are
held constant. More generally, we can say that if two subjects have identical video scores and are
both female (or both male), the subject with the higher puzzle score is more likely to prefer
vanilla ice cream over chocolate ice cream than the subject with the lower puzzle score.
female - This is the relative risk ratio comparing females to males for chocolate relative to
vanilla level given that the other variables in the model are held constant. For females relative to
males, the relative risk for preferring chocolate relative to vanilla would be expected to increase
by a factor of 2.263 given the other variables in the model are held constant. In other words,
females are more likely than males to prefer chocolate ice cream over vanilla ice cream.

strawberry relative to vanilla

video - This is the relative risk ratio for a one unit increase in video score for strawberry
relative to vanilla level given that the other variables in the model are held constant. If a subject
were to increase her video score by one unit, the relative risk for strawberry relative to vanilla
would be expected to increase by a factor of 1.023 given the other variables in the model are held
constant. More generally, we can say that if a subject were to increase her video score, we would
expect her to be more likely to prefer strawberry ice cream over vanilla ice cream.

puzzle - This is the relative risk ratio for a one unit increase in puzzle score for strawberry
relative to vanilla level given that the other variables in the model are held constant. If a subject
were to increase her puzzle score by one unit, the relative risk for strawberry relative to vanilla
would be expected to increase by a factor of 1.044 given the other variables in the model are held
constant. More generally, we can say that if two subjects have identical video scores and are both
female (or both male), the subject with the higher puzzle score is more likely to prefer
strawberry ice cream to vanilla ice cream than the subject with the lower puzzle score.

female - This is the relative risk ratio comparing females to males for strawberry relative to
vanilla given that the other variables in the model are held constant. For females relative to
males, the relative risk for preferring strawberry to vanilla would be expected to decrease by a
factor of 0.968 given the other variables in the model are held constant. In other words, females
are less likely than males to prefer strawberry ice cream to vanilla ice cream.

t. 95% Confidence Interval for Exp(B) - This is the Confidence Interval (CI) for an individual
multinomial odds ratio given the other predictors are in the model for outcome m relative to the
referent group. For a given predictor with a level of 95% confidence, we'd say that we are 95%
confident that the "true" population multinomial odds ratio lies between the lower and upper
limit of the interval for outcome m relative to the referent group. It is calculated as the Exp(B
(z/2)*(Std.Error)), where z/2 is a critical value on the standard normal distribution. This CI is
equivalent to the z test statistic: if the CI includes one, we'd fail to reject the null hypothesis that
a particular regression coefficient is zero given the other predictors are in the model. An
advantage of a CI is that it is illustrative; it provides a range where the "true" odds ratio may lie.

Ordered Logistic Regression

This page shows an example of an ordered logistic regression analysis with footnotes explaining
the output. The data were collected on 200 high school students and are scores on various tests,
including science, math, reading and social studies. The outcome measure in this analysis is
socio-economic status (ses)- low, medium and high- from which we are going to see what
relationships exist with science test scores (science), social science test scores (socst) and gender
(female). Our response variable, ses, is going to be treated as ordinal under the assumption that
the levels of ses status have a natural ordering (low to high), but the distances between adjacent
levels are unknown.

get file = 'C:/hsb2.sav'

plum ses with female science socst


/link = logit
/print = parameter summary tparallel.
Case Processing Summary

a. N -N provides the number of observations fitting the description in the first column. For
example, the first three values give the number of observations for which the subject's ses is low,
middle, or high, respectively.

b. Marginal Percentage - The marginal percentage lists the proportion of valid observations
found in each of the outcome variable's groups. This can be calculated by dividing the N for each
group by the N for "Valid". Of the 200 subjects with valid data, 47 were categorized as low ses.
Thus, the marginal percentage for this group is (47/200) * 100 = 23.5 %.

c. ses - In this regression, the outcome variable is ses which contains a numeric code for the
subject's socio-economic status. The data includes three levels of ses.

d. Valid - This indicates the number of observations in the dataset where the outcome variable
and all predictor variables are non-missing.

e. Missing - This indicates the number of observations in the dataset where data are missing
from the outcome variable or any of the predictor variables.

f. Total - This indicates the total number of observations in the dataset--the sum of the number of
observations in which data are missing and the number of observations with valid data.
Model Fitting

g. Model - This indicates the parameters of the model for which the model fit is calculated.
"Intercept Only" describes a model that does not control for any predictor variables and simply
fits an intercept to predict the outcome variable. "Final" describes a model that includes the
specified predictor variables and has been arrived at through an iterative process that maximizes
the log likelihood of the outcomes seen in the outcome variable. By including the predictor
variables and maximizing the log likelihood of the outcomes seen in the data, the "Final" model
should improve upon the "Intercept Only" model. This can be seen in the differences in the
-2(Log Likelihood) values associated with the models.

h. -2(Log Likelihood) - This is the product of -2 and the log likelihoods of the null model and
fitted "final" model. The likelihood of the model is used to test of whether all predictors'
regression coefficients in the model are simultaneously zero and in tests of nested models.

i. Chi-Square - This is the Likelihood Ratio (LR) Chi-Square test that at least one of the
predictors' regression coefficient is not equal to zero in the model. The LR Chi-Square statistic
can be calculated by -2*L(null model) - (-2*L(fitted model)) = 365.736 - 334.176 = 31.560,
where L(null model) is from the log likelihood with just the response variable in the model
(Iteration 0) and L(fitted model) is the log likelihood from the final iteration (assuming the model
converged) with all the parameters.

j. df - This indicates the degrees of freedom of the Chi-Square distribution used to test the LR
Chi-Sqare statistic and is defined by the number of predictors in the model.

k. Sig. - This is the probability of getting a LR test statistic as extreme as, or more so, than the
observed under the null hypothesis; the null hypothesis is that all of the regression coefficients in
the model are equal to zero. In other words, this is the probability of obtaining this chi-square
statistic (31.56) if there is in fact no effect of the predictor variables. This p-value is compared to
a specified alpha level, our willingness to accept a type I error, which is typically set at 0.05 or
0.01. The small p-value from the LR test, <0.00001, would lead us to conclude that at least one
of the regression coefficients in the model is not equal to zero. The parameter of the Chi-Square
distribution used to test the null hypothesis is defined by the degrees of freedom in the prior
column.

Pseudo R-Squares
l. Pseudo R-Square - These are three pseudo R-squared values. Logistic regression does not
have an equivalent to the R-squared that is found in OLS regression; however, many people have
tried to come up with one. There are a wide variety of pseudo R-squared statistics which can
give contradictory conclusions. Because these statistics do not mean what R-squared means in
OLS regression (the proportion of variance for the response variable explained by the
predictors), we suggest interpreting them with great caution.

Parameter Estimates

m. Threshold - This represents the response variable in the ordered logistic regression. The
threshold estimate for [ses = 1.00] is the cutoff value between low and middle ses and the
threshold estimate for [ses = 2.00] is the cutoff value between middle and high ses. Underneath
Threshold are the predictors in the model.

[ses = 1.00] - This is the estimated cutpoint on the latent variable used to differentiate low ses
from middle and high ses when values of the predictor variables are evaluated at zero. Subjects
that had a value of 2.755 or less on the underlying latent variable that gave rise to our ses
variable would be classified as low ses given they were male (the variable female evaluated at
zero) and had zero science and socst test scores.

[ses = 2.00] - This is the estimated cutpoint on the latent variable used to differentiate low and
middle ses from high ses when values of the predictor variables are evaluated at zero. Subjects
that had a value of 5.105 or greater on the underlying latent variable that gave rise to our ses
variable would be classified as high ses given they were male and had zero science and socst test
scores. Subjects that had a value between 2.755 and 5.105 on the underlying latent variable
would be classified as middle ses.

n. Estimate - These are the ordered log-odds (logit) regression coefficients. Standard
interpretation of the ordered logit coefficient is that for a one unit increase in the predictor, the
response variable level is expected to change by its respective regression coefficient in the
ordered log-odds scale while the other variables in the model are held constant. Interpretation of
the ordered logit estimates is not dependent on the ancillary parameters; the ancillary parameters
are used to differentiate the adjacent levels of the response variable. However, since the ordered
logit model estimates one equation over all levels of the outcome variable, a concern is whether
our one-equation model is valid or a more flexible model is required. The odds ratios of the
predictors can be calculated by exponentiating the estimate.

science - This is the ordered log-odds estimate for a one unit increase in science score on the
expected ses level given the other variables are held constant in the model. If a subject were to
increase his science score by one point, his ordered log-odds of being in a higher ses category
would increase by 0.03 while the other variables in the model are held constant.

socst - This is the ordered log-odds estimate for a one unit increase in socst score on the
expected ses level given the other variables are held constant in the model. A one unit increase in
socst test scores would result in a 0.053 unit increase in the ordered log-odds of being in a higher
ses category while the other variables in the model are held constant.

female - This is the ordered log-odds estimate of comparing females to males on expected ses
given the other variables are held constant in the model. The ordered logit for females being in a
higher ses category is -0.4824 less than males when the other variables in the model are held
constant.

o. Std. Error - These are the standard errors of the individual regression coefficients. They are
used in both the calculation of the Wald test statistic, superscript p, and the confidence interval of
the regression coefficient, superscript r.

p. Wald - This is the Wald chi-square test that tests the null hypothesis that the estimate equals
0.

q. DF - These are the degrees of freedom for each of the tests of the coefficients. For each
Estimate (parameter) estimated in the model, one DF is required, and the DF defines the Chi-
Square distribution to test whether the individual regression coefficient is zero given the other
variables are in the model.

r. Sig.- These are the p-values of the coefficients or the probability that, within a given model, the
null hypothesis that a particular predictor's regression coefficient is zero given that the rest of the
predictors are in the model. They are based on the Wald test statistics of the predictors, which
can be calculated by dividing the square of the predictor's estimate by the square of its standard
error. The probability that a particular Wald test statistic is as extreme as, or more so, than what
has been observed under the null hypothesis is defined by the p-value and presented here.
The Wald test statistic for the predictor female is 3.000 with an associated p-value of 0.083. If
we set our alpha level to 0.05, we would fail to reject the null hypothesis and conclude that the
regression coefficient for female has not been found to be statistically different from zero in
estimating ses given socst and science are in the model.
The Wald test statistic for the predictor science is 3.584 with an associated p-value of 0.058. If
we set our alpha level to 0.05, we would fail to reject the null hypothesis and conclude that the
regression coefficient for science has not been found to be statistically different from zero in
estimating ses given socst and female are in the model.
The Wald test statistic for the predictor socst is 12.777 with an associated p-value of <0.0001.
If we again set our alpha level to 0.05, we would reject the null hypothesis and conclude that the
regression coefficient for socst has been found to be statistically different from zero in estimating
ses given that science and female are in the model. The interpretation for a dichotomous variable
such as female, parallels that of a continuous variable: the observed difference between males
and females on ses status was not found to be statistically significant at the 0.05 level when
controlling for socst and science (p=0.083).

s. 95% Confidence Interval - This is the Confidence Interval (CI) for an individual regression
coefficient given the other predictors are in the model. For a given predictor with a level of 95%
confidence, we'd say that we are 95% confident that the "true" population regression coefficient
lies in between the lower and upper limit of the interval. It is calculated as the Coef.
(z/2)*(Std.Err.), where z/2 is a critical value on the standard normal distribution. The CI is
equivalent to the z test statistic: if the CI includes zero, we'd fail to reject the null hypothesis that
a particular regression coefficient is zero given the other predictors are in the model. An
advantage of a CI is that it is illustrative; it provides a range where the "true" parameter may lie.

Test of Parallel Lines

t. General - Here, SPSS tests the proportional odds assumption. This is commonly referred to as
the test of parallel lines because the null hypothesis states that the slope coefficients in the model
are the same across response categories (and lines of the same slope are parallel). Since the
ordered logit model estimates one equation over all levels of the response variable (as compared
to the multinomial logit model, which models, assuming low ses is our referent level, an equation
for medium ses versus low ses, and an equation for high ses versus low ses), the test for
proportional odds tests whether our one-equation model is valid. If we were to reject the null
hypothesis based on the significance of the Chi-Square statistic, we would conclude that ordered
logit coefficients are not equal across the levels of the outcome, and we would fit a less
restrictive model (i.e., multinomial logit model). If we fail to reject the null hypothesis, we
conclude that the assumption holds. For our model, the proportional odds assumption appears to
have held because our the significance of our Chi-Square statistic is .534 > .05.
Factor Analysis

This page shows an example of a factor analysis with footnotes explaining the output. The data
used in this example were collected by Professor James Sidanius, who has generously shared
them with us. You can download the data set here.

Overview: The "what" and "why" of factor analysis

Factor analysis is a method of data reduction. It does this by seeking underlying unobservable
(latent) variables that are reflected in the observed variables (manifest variables). There are
many different methods that can be used to conduct a factor analysis (such as principal axis
factor, maximum likelihood, generalized least squares, unweighted least squares), There are also
many different types of rotations that can be done after the initial extraction of factors, including
orthogonal rotations, such as varimax and equimax, which impose the restriction that the factors
cannot be correlated, and oblique rotations, such as promax, which allow the factors to be
correlated with one another. You also need to determine the number of factors that you want to
extract. Given the number of factor analytic techniques and options, it is not surprising that
different analysts could reach very different results analyzing the same data set. However, all
analysts are looking for simple structure. Simple structure is pattern of results such that each
variable loads highly onto one and only one factor.

Factor analysis is a technique that requires a large sample size. Factor analysis is based on the
correlation matrix of the variables involved, and correlations usually need a large sample size
before they stabilize. Tabachnick and Fidell (2001, page 588) cite Comrey and Lee's (1992)
advise regarding sample size: 50 cases is very poor, 100 is poor, 200 is fair, 300 is good, 500 is
very good, and 1000 or more is excellent. As a rule of thumb, a bare minimum of 10
observations per variable is necessary to avoid computational difficulties.

For the example below, we are going to do a rather "plain vanilla" factor analysis. We will use
iterated principal axis factor with three factors as our method of extraction, a varimax rotation,
and for comparison, we will also show the promax oblique solution. The determination of the
number of factors to extract should be guided by theory, but also informed by running the
analysis extracting different numbers of factors and seeing which number of factors yields the
most interpretable results.

In this example we have included many options, including the original and reproduced
correlation matrix, the scree plot and the plot of the rotated factors. While you may not wish to
use all of these options, we have included them here to aid in the explanation of the analysis. We
have also created a page of annotated output for a principal components analysis that parallels
this analysis. For general information regarding the similarities and differences between
principal components analysis and factor analysis, see Tabachnick and Fidell (2001), for
example.

factor
/variables item13 item14 item15 item16 item17 item18 item19 item20 item21
item22 item23 item24
/print initial det kmo repr extraction rotation fscore univariate
/format blank(.30)
/plot eigen rotation
/criteria factors(3)
/extraction paf
/rotation varimax
/method = correlation.

The table above is output because we used the univariate option on the /print subcommand.
Please note that the only way to see how many cases were actually used in the factor analysis is
to include the univariate option on the /print subcommand. The number of cases used in the
analysis will be less than the total number of cases in the data file if there are missing values on
any of the variables used in the factor analysis, because, by default, SPSS does a listwise deletion
of incomplete cases. If the factor analysis is being conducted on the correlations (as opposed to
the covariances), it is not much of a concern that the variables have very different means and/or
standard deviations (which is often the case when variables are measured on different scales).

a. Mean - These are the means of the variables used in the factor analysis.

b. Std. Deviation - These are the standard deviations of the variables used in the factor analysis.

c. Analysis N - This is the number of cases used in the factor analysis.

The table above is included in the output because we used the det option on the /print
subcommand. All we want to see in this table is that the determinant is not 0. If the determinant
is 0, then there will be computational problems with the factor analysis, and SPSS may issue a
warning message or be unable to complete the factor analysis.

a. Kaiser-Meyer-Olkin Measure of Sampling Adequacy - This measure varies between 0 and


1, and values closer to 1 are better. A value of .6 is a suggested minimum.

b. Bartlett's Test of Sphericity - This tests the null hypothesis that the correlation matrix is an
identity matrix. An identity matrix is matrix in which all of the diagonal elements are 1 and all
off diagonal elements are 0. You want to reject this null hypothesis.

Taken together, these tests provide a minimum standard which should be passed before a factor
analysis (or a principal components analysis) should be conducted.
a. Communalities - This is the proportion of each variable's variance that can be explained by
the factors (e.g., the underlying latent continua). It is also noted as h2 and can be defined as the
sum of squared factor loadings for the variables.

b. Initial - With principal factor axis factoring, the initial values on the diagonal of the
correlation matrix are determined by the squared multiple correlation of the variable with the
other variables. For example, if you regressed items 14 through 24 on item 13, the squared
multiple correlation coefficient would be .564.

c. Extraction - The values in this column indicate the proportion of each variable's variance that
can be explained by the retained factors. Variables with high values are well represented in the
common factor space, while variables with low values are not well represented. (In this
example, we don't have any particularly low values.) They are the reproduced variances from
the factors that you have extracted. You can find these values on the diagonal of the reproduced
correlation matrix.

a. Factor - The initial number of factors is the same as the number of variables used in the factor
analysis. However, not all 12 factors will be retained. In this example, only the first three
factors will be retained (as we requested).

b. Initial Eigenvalues - Eigenvalues are the variances of the factors. Because we conducted our
factor analysis on the correlation matrix, the variables are standardized, which means that the
each variable has a variance of 1, and the total variance is equal to the number of variables used
in the analysis, in this case, 12.

c. Total - This column contains the eigenvalues. The first factor will always account for the
most variance (and hence have the highest eigenvalue), and the next factor will account for as
much of the left over variance as it can, and so on. Hence, each successive factor will account
for less and less variance.

d. % of Variance - This column contains the percent of total variance accounted for by each
factor.

e. Cumulative % - This column contains the cumulative percentage of variance accounted for
by the current and all preceding factors. For example, the third row shows a value of 68.313.
This means that the first three factors together account for 68.313% of the total variance.

f. Extraction Sums of Squared Loadings - The number of rows in this panel of the table
correspond to the number of factors retained. In this example, we requested that three factors be
retained, so there are three rows, one for each retained factor. The values in this panel of the
table are calculated in the same way as the values in the left panel, except that here the values are
based on the common variance. The values in this panel of the table will always be lower than
the values in the left panel of the table, because they are based on the common variance, which is
always smaller than the total variance.

g. Rotation Sums of Squared Loadings - The values in this panel of the table represent the
distribution of the variance after the varimax rotation. Varimax rotation tries to maximize the
variance of each of the factors, so the total amount of variance accounted for is redistributed over
the three extracted factors.

The scree plot graphs the eigenvalue against the factor number. You can see these values in the
first two columns of the table immediately above. From the third factor on, you can see that the
line is almost flat, meaning the each successive factor is accounting for smaller and smaller
amounts of the total variance.
b. Factor Matrix - This table contains the unrotated factor loadings, which are the correlations
between the variable and the factor. Because these are correlations, possible values range from
-1 to +1. On the /format subcommand, we used the option blank(.30), which tells SPSS not to
print any of the correlations that are .3 or less. This makes the output easier to read by removing
the clutter of low correlations that are probably not meaningful anyway.

c. Factor - The columns under this heading are the unrotated factors that have been extracted.
As you can see by the footnote provided by SPSS (a.), three factors were extracted (the three
factors that we requested).
c. Reproduced Correlations - This table contains two tables, the reproduced correlations in the
top part of the table, and the residuals in the bottom part of the table.

d. Reproduced Correlation - The reproduced correlation matrix is the correlation matrix based
on the extracted factors. You want the values in the reproduced matrix to be as close to the
values in the original correlation matrix as possible. This means that the residual matrix, which
contains the differences between the original and the reproduced matrix to be close to zero. If
the reproduced matrix is very similar to the original correlation matrix, then you know that the
factors that were extracted accounted for a great deal of the variance in the original correlation
matrix, and these few factors do a good job of representing the original data. The numbers on
the diagonal of the reproduced correlation matrix are presented in the Communalities table in the
column labeled Extracted.

e. Residual - As noted in the first footnote provided by SPSS (a.), the values in this part of the
table represent the differences between original correlations (shown in the correlation table at the
beginning of the output) and the reproduced correlations, which are shown in the top part of this
table. For example, the original correlation between item13 and item14 is .661, and the
reproduced correlation between these two variables is .646. The residual is .016 = .661 - .646
(with some rounding error).
b. Rotated Factor Matrix - This table contains the rotated factor loadings (factor pattern
matrix), which represent both how the variables are weighted for each f actor but also the
correlation between the variables and the factor. Because these are correlations, possible values
range from -1 to +1. On the /format subcommand, we used the option blank(.30), which tells
SPSS not to print any of the correlations that are .3 or less. This makes the output easier to read
by removing the clutter of low correlations that are probably not meaningful anyway.
For orthogonal rotations, such as varimax, the factor pattern and factor structure matrices are the
same.

c. Factor - The columns under this heading are the rotated factors that have been extracted. As
you can see by the footnote provided by SPSS (a.), three factors were extracted (the three factors
that we requested). These are the factors that analysts are most interested in and try to name.
For example, the first factor might be called "instructor competence" because items like
"instructor well prepare" and "instructor competence" load highly on it. The second factor might
be called "relating to students" because items like "instructor is sensitive to students" and
"instructor allows me to ask questions" load highly on it. The third factor has to do with
comparisons to other instructors and courses.

The table below is from another run of the factor analysis program shown above, except with a
promax rotation. We have included it here to show how different the rotated solutions can be,
and to better illustrate what is meant by simple structure. As you can see with an oblique
rotation, such as a promax rotation, the factors are permitted to be correlated with one another.
With an orthogonal rotation, such as the varimax shown above, the factors are not permitted to
be correlated (they are orthogonal to one another). Oblique rotations, such as promax, produce
both factor pattern and factor structure matrices. For orthogonal rotations, such as varimax and
equimax, the factor structure and the factor pattern matrices are the same. The factor structure
matrix represents the correlations between the variables and the factors. The factor pattern
matrix contain the coefficients for the linear combination of the variables.
The table below indicates that the rotation done is an oblique rotation. If an orthogonal rotation
had been done (like the varimax rotation shown above), this table would not appear in the output
because the correlations between the factors are set to 0. Here, you can see that the factors are
highly correlated.
The rest of the output shown below is part of the output generated by the SPSS syntax shown at
the beginning of this page.

a. Factor Transformation Matrix - This is the matrix by which you multiply the unrotated
factor matrix to get the rotated factor matrix.
The plot above shows the items (variables) in the rotated factor space. While this picture may
not be particularly helpful, when you get this graph in the SPSS output, you can interactively
rotate it. This may help you to see how the items (variables) are organized in the common factor
space.
a. Factor Score Coefficient Matrix - This is the factor weight matrix and is used to compute
the factor scores.
a. Factor Score Covariance Matrix - Because we used an orthogonal rotation, this should be a
diagonal matrix, meaning that the same number should appear in all three places along the
diagonal. In actuality the factors are uncorrelated; however, because factor scores are estimated
there may be slight correlations among the factor scores.

One-Way Manova

This page shows an example of multivariate analysis of variance (manova) in SPSS with
footnotes explaining the output. The data used in this example are from the following
experiment.

A researcher randomly assigns 33 subjects to one of three groups. The first group receives
technical dietary information interactively from an on-line website. Group 2 receives the same
information from a nurse practitioner, while group 3 receives the information from a video tape
made by the same nurse practitioner. Each subject then made three ratings: difficulty, usefulness,
and importance of the information in the presentation. The researcher looks at three different
ratings of the presentation (difficulty, usefulness and importance) to determine if there is a
difference in the modes of presentation. In particular, the researcher is interested in whether the
interactive website is superior because that is the most cost-effective way of delivering the
information. In the dataset, the ratings are presented in the variables useful, difficulty and
importance. The variable group indicates the group to which a subject was assigned.

We are interested in how the variability in the three ratings can be explained by a subject's
group. Group is a categorical variable with three possible values: 1, 2 or 3. Because we have
multiple dependent variables that cannot be combined, we will choose to use manova. Our null
hypothesis in this analysis is that a subject's group has no effect on any of the three different
ratings, and we can test this hypothesis on the dataset, manova.sav.

GET FILE='C:\manova.sav'.

We can start by examining the three outcome variables.

DESCRIPTIVES VARIABLES=useful difficulty importance.


FREQUENCIES VARIABLES=group.

MEANS TABLES=useful difficulty importance BY group.


Next, we can enter our manova command. In SPSS, manova can be conducted through the
generalized linear model function, GLM. In the manova command, we first list the outcome
variables, then indicate any categorical factors after "by" and any covariates after "with". Here,
group is a categorical factor. We must also indicate the lowest and highest values found in
group. We are also asking SPSS to print the eigenvalues generated. These will be useful in
seeing how the test statistics are calculated.

manova useful difficulty importance by group(1,3)


/print=sig(eigen).

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
The default error term in MANOVA has been changed from WITHIN CELLS to
WITHIN+RESIDUAL. Note that these are the same for all full factorial
designs.

* * * * * * A n a l y s i s o f V a r i a n c e * * * * * *

33 cases accepted.
0 cases rejected because of out-of-range factor values.
0 cases rejected because of missing data.
3 non-empty cells.

1 design will be processed.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

* * * * * * A n a l y s i s o f V a r i a n c e -- design 1 * * * * * *
EFFECT .. GROUP
Multivariate Tests of Significance (S = 2, M = 0, N = 13 )

Test Name Value Approx. F Hypoth. DF Error DF Sig. of F

Pillais .47667 3.02483 6.00 58.00 .012


Hotellings .89723 4.03753 6.00 54.00 .002
Wilks .52579 3.53823 6.00 56.00 .005
Roys .47146
Note.. F statistic for WILKS' Lambda is exact.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Eigenvalues and Canonical Correlations

Root No. Eigenvalue Pct. Cum. Pct. Canon Cor.

1 .892 99.416 99.416 .687


2 .005 .584 100.000 .072

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
EFFECT .. GROUP (Cont.)
Univariate F-tests with (2,30) D. F.

Variable Hypoth. SS Error SS Hypoth. MS Error MS F Sig. of F

USEFUL 52.92424 293.96544 26.46212 9.79885 2.70053 .083


DIFFICUL 3.97515 126.28728 1.98758 4.20958 .47216 .628
IMPORTAN 81.82969 426.37090 40.91485 14.21236 2.87882 .072

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Abbreviated Extended
Name Name

DIFFICUL DIFFICULTY
IMPORTAN IMPORTANCE

Manova Output
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
The default error term in MANOVA has been changed from WITHIN CELLS to
WITHIN+RESIDUAL. Note that these are the same for all full factorial
designs.

* * * * * * A n a l y s i s o f V a r i a n c e * * * * * *

33 cases accepted.
0 cases rejected because of out-of-range factor values.
0 cases rejected because of missing data.
3 non-empty cells.
1 design will be processed.a

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

* * * * * * A n a l y s i s o f V a r i a n c e -- design 1 * * * * * *

EFFECTb .. GROUP
Multivariate Tests of Significance (S = 2, M = 0, N = 13 )

Test Name Valuec Approx. Fd Hypoth. DFe Error DFf Sig. of Fg

Pillaish .47667 3.02483 6.00 58.00 .012


Hotellingsi .89723 4.03753 6.00 54.00 .002
Wilksj .52579 3.53823 6.00 56.00 .005
k
Roys .47146
Note.. F statistic for WILKS' Lambda is exact.l

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Eigenvalues and Canonical Correlationsm

Root No. Eigenvalue Pct. Cum. Pct. Canon Cor.

1 .892 99.416 99.416 .687


2 .005 .584 100.000 .072

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
EFFECT .. GROUP (Cont.)
Univariate F-tests with (2,30) D. F.n

Variable Hypoth. SS Error SS Hypoth. MS Error MS F Sig. of F

USEFUL 52.92424 293.96544 26.46212 9.79885 2.70053 .083


DIFFICUL 3.97515 126.28728 1.98758 4.20958 .47216 .628
IMPORTAN 81.82969 426.37090 40.91485 14.21236 2.87882 .072

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Abbreviated Extended
Name Name

DIFFICUL DIFFICULTY
IMPORTAN IMPORTANCE

a. Case summary - This provides counts of the observations to be included in the manova and
the counts of observations to be dropped due to missing data or data that falls out-of-range. For
example, a record where the value for group is 4, after we have specified that the maximum
value for group is 3, would be considered out-of-range.
b. Effect - This indicates the predictor variable in question. In our model, we are looking at the
effect of group.

c. Value - This is the test statistic for the given effect and multivariate statistic listed in the prior
column. For each predictor variable, SPSS calculates four test statistics. All of these test
statistics are calculated using the eigenvalues of the model (see superscript m). See superscripts
h, i, j and k for explanations of each of the tests.

d. Approx. F - This is the approximate F statistic for the given effect and test statistic.

e. Hypoth. DF - This is the number of degrees of freedom in the model.

f. Error DF - This is the number of degrees of freedom associated with the model errors. There
are instances in manova when the degrees of freedom may be a non-integer.

g. Sig. of F - This is the p-value associated with the F statistic and the hypothesis and error
degrees of freedom of a given effect and test statistic. The null hypothesis that a given predictor
has no effect on either of the outcomes is evaluated with regard to this p-value. For a given alpha
level, if the p-value is less than alpha, the null hypothesis is rejected. If not, then we fail to reject
the null hypothesis. In this example, we reject the null hypothesis that group has no effect on the
three different ratings at alpha level .05 because the p-values are all less than .05.

h. Pillais - This is Pillai's Trace, one of the four multivariate criteria test statistics used in
manova. We can calculate Pillai's trace using the generated eigenvalues (see superscript m).
Divide each eigenvalue by (1 + the eigenvalue), then sum these ratios. So in this example, you
would first calculate 0.89198790/(1+0.89198790) = 0.471455394, 0.00524207/(1+0.00524207)
= 0.005214734, and 0/(1+0)=0. When these are added, we arrive at Pillai's trace: (0.471455394
+ 0.005214734 + 0) = .47667.

i. Hotellings - This is Lawley-Hotelling's Trace. It is very similar to Pillai's Trace. It is the sum
of the eigenvalues (see superscript m) and is a direct generalization of the F statistic in ANOVA.
We can calculate the Hotelling-Lawley Trace by summing the characteristic roots listed in the
output: 0.8919879 + 0.00524207 + 0 = 0.89723.

j. Wilks - This is Wilk's Lambda. This can be interpreted as the proportion of the variance in the
outcomes that is not explained by an effect. To calculate Wilks' Lambda, for each eigenvalue,
calculate 1/(1 + the eigenvalue), then find the product of these ratios. So in this example, you
would first calculate 1/(1+0.8919879) = 0.5285446, 1/(1+0.00524207) = 0.9947853, and 1/
(1+0)=1. Then multiply 0.5285446 * 0.9947853 * 1 = 0.52579.

k. Roys - This is Roy's Largest Root. We can calculate this value by dividing the largest
eigenvalue by (1+largest eigenvalue). Here, the value is 0.8919879/(1+0.8919879). Because it is
based only on the maximum eigenvalue, it can behave differently from the other three test
statistics. In instances where the other three are not significant and Roy's is significant, the effect
should be considered not significant.
l. Note - This indicates that the F statistic for Wilk's Lambda was calculated exactly. For the
other test statistics, the F values are approximate (as indicated by the column heading).

m. Eigenvalues and Canonical Correlations - This section of output provides the eigenvalues
from the product of the sum-of-squares matrix of the model and the sum-of-squares matrix of the
errors. There is one eigenvalue for each of the three eigenvectors of the product of the model
sum of squares matrix and the error sum of squares matrix, a 3x3 matrix. Because only two are
listed here, we can assume the third eigenvalue is zero. These values can be used to calculate the
four multivariate test statistics.

n. Univariate F-tests - The manova procedure provides both univariate and multivariate output.
This section of output provides summarized output from a one-way anova for each of the
outcomes in the manova. Each row corresponds to a different one-way anova, one for each
dependent variable in the manova. While the manova tested a single hypothesis, each line in this
output corresponds to a test of a different hypothesis. Generally, if your manova suggests that an
effect is significant, you would expect at least one of these one-way anova tests to indicate that
the effect is significant on a single outcome.

Discriminant Analysis

This page shows an example of a discriminant analysis in SPSS with footnotes explaining the
output. The data used in this example are from a data file, discrim.sav, with 244 observations on
four variables. The variables include three continuous, numeric variables (outdoor, social and
conservative) and one categorical variable (job) with three levels: 1) customer service, 2)
mechanic and 3) dispatcher.

We are interested in the relationship between the three continuous variables and our categorical
variable. Specifically, we would like to know how many dimensions we would need to express
this relationship. Using this relationship, we can predict a classification based on the continuous
variables or assess how well the continuous variables separate the categories in the
classification. We will be discussing the degree to which the continuous variables can be used to
discriminate between the groups. Some options for visualizing what occurs in discriminant
analysis can be found in the Discriminant Analysis Data Analysis Example.

To start, we can examine the overall means of the continuous variables.

get file='C:\Data\discrim.sav'.
descriptives
variables=outdoor social conservative
/statistics=mean stddev min max .
We are interested in how job relates to outdoor, social and conservative. Let's look at summary
statistics of these three continuous variables for each job category.

means
tables=outdoor social conservative by job
/cells mean count stddev .

From this output, we can see that some of the means of outdoor, social and conservative differ
noticeably from group to group in job. These differences will hopefully allow us to use these
predictors to distinguish observations in one job group from observations in another job group.
Next, we can look at the correlations between these three predictors. These correlations will give
us some indication of how much unique information each predictor will contribute to the
analysis. If two predictor variables are very highly correlated, then they will be contributing
shared information to the analysis. Uncorrelated variables are likely preferable in this respect.
We will also look at the frequency of each job group.

correlations
variables=outdoor social conservative .

frequencies
variables=job .

The discriminant command in SPSS performs canonical linear discriminant analysis which is
the classical form of discriminant analysis. In this example, we specify in the groups
subcommand that we are interested in the variable job, and we list in parenthesis the minimum
and maximum values seen in job. We next list the discriminating variables, or predictors, in the
variables subcommand. In this example, we have selected three predictors: outdoor, social and
conservative. We will be interested in comparing the actual groupings in job to the predicted
groupings generated by the discriminant analysis. For this, we use the statistics subcommand.
This will provide us with classification statistics in our output.
discriminant
/groups=job(1 3)
/variables=outdoor social conservative
/statistics=table.

Data Summary

a. Analysis Case Processing Summary - This table summarizes the analysis dataset in terms of
valid and excluded cases. The reasons why SPSS might exclude an observation from the
analysis are listed here, and the number ("N") and percent of cases falling into each category
(valid or one of the exclusions) are presented. In this example, all of the observations in the
dataset are valid.
b. Group Statistics - This table presents the distribution of observations into the three groups
within job. We can see the number of observations falling into each of the three groups. In this
example, we are using the default weight of 1 for each observation in the dataset, so the weighted
number of observations in each group is equal to the unweighted number of observations in each
group.

Eigenvalues and Multivariate Tests

c. Function - This indicates the first or second canonical linear discriminant function. The
number of functions is equal to the number of discriminating variables, if there are more groups
than variables, or 1 less than the number of levels in the group variable. In this example, job has
three levels and three discriminating variables were used, so two functions are calculated. Each
function acts as projections of the data onto a dimension that best separates or discriminates
between the groups.

d. Eigenvalue - These are the eigenvalues of the matrix product of the inverse of the within-
group sums-of-squares and cross-product matrix and the between-groups sums-of-squares and
cross-product matrix. These eigenvalues are related to the canonical correlations and describe
how much discriminating ability a function possesses. The magnitudes of the eigenvalues are
indicative of the functions' discriminating abilities. See superscript e for underlying
calculations.

e. % of Variance - This is the proportion of discriminating ability of the three continuous


variables found in a given function. This proportion is calculated as the proportion of the
function's eigenvalue to the sum of all the eigenvalues. In this analysis, the first function
accounts for 77% of the discriminating ability of the discriminating variables and the second
function accounts for 23%. We can verify this by noting that the sum of the eigenvalues is
1.081+.321 = 1.402. Then (1.081/1.402) = 0.771 and (0.321/1.402) = 0.229.
f. Cumulative % - This is the cumulative proportion of discriminating ability . For any analysis,
the proportions of discriminating ability will sum to one. Thus, the last entry in the cumulative
column will also be one.

g. Canonical Correlation - These are the canonical correlations of our predictor variables
(outdoor, social and conservative) and the groupings in job. If we consider our discriminating
variables to be one set of variables and the set of dummies generated from our grouping variable
to be another set of variables, we can perform a canonical correlation analysis on these two sets.
From this analysis, we would arrive at these canonical correlations.

h. Test of Function(s) - These are the functions included in a given test with the null hypothesis
that the canonical correlations associated with the functions are all equal to zero. In this
example, we have two functions. Thus, the first test presented in this table tests both canonical
correlations ("1 through 2") and the second test presented tests the second canonical correlation
alone.

i. Wilks' Lambda - Wilks' Lambda is one of the multivariate statistic calculated by SPSS. It is
the product of the values of (1-canonical correlation2). In this example, our canonical
correlations are 0.721 and 0.493, so the Wilks' Lambda testing both canonical correlations is (1-
0.7212)*(1-0.4932) = 0.364, and the Wilks' Lambda testing the second canonical correlation is (1-
0.4932) = 0.757.

j. Chi-square - This is the Chi-square statistic testing that the canonical correlation of the given
function is equal to zero. In other words, the null hypothesis is that the function, and all
functions that follow, have no discriminating ability. This hypothesis is tested using this Chi-
square statistic.

k. df - This is the effect degrees of freedom for the given function. It is based on the number of
groups present in the categorical variable and the number of continuous discriminant variables.
The Chi-square statistic is compared to a Chi-square distribution with the degrees of freedom
stated here.

l. Sig. - This is the p-value associated with the Chi-square statistic of a given test. The null
hypothesis that a given function's canonical correlation and all smaller canonical correlations are
equal to zero is evaluated with regard to this p-value. For a given alpha level, such as 0.05, if
the p-value is less than alpha, the null hypothesis is rejected. If not, then we fail to reject the null
hypothesis.

Discriminant Function Output


m. Standardized Canonical Discriminant Function Coefficients - These coefficients can be
used to calculate the discriminant score for a given case. The score is calculated in the same
manner as a predicted value from a linear regression, using the standardized coefficients and the
standardized variables. For example, let zoutdoor, zsocial and zconservative be the variables
created by standardizing our discriminating variables. Then, for each case, the function scores
would be calculated using the following equations:

Score1 = 0.379*zoutdoor - 0.831*zsocial + 0.517*zconservative

Score2 = 0.926*zoutdoor + 0.213*zsocial - 0.291*zconservative

The distribution of the scores from each function is standardized to have a mean of zero and
standard deviation of one. The magnitudes of these coefficients indicate how strongly the
discriminating variables effect the score. For example, we can see that the standardized
coefficient for zsocial in the first function is greater in magnitude than the coefficients for the
other two variables. Thus, social will have the greatest impact of the three on the first
discriminant score.

n. Structure Matrix - This is the canonical structure, also known as canonical loading or
discriminant loading, of the discriminant functions. It represents the correlations between the
observed variables (the three continuous discriminating variables) and the dimensions created
with the unobserved discriminant functions (dimensions).

o. Functions at Group Centroids - These are the means of the discriminant function scores by
group for each function calculated. If we calculated the scores of the first function for each case
in our dataset, and then looked at the means of the scores by group, we would find that the
customer service group has a mean of -1.219, the mechanic group has a mean of 0.107, and the
dispatch group has a mean of 1.420. We know that the function scores have a mean of zero, and
we can check this by looking at the sum of the group means multiplied by the number of cases in
each group: (85*-1.219)+(93*.107)+(66*1.420) = 0.

Predicted Classifications
p. Classification Processing Summary - This is similar to the Analysis Case Processing
Summary (see superscript a), but in this table, "Processed" cases are those that were successfully
classified based on the analysis. The reasons why an observation may not have been processed
are listed here. We can see that in this example, all of the observations in the dataset were
successfully classified.

q. Prior Probabilities for Groups - This is the distribution of observations into the job groups
used as a starting point in the analysis. The default prior distribution is an equal allocation into
the groups, as seen in this example. SPSS allows users to specify different priors with the priors
subcommand.

r. Predicted Group Membership - These are the predicted frequencies of groups from the
analysis. The numbers going down each column indicate how many were correctly and
incorrectly classified. For example, of the 89 cases that were predicted to be in the customer
service group, 70 were correctly predicted, and 19 were incorrectly predicted (16 cases were in
the mechanic group and three cases were in the dispatch group).

s. Original - These are the frequencies of groups found in the data. We can see from the row
totals that 85 cases fall into the customer service group, 93 fall into the mechanic group, and 66
fall into the dispatch group. These match the results we saw earlier in the output for the
frequencies command. Across each row, we see how many of the cases in the group are
classified by our analysis into each of the different groups. For example, of the 85 cases that are
in the customer service group, 70 were predicted correctly and 15 were predicted incorrectly (11
were predicted to be in the mechanic group and four were predicted to be in the dispatch
group).

t. Count - This portion of the table presents the number of observations falling into the given
intersection of original and predicted group membership. For example, we can see in this
portion of the table that the number of observations originally in the customer service group, but
predicted to fall into the mechanic group is 11. The row totals of these counts are presented, but
column totals are not.
u. % - This portion of the table presents the percent of observations originally in a given group
(listed in the rows) predicted to be in a given group (listed in the columns). For example, we can
see that the percent of observations in the mechanic group that were predicted to be in the
dispatch group is 16.1%. This is NOT the same as the percent of observations predicted to be in
the dispatch group that were in the mechanic group. The latter is not presented in this table.

Canonical Correlation Analysis

This page shows an example of a canonical correlation analysis with footnotes explaining the
output in SPSS. A researcher has collected data on three psychological variables, four academic
variables (standardized test scores) and gender for 600 college freshman. She is interested in
how the set of psychological variables relates to the academic variables and gender. In
particular, the researcher is interested in how many dimensions are necessary to understand the
association between the two sets of variables.

We have a data file, mmr.sav, with 600 observations on eight variables. The psychological
variables are locus of control, self-concept and motivation. The academic variables are
standardized test scores in reading, writing, math and science. Additionally, the variable
female is a zero-one indicator variable with the one indicating a female student. The researcher
is interested in the relationship between the psychological variables and the academic variables,
with gender considered as well. Canonical correlation analysis aims to find pairs of linear
combinations of each group of variables that are highly correlated. These linear combinations
are called canonical variates. Each canonical variate is orthogonal to the other canonical variates
except for the one with which its correlation has been maximized. The possible number of such
pairs is limited to the number of variables in the smallest group. In our example, there are three
psychological variables and more than three academic variables. Thus, a canonical correlation
analysis on these sets of variables will generate three pairs of canonical variates.

To begin, let's read in and summarize the dataset.

get file='d:\data\mmr.sav'.

descriptives
variables=locus_of_control self_concept motivation
read write math science female
/statistics=mean stddev min max.
These descriptives indicate that there are not any missing values in the data and suggest the
different scales the different variables. We can proceed with the canonical correlation analysis
without worries of missing data, keeping in mind that our variables differ widely in scale.

SPSS performs canonical correlation using the manova command with the discrim option. The
manova command is one of the SPSS commands that can only be accessed via syntax; there is
not a sequence of pull-down menus or point-and-clicks that could arrive at this analysis.

Due to the length of the output, we will be omitting some of the output that is extraneous to our
canonical correlation analysis and making comments in several places along the way.

In the manova command, we first list the variables in our psychological group
(locus_of_control, self_concept and motivation). Then, after the SPSS keyword with, we list
the variables in our academic group (read, write, math, science and female). SPSS refers to the
first group of variables as the "dependent variables" and the second group of variables as the
"covariates". This follows manova convention.

manova locus_of_control self_concept motivation with read write math science


female
/ discrim all alpha(1)
/ print=sig(eigen dim).
...[additional output omitted]...
* * * * * * A n a l y s i s o f V a r i a n c e -- design 1 * * * * * *

EFFECT .. WITHIN CELLS Regression


Multivariate Tests of Significance (S = 3, M = 1/2, N = 295 )

Test Name Value Approx. F Hypoth. DF Error DF Sig. of F

Pillais .25425 11.00057 15.00 1782.00 .000


Hotellings .31430 12.37633 15.00 1772.00 .000
Wilks .75436 11.71573 15.00 1634.65 .000
Roys .21538

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Eigenvalues and Canonical Correlations
Root No. Eigenvalue Pct. Cum. Pct. Canon Cor. Sq. Cor

1 .274 87.336 87.336 .464 .215


2 .029 9.185 96.522 .168 .028
3 .011 3.478 100.000 .104 .011

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Dimension Reduction Analysis

Roots Wilks L. F Hypoth. DF Error DF Sig. of F

1 TO 3 .75436 11.71573 15.00 1634.65 .000


2 TO 3 .96143 2.94446 8.00 1186.00 .003
3 TO 3 .98919 2.16461 3.00 594.00 .091

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

...[additional output omitted]...

* * * * * * A n a l y s i s o f V a r i a n c e -- design 1 * * * * * *

Raw canonical coefficients for DEPENDENT variables


Function No.

Variable 1 2 3

locus_of 1.254 -.621 .662


self_con -.351 -1.188 -.827
motivati 1.262 2.027 -2.000

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Standardized canonical coefficients for DEPENDENT variables
Function No.

Variable 1 2 3

locus_of .840 -.417 .444


self_con -.248 -.838 -.583
motivati .433 .695 -.686

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Correlations between DEPENDENT and canonical variables
Function No.

Variable 1 2 3

locus_of .904 -.390 .176


self_con .021 -.709 -.705
motivati .567 .351 -.745

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Variance in dependent variables explained by canonical variables

CAN. VAR. Pct Var DE Cum Pct DE Pct Var CO Cum Pct CO

1 37.980 37.980 8.180 8.180


2 25.910 63.889 .727 8.907
3 36.111 100.000 .391 9.297

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Raw canonical coefficients for COVARIATES
Function No.

COVARIATE 1 2 3

read .045 -.005 -.021


write .036 .042 -.091
math .023 .004 -.009
science .005 -.085 .110
female .632 1.085 1.795

* * * * * * A n a l y s i s o f V a r i a n c e -- design 1 * * * * * *

Standardized canonical coefficients for COVARIATES


CAN. VAR.

COVARIATE 1 2 3

read .451 -.050 -.216


write .349 .409 -.888
math .220 .040 -.088
science .049 -.827 1.066
female .315 .541 .894

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Correlations between COVARIATES and canonical variables
CAN. VAR.

Covariate 1 2 3

read .840 -.359 -.135


write .877 .065 -.255
math .764 -.298 -.148
science .658 -.677 .230
female .364 .755 .543

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Variance in covariates explained by canonical variables

CAN. VAR. Pct Var DE Cum Pct DE Pct Var CO Cum Pct CO

1 11.305 11.305 52.488 52.488


2 .701 12.006 24.994 77.482
3 .098 12.104 9.066 86.548

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

...[additional output omitted]...

Data Summary, Eigenvalues and Hypothesis Tests


* * * * * * A n a l y s i s o f V a r i a n c e -- design 1 * * * * * *

EFFECT .. WITHIN CELLS Regression


Multivariate Tests of Significance (S = 3, M = 1/2, N = 295 )

Test Name Valuee Approx. Ff Hypoth. DFg Error DFg Sig. of Fh

Pillaisa .25425 11.00057 15.00 1782.00 .000


Hotellingsb .31430 12.37633 15.00 1772.00 .000
Wilksc .75436 11.71573 15.00 1634.65 .000
Roysd .21538

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Eigenvalues and Canonical Correlations

Root No.i Eigenvaluej Pct.k Cum. Pct.l Canon Cor.m Sq. Corn

1 .274 87.336 87.336 .464 .215


2 .029 9.185 96.522 .168 .028
3 .011 3.478 100.000 .104 .011

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Dimension Reduction Analysis

Rootso Wilks L.p Ff Hypoth. DFg Error DFg Sig. of Fh

1 TO 3 .75436 11.71573 15.00 1634.65 .000


2 TO 3 .96143 2.94446 8.00 1186.00 .003
3 TO 3 .98919 2.16461 3.00 594.00 .091

a. Pillais - This is Pillai's trace, one of the four multivariate statistics calculated by SPSS to test
the null hypothesis that the canonical correlations are zero (which, in turn, means that there is no
linear relationship between the two specified groups of variables). Pillai's trace is the sum of the
squared canonical correlations, which can be found in the next section of output (see superscript
n): 0.4642 + 0.1682 + 0.1042 = 0.25425.

b. Hotellings - This is the Hotelling-Lawley trace. It is very similar to Pillai's trace and can be
calculated as the sum of the values of (canonical correlation2/(1-canonical correlation2)). We can
calculate 0.4642 /(1- 0.4642) + 1682/(1-0.1682) + 0.1042/(1-0.1042) = 0.31430.

c. Wilks - This is Wilks' lambda, another multivariate statistic calculated by SPSS. It is the
product of the values of (1-canonical correlation2). In this example, our canonical correlations
are 0.4641, 0.1675, and 0.1040 so the Wilks' Lambda is (1- 0.4642)*(1-0.1682)*(1-0.1042) =
0.75436.

d. Roys - This is Roy's greatest root. It can be calculated from the largest eigenvalue: largest
eigenvalue/(1 + largest eigenvalue). Because it is based on a maximum, it can behave differently
from the other three test statistics. In instances where the other three are not statistically
significant and Roy's is statistically significant, the effect should be considered to be not
statistically significant.
e. Value - This is the value of the multivariate test listed in the prior column.

f. (Approx.) F - These are the F values associated with the various tests that are included in
SPSS's output. For the multivariate tests, the F values are approximate.

g. Hypoth. DF, Error DF - These are the degrees of freedom used in determining the F values.
Note that there are instances in which the degrees of freedom may be a non-integer because these
degrees of freedom are calculated using the mean squared errors, which are often non-integers.

h. Sig. of F - This is the p-value associated with the F value of a given test statistic. The null
hypothesis that our two sets of variables are not linearly related is evaluated with regard to this p-
value. For a given alpha level, such as 0.05, if the p-value is less than alpha, the null hypothesis
is rejected. If not, then we fail to reject the null hypothesis.

i. Root No. - This is the rank of the given eigenvalue (largest to smallest). There are as many
roots as there were variables in the smaller of the two variable sets. In this example, our set of
psychological variables contains three variables and our set of academic variables contains five
variables. Thus the smaller variable set contains three variables and the analysis generates three
roots.

j. Eigenvalue - These are the eigenvalues of the product of the model matrix and the inverse of
the error matrix. These eigenvalues can also be calculated using the squared canonical
correlations. The largest eigenvalue is equal to largest squared correlation /(1- largest squared
correlation); 0.215/(1-0.215) = 0.274. These calculations can be completed for each correlation
to find the corresponding eigenvalue. The relative size of the eigenvalues reflect how much of
the variance in the canonical variates can be explained by the corresponding canonical
correlation. Thus, the eigenvalue corresponding to the first correlation is greatest, and all
subsequent eigenvalues are smaller.

k. Pct. - This is the percent of the sum of the eigenvalues represented by a given eigenvalue. The
sum of the three eigenvalues is (0.2745+0.0289+0.0109) = 0.3143. Then, the proportions can be
calculated: 0.2745/0.3143 = 0.8734, 0.0289/0.3143 = 0.0919, and 0.0109/0.3143 = 0.0348.
This is the proportion of explained variance in the canonical variates attributed to a given
canonical correlation.

l. Cum. Pct. - This is the cumulative sum of the percents.

m. Canon Cor. - These are the Pearson correlations of the pairs of canonical variates. The first
pair of variates, a linear combination of the psychological measurements and a linear
combination of the academic measurements, has a correlation coefficient of 0.464. The second
pair has a correlation coefficient of 0.168, and the third pair 0.104. Each subsequent pair of
canonical variates is less correlated. These can be interpreted as any other Pearson correlations.
That is, the square of the correlation represents the proportion of the variance in one group's
variate explained by the other group's variate.
n. Sq. Cor - These are the squares of the canonical correlations. For example, (0.464*0.464) =
0.215.

o. Roots - This is the set of roots included in the null hypothesis being tested. The null
hypothesis is that all of the correlations associated with the roots in the given set are equal to
zero in the population. By testing these different sets of roots, we are determining how many
dimensions are required to describe the relationship between the two groups of variables.
Because each root is less informative than the one before it, unnecessary dimensions will be
associated with the smallest eigenvalues. Thus, we start our test with the full set of roots and
then test subsets generated by omitting the greatest root in the previous set. Here, we first tested
all three roots, then roots two and three, and then root three alone.

p. Wilks L. - Here, the Wilks lambda test statistic is used for testing the null hypothesis that the
given canonical correlation and all smaller ones are equal to zero in the population. Each value
can be calculated as the product of the values of (1-canonical correlation2) for the set of
canonical correlations being tested. In this example, our canonical correlations are 0.464, 0.168
and 0.104, so the value for testing that all three of the correlations are zero is (1- 0.4642)*(1-
0.1682)*(1-0.1042) = 0.75436. To test that the two smaller canonical correlations, 0.168 and
0.104, are zero in the population, the value is (1-0.1682)*(1-0.1042) = 0.96143. The value for
testing that the smallest canonical correlation is zero is (1-0.1042) = 0.98919.

Canonical Coefficients, Correlations, and Variance Explained

* * * * * * A n a l y s i s o f V a r i a n c e -- design 1 * * * * * *

Raw canonical coefficients for DEPENDENT variablesq


Function No.

Variable 1 2 3

locus_of 1.254 -.621 .662


self_con -.351 -1.188 -.827
motivati 1.262 2.027 -2.000

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Standardized canonical coefficients for DEPENDENT variablesr
Function No.

Variable 1 2 3

locus_of .840 -.417 .444


self_con -.248 -.838 -.583
motivati .433 .695 -.686

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Correlations between DEPENDENT and canonical variabless
Function No.

Variable 1 2 3
locus_of .904 -.390 .176
self_con .021 -.709 -.705
motivati .567 .351 -.745

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Variance in dependent variables explained by canonical variablest

CAN. VAR. Pct Var DE Cum Pct DE Pct Var CO Cum Pct CO

1 37.980 37.980 8.180 8.180


2 25.910 63.889 .727 8.907
3 36.111 100.000 .391 9.297

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Raw canonical coefficients for COVARIATESq
Function No.

COVARIATE 1 2 3

read .045 -.005 -.021


write .036 .042 -.091
math .023 .004 -.009
science .005 -.085 .110
female .632 1.085 1.795

* * * * * * A n a l y s i s o f V a r i a n c e -- design 1 * * * * * *

Standardized canonical coefficients for COVARIATESr


CAN. VAR.

COVARIATE 1 2 3

read .451 -.050 -.216


write .349 .409 -.888
math .220 .040 -.088
science .049 -.827 1.066
female .315 .541 .894

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Correlations between COVARIATES and canonical variabless
CAN. VAR.

Covariate 1 2 3

read .840 -.359 -.135


write .877 .065 -.255
math .764 -.298 -.148
science .658 -.677 .230
female .364 .755 .543

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Variance in covariates explained by canonical variablesu

CAN. VAR. Pct Var DE Cum Pct DE Pct Var CO Cum Pct CO
1 11.305 11.305 52.488 52.488
2 .701 12.006 24.994 77.482
3 .098 12.104 9.066 86.548

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

q. Raw canonical coefficients for DEPENDENT/COVARIATE variables - These are the raw
canonical coefficients. They define the linear relationship between the variables in a given group
and the canonical variates. They can be interpreted in the same manner as regression
coefficients, assuming the canonical variate as the outcome variable. For example, a one unit
increase in locus_of_control leads to a 1.254 unit increase in the first variate of the
psychological measurements, and a one unit increase in read score leads to a 0.045 unit increase
in the first variate of the academic measurements. Recall that our variables varied in scale. This
is reflected in the varied scale of these raw coefficients.

r. Standardized canonical coefficients for DEPENDENT/COVARIATE variables - These are


the standardized canonical coefficients. This means that, if all of the variables in the analysis are
rescaled to have a mean of zero and a standard deviation of 1, the coefficients generating the
canonical variates would indicate how a one standard deviation increase in the variable would
change the variate. For example, an increase of one standard deviation in locus_of_control
would lead to a 0.840 standard deviation increase in the first variate of the psychological
measurements, and an increase of one standard deviation in read would lead to a 0.451 standard
deviation increase in the first variate of the academic measurements.

s. Correlations between DEPENDENT/COVARIATE variables and canonical variables -


These are the correlations between each variable in a group and the group's canonical variates.
For example, we can see in the "dependent" variables that locus_of_control has a Pearson
correlation of 0.904 with the first psychological variate, -0.390 with the second psychological
variate, and 0.176 with the third psychological variate. In the "covariates" section, we can see
that read has a Pearson correlation of 0.840 with the first academic variate, -0.359 with the
second academic variate, and -0.135 with the third academic variate.

t. Variance in dependent variables explained by canonical variables - This is the degree to


which the canonical variates of both the dependent variables (DE) and covariates (CO) can
explain the standardized variability in the dependent variables. For both sets of canonical
variates, the percent and cumulative percent of variability explained by each variate is displayed.

u. Variance in covariates explained by canonical variables - This is the degree to which the
canonical variates of both the dependent variables (DE) and covariates (CO) can explain the
standardized variability in the covariates. For both sets of canonical variates, the percent and
cumulative percent of variability explained by each variate is displayed.

Poisson Regression

This page shows an example of Poisson regression analysis with footnotes explaining the output
in SPSS. The data collected were academic information on 316 students. The response variable is
days absent during the school year (daysabs). We explore its relationship with math standardized
test scores (mathnce), language standardized test scores (langnce) and gender (female).

As assumed for a Poisson model, our response variable is a count variable, and each subject has
the same length of observation time. Had the observation time for subjects varied (i.e., some
subjects were followed for half a year, some for a year and the rest for two years) and we were
to neglect these differences in exposure time, our Poisson regression estimate would be biased
since our model assumes all subjects had the same follow up time. Also, the Poisson model, as
compared to other count models (i.e., negative binomial or zero-inflated models), is assumed the
appropriate model. In other words, we assume that the response variable is not over-dispersed
and does not have an excessive number of zeros.

The dataset can be downloaded here.

get file 'D:\lahigh.sav'.


recode gender (1=1) (else = 0) into female.
exe.

In SPSS, Poisson models are treated as a subset of generalized linear models. This is reflected in
the syntax. A generalized linear model is Poisson if the specified distribution is Poisson and the
link function is log.

genlin daysabs with female mathnce langnce

/model female mathnce langnce distribution = poisson link = log

/print cps history solution fit.


Case Processing Summary
a. Included - This is the number of observations from the dataset included in the model. A
observation is included if the outcome variable and all predictor variables have valid, non-
missing values.

b. Excluded - This is the number of observations from the dataset not included in the model due
to missing data in any of the outcome or predictor variables.

c. Total - This is the sum of the included and excluded records. It is equal to the total number of
observations in the dataset.

Iteration History

d. Iteration History - This is a listing of the log likelihoods at each iteration. Remember Poisson
regression, like binary and ordered logistic regression, uses maximum likelihood estimation,
which is an iterative procedure. The first iteration (called iteration 0) is the log likelihood of the
"null" model. At each iteration, the log likelihood increases because the goal is to maximize the
log likelihood. When the difference between successive iterations is very small, the model is said
to have "converged", the iterating stops, and the results are displayed. For more information on
this process for binary outcomes, see Regression Models for Categorical and Limited Dependent
Variables by J. Scott Long (page 52-61).

Gradient Vector and Hessian Matrix


e. Gradient Vector and Hessian Matrix - In our model, we are estimating k+1 parameters
where k is the number of predictors: one for each of our predictors and one intercept parameter.
The log likelihood of our model is calculated based on these estimated parameters. The gradient
vector is the vector of partial derivatives of the log likelihood function with respect to the
estimated parameters and the Hessian matrix is the square matrix of second derivatives of this
log likelihood with respect to the estimated parameters. The variance-covariance matrix of the
model parameters is the negative of the inverse of the Hessian. The values in the Hessian can
suggest convergence problems in the model, but the iteration history and possible error messages
provided by SPSS are more useful tools in diagnosing problems with the model.

Goodness-of-Fit
f. Deviance - Deviance is usually defined as the log likelihood of the final model, multiplied by
(-2). However, for Poisson regression, SPSS calculates the deviance as

Note that the log likelihood of the model is -1547.971. The usual formulation of the deviance
would yield (-2)(-1547.971) = 3095.942, which is greater than the deviance calculated using the
above formula.

g. Pearson Chi-Square - This is a goodness-of-fit measure that compares the predicted values of
the outcome variable with the actual values. It is calculated as

There is no scaling in this model, so we see that the Scaled Pearson Chi-Square is equal to the
Pearson Chi-Square.

h. Log Likelihood - This is the log likelihood of the final model.

i. AIC - This is the Akaike information criterion, a goodness-of-fit measure defined as (-2 ln L +
2k) where k is the number of parameters in the model and L is the likelihood function of the final
model.

j. BIC - This is the Bayesian information criterion, a goodness of fit measure defined as

where n is the total number of observations, k is the number of model parameters, and L is the
likelihood function of the final model.

Parameter Estimates
k. B - These are the estimated Poisson regression coefficients for the model. Recall that the
response variable is a count variable, and Poisson regression models the log of the expected
count as a function of the predictor variables. We can interpret the Poisson regression coefficient
as follows: for a one unit change in the predictor variable, the difference in the logs of expected
counts is expected to change by the respective regression coefficient, given the other predictor
variables in the model are held constant.

(Intercept) - This is the Poisson regression estimate when all variables in the model are
evaluated at zero. For males (the variable female evaluated at zero) with zero mathnce and
langnce test scores, the log of the expected count for daysabs is 2.287 units. Note that
evaluating mathnce and langnce at zero is out of the range of plausible test scores. If the test
scores were mean-centered, the intercept would have a natural interpretation: the log of the
expected count for males with average mathnce and langnce test scores.

female - This is the estimated Poisson regression coefficient comparing females to males,
given the other variables are held constant in the model. The difference in the logs of expected
counts is expected to be 0.401 unit higher for females compared to males, while holding the
other variables constant in the model. So if we consider two students, one male and one female,
with identical math and language test scores, the female student will have a higher predicted
value of log(# days absent) than the male student. Thus, we would expect the female student to
have more days absent than her male counterpart.

mathnce - This is the Poisson regression estimate for a one unit increase in math standardized
test score, given the other variables are held constant in the model. If a student were to increase
her mathnce test score by one point, the difference in the logs of expected counts would be
expected to decrease by 0.004 unit, while holding the other variables in the model constant. If we
consider two students of the same sex who have the same language score, we would expect the
student with the higher math score of the two to have fewer days absent than the other student.

langnce - This is the Poisson regression estimate for a one unit increase in language
standardized test score, given the other variables are held constant in the model. If a student were
to increase her langnce test score by one point, the difference in the logs of expected counts
would be expected to decrease by 0.012 unit while holding the other variables in the model
constant. If we consider two students of the same sex who have the same math score, we would
expect the student with the higher language score of the two to have fewer days absent than the
other student.

l. Std. Error - These are the standard errors of the individual regression coefficients. They are
used both in the calculation of the Wald Chi-Square test statistic, superscript l, and the
confidence interval of the regression coefficient, superscript k.

m. 95% Wald Confidence Interval - This is the confidence interval (CI) of an individual
Poisson regression coefficient, given the other predictors are in the model. For a given predictor
variable with a level of 95% confidence, we'd say that we are 95% confident that upon repeated
trials 95% of the CI's would include the "true" population Poisson regression coefficient. It is
calculated as B (z/2)*(Std.Error), where z/2 is a critical value on the standard normal
distribution. The CI is equivalent to the z test statistic: if the CI includes zero, we'd fail to reject
the null hypothesis that a particular regression coefficient is zero, given the other predictors are
in the model. An advantage of a CI is that it is illustrative; it provides information on where the
"true" parameter may lie and the precision of the point estimate.

n. Wald Chi-Square - These are the test statistics for the individual regression coefficients. The
test statistic is the squared ratio of the coefficient B to the Std. Error of the respective predictor.
The test statistic follows a Chi-Square distribution which is used to test against a two-sided
alternative hypothesis that the B is not equal to zero.

o. df - This column lists the degrees of freedom for each of the variables included in the model.
For each of these variables, the degree of freedom is 1.

p. Sig. - These are the p-values of the coefficients or the probability that, within a given model,
the null hypothesis that a particular predictor's regression coefficient is zero given that the rest of
the predictors are in the model. They are based on the Wald Chi-Square test statistics of the
predictors. The probability that a particular Wald test statistic is as extreme as, or more so, than
what has been observed under the null hypothesis is defined by the p-value and presented here.
By looking at the estimates of the standard errors to a greater degree of precision, we can
calculate the test statistics and see that they match those produced in SPSS. To view the estimates
with more decimal places displayed, click on the Parameter Estimates table in your SPSS output,
then double-click on the number of interest.

(Intercept) - The Wald Chi-Square test statistic testing (Intercept) is zero, given the other
variables are in the model and evaluated at zero, is (2.286745/ 0.0699539)2 = 1068.590, with an
associated p-value of <0.0001. If we set our alpha level at 0.05, we would reject the null
hypothesis and conclude that (Intercept) on daysabs has been found to be statistically different
from zero given mathnce, langnce and female are in the model and evaluated at zero.

female - The Wald Chi-Square test statistic testing the difference between the log of
expected counts between males and females on daysabs is zero, given the other variables are in
the model, is (0.4009209/0.0484122)2= 68.582, with an associated p-value of <0.0001. If we set
our alpha level at 0.05, we would reject the null hypothesis and conclude that the coefficient for
female is statistically different from zero given mathnce and langnce are in the model.

mathnce - The Wald Chi-Square test statistic testing the slope for mathnce on daysabs is
zero, given the other variables are in the model, is (-0.0035232/0.0018213)2 = 3.742, with an
associated p-value of 0.053. If we set our alpha level at 0.05, we would fail to reject the null
hypothesis and conclude the poisson regression coefficient for mathnce is not statistically
different from zero given langnce and female are in the model.

langnce - The Wald Chi-Square test statistic testing the slope for langnce on daysabs is
zero, given the other variables are in the model, is (-0.0121521/0.0018348)2 = 43.865, with an
associated p-value of <0.0001. If we set our alpha level at 0.05, we would reject the null
hypothesis and conclude the poisson regression coefficient for langnce is statistically different
from zero given mathnce and female are in the model.

Anda mungkin juga menyukai