Anda di halaman 1dari 8

DATA ANALYSIS WORKBOOK

LAB 3

INTERPRETING STANDARD DEVIATIONS


OVERVIEW
The purpose of this lab is to use properties of the normal distribution to aid in the
interpretation of the standard deviation of a variable measured at either the interval or
ratio levels (or, in the case of an ordinal variable, treated as interval). The mean and the
standard deviation are the two basic descriptive statistics which are the foundation of
more advanced statistics. Understanding how to interpret the standard deviation depends
upon the metric of that variable.

CONCEPTS: standard deviation, z-scores, normal distribution


Standard Deviation
A measure of dispersion describes the amount of variation in a variable. A measure of
central tendency gives us a sense of the typical value in a distribution. A measure of
dispersion gives us an idea of the degree to which the cases differ from this typical value.
In fact, one could argue that the need for statistics arises because of the variability in the
phenomena that social scientists study.
Social scientists typically work with two measures of variation: the variance and the
standard deviation. The variance is the arithmetic average of the mean deviation squared
for each case. The standard deviation is the square root of the variance. The standard
deviation is more informative than the variance because it expresses the variability in terms
of the metric used in the measure of the variable rather than the square of the metric. For
example, if the variable is dollar income, the variance expresses the variability in terms of
dollars squared, while the standard deviation expresses variability in terms of dollars. The
variance is still valuable since it has mathematical properties that make it easier to work
with than the standard deviation. For this reason, we work with both the standard
deviation and the variance in labs to come. In this lab and, indeed, those labs that focus on
the distribution of a single variable, however, we focus on just the standard deviation.
1

To get the mean deviation for a case, you subtract the mean of the variable from the value of the variable
for the case. To get the variance, you square the deviations, sum them, and divide by the number of cases
(or, in the case of sample estimates of the population variance, the number of cases minus one.)

3.1

LAB 3

DATA ANALYSIS WORKBOOK

Table 1. Calculating z-scores from raw scores and raw scores from z-scores
Compute z-score (1)
z (y y) / sy

Equation
Example
y =50 sy =10 y = 42

Compute Raw Score (2)


y y (z) sy

z = (42 - 50) / 10

y = 50 + (.8) 10

z = .8

y = 42

Z-scores
We interpret the standard deviation as the typical deviation of a case from the mean. In
order to educate your intuition about the standard deviation we introduce the concept of
z-scores and the properties of the normal distribution. A z-score expresses the
difference between the value of a variable and the mean of that variable in standard
deviation units. As equation (1) in Table 1 shows, you get a z-score from a raw score y by
dividing the difference between y and the mean ( y ) by the standard deviation ( sy ).
For example, if the mean of a variable is 50, and the standard deviation is 10, the raw
score 42 corresponds to a z-score of -.8. (The raw score 42 is .8 of a standard deviation
less than the mean.) You can use equation (2) to transform a z-score back to a raw score.
Multiply the z-score by the standard deviation ( sy ) and add the product to the mean.
Transforming an original variable by subtracting the mean centres the variable. As a
consequence of centering, the mean of the new variable will be zero. Transforming the
values of a variable by computing z-scores in effect changes the metric of the original
variable, for example years in the case of age, to standard deviation units. A consequence
of standardizing a variable is that its standard deviation (and variance) will be one.
(*You can think of a set of z-scores as a standardized, centered variable.) Table 2 uses the
respondents age from the 1987 NORC GSS to illustrate the relation between raw and
standardized (z) scores.
Table 2. The mean and standard deviation of age expressed as raw and standard scores.
Variable

Type

Metric

Age
Age

raw score
z-score

years
standard deviation

Mean

Standard deviation

44.92 yrs
0 s.d.

17.71 yrs
1 s.d.

Because a z-score is measured in standard deviation units, we can get some insight into
the standard deviation by examing a distribution of z-scores. We can increase our
understanding by viewing the distribution of z-scores from the perspective of a standard,
normal distribution.

3.2

DATA ANALYSIS WORKBOOK

LAB 3

Normal Distributions
The normal distribution is symmetric and bell-shaped. The center of the distribution is the
mean, median, and mode. It is called normal because there is no skew, and the
distribution is neither flat nor peaked. Consquently, the distribution of a normal variable
can be determined completely by its mean and its standard deviation. A standard normal
distribution is a normal distribution of a variable that has been transformed into z-scores.
Like all standardized variables, it has a mean of zero and a standard deviation of one.
(Note that standardizing a variable does NOT make it normal.)
Table 3 contains the relative frequencies for the standard normal distribution partitioned
into intervals or areas with the z-score as the boundary for each interval. There are six
intervalsA, B, C, D, E, F. For example the first interval, A, has a lower limit z-score
value of -3 and upper limit z-score value of -2. Remember that according to the
convention used in this course, we would place a case with the lower limit of 3 in this
interval along with all other cases with z-scores between -3.0 and up to but not including
-2.0. A case with a z-score of -2 would be contained in the next interval, B.
2

In a normal distribution, interval A contains (roughly) 2.1% of the cases. Interval B


contains 13.6% of the cases, and interval C contains 34.1% of the cases. Note that the
upper limit of interval C, 0, corresponds to the mean of the raw scores. Due to the
symmetry of a normal distribution, intervals D, E, and F also contain 34.1%, 13.6%, and
2.1% of the cases, respectively.
Table 3. The Relative Frequency Distribution of the Standard Normal Distribution Partitioned into
Six Intervals of One, Two, and Three Standard Deviations Above and Below the Mean.

A
LL
-3

B
UL LL
-2

2.1%

C
UL LL

-2

-1

13.6%

D
UL LL

-1

34.1%

E
UL LL

34.1%

Interval

F
UL LL

13.6%

UL
3

2.1%

Interval
Limits
z-score

Proportion

| ................68% ..................|
|........................................95%........................................|
|..........................................................99.75%.............................................................|

The distribution of cases in Table 3 are only approximate because, rather than using +2
and 2 as the interval limits, we should use 1.96 . (You can confirm this fact with a
table of z-scores.) Keeping this qualification in mind, we can combine the intervals A to F,
2

Technically, a normal distribution begins at minus infinity and ends at plus infinity. Consequently, there
will be a few cases with z-scores either less than 3 or greater than +3.
3.3

LAB 3

DATA ANALYSIS WORKBOOK

as shown at the bottom of Table 3, order to develop the following statements about the
distribution of cases in a normal distribution:

68% of the cases fall in an interval one standard deviation below and above the mean
95% of the cases fall in an interval two standard deviations below and above the mean
99.75% (or nearly all) of the cases fall in an interval three standard deviations below
and above the mean.

At some point during this discussion you might be wondering about the relevance of this
information for the study of social science variables. After all, few of the ratio, interval, or
ordinal (treated as interval) measures that social scientists study will have a normal
distribution. There are two answers to this objection. First, when we get to inferential
statistics in subsequent labs, we will find that a class of distributions (called sampling
distributions) are normal (or approximately normal). In the case of this lab, however, you
will see that the three statements above will apply to many distributions that are NOT
normal. For this reason, we refer to these statements as the empirical rule. It works
because a large percentage of cases of, say, interval C, will be offset by a small
percentage of cases in interval D, so the percentage of cases in intervals C and D comes
pretty close to 68%. We refer to this rule as empirical because it happens to work for
many variables. Remember, however, that it does not always work, and even when it
does, it often only works approximately.
The final point to keep in mind when using the empirical rule is it is supposed to give you
a basis for interpreting a standard deviation. For example, if the mean and standard
deviation of age are 45 and 15 years, respectively, you can expect that approximately 68%
of the people are between the ages of 30 and 60, 95% will be between the ages of 15 and
75, and nearly everybody will be between the ages of zero and 90.
3

DATA ANALYSIS EXAMPLE


Research Question
Summarize the distribution of the variable age (AGE, v04). Use the standard deviation and
mean of the variable in your description. How well does the distribution of age (AGE v04)
in the 1987 GSS compare with the characteristics of a normal distribution? (i.e., Do
approximately 68% of the cases fall in an interval one standard deviation below and above
the mean? Do 13.4% of cases fall in the interval between one and two standard deviations
below the mean, etc.?)

In the case of a survey of the adult population, the minimum age will be around 18. The reason interval
A is empty and the lower limit of B is below this minium is due to the fact that the distribution of age is
positively skewed.
3.4

DATA ANALYSIS WORKBOOK

LAB 3

Results
First, become familiar with the measurement characteristics of the variable age. Use the
blue code book and complete the variable attributes information (1) in Table 4.
Table 4. Example from a Yellow Sheet for Lab 3.
(1) Variable Attributes
Index and
Name
Description

v4

(2) Statistics
AGE

Mean

44.92

Respondent's age

Std Dev

17.71

Minimum

18

Minimum

18

Maximum

89

Maximum

89

Metric

years

Valid Cases

1807

Level of
Measurement

ratio

NA

NA

The descriptive statistics for age and the z-score for age are given in Table 5 below. The
mean age of the respondents in the 1987 NORC GSS is 44.92 years, and the standard
deviation is 17.71 years. These statistics are based on 1807 cases out of 1819 possible
cases. Copy these results into the statistics column (2) of Table 4.
Table 5. Descriptive Statistics for Age from SPSS.
Descriptive Statistics
N Minimum Maximum
AGE

1807

Valid N
(listwise)

1807

18

89

Mean

Std.
Deviation

44.92

17.705

To find the percentage of cases that fall in the intervals one, two and three standard
deviations below and above the mean, we have to find the age that falls three standard
deviations below the mean of 44.92 and use this number as the starting point in
constructing a distribution. That is, we have to transform the z-score -3 into a raw score.
We use equation (2) in Table 1 to do this. Plugging the values for the mean and standard
deviation into the equation, we find that -8.21 is the value for age that is three standard
deviations below the mean. (The fact that it is impossible to get this value means that the
distribution of age is positively skewed. See footnote 3.)
(e.g.) y = 44.92 + (-3)17.71 = 44.92 - 53.13 = -8.21
4

You may need to use a z-score that is more extreme than -3 (e.g., -4 or -5) in the case of variables with
an extreme negative skew. (See footnote 2.)
3.5

LAB 3

DATA ANALYSIS WORKBOOK

Using -8.21 as the starting point, we divided age into the six intervals one standard
deviation wide (17.71). These resulting intervals, (which we label z04), correspond to the
intervals A F in Table 3. Table 6 contains the frequency distribution for this variable
generated by SPSS. We examine the valid percents since SPSS omits cases with missing
values when making these computations.
Table 6. Frequency Distribtion of Age Divided into Intervals of 1, 2, and 3 Standard
Deviations Above and Below the Mean (z04)
Value Label

Value

Frequency

Percent

-8.2 to 9.5
9.5 to 27.2
27.2 to 44.9
44.9 to 62.6
62.6 to 80.3
80.3 to 98.0
98.0 to

-3
-2
-1
0
1
2
3

0
321
683
426
325
52
0
12
1819

0.0
17.6
37.5
23.4
17.9
2.9
0.0
.6
100.0

z04

TOTAL

Valid
Percent
0.0
17.8
37.8
23.6
18.0
2.9
0.0
Missing
100.0

Cum
Percent
0.0
17.8
55.6
79.1
97.1
100.0
100.0

To see whether the empirical rule applies to this distribution, we can compare the valid
percents with the percents given by the empirical rule. (In addition to the percentages in
Table 3, the yellow sheets also contain this distribution.) A criterion for deciding whether
the observed percentage distribution behaves according to the empirical rule, use five per
cent as the criterion. If the two sets of percentages for an interval differ by more than five
per cent, conclude that the distribution of the variable does behave according to the
empirical rule.
In the case of the age distribution, we observe the following: First, 61.4% (= 37.8 +
23.6) of the cases fall in the interval one standard deviation below and above the mean
(27.2 to 62.6). Second, 96.2% (= 17.8 + 61.4 + 18.0) fall in the interval two standard
deviations below and above the mean (9.5 to 80.3). Third, all (100%) of the cases fall in
the interval three standard deviations below and above the mean (-8.2 to 98.0). In
deciding whether the empirical rule applies, we could conclude that the first statement is
off by a bit since 61.4% differs from 68% by more than 5%. On the other hand, the two
other statements fit the data. (Both 97.2% and 100% are less than 5% away from 95%
and 99.75%, respectively.)

3.6

DATA ANALYSIS WORKBOOK

LAB 3

Explanations
The box below provides an example of how you would summarize the results. Be sure to
include the descriptive statistics (the mean and standard deviation), the range of scores
for the intervals one and two standard deviations above and below the mean, and the
percentage of cases in that range. It is important to be simple in the description and use
the metric of the variable in your description. In the interpretation part, briefly describe
the fit of the empirical rule to the frequency distribution.

For the variable age, a typical deviation from the mean of 45 years is about 18 years
(standard deviation). Sixty-one per cent of the cases are between 27 and 63 years of
age. About 97% of the cases are between 18 and 80 years of age, and all of the cases
are between the ages of 18 and 89. Although 61% is less than the expected 68%, the
other two percentages fit the empirical rule.
5

RESEARCH QUESTIONS
Examine the frequency distribution of the following four variables. Does the empirical rule
fit the distribution for each of the variables?
1.
2.
3.
4.

The prestige of the spouse's occupation. (v28)


The total income of the respondent's family. (v32)
The number of children the respondent has had. (v19)
The number of organizations to which the respondent belongs. (v36)

Step 1 Variable Attributes (Lab Exercise 3.1)

Use the blue code book to determine the attributes of each of the variables.
What is the metric and level of measurement of the variables?
Copy the information from the codebook to the yellow sheet

Step 2 Statistically Describe the Variables (Lab Exercise 3.1)

Use SPSS (Statistics/Summarize/Descriptives) to get the statistics (Mean, Standard


Deviation, Valid Cases).
Copy the information from the screen to the yellow sheet.
Repeat this step for each of the variables by replacing the appropriate variable in the
variable box.

Step 3 Distributions of Variables in Standard Deviation Units

Because the respondents had to be a minimum of 18 years old to participate in the study, the low end of
the distribution is truncated. In addition, the upper end also is truncated since NORC codes everybody
aged ninety or over as 89.

3.7

LAB 3

DATA ANALYSIS WORKBOOK

To compare the distribution of a variable with the normal distribution, the coding of
the variable must be done in standard deviation units. We have transformed the raw
values of each variable into six categories whose limits are 1, 2, and 3 z-score values
below and above the mean. These variables are named z28, z32, z19, z36. (We use
the prefix z rather than v since limits of the intervals represente z-scores.) You
will need the transformed variables to get the observed frequency distributions.
Copy the information from the screen to the yellow sheet (one sheet per variable).
Use SPSS (Statistics/Summarize/Frequencies) to get a frequency distribution of the
transformed or z-score variables.
Note that the Value of the variable is now in z-score units that is equal to a standard
deviation of that variable. The value denotes the lower limit (i.e. left-end value or
boundary of the category). Z-scores are continuous scores.
Note that the Value Label denotes the upper and lower limit values or scores of the
variable in its original metric and treats the variable as if it were continuous.
Use the continuous score lower and upper limits to determine the discrete score lower
and upper limits. Hint: refer to the minimum and maximum values and the metric of
the variable.
Note the Valid Percent of cases in each standard deviation category.
Repeat this step for each of the variables by replacing the appropriate z-score variable
in the variable box.

Step 4 Compare the Observed Distributions with a Normal Distribution

Compare the observed and expected percentage distributions as you noted in Step 3
and determine if the variable is approximately normally distributed. Note any large
differences (greater than 5%) between the observed and expected percentages.
From Step 3, determine the discrete raw scores that are plus and minus one, two and
three standard deviations from the mean. What percentage of the cases lie in these
intervals?
For each variable on the back side of the yellow sheet in narrative form, describe the
distribution of the variable using the mean and standard deviation. Report the range
of scores which are plus and minus one standard deviation of the mean and the
percentage of cases in the actual distribution within that range. Likewise, report the
range and percentage for scores within two standard deviations of the mean. Refer to
the example in the white sheets.

\c:\workbook\white\inst3 r8.02

3.8

Anda mungkin juga menyukai