Anda di halaman 1dari 50

Central Tendency & Variability

I. Measures of Central Tendency


1. Mean (including weighted & trimmed)
2. Median
3. Mode
II. Measures of Variability
1. Range (including IQR & SIQR)
2. Mean Deviation
3. Variance
4. Standard Deviation
III. Estimation
IV. Overall Example - [Minitab]

Practice Problems (Answers)


Homework

In addition to describing the form or shape of a distribution, it is also necessary to describe


central tendency and variability (or spread).

I. Measures of Central Tendency (or Averages)

Here, we are interested in the typical, most representative score. There are three measures of
central tendency that you should be familiar with. Note that when reporting these values, one
additional decimal of accuracy is given compared to what is available in the raw data (even if the
additional decimal is a zero, e.g., 43.0).

1. Mean
It is simply the arithmetic average or sum of the scores divided by the number of them.
It is symbolized as:

(read as "X-Bar") when computed on a sample.

(read as "Mew") when computed on a population.

Computation - Example:

X
2
3
5
10
X = 20
N=4

Since means are typically reported with one more digit of accuracy that is present
in the data, I reported the mean as 5.0 rather than just 5.

When working with grouped frequency distributions, we can use an


approximation:

For example:

Interval Midpoint f Mid*f


95-99 97 1 97
90-94 92 3 276
85-89 87 5 435
80-84 82 6 492
75-79 77 4 308
70-74 72 3 216
65-69 67 1 67
60-64 62 2 124
f=25=N (Mid*f)=2015

When computed on the raw data, we get:

Thus the formula for computing the mean with grouped data gives us a
good approximation of the actual mean. In fact, when we report the mean
with one decimal more accuracy than what is in the data, the two
techniques give the same result.

Properties

1. It is sensitive to all of the scores. In other words, if one score in the distribution is
changed, the mean will change too. Example:

Xs

1, 2, 3 2
1, 2, 30 11
1, 2, 300 101
2.

3. The sum of the deviations about the mean equals zero. A deviation is symbolized
as (little x) and refers to the difference between a score and its mean. That is:

Thus, this second property of the mean states that:

What follows is some sample data demonstrating this property.

2 5.0 -3
3 5.0 -2
5 5.0 0
10 5.0 5
= 0

4. The sum of the squared deviations about the mean is less than the sum of the
squared deviations about any other value. Example (with "4" as the arbitrary
"other value"):

X 2 X-4 (X-4)2
2 -3 9 -2 4
3 -2 4 -1 1
5 0 0 1 1
10 5 25 6 36
2= 38 (X-4)2 = 42
5. So, 38 is less than 42. This relationship would hold with any "other value."

Variations

Weighted Mean
Each quantity to be averaged is assigned a weight. These weightings determine the
relative importance of each quantity in the average. We will see an example in our grade
postings, where the homework assignments are weighted at 20% and the exams are
weighted at 80%.
Trimmed Mean
A mean that is computed on the middle 95% of the distribution. Can be a more stable
estimate than the regular mean since it is less sensitive to outliers.

1. Median or Md
The score that cuts the distribution into two equal halves (or the middle score in the
distribution).

Computation - There are several situations possible:

1. An odd number of scores and no duplication near the middle, then the median is
the middle score.
Ex: 1, 2, 2, 4, 6, 7, 7. N=7 & Md= 4.
2. An even number of scores and no duplication near the middle, then the median is
the average of the two middle scores.
Ex: 2, 2, 4, 6, 7, 7. N=6 & Md = (6+4)/2 = 5.
3. Duplication near the middle.
Ex: 4, 5, 5, 5, 6, 6. N=6 & Md = ?

o So the median is somewhere between 4.5 and 5.5.


o The lower exact limit of the score near the middle that is duplicated is 4.5 and we
need 2 of the 3 scores in the interval with the duplication.
o Thus, Md = 4.5 + 2/3 = 4.5 +.67 = 5.2
Fortunately, there is a formula to take care of the more complicated situations,
including computing the median for grouped frequency distributions.

Where:

L = Lower exact limit of the interval containing Md.


nb = number of scores below L.
nw = number of scores within the interval containing Md.
i = the width of the interval (for ungrouped data i=1).
N = the Number of scores.

Using our last example:

L = 4.5
nb = 1
nw = 3
i =1
N =6

Properties

11 Not sensitive to all scores.

Xs Md

1, 2, 3 2 2
1, 2, 30 11 2
1, 2, 300 101 2
1, 2, 3000 1001 2
11

11 Most useful with skewed distributions.

2. Mode
Is the most frequently occurring score. Note:

o There can be more than one. Can have bi- or tri-modal distributions and then
speak of major and minor modes.
o It is symbolized as Mo.

o For grouped data, we have a Modal Interval and a Crude Mo.

Note that the presence and direction of skew in the distribution can be determined from the
mean and median. The key to understanding this is to be aware that the mean is sensitive to all
scores, while the median is not. There are three rules:

1. If - Md > 0 then +skew

2. If - Md < 0 then -skew


3. If - Md = 0 then the distribution is normal
and all three measures of central tendency coincide.

II. Measures of Variability

Variability refers to the extent to which the scores in a distribution differ from each other. An
equivalent definition (that is easier to work with mathematically) says that variability refers to
the extent to which the scores in a distribution differ from their mean. If a distribution is lacking
in variability, we may say that it is homogenous (note the opposite would be heterogenous). Note
that when reporting these values, two additional decimals of accuracy are given compared to
what is available in the raw data (even if the last decimal is a zero, e.g., 4.30). The exception is
the range were no extra decimals are needed because it is a crude measure (as we will see in a
moment).

We will discuss four measures of variability for now: the range, mean or average deviation,
variance and standard deviation.

1. Range
As we noted when discussing the rules for creation of a grouped frequency distribution,
the range is given by the highest score in the distribution minus the lowest score plus one.

R = XH - XL+ 1

Example:
Distribution A has a larger range (and more variability) than Distribution B.

Because only the two extreme scores are used in computing the range, however, it is a
crude measure. For example:

The range of Distribution A and B is the same, although Distribution A has more
variability.

Variations

Inter Quartile Range (or IQR)


The range computed for the middle 50% of the distribution.
Semi Inter Quartile Range (or SIQR)
Is simply one half of the IQR to make the measure a distance from the mean. This will
make more sense after we cover more measures of variability. It is the preferred measure
of variability for skewed data.

2. Mean (or Average) Deviation


If a deviation () is the difference of a score from its mean and variability is the extent to
which the scores differ from their mean, then summing all the deviations and dividing by
the number of them should give us a measure of variability. The problem though is that
the deviations sum to zero. However, computing the absolute value of the deviations
before summing them eliminates this problem. Thus, the formula for the MD is given by:

The problem with the MD is that due to the use of the absolute value, it is a terminal
procedure. In other words, it cannot be used in further calculations (which is something
that we would like to be able to do).

1. Variance
Another solution to the problem of the deviations summing to zero is to square the
deviations. That is:

Thus another name for the Variance is the Mean of the Squared Deviations About the
Mean (or more simply, the Mean of Squares (MS)). The problem with the MS is that its
units are squared and thus represent space, rather than a distance on the X axis like the
other measures of variability.

2. Standard Deviation
A simple solution to the problem of the MS representing a space is to compute its square
root. That is:

Since standard deviations can sometimes be very small, you may need more than
the 2 additional decimals of accuracy than what is available in the original data as
was suggested at the outset of this section.

Properties of the Variance & Standard Deviation:

1. Are always positive (or zero).


2. Equal zero when all scores are identical (i.e., there is no variability).

3. Like the mean, they are sensitive to all scores.

4. The standard deviation is the preferred measure of variability for normal


distributions.

III. Estimation
Estimation is the goal of inferential statistics. We use sample values to estimate population
values. The symbols are as follows:

Measure Sample Population

Mean

Variance s2 2
Standard Deviation s

It is important that the sample values (estimators) be unbiased. An unbiased estimator of a


parameter is one whose average over all possible random samples of a given size equals the
value of the parameter.

While is an unbiased estimator of , s2 is not an unbiased estimator of 2.

In order to make it an unbiased estimator, we use N-1 in the denominator of the formula rather
than just N. Thus:

Note that this is a defining formula and, as we will see below, is not the best choice when
actually doing the calculations.

IV. Overall Example - [Minitab]

Let's reconsider an example from above of two distributions (A & B):

Consider a possibility for the scores that go with these distributions:


Distribution A B
150 150
145 110
100 100
Data
100 100
55 90
50 50
600 600
N 6 6

100 100

Range 150-50+1=101 150-50+1=101

Notice that the central tendency and range of the two distributions are the same. That is, the
mean, median, and mode all equal 100 for both distributions and the range is 101 for both
distributions. However, while Distributions A and B have the same measures of central tendency
and the same range, they differ in their variability. Distribution A has more of it. Let us prove this
by computing the standard deviation in each case. First, for Distribution A:

A 2

150 100 50 2500


145 100 45 2025
100 100 0 0
100 100 0 0
55 100 -45 2025
50 100 -50 2500
600 0 9050
N 6

Plugging the appropriate values into the defining formula gives:

Measure A
Note that calculating the variance and standard deviation in this manner requires computing the
mean and subtracting it from each score. Since this is not very efficient and can be less accurate
as a result of rounding error, computational formulas are typically used. They are given as
follows:

and

Redoing the computations for Distribution A in this manner gives:

A X2
150 22500
145 21025
100 10000
100 10000
55 3025
50 2500
600 69050
N 6

Then, plugging in the appropriate values into the computational formula gives:
Note that the defining and computational formulas give the same result, but the computational
formula is easier to work with (and potentially more accurate due to less rounding error).

Doing the same calculations for Distribution B yields:

B X2
150 22500
110 12100
100 10000
100 10000
90 8100
50 2500
600 65200
N 6

Then, plugging in the appropriate values into the computational formula gives:

Thus, Distribution A clearly has more variability than Distribution B.


Copyright 2015 Arvella Albay, Ph.D.
Comments? arvellamedinaalbay@yahoo.com

Homework - Central Tendency & Variability


DIRECTIONS: When computation is required, be sure to show all work neatly (i.e., each and
every step as indicated in the class examples given). Attach Minitab output when the problem
requests it. Indicate on the output whether the values agree with what you have calculated by
hand.

Problems

1. Compute the mean, range, mean deviation, variance, and standard deviation for the
following sample data. Compute the standard deviation using the definitional formula and
then again using the computational formula. The idea is for you to see how much easier
the computational formula is to use than the definitional formula. When you are through
with the manual computations, use Minitab to do them. The data represent the number of
fish caught while ice fishing for 10 people (30 points).
9, 7, 5, 8, 6, 3, 0, 6, 6, 1.
2. Compute the mean, range, mean deviation, variance, and standard deviation for the
following sample data. Compute the standard deviation using the definitional formula and
then again using the computational formula. The idea is for you to see how much easier
the computational formula is to use than the definitional formula. When you are through
with the manual computations, use Minitab to do them. The data represent the average
number hours of sleep per night for a bunch of teenagers (30 points).
5, 5, 5, 6, 6, 6, 6, 6, 7, 7, 7, 8, 8, 8, 8, 9, 10, 10, 10, 10, 11, 11.
3. Compute the mean and standard deviation for the following by whatever manual
method you prefer, and then use Minitab to do them. The data represent hypothetical
scores on an exam for a sample of students (30 points).
63 88 79 92 86 87 83 78 41 67
68 76 46 81 92 77 84 76 70 66
77 75 98 81 82 81 87 78 80 60
94 79 52 82 77 61 77 70 74 61

Multiple Choice (10 points)

1. Which measure of central tendency is the MOST sensitive to extreme scores?


a. mean
b. median
c. mode
d. trimmed mean
e. variance

2. Suppose the mean of our first exam is 77 and the median is 82. This would indicate a
_____ distribution.
a. normal
b. bimodal
c. rectangular
d. positively skewed
e. negatively skewed

Copyright 2015 Arvella Albayy, Ph.D.


Comments?arvellamedinaalbay@yahoo.com

Measures of Relative Standing


I. Percentiles & Percentile Rank
1. Percentile Rank
A. Rationale
B. Formula
C. Example
2. Percentile Points
3. Additional Measures of Variability

II. Changing the Properties of Scales


1. Introduction
2. Review & Additional Background
3. Application

III. Standard Scores & the Normal Distribution


1. Comments
2. Standard Scores
A. Theory
B. Application
3. The Normal Distribution

IV. The Standard Normal Distribution


1. Comments
2. Simple Applications
3. Parameters & the Standard Normal Distribution
4. More Complex Applications

Practice Problems (Answers)


Homework

I. Percentiles & Percentile Rank


1. Percentile Rank

A. Rationale

Consider an example where you receive the same grade on a test in two different
classes.
In which class did you do better?

This example should make clear that just because you got the same grade in two
classes doesn't necessarily mean that you did equally well in both classes. We
need a precise way to measure this. One measure often used is called the
Percentile Rank (or PR). Note that the cumulative percent (C%) gives the PR or
% falling at or below a given score. However, unlike the C% which is only given
for the upper exact limit of the interval, we need to be able to compute this for any
score.

The percentile rank is obtained through a procedure called linear interpolation.


This is the same procedure used to generate the formula for the median (discussed
earlier). The rationale goes something like this. (Dont worry about it too much
though, because we are given a formula.)

First, we assume a rectangular distribution of the scores within an interval. For


example:

Quantity Value
i 5
Apparent Limits 30 - 34
Exact Limits 29.5 - 34.5
f 6

So, we assume each score to have a frequency of 6/5, that is:


Given this assumption, consider the logic for PR calculations:

So if score X is halfway through the interval, then score X must have half of the
scores in that interval falling below it. For example, consider the expected scores
for Exam 1 data set. Let's say we want to know the percentile rank of the score of
82, then:

A. Formula for Percentile Rank

Where:

PR = Percentile Rank.
X = the score we are interested in.
L = the Lower exact limit of the interval containing X.
nb = number (frequency) below the lower exact limit of the interval containing X.
nw = number (frequency) within the interval containing X.
i = the interval width.
N = the Number of scores.
Note that your book uses a slightly different formula. That is, the
proportion (p) is computed and then it needs to be multiplyed by
100 to get the percentile rank. The formula given above combines
these two steps.

A. Example

Consider the expected scores for Exam 1 data set. By looking at the data, we can
see that the score of 82 will have PR between 40 and 64. In this case:

PR =?
X = 82
estimate =40-64
L = 79.5
nb = 10
nw =6
i =5
N = 25

Substituting into the formula:

So 52% of folks scored at or below a score of 82. (And note that 52 is between 40
& 64 as predicted. So this is a helpful check on our work.)

2. Percentile Points

Percentile points are like pennies, deciles are like dimes, and quartiles are like quarters. They
are essentially a way of dividing up the 100% that comprises the entire group. Note that the
median equals P50, D5, or Q2.

In dealing with percentile rank, we asked the question, What percent of the group scored at or
below a particular score?. We can also ask the reverse. For example, What score did 90% of
the group fall at or below?. Actually, we have already dealt with this question, that is, the
median is the score that has 50% falling at or below it. The formula for the score at a given
percentile point and the formula for the median usually use a slightly different notation. That is,
the median uses N/2, while the formula for the score at a given percentile point uses P(N) where
P is the percentile expressed as a proportion.. Note that if P=.5, then P(N)=N/2.

Formula for the score at a given percentile point:

where the symbols for the quantities involved are the same as for the PR formula. Lets do two
examples. We will compute the score at the first and third quartile points. Our first step is to
determine the relevant interval in each case.

Relevant
Quantity Q1=P25 Q3=P75
P .25 .75
Interval 75-79 85-89
L 74.5 84.5
N 25 25
nb 6 16
nw 4 5
i 5 5

Plugging the values into the formula gives:

For P25:
For P75:

3. Additional Measures of Variability

With the Q1 and Q3, we can compute a new measure of variability called the Inter Quartile
Range or IQR. Thus:

IQR = (X at Q3) (X at Q1)


Using the example above:

IQR = 87.25 - 74.81 = 12.44


In order to turn the IQR into a distance (on the x axis) from the mean, the Semi Inter Quartile
Range is sometimes computed. This makes it more similar to the standard deviation. It takes the
form:

SIQR = IQR/2
Using the example above:

SIQR = (87.25-74.81)/2 = 12.44/2 = 6.22


These measures of variability are the preferred measures with skewed distributions.
II. Changing the Properties of Scales
1. Introduction

The idea of percentile rank and percentile points are helpful, however, they dont always answer
the questions we have. For example, consider the following distributions of test scores.

Class A = 60, 70, & 86


Class B = 84, 85, & 86
You received an 86 on both tests. I bet that you would be happier with your performance in class
A. Your PR, however, would be the same in both cases. Thus, we need additional ways of
getting at the issue of relative standing.

2. Review & Additional Background

Earlier we noted that parametric scales have a zero point or origin and by definition, these scales
are measured with units. For example, Fahrenheit (Fo) and Celsius (Co) have different origins
and units. However, since they measure the same thing, one can convert back and forth with a
formula. For example, to get Fahrenheit from centigrade, we can use the following:

Fo = Co (1.8) + 32o
To go in the other direction, we solve for Co, that is:

Co = Fo (.56) 17.78o
Note:

The formula involves two constants, that is, Fo = Co (C1) + C2


C1 (1.8 in this case) changes the unit. It is also called the conversion factor.

C2 (32 in this case) changes the origin (or zero point).

Consider where 1.8 & 32 come from:

Point on Scale Fo Co
Boiling 212 100
Freezing 32 0
Difference 180 100

So 1.8 units on the Fahrenheit scale equals 1 unit on the Celsius scale. Or looking at it the other
way, .56-th of a unit on the Celsius scale equals 1 unit on the Fahrenheit scale.

The moral of the story is that if we have a set of scores and we want to change the unit, we need
to multiply all scores by a constant. (Note that division can be viewed as multiplication by a
reciprocal.) In addition, if we want to change the origin, we need to add a constant to all the
scores. (Note that subtraction can be viewed as addition of a negative.)

Now let us look at what happens to the mean and standard deviation of a distribution when we
change the properties of a scale.
If a constant (c) is added to all scores, then the new mean (or x-bar prime) will be equal to the
old mean plus the constant. That is:

Furthermore, the variability remains unchanged. That is:

and

Thus, adding a constant to all the scores simply shifts the score values. Assume that =5 and
c=3.

So the new mean would be 8 (5+3) and the standard deviation would remain unchanged.

If all scores are multiplied by a constant, then the new mean will equal the old mean times the
constant. That is:

Furthermore, in this case the variability is changed. That is:

and

Thus, multiplying all scores by a constant will shift the score values as well as their variability.
To see this visually, consider the following example (where c=3).
3. Application

Now suppose you took two tests; Biology & Chemistry.

Test X
Biology 60
Chemistry 80

In which class did you do better? From what we have discussed thus far, it should be apparent
that:

We do not have enough info to answer the question.


It is possible that you did better (relative to the rest of your class) on the biology
test, even though your score on the chemistry test was higher. That is:

So lets see what we can make of this when we are given the means and standard deviations.

Test X s
Biology 60 55 5
Chemistry 80 85 10

Note that this data supports the figure above. That is, you scored above the mean on the biology
test and below the mean on the chemistry test.

Based on what we just covered about changing the properties of scales, one strategy for solving
this problem would be to transform the biology distribution to make its mean and standard
deviation equal to that of the chemistry distribution. In other words:
Test X s
Biology 60 55 5
Chemistry 80 85 10
Transformed Biology distribution ? 85 10

So we need to determine the constants in the conversion formula and then transform the biology
score. Our goal is to come up with a biology score that uses a scale with the same mean and
standard deviation as the chemistry score.

Remember that C1 changes the unit and C2 changes the origin. When using the formula, be
concerned first with C1 and then worry about C2.

= C1* + C2

Substituting the values:

85 = C1*55 + C2

Therefore, let :

C1=2 C2=-25
Do 1-st to fix unit Do 2-nd to fix origin
(to make SDs equal)
(to make s equal)

Thus:

85 = 2*55 + -25

Which shows the constants work & now we can use them:

X' = 2*X + -25

Thus:

X' = 2*60 + -25


= 120 - 25
= 95
So the biology score was equivalent to a 95 on the chemistry scale. Thus, relative to the rest of
the class, you did much better on the biology than on the chemistry test.

III. Standard Scores & the Normal Distribution


1. Comments
As we have now seen, if we want to compare the scores from two distributions, we can do it in
either of two ways:

1. Compute the percentile ranks (but we saw that this can have its problems).
2. Transform the scores of one distribution such that the means and standard
deviations are equal in the two distributions.

A logical extension of the second procedure allows us to change all the scores to a standard scale.

2. Standard Scores

A. Theory

In this standard distribution (or one that employs a standard scale), it would be useful to have:

A mean equal to zero. This would allow us to tell if a score is greater or


less than the mean by its sign.
A standard deviation of one. This would allow us to tell how much a score
deviates from its mean by its magnitude.

Let z equal the standard score. It will tell us how many standard deviations a score differs from
its mean. Then the formula for z would be:

So a z score is a score minus its mean divided by the standard deviation. If we do this for all the

scores (letting equal the new mean), then:

and so

and we have the mean of zero that we wanted.

If we let equal the new standard deviation, then:

and so

So Z scores are in SD units (i.e., a distance on the x axis). They tell us how much a score
deviates from its mean.

B. Application
Lets redo the earlier chemistry versus biology test example to see how much easier it is using
this strategy.

Class X SD

Bio 60 55 5

Chem 80 85 10

Now, we can quickly see that while we scored a standard deviation above the mean in the
biology course, we scored a half of a standard deviation below the mean in the chemistry class.

Note that this technique is a lot easier to do than manually transforming the scores of one
distribution to those of another.

3. The Normal Distribution

Since standard scores are most often used with normal distributions, we need to learn a little bit
more about these distributions.

As we have already discussed, the normal distribution is a theoretical distribution (meaning it


doesnt really exist). A number of human behavioral characteristics fit this distribution (e.g., IQ,
anxiety level, drug responsiveness, etc.).

There are actually several types of normal distributions differing in their kurtosis (or
peakedness).
Finally, all normal distributions have three properties in common.

1. The three measures of central tendency (mean, median, & mode) all coincide.
2. They are bilaterally symmetrical.
3. The tails are asymptotic to the x axis, meaning they come closer and closer but
never actually touch it. More formally, let equal infinity. Then the tails go

from .

IV. The Standard Normal Distribution


1. Comments

If we take a normal distribution and transform all of the scores to z scores, we have a standard
normal distribution. In this case, we are dealing with theoretical (population) values, so the
formula is:

This distribution has a very special characteristic, that is, we can compute the proportion of area
under different portions of the curve. The figure below shows this in more detail. Note that the
values are obtained using integral calculus, but are provided in most statistics books in the form
of a table to save folks from having to perform these complex calculations.
Essentially the same info in a tabular view:

Distance from
mean (in SDs) % of cases
1 34.13
2 47.72
3 49.87

Applying this to IQ scores gives:

Thus, in a normal distribution, virtually all of the scores fall within three standard deviations
from the mean. More detailed figures can be obtained from the z table in the back of your
book. Note that only positive values are given since the curve is bilaterally symmetrical.

What follows is a very small portion of a typical z table


Z
Z values Shows area Shows area
listed to two between the beyond the z
decimal places mean & the z
...
1.00 .3413 .1587
1.01 .3438 .1562
...
1.48 .4306 .0649
1.49 .4319 .0681
1.50 .4332 .0668
1.51 .4345 .0655
1.52 .4357 .0643
...
2. Simple Applications

The z table can help us answer a lot of questions fairly quickly. However, it is best to draw
diagrams when trying to answer these questions. Consider three examples:

1. What percent of the distribution scored between the mean and a z of 1.5?

2. What percentage scored above a z score of 1.5?


3. What percentage scored below a z score of -1.5?

3. Parameters & the Standard Normal Distribution

Before showing additional examples of the application of the standard normal curve, we need to
talk a little bit more about parameters and statistics. As we noted earlier, since we are dealing
with theoretical (population) values, the z score formula is:

However, sample estimates can be used if both of the following conditions are met:

1. The population from which the sample was drawn is normal in shape.
2. The sample must be reasonably large.

Thus, we are back to:


4. More Complex Applications

Lets look at four examples that are representative of the different types of applications.

1. What is the percent of scores falling between a z of 1.5 and a z of 2?

First make a diagram of what is being asked. Then use it to develop a strategy to answer the
question. Here is the diagram.

C is what the question is asking and we can obtain A and B from the z tables. C=A+B.

A .4772
+B .4332
=C .9104
Thus, the answer is that 91% of the scores fall between a z of 1.5 and a z of 2.

2. What percent of the population has an IQ falling between 110 and 120. (Note
that IQ is distributed normally and has = 100 & = 15.) In a group of 50
folks chosen at random, how many can we expect to have an IQ between 110
and 120?

Again, make a diagram of what is being asked. Then use it to develop a strategy to answer the
question. Here is the diagram.
C is what the question is asking and we can obtain A and B from the z tables. C=A-B.

This time, though, we need to compute the z scores before we can look up the appropriate values
in the z tables.

And now we can obtain the proportions under the curve from the z tables.

A .4082
-B .2486
=C .1596
Thus, the answer is that 16% of people would be expected to have an IQ between 110 and
120. Furthermore, 8 people (.1596 * 50 = approximately 8) out of a randomly selected 50 would
be expected to have an IQ in that range.

3. What is the Percentile Rank of a score of 155 on the Authoritarian scale


(assume = 150 & = 20)?

Again, make a diagram of what is being asked. Then use it to develop a strategy to answer the
question. Here is the diagram.

C is what the question is asking and we can obtain A and B from the z tables. C=A+B.

We will need to compute the z of 155. Note that the z of the mean/median (150) is 0 and thus
half of the scores fall below it.
And now we can obtain the proportions under the curve from the z tables.

A .5000
+B .0987
=C .5987
There for, the PR of 155 is 60. Sixty percent of the distribution falls at or below a score of 155.

4. In the distribution described in the problem above, what is the score at P90
(i.e., the score that has 90% of the distribution falling below it).

Again, make a diagram of what is being asked. Then use it to develop a strategy to answer the
question. Here is the diagram.

In these case we are given the proportion (i.e., .9) and we will need to find the z value associated
with this proportion in the table and then use it to compute the score. From the tables, we see
that a z value of 1.28 has 40% of the distribution between it and the mean.
So the score that has 90% of the distribution falling below it is 175.6. To check this, you can
solve it in reverse, that is, find the PR of 175.6.

If you are going to do this type of problem a bunch of times, it is easiest to just derive the
formula. That is:

Copyright 2015 Arvella Albay, Ph.D.


Comments? arvellamedinaalbay@yahoo.com

Homework - Measures of Relative Standing


DIRECTIONS: When computation is required, be sure to show all work neatly. Include clear and
well labeled diagrams explaining the logic to your answers.

Problems

1. Considering the expected scores for Exam 1 data set (the first example in the section on
grouped frequency distributions), What is the percentile rank of a score of 65? What is
the score at the 80th percentile point? (5 points for each question = 10 points)

2. Consider Juan's scores on his first two tests in college. He received a 75 in Biology and a
85 in Psychology. What follows are some summary statistics from the two tests. Which
grade do you think Juan was happier with? (Note that we demonstrated two ways to solve
this type of problem in class, one of which was considerably easier than the other.) (10
points)

General General
Biology Psychology
mean 71 81
SD 20 10
3.
4. Consider the following data for the heights of players (in inches) on two basketball
teams. (10 points)

Chicago Milwaukee
Bulls Bucks
mean 76 78
SD 2 4
5.
Now assume that the tallest player on the Bull's team is 6'10" and the tallest player on the
Buck's team is 7'2". Which player would stand out more from the other players on their
respective teams?

4. A distribution has a mean of 72 and a standard deviation of 16. For each part of the
problem, be sure to include the appropriate diagrams with appropriate shading and
labeling to clearly show the strategy you used to solve the problem (just like we have
been doing in class).

A. What is the percent of folks scoring between a z of -1.5 and +2? (10 points)
B. What is the percent of folks scoring between 60 and 70? (15 points)
How many students would be expected to score within this range in a typical class
(N=25)?
C. What is the percentile rank of a score of 65? (10 points)
D. What is the score at the 80th percentile point? (10 points)
5. In a IQ distribution (mean=100, standard deviation=15), how many people would we
expect to have an IQ above 130 living in Stevens Point? Note that this problem makes
two assumptions. First, Stevens Point has a population of 23,000 people. The second is
that folks in Stevens Point are representative of the population as a whole. (15 points)

Multiple Choice (10 points)

1. A z score is
a. a deviation that is standardized for all groups.
b. the number of standard deviations that a given score deviates from the mean.
c. the mean minus the median divided by the standard deviation.
d. a standard deviation.

2. What percent of the distribution falls between the mean and a z score of -1.00?
a. 13.59
b. 34.13
c. 47.72
d. 50.00
Copyright 2015 Arvella Albay,Phd
Comments? arvellamedinaalbay@yahoo.com

Correlation
I. Important Concepts
1. Correlation
2. Correlation Coefficient
3. Scatterplot

II. Range of a Correlation Coefficient


III. Pearson's r
1. Rationale for Computation
2. Computational Formula & Example - [Minitab]

IV. Spearman's rho


1. Example 1 - [Minitab]
2. Example 2 - [Minitab]

V. Important Issues With Correlation


1. Factors Influencing It
a. Curvolinearity
b. Limited (Restricted & Truncated) Ranges
c. Extreme Groups
d. An Extreme Score
2. Relation to Causality
3. Some Specific Uses of it

I. Important Concepts
1. Correlation
Earlier in the semester we noted that scientists are interested in relationships between
variables. When two variables vary together (a change in one is accompanied by a change
in the other), we say they are correlated.
2. Correlation Coefficient
Expresses quantitatively the extent to which two variables are related. There are several.
We will learn about two.

3. Scatterplot
A graph of a collection of pairs of scores. Example:

Note that in scatterplots, the X and Y axes are equal in length and thus this type of graph
does not obey the 3/4 high rule.

II. Range of a Correlation Coefficient


Is best illustrated with examples:
1. Perfect positive (all points fall on a straight line)

As the number of hours studied increased so did the grade. This is also called a "direct"
relationship.

More realistic example

2. Perfect negative

As the number of beers drank increased, the grade decreased. This is also called an
"inverse" or "indirect" relationship.

More realistic example


3. No correlation

So, basicially, there is no relationship between toe size and grade.

More realistic example

Summary

r= (0 1)
Sign Magnitude
Gives direction Gives strength

III. Pearson's r
Termed Pearson's Product Moment Correlation Coefficient.
Good with metric data.

Probably the most popular correlation coefficient.

It is required that both variables involved be normally distributed.

Represents quantitatively the extent to which scores on two variables occupy the same
relative position.

1. Rationale for Computation


We have seen that z scores provide information about the relative position of a score
compared to other scores in the distribution. Pearsons r uses this:

Thus, r is the mean of the sum of the products of the z scores for the two variables. What
follows is a demonstration of why this works in the case of perfect positive relationship
(variables X & Y) and in the case of a perfect negative relationship (variables X & W).

First, the perfect positive relationship between X & Y.

X ZX ZX2 Y ZY ZXZY
3 -1.42 2.02 1 -1.42 2.02
5 -.71 .50 2 -.71 .50
7 0 0 3 0 0
9 .71 .50 4 .71 .50
11 1.42 2.02 5 1.42 2.02
=2.82 ZX2=5=N =1.41 ZXZY=ZX2
=7 =3
N=5
If the relative position of the scores on the two variables is the same (as in the present
case), then the z scores of each of the variables will be the same and (ZXZY) would be
equal to ZX2. As we saw above, ZX2 is equal to N and thus r would equal N/N or 1.

Now for the perfect negative relationship between X & W.

X ZX ZX2 W ZW ZX ZW
3 -1.42 2.02 5 1.42 -2.02
5 -.71 .50 4 .71 -.5
7 0 0 3 0 0
9 .71 .50 2 -.71 -.5
11 1.42 2.02 1 -1.42 -2.02
=2.82 ZX2=5=N ZXZW=-5
=7
N=5

The scores again have the same relative position, but this time the relationship is indirect.
In this case, (ZXZW) would be equal to -N and r would be equal to -N/N or -1.

1. Computational Formula & Example


Since the standard score formula is cumbersome, a computational formula was developed
which doesnt require the calculation of z scores for all of the scores.
Example: Scores on 20 point math and science quizzes. [Minitab]

Person Math (X) Science (Y)


A 11 11
B 13 10
C 18 17
D 12 13
E 16 14
N=5

First step would be to create a scatterplot:

Since the scatterplot looks promising (suggests a strong positive relationship), create the
necessary grid for the computations.

Person Math (X) Science (Y) XY X2 Y2


A 11 11 121 121 121
B 13 10 130 169 100
C 18 17 306 324 289
D 12 13 156 144 169
E 16 14 224 256 196
N=5 X=70 Y=65 XY=937 X2=1014 Y2=875

Then perform the computations:


As was suggested by the scatterplot, there is indeed a strong positive correlation between the
math and science scores.

IV. Spearmans Rho


A variant of Pearsons r which is used with rank data is called Spearmans Rho (rs). This
correlation coefficient is appropriate when either of the following two conditions are met:

a. One variable is an ordinal scale and the other is an ordinal scale or higher.
b. One of the distributions is markedly skewed.

In either case, both scales must be converted to ranks. And if we computed Pearson's r on
the ranked data, it would give Spearman's Rho. However, for computations by hand, there
is a simpler formula:

Where D= Rank of X Rank of Y (i.e., a Difference score).

1. Example 1. Beauty & Sociability. [Minitab]

Person Beauty Sociability


A 3 3
B 1=most 2
C 2 1=most
D 5 4
E 4 5
N=5
2. First step would be to create a scatterplot.

Since the scatterplot looks promising (suggests a strong positive relationship), create the
necessary grid for the computations.

Person Beauty Sociability D D2


A 3 3 0 0
B 1=most 2 -1 1
C 2 1=most 1 1
D 5 4 1 1
E 4 5 -1 1
N=5 D=0 D2=4
3. Then perform the computations:

4.

5. Example 2. Beauty & Science scores. [Minitab]

Since the science score is a ratio variable, it makes sense to rank it from low to high, that
is, where low ranks represent low scores. If we are going to correlate beauty with this
score, it makes sense to rerank the beauty scores so that they go from low to high as well.

Beauty Science
Person Beauty Science
(reranked) (ranked)
A 3 3 11 2
B 1=most 5=most 10 1
C 2 4 17 5=most
D 5 1 13 3
E 4 2 14 4
N=5
6.
7. Then we would create a scatterplot of the ranked scores.

The data do not look very promising, but let's prepare the grid for the computations
anyway.

Beauty Science
Person D D2
(reranked) (ranked)
A 3 2 1 1
B 5=most 1 4 16
C 4 5=most -1 1
D 1 3 -2 4
E 2 4 -2 4
N=5 D=0 D2=26
8. Then perform the computations:

9.

10. So as the scatter plot indicated, there wasn't much of a correlation.


11. Note: Tied ranks would get the average of the tie(s).

Examples:

Pair of tied scores:


Three scores tied:
Person X Y Y (rank)
Person X Y Y (rank)
A 3 11 4.5
A 3 11 4
B 1 1 B 1 11 4
4.5
C 2 11 4
C 2 17 1
D 5 13 2
D 5 13 3
E 4 14 1
E 4 14 2
N=5
N=5

V. Important Issues With Correlation


1. Factors Influencing the Correlation

These are the reasons why it is important to create a scatterplot.

a. Curvolinearity
A linear (or monotonic) relationship is best characterized by a straight line.
Both r and rs assume this.
Example linear relationship:

Example of a curvilinear (or nonmonotonic) relationship:

In general, curvilinearity in a relationship will result in an r that


underestimates the true relationship.
b. Limited (Restricted & Truncated) Ranges
Refer to situations in which the sample is somehow limited. In both cases,
it results in an underestimated r.

Example of a Restricted Range - Foot size and age in 6 year olds:

Example of a Truncated Range - ACT scores and GPA in college


students:

c. Extreme Groups
Results in an overestimated r. Consider looking at the relationship of
reading ability and IQ, but only in poor and excellent readers:
d. An Extreme Score
Also results in an overestimated r. Is more of a problem when using small
sample sizes. Example:

2. Relation to Causality

Possible causal relationships between X and Y if they are correlated include:

Possibility Symbols Explanation


a. XY X causes Y

b. XY Y causes X

c. X A Y A causes both X & Y

BCX
d. Etc.
BY

Main point is that correlation doesnt tell us much about causality. It should be
noted that inferring causality from a correlation is an error that is all too common.

3. Some Specific Uses of Correlation


a. Determining Reliabilities
Compare two raters (interobserver) or the same raters (intraobserver)
observations of behavior to see if they agree.
b. Determining Validities
If ACT scores are highly correlated with GPA's then we can say that ACT
scores are a valid predictor of GPA's.
c. For Prediction
A set of procedures similar to correlation called regression is used for
predicting one variable from one or more other variables.

Copyright 2015 Arvella Albay, Phd


Comments? arvellamedinaalbay@yahoo.com

Anda mungkin juga menyukai