Here, we are interested in the typical, most representative score. There are three measures of
central tendency that you should be familiar with. Note that when reporting these values, one
additional decimal of accuracy is given compared to what is available in the raw data (even if the
additional decimal is a zero, e.g., 43.0).
1. Mean
It is simply the arithmetic average or sum of the scores divided by the number of them.
It is symbolized as:
Computation - Example:
X
2
3
5
10
X = 20
N=4
Since means are typically reported with one more digit of accuracy that is present
in the data, I reported the mean as 5.0 rather than just 5.
For example:
Thus the formula for computing the mean with grouped data gives us a
good approximation of the actual mean. In fact, when we report the mean
with one decimal more accuracy than what is in the data, the two
techniques give the same result.
Properties
1. It is sensitive to all of the scores. In other words, if one score in the distribution is
changed, the mean will change too. Example:
Xs
1, 2, 3 2
1, 2, 30 11
1, 2, 300 101
2.
3. The sum of the deviations about the mean equals zero. A deviation is symbolized
as (little x) and refers to the difference between a score and its mean. That is:
2 5.0 -3
3 5.0 -2
5 5.0 0
10 5.0 5
= 0
4. The sum of the squared deviations about the mean is less than the sum of the
squared deviations about any other value. Example (with "4" as the arbitrary
"other value"):
X 2 X-4 (X-4)2
2 -3 9 -2 4
3 -2 4 -1 1
5 0 0 1 1
10 5 25 6 36
2= 38 (X-4)2 = 42
5. So, 38 is less than 42. This relationship would hold with any "other value."
Variations
Weighted Mean
Each quantity to be averaged is assigned a weight. These weightings determine the
relative importance of each quantity in the average. We will see an example in our grade
postings, where the homework assignments are weighted at 20% and the exams are
weighted at 80%.
Trimmed Mean
A mean that is computed on the middle 95% of the distribution. Can be a more stable
estimate than the regular mean since it is less sensitive to outliers.
1. Median or Md
The score that cuts the distribution into two equal halves (or the middle score in the
distribution).
1. An odd number of scores and no duplication near the middle, then the median is
the middle score.
Ex: 1, 2, 2, 4, 6, 7, 7. N=7 & Md= 4.
2. An even number of scores and no duplication near the middle, then the median is
the average of the two middle scores.
Ex: 2, 2, 4, 6, 7, 7. N=6 & Md = (6+4)/2 = 5.
3. Duplication near the middle.
Ex: 4, 5, 5, 5, 6, 6. N=6 & Md = ?
Where:
L = 4.5
nb = 1
nw = 3
i =1
N =6
Properties
Xs Md
1, 2, 3 2 2
1, 2, 30 11 2
1, 2, 300 101 2
1, 2, 3000 1001 2
11
2. Mode
Is the most frequently occurring score. Note:
o There can be more than one. Can have bi- or tri-modal distributions and then
speak of major and minor modes.
o It is symbolized as Mo.
Note that the presence and direction of skew in the distribution can be determined from the
mean and median. The key to understanding this is to be aware that the mean is sensitive to all
scores, while the median is not. There are three rules:
Variability refers to the extent to which the scores in a distribution differ from each other. An
equivalent definition (that is easier to work with mathematically) says that variability refers to
the extent to which the scores in a distribution differ from their mean. If a distribution is lacking
in variability, we may say that it is homogenous (note the opposite would be heterogenous). Note
that when reporting these values, two additional decimals of accuracy are given compared to
what is available in the raw data (even if the last decimal is a zero, e.g., 4.30). The exception is
the range were no extra decimals are needed because it is a crude measure (as we will see in a
moment).
We will discuss four measures of variability for now: the range, mean or average deviation,
variance and standard deviation.
1. Range
As we noted when discussing the rules for creation of a grouped frequency distribution,
the range is given by the highest score in the distribution minus the lowest score plus one.
R = XH - XL+ 1
Example:
Distribution A has a larger range (and more variability) than Distribution B.
Because only the two extreme scores are used in computing the range, however, it is a
crude measure. For example:
The range of Distribution A and B is the same, although Distribution A has more
variability.
Variations
The problem with the MD is that due to the use of the absolute value, it is a terminal
procedure. In other words, it cannot be used in further calculations (which is something
that we would like to be able to do).
1. Variance
Another solution to the problem of the deviations summing to zero is to square the
deviations. That is:
Thus another name for the Variance is the Mean of the Squared Deviations About the
Mean (or more simply, the Mean of Squares (MS)). The problem with the MS is that its
units are squared and thus represent space, rather than a distance on the X axis like the
other measures of variability.
2. Standard Deviation
A simple solution to the problem of the MS representing a space is to compute its square
root. That is:
Since standard deviations can sometimes be very small, you may need more than
the 2 additional decimals of accuracy than what is available in the original data as
was suggested at the outset of this section.
III. Estimation
Estimation is the goal of inferential statistics. We use sample values to estimate population
values. The symbols are as follows:
Mean
Variance s2 2
Standard Deviation s
In order to make it an unbiased estimator, we use N-1 in the denominator of the formula rather
than just N. Thus:
Note that this is a defining formula and, as we will see below, is not the best choice when
actually doing the calculations.
100 100
Notice that the central tendency and range of the two distributions are the same. That is, the
mean, median, and mode all equal 100 for both distributions and the range is 101 for both
distributions. However, while Distributions A and B have the same measures of central tendency
and the same range, they differ in their variability. Distribution A has more of it. Let us prove this
by computing the standard deviation in each case. First, for Distribution A:
A 2
Measure A
Note that calculating the variance and standard deviation in this manner requires computing the
mean and subtracting it from each score. Since this is not very efficient and can be less accurate
as a result of rounding error, computational formulas are typically used. They are given as
follows:
and
A X2
150 22500
145 21025
100 10000
100 10000
55 3025
50 2500
600 69050
N 6
Then, plugging in the appropriate values into the computational formula gives:
Note that the defining and computational formulas give the same result, but the computational
formula is easier to work with (and potentially more accurate due to less rounding error).
B X2
150 22500
110 12100
100 10000
100 10000
90 8100
50 2500
600 65200
N 6
Then, plugging in the appropriate values into the computational formula gives:
Problems
1. Compute the mean, range, mean deviation, variance, and standard deviation for the
following sample data. Compute the standard deviation using the definitional formula and
then again using the computational formula. The idea is for you to see how much easier
the computational formula is to use than the definitional formula. When you are through
with the manual computations, use Minitab to do them. The data represent the number of
fish caught while ice fishing for 10 people (30 points).
9, 7, 5, 8, 6, 3, 0, 6, 6, 1.
2. Compute the mean, range, mean deviation, variance, and standard deviation for the
following sample data. Compute the standard deviation using the definitional formula and
then again using the computational formula. The idea is for you to see how much easier
the computational formula is to use than the definitional formula. When you are through
with the manual computations, use Minitab to do them. The data represent the average
number hours of sleep per night for a bunch of teenagers (30 points).
5, 5, 5, 6, 6, 6, 6, 6, 7, 7, 7, 8, 8, 8, 8, 9, 10, 10, 10, 10, 11, 11.
3. Compute the mean and standard deviation for the following by whatever manual
method you prefer, and then use Minitab to do them. The data represent hypothetical
scores on an exam for a sample of students (30 points).
63 88 79 92 86 87 83 78 41 67
68 76 46 81 92 77 84 76 70 66
77 75 98 81 82 81 87 78 80 60
94 79 52 82 77 61 77 70 74 61
2. Suppose the mean of our first exam is 77 and the median is 82. This would indicate a
_____ distribution.
a. normal
b. bimodal
c. rectangular
d. positively skewed
e. negatively skewed
A. Rationale
Consider an example where you receive the same grade on a test in two different
classes.
In which class did you do better?
This example should make clear that just because you got the same grade in two
classes doesn't necessarily mean that you did equally well in both classes. We
need a precise way to measure this. One measure often used is called the
Percentile Rank (or PR). Note that the cumulative percent (C%) gives the PR or
% falling at or below a given score. However, unlike the C% which is only given
for the upper exact limit of the interval, we need to be able to compute this for any
score.
Quantity Value
i 5
Apparent Limits 30 - 34
Exact Limits 29.5 - 34.5
f 6
So if score X is halfway through the interval, then score X must have half of the
scores in that interval falling below it. For example, consider the expected scores
for Exam 1 data set. Let's say we want to know the percentile rank of the score of
82, then:
Where:
PR = Percentile Rank.
X = the score we are interested in.
L = the Lower exact limit of the interval containing X.
nb = number (frequency) below the lower exact limit of the interval containing X.
nw = number (frequency) within the interval containing X.
i = the interval width.
N = the Number of scores.
Note that your book uses a slightly different formula. That is, the
proportion (p) is computed and then it needs to be multiplyed by
100 to get the percentile rank. The formula given above combines
these two steps.
A. Example
Consider the expected scores for Exam 1 data set. By looking at the data, we can
see that the score of 82 will have PR between 40 and 64. In this case:
PR =?
X = 82
estimate =40-64
L = 79.5
nb = 10
nw =6
i =5
N = 25
So 52% of folks scored at or below a score of 82. (And note that 52 is between 40
& 64 as predicted. So this is a helpful check on our work.)
2. Percentile Points
Percentile points are like pennies, deciles are like dimes, and quartiles are like quarters. They
are essentially a way of dividing up the 100% that comprises the entire group. Note that the
median equals P50, D5, or Q2.
In dealing with percentile rank, we asked the question, What percent of the group scored at or
below a particular score?. We can also ask the reverse. For example, What score did 90% of
the group fall at or below?. Actually, we have already dealt with this question, that is, the
median is the score that has 50% falling at or below it. The formula for the score at a given
percentile point and the formula for the median usually use a slightly different notation. That is,
the median uses N/2, while the formula for the score at a given percentile point uses P(N) where
P is the percentile expressed as a proportion.. Note that if P=.5, then P(N)=N/2.
where the symbols for the quantities involved are the same as for the PR formula. Lets do two
examples. We will compute the score at the first and third quartile points. Our first step is to
determine the relevant interval in each case.
Relevant
Quantity Q1=P25 Q3=P75
P .25 .75
Interval 75-79 85-89
L 74.5 84.5
N 25 25
nb 6 16
nw 4 5
i 5 5
For P25:
For P75:
With the Q1 and Q3, we can compute a new measure of variability called the Inter Quartile
Range or IQR. Thus:
SIQR = IQR/2
Using the example above:
The idea of percentile rank and percentile points are helpful, however, they dont always answer
the questions we have. For example, consider the following distributions of test scores.
Earlier we noted that parametric scales have a zero point or origin and by definition, these scales
are measured with units. For example, Fahrenheit (Fo) and Celsius (Co) have different origins
and units. However, since they measure the same thing, one can convert back and forth with a
formula. For example, to get Fahrenheit from centigrade, we can use the following:
Fo = Co (1.8) + 32o
To go in the other direction, we solve for Co, that is:
Co = Fo (.56) 17.78o
Note:
Point on Scale Fo Co
Boiling 212 100
Freezing 32 0
Difference 180 100
So 1.8 units on the Fahrenheit scale equals 1 unit on the Celsius scale. Or looking at it the other
way, .56-th of a unit on the Celsius scale equals 1 unit on the Fahrenheit scale.
The moral of the story is that if we have a set of scores and we want to change the unit, we need
to multiply all scores by a constant. (Note that division can be viewed as multiplication by a
reciprocal.) In addition, if we want to change the origin, we need to add a constant to all the
scores. (Note that subtraction can be viewed as addition of a negative.)
Now let us look at what happens to the mean and standard deviation of a distribution when we
change the properties of a scale.
If a constant (c) is added to all scores, then the new mean (or x-bar prime) will be equal to the
old mean plus the constant. That is:
and
Thus, adding a constant to all the scores simply shifts the score values. Assume that =5 and
c=3.
So the new mean would be 8 (5+3) and the standard deviation would remain unchanged.
If all scores are multiplied by a constant, then the new mean will equal the old mean times the
constant. That is:
and
Thus, multiplying all scores by a constant will shift the score values as well as their variability.
To see this visually, consider the following example (where c=3).
3. Application
Test X
Biology 60
Chemistry 80
In which class did you do better? From what we have discussed thus far, it should be apparent
that:
So lets see what we can make of this when we are given the means and standard deviations.
Test X s
Biology 60 55 5
Chemistry 80 85 10
Note that this data supports the figure above. That is, you scored above the mean on the biology
test and below the mean on the chemistry test.
Based on what we just covered about changing the properties of scales, one strategy for solving
this problem would be to transform the biology distribution to make its mean and standard
deviation equal to that of the chemistry distribution. In other words:
Test X s
Biology 60 55 5
Chemistry 80 85 10
Transformed Biology distribution ? 85 10
So we need to determine the constants in the conversion formula and then transform the biology
score. Our goal is to come up with a biology score that uses a scale with the same mean and
standard deviation as the chemistry score.
Remember that C1 changes the unit and C2 changes the origin. When using the formula, be
concerned first with C1 and then worry about C2.
= C1* + C2
85 = C1*55 + C2
Therefore, let :
C1=2 C2=-25
Do 1-st to fix unit Do 2-nd to fix origin
(to make SDs equal)
(to make s equal)
Thus:
85 = 2*55 + -25
Which shows the constants work & now we can use them:
Thus:
1. Compute the percentile ranks (but we saw that this can have its problems).
2. Transform the scores of one distribution such that the means and standard
deviations are equal in the two distributions.
A logical extension of the second procedure allows us to change all the scores to a standard scale.
2. Standard Scores
A. Theory
In this standard distribution (or one that employs a standard scale), it would be useful to have:
Let z equal the standard score. It will tell us how many standard deviations a score differs from
its mean. Then the formula for z would be:
So a z score is a score minus its mean divided by the standard deviation. If we do this for all the
and so
and so
So Z scores are in SD units (i.e., a distance on the x axis). They tell us how much a score
deviates from its mean.
B. Application
Lets redo the earlier chemistry versus biology test example to see how much easier it is using
this strategy.
Class X SD
Bio 60 55 5
Chem 80 85 10
Now, we can quickly see that while we scored a standard deviation above the mean in the
biology course, we scored a half of a standard deviation below the mean in the chemistry class.
Note that this technique is a lot easier to do than manually transforming the scores of one
distribution to those of another.
Since standard scores are most often used with normal distributions, we need to learn a little bit
more about these distributions.
There are actually several types of normal distributions differing in their kurtosis (or
peakedness).
Finally, all normal distributions have three properties in common.
1. The three measures of central tendency (mean, median, & mode) all coincide.
2. They are bilaterally symmetrical.
3. The tails are asymptotic to the x axis, meaning they come closer and closer but
never actually touch it. More formally, let equal infinity. Then the tails go
from .
If we take a normal distribution and transform all of the scores to z scores, we have a standard
normal distribution. In this case, we are dealing with theoretical (population) values, so the
formula is:
This distribution has a very special characteristic, that is, we can compute the proportion of area
under different portions of the curve. The figure below shows this in more detail. Note that the
values are obtained using integral calculus, but are provided in most statistics books in the form
of a table to save folks from having to perform these complex calculations.
Essentially the same info in a tabular view:
Distance from
mean (in SDs) % of cases
1 34.13
2 47.72
3 49.87
Thus, in a normal distribution, virtually all of the scores fall within three standard deviations
from the mean. More detailed figures can be obtained from the z table in the back of your
book. Note that only positive values are given since the curve is bilaterally symmetrical.
The z table can help us answer a lot of questions fairly quickly. However, it is best to draw
diagrams when trying to answer these questions. Consider three examples:
1. What percent of the distribution scored between the mean and a z of 1.5?
Before showing additional examples of the application of the standard normal curve, we need to
talk a little bit more about parameters and statistics. As we noted earlier, since we are dealing
with theoretical (population) values, the z score formula is:
However, sample estimates can be used if both of the following conditions are met:
1. The population from which the sample was drawn is normal in shape.
2. The sample must be reasonably large.
Lets look at four examples that are representative of the different types of applications.
First make a diagram of what is being asked. Then use it to develop a strategy to answer the
question. Here is the diagram.
C is what the question is asking and we can obtain A and B from the z tables. C=A+B.
A .4772
+B .4332
=C .9104
Thus, the answer is that 91% of the scores fall between a z of 1.5 and a z of 2.
2. What percent of the population has an IQ falling between 110 and 120. (Note
that IQ is distributed normally and has = 100 & = 15.) In a group of 50
folks chosen at random, how many can we expect to have an IQ between 110
and 120?
Again, make a diagram of what is being asked. Then use it to develop a strategy to answer the
question. Here is the diagram.
C is what the question is asking and we can obtain A and B from the z tables. C=A-B.
This time, though, we need to compute the z scores before we can look up the appropriate values
in the z tables.
And now we can obtain the proportions under the curve from the z tables.
A .4082
-B .2486
=C .1596
Thus, the answer is that 16% of people would be expected to have an IQ between 110 and
120. Furthermore, 8 people (.1596 * 50 = approximately 8) out of a randomly selected 50 would
be expected to have an IQ in that range.
Again, make a diagram of what is being asked. Then use it to develop a strategy to answer the
question. Here is the diagram.
C is what the question is asking and we can obtain A and B from the z tables. C=A+B.
We will need to compute the z of 155. Note that the z of the mean/median (150) is 0 and thus
half of the scores fall below it.
And now we can obtain the proportions under the curve from the z tables.
A .5000
+B .0987
=C .5987
There for, the PR of 155 is 60. Sixty percent of the distribution falls at or below a score of 155.
4. In the distribution described in the problem above, what is the score at P90
(i.e., the score that has 90% of the distribution falling below it).
Again, make a diagram of what is being asked. Then use it to develop a strategy to answer the
question. Here is the diagram.
In these case we are given the proportion (i.e., .9) and we will need to find the z value associated
with this proportion in the table and then use it to compute the score. From the tables, we see
that a z value of 1.28 has 40% of the distribution between it and the mean.
So the score that has 90% of the distribution falling below it is 175.6. To check this, you can
solve it in reverse, that is, find the PR of 175.6.
If you are going to do this type of problem a bunch of times, it is easiest to just derive the
formula. That is:
Problems
1. Considering the expected scores for Exam 1 data set (the first example in the section on
grouped frequency distributions), What is the percentile rank of a score of 65? What is
the score at the 80th percentile point? (5 points for each question = 10 points)
2. Consider Juan's scores on his first two tests in college. He received a 75 in Biology and a
85 in Psychology. What follows are some summary statistics from the two tests. Which
grade do you think Juan was happier with? (Note that we demonstrated two ways to solve
this type of problem in class, one of which was considerably easier than the other.) (10
points)
General General
Biology Psychology
mean 71 81
SD 20 10
3.
4. Consider the following data for the heights of players (in inches) on two basketball
teams. (10 points)
Chicago Milwaukee
Bulls Bucks
mean 76 78
SD 2 4
5.
Now assume that the tallest player on the Bull's team is 6'10" and the tallest player on the
Buck's team is 7'2". Which player would stand out more from the other players on their
respective teams?
4. A distribution has a mean of 72 and a standard deviation of 16. For each part of the
problem, be sure to include the appropriate diagrams with appropriate shading and
labeling to clearly show the strategy you used to solve the problem (just like we have
been doing in class).
A. What is the percent of folks scoring between a z of -1.5 and +2? (10 points)
B. What is the percent of folks scoring between 60 and 70? (15 points)
How many students would be expected to score within this range in a typical class
(N=25)?
C. What is the percentile rank of a score of 65? (10 points)
D. What is the score at the 80th percentile point? (10 points)
5. In a IQ distribution (mean=100, standard deviation=15), how many people would we
expect to have an IQ above 130 living in Stevens Point? Note that this problem makes
two assumptions. First, Stevens Point has a population of 23,000 people. The second is
that folks in Stevens Point are representative of the population as a whole. (15 points)
1. A z score is
a. a deviation that is standardized for all groups.
b. the number of standard deviations that a given score deviates from the mean.
c. the mean minus the median divided by the standard deviation.
d. a standard deviation.
2. What percent of the distribution falls between the mean and a z score of -1.00?
a. 13.59
b. 34.13
c. 47.72
d. 50.00
Copyright 2015 Arvella Albay,Phd
Comments? arvellamedinaalbay@yahoo.com
Correlation
I. Important Concepts
1. Correlation
2. Correlation Coefficient
3. Scatterplot
I. Important Concepts
1. Correlation
Earlier in the semester we noted that scientists are interested in relationships between
variables. When two variables vary together (a change in one is accompanied by a change
in the other), we say they are correlated.
2. Correlation Coefficient
Expresses quantitatively the extent to which two variables are related. There are several.
We will learn about two.
3. Scatterplot
A graph of a collection of pairs of scores. Example:
Note that in scatterplots, the X and Y axes are equal in length and thus this type of graph
does not obey the 3/4 high rule.
As the number of hours studied increased so did the grade. This is also called a "direct"
relationship.
2. Perfect negative
As the number of beers drank increased, the grade decreased. This is also called an
"inverse" or "indirect" relationship.
Summary
r= (0 1)
Sign Magnitude
Gives direction Gives strength
III. Pearson's r
Termed Pearson's Product Moment Correlation Coefficient.
Good with metric data.
Represents quantitatively the extent to which scores on two variables occupy the same
relative position.
Thus, r is the mean of the sum of the products of the z scores for the two variables. What
follows is a demonstration of why this works in the case of perfect positive relationship
(variables X & Y) and in the case of a perfect negative relationship (variables X & W).
X ZX ZX2 Y ZY ZXZY
3 -1.42 2.02 1 -1.42 2.02
5 -.71 .50 2 -.71 .50
7 0 0 3 0 0
9 .71 .50 4 .71 .50
11 1.42 2.02 5 1.42 2.02
=2.82 ZX2=5=N =1.41 ZXZY=ZX2
=7 =3
N=5
If the relative position of the scores on the two variables is the same (as in the present
case), then the z scores of each of the variables will be the same and (ZXZY) would be
equal to ZX2. As we saw above, ZX2 is equal to N and thus r would equal N/N or 1.
X ZX ZX2 W ZW ZX ZW
3 -1.42 2.02 5 1.42 -2.02
5 -.71 .50 4 .71 -.5
7 0 0 3 0 0
9 .71 .50 2 -.71 -.5
11 1.42 2.02 1 -1.42 -2.02
=2.82 ZX2=5=N ZXZW=-5
=7
N=5
The scores again have the same relative position, but this time the relationship is indirect.
In this case, (ZXZW) would be equal to -N and r would be equal to -N/N or -1.
Since the scatterplot looks promising (suggests a strong positive relationship), create the
necessary grid for the computations.
a. One variable is an ordinal scale and the other is an ordinal scale or higher.
b. One of the distributions is markedly skewed.
In either case, both scales must be converted to ranks. And if we computed Pearson's r on
the ranked data, it would give Spearman's Rho. However, for computations by hand, there
is a simpler formula:
Since the scatterplot looks promising (suggests a strong positive relationship), create the
necessary grid for the computations.
4.
Since the science score is a ratio variable, it makes sense to rank it from low to high, that
is, where low ranks represent low scores. If we are going to correlate beauty with this
score, it makes sense to rerank the beauty scores so that they go from low to high as well.
Beauty Science
Person Beauty Science
(reranked) (ranked)
A 3 3 11 2
B 1=most 5=most 10 1
C 2 4 17 5=most
D 5 1 13 3
E 4 2 14 4
N=5
6.
7. Then we would create a scatterplot of the ranked scores.
The data do not look very promising, but let's prepare the grid for the computations
anyway.
Beauty Science
Person D D2
(reranked) (ranked)
A 3 2 1 1
B 5=most 1 4 16
C 4 5=most -1 1
D 1 3 -2 4
E 2 4 -2 4
N=5 D=0 D2=26
8. Then perform the computations:
9.
Examples:
a. Curvolinearity
A linear (or monotonic) relationship is best characterized by a straight line.
Both r and rs assume this.
Example linear relationship:
c. Extreme Groups
Results in an overestimated r. Consider looking at the relationship of
reading ability and IQ, but only in poor and excellent readers:
d. An Extreme Score
Also results in an overestimated r. Is more of a problem when using small
sample sizes. Example:
2. Relation to Causality
b. XY Y causes X
BCX
d. Etc.
BY
Main point is that correlation doesnt tell us much about causality. It should be
noted that inferring causality from a correlation is an error that is all too common.