Anda di halaman 1dari 23

Module 2 Notes: Central Tendency & Dispersion

Central Tendency

Dispersion

Central tendency: the most typical value in a set of numbers.

Note: The measure most appropriate for a given set of numbers depends solely on the level of
measurement of that set of numbers.

1.
Mode
: the value that occurs most often
only measure of central tendency that can be used with nominal variables.
also appropriate for ordinal, interval, and ratio variables.
Note: some sets of numbers have no mode, and others have multiple modes.
Example: favorite breakfast food of 20 of your classmates

biscuits cereal ham bacon eggs

cereal miso soup fresh fruit croissants rice porridge

potatoes bacon eggs potatoes bacon


bacon fresh fruit churros bacon cheese

In this example, we see a wide range of types of breakfast foods. The most
commonly-given favorite breakfast food is bacon -- 5 out of 20, or 1 in 4 people said that
bacon is their favorite. Therefore, bacon is the mode in this set of data.
If you had to guess what the favorite food of a 21st classmate is, your best bet would be
to guess bacon.
If 5 people had said bacon, and 5 others had said eggs, then we would have concluded
two modes
that this set of data has -- bacon and eggs, indicating that these are the two
most commonly-mentioned favorite breakfast foods in this group.
If, instead, each type of food was only mentioned one time, then there would be no
most commonly-occurring type of food, and we would say that this set of data has
no
mode
.

2.
Median
: the value at the exact center of a distribution, so that 50% of cases are below
and 50% are above (when the scores are arranged in rank order).
The median is appropriate for ordinal, interval, and ratio variables, but not nominal
variables (because nominal data cannot be arranged in rank order).
Example: how 15 of your classmates rate the last movie they saw

excellent awful pretty pretty not great pretty excellent not great
good good good

excellent not great awful excellent pretty excellent not great


good

In order to figure out the median of this group of data, we need to rank order it, either
from most favorable to least favorable opinion or vice versa (the answer will be the
same either way).
excellent excellent excellent excellent excellent pretty pretty pretty
good good good

pretty not great not great not great not great awful awful
good

In a set of data with 15 data points, the median is at the 8th position -- 7 below it, and 7 above it. In
this case, the median value is pretty good. In the median spot, there are 5 excellent and 2 pretty goods
above it, and 1 pretty good, 4 not greats, and 2 awfuls below it.

In this set of data, we simply rank ordered the values and identified the middle case. This is a
pretty simple task when there are only 15 data points. However, with a dataset with hundreds
or thousands of data points, finding the median can be quite tedious if rank ordering and/or
finding the middle category is done by hand.
A quick way to find the median in any dataset is to use a formula that will identify the
position
of the data (not the value of the data in that position).
for datasets with an odd number of values: (N+1)/2, when N = the number of values in a
dataset. In our movie rating example above, we had 15 data points, so we can use this
formula to find out which position the median is in: (15+1)/2=16/2=8th position.
for datasets with an even number of values: the midpoint between the value at the
positions at N/2 and (N/2)+1. For example, in a dataset with 20 values, the median is
the value between the values at the 10th (20/2) and 11th (10+1) positions. If the values
at the 10th and 11th positions are the same, that value is the median. If the two values
are different, the midpoint between those two values is the median.
Dataset A:

low low low medium medium high high high

Dataset B:

low low medium medium high high high high

In both Datasets A & B, there are 8 values. Because this is an even number of values, we find
the position of the median at the midpoint between (N/2) and (N/2) +1, or between the 4th
(8/2) and 5th (4+1) positions.

In Dataset A, the values at positions 4 and 5 are both medium, so the median is medium.
In Dataset B, the values of positions 4 and 5 are different, so we have to find a midpoint
between them. In this case, we might say medium high.
If the values in this example had been numeric rather than words (say 2 for medium and
3 for high), we would take the average of the two values, and conclude that the median
is 2.5.

Its probably rather rare that you are given a set of data points and need to summarize them (as
we have done in the last two examples). Instead, youre more likely to come across a table or
chart, and will want to be able to quickly summarize the main points.
Example: A new policy is being proposed at your work or in your neighborhood, and
results from a survey taken to assess peoples support of that policy were just released:

All survey Women Men


respondents

Strongly support the 8% 12% 3%


new policy

Somewhat support 33% 42% 20%


the new policy

Somewhat oppose 40% 21% 69%


the new policy

Strongly oppose the 19% 26% 9%


new policy

First, we can identify the mode for the entire survey. Because 40% of all survey respondents
somewhat oppose the policy, somewhat oppose is the modal category for all respondents.
However, we also see that the responses are separated by gender. The greatest percentage of
women somewhat support the policy, so somewhat support is the modal category for
women, while somewhat oppose is the modal category for men, as this category was most
often chosen by men.

Next, we can easily identify the median for all survey respondents, for women, and for men by
taking a quick look at each column. To find the median from a table like this one, we find the
category that encompasses the 50% mark when we calculate a cumulative percentage (when
looking at the data in rank order, such as from strongly support to strongly oppose, as is the
case with this table).

A cumulative percentage is the percentage of respondent who fall into a particular category
plus all categories that come before it (in rank order). Well start by looking at the results for all
respondents. The cumulative percentage for strongly support is simply 8%, because no
category comes before it.

The cumulative percentage for somewhat support is 41% (8% + 33%). Because these two
categories dont include 50% of respondents, we have not yet found the median category.
Next, we calculate the cumulative percentage for somewhat oppose, which is 81% (41% +
40%).

At this point, we have reached and surpassed the 50% mark, which means that we have found
the median category -- somewhat oppose. Note that we would have gotten the same answer
if we had started from the bottom of the table and worked up. The cumulative percentage for
strongly oppose and somewhat oppose is 59% (19% + 40%), meaning that the halfway point of
this dataset, or median, is somewhat oppose, just as we found above. This tells us that half of
the respondents either strongly or somewhat oppose the policy, and the other half somewhat
oppose, somewhat support, or strongly support the policy. That is, there are more people who
oppose the policy than support it.

Finally, we can use this same method to find the median category for men and for women.
Starting from the top of the chart, we see that we reach the 50% mark when we find the
cumulative percentage for strongly support and somewhat support (12% + 42 = 54%), or
starting from the bottom, we reach the 50% mark when we find the cumulative percentage for
strongly oppose, somewhat oppose, and somewhat support (26% + 21% + 42% = 89%).

Therefore, the median category for women is somewhat support.


Similarly, the median
category for the men can be found using this same method. Starting from the top of the chart,
we see that we reach the 50% mark when we find the cumulative percentage for strongly
support, somewhat support, and somewhat oppose (3% + 20 + 69% = 92%), or starting from the
bottom, we reach the 50% mark when we find the cumulative percentage for strongly oppose
and somewhat oppose (9% + 69% = 78%).
In summary:

all respondents women men

mode somewhat oppose somewhat support somewhat oppose

median somewhat oppose somewhat support somewhat oppose

3.
Mean
: mathematical average (sum of all values divided by the number of values).
The mean is only appropriate with interval and ratio level data, and is the only measure
of central tendency that incorporates all of the scores in a dataset.

xi
X = N

In words: = the sum of all of the scores (of all of the Xs), N = the number of scores
Example: average number of children of 20 of your classmates

0 1 4 3 5

2 1 0 2 1

3 1 2 2 3

0 2 5 2 1

Adding all 20 values from this table gives us a total of 40. To calculate the mean of this
data set, we divide the total (40) by the number of values in the dataset (20), which
gives us 2. This indicates that the average, or most typical, number of children in this
set of data is 2.
An important cautionary note about using the mean is that this measure of central
tendency is sensitive to extreme (unusually high or low) values, whereas the mode and
median are not, as they do not take into account the data on the endpoints of a ranked
dataset.

For example: the monthly income of 25 passengers on each of two recent flights into Paris:

Flight 1:

1000 5000 4000 5000 9000

6000 2000 3000 8000 5000

4000 6000 4000 3000 7000

3000 4000 5000 2000 8000

9000 2000 10000 6000 4000

Flight 2:

1000 5000 4000 5000 9000

6000 2000 3000 8000 5000

4000 6000 4000 3000 7000

3000 4000 5000 2000 8000

9000 2000 10000 6000 229000


As you see, the passengers on the two fights have identical monthly incomes except for the
25th passenger, whose data is recorded in the bottom right hand corner of each table.
In Flight 1, the total of all of the reported incomes is 125,000, and in Flight 2, the total is
350,000. To calculate the mean for each flight, we divide each total by the number of
passengers on each flight (25): 125,000/25 = 5000 and 350,000/25 = 14,000.

For Flight 1, 5,000 is a typical value that represents the passengers fairly well. In fact,
take a look and youll find that the median of the Flight 1 data is 5,000 as well.
Therefore, the mean serves as an effective measure of central tendency, since central
tendency is intended to be a measure of the most typical value in a dataset.
However, in the Flight 2 data, the mean of 14,000 is not typical of anyones data. All of
the passengers other than the 25th have monthly incomes well below 14,000, and the
25th passenger has a monthly income well above 14,000. The fact that the mean takes
into account all of the data points, including this one very high value, makes it sensitive
to extreme values.
The median and mode, meanwhile, do not take into account the data on the endpoints
of a ranked dataset, and therefore are immune to such extremes. Therefore, when
calculating the central tendency of a dataset with extreme values, its better to use the
median than the mean.

Dispersion:

how much the scores in a distribution vary from the typical score. Synonyms for dispersion
are variation, diversity, heterogeneity.

Once we know the typical score of a set of values, we might want to know how typical it is of the
entire group of data, or how much the scores vary from that typical score.

1. The
range
of a dataset gives the difference between the largest and smallest value. Therefore,
the range only takes the two most extreme values into account.

Example 1: exam scores


A B+ A B- B A- B+ A A- B-

B+ C B C+ A- B- D B- C- C+

In this example, the highest exam score was an A, and the lowest score was a D. Therefore, the range
was from A to D. However, if we split the data into the top and bottom row, with the top row
representing the students who studied for a midterm exam for two or more days, and the bottom row
representing those who waited until the day before the exam to study, we can use the range of each
row to give a meaningful comparison.
Top row range: A to B-
Bottom row range: A- to D
The discrepancy between these two ranges indicates that studying for longer than one day for this
exam yields different results that waiting until the day before to study.

Example 2: prices paid for hotels in a particular city on weekends in March 2015

203 120 173 90 228 101 186 115

147 87 204 147 172 87 175 564

179 122 136 241 161 105 128 243

679 471 342 467 233 357 353 318

In this example, the highest price paid was $679, and the lowest price was $87. Therefore, the range
was $87 - $679. If we divide the data into standard rooms (the first three rows of data) and suites (the
fourth row of data), we can again find one range for each group:
standard rooms: $87 - $564
suites: $233 - $679

2. Variation Ratio
When data are measured at the nominal level, we can only compare the size of each category
represented in the dataset. One way to do this is to calculate the Variation Ratio (VR), which tells us
how much variation there is in the dataset from the modal category. Therefore, if our variable of
interest is religious identification, and we find out that the modal category of a particular group is
Catholic, the VR will tell us how much variation there is in the group from that mode, or, what
proportion of the entire group does not fit into the modal category (is not Catholic).
Example 3: Favorite vacation activity -- if we ask 200 people what their favorite thing to do on vacation
is, we might get these answers:
Go to the beach 100
Ski or snowboard 55
Visit museums or landmarks 30
Try new food 15

In order to calculate the VR, we first need to identify the modal category, or the mode of this group of
data. In this case, the highest number of people fell into the beach category, so this is the mode.
The next thing to do is plug appropriate numbers into the formula, given below:

1 - (100/200) =
1 - .5 = .5

In this case, the VR value of .5 tells us that half of the cases in this dataset fall outside the modal
category. If the VR had been .25, then 25% of the cases would have been different than go to the
beach as in this example:
Go to the beach 150
Ski or snowboard 25
Visit museums or landmarks 20
Try new food 5

1 - (150/200) =
1 - .75 = .25

In the second case, the variation in the group of 200 is lower, meaning that people in the second group
are more likely to prefer the beach to the other vacation options than in the first group. As variation
within a dataset decreases, the VR will also decrease, or the lower the variation (or dispersion) in a
dataset, the lower the VR will be. (The same will be true for all measures of dispersion).

3. The
mean deviation
gives us a measure of the typical difference (or deviation) from the mean. If
most scores are very similar to the mean, then the mean deviation score will be low, indicating high
similarity within the data. If there is great variation among scores, then the mean deviation score will
be high, indicating low similarity within the data.

First, lets take a look at the formula:

This should look very similar to the formula of the mean. To calculate the mean, we sum the scores in
a dataset, and divide by the number of scores. The mean deviation does the same thing, but instead of
summing the scores, we sum the deviations. The deviation for each data point is simply its difference
from the mean, in absolute value (without negative signs). Lets look at an example.
Example: You are the director of Human Resources for a company in a large, urban city, and you are
considering implementing a transportation subsidy for employees to encourage employees to utilize
public transportation rather than driving their own cars to work. First, you need to conduct a feasibility
study to find out if this would be a worthwhile endeavor. You need to find out how far most of your
employees commutes are (measured in miles), and how much variation there is among the data.
Therefore, you conduct a survey of 20 employees at random and get these results:

37 24 13 4 13

7 22 35 30 18

15 25 3 30 26

42 27 25 35 3

First, to find the commute distance that is most typical for this group of people, you need to calculate a
measure of central tendency. Because number of miles to work is a ratio level variable, you can
calculate the mean, if there are no extreme values in the dataset. Looking at the data above, we dont
see any values that are vastly different from the others (an example would be 250 miles). Therefore,
the mean is an appropriate measure of central tendency. To calculate the mean, we simply divide the
sum of scores (434) by the number of scores (20), which gives us 21.7. This tells us that the most
typical commute distance for these 20 employees is 21.7 miles.

Next, we want to find out how much variation there is around this mean, or how much individuals in
this group of 20 have commute distances that vary from 21.7. If you look again at the data, youll see
that no one has a commute distance of exactly 21.7 miles, so each person varies from the mean at
least a little. The individual with the commute time closest to 21.7 has a commute of 22 miles, so her
commute distance only varies from the mean by a third of a mile. The individual whose commute
distance is the most different from the mean is the person who commutes 42 miles to work, giving him
a deviance score of 20.4 miles (42 - 21.7), meaning that he has a commute that is 20.4 miles longer
than the average commute for his co-workers in this dataset.
To get a measure of the average deviance among these 20 employees, we can calculate the mean
deviation formula. There are five steps to calculating this formula:
1. find the mean of the dataset
2. calculate the deviation for each score
3. take the absolute value of each deviation
4. sum the absolute values of the deviations
5. divide that sum by N

This is a cumbersome, but not difficult formula to calculate, made easier with the chart provided here.
We already calculated the mean above by dividing the sum of scores by the number of scores, so we
can start filling our chart by entering our scores and mean:

Scores Mean Deviation |Deviation|

37 21.7

7 21.7

15 21.7

42 21.7

13 21.7

13 21.7

35 21.7

3 21.7

25 21.7

26 21.7

24 21.7
22 21.7

25 21.7

27 21.7

18 21.7

4 21.7

30 21.7

30 21.7

35 21.7

3 21.7

Next, we can move on to Step 2, which tells us to calculate the deviation of each score
from the mean. To do this, we subtract the mean from each score, and fill in our chart
accordingly:

Scores Mean Deviation |Deviation|

37 21.7 15.3

7 21.7 -14.7

15 21.7 -6.7

42 21.7 20.3
13 21.7 -8.7

13 21.7 -8.7

35 21.7 13.3

3 21.7 -18.7

25 21.7 3.3

26 21.7 4.3

24 21.7 2.3

22 21.7 0.3

25 21.7 3.3

27 21.7 5.3

18 21.7 -3.7

4 21.7 -17.7

30 21.7 8.3

30 21.7 8.3

35 21.7 13.3

3 21.7 -18.7

In Step 3, we take the absolute value of each deviation score, which simply means removing any
negative signs from the deviation scores:
Scores Mean Deviation |Deviation|

37 21.7 15.3 15.3

7 21.7 -14.7 14.7

15 21.7 -6.7 6.7

42 21.7 20.3 20.3

13 21.7 -8.7 8.7

13 21.7 -8.7 8.7

35 21.7 13.3 13.3

3 21.7 -18.7 18.7

25 21.7 3.3 3.3

26 21.7 4.3 4.3

24 21.7 2.3 2.3

22 21.7 0.3 0.3

25 21.7 3.3 3.3

27 21.7 5.3 5.3

18 21.7 -3.7 3.7

4 21.7 -17.7 17.7

30 21.7 8.3 8.3

30 21.7 8.3 8.3

35 21.7 13.3 13.3


3 21.7 -18.7 18.7

In Step 5, we add these values, the absolute values of the deviations:

Scores Mean Deviation |Deviation|

37 21.7 15.3 15.3

7 21.7 -14.7 14.7

15 21.7 -6.7 6.7

42 21.7 20.3 20.3

13 21.7 -8.7 8.7

13 21.7 -8.7 8.7

35 21.7 13.3 13.3

3 21.7 -18.7 18.7

25 21.7 3.3 3.3

26 21.7 4.3 4.3

24 21.7 2.3 2.3

22 21.7 0.3 0.3

25 21.7 3.3 3.3

27 21.7 5.3 5.3

18 21.7 -3.7 3.7


4 21.7 -17.7 17.7

30 21.7 8.3 8.3

30 21.7 8.3 8.3

35 21.7 13.3 13.3

3 21.7 -18.7 18.7

= 195.2

6. Finally, in Step 6, we divide this sum (195.2) by the number of scores (20), which gives us 9.8 for our
value of the mean deviation.

Putting all of this together, weve discovered that the mean commute distance for these 20 employees
is 21.7 miles, and the average distance from the mean is 9.8, meaning that the typical commute is
about 22 miles, and most people have commute distances within about 10 miles of this mean (either
10 miles longer or 10 miles shorter).

4. The
standard deviation
of a dataset gives a measure of how each value in a dataset varies from the
mean (much like the Variation Ratio tells us how much variation there is from the modal category in a
dataset, and very similar to the mean deviation, discussed just above). More specifically, the standard
deviation includes finding the difference between each value and the mean (that values deviation
from the mean), and then averaging those values (thus making the standard deviation an average, or
standard, measure of deviation). This is awfully similar to the mean deviation, and indeed, gives us
quite similar information, substantively. Because of some of the mathematical characteristics of the
standard deviation calculation, it will be the measure of dispersion that we use most in the rest of the
class.

First, lets take a look at the formula:


In words: the square root of the average squared deviation. (Remember, deviation is the difference
between each score and the mean.)

Six steps for calculating the standard deviation


1. find the mean of the dataset
2. calculate the deviation for each score
3. square each deviation
4. sum the squared deviations
5. divide that sum by N
6. take the square root of the result

As youll see, these steps are almost identical to the steps for calculating the mean
deviation. The primary difference is that we square the deviations rather than taking the
absolute value of them (both of them remove the negative signs, but in different ways), and then
because weve squared our scores in the middle of the calculation, we have to take the square root at
the end of the calculation (remember that taking the square root of a number is the opposite of
squaring it).

Again, well use a chart similar to the one we used before for the calculation of mean deviation. Here
is that chart, with a small, sample dataset already filled in:

Scores Mean Deviation Deviation squared


2

10

Now well work through the six steps one by one:


1. Find the mean of the dataset. The sum of the scores is 30, and there are 5 scores, so the mean
is 6. *Fill this mean into each of the open cells in the second column of the table, as shown
below:

Scores Mean Deviation Deviation squared

2 6

6 6

10 6

4 6

8 6

2. Calculate the deviation for each score. The deviation is the mean subtracted
from the score. In the first row, the score is 2, so the deviation is -4.
*Fill in the deviation for each score in the third column of the table, as shown below:

Scores Mean Deviation Deviation squared

2 6 -4

6 6 0

10 6 4

4 6 -2

8 6 2

3. Square each deviation. *The square of -4 is 16, so well insert 16 into the top
empty cell in column 4, and then complete the rest of the cells for the next four rows, as shown below:

Scores Mean Deviation Deviation squared

2 6 -4 16

6 6 0 0

10 6 4 16

4 6 -2 4

8 6 2 4

=
4. Sum the squared deviations. We just calculated the squared deviation for each score in step 3, so
for this step, we simply add those values together. In this case, 16 + 0 + 16 + 4 + 4 = 40, which has been
added to the table below:

Scores Mean Deviation Deviation squared

2 6 -4 16

6 6 0 0

10 6 4 16

4 6 -2 4

8 6 2 4

=40

5. Divide that sum by N. In this example, the sum is 40, and the N is 5, so our result for this step is 8.

*Note the similarity between this step and the calculation of the mean. Both are the sum of scores
divided by the number of scores. The mean calculation gives us a representation of a typical score.
The standard deviation calculation gives us a representation of a typical deviation (which is a difference
from a mean).

6. Take the square root of that result. The square root of 8 is 2.8. This is the standard deviation of our
dataset. This number should make some intuitive sense to you if you look at the original list of scores
in relation to the mean (which is 6). All of the scores were within 4 of the mean. Because 0 was the
lowest deviation from the mean, and 4 was the highest, we were guaranteed to get a standard
deviation somewhere between 0 and 4. Just like the calculation of the mean, if we have a deviation
score that is much higher than the other deviation scores (resulting from a score that is much higher
than the other scores), then our standard deviation will be biased either upward or downward (just as
the mean will be).

Anda mungkin juga menyukai