Anda di halaman 1dari 71

Chapter 2

Describing Distributions
with Numbers

BPS - 5th Ed.

Chapter 2

Chapter 2 Overview
Three characteristics of a quantitative
variable's distribution:
Chapter 1
Shape Visually via graphs.
Center (Typical Value) Numeric summary.
Chapter 2
Spread (Dispersion) Numeric summary.

Appropriate measures of center and


spread depends upon the
distribution's shape.
BPS - 5th Ed.

Chapter 2

Review: Populations vs.


Samples
Analyzing populations
versus analyzing samples.
For populations:
We know all of the data.
Descriptive measures of populations are called
parameters.
Parameters are often written using Greek letters
( ).

For samples:
We know only part of the entire data.
Descriptive measures of samples are called
statistics.
Statistics are often written using Roman letters (x
BPS - 5th Ed.

Chapter 2

).
3

Discriptive Measures: Numerical


Summaries
Center of the data: Numeric values
that represent the average or typical
value of a quantitative variable.
Examples of Measures of Central
Tendancy
mean
median

BPS - 5th Ed.

Chapter 2

Discriptive Measures: Numerical


Summaries (cont.)
Variation or Measures of Dispersion:
Numeric values that represent the
degree to which the values are spread
out.
range
quartiles (interquartile range)
variance
standard deviation

BPS - 5th Ed.

Chapter 2

Central Tendency: Mean


The mean (arithmetic mean) of a variable is often what
people mean by the average add up all the values
and divide by the number of measurements in the data
set.
To compute the arithmetic mean of:
6, 1, 5
Add up the three numbers and divide by 3.
(6 + 1 + 5) / 3 = 4.0
The arithmetic mean is 4.0, one more decimal place than the
General Rounding Rule for
data.
Reporting Statistics
One interpretation: The arithmetic mean can be thought of as the
center of gravity where the yardstick balances.
BPS - 5th Ed.

Chapter 2

N
x
i
i
1

Central Tendency: Mean


The mean, more accurately
known as the arithmetic
mean, is an arithmetic
average of the elements of
the data set.

The mean of a sample of


n measurements is
denoted
by
and
x
equals:
If the data are from a
population, the mean is
denoted by (mu) and
equals:
BPS - 5th Ed.

xi
i 1

statistic

n = Sample Size

parameter

N = Population Size

Chapter 2

xxx12341010

n
x

x
i
1
2
3
4
n
i

1
2
n
n
i1xi2i441x
i2
x
x

i12
i
12x
23
24
12491630
Review: Summation Notation

Used to simplify summation instructions:


Each observation in a data set is identified
by a subscript: x1, x2, x3, x4, x5, . xn
Notation used to sum the above numbers
together is:

Is

ii
1
2
3
4

the same as

Data set: 1, 2, 3, 4 (that is x1= 1, x2= 2, x3= 3, x4= 4)

BPS - 5th Ed.

Chapter 2

Example A: Finding the mean


Sample data set: 1, 4, 5, 10
x1 = 1, x2 = 4, x3 = 5, x4 = 10
n

x
x

xi
i 1

1 4 5 10
4

x 5.0

BPS - 5th Ed.

Chapter 2

Example A: Finding the mean using


MINITAB
Sample data set: 1, 4, 5, 10
Put data in column 1 in MINITAB
Use the following command:
MTB > mean c1

The output in the Sessions window :


Mean of Example A
Mean of Example A = 5

BPS - 5th Ed.

Chapter 2

10

Central Tendency: Median (M)


A resistant measure of the datas center

With ordered data, that is data arranged in


ascending order, at least half of the ordered
values are less than or equal to the median
value
With ordered data, that is data arranged in
ascending order, at least half of the ordered
values are greater than or equal to the median
value
If n is odd, the median is the middle ordered data value
If n is even, the median is the average of the two
middle ordered data values
BPS - 5th Ed.

Chapter 2

11

Median (M): Finding the position of


the median in your ordered data set
Location of the median: L(M) = (n+1)/2 ,
where n = sample size.
Example: Take this ordered data set:
x1, x2, x3, .x25
x1< x2<< x25

The position of the Median would be the


(25+1)/2 = 13th ordered value. That is
the Median = x13
BPS - 5th Ed.

Chapter 2

12

Median
Example 1 data: 2 4 6
[ x1 = 2, x2 = 4, x3 = 6 ]
L(M) = (3+1)/2 =2

Median: M = 4.0
Example 2 data: 2 4 6 8
[ x1 = 2, x2 = 4, x3 = 6, x4 = 8 ]
L(M)= (4+1)/2 = 2.5
so we must average data values x 2 and x3 to get the
median

M =(4 +6)/2 = 5.0


Example 3 data: 6 2 4
Why is the Median not equal to 2.0?
BPS - 5th Ed.

Chapter 2

13

Median
Example 3 data: 6 2 4
The Median is not equal to 2.0 because we did not
order our data. After ordering our data, it is just
like example 1.

Example 4 data: 6 ,1, 11, 2, 11


Arrange in order: 1, 2, 6, 11, 11
L(M) = 3 so x3 is the median. M = 6.0

Example 5 data: 68, 71, 74, 77, 82, 84, 88, 90


Already in order
L(M) = 4.5, so the median is the average of x4 and
x5
Median: M = (77+82)/2 = 79.5
M = 79.5
BPS - 5th Ed.

Chapter 2

14

Using MINITAB to find the Median in


Example 1
Put data in column 1 (does not have to be
in ascending order, MINITAB will do that
when it calculates the median).
Use the following command:
MTB > median c1

Output in the Sessions window:


Median of Example 1
Median of Example 1 = 4.5

BPS - 5th Ed.

Chapter 2

15

Comparing the Mean & Median


The mean and median of data from a
symmetric distribution should be close together.
The actual (true) mean and median of a
symmetric distribution are exactly the same.
In a skewed distribution, the mean is farther out
in the long tail than is the median [the mean is
pulled in the direction of the possible
outlier(s)].
The next slide will illustrate how knowing the
mean and median can give clues about the
shape of a distribution.

BPS - 5th Ed.

Chapter 2

16

Comparing the Mean & Median


Since the mean and the median are often
different. This difference gives us clues
about the shape of the distribution.
(symmetric?, skewed left?, skewed right?)
Symmetric mean will usually be close to the median.
Skewed left mean will usually be smaller than the
median.
Skewed right mean will usually be larger than the
median

BPS - 5th Ed.

Chapter 2

17

Resistant Statistic
What if one value is extremely different from the
others?
Example: What if we made a mistake and
6, 1, 2
was recorded as
6000, 1, 2 ?
The mean is now ( 6000 + 1 + 2 ) / 3 = 2001.0
The median is still 2.0
Conclusion: The median is resistant to extreme values
while the mean is not resistant.
When data has extreme values (i.e., unusually large or small
values) or is skewed, median preferred to mean because it is
more representative of the "typical" value of the variable.
BPS - 5th Ed.

Chapter 2

18

Summary: Central Tendency


Mean

Center of gravity.
Useful for roughly symmetric quantitative data
with no extreme values.
Median

Splits the data into halves.


Useful for highly skewed quantitative data (or
data with extreme values).

BPS - 5th Ed.

Chapter 2

19

Question
A recent newspaper article in California said
that the median price of single-family homes
sold in the past year in the local area was
$136,000 and the mean price was $149,160.
Which do you think is more useful to
someone considering the purchase of a
home, the median or the mean?

BPS - 5th Ed.

Chapter 2

20

Spread, or Variability
If all values are the same, then they
all equal the mean. There is no
variability.
Variability exists when some values
are different from (above or below)
the mean.

BPS - 5th Ed.

Chapter 2

21

Measuring Spread or Variability


Measures of spread, or measures of
dispersion:
Numerical values that represent the
degree to which the values are spread out.
The range
The quartiles
First quartile or 25th percentile
Second quartile or 50th percentile or Median
Third quartile or 75th percentile

The variance
The standard deviation

BPS - 5th Ed.

Chapter 2

22

Range
The range of a variable is the
largest data value (called the maximum)
minus the smallest data value (called the
minimum).
Example 6: Compute the range of 6, 1, 2, 6,
11, 7, 3, 3
The largest value is 11.
The smallest value is 1.
Subtracting the two 11 1 = 10 the
range is 10.0, one more decimal place
than the data.
General Rounding Rule for Reporting Statistics
BPS - 5th Ed.

Chapter 2

23

Example 6: Finding the range using


MINITAB
Sample data set: 6, 1, 2, 6, 11, 7, 3, 3
Put data in column 1 in MINITAB
Use the following command:
MTB > range c1

The output in the Sessions window :


Range of Example 6
Range of Example 6 = 10

BPS - 5th Ed.

Chapter 2

24

Range
Note: The range only uses two values in the
data set the largest value and the
smallest value. So, the range is not
resistant.
Example 2: If we made a mistake and 6, 1, 2
was recorded as 6000, 1, 2 .
The range is now (6000 1) = 5999.0
instead of (6 1) = 5.0 .

BPS - 5th Ed.

Chapter 2

25

Quartiles
Three numbers which divide the
ordered data into four equal sized
groups.
Q1 has 25% of the data below it.
Q2 has 50% of the data below it.
(Median)

Q3 has 75% of the data below it.

BPS - 5th Ed.

Chapter 2

26

Quartiles
Uniform Distribution

Q1
BPS - 5th Ed.

Q2
Chapter 2

Q3
27

Obtaining the Quartiles


Order the data.
For Q2, just find the median.
For Q1, look at the lower half of the data
values, those to the left of the median
location; find the median of this lower half.
For Q3, look at the upper half of the data
values, those to the right of the median
location; find the median of this upper
half.

BPS - 5th Ed.

Chapter 2

28

Example 7
Even Data Set: 5 25 7 23 10 11 15 21 18 20
1. Put Data Set in order and find location of
median:
5 7 10 11 15 18 20 21 23 25
L(M) = (10+1)/2 = 5.5 so the median or 2 nd
quartile is the average of the 5th and 6th data
value
2.M = 16.5
3.There are 5 data points below and above
median.
4.Find the median of the first 5 data points and
that is the 1st quartile.
BPS - 5th Ed.

Chapter 2

29

Example 7 (cont.)
Data Set: 5 7 10 11 15 18 20 21 23
25
To find the 1st Quartile, we must find
the median of the first half of the
data (data to the left of the Median).
The L(M) of the first 5 data points is
L(M) = (5+1)/2 = 3
so the 1st quartile is x3 = 10.0

BPS - 5th Ed.

Chapter 2

30

Example 7 (cont.)
Data Set: 5 7 10 11 15 18 20 21 23
25
To find the 3rd Quartile, we must find
the median of the second half of the
data (data to the right of the
Median). The L(M) of the last 5 data
points is
L(M) = (5+1)/2 = 3
so the 3rd quartile is 3rd data point in
that set, so 3rd quartile is 21.0
BPS - 5th Ed.

Chapter 2

31

Example 8
Odd Data Set: 5 25 7 23 10 11 15 21 18 20 27
1. Put Data Set in order and find location of
median:
5 7 10 11 15 18 20 21 23 25 27
L(M) = (11+1)/2 = 6 so the median or 2nd
quartile is the 6th data value, x6 = 18
2.M = 18.0
3.There are 5 data points below and above
median.
4.Find the median of the first 5 data points and
that is the 1st quartile.
BPS - 5th Ed.

Chapter 2

32

Example 8 (cont.)
Data Set: 5 7 10 11 15 18 20 21 23
25 27
To find the 1st Quartile, we must find
the median of the first half of the
data (data to the left of the Median).
The L(M) of the first 5 data points is
L(M) = (5+1)/2 = 3
so the 1st quartile is x3 = 10.0

BPS - 5th Ed.

Chapter 2

33

Example 8 (cont.)
Data Set: 5 7 10 11 15 18 20 21 23
25 27
To find the 3rd Quartile, we must find
the median of the second half of the
data (data to the right of the
Median). The L(M) of the last 5 data
points is
L(M) = (5+1)/2 = 3
so the 3rd quartile is 3rd data point in
that set, so 3rd quartile is 23.0
BPS - 5th Ed.

Chapter 2

34

Example 9: Weight Data: Sorted

L(M)=(53+1)/2=27

L(Q1)=(26+1)/2=13.5

BPS - 5th Ed.

Chapter 2

35

Weight Data: Quartiles


Q1= 127.5
Q2= 165.0 (Median)
Q3= 185.0

BPS - 5th Ed.

Chapter 2

36

Weight
Data:

10
11
first
12
13
quartile
Quartiles
14
15
median or second quartile 16
17
third quartile 18
19
20
21
22
23
24
25
26
BPS - 5th Ed.

Chapter 2

0166
009
0034578
00359
08
00257
555
000255
000055567
245
3
025
0

0
37

Interquartile Range (IQR)


The interquartile range (IQR) is the difference
between the third and first quartiles.
IQR = Q3 Q1
The IQR is a resistant measure of dispersion.

IQR
Interquartile
Range
BPS - 5th Ed.

Chapter 2

38

IQR for Examples 7-9


Example 7 : Q1 = 10.0 and Q3 = 21.0
So the IQR = 21.0-10.0
IQR = 11.0

Example 8: Q1 = 10.0 and Q3 = 23.0


So the IQR = 23.0-10.0
IQR = 13.0

Example 9: Q1 = 127.5 and Q3 = 185.0


So the IQR = 185.0 - 127.5
IQR = 57.5

BPS - 5th Ed.

Chapter 2

39

Five-Number Summary
Five-Number Summary gives a
concise description of the
distribution of a variable:

Smallest value (Min)


First quartile (Q1 or P25)
Median (M or Q2 or P50)
Third quartile (Q3 or P75)
Largest value (Max)

The first quartile and the


third quartile:
Information about the
spread of the data (IQR).
Resistant.
BPS - 5th Ed.

The smallest value (min) and


the largest value (max):
Information about the tails
of the data.
Not resistant.
The median:
Information about the
center of the data.
Resistant.
Chapter 2

40

Five-Number Summary for Example 7

minimum = 5.0
Q1 = 10.0

Interquartile
Range (IQR)
= Q3 Q1

M = 16.5
Q3 = 21.0
maximum = 25.0

= 11.0

IQR gives spread of middle 50% of the data

BPS - 5th Ed.

Chapter 2

41

Five-Number Summary for Example 8

minimum = 5.0
Q1 = 10.0

Interquartile
Range (IQR)
= Q3 Q1

M = 18.0
Q3 = 23.0
maximum = 27.0

BPS - 5th Ed.

Chapter 2

= 13.0

42

Five-Number Summary for Example 9

minimum = 100
Q1 = 127.5
M = 165.0
Q3 = 185.0
maximum = 260

BPS - 5th Ed.

Chapter 2

Interquartile
Range (IQR)
= Q3 Q1
= 57.5

43

Boxplot
A boxplot is a graphical representation of
the five-number summary
Central box spans Q1 and Q3.
A line in the box marks the median M (Q2).
Lines extend from the box out to the
minimum and maximum. (These lines are
sometimes called whiskers)

BPS - 5th Ed.

Chapter 2

44

Boxplot of Example 9: Weight Data

Q1
100
275

BPS - 5th Ed.

125

M
150

Q3
175

max
200

225

250

Weight

Chapter 2

45

Boxplot
WARNING: When Using MINITAB, the
whiskers show the minimum and
maximum values within what is called the
lower and upper fences (Numbers greater
than the upper fence and smaller than the
lower fence are things called outliers) not
the min and max of the data set
necessarily. Asterisks are used to indicate
any outliers.
We will talk about what an outlier is and a
way we can check our data for outliers.
BPS - 5th Ed.

Chapter 2

46

Outliers
Outliers are extreme observations in the data. They
are values that are significantly too high or too
low, based on the spread of the data.
Outliers should be identified and investigated.
Outliers could be:

Chance occurrences
Measurement errors
Data entry errors
Sampling errors

Outliers are not necessarily invalid data.


One way to check for outliers uses the quartiles.
BPS - 5th Ed.

Chapter 2

47

Outliers (cont.)
Fence Rule for checking for outliers using the
quartiles:
Calculate lower and upper fences:
Lower fence = LF = Q1 (1.5 IQR)
Upper fence = UF = Q3 + (1.5 IQR)

Values (strictly) less than the lower fence or


(strictly) greater than the upper fence could be
considered outliers.

BPS - 5th Ed.

Chapter 2

48

Boxplot in MINITAB of Example 7


The MINITAB commands that
went with producing this boxplot
(Data was in column 1):
MTB> boxplot c1;
SUBC> transpose.

From the boxplot produced from MINITAB, we see that


there are no outliers. (No asterisks)

BPS - 5th Ed.

Chapter 2

49

Boxplot in MINITAB of Altered Weight


Data
The MINITAB commands that
went with producing this boxplot
(Data was in column 2):
MTB> boxplot c2;
SUBC> transpose.

From the boxplot produced from MINITAB, we see that


there are outliers. The left whisker goes to the minimum
data value, but the right whisker goes to the upper fence.
And the asterisks goes to the maximum data value (or any
of the outliers, in this example there is only one outlier and
it happens to be the maximum value in the data set).
BPS - 5th Ed.

Chapter 2

50

Example from Text: Boxplots

BPS - 5th Ed.

Chapter 2

51

Variance and Standard


Deviation
Recall that variability exists when
some values are different from
(above or below) the mean.
Each data value has an associated
deviation from the mean.

BPS - 5th Ed.

Chapter 2

52

Variance
The variance is based on the deviation from the mean. (How far is each observation from the typical value?)

To treat positive differences and negative differences, we square the deviations:

( xi )
( xi x )

for samples

) 2 for populations

( xi

( xi x ) 2

BPS - 5th Ed.

for populations

for samples

Chapter 2

The larger these


values, the further
observations are
from the mean,
i.e., the more
spread out the
53
data values.

2
222
()
x

(
x

(
xx

)()

i
2
12
N
NN

Population Variance

The population variance (denoted 2) of a variable is


the sum of these squared deviations divided by the
number in the population (i.e., population size ).
(In other words we are calculating the average of the
squared distance of each of the data points.)

Note: If you are calculating by hand, use as many decimal places as


allowed by your calculator.

BPS - 5th Ed.

Chapter 2

54

2
222
()
x

()
x

(
x

x
)()

i
2
12
n
snn

11
Sample Variance

The sample variance (denoted s2) of a variable is the sum of these


squared deviations divided by one less than the number in the
sample (i.e., sample size minus 1).

We say that this statistic has n 1 degrees of freedom.

Note: For accuracy when calculating by hand, use as many


decimal places as allowed by your calculator.

Why do we use different formulas for the population variance and the
sample variance? See page 51 in your book.
BPS - 5th Ed.

Chapter 2

55

Standard Deviation
The standard deviation is the square root of the
variance.
The population standard deviation
Is the square root of the population variance (2).
Is represented by .

The sample standard deviation


Is the square root of the sample variance (s2).
Is represented by s.
To avoid round-off error, never use the rounded value of the
variance to compute the standard deviation.

The variance and standard deviation are not


resistant measures of dispersion.
BPS - 5th Ed.

Chapter 2

56

Deviations
what is a typical deviation from the
mean? (standard deviation)
small values of this typical deviation
indicate small variability in the data
large values of this typical deviation
indicate large variability in the data

BPS - 5th Ed.

Chapter 2

57

1
7
9
2

1
6

1
3
6
2

1
6
4

1
6
0

1
8
6
7

1
4
3
9
x16,707
Variance and Standard
Deviation
Example from Text

Metabolic rates of 7 men (cal./24hr.) :

1792 1666 1362 1614 1460 1867 1439

BPS - 5th Ed.

Chapter 2

58

xxx
2
ii i

Variance and Standard Deviation


Example from Text
Observations

Deviations

Squared deviations

1792
1666
1362
1614
1460
1867
1439
Sum =

BPS - 5th Ed.

Chapter 2

59

xi xixi2
Variance and Standard
Deviation

Observations

Example from Text


Deviations

Squared deviations

1792

17921600 = 192

(192)2 = 36,864

1666

1666 1600 =

1362

1362 1600 = -238

1614

1614 1600 =

1460

1460 1600 = -140

(-140)2 = 19,600

1867

1867 1600 = 267

(267)2 = 71,289

1439

1439 1600 = -161


sum =
0

(-161)2 = 25,921

BPS - 5th Ed.

66
14

Chapter 2

(66)2 =

4,356

(-238)2 = 56,644
(14)2 =

196

sum = 214,870

60

,s35,8
2
1
4
8170.6
,78192.647calories
3
5
2

Variance and Standard


Deviation
Example from Text

BPS - 5th Ed.

Chapter 2

61

Choosing Measures of Center and


Spread (A Summary)
Outliers affect the values of the mean and
standard deviation.
The five-number summary should be
used to describe center and spread for
skewed distributions, or when outliers
are present.
Use the mean and standard deviation
for reasonably symmetric distributions
that are free of outliers.

BPS - 5th Ed.

Chapter 2

62

Example 10: Number of Books Read for


Pleasure: Sorted
L(M)=(52+1)/2=26.5

10

30

10

99

12

13

14

14

15

15

20

20

BPS - 5th Ed.

Chapter 2

63

Example 10: Five-Number Summary:


Boxplot
Median = 3
interquartile range (iqr) = 5.5-1.0 = 4.5
range = 99-0 = 99
Boxplots in book do not show
outliers. MINITAB boxplots will show
outliers as *

0
100

10

20

30

40
50
60
Number of books

Mean = 7.06
BPS - 5th Ed.

70

80

90

s.d. = 14.43

Chapter 2

64

Five-Number Summary using


MINITAB
Data is put into column 1 in MINITAB
Using the describe command, I will not only
get the five number summary but I will also
get the mean and median:
MINITAB Command:
MTB > describe c1

Output in the sessions window


Descriptive Statistics: Num of Books
Variable
N N* Mean SE Mean StDev Minimum Q1
Median Q3
Num of Books 52 0
7.06
2.00
14.43
0.00
1.00
3.00
5.75
Note: MINITABs answer is different than our
Variable
Maximum
hand calculation. MINITAB calculates it a
Num of Books 99.00
little different.
BPS - 5th Ed.

Chapter 2

65

Which summary to use for Example


10?

Looking at the mean and median, it seems like


the data is ________ skewed.
Taking a look at the probability plot and boxplot
that we did in class, it also shows that the
distribution is _________ skewed. So the summary
we should use is the ________________.
Blanks to be filled in during class.

BPS - 5th Ed.

Chapter 2

66

Statistical Problem

BPS - 5th Ed.

Chapter 2

67

Tornadoes
Example
The following data give the number of tornadoes in
Oklahoma, Kansas, and Nebraska for the years 1990 to
2004.

a) Construct boxplots for all three


states on the same scale.
b) Describe the shape of the
distribution for each state.
c) Which state had the highest
number of tornadoes in a year?
Was this number of tornadoes
unusual for that state?
d) Which state tends to have the
highest yearly number of
tornadoes?
e) What measure of center and
spread should be reported for
this data?
BPS - 5th Ed.

Year

1990
1991
1992
1993
1994
1995
1996
1997

Chapter 1998
2

Oklahoma
Nebraska
30
88
73
63
64
74
64
69
40
55
79
26
47
60
55
30
83
65

Kansas

88
116
92
113
42
73
68
62
71

68

MINITAB Tornadoes
Example
C1, which contains the

year variable, is not used


in creating the boxplot.

MTB > boxp c2 c3 c4;


SUBC> title "Distribution of Yearly Number of Tornadoes for
1990-2004";
SUBC> axlabel 1 "States";
SUBC> axlabel 2 "Number of Tornadoes (Per Year)";
SUBC> transpose;
SUBC> over.

BPS - 5th Ed.

Chapter 2

69

Tornadoes Example (cont.)


Distribution of Yearly Number of Tornadoes for 1990-2004

States

Oklahoma

Kansas

Nebraska

BPS - 5th Ed.

20

40
60
80
100
120
Number of Tornadoes (Per Year)
Chapter 2

140

160

70

title of graph: distribution of car color


preferences for town A for sample of
50 people
correct titles for vertical and
horizontal axis,
Vertical: percents (relative
frequencies) or frequencies (counts)

BPS - 5th Ed.

Chapter 2

71

Anda mungkin juga menyukai