Anda di halaman 1dari 130

Unit 1: Exploratory Data Analysis

Prof. Siswanto Agus Wilopo, S.U., M.Sc., Sc.D.


Department of Public Health
Faculty of Medicine
Universitas Gadjah Mada

1
sawilopo@yahoo.com

Universitas Gadjah Mada, Faculty of Medicine, Department of Public Health

Table:
Assessing the use of table for each type of
data,
Differentiate a frequency distribution,
Create a frequency table from raw data,
Constructs relative frequency, cumulative
frequency and relative cumulative frequency
tables.
Construct grouped frequency tables.
Construct a crosscross-tabulation table.
Illustrate the use of a contingency table is.
Create table with rank data.
data.
sawilopo@yahoo.com

Biostatistics
I: 2013
2
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Graph:
Assessing the most appropriate chart for a given data type.
Construct pie charts and simple, clustered and stacked, bar charts.
Create histograms.
Create step charts and ogives.
ogives.
Construct time series charts, including statistics process control
(SPC).
Interpret and assess a chart reveals.
Assess the meaning by looking at the shape of a frequency
distribution.
Appraise negatively skewed, symmetric and positively skewed
distributions.
Describe a bimodal distribution.
Describe the approximate shape of a frequency distribution from a
frequency table or chart.
Assess whether data is considered a normal distribution.

sawilopo@yahoo.com

Biostatistics
I: 2013
3
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Numeric Summary:
Describe a summary measure of location is, and understand
the meaning of, and the difference between, the mode, the
median and the mean.
Compute the mode, median and mean for a set of values.
Formulate the role of data type and distributional shape in
choosing the most appropriate measure of location.
Describe what a percentile is, and calculate any given
percentile value.
Describe what a summary measure of spread is
Differentiate the difference between, and can calculate, the
range, the interquartile range and the standard deviation.
Interpret estimate percentile values
Formulate the role of data type and distributional shape in
choosing the most appropriate measure of spread.

sawilopo@yahoo.com

Biostatistics
I: 2013
4
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

The Big Picture


Recall The Big Picture, the fourfour-step process
that encompasses statistics (as it is presented in
this course):
1. Producing Data Choosing a sample from the
population of interest and collecting data.
2. Exploratory Data Analysis (EDA) {Descriptive
Statistics}
3. Summarizing the data weve collected. Probability and
Inference
4. Drawing conclusions about the entire population
based on the data collected from the sample.

Even though in practice it is the second step in the


process, we are going to look at Exploratory Data
Analysis (EDA) first.
5
sawilopo@yahoo.com

Biostatistics
I: 2013
5
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

6
sawilopo@yahoo.com

Universitas

Biostatistics
I: 2013
Gadjah
Mada,
Faculty

6
of Medicine, Department of Public Health

7
sawilopo@yahoo.com

Biostatistics
I: 2013
7
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

8
sawilopo@yahoo.com

Biostatistics
I: 2013
8
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

9
sawilopo@yahoo.com

Biostatistics
I: 2013
9
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

10
sawilopo@yahoo.com

Biostatistics
I: 2013
10
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Goals of EDA
Exploratory Data Analysis (EDA) is how
we make sense of the data by
converting them from their raw form to
a more informative one.

11
sawilopo@yahoo.com

Biostatistics
I: 2013
11
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

EDA consists of:


organizing and summarizing the raw
data,
discovering important features and
patterns in the data and any striking
deviations from those patterns, and then
interpreting our findings in the context of
the problem

12
sawilopo@yahoo.com

Biostatistics
I: 2013
12
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

(continued)
And can be useful for:
describing the distribution of a single
variable (center, spread, shape, outliers)
checking data (for errors or other
problems)
checking assumptions to more complex
statistical analyses
investigating relationships between
variables
13
sawilopo@yahoo.com

Biostatistics
I: 2013
13
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

EDA
Exploratory data analysis (EDA) methods are
often called Descriptive Statistics due to the
fact that they simply describe, or provide
estimates based on, the data at hand.
In Unit 4 we will cover methods of Inferential
Statistics which use the results of a sample to
make inferences about the population under
study.
Comparisons can be visualized and values of
interest estimated using EDA but descriptive
statistics alone will provide no information
about the certainty of our conclusions.
conclusions.
14
sawilopo@yahoo.com

Biostatistics
I: 2013
14
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Important Features of Exploratory Data


Analysis
There are two important features to the
structure of the EDA unit in this course:
The material in this unit covers two
broad topics:
Examining Distributions exploring data one
variable at a time.
Examining Relationships exploring data two
variables at a time.

15
sawilopo@yahoo.com

Biostatistics
I: 2013
15
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Important Features of Exploratory Data


Analysis
In Exploratory Data Analysis, our
exploration of data will always consist
of the following two elements:
visual displays, supplemented by
numerical measures.

Try to remember these structural


themes, as they will help you orient
yourself along the path of this unit.
16
sawilopo@yahoo.com

Biostatistics
I: 2013
16
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

EXAMINING DISTRIBUTIONS

sawilopo@yahoo.com

17

Biostatistics
I: 2013
17
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Examining Distributions
We will begin the EDA part of the course
by exploring (or looking at) one variable
at a time.
As we have seen, the data for each
variable consist of a long list of values
(whether numerical or not), and are not
very informative in that form.

18
sawilopo@yahoo.com

Biostatistics
I: 2013
18
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Examining Distributions
In order to convert these raw data into
useful information, we need to summarize
and then examine the distribution of the
variable.
By distribution of a variable, we mean:
what values the variable takes, and
how often the variable takes those values.

We will first learn how to summarize and


examine the distribution of a single
categorical variable, and then do the same
for a single quantitative variable.
19
sawilopo@yahoo.com

Biostatistics
I: 2013
19
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

ONE CATEGORICAL VARIABLE

sawilopo@yahoo.com

Biostatistics
I: 2013
20
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Example:
Distribution of One Categorical Variable
What is your perception of your own
body? Do you feel that you are
overweight, underweight, or about right?
A random sample of 1,200 college
students were asked this question as
part of a larger survey. The following
table shows part of the responses:

21
sawilopo@yahoo.com

Biostatistics
I: 2013
21
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Example Raw Data out of 1200 students


Student

Body Image

student 25

overweight

student 26

about right

student 27

underweight

student 28

about right

student 29

about right

22
sawilopo@yahoo.com

Biostatistics
I: 2013
22
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Here is some information that would be


interesting to get from these data:
What percentage of the sampled students fall into
each category?
How are students divided across the three body
image categories?
Are they equally divided? If not, do the
percentages follow some other kind of pattern?

23
sawilopo@yahoo.com

Biostatistics
I: 2013
23
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

There is no way that we can answer these


questions by looking at the raw data, which
are in the form of a long list of 1,200
responses, and thus not very useful.
However, both of these questions will be
easily answered once we summarize and
look at the distribution of the variable Body
Image (i.e., once we summarize how often
each of the categories occurs).
24
sawilopo@yahoo.com

Biostatistics
I: 2013
24
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Numerical Measures
In order to summarize the distribution of
a categorical variable, we first create a
table of the different values (categories)
the variable takes, how many times each
value occurs (count) and, more
importantly, how often each value occurs
(by converting the counts to
percentages).
The result is often called a Frequency
Distribution or Frequency Table.
25
sawilopo@yahoo.com

Biostatistics
I: 2013
25
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

A Frequency Distribution or Frequency


Table
Category
About right
Overweight
Underweight
Total

Count
855
235
110
n=1200

Percent
(855/1200)*100 = 71.3%
(235/1200)*100 = 19.6%
(110/1200)*100 = 9.2%
100%

26
sawilopo@yahoo.com

Biostatistics
I: 2013
26
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Visual or Graphical Displays: Pie Chart

27
sawilopo@yahoo.com

Biostatistics
I: 2013
27
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Visual or Graphical Displays

OR

28
sawilopo@yahoo.com

Biostatistics
I: 2013
28
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

ONE QUANTITATIVE VARIABLE

sawilopo@yahoo.com

Biostatistics
I: 2013
29
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

To display data from one quantitative


variable graphically, we can use either
a histogram or boxplot.
We will also present several byby-hand
displays such as the stemplot and
dotplot

30
sawilopo@yahoo.com

Biostatistics
I: 2013
30
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Numerical Measures
The overall pattern of the distribution of
a quantitative variable is described by
its shape, center, and spread.
By inspecting the histogram or boxplot,
we can describe the shape of the
distribution, but we can only get a rough
estimate for the center and spread.

31
sawilopo@yahoo.com

Biostatistics
I: 2013
31
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Numerical Measures
A description of the distribution of a
quantitative variable must include, in addition
to the graphical display, a more
precise numerical description of the center
and spread of the distribution.
In this lecture you will learn:
how to quantify the center and spread of a distribution with
various numerical measures;
some of the properties of those numerical measures; and
how to choose the appropriate numerical measures of center
and spread to supplement the histogram.
We will also discuss a few measures of position or location
which allow us to quantify the where a particular value is in
the distribution of all values.
32
sawilopo@yahoo.com

Biostatistics
I: 2013
32
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

How To Create Histograms


Here are the exam grades of 15 students:
88, 48, 60, 51, 57, 85, 69, 75, 97, 72, 71, 79, 65, 63, 73
Score

Count

[40-50)

[50-60)

[60-70)

[70-80)

[80-90)

[90-100)

1
33

sawilopo@yahoo.com

Biostatistics
I: 2013
33
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Stemplot (Stem and Leaf Plot)


The stemplot (also called stem and leaf plot) is
another graphical display of the distribution of
quantitative variable.
The idea is to separate each data point into a
stem and leaf, as follows:
The leaf is the right-most digit.
The stem is everything except the right-most digit.
So, if the data point is 34, then 3 is the stem and 4 is the leaf.
If the data point is 3.41, then 3.4 is the stem and 1 is the leaf.

Note: For this to work, ALL data points should


be rounded to the same number of decimal
places.
places.
34
sawilopo@yahoo.com

Biostatistics
I: 2013
34
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Stemplot (Stem and Leaf Plot)


EXAMPLE: Best Actress Oscar Winners
We will use the Best Actress Oscar winners
example
34 34 26 37 42 41 35 31 41 33 30 74 33 49 38 61 21
41 26 80 43 29 33 35 45 49 39 34 26 25 35 33
To make a stemplot:
stemplot:
Separate each observation into a stem and a leaf.
Write the stems in a vertical column with the
smallest at the top, and draw a vertical line at the
right of this column.
Go through the data points, and write each leaf in
the row to the right of its stem.
Rearrange the leaves in an increasing order.
35
sawilopo@yahoo.com

Biostatistics
I: 2013
35
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

When you rotated 90 degrees counterclockwise, the stemplot


visually resembles a histogram:

36
sawilopo@yahoo.com

Biostatistics
I: 2013
36
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Summary Measures
Describing Data Numerically

Central Tendency

Quartiles

Variation

Arithmetic Mean

Range

Median

Interquartile Range

Mode

Variance

Geometric Mean

Standard Deviation

Shape
Skewness

Coefficient of Variation

sawilopo@yahoo.com

Biostatistics
I: 2013
37
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Central Tendency

sawilopo@yahoo.com

Biostatistics
I: 2013
38
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Measures of Central Tendency


Overview
Central Tendency

Arithmetic Mean

Median

Mode

X
X=

Geometric Mean
X G = ( X1 X 2 L Xn )1/ n

i=1

sawilopo@yahoo.com

Midpoint of
ranked
values

Most
frequently
observed
value

Biostatistics
I: 2013
39
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Arithmetic Mean
The arithmetic mean (sample mean)
is the most common measure of
central tendency
For a sample of size n:
n

X
X=

Sample size
sawilopo@yahoo.com

i =1

X1 + X 2 + L + Xn
=
n
Observed values

Biostatistics
I: 2013
40
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Arithmetic Mean

(continued)

The most common measure of central tendency


Mean = sum of values divided by the number of
values
Affected by extreme values (outliers)
0 1 2 3 4 5 6 7 8 9 10

Mean = 3

Mean = 4

1 + 2 + 3 + 4 + 5 15
=
=3
5
5
sawilopo@yahoo.com

0 1 2 3 4 5 6 7 8 9 10

1 + 2 + 3 + 4 + 10 20
=
=4
5
5

Biostatistics
I: 2013
41
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Median
In an ordered array, the median is the
middle number (50% above, 50%
below)
below)
0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10

Median = 3

Median = 3

Not affected by extreme values

sawilopo@yahoo.com

Biostatistics
I: 2013
42
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Finding the Median


The location of the median:

n +1
Median position =
position in the ordered data
2
If the number of values is odd, the median is the middle
number
If the number of values is even, the median is the average of
the two middle numbers

n +1
Note that
is not the value of the median, only
2
the position of the median in the ranked data
sawilopo@yahoo.com

Biostatistics
I: 2013
43
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Mode
A measure of central tendency
Value that occurs most often
Not affected by extreme values
Used for either numerical or
categorical (nominal) data
There may be no mode
There may be several modes
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Mode = 9
sawilopo@yahoo.com

Chap 3-44

0 1 2 3 4 5 6

No Mode

Biostatistics
I: 2013
44
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Problem
Which measure of location
is the best?
Mean is generally used, unless
extreme values (outliers) exist
Then median is often used, since the
median is not sensitive to extreme
values.
values.

sawilopo@yahoo.com

Biostatistics
I: 2013
45
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Measures of Location
Comparison of Mean and Median
Let use cholesterol data as an example:
example:

145, 159, 166, 166, 195, 205, 250


We found the mean is 183.7 and the
median is 166.
166.
46
sawilopo@yahoo.com

Biostatistics
I: 2013
46
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Measures of Location
Comparison of Mean and Median
Suppose we replace 250 with 215:
215:

145, 159, 166, 166, 195, 205, 215


We will find the mean is 178.7 and the
median remains 166.
166.
47
sawilopo@yahoo.com

Biostatistics
I: 2013
47
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Geometric Mean
Geometric mean
Used to measure the rate of change of a variable
over time
1/ n

XG = ( X1 X 2 L Xn )

Geometric mean rate of return


Measures the status of an investment over time

R G = [(1 + R1 ) (1 + R 2 ) L (1 + Rn )]1/ n 1
Where Ri is the rate of return in time period i
sawilopo@yahoo.com

Biostatistics
I: 2013
48
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Example
An investment of $100,000 declined to $50,000 at
the end of year one and rebounded to $100,000
at end of year two:

X1 = $100,000

X 2 = $50,000

50% decrease

X3 = $100,000

100% increase

The overall two-year return is zero, since it started and


ended at the same level.
sawilopo@yahoo.com

Biostatistics
I: 2013
49
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Example

(continued)

Use the 11-year returns to compute the


arithmetic mean and the geometric mean:
Arithmetic
mean rate
of return:

( 50%) + (100%)
X=
= 25%
2

Geometric
mean rate
of return:

R G = [(1 + R1 ) (1 + R 2 ) L (1 + Rn )]1/ n 1

Misleading result

= [(1 + ( 50%)) (1 + (100%))]1/ 2 1


= [(.50 ) ( 2)]1/ 2 1 = 11/ 2 1 = 0%

sawilopo@yahoo.com

More
accurate
result

Biostatistics
I: 2013
50
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

MEASURE OF VARIATION

sawilopo@yahoo.com

Biostatistics
I: 2013
51
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Measures of Variation
Variation
Range

Interquartile
Range

Variance

Standard
Deviation

Coefficient
of Variation

Measures of variation give


information on the spread
or variability of the data
values.
Same center,
different variation
sawilopo@yahoo.com

Biostatistics
I: 2013
52
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Range
Simplest measure of variation
Difference between the largest and
the smallest values in a set of data:
Range = Xlargest Xsmallest
Example:
0 1 2 3 4 5 6 7 8 9 10 11 12

13 14

Range = 14 - 1 = 13
sawilopo@yahoo.com

Biostatistics
I: 2013
53
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Disadvantages of the Range


Ignores the way in which data are
distributed
7

10

11

12

Range = 12 - 7 = 5

10

11

12

Range = 12 - 7 = 5

Sensitive to outliers
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4

1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119

sawilopo@yahoo.com

Biostatistics
I: 2013
54
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Quartiles
Quartiles split the ranked data into 4 segments
with an equal number of values per segment
25%
Q1

25%

25%
Q2

25%
Q3

The first quartile, Q1, is the value for which 25% of the
observations are smaller and 75% are larger
Q2 is the same as the median (50% are smaller, 50% are
larger)
Only 25% of the observations are greater than the third
quartile
sawilopo@yahoo.com

Biostatistics
I: 2013
55
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Quartile Formulas
Find a quartile by determining the value in the
appropriate position in the ranked data, where
First quartile position:

Q1 = (n+1)/4

Second quartile position: Q2 = (n+1)/2 (the median position)


Third quartile position:

Q3 = 3(n+1)/4

where n is the number of observed values

sawilopo@yahoo.com

Biostatistics
I: 2013
56
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Calculating Quartiles
Example: Find the first quartile
Sample Data in Ordered Array: 11 12 13 16 16 17 18 21 22

(n = 9)
Q1 is in the (9+1)/4 = 2.5 position of the ranked data
so use the value half way between the 2nd and 3rd values,
so

Q1 = 12.5

Q1 and Q3 are measures of noncentral location


Q2 = median, a measure of central tendency
sawilopo@yahoo.com

Biostatistics
I: 2013
57
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Quartiles

(continued)

Example:
Sample Data in Ordered Array: 11 12 13 16 16 17 18 21 22

(n = 9)
Q1 is in the (9+1)/4 = 2.5 position of the ranked data,
so Q1 = 12.5
Q2 is in the (9+1)/2 = 5th position of the ranked data,
so Q2 = median = 16
Q3 is in the 3(9+1)/4 = 7.5 position of the ranked data,
so Q3 = 19.5
sawilopo@yahoo.com

Biostatistics
I: 2013
58
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Interquartile Range
Can eliminate some outlier problems by
using the interquartile range
Eliminate some highhigh- and lowlow-valued
observations and calculate the range
from the remaining values
Interquartile range = 3rd quartile 1st quartile

= Q3 Q1
sawilopo@yahoo.com

Biostatistics
I: 2013
59
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Interquartile Range
Example:
X

minimum

Q1

25%

12

Median
(Q2)
25%

30

25%

45

Q3

maximum

25%

57

70

Interquartile range
= 57 30 = 27

sawilopo@yahoo.com

Biostatistics
I: 2013
60
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Variance
Average (approximately) of squared
deviations of values from the mean
n

(X X)

Sample variance:

S =
Where

i=1

n -1

X = mean
n = sample size
Xi = ith value of the variable X

sawilopo@yahoo.com

Biostatistics
I: 2013
61
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Standard Deviation
Most commonly used measure of
variation
Shows variation about the mean
Is the square root of the variance
Has the same units as the original data
n

Sample standard deviation:

sawilopo@yahoo.com

2
(X

X
)
i

S=

i=1

n -1

Biostatistics
I: 2013
62
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Calculation Example:
Sample Standard Deviation
Sample
Data (Xi) :

10

12

14

n=8
S=

15

17

18

18

24

Mean = X = 16

(10 X )2 + (12 X )2 + (14 X )2 + L + (24 X )2


n 1

(10 16) 2 + (12 16) 2 + (14 16) 2 + L + (24 16) 2


8 1

130
7

sawilopo@yahoo.com

4.3095

A measure of the average


scatter around the mean

Biostatistics
I: 2013
63
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Measuring variation
Small standard deviation

Large standard deviation

sawilopo@yahoo.com

Biostatistics
I: 2013
64
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Comparing Standard Deviations


Data A
11

12

13

14

15

16

17

18

19

20 21

Mean = 15.5
S = 3.338

20 21

Mean = 15.5
S = 0.926

20 21

Mean = 15.5
S = 4.567

Data B
11

12

13

14

15

16

17

18

19

Data C
11

12

sawilopo@yahoo.com

13

14

15

16

17

18

19

Biostatistics
I: 2013
65
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Advantages of Variance and


Standard Deviation
Each value in the data set is used in
the calculation
Values far from the mean are given
extra weight
(because deviations from the mean are squared)

sawilopo@yahoo.com

Biostatistics
I: 2013
66
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Coefficient of Variation
Measures relative variation
Always in percentage (%)
Shows variation relative to mean
Can be used to compare two or more
sets of data measured in different
units
S

CV =
X 100%

sawilopo@yahoo.com

Biostatistics
I: 2013
67
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Comparing Coefficient
of Variation
Hospital A:
Average surplus in the last 10 years = 50 Billion Rp.
Standard deviation = 5 Billion Rp.
Both hospital

S
CVA =
X

5 Bill Rp.
100% = 10%
100% =
50 Bill Rp.

Hospital B:

have the same


standard
deviation, but
hospital B is
less variable
relative to its
surplus

Average surplus last in the last 10 years = 100 Billion Rp.


Standard deviation = 5 Billion Rp.

S
CVB =
X
sawilopo@yahoo.com

5 Bill Rp.
100% = 5%
100% =
100 Bill Rp.

Biostatistics
I: 2013
68
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Standardized Scores (Z-Scores)


Z-scores use the mean and standard deviation as the
primary measures of center and spread and are therefore
most useful when the mean and standard deviation are
appropriate, i.e. when the distribution is reasonably
symmetric with no extreme outliers.
For any individual, the z-score tells us how many standard
deviations the raw score for that individual deviates from
the mean and in what direction.
To calculate a zz-score, we take the individual value and
subtract the mean and then divide this difference by the
standard deviation.
deviation.
A positive zz-score indicates the individual is above
average and a negative zz-score indicates the individual is
below average.
sawilopo@yahoo.com

Biostatistics
I: 2013
69
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Z Scores
A measure of distance from the mean (for
example, a ZZ-score of 2.0 means that a value is 2.0
standard deviations from the mean)
The difference between a value and the mean,
divided by the standard deviation
A Z score above 3.0 or below -3.0 is considered an
outlier

XX
Z=
S
sawilopo@yahoo.com

Biostatistics
I: 2013
70
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Z Scores

(continued)

Example:
If the mean is 14.0 and the standard deviation is
3.0, what is the Z score for the value 18.5?

X X 18.5 14.0
Z=
=
= 1.5
S
3.0
The value 18.5 is 1.5 standard deviations above the
mean
(A negative ZZ-score would mean that a value is less
than the mean)
sawilopo@yahoo.com

Biostatistics
I: 2013
71
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Quantitative and Graphical Approach:

MEASURE SPREAD AND DISTRIBUTION

sawilopo@yahoo.com

72

Biostatistics
I: 2013
72
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

DESCRIBING DISTRIBUTIONS

sawilopo@yahoo.com

73

Biostatistics
I: 2013
73
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Features of Distributions of
Quantitative Variables

sawilopo@yahoo.com

Biostatistics
I: 2013
74
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Shape
When describing the shape of a
distribution, we should consider:
Symmetry/skewness of the
distribution.
Peakedness (modality) the
number of peaks (modes) the
distribution has.

sawilopo@yahoo.com

Biostatistics
I: 2013
75
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Symmetry/skewness of the
distribution.

sawilopo@yahoo.com

Biostatistics
I: 2013
76
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

sawilopo@yahoo.com

Biostatistics
I: 2013
77
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

sawilopo@yahoo.com

Biostatistics
I: 2013
78
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Shape of a Distribution
Describes how data are distributed
Measures of shape
Symmetric or skewed

Left-Skewed

Symmetric

Right-Skewed

Mean < Median

Mean = Median

Median < Mean

sawilopo@yahoo.com

Biostatistics
I: 2013
79
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Numerical Measures
for a Population
Population summary measures are called
parameters
The population mean is the sum of the values in the
population divided by the population size, N
N

X
=
Where

i=1

X1 + X 2 + L + XN
N

= population mean
N = population size
Xi = ith value of the variable X

sawilopo@yahoo.com

Biostatistics
I: 2013
80
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Population Variance
Average of squared deviations of
values from the mean
N

Population variance:

2
(X
)

2 =

Where

i=1

= population mean
N = population size
Xi = ith value of the variable X

sawilopo@yahoo.com

Biostatistics
I: 2013
81
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Population Standard Deviation


Most commonly used measure of
variation
Shows variation about the mean
Is the square root of the population
variance
Has the same units as the original data
N

Population standard deviation: =


sawilopo@yahoo.com

(X )

i =1

Biostatistics
I: 2013
82
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

The Sample Covariance


The sample covariance measures the strength of
the linear relationship between two variables
(called bivariate data)
The sample covariance:
covariance:
n

( X X)( Y Y )
i

cov ( X , Y ) =

i=1

n 1

Only concerned with the strength of the relationship


No causal effect is implied

sawilopo@yahoo.com

Biostatistics
I: 2013
83
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Interpreting Covariance
Covariance between two random
variables:

cov(X,Y
X and Y tend to move in
cov(X,Y)
(X,Y) > 0
the same direction
cov(X,Y)
X and Y tend to move in
cov(X,Y) < 0
opposite directions
cov(X,Y)
cov(X,Y) = 0
sawilopo@yahoo.com

X and Y are independent


Biostatistics
I: 2013
84
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Coefficient of Correlation
Measures the relative strength of the
linear relationship between two
variables
Sample coefficient of correlation:
correlation:
cov (X , Y)
r=
SX SY
n

(X X)(Y Y)

where
cov (X
, Y) =
i

(X X)

i=1

n 1

sawilopo@yahoo.com

2
(Y

Y
)
i

SX =

i=1

n 1

SY =

i=1

n 1

Biostatistics
I: 2013
85
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Features of
Correlation Coefficient, r
Unit free
Ranges between 1 and 1
The closer to 1, the stronger the
negative linear relationship
The closer to 1, the stronger the
positive linear relationship
The closer to 0, the weaker the linear
relationship
sawilopo@yahoo.com

Biostatistics
I: 2013
86
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Scatter Plots of Data with Various


Correlation Coefficients
Y

r = -1

r = -.6

r=0
Y

r = +1
sawilopo@yahoo.com

r = +.3

r=0

Biostatistics
I: 2013
87
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

The Empirical Rule


If the data distribution is approximately
bellbell-shaped, then the interval:
1 contains about 68% of the values
in the population or the sample

68%

1
sawilopo@yahoo.com

Biostatistics
I: 2013
88
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

The Empirical Rule

2 contains about 95% of the

values in
3 the population or the sample
contains about 99.7% of the values in
the population or the sample

sawilopo@yahoo.com

95%

99.7%

3
Biostatistics
I: 2013
89
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Chebyshev Rule
Regardless of how the data are
distributed, at least (1 - 1/k2) x 100% of
the values will fall within k standard
deviations of the mean (for k > 1)
Examples:
At least
within
(1 - 1/12) x 100% = 0% ..... k=1 (
( 1))
(1 - 1/22) x 100% = 75% ........ k=2 (
( 2))
(1 - 1/32) x 100% = 89% . k=3 (
( 3))

sawilopo@yahoo.com

Biostatistics
I: 2013
90
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

MEASURES OF SPREAD

sawilopo@yahoo.com

91

Biostatistics
I: 2013
91
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

sawilopo@yahoo.com

92

Biostatistics
I: 2013
92
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Five-Number Summary
The combination of the five numbers (min, Q1,
M, Q3, Max) is called the five number
summary.
It provides a quick numerical description of
both the center and spread of a distribution.
Each of the values represents a measure of
position in the dataset.
The min and max providing the boundaires
and the quartiles and median providing
information about the 25th, 50th, and 75th
percentiles.
sawilopo@yahoo.com

Biostatistics
I: 2013
93
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Inter-Quartile Range (IQR)

sawilopo@yahoo.com

Biostatistics
I: 2013
94
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Inter-Quartile Range (IQR)

sawilopo@yahoo.com

Biostatistics
I: 2013
95
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Inter-Quartile Range (IQR)

sawilopo@yahoo.com

Biostatistics
I: 2013
96
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

The 1.5(IQR) Criterion for Outliers


An observation is considered
a suspected outlier or potential
outlier if it is:
below Q1 1.5(IQR) or
above Q3 + 1.5(IQR)

sawilopo@yahoo.com

Biostatistics
I: 2013
97
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

The following picture (not to scale)


illustrates this rule:

sawilopo@yahoo.com

Biostatistics
I: 2013
98
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

EXAMPLE:
Best Actress Oscar Winners
We will continue with the Best Actress Oscar winners example
34 34 26 37 42 41 35 31 41 33 30 74 33 49 38 61 21 41 26 80 43 29 33
35 45 49 39 34 26 25 35 33
We can now use the 1.5(IQR) criterion to check whether the
three highest ages should indeed be classified as potential
outliers:
For this example, we found Q1 = 32 and
Q3 = 41.5 which give an IQR = 9.5
Q1 1.5 (IQR) = 32 (1.5)(9.5) = 17.75
Q3 + 1.5 (IQR) = 41.5 + (1.5)(9.5) = 55.75

The 1.5(IQR) criterion tells us that any


observation with an age that is below
17.75 or above 55.75 is considered a
suspected outlier.
We therefore conclude that the
observations with ages of 61, 74 and
80 should be flagged as suspected
outliers in the distribution of ages.
Note that since the smallest observation is 21,
there are no suspected low outliers in this
distribution.
sawilopo@yahoo.com

Biostatistics
I: 2013
99
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Possible methods for handling


outliers in practice
Why is it important to identify possible outliers, and how should
they be dealt with? The answers to these questions depend on the
reasons for the outlying values.
Here are several possibilities:
Even though it is an extreme value, if an outlier can be
understood to have been produced by essentially the same sort
of physical or biological process as the rest of the data, and if
such extreme values are expected to eventually occur again,
then such an outlier indicates something important and
interesting about the process youre investigating, and it should
be kept in the data.
data.

sawilopo@yahoo.com

Biostatistics
I: 2013
100
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

If an outlier can be explained to have been produced


under fundamentally different conditions from the rest of
the data (or by a fundamentally different process), such
an outlier can be removed from the data if your goal is to
investigate only the process that produced the rest of the
data.
An outlier might indicate a mistake in the data (like a
typo, or a measuring error), in which case it should be
corrected if possible or else removed from the data
before calculating summary statistics or making
inferences from the data (and the reason for the mistake
should be investigated).
sawilopo@yahoo.com

Biostatistics
I: 2013
101
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Identification for suspected outliers

BOXPLOTS

sawilopo@yahoo.com

102

Biostatistics
I: 2013
102
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

EXAMPLE: Best Actress Oscar Winners


We will use data on the Best Actress Oscar
winners as an example
34 34 26 37 42 41 35 31 41 33 30 74 33 49
38 61 21 41 26 80 43 29 33 35 45 49 39 34
26 25 35 33
The five number summary of the age of
Best Actress Oscar winners (1970(1970-2001) is:
min = 21, Q1 = 32, M = 35,
Q3 = 41.5, Max = 80
sawilopo@yahoo.com

Biostatistics
I: 2013
103
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Box Plot and Outliers


Lines extend from the
edges of the box to the
smallest and largest
observations that were
not classified as
suspected outliers
(using the 1.5xIQR
criterion).
In our example, we have
no low outliers, so the
bottom line goes down
to the smallest
observation, which is
21.
Since we have three
high outliers (61,74, and
80), the top line extends
only up to 49, which is
the largest observation
that has not been
flagged as an outlier.
sawilopo@yahoo.com

Biostatistics
I: 2013
104
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

The following information is visually


depicted in the boxplot
the five
number
summary
(blue)
the range
and IQR
(red)
(red)
outliers
(green)
(green)

sawilopo@yahoo.com

Biostatistics
I: 2013
105
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Side-by-side boxplots of the age


distributions by gender

sawilopo@yahoo.com

Biostatistics
I: 2013
106
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Box Plot Summarized


The fivefive-number summary of a distribution
consists of M, Q1,
Q1, Q3 and the extremes Min
Min,
in, Max.
The median describes the center, and the
extremes (which give the range) and the quartiles
(which give the IQR) describe the spread.
The boxplot is visually displaying the five number
summary and any suspected outlier using the
1.5(IQR) criterion.
Boxplots presented in sideside-byby-side to compare
and contrast distributions from two or more
groups.
groups.
sawilopo@yahoo.com

Biostatistics
I: 2013
107
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

ROLE-TYPE CLASSIFICATION

sawilopo@yahoo.com

108

Biostatistics
I: 2013
108
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Classification
In most studies involving two variables, each of the
variables has a role. We distinguish between:
the response variable (dependent) the outcome of the
study; and
the explanatory variable (independent) the variable that
claims to explain, predict or affect the response.

The variable we wish to predict is commonly called


the dependent variable, the outcome variable, or
the response variable.
variable.
Any variable we are using to predict (or explain
differences) in the outcome is commonly called
an explanatory variable, an independent variable,
a predictor variable, or a covariate.
sawilopo@yahoo.com

Biostatistics
I: 2013
109
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

If we further classify each of the two relevant


variables according to type (categorical or
quantitative),

We get the following 4 possibilities


for rolerole-type classification
Categorical explanatory and quantitative response
Categorical explanatory and categorical response
Quantitative explanatory and quantitative response
Quantitative explanatory and categorical response

sawilopo@yahoo.com

Biostatistics
I: 2013
110
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

sawilopo@yahoo.com

Biostatistics
I: 2013
111
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Case CQ:
Exploring the relationship amounts
to comparing the distributions of the
quantitative response variable for each
category of the explanatory variable.
To do this, we use:
Display: side-by-side boxplots.
Numerical summaries: descriptive statistics of the
response variable, for each value (category) of the
explanatory variable separately.

sawilopo@yahoo.com

Biostatistics
I: 2013
112
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Case CC:
Exploring the relationship amounts
to comparing the distributions of the
categorical response variable, for
each category of the explanatory
variable.
To do this, we use:
Display: two-way table.
Numerical summaries: conditional percentages (of
the response variable for each value (category) of
the explanatory variable separately).

sawilopo@yahoo.com

Biostatistics
I: 2013
113
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Here is the two-way table for example:

sawilopo@yahoo.com

Biostatistics
I: 2013
114
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Another way to visualize the conditional percent, instead of a


table, is the double bar chart

sawilopo@yahoo.com

Biostatistics
I: 2013
115
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Case QQ
We examine the relationship using:
using:
Display:
Display: scatterplot.
When describing the relationship as
displayed by the scatterplot, be sure to
consider:
Overall pattern direction, form, strength.
Deviations from the pattern outliers.

Labeling the scatterplot (including a


relevant third categorical variable in our
analysis), might add some insight into the
nature of the relationship.
sawilopo@yahoo.com

Biostatistics
I: 2013
116
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Scatter Plot

sawilopo@yahoo.com

Biostatistics
I: 2013
117
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

sawilopo@yahoo.com

Biostatistics
I: 2013
118
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Interpreting Scatterplots
How do we explore the relationship between two
quantitative variables using the scatterplot?
What should we look at, or pay attention to?

sawilopo@yahoo.com

Biostatistics
I: 2013
119
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

The direction of the relationship can


be positive, negative, or neither:

sawilopo@yahoo.com

Biostatistics
I: 2013
120
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

The strength of the linear relationship

sawilopo@yahoo.com

Biostatistics
I: 2013
121
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

In the special case


The scatterplot displays a linear
relationship (and only then), we supplement
the scatterplot with:
Numerical summaries: Pearsons correlation
coefficient (r) measures the direction and, more
importantly, the strength of the linear relationship.
The closer r is to 1 (or -1), the stronger the positive
(or negative) linear relationship. r is unitless,
influenced by outliers, and should be used only as
a supplement to the scatterplot.

sawilopo@yahoo.com

Biostatistics
I: 2013
122
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

linear relationship and outliers

sawilopo@yahoo.com

Biostatistics
I: 2013
123
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

When the relationship is linear (as


displayed by the scatterplot, and
supported by the correlation r), we can
summarize the linear pattern using
the least squares regression line.
Remember that:
The slope of the regression line tells us the average change in
the response variable that results from a 1-unit increase in the
explanatory variable.
When using the regression line for predictions, you should
beware of extrapolation.

sawilopo@yahoo.com

Biostatistics
I: 2013
124
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Least squares regression line

sawilopo@yahoo.com

Biostatistics
I: 2013
125
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

When examining the relationship between


two variables (regardless of the case),
any observed
relationship (association) does not imply
causation, due to the possible presence
of lurking variables.
When we include a lurking variable in our
analysis, we might need to rethink the
direction of the relationship Simpsons
paradox.
sawilopo@yahoo.com

Biostatistics
I: 2013
126
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Simpsons paradox

sawilopo@yahoo.com

127

Biostatistics
I: 2013
127
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Simpsons paradox

sawilopo@yahoo.com

Biostatistics
I: 2013
128
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Simpsons paradox

Note that despite our earlier finding that overall Hospital A has a
higher death rate (3% vs. 2%) when we take into account the lurking
variable, we find that actually it is Hospital B that has the higher
death rate both among the severely ill patients (4% vs. 3.8%) and
among the not severely ill patients (1.3% vs. 1%).
Thus, we see that adding a lurking variable can change the direction
of an association.
sawilopo@yahoo.com

Biostatistics
I: 2013
129
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

END

sawilopo@yahoo.com

Biostatistics
I: 2013
130
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health

Anda mungkin juga menyukai