1
sawilopo@yahoo.com
Table:
Assessing the use of table for each type of
data,
Differentiate a frequency distribution,
Create a frequency table from raw data,
Constructs relative frequency, cumulative
frequency and relative cumulative frequency
tables.
Construct grouped frequency tables.
Construct a crosscross-tabulation table.
Illustrate the use of a contingency table is.
Create table with rank data.
data.
sawilopo@yahoo.com
Biostatistics
I: 2013
2
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Graph:
Assessing the most appropriate chart for a given data type.
Construct pie charts and simple, clustered and stacked, bar charts.
Create histograms.
Create step charts and ogives.
ogives.
Construct time series charts, including statistics process control
(SPC).
Interpret and assess a chart reveals.
Assess the meaning by looking at the shape of a frequency
distribution.
Appraise negatively skewed, symmetric and positively skewed
distributions.
Describe a bimodal distribution.
Describe the approximate shape of a frequency distribution from a
frequency table or chart.
Assess whether data is considered a normal distribution.
sawilopo@yahoo.com
Biostatistics
I: 2013
3
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Numeric Summary:
Describe a summary measure of location is, and understand
the meaning of, and the difference between, the mode, the
median and the mean.
Compute the mode, median and mean for a set of values.
Formulate the role of data type and distributional shape in
choosing the most appropriate measure of location.
Describe what a percentile is, and calculate any given
percentile value.
Describe what a summary measure of spread is
Differentiate the difference between, and can calculate, the
range, the interquartile range and the standard deviation.
Interpret estimate percentile values
Formulate the role of data type and distributional shape in
choosing the most appropriate measure of spread.
sawilopo@yahoo.com
Biostatistics
I: 2013
4
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Biostatistics
I: 2013
5
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
6
sawilopo@yahoo.com
Universitas
Biostatistics
I: 2013
Gadjah
Mada,
Faculty
6
of Medicine, Department of Public Health
7
sawilopo@yahoo.com
Biostatistics
I: 2013
7
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
8
sawilopo@yahoo.com
Biostatistics
I: 2013
8
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
9
sawilopo@yahoo.com
Biostatistics
I: 2013
9
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
10
sawilopo@yahoo.com
Biostatistics
I: 2013
10
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Goals of EDA
Exploratory Data Analysis (EDA) is how
we make sense of the data by
converting them from their raw form to
a more informative one.
11
sawilopo@yahoo.com
Biostatistics
I: 2013
11
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
12
sawilopo@yahoo.com
Biostatistics
I: 2013
12
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
(continued)
And can be useful for:
describing the distribution of a single
variable (center, spread, shape, outliers)
checking data (for errors or other
problems)
checking assumptions to more complex
statistical analyses
investigating relationships between
variables
13
sawilopo@yahoo.com
Biostatistics
I: 2013
13
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
EDA
Exploratory data analysis (EDA) methods are
often called Descriptive Statistics due to the
fact that they simply describe, or provide
estimates based on, the data at hand.
In Unit 4 we will cover methods of Inferential
Statistics which use the results of a sample to
make inferences about the population under
study.
Comparisons can be visualized and values of
interest estimated using EDA but descriptive
statistics alone will provide no information
about the certainty of our conclusions.
conclusions.
14
sawilopo@yahoo.com
Biostatistics
I: 2013
14
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
15
sawilopo@yahoo.com
Biostatistics
I: 2013
15
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Biostatistics
I: 2013
16
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
EXAMINING DISTRIBUTIONS
sawilopo@yahoo.com
17
Biostatistics
I: 2013
17
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Examining Distributions
We will begin the EDA part of the course
by exploring (or looking at) one variable
at a time.
As we have seen, the data for each
variable consist of a long list of values
(whether numerical or not), and are not
very informative in that form.
18
sawilopo@yahoo.com
Biostatistics
I: 2013
18
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Examining Distributions
In order to convert these raw data into
useful information, we need to summarize
and then examine the distribution of the
variable.
By distribution of a variable, we mean:
what values the variable takes, and
how often the variable takes those values.
Biostatistics
I: 2013
19
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
sawilopo@yahoo.com
Biostatistics
I: 2013
20
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Example:
Distribution of One Categorical Variable
What is your perception of your own
body? Do you feel that you are
overweight, underweight, or about right?
A random sample of 1,200 college
students were asked this question as
part of a larger survey. The following
table shows part of the responses:
21
sawilopo@yahoo.com
Biostatistics
I: 2013
21
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Body Image
student 25
overweight
student 26
about right
student 27
underweight
student 28
about right
student 29
about right
22
sawilopo@yahoo.com
Biostatistics
I: 2013
22
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
23
sawilopo@yahoo.com
Biostatistics
I: 2013
23
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Biostatistics
I: 2013
24
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Numerical Measures
In order to summarize the distribution of
a categorical variable, we first create a
table of the different values (categories)
the variable takes, how many times each
value occurs (count) and, more
importantly, how often each value occurs
(by converting the counts to
percentages).
The result is often called a Frequency
Distribution or Frequency Table.
25
sawilopo@yahoo.com
Biostatistics
I: 2013
25
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Count
855
235
110
n=1200
Percent
(855/1200)*100 = 71.3%
(235/1200)*100 = 19.6%
(110/1200)*100 = 9.2%
100%
26
sawilopo@yahoo.com
Biostatistics
I: 2013
26
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
27
sawilopo@yahoo.com
Biostatistics
I: 2013
27
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
OR
28
sawilopo@yahoo.com
Biostatistics
I: 2013
28
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
sawilopo@yahoo.com
Biostatistics
I: 2013
29
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
30
sawilopo@yahoo.com
Biostatistics
I: 2013
30
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Numerical Measures
The overall pattern of the distribution of
a quantitative variable is described by
its shape, center, and spread.
By inspecting the histogram or boxplot,
we can describe the shape of the
distribution, but we can only get a rough
estimate for the center and spread.
31
sawilopo@yahoo.com
Biostatistics
I: 2013
31
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Numerical Measures
A description of the distribution of a
quantitative variable must include, in addition
to the graphical display, a more
precise numerical description of the center
and spread of the distribution.
In this lecture you will learn:
how to quantify the center and spread of a distribution with
various numerical measures;
some of the properties of those numerical measures; and
how to choose the appropriate numerical measures of center
and spread to supplement the histogram.
We will also discuss a few measures of position or location
which allow us to quantify the where a particular value is in
the distribution of all values.
32
sawilopo@yahoo.com
Biostatistics
I: 2013
32
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Count
[40-50)
[50-60)
[60-70)
[70-80)
[80-90)
[90-100)
1
33
sawilopo@yahoo.com
Biostatistics
I: 2013
33
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Biostatistics
I: 2013
34
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Biostatistics
I: 2013
35
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
36
sawilopo@yahoo.com
Biostatistics
I: 2013
36
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Summary Measures
Describing Data Numerically
Central Tendency
Quartiles
Variation
Arithmetic Mean
Range
Median
Interquartile Range
Mode
Variance
Geometric Mean
Standard Deviation
Shape
Skewness
Coefficient of Variation
sawilopo@yahoo.com
Biostatistics
I: 2013
37
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Central Tendency
sawilopo@yahoo.com
Biostatistics
I: 2013
38
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Arithmetic Mean
Median
Mode
X
X=
Geometric Mean
X G = ( X1 X 2 L Xn )1/ n
i=1
sawilopo@yahoo.com
Midpoint of
ranked
values
Most
frequently
observed
value
Biostatistics
I: 2013
39
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Arithmetic Mean
The arithmetic mean (sample mean)
is the most common measure of
central tendency
For a sample of size n:
n
X
X=
Sample size
sawilopo@yahoo.com
i =1
X1 + X 2 + L + Xn
=
n
Observed values
Biostatistics
I: 2013
40
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Arithmetic Mean
(continued)
Mean = 3
Mean = 4
1 + 2 + 3 + 4 + 5 15
=
=3
5
5
sawilopo@yahoo.com
0 1 2 3 4 5 6 7 8 9 10
1 + 2 + 3 + 4 + 10 20
=
=4
5
5
Biostatistics
I: 2013
41
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Median
In an ordered array, the median is the
middle number (50% above, 50%
below)
below)
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
Median = 3
Median = 3
sawilopo@yahoo.com
Biostatistics
I: 2013
42
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
n +1
Median position =
position in the ordered data
2
If the number of values is odd, the median is the middle
number
If the number of values is even, the median is the average of
the two middle numbers
n +1
Note that
is not the value of the median, only
2
the position of the median in the ranked data
sawilopo@yahoo.com
Biostatistics
I: 2013
43
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Mode
A measure of central tendency
Value that occurs most often
Not affected by extreme values
Used for either numerical or
categorical (nominal) data
There may be no mode
There may be several modes
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Mode = 9
sawilopo@yahoo.com
Chap 3-44
0 1 2 3 4 5 6
No Mode
Biostatistics
I: 2013
44
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Problem
Which measure of location
is the best?
Mean is generally used, unless
extreme values (outliers) exist
Then median is often used, since the
median is not sensitive to extreme
values.
values.
sawilopo@yahoo.com
Biostatistics
I: 2013
45
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Measures of Location
Comparison of Mean and Median
Let use cholesterol data as an example:
example:
Biostatistics
I: 2013
46
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Measures of Location
Comparison of Mean and Median
Suppose we replace 250 with 215:
215:
Biostatistics
I: 2013
47
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Geometric Mean
Geometric mean
Used to measure the rate of change of a variable
over time
1/ n
XG = ( X1 X 2 L Xn )
R G = [(1 + R1 ) (1 + R 2 ) L (1 + Rn )]1/ n 1
Where Ri is the rate of return in time period i
sawilopo@yahoo.com
Biostatistics
I: 2013
48
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Example
An investment of $100,000 declined to $50,000 at
the end of year one and rebounded to $100,000
at end of year two:
X1 = $100,000
X 2 = $50,000
50% decrease
X3 = $100,000
100% increase
Biostatistics
I: 2013
49
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Example
(continued)
( 50%) + (100%)
X=
= 25%
2
Geometric
mean rate
of return:
R G = [(1 + R1 ) (1 + R 2 ) L (1 + Rn )]1/ n 1
Misleading result
sawilopo@yahoo.com
More
accurate
result
Biostatistics
I: 2013
50
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
MEASURE OF VARIATION
sawilopo@yahoo.com
Biostatistics
I: 2013
51
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Measures of Variation
Variation
Range
Interquartile
Range
Variance
Standard
Deviation
Coefficient
of Variation
Biostatistics
I: 2013
52
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Range
Simplest measure of variation
Difference between the largest and
the smallest values in a set of data:
Range = Xlargest Xsmallest
Example:
0 1 2 3 4 5 6 7 8 9 10 11 12
13 14
Range = 14 - 1 = 13
sawilopo@yahoo.com
Biostatistics
I: 2013
53
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
10
11
12
Range = 12 - 7 = 5
10
11
12
Range = 12 - 7 = 5
Sensitive to outliers
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119
sawilopo@yahoo.com
Biostatistics
I: 2013
54
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Quartiles
Quartiles split the ranked data into 4 segments
with an equal number of values per segment
25%
Q1
25%
25%
Q2
25%
Q3
The first quartile, Q1, is the value for which 25% of the
observations are smaller and 75% are larger
Q2 is the same as the median (50% are smaller, 50% are
larger)
Only 25% of the observations are greater than the third
quartile
sawilopo@yahoo.com
Biostatistics
I: 2013
55
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Quartile Formulas
Find a quartile by determining the value in the
appropriate position in the ranked data, where
First quartile position:
Q1 = (n+1)/4
Q3 = 3(n+1)/4
sawilopo@yahoo.com
Biostatistics
I: 2013
56
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Calculating Quartiles
Example: Find the first quartile
Sample Data in Ordered Array: 11 12 13 16 16 17 18 21 22
(n = 9)
Q1 is in the (9+1)/4 = 2.5 position of the ranked data
so use the value half way between the 2nd and 3rd values,
so
Q1 = 12.5
Biostatistics
I: 2013
57
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Quartiles
(continued)
Example:
Sample Data in Ordered Array: 11 12 13 16 16 17 18 21 22
(n = 9)
Q1 is in the (9+1)/4 = 2.5 position of the ranked data,
so Q1 = 12.5
Q2 is in the (9+1)/2 = 5th position of the ranked data,
so Q2 = median = 16
Q3 is in the 3(9+1)/4 = 7.5 position of the ranked data,
so Q3 = 19.5
sawilopo@yahoo.com
Biostatistics
I: 2013
58
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Interquartile Range
Can eliminate some outlier problems by
using the interquartile range
Eliminate some highhigh- and lowlow-valued
observations and calculate the range
from the remaining values
Interquartile range = 3rd quartile 1st quartile
= Q3 Q1
sawilopo@yahoo.com
Biostatistics
I: 2013
59
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Interquartile Range
Example:
X
minimum
Q1
25%
12
Median
(Q2)
25%
30
25%
45
Q3
maximum
25%
57
70
Interquartile range
= 57 30 = 27
sawilopo@yahoo.com
Biostatistics
I: 2013
60
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Variance
Average (approximately) of squared
deviations of values from the mean
n
(X X)
Sample variance:
S =
Where
i=1
n -1
X = mean
n = sample size
Xi = ith value of the variable X
sawilopo@yahoo.com
Biostatistics
I: 2013
61
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Standard Deviation
Most commonly used measure of
variation
Shows variation about the mean
Is the square root of the variance
Has the same units as the original data
n
sawilopo@yahoo.com
2
(X
X
)
i
S=
i=1
n -1
Biostatistics
I: 2013
62
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Calculation Example:
Sample Standard Deviation
Sample
Data (Xi) :
10
12
14
n=8
S=
15
17
18
18
24
Mean = X = 16
130
7
sawilopo@yahoo.com
4.3095
Biostatistics
I: 2013
63
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Measuring variation
Small standard deviation
sawilopo@yahoo.com
Biostatistics
I: 2013
64
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
12
13
14
15
16
17
18
19
20 21
Mean = 15.5
S = 3.338
20 21
Mean = 15.5
S = 0.926
20 21
Mean = 15.5
S = 4.567
Data B
11
12
13
14
15
16
17
18
19
Data C
11
12
sawilopo@yahoo.com
13
14
15
16
17
18
19
Biostatistics
I: 2013
65
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
sawilopo@yahoo.com
Biostatistics
I: 2013
66
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Coefficient of Variation
Measures relative variation
Always in percentage (%)
Shows variation relative to mean
Can be used to compare two or more
sets of data measured in different
units
S
CV =
X 100%
sawilopo@yahoo.com
Biostatistics
I: 2013
67
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Comparing Coefficient
of Variation
Hospital A:
Average surplus in the last 10 years = 50 Billion Rp.
Standard deviation = 5 Billion Rp.
Both hospital
S
CVA =
X
5 Bill Rp.
100% = 10%
100% =
50 Bill Rp.
Hospital B:
S
CVB =
X
sawilopo@yahoo.com
5 Bill Rp.
100% = 5%
100% =
100 Bill Rp.
Biostatistics
I: 2013
68
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Biostatistics
I: 2013
69
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Z Scores
A measure of distance from the mean (for
example, a ZZ-score of 2.0 means that a value is 2.0
standard deviations from the mean)
The difference between a value and the mean,
divided by the standard deviation
A Z score above 3.0 or below -3.0 is considered an
outlier
XX
Z=
S
sawilopo@yahoo.com
Biostatistics
I: 2013
70
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Z Scores
(continued)
Example:
If the mean is 14.0 and the standard deviation is
3.0, what is the Z score for the value 18.5?
X X 18.5 14.0
Z=
=
= 1.5
S
3.0
The value 18.5 is 1.5 standard deviations above the
mean
(A negative ZZ-score would mean that a value is less
than the mean)
sawilopo@yahoo.com
Biostatistics
I: 2013
71
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
sawilopo@yahoo.com
72
Biostatistics
I: 2013
72
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
DESCRIBING DISTRIBUTIONS
sawilopo@yahoo.com
73
Biostatistics
I: 2013
73
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Features of Distributions of
Quantitative Variables
sawilopo@yahoo.com
Biostatistics
I: 2013
74
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Shape
When describing the shape of a
distribution, we should consider:
Symmetry/skewness of the
distribution.
Peakedness (modality) the
number of peaks (modes) the
distribution has.
sawilopo@yahoo.com
Biostatistics
I: 2013
75
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Symmetry/skewness of the
distribution.
sawilopo@yahoo.com
Biostatistics
I: 2013
76
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
sawilopo@yahoo.com
Biostatistics
I: 2013
77
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
sawilopo@yahoo.com
Biostatistics
I: 2013
78
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Shape of a Distribution
Describes how data are distributed
Measures of shape
Symmetric or skewed
Left-Skewed
Symmetric
Right-Skewed
Mean = Median
sawilopo@yahoo.com
Biostatistics
I: 2013
79
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Numerical Measures
for a Population
Population summary measures are called
parameters
The population mean is the sum of the values in the
population divided by the population size, N
N
X
=
Where
i=1
X1 + X 2 + L + XN
N
= population mean
N = population size
Xi = ith value of the variable X
sawilopo@yahoo.com
Biostatistics
I: 2013
80
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Population Variance
Average of squared deviations of
values from the mean
N
Population variance:
2
(X
)
2 =
Where
i=1
= population mean
N = population size
Xi = ith value of the variable X
sawilopo@yahoo.com
Biostatistics
I: 2013
81
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
(X )
i =1
Biostatistics
I: 2013
82
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
( X X)( Y Y )
i
cov ( X , Y ) =
i=1
n 1
sawilopo@yahoo.com
Biostatistics
I: 2013
83
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Interpreting Covariance
Covariance between two random
variables:
cov(X,Y
X and Y tend to move in
cov(X,Y)
(X,Y) > 0
the same direction
cov(X,Y)
X and Y tend to move in
cov(X,Y) < 0
opposite directions
cov(X,Y)
cov(X,Y) = 0
sawilopo@yahoo.com
Coefficient of Correlation
Measures the relative strength of the
linear relationship between two
variables
Sample coefficient of correlation:
correlation:
cov (X , Y)
r=
SX SY
n
(X X)(Y Y)
where
cov (X
, Y) =
i
(X X)
i=1
n 1
sawilopo@yahoo.com
2
(Y
Y
)
i
SX =
i=1
n 1
SY =
i=1
n 1
Biostatistics
I: 2013
85
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Features of
Correlation Coefficient, r
Unit free
Ranges between 1 and 1
The closer to 1, the stronger the
negative linear relationship
The closer to 1, the stronger the
positive linear relationship
The closer to 0, the weaker the linear
relationship
sawilopo@yahoo.com
Biostatistics
I: 2013
86
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
r = -1
r = -.6
r=0
Y
r = +1
sawilopo@yahoo.com
r = +.3
r=0
Biostatistics
I: 2013
87
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
68%
1
sawilopo@yahoo.com
Biostatistics
I: 2013
88
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
values in
3 the population or the sample
contains about 99.7% of the values in
the population or the sample
sawilopo@yahoo.com
95%
99.7%
3
Biostatistics
I: 2013
89
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Chebyshev Rule
Regardless of how the data are
distributed, at least (1 - 1/k2) x 100% of
the values will fall within k standard
deviations of the mean (for k > 1)
Examples:
At least
within
(1 - 1/12) x 100% = 0% ..... k=1 (
( 1))
(1 - 1/22) x 100% = 75% ........ k=2 (
( 2))
(1 - 1/32) x 100% = 89% . k=3 (
( 3))
sawilopo@yahoo.com
Biostatistics
I: 2013
90
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
MEASURES OF SPREAD
sawilopo@yahoo.com
91
Biostatistics
I: 2013
91
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
sawilopo@yahoo.com
92
Biostatistics
I: 2013
92
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Five-Number Summary
The combination of the five numbers (min, Q1,
M, Q3, Max) is called the five number
summary.
It provides a quick numerical description of
both the center and spread of a distribution.
Each of the values represents a measure of
position in the dataset.
The min and max providing the boundaires
and the quartiles and median providing
information about the 25th, 50th, and 75th
percentiles.
sawilopo@yahoo.com
Biostatistics
I: 2013
93
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
sawilopo@yahoo.com
Biostatistics
I: 2013
94
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
sawilopo@yahoo.com
Biostatistics
I: 2013
95
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
sawilopo@yahoo.com
Biostatistics
I: 2013
96
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
sawilopo@yahoo.com
Biostatistics
I: 2013
97
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
sawilopo@yahoo.com
Biostatistics
I: 2013
98
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
EXAMPLE:
Best Actress Oscar Winners
We will continue with the Best Actress Oscar winners example
34 34 26 37 42 41 35 31 41 33 30 74 33 49 38 61 21 41 26 80 43 29 33
35 45 49 39 34 26 25 35 33
We can now use the 1.5(IQR) criterion to check whether the
three highest ages should indeed be classified as potential
outliers:
For this example, we found Q1 = 32 and
Q3 = 41.5 which give an IQR = 9.5
Q1 1.5 (IQR) = 32 (1.5)(9.5) = 17.75
Q3 + 1.5 (IQR) = 41.5 + (1.5)(9.5) = 55.75
Biostatistics
I: 2013
99
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
sawilopo@yahoo.com
Biostatistics
I: 2013
100
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Biostatistics
I: 2013
101
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
BOXPLOTS
sawilopo@yahoo.com
102
Biostatistics
I: 2013
102
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Biostatistics
I: 2013
103
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Biostatistics
I: 2013
104
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
sawilopo@yahoo.com
Biostatistics
I: 2013
105
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
sawilopo@yahoo.com
Biostatistics
I: 2013
106
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Biostatistics
I: 2013
107
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
ROLE-TYPE CLASSIFICATION
sawilopo@yahoo.com
108
Biostatistics
I: 2013
108
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Classification
In most studies involving two variables, each of the
variables has a role. We distinguish between:
the response variable (dependent) the outcome of the
study; and
the explanatory variable (independent) the variable that
claims to explain, predict or affect the response.
Biostatistics
I: 2013
109
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
sawilopo@yahoo.com
Biostatistics
I: 2013
110
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
sawilopo@yahoo.com
Biostatistics
I: 2013
111
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Case CQ:
Exploring the relationship amounts
to comparing the distributions of the
quantitative response variable for each
category of the explanatory variable.
To do this, we use:
Display: side-by-side boxplots.
Numerical summaries: descriptive statistics of the
response variable, for each value (category) of the
explanatory variable separately.
sawilopo@yahoo.com
Biostatistics
I: 2013
112
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Case CC:
Exploring the relationship amounts
to comparing the distributions of the
categorical response variable, for
each category of the explanatory
variable.
To do this, we use:
Display: two-way table.
Numerical summaries: conditional percentages (of
the response variable for each value (category) of
the explanatory variable separately).
sawilopo@yahoo.com
Biostatistics
I: 2013
113
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
sawilopo@yahoo.com
Biostatistics
I: 2013
114
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
sawilopo@yahoo.com
Biostatistics
I: 2013
115
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Case QQ
We examine the relationship using:
using:
Display:
Display: scatterplot.
When describing the relationship as
displayed by the scatterplot, be sure to
consider:
Overall pattern direction, form, strength.
Deviations from the pattern outliers.
Biostatistics
I: 2013
116
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Scatter Plot
sawilopo@yahoo.com
Biostatistics
I: 2013
117
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
sawilopo@yahoo.com
Biostatistics
I: 2013
118
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Interpreting Scatterplots
How do we explore the relationship between two
quantitative variables using the scatterplot?
What should we look at, or pay attention to?
sawilopo@yahoo.com
Biostatistics
I: 2013
119
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
sawilopo@yahoo.com
Biostatistics
I: 2013
120
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
sawilopo@yahoo.com
Biostatistics
I: 2013
121
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
sawilopo@yahoo.com
Biostatistics
I: 2013
122
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
sawilopo@yahoo.com
Biostatistics
I: 2013
123
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
sawilopo@yahoo.com
Biostatistics
I: 2013
124
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
sawilopo@yahoo.com
Biostatistics
I: 2013
125
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Biostatistics
I: 2013
126
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Simpsons paradox
sawilopo@yahoo.com
127
Biostatistics
I: 2013
127
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Simpsons paradox
sawilopo@yahoo.com
Biostatistics
I: 2013
128
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Simpsons paradox
Note that despite our earlier finding that overall Hospital A has a
higher death rate (3% vs. 2%) when we take into account the lurking
variable, we find that actually it is Hospital B that has the higher
death rate both among the severely ill patients (4% vs. 3.8%) and
among the not severely ill patients (1.3% vs. 1%).
Thus, we see that adding a lurking variable can change the direction
of an association.
sawilopo@yahoo.com
Biostatistics
I: 2013
129
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
END
sawilopo@yahoo.com
Biostatistics
I: 2013
130
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health