Biostatistic I

Unit 1: Exploratory Data Analysis
Prof. Siswanto Agus Wilopo, S.U., M.Sc., Sc.D.

Department of Public Health
Faculty of Medicine
Universitas Gadjah Mada
1
sawilopo@yahoo.com
Universitas Gadjah Mada, Faculty of Medicine, Department of Public Health
Table:
Assessing the use of table for each type of
data,
Differentiate a frequency distribution,
Create a frequency table from raw data,
Constructs relative frequency, cumulative
frequency and relative cumulative frequency
tables.
Construct grouped frequency tables.
Construct a crosscross-tabulation table.
Illustrate the use of a contingency table is.
Create table with rank data.
data.
sawilopo@yahoo.com
Biostatistics
I: 2013
2
Universitas Gadjah
Mada,
Faculty of Medicine, Department of Public Health
Graph:
Assessing the most appropriate chart for a given data type.
Construct pie charts and simple, clustered and stacked, bar charts.
Create histograms.
Create step charts and ogives.
ogives.
Construct time series charts, including statistics process control
(SPC).
Interpret and assess a chart reveals.
Assess the meaning by looking at the shape of a frequency
distribution.
Appraise negatively skewed, symmetric and positively skewed
distributions.
Describe a bimodal distribution.
Describe the approximate shape of a frequency distribution from a
frequency table or chart.
Assess whether data is considered a normal distribution.
sawilopo@yahoo.com
Biostatistics
I: 2013
3
Universitas Gadjah
Mada,
Numeric Summary:
Describe a summary measure of location is, and understand
the meaning of, and the difference between, the mode, the
median and the mean.
Compute the mode, median and mean for a set of values.
Formulate the role of data type and distributional shape in
choosing the most appropriate measure of location.
Describe what a percentile is, and calculate any given
percentile value.
Describe what a summary measure of spread is
Differentiate the difference between, and can calculate, the
range, the interquartile range and the standard deviation.
Interpret estimate percentile values
Formulate the role of data type and distributional shape in
choosing the most appropriate measure of spread.
sawilopo@yahoo.com
Biostatistics
I: 2013
4
Universitas Gadjah
Mada,
The Big Picture

Recall The Big Picture, the fourfour-step process
that encompasses statistics (as it is presented in
this course):
1. Producing Data Choosing a sample from the
population of interest and collecting data.
2. Exploratory Data Analysis (EDA) {Descriptive
Statistics}
3. Summarizing the data weve collected. Probability and
Inference
4. Drawing conclusions about the entire population
based on the data collected from the sample.
Even though in practice it is the second step in the

process, we are going to look at Exploratory Data
Analysis (EDA) first.
5
sawilopo@yahoo.com
Biostatistics
I: 2013
5
Universitas Gadjah
Mada,
6
sawilopo@yahoo.com
Universitas
Biostatistics
I: 2013
Gadjah
Mada,
Faculty
6
of Medicine, Department of Public Health
7
sawilopo@yahoo.com
Biostatistics
I: 2013
7
Universitas Gadjah
Mada,
8
sawilopo@yahoo.com
Biostatistics
I: 2013
8
Universitas Gadjah
Mada,
9
sawilopo@yahoo.com
Biostatistics
I: 2013
9
Universitas Gadjah
Mada,
10
sawilopo@yahoo.com
Biostatistics
I: 2013
10
Universitas Gadjah
Mada,
Goals of EDA
Exploratory Data Analysis (EDA) is how
we make sense of the data by
converting them from their raw form to
a more informative one.
11
sawilopo@yahoo.com
Biostatistics
I: 2013
11
Universitas Gadjah
Mada,
EDA consists of:

organizing and summarizing the raw
data,
discovering important features and
patterns in the data and any striking
deviations from those patterns, and then
interpreting our findings in the context of
the problem
12
sawilopo@yahoo.com
Biostatistics
I: 2013
12
Universitas Gadjah
Mada,
(continued)
And can be useful for:
describing the distribution of a single
variable (center, spread, shape, outliers)
checking data (for errors or other
problems)
checking assumptions to more complex
statistical analyses
investigating relationships between
variables
13
sawilopo@yahoo.com
Biostatistics
I: 2013
13
Universitas Gadjah
Mada,
EDA
Exploratory data analysis (EDA) methods are
often called Descriptive Statistics due to the
fact that they simply describe, or provide
estimates based on, the data at hand.
In Unit 4 we will cover methods of Inferential
Statistics which use the results of a sample to
make inferences about the population under
study.
Comparisons can be visualized and values of
interest estimated using EDA but descriptive
statistics alone will provide no information
about the certainty of our conclusions.
conclusions.
14
sawilopo@yahoo.com
Biostatistics
I: 2013
14
Universitas Gadjah
Mada,
Important Features of Exploratory Data

Analysis
There are two important features to the
structure of the EDA unit in this course:
The material in this unit covers two
broad topics:
Examining Distributions exploring data one
variable at a time.
Examining Relationships exploring data two
variables at a time.
15
sawilopo@yahoo.com
Biostatistics
I: 2013
15
Universitas Gadjah
Mada,
Important Features of Exploratory Data

Analysis
In Exploratory Data Analysis, our
exploration of data will always consist
of the following two elements:
visual displays, supplemented by
numerical measures.
Try to remember these structural

themes, as they will help you orient
yourself along the path of this unit.
16
sawilopo@yahoo.com
Biostatistics
I: 2013
16
Universitas Gadjah
Mada,
EXAMINING DISTRIBUTIONS
sawilopo@yahoo.com
17
Biostatistics
I: 2013
17
Universitas Gadjah
Mada,
Examining Distributions
We will begin the EDA part of the course
by exploring (or looking at) one variable
at a time.
As we have seen, the data for each
variable consist of a long list of values
(whether numerical or not), and are not
very informative in that form.
18
sawilopo@yahoo.com
Biostatistics
I: 2013
18
Universitas Gadjah
Mada,
Examining Distributions
In order to convert these raw data into
useful information, we need to summarize
and then examine the distribution of the
variable.
By distribution of a variable, we mean:
what values the variable takes, and
how often the variable takes those values.
We will first learn how to summarize and

examine the distribution of a single
categorical variable, and then do the same
for a single quantitative variable.
19
sawilopo@yahoo.com
Biostatistics
I: 2013
19
Universitas Gadjah
Mada,
ONE CATEGORICAL VARIABLE
sawilopo@yahoo.com
Biostatistics
I: 2013
20
Universitas Gadjah
Mada,
Example:
Distribution of One Categorical Variable
What is your perception of your own
body? Do you feel that you are
overweight, underweight, or about right?
A random sample of 1,200 college
students were asked this question as
part of a larger survey. The following
table shows part of the responses:
21
sawilopo@yahoo.com
Biostatistics
I: 2013
21
Universitas Gadjah
Mada,
Example Raw Data out of 1200 students

Student
Body Image
student 25
overweight
student 26
about right
student 27
underweight
student 28
about right
student 29
about right
22
sawilopo@yahoo.com
Biostatistics
I: 2013
22
Universitas Gadjah
Mada,
Here is some information that would be

interesting to get from these data:
What percentage of the sampled students fall into
each category?
How are students divided across the three body
image categories?
Are they equally divided? If not, do the
percentages follow some other kind of pattern?
23
sawilopo@yahoo.com
Biostatistics
I: 2013
23
Universitas Gadjah
Mada,
There is no way that we can answer these

questions by looking at the raw data, which
are in the form of a long list of 1,200
responses, and thus not very useful.
However, both of these questions will be
easily answered once we summarize and
look at the distribution of the variable Body
Image (i.e., once we summarize how often
each of the categories occurs).
24
sawilopo@yahoo.com
Biostatistics
I: 2013
24
Universitas Gadjah
Mada,
Numerical Measures
In order to summarize the distribution of
a categorical variable, we first create a
table of the different values (categories)
the variable takes, how many times each
value occurs (count) and, more
importantly, how often each value occurs
(by converting the counts to
percentages).
The result is often called a Frequency
Distribution or Frequency Table.
25
sawilopo@yahoo.com
Biostatistics
I: 2013
25
Universitas Gadjah
Mada,
A Frequency Distribution or Frequency

Table
Category
About right
Overweight
Underweight
Total
Count
855
235
110
n=1200
Percent
(855/1200)*100 = 71.3%
(235/1200)*100 = 19.6%
(110/1200)*100 = 9.2%
100%
26
sawilopo@yahoo.com
Biostatistics
I: 2013
26
Universitas Gadjah
Mada,
Visual or Graphical Displays: Pie Chart
27
sawilopo@yahoo.com
Biostatistics
I: 2013
27
Universitas Gadjah
Mada,
Visual or Graphical Displays
OR
28
sawilopo@yahoo.com
Biostatistics
I: 2013
28
Universitas Gadjah
Mada,
ONE QUANTITATIVE VARIABLE
sawilopo@yahoo.com
Biostatistics
I: 2013
29
Universitas Gadjah
Mada,
To display data from one quantitative

variable graphically, we can use either
a histogram or boxplot.
We will also present several byby-hand
displays such as the stemplot and
dotplot
30
sawilopo@yahoo.com
Biostatistics
I: 2013
30
Universitas Gadjah
Mada,
Numerical Measures
The overall pattern of the distribution of
a quantitative variable is described by
its shape, center, and spread.
By inspecting the histogram or boxplot,
we can describe the shape of the
distribution, but we can only get a rough
estimate for the center and spread.
31
sawilopo@yahoo.com
Biostatistics
I: 2013
31
Universitas Gadjah
Mada,
Numerical Measures
A description of the distribution of a
quantitative variable must include, in addition
to the graphical display, a more
precise numerical description of the center
and spread of the distribution.
In this lecture you will learn:
how to quantify the center and spread of a distribution with
various numerical measures;
some of the properties of those numerical measures; and
how to choose the appropriate numerical measures of center
and spread to supplement the histogram.
We will also discuss a few measures of position or location
which allow us to quantify the where a particular value is in
the distribution of all values.
32
sawilopo@yahoo.com
Biostatistics
I: 2013
32
Universitas Gadjah
Mada,
How To Create Histograms

Here are the exam grades of 15 students:
88, 48, 60, 51, 57, 85, 69, 75, 97, 72, 71, 79, 65, 63, 73
Score
Count
[40-50)
[50-60)
[60-70)
[70-80)
[80-90)
[90-100)
1
33
sawilopo@yahoo.com
Biostatistics
I: 2013
33
Universitas Gadjah
Mada,
Stemplot (Stem and Leaf Plot)

The stemplot (also called stem and leaf plot) is
another graphical display of the distribution of
quantitative variable.
The idea is to separate each data point into a
stem and leaf, as follows:
The leaf is the right-most digit.
The stem is everything except the right-most digit.
So, if the data point is 34, then 3 is the stem and 4 is the leaf.
If the data point is 3.41, then 3.4 is the stem and 1 is the leaf.
Note: For this to work, ALL data points should

be rounded to the same number of decimal
places.
places.
34
sawilopo@yahoo.com
Biostatistics
I: 2013
34
Universitas Gadjah
Mada,
Stemplot (Stem and Leaf Plot)

EXAMPLE: Best Actress Oscar Winners
We will use the Best Actress Oscar winners
example
34 34 26 37 42 41 35 31 41 33 30 74 33 49 38 61 21
41 26 80 43 29 33 35 45 49 39 34 26 25 35 33
To make a stemplot:
stemplot:
Separate each observation into a stem and a leaf.
Write the stems in a vertical column with the
smallest at the top, and draw a vertical line at the
right of this column.
Go through the data points, and write each leaf in
the row to the right of its stem.
Rearrange the leaves in an increasing order.
35
sawilopo@yahoo.com
Biostatistics
I: 2013
35
Universitas Gadjah
Mada,
When you rotated 90 degrees counterclockwise, the stemplot

visually resembles a histogram:
36
sawilopo@yahoo.com
Biostatistics
I: 2013
36
Universitas Gadjah
Mada,
Summary Measures
Describing Data Numerically
Central Tendency
Quartiles
Variation
Arithmetic Mean
Range
Median
Interquartile Range
Mode
Variance
Geometric Mean
Standard Deviation
Shape
Skewness
Coefficient of Variation
sawilopo@yahoo.com
Biostatistics
I: 2013
37
Universitas Gadjah
Mada,
Central Tendency
sawilopo@yahoo.com
Biostatistics
I: 2013
38
Universitas Gadjah
Mada,
Measures of Central Tendency

Overview
Central Tendency
Arithmetic Mean
Median
Mode
X
X=
Geometric Mean
X G = ( X1 X 2 L Xn )1/ n
i=1
sawilopo@yahoo.com
Midpoint of
ranked
values
Most
frequently
observed
value
Biostatistics
I: 2013
39
Universitas Gadjah
Mada,
Arithmetic Mean
The arithmetic mean (sample mean)
is the most common measure of
central tendency
For a sample of size n:
n
X
X=
Sample size
sawilopo@yahoo.com
i =1
X1 + X 2 + L + Xn
=
n
Observed values
Biostatistics
I: 2013
40
Universitas Gadjah
Mada,
Arithmetic Mean
(continued)
The most common measure of central tendency

Mean = sum of values divided by the number of
values
Affected by extreme values (outliers)
0 1 2 3 4 5 6 7 8 9 10
Mean = 3
Mean = 4
1 + 2 + 3 + 4 + 5 15
=
=3
5
5
sawilopo@yahoo.com
0 1 2 3 4 5 6 7 8 9 10
1 + 2 + 3 + 4 + 10 20
=
=4
5
5
Biostatistics
I: 2013
41
Universitas Gadjah
Mada,
Median
In an ordered array, the median is the
middle number (50% above, 50%
below)
below)
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
Median = 3
Median = 3
Not affected by extreme values
sawilopo@yahoo.com
Biostatistics
I: 2013
42
Universitas Gadjah
Mada,
Finding the Median

The location of the median:
n +1
Median position =
position in the ordered data
2
If the number of values is odd, the median is the middle
number
If the number of values is even, the median is the average of
the two middle numbers
n +1
Note that
is not the value of the median, only
2
the position of the median in the ranked data
sawilopo@yahoo.com
Biostatistics
I: 2013
43
Universitas Gadjah
Mada,
Mode
A measure of central tendency
Value that occurs most often
Not affected by extreme values
Used for either numerical or
categorical (nominal) data
There may be no mode
There may be several modes
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Mode = 9
sawilopo@yahoo.com
Chap 3-44
0 1 2 3 4 5 6
No Mode
Biostatistics
I: 2013
44
Universitas Gadjah
Mada,
Problem
Which measure of location
is the best?
Mean is generally used, unless
extreme values (outliers) exist
Then median is often used, since the
median is not sensitive to extreme
values.
values.
sawilopo@yahoo.com
Biostatistics
I: 2013
45
Universitas Gadjah
Mada,
Measures of Location
Comparison of Mean and Median
Let use cholesterol data as an example:
example:
145, 159, 166, 166, 195, 205, 250

We found the mean is 183.7 and the
median is 166.
166.
46
sawilopo@yahoo.com
Biostatistics
I: 2013
46
Universitas Gadjah
Mada,
Measures of Location
Comparison of Mean and Median
Suppose we replace 250 with 215:
215:
145, 159, 166, 166, 195, 205, 215

We will find the mean is 178.7 and the
median remains 166.
166.
47
sawilopo@yahoo.com
Biostatistics
I: 2013
47
Universitas Gadjah
Mada,
Geometric Mean
Geometric mean
Used to measure the rate of change of a variable
over time
1/ n
XG = ( X1 X 2 L Xn )
Geometric mean rate of return

Measures the status of an investment over time
R G = [(1 + R1 ) (1 + R 2 ) L (1 + Rn )]1/ n 1
Where Ri is the rate of return in time period i
sawilopo@yahoo.com
Biostatistics
I: 2013
48
Universitas Gadjah
Mada,
Example
An investment of $100,000 declined to $50,000 at
the end of year one and rebounded to $100,000
at end of year two:
X1 = $100,000
X 2 = $50,000
50% decrease
X3 = $100,000
100% increase
The overall two-year return is zero, since it started and

ended at the same level.
sawilopo@yahoo.com
Biostatistics
I: 2013
49
Universitas Gadjah
Mada,
Example
(continued)
Use the 11-year returns to compute the

arithmetic mean and the geometric mean:
Arithmetic
mean rate
of return:
( 50%) + (100%)
X=
= 25%
2
Geometric
mean rate
of return:
R G = [(1 + R1 ) (1 + R 2 ) L (1 + Rn )]1/ n 1
Misleading result
= [(1 + ( 50%)) (1 + (100%))]1/ 2 1

= [(.50 ) ( 2)]1/ 2 1 = 11/ 2 1 = 0%
sawilopo@yahoo.com
More
accurate
result
Biostatistics
I: 2013
50
Universitas Gadjah
Mada,
MEASURE OF VARIATION
sawilopo@yahoo.com
Biostatistics
I: 2013
51
Universitas Gadjah
Mada,
Measures of Variation
Variation
Range
Interquartile
Range
Variance
Standard
Deviation
Coefficient
of Variation
Measures of variation give

information on the spread
or variability of the data
values.
Same center,
different variation
sawilopo@yahoo.com
Biostatistics
I: 2013
52
Universitas Gadjah
Mada,
Range
Simplest measure of variation
Difference between the largest and
the smallest values in a set of data:
Range = Xlargest Xsmallest
Example:
0 1 2 3 4 5 6 7 8 9 10 11 12
13 14
Range = 14 - 1 = 13
sawilopo@yahoo.com
Biostatistics
I: 2013
53
Universitas Gadjah
Mada,
Disadvantages of the Range

Ignores the way in which data are
distributed
7
10
11
12
Range = 12 - 7 = 5
10
11
12
Range = 12 - 7 = 5
Sensitive to outliers
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119
sawilopo@yahoo.com
Biostatistics
I: 2013
54
Universitas Gadjah
Mada,
Quartiles
Quartiles split the ranked data into 4 segments
with an equal number of values per segment
25%
Q1
25%
25%
Q2
25%
Q3
The first quartile, Q1, is the value for which 25% of the
observations are smaller and 75% are larger
Q2 is the same as the median (50% are smaller, 50% are
larger)
Only 25% of the observations are greater than the third
quartile
sawilopo@yahoo.com
Biostatistics
I: 2013
55
Universitas Gadjah
Mada,
Quartile Formulas
Find a quartile by determining the value in the
appropriate position in the ranked data, where
First quartile position:
Q1 = (n+1)/4
Second quartile position: Q2 = (n+1)/2 (the median position)

Third quartile position:
Q3 = 3(n+1)/4
where n is the number of observed values
sawilopo@yahoo.com
Biostatistics
I: 2013
56
Universitas Gadjah
Mada,
Calculating Quartiles
Example: Find the first quartile
Sample Data in Ordered Array: 11 12 13 16 16 17 18 21 22
(n = 9)
Q1 is in the (9+1)/4 = 2.5 position of the ranked data
so use the value half way between the 2nd and 3rd values,
so
Q1 = 12.5
Q1 and Q3 are measures of noncentral location

Q2 = median, a measure of central tendency
sawilopo@yahoo.com
Biostatistics
I: 2013
57
Universitas Gadjah
Mada,
Quartiles
(continued)
Example:
Sample Data in Ordered Array: 11 12 13 16 16 17 18 21 22
(n = 9)
Q1 is in the (9+1)/4 = 2.5 position of the ranked data,
so Q1 = 12.5
Q2 is in the (9+1)/2 = 5th position of the ranked data,
so Q2 = median = 16
Q3 is in the 3(9+1)/4 = 7.5 position of the ranked data,
so Q3 = 19.5
sawilopo@yahoo.com
Biostatistics
I: 2013
58
Universitas Gadjah
Mada,
Interquartile Range
Can eliminate some outlier problems by
using the interquartile range
Eliminate some highhigh- and lowlow-valued
observations and calculate the range
from the remaining values
Interquartile range = 3rd quartile 1st quartile
= Q3 Q1
sawilopo@yahoo.com
Biostatistics
I: 2013
59
Universitas Gadjah
Mada,
Interquartile Range
Example:
X
minimum
Q1
25%
12
Median
(Q2)
25%
30
25%
45
Q3
maximum
25%
57
70
Interquartile range
= 57 30 = 27
sawilopo@yahoo.com
Biostatistics
I: 2013
60
Universitas Gadjah
Mada,
Variance
Average (approximately) of squared
deviations of values from the mean
n
(X X)
Sample variance:
S =
Where
i=1
n -1
X = mean
n = sample size
Xi = ith value of the variable X
sawilopo@yahoo.com
Biostatistics
I: 2013
61
Universitas Gadjah
Mada,
Standard Deviation
Most commonly used measure of
variation
Shows variation about the mean
Is the square root of the variance
Has the same units as the original data
n
Sample standard deviation:
sawilopo@yahoo.com
2
(X
X
)
i
S=
i=1
n -1
Biostatistics
I: 2013
62
Universitas Gadjah
Mada,
Calculation Example:
Sample Standard Deviation
Sample
Data (Xi) :
10
12
14
n=8
S=
15
17
18
18
24
Mean = X = 16
(10 X )2 + (12 X )2 + (14 X )2 + L + (24 X )2

n 1
(10 16) 2 + (12 16) 2 + (14 16) 2 + L + (24 16) 2

8 1
130
7
sawilopo@yahoo.com
4.3095
A measure of the average

scatter around the mean
Biostatistics
I: 2013
63
Universitas Gadjah
Mada,
Measuring variation
Small standard deviation
Large standard deviation
sawilopo@yahoo.com
Biostatistics
I: 2013
64
Universitas Gadjah
Mada,
Comparing Standard Deviations

Data A
11
12
13
14
15
16
17
18
19
20 21
Mean = 15.5
S = 3.338
20 21
Mean = 15.5
S = 0.926
20 21
Mean = 15.5
S = 4.567
Data B
11
12
13
14
15
16
17
18
19
Data C
11
12
sawilopo@yahoo.com
13
14
15
16
17
18
19
Biostatistics
I: 2013
65
Universitas Gadjah
Mada,
Advantages of Variance and

Standard Deviation
Each value in the data set is used in
the calculation
Values far from the mean are given
extra weight
(because deviations from the mean are squared)
sawilopo@yahoo.com
Biostatistics
I: 2013
66
Universitas Gadjah
Mada,
Coefficient of Variation
Measures relative variation
Always in percentage (%)
Shows variation relative to mean
Can be used to compare two or more
sets of data measured in different
units
S
CV =
X 100%
sawilopo@yahoo.com
Biostatistics
I: 2013
67
Universitas Gadjah
Mada,
Comparing Coefficient
of Variation
Hospital A:
Average surplus in the last 10 years = 50 Billion Rp.
Standard deviation = 5 Billion Rp.
Both hospital
S
CVA =
X
5 Bill Rp.
100% = 10%
100% =
50 Bill Rp.
Hospital B:
have the same

standard
deviation, but
hospital B is
less variable
relative to its
surplus
Average surplus last in the last 10 years = 100 Billion Rp.

Standard deviation = 5 Billion Rp.
S
CVB =
X
sawilopo@yahoo.com
5 Bill Rp.
100% = 5%
100% =
100 Bill Rp.
Biostatistics
I: 2013
68
Universitas Gadjah
Mada,
Standardized Scores (Z-Scores)

Z-scores use the mean and standard deviation as the
primary measures of center and spread and are therefore
most useful when the mean and standard deviation are
appropriate, i.e. when the distribution is reasonably
symmetric with no extreme outliers.
For any individual, the z-score tells us how many standard
deviations the raw score for that individual deviates from
the mean and in what direction.
To calculate a zz-score, we take the individual value and
subtract the mean and then divide this difference by the
standard deviation.
deviation.
A positive zz-score indicates the individual is above
average and a negative zz-score indicates the individual is
below average.
sawilopo@yahoo.com
Biostatistics
I: 2013
69
Universitas Gadjah
Mada,
Z Scores
A measure of distance from the mean (for
example, a ZZ-score of 2.0 means that a value is 2.0
standard deviations from the mean)
The difference between a value and the mean,
divided by the standard deviation
A Z score above 3.0 or below -3.0 is considered an
outlier
XX
Z=
S
sawilopo@yahoo.com
Biostatistics
I: 2013
70
Universitas Gadjah
Mada,
Z Scores
(continued)
Example:
If the mean is 14.0 and the standard deviation is
3.0, what is the Z score for the value 18.5?
X X 18.5 14.0
Z=
=
= 1.5
S
3.0
The value 18.5 is 1.5 standard deviations above the
mean
(A negative ZZ-score would mean that a value is less
than the mean)
sawilopo@yahoo.com
Biostatistics
I: 2013
71
Universitas Gadjah
Mada,
Quantitative and Graphical Approach:
MEASURE SPREAD AND DISTRIBUTION
sawilopo@yahoo.com
72
Biostatistics
I: 2013
72
Universitas Gadjah
Mada,
DESCRIBING DISTRIBUTIONS
sawilopo@yahoo.com
73
Biostatistics
I: 2013
73
Universitas Gadjah
Mada,
Features of Distributions of
Quantitative Variables
sawilopo@yahoo.com
Biostatistics
I: 2013
74
Universitas Gadjah
Mada,
Shape
When describing the shape of a
distribution, we should consider:
Symmetry/skewness of the
distribution.
Peakedness (modality) the
number of peaks (modes) the
distribution has.
sawilopo@yahoo.com
Biostatistics
I: 2013
75
Universitas Gadjah
Mada,
Symmetry/skewness of the
distribution.
sawilopo@yahoo.com
Biostatistics
I: 2013
76
Universitas Gadjah
Mada,
sawilopo@yahoo.com
Biostatistics
I: 2013
77
Universitas Gadjah
Mada,
sawilopo@yahoo.com
Biostatistics
I: 2013
78
Universitas Gadjah
Mada,
Shape of a Distribution
Describes how data are distributed
Measures of shape
Symmetric or skewed
Left-Skewed
Symmetric
Right-Skewed
Mean < Median
Mean = Median
Median < Mean
sawilopo@yahoo.com
Biostatistics
I: 2013
79
Universitas Gadjah
Mada,
Numerical Measures
for a Population
Population summary measures are called
parameters
The population mean is the sum of the values in the
population divided by the population size, N
N
X
=
Where
i=1
X1 + X 2 + L + XN
N
= population mean
N = population size
sawilopo@yahoo.com
Biostatistics
I: 2013
80
Universitas Gadjah
Mada,
Population Variance
Average of squared deviations of
values from the mean
N
Population variance:
2
(X
)
2 =
Where
i=1
= population mean
N = population size
sawilopo@yahoo.com
Biostatistics
I: 2013
81
Universitas Gadjah
Mada,
Population Standard Deviation

Most commonly used measure of
variation
Shows variation about the mean
Is the square root of the population
variance
Has the same units as the original data
N
Population standard deviation: =

sawilopo@yahoo.com
(X )
i =1
Biostatistics
I: 2013
82
Universitas Gadjah
Mada,
The Sample Covariance

The sample covariance measures the strength of
the linear relationship between two variables
(called bivariate data)
The sample covariance:
covariance:
n
( X X)( Y Y )
i
cov ( X , Y ) =
i=1
n 1
Only concerned with the strength of the relationship

No causal effect is implied
sawilopo@yahoo.com
Biostatistics
I: 2013
83
Universitas Gadjah
Mada,
Interpreting Covariance
Covariance between two random
variables:
cov(X,Y
X and Y tend to move in
cov(X,Y)
(X,Y) > 0
the same direction
cov(X,Y)
X and Y tend to move in
cov(X,Y) < 0
opposite directions
cov(X,Y)
cov(X,Y) = 0
sawilopo@yahoo.com
X and Y are independent

Biostatistics
I: 2013
84
Universitas Gadjah
Mada,
Coefficient of Correlation
Measures the relative strength of the
linear relationship between two
variables
Sample coefficient of correlation:
correlation:
cov (X , Y)
r=
SX SY
n
(X X)(Y Y)
where
cov (X
, Y) =
i
(X X)
i=1
n 1
sawilopo@yahoo.com
2
(Y
Y
)
i
SX =
i=1
n 1
SY =
i=1
n 1
Biostatistics
I: 2013
85
Universitas Gadjah
Mada,
Features of
Correlation Coefficient, r
Unit free
Ranges between 1 and 1
The closer to 1, the stronger the
negative linear relationship
The closer to 1, the stronger the
positive linear relationship
The closer to 0, the weaker the linear
relationship
sawilopo@yahoo.com
Biostatistics
I: 2013
86
Universitas Gadjah
Mada,
Scatter Plots of Data with Various

Correlation Coefficients
Y
r = -1
r = -.6
r=0
Y
r = +1
sawilopo@yahoo.com
r = +.3
r=0
Biostatistics
I: 2013
87
Universitas Gadjah
Mada,
The Empirical Rule

If the data distribution is approximately
bellbell-shaped, then the interval:
1 contains about 68% of the values
in the population or the sample
68%
1
sawilopo@yahoo.com
Biostatistics
I: 2013
88
Universitas Gadjah
Mada,
The Empirical Rule
2 contains about 95% of the
values in
3 the population or the sample
contains about 99.7% of the values in
the population or the sample
sawilopo@yahoo.com
95%
99.7%
3
Biostatistics
I: 2013
89
Universitas Gadjah
Mada,
Chebyshev Rule
Regardless of how the data are
distributed, at least (1 - 1/k2) x 100% of
the values will fall within k standard
deviations of the mean (for k > 1)
Examples:
At least
within
(1 - 1/12) x 100% = 0% ..... k=1 (
( 1))
(1 - 1/22) x 100% = 75% ........ k=2 (
( 2))
(1 - 1/32) x 100% = 89% . k=3 (
( 3))
sawilopo@yahoo.com
Biostatistics
I: 2013
90
Universitas Gadjah
Mada,
MEASURES OF SPREAD
sawilopo@yahoo.com
91
Biostatistics
I: 2013
91
Universitas Gadjah
Mada,
sawilopo@yahoo.com
92
Biostatistics
I: 2013
92
Universitas Gadjah
Mada,
Five-Number Summary
The combination of the five numbers (min, Q1,
M, Q3, Max) is called the five number
summary.
It provides a quick numerical description of
both the center and spread of a distribution.
Each of the values represents a measure of
position in the dataset.
The min and max providing the boundaires
and the quartiles and median providing
information about the 25th, 50th, and 75th
percentiles.
sawilopo@yahoo.com
Biostatistics
I: 2013
93
Universitas Gadjah
Mada,
Inter-Quartile Range (IQR)
sawilopo@yahoo.com
Biostatistics
I: 2013
94
Universitas Gadjah
Mada,
sawilopo@yahoo.com
Biostatistics
I: 2013
95
Universitas Gadjah
Mada,
sawilopo@yahoo.com
Biostatistics
I: 2013
96
Universitas Gadjah
Mada,
The 1.5(IQR) Criterion for Outliers

An observation is considered
a suspected outlier or potential
outlier if it is:
below Q1 1.5(IQR) or
above Q3 + 1.5(IQR)
sawilopo@yahoo.com
Biostatistics
I: 2013
97
Universitas Gadjah
Mada,
The following picture (not to scale)

illustrates this rule:
sawilopo@yahoo.com
Biostatistics
I: 2013
98
Universitas Gadjah
Mada,
EXAMPLE:
Best Actress Oscar Winners
We will continue with the Best Actress Oscar winners example
34 34 26 37 42 41 35 31 41 33 30 74 33 49 38 61 21 41 26 80 43 29 33
35 45 49 39 34 26 25 35 33
We can now use the 1.5(IQR) criterion to check whether the
three highest ages should indeed be classified as potential
outliers:
For this example, we found Q1 = 32 and
Q3 = 41.5 which give an IQR = 9.5
Q1 1.5 (IQR) = 32 (1.5)(9.5) = 17.75
Q3 + 1.5 (IQR) = 41.5 + (1.5)(9.5) = 55.75
The 1.5(IQR) criterion tells us that any

observation with an age that is below
17.75 or above 55.75 is considered a
suspected outlier.
We therefore conclude that the
observations with ages of 61, 74 and
80 should be flagged as suspected
outliers in the distribution of ages.
Note that since the smallest observation is 21,
there are no suspected low outliers in this
distribution.
sawilopo@yahoo.com
Biostatistics
I: 2013
99
Universitas Gadjah
Mada,
Possible methods for handling

outliers in practice
Why is it important to identify possible outliers, and how should
they be dealt with? The answers to these questions depend on the
reasons for the outlying values.
Here are several possibilities:
Even though it is an extreme value, if an outlier can be
understood to have been produced by essentially the same sort
of physical or biological process as the rest of the data, and if
such extreme values are expected to eventually occur again,
then such an outlier indicates something important and
interesting about the process youre investigating, and it should
be kept in the data.
data.
sawilopo@yahoo.com
Biostatistics
I: 2013
100
Universitas Gadjah
Mada,
If an outlier can be explained to have been produced

under fundamentally different conditions from the rest of
the data (or by a fundamentally different process), such
an outlier can be removed from the data if your goal is to
investigate only the process that produced the rest of the
data.
An outlier might indicate a mistake in the data (like a
typo, or a measuring error), in which case it should be
corrected if possible or else removed from the data
before calculating summary statistics or making
inferences from the data (and the reason for the mistake
should be investigated).
sawilopo@yahoo.com
Biostatistics
I: 2013
101
Universitas Gadjah
Mada,
Identification for suspected outliers
BOXPLOTS
sawilopo@yahoo.com
102
Biostatistics
I: 2013
102
Universitas Gadjah
Mada,
EXAMPLE: Best Actress Oscar Winners

We will use data on the Best Actress Oscar
winners as an example
34 34 26 37 42 41 35 31 41 33 30 74 33 49
38 61 21 41 26 80 43 29 33 35 45 49 39 34
26 25 35 33
The five number summary of the age of
Best Actress Oscar winners (1970(1970-2001) is:
min = 21, Q1 = 32, M = 35,
Q3 = 41.5, Max = 80
sawilopo@yahoo.com
Biostatistics
I: 2013
103
Universitas Gadjah
Mada,
Box Plot and Outliers

Lines extend from the
edges of the box to the
smallest and largest
observations that were
not classified as
suspected outliers
(using the 1.5xIQR
criterion).
In our example, we have
no low outliers, so the
bottom line goes down
to the smallest
observation, which is
21.
Since we have three
high outliers (61,74, and
80), the top line extends
only up to 49, which is
the largest observation
that has not been
flagged as an outlier.
sawilopo@yahoo.com
Biostatistics
I: 2013
104
Universitas Gadjah
Mada,
The following information is visually

depicted in the boxplot
the five
number
summary
(blue)
the range
and IQR
(red)
(red)
outliers
(green)
(green)
sawilopo@yahoo.com
Biostatistics
I: 2013
105
Universitas Gadjah
Mada,
Side-by-side boxplots of the age

distributions by gender
sawilopo@yahoo.com
Biostatistics
I: 2013
106
Universitas Gadjah
Mada,
Box Plot Summarized

The fivefive-number summary of a distribution
consists of M, Q1,
Q1, Q3 and the extremes Min
Min,
in, Max.
The median describes the center, and the
extremes (which give the range) and the quartiles
(which give the IQR) describe the spread.
The boxplot is visually displaying the five number
summary and any suspected outlier using the
1.5(IQR) criterion.
Boxplots presented in sideside-byby-side to compare
and contrast distributions from two or more
groups.
groups.
sawilopo@yahoo.com
Biostatistics
I: 2013
107
Universitas Gadjah
Mada,
ROLE-TYPE CLASSIFICATION
sawilopo@yahoo.com
108
Biostatistics
I: 2013
108
Universitas Gadjah
Mada,
Classification
In most studies involving two variables, each of the
variables has a role. We distinguish between:
the response variable (dependent) the outcome of the
study; and
the explanatory variable (independent) the variable that
claims to explain, predict or affect the response.
The variable we wish to predict is commonly called

the dependent variable, the outcome variable, or
the response variable.
variable.
Any variable we are using to predict (or explain
differences) in the outcome is commonly called
an explanatory variable, an independent variable,
a predictor variable, or a covariate.
sawilopo@yahoo.com
Biostatistics
I: 2013
109
Universitas Gadjah
Mada,
If we further classify each of the two relevant

variables according to type (categorical or
quantitative),
We get the following 4 possibilities

for rolerole-type classification
Categorical explanatory and quantitative response
Categorical explanatory and categorical response
Quantitative explanatory and quantitative response
Quantitative explanatory and categorical response
sawilopo@yahoo.com
Biostatistics
I: 2013
110
Universitas Gadjah
Mada,
sawilopo@yahoo.com
Biostatistics
I: 2013
111
Universitas Gadjah
Mada,
Case CQ:
Exploring the relationship amounts
to comparing the distributions of the
quantitative response variable for each
category of the explanatory variable.
To do this, we use:
Display: side-by-side boxplots.
Numerical summaries: descriptive statistics of the
response variable, for each value (category) of the
explanatory variable separately.
sawilopo@yahoo.com
Biostatistics
I: 2013
112
Universitas Gadjah
Mada,
Case CC:
Exploring the relationship amounts
to comparing the distributions of the
categorical response variable, for
each category of the explanatory
variable.
To do this, we use:
Display: two-way table.
Numerical summaries: conditional percentages (of
the response variable for each value (category) of
the explanatory variable separately).
sawilopo@yahoo.com
Biostatistics
I: 2013
113
Universitas Gadjah
Mada,
Here is the two-way table for example:
sawilopo@yahoo.com
Biostatistics
I: 2013
114
Universitas Gadjah
Mada,
Another way to visualize the conditional percent, instead of a

table, is the double bar chart
sawilopo@yahoo.com
Biostatistics
I: 2013
115
Universitas Gadjah
Mada,
Case QQ
We examine the relationship using:
using:
Display:
Display: scatterplot.
When describing the relationship as
displayed by the scatterplot, be sure to
consider:
Overall pattern direction, form, strength.
Deviations from the pattern outliers.
Labeling the scatterplot (including a

relevant third categorical variable in our
analysis), might add some insight into the
nature of the relationship.
sawilopo@yahoo.com
Biostatistics
I: 2013
116
Universitas Gadjah
Mada,
Scatter Plot
sawilopo@yahoo.com
Biostatistics
I: 2013
117
Universitas Gadjah
Mada,
sawilopo@yahoo.com
Biostatistics
I: 2013
118
Universitas Gadjah
Mada,
Interpreting Scatterplots
How do we explore the relationship between two
quantitative variables using the scatterplot?
What should we look at, or pay attention to?
sawilopo@yahoo.com
Biostatistics
I: 2013
119
Universitas Gadjah
Mada,
The direction of the relationship can

be positive, negative, or neither:
sawilopo@yahoo.com
Biostatistics
I: 2013
120
Universitas Gadjah
Mada,
The strength of the linear relationship
sawilopo@yahoo.com
Biostatistics
I: 2013
121
Universitas Gadjah
Mada,
In the special case

The scatterplot displays a linear
relationship (and only then), we supplement
the scatterplot with:
Numerical summaries: Pearsons correlation
coefficient (r) measures the direction and, more
importantly, the strength of the linear relationship.
The closer r is to 1 (or -1), the stronger the positive
(or negative) linear relationship. r is unitless,
influenced by outliers, and should be used only as
a supplement to the scatterplot.
sawilopo@yahoo.com
Biostatistics
I: 2013
122
Universitas Gadjah
Mada,
linear relationship and outliers
sawilopo@yahoo.com
Biostatistics
I: 2013
123
Universitas Gadjah
Mada,
When the relationship is linear (as

displayed by the scatterplot, and
supported by the correlation r), we can
summarize the linear pattern using
the least squares regression line.
Remember that:
The slope of the regression line tells us the average change in
the response variable that results from a 1-unit increase in the
explanatory variable.
When using the regression line for predictions, you should
beware of extrapolation.
sawilopo@yahoo.com
Biostatistics
I: 2013
124
Universitas Gadjah
Mada,
Least squares regression line
sawilopo@yahoo.com
Biostatistics
I: 2013
125
Universitas Gadjah
Mada,
When examining the relationship between

two variables (regardless of the case),
any observed
relationship (association) does not imply
causation, due to the possible presence
of lurking variables.
When we include a lurking variable in our
analysis, we might need to rethink the
direction of the relationship Simpsons
paradox.
sawilopo@yahoo.com
Biostatistics
I: 2013
126
Universitas Gadjah
Mada,
Simpsons paradox
sawilopo@yahoo.com
127
Biostatistics
I: 2013
127
Universitas Gadjah
Mada,
Simpsons paradox
sawilopo@yahoo.com
Biostatistics
I: 2013
128
Universitas Gadjah
Mada,
Simpsons paradox
Note that despite our earlier finding that overall Hospital A has a
higher death rate (3% vs. 2%) when we take into account the lurking
variable, we find that actually it is Hospital B that has the higher
death rate both among the severely ill patients (4% vs. 3.8%) and
among the not severely ill patients (1.3% vs. 1%).
Thus, we see that adding a lurking variable can change the direction
of an association.
sawilopo@yahoo.com
Biostatistics
I: 2013
129
Universitas Gadjah
Mada,
END
sawilopo@yahoo.com
Biostatistics
I: 2013
130
Universitas Gadjah
Mada,

Biostatistic I

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Biostatistic I

Diunggah oleh

Hak Cipta:

Format Tersedia

Unit 1: Exploratory Data Analysis

Prof. Siswanto Agus Wilopo, S.U., M.Sc., Sc.D.

Universitas Gadjah Mada, Faculty of Medicine, Department of Public Health

The Big Picture

Even though in practice it is the second step in the

EDA consists of:

Important Features of Exploratory Data

Important Features of Exploratory Data

Try to remember these structural

We will first learn how to summarize and

ONE CATEGORICAL VARIABLE

Example Raw Data out of 1200 students

Here is some information that would be

There is no way that we can answer these

A Frequency Distribution or Frequency

Visual or Graphical Displays: Pie Chart

Visual or Graphical Displays

ONE QUANTITATIVE VARIABLE

To display data from one quantitative

How To Create Histograms

Stemplot (Stem and Leaf Plot)

Note: For this to work, ALL data points should

Stemplot (Stem and Leaf Plot)

When you rotated 90 degrees counterclockwise, the stemplot

Measures of Central Tendency

The most common measure of central tendency

Not affected by extreme values

Finding the Median

145, 159, 166, 166, 195, 205, 250

145, 159, 166, 166, 195, 205, 215

Geometric mean rate of return

The overall two-year return is zero, since it started and

Use the 11-year returns to compute the

= [(1 + ( 50%)) (1 + (100%))]1/ 2 1

Measures of variation give

Disadvantages of the Range

Second quartile position: Q2 = (n+1)/2 (the median position)

where n is the number of observed values

Q1 and Q3 are measures of noncentral location

Sample standard deviation:

(10 X )2 + (12 X )2 + (14 X )2 + L + (24 X )2

(10 16) 2 + (12 16) 2 + (14 16) 2 + L + (24 16) 2

A measure of the average

Large standard deviation

Comparing Standard Deviations

Advantages of Variance and

have the same

Average surplus last in the last 10 years = 100 Billion Rp.

Standardized Scores (Z-Scores)

Quantitative and Graphical Approach:

MEASURE SPREAD AND DISTRIBUTION

Mean < Median

Median < Mean

Population Standard Deviation

Population standard deviation: =

The Sample Covariance

Only concerned with the strength of the relationship

X and Y are independent

Scatter Plots of Data with Various

The Empirical Rule

The Empirical Rule

2 contains about 95% of the

Inter-Quartile Range (IQR)

Inter-Quartile Range (IQR)