Biological statistics
Li, Xinhai
Phone: 64807898
Email: lixh@ioz.ac.cn
Homepage: http://people.gucas.ac.cn/~LiXinhai
Blog: http://blog.sciencenet.cn/u/lixinhai
Miniblog: http://weibo.com/lixinhaiblog
1
Biostatistics
Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
2
Biostatistics
Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
Text books
Biostatistics or biometry
What is statistics?
http://teeky.org/search-engine-optimization/
determine-success-via-website-statistics/
Charles J. Krebs. 1999. Ecological Methodology, 2nd ed. Addison-Wesley Educational Publishers, Inc.
Ilker Ercan, Berna Yazc, Yaning Yang, Guven zkaya, Sengul Cangur, Bulent Ediz, Ismet Kan.
Misusage Of Statistics In Medical Research.
Research Eur J Gen Med 2007; 4(3):128-134
6
Biostatistics
Overview Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
Contents
Brief history, basic concepts, Analysis of covariance (ANCOVA)
and descriptive
p statistics
Nonparametric statistics
Probability distribution
Multivariate analysis
Hypothesis testing
Generalized linear model
Analysis of variance (ANOVA)
Common mistakes
Simple linear regression and
correlation
l ti
y x1 x2
7
Biostatistics
Overview Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
St ti ti l software
Statistical ft R http://cran.r-project.org
In 1995,
1995 R was initially written by Ross Ihaka and Robert Gentleman at the
Department of Statistics of the University of Auckland in Auckland, New Zealand.
Citation
R Development Core Team
Team. 2011
2011. R: A Language and Environment
for Statistical Computing. R Foundation for Statistical Computing.
Vienna, Austria. ISBN: 3-900051-07-0. http://www.R-project.org. 8
Biostatistics
Overview Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
Todays contents
Introduction to biological
statistics
History
Data in biology
Descriptive statistics
9
Biostatistics
History Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
Hi t
History
Jo G
John Graunt
au t ((1620-1674,
6 0 6 , British)
t s ) and
a d William
a Petty
etty ((1623-1687,
6 3 68 , British):
ts )
developed early human statistical and census methods that later provided a
framework for modern demography based on life table, mean value, census,
longevity, and mortality.
12
Biostatistics
History Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
Sewall G.
G Wright (1889-1988,
(1889 1988 American) used statistics
in the development of modern population genetics
13
Biostatistics
History Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
Francis Galton
http://www.sil.si.edu/digitalcollections/hst/scientific-identity/fullsize/SIL14-G001-05a.jpg
Karl Pearson
http://www.economics.soton.ac.uk/staff/aldrich/New%20Folder/kpreader1.htm
Ronald A. Fisher
http://en.wikipedia.org/wiki/Image:RonaldFisher.jpg
He was described byy Anders Hald as "a g genius who almost single-handedly
g y
created the foundations for modern statistical science"[1] and Richard Dawkins
described him as "the greatest biologist since Darwin".[2] (from Wikipedia)
Analysis of variance Fisher, R.A. 1925. Statistical Methods for Research Workers
Maximum likelihood Fisher, R.A. 1935. The design of experiments
Fisher information
[1] Hald, Anders (1998). A History of Mathematical Statistics. New York: Wiley. 16
[2] Dawkins, Richard (1995). River out of Eden.
Biostatistics
History Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
Th
The bi
biometrics
i section
i off the
h AAmerican
i S
Statistical
i i l
Association to publish the Biometrics Bulletin, in 1945.
A story of statistics in
industry
http://www.census.gov/history/www/census_then_now/notable_alumni/w_edwards_deming.html
Data
Datum is one observation about the variable
being
g measured.
Derived variables
Ratio
Sex ratio
Index
S&P 500 index (stock
market)
Rate
Growth rate 23
Biostatistics
Data Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
24
Biostatistics
Data Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
Accuracy of data
Mean square
q error
for estimating population mean ()
using sample mean (m)
M
Bias
MSE (M )
E[( M ) ] 2
Accuracy
Var ( M ) [ E ( M ) ] 2
precision bias
25
Biostatistics
Data Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
Summarizing Data
Frequency Distribution
Cumulative Distributions
Relative Frequency Distribution
Percent Frequency Distribution
B G
Bar Graphh
Histogram
Pie Chart
Dot Plot
26
Biostatistics
Data Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
The
h objective
bj i is i to provide
id insights
i i h about
b the
h data
d
that cannot be quickly obtained by looking only at
the
h original
i i l data.
d
27
Biostatistics
Data Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
Frequency Distribution
Guests staying at Holiday Inn were asked to
rate the quality of their accommodations as
being:
Rating Frequency
Poor 2
Below Average 3
Average 5
Above Average 9
Excellent 1
Totall 20
28
Biostatistics
Data Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
91 78 93 57 75 52 99 80 97 62
71 69 72 89 66 75 79 75 72 76
104 74 62 68 97 105 77 65 80 109
85 97 88 68 83 68 71 69 67 74
62 82 98 101 79 105 79 69 62 73
29
Biostatistics
Data Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
Frequency Distribution
Guidelines for selecting number of classes
Frequency Distribution
For Hudson Auto Repair, if we choose six classes:
Relati e Freq
Relative Frequency
enc Distrib
Distribution
tion
32
Biostatistics
Data Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
33
Biostatistics
Data Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
Relative
R l ti Frequency
F and
d
Percent Frequency
q y Distributions
Holiday Inn Quality Ratings
Relative Percent
Rating Frequency Frequency
Poor .10 10
Below Average .15 15
Average .25 25
Above Average .45 45
Excellent .05 5
Total 1.00 100
1/20 = .05 34
Biostatistics
Data Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
Relative
R l ti Frequency
F and
d
Percent Frequency
q y Distributions
Hudson Auto Repair
Relative
R l i Frequency
F and
d
Percent Frequency Distributions
Insights gained from the percent frequency distribution
36
Biostatistics
Data Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
Our class
students <- read.csv('D:/ioz/statistics/2012/students.csv', header=T)
students$ID <- as.character(students$ID)
head(students)
order ID name visits email
33 201028007610020 2093 163.com nrow(students) #115
3 201028016215017 222 163.com f il
family.name <
<- substr(students$name,1,1)
b t ( t d t $ 1 1)
111 201028016215018 99 163.com length(unique(family.name)) #62
4 201028016215019 130 163.com f.name <- table(family.name)[table(family.name)>1]
25 201128000206033 130 163.com f.name <- as.table(f.name)
56 201128000206061 282 mails.gucas.ac.cn
g barplot(f.name, ylab='Number')
0 12
10
8
mber
6
Num
4
2
0
37
Biostatistics
Data Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
Our class
email <- table(students$email)[table(students$email) > 2]
class(email) # array
email <- as.table(email)
barplot(email, ylab='Number')
40
30
Number
20
10
1
0
38
Biostatistics
Data Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
Our class
hist(students$visits, freq=T, nclass=15, xlab='Times')
Histogram of students$visits
25
20
0
uency
15
Frequ
10
5
0
Barplot()
10 12
Bar Graph
8
Number
6
4
2
0
Histogram
Another common graphical presentation of quantitative data is a
histogram.
The variable of interest is placed on the horizontal axis.
A rectangle is drawn above each class interval with its height
corresponding to the inter
intervals
als freq
frequency,
enc relati
relative
e freq
frequency,
enc or percent
frequency.
Unlike a bar graph, a histogram has no natural separation between
rectangles of adjacent classes.
R code
( ( ), )
hist(rnorm(100),nclass=6)
41
Biostatistics
Data Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
Pie Chart
R code
x=sample(1:100,6,replace=TRUE)
names(x)=c('A' 'B' 'C' 'D' 'E' 'F')
names(x)=c('A','B','C','D','E','F')
pie(x)
The pie chart is a commonly used graphical device for presenting relative
frequency distributions for qualitative data.
First draw a circle; then use the relative frequencies to subdivide the
circle into sectors that correspond to the relative frequency for each class.
Since there are 360 degrees in a circle, a class with a relative frequency
of .25 would consume .25(360) = 90 degrees of the circle.
42
Biostatistics
Data Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
D t Pl
Dot Plott
One of the simplest graphical summaries of data is a
dot plot.
A horizontal axis shows the range of data values
values.
Then each data value is represented by a dot placed
above the axis.
axis
Tune-up
p Parts Cost
.
. .. . . .
. .. .. .. .. . .
. . . ..... .......... .. . .. . . ... . .. .
50 60 70 80 90 100 110
43
Cost ($)
Biostatistics
Data Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
C
Cumulative
l ti Distributions
Di t ib ti
Cumulative frequency distribution - shows the
number of items with values less than or equal to
the upper limit of each class..
Cumulative relative/ percent frequency distribution
R code
x=seq(-5,5,by=0.1)
plot(pnorm(x,mean=0,sd=1),type='l')
44
Biostatistics
Data Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
C
Cumulative
l ti Distributions
Di t ib ti
Hudson Auto Repair
Cumulative Cumulative
Cumulative Relative Percent
Cost ($) Frequency Frequency Frequency
< 59 2 .04 4
< 69 15 .30 30
< 79 31 2 + 13 .62 15/50 62
< 89 38 .76 76
< 99 45 .90 90
< 109 50 1.00 100
45
Biostatistics
Data Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
A stem-and-leaf display
p y shows both the rank order and shape
p of the
distribution of the data.
It is similar to a histogram on its side, but it has the advantage of
showing
h i th the actual
t ld data
t values.
l
The first digits of each data item are arranged to the left of a vertical
line.
To the right of the vertical line we record the last digit for each item in
rank order.
Each line in the display is referred to as a stem.
Each digit on a stem is a leaf.
46
Biostatistics
Data Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
E
Example:
l Leaf
L f Unit
U it = 0.1
01
If we have data with values such as
8.6 11.7 9.4 9.1 10.2 11.0 8.8
a stem-
stem-and
and--leaf display of these data will be
a stem
stem--and
and--leaf display of these data will be
Leaf Unit = 10
16 8
The 82 in 1682
17 1 9 is rounded down
18 0 3 to 80 and is
represented as an 8.
19 1 7
48
Biostatistics
Data Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
f ( x )dx
a
IIntuitively,
t iti l if a probability
b bilit di
distribution
t ib ti h has ddensity
it f(x),
f( ) then
th ththe infinitesimal
i fi it i l
interval [x, x + dx] has probability f(x) dx.
x=seq(-5,5,by=0.1)
plot(dnorm(x,mean=0,sd=1),type='l')
The total area under the graph is 1 f ( x )dx 1
-
49
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
Descriptive statistics
Mode
51
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
E
Example
l off Mode
M d
Measurements
x
3
5 In this case the data have
5 tow modes:
1
7
5 and 7
2 Both measurements are
6
repeated twice
7
0
4
52
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
E
Example
l off Mode
M d
Measurements
M t
x
3
Mode: 3
5
1
1
4 Notice that it is possible for a
7 data not to have anyy mode.
3
8
3
53
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
Median
Score
S att the
th 50th percentile
til
For normal distribution the median is the same
as the mode
Arrange scores from lowest to highest
highest, if odd
number of scores the Median is the one in the
middle, if even number of scores then average
the two scores in the middle
Used when have ordinal scale and when the
distribution is skewed
54
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
Example of Median
Measurements Measurements Median: (4+5)/2 =
Ranked 4.5
x x
3 0
5 1 Notice that only the two
5 2 centrall values
l are used
d
1 3 in the computation.
7 4
2 5
6 5 The median is not
7 6 sensible to extreme
0 7
values
4 7
40 40
55
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
Mean
Score at the exact mathematical center of
distribution (average)
U
Used
d with
ith iinterval
t l and
d ratio
ti scales,
l and
d when
h
have a symmetrical and unimodal distribution
Not accurate when distribution is skewed
because it is pulled towards the tail
n
x i
X i 1
n 56
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
58
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
Range
59
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
Measures of Variability
60
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
Wh do
Why d we care about
b t variability?
i bilit ?
Where would you rather vacation,
vacation LA Bungalows,
Bungalows
where the mean temperature is 24 degrees, or
Sahara Condos where the mean temperature is
also 24 degrees?
LA temperature range:
day
y = 26
night = 22
Sahara
S h ttemperature
t range:
day = 40
night = 8
61
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
Variance
Uses the deviation from the mean
Remember
Remember, the sum of the deviations always
equals zero, so you have to square each of the
deviations
S2X= sum of squared deviations divided by the
number of scores
Provides information about the relative variability
62
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
S
Some Li
Limits
it
63
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
64
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
S th sample
the l standard
t d d deviation
d i ti
i
( x ) 2
i 1
n
The
e larger
a ge the e sstandard
a da d de
deviation
a o the
e more
oe
variability there is in the scores
Th d
The deviation
i ti (definitional)
(d fi iti l) formula
f l for
f
the sample standard deviation
X 2
X
S i
St d d Deviation:
Standard D i ti Example
E l
X X X X X 2
21 -5.8
5.8 33.64
25 -1.8 3.24
24 -2 8
-2.8 7 84
7.84
30 3.2 10.24
34 72
7.2 51 84
51.84
Mean 26.8 0 21.36
106.8
S 21.36 4.62
5 68
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
X
2
X 2
N
S
N
70
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
S
S, the sample standard deviation,
deviation is usually a little smaller
than the population standard deviation. Why?
X X
2
Definitional
formula: s
N 1
X
2
Raw-score
R X 2
N
formula: s
N 1
72
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
2
X 2
s 2
X X
2
N n 1
74
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
Standard error
s
SE x
n
75
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
Coefficient of variation
CV 100
76
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
Skewness
Symmetrical distribution
Symmetric
Left tail is the mirror image of the right tail
Examples: heights and weights of people
.35
.30
ncy
Frequen
.25
.20
F
.15
Relative
.10
.05
05
R
0 77
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
Skewness
Asymmetrical distribution
Moderately Skewed Left
A longer tail to the left
Example: exam scores
.35
.30
Rellative Frrequency
y
.25
.20
.15
.10
.05
78
0
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
Skewness
Asymmetrical distribution
Frequency
IIncome
Populations of
countries
Value
79
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
Skewness
A Measure of skewness based on the 3rd moment about the Mean
( i ) 3
skewness i 1
(N 1)s 3
n
xi x 3
(n 1) 3 / 2
i 1
n
n2
x x
3/ 2
i
i 1
Sk
Skewness
Frequency
q y
Value
81
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
20
Frequency
10
Kurtosis
Measures of Kurtosis
83
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
Kurtosis
( i ) 4
kurtosis i 1
(N 1)s 4
Kurtosis
k>3
Frequency
q y
k=3
k<3
85
Value
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
Describing data
Statistic (mean Statistic (non-
based) mean based)
Center Mean Mode, median
Spread Variance, SD Range,
(standard Interquartile
deviation), SE, range
CV
Skew Skewness --
Peaked Kurtosis --
86
Biostatistics
Descriptive statistics Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
R code
x = rnorm(100)
( )
mean(x)
sd(x)
var(x)
min(x)
max(x)
median(x)
range(x)
quantile(x)
summary(x)
skewness = sum((x-mean(x))
sum((x-mean(x))^3/sqrt(var(x))^3)/length(x);
3/sqrt(var(x)) 3)/length(x); skewness
kurtosis = sum((x-mean(x))^4/var(x)^2) /length(x) -3; kurtosis
87
Biostatistics
SAS Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
/****************************************************************/
/* SAS SAMPLE LIBRARY */
/* */
SAS Example
/* NAME: UNIVAR */
/* TITLE: Simple Descriptive Statistics using PROC UNIVARIATE */
/* PRODUCT: SAS */
/* SYSTEM: ALL */
/* KEYS: DESCRIPTIVE STATISTICS, */
/* PROCS: UNIVARIATE */
/* DATA: */
/* */
/* REF: */
/* MISC: */
/* DESC: INPUT A SMALL DATA SET USING THE CARDS STATEMENT. */
OPTIONS LS=75 NODATE; /*
/*
/*
RUN UNIVARIATE USING THE FREQ, PLOT AND NORMAL
PROC OPTIONS. ANALYZE THE VARIABLE POP AND
RETAIN THE VARIABLE STATE USING THE ID STATEMENT. */
*/
*/
ALA 3.44 ALASKA 0.30 ARIZ 1.77 ARK 1.92 CALIF 19.95
COLO 2.21 CONN 3.03 DEL 0.55 FLA 6.79 GA 4.59
HAW 0.77 IDAHO 0.71 ILL 11.01 IND 5.19 IOWA 2.83
KAN 2.25 KY 3.22 LA 3.64 ME 0.99 MD 3.92
MASS 5.69 MICH 8.88 MINN 3.81 MISS 2.22 MO 4.68
MONT 0.69 NEB 1.48 NEV 0.49 NH 0.74 NJ 7.17
NM 1.02 NY 18.24 NC 5.08 ND 0.62 OHIO 10.65
OKLA 2.56 ORE 2.09 PA 11.79 RI 0.95 SC 2.59
SD 0.67 TENN 3.92 TEXAS 11.2 UTAH 1.06 VT 0.44
VA 4.65 WASH 3.41 W.VA 1.74 WIS 4.42 WYO 0.33
PROC UNIVARIATE FREQ PLOT NORMAL;
VAR POP; ID STATE; 88
run;
Biostatistics
SAS Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
SAS results
The SAS System
Moments
N 50 Sum Weights 50
Mean 4.0472 Sum Observations 202.36
Std Deviation 4.32931867 Variance 18.7430002
Skewness 2.05521839 Kurtosis 4.54561679
Uncorrected SS 1737.3984 Corrected SS 918.407008
Coeff Variation 106.970712 Std Error Mean 0.61225812
89
Biostatistics
SAS Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
Basic statistics
Location Variability
M
Mean 4.047200
4 047200 Std Deviation
D i ti 4.32932
4 32932
Median 2.710000 Variance 18.74300
Mode 3.920000 Range 19.65000
Interquartile Range 3.69000
90
Biostatistics
SAS Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
Quantiles
Quantile Estimate
91
Biostatistics
Lecture 1. Brief history, basic concepts and descriptive statistics Xinhai Li
Assignment
Be familiar with the following terms:
Probability density function (PDF)
Deviation
Variance
Standard deviation
Standard error
Range
Mode
Quantile
Coefficient of variation