Anda di halaman 1dari 3

[QMM] Statistical formulas

1. Mean
The mean, or average, of a collection of numbers x
1
, x
2
, . . . , x
N
is
x =
x
1
+x
2
+ +x
N
N
=
1
N

x
i
.
2. Standard deviation
The standard deviation is dened as
S =
_
(x
1
x)
2
+ + (x
N
x)
2
N 1
=
_
1
N 1

(x
i
x)
2
.
One may nd in some textbooks an alternative version, with N in the denominator. When the
author wishes to distinguish between both versions, the N version is presented as the population
standard deviation, while the N 1 is the sample standard deviation.
3. The normal distribution
The normal density curve is given by a function of the form
f(x) =
1

2
exp
_

(x )
2
2
2
_
.
In this formula, y are two parameters which are dierent for each application of the model.
A normal density curve has a bell shape (Figure 1). The parameter , called the population
mean, has an straightforward interpretation: the density curve peaks at x = . The parameter ,
called the population standard deviation measures the spread of the distribution: the higher
, the atter the bell. The case = 0, = is called the standard normal.
Probabilities for the normal distribution are calculated as (numerical) integrals of the density. For
most people, the only probability needed is
p
_
1.96 < X < + 1.96

= 0.95.
This formula provides us with an interval which contains 95% of the population. The tails
contain the remaining 5%.
4. Condence limits for the mean
The formula for the 95% condence limits for the mean is
x 1.96
S

N
.
[QMM] Statistical formulas 1 20120301
4 2 0 2 4
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
Figure 1. Three normal density curves
Here, N is the number of data points, x the sample mean and S the sample standard deviation.
Textbooks recommend replacing the factor 1.96, derived from the normal distribution, by a factor
taken from the Student t distribution, but the correction becomes irrelevant when N is high.
5. Correlation
For two dimensional data (x
1
, y
1
), (x
2
, y
2
), . . . , (x
N
, y
N
), the (linear) correlation is
R =

_
x
i
x)(y
i
y
_
_

_
x
i
x
_
2

_
y
i
y
_
2
.
Always 1 R 1.
6. Coecients of the regression line
Given N data points (x
1
, y
1
), (x
2
, y
2
), . . . , (x
N
, y
N
), the regression line has an equation y = b
0
+b
1
x,
in which b
0
and b
1
are the regression coecients: b
1
is the slope, and b
0
the intercept. The
formulas are
b
1
= R
S
Y
S
X
, b
0
= y b
1
x.
R is the linear correlation. y and x are the means of Y and Y , respectively. S
Y
and S
X
are the
standard deviations.
7. R square statistic
In a linear regression equation, the R
2
statistic is the proportion of the total variability of the
[QMM] Statistical formulas 2 20120301
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G G
G
G
G
G
G G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G G
G
G
G
G
G
G G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G G G
G
G
G
G G
G
G
G
3 2 1 0 1 2 3

1
0
1
2
3
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G G
G
G
G
G
G
G
G
G
G
G
G
G
G
G G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
3 2 1 0 1 2 3

1
0
1
2
3
Figure 2. Regression lines with R = 0.8 and R = 0.2
dependent variable explained by the equation
R
2
=
Explained variability
Total variability
.
More explicitly, if y
1
, y
2
, . . . , y
N
are the observed valued of the dependent variable Y , with mean
y and y
1
, y
2
, . . . , y
N
are the values predicted by the equation,
R
2
=
_
y
i
y
_
2
_
y
i
y
_
2
.
Always 0 R
2
1. In simple regression (a single independent variable), R
2
coincides with the
square of the correlation.
8. Adjusted R square
An adjusted R
2
statistic, dened as
Adjusted R
2
= 1
(1 R
2
)(N 1)
N p 1
,
is used sometimes to compare regression equations. N is the number of data points and p the
number of independent variables in the equation. The adjustment becomes irrelevant when N is
high.
[QMM] Statistical formulas 3 20120301

Anda mungkin juga menyukai