Anda di halaman 1dari 39

physical meaning of mean, variance and standard deviation???

Mean is the average value. Average is sometimes misleading.

For example in a class we have 5 students of ages 13,14,14,15,34. the average value comes to18.

None of the student represent this or any where near this.

This has happened because of one value has distorted the whole picture.

On the contrary if we take the variance and standard deviation this will at once be revealed.

SD tells you how much the data is spread?

whether data is skewed?

or if it is peaky? and so on.

Thus it gives a greater insight in to data.

Mean represents a sort of "weighted center" of a distribution of data.

Physically, the "mean" of a probability distribution is equivalent to its "center of mass."

In other words, if you cut out a probability distribution (like a bell curve) then you could balance it on your finger right at the value of
its mean.

For example, picture a probability distribution where there is a 20% chance of getting a 5 and an 80% chance of getting a 10.

In that case, the mean is 0.2*5+0.8*10=1+8=9. As you can see, there is 80% "mass" exactly 1 away from 9 and there is 20% mass
exactly 4 away from 1. 80%*1 = 20%*4. If the probability distribution was a lever (like a see-saw) with the fulcrum 9 away from
the lighter side, the lever would be perfectly balanced.

So that's the "physical" interpretation of mean. It's a "center of mass."

Variance is a measure of the amount of "noise power."

It is a measure of how much a random variable varies.

For example, if there was a 100% chance of getting a 5, then the variable would not be random. It would be deterministic. It would
thus have 0 variance. However, if there was only an 80% chance of getting a 5, a 10% chance of getting a 4 and a 10% chance of
getting a 6, then the random variable would have a variance of 0.2 with reflects the 10% and 10% chances on the left and right of 5.
In other words, 2*10%=0.2. If you change those 10% to 5%, you get a variance of 0.1 because 2*5%=0.1. The variance is somehow
measuring the "spread" of the data.

It's measuring the amount of noise you're going to get around your mean (the mean in this case is 5).

The standard deviation is just the square root of the variance.

The variance is in square units because it is actually the "expected" squared distance from the mean. In other words, if the
variance is 1, we expect that if we square the distance from the mean, we'll get a value around 1.

The standard deviation just converts this expectation back into our old units.

That way if we have a variance of 4, we'll have a standard deviation of 2. It's just more convenient to express variance in normal
units rather than square units.

I hope that's "physical" enough.


-Mean is the sum of the values, divided by the total number of values.
-Variance is the average of the squares of the distance that each value is from the mean.
-Standard deviation is the square root of the variance.

Probability Density Function describes the relative likelihood for this random variable to occur at a given point in the
observation space.

The probability of a random variable falling within a given set is given by the integral of its density over the set.

Distribution Function describes the range of possible values that a random variable can attain and

the probability that the value of the random variable is within any (measurable) subset of that range

Probability density function (PDF) of a continuous random variable is a function


that describes the relative likelihood for this random variable to occur
at a point in the observation space.

The PDF is the derivative of the probability distribution (also known as


cummulative distriubution function (CDF)) which described the enitre range of values
(distrubition) a continuous random variable takes in a domain.
The CDF is used to determine the probability a continuous random variable occurs any (measurable) subset of that
range.
This is performed by integrating the PDF over some range (i.e., taking the area under of CDF curve between two
values).
NOTE: Over the entire domain the total area under the CDF curve is equal to 1.
NOTE: A continuous random variable can take on an infinite number of values. The probability that it will equal a
specific value is always zero.

eg. Example of CDF of a normal distribution:


If test scores are normal distributed with mean 100 and standard deviation 10. The probability
a score is between 90 and 110 is:
P( 90 < X < 110 ) = P( X < 110 ) - P( X < 90 )
= 0.84 - 0.16 = 0.68.
ie. AProximately 68%.

standard deviation (represented by the symbol sigma, ) shows how much variation or "dispersion" exists from the average
(mean, or expected value). A low standard deviation indicates that the data points tend to be very close to the mean; high standard
deviation indicates that the data points are spread out over a large range of values.
nterpretation and application

Example of two sample populations with the same mean and different standard deviations. Red population has mean 100 and SD

10; blue population has mean 100 and SD 50.

A large standard deviation indicates that the data points are far from the mean and a small standard deviation indicates that they are
clustered closely around the mean.
For example, each of the three populations {0, 0, 14, 14}, {0, 6, 8, 14} and {6, 6, 8, 8} has a mean of 7. Their standard deviations are
7, 5, and 1, respectively.
The third population has a much smaller standard deviation than the other two because its values are all close to 7.
It will have the same units as the data points themselves.
If, for instance, the data set {0, 6, 8, 14} represents the ages of a population of four siblings in years, the standard deviation is 5
years.
As another example, the population {1000, 1006, 1008, 1014} may represent the distances traveled by four athletes, measured in
meters. It has a mean of 1007 meters, and a standard deviation of 5 meters.
Standard deviation may serve as a measure of uncertainty. In physical science, for example, the reported standard deviation of a
group of repeated measurements gives the precision of those measurements. When deciding whether measurements agree with a
theoretical prediction, the standard deviation of those measurements is of crucial importance: if the mean of the measurements is
too far away from the prediction (with the distance measured in standard deviations), then the theory being tested probably needs to
be revised. This makes sense since they fall outside the range of values that could reasonably be expected to occur, if the
prediction were correct and the standard deviation appropriately quantified. See prediction interval.
While the standard deviation does measure how far typical values tend to be from the mean, other measures are available. An
example is the mean absolute deviation, which might be considered a more direct measure of average distance, compared to
the root mean square distance inherent in the standard deviation.
[edit]Application examples
The practical value of understanding the standard deviation of a set of values is in appreciating how much variation there is from the
"average" (mean).
[edit]Climate
As a simple example, consider the average daily maximum temperatures for two cities, one inland and one on the coast. It is helpful
to understand that the range of daily maximum temperatures for cities near the coast is smaller than for cities inland. Thus, while
these two cities may each have the same average maximum temperature, the standard deviation of the daily maximum temperature
for the coastal city will be less than that of the inland city as, on any particular day, the actual maximum temperature is more likely to
be farther from the average maximum temperature for the inland city than for the coastal one.
[edit]Particle physics
Particle physics uses a standard of "5 sigma" for the declaration of a discovery. [3] At five-sigma there is only one chance in nearly
two million that a random fluctuation would yield the result. This level of certainty prompted the announcement that a particle
consistent with the Higgs boson has been discovered in two independent experiments at CERN.[4]
[edit]Sports
Another way of seeing it is to consider sports teams. In any set of categories, there will be teams that rate highly at some things and
poorly at others. Chances are, the teams that lead in the standings will not show such disparity but will perform well in most
categories. The lower the standard deviation of their ratings in each category, the more balanced and consistent they will tend to be.
Teams with a higher standard deviation, however, will be more unpredictable. For example, a team that is consistently bad in most
categories will have a low standard deviation. A team that is consistently good in most categories will also have a low standard
deviation. However, a team with a high standard deviation might be the type of team that scores a lot (strong offense) but also
concedes a lot (weak defense), or, vice versa, that might have a poor offense but compensates by being difficult to score on.
Trying to predict which teams, on any given day, will win, may include looking at the standard deviations of the various team "stats"
ratings, in which anomalies can match strengths vs. weaknesses to attempt to understand what factors may prevail as stronger
indicators of eventual scoring outcomes.
In racing, a driver is timed on successive laps. A driver with a low standard deviation of lap times is more consistent than a driver
with a higher standard deviation. This information can be used to help understand where opportunities might be found to reduce lap
times.
[edit]Finance
In finance, standard deviation is often used as a measure of the risk associated with price-fluctuations of a given asset (stocks,
bonds, property, etc.), or the risk of a portfolio of assets [5] (actively managed mutual funds, index mutual funds, or ETFs). Risk is an
important factor in determining how to efficiently manage a portfolio of investments because it determines the variation in returns on
the asset and/or portfolio and gives investors a mathematical basis for investment decisions (known as mean-variance optimization).
The fundamental concept of risk is that as it increases, the expected return on an investment should increase as well, an increase
known as the "risk premium." In other words, investors should expect a higher return on an investment when that investment carries
a higher level of risk or uncertainty. When evaluating investments, investors should estimate both the expected return and the
uncertainty of future returns. Standard deviation provides a quantified estimate of the uncertainty of future returns.
For example, let's assume an investor had to choose between two stocks. Stock A over the past 20 years had an average return of
10 percent, with a standard deviation of 20 percentage points (pp) and Stock B, over the same period, had average returns of 12
percent but a higher standard deviation of 30 pp. On the basis of risk and return, an investor may decide that Stock A is the safer
choice, because Stock B's additional two percentage points of return is not worth the additional 10 pp standard deviation (greater
risk or uncertainty of the expected return). Stock B is likely to fall short of the initial investment (but also to exceed the initial
investment) more often than Stock A under the same circumstances, and is estimated to return only two percent more on average.
In this example, Stock A is expected to earn about 10 percent, plus or minus 20 pp (a range of 30 percent to -10 percent), about
two-thirds of the future year returns. When considering more extreme possible returns or outcomes in future, an investor should
expect results of as much as 10 percent plus or minus 60 pp, or a range from 70 percent to 50 percent, which includes outcomes
for three standard deviations from the average return (about 99.7 percent of probable returns).
Calculating the average (or arithmetic mean) of the return of a security over a given period will generate the expected return of the
asset. For each period, subtracting the expected return from the actual return results in the difference from the mean. Squaring the
difference in each period and taking the average gives the overall variance of the return of the asset. The larger the variance, the
greater risk the security carries. Finding the square root of this variance will give the standard deviation of the investment tool in
question.
Population standard deviation is used to set the width of Bollinger Bands, a widely adopted technical analysis tool. For example, the
upper Bollinger Band is given as x + nx. The most commonly used value for n is 2; there is about a five percent chance of going
outside, assuming a normal distribution of returns.
[edit]Geometric interpretation

It is requested that a diagram or diagrams be included in this


article to improve its quality. Specific illustrations, plots or diagrams
can be requested at the Graphic Lab.
For more information, refer to discussion on this page and/or the
listing at Wikipedia:Requested images.

To gain some geometric insights and clarification, we will start with a population of three values, x1, x2, x3. This defines a point P =
(x1, x2, x3) in R3. Consider the line L = {(r, r, r) : r R}. This is the "main diagonal" going through the origin. If our three given values
were all equal, then the standard deviation would be zero and P would lie on L. So it is not unreasonable to assume that the
standard deviation is related to the distance of P to L. And that is indeed the case. To move orthogonally from L to the point P, one
begins at the point:
whose coordinates are the mean of the values we started out with. A little algebra shows that the distance
between P and M (which is the same as the orthogonal distance between P and the line L) is equal to the standard deviation of
the vector x1, x2, x3, multiplied by the square root of the number of dimensions of the vector (3 in this case.)
[edit]Chebyshev's inequality
Main article: Chebyshev's inequality

An observation is rarely more than a few standard deviations away from the mean.
Chebyshev's inequality ensures that, for all distributions for which the standard deviation is defined,
the amount of data within a number of standard deviations of the mean is at least as much as given in the following table.

Minimum population Distance from mean

50% 2

75% 2

89% 3

94% 4

96% 5

97% 6

[6]

[edit]Rules for normally distributed data

Dark blue is less than one standard deviation from the mean. For the normal distribution, this accounts for 68.27 percent of the

set; while two standard deviations from the mean (medium and dark blue) account for 95.45 percent; three standard deviations

(light, medium, and dark blue) account for 99.73 percent; and four standard deviations account for 99.994 percent. The two

points of the curve that are one standard deviation from the mean are also the inflection points.

The central limit theorem says that the distribution of an average of many independent, identically distributed random
variables tends toward the famous bell-shaped normal distribution with a probability density function of:
where is the expected value of the random variables, equals their distribution's standard deviation divided by n1/2,
and n is the number of random variables. The standard deviation therefore is simply a scaling variable that adjusts
how broad the curve will be, though it also appears in the normalizing constant.
If a data distribution is approximately normal, then the proportion of data values within z standard deviations of the
mean is defined by:

Proportion =
where is the error function. If a data distribution is approximately normal then about 68 percent of the data
values are within one standard deviation of the mean (mathematically, , where is the arithmetic mean),
about 95 percent are within two standard deviations ( 2), and about 99.7 percent lie within three standard
deviations ( 3).
This is known as the 68-95-99.7 rule, or the empirical rule.
For various values of z, the percentage of values expected to lie in and outside the symmetric interval,
CI = (z, z), are as follows:

Z Percentage within CI Percentage outside CI Fraction outside CI

0.674490 50% 50% 1/2

0.994458 68% 32% 1 / 3.125

1 68.2689492% 31.7310508% 1 / 3.1514872

1.281552 80% 20% 1/5

1.644854 90% 10% 1 / 10

1.959964 95% 5% 1 / 20

2 95.4499736% 4.5500264% 1 / 21.977895

2.575829 99% 1% 1 / 100

3 99.7300204% 0.2699796% 1 / 370.398

3.290527 99.9% 0.1% 1 / 1,000

3.890592 99.99% 0.01% 1 / 10,000


4 99.993666% 0.006334% 1 / 15,787

4.417173 99.999% 0.001% 1 / 100,000

4.891638 99.9999% 0.0001% 1 / 1,000,000

5 99.9999426697% 0.0000573303% 1 / 1,744,278

5.326724 99.99999% 0.00001% 1 / 10,000,000

5.730729 99.999999% 0.000001% 1 / 100,000,000

6 99.9999998027% 0.0000001973% 1 / 506,797,346

6.109410 99.9999999% 0.0000001% 1 / 1,000,000,000

6.466951 99.99999999% 0.00000001% 1 / 10,000,000,000

6.806502 99.999999999% 0.000000001% 1 / 100,000,000,000

7 99.9999999997440% 0.000000000256% 1 / 390,682,215,445

COVARIANCE
In probability theory and statistics, covariance is a measure of how much two random variables change together.
If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for
the smaller values,
i.e.,the variables tend to show similar behavior, the covariance is a positive number.
In the opposite case, when the greater values of one variable mainly correspond to the smaller values of the other, i.e., the
variables tend to show opposite behavior, the covariance is negative.
The sign of the covariance therefore shows the tendency in the linear relationship between the variables. The magnitude of
the covariance is not that easy to interpret.
The normalized version of the covariance, the correlation coefficient, however, shows by its magnitude the strength of the
linear relation.
A distinction must be made between
(1) the covariance of two random variables, which is a population parameter that can be seen as a property of the joint probability
distribution, and
(2) the sample covariance, which serves as an estimated value of the parameter
VARIANCE
In probability theory and statistics,
the variance is a measure of how far a set of numbers is spread out.
It is one of several descriptors of a probability distribution, describing how far the numbers lie from the mean (expected value).
In particular, the variance is one of the moments of a distribution.
In that context, it forms part of a systematic approach to distinguishing between probability distributions. While other such
approaches have been developed, those based on moments are advantageous in terms of mathematical and computational
simplicity.
The variance is a parameter describing in part either the actual probability distribution of an observed population of
numbers, or
the theoretical probability distribution of a sample (a not-fully-observed population) of numbers.
In the latter case a sample of data from such a distribution can be used to construct an estimate of its variance: in the simplest
cases this estimate can be the sample variance, defined below.
.

Contents

[hide]

1 Definition

2 Properties

o 2.1 A more general identity for covariance matrices

o 2.2 Uncorrelatedness and independence

o 2.3 Relationship to inner products

3 Calculating the sample covariance

4 Comments

5 See also

6 References

7 External links
[edit]Definition

The covariance between two jointly distributed real-valued random variables x and y with finite second moments is defined[1] as:

where E[x] is the expected value of x, also known as the mean of x. By using the linearity property of expectations, this can be
simplified to

where is the correlation between x and y;


if this term is zero then the random variables are orthogonal.[2]

For random vectors and (of dimension m and n respectively) the mn covariance matrix is equal to
where mT is the transpose of the vector (or matrix) m.
The (i,j)-th element of this matrix is equal to the covariance Cov(xi, yj) between the i-th scalar component of x and
the j-th scalar component of y. In particular, Cov(y, x) is the transpose of Cov(x, y).

For a vector of m jointly distributed random variables with finite second


moments, its covariance matrix is defined as:

Random variables whose covariance is zero are called uncorrelated.


The units of measurement of the covariance Cov(x, y) are those of x times those of y.
By contrast, correlation coefficients, which depend on the covariance, are a dimensionless measure of linear
dependence. (In fact, correlation coefficients can simply be understood as a normalized version of
covariance.)
[edit]Properties

Variance is a special case of the covariance when the two variables are identical:

If x, y, W, and V are real-valued random variables and a, b, c, d are constant ("constant" in this
context means non-random), then the following facts are a consequence of the definition of
covariance:

For sequences x1, ..., xn and y1, ..., ym of random variables, we have

For a sequence x1, ..., xn of random variables, and constants a1, ..., an, we have

[edit]A more general identity for covariance matrices

Let be a random vector, let denote its covariance matrix, and let be a
matrix that can act on . The result of applying this matrix to is a new vector with
covariance matrix

.
This is a direct result of the linearity of expectation and is useful when applying
a linear transformation, such as a whitening transformation, to a vector.
[edit]Uncorrelatedness and independence
If x and y are independent, then their covariance is zero. This follows because
under independence,

The converse, however, is not generally true. For example, let x be uniformly
distributed in [-1, 1] and let y = x2. Clearly, x and y are dependent, but

In this case, the relationship between y and x is non-linear, while


correlation and covariance are measures of linear dependence between
two variables. Still, as in the example, if two variables are uncorrelated,
that does not imply that they are independent.
[edit]Relationship to inner products
Many of the properties of covariance can be extracted elegantly by
observing that it satisfies similar properties to those of an inner product:

1. bilinear: for constants a and b and random variables x, y, z,


(ax + by, z) = a (x, z) + b (y, z)
2. symmetric: (x, y) = (y, x)
3. positive semi-definite: 2(x) = (x, x) 0, and (x, x) = 0
implies that x is a constant random variable (K).
In fact these properties imply that the covariance defines an inner
product over the quotient vector space obtained by taking the subspace
of random variables with finite second moment and identifying any two
that differ by a constant. (This identification turns the positive semi-
definiteness above into positive definiteness.) That quotient vector
space is isomorphic to the subspace of random variables with finite
second moment and mean zero; on that subspace, the covariance is
exactly the L2 inner product of real-valued functions on the sample
space.
As a result for random variables with finite variance the following
inequality holds via the CauchySchwarz inequality:

Proof: If 2(y) = 0, then it holds trivially. Otherwise, let random


variable

Then we have:
QED.
[edit]Calculating the sample covariance

Main article: Sample mean and sample covariance

The sample covariance of N observations of K variables

is the K-by-K matrix with the entries

,
which is an estimate of the covariance between
variable j and variable k.
The sample mean and the sample covariance
matrix are unbiased estimates of the mean and
the covariance matrix of the random vector ,a
row vector whose jth element (j = 1, ..., K) is one of
the random variables. The reason the sample
covariance matrix has in the
denominator rather than is essentially that the

population mean is not known and is


replaced by the sample mean . If the population

mean is known, the analogous unbiased


estimate is given by

Comments

The covariance is sometimes called a


measure of "linear dependence" between
the two random variables.
That does not mean the same thing as in the
context of linear algebra (see linear
dependence).
When the covariance is normalized, one
obtains the correlation coefficient. From it,-
one can obtain the Pearson coefficient,
which gives us the goodness of the fit for
the best possible linear function describing
the relation between the variables.
In this sense covariance is a linear gauge
of dependence.
In probability theory and statistics, the variance is a measure of how far a set of numbers is spread out. It is one of several
descriptors of aprobability distribution, describing how far the numbers lie from the mean (expected value). In particular, the variance
is one of the moments of a distribution. In that context, it forms part of a systematic approach to distinguishing between probability
distributions. While other such approaches have been developed, those based on moments are advantageous in terms of
mathematical and computational simplicity.
The variance is a parameter describing in part either the actual probability distribution of an observed population of numbers, or the
theoretical probability distribution of a sample (a not-fully-observed population) of numbers. In the latter case a sample of data from
such a distribution can be used to construct an estimate of its variance: in the simplest cases this estimate can be the sample
variance, defined below.
MOMENT
In mathematics, a moment is, loosely speaking, a quantitative measure of the shape of a set of points.
The "second moment", for example, is widely used and measures the "width" (in a particular sense) of a set of points in one
dimension or in higher dimensions measures the shape of a cloud of points as it could be fit by an ellipsoid.
Other moments describe other aspects of a distribution such as how the distribution is skewed from its mean, or peaked. The
mathematical concept is closely related to the concept of moment in physics, although moment in physics is often represented
somewhat differently. Any distribution can be characterized by a number of features (such as the mean, the variance, the skewness,
etc.), and the moments of a function[1] describe the nature of its distribution.
The 1st moment is denoted by 1. The first moment of the distribution of the random variable X is the expectation operator, i.e.,
the population mean(if the first moment exists).
In higher orders, the central moments (moments about the mean) are more interesting than the moments about zero. The first
central moment is 0. The zeroth central moment, 0 is one. The second central moment is the variance.

Other moments may also be defined. For example, the n th inverse moment about zero is and the n th logarithmic

moment about zero is .

Contents

[hide]

1 Significance of the moments

o 1.1 Mean

o 1.2 Variance

1.2.1 Normalized moments

o 1.3 Skewness

o 1.4 Kurtosis

o 1.5 Mixed moments

o 1.6 Higher moments

2 Cumulants

3 Sample moments

4 Problem of moments

5 Partial moments

6 Central Moments in metric spaces

7 See also

8 References

9 External links
Significance of the moments

The nth moment of a real-valued continuous function f(x) of a real variable about a value c is
It is possible to define moments for random variables in a more general fashion than moments for real valuessee moments
in metric spaces. The moment of a function, without further explanation, usually refers to the above expression with c = 0.
Usually, except in the special context of the problem of moments, the function f(x) will be a probability density function.
The nth moment about zero of a probability density function f(x) is the expected value of Xn and is called a raw
moment or crude moment.[2] The moments about its mean are called central moments; these describe the shape of the
function, independently of translation.
If f is a probability density function, then the value of the integral above is called the nth moment of the probability distribution.
More generally, if F is a cumulative probability distribution function of any probability distribution, which may not have a density
function, then the nth moment of the probability distribution is given by the RiemannStieltjes integral

where X is a random variable that has this cumulative distribution F, and E is the expectation operator or mean.
When

then the moment is said not to exist. If the nth moment about any point exists, so does (n 1)th moment, and all
lower-order moments, about every point.
The zeroth moment of any probability density function is 1, since the area under any probability density
function must be equal to one.

Significance of moments (raw, central, standardized) and cumulants (raw, standardized), in connection with
named properties of distributions

Moment Raw Central Standardized Raw


Standardized cumulant
number moment moment moment cumulant

1 mean 0 0 mean N/A

2 variance 1 variance 1

3 skewness skewness

modern kurtosis (i.e. excess


4 historical kurtosis
kurtosis}

5+

[edit]Mean
Main article: Mean
The first raw moment is the mean.
[edit]Variance
Main article: Variance

The second central moment is the variance. Its positive square root is the standard deviation .
[edit]Normalized moments
The normalized nth central moment or standardized moment is the nth central moment divided by n; the
normalized nth central moment of x = E((x )n)/n. These normalized central moments are dimensionless
quantities, which represent the distribution independently of any linear change of scale.
[edit]Skewness
Main article: Skewness

The third central moment is a measure of the lopsidedness of the distribution;


any symmetric distribution will have a third central moment, if defined, of zero.
The normalized third central moment is called the skewness, often .
A distribution that is skewed to the left (the tail of the distribution is heavier on the left) will have a negative
skewness.
A distribution that is skewed to the right (the tail of the distribution is heavier on the right), will have a positive
skewness.
For distributions that are not too different from the normal distribution,
the median will be somewhere near /6; the mode about /2.
[edit]Kurtosis
Main article: Kurtosis

The fourth central moment is a measure of whether the distribution is tall and skinny or short and squat, compared
to the normal distribution of the same variance. Since it is the expectation of a fourth power, the fourth central
moment, where defined, is always non-negative; and except for a point distribution, it is always strictly positive. The
fourth central moment of a normal distribution is 34.
The kurtosis is defined to be the normalized fourth central moment minus 3. (Equivalently, as in the next section, it
is the fourth cumulant divided by the square of the variance.) Some authorities[3][4] do not subtract three, but it is
usually more convenient to have the normal distribution at the origin of coordinates. If a distribution has a peak at
the mean and long tails, the fourth moment will be high and the kurtosis positive (leptokurtic); and conversely; thus,
bounded distributions tend to have low kurtosis (platykurtic).
The kurtosis can be positive without limit, but must be greater than or equal to 2 2; equality only holds for binary
distributions. For unbounded skew distributions not too far from normal, tends to be somewhere in the area of
2 and 22.
The inequality can be proven by considering

where T = (X )/. This is the expectation of a square, so it is non-negative whatever a is; on the other hand,
it's also a quadratic equation in a. Itsdiscriminant must be non-positive, which gives the required relationship.
[edit]Mixed moments
Mixed moments are moments involving multiple variables.
Some examples are covariance, coskewness and cokurtosis. While there is a unique covariance, there are
multiple co-skewnesses and co-kurtoses.
[edit]Higher moments
High-order moments are moments beyond 4th-order moments. The higher the moment, the harder it is to
estimate, in the sense that larger samples are required in order to obtain estimates of similar quality.[citation needed]
[edit]Cumulants

Main article: cumulant

The first moment and the second and third unnormalized central moments are additive in the sense that
if X and Y are independent random variables then
and

and

(These can also hold for variables that satisfy weaker conditions than independence. The first
always holds; if the second holds, the variables are called uncorrelated).
In fact, these are the first three cumulants and all cumulants share this additivity property.
[edit]Sample moments

The moments of a population can be estimated using the sample k-th moment

applied to a sample X1,X2,..., Xn drawn from the population.


It can be shown that the expected value of the sample moment is equal to the k-th
moment of the population, if that moment exists, for any sample size n. It is thus an
unbiased estimator.
[edit]Problem of moments

Main article: moment problem

The problem of moments seeks characterizations of sequences { n : n = 1, 2, 3, ... }


that are sequences of moments of some function f.
[edit]Partial moments

Partial moments are sometimes referred to as "one-sided moments." The nth order lower
and upper partial moments with respect to a reference point r may be expressed as

Partial moments are normalized by being raised to the power 1/n. The upside
potential ratio may be expressed as a ratio of a first-order upper partial
moment to a normalized second-order lower partial moment.
[edit]Central Moments in metric spaces

Let (M, d) be a metric space, and let B(M) be the Borel -algebra on M, the -
algebra generated by the d-open subsets of M. (For technical reasons, it is
also convenient to assume that M is a separable space with respect to
the metric d.) Let 1 p +.
The pth central moment of a measure on the measurable space (M, B(M))
about a given point x0 in M is defined to be

is said to have finite pth central moment if the pth central moment
of about x0 is finite for some x0 M.
This terminology for measures carries over to random variables in the
usual way: if (, , P) is a probability space and X : M is a random
variable, then the pth central moment of X about x0 M is defined to be
and X has finite pth central moment if the pth central moment
of X about x0 is finite for some x0 M.
INDEPENDENCE
In probability theory, to say that two events are independent (alternatively statistically independent,marginally
independent or absolutely independent[1]) means that the occurrence of one does not affect the probability of the other. Similarly,
two random variables are independent if the observed value of one does not affect the probability distribution of the other.
The concept of independence extends to dealing with collections of more than two events or random variables.

Contents

[hide]

1 Definition

o 1.1 For events

1.1.1 Two events

1.1.2 More than two events

o 1.2 For random variables

1.2.1 Two random variables

1.2.2 More than two random variables

1.2.3 Conditional independence

o 1.3 Independent -algebras

2 Properties

o 2.1 Self-dependence

o 2.2 Expectation and covariance

o 2.3 Characteristic function

3 Examples

o 3.1 Rolling a die

o 3.2 Drawing cards

o 3.3 Pairwise and mutual independence

4 See also

5 References
[edit]Definition

[edit]For events
[edit]Two events

Two events and are independent iff their joint probability equals the product of their probabilities:

.
Why this defines independence is made clear by rewriting with conditional probabilities:
.

Thus, the occurrence of does not affect the probability of , and vice versa. Although the derived
expressions may seem more intuitive, they are not the preferred definition, as the conditional probabilities may

be undefined if or are 0. Furthermore, the preferred definition makes clear by symmetry


that when is independent of , is also independent of .
[edit]More than two events

A finite set of events is pairwise independent iff every pair of events is independent[2]. That is, iff for

all pairs of indices , ( )

.
A finite set of events is mutually independent iff every event is independent of any intersection of the

other events[3]. That is, iff for every subset

.
This is called the multiplication rule for independent events.
For more than two events, a mutually independent set of events is pairwise independent, but the
converse is not necessarily true.
[edit]For random variables
[edit]Two random variables

Two random variables and are independent iff for every and , the

events and are independent events (as defined above). That

is, and with cumulative distribution functions and , and probability

densities and , are independent iff the combined random variable


has a joint cumulative distribution function

or equivalently, a joint density

[edit]More than two random variables


A set of random variables is pairwise independent iff every pair of random variables is
independent.
A set of random variables is mutually independent iff for any finite
subset and any finite set of numbers , the

events are independent events (as


defined above).

The measure-theoretically inclined may prefer to substitute events for

events in the above definition, where is anyBorel set. That definition


is exactly equivalent to the one above when the values of the random variables are real
numbers. It has the advantage of working also for complex-valued random variables or
for random variables taking values in any measurable space (which includes topological
spacesendowed by appropriate -algebras).
[edit]Conditional independence
Main article: Conditional independence

Intuitively, two random variables X and Y are conditionally independent given Z if,
once Z is known, the value of Y does not add any additional information about X. For
instance, two measurements X and Y of the same underlying quantity Z are not
independent, but they are conditionally independent given Z (unless the errors in the
two measurements are somehow connected).
The formal definition of conditional independence is based on the idea of conditional
distributions. If X, Y, and Z are discrete random variables, then we define X and Y to
be conditionally independent given Z if

for all x, y and z such that P(Z = z) > 0. On the other hand, if the random variables
are continuous and have a joint probability density function p,
then Xand Y are conditionally independent given Z if

for all real numbers x, y and z such that pZ(z) > 0.


If X and Y are conditionally independent given Z, then

for any x, y and z with P(Z = z) > 0. That is, the conditional distribution
for X given Y and Z is the same as that given Z alone. A similar equation
holds for the conditional probability density functions in the continuous
case.
Independence can be seen as a special kind of conditional
independence, since probability can be seen as a kind of conditional
probability given no events.
[edit]Independent -algebras
The definitions above are both generalized by the following definition of
independence for -algebras. Let (, , Pr) be a probability space and
let Aand B be two sub--algebras of . A and B are said to
be independent if, whenever A A and B B,

The new definition relates to the previous ones very directly:

Two events are independent (in the old sense) if and only
if the -algebras that they generate are independent (in the
new sense). The -algebra generated by an event E is,
by definition,

Two random variables X and Y defined over are


independent (in the old sense) if and only if the -
algebras that they generate are independent (in the new
sense). The -algebra generated by a random
variable X taking values in some measurable
space S consists, by definition, of all subsets of of the
form X1(U), where U is any measurable subset of S.
Using this definition, it is easy to show that if X and Y are
random variables and Y is constant, then X and Y are
independent, since the -algebra generated by a constant
random variable is the trivial -algebra {, }. Probability
zero events cannot affect independence so independence
also holds if Y is only Pr-almost surely constant.
[edit]Properties

[edit]Self-dependence
Note that an event is independent of itself iff

.
Thus if an event or its complement almost
surely occurs, it is independent of itself. For
example, if is choosing any number but 0.5
from a uniform distribution on the unit interval,
is independent of itself, even
though, tautologically, fully determines .
[edit]Expectation and covariance

If and are independent, then


the expectation operator has the property

and for the covariance since we have

so the covariance is
zero. (The converse of these, i.e. the
proposition that if two random variables
have a covariance of 0 they must be
independent, is not true.
See uncorrelated.)
[edit]Characteristic function
Two independent random
variables and have the
property that the characteristic
function of their sum is the product of
their marginal characteristic functions:

but the reverse implication is not


true (see subindependence).
[edit]Examples

[edit]Rolling a die
The event of getting a 6 the first
time a die is rolled and the event
of getting a 6 the second time
are independent. By contrast, the
event of getting a 6 the first time a
die is rolled and the event that the
sum of the numbers seen on the
first and second trials is 8
are not independent.
[edit]Drawing cards
If two cards are
drawn with replacement from a
deck of cards, the event of
drawing a red card on the first trial
and that of drawing a red card on
the second trial are independent.
By contrast, if two cards are
drawn without replacement from a
deck of cards, the event of
drawing a red card on the first trial
and that of drawing a red card on
the second trial are
again not independent.
i. [edit]Pairwise and mutual
independence

ii.

iii.
iv. Pairwise independent, but not mutually

independent, events.

v.

vi.
vii. Mutually independent events.

viii. Consider the two probability spaces


shown. In both

cases,

and . The first space


is pairwise independent but not mutually
independent. The second space is
mutually independent. To illustrate the
difference, consider conditioning on two
events. In the pairwise independent
case, although, for example, is
independent of both and , it is
not independent of :

b.

c.

d.
i. In the mutually independent case
however:

e.

f.

g.
i. See also [4] for a three-event example in
which

, and yet no two of the three events are


pairwise independent.
In layperson's terms, two events are 'mutually exclusive' if they cannot occur at the same time. An example is tossing a coin once,
which can result in either heads or tails, but not both.
In the coin-tossing example, both outcomes are collectively exhaustive, which means that at least one of the outcomes must
happen, so these two possibilities together exhaust all the possibilities. However, not all mutually exclusive events are collectively
exhaustive. For example, the outcomes 1 and 4 of a single roll of a six-sided die are mutually exclusive (cannot both happen) but
not collectively exhaustive (there are other possible outcomes; 2,3,5,6).
Contents

[hide]

1 Logic

2 Probability

3 Statistics

4 See also

5 Notes

6 References
[edit]Logic

In logic, two mutually exclusive propositions are propositions that logically cannot be true at the same time. Another term for
mutually exclusive is "disjoint". To say that more than two propositions are mutually exclusive, depending on context, means that
one cannot be true if the other one is true, or at least one of them cannot be true. The term pairwise mutually exclusive always
means two of them cannot be true simultaneously.
[edit]Probability

In probability theory, events E1, E2, ..., En are said to be mutually exclusive if the occurrence of any one of them automatically
implies the non-occurrence of the remaining n 1 events. Therefore, two mutually exclusive events cannot both occur. Formally
said, the intersection of each two of them is empty (the null event): A and B = . In consequence, mutually exclusive events have
the property: P(A and B) = 0.[1]
For example, one cannot draw a card that is both red and a club because clubs are always black. If one draws just one card from
the deck, either a red card or a club can be drawn. When A and B are mutually exclusive, P(A or B) = P(A) + P(B).[2] One might ask,
"What is the probability of drawing a red card or a club?" This problem would be solved by adding together the probability of drawing
a red card and the probability of drawing a club. In a standard 52-card deck, there are twenty-six red cards and thirteen clubs: 26/52
+ 13/52 = 39/52 or 3/4.
One would have to draw at least two cards in order to draw both a red card and a club. The probability of doing so in two draws
would depend on whether the first card drawn were replaced before the second drawing, since without replacement there would be
one fewer card after the first card was drawn. The probabilities of the individual events (red, and club) would be multiplied rather
than added. The probability of drawing a red and a club in two drawings without replacement would be 26/52 * 13/51 = 338/2652, or
13/102. With replacement, the probability would be 26/52 * 13/52 = 338/2704, or 13/104.
In probability theory the word "or" allows for the possibility of both events happening. The probability of one or both events occurring
is denoted P(Aor B) and in general it equals P(A) + P(B) P(A and B).[2] Therefore, if one asks, "What is the probability of drawing a
red card or a king?", drawing any of a red king, a red non-king, or a black king is considered a success. In a standard 52-card deck,
there are twenty-six red cards and four kings, two of which are red, so the probability of drawing a red or a king is 26/52 + 4/52
2/52 = 28/52. However, with mutually exclusive events the last term in the formula, P(A and B), is zero, so the formula simplifies to
the one given in the previous paragraph.
Events are collectively exhaustive if all the possibilities for outcomes are exhausted by those possible events, so at least one of
those outcomes must occur. The probability that at least one of the events will occur is equal to 1. [3] For example, there are
theoretically only two possibilities for flipping a coin. Flipping a head and flipping a tail are collectively exhaustive events, and there
is a probability of 1 of flipping either a head or a tail. Events can be both mutually exclusive and collectively exhaustive. [3] In the case
of flipping a coin, flipping a head and flipping a tail are also mutually exclusive events. Both outcomes cannot occur for a single trial
(i.e., when a coin is flipped only once). The probability of flipping a head and the probability of flipping a tail can be added to yield a
probability of 1: 1/2 + 1/2 =1.[4]
[edit]Statistics

In statistics and regression analysis, an independent variable that can take on only two possible values is called a dummy variable.
For example, it may take on the value 0 if an observation is of a male subject or 1 if the observation is of a female subject. The two
possible categories associated with the two possible values are mutually exclusive, so that no observation falls into more than one
category, and the categories are exhaustive, so that every observation falls into some category. Sometimes there are three or more
possible categories, which are pairwise mutually exclusive and are collectively exhaustive for example, under 18 years of age, 18
to 64 years of age, and age 65 or above. In this case a set of dummy variables is constructed, each dummy variable having two
mutually exclusive and jointly exhaustive categories in this example, one dummy variable (called D1) would equal 1 if age is less
than 18, and would equal 0 otherwise; a second dummy variable (called D2) would equal 1 if age is in the range 18-64, and 0
otherwise. In this set-up, the dummy variable pairs (D1, D2) can have the values (1,0) (under 18), (0,1) (between 18 and 64), or (0,0)
(65 or older) (but not (1,1), which would nonsensically imply that an observed subject is both under 18 and between 18 and 64).
Then the dummy variables can be included as independent (explanatory) variables in a regression. Note that the number of dummy
variables is always one less than the number of categories: with the two categories male and female there is a single dummy
variable to distinguish them, while with the three age categories two dummy variables are needed to distinguish them.
Such qualitative data can also be used for dependent variables. For example, a researcher might want to predict whether someone
goes to college or not, using family income, a gender dummy variable, and so forth as explanatory variables. Here the variable to be
explained is a dummy variable that equals 0 if the observed subject does not go to college and equals 1 if the subject does go to
college. In such a situation, ordinary least squares(the basic regression technique) is widely seen as inadequate; instead probit
regression or logistic regression is used. Further, sometimes there are three or more categories for the dependent variable for
example, no college, community college, and four-year college. In this case, themultinomial probit or multinomial logit technique is
used.
In mathematics, two sets are said to be disjoint if they have no element in common. For example, {1, 2, 3} and {4, 5, 6} are disjoint
sets.[1]
[edit]Explanation

Formally, two sets A and B are disjoint if their intersection is the empty set, i.e. if

This definition extends to any collection of sets. A collection of sets is pairwise disjoint or mutually disjoint if, given any two
sets in the collection, those two sets are disjoint.
Formally, let I be an index set, and for each i in I, let Ai be a set. Then the family of sets {Ai : i I} is pairwise disjoint if for
any i and j in I with i j,

For example, the collection of sets { {1}, {2}, {3}, ... } is pairwise disjoint. If {Ai} is a pairwise disjoint collection (containing
at least two sets), then clearly its intersection is empty:

However, the converse is not true: the intersection of the collection {{1, 2}, {2, 3}, {3, 1}} is empty, but the collection
is not pairwise disjoint. In fact, there are no two disjoint sets in this collection.
A partition of a set X is any collection of non-empty subsets {Ai : i I} of X such that {Ai} are pairwise disjoint and

In probability theory and statistics, a sequence or other collection of random variables is independent and identically
distributed (i.i.d.) if each random variable has the same probability distribution as the others and all are mutually independent.[1]
The abbreviation i.i.d. is particularly common in statistics (often as iid, sometimes written IID), where observations in a sample are
often assumed to be effectively i.i.d. for the purposes of statistical inference. The assumption (or requirement) that observations be
i.i.d. tends to simplify the underlying mathematics of many statistical methods (see mathematical statistics and statistical theory).
However, in practical applications of statistical modelingthe assumption may or may not be realistic. The generalization
of exchangeable random variables is often sufficient and more easily met.
The assumption is important in the classical form of the central limit theorem, which states that the probability distribution of the sum
(or average) of i.i.d. variables with finite variance approaches a normal distribution.
Note that IID refers to sequences of random variables. "Independent and identically distributed" implies an element in the sequence
is independent of the random variables that came before it. In this way, an IID sequence is different from a Markov sequence, where
the probability distribution for the nth random variable is a function of the previous random variable in the sequence (for a first order
Markov sequence). An IID sequence does not imply the probabilities for all elements of the sample space or event space must be
the same.[2] For example, repeated throws of loaded dice will produce a sequence that is IID, despite the outcomes being biased.
Contents

[hide]

1 Examples

o 1.1 Uses in modeling

o 1.2 Uses in inference

2 Generalizations

o 2.1 Exchangeable random variables

o 2.2 Lvy process

3 See also

4 References
[edit]Examples

[edit]Uses in modeling
The following are examples or applications of independent and identically distributed (i.i.d.) random variables [dubious discuss]:

A sequence of outcomes of spins of a fair or unfair roulette wheel is i.i.d. One implication of this is that if the roulette ball lands
on "red", for example, 20 times in a row, the next spin is no more or less likely to be "black" than on any other spin (see
the Gambler's fallacy).
A sequence of fair or loaded dice rolls is i.i.d.
A sequence of fair or unfair coin flips is i.i.d.
In signal processing and image processing the notion of transformation to IID implies two specifications, the "ID"
(ID = identically distributed) part and the "I" (I = independent) part:

(ID) the signal level must be balanced on the time axis;

(I) the signal spectrum must be flattened, i.e. transformed by filtering (such as deconvolution) to a white signal (one

where all frequencies are equally present).


[edit]Uses in inference

One of the simplest statistical tests, the z-test, is used to test hypotheses about means of random variables. When using the z-
test, one assumes (requires) that all observations are i.i.d. in order to satisfy the conditions of the central limit theorem.
[edit]Generalizations

Many results that are initially[clarification needed] stated for i.i.d. variables are true more generally.[clarification needed]
[edit]Exchangeable random variables
Main article: Exchangeable random variables

The most general notion which shares the main properties of i.i.d. variables are exchangeable random variables, introduced
by Bruno de Finetti. Exchangeability means that while variables may not be independent or identically distributed, future ones
behave like past ones formally, any value of a finite sequence is as likely as any permutation of those values the joint probability
distribution is invariant under the symmetric group.
This provides a useful generalization for example, sampling without replacement is not independent, but is exchangeable and is
widely used inBayesian statistics.
[edit]Lvy process
Main article: Lvy process

In stochastic calculus, i.i.d. variables are thought of as a discrete time Lvy process: each variable gives how much one changes
from one time to another. For example, a sequence of Bernoulli trials is interpreted as the Bernoulli process. One may generalize
this to include continuous time Lvy processes, and many Lvy processes can be seen as limits of i.i.d. variablesfor instance,
the Wiener process is the limit of the Bernoulli process.
In linear algebra, a family of vectors is linearly independent if none of them can be written as a linear combination of finitely many
other vectors in the family. A family of vectors which is not linearly independent is called linearly dependent. For instance, in the
three-dimensional real vector space we have the following example.

Here the first three vectors are linearly independent; but the fourth vector equals 9 times the first plus 5 times the second plus
4 times the third, so the four vectors together are linearly dependent. Linear dependence is a property of the family, not of any
particular vector; for example in this case we could just as well write the first vector as a linear combination of the last three.

In probability theory and statistics there is an unrelated measure of linear dependence between random variables.

Contents

[hide]

1 Definition

2 Geometric meaning

3 Example I

o 3.1 Proof

o 3.2 Alternative method using determinants

4 Example II

o 4.1 Proof

5 Example III

o 5.1 Proof

6 Example IV

o 6.1 Proof

7 Projective space of linear dependences

8 Linear dependence between random variables

9 See also

10 External links
[edit]Definition

A finite subset of n vectors, v1, v2, ..., vn, from the vector space V, is linearly dependent if and only if there exists a set
of n scalars, a1, a2, ..., an, not all zero, such that
Note that the zero on the right is the zero vector, not the number zero.
If such scalars do not exist, then the vectors are said to be linearly independent.
Alternatively, linear independence can be directly defined as follows: a set of vectors is linearly independent if and
only if the only representations of the zero vector as linear combinations of its elements are trivial solutions, i.e.,
whenever a1, a2, ..., an are scalars such that

if and only if ai = 0 for i = 1, 2, ..., n.


A set of vectors is then said to be linearly dependent if it is not linearly independent.
More generally, let V be a vector space over a field K, and let {vi | iI} be a family of elements of V. The family
is linearly dependent over K if there exists a family {aj | jJ} of elements of K, not all zero, such that

where the index set J is a nonempty, finite subset of I.


A set X of elements of V is linearly independent if the corresponding family {x}xX is linearly independent.
Equivalently, a family is dependent if a member is in the linear span of the rest of the family, i.e., a
member is a linear combination of the rest of the family.
A set of vectors which is linearly independent and spans some vector space, forms a basis for that vector
space. For example, the vector space of all polynomials in x over the reals has for a basis the (infinite)
subset {1, x, x2, ...}.
[edit]Geometric meaning

A geographic example may help to clarify the concept of linear independence. A person describing the
location of a certain place might say, "It is 5 miles north and 6 miles east of here." This is sufficient
information to describe the location, because the geographic coordinate system may be considered as a
2-dimensional vector space (ignoring altitude). The person might add, "The place is 7.81 miles northeast
of here." Although this last statement is true, it is not necessary.
In this example the "5 miles north" vector and the "6 miles east" vector are linearly independent. That is
to say, the north vector cannot be described in terms of the east vector, and vice versa. The third "7.81
miles northeast" vector is a linear combination of the other two vectors, and it makes the set of
vectors linearly dependent, that is, one of the three vectors is unnecessary.
Also note that if altitude is not ignored, it becomes necessary to add a third vector to the linearly
independent set. In general, n linearly independent vectors are required to describe any location in n-
dimensional space.
[edit]Example I

The vectors (1, 1) and (3, 2) in are linearly independent.


[edit]Proof
Let 1 and 2 be two real numbers such that

Taking each coordinate alone, this means

Solving for 1 and 2, we find that 1 = 0 and 2 = 0.


[edit]Alternative method using determinants

An alternative method uses the fact that n vectors in are linearly dependent if and only
if the determinant of the matrix formed by taking the vectors as its columns is zero.
In this case, the matrix formed by the vectors is
We may write a linear combination of the columns as

We are interested in whether A = 0 for some nonzero vector . This depends on


the determinant of A, which is

Since the determinant is non-zero, the vectors (1, 1) and (3, 2) are linearly
independent.
Otherwise, suppose we have m vectors of n coordinates, with m < n.
Then A is an nm matrix and is a column vector with m entries, and we are
again interested in A = 0. As we saw previously, this is equivalent to a list
of n equations. Consider the first m rows of A, the first m equations; any
solution of the full list of equations must also be true of the reduced list. In
fact, if i1,...,im is any list of m rows, then the equation must be true for those
rows.

Furthermore, the reverse is true. That is, we can test whether


the m vectors are linearly dependent by testing whether

for all possible lists of m rows. (In case m = n, this requires only
one determinant, as above. If m > n, then it is a theorem that the
vectors must be linearly dependent.) This fact is valuable for
theory; in practical calculations more efficient methods are
available.
[edit]Example II

Let V = Rn and consider the following elements in V:

a.
a. Then e1, e2, ..., en are linearly independent.
b. [edit]Proof
c. Suppose that a1, a2, ..., an are elements of R such
that

b.
a. Since

c.
i. then ai = 0 for all i in {1, ..., n}.
ii. [edit]Example III

iii. Let V be the vector space of


all functions of a real variable t. Then
the functions et and e2t in V are linearly
independent.
iv. [edit]Proof
v. Suppose a and b are two real numbers
such that

d. aet + be2t = 0
i. for all values of t. We need to show
that a = 0 and b = 0. In order to do this,
we divide through by et (which is never
zero) and subtract to obtain

e. bet = a.
i. In other words, the function bet must be
independent of t, which only occurs
when b = 0. It follows that a is also zero.
ii. [edit]Example IV

iii. The following vectors in R4 are linearly


dependent.

f.
i. [edit]Proof

ii. We need to find scalars ,


and such that

g.
i. Forming the simultaneous equations:

h.
i. we can solve (using, for
example, Gaussian elimination) to
obtain:

i.
i. where can be chosen arbitrarily.
ii. Since these are nontrivial results, the
vectors are linearly dependent.
iii. [edit]Projective space of linear
dependences

iv. A linear dependence among


vectors v1, ..., vn is a tuple (a1, ..., an)
with n scalar components, not all zero,
such that

j.
i. If such a linear dependence exists, then
the n vectors are linearly dependent. It
makes sense to identify two linear
dependences if one arises as a non-
zero multiple of the other, because in
this case the two describe the same
linear relationship among the vectors.
Under this identification, the set of all
linear dependences among v1, ...., vn is
a projective space.
ii. [edit]Linear dependence between
random variables

iii. The covariance is sometimes called a


measure of "linear dependence"
between two random variables. That
does not mean the same thing as in the
context of linear algebra. When the
covariance is normalized, one obtains
the correlation matrix. From it, one can
obtain the Pearson coefficient, which
gives us the goodness of the fit for the
best possible linear function describing
the relation between the variables. In
this sense covariance is a linear gauge
of dependence.
2. Orthogonality comes from the Greek orthos, meaning "straight", and gonia, meaning "angle". It has somewhat different
meanings depending on the context, but most involve the idea of perpendicular, non-overlapping, or uncorrelated.
3. In mathematics, two lines or curves are orthogonal if they are perpendicular at their point of intersection. Two vectors are
orthogonal if and only if their dot product is zero.[1] In computer science, orthogonality has to do with the ability of a
language, method, or object to vary without side effects.[2]
By using integral calculus. it is common to use the following to define the inner product of two mathemathematical functions f and g:

Here we introduce a nonnegative weight function in the definition of this inner product. In simple cases, w(x) = 1,
exactly.
We say that these functions are orthogonal if that inner product is zero:

We write the norms with respect to this inner product and the weight function as

The members of a set of functions { fi : i = 1, 2, 3, ... } are:


orthogonal on the closed interval [a, b] if

orthonormal on the interval [a, b] if

where

is the "Kronecker delta" function. In other words, any two of them are orthogonal, and the norm of
each is 1 in the case of the orthonormal sequence. See in particular the orthogonal polynomials.
[edit]Examples

The vectors (1, 3, 2), (3, 1, 0), (1/3, 1, 5/3) are orthogonal to each other, since (1)(3) +
(3)(1) + (2)(0) = 0, (3)(1/3) + (1)(1) + (0)(5/3) = 0, and (1)(1/3) + (3)(1) + (2)(5/3) = 0.

The vectors (1, 0, 1, 0, ...)T and (0, 1, 0, 1, ...)T are orthogonal to each other. The dot product
of these vectors is 0. We can then make the generalization to consider the vectors in Z2n:

for some positive integer a, and for 1 k a 1, these vectors are orthogonal, for example (1, 0, 0, 1, 0, 0, 1, 0)T, (0, 1, 0,

0, 1, 0, 0, 1)T, (0, 0, 1, 0, 0, 1, 0, 0)T are orthogonal.

Take two quadratic functions 2t + 3 and 5t2 + t 17/9. These functions are
orthogonal with respect to a unit weight function on the interval from 1 to 1. The
product of these two functions is 10t3 + 17t2 7/9 t 17/3, and now,

The functions 1, sin(nx), cos(nx) : n = 1, 2, 3, ... are orthogonal with respect


to Riemann integration on the intervals [0, 2], [-, ], or any other closed
interval of length 2. This fact is a central one in Fourier series.
In communications, multiple-access schemes are orthogonal when an ideal receiver can completely reject arbitrarily
strong unwanted signals from the desired signal using different basis functions. One such scheme is TDMA, where the
orthogonal basis functions are nonoverlapping rectangular pulses ("time slots").

Another scheme is orthogonal frequency-division multiplexing (OFDM), which refers to the use, by a single transmitter, of
a set of frequency multiplexed signals with the exact minimum frequency spacing needed to make them orthogonal so
that they do not interfere with each other. Well known examples include (a, g, and n) versions of 802.11 Wi-
Fi; WiMAX; ITU-T G.hn, DVB-T, the terrestrial digital TV broadcast system used in most of the world outside North
America; and DMT (Discrete Multi Tone), the standard form of ADSL.

In OFDM, the subcarrier frequencies are chosen so that the subcarriers are orthogonal to each other, meaning that
crosstalk between the subchannels is eliminated and intercarrier guard bands are not required. This greatly simplifies the
design of both the transmitter and the receiver. Unlike in conventional FDM, a separate filter for each subchannel is not
required.

In probability theory, two events R and B are conditionally independent given a third event Y precisely if the occurrence
or non-occurrence of R and the occurrence or non-occurrence of B are independent events in their conditional probability
distribution given Y. In other words, R and B are conditionally independent given Y if and only if, given knowledge
that Yoccurs, knowledge of whether R occurs provides no information on the likelihood of B occurring, and knowledge of
whether B occurs provides no information on the likehood of Roccurring.

In the standard notation of probability theory, R and B are conditionally independent given Y if and only if

or equivalently,

Two random variables X and Y are conditionally independent given a third random variable Z if and only if they are
independent in their conditional probability distribution givenZ. That is, X and Y are conditionally independent given Z if
and only if, given any value of Z, the probability distribution of X is the same for all values of Y and the probability
distribution of Y is the same for all values of X.

Two events R and B are conditionally independent given a -algebra if

where denotes the conditional expectation of the indicator function of the event , , given the
sigma algebra . That is,

Two random variables X and Y are conditionally independent given a -algebra if the above equation holds for all R in
(X) and B in (Y).

Two random variables X and Y are conditionally independent given a random variable W if they are independent given
(W): the -algebra generated by W. This is commonly written:

or

This is read "X is independent of Y, given W"; the conditioning applies to the whole statement: "(X is independent of Y)
given W".

If W assumes a countable set of values, this is equivalent to the conditional independence of X and Y for the events of the
form [W = w]. Conditional independence of more than two events, or of more than two random variables, is defined
analogously.

The following two examples show that X Y neither implies nor is implied by X Y | W. First, suppose W is 0 with
probability 0.5 and is the value 1 otherwise. When W = 0 take X and Y to be independent, each having the value 0 with
probability 0.99 and the value 1 otherwise. When W = 1, X andY are again independent, but this time they take the value
1 with probability 0.99. Then X Y | W. But X and Y are dependent, because Pr(X = 0) < Pr(X = 0|Y = 0). This is because
Pr(X = 0) = 0.5, but if Y = 0 then it's very likely that W = 0 and thus that X = 0 as well, so Pr(X = 0|Y = 0) > 0.5. For the
second example, suppose X Y, each taking the values 0 and 1 with probability 0.5. Let W be the product XY. Then
when W = 0, Pr(X = 0) = 2/3, but Pr(X = 0|Y = 0) = 1/2, so X Y | W is false. This is also an example of Explaining Away.
See Kevin Murphy's tutorial [2] where Xand Y take the values "brainy" and "sporty".
In probability theory, two random variables being uncorrelated does not imply their independence. In some contexts,
uncorrelatedness implies at least pairwise independence (as when the random variables involved have Bernoulli distributions).
It is sometimes mistakenly thought that one context in which uncorrelatedness implies independence is when the random variables
involved arenormally distributed. However, this is incorrect if the variables are merely marginally normally distributed but not jointly
normally distributed.
Suppose two random variables X and Y are jointly normally distributed. That is the same as saying that the random vector (X, Y)
has a multivariate normal distribution. It means that the joint probability distribution of X and Y is such that for any two constant (i.e.,
non-random) scalars a and b, the random variable aX + bY is normally distributed. In that case if X and Y are uncorrelated, i.e.,
their covariance cov(X, Y) is zero, then they are independent. [1] However, it is possible for two random variables X and Y to be so
distributed jointly that each one alone is marginally normally distributed, and they are uncorrelated, but they are not independent;
examples are given below.

Contents

[hide]

1 Examples

o 1.1 A symmetric example

o 1.2 An asymmetric example

2 References
[edit]Examples

[edit]A symmetric example

X and Y.

Suppose X has a normal distribution with expected value 0 and variance 1. Let W = 1 or 1, each with probability 1/2, and
assume W is independent of X. Let Y = WX. Then

X and Y are uncorrelated;


Both have the same normal distribution; and
X and Y are not independent.
Note that the distribution of X + Y concentrates positive probability at 0: Pr(X + Y = 0) = 1/2. To see thatX and Y are uncorrelated,
consider
To see that Y has the same normal distribution as X, consider

(since X and X both have the same normal distribution).


To see that X and Y are not independent, observe that |Y| = |X| or that Pr(Y > 1 | X = 1/2) = 0.
[edit]An asymmetric example

X and Y.

Suppose X has a normal distribution with expected value 0 and variance 1. Let

where c is a positive number to be specified below. If c is very small, then the correlation corr(X, Y) is near 1; if c is
very large, then corr(X, Y) is near 1. Since the correlation is a continuous function of c, theintermediate value
theorem implies there is some particular value of c that makes the correlation 0. That value is approximately 1.54. In
that case, X and Y are uncorrelated, but they are clearly not independent, since X completely determines Y.
To see that Y is normally distributedindeed, that its distribution is the same as that of Xlet us find itscumulative
distribution function:

(This follows from the symmetry of the distribution of X and the symmetry of the condition that |X| < c.)
Observe that the sum X + Y is nowhere near being normally distributed, since it has a substantial
probability (about 0.88) of it being equal to 0, whereas the normal distribution, being a continuous
distribution, has no discrete part, i.e., does not concentrate more than zero probability at any single point.
Consequently X and Y are not jointly normally distributed, even though they are separately normally
distributed.
In probability theory, a set of events is jointly or collectively exhaustive if at least one of the events must occur. For example,
when rolling a six-sided die, the outcomes 1, 2, 3, 4, 5, and 6 are collectively exhaustive, because they encompass the entire range
of possible outcomes.
Another way to describe collectively exhaustive events, is that their union must cover all the events within the entire sample space.
For example, events A and B are said to be collectively exhaustive if

where S is the sample space.


Compare this to the concept of a set of outcomes which are mutually exclusive, which means that at most one of the events
may occur. The set of all possible die rolls is both collectively exhaustive and mutually exclusive. The outcomes 1 and 6 are
mutually exclusive but not collectively exhaustive. The outcomes "even" (2,4 or 6) and "not-6" (1,2,3,4, or 5) are collectively
exhaustive but not mutually exclusive.
One example of a collectively exhaustive and mutually exclusive event is tossing a coin. P(Head or Tail) = 1, so the outcomes
are collectively exhaustive. When head occurs tail can't occur or P(Head and Tail) = 0, so the outcomes are mutually exclusive
also.
The term "collectively exhaustive" is a relatively new term.[citation needed] This is due to changes in the meaning of "mutually
exclusive" (Knuth[full citation needed]). Heads and Tails are classic. Heads and tails are "Exclusive" because one flip cannot be both
heads and tails. Heads and tails are "Mutual" one flip must be either heads or tails. Formal logic used to define events that are
both mutual and exclusive as "mutually exclusive". When the description of "exclusive" was expanded, and became "mutually
exclusive" there became a need for a term to describe what used to be "mutual". Thus "collectively exhaustive" entered the
literature. Events that used to be referred to as "mutually exclusive" are now referred to as "mutually exclusive and collectively
exhaustive".[citation needed]

Estimator

From Wikipedia, the free encyclopedia

In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule and its result

(the estimate) are distinguished.

There are point and interval estimators. The point estimators yield single-valued results, although this includes the possibility of

single vector-valued results and results that can be expressed as a single function. This is in contrast to an interval estimator, where

the result would be a range of plausible values (or vectors or functions).

Statistical theory is concerned with the properties of estimators; that is, with defining properties that can be used to compare

different estimators (different rules for creating estimates) for the same quantity, based on the same data. Such properties can be

used to determine the best rules to use under given circumstances. However, in robust statistics, statistical theory goes on to

consider the balance between having good properties, if tightly defined assumptions hold, and having less good properties that hold

under wider conditions.


Contents

[hide]

1 Background

2 Definition

3 Quantified properties

4 Behavioural properties

5 See also

6 References

7 External links

[edit]Background

An "estimator" or "point estimate" is a statistic (that is, a function of the data) that is used to infer the value of an

unknown parameter in a statistical model. The parameter being estimated is sometimes called the estimand.[citation needed] It can be

either finite-dimensional (in parametric and semi-parametric models), or infinite-dimensional (semi-nonparametric and non-

parametric models).[citation needed] If the parameter is denoted then the estimator is typically written by adding a "hat" over the

symbol: . Being a function of the data, the estimator is itself a random variable; a particular realization of this random variable is

called the "estimate". Sometimes the words "estimator" and "estimate" are used interchangeably.

The definition places virtually no restrictions on which functions of the data can be called the "estimators". The attractiveness of

different estimators can be judged by looking at their properties, such as unbiasedness, mean square error, consistency,asymptotic

distribution, etc.. The construction and comparison of estimators are the subjects of the estimation theory. In the context of decision

theory, an estimator is a type of decision rule, and its performance may be evaluated through the use of loss functions.

When the word "estimator" is used without a qualifier, it usually refers to point estimation. The estimate in this case is a single point

in the parameter space. Other types of estimators also exist: interval estimators, where the estimates are subsets of the parameter

space.

The problem of density estimation arises in two applications. Firstly, in estimating the probability density functions of random

variables and secondly in estimating the spectral density function of a time series. In these problems the estimates are functions that

can be thought of as point estimates in an infinite dimensional space, and there are corresponding interval estimation problems.

[edit]Definition

Suppose there is a fixed parameter that needs to be estimated. Then an "estimator" is a function that maps the sample space to

a set of sample estimates. An estimator of is usually denoted by the symbol . It is often convenient to express the theory using

the algebra of random variables: thus if X is used to denote a random variable corresponding to the observed data, the estimator

(itself treated as a random variable) is symbolised as a function of that random variable, . The estimate for a particular
observed dataset (i.e. for X=x) is then , which is a fixed value. Often an abbreviated notation is used in which is

interpreted directly as a random variable, but this can cause confusion.

[edit]Quantified properties

The following definitions and attributes apply:

Error

For a given sample , the "error" of the estimator is defined as

where is the parameter being estimated. Note that the error, e, depends not only on the estimator (the estimation formula or

procedure), but on the sample.

Mean squared error

The mean squared error of is defined as the expected value (probability-weighted average, over all samples) of the

squared errors; that is,

It is used to indicate how far, on average, the collection of estimates are from the single parameter being estimated.

Consider the following analogy. Suppose the parameter is the bull's-eye of a target, the estimator is the process of

shooting arrows at the target, and the individual arrows are estimates (samples). Then high MSE means the average

distance of the arrows from the bull's-eye is high, and low MSE means the average distance from the bull's-eye is low.

The arrows may or may not be clustered. For example, even if all arrows hit the same point, yet grossly miss the target,

the MSE is still relatively large. Note, however, that if the MSE is relatively low, then the arrows are likely more highly

clustered (than highly dispersed).

Sampling deviation

For a given sample , the sampling deviation of the estimator is defined as

where is the expected value of the estimator. Note that the sampling deviation, d, depends not only

on the estimator, but on the sample.

Variance

The variance of is simply the expected value of the squared sampling deviations; that

is, . It is used to indicate how far, on average, the collection of estimates

are from the expected value of the estimates. Note the difference between MSE and variance. If the parameter is

the bull's-eye of a target, and the arrows are estimates, then a relatively high variance means the arrows are
dispersed, and a relatively low variance means the arrows are clustered. Some things to note: even if the variance

is low, the cluster of arrows may still be far off-target, and even if the variance is high, the diffuse collection of

arrows may still be unbiased. Finally, note that even if all arrows grossly miss the target, if they nevertheless all hit

the same point, the variance is zero.

Bias

The bias of is defined as . It is the distance between the average of the collection

of estimates, and the single parameter being estimated. It also is the expected value of the error,

since . If the parameter is the bull's-eye of a target, and the arrows are

estimates, then a relatively high absolute value for the bias means the average position of the arrows is off-target,

and a relatively low absolute bias means the average position of the arrows is on target. They may be dispersed, or

may be clustered. The relationship between bias and variance is analogous to the relationship between accuracy

and precision.

Unbiased

The estimator is an unbiased estimator of if and only if . Note that bias is a property of the

estimator, not of the estimate. Often, people refer to a "biased estimate" or an "unbiased estimate," but they really

are talking about an "estimate from a biased estimator," or an "estimate from an unbiased estimator." Also, people

often confuse the "error" of a single estimate with the "bias" of an estimator. Just because the error for one estimate

is large, does not mean the estimator is biased. In fact, even if all estimates have astronomical absolute values for

their errors, if the expected value of the error is zero, the estimator is unbiased. Also, just because an estimator is

biased, does not preclude the error of an estimate from being zero (we may have gotten lucky). The ideal situation,

of course, is to have an unbiased estimator with low variance, and also try to limit the number of samples where the

error is extreme (that is, have few outliers). Yet unbiasedness is not essential. Often, if just a little bias is permitted,

then an estimator can be found with lower MSE and/or fewer outlier sample estimates.

An alternative to the version of "unbiased" above, is "median-unbiased", where the median of the distribution of

estimates agrees with the true value; thus, in the long run half the estimates will be too low and half too high. While

this applies immediately only to scalar-valued estimators, it can be extended to any measure of central tendency of

a distribution: see median-unbiased estimators.

Relationships

The MSE, variance, and bias, are related: i.e. mean

squared error = variance + square of bias. In particular, for an unbiased estimator, the variance equals the

MSE.
The standard deviation of an estimator of (the square root of the variance), or an estimate of the standard

deviation of an estimator of , is called the standard error of .

[edit]Behavioural properties

Consistency

Main article: Consistent estimator

A consistent sequence of estimators is a sequence of estimators that converge in probability to the quantity being

estimated as the index (usually the sample size) grows without bound. In other words, increasing the sample size

increases the probability of the estimator being close to the population parameter.

Mathematically, a sequence of estimators {tn; n 0} is a consistent estimator for parameter if and only if, for all >

0, no matter how small, we have

The consistency defined above may be called weak consistency. The sequence is strongly consistent, if

it converges almost surely to the true value.

An estimator that converges to a multiple of a parameter can be made into a consistent estimator by

multiplying the estimator by a scale factor, namely the true value divided by the asymptotic value of the

estimator. This occurs frequently in estimation of scale parameters by measures of statistical dispersion.

Asymptotic normality

Main article: Asymptotic normality

An asymptotically normal estimator is a consistent estimator whose distribution around the true

parameter approaches a normal distribution with standard deviation shrinking in proportion to as

the sample size n grows. Using to denoteconvergence in distribution, tn is asymptotically normal if

for some V, which is called the asymptotic variance of the estimator.

The central limit theorem implies asymptotic normality of the sample mean as an estimator of the true

mean. More generally, maximum likelihood estimators are asymptotically normal under fairly weak

regularity conditions see the asymptotics section of the maximum likelihood article. However, not all

estimators are asymptotically normal, the simplest examples being case where the true value of a

parameter lies in the boundary of the allowable parameter region.

Efficiency

Main article: Efficiency (statistics)


Two naturally desirable properties of estimators are for them to be unbiased and have minimal mean

squared error (MSE). These cannot in general both be satisfied simultaneously: a biased estimator may

have lower mean squared error (MSE) than any unbiased estimator: despite having bias, the estimator

variance may be sufficiently smaller than that of any unbiased estimator, and it may be preferable to use,

despite the bias; see estimator bias.

Among unbiased estimators, there often exists one with the lowest variance, called the minimum

variance unbiased estimator (MVUE). In some cases an unbiased efficient estimator exists, which, in

addition to having the lowest variance among unbiased estimators, satisfies the CramrRao bound,

which is an absolute lower bound on variance for statistics of a variable.

Concerning such "best unbiased estimators", see also CramrRao bound, GaussMarkov

theorem, LehmannScheff theorem, RaoBlackwell theorem