0 Suka0 Tidak suka

4 tayangan39 halamanDec 06, 2017

© © All Rights Reserved

DOCX, PDF, TXT atau baca online dari Scribd

© All Rights Reserved

4 tayangan

© All Rights Reserved

- Process Capability
- sixsig sloan
- Dynamic factor model for consumer confidence
- Omega Intro
- Statistics Cheating Sheet
- APPLYING Article Critique III
- Stastics
- stats
- J Social Psy 02
- Day4 Normal Curve
- Quiz 1
- STA101 Formula Sheet
- Job Em
- sem12ersh8750_lecture12
- Relationship Between Indoleamine .... and Post Stroke Cognitive Impairment
- STATS FRQ
- Face Recognition
- Does Being Attractive
- Mvpa Manual
- Nichtlineare_Dynamik

Anda di halaman 1dari 39

For example in a class we have 5 students of ages 13,14,14,15,34. the average value comes to18.

This has happened because of one value has distorted the whole picture.

On the contrary if we take the variance and standard deviation this will at once be revealed.

In other words, if you cut out a probability distribution (like a bell curve) then you could balance it on your finger right at the value of

its mean.

For example, picture a probability distribution where there is a 20% chance of getting a 5 and an 80% chance of getting a 10.

In that case, the mean is 0.2*5+0.8*10=1+8=9. As you can see, there is 80% "mass" exactly 1 away from 9 and there is 20% mass

exactly 4 away from 1. 80%*1 = 20%*4. If the probability distribution was a lever (like a see-saw) with the fulcrum 9 away from

the lighter side, the lever would be perfectly balanced.

For example, if there was a 100% chance of getting a 5, then the variable would not be random. It would be deterministic. It would

thus have 0 variance. However, if there was only an 80% chance of getting a 5, a 10% chance of getting a 4 and a 10% chance of

getting a 6, then the random variable would have a variance of 0.2 with reflects the 10% and 10% chances on the left and right of 5.

In other words, 2*10%=0.2. If you change those 10% to 5%, you get a variance of 0.1 because 2*5%=0.1. The variance is somehow

measuring the "spread" of the data.

It's measuring the amount of noise you're going to get around your mean (the mean in this case is 5).

The variance is in square units because it is actually the "expected" squared distance from the mean. In other words, if the

variance is 1, we expect that if we square the distance from the mean, we'll get a value around 1.

The standard deviation just converts this expectation back into our old units.

That way if we have a variance of 4, we'll have a standard deviation of 2. It's just more convenient to express variance in normal

units rather than square units.

-Mean is the sum of the values, divided by the total number of values.

-Variance is the average of the squares of the distance that each value is from the mean.

-Standard deviation is the square root of the variance.

Probability Density Function describes the relative likelihood for this random variable to occur at a given point in the

observation space.

The probability of a random variable falling within a given set is given by the integral of its density over the set.

Distribution Function describes the range of possible values that a random variable can attain and

the probability that the value of the random variable is within any (measurable) subset of that range

that describes the relative likelihood for this random variable to occur

at a point in the observation space.

cummulative distriubution function (CDF)) which described the enitre range of values

(distrubition) a continuous random variable takes in a domain.

The CDF is used to determine the probability a continuous random variable occurs any (measurable) subset of that

range.

This is performed by integrating the PDF over some range (i.e., taking the area under of CDF curve between two

values).

NOTE: Over the entire domain the total area under the CDF curve is equal to 1.

NOTE: A continuous random variable can take on an infinite number of values. The probability that it will equal a

specific value is always zero.

If test scores are normal distributed with mean 100 and standard deviation 10. The probability

a score is between 90 and 110 is:

P( 90 < X < 110 ) = P( X < 110 ) - P( X < 90 )

= 0.84 - 0.16 = 0.68.

ie. AProximately 68%.

standard deviation (represented by the symbol sigma, ) shows how much variation or "dispersion" exists from the average

(mean, or expected value). A low standard deviation indicates that the data points tend to be very close to the mean; high standard

deviation indicates that the data points are spread out over a large range of values.

nterpretation and application

Example of two sample populations with the same mean and different standard deviations. Red population has mean 100 and SD

A large standard deviation indicates that the data points are far from the mean and a small standard deviation indicates that they are

clustered closely around the mean.

For example, each of the three populations {0, 0, 14, 14}, {0, 6, 8, 14} and {6, 6, 8, 8} has a mean of 7. Their standard deviations are

7, 5, and 1, respectively.

The third population has a much smaller standard deviation than the other two because its values are all close to 7.

It will have the same units as the data points themselves.

If, for instance, the data set {0, 6, 8, 14} represents the ages of a population of four siblings in years, the standard deviation is 5

years.

As another example, the population {1000, 1006, 1008, 1014} may represent the distances traveled by four athletes, measured in

meters. It has a mean of 1007 meters, and a standard deviation of 5 meters.

Standard deviation may serve as a measure of uncertainty. In physical science, for example, the reported standard deviation of a

group of repeated measurements gives the precision of those measurements. When deciding whether measurements agree with a

theoretical prediction, the standard deviation of those measurements is of crucial importance: if the mean of the measurements is

too far away from the prediction (with the distance measured in standard deviations), then the theory being tested probably needs to

be revised. This makes sense since they fall outside the range of values that could reasonably be expected to occur, if the

prediction were correct and the standard deviation appropriately quantified. See prediction interval.

While the standard deviation does measure how far typical values tend to be from the mean, other measures are available. An

example is the mean absolute deviation, which might be considered a more direct measure of average distance, compared to

the root mean square distance inherent in the standard deviation.

[edit]Application examples

The practical value of understanding the standard deviation of a set of values is in appreciating how much variation there is from the

"average" (mean).

[edit]Climate

As a simple example, consider the average daily maximum temperatures for two cities, one inland and one on the coast. It is helpful

to understand that the range of daily maximum temperatures for cities near the coast is smaller than for cities inland. Thus, while

these two cities may each have the same average maximum temperature, the standard deviation of the daily maximum temperature

for the coastal city will be less than that of the inland city as, on any particular day, the actual maximum temperature is more likely to

be farther from the average maximum temperature for the inland city than for the coastal one.

[edit]Particle physics

Particle physics uses a standard of "5 sigma" for the declaration of a discovery. [3] At five-sigma there is only one chance in nearly

two million that a random fluctuation would yield the result. This level of certainty prompted the announcement that a particle

consistent with the Higgs boson has been discovered in two independent experiments at CERN.[4]

[edit]Sports

Another way of seeing it is to consider sports teams. In any set of categories, there will be teams that rate highly at some things and

poorly at others. Chances are, the teams that lead in the standings will not show such disparity but will perform well in most

categories. The lower the standard deviation of their ratings in each category, the more balanced and consistent they will tend to be.

Teams with a higher standard deviation, however, will be more unpredictable. For example, a team that is consistently bad in most

categories will have a low standard deviation. A team that is consistently good in most categories will also have a low standard

deviation. However, a team with a high standard deviation might be the type of team that scores a lot (strong offense) but also

concedes a lot (weak defense), or, vice versa, that might have a poor offense but compensates by being difficult to score on.

Trying to predict which teams, on any given day, will win, may include looking at the standard deviations of the various team "stats"

ratings, in which anomalies can match strengths vs. weaknesses to attempt to understand what factors may prevail as stronger

indicators of eventual scoring outcomes.

In racing, a driver is timed on successive laps. A driver with a low standard deviation of lap times is more consistent than a driver

with a higher standard deviation. This information can be used to help understand where opportunities might be found to reduce lap

times.

[edit]Finance

In finance, standard deviation is often used as a measure of the risk associated with price-fluctuations of a given asset (stocks,

bonds, property, etc.), or the risk of a portfolio of assets [5] (actively managed mutual funds, index mutual funds, or ETFs). Risk is an

important factor in determining how to efficiently manage a portfolio of investments because it determines the variation in returns on

the asset and/or portfolio and gives investors a mathematical basis for investment decisions (known as mean-variance optimization).

The fundamental concept of risk is that as it increases, the expected return on an investment should increase as well, an increase

known as the "risk premium." In other words, investors should expect a higher return on an investment when that investment carries

a higher level of risk or uncertainty. When evaluating investments, investors should estimate both the expected return and the

uncertainty of future returns. Standard deviation provides a quantified estimate of the uncertainty of future returns.

For example, let's assume an investor had to choose between two stocks. Stock A over the past 20 years had an average return of

10 percent, with a standard deviation of 20 percentage points (pp) and Stock B, over the same period, had average returns of 12

percent but a higher standard deviation of 30 pp. On the basis of risk and return, an investor may decide that Stock A is the safer

choice, because Stock B's additional two percentage points of return is not worth the additional 10 pp standard deviation (greater

risk or uncertainty of the expected return). Stock B is likely to fall short of the initial investment (but also to exceed the initial

investment) more often than Stock A under the same circumstances, and is estimated to return only two percent more on average.

In this example, Stock A is expected to earn about 10 percent, plus or minus 20 pp (a range of 30 percent to -10 percent), about

two-thirds of the future year returns. When considering more extreme possible returns or outcomes in future, an investor should

expect results of as much as 10 percent plus or minus 60 pp, or a range from 70 percent to 50 percent, which includes outcomes

for three standard deviations from the average return (about 99.7 percent of probable returns).

Calculating the average (or arithmetic mean) of the return of a security over a given period will generate the expected return of the

asset. For each period, subtracting the expected return from the actual return results in the difference from the mean. Squaring the

difference in each period and taking the average gives the overall variance of the return of the asset. The larger the variance, the

greater risk the security carries. Finding the square root of this variance will give the standard deviation of the investment tool in

question.

Population standard deviation is used to set the width of Bollinger Bands, a widely adopted technical analysis tool. For example, the

upper Bollinger Band is given as x + nx. The most commonly used value for n is 2; there is about a five percent chance of going

outside, assuming a normal distribution of returns.

[edit]Geometric interpretation

article to improve its quality. Specific illustrations, plots or diagrams

can be requested at the Graphic Lab.

For more information, refer to discussion on this page and/or the

listing at Wikipedia:Requested images.

To gain some geometric insights and clarification, we will start with a population of three values, x1, x2, x3. This defines a point P =

(x1, x2, x3) in R3. Consider the line L = {(r, r, r) : r R}. This is the "main diagonal" going through the origin. If our three given values

were all equal, then the standard deviation would be zero and P would lie on L. So it is not unreasonable to assume that the

standard deviation is related to the distance of P to L. And that is indeed the case. To move orthogonally from L to the point P, one

begins at the point:

whose coordinates are the mean of the values we started out with. A little algebra shows that the distance

between P and M (which is the same as the orthogonal distance between P and the line L) is equal to the standard deviation of

the vector x1, x2, x3, multiplied by the square root of the number of dimensions of the vector (3 in this case.)

[edit]Chebyshev's inequality

Main article: Chebyshev's inequality

An observation is rarely more than a few standard deviations away from the mean.

Chebyshev's inequality ensures that, for all distributions for which the standard deviation is defined,

the amount of data within a number of standard deviations of the mean is at least as much as given in the following table.

50% 2

75% 2

89% 3

94% 4

96% 5

97% 6

[6]

Dark blue is less than one standard deviation from the mean. For the normal distribution, this accounts for 68.27 percent of the

set; while two standard deviations from the mean (medium and dark blue) account for 95.45 percent; three standard deviations

(light, medium, and dark blue) account for 99.73 percent; and four standard deviations account for 99.994 percent. The two

points of the curve that are one standard deviation from the mean are also the inflection points.

The central limit theorem says that the distribution of an average of many independent, identically distributed random

variables tends toward the famous bell-shaped normal distribution with a probability density function of:

where is the expected value of the random variables, equals their distribution's standard deviation divided by n1/2,

and n is the number of random variables. The standard deviation therefore is simply a scaling variable that adjusts

how broad the curve will be, though it also appears in the normalizing constant.

If a data distribution is approximately normal, then the proportion of data values within z standard deviations of the

mean is defined by:

Proportion =

where is the error function. If a data distribution is approximately normal then about 68 percent of the data

values are within one standard deviation of the mean (mathematically, , where is the arithmetic mean),

about 95 percent are within two standard deviations ( 2), and about 99.7 percent lie within three standard

deviations ( 3).

This is known as the 68-95-99.7 rule, or the empirical rule.

For various values of z, the percentage of values expected to lie in and outside the symmetric interval,

CI = (z, z), are as follows:

1.959964 95% 5% 1 / 20

4 99.993666% 0.006334% 1 / 15,787

COVARIANCE

In probability theory and statistics, covariance is a measure of how much two random variables change together.

If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for

the smaller values,

i.e.,the variables tend to show similar behavior, the covariance is a positive number.

In the opposite case, when the greater values of one variable mainly correspond to the smaller values of the other, i.e., the

variables tend to show opposite behavior, the covariance is negative.

The sign of the covariance therefore shows the tendency in the linear relationship between the variables. The magnitude of

the covariance is not that easy to interpret.

The normalized version of the covariance, the correlation coefficient, however, shows by its magnitude the strength of the

linear relation.

A distinction must be made between

(1) the covariance of two random variables, which is a population parameter that can be seen as a property of the joint probability

distribution, and

(2) the sample covariance, which serves as an estimated value of the parameter

VARIANCE

In probability theory and statistics,

the variance is a measure of how far a set of numbers is spread out.

It is one of several descriptors of a probability distribution, describing how far the numbers lie from the mean (expected value).

In particular, the variance is one of the moments of a distribution.

In that context, it forms part of a systematic approach to distinguishing between probability distributions. While other such

approaches have been developed, those based on moments are advantageous in terms of mathematical and computational

simplicity.

The variance is a parameter describing in part either the actual probability distribution of an observed population of

numbers, or

the theoretical probability distribution of a sample (a not-fully-observed population) of numbers.

In the latter case a sample of data from such a distribution can be used to construct an estimate of its variance: in the simplest

cases this estimate can be the sample variance, defined below.

.

Contents

[hide]

1 Definition

2 Properties

4 Comments

5 See also

6 References

7 External links

[edit]Definition

The covariance between two jointly distributed real-valued random variables x and y with finite second moments is defined[1] as:

where E[x] is the expected value of x, also known as the mean of x. By using the linearity property of expectations, this can be

simplified to

if this term is zero then the random variables are orthogonal.[2]

For random vectors and (of dimension m and n respectively) the mn covariance matrix is equal to

where mT is the transpose of the vector (or matrix) m.

The (i,j)-th element of this matrix is equal to the covariance Cov(xi, yj) between the i-th scalar component of x and

the j-th scalar component of y. In particular, Cov(y, x) is the transpose of Cov(x, y).

moments, its covariance matrix is defined as:

The units of measurement of the covariance Cov(x, y) are those of x times those of y.

By contrast, correlation coefficients, which depend on the covariance, are a dimensionless measure of linear

dependence. (In fact, correlation coefficients can simply be understood as a normalized version of

covariance.)

[edit]Properties

Variance is a special case of the covariance when the two variables are identical:

If x, y, W, and V are real-valued random variables and a, b, c, d are constant ("constant" in this

context means non-random), then the following facts are a consequence of the definition of

covariance:

For sequences x1, ..., xn and y1, ..., ym of random variables, we have

For a sequence x1, ..., xn of random variables, and constants a1, ..., an, we have

Let be a random vector, let denote its covariance matrix, and let be a

matrix that can act on . The result of applying this matrix to is a new vector with

covariance matrix

.

This is a direct result of the linearity of expectation and is useful when applying

a linear transformation, such as a whitening transformation, to a vector.

[edit]Uncorrelatedness and independence

If x and y are independent, then their covariance is zero. This follows because

under independence,

The converse, however, is not generally true. For example, let x be uniformly

distributed in [-1, 1] and let y = x2. Clearly, x and y are dependent, but

correlation and covariance are measures of linear dependence between

two variables. Still, as in the example, if two variables are uncorrelated,

that does not imply that they are independent.

[edit]Relationship to inner products

Many of the properties of covariance can be extracted elegantly by

observing that it satisfies similar properties to those of an inner product:

(ax + by, z) = a (x, z) + b (y, z)

2. symmetric: (x, y) = (y, x)

3. positive semi-definite: 2(x) = (x, x) 0, and (x, x) = 0

implies that x is a constant random variable (K).

In fact these properties imply that the covariance defines an inner

product over the quotient vector space obtained by taking the subspace

of random variables with finite second moment and identifying any two

that differ by a constant. (This identification turns the positive semi-

definiteness above into positive definiteness.) That quotient vector

space is isomorphic to the subspace of random variables with finite

second moment and mean zero; on that subspace, the covariance is

exactly the L2 inner product of real-valued functions on the sample

space.

As a result for random variables with finite variance the following

inequality holds via the CauchySchwarz inequality:

variable

Then we have:

QED.

[edit]Calculating the sample covariance

,

which is an estimate of the covariance between

variable j and variable k.

The sample mean and the sample covariance

matrix are unbiased estimates of the mean and

the covariance matrix of the random vector ,a

row vector whose jth element (j = 1, ..., K) is one of

the random variables. The reason the sample

covariance matrix has in the

denominator rather than is essentially that the

replaced by the sample mean . If the population

estimate is given by

Comments

measure of "linear dependence" between

the two random variables.

That does not mean the same thing as in the

context of linear algebra (see linear

dependence).

When the covariance is normalized, one

obtains the correlation coefficient. From it,-

one can obtain the Pearson coefficient,

which gives us the goodness of the fit for

the best possible linear function describing

the relation between the variables.

In this sense covariance is a linear gauge

of dependence.

In probability theory and statistics, the variance is a measure of how far a set of numbers is spread out. It is one of several

descriptors of aprobability distribution, describing how far the numbers lie from the mean (expected value). In particular, the variance

is one of the moments of a distribution. In that context, it forms part of a systematic approach to distinguishing between probability

distributions. While other such approaches have been developed, those based on moments are advantageous in terms of

mathematical and computational simplicity.

The variance is a parameter describing in part either the actual probability distribution of an observed population of numbers, or the

theoretical probability distribution of a sample (a not-fully-observed population) of numbers. In the latter case a sample of data from

such a distribution can be used to construct an estimate of its variance: in the simplest cases this estimate can be the sample

variance, defined below.

MOMENT

In mathematics, a moment is, loosely speaking, a quantitative measure of the shape of a set of points.

The "second moment", for example, is widely used and measures the "width" (in a particular sense) of a set of points in one

dimension or in higher dimensions measures the shape of a cloud of points as it could be fit by an ellipsoid.

Other moments describe other aspects of a distribution such as how the distribution is skewed from its mean, or peaked. The

mathematical concept is closely related to the concept of moment in physics, although moment in physics is often represented

somewhat differently. Any distribution can be characterized by a number of features (such as the mean, the variance, the skewness,

etc.), and the moments of a function[1] describe the nature of its distribution.

The 1st moment is denoted by 1. The first moment of the distribution of the random variable X is the expectation operator, i.e.,

the population mean(if the first moment exists).

In higher orders, the central moments (moments about the mean) are more interesting than the moments about zero. The first

central moment is 0. The zeroth central moment, 0 is one. The second central moment is the variance.

Other moments may also be defined. For example, the n th inverse moment about zero is and the n th logarithmic

Contents

[hide]

o 1.1 Mean

o 1.2 Variance

o 1.3 Skewness

o 1.4 Kurtosis

2 Cumulants

3 Sample moments

4 Problem of moments

5 Partial moments

7 See also

8 References

9 External links

Significance of the moments

The nth moment of a real-valued continuous function f(x) of a real variable about a value c is

It is possible to define moments for random variables in a more general fashion than moments for real valuessee moments

in metric spaces. The moment of a function, without further explanation, usually refers to the above expression with c = 0.

Usually, except in the special context of the problem of moments, the function f(x) will be a probability density function.

The nth moment about zero of a probability density function f(x) is the expected value of Xn and is called a raw

moment or crude moment.[2] The moments about its mean are called central moments; these describe the shape of the

function, independently of translation.

If f is a probability density function, then the value of the integral above is called the nth moment of the probability distribution.

More generally, if F is a cumulative probability distribution function of any probability distribution, which may not have a density

function, then the nth moment of the probability distribution is given by the RiemannStieltjes integral

where X is a random variable that has this cumulative distribution F, and E is the expectation operator or mean.

When

then the moment is said not to exist. If the nth moment about any point exists, so does (n 1)th moment, and all

lower-order moments, about every point.

The zeroth moment of any probability density function is 1, since the area under any probability density

function must be equal to one.

Significance of moments (raw, central, standardized) and cumulants (raw, standardized), in connection with

named properties of distributions

Standardized cumulant

number moment moment moment cumulant

2 variance 1 variance 1

3 skewness skewness

4 historical kurtosis

kurtosis}

5+

[edit]Mean

Main article: Mean

The first raw moment is the mean.

[edit]Variance

Main article: Variance

The second central moment is the variance. Its positive square root is the standard deviation .

[edit]Normalized moments

The normalized nth central moment or standardized moment is the nth central moment divided by n; the

normalized nth central moment of x = E((x )n)/n. These normalized central moments are dimensionless

quantities, which represent the distribution independently of any linear change of scale.

[edit]Skewness

Main article: Skewness

any symmetric distribution will have a third central moment, if defined, of zero.

The normalized third central moment is called the skewness, often .

A distribution that is skewed to the left (the tail of the distribution is heavier on the left) will have a negative

skewness.

A distribution that is skewed to the right (the tail of the distribution is heavier on the right), will have a positive

skewness.

For distributions that are not too different from the normal distribution,

the median will be somewhere near /6; the mode about /2.

[edit]Kurtosis

Main article: Kurtosis

The fourth central moment is a measure of whether the distribution is tall and skinny or short and squat, compared

to the normal distribution of the same variance. Since it is the expectation of a fourth power, the fourth central

moment, where defined, is always non-negative; and except for a point distribution, it is always strictly positive. The

fourth central moment of a normal distribution is 34.

The kurtosis is defined to be the normalized fourth central moment minus 3. (Equivalently, as in the next section, it

is the fourth cumulant divided by the square of the variance.) Some authorities[3][4] do not subtract three, but it is

usually more convenient to have the normal distribution at the origin of coordinates. If a distribution has a peak at

the mean and long tails, the fourth moment will be high and the kurtosis positive (leptokurtic); and conversely; thus,

bounded distributions tend to have low kurtosis (platykurtic).

The kurtosis can be positive without limit, but must be greater than or equal to 2 2; equality only holds for binary

distributions. For unbounded skew distributions not too far from normal, tends to be somewhere in the area of

2 and 22.

The inequality can be proven by considering

where T = (X )/. This is the expectation of a square, so it is non-negative whatever a is; on the other hand,

it's also a quadratic equation in a. Itsdiscriminant must be non-positive, which gives the required relationship.

[edit]Mixed moments

Mixed moments are moments involving multiple variables.

Some examples are covariance, coskewness and cokurtosis. While there is a unique covariance, there are

multiple co-skewnesses and co-kurtoses.

[edit]Higher moments

High-order moments are moments beyond 4th-order moments. The higher the moment, the harder it is to

estimate, in the sense that larger samples are required in order to obtain estimates of similar quality.[citation needed]

[edit]Cumulants

The first moment and the second and third unnormalized central moments are additive in the sense that

if X and Y are independent random variables then

and

and

(These can also hold for variables that satisfy weaker conditions than independence. The first

always holds; if the second holds, the variables are called uncorrelated).

In fact, these are the first three cumulants and all cumulants share this additivity property.

[edit]Sample moments

The moments of a population can be estimated using the sample k-th moment

It can be shown that the expected value of the sample moment is equal to the k-th

moment of the population, if that moment exists, for any sample size n. It is thus an

unbiased estimator.

[edit]Problem of moments

that are sequences of moments of some function f.

[edit]Partial moments

Partial moments are sometimes referred to as "one-sided moments." The nth order lower

and upper partial moments with respect to a reference point r may be expressed as

Partial moments are normalized by being raised to the power 1/n. The upside

potential ratio may be expressed as a ratio of a first-order upper partial

moment to a normalized second-order lower partial moment.

[edit]Central Moments in metric spaces

Let (M, d) be a metric space, and let B(M) be the Borel -algebra on M, the -

algebra generated by the d-open subsets of M. (For technical reasons, it is

also convenient to assume that M is a separable space with respect to

the metric d.) Let 1 p +.

The pth central moment of a measure on the measurable space (M, B(M))

about a given point x0 in M is defined to be

is said to have finite pth central moment if the pth central moment

of about x0 is finite for some x0 M.

This terminology for measures carries over to random variables in the

usual way: if (, , P) is a probability space and X : M is a random

variable, then the pth central moment of X about x0 M is defined to be

and X has finite pth central moment if the pth central moment

of X about x0 is finite for some x0 M.

INDEPENDENCE

In probability theory, to say that two events are independent (alternatively statistically independent,marginally

independent or absolutely independent[1]) means that the occurrence of one does not affect the probability of the other. Similarly,

two random variables are independent if the observed value of one does not affect the probability distribution of the other.

The concept of independence extends to dealing with collections of more than two events or random variables.

Contents

[hide]

1 Definition

2 Properties

o 2.1 Self-dependence

3 Examples

4 See also

5 References

[edit]Definition

[edit]For events

[edit]Two events

Two events and are independent iff their joint probability equals the product of their probabilities:

.

Why this defines independence is made clear by rewriting with conditional probabilities:

.

Thus, the occurrence of does not affect the probability of , and vice versa. Although the derived

expressions may seem more intuitive, they are not the preferred definition, as the conditional probabilities may

that when is independent of , is also independent of .

[edit]More than two events

A finite set of events is pairwise independent iff every pair of events is independent[2]. That is, iff for

.

A finite set of events is mutually independent iff every event is independent of any intersection of the

.

This is called the multiplication rule for independent events.

For more than two events, a mutually independent set of events is pairwise independent, but the

converse is not necessarily true.

[edit]For random variables

[edit]Two random variables

Two random variables and are independent iff for every and , the

has a joint cumulative distribution function

A set of random variables is pairwise independent iff every pair of random variables is

independent.

A set of random variables is mutually independent iff for any finite

subset and any finite set of numbers , the

defined above).

is exactly equivalent to the one above when the values of the random variables are real

numbers. It has the advantage of working also for complex-valued random variables or

for random variables taking values in any measurable space (which includes topological

spacesendowed by appropriate -algebras).

[edit]Conditional independence

Main article: Conditional independence

Intuitively, two random variables X and Y are conditionally independent given Z if,

once Z is known, the value of Y does not add any additional information about X. For

instance, two measurements X and Y of the same underlying quantity Z are not

independent, but they are conditionally independent given Z (unless the errors in the

two measurements are somehow connected).

The formal definition of conditional independence is based on the idea of conditional

distributions. If X, Y, and Z are discrete random variables, then we define X and Y to

be conditionally independent given Z if

for all x, y and z such that P(Z = z) > 0. On the other hand, if the random variables

are continuous and have a joint probability density function p,

then Xand Y are conditionally independent given Z if

If X and Y are conditionally independent given Z, then

for any x, y and z with P(Z = z) > 0. That is, the conditional distribution

for X given Y and Z is the same as that given Z alone. A similar equation

holds for the conditional probability density functions in the continuous

case.

Independence can be seen as a special kind of conditional

independence, since probability can be seen as a kind of conditional

probability given no events.

[edit]Independent -algebras

The definitions above are both generalized by the following definition of

independence for -algebras. Let (, , Pr) be a probability space and

let Aand B be two sub--algebras of . A and B are said to

be independent if, whenever A A and B B,

Two events are independent (in the old sense) if and only

if the -algebras that they generate are independent (in the

new sense). The -algebra generated by an event E is,

by definition,

independent (in the old sense) if and only if the -

algebras that they generate are independent (in the new

sense). The -algebra generated by a random

variable X taking values in some measurable

space S consists, by definition, of all subsets of of the

form X1(U), where U is any measurable subset of S.

Using this definition, it is easy to show that if X and Y are

random variables and Y is constant, then X and Y are

independent, since the -algebra generated by a constant

random variable is the trivial -algebra {, }. Probability

zero events cannot affect independence so independence

also holds if Y is only Pr-almost surely constant.

[edit]Properties

[edit]Self-dependence

Note that an event is independent of itself iff

.

Thus if an event or its complement almost

surely occurs, it is independent of itself. For

example, if is choosing any number but 0.5

from a uniform distribution on the unit interval,

is independent of itself, even

though, tautologically, fully determines .

[edit]Expectation and covariance

the expectation operator has the property

so the covariance is

zero. (The converse of these, i.e. the

proposition that if two random variables

have a covariance of 0 they must be

independent, is not true.

See uncorrelated.)

[edit]Characteristic function

Two independent random

variables and have the

property that the characteristic

function of their sum is the product of

their marginal characteristic functions:

true (see subindependence).

[edit]Examples

[edit]Rolling a die

The event of getting a 6 the first

time a die is rolled and the event

of getting a 6 the second time

are independent. By contrast, the

event of getting a 6 the first time a

die is rolled and the event that the

sum of the numbers seen on the

first and second trials is 8

are not independent.

[edit]Drawing cards

If two cards are

drawn with replacement from a

deck of cards, the event of

drawing a red card on the first trial

and that of drawing a red card on

the second trial are independent.

By contrast, if two cards are

drawn without replacement from a

deck of cards, the event of

drawing a red card on the first trial

and that of drawing a red card on

the second trial are

again not independent.

i. [edit]Pairwise and mutual

independence

ii.

iii.

iv. Pairwise independent, but not mutually

independent, events.

v.

vi.

vii. Mutually independent events.

shown. In both

cases,

is pairwise independent but not mutually

independent. The second space is

mutually independent. To illustrate the

difference, consider conditioning on two

events. In the pairwise independent

case, although, for example, is

independent of both and , it is

not independent of :

b.

c.

d.

i. In the mutually independent case

however:

e.

f.

g.

i. See also [4] for a three-event example in

which

pairwise independent.

In layperson's terms, two events are 'mutually exclusive' if they cannot occur at the same time. An example is tossing a coin once,

which can result in either heads or tails, but not both.

In the coin-tossing example, both outcomes are collectively exhaustive, which means that at least one of the outcomes must

happen, so these two possibilities together exhaust all the possibilities. However, not all mutually exclusive events are collectively

exhaustive. For example, the outcomes 1 and 4 of a single roll of a six-sided die are mutually exclusive (cannot both happen) but

not collectively exhaustive (there are other possible outcomes; 2,3,5,6).

Contents

[hide]

1 Logic

2 Probability

3 Statistics

4 See also

5 Notes

6 References

[edit]Logic

In logic, two mutually exclusive propositions are propositions that logically cannot be true at the same time. Another term for

mutually exclusive is "disjoint". To say that more than two propositions are mutually exclusive, depending on context, means that

one cannot be true if the other one is true, or at least one of them cannot be true. The term pairwise mutually exclusive always

means two of them cannot be true simultaneously.

[edit]Probability

In probability theory, events E1, E2, ..., En are said to be mutually exclusive if the occurrence of any one of them automatically

implies the non-occurrence of the remaining n 1 events. Therefore, two mutually exclusive events cannot both occur. Formally

said, the intersection of each two of them is empty (the null event): A and B = . In consequence, mutually exclusive events have

the property: P(A and B) = 0.[1]

For example, one cannot draw a card that is both red and a club because clubs are always black. If one draws just one card from

the deck, either a red card or a club can be drawn. When A and B are mutually exclusive, P(A or B) = P(A) + P(B).[2] One might ask,

"What is the probability of drawing a red card or a club?" This problem would be solved by adding together the probability of drawing

a red card and the probability of drawing a club. In a standard 52-card deck, there are twenty-six red cards and thirteen clubs: 26/52

+ 13/52 = 39/52 or 3/4.

One would have to draw at least two cards in order to draw both a red card and a club. The probability of doing so in two draws

would depend on whether the first card drawn were replaced before the second drawing, since without replacement there would be

one fewer card after the first card was drawn. The probabilities of the individual events (red, and club) would be multiplied rather

than added. The probability of drawing a red and a club in two drawings without replacement would be 26/52 * 13/51 = 338/2652, or

13/102. With replacement, the probability would be 26/52 * 13/52 = 338/2704, or 13/104.

In probability theory the word "or" allows for the possibility of both events happening. The probability of one or both events occurring

is denoted P(Aor B) and in general it equals P(A) + P(B) P(A and B).[2] Therefore, if one asks, "What is the probability of drawing a

red card or a king?", drawing any of a red king, a red non-king, or a black king is considered a success. In a standard 52-card deck,

there are twenty-six red cards and four kings, two of which are red, so the probability of drawing a red or a king is 26/52 + 4/52

2/52 = 28/52. However, with mutually exclusive events the last term in the formula, P(A and B), is zero, so the formula simplifies to

the one given in the previous paragraph.

Events are collectively exhaustive if all the possibilities for outcomes are exhausted by those possible events, so at least one of

those outcomes must occur. The probability that at least one of the events will occur is equal to 1. [3] For example, there are

theoretically only two possibilities for flipping a coin. Flipping a head and flipping a tail are collectively exhaustive events, and there

is a probability of 1 of flipping either a head or a tail. Events can be both mutually exclusive and collectively exhaustive. [3] In the case

of flipping a coin, flipping a head and flipping a tail are also mutually exclusive events. Both outcomes cannot occur for a single trial

(i.e., when a coin is flipped only once). The probability of flipping a head and the probability of flipping a tail can be added to yield a

probability of 1: 1/2 + 1/2 =1.[4]

[edit]Statistics

In statistics and regression analysis, an independent variable that can take on only two possible values is called a dummy variable.

For example, it may take on the value 0 if an observation is of a male subject or 1 if the observation is of a female subject. The two

possible categories associated with the two possible values are mutually exclusive, so that no observation falls into more than one

category, and the categories are exhaustive, so that every observation falls into some category. Sometimes there are three or more

possible categories, which are pairwise mutually exclusive and are collectively exhaustive for example, under 18 years of age, 18

to 64 years of age, and age 65 or above. In this case a set of dummy variables is constructed, each dummy variable having two

mutually exclusive and jointly exhaustive categories in this example, one dummy variable (called D1) would equal 1 if age is less

than 18, and would equal 0 otherwise; a second dummy variable (called D2) would equal 1 if age is in the range 18-64, and 0

otherwise. In this set-up, the dummy variable pairs (D1, D2) can have the values (1,0) (under 18), (0,1) (between 18 and 64), or (0,0)

(65 or older) (but not (1,1), which would nonsensically imply that an observed subject is both under 18 and between 18 and 64).

Then the dummy variables can be included as independent (explanatory) variables in a regression. Note that the number of dummy

variables is always one less than the number of categories: with the two categories male and female there is a single dummy

variable to distinguish them, while with the three age categories two dummy variables are needed to distinguish them.

Such qualitative data can also be used for dependent variables. For example, a researcher might want to predict whether someone

goes to college or not, using family income, a gender dummy variable, and so forth as explanatory variables. Here the variable to be

explained is a dummy variable that equals 0 if the observed subject does not go to college and equals 1 if the subject does go to

college. In such a situation, ordinary least squares(the basic regression technique) is widely seen as inadequate; instead probit

regression or logistic regression is used. Further, sometimes there are three or more categories for the dependent variable for

example, no college, community college, and four-year college. In this case, themultinomial probit or multinomial logit technique is

used.

In mathematics, two sets are said to be disjoint if they have no element in common. For example, {1, 2, 3} and {4, 5, 6} are disjoint

sets.[1]

[edit]Explanation

Formally, two sets A and B are disjoint if their intersection is the empty set, i.e. if

This definition extends to any collection of sets. A collection of sets is pairwise disjoint or mutually disjoint if, given any two

sets in the collection, those two sets are disjoint.

Formally, let I be an index set, and for each i in I, let Ai be a set. Then the family of sets {Ai : i I} is pairwise disjoint if for

any i and j in I with i j,

For example, the collection of sets { {1}, {2}, {3}, ... } is pairwise disjoint. If {Ai} is a pairwise disjoint collection (containing

at least two sets), then clearly its intersection is empty:

However, the converse is not true: the intersection of the collection {{1, 2}, {2, 3}, {3, 1}} is empty, but the collection

is not pairwise disjoint. In fact, there are no two disjoint sets in this collection.

A partition of a set X is any collection of non-empty subsets {Ai : i I} of X such that {Ai} are pairwise disjoint and

In probability theory and statistics, a sequence or other collection of random variables is independent and identically

distributed (i.i.d.) if each random variable has the same probability distribution as the others and all are mutually independent.[1]

The abbreviation i.i.d. is particularly common in statistics (often as iid, sometimes written IID), where observations in a sample are

often assumed to be effectively i.i.d. for the purposes of statistical inference. The assumption (or requirement) that observations be

i.i.d. tends to simplify the underlying mathematics of many statistical methods (see mathematical statistics and statistical theory).

However, in practical applications of statistical modelingthe assumption may or may not be realistic. The generalization

of exchangeable random variables is often sufficient and more easily met.

The assumption is important in the classical form of the central limit theorem, which states that the probability distribution of the sum

(or average) of i.i.d. variables with finite variance approaches a normal distribution.

Note that IID refers to sequences of random variables. "Independent and identically distributed" implies an element in the sequence

is independent of the random variables that came before it. In this way, an IID sequence is different from a Markov sequence, where

the probability distribution for the nth random variable is a function of the previous random variable in the sequence (for a first order

Markov sequence). An IID sequence does not imply the probabilities for all elements of the sample space or event space must be

the same.[2] For example, repeated throws of loaded dice will produce a sequence that is IID, despite the outcomes being biased.

Contents

[hide]

1 Examples

2 Generalizations

3 See also

4 References

[edit]Examples

[edit]Uses in modeling

The following are examples or applications of independent and identically distributed (i.i.d.) random variables [dubious discuss]:

A sequence of outcomes of spins of a fair or unfair roulette wheel is i.i.d. One implication of this is that if the roulette ball lands

on "red", for example, 20 times in a row, the next spin is no more or less likely to be "black" than on any other spin (see

the Gambler's fallacy).

A sequence of fair or loaded dice rolls is i.i.d.

A sequence of fair or unfair coin flips is i.i.d.

In signal processing and image processing the notion of transformation to IID implies two specifications, the "ID"

(ID = identically distributed) part and the "I" (I = independent) part:

(I) the signal spectrum must be flattened, i.e. transformed by filtering (such as deconvolution) to a white signal (one

[edit]Uses in inference

One of the simplest statistical tests, the z-test, is used to test hypotheses about means of random variables. When using the z-

test, one assumes (requires) that all observations are i.i.d. in order to satisfy the conditions of the central limit theorem.

[edit]Generalizations

Many results that are initially[clarification needed] stated for i.i.d. variables are true more generally.[clarification needed]

[edit]Exchangeable random variables

Main article: Exchangeable random variables

The most general notion which shares the main properties of i.i.d. variables are exchangeable random variables, introduced

by Bruno de Finetti. Exchangeability means that while variables may not be independent or identically distributed, future ones

behave like past ones formally, any value of a finite sequence is as likely as any permutation of those values the joint probability

distribution is invariant under the symmetric group.

This provides a useful generalization for example, sampling without replacement is not independent, but is exchangeable and is

widely used inBayesian statistics.

[edit]Lvy process

Main article: Lvy process

In stochastic calculus, i.i.d. variables are thought of as a discrete time Lvy process: each variable gives how much one changes

from one time to another. For example, a sequence of Bernoulli trials is interpreted as the Bernoulli process. One may generalize

this to include continuous time Lvy processes, and many Lvy processes can be seen as limits of i.i.d. variablesfor instance,

the Wiener process is the limit of the Bernoulli process.

In linear algebra, a family of vectors is linearly independent if none of them can be written as a linear combination of finitely many

other vectors in the family. A family of vectors which is not linearly independent is called linearly dependent. For instance, in the

three-dimensional real vector space we have the following example.

Here the first three vectors are linearly independent; but the fourth vector equals 9 times the first plus 5 times the second plus

4 times the third, so the four vectors together are linearly dependent. Linear dependence is a property of the family, not of any

particular vector; for example in this case we could just as well write the first vector as a linear combination of the last three.

In probability theory and statistics there is an unrelated measure of linear dependence between random variables.

Contents

[hide]

1 Definition

2 Geometric meaning

3 Example I

o 3.1 Proof

4 Example II

o 4.1 Proof

5 Example III

o 5.1 Proof

6 Example IV

o 6.1 Proof

9 See also

10 External links

[edit]Definition

A finite subset of n vectors, v1, v2, ..., vn, from the vector space V, is linearly dependent if and only if there exists a set

of n scalars, a1, a2, ..., an, not all zero, such that

Note that the zero on the right is the zero vector, not the number zero.

If such scalars do not exist, then the vectors are said to be linearly independent.

Alternatively, linear independence can be directly defined as follows: a set of vectors is linearly independent if and

only if the only representations of the zero vector as linear combinations of its elements are trivial solutions, i.e.,

whenever a1, a2, ..., an are scalars such that

A set of vectors is then said to be linearly dependent if it is not linearly independent.

More generally, let V be a vector space over a field K, and let {vi | iI} be a family of elements of V. The family

is linearly dependent over K if there exists a family {aj | jJ} of elements of K, not all zero, such that

A set X of elements of V is linearly independent if the corresponding family {x}xX is linearly independent.

Equivalently, a family is dependent if a member is in the linear span of the rest of the family, i.e., a

member is a linear combination of the rest of the family.

A set of vectors which is linearly independent and spans some vector space, forms a basis for that vector

space. For example, the vector space of all polynomials in x over the reals has for a basis the (infinite)

subset {1, x, x2, ...}.

[edit]Geometric meaning

A geographic example may help to clarify the concept of linear independence. A person describing the

location of a certain place might say, "It is 5 miles north and 6 miles east of here." This is sufficient

information to describe the location, because the geographic coordinate system may be considered as a

2-dimensional vector space (ignoring altitude). The person might add, "The place is 7.81 miles northeast

of here." Although this last statement is true, it is not necessary.

In this example the "5 miles north" vector and the "6 miles east" vector are linearly independent. That is

to say, the north vector cannot be described in terms of the east vector, and vice versa. The third "7.81

miles northeast" vector is a linear combination of the other two vectors, and it makes the set of

vectors linearly dependent, that is, one of the three vectors is unnecessary.

Also note that if altitude is not ignored, it becomes necessary to add a third vector to the linearly

independent set. In general, n linearly independent vectors are required to describe any location in n-

dimensional space.

[edit]Example I

[edit]Proof

Let 1 and 2 be two real numbers such that

[edit]Alternative method using determinants

An alternative method uses the fact that n vectors in are linearly dependent if and only

if the determinant of the matrix formed by taking the vectors as its columns is zero.

In this case, the matrix formed by the vectors is

We may write a linear combination of the columns as

the determinant of A, which is

Since the determinant is non-zero, the vectors (1, 1) and (3, 2) are linearly

independent.

Otherwise, suppose we have m vectors of n coordinates, with m < n.

Then A is an nm matrix and is a column vector with m entries, and we are

again interested in A = 0. As we saw previously, this is equivalent to a list

of n equations. Consider the first m rows of A, the first m equations; any

solution of the full list of equations must also be true of the reduced list. In

fact, if i1,...,im is any list of m rows, then the equation must be true for those

rows.

the m vectors are linearly dependent by testing whether

for all possible lists of m rows. (In case m = n, this requires only

one determinant, as above. If m > n, then it is a theorem that the

vectors must be linearly dependent.) This fact is valuable for

theory; in practical calculations more efficient methods are

available.

[edit]Example II

a.

a. Then e1, e2, ..., en are linearly independent.

b. [edit]Proof

c. Suppose that a1, a2, ..., an are elements of R such

that

b.

a. Since

c.

i. then ai = 0 for all i in {1, ..., n}.

ii. [edit]Example III

all functions of a real variable t. Then

the functions et and e2t in V are linearly

independent.

iv. [edit]Proof

v. Suppose a and b are two real numbers

such that

d. aet + be2t = 0

i. for all values of t. We need to show

that a = 0 and b = 0. In order to do this,

we divide through by et (which is never

zero) and subtract to obtain

e. bet = a.

i. In other words, the function bet must be

independent of t, which only occurs

when b = 0. It follows that a is also zero.

ii. [edit]Example IV

dependent.

f.

i. [edit]Proof

and such that

g.

i. Forming the simultaneous equations:

h.

i. we can solve (using, for

example, Gaussian elimination) to

obtain:

i.

i. where can be chosen arbitrarily.

ii. Since these are nontrivial results, the

vectors are linearly dependent.

iii. [edit]Projective space of linear

dependences

vectors v1, ..., vn is a tuple (a1, ..., an)

with n scalar components, not all zero,

such that

j.

i. If such a linear dependence exists, then

the n vectors are linearly dependent. It

makes sense to identify two linear

dependences if one arises as a non-

zero multiple of the other, because in

this case the two describe the same

linear relationship among the vectors.

Under this identification, the set of all

linear dependences among v1, ...., vn is

a projective space.

ii. [edit]Linear dependence between

random variables

measure of "linear dependence"

between two random variables. That

does not mean the same thing as in the

context of linear algebra. When the

covariance is normalized, one obtains

the correlation matrix. From it, one can

obtain the Pearson coefficient, which

gives us the goodness of the fit for the

best possible linear function describing

the relation between the variables. In

this sense covariance is a linear gauge

of dependence.

2. Orthogonality comes from the Greek orthos, meaning "straight", and gonia, meaning "angle". It has somewhat different

meanings depending on the context, but most involve the idea of perpendicular, non-overlapping, or uncorrelated.

3. In mathematics, two lines or curves are orthogonal if they are perpendicular at their point of intersection. Two vectors are

orthogonal if and only if their dot product is zero.[1] In computer science, orthogonality has to do with the ability of a

language, method, or object to vary without side effects.[2]

By using integral calculus. it is common to use the following to define the inner product of two mathemathematical functions f and g:

Here we introduce a nonnegative weight function in the definition of this inner product. In simple cases, w(x) = 1,

exactly.

We say that these functions are orthogonal if that inner product is zero:

We write the norms with respect to this inner product and the weight function as

orthogonal on the closed interval [a, b] if

where

is the "Kronecker delta" function. In other words, any two of them are orthogonal, and the norm of

each is 1 in the case of the orthonormal sequence. See in particular the orthogonal polynomials.

[edit]Examples

The vectors (1, 3, 2), (3, 1, 0), (1/3, 1, 5/3) are orthogonal to each other, since (1)(3) +

(3)(1) + (2)(0) = 0, (3)(1/3) + (1)(1) + (0)(5/3) = 0, and (1)(1/3) + (3)(1) + (2)(5/3) = 0.

The vectors (1, 0, 1, 0, ...)T and (0, 1, 0, 1, ...)T are orthogonal to each other. The dot product

of these vectors is 0. We can then make the generalization to consider the vectors in Z2n:

for some positive integer a, and for 1 k a 1, these vectors are orthogonal, for example (1, 0, 0, 1, 0, 0, 1, 0)T, (0, 1, 0,

Take two quadratic functions 2t + 3 and 5t2 + t 17/9. These functions are

orthogonal with respect to a unit weight function on the interval from 1 to 1. The

product of these two functions is 10t3 + 17t2 7/9 t 17/3, and now,

to Riemann integration on the intervals [0, 2], [-, ], or any other closed

interval of length 2. This fact is a central one in Fourier series.

In communications, multiple-access schemes are orthogonal when an ideal receiver can completely reject arbitrarily

strong unwanted signals from the desired signal using different basis functions. One such scheme is TDMA, where the

orthogonal basis functions are nonoverlapping rectangular pulses ("time slots").

Another scheme is orthogonal frequency-division multiplexing (OFDM), which refers to the use, by a single transmitter, of

a set of frequency multiplexed signals with the exact minimum frequency spacing needed to make them orthogonal so

that they do not interfere with each other. Well known examples include (a, g, and n) versions of 802.11 Wi-

Fi; WiMAX; ITU-T G.hn, DVB-T, the terrestrial digital TV broadcast system used in most of the world outside North

America; and DMT (Discrete Multi Tone), the standard form of ADSL.

In OFDM, the subcarrier frequencies are chosen so that the subcarriers are orthogonal to each other, meaning that

crosstalk between the subchannels is eliminated and intercarrier guard bands are not required. This greatly simplifies the

design of both the transmitter and the receiver. Unlike in conventional FDM, a separate filter for each subchannel is not

required.

In probability theory, two events R and B are conditionally independent given a third event Y precisely if the occurrence

or non-occurrence of R and the occurrence or non-occurrence of B are independent events in their conditional probability

distribution given Y. In other words, R and B are conditionally independent given Y if and only if, given knowledge

that Yoccurs, knowledge of whether R occurs provides no information on the likelihood of B occurring, and knowledge of

whether B occurs provides no information on the likehood of Roccurring.

In the standard notation of probability theory, R and B are conditionally independent given Y if and only if

or equivalently,

Two random variables X and Y are conditionally independent given a third random variable Z if and only if they are

independent in their conditional probability distribution givenZ. That is, X and Y are conditionally independent given Z if

and only if, given any value of Z, the probability distribution of X is the same for all values of Y and the probability

distribution of Y is the same for all values of X.

where denotes the conditional expectation of the indicator function of the event , , given the

sigma algebra . That is,

Two random variables X and Y are conditionally independent given a -algebra if the above equation holds for all R in

(X) and B in (Y).

Two random variables X and Y are conditionally independent given a random variable W if they are independent given

(W): the -algebra generated by W. This is commonly written:

or

This is read "X is independent of Y, given W"; the conditioning applies to the whole statement: "(X is independent of Y)

given W".

If W assumes a countable set of values, this is equivalent to the conditional independence of X and Y for the events of the

form [W = w]. Conditional independence of more than two events, or of more than two random variables, is defined

analogously.

The following two examples show that X Y neither implies nor is implied by X Y | W. First, suppose W is 0 with

probability 0.5 and is the value 1 otherwise. When W = 0 take X and Y to be independent, each having the value 0 with

probability 0.99 and the value 1 otherwise. When W = 1, X andY are again independent, but this time they take the value

1 with probability 0.99. Then X Y | W. But X and Y are dependent, because Pr(X = 0) < Pr(X = 0|Y = 0). This is because

Pr(X = 0) = 0.5, but if Y = 0 then it's very likely that W = 0 and thus that X = 0 as well, so Pr(X = 0|Y = 0) > 0.5. For the

second example, suppose X Y, each taking the values 0 and 1 with probability 0.5. Let W be the product XY. Then

when W = 0, Pr(X = 0) = 2/3, but Pr(X = 0|Y = 0) = 1/2, so X Y | W is false. This is also an example of Explaining Away.

See Kevin Murphy's tutorial [2] where Xand Y take the values "brainy" and "sporty".

In probability theory, two random variables being uncorrelated does not imply their independence. In some contexts,

uncorrelatedness implies at least pairwise independence (as when the random variables involved have Bernoulli distributions).

It is sometimes mistakenly thought that one context in which uncorrelatedness implies independence is when the random variables

involved arenormally distributed. However, this is incorrect if the variables are merely marginally normally distributed but not jointly

normally distributed.

Suppose two random variables X and Y are jointly normally distributed. That is the same as saying that the random vector (X, Y)

has a multivariate normal distribution. It means that the joint probability distribution of X and Y is such that for any two constant (i.e.,

non-random) scalars a and b, the random variable aX + bY is normally distributed. In that case if X and Y are uncorrelated, i.e.,

their covariance cov(X, Y) is zero, then they are independent. [1] However, it is possible for two random variables X and Y to be so

distributed jointly that each one alone is marginally normally distributed, and they are uncorrelated, but they are not independent;

examples are given below.

Contents

[hide]

1 Examples

2 References

[edit]Examples

X and Y.

Suppose X has a normal distribution with expected value 0 and variance 1. Let W = 1 or 1, each with probability 1/2, and

assume W is independent of X. Let Y = WX. Then

Both have the same normal distribution; and

X and Y are not independent.

Note that the distribution of X + Y concentrates positive probability at 0: Pr(X + Y = 0) = 1/2. To see thatX and Y are uncorrelated,

consider

To see that Y has the same normal distribution as X, consider

To see that X and Y are not independent, observe that |Y| = |X| or that Pr(Y > 1 | X = 1/2) = 0.

[edit]An asymmetric example

X and Y.

Suppose X has a normal distribution with expected value 0 and variance 1. Let

where c is a positive number to be specified below. If c is very small, then the correlation corr(X, Y) is near 1; if c is

very large, then corr(X, Y) is near 1. Since the correlation is a continuous function of c, theintermediate value

theorem implies there is some particular value of c that makes the correlation 0. That value is approximately 1.54. In

that case, X and Y are uncorrelated, but they are clearly not independent, since X completely determines Y.

To see that Y is normally distributedindeed, that its distribution is the same as that of Xlet us find itscumulative

distribution function:

(This follows from the symmetry of the distribution of X and the symmetry of the condition that |X| < c.)

Observe that the sum X + Y is nowhere near being normally distributed, since it has a substantial

probability (about 0.88) of it being equal to 0, whereas the normal distribution, being a continuous

distribution, has no discrete part, i.e., does not concentrate more than zero probability at any single point.

Consequently X and Y are not jointly normally distributed, even though they are separately normally

distributed.

In probability theory, a set of events is jointly or collectively exhaustive if at least one of the events must occur. For example,

when rolling a six-sided die, the outcomes 1, 2, 3, 4, 5, and 6 are collectively exhaustive, because they encompass the entire range

of possible outcomes.

Another way to describe collectively exhaustive events, is that their union must cover all the events within the entire sample space.

For example, events A and B are said to be collectively exhaustive if

Compare this to the concept of a set of outcomes which are mutually exclusive, which means that at most one of the events

may occur. The set of all possible die rolls is both collectively exhaustive and mutually exclusive. The outcomes 1 and 6 are

mutually exclusive but not collectively exhaustive. The outcomes "even" (2,4 or 6) and "not-6" (1,2,3,4, or 5) are collectively

exhaustive but not mutually exclusive.

One example of a collectively exhaustive and mutually exclusive event is tossing a coin. P(Head or Tail) = 1, so the outcomes

are collectively exhaustive. When head occurs tail can't occur or P(Head and Tail) = 0, so the outcomes are mutually exclusive

also.

The term "collectively exhaustive" is a relatively new term.[citation needed] This is due to changes in the meaning of "mutually

exclusive" (Knuth[full citation needed]). Heads and Tails are classic. Heads and tails are "Exclusive" because one flip cannot be both

heads and tails. Heads and tails are "Mutual" one flip must be either heads or tails. Formal logic used to define events that are

both mutual and exclusive as "mutually exclusive". When the description of "exclusive" was expanded, and became "mutually

exclusive" there became a need for a term to describe what used to be "mutual". Thus "collectively exhaustive" entered the

literature. Events that used to be referred to as "mutually exclusive" are now referred to as "mutually exclusive and collectively

exhaustive".[citation needed]

Estimator

In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule and its result

There are point and interval estimators. The point estimators yield single-valued results, although this includes the possibility of

single vector-valued results and results that can be expressed as a single function. This is in contrast to an interval estimator, where

Statistical theory is concerned with the properties of estimators; that is, with defining properties that can be used to compare

different estimators (different rules for creating estimates) for the same quantity, based on the same data. Such properties can be

used to determine the best rules to use under given circumstances. However, in robust statistics, statistical theory goes on to

consider the balance between having good properties, if tightly defined assumptions hold, and having less good properties that hold

Contents

[hide]

1 Background

2 Definition

3 Quantified properties

4 Behavioural properties

5 See also

6 References

7 External links

[edit]Background

An "estimator" or "point estimate" is a statistic (that is, a function of the data) that is used to infer the value of an

unknown parameter in a statistical model. The parameter being estimated is sometimes called the estimand.[citation needed] It can be

either finite-dimensional (in parametric and semi-parametric models), or infinite-dimensional (semi-nonparametric and non-

parametric models).[citation needed] If the parameter is denoted then the estimator is typically written by adding a "hat" over the

symbol: . Being a function of the data, the estimator is itself a random variable; a particular realization of this random variable is

called the "estimate". Sometimes the words "estimator" and "estimate" are used interchangeably.

The definition places virtually no restrictions on which functions of the data can be called the "estimators". The attractiveness of

different estimators can be judged by looking at their properties, such as unbiasedness, mean square error, consistency,asymptotic

distribution, etc.. The construction and comparison of estimators are the subjects of the estimation theory. In the context of decision

theory, an estimator is a type of decision rule, and its performance may be evaluated through the use of loss functions.

When the word "estimator" is used without a qualifier, it usually refers to point estimation. The estimate in this case is a single point

in the parameter space. Other types of estimators also exist: interval estimators, where the estimates are subsets of the parameter

space.

The problem of density estimation arises in two applications. Firstly, in estimating the probability density functions of random

variables and secondly in estimating the spectral density function of a time series. In these problems the estimates are functions that

can be thought of as point estimates in an infinite dimensional space, and there are corresponding interval estimation problems.

[edit]Definition

Suppose there is a fixed parameter that needs to be estimated. Then an "estimator" is a function that maps the sample space to

a set of sample estimates. An estimator of is usually denoted by the symbol . It is often convenient to express the theory using

the algebra of random variables: thus if X is used to denote a random variable corresponding to the observed data, the estimator

(itself treated as a random variable) is symbolised as a function of that random variable, . The estimate for a particular

observed dataset (i.e. for X=x) is then , which is a fixed value. Often an abbreviated notation is used in which is

[edit]Quantified properties

Error

where is the parameter being estimated. Note that the error, e, depends not only on the estimator (the estimation formula or

The mean squared error of is defined as the expected value (probability-weighted average, over all samples) of the

It is used to indicate how far, on average, the collection of estimates are from the single parameter being estimated.

Consider the following analogy. Suppose the parameter is the bull's-eye of a target, the estimator is the process of

shooting arrows at the target, and the individual arrows are estimates (samples). Then high MSE means the average

distance of the arrows from the bull's-eye is high, and low MSE means the average distance from the bull's-eye is low.

The arrows may or may not be clustered. For example, even if all arrows hit the same point, yet grossly miss the target,

the MSE is still relatively large. Note, however, that if the MSE is relatively low, then the arrows are likely more highly

Sampling deviation

where is the expected value of the estimator. Note that the sampling deviation, d, depends not only

Variance

The variance of is simply the expected value of the squared sampling deviations; that

are from the expected value of the estimates. Note the difference between MSE and variance. If the parameter is

the bull's-eye of a target, and the arrows are estimates, then a relatively high variance means the arrows are

dispersed, and a relatively low variance means the arrows are clustered. Some things to note: even if the variance

is low, the cluster of arrows may still be far off-target, and even if the variance is high, the diffuse collection of

arrows may still be unbiased. Finally, note that even if all arrows grossly miss the target, if they nevertheless all hit

Bias

The bias of is defined as . It is the distance between the average of the collection

of estimates, and the single parameter being estimated. It also is the expected value of the error,

since . If the parameter is the bull's-eye of a target, and the arrows are

estimates, then a relatively high absolute value for the bias means the average position of the arrows is off-target,

and a relatively low absolute bias means the average position of the arrows is on target. They may be dispersed, or

may be clustered. The relationship between bias and variance is analogous to the relationship between accuracy

and precision.

Unbiased

The estimator is an unbiased estimator of if and only if . Note that bias is a property of the

estimator, not of the estimate. Often, people refer to a "biased estimate" or an "unbiased estimate," but they really

are talking about an "estimate from a biased estimator," or an "estimate from an unbiased estimator." Also, people

often confuse the "error" of a single estimate with the "bias" of an estimator. Just because the error for one estimate

is large, does not mean the estimator is biased. In fact, even if all estimates have astronomical absolute values for

their errors, if the expected value of the error is zero, the estimator is unbiased. Also, just because an estimator is

biased, does not preclude the error of an estimate from being zero (we may have gotten lucky). The ideal situation,

of course, is to have an unbiased estimator with low variance, and also try to limit the number of samples where the

error is extreme (that is, have few outliers). Yet unbiasedness is not essential. Often, if just a little bias is permitted,

then an estimator can be found with lower MSE and/or fewer outlier sample estimates.

An alternative to the version of "unbiased" above, is "median-unbiased", where the median of the distribution of

estimates agrees with the true value; thus, in the long run half the estimates will be too low and half too high. While

this applies immediately only to scalar-valued estimators, it can be extended to any measure of central tendency of

Relationships

squared error = variance + square of bias. In particular, for an unbiased estimator, the variance equals the

MSE.

The standard deviation of an estimator of (the square root of the variance), or an estimate of the standard

[edit]Behavioural properties

Consistency

A consistent sequence of estimators is a sequence of estimators that converge in probability to the quantity being

estimated as the index (usually the sample size) grows without bound. In other words, increasing the sample size

increases the probability of the estimator being close to the population parameter.

Mathematically, a sequence of estimators {tn; n 0} is a consistent estimator for parameter if and only if, for all >

The consistency defined above may be called weak consistency. The sequence is strongly consistent, if

An estimator that converges to a multiple of a parameter can be made into a consistent estimator by

multiplying the estimator by a scale factor, namely the true value divided by the asymptotic value of the

estimator. This occurs frequently in estimation of scale parameters by measures of statistical dispersion.

Asymptotic normality

An asymptotically normal estimator is a consistent estimator whose distribution around the true

The central limit theorem implies asymptotic normality of the sample mean as an estimator of the true

mean. More generally, maximum likelihood estimators are asymptotically normal under fairly weak

regularity conditions see the asymptotics section of the maximum likelihood article. However, not all

estimators are asymptotically normal, the simplest examples being case where the true value of a

Efficiency

Two naturally desirable properties of estimators are for them to be unbiased and have minimal mean

squared error (MSE). These cannot in general both be satisfied simultaneously: a biased estimator may

have lower mean squared error (MSE) than any unbiased estimator: despite having bias, the estimator

variance may be sufficiently smaller than that of any unbiased estimator, and it may be preferable to use,

Among unbiased estimators, there often exists one with the lowest variance, called the minimum

variance unbiased estimator (MVUE). In some cases an unbiased efficient estimator exists, which, in

addition to having the lowest variance among unbiased estimators, satisfies the CramrRao bound,

Concerning such "best unbiased estimators", see also CramrRao bound, GaussMarkov

- Process CapabilityDiunggah olehCarlos Ca
- sixsig sloanDiunggah olehjose Martin
- Dynamic factor model for consumer confidenceDiunggah olehdirknbr
- Omega IntroDiunggah olehAnas Rmili
- Statistics Cheating SheetDiunggah olehLorianne
- APPLYING Article Critique IIIDiunggah olehEvin Shinn
- StasticsDiunggah olehmayank1683
- statsDiunggah olehEmmanuel Oppong
- J Social Psy 02Diunggah olehPaola Andrea Guerrero
- Day4 Normal CurveDiunggah olehvheyvehboi
- Quiz 1Diunggah olehhubbybei
- STA101 Formula SheetDiunggah olehaybalasevinc
- Job EmDiunggah olehjaglasc
- sem12ersh8750_lecture12Diunggah olehr4aden
- Relationship Between Indoleamine .... and Post Stroke Cognitive ImpairmentDiunggah olehpanyanyancat
- STATS FRQDiunggah olehRobert Jenkins
- Face RecognitionDiunggah olehArpit Srivastava
- Does Being AttractiveDiunggah olehvladimir216
- Mvpa ManualDiunggah olehCory Turner
- Nichtlineare_DynamikDiunggah olehYaser Alshetwi
- Stata Help Hausman Test.pdfDiunggah olehHabib Ahmad
- Lecture 8Diunggah olehsarkhan
- Lincoln Kovalski Car a Silo 070839Diunggah olehBruno Silva Cunha
- frekuensiDiunggah olehEvan Azrael
- BASIC_STAT_PROBLEM_SET.docxDiunggah olehAlyanna Sofia
- 4.1 Skor Baku Skewness KurtosisDiunggah olehBarep Adji Widhi
- lecture_slides-Stats1.13.L11.pdfDiunggah olehRicardo Nascimento
- Diskriminan (tugas)Diunggah olehsulthanhakim
- 2001年考试题.pdfDiunggah olehSameer Kumar
- 2001年考试题.pdfDiunggah olehSameer Kumar

- 10 career decisions that make more sense than an MBA.docxDiunggah olehVikas Mehta
- Smart Coding Techniques And Their Benefits For Electronics.docxDiunggah olehVikas Mehta
- Raspberry Pi and your LED.docxDiunggah olehVikas Mehta
- raj_comics_list.docxDiunggah olehVikas Mehta
- doomsday.docxDiunggah olehVikas Mehta
- Channel.docxDiunggah olehVikas Mehta
- qdasedecrdft.docxDiunggah olehVikas Mehta
- OPG MACHINE.docxDiunggah olehVikas Mehta
- Mathtype.docxDiunggah olehVikas Mehta
- Highly_paid_skills.docxDiunggah olehVikas Mehta
- armature reaction.docxDiunggah olehVikas Mehta
- Notes to self.docxDiunggah olehVikas Mehta
- nyqyistshannonsampling.docxDiunggah olehVikas Mehta
- Randomsprocess.docxDiunggah olehVikas Mehta
- Relationship Between Bandwidth and Rise Time.docxDiunggah olehVikas Mehta
- Shannon.docxDiunggah olehVikas Mehta
- Spectral efficiency.docxDiunggah olehVikas Mehta
- Understanding BandwidthDiunggah olehSanjay Balwani
- sst.docxDiunggah olehVikas Mehta
- Symbol rate.docDiunggah olehVikas Mehta
- In.docxDiunggah olehVikas Mehta
- Differential Manchester encoding.docDiunggah olehVikas Mehta
- Data transmission.docxDiunggah olehVikas Mehta
- Channel access method.docxDiunggah olehVikas Mehta
- Andrew Garfield Vs.docxDiunggah olehVikas Mehta
- 11.docxDiunggah olehVikas Mehta

- probDiunggah olehJason Laster
- Tutorial IIDiunggah olehmasterrk
- Allocation of Geometric Tolerances zNew Criterion and MethodologyDiunggah olehganesh_the_aviator
- mgb_3rd_edition_(part_1_-_stat_121_122_-_chap_1_to_4).pdfDiunggah olehTommy Narrajos
- Topic+2+Part+III.pdfDiunggah olehChris Masters
- QMLect4probsSOLNDiunggah olehMark Eichenlaub
- MIT2 086F12 Notes Unit2Diunggah olehJauhar Nafis
- 07 r059210401 Probability Theory and Stochastic ProcessDiunggah olehandhracolleges
- Simulations Using SimtoolsDiunggah olehAmith Kumar
- 4.2 Binomial DistributionsDiunggah olehsakshi
- Set4 Queuing Theory Only] [Compatibility Mode]Diunggah olehGirish Kumar Nistala
- ProbabilityDiunggah olehTraderCat Solaris
- TransformationsDiunggah olehDaniel Lee Eisenberg Jacobs
- Tutorial Workshop Questions s1 2016 (2)Diunggah olehAnonymous vaAswkZZm
- 485-HWDiunggah olehjiuj78
- 11_Session_CSC304A.pptxDiunggah olehshy
- CHAPTER 3 - CONTINUOUS.pptDiunggah olehLual Malueth Kucdit
- On the Entropy SumDiunggah olehYannick Roland Kamdem
- DistributionsDiunggah olehTalida Barbé
- Probabilistic Fatigue Methodology and Wind Turbine Reliability_ThesisPhd_langeDiunggah olehBaran Yeter
- IIML 3 4 Probability and Statistics a Review [Compatibility Mode]Diunggah olehNikhil Nangia
- Probability CheatsheetDiunggah olehMo Ml
- midtermOne6711F12solsDiunggah olehSongya Pan
- Monte Carlo MethodsDiunggah olehgenmuratcer
- Probability and Stochastic Processes 2E, By Roy D. Yates , David J. GoodmanDiunggah olehiwat4
- Lecture 27Diunggah olehEd Z
- Prob Stat Week 1Diunggah olehAlex Yu
- Lecture 1Diunggah olehkardra
- S.Y.B.Sc. ActuarialDiunggah olehchinu-pawan
- Probability and Random VariablesDiunggah olehNadir Munir

## Lebih dari sekadar dokumen.

Temukan segala yang ditawarkan Scribd, termasuk buku dan buku audio dari penerbit-penerbit terkemuka.

Batalkan kapan saja.