Anda di halaman 1dari 184

Stat 340 Course Notes Spring 2010

Generously Funded by MEF

Contributions:

Riley Metzger Michaelangelo Finistauri

Special Thanks:

Without the following people and groups, these course notes would never have been completed: MEF, Don McLeish.

Table of Contents

Chapter 1: Probability…………………………………………………………… 1

Exercises for Chapter 1

Solutions ………………………………………………………………………… pp. 18

29

Chapter 2: Statistics….…………………………………………………………… 34

38

Exercises for Chapter 2

Solutions ………………………………………………………………………… pp. 19

Chapter 3: Validation….………………………………………………………… 42

52

Exercises for Chapter 3

Solutions …………………………………………………………………………. pp. 14

Chapter 4: Queuing Systems ….……………………………………………

53

Exercises for Chapter 4

66

Solutions ………………………………………………………………………… pp. 14

Chapter 5: Generating Random Variables….………………………… 67

75

Exercises for Chapter 5

Solutions ………………………………………………………………………… pp. 14

Chapter 6: Variance Reduction ….………………………………………… 76

Exercises for Chapter 6

Solutions …………………………………………………………………………. pp. 112

120

Statistical Tables

Chapter 1

Probability

Three approaches to deÖning probability are:

1. The classical deÖnition: Let the sample space (denoted by S ) be the set of all possible distinct outcomes to an experiment. The probability of some event is

number of ways the event can occur

number of outcomes in S

;

provided all points in S are equally likely. For example, when a die is rolled, the probability of getting a 2 is 6 because one of the six faces is a

2.

1

2. The relative frequency deÖnition: The probability of an event is the proportion (or fraction) of times the event occurs over a very long (the- oretically inÖnite) series of repetitions of an experiment or process. For example, this deÖnition could be used to argue that the probability of get- ting a 2 from a rolled die is 1 6 . For instance, if we roll the die 100 times, but get a 2 30 times, we may suspect that the probability of getting a 2 is

3. The subjective probability deÖnition: The probability of an event is a measure of how sure the person making the statement is that the event will happen. For example, after considering all available data, a weather forecaster might say that the probability of rain today is 30% or 0.3.

1

3 .

Unfortunately, all three of these deÖnitions have serious limitations.

1

1.1 Sample Spaces and Probability

Consider a phenomenon or process which is repeatable, at least in theory, and suppose that certain events (outcomes) A 1 ; A 2 ; A 3 ; : : : are deÖned. We will often refer to this phenomenon or process an ìexperiment," and refer to a single repetition of the experiment as a ìtrial ." Then, the probability of an event A, denoted by P (A), is a number between 0 and 1.

DeÖnition 1. A sample space S is a set of distinct outcomes for an exper- iment or process, with the property that in a single trial, one and only one of these outcomes occurs. The outcomes that make up the sample space are called sample points.

DeÖnition 2. Let S = fA 1 ; A 2 ; A 3 ; : : : g be a discrete sample space. Then probabilities P (A i ) are numbers attached to the A i ís (i = 1; 2; 3 ; : : :) such that the following two conditions hold:

(1)

0 P (A i ) 1

(2)

P P (A i ) = 1

i

The set of values fP (A i ); i = 1 ; 2 ; : : : g is called a probability distrib- ution on S .

DeÖnition 3. An event in a discrete sample space is a subset A S . If the event contains only one point, e.g. A 1 = f A 1 g, we call it a simple event. An event A made up of two or more simple events such as A = f A 1 ; A 2 g is called a compound event.

DeÖnition 4. The probability P(A) of an event A is the sum of probabilities for all the simple events that make up A.

Example: Suppose a 6-sided die is rolled, and let the sample space be S = f 1; 2; 3; 4; 5 ; 6 g, where 1 means the number 1 occurs, and so on. If the die is an ordinary one, we would Önd it useful to deÖne probabilities as

P (i) = 1 =6 for i = 1; 2; 3 ; 4 ; 5; 6;

because if the die were tossed repeatedly (as in some games or gambling situ- ations), then each number would occur close to 1= 6 of the time. However, if the die were weighted in a way such that one face favoured the others, these numerical values would not be so useful. Note that if we wish to consider a compound event, the probability is easily obtained. For example, if A = ìeven number," then because A = f 2; 4; 6g we get P (A) = P (2) + P (4) + P (6) = 1 = 2.

2

1.2 Conditional Probability

It is often important to know how the outcome of one event will a§ect the outcome of another. Consider áipping a coin: if we know that the Örst áip is a head, how does this a§ect the outcome of a second áip? Logically this should have no e§ect. Now, suppose we are interested in the values of two rolled dice (assuming each die is a standard six-faced, with faces numbered 1 through 6, hereafter referred to as a D6 ). If the total sum of the dice is 5 and we know the value of one of the dice, then how does this a§ect the outcome of the second die? This is the basic concept behind Conditional Probability .

1.2.1 DeÖnition

Given two events, A and B , and given that we know B occurs the Conditional Probability of AjB (said ìA given B î) is deÖned as:

P (AjB ) = P (A \ B )

P (B )

Recall that A \ B denotes the intersection of events A and B (hereafter, this will be shortened to AB ). Note: If A and B are independent,

P (AB ) = P (A)P (B ) so

P (AjB ) = P (A)P (B )

P (B )

= P (A):

1.2.2 Example

Consider the Örst example: the coin. Assume that the coin is fair, which is to say that the probability of either outcome is strictly 1 2 . Event A is the event that the Örst toss will land a head, and event B is the event that the second toss will land a tail. Then the conditional probability of B jA is:

P (B jA) =

P (B \ A)

P (A)

 

1

4

=

 

1

2

1

=

 

2

Notice that the conditional probability B jA is the same as the probability of B . This is not always the case.

3

1.3 Random Variables

There is a far more intuitive and useful representation of probabilistic events:

Random Variables (r.v). A random variable can be thought of as an unknown value (numeric) which is determined by chance.

DeÖnition 5. A random variable is a function that assigns a real number to each point in a sample space S .

Example 1. From the previous examples we can have X be the random variable representing the outcome of a coin toss. We could assign X with the value of 1 if the coin turns out to be a head, and 0 otherwise. We can even have a sequence of n random variables for a series of n coin tosses. In such a case, X might be the number of heads on the n coin tosses.

Random variables come in three main types: Discrete, Continuous and Mixed (a combination of Discrete and Continuous). Which of these categories a random variable falls into is dependent on the domain of the sample space. Both the coin and dice examples have a discrete support for the sample space. Time, height or temperature are examples where the support may indicate a continuous random variable.

1.4 Discrete Random Variables

A discrete random variable is one whose sample space is Önite or countably

inÖnite. Common sample spaces are (proper or improper) subsets of integers.

DeÖnition 6. The probability function (p.f.) of a random variable X is the function

f (x) = P (X = x); deÖned for all x 2 A; whereAisthesamplespaceof therandomvariable:

The set of pairs f(x; f (x)) : x 2 Ag is called the probability distribution

of X . All probability functions must have two properties:

1.

2.

f (x) 0 for all values of x (i.e. for x 2 A)

all

P x2A f (x) = 1

By implication, these properties ensure that f (x) 1 for all x. We consider a few ìtoyî examples before dealing with more complicated problems.

4

1.5 Expectation, Averages, Variability

DeÖnition 7. The expected value (also called the mean or the expectation) of a discrete random variable X with probability function f (x) is

E (X ) = X xf (x):

all x

The expected value of X is also often denoted by the Greek letter . The expected value of X can be thought as the average of the X -values that would occur in an inÖnite series of repetitions of the process where X is deÖned. Notes:

(1) You can interpret E [g (X )] as the average value of g (X ) in an inÖnite series of repetitions of the process where X is deÖned.

(2) E [g (X )] is also known as the expected value of g (X ). This name is some- what misleading since the average value of g (X ) may be a value which g (X ) never takes - hence unexpected!

(3) The case where g (X ) = X reduces to our earlier deÖnition of E (X ).

Theorem 1. Suppose the random variable X has probability function f (x):

Then, the expected value of some function g (X ) of X is given by

E [g (X )] = X g (x)f (x)

all x

Properties of Expectation:

If your linear algebra is good, it may help if you think of E as being a linear operator. Otherwise, youíll have to remember these and subsequent properties.

1. For constants a and b ,

E [ag (X ) + b ] = aE [g (X )] + b

Proof: E [ag (X ) + b ] = P [ag (x) + b ] f (x)

all x

= P [ag (x)f (x) + bf (x)]

all x

= a P g (x)f (x) + b P f (x)

all x

all x

= aE [g (X )] + b since P x f (x) = 1

all

2. For constants a and b and functions g 1 and g 2 , it is also easy to show

E [ag 1 (X ) + bg 2 (X )] = aE [g 1 (X )] + bE [g 2 (X )]

Variability:

5

While an average is a useful summary of a set of observations, or of a proba- bility distribution, it omits another important piece of information, namely the amount of variability. For example, it would be possible for car doors to be the right width, on average, and still have no doors Öt properly. In the case of Ötting car doors, we would also want the door widths to all be close to this correct average. We give a way of measuring the amount of variability next. You might think we could use the average di§erence between X and to indi- cate the amount of variation. In terms of expectation, this would be E (X ). However, E (X ) = E (X ) (since is a constant) = 0. We soon realize that to measure variability we need a function that is the same sign for X > and for X < . We now deÖne

DeÖnition 8. The variance of a r.v X is E h (X ) 2 i , and is denoted by 2

or by Var (X ).

In words, the variance is the average squared distance from the mean. This turns out to be a very useful measure of the variability of X . Example: Let X be the number of heads when a fair coin is tossed 4 times. Then, X Binomial 4; 1 2 ; and so, = np = (4) 1 2 = 2 . Without doing any calculations, we know 2 4 because X is always between 0 and 4. Hence it can never be further away from than 2. This makes the average squared distance from at most 4. The values of f (x) are

x

0

1

2

3

4

f (x)

1/16

4/16

6/16

4/16

1/16

since f (x) = =

x

4

x

2

2 4

1

4

1

x

2 1 4 x

The value of V ar (X ) (i.e. 2 ) is easily found here:

2

=

=

=

E h (X ) 2 i =

4

P x=0 (x ) 2 f (x)

(0 2) 2 16 + (1 2) 2 16 + (2 2) 2

4

+(3 2) 2 16 + (4 2) 2

1

16

1

4

1

6

16

DeÖnition 9. The standard deviation of a random variable X is = r E h (X ) 2 i

Both variance and standard deviation are commonly used to measure variability.

The basic deÖnition of variance is often awkward to use for mathematical cal- culation of 2 , whereas the following two results are often useful:

(1)

(2)

2

2

=

E X 2 2 = E [X (X 1)] + 2

6

Properties of Mean and Variance If a and b are constants and Y = aX + b , then

Y = a X + b and Y = a 2

2

2

X

(where X and X are the mean and variance of X and Y and Y are the mean and variance of Y ). The proof of this is left to the reader as an exercise.

2

2

1.6 Moment Generating Functions

We have now seen two functions which characterize a distribution, the probabil- ity function and the cumulative distribution function. There is a third type of function: the moment generating function , which uniquely determines a distri- bution. The moment generating function is closely related to other transforms used in mathematics: the Laplace and Fourier transforms.

DeÖnition 10. Consider a discrete random variable X with probability function f (x). The moment generating function (m.g.f.) of X is deÖned as

M (t ) = E (e tX ) = X e tx f (x):

x

We will assume that the moment generating function is deÖned and Önite for values of t in an interval around 0 (i.e. for some a > 0 , P e tx f (x) < 1 for

all t 2 [ a; a]).

The moments of a random variable X are the expectations of the functions X r for r = 1 ; 2 ; : : : . The expected value E (X r ) is called the r th moment of X . The mean = E (X ) is therefore the Örst moment, E (X 2 ) is the second and so on. It is often easy to Önd the moments of a probability distribution mathematically by using the moment generating function. This often gives easier derivations of means and variances than the direct summation methods in the preceding section. The following theorem gives a useful property of m.g.f.ís.

x

Theorem 2. Let the random variable X have m.g.f. M (t ). Then

E (X r ) = M ( r ) (0)

r = 1; 2; : : :

where M ( r ) (0) stands for d r M (t )=dt r evaluated at t = 0.

Proof:

M (t ) = P e tx f (x) and if the sum converges, then

x

M ( r ) (t ) =

=

d

dt

r P

e tx f (x)

x

d

P r (e tx )f (x)

dt

x

= P x r e tx f (x)

7

x

Therefore M ( r ) (0) = P x x r f (x) = E (X r ), as stated. This sometimes gives a simple way to Önd the moments for a distribution.

Example 1. Suppose X has a Binomial (n; p) distribution. Then its moment generating function is

Therefore

M 0 (t ) =

M (t ) =

n

x=0 e tx n

X

x

p x (1 p ) n x

=

n

x=0 n

X

x

(pe t ) x (1 p ) n x

= (pe t + 1 p ) n

npe t (pe t + 1 p ) n 1

M 00 (t ) = npe t (pe t + 1 p ) n 1 + n (n 1)p 2 e 2 t (pe t + 1 p ) n 2

and so

E [X ] = M 0 (0) = np;

E [X 2 ] = M "(0) = np + n (n 1)p 2

V ar (X ) = E (X 2 ) E (X ) 2 = np (1 p )

1.7 Discrete Distributions

1.7.1 Discrete Uniform

The Discrete Uniform is used when every outcome is equiprobable such as with fair dice, coins, and simple random sampling (surveying method). This is the simplest discrete random variable. Let X U(a,b) where parameters a and b are the integer min and max of the support respectively (support is in the integers). For coin tosses a = 0 and b = 1 , and for dice a = 1 and b = 6 . For the discrete uniform, the sample space is the closed set of all integers between a and b . The uniform has the following properties.

f (x)

F (x)

=

=

E [X ] =

V ar (X ) =

1

a + 1 ; x 2 S

b

x a + 1 ; x 2 S

a + 1 a + b

b

2

(b a + 1) 2 1

8

12

1.7.2 Bernoulli

Given a trial which results in either a success or a failure (like áipping a coin; we can consider a head as a success and a tail as a failure), we can use a random variable to represent the outcome. This situation is modelled by a Bernoulli random variable, named after the scientist Jacob Bernoulli. This is also known as a Bernoulli Trial with a probability of success, p , and probability of failure, q = 1 p . This random variable only has one parameter, p , which is the probability of a success. The support is simply 0 and 1 (i.e. failure or success respectively).

f (x) =

q

p

F (x) = q

1

E [X ] = p

x = 0 x = 1

x = 0

x = 1

V ar (X ) = p (1 p )

1.7.3 Binomial

When there are n independent and identically distributed Bernoulli random variables and the value of interest is in the number of success (i.e. the number of heads one obtains after áipping a coin a certain number of times), then this can be described by a binomial random variable. The Binomial random variable has two parameters: number of trials, n , and probability of success, p . The sample space is the number of possible successes given n trials to x 2 Z[0; n ].

f (x)

=

n p x (1 p ) n x

x

F (x) =

x

X

f (i)

i =0

E [X ] = np

V ar (X ) = np (1 p )

1.7.4 Geometric

Given a series of independent Bernoulli trials, if we are interested in the number of trials until the Örst success then we would use a Geometric random variable. This has one parameter, the probability of success, p . The sample space is the non-negative integers which is very intuitive. If the Örst trial is a success then 0 additional trials were required; however, we can have up to inÖnitely many trials

9

until our Örst success. Especially if the coin rolls into a nearby storm drain.

f (x)

=

p (1 p ) x

F (x)

=

1 (1 p ) x+1 1 p

p

E [X ] =

V ar (X ) =

1 p

p

2

1.7.5 Negative Binomial

Consider the previous random variable except we are now interested in the k th success. Then this is a negative binomial distribution. This distribution has two parameters: the number of desired successes, k , and the probability of each success occurring, p . The sample space for this random variable, again, is the set of non-negative integers.

f (x) =

x + k 1 p k (1 p ) x

k 1

F (x) =

x

X

i =0

f

(i)

E [X ] =

k 1 p

p

V ar (X ) = k 1 p

p

2

1.7.6 Hypergeometric

Now assume that we have a population of size N which can be split into two subgroups arbitrarily called A and B (a common example is an urn full of coloured balls). The subgroups have size M and N M respectively. If we take a sample without replacement of size n from the population, the number of elements taken from subgroup A is the random variable of interest. Note that the two subgroups can usually be swapped in terms of notation without consequence. Consider the example of an urn full of N balls. There are M blue balls and N M red balls. If we take a sample of n balls out of the urn, how many blue balls are there? This assumes that each ball is equiprobable to be selected at each selection. The parameters and support for this random variable have very speciÖc restrictions. N , the population size, is a positive integer (preferably greater than 2). M , the size of a subpopulation (i.e. number of blue balls), can be of any integer in the set [0; N ]. n , the number of items selected, is an integer in the set [0; N ]. Now the sample space is a little tricky: the support of the sample

10

space is x 2 [max(0; n + M N ); : : : ; min(M; n )].

f

F

(x) = M

x

N M

n

x

N

n

(x) =

x

X

i

=1

f (i)

E [X ] =

nM

N

V ar (X ) = n M 1 M

N

N n N 1

N

1.7.7 Poisson

The number of independent events that occur with a common rate, , over

a Öxed period of time, t , is known as the Poisson distribution. This random

variable has the parameter t , where is the rate of arrivals (strictly positive,

real number) and t is the time of interest. The support for this random variable

is the non-negative real numbers.

f (x) =

( t ) x e t

x!

F (x) = e t

x

X

i =1

( t ) x

x!

E [X ] = t

V ar (X ) = t

1.8 Discrete Multivariate Distributions

1.8.1 Joint Probability Functions:

First, suppose there are two random variables X and Y , and deÖne the function

f (x; y ) =

P (X

= x and Y = y )

=

P (X = x; Y = y ):

We call f (x; y ) the joint probability function of (X; Y ). In general,

f (x 1 ; x 2 ; ; x n ) = P (X 1 = x 1 and X 2 = x 2 and : : : and X n

= x n )

if there are n random variables X 1 ; : : : ; X n .

The properties of a joint probability function are similar to those for a single variable; for two random variables we have f (x; y ) 0 for all (x; y ) and

X f (x; y ) = 1 :

all(x; y)

11

Example: Consider the following numerical example, where we show f (x; y )

in a table.

x

1

.2

.1

f (x; y )

y

1

2

0

.1

2

.3

.1

.2

For example, f (0; 2) = P (X = 0 and Y = 2) = 0:2 : We can check that f (x; y ) is a proper joint probability function since f (x; y ) 0 for all 6 combinations of (x; y ) and the sum of these 6 probabilities is 1. When there are only a few values for X and Y , it is often easier to tabulate f (x; y ) than to Önd a formula for it. Weíll use this example below to illustrate other deÖnitions for multivariate distributions, but Örst we give a short example where we need to Önd f (x; y ).

Example: Suppose the range for (X; Y ), which is the set of possible values (x; y ) is the following: X can be 0, 1, 2, or 3, Y can be 0 or 1. Weíll see that not all 8 combinations (x; y ) are possible in the following table of f (x; y ) = P (X = x; Y = y ).

   

x

f (x; y )

0

1

2

3

 

0

1

2

1

0

8

8

8

y

 
 

1

0

1

2

1

8

8

8

Note that the range or joint p.f. for (X; Y ) is a little awkward to write down here in formulas, so we just use the table.

Marginal Distributions: We may be given a joint probability function involving more variables than weíre interested in using. How can we eliminate variables that are not of interest? Look at the Örst example above: if weíre only interested in X , and donít care what value Y takes, we can see that

P (X = 0) = P (X = 0; Y = 1) + P (X = 0 ; Y = 2);

so P (X

= 0) = f (0; 1) + f (0; 2) = 0 :3 : Similarly

P (X = 1) = f (1; 1) + f (1; 2) = :3 and

P (X = 2) = f (2; 1) + f (2; 2) = :4

The distribution of X obtained in this way from the joint distribution is called the marginal probability function of X :

12

x

0

1

2

f (x)

.3

.3

.4

In the same way, if we were only interested in Y , we obtain

P (Y = 1) = f (0; 1) + f (1; 1) + f (2; 1) = :6

since X can be 0, 1, or 2 when Y = 1. The marginal probability function of Y would be:

y 1 2 f (y ) .6 .4
y
1
2
f (y )
.6
.4

We generally put a subscript on the f to indicate whether it is the marginal probability function for the Örst or second variable. So f 1 (1) would be P (X = 1) = :3 , while f 2 (1) would be P (Y = 1) = 0:6. An alternative notation that you may see is f X (x) and f Y (y ). In general, to Önd f 1 (x) we add over all values of y where X = x, and to Önd f 2 (y ) we add over all values of x with Y = y . Then

f 1 (x) = X f (x; y ) and

all y

f 2 (y ) = X f (x; y ):

all x

This reasoning can be extended beyond two variables. For example, with three variables (X 1 ; X 2 ; X 3 ),

f 1 (x 1 ) = X f (x 1 ; x 2 ; x 3 ) and

all ( x 2 ; x 3 )

f 1 ; 3 (x 1 ; x 3 ) = X f (x 1 ; x 2 ; x 3 ) = P (X 1 = x 1 ; X 3 = x 3 )

all x 2

where f 1 ; 3 (x 1 ; x 3 ) is the marginal joint distribution of (X 1 ; X 3 ):

1.8.2 Conditional Probability Functions (described using Random Variables):

Again, we can extend a deÖnition from events to random variables. For events

A

x;

and B , recall that P (AjB ) = P ( AB ) . Since P (X = xjY = y ) = P (X =

P ( B )

Y = y )=P (Y = y ), we make the following deÖnition.

13

DeÖnition 11. The conditional probability function of X given Y = y is f (xjy ) = f ( x;y )

f 2 ( y ) .

Similarly, f (y jx) = f ( x;y )

f 1 ( x)

(provided, of course, the denominator is not zero).

In our Örst example let us Önd f (xjY = 1).

This gives:

f (xjY = 1) = f (x; 1)

f 2 (1) :

x

0

1

2

f (xjY = 1)

: 1

: 6 = 6 1

: 2

: 6 = 3 1

: 3

: 6 = 1 2

As you would expect, marginal and conditional probability functions are probability functions in that they are always 0 and their sum is 1.

1.9 Multinomial Distribution

There is only this one multivariate model distribution introduced in this course, though other multivariate distributions exist. The multinomial distribution de- Öned below is very important. It is a generalization of the binomial model to the case where each trial has k possible outcomes. Physical Setup: This distribution is the same as Binomial except there are k outcomes rather than two. An experiment is repeated independently n times with k distinct outcomes each time. Let the probabilities of these k outcomes be p 1 ; p 2 ; ; p k each time. Let X 1 be the number of times the 1 st outcome to occur, X 2 the number of times the 2 nd outcome occurs, , X k the number of times the k th outcome occurs. Then (X 1 ; X 2 ; ; X k ) has a multinomial distribution. Notes:

(1)

p 1 + p 2 + + p k =

1

(2)

X 1 + X 2 + + X k = n ,

If we wish, we can drop one of the variables (say the last), and just note that X k equals n X 1 X 2 X k 1 .

Illustration:

1. Suppose student marks are given in letter grades: A, B, C, D, or F. In

, F might

a class of 80 students, the number of students getting A, B, have a multinomial distribution with n = 80 and k = 5 .

14

Joint Probability Function: The joint probability function of X 1 ; : : : ; X k is given by extending the argument in the sprinters example from k = 3 to general

k . There are

x k ! di§erent outcomes of the n trials in which x 1 are of the

1 st type, x 2 are of the 2 nd type, etc. Each of these arrangements has probability p x 1 p x 2 p x k since p 1 is multiplied x 1 times in some order, etc.

n !

x 1 ! x 2 !

k

1

2

Therefore f (x 1 ; x 2 ; ; x k ) =

n ! x 1 !x 2 ! x k ! p x 1

1

p x 2 p x k

2

k

k

The restriction on the x i ís are x i = 0; 1 ; ; n and

As a check that P f (x 1 ; x 2 ; ; x k ) = 1 we use the multinomial theorem to get

P x i = n .

i =1

X

n !

x 1 !x 2 ! x k ! p x 1

1

p x k = ( p 1 + p 2 + + p k ) n = 1:

k

Here is another simple example.

Example: Every person has one of four blood types: A, B, AB and O. (This is important in determining, for example, who may give a blood transfusion to a person.) In a large population, let the fraction that has type A, B, AB and O, respectively, be p 1 ; p 2 ; p 3 ; p 4 . Then, if n persons are randomly selected from the population, the numbers X 1 ; X 2 ; X 3 ; X 4 of types A, B, AB, O have a multinomial distribution with k = 4 (for Caucasian people, the values of the p i ís are approximately p 1 = :45 ; p 2 = :08 ; p 3 = :03 ; p 4 = :44 :)

Note: We sometimes use the notation (X 1 ; : : : ; X k ) Mult(n ; p 1 ; : : : ; p k ) to indicate that (X 1 ; : : : ; X k ) has a multinomial distribution.

1.10 Expectation for Multivariate Distributions:

Covariance and Correlation

It is easy to extend the deÖnition of expected value to multiple variables. Gen- eralizing E [g (X )] = P g (x)f (x) leads to the deÖnition of expected value in

the multivariate case.

DeÖnition 12.

all x

E

[g (X; Y )] = X g (x; y )f (x; y )

all ( x;y )

and

E [g (X 1 ; X 2 ; ; X n )] =

X

all ( x 1 ;x 2 ; ;x n )

g (x 1 ; x 2 ; x n ) f (x 1 ; ; x n )

15

As before, these represent the average value of g (X; Y ) and g (X 1 ; : : : ; X n ).

E

[g (X; Y )] could also be determined by Önding the probability function f Z (z ) of

Z

= g (X; Y ) and then using the deÖnition of expected value E (Z ) = P all z zf Z (z ).

Example: Let the joint probability function, f (x; y ), be given by

Find E (XY ) and E (X ). Solution:

E (XY ) = X xyf (x;

all ( x;y )

y

)

 

x

f (x; y )

0

1

2

 

1

.1

.2

.3

y

2

.2

.1

.1

= (0 1 :1) + (1 1 :2) + (2 1 :3) + (0 2 :2) + (1 2 :1) + (2 2 :1)

= 1 :4

To Önd E (X ) we have a choice of methods. First, taking g (x; y ) = x we get

E (X ) =

X xf (x; y )

all ( x;y )

=

(0 :1) + (1 :2) + (2 :3) + (0 :2) + (1 :1) + (2 :1)

=

1:1

Alternatively, since E (X ) only involves X , we could Önd f 1 (x) and use

E (X ) =

2

X xf 1 (x) = (0 :3) + (1 :3) + (2 :4) = 1 :1

x=0

Property of Multivariate Expectation: It is easily proved (make sure you can do this) that

E [ag 1 (X; Y ) + bg 2 (X; Y )] = aE [g 1 (X; Y )] + bE [g 2 (X; Y )]

This can be extended beyond 2 functions g 1 and g 2 , and beyond 2 variables X and Y .

16

1.10.1 Relationships between Variables:

DeÖnition 13. The covariance of X and Y , denoted Cov(X; Y ) or XY , is

Cov(X; Y ) = E [(X X )(Y Y )]

For calculation purposes, it is easier to express the formula in the following form:

Cov(X; Y ) =

E [(X X ) (Y Y )] = E (XY X Y X Y + X Y )

=

E (XY ) X E (Y ) Y E (X ) + X Y

=

E (XY ) E (X )E (Y ) E (Y )E (X ) + E (X )E (Y )

Therefore Cov(X; Y ) = E (XY ) E (X )E (Y )

Example:

In the example with joint probability function

Önd Cov (X; Y ).

x

1

f (x; y )

0

2

y

1

2

:1

:2

:2 :3

:1 :1

Solution: We previously calculated E (XY ) = 1:4 and E (X ) = 1:1 . Simi- larly, E (Y ) = (1 :6) + (2 :4) = 1 :4

Therefore Cov (X; Y ) = 1 :4 (1:1)(1:4) = :14

Theorem 3. If X and Y are independent then Cov (X; Y ) = 0 .

Proof: Recall E (X X ) = E (X ) X = 0 . Let X and Y be independent. Then f (x; y ) = f 1 (x)f 2 (y ).

Cov ( X; Y ) = E [(X X ) (Y Y )] =

all y all

P

x (x X ) (y Y ) f 1 (x)f 2 (y )

P

= P y (y Y ) f 2 (y ) P x (x X ) f 1 (x)

= P [(y Y ) f 2 (y )E (X X )]

all

all

all y

= P 0 = 0

all y

The following theorem gives a direct proof of the result above, and is useful in many other situations.

Theorem 4. Suppose random variables X and Y are independent. Then, if g 1 (X ) and g 2 (Y ) are any two functions,

E [g 1 (X )g 2 (Y )] = E [g 1 (X )]E [g 2 (Y )]:

17

To prove Theorem 4 above, we just note that if X and Y are independent then

Cov(X; Y ) =

E [(X X )(Y Y )]

=

E (X X )E (Y Y ) = 0

Caution: This result is not reversible. If Cov (X; Y ) = 0 we cannot conclude that X and Y are independent. Example: Let (X; Y ) have the joint probability function f (0; 0) = 0 :2 ; f (1; 1) = 0 :6; f (2; 0) = 0 :2; i.e. (X; Y ) only takes on 3 values.

and

x 0 1 2 f 1 (x) .2 .6 .2 y 0 1 f 2
x
0
1
2
f 1 (x)
.2
.6
.2
y
0
1
f
2 (y )
.4
.6

are marginal probability functions. Since f 1 (x)f 2 (y ) 6= f (x; y ); therefore, X and Y are not independent. However,

E (XY ) =

(0 0 :2) + (1 1 :6) + (2 0 :2) = :6

E (X ) = (0 :2) + (1 :6) + (2 :2) = 1 and E (Y ) = (0 :4) + (1 :6) = :6

Therefore, Cov ( X; Y ) = E (XY ) E (X )E (Y ) = :6 (1)(:6) = 0

So, X and Y have covariance 0 but are not independent. If Cov (X; Y ) = 0 we say that X and Y are uncorrelated, because of the deÖnition of correlation given below.

DeÖnition 14. The correlation coe¢ cient of X and Y is = Cov ( X;Y )

X Y

The correlation coe¢ cient measures the strength of the linear relationship between X and Y and is simply a rescaled version of the covariance, scaled to lie in the interval [ 1; 1]. Properties of :

1) Since X and Y , the standard deviations of X and Y , are both positive, will have the same sign as Cov (X; Y ). Hence the interpretation of the sign of is the same as for Cov (X; Y ), and = 0 if X and Y are independent. When = 0 we say that X and Y are uncorrelated.

2) 1 1 and as ! 1 the relation between X and Y becomes one-to-one and linear.

18

1.11 Mean and Variance of a Linear Combina- tion of Random Variables

Many problems require us to consider linear combinations of random variables; examples will be given below and in Chapter 9. Although writing down the formulas is somewhat tedious, we give here some important results about their means and variances.

Results for Means:

1.

E

(aX + bY ) = aE (X ) + bE (Y ) = a X + b Y , when a and b are con-

stants. (This follows from the deÖnition of expected value .) In particular,

E

(X + Y ) = X + Y and E (X Y ) = X Y .

2.

Let a i be constants (real numbers) and E (X i ) = i . Then E ( P a i X i ) = P a i i . In particular, E ( P X i ) = P E (X i ).

3.

Let X 1 ; X 2 ; ; X n be random variables which have a mean . (You can imagine these being some sample results from an experiment such as recording the number of occupants in cars travelling over a toll bridge.)

n

P

X

i

The sample mean is X =

i=1
i=1

n

Results for Covariance:

. Then E X = .

1. Cov (X; X ) = E [(X X ) (X X )] = E h (X ) 2 i =