Anda di halaman 1dari 54

Introduction to Probability

Franco Vivaldi
School of Mathematical Sciences

c The University of London, 2015


Last updated: April 29, 2015

Abstract
These notes are a summary of what was lectured. They do not contain all examples or
explanantions, and are not a substitute for your own notes.

Introduction

We begin with some questions.


1. Would you rather be given 5, or toss a coin winning 10 if it comes up heads?
2. How many people do you need in a room to have a better than 50% chance that two of them
share a birthday?
3. A suspects fingerprints match those found on a murder weapon. The chance of a match
occurring by chance is around 1 in 50,000. Is the suspect likely to be innocent?
4. A network is formed by taking 100 points and connecting each pair with some fixed probability. What can we say about the number and size of connected components in the network?
These questions all relate to situations where there is some randomness events we cannot
predict with certainty. This could be either because they happen by chance (tossing a coin) or they
are beyond our knowledge (the innocence or guilt of a suspect). Probability is the mathematical
discipline that studies the laws that govern random phenomena. Since randomness is all around us,
it is a useful theory; it is also mathematically appealing.
The first question is not entirely a mathematical one and there is no right or wrong answer. Itll
depend on circumstances. To bring it closer to mathematics, we modify it by introducing a variable
quantity p as follows:
1. Would you rather (a) be given 5, or (b) toss a coin winning p if it comes up heads?
Now the answer depends on the value of p. If p 6 5, then everybody would answer (a), and most
people would still answer (a) for p = 6. For p = 100 most people would answer (b), and there
must be an intermediate value of p for which opinions are as divided as possible. Such a value is
p = 10, but how can we arrive to this conclusion? Let N be a large integer, and let N people choose
(a). Then the total amount paid out will be exactly 5N. If N people choose (b), then the total
amount paid out should be close to N p/2, the closer the larger the value of N. The ratio of these
figures is near N p/(2 5N) = p/10. As N , the value p = 10 will emerge as the only value for
which this ratio converges to 1.
The second question is a simple calculation which we will see how to do later. If you havent
seen this before then try to guess what the answer will be. Most people have rather poor intuition
for this kind of questions, so I would expect there to be a wide range of guesses and for many of
them to be far from the correct answer.
The third question emphasises how important it is to determine exactly what information we are
given and what we want to know. The court would have to consider the probability that the suspect
is innocent under the assumption that the fingerprints match. The numerical probability given in
the question is the probability that the fingerprints match under the assumption that the suspect
is innocent. These are in general completely different quantities. The erroneous assumption that
2

they equal is sometimes called the prosecutors fallacy. The mathematical tool for considering
probabilities given certain assumptions is conditional probability.
The final question describes the construction of a random graph. These objects are well studied
both as pure mathematical objects in their own right and as models to help understand networks in
a range of areas such as epidemiology and computing.
The broader question of what probability is does not have an entirely satisfactory answer. We
will associate to an event a number which measures the likelihood of it occurring. But what does
this really mean? As we saw above, you can think of it as a limiting frequency: if I toss a coin
many times under the same conditions, I would expect the proportion of tosses which come up
heads to approach 1/2. Under such a circumstance, we assign the number 1/2 to this event, as a
measure of the likelihood of it occurring. If it is not natural to imagine repeating an experiment, we
could think of probability as describing the degree of belief we have that something will happen.
Each of these notions though is rather inexact. Later we will give a precise mathematical statement
which defines probability as a function which possesses certain porperties.

Sample Space and Events

The general setting is this: we perform an experiment and record an outcome. Outcomes must be
precisely specified, mutually exclusive and cover all possibilities. To describe these phenomena
we shall employ the mathematical language of sets. The basic terminology and notation is covered
in the Mathematical Structures module.
Definition. The sample space is the set of all possible outcomes for the experiment. It is denoted
by (or S ).
Definition. An event is a subset of the sample space. The event occurs if the actual outcome is an
element of the set. An event comprising just one element of is called simple or elementary.
E XAMPLE . A coin is tossed three times and the sequence of heads/tails is recorded. We let
= {hhh, hht, hth, htt,thh,tht,tth,ttt}
where htt means the first toss is a head, the second is a tail, the third is a tail, etc. The elements of
the set are words made of three letters from the alphabet {h,t}. The sample space could equally
be represented as a set of binary sequences
= {(0, 0, 0), (0, 0, 1), (0, 1, 0), (0, 1, 1), (1, 0, 0), (1, 0, 1), (1, 1, 0), (1, 1, 1)},
or a set of integers: = {0, 1, 2, 3, 4, 5, 6, 7} (what is 5?). If A is the event the second toss is a tail
then
A = {hth, htt,tth,ttt}.
We see that A .
Since events are sets, the set operations (union, intersection, difference, taking the complement)
have natural interpretations in the language of events.
There are many more examples in your lecture notes, which illustrate among other things
that the sample space may be finite or infinite.
We want to assign a numerical value to an event, which reflects the chance that it occurs. The
simplest way of doing this is to say that the probability of an event A is the ratio of the number of
outcomes in A to the total number of outcomes in . We characterise this as the situation where all
outcomes are equally likely, which may or may not be a reasonable description of the phemomenon
under study. In the example above, this would be sensible if the coin is fair, in which case we would
get that the event A has probability 4/8 = 1/2. However, if the coin is biased this would not be a
reasonable notion of probability.
We also run into difficulties if is infinite. If = N (the set of positive integers) then there
is no reasonable way to choose an element of with all outcomes equally likely (why?). There
are however ways to choose a random positive integer in which every outcome has some chance of
occurring.
4

If is a bounded region of the plane (the disc of unit radius, for instance), then there are similar
problems with defining probability in terms of ratios. However, here we do have some notion of
choosing a point with all possibilities equally likely. We could do this by setting the probability of
an event to be its area.
In the next section we will see how to define a mathematical notion of probability which can
deal with outcomes not all being equally likely and with infinite sample spaces.

The Axiomatic Approach to Probability

Formally, a probability is a standard mathematical object a function satisfying certain conditions, called axioms. The axioms are chosen to as to make the construction plausible, and as
general as possible.
Definition (Kolmogorovs Axioms for Probability). A probability is a function P which assigns to
each event A of a sample space a real number P(A) such that:
I. For every event A, we have P(A) > 0
II. P() = 1
III. If A1 , A2 , . . . , An are events and Ai A j = 0/ for all i 6= j, then
n

P(A1 A2 An ) = P(Ai ).
i=1

If A1 , A2 , . . . are events and Ai A j = 0/ for all i 6= j, then

P(A1 A2 . . . ) = P(Ai ).
i=1

Think of P as a function that assigns a weight to the subsets of . Axiom I says that the weight
of a set must be non-negative, and axiom II says that itself has unit weight. Axiom III says that
the total weight of a collection of objects with no common elements may be computed by adding
up the weights of individual objects.
The domain of P is the set of all subsets of , called the power set of , and denoted by
P(). The co-domain of P may be taken as the set of non-negative real numbers. Soon we shall
see that it can be taken as the closed unit interval, namely the interval with endpoints 0 and 1, both
included.
We say that events satisfying III are pairwise disjoint or mutually exclusive. Notice that Axiom
III has a version for finitely many events and a version for a countable infinite number of events. If
is finite then we need only worry about the first one of these. One issue with the infinite version
of Axiom III is that we havent said what the notation
i=1 means (it is not sufficient just to think
of adding up infinitely many numbers). For this module we will need to use infinite sums which
you already know how to deal with (for instance summing a geometric progression).
If is infinite (more particularly, if it is not countable, e.g., an interval or a disc) then there is
one subtlety. It is not possible for every subset of to be an event since some sets are too weird to
give probabilities to in a meaningful way (you cant assign a weight to them). The reason for this is
very subtle and is beyond the scope of this module1 . It will not concern us since every reasonable
set you would ever want to consider does not have this problem.
1 If

you want to find out more about this then you could read up on non-measurable sets and the Banach-Tarski
paradox but be aware that this is post-graduate mathematics.

We write |A| to denote the number of elements (or cardinality) of a finite set A. If is finite,
then you can check that setting P(A) = |A|/|| gives a probability. This is the case when every
outcome in the sample space is equally likely.
WARNING : Do not assume that every outcome is equally likely without good reason.
Starting from the axioms we can deduce various properties. Hopefully, these will agree with
our intuition about probability (if they did not then this would suggest that we had not made a good
choice of axioms). The proofs of all of these are simple deductions from the axioms. The proofs
that are not given here are in your lecture notes.
Proposition 3.1. If A is an event then
P(Ac ) = 1 P(A).
P ROOF. Let A be any event. Set A1 = A and A2 = Ac . By definition of the complement, A1 A2 = 0/
and so we can apply Axiom III (with n = 2) to get
P(A1 A2 ) = P(A1 ) + P(A2 ).
But (again by definition of the complement) A1 A2 = so
P() = P(A1 ) + P(A2 ) = P(A) + P(Ac ).
By Axiom II we have P() = 1 and so
1 = P(A) + P(Ac ).
Rearranging this gives the result. 
Notice that each line of the proof is justified by one of the axioms or a definition (or is a simple
manipulation). We can use the results we have proved to deduce further ones such as the following
corollary.
Corollary 3.2. P(0)
/ = 0.
P ROOF. By definition of complement, we have c = 0.
/ Hence by Proposition 3.1
P(0)
/ = 1 P() = 1 1 = 0,
where the second equality uses Axiom II. 
Corollary 3.3. If A is an event, then P(A) 6 1.
Proposition 3.4. If A and B are events and A B, then
P(A) 6 P(B).
7

Proposition 3.5. If A = {1 , 2 , . . . , n } is a finite event, then


n

P(A) = P({i }).


i=1

We often write P(i ) for P({i }) although, strictly speaking, the former is incorrect. Idiomatic
expressions are common in mathematics; this will not be the only one in this course.
Proposition 3.6 (Inclusion-exclusion for two events2 ). For any two events A and B we have
P(A B) = P(A) + P(B) P(A B).
The idea here is that if we just take P(A) + P(B) then we have counted A B twice so this
must be subtracted. Here is a proof:
P ROOF. The events A r B, B r A and A B are pairwise disjoint and their union is A B. Hence
P(A B) = P(A r B) + P(B r A) + P(A B) (Axiom III).
Also, (again by Axiom III)
P(A) = P(A r B) + P(A B)
P(B) = P(B r A) + P(A B).
It follows that
P(A B) = P(A) + P(B) P(A B).

Proposition 3.7 (Inclusion-exclusion for three events). For any three events A, B and C we have
P(A B C) = P(A) + P(B) + P(C) P(A B) P(A C) P(B C) + P(A B C).
As for two events, there is an intuitive argument and a proof from the axioms. There is also an
inclusion-exclusion formula for n events. If youre feeling keen you could try to work it out and
prove it.

3.1

Remarks on words and symbols.

Let be a set of integers, and let


A = { : is even}.
The explicit expression for P(A) is heavy
P({ : is even})
2 If

we are in the situation where P(A) = |A|/|| then this reduces to an expression involving cardinalities of sets.
This was the version of inclusion-exclusion that you saw in Mathematical Structures.

and so we adopt the short-hand notation


P( is even)

or

P(2|).

Formally, these expression are incorrect because the argument of a probability function P should
be a set (the set of all which are even), not a sentence ( is even). However, replacing an event
by the sentence that defines it is very much part of the probability idiom, and this transgression
will not be considered a mistake.
This is not the first time we break a syntax rule; weve already done it when we wrote P() to
mean P({}). The important thing is to state clearly our conventions.
This dual identity of events may be exploited to write statements both symbolic and verbal
in different ways:
S ETS :
S ENTENCES :

AB

The event A is contained


in the event B

A B

The event A implies the


event B

Note that the expression The event A implies the event B is acceptable only within probability.
There is no such convention elsewhere, and so a statement such as The set A implies the set B is
plainly wrong.

Sampling

If the sample space is finite, and if all elements of the sample space are equally likely, then
calculating probabilities of an event A involves counting elements of sets:
P(A) =

|A|
.
||

To count, we are often interested in finding how many ways there are of choosing r objects from
an n-element set. This is called sampling from an n-element set. The result of sampling depends
on exactly what we mean by selection: is the order important? Is repetition allowed?

Ordered selection with replacement


If we make an ordered selection of r objects from a set X with replacement (that is, we allow
elements to be repeated) then the sample space is the set of all ordered r-tuples of elements of X:
= {(x1 , x2 , . . . , xr ) : xi X},
and if |X| = n we have
|| = nr .
(There are n choices for x1 , for each of these there are n choices for x2 , and so on.)

Ordered selection without replacement


If we make an ordered selection of r objects from a set X without replacement (that is to say, we
do not allow elements to be repeated) then the sample space is the set of all ordered r-tuples of
distinct elements of X:
= {(x1 , x2 , . . . , xr ) : xi X with xi 6= x j for all i 6= j}.
To find the cardinality of this set, notice that if |X| = n then there are n choices for x1 ; for each of
these choices there are n 1 choices for x2 ; for each of these there are n 2 choices for x3 and so
on. Hence,
n!
|| = n(n 1)(n 2) . . . (n r + 1) =
(n r)!
(there are r terms in this product).
An important special case of this is that there are n! ways of making an ordered selection of n
objects from an n element set that is, there are n! permutations of an n element set3 .
3 The

convention that 0! = 1 means that the formula

n!
(nr)!

10

is still valid in this situation.

Unordered selection without replacement


If we make an unordered selection of r objects from a set X without replacement then the sample
space is the set of all unordered lists of r distinct elements of X. An unordered list of r distinct
objects is set of cardinality r and so
= {A X : |A| = r}.
An ordered sample is obtained by taking an element of this sample space and putting its elements
in order. Each element of the sample space can be ordered in r! ways and so (using the formula for
ordered selections with repetition) if |X| = n we have that
n!
r! || =
,
(n r)!
and so
|| =
This expression is usually written as

By convention nr = 0 when r > n.

n
r

n!
.
(n r)!r!

(read as n choose r) and is called a binomial coefficient.

Unordered selection with replacement


We probably wont use this and you certainly wont need to know it for the exam, but for completeness the number of ways of choosing r objects from an n element set, unordered but with
replacement is


n+r1
.
r
This is harder than the other three cases. Have a go at proving it if youre feeling bold. Even if you
dont manage to prove this, try to see why the naive generalisation of the argument for unordered
r
sampling without replacement doesnt work. In other words, why is the answer not nr! ?

Summary
We have shown the following:
Theorem. The number of ways of selecting (sampling) r objects from a set of n objects is:
Ordered? Replacement?
1)

yes

yes

nr

2)

yes

no

3)

no

no

4)

no

yes

n(n 1) (n r + 1)
 
n
r


nr1
r
11

Item 4) is harder, and the proof is omitted.


When answering questions involving sampling, you must decide what sort of sampling is involved. Specifically, how many objects are you selecting, what set are you selecting from, does
the order matter, and is repetition allowed or not. There are many examples in your notes and on
problem sheets. Sometimes more than one sort can be used but you must be consistent (see for
example the silver and copper coins example in your lecture notes).

The Binomial Theorem


The following famous theorem called the binomial theorem is an application of counting
selections.
Theorem. For all real numbers x and all non-negative integers n, the following holds:
n  
n r
n
(1 + x) =
x.
r
r=0

(1)

P ROOF. We have
(1 + x)n = (1 + x) (1 + x) (1 + x)
|
{z
}
n terms
= a0 + a1 x + a2 x2 + + an xn .

Label each term (1 + x) in the product, from 1 to n, so as to regard these terms as distinct objects.
They form a set with n elements. To construct the monomial xr we must select r distinct terms from
which to take x, while taking 1 from the remaining n r terms. This is an unordered selection of

r objects from an n element set without repetition, so there are nr ways of doing it. This is the
coefficient ar , and the result follows. 
If we replace x with y/x in (1) and we multiply both sides by xn , we obtain the following
alternative statement of the theorem:
n  
n nr r
n
(x + y) =
x y
x, y R, n > 0.
r=0 r
The binomial theorem can also be proved by induction on n.

E XERCISES
E XERCISE 4.1. Two balls are drawn from an urn containing one white, three black, and four green
balls. (a) What is the probability that the first is white and the second is black? (b) What is the
probability if the first ball is replaced before the second drawing?
[Ans: (a) 3/56; (b) 3/64.]
12

E XERCISE 4.2. Two fair dice are rolled. Find the probability that the sum of the outcomes is not
equal to 4.
[Ans: 11/12.]
E XERCISE 4.3. The word drawer is spelled with six scrabble tiles. The tiles are then randomly
rearranged. What is the probability of the rearranged tiles spelling the word reward?
[Ans: 1/360.]
E XERCISE 4.4. One of the numbers 2, 4, 6, 7, 8, 11, 12 and 13 is chosen at random as the numerator
of a fraction, and then one of the remaining numbers is chosen at random as the denominator of
the fraction. What is the probability of the fraction being in lowest terms?
[Ans: 9/14.]
E XERCISE 4.5. Three cards are drawn at random from a full deck. What is the probability of
getting a three, a seven, and an ace?
[Ans: 24 /(52 13 17).]
E XERCISE 4.6. A three-digit integer is formed at random from the digits {1, 2, 3, 4}, with no
repeated digits. (a) What is the probability that such an integer be even? (b) What is the same
probability if we use the digits {1, 2, 3, 4, 5}?
[Ans: (a) 1/2; (b) 2/5.]
E XERCISE 4.7. An integer is chosen at random from the first 2000 positive integers. what is the
probability that the integer chosen in divisible by 6 or 8?
[Ans: 1/4.]
E XERCISE 4.8. From five married couples, four people are selected. What is the probability that
two men and two women are chosen?
[Ans: 10/21.]
E XERCISE 4.9. Five items are chosen at random from a batch of 100 items and then inspected.
The whole batch is rejected if any of the items is found to be defective. What is the probability of
the batch being rejected if it contains 5 defective items?
95 94 93 92 91
[Ans: 1
0.23.]
100 99 98 97 96
E XERCISE 4.10. A wooden cube with painted faces is sawed up into 1000 little cubes, all of the
same size. The little cubes are then mixed up, and one is chosen at random. What is the probability
of it having just two painted faces?
[Ans: 0.096.]
E XERCISE 4.11. I have ten books, five each of two titles, and I place them at random on a shelf.
What is the probability that (a) five copies of one title follow five copies of the other title on the
shelf? (b) the two titles alternate on the shelf?
[Ans: (a) 1/126; (b) same as (a).]

13

E XERCISE 4.12. A batch of 100 items contains 5 defective items. Fifty items are chosen at random
and then inspected. Suppose the whole batch is accepted if no more than one of the 50 inspected
items is defective. What is the probability of accepting the whole batch?
47 37
[Ans:
0.18.]
99 97

14

Conditional Probability

If you are told that event A has occurred what effect does this have on the probability that B has
occurred?
Definition. If A and B are events and P(B) 6= 0 then the conditional probability P(A|B) of A given
B is
P(A B)
.
(2)
P(A|B) :=
P(B)
This important definition is often the source of great confusion, partly due to the idiosyncratic
notation. Let us analyse it in detail. The assignment operator := gives meaning to the expression
on the left in terms of what appears on the right. On the right we have a function P, two sets A and
B, their intersection A B, and a ratio. The condition P(B) 6= 0 ensures that the right hand side is
well-defined.
Whats on the left hand side is less transparent. A clearer notation could be
G(A, B) :=

P(A B)
.
P(B)

Now we see that P(A|B) is really a new function of two variables, as in the expression
g(x, y) =

f (x + y)
f (y)

f (y) 6= 0.

Thus, while the symbols A and B have the same meaning on the left and right hand sides of (2),
the symbol P has a different meaning! Also, dont confuse P(A|B) with P(A r B): they are totally
different objects. In particular, A r B has a meaning (its a set), whereas the expression A|B, taken
in isolation, is meaningless (the symbol does not represent a set operator, or anything else in
the present contex).
Note that the definition does not require that A happens after B. One way of thinking of this
is to imagine that B has been promoted to the new space of outcomes, and that all elementary
outcomes outside B are ignored. Another way of thinking of this is to imagine that the experiment
B is performed secretly and the fact that B occurred is revealed to you (without the full outcome
being revealed).
The conditional probability of A given B is the new probability of A in these circumstances.
Conditional probability measures measure how the occurrence of some event influences the
chance of another event occurring. For instance if P(A|B) < P(A), then B occurring makes A
less probable. On the other hand, if P(A|B) > P(A) then B occurring makes A more probable.
The special case that P(A|B) = P(A) is particularly important, and we will discuss it in the next
chapter. Using the definition of conditional probability we see that the condition P(A|B) = P(A)
is equivalent to P(A B) = P(A)P(B). As we will see later events satisfying this are said to be
independent.
One use of conditional probability is to calculate the probability of events which are made up of
smaller events. This is useful when we can work out a conditional probability just by considering
15

the experiment (in the lecture notes the example of the conditional probability that the second
pen is black given that the first pen is red for instance). In this kind of situation we can use
conditional probability to calculate the probability of the intersection of two events by rearranging
the definition to get:
P(A B) = P(B)P(A|B).
In the lectures we saw some examples of using this approach. We also saw some examples using a
similar method to calculate the probability of an intersection of 3 or 4 events. The full generalisation to the intersection of n events is the content of the next theorem.
For this we need to consider probabilities like P(A3 |A1 A2 ) which is defined to be P(A1 A2
A3 )/P(A1 A2 ) (just think what the definition of conditional probability says).
Theorem. For any events A1 , A2 , . . . , An , we have
P(A1 A2 An ) =P(A1 ) P(A2 |A1 ) P(A3 |A1 A2 )
P(An |A1 A2 An1 )
provided that all of the conditional probabilities involved are defined.
The statement of the theorem depends on a natural number n, and we shall prove it by induction
on n (see Mathematical Structures lecture notes). To do this, we first prove that the statement is
true for 2 events. We then prove that, given any k > 2, if the statement is true for k events then it is
also true for k + 1 events. The principle of induction then ensures that the statement is true for all
natural numbers n.
P ROOF. Base (starting) case:
When n = 2 the statement is P(A1 A2 ) = P(A1 )P(A2 |A1 ). This is true by the definition of
conditional probability rearranged.
Inductive step:
Let k > 2 and suppose that the statement is true for k events. We will use this to prove that the
statement is true for k + 1 events. If we can do this then we may conclude that it is true for all n.
By definition
P(A1 Ak+1 )
P(Ak+1 |A1 A2 Ak ) =
.
P(A1 Ak )
Rearranging this we get
P(A1 Ak+1 ) = P(A1 Ak )P(An |A1 A2 Ak ).
The fact that the result is true for k events means that
P(A1 Ak ) = P(A1 ) P(A2 |A1 ) P(Ak |A1 A2 Ak1 ).
Substituting this into the previous equation gives that
P(A1 Ak+1 ) =P(A1 ) P(A2 |A1 )
P(Ak |A1 A2 Ak1 )P(Ak+1 |A1 A2 Ak ).
But this is precisely the statement of the theorem for k + 1 events. 
16

5.1

Ordered Sampling Revisited

Conditional probability provides an alternative approach to question involving ordered sampling.


Suppose that I select r things from a set X with |X| = n in order.
If x1 , . . . , xr are elements of X and we let Ai be the event the ith pick is xi then we can use
Theorem 5.1 to say that
P( I pick (x1 , . . . , xr )) = P(A1 A2 Ar )
= P(A1 ) P(A2 |A1 ) P(A3 |A1 A2 ) P(Ar |A1 A2 Ar1 ).
Also if Y X with |Y | = m and we let Bi be the event the ith pick is in Y then we can use
Theorem 7.2 to say
P(All r selections are elements of Y ) = P(B1 B2 Br )
= P(B1 ) P(B2 |B1 ) P(B3 |B1 B2 ) P(Br |B1 B2 Br1 )
Now suppose that the selection is done without replacement and that x1 , . . . , xr are distinct. If
A1 Ai1 occurs then the ith pick involves selecting from the set X r {x1 , . . . , xi1 } where xi
1
is not equal to any of x1 , . . . , xi1 . So P(Ai |A1 Ai1 ) = ni+1
. It follows that
1
1
1

n n1
nr+1
(n r)!
.
=
n!

P( I pick (x1 , . . . , xr )) =

Similarly, P(Bi |B1 Bi1 ) = mi+1


ni+1 since if B1 Bi1 occurs then the ith pick involves
selecting from a set of n i + 1 elements, m i + 1 of which are in Y . Hence
m m1
mr+1

n n1
nr+1
m!(n r)!
=
n!(m r)!

P(All r selections are elements of Y ) =

If the selection is done with replacement then the probability of the ith pick being xi does not
depend on earlier picks.
1
P(Ai |A1 A2 Ai1 ) = P(I pick element xi from the set X) = .
n
So for any x1 , . . . , xr X we get
P( I pick (x1 , . . . , xr )) = P(1st pick is x1 ) P(2nd is x2 ) P(rth is xr )
1 1
1
=
n n
n
1
= r
n
17

and
P(All r selections are elements of Y ) = P(1st pick is in Y ) P(rth is in Y )
m m
m
=
n n
n
mr
= r
n
We will see later that the with replacement case was simpler because the events corresponding
to each pick are mutually independent.
Of course the answers we get here are the same as the ones you would get by using the ideas
|A|
in the sampling section and calculating P(A) = ||
where is the sample space for the complete
experiment of picking r things. Some of you will prefer this method while some will prefer the
previous one. This method has the advantage that it is a little bit slicker. Also the assumption we
make (that each pick is equally likely to be each of the remaining elements) is slightly simpler than
the assumption we made previously (that each possible selection of r things is equally likely).

18

Independence

As we saw in lectures the probability P(A B) may or may not be equal to P(A)P(B); dont fall in
to the trap of assuming that this is always true.
Definition. We say that the events A and B are independent if
P(A B) = P(A) P(B).
You may assume that events are independent in the following situations:
i) they are clearly physically unrelated (e.g., depend on different coin tosses),
ii) you calculate their probabilities and find that P(A B) = P(A)P(B) (e.g., to test whether
events are correlated),
iii) the question tells you that the events are independent!
WARNING : Independence is not the same as physically unrelated. For example we saw that if a
fair die is rolled twice then the event first roll is a 6 and the event both rolls produce the same
number are independent.
The next theorem connects independence to conditional probabilities.
Theorem. Let A and B be events with P(A) > 0 and P(B) > 0. The following are equivalent:
1. A and B are independent
2. P(A|B) = P(A)
3. P(B|A) = P(B).
This result says roughly that if A and B are independent then telling you that A occurred does
not change the probability that B occurred.
We spent some time discussing what a result of this type really means. One way of thinking of
it is that for any two events A and B we have that the 3 statements in the theorem are either all true
or all false.
P ROOF. It will be sufficient to show that 1 2, 2 3, and 3 1 (think about why this is
enough).
1 2: (That is if A and B are independent then P(A|B) = P(A).)
Suppose that A and B are independent. By the definition of conditional probability we have
P(A)P(B)
because A and B are independent. Cancelling we get P(A|B) = P(A)
P(A|B) = P(AB)
P(B) =
P(B)
as required.
2 3:
19

Suppose that P(A|B) = P(A). By the definition of conditional probability) this means that
= P(A). Rearranging (using that P(A) 6= 0) we get P(AB)
P(A) = P(B). That is P(B|A) = P(B).
3 1:
Suppose that P(B|A) = P(B). By definition this means that P(AB)
P(A) = P(B). Rearranging we
get P(A B) = P(A)P(B). That is A and B are independent events. 
If we want to define independence between more than two events things get more complicated.
We saw an example where any two of A, B,C were independent but
P(AB)
P(B)

P(A B C) 6= P(A)P(B)P(C).
Definition. We say that the events A1 , A2 , . . . , An are mutually independent (sometimes also written
as each event is independent of all the others) if for every 1 i1 < i2 < < it n, 1 t n we
have
P(Ai1 Ai2 Ait ) = P(Ai1 )P(Ai2 ) . . . P(Ait ).
That is to say for every subset I of the events, the probability that all events in I occur is the
product of the probabilities of the individual events in I. If you find this confusing then think what
it says in the case n = 3 first.
Do not confuse mutually independent with mutually exclusive; they mean completely different
things. In fact two mutually exclusive events are never independent (unless one has probability 0)
(Why not?).
Example. A coin which has probability p of showing heads is tossed n times in succession. What
is the probability that heads comes up exactly r times?
We may assume that the results of the tosses are mutually independent (they are physically
unrelated things). So to find the probability of a particular sequence of heads and tails we just
multiply together the appropriate probability for each toss. That is
P(h| . .{z
. . . . h} t|. .{z
. . . .t}) = pr (1 p)nr .
r heads

nr tails

Similarly any sequence of r heads and n r tails will have probability pr (1 p)nr . There are
such sequences and so
 
n r
P(exactly r heads) =
p (1 p)nr .
r

n
r

We will see this again when we discuss the binomial distribution.


We saw several examples of calculating probabilities of events like this which are built up of
several independent events. There are more examples on the exercise sheets.

20

More on Conditional Probability

We mentioned that conditional probability can be used as an aid to calculating probabilities. For
this purpose, the total probability theorem is a very useful tool. To state it, we need a definition.
Definition. The events E1 , E2 , . . . , En partition (or form a partition of ) if they are pairwise
disjoint and their union is the whole of .
Theorem. [Theorem of total probability] If E1 , E2 , . . . , En partition and P(Ei ) 6= 0 for all i then
for any event A we have
n

P(A) = P(A|Ei )P(Ei ).


i=1

Often we know the conditional probability of an event under certain assumptions. This result
tells us how to aggregate these conditional probabilities to find the probability of the event
a technique which is sometimes called conditioning, which is a form of divide and conquer..
You saw several examples of calculating probabilities by this method in the lecture notes and on
problem sheet 6.
A useful special case of Theorem 7.1 is that if 0 < P(E) < 1 (hence P(E) 6= 0 and P(E c ) 6= 0),
then
P(A) = P(A|E)P(E) + P(A|E c )P(E c ).
P ROOF. [Proof of Theorem 7.1] Let Ai = A Ei . The events A1 , . . . , An are pairwise disjoint (because E1 , . . . , En are pairwise disjoint) and their union is A.
Applying Axiom III we get P(A) = P(A1 An ) = ni=1 P(Ai ).
Now, using the definition of conditional probability, we have P(Ai ) = P(AEi ) = P(A|Ei )P(Ei ),
and hence P(A) = ni=1 P(A|Ei )P(Ei ). 
There is also an analogue of the theorem of total probability for conditional probabilities.
Theorem. If E1 , E2 , . . . , En partition , and A and B are events with P(B Ei ) > 0 for all i then
n

P(A|B) = P(A|B Ei )P(Ei |B).


i=1

The proof is in your lecture notes.


As we have seen P(A|B) and P(B|A) are very different things. The following theorem relates
these two conditional probabilities and is used when you need to calculate P(B|A) from P(A|B).
Theorem. [Bayes theorem] If A and B are events with P(A), P(B) > 0 then
P(B|A) =

P(A|B)P(B)
.
P(A)

Remark. The conclusion of this theorem is sometimes stated as


P(B|A) =

P(A|B)P(B)
,
P(A|B)P(B) + P(A|Bc )P(Bc )

(we have just applied the theorem of total probability to the denominator.)
21

A popular example concerns medical trials. We gave an example in lectures of a test for a
disease which has a 90% chance of correctly identifying the presence of the disease (in other
words, the conditional probability of a positive result given that you have the disease is 9/10) and
only a 1% chance of giving a false positive (i.e., identifying the disease in a healthy patient).
Despite this if the disease is a rare one affecting only 2% of the population (and if there was no
reason to assume that the person has the disease before the test was performed), the conditional
probability of not having the disease given a positive test is surprisingly large (around 0.35..).
The prosecutors fallacy mentioned in lecture 1 is another example with some similar features.
One way of thinking of these examples is that we have changed our assessment of the likelihood
of B from P(B) to P(B|A) given the information that A occurs. This updating of probabilities in
the light of evidence is called Bayesian inference by statisticians.

22

Introduction to Random Variables

As usual, we have a sample space and a probability function P which assigns a number to
each event. Sometimes we are interested not in the exact outcome but only some aspect of it For
instance, if the outcome is a sequence of coin tosses, we may only want to know how many heads
occur. Random variables are introduced to handle this sort of situations.
Definition. A random variable is a function from to R.
Note that a random variable is not a variable and is not random. As we develop the theory of
random variables, the reasons for this seemingly incongruous terminology will emerge.
We remark that if is uncountable (namely infinite but not in one-to-one correspondence with
N, e.g., = R), then the given definition of random variable is not quite correct. It turns out that
some functions are so pathological that they cannot be regarded as random variables. This should
not be entirely surprising: we have already seen that if is uncountable there are sets which are
too pathological to be regarded as events. This subtlety is well beyond the scope of this module
and will not concern us at all. Every reasonable function that you would ever want to use will be a
legitimate random variable.
We use capital letters for random variables. Informally, a random variable allows us to describe
a measurement on the outcome of an experiment, as long as this measurement is a real number.
A statement like X = 3 represents an event, namely the event
A = { : X() = 3},
that is, the set of all outcomes on which X takes the value 3. In other words, A is the subset of
which consists of all for which the mathematical sentence X() = 3 is true. Thus if X does
not assume the value 3, then A = 0/ (the impossible event). Likewise, if X is the constant function
7 3 (not a very useful random variable!), then A = , the certain event.
Definition. Let X be a random variable. The function which, given k, has value (output) P(X = k)
is called the probability mass function or pmf of X, denoted by X (or, simply, ). In symbols:
X (k) = P({ : X() = k}).
Thus the domain of is R. Give yourself some time to absorb this definition and the associated
notation.
We saw several examples of random variables in the lectures. Here is one of them.
E XAMPLE . A coin which has probability p of coming up heads is tossed three times. Let X be
the number of heads observed, for example, X(hht) = 2. Then X is a random variable, namely a
function from to R so each element of is mapped by X to a real number, in this case to 2.
The event X = 2 is {hht, hth,thh} and so
P(X = 2) = P({hht, hth,thh}) = 3p2 (1 p).
23

Similarly, you can work out P(X = 0), P(X = 1), P(X = 3) and get that the probability mass function is
k
0
1
2
3
3
2
2
P(X = k) (1 p) 3p(1 p) 3p (1 p) p3
Definition. A random variable X is discrete if the set of values it assumes
{X() : }
(the range of X), is either finite or countably infinite.
Note that the definition of random variable implies that the set in the formula above is a subset
of R. For instance, if the set of values a random variable assumes is N (such as the number of coin
tosses until the first head is seen) or Z, then this variable is discrete. If the set of values a random
variable assumes is R or some interval of of real numbers (for example the height of a randomly
chosen person), then the random variable is not discrete.
Discrete random variables are conceptually simpler, and we will mainly be studying these.
Later in the module we will look at a special type of non-discrete random variables called continuous random variables.
Lemma. If X is a discrete random variable then

X (k) = P(X = k) = 1
k

where the sum is over all values the random variable takes4 .
If X assumes finitely many values k1 , k2 , . . . , kn , then this result may be written as
n

P(X = ki) = 1.

i=1

If X assumes a countably infinite number of values k1 , k2 , . . . , we write instead

P(X = ki) = 1.

i=1

To treat the finite and infinite cases with the same notation, we write

P(X = ki) = 1.

i>1

Lemma 8.1 provides a useful check when calculating the pmf.


P ROOF. Let Ei be the event X = ki , for i = 1, 2, . . .. Then the sets Ei partition . Indeed, if i 6= j,
then we must have Ei E j = 0,
/ because if Ei and E j had a point in common, then the function X
4 If

the set of values X can take is infinite we dont have a formal definition of what this sum means. However, in
most cases we will look at you will be able to work it out (for instance you know how to find the sum of infinitely
many terms of a geometric progression).

24

would have two different values at . Likewise, the union of the Ei is , because is the domain
of the function X. The lemma now follows from Axiom III.
For an alternative proof of the independence of the Ei , we consider the sentence X(1 ) =
X(2 ) (with 1 , 2 ) and we note that this is a relation 1 R2 on the set . It is immediate to
see that this relation is reflexive, symmetric, and transitive (check it!), namely its an equivalence
relation. The equivalence classes of this relation are precisely the sets Ei , and we know from
general theory that the equivalence classes of a relation on a set partition that set. .
In what follows, whenever we write k , we mean the summation over all values assumed by
a random variable X. It may be helpful to arrange such values in a finite or infinite sequence
k1 , k2 , . . . (which is always possible if X is discrete) and then consider an appropriate sum of the
type
n

k=1

k=1

k>1

A random variable is characterised by two key properties: its expectation and variance.
Definition. The expectation of a discrete random variable X is

kP(X = k)
k

where the summation is over all values X assumes; it is denoted by E(X) (or E(X)).
Note that the value of the sum could be infinite in which case we say that the expectation of X
is infinite. If X assumes the sequence of values k1 , k2 , . . . (we can form such a sequence finite or
infinite because X is discrete), then we can rewrite the expectation as

kiP(X = ki) = k1P(X = k1) + k2P(X = k2) +

i>1

Thus to compute the expectation of X, we must first determine its pmf (the values ki assumed by
X and the corresponding probabilities), and then perform a summation.
Definition. The variance of a discrete random variable X is

(k E(X))2P(X = k)
k

where the summation is over all values assumed by X; it is denoted by Var(X).


The expectation of a random variable is a generalisation of the notion of mean of a sequence
k1 , k2 , . . . , kn of numbers. The latter is given by
n
1 n
1
k
=
ki .
i

n i=1
n
i=1

25

We see that the mean coincides with the expectation if we regard these numbers as equiprobable
values of a random variable. The variance measures how concentrated the values of X are about
its expectation, whereby a small variance means sharply concentrated and a large variance means
spread out.
Alternative formulae for mean and variance are given by
Proposition 8.1 (Alternative formulae for expectation and variance). If X is a discrete random
variable, then
i) E(X) =

X()P().

ii) Var(X) = k2 P(X = k) E(X)2 .


k

Some mathematicians regard formula ii) as the definition of variance. Since this formula and
the one given above have the same value, this makes no difference.
E XAMPLE . For the three coins example described earlier we have
E(X) = 0 (1 p)3 + 1 3p(1 p)2 + 2 3p2 (1 p) + 3 p3
= 3p[(1 p)2 + 2p(1 p) + p2 ] = 3p
and (using Prop. 8.2)
Var(X) = 02 (1 p)3 + 12 (1 p)2 + 22 p2 (1 p) + 32 p3 (3p)2
= 3p(1 p).
Sometimes it is useful to apply a function to a random variable to produce a new random
variable. If f : R R then f (X) is a random variable; its expectation is given by5 E( f (X)) =
k f (k)P(X = k). This follows immediately from proposition 8.2 i).
With this notation we can rewrite the definition of the variance of a discrete random variable X
as
Var(X) = E((X E(X))2 )
((X E(X))2 is the random variable obtained by applying the function x 7 (x E(X))2 to X).
Also, Proposition 8.2 ii) becomes
Var(X) = E(X 2 ) E(X)2 = E(X 2 ) (E(X))2 .
(Be sure to distinguish between E(X 2 ) and (E(X))2 ; they are completely different things.)
5 There

is a slight subtlety here. According to the definition we should take the sum over all values y that f (X)
takes of yP( f (X) = y) rather than the sum over all values x that X takes of f (x)P(X = x). The individual terms may
be different if several values for X yield the same value for f (X). However a moments thought shows that the sum of
them is the same.

26

Proposition 8.2 (Properties of expectation). Let X and Y be discrete random variables, and let
c R.
i) E(c) = c
ii) E(cX) = cE(X)
iii) E(X +Y ) = E(X) + E(Y )
iv) If m 6 X(s) 6 M for all s S then
m 6 E(X) 6 M.
v) If there exists a with P(X = a t) = P(a + t) for all t > 0 (that is X is symmetric about a)
then E(X) = a.
P ROOF.
i) The constant random variable c takes the value c with probability 1. So E(c) = c 1 = c
iii)
E(cX) = ckP(X = k)
k

= c kP(X = k)
k

= cE(X).
iii) We use proposition 8.2 i).
E(X +Y ) =

[X() +Y ()]P()

X() + Y () = E(X) + E(Y ).

iv) Since every value that X takes does not exceed M, we have that
E(X) = kP(X = k) 6 M P(X = k) = M.
k

Similarly, since every value that X takes is > m we have that


E(X) = kP(X = k) > m P(X = k) = m.
k

27

v)
E(X) = kP(X = k)
k

= aP(X = a) + ((a t)P(X = a t) + (a + t)P(X = a + t))


t>0

= aP(X = a) + (aP(X = a t) + aP(X = a + t))


t>0

(the terms involving t cancel because P(X = a t) = P(X = a + t))


= a P(X = k)
k

= a.

Proposition 8.3 (Properties of variance). Let X be a discrete random variable and c R be a
constant.
i) Var(X) > 0
ii) Var(c) = 0
iii) Var(X + c) = Var(X)
iv) Var(cX) = c2 Var(X).
P ROOF. For most of these we can either use the definition of variance or Proposition 8.2. However,
one or other of these approaches may be easier.
i) By definition Var = k (k E(X))2 P(X = k). Since the square of any real number is nonnegative and probabilities are all non-negative, each summand is non-negative. It follows
that Var(X) > 0.
ii) Since E(c) = c we have that Var(c) = (c c)2 1 = 0.
iii) By Proposition 8.3
Var(X + c) = E((X + c)2 ) (E(X + c))2
= E(X 2 + 2cX + c2 ) (E(X) + c)2
= E(X 2 ) + 2cE(X) + c2 ((E(X))2 + 2cE(X) + c2 )
= E(X 2 ) (E(X))2
= Var(X).

28

iv) By Proposition 8.3ii


Var(cX) = E((cX)2 ) (E(cX))2
= c2 E(X 2 ) (cE(X))2

(By Proposition 8.3ii)

= c2 (E(X 2 ) (E(X))2 )
= c2 Var(X).

You could combine parts of propositions 8.3 and 8.4 into the following statements, valid for
any random variable X,Y and real numbers a, b:
E(aX + bY ) = aE(X) + bE(Y )
Var(aX + b) = a2 Var(X).
E XAMPLE . Consider our example of tossing a coin three times. Let Y be the number of tails
observed. Clearly Y = 3 X (where as before X is the number of heads observed)6 . We worked out
that E(X) = 3p and Var(X) = 3p(1 p). So, using parts ii) and iii) of Proposition 8.3 E(Y ) = 3
3E(X) = 3(1 p). Using parts iii) and iv) of Proposition 8.4, Var(Y ) = (1)2 Var(X) = 3p(1 p).

6 Remember

that this identity connects two functions. More explicitly, it means that for any i S we have Y (i) =

3 X(i).

29

Some Special Discrete Random Variables

We say that two discrete random variables X and Y have the same distribution if the probability
mass functions of X and Y are the same. In this case we write X Y .
If X = Y , then clearly X Y , but the converse is not true. For instance, in a sequence of fair
coin tosses the number of heads and the number of tails seen are different random variables, but
they have the same distribution.
Certain distributions occur so often that they have been given special names. In this section we
study some of them (Bernoulli, Binomial, Hypergeometric, Geometric, Poisson). In each case we
will determine expectation and variance, and describe the sort of situation the distribution occurs
in.

Bernoulli distribution
Suppose that a trial (experiment) with two outcomes is performed. We will call the outcomes
success and failure and let P(success) = p (and hence P(failure) = 1 p). Such a trial is called a
Bernoulli trial (or a Bernoulli(p) trial if we wish to emphasise the probability).
The random variable X has the Bernoulli(p) distribution (write X Bernoulli(p)) if it has
probability mass function
k
0
1
P(X = k) 1 p p
The expectation and variance of X can be calculated from the pmf in the usual way.
E(X) = 0 (1 p) + 1 p = p
Var(X) = 02 (1 p) + 12 p p2 = p(1 p).
Let A be any event. The random variable IA , defined as
(
1 A
IA () =
0 6 A,
called the indicator function (or characteristic function) of the set A, has the Bernoulli(P(A))
distribution.

Binomial distribution
If n independent7 Bernoulli trials are performed, and we let X be the number of trials which result
in success, then X has the binomial distribution. We write X Bin(n, p). The random variable X
takes values in {0, 1, 2, . . . , n} and has pmf
 
n k
P(X = k) =
p (1 p)nk for 0 6 k 6 n.
k
7 this

means that if we let Ei be the event that the ith trial results in success then the events E1 , . . . , En are mutually
independent

30

Figure 1: Binomial distribution Bin(n, p), for n = 50 and p = 1/10 (blue), p = 1/2 (red), p = 3/4 (black).

(Check: nk=0 nk pk (1 p)nk = (p + (1 p))n = 1.)
We showed in lectures that:
E(X) = np,
Var(X) = np(1 p).
For fixed n, the expectation is proportional to p. By contrast, the variance is a quadratic function of
p, which achieves its maximum at p = 1/2. The binomial distribution for n = 50 and three values
of p is displayed in figure 1.
We showed this directly in lectures using the Binomial theorem and the trick of looking at
E(X(X 1)). We will see a further proof later in the course and there is another method using
generating functions which you may see in books.
If X Bin(n, p), then Y = n X is the number of failures in n independent Bernoulli(p) trials.
It follows that Y Bin(n, 1 p).
One example of the binomial distribution relates to sampling. Suppose that a bag contains N
balls, M of which are red, and I pick n of them randomly with replacement. Each random pick will
result in a red ball with probability M
N and the outcome of each pick is independent of all the others
(since we are sampling with replacement). So the number of red balls has distribution Bin(n, M
N ).

 NM 
M
M
It has expectation n N and variance n N
.
N

Hypergeometric distribution
Suppose that I perform the previous sampling example but this time without replacement. Let
R be the number of red balls among the n balls chosen. The random variable R takes values in
{0, 1, 2, . . . , M} (some with zero probability if n < M or n > N M) and has pmf
M  NM 
P(R = k) =

nk
N
n

31

for 0 6 k 6 M.


This is just a standard sampling problem. There are Mk ways of choosing the k red balls; for each

of these there are NM
ways of choosing the n k balls which are not red; the sample space has
nk
N
size n .
We say that R has the hypergeometric distribution, and write R Hg(n, M, N). It can be
shown that:
 
M
E(R) = n
,
N
 


M
N M
N n
Var(R) = n
.
N
N
N 1
The derivation of these formulae will be deal with in a problem sheet.
It is interesting to compare this with Bin(n, M
N ) (the analogous random variable when the sampling is done with replacement). The expectation is the same as for the hypergeometric case and

the variance differs by a factor of Nn
N1 ; for n large enough, this makes the variance smaller so
the random variable is more sharply concentrated. However when N is very large compared with

M
n the factor Nn
N1 is close to 1 so the variance is close to that of Bin(n, N ).

Geometric distribution
Now suppose that we perform an unlimited number of independent Bernoulli trials and let X be the
number of trials up to and including the first success. In this case X has the geometric distribution.
We write X Geom(p). The random variable X takes values in {1, 2, 3, . . . } and has pmf
P(X = k) = p(1 p)k1

k 1.

Notice also that the event X > k occurs if and only if the first k trials all result in failure. Hence,
P(X > k) = (1 p)k .
We showed in lectures that:
1
E(X) = ,
p

Var(X) =

1 p
.
p2

Poisson distribution
Suppose that incidents of some kind occur at random times, but at an average rate per unit time.
Two examples are the number of emissions from a radioactive source and the number of phone
calls received by a call centre. Let X be the number of these incidents which occur in unit time.
The random variable X has a Poisson distribution (write X Poisson( )). It assumes the
values {0, 1, 2, . . . }, and has pmf
P(X = k) = e
32

k
k!

k 0.

Where does this expression come from? It turns out that the Poisson distribution arises as a
limit of binomial distributions. To see this, let us divide the unit of time we are interested in into
n subintervals. By choosing a sufficiently large value of n, we ensure that the quantity /n is very
small. Under these circumstances, the occurrence of 1 incident is each subinterval is a rare event,
and we expect the probability of this event to be /n. Because /n is very small, the probability
that there be 2 or more incidents in a subinterval is so small that we can neglect it (you can verify
that this probability is very close to ( /n)2 ).
Thus we assume that in each interval there are either 0 or 1 incidents, with the probability of
0 incidents being 1 /n, Then the total number of incidents has distribution Bin(n, /n). This
means that the probability of k incidents per unit time is:
   k 

n

nk
P(X = k)
1
k
n
n




k k
n
n(n 1) . . . (n k + 1)
1
1
.
=
n
k!
n
nk
As n is made large, this gives a better and better approximation. Now


 

1
2
nk+1
k
n(n 1)(n 2) (n k + 1) = n 1
1
1
,
n
n
n
and therefore

where

k
P(X = k) U(n, k)
k!



n
1
n



 


1
2
nk+1
k
U(n, k) = 1
1
1
1
.
n
n
n
n

Now fix k, and consider the limit n . Since each term in the above product tends to 1, we
have
lim U(n, k) = 1.
n

Moreover, a classic result of analysis states that



x n
lim 1 +
= ex
n
n

x R.

Putting everything together, we obtain


k
P(X = k) = lim U(n, k)
n
k!



n k
1
= e
n
k!

which is the pmf for the Poisson distribution.


Using the power series of the exponential function
ex =

xk
x2 x3 x4
=
1
+
x
+
+ + +

2
6 24
k>0 k!
33

we showed in lectures that:


E(X) = ,
Var(X) = .
The binomial, geometric and Poisson distributions are among the most important discrete distributions. Condense this section of the notes into half a page by writing your own summary of their
properties (pmf, expectation, variance, the sort of situation in which they occur, and an example of
their occurrence).

The Cumulative Distribution Function


The cumulative distribution function (cdf) FX of a random variable X is defined as
FX : R R

FX (x) = P(X 6 x).

Note that the definition of the cdf does not require the random variable X to be discrete.
In the case of a discrete random variable X, the cumulative distribution function provides an
alternative and often very useful representation of the information stored in the probability
mass function. In other words, if we know one of the pmf and cdf then we can determine the other.
Indeed, given the pmf, we can determine the cdf as follows:
FX (x) =

P(X = k).
k6x

Conversely, let X takes values x1 , x2 , x3 , . . . with x1 < x2 < x3 < . . . (this sequence could also be
doubly infinite: < x2 < x1 < x0 < x1 < x2 < ). Then
P(X = xi ) = FX (xi ) FX (xi1 ).
Statistical tables often give the cdf rather than the pmf, so you may need to use this fact to find the
pmf from tables.
The cdf FX of a random variable X satisfies the following properties:
1. 0 6 FX (t) 6 1 (since FX (t) is a probability);
2. we have
lim FX (t) = 1

lim FX (t) = 0;

3. FX is a non-decreasing function, that is, if a < b then FX (a) 6 FX (b) (the event X 6 a, being
a subset of the event X 6 b, has smaller probability);
4. P(a < X 6 b) = FX (b) FX (a);
5. if P(X = t) > 0 then FX has a discontinuity at t.
In the next section we will use the cdf to study a new type of random variables, called continuous random variables.
34

10

Continuous Random Variables

Many natural random variables have the property that the set of values they take is R (or an interval
of R). In these cases we can no longer work with the pmf. However, the cdf can still be useful.
One important family of random variables is described by the following definition.
Definition. A random variable X is continuous if its cdf FX is a continuous function.
Recall that the cdf is always a non-decreasing function with FX (t) 0 as x and FX (t)
1 as x . If you ever calculate a cdf and find that it doesnt satisfy these conditions then you
have made a mistake.
Definition. The number m is a median of X if FX (m) = 1/2.
The numbers l, u are lower and upper quartiles of X if FX (l) = 1/4 and FX (u) = 3/4.
The number ak is a kth percentile of X if FX (ak ) = k/100.
This definition also holds for discrete random variables. However, for a discrete random variable the median (and the quartiles and percentiles) may not exist or may not be unique. If the
random variable is continuous they are guaranteed to exist and are generally unique (can you see
why?).
It is a fact (from calculus) that the cdf of a continuous random variable is differentiable except,
possibly, a finitely many points, so we make the definition.
Definition. The probability density function (or pdf) of a continuous random variable X is the
function
d
FX (t)
dt
(defined arbitrarily where this derivative does not exist).
We shall denote it by X .
This definition is a little imprecise as X is not determined uniquely at points where FX is not
differentiable (technically we are defining not a single pdf but a whole family of possible pdfs any
of which will do equally well). Whatever values you give the function here will make no difference
to integrals involving it so everything that follows is unaffected by the value of X at these bad
points.
We can work out some properties of the pdf. In particular X (t) > 0 for all t (because the cdf
is non-decreasing). Also, by the fundamental theorem of calculus8 we have that
P(a 6 X 6 b) = FX (b) FX (a) =
8 Strictly

Z b
a

X (t)dt.

speaking we need some mild conditions on X to be able to apply the fundamental theorem of calculus.
If the pdf is particularly unpleasant things might go wrong. For any reasonable function (and certainly anything we
will meet in this module) this problem will not arise. However if you go on to study probability or analysis to a higher
level you will have to worry about this kind of problem.

35


X (t)dt = 1 this gives a way of checking that a calculated pdf is
It follows from this that
plausible (analogous to checking y P(Y = y) = 1 for a discrete random variable).
Usually it is easier to calculate the cdf of a continuous random variable and then find the pdf
by differentiating. However the cdf can be found from the pdf by integrating. Specifically,

Z x

FX (x) =

X (t)dt.

In the theory of continuous random variables the pdf plays the role of the pmf in the theory of
discrete random variables as the following definitions show.
Definition. If X is a continuous random variable with pdf X then
Z

E(X) =

tX (t)dt
Z

Var(X) =

(t E(X))2 X (t)dt.

The variance can also be written as follows (compare the discrete case again):
Z

Var(X) =

t 2 X (t)dt (E(X))2 .

The properties of E and Var that we proved in the discrete case still hold as does the fact that if
g : R R is a function then
Z
E(g(X)) =
g(t)X (t)dt.

Note that in all these definitions the integrals go from to . However, in practice the pdf is
often 0 outside a smaller range and so we can integrate over this smaller range only (see examples
in notes and on problem sheets).

Special Continuous Random Variables


As for the discrete case some random variables occur so frequently that we give them special
names. We look at two such.
Suppose that a number X is chosen from the interval [a, b], with X being equally likely to be
anywhere in the interval. We will interpret this condition as meaning that the probability that X is
in a sub-interval of length is proportional to . We say that X has the uniform distribution and
write X Uniform[a, b] or X U[a, b]. The pdf and cdf of X are given by
(
1
if a 6 x 6 b
ba
X (x) =
0
otherwise

FX (x) =

xa
ba

1
36

if x < a
if a 6 x 6 b
if x > b

If you substitute this X into the definitions of expectation and variance you will find that
(ba)2
E(X) = b+a
2 and Var(X) = 12 .
The second special random variable we look at is related to the Poisson distribution. Suppose
that the number of incidents occurring in any interval of time of length t is distributed Poisson
(t). Instead of counting the number of incidents in a fixed interval (this would give a discrete
random variable with the Poisson distribution) we look at the time T at which the first incident
occurs.This gives a continuous random variable which we say has the exponential distribution. We
write T Exponential( ) or T Exp( ). We can use the connection with the Poisson distribution
to show that the cdf is given by:
FT (t) = P(T 6 t)
= 1 P(T > t)
= 1 P(there are no incidents in the interval (0,t])
(t)0
0!
t
= 1e ,

= 1 et

if t > 0 and FT (t) = 0 if t < 0. Note that FT is a non-decreasing continuous function which tends
to 1. Differentiating gives the pdf

0
if t < 0
T (t) =
t
e
if t > 0.
The expectation and variance of the exponential distribution can be found by integrating (hint:
use integration by parts).
1
1
E(T ) = ; Var(T ) = 2 .

Another important continuous random variable is the normal distribution; you will meet this in
statistics courses.

Monotonic transformations of random variables


I will present this material in somewhat more generality than I did in lectures. Before reading this
section, I suggest that you make sure you understand the more concrete examples of transformations from your lecture notes.
If g is any function which is defined on the range of X (the set of values that X takes), and which
takes values in R, we can define a new random variable Y = g(X). If g is monotonic (increasing or
decreasing) then the pdf of Y can be found from the pdf of X as follows.

37

If g is increasing, then it can be shown that g has an inverse function g1 which is also increasing (this is intuitively obvious by considering the graph of g). Now,
FY (y) = P(Y 6 y)
= P(g(X) 6 y)
= P(X 6 g1 (y)) (as g1 is increasing)
= FX (g1 (y))
Now, provided that g1 is differentiable, we can differentiate with respect to y and obtain (using
the chain rule) that
d
Y (y) = (g1 (y))X (g1 (y)).
dy
Similarly, if g is decreasing then it can be shown that g has an inverse function which is also
decreasing. Now,
FY (y) = P(Y 6 y)
= P(g(X) 6 y)
= P(X > g1 (y)) (as g1 is decreasing)
= 1 FX (g1 (y)
Provided that g1 is differentiable we can differentiate with respect to y and (again using the chain
rule) obtain that
d
Y (y) = (g1 (y))X (g1 (y)).
dy
d
(g1 (y)) 6.
Note that this is non-negative because dy
We could express both of these by saying that if g is monotonic and g1 is differentiable then


d 1

Y (y) = (g (y)) X (g1 (y)).
dy

However it is better to learn the method than to memorise this formula.


Note that if Y is related to X by a non-monotone function then this method does not work. For
instance, if Y = X 2 then

FY (y) = P(X 2 6 y) = P( y 6 X 6 y) = FX ( y) FX ( y)
so we can still relate the cdf of Y to the cdf of X but not in such a simple way.

38

11

Working with Several Random Variables

Sometimes it is useful to consider more than one random variable at the same time, or to write a
random variable as a combination of other random variables. In this section we develop some of
this theory in the discrete case. All random variables mentioned are assumed to be discrete. Much
of what follows is also true for continuous random variables (but with sums replaced by integrals
and pmf replaced by pdf).
If we have two discrete random variables X and Y defined on the same sample space the function
X,Y : R2 R
(x, y) 7 P(X = x,Y = y)
is called the joint probability mass function (jpmf) of X and Y . The domain of a jpmf is the
Cartesian plane, namely the set of ordered pairs (two-element sequences) of real numbers. The
jpmf of X and Y can be presented as a table. The example we had in lectures gave the following:

R
1

3/35

6/35 1/35

B 1 2/35 12/35 6/35


2 2/35

3/35

0
0

Here, for example, the top right entry means that P(R = 3, B = 0) = 1/35.
You can work out the pmfs of X and Y from the joint pmf by
P(X = x) = P(X = x,Y = y),
y

P(Y = y) = P(X = x,Y = y).


x

We sometimes refer to these as the marginal pmf of X and Y to emphasise that they came from a
joint pmf.
If X,Y are random variables and g is a real-valued function of two variables then g(X,Y ) is
another random variable. It satisfies
E(g(X,Y )) = g(x, y)P(X = x,Y = y).
x

Theorem. If X and Y are discrete random variables then


E(X +Y ) = E(X) + E(Y ).

39

P ROOF.
E(X +Y ) = (x + y)P(X = x,Y = y)
x

= xP(X = x,Y = y) + yP(X = x,Y = y)


x

= x P(X = x,Y = y) + y P(X = x,Y = y)


x

= xP(X = x) + P(Y = y)
x

= E(X) + E(Y ).

If we have more than two random variables then applying this theorem repeatedly and using
basic properties of expectation we obtain:
Corollary 11.1 (Linearity of expectation). If X1 , X2 , . . . , Xn are discrete random variables and
c1 , c2 , . . . , cn R then
!
n

ciXi

i=1

= ci E(Xi ).
i=1

Linearity is an important property of expectation, which you will see often if you do any further
probability modules. The next example gives one illustration of its use; there are more examples
on the final problem sheet.
E XAMPLE . We provide an alternative way of deriving the expectation of a binomial random variable. Suppose that X Bin(n, p) and for every 1 i n define a random variable

1 if the ith trial results in success
Xi =
0 if the ith trial results in failure.
Clearly X = X1 + X2 + + Xn . Also, for every i we have Xi Bernoulli(p) and so E(X1 ) =
E(X2 ) = = E(Xn ) = p. Corollary 11.2 now tells us that
E(X) = E(X1 ) + E(X2 ) + + E(Xn ) = np.
The Xi in this argument are sometimes called indicator variables.
Definition. Two discrete random variables X and Y are independent if for all x and y we have
P(X = x,Y = y) = P(X = x)P(Y = y)
that is, if the events X = x and Y = y are independent.
More generally, the random variables X1 , X2 , . . . , Xn are independent if for all x1 , x2 , . . . , xn we
have
P(X1 = x1 , X2 = x2 , . . . , Xn = xn ) = P(X1 = x1 )P(X2 = x2 ) . . . P(Xn = xn ).
40

It is worth noting (because it is a frequent misconception) that we do not require X and Y to be


independent in Theorem 11.1. However, if X and Y are independent then we can say more.
Theorem. If X and Y are independent discrete random variables then:
i) E(XY ) = E(X)E(Y ),
ii) Var(X +Y ) =Var(X)+Var(Y ).
P ROOF.
i)
E(XY ) = xyP(X = x,Y = y).
x

Since X and Y are independent we have that


E(XY ) = xyP(X = x)P(Y = y)
x

Using properties of summation, we can express the right-hand side as a product to get:
!


E(XY ) =

xP(X = x)

yP(Y = y)

= E(X)E(Y ).

(If you are having trouble seeing the last step, then write it out in full in a special case, e.g.,
where x assumes three values x1 , x2 , x3 , and y assumes two values y1 , y2 .)
ii)
Var(X +Y ) = E((X +Y )2 ) E(X +Y )2
= E(X 2 + 2XY +Y 2 ) (E(X) + E(Y ))2
= E(X 2 ) + 2E(XY ) + E(Y 2 ) E(X)2 2E(X)E(Y ) E(Y )2
= Var(X) + Var(Y ) + 2(E(XY ) E(X)E(Y )).
Now since X and Y are independent we use part i) to deduce that the term in brackets is 0.
Hence,
Var(X +Y ) = Var(X) + Var(Y ).

The converse of this is false; it is possible to have X and Y not independent but still have
E(XY ) = E(X)E(Y ).
Applying this result repeatedly and using basic properties of variance we obtain:

41

Corollary 11.2. If X1 , X2 , . . . , Xn are independent discrete random variables and c1 , c2 , . . . , cn R


then
!
n

Var

ciXi

i=1

= c2i Var(Xi ).
i=1

E XAMPLE . We provide an alternative way of deriving the variance of a binomial random variable.
Suppose that X Bin(n, p) and we define indicator variables Xi as in the previous example. The
Xi are independent random variables (because the trials involved in the definition of the Binomial
distribution are independent). Since Xi Bernoulli(p) we have that Var(X1 ) = Var(X2 ) = =
Var(Xn ) = p(1 p). Corollary 11.4 now tells us that
Var(X) = Var(X1 ) + Var(X2 ) + + Var(Xn ) = np(1 p).
The next definition gives a measure of how far from being independent two random variables
are. You will meet these concepts again in statistics modules.
Definition. The covariance Cov(X,Y ) of two discrete random variables X and Y is
Cov(X,Y ) = E(XY ) E(X)E(Y ).
It can be shown that
Cov(X,Y ) = E ((X E(X))(Y E(Y ))) .
The correlation coefficient Corr(X,Y ) of X and Y is defined to be
Cov(X,Y )
Corr(X,Y ) = p
.
Var(X)Var(Y )

The rough idea of these definitions is that:


If X and Y are independent then Cov(X,Y ) = Corr(X,Y ) = 0 (by Theorem 11.3).
If X and Y have a tendency to either both be large or both be small then the covariance and
correlation coefficient will be positive.
If there is a tendency for one of X and Y to be large while the other is small then the covariance and correlation coefficient will be negative.
The more extreme this tendency is, the larger in magnitude the covariance or correlation are.
The correlation coefficient (as we shall see later) is normalised to lie between 1 and 1, and
to be unchanged when X and Y are scaled linearly.
Where X being large (or small) really means X being larger (or smaller) than E(X) and similarly
for Y .
The following properties of covariance and correlation can be deduced from their definitions.
42

Proposition 11.3. If X and Y are discrete random variables and a, b, c, d R then:


i) Var(X +Y ) = Var(X) + Var(Y ) + 2Cov(X,Y ),
ii) Cov(aX + b, cY + d) = acCov(X,Y ),
iii) If a, c > 0 then Corr(aX + b, cY + d) = Corr(X,Y ),
iv) 1 Corr(X,Y ) 1.
The point of parts ii) and iii) is that if I decide to scale one or both of the random variables
by a linear transformation (for example measuring a temperature in degrees Fahrenheit rather than
degrees Celsius) then the covariance may change but the correlation coefficient will not.
P ROOF.
i) During the proof of 11.3ii we showed that for any
Var(X +Y ) = Var(X) + Var(Y ) + 2(E(XY ) E(X)E(Y ))
(look back to check that we really did prove this for any X,Y , in other words up to this point
of the proof we did not use that X,Y were independent). The result follows from this.
ii) We will use the second definition of covariance that
Cov(aX + b, cY + d) = E((aX + b E(aX + b))(cY + d E(cY + d)))
Using basic properties of expectation we get
Cov(aX + b, cY + d) = E((aX + b aE(X) b)(cY + d cE(Y ) d))
= E(a(X E(X))c(Y E(Y )))
= acE((X E(X))(Y E(Y ))
= acCov(X,Y )
iii) We have that Var(aX + b) = a2 Var(X) and Var(cY + d) = c2 Var(Y ). This and the previous
part give that
Cov(aX + b, cY + d)
Corr(aX + b, cY + d) = p
Var(aX + b)Var(cY + d)
acCov(X,Y )
=p
a2 Var(X)c2 Var(Y )

Now, since a, c > 0 we have that

a2 = a and c2 = c and so

Corr(X,Y ) =

acCov(X,Y )
p
= Corr(X,Y ).
ac Var(X)Var(Y )

iv) Proof omitted and non-examinable.



43

Conditional Random Variables9


As in the previous section all random variables will be assumed to be discrete.
It is important in this section (as in the whole course) to always be clear about which letters
stand for events and which stand for random variables. Recall that if X is a random variable and
x R then X = x is an event.
If X is a random variable and E is an event then, by the definition of conditional probability,
P(X = k and E)
.
P(E)

P(X = k|E) =

This gives the pmf of a new random variable called X given E or X|E (or X conditioned on E).
We can consider properties of X|E such as its expectation and variance. By definition,
E(X|E) = kP(X = k|E).
k

E XAMPLE . A standard fair die is rolled twice. Let X be the number showing on the first roll and
E be the event at least one odd number is rolled. Find the distribution of X|E.
We have that P(E) = 3/4.
If r is odd then
P(X = r|E) =

P(X = r and E) P(X = r) 1/6 2


=
=
= .
P(E)
P(E)
3/4 9

If r is even then
P(X = r and E)
P(E)
P(X = r) P(second roll is odd) 1/6 1/2 1
=
=
= .
P(E)
3/4
9

P(X = r|E) =

It follows that the conditional pmf of X|E is


n
1
2
3
4
5
6
P(X = n|E) 2/9 1/9 2/9 1/9 2/9 1/9
You can now work out that E(X|E) =

2+2+6+4+10+6
9

10
3.

E XAMPLE . Often the event E is given by another random variable. Look back at the joint distribution of R and B in the last chapter. To find the distribution of R|B = 0 means to find P(R = r|B = 0).
Hence we divide the top row of the joint distribution by P(B = 0). To find P(B = 0) we take the
sum of all elements in the top row of the joint distribution. We get P(B = 0) = 10/35 and that the
conditional pmf is:
n
1
2
3
P(R = n|B = 0) 3/10 6/10 1/10
9 This

section is not examinable

44

The following theorem is similar to the theorem of total probability.


Theorem. If E1 , E2 , . . . , En partition (that is they are pairwise disjoint and their union is ) and
P(Ei ) > 0 for all i then
n

E(X) = P(Ei )E(X|Ei ).


i=1

P ROOF. By definition
E(X) = kP(X = k).
k

If we apply the Theorem of Total Probability with the partition E1 , . . . , En to P(X = k) we get
n

E(X) = k P(X = k|Ei )P(Ei )


k
n

i=1

= kP(X = k|Ei )P(Ei )


i=1 k
n

= P(Ei ) kP(X = k|Ei )


i=1
n

= P(Ei )E(X|Ei ).
i=1


This theorem can be sometime be used to calculate the expectation of a random variable without
having to calculate the whole pmf. We saw one example in lectures and here is a second example
giving an alternative derivation of the expectation and variance of a geometric random variable.
Example. If X Geom(p) (so X is the number of trials up to an including the first success in a
sequence of Bernoulli trials) and A is the event the first trial is a success then
E(X) = P(A)E(X|A) + P(Ac )E(X|Ac ).
Clearly, if A occurs that X = 1 with probability 1 and so E(X|A) = 1.
If A does not occur then the number of subsequent trials (not including the first one) to the first
success is also distributed Geom(p) and so
E(X|Ac ) = 1 + E(X).
Substituting into the the above equation we obtain an equation for E(X) which when solved gives
E(X) = 1/p (agreeing with our previous method of working this out).
Similarly,
E(X 2 ) = P(A)E(X 2 |A) + P(Ac )E(X 2 |Ac ).

45

As before E(X 2 |A) = 1. Since the number of trials after the first one is also distributed Geom(p)
we have that E(X 2 |Ac ) = E((1 + X)2 ). Substituting this gives
E(X 2 ) = p + (1 p)E(1 + 2X + X 2 )
= p + (1 p)(1 + 2/p + E(X 2 ))

which when solved gives E(X 2 ) = (2 p)/p2 and hence Var(X) = (1 p)/p2 .

46

12

Appendix: Words and Symbols of Higher Mathematics

The content of this section is not examinable.

12.1

Logic and Proof

In this short digression there are a few brief remarks on how to understand and write proofs.
If p and q are mathematical statements, then the statement
pq
means p implies q or equivalently if p then q. This statement is true unless there is some
situation where p is true but q is false. The symbol is much abused; use it as little as possible,
and if you write p q, make sure to read it aloud as if p then q, to check it makes sense.
The proof of a statement of the form p q, should look like the following:
P ROOF. Suppose that p is true. Then we have . . .
So . . .
Hence q is true. 
Each line must follow clearly from the previous ones. To prove that p does not imply q (that
is, that p q is false), you need to find a situation where p is true but q is false.
Notice that p q and q p are different statements, and confusing them is perhaps the most
common logical error. It is quite possible for one to be true and the other to be false. For example,
x = 2 implies that x2 = 4 but x2 = 4 does not imply that x = 2 (we could equally well have x = 2).
The statement
pq
means p q and q p. This is usually read as p if and only if q (if and only if is often
abbreviated to iff) or p and q are equivalent.
To prove that p q you need to show that p q and q p. Sometimes it is possible to do
both of these at once but it is often clearer to show then separately (see for example the proof of
Proposition 2.3).
Another way of showing that p q is to show that
not q not p.
That is, if q is false then p is false. This statement is called the contrapositive (think about why
it is equivalent to p q).
The statement not p which appears above is called the negation of p. Working out the
negation of a statement can be subtle, so we give a few examples. Make sure you understand all of
these.
If p is the statement x = 2, then the negation of p is the statement x 6= 2
47

If p is the statement x = 2 and y = 3, then the negation of p is the statement either x 6= 2


or y 6= 3 (where or is being used in its usual mathematical sense to include both occurring).
If p is the statement either x = 2 or y = 3, then the negation of p is the statement x 6= 2
and y 6= 3.
If p is the statement every student at QM is hard-working, then the negation of p is there
exists at least one lazy student at QM.
If p is the statement for all x, y A with x 6= y we have f (x) 6= f (y), then the negation of p
is the statement there exist x, y A with x 6= y and f (x) = f (y)

12.2

Sets

A set is a collection of distinct, well-defined objects, called the elements of the set. Order and
repetitions of the elements are irrelevant. (What does the expression well-defined mean?)
A set can be specified in several ways.
By listing its elements between braces ({, }) separated by commas, e.g., {1, 2, 3, 4}.
By listing enough elements to identify a pattern, e.g., {2, 4, 6, 8, . . . } (the set of positive even
integers) or {1, , 2 , . . . , 1 00} (the set of increasing non-negative powers of ).
By giving a rule, e.g., {x : x is an even integer} (read as the set of all x such that x is an
even integer).
By giving a rule which identifies a subset of a given set, e.g., {x Z : x is a cube} (read as
the set of integers which are cubes).
If X is a set, we write x X to mean that x x is an element of X, and x 6 X to mean that x is not
an element of X.
Two sets are equal if they contain precisely the same elements. To show that two sets are equal,
A = B say, we can either
i) Show that if x A then x B and if x B then x A,
or
ii) Show that if x A then x B and if x 6 A then x 6 B.
The empty set is the set with no elements; it is denoted by 0.
/ If X is finite then the cardinality
of X is the number of elements of X; it is denoted by |X|. The cardinality of the emptys set is zero.
If every element of X is also an element of Y then we say that X is a subset of Y and write X Y
(or X Y ). Note that X X.

48

12.2.1

Operations on Sets

Let X and Y be sets.


X Y (X union Y ) is the set of elements of X or Y , or both.
X Y (X intersection Y ) is the set of elements of both X and Y .
X rY is the set of elements in X but not in Y .
X4Y (symmetric difference of X and Y ) is the set of elements in either X or Y but not
both.
If all sets are subsets of some fixed set S then X c (the complement of X) is S r X (the set
of all elements of S which are not elements of X).
If X,Y are events, then these have probabilistic interpretations
X Y is the event either X or Y occurs
X Y is the event both X and Y occur
X rY is the event X occurs but Y doesnt
X4Y is the event exactly one of X and Y occurs
X c is the event X doesnt occur
The sets X and Y are disjoint if X Y = 0.
/
12.2.2

Set Identities

Proposition 12.1. If X,Y and Z are sets then


a) Commutative laws
i) X Y = Y X
ii) X Y = Y X
iii) X4Y = Y 4X
b) Associative laws
i) X (Y Z) = (X Y ) Z
ii) X (Y Z) = (X Y ) Z
c) Distributive laws
i) X (Y Z) = (X Y ) (X Z)
49

ii) X (Y Z) = (X Y ) (X Z)
d) De Morgan laws
i) (X Y )c = X c Y c
ii) (X Y )c = X c Y c

12.3

Ordered Pairs

We denote the ordered pair x then y by (x, y). So (1, 2) 6= (2, 1) unlike for sets where {1, 2} =
{2, 1}. The Cartesian product of two sets A and B, denoted by A B, is defined by
A B = {(a, b) : a A, b B}.
We denote A A by A2 .
More generally, an ordered n-tuple is written (x1 , x2 , . . . , xn ) and
A1 A2 An = {(a1 , a2 , . . . , an ) : ai Ai }.
We write An for A
| A
{z A}.
n terms

50

12.4

Functions

A function from a set A to a set B is a rule which assigns an element of B to each element of A.
The set A is called the domain. The set B is called the codomain. If f is a function and a A we
denote the element of B assigned to a by f (a). The point f (a) is also called the value of f at a and
we say that f maps a to f (a).
We use the notation
f :AB
to mean f is a function from A to B, and
f : x 7 f (x)
to mean f is the function which maps x to f (x).
A function f is injective if there are no two elements a1 , a2 A with a1 6= a2 but f (a1 ) = f (a2 )
(no two distinct elements of A map to the same element of B).
It is surjective if for every b B there is an a A with f (a) = b (every element in B is mapped
to by something in A). It is bijective if it is both injective and surjective. An injective function is
also called an injection. A surjective function is also called an surjection. A bijective function is
also called a bijection.
The idea of a bijection is related to inverse functions.
If f : A B and g : B C are functions then the composite function h : A C is defined by
h(x) = g( f (x)). We denote this by g f (or just g f ). Note that g f and f g may be different.
If f : A B is a function, then a function g : B A is an inverse to f if (g f )(a) = a for all
a A and ( f g)(b) = b for all b B.
It is important to remember that not every function has an inverse. However, if an inverse does
exist then it is unique.
Proposition 12.2. The function f : A B has an inverse if and only if it is a bijection.
P ROOF. There are two things to prove here. Firstly we must show that if f has an inverse then it is
a bijection. Secondly, we must show that if f is a bijection then it has an inverse.
Suppose that f has an inverse and denote it by g. Given any x B set y = g(x). Now f (y) =
f (g(x)) = x so f is surjective.
If f (a) = f (b) then (applying g to both sides) g( f (a)) = g( f (b)) and so a = b. Thus, f is
injective.
So f is both surjective and injective So it is bijective and we have proved the first part of the
result.
Given x B there exists a y A with f (y) = x (since f is surjective). Moreover, there is only
one such y (since f is injective). Define g(x) to be equal to this y. Now g( f (y)) = y for any y and
f (g(x)) = x for any x and so g is an inverse to f . This completes the proof of the second part of
the result. 
The following proposition relates the existence of injections, surjections and bijections to cardinalities of sets.
51

Proposition 12.3. Let A and B be finite sets and f : A B.


i) If f is injective then |B| |A|.
ii) If f is surjective then |A| |B|.
iii) If f is bijective then |B| = |A|.
P ROOF. For part i) suppose that f is injective. For each element b B there is at most one element
of a with f (a) = b. It follows that the number of elements of A is at most the number of elements
of B. That is, |B| |A|.
For part ii) suppose that f is surjective. For each element b B there is at least one element of
a with f (a) = b. It follows that the number of elements of A is at least the number of elements of
B. That is, |A| |B|.
If f is bijective then it is both surjective and bijective, and so |B| |A| (by part i) and |A| |B|
(by part ii). It follows that |A| = |B|. 

12.5

Cardinality of Infinite Sets

Motivated by the previous proposition, we say that two infinite sets have the same cardinality if
there is a bijection between them.
A set S is countable if there is an injection from S to N. It can be shown that if S is infinite and
there is an injection S N then there is also a bijection S N. So a set if countable if it is either
finite or has the same cardinality as N.
We showed in lectures that Z is countable and N2 is countable. We also hinted at the proof that
Q is countable. However there are infinite sets which are not countable. Two examples are R and
the set of all subsets of N. So there are infinite sets which are, in some sense, bigger than N.
This surprising phenomenon was discovered by Cantor in the 1870s.

12.6

Sequences

An ordered n-tuple (a1 , a2 , . . . , an ) is also called a sequence of length n. Sometimes this is written
as (ai )ni=1 .
A sequence can also be infinite, written in several equivalent ways:
(a1 , a2 , a3 , . . . )

(ai )i1

(ai )
i=1 .

If (a1 , a2 , . . . ) is a sequence of numbers we write


n

ai = a1 + a2 + + an.

i=1

The symbol i here is called the dummy variable. Note that


n

ai =

i=1

ak = a1 + a2 + + an.
k=1

52

E XAMPLE . Let (a1 , a2 , . . . ) and (b1 , b2 , . . . ) be sequences of numbers and c, d be numbers.


i)

(cak + dbk ) = c ak + d

bk

k=1

k=1

k=1

ii)
!

ak

i=1

bk

j=1

aib j

i=1 j=1

The proof is an easy exercise just rearrange the terms.


You may also see the notation

ai

or

i=1

ai

i1

for summing infinite sequence. We cant really say what this means without a little more analysis
(Calculus II). One case we will need is when ai = ari1 for 1 < r < 1. In this case

first term

ai = 1 r = 1 common ratio

i=1

53

Anda mungkin juga menyukai