(ST111 Probability A)
Another student:
You keep telling me to read books.
A rumour:
The exam will only be taken by a sample of 40 students selected at random
from this class.
A coin:
This module tossed me around in my sleep!
Niels Bohr:
Einstein, stop telling God what to do with his dice.
ebay:
Shadowfist CCG Probability Manipulator SE Rare
0 Bids £0.99 + Postage £0.75 12h 21m
Oscar Wilde:
Always be a little improbable.
1 Preface
Lecture notes
1
Royal Blackboard Beauty Contest for the Counties of the Midlands
2
http://www.bbc.co.uk/cbeebies/charlieandlola/stories/
3
get the idea at http://www.youtube.com/watch?v=I8aW4YuFSiY&feature=related
Websites
Confusing enough, you may come across a few different websites when searching for this
module:
http://www2.warwick.ac.uk/fac/sci/statistics/courses/modules/year1/st111/
http://www2.warwick.ac.uk/fac/sci/statistics/courses/modules/year1/st111/resources
The first one is very general and is intended to help students with their module choice or
with anticipating the role of this module within the course program. The second website,
the resources website, is the one that is relevant for you now. There, I post lecture notes,
exercise sheets, information about the module organisation, old exams, computer code
and interesting links about probability. In previous years, some of the lecturers have used
the MathStuff website by the Warwick Mathematics Department to post exercises, lecture
notes, old exams and other information. This could still be useful for you and you can
find it there under ST111 > Module Pages > Archived Material; or as this address:
http://mathstuff.warwick.ac.uk/ST111/archive
Finally, there is my website:
http://www2.warwick.ac.uk/fac/sci/statistics/staff/academic/brettschneider/
Textbooks
Please check out the list on the resources website. Which one is best? Some people say
the one by Ross tends to more or less suit the majority of students. The other obvious
one is the textbook by Pitman.
By the way, have you experienced the law of the second source? The second source where
you read about a piece of mathematics new to you tends to be the one that explains it
better – regardless of the order of the sources. Interestingly, there is no such law for the
third source. In fact, there seems to be a k ≥ 3 such that while reading the kth source
you end up being more confused than while reading the first one, again regardless of the
order of the sources. Without getting to that point you will never know how well you had
already understood the material before getting hold of source k. For this and many other
reasons,
∞
X
“read another book”
i=1
Pictures sources
The top cover picture showing the foyer of the Warwick Mathematics and Statistics De-
partment, known as The Street, is from
http://www.maths.warwick.ac.uk/general/news/images/wall-400.jpg
Other picture sources are mentioned in the figure captions. If no source is mentioned, such
as for the bottom picture on the cover, I made the picture. For simulations I am typically
using the statistical programming language R which is publicly available at
http://www.r-project.org
The one most frequently asked question
How do I study for the exams? There is a lot of advice on the resource website.
In a nutshell:
∞
X
“do another exercise”
i=1
Synopsis
Week I
Lecture 1: Randomness, probability as mathematical discipline
Lecture 2: A few questions (birthday problem, door problem, random sequences)
Week II
Lecture 3: Terminology for random events, classical examples for random experiments
Lecture 4: Classical probability
Lecture 5: Combinatorics
Week III
Lecture 6: Axioms of probability
Lecture 7: Conditional probability
Lecture 8: Total probability theorem, Bayes theorem
Week IV
Lecture 9: General multiplication rule, independence, binomial distribution
Lecture 10: Geometric distribution, negative binomial distribution
Lecture 11: Poisson approximation to the binomial, Poisson distribution
Week V
Lecture 12: In-class test
Lecture 13: Random variables
Lecture 14: Expectation, variance
We start by looking at a few pictures from the real world that exhibit random phenomena.
The slide show used during this lecture can be downloaded from the ST111 resource
website. It includes ice crystals, animal patterns, genetics, traffic, networks, statistical
mechanics, stock markets, gambling, knitting and more.
The common definition of mathematics as the science of patterns goes back to G.F. Hardy,
if not longer. Number theorist look for patterns in the integers, topologists in shapes,
analysts in motion and change and so on.
Look at the bottom picture on the cover sheet of these lecture notes. Are the black and
white dot sequences regular or not? In which sense (not)?
Row 1: Yes, it is constant
Row 2: Yes, it is periodic.
Row 3: No, but almost. It is the sequence from row (2) with a few errors.
Row 4: No, there seems to be no rule that generates this sequence.
Row 5: No, there seems to be no rule that generates this sequence.
Question 2.1. Statistical regularity in dot sequences. The sequences in Row 4 and Row 5
clearly are not regular in the classical sense: There is no deterministic rule that generates
them. But are they statistically regular in some sense? And if so, do they have the same
kind of statistical regularity in the sense that they show similar statistical characteristics?
One of them has actually been generated by flipping a fair coin 100 times and printing a
white dot for head and a black dot for tail. Can you say which one? And why do you think
so?
See end of this section for an answer.
ω: But excuse me, I do not like black and white. My very most favourite colour is totally
tomato red and my other favourite colour is bitter chocolate brown with lime green sparkles
in it and YOU should really try this link: http://biscuitsandjam.com/stripe_maker.php
And this one about statistical mechanics makes some pretty shape, too:
http://physics.ucsc.edu/~peter/ising/ising.html
Ω: Speaking of black and white, have you heard that half of the students taking this
module loves and the other half hates us? Don’t cry, omi, you know as well as me that we
can be quite annoying and silly...
I have decided that the most sensible step for us is to turn grey and to shrink bit, we do not overstay our
welcome.
It seems desirable to get more insight into the phenomenon of randomness, but what kind
of discoveries can we expect from studying something as inconceivable as randomness?
The actual science of logic is conversant at present only with things either
certain, impossible, or entirely doubtful, none of which (fortunately) we have to
reason on. Therefore the true logic for this world is the calculus of probabilities,
which takes account of the magnitude of the probability which is, or ought to
be, in a reasonable man’s mind.
— James Clerk Maxwell
How dare we speak of the laws of chance? Is not chance the antithesis of all
law?
— Joseph Bertrand, Calcul des probabilités
Why would anyone want to study randomness? There are a number of reasons including
• describing and understanding patterns in random phenomena,
• unrevealing scientific principles,
• predicting and forecasting events,
• learning and developing degrees of belief about events,
• quantifying and comparing risks,
• taking decisions under uncertain circumstances.
Probability theory provides a formal theory to describe and analyse random phenomena.
Probability can be practiced as pure math. The birth of of probability theory as a math-
ematical discipline is often associated with the axiomatisation the field underwent during
the twenties and early thirties of the 20th century.
It borders other fields in mathematics. Most obvious are the overlaps with measure theory,
analysis, functional analysis, PDEs, geometry, combinatorics, computing, ergodic theory,
number theory and mathematical physics. The last International Congress of Mathemati-
cians (2006 in Madrid) recognised and honoured such connections. Andrei Okounkov from
Russia received a Fields Medal for his contributions bridging probability, representation
theory and algebraic geometry. Wendelin Werner from France received a Fields Medal
for his contributions to the development of stochastic Loewner evolution, the geometry of
two-dimensional Brownian motion, and conformal field theory.
The probability group at Warwick is among the leading ones in the UK. Their major home
is P@W; see http://www2.warwick.ac.uk/fac/sci/statistics/paw/
Probability can be practiced as applied mathematics. (To keep things open we will not
impose p + q = 1.)
The instrument that mediates between theory and practice, between thought
and observation, is mathematics; it builds the bridge and makes it stronger
and stronger. Thus it happens that our entire present day culture, to the
degree that it reflects intellectual achievement and the harnessing of nature, is
founded on mathematics.
— David Hilbert, radio speech (1930)
Pobability theory has been serving the human mind’s inquiries and activities in many
areas. Besides statistics (see below) these including the following:
Physical world: e.g. astronomy (in particular, theories about measurement error), me-
chanics, statistical physics, quantum mechanics.
Living world: e.g. integrative biology, evolution, genetics, genomics, medicine, neuro-
science.
Social world: e.g. economics (in particular, financial markets), sociology, demography.
Engineering: e.g. computer science (in particular, machine learning), electrical engineer-
ing, risk assessment, reliability, operations research, information theory, communication
theory, control theory, traffic engineering.
Humanities: e.g. epistemology, determinism, philosophy of language, philosophy of
religion, philosophy of logic, political philosophy, belief systems.
Games of chance: e.g. roulette, dice, card games.
Arts: e.g. stochastic music, visual art (inspired by chaos theory and fractals, patterns
simulated by stochastic algorithms).
Probability theory is based on a set of axioms that can be summarised in just a few
lines. What makes it applicable to so many different kinds of applications is a huge
repertoire of probabilistic models. For example: Classical discrete probability models
in games of chance, genetic inheritance, quality control, measurement error, electrical
circuits and lattice modes for particle interaction. Discrete time stochastic processes (often
involving a certain depth of dependency on the past) in queuing, coding and stochastic
composition algorithm. Continuous stochastic processes for particle motion and stock
prices. Probabilistic networks in sociology, mobile phone and genomics.
Before trying our models in the real world, let us state some thoughts on the application
of probability theory from Saul Jacka’s lecture notes: ”It is a very powerful language, for
if we accept a few simple propositions, such as involve mapping a practical situation on
to some axioms, the whole battery of a theory’s deductions follow as true, conditional
upon the validity of that mapping. Of course, if that first mapping is nonsense then, no
matter how good our mathematics is, the result is almost certainly rubbish and can have
disastrous practical consequences.”
A quick note: You will need every bit of Probability A and Probability B (and more) to
understand Mathematical Statistics A and B in your second year. The latter modules are
the basis for a lot of exciting modules, theoretical and applied, ahead of you in the final
years. Please make sure you are on top of the probability modules.
Probability and statistics are linked by studying similar objects from different perspectives:
Probability starts out with a model for a random experiment that describes potential
outcomes and assumes certain basic parameters. Probabilities for events of interest are
deduced from these model assumptions. For statistics, the starting point are the observed
outcomes (aka oberservations or data) of an experiment that has already been performed.
The task is to infer characteristics of an optimal model for the unknown (or only partially
known) mechanism of the experiment. That means, we are trying the find the model that
is, among a certain class of models, most likely to have created the observations.
“Probability and statistics used to be married, then they separated, then they got divorced
and now they hardly see each other,” says David Williams. At Warwick they have moved
on to be good friends offering a wide range of research activities. They cover any possible
blend of the two disciplines from pure probability to applied statistics and extending to
interdisciplinary collaborations with other Warwick research groups such as Complexity,
Systems Biology and Finance.
In his classical book An Introduction to Probability Theory and its Applications, William
Feller stresses that in probability, as in other mathematical disciplines, we must distinguish
three aspects of the theory:
(a) the formal logical content,
(b) the intuitive background,
(c) the applications.
He continues: ”The character, and the charm, of the whole structure cannot be appreciated
without considering all three aspects in their proper relation.”
Considering both the history of probability and the fact that this is a first year module we
start with models for equally like probabilities. This approach is minimalistic in the formal
sense but already provides enough mathematical foundation to properly dive into some
interesting examples to motivate probability as a field. While some students are extremely
good at solving probability problems intuitively, others would prefer to be see a more for-
mal approach first, which is why we will then introduce the modern axiomatic approach
to probability including the notions of σ-algebras, probability measures, and random vari-
ables. Besides these objects, the mathematical theory of probability which includes crucial
concepts such as conditioning and independence, expectations and variances. In the easier
examples, the introduction of such formalism may feel like breaking a butterfly on a wheel,
On some occasions they serve as an introduction of the formalism needed for more complex
questions some of which even have counterintuitive answers. Along with the theory we
will get to know classical probability distributions such as Bernoulli, uniform, binomial,
geometric and Poisson. All of this will be motivated by real world questions and illustrated
by further applications and by examples constructed from probabilistic experiments (such
as coin tossing or gambling examples).
ω: But confusion completely does not scare me. When I am big I will be working as a researching researcher.
And also it is your birthday and Mum said I should help you make sure you have an extremely lovely,
happy birthday, so that’s why I will stick around.
Ω: OK. Thank you omi. Let us sort out a problem together then. . .
Question 2.2. Suppose there are r students in this class. What is the probability that at
least two students in the class have the same birthday?
ω: But I need to know everybody’s birthday first to find out whether any two are on the same day!
Ω: Listen, in the question they don’t care about exactly which students these are. They just want to know
what are the chances of two identical birthdays whatever class of r people.
ω: But when it’s your birthday that means you are really special. How can you feel really special if all you
know is your odds of having a birthday today are say one in two, or one in a million or whatever?
Ω: Don’t be silly. I’m just curious to find out how likely is it, that a random class of r students has at least
two students in it with identical birthdays?
ω: You are the one being silly. A random class is how our school teacher calls us when we are naughty.
Ω: Not that kind of random. I mean a class picked at random from all the classes of r students in the
universe.
ω: Universe, outer space, probability space – you are just making up funny words. How is the universe
going to help you out? You do not even know who is out there. Maybe on Mars everybody is born in May,
and on Saturn everybody is born on a Saturday, and in the Milky Way everybody is born on a Lucky Day!
Ω: OK. That is a very good point. We need a realistic probability space to answer this question on Earth.
I will make the following assumptions:
• Every year has 365 days.
• Every student is equally like to be born an any of these days.
• Birthdays are independent of each other.
ω: But excuse me, there are leap years and everybody knows there are more birthdays during some parts
of the year and twins have the same birthday.
Ω: You’re absolutely right, but I just want to start somewhere and my assumptions still capture the essence
of the problem. Once we have solved it this way we can see how important the assumptions are in the
calculation and then tackle it with a different set of assumptions.
ω: There are 23 kids in my class and they are all very special. I’m not worried that any two of us have
the same birthday. The risk Alphie is born on my birthday is only 1 in 365, for Betty it is also 1 in 365
and so on makes 24/365. Now, Alphie makes the same calculation for her, and so make Betty and so on
makes 25 times 24/365.
Ω: That is about 1.64. In other words 164%. Did you know that probabilities can not be bigger than
100%?
ω: OK, I see, it’s just that I counted some things twice, like Alphie on the same day as me is the same as
me on the same day as Alphie. I will try to answer that question again. . . .
Ω: I know you like questions. But this is actually my question. I was just going to look at a simpler version
of this problem. There are three students in a class and we want to know what are the chances that at least
two of them have their birthday in the same month. I will call the months a, b, c, . . . , l. Now I will figure
out all possibilities for the birthday months of the three students and count how many of them have at
least two identical months in it: aaa, aab, aac, . . . , aal obviously all belong in this group, makes 12. Now,
aba, abb, abc, . . . , abl. Of those, only two have at least two identical months. Same for all the once starting
with ac or ad and so on up to al. So far, we have 12 plus 2 times 11. Now, there are baa, bab, . . . , bal and
bba, bbb, . . . , bbl. . .
Problem solving technique: Consider all options.
ω: And bla, bla, bla and. . . Omega, I am hungry, and there’s got to be a better way for this!
We order the students in some way. With the above assumptions our model is that
the birthdays are equally likely to form any ordered r-tuple with numbers chosen from
{1, 2, . . . , n}.
We want to compute the probability pn,r that there are at least two students with the
same birthday among the r students in the class. Obviously, if r > n this has to happen,
so pn,r = 1. Otherwise, we can compute it by dividing the number of r-tuples containing
repeats by the total number nr of r-tuples from a set of n elements.
Omega attempted to list all possible r-tuples, identify which of them have repeats in them
and count those. Whereas this is a correct way of solving the problem, it is more efficient
to just list and count those r-tuples that do not have any repeats in them. This will allow
us to compute the probability qn,r of the opposite, that is, not any two students have the
same birthday?
Problem solving technique: Check if the opposite event is easier to handle?
In how many ways can birthdays be assigned without ever repeating one? Let the first
student have whatever birthday. Avoiding that day for the second student leaves n − 1
options. The third student has to avoid both previously taken birthdays which leaves n−2
choices. The forth has n − 3 choices and so on up to the n − r + 1 choices for the rth
student. So we have
qn,r = n · (n − 1) · (n − 2) · . . . · (n − r + 1)/nr ,
which yields
pn,r = 1 − n · (n − 1) · (n − 2) · . . . · (n − r + 1)/nr .
Let us look at some numerical results:
In particular, we can see that the smallest number of students r such that the probability
of having at least two of them with the same birthday is 23:
An old kind of problem that became famous worldwide via the Monty Hall game show in
1990. You are being shown three closed doors. Behind one of the doors is a car; behind
each of the other doors is a goat. You choose one of the doors. The show’s host, who
knows which door conceals the car, opens one of the two remaining doors which he knows
will definitively reveal a goat. Then he asks you whether or not you want to switch your
choice to the remaining closed door. What are you going to to? Here are some attempts
to answer this question.
(i) Either the prize is behind the door you bet on originally or behind the other still
closed door makes it a fifty-fifty chance for each, so it doesn’t matter whether you
change or not.
(ii) All doors were equally like originally. That is not going to change because of whatever
that show’s host did. So it doesn’t matter whether you change or not.
(iii) Imagine the same problem but with 100 doors instead of 3. Behind one of the doors
is a car, behind all the other 99 doors is a goat. The show’s host knows where the
prize is and opens all doors but the one you picked and one other one. Now it seems
intuitive, that you would want to switch.
Problem solving technique: Exaggerating the original question helps to reveal
the essential characteristics of a problem. Often, the qualitative aspects of the answer
to the exaggerated question can be carried over to answer the original question.
(iv) We compare the two different strategies stick (with your original choice) and switch.
There are two possibilities:
Case 1: Original choice is door with car. Now you will get a goat.
Case 2: Original choice is a doors with a goat. Now you will get a car.
The probability for Case 1 is 1/3, the probability for Case 2 is 2/3. If your strategy
is stick then your chance of winning the car are 1/3. If your strategy is switch then
your chance of winning the car are 2/3.
Problem solving technique: Distinguishing cases allows to find solutions of the
problem under more constraint conditions. In probability, it is usually done to get
rid of (some of ) the randomness.
(v) A qualitative approach. The shop’s host knows where the car is and opens, on
purpose, a door that reveals a goat. If you make use of that information it should
increase your chances of getting the car. At least it should not decrease it.
(vi) A variation of the problem. The show’s host does not know where the car is, he just
happens to reveal a goat when he opens the door. Does this change your answer to
the problem?
Question 2.3. Which one of the following birth orders in a family with six children is
more likely, or are they the same? (G means girls, B means boy.)
GBGBBG BGBBBB
Question 2.4. A hexagonal die with 2 red and 4 green faces is thrown 20 times. You have
to bet on the occurrence of one of the following patterns. Which one would you choose?
Answers to these two questions are given below. Before you read them, try to come up
with your own answers.
About Question 2.3: Both are equally likely. Experiments with people who have no
training in probability have shown that they tend to believe that the first order is more
likely than the second order. The second sequence is perceived to be too regular. The
answers suggest that the respondents compare the frequency of families with 3 girls and 3
girls with the frequency of families with 5 boys and 1 girl. Yet both exact orders of birth,
GBGBBG and BGBBBB, are equally like, because they both represent one of 64 equally
like possibilities. (See Kahneman and Tversky, 1972b, p.432 see Chapter 5, Neglecting
exact birth order.)
About Question 2.4: Your best bet is (i). Most people bet on (ii), because G is more
likely than R and (ii) has two Gs in it. Yet (i) is more likely, simply because RGRRR is
nested in GRGRRR. This is an example of committing the conjunction fallacy without
realising that there is a conjunction. (See Kahneman and Tversky, 1983, p.303, Failing to
detect a hidden binary sequence.)
The conjunction fallacy has been studied by many researchers using the example about
the (hypothetical) person Linda. The description reads:
Linda is 31 years old, single, outspoken, and very bright. She majored in phi-
losophy. As a student, she was deeply concerned with issues of discrimination
and social justice, and also participated in anti-nuclear demonstrations.
Then people are asked to rank statements about Linda by their probability, in particularly
the following two:
People assigned higher probabilities to the second statement. This is not in tune with
the mathematical rules about probability and has stimulated a discussion about how real
people deal with probabilities. (See Kahneman and Tversky, 1982b, p.92.)
ω: People’s minds don’t follow the usual probability rules, small particles do not follow them either; are
they of any use?
ω: Oh yes, they work in many situations. Besides, I believe that human minds do follow them for the most
part, it’s just that we enjoy the examples where we’re getting mixed up. And these fractions of elementary
particles, well, that’s why quantum probability was invented.
About Question 2.1: Most likely, the second one (Row 5) is the coin toss. Does the
first one look more random to you? Well, it does switch more often between black and
white, and we seem to take that as a sign for proper randomness. But could this be too
much of a good thing? Consider that staying with the same colour is as likely as switching
to another colour we would expect that a sequence of 100 coin tosses has about 50 colour
changes. However, there are 66 colour changes in the first one (Row 4) and 46 in the
second (Row 5). While we can not say with certainty, we have argued that the difference
in the number of observed colour changes makes Row 5 much more likely to be generated
by the coin tossing by than Row 4.
This is an example for the thinking used in hypothesis testing, a branch of statistics. This
particular argument is used in the runs test used to check the independence assumption.
The set of all outcomes is called outcome space or sample space and is usually denoted
by Ω. Before the experiment ω is not know, after the experiment is is known. However,
we will often calculate just probabilities for ω to belong to a certain subset of Ω. These
subsets are called events.
Example 3.1. Toss a coin and see what it is facing. Ω = {h, t} with h for the coin facing
head and t for the coin facing tail.
Example 3.2. Roll a die and observe the number of dots it faces. Ω = {1, 2, 3, 4, 5, 6}
and events can be any subsets of Ω. For example, ”rolling an odd number” is {1, 3, 5} and
”rolling a number bigger than 4” is {4, 5} and ”rolling a six” is {6}.
Example 3.3. The sides of an icosahedron die have 20 equilateral triangles, all of the
same size, with the numbers 1 to 20 written on them. Roll an icosahedron die and observe
the number it faces. Ω = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20} and
events can be any subsets of Ω. For more information on different kinds of dice and a nice
picture of all the platonic solid shaped dices you may look under ”dice” in Wikipedia.
Example 3.4. There are several ways of modelling the experiment of tossing two coins.
For example:
(i) Observe which sides are up. This corresponds to Ω = {hh, ht, th, tt}.
(ii) Count the total number of heads. This corresponds to Ωheads = {0, 1, 2}.
Note that (i) contains more information than (ii).
Example 3.5. The deck most often seen in English-speaking cultures, and common in
other countries where the deck has been introduced, is the Anglo-American poker deck.
This deck contains 52 unique cards in the four French suits, Spade (♠), Heart (♥), Dia-
mond (♦) and Club (♣) and thirteen ranks running from two (deuce) to ten, Jack, Queen,
King, and Ace. Draw a card from the deck and observe the suit and the rank.
For simplicity, introduce the coding Ace=1, Jack=11, Queen=12, King=13 and choose
Ω = {Sk | S = ♠, ♥, ♦, ♣; k = 1, 2, . . . , 10}. Events can be any subsets of Ω, for example,
”an Ace”={♠1, ♥1, ♦1, ♣1}, or ”a Spade”={♠k | k = 1, 2, . . . , 13}.
Alternatively, we can define outcome spaces that reflect only a partial observation of the
characteristics of the cards. For example Ωsuit = {♠, ♥, ♦, ♣} or Ωnumber = {1, 2, . . . , 13}.
Example 3.6. A box contains a finite number n of tickets enumerated 1, 2, . . . , n. Draw
one at random. The standard choice for an outcome space is W = {1, 2, . . . , n}. Events can
be subsets of Ω, for example, ”a 3”= {3}, ”a number between 5 and 10”= {5, 6, 7, 8, 9, 10},
or ”an even number”= {1k | k ∈ N, k ≤ n/2}.
Note that the previous example is a general form of a random experiment with finitely
many outcomes. Examples 3.1 to 3.5 can be represented in this form. The next two
examples are both infinite models. However, the first one is countable whereas the second
one is continuous.
Example 3.7. Coin tossing until heads come up.
A coin is tossed until heads come up. The outcomes are all sequences consisting of some
number of tails t and then end with a head h, so the outcome space can be represented as
Ω = {h, th, tth, ttth, tttth, ttttth, . . .}. Note that while there are infinitely many different
outcomes, each individual outcome is finite. Some examples for events are: ”it takes 3
trials to get a head”, ”it takes at least 3 trials to get a head”, ”it takes at most 3 trials to
get a head”.
Example 3.8. Infinitely many coin tosses.
A coin is tossed infinitely many times. In each toss, it is recorded which face shows up.
The outcomes are all sequences consisting of h and t, so Ω = {h, t}N . Some examples for
events are: ”first toss is a tail”, ”at least 20 heads”, ”the pattern hhhthhh occurs at least
once”, ”infinitely many heads”.
Since there is a one-to-one map of the set of binary sequences onto the interval [0, 1], this
corresponds to picking a number at random between 0 and 1, as random number generators
do. It is now obvious that this is an example for a continuous outcome space.
Ω: Mmh, I would say that in Example 3.8 I would use the approach for finitely many outcomes to define
probabilities for events that just specify the first n numbers of tosses. Then let n got to infinity. Honestly,
though, I have no idea what it means for probabilities to converge. These concepts will come up in the
modules Probability B and Random events...
ω: Example 3.7 I’m sure about one thing; it is absolutely completely not possible to give every outcome
the same chance. Why not? You just have to imagine they did. Then there was this number p > 0 and
the probability that one of the first n outcomes occurred was n · p and the probability that one of the first
n + 1 of them occurred was (n + 1) · p. And so on. That goes to infinity. But the probability that one of
all of them occurred should be at most 1. So it just can’t be! And this is why I’m never ever allowed to
invited infinitely many friends to my birthday party. You can not cut a cake into infinitely many equally
big pieces.
The axioms for classical probability we will define below are only a slight extension of
a model with equally likely, or equiprobable, outcomes as in (1). Classical probability
models are based on symmetric characteristics of the random mechanism generating the
outcomes. Often, these are physical characteristics such as a die with faces of the same
shape or a box with balls of the same size. Calculations can usually be performed directly
or indirectly using the equiprobable model.
Definition 3.9. Algebra of sets.
Let Ω be a set of points. A system A of subsets of Ω is called algebra if
(a) Ω ∈ A,
(b) A, B ∈ A =⇒ A ∪ B ∈ A, A ∩ B ∈ A,
(c) A ∈ A =⇒ Ac ∈ A.
Example 3.10. Algebra generated by a partition.
The algebra A generated by a partition B1 , . . . , Bn of Ω consists of the empty set and all
possible unions of any number of element of the partition. A consists of 2n subsets of Ω.
Definition 3.11. Classical measurable space.
Let Ω be a non-empty set of points and F the algebra generated by a partition B1 , . . . , Bn
of Ω. Then (Ω, F) is called a classical measurable space. Ω is called sample space and
B1 , . . . , Bn are called basic events.
Definition 3.12. Axioms for classical probability.
(Ω, F, P ) is called classical probability space if the following two axioms are fulfilled:
C1: (Ω, F) is a classical measurable space.
C2: P : F −→ [0, 1] is a set function such that, for any event A ∈ F which compromises
exactly k distinct basic events Bi , it is P (A) = k/n.
Obvious examples for classical probability spaces are examples 3.1 to 3.6 with P defined
as in (1).
Ω: Stop this equally likely approach to probability! David Williams (you can find him in Wikipedia) calls
it an invitation to disaster.
ω: But I like it. I think it’s more fair to give everybody the same chance.
Ω: Ah, are you bringing socialism back on the stage? Anyway, from the mathematical point of view, this
equally likely approach to probability requires only a minimal amount of techniques and terminology.
All you need is counts, count, counts, is all you need.
All you need is counts (all together now),
All you need is counts (everybody).
3.3.1 Counting
“I think you’re begging the question,” said Haydock, “and I can see looming
ahead one of those terrible exercises where six men have white hats and six
men have black hats and you have to work it out by mathematics how likely it
is that the hats will get mixed up and in what proportion. If you start thinking
about things like that, you would go round the bend. Let me assure you of
that!” — Agatha Christie, The Mirror Crack’d
A way to visualise this is drawing a tree shaped graph. First, it shows three branches, one
for each building. Then, each of these branches grows two branches, one for each storey.
Finally, each of these ones, further branches out into four limbs, one for each room in that
storey.
Let us now formally state the counting methods we have just applied intuitively.
(ai1 , . . . , xiv ).
Ω: I see you are taking the notation very literally. Look, just forget about using the latin alphabet
completely and exclusively. You can squeeze any number of characters and more between a and x. For
example, a, b, c, x makes quadrupels, and a, α, β, γ, δ, b, c, . . . , x makes 29-tupels. a and x are just used in
this way to avoid more indices; otherwise we would have something like a1,i1 , . . . , anv ,iv ), you see? It’s all
fine, there’s no account for taste.
Proof.
(i) is obvious: Arrange the paris in a rectangular array with pair (ai , bj ) at the intersection
of the ith row and the jth column.
(ii) By induction. True for v = 2 from (i). Now suppose it is true for v − 1.
Expressing the v-tupel as a pair ((ai1 , . . . , wiv−1 , xiv ) and applying (i) with m = n1 · n2 ·
. . . · niv−1 and n = nv shows the claim for v.
Ω: Well, eh, that will reduce the number of options. I guess that’s a great exercise for first year students
learning about combinatorics. . .
Definition 3.16. A permutation of a finite set is any ordering of its elements in a list.
Ω: Actually, this it not what I learned in my algebra class. They said, a permutation of a finite set is a
bijective map of that set onto itself.
ω: It doesn’t matter to me. Bijective maps uniquely describe how to reorder the set, and list orderings are
the results of these reordering processes. Practically speaking, it’s all the same.
Proof. Let a1 , . . . , an be the n distinct elements of the set. We compute the number of
permutations following the idea of the calculation in Question 3.15. There are n choices
for an to go, n − 1 choices for an−1 , n − 2 choices for an−2 , and so on, up to 2 choices
for a2 and just one option for a1 . Using the fundamental rule, that makes n! different
orderings.
3.3.2 Sampling
Ω: I am collecting a collection.
Use Saul Jacka’s notes p.9-11, 13.
ω: That’s silly. First she cares about order, then she doesn’t.
Ω: Look, over here are seven M&Ms. Three of them are green, two are orange and two are violet, then
you would care about the order of the colours as colours go, but you would not care about the order of
the individual pieces within one colour. For example, you can reorder the green ones among themselves
without even noticing. . .
ω: Did you notice I just ate all the seven while you weren’t looking? I was much faster than three seconds
and I can not even taste the order.
4 Axioms of probability
Use Saul Jacka’s notes p.6-7.
Ω: Obviously, this case is very boring, and so is the case p = 0. They are included in the model, but they
are the extreme cases. Such experiments are not actually random, they are deterministic.
P ({wi }) = pi (i = 1, 2, . . .) (3)
ω: I need to invite infinitely many friends to my birthday party. Because, whenever I think I will invite
these n friends I realise I forgot one. But how can I divide my birthday cake into infinitely many equally
large pieces?
Ω: That would not be LARGE pieces, omega, but small pieces! You are even in much bigger trouble here.
Just read on. . .
Ω: In part B of this module you can make a model for equally sharing a cake with a continuum of friends.
Is that what you would like to do for you next birthday, omega?
ω: But I do not have a continuum of friends. Even if I added all my imaginary ones.
Ω: Try nextgenerationfacebook.
5 Conditioning
Find a way to get 6 male and 4 female students to come in the front and form a circle.
Offer 3 of the women and 2 of the men to put a hat on their head. Get another student
and call him Mr. Randomness. Bring him in the centre of the circle, blindfold him and
let him spin around until the audience calls him to stop. He faces one of the other 10
students, say X. What is the probability X is wearing a hat? Five of the ten have a hat,
so it’s a half.
Now Mr. Random is asked to exchange ”hi” with the student he is facing. What is the
probability that X is wearing a hat? Is there anything different? Well, Mr. Random did
get some information. From the voice saying ”hi” he can tell whether X is male or female.
As Mr. Random had seen the students with their hats before getting blindfolded, he knows
that 3 of the 4 women wear hat, whereas only 2 of the 6 mean wear hats. In this run of
the experiment it turns out that X is a woman. What he is going to calculate now is the
probability that the person has a hat on given he already knows the person is a woman.
Among the women, what is the chance she has got a hat on? He takes the number of
women with hats and divides by the number of women.
ω: Why does the left-hand side in Defintion 5.1 look symmetric, but then the right-hand side is not?!
Ω: P (A | B) is just a short form for the right-hand side and it is not symmetric. But I agree, that the
symbolic expression (A | B) is misleading in terms of its symmetric appearance. But what actually is the
relationship between P (A | B) and P (B | A)?
Some implications of the definition are both easy to proof and very frequently used:
Multiplication rule
Bayes’ rule
P (A | B) · P (B)
P (B | A) =
P (A | B) · P (B) + P (A | B c ) · (1 − P (B))
(8)
∀A, B ∈ F with P (A) > 0, 0 < P (B) < 1
The first two statements are immediate consequences of Definition 5.1. The last two
statement are special cases of more general Theorems 5.6 and 5.7. Before we state and
prove these theorem, we will show a few examples for the use of the rules (4) to (8).
Ω: Well, you’re right... but you’re not. To come up with the first answer you need an idea about how
to exploit the specific conditions of this example. The second answer is a standard application of the
multiplication rule. It should be in your repertoire. In particular, when things get a bit more complex.
What if the coin is biased? What if there are more than one ticket numbered 3, and in both boxes? What
if the tickets in a box are not equally likely to be drawn? You can still use the first approach, but it won’t
look so elegant anymore.
P (Se ) 1/2
P (Se | G) = · P (G | Se ) = · 1/2 = 3/5.
P (G) 5/12
Furthermore, P (Su | G) = 1−P (Se | G) = 2/5. Therefore it is 50% more likely that Omega’s
golden ticket came from the shop with equal probabilities than from the other shop.
ω: I know how I can answer Question 5.4 much faster! Say there’s a place with only one Wonka bar with
a golden ticket in it, and there is a place with a million Wonka bars one of which has a golden ticket in it
whereas the other 999,999 have none. The chance of getting a golden ticket when you search in place one
is, well, there isn’t even any chance, you surely get one. But when you are searching for a golden ticket
in the second place, it’s like 1 in a million, so your chances are incy wincy small. If you actually got a
ticket from one of those places, well, chances are it’s the first place where you got it from. The boxes in
the example are not quite as extreme, but the answer still goes in the same direction.
Problem solving technique: Exaggerating the original question (see also dialogue following Question
2.7.2).
Ω: Yes. And did you notice you argued the other way round from usual? Here is another example. Who
knitted the Klein bottle hat, the lecturer or her husband? Only a small fraction of the men know how to
knit, whereas, at least in their generation, a decent fraction of women do. So it’s more likely she knitted
that hat than he did it. This kind of thinking is a probabilistic version of reverse engineering also known
as statistical inference. Regular people actually use it in everyday life all the time.
ω: Like when you blamed me for destroying your favourite sunglasses. Just because I broke your fabulous
prize-winning rocket, I dropped your report card in the loo, I reset your password and I played frisbee with
your data CD, you think that I also stepped on your glasses. But it wasn’t me!
Pn
Proof. Since A∩B1 , . . . , A∩Bn are mutually exclusive, P (A) = i=1 P (A∩Bi ). Applying
Definition 5.1 transforms the addends to P (A | Bi )P (Bi ).
P (A | Bk )P (Bk )
P (Bk | A) = (10)
P (A | B1 )P (B1 ) + . . . + P (A | Bn )P (Bn )
Proof. Fix k ∈ {1, . . . , n}. Formula (6) yields P (Bk | A) = P (A | Bk )P (Bk )/P (A), which
results in (9) after replacing P (A) by the right-hand side in Theorem 5.6.
Terminology
P (Bi ) are called prior probabilities – that’s because it’s before knowing about A
P (A | Bi ) are called likelihoods – probabilities of A given the different options Bi
P (Bi | A) are called posterior probabilities – that’s because it’s after knowing about A
ω: How comes a theorem that is so easy to proof became so famous?
Ω: Sometimes, it’s not the technique that makes a result famous, but the idea to even think of of looking
at the mathematical objects in a new way. The key idea of first the ”flip around” formula and then also
Bayes’ theorem is to use the information about the probability for A given a few different Bi s to sort of
reverse engineer which of the Bi ’s happened in the first place.
ω: You mean which one is most likely to have occurred given A did. Actually, the theorem calculates the
probability for each of the Bi s given we know A happened. My teacher is using this all the time. Whenever
it’s noisy in the classroom while she’s writing at the board, she blames me! But just because I’m talking
a lot doesn’t meant that’s always me!
ω: So what?
Ω: You are being told you’ve got this disease that’s going to kill you within 5 years. You spent a few weeks
being horrified and you end up dying of a heart attack. So, indeed you died. However, the likelihood you
do not have the disease given your test was positive is actually about two in three!
Ω: Not actually having the disease despite being tested positive is an event called false positive. Related
to that is the false positive rate and the specificity and the sensitivity of the test and...
ω: I don’t care about all this medical terminology. I just want to know this: If they gave me a positive
test result, what is the probability that I actually have the disease.
Ω: That’s what the physicians call predictive value of a positive test and it’s just what we computed above.
It does depend on the quality of the test but also on the disease prevalence (i.e., the incidence of the
disease) in the population you are drawn from. In the example above, the incidence was only 1%. This
made the nominator small, but inflated the denominator because of the term 1 − P (D) = 99%.
Ω: actually, I was being a bit cynical about the heart attack... There are ways to apply the test in a more
meaningful way.
The medical diagnostics example is a very common situation. The false positive issue is a
concern when the prevalence is low even with good quality tests. This is often the case for a
screening test for a disease in an unspecific population, unless the disease is very common.
Depending on the nature of the diagnosis, such news may create stress and anxiety in
the individual who received them, even if it is understood that the diagnosis has a lot of
uncertainty attached to it. Often, the individual will have to wait long periods of time
until further tests can be done and deliver refined results. Such aim of avoiding stress for
the patient has to be outweighed with the usefulness of potentially true news, for example
if early treatment is crucial for survival. Examples for population-wide screening with low
incidence rates of the condition tested for may be found, for example, in antenatal care or
detection of drug users.
ω: I want to tell you something. I really very much hate word problems.
Ω: Why?
Ω: You mean one part of your brain is in charge of math and the other part of your brain is in charge of
the rest of the world, and the two don’t mix? You are basically saying that applied mathematics can not
be done. Either you are coming up with the most beautiful piece of math to describe a certain real world
application but you neglect the point in the application, or you pour all you thoughts into an excellent
description of a piece of the real world but then you have no brains left to work out mathematically rigorous
answers to questions? I suggest you activate a third part of your brain to get the first two parts to act in
concert.
If the occurrence of the event B has no impact on the occurrence of the event A then
P (A | B) = P (A). Using the multiplication rule (4), this can be expressed as P (A ∩ B) =
P (A) · P (B). Or it could be expressed as P (B | A) = P (B). Note that in formulations
using conditional probabilities we need to assume that the probability of the event that
comprises the condition is not zero.
Definition 5.10. Independence of two events.
Let (Ω, F, P ) be a probability space and A ∈ F and B ∈ F events. A and B are called
independent, if
ω: It just means they have nothing to do with each other. For example, if I toss a coin and I want it to
come up heads and you want it to come up tails than we what you and me want is independent...
Ω: Eh, I’m not so sure about that. Let’s see... P ({h}) · P ({t}) = 1/2 · 1/2 = 1/4 and P ({h} ∩ {t}) =
P (∅) = 0, so they are disjoint, but not independent.
ω: Ah sure, they have a lot to do with each other. Because if it’s heads it can’t be tails!
The multiplication rule (4) can be generalized to more than two events. This will be used
to define a notion of independence for more than two events.
Theorem 5.13. General multiplication rule.
Let (Ω, F, P ) be a probability space and A1 , . . . , An ∈ F with P (A1 ∩. . .∩An−1 ) > 0. Then
P (A1 ∩ . . . ∩ An ) = P (A1 ) · P (A2 | A1 ) · . . . · P (An | A1 ∩ . . . ∩ An−1 ). (12)
Proof. Exercise.
(Hint: The result can be shown by induction based on multiple application of (4).)
Definition 5.14. (Mutual) independence.
Let (Ω, F, P ) be a probability space and A1 , . . . , An ∈ F events with P (A1 ∩. . .∩An−1 ) > 0.
The events are called (mutually) independent, if
P (A1 ∩ . . . ∩ An ) = P (A1 ) · P (A2 ) · . . . · P (An ) (13)
Definition 5.15. Pairwise independence.
Let (Ω, F, P ) be a probability space and A1 , . . . , An ∈ F events. The events are called
pairwise independent, if
6 Repeated experiments
This section is about doing the same experiment several times. We consider a simple, but
not trivial, situation: a series of independent Bernoulli trials (see Example 4.2).
Consider a Bernoulli trials process of length n with success probability p ∈ [0, 1] and failure
probability q = 1 − p as described in Model 6.1. We want to compute probabilities for the
outcomes. Illustrate this a special case.
Problem solving technique: Consider a special case or an example.
Example 6.2. Three independents Bernoulli trials.
Let n = 3. Using independence we observe
Go back to the general case of a Bernoulli trial process of length n. For ω ∈ Ω let ω(i) be the
ith place in the tuple. At a closer look, using independence, shows that the probability of
an outcome only depends on the number of successes and failures. Given the total number
of trials is fixed, it is enough to record only the number of successes. The events
can be represented using the 0-1-coding we chose to represent failure and success (check
yourself):
n X n o
Sk = ω ω(i) = k (k = 0, 1, . . . , n). (16)
i=1
This implies
To calculate the probability of the event Sk it remains to count the number of outcomes
in Sk . In how many ways can we choose (exactly) k successes in n trials? According to
Section 3.3, using the formula for “without replacement, not ordered”, this is nk , so
n k n−k
P (Sk ) = p q (k = 0, 1, . . . , n). (17)
k
This defines a probability measure (see Example 4.3) on the set of possible outcomes for
the number of successes.
P (“at least 5 are not checked out”) = P (“at most 2 are checked out”)
2 2
[ X X 7 1 k 9 7−k
=P Sk = P (Sk ) =
k 10 10
k∈{0,1,2} k=0 k=0
95
= · (92 + 7 · 9 + 21) ≈ 0.9743
107
So the probability that you can get at least 5 of the books from your list is about 97.43%.
In the last section we calculated the probability for a certain number of successes in
a fixed number of trials. In this section, we work with the model for infinitely many
Bernoulli trials (see Model 6.1). We keep trying until the first success. That means
Ω = {1, 01, 001, 0001, 00001, . . .}. Ω is countable. This can be shown, for example, by
using the one-to-one map between Ω and N0 that is defined by counting the number of
zeros in any outcome ω ∈ Ω.
Define events based on the waiting time:
To calculate the probability for Wk just look at the meaning: The first k − 1 trials are
failures, the kth trials is a success. Since the trials are independent, the probability
factorizes:
To get a feeling for this distribution it is helpful to look at the consecutive odds ratios
(and they are useful to quickly sketch a histogram of the distribution).
P ({k + 1})
=q (k = 1, 2, . . .). (22)
P ({k})
Define the events Vk =“it takes at most k trials until first success” k = 1, 2, . . . Let k ∈ N.
Then Vk = W1 ∪ W2 ∪ . . . ∪ Wk with W1 , W2 , . . . , Wn mutually exclusive. Using results
about geometric sums (see Analysis textbooks) we get
k k
X X 1 − qk
P (Vk ) = P (Wi ) = q k−1 · p = · p = 1 − qk . (23)
1−q
i=1 i=1
ω: ‘Excuse me UK, can you spare some cash? Excuse me EU, can you spare some cash?’ Is the Penny
example about illiquidity?
Ω: It certainly is an application of no less importance than Archeology, Biology, Coding, Demography, Elec-
trical engineering, Genetics, Hurricanes, Insurances, Juggling (most times), Knitting (sometimes), Logic,
Magnetism, Networks, Operations research, Psychology, Queuing, Risk, Statistics, Trees, Uncertainty,
Volcano science, Walks, some random applications X and Y, and the Zeta function.
As in the last section, consider the model for infinitely many Bernoulli trials (see Model
6.1). Now keep trying until you have obtained r successes. The outcome space Ω − r now
consists of all 0-1-patterns that have exactly r 1’s in it, one of them at the end.
For example, for r = 2 : Ωr = {11, 011, 101, 0011, 0101, 1001, 00011, 00101, 01001, 10001, . . .}.
Proof. There is no canonical candidate for a one-to-one map between Ωr and the integers
as in the last section, but it is still easy to do. One could arrange the outcomes in a
triangle and then define a map from the integers to Ω by enumerating the outcomes row
by row. For example, for r = 2 use
11
011 101
0011 0101 1001
00011 00101 01001 10001
...
and define f (1) = 11, f (2) = 011, f (3) = 101, f (4) = 0011, . . .
It works similarly for larger r, but the explicit construction of such a map becomes more
cumbersome as r gets bigger. Here is another way to show that Ωr is countable: Every
ω ∈ Ωr can be described by the its length and the location of the first r − 1 successes.
This suggests a surjection from the countable set Nr−1 × N onto Ωr , which implies that
Ωr is countable.
We are interested in the probabilities for waiting a certain number of trials until the rth
success. Proceed similarly to the last section.
Define events based on the waiting time until the rth success:
To calculate the probability for Wk just look at the meaning: The kth trials is a success,
and there are r −1 successes and k −1−(r −1) failures among the first k −1 trials. For any
outcome of this kind the probability is pr−1 pk−1−(r−1) p. So, similarly to the derivation of
(17) we obtain
k − 1 r k−r
P (Wk ) = p q (k = r, r + 1, . . .). (25)
r−1
This probability measure (Ω, P(Ω)) can also be represented as a probability measure on
(N, P(N)).
To get a feeling for the probabilities of these events it is helpful to look at the consecutive
odds ratios (and hey are handy to quickly sketch a histogram). A simple calculation yields
P (Wk+1 ) k
= (k = r, r + 1, . . .). (26)
P (Wk ) k−1
Here are some applications of the negative binomial distribution. In other words, the
Bernoulli trials process may be an appropriate model and we are interested in looking at
the waiting time for the rth success:
• A robe consists of several cables. I has to be replaced after 3 of them broke.
• How likely do milk bottles get stolen more often than twice in a year?
• Parents of pupils who arrive late more than 3 times a term need to see the
head teacher.
ω: We use this for counting fish. And the Sauterelle distribution for counting grasshoppers and the de
Chevalier’s distribution for counting horsemen.
Ω: Oh you’re being silly. Read this old story about Chevalier de Méré... e.g. at
http://www.ualberta.ca/MATH/gauss/fcm/BscIdeas/probability/DeMere.htm.
0.50
P (N ≥ 1) = 1 − P (N = 0) = 1 − e−0.5 · = 1 − e−0.5 ≈ 0.393.
0!
In other words, there is a likelihood of almost 40% for a misprint on page 20 (or on any
other page for that matter).
P (N ≤ 2) =P (N = 0) + P (N = 1) + P (N = 2)
3.22 −3.2
= e−3.2 + 3.2 · e−3.2 + 3.2 · e−3.2 + ·e
2!
3.22
=(1 + 3.2 + ) · e−3.2 ≈ 0.37989.
2!
Example 6.15. Earthquakes.
Earthquakes are very unpredictable. A common simple model is to assume they follow a
Poisson distribution. For a region in the West of the US it is known that the rate is about
2 earthquakes per week. We will know calculate a few different kinds of probabilities using
this model.
(i) What is the probability that there are at least 3 earthquakes next week? Let N be
the number of earthquakes next week.
2 1 0
P (N ≥ 3) = 1 − 2k=0 P (N = k) = 1 − ( 22! + 21! + 20! ) · e−2 = 1 − 2 e−2 .
P
(ii) What is the probability that there are exactly k earthquakes during the next 3 weeks?
This corresponds to changing the unit to 3 weeks with a new intensity parameter 6.
k
The probability is therefore e−6 · 6k! .
More generally, what is the probability that there are exactly k earthquakes during
the next m weeks?
k
Reasoning as before we obtain e−2m · (2m)k! .
(iii) Let us know change the perspective we look at the situation and define T to be the
time (in weeks) until the next earthquake. Let Nm be the number of earthquakes
during the next m weeks. The crucial point here is to understand that the sets
{T > k} and {Nm = 0} are identical. Therefore P (T > m) = P (Nm = 0) = e−2m .
Some other situations where a Poisson distribution is a good candidate for a suitable
model:
• number of people who survive age 100
• number of floods in the UK per year
• number of occurrences of a certain pattern in a stretch of DNA
Some comments about the T-shirt with the random scatter — not relevant for the exam.
In many examples the aim is to compute the probability of an expression of the form
For example:
N = “number of heads in n coin tosses”
N = “number of coin tosses until third head”
N = “number of α-particles emitted by 1 gram of radioactive material in 1 second”
N = “number of people taking the 18:22 bus to Leamington Spa”
N = “number of goals scored by Coventry City FC in the 2009-2010 season”
N = “number of broken eggs in an egg carton selected at random from your grocery store”
More generally, a lot of examples call for the probability of expression of the form
For example:
X = “sum of two dice”
X = “test score of a randomly selected student from this class”
X = “phone number selected at random from the Coventry phone book”
X = “weight (rounded off to the nearest stone) of a randomly selected British citizen”
X = “minimum due date of ten books you borrowed from the library”
In other words, we are interested in events that are characterised by the level sets of some
function of the outcomes of the random experiment. Formally, this is captured by the
following concept.
ω: I will not ever never use a random variable. They are totally unnecessary, because I completely already
know everything from the sample space. If I want to know the number of heads in three coin tosses then
I choose Ω = {0, 1, 2, 3} and my probabilities are defined right there.
Ω: Okay, whether or not you use random variables is your choice, but consider that I’m witnessing the
exact same coin tossing experiment, and it would be nice if we could simply share the outcome space. I’m
interested in whether or not the last toss is a head, and your Ω = {0, 1, 2, 3} does NOT tell me that.
Remark 7.2. About notation and about the rationale for this definition.
(i) The first term in (27) is not a typographic error, but a very common shortform
used in probability theory and measure theory. Here is what it means: {X = s} =
{ω ∈ Ω | X(ω) = s} = X −1 (s). Expressions like {X > s} and {X ≥ s} are defined
accordingly.
(ii) Typical choices for S are finite sets, N or Z. S does not need to be numerical; for
example, a random variable describing the colour of a stripe picked at random from
the mural in the Street is in a finite set of colours. Another example for a colour-
valued random variable is shown in Example 7.8 below.
(iii) Condition (27) is called measurability of the function X, a basic concept in measure
theory that will be used massively in modules on advanced probability theory and
measure theory. The immediate use here is that because of the measurability of X
any probability measure P on the space (Ω, F) canonically defines probabilities for
the level sets of X via the equation pX := P ◦ X −1 as detailed in Definition 7.6.
(iv) Condition (27) may seem artificial in the discrete setting, because one could always
force it to be true simply by choosing F = Ω. By contrast, this is not always possible
in the continuous case and not always appropriate in the discrete case (see Remark
4.1).
Example 7.3. Indicator functions.
Let (Ω, F) be a measurable space and A ∈ F. The function 1A on Ω defined by
(
1 for ω ∈ A ,
1A (ω) = (28)
0 for ω ∈ Ac .
Ω: After your dislike of random variables earlier, that comment does surprise me.
x 0 1 2 3 y -3 -1 1 3
pX (x) 1/8 3/8 3/8 1/8 pY (y) 1/8 3/8 3/8 1/8
x 0 1 t 1 2 3 4
pX (x) 1/2 1/2 pT (t) 1/2 1/4 1/8 1/8
pX (x) = P (X = x) (x ∈ S) (29)
The above example is the easiest non-trivial example. We have already seen a number of
more complicated probability mass functions for random variables such as Bernoulli distri-
bution (X =“number of successes”), geometrical distribution (X =“number of trials until
success”) and negative binomial distribution (X =“number of trials until rth success”).
The following example shows a probability mass function obtained by (real world) data.
Proof. The range of Y is ψ(S). It is discrete because the cardinality is at most the
cardinality of S. It remains to show condition (27) for Y. For any y ∈ ψ(S), {Y = y} =
{ω ∈ Ω | ψ(X(ω)) = y} = {ω ∈ Ω | X(ω) = ψ −1 (y)} = x∈ψ−1 (y) {X = x}. By (27) for
S
Example 7.5. Some random variables from three coin tosses (ct.)
P
Y = ψ ◦ X for ψ(x) = 2x − n and pY (y) = x:ψ(x)=y pX (x).
In particular, pY (3) = pX (3), pY (1) = pX (2), pY (−1) = pX (1), pY (−3) = pX (1).
Choose Ỹ = ψ̃ ◦ X for ψ̃(x) = | 2x − n |. In other words, Ỹ is the gain in the game,
regardless who receives it.
P
Now pỸ (y) = x:ψ̃(x)=y pX (x), but pỸ (3) = pX (3) + pX (0) and pỸ (1) = pX (2) + pX (1).
Lemma 7.10. Combination of random variables.
Let (Ω, F) be a measurable space. Let X and Y be discrete random variables with ranges
that are S and R, which are discrete subsets of R. Then the following expressions are also
discrete random variables: max(X, Y ), min(X, Y ), X + Y, X − Y, X · Y and, if Y (ω) 6= 0
for all ω ∈ Ω, X/Y.
Proof. Let Z := (X, Y ). This is a random variable on (Ω̃, F̃), where Ω̃ = Ω × Ω is the
product space and F̃ is the product σ-algebra. The operations max, min, +, −, · and /
between the two random variables can be understood as a function ψ of the pair Z. For
example=, X + Y = ψ ◦ Z, where ψ1 is the function on S × R defined by ψ1 (x, y) := x + y,
and max(X, Y ) = ψ2 ◦ Z, for ψ2 (x, y) := max(x, y). Since S and R are discrete subsets of
R, so is the range of ψ ◦ Z. By Lemma 7.9, ψ ◦ Z is a discrete random variable.
Remark 7.11. Continuous analogues.
Both Lemma 7.9 and Lemma 7.10 have analogues for continuous random variables, but
additional assumptions are needed and more care needs to be taken in the proofs.
Example 7.12. Simple random variables built from indicator functions.
Let αi ∈ R (i = 1, . . . , n) and let Ai ∈ F (i = 1 . . . , n). By Example 7.3, the lat-
ter define discrete random variables. Applying Lemma 7.9 to ψi (x) = αi x yields that
αi 1Ai (i = 1, . . . , n) are also discrete random variables and multiple application of Lemma
7.10 insures that ni=1 αi 1Ai is also a discrete random variable. Random variables with
P
Note that in the proof above we made use of the pair of two random variables defined
on the product space. Thinking about them in this way will often come in handy when
computing their probability mass function. The following definition will help formalising
this.
Definition 7.13. Joint PMF/joint distribution.
Let (Ω, F, P ) be a probability space and let X and Y be discrete random variables . Then
range(X) distribution of Y
0 1
0 4/12 3/12 7/12 (33)
range(Y)
1 2/12 3/12 5/12
distribution of X 6/12 6/12
The distribution of Y is obtained by the sums of the rows: pY (0) = 7/12 and py (1) = 5/12.
7.1 Expectation
Ω: Look, I’ve got a box with two 50p-coins, one £1-coins and two £2-coins. You can pick one at random.
What do you expect to get?
Ω: This can be rewritten as p(0.5) · 0.5 + p(1) · 1 + p(2) · 2, where p(ω) = P ({ω}) for ω ∈ Ω = {0.5, 1, 2}
where p(0.5) = P ({£0.5}) = 2/5, p(1) = P ({£1}) = 2/5 and p(2) = P ({£2}) = 2/5.
average An := 1/n · Sn . Let us compute their expected values using the linearity of E.
X
E[Sn ] = E[Xi ] = n · 2.5.
1 1
E[An ] = E[ Sn ] = E[Sn ] = 2.5.
n n
Question 7.24. Expected number of working components.
In a system of n components each works with a certain probability pi (i = 1, . . . , n).
Compute the expected number of working components.
Answer: Use Ω0 = {w, f} as an outcome space describing the state of an individual
component using w for “works” and f for “fails”. Let Ω = Ωn0 be the outcome space
describing the state of the whole system. Let Ai (i = 1, . . . , n) be the event that the i th
component works. (Formally, this looks for example as follows: A1 = {w, ω2 , . . . , ωn | ωi ∈
Ω0 for i = 2, . . . , n}.) The number X of working components can be represented as the
sum of indicator functions: X(ω) = 1A1 (ω) + . . . + 1An (ω). Using (36) yields
n
X Xn n
X
E[X] = E 1Ai = E[1Ai ] = pi . (37)
i=1 i=1 i=1
The method used in the answer above is very general and has got a name:
Pn
Proof. Using X(ω) = i=1 1Ai (ω) (ω ∈ Ω) the statement follows from (37).
The proof is just a formalisation of the simple idea used in the explicit example above. It
deserve to be a lemma, because the technique it is often quite useful. However, it is a bit
of art to recognise when it can be applied.
Ω: They forgot the independence assumption.
is called variance of X.
In other words, we measure how much the outcomes deviate from the expected value by
the square of their difference. Then we take the expected values of that expression. Using
the square makes the expression always non-negative. Alternatively, absolute values could
be used (and they are used in the so-called L1 -norm approach to statistics).
The variance is (iii) is high. This is related to the issue discussed in Remark 7.18.
The expectation is a measure of location and the variance is a measure of spread. They are
examples of one-parameter summaries of distributions or data sets. Their mathematical
expressions can be computed by the recipes provided. However, whether they do provide
good summaries or whether they give an oversimplified or even misleading picture of the
situation has to be checked in the particular cases.