0292 0310 PDF

GENERAL ⎜ ARTICLE
What is Probability Theory?
K B Athreya
This issue of Resonance features Joseph Leonard

Doob, who played a critical role in the devel-
opment of probability theory in the world from
1935 onwards. The goal of the present article is
to explain to the readers of Resonance what prob-
ability theory is all about.
K B Athreya is a retired
professor of mathematics
Probability theory provides the mathematical basis for
and statistics at Iowa State the study of random phenomena, that is, phenomena
University, Ames, Iowa, in whose outcome is not predictable ahead of time. In this
the USA. He spends article, we try to provide a more detailed answer.
several months in India
visiting schools, colleges Introduction
and universities. He enjoys
teaching mathematics and Let us start with an example each of random and non-
statistics at all levels. random (also called deterministic) phenomena:
He loves Indian classical
and folk music.
i) What will be the temperature at 4pm a week from
now at the 18th Cross and Margosa Road intersec-
tion in Bengaluru?
ii) We throw a stone up from a spot inside an open

football ground and observe whether it falls down
to the ground or not.
Which one of these can be termed random and which

one non-random?
By and large, most physical and natural phenomena can
be classified into one or other of these two categories.
Keywords
The readers are invited to construct their own examples
Random variables, distribution of real world phenomena of both kinds (say, two each).
function, statistical inference,
error function, law of large num- Over the last few centuries mathematical methods have
bers. been developed to study many deterministic (i.e. non-
292 RESONANCE ⎜ April 2015

GENERAL ⎜ ARTICLE
random) phenomena, especially those evolving over time.

The study of motion of physical objects over time by
Newton led to his famous three laws of motion as well
as many important developments in the theory of ordi-
nary differential equations.
Similarly, the construction and study of buildings led
to important results in geometry in many parts of the
world such as India, China, Middle East, Greece. Also
advances in quantum mechanics, relativity, etc., were
based on deep results from the theory of ordinary and
partial differential equations.
Early Beginnings
A mathematical study of random phenomena could be
said to have originated in the calculations of odds in
some gambling problems in the 18th century Europe.
The principal models considered were binomial distri-
butions and their Poisson approximations, and later on
normal approximations. For example, if a coin is tossed
n times independently (i.e., the outcome of the tosses
of any subset of these n tosses has no effect on the
outcomes of the remaining tosses) and the probability
of ‘heads’ in any one toss is p, 0 ≤ p ≤ 1, then it
can be shown that the probability of getting r heads
in n tosses is simply pn,r ≡ ( nr )pr (1 − p)n−r for r =
0, 1, 2, · · · , n, where ( nr ) = r!(n−r)!
n!
. This collection of
(n + 1) numbers {pn,r , r = 0, 1, 2, · · · , n} is called the
binomial probability distribution B(n, p), 0 ≤ p ≤ 1, A mathematical
n = 0, 1, 2, · · · . Note that pn,r is non-negative and
study of random
n
r=0 pn,r = (p + (1 − p)) by the binomial theorem
n
phenomena could
and hence is equal to 1. Later on, it was shown by Pois- be said to have
son that this quantity pn,r ≡ ( nr )pr (1 − p)n−r could be originated in the
r
approximated for each r, 0 ≤ r ≤ n, by pr ≡ e−λ λr! calculations of
if n is large and p is small but np is neither large nor odds in some
small but close to some λ, 0 < λ < ∞. This collec- gambling problems
r
tion {pr ≡ e−λ λr! , r = 0, 1, 2, · · · } of numbers is called in the 18th century
the Poisson λ probability distribution, 0 < λ < ∞. It
Europe.
RESONANCE ⎜ April 2015 293

GENERAL ⎜ ARTICLE
∞ be noted that pr ≥ 0 for all r = 0, 1, 2, · · · , and

may
r=0 pr = 1.
A bit later, De Moivre and Laplace proved the following.

Let n be large and let pn be not necessarily small but
such that σn := npn (1 − pn ) converge to a number in
(0, ∞). Then, the binomial probabilities can be approx-
imated by a Gaussian distribution. nMore precisely, the
sum of the binomial probabilities r pn (1 − pn )n−r over
r
all r in an interval of the form (aσn + npn , bσn + npn )

will converge to Φ(b) − Φ(a) where Φ(y) is equal to the
integral of the error function over (−∞, y). This can be
translated into a probability statement:
As n → ∞,

Xn − rpn
Prob a < <b →
rpn (1 − pn )
b
1 x2
Prob a < Y < b) ≡ √ e− 2 dx,
a 2π
where Xn is a random variable with the distribution
binomial (n, pn ) and Y is a random variable that is
normally distributed with mean EY ≡ 0 and variance
V (Y ) = EY 2 − (EY )2 = 1. This is also referred to as
1
The great mathematician an example of the Central Limit Theorem1 (CLT) and
George Pólya coined the term could be thought of as a refinement of the weak law of
'central', meaning fundamental.
large numbers which says: for each > 0,
An issue of Resonance, Vol.19,

No.4, 2014 is devoted to Pólya.
Xn
Prob − pn > → 0
n
as n → ∞ provided npn (1−pn ) → ∞. Later, both these
results were proved for a much larger class of distribu-
tions, than just the binomial (n, pn ) cases.
Kolmogorov’s Model
A mathematical theory as a basis for studying random
phenomena was provided by the great Russian mathe-
2
Resonance, Vol.3, No.4, 1998. matician A N Kolmogorov2 around 1930. About twenty

GENERAL ⎜ ARTICLE
years earlier, Henri Lebesgue of France extended the no- Kolmogorov saw in
tion of length of intervals in R, the real line, to a large Lebesgue's theory of
class M of sets in R, now called Lebesgue measurable measure on , an
sets. The extended function λ on M satisfied the condi- appropriate
tion that (R, M, λ) is a measure space, i.e., M is known mathematical model
now as a σ-algebra of subsets of R that included all inter- for studying random
vals and λ : M → [0, ∞] was such that λ is a measure. phenomena.
(See precise definition later.)
Kolmogorov saw in Lebesgue’s theory of measure on R,
an appropriate mathematical model for studying ran-
dom phenomena.
First, one identifies the set Ω of possible outcomes asso-
ciated with the given random phenomena. This set Ω is
called the sample space and a typical individual element
ω in Ω called a sample point. Even though the outcome
of the experiment is not predictable ahead of time, one
may be able to determine the ‘chances’ that some par-
ticular statement about the outcome is valid. The set
of ω’s for which a given statement is valid is called an
event. Thus, an event is a subset of the sample space Ω.
After identifying the sample space, one identifies a class
F of subsets of Ω (not necessarily all of P(Ω), the power
set of Ω, i.e., the collection of all possible subsets of Ω)
and then a set function P on F such that for an event
A in F , P (A) will represent the chance of the event A
happening. Thus, to a given random phenomenon, one
associates a triplet (Ω, F , P ) where Ω is the set of all
possible outcomes (called the sample space), a collection
F ⊂ P(Ω) (called the events collection) and a function
P on F to [0, ∞] (called a probability distribution). It
is reasonable to impose the following conditions on F
and P .
i) A ∈ F implies Ac ∈ F (Ac is the complement of

A, i.e., Ac = {ω : ω
∈ A}), i.e., if A is an event
then A not happening, i.e., Ac should also be an
event.

GENERAL ⎜ ARTICLE
ii) A1, A2 ∈ F should imply A1 ∪ A2 ∈ F , i.e., if

A1 and A2 are events then at least one of the two
events A1 and A2 happening should also be an
event.
iii) For all A in F , P (A) should be in [0, 1] with P (Ω) =
1 and P (∅) = 0, where ∅ is the empty set.
iv) A1, A2 ∈ F , A1 ∩ A2 = ∅ should imply
P (A1 ∪ A2) = P (A1) + P (A2 )
i.e., if A1 and A2 are mutually exclusive events
then the probability of at least one of them hap-
pening should simply be the sum of the probabil-
ities of A1 and A2.
The above conditions (i−iv) imply that F is an algebra

and P is a finitely additive set function on F , i.e., F is
closed under complementation and finite unions and

k k
P Ai = P (Ai )
i=1 i=1
for k < ∞ and A1 , A2, · · · , Ak ∈ F and Ai ∩ Aj = ∅ for

i
= j.
Next, it is reasonable to require that F be closed under
monotone increasing unions and P be monotone con-
tinuous from below. That is, if {An }n≥1 is a sequence
of events in F such that An ⊂ An+1 for each n ≥ 1,
then the ‘event’ A ≡ ∞ n=1 An of at least one of the
An ’s happening should be in F and P (A) should equal
lim P (An ).
This requirement is imposed by the practical idea that if
A is a complicated subset of Ω but can be approximated
by a sequence {An }n≥1 of non-decreasing events such
that the above holds then A should be an event and
P (An ) should be close to P (A) for large n. Thus, in
addition to conditions (i−iv) on F and P , it is natural
to require the following:

GENERAL ⎜ ARTICLE
v) An ∈ F , An⊂ An+1 for all n = 1, 2, · · · should

imply A ≡ ∞ n=1 An ∈ F and P (An ) ↑ P (A) as
n → ∞.
Let us call this last condition: P (·) is monotone contin-

uous from below (mcfb). This last condition (v) looks
very natural but along with (i−iv) forces that (Ω, F , P )
be a measure space, i.e., the following holds:
vi) F is a σ-algebra (i.e., F is closed under comple-

mentation and countable unions) and P : F →
[0, 1] is a measure, i.e., P is countably additive,
i.e.,

∞ ∞
P Bn = P (Bn )
n=1 n=1
for any {Bn }n≥1 ⊂ F such that Bn ∩ Bm = ∅ for
n
= m.
Since we demand P (Ω) = 1, (Ω, F , P ) is called a proba-

bility space. Thus, Kolmogorov’s model for the study of
a random phenomena E is to determine Ω, the sample
space, the sets of all possible outcomes of E, a collection
F of events and a probability set function P mapping F
to [0, 1] so that the triplet (Ω, F , P ) is a measure space,
i.e., the condition (vi) holds with P (Ω) = 1.
Some Examples.
Example 1 (Finite Sample Space). Let Ω ≡ {ω1 , ω2 , · · · ,
ωk }, k < ∞, F ≡ P(Ω), the power set of Ω, i.e., the
collection of all possible subsets of Ω (show that there are
exactly 2k of them). Now every probability set function
P on F is necessarily of the form:

k
P (A) = pi IA (ωi ) ,
i=1

where {pi }ki=1 are such that pi ≥ 0 for all i and ki=1 pi =
1 and IA (ω) = 1 if ω is in A and 0 if ω is not in A.

GENERAL ⎜ ARTICLE
This is a probability This is a probability model for random experiments with

model for random finitely many possible outcomes. An important exam-
experiments with ple of this is in finite population sampling, used exten-
finitely many sively by the National Sample Survey Organization of
possible outcomes. the Government of India as well as many market re-
An important search groups.
example of this is in Let {U1 , U2 , · · · , UN } be a finite population of N units
finite population or objects. These could be individuals in a city, dis-
sampling, used tricts in a state, acreage under cultivation of some crops,
extensively by the etc. In a typical sample survey procedure, one chooses
National Sample a subset of size n (n usually small compared to N) and
Survey Organization makes measurements on the chosen subset and uses this
of the data to make inferences about the big population. Here,
Government of India each sample point is a subsetN of size n. Thus, the sample
N!
as well as many space Ω consists of k = n = n!(N −n)! sample points and
market research the probabilities pi of selecting the ith sample are deter-
groups. mined by a given sampling scheme. In the so-called sim-
ple random sampling without replacement (SRSWOR),
each pi = k1 , i = 1, 2, · · · , k, where k = Nn . Other
examples include coin tossing (finite number of times),
roll of dice, card games such as Bridge.
Another important example with a finite sample space
is from statistical mechanics in particle physics. Sup-
pose S ≡ {s = (i1 , i2, i3), ij ∈ {0, 1, −1}, j = 1, 2, 3} is
a set of sites. Note that there are 3 × 3 × 3 = 27 sites
in S. Suppose at each site s in S, there is a spin ω(s)
that could be +1 or −1. Consider the collection Ω of
all spin functions ω mapping S to {+1, −1}. Then the
size of Ω is 227 , a finite, but large number. Call a typ-
ical element ω in Ω a configuration. Physicists assign
probabilities to any configuration ω by using a param-
eter β, temperature T , and a function V (ω), called the
potential function. It is of the form
β
e− T V (ω)
p(ω) = ,
zβ,T
− Tβ V (ω )
where Zβ,T ≡ ω ∈Ω e is called the partition func-

GENERAL ⎜ ARTICLE
tion. Here 0 < β < ∞, 0 < T < ∞. The physicists

The probability distribution {p(ω) : ω ∈ Ω} is called Metropolis et al [2]
Gibbs distribution. Computing {p(ω)} is a very chal- invented a method in
lenging task since computing the partition function Zβ,T the early 1950's.
is quite difficult. Even more so is computing the mean Statisticians
and variance of some function g : Ω → R with re- discovered this paper
spect to Gibbs distribution. That is, computing λ1 and in the early 1990's and
λ2 − λ21 where λk = Σω (g(ω))k p(ω), k a positive integer. coined the term
For this, the physicists Metropolis et al [2] invented a Markov Chain Monte
method in the early 1950’s. Statisticians discovered this Carlo (MCMC) and
paper in the early 1990’s and coined the term Markov since then this subject,
Chain Monte Carlo (MCMC) and since then this sub- i.e., MCMC, has seen
ject, i.e., MCMC, has seen some rapid growth. (see [1] some rapid growth.
Section 9.3, [2])
Example 2 (Countably Infinite Sample Space). Here,
Ω ≡ {ω1 , ω2 , · · · } is a countably infinite set, F = P(Ω),
the power set of Ω,

∞
P (A) = pi IA (ωi ) ,
i=1
∞
where pi ≥ 0, i=1 pi = 1.
An example of this is the experiment of recording the
number of radioactive emissions during a given period
[0, T ] from a specified radioactive source. Here, Ω =
{0, 1, 2, · · · } and {pi }i≥0 is typically a Poisson distribu-
tion of the form
e−λλi
pi = , i = 0, 1, 2, · · · , 0 < λ < ∞.
i!
Example 3 (Real-Valued Random Variables). Let Ω ≡ R,
F ≡ B(R), the Borel σ-algebra in R, i.e., the smallest
σ-algebra containing all intervals. (See the definition
of a o-algebra given earlier.) Let F : R → [0, 1] be a
cumulative distribution function (CDF), i.e.,
i) x1 ≤ x2 ⇒ F (x1) ≤ F (x2),

GENERAL ⎜ ARTICLE
ii) lim F (x) = 0,

x→−∞
iii) lim F (x) = 1.

x→+∞
Then it was shown by Stieltjes [1] that there is a prob-

ability measure μF on (R, B(R)) such that
μF (a, b] = F (b+) − F (a+) ∀ − ∞ < a < b < ∞,
where F (x+) ≡ limy↓x F (y). Let X : Ω → Ω be the

identity map, i.e. X(ω) = ω. This serves as a model
for a single real-valued random variable X. We give
below a number of examples of F ’s that are probability
distribution functions on R = (−∞, ∞).
i) Normal or Gaussian (μ, σ 2): Here,

x
1 (u−μ)2
F (x) = √ e− 2σ2 du;
2πσ −∞
−∞ < μ < ∞, 0 < σ < ∞.
ii) Gamma (α, p): 0 < α, p < ∞. Here

⎧
⎨ 0,
x x≤0
F (x) = 1
⎩ e−αu up−1 du p , x>0
0 α Γ(p)
∞
where Γ(p) = e−u up−1 du.
0
iii) Beta (α, β): 0 < α < ∞, 0 < β < ∞,

⎧
⎪
⎪ 0, x≤0
⎨
x α−1 1
F (x) = y (1 − y)β−1dy B(α,β) , 0≤x≤1
⎪
⎪
0
⎩
1, x>1
1
where B(α, β) = y α−1 (1 − y)β−1dy.
0

GENERAL ⎜ ARTICLE
iv) Uniform (0, 1) is Beta (1, 1).

v) Cauchy (γ, σ): −∞ < γ < ∞, 0 < σ < ∞,

1 x 1 1
F (x) = dy, −∞ < x < ∞.
π −∞ σ ( σ ) + 1
y−γ 2
vi) Binomial (n, p): n a positive integer, [x] = k, k ≤

x < k + 1,
⎧
⎪
⎪ 0, x<0
⎪
⎨ [x]
n r
F (x) = p (1 − p)n−r , 0 ≤ x ≤ n
⎪
⎪ r
⎪
⎩ r=0
1, x>n
vii) Geometric (p): 0 < p < 1

0, x<0
F (x) =
(1 − p)[x] p, x > 0
viii) Poisson (λ):

⎧
⎪
⎨ 0, x<0
[x]
F (x) = e−λ λr
⎪
⎩ , x≥0
r!
r=0
Stieltjes’s result is more general [1]. It shows that, given

a function F : R → R that is non-decreasing, i.e., x1 ≤
x2 ⇒ F (x1) ≤ F (x2), there is a measure μF defined on
the Borel σ-algebra B(R) of R such that
μF (a, b] = F (b+) − F (a+) ,
where F (x+) ≡ lim F (y).

y↓x
Example 4 (Random Vectors). Let Ω ≡ Rk , F ≡ B(Rk ),

the Borel σ-algebra in Rk , i.e., the smallest σ-algebra
containing all sets of the form
(a1, b2 ) × (a2, b2) × · · · × (an , bn ),

GENERAL ⎜ ARTICLE
where ai < bi , i = 1, 2, · · · , k. Let μ be a measure on

B(Rk ) such that μ(Rk ) = 1.
Let F (
x) ≡ μ(−∞,

x]
where,

x = (x1, x2, · · · , xk ) ∈ Rk , ∞
= (∞, ∞, · · · , ∞), and
(−∞,

x] ≡ {
y :
y = (y1 , y2, · · · , yk ), −∞ < yi ≤ xi , i ≤
k}.
Then, F is called a k-variate CDF. It satisfies some well-
known conditions. Conversely, given such a F , there
exists a unique probability measure μF on (Rk , B(Rk )).
For details see [1], Section 1.3. The identity map X(ω) =
ω is a model for the notion of a random vector of k-
dimensions.
Example 5 (Random Sequences). Let Ω ≡ R∞ ≡ {ω :
ω : N → R}, N = {1, 2, 3, · · · }. Let ∀ k ∈ N, μk be
a probability measure on (Rk , B(R)) as in Example 4.
Suppose {μk }k≥1 satisfies μk+1 (A × R) = μk (A) for all
A ∈ B(Rk ). Let F be a σ-algebra generated by the class
C of finite dimensional sets of the form A × R × R ×
R × · · · , where A ∈ B(Rk ) for some 1 ≤ k < ∞. Then
by Kolmogorov’s consistency theorem [1], there exists a
probability measure μ on (R∞ , F ) such that for ∀ k <
∞, A ∈ B(Rk ), μ(A×R×R×R×· · · ) = μk (A). This is a
model for a sequence of random variables {Xk }k≥1 such
that for every 1 ≤ k < ∞, the probability distribution
of (X1 , X2 , · · · , Xk ) under μ will coincide with μk .
Example 6 (Random Functions). Let T be a nonempty
set. For example, T could be a finite set or a count-
able set or an interval or a subset of some Euclidean
space. Let Ω ≡ {f : T → R} be the set of all real-
valued functions ω on T . Suppose we want to model the
choice of an element ω from Ω by a random mechanism.
Kolmogorov proved a result known as the consistency
theorem to make this precise. Suppose for every finite
vector (t1, t2 , · · · , tk ), t < ∞, of elements from T there

GENERAL ⎜ ARTICLE
is a probability measure μ(t1 ,t2 ,··· ,tk )(·) on (Rk , B(Rk )).
Suppose this family of probability measures satisfy:
(i) μ(t1 ,t2 ,··· ,tk )(A1 × A2 × · · · × Ak) =
μ(tπ(1) ,tπ(2),··· ,tπ(k) ) (Aπ(1) × Aπ(2) × · · · × Aπ(k) )

for every permutation π of (1, 2, · · · , k), where A1 , A2, · · · ,
Ak are Borel sets in R.
(ii) μ(t1 ,t2 ,··· ,tk ,tk+1 ) (A1 ×A2×· · ·×Ak ×R) =
μ(t1 ,t2 ,··· ,tk ) (A1 × A2 × · · · × Ak ).

Then, there exists a σ-algebra BT of subsets of Ω and a
probability measure μT on BT such that for any (t1, t2 , · · · ,
tk ) and A1, A2, · · · , Ak in B(R), the Borel σ-algebra of
R,
μT (ω(t1) ∈ A1, · · · , ω(tk ) ∈ Ak ) =
μ(t1 ,t2 ,··· ,tk ) (A1 × A2 × · · · × Ak ).
See [1], Section 1.3 for a proof and further details.
An example of this when T = [0, ∞) is the standard
Brownian motion. Here, for every t1 , t2, · · · , tk ∈ T =
[0, ∞), the probability distribution μt1 ,t2,··· ,tk (·) is that
of a k variate normal distribution with mean vector
(0, 0, · · · , 0) and covariance matrix σij ≡ min(ti , tj ) ([1],
Section 10.2).
If T is a singleton and Ω = R, then the random element
ω is called a random variable. If T is a finite set the
random element ω of ΩT ≡ RT , the set of all functions
from Ω to R is called a random vector. If T is a countable
set it is called a random sequence. If T is an interval, it
is called a random function. If T is a subset of Rk ,
it is called a random field. Typically, Ω ≡ RT , the
collection of all real-valued functions on T is very large.
But the σ-algebra BT in Kolmogorov’s construction is
not very large. This makes many interesting quantities
M = sup{|ω(t)| : t ∈ T } not BT -measurable, i.e., {ω :
M(ω) ≤ a} need not be in BT for all a in R. Since the

GENERAL ⎜ ARTICLE
Kolmogorov construction gives probabilities only for sets

in BT , the probability that M ≤ m where m is a given
real number, can not be discussed. J L Doob devised a
method called separability to take care of this problem
[3].
Mean, Variance, Moments of a Random Variable
Let (Ω, F , P ) be a probability space. Then a function
X : Ω → R is called a random variable on (Ω, F , P ), if
sets of the form {ω : X(ω) ≤ a} ∈ F for each a ∈ R and
hence are events and one can talk about the probability
distribution of X, i.e., FX (a) ≡ P (ω : X(ω) ≤ a). This
FX (·) is called the cumulative distribution function of
X. It satisfies:
i) x1 ≤ x2 ⇒ FX (x1) ≤ FX (x2),
ii) FX (x+) ≡ lim FX (y) = FX (x) (right continuity),

y↓x
and
iii) FX (−∞) ≡ lim FX (x) = 0,

x↓−∞
FX (∞) ≡ lim FX (x) = 1.
x↑∞
For any Borel set A in R, P {ω : X(ω) ∈ A} ≡ P (X −1 (A))

will coincide with μFX (A), where μFX (·) is the Stieltjes
measure on (R, B(R)) induced by FX (·).
If X is a simple random variable, i.e., it takes only
finitely many distinct real values {a1 , a2, · · · , ak } then
the expected value EX of X or the mean value of X is
defined as
k
EX ≡ aipi .
i=1
If X is a non-negative random variable, then X can

be approximated by a sequence {Xn }n≥1 of simple ran-
dom variables such that for each sample point ω in Ω,
Xn (ω) ≥ 0, Xn (ω) ≤ Xn+1 (ω) ∀ n ≥ 1, lim Xn (ω) = X(ω).
n

GENERAL ⎜ ARTICLE
Call such a sequence admissible for X. It is natural to

define the mean value of X, i.e., EX by setting it equal
to lim EXn . It can be shown [1] that {EXn }n≥1 is a non-
n
decreasing sequence in n and that lim EXn will be the
n
same for all admissible sequences. It could be +∞. Here
the properties that F is a σ-algebra and P is countably
additive are crucially used.
Next, for any real-valued random variable X on (Ω, F , P )
to R, let
X + (ω) ≡ max{X(ω), 0},
X − (ω) ≡ max{−X(ω), 0}.
Then it can be shown that both X + and X − are non-
negative random variables on (Ω, F , P ) and for every ω,
X(ω) = X + (ω) − X − (ω). So, it is natural to define
EX, the expected value of X as EX = EX + − EX −
provided at least one of the two quantities EX + , EX −
is finite. Typically one requires both EX + and EX −
to be finite. This renders E|X| < ∞. So we see that
EX is well defined for any random variable X such that
E|X| < ∞.
For any random variable X, the kth moment of X for a
positive integer k is defined as EX k provided E|X k | <
∞. The variance of a random variable X is defined as
V (X) ≡ E(X − EX)2 provided EX 2 < ∞. It can be
seen that if EX 2 < ∞, then V (X) = EX 2 − (EX)2 .
The reader is invited to compute the mean EX and the
variance V X for random variables X with probability
distributions F (·) mentioned earlier in Example 3.
Laws of Large Numbers and CLT.
There are two results in probability theory that make
the subject very useful in applications. This area of
application of probability theory to the real world is of-
ten termed as the field of statistics. It involves collect-
ing data (i.e., generating random variables) according to

GENERAL ⎜ ARTICLE
The application of well-defined rules of probability theory and then making

probability theory to inferences about the underlying population based on the
the real world (often data (referred to as statistical inference). A fundamental
termed the field of notion needed for these two results is that of the inde-
statistics) involves pendence of random variables. Let E be a random exper-
collection of data. iment, (Ω, F , P ) be a probability space associated with
That is, it involves E and X1 , X2 , · · · , Xk be k real-valued random variables
generating random (k < ∞) defined on (Ω, F , P ). Recall that a real-valued
variables according to random variable X on a probability space (Ω, F , P ) is
well-defined rules of
simply a function X from Ω to R such that for each a in
probability theory and
R the set {ω : X(ω) ≤ a} is in F . This is often referred
to as X is a measurable function on (Ω, F ) to R. Note
then making
that X being measurable depends on F and not on P . It
inferences about the
can also be verified that X : Ω → R is a random variable
underlying population
on (Ω, F ), if and only if, {ω : X(ω) ∈ B} ∈ F for all B
based on the data –
in B(R), the Borel σ-algebra of R, i.e., the smallest σ-
this is called statistical
algebra containing intervals of the form (α, β), α, β ∈ R.
inference.
This property is called X is (F , B(R)) measurable [1].
A finite collection X1 , X2 , · · · , Xk , k < ∞, of real-valued
random variables on a space (Ω, F ) are said to be inde-
pendent with respect to the probability measure (distri-
bution) P if for any a1, a2 , · · · , ak in R,
P {ω : X1 (ω) ≤ a1, X2 (ω) ≤ a2, · · · , Xk (ω) ≤ ak }

= P {ω : X1 (ω) ≤ a1 } · · · P {ω : Xk (ω) ≤ ak },
i.e., the joint distribution function
F(X1 ,X2 ,··· ,Xk ) (a1, a2, · · · , ak ) ≡

P {ω : X1 (ω) ≤ a1, X2 (ω) ≤ a2, · · · , Xk (ω) ≤ ak }
is equal to the product ofall the marginal distribution

k
functions, i.e., it equals i=1 FXi (ai), where FXi (ai ) ≡
P {ω : Xi (ω) ≤ ai }.
A family {Xt (ω) : t ∈ T } of real-valued random vari-
ables on a probability space (Ω, F , P ), where T is an
arbitrary index set is said to be independent with re-
spect to P if ∀ finite set {t1, t2, · · · , tk } ⊂ T , k < ∞,

GENERAL ⎜ ARTICLE
{Xt1 (ω), Xt2 (ω), · · · , Xtk (ω)} are independent with re-
spect to P .
An example of an infinite sequence of independent ran-
dom variables is the following. Let Ω = [0, 1], F =
B[0, 1], the Borel σ-algebra of [0, 1], P = Lebesgue mea-
δi(ω)
sure. For each ω, let ω ≡ ∞ i=1 2i be the binary expan-
sion of ω in base 2. Then, it can be shown that for each
k < ∞, the functions δ1 (ω), · · · , δk (ω) are independent
on this (Ω, F , P ) with each δi having distribution
1
P {ω : δi (ω) = 0} = = P {ω : δi (ω) = 1},
2
called the Bernoulli( 12 ) distribution. One just needs to
verify that {ω : δ1 (ω) = s1, · · · , δk (ω) = sk } for any
given s1, · · · , sk ∈ {0, 1} is simply an interval of length
1
in [0, 1]. A similar result holds for expansion to base
2k
p, where p is an integer > 1.
The following results known as the ‘laws of large num-
bers’ are consequences of slightly more general results
due to Kolmogorov.
Theorem 1 (Weak) Law of Large Numbers. Let X1 , X2 ,
X3 , · · · , Xn be independent random variables on some
probability space (Ω, F , P ) such that
i) they are identically distributed, i.e., P {ω : Xi (ω) ≤

a} ≡ F (a), a ∈ R is the same for all i = 1, 2, · · · , n
ii) the mean value of X1 is well defined, i.e., E|X1 | <

∞ (see definition given earlier).
Then, for each > 0
P {|X n − EX1 | > } → 0 as n → ∞ ,
where X n ≡ X1 + X2 n+ ··· + Xn (called the sample mean),

and EX1 is as defined earlier.

GENERAL ⎜ ARTICLE
Theorem 2 (Strong) Law of Large Numbers. Let X1 , X2 ,

· · · be a sequence of random variables on some probabil-
ity space (Ω, F , P ) such that for each n < ∞, X1 , X2 ,
· · · , Xn satisfy the hypothesis of Theorem 1. Then
P {ω : X n (ω) → EX1 as n → ∞} = 1.
These two laws of large numbers are what makes the

subject of statistics very useful. If the mean value λ of
a random variable X is not known it can be estimated
from a sample data. More precisely, let X1 , X2 , · · · , Xn
be a sample of n independent copies of X, then by the
law of large numbers i.e., Theorem 1, the sample mean
Xn converges to λ as n tends to infinity. This is called
the IID Monte Carlo IIDMC) method.
An example of this is opinion polls in election surveys.
Suppose there are two candidates A and B contesting
for a position in a city with a large electorate. Suppose
the organizers of the candidate A want to estimate the
support for A in that city. They choose a small sample
of people from that city. Find out the support that A
has in that sample. Use that to estimate the support A
enjoys in the whole city.
The estimate X n ≡ X1 + X2 n+ ··· + Xn based on n indepen-
dent observations X1 , X2 , · · · , Xn of a random variable
X is often referred to as a point estimate for the quan-
tity λ ≡ EX. Another kind of estimate called interval
estimate or a confidence interval In for λ ≡ EX based
on the observation X1 , X2 , · · · , Xn is generated by the
use CLT in probability theory (referred to earlier in this
article). We give this below.
Central Limit Theorem: Let X1 , X2 , · · · , Xn , · · · be
independent identically distributed real-valued random
variables. Let EX12 < ∞. Let EX1 = μ and 0 <
V X1 ≡ EX12 − (EX1 )2 ≡ σ 2 < ∞. Let X n ≡ n1 ni=1 Xi
for n = 1, 2, · · · . Then, for any −∞ < a < b < ∞,

GENERAL ⎜ ARTICLE

√ (X n − μ) b
1 x2
i) lim P a ≤ n ≤b = √ e− 2 dx
n→∞ σ a 2π

√ (X n − μ) b
1 x2
ii) lim P a ≤ n ≤b = √ e− 2 dx
n→∞ σn a 2π
where
1 2
n
2
σn2 ≡ Xi − X n , n ≥ 1.
n i=1
y x2
The function Φ(y) ≡ −∞ √12π e− 2 dx, −∞ < y < ∞
is called the standard normal distribution function or
Gaussian distribution named after the great German
mathematician Carl F Gauss3 . The function φ(y) = 3
Resonance, Vol.2, No.6, 1997.
2
dΦ(y)
√1 e − y2
dy
= 2π
is called the standard normal probabil-
ity density function. The graph of the curve (x, φ(x)) as
x varies over (−∞, ∞) looks like a bell and is referred
to as bell curve.
Suppose given 0 < α < 1 one wants to produce an
interval In based on observations X1 , X2 , · · · , Xn such
that
P (μ ∈ In ) → (1 − α) as n → ∞,
where μ = EX1 . For this, one first chooses a ∈ (0, ∞)
such that
+a
1 u2
Φ(a) − Φ(−a) = √ e− 2 du = (1 − α)
−a 2π
Note that if

aσn aσn
In ≡ X n − √ , X n + √
n n
then the CLT (ii) above assures us that

√
(X n − μ) n
P (μ ∈ In ) = P − a ≤ ≤ b → (1 − α .)
σn

GENERAL ⎜ ARTICLE
This random interval In is referred to an approximate

confidence interval of level (1 − α) for the parameter
μ = EX1 . Typically, one chooses α to be 0.05 and the
corresponding interval In is called a 95% level confidence
interval.
It may be noted that CLT is a refinement of the law
of large numbers which says that if E|X1 | < ∞ then
X n − μ → 0 as n → 0 with probability one. CLT says
that if EX12 < ∞ while (X n − μ) → 0, then (X n − μ)
decays at the rate of √1n .
There are similar results when EX12 = ∞, but EX1 <

∞. This requires the study of what is called stable dis-
tributions [1].
Address for Correspondence
K B Athreya Suggested Reading
Department of Mathematics
Iowa State University [1] K B Athreya and S N Lahiri, Measure Theory and Probability Theory,
Ames, Iowa, USA Springer, New York. (see also TRIM Series Vol.36 and 41, 2006).
Email: [2] K B Athreya, M Delampady and T Krishnan, MCMC Methods,
kbathreya@gmail.com Resonance, April, July, October, December, 2003.
[3] J L Doob, Stochastic Processes, John Wiley, New York, 1953.

0292 0310 PDF

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

0292 0310 PDF

Diunggah oleh

Hak Cipta:

Format Tersedia

GENERAL ⎜ ARTICLE

What is Probability Theory?

This issue of Resonance features Joseph Leonard

ii) We throw a stone up from a spot inside an open

Which one of these can be termed random and which

292 RESONANCE ⎜ April 2015

random) phenomena, especially those evolving over time.

RESONANCE ⎜ April 2015 293

∞ be noted that pr ≥ 0 for all r = 0, 1, 2, · · · , and

A bit later, De Moivre and Laplace proved the following.

all r in an interval of the form (aσn + npn , bσn + npn )

294 RESONANCE ⎜ April 2015

i) A ∈ F implies Ac ∈ F (Ac is the complement of

RESONANCE ⎜ April 2015 295

ii) A1, A2 ∈ F should imply A1 ∪ A2 ∈ F , i.e., if

The above conditions (i−iv) imply that F is an algebra

for k < ∞ and A1 , A2, · · · , Ak ∈ F and Ai ∩ Aj = ∅ for

296 RESONANCE ⎜ April 2015

v) An ∈ F , An ⊂ An+1 for all n = 1, 2, · · · should

Let us call this last condition: P (·) is monotone contin-

vi) F is a σ-algebra (i.e., F is closed under comple-

Since we demand P (Ω) = 1, (Ω, F , P ) is called a proba-

RESONANCE ⎜ April 2015 297

This is a probability This is a probability model for random experiments with

298 RESONANCE ⎜ April 2015

tion. Here 0 < β < ∞, 0 < T < ∞. The physicists

RESONANCE ⎜ April 2015 299

ii) lim F (x) = 0,

iii) lim F (x) = 1.

Then it was shown by Stieltjes [1] that there is a prob-

μF (a, b] = F (b+) − F (a+) ∀ − ∞ < a < b < ∞,

where F (x+) ≡ limy↓x F (y). Let X : Ω → Ω be the

i) Normal or Gaussian (μ, σ 2): Here,

ii) Gamma (α, p): 0 < α, p < ∞. Here

iii) Beta (α, β): 0 < α < ∞, 0 < β < ∞,

300 RESONANCE ⎜ April 2015

iv) Uniform (0, 1) is Beta (1, 1).

vi) Binomial (n, p): n a positive integer, [x] = k, k ≤

vii) Geometric (p): 0 < p < 1

viii) Poisson (λ):

Stieltjes’s result is more general [1]. It shows that, given

where F (x+) ≡ lim F (y).

Example 4 (Random Vectors). Let Ω ≡ Rk , F ≡ B(Rk ),

RESONANCE ⎜ April 2015 301

where ai < bi , i = 1, 2, · · · , k. Let μ be a measure on

302 RESONANCE ⎜ April 2015

(i) μ(t1 ,t2 ,··· ,tk )(A1 × A2 × · · · × Ak) =

μ(tπ(1) ,tπ(2),··· ,tπ(k) ) (Aπ(1) × Aπ(2) × · · · × Aπ(k) )

μ(t1 ,t2 ,··· ,tk ) (A1 × A2 × · · · × Ak ).

RESONANCE ⎜ April 2015 303

Kolmogorov construction gives probabilities only for sets

ii) FX (x+) ≡ lim FX (y) = FX (x) (right continuity),

iii) FX (−∞) ≡ lim FX (x) = 0,

For any Borel set A in R, P {ω : X(ω) ∈ A} ≡ P (X −1 (A))

If X is a non-negative random variable, then X can

304 RESONANCE ⎜ April 2015

Call such a sequence admissible for X. It is natural to

RESONANCE ⎜ April 2015 305

The application of well-deﬁned rules of probability theory and then making

P {ω : X1 (ω) ≤ a1, X2 (ω) ≤ a2, · · · , Xk (ω) ≤ ak }

i.e., the joint distribution function

F(X1 ,X2 ,··· ,Xk ) (a1, a2, · · · , ak ) ≡

is equal to the product ofall the marginal distribution

v) An ∈ F , An⊂ An+1 for all n = 1, 2, · · · should

is equal to the product ofall the marginal distribution

Then, for each > 0

P {|X n − EX1 | > } → 0 as n → ∞ ,