k=0
_
n
k
_
x
nk
y
k
= (x +y)
n
(Binomial Theorem)
k=0
ar
k
=
a
1 r
(1 < r < 1) (Geometric Summation)
n
i=1
i =
n(n + 1)
2
(Denite Arithmetic Summation)
n
i=1
i
k
=
1
k + 1
n
k+1
(Denite Polynomial Summation)
i=0
x
i
=
1
1 x
(Indenite Exponential Summation)
i=0
ix
i
=
x
(x 1)
2
(Indenite Exponential Summation)
_
u dv = uv
_
v du (Integration by parts)
(fg)
= f
g +fg
(Product rule)
_
e
2x
dx =
1
2
e
2x
(Integrating with Eulers constant)
e
2x
d
dx
= 2e
2x
(Deriving with Eulers constant)
i=0
x
i
i!
= e
x
(Summation Representation of Eulers constant)
4 Axioms of Probability
4.1 Sample Space & Events
The sample space describes the top set of all possible outcomes and an event is a subset of that space. The
sets can be described in set builder notation and be treated with the conventional set operators.
eg. Probability of a hard drive crash.
The sample space, S, that a hard drive will crash in terms of months m. S = {m : m 0}
The event, E, that a hard drive will crash after 36 months. E = {m : m > 36}
The complement of the event E. E = {m : m 36}.
4.2 Axioms
P is called a probability if it satises the following axioms:
1. P(A) 0: The probability of the occurence of an event is always nonnegative.
2. P(S) = 1: The probability of the occurence of the event S that is certain is 1. This ensures that S
covers all possible outcomes.
3. A = {A
1
, A
2
, A
3
, . . . } is a set of multually exclusive events.
3
When these axioms are satised, S and P can be considered a probability model.
4.3 Basic Theorems
1. P() = 0
2. P(
n
i=1
A
i
) =
n
i=1
P(A
i
)
3. P(A) = 1 P(A)
4.4 Inclusion-Exclusion Principle
In order to calculate P(A
1
A
2
A
n
), the unioned events must have mutually exclusive sample points.
Therefore, if A
1
A
2
= , in order to calculate P(A
1
A
2
), the events must be made exclusive: P(A
1
A
2
) =
P(A
1
) +P(A
2
) P(A
1
A
2
).
4.5 Continuity of Probability Function
For any increasing or decreasing sequence of events, {E
n
: n 1},
lim
n
E
n
=
_
n=1
E
n
(increasing)
lim
n
E
n
=
n=1
E
n
(decreasing)
lim
n
P(E
n
) = P( lim
n
E
n
)
eg. Probability of a population dying out.
Some individuals in a population, E
i
, produce ospring that form a successive generation, E
i+1
. If the
probability of an extinction for the n
th
generation is e
(2n
2
+7)/(6n
2
)
, what is the probability of it surviving
forever?
P{surviving forever} = 1 P{exctinction}
= 1 P(
_
i=1
E
i
)
= 1 P( lim
n
E
n
)
= 1 P( lim
n
e
(2n
2
+7)/(6n
2
)
)
= 1 e
1
3
5 Combinatorics
Let n be the total elements in the set: k is the number of elements to be chosen.
Per(n, k) =
n!
(n k)!
(Permutation: Order matters.)
C(n, k) =
n!
k!(n k)!
(Combination: Order doesnt matter.)
4
6 Conditional Probability & Independence
6.1 Conditional Probability
Determines the probability of an event, A, given a subset of the sample space B.
P(A|B) =
P(AB)
P(B)
6.2 Law of Multiplication
The theorem of conditional probability can also be used to nd the intersection of two events:
P(AB) = P(B)P(A|B) i P(A) = 0
= P(A|B)P(B) i P(B) = 0
P(A
1
A
2
A
3
. . . A
n1
A
n
) = P(A
1
)P(A
2
|A
1
)P(A
3
|A
2
A
1
) . . . P(A
n
|A
1
A
2
A
3
. . . A
n1
)
6.3 Law of Total Probability
Given the following constraints over = {A
1
. . . A
n
}
1. A
i
A
j
= i i = j (mutually exclusive)
2. P(A
i
) > 0 i = 1 . . . n
3.
A
i
= (collectively exhaustive)
P(A) = P(A
1
)P(A|A
1
) + +P(A
n
)P(A|A
n
)
6.4 Independence
Given two events A and B. They are independent i:
P(AB) = P(A)P(B)
Therefore it can be inferred that:
P(AB) = P(A|B)P(B) = P(B|A)P(A)
P(AB) = P(A)P(B) = P(B)P(A)
... and generalized to:
P(A
1
A
2
. . . A
n
) = P(A
1
)P(A
2
) . . . P(A
n
)
6.5 Bayes Formula
Given the conditioning event A and a set of events, A
1
. . . A
n
, that are mutually exclusive and collectively
exhaustive, it follows that:
P(A
i
|A) =
P(A
i
)P(A|A
i
)
n
j=1
P(A
j
)P(A|A
j
)
5
7 Random Variables & Distributions
7.1 Random Variable
A real valued function that associates a number with the outcome of a random experiemnt:
X R
X = x{ : X() = x}
X x{ : X() x}
y < X x{ : y < X() x}
7.2 Cumulative/Probability Distribution Function (CDF)
Given some value x, let F(x) be the function over all values up to and including x; it represents the sum of
all probabilities up to x:
F(x) = P(X x) x R
The following constraints must hold for this to be true.
1. F must be non-decreasing; it must be accumulating for each greater value.
2. lim
n+
F(x) = 1
3. lim
n
F(x) = 0
It follows that to nd any probability range, given a CDF, the following cases can be derived:
P(X a) = F(a) P(a < X b) = F(b) F(a)
P(X > a) = 1 F(a) P(a < X < b) = F(b) F(a)
P(X < a) = F(a) P(a X b) = F(b) F(a)
P(X a) = 1 F(a) P(a X < b) = F(b) F(a)
P(X = a) = F(a) F(a)
7.3 Probability Mass Function (PMF - Discrete)
Given a collection of discrete points over the random variable X, let the sum of their probabilities be certain.
i=1
p(x
i
) = 1
With respect to a CDF given the sum of all probabilities up to and including t:
P(X t) = F(t) =
n1
i=1
p(x
i
) where x
n1
t < x
n
Likewise a PMF value can be derived by:
p(x
n
) = F(x
n
) F(x
n1
)
6
7.4 Probability Density Function (PDF - Continuous)
Given a continuous distribution, as opposed to mass points, let a and b be the bounds of a distribution
function f(x).
_
f(x) dx = 1
With respect to a CDF given the sum of all probabilities bounded by a and b:
P(a X b) =
_
b
a
f(x) dx
7.5 Expected Value (First Moment)
Given a random variable X, some function h(X) and a distribution function p(x) and f(x), let the expected
value be the weighted average / mean:
= E[h(X)] =
xi
h(x
i
)p(x
i
) if discrete and converges
=
_
b
a
h(x)f(x) dx if continuous and converges
7.6 Variance, Covariance, Standard Deviation & Correlation
Let
2
be the variance and be the standard deviation of a random variable X:
2
= V ar[X] = E[(X E[X])
2
] = E[X
2
] E[X]
2
=
xi
(x
i
E[X])
2
p(x
i
) if discrete
=
_
(x E[X])
2
f(x) dx if continuous
Let X and Y be random variables, and a and b be constants; the variance has the following properties:
V ar[X Y ] = V ar[X] +V ar[Y ] 2Cov[X, Y ]
V ar[aX b] = V ar[aX] = a
2
V ar[X]
V ar[aX bY ] = a
2
V ar[X] +b
2
V ar[Y ] 2abCov[X, Y ]
Let X and Y be random variables, and a and b be constants; the covariance has the following properties:
Cov[X, Y ] = E[XY ] E[X]E[Y ]
Cov[X, a] = 0
Cov[X, X] = V ar[X]
Cov[aX, bY ] = ab Cov[X, Y ]
Cov[X +a, Y +b] = Cov[X, Y ]
Cov[X, Y ] = 0 i X and Y are independent
The correlation coecient of two random variables:
X,Y
=
Cov[X, Y ]
Y
2
X,Y
=
Cov
2
[X, Y ]
V ar[X]V ar[Y ]
7
7.7 Joint & Marginal Distribution (Bivariate - Discrete)
Let X and Y be discrete random variables dened on a sample space: their possible values are A and B
respectively.
p
XY
(x, y) =
xA
yB
p(x, y) = 1
The function input value can be xed along a particular value for x or y to yield the marginal distribution.
p
X
(x) =
yB
p(x, y) p
Y
(y) =
xA
p(x, y)
The expected value can be derived from the marginal distribution:
E[X] =
xA
xp
X
(x) E[Y ] =
yB
yp
Y
(y)
The joint conditional can be expressed as:
E[X
k
|Y = y
j
] =
xi
x
k
i
p
X|Y
(x
i
|y
j
) : k = 1, 2, . . .
7.8 Joint & Marginal Density (Bivariate - Continuous)
Let X and Y be continuous random variables dened on a countably innite sample space (R R): their
possible values are A and B respectively.
P(X A, Y B) =
_
f(x, y) dx dy = 1
The function input value can be xed along a particular value for x or y to yield the marginal distribution.
f
X
(x) =
_
f(x, y) dy f
Y
(y) =
_
f(x, y) dx
The expected value can be derived from the marginal distribution:
E[X] =
_
xf
X
(x) dx E[Y ] =
_
yf
Y
(y) dy
The joint conditional can be expressed as:
E[X
k
|Y = y] =
_
x
k
f
X|Y
(x|y) dx : k = 1, 2, . . .
8 Case Distributions
8.1 Exponential
Let be an exponential parameter and x the point along a distribution; the CDF is:
F(x) = 1 e
x
Correspondingly, the probability function is the derivative of the CDF:
f(x) = F(x)
d
dx
= (1 e
x
)
d
dx
= e
x
8
The n
th
expected value can be derived by solving for E[X
n
] =
_
x
n
e
x
dx:
E[X
n
] =
n!
n
V ar[X] =
1
2
8.2 Poisson Distribution
Let be an exponential parameter and k be the number of occurrences of an event for the PMF:
p(k) = e
k
k!
Considering two independent Poisson random variables, and ; their sum can be used in the general form:
p(k) = e
(+)
( +)
k
k!
The expected value of a Poisson distribution can be derived by solving for the general form:
E[X] =
i=1
i
e
i
i!
= e
i=1
i
ni
(i 1)!
= e
i=0
i
i!
= e
=
Intuitively, this makes sense since a binomial random variable with parameters n and p would have the
average np = . Further, it can be shown that E[X
2
] = +
2
: V ar[X] = = E[X]. Therefore
E[X] = V ar[X] =
8.3 Bernoulli & Binomial Distribution
Given an experiment with only two outcomes, let n be the total number of trials where k is the number of
successful outcomes, p is the probability of success, q is the probability of failure (q = 1 p) and X is a
Bernoulli random variable; X = {0, 1}.
Since the expected value of a Bernoulli random variable X with parameter p is simply E[X] = 0 P(X =
0) + 1 P(X = 1) = P(X = 1), the expected value, variance and standard deviation can be expressed as:
E[X] = p V ar[X] = pq
X
=
pq
Let Y be a binomial random variable with parameters n and p (k = np Y ).
P(Y = k) =
_
n
k
_
p
k
q
nk
=
_
n
k
_
p
k
(1 p)
nk
The expected value and variance of the binomial random variable is similiar to the Bernoulli but dierent
in that it considers the number of trials, n:
E[Y ] = np V ar[Y ] = npq
Y
=
npq
8.4 Geometric Distribution
Let k be the number of Bernoulli trials before a success; p is the probability of success and q is the probability
of failure:
p(k) = q
k
p
= (1 p)
k
p
9
The expected value of a Geometric distribution can be derived by solving for the general form:
E[X] =
k=0
k q
k
p =
k=0
k(1 p)
k
p =
k=0
k(1 p)
k
p
Therefore the expected value and variance are:
E[X] =
q
p
V ar[X] =
q
p
2
8.5 Continuous Uniform Distribution
Let a and b be two points bounding a continuous distribution; the PDF is:
f(x) =
1
b a
The expected value of a continuous uniform distribution can be derived by solving for the general form:
E[X] =
_
b
a
x f(x) dx =
1
2
(a +b)
Therefore the expected value and variance are:
E[X] =
1
2
(a +b) V ar[X] =
(b a)
2
12
8.6 Normal/Gaussian Distribution
Let and
2
be the center peak and slope of a continuous Gaussian distribution respectively; the curve is
dened as N(,
2
) and the PDF is:
f(x) =
1
2
e
1
2
(
x
)
2
The distribution can be discretized in standard normal form as:
(x) =
1
2
e
1
2
x
2
Although the continuous CDF cannot be solved for in closed form:
(x) =
_
x
f(t) dt
Therefore, it can only be solved for using approximations (see Approximations: Standard Normal Distribution
Approximation).
9 Moments
The expected value E[X] = is the rst moment; E[X
2
] is considered the second moment and E[X
n
] is the
n
th
moment. The general form is:
E[X] =
xi
x
n
p(x
i
) if discrete and converges
=
_
b
a
x
n
f(x) dx if continuous and converges
10
9.1 Law of Total Moments
Let X and Y be random variables and k denote the k
th
moment of a given random variable.
E[E[X
k
|Y ]] = E[X
k
]
=
xi
x
k
i
p
X
(x
i
)
=
_
x
k
f
X
(x) dx
9.2 Moment Generating Function (MGF)
Let R be some xed value and X be the random variable that is dened over.
() = E[e
X
]
=
xi
e
xi
p(x
i
) if discrete
=
_
xi
e
xi
p(x
i
) if continuous
The n
th
expected value can be derived from the MGF by taking the n
th
derivative.
E[X
n
] =
d
n
X
()
d
n
=0
Let X and Y be independent random variables; the MGF of their sum is the product of the separate MGFs:
X+Y
() =
X
()
Y
()
Special distribution cases for the Poisson and exponential respectively.
() = e
(e
1)
() =
9.3 Generating Function (GF)
Let z Z
+
be some xed value and N be the discrete random variable that z is dened over.
g(z) =
n=0
z
n
p
N
(n)
Note that this is simply a re-parameterization of the MGFs e
(1) V ar[N] = [g
(1) +g
(1)] (g
(1))
2
10 Approximations
It is not always practical to determine the distribution of a random variable but an approximation can be
made if the characteristics of the distribution are known (ie. mean, variance, standard deviation).
11
10.1 Markov & Chebychevs Inequalities
Recall that is the expected value and
2
the variance. Let X be a discrete non-negative random variable
and t > 0:
P(X t)
E[X]
t
(Markovs)
P(|X | t)
2
t
2
(Chebychevs)
Chebychevs inequality can also be reparamterised where t = k:
P(|X | k
2
)
1
k
2
Given n Bernoulli trials, let and be the desired rate of error and probability of failure respectively.
Therefore, to nd an upper bound to the number of trials (n) needed to attain the given rate of error ()
and failure probability ():
n
1
4
2
2
+ (t E[X])
2
if t < E[X]
P(X > t)
2
2
+ (t E[X])
2
if t > E[X]
10.3 Law of Large Numbers
Given a number of successes S
n
among n Bernoilli trials where success is dened for when A occurs. Let
Sn
n
and P(A) be the experimental and theoretical probabilities respectively, and be the condence interval.
P
_
S
n
n
P(A)
P(A)(1 P(A))
n
2
P(A)(1 P(A))
n
2
0.25
n
2
(maximum value when P(A) =
1
2
)
10.4 Standard Normal Distribution Approximation
Given a standard normal distribution with mean and variance
2
, an approximation to the CDF at point
x can be found by taking the following equation and determining the approximate value of z in a lookup
table.
P(Z z) = (z) : z =
x
S
n
n
p
_
2
_
1
_
pq
__
=
12
10.6 Central-Limit Theorem
Given iid random variables {X
1
, X
2
, . . . , X
n
} where and are nite; > 0 and S
n
=
n
i=1
X
i
. Let there
be a cumulative probability in bounds x and y along a standard normal distribution with a signicantly
large number of trials:
lim
n
P
_
x
S
n
n
n
y
_
= (y) (x)
10.7 r
th
Percentile
Given an exponential random variable, X, and a percentile function, (r), nd the r
th
percentile for all
X r:
P(X (r)) =
r
100
= r%
1 e
(r)
=
r
100
e
(r)
=
100 r
100
(r) =
1
ln
_
100
100 r
_
Therefore the nal function becomes:
(r) = E[X] ln
_
100
100 r
_
11 Stochastic Processes
11.1 Poisson Process
Given some system with time t elapsed and n countable elements, let N(t) be a random variable related to n.
Constrain the system to have independent increments, stationary increments and non-overlapping intervals;
it constitutes a Poisson process with rate . Therefore, the probability of n elements in a system at time t
can be expressed as:
P(N(t) = n) =
e
t
(t)
n
n!
= P
n
(t)
In addition, the probability distribution of wait time until next occurrence of n is exponential.
11.2 Markov Chains
Given some countable set of states, S
n
= {0, 1, . . . , n}, let S
1
be a state set where and are the associated
probabilities for states 0 and 1 respectively.
0 1
1
1
For each state, all outgoing transitions will sum to 1 and can be described by a transition matrix:
P =
_
1
1
_
The above example can be extended to n states for an n n transition matrix.
13
11.3 Continuous-Time Markov Chains
0 1 2
0
1
1
2
14