Anda di halaman 1dari 482

From Algorithms to Z-Scores:

Probabilistic and Statistical Modeling in


Computer Science
Norm Matlo, University of California, Davis
f
X
(t) = ce
0.5(t)

1
(t)
library(MASS)
x <- mvrnorm(mu,sgm)
x1
10
5
0
5
10
x
2
10
5
0
5
10
z
0.005
0.010
0.015
See Creative Commons license at
http://heather.cs.ucdavis.edu/ matlo/probstatbook.html
The author has striven to minimize the number of errors, but no guarantee is made as to accuracy
of the contents of this book.
2
Authors Biographical Sketch
Dr. Norm Matlo is a professor of computer science at the University of California at Davis, and
was formerly a professor of statistics at that university. He is a former database software developer
in Silicon Valley, and has been a statistical consultant for rms such as the Kaiser Permanente
Health Plan.
Dr. Matlo was born in Los Angeles, and grew up in East Los Angeles and the San Gabriel Valley.
He has a PhD in pure mathematics from UCLA, specializing in probability theory and statistics. He
has published numerous papers in computer science and statistics, with current research interests
in parallel processing, statistical computing, and regression methodology.
Prof. Matlo is a former appointed member of IFIP Working Group 11.3, an international com-
mittee concerned with database software security, established under UNESCO. He was a founding
member of the UC Davis Department of Statistics, and participated in the formation of the UCD
Computer Science Department as well. He is a recipient of the campuswide Distinguished Teaching
Award and Distinguished Public Service Award at UC Davis.
Dr. Matlo is the author of two published textbooks, and of a number of widely-used Web tutorials
on computer topics, such as the Linux operating system and the Python programming language.
He and Dr. Peter Salzman are authors of The Art of Debugging with GDB, DDD, and Eclipse.
Prof. Matlos book on the R programming language, The Art of R Programming, was published
in 2011. He is also the author of several open-source textbooks, including From Algorithms to Z-
Scores: Probabilistic and Statistical Modeling in Computer Science (http://heather.cs.ucdavis.
edu/probstatbook), and Programming on Parallel Machines (http://heather.cs.ucdavis.edu/
~
matloff/ParProcBook.pdf).
Contents
1 Time Waste Versus Empowerment 1
2 Basic Probability Models 3
2.1 ALOHA Network Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 The Crucial Notion of a Repeatable Experiment . . . . . . . . . . . . . . . . . . . . 5
2.3 Our Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Mailing Tubes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Basic Probability Computations: ALOHA Network Example . . . . . . . . . . . . . 10
2.6 Bayes Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.7 ALOHA in the Notebook Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.8 Solution Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.9 Example: Divisibility of Random Integers . . . . . . . . . . . . . . . . . . . . . . . . 16
2.10 Example: A Simple Board Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.11 Example: Bus Ridership . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.12 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.12.1 Example: Rolling Dice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.12.2 Improving the Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.12.3 Simulation of the ALOHA Example . . . . . . . . . . . . . . . . . . . . . . . 23
2.12.4 Example: Bus Ridership, contd. . . . . . . . . . . . . . . . . . . . . . . . . . 24
i
ii CONTENTS
2.12.5 Back to the Board Game Example . . . . . . . . . . . . . . . . . . . . . . . . 24
2.12.6 How Long Should We Run the Simulation? . . . . . . . . . . . . . . . . . . . 25
2.13 Combinatorics-Based Probability Computation . . . . . . . . . . . . . . . . . . . . . 25
2.13.1 Which Is More Likely in Five Cards, One King or Two Hearts? . . . . . . . . 25
2.13.2 Example: Lottery Tickets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.13.3 Association Rules in Data Mining . . . . . . . . . . . . . . . . . . . . . . . 27
2.13.4 Multinomial Coecients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.13.5 Example: Probability of Getting Four Aces in a Bridge Hand . . . . . . . . . 29
3 Discrete Random Variables 35
3.1 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3 Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4 Expected Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.1 GeneralityNot Just for DiscreteRandom Variables . . . . . . . . . . . . . . 36
3.4.1.1 What Is It? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4.2 Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4.3 Computation and Properties of Expected Value . . . . . . . . . . . . . . . . . 37
3.4.4 Mailing Tubes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.4.5 Casinos, Insurance Companies and Sum Users, Compared to Others . . . . 43
3.5 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5.1 Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5.2 Central Importance of the Concept of Variance . . . . . . . . . . . . . . . . . 47
3.5.3 Intuition Regarding the Size of Var(X) . . . . . . . . . . . . . . . . . . . . . . 47
3.5.3.1 Chebychevs Inequality . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.5.3.2 The Coecient of Variation . . . . . . . . . . . . . . . . . . . . . . . 48
3.6 Indicator Random Variables, and Their Means and Variances . . . . . . . . . . . . . 48
CONTENTS iii
3.7 A Combinatorial Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.8 A Useful Fact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.9 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.10 Expected Value, Etc. in the ALOHA Example . . . . . . . . . . . . . . . . . . . . . 52
3.11 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.11.1 Example: Toss Coin Until First Head . . . . . . . . . . . . . . . . . . . . . . 54
3.11.2 Example: Sum of Two Dice . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.11.3 Example: Watts-Strogatz Random Graph Model . . . . . . . . . . . . . . . . 55
3.12 Parameteric Families of pmfs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.12.1 The Geometric Family of Distributions . . . . . . . . . . . . . . . . . . . . . . 56
3.12.1.1 R Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.12.1.2 Example: a Parking Space Problem . . . . . . . . . . . . . . . . . . 59
3.12.2 The Binomial Family of Distributions . . . . . . . . . . . . . . . . . . . . . . 60
3.12.2.1 R Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.12.2.2 Example: Flipping Coins with Bonuses . . . . . . . . . . . . . . . . 62
3.12.2.3 Example: Analysis of Social Networks . . . . . . . . . . . . . . . . . 63
3.12.3 The Negative Binomial Family of Distributions . . . . . . . . . . . . . . . . . 64
3.12.3.1 Example: Backup Batteries . . . . . . . . . . . . . . . . . . . . . . . 65
3.12.4 The Poisson Family of Distributions . . . . . . . . . . . . . . . . . . . . . . . 65
3.12.4.1 R Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.12.5 The Power Law Family of Distributions . . . . . . . . . . . . . . . . . . . . . 66
3.13 Recognizing Some Parametric Distributions When You See Them . . . . . . . . . . . 68
3.13.1 Example: a Coin Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.13.2 Example: Tossing a Set of Four Coins . . . . . . . . . . . . . . . . . . . . . . 70
3.13.3 Example: the ALOHA Example Again . . . . . . . . . . . . . . . . . . . . . . 70
3.14 Example: the Bus Ridership Problem Again . . . . . . . . . . . . . . . . . . . . . . . 71
3.15 A Preview of Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
iv CONTENTS
3.15.1 Example: Die Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.15.2 Long-Run State Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.15.3 Example: 3-Heads=in-a-Row Game . . . . . . . . . . . . . . . . . . . . . . . 74
3.15.4 Example: ALOHA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.15.5 Example: Bus Ridership Problem . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.15.6 An Inventory Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.16 A Cautionary Tale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.16.1 Trick Coins, Tricky Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.16.2 Intuition in Retrospect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.16.3 Implications for Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.17 Why Not Just Do All Analysis by Simulation? . . . . . . . . . . . . . . . . . . . . . 79
3.18 Proof of Chebychevs Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.19 Reconciliation of Math and Intuition (optional section) . . . . . . . . . . . . . . . . . 81
4 Continuous Probability Models 87
4.1 A Random Dart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.2 Continuous Random Variables Are Useful Unicorns . . . . . . . . . . . . . . . . . 88
4.3 But Equation (4.2) Presents a Problem . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.4 Density Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.4.1 Motivation, Denition and Interpretation . . . . . . . . . . . . . . . . . . . . 92
4.4.2 Properties of Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.4.3 A First Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.5 Famous Parametric Families of Continuous Distributions . . . . . . . . . . . . . . . . 97
4.5.1 The Uniform Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.5.1.1 Density and Properties . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.5.1.2 R Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.5.1.3 Example: Modeling of Disk Performance . . . . . . . . . . . . . . . 97
CONTENTS v
4.5.1.4 Example: Modeling of Denial-of-Service Attack . . . . . . . . . . . . 98
4.5.2 The Normal (Gaussian) Family of Continuous Distributions . . . . . . . . . . 98
4.5.2.1 Density and Properties . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.5.2.2 Example: Network Intrusion . . . . . . . . . . . . . . . . . . . . . . 101
4.5.2.3 Example: Class Enrollment Size . . . . . . . . . . . . . . . . . . . . 102
4.5.2.4 The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . 102
4.5.2.5 Example: Cumulative Roundo Error . . . . . . . . . . . . . . . . . 103
4.5.2.6 Example: Bug Counts . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.5.2.7 Example: Coin Tosses . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.5.2.8 Museum Demonstration . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.5.2.9 Optional topic: Formal Statement of the CLT . . . . . . . . . . . . 106
4.5.2.10 Importance in Modeling . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.5.3 The Chi-Squared Family of Distributions . . . . . . . . . . . . . . . . . . . . 107
4.5.3.1 Density and Properties . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.5.3.2 Example: Error in Pin Placement . . . . . . . . . . . . . . . . . . . 107
4.5.3.3 Importance in Modeling . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.5.4 The Exponential Family of Distributions . . . . . . . . . . . . . . . . . . . . . 108
4.5.4.1 Density and Properties . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.5.4.2 R Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.5.4.3 Example: Refunds on Failed Components . . . . . . . . . . . . . . . 109
4.5.4.4 Example: Overtime Parking Fees . . . . . . . . . . . . . . . . . . . . 110
4.5.4.5 Connection to the Poisson Distribution Family . . . . . . . . . . . . 110
4.5.4.6 Importance in Modeling . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.5.5 The Gamma Family of Distributions . . . . . . . . . . . . . . . . . . . . . . . 112
4.5.5.1 Density and Properties . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.5.5.2 Example: Network Buer . . . . . . . . . . . . . . . . . . . . . . . . 114
4.5.5.3 Importance in Modeling . . . . . . . . . . . . . . . . . . . . . . . . . 114
vi CONTENTS
4.5.6 The Beta Family of Distributions . . . . . . . . . . . . . . . . . . . . . . . . . 114
4.6 Choosing a Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.7 A General Method for Simulating a Random Variable . . . . . . . . . . . . . . . . . 117
4.8 Hybrid Continuous/Discrete Distributions . . . . . . . . . . . . . . . . . . . . . . . 118
5 Describing Failure 121
5.1 Memoryless Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.1.1 Derivation and Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.1.2 Continuous-Time Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.1.3 Example: Nonmemoryless Light Bulbs . . . . . . . . . . . . . . . . . . . . . 123
5.2 Hazard Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.2.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.2.2 Example: Software Reliability Models . . . . . . . . . . . . . . . . . . . . . . 126
5.3 A Cautionary Tale: the Bus Paradox . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.3.1 Length-Biased Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.3.2 Probability Mass Functions and Densities in Length-Biased Sampling . . . . 127
5.4 Residual-Life Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.4.1 Renewal Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.4.2 Intuitive Derivation of Residual Life for the Continuous Case . . . . . . . . . 130
5.4.3 Age Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.4.4 Mean of the Residual and Age Distributions . . . . . . . . . . . . . . . . . . . 133
5.4.5 Example: Estimating Web Page Modication Rates . . . . . . . . . . . . . . 133
5.4.6 Example: Disk File Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
5.4.7 Example: Memory Paging Model . . . . . . . . . . . . . . . . . . . . . . . . . 134
6 Stop and Review 137
7 Covariance and Random Vectors 141
CONTENTS vii
7.1 Measuring Co-variation of Random Variables . . . . . . . . . . . . . . . . . . . . . . 141
7.1.1 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
7.1.2 Example: Variance of Sum of Nonindependent Variables . . . . . . . . . . . . 143
7.1.3 Example: the Committee Example Again . . . . . . . . . . . . . . . . . . . . 143
7.1.4 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.1.5 Example: a Catchup Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.2 Sets of Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.2.1 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.2.1.1 Expected Values Factor . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.2.1.2 Covariance Is 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.2.1.3 Variances Add . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.2.2 Examples Involving Sets of Independent Random Variables . . . . . . . . . . 147
7.2.2.1 Example: Dice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.2.2.2 Example: Variance of a Product . . . . . . . . . . . . . . . . . . . . 148
7.2.2.3 Example: Ratio of Independent Geometric Random Variables . . . 148
7.3 Matrix Formulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
7.3.1 Properties of Mean Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
7.3.2 Covariance Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
7.3.3 Example: Easy Sum Again . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
7.3.4 Example: (X,S) Dice Example Again . . . . . . . . . . . . . . . . . . . . . . . 152
7.3.5 Example: Dice Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
8 Multivariate PMFs and Densities 159
8.1 Multivariate Probability Mass Functions . . . . . . . . . . . . . . . . . . . . . . . . . 159
8.2 Multivariate Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
8.2.1 Motivation and Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
8.2.2 Use of Multivariate Densities in Finding Probabilities and Expected Values . 162
viii CONTENTS
8.2.3 Example: a Triangular Distribution . . . . . . . . . . . . . . . . . . . . . . . 163
8.2.4 Example: Train Rendezvouz . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
8.3 More on Sets of Independent Random Variables . . . . . . . . . . . . . . . . . . . . . 167
8.3.1 Probability Mass Functions and Densities Factor in the Independent Case . . 167
8.3.2 Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
8.3.3 Example: Ethernet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
8.3.4 Example: Analysis of Seek Time . . . . . . . . . . . . . . . . . . . . . . . . . 169
8.3.5 Example: Backup Battery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
8.3.6 Example: Minima of Uniformly Distributed Random Variables . . . . . . . . 171
8.3.7 Example: Minima of Independent Exponentially Distributed Random Variables171
8.3.8 Example: Computer Worm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
8.3.9 Example: Ethernet Again . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
8.4 Example: Finding the Distribution of the Sum of Nonindependent Random Variables 175
8.5 Parametric Families of Multivariate Distributions . . . . . . . . . . . . . . . . . . . . 175
8.5.1 The Multinomial Family of Distributions . . . . . . . . . . . . . . . . . . . . 175
8.5.1.1 Probability Mass Function . . . . . . . . . . . . . . . . . . . . . . . 175
8.5.1.2 Example: Component Lifetimes . . . . . . . . . . . . . . . . . . . . 177
8.5.1.3 Mean Vectors and Covariance Matrices in the Multinomial Family . 177
8.5.1.4 Application: Text Mining . . . . . . . . . . . . . . . . . . . . . . . . 180
8.5.2 The Multivariate Normal Family of Distributions . . . . . . . . . . . . . . . 181
8.5.2.1 Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
8.5.2.2 Geometric Interpretation . . . . . . . . . . . . . . . . . . . . . . . . 182
8.5.2.3 Properties of Multivariate Normal Distributions . . . . . . . . . . . 185
8.5.2.4 The Multivariate Central Limit Theorem . . . . . . . . . . . . . . . 186
8.5.2.5 Example: Finishing the Loose Ends from the Dice Game . . . . . . 187
8.5.2.6 Application: Data Mining . . . . . . . . . . . . . . . . . . . . . . . . 187
CONTENTS ix
9 Advanced Multivariate Methods 193
9.1 Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
9.1.1 Conditional Pmfs and Densities . . . . . . . . . . . . . . . . . . . . . . . . . . 193
9.1.2 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
9.1.3 The Law of Total Expectation (advanced topic) . . . . . . . . . . . . . . . . . 194
9.1.3.1 Conditional Expected Value As a Random Variable . . . . . . . . . 194
9.1.3.2 Famous Formula: Theorem of Total Expectation . . . . . . . . . . . 195
9.1.4 What About the Variance? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
9.1.5 Example: Trapped Miner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
9.1.6 Example: More on Flipping Coins with Bonuses . . . . . . . . . . . . . . . . 198
9.1.7 Example: Analysis of Hash Tables . . . . . . . . . . . . . . . . . . . . . . . . 198
9.2 Simulation of Random Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
9.3 Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
9.4 Transform Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
9.4.1 Generating Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
9.4.2 Moment Generating Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 205
9.4.3 Transforms of Sums of Independent Random Variables . . . . . . . . . . . . . 206
9.4.4 Example: Network Packets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
9.4.4.1 Poisson Generating Function . . . . . . . . . . . . . . . . . . . . . . 206
9.4.4.2 Sums of Independent Poisson Random Variables Are Poisson Dis-
tributed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
9.4.5 Random Number of Bits in Packets on One Link . . . . . . . . . . . . . . . . 207
9.4.6 Other Uses of Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
9.5 Vector Space Interpretations (for the mathematically adventurous only) . . . . . . . 209
9.6 Properties of Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
9.7 Conditional Expectation As a Projection . . . . . . . . . . . . . . . . . . . . . . . . . 210
9.8 Proof of the Law of Total Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . 212
x CONTENTS
10 Introduction to Condence Intervals 217
10.1 Sampling Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
10.1.1 Random Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
10.1.2 Example: Subpopulation Considerations . . . . . . . . . . . . . . . . . . . . . 219
10.1.3 The Sample Meana Random Variable . . . . . . . . . . . . . . . . . . . . . 220
10.1.4 Sample Means Are Approximately NormalNo Matter What the Population
Distribution Is . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
10.1.5 The Sample VarianceAnother Random Variable . . . . . . . . . . . . . . . 222
10.1.6 A Good Time to Stop and Review! . . . . . . . . . . . . . . . . . . . . . . . . 223
10.2 The Margin of Error and Condence Intervals . . . . . . . . . . . . . . . . . . . . 223
10.3 Condence Intervals for Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
10.3.1 Condence Intervals for Population Means . . . . . . . . . . . . . . . . . . . . 225
10.3.2 Example: Simulation Output . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
10.4 Meaning of Condence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
10.4.1 A Weight Survey in Davis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
10.4.2 One More Point About Interpretation . . . . . . . . . . . . . . . . . . . . . . 228
10.5 General Formation of Condence Intervals from Approximately Normal Estimators . 228
10.6 Condence Intervals for Proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
10.6.1 Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
10.6.2 Simulation Example Again . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
10.6.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
10.6.4 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
10.6.5 (Non-)Eect of the Population Size . . . . . . . . . . . . . . . . . . . . . . . . 232
10.6.6 Planning Ahead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
10.7 Condence Intervals for Dierences of Means or Proportions . . . . . . . . . . . . . . 233
10.7.1 Independent Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
10.7.2 Example: Network Security Application . . . . . . . . . . . . . . . . . . . . . 235
CONTENTS xi
10.7.3 Dependent Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
10.7.4 Example: Machine Classication of Forest Covers . . . . . . . . . . . . . . . . 237
10.8 R Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
10.9 Example: Amazon Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
10.10The Multivariate Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
10.10.1Sample Mean and Sample Covariance Matrix . . . . . . . . . . . . . . . . . . 240
10.10.2Growth Rate Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
10.11Advanced Topics in Condence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . 241
10.12And What About the Student-t Distribution? . . . . . . . . . . . . . . . . . . . . . . 241
10.13Other Condence Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
10.14Real Populations and Conceptual Populations . . . . . . . . . . . . . . . . . . . . . . 243
10.15One More Time: Why Do We Use Condence Intervals? . . . . . . . . . . . . . . . . 243
11 Introduction to Signicance Tests 247
11.1 The Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
11.2 General Testing Based on Normally Distributed Estimators . . . . . . . . . . . . . . 249
11.3 Example: Network Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
11.4 The Notion of p-Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
11.5 R Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
11.6 One-Sided H
A
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
11.7 Exact Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
11.7.1 Example: Test for Biased Coin . . . . . . . . . . . . . . . . . . . . . . . . . . 252
11.7.1.1 Example: Light Bulbs . . . . . . . . . . . . . . . . . . . . . . . . . . 252
11.7.2 Example: Test Based on Range Data . . . . . . . . . . . . . . . . . . . . . . . 253
11.7.3 Exact Tests under a Normal Distribution Assumption . . . . . . . . . . . . . 254
11.8 Whats Wrong with Signicance Testingand What to Do Instead . . . . . . . . . . 254
11.8.1 History of Signicance Testing, and Where We Are Today . . . . . . . . . . . 255
xii CONTENTS
11.8.2 The Basic Fallacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
11.8.3 You Be the Judge! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
11.8.4 What to Do Instead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
11.8.5 Decide on the Basis of the Preponderance of Evidence . . . . . . . . . . . . 258
11.8.6 Example: the Forest Cover Data . . . . . . . . . . . . . . . . . . . . . . . . . 259
11.8.7 Example: Assessing Your Candidates Chances for Election . . . . . . . . . . 259
12 General Statistical Estimation and Inference 261
12.1 General Methods of Parametric Estimation . . . . . . . . . . . . . . . . . . . . . . . 261
12.1.1 Example: Guessing the Number of Rae Tickets Sold . . . . . . . . . . . . . 261
12.1.2 Method of Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
12.1.3 Method of Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . 263
12.1.4 Example: Estimation the Parameters of a Gamma Distribution . . . . . . . . 264
12.1.4.1 Method of Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
12.1.4.2 MLEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
12.1.4.3 Rs mle() Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
12.1.5 More Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
12.1.6 What About Condence Intervals? . . . . . . . . . . . . . . . . . . . . . . . . 269
12.2 Bias and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
12.2.1 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
12.2.2 Why Divide by n-1 in s
2
? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
12.2.2.1 Example of Bias Calculation . . . . . . . . . . . . . . . . . . . . . . 273
12.2.3 Tradeo Between Variance and Bias . . . . . . . . . . . . . . . . . . . . . . . 273
12.3 More on the Issue of Independence/Nonindependence of Samples . . . . . . . . . . . 274
12.4 Nonparametric Distribution Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 277
12.4.1 The Empirical cdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
12.4.2 Basic Ideas in Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . 279
CONTENTS xiii
12.4.3 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
12.4.4 Kernel-Based Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 282
12.4.5 Proper Use of Density Estimates . . . . . . . . . . . . . . . . . . . . . . . . . 284
12.5 Bayesian Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
12.5.1 How It Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
12.5.1.1 Empirical Bayes Methods . . . . . . . . . . . . . . . . . . . . . . . . 287
12.5.2 Extent of Usage of Subjective Priors . . . . . . . . . . . . . . . . . . . . . . . 287
12.5.3 Arguments Against Use of Subjective Priors . . . . . . . . . . . . . . . . . . . 288
12.5.4 What Would You Do? A Possible Resolution . . . . . . . . . . . . . . . . . . 289
13 Advanced Statistical Estimation and Inference 293
13.1 Slutskys Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
13.1.1 The Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
13.1.2 Why Its Valid to Substitute s for . . . . . . . . . . . . . . . . . . . . . . . 294
13.1.3 Example: Condence Interval for a Ratio Estimator . . . . . . . . . . . . . . 294
13.2 The Delta Method: Condence Intervals for General Functions of Means or Proportions295
13.2.1 The Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
13.2.2 Example: Square Root Transformation . . . . . . . . . . . . . . . . . . . . . . 298
13.2.3 Example: Condence Interval for
2
. . . . . . . . . . . . . . . . . . . . . . . 299
13.2.4 Example: Condence Interval for a Measurement of Prediction Ability . . . . 301
13.3 Simultaneous Condence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
13.3.1 The Bonferonni Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
13.3.2 Schees Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
13.3.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
13.3.4 Other Methods for Simultaneous Inference . . . . . . . . . . . . . . . . . . . . 306
13.4 The Bootstrap Method for Forming Condence Intervals . . . . . . . . . . . . . . . . 306
13.4.1 Basic Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
xiv CONTENTS
13.4.2 Example: Condence Intervals for a Population Variance . . . . . . . . . . . 308
13.4.3 Computation in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
13.4.4 General Applicability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
13.4.5 Why It Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
14 Introduction to Model Building 311
14.1 Desperate for Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
14.1.1 Known Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
14.1.2 Estimated Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
14.1.3 The Bias/Variance Tradeo . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
14.1.4 Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
14.2 Assessing Goodness of Fit of a Model . . . . . . . . . . . . . . . . . . . . . . . . . 316
14.2.1 The Chi-Square Goodness of Fit Test . . . . . . . . . . . . . . . . . . . . . . 316
14.2.2 Kolmogorov-Smirnov Condence Bands . . . . . . . . . . . . . . . . . . . . . 317
14.3 Bias Vs. VarianceAgain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
14.4 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
15 Relations Among Variables: Linear Regression 323
15.1 The Goals: Prediction and Understanding . . . . . . . . . . . . . . . . . . . . . . . . 323
15.2 Example Applications: Software Engineering, Networks, Text Mining . . . . . . . . . 324
15.3 Adjusting for Covariates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
15.4 What Does Relationship Really Mean? . . . . . . . . . . . . . . . . . . . . . . . . 325
15.5 Estimating That Relationship from Sample Data . . . . . . . . . . . . . . . . . . . . 326
15.6 Multiple Regression: More Than One Predictor Variable . . . . . . . . . . . . . . . . 329
15.7 Interaction Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
15.8 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
15.9 Preview of Linear Regression Analysis with R . . . . . . . . . . . . . . . . . . . . . . 331
CONTENTS xv
15.10Parametric Estimation of Linear Regression Functions . . . . . . . . . . . . . . . . . 334
15.10.1Meaning of Linear . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
15.10.2Point Estimates and Matrix Formulation . . . . . . . . . . . . . . . . . . . . 334
15.10.3Back to Our ALOHA Example . . . . . . . . . . . . . . . . . . . . . . . . . . 337
15.10.4Approximate Condence Intervals . . . . . . . . . . . . . . . . . . . . . . . . 344
15.10.5Once Again, Our ALOHA Example . . . . . . . . . . . . . . . . . . . . . . . 346
15.10.6Exact Condence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
15.11Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
15.11.1The Overtting Problem in Regression . . . . . . . . . . . . . . . . . . . . . . 348
15.11.2Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
15.11.3Methods for Predictor Variable Selection . . . . . . . . . . . . . . . . . . . . 350
15.11.4A Rough Rule of Thumb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
15.12Nominal Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
15.13Regression Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
15.14Case Study: Prediction of Network RTT . . . . . . . . . . . . . . . . . . . . . . . . . 353
15.15The Famous Error Term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
16 Relations Among Variables: Advanced 355
16.1 Nonlinear Parametric Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . 355
16.2 The Classication Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
16.2.1 The Mean Here Is a Probability . . . . . . . . . . . . . . . . . . . . . . . . . . 356
16.2.2 Logistic Regression: a Common Parametric Model for the Regression Func-
tion in Classication Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 357
16.2.2.1 The Logistic Model: Intuitive Motivation . . . . . . . . . . . . . . . 358
16.2.2.2 The Logistic Model: Theoretical Motivation . . . . . . . . . . . . . 358
16.2.3 Variable Selection in Classication Problems . . . . . . . . . . . . . . . . . . 359
16.2.3.1 Problems Inherited from the Regression Context . . . . . . . . . . . 359
xvi CONTENTS
16.2.3.2 Example: Forest Cover Data . . . . . . . . . . . . . . . . . . . . . . 360
16.2.4 Y Must Have a Marginal Distribution! . . . . . . . . . . . . . . . . . . . . . . 361
16.3 Nonparametric Estimation of Regression and Classication Functions . . . . . . . . . 362
16.3.1 Methods Based on Estimating m
Y ;X
(t) . . . . . . . . . . . . . . . . . . . . . 362
16.3.1.1 Kernel-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . 362
16.3.1.2 Nearest-Neighbor Methods . . . . . . . . . . . . . . . . . . . . . . . 363
16.3.1.3 The Naive Bayes Method . . . . . . . . . . . . . . . . . . . . . . . . 363
16.3.2 Methods Based on Estimating Classication Boundaries . . . . . . . . . . . . 364
16.3.2.1 Support Vector Machines (SVMs) . . . . . . . . . . . . . . . . . . . 364
16.3.2.2 CART . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
16.3.3 Comparison of Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
16.4 Symmetric Relations Among Several Variables . . . . . . . . . . . . . . . . . . . . . 368
16.4.1 Principal Components Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 369
16.4.2 How to Calculate Them . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
16.4.3 Example: Forest Cover Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
16.4.4 Log-Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
16.4.4.1 The Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
16.4.4.2 The Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
16.4.4.3 The Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
16.4.4.4 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 374
16.4.4.5 The Goal: Parsimony Again . . . . . . . . . . . . . . . . . . . . . . 375
16.5 Simpsons (Non-)Paradox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
16.6 Linear Regression with All Predictors Being Nominal Variables: Analysis of Variance380
16.6.1 Its a Regression! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
16.6.2 Interaction Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
16.6.3 Now Consider Parsimony . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
16.6.4 Reparameterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
CONTENTS xvii
16.7 Optimality Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
16.7.1 Optimality of the Regression Function for General Y . . . . . . . . . . . . . . 384
16.7.2 Optimality of the Regression Function for 0-1-Valued Y . . . . . . . . . . . . 385
17 Markov Chains 387
17.1 Discrete-Time Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
17.1.1 Example: Finite Random Walk . . . . . . . . . . . . . . . . . . . . . . . . . . 387
17.1.2 Long-Run Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
17.1.2.1 Derivation of the Balance Equations . . . . . . . . . . . . . . . . . . 389
17.1.2.2 Solving the Balance Equations . . . . . . . . . . . . . . . . . . . . . 389
17.1.2.3 Periodic Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
17.1.2.4 The Meaning of the Term Stationary Distribution . . . . . . . . . 391
17.1.3 Example: Stuck-At 0 Fault . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
17.1.3.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
17.1.3.2 Initial Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
17.1.3.3 Going Beyond Finding . . . . . . . . . . . . . . . . . . . . . . . . 394
17.1.4 Example: Shared-Memory Multiprocessor . . . . . . . . . . . . . . . . . . . . 396
17.1.4.1 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
17.1.4.2 Going Beyond Finding . . . . . . . . . . . . . . . . . . . . . . . . 398
17.1.5 Example: Slotted ALOHA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
17.1.5.1 Going Beyond Finding . . . . . . . . . . . . . . . . . . . . . . . . 400
17.2 Simulation of Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
17.3 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
17.4 Continuous-Time Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
17.4.1 Holding-Time Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
17.4.2 The Notion of Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
17.4.3 Stationary Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406
xviii CONTENTS
17.4.3.1 Intuitive Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 406
17.4.3.2 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406
17.4.4 Example: Machine Repair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
17.4.5 Example: Migration in a Social Network . . . . . . . . . . . . . . . . . . . . . 409
17.4.6 Continuous-Time Birth/Death Processes . . . . . . . . . . . . . . . . . . . . . 409
17.5 Hitting Times Etc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
17.5.1 Some Mathematical Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . 411
17.5.2 Example: Random Walks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
17.5.3 Finding Hitting and Recurrence Times . . . . . . . . . . . . . . . . . . . . . . 413
17.5.4 Example: Finite Random Walk . . . . . . . . . . . . . . . . . . . . . . . . . . 414
17.5.5 Example: Tree-Searching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
18 Introduction to Queuing Models 421
18.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
18.2 M/M/1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422
18.2.1 Steady-State Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422
18.2.2 Mean Queue Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
18.2.3 Distribution of Residence Time/Littles Rule . . . . . . . . . . . . . . . . . . 423
18.3 Multi-Server Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426
18.3.1 M/M/c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426
18.3.2 M/M/2 with Heterogeneous Servers . . . . . . . . . . . . . . . . . . . . . . . 427
18.4 Loss Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
18.4.1 Cell Communications Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
18.4.1.1 Stationary Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 430
18.4.1.2 Going Beyond Finding the . . . . . . . . . . . . . . . . . . . . . . 431
18.5 Nonexponential Service Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431
18.6 Reversed Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
CONTENTS xix
18.6.1 Markov Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
18.6.2 Long-Run State Proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . 434
18.6.3 Form of the Transition Rates of the Reversed Chain . . . . . . . . . . . . . . 434
18.6.4 Reversible Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434
18.6.4.1 Conditions for Checking Reversibility . . . . . . . . . . . . . . . . . 435
18.6.4.2 Making New Reversible Chains from Old Ones . . . . . . . . . . . . 435
18.6.4.3 Example: Distribution of Residual Life . . . . . . . . . . . . . . . . 436
18.6.4.4 Example: Queues with a Common Waiting Area . . . . . . . . . . . 436
18.6.4.5 Closed-Form Expression for for Any Reversible Markov Chain . . 437
18.7 Networks of Queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
18.7.1 Tandem Queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
18.7.2 Jackson Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
18.7.2.1 Open Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
18.7.3 Closed Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
A Review of Matrix Algebra 443
A.1 Terminology and Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
A.1.1 Matrix Addition and Multiplication . . . . . . . . . . . . . . . . . . . . . . . 444
A.2 Matrix Transpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
A.3 Linear Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
A.4 Determinants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
A.5 Matrix Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
A.6 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
B R Quick Start 449
B.1 Correspondences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
B.2 Starting R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
xx CONTENTS
B.3 First Sample Programming Session . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
B.4 Second Sample Programming Session . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
B.5 Third Sample Programming Session . . . . . . . . . . . . . . . . . . . . . . . . . . . 455
B.6 Complex Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
B.7 Other Sources for Learning R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
B.8 Online Help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
B.9 Debugging in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458
Preface
Why is this book dierent from all other books on mathematical probability and statistics, especially
those for computer science students? What sets this text apart is its consistently applied approach,
in a number of senses:
First, there is a strong emphasis on intution, with less mathematical formalism. In my experience,
dening probability via sample spaces, the standard approach, is a major impediment to doing
good applied work. The same holds for dening expected value as a weighted average. Instead, I
use the intuitive, informal approach of long-run frequency and long-run average.
However, in spite of the relative lack of formalism, all models and so on are described precisely
in terms of random variables and distributions. And the material is actually somewhat more
mathematical than most at this level in the sense that it makes extensive usage of linear algebra.
Second, the book stresses real-world applications. Many similar texts, notably the elegant and
interesting book by Mitzenmacher, focus on probability, in fact discrete probability. Their intended
class of applications is the theoretical analysis of algorithms. I instead focus on the actual use
of the material in the real world; which tends to be more continuous than discrete, and more in
the realm of statistics than probability. This should prove especially valuable, as big data and
machine learning now play a signicant role in applications of computers.
Third, there is a strong emphasis on modeling. Considerable emphasis is placed on questions such
as: What do probabilistic models really mean, in real-life terms? How does one choose a model?
How do we assess the practical usefulness of models? This aspect is so important that there is
a separate chapter for this, titled Introduction to Model Building. Throughout the text, there is
considerable discussion of the real-world meaning of probabilistic concepts. For instance, when
probability density functions are introduced, there is an extended discussion regarding the intuitive
meaning of densities in light of the inherently-discrete nature of real data, due to the nite precision
of measurement.
Finally, the R statistical/data manipulation language is used throughout. Again, several excellent
texts on probability and statistics have been written that feature R, but this book, by virtue of
having a computer science audience, uses R in a more sophisticated manner. My open source
xxi
xxii CONTENTS
tutorial on R programming, R for Programmers (http://heather.cs.ucdavis.edu/
~
matloff/R/
RProg.pdf), can be used as a supplement. (More advanced R programming is covered in my book,
The Art of R Programming, No Starch Press, 2011.)
As prerequisites, the student must know calculus, basic matrix algebra, and have some skill in
programming. As with any text in probability and statistics, it is also necessary that the student
has a good sense of math intuition, and does not treat mathematics as simply memorization of
formulas.
A couple of points regarding computer usage:
In the mathematical exercises, the instructor is urged to require that the students not only do
the mathematical derivations but also check their results by writing R simulation code. This
gives the students better intuition, and has the huge practical benet that its gives partial
conrmation that the students answer is correct.
In the chapters on statistics, it is crucial that students apply the concepts in thought-provoking
exercises on real data. Nowadays there are many good sources for real data sets available.
Here are a few to get you started:
UC Irvine Machine Learning Repository, http://archive.ics.uci.edu/ml/datasets.
html
UCLA Statistics Dept. data sets, http://www.stat.ucla.edu/data/
Dr. Bs Wide World of Web Data, http://research.ed.asu.edu/multimedia/DrB/
Default.htm
StatSci.org, at http://www.statsci.org/datasets.html
University of Edinburgh School of Informatics, http://www.inf.ed.ac.uk/teaching/
courses/dme/html/datasets0405.html
Note that R has the capability of reading les on the Web, e.g.
> z <- read.table("http://heather.cs.ucdavis.edu/~matloff/z")
This work is licensed under a Creative Commons Attribution-No Derivative Works 3.0 United States
License. The details may be viewed at http://creativecommons.org/licenses/by-nd/3.0/us/,
but in essence it states that you are free to use, copy and distribute the work, but you must
attribute the work to me and not alter, transform, or build upon it. If you are using the book,
either in teaching a class or for your own learning, I would appreciate your informing me. I retain
copyright in all non-U.S. jurisdictions, but permission to use these materials in teaching is still
granted, provided the licensing information here is displayed.
Chapter 1
Time Waste Versus Empowerment
I took a course in speed reading, and read War and Peace in 20 minutes. Its about Russia
comedian Woody Allen
I learned very early the dierence between knowing the name of something and knowing something
Richard Feynman, Nobel laureate in physics
The main goal [of this course] is self-actualization through the empowerment of claiming your
educationUCSC (and former UCD) professor Marc Mangel, in the syllabus for his calculus course
What does this really mean? Hmm, Ive never thought about thatUCD PhD student in statistics,
in answer to a student who asked the actual meaning of a very basic concept
You have a PhD in mechanical engineering. You may have forgotten technical details like
d
dt
sin(t) =
cos(t), but you should at least understand the concepts of rates of changethe author, gently chiding
a friend who was having trouble following a simple quantitative discussion of trends in Californias
educational system
The eld of probability and statistics (which, for convenience, I will refer to simply as statistics
below) impacts many aspects of our daily livesbusiness, medicine, the law, government and so
on. Consider just a few examples:
The statistical models used on Wall Street made the quants (quantitative analysts) rich
but also contributed to the worldwide nancial crash of 2008.
In a court trial, large sums of money or the freedom of an accused may hinge on whether the
judge and jury understand some statistical evidence presented by one side or the other.
Wittingly or unconsciously, you are using probability every time you gamble in a casinoand
1
2 CHAPTER 1. TIME WASTE VERSUS EMPOWERMENT
every time you buy insurance.
Statistics is used to determine whether a new medical treatment is safe/eective for you.
Statistics is used to ag possible terroristsbut sometimes unfairly singling out innocent
people while other times missing ones who really are dangerous.
Clearly, statistics matters. But it only has value when one really understands what it means and
what it does. Indeed, blindly plugging into statistical formulas can be not only valueless but in
fact highly dangerous, say if a bad drug goes onto the market.
Yet most people view statistics as exactly thatmindless plugging into boring formulas. If even
the statistics graduate student quoted above thinks this, how can the students taking the course
be blamed for taking that atititude?
I once had a student who had an unusually good understanding of probability. It turned out that
this was due to his being highly successful at playing online poker, winning lots of cash. No blind
formula-plugging for him! He really had to understand how probability works.
Statistics is not just a bunch of formulas. On the contrary, it can be mathematically deep, for those
who like that kind of thing. (Much of statistics can be viewed at the Pythagorean Theorem in
n-dimensional or even innite-dimensional space.) But the key point is that anyone who has taken
a calculus course can develop true understanding of statistics, of real practical value. As Professor
Mangel says, thats empowering.
So as you make your way through this book, always stop to think, What does this equation really
mean? What is its goal? Why are its ingredients dened in the way they are? Might there be a
better way? How does this relate to our daily lives? Now THAT is empowering.
Chapter 2
Basic Probability Models
This chapter will introduce the general notions of probability. Most of it will seem intuitive to you,
but pay careful attention to the general principles which are developed; in more complex settings
intuition may not be enough, and the tools discussed here will be very useful.
2.1 ALOHA Network Example
Throughout this book, we will be discussing both classical probability examples involving coins,
cards and dice, and also examples involving applications to computer science. The latter will involve
diverse elds such as data mining, machine learning, computer networks, software engineering and
bioinformatics.
In this section, an example from computer networks is presented which will be used at a number
of points in this chapter. Probability analysis is used extensively in the development of new, faster
types of networks.
Todays Ethernet evolved from an experimental network developed at the University of Hawaii,
called ALOHA. A number of network nodes would occasionally try to use the same radio channel to
communicate with a central computer. The nodes couldnt hear each other, due to the obstruction
of mountains between them. If only one of them made an attempt to send, it would be successful,
and it would receive an acknowledgement message in response from the central computer. But if
more than one node were to transmit, a collision would occur, garbling all the messages. The
sending nodes would timeout after waiting for an acknowledgement which never came, and try
sending again later. To avoid having too many collisions, nodes would engage in random backo,
meaning that they would refrain from sending for a while even though they had something to send.
One variation is slotted ALOHA, which divides time into intervals which I will call epochs. Each
3
4 CHAPTER 2. BASIC PROBABILITY MODELS
epoch will have duration 1.0, so epoch 1 extends from time 0.0 to 1.0, epoch 2 extends from 1.0 to
2.0 and so on. In the version we will consider here, in each epoch, if a node is active, i.e. has a
message to send, it will either send or refrain from sending, with probability p and 1-p. The value
of p is set by the designer of the network. (Real Ethernet hardware does something like this, using
a random number generator inside the chip.)
The other parameter q in our model is the probability that a node which had been inactive generates
a message during an epoch, i.e. the probability that the user hits a key, and thus becomes active.
Think of what happens when you are at a computer. You are not typing constantly, and when you
are not typing, the time until you hit a key again will be random. Our parameter q models that
randomness.
Let n be the number of nodes, which well assume for simplicity is two. Assume also for simplicity
that the timing is as follows. Arrival of a new message happens in the middle of an epoch, and the
decision as to whether to send versus back o is made near the end of an epoch, say 90% into the
epoch.
For example, say that at the beginning of the epoch which extends from time 15.0 to 16.0, node A
has something to send but node B does not. At time 15.5, node B will either generate a message
to send or not, with probability q and 1-q, respectively. Suppose B does generate a new message.
At time 15.9, node A will either try to send or refrain, with probability p and 1-p, and node B will
do the same. Suppose A refrains but B sends. Then Bs transmission will be successful, and at the
start of epoch 16 B will be inactive, while node A will still be active. On the other hand, suppose
both A and B try to send at time 15.9; both will fail, and thus both will be active at time 16.0,
and so on.
Be sure to keep in mind that in our simple model here, during the time a node is active, it wont
generate any additional new messages.
(Note: The denition of this ALOHA model is summarized concisely on page 10.)
Lets observe the network for two epochs, epoch 1 and epoch 2. Assume that the network consists
of just two nodes, called node 1 and node 2, both of which start out active. Let X
1
and X
2
denote
the numbers of active nodes at the very end of epochs 1 and 2, after possible transmissions. Well
take p to be 0.4 and q to be 0.8 in this example.
Lets nd P(X
1
= 2), the probability that X
1
= 2, and then get to the main point, which is to ask
what we really mean by this probability.
How could X
1
= 2 occur? There are two possibilities:
both nodes try to send; this has probability p
2
neither node tries to send; this has probability (1 p)
2
2.2. THE CRUCIAL NOTION OF A REPEATABLE EXPERIMENT 5
1,1 1,2 1,3 1,4 1,5 1,6
2,1 2,2 2,3 2,4 2,5 2,6
3,1 3,2 3,3 3,4 3,5 3,6
4,1 4,2 4,3 4,4 4,5 4,6
5,1 5,2 5,3 5,4 5,5 5,6
6,1 6,2 6,3 6,4 6,5 6,6
Table 2.1: Sample Space for the Dice Example
Thus
P(X
1
= 2) = p
2
+ (1 p)
2
= 0.52 (2.1)
2.2 The Crucial Notion of a Repeatable Experiment
Its crucial to understand what that 0.52 gure really means in a practical sense. To this end, lets
put the ALOHA example aside for a moment, and consider the experiment consisting of rolling
two dice, say a blue one and a yellow one. Let X and Y denote the number of dots we get on the
blue and yellow dice, respectively, and consider the meaning of P(X +Y = 6) =
5
36
.
In the mathematical theory of probability, we talk of a sample space, which (in simple cases)
consists of the possible outcomes (X, Y ), seen in Table 2.1. In a theoretical treatment, we place
weights of 1/36 on each of the points in the space, reecting the fact that each of the 36 points is
equally likely, and then say, What we mean by P(X + Y = 6) =
5
36
is that the outcomes (1,5),
(2,4), (3,3), (4,2), (5,1) have total weight 5/36.
Unfortunately, the notion of sample space becomes mathematically tricky when developed for more
complex probability models. Indeed, it requires graduate-level math. And much worse, one loses all
the intuition. In any case, most probability computations do not rely on explicitly writing down a
sample space. In this particular example it is useful for us as a vehicle for explaining the concepts,
but we will NOT use it much. Those who wish to get a more theoretical grounding can get a start
in Section 3.19.
But the intuitive notionwhich is FAR more importantof what P(X + Y = 6) =
5
36
means is
the following. Imagine doing the experiment many, many times, recording the results in a large
notebook:
Roll the dice the rst time, and write the outcome on the rst line of the notebook.
6 CHAPTER 2. BASIC PROBABILITY MODELS
notebook line outcome blue+yellow = 6?
1 blue 2, yellow 6 No
2 blue 3, yellow 1 No
3 blue 1, yellow 1 No
4 blue 4, yellow 2 Yes
5 blue 1, yellow 1 No
6 blue 3, yellow 4 No
7 blue 5, yellow 1 Yes
8 blue 3, yellow 6 No
9 blue 2, yellow 5 No
Table 2.2: Notebook for the Dice Problem
Roll the dice the second time, and write the outcome on the second line of the notebook.
Roll the dice the third time, and write the outcome on the third line of the notebook.
Roll the dice the fourth time, and write the outcome on the fourth line of the notebook.
Imagine you keep doing this, thousands of times, lling thousands of lines in the notebook.
The rst 9 lines of the notebook might look like Table 2.2. Here 2/9 of these lines say Yes. But
after many, many repetitions, approximately 5/36 of the lines will say Yes. For example, after
doing the experiment 720 times, approximately
5
36
720 = 100 lines will say Yes.
This is what probability really is: In what fraction of the lines does the event of interest happen?
It sounds simple, but if you always think about this lines in the notebook idea,
probability problems are a lot easier to solve. And it is the fundamental basis of computer
simulation.
2.3 Our Denitions
These denitions are intuitive, rather than rigorous math, but intuition is what we need. Keep in
mind that we are making denitions below, not listing properties.
We assume an experiment which is (at least in concept) repeatable. The experiment of
rolling two dice is repeatable, and even the ALOHA experiment is so. (We simply watch the
network for a long time, collecting data on pairs of consecutive epochs in which there are
two active stations at the beginning.) On the other hand, the econometricians, in forecasting
2.3. OUR DEFINITIONS 7
2009, cannot repeat 2008. Yet all of the econometricians tools assume that events in 2008
were aected by various sorts of randomness, and we think of repeating the experiment in a
conceptual sense.
We imagine performing the experiment a large number of times, recording the result of each
repetition on a separate line in a notebook.
We say A is an event for this experiment if it is a possible boolean (i.e. yes-or-no) outcome
of the experiment. In the above example, here are some events:
* X+Y = 6
* X = 1
* Y = 3
* X-Y = 4
A random variable is a numerical outcome of the experiment, such as X and Y here, as
well as X+Y, 2XY and even sin(XY).
For any event of interest A, imagine a column on A in the notebook. The k
th
line in the
notebook, k = 1,2,3,..., will say Yes or No, depending on whether A occurred or not during
the k
th
repetition of the experiment. For instance, we have such a column in our table above,
for the event A = blue+yellow = 6.
For any event of interest A, we dene P(A) to be the long-run fraction of lines with Yes
entries.
For any events A, B, imagine a new column in our notebook, labeled A and B. In each line,
this column will say Yes if and only if there are Yes entries for both A and B. P(A and B) is
then the long-run fraction of lines with Yes entries in the new column labeled A and B.
1
For any events A, B, imagine a new column in our notebook, labeled A or B. In each line,
this column will say Yes if and only if at least one of the entries for A and B says Yes.
2
For any events A, B, imagine a new column in our notebook, labeled A [ B and pronounced
A given B. In each line:
* This new column will say NA (not applicable) if the B entry is No.
* If it is a line in which the B column says Yes, then this new column will say Yes or No,
depending on whether the A column says Yes or No.
1
In most textbooks, what we call A and B here is written AB, indicating the intersection of two sets in the
sample space. But again, we do not take a sample space point of view here.
2
In the sample space approach, this is written A B.
8 CHAPTER 2. BASIC PROBABILITY MODELS
Think of probabilities in this notebook context:
P(A) means the long-run fraction of lines in the notebook in which the A column says Yes.
P(A or B) means the long-run fraction of lines in the notebook in which the A-or-B column
says Yes.
P(A and B) means the long-run fraction of lines in the notebook in which the A-and-B column
says Yes.
P(A [ B) means the long-run fraction of lines in the notebook in which the A [ B column says
Yesamong the lines which do NOT say NA.
A hugely common mistake is to confuse P(A and B) and P(A [ B). This is where the
notebook view becomes so important. Compare the quantities P(X = 1 and S = 6) =
1
36
and
P(X = 1[S = 6) =
1
5
, where S = X+Y:
3
After a large number of repetitions of the experiment, approximately 1/36 of the lines of the
notebook will have the property that both X = 1 and S = 6 (since X = 1 and S = 6 is
equivalent to X = 1 and Y = 5).
After a large number of repetitions of the experiment, if we look only at the lines in
which S = 6, then among those lines, approximately 1/5 of those lines will show X =
1.
The quantity P(A[B) is called the conditional probability of A, given B.
Note that and has higher logical precedence than or. For example, P(A and B or C) means P[(A
and B) or C]. Also, not has higher precedence than and.
Here are some more very important denitions and properties:
Denition 1 Suppose A and B are events such that it is impossible for them to occur in the
same line of the notebook. They are said to be disjoint events.
If A and B are disjoint events, then
P(A or B) = P(A) +P(B) (2.2)
Again, this terminology disjoint stems from the set-theoretic sample space approach, where
it means that A B = . That mathematical terminology works ne for our dice example,
3
Think of adding an S column to the notebook too
2.4. MAILING TUBES 9
but in my experience people have major diculty applying it correctly in more complicated
problems. This is another illustration of why I put so much emphasis on the notebook
framework.
If A and B are not disjoint, then
P(A or B) = P(A) +P(B) P(A and B) (2.3)
In the disjoint case, that subtracted term is 0, so (2.3) reduces to (2.2).
Denition 2 Events A and B are said to be stochastically independent, usually just
stated as independent,
4
if
P(A and B) = P(A) P(B) (2.4)
In calculating an and probability, how does one know whether the events are independent?
The answer is that this will typically be clear from the problem. If we toss the blue and yellow
dice, for instance, it is clear that one die has no impact on the other, so events involving the
blue die are independent of events involving the yellow die. On the other hand, in the ALOHA
example, its clear that events involving X
1
are NOT independent of those involving X
2
.
If A and B are not independent, the equation (2.4) generalizes to
P(A and B) = P(A)P(B[A) (2.5)
Note that if A and B actually are independent, then P(B[A) = P(B), and (2.5) reduces to
(2.4).
Note too that (2.5) implies
P(B[A) =
P(A and B)
P(A)
(2.6)
2.4 Mailing Tubes
If I ever need to buy some mailing tubes, I can come herefriend of the authors, while browsing
through an oce supplies store
Examples of the above properties, e.g. (2.4) and (2.5), will be given starting in Section 2.5. But
rst, a crucial strategic point in learning probability must be addressed.
4
The term stochastic is just a fancy synonym for random.
10 CHAPTER 2. BASIC PROBABILITY MODELS
Some years ago, a friend of mine was in an oce supplies store, and he noticed a rack of mailing
tubes. My friend made the remark shown above. Well, (2.4) and 2.5 are mailing tubesmake a
mental note to yourself saying, If I ever need to nd a probability involving and, one thing I can
try is (2.4) and (2.5). Be ready for this!
This mailing tube metaphor will be mentioned often, such as in Section 3.4.4 .
2.5 Basic Probability Computations: ALOHA Network Example
Please keep in mind that the notebook idea is simply a vehicle to help you understand what the
concepts really mean. This is crucial for your intuition and your ability to apply this material in
the real world. But the notebook idea is NOT for the purpose of calculating probabilities. Instead,
we use the properties of probability, as seen in the following.
Lets look at all of this in the ALOHA context. Heres a summary:
We have n network nodes, sharing a common communications channel.
Time is divided in epochs. X
k
denotes the number of active nodes at the end of epoch k,
which we will sometimes refer to as the state of the system in epoch k.
If two or more nodes try to send in an epoch, they collide, and the message doesnt get
through.
We say a node is active if it has a message to send.
If a node is active node near the end of an epoch, it tries to send with probability p.
If a node is inactive at the beginning of an epoch, then at the middle of the epoch it will
generate a message to send with probability q.
In our examples here, we have n = 2 and X
0
= 2, i.e. both nodes start out active.
Now, in Equation (2.1) we found that
P(X
1
= 2) = p
2
+ (1 p)
2
= 0.52 (2.7)
How did we get this? Let C
i
denote the event that node i tries to send, i = 1,2. Then using the
denitions above, our steps would be
2.5. BASIC PROBABILITY COMPUTATIONS: ALOHA NETWORK EXAMPLE 11
P(X
1
= 2) = P(C
1
and C
2
. .
or not C
1
and not C
2
. .
) (2.8)
= P(C
1
and C
2
) +P( not C
1
and not C
2
) (from (2.2)) (2.9)
= P(C
1
)P(C
2
) +P( not C
1
)P( not C
2
) (from (2.4)) (2.10)
= p
2
+ (1 p)
2
(2.11)
(The underbraces in (2.8) do not represent some esoteric mathematical operation. There are there
simply to make the grouping clearer, corresponding to events G and H dened below.)
Here are the reasons for these steps:
(2.8): We listed the ways in which the event X
1
= 2 could occur.
(2.9): Write G = C
1
and C
2
, H = D
1
and D
2
, where D
i
= not C
i
, i = 1,2. Then the events G and
H are clearly disjoint; if in a given line of our notebook there is a Yes for G, then denitely
there will be a No for H, and vice versa.
(2.10): The two nodes act physically independently of each other. Thus the events C
1
and C
2
are
stochastically independent, so we applied (2.4). Then we did the same for D
1
and D
2
.
Now, what about P(X
2
= 2)? Again, we break big events down into small events, in this case
according to the value of X
1
:
P(X
2
= 2) = P(X
1
= 0 and X
2
= 2 or X
1
= 1 and X
2
= 2 or X
1
= 2 and X
2
= 2)
= P(X
1
= 0 and X
2
= 2) (2.12)
+ P(X
1
= 1 and X
2
= 2)
+ P(X
1
= 2 and X
2
= 2)
Since X
1
cannot be 0, that rst term, P(X
1
= 0 and X
2
= 2) is 0. To deal with the second term,
P(X
1
= 1 and X
2
= 2), well use (2.5). Due to the time-sequential nature of our experiment here,
it is natural (but certainly not mandated, as well often see situations to the contrary) to take A
and B to be X
1
= 1 and X
2
= 2, respectively. So, we write
P(X
1
= 1 and X
2
= 2) = P(X
1
= 1)P(X
2
= 2[X
1
= 1) (2.13)
12 CHAPTER 2. BASIC PROBABILITY MODELS
To calculate P(X
1
= 1), we use the same kind of reasoning as in Equation (2.1). For the event in
question to occur, either node A would send and B wouldnt, or A would refrain from sending and
B would send. Thus
P(X
1
= 1) = 2p(1 p) = 0.48 (2.14)
Now we need to nd P(X
2
= 2[X
1
= 1). This again involves breaking big events down into small
ones. If X
1
= 1, then X
2
= 2 can occur only if both of the following occur:
Event A: Whichever node was the one to successfully transmit during epoch 1and we are
given that there indeed was one, since X
1
= 1now generates a new message.
Event B: During epoch 2, no successful transmission occurs, i.e. either they both try to send
or neither tries to send.
Recalling the denitions of p and q in Section 2.1, we have that
P(X
2
= 2[X
1
= 1) = q[p
2
+ (1 p)
2
] = 0.41 (2.15)
Thus P(X
1
= 1 and X
2
= 2) = 0.48 0.41 = 0.20.
We go through a similar analysis for P(X
1
= 2 and X
2
= 2): We recall that P(X
1
= 2) = 0.52
from before, and nd that P(X
2
= 2[X
1
= 2) = 0.52 as well. So we nd P(X
1
= 2 and X
2
= 2) to
be 0.52
2
= 0.27. Putting all this together, we nd that P(X
2
= 2) = 0.47.
Lets do one more; lets nd P(X
1
= 1[X
2
= 2). [Pause a minute here to make sure you understand
that this is quite dierent from P(X
2
= 2[X
1
= 1).] From (2.6), we know that
P(X
1
= 1[X
2
= 2) =
P(X
1
= 1 and X
2
= 2)
P(X
2
= 2)
(2.16)
We computed both numerator and denominator here before, in Equations (2.13) and (2.12), so we
see that P(X
1
= 1[X
2
= 2) = 0.20/0.47 = 0.43.
So, in our notebook view, if we were to look only at lines in the notebook for which X
2
= 2, a
fraction 0.43 of those lines would have X
1
= 1.
You might be bothered that we are looking backwards in time in (2.16), kind of guessing the past
from the present. There is nothing wrong or unnatural about that. Jurors in court trials do it all
the time, though presumably not with formal probability calculation. And evolutionary biologists
do use formal probability models to guess the past.
2.6. BAYES RULE 13
Note by the way that events involving X
2
are NOT independent of those involving X
1
. For instance,
we found in (2.16) that
P(X
1
= 1[X
2
= 2) = 0.43 (2.17)
yet from (2.14) we have
P(X
1
= 1) = 0.48. (2.18)
2.6 Bayes Rule
(This section should not be confused with Section 12.5. The latter is highly controversial, while
the material in this section is not controversial at all.)
Following (2.16) above, we noted that the ingredients had already been computed, in (2.13) and
(2.12). If we go back to the derivations in those two equations and substitute in (2.16), we have
P(X
1
= 1[X
2
= 2) =
P(X
1
= 1 and X
2
= 2)
P(X
2
= 2
(2.19)
=
P(X
1
= 1 and X
2
= 2)
P(X
1
= 1 and X
2
= 2) +P(X
1
= 2 and X
2
= 2)
(2.20)
=
P(X
1
= 1)P(X
2
= 2[X
1
= 1)
P(X
1
= 1)P(X
2
= 2[X
1
= 1) +P(X
1
= 2)P(X
2
= 2[X
1
= 2)
(2.21)
Looking at this in more generality, for events A and B we would nd that
P(A[B) =
P(A)P(B[A)
P(A)P(B[A) +P(not A)P(B[not A)
(2.22)
This is known as Bayes Theorem or Bayes Rule. It can be extended easily to cases with
several terms in the denominator, arising from situations that need to be broken down into several
subevents rather than just A and not-A.
2.7 ALOHA in the Notebook Context
Think of doing the ALOHA experiment many, many times.
14 CHAPTER 2. BASIC PROBABILITY MODELS
notebook line X
1
= 2 X
2
= 2 X
1
= 2 and X
2
= 2 X
2
= 2[X
1
= 2
1 Yes No No No
2 No No No NA
3 Yes Yes Yes Yes
4 Yes No No No
5 Yes Yes Yes Yes
6 No No No NA
7 No Yes No NA
Table 2.3: Top of Notebook for Two-Epoch ALOHA Experiment
Run the network for two epochs, starting with both nodes active, the rst time, and write
the outcome on the rst line of the notebook.
Run the network for two epochs, starting with both nodes active, the second time, and write
the outcome on the second line of the notebook.
Run the network for two epochs, starting with both nodes active, the third time, and write
the outcome on the third line of the notebook.
Run the network for two epochs, starting with both nodes active, the fourth time, and write
the outcome on the fourth line of the notebook.
Imagine you keep doing this, thousands of times, lling thousands of lines in the notebook.
The rst seven lines of the notebook might look like Table 2.3. We see that:
Among those rst seven lines in the notebook, 4/7 of them have X
1
= 2. After many, many
lines, this fraction will be approximately 0.52.
Among those rst seven lines in the notebook, 3/7 of them have X
2
= 2. After many, many
lines, this fraction will be approximately 0.47.
5
Among those rst seven lines in the notebook, 3/7 of them have X
1
= 2 and X
2
= 2. After
many, many lines, this fraction will be approximately 0.27.
Among the rst seven lines in the notebook, four of them do not say NA in the X
2
= 2[X
1
= 2
column. Among these four lines, two say Yes, a fraction of 2/4. After many, many lines,
this fraction will be approximately 0.52.
5
Dont make anything of the fact that these probabilities nearly add up to 1.
2.8. SOLUTION STRATEGIES 15
2.8 Solution Strategies
The example in Section 2.5 shows typical stategies in exploring solutions to probability problems,
such as:
Name what seem to be the important variables and events, in this case X
1
, X
2
, C
1
, C
2
and
so on.
Write the given probability in terms of those named variables, e.g.
P(X
1
= 2) = P(C
1
and C
2
. .
or not C
1
and not C
2
. .
) (2.23)
above.
Ask the famous question, How can it happen? Break big events down into small events; in
the above case the event X
1
= 2 can happen if C
1
and C
2
or not C
1
and not C
2
.
Do not write/think nonsense. For example: the expression P(A) or P(B) is nonsensedo
you see why? Probabilities are numbers, not boolean expressions, so P(A) or P(B) is like
saying, 0.2 or 0.5meaningless.
Similarly, say we have a random variable X. The probability P(X) is invalid. P(X = 3) is
valid, but P(X) is meaningless.
Please note that = is not like a comma, or equivalent to the English word therefore. It needs
a left side and a right side; a = b makes sense, but = b doesnt.
Similarly, dont use formulas that you didnt learn and that are in fact false. For example,
in an expression involving a random variable X, one can NOT replace X by its mean. (How
would you like it if your professor were to lose your exam, and then tell you, Well, Ill just
assign you a score that is equal to the class mean?)
In the beginning of your learning probability methods, meticulously write down all your steps,
with reasons, as in the computation of P(X
1
= 2) in Equations (2.8). After you gain more
experience, you can start skipping steps, but not in the initial learning period.
Solving probability problemsand even more so, building useful probability modelsis like
computer programming: Its a creative process.
One can NOTrepeat, NOTteach someone how to write programs. All one can do is show
the person how the basic building blocks work, such as loops, if-else and arrays, then show a
number of examples. But the actual writing of a program is a creative act, not formula-based.
The programmer must creatively combined the various building blocks to produce the desired
result. The teacher cannot teach the student how to do this.
16 CHAPTER 2. BASIC PROBABILITY MODELS
The same is true for solving probability problems. The basic building blocks were presented
above in Section 2.5, and many more mailing tubes will be presented in the rest of this
book. But it is up to the student to try using the various building blocks in a way that solves
the problem. Sometimes use of one block may prove to be unfruitful, in which case one must
try other blocks.
For instance, in using probability formulas like P(A and B) = P(A) P(B[A), there is no magic
rule as to how to choose A and B.
Moreover, if you need P(B[A), there is no magic rule on how to nd it. On the one hand,
you might calculate it from (2.6), as we did in (2.16), but on the other hand you may be able
to reason out the value of P(B[A), as we did following (2.14). Just try some cases until you
nd one that works, in the sense that you can evaluate both factors. Its the same as trying
various programming ideas until you nd one that works.
2.9 Example: Divisibility of Random Integers
Suppose at step i we generate a random integer between 1 and 1000, and check whether its evenly
divisible by i, i = 5,4,3,2,1. Let N denote the number of steps needed to reach an evenly divisible
number.
Lets nd P(N = 2). Let q(i) denote the fraction of numbers in 1...,1000 that are evenly divisible
by i, so that for instance q(5) = 200/1000 = 1/5 while q(3) = 333/1000. Then since the random
numbers are independent from step to step, we have
P(N = 2) = P(fail in step 5 and succeed in step 4) (How can it happen?) (2.24)
= P(fail in step 5) P(succeed in step 4 [ fail in step 5) ((2.5)) (2.25)
= [1 q(5)]q(4) (2.26)
=
4
5

1
4
(2.27)
=
1
5
(2.28)
But theres more.
First, note that q(i) is either equal or approximately equal to 1/i. Then following the derivation in
(2.24), youll nd that
P(N = j)
1
5
(2.29)
2.10. EXAMPLE: A SIMPLE BOARD GAME 17
for ALL j in 1,...,5.
That may seem counterintuitive. Yet the example here is in essence the same as one found as an
exercise in many textbooks on probability:
A man has ve keys. He knows one of them opens a given lock, but he doesnt know
which. So he tries the keys one at a time until he nds the right one. Find P(N = j), j
= 1,...,5, where N is the number of keys he tries until he succeeds.
Here too the answer is 1/5 for all j. But this one makes intuitive sense: Each of the keys has chance
1/5 of being the right key, so each of the values 1,...,5 is equally likely for N.
This is then an example of the fact that sometimes we can gain insight into one problem by
considering a mathematically equivalent problem in a quite dierent setting.
2.10 Example: A Simple Board Game
Consider a board game, which for simplicity well assume consists of two squares per side, on four
sides. A players token advances around the board. The squares are numbered 0-7, and play begins
at square 0.
A token advances according to the roll of a single die. If a player lands on square 3, he/she gets
a bonus turn. Lets nd the probability that a player has yet to make a complete circuit of the
boardi.e. has reached or passed 0after the rst turn (including the bonus, if any). Let R denote
his rst roll, and let B be his bonus if there is one, with B being set to 0 if there is no bonus. Then
(using commas as a shorthand notation for and)
P(doesnt reach or pass 0) = P(R +B 7) (2.30)
= P(R 6, R ,= 3 or R = 3, B 4) (2.31)
= P(R 6, R ,= 3) +P(R = 3, B 4) (2.32)
= P(R 6, R ,= 3) +P(R = 3) P(B 4) (2.33)
=
5
6
+
1
6

4
6
(2.34)
=
17
18
(2.35)
Now, heres a shorter way (there are always multiple ways to do a problem):
18 CHAPTER 2. BASIC PROBABILITY MODELS
P(dont reach or pass 0 = 1 P(do reach or pass 0 (2.36)
= 1 P(R +B > 7) (2.37)
= 1 P(R = 3, B > 4) (2.38)
= 1
1
6

2
6
(2.39)
=
17
18
(2.40)
Now suppose that, according to a telephone report of the game, you hear that on As rst turn,
his token ended up at square 4. Lets nd the probability that he got there with the aid of a bonus
roll.
Note that this a conditional probabilitywere nding the probability that A goes a bonus roll,
given that we know he ended up at square 4. The word given wasnt there, but it was implied.
A little thought reveals that we cannot end up at square 4 after making a complete circuit of the
board, which simplies the situation quite a bit. So, write
P(B > 0[R +B = 4) =
P(R +B = 4, B > 0)
P(R +B = 4)
(2.41)
=
P(R +B = 4, B > 0)
P(R +B = 4, B > 0 or R +B = 4, B = 0)
(2.42)
=
P(R +B = 4, B > 0)
P(R +B = 4, B > 0) +P(R +B = 4, B = 0)
(2.43)
=
P(R = 3, B = 1)
P(R = 3, B = 1) +P(R = 4)
(2.44)
=
1
6

1
6
1
6

1
6
+
1
6
(2.45)
=
1
7
(2.46)
We could have used Bayes Rule to shorten the derivation a little here, but will prefer to derive
everything, at least in this introductory chapter.
Pay special attention to that third equality above, as it is a frequent mode of attack in probability
problems. In considering the probability P(R+B = 4, B > 0), we ask, what is a simplerbut still
equivalent!description of this event? Well, we see that R+B = 4, B > 0 boils down to R = 3, B
= 1, so we replace the above probability with P(R = 3, B = 1).
2.11. EXAMPLE: BUS RIDERSHIP 19
Again, this is a very common approach. But be sure to take care that we are in an if and only if
situation. Yes, R+B = 4, B > 0 implies R = 3, B = 1, but we must make sure that the converse
is true as well. In other words, we must also conrm that R = 3, B = 1 implies R+B = 4, B > 0.
Thats trivial in this case, but one can make a subtle error in some problems if one is not careful;
otherwise we will have replaced a higher-probability event by a lower-probability one.
2.11 Example: Bus Ridership
Consider the following analysis of bus ridership. (In order to keep things easy, it will be quite
oversimplied, but the principles will be clear.) Here is the model:
At each stop, each passsenger alights from the bus, independently, with probability 0.2 each.
Either 0, 1 or 2 new passengers get on the bus, with probabilities 0.5, 0.4 and 0.1, respectively.
Assume the bus is so large that it never becomes full, so the new passengers can always get
on.
Suppose the bus is empty when it arrives at its rst stop.
Let L
i
denote the number of passengers on the bus as it leaves its i
th
stop, i = 1,2,3,... Lets nd
some probabilities, say P(L
2
= 0).
For convenience, let B
i
denote the number of new passengers who board the bus at the i
th
stop.
Then
P(L
2
= 0) = P(B
1
= 0 and L
2
= 0 or B
1
= 1 and L
2
= 0 or B
1
= 2 and L
2
= 0) (2.47)
=
2

i=0
P(B
1
= i and L
2
= 0) (2.48)
=
2

i=0
P(B
1
= i)P(L
2
= 0[B
1
= i) (2.49)
= 0.5
2
+ (0.4)(0.2)(0.5) + (0.1)(0.2
2
)(0.5) (2.50)
= 0.292 (2.51)
For instance, where did that rst term, 0.5
2
, come from? Well, P(B
1
= 0) = 0.5, and what about
P(L
2
= 0[B
1
= 0? If B
1
= 0, then the bus approaches the second stop empty. For it to then leave
that second stop empty, it must be the case that B
2
= 0, which has probability 0.5.
20 CHAPTER 2. BASIC PROBABILITY MODELS
2.12 Simulation
Note to readers: The R simulation examples in this book provide a valuable supplement to your
developing insight into this material.
To learn about the syntax (e.g. <- as the assignment operator), see Appendix B.
To simulate whether a simple event occurs or not, we typically use R function runif(). This
function generates random numbers from the interval (0,1), with all the points inside being equally
likely. So for instance the probability that the function returns a value in (0,0.5) is 0.5. Thus here
is code to simulate tossing a coin:
if (runif(1) < 0.5) heads <- TRUE else heads <- FALSE
The argument 1 means we wish to generate just one random number from the interval (0,1).
2.12.1 Example: Rolling Dice
If we roll three dice, what is the probability that their total is 8? We count all the possibilities, or
we could get an approximate answer via simulation:
1 # roll d dice; find P(total = k)
2
3 # simulate roll of one die; the possible return values are 1,2,3,4,5,6,
4 # all equally likely
5 roll <- function() return(sample(1:6,1))
6
7 probtotk <- function(d,k,nreps) {
8 count <- 0
9 # do the experiment nreps times
10 for (rep in 1:nreps) {
11 sum <- 0
12 # roll d dice and find their sum
13 for (j in 1:d) sum <- sum + roll()
14 if (sum == k) count <- count + 1
15 }
16 return(count/nreps)
17 }
The call to the built-in R function sample() here says to take a sample of size 1 from the sequence
of numbers 1,2,3,4,5,6. Thats just what we want to simulate the rolling of a die. The code
for (j in 1:d) sum <- sum + roll()
then simulates the tossing of a die d times, and computing the sum.
2.12. SIMULATION 21
2.12.2 Improving the Code
Since applications of R often use large amounts of computer time, good R programmers are always
looking for ways to speed things up. Here is an alternate version of the above program:
1 # roll d dice; find P(total = k)
2
3 probtotk <- function(d,k,nreps) {
4 count <- 0
5 # do the experiment nreps times
6 for (rep in 1:nreps)
7 total <- sum(sample(1:6,d,replace=TRUE))
8 if (total == k) count <- count + 1
9 }
10 return(count/nreps)
11 }
Here the code
sample(1:6,d,replace=TRUE)
simulates tossing the die d times (the argument replace says this is sampling with replacement,
so for instance we could get two 6s). That returns a d-element array, and we then call Rs built-in
function sum() to nd the total of the d dice.
Note the call to Rs sum() function, a nice convenience.
The second version of the code here is more compact and easier to read. It also eliminates one
explicit loop, which is the key to writing fast code in R.
Actually, further improvements are possible. Consider this code:
1 # roll d dice; find P(total = k)
2
3 # simulate roll of nd dice; the possible return values are 1,2,3,4,5,6,
4 # all equally likely
5 roll <- function(nd) return(sample(1:6,nd,replace=TRUE))
6
7 probtotk <- function(d,k,nreps) {
8 sums <- vector(length=nreps)
9 # do the experiment nreps times
10 for (rep in 1:nreps) sums[rep] <- sum(roll(d))
11 return(mean(sums==k))
12 }
There is quite a bit going on here.
22 CHAPTER 2. BASIC PROBABILITY MODELS
We are storing the various notebook lines in a vector sums. We rst call vector() to allocate
space for it.
But the heart of the above code is the expression sums==k, which involves the very essence of
the R idiom, vectorization. At rst, the expression looks odd, in that we are comparing a vector
(remember, this is what languages like C call an array), sums, to a scalar, k. But in R, every
scalar is actually considered a one-element vector.
Fine, k is a vector, but wait! It has a dierent length than sums, so how can we compare the two
vectors? Well, in R a vector is recycledextended in length, by repeating its valuesin order to
conform to longer vectors it will be involved with. For instance:
> c(2,5) + 4:6
[1] 6 10 8
Here we added the vector (2,5) to (4,5,6). The former was rst recycled to (2,5,2), resulting in a
sum of (6,10,8).
6
So, in evaluating the expression sums==k, R will recycle k to a vector consisting of nreps copies
of k, thus conforming to the length of sums. The result of the comparison will then be a vector
of length nreps, consisting of TRUE and FALSE values. In numerical contexts, these are treated
at 1s and 0s, respectively. Rs mean() function will then average those values, resulting in the
fraction of 1s! Thats exactly what we want.
Even better:
1 roll <- function(nd) return(sample(1:6,nd,replace=TRUE))
2
3 probtotk <- function(d,k,nreps) {
4 # do the experiment nreps times
5 sums <- replicate(nreps,sum(roll(d)))
6 return(mean(sums==k))
7 }
Rs replicate() function does what its name implies, in this case executing the call sum(roll(d)).
That produces a vector, which we then assign to sums. And note that we dont have to allocate
space for sums; replicate() produces a vector, allocating space, and then we merely point sums
to that vector.
The various improvements shown above compactify the code, and in many cases, make it much
faster.
7
Note, though, that this comes at the expense of using more memory.
6
There was also a warning message, not shown here. The circumstances under which warnings are or are not
generated are beyond our scope here, but recycling is a very common R operation.
7
You can measure times using Rs system.time() function, e.g. via the call sys-
tem.time(probtotk(3,7,10000)).
2.12. SIMULATION 23
2.12.3 Simulation of the ALOHA Example
Following is a computation via simulation of the approximate value of P(X
1
= 2), P(X
2
= 2)
and P(X
2
= 2[X
1
= 1), using the R statistical language, the language of choice of professional
statisticans. It is open source, its statistically correct (not all statistical packages are so), has
dazzling graphics capabilities, etc.
1 # finds P(X1 = 2), P(X2 = 2) and P(X2 = 2|X1 = 1) in ALOHA example
2 sim <- function(p,q,nreps) {
3 countx2eq2 <- 0
4 countx1eq1 <- 0
5 countx1eq2 <- 0
6 countx2eq2givx1eq1 <- 0
7 # simulate nreps repetitions of the experiment
8 for (i in 1:nreps) {
9 numsend <- 0 # no messages sent so far
10 # simulate A and Bs decision on whether to send in epoch 1
11 for (i in 1:2)
12 if (runif(1) < p) numsend <- numsend + 1
13 if (numsend == 1) X1 <- 1
14 else X1 <- 2
15 if (X1 == 2) countx1eq2 <- countx1eq2 + 1
16 # now simulate epoch 2
17 # if X1 = 1 then one node may generate a new message
18 numactive <- X1
19 if (X1 == 1 && runif(1) < q) numactive <- numactive + 1
20 # send?
21 if (numactive == 1)
22 if (runif(1) < p) X2 <- 0
23 else X2 <- 1
24 else { # numactive = 2
25 numsend <- 0
26 for (i in 1:2)
27 if (runif(1) < p) numsend <- numsend + 1
28 if (numsend == 1) X2 <- 1
29 else X2 <- 2
30 }
31 if (X2 == 2) countx2eq2 <- countx2eq2 + 1
32 if (X1 == 1) { # do tally for the cond. prob.
33 countx1eq1 <- countx1eq1 + 1
34 if (X2 == 2) countx2eq2givx1eq1 <- countx2eq2givx1eq1 + 1
35 }
36 }
37 # print results
38 cat("P(X1 = 2):",countx1eq2/nreps,"\n")
39 cat("P(X2 = 2):",countx2eq2/nreps,"\n")
40 cat("P(X2 = 2 | X1 = 1):",countx2eq2givx1eq1/countx1eq1,"\n")
41 }
Note that each of the nreps iterations of the main for loop is analogous to one line in our hy-
pothetical notebook. So, the nd (the approximate value of) P(X
1
= 2), divide the count of the
number of times X
1
= 2 occurred by the number of iterations.
24 CHAPTER 2. BASIC PROBABILITY MODELS
Note especially that the way we calculated P(X
2
= 2[X
1
= 1) was to count the number of times
X
2
= 2, among those times that X
1
= 1, just like in the notebook case.
Also: Keep in mind that we did NOT use (2.22) or any other formula in our simulation. We
stuck to basics, the notebook denition of probability. This is really important if you are using
simulation to conrm something you derived mathematically. On the other hand, if you are using
simulation because you CANT derive something mathematically (the usual situation), using some
of the mailing tubes might speed up the computation.
2.12.4 Example: Bus Ridership, contd.
Consider the example in Section 2.11. Lets nd the probability that after visiting the tenth stop,
the bus is empty. This is too complicated to solve analytically, but can easily be simulated:
1 nreps <- 10000
2 nstops <- 10
3 count <- 0
4 for (i in 1:nreps) {
5 passengers <- 0
6 for (j in 1:nstops) {
7 if (passengers > 0)
8 for (k in 1:passengers)
9 if (runif(1) < 0.2)
10 passengers <- passengers - 1
11 newpass <- sample(0:2,1,prob=c(0.5,0.4,0.1))
12 passengers <- passengers + newpass
13 }
14 if (passengers == 0) count <- count + 1
15 }
16 print(count/nreps)
Note the dierent usage of the sample() function in the call
sample(0:2,1,prob=c(0.5,0.4,0.1))
Here we take a sample of size 1 from the set 0,1,2, but with probabilities 0.5 and so on. Since
the third argument for sample() is replace, not prob, we need to specify the latter in our call.
2.12.5 Back to the Board Game Example
Recall the board game in Section 2.10. Below is simulation code to nd the probability in (2.41):
1 boardsim <- function(nreps) {
2 count4 <- 0
2.13. COMBINATORICS-BASED PROBABILITY COMPUTATION 25
3 countbonusgiven4 <- 0
4 for (i in 1:nreps) {
5 position <- sample(1:6,1)
6 if (position == 3) {
7 bonus <- TRUE
8 position <- (position + sample(1:6,1)) %% 8
9 } else bonus <- FALSE
10 if (position == 4) {
11 count4 <- count4 + 1
12 if (bonus) countbousngiven4 <- countbousngiven4 + 1
13 }
14 }
15 return(countbousngiven4/count4)
16 }
2.12.6 How Long Should We Run the Simulation?
Clearly, the large the value of nreps in our examples above, the more accurate our simulation
results are likely to be. But how large should this value be? Or, more to the point, what measure is
there for the degree of accuracy one can expect (whatever that means) for a given value of nreps?
These questions will be addressed in Chapter 10.
2.13 Combinatorics-Based Probability Computation
And though the holes were rather small, they had to count them allfrom the Beatles song, A Day
in the Life
In some probability problems all the outcomes are equally likely. The probability computation is
then simply a matter of counting all the outcomes of interest and dividing by the total number of
possible outcomes. Of course, sometimes even such counting can be challenging, but it is simple in
principle. Well discuss two examples here.
2.13.1 Which Is More Likely in Five Cards, One King or Two Hearts?
Suppose we deal a 5-card hand from a regular 52-card deck. Which is larger, P(1 king) or P(2
hearts)? Before continuing, take a moment to guess which one is more likely.
Now, here is how we can compute the probabilities. The key point is that all possible hands are
equally likely, which implies that all we need do is count them. There are
_
52
5
_
possible hands, so
this is our denominator. For P(1 king), our numerator will be the number of hands consisting of
one king and four non-kings. Since there are four kings in the deck, the number of ways to choose
one king is
_
4
1
_
= 4. There are 48 non-kings in the deck, so there are
_
48
4
_
ways to choose them.
26 CHAPTER 2. BASIC PROBABILITY MODELS
Every choice of one king can be combined with every choice of four non-kings, so the number of
hands consisting of one king and four non-kings is 4
_
48
4
_
. Thus
P(1 king) =
4
_
48
4
_
_
52
5
_ = 0.299 (2.52)
The same reasoning gives us
P(2 hearts) =
_
13
2
_

_
39
3
_
_
52
5
_ = 0.274 (2.53)
So, the 1-king hand is just slightly more likely.
Note that an unstated assumption here was that all 5-card hands are equally likely. That is a
realistic assumption, but its important to understand that it plays a key role here.
By the way, I used the R function choose() to evaluate these quantities, running R in interactive
mode, e.g.:
> choose(13,2) * choose(39,3) / choose(52,5)
[1] 0.2742797
R also has a very nice function combn() which will generate all the
_
n
k
_
combinations of k things
chosen from n, and also will at your option call a user-specied function on each combination. This
allows you to save a lot of computational work. See the examples in Rs online documentation.
Heres how we could do the 1-king problem via simulation:
1 # use simulation to find P(1 king) when deal a 5-card hand from a
2 # standard deck
3
4 # think of the 52 cards as being labeled 1-52, with the 4 kings having
5 # numbers 1-4
6
7 sim <- function(nreps) {
8 count1king <- 0 # count of number of hands with 1 king
9 for (rep in 1:nreps) {
10 hand <- sample(1:52,5,replace=FALSE) # deal hand
11 kings <- intersect(1:4,hand) # find which kings, if any, are in hand
12 if (length(kings) == 1) count1king <- count1king + 1
13 }
14 print(count1king/nreps)
15 }
Here the intersect() function performs set intersection, in this case the set 1,2,3,4 and the one in
the variable hand. Applying the length() function then gets us number of kings.
2.13. COMBINATORICS-BASED PROBABILITY COMPUTATION 27
2.13.2 Example: Lottery Tickets
Twenty tickets are sold in a lottery, numbered 1 to 20, inclusive. Five tickets are drawn for prizes.
Lets nd the probability that two of the ve winning tickets are even-numbered.
Since there are 10 even-numbered tickets, there are
_
10
2
_
sets of two such tickets. Continuing along
these lines, we nd the desired probability to be.
_
10
2
__
10
3
_
_
20
5
_ (2.54)
Now lets nd the probability that two of the ve winning tickets are in the range 1 to 5, two are
in 6 to 10, and one is in 11 to 20.
Picture yourself picking your tickets. Again there are
_
20
5
_
ways to choose the ve tickets. How
many of those ways satisfy the stated condition?
Well, rst, there are
_
10
2
_
ways to choose two tickets from the range 1 to 5. Once youve done
that, there are
_
10
2
_
ways to choose two tickets from the range 6 to 10, and so on. So, The desired
probability is then
_
10
2
__
10
2
__
10
1
_
_
20
5
_ (2.55)
2.13.3 Association Rules in Data Mining
The eld of data mining is a branch of computer science, but it is largely an application of various
statistical methods to really huge databases.
One of the applications of data mining is called the market basket problem. Here the data consists
of records of sales transactions, say of books at Amazon.com. The business goal is exemplied
by Amazons suggestion to customers that Patrons who bought this book also tended to buy
the following books.
8
The goal of the market basket problem is to sift through sales transaction
records to produce association rules, patterns in which sales of some combinations of books imply
likely sales of other related books.
The notation for association rules is A, B C, D, E, meaning in the book sales example that
customers who bought books A and B also tended to buy books C, D and E. Here A and B are
8
Some customers appreciate such tips, while others view it as insulting or an invasion of privacy, but well not
address such issues here.
28 CHAPTER 2. BASIC PROBABILITY MODELS
called the antecedents of the rule, and C, D and E are called the consequents. Lets suppose
here that we are only interested in rules with a single consequent.
We will present some methods for nding good rules in another chapter, but for now, lets look
at how many possible rules there are. Obviously, it would be impractical to use rules with a large
number of antecedents.
9
. Suppose the business has a total of 20 products available for sale. What
percentage of potential rules have three or fewer antecedents?
10
For each k = 1,...,19, there are
_
20
k
_
possible sets of k antecedents, and for each such set there are
_
20k
1
_
possible consequents. The fraction of potential rules using three or fewer antecedents is then

3
k=1
_
20
k
_

_
20k
1
_

19
k=1
_
20
k
_

_
20k
1
_ =
23180
10485740
= 0.0022 (2.56)
So, this is just scratching the surface. And note that with only 20 products, there are already over
ten million possible rules. With 50 products, this number is 2.81 10
16
! Imagine what happens in
a case like Amazon, with millions of products. These staggering numbers show what a tremendous
challenge data miners face.
2.13.4 Multinomial Coecients
Question: We have a group consisting of 6 Democrats, 5 Republicans and 2 Independents, who
will participate in a panel discussion. They will be sitting at a long table. How many seating
arrangements are possible, with regard to political aliation? (So we do not care, for instance,
about permuting the individual Democrats within the seats assigned to Democrats.)
Well, there are
_
13
6
_
ways to choose the Democratic seats. Once those are chosen, there are
_
7
5
_
ways
to choose the Republican seats. The Independent seats are then already determined, i.e. there will
be only way at that point, but lets write it as
_
2
2
_
. Thus the total number of seating arrangements
is
13!
6!7!

7!
5!2!

2!
2!0!
(2.57)
That reduces to
13!
6!5!2!
(2.58)
9
In addition, there are serious statistical problems that would arise, to be discussed in another chapter.
10
Be sure to note that this is also a probability, namely the probability that a randomly chosen rule will have three
or fewer antecedents.
2.13. COMBINATORICS-BASED PROBABILITY COMPUTATION 29
The same reasoning yields the following:
Multinomial Coecients: Suppose we have c objects and r bins. Then the number of ways to
choose c
1
of them to put in bin 1, c
2
of them to put in bin 2,..., and c
r
of them to put in bin r is
c!
c
1
!...c
r
!
, c
1
+... +c
r
= c (2.59)
Of course, the bins may just be metaphorical. In the political party example above, the bins
were political parties, and objects were seats.
2.13.5 Example: Probability of Getting Four Aces in a Bridge Hand
A standard deck of 52 cards is dealt to four players, 13 cards each. One of the players is Millie.
What is the probability that Millie is dealt all four aces?
Well, there are
52!
13!13!13!13!
(2.60)
possible deals. (the objects are the 52 cards, and the bins are the 4 players.) The number of
deals in which Millie holds all four aces is the same as the number of deals of 48 cards, 9 of which
go to Millie and 13 each to the other three players, i.e.
48!
13!13!13!9!
(2.61)
Thus the desired probability is
48!
13!13!13!9!
52!
13!13!13!13!
= 0.00264 (2.62)
Exercises
1. This problem concerns the ALOHA network model of Section 2.1. Feel free to use (but cite)
computations already in the example.
(a) P(X
1
= 2 and X
2
= 1), for the same values of p and q in the examples.
30 CHAPTER 2. BASIC PROBABILITY MODELS
(b) Find P(X
2
= 0).
(c) Find (P(X
1
= 1[X
2
= 1).
2. Urn I contains three blue marbles and three yellow ones, while Urn II contains ve and seven
of these colors. We draw a marble at random from Urn I and place it in Urn II. We then draw a
marble at random from Urn II.
(a) Find P(second marble drawn is blue).
(b) Find P( rst marble drawn is blue [ second marble drawn is blue).
3. Consider the example of association rules in Section 2.13.3. How many two-antecedent, two-
consequent rules are possible from 20 items? Express your answer in terms of combinatorial (n
choose k) symbols.
4. Suppose 20% of all C++ programs have at least one major bug. Out of ve programs, what is
the probability that exactly two of them have a major bug?
5. Assume the ALOHA network model as in Section 2.1, i.e. m = 2 and X
0
= 2, but with general
values for p and q. Find the probability that a new message is created during epoch 2.
6. Say we choose six cards from a standard deck, one at a time WITHOUT replacement. Let N
be the number of kings we get. Does N have a binomial distribution? Choose one: (i) Yes. (ii)
No, since trials are not independent. (iii) No, since the probability of success is not constant from
trial to trial. (iv) No, since the number of trials is not xed. (v) (ii) and (iii). (iv) (ii) and (iv).
(vii) (iii) and (iv).
7. You bought three tickets in a lottery, for which 60 tickets were sold in all. There will be ve
prizes given. Find the probability that you win at least one prize, and the probability that you win
exactly one prize.
8. Two ve-person committees are to be formed from your group of 20 people. In order to foster
communication, we set a requirement that the two committees have the same chair but no other
overlap. Find the probability that you and your friend are both chosen for some committee.
9. Consider a device that lasts either one, two or three months, with probabilities 0.1, 0.7 and 0.2,
respectively. We carry one spare. Find the probability that we have some device still working just
before four months have elapsed.
10. A building has six oors, and is served by two freight elevators, named Mike and Ike. The
destination oor of any order of freight is equally likely to be any of oors 2 through 6. Once an
elevator reaches any of these oors, it stays there until summoned. When an order arrives to the
2.13. COMBINATORICS-BASED PROBABILITY COMPUTATION 31
building, whichever elevator is currently closer to oor 1 will be summoned, with elevator Ike being
the one summoned in the case in which they are both on the same oor.
Find the probability that after the summons, elevator Mike is on oor 3. Assume that only one
order of freight can t in an elevator at a time. Also, suppose the average time between arrivals of
freight to the building is much larger than the time for an elevator to travel between the bottom
and top oors; this assumption allows us to neglect travel time.
11. Without resorting to using the fact that
_
n
k
_
= n!/[k!(n k!)], nd c and d such that
_
n
k
_
=
_
n 1
k
_
+
_
c
d
_
(2.63)
12. Consider the ALOHA example from the text, for general p and q, and suppose that X
0
= 0,
i.e. there are no active nodes at the beginning of our observation period. Find P(X
1
= 0).
13. Consider a three-sided die, as opposed to the standard six-sided type. The die is cylinder-
shaped, and gives equal probabilities to one, two and three dots. The game is to keep rolling the
die until we get a total of at least 3. Let N denote the number of times we roll the die. For example,
if we get a 3 on the rst roll, N = 1. If we get a 2 on the rst roll, then N will be 2 no matter what
we get the second time. The largest N can be is 3. The rule is that one wins if ones nal total is
exactly 3.
(a) Find the probability of winning.
(b) Find P(our rst roll was a 1 [ we won).
(c) How could we construct such a die?
14. Consider the ALOHA simulation example in Section 2.12.3.
(a) Suppose we wish to nd P(X
2
= 1[X
1
= 1) instead of P(X
2
= 2[X
1
= 1). What line(s)
would we change, and how would we change them?
(b) In which line(s) are we in essence checking for a collision?
15. Jack and Jill keep rolling a four-sided and a three-sided die, respectively. The rst player to
get the face having just one dot wins, except that if they both get a 1, its a tie, and play continues.
Let N denote the number of turns needed. Find the following:
(a) P(N = 1), P(N = 2).
32 CHAPTER 2. BASIC PROBABILITY MODELS
(b) P(the rst turn resulted in a tie[N = 2)
16. In the ALOHA network example in Sec. 1.1, suppose X
0
= 1, i.e. we start out with just one
active node. Find P(X
2
= 0), as an expression in p and q.
17. Suppose a box contains two pennies, three nickels and ve dimes. During transport, two coins
fall out, unseen by the bearer. Assume each type of coin is equally likely to fall out. Find: P(at
least $0.10 worth of money is lost); P(both lost coins are of the same denomination)
18. Suppose we have the track record of a certain weather forecaster. Of the days for which he
predicts rain, a fraction c actually do haverain. Among days for which he predicts no rain, he is
correct a fraction is d of the time. Among all days, he predicts rain g of the time, and predicts no
rain 1-g of the time. Find P(he predicted rain [ it does rain), P(he predicts wrong) and P(it does
rain he was wrong). Write R simulation code to verify. (Partial answer: For the case c = 0.8, d
= 0.6 and g = 0.2, P(he predicted rain [ it does rain) = 1/3.)
19. The Game of Pit is really fun because there are no turns. People shout out bids at random,
chaotically. Here is a slightly simplied version of the game:
There are four suits, Wheat, Barley, Corn and Rye, with nine cards each, 36 cards in all. There
are four players. At the opening, the cards are all dealt out, nine to each player. The players hide
their cards from each others sight.
Players then start trading. In computer science terms, trading is asynchronous, no turns; a player
can bid at any time. The only rule is that a trade must be homogeneous in suit, e.g. all Rye.
(The player trading Rye need not trade all the Rye he has, though.) The player bids by shouting
out the number she wants to trade, say 2! If another player wants to trade two cards (again,
homogeneous in suit), she yells out, OK, 2! and they trade. When one player acquires all nine
of a suit, he shouts Corner!
Consider the situation at the time the cards have just been dealt. Imagine that you are one of the
players, and Jane is another. Find the following probabilities:
(a) P(you have no Wheats).
(b) P(you have seven Wheats).
(c) P(Jane has two Wheats you have seven Wheats).
(d) P(you have a corner) (note: someone else might too; whoever shouts it out rst wins).
20. In the board game example in Section 2.10, suppose that the telephone report is that A ended
up at square 1 after his rst turn. Find the probability that he got a bonus.
2.13. COMBINATORICS-BASED PROBABILITY COMPUTATION 33
21. Consider the bus ridership example in Section 2.11 of the text. Suppose the bus is initially
empty, and let X
n
denote the number of passengers on the bus just after it has left the n
th
stop, n
= 1,2,... Find the following:
(a) P(X
2
= 1)
(b) P(at least one person boarded the bus at the rst stop [X
2
= 1)
22. Suppose committees of sizes 3, 4 and 5 are to be chosen at random from 20 people, among who
are persons A and B. Find the probability that A and B are on the same committee.
23. Consider the ALOHA simulation in Section 23.
(a) On what line do we simulate the possible creation of a new message?
(b) Change line 10 so that it uses sample() instead of runif().
34 CHAPTER 2. BASIC PROBABILITY MODELS
Chapter 3
Discrete Random Variables
This chapter will introduce entities called discrete random variables. Some properties will be derived
for means of such variables, with most of these properties actually holding for random variables in
general. Well, all of that seems abstract to you at this point, so lets get started.
3.1 Random Variables
Denition 3 A random variable is a numerical outcome of our experiment.
For instance, consider our old example in which we roll two dice, with X and Y denoting the number
of dots we get on the blue and yellow dice, respectively. Then X and Y are random variables, as
they are numerical outcomes of the experiment. Moreover, X+Y, 2XY, sin(XY) and so on are also
random variables.
In a more mathematical formulation, with a formal sample space dened, a random variable would
be dened to be a real-valued function whose domain is the sample space.
3.2 Discrete Random Variables
In our dice example, the random variable X could take on six values in the set 1,2,3,4,5,6. This
is a nite set.
In the ALOHA example, X
1
and X
2
each take on values in the set 0,1,2, again a nite set.
1
1
We could even say that X1 takes on only values in the set {1,2}, but if we were to look at many epochs rather
than just two, it would be easier not to make an exceptional case.
35
36 CHAPTER 3. DISCRETE RANDOM VARIABLES
Now think of another experiment, in which we toss a coin until we get heads. Let N be the number
of tosses needed. Then N can take on values in the set 1,2,3,... This is a countably innite set.
2
Now think of one more experiment, in which we throw a dart at the interval (0,1), and assume that
the place that is hit, R, can take on any of the values between 0 and 1. This is an uncountably
innite set.
We say that X, X
1
, X
2
and N are discrete random variables, while R is continuous. Well discuss
continuous random variables in a later chapter.
3.3 Independent Random Variables
We already have a denition for the independence of events; what about independence of random
variables? Here it is:
Random variables X and Y are said to be independent if for any sets I and J, the
events X is in I and Y is in J are independent, i.e. P(X is in I and Y is in J) =
P(X is in I) P(Y is in J).
Sounds innocuous, but the notion of independent random variables is absolutely central to the eld
of probability and statistics, and will pervade this entire book.
3.4 Expected Value
3.4.1 GeneralityNot Just for Discrete Random Variables
The concepts and properties introduced in this section form the very core of probability and statis-
tics. Except for some specic calculations, these apply to both discrete and continuous
random variables.
The properties developed for variance, dened later in this chapter, also hold for both discrete and
continuous random variables.
2
This is a concept from the fundamental theory of mathematics. Roughly speaking, it means that the set can
be assigned an integer labeling, i.e. item number 1, item number 2 and so on. The set of positive even numbers is
countable, as we can say 2 is item number 1, 4 is item number 2 and so on. It can be shown that even the set of all
rational numbers is countable.
3.4. EXPECTED VALUE 37
3.4.1.1 What Is It?
The term expected value is one of the many misnomers one encounters in tech circles. The
expected value is actually not something we expect to occur. On the contrary, its often pretty
unlikely.
For instance, let H denote the number of heads we get in tossing a coin 1000 times. The expected
value, youll see later, is 500 (i.e. the mean). Yet P(H = 500) turns out to be about 0.025. In other
words, we certainly should not expect H to be 500.
In spite of being misnamed, expected value plays an absolutely central role in probability and
statistics.
3.4.2 Denition
Consider a repeatable experiment with random variable X. We say that the expected value of X
is the long-run average value of X, as we repeat the experiment indenitely.
In our notebook, there will be a column for X. Let X
i
denote the value of X in the i
th
row of the
notebook. Then the long-run average of X is
lim
n
X
1
+... +X
n
n
(3.1)
Suppose for instance our experiment is to toss 10 coins. Let X denote the number of heads we get
out of 10. We might get four heads in the rst repetition of the experiment, i.e. X
1
= 4, seven
heads in the second repetition, so X
2
= 7, and so on. Intuitively, the long-run average value of X
will be 5. (This will be proven below.) Thus we say that the expected value of X is 5, and write
E(X) = 5.
3.4.3 Computation and Properties of Expected Value
Continuing the coin toss example above, let K
in
be the number of times the value i occurs among
X
1
, ..., X
n
, i = 0,...,10, n = 1,2,3,... For instance, K
4,20
is the number of times we get four heads,
in the rst 20 repetitions of our experiment. Then
38 CHAPTER 3. DISCRETE RANDOM VARIABLES
E(X) = lim
n
X
1
+... +X
n
n
(3.2)
= lim
n
0 K
0n
+ 1 K
1n
+ 2 K
2n
... + 10 K
10,n
n
(3.3)
=
10

i=0
i lim
n
K
in
n
(3.4)
To understand that second equation, suppose when n = 5 we have 2, 3, 1, 2 and 1 for our values
of X
1
, X
2
, X
3
, X
4
, X
5
,. Then we can group the 2s together and group the 1s together, and write
2 + 3 + 1 + 2 + 1 = 2 2 + 2 1 + 1 3 (3.5)
But lim
n
K
in
n
is the long-run fraction of the time that X = i. In other words, its P(X = i)! So,
E(X) =
10

i=0
i P(X = i) (3.6)
So in general we have a
Property A:
The expected value of a discrete random variable X which takes value in the set A is
E(X) =

cA
cP(X = c) (3.7)
Note that (3.7) is the formula well use. The preceding equations were derivation, to motivate the
formula. Note too that 3.7 is not the denition of expected value; that was in 3.1. It is quite
important to distinguish between all of these, in terms of goals.
It will be shown in Section 3.12.2 that in our example above in which X is the number of heads we
get in 10 tosses of a coin,
P(X = i) =
_
10
i
_
0.5
i
(1 0.5)
10i
(3.8)
3.4. EXPECTED VALUE 39
So
E(X) =
10

i=0
i
_
10
i
_
0.5
i
(1 0.5)
10i
(3.9)
It turns out that E(X) = 5.
For X in our dice example,
E(X) =
6

c=1
c
1
6
= 3.5 (3.10)
It is customary to use capital letters for random variables, e.g. X here, and lower-case letters for
values taken on by a random variable, e.g. c here. Please adhere to this convention.
By the way, it is also customary to write EX instead of E(X), whenever removal of the parentheses
does not cause any ambiguity. An example in which it would produce ambiguity is E(U
2
). The
expression EU
2
might be taken to mean either E(U
2
), which is what we want, or (EU)
2
, which is
not what we want.
For S = X+Y in the dice example,
E(S) = 2
1
36
+ 3
2
36
+ 4
3
36
+...12
1
36
= 7 (3.11)
In the case of N, tossing a coin until we get a head:
E(N) =

c=1
c
1
2
c
= 2 (3.12)
(We will not go into the details here concerning how the sum of this particular innite series is
computed.)
Some people like to think of E(X) using a center of gravity analogy. Forget that analogy! Think
notebook! Intuitively, E(X) is the long-run average value of X among all the lines of
the notebook. So for instance in our dice example, E(X) = 3.5, where X was the number of dots
on the blue die, means that if we do the experiment thousands of times, with thousands of lines in
our notebook, the average value of X in those lines will be about 3.5. With S = X+Y, E(S) = 7.
This means that in the long-run average in column S in Table 3.1 is 7.
40 CHAPTER 3. DISCRETE RANDOM VARIABLES
notebook line outcome blue+yellow = 6? S
1 blue 2, yellow 6 No 8
2 blue 3, yellow 1 No 4
3 blue 1, yellow 1 No 2
4 blue 4, yellow 2 Yes 6
5 blue 1, yellow 1 No 2
6 blue 3, yellow 4 No 7
7 blue 5, yellow 1 Yes 6
8 blue 3, yellow 6 No 9
9 blue 2, yellow 5 No 7
Table 3.1: Expanded Notebook for the Dice Problem
Of course, by symmetry, E(Y) will be 3.5 too, where Y is the number of dots showing on the
yellow die. That means we wasted our time calculating in Equation (3.11); we should have realized
beforehand that E(S) is 2 3.5 = 7.
In other words:
Property B:
For any random variables U and V, the expected value of a new random variable D = U+V is the
sum of the expected values of U and V:
E(U +V ) = E(U) +E(V ) (3.13)
Note carefully that U and V do NOT need to be independent random variables for this relation
to hold. You should convince yourself of this fact intuitively by thinking about the notebook
notion. Say we look at 10000 lines of the notebook, which has columns for the values of U, V and
U+V. It makes no dierence whether we average U+V in that column, or average U and V in their
columns and then addeither way, well get the same result.
While you are at it, use the notebook notion to convince yourself of the following:
Properties C:
For any random variable U and constant a, then
E(aU) = aEU (3.14)
3.4. EXPECTED VALUE 41
For random variables X and Ynot necessarily independentand constants a and b, we have
E(aX +bY ) = aEX +bEY (3.15)
This follows by taking U = aX and V = bY in (3.13), and then using (3.15).
For any constant b, we have
E(b) = b (3.16)
For instance, say U is temperature in Celsius. Then the temperature in Fahrenheit is W =
9
5
U +32.
So, W is a new random variable, and we can get its expected value from that of U by using (??)
with a =
9
5
and b = 32.
Another important point:
Property D: If U and V are independent, then
E(UV ) = EU EV (3.17)
In the dice example, for instance, let D denote the product of the numbers of blue dots and yellow
dots, i.e. D = XY. Then
E(D) = 3.5
2
= 12.25 (3.18)
Equation (3.17) doesnt have an easy notebook proof. It is proved in Section 8.3.1.
Consider a function g() of one variable, and let W = g(X). W is then a random variable too. Say
X takes on values in A, as in (3.7). Then W takes on values in B = g(c) : cA. Dene
A
d
= c : c A, g(c) = d (3.19)
Then
P(W = d) = P(X A
d
) (3.20)
so
42 CHAPTER 3. DISCRETE RANDOM VARIABLES
E[g(X)] = E(W) (3.21)
=

dB
dP(W = d) (3.22)
=

dB
d

cA
d
P(X = c) (3.23)
=

cA
g(c)P(X = c) (3.24)
Property E:
If E[g(X)] exists, then
E[g(X)] =

c
g(c) P(X = c) (3.25)
where the sum ranges over all values c that can be taken on by X.
For example, suppose for some odd reason we are interested in nding E(

X), where X is the


number of dots we get when we roll one die. Let W =

X). Then W is another random variable,


and is discrete, since it takes on only a nite number of values. (The fact that most of the values
are not integers is irrelevant.) We want to nd EW.
Well, W is a function of X, with g(t) =

t. So, (3.25) tells us to make a list of values that W and


take on, i.e.

1,

2, ...,

6, and a list of the corresponding probabilities for X, which are all


1
6
.
Substituting into (3.25), we nd that
E(

X) =
1
6
6

i=1

i (3.26)
3.4.4 Mailing Tubes
The properties of expected value discussed above are key to the entire remainder of
this book. You should notice immediately when you are in a setting in which they
are applicable. For instance, if you see the expected value of the sum of two random
variables, you should instinctively think of (3.13) right away.
As discussed in Section 2.4, these properties are mailing tubes. For instance, (3.13) is a mailing
tubemake a mental note to yourself saying, If I ever need to nd the expected value of the sum
3.4. EXPECTED VALUE 43
of two random variables, I can use (3.13). Similarly, (3.25) is a mailing tube; tell yourself, If I
ever see a new random variable that is a function of one whose probabilities I already know, I can
nd the expected value of the new random variable using (3.25).
You will encounter mailing tubes throughout this book. For instance, (3.33) below is a very
important mailing tube. Constatly remind yourselfRemember the mailing tubes !
3.4.5 Casinos, Insurance Companies and Sum Users, Compared to Others
The expected value is intended as a measure of central tendency, i.e. as some sort of denition
of the probablistic middle in the range of a random variable. There are various other such
measures one can use, such as the median, the halfway point of a distribution, and today they are
recognized as being superior to the mean in certain senses. For historical reasons, the mean plays
an absolutely central role in probability and statistics. Yet one should understand its limitations.
(Warning: The concept of the mean is likely so ingrained in your consciousness that you simply
take it for granted that you know what the mean means, no pun intended. But try to take a step
back, and think of the mean afresh in what follows.)
First, the term expected value itself is a misnomer. We do not expect W to be 91/6 in this last
example; in fact, it is impossible for W to take on that value.
Second, the expected value is what we call the mean in everyday life. And the mean is terribly
overused. Consider, for example, an attempt to describe how wealthy (or not) people are in the
city of Davis. If suddenly Bill Gates were to move into town, that would skew the value of the
mean beyond recognition.
But even without Gates, there is a question as to whether the mean has that much meaning. After
all, what is so meaningful about summing our data and dividing by the number of data points?
The median has an easy intuitive meaning, but although the mean has familiarity, one would be
hard pressed to justify it as a measure of central tendency.
What, for example, does Equation (3.1) mean in the context of peoples heights in Davis? We
would sample a person at random and record his/her height as X
1
. Then wed sample another
person, to get X
2
, and so on. Fine, but in that context, what would (3.1) mean? The answer is,
not much. So the signicance of the mean height of people in Davis would be hard to explain.
For a casino, though, (3.1) means plenty. Say X is the amount a gambler wins on a play of a
roulette wheel, and suppose (3.1) is equal to $1.88. Then after, say, 1000 plays of the wheel (not
necessarily by the same gambler), the casino knows from 3.1 it will have paid out a total about
about $1,880. So if the casino charges, say $1.95 per play, it will have made a prot of about $70
over those 1000 plays. It might be a bit more or less than that amount, but the casino can be
pretty sure that it will be around $70, and they can plan their business accordingly.
44 CHAPTER 3. DISCRETE RANDOM VARIABLES
The same principle holds for insurance companies, concerning how much they pay out in claims.
With a large number of customers, they know (expect!) approximately how much they will pay
out, and thus can set their premiums accordingly. Here the mean has a tangible, practical meaning.
The key point in the casino and insurance companies examples is that they are interested in totals,
such as total payouts on a blackjack table over a months time, or total insurance claims paid in
a year. Another example might be the number of defectives in a batch of computer chips; the
manufacturer is interested in the total number of defectives chips produced, say in a month. Since
the mean is by denition a total (divided by the number of data points), the mean will be of direct
interest to casinos etc.
By contrast, in describing how wealthy people of a town are, the total height of all the residents is
not relevant. Similarly, in describing how well students did on an exam, the sum of the scores of all
the students doesnt tell us much. (Unless the professor gets $10 for each point in the exam scores
of each of the students!) A better description for heights and examp scores might be the median
height or score.
Nevertheless, the mean has certain mathematical properties, such as (3.13), that have allowed the
rich development of the elds of probability and statistics over the years. The median, by contrast,
does not have nice mathematical properties. In many cases, the mean wont be too dierent from
the median anyway (barring Bill Gates moving into town), so you might think of the mean as a
convenient substitute for the median. The mean has become entrenched in statistics, and we will
use it often.
3.5 Variance
As in Section 3.4, the concepts and properties introduced in this section form the very core of
probability and statistics. Except for some specic calculations, these apply to both
discrete and continuous random variables.
3.5.1 Denition
While the expected value tells us the average value a random variable takes on, we also need
a measure of the random variables variabilityhow much does it wander from one line of the
notebook to another? In other words, we want a measure of dispersion. The classical measure is
variance, dened to be the mean squared dierence between a random variable and its mean:
Denition 4 For a random variable U for which the expected values written below exist, the vari-
3.5. VARIANCE 45
ance of U is dened to be
V ar(U) = E[(U EU)
2
] (3.27)
For X in the die example, this would be
V ar(X) = E[(X 3.5)
2
] (3.28)
Remember what this means: We have a random variable X, and were creating a new random
variable, W = (X 3.5)
2
, which is a function of the old one. We are then nding the expected
value of that new random variable W.
In the notebook view, E[(X 3.5)
2
] is the long-run average of the W column:
line X W
1 2 2.25
2 5 2.25
3 6 6.25
4 3 0.25
5 5 2.25
6 1 6.25
To evaluate this, apply (3.25) with g(c) = (c 3.5)
2
:
V ar(X) =
6

c=1
(c 3.5)
2

1
6
= 2.92 (3.29)
You can see that variance does indeed give us a measure of dispersion. In the expression V ar(U) =
E[(U EU)
2
], if the values of U are mostly clustered near its mean, then (U EU)
2
will usually
be small, and thus the variance of U will be small; if there is wide variation in U, the variance will
be large.
The properties of E() in (3.13) and (??) can be used to show:
Property F:
V ar(U) = E(U
2
) (EU)
2
(3.30)
The term E(U
2
) is again evaluated using (3.25).
46 CHAPTER 3. DISCRETE RANDOM VARIABLES
Thus for example, if X is the number of dots which come up when we roll a die. Then, from (3.30),
V ar(X) = E(X
2
) (EX)
2
(3.31)
Lets nd that rst term (we already know the second is 3.5). From (3.25),
E(X
2
) =
6

i=1
i
2

1
6
=
91
6
(3.32)
Thus V ar(X) = E(X
2
) (EX)
2
=
91
6
3.5
2
Remember, though, that (3.30) is a shortcut formula for nding the variance, not the denition of
variance.
An important behavior of variance is:
Property G:
V ar(cU) = c
2
V ar(U) (3.33)
for any random variable U and constant c. It should make sense to you: If we multiply a random
variable by 5, say, then its average squared distance to its mean should increase by a factor of 25.
Lets prove (3.33). Dene V = cU. Then
V ar(V ) = E[(V EV )
2
] (def.) (3.34)
= E[cU E(cU)]
2
(subst.) (3.35)
= E[cU cEU]
2
((??)) (3.36)
= Ec
2
[U EU]
2
(algebra) (3.37)
= c
2
E[U EU]
2
((??)) (3.38)
= c
2
V ar(U) (def.) (3.39)
Shifting data over by a constant does not change the amount of variation in them:
Property H:
V ar(U +d) = V ar(U) (3.40)
3.5. VARIANCE 47
for any constant d.
Intuitively, the variance of a constant is 0after all, it never varies! You can show this formally
using (3.30):
V ar(c) = E(c
2
) [E(c)]
2
= c
2
c
2
= 0 (3.41)
The square root of the variance is called the standard deviation.
Again, we use variance as our main measure of dispersion for historical and mathematical reasons,
not because its the most meaningful measure. The squaring in the denition of variance produces
some distortion, by exaggerating the importance of the larger dierences. It would be more natural
to use the mean absolute deviation (MAD), E([U EU[). However, this is less tractable
mathematically, so the statistical pioneers chose to use the mean squared dierence, which lends
itself to lots of powerful and beautiful math, in which the Pythagorean Theorem pops up in abstract
vector spaces. (See Section 9.7 for details.)
As with expected values, the properties of variance discussed above, and also in Sec-
tion 7.1.1 below, are key to the entire remainder of this book. You should notice
immediately when you are in a setting in which they are applicable. For instance,
if you see the variance of the sum of two random variables, you should instinctively
think of (3.65) right away.
3.5.2 Central Importance of the Concept of Variance
No one needs to be convinced that the mean is a fundamental descriptor of the nature of a random
variable. But the variance is of central importance too, and will be used constantly throughout the
remainder of this book.
The next section gives a quantitative look at our notion of variance as a measure of dispersion.
3.5.3 Intuition Regarding the Size of Var(X)
A billion here, a billion there, pretty soon, youre talking real moneyattribted to the late Senator
Everitt Dirksen, replying to a statement that some federal budget item cost only a billion dollars
Recall that the variance of a random variable X is suppose to be a measure of the dispersion of X,
meaning the amount that X varies from one instance (one line in our notebook) to the next. But
if Var(X) is, say, 2.5, is that a lot of variability or not? We will pursue this question here.
48 CHAPTER 3. DISCRETE RANDOM VARIABLES
3.5.3.1 Chebychevs Inequality
This inequality states that for a random variable X with mean and variance
2
,
P([X [ c)
1
c
2
(3.42)
In other words, X strays more than, say, 3 standard deviations from its mean at most only 1/9 of
the time. This gives some concrete meaning to the concept of variance/standard deviation.
Youve probably had exams in which the instructor says something like An A grade is 1.5 standard
deviations above the mean. Here c in (3.42) would be 1.5.
Well prove the inequality in Section 3.18.
3.5.3.2 The Coecient of Variation
Continuing our discussion of the magnitude of a variance, look at our remark following (3.42):
In other words, X does not often stray more than, say, 3 standard deviations from its
mean. This gives some concrete meaning to the concept of variance/standard deviation.
Or, think of the price of, say, widgets. If the price hovers around a $1 million, but the variation
around that gure is only about a dollar, youd say there is essentially no variation. But a variation
of about a dollar in the price of a hamburger would be a lot.
These considerations suggest that any discussion of the size of Var(X) should relate to the size of
E(X). Accordingly, one often looks at the coecient of variation, dened to be the ratio of the
standard deviation to the mean:
coef. of var. =
_
V ar(X)
EX
(3.43)
This is a scale-free measure (e.g. inches divided by inches), and serves as a good way to judge
whether a variance is large or not.
3.6 Indicator Random Variables, and Their Means and Variances
Denition 5 A random variable that has the value 1 or 0, according to whether a specied event
occurs or not is called an indicator random variable for that event.
3.7. A COMBINATORIAL EXAMPLE 49
Youll often see later in this book that the notion of an indicator random variable is a very handy
device in certain derivations. But for now, lets establish its properties in terms of mean and
variance.
Handy facts: Suppose X is an indicator random variable for the event A. Let p denote
P(A). Then
E(X) = p (3.44)
V ar(X) = p(1 p) (3.45)
This two facts are easily derived. In the rst case we have, using our properties for expected value,
EX = 1 P(X = 1) + 0 P(X = 0) = P(X = 1) = P(A) = p (3.46)
The derivation for Var(X) is similar (use (3.30)).
3.7 A Combinatorial Example
A committee of four people is drawn at random from a set of six men and three women. Suppose
we are concerned that there may be quite a gender imbalance in the membership of the committee.
Toward that end, let M and W denote the numbers of men and women in our committee, and let
D = M-W. Lets nd E(D), in two dierent ways.
D can take on the values 4-0, 3-1, 2-2 and 1-3, i.e. 4, 2, 0 and -2. So from (3.7)
ED = 2 P(D = 2) + 0 P(D = 0) + 2 P(D = 2) + 4 P(D = 4) (3.47)
Now, using reasoning along the lines in Section 2.13, we have
P(D = 2) = P(M = 1 and W = 3) =
_
6
1
__
3
3
_
_
9
4
_ (3.48)
After similar calculations for the other probabilities in (3.47), we nd the ED =
4
3
.
50 CHAPTER 3. DISCRETE RANDOM VARIABLES
Note what this means: If we were to perform this experiment many times, i.e. choose committees
again and again, on average we would have a little more than one more man than women on the
committee.
Now lets use our mailing tubes to derive ED a dierent way:
ED = E(M W) (3.49)
= E[M (4 M)] (3.50)
= E(2M 4) (3.51)
= 2EM 4 (from (??)) (3.52)
Now, lets nd EM by using indicator random variables. Let G
i
denote the indicator random
variable for the event that the i
th
person we pick is male, i = 1,2,3,4. Then
M = G
1
+G
2
+G
3
+G
4
(3.53)
so
EM = E(G
1
+G
2
+G
3
+G
4
) (3.54)
= EG
1
+EG
2
+EG
3
+EG
4
[ from (3.13)] (3.55)
= P(G
1
= 1) +P(G
2
= 1) +P(G
3
= 1) +P(G
4
= 1) [ from (3.44)] (3.56)
Note carefully that the second equality here, which uses (3.13), is true in spite of the fact that the
G
i
are not independent. Equation (3.13) does not require independence.
Another key point is that, due to symmetry, P(G
i
= 1) is the same for all i. same expected value.
(Note that we did not write a conditional probability here.) To see this, suppose the six men that
are available for the committee are named Alex, Bo, Carlo, David, Eduardo and Frank. When we
select our rst person, any of these men has the same chance of being chosen (1/9). But that is
also true for the second pick. Think of a notebook, with a column named second pick. In some
lines, that column will say Alex, in some it will say Bo, and so on, and in some lines there will be
womens names. But in that column, Bo will appear the same fraction of the time as Alex, due to
symmetry, and that will be the same fraction is for, say, Alice, again 1/9.
Now,
P(G
1
= 1) =
6
9
=
2
3
(3.57)
3.8. A USEFUL FACT 51
Thus
ED = 2 (4
2
3
) 4 =
4
3
(3.58)
3.8 A Useful Fact
For a random variable X, consider the function
g(c) = E[(X c)
2
] (3.59)
Remember, the quantity E[(X c)
2
] is a number, so g(c) really is a function, mapping a real
number c to some real output.
We can ask the question, What value of c minimizes g(c)? To answer that question, write:
g(c) = E[(X c)
2
] = E(X
2
2cX +c
2
) = E(X
2
) 2cEX +c
2
(3.60)
where we have used the various properties of expected value derived in recent sections.
Now dierentiate with respect to c, and set the result to 0. Remembering that E(X
2
) and EX are
constants, we have
0 = 2EX + 2c (3.61)
so the minimizing c is c = EX!
In other words, the minimum value of E[(X c)
2
] occurs at c = EX.
Moreover: Plugging c = EX into (3.60) shows that the minimum value of g(c) is E(X EX)
2
] ,
which is Var(X)!
3.9 Covariance
This is a topic well cover fully in Chapter 8, but at least introduce here.
A measure of the degree to which U and V vary together is their covariance,
Cov(U, V ) = E[(U EU)(V EV )] (3.62)
52 CHAPTER 3. DISCRETE RANDOM VARIABLES
Except for a divisor, this is essentially correlation. If U is usually large at the same time V is
small, for instance, then you can see that the covariance between them witll be negative. On the
other hand, if they are usually large together or small together, the covariance will be positive.
Again, one can use the properties of E() to show that
Cov(U, V ) = E(UV ) EU EV (3.63)
Also
V ar(U +V ) = V ar(U) +V ar(V ) + 2Cov(U, V ) (3.64)
Suppose U and V are independent. Then (3.17) and (3.63) imply that Cov(U,V) = 0. In that case,
V ar(U +V ) = V ar(U) +V ar(V ) (3.65)
By the way, (3.65) is actually the Pythagorean Theorem in a certain esoteric, innite-dimesional
vector space (related to a similar remark made earlier). This is pursued in Section 9.7 for the
mathematically inclined.
3.10 Expected Value, Etc. in the ALOHA Example
Finding expected values etc. in the ALOHA example is straightforward. For instance,
EX
1
= 0 P(X
1
= 0) + 1 P(X
1
= 1) + 2 P(X
1
= 2) = 1 0.48 + 2 0.52 = 1.52 (3.66)
Here is R code to nd various values approximately by simulation:
1 # finds E(X1), E(X2), Var(X2), Cov(X1,X2)
2 sim <- function(p,q,nreps) {
3 sumx1 <- 0
4 sumx2 <- 0
5 sumx2sq <- 0
6 sumx1x2 <- 0
7 for (i in 1:nreps) {
8 numsend <- 0
9 for (i in 1:2)
10 if (runif(1) < p) numsend <- numsend + 1
11 if (numsend == 1) X1 <- 1
12 else X1 <- 2
13 numactive <- X1
3.11. DISTRIBUTIONS 53
14 if (X1 == 1 && runif(1) < q) numactive <- numactive + 1
15 if (numactive == 1)
16 if (runif(1) < p) X2 <- 0
17 else X2 <- 1
18 else { # numactive = 2
19 numsend <- 0
20 for (i in 1:2)
21 if (runif(1) < p) numsend <- numsend + 1
22 if (numsend == 1) X2 <- 1
23 else X2 <- 2
24 }
25 sumx1 <- sumx1 + X1
26 sumx2 <- sumx2 + X2
27 sumx2sq <- sumx2sq + X2^2
28 sumx1x2 <- sumx1x2 + X1*X2
29 }
30 # print results
31 meanx1 <- sumx1 /nreps
32 cat("E(X1):",meanx1,"\n")
33 meanx2 <- sumx2 /nreps
34 cat("E(X2):",meanx2,"\n")
35 cat("Var(X2):",sumx2sq/nreps - meanx2^2,"\n")
36 cat("Cov(X1,X2):",sumx1x2/nreps - meanx1*meanx2,"\n")
37 }
As a check on your understanding so far, you should nd at least one of these values by hand, and
see if it jibes with the simulation output.
3.11 Distributions
The idea of the distribution of a random variable is central to probability and statistics.
Denition 6 Let U be a discrete random variable. Then the distribution of U is simply a list of
all the values U takes on, and their associated probabilities:
Example: Let X denote the number of dots one gets in rolling a die. Then the values X can take
on are 1,2,3,4,5,6, each with probability 1/6. So
distribution of X = (1,
1
6
), (2,
1
6
), (3,
1
6
), (4,
1
6
), (5,
1
6
), (6,
1
6
) (3.67)
Example: Recall the ALOHA example. There X
1
took on the values 1 and 2, with probabilities
0.48 and 0.52, respectively. So,
distribution of X
1
= (0, 0.00), (1, 0.48), (2, 0.52) (3.68)
54 CHAPTER 3. DISCRETE RANDOM VARIABLES
Example: Recall our example in which N is the number of tosses of a coin needed to get the rst
head. N can take on the values 1,2,3,..., the probabilities of which we found earlier to be 1/2, 1/4,
1/8,... So,
distribution of N = (1,
1
2
), (2,
1
4
), (3,
1
8
), ... (3.69)
It is common to express this in functional notation:
Denition 7 The probability mass function (pmf ) of a discrete random variable V, denoted
p
V
, as
p
V
(k) = P(V = k) (3.70)
for any value k which V can take on.
(Please keep in mind the notation. It is customary to use the lower-case p, with a subscript
consisting of the name of the random variable.)
Note that p
V
() is just a function, like any function (with integer domain) youve had in your
previous math courses. For each input value, there is an output value.
3.11.1 Example: Toss Coin Until First Head
In (3.69),
p
N
(k) =
1
2
k
, k = 1, 2, ... (3.71)
3.11.2 Example: Sum of Two Dice
In the dice example, in which S = X+Y,
p
S
(k) =
_

_
1
36
, k = 2
2
36
, k = 3
3
36
, k = 4
...
1
36
, k = 12
(3.72)
3.11. DISTRIBUTIONS 55
It is important to note that there may not be some nice closed-form expression for p
V
like that of
(3.71). There was no such form in (3.72), nor is there in our ALOHA example for p
X
1
and p
X
2
.
3.11.3 Example: Watts-Strogatz Random Graph Model
Random graph models are used to analyze many types of link systems, such as power grids, social
networks and even movie stars. The following is a variation on a famous model of that type, due
to Duncan Watts and Steven Strogatz.
We have a graph of n nodes (e.g. each node is a person).
3
Think of them as being linked in a circle,
so we already have n links. One can thus reach any node in the graph from any other, by following
the links of the circle. (Well assume all links are bidirectional.)
We now randomly add k more links (k is thus a parameter of the model), which will serve as
shortcuts. There are
_
n
2
_
= n(n 1)/2 possible links between nodes, but remember, we already
have n of those in the graph, so there are only n(n1)/2n = n
2
/23n/2 possibilities left. Well
be forming k new links, chosen at random from those n
2
/2 3n/2 possibilities.
Let M denote the number of links that attached to a particular node, known as the degree of a
node. M is a random variable (we are choosing the shortcut links randomly), so we can talk of its
pmf, p
M
, termed the degree distribution, which well calculate now.
Well, p
M
(r) is the probability that this node has r links. Since the node already had 2 links before
the shortcuts were constructed, p
M
(r) is the probability that r-2 of the k shortcuts attach to this
node.
This problem is similar in spirit to (though admittedly more dicult to think about than) kings-
and-hearts example of Section 2.13.1. Other than the two neighboring links in the original circle
and the link of a node to itself, there are n-3 possible shortcut links to attach to our given node.
Were interested in the probability that r-2 of them are chosen, and that k-(r-2) are chosen from
the other possible links. Thus our probability is:
p
M
(r) =
_
n3
r2
__
n
2
/23n/2(n3)
k(r2)
_
_
n
2
/23n/2
k
_
=
_
n2
r2
__
n
2
/25n/2+3
k(r2)
_
_
n
2
/23n/2
k
_
(3.73)
3
The word graph here doesnt mean graph in the sense of a picture. Here we are using the computer science
sense of the word, meaning a system of vertices and edges. Its common to call those nodes and links.
56 CHAPTER 3. DISCRETE RANDOM VARIABLES
3.12 Parameteric Families of pmfs
Consider plotting the curves sin(ct). For each c, we get the familiar sine function. For larger c,
the curve is more squished and for c strictly between 0 and 1, we get a broadened sine curve.
So we have a family of sine curves of dierent proportions. We say the family is indexed by the
parameter c, meaning, each c gives us a dierent member of the family, i.e. a dierent curve.
Probability mass functions, and in the next chapter, probability density functions, can also come
in families, indexed by one or more parameters. In fact, we just had an example above, in Section
3.11.3. Since we get a dierent function p
M
for each dierent value of k, that was a parametric
family of pmfs, indexed by k.
Some parametric families of pmfs have been found to be so useful over the years that theyve been
given names. We will discuss some of those families here. But remember, they are famous just
because they have been found useful, i.e. that they t real data well in various settings. Do not
jump to the conclusion that we always must use pmfs from some family.
3.12.1 The Geometric Family of Distributions
Recall our example of tossing a coin until we get the rst head, with N denoting the number of
tosses needed. In order for this to take k tosses, we need k-1 tails and then a head. Thus
p
N
(k) = (1
1
2
)
k1

1
2
, k = 1, 2, ... (3.74)
We might call getting a head a success, and refer to a tail as a failure. Of course, these words
dont mean anything; we simply refer to the outcome of interest as success.
Dene M to be the number of rolls of a die needed until the number 5 shows up. Then
p
M
(k) =
_
1
1
6
_
k1
1
6
, k = 1, 2, ... (3.75)
reecting the fact that the event M = k occurs if we get k-1 non-5s and then a 5. Here success
is getting a 5.
The tosses of the coin and the rolls of the die are known as Bernoulli trials, which is a sequence
of independent events. We call the occurrence of the event success and the nonoccurrence failure
(just convenient terms, not value judgments). The associated indicator random variable are denoted
B
i
, i = 1,2,3,... So B
i
is 1 for success on the i
th
trial, 0 for failure, with success probability p. For
instance, p is 1/2 in the coin case, and 1/6 in the die example.
3.12. PARAMETERIC FAMILIES OF PMFS 57
In general, suppose the random variable W is dened to be the number of trials needed to get a
success in a sequence of Bernoulli trials. Then
p
W
(k) = (1 p)
k1
p, k = 1, 2, ... (3.76)
Note that there is a dierent distribution for each value of p, so we call this a parametric family
of distributions, indexed by the parameter p. We say that W is geometrically distributed with
parameter p.
4
It should make good intuitive sense to you that
E(W) =
1
p
(3.77)
This is indeed true, which we will now derive. First well need some facts (which you should le
mentally for future use as well):
Properties of Geometric Series:
(a) For any t ,= 1 and any nonnegative integers r s,
s

i=r
t
i
= t
r
1 t
sr+1
1 t
(3.78)
This is easy to derive for the case r = 0, using mathematical induction. For the general case,
just factor out t
sr
.
(b) For [t[ < 1,

i=0
t
i
=
1
1 t
(3.79)
To prove this, just take r = 0 and let s in (3.78).
(b) For [t[ < 1,

i=1
it
i1
=
1
(1 t)
2
(3.80)
4
Unfortunately, we have overloaded the letter p here, using it to denote the probability mass function on the left
side, and the unrelated parameter p, our success probability on the right side. Its not a problem as long as you are
aware of it, though.
58 CHAPTER 3. DISCRETE RANDOM VARIABLES
This is derived by applying
d
dt
to (3.79).
5
Deriving (3.77) is then easy, using (3.80):
EW =

i=1
i(1 p)
i1
p (3.81)
= p

i=1
i(1 p)
i1
(3.82)
= p
1
[1 (1 p)]
2
(3.83)
=
1
p
(3.84)
Using similar computations, one can show that
V ar(W) =
1 p
p
2
(3.85)
We can also nd a closed-form expression for the quantities P(W m), m = 1,2,... (This has a
formal name F
W
(m) , as will be seen later in Section 4.3.) For any positive integer m we have
F
W
(m) = P(W m) (3.86)
= 1 P(W > m) (3.87)
= 1 P(the rst m trials are all failures) (3.88)
= 1 (1 p)
m
(3.89)
By the way, if we were to think of an experiment involving a geometric distribution in terms of our
notebook idea, the notebook would have an innite number of columns, one for each B
i
. Within
each row of the notebook, the B
i
entries would be 0 until the rst 1, then NA (not applicable
after that).
3.12.1.1 R Functions
You can simulate geometrically distributed random variables via Rs rgeom() function. Its rst
argument species the number of such random variables you wish to generate, and the second is
5
To be more careful, we should dierentiate (3.78) and take limits.
3.12. PARAMETERIC FAMILIES OF PMFS 59
the success probability p.
For example, if you run
> y <- rgeom(2,0.5)
then its simulating tossing a coin until you get a head (y[1]) and then tossing the coin until a head
again (y[2]). Of course, you could simulate on your own, say using sample() and while(), but R
makes is convenient for you.
Heres the full set of functions for a geometrically distributed random variable X with success
probability p:
dgeom(i,p), to nd P(X = i)
pgeom(i,p), to nd P(X i)
qgeom(q,p), to nd c such that P(X c) = q
rgeom(n,p), to generate n variates from this geometric distribution
Important note: Some books dene geometric distributions slightly dierently, as the number
of failures before the rst success, rather than the number of trials to the rst success. Thus for
example in calling dgeom(), subtract 1 from the value used in our denition.
3.12.1.2 Example: a Parking Space Problem
Suppose there are 10 parking spaces per block on a certain street. You turn onto the street at
the start of one block, and your destination is at the start of the next block. You take the rst
parking space you encounter. Let D denote the distance of the parking place you nd from your
destination, measured in parking spaces. Suppose each space is open with probability 0.15, with
the spaces being independent. Find ED.
To solve this problem, you might at rst think that D follows a geometric distribution. But dont
jump to conclusions! Actually this is not the case, given that D is a somewhat complicated
distance. But clearly D is a function of N, where the latter denotes the number of parking spaces
you see until you nd an empty oneand N is geometrically distributed.
As noted, D is a funciton of N:
D =
_
11 N, N 10
N 11, N > 10
(3.90)
60 CHAPTER 3. DISCRETE RANDOM VARIABLES
Since D is a function of N, we can use (3.25):
ED =
10

i=1
(11 i)0.85
i1
0.15 +

i=11
(i 11)0.85
i1
0.15 (3.91)
This can now be evaluated using the properties of geometric series presented above.
Alternatively, heres how we could nd the result by simulation:
1 parksi m < f unc t i on ( nreps )
2 # do the experi ment nreps ti mes , r e c or di ng the val ues of N
3 nval s < rgeom( nreps , 0 . 2 ) + 1
4 # now f i nd the val ues of D
5 dval s < i f e l s e ( nval s <= 10,11nval s , nval s 11)
6 # r et ur n ED
7 r et ur n ( mean( dval s ) )
8
Note the vectorized addition and recycling (Section 2.12.2 in the line
nval s < rgeom( nreps , 0 . 2 ) + 1
The call to ifelse() is another instance of Rs vectorization, a vectorized if-then-else. The rst
argument evaluates to a vector of TRUE and FALSE values. For each TRUE, the corresponding
element of dvals will be set to the corresponding element of the vector 11-nvals (again involving
vectorized addition and recycling), and for each false, the element of dvals will be set to the element
of nvals-11.
3.12.2 The Binomial Family of Distributions
A geometric distribution arises when we have Bernoulli trials with parameter p, with a variable
number of trials (N) but a xed number of successes (1). A binomial distribution arises when
we have the oppositea xed number of Bernoulli trials (n) but a variable number of successes
(say X).
6
For example, say we toss a coin ve times, and let X be the number of heads we get. We say that
X is binomially distributed with parameters n = 5 and p = 1/2. Lets nd P(X = 2). There are
many orders in which that could occur, such as HHTTT, TTHHT, HTTHT and so on. Each order
6
Note again the custom of using capital letters for random variables, and lower-case letters for constants.
3.12. PARAMETERIC FAMILIES OF PMFS 61
has probability 0.5
2
(1 0.5)
3
, and there are
_
5
2
_
orders. Thus
P(X = 2) =
_
5
2
_
0.5
2
(1 0.5)
3
=
_
5
2
_
/32 = 5/16 (3.92)
For general n and p,
P(X = k) =
_
n
k
_
p
k
(1 p)
nk
(3.93)
So again we have a parametric family of distributions, in this case a family having two parameters,
n and p.
Lets write X as a sum of those 0-1 Bernoulli variables we used in the discussion of the geometric
distribution above:
X =
n

i=1
B
i
(3.94)
where B
i
is 1 or 0, depending on whether there is success on the i
th
trial or not. Note again that
the B
i
are indicator random variables (Section 3.6), so
EB
i
= p (3.95)
and
V ar(B
i
) = p(1 p) (3.96)
Then the reader should use our earlier properties of E() and Var() in Sections 3.4 and 3.5 to ll
in the details in the following derivations of the expected value and variance of a binomial random
variable:
EX = E(B
1
+..., +B
n
) = EB
1
+... +EB
n
= np (3.97)
and from (3.65),
V ar(X) = V ar(B
1
+..., +B
n
) = V ar(B
1
) +... +V ar(B
n
) = np(1 p) (3.98)
Again, (3.97) should make good intuitive sense to you.
62 CHAPTER 3. DISCRETE RANDOM VARIABLES
3.12.2.1 R Functions
Relevant functions for a binomially distributed random variable X for k trials and with success
probability p are:
dbinom(i,k,p), to nd P(X = i)
pbinom(i,k,p), to nd P(X i)
qbinom(q,k,p), to nd c such that P(X c) = q
rbinom(n,k,p), to generate n independent values of X
3.12.2.2 Example: Flipping Coins with Bonuses
A game involves ipping a coin k times. Each time you get a head, you get a bonus ip, not counted
among the k. (But if you get a head from a bonus ip, that does not give you its own bonus ip.)
Let X denote the number of heads you get among all ips, bonus or not. Lets nd the distribution
of X.
As with the parking space example above, we should be careful not to come to hasty conclusions.
The situation here sounds binomial, but X, based on a variable number of trials, doesnt t the
deniiton of binomial.
But let Y denote the number of heads you obtain through nonbonus ips. Y then has a binomial
distribution with parameters k and 0.5. To nd the distribution of X, well condition on Y.
We will as usual ask, How can it happen?, but we need to take extra care in forming our sums,
recognizing constraints on Y:
Y X/2
Y X
Y k
Keeping those points in mind, we have
3.12. PARAMETERIC FAMILIES OF PMFS 63
p
X
(m) = P(X = m) (3.99)
=
min(m,k)

i=ceil(m/2)
P(X = m and Y = i) (3.100)
=
min(m,k)

i=ceil(m/2)
P(X = m[Y = i) P(Y = i) (3.101)
=
min(m,k)

i=ceil(m/2)
_
i
mi
_
0.5
i
_
k
i
_
0.5
k
(3.102)
= 0.5
k
min(m,k)

i=ceil(m/2)
k!
m!(2i m)!(k i)!
0.5
i
(3.103)
There doesnt seem to be much further simplication possible here.
3.12.2.3 Example: Analysis of Social Networks
Lets continue our earlier discussion from Section 3.11.3.
One of the earliestand now the simplestmodels of social networks is due to Erdos and Renyi.
Say we have n people (or n Web sites, etc.), with
_
n
2
_
potential links between pairs. (We are
assuming an undirected graph here.) In this model, each potential link is an actual link with
probability p, and a nonlink with probability 1-p, with all the potential links being independent.
Recall the notion of degree distribution from Section 3.11.3. Clearly the degree distribution here
for a single node is binomial with parameters n-1 and p. But consider k nodes, and the total T of
their degrees. Lets nd the distribution of T.
That distribution is again binomial, but the number of trials is not k
_
n1
2
_
, due to overlap. There
are
_
k
2
_
potential links among these k nodes, and each of the k nodes has
_
nk
2
_
potential links to
the outside world, i.e. to the remaining n-k nodes. So, the distribution of T is binomial with
k
_
n k
2
_
+
_
k
2
_
(3.104)
trials and success probability p.
64 CHAPTER 3. DISCRETE RANDOM VARIABLES
3.12.3 The Negative Binomial Family of Distributions
Recall that a typical example of the geometric distribution family (Section 3.12.1) arises as N, the
number of tosses of a coin needed to get our rst head. Now generalize that, with N now being
the number of tosses needed to get our r
th
head, where r is a xed value. Lets nd P(N = k), k
= r, r+1, ... For concreteness, look at the case r = 3, k = 5. In other words, we are nding the
probability that it will take us 5 tosses to accumulate 3 heads.
First note the equivalence of two events:
N = 5 = 2 heads in the rst 4 tosses and head on the 5
th
toss (3.105)
That event described before the and corresponds to a binomial probability:
P(2 heads in the rst 4 tosses) =
_
4
2
__
1
2
_
4
(3.106)
Since the probability of a head on the k
th
toss is 1/2 and the tosses are independent, we nd that
P(N = 5) =
_
4
2
__
1
2
_
5
=
3
16
(3.107)
The negative binomial distribution family, indexed by parameters r and p, corresponds to random
variables that count the number of independent trials with success probability p needed until we
get r successes. The pmf is
P(N = k) =
_
k 1
r 1
_
(1 p)
kr
p
r
, k = r, r + 1, ... (3.108)
We can write
N = G
1
+... +G
r
(3.109)
where G
i
is the number of tosses between the successes numbers i-1 and i. But each G
i
has a
geometric distribution! Since the mean of that distribution is 1/p, we have that
E(N) = r
1
p
(3.110)
3.12. PARAMETERIC FAMILIES OF PMFS 65
In fact, those r geometric variables are also independent, so we know the variance of N is the sum
of their variances:
V ar(N) = r
1 p
p
2
(3.111)
3.12.3.1 Example: Backup Batteries
A machine contains one active battery and two spares. Each battery has a 0.1 chance of failure
each month. Let L denote the lifetime of the machine, i.e. the time in months until the third
battery failure. Find P(L = 12).
The number of months until the third failure has a negative binomial distribution, with r = 3 and
p = 0.1. Thus the answer is obtained by negbinpmf, with k = 12:
P(L = 12) =
_
11
2
_
(1 0.1)
9
0.1
3
(3.112)
3.12.4 The Poisson Family of Distributions
Another famous parametric family of distributions is the set of Poisson Distributions.
This family is a little dierent from the geometric, binomial and negative binomial families, in
the sense that in those cases there were qualitative descriptions of the settings in which such
distributions arise. Geometrically distributed random variables, for example occur as the number
of Bernoulli trials needed to get the rst success.
By contrast, the Poisson family does not really have this kind of qualitative description.
7
. It is
merely something that people have found to be a reasonably accurate model of actual data. We
might be interested, say, in the number of disk drive failures in periods of a specied length of time.
If we have data on this, we might graph it and it looks like the pmf form below, then we might
adopt it as our model.
The pmf is
P(X = k) =
e

k
k!
, k = 0, 1, 2, ... (3.113)
7
Some such descriptions are possible in the Poisson case, but they are complicated and dicult to veryify.
66 CHAPTER 3. DISCRETE RANDOM VARIABLES
It turns out that
EX = (3.114)
V ar(X) = (3.115)
The derivations of these facts are similar to those for the geometric family in Section 3.12.1. One
starts with the Maclaurin series expansion for e
t
:
e
t
=

i=0
t
i
i!
(3.116)
and nds its derivative with respect to t, and so on. The details are left to the reader.
The Poisson family is very often used to model count data. For example, if you go to a certain
bank every day and count the number of customers who arrive between 11:00 and 11:15 a.m., you
will probably nd that that distribution is well approximated by a Poisson distribution for some .
There is a lot more to the Poisson story than we see in this short section. Well return to this
distribution family in Section 4.5.4.5.
3.12.4.1 R Functions
Relevant functions for a Poisson distributed random variable X with parameter lambda are:
dpois(i,lambda), to nd P(X = i)
ppois(i,lambda), to nd P(X i)
qpois(q,lambda), to nd c such that P(X c) = q
rpois(n,lambda), to generate n independent values of X
3.12.5 The Power Law Family of Distributions
Here
p
X
(k) = ck

, k = 1, 2, 3, ... (3.117)
3.12. PARAMETERIC FAMILIES OF PMFS 67
It is required that > 1, as otherwise the sum of probabilities will be innite. For satisfying that
condition, the value c is chosen so that that sum is 1.0:
1.0 =

k=1
ck

c
_

1
k

dk = c/( 1) (3.118)
so c 1.
Here again we have a parametric family of distributions, indexed by the parameter .
The power law family is an old-fashioned model (an old-fashioned term for distribution is law),
but there has been a resurgence of interest in it in recent years. Analysts have found that many
types of social networks in the real world exhibit approximately power law behavior in their degree
distributions.
For instance, in a famous study of the Web (A. Barabasi and R. Albert, Emergence of Scaling in
Random Networks, Science, 1999, 509-512), degree distribution on the Web (a directed graph, with
incoming links being the ones of interest here) it was found that the number of links leading to a
Web page has an approximate power law distribution with = 2.1. The number of links leading
out of a Web page was found to be approximately power-law distributed, with = 2.7.
Much of the interest in power laws stems from their fat tails, a term meaning that values far
from the mean are more likely under a power law than they would be under a normal distribution
with the same mean. In recent popular literature, values far from the mean have often been called
black swans. The nancial crash of 2008, for example, is blamed by some on the ignorance by
quants (people who develop probabilistic models for guiding investment) in underestimating the
probabilities of values far from the mean.
Some examples of real data that are, or are not, t well by power law models are given in the
paper Power-Law Distributions in Empirical Data, by A. Clauset, C. Shalizi and M. Newman, at
http://arxiv.org/abs/0706.1062. Methods for estimating the parameter are discussed and
evaluated.
A variant of the power law model is the power law with exponential cuto, which essentially
consists of a blend of the power law and a geometric distribution. Here
p
X
(k) = ck

q
k
(3.119)
This now is a two-parameter family, the parameters being and q. Again c is chosen so that the
pmf sums to 1.0.
This model is said to work better than a pure power law for some types of data. Note, though,
that this version does not really have the fat tail property, as the tail decays exponentially now.
68 CHAPTER 3. DISCRETE RANDOM VARIABLES
3.13 Recognizing Some Parametric Distributions When You See
Them
Three of the discrete distribution families weve considered here arise in settings with very denite
structure, all dealing with independent trials:
the binomial family gives the distribution of the number of successes in a xed number of
trials
the geometric family gives the distribution of the number of trials needed to obtain the rst
success
the negative binomial family gives the distribution of the number of trials needed to obtain
the k
th
success
Such situations arise often, hence the fame of these distribution families.
By contrast, the Poisson and power law distributions have no underlying structure. They are
famous for a dierent reason, that it has been found empirically that they provide a good t to
many real data sets.
In other words, the Poisson and power law distributions are typically t to data, in attempt to nd
a good model, whereas in the binomial, geometric and negative binomial cases, the fundamental
nature of the setting implies one of those distributions.
You should make a strong eort to get to the point at which you automatically rec-
ognize such settings when your encounter them.
3.13.1 Example: a Coin Game
Life is unfairformer President Jimmie Carter
Consider a game played by Jack and Jill. Each of them tosses a coin many times, but Jack gets a
head start of two tosses. So by the time Jack has had, for instance, 8 tosses, Jill has had only 6;
when Jack tosses for the 15
th
time, Jill has her 13
th
toss; etc.
Let X
k
denote the number of heads Jack has gotten through his k
th
toss, and let Y
k
be the head
count for Jill at that same time, i.e. among only k-2 tosses for her. (So, Y
1
= Y
2
= 0.) Lets nd
the probability that Jill is winning after the k
th
toss, i.e. P(Y
6
> X
6
).
Your rst reaction might be, Aha, binomial distribution! You would be on the right track, but
the problem is that you would not be thinking precisely enough. Just WHAT has a binomial
3.13. RECOGNIZING SOME PARAMETRIC DISTRIBUTIONS WHEN YOU SEE THEM 69
distribution? The answer is that both X
6
and Y
6
have binomial distributions, both with p = 0.5,
but n = 6 for X
6
while n = 4 for Y
6
.
Now, as usual, ask the famous question, How can it happen? How can it happen that Y
6
> X
6
?
Well, we could have, for example, Y
6
= 3 and X
6
= 1, as well as many other possibilities. Lets
write it mathematically:
P(Y
6
> X
6
) =
4

i=1
i1

j=0
P(Y
6
= i and X
6
= j) (3.120)
Make SURE your understand this equation.
Now, to evaluate P(Y
6
= i and X
6
= j), we see the and so we ask whether Y
6
and X
6
are
independent. They in fact are; Jills coin tosses certainly dont aect Jacks. So,
P(Y
6
= i and X
6
= j) = P(Y
6
= i) P(X
6
= j) (3.121)
It is at this point that we nally use the fact that X
6
and Y
6
have binomial distributions. We have
P(Y
6
= i) =
_
4
i
_
0.5
i
(1 0.5)
4i
(3.122)
and
P(X
6
= j) =
_
6
j
_
0.5
j
(1 0.5)
6j
(3.123)
We would then substitute (3.122) and (3.123) in (3.120). We could then evaluate it by hand, but
it would be more convenient to use Rs dbinom() function:
1 prob <- 0
2 for (i in 1:4)
3 for (j in 0:(i-1))
4 prob <- prob + dbinom(i,4,0.5) * dbinom(j,6,0.5)
5 print(prob)
We get an answer of about 0.17. If Jack and Jill were to play this game repeatedly, stopping each
time after the 6
th
toss, then Jill would win about 17% of the time.
70 CHAPTER 3. DISCRETE RANDOM VARIABLES
3.13.2 Example: Tossing a Set of Four Coins
Consider a game in which we have a set of four coins. We keep tossing the set of four until we have
a situation in which exactly two of them come up heads. Let N denote the numbr of times we must
toss the set of four coins.
For instance, on the rst toss of the set of four, the outcome might be HTHH. The second might
be TTTH, and the third could be THHT. In the situation, N = 3.
Lets nd P(N = 5). Here we recognize that N has a geometric distribution, with success dened
as getting two heads in our set of four coins. What value does the parameter p have here?
Well, p is P(X = 2), where X is the number of heads we get from a toss of the set of four coins.
We recognize that X is binomial! Thus
p =
_
4
2
_
0.5
4
=
3
8
(3.124)
Thus using the fact that N has a geometric distribution,
P(N = 5) = (1 p)
4
p = 0.057 (3.125)
3.13.3 Example: the ALOHA Example Again
As an illustration of how commonly these parametric families arise, lets again look at the ALOHA
example. Consider the general case, with transmission probability p, message creation probability
q, and m network nodes. We will not restrict our observation to just two epochs.
Suppose X
i
= m, i.e. at the end of epoch i all nodes have a message to send. Then the number
which attempt to send during epoch i+1 will be binomially distributed, with parameters m and p.
8
For instance, the probability that there is a successful transmission is equal to the probability that
exactly one of the m nodes attempts to send,
_
m
1
_
p(1 p)
m1
= mp(1 p)
m1
(3.126)
Now in that same setting, X
i
= m, let K be the number of epochs it will take before some message
actually gets through. In other words, we will have X
i
= m, X
i+1
= m, X
i+2
= m,... but nally
X
i+K1
= m 1. Then K will be geometrically distributed, with success probability equal to
(3.126).
8
Note that this is a conditional distribution, given Xi = m.
3.14. EXAMPLE: THE BUS RIDERSHIP PROBLEM AGAIN 71
There is no Poisson distribution in this example, but it is central to the analysis of Ethernet, and
almost any other network. We will discuss this at various points in later chapters.
3.14 Example: the Bus Ridership Problem Again
Recall the bus ridership example of Section 2.11. Lets calculate some expected values, for instance
E(B
1
):
E(B
1
) = 0 P(B
1
= 0) + 1 P(B
1
= 1) + 2 P(B
1
= 2) = 0.4 + 2 0.1 (3.127)
Now suppose the company charges $3 for passengers who board at the rst stop, but charges $2
for those who join at the second stop. (The latter passengers get a possibly shorter ride, thus pay
less.) So, the total revenue from the rst two stops is T = 3B
1
+2B
2
. Lets nd E(T). Wed write
E(T) = 3E(B
1
) + 2E(B
2
) (3.128)
making use of (3.15). Wed then compute the terms as in 3.127.
Suppose the bus driver has the habit of exclaiming, What? No new passengers?! every time he
comes to a stop at which B
i
= 0. Let N denote the number of the stop (1,2,...) at which this rst
occurs. Find P(N = 3):
N has a geometric distribution, with p equal to the probability that there 0 new passengers at a
stop, i.e. 0.5. Thus p
N
(3) = (1 0.5)
2
0.5, by (3.76).
Let T denote the number of stops, out of the rst 6, at which 2 new passengers board. For example,
T would be 3 if B
1
= 2, B
2
= 2, B
3
= 0, B
4
= 1, B
5
= 0, and B
6
= 2. Find p
T
(4):
T has a binomial distribution, with n = 6 and p = probability of 2 new passengers at a stop = 0.1.
Then
p
T
(4) =
_
6
4
_
0.1
4
(1 0.1)
64
(3.129)
By the way, we can exploit our knowledge of binomial distributions to simplify the simulation code
in Section 2.12.4. The lines
f or ( k i n 1: pas s enger s )
i f ( r uni f ( 1) < 0. 2)
pas s enger s < pas s enger s 1
72 CHAPTER 3. DISCRETE RANDOM VARIABLES
simulate nding that number of passengers that alight at that stop. But that number is binomially
distributed, so the above code can be compactied (and speeded up in execution) as
pas s enger s < pas s enger s rbinom( 1 , pas s enger s , 0 . 2 )
3.15 A Preview of Markov Chains
Here we introduce Markov chains, a topic covered in much more detail in Chapter 17.
The basic idea is that we have random variables X
1
, X
2
, ..., with the index representing time. Each
one can take on any value in a given set, called the state space; X
n
is then the state of the system
at time n.
The key aspect that we assume the Markov property, which in rough terms can be described as:
The probabilities of future states, given the present state and the past state, depends
only on the present state; the past is irrelevant.
In formal terms:
P(X
t+1
= s
t+1
[X
t
= s
t
, X
t1
= s
t1
, . . . , X
0
= s
0
) = P(X
t+1
= s
t+1
[X
t
= s
t
) (3.130)
We dene p
ij
to be the probability of going from state i to state j in one time step. This forms a
matrix P, whose row i, column j element is p
ij
, which is called the transition matrix.
3.15.1 Example: Die Game
As our rst example of Markov chains, consider the following game. One repeatedly rolls a die,
keeping a running total. Each time the total exceeds 10, we receive one dollar, and continue playing,
resuming where we left o, mod 10. Say for instance we have a total of 8, then roll a 5. We receive
a dollar, and now our total is 3.
This process clearly satises the Markov property. If our current total is 6, for instance, then the
probability that we next have a total of 9 is 1/6, regardless of what happend our previous rolls. We
have p
25
, p
72
and so on all equal to 1/6, while for instance p
29
= 0. Heres the code to nd P:
1 p < matri x ( rep ( 0 , 100) , nrow=10)
2 ones i xt h < 1/6
3 f or ( i i n 1: 10)
3.15. A PREVIEW OF MARKOV CHAINS 73
4 f or ( j i n 1: 6)
5 k < i + j
6 i f ( k > 10) k < k 10
7 p [ i , k ] < ones i xt h
8
9
3.15.2 Long-Run State Probabilities
Let N
it
denote the number of times we have visited state i during times 1,...,t. Than as discussed
in Section 17.1.2, in typical applications

i
= lim
t
N
it
t
(3.131)
exists for each state i. Under a couple more conditions, we have the stronger result,
lim
t
P(X
t
= i) =
i
(3.132)
These quantities
i
are typically the focus of analyses of Markov chains.
In Chapter 17 it is shown that the
i
are easy to nd (in the case of nite state spaces, the subject
of this section here), by solving the matrix equation
(I P

) = 0 (3.133)
subject to the constraint

i
= 1 (3.134)
Here I is the identity matrix, and denotes matrix transpose. R code to do all this (after some alge-
braic manipulations), ndpi1(), is provided in Section 17.1.2.2, reproduced here for convenience:
1 f i ndpi 1 < f unc t i on ( p)
2 n < nrow( p)
3 imp < di ag ( n) t ( p) # IP
4 imp [ n , ] < rep ( 1 , n)
5 rhs < c ( rep ( 0 , n1) , 1)
74 CHAPTER 3. DISCRETE RANDOM VARIABLES
6 pi vec < s ol ve ( imp , rhs )
7 r et ur n ( pi vec )
8
Consider the die game example above. Guess what! All the
i
turn out to be 1/10. In retrospect,
this should be obvious. If we were to draw the states 1 through 10 as a ring, with 1 following 10,
it should be clear that all the states are completely symmetric.
3.15.3 Example: 3-Heads=in-a-Row Game
How about the following game? We keep tossing a coin until we get three consecutive heads. What
is the expected value of the number of tosses we need?
We can model this as a Markov chain with states 0, 1, 2 and 3, where state i means that we have
accumulated i consecutive heads so far. If we simply stop playing the game when we reach state 3,
that state would be known as an absorbing state, one that we never leave.
We could proceed on this basis, but to keep things elementary, lets just model the game as being
played repeatedly, as in the die game above. Youll see that that will still allow us to answer the
original question. Note that now that we are taking that approach, it will suce to have just three
states, 0, 1 and 2.
Clearly we have transition probabilities such as p
01
, p
12
, p
10
and so on all equal to 1/2. Note from
state 2 we can only go to state 0, so p
20
= 1.
Heres the code below. Of course, since R subscripts start at 1 instead of 0, we must recode our
states as 1, 2 an 3.
p <- matrix(rep(0,9),nrow=3)
onehalf <- 1/2
p[1,1] <- onehalf
p[1,2] <- onehalf
p[2,3] <- onehalf
p[2,1] <- onehalf
p[3,1] <- 1
findpi1(p)
It turns out that
= (0.5714286, 0.2857143, 0.1428571) (3.135)
So, in the long run, about 57.1% of our tosses will be done while in state 0, 28.6% while in state 1,
and 14.3% in state 2.
3.15. A PREVIEW OF MARKOV CHAINS 75
Now, look at that latter gure. Of the tosses we do while in state 2, half will be heads, so half will
be wins. In other words, about 0.071 of our tosses will be wins. And THAT gure answers our
original question, through the following reasoning:
Think of, say, 10000 tosses. There will be about 710 wins sprinkled among those 10000 tosses.
Thus the average number of tosses between wins will be about 10000/710 = 14.1. In other words,
the expected time until we get three consecutive heads is about 14.1 tosses.
3.15.4 Example: ALOHA
Consider our old friend, the ALOHA network model. (You may wish to review the statement of the
model in Section 2.5 before continuing.) The key point in that system is that it was memoryless,
in that the probability of what happens at time k+1 depends only on the state of the system at
time k.
For instance, consider what might happen at time 6 if X
5
= 2. Recall that the latter means that
at the end of epoch 5, both of our two network nodes were active. The possibilities for X
6
are then
X
6
will be 2 again, with probability p
2
+ (1 p)
2
X
6
will be 1, with probability 2p(1 p)
The central point here is that the past history of the systemi.e. the values of X
1
, X
2
, X
3
, X
4
and X
5

dont have any impact. We can state that precisely:


The quantity
P(X
6
= j[X
1
= i
1
, X
2
= i
2
, X
3
= i
3
, X
4
= i
4
, X
5
= i) (3.136)
does not depend on i
m
, m = 1, ..., 4. Thus we can write (3.136) simply as P(X
6
=
j[X
5
= i).
Furthermore, that probability is the same as P(X
9
= j[X
8
= i) and in general P(X
k+1
= j[X
k
= i).
We denote this probability by p
ij
, and refer to it as the transition probability from state i to
state j.
Since this is a three-state chain, the p
ij
form a 3x3 matrix:
P =
_
_
(1 q)
2
+ 2q(1 q)p 2q(1 q)(1 p) + 2q
2
p(1 p) q
2
[p
2
+ (1 p)
2
]
(1 q)p 2qp(1 p) + (1 q)(1 p) q[p
2
+ (1 p)
2
]
0 2p(1 p) p
2
+ (1 p)
2
_
_
(3.137)
76 CHAPTER 3. DISCRETE RANDOM VARIABLES
For instance, the element in row 0, column 2, p
02
, is q
2
[p
2
+ (1 p)
2
, reecting the fact that to go
from state 0 to state 2 would require that both inactive nodes become active (which has probability
q
2
, and then either both try to send or both refrain from sending (probability p
2
+ (1 p)
2
.
For the ALOHA example here, with p = 0.4 and q = 0.3, the solution is
0
= 0.47,
1
= 0.43 and

2
= 0.10.
So we know that in the long run, about 47% of the epochs will have no active nodes, 43% will have
one, and 10% will have two. From this we see that the long-run average number of active nodes is
0 0.47 + 1 0.43 + 2 0.10 = 0.63 (3.138)
3.15.5 Example: Bus Ridership Problem
Consider the bus ridership problem in Section 2.11. Make the same assumptions now, but add a
new one: There is a maximum capacity of 20 passengers on the bus.
The random variables L
i
, i = 1,2,3,... form a Markov chain. Lets look at some of the transition
probabilities:
p
00
= 0.5 (3.139)
p
01
= 0.4 (3.140)
p
20
= (0.2)
2
(0.5) = 0.02 (3.141)
p
20,20
= (0.8)
20
+ 20 (0.2)(0.8)
19
(0.4) + 190 (0.2)
2
(0.8
18
(0.1) (3.142)
After nding the vector as above, we can nd quantities such as the long-run average number of
passengers on the bus,
20

i=0

i
i (3.143)
and the long-run average number of would-be passengers who fail to board the bus,
1 [
19
(0.1) +
20
(0.4)] + 2 [
20
(0.1)] (3.144)
3.16. A CAUTIONARY TALE 77
3.15.6 An Inventory Model
Consider the following simple inventory model. A store has 1 or 2 customers for a certain item
each day, with probabilities p and q (p+1 = 1). Each customer is allowed to buy only 1 item.
When the stock on hand reaches 0 on a day, it is replenished to r items immediately after the store
closes that day.
If at the start of a day the stock is only 1 item and 2 customers wish to buy the item, only one
customer will complete the purchase, and the other customer will leave emptyhanded.
Let X
n
be the stock on hand at the end of day n (after replenishment, if any). Then X
1
, X
2
, ...
form a Markov chain, with state space 1,2,...,r.
Lets write a function inventory(p,q,r) that returns the vector for this Markov chain. It will
call ndpi1(), similarly to the two code snippets on page ??.
i nvent or y < f unc t i on ( p , q , r )
tm < matri x ( rep ( 0 , r 2) , nrow=r )
f or ( i i n 3: r )
tm[ i , i 1] < p
tm[ i , i 2] < q

tm[ 2 , 1 ] < p
tm[ 2 , r ] < q
tm[ 1 , r ] < 1
r et ur n ( f i ndpi 1 (tm) )

3.16 A Cautionary Tale


3.16.1 Trick Coins, Tricky Example
Suppose we have two trick coins in a box. They look identical, but one of them, denoted coin 1, is
heavily weighted toward heads, with a 0.9 probability of heads, while the other, denoted coin 2, is
biased in the opposite direction, with a 0.9 probability of tails. Let C
1
and C
2
denote the events
that we get coin 1 or coin 2, respectively.
Our experiment consists of choosing a coin at random from the box, and then tossing it n times.
Let B
i
denote the outcome of the i
th
toss, i = 1,2,3,..., where B
i
= 1 means heads and B
i
= 0
means tails. Let X
i
= B
1
+... +B
i
, so X
i
is a count of the number of heads obtained through the
i
th
toss.
78 CHAPTER 3. DISCRETE RANDOM VARIABLES
The question is: Does the random variable X
i
have a binomial distribution? Or, more simply,
the question is, Are the random variables B
i
independent? To most peoples surprise, the answer
is No (to both questions). Why not?
The variables B
i
are indeed 0-1 variables, and they have a common success probability. But they
are not independent! Lets see why they arent.
Consider the events A
i
= B
i
= 1, i = 1,2,3,... In fact, just look at the rst two. By denition,
they are independent if and only if
P(A
1
and A
2
) = P(A
1
)P(A
2
) (3.145)
First, what is P(A
1
)? Now, wait a minute! Dont answer, Well, it depends on which coin
we get, because this is NOT a conditional probability. Yes, the conditional probabilities P(A
1
[C
1
)
and P(A
1
[C
2
) are 0.9 and 0.1, respectively, but the unconditional probability is P(A
1
) = 0.5. You
can deduce that either by the symmetry of the situation, or by
P(A
1
) = P(C
1
)P(A
1
[C
1
) +P(C
2
)P(A
1
[C
2
) = (0.5)(0.9) + (0.5)(0.1) = 0.5 (3.146)
You should think of all this in the notebook context. Each line of the notebook would consist
of a report of three things: which coin we get; the outcome of the rst toss; and the outcome of
the second toss. (Note by the way that in our experiment we dont know which coin we get, but
conceptually it should have a column in the notebook.) If we do this experiment for many, many
lines in the notebook, about 90% of the lines in which the coin column says 1 will show Heads
in the second column. But 50% of the lines overall will show Heads in that column.
So, the right hand side of Equation (3.145) is equal to 0.25. What about the left hand side?
P(A
1
and A
2
) = P(A
1
and A
2
and C
1
) +P(A
1
and A
2
and C
2
) (3.147)
= P(A
1
and A
2
[C
1
)P(C
1
) +P(A
1
andA
2
[C
2
)P(C
2
) (3.148)
= (0.9)
2
(0.5) + (0.1)
2
(0.5) (3.149)
= 0.41 (3.150)
Well, 0.41 is not equal to 0.25, so you can see that the events are not independent, contrary to our
rst intuition. And that also means that X
i
is not binomial.
3.17. WHY NOT JUST DO ALL ANALYSIS BY SIMULATION? 79
3.16.2 Intuition in Retrospect
To get some intuition here, think about what would happen if we tossed the chosen coin 10000
times instead of just twice. If the tosses were independent, then for example knowledge of the rst
9999 tosses should not tell us anything about the 10000th toss. But that is not the case at all.
After 9999 tosses, we are going to have a very good idea as to which coin we had chosen, because
by that time we will have gotten about 9000 heads (in the case of coin C
1
) or about 1000 heads
(in the case of C
2
). In the former case, we know that the 10000th toss is likely to be a head, while
in the latter case it is likely to be tails. In other words, earlier tosses do indeed give us
information about later tosses, so the tosses arent independent.
3.16.3 Implications for Modeling
The lesson to be learned is that independence can denitely be a tricky thing, not to be assumed
cavalierly. And in creating probability models of real systems, we must give very, very careful
thought to the conditional and unconditional aspects of our models-it can make a huge dierence,
as we saw above. Also, the conditional aspects often play a key role in formulating models of
nonindependence.
This trick coin example is just thattrickybut similar situations occur often in real life. If in some
medical study, say, we sample people at random from the population, the people are independent
of each other. But if we sample families from the population, and then look at children within the
families, the children within a family are not independent of each other.
3.17 Why Not Just Do All Analysis by Simulation?
Now that computer speeds are so fast, one might ask why we need to do mathematical probability
analysis; why not just do everything by simulation? There are a number of reasons:
Even with a fast computer, simulations of complex systems can take days, weeks or even
months.
Mathematical analysis can provide us with insights that may not be clear in simulation.
Like all software, simulation programs are prone to bugs. The chance of having an uncaught
bug in a simulation program is reduced by doing mathematical analysis for a special case of
the system being simulated. This serves as a partial check.
Statistical analysis is used in many professions, including engineering and computer science,
and in order to conduct meaningful, useful statistical analysis, one needs a rm understanding
80 CHAPTER 3. DISCRETE RANDOM VARIABLES
notebook line Y dZ Y dZ?
1 0.36 0 yes
2 3.6 3 yes
3 2.6 0 yes
Table 3.2: Illustration of Y and Z
of probability principles.
An example of that second point arose in the computer security research of a graduate student at
UCD, Senthilkumar Cheetancheri, who was working on a way to more quickly detect the spread
of a malicious computer worm. He was evaluating his proposed method by simulation, and found
that things hit a wall at a certain point. He wasnt sure if this was a real limitation; maybe, for
example, he just wasnt running his simulation on the right set of parameters to go beyond this
limit. But a mathematical analysis showed that the limit was indeed real.
3.18 Proof of Chebychevs Inequality
To prove (3.42), lets rst state and prove Markovs Inequality: For any nonnegative random
variable Y,
P(Y d)
EY
d
(3.151)
To prove (3.151), let Z be the indicator random variable for the event Y d (Section 3.6).
Now note that
Y dZ (3.152)
To see this, just think of a notebook, say with d = 3. Then the notebook might look like Table 3.2.
So
EY dEZ (3.153)
(Again think of the notebook. The long-run average in the Y column will be the corresponding
average for the dZ column.
3.19. RECONCILIATION OF MATH AND INTUITION (OPTIONAL SECTION) 81
The right-hand side of (3.153) is dP(Y d), so (3.151) follows.
Now to prove (3.42), dene
Y = (X )
2
(3.154)
and set d = c
2

2
. Then (3.151) says
P[(X )
2
c
2

2
]
E[(X )
2
]
c
2

2
(3.155)
Since
(X )
2
c
2

2
if and only if [X [ c (3.156)
the left-hand side of (3.155) is the same as the left-hand side of (3.42). The numerator of the
right-hand size of (3.155) is simply Var(X), i.e.
2
, so we are done.
3.19 Reconciliation of Math and Intuition (optional section)
Here is a more theoretical denition of probability, as opposed to the intuitive notebook idea in
this book. The denition is an abstraction of the notions of events (the sets A in J below) and
probabilities of those events (the values of the function P(A)):
Denition 8 Let S be a set, and let J be a collection of subsets of S. Let P be a real-valued
function on J. Then S, J and P form a probability space if the following conditions hold:
S J.
J is closed under complements (if a set is in J, then the sets complement with respect to
S is in J too) and under unions of countably many members of J.
P(A) 0 for any A in J.
If A
1
, A
2
, ... J and the A
i
are pairwise disjoint, then
P(
i
A
i
) =

i
P(A
i
) (3.157)
82 CHAPTER 3. DISCRETE RANDOM VARIABLES
A random variable is any function X : S 1.
9
Using just these simple axioms, one can prove (with lots of heavy math) theorems like the Strong
Law of Large Numbers:
Theorem 9 Consider a random variable U, and a sequence of independent random variables
U
1
, U
2
, ... which all have the same distribution as U, Then
lim
n
U
1
+... +U
n
n
= E(U) with probability 1 (3.158)
In other words, the average value of U in all the lines of the notebook will indeed converge to EU.
Exercises
1. Consider a game in which one rolls a single die until one accumulates a total of at least four
dots. Let X denote the number of rolls needed. Find P(X 2) and E(X).
2. Recall the committee example in Section 3.7. Suppose now, though, that the selection protocol
is that there must be at least one man and at least one woman on the committee. Find E(D) and
V ar(D).
3. Suppose a bit stream is subject to errors, with each bit having probability p of error, and with
the bits being independent. Consider a set of four particular bits. Let X denote the number of
erroneous bits among those four.
(a) Find P(X = 2) and EX.
(b) What famous parametric family of distributions does the distribution of X belong to?
(c) Let Y denote the maximum number of consecutive erroneous bits. Find P(Y = 2) and Var(Y).
4. Derive (3.85).
5. Finish the computation in (3.91).
6. Derive the facts that for a Poisson-distributed random variable X with parameter , EX =
V ar(X) = . Use the hints in Section 3.12.4.
7. A civil engineer is collecting data on a certain road. She needs to have data on 25 trucks, and 10
percent of the vehicles on that road are trucks. State the famous parametric family that is relevant
9
The function must also have a property called measurability, which we will not discuss here.
3.19. RECONCILIATION OF MATH AND INTUITION (OPTIONAL SECTION) 83
here, and nd the probability that she will need to wait for more than 200 vehicles to pass before
she gets the needed data.
8. In the ALOHA example:
(a) Find E(X
1
) and V ar(X
1
), for the case p = 0.4, q = 0.8. You are welcome to use quantities
already computed in the text, e.g. P(X
1
= 1) = 0.48, but be sure to cite equation numbers.
(b) Find P(collision during epoch 1) for general p, q.
9. Our experiment is to toss a nickel until we get a head, taking X rolls, and then toss a dime until
we get a head, taking Y tosses. Find:
(a) Var(X+Y).
(b) Long-run average in a notebook column labeled X
2
.
10. Consider the game in Section 3.13.1. Find E(Z) and V ar(Z), where Z = Y
6
X
6
.
11. Say we choose six cards from a standard deck, one at a time WITHOUT replacement. Let N
be the number of kings we get. Does N have a binomial distribution? Choose one: (i) Yes. (ii)
No, since trials are not independent. (iii) No, since the probability of success is not constant from
trial to trial. (iv) No, since the number of trials is not xed. (v) (ii) and (iii). (iv) (ii) and (iv).
(vii) (iii) and (iv).
12. Suppose we have n independent trials, with the probability of success on the i
th
trial being p
i
.
Let X = the number of successes. Use the fact that the variance of the sum is the sum of the
variance for independent random variables to derive V ar(X).
13. Prove Equation (3.30).
14. Show that if X is a nonnegative-integer valued random variable, then
EX =

i=1
P(X i) (3.159)
Hint: Write i =

i
j=1
1, and when you see an iterated sum, reverse the order of summation.
15. Suppose we toss a fair time n times, resulting in X heads. Show that the term expected value
is a misnomer, by showing that
lim
n
P(X = n/2) = 0 (3.160)
84 CHAPTER 3. DISCRETE RANDOM VARIABLES
Use Stirlings approximation,
k!

2k
_
k
e
_
k
(3.161)
16. Suppose X and Y are independent random variables with standard deviations 3 and 4, respec-
tively.
(a) Find Var(X+Y).
(b) Find Var(2X+Y).
17. Fill in the blanks in the following simulation, which nds the approximate variance of N, the
number of rolls of a die needed to get the face having just one dot.
onesixth <- 1/6
sumn <- 0
sumn2 <- 0
for (i in 1:10000) {
n <- 0
while(TRUE) {
________________________________________
if (______________________________ < onesixth) break
}
sumn <- sumn + n
sumn2 <- sumn2 + n^2
}
approxvarn <- ____________________________________________
cat("the approx. value of Var(N) is ",approx,"\n")
18. Let X be the total number of dots we get if we roll three dice. Find an upper bound for
P(X 15), using our course materials.
19. Suppose X and Y are independent random variables, and let Z = XY. Show that V ar(Z) =
E(X
2
)E(Y
2
) [E(X)]
2
[E(Y )]
2
.
20. This problem involves a very simple model of the Web. (Far more complex ones exist.)
Suppose we have n Web sites. For each pair of sites i and j, i ,= j, there is a link from i to j with
probability p, and no link (in that direction) with probability 1-p. Let N
i
denote the number of
sites that site i is linked to; note that N
i
can range from 0 to n-1. Also, let M
ij
denote the number
of outgoing links that i and j have in common, not counting the one between them, if any. Assume
that each site forms its outgoing links independently of the others.
Say n = 10, p = 0.2. Find the followinga:
3.19. RECONCILIATION OF MATH AND INTUITION (OPTIONAL SECTION) 85
(a) P(N
1
= 3)
(b) P(N
1
= 3 and N
2
= 2)
(c) V ar(N
1
)
(d) V ar(N
1
+N
2
)
(e) P(M
12
= 4)
Note: There are some good shortcuts in some of these problems, making the work much easier.
But you must JUSTIFY your work.
21. Let X denote the number of heads we get by tossing a coin 50 times. Consider Chebychevs
Inequality for the case of 2 standard deviations. Compare the upper bound given by the inequality
to the exact probability.
22. Suppose the number N of cars arriving during a given time period at a toll booth has a
Poisson distribution with parameter . Each car has a probability p of being in a car pool. Let
M be the number of car-pool cars that arrive in the given period. Show that M also has a Poisson
distribution, with parameter p. (Hint: Use the Maclaurin series for e
x
.)
23. Consider a three-sided die, as on page 31.
(a) (10) State the value of p
X
(2).
(b) (10) Find EX and Var(X).
(c) (15) Suppose you win $2 for each dot. Find EW, where W is the amount you win.
24. Consider the parking space problem in Section 3.12.1.2. Find Var(M), where M is the number
of empty spaces in the rst block, and Var(D).
25. Suppose X and Y are independent, with variances 1 and 2, respectively. Find the value of c
that minimizes Var[cX + (1-c)Y].
26. In the cards example in Section 2.13.1, let H denote the number of hearts. Find EH and
Var(H).
27. In the bank example in Section 3.12.4, suppose you observe the bank for n days. Let X denote
the number of days in which at least 2 customers entered during the 11:00-11:15 observation period.
Find P(X = k).
28. Find E(X
3
), where X has a geometric distribution with parameter p.
86 CHAPTER 3. DISCRETE RANDOM VARIABLES
29. Supppose we have a nonnegative random variable X, and dene a new random variable Y,
which is equal to X if X > 8 and equal to 0 otherwise. Assume X takes on only a nite number of
values (just a mathematical nicety, not really an issue). Which one of the following is true:
(i) EY EX.
(ii) EY EX.
(iii) Either of EY and EX could be larger than the other, depending on the situation.
(iv) EY is undened.
30. Say we roll two dice, a blue one and a yellow one. Let B and Y denote the number of dots we
get, respectively. Now let G denote the indicator random variable for the event S = 2. Find E(G).
31. Suppose I
1
, I
2
and I
3
are independent indicator random variables, with P(I
j
= 1) = p
j
, j =
1,2,3. Find the following in terms of the p
j
, writing your derivation with reasons in the form of
mailing tube numbers.
32. Consider the ALOHA example, Section 3.13.3 . Write a call to the built-in R function dbi-
nom() to evaluate (3.126) for general m and p.
33. Consider the bus ridership example, Section 2.11. Suppose upon arrival to a certain stop, there
are 2 passengers. Let A denote the number of them who choose to alight at that stop.
(a) State the parametric family that the distribution of A belongs to.
(b) Find p
A
(1) and F
A
(1), writing each answer in decimal expression form e.g. 12
8
0.32+0.3333.
34. Suppose you have a large disk farm, so heavily used that the lifetimes L are measured in
months. They come from two dierent factories, in proportions q and 1-q. The disks from factory
i have geometrically distributed lifetime with parameter p
i
, i = 1,2. Find Var(L) in terms of q and
the p
i
.
Chapter 4
Continuous Probability Models
There are other types of random variables besides the discrete ones you studied in Chapter 3. This
chapter will cover another major class, continuous random variables. It is for such random variables
that the calculus prerequisite for this book is needed.
4.1 A Random Dart
Imagine that we throw a dart at random at the interval (0,1). Let D denote the spot we hit. By at
random we mean that all subintervals of equal length are equally likely to get hit. For instance,
the probability of the dart landing in (0.7,0.8) is the same as for (0.2,0.3), (0.537,0.637) and so on.
Because of that randomness,
P(u D v) = v u (4.1)
for any case of 0 u < v 1.
The rst crucial point to note is that
P(D = c) = 0 (4.2)
for any individual point c. This may seem counterintuitive, but it can be seen in a couple of ways:
87
88 CHAPTER 4. CONTINUOUS PROBABILITY MODELS
Take for example the case c = 0.3. Then
P(D = 0.3) P(0.29 D 0.31) = 0.02 (4.3)
the last equality coming from (4.1).
So, P(D = 0.3) 0.02. But we can replace 0.29 and 0.31 in (4.3) by 0.299 and 0.301, say,
and get P(D = 0.3) 0.002. So, P(D = 0.3) must be smaller than any positive number, and
thus its actually 0.
Reason that there are innitely many points, and if they all had some nonzero probability
w, say, then the probabilities would sum to innity instead of to 1; thus they must have
probability 0.
Remember, we have been looking at probability as being the long-run fraction of the time an event
occurs, in innitely many repetitions of our experiment. So (4.2) doesnt say that D = c cant
occur; it merely says that it happens so rarely that the long-run fraction of occurrence is 0.
4.2 Continuous Random Variables Are Useful Unicorns
The above discussion of the random dart may still sound odd to you, but remember, this is an
idealization. D actually cannot be just any old point in (0,1). Our dart has nonzero thickness, our
measuring instrument has only nite precision, and so on.
So this modeling of the position of the dart as continuously distributed really is an idealization.
Indeed, in practice there are NO continuous random variables. But the coninuous model can be an
excellent approximation, and thus the concept is extremely. Its like the assumption of massless
string in physics analyses; there is no such thing, but its a good approximation to reality.
Indeed, most applications of probability and statistics, especially the latter, are based on continuous
distributions. Well be using them heavily for the remainder of this book.
4.3 But Equation (4.2) Presents a Problem
But Equation (4.2) presents a problem for us in dening the term distribution for variables like
this. In Section 3.11, we dened this for a discrete random variable Y as a list of the values Y takes
on, together with their probabilities. But that would be impossible hereall the probabilities of
individual values here are 0.
4.3. BUT EQUATION (??) PRESENTS A PROBLEM 89
Instead, we dene the distribution of a random variable W which puts 0 probability on individual
points in another way. To set this up, we rst must dene a key function:
Denition 10 For any random variable W (including discrete ones), its cumulative distribu-
tion function (cdf ), F
W
, is dened by
F
W
(t) = P(W t), < t < (4.4)
(Please keep in mind the notation. It is customary to use capital F to denote a cdf, with a subscript
consisting of the name of the random variable.)
What is t here? Its simply an argument to a function. The function here has domain (, ),
and we must thus dene that function for every value of t. This is a simple point, but a crucial
one.
For an example of a cdf, consider our random dart example above. We know that, for example
for t = 0.23,
F
D
(0.23) = P(D 0.23) = P(0 D 0.23) = 0.23 (4.5)
Also,
F
D
(10.23) = P(D 10.23) = 0 (4.6)
and
F
D
(10.23) = P(D 10.23) = 1 (4.7)
In general for our dart,
F
D
(t) =
_

_
0, if t 0
t, if 0 < t < 1
1, if t 1
(4.8)
Here is the graph of F
D
:
90 CHAPTER 4. CONTINUOUS PROBABILITY MODELS
0.5 0.0 0.5 1.0 1.5
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
t
F
(
t
)
The cdf of a discrete random variable is dened as in Equation (4.4) too. For example, say Z is the
number of heads we get from two tosses of a coin. Then
F
Z
(t) =
_

_
0, if t < 0
0.25, if 0 t < 1
0.75, if 1 t < 2
1, if t 2
(4.9)
For instance, F
Z
(1.2) = P(Z 1.2) = P(Z = 0 or Z = 1) = 0.25 + 0.50 = 0.75. (Make sure you
conrm this!) F
Z
is graphed below.
4.4. DENSITY FUNCTIONS 91
0.5 0.0 0.5 1.0 1.5 2.0 2.5
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
t
F
(
t
)
The fact that one cannot get a noninteger number of heads is what makes the cdf of Z at between
consecutive integers.
In the graphs you see that F
D
in (4.8) is continuous while F
Z
in (4.9) has jumps. For this reason,
we call random variables like Dones which have 0 probability for individual pointscontinuous
random variables.
Students sometimes ask, What is t? The answer is that its simply the argument of a mathematical
function, just like the role of t in, say, g(t) = sin(t), < t < . F
Z
() is a function, just like
this g(t) or the numerous functions that you worked with in calculus. Each input yields an ouput;
the input 1.2 yields the output 0.75 in the case of F
Z
() while the input 1 yields the output 0 in the
case of g(t).
At this level of study of probability, most random variables are either discrete or continuous, but
some are not.
4.4 Density Functions
Intuition is key here. Make SURE you develop a good intuitive understanding of density functions,
as it is vital in being able to apply probability well. We will use it a lot in our course.
92 CHAPTER 4. CONTINUOUS PROBABILITY MODELS
4.4.1 Motivation, Denition and Interpretation
OK, now we have a name for random variables that have probability 0 for individual points
continuousand we have solved the problem of how to describe their distribution. Now we need
something which will be continuous random variables analog of a probability mass function. (The
reader may wish to review pmfs in Section 3.11.)
Think as follows. From (4.4) we can see that for a discrete random variable, its cdf can be calculated
by summing is pmf. Recall that in the continuous world, we integrate instead of sum. So, our
continuous-case analog of the pmf should be something that integrates to the cdf. That of course
is the derivative of the cdf, which is called the density:
Denition 11 (Oversimplied from a theoretical math point of view.) Consider a continuous
random variable W. Dene
f
W
(t) =
d
dt
F
W
(t), < t < (4.10)
wherever the derivative exists. The function f
W
is called the density of W.
(Please keep in mind the notation. It is customary to use lower-case f to denote a density, with a
subscript consisting of the name of the random variable.)
Recall from calculus that an integral is the area under the curve, derived as the limit of the sums
of areas of rectangles drawn at the curve, as the rectangles become narrower and narrower. Since
the integral is a limit of sums, its symbol
_
is shaped like an S.
Now look at Figure 4.1, depicting a density function f
X
. (It so happens that in this example, the
density is an increasing function, but most are not.) A rectangle is drawn, positioned horizontally
at 1.3 0.1, and with height equal f
X
(1.3). The area of the rectangle approximates the area under
the curve in that region, which in turn is a probability:
2(0.1)f
X
(1.3)
_
1.4
1.2
f
X
(t) dt (rect. approx. to slice of area) (4.11)
= F
X
(1.4) F
X
(1.2) (f
X
= F

X
) (4.12)
= P(1.2 < X 1.4) (def. of F
X
) (4.13)
= P(1.2 < X < 1.4) (prob. of single pt. is 0) (4.14)
4.4. DENSITY FUNCTIONS 93
0.0 0.5 1.0 1.5 2.0
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
x
f

(
x
)
f_X
Figure 4.1: Approximation of Probability by a Rectangle
In other words, for any density f
X
at any point t, and for small values of c,
2cf
X
(t) P(t c < X < t +c) (4.15)
Thus we have:
Interpretation of Density Functions
For any density f
X
and any two points r and s,
P(r c < X < r +c)
P(s c < X < s +c)

f
X
(r)
f
X
(s)
(4.16)
So, X will take on values in regions in which f
X
is large much more often than in regions
where it is small, with the ratio of frequencies being proportion to the values of f
X
.
94 CHAPTER 4. CONTINUOUS PROBABILITY MODELS
For our dart random variable D, f
D
(t) = 1 for t in (0,1), and its 0 elsewhere.
1
Again, f
D
(t) is
NOT P(D = t), since the latter value is 0, but it is still viewable as a relative likelihood. The
fact that f
D
(t) = 1 for all t in (0,1) can be interpreted as meaning that all the points in (0,1) are
equally likely to be hit by the dart. More precisely put, you can view the constant nature of this
density as meaning that all subintervals of the same length within (0,1) have the same probability
of being hit.
Note too that if, say, X has the density in the previous paragraph, then f
X
(3) = 6/15 = 0.4 and
thus P(1.99 < X < 2.01) 0.008. Using our notebook viewpoint, think of many repetitions of the
experiment, with each line in the notebook recording the value of X in that repetition. Then in the
long run, about 0.8% of the lines would have X in (1.99,2.01).
The interpretation of the density is, as seen above, via the relative heights of the curve at various
points. The absolute heights are not important. Think of what happens when you view a histogram
of grades on an exam. Here too you are just interested in relative heights. (In a later unit, you will
see that a histogram is actually an estimate for a density.)
4.4.2 Properties of Densities
Equation (4.10) implies
Property A:
P(a < W b) = F
W
(b) F
W
(a) =
_
b
a
f
W
(t) dt (4.17)
Since P(W = c) = 0 for any single point c, this also means:
Property B:
P(a < W b) = P(a W b) = P(a W < b) = P(a < W < b) =
_
b
a
f
W
(t) dt (4.18)
This in turn implies:
Property C:
_

f
W
(t) dt = 1 (4.19)
1
The derivative does not exist at the points 0 and 1, but that doesnt matter.
4.4. DENSITY FUNCTIONS 95
Note that in the above integral, f
W
(t) will be 0 in various ranges of t corresponding to values W
cannot take on. For the dart example, for instance, this will be the case for t < 0 and t > 1.
What about E(W)? Recall that if W were discrete, wed have
E(W) =

c
cp
W
(c) (4.20)
where the sum ranges overall all values c that W can take on. If for example W is the number of
dots we get in rolling two dice, c will range over the values 2,3,...,12.
So, the analog for continuous W is:
Property D:
E(W) =
_
t
tf
W
(t) dt (4.21)
where here t ranges over the values W can take on, such as the interval (0,1) in the dart case.
Again, we can also write this as
E(W) =
_

tf
W
(t) dt (4.22)
in view of the previous comment that f
W
(t) might be 0 for various ranges of t.
And of course,
E(W
2
) =
_
t
t
2
f
W
(t) dt (4.23)
and in general, similarly to (3.25):
Property E:
E[g(W)] =
_
t
g(t)f
W
(t) dt (4.24)
Most of the properties of expected value and variance stated previously for discrete random variables
hold for continuous ones too:
Property F:
Equations (3.13), (??), (3.17), (3.30), (3.33), (4.5.2.1) still hold in the continuous case.
96 CHAPTER 4. CONTINUOUS PROBABILITY MODELS
4.4.3 A First Example
Consider the density function equal to 2t/15 on the interval (1,4), 0 elsewhere. Say X has this
density. Here are some computations we can do:
EX =
_
4
1
t 2t/15 dt = 2.8 (4.25)
P(X > 2.5) =
_
4
2.5
2t/15 dt = 0.65 (4.26)
F
X
(s) =
_
s
1
2t/15 dt =
s
2
1
15
for s in (1,4) (cdf is 0 for t < 1, and 1 for t > 4) (4.27)
V ar(X) = E(X
2
) (EX)
2
(from (3.30) (4.28)
=
_
4
1
t
2
2t/15 dt 2.8
2
(from (4.25)) (4.29)
= 5.7 (4.30)
P(tenths digit of X is even) =
28

i=0
P [1 +i/10 < X < 1 + (i + 1)/10] (4.31)
=
28

i=0
_
1+(i+1)/10
1+i/10
2t/15 dt (4.32)
= ... (integration left to the reader) (4.33)
Suppose L is the lifetime of a light bulb (say in years), with the density that X has above. Lets
nd some quantities in that context:
Proportion of bulbs with lifetime less than the mean lifetime:
P(L < 2.8) =
_
2.8
1
2t/15 dt = (2.8
2
1)/15 (4.34)
Mean of 1/L:
E(1/L) =
_
4
1
1
t
2t/15 dt =
2
5
(4.35)
4.5. FAMOUS PARAMETRIC FAMILIES OF CONTINUOUS DISTRIBUTIONS 97
In testing many bulbs, mean number of bulbs that it takes to nd two that have
lifetimes longer than 2.5:
Use (3.108) with k = 2 and p = 0.65.
4.5 Famous Parametric Families of Continuous Distributions
4.5.1 The Uniform Distributions
4.5.1.1 Density and Properties
In our dart example, we can imagine throwing the dart at the interval (q,r) (so this will be a
two-parameter family). Then to be a uniform distribution, i.e. with all the points being equally
likely, the density must be constant in that interval. But it also must integrate to 1 [see (4.19).
So, that constant must be 1 divided by the length of the interval:
f
D
(t) =
1
r q
(4.36)
for t in (q,r), 0 elsewhere.
It easily shown that E(D) =
q+r
2
and V ar(D) =
1
12
(r q)
2
.
The notation for this family is U(q,r).
4.5.1.2 R Functions
Relevant functions for a uniformally distributed random variable X on (r,s) are:
punif(q,r,s), to nd P(X q)
qunif(q,r,s), to nd c such that P(X c) = q
runif(n,r,s), to generate n independent values of X
4.5.1.3 Example: Modeling of Disk Performance
Uniform distributions are often used to model computer disk requests. Recall that a disk consists
of a large number of concentric rings, called tracks. When a program issues a request to read or
98 CHAPTER 4. CONTINUOUS PROBABILITY MODELS
write a le, the read/write head must be positioned above the track of the rst part of the le.
This move, which is called a seek, can be a signicant factor in disk performance in large systems,
e.g. a database for a bank.
If the number of tracks is large, the position of the read/write head, which Ill denote at X, is like
a continuous random variable, and often this position is modeled by a uniform distribution. This
situation may hold just before a defragmentation operation. After that operation, the les tend
to be bunched together in the central tracks of the disk, so as to reduce seek time, and X will not
have a uniform distribution anymore.
Each track consists of a certain number of sectors of a given size, say 512 bytes each. Once the
read/write head reaches the proper track, we must wait for the desired sector to rotate around and
pass under the read/write head. It should be clear that a uniform distribution is a good model for
this rotational delay.
4.5.1.4 Example: Modeling of Denial-of-Service Attack
In one facet of computer security, it has been found that a uniform distribution is actually a
warning of trouble, a possible indication of a denial-of-service attack. Here the attacker tries to
monopolize, say, a Web server, by inundating it with service requests. According to the research of
David Marchette,
2
attackers choose uniformly distributed false IP addresses, a pattern not normally
seen at servers.
4.5.2 The Normal (Gaussian) Family of Continuous Distributions
These are the famous bell-shaped curves, so called because their densities have that shape.
3
4.5.2.1 Density and Properties
Density and Parameters:
The density for a normal distribution is
f
W
(t) =
1

2
e
0.5(
t

)
2
, < t < (4.37)
2
Statistical Methods for Network and Computer Security, David J. Marchette, Naval Surface Warfare Center,
rion.math.iastate.edu/IA/2003/foils/marchette.pdf.
3
Note that other parametric families, notably the Cauchy, also have bell shapes. The dierence lies in the rate at
which the tails of the distribution go to 0. However, due to the Central Limit Theorem, to be presented below, the
normal family is of prime interest.
4.5. FAMOUS PARAMETRIC FAMILIES OF CONTINUOUS DISTRIBUTIONS 99
Again, this is a two-parameter family, indexed by the parameters and , which turn out to be
the mean
4
and standard deviation and , The notation for it is N(,
2
) (it is customary to state
the variance
2
rather than the standard deviation).
Closure Under Ane Transformation:
The family is closed under ane transformations, meaning that if X has the distribution N(,
2
),
then Y = cX + d has the distribution N(c +d, c
2

2
), i.e. Y too has a normal distribution.
Consider this statement carefully. It is saying much more than simply that Y has mean c+d and
variance c
2

2
, which would follow from (4.5.2.1) even if X did not have a normal distribution. The
key point is that this new variable Y is also a member of the normal family, i.e. its density is still
given by (4.37), now with the new mean and variance.
Lets derive this. For convenience, suppose c > 0. Then
F
Y
(t) = P(Y t) (denition of F
Y
) (4.38)
= P(cX +d t) (denition of Y) (4.39)
= P
_
X
t d
c
_
(algebra) (4.40)
= F
X
_
t d
c
_
(denition of F
X
) (4.41)
Therefore
f
Y
(t) =
d
dt
F
Y
(t) (denition of f
Y
) (4.42)
=
d
dt
F
X
_
t d
c
_
(from (4.41)) (4.43)
= f
X
_
t d
c
_

d
dt
t d
c
(denition of f
X
and the Chain Rule) (4.44)
=
1
c

1

2
e
0.5
_
td
c

_
2
(from (4.37) (4.45)
=
1

2(c)
e
0.5
_
t(c+d)
c
_
2
(algebra) (4.46)
That last expression is the N(c +d, c
2

2
) density, so we are done!
4
Remember, this is a synonym for expected value.
100 CHAPTER 4. CONTINUOUS PROBABILITY MODELS
Closure Under Independent Summation
If X and Y are independent random variables, each having a normal distribution, then their sum S
= X + Y also is normally distributed.
This is a pretty remarkable phenomenon, not true for most other parametric families. If for instance
X and Y each with, say, a U(0,1) distribution, then the density of S turns out to be triangle-shaped,
NOT another uniform distribution. (This can be derived using the methods of Section 8.3.2.)
Note that if X and Y are independent and normally distributed, then the two properties above
imply that cX + dY will also have a normal distribution, for any constants c and d.
Evaluating Normal cdfs
The function in (4.37) does not have a closed-form indenite integral. Thus probabilities involving
normal random variables must be approximated. Traditionally, this is done with a table for the cdf
of N(0,1). This one table is sucient for the entire normal family, because if X has the distribution
N(,
2
) then
X

(4.47)
has a N(0,1) distribution too, due to the ane transformation closure property discussed above.
By the way, the N(0,1) cdf is traditionally denoted by . As noted, traditionally it has played
a central role, as one could transform any probability involving some normal distribution to an
equivalent probability involving N(0,1). One would then use a table of N(0,1) to nd the desired
probability.
Nowadays, probabilities for any normal distribution, not just N(0,1), are easily available by com-
puter. In the R statistical package, the normal cdf for any mean and variance is available via the
function pnorm(). The signature is
pnorm(q,mean=0,sd=1)
This returns the value of the cdf evaluated at q, for a normal distribution having the specied mean
and standard deviation (default values of 0 and 1).
We can use rnorm() to simulate normally distributed random variables. The call is
rnorm(n,mean=0,sd=1)
which returns a vector of n random variates from the specied normal distribution.
Well use both methods in our rst couple of examples below.
4.5. FAMOUS PARAMETRIC FAMILIES OF CONTINUOUS DISTRIBUTIONS 101
4.5.2.2 Example: Network Intrusion
As an example, lets look at a simple version of the network intrusion problem. Suppose we have
found that in Jills remote logins to a certain computer, the number X of disk sectors she reads or
writes has an approximate normal distribution has a mean of 500 and a standard deviation of 15.
Before we continue, a comment on modeling: Since the number of sectors is discrete, it could not
have an exact normal distribution. But then, no random variable in practice has an exact normal
or other continuous distribution, as discussed in Section 4.2, and the distribution can indeed by
approximately normal.
Now, say our network intrusion monitor nds that Jillor someone posing as herhas logged in
and has read or written 535 sectors. Should we be suspicious?
To answer this question, lets nd P(X 535): Let Z = (X500)/15. From our discussion above,
we know that Z has a N(0,1) distribution, so
P(X 535) = P
_
Z
535 500
15
_
= 1 (35/15) = 0.01 (4.48)
Again, traditionally we would obtain that 0.01 value from a N(0,1) cdf table in a book. With R,
we would just use the function pnorm():
> 1 - pnorm(535,500,15)
[1] 0.009815329
Anyway, that 0.01 probability makes us suspicious. While it could really be Jill, this would be
unusual behavior for Jill, so we start to suspect that it isnt her. Its suspicious enough for us to
probe more deeply, e.g. by looking at which les she (or the impostor) accessedwere they rare
for Jill too?
Now suppose there are two logins to Jills account, accessing X and Y sectors, with X+Y = 1088.
Is this rare for her, i.e. is P(X +Y > 1088)? small?
Well assume X and Y are independent. Wed have to give some thought as to whether this
assumption is reasonable, depending on the details of how we observed the logins, etc., but lets
move ahead on this basis.
From page 100, we know that the sum S = X+Y is again normally distributed. Due to the properties
in Chapter 3, we know S has mean 2 500 and variance 2 15
2
. The desired probability is then
found via
1 pnorm( 1088 , 1000 , s qr t ( 450) )
102 CHAPTER 4. CONTINUOUS PROBABILITY MODELS
which is about 0.00002. That is indeed a small number, and we should be highly suspicious.
Note again that the normal model (or any other continuous model) can only be approximate,
especially in the tails of the distribution, in this case the right-hand tail. But it is clear that S is
only rarely larger than 1088, and the matter mandates further investigation.
Of course, this is very crude analysis, and real intrusion detection systems are much more complex,
but you can see the main ideas here.
4.5.2.3 Example: Class Enrollment Size
After years of experience with a certain course, a university has found that online pre-enrollment
in the course is approximately normally distributed, with mean 28.8 and standard deviation 3.1.
Suppose that in some particular oering, pre-enrollment was capped at 25, and it hit the cap. Find
the probability that the actual demand for the course was at least 30.
Note that this is a conditional probability! Evaulate it as follows. Let N be the actual demand.
Then the key point is that we are given that N 25, so
P(N 30[N 25) =
P(N 30 and N 25)
P(N 25)
((2.5)) (4.49)
=
P(N 30)
P(N 25)
(4.50)
=
1 [(30 28.8)/3.1]
1 [(25 28.8)/3.1]
(4.51)
= 0.39 (4.52)
Sounds like it may be worth moving the class to a larger room before school starts.
Since we are approximating a discrete random variable by a continuous one, it might be more
accurate here to use a correction for continuity, described in Section 4.5.2.7.
4.5.2.4 The Central Limit Theorem
The Central Limit Theorem (CLT) says, roughly speaking, that a random variable which is a sum
of many components will have an approximate normal distribution. So, for instance, human weights
are approximately normally distributed, since a person is made of many components. The same is
true for SAT test scores,
5
as the total score is the sum of scores on the individual problems.
5
This refers to the raw scores, before scaling by the testing company.
4.5. FAMOUS PARAMETRIC FAMILIES OF CONTINUOUS DISTRIBUTIONS 103
There are many versions of the CLT. The basic one requires that the summands be independent
and identically distributed:
6
Theorem 12 Suppose X
1
, X
2
, ... are independent random variables, all having the same distribu-
tion which has mean m and variance v
2
. Form the new random variable T = X
1
+... +X
n
. Then
for large n, the distribution of T is approximately normal with mean nm and variance nv
2
.
The larger n is, the better the approximation, but typically n = 20 or even n = 10 is enough.
4.5.2.5 Example: Cumulative Roundo Error
Suppose that computer roundo error in computing the square roots of numbers in a certain range
is distributed uniformly on (-0.5,0.5), and that we will be computing the sum of n such square roots.
Suppose we compute a sum of 50 square roots. Lets nd the approximate probability that the
sum is more than 2.0 higher than it should be. (Assume that the error in the summing operation
is negligible compared to that of the square root operation.)
Let U
1
, ..., U
50
denote the errors on the individual terms in the sum. Since we are computing a
sum, the errors are added too, so our total error is
T = U
1
+... +U
50
(4.53)
By the Central Limit Theorem, T has an approximately normal distribution, with mean 50 EU and
variance 50 Var(U), where U is a random variable having the distribution of the U
i
. From Section
4.5.1.1, we know that
EU = (0.5 + 0.5)/2 = 0, V ar(U) =
1
12
[0.5 (0.5)]
2
=
1
12
(4.54)
So, the approximate distribution of T is N(0,50/12). We can then use R to nd our desired
probability:
> 1 pnorm( 2 , mean=0, sd=s qr t ( 50/12) )
[ 1 ] 0. 1635934
6
A more mathematically precise statement of the theorem is given in Section 4.5.2.9.
104 CHAPTER 4. CONTINUOUS PROBABILITY MODELS
4.5.2.6 Example: Bug Counts
As an example, suppose the number of bugs per 1,000 lines of code has a Poisson distribution with
mean 5.2. Lets nd the probability of having more than 106 bugs in 20 sections of code, each 1,000
lines long. Well assume the dierent sections act independently in terms of bugs.
Here X
i
is the number of bugs in the i
th
section of code, and T is the total number of bugs. Since
each X
i
has a Poisson distribution, m = v
2
= 5.2. So, T is approximately distributed normally
with mean and variance 20 5.2. So, we can nd the approximate probability of having more than
106 bugs:
> pnorm(106,20*5.2,sqrt(20*5.2))
[1] 0.5777404
4.5.2.7 Example: Coin Tosses
Binomially distributed random variables, though discrete, also are approximately normally dis-
tributed. Heres why:
Say T has a binomial distribution with n trials. Then we can write T as a sum of indicator random
variables (Section 3.6):
T = T
1
+... +T
n
(4.55)
where T
i
is 1 for a success and 0 for a failure on the i
th
trial. Since we have a sum of indepen-
dent, identically distributed terms, the CLT applies. Thus we use the CLT if we have binomial
distributions with large n.
For example, lets nd the approximate probability of getting more than 12 heads in 20 tosses of a
coin. X, the number of heads, has a binomial distribution with n = 20 and p = 0.5 Its mean and
variance are then np = 10 and np(1-p) = 5. So, let Z = (X 10)/

5, and write
P(X > 12) = P(Z >
12 10

5
) 1 (0.894) = 0.186 (4.56)
Or:
> 1 - pnorm(12,10,sqrt(5))
[1] 0.1855467
4.5. FAMOUS PARAMETRIC FAMILIES OF CONTINUOUS DISTRIBUTIONS 105
The exact answer is 0.132. Remember, the reason we could do this was that X is approximately
normal, from the CLT. This is an approximation of the distribution of a discrete random variable
by a continuous one, which introduces additional error.
We can get better accuracy by using the correction of continuity, which can be motivated as
follows. As an alternative to (4.56), we might write
P(X > 12) = P(X 13) = P(Z >
13 10

5
) 1 (1.342) = 0.090 (4.57)
That value of 0.090 is considerably smaller than the 0.186 we got from (4.56). We could split the
dierence this way:
P(X > 12) = P(X 12.5) = P(Z >
12.5 10

5
) 1 (1.118) = 0.132 (4.58)
(Think of the number 13 owning the region between 12.5 and 13.5, 14 owning the part between
13.5 and 14.5 and so on.) Since the exact answer to seven decimal places is 0.131588, the strategy
has improved accuracy substantially.
The term correction for continuity alludes to the fact that we are approximately a discrete distri-
bution by a continuous one.
4.5.2.8 Museum Demonstration
Many science museums have the following visual demonstration of the CLT.
There are many balls in a chute, with a triangular array of r rows of pins beneath the chute. Each
ball falls through the rows of pins, bouncing left and right with probability 0.5 each, eventually
being collected into one of r bins, numbered 0 to r. A ball will end up in bin i if it bounces rightward
in i of the r rows of pins, i = 0,1,...,r. Key point:
Let X denote the bin number at which a ball ends up. X is the number of rightward
bounces (successes) in r rows (trials). Therefore X has a binomial distribution with
n = r and p = 0.5
Each bin is wide enough for only one ball, so the balls in a bin will stack up. And since there are
many balls, the height of the stack in bin i will be approximately proportional to P(X = i). And
since the latter will be approximately given by the CLT, the stacks of balls will roughly look like
the famous bell-shaped curve!
106 CHAPTER 4. CONTINUOUS PROBABILITY MODELS
There are many online simulations of this museum demonstration, such as http://www.mathsisfun.
com/data/quincunx.html. By collecting the balls in bins, the apparatus basically simulates a his-
togram for X, which will then be approximately bell-shaped.
4.5.2.9 Optional topic: Formal Statement of the CLT
Denition 13 A sequence of random variables L
1
, L
2
, L
3
, ... converges in distribution to a
random variable M if
lim
n
P(L
n
t) = P(M t), for all t (4.59)
Note by the way, that these random variables need not be dened on the same probability space.
The formal statement of the CLT is:
Theorem 14 Suppose X
1
, X
2
, ... are independent random variables, all having the same distribu-
tion which has mean m and variance v
2
. Then
Z =
X
1
+... +X
n
nm
v

n
(4.60)
converges in distribution to a N(0,1) random variable.
4.5.2.10 Importance in Modeling
Needless to say, there are no random variable in the real world that are exactly normally distributed.
In addition to our comments at the beginning of this chapter that no real-world random variable
has a continuous distribution, there are no practical applications in which a random variable is not
bounded on both ends. This contrasts with normal distributions, which extend from to .
Yet, many things in nature do have approximate normal distributions, normal distributions play
a key role in statistics. Most of the classical statistical procedures assume that one has sampled
from a population having an approximate distributions. This should come as no surprise, knowing
the CLT. In addition, the CLT tells us in many of these cases the quantities used for statistical
estimation are approximately normal, even if the data they are calculated from do not.
4.5. FAMOUS PARAMETRIC FAMILIES OF CONTINUOUS DISTRIBUTIONS 107
4.5.3 The Chi-Squared Family of Distributions
4.5.3.1 Density and Properties
Let Z
1
, Z
2
, ..., Z
k
be independent N(0,1) random variables. Then the distribution of
Y = Z
2
1
+... +Z
2
k
(4.61)
is called chi-squared with k degrees of freedom. We write such a distribution as
2
k
. Chi-
squared is a one-parameter family of distributions, and arises quite frequently in statistical appli-
cations, as will be seen in future chapters.
We can derive the mean of a chi-squared distribution as follows. In (4.61), note that
1 = V ar(Z
i
) = E(Z
2
i
) (EZ
i
)
2
= 1 0
2
= 1 (4.62)
Then EY in (4.61) is k. One can also show that Var(Y) = 2k.
It turns out that chi-squared is a special case of the gamma family in Section 4.5.5 below, with r
= k/2 and = 0.5.
The R functions dchisq(), pchisq(), qchisq() and rchisq() give us the density, cdf, quantile
function and random number generator for the chi-squared family. The second argument in each
case is the number of degrees of freedom. The rst argument is the argument to the corresponding
math function in all cases but rchisq(), in which it is the number of random variates to be
generated.
For instance, to get the value of f
X
(5.2) for a chi-squared random variable having 3 degrees of
freedom, we make the following call:
> dchi s q ( 5 . 2 , 3 )
[ 1 ] 0. 06756878
4.5.3.2 Example: Error in Pin Placement
Consider a machine that places a pin in the middle of a at, disk-shaped object. The placement
is subject to error. Let X and Y be the placement errors in the horizontal and vertical directions,
respectively, and let W denote the distance from the true center to the pin placement. Suppose X
and Y are independent and have normal distributions with mean 0 and variance 0.04. Lets nd
P(W > 0.6).
108 CHAPTER 4. CONTINUOUS PROBABILITY MODELS
Since a distance is the square root of a sum of squares, this sounds like the chi-squared distribution
might be relevant. So, lets rst convert the problem to one involving squared distance:
P(W > 0.6) = P(W
2
> 0.36) (4.63)
But W
2
= X
2
+Y
2
, so
P(W > 0.6) = P(X
2
+Y
2
> 0.36) (4.64)
This is not quite chi-squared, as that distribution involves the sum of squares of independent N(0,1)
random variables. But due to the normal familys closure under ane transformations (page 99),
we know that X/0.2 and Y/0.2 do have N(0,1) distributions. So write
P(W > 0.7) = P[(X/0.2)
2
+ (Y/0.2)
2
> 0.36/0.2
2
] (4.65)
Now evaluate the right-hand side:
> 1 pchi s q ( 0. 36/0. 04 , 2)
[ 1 ] 0. 01110900
4.5.3.3 Importance in Modeling
This distribution is used widely in statistical applications. As will be seen in our chapters on
statistics, many statistical methods involve a sum of squared normal random variables.
7
4.5.4 The Exponential Family of Distributions
Please note: We have been talking here of parametric families of distributions, and in this section
will introduce one of the most famous, the family of exponential distributions. This should not be
confused, though, with the term exponential family that arises in mathematical statistics, which
includes exponential distributions but is much broader.
7
The motivation for the term degrees of freedom will be explained in those chapters too.
4.5. FAMOUS PARAMETRIC FAMILIES OF CONTINUOUS DISTRIBUTIONS 109
4.5.4.1 Density and Properties
The densities in this family have the form
f
W
(t) = e
t
, 0 < t < (4.66)
This is a one-parameter family of distributions.
After integration, one nds that E(W) =
1

and V ar(W) =
1

2
. You might wonder why it is
customary to index the family via rather than 1/ (see (4.66)), since the latter is the mean. But
this is actually quite natural, for the reason cited in the following subsection.
4.5.4.2 R Functions
Relevant functions for a uniformally distributed random variable X with parameter lambda are
pexp(q,lambda), to nd P(X q)
qexp(q,lambda), to nd c such that P(X c) = q
rexp(n,lambda), to generate n independent values of X
4.5.4.3 Example: Refunds on Failed Components
Suppose a manufacturer of some electronic component nds that its lifetime L is exponentially
distributed with mean 10000 hours. They give a refund if the item fails before 500 hours. Let M
be the number of items they have sold, up to and including the one on which they make the rst
refund. Lets nd EM and V ar(M).
First, notice that M has a geometric distribution! It is the number of independent trials until the
rst success, where a trial is one component, success (no value judgment, remember) is giving
a refund, and the success probability is
P(L < 500) =
_
500
0
0.0001e
0.0001t
dt = 0.05 (4.67)
Then plug p = 0.05 into (3.84) and (3.85).
110 CHAPTER 4. CONTINUOUS PROBABILITY MODELS
4.5.4.4 Example: Overtime Parking Fees
A certain public parking garage charges parking fees of $1.50 for the rst hour, and $1 per hour
after that. Suppose parking times T are exponentially distributed with mean 1.5 hours. Let W
denote the total fee paid. Lets nd E(W) and Var(W).
The key point is that W is a function of T:
W =
_
1.5T, if T 1
1.5 + 1 (T 1) = T + 0.5, if T > 1
(4.68)
Thats good, because we know how to nd the expected value of a function of a continuous random
variable, from (4.24). Dening g() as in (4.68) above, we have
EW =
_

0
g(t)
1
1.5
e

1
1.5
t
dt =
_
1
0
1.5t
1
1.5
e

1
1.5
t
dt +
_

1
(t + 0.5)
1
1.5
e

1
1.5
t
dt (4.69)
The integration is left to the reader.
Now, what about Var(W)? As is often the case, its easier to use (3.30), so we need to nd E(W
2
).
The above integration becomes
E(W
2
) =
_

0
g
2
(t)
1
1.5
e

1
1.5
t
dt =
_
1
0
1.5
2
t
1
1.5
e

1
1.5
t
dt +
_

1
(t + 0.5)
2
1
1.5
e

1
1.5
t
dt (4.70)
After evaluating this, we subtract (EW)
2
, giving us the variance of W.
4.5.4.5 Connection to the Poisson Distribution Family
Suppose the lifetimes of a set of light bulbs are independent and identically distributed (i.i.d.),
and consider the following process. At time 0, we install a light bulb, which burns an amount of
time X
1
. Then we install a second light bulb, with lifetime X
2
. Then a third, with lifetime X
3
,
and so on.
Let
T
r
= X
1
+... +X
r
(4.71)
denote the time of the r
th
replacement. Also, let N(t) denote the number of replacements up to and
including time t. Then it can be shown that if the common distribution of the X
i
is exponentially
4.5. FAMOUS PARAMETRIC FAMILIES OF CONTINUOUS DISTRIBUTIONS 111
distributed, the N(t) has a Poisson distribution with mean t. And the converse is true too: If
the X
i
are independent and identically distributed and N(t) is Poisson, then the X
i
must have
exponential distributions. In summary:
Theorem 15 Suppose X
1
, X
2
, ... are i.i.d. nonnegative continuous random variables. Dene
T
r
= X
1
+... +X
r
(4.72)
and
N(t) = maxk : T
k
t (4.73)
Then the distribution of N(t) is Poisson with parameter t for all t if and only the X
i
have an
exponential distribution with parameter .
In other words, N(t) will have a Poisson distribution if and only if the lifetimes are exponentially
distributed.
Proof
Only if part:
The key is to notice that the event X
1
> t is exactly equivalent to N(t) = 0. If the rst light bulb
has lasts longer than t, then the count of burnouts at time t is 0, and vice versa. Then
P(X
1
> t) = P[N(t) = 0] (see above equiv.) (4.74)
=
(t)
0
0!
e
t
((3.113) (4.75)
= e
t
(4.76)
Then
f
X
1
(t) =
d
dt
(1 e
t
) = e
t
(4.77)
That shows that X
1
has an exponential distribution, and since the X
i
are i.i.d., that implies that
all of them have that distribution.
If part:
112 CHAPTER 4. CONTINUOUS PROBABILITY MODELS
We need to show that if the X
i
are exponentially distributed with parameter , then for u nonneg-
ative and each positive integer k,
P[N(u) = k] =
(t)
k
e
t
k!
(4.78)
The proof for the case k = 0 just reverses (4.74) above. The general case, not shown here, notes that
N(u) k is equivalent to T
k+1
> u. The probability of the latter event can be found be integrating
(4.79) from u to innity. One needs to perform k-1 integrations by parts, and eventually one arrives
at (4.78), summed from 1 to k, as required.
The collection of random variables N(t) t 0, is called a Poisson process.
The relation E[N(t)] = t says that replacements are occurring at an average rate of per unit
time. Thus is called the intensity parameter of the process. It is because of this rate
interpretation that makes a natural indexing parameter in (4.66).
4.5.4.6 Importance in Modeling
Many distributions in real life have been found to be approximately exponentially distributed.
A famous example is the lifetimes of air conditioners on airplanes. Another famous example is
interarrival times, such as customers coming into a bank or messages going out onto a computer
network. It is used in software reliability studies too.
Exponential distributions are the only continous ones that are memoryless. This point is pursued
in Chapter 5. Due to this property, exponential distributions play a central role in Markov chains
(Chapter 17).
4.5.5 The Gamma Family of Distributions
4.5.5.1 Density and Properties
Recall Equation (4.71), in which the random variable T
r
was dened to be the time of the r
th
light
bulb replacement. T
r
is the sum of r independent exponentially distributed random variables with
parameter . The distribution of T
r
is called an Erlang distribution, with density
f
Tr
(t) =
1
(r 1)!

r
t
r1
e
t
, t > 0 (4.79)
4.5. FAMOUS PARAMETRIC FAMILIES OF CONTINUOUS DISTRIBUTIONS 113
This is a two-parameter family.
Again, its helpful to think in notebook terms. Say r = 8. Then we watch the lamp for the
durations of eight lightbulbs, recording T
8
, the time at which the eighth burns out. We write that
time in the rst line of our notebook. Then we watch a new batch of eight bulbs, and write the
value of T
8
for those bulbs in the second line of our notebook, and so on. Then after recording a
very large number of lines in our notebook, we plot a histogram of all the T
8
values. The point is
then that that histogram will look like (4.79).
then
We can generalize this by allowing r to take noninteger values, by dening a generalization of the
factorial function:
(r) =
_

0
x
r1
e
x
dx (4.80)
This is called the gamma function, and it gives us the gamma family of distributions, more general
than the Erlang:
f
W
(t) =
1
(r)

r
t
r1
e
t
, t > 0 (4.81)
(Note that (r) is merely serving as the constant that makes the density integrate to 1.0. It doesnt
have meaning of its own.)
This is again a two-parameter family, with r and as parameters.
A gamma distribution has mean r/ and variance r/
2
. In the case of integer r, this follows from
(4.71) and the fact that an exponentially distributed random variable has mean and variance 1/
and variance 1/
2
, and it can be derived in general. Note again that the gamma reduces to the
exponential when r = 1.
Recall from above that the gamma distribution, or at least the Erlang, arises as a sum of independent
random variables. Thus the Central Limit Theorem implies that the gamma distribution should
be approximately normal for large (integer) values of r. We see in Figure 4.2 that even with r =
10 it is rather close to normal.
It also turns out that the chi-square distribution with d degrees of freedom is a gamma distribution,
with r = d/2 and = 0.5.
114 CHAPTER 4. CONTINUOUS PROBABILITY MODELS
4.5.5.2 Example: Network Buer
Suppose in a network context (not our ALOHA example), a node does not transmit until it has
accumulated ve messages in its buer. Suppose the times between message arrivals are independent
and exponentially distributed with mean 100 milliseconds. Lets nd the probability that more than
552 ms will pass before a transmission is made, starting with an empty buer.
Let X
1
be the time until the rst message arrives, X
2
the time from then to the arrival of the
second message, and so on. Then the time until we accumulate ve messages is Y = X
1
+... +X
5
.
Then from the denition of the gamma family, we see that Y has a gamma distribution with r =
5 and = 0.01. Then
P(Y > 552) =
_

552
1
4!
0.01
5
t
4
e
0.01t
dt (4.82)
This integral could be evaluated via repeated integration by parts, but lets use R instead:
> 1 - pgamma(552,5,0.01)
[1] 0.3544101
4.5.5.3 Importance in Modeling
As seen in (4.71), sums of exponentially distributed random variables often arise in applications.
Such sums have gamma distributions.
You may ask what the meaning is of a gamma distribution in the case of noninteger r. There is
no particular meaning, but when we have a real data set, we often wish to summarize it by tting
a parametric family to it, meaning that we try to nd a member of the family that approximates
our data well.
In this regard, the gamma family provides us with densities which rise near t = 0, then gradually
decrease to 0 as t becomes large, so the family is useful if our data seem to look like this. Graphs
of some gamma densities are shown in Figure 4.2.
4.5.6 The Beta Family of Distributions
As seen in Figure 4.2, the gamma family is a good choice to consider if our data are nonnegative,
with the density having a peak near 0 and then gradually tapering o to the right. What about
data in the range (0,1)? The beta family provides a very exible model for this kind of setting,
allowing us to model many dierent concave up or concave down curves.
4.5. FAMOUS PARAMETRIC FAMILIES OF CONTINUOUS DISTRIBUTIONS 115
0 5 10 15 20 25 30
0
.
0
0
.
1
0
.
2
0
.
3
x
y
0
1
0
r = 1.0
r = 5.0
r = 10.0
lambda = 1.0
Figure 4.2: Various Gamma Densities
116 CHAPTER 4. CONTINUOUS PROBABILITY MODELS
The densities of the family have the following form:
( +)
()()
(1 t)
1
t
1
(4.83)
There are two parameters, and . Here are two possibilities.
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
0
.
5
1
.
0
1
.
5
x
b
t

(
x
)
alph=2, bt=2
alph=0.5, bt=0.8
The mean and variance are

+
(4.84)
and

( +)
2
( + + 1)
(4.85)
4.6. CHOOSING A MODEL 117
4.6 Choosing a Model
The parametric families presented here are often used in the real world. As indicated previously,
this may be done on an empirical basis. We would collect data on a random variable X, and plot
the frequencies of its values in a histogram. If for example the plot looks roughly like the curves in
Figure 4.2, we could choose this as the family for our model.
Or, our choice may arise from theory. If for instance our knowledge of the setting in which we
are working says that our distribution is memoryless, that forces us to use the exponential density
family.
In either case, the question as to which member of the family we choose to will be settled by using
some kind of procedure which nds the member of the family which best ts our data. We will
discuss this in detail in our chapters on statistics, especially Chapter 14.
Note that we may choose not to use a parametric family at all. We may simply nd that our data
does not t any of the common parametric families (there are many others than those presented
here) very well. Procedures that do not assume any parametric family are termed nonparametric.
4.7 A General Method for Simulating a Random Variable
Suppose we wish to simulate a random variable X with cdf F
X
for which there is no R function.
This can be done via F
1
X
(U), where U has a U(0,1) distribution. In other words, we call runif()
and then plug the result into the inverse of cdf of X. Here inverse is in the sense that, for instance,
squaring and square-rooting, exp() and ln(), etc. are inverse operations of each other.
For example, say X has the density 2t on (0,1). Then F
X
(t) = t
2
, so F
1
(s) = s
0.5
. We can then
generate X in R as sqrt(runif(1)). Heres why:
For brevity, denote F
1
X
as G and F
X
as H. Our generated random variable is G(U). Then
P[G(U) t]
= P[U G
1
(t)]
= P[U H(t)]
= H(t) (4.86)
In other words, the cdf of G(U) is F
X
! So, G(U) has the same distribution as X.
Note that this method, though valid, is not necessarily practical, since computing F
1
X
may not be
easy.
118 CHAPTER 4. CONTINUOUS PROBABILITY MODELS
4.8 Hybrid Continuous/Discrete Distributions
A random variable could have a distribution that it partly discrete and partly continuous. Recall
our rst example, from Section 4.1, in which D is the position that a dart hits when thrown at the
interval (0,1). Suppose our measuring instrument is broken, and registers any value of D past 0.8
as being equal to 0.8. Let W denote the actual value recorded by this instrument.
Then P(W = 0.8) = 0.2, so W is not a continuous random variable, in which every point has mass
0. On the other hand, P(W = t) = 0 for every t before 0.8, so W is not discrete either.
In the advanced theory of probability, some very odd mixtures, beyond this simple discrete/con-
tinuous example, can occur, though primarily of theoretical interest.
Exercises
1. Fill in the blanks, in the following statements about continuous random variables. Make sure
to use our books notation.
(a)
d
dt
P(X t) =
(b) P(a < X < b) =
2. Suppose X has a uniform distribution on (-1,1), and let Y = X
2
. Find f
Y
.
3. In the network intrusion example in Section 4.5.2.2, suppose X is not normally distributed, but
instead has a uniform distribution on (450,550). Find P(X 535) in this case.
4. Suppose X has an exponential distribution with parameter . Show that EX = 1/ and
V ar(X) = 1/
2
.
5. Suppose f
X
(t) = 3t
2
for t in (0,1) and is zero elsewhere. Find F
X
(0.5) and E(X).
6. Suppose light bulb lifetimes X are exponentially distributed with mean 100 hours.
(a) Find the probability that a light bulb burns out before 25.8 hours.
In the remaining parts, suppose we have two light bulbs. We install the rst at time 0, and then
when it burns out, immediately replace it with the second.
(b) Find the probability that the rst light bulb lasts less than 25.8 hours and the lifetime of the
second is more than 120 hours.
4.8. HYBRID CONTINUOUS/DISCRETE DISTRIBUTIONS 119
(c) Find the probability that the second burnout occurs after time 192.5.
7. Suppose for some continuous random variable X, f
X
(t) is equal to 2(1-t) for t in (0,1) and is 0
elsewhere.
(a) Why is the constant here 2? Why not, say, 168?
(b) Find F
X
(0.2) and Var(X).
(c) Using the method in Section 4.7, write an R function, named oneminust(), that generates a
random variate sampled from this distribution. Then use this function to verify your answers
in (b) above.
8. The company Wrong Turn Criminal Mismanagement makes predictions every day. They tend
to err on the side of overpredicting, with the error having a uniform distribution on the interval
(-0.5,1.5). Find the following:
(a) The mean and variance of the error.
(b) The mean of the absolute error.
(c) The probability that exactly two errors are greater than 0.25 in absolute value, out of 10
predictions. Assume predictions are independent.
9. All that glitters is not gold, and not every bell-shaped density is normal. The family of
Cauchy distributions, having density
f
X
(t) =
1
c
1
1 + (
tb
c
)
2
, < t < (4.87)
is bell-shaped but denitely not normal.
Here the parameters b and c correspond to mean and standard deviation in the normal case, but
actual neither the mean nor standard deviation exist for Cauchy distributions. The means failure
to exist is due to technical problems involving the theoretical denition of integration. In the case of
variance, it does not exist because there is no mean, but even more signicantly, E[(X b)
2
] = .
However, a Cauchy distribution does have a median, b, so well use that instead of a mean. Also,
instead of a standard deviation, well use as our measure of dispersion the interquartile range,
dened (for any distribution) to be the dierence between the 75th and 25th percentiles.
We will be investigating the Cauchy distribution that has b = 0 and c = 1.
120 CHAPTER 4. CONTINUOUS PROBABILITY MODELS
(a) Find the interquartile range of this Cauchy distribution.
(b) Find the normal distribution that has the same median and interquartile range as this Cauchy
distribution.
(c) Use R to plot the densities of the two distributions on the same graph, so that we can see
that they are both bell-shaped, but dierent.
10. Consider the following game. A dart will hit the random point Y in (0,1) according to the
density f
Y
(t) = 2t. You must guess the value of Y . (Your guess is a constant, not random.) You
will lose $2 per unit error if Y is to the left of your guess, and will lose $1 per unit error on the
right. Find best guess in terms of expected loss.
11. Fill in the blank: Density functions for continuous random variables are analogs of the
functions that are used for discrete random variables.
12. Suppose for some random variable W, F
W
(t) = t
3
for 0 < t < 1, with F
W
(t) being 0 and 1 for
t < 0 and t > 0, respectively. Find f
W
(t) for 0 < t < 1.
13. Suppose X has a binomial distribution with parameters n and p. Then X is approximately
normally distributed with mean np and variance np(1-p). For each of the following, answer either
A or E, for approximately or exact, respectively:
(a) the distribution of X is normal
(b) E(X) is np
(c) Var(X) is np(1-p)
14. Consider the density f
Z
(t) = 2t/15 for 1 < t < 4 and 0 elsewhere. Find the median of Z, as
well as Zs third moment, E(Z
3
), and its third central moment, E[(Z EZ)
3
].
15. Suppose X has a uniform distribution on the interval (20,40), and we know that X is greater
than 25. What is the probability that X is greater than 32?
16. Suppose U and V have the 2t/15 density on (1,4). Let N denote the number of values among
U and V that are greater than 1.5, so N is either 0, 1 or 2. Find Var(N).
17. Find the value of E(X
4
) if X has an N(0,1) distribution. (Give your answer as a number, not
an integral.)
Chapter 5
Describing Failure
In addition to density functions, another useful description of a distribution is its hazard function.
Again think of the lifetimes of light bulbs, not necessarily assuming an exponential distribution.
Intuitively, the hazard function states the likelihood of a bulb failing in the next short interval of
time, given that it has lasted up to now. To understand this, lets rst talk about a certain property
of the exponential distribution family.
5.1 Memoryless Property
One of the reasons the exponential family of distributions is so famous is that it has a property that
makes many practical stochastic models mathematically tractable: The exponential distributions
are memoryless.
5.1.1 Derivation and Intuition
What the term memoryless means for a random variable W is that for all positive t and u
P(W > t +u[W > t) = P(W > u) (5.1)
Any exponentially distributed random variable has this property. Lets derive this:
121
122 CHAPTER 5. DESCRIBING FAILURE
P(W > t +u[W > t) =
P(W > t +u and W > t)
P(W > t)
(5.2)
=
P(W > t +u)
P(W > t)
(5.3)
=
_

t+u
e
s
ds
_

t
e
s
ds
(5.4)
= e
u
(5.5)
= P(W > u) (5.6)
We say that this means that time starts over at time t, or that W doesnt remember what
happened before time t.
It is dicult for the beginning modeler to fully appreciate the memoryless property. Lets make it
concrete. Consider the problem of waiting to cross the railroad tracks on Eighth Street in Davis,
just west of J Street. One cannot see down the tracks, so we dont know whether the end of the
train will come soon or not.
If we are driving, the issue at hand is whether to turn o the cars engine. If we leave it on, and the
end of the train does not come for a long time, we will be wasting gasoline; if we turn it o, and
the end does come soon, we will have to start the engine again, which also wastes gasoline. (Or,
we may be deciding whether to stay there, or go way over to the Covell Rd. railroad overpass.)
Suppose our policy is to turn o the engine if the end of the train wont come for at least s seconds.
Suppose also that we arrived at the railroad crossing just when the train rst arrived, and we have
already waited for r seconds. Will the end of the train come within s more seconds, so that we will
keep the engine on? If the length of the train were exponentially distributed (if there are typically
many cars, we can model it as continuous even though it is discrete), Equation (5.1) would say that
the fact that we have waited r seconds so far is of no value at all in predicting whether the train
will end within the next s seconds. The chance of it lasting at least s more seconds right now is no
more and no less than the chance it had of lasting at least s seconds when it rst arrived.
By the way, the exponential distributions are the only continuous distributions which are memory-
less. (Note the word continuous; in the discrete realm, the family of geometric distributions are also
uniquely memoryless.) This too has implications for the theory. A rough proof of this uniqueness
is as follows:
Suppose some continuous random variable V has the memoryless property, and let R(t) denote
1 F
V
(t). Then from (5.1), we would have
R(t +u)]/R(t) = R(u) (5.7)
5.1. MEMORYLESS PROPERTY 123
or
R(t +u) = R(t)R(u) (5.8)
Dierentiating both sides with respect to t, wed have
R/(t +u) = R/(t)R(u) (5.9)
Setting t to 0, this would say
R/(u) = R/(0)R(u) (5.10)
This is a well-known dierential equation, whose solution is
R(u) = e
cu
(5.11)
which is exactly 1 minus the cdf for an exponentially distributed random variable.
5.1.2 Continuous-Time Markov Chains
The memorylessness of exponential distributions implies that a Poisson process N(t) also has a
time starts over property: Recall our example in Section 4.5.4.5 in which N(t) was the number
of light bulb burnouts up to time t. The memorylessness property means that if we start counting
afresh from time, say z, then the numbers of burnouts after time z, i.e. Q(u) = N(z+u) - N(z), also
is a Poisson process. In other words, Q(u) has a Poisson distribution with parameter . Moreover,
Q(u) is independent of N(t) for any t < z.
All this should remind you of Markov chains, which we introduced in Section 3.15and it should.
Continuous time Markov chains are dened in the same way as the discrete-time ones in Section
3.15, but with the process staying in each state for a random amount of time. From the considera-
tions here, you can now see that time must have an exponential distribution. This will be discussed
at length in Chapter 17.
5.1.3 Example: Nonmemoryless Light Bulbs
Suppose the lifetimes in years of light bulbs have the density 2t/15 on (1,4), 0 elsewhere. Say Ive
been using bulb A for 2.5 years now in a certain lamp, and am continuing to use it. But at this
124 CHAPTER 5. DESCRIBING FAILURE
time I put a new bulb, B, in a second lamp. I am curious as to which bulb is more likely to burn
out within the next 1.2 years. Lets nd the two probabilities.
For bulb A:
P(L > 3.7[L > 2.5) =
P(L > 3.7)
P(L > 2.5)
= 0.24 (5.12)
For bulb B:
P(X > 1.2) =
_
4
1.2
2t/15 dt = 0.97 (5.13)
So you can see that the bulbs do have memory.
5.2 Hazard Functions
5.2.1 Basic Concepts
Suppose the lifetimes of light bulbs L were discrete. Suppose a particular bulb has already lasted
80 hours. The probability of it failing in the next hour would be
P(L = 81[L > 80) =
P(L = 81 and L > 80)
P(L > 80)
=
P(L = 81)
P(L > 80)
=
p
L
(81)
1 F
L
(80)
(5.14)
In general, for discrete L, we dene its hazard function as
h
L
(i) =
p
L
(i)
1 F
L
(i 1)
(5.15)
By analogy, for continuous L we dene
h
L
(t) =
f
L
(t)
1 F
L
(t)
(5.16)
Again, the interpretation is that h
L
(t) is the likelihood of the item failing very soon after t, given
that it has lasted t amount of time.
5.2. HAZARD FUNCTIONS 125
Note carefully that the word failure here should not be taken literally. In our Davis railroad
crossing example above, failure means that the train endsa failure which those of us who are
waiting will welcome!
Since we know that exponentially distributed random variables are memoryless, we would expect
intuitively that their hazard functions are constant. We can verify this by evaluating (5.16) for an
exponential density with parameter ; sure enough, the hazard function is constant, with value .
The reader should verify that in contrast to an exponential distributions constant failure rate, a
uniform distribution has an increasing failure rate (IFR). Some distributions have decreasing failure
rates, while most have non-monotone rates.
Hazard function models have been used extensively in software testing. Here failure is the
discovery of a bug, and with quantities of interest include the mean time until the next bug is
discovered, and the total number of bugs.
Some parametric families of distributions have strictly increasing failure rates (IFR). Some of strictly
decreasing failure rates (DFR). People have what is called a bathtub-shaped hazard function. It
is high near 0 (reecting infant mortality) and after, say, 70, but is low and rather at in between.
You may have noticed that the right-hand side of (5.16) is the derivative of ln[1F
L
(t)]. Therefore
_
t
0
h
L
(s) ds = ln[1 F
L
(t)] (5.17)
so that
1 F
L
(t) = e

_
t
0
h
L
(s) ds
(5.18)
and thus
1
f
L
(t) = h
L
(t)e

_
t
0
h
L
(s) ds
(5.19)
In other words, just as we can nd the hazard function knowing the density, we can also go in the
reverse direction. This establishes that there is a one-to-one correspondence between densities and
hazard functions.
This may guide our choice of parametric family for modeling some random variable. We may not
only have a good idea of what general shape the density takes on, but may also have an idea of
what the hazard function looks like. These two pieces of information can help guide us in our choice
of model.
1
Recall that the derivative of the integral of a function is the original function!
126 CHAPTER 5. DESCRIBING FAILURE
5.2.2 Example: Software Reliability Models
Hazard function models have been used successfully to model the arrivals (i.e. discoveries) of
bugs in software. Questions that arise are, for instance, When are we ready to ship?, meaning
when can we believe with some condence that most bugs have been found?
Typically one collects data on bug discoveries from a number of projects of similar complexity, and
estimates the hazard function from that data. Some investigations, such as Ohishia et al, Gompertz
Software Reliability Model: Estimation Algorithm and Empirical Validation, Journal of Systems
and Software, 82, 3, 2009, 535-543.
See Accurate Software Reliability Estimation, by Jason Allen Denton, Dept. of Computer Science,
Colorado State University, 1999, and the many references therein.
5.3 A Cautionary Tale: the Bus Paradox
Suppose you arrive at a bus stop, at which buses arrive according to a Poisson process with intensity
parameter 0.1, i.e. 0.1 arrival per minute. Recall that the means that the interarrival times have an
exponential distribution with mean 10 minutes. What is the expected value of your waiting time
until the next bus?
Well, our rst thought might be that since the exponential distribution is memoryless, time starts
over when we reach the bus stop. Therefore our mean wait should be 10.
On the other hand, we might think that on average we will arrive halfway between two consecutive
buses. Since the mean time between buses is 10 minutes, the halfway point is at 5 minutes. Thus
it would seem that our mean wait should be 5 minutes.
Which analysis is correct? Actually, the correct answer is 10 minutes. So, what is wrong with the
second analysis, which concluded that the mean wait is 5 minutes? The problem is that the second
analysis did not take into account the fact that although inter-bus intervals have an exponential
distribution with mean 10, the particular inter-bus interval that we encounter is special.
5.3.1 Length-Biased Sampling
Imagine a bag full of sticks, of dierent lengths. We reach into the bag and choose a stick at
random. The key point is that not all pieces are equally likely to be chosen; the longer pieces will
have a greater chance of being selected.
Say for example there are 50 sticks in the bag, with ID numbers from 1 to 50. Let X denote the
length of the stick we obtain if select a stick on an equal-probability basis, i.e. each stick having
5.3. A CAUTIONARY TALE: THE BUS PARADOX 127
probability 1/50 of being chosen. (We select a random number I from 1 to 50, and choose the stick
with ID number I.) On the other hand, let Y denote the length of the stick we choose by reaching
into the bag and pulling out whichever stick we happen to touch rst. Intuitively, the distribution
of Y should favor the longer sticks, so that for instance EY > EX.
Lets look at this from a notebook point of view. We pull a stick out of the bag by random ID
number, and record its length in the X column of the rst line of the notebook. Then we replace
the stick, and choose a stick out by the rst touch method, and record its length in the Y column
of the rst line. Then we do all this again, recording on the second line, and so on. Again, because
the rst touch method will favor the longer sticks, the long-run average of the Y column will be
larger than the one for the X column.
Another example was suggested to me by UCD grad student Shubhabrata Sengupta. Think of a
large parking lot on which hundreds of buckets are placed of various diameters. We throw a ball
high into the sky, and see what size bucket it lands in. Here the density would be proportional to
area of the bucket, i.e. to the square of the diameter.
Similarly, the particular inter-bus interval that we hit is likely to be a longer interval. To see this,
suppose we observe the comings and goings of buses for a very long time, and plot their arrivals
on a time line on a wall. In some cases two successive marks on the time line are close together,
sometimes far apart. If we were to stand far from the wall and throw a dart at it, we would hit
the interval between some pair of consecutive marks. Intuitively we are more apt to hit a wider
interval than a narrower one.
The formal name for this is length-biased sampling.
Once one recognizes this and carefully derives the density of that interval (see below), we discover
that that interval does indeed tend to be longerso much so that the expected value of this interval
is 20 minutes! Thus the halfway point comes at 10 minutes, consistent with the analysis which
appealed to the memoryless property, thus resolving the paradox.
In other words, if we throw a dart at the wall, say, 1000 times, the mean of the 1000 intervals we
would hit would be about 20. This in contrast to the mean of all of the intervals on the wall, which
would be 10.
5.3.2 Probability Mass Functions and Densities in Length-Biased Sampling
Actually, we can intuitively reason out what the density is of the length of the particular inter-bus
interval that we hit, as follows.
First consider the bag-of-sticks example, and suppose (somewhat articially) that stick length X is
a discrete random variable. Let Y denote the length of the stick that we pick by randomly touching
a stick in the bag.
128 CHAPTER 5. DESCRIBING FAILURE
Again, note carefully that for the reasons weve been discussing here, the distributions of X and Y
are dierent. Say we have a list of all sticks, and we choose a stick at random from the list. Then
the length of that stick will be X. But if we choose by touching a stick in the back, that length will
be Y.
Now suppose that, say, stick lengths 2 and 6 each comprise 10% of the sticks in the bag, i.e.
p
X
(2) = p
X
(6) = 0.1 (5.20)
Intuitively, one would then reason that
p
Y
(6) = 3p
Y
(2) (5.21)
In other words, even though the sticks of length 2 are just as numerous as those of length 6, the
latter are three times as long, so they should have triple the chance of being chosen. So, the chance
of our choosing a stick of length j depends not only on p
X
(j) but also on j itself.
We could write that formally as
p
Y
(j) jp
X
(j) (5.22)
where is the is proportional to symbol. Thus
p
Y
(j) = cjp
X
(j) (5.23)
for some constant of proportionality c.
But a probability mass function must sum to 1. So, summing over all possible values of j (whatever
they are), we have
1 =

j
p
Y
(j) =

j
cjp
X
(j) (5.24)
That last term is c E(X)! So, c = 1/EX, and
p
Y
(j) =
1
EX
jp
X
(j) (5.25)
5.4. RESIDUAL-LIFE DISTRIBUTION 129
The continuous analog of (5.25) is
f
Y
(t) =
1
EX
tf
X
(t) (5.26)
So, for our bus example, in which f
X
(t) = 0.1e
0.1t
, t > 0 and EX = 10,
f
Y
(t) = 0.01te
0.1t
(5.27)
You may recognize this as an Erlang density with r = 2 and = 0.1. That distribution does indeed
have mean 20.
5.4 Residual-Life Distribution
In the bus-paradox example, if we had been working with light bulbs instead of buses, the analog
of the time we wait for the next bus would be the remaining lifetime of the current light bulb.
The time from a xed time point t until the next bulb replacement, is known as the residual life.
(Another name for it is the forward recurrence time.)
Our aim here is to derive the distribution of renewal times. To do this, lets rst bring in some
terminology from renewal theory.
5.4.1 Renewal Theory
Recall the light bulb example of Section 4.5.4.5. Every time a light bulb burns out, we immediately
replace it with a new one. The time of the r
th
replacement is denoted by T
r
, and satises the
relation
N(t) = maxk : T
k
t (5.28)
where N(t) is the number of replacements that have occurred by time t and X
i
is the lifetime of
the i
th
bulb. The random variables X
1
, X
2
, ... are assumed independent and identically distributed
(i.i.d.); we will NOT assume that their common distribution is exponential, though.
Note that for each t > 0, N(t) is a random variable, and since we have a collection of random
variables indexed by t. This collection is called a renewal process, the name being motivated by
the idea of renewals occurring when light bulbs burn out. We say that N(t) is the number of
renewals by time t.
130 CHAPTER 5. DESCRIBING FAILURE
In the bus paradox example, we can think of bus arrivals as renewals too, with the interbus times
sbeing analogous to the light bulb lifetimes, and with N(t) being the number of buses that have
arrived by time t.
Note the following for general renewal processes:
Duality Between Lifetime Domain and Counts Domain:
A very important property of renewal processes is that
N(t) k if and only if T
k
t (5.29)
This is just a formal mathematical of common sense: There have been at least k renewals by now
if and only if the k
th
renewal has already occurred! But it is a very important device in renewal
analysis.
Equation (5.29) might be described as relating the counts domain (left-hand side of the equation)
to the lifetimes domain (right-hand side).
There is a very rich theory of renewal processes, but lets move on to our goal of nding the
distribution of residual life.
5.4.2 Intuitive Derivation of Residual Life for the Continuous Case
Here is a derivation for the case of continuous X
i
. For concreteness think of the bus case, but the
derivation is general.
Denote by V the length of the interbus arrival that we happen to hit when we arrive at the bus
stop, and let D denote the residual life, i.e. the time until the next bus. The key point is that,
given V, D is uniformly distributed on (0,V). To see this, think of the stick example. If the stick
that we happen to touch rst has length V, the point on which we touched it could be anywhere
from one end to the other with equal likelihood. So,
f
D|V
(s, t) =
1
t
, 0 < s < t (5.30)
Thus (9.2) yields
f
D,V
(s, t) =
1
t
f
V
(t), 0 < s < t (5.31)
Then (8.17) shows
5.4. RESIDUAL-LIFE DISTRIBUTION 131
f
D
(s) =
_

s
1
t
f
V
(t) dt (5.32)
=
_

s
1
EX
f
X
(t) dt (5.33)
=
1 F
X
(s)
EX
(5.34)
This is a classic result, of central importance and usefulness, as seen in our upcoming examples
later in this section.
2
It should be noted that all of this assume a long-run situation. In our bus example, for instance,
it implicitly assumes that when we arrive at the bus stop at 5:00, the buses have been running for
quite a while. To state this more precisely, lets let D depend on t: D(t) will be the residual life at
time t, e.g. the time we must wait for the next bus if we arrive at the stop at time t. Then (5.32)
is really the limiting density of f
D(t)
as t .
5.4.3 Age Distribution
Analogous to the residual lifetime D(t), let A(t) denote the age (sometimes called the backward
recurrence time) of the current light bulb, i.e. the length of time it has been in service. (In the
bus-paradox example, A(t) would be the time which has elapsed since the last arrival of a bus, to
the current time t.) Using an approach similar to that taken above, one can show that
lim
t
f
A(t)
(w) =
1 F
L
(w)
E(L)
(5.35)
In other words, A(t) has the same long-run distribution as D(t)!
Here is a derivation for the case in which the X
i
are discrete. (Well call the L
i
here, with L being
the generic random variable.) Remember, our xed observation point t is assumed large, so that
the system is in steady-state. Let W denote the lifetime so far for the current bulb. Say we have a
new bulb at time 52. Then W is 0 at that time. If the total lifetime turns out to be, say, 12, then
W will be 0 again at time 64.
Then we have a Markov chain in which our state at any time is the value of W. In fact, the transition
probabilities for this chain are the values of the hazard function of L:
2
If you are wondering about that rst equality in (5.32), it is basically a continuous analog of
P(A) = P(A and B1 or A and B2 or ...) = P(A|B1)P(B1) +P(A|B2)P(B2) +...)
for disjoint events B1, B2, .... This is stated more precisely in Section 9.1.3.
132 CHAPTER 5. DESCRIBING FAILURE
First note that when we are in state i, i.e. W = i, we know that the current bulbs lifetime is at
least i+1. If its lifetime is exactly i+1, our next state will be 0. So,
p
i,0
= P(L = i + 1[L > i) =
p
L
(i + 1)
1 F
L
(i)
(5.36)
p
i,i+1
=
1 F
L
(i + 1)
1 F
L
(i)
(5.37)
Dene
q
i
=
1 F
L
(i + 1)
1 F
L
(i)
(5.38)
and write

i+1
=
i
q
i
(5.39)
Applying (5.39) recursively, we have

i+1
=
0
q
i
q
i1
) q
0
(5.40)
But the right-hand side of (5.40) telescopes down to

i+1
=
0
[1 F
L
(i + 1)] (5.41)
Then
1 =

i=0

i
=
0

i=0
[1 F
L
(i)] =
0
E(L) (5.42)
Thus

i
=
1 F
L
(i + 1)
EL
(5.43)
in analogy to (5.35).
5.4. RESIDUAL-LIFE DISTRIBUTION 133
5.4.4 Mean of the Residual and Age Distributions
Taking the expected value of (5.32) or (5.35), we get a double integral. Reversing the order of
integration, we nd that the mean residual life or age is given by
E(L
2
)
2EL
(5.44)
5.4.5 Example: Estimating Web Page Modication Rates
My paper, Estimation of Internet File-Access/Modication Rates, ACM Transactions on Modeling
and Computer Simulation, 2005, 15, 3, 233-253, concerns the following problem.
Suppose we are interested in the rate of modcation of a le in some FTP repository on the Web.
We have a spider visit the site at regular intervals. At each visit, the spider records the time of last
modication to the site. We do not observe how MANY times the site was modied. The problem
then is how to estimate the modication rate from the last-modication time data that we do have.
I assumed that the modications follow a renewal process. Then the dierence between the spider
visit time and the time of last modcation is equal to the age A(t). I then applied a lot of renewal
theory to develop statistical estimators for the modcation rate.
5.4.6 Example: Disk File Model
Suppose a disk will store backup les. We place the rst le in the rst track on the disk, then
the second le right after the rst in the same track, etc. Occasionally we will run out of room
on a track, and the le we are placing at the time must be split between this track and the next.
Suppose the amount of room X taken up by a le (a continuous random variable in this model) is
uniformly distributed between 0 and 3 tracks.
Some tracks will contain data from only one le. (The le may extend onto other tracks as well.)
Lets nd the long-run proportion of tracks which have this property.
Think of the disk as consisting of a Very Long Line, with the end of one track being followed
immediately by the beginning of the next track. The points at which les begin then form a
renewal process, with time being distance along the Very Long Line. If we observe the disk at
the end of the k
th
track, this is observing at time k. That track consists entirely of one le if
and only if the age A of the current lei.e. the distance back to the beginning of that leis
greater than 1.0.
134 CHAPTER 5. DESCRIBING FAILURE
Then from Equation (5.35), we have
f
A
(w) =
1
w
3
1.5
=
2
3

2
9
w (5.45)
Then
P(A > 1) =
_
3
1
_
2
3

2
9
w
_
dw =
4
9
(5.46)
5.4.7 Example: Memory Paging Model
(Adapted from Probabiility and Statistics, with Reliability, Queuing and Computer Science Appli-
catiions, by K.S. Trivedi, Prentice-Hall, 1982 and 2002.)
Consider a computer with an address space consisting of n pages, and a program which generates
a sequence of memory references with addresses (page numbers) D
1
, D
2
, ... In this simple model,
the D
i
are assumed to be i.i.d. integer-valued random variables.
For each page i, let T
ij
denote the time at which the j
th
reference to page i occurs. Then for
each xed i, the T
ij
form a renewal process, and thus all the theory we have developed here
applies.
3
Let F
i
be the cumulative distribution function for the interrenewal distribution, i.e.
F
i
(m) = P(L
ij
m), where L
ij
= T
ij
T
i,j1
for m = 0, 1, 2, ...
Let W(t, ) denote the working set at time t, i.e. the collection of page numbers of pages accessed
during the time (t , t), and let S(t, ) denote the size of that set. We are interested in nding
the value of
s() = lim
t
E[S(t, )] (5.47)
Since the denition of the working set involves looking backward amount of time from time t,
a good place to look for an approach to nding s() might be to use the limiting distribution of
backward-recurrence time, given by Equation (5.43).
Accordingly, let A
i
(t) be the age at time t for page i. Then
Page i is in the working set if and only if it has been accessed after time t ., i.e.
A
i
(t) < .
3
Note, though, tht all random variables here are discrete, not continuous.
5.4. RESIDUAL-LIFE DISTRIBUTION 135
Thus, using (5.43) and letting 1
i
be 1 or 0 according to whether or not A
i
(t) < , we have that
s() = lim
t
E(
n

i=1
1
i
)
= lim
t
n

i=1
P(A
i
(t) < )
=
n

i=1
1

j=0
1 F
i
(j)
E(L
i
)
(5.48)
Exercises
1. Use R to plot the hazard functions for the gamma distributions plotted in Figure 4.2, plus the
case r = 0.5. Comment on the implications for trains at 8th and J Streets in Davis.
2. Consider the random bucket example in Section 5.3. Suppose bucket diameter D, measured
in meters, has a uniform distribution on (1,2). Let W denote the diameter of the bucket in which
the tossed ball lands.
(a) Find the density, mean and variance of W, and also P(W > 1.5)
(b) Write an R function that will generate random variates having the distribution of W.
3. In Section 5.1, we showed that the exponential distribution is memoryless. In fact, it is the only
continuous distribution with that property. Show that the U(0,1) distribution does NOT have that
property. To do this, evaluate both sides of (5.1).
4. Suppose f
X
(t) = 1/t
2
on (1, ), 0 elsewhere. Find h
X
(2.0)
5. Consider the three-sided die on page 31. Find the hazard function h
V
(t), where V is the number
of dots obtained on one roll (1, 2 or 3).
6. Suppose f
X
(t) = 2t for 0 < t < 1 and the density is 0 elsewhere.
(a) Find h
X
(0.5).
(b) Which statement concerning this distribution is correct? (i) IFR (ii) DFR. (iii) U-shaped
failure rate. (iv) Sinusoidal failure rate. (v) Failure rate is undened for t > 0.5.
136 CHAPTER 5. DESCRIBING FAILURE
Chapter 6
Stop and Review
Theres quite a lot of material in the preceding chapters, but its crucial that you have a good
command of it before proceeding, as the coming chapters will continue to build on it.
With that aim, here are the highlights of what weve covered so far, with links to the places at
which they were covered:
expected value (Section 3.4):
Consider random variables X and Y (not assumed independent), and constants c
1
and c
2
. We
have:
E(X +Y ) = EX +EY (6.1)
E(c
1
X) = c
1
X (6.2)
E(c
1
X +c
2
Y ) = c
1
EX +c
2
EY (6.3)
By induction,
E(a
1
U
1
+... +a
k
U
k
) = a
1
EX
1
+... +a
k
EX
k
(6.4)
for random variables U
i
and constants a
i
.
variance (Section 3.5):
137
138 CHAPTER 6. STOP AND REVIEW
Consider random variables X and Y (now assumed independent), and constants c
1
and c
2
.
We have:
V ar(X +Y ) = V ar(X) +V ar(Y ) (6.5)
V ar(c
1
X) = c
2
1
V ar(X) (6.6)
By induction,
V ar(a
1
U
1
+... +a
k
U
k
) = a
2
1
V ar(U
1
) +... +a
2
k
V ar(U
k
) (6.7)
for independent random variables U
i
and constants a
i
.
indicator random variables (Section 3.6):
Equal 1 or 0, depending on whether a specied event A occurs.
If T is an indicator random variable for the event A, then
ET = P(A), V ar(T) = P(A)[1 P(A)] (6.8)
distributions:
cdfs(Section 4.3):
For any random variable X,
F
X
(t) = P(X t), < t < (6.9)
pmfs (Section 3.11):
For a discrete random variable X,
p
X
(k) = P(X = k) (6.10)
density functions (Section 3.11):
For a continuous random variable X,
f
X
(t) =
d
dt
F
X
(t), < t < (6.11)
and
P(X in A) =
_
A
f
X
(s) ds (6.12)
139
famous parametric families of distributions:
Just like one can have a family of curves, say sin(2n(t) (dierent curve for each n), certain
families of distributions have been found useful. Theyre called parametric families, because
they are indexed by one or more parameters, anlagously to n above.
discrete:
geometric (Section 3.12.1)
Number of i.i.d. trials until rst success. For success probability p:
p
N
(k) = (1 p)
k
p (6.13)
EN = 1/p, V ar(N) =
1 p
p
2
(6.14)
binomial (Section 3.12.2):
Number of successes in n i.i.d. trials, probability p of success per trial:
p
N
(k) =
_
n
k
_
p
k
(1 p)
nk
(6.15)
EN = np, V ar(N) = np(1 p) (6.16)
Poisson (Section 3.12.4):
Has often been found to be a good model for counts over time periods.
One parameter, often called . Then
p
N
(k) =
e

k
k!
, k = 0, 1, 2, ... (6.17)
EN = V ar(N) = (6.18)
negative binomial (Section 3.12.3):
Number of i.i.d. trials until r
th
success. For success probability p:
p
N
(k) =
_
k 1
r 1
_
(1 p)
kr
p
r
, k = r, r + 1, ... (6.19)
E(N) = r
1
p
, V ar(N) = r
1 p
p
2
(6.20)
continuous:
140 CHAPTER 6. STOP AND REVIEW
uniform (Section 4.5.1.1):
All points equally likely. If the interval is (q,r),
f
X
(t) =
1
r q
, q < t < r (6.21)
EX ==
q +r
2
, V ar(D) =
1
12
(r q)
2
(6.22)
normal (Gaussian) (Section 4.5.2):
Bell-shaped curves. Useful due to Central Limit Theorem (Section 4.5.2.4. (Thus
good approximation to binomial distribution.)
Closed under ane transformations (Section 4.5.2.1)!
Parameterized by mean and variance, and
2
:
f
X
(t) =
1

2
e
0.5(
t

)
2
, < t < (6.23)
exponential (Section 4.5.4):
Memoryless! One parameter, usually called . Connectedt to Poisson family.
f
X
(t) == e
t
, 0 < t < (6.24)
EX = 1/, V ar(X) = 1/
2
(6.25)
gamma (Section 4.5.5):
Special case, Erlang family, arises as the distribution of the sum of i.i.d. exponential random
variables.
f
X
(t) =
1
(r)

r
t
r1
e
t
, t > 0 (6.26)
Chapter 7
Covariance and Random Vectors
Most applications of probability and statistics involve the interaction between variables. For in-
stance, when you buy a book at Amazon.com, the software will likely inform you of other books
that people bought in conjunction with the one you selected. Amazon is relying on the fact that
sales of certain pairs or groups of books are correlated.
Thus we need the notion of distributions that describe how two or more variables vary together.
This chapter develops that notion, which forms the very core of statistics.
7.1 Measuring Co-variation of Random Variables
7.1.1 Covariance
Denition 16 The covariance between random variables X and Y is dened as
Cov(X, Y ) = E[(X EX)(Y EY )] (7.1)
Suppose that typically when X is larger than its mean, Y is also larger than its mean, and vice versa
for below-mean values. Then (7.1) will likely be positive. In other words, if X and Y are positively
correlated (a term we will dene formally later but keep intuitive for now), then their covariance
is positive. Similarly, if X is often smaller than its mean whenever Y is larger than its mean, the
covariance and correlation between them will be negative. All of this is roughly speaking, of course,
since it depends on how much and how often X is larger or smaller than its mean, etc.
141
142 CHAPTER 7. COVARIANCE AND RANDOM VECTORS
Linearity in both arguments:
Cov(aX +bY, cU +dV ) = acCov(X, U) +adCov(X, V ) +bcCov(Y, U) +bdCov(Y, V ) (7.2)
for any constants a, b, c and d.
Insensitivity to additive constants:
Cov(X, Y +q) = Cov(X, Y ) (7.3)
for any constant q and so on.
Covariance of a random variable with itself:
Cov(X, X) = V ar(X) (7.4)
for any X with nite variance.
Shortcut calculation of covariance:
Cov(X, Y ) = E(XY ) EX EY (7.5)
The proof will help you review some important issues, namely (a) E(U+V) = EU + EV, (b) E(cU)
= c EU and Ec = c for any constant c, and (c) EX and EY are constants in (7.5).
Cov(X, Y ) = E[(X EX)(Y EY )] (denition) (7.6)
= E [XY EX Y EY X +EX EY ] (algebra) (7.7)
= E(XY ) +E[EX Y ] +E[EY X] +E[EX EY ] (E[U+V]=EU+EV) (7.8)
= E(XY ) EX EY (E[cU] = cEU, Ec = c) (7.9)
Variance of sums:
V ar(X +Y ) = V ar(X) +V ar(Y ) + 2Cov(X, Y ) (7.10)
This comes from (7.5), the relation V ar(X) = E(X
2
) EX
2
and the corresponding one for Y. Just
substitute and do the algebra.
7.1. MEASURING CO-VARIATION OF RANDOM VARIABLES 143
By induction, (7.10) generalizes for more than two variables:
V ar(W
1
+... +W
r
) =
r

i=1
V ar(W
i
) + 2

1j<ir
Cov(W
i
, W
j
) (7.11)
7.1.2 Example: Variance of Sum of Nonindependent Variables
Consider random variables X
1
and X
2
, for which V ar(X
i
) = 1.0 for i = 1,2, and Cov(X
1
, X
2
) = 0.5.
Lets nd V ar(X
1
+X
2
).
This is quite straightforward, from (7.10):
V ar(X
1
+X
2
) = V ar(X
1
) +V ar(X
2
) + 2Cov(X
1
, X
2
) = 3 (7.12)
7.1.3 Example: the Committee Example Again
Lets nd Var(M) in the committee example of Section 3.7. In (3.53), we wrote M as a sum of
indicator random variables:
M = G
1
+G
2
+G
3
+G
4
(7.13)
and found that
P(G
i
= 1) =
2
3
(7.14)
for all i.
You should review why this value is the same for all i, as this reasoning will be used again below.
Also review Section 3.6.
Applying (7.11) to (7.13), we have
V ar(M) = 4V ar(G
1
) + 12Cov(G
1
.G
2
) (7.15)
Finding that rst term is easy, from (3.45):
V ar(G
1
) =
2
3

_
1
2
3
_
=
2
9
(7.16)
144 CHAPTER 7. COVARIANCE AND RANDOM VECTORS
Now, what about Cov(G
1
.G
2
)? Equation (7.5) will be handy here:
Cov(G
1
.G
2
) = E(G
1
G
2
) E(G
1
)E(G
2
) (7.17)
That rst term in (7.17) is
E(G
1
G
2
) = P(G
1
= 1 and G
2
= 1) (7.18)
= P(choose a man on both the rst and second pick) (7.19)
=
6
9

5
8
(7.20)
=
5
12
(7.21)
That second term in (7.17) is, again from Section 3.6,
_
2
3
_
2
=
4
9
(7.22)
All thats left is to put this together in (7.15), left to the reader.
7.1.4 Correlation
Covariance does measure how much or little X and Y vary together, but it is hard to decide
whether a given value of covariance is large or not. For instance, if we are measuring lengths in
feet and change to inches, then (7.2) shows that the covariance will increase by 12
2
= 144. Thus it
makes sense to scale covariance according to the variables standard deviations. Accordingly, the
correlation between two random variables X and Y is dened by
(X, Y ) =
Cov(X, Y )
_
V ar(X)
_
V ar(Y )
(7.23)
So, correlation is unitless, i.e. does not involve units like feet, pounds, etc.
It is shown later in this chapter that
1 (X, Y ) 1
[(X, Y )[ = 1 if and only if X and Y are exact linear functions of each other, i.e. Y = cX +
d for some constants c and d
7.2. SETS OF INDEPENDENT RANDOM VARIABLES 145
7.1.5 Example: a Catchup Game
Consider the following simple game. There are two players, who take turns playing. Ones position
after k turns is the sum of ones winnings in those turns. Basically, a turn consists of generating a
random U(0,1) variable, with one dierenceif that player is currently losing, he gets a bonus of
0.2 to help him catch up.
Let X and Y be the total winnings of the two players after 10 turns. Intuitively, X and Y should
be positively correlated, due to the 0.2 bonus which brings them closer together. Lets see if this is
true.
Though very simply stated, this problem is far too tough to solve mathematically in an elementary
course (or even an advanced one). So, we will use simulation. In addition to nding the correlation
between X and Y, well also nd F
X,Y
(5.8, 5.2).
1 taketurn <- function(a,b) {
2 win <- runif(1)
3 if (a >= b) return(win)
4 else return(win+0.2)
5 }
6
7 nturns <- 10
8 xyvals <- matrix(nrow=nreps,ncol=2)
9 for (rep in 1:nreps) {
10 x <- 0
11 y <- 0
12 for (turn in 1:nturns) {
13 # xs turn
14 x <- x + taketurn(x,y)
15 # ys turn
16 y <- y + taketurn(y,x)
17 }
18 xyvals[rep,] <- c(x,y)
19 }
20 print(cor(xyvals[,1],xyvals[,2]))
The output is 0.65. So, X and Y are indeed positively correlated as we had surmised.
Note the use of Rs built-in function cor() to compute correlation, a shortcut that allows us to
avoid summing all the products xy and so on, from (7.5). The reader should make sure he/she
understands how this would be done.
7.2 Sets of Independent Random Variables
Recall from Section 3.3:
146 CHAPTER 7. COVARIANCE AND RANDOM VECTORS
Denition 17 Random variables X and Y are said to be independent if for any sets I and J,
the events X is in I and Y is in J are independent, i.e. P(X is in I and Y is in J) = P(X is
in I) P(Y is in J).
Intuitively, though, it simply means that knowledge of the value of X tells us nothing about the
value of Y, and vice versa.
Great mathematical tractability can be achieved by assuming that the X
i
in a random vector
X = (X
1
, ..., X
k
) are independent. In many applications, this is a reasonable assumption.
7.2.1 Properties
In the next few sections, we will look at some commonly-used properties of sets of independent
random variables. For simplicity, consider the case k = 2, with X and Y being independent (scalar)
random variables.
7.2.1.1 Expected Values Factor
If X and Y are independent, then
E(XY ) = E(X)E(Y ) (7.24)
7.2.1.2 Covariance Is 0
If X and Y are independent, we have
Cov(X, Y ) = 0 (7.25)
and thus
(X, Y ) = 0 as well.
This follows from (7.24) and (7.5).
However, the converse is false. A counterexample is the random pair (V, W) that is uniformly
distributed on the unit disk, (s, t) : s
2
+ t
2
1. Clearly 0 = E(XY) = EX = EY due to the
symmetry of the distribution about (0,0), so Cov(X,Y) = 0 by (7.5).
But X and Y just as clearly are not independent. If for example we know that X > 0.8, say, then
Y
2
< 1 0.8
2
and thus [Y [ < 0.6. If X and Y were independent, knowledge of X should not tell us
7.2. SETS OF INDEPENDENT RANDOM VARIABLES 147
anything about Y, which is not the case here, and thus they are not independent. If we also know
that X and Y are bivariate normally distributed (Section 8.5.2.1), then zero covariance does imply
independence.
7.2.1.3 Variances Add
If X and Y are independent, then we have
V ar(X +Y ) = V ar(X) +V ar(Y ). (7.26)
This follows from (7.10) and (7.24).
7.2.2 Examples Involving Sets of Independent Random Variables
7.2.2.1 Example: Dice
In Section 7.1.1, we speculated that the correlation between X, the number on the blue die, and S,
the total of the two dice, was positive. Lets compute it.
Write S = X + Y, where Y is the number on the yellow die. Then using the properties of covariance
presented above, we have that
Cov(X, S) = Cov(X, X +Y ) (def. of S) (7.27)
= Cov(X, X) +Cov(X, Y ) (from (7.2)) (7.28)
= V ar(X) + 0 (from (7.4), (7.25)) (7.29)
Also, from (7.26),
V ar(S) = V ar(X +Y ) = V ar(X) +V ar(Y ) (7.30)
But Var(Y) = Var(X). So the correlation between X and S is
(X, S) =
V ar(X)
_
V ar(X)
_
2V ar(X)
= 0.707 (7.31)
Since correlation is at most 1 in absolute value, 0.707 is considered a fairly high correlation. Of
course, we did expect X and S to be highly correlated.
148 CHAPTER 7. COVARIANCE AND RANDOM VECTORS
7.2.2.2 Example: Variance of a Product
Suppose X
1
and X
2
are independent random variables with EX
i
=
i
and V ar(X
i
) =
2
i
, i = 1,2.
Lets nd an expression for V ar(X
1
X
2
).
V ar(X
1
X
2
) = E(X
2
1
X
2
2
) [E(X
1
X
2
)]
2
(3.30) (7.32)
= E(X
2
1
) E(X
2
2
)
2
1

2
2
(7.24) (7.33)
= (
2
1
+
2
1
)(
2
2
+
2
2
)
2
1

2
2
(7.34)
=
2
1

2
2
+
2
1

2
2
+
2
2

2
1
(7.35)
7.2.2.3 Example: Ratio of Independent Geometric Random Variables
Suppose X and Y are independent geometrically distributed random variables with success proba-
bility p. Let Z = X/Y. We are interested in EZ and F
Z
.
First, by (7.24), we have
EZ = E
_
X
Y
_
=
1
p
E
_
1
Y
_
(7.36)
so we need to nd E(1/Y):
E(
1
Y
) =

i=1
1
i
(1 p)
i1
p (7.37)
Unfortunately, no further simplication seems possible.
Now lets nd F
Z
(m) for a positive integer m.
7.3. MATRIX FORMULATIONS 149
F
Z
(m) = P
_
X
Y
m
_
(7.38)
= P(X mY ) (7.39)
=

i=1
P(Y = i) P(X mY [Y = i) (7.40)
=

i=1
(1 p)
i1
p P(X mi) (7.41)
=

i=1
(1 p)
i1
p [1 (1 p)
mi
] (7.42)
this last step coming from (3.89).
We can actually reduce (7.42) to closed form, by writing
(1 p)
i1
(1 p)
mi
= (1 p)
mi+i1
=
1
1 p
_
(1 p)
m+1

i
(7.43)
and then using (3.79). Details are left to the reader.
7.3 Matrix Formulations
(Note that there is a review of matrix algebra in Appendix A.)
In your rst course in matrices and linear algebra, your instructor probably motivated the notion
of a matrix by using an example involving linear equations, as follows.
Suppose we have a system of equations
a
i1
x
1
+... +a
in
x
n
= b
i
, i = 1, ..., n, (7.44)
where the x
i
are the unknowns to be solved for.
This system can be represented compactly as
AX = B, (7.45)
where A is nxn and X and B are nx1.
150 CHAPTER 7. COVARIANCE AND RANDOM VECTORS
That compactness coming from the matrix formulation applies to statistics too, though in dierent
ways, as we will see. (Linear algebra in general is used widely in statisticsmatrices, rank and
subspace, eigenvalues, even determinants.)
When dealing with multivariate distributions, some very messy equations can be greatly compact-
ied through the use of matrix algebra. We will introduce this here.
Throughout this section, consider a random vector W = (W
1
, ..., W
k
)

where

denotes matrix
transpose, and a vector written horizontally like this without a

means a row vector.
7.3.1 Properties of Mean Vectors
Denition 18 The expected value of W is dened to be the vector
EW = (EW
1
, ..., EW
k
)

(7.46)
The linearity of the components implies that of the vectors:
For any scalar constants c and d, and any random vectors V and W, we have
E(cV +dW) = cEV +dEW (7.47)
where the multiplication and equality is now in the vector sense.
Also, multiplication by a constant matrix factors:
If A is a nonrandom matrix having k columns, then
E(AW) = AEW (7.48)
7.3.2 Covariance Matrices
Denition 19 The covariance matrix Cov(W) of W = (W
1
, ..., W
k
)

is the k x k matrix whose


(i, j)
th
element is Cov(W
i
, W
j
).
Note that that implies that the diagonal elements of the matrix are the variances of the W
i
, and
that the matrix is symmetric.
As you can see, in the statistics world, the Cov() notation is overloaded. If it has two arguments,
it is ordinary covariance, between two variables. If it has one argument, it is the covariance matrix,
7.3. MATRIX FORMULATIONS 151
consisting of the covariances of all pairs of components in the argument. When people mean the
matrix form, they always say so, i.e. they say covariance MATRIX instead of just covariance.
The covariance matrix is just a way to compactly do operations on ordinary covariances. Here are
some important properties:
Say c is a constant scalar. Then cW is a k-component random vector like W, and
Cov(cW) = c
2
Cov(W) (7.49)
If A is an r x k but nonrandom matrix. Then AW is an r-component random vector, and
Cov(AW) = A Cov(W) A

(7.50)
Suppose V and W are independent random vectors, meaning that each component in V is inde-
pendent of each component of W. (But this does NOT mean that the components within V are
independent of each other, and similarly for W.) Then
Cov(V +W) = Cov(V ) +Cov(W) (7.51)
Of course, this is also true for sums of any (nonrandom) number of independent random vectors.
In analogy with (3.30), for any random vector Q,
Cov(Q) = E(QQ

) EQ (EQ)

(7.52)
7.3.3 Example: Easy Sum Again
Lets redo the example in Section 7.1.2 again, this time using matrix methods.
First note that
X
1
+X
2
= (1, 1)
_
X
1
X
2
_
(7.53)
so take A = (1,1). Then from (7.50),
V ar(X
1
+X
2
) = (1, 1)
_
1 0.5
0.5 1
__
1
1
_
= 3 (7.54)
152 CHAPTER 7. COVARIANCE AND RANDOM VECTORS
Of course using the matrix formulation didnt save us much time here, but for complex problems
its invaluable.
7.3.4 Example: (X,S) Dice Example Again
Recall Sec. 7.2.2.1. We rolled two dice, getting X and Y dots, and set S to X+Y. We then found
(X, S). Lets nd (X, S) using matrix methods.
The key is nding a proper choice for A in (7.50). A little thought shows that
_
X
S
_
=
_
1 0
1 1
__
X
Y
_
(7.55)
Thus the covariance matrix of (X,S) is
Cov[(X, S)

] =
_
1 0
1 1
__
V ar(X) 0
0 V ar(Y )
__
1 1
0 1
_
(7.56)
=
_
V ar(X) 0
V ar(X) V ar(Y )
__
1 1
0 1
_
(7.57)
=
_
V ar(X) V ar(X)
V ar(X) V ar(X) +V ar(Y )
_
(7.58)
since X and Y are independent. We would then proceed as before.
This matches what we found earlier, as it should, but shows how matrix methods can be used. This
example was fairly simple, so those methods did not produce a large amount of streamlining, but
in other examples later in the book, the matrix approach will be key.
7.3.5 Example: Dice Game
This example will necessitate a sneak preview of material in Chapter 8, but it will be worthwhile
to present this example now, in order to show why covariance matrices and their properties are so
important.
Suppose we roll a die 50 times. Let X denote the number of rolls in which we get one dot, and let
Y be the number of times we get either two or three dots. For convenience, lets also dene Z to
be the number of times we get four or more dots, though our focus will be on X and Y. Suppose
also that we win $5 for each roll of a one, and $2 for each roll of a two or three.
Lets nd the approximate values of the following:
7.3. MATRIX FORMULATIONS 153
P(X 12 and Y 16)
P(win more than $90)
P(X > Y > Z)
The exact multinomial probabilities could in principle be calculated. But that would be rather
cumbersome. But as will be shown in Section 8.5.2, the triple (X,Y,Z) has an approximate multi-
variate normal distribution. The latter is a generalization of the normal distribution, again covered
in that section, but all we need to know here is that:
(a) If a random vector W has a multivariate normal distribution, and A is a constant matrix,
then the new random vector AW is also multivariate normally distributed.
(b) R provides functions that compute probabilities involving this family of distributions.
Just as the univariate normal family is parameterized by the mean and variance, the multivariate
normal family has as its parameters the mean vector and the covariance matrix.
Well of course need to know the mean vector and covariance matrix of the random vector (X, Y, Z)

.
Once again, this will be shown later (using (8.99) and (8.112)), but for now take them on faith:
E[(X, Y, Z)] = (50/6, 50/3, 50/2) (7.59)
and
Cov[(X, Y, Z)] = 50
_
_
5/36 1/18 1/12
1/18 2/9 1/6
1/12 1/6 1/4
_
_
(7.60)
We use the R function pmvnorm(), which computes probabilities of rectangular regions for
multivariate normally distributed random vectors W.
1
The arguments well use for this function
here are:
mean: the mean vector
sigma: the covariance matrix
lower, upper: bounds for a multidimensional rectangular region of interest
1
You must rst load the mvtnorm library to use this function.
154 CHAPTER 7. COVARIANCE AND RANDOM VECTORS
Since a multivariate normal distribution is characterized by its mean vector and covariance matrix,
the rst two arguments above shouldnt suprise you. But what about the other two?
The function nds the probability of our random vector falling into a multidimensional rectangular
region that we specify, through the arguments are lower and upper. Note that these will typically
be specied via Rs c() function, but default values are recycled versions of -Inf and Inf, built-in
R constants for and .
An important special case is that in which we specify upper but allow lower to be the default
values, yielding:
P(W
1
c
1
, ..., W
r
c
r
) (7.61)
just what we need to nd P(X 12 and Y 16).
To account for the integer nature of X and Y, we call the function with upper limits of 12.5 and 16.5,
rather than 12 and 16, which is often used to get a better approximation. (Recall the correction
for continuity, Section 4.5.2.7.) Our code is
1 p1 <- 1/6
2 p23 <- 1/3
3 meanvec <- 50*c(p1,p23)
4 var1 <- 50*p1*(1-p1)
5 var23 <- 50*p23*(1-p23)
6 covar123 <- -50*p1*p23
7 covarmat <- matrix(c(var1,covar123,covar123,var23),nrow=2)
8 print(pmvnorm(upper=c(12.5,16.5),mean=meanvec,sigma=covarmat))
We nd that
P(X 12 and Y 16) 0.43 (7.62)
Now, lets nd the probability that our total winnings, T, is over $90. We know that T = 5X +
2Y, and property (a) above applies. We simply choose the matrix to be
A = (5, 2, 0) (7.63)
since
(5, 2, 0)
_
_
X
Y
Z
_
_
= 5X + 2Y (7.64)
7.3. MATRIX FORMULATIONS 155
Then property (a) tells us that 5X + 2Y also has an approximate multivariate normal random
vector, which of course is univariate normal here. In other words, T has an approximate normal
distribution, great since we know how to nd probabilities involving that distribution!
We thus need the mean and variance of T. The mean is easy:
ET = E(5X + 2Y ) = 5EX + 2EY = 250/6 + 100/3 = 75 (7.65)
For the variance, use (7.50). (Since T is a 1-element vector, its covariance matrix reduces to simply
Var(T).)
V ar(T) = ACov
_
_
X
Y
Z
_
_
A

= (5, 2)50
_
_
5/36 1/18 1/12
1/18 2/9 1/6
1/12 1/6 1/4
_
_
_
5
2
_
= 162.5 (7.66)
So, proceeding as in Chapter 4, we have
P(T > 90) = 1
_
90 75
162.5
0.5
_
= 0.12 (7.67)
Now to nd P(X > Y > Z), we need to work with (U, V )

= (X Y, Y Z), so set
A =
_
1 1 0
0 1 1
_
(7.68)
and then proceed as before to nd P(U > 0, V > 0). Now we take lower to be (0,0), and upper
to be the default values, in pmvnorm().
Exercises
1. Suppose the pair (X,Y) has mean vector (0,2) and covariance matrix
_
1 2
2 6
_
Find the covariance matrix of the pair U = (X+Y,X-2Y).
2. Show that
(aX +b, cY +d) = (X, Y ) (7.69)
156 CHAPTER 7. COVARIANCE AND RANDOM VECTORS
for any constants a, b, c and d.
3. Suppose X, Y and Z are i.i.d. (independent, identically distributed) random variables, with
E(X
k
) being denoted by
k
, i = 1,2,3. Find Cov(XY,XZ) in terms of the
k
.
4. Using the properties of covariance in Section 7.1.1, show that for any random variables X and
Y, Cov(X+Y,X-Y) = Var(X) - Var(Y).
5. Suppose we wish to predict a random variable Y by using another random variable, X. We may
consider predictors of the form cX +d for constants c and d. Show that the values of c and d that
minimize the mean squared prediction error, E[(Y cX d)
2
are
c =
E(XY ) EX EY
V ar(X)
(7.70)
d =
E(X
2
) EY EX E(XY )
V ar(X)
(7.71)
6. Programs A and B consist of r and s modules, respectively, of which c modules are common to
both. As a simple model, assume that each module has probability p of being correct, with the
modules acting independently. Let X and Y denote the numbers of correct modules in A and B,
respectively. Find the correlation A(X, Y ) as a function of r, s, c and p.
Hint: Write X = X
1
+ ...X
r
, where X
i
is 1 or 0, depending on whether module i of A is correct.
Of those, let X
1
, ..., X
c
correspond to the modules in common to A and B. Similarly, write Y =
Y
1
+ ...Y
s
, for the modules in B, again having the rst c of them correspond to the modules in
common. Do the same for B, and for the set of common modules.
7. Suppose we have random variables X and Y, and dene the new random variable Z = 8Y. Then
which of the following is correct? (i) (X, Z) = (X, Y ). (ii) (X, Z) = 0. (iii) (Y, Z) = 0. (iv)
(X, Z) = 8(X, Y ). (v) (X, Z) =
1
8
(X, Y ). (vi) There is no special relationship.
8. Derive (7.3). Hint: A constant, q here, is a random variable, trivially, with 0 variance.
9. Consider a three-card hand drawn from a 52-card deck. Let X and Y denote the number of
hearts and diamonds, respectively. Find (X, Y ).
10. Consider the lightbulb example in Section 4.5.4.5. Use the mailing tubes on Var() and Cov()
to nd (X
1
, T
2
).
11. Find the following quantities for the dice example in Section 7.2.2.1:
(a) Cov(X,2S)
(b) Cov(X,S+Y)
7.3. MATRIX FORMULATIONS 157
(c) Cov(X+2Y,3X-Y)
(d) p
X,S
(3, 8)
12. Suppose X
i
, i = 1,2,3,4,5 are independent and each have mean 0 and variance 1. Let Y
i
= X
i+1

X
i
, i = 1,2,3,4. Using the material in Section 7.3, nd the covariance matrix of Y = (Y
1
, Y
2
, Y
3
, Y
4
).
158 CHAPTER 7. COVARIANCE AND RANDOM VECTORS
Chapter 8
Multivariate PMFs and Densities
Individual pmfs p
X
and densities f
X
dont describe correlations betwen variables. We need some-
thing more. We need ways to describe multivariate distributions.
8.1 Multivariate Probability Mass Functions
Recall that for a single discrete random variable X, the distribution of X was dened to be a list
of all the values of X, together with the probabilities of those values. The same is done for a pair
of discrete random variables U and V, as follows.
Suppose we have a bag containing two yellow marbles, three blue ones and four green ones. We
choose four marbles from the bag at random, without replacement. Let Y and B denote the number
of yellow and blue marbles that we get. Then dene the two-dimensional pmf of Y and B to be
p
Y,B
(i, j) = P(Y = i and B = j) =
_
2
i
__
3
j
__
4
4ij
_
_
9
4
_ (8.1)
Here is a table displaying all the values of P(Y = i and B = j):
i , j 0 1 2 3
0 0.002 0.024 0.036 0.008
1 0.162 0.073 0.048 0.004
2 0.012 0.024 0.006 0.000
So this table is the distribution of the pair (Y,B).
Recall further that in the discrete case, we introduced a symbolic notation for the distribution of
159
160 CHAPTER 8. MULTIVARIATE PMFS AND DENSITIES
a random variable X, dened as p
X
(i) = P(X = i), where i ranged over all values that X takes on.
We do the same thing for a pair of random variables:
Denition 20 For discrete random variables U and V, their probability mass function is dened
to be
p
U,V
(i, j) = P(U = i and V = j) (8.2)
where (i,j) ranges over all values taken on by (U,V). Higher-dimensional pmfs are dened similarly,
e.g.
p
U,V,W
(i, j, k) = P(U = i and V = j and W = k) (8.3)
So in our marble example above, p
Y,B
(1, 2) = 0.048, p
Y,B
(2, 0) = 0.012 and so on.
Just as in the case of a single discrete random variable X we have
P(X A) =

iA
p
X
(i) (8.4)
for any subset A of the range of X, for a discrete pair (U,V) and any subset A of the pairs range,
we have
P[(U, V ) A) =

(i,j)A
p
U,V
(i, j) (8.5)
Again, consider our marble example. Suppose we want to nd P(Y < B). Doing this by hand,
we would simply sum the relevant probabilities in the table above, which are marked in bold face
below:
i , j 0 1 2 3
0 0.002 0.024 0.036 0.008
1 0.162 0.073 0.048 0.004
2 0.012 0.024 0.006 0.000
The desired probability would then be 0.024+0.036+0.008+0.048+0.004 = 0.12.
Writing it in the more formal way using (8.5), we would set
A = (i, j) : i < j (8.6)
8.1. MULTIVARIATE PROBABILITY MASS FUNCTIONS 161
and then
P(Y < B) = P[(Y, B) A) =
2

i=0
3

j=i+1
p
Y,B
(i, j) (8.7)
Note that the lower bound in the inner sum is j = i+1. This reects the common-sense point that
in the event Y < B, B must be at least equal to Y+1.
Of course, this sum still works out to 0.12 as before, but its important to be able to express this
as a double sum of p
Y,B
(), as above. We will rely on this to motivate the continuous case in the
next section.
Expected values are calculated in the analogous manner. Recall that for a function g() of X
E[g(X] =

i
g(i)p
X
(i) (8.8)
So, for any function g() of two discrete random variables U and V, dene
E[g(U, V )] =

j
g(i, j)p
U,V
(i, j) (8.9)
For instance, if for some bizarre reason we wish to nd the expected value of the product of the
numbers of yellow and blue marbles above,
1
, the calculation would be
E(Y B) =
2

i=0
3

j=0
ij p
Y,B
(i, j) = 0.255 (8.10)
The univariate pmfs, called marginal pmfs, can of course be recovered from the multivariate pmf:
p
U
(i) = P(U = i) =

j
P(U = i, V = j) =

j
p
U,V
(i, j) (8.11)
For example, look at the table following (8.5). Evaluating (8.11) for i = 1, say, with U = Y and V
= B, would give us 0.012 + 0.024 + 0.006 + 0.000 = 0.042. Then all that (8.11) tells us is the P(Y
= 1) = 0.042, which is obvious from the table; (8.11) simply is an application of our old principle,
Break big events down into small events.
1
Not so bizarre, well nd in Section 7.1.1.
162 CHAPTER 8. MULTIVARIATE PMFS AND DENSITIES
Needless to say, we can recover the marginal distribution of V similarly to (8.11):
p
V
(j) = P(V = j) =

i
P(U = i, V = j) =

i
p
U,V
(i, j) (8.12)
8.2 Multivariate Densities
8.2.1 Motivation and Denition
Extending our previous denition of cdf for a single variable, we dene the two-dimensional cdf for
a pair of random variables X and Y (discrete or continuous) as
F
X,Y
(u, v) = P(X u and Y v) (8.13)
If X and Y were discrete, we would evaluate that cdf via a double sum of their bivariate pmf.
You may have guessed by now that the analog for continuous random variables would be a double
integral, and it is. The integrand is the bivariate density:
f
X,Y
(u, v) =

2
uv
F
X,Y
(u, v) (8.14)
Densities in higher dimensions are dened similarly.
2
As in the univariate case, a bivariate density shows which regions of the X-Y plane occur more
frequently, and which occur less frequently.
8.2.2 Use of Multivariate Densities in Finding Probabilities and Expected Val-
ues
Again by analogy, for any region A in the X-Y plane,
P[(X, Y )A] =
__
A
f
X,Y
(u, v) du dv (8.15)
2
Just as we noted in Section 4.8 that some random variables are neither discrete nor continuous, there are some
pairs of continuous random variables whose cdfs do not have the requisite derivatives. We will not pursue such cases
here.
8.2. MULTIVARIATE DENSITIES 163
So, just as probabilities involving a single variable X are found by integrating f
X
over the region in
question, for probabilities involving X and Y, we take the double integral of f
X,Y
over that region.
Also, for any function g(X,Y),
E[g(X, Y )] =
_

g(u, v)f
X,Y
(u, v) du dv (8.16)
where it must be kept in mind that f
X,Y
(u, v) may be 0 in some regions of the U-V plane. Note
that there is no set A here as in (8.15). See (8.20) below for an example.
Finding marginal densities is also analogous to the discrete case, e.g.
f
X
(s) =
_
t
f
X,Y
(s, t) dt (8.17)
Other properties and calculations are analogous as well. For instance, the double integral of the
density is equal to 1, and so on.
8.2.3 Example: a Triangular Distribution
Suppose (X,Y) has the density
f
X,Y
(s, t) = 8st, 0 < t < s < 1 (8.18)
The density is 0 outside the region 0 < t < s < 1.
First, think about what this means, say in our notebook context. We do the experiment many
times. Each line of the notebook records the values of X and Y. Each of these (X,Y) pairs is a
point in the triangular region 0 < t < s < 1. Since the density is highest near the point (1,1) and
lowest near (0,1), (X,Y) will be observed near (1,1) much more often than near (0,1), with points
near, say, (1,0.5) occurring with middling frequencies.
Lets nd P(X +Y > 1). This calculation will involve a double integral. The region A in (8.15) is
(s, t) : s +t > 1, 0 < t < s < 1. We have a choice of integrating in the order ds dt or dt ds. The
latter will turn out to be more convenient.
To see how the limits in the double integral are obtained, rst review (8.7). We use the same
reasoning here, changing from sums to integrals and applying the current density, as shown in this
gure:
164 CHAPTER 8. MULTIVARIATE PMFS AND DENSITIES
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
s
t
t=1s
t=s
Here s represents X and t represents Y. The gray area is the region in which (X,Y) ranges. The
subregion A in (8.15), corresponding to the event X+Y > 1, is shown in the striped area in the
gure.
The dark vertical line shows all the points (s,t) in the striped region for a typical value of s in the
integration process. Since s is the variable in the outer integral, considered it xed for the time
being and ask where t will range for that s. We see that for X = s, Y will range from 1-s to s; thus
we set the inner integrals limits to 1-s and s. Finally, we then ask where s can range, and see from
the picture that it ranges from 0.5 to 1. Thus those are the limits for the outer integral.
P(X +Y > 1) =
_
1
0.5
_
s
1s
8st dt ds =
_
1
0.5
8s (s 0.5) ds =
5
6
(8.19)
Following (8.16),
E[

X +Y ] =
_
1
0
_
s
0

s +t 8st dt ds (8.20)
Lets nd the marginal density f
Y
(t). Just as we summed out in (8.11), in the continuous case
8.2. MULTIVARIATE DENSITIES 165
we must integrate out the s in (8.18):
f
Y
(t) =
_
1
t
8st ds = 4t 4t
3
(8.21)
for 0 < t < 1, 0 elsewhere.
Lets nd the correlation between X and Y for this density.
E(XY ) =
_
1
0
_
s
0
st 8st dt ds (8.22)
=
_
1
0
8s
2
s
3
/3 ds (8.23)
=
4
9
(8.24)
f
X
(s) =
_
s
0
8st dt (8.25)
= 4st
2

s
0
(8.26)
= 4s
3
(8.27)
f
Y
(t) =
_
1
t
8st ds (8.28)
= 4t s
2

1
t
(8.29)
= 4t(1 t
2
) (8.30)
EX =
_
1
0
s 4s
3
ds =
4
5
(8.31)
E(X
2
) =
_
1
0
s
2
4s
3
ds =
2
3
(8.32)
V ar(X) =
2
3

_
4
5
_
2
= 0.027 (8.33)
166 CHAPTER 8. MULTIVARIATE PMFS AND DENSITIES
EY =
_
1
0
t (4t 4t
3
) ds =
4
3

4
5
=
8
15
(8.34)
E(Y
2
) =
_
1
0
t
2
(4t 4t
3
) dt = 1
4
6
=
1
3
(8.35)
V ar(Y ) =
1
3

_
8
15
_
2
= 0.049 (8.36)
Cov(X, Y ) =
4
9

4
5

8
15
= 0.018 (8.37)
(X, Y ) =
0.018

0.027 0.049
= 0.49 (8.38)
8.2.4 Example: Train Rendezvouz
Train lines A and B intersect at a certain transfer point, with the schedule stating that buses
from both lines will arrive there at 3:00 p.m. However, they are often late, by amounts X and Y ,
measured in hours, for the two trains. The bivariate density is
f
X,Y
(s, t) = 2 s t, 0 < s, t < 1 (8.39)
Two friends agree to meet at the transfer point, one taking line A and the other B. Let W denote
the time in minutes the person arriving on line B must wait for the friend. Lets nd P(W > 6).
First, convert this to a problem involving X and Y, since they are the random variables for which
we have a density, and then use (8.15):
P(W > 0.1) = P(Y + 0.1 < X) (8.40)
=
_
1
0.1
_
s0.1
0
(2 s t) dtds (8.41)
8.3. MORE ON SETS OF INDEPENDENT RANDOM VARIABLES 167
8.3 More on Sets of Independent Random Variables
8.3.1 Probability Mass Functions and Densities Factor in the Independent Case
If X and Y are independent, then
p
X,Y
= p
X
p
Y
(8.42)
in the discrete case, and
f
X,Y
= f
X
f
Y
(8.43)
in the continuous case. In other words, the joint pmf/density is the product of the marginal ones.
This is easily seen in the discrete case:
p
X,Y
(i, j) = P(X = i and Y = j) (denition) (8.44)
= P(X = i)P(Y = j) (independence) (8.45)
= p
X
(i)p
Y
(j) (denition)) (8.46)
Here is the proof for the continuous case;
f
X,Y
(u, v) =

2
uv
F
X,Y
(u, v) (8.47)
=

2
uv
P(X u and Y v) (8.48)
=

2
uv
[P(X u) P(Y v)] (8.49)
=

2
uv
F
X
(u) F
Y
(v) (8.50)
= f
X
(u)f
Y
(v) (8.51)
8.3.2 Convolution
Denition 21 Suppose g and h are densities of continuous random variables X and Y, respectively.
The convolution of g and h, denoted g*h,
3
is another density, dened to be that of the random
3
The reason for the asterisk, suggesting a product, will become clear in Section 9.4.3.
168 CHAPTER 8. MULTIVARIATE PMFS AND DENSITIES
variable X+Y. In other words, convolution is a binary operation on the set of all densities.
If X and Y are nonnegative and independent, then the convolution reduces to
f
Z
(t) =
_
t
0
g(s)h(t s) ds (8.52)
You can get intuition on this by considering the discrete case. Say U and V are nonnegative
integer-valued random variables, and set W = U+V. Lets nd p
W
;
p
W
(k) = P(W = k) (by denition) (8.53)
= P(U +V = k) (substitution) (8.54)
=
k

i=0
P(U = i and V = k i) (In what ways can it happen?) (8.55)
=
k

i=0
p
U,V
(i, k i) (by denition) (8.56)
=
k

i=0
p
U
(i)p
V
(k i) (from Section 8.3.1) (8.57)
Review the analogy between densities and pmfs in our unit on continuous random variables, Section
4.4.1, and then see how (8.52) is analogous to (8.53) through (8.57):
k in (8.53) is analogous to t in (8.52)
the limits 0 to k in (8.57) are analogous to the limits 0 to t in (8.52)
the expression k-i in (8.57) is analogous to t-s in (8.52)
and so on
8.3.3 Example: Ethernet
Consider this network, essentially Ethernet. Here nodes can send at any time. Transmission time
is 0.1 seconds. Nodes can also hear each other; one node will not start transmitting if it hears
that another has a transmission in progress, and even when that transmission ends, the node that
had been waiting will wait an additional random time, to reduce the possibility of colliding with
some other node that had been waiting.
8.3. MORE ON SETS OF INDEPENDENT RANDOM VARIABLES 169
Suppose two nodes hear a third transmitting, and thus refrain from sending. Let X and Y be
their random backo times, i.e. the random times they wait before trying to send. (In this model,
assume that they do not do listen before talk after a backo.) Lets nd the probability that
they clash, which is P([X Y [ 0.1).
Assume that X and Y are independent and exponentially distributed with mean 0.2, i.e. they each
have density 5e
5u
on (0, ). Then from (8.43), we know that their joint density is the product of
their marginal densities,
f
X,Y
(s, t) = 25e
5(s+t)
, s, t > 0 (8.58)
Now
P([X Y [ 0.1) = 1 P([X Y [ > 0.1) = 1 P(X > Y + 0.1) P(Y > X + 0.1) (8.59)
Look at that rst probability. Applying (8.15) with A = (s, t) : s > t + 0.1, 0 < s, t, we have
P(X > Y + 0.1) =
_

0
_

t+0.1
25e
5(s+t)
ds dt = 0.303 (8.60)
By symmetry, P(Y > X +0.1) is the same. So, the probability of a clash is 0.394, rather high. We
may wish to increase our mean backo time, though a more detailed analysis is needed.
8.3.4 Example: Analysis of Seek Time
This will be an analysis of seek time on a disk. Suppose we have mapped the innermost track to 0
and the outermost one to 1, and assume that (a) the number of tracks is large enough to treat the
position H of the read/write head the interval [0,1] to be a continous random variable, and (b) the
track number requested has a uniform distribution on that interval.
Consider two consecutive service requests for the disk, denoting their track numbers by X and Y.
In the simplest model, we assume that X and Y are independent, so that the joint distribution of
X and Y is the product of their marginals, and is thus is equal to 1 on the square 0 X, Y 1.
The seek distance will be [X Y [. Its mean value is found by taking g(s,t) in (8.16) to be [s t[.
_
1
0
_
1
0
[s t[ 1 ds dt =
1
3
(8.61)
170 CHAPTER 8. MULTIVARIATE PMFS AND DENSITIES
Lets nd the density of the seek time S = [X Y [:
F
S
(v) = P([X Y [ v) (8.62)
= P(v X Y v) (8.63)
= 1 P(X Y < v) P(X Y > v) (8.64)
= 1 (1 v)
2
(8.65)
where for instance P(X Y > v) the integral of 1 on the triangle with vertices (v,0), (1,0) and
(1,1-v), thus equal to the area of that triangle, 0.5(1 v)
2
.
Then
f
S
(v) =
d
dt
F
S
(v) = 2(1 v) (8.66)
By the way, what about the assumptions here? The independence would be a good assumption,
for instance, for a heavily-used le server accessed by many dierent machines. Two successive
requests are likely to be from dierent machines, thus independent. In fact, even within the same
machine, if we have a lot of users at this time, successive requests can be assumed independent.
On the other hand, successive requests from a particular user probably cant be modeled this way.
As mentioned in our unit on continuous random variables, page 98, if its been a while since weve
done a defragmenting operation, the assumption of a uniform distribution for requests is probably
good.
Once again, this is just scratching the surface. Much more sophisticated models are used for more
detailed work.
8.3.5 Example: Backup Battery
Suppose we have a portable machine that has compartments for two batteries. The main battery
has lifetime X with mean 2.0 hours, and the backups lifetime Y has mean life 1 hours. One replaces
the rst by the second as soon as the rst fails. The lifetimes of the batteries are exponentially
distributed and independent. Lets nd the density of W, the time that the system is operational
(i.e. the sum of the lifetimes of the two batteries).
Recall that if the two batteries had the same mean lifetimes, W would have a gamma distribution.
But thats not the case here. But we notice that the distribution of W is a convolution of two
exponential densities, as it is the sum of two nonnegative independent random variables. Using
8.3. MORE ON SETS OF INDEPENDENT RANDOM VARIABLES 171
(8.3.2), we have
f
W
(t) =
_
t
0
f
X
(s)f
Y
(t s) ds =
_
t
0
0.5e
0.5s
e
(ts)
ds = e
0.5t
e
t
, 0 < t < (8.67)
8.3.6 Example: Minima of Uniformly Distributed Random Variables
Suppose X and Y be independent and each have a uniform distribution on the interval (0,1). Let
Z = min(X,Y). Find f
Z
:
F
Z
(t) = P(Z t) (def. of cdf) (8.68)
= 1 P(Z > t) (8.69)
= 1 P(X > and Y > t) (min(u,v) > t i both u,v > t) (8.70)
= 1 P(X > t)P(Y > t) (indep.) (8.71)
= 1 (1 t)
2
(indep.) (8.72)
(8.73)
The density of Z is then the derivative of that last expression:
f
Z
(t) = 2(1 t), 0 < t < 1 (8.74)
8.3.7 Example: Minima of Independent Exponentially Distributed Random
Variables
The memoryless property of the exponential distribution leads to other key properties. Heres a
famous one:
Theorem 22 Suppose W
1
, ..., W
k
are independent random variables, with W
i
being exponentially
distributed with parameter
i
. Let Z = min(W
1
, ..., W
k
). Then
(a) Z is exponentially distributed with parameter
1
+... +
k
(b) P(Z = W
i
) =

i

1
+...+
k
Comments:
172 CHAPTER 8. MULTIVARIATE PMFS AND DENSITIES
In notebook terms, we would have k+1 columns, one each for the W
i
and one for Z. For
any given line, the value in the Z column will be the smallest of the values in the columns
for W
1
, ..., W
k
; Z will be equal to one of them, but not the same one in every line. Then for
instance P(Z = W
3
) is interpretable in notebook form as the long-run proportion of lines in
which the Z column equals the W
3
column.
Its pretty remarkable that the minimum of independent exponential random variables turns
out again to be exponential. Contrast that with what we found in Section 8.3.6, where the
minimum of independent uniform random variables did NOT turn out to have a uniform
distribution.
The sum
1
+... +
n
in (a) should make good intuitive sense to you, for the following reasons.
Recall from Section 4.5.4.5 that the parameter in an exponential distribution is interpretable
as a light bulb burnout rate.
Say we have persons 1 and 2. Each has a lamp. Person i uses Brand i light bulbs, i = 1,2. Say
Brand i light bulbs have exponential lifetimes with parameter
i
. Suppose each time person
i replaces a bulb, he shouts out, New bulb! and each time anyone replaces a bulb, I shout
out New bulb! Persons 1 and 2 are shouting at a rate of
1
and
2
, respectively, so I am
shouting at a rate of
1
+
2
.
Similarly, (b) should be intuitively clear as well from the above thought experiment, since
for instance a proportion
1
/(
1
+
2
) of my shouts will be in response to person 1s shouts.
Also, at any given time, the memoryless property of exponential distributions implies that
the time at which I shout next will be the minimum of the times at which persons 1 and 2
shout next.
Proof
Properties (a) and (b) above are easy to prove, using the same approach as in Section 8.3.6:
F
Z
(t) = P(Z t) (def. of cdf) (8.75)
= 1 P(Z > t) (8.76)
= 1 P(W
1
> t and ... and W
k
> t) (min > t i all W
i
> t) (8.77)
= 1
i
P(W
i
> t) (indep.) (8.78)
= 1
i
e

i
t
(expon. distr.) (8.79)
= 1 e
(
1
+...+n)t
(8.80)
Taking
d
dt
of both sides shows (a).
For (b), suppose k = 2. we have that
8.3. MORE ON SETS OF INDEPENDENT RANDOM VARIABLES 173
P(Z = W
1
) = P(W
1
< W
2
) (8.81)
=
_

0
_

t

1
e

1
t

2
e

2
s
ds dt (8.82)
=

1

1
+
2
(8.83)
The case for general k can be done by induction, writing W
1
+... +W
c+1
= (W
1
+... +W
c
) +W
c+1
.
Note carefully: Just as the probability that a continuous random variable takes on a specic
value is 0, the probability that two continuous and independent random variables are equal to each
other is 0. Thus in the above analysis, P(W
1
= W
2
) = 0.
This property of minima of independent exponentially-distributed random variables developed in
this section is key to the structure of continuous-time Markov chains, in Section 17.4.
8.3.8 Example: Computer Worm
A computer science graduate student at UCD, Senthilkumar Cheetancheri, was working on a worm
alert mechanism. A simplied version of the model is that network hosts are divided into groups of
size g, say on the basis of sharing the same router. Each infected host tries to infect all the others
in the group. When g-1 group members are infected, an alert is sent to the outside world.
The student was studying this model via simulation, and found some surprising behavior. No
matter how large he made g, the mean time until an external alert was raised seemed bounded. He
asked me for advice.
I modeled the nodes as operating independently, and assumed that if node A is trying to infect
node B, it takes an exponentially-distributed amount of time to do so. This as a continuous-time
Markov chain. Again, this topic is much more fully developed in Section 17.4, but all we need here
is the result of Section 8.3.7.
In state i, there are i infected hosts, each trying to infect all of the g-i noninfected hots. When the
process reaches state g-1, the process ends; we call this state an absorbing state, i.e. one from
which the process never leaves.
Scale time so that for hosts A and B above, the mean time to infection is 1.0. Since in state i there
are i(g-i) such pairs, the time to the next state transition is the minimum of i(g-i) exponentially-
distributed random variables with mean 1. Thus the mean time to go from state i to state i+1 is
174 CHAPTER 8. MULTIVARIATE PMFS AND DENSITIES
1/[i(g-i)].
Then the mean time to go from state 1 to state g-1 is
g1

i=1
1
i(g i)
(8.84)
Using a calculus approximation, we have
_
g1
1
1
x(g x)
dx =
1
g
_
g1
1
(
1
x
+
1
g x
) dx =
2
g
ln(g 1) (8.85)
The latter quantity goes to zero as g . This conrms that the behavior seen by the student
in simulations holds in general. In other words, (8.84) remains bounded as g . This is a very
interesting result, since it says that the mean time to alert is bounded no matter how big our group
size is.
So, even though our model here was quite simple, probably overly so, it did explain why the student
was seeing the surprising behavior in his simulations.
8.3.9 Example: Ethernet Again
In the Ethernet example in Section 8.3.3, we assumed that transmission time was a constant, 0.1.
Now lets account for messages of varying sizes, by assuming that transmission time T for a message
is random, exponentially distributed with mean 0.1. Lets nd P(X < Y and there is no collision).
That probability is equal to P(X +T < Y ). Well, this sounds like were going to have to deal with
triple integrals, but actually not. The derivation in Section 8.3.5 shows that the density of S =
X+T is
f
S
(t) = e
0.1t
e
0.2t
, 0 < t < (8.86)
Thus the joint density of S and Y is
f
S,Y
(u, v) = (e
0.1u
e
0.2u
)0.2e
0.2v
, 0 < u, v, < (8.87)
We can then evaluate P(S < Y ) as a double integral, along the same lines as we did for instance
in (8.19).
8.4. EXAMPLE: FINDING THE DISTRIBUTION OF THE SUM OF NONINDEPENDENT RANDOM VARIABLES175
8.4 Example: Finding the Distribution of the Sum of Noninde-
pendent Random Variables
In Section 8.3.2, we found a general formula for the distribution of the sum of two independent
random variables. What about the nonindependent case?
Suppose for instance f
X,Y
(s, t) = 2 on 0 < t < s < 1, 0 elsewhere. Lets nd f
X+Y
(w) for the case
0 < w < 1.
Since X and Y are not independent, we cannot use convolution. But:
F
X+Y
(w) = P(X +Y ) w (8.88)
=
_
w/2
0
_
wt
t
2 ds dt (8.89)
= w
2
/2 (8.90)
So f
X+Y
(w) = w.
The case 1 < w < 2 is similar.
8.5 Parametric Families of Multivariate Distributions
Since there are so many ways in which random variables can correlate with each other, there
are rather few parametric families commonly used to model multivariate distributions (other than
those arising from sets of independent random variables have a distribution in a common parametric
univariate family). We will discuss two here.
8.5.1 The Multinomial Family of Distributions
8.5.1.1 Probability Mass Function
This is a generalization of the binomial family.
Suppose one rolls a die 8 times. What is the probability that the results consist of two 1s, one 2,
one 4, three 5s and one 6? Well, if the rolls occur in that order, i.e. the two 1s come rst, then the
176 CHAPTER 8. MULTIVARIATE PMFS AND DENSITIES
2, etc., then the probability is
_
1
6
_
2
_
1
6
_
1
_
1
6
_
0
_
1
6
_
1
_
1
6
_
3
_
1
6
_
1
(8.91)
But there are many dierent orderings, in fact
8!
2!1!0!1!3!1!
(8.92)
of them, from Section 2.13.4, and thus
P(two 1s, one 2, no 3s, one 4, three 5s, one 6) =
8!
2!1!0!1!3!1!
_
1
6
_
2
_
1
6
_
1
_
1
6
_
0
_
1
6
_
1
_
1
6
_
3
_
1
6
_
1
(8.93)
From this, we can more generally see the following. Suppose:
we have n trials, each of which has r possible outcomes or categories
the trials are independent
the i
th
outcome has probability p
i
Let X
i
denote the number of trials with outcome i, i = 1,...,r. In the die example above, for instance,
r = 6 for the six possible outcomes of one trial, i.e. one roll of the die, and X
1
is the number of
times we got one dot, in our n = 8 rolls.
Then we say that the vector X = (X
1
, ..., X
r
) have a multinomial distribution. Since the X
i
are discrete random variables, they have a joint pmf p
X
1
,...,Xr
(). Taking the above die example for
illustration again, the probability of interest there is p
X
(2, 1, 0, 1, 3, 1). We then have in general,
p
X
1
,...,Xr
(j
1
, ..., j
r
) =
n!
j
1
!...j
r
!
p
j
1
1
...p
jr
r
(8.94)
Note that this family of distributions has r+1 parameters.
R has the function dmultinom() for the multinomial pmf. The call dmultinom(x,n,prob, x
evaluates (8.94), where x is the vector (j
1
, ..., j
r
) and prob is (p
1
, ..., p
r
) .
We can simulate multinomial random vectors in R using the sample() function:
8.5. PARAMETRIC FAMILIES OF MULTIVARIATE DISTRIBUTIONS 177
1 # n is the number of trials, p the vector of probabilities of the r
2 # categories
3 multinom <- function(n,p) {
4 r <- length(p)
5 outcome <- sample(x=1:r,size=n,replace=T,prob=p)
6 counts <- vector(length=r) # counts of the various categories
7 # tabulate the counts (could be done more efficiently)
8 for (i in 1:n) {
9 j <- outcome[i]
10 counts[j] <- counts[j] + 1
11 }
12 return(counts)
13 }
8.5.1.2 Example: Component Lifetimes
Say the lifetimes of some electronic component, say a disk drive, are exponentially distributed with
mean 4.5 years. If we have six of them, what is the probability that two fail before 1 year, two last
between 1 and 2 years, and the remaining two last more than 2 years?
Let (X,Y,Z) be the number that last in the three time intervals. Then this vector has a multinomial
distribution, with n = 6 trials, and
p
1
=
_
1
0
1
4.5
e
t/4.5
dt = 0.20 (8.95)
p
2
=
_
2
1
1
4.5
e
t/4.5
dt = 0.16 (8.96)
p
3
=
_

2
1
4.5
e
t/4.5
dt = 0.64 (8.97)
We then use (8.94) to nd the specied probability, which is:
6!
2!2!2!
0.20
2
0.16
2
0.64
2
(8.98)
8.5.1.3 Mean Vectors and Covariance Matrices in the Multinomial Family
Consider a multinomially distributed random vector X = (X
1
, ..., X
r
)

, with n trials and category


probabilities p
i
. Lets nd its mean vector and covariance matrix.
178 CHAPTER 8. MULTIVARIATE PMFS AND DENSITIES
First, note that the marginal distributions of the X
i
are binomial! So,
EX
i
= np
i
and V ar(X
i
) = np
i
(1 p
i
) (8.99)
So we know EX now:
EX =
_
_
np
1
...
np
r
_
_
(8.100)
We also know the diagonal elements of Cov(X)np
i
(1 p
i
) is the i
th
diagonal element, i = 1,...,r.
But what about the rest? The derivation will follow in the footsteps of those of (3.98), but now
in a vector context. Prepare to use your indicator random variable, random vector and covariance
matrix skills! Also, this derivation will really build up your probabilistic stamina level. So, its
good for you! But now is the time to review (3.98), Section 3.6 and Section 7.3, before
continuing.
Well continue the notation of the last section. In order to keep on eye on the concrete, well often
illustrate the notation with the die example above; there we rolled a die 8 times, and dened 6
categories (one dot, two dots, etc.). We were interested in probabilities involving the number of
trials that result in each of the 6 categories.
Dene the random vector T
i
to be the outcome of the i
th
trial. It is a vector of indicator random
variables, one for each of the r categories. In the die example, for instance, consider the second
roll, which is recorded in T
2
. If that roll turns out to be, say, 5, then
T
2
=
_
_
_
_
_
_
_
_
0
0
0
0
1
0
_
_
_
_
_
_
_
_
(8.101)
Here is the key observation:
_
_
X
1
...
X
r
_
_
=
n

i=1
T
i
(8.102)
8.5. PARAMETRIC FAMILIES OF MULTIVARIATE DISTRIBUTIONS 179
Keep in mind, (8.102) is a vector equation. In the die example, the rst element of the left-hand
side,X
1
, is the number of times the 1-dot face turns up, and on the right-hand side, the rst element
of T
i
is 1 or 0, according to whether the 1-dot face turns up on the i
th
roll. Make sure you believe
this equation before continuing.
Since the trials are independent, (7.51) and (8.102) now tell us that
Cov[(X
1
, ..., X
r
)

[ =
n

i=1
Cov(T
i
) (8.103)
But the trials are not only independent, but also identically distributed. (The die, for instance, has
the same probabilities on each trial.) So the last equation becomes
Cov
_
_
_
_
X
1
...
X
r
_
_
_
_
= nCov(T
1
) (8.104)
One more step to go. Remember, T
1
is a vector, recording what happens on the rst trial, e.g. the
rst roll of the die. Write it as
T
1
=
_
_
U
1
...
U
r
_
_
(8.105)
Then the covariance matrix of T
1
consists of elements of the form
Cov(U
i
, U
j
) (8.106)
Lets evaluate them.
Case 1: i = j
Cov(U
i
, U
j
) = V ar(U
i
) (7.4) (8.107)
= p
i
(1 p
i
) (3.45) (8.108)
Case 2: i ,= j
180 CHAPTER 8. MULTIVARIATE PMFS AND DENSITIES
Cov(U
i
, U
j
) = E(U
i
U
j
) EU
i
EU
j
(7.5) (8.109)
= E(U
i
U
j
) p
i
p
j
(3.44) (8.110)
= p
i
p
j
(8.111)
with that last step coming from the fact that U
i
and U
j
can never both be 1 (e.g. never on the
same line of the our notebook). Thus the product U
i
U
j
is always 0, and thus so is is expected
value. In the die example, for instance, if our roll resulted in the 2-dot face turned upward, then
the 5-dot face denitely did NOT turn upward, so U
2
= 1 while U
5
= 0.
So, weve now found Cov(T
1
), and using this in (8.104), we see that
Cov
_
_
_
_
X
1
...
X
r
_
_
_
_
= n
_
_
_
_
p
1
(1 p
1
) p
1
p
2
... p
1
p
r
p
1
p
2
p
2
(1 p
2
) ... p
2
p
r
... ... ... ...
... ... ... p
r
(1 p
r
)
_
_
_
_
(8.112)
Note too that if we dene R = X/n, so that R is the vector of proportions in the various categories
(e.g. X
1
/n is the fraction of trials that resulted in category 1), then from (8.112) and (7.49), we
have
Cov(R) =
1
n
_
_
_
_
p
1
(1 p
1
) p
1
p
2
... p
1
p
r
p
1
p
2
p
2
(1 p
2
) ... p
2
p
r
... ... ... ...
... ... ... p
r
(1 p
r
)
_
_
_
_
(8.113)
Whew! That was a workout, but these formulas will become very useful later on, both in this
chapter and subsequent ones.
8.5.1.4 Application: Text Mining
One of the branches of computer science in which the multinomial family plays a prominent role
is in text mining. One goal is automatic document classication. We want to write software that
will make reasonably accurate guesses as to whether a document is about sports, the stock market,
elections etc., based on the frequencies of various key words the program nds in the document.
Many of the simpler methods for this use the bag of words model. We have r key words
weve decided are useful for the classication process, and the model assumes that statistically
the frequencies of those words in a given document category, say sports, follow a multinomial
8.5. PARAMETRIC FAMILIES OF MULTIVARIATE DISTRIBUTIONS 181
distribution. Each category has its own set of probabilities p
1
, ..., p
r
. For instance, if Barry
Bonds is considered one word, its probability will be much higher in the sports category than in
the elections category, say. So, the observed frequencies of the words in a particular document will
hopefully enable our software to make a fairly good guess as to the category the document belongs
to.
Once again, this is a very simple model here, designed to just introduce the topic to you. Clearly
the multinomial assumption of independence between trials is grossly incorrect here, most models
are much more complex than this.
8.5.2 The Multivariate Normal Family of Distributions
Note to the reader: This is a more dicult section, but worth putting extra eort into, as so many
statistical applications in computer science make use of it. It will seem hard at times, but in the
end wont be too bad.
8.5.2.1 Densities
Intuitively, this family has densities which are shaped like multidimensional bells, just like the
univariate normal has the famous one-dimensional bell shape.
Lets look at the bivariate case rst. The joint distribution of X
1
and X
2
is said to be bivariate
normal if their density is
f
X,Y
(s, t) =
1
2
1

2
_
1
2
e

1
2(1
2
)
_
(s
1
)
2

2
1
+
(t
2
)
2

2
2

2(s
1
)(t2)

2
_
, < s, t < (8.114)
This looks horrible, and it is. But dont worry, as we wont work with this directly.
Its important for conceptual reasons, as follows.
First, note the parameters here:
1
,
2
,
1
and
2
are the means and standard deviations of X and
Y, while is the correlation between X and Y. So, we have a ve-parameter family of distributions.
The multivariate normal family of distributions is parameterized by one vector-valued quantity,
the mean , and one matrix-valued quantity, the covariance matrix . Specically, suppose the
random vector X = (X
1
, ..., X
k
)

has a k-variate normal distribution.


The density has this form:
f
X
(t) = ce
0.5(t)

1
(t)
(8.115)
182 CHAPTER 8. MULTIVARIATE PMFS AND DENSITIES
Here c is a constant, needed to make the density integrate to 1.0. It turns out that
c =
1
(2)
k/2
_
det()
(8.116)
but well never use this fact.
Here again

denotes matrix transpose, -1 denotes matrix inversion and det() means determinant.
Again, note that t is a kx1 vector.
Since the matrix is symmetric, there are k(k+1)/2 distinct parameters there, and k parameters in
the mean vector, for a total of k(k+3)/2 parameters for this family of distributions.
8.5.2.2 Geometric Interpretation
Now, lets look at some pictures, generated by R code which Ive adapted from one of the entries
in the R Graph Gallery, http://addictedtor.free.fr/graphiques/graphcode.php?graph=42.
4
Both are graphs of bivariate normal densities, with EX
1
= EX
2
= 0, V ar(X
1
) = 10, V ar(X
2
) = 15
and a varying value of the correlation between X
1
and X
2
. Figure 8.1 is for the case = 0.2.
The surface is bell-shaped, though now in two dimensions instead of one. Again, the height of the
surface at any (s,t) point the relative likelihood of X
1
being near s and X
2
being near t. Say for
instance that X
1
is height and X
2
is weight. If the surface is high near, say, (70,150) (for height
of 70 inches and weight of 150 pounds), it mean that there are a lot of people whose height and
weight are near those values. If the surface is rather low there, then there are rather few people
whose height and weight are near those values.
Now compare that picture to Figure 8.2, with = 0.8.
Again we see a bell shape, but in this case narrower. In fact, you can see that when X
1
(s) is
large, X
2
(t) tends to be large too, and the same for large replaced by small. By contrast, the
surface near (5,5) is much higher than near (5,-5), showing that the random vector (X
1
, X
2
) is near
(5,5) much more often than (5,-5).
All of this reects the high correlation (0.8) between the two variables. If we were to continue
to increase toward 1.0, we would see the bell become narrower and narrower, with X
1
and X
2
coming closer and closer to a linear relationship, one which can be shown to be
X
1

1
=

1

2
(X
2

2
) (8.117)
4
There appears to be an error in their denition of the function f(); the assignment to term5 should not have a
negative sign at the beginning.
8.5. PARAMETRIC FAMILIES OF MULTIVARIATE DISTRIBUTIONS 183
s
10
5
0
5
10
t
10
5
0
5
10
z
0.005
0.010
0.015
Figure 8.1: Bivariate Normal Density, = 0.2
184 CHAPTER 8. MULTIVARIATE PMFS AND DENSITIES
s
10
5
0
5
10
t
10
5
0
5
10
z
0.000
0.005
0.010
0.015
0.020
0.025
Figure 8.2: Bivariate Normal Density, = 0.8
8.5. PARAMETRIC FAMILIES OF MULTIVARIATE DISTRIBUTIONS 185
In this case, that would be
X
1
=
_
10
15
X
2
= 0.82X
2
(8.118)
8.5.2.3 Properties of Multivariate Normal Distributions
Theorem 23 Suppose X = (X
1
, ..., X
k
) has a multivariate normal distribution with mean vector
and covariance matrix . Then:
(a) The contours of f
X
are k-dimensional ellipsoids. In the case k = 2 for instance, where we
can visualize the density of X as a three-dimensional surface, the contours for points at which
the bell has the same height (think of a topographical map) are elliptical in shape. The larger
the correlation (in absolute value) between X
1
and X
2
, the more elongated the ellipse. When
the absolute correlation reaches 1, the ellipse degenerates into a straight line.
(b) Let A be a constant (i.e. nonrandom) matrix with k columns. Then the random vector Y =
AX also has a multivariate normal distribution.
5
The parameters of this new normal distribution must be EY = A and Cov(Y ) = AA

, by
(7.48) and (7.50).
(c) If U
1
, ..., U
m
are each univariate normal and they are independent, then they jointly have a
multivariate normal distribution. (In general, though, having a normal distribution for each
U
i
does not imply that they are jointly multivariate normal.)
(d) Suppose W has a multivariate normal distribution. The conditional distribution of some
components of W, given other components, is again multivariate normal.
Part [(b)] has some important implications:
(i) The lower-dimensional marginal distributions are also multivariate normal. For example, if k
= 3, the pair (X
1
, X
3
)

has a bivariate normal distribution, as can be seen by setting


A =
_
1 0 0
0 0 1
_
(8.119)
in (b) above.
5
Note that this is a generalization of the material on ane transformations on page 99.
186 CHAPTER 8. MULTIVARIATE PMFS AND DENSITIES
(ii) Scalar linear combinations of X are normal. In other words, for constant scalars a
1
, ..., a
k
, set
a = (a
1
, ..., a
k
)

. Then the quantity Y = a


1
X
1
+...+a
k
X
k
has a univariate normal distribution
with mean a

and variance a

a.
(iii) Vector linear combinations are multivariate normal. Again using the case k = 3 as our
example, consider (U, V )

= (X
1
X
3
, X
2
X
3
). Then set
A =
_
1 0 1
0 1 1
_
(8.120)
(iv) The r-component random vector X has a multivariate normal distribution if and only if c

X
has a univariate normal distribution for all constant r-component vectors c.
In R the density, cdf and quantiles of the multivariate normal distribution are given by the func-
tions dmvnorm(), pmvnorm() and qmvnorm() in the library mvtnorm. You can simulate a
multivariate normal distribution by using mvrnorm() in the library MASS.
8.5.2.4 The Multivariate Central Limit Theorem
The multidimensional version of the Central Limit Theorem holds. A sum of independent identically
distributed (iid) random vectors has an approximate multivariate normal distribution. Here is the
theorem:
Theorem 24 Suppose X
1
, X
2
, ... are independent random vectors, all having the same distribution
which has mean vector and covariance matrix . Form the new random vector T = X
1
+... +X
n
.
Then for large n, the distribution of T is approximately normal with mean n and covariance matrix
n.
For example, since a persons body consists of many dierent components, the CLT (a non-
independent, non-identically version of it) explains intuitively why heights and weights are ap-
proximately bivariate normal. Histograms of heights will look approximately bell-shaped, and the
same is true for weights. The multivariate CLT says that three-dimensional histogramsplotting
frequency along the Z axis against height and weight along the X and Y axeswill be
approximately three-dimensional bell-shaped.
The proof of the multivariate CLT is easy, from Property (iv) above. Say we have a sum of iid
random vectors:
S = X
1
+... +X
n
(8.121)
8.5. PARAMETRIC FAMILIES OF MULTIVARIATE DISTRIBUTIONS 187
Then
c

S = c

X
1
+... +c

X
n
(8.122)
Now on the right side we have a sum of iid scalars, not vectors, so the univariate CLT applies!
We thus know the right-hand side is a approximately normal for all c, which means c

S is also
approximately normal for all c, which then by (iv) above means that S itself is approximately
multivariate normal.
8.5.2.5 Example: Finishing the Loose Ends from the Dice Game
Recall the game example in Section 7.3.5:
Suppose we roll a die 50 times. Let X denote the number of rolls in which we get one dot, and let
Y be the number of times we get either two or three dots. For convenience, lets also dene Z to
be the number of times we get four or more dots, though our focus will be on X and Y. Suppose
also that we win $5 for each roll of a one, and $2 for each roll of a two or three.
Our analysis relied on the vector (X,Y,Z) having an approximate multivariate normal distribution.
Where does that come from? Well, rst note that the exact distribution of (X,Y,Z) is multinomial.
Then recall (8.102). The latter makes (X,Y,Z) a sum of iid vectors, so that the multivariate CLT
applies.
8.5.2.6 Application: Data Mining
The multivariate normal family plays a central role in multivariate statistical methods.
For instance, a major issue in data mining is dimension reduction, which means trying to reduce
what may be hundreds or thousands of variables down to a manageable level. One of the tools for
this, called principle components analysis (PCA), is based on multivariate normal distributions.
Google uses this kind of thing quite heavily. Well discuss PCA in Section 16.4.1.
To see a bit of how this works, note that in Figure 8.2, X
1
and X
2
had nearly a linear relationship
with each other. That means that one of them is nearly redundant, which is good if we are trying
to reduce the number of variables we must work with.
In general, the method of principle components takes r original variables, in the vector X and forms
r new ones in a vector Y, each of which is some linear combination of the original ones. These new
ones are independent. In other words, there is a square matrix A such that the components of Y =
AX are independent. (The matrix A consists of the eigenvectors of Cov(X); more on this in Section
16.4.1 of our unit on statistical relations.
188 CHAPTER 8. MULTIVARIATE PMFS AND DENSITIES
We then discard the Y
i
with small variance, as that means they are nearly constant and thus do
not carry much information. That leaves us with a smaller set of variables that still captures most
of the information of the original ones.
Many analyses in bioinformatics involve data that can be modeled well by multivariate normal
distributions. For example, in automated cell analysis, two important variables are forward light
scatter (FSC) and sideward light scatter (SSC). The joint distribution of the two is approximately
bivariate normal.
6
Exercises
1. Suppose the random pair (X, Y ) has the density f
X,Y
(s, t) = 8st on the triangle (s, t) : 0 < t <
s < 1.
(a) Find f
X
(s).
(b) Find P(X < Y/2).
2. Suppose packets on a network are of three types. In general, 40% of the packets are of type A,
40% have type B and 20% have type C. We observe six packets, and denote the numbers of packets
of types A, B and C by X, Y and Z, respectively.
(a) Find P(X = Y = Z = 2).
(b) Find Cov(X,Y+Z).
(c) To what parametric family in this book does the distribution of Y+Z belong?
3. Suppose X and Y are independent, each having an exponential distribution with means 1.0 and
2.0, respectively. Find P(Y > X
2
).
4. Suppose the pair (X,Y) has a bivariate normal distribution with mean vector (0,2) and covari-
ance matrix
_
1 2
2 6
_
(a) Set up (but do not evaluate) the double integral for the exact value of P(X
2
+Y
2
2.8).
6
See Bioinformatics and Computational Biology Solutions Using R and Bioconductor, edited by Robert Gentleman,
Wolfgang Huber, Vincent J. Carey, Rafael A. Irizarry and Sandrine Dudoit, Springer, 2005.
8.5. PARAMETRIC FAMILIES OF MULTIVARIATE DISTRIBUTIONS 189
(b) Using the matrix methods of Section 7.3, nd the covariance matrix of the pair U = (X+Y,X-
2Y). Does U have a bivariate normal distribution?
5. Suppose X are Y independent, and each has a U(0,1) distribution. Let V = X + Y.
(a) Find f
V
. (Advice: It will be a two-part function, i.e. the type we have to describe by
saying something like, The function has value 2z for z < 6 and 1/z for z > 6.)
(b) Verify your answer in (a) by nding EV from your answer in (a) and then using the fact that
EX = EY = 0.5.
In the general population of parents who have 10-year-old kids, the parent/kid weight pairs
have an exact bivariate normal distribution.
Parents weights have mean 152.6 and standard deviation 25.0.
Weights of kids have mean 62 and standard deviation 6.4.
The correlation between the parents and kids weights is 0.4.
Use R functions (not simulation) in the following:
(a) Find the fraction of parents who weigh more than 160.
(b) Find the fraction of kids who weigh less than 56.
(c) Find the fraction of parent/child pairs in which the parent weighs more than 160 and the
child weighs less than 56.
(d) Suppose a ride at an amusement park charges by weight, one cent for each pound of weight
in the parent and child. State the exact distribution of the fee, and nd the fraction of
parent/child pairs who are charged less than $2.00.
6. Newspapers at a certain vending machine cost 25 cents. Suppose 60% of the customers pay with
quarters, 20% use two dimes and a nickel, 15% insert a dime and three nickels, and 5% deposit ve
nickels. When the vendor collects the money, ve coins fall to the ground. Let X, Y amd Z denote
the numbers of quarters, dimes and nickels among these ve coins.
(a) Is the joint distribution of (X, Y, Z) a member of a parametric family presented in this chapter?
If so, which one?
190 CHAPTER 8. MULTIVARIATE PMFS AND DENSITIES
(b) Find P(X = 2, Y = 2, Z = 1).
(c) Find (X, Y ).
7. Jack and Jill play a dice game, in which one wins $1 per dot. There are three dice, die A, die
B and die C. Jill always rolls dice A and B. Jack always rolls just die C, but he also gets credit for
90% of die B. For instance, say in a particular roll A, B and C are 3, 1 and 6, respectively. Then
Jill would win $4 and Jack would get $6.90. Let X and Y be Jills and Jacks total winnings after
100 rolls. Use the Central Limit Theorem to nd the approximate values of P(X > 650, Y < 660)
and P(Y > 1.06X).
Hints: This will follow a similar pattern to the dice game in Section 7.3.5, which we win $5 for one
dot, and $2 for two or three dots. Remember, in that example, the key was that we noticed that
the pair (X, Y ) was a sum of random pairs. That meant that (X, Y ) had an approximate bivariate
normal distribution, so we could nd probabilities if we had the mean vector and covariance matrix
of (X, Y ). Thus we needed to nd EX, EY, V ar(X), V ar(Y ) and Cov(X, Y ). We used the various
properties of E(), V ar() and Cov() to get those quantities.
You will do the same thing here. Write X = U
1
+ ... + U
100
, where U
i
is Jills winnings on the i
th
roll. Write Y as a similar sum of V
i
. You probably will nd it helpful to dene A
i
, B
i
and C
i
as
the numbers of dots appearing on dice A, B and C on the i
th
roll. Then nd EX etc. Again, make
sure to utilize the various properties for E(), V ar() and Cov().
8. Consider the coin game in Section 3.13.1. Find F
X
3
,Y
3
(0, 0).
9. Suppose the random vector X = (X
1
, X
2
, X
3
)

has mean (2.0, 3.0, 8.2)

and covariance matrix


_
_
1 0.4 0.2
1 0.25
3
_
_
(8.123)
(a) Fill in the three missing entries.
(b) Find Cov(X
1
, X
3
).
(c) Find (X
2
, X
3
).
(d) Find V ar(X
3
).
(e) Find the covariance matrix of (X
1
+X
2
, X
2
+X
3
)

.
(f) If in addition we know that X
1
has a normal distribution, nd P(1 < X
1
< 2.5), in terms of
().
8.5. PARAMETRIC FAMILIES OF MULTIVARIATE DISTRIBUTIONS 191
(g) Consider the random variable W = X
1
+X
2
. Which of the following is true? (i) V ar(W) =
V ar(X
1
+X
2
). (ii) V ar(W) > V ar(X
1
+X
2
). (iii) V ar(W) < V ar(X
1
+X
2
). (iv) In order
to determine which of the two variances is the larger one, we would need to know whether
the variables X
i
have a multivariate normal distribution. (v) V ar(X
1
+X
2
) doesnt exist.
10. Find the (approximate) output of this R code, by using the analytical techniques of this
chapter:
count <- 0
for (i in 1:10000) {
count1 <- 0
count2 <- 0
count3 <- 0
for (j in 1:20) {
x <- runif(1)
if (x < 0.2) {
count1 <- count1 + 1
} else if (x < 0.6) count2 <- count2 + 1 else
count3 <- count3 + 1
}
if (count1 == 9 && count2 == 2 && count3 == 9) count <- count + 1
}
cat(count/10000)
11. Use the convolution formula (8.52) to derive (4.79) for the case r = 2. Explain your steps
carefully!
12. The book, Last Man Standing, author D. McDonald writes the following about the practice of
combining many mortgage loans into a single package sold to investors:
Even if every single [loan] in the [package] had a 30 percent risk of default, the think-
ing went, the odds that most of them would default at once were arguably innites-
imal...What [this argument] missed was the auto-synchronous relationship of many
loans...[If several of them] are all mortgage for houses sitting next to each other on
a beach...one strong hurricane and the [loan package] would be decimated.
Fill in the blank with a term from this book: The author is referring to an unwarranted assumption
of .
13. Consider the computer worm example in Section 8.3.8. Let R denote the time it takes to go
from state 1 to state 3. Find f
R
(v). (Leave your answer in integral form.)
14. Suppose (X,Y) has a bivariate normal distribution, with EX = EY = 0, Var(X) = Var(Y) =
1, and (X, Y ) = 0.2. Find the following, leaving your answers in integral form:
192 CHAPTER 8. MULTIVARIATE PMFS AND DENSITIES
(a) E(X
2
+XY
0.5
)
(b) P(Y > 0.5X)
(c) F
X,Y
(0.6, 0.2)
Chapter 9
Advanced Multivariate Methods
9.1 Conditional Distributions
The key to good probability modeling and statistical analysis is to understand conditional proba-
bility. The issue arises constantly.
9.1.1 Conditional Pmfs and Densities
First, lets review: In many repetitions of our experiment, P(A) is the long-run proportion of
the time that A occurs. By contrast, P(A[B) is the long-run proportion of the time that A occurs,
among those repetitions in which B occurs. Keep this in your mind at all times.
Now we apply this to pmfs, densities, etc. We dene the conditional pmf as follows for discrete
random variables X and Y:
p
Y |X
(j[i) = P(Y = j[X = i) =
p
X,Y
(i, j)
p
X
(i)
(9.1)
By analogy, we dene the conditional density for continuous X and Y:
f
Y |X
(t[s) =
f
X,Y
(s, t)
f
X
(s)
(9.2)
193
194 CHAPTER 9. ADVANCED MULTIVARIATE METHODS
9.1.2 Conditional Expectation
Conditional expectations are dened as straightforward extensions of (9.1) and (9.2):
E(Y [X = i) =

j
jp
Y |X
(j[i) (9.3)
E(Y [X = s) =
_
t
tf
Y |X
(t[s) dt (9.4)
9.1.3 The Law of Total Expectation (advanced topic)
9.1.3.1 Conditional Expected Value As a Random Variable
For a random variable Y and an event A, the quantity E(Y[A) is the long-run average of Y, among
the times when A occurs. Note several things about the expression E(Y[A):
The item to the left of the [ symbol is a random variable (Y).
The item on the right of the [ symbol is an event (A).
The overall expression evaluates to a constant.
By contrast, for the quantity E(Y[W) to be dened shortly for a random variable W, it is the case
that:
The item to the left of the [ symbol is a random variable (Y).
The item to the right of the [ symbol is a random variable (W).
The overall expression itself is a random variable, not a constant.
It will be very important to keep these dierences in mind.
Consider the function g(t) dened as
1
g(t) = E(Y [W = t) (9.5)
In this case, the item to the right of the [ is an event, and thus g(t) is a constant (for each value of
t), not a random variable.
1
Of course, the t is just a placeholder, and any other letter could be used.
9.1. CONDITIONAL DISTRIBUTIONS 195
Denition 25 Dene g() as in (9.5). Form the new random variable Q = g(W). Then the quantity
E(Y[W) is dened to be Q.
(Before reading any further, re-read the two sets of bulleted items above, and make sure you
understand the dierence between E(Y[W=t) and E(Y[W).)
One can view E(Y[W) as a projection in an abstract vector space. This is very elegant, and actually
aids the intuition. If (and only if) you are mathematically adventurous, read the details in Section
9.7.
9.1.3.2 Famous Formula: Theorem of Total Expectation
An extremely useful formula, given only scant or no mention in most undergraduate probability
courses, is
E(Y ) = E[E(Y [W)] (9.6)
for any random variables Y and W (for which the expectations are dened).
The RHS of (9.6) looks odd at rst, but its merely E[g(W)]; since Q = E(Y[W) is a random
variable, we can certainly ask what its expected value is.
Equation (9.6) is a bit abstract. Its a very useful abstraction, enabling streamlined writing and
thinking about the probabilistic structures at hand. Still, you may nd it helpful to consider the
case of discrete W, in which (9.6) has the more concrete form
EY =

i
P(W = i) E(Y [W = i) (9.7)
To see this intuitively, think of measuring the heights and weights of all the adults in Davis. Say
we measure height to the nearest inch, so that height is discrete. We look at all the adults in Davis
who are 72 inches tall, and write down their mean weight. Then we write down the mean weight
of all adults of height 68. Then we write down the mean weight of all adults of height 75, and so
on. Then (9.6) says that if we take the average of all the numbers we write downthe average of
the averagesthen we get the mean weight among all adults in Davis.
Note carefully, though, that this is a weighted average. If for instance people of height 69 inches
are more numerous in the population, then their mean weight will receive greater emphasis in over
average of all the means weve written down. This is seen in (9.7), with the weights being the
quantities P(W=i).
The relation (9.6) is proved in the discrete case in Section 9.8.
196 CHAPTER 9. ADVANCED MULTIVARIATE METHODS
9.1.4 What About the Variance?
By the way, one might guess that the analog of the Theorem of Total Expectation for variance is
V ar(Y ) = E[V ar(Y [W)] (9.8)
But this is false. Think for example of the extreme case in which Y = W. Then Var(Y[W) would
be 0, but Var(Y) would be nonzero.
The correct formula, called the Law of Total Variance, is
V ar(Y ) = E[V ar(Y [W)] +V ar[E(Y [W)] (9.9)
Deriving this formula is easy, by simply evaluating both sides of bis, and using the relation
V ar(X) = E(X
2
) (EX)
2
. This exercise is left to the reader. See also Section 9.7.
9.1.5 Example: Trapped Miner
(Adapted from Stochastic Processes, by Sheldon Ross, Wiley, 1996.)
A miner is trapped in a mine, and has a choice of three doors. Though he doesnt realize it, if he
chooses to exit the rst door, it will take him to safety after 2 hours of travel. If he chooses the
second one, it will lead back to the mine after 3 hours of travel. The third one leads back to the
mine after 5 hours of travel. Suppose the doors look identical, and if he returns to the mine he does
not remember which door(s) he tried earlier. What is the expected time until he reaches safety?
Let Y be the time it takes to reach safety, and let W denote the number of the door chosen (1, 2 or
3) on the rst try. Then let us consider what values E(Y[W) can have. If W = 1, then Y = 2, so
E(Y [W = 1) = 2 (9.10)
If W = 2, things are a bit more complicated. The miner will go on a 3-hour excursion, and then
be back in its original situation, and thus have a further expected wait of EY, since time starts
over. In other words,
E(Y [W = 2) = 3 +EY (9.11)
9.1. CONDITIONAL DISTRIBUTIONS 197
Similarly,
E(Y [W = 3) = 5 +EY (9.12)
In summary, now considering the random variable E(Y[W), we have
Q = E(Y [W) =
_
_
_
2, w.p.
1
3
3 +EY, w.p.
1
3
5 +EY, w.p.
1
3
(9.13)
where w.p. means with probability. So, using (9.6) or (9.7), we have
EY = EQ = 2
1
3
+ (3 +EY )
1
3
+ (5 +EY )
1
3
=
10
3
+
2
3
EY (9.14)
Equating the extreme left and extreme right ends of this series of equations, we can solve for EY,
which we nd to be 10.
It is no accident that the answer, 10, is 2+3+5. This was discovered by UCD grad student Ahmed
Ahmedin. Heres why (dierent from Ahmeds reasoning):
Let N denote the total number of attempts the miner makes before escaping (including the successful
attempt at the end), and let U
i
denote the time spent traveling during the i
th
attempt, i = 1,...,N.
Then
ET = E(U
1
+..., +U
N
) (9.15)
= E [E(U
1
+..., +U
N
[N)] (9.16)
Given N, each of U
1
, ..., U
N1
takes on the values 3 and 5, with probability 0.5 each, while U
N
is
the constant 2. Thus
E(U
1
+..., +U
N
[N) = (N 1)
3 + 5
2
+ 2 = 4N 2 (9.17)
N has a geometric distribution with p = 1/3, thus mean 3. Putting all this together, we have
ET = E(U
1
+..., +U
N
) = E(4N 2) = 10 (9.18)
This would be true if 2, 3 and 5 were replaced by a, b and c. In other words, intuitively: It takes an
average of 3 attempts to escape, with mean time of (a+b+c)/3, so the mean time overall is a+b+c.
198 CHAPTER 9. ADVANCED MULTIVARIATE METHODS
It is left to the reader to see how this would change if we assume that the miner remembers which
doors he has already hit.
9.1.6 Example: More on Flipping Coins with Bonuses
Recall the situation of Section 3.12.2.2: A game involves ipping a coin k times. Each time you get
a head, you get a bonus ip, not counted among the k. (But if you get a head from a bonus ip,
that does not give you its own bonus ip.) Let X denote the number of heads you get among all
ips, bonus or not. Well compute EX.
As before, Y denote the number of heads you obtain through nonbonus ips. This is a natural
situation in which to try the Theorem of Total Expectation, conditioning on Y. Reason as follows:
It would be tempting to say that, given Y = m, X has a binomial distribution with parameters
m and 0.5. That is not correct, but what is true is the random variable X-m does have that
distribution. Note by the way that X-Y is the number of bonus ips.
Then
EX = E[E(X[Y )] (9.19)
= E [E(X Y +Y [Y )] (9.20)
= E [0.5Y +Y ] (9.21)
= 1.5EY (9.22)
= 0.75k (9.23)
9.1.7 Example: Analysis of Hash Tables
(Famous example, adapted from various sources.)
Consider a database table consisting of m cells, only some of which are currently occupied. Each
time a new key must be inserted, it is used in a hash function to nd an unoccupied cell. Since
multiple keys map to the same table cell, we may have to probe multiple times before nding an
unoccupied cell.
We wish to nd E(Y), where Y is the number of probes needed to insert a new key. One approach
to doing so would be to condition on W, the number of currently occupied cells at the time we do
a search. After nding E(Y[W), we can use the Theorem of Total Expectation to nd EY. We will
make two assumptions (to be discussed later):
9.1. CONDITIONAL DISTRIBUTIONS 199
(a) Given that W = k, each probe will collide with an existing cell with probability k/m, with
successive probes being independent.
(b) W is uniformly distributed on the set 1,2,...,m, i.e. P(W = k) = 1/m for each k.
To calculate E(Y[W=k), we note that given W = k, then Y is the number of independent trials
until a success is reached, where success means that our probe turns out to be to an unoccupied
cell. This is a geometric distribution, i.e.
P(Y = r[W = k) = (
k
m
)
r1
(1
k
m
) (9.24)
The mean of this geometric distribution is, from (3.77),
1
1
k
m
(9.25)
Then
EY = E[E(Y [W)] (9.26)
=
m1

k=1
1
m
E(Y [W = k) (9.27)
=
m1

k=1
1
mk
(9.28)
= 1 +
1
2
+
1
3
+... +
1
m1
(9.29)

_
m
1
1
u
du (9.30)
= ln(m) (9.31)
where the approximation is something you might remember from calculus (you can picture it by
drawing rectangles to approximate the area under the curve.).
Now, what about our assumptions, (a) and (b)? The assumption in (a) of each cell having prob-
ability k/m should be reasonably accurate if k is much smaller than m, because hash functions
tend to distribute probes uniformly, and the assumption of independence of successive probes is all
right too, since it is very unlikely that we would hit the same cell twice. However, if k is not much
smaller than m, the accuracy will suer.
200 CHAPTER 9. ADVANCED MULTIVARIATE METHODS
Assumption (b) is more subtle, with diering interpretations. For example, the model may concern
one specic database, in which case the assumption may be questionable. Presumably W grows
over time, in which case the assumption would make no senseit doesnt even have a distribution.
We could instead think of a database which grows and shrinks as time progresses. However, even
here, it would seem that W would probably oscillate around some value like m/2, rather than being
uniformly distributed as assumed here. Thus, this model is probably not very realistic. However,
even idealized models can sometimes provide important insights.
9.2 Simulation of Random Vectors
Let X = (X
1
, ..., X
k
)

be a random vector having a specied distribution. How can we write code


to simulate it? It is not always easy to do this. Well discuss a couple of easy cases here, and
illustrate what one may do in other situations.
The easiest case (and a very frequenly-occurring one) is that in which the X
i
are independent. One
simply simulates them individually, and that simulates X!
Another easy case is that in which X has a multivariate normal distribution. We noted in Section
8.5.2.1 that R includes the function mvrnorm(), which we can use to simulate our X here. The
way this function works is to use the notion of principle components mentioned in Section 8.5.2.6.
We construct Y = AX for the matrix A discussed there. The Y
i
are independent, thus easily
simulated, and then we transform back to X via X = A
1
Y
In general, though, things may not be so easy. For instance, consider the distribution in (8.18).
There is no formulaic solution here, but the following strategy works.
First we nd the (marginal) density of X. As in the case for Y shown in (8.21), we compute
f
X
(s) =
_
s
0
8st dt = 4s
3
(9.32)
Using the method shown in our unit on continuous probability, Section 4.7, we can simulate X as
X = F
1
X
(W) (9.33)
where W is a U(0,1) random variable, generated as runif(1). Since F
X
(u) = u
4
, F
1
X
(v) = v
0.25
,
and thus our code to simulate X is
runif(1)^0.25
9.3. MIXTURE MODELS 201
Now that we have X, we can get Y. We know that
f
Y |X
(t[S) =
8st
4s
3
=
2
s
2
t (9.34)
Remember, s is considered constant. So again we use the inverse-cdf method here to nd Y,
given X, and then we have our pair (X,Y).
9.3 Mixture Models
To introduce this topic, suppose mens heights are normally distributed with mean 70 and standard
deviation 3, with womens heights being normal with mean 66 and standard deviation 2.5. Let H
denote the height of a randomly selected person from the entire population, and let G be the
persons gender, 1 for male and 2 for female.
Then the conditional distribution of H, given G = 1, is N(70,9), and a similar statement holds for
G = 2. But what about the unconditional distribution of H? We can derive it:
f
H
(t) =
d
dt
F
H
(t) (9.35)
=
d
dt
P(H t) (9.36)
=
d
dt
P(H t and G = 1 or H t and G = 2) (9.37)
=
d
dt
[0.5P(H t[G = 1) + 0.5P(H t[G = 2)] (9.38)
=
d
dt
_
0.5F
H|G=1
(t) + 0.5F
H|G=2
(t)

(9.39)
= 0.5f
H|G=1
(t) + 0.5f
H|G=2
(t) (9.40)
So the density of H in the grand population is the average of the densities of H in the two subpop-
ulations. This makes intuitive sense.
In terms of shape, f
H
, being the average of two bells that are space apart, will look like a two-
humped camel, instead of a bell. We call the distribution of H a mixture distribution, with the
name alluding to the fact that we mixed the two bells to ge the two-humped camel.
Another example is that of overdispersion in connection with Poisson models. Recall the following
about the Poisson distribution family:
202 CHAPTER 9. ADVANCED MULTIVARIATE METHODS
(a) This family is often used to model counts.
(b) For any Poisson distribution, the variance equals the mean.
In some cases in which we are modeling count data, one may try to t a mixture of several Poisson
distributions, instead of a single one. This frees us of constraint (b), as can be seen as follows:
Suppose X can equal 1,2,...,k, with probabilities p
1
, ..., p
k
that sum to 1. Say the distribution of Y
given X = i is Poisson with parameter
i
. Then by the Law of Total Expectation,
EY = E[E([X)] (9.41)
= E(
X
) (9.42)
=
k

i=1
p
i

i
(9.43)
Note that in the above, the expression
X
is a random variable, since its subscript X is random.
Indeed, it is a function of X, so Equation (3.25) then applies, yielding the nal equation. The
random variable
x
takes on the values
1
, ...,
k
with probabilities p
1
, ..., p
k
, hence that nal sum.
The corresponding formula for variance, (9.9), can be used to derive Var(Y).
V ar(Y ) = E[V ar(Y [X)] +V ar[E(Y [X)] (9.44)
= E(
X
) +V ar(
X
) (9.45)
We already evaluated the rst term, in (9.41). The second term is evaluated the same way: This
is the variance of a random variable that takes on the values
1
, ...,
k
with probabilities p
1
, ..., p
k
,
which is
k

i=1
p
i
(
i
)
2
(9.46)
where
= E
X
=
k

i=1
p
i

i
(9.47)
9.4. TRANSFORM METHODS 203
Thus
EY = (9.48)
and
V ar(Y ) = +
k

i=1
p
i
(
i
)
2
(9.49)
So, as long as the
i
are not equal and no p
i
= 1, we have
V ar(Y ) > EY (9.50)
in this Poisson mixture model, in contrast to the single-Poisson case in which Var(Y) = EY. You
can now see why the Poisson mixture model is called an overdispersion model.
So, if one has count data in which the variance is greater than the mean, one might try using this
model.
In mixing the Poissons, there is no need to restrict to discrete X. In fact, it is not hard to derive
the fact that if X has a gamma distribution with parameters r and p/(1-p) for some 0 < p < 1, and
Y given X has a Poisson distribution with mean X, then the resulting Y neatly turns out to have
a negative binomial distribution.
9.4 Transform Methods
We often use the idea of transform functions. For example, you may have seen Laplace trans-
forms in a math or engineering course. The functions we will see here dier from this by just a
change of variable.
Though in the form used here they involve only univariate distributions, their applications are often
multivariate, as will be the case here.
204 CHAPTER 9. ADVANCED MULTIVARIATE METHODS
9.4.1 Generating Functions
Lets start with the generating function. For any nonnegative-integer valued random variable
V, its generating function is dened by
g
V
(s) = E(s
V
) =

i=0
s
i
p
V
(i), 0 s 1 (9.51)
For instance, suppose N has a geometric distribution with parameter p, so that p
N
(i) = (1p)p
i1
,
i = 1,2,... Then
g
N
(s) =

i=1
s
i
(1 p)p
i1
=
1 p
p

i=1
s
i
p
i
=
1 p
p
ps
1 ps
=
(1 p)s
1 ps
(9.52)
Why restrict s to the interval [0,1]? The answer is that for s > 1 the series in (9.51) may not
converge. for 0 s 1, the series does converge. To see this, note that if s = 1, we just get the
sum of all probabilities, which is 1.0. If a nonnegative s is less than 1, then s
i
will also be less than
1, so we still have convergence.
One use of the generating function is, as its name implies, to generate the probabilities of values
for the random variable in question. In other words, if you have the generating function but not
the probabilities, you can obtain the probabilities from the function. Heres why: For clarify, write
(9.51) as
g
V
(s) = P(V = 0) +sP(V = 1) +s
2
P(V = 2) +... (9.53)
From this we see that
g
V
(0) = P(V = 0) (9.54)
So, we can obtain P(V = 0) from the generating function. Now dierentiating (9.51) with respect
to s, we have
g

V
(s) =
d
ds
_
P(V = 0) +sP(V = 1) +s
2
P(V = 2) +...

= P(V = 1) + 2sP(V = 2) +... (9.55)


So, we can obtain P(V = 2) from g

V
(0), and in a similar manner can calculate the other probabilities
from the higher derivatives.
9.4. TRANSFORM METHODS 205
9.4.2 Moment Generating Functions
The generating function is handy, but it is limited to discrete random variables. More generally,
we can use the moment generating function, dened for any random variable X as
m
X
(t) = E[e
tX
] (9.56)
for any t for which the expected value exists.
That last restriction is anathema to mathematicians, so they use the characteristic function,

X
(t) = E[e
itX
] (9.57)
which exists for any t. However, it makes use of pesky complex numbers, so well stay clear of it
here.
Dierentiating (9.56) with respect to t, we have
m

X
(t) = E[Xe
tX
] (9.58)
We see then that
m

X
(0) = EX (9.59)
So, if we just know the moment-generating function of X, we can obtain EX from it. Also,
m

X
(t) = E(X
2
e
tX
) (9.60)
so
m

X
(0) = E(X
2
) (9.61)
In this manner, we can for various k obtain E(X
k
), the kth moment of X, hence the name.
206 CHAPTER 9. ADVANCED MULTIVARIATE METHODS
9.4.3 Transforms of Sums of Independent Random Variables
Suppose X and Y are independent and their moment generating functions are dened. Let Z =
X+Y. then
m
Z
(t) = E[e
t(X+Y )
] = E[e
tX
e
tY
] = E(e
tX
) E(e
tY
) = m
X
(t)m
Y
(t) (9.62)
In other words, the mgf of the sum is the product of the mgfs! This is true for other transforms,
by the same reasoning.
Similarly, its clear that the mgf of a sum of three independent variables is again the product of
their mgfs, and so on.
9.4.4 Example: Network Packets
As an example, suppose say the number of packets N received on a network link in a given time
period has a Poisson distribution with mean , i.e.
P(N = k) =
e

k
k!
, k = 0, 1, 2, 3, ... (9.63)
9.4.4.1 Poisson Generating Function
Lets rst nd its generating function.
g
N
(t) =

k=0
t
k
e

k
k!
= e

k=0
(t)
k
k!
= e
+t
(9.64)
where we made use of the Taylor series from calculus,
e
u
=

k=0
u
k
/k! (9.65)
9.4.4.2 Sums of Independent Poisson Random Variables Are Poisson Distributed
Supposed packets come in to a network node from two independent links, with counts N
1
and N
2
,
Poisson distributed with means
1
and
2
. Lets nd the distribution of N = N
1
+ N
2
, using a
9.4. TRANSFORM METHODS 207
transform approach.
From Section 9.4.3:
g
N
(t) = g
N
1
(t)g
N
2
(t) = e
+t
(9.66)
where =
1
+
2
.
But the last expression in (9.66) is the generating function for a Poisson distribution too! And
since there is a one-to-one correspondence between distributions and transforms, we can conclude
that N has a Poisson distribution with parameter . We of course knew that N would have mean
but did not know that N would have a Poisson distribution.
So: A sum of two independent Poisson variables itself has a Poisson distribution. By induction,
this is also true for sums of k independent Poisson variables.
9.4.5 Random Number of Bits in Packets on One Link
Consider just one of the two links now, and for convenience denote the number of packets on the
link by N, and its mean as . Continue to assume that N has a Poisson distribution.
Let B denote the number of bits in a packet, with B
1
, ..., B
N
denoting the bit counts in the N
packets. We assume the B
i
are independent and identically distributed. The total number of bits
received during that time period is
T = B
1
+... +B
N
(9.67)
Suppose the generating function of B is known to be h(s). Then what is the generating function of
T?
g
T
(s) = E(s
T
) (9.68)
= E[E(s
T
[N)] (9.69)
= E[E(s
B
1
+...+B
N
[N)] (9.70)
= E[E(s
B
1
[N)...E(s
B
N
[N)] (9.71)
= E[h(s)
N
] (9.72)
= g
N
[h(s)] (9.73)
= e
+h(s)
(9.74)
Here is how these steps were made:
208 CHAPTER 9. ADVANCED MULTIVARIATE METHODS
From the rst line to the second, we used the Theorem of Total Expectation.
From the second to the third, we just used the denition of T.
From the third to the fourth lines, we have used algebra plus the fact that the expected value
of a product of independent random variables is the product of their individual expected
values.
From the fourth to the fth, we used the denition of h(s).
From the fth to the sixth, we used the denition of g
N
.
From the sixth to the last we used the formula for the generating function for a Poisson
distribution with mean .
We can then get all the information about T we need from this formula, such as its mean, variance,
probabilities and so on, as seen previously.
9.4.6 Other Uses of Transforms
Transform techniques are used heavily in queuing analysis, including for models of computer net-
works. The techniques are also used extensively in modeling of hardware and software reliability.
Transforms also play key roles in much of theoretical probability, the Central Limit Theorems
2
being a good example. Heres an outline of the proof of the basic CLT, assuming the notation of
Section 4.5.2.9:
First rewrite Z as
Z =
n

i=1
X
i
m
v

n
(9.75)
Then work with the characteristic function of Z:
c
Z
(t) = E(e
itZ
) (def.) (9.76)
=
n
i=1
E[e
it(X
i
m)/(v

n)
] (indep.) (9.77)
=
n
i=1
E[e
it(X
1
m)/(v

n)
] (ident. distr.) (9.78)
= [g(
it

n
)]
n
(9.79)
2
The plural is used here because there are many dierent versions, which for instance relax the condition that the
summands be independent and identically distributed.
9.5. VECTOR SPACE INTERPRETATIONS (FOR THE MATHEMATICALLY ADVENTUROUS ONLY)209
where g(s) is the characteristic function of (X
1
m)/v, i.e.
g(s) = E[e
is
X
1
m
v
] (9.80)
Now expand (9.79) in a Taylor series around 0, and use the fact that g

(0) is the expected value of


(X
1
m)/v, which is 0:
[g(
t

n
)]
n
=
_
1
t
2
2n
+o(
t
2
n
)
_
n
(9.81)
e
t
2
/2
as n (9.82)
where weve also used the famous fact that (1 s/n)
n
converges to e
s
as n .
But (9.82) is the
9.5 Vector Space Interpretations (for the mathematically adven-
turous only)
The abstract vector space notion in linear algebra has many applications to statistics. We develop
some of that material in this section.
Consider the set of all random variables associated with some experiment, in our notebook
sense from Section 2.2. (In more mathematical treatments, we would refer here to the set of all
random variables dened on some probability space.) Note that some of these random variables
are independent of each other, while others are not; we are simply considering the totality of all
random variables that arise from our experiment.
Let 1 be the set of all such random variables having nite variance and mean 0. We can set up as
a vector space. For that, we need to dene a sum and a scalar product. Dene the sum of any two
vectors X and Y to be the random variable X+Y. For any constant c, the vector cX is the random
variable cX. Note that 1 is closed under these operations, as it must be: If X and Y both have
mean 0, then X+Y does too, and so on.
Dene an inner product on this space:
(X, Y ) = E(XY ) = Cov(X, Y ) (9.83)
210 CHAPTER 9. ADVANCED MULTIVARIATE METHODS
(Recall that Cov(X,Y) = E(XY) - EX EY, and that we are working with random variables that
have mean 0.) Thus the norm of a vector X is
[[X[[ = (X, X)
0.5
=
_
E(X
2
) =
_
V ar(X) (9.84)
again since E(X) = 0.
9.6 Properties of Correlation
The famous Cauchy-Schwarz Inequality for inner products says,
[(X, Y )[ [[X[[ [[Y [[ (9.85)
i.e.
[(X, Y )[ 1 (9.86)
Also, the Cauchy-Schwarz Inequality yields equality if and only if one vector is a scalar multiple
of the other, i.e. Y = cX for some c. When we then translate this to random variables of nonzero
means, we get Y = cX + d.
In other words, the correlation between two random variables is between -1 and 1, with equality if
and only if one is an exact linear function of the other.
9.7 Conditional Expectation As a Projection
For a random variable X in , let J denote the subspace of 1 consisting of all functions h(X) with
mean 0 and nite variance. (Again, note that this subspace is indeed closed under vector addition
and scalar multiplication.)
Now consider any Y in 1. Recall that the projection of Y onto J is the closest vector T in J to
Y, i.e. T minimizes [[Y T[[. That latter quantity is
_
E[(Y T)
2
]
_
0.5
(9.87)
9.7. CONDITIONAL EXPECTATION AS A PROJECTION 211
To nd the minimizing T, consider rst the minimization of
E[(S c)
2
] (9.88)
with respect to constant c for some random variable S. We already solved this problem back in
Section 3.60. The minimizing value is c = ES.
Getting back to (9.87), use the Law of Total Expectation to write
E[(Y T)
2
] = E
_
E[(Y T)
2
[X]
_
(9.89)
From what we learned with (9.88), applied to the conditional (i.e. inner) expectation in (9.89), we
see that the T which minimizes (9.89) is T = E(Y[X).
In other words, the conditional mean is a projection! Nice, but is this useful in any way? The
answer is yes, in the sense that it guides the intuition. All this is related to issues of statistical
predictionhere we would be predicting Y from Xand the geometry here can really guide our
insight. This is not very evident without getting deeply into the prediction issue, but lets explore
some of the implications of the geometry.
For example, a projection is perpendicular to the line connecting the projection to the original
vector. So
0 = (E(Y [X), Y E(Y [X)) = Cov[E(Y [X), Y E(Y [X)] (9.90)
This says that the prediction E(Y[X) is uncorrelated with the prediction error, Y-E(Y[X). This
in turn has statistical importance. Of course, (9.90) could have been derived directly, but the
geometry of the vector space intepretation is what suggested we look at the quantity in the rst
place. Again, the point is that the vector space view can guide our intuition.
Simlarly, the Pythagorean Theorem holds, so
[[Y [[
2
= [[E(Y [X)[[
2
+[[Y E(Y [X)[[
2
(9.91)
which means that
V ar(Y ) = V ar[E(Y [X)] +V ar[Y E(Y [X)] (9.92)
Equation (9.92) is a common theme in linear models in statistics, the decomposition of variance.
212 CHAPTER 9. ADVANCED MULTIVARIATE METHODS
There is an equivalent form that is useful as well, derived as follows from the second term in (9.92).
Since
E[Y E(Y [X)] = EY E[E(Y [X)] = EY EY = 0 (9.93)
we have
V ar[Y E(Y [X)] = E
_
(Y E(Y [X))
2

(9.94)
= E
_
Y
2
2Y E(Y [X) + (E(Y [X))
2

(9.95)
Now consider the middle term, E[2Y E(Y [X)]. Conditioning on X and using the Law of Total
Expectation, we have
E[2Y E(Y [X)] = 2E
_
(E(Y [X))
2

(9.96)
Then (9.94) becomes
V ar[Y E(Y [X)] = E(Y
2
) E
_
(E(Y [X))
2

(9.97)
= E
_
E(Y
2
[X)

E
_
(E(Y [X))
2

(9.98)
= E
_
E(Y
2
[X) (E(Y [X))
2
_
(9.99)
= E [V ar(Y [X)] (9.100)
the latter coming from our old friend, V ar(U) = E(U
2
) (EU)
2
, with U being Y here, under
conditioning by X.
In other words, we have just derived another famous formula:
V ar(Y ) = E[V ar(Y [X)] +V ar[E(Y [X)] (9.101)
9.8 Proof of the Law of Total Expectation
Lets prove (9.6) for the case in which W and Y take values only in the set 1,2,3,.... Recall that
if T is an integer-value random variable and we have some function h(), then L = h(T) is another
9.8. PROOF OF THE LAW OF TOTAL EXPECTATION 213
random variable, and its expected value can be calculated as
3
E(L) =

k
h(k)P(T = k) (9.102)
In our case here, Q is a function of W, so we nd its expectation from the distribution of W:
E(Q) =

i=1
g(i)P(W = i)
=

i=1
E(Y [W = i)P(W = i)
=

i=1
_
_

j=1
jP(Y = j[W = i)
_
_
P(W = i)
=

j=1
j

i=1
P(Y = j[W = i)P(W = i)
=

j=1
jP(Y = j)
= E(Y )
In other words,
E(Y ) = E[E(Y [W)] (9.103)
Exercises
1. In the catchup game in Section 7.1.5, let V and W denote the winnings of the two players after
only one turn. Find P(V > 0.4).
2. Use transform methods to derive some properties of the Poisson family:
(a) Show that for any Poisson random variable, its mean and variance are equal.
3
This is sometimes called The Law of the Unconscious Statistician, by nasty probability theorists who look down
on statisticians. Their point is that technically EL =

k
kP(L = k), and that (9.102) must be proven, whereas the
statisticians supposedly think its a denition.
214 CHAPTER 9. ADVANCED MULTIVARIATE METHODS
(b) Suppose X and Y are independent random variables, each having a Poisson distribution.
Show that Z = X +Y again has a Poisson distribution.
3. Suppose one keeps rolling a die. Let S
n
denote the total number of dots after n rolls, mod 8, and
let T be the number of rolls needed for the event S
n
= 0 to occur. Find E(T), using an approach
like that in the trapped miner example in Section 9.1.5.
4. In our ordinary coins which we use every day, each one has a slightly dierent probability of
heads, which well call H. Say H has the distribution N(0.5, 0.03
2
). We choose a coin from a batch
at random, then toss it 10 times. Let N be the number of heads we get. Find V ar(N).
5. Suppose the number N of bugs in a certain number of lines of code has a Poisson distribution,
with parameter L, where L varies from one programmer to another. Show that Var(N) = EL +
Var(L).
6. This problem arises from the analysis of random graphs, which for concreteness we will treat
here as social networks such as Facebook.
In the model here, each vertex in the graph has N friends, N being a random variable with the
same distribution at every vertex. One thinks of each vertex as generating its links, unterminated,
i.e. not tied yet to a second vertex. Then the unterminated links of a vertex pair o at random
with those of other vertices. (Those that fail will just pair in self loops, but well ignore that.)
Let M denote the number of friends a friend of mine has. That is, start at a vertex A, and follow
a link from A to another vertex, say B. M is the number of friends B has (well include A in this
number).
(a) Since an unterminated link from A is more likely to pair up with a vertex that has a lot of
links, a key assumption is that P(M = k) = ck P(N = k) for some constant c. Fill in the
blank: This is an example of the setting we studied called .
(b) Show the following relation of generating functions: g
M
(s) = g

N
(s)/EN.
7. Suppose Type 1 batteries have exponentially distributed lifetimes with mean 2.0 hours, while
Type 2 battery lifetimes are exponentially distributed with mean 1.5. We have a large box con-
taining a mixture of the two types of batteries, in proportions q and 1-q. We reach into the box,
choose a battery at random, then use it. Let Y be the lifetime of the battery we choose. Use the
Law of Total Variance, (9.9), to nd V ar(Y ).
8. In the backup battery example in Section 8.3.5, nd Var(W), using the Law of Total Expectation.
9. Let X denote the number we obtain when we roll a single die once. Let G
X
(s) denote the
generating function of X.
9.8. PROOF OF THE LAW OF TOTAL EXPECTATION 215
(a) Find G
X
(s).
(b) Suppose we roll the die 5 times, and let T denote the total number of dots we get from the 5
rolls. Find G
T
(s).
10. Consider this model of disk seeks. For simplicity, well assume a very tiny number of tracks,
3. Let X
1
and X
2
denote the track numbers of two successive disk requests. Each has a uniform
distribution on 1,2,3. But given X
1
= i, then X
2
= i with probability 0.4, with X
2
being j with
probability 0.3, j ,= i. (Convince yourself that these last two sentences are consistent with each
other.) Find the following:
(a) P([X
1
X
2
[ 1)
(b) E([X
1
X
2
[)
(c) F
X
1
,X
2
(2, 2)
11. Consider the computer worm example in Section 8.3.8. Let R denote the time it takes to go
from state 1 to state 3. Find f
R
(v). (Leave your answer in integral form.)
12. Suppose (X,Y) has a bivariate normal distribution, with EX = EY = 0, Var(X) = Var(Y) =
1, and (X, Y ) = 0.2. Find the following, in integral forms:
(a) E(X
2
+XY
0.5
)
(b) P(Y > 0.5X)
(c) F
X,Y
(0.6, 0.2)
13. Suppose X
i
, i = 1,2,3,4,5 are independent and each have mean 0 and variance 1. Let Y
i
= X
i+1

X
i
, i = 1,2,3,4. Using the material in Section 7.3, nd the covariance matrix of Y = (Y
1
, Y
2
, Y
3
, Y
4
).
216 CHAPTER 9. ADVANCED MULTIVARIATE METHODS
Chapter 10
Introduction to Condence Intervals
Consider the following problems:
Suppose you buy a ticket for a rae, and get ticket number 68. Two of your friends bought
tickets too, getting numbers 46 and 79. Let c be the total number of tickets sold. You dont
know the value of c, but hope its small, so you have a better chance of winning. How can
you estimate the value of c, from the data, 68, 46 and 79?
Its presidential election time. A poll says that 56% of the voters polled support candidate
X, with a margin of error of 2%. The poll was based on a sample of 1200 people. How can a
sample of 1200 people out of more than 100 million voters have a margin of error that small?
And what does the term margin of error really mean, anyway?
A satellite detects a bright spot in a forest. Is it a re? How can we design the software on
the satellite to estimate the probability that this is a re?
If you think that statistics is nothing more than adding up columns of numbers and plugging
into formulas, you are badly mistaken. Actually, statistics is an application of probability theory.
We employ probabilistic models for the behavior of our sample data, and infer from the data
accordinglyhence the name, statistical inference.
Arguably the most powerful use of statistics is prediction. This has applications from medicine to
marketing to movie animation. We will study prediction in Chapter 15.
10.1 Sampling Distributions
We rst will set up some infrastructure, which will be used heavily throughout the next few chapters.
217
218 CHAPTER 10. INTRODUCTION TO CONFIDENCE INTERVALS
10.1.1 Random Samples
Denition 26 Random variables X
1
, X
2
, X
3
, ... are said to be i.i.d. if they are independent and
identically distributed. The latter term means that p
X
i
or f
X
i
is the same for all i.
For i.i.d. X
1
, X
2
, X
3
, ..., we often use X to represent a generic random variable having the common
distribution of the X
i
.
Denition 27 We say that X
1
, X
2
, X
3
, ..., X
n
is a random sample of size n from a population
if the X
i
are i.i.d. and their common distribution is that of the population.
If the sampled population is nite,
1
then a random sample must be drawn in this manner. Say
there are k entities in the population, e.g. k people, with values v
1
, ..., v
k
. If we are interested in
peoples heights, for instance, then v
1
, ..., v
k
would be the heights of all people in our population.
Then a random sample is drawn this way:
(a) The sampling is done with replacement.
(b) Each X
i
is drawn from v
1
, ..., v
k
, with each v
j
having probability
1
k
of being drawn.
Condition (a) makes the X
i
independent, while (b) makes them identically distributed.
If sampling is done without replacement, we call the data a simple random sample. Note how
this implies lack of independence of the X
i
. If for instance X
1
= v
3
, then we know that no other X
i
has that value, contradicting independence; if the X
i
were independent, knowledge of one should
not give us knowledge concerning others.
But we assume true random sampling from here onward.
Note most carefully that each X
i
has the same distribution as the population. If for instance a third
of the population, i.e. a third of the v
j
, are less than 28, then P(X
i
< 28) will be 1/3. This point
is easy to see, but keep it in mind at all times, as it will arise again and again.
We will often make statements like, Let X be distributed according to the population. This
simply means that P(X = v
j
) =
1
k
, j = 1,...,k.
What about drawing from an innite population? This may sound odd at rst, but it relates to the
fact, noted at the outset of Chapter 4, that although continuous random variables dont really exist,
they often make a good approximation. In our human height example above, for instance, heights
do tend to follow a bell-shaped curve which which is well-approximated by a normal distributiion.
1
You might wonder how it could be innite. This will be discussed shortly.
10.1. SAMPLING DISTRIBUTIONS 219
In this case, each X
i
is modeled as having a continuum of possible values, corresponding to a
theoretically innite population. Each X
i
then has the same density as the population density.
10.1.2 Example: Subpopulation Considerations
To get a better understanding of the fact that the X
i
are random variables, consider an election
poll in the following setting:
The total population size is m.
We sample n people at random.
In the population, there are d Democrats, r Republicans and o people well refer to as Others.
Let D, R and O denote the number of people of the three types that we get in our sample. It
would be nice if our sample contained Democrats, Republicans and Others in proportions roughly
the same as in the population. In order to see how likely this is to occur, lets nd the probability
mass function of the random vector (D,R,O),
p
D,R,O
(i, j, k) = P(D = i, R = j, O = k) (10.1)
Case I: Random Sample
Here the X
i
are i.i.d., with each one being one of the three categories (Democrat, Republican,
Other). Moreover, the random variables D, R and O are the total counts of the number of times
each of the three categories occurs. In other words, this is exactly the setting of Section 8.5.1.1,
and the random vector (D,R,O) has a multinomial distribution!
So, we evaluate (10.1) by using (8.94) with
p
1
= d/m, p
2
= r/m, p
3
= o/m (10.2)
Case I: Simple Random Sample
This is a combinatorial problem, from Section 2.13:
P(D = i, R = j, O = k) =
_
m
i
_

_
mi
j
_
_
mij
k
_ (10.3)
220 CHAPTER 10. INTRODUCTION TO CONFIDENCE INTERVALS
10.1.3 The Sample Meana Random Variable
A large part of this chapter will concern the sample mean,
X =
X
1
+X
2
+X
3
+... +X
n
n
(10.4)
Since X
1
, X
2
, X
3
, ..., X
n
are random variables, X is a random variable too.
Make absolutely sure to distinguish between the sample mean and the population mean.
The point that X is a random variable is another simple yet crucial concept. Lets illustrate it with
a tiny example. Suppose we have a population of three people, with heights 69, 72 and 70, and we
draw a random sample of size 2. Here X can take on six values:
69 + 69
2
= 69,
69 + 72
2
= 70.5,
69 + 70
2
= 69.5,
70 + 70
2
= 70,
70 + 72
2
= 71,
72 + 72
2
= 72 (10.5)
The probabilities of these values are 1/9, 2/9, 2/9, 1/9, 2/9 and 1/9, respectively. So,
p
X
(69) =
1
9
, p
X
(70.5) =
2
9
, p
X
(69.5) =
2
9
, p
X
(70) =
1
9
, p
X
(71) =
2
9
, p
X
(72) =
1
9
(10.6)
Viewing it in notebook terms, we might have, in the rst three lines:
notebook line X
1
X
2
X
1 70 70 70
2 69 70 69.5
3 72 70 71
Again, the point is that all of X
1
, X
2
and X are random variables.
Now, returning to the case of general n and our sample X
1
, ..., X
n
, since X is a random variable,
we can ask about its expected value and variance.
Let denote the population mean. Remember, each X
i
is distributed as is the population, so
EX
i
= .
This then implies that the mean of X is also . Heres why:
10.1. SAMPLING DISTRIBUTIONS 221
E(X) = E
_
1
n
n

i=1
X
i
_
(def. of X ) (10.7)
=
1
n
E
_
n

i=1
X
i
_
(for const. c, E(cU) = cEU) (10.8)
=
1
n
n

i=1
EX
i
(E[U +V ] = EU +EV ) (10.9)
=
1
n
n (EX
i
= ) (10.10)
= (10.11)
V ar(X) = V ar
_
1
n
n

i=1
X
i
_
(10.12)
=
1
n
2
V ar
_
n

i=1
X
i
_
(for const. c, , V ar[cU] = c
2
V ar[U]) (10.13)
=
1
n
2
n

i=1
V ar(X
i
) (for U,V indep., , V ar[U +V ] = V ar[U] +V ar[V ]) (10.14)
=
1
n
2
n
2
(10.15)
=
1
n

2
(10.16)
10.1.4 Sample Means Are Approximately NormalNo Matter What the Popu-
lation Distribution Is
The Central Limit Theorem tells us that the numerator in (10.4) has an approximate normal
distribution. That means that ane transformations of that numerator are also approximately
normally distributed (page 99). So:
Approximate distribution of (centered and scaled) X:
The quantity
Z =
X
/

n
(10.17)
222 CHAPTER 10. INTRODUCTION TO CONFIDENCE INTERVALS
has an approximately N(0,1) distribution, where
2
is the population variance.
Make sure you understand why it is the N that is approximate here, not the 0 or 1.
So even if the population distribution is very skewed, multimodal and so on, the sample mean will
still have an approximate normal distribution. This will turn out to be the core of statistics.
10.1.5 The Sample VarianceAnother Random Variable
Later we will be using the sample mean X, a function of the X
i
, to estimate the population mean
. What other function of the X
i
can we use to estimate the population variance
2
?
Let X denote a generic random variable having the distribution of the X
i
, which, note again, is the
distribution of the population. Because of that property, we have
V ar(X) =
2
(
2
is the population variance) (10.18)
Recall that by denition
V ar(X) = E[(X EX)
2
] (10.19)
Lets estimate V ar(X) =
2
by taking sample analogs in (10.19). Here are the correspondences:
pop. entity samp. entity
EX X
X X
i
E[]
1
n

n
i=1
The sample analog of is X. What about the sample analog of the E()? Well, since E() averaging
over the whole population of Xs, the sample analog is to average over the sample. So, our sample
analog of (10.19) is
s
2
=
1
n
n

i=1
(X
i
X)
2
(10.20)
In other words, just as it is natural to estimate the population mean of X by its sample mean, the
same holds for Var(X):
The population variance of X is the mean squared distance from X to its population
mean, as X ranges over all of the population. Therefore it is natural to estimate Var(X)
10.2. THE MARGIN OF ERROR AND CONFIDENCE INTERVALS 223
by the average squared distance of X from its sample mean, among our sample values
X
i
, shown in (10.20).
2
We use s
2
as our symbol for this estimate of population variance.
3
It should be noted that it is
common to divide by n-1 instead of by n in (10.20). Though we will not take that approach here,
it will be discussed in Section 12.2.2.
By the way, it can be shown that (10.20) is equal to
1
n
n

i=1
X
2
i
X
2
(10.21)
This is a handy way to calculate s
2
, though it is subject to more roundo error. Note that (10.21)
is a sample analog of (3.30).
10.1.6 A Good Time to Stop and Review!
The material weve discussed in this section, that is since page 217, is absolutely key, forming the
very basis of statistics. It will be used throughout all our chapters here on statistics. It would be
highly worthwhile for the reader to review this section before continuing.
10.2 The Margin of Error and Condence Intervals
To explain the idea of margin of error, lets begin with a problem that has gone unanswered so far:
In our simulations in previous units, it was never quite clear how long the simulation should be
run, i.e. what value to set for nreps in Section 2.12.3. Now we will nally address this issue.
As our example, recall from the Bus Paradox in Section 5.3: Buses arrive at a certain bus stop at
random times, with interarrival times being independent exponentially distributed random variables
with mean 10 minutes. You arrive at the bus stop every day at a certain time, say four hours (240
minutes) after the buses start their morning run. What is your mean wait for the next bus?
We later found mathematically that, due to the memoryless property of the exponential distribution,
our wait is again exponentially distributed with mean 10. But suppose we didnt know that, and we
wished to nd the answer via simulation. (Note to reader: Keep in mind throughout this example
2
Note the similarity to (3.30).
3
Though I try to stick to the convention of using only capital letters to denote random variables, it is conventional
to use lower case in this instance.
224 CHAPTER 10. INTRODUCTION TO CONFIDENCE INTERVALS
that we will be pretending that we dont know the mean wait is actually 10. Reminders of this will
be brought up occasionally.)
We could write a program to do this:
1 doexpt <- function(opt) {
2 lastarrival <- 0.0
3 while (lastarrival < opt)
4 lastarrival <- lastarrival + rexp(1,0.1)
5 return(lastarrival-opt)
6 }
7
8 observationpt <- 240
9 nreps <- 1000
10 waits <- vector(length=nreps)
11 for (rep in 1:nreps) waits[rep] <- doexpt(observationpt)
12 cat("approx. mean wait = ",mean(waits),"\n")
Running the program yields
approx. mean wait = 9.653743
Note that is a population mean, where our population here is the set of all possible bus wait
times (some more frequent than others). Our simulation, then, drew a sample of size 1000 from
that population. The expression mean(waits) was our sample mean.
Now, was 1000 iterations enough? How close is this value 9.653743 to the true expected value of
waiting time?
4
What we would like to do is something like what the pollsters do during presidential elections,
when they say Ms. X is supported by 62% of the voters, with a margin of error of 4%. In other
words, we want to be able to attach a margin of error to that gure of 9.653743 above. We do this
in the next section.
10.3 Condence Intervals for Means
We are now set to make use of the infrastructure that weve built up in the preceding sections of
this chapter. Everything will hinge on understand that the sample mean is a random variable, with
a known approximate distribution.
The goal of this section (and several that follow) is to develop a notion of margin of
error, just as you see in the election campaign polls. This raises two questions:
4
Of course, continue to ignore the fact that we know that this value is 10.0. What were trying to do here is gure
out how to answer how close is it questions in general, when we dont know the true mean.
10.3. CONFIDENCE INTERVALS FOR MEANS 225
(a) What do we mean by margin of error?
(b) How can we calculate it?
10.3.1 Condence Intervals for Population Means
So, suppose we have a random sample W
1
, ..., W
n
from some population with mean and variance

2
.
Recall that (10.17) has an approximate N(0,1) distribution. We will be interested in the central
95% of the distribution N(0,1). Due to symmetry, that distribution has 2.5% of its area in the left
tail and 2.5% in the right one. Through the R call qnorm(0.025), or by consulting a N(0,1) cdf
table in a book, we nd that the cuto points are at -1.96 and 1.96. In other words, if some random
variable T has a N(0,1) distribution, then P(1.96 < T < 1.96) = 0.95.
Thus
0.95 P
_
1.96 <
W
/

n
< 1.96
_
(10.22)
(Note the approximation sign.) Doing a bit of algebra on the inequalities yields
0.95 P
_
W 1.96

n
< < W + 1.96

n
_
(10.23)
Now remember, not only do we not know , we also dont know . But we can estimate it, as we
saw, via (10.20). One can show (the details will be given in Section 13.1) that (10.23) is still valid
if we substitute s for , i.e.
0.95 P
_
W 1.96
s

n
< < W + 1.96
s

n
_
(10.24)
In other words, we are about 95% sure that the interval
(W 1.96
s

n
, W + 1.96
s

n
) (10.25)
contains . This is called a 95% condence interval for . The quantity 1.96
s

n
is the margin
of error.
226 CHAPTER 10. INTRODUCTION TO CONFIDENCE INTERVALS
10.3.2 Example: Simulation Output
We could add this feature to our program in Section 10.2:
1 doexpt <- function(opt) {
2 lastarrival <- 0.0
3 while (lastarrival < opt)
4 lastarrival <- lastarrival + rexp(1,0.1)
5 return(lastarrival-opt)
6 }
7
8 observationpt <- 240
9 nreps <- 10000
10 waits <- vector(length=nreps)
11 for (rep in 1:nreps) waits[rep] <- doexpt(observationpt)
12 wbar <- mean(waits)
13 cat("approx. mean wait =",wbar,"\n")
14 s2 <- mean(waits^2) - wbar^2
15 s <- sqrt(s2)
16 radius <- 1.96*s/sqrt(nreps)
17 cat("approx. CI for EW =",wbar-radius,"to",wbar+radius,"\n")
When I ran this, I got 10.02565 for the estimate of EW, and got an interval of (9.382715, 10.66859).
Note that the margin of error is the radius of that interval, about 1.29. We would then say, We
are about 95% condent that the true mean wait time is between 9.38 and 10.67.
What does this really mean? This question is of the utmost importance. We will devote an
entire section to it, Section 10.4.
Note that our analysis here is approximate, based on the Central Limit Theorem, which was
applicable because W involves a sum. We are making no assumption about the density of the
population from which the W
i
are drawn. However, if that population density itself is normal, then
an exact condence interval can be constructed. This will be discussed in Section 10.12.
10.4 Meaning of Condence Intervals
10.4.1 A Weight Survey in Davis
Consider the question of estimating the mean weight, denoted by , of all adults in the city of
Davis. Say we sample 1000 people at random, and record their weights, with W
i
being the weight
of the i
th
person in our sample.
5
5
Do you like our statistical pun here? Typically an example like this would concern peoples heights, not weights.
But it would be nice to use the same letter for random variables as in Section 10.3, i.e. the letter W, so well have
our example involve peoples weights instead of heights. It works out neatly, because the word weight has the same
sound as wait.
10.4. MEANING OF CONFIDENCE INTERVALS 227
Now remember, we dont know the true value of that population mean, again,
thats why we are collecting the sample data, to estimate ! Our estimate will be our
sample mean, W. But we dont know how accurate that estimate might be. Thats
the reason we form the condence interval, as a gauge of the accuracy of W as an
estimate of .
Say our interval (10.25) turns out to be (142.6,158.8). We say that we are about 95% condent that
the mean weight of all adults in Davis is contained in this interval. What does this mean?
Say we were to perform this experiment many, many times, recording the results in a notebook:
Wed sample 1000 people at random, then record our interval (W 1.96
s

n
, W + 1.96
s

n
) on the
rst line of the notebook. Then wed sample another 1000 people at random, and record what
interval we got that time on the second line of the notebook. This would be a dierent set of 1000
people (though possibly with some overlap), so we would get a dierent value of W and so, thus a
dierent interval; it would have a dierent center and a dierent radius. Then wed do this a third
time, a fourth, a fth and so on.
Again, each line of the notebook would contain the information for a dierent random sample of
1000 people. There would be two columns for the interval, one each for the lower and upper bounds.
And though its not immediately important here, note that there would also be columns for W
1
through W
1000
, the weights of our 1000 people, and columns for W and s.
Now here is the point: Approximately 95% of all those intervals would contain , the mean weight
in the entire adult population of Davis. The value of would be unknown to usonce again, thats
why wed be sampling 1000 people in the rst placebut it does exist, and it would be contained
in approximately 95% of the intervals.
As a variation on the notebook idea, think of what would happen if you and 99 friends each do
this experiment. Each of you would sample 1000 people and form a condence interval. Since each
of you would get a dierent sample of people, you would each get a dierent condence interval.
What we mean when we say the condence level is 95% is that of the 100 intervals formedby
you and 99 friendsabout 95 of them will contain the true population mean weight. Of course,
you hope you yourself will be one of the 95 lucky ones! But remember, youll never know whose
intervals are correct and whose arent.
Now remember, in practice we only take one sample of 1000 people. Our notebook
idea here is merely for the purpose of understanding what we mean when we say that
we are about 95% condent that one interval we form does contain the true value of
.
228 CHAPTER 10. INTRODUCTION TO CONFIDENCE INTERVALS
10.4.2 One More Point About Interpretation
Some statistics instructors give students the odd warning, You cant say that the probability is
95% that is IN the interval; you can only say that the probability is 95% condent that the interval
CONTAINS . This of course is nonsense. As any fool can see, the following two statements are
equivalent:
is in the interval
the interval contains
So it is ridiculous to say that the rst is incorrect. Yet many instructors of statistics say so.
Where did this craziness come from? Well, way back in the early days of statistics, some instructor
was afraid that a statement like The probability is 95% that is in the interval would make
it sound like is a random variable. Granted, that was a legitimate fear, because is not a
random variable, and without proper warning, some learners of statistics might think incorrectly.
The random entity is the interval (both its center and radius), not . This is clear in our program
abovethe 10 is constant, while wbar and s vary from interval to interval.
So, it was reasonable for teachers to warn students not to think is a random variable. But later
on, some idiot must have then decided that it is incorrect to say is in the interval, and other
idiots then followed suit. They continue to this day, sadly.
10.5 General Formation of Condence Intervals from Approxi-
mately Normal Estimators
Recall that the idea of a condence interval is really simple: We report our estimate, plus or minus
a margin of error. In (10.25),
margin of error = 1.96 estimated standard deviation of W = 1.96
s

n
Remember, W is a random variable. In our Davis people example, each line of the notebook would
correspond to a dierent sample of 1000 people, and thus each line would have a dierent value for
W. Thus it makes sense to talk about V ar(W), and to refer to the square root of that quantity,
i.e. the standard deviation of W. In (10.16), we found this to be /

n and decided to estimate


it by s/

n. The latter is called the standard error of the estimate (or just standard error,
s.e.), meaning the estimate of the standard deviation of the estimate W. (The word estimate was
10.6. CONFIDENCE INTERVALS FOR PROPORTIONS 229
used twice in the preceding sentence. Make sure to understand the two dierent settings that they
apply to.)
That gives us a general way to form condence intervals, as long as we use approximately normally
distributed estimators:
Denition 28 Suppose

is a sample-based estimator of a population quantity .
6
The sample-
based estimate of the standard deviation of

is called the standard error of

.
We can see from (10.25) what to do in general:
Suppose

is a sample-based estimator of a population quantity , and that, due to
being composed of sums or some other reason,

is approximately normally distributed.
Then the quantity


s.e.(

)
(10.26)
has an approximate N(0,1) distribution.
7
That means we can mimic the derivation that led to (10.25), showing that an approxi-
mate 95% condence interval for is

1.96 s.e.(

) (10.27)
In other words, the margin of error is 1.96 s.e.(

).
The standard error of the estimate is one of the most commonly-used quantities in statistical
applications. You will encounter it frequently in the output of R, for instance, and in the subsequent
portions of this book. Make sure you understand what it means and how it is used.
10.6 Condence Intervals for Proportions
So we know how to nd condence intervals for means. How about proportions?
6
The quantity is pronounced theta-hat. The hat symbol is traditional for estimate of.
7
This also presumes that

is a consistent estimator of , meaning that

converges to as n .
230 CHAPTER 10. INTRODUCTION TO CONFIDENCE INTERVALS
10.6.1 Derivation
It turns out that we already have our answer, from Section 3.6. We found there that proportions
are special cases of means: If Y is an indicator random variable with P(Y = 1) = p, then EY = p.
For example, in an election opinion poll, we might be interested in the proportion p of people in
the entire population who plan to vote for candidate A. Each voter has a value of Y, 1 if he/she
plans to vote for A, 0 otherwise. Then p is the population mean of Y.
We will estimate p by taking a random sample of n voters, and nding p, the sample proportion of
voters who plan to vote for A. Let Y
i
be the value of Y for the i
th
person in our sample. Then
p = Y (10.28)
So, in order to get a condence interval for p from p, we can use (10.25)! We have that an
approximate 95% condence interval for p is
_
p 1.96s/

n, p + 1.96s/

n,
_
(10.29)
where as before s
2
is the sample variance among the Y
i
.
But theres more, because we can exploit the fact that in this special case, each Y
i
is either 1 or 0.
Recalling the convenient form of s
2
, (10.21), we have
s
2
=
1
n
n

i=1
Y
2
i
Y
2
(10.30)
=
1
n
n

i=1
Y
i
Y
2
(10.31)
= Y Y
2
(10.32)
= p p
2
(10.33)
Then (10.29) becomes
_
p 1.96
_
p(1 p)/n, p + 1.96
_
p(1 p)/n
_
(10.34)
And note again that
_
p(1 p)/n is the standard error of p.
10.6. CONFIDENCE INTERVALS FOR PROPORTIONS 231
10.6.2 Simulation Example Again
In our bus example above, suppose we also want our simulation to print out the (estimated)
probability that one must wait longer than 6.4 minutes. As before, wed also like a margin of error
for the output.
We incorporate (10.34) into our program:
1 doexpt <- function(opt) {
2 lastarrival <- 0.0
3 while (lastarrival < opt)
4 lastarrival <- lastarrival + rexp(1,0.1)
5 return(lastarrival-opt)
6 }
7
8 observationpt <- 240
9 nreps <- 1000
10 waits <- vector(length=nreps)
11 for (rep in 1:nreps) waits[rep] <- doexpt(observationpt)
12 wbar <- mean(waits)
13 cat("approx. mean wait =",wbar,"\n")
14 s2 <- (mean(waits^2) - mean(wbar)^2)
15 s <- sqrt(s2)
16 radius <- 1.96*s/sqrt(nreps)
17 cat("approx. CI for EW =",wbar-radius,"to",wbar+radius,"\n")
18 prop <- length(waits[waits > 6.4]) / nreps
19 s2 <- prop*(1-prop)
20 s <- sqrt(s2)
21 radius <- 1.96*s/sqrt(nreps)
22 cat("approx. P(W > 6.4) =",prop,", with a margin of error of",radius,"\n")
When I ran this, the value printed out for p was 0.54, with a margin of error of 0.03, thus an
interval of (0.51,0.57). We would say, We dont know the exact value of P(W > 6.4), so we ran a
simulation. The latter estimates this probability to be 0.54, with a 95% margin of error of 0.03.
10.6.3 Examples
Note again that this uses the same principles as our Davis weights example. Suppose we were
interested in estimating the proportion of adults in Davis who weigh more than 150 pounds. Sup-
pose that proportion is 0.45 in our sample of 1000 people. This would be our estimate p for the
population proportion p, and an approximate 95% condence interval (10.34) for the population
proportion would be (0.42,0.48). We would then say, We are 95% condent that the true popula-
tion proportion p of people who weigh over 150 pounds is between 0.42 and 0.48.
Note also that although weve used the word proportion in the Davis weights example instead of
probability, they are the same. If I choose an adult at random from the population, the probability
232 CHAPTER 10. INTRODUCTION TO CONFIDENCE INTERVALS
that his/her weight is more than 150 is equal to the proportion of adults in the population who
have weights of more than 150.
And the same principles are used in opinion polls during presidential elections. Here p is the
population proportion of people who plan to vote for the given candidate. This is an unknown
quantity, which is exactly the point of polling a sample of peopleto estimate that unknown
quantity p. Our estimate is p, the proportion of people in our sample who plan to vote for the
given candidate, and n is the number of people that we poll. We again use (10.34).
10.6.4 Interpretation
The same interpretation holds as before. Consider the examples in the last section:
If each of you and 99 friends were to run the R program at the beginning of Section 10.6.3,
you 100 people would get 100 condence intervals for P(W > 6.4). About 95 of you would
have intervals that do contain that number.
If each of you and 99 friends were to sample 1000 people in Davis and come up with condence
intervals for the true population proportion of people who weight more than 150 pounds, about
95 of you would have intervals that do contain that true population proportion.
If each of you and 99 friends were to sample 1200 people in an election campaign, to estimate
the true population proportion of people who will vote for candidate X, about 95 of you will
have intervals that do contain this population proportion.
Of course, this is just a thought experiment, whose goal is to understand what the term 95%
condent really means. In practice, we have just one sample and thus compute just one interval.
But we say that the interval we computer has a 95% chance of containing the population value,
since 95% of all intervals will contain it.
10.6.5 (Non-)Eect of the Population Size
Note that in both the Davis and election examples, it doesnt matter what the size of the population
is. The approximate distribution of p is N(p,p(1-p)/n), so the accuracy of p, depends only on p
and n. So when people ask, How a presidential election poll can get by with sampling only 1200
people, when there are more than 100,000,000 voters in the U.S.? now you know the answer.
(Well discuss the question Why 1200? below.)
Another way to see this is to think of a situation in which we wish to estimate the probability p
of heads for a certain coin. We toss the coin n times, and use p as our estimate of p. Here our
10.7. CONFIDENCE INTERVALS FOR DIFFERENCES OF MEANS OR PROPORTIONS 233
populationthe population of all coin tossesis innite, yet it is still the case that 1200 tosses
would be enough to get a good estimate of p.
10.6.6 Planning Ahead
Now, why do the pollsters sample 1200 people?
First, note that the maximum possible value of p(1 p) is 0.25.
8
Then the pollsters know that
their margin of error with n = 1200 will be at most 1.96 0.5/

1200, or about 3%, even before


they poll anyone. They consider 3% to be suciently accurate for their purposes, so 1200 is the n
they choose.
10.7 Condence Intervals for Dierences of Means or Proportions
10.7.1 Independent Samples
Suppose in our sampling of people in Davis we are mainly interested in the dierence in weights
between men and women. Let X and n
1
denote the sample mean and sample size for men, and let
Y and n
2
for the women. Denote the population means and variances by
i
and
2
i
, i = 1,2. We
wish to nd a condence interval for
1

2
. The natural estimator for that quantity is X Y .
So, how can we form a condence interval for
1

2
using X Y ? Since the latter quantity is
composed of sums, we can use (10.27). Here:
is
1

2


is X Y
So, we need to nd the standard error of X Y .
Lets nd the standard deviation of X Y , and then estimate it from the data. We have
8
Use calculus to nd the maximum value of f(x) = x(1-x).
234 CHAPTER 10. INTRODUCTION TO CONFIDENCE INTERVALS
std.dev.(X Y ) =
_
V ar[X Y ] (def.) (10.35)
=
_
V ar[X + (1)Y ] (algebra) (10.36)
=
_
V ar(X) +V ar[(1)Y ] (indep.) (10.37)
=
_
V ar(X) +V ar(Y ) (3.33.) (10.38)
=

2
1
n
1
+

2
2
n
2
(10.16) (10.39)
Note that we used the fact that X and Y are independent, as they come from separate people.
Replacing the
2
i
values by their sample estimates,
s
2
1
=
1
n
1
n
1

i=1
(X
i
X)
2
(10.40)
and
s
2
2
=
1
n
2
n
2

i=1
(Y
i
Y )
2
(10.41)
we nally have
s.e.(X Y ) =

s
2
1
n
1
+
s
2
2
n
2
(10.42)
Thus (10.27) tells us that an approximate 95% condence interval for
1

2
is
_
_
X Y 1.96

s
2
1
n
1
+
s
2
2
n
2
, X Y + 1.96

s
2
1
n
1
+
s
2
2
n
2
_
_
(10.43)
What about condence intervals for the dierence in two population proportions p
1
p
2
? Recalling
that in Section 10.6 we noted that proportions are special cases of means, we see that nding a
condence interval for the dierence in two proportions is covered by (10.43). Here
10.7. CONFIDENCE INTERVALS FOR DIFFERENCES OF MEANS OR PROPORTIONS 235
X reduces to p
1
Y reduces to p
2
s
2
1
reduces to p
1
(1 p
1
)
s
2
2
reduces to p
2
(1 p
2
)
So, (10.43) reduces to
_
_
p
1
p
2
1.96

s
2
1
n
1
+
s
2
2
n
2
, p
1
p
2
+ 1.96

s
2
1
n
2
+
s
2
2
n
2
_
_
(10.44)
10.7.2 Example: Network Security Application
In a network security application, C. Mano et al
9
compare round-trip travel time for packets involved
in the same application in certain wired and wireless networks. The data was as follows:
sample sample mean sample s.d. sample size
wired 2.000 6.299 436
wireless 11.520 9.939 344
We had observed quite a dierence, 11.52 versus 2.00, but could it be due to sampling variation?
Maybe we have unusual samples? This calls for a condence interval!
Then a 95% condence interval for the dierence between wireless and wired networks is
11.520 2.000 1.96
_
9.939
2
344
+
6.299
2
436
= 9.52 1.22 (10.45)
So you can see that there is a big dierence between the two networks, even after allowing for
sampling variation.
10.7.3 Dependent Samples
Note carefully, though, that a key point above was the independence of the two samples. By
contrast, suppose we wish, for instance, to nd a condence interval for
1

2
, the dierence in
9
RIPPS: Rogue Identifying Packet Payload Slicer Detecting Unauthorized Wireless Hosts Through Network Trac
Conditioning, C. Mano and a ton of other authors, ACM Transactions on Information Systems and Security,
May 2007.
236 CHAPTER 10. INTRODUCTION TO CONFIDENCE INTERVALS
mean heights in Davis of 15-year-old and 10-year-old children, and suppose our data consist of pairs
of height measurements at the two ages on the same children. In other words, we have a sample of
n children, and for the i
th
child we have his/her height U
i
at age 15 and V
i
at age 10. Let U and
V denote the sample means.
The problem is that the two sample means are not independent. If a child is taller than his/her
peers at age 15, he/she was probably taller than them when they were all age 10. In other words,
for each i, V
i
and U
i
are positively correlated, and thus the same is true for V and U. Thus we
cannot use (10.43).
As always, it is instructive to consider this in notebook terms. Suppose on one particular sample
at age 10one line of the notebookwe just happen to have a lot of big kids. Then V is large.
Well, if we look at the same kids later at age 15, theyre liable to be bigger than the average
15-year-old too. In other words, among the notebook lines in which V is large, many of them will
have U large too.
Since U is approximately normally distributed with mean
1
, about half of the notebook lines will
have U >
1
. Similarly, about half of the notebook lines will have V >
2
. But the nonindependence
will be reected in MORE than one-fourth of the lines having both U >
1
and V >
2
. (If the
two sample means were 100% correlated, that fraction would be 1.0.)
Contrast that with a sample scheme in which we sample some 10-year-olds and some 15-year-olds,
say at the same time. Now there are dierent kids in each of the two samples. So, if by happenstance
we get some big kids in the rst sample, that has no impact on which kids we get in the second
sample. In other words, V and U will be independent. In this case, one-fourth of the lines will
have both U >
1
and V >
2
.
So, we cannot get a condence interval for
1

2
from (10.43), since the latter assumes that the
two sample means are independent. What to do?
The key to the resolution of this problem is that the random variables T
i
= V
i
U
i
, i = 1,2,...,n are
still independent. Thus we can use (10.25) on these values, so that our approximate 95% condence
interval is
(T 1.96
s

n
, T + 1.96
s

n
) (10.46)
where T and s
2
are the sample mean and sample variance of the T
i
.
A common situation in which we have dependent samples is that in which we are comparing two
dependent proportions. Suppose for example that there are three candidates running for a political
oce, A, B and C. We poll 1,000 voters and ask whom they plan to vote for. Let p
A
, p
B
and p
Z
be the three population proportions of people planning to vote for the various candidates, and let
p
A
, p
B
and p
C
be the corresponding sample proportions.
10.7. CONFIDENCE INTERVALS FOR DIFFERENCES OF MEANS OR PROPORTIONS 237
Suppose we wish to form a condence interval for p
A
p
B
. Clearly, the two sample proportions are
not independent random variables, since for instance if p
A
= 1 then we know for sure that p
B
is 0.
Or to put it another way, dene the indicator variables U
i
and V
i
as above, with for example U
i
being 1 or 0, according to whether the i
th
person in our sample plans to vote for A or not, with
V
i
being dened similarly for B. Since U
i
and V
i
are measurements on the same person, they are
not independent, and thus p
A
and p
B
are not independent either.
Note by the way that while the two sample means in our kids height example above were positively
correlated, in this voter poll example, the two sample proportions are negatively correlated.
So, we cannot form a condence interval for p
A
p
B
by using (10.44). What can we do instead?
Well use the fact that the vector (N
A
, N
B
, N
C
)
T
has a multinomial distribution, where N
A
, N
B
and
N
C
denote the numbers of people in our sample who state they will vote for the various candidates
(so that for instance p
A
= N
A
/1000).
Now to compute V ar( p
A
p
B
), we make use of (7.10):
V ar( p
A
p
B
) = V ar( p
A
) +V ar( p
B
) 2Cov( p
A
, p
B
) (10.47)
Or, we could have taken a matrix approach, using (7.50) with A equal to the row vector (1,-1,0).
So, using (8.113), the standard error of p
A
p
B
is
_
0.001 p
A
(1 p
A
) + 0.001 p
B
(1 p
B
) + 0.002 p
A
p
B
(10.48)
10.7.4 Example: Machine Classication of Forest Covers
Remote sensing is machine classication of type from variables observed aerially, typically by satel-
lite. In the application well consider here, involves forest cover type for a given location; there are
seven dierent types. (See Blackard, Jock A. and Denis J. Dean, 2000, Comparative Accuracies
of Articial Neural Networks and Discriminant Analysis in Predicting Forest Cover Types from
Cartographic Variables, Computers and Electronics in Agriculture, 24(3):131-151.) Direct obser-
vation of the cover type is either too expensive or may suer from land access permission issues.
So, we wish to guess cover type from other variables that we can more easily obtain.
One of the variables was the amount of hillside shade at noon, which well call HS12. Heres our
goal: Let
1
and
2
be the population mean HS12 12 among sites having cover types 1 and 2,
respectively. If
1

2
is large, then HS12 would be a good predictor of whether the cover type is
1 or 2.
238 CHAPTER 10. INTRODUCTION TO CONFIDENCE INTERVALS
So, we wish to estimate
1

2
from our data, in which we do know cover type. There were over
50,000 observations, but for simplicity well just use the rst 1,000 here. Lets nd an approximate
95% condence interval for
1

2
. The two sample means were 223.8 and 226.3, with s values of
15.3 and 14.3, and the sample sizes were 226 and 585.
Using (10.43), we have that the interval is
223.8 226.3 1.96
_
15.3
2
226
+
14.3
2
585
= 2.5 2.3 = (4.8, 0.3) (10.49)
Given that HS12 values are in the 200 range (see the sample means), this dierence between them
actually is not very large. This is a great illustration of an important principle, it will turn out in
Section 11.8.
As another illustration of condence intervals, lets nd one for the dierence in population pro-
portions of sites that have cover types 1 and 2. Our sample estimate is
p
1
p
2
= 0.226 0.585 = 0.359 (10.50)
The standard error of this quantity, from (10.48), is

0.001 0.226 0.7740.001 0.585 0.415 + 002 0.226 0.585 = 0.019 (10.51)
That gives us a condence interval of
0.359 1.96 0.019 = (0.397, 0.321) (10.52)
10.8 R Computation
The R function t.test() forms condence intervals for a single mean or for the dierence of two
means. In the latter case, the two samples must be independent; otherwise, do the single-mean CI
on dierences, as in Section 10.7.3.
This function uses the Student-t distribution, rather than the normal, but as discussed in Section
10.12, the dierence is negligible except in small samples.
Thus you can conveniently use t.test() to form a condence interval for a single mean, instead of
computing (10.25) yourself (or writing the R code yourself).
10.9. EXAMPLE: AMAZON LINKS 239
Its slightly more complicated in the case of forming a condence interval for the dierence of two
means. The t.test() function will do that for you too, but will make the assmption that we have

2
1
=
2
2
in Section 10.7.1. Unless you believe there is a huge dierence between the two population
variances, this approximation is not bad.
10.9 Example: Amazon Links
This example involves the Amazon product co-purchasing network, March 2 2003. The data set
is large but simple. It stores a directed graph of what links to what: If a record show i then j, it
means that i is often co-purchased with j (though not necessarily vice versa). Lets nd a condence
interval for the mean number of inlinks, i.e. links into a node.
Actually, even the R manipulations are not so trivial, so here is the complete code (http://snap.
stanford.edu/data/amazon0302.html):
1 mzn < read . t abl e ( amazon0302 . t xt , header=F)
2 # cut down the data s e t f o r conveni ence
3 mzn1000 < mzn[ mzn [ 1 . ] <= 1000 & mzn [ , 2 ] <= 1000 , ]
4 # make an R l i s t , one el ement per val ue of j
5 degr ees 1000 < s p l i t ( mzn1000 , mzn1000 [ , 2 ] )
6 # by f i ndi ng the number of rows i n each matri x , we get the numbers of
7 # i n l i n k s
8 i ndegr ees 1000 < sappl y ( degrees1000 , nrow)
Now run t.test():
> t . t e s t ( i ndegr ees 1000 )
One Sample tt e s t
data : i ndegr ees 1000
t = 35. 0279 , df = 1000 , pval ue < 2. 2 e16
a l t e r na t i v e hypot hes i s : t r ue mean i s not equal to 0
95 per cent c onf i de nc e i nt e r v a l :
3. 728759 4. 171340
sampl e e s t i mat e s :
mean of x
3. 95005
240 CHAPTER 10. INTRODUCTION TO CONFIDENCE INTERVALS
10.10 The Multivariate Case
In the last few sections, the standard error has been key to nding condence intervals for univariate
quantities. Recalling that the standard error, squared, is the estimated variance of our estimator,
one might guess that the analogous quantity in the multivariate case is the estimator covariance
matrix of the estimator. This turns out to be correct.
10.10.1 Sample Mean and Sample Covariance Matrix
Say W is an r-element random vector, and we have a random sample W
1
, ..., W
n
from the distribu-
tion of W.
In analogy with (10.20), we have

Cov(W) =
1
n
n

i=1
(W
i
W)(W
i
W)

(10.53)
where
W =
1
n
n

i=1
W
i
(10.54)
Note that (10.53) and (10.54) are of size r x r and r x 1, and are estimators of Cov(W) and EW.
Note too that (10.54) is a sum, thus reminding us of the Central Limit Theorem. In this case its
the Multivariate Central Limit Theorem, which implies that W has an approximate multivariate
normal distribution. If you didnt read that chapter, the key content is the following:
Let
_
_
c
1
...
c
r
_
_
(10.55)
denote any constant (i.e. nonrandom) r-element vector. Then the quantity
c

W (10.56)
10.11. ADVANCED TOPICS IN CONFIDENCE INTERVALS 241
has an approximate normal distribution with mean c EW and variance
c

Cov(W)c (10.57)
An approximate 95% condence interval for c
1
EW
1
+... +c
r
EW
r
is then
c

W 1.96
_
c

Cov(W)c (10.58)
where the estimated covariance matrix is given in (10.53).
More generally, here is the extension of the material in Section 10.5:
Suppose we are estimating some r-component vector , using an approximately r-variate
normal estimator

. Let C denote the estimated covariance matrix of

. Then an
approximate 95% condence interval for theta is
c

1.96

Cc (10.59)
10.10.2 Growth Rate Example
Suppose we are studying childrens growth patterns, and have data on heights at ages 6, 10 and
18, denoted (X,Y,Z) = W. Were interested in the growths between 6 and 10, and between 10 and
18, denoted by G
1
and G
2
, respectively. Say we wish to form a condence interval for EG
2
EG
1
,
based on a random sample W
i
= (X
i
, Y
i
, Z
i
), i = 1, ..., n.
This ts right into the context of the previous section. Were interested in
(EZ EY ) (EY EX) = EZ 2EY +EX (10.60)
So, we can set c = (1,-2,1) in (10.55), and then use (10.58).
10.11 Advanced Topics in Condence Intervals
10.12 And What About the Student-t Distribution?
Far better an approximate answer to the right question, which is often vague, than an exact answer
to the wrong question, which can always be made preciseJohn Tukey, pioneering statistician at
Bell Labs
242 CHAPTER 10. INTRODUCTION TO CONFIDENCE INTERVALS
Another thing we are not doing here is to use the Student t-distribution. That is the name of
the distribution of the quantity
T =
W
s/

n
(10.61)
where s
2
is the version of the sample variance in which we divide by n-1 instead of by n, i.e. (10.21).
Note carefully that we are assuming that the W
i
themselvesnot just Whave a normal distri-
bution. The exact distribution of T is called the Student t-distribution with n-1 degrees of
freedom. These distributions thus form a one-parameter family, with the degrees of freedom being
the parameter.
This distribution has been tabulated. In R, for instance, the functions dt(), pt() and so on play
the same roles as dnorm(), pnorm() etc. do for the normal family. The call qt(0.975,9) returns
2.26. This enables us to get an for from a sample of size 10, at EXACTLY a 95% condence
level, rather than being at an APPROXIMATE 95% level as we have had here, as follows.
We start with (10.22), replacing 1.96 by 2.26, (

W )/(/

n) by T, and by =. Doing the same


algebra, we nd the following condence interval for :
(W 2.26
s

10
, W + 2.26
s

10
) (10.62)
Of course, for general n, replace 2.26 by t
0.975,n1
, the 0.975 quantile of the t-distribution with n-1
degrees of freedom. The distribution is tabulated by the R functions dt(), p(t) and so on.
I do not use the t-distribution here because:
It depends on the parent population having an exact normal distribution, which is never really
true. In the Davis case, for instance, peoples weights are approximately normally distributed,
but denitely not exactly so. For that to be exactly the case, some people would have to have
weights of say, a billion pounds, or negative weights, since any normal distribution takes on
all values from to .
For large n, the dierence between the t-distribution and N(0,1) is negligible anyway.
10.13 Other Condence Levels
We have been using 95% as our condence level. This is common, but of course not unique. We
can for instance use 90%, which gives us a narrower interval (in (10.25),we multiply by 1.65 instead
10.14. REAL POPULATIONS AND CONCEPTUAL POPULATIONS 243
of by 1.96, which the reader should check), at the expense of lower condence.
A condence intervals error rate is usually denoted by 1, so a 95% condence level has = 0.05.
10.14 Real Populations and Conceptual Populations
In our example in Section 10.4.1, we were sampling from a real population. However, in many,
probably most applications of statistics, either the population or the sampling is more conceptual.
Consider an experiment we will discuss in Section 15.2, in which we compare the programmability
of three scripting languages. (You need not read ahead.) We divide our programmers into three
groups, and assign each group to program in one of the languages. We then compare how long it
took the three groups to nish writing and debugging the code, and so on.
We think of our programmers as being a random sample from the population of all programmers,
but that is probably an idealization. We probably did NOT choose our programmers randomly;
we just used whoever we had available. But we can think of them as a random sample from the
rather conceptual population of all programmers who might work at this company.
10
You can see from this that if one chooses to apply statistics carefullywhich you absolutely should
dothere sometimes are some knotty problems of interpretation to think about.
10.15 One More Time: Why Do We Use Condence Intervals?
After all the variations on a theme in the very long Section 10.2, it is easy to lose sight of the goal,
so lets review:
Almost everyone is familiar with the term margin of error, given in every TV news report during
elections. The report will say something like, In our poll, 62% stated that they plan to vote for Ms.
X. The margin of error is 3%. Those two numbers, 62% and 3%, form the essence of condence
intervals:
The 62% gure is our estimate of p, the true population fraction of people who plan to vote
for Ms. X.
Recognizing that that 62% gure is only a sample estimate of p, we wish to have a measure
of how accurate the gure isour margin of error. Though the poll reports dont say this,
10
Youre probably wondering why we havent discussed other factors, such as diering levels of experience among
the programmers. This will be dealt with in our unit on regression analysis, Chapter 15.
244 CHAPTER 10. INTRODUCTION TO CONFIDENCE INTERVALS
what they are actually saying is that we are 95% sure that the true population value p is in
the range 0.62 0.03.
So, a condence interval is nothing more than the concept of the ab range that we are so familiar
with.
Exercises
1. Consider Equation (10.24). In each of the entries in the table below, ll in either R for random,
or NR for nonrandom:
quantity R or NR?
W
s

n
2. Consider p, the estimator of a population proportion p, based on a sample of size n. Give the
expression for the standard error of p.
3. Suppose we take a simple random sample of size 2 from a population consisting of just three
values, 66, 67 and 69. Let X denote the resulting sample mean. Find p
X
(67.5).
4. Suppose we have a random sample W
1
, ..., W
n
, and we wish to estimate the population mean ,
as usual. But we decide to place double weight on W
1
, so our estimator for is
U =
2W
1
+W
2
+... +W
n
n + 1
(10.63)
Find E(U) and Var(U) in terms of and the population variance
2
.
5. Suppose a random sample of size n is drawn from a population in which, unknown to the
analyst, X actually has an exponential distribution with mean 10. Suppose the analyst forms an
approximate 95% condence interval for the mean, using (10.24). Use R simulation to nd the true
condence level, for n = 10, 25, 100 and 500.
6. Suppose we draw a sample of size 2 from a population in which X has the values 10, 15 and 12.
Find p
X
, rst assuming sampling with replacement, then assuming sampling without replacement.
7. We ask 100 randomly sampled programmers whether C++ is their favorite language, and
12 answer yes. Give a numerical expression for an approximate 95% condence interval for the
population fraction of programmers who have C++ as their favorite language.
8. In Equation (10.25), suppose 1.96 is replaced by 1.88 in both instances. Then of course the
10.15. ONE MORE TIME: WHY DO WE USE CONFIDENCE INTERVALS? 245
condence level will be smaller than 95%. Give a call to an R function (not a simulation), that will
nd the new condence level.
9. Candidates A, B and C are vying for election. Let p
1
, p
2
and p
3
denote the fractions of people
planning to vote for them. We poll n people at random, yielding estimates p
1
, p
2
and p
3
. Y claims
that she has more supporters than the other two candidates combined. Give a formula for an
approximate 95% condence interval for p
2
(p
1
+p
3
).
10. Suppose Jack and Jill each collect random samples of size n from a population having unknown
mean but KNOWN variance
2
. They each form an approximate 95% condence interval for ,
using (10.25) but with s replaced by . Find the approximate probability that their intervals do
not overlap. Express your answer in terms of , the cdf of the N(0,1) distribution.
11. In the example of the population of three people, page 220, nd the following:
(a) p
X
1
(70)
(b) p
X
1
,X
2
(69, 70)
(c) F
X
(69.5)
(d) probability that X overestimates the population mean
(e) p
X
(69) if our sample size is three rather than two (remember, we are sampling with replace-
ment)
12. In the derivation (10.11), suppose instead we have a simple random sample. Which one of the
following statements is correct?
(a) E(X) will still be equal to .
(b) E(X) will not exist.
(c) E(X) will exist, but may be less than .
(d) E(X) will exist, but may be greater than .
(e) None of the above is necessarily true.
13. Consider a toy example in which we take a random sample of size 2 (done with replacement)
from a population of size 2. The two values in the population (say heights in some measure system)
are 40 and 60. Find p
s
2(100).
246 CHAPTER 10. INTRODUCTION TO CONFIDENCE INTERVALS
Chapter 11
Introduction to Signicance Tests
Suppose (just for fun, but with the same pattern as in more serious examples) you have a coin
that will be ipped at the Super Bowl to see who gets the rst kicko. (Well assume slightly
dierent rules here. The coin is not called. Instead, it is agreed beforehand that if the coin
comes up heads, Team A will get the kicko, and otherwise it will be Team B.) You want to assess
for fairness. Let p be the probability of heads for the coin.
You could toss the coin, say, 100 times, and then form a condence interval for p using (10.34).
The width of the interval would tell you the margin of error, i.e. it tells you whether 100 tosses
were enough for the accuracy you want, and the location of the interval would tell you whether the
coin is fair enough.
For instance, if your interval were (0.49,0.54), you might feel satised that this coin is reasonably
fair. In fact, note carefully that even if the interval were, say, (0.502,0.506), you would
still consider the coin to be reasonably fair; the fact that the interval did not contain 0.5 is
irrelevant, as the entire interval would be reasonably near 0.5.
However, this process would not be the way its traditionally done. Most users of statistics would
use the toss data to test the null hypothesis
H
0
: p = 0.5 (11.1)
against the alternate hypothesis
H
A
: p ,= 0.5 (11.2)
For reasons that will be explained below, this procedure is called signicance testing. It forms
247
248 CHAPTER 11. INTRODUCTION TO SIGNIFICANCE TESTS
the very core of statistical inference as practiced today. This, however, is unfortunate, as there
are some serious problems that have been recognized with this procedure. We will rst discuss the
mechanics of the procedure, and then look closely at the problems with it in Section 11.8.
11.1 The Basics
Heres how signicance testing works.
The approach is to consider H
0
innocent until proven guilty, meaning that we assume H
0
is
true unless the data give strong evidence to the contrary. KEEP THIS IN MIND!we are
continually asking, What if...?
The basic plan of attack is this:
We will toss the coin n times. Then we will believe that the coin is fair unless the
number of heads is suspiciously extreme, i.e. much less than n/2 or much more than
n/2.
Let p denote the true probability of heads for our coin. As in Section 10.6.1, let p denote the
proportion of heads in our sample of n tosses. We observed in that section that p is a special case
of a sample mean (its a mean of 1s and 0s). We also found that the standard deviation of p is
_
p(1 p)/n.
1
In other words,
p p
_
1
n
p(1 p)
(11.3)
has an approximate N(0,1) distribution.
But remember, we are going to assume H
0
for now, until and unless we nd strong evidence to the
contrary. Thus we are assuming, for now, that the test statistic
Z =
p 0.5
_
1
n
0.5(1 0.5)
(11.4)
has an approximate N(0,1) distribution.
1
This is the exact standard deviation. The estimated standard deviation is

p(1 p)/n.
11.2. GENERAL TESTING BASED ON NORMALLY DISTRIBUTED ESTIMATORS 249
Now recall from the derivation of (10.25) that -1.96 and 1.96 are the lower- and upper-2.5% points
of the N(0,1) distribution. Thus,
P(Z < 1.96 or Z > 1.96) 0.05 (11.5)
Now here is the point: After we collect our data, in this case by tossing the coin n times, we
compute p from that data, and then compute Z from (11.4). If Z is smaller than -1.96 or larger
than 1.96, we reason as follows:
Hmmm, Z would stray that far from 0 only 5% of the time. So, either I have to believe
that a rare event has occurred, or I must abandon my assumption that H
0
is true.
For instance, say n = 100 and we get 62 heads in our sample. That gives us Z = 2.4, in that rare
range. We then reject H
0
, and announce to the world that this is an unfair coin. We say, The
value of p is signicantly dierent from 0.5.
The 5% suspicion criterion used above is called the signicance level, typically denoted . One
common statement is We rejected H
0
at the 5% level.
On the other hand, suppose we get 47 heads in our sample. Then Z = -0.60. Again, taking 5% as
our signicance level, this value of Z would not be deemed suspicious, as it occurs frequently. We
would then say We accept H
0
at the 5% level, or We nd that p is not signicantly dierent
from 0.5.
The word signicant is misleading. It should NOT be confused with important. It simply is saying
we dont believe the observed value of Z is a rare event, which it would be under H
0
; we have
instead decided to abandon our believe that H
0
is true.
11.2 General Testing Based on Normally Distributed Estimators
In Section 10.5, we developed a method of constructing condence intervals for general approxi-
mately normally distributed estimators. Now we do the same for signicance testing.
Suppose

is an approximately normally distributed estimator of some population value . Then
to test H
0
: = c, form the test statistic
Z =

c
s.e.(

)
(11.6)
250 CHAPTER 11. INTRODUCTION TO SIGNIFICANCE TESTS
where s.e.(

) is the standard error of



,
2
and proceed as before:
Reject H
0
: = c at the signicance level of = 0.05 if [Z[ 1.96.
11.3 Example: Network Security
Lets look at the network security example in Section 10.7.1 again. Here

= X Y , and c is
presumably 0 (depending on the goals of Mano et al). From 10.42, the standard error works out to
0.61. So, our test statistic (11.6) is
Z =
X Y 0
0.61
=
11.52 2.00
0.61
= 15.61 (11.7)
This is denitely larger in absolute value than 1.96, so we reject H
0
, and conclude that the popu-
lation mean round-trip times are dierent in the wired and wireless cases.
11.4 The Notion of p-Values
Recall the coin example in Section 11.1, in which we got 62 heads, i.e. Z = 2.4. Since 2.4 is
considerably larger than 2.4, and our cuto for rejection was only 1.96, we might say that in some
sense we not only rejected H
0
, we actually strongly rejected it.
To quantify that notion, we compute something called the observed signicance level, more
often called the p-value.
We ask,
We rejected H
0
at the 5% level. Clearly, we would have rejected it even at some small
thus more stringentlevels. What is the smallest such level?
By checking a table of the N(0,1) distribution, or by calling pnorm(2.40) in R, we would nd that
the N(0,1) distribution has area 0.008 to the right of 2.40, and of course by symmetry there is an
equal area to the left of -2.40. Thats a total area of 0.016. In other words, we would have been
able to reject H
0
even at the much more stringent signicance level of 0.016 (the 1.6% level)instead
of 0.05. So, Z = 2.40 would be considered even more signicant than Z = 1.96. In the research
2
See Section 10.5. Or, if we know the exact standard deviation of

under H0, which was the case in our coin
example above, we could use that, for a better normal approximation.
11.5. R COMPUTATION 251
community it is customary to say, The p-value was 0.016.
3
The smaller the p-value, the more
signicant the results are considered.
In our network security example above in which Z was 15.61, the value is literally o the chart;
pnorm(15.61) returns a value of 1. Of course, its a tiny bit less than 1, but it is so far out in the
right tail of the N(0,1) distribution that the area to the right is essentially 0. So the p-value would
be essentially 0, and the result would be treated as very, very highly signicant.
It is customary to denote small p-values by asterisks. This is generally one asterisk for p under
0.05, two for p less than 0.01, three for 0.001, etc. The more asterisks, the more signicant the
data is supposed to be.
11.5 R Computation
The R function t.test(), discussed in Section 10.8, does both condence intervals and tests, in-
cluding p-values in the latter case.
11.6 One-Sided H
A
Suppose thatsomehowwe are sure that our coin in the example above is either fair or it is more
heavily weighted towards heads. Then we would take our alternate hypothesis to be
H
A
: p > 0.5 (11.8)
A rare event which could make us abandon our belief in H
0
would now be if Z in (11.4) is very
large in the positive direction. So, with = 0.05, we call qnorm(0.95), and nd that our rule
would now be to reject H
0
if Z > 1.65.
One-sided tests are not common, as their assumptions are often dicult to justify.
11.7 Exact Tests
Remember, the tests weve seen so far are all approximate. In (11.4), for instance, p had an
approximate normal distribution, so that the distribution of Z was approximately N(0,1). Thus the
3
The p in p-value of course stands for probability, meaning the probably that a N(0,1) random variable
would stray as far, or further, from 0 as our observed Z here. By the way, be careful not to confuse this with the
quantity p in our coin example, the probability of heads.
252 CHAPTER 11. INTRODUCTION TO SIGNIFICANCE TESTS
signicance level was approximate, as were the p-values and so on.
4
But the only reason our tests were approximate is that we only had the approximate distribution
of our test statistic Z, or equivalently, we only had the approximate distribution of our estimator,
e.g. p. If we have an exact distribution to work with, then we can perform an exact test.
11.7.1 Example: Test for Biased Coin
Lets consider the coin example again, with the one-sided alternative (11.8) . To keep things simple,
lets suppose we toss the coin 10 times. We will make our decision based on X, the number of heads
out of 10 tosses. Suppose we set our threshhold for strong evidence again H
0
to be 8 heads, i.e.
we will reject H
0
if X 8. What will be?
=
10

i=8
P(X = i) =
10

i=8
_
10
i
__
1
2
_
10
= 0.055 (11.9)
Thats not the usual 0.05. Clearly we cannot get an exact signicance level of 0.05,
5
but our is
exactly 0.055, so this is an exact test.
So, we will believe that this coin is perfectly balanced, unless we get eight or more heads in our
10 tosses. The latter event would be very unlikely (probability only 5.5%) if H
0
were true, so we
decide not to believe that H
0
is true.
11.7.1.1 Example: Light Bulbs
Suppose lifetimes of lightbulbs are exponentially distributed with mean . In the past, = 1000,
but there is a claim that the new light bulbs are improved and > 1000. To test that claim, we
will sample 10 lightbulbs, getting lifetimes X
1
, ..., X
10
, and compute the sample mean X. We will
then perform a signicance test of
H
0
: = 1000 (11.10)
vs.
H
A
: > 1000 (11.11)
4
Another class of probabilities which would be approximate would be the power values. These are the probabilities
of rejecting H0 if the latter is not true. We would speak, for instance, of the power of our test at p = 0.55, meaning
the chances that we would reject the null hypothesis if the true population value of p were 0.55.
5
Actually, it could be done by introducing some randomization to our test.
11.7. EXACT TESTS 253
It is natural to have our test take the form in which we reject H
0
if
X > w (11.12)
for some constant w chosen so that
P(X > w) = 0.05 (11.13)
under H
0
. Suppose we want an exact test, not one based on a normal approximation.
Recall that 100X, the sum of the X
i
, has a gamma distribution, with r = 10 and = 0.001. So,
we can nd the w for which P(X > w) = 0.05 by using Rs qgamma()
> qgamma(0.95,10,0.001)
[1] 15705.22
So, we reject H
0
if our sample mean is larger than 1570.5.
11.7.2 Example: Test Based on Range Data
Suppose lifetimes of some electronic component formerly had an exponential distribution with mean
100.0. However, its claimed that now the mean has increased. (Suppose we are somehow sure it has
not decreased.) Someone has tested 50 of these new components, and has recorded their lifetimes,
X
1
, ..., X
50
. Unfortunately, they only reported to us the range of the data, R = max
i
X
i
min
i
X
i
,
not the individual X
i
. We will need to do a signicance test with this limited data, at the 0.05
level.
Recall that the variance of an exponential random variable is the square of its mean. Intutively,
then, the larger this population mean of X, the larger the mean of the range R. In other words, the
form of the test should be to reject H
0
if R is greater than some cuto value c. So, we need to nd
the value of c to make equal to 0.05.
Unfortunately, we cant do this analytically, i.e. mathematically, as the distribution of R is far too
complex. This well have to resort to simulation.
6
Here is code to do that:
1 # code to determi ne the c ut o f f poi nt f o r s i g n i f i c a n c e
2 # at 0. 05 l e v e l
3
6
I am still referring to the following as an exact test, as we are not using any statistical approximation, such as
the Central Limit Theorm.
254 CHAPTER 11. INTRODUCTION TO SIGNIFICANCE TESTS
4 nreps < 200000
5 n < 50
6
7 r vec < vect or ( l engt h=nreps )
8 f or ( i i n 1: nreps )
9 x < rexp ( n , 0 . 0 1 )
10 rng < range ( x)
11 r vec [ i ] < rng [ 2 ] rng [ 1 ]
12
13
14 r vec < s or t ( r vec )
15 c ut o f f < r vec [ c e i l i n g ( 0. 95 nreps ) ]
16 cat ( r e j e c t H0 i f R >, r vec [ c ut o f f ] , n)
Here we general nreps samples of size 50 from an exponential distribution having mean 100. Note
that since we are setting , a probability dened in the setting in which H
0
is true, we assume the
mean is 100. For each of the nreps samples we nd the value of R, recording it in rvec. We then
take the 95
th
percentile of those values, which is the c for which P(R > c) = 0.05.
7
The value of c output by the code was 220.4991. A second run yielded, 220.9304, and a hird
220.7099. The fact that these values varied little among themselves indicates that our value of
nreps, 200000, was suciently large.
11.7.3 Exact Tests under a Normal Distribution Assumption
If you are willing to assume that you are sampling from a normally-distributed population, then
the Student-t test is nominally exact. The R function t.test() performs this operation, with the
argument alternative set to be either less or greater.
11.8 Whats Wrong with Signicance Testingand What to Do
Instead
The rst principle is that you must not fool yourselfand you are the easiest person to fool. So
you have to be very careful about that. After youve not fooled yourself, its easy not to fool other
scientists.Richard Feynman, Nobel laureate in physics
Sir Ronald [Fisher] has befuddled us, mesmerized us, and led us down the primrose pathPaul
Meehl, professor of psychology and the philosophy of science
7
Of course, this is approximate. The greater the value of nreps, the better the approximation.
11.8. WHATS WRONG WITH SIGNIFICANCE TESTINGAND WHAT TO DO INSTEAD255
Signicance testing is a time-honored approach, used by tens of thousands of people
every day. But it is wrong. I use the quotation marks here because, although signicance
testing is mathematically correct, it is at best noninformative and at worst seriously misleading.
11.8.1 History of Signicance Testing, and Where We Are Today
Well see why signicance testing has serious problems shortly, but rst a bit of history.
When the concept of signicance testing, especially the 5% value for , was developed in the 1920s
by Sir Ronald Fisher, many prominent statisticians opposed the ideafor good reason, as well see
below. But Fisher was so inuential that he prevailed, and thus signicance testing became the
core operation of statistics.
So, signicance testing became entrenched in the eld, in spite of being widely recognized as faulty,
to this day. Most modern statisticians understand this, even if many continue to engage in the
practice. (Many are forced to do so, e.g. to comply with government standards in pharmaceutical
testing.) Here are a few places you can read criticism of testing:
There is an entire book on the subject, The Cult of Statistical Signicance, by S. Ziliak and
D. McCloskey. Interestingly, on page 2, they note the prominent people who have criticized
testing. Their list is a virtual whos who of statistics, as well as physics Nobel laureate
Richard Feynman and economics Nobelists Kenneth Arrow and Milton Friedman.
See http://www.indiana.edu/
~
stigtsts/quotsagn.html for a nice collection of quotes
from famous statisticians on this point.
There is an entire chapter devoted to this issue in one of the best-selling elementary statistics
textbooks in the nation.
8
The Federal Judicial Center, which is the educational and research arm of the federal court
system, commissioned two prominent academics, one a statistics professor and the other a
law professor, to write a guide to statistics for judges: Reference Guide on Statistics. David
H. Kaye. David A. Freedman, at
http://www.fjc.gov/public/pdf.nsf/lookup/sciman02.pdf/$file/sciman02.pdf
There is quite a bit here on the problems of signicance testing, and especially p.129.
8
Statistics, third edition, by David Freedman, Robert Pisani, Roger Purves, pub. by W.W. Norton, 1997.
256 CHAPTER 11. INTRODUCTION TO SIGNIFICANCE TESTS
11.8.2 The Basic Fallacy
To begin with, its absurd to test H
0
in the rst place, because we know a priori that H
0
is
false.
Consider the coin example, for instance. No coin is absolutely perfectly balanced, and yet that is
the question that signicance testing is asking:
H
0
: p = 0.5000000000000000000000000000... (11.14)
We know before even collecting any data that the hypothesis we are testing is false, and thus its
nonsense to test it.
But much worse is this word signicant. Say our coin actually has p = 0.502. From
anyones point of view, thats a fair coin! But look what happens in (11.4) as the sample size n
grows. If we have a large enough sample, eventually the denominator in (11.4) will be small enough,
and p will be close enough to 0.502, that Z will be larger than 1.96 and we will declare that p is
signicantly dierent from 0.5. But it isnt! Yes, 0.502 is dierent from 0.5, but NOT in any
signicant sense in terms of our deciding whether to use this coin in the Super Bowl.
The same is true for government testing of new pharmaceuticals. We might be comparing a new
drug to an old drug. Suppose the new drug works only, say, 0.4% (i.e. 0.004) better than the old
one. Do we want to say that the new one is signcantly better? This wouldnt be right, especially
if the new drug has much worse side eects and costs a lot more (a given, for a new drug).
Note that in our analysis above, in which we considered what would happen in (11.4) as the
sample size increases, we found that eventually everything becomes signcianteven if there is
no practical dierence. This is especially a problem in computer science applications of statistics,
because they often use very large data sets. A data mining application, for instance, may consist
of hundreds of thousands of retail purchases. The same is true for data on visits to a Web site,
network trac data and so on. In all of these, the standard use of signicance testing can result
in our pouncing on very small dierences that are quite insignicant to us, yet will be declared
signicant by the test.
Conversely, if our sample is too small, we can miss a dierence that actually is signicanti.e.
important to usand we would declare that p is NOT signicantly dierent from 0.5. In the
example of the new drug, this would mean that it would be declared as not signicantly better
than the old drug, even if the new one is much better but our sample size wasnt large enough to
show it.
In summary, the basic problems with signicance testing are
H
0
is improperly specied. What we are really interested in here is whether p is near 0.5, not
11.8. WHATS WRONG WITH SIGNIFICANCE TESTINGAND WHAT TO DO INSTEAD257
whether it is exactly 0.5 (which we know is not the case anyway).
Use of the word signicant is grossly improper (or, if you wish, grossly misinterpreted).
Signicance testing forms the very core usage of statistics, yet you can now see that it is, as I said
above, at best noninformative and at worst seriously misleading. This is widely recognized by
thinking statisticians and prominent scientists, as noted above. But the practice of signicance
testing is too deeply entrenched for things to have any prospect of changing.
11.8.3 You Be the Judge!
This book has been written from the point of view that every educated person should understand
statistics. It impacts many vital aspects of our daily lives, and many people with technical degrees
nd a need for it at some point in their careers.
In other words, statistics is something to be used, not just learned for a course. You should think
about it critically, especially this material here on the problems of signicance testing. You yourself
should decide whether the latters widespread usage is justied.
11.8.4 What to Do Instead
Note carefully that I am not saying that we should not make a decision. We do have to decide, e.g.
decide whether a new hypertension drug is safe or in this case decide whether this coin is fair
enough for practical purposes, say for determining which team gets the kicko in the Super Bowl.
But it should be an informed decision, and even testing the modied H
0
above would be much less
informative than a condence interval.
In fact, the real problem with signicance tests is that they take the decision out of our hands.
They make our decision mechanically for us, not allowing us to interject issues of importance to
us, such possible side eects in the drug case.
So, what can we do instead?
In the coin example, we could set limits of fairness, say require that p be no more than 0.01 from
0.5 in order to consider it fair. We could then test the hypothesis
H
0
: 0.49 p 0.51 (11.15)
Such an approach is almost never used in practice, as it is somewhat dicult to use and explain.
But even more importantly, what if the true value of p were, say, 0.51001? Would we still really
want to reject the coin in such a scenario?
258 CHAPTER 11. INTRODUCTION TO SIGNIFICANCE TESTS
Forming a condence interval is the far superior approach. The width of the interval shows
us whether n is large enough for p to be reasonably accurate, and the location of the interval tells
us whether the coin is fair enough for our purposes.
Note that in making such a decision, we do NOT simply check whether 0.5 is in the
interval. That would make the condence interval reduce to a signicance test, which is what
we are trying to avoid. If for example the interval is (0.502,0.505), we would probably be quite
satised that the coin is fair enough for our purposes, even though 0.5 is not in the interval.
On the other hand, say the interval comparing the new drug to the old one is quite wide and more
or less equal positive and negative territory. Then the interval is telling us that the sample size
just isnt large enough to say much at all.
Signicance testing is also used for model building, such as for predictor variable selection in
regression analysis (a method to be covered in Chapter 15). The problem is even worse there,
because there is no reason to use = 0.05 as the cuto point for selecting a variable. In fact, even
if one uses signicance testing for this purposeagain, very questionablesome studies have found
that the best values of for this kind of application are in the range 0.25 to 0.40, far outside the
range people use in testing.
In model building, we still can and should use condence intervals. However, it does take more
work to do so. We will return to this point in our unit on modeling, Chapter 14.
11.8.5 Decide on the Basis of the Preponderance of Evidence
I was in search of a one-armed economist, so that the guy could never make a statement and then
say: on the other handPresident Harry S Truman
If all economists were laid end to end, they would not reach a conclusionIrish writer George
Bernard Shaw
In the movies, you see stories of murder trials in which the accused must be proven guilty beyond
the shadow of a doubt. But in most noncriminal trials, the standard of proof is considerably
lighter, preponderance of evidence. This is the standard you must use when making decisions
based on statistical data. Such data cannot prove anything in a mathematical sense. Instead, it
should be taken merely as evidence. The width of the condence interval tells us the likely accuracy
of that evidence. We must then weigh that evidence against other information we have about the
subject being studied, and then ultimately make a decision on the basis of the preponderance of
all the evidence.
Yes, juries must make a decision. But they dont base their verdict on some formula. Similarly,
you the data analyst should not base your decision on the blind application of a method that is
11.8. WHATS WRONG WITH SIGNIFICANCE TESTINGAND WHAT TO DO INSTEAD259
usually of little relevance to the problem at handsignicance testing.
11.8.6 Example: the Forest Cover Data
In Section 10.7.4, we found that an approximate 95% condence interval for
1

2
was
223.8 226.3 2.3 = (4.8, 0.3) (11.16)
Clearly, the dierence in HS12 between cover types 1 and 2 is tiny when compared to the general
size of HS12, in the 200s. Thus HS12 is not going to help us guess which cover type exists at a
given location. Yet with the same data, we would reject the hypothesis
H
0
:
1
=
2
(11.17)
and say that the two means are signicantly dierent, which sounds like there is an important
dierencewhich there is not.
11.8.7 Example: Assessing Your Candidates Chances for Election
Imagine an election between Ms. Smith and Mr. Jones, with you serving as campaign manager
for Smith. Youve just gotten the results of a very small voter poll, and the condence interval for
p, the fraction of voters who say theyll vote for Smith, is (0.45,0.85). Most of the points in this
interval are greater than 0.5, so you would be highly encouraged! You are certainly not sure of the
nal election result, as a small part of the interval is below 0.5, and anyway voters might change
their minds between now and the election. But the results would be highly encouraging.
Yet a signicance test would say There is no signicant dierence between the two candidates.
Its a dead heat. Clearly that is not telling the whole story. The point, once again, is that the
condence interval is giving you much more information than is the signicance test.
Exercises
1. In the light bulb example on page 252, suppose the actual observed value of X turns out to be
15.88. Find the p-value.
260 CHAPTER 11. INTRODUCTION TO SIGNIFICANCE TESTS
Chapter 12
General Statistical Estimation and
Inference
In the last chapter, we often referred to certain estimators as being natural. For example, if we
are estimating a population mean, an obvious choice of estimator would be the sample mean. But
in many applications, it is less clear what a natural estimate for a parameter of interest would
be.
1
We will present general methods for estimation in this section.
We will also discuss advanced methods of inference.
12.1 General Methods of Parametric Estimation
Lets begin with a simple motivating example.
12.1.1 Example: Guessing the Number of Rae Tickets Sold
Youve just bought a rae ticket, and nd that you have ticket number 68. You check with a
couple of friends, and nd that their numbers are 46 and 79. Let c be the total number of tickets.
How should we estimate c, using our data 68, 46 and 79?
It is reasonable to assume that each of the three of you is equally likely to get assigned any of the
numbers 1,2,...,c. In other words, the numbers we get, X
i
, i = 1,2,3 are uniformly distributed on
the set 1,2,...,c. We can also assume that they are independent; thats not exactly true, since
1
Recall that we are using the term parameter to mean any population quantity, rather an an index into a parametric
family of distributions.
261
262 CHAPTER 12. GENERAL STATISTICAL ESTIMATION AND INFERENCE
we are sampling without replacement, but for large cor better stated, for n/c smallits close
enough.
So, we are assuming that the X
i
are independent and identically distributedfamously written as
i.i.d. in the statistics worldon the set 1,2,...,c. How do we use the X
i
to estimate c?
12.1.2 Method of Moments
One approach, an intuitive one, would be to reason as follows. Note rst that
E(X) =
c + 1
2
(12.1)
Lets solve for c:
c = 2EX 1 (12.2)
We know that we can use
X =
1
n
n

i=1
X
i
(12.3)
to estimate EX, so by (12.2), 2X 1 is an intuitive estimate of c. Thus we take our estimator for
c to be
c = 2X 1 (12.4)
This estimator is called the Method of Moments estimator of c.
Lets step back and review what we did:
We wrote our parameter as a function of the population mean EX of our data item X. Here,
that resulted in (12.2).
In that function, we substituted our sample mean X for EX, and substituted our estimator
c for the parameter c, yielding (12.4). We then solved for our estimator.
We say that an estimator

of some parameter is consistent if
lim
n

= (12.5)
12.1. GENERAL METHODS OF PARAMETRIC ESTIMATION 263
where n is the sample size. In other words, as the sample size grows, the estimator eventually
converges to the true population value.
Of course here X is a consistent estimator of EX . Thus you can see from (12.2) and (12.4) that c
is a consistent estimator of c. In other words, the Method of Moments generally gives us consistent
estimators.
What if we have more than one parameter to estimate? We generalize what we did above:
Suppose we are estimating a parametric distribution with parameters
1
, ...,
r
.
Let
i
denote the i
th
moment of X, E(X
i
).
For i = 1,...,r we write
i
as a function g
i
of all the
k
.
For i = 1,...,r set

i
=
1
n
n

j=1
X
i
j
(12.6)
Substitute the

k
in the g
i
and then solve for them.
In the above example with the rae, we had r = 1,
1
= c, g
1
(c) = (c + 1)/2 and so on. A
two-parameter example will be given below.
12.1.3 Method of Maximum Likelihood
Another method, much more commonly used, is called the Method of Maximum Likelihood. In
our example above, it means asking the question, What value of c would have made our data68,
46, 79most likely to happen? Well, lets nd what is called the likelihood, i.e. the probability
of our particular data values occurring:
L = P(X
1
= 68, X
2
= 46, X
3
= 79) =
_
(
1
c
)
3
, if c 79
0, otherwise
(12.7)
Now keep in mind that c is a xed, though unknown constant. It is not a random variable. What
we are doing here is just asking What if questions, e.g. If c were 85, how likely would our data
be? What about c = 91?
Well then, what value of c maximizes (12.7)? Clearly, it is c = 79. Any smaller value of c gives us
a likelihood of 0. And for c larger than 79, the larger c is, the smaller (12.7) is. So, our maximum
264 CHAPTER 12. GENERAL STATISTICAL ESTIMATION AND INFERENCE
likelihood estimator (MLE) is 79. In general, if our sample size in this problem were n, our MLE
for c would be
c = max
i
X
i
(12.8)
12.1.4 Example: Estimation the Parameters of a Gamma Distribution
As another example, suppose we have a random sample X
1
, ..., X
n
from a gamma distribution.
f
X
(t) =
1
(c)

c
t
c1
e
t
, t > 0 (12.9)
for some unknown c and . How do we estimate c and from the X
i
?
12.1.4.1 Method of Moments
Lets try the Method of Moments, as follows. We have two population parameters to estimate, c
and , so we need to involve two moments of X. That could be EX and E(X
2
), but here it would
more conveniently be EX and Var(X). We know from our previous unit on continuous random
variables, Chapter 4, that
EX =
c

(12.10)
V ar(X) =
c

2
(12.11)
In our earlier notation, this would be r = 2,
1
= c ,
2
= and g
1
(c, ) = c/ and g
2
(c, ) = c/
2
.
Switching to sample analogs and estimates, we have
c

= X (12.12)
c

2
= s
2
(12.13)
12.1. GENERAL METHODS OF PARAMETRIC ESTIMATION 265
Dividing the two quantities yields

=
X
s
2
(12.14)
which then gives
c =
X
2
s
2
(12.15)
12.1.4.2 MLEs
What about the MLEs of c and ? Remember, the X
i
are continuous random variables, so the
likelihood function, i.e. the analog of (12.7), is the product of the density values:
L =
n
i=1
_
1
(c)

c
X
i
c1
e
X
i
_
(12.16)
= [
c
/(c)]
n
(
n
i=1
X
i
)
c1
e

n
i=1
X
i
(12.17)
In general, it is usually easier to maximize the log likelihood (and maximizing this is the same as
maximizing the original likelihood):
l = (c 1)
n

i=1
ln(X
i
)
n

i=1
X
i
+nc ln() nln((c)) (12.18)
One then takes the partial derivatives of (12.18) with respect to c and , and sets the derivatives
to zero. The solution values, c and

, are then the MLEs of c and . Unfortunately, in this case,
these equations do not have closed-form solutions. So the equations must be solved numerically.
(In fact, numerical methods are needed even more in this case, because nding the derivative of
(c) is not easy.)
12.1.4.3 Rs mle() Function
R provides a function, mle(), for nding MLEs in mathematically intractable situations such as
the one in the last section. Heres an example in the that context. Well simulate some data from a
gamma distribution with given parameter values, then pretend we dont know those, and nd the
MLEs from the data:
266 CHAPTER 12. GENERAL STATISTICAL ESTIMATION AND INFERENCE
x <- rgamma(100,shape=2) # Erlang, r = 2
n <- length(x)
ll <- function(c,lambda) {
loglik <- (c-1) * sum(log(x)) - sum(x)*lambda + n*c*log(lambda) -
n*log(gamma(c))
return(-loglik)
}
summary(mle(minuslogl=ll,start=list(c=2,lambda=2)))
Maximum likelihood estimation
Call:
mle(minuslogl = ll, start = list(c = 1, lambda = 1))
Coefficients:
Estimate Std. Error
c 1.993399 0.1770996
lambda 1.027275 0.1167195
-2 log L: 509.8227
How did this work? The main task we have is to write a function that calculates negative the
log likelihood, with that functions arguments will be the parameters to be estimated. (Note that
in R, log() calculates the natural logarithm by default.) Fortunately for us, mle() calculates
the derivatives numerically too, so we didnt need to specify them in the log likelihood function.
(Needless to say, this function thus cannot be used in a problem in which derivatives cannot be
used, such as the lottery example above.)
We also need to supply mle() with initial guesses for the parameters. Thats done in the start
argument. I more or less arbitrarily chose 1.0 for these values. You may have to experiment,
though, as some sets of initial values may not result in convergence.
The standard errors of the estimated parameters are also printed out, enabling the formation of
condence intervals and signicance tests. See for instance Section 10.5. In fact, you can get the
estimated covariance matrix for the vector of estimated parameters. In our case here:
> mleout <- mle(minuslogl=ll,start=list(c=2,lambda=2))
Warning messages:
1: In log(lambda) : NaNs produced
2: In log(lambda) : NaNs produced
3: In log(lambda) : NaNs produced
> solve(mleout@details$hessian)
c lambda
c 0.08434476 0.04156666
lambda 0.04156666 0.02582428
By the way, there were also some warning messages, due to the fact that during the iterative
maximization process, some iterations generated guesses for

were 0 or near it, causing problems
with log().
12.1. GENERAL METHODS OF PARAMETRIC ESTIMATION 267
12.1.5 More Examples
Suppose f
W
(t) = ct
c1
for t in (0,1), with the density being 0 elsewhere, for some unknown c > 0.
We have a random sample W
1
, ..., W
n
from this density.
Lets nd the Method of Moments estimator.
EW =
_
1
0
tct
c1
dt =
c
c + 1
(12.19)
So, set
W =
c
c + 1
(12.20)
yielding
c =
W
1 W
(12.21)
What about the MLE?
L =
n
i=1
cW
c1
i
(12.22)
so
l = nln c + (c 1)
n

i=1
ln W
i
(12.23)
Then set
0 =
n
c
+
n

i=1
ln W
i
(12.24)
and thus
c =
1
1
n

n
i=1
ln W
i
(12.25)
268 CHAPTER 12. GENERAL STATISTICAL ESTIMATION AND INFERENCE
As in Section 12.1.3, not every MLE can be determined by taking derivatives. Consider a continuous
analog of the example in that section, with f
W
(t) =
1
c
on (0,c), 0 elsewhere, for some c > 0.
The likelihood is
_
1
c
_
n
(12.26)
as long as
c max
i
W
i
(12.27)
and is 0 otherwise. So,
c = max
i
W
i
(12.28)
as before.
Now consider a dierent problem. Suppose the random variable X is equal to 1, 2 and 3, with
probabilities c, c and 1-2c. The value c is thus a population parameter. We have a random sample
X
1
, ..., X
n
from this population. Lets nd the Method of Moments Estimator of c, and its bias.
First,
EX = c 1 +c 2 + (1 2c) 3 = 3 3c (12.29)
Thus
c = (3 EX)/3 (12.30)
and so set
c = (3 X)/3 (12.31)
Next,
12.1. GENERAL METHODS OF PARAMETRIC ESTIMATION 269
E c = E[(3 X)/3) (12.32)
=
1
3
(3 EX) (12.33)
=
1
3
[3 EX] (12.34)
=
1
3
[3 (3 3c)] (12.35)
= c (12.36)
So, the bias is 0.
12.1.6 What About Condence Intervals?
Usually we are not satised with simply forming estimates (called point estimates). We also want
some indication of how accurate these estimates are, in the form of condence intervals (interval
estimates).
In many special cases, nding condence intervals can be done easily on an ad hoc basis. Look, for
instance, at the Method of Moments Estimator in Section 12.1.2. Our estimator (12.4) is a linear
function of X, so we easily obtain a condence interval for c from one for EX.
Another example is (12.25). Taking the limit as n the equation shows us (and we could
verify) that
c =
1
E[ln W]
(12.37)
Dening X
i
= ln W
i
and X = (X
1
+... +X
n
)/, we can obtain a condence interval for EX in the
usual way. We then see from (12.37) that we can form a condence interval for c by simply taking
the reciprocal of each endpoint of the interval, and swapping the left and right endpoints.
What about in general? For the Method of Moments case, our estimators are functions of the
sample moments, and since the latter are formed from sums and thus are asymptotically normal,
the delta method (Section 13.2) can be used to show that our estimators are asymptotically normal
and to obtain asymptotic variances for them.
There is a well-developed asymptotic theory for MLEs, which under certain conditions shows asymp-
totic normality with a certain asymptotic variance, thus enabling condence intervals. The theory
also establishes that MLEs are in a certain sense optimal among all estimators. We will not pursue
270 CHAPTER 12. GENERAL STATISTICAL ESTIMATION AND INFERENCE
this here, but will note that mle() does give standard errors for the estimates, thus enabling the
formation of condence intervals.
12.2 Bias and Variance
The notions of bias and variance play central roles in the evaluation of goodness of estimators.
12.2.1 Bias
Denition 29 Suppose

is an estimator of . Then the bias of

is
bias = E(

) (12.38)
If the bias is 0, we say that the estimator is unbiased.
Its very important to note that, in spite of the perjorative-sounding name, bias is not an inherently
bad property for an estimator to have. Indeed, most good estimators are at least slightly biased.
Well explore this in the next section.
12.2.2 Why Divide by n-1 in s
2
?
It should be noted that it is customary in (10.20) to divide by n-1 instead of n, for reasons that are
largely historical. Heres the issue:
If we divide by n, as we have been doing, then it turns out that s
2
is biased.
E(s
2
) =
n 1
n

2
(12.39)
Think about this in the Davis people example, once again in the notebook context. Remember,
here n is 1000, and each line of the notebook represents our taking a dierent random sample of
1000 people. Within each line, there will be entries for W
1
through W
1000
, the weights of our 1000
people, and for W and s. For convenience, lets suppose we record that last column as s
2
instead
of s.
Now, say we want to estimate the population variance
2
. As discussed earlier, the natural estimator
for it would be the sample variance, s
2
. What (12.39) says is that after looking at an innite number
of lines in the notebook, the average value of s
2
would be just...a...little...bit...too...small. All the
12.2. BIAS AND VARIANCE 271
s
2
values would average out to 0.999
2
, rather than to
2
. We might say that s
2
has a little bit
more tendency to underestimate
2
than to overestimate it.
So, (12.39) implies that s
2
is a biased estimator of the population variance
2
, with the amount of
bias being
n 1
n

2

2
=
1
n

2
(12.40)
Lets prove (12.39). As before, let W be a random variable distributed as the population, and let
W
1
, ..., W
n
be a random sample from that population. So, EW
i
= and V ar(W
i
)
2
, where
again and
2
are the population mean and variance.
It will be more convenient to work with ns
2
than s
2
, since it will avoid a lot of dividing by n. So,
write
ns
2
=
n

i=1
(W
i
W)
2
(def.) (12.41)
=
n

i=1
_
(W
i
) + ( W)

2
(alg.) (12.42)
=
n

i=1
(W
i
)
2
+ 2( W)
n

i=1
(W
i
) +n( W)
2
(alg.) (12.43)
But that middle sum is
n

i=1
(W
i
) =
n

i=1
W
i
n = nW n (12.44)
So,
ns
2
=
n

i=1
(W
i
)
2
n(W )
2
(12.45)
Now lets take the expected value of (12.45). First,
272 CHAPTER 12. GENERAL STATISTICAL ESTIMATION AND INFERENCE
E
_
n

i=1
(W
i
)
2
_
=
n

i=1
E[(W
i
)
2
] (E is lin.) (12.46)
=
n

i=1
E[(W
i
EW
i
)
2
] (W
i
distr. as pop.) (12.47)
=
n

i=1
V ar(W
i
) (def. of Var()) (12.48)
=
n

i=1

2
(W
i
distr. as pop.) (12.49)
= n
2
(12.50)
Also,
E[(W )
2
] = E[(W EW)
2
] ((10.11)) (12.51)
= V ar(W) (def. of Var()) (12.52)
=

2
n
(10.16) (12.53)
Applying these last two ndings to (12.45), we get (12.39).
E(s
2
) =
n 1
n

2
(12.54)
The earlier developers of statistics were bothered by this bias, so they introduced a fudge factor
by dividing by n-1 instead of n in (10.20). We will call that s
2
:
s
2
=
1
n 1
n

i=1
(W
i
W)
2
(12.55)
This is the classical denition of sample variance, in which we divide by n-1 instead of n.
But we will use n. After all, when n is largewhich is what we are assuming by using the Central
Limit Theorem in the entire development so farit doesnt make any appreciable dierence. Clearly
it is not important in our Davis example, or our bus simulation example.
Moreover, speaking generally now rather than necessarily for the case of s
2
there is no particular
reason to insist that an estimator be unbiased anyway. An alternative estimator may have a little
12.2. BIAS AND VARIANCE 273
bias but much smaller variance, and thus might be preferable. And anyway, even though the
classical version of s
2
, i.e. s
2
, is an unbiased estimator for
2
, s is not an unbiased estimator for ,
the population standard deviation. In other words, unbiasedness is not such an important property.
The R functions var() and sd() calculate the versions of s
2
and s, respectively, that have a divisor
of n-1.
12.2.2.1 Example of Bias Calculation
Lets nd the bias of the estimator (12.28).
The bias is E

C c. To get Ec we need the density of that estimator, which we get as follows:


P(c t) = P(all W
i
t) (denition) (12.56)
=
_
t
c
_
n
(density of W
i
) (12.57)
So,
f
c
(t) =
n
c
n
t
n1
(12.58)
Integrating against t, we nd that
Ec =
n
n + 1
c (12.59)
So the bias is c/(n+1), not bad at all.
12.2.3 Tradeo Between Variance and Bias
Consider a general estimator Q of some population value b. Then a common measure of the quality
(of course there are many others) of the estimator Q is the mean squared error (MSE),
E[(Qb)
2
] (12.60)
Of course, the smaller the MSE, the better.
274 CHAPTER 12. GENERAL STATISTICAL ESTIMATION AND INFERENCE
One can break (12.60) down into variance and (squared) bias components, as follows:
2
MSE(Q) = E[(Qb)
2
] (denition) (12.61)
= E[(QEQ) + (EQb)
2
] (algebra) (12.62)
= E[(QEQ)
2
] + 2E [(QEQ)(EQb)] +E[(EQb)
2
] (alg., E props.) (12.63)
= E[(QEQ)
2
] +E[(EQb)
2
] (factor out constant EQ-b) (12.64)
= V ar(Q) + (EQb)
2
(def. of Var(), fact that EQ-b is const.) (12.65)
= variance + squared bias (12.66)
In other words, in discussing the accuracy of an estimatorespecially in comparing two or more
candidates to use for our estimatorthe average squared error has two main components, one for
variance and one for bias. In building a model, these two components are often at odds with each
other; we may be able to nd an estimator with smaller bias but more variance, or vice versa.
We also see from (12.66) that a little bias in an estimator may be quite tolerable, as long as the
variance is low. This is good, because as mentioned earlier, most estimators are in fact biased.
These point will become central in Chapters 14 and 15.
12.3 More on the Issue of Independence/Nonindependence of Sam-
ples
In Section 10.7.1, we derived condence intervals for the dierence between two population means
(or proportions). The derivation depended crucially on the fact that the two sample means, X and
Y , were independent. This in turn stemmed from the fact that the corresponding sample data sets
were separate.
On the other hand, in Section 10.7.3, we had an example in which the two sample means, X and
Y , were not independent, as they came from the same set of kids. The condence intervals derived
in Section 10.7.1 were thus invalid, and new ones were derived, based on dierences.
Note that in both cases, the observations within a sample were also independent. In the example
of childrens heights in Section 10.7.3, for instance, the fact that Mary was chosen as the rst child
in the sample had no eect on whether Jane was chosen as the second one. This was important for
the derivations too, as they used (10.16), which assumed independence.
In this section, we will explore these points further, with our aim being to state the concepts in
precise random variable terms.
2
In reading the following derivation, keep in mind that EQ and b are constants.
12.3. MORE ON THE ISSUE OF INDEPENDENCE/NONINDEPENDENCE OF SAMPLES275
As our concrete example, consider an election survey, in a small city. Say there are equal numbers
of men and women in the city, 5000 each. We wish to estimate the population proportion of people
who plan to vote for candidate A. We take a random sample of size n from the population, Dene
the following:
Let V denote the indicator variable for the event that the person plans to vote for A.
We might be interested in dierences between men and women in As support, so let G be 1
for male, 2 for female.
Let p denote the population proportion of people who plan to vote for A.
Let p
1
and p
2
denote the population proportions planning to vote for A, among men and
women respectively. Note that
p = 0.5p
1
+ 0./5p
2
(12.67)
Denote our data by (V
1
, G
1
), ..., (V
n
, G
n
), recording both the planned vote and gender for
each person in the sample.
For convenience, relabel the data by gender, with M
1
, ..., M
N
1
and F
1
, ..., F
N
2
denoting the
planned votes of the men and women.
Clearly, the male data and female data are independent. The fact that Jack is chosen in the male
sample has no impact on whether Jill is chosen in the female one.
But what about data within a gender group? For example, are M
1
and M
2
, the planned votes of
the rst two men in our male sample, independent? Or are they correlated, since these two people
have the same gender?
The answer is that M
1
and M
2
are indeed independent. The rst man could be any of the 5000
men in the city, with probability 1/5000 each, and the same is true of the second man. Moreover,
the choice of the rst man has no eect at all on the choice of the second one. (Remember, in
random samples we sample with replacement.)
Our estimate of p is our usual sample proportion,
p =
V
1
+... +V
n
n
(12.68)
Then we can use (10.34) to nd a condence interval for p. But again, the reader might question
this, saying something like, What if G
1
and G
2
are both 1, i.e. the rst two people in our sample
are both men? Wont V
1
and V
2
then be correlated? The answer is no, because the reader would
276 CHAPTER 12. GENERAL STATISTICAL ESTIMATION AND INFERENCE
be referring to the conditional distribution of V given G, whereas our use of (10.34) does not involve
gender, i.e. it concerns the unconditional distribution of V.
This point is subtle, and is dicult for the beginning modeler to grasp. It is related to issues in our
rst discussions of probability in Chapter 2. In the ALOHA model there, for instance, beginning
students who are asked to nd P(X
2
= 1) often object, Well, it depends on what X
1
is. That is
incorrect thinking, because they are confusing P(X
2
= 1) with P(X
2
= 1[X
1
= i). That confusion
is resolved by thinking in notebook terms, with P(X
2
= 1) meaning the long-run proportion of
notebook lines in which X
2
= 1, regardless of the value of X
1
. In our case here, the reader must
avoid confusing P(V = 1) (which is p) with P(V = 1[G = i) (which is p
i
).
Continuing this point a bit more, note that our p above is an unbiased estimate of p:
E p = E(V
1
) ((10.11)) (12.69)
= P(V
1
= 1) ((3.44)) (12.70)
= P(G
1
= 1)P(V
1
= 1[G
1
= 1) +P(G
1
= 2)P(V
1
= 1[G
1
= 2) (Chapter 2) (12.71)
= 0.5p
1
+ 0.5p
2
(12.72)
= p ((12.67)) (12.73)
Due to the independence of the male and female samples, we can use (10.43) to nd a condence
interval for p
1
p
2
, so that we can compare male and female support of A. Note by the way that
M will be an unbiased estimate of p
1
, with a similar statement holding for the women.
Now, contrast all that with a dierent kind of sampling, as follows. We choose a gender group at
random, and then sample n people from that gender group. Let R denote the group chosen, so that
G
i
= R for all i. So, what about the answers to the above questions in this new setting?
Conditionally on R, the V
i
and again independent, using the same argument as we used to show
that M
1
and M
2
were independent above. And (12.69) still works, so our p is still unbiased.
However: The V
i
are no longer unconditionally independent:
P(V
1
= 1 and V
2
= 1) = 0.5p
2
1
+ 0.5p
2
2
(12.74)
(the reader should ll in the details, with a conditioning argument like that in (12.69)), while
P(V
1
= 1) P(V
2
= 1) = p
2
= (0.5p
1
+ 0.5p
2
)
2
(12.75)
12.4. NONPARAMETRIC DISTRIBUTION ESTIMATION 277
So,
P(V
1
= 1 and V
2
= 1) ,= P(V
1
= 1) P(V
2
= 1) (12.76)
and thus V
1
and V
2
are not unconditionally independent.
This setting is very common. We might, for instance, choose k trees at random, and then collect
data on r leaves in each tree.
12.4 Nonparametric Distribution Estimation
Here we will be concerned with estimating distribution functions and densities in settings in which
we do not assume our distribution belongs to some parametric model.
12.4.1 The Empirical cdf
Recall that F
X
, the cdf of X, is dened as
F
X
(t) = P(X t), < t < (12.77)
Dene its sample analog, called the empirical distribution function, by

F
X
(t) =
# of X
i
in (, t)
n
(12.78)
In other words, F
X
(t) is the proportion of X that are below t in the population, and

F
X
(t) is the
value of that proportion in our sample.

F
X
(t) estimates F
X
(t) for each t.
Graphically,

F
X
is a step function, with jumps at the values of the X
i
. Specically, let Y
j
, j =
1,...,n denote the sorted version of the X
I
.
3
Then

F
X
(t) =
_

_
0, for t < Y
1
j
n
, for Y
j
t < Y
j+1
1, for t > Y
n
(12.79)
3
A common notation for this is Yj = X
(j)
, meaning that Yj is the j
th
smallest of the Xi. These are called the
order statistics of our sample.
278 CHAPTER 12. GENERAL STATISTICAL ESTIMATION AND INFERENCE
Here is a simple example. Say n = 4 and our data are 4.8, 1.2, 2.2 and 6.1. We can plot the
empirical cdf by calling Rs ecdf() function:
> plot(ecdf(x))
Here is the graph:
0 2 4 6
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
ecdf(x)
x
F
n
(
x
)
Consider the Bus Paradox example again. Recall that W denoted the time until the next bus
arrives. This is called the forward recurrence time. The backward recurrence time is the
time since the last bus was here, which we will denote by R.
Suppose we are interested in estimating the density of R, f
R
(), based on the sample data R
1
, ..., R
n
that we gather in our simulation in Section ??, where n = 1000. How can we do this?
4
We could, of course, assume that f
R
is a member of some parametric family of distributions, say
the two-parameter gamma family. We would then estimate those two parameters as in Section
12.1, and possibly check our assumption using goodness-of-t procedures, discussed in our unit
on modeling, Chapter 14. On the other hand, we may wish to estimate f
R
without making any
parametric assumptions. In fact, one reason we may wish to do so is to visualize the data in order
to search for a suitable parametric model.
If we do not assume any parametric model, we have in essence change our problem from estimating
a nite number of parameters to an innite-parameter problem; the parameters are the values of
4
Actually, one can prove that R has an exponential distribution. However, here well pretend we dont know that.
12.4. NONPARAMETRIC DISTRIBUTION ESTIMATION 279
f
X
(t) for all the dierent values of t. Of course, we probably are willing to assume some structure
on f
R
, such as continuity, but then we still would have an innite-parameter problem.
We call such estimation nonparametric, meaning that we dont use a parametric model. However,
you can see that it is really innite-parametric estimation.
Again discussed in our unit on modeling, Chapter 14, the more complex the model, the higher the
variance of its estimator. So, nonparametric estimators will have higher variance than
parametric ones. The nonparametric estimators will also generally have smaller bias, of course.
12.4.2 Basic Ideas in Density Estimation
Recall that
f
R
(t) =
d
dt
F
R
(t) =
d
dt
P(R t) (12.80)
From calculus, that means that
f
R
(t)
P(R t +h) P(R t h)
2h
(12.81)
=
P(t h < R t +h)
2h
(12.82)
if h is small. We can then form an estimate

f
R
(t) by plugging in sample analogs in the right-hand
side of (12.81):

f
R
(t)
#(t h, t +h))/n
2h
(12.83)
=
#(t h, t +h))
2hn
(12.84)
where the notation #(a, b) means the number of R
i
in the interval (a,b).
There is an important issue of how to choose the value of h here, but lets postpone that for now.
For the moment, lets take
h =
max
i
R
i
min
i
R
i
100
(12.85)
280 CHAPTER 12. GENERAL STATISTICAL ESTIMATION AND INFERENCE
i.e. take h to be 0.01 of the range of our data.
At this point, wed then compute (12.84) at lots of dierent points t. Although it would seem that
theoretically we must compute (12.84) at innitely many such points, the graph of the function is
actually a step function. Imagine t moving to the right, starting at min
i
R
i
. The interval (th, t+h)
moves along with it. Whenever the interval moves enough to the right to either pick up a new R
i
or lose one that it had had, (12.84) will change value, but not at any other time. So, we only need
to evaluate the function at about 2n values of t.
12.4.3 Histograms
If for some reason we really want to save on computation, lets say that we rst break the interval
(min
i
R
i
, max
i
R
i
into 100 subintervals of size h given by (12.85). We then compute (12.84) only
at the midpoints of those intervals, and pretend that the graph of

f
R
(t) is constant within each
subinterval. Do you know what we get from that? A histogram! Yes, a histogram is a form of
density estimation. (Usually a histogram merely displays counts. We do so here too, but we have
scaled things so that the total area under the curve is 1.)
Lets see how this works with our Bus Paradox simulation. Well use Rs hist() to draw a histogram.
First, heres our simulation code:
1 doexpt <- function(opt) {
2 lastarrival <- 0.0
3 while (TRUE) {
4 newlastarrival = lastarrival + rexp(1,0.1)
5 if (newlastarrival > opt)
6 return(opt-lastarrival)
7 else lastarrival <- newlastarrival
8 }
9 }
10
11 observationpt <- 240
12 nreps <- 10000
13 waits <- vector(length=nreps)
14 for (rep in 1:nreps) waits[rep] <- doexpt(observationpt)
15 hist(waits)
Note that I used the default number of intervals, 20. Here is the result:
12.4. NONPARAMETRIC DISTRIBUTION ESTIMATION 281
Histogram of waits
waits
F
r
e
q
u
e
n
c
y
0 20 40 60 80 100
0
1
0
0
0
2
0
0
0
3
0
0
0
4
0
0
0
The density seems to have a shape like that of the exponential parametric family. (This is not
surprising, because it is exponential, but remember were pretending we dont know that.)
Here is the plot with 100 intervals:
282 CHAPTER 12. GENERAL STATISTICAL ESTIMATION AND INFERENCE
Histogram of waits
waits
F
r
e
q
u
e
n
c
y
0 20 40 60 80
0
2
0
0
4
0
0
6
0
0
8
0
0
Again, a similar shape, though more raggedy.
12.4.4 Kernel-Based Density Estimation
No matter what the interval width is, the histogram will consist of a bunch of rectanges, rather
than a curve. That is basically because, for any particular value of t,

f
X
(t), depends only on the X
i
that fall into that interval. We could get a smoother result if we used all our data to estimate f
X
(t)
but put more weight on the data that is closer to t. One way to do this is called kernel-based
density estimation, which in R is handled by the function density().
We need a set of weights, more precisely a weight function k, called the kernel. Any nonnegative
function which integrates to 1i.e. a density function in its own rightwill work. Our estimator
is then

f
R
(t) =
1
nh
n

i=1
k
_
t R
i
h
_
(12.86)
To make this idea concrete, take k to be the uniform density on (-1,1), which has the value 0.5
on (-1,1) and 0 elsewhere. Then (12.86) reduces to (12.84). Note how the parameter h, called the
12.4. NONPARAMETRIC DISTRIBUTION ESTIMATION 283
0 20 40 60 80
0
.
0
0
0
.
0
1
0
.
0
2
0
.
0
3
0
.
0
4
0
.
0
5
0
.
0
6
0
.
0
7
density.default(x = r)
N = 1000 Bandwidth = 1.942
D
e
n
s
i
t
y
Figure 12.1: Kernel estimate, default bandwidth
bandwidth, continues to control how far away from to t we wish to go for data points.
But as mentioned, what we really want is to include all data points, so we typically use a kernel with
support on all of (, ). In R, the default kernel is that of the N(0,1) density. The bandwidth h
controls how much smoothing we do; smaller values of h place heavier weights on data points near
t and much lighter weights on the distant points. The default bandwidth in R is taken to the the
standard deviation of k.
For our data here, I took the defaults:
plot(density(r))
The result is seen in Figure 12.1.
I then tried it with a bandwidth of 0.5. See Figure 12.2. This curve oscillates a lot, so an analyst
might think 0.5 is too small. (We are prejudiced here, because we know the true population density
is exponential.)
284 CHAPTER 12. GENERAL STATISTICAL ESTIMATION AND INFERENCE
0 20 40 60 80
0
.
0
0
0
.
0
2
0
.
0
4
0
.
0
6
0
.
0
8
density.default(x = r, bw = 0.5)
N = 1000 Bandwidth = 0.5
D
e
n
s
i
t
y
Figure 12.2: Kernel estimate, bandwidth 0.5
12.4.5 Proper Use of Density Estimates
There is no good, practical way to choose a good bin width or bandwdith. Moreover, there is also
no good way to form a reasonable condence band for a density estimate.
So, density estimates should be used as exploratory tools, not as rm bases for decision making.
You will probably nd it quite unsettling to learn that there is no exact answer to the problem.
But thats real life!
12.5 Bayesian Methods
Everyone is entitled to his own opinion, but not his own factsDaniel Patrick Moynihan, senator
from New York, 1976-2000
Black cat, white cat, it doesnt matter as long as it catches miceDeng Xiaoping, when asked about
12.5. BAYESIAN METHODS 285
his plans to give private industry a greater role in Chinas economy
Whiskeys for drinkin and waters for ghtin overMark Twain, on California water jurisdiction
battles
The most controversial topic in statistics by far is that of Bayesian methods. In fact, it is so
controversial that a strident Bayesian colleague of mine even took issue with my calling it contro-
versial!
The name stems from Bayes Rule (Section 2.6),
P(A[B) =
P(A)P(B[A)
P(A)P(B[A) +P(not A)P(B[not A)
(12.87)
No one questions the validity of Bayes Rule, and thus there is no controversy regarding statistical
procedures that make use of probability calculations based on that rule. But the key word is
probability. As long as the various terms in (12.87) are real probabilities, there is no controversy.
But instead, the debate stems from the cases in which Bayesians replace some of the probabilities
in the theorem with feelings, i.e. non-probabilities, arising from what they call subjective prior
distributions. The key word is then subjective. Our section here will concern the controversy over
the use of subjective priors.
5
Say we wish to estimate a population mean. Here the Bayesian analyst, before even collecting data,
says, Well, I think the population mean could be 1.2, with probability, oh, lets say 0.28, but on
the other hand, it might also be 0.88, with probability, well, Ill put it at 0.49... etc. This is the
analysts subjective prior distribution for the population mean. The analyst does this before even
collecting any data. Note carefully that he is NOT claiming these are real probabilities; hes just
trying to quantify his hunches. The analyst then collects the data, and uses some mathematical
procedure that combines these feelings with the actual data, and which then outputs an estimate
of the population mean or other quantity of interest.
The Bayesians justify this by saying one should use all available information, even if it is just a
hunch. The analyst is typically an expert in the eld under study. You wouldnt want to throw
away his/her expertise, would you? Moreover, they cite theoretical analyses that show that Bayes
estimator doing very well in terms of criteria such as mean squared error, even if the priors are not
valid.
The non-Bayesians, known as frequentists, on the other hand dismiss this as unscientic and
lacking in impartiality. In research on a controversial health issue, say, you wouldnt want the
researcher to incorporate his/her personal political biases into the number crunching, would you?
5
By contrast, there is no controversy if the prior makes use of real data. I will explain this in Section 12.5.1.1
below, but in the mean time, note that my use of the term Bayesian refers only to subjective priors.
286 CHAPTER 12. GENERAL STATISTICAL ESTIMATION AND INFERENCE
They also point out that in the real world one must typically perform inference (condence intervals
or signicance tests), not just compute point estimates; Bayesian methods are not really suited for
inference.
Note carefully the key role of data. One might ask, for instance, Why this sharp distinction
between the Bayesians and the frequentists over the subjectivity issue? Dont the frequentists
make subjective decisions too? Consider an analysis of disk drive lifetime data, for instance.
Some frequentist statistician might use a normal model, instead of, say, a gamma model. Isnt that
subjectivity? The answer is no, because the statistician can use the data to assess the validity of
her model, employing the methods of Section 14.2.
12.5.1 How It Works
To introduce the idea, consider again the example of estimating p, the probability of heads for
a certain penny. Suppose we were to saybefore tossing the penny even onceI think p could
be any number, but more likely near 0.5, something like a normal distribution with mean 0.5 and
standard deviation, oh, lets say 0.1.
6
The prior distribution is then N(0.5, 0.1
2
). But again,
note that the Bayesians do not consider it to be a distribution in the sense of probability. It just
quanties our gut feeling here, our hunch.
Nevertheless, in terms of the mathematics involved, its as if the Bayesians are treating p as random,
with ps distribution being whatever the analyst species as the prior. Under this random p
assumption, the Maximum Likelihood Estimate (MLE), for instance, would change. Just as in the
frequentist approach, the data here is X, the number of heads we get from n tosses of the penny.
But in contrast to the frequentist approach, in which the likelihood would be
L =
_
n
X
_
p
X
(1 p)
nX
(12.88)
it now becomes
L =
1

2 0.1
exp 0.5[(p 0.5)/0.1]
2
_
n
X
_
p
X
(1 p)
nX
(12.89)
This is basically P(A and B) = P(A) P(B[A), though using a density rather than a probability mass
function. We would then nd the value of p which maximizes L, and take that as our estimate.
6
Of course, the true value of p is between 0 and 1, while the normal distribution extends from to . This,
as noted in Section 4.5.2.10, the use of normal distributions is common for modeling many bounded quantities.
Nevertheless, many Bayesians prefer to use a beta distribution for the prior in this kind of setting.
12.5. BAYESIAN METHODS 287
A Bayesian would use Bayes Rule to compute the distribution of p given X, called the posterior
distribution. The analog of (12.87) would be (12.89) divided by the integral of (12.89) as p ranges
from 0 to 1, with the resulting quotient then being treated as a density. The MLE would then be
the mode, i.e. the point of maximal density of the posterior distribution.
But we could use any measure of central tendency, and in fact typically the mean is used, rather
than the mode. In other words:
To estimate a population value , the Bayesian constructs a prior distribution for
(again, the quotation marks indicate that it is just a quantied gut feeling, rather
than a real probability distribution). Then she uses the prior together with the actual
observed data to construct the posterior distribution. Finally, she takes her estimate

to be the mean of the posterior distribution.


Note how this procedure achieves a kind of balance between what our hunch says and what our
data say. In (12.89), suppose the mean of p is 0.5 but n = 20 and X = 12 Then the frequentist
estimator would be X/n = 0.6, while the Bayes estimator would be about 0.56. (Computation not
shown here.) So our Bayesian approach pulled our estimate away from the frequentist estimate,
toward our hunch that p is at or very near 0.5. This pulling eect would be stronger for smaller n
or for a smaller standard deviation of the prior distribution.
12.5.1.1 Empirical Bayes Methods
Note carefully that if the prior distribution in our model is not subjective, but is a real distribution
veriable from data, the above analysis on p would not be controversial at all. Say p does vary a
substantial amount from one penny to another, so that there is a physical distribution involved.
Suppose we have a sample of many pennies, tossing each one n times. If n is very large, well get
a pretty accurate estimate of the value of p for each coin, and we can then plot these values in a
histogram and compare it to the N(0.5, 0.1
2
) density, to check whether our prior is reasonable. This
is called an empirical Bayes model, because we can empirically estimate our prior distribution,
and check its validity. In spite of the name, frequentists would not consider this to be Bayesian
analysis. Note that we could also assume that p has a general N(,
2
) distribution, and estimate
and from the data.
12.5.2 Extent of Usage of Subjective Priors
Though some academics are staunch, often militantly proselytizing Bayesians, only a small minority
of statisticians in practice use the Bayesian approach. It is not mainstream.
288 CHAPTER 12. GENERAL STATISTICAL ESTIMATION AND INFERENCE
One way to see that Bayesian methodology is not mainstream is through the R programming
language. For example, as of December 2010, only about 65 of the more than 3000 packages on
CRAN, the R repository, involve Bayesian techniques. (See http://cran.r-project.org/web/
packages/tgp/index.html.) There is actually a book on the topic, Bayesian Computation with
R, by Jim Albert, Springer, 2007, and among those who use Bayesian techniques, many use R for
that purpose. However, almost all general-purpose books on R do not cover Bayesian methodology
at all.
Signicantly, even among Bayesian academics, many use frequentist methods when they work on
real, practical problems. Choose a Bayesian academic statistician at random, and youll likely nd
on the Web that he/she does not use Bayesian methods when working on real applications.
On the other hand, use of subjective priors has become very common in the computer science
research community. Papers using Bayesian methods appear frequently (no pun intended) in the
CS research literature, and seldom is heard a discouraging word.
12.5.3 Arguments Against Use of Subjective Priors
As noted, most professional statisticians are frequentists. What are the arguments made in this
regard?
Ultimately, the use of any statistical analysis is to make a decision about something. This could
be a very formal decision, such as occurs when the Food and Drug Administration (FDA) decides
whether to approve a new drug, or it could be informal, for instance when an ordinary citizen reads
a newspaper article reporting on a study analyzing data on trac accidents, and she decides what
to conclude from the study.
There is nothing wrong using ones gut feelings to make a nal decision, but it should not be
part of the mathematical analysis of the data. Ones hunches can play a role in deciding the
preponderance of evidence, as discussed in Section 11.8.5, but that should be kept separate from
our data analysis.
If for example the FDAs data shows the new drug to be eective, but at the same time the FDA
scientists still have their doubts, they may decide to delay approval of the drug pending further
study. So they can certainly act on their hunch, or on non-data information they have, concerning
approval of the drug. But the FDA, as a public agency, has a responsibility to the citizenry to
state what the data say, i.e. to report the frequentist estimate, rather than merely reporting a
numberthe Bayesian estimatethat mixes fact and hunch.
In many if not most applications of statistics, there is a need for impartial estimates. As noted
above, even if the FDA acts on a hunch to delay approval of a drug in spite of favorable data,
the FDA owes the public (and the pharmaceutal rm) an impartial report of what the data say.
12.5. BAYESIAN METHODS 289
Bayesian estimation is by denition not impartial. One Bayesian statistician friend put it very well,
saying I believe my own subjective priors, but I dont believe those of other people.
Furthermore, in practice we are typically interested in inference, i.e. condence intervals and
signicance tests, rather than just point estimation. We are sampling from populations, and want
to be able to legitimately make inferences about those populations. For instance, though one can
derive a Bayesian 95% condence interval for p for our coin, it really has very little meaning, and
again is certainly not impartial.
12.5.4 What Would You Do? A Possible Resolution
Consider the following scenario. Steven is running for president. Leo, his campaign manager, has
commissioned Lynn to conduct a poll to assess Stevens current support among the voters. Lynn
takes her poll, and nds that 57% of those polled support Steven. But her own gut feeling as an
expert in politics, is that Stevens support is only 48%. She then combines these two numbers in
some Bayesian fashion, and comes up with 50.2% as her estimate of Stevens support.
So, here the frequentist estimate is 57%, while Lynns Bayesian estimate is 50.2%.
Lynn then gives Steven only the 50.2% gure, not reporting the value 57% number to him. Leo asks
Lynn how she arrived at that number, and she explains that she combined her prior distribution
with the data.
If you were Leo, what would you do? Consider two choices as to instructions you might give Lynn:
(a) You could say, Lynn, I trust your judgment, so as the election campaign progresses, always
give me only your Bayesian estimate.
(b) You might say, Lynn, I trust your judgment, but as the election campaign progresses, always
give me both your Bayesian estimate and what the impartial data actually say.
I believe that choice (b) is something that both the Bayesian and frequentist camps would generally
agree upon.
Exercises
Note to instructor: See the Preface for a list of sources of real data on which exercises can be
assigned to complement the theoretical exercises below.
1. Consider rae ticket example in Section 12.1.1. Suppose 500 tickets are sold, and you have data
on 8 of them. Continue to assume sampling with replacement. Consider the Maximum Likelihood
and Methods of Moments estimators.
290 CHAPTER 12. GENERAL STATISTICAL ESTIMATION AND INFERENCE
(a) Find the probability that the MLE is exactly equal to the true value of c.
(b) Find the exact probability that the MLE is within 50 of the true value.
(c) Find the approximate probability that the Method of Moments estimator is within 50 of the
true value.
2. Suppose I = 1 or 0, with probability p and 1-p, respectively. Given I, X has a Poisson distribution
with mean
I
. Suppose we have X
1
, ..., X
n
, a random sample of size n from the (unconditional)
distribution of X. (We do not know the associated values of I, i.e. I
1
, ..., I
n
.) This kind of situation
occurs in various applications. The key point is the eect of the unseen variable. In terms of
estimation, note that there are three parameters to be estimated.
(a) Set up the likelihood function, which if maximized with respect to the three parameters would
yield the MLEs for them.
(b) The words if and would in that last sentence allude to the fact that MLEs cannot be derived in
closed form. However, Rs mle() function can be used to nd their values numerically. Write
R code to do this. In other words, write a function with a single argument x, representing
the X
i
, and returning the MLEs for the three parameters.
3. Find the Method of Moments and Maximum Likelihood estimators of the following parameters
in famous distribution families:
p in the binomial family (n known)
p in the geometric family
in the normal family ( known)
in the Poisson family
4. For each of the following quantities, state whether the given estimator is unbiased in the given
context:
(a) (4.15), p. 97, as an estimator of
2
(b) p, as an estimator of p, p.105
(c) p(1 p), as an estimator of p(1-p), p.105
(d)

X

Y , as an estimator of
1

2
, p.107
12.5. BAYESIAN METHODS 291
(e)
1
n

n
i=1
(X
i

1
)
2
(assuming
1
is known), as an estimator of
2
1
, p.107
(f)

X, as an estimator of
1
, p.107, but sampling (from the population of Davis) without replace-
ment
5. Consider the Method of Moments Estimator c in the rae example, Section 12.1.1. Find the
exact value of V ar( c). Use the facts that 1 + 2 + ... + r = r(r + 1)/2 and 1
2
+ 2
+
... + r
2
=
r(r + 1)(2r + 1)/6.
6. Suppose W has a uniform distribution on (-c,c), and we draw a random sample of size n,
W
1
, ..., W
n
. Find the Method of Moments and Maximum Likelihood estimators. (Note that in the
Method of Moments case, the rst moment wont work.)
7. An urn contains marbles, one of which is black and the rest being white. We draw marbles
from the urn one at a time, without replacement, until we draw the black one; let N denote the
number of draws needed. Find the Method of Moments estimator of based on X.
8. Suppose X
1
, ..., X
n
are uniformly distributed on (0,c). Find the Method of Moments and
Maximum Likelihood estimators of c, and compare their mean squared error.
Hint: You will need the density of M = max
i
X
i
. Derive this by noting that M t if and only if
X
i
t for all i = 1,2,...,n.
9. Add a single line to the code on page 226 that will print out the estimated value of Var(W).
10. In the rae example, Section 12.1.1, nd a (1 )% condence interval for c based on c, the
Maximum Likelihood Estimate of c.
11. In many applications, observations come in correlated clusters. For instance, we may sample
r trees at random, then s leaves within each tree. Clearly, leaves from the same tree will be more
similar to each other than leaves on dierent trees.
In this context, suppose we have a random sample X
1
, ..., X
n
, n even, such that there is correlation
within pairs. Specically, suppose the pair (X
2i+1
, X
2i+2
) has a bivariate normal distribution with
mean (, ) and covariance matrix
_
1
1
_
(12.90)
i = 0,...,n/2-1, with the n/2 pairs being independent. Find the Method of Moments estimators of
and .
12. Suppose we have a random sample X
1
, ..., X
n
from some population in which EX = and
V ar(X) =
2
. Let X = (X
1
+ ... + X
n
)/n be the sample mean. Suppose the data points X
i
are
collected by a machine, and that due to a defect, the machine always records the last number as
292 CHAPTER 12. GENERAL STATISTICAL ESTIMATION AND INFERENCE
0, i.e. X
n
= 0. Each of the other X
i
is distributed as the population, i.e. each has mean and
variance
2
. Find the mean squared error of X as an estimator of , separating the MSE into
variance and squared bias components as in Section 12.2.
13. Suppose we have a random sample X
1
, ..., X
n
from a population in which X is uniformly
distributed on the region (0, 1) (2, c) for some unknown c > 2. Find closed-form expressions
for the Method of Moments and Maximum Likelihood Estimators, to be denoted by T
1
and T
2
,
respectively.
Chapter 13
Advanced Statistical Estimation and
Inference
13.1 Slutskys Theorem
(The reader should review Section 4.5.2.9 before continuing.)
Since one generally does not know the value of in (10.23), we replace it by s, yielding (10.24).
Why was that legitimate?
The answer depends on the theorem below. First, we need a denition.
Denition 30 We say that a sequence of random variables L
n
converges in probability to the
random variable L if for every > 0,
lim
n
P([L
n
L[ > ) = 0 (13.1)
This is a little weaker than convergence with probability 1, as in the Strong Law of Large Numbers
(SLLN, Section 3.19). Convergence with probability 1 implies convergence in probability but not
vice versa.
So for example, if Q
1
, Q
2
, Q
3
, ... are i.i.d. with mean , then the SLLN implies that
L
n
=
Q
1
+... +Q
n
n
(13.2)
converges with probability 1 to , and thus L
n
converges in probability to too.
293
294 CHAPTER 13. ADVANCED STATISTICAL ESTIMATION AND INFERENCE
13.1.1 The Theorem
Theorem 31 Slutskys Theorem (abridged version): Consider random variables X
n
, Y
n
, and X,
such that X
n
converges in distribution to X and Y
n
converges in probability to a constant c with
probability 1,
Then:
(a) X
n
+Y
n
converges in distribution to X +c.
(b) X
n
/Y
n
converges in distribution to X/c.
13.1.2 Why Its Valid to Substitute s for
We now return to the question raised above. In our context here, that we take
X
n
=
W
/

n
(13.3)
Y
n
=
s

(13.4)
We know that (13.3) converges in distribution to N(0,1) while (13.4) converges in to 1. Thus for
large n, we have that
W
s/

n
(13.5)
has an approximate N(0,1) distribution, so that (10.24) is valid.
13.1.3 Example: Condence Interval for a Ratio Estimator
Again consider the example in Section 10.4.1 of weights of men and women in Davis, but this time
suppose we wish to form a condence interval for the ratio of the means,
=

1

2
(13.6)
13.2. THE DELTA METHOD: CONFIDENCE INTERVALS FOR GENERAL FUNCTIONS OF MEANS OR PROPORTIONS295
Again, the natural estimator is
=
X
Y
(13.7)
How can we construct a condence interval from this estimator? If it were a linear combination
of X and Y , wed have no problem, since a linear combination of multivariate normal random
variables is again normal.
That is not exactly the case here, but its close. Since Y converges in probability to
2
, Slutskys
Theorem (Section 13.1) tells us that the problem here really is one of such a linear combination.
We can form a condence interval for
1
, then divide both endpoints of the interval by Y , yielding
a condence interval for .
13.2 The Delta Method: Condence Intervals for General Func-
tions of Means or Proportions
The delta method is a great way to derive asymptotic distributions of quantities that are functions
of random variables whose asymptotic distributions are already known.
13.2.1 The Theorem
Theorem 32 Suppose R
1
, ..., R
k
are estimators of
1
, ...,
k
based on a random sample of size n.
Let R denote the vector whose components are the R
i
, and let denote the corresponding vector
for the
i
. Suppose the random vector

n(R ) =

n
_
_
_
_
R
1

1
R
2

2
...
R
k

k
_
_
_
_
(13.8)
is known to have an asymptotically multivariate normal distribution with mean 0 and nonsingular
covariance matrix = (
ij
).
Let h be a smooth scalar function
1
of k variables, with h
i
denoting its i
th
partial derivative. Consider
1
The word smooth here refers to mathematical conditions such as existence of derivatives, which we will not
worry about here.
Similarly, the reason that we multiply by

n is also due to theoretical considerations we will not go into here,
other than to note that it is related to the formal statement of the Central Limit Theorem in Section 4.5.2.9. If we
296 CHAPTER 13. ADVANCED STATISTICAL ESTIMATION AND INFERENCE
the random variable
Y = h(R
1
, ..., R
k
) (13.10)
Then

n[Y h(
1
, ...,
k
)] converges in distribution to a normal distribution with mean 0 and
variance
[
1
, ...,
k
]

[
1
, ...,
k
] (13.11)
provided not all of

i
= h
i
(
1
, ...,
k
), i = 1, ..., k (13.12)
are 0.
Informally, the theorem says, with R, , , h() and Y dened above:
Suppose R is asymptotically multivariate normally distributed with mean and co-
variance matrix /n. Y will be approximately normal with mean h(
1
, ...,
k
) and
covariance matrix 1/n times (13.11).
Note carefully that the theorem is not saying, for example, that E[h(R) = h() for xed, nite n,
which is not true. Nor is it saying that h(R) is normally distributed, which is denitely not true;
recall for instance that if X has a N(0,1) distribution, then X
2
has a chi-square distribution with
one degree of freedom, hardly the same as N(0,1). But the theorem says that for the purpose of
asymptotic distributions, we can operate as if these things were true.
The theorem can be used to form condence intervals for h(
1
, ...,
k
), because it provides us with
a standard error (Section 10.5):
std. err. of h(R) =
_
1
n
[
1
, ...,
k
]

[
1
, ...,
k
] (13.13)
replace X1 +..., +Xn in (4.60), by nX, we get
Z =

n
X m
v
(13.9)
13.2. THE DELTA METHOD: CONFIDENCE INTERVALS FOR GENERAL FUNCTIONS OF MEANS OR PROPORTIONS297
Of course, these quantities are typically estimated from the sample, e.g.

i
= h
i
(R
1
, ..., R
k
) (13.14)
So, our approximate 95% condence interval for h(
1
, ...,
k
) is
h(R
1
, ..., R
k
) 1.96
_
1
n
[
1
, ...,
k
]

[
1
, ...,
k
] (13.15)
Note that here we are considering scalar functions h(), but the theorem can easily be extended to
vector-valued h().
Now, how is theorem derived?
Proof
Well cover the case k = 1 (dropping the subscript 1 for convenience).
The intuitive version of the proof cites the fact from calculus
2
that a curve is close to its tangent
line if we are close to the point of tangency. Here that means
h(R) h() +h

()(R ) (13.16)
if R is near , which will be the case for large n. Note that in the right-hand side of (13.16), the
only random quantity is R; the rest are constants. In other words, the right-hand side has the form
c+dQ, where Q is approximately normal. Since a linear function of a normally distributed random
variable itself has a normal distribution, (13.16) implies that h(R) is approximately normal with
mean h() and variance [h

()]
2
V ar(R).
Reasoning more carefully, recall the Mean Value Theorem from calculus:
h(R) = h() +h

(W)(R ) (13.17)
for some W between and R. Rewriting this, we have

n[h(R) h()] =

n h

(W)(R ) (13.18)
It can be shownand should be intuitively plausible to youthat if a sequence of random variables
converges in distribution to a constant, the convergence is in probability too. So, R converges
2
This is where the delta in the name of the method comes from, an allusion to the fact that derivatives are
limits of dierence quotients.
298 CHAPTER 13. ADVANCED STATISTICAL ESTIMATION AND INFERENCE
in probability to 0, forcing W to converge in probability to h(). Then from Slutskys Theorem,
the asymptotic distribution of (13.18) is the same as that of

n h

()(R ). The result follows.


13.2.2 Example: Square Root Transformation
Here is an example of the delta method with k = 1. It will be a rather odd example, in that our
goal is actually not to form a condence interval for anything, but it will illustrate how the delta
method is used.
It is used to be common, and to some degree is still common today, for statistical analysts to apply
a square-root transformation to Poisson data. The delta method sheds light on the motivation for
this, as follows.
First, note that we cannot even apply the delta method unless we have approximately normally dis-
tributed inputs, i.e. the R
i
in the theorem. But actually, any Poisson-distributed random variable
T is approximately normally distributed if its mean, , is large. To see this, recall from Section
9.4.4.2 that sums of independent Poisson random variables are themselves Poisson distributed. So,
if for instance, ET is an integer k, then T has the same distribution as
U
1
+... +U
m
(13.19)
where the U
i
are i.i.d. Poisson random variables each having mean 1. By the Central Limit
Theorem, T then has an approximate normal distribution, with mean and variance . (This is not
quite a rigorous argument, so our treatment here is informal.)
Now that we know that T is approximately normal, we can apply the delta method. So, what h()
should we use? The pioneers of statistics chose h(t) =

t. Lets see why.


Set Y = h(T) =

T (so that T is playing the role of R in the theorem). Here is ET = .


We have h

(t) = 1/(2

t). Then the delta method says that since T is approximately normally
distributed with mean and variance , Y too has an approximate normal distribution, with mean
h() =

(13.20)
What about the variance? Well, in one dimension, (13.11) reduces to

2
V ar(R) (13.21)
13.2. THE DELTA METHOD: CONFIDENCE INTERVALS FOR GENERAL FUNCTIONS OF MEANS OR PROPORTIONS299
so we have
[h

()]
2
V ar(R) =
_
1
2

t=
_
2
=
1
4
=
1
4
(13.22)
So, the (asymptotic) variance of

T is a constant, independent of , and we say that the square


root function is a variance stabilizing transformation. This becomes relevant in regression
analysis, where, as we will discuss in Chapter 15, a classical assumption is that a certain collection
of random variables all have the same variance. If those random variables are Poisson-distributed,
then their square roots will all have approximately the same variance.
13.2.3 Example: Condence Interval for
2
Recall that in Section 12.2.2 we noted that (10.25) is only an approximate condence interval for the
mean. An exact interval is available using the Student t-distribution, if the population is normally
distributed. We pointed out that (10.25) is very close to the exact interval for even moderately large
n anyway, and since no population is exactly normal, (10.25) is good enough. Note that one of the
implications of this and the fact that (10.25) did not assume any particular population distribution
is that a Student-t based condence interval works well even for non-normal populations. We say
that the Student-t interval is robust to the normality assumption.
But what about a condence interval for a variance? It can be shown that one can form an exact
interval based on the chi-square distribution, if the population is normal. In this case, though,
the interval does NOT work well for non-normal populations; it is NOT robust to the normality
assumption. So, lets derive an interval that doesnt assume normality; well use the delta method.
(Warning: This will be a lengthy derivation, but it will cause you to review many concepts, which
is good.)
As before, say we have W
1
, ..., W
n
, a random sample from our population, and with W representing
a random variable having the population distribution.) Write

2
= E(W
2
) (EW)
2
(13.23)
and from (10.21) write our estimator of
2
as
s
2
=
1
n
n

i=1
W
2
i
W
2
(13.24)
300 CHAPTER 13. ADVANCED STATISTICAL ESTIMATION AND INFERENCE
This suggests how we can use the delta method. We dene
R
1
= W (13.25)
R
2
=
1
n
n

i=1
W
2
i
(13.26)
R
1
is an estimator of EW, and R
2
estimates E(W
2
). Furthermore, well see below that R
1
and R
2
are approximately bivariate normal, by the multivariate Central Limit Theorem, so we can use the
delta method.
And most importantly, our estimator of interest, s
2
, is a function of R
1
and R
2
:
s
2
= R
2
R
2
1
(13.27)
So, we take our function h to be
h(u, v) = u
2
+v (13.28)
Now we must nd in the theorem. That means well need well need the covariance matrix of R
1
and R
2
. But since
_
R
1
R
2
_
=
1
n
n

i=1
_
W
i
W
2
i
_
(13.29)
we can derive the covariance matrix of R
1
and R
2
, as follows.
Remember, the covariance matrix is the multidimensional analog of variance. So, after reviewing
the reasoning in (10.16), we have in the vector-valued version of that derivation that
13.2. THE DELTA METHOD: CONFIDENCE INTERVALS FOR GENERAL FUNCTIONS OF MEANS OR PROPORTIONS301
Cov
__
R
1
R
2
__
=
1
n
2
Cov
_
n

i=1
_
W
i
W
2
i
_
_
(13.30)
=
1
n
2
n

i=1
Cov
__
W
i
W
2
i
__
(13.31)
=
1
n
2
n

i=1
Cov
__
W
W
2
__
(13.32)
=
1
n
Cov
__
W
W
2
__
(13.33)
So
= Cov
__
W
W
2
__
(13.34)
Now we must estimate . Taking sample analogs of (7.52), we set

=
1
n
n

i=1
_
W
i
W
2
i
_
(W
i
, W
2
i
) R R

=
1
n
n

i=1
_
W
2
i
W
3
i
W
3
i
W
4
i
_
R R

(13.35)
where R = (R
1
, R
2
)

.
Also, h

(u, v) = (2u, 1)

, so
h

(R
1
, R
2
) = (2R
1
, 1)

(13.36)
Whew! Were done. We can now plug everything into (13.15).
Note that all these quantities are expressions in E(W
k
) for various k. It should be noted that
estimating means of higher powers of a random variable requires larger samples in order to achieve
comparable accuracy. Our condence interval here may need a rather large sample to be accurate,
as opposed to the situation with (10.25), in which even n = 20 should work well.
13.2.4 Example: Condence Interval for a Measurement of Prediction Ability
Suppose we have a random sample X
1
, ..., X
n
from some population. In other words, the X
i
are
independent and each is distributed as in the population. Let X represent a generic random variable
302 CHAPTER 13. ADVANCED STATISTICAL ESTIMATION AND INFERENCE
having that distribution. Here we are allowing the X
i
and X to be random vectors, though they
wont play much explicit role anyway.
Let A and B be events associated with X. If for example X is a random vector (U,V), we might
have A and B being the events U > 12 and U-V < 5. The question of interest here will be to what
extent we can predict A from B.
One measure of that might be the quantity = P(A[B)P(A). The larger is (in absolute value),
the stronger the ability of B to predict A. (We could look at variations of this, such as the quotient
of those two probabilities, but will not do so here.)
Lets use the delta method to derive an approximate 95% condence interval for . To that end,
think of four categoriesA and B; A and not B; not A and B; and not A and not B. Each X
i
falls
into one of those categories, so the four-component vector Y consisting of counts of the numbers of
X
i
falling into the four categories has a multinomial distribution with r = 4.
To use the theorem, set R = Y/n, so that R is the vector of the sample proportions. For instance,
R
1
will be the number of X
i
satisfying both events A and B, divided by n. The vector will then
be the corresponding population proportion, so that for instance

2
= P(A and not B) (13.37)
We are interested in
= P(A[B) P(A) (13.38)
=
P(A and B)
P(A and B) +P(not A and B)
[P(A and B) +P(A and not B)] (13.39)
=

1

1
+
3
(
1
+
2
) (13.40)
By the way, since
4
is not involved, lets shorten R to (R
1
, R
2
, R
3
)

.
What about ? Since Y is multinomial, Equation (8.113) provides us :
=
1
n
_
_

1
(1
1
)
1

2

1

1

2
(1
2
)
2

1

3

2

3
(1
3
)
_
_
(13.41)
We then get

by substituting R
i
for
i
. After deriving the
i
from (13.38), we make the same
substitution there, and then compute (13.15).
13.3. SIMULTANEOUS CONFIDENCE INTERVALS 303
13.3 Simultaneous Condence Intervals
Suppose in our study of heights, weights and so on of people in Davis, we are interested in estimating
a number of dierent quantities, with our forming a condence interval for each one. Though our
condence level for each one of them will be 95%, our overall condence level will be less than that.
In other words, we cannot say we are 95% condent that all the intervals contain their respective
population values.
In some cases we may wish to construct condence intervals in such a way that we can say we are
95% condent that all the intervals are correct. This branch of statistics is known as simultaneous
inference or multiple inference.
Usually this kind of methodology is used in the comparison of several treatments. This term
originated in the life sciences, e.g. comparing the eectiveness of several dierent medications for
controlling hypertension, it can be applied in any context. For instance, we might be interested in
comparing how well programmers do in several dierent programming languages, say Python, Ruby
and Perl. Wed form three groups of programmers, one for each language, with say 20 programmers
per group. Then we would have them write code for a given application. Our measurement could
be the length of time T that it takes for them to develop the program to the point at which it runs
correctly on a suite of test cases.
Let T
ij
be the value of T for the j
th
programmer in the i
th
group, i = 1,2,3, j = 1,2,...,20. We
would then wish to compare the three treatments, i.e. programming languages, by estimating

i
= ET
i1
, i = 1,2,3. Our estimators would be U
i
=

20
j=1
T
ij
/20, i = 1,2,3. Since we are comparing
the three population means, we may not be satised with simply forming ordinary 95% condence
intervals for each mean. We may wish to form condence intervals which jointly have condence
level 95%.
3
Note very, very carefully what this means. As usual, think of our notebook idea. Each line of the
notebook would contain the 60 observations; dierent lines would involve dierent sets of 60 people.
So, there would be 60 columns for the raw data, three columns for the U
i
. We would also have six
more columns for the condence intervals (lower and upper bounds) for the
i
. Finally, imagine
three more columns, one for each condence interval, with the entry for each being either Right or
Wrong. A condence interval is labeled Right if it really does contain its target population value,
and otherwise is labeled Wrong.
Now, if we construct individual 95% condence intervals, that means that in a given Right/Wrong
column, in the long run 95% of the entries will say Right. But for simultaneous intervals, we hope
that within a line we see three Rights, and 95% of all lines will have that property.
In our context here, if we set up our three intervals to have individual condence levels of 95%,
3
The word may is important here. It really is a matter of philosophy as to whether one uses simultaneous inference
procedures.
304 CHAPTER 13. ADVANCED STATISTICAL ESTIMATION AND INFERENCE
their simultaneous level will be 0.95
3
= 0.86, since the three condence intervals are independent.
Conversely, if we want a simultaneous level of 0.95, we could take each one at a 98.3% level, since
0.95
1
3
0.983.
However, in general the intervals we wish to form will not be independent, so the above cube root
method would not work. Here we will give a short introduction to more general procedures.
Note that nothing in life is free. If we want simultaneous condence intervals, they will be wider.
Another reason to form simultaneous condence intervals is that it gives you license to browse,
i.e. to rummage through the data looking for interesting nuggets.
13.3.1 The Bonferonni Method
One simple approach is Bonferonnis Inequality:
Lemma 33 Suppose A
1
, ..., A
g
are events. Then
P(A
1
or ... orA
g
)
g

i=1
P(A
i
) (13.42)
You can easily see this for g = 2:
P(A
1
or A
2
) = P(A
1
) +P(A
2
) P(A
1
and A
2
) P(A
1
) +P(A
2
) (13.43)
One can then prove the general case by mathematical induction.
Now to apply this to forming simultaneous condence intervals, take A
i
to be the event that the i
th
condence interval is incorrect, i.e. fails to include the population quantity being estimated. Then
(13.42) says that if, say, we form two condence intervals, each having individual condence level
(100-5/2)%, i.e. 97.5%, then the overall collective condence level for those two intervals is at least
95%. Heres why: Let A
1
be the event that the rst interval is wrong, and A
2
is the corresponding
event for the second interval. Then
overall conf. level = P(not A
1
and not A
2
) (13.44)
= 1 P(A
1
or A
2
) (13.45)
1 P(A
1
) P(A
2
) (13.46)
= 1 0.025 0.025 (13.47)
= 0.95 (13.48)
13.3. SIMULTANEOUS CONFIDENCE INTERVALS 305
13.3.2 Schees Method
The Bonferonni method is unsuitable for more than a few intervals; each one would have to have
such a high individual condence level that the intervals would be very wide. Many alternatives
exist, a famous one being Schees method.
4
Theorem 34 Suppose R
1
, ..., R
k
have an approximately multivariate normal distribution, with
mean vector = (
i
) and covariance matrix = (
ij
). Let

be a consistent estimator of
, meaning that it converges in probability to as the sample size goes to innity.
For any constants c
1
, ..., c
k
, consider linear combinations of the R
i
,
k

i=1
c
i
R
i
(13.49)
which estimate
k

i=1
c
i

i
(13.50)
Form the condence intervals
k

i=1
c
i
R
i

_
k
2
;k
s(c
1
, ..., c
k
) (13.51)
where
[s(c
1
, ..., c
k
)]
2
= (c
1
, ..., c
k
)
T

(c
1
, ..., c
k
) (13.52)
and where
2
;k
is the upper- percentile of a chi-square distribution with k degrees of freedom.
5
Then all of these intervals (for innitely many values of the c
i
!) have simultaneous condence level
1 .
4
The name is pronounced sheh-FAY.
5
Recall that the distribution of the sum of squares of g independent N(0,1) random variables is called chi-square
with g degrees of freedom. It is tabulated in the R statistical packages function qchisq().
306 CHAPTER 13. ADVANCED STATISTICAL ESTIMATION AND INFERENCE
By the way, if we are interested in only constructing condence intervals for contrasts, i.e. c
i
having the property that
i
c
i
= 0, we the number of degrees of freedom reduces to k-1, thus
producing narrower intervals.
Just as in Section 12.2.2 we avoided the t-distribution, here we have avoided the F distribution,
which is used instead of ch-square in the exact form of Schees method.
13.3.3 Example
For example, again consider the Davis heights example in Section 10.7. Suppose we want to nd
approximate 95% condence intervals for two population quantities,
1
and
2
. These correspond
to values of c
1
, c
2
of (1,0) and (0,1). Since the two samples are independent,
12
= 0. The chi-square
value is 5.99,
6
so the square root in (13.51) is 3.46. So, we would compute (10.25) for X and then
for Y , but would use 3.46 instead of 1.96.
This actually is not as good as Bonferonni in this case. For Bonferonni, we would nd two 97.5%
condence intervals, which would use 2.24 instead of 1.96.
Schees method is too conservative if we just are forming a small number of intervals, but it is
great if we form a lot of them. Moreover, it is very general, usable whenever we have a set of
approximately normal estimators.
13.3.4 Other Methods for Simultaneous Inference
There are many other methods for simultaneous inference. It should be noted, though, that many
of them are limited in scope, in contrast to Schees method, which is usable whenever one has
multivariate normal estimators, and Bonferonnis method, which is universally usable.
13.4 The Bootstrap Method for Forming Condence Intervals
Many statistical applications can be quite complex, which makes them very dicult to analyze
mathematically. Fortunately, there is a fairly general method for nding condence intervals called
the bootstrap. Here is a brief overview of the type of bootstrap condence interval construction
called Efrons percentile method.
6
Obtained from R via qchisq(0.95,2).
13.4. THE BOOTSTRAP METHOD FOR FORMING CONFIDENCE INTERVALS 307
13.4.1 Basic Methodology
Say we are estimating some population value based on i.i.d. random variables Q
i
, i = 1,...,n.
Note that and the Q
i
could be vector-valued.
Our estimator of is of course some function of the Q
i
, h(Q
1
, ..., Q
n
). For example, if we are
estimating a population mean by a sample mean, then the function h() is dened by
h(u
1
, ..., u
n
) =
u
1
+..., +u
n
n
(13.53)
Our procedure is as follows:
Estimate based on the original sample, i.e. set

= h(Q
1
, ..., Q
n
) (13.54)
For j = 1,2,...,k:
Resample, i.e. create a new sample,

Q
1
, ..,

Q
n
, by drawing n times with replacement
from Q
1
, .., Q
n
.
Calculate the value of

based on the

Q
i
instead of the Q
i
, i.e. set

j
= h(

Q
1
, ...,

Q
n
) (13.55)
Sort the values

j
, j = 1,...,k, and let

(k)
be the k
th
-smallest value.
Let A and B denote the 0.025 and 0.975 quantiles of the

, i.e.
A =

(0.025n)

and B =

(0.975n)

(13.56)
(The quantities 0.025n and 0.975n must be rounded, say to the nearest integer in the range
1,...,n.)
Then your approximate 95% condence interval for is
(

B,

A) (13.57)
308 CHAPTER 13. ADVANCED STATISTICAL ESTIMATION AND INFERENCE
13.4.2 Example: Condence Intervals for a Population Variance
As noted in Section 13.2.3, the classical chi-square method for nding a condence interval for a
population variance
2
is not robust to the assumption of a normally distributed parent population.
In that section, we showed how to nd the desired condence interval using the delta method.
That was a solution, but the derivation was complex. An alternative would be to use the bootstrap.
We resample many times, calculate the sample variance on each of the new samples, and then form
a condence interval for
2
as in (13.56). We show the details using R in Section 13.4.3
13.4.3 Computation in R
R includes the boot() function to do the mechanics of this for us. To illustrate its usage, lets
consider nding a condence interval for the population variance
2
, based on the sample variance,
s
2
. Here is the code:
# R base doesnt include the boot package, so must load it
library(boot)
# finds the sample variance on x[c(inds)]
s2 <- function(x,inds) {
return(var(x[inds]))
}
bt <- boot(x,s2,R=200)
cilow[rep] <- quantile(bt$t,alp)
cihi[rep] <- quantile(bt$t,1-alp)
print(mean(cilow <= 1.0 & 1.0 <= cihi))
How does this work? The line
bt <- boot(x,s2,R=200)
instructs R to apply the bootstrap to the data set x, with the statistic of interest being specied
by the user in the function s2(). The argument R here is what we called k in Section 13.4.1 above,
i.e. the number of times we resample n items from x.
Our argument inds in s2() is less obvious. Heres what happens: As noted, the boot() function
merely shortens our work. Without it, we could simply call sample() to do our resampling. Say
for simplicity that n is 4. We might make the call
j <- sample(1:4,replace=T)
13.4. THE BOOTSTRAP METHOD FOR FORMING CONFIDENCE INTERVALS 309
and j might turn out to be, say, c(4,1,3,3). We would then apply the statistic to be bootstrapped, in
our case here the sample variance, to the data x[4], x[1], x[3], x[3]more compactly and eciently
expressed as x[c(4, 1, 3, 3)]. Thats what boot() does for us. So, in our example above, the argument
inds would be c(4,1,3,3) here.
In the example here, our statistic to be bootstrapped was a very common one, and thus there was
already an R function for it, var(). In more complex settings, wed write our own function.
13.4.4 General Applicability
Much theoretical work has been done on the bootstrap, and it is amazingly general. It has become
the statisticians Swiss army knife. However, there are certain types of estimators on which the
bootstrap fails. How can one tell in general?
One approach would be to consult the excellent book Bootstrap Methods and Their Application, by
A. C. Davison and D. V. Hinkley, Cambridge University Press, 1997.
But a simpler method would be to test the bootstrap in the proposed setting by simulation: Write
R code to generate many samples; get a bootstrap condence interval on each one; and then see
whether the number of intervals containing the true population value is approximately 95%.
In the sample variance example above, the code could be:
sim <- function(n,nreps,alp) {
cilow <- vector(length=nreps)
cihi <- vector(length=nreps)
for (rep in 1:nreps) {
x <- rnorm(n)
bt <- boot(x,s2,R=200)
cilow[rep] <- quantile(bt$t,alp)
cihi[rep] <- quantile(bt$t,1-alp)
}
print(mean(cilow <= 1.0 & 1.0 <= cihi))
}
13.4.5 Why It Works
The mathematical theory of the bootstrap can get extremely involved, but we can at least get a
glimpse of why it works here.
First review notation:
Our random sample data is Q
1
, ..., Q
n
.
Our estimator of is

= h(Q
1
, ..., Q
n
).
310 CHAPTER 13. ADVANCED STATISTICAL ESTIMATION AND INFERENCE
Our resampled estimators of are

1
, ...,

k
.
Remember, to get any condence interval from an estimator, we need the distribution of that
estimator. Here in our bootstrap context, our goal is to nd the approximate distribution of

.
The bootstrap achieves that goal very simply.
In essence, we are performing a simulation, drawing samples from the empirical distribution function
for our Q
i
data. Since the empirical cdf is an estimate of the population cdf F
Q
, then the

j
act
like a random sample from the resulting distribution of

.
Indeed, if we calculate the sample standard deviation (s) of the

j
, that is an estimate of the
standard error of

. If due to the delta method or other considerations, we know that the asymptotic
distribution of

is normal, then an approximate 95% condence interval for would be

1.96 standard deviation of the


j
(13.58)
Efrons percentile method is more general, and works better for small samples. The idea is that
the above discussion implies that the values

(13.59)
have approximately the same distribution as the values

(13.60)
Accordingly, the probability that (13.60) is between A and B is approximately 0.95, thus giving us
(13.57).
Chapter 14
Introduction to Model Building
All models are wrong, but some are useful.George Box
1
[Mathematical models] should be made as simple as possible, but not simpler.Albert Einstein
2
Beware of geeks bearing formulas.Warrent Buett, 2009, on the role of quants (Wall Street
analysts who form probabilistic models for currency, bonds etc.) in the 2008 nancial collapse.
The above quote by Box says it all. Consider for example the family of normal distributions. In real
life, random variables are boundedno persons height is negative or greater than 500 inchesand
are inherently discrete, due to the nite precision of our measuring instruments. Thus, technically,
no random variable in practice can have an exact normal distribution. Yet the assumption of
normality pervades statistics, and has been enormously successful, provided one understands its
approximate nature.
The situation is similar to that of physics. Paraphrasing Box, we might say that the physical
models used when engineers design an airplane wing are all wrongbut they are useful. We know
that in many analyses of bodies in motion, we can neglect the eect of air resistance. But we also
know that in some situations one must include that factor in our model.
So, the eld of probability and statistics is fundamentally about modeling. The eld is extremely
useful, provided the user understands the modeling issues well. For this reason, this book contains
this separate chapter on modeling issues.
1
George Box (1919-) is a famous statistician, with several statistical procedures named after him.
2
The reader is undoubtedly aware of Einsteins (1879-1955) famous theories of relativity, but may not know his
connections to probability theory. His work on Brownian motion, which describes the path of a molecule as it
is bombarded by others, is probabilistic in nature, and later developed into a major branch of probability theory.
Einstein was also a pioneer in quantum mechanics, which is probabilistic as well. At one point, he doubted the
validity of quantum theory, and made his famous remark, God does not play dice with the universe.
311
312 CHAPTER 14. INTRODUCTION TO MODEL BUILDING
14.1 Desperate for Data
Suppose we have the samples of mens and womens heights, X
1
, ..., X
n
and Y
1
, ..., Y
n
. Assume
for simplicity that the variance of height is the same for each gender,
2
. The means of the two
populations are designated by
1
and
2
.
Say we wish to guess the height of a new person who we know to be a man but for whom we know
nothing else. We do not see him, etc.
14.1.1 Known Distribution
Suppose for just a moment that we actually know the distribution of X, i.e. the population distri-
bution of male heights. What would be the best constant g to use as our guess for a person about
whom we know nothing other than gender?
Well, we might borrow from Section 12.2 and use mean squared error,
E[(g X)
2
] (14.1)
as our criterion of goodness of guessing. But we already know what the best g is, from Section
3.60: The best g is
1
. Our best guess for this unseen mans height is the mean height of all men
in the population.
14.1.2 Estimated Mean
Of course, we dont know
1
, but we can do the next-best thing, i.e. use an estimate of it from our
sample.
The natural choice for that estimator would be
T
1
= X, (14.2)
the mean height of men in our sample.
But what if n is really small, say n = 5? Thats awfully small. We may wish to consider adding
the womens heights to our estimate, in order to get a larger sample. Then we would estimate
1
by
T
2
=
X +Y
2
, (14.3)
14.1. DESPERATE FOR DATA 313
It may at rst seem obvious that T
1
is the better estimator. Women tend to be shorter, after all,
so pooling the data from the two genders would induce a bias. On the other hand, we found in
Section 12.2 that for any estimator,
MSE = variance of the estimator + bias of the estimator
2
(14.4)
In other words, some amount of bias may be tolerable, if it will buy us a subtantial reduction in
variance. After all, women are not that much shorter than men, so the bias might not be too
bad. Meanwhile, the pooled estimate should have lower variance, as it is based on 2n observations
instead of n; (10.11) indicates that.
Before continuing, note rst that T
2
is based on a simpler model than is T
1
, as T
2
ignores gender.
We thus refer to T
1
as being based on the more complex model.
Which one is better? The answer will need a criterion for goodness of estimation, which we will
take to be mean squared error, MSE. So, the question becomes, which has the smaller MSE, T
1
or
T
2
? In other words:
Which is smaller, E[(T
1

1
)
2
] or E[(T
2

1
)
2
]?
14.1.3 The Bias/Variance Tradeo
We could calculate MSE from scratch, but it would probably be better to make use of the work
we already went through, producing (12.66). This is especially true in that we know a lot about
variance of sample means, and we will take this route.
So, lets nd the biases of the two estimators.
T
1
T
1
is unbiased, from (10.11). So,
bias of T
1
= 0
T
2
E(T
2
) = E(0.5X + 0.5Y ) (denition) (14.5)
= 0.5EX + 0.5EY (linearity of E()) (14.6)
= 0.5
1
+ 0.5
2
[from (10.11)] (14.7)
So,
314 CHAPTER 14. INTRODUCTION TO MODEL BUILDING
bias of T
2
= (0.5
1
+ 0.5
2
)
1
On the other hand, T
2
has a smaller variance than T
1
:
T
1
Recalling (10.16), we have
V ar(T
1
) =

2
n
(14.8)
T
2
V ar(T
2
) = V ar(0.5X + 0.5Y ) (14.9)
= 0.5
2
V ar(X) + 0.5
2
V ar(Y ) (properties of Var()) (14.10)
= 2 0.25

2
n
[from 10.16] (14.11)
=

2
2n
(14.12)
These ndings are highly instructive. You might at rst think that of course T
1
would be the
better predictor than T
2
. But for a small sample size, the smaller (actually 0) bias of T
1
is not
enough to counteract its larger variance. T
2
is biased, yes, but it is based on double the sample
size and thus has half the variance.
In light of (12.66), we see that T
1
, the true predictor, may not necessarily be the better of the
two predictors. Granted, it has no bias whereas T
2
does have a bias, but the latter has a smaller
variance.
So, under what circumstances will T
1
be better than T
2
? Lets answer this by using (12.65):
MSE(T
1
) =

2
n
+ 0
2
=

2
n
(14.13)
MSE(T
2
) =

2
2n
+
_

1
+
2
2

1
_
2
=

2
2n
+
_

2

1
2
_
2
(14.14)
T
1
is a better predictor than T
2
if (14.13) is smaller than (14.14), which is true if
_

2

1
2
_
2
>

2
2n
(14.15)
14.1. DESPERATE FOR DATA 315
Granted, we dont know the values of the
1
and
2
, so in a real situation, we wont really know
whether to use T
1
or T
2
. But the above analysis makes the point that under some circumstances,
it really is better to pool the data in spite of bias.
14.1.4 Implications
So you can see that T
1
is better only if either
n is large enough, or
the dierence in population mean heights between men and women is large enough, or
there is not much variation within each population, e.g. most men have very similar heights
Since that third item, small within-population variance, is rarely seen, lets concentrate on the rst
two items. The big revelation here is that:
A more complex model is more accurate than a simpler one only if either
we have enough data to support it, or
the complex model is suciently dierent from the simpler one
In height/gender example above, if n is too small, we are desperate for data, and
thus make use of the female data to augment our male data. Though women tend to
be shorter than men, the bias that results from that augmentation is oset by the reduction in
estimator variance that we get. But if n is large enough, the variance will be small in either model,
so when we go to the more complex model, the advantage gained by reducing the bias will more
than compensate for the increase in variance.
THIS IS AN ABSOLUTELY FUNDAMENTAL NOTION IN STATISTICS.
This was a very simple example, but you can see that in complex settings, tting too rich a model
can result in very high MSEs for the estimates. In essence, everything becomes noise. (Some people
have cleverly coined the term noise mining, a play on the term data mining.) This is the famous
overtting problem.
In our unit on statistical relations, Chapter 15, we will show the results of a scary experiment
done at the Wharton School, the University of Pennsylvanias business school. The researchers
deliberately added fake data to a prediction equation, and standard statistical software identied
it as signicant! This is partly a problem with the word itself, as we saw in Section 11.8, but
also a problem of using far too complex a model, as will be seen in that future unit.
316 CHAPTER 14. INTRODUCTION TO MODEL BUILDING
Note that of course (14.15) contains several unknown population quantities. I derived it here merely
to establish a principle, namely that a more complex model may perform more poorly under some
circumstances.
It would be possible, though, to make (14.15) into a practical decision tool, by estimating the
unknown quantities, e.g. replacing
1
by X. This then creates possible problems with condence
intervals, whose derivation did not include this extra decision step. Such estimators, termed adap-
tive, are beyond the scope of this book.
14.2 Assessing Goodness of Fit of a Model
Our example in Section 12.1.4 concerned how to estimate the parameters of a gamma distribution,
given a sample from the distribution. But that assumed that we had already decided that the
gamma model was reasonable in our application. Here we will be concerned with how we might
come to such decisions.
Assume we have a random sample X
1
, ..., X
n
from a distribution having density f
X
.
14.2.1 The Chi-Square Goodness of Fit Test
The classic way to do this would be the Chi-Square Goodness of Fit Test. We would set
H
0
: f
X
is a member of the exponential parametric family (14.16)
This would involve partitioning (0, ) into k intervals (s
i1
, s
i
) of our choice, and setting
N
i
= number of X
i
in (s
i1
, s
i
) (14.17)
We would then nd the Maximum Likelihood Estimate (MLE) of , on the assumption that the
distribution of X really is exponential. The MLE turns out to be the reciprocal of the sample mean,
i.e.

= 1/X (14.18)
This would be considered the parameter of the best-tting exponential density for our data. We
would then estimate the probabilities
p
i
= P[X(s
i1
, s
i
)] = e
s
i1
e
s
i
, i = 1, ..., k. (14.19)
14.2. ASSESSING GOODNESS OF FIT OF A MODEL 317
by
p
i
= e

s
i1
e

s
i
, i = 1, ..., k. (14.20)
Note that N
i
has a binomial distribution, with n trials and success probability p
i
. Using this, the
expected value of EN
i
is estimated to be

i
= n(e

s
i1
e

s
i
), i = 1, ..., k. (14.21)
Our test statistic would then be
Q =
k

i=1
(N
i
v
i
)
2
v
i
(14.22)
where v
i
is the expected value of N
i
under the assumption of exponentialness. It can be shown
that Q is approximately chi-square distributed with k-2 degrees of freedom.
3
Note that only large
values of Q should be suspicious, i.e. should lead us to reject H
0
; if Q is small, it indicates a good
t. If Q were large enough to be a rare event, say larger than
0.95,k2
, we would decide NOT
to use the exponential model; otherwise, we would use it.
Hopefully the reader has immediately recognized the problem here. If we have a large
sample, this procedure will pounce on tiny deviations from the exponential distribution, and we
would decide not to use the exponential modeleven if those deviations were quite minor. Again,
no model is 100% correct, and thus a goodness of t test will eventually tell us not to use any
model at all.
14.2.2 Kolmogorov-Smirnov Condence Bands
Again consider the problem above, in which we were assessing the t of a exponential model. In
line with our major point that condence intervals are far superior to hypothesis tests, we now
present Kolmogorov-Smirnov condence bands, which work as follows.
Recall the concept of empirical cdfs, presented in Section 12.4.1. It turns out that the distribution
of
M = max
<t
[

F
X
(t) F
X
(t)[ (14.23)
3
We have k intervals, but the Ni must sum to n, so there are only k-1 free values. We then subtract one more
degree of freedom, having estimated the parameter .
318 CHAPTER 14. INTRODUCTION TO MODEL BUILDING
is the same for all distributions having a density. This fact (whose proof is related to the
general method for simulating random variables having a given density, in Section 4.7) tells us
that, without knowing anything about the distribution of X, we can be sure that M has the same
distribution. And it turns out that
F
M
(1.358n
1/2
) = 0.95 (14.24)
Dene upper and lower functions
U(t) =

F
X
(t) + 1.358n
1/2
, L(t) =

F
X
(t) 1.358n
1/2
(14.25)
So, what (14.23) and (14.24) tell us is
0.95 = P (the curve F
X
is entirely between U and L) (14.26)
So, the pair of curves, (L(t), U(t)) is called a a 95% condence band for F
X
.
The usefulness is similar to that of condence intervals. If the band is very wide, we know we really
dont have enough data to decide much about the distribution of X. But if the band is narrow but
some member of the family comes reasonably close to the band, we would probably decide that the
model is a good one, even if no member of the family falls within the band. Once again, we should
NOT pounce on tiny deviations from the model.
Warning: The Kolmogorov-Smirnov procedure available in the R language performs only a hypoth-
esis test, rather than forming a condence band. In other words, it simply checks to see whether a
member of the family falls within the band. This is not what we want, because we may be perfectly
happy if a member is only near the band.
Of course, another way, this one less formal, of assessing data for suitability for some model is to
plot the data in a histogram or something of that naure.
14.3 Bias Vs. VarianceAgain
In our unit on estimation, Section 12.4, we saw a classic tradeo in histogram- and kernel-based
density estimators. With histograms, for instance, the wider bin width produces a graph which is
smoother, but possibly too smooth, i.e. with less oscillation than the true population curve has.
The same problem occurs with larger values of h in the kernel case.
This is actually yet another example of the bias/variance tradeo, discussed in above and, as
mentioned, ONE OF THE MOST RECURRING NOTIONS IN STATISTICS. A large
14.4. ROBUSTNESS 319
bin width, or a large value of h, produces more bias. In general, the large the bin width or h, the
further E[

f
R
(t) is from the true value of f
R
(t). This occurs because we are making use of points
which are not so near t, and thus at which the density height is dierent from that of f
R
(t). On
the other hand, because we are making use of more points, V ar[

f
r
(t)] will be smaller.
THERE IS NO GOOD WAY TO CHOOSE THE BIN WIDTH OR h. Even though
there is a lot of theory to suggest how to choose the bin width or h, no method is foolproof. This
is made even worse by the fact that the theory generally has a goal of minimizing integrated mean
squared error,
_

E
_
_

f
R
(t) f
R
(t)
_
2
_
dt (14.27)
rather than, say, the mean squared error at a particular point of interest, v:
E
_
_

f
R
(t) f
R
(t)
_
2
_
(14.28)
14.4 Robustness
Traditionally, the term robust in statistics has meant resilience to violations in assumptions. For
example, in Section 10.12, we presented Student-t, a method for nding exact condence intervals
for means, assuming normally-distributed populations. But as noted at the outset of this chapter,
no population in the real world has an exact normal distribution. The question at hand (which
we will address below) is, does the Student-t method still give approximately correct results if
the sample population is not normal? If so, we say that Student-t is robust to the normality
assumption.
Later, there was quite a lot of interest among statisticians in estimation procedures that do well
even if there are outliers in the data, i.e. erroneous observations that are in the fringes of the
sample. Such procedures are said to be robust to outliers.
Our interest here is on robustness to assumptions. Let us rst consider the Student-t example. As
discussed in Section 10.12, the main statistic here is
T =

X
s/

n
(14.29)
320 CHAPTER 14. INTRODUCTION TO MODEL BUILDING
where is the population mean and s is the unbiased version of the sample variance:
s =

n
i=1
(X
i


X)
2
n 1
(14.30)
The distribution of T, under the assumption of a normal population, has been tabulated, and tables
for it appear in virtually every textbook on statistics. But what if the population is not normal,
as is inevitably the case?
The answer is that it doesnt matter. For large n, even for samples having, say, n = 20, the distri-
bution of T is close to N(0,1) by the Central Limit Theorem regardless of whether the population
is normal.
By contrast, consider the classic procedure for performing hypothesis tests and forming condence
intervals for a population variance
2
, which relies on the statistic
K =
(n 1)s
2

2
(14.31)
where again s
2
is the unbiased version of the sample variance. If the sampled population is normal,
then K can be shown to have a chi-square distribution with n-1 degrees of freedom. This then
sets up the tests or intervals. However, it has been shown that these procedures are not robust to
the assumption of a normal population. See The Analysis of Variance: Fixed, Random, and Mixed
Models, by Hardeo Sahai and Mohammed I. Ageel, Springer, 2000, and the earlier references they
cite, especially the pioneering work of Schee.
Exercises
Note to instructor: See the Preface for a list of sources of real data on which exercises can be
assigned to complement the theoretical exercises below.
1. In our example in Section 14.1, assume
1
= 70,
2
= 66, = 4 and the distribution of height is
normal in the two populations. Suppose we are predicting the height of a man who, unknown to
us, has height 68. We hope to guess within two inches. Find P([T
1
68[) < 2 and P([T
2
68[) < 2
for various values of n.
2. In Section 13.3 we discussed simultaneous inference, the forming of condence intervals whose
joint condence level was 95% or some other target value. The Kolmogorov-Smirnov condence
band in Section 14.2.2 allows us to computer innitely many condence intervals for F
X
(t) at
dierent values of t, at a price of only 1.358. Still, if we are just estimating F
X
(t) at a single
value of t, an individual condence interval using (10.34) would be narrower than that given to us
14.4. ROBUSTNESS 321
by Kolmogorov-Smirnov. Compare the widths of these two intervals in a situation in which the
true value of F
X
(t) = 0.4.
3. Say we have a random sample X
1
, ..., X
n
from a population with mean and variance
2
.
The usual estimator of is the sample mean

X, but here we will use what is called a shrinkage
estimator: Our estimate of will be 0.9

X. Find the mean squared error of this estimator, and give
an inequality (you dont have to algebraically simplify it) that shows under what circumstances
0.9

X is better than

X. (Strong advice: Do NOT reinvent the wheel. Make use of what we have
already derived.)
322 CHAPTER 14. INTRODUCTION TO MODEL BUILDING
Chapter 15
Relations Among Variables: Linear
Regression
In many senses, this chapter and the next one form the real core of statistics, especially from a
computer science point of view.
In this chapter we are interested in relations between variables, in two main senses:
In regression analysis, we are interested in the relation of one variable with one or more
others.
In other kinds of analyses covered in this chapter, we are interested in relations among several
variables, symmetrically, i.e. not having one variable play a special role.
15.1 The Goals: Prediction and Understanding
Prediction is dicult, especially when its about the future.Yogi Berra
1
Before beginning, it is important to understand the typical goals in regression analysis.
Prediction: Here we are trying to predict one variable from one or more others.
1
Yogi Berra (1925-) is a former baseball player and manager, famous for his malapropisms, such as When you
reach a fork in the road, take it; That restaurant is so crowded that no one goes there anymore; and I never said
half the things I really said.
323
324 CHAPTER 15. RELATIONS AMONG VARIABLES: LINEAR REGRESSION
Understanding: Here we wish to determine which of several variables have a greater eect
on (or relation to) a given variable. An important special case is that in which we are
interested in determining the eect of one predictor variable, after the eects of the other
predictors are removed.
Denote the predictor variables by, X
(1)
, ..., X
(r)
. They are also called independent variables.
The variable to be predicted, Y, is often called the response variable, or the dependent variable.
A common statistical methodology used for such analyses is called regression analysis. In the
important special cases in which the response variable Y is an indicator variable (Section 3.6),
2
taking on just the values 1 and 0 to indicate class membership, we call this the classication
problem. (If we have more than two classes, we need several Ys.)
In the above context, we are interested in the relation of a single variable Y with other variables
X
(i)
. But in some applications, we are interested in the more symmetric problem of relations among
variables X
(i)
(with there being no Y). A typical tool for the case of continuous random variables
is principal components analysis, and a popular one for the discrete case is log-linear model;
both will be discussed later in this chapter.
15.2 Example Applications: Software Engineering, Networks, Text
Mining
Example: As an aid in deciding which applicants to admit to a graduate program in computer
science, we might try to predict Y, a faculty rating of a student after completion of his/her rst
year in the program, from X
(1)
= the students CS GRE score, X
(2)
= the students undergraduate
GPA and various other variables. Here our goal would be Prediction, but educational researchers
might do the same thing with the goal of Understanding. For an example of the latter, see Predict-
ing Academic Performance in the School of Computing & Information Technology (SCIT), 35th
ASEE/IEEE Frontiers in Education Conference, by Paul Golding and Sophia McNamarah, 2005.
Example: In a paper, Estimation of Network Distances Using O-line Measurements, Computer
Communications, by Prasun Sinha, Danny Raz and Nidhan Choudhuri, 2006, the authors wanted
to predict Y, the round-trip time (RTT) for packets in a network, using the predictor variables
X
(1)
= geographical distance between the two nodes, X
(2)
= number of router-to-router hops, and
other oine variables. The goal here was primarily Prediction.
Example: In a paper, Productivity Analysis of Object-Oriented Software Developed in a Com-
mercial Environment, SoftwarePractice and Experience, by Thomas E. Potok, Mladen Vouk and
Andy Rindos, 1999, the authors mainly had an Understanding goal: What impact, positive or
2
Sometimes called a dummy variable.
15.3. ADJUSTING FOR COVARIATES 325
negative, does the use of object-oriented programming have on programmer productivity? Here
they predicted Y = number of person-months needed to complete the project, from X
(1)
= size of
the project as measured in lines of code, X
(2)
= 1 or 0 depending on whether an object-oriented
or procedural approach was used, and other variables.
Example: Most text mining applications are classication problems. For example, the paper
Untangling Text Data Mining, Proceedings of ACL99, by Marti Hearst, 1999 cites, inter alia, an
application in which the analysts wished to know what proportion of patents come from publicly
funded research. They were using a patent database, which of course is far too huge to feasibly
search by hand. That meant that they needed to be able to (reasonably reliably) predict Y = 1 or
0 according to whether the patent was publicly funded from a number of X
(i)
, each of which was
an indicator variable for a given key word, such as NSF. They would then treat the predicted Y
values as the real ones, and estimate their proportion from them.
15.3 Adjusting for Covariates
The rst statistical consulting engagement I ever worked involved something called adjusting for
covariates. I was retained by the Kaiser hospital chain to investigate how heart attack patients
fared at the various hospitalsdid patients have a better chance to survive in some hospitals than
in others? There were four hospitals of particular interest.
I could have simply computed raw survival rates, say the proportion of patients who survive for a
month following a heart attack, and then used the methods of Section 10.6, for instance. This could
have been misleading, though, because one of the four hospitals served a largely elderly population.
A straight comparison of survival rates might then unfairly paint that particular hospital as giving
lower quality of care than the others.
So, we want to somehow adjust for the eects of age. I did this by setting Y to 1 or 0, for survival,
X
(1)
to age, and X
2+i
to be an indicator random variable for whether the patient was at hospital
i, i = 1,2,3.
3
15.4 What Does Relationship Really Mean?
Consider the Davis city population example again. In addition to the random variable W for
weight, let H denote the persons height. Suppose we are interested in exploring the relationship
between height and weight.
3
Note that there is no i = 4 case, since if the rst three hospital variables are all 0, that already tells us that this
patient was at the fourth hospital.
326 CHAPTER 15. RELATIONS AMONG VARIABLES: LINEAR REGRESSION
As usual, we must rst ask, what does that really mean? What do we mean by relationship?
Clearly, there is no exact relationship; for instance, a persons weight is not an exact function of
his/her height.
Intuitively, though, we would guess that mean weight increases with height. To state this precisely,
take Y to be the weight W and X
(1)
to be the height H, and dene
m
W;H
(t) = E(W[H = t) (15.1)
This looks abstract, but it is just common-sense stu. For example, m
W;H
(68) would be the mean
weight of all people in the population of height 68 inches. The value of m
W;H
(t) varies with t, and
we would expect that a graph of it would show an increasing trend with t, reecting that taller
people tend to be heavier.
We call m
W;H
the regression function of W on H. In general, m
Y ;X
(t) means the mean of Y
among all units in the population for which X = t.
Note the word population in that last sentence. The function m() is a population function.
So we have:
Major Point 1: When we talk about the relationship of one variable to one or more
others, we are referring to the regression function, which expresses the mean of the rst
variable as a function of the others. The key word here is mean!
15.5 Estimating That Relationship from Sample Data
As noted, though, m
W;H
(t) is a population function, dependent on population distributions. How
can we estimate this function from sample data?
Toward that end, lets again suppose we have a random sample of 1000 people from Davis, with
(H
1
, W
1
), ..., (H
1000
, W
1000
) (15.2)
being their heights and weights. We again wish to use this data to estimate population values. But
the dierence here is that we are estimating a whole function now, the whole curve m
W;H
(t). That
means we are estimating innitely many values, with one m
W;H
(t) value for each t.
4
How do we
do this?
4
Of course, the population of Davis is nite, but there is the conceptual population of all people who could live in
Davis.
15.5. ESTIMATING THAT RELATIONSHIP FROM SAMPLE DATA 327
One approach would be as follows. Say we wish to nd m
W;H
(t) (note the hat, for estimate of!)
at t = 70.2. In other words, we wish to estimate the mean weightin the populationamong all
people of height 70.2. What we could do is look at all the people in our sample who are within, say,
1.0 inch of 70.2, and calculate the average of all their weights. This would then be our m
W;H
(t).
There are many methods like this (see Section 16.3), but the traditional method is to choose
a parametric model for the regression function. That way we estimate only a nite number of
quantities instead of an innite number. This would be good in light of Section 14.1.
Typically the parametric model chosen is linear, i.e. we assume that m
W;H
(t) is a linear function
of t:
m
W;H
(t) = ct +d (15.3)
for some constants c and d. If this assumption is reasonablemeaning that though it may not
be exactly true it is reasonably closethen it is a huge gain for us over a nonparametric model.
Do you see why? Again, the answer is that instead of having to estimate an innite number of
quantities, we now must estimate only two quantitiesthe parameters c and d.
Equation (15.3) is thus called a parametric model of m
W;H
(). The set of straight lines indexed
by c and d is a two-parameter family, analogous to parametric families of distributions, such as
the two-parametric gamma family; the dierence, of course, is that in the gamma case we were
modeling a density function, and here we are modeling a regression function.
Note that c and d are indeed population parameters in the same sense that, for instance, r and
are parameters in the gamma distribution family. We must estimate c and d from our sample data.
So we have:
Major Point 2: The function m
W;H
(t) is a population entity, so we must estimate it
from our sample data. To do this, we have a choice of either assuming that m
W;H
(t)
takes on some parametric form, or making no such assumption.
If we opt for a parametric approach, the most common model is linear, i.e. (15.3).
Again, the quantities c and d in (15.3) are population values, and as such, we must
estimate them from the data.
So, how can we estimate these population values c and d? Well go into details in Section 15.10,
but here is a preview:
Using the result on page 51, together with the Law of Total Expectation in Section 9.1.3, we have
328 CHAPTER 15. RELATIONS AMONG VARIABLES: LINEAR REGRESSION
that the minimum value of the quantity
E
_
(W g(H))
2
_
(15.4)
overall all possible functions g(H), is attained by setting
g(H) = m
W;H
(H) (15.5)
In other words, m
W;H
(H) is the best predictor of W among all possible functions of H, in the sense
of minimizing mean squared prediction error.
5
Since we are assuming the model (15.3), this in turn means that:
The quantity
E
_
(W (rH +s))
2
_
(15.6)
is minimized by setting r = c and s = d.
This then gives us a clue as to how to estimate c and d from our data, as follows.
If you recall, in earlier chapters weve often chosen estimators by using sample analogs, e.g. s
2
as
an estimator of
2
. Well, the sample analog of (15.6) is
1
n
n

i=1
[W
i
(r +sH
i
)]
2
(15.7)
Here (15.6) is the mean squared prediction error using r and s in the population, and (15.7) is the
mean squared prediction error using r and s in our sample. Since r = d and s = c minimize (15.6),
it is natural to estimate d and c by the r and s that minimize (15.7).
These are then the classical least-squares estimators of c and d.
Major Point 3: In statistical regression analysis, one uses a linear model as in (15.3),
estimating the coecients by minimizing (15.7).
We will elaborate on this in Section 15.10.
5
But if we wish to minimize the mean absolute prediction error, E (|W g(H)|), the best function turns out to
be is g(H) = median(W|H).
15.6. MULTIPLE REGRESSION: MORE THAN ONE PREDICTOR VARIABLE 329
15.6 Multiple Regression: More Than One Predictor Variable
Note that X and t could be vector-valued. For instance, we could have Y be weight and have X
be the pair
X =
_
X
(1)
, X
(2)
_
= (H, A) = (height, age) (15.8)
so as to study the relationship of weight with height and age. If we used a linear model, we would
write for t = (t
1
, t
2
),
m
W;H,A
(t) =
0
+
1
t
1
+
2
t
2
(15.9)
In other words
mean weight =
0
+
1
height +
2
age (15.10)
(It is traditional to use the Greek letter to name the coecients in a linear regression model.)
So for instance m
W;H,A
(68, 37.2) would be the mean weight in the population of all people having
height 68 and age 37.2.
In analogy with (15.7), we would estimate the
i
by minimizing
1
n
n

i=1
[W
i
(u +vH
i
+wA
i
)]
2
(15.11)
with respect to u, v and w. The minimizing values would be denoted

0
,

1
and

2
.
We might consider adding a third predictor, gender:
mean weight =
0
+
1
height +
2
age +
3
gender (15.12)
where gender is an indicator variable, 1 for male, 0 for female. Note that we would not have two
gender variables, since knowledge of the value of one such variable would tell us for sure what the
other one is. (It would also make a certain matrix noninvertible, as well discuss later.)
330 CHAPTER 15. RELATIONS AMONG VARIABLES: LINEAR REGRESSION
15.7 Interaction Terms
Equation (15.9) implicitly says that, for instance, the eect of age on weight is the same at all
height levels. In other words, the dierence in mean weight between 30-year-olds and 40-year-olds
is the same regardless of whether we are looking at tall people or short people. To see that, just
plug 40 and 30 for age in (15.9), with the same number for height in both, and subtract; you get
10
2
, an expression that has no height term.
That the assumption is not a good one, since people tend to get heavier as they age. If we dont like
this assumption, we can add an interaction term to (15.9), consisting of the product of the two
original predictors. Our new predictor variable X
(3)
is equal to X
(1)
X
(2)
, and thus our regression
function is
m
W;H
(t) =
0
+
1
t
1
+
2
t
2
+
3
t
1
t
2
(15.13)
If you perform the same subtraction described above, youll see that this more complex model does
not assume, as the old did, that the dierence in mean weight between 30-year-olds and 40-year-olds
is the same regardless of we are looking at tall people or short people.
Recall the study of object-oriented programming in Section 15.1. The authors there set X
(3)
=
X
(1)
X
(2)
. The reader should make sure to understand that without this term, we are basically
saying that the eect (whether positive or negative) of using object-oriented programming is the
same for any code size.
Though the idea of adding interaction terms to a regression model is tempting, it can easily get out
of hand. If we have k basic predictor variables, then there are
_
k
2
_
potential two-way interaction
terms,
_
k
3
_
three-way terms and so on. Unless we have a very large amount of data, we run a
big risk of overtting (Section 15.11.1). And with so many interaction terms, the model would be
dicult to interpret.
So, we may have a decision to make here, as to whether to introduce interaction terms. For that
matter, it may be the case that age is actually not that important, so we even might consider
dropping that variable altogether. These questions will be pursued in Section 15.11.
15.8 Prediction
Lets return to our weight/height/age example. We are informed of a certain person, of height 70.4
and age 24.8, but weight unknown. What should we predict his weight to be?
15.9. PREVIEW OF LINEAR REGRESSION ANALYSIS WITH R 331
The intuitive answer (justied formally by Section 16.7.1) is that we predict his weight to be the
mean weight for his height/age group,
m
W;H,A
(70.4, 24.8) (15.14)
But that is a population value. Say we estimate the function m
W;H
using that data, yielding m
W;H
.
Then we could take as our prediction for the new persons weight
m
W;H,A
(70.4, 24.8) (15.15)
If our model is (15.9), then (15.15) is
m
W;H
(t) =

0
+

1
70.4 +

2
24.8 (15.16)
where the

i
are estimated from our data by least-squares.
15.9 Preview of Linear Regression Analysis with R
There is a downloadable R library called ISwR that contains a number of useful real data sets.
One of them is bp.obese, data on a set of 102 adults, on the variables gender, obesity and systolic
blood pressure. Here obesity is measure relative to the ideal weight for a given height, and thus is
centered around 1.00. Gender is 1 for male, 0 for female. Lets run the regression analysis:
> l i br a r y (ISwR) # l oad ISwR l i br a r y ; must i n s t a l l f i r s t
> bpvob < lm( bp . obese$bp bp . obes e$obes e )
Here we use Rs lm() (linear model) function, using the variables bp and obese in the data
set bp.obese. R uses a dollar sign to denote members of class objects, so here for example
bp.obese$bp means the bp member of the object bp.obese.
We could have also used the matrix-like notation that R data frames allow:
> bpvob < lm( bp . obese [ , 3 ] bp . obese [ , 2 ] )
referring to columns 3 and 2 of bp.obese. But if the columns have names, as they do here, its
clearer to use them.
332 CHAPTER 15. RELATIONS AMONG VARIABLES: LINEAR REGRESSION
The tilde,

, in the call to lm() indicates what is predicting what. Here the obese variable is
predicting the bp variable. In other words, our model is
mean blood pressure =
0
+
1
obese (15.17)
,
The result returned by the call is another class object, an instance of the lm class. (All R classes
have quoted names.) Weve stored it in a variable weve named bpvob, for blood pressure versus
obesity.
An object of the lm class has many, many members. The more central ones can be listed by
calling the summary() function:
> summary( bpvob)
Cal l :
lm( f ormul a = bp . obese$bp bp . obes e$obes e )
Res i dual s :
Min 1Q Median 3Q Max
27.570 11.241 2.400 9. 116 71. 390
Co e f f i c i e nt s :
Esti mate Std . Error t val ue Pr( >[ t [ )
( I nt e r c e pt ) 96. 818 8. 920 10. 86 < 2e16
bp . obes e$obes e 23. 001 6. 667 3. 45 0. 000822

Si g ni f . codes : 0 0. 001 0. 01 0. 05 . 0. 1 1
Res i dual standard e r r or : 17. 28 on 100 degr ees of freedom
Mul t i pl e Rsquared : 0. 1064 , Adj usted Rsquared : 0. 09743
Fs t a t i s t i c : 11. 9 on 1 and 100 DF, pval ue : 0. 0008222
The residuals are the dierences between the actual and predicted values of the response variable.
These are useful in advanced model tting.
Look at the Coecients section. We see that

0
= 96.818 and

1
= 23.001. Remember, these are
just estimates of the true population
i
, so we might consider condence intervals and signicance
test regarding them, especially for
1
.
15.9. PREVIEW OF LINEAR REGRESSION ANALYSIS WITH R 333
Using the standard errors listed above, and recalling Section 10.5, we have that an approximate
95% condence interval for
1
is
96.818 1.96 7.172 = (14.981, 43.095) (15.18)
So obesity does seem to have a substantial eect on blood pressure, with the latter rising somewhere
between 8 to 11 points for each rise of 0.1 in obesity.
For signicance tests on the
i
, R conveniently provides with p-values, in the Pr(> [t[) column.
The R
2
and adjusted R
2
values measure how well the predictor variables predict the response
variable. More on this in Section 15.11.3.
Now lets bring in the gender variable:
> summary( lm( bp . obese$bp bp . obes e$obes e + bp . obes e$s ex ) )
Cal l :
lm( f ormul a = bp . obese$bp bp . obes e$obes e + bp . obes e$s ex )
Res i dual s :
Min 1Q Median 3Q Max
24.263 11.613 2.057 6. 424 72. 207
Co e f f i c i e nt s :
Esti mate Std . Error t val ue Pr( >[ t [ )
( I nt e r c e pt ) 93. 287 8. 937 10. 438 < 2e16
bp . obes e$obes e 29. 038 7. 172 4. 049 0. 000102
bp . obes e$s ex 7.730 3. 715 2.081 0. 040053

Si g ni f . codes : 0 0. 001 0. 01 0. 05 . 0. 1 1
Res i dual standard e r r or : 17 on 99 degr ees of f reedom
Mul t i pl e Rsquared : 0. 1438 , Adj usted Rsquared : 0. 1265
Fs t a t i s t i c : 8. 314 on 2 and 99 DF, pval ue : 0. 0004596
In the model specication,
bp . obese$bp bp . obes e$obes e + bp . obes e$s ex
the + doesnt mean addition; it simply is a delimiter in the list of the predictors. (Actually, if we
use * instead of +, this is a signal to R that we also want interaction terms included.)
334 CHAPTER 15. RELATIONS AMONG VARIABLES: LINEAR REGRESSION
Note that the estimated value of the coecient for gender is negative. Since our coding had 1 for
male, 0 for female, this means that men of any given obesity level on average have a lower blood
pressure than do women of the same level of obesity, around 8 points.
Note, though that a condence interval for that quantity would range from about 15.2 points to
0.2 points, so the gender dierence might actually be small. Of course, the signicance test is much
less subtle, and simply says that men have signicantly lower blood pressures than women, a bit
of an overstatement.
Note too that the R
2
and adjusted R
2
value increased by about 40% when we added the gender
variable. However, they are still rather low (their maximum possible values 1.00), indicating that
there are lots of other factors in blood pressure that are not measured in our data, say age, physical
activity, diet and so on.
15.10 Parametric Estimation of Linear Regression Functions
15.10.1 Meaning of Linear
Here we model m
Y ;X
as a linear function of X
(1)
, ..., X
(r)
:
m
Y ;X
(t) =
0
+
1
t
(1)
+... +
r
t
(r)
(15.19)
Note that the term linear regression does NOT necessarily mean that the graph of the regression
function is a straight line or a plane. We could, for instance, have one predictor variable set equal
to the square of another, as in (15.33).
Instead, the word linear refers to the regression function being linear in the parameters. So, for
instance, (15.33) is a linear model; if for example we multiple
0
,
1
and
2
by 8, then m
A;b
(s) is
multiplied by 8.
A more literal look at the meaning of linear comes from the matrix formulation (15.24) below.
15.10.2 Point Estimates and Matrix Formulation
So, how do we estimate the
i
? Look for instance at (15.33). Keep in mind that in (15.33), the
i
are population values. We need to estimate them from our data. How do we do that? As previewed
in Section 15.5, the usual method is least-squares. Here we will go into the details.
Lets dene (b
i
, A
i
) to be the i
th
pair from the simulation. In the program, this is md[i,]. Our
estimated parameters will be denoted by

i
. As in (15.7), the estimation methodology involves
15.10. PARAMETRIC ESTIMATION OF LINEAR REGRESSION FUNCTIONS 335
nding the values of

i
which minimize the sum of squared dierences between the actual A values
and their predicted values:
100

i=1
[A
i
(

0
+

1
b
i
+

2
b
2
i
)]
2
(15.20)
Obviously, this is a calculus problem. We set the partial derivatives of (15.20) with respect to the

i
to 0, giving use three linear equations in three unknowns, and then solve.
For the general case (15.19), we have r+1 equations in r+1 unknowns. This is most conveniently
expressed in matrix terms. Let X
(j)
i
be the value of X
(j)
for the i
th
observation in our sample, and
let Y
i
be the corresponding Y value. Plugging this data into (15.10.1), we have
E(Y
i
[X
(1)
i
, ..., X
(r)
i
) =
0
+
1
X
(1)
i
+... +
r
X
(r)
i
, i = 1, ..., n (15.21)
Thats a system of n linear equations, which from your linear algebra class you know can be
represented more compactly by a matrix, as follows.
Let Q be the n x (r+1) matrix whose (i,j) element is X
(j)
i
, with X
(0)
i
taken to be 1. For instance, if
we are predicting weight from height and age based on a sample of 100 people, then Q would look
like this:
_
_
_
_
1 H
1
A
1
1 H
2
A
2
...
1 H
100
A
100
_
_
_
_
(15.22)
For example, row 5 of Q would consist of a 1, then the height and age of the fth person in our
sample.
Also, let
V = (Y
1
, ..., Y
n
)

, (15.23)
Then the system (15.21) in matrix form is
E(V [Q) = Q (15.24)
336 CHAPTER 15. RELATIONS AMONG VARIABLES: LINEAR REGRESSION
where
= (
0
,
1
, ...,
r
)

(15.25)
Keep in mind that the derivation below is conditional on the X
(i)
j
, i.e. conditional on Q, as shown
above. This is the standard approach, especially since there is the case of nonrandom X. Thus we
will later get conditional condence intervals, which is ne. To avoid clutter, I will sometimes not
show the conditioning explicitly, and thus for instance will write, for example, Cov(V) instead of
Cov(V[Q).
Now to estimate the
i
, let

= (

0
,

1
, ...,

r
)

(15.26)
with our goal now being to nd

. The matrix form of (15.20) (now for the general case, not just
ALOHA) is
(V Q

(V Q

) (15.27)
Then it can be shown that, after all the partial derivatives are taken and set to 0, the solution is

= (Q

Q)
1
Q

V (15.28)
By the way, recall in (15.12) we had only one indicator variable for gender, not two. If we were to
have variables for both male and female, the corresponding columns in Q would add to the rst
column. The matrix would then not be of full rank, thus not invertible above.
It turns out that

is an unbiased estimate of :
6
E

= E[(Q

Q)
1
Q

V ] (15.28) (15.29)
= (Q

Q)
1
Q

EV (linearity of E()) (15.30)


= (Q

Q)
1
Q

Q (15.24) (15.31)
= (15.32)
In some applications, we assume there is no constant term
0
in (15.19). This means that our Q
matrix no longer has the column of 1s on the left end, but everything else above is valid.
6
Note that here we are taking the expected value of a vector. This is covered in Chapter 8.
15.10. PARAMETRIC ESTIMATION OF LINEAR REGRESSION FUNCTIONS 337
15.10.3 Back to Our ALOHA Example
In our weight/height/age example above, all three variables are random. If we repeat the experi-
ment, i.e. we choose another sample of 1000 people, these new people will have dierent weights,
dierent heights and dierent ages from the people in the rst sample.
But we must point out that the function m
Y ;X
for the regression function of Y and X makes sense
even if X is nonrandom. To illustrate this, lets look at the ALOHA network example in our
introductory chapter on discrete probability, Section 2.1.
1 # simulation of simple form of slotted ALOHA
2
3 # a node is active if it has a message to send (it will never have more
4 # than one in this model), inactive otherwise
5
6 # the inactives have a chance to go active earlier within a slot, after
7 # which the actives (including those newly-active) may try to send; if
8 # there is a collision, no message gets through
9
10 # parameters of the system:
11 # s = number of nodes
12 # b = probability an active node refrains from sending
13 # q = probability an inactive node becomes active
14
15 # parameters of the simulation:
16 # nslots = number of slots to be simulated
17 # nb = number of values of b to run; they will be evenly spaced in (0,1)
18
19 # will find mean message delay as a function of b;
20
21 # we will rely on the "ergodicity" of this process, which is a Markov
22 # chain (see http://heather.cs.ucdavis.edu/~matloff/132/PLN/Markov.tex),
23 # which means that we look at just one repetition of observing the chain
24 # through many time slots
25
26 # main loop, running the simulation for many values of b
27 alohamain <- function(s,q,nslots,nb) {
28 deltab = 0.7 / nb # well try nb values of b in (0.2,0.9)
29 md <- matrix(nrow=nb,ncol=2)
30 b <- 0.2
31 for (i in 1:nb) {
32 b <- b + deltab
33 w <- alohasim(s,b,q,nslots)
34 md[i,] <- alohasim(s,b,q,nslots)
35 }
36 return(md)
37 }
38
39 # simulate the process for h slots
40 alohasim <- function(s,b,q,nslots) {
41 # status[i,1] = 1 or 0, for node i active or not
42 # status[i,2] = if node i active, then epoch in which msg was created
43 # (could try a list structure instead a matrix)
44 status <- matrix(nrow=s,ncol=2)
338 CHAPTER 15. RELATIONS AMONG VARIABLES: LINEAR REGRESSION
45 # start with all active with msg created at time 0
46 for (node in 1:s) status[node,] <- c(1,0)
47 nsent <- 0 # number of successful transmits so far
48 sumdelay <- 0 # total delay among successful transmits so far
49 # now simulate the nslots slots
50 for (slot in 1:nslots) {
51 # check for new actives
52 for (node in 1:s) {
53 if (!status[node,1]) # inactive
54 if (runif(1) < q) status[node,] <- c(1,slot)
55 }
56 # check for attempted transmissions
57 ntrysend <- 0
58 for (node in 1:s) {
59 if (status[node,1]) # active
60 if (runif(1) > b) {
61 ntrysend <- ntrysend + 1
62 whotried <- node
63 }
64 }
65 if (ntrysend == 1) { # something gets through iff exactly one tries
66 # do our bookkeeping
67 sumdelay <- sumdelay + slot - status[whotried,2]
68 # this node now back to inactive
69 status[whotried,1] <- 0
70 nsent <- nsent + 1
71 }
72 }
73 return(c(b,sumdelay/nsent))
74 }
A minor change is that I replaced the probability p, the probability that an active node would
send in the original example to b, the probability of not sending (b for backo). Let A denote
the time A (measured in slots) between the creation of a message and the time it is successfully
transmitted.
We are interested in mean delay, i.e. the mean of A. (Note that our Y
i
here are sample mean
values of A, whereas we want to draw inferences about the population mean value of A.) We are
particularly interested in the eect of b here on that mean. Our goal here, as described in Section
15.1, could be Prediction, so that we could have an idea of how much delay to expect in future
settings. Or, we may wish to explore nding an optimal b, i.e. one that minimizing the mean delay,
in which case our goal would be more in the direction of Understanding.
I ran the program with certain arguments, and then plotted the data:
> md <- alohamain(4,0.1,1000,100)
> plot(md,cex=0.5,xlab="b",ylab="A")
The plot is shown in Figure 15.1.
15.10. PARAMETRIC ESTIMATION OF LINEAR REGRESSION FUNCTIONS 339
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
5
1
0
1
5
2
0
b
A
Figure 15.1: Scatter Plot
Note that though our values b here are nonrandom, the A values are indeed random. To dramatize
that point, I ran the program again. (Remember, unless you specify otherwise, R will use a dierent
seed for its random number stream each time you run a program.) Ive superimposed this second
data set on the rst, using lled circles this time to represent the points:
md2 <- alohamain(4,0.1,1000,100)
points(md2,cex=0.5,pch=19)
The plot is shown in Figure 15.2.
We do expect some kind of U-shaped relation, as seen here. For b too small, the nodes are clashing
340 CHAPTER 15. RELATIONS AMONG VARIABLES: LINEAR REGRESSION
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
5
1
0
1
5
2
0
b
A
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
qq
qq
q
qqq
q
q
qqqq
q
q
q
q
q
q
q
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
Figure 15.2: Scatter Plot, Two Data Sets
with each other a lot, causing long delays to message transmission. For b too large, we are needlessly
backing o in many cases in which we actually would get through.
So, a model that expresses mean A to be a linear function of b, as in our height-weight example,
is clearly inappropriate. However, you may be surprised to know that we can still use a linear
regression model! And this is common. Here are the details:
This looks like a quadratic relationship, meaning the following. Take our response variable Y to be
A, take our rst predictor X
(1)
to be b, and take our second predictor X
(2)
to be b
2
. Then when
15.10. PARAMETRIC ESTIMATION OF LINEAR REGRESSION FUNCTIONS 341
we say A and b have a quadratic relationship, we mean
m
A;b
(b) =
0
+
1
b +
2
b
2
(15.33)
for some constants
0
,
1
,
2
. So, we are using a three-parameter family for our model of m
A;b
.
No model is exact, but our data seem to indicate that this one is reasonably good, and if further
investigation conrms that, it provides for a nice compact summary of the situation.
As mentioned, this is a linear model, in the sense that the
i
enter into (15.33) is a linear manner.
The fact that that equation is quadratic in b is irrelevant. By the way, one way to look at the
degree-2 term is to consider it to model the interaction of b with itself.
Again, well see how to estimate the
i
in Section 15.10.
We could also try adding two more predictor variables, consisting of X
(3)
= q and X
(4)
= s, the
node activation probability and number of nodes, respectively. We would collect more data, in
which we varied the values of q and s, and then could entertain the model
m
A;b,q
(u, v, w) =
0
+
1
u +
2
u
2
+
3
v +
4
w (15.34)
R or any other statistical package does the work for us. In R, we can use the lm() (linear model)
function:
> md <- cbind(md,md[,1]^2)
> lmout <- lm(md[,2] ~ md[,1] + md[,3])
First I added a new column to the data matrix, consisting of b
2
. I then called lm(), with the
argument
md[,2] ~ md[,1] + md[,3]
R documentation calls this model specication argument the formula. It states that I wish to use
the rst and third columns of md, i.e. b and b
2
, as predictors, and use A, i.e. second column, as
the response variable.
7
The return value from this call, which Ive stored in lmout, is an object of class lm. One of the
member variables of that class, coecients, is the vector

:
7
Unfortunately, R did not allow me to put the squared column directly into the formula, forcing me to use cbind()
to make a new matrix.
342 CHAPTER 15. RELATIONS AMONG VARIABLES: LINEAR REGRESSION
> lmout$coefficients
(Intercept) md[, 1] md[, 3]
27.56852 -90.72585 79.98616
So,

0
= 27.57 and so on.
The result is
m
A,b
(t) = 27.57 90.73t + 79.99t
2
(15.35)
(Do you understand why there is a hat about the m?)
Another member variable in the lm class is tted.values. This is the tted curve, meaning the
values of (15.35) at b
1
, ..., b
100
. In other words, this is (15.35). I plotted this curve on the same
graph,
> lines(cbind(md[,1],lmout$fitted.values))
See Figure 15.3. As you can see, the t looks fairly good. What should we look for?
Remember, we dont expect the curve to go through the pointswe are estimating
the mean of A for each b, not the A values themselves. There is always variation around
the mean. If for instance we are looking at the relationship between people heights and weights,
the mean weight for people of height 70 inches might be, say, 160 pounds, but we know that some
70-inch-tall people weigh more than this and some weigh less.
However, there seems to be a tendency for our estimates of m
A,b
(t) to be too low for values in the
middle range of t, and possible too high for t around 0.3 or 0.4. However, with a sample size of
only 100, its dicult to tell. Its always important to keep in mind that the data are random;
a dierent sample may show somewhat dierent patterns. Nevertheless, we should consider a more
complex model.
So I tried a quartic, i.e. fourth-degree, polynomial model. I added third- and fourth-power columns
to md, calling the result md4, and invoked the call
lm(md4[,2] ~ md4[,1] + md4[,3] + md4[,4] + md4[,5])
The result was
> lmout$coefficients
(Intercept) md4[, 1] md4[, 3] md4[, 4] md4[, 5]
95.98882 -664.02780 1731.90848 -1973.00660 835.89714
15.10. PARAMETRIC ESTIMATION OF LINEAR REGRESSION FUNCTIONS 343
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
5
1
0
1
5
2
0
b
A
Figure 15.3: Quadratic Fit Superimposed
In other words, we have an estimated regression function of
m
A,b
(t) = 95.98882 664.02780 t + 1731.90848 t
2
1973.00660 t
3
+ 835.89714 t
4
(15.36)
The t is shown in Figure 15.4. It looks much better. On the other hand, we have to worry about
overtting. We return to this issue in Section 15.11.1).
344 CHAPTER 15. RELATIONS AMONG VARIABLES: LINEAR REGRESSION
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
5
1
0
1
5
2
0
b
A
Figure 15.4: Fourth Degree Fit Superimposed
15.10.4 Approximate Condence Intervals
As usual, we should not be satised with just point estimates, in this case the

i
. We need an
indication of how accurate they are, so we need condence intervals. In other words, we need to
use the

i
to form condence intervals for the
i
.
For instance, recall the study on object-oriented programming in Section 15.1. The goal there was
primarily Understanding, specically assessing the impact of OOP. That impact is measured by
2
.
Thus, we want to nd a condence interval for
2
.
Equation (15.28) shows that the

i
are sums of the components of V, i.e. the Y
j
. So, the Central
15.10. PARAMETRIC ESTIMATION OF LINEAR REGRESSION FUNCTIONS 345
Limit Theorem implies that the

i
are approximately normally distributed. That in turn means
that, in order to form condence intervals, we need standard errors for the
i
. How will we get
them?
Note carefully that so far we have made NO assumptions other than (15.19). Now, though, we
need to add an assumption:
8
V ar(Y [X = t) =
2
(15.37)
for all t. Note that this and the independence of the sample observations (e.g. the various people
sampled in the Davis height/weight example are independent of each other) implies that
Cov(V [Q) =
2
I (15.38)
where I is the usual identiy matrix (1s on the diagonal, 0s o diagonal).
Be sure you understand what this means. In the Davis weights example, for instance, it means
that the variance of weight among 72-inch tall people is the same as that for 65-inch-tall people.
That is not quite truethe taller group has larger variancebut research into this has found that
as long as the discrepancy is not too bad, violations of this assumption wont aect things much.
We can derive the covariance matrix of

as follows. Again to avoid clutter, let B = (Q

Q)
1
. A
theorem from linear algebra says that QQ is symmetric and thus B is too. Another theorem says
that for any conformable matrices U and V, then (UV) = VU. Armed with that knowledge, here
we go:
Cov(

) = Cov(BQ

V ) ((15.28)) (15.39)
= BQ

Cov(V )(BQ

(7.50) (15.40)
= BQ

2
I(BQ

(15.38) (15.41)
=
2
BQ

QB (lin. alg.) (15.42)


=
2
(Q

Q)
1
(def. of B) (15.43)
Whew! Thats a lot of work for you, if your linear algebra is rusty. But its worth it, because
(15.43) now gives us what we need for condence intervals. Heres how:
First, we need to estimate
2
. Recall rst that for any random variable U, V ar(U) = E[(UEU)
2
],
we have
8
Actually, we could derive some usable, though messy,standard errors without this assumption.
346 CHAPTER 15. RELATIONS AMONG VARIABLES: LINEAR REGRESSION

2
= V ar(Y [X = t) (15.44)
= V ar(Y [X
(1)
= t
1
, ..., X
(r)
= t
r
) (15.45)
= E
_
Y m
Y ;X
(t)
2

(15.46)
= E
_
(Y
0

1
t
1
...
r
t
r
)
2

(15.47)
Thus, a natural estimate for
2
would be the sample analog, where we replace E() by averaging
over our sample, and replace population quantities by sample estimates:
s
2
=
1
n
n

i=1
(Y
i

1
X
(1)
i
...

r
X
(r)
i
)
2
(15.48)
As in Chapter 12, this estimate of
2
is biased, and classicly one divides by n-(r+1) instead of n.
But again, its not an issue unless r+1 is a substantial fraction of n, in which case you are overtting
and shouldnt be using a model with so large a value of r.
So, the estimated covariance matrix for

is

Cov(

) = s
2
(Q

Q)
1
(15.49)
The diagonal elements here are the squared standard errors (recall that the standard error of an
estimator is its estimated standard deviation) of the
i
. (And the o-diagonal elements are the
estimated covariances between the
i
.) Since the rst standard errors you ever saw, in Section 10.5,
included factors like 1/

n, you might wonder why you dont see such a factor in (15.49).
The answer is that such a factor is essentially there, in the following sense. QQ consists of various
sums of products of the X values, and the larger n is, then the larger the elements of QQ are. So,
(Q

Q)
1
already has something like a 1/n factor in it.
15.10.5 Once Again, Our ALOHA Example
In R we can obtain (15.49) via the generic function vcov():
> vcov(lmout)
(Intercept) md4[, 1] md4[, 3] md4[, 4] md4[, 5]
(Intercept) 92.73734 -794.4755 2358.860 -2915.238 1279.981
md4[, 1] -794.47553 6896.8443 -20705.705 25822.832 -11422.355
md4[, 3] 2358.86046 -20705.7047 62804.912 -79026.086 35220.412
md4[, 4] -2915.23828 25822.8320 -79026.086 100239.652 -44990.271
md4[, 5] 1279.98125 -11422.3550 35220.412 -44990.271 20320.809
15.10. PARAMETRIC ESTIMATION OF LINEAR REGRESSION FUNCTIONS 347
What is this telling us? For instance, it says that the (4,4) position (starting at (0,0) in the matrix
(15.49) is equal to 20320.809, so the standard error of

4
is the square root of this, 142.6. Thus an
approximate 95% condence interval for the true population
4
is
835.89714 1.96 142.6 = (556.4, 1115.4) (15.50)
That interval is quite wide. The margin of error, 1.96 142.6 = 279.5 is more than half of the
left endpoint of the interval, 556.4. Remember what this tells usthat our sample of size 100
is not very large. On the other hand, the interval is quite far from 0, which indicates that our
fourth-degree model is substantially better than our quadratic one.
Applying the R function summary() to a linear model object such as lmout here gives standard
errors for the

i
(and lots of other information), so we didnt really need to call vcov(). But that
call can give us more:
Note that we can apply (7.50) to the estimated covariance matrix of

! Recall our old example of
measuring the relation between peoples weights and heights,
m
W;H
(t) =
0
+
1
t (15.51)
Suppose we estimate from our data, and wish to nd a condence interval for the mean height
of all people of height 70 inches, which is

0
+ 70
1
(15.52)
Our estimate is

0
+ 70

1
(15.53)
That latter quantity is
(1, 70)

(15.54)
perfect for (10.59). Thus

V ar(

0
+ 70

1
) = (1, 70)C
_
1
70
_
(15.55)
where C is the output from vcov(). The square root of this is then the standard error for (15.53).
(Recall Section 10.5.)
348 CHAPTER 15. RELATIONS AMONG VARIABLES: LINEAR REGRESSION
15.10.6 Exact Condence Intervals
Note carefully that we have not assumed that Y, given X, is normally distributed. In
the height/weight context, for example, such an assumption would mean that weights in a specic
height subpopulation, say all people of height 70 inches, have a normal distribution.
This issue is similar to that of Section 10.12. If we do make such a normality assumption, then we
can get exact condence intervals (which of course, only hold if we really do have an exact normal
distribution in the population). This again uses Student-t distributions. In that analysis, s
2
has
n-(r+1) in its denominator instead of our n, just as there was n-1 in the denominator for s
2
when
we estimated a single population variance. The number of degrees of freedom in the Student-t
distribution is likewise n-(r+1). But as before, for even moderately large n, it doesnt matter.
15.11 Model Selection
The issues raised in Chapter 14 become crucial in regression and classication problems. In this
chapter, we will typically deal with models having large numbers of parameters. A central principle
will be that simpler models are preferable, provided of course they t the data well. Hence the
Einstein quote in Chapter 14! Simpler models are often called parsimonious.
Here I use the term model selection to mean which predictor variables (including powers and
interactions) we will use. If we have data on many predictors, we almost certainly will not be able
to use them all, for the following reason:
15.11.1 The Overtting Problem in Regression
Recall (15.33). There we assumed a second-degree polynomial for m
A;b
. Later we extended it to a
fourth-degree model. Why not a fth-degree, or sixth, and so on?
You can see that if we carry this notion to its extreme, we get absurd results. If we t a polynomial
of degree 99 to our 100 points, we can make our tted curve exactly pass through every point! This
clearly would give us a meaningless, useless curve. We are simply tting the noise.
Recall that we analyzed this problem in Section 14.1.4 in our chapter on modeling. There we noted
an absolutely fundamental principle in statistics:
In choosing between a simpler model and a more complex one, the latter is more accurate
only if either
we have enough data to support it, or
15.11. MODEL SELECTION 349
the complex model is suciently dierent from the simpler one
This is extremely important in regression analysis, because we often have so many
variables we can use, thus often can make highly complex models.
In the regression context, the phrase we have enough data to support the model means (in the
parametric model case) we have enough data so that the condence intervals for the
i
will be
reasonably narrow. For xed n, the more complex the model, the wider the resulting condence
intervals will tend to be.
If we use too many predictor variables,
9
, our data is diluted, by being shared by so many

i
. As a result, V ar(

i
) will be large, with big implications: Whether our goal is Prediction or
Understanding, our estimates will be so poor that neither goal is achieved.
On the other hand, if some predictor variable is really important (i.e. its
i
is far from 0), then it
may pay to include it, even though the condence intervals might get somewhat wider.
For example, look at our regression model for A against b in the ALOHA simulation in earlier
sections. The relation between A and b was so far from a straight line that we should use at least
a quadratic model, even if the sample size is pretty small.
The questions raised in turn by the above considerations, i.e. How much data is enough data?,
and How dierent from 0 is quite dierent?, are addressed below in Section 15.11.3.
A detailed mathematical example of overtting in regression is presented in my paper A Careful
Look at the Use of Statistical Methodology in Data Mining (book chapter), by N. Matlo, in
Foundations of Data Mining and Granular Computing, edited by T.Y. Lin, Wesley Chu and L.
Matzlack, Springer-Verlag Lecture Notes in Computer Science, 2005.
15.11.2 Multicollinearity
In typical applications, the X
(i)
are correlated with each other, to various degrees. If the correlation
is higha condition termed multicollinearityproblems may occur.
Consider (15.28). Suppose one predictor variable were to be fully correlated with another. That
would mean that the rst is exactly equal to a linear function of the other, which would mean that
in Q one column is an exact linear combination of the rst column and another column. Then
Q

Q
1
would not exist.
Well, if one predictor is strongly (but not fully) correlated with another, (Q

Q)
1
will exist, but it
will be numerically unstable. Moreover, even without numeric roundo errors, (Q

Q)
1
would be
very large, and thus (15.43) would be large, giving us large standard errorsnot good!
9
In the ALOHA example above, b, b
2
, b
3
and b
4
are separate predictors, even though they are of course correlated.
350 CHAPTER 15. RELATIONS AMONG VARIABLES: LINEAR REGRESSION
Thus we have yet another reason to limit our set of predictor variables.
15.11.3 Methods for Predictor Variable Selection
So, we typically must discard some, maybe many, of our predictor variables. In the weight/height/age
example, we may need to discard the age variable. In the ALOHA example, we might need to dis-
card b
4
and even b
3
. How do we make these decisions?
Note carefully that this is an unsolved problem. If anyone claims they have a foolproof way
to do this, then they do not understand the problem in the rst place. Entire books have been
written on this subject (e.g. Subset Selection in Regression, by Alan Miller, pub. by Chapman and
Hall, 2002), discussing myriad dierent methods. but again, none of them is foolproof.
Hypothesis testing:
The most commonly used methods for variable selection use hypothesis testing in one form or
another. Typically this takes the form
H
0
:
i
= 0 (15.56)
In the context of (15.10), for instance, a decision as to whether to include age as one of our predictor
variables would mean testing
H
0
:
2
= 0 (15.57)
If we reject H
0
, then we use the age variable; otherwise we discard it.
I hope Ive convinced the reader, in Sections 11.8 and 14.2.1, that this is not a
good idea. As usual, the hypothesis test is asking the wrong question. For instance, in the
weight/height/age example, the test is asking whether
2
is zero or notyet we know it is not zero,
before even looking at our data. What we want to know is whether
2
is far enough from 0 for age
to give us better predictions of weight. Those are two very, very dierent questions.
A very interesting example of overtting using real data may be found in the paper, Honest Con-
dence Intervals for the Error Variance in Stepwise Regression, by Foster and Stine, www-stat.
wharton.upenn.edu/
~
stine/research/honests2.pdf. The authors, of the University of Pennsyl-
vania Wharton School, took real nancial data and deliberately added a number of extra predic-
tors that were in fact random noise, independent of the real data. They then tested the hypothesis
(15.56). They found that each of the fake predictors was signicantly related to Y! This illustrates
both the dangers of hypothesis testing and the possible need for multiple inference procedures.
10
10
They added so many predictors that r became greater than n. However, the problems they found would have
15.11. MODEL SELECTION 351
This problem has always been known by thinking statisticians, but the Wharton study certainly
dramatized it.
Condence intervals:
Well, then, what can be done instead? First, there is the same alternative to hypothesis testing
that we discussed beforecondence intervals. We saw an example of that in (15.50). Granted,
the interval was very wide, telling us that it would be nice to have more data. But even the lower
bound of that interval is far from zero, so it looks like b
4
is worth using as a predictor.
On the other hand, suppose in the weight/height/age example our condence interval for
2
is
(0.04,0.06). In other words, we estimate
2
to be 0.05, with a margin of error of 0.01. The 0.01
is telling us that our sample size is good enough for an accurate assessment of the situation, but
the intervals locationcentered at 0.05says, for instance, a 10-year dierence in age only makes
about half a pound dierence in mean weight. In that situation age would be of almost no value
in predicting weight.
An example of this using real data is given in Section 16.2.3.2.
Predictive ability indicators:
Suppose you have several competing models, some using more predictors, some using fewer. If we
had some measure of predictive power, we could decide to use whichever model has the maximum
value of that measure. Here are some of the more commonly used methods of this type:
One such measure is called adjusted R-squared. To explain it, we must discuss ordinary R
2
rst.
Let denote the population correlation between actual Y and predicted Y, i.e. the correlation
between Y and m
Y ;X
(X), where X is the vector of predictor variables in our model. Then [[
is a measure of the power of X to predict Y, but it is traditional to use
2
instead.
11
R is then the sample correlation between the Y
i
and the vectors X
i
. The sample R
2
is then an
estimate of
2
. However, the former is a biased estimateover innitely many samples, the
long-run average value of R
2
is higher than
2
. And the worse the overtting, the greater the
bias. Indeed, if we have n-1 predictors and n observations, we get a perfect t, with R
2
= 1,
yet obviously that perfection is meaningless.
Adjusted R
2
is a tweaked version of R
2
with less bias. So, in deciding which of several models
to use, we might choose the one with maximal adjusted R
2
. Both measures are reported
when one calls summary() on the output of lm().
The most popular alternative to hypothesis testing for variable selection today is probably
cross validation. Here we split our data into a training set, which we use to estimate the
been there to a large degree even if r were less than n but r/n was substantial.
11
That quantity can be shown to be the proportion of variance of Y attributable to X.
352 CHAPTER 15. RELATIONS AMONG VARIABLES: LINEAR REGRESSION

i
, and a validation set, in which we see how well our tted model predicts new data, say
in terms of average squared prediction error. We do this for several models, i.e. several sets
of predictors, and choose the one which does best in the validation set. I like this method
very much, though I often simply stick with condence intervals.
A method that enjoys some popularity in certain circles is the Akaike Information Crite-
rion (AIC). It uses a formula, backed by some theoretical analysis, which creates a tradeo
between richness of the model and size of the standard errors of the

i
. Here we choose the
model with minimal AIC.
The R statistical package includes a function AIC() for this, which is used by step() in the
regression case.
15.11.4 A Rough Rule of Thumb
A rough rule of thumb is that one should have r <

n, where r is the number of predictors.
12
15.12 Nominal Variables
Recall our example in Section 15.2 concerning a study of software engineer productivity. To review,
the authors of the study predicted Y = number of person-months needed to complete the project,
from X
(1)
= size of the project as measured in lines of code, X
(2)
= 1 or 0 depending on whether
an object-oriented or procedural approach was used, and other variables.
As mentioned at the time, X
(2)
is an indicator variable. Lets generalize that a bit. Suppose we
are comparing two dierent object-oriented languages, C++ and Java, as well as the procedural
language C. Then we could change the denition of X
(2)
to have the value 1 for C++ and 0 for
non-C++, and we could add another variable, X
(3)
, which has the value 1 for Java and 0 for
non-Java. Use of the C language would be implied by the situation X
(2)
= X
(3)
= 0.
Here we are dealing with a nominal variable, Language, which has three values, C++, Java and
C, and representing it by the two indicator variables X
(2)
and X
(3)
. Note that we do NOT want
to represent Language by a single value having the values 0, 1 and 2, which would imply that C
has, for instance, double the impact of Java.
You can see that if a nominal variable takes on q values, we need q-1 indicator variables to represent
it. We say that the variable has q levels. Note carefully that although we speak of this as one
variable, it is implemented as q-1 variables.
12
Asymptotic Behavior of Likelihood Methods for Exponential Families When the Number of Parameters Tends
to Innity, Stephen Portnoy, Annals of Statistics, 1968.
15.13. REGRESSION DIAGNOSTICS 353
15.13 Regression Diagnostics
Researchers in regression analysis have devised some diagnostic methods, meaning methods to
check the t of a model, the validity of assumptions [e.g. (15.37)], search for data points that may
have an undue inuence (and may actually be in error), and so on. The residuals tend to play a
central role here.
The R package has tons of diagnostic methods. See for example Chapter 4 of Linear Models with
R, Julian Faraway, Chapman and Hall, 2005.
15.14 Case Study: Prediction of Network RTT
Recall the paper by Raz et al, introduced in Section 15.2. They wished to predict network round-
trip travel time (RTT) from oine variables. Now that we know how regression analysis works,
lets look at some details of that paper.
First, they checked for multicollinearity. one measure of that is the ratio of largest to smallest
eigenvalue of the matrix of correlations among the predictors. A rule of thumb is that there are
problems if this value is greater than 15, but they found it was only 2.44, so they did not worry
about multicollinearity.
They took a backwards stepwise approach to predictor variable selection, meaning that they started
with all the variables, and removed them one-by-one while monitoring a goodness-of-t criterion.
They chose AIC for the latter.
Their initial predictors were DIST, the geographic distance between source and destination node,
HOPS, the number of network hops (router processing) and an online variable, AS, the number of
autonomous systemslarge network routing regionsa message goes through. They measured
the latter using the network tool traceroute.
But AS was the rst variable they ended up eliminating. They found that removing it increased
AIC only slightly, from about 12.6 million to 12.9 million, and reduced R
2
only a bit, from 0.785 to
0.778. They decided that AS was expendable, especially since they were hoping to use only oine
variables.
Based on a scatter plot of RTT versus DIST, they then decided to try adding a quadratic term
in that variable. This increased R
2
substantially, to 0.877. So, the nal prediction equation they
settled on predicts RTT from a quadratic function of DIST and a linear term for HOPS.
354 CHAPTER 15. RELATIONS AMONG VARIABLES: LINEAR REGRESSION
15.15 The Famous Error Term
Books on linear regression analysisand there are hundreds, if not thousands of thesegenerally
introduce the subject as follows. They consider the linear case with r = 1, and write
Y =
0
+
1
X +, E = 0 (15.58)
with being independent of X. They also assume that has a normal distribution with variance

2
.
Lets see how this compares to what we have been assuming here so far. In the linear case with r
= 1, we would write
m
Y ;X
(t) = E(Y [X = t) =
0
+
1
t (15.59)
Note that in our context, we could dene as
= Y m
Y ;X
(X) (15.60)
Equation (15.58) is consistent with (15.59): The former has E = 0, and so does the latter, since
E = EY E[m
Y ;X
(X)] = EY E[E(Y [X)] = EY EY = 0 (15.61)
In order to produce condence intervals, we later added the assumption (15.37), which you can see
is consistent with (15.58) since the latter assumes that V ar() =
2
no matter what value X has.
Now, what about the normality assumption in (15.58)? That would be equivalent to saying that
in our context, the conditional distribution of Y given X is normal, which is an assumption we did
not make. Note that in the weight/height example, this assumption would say that, for instance,
the distribution of weights among people of height 68.2 inches is normal.
No matter what the context is, the variable is called the error term. Originally this was an
allusion to measurement error, e.g. in chemistry experiments, but the modern interpretation would
be prediction error, i.e. how much error we make when we us m
Y ;X
(t) to predict Y.
Chapter 16
Relations Among Variables:
Advanced
16.1 Nonlinear Parametric Regression Models
We pointed out in Section 15.10.1 that the word linear in linear regression model means linear in
, not in t. This is the most popular approach, as it is computationally easy, but nonlinear models
are often used.
The most famous of these is the logistic model, for the case in which Y takes on only the values 0
and 1. As we have seen before (Section 3.6), in this case the expected value becomes a probability.
The logistic model for a nonvector X is then
m
Y ;X
(t) = P(Y = 1[X = t) =
1
1 +e
(
0
+
1
t
)
(16.1)
It extends to the case of vector-valued X in the obvious way.
The logistic model is quite widely used in computer science, in medicine, economics, psychology
and so on.
Here is an example of a nonlinear model used in kinetics of chemical reactions, with r = 3:
1
m
Y ;X
(t) =

1
t
(2)
t
(3)
/
5
1 +
2
t
(1)
+
3
t
(2)
+
4
t
(3)
(16.2)
1
See http://www.mathworks.com/access/helpdesk/help/toolbox/stats/rsmdemo.html.
355
356 CHAPTER 16. RELATIONS AMONG VARIABLES: ADVANCED
Here the X vector is (hydrogen, n-pentane, isopentane) and Y is the reaction rate.
Unfortunately, in most cases, the least-squares estimates of the parameters in nonlinear regression
do not have closed-form solutions, and numerical methods must be used. But R does that for you,
via the nls() function in general, and via glm() for the logistic and related models in particular.
16.2 The Classication Problem
As mentioned earlier, in the special case in which Y is an indicator variable, with the value 1 if the
object is in a class and 0 if not, the regression problem is called the classication problem. In
electrical engineering it is called pattern recognition, and the predictors are called features. In
computer science the term machine learning usually refers to classication problems. Dierent
terms, same concept.
If there are c classes, we need c (or c-1) Y variables, which I will denote by Y
(i)
, i = 1,...,c.
Here are some examples:
A forest re is now in progress. Will the re reach a certain populated neighborhood? Here
Y would be 1 if the re reaches the neighborhood, 0 otherwise. The predictors might be wind
direction, distance of the re from the neighborhood, air temperature and humidity, and so
on.
Is a patient likely to develop diabetes? This problem has been studied by many researchers,
e.g. Using Neural Networks To Predict the Onset of Diabetes Mellitus, Murali S. Shanker J.
Chem. Inf. Comput. Sci., 1996, 36 (1), pp 3541. A famous data set involves Pima Indian
women, with Y being 1 or 0, depending on whether the patient does ultimately develop dia-
betes, and the predictors being the number of times pregnant, plasma glucose concentration,
diastolic blood pressure, triceps skin fold thickness, serum insulin level, body mass index,
diabetes pedigree function and age.
Is a disk drive likely to fail sure? This has been studied for example in Machine Learning
Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application, by Joseph
F. Murray, Gordon F. Hughes, and Kenneth Kreutz-Delgado, Journal of Machine Learning
Research 6 (2005) 783-816. Y was 1 or 0, depending on whether the drive failed, and the
predictors were temperature, number of read errors, and so on.
16.2.1 The Mean Here Is a Probability
Now, here is a key point: As we have frequently noted the mean of any indicator random variable is
the probability that the variable is equal to 1 (Section 3.6). Thus in the case in which our response
16.2. THE CLASSIFICATION PROBLEM 357
variable Y takes on only the values 0 and 1, i.e. classication problems, the regression function
reduces to
m
Y ;X
(t) = P(Y = 1[X = t) (16.3)
(Remember that X and t are vector-valued.)
As a simple but handy example, suppose Y is gender (1 for male, 0 for female), X
(1)
is height
and X
(2)
is weight, i.e. we are predicting a persons gender from the persons height and weight.
Then for example, m
Y ;X
(70, 150) is the probability that a person of height 70 inches and weight
150 pounds is a man. Note again that this probability is a population fraction, the fraction of men
among all people of height 70 and weight 150 in our population.
Make a mental note of the optimal prediction rule, if we know the population regression function:
Given X = t, the optimal prediction rule is to predict that Y = 1 if and only if m
Y ;X
(t) >
0.5.
So, if we known a certain person is of height 70 and weight 150, our best guess for the persons
gender is to predict the person is male if and only if m
Y ;X
(70, 150) > 0.5.
The optimality makes intuitive sense, and is proved in Section 16.7.2.
16.2.2 Logistic Regression: a Common Parametric Model for the Regression
Function in Classication Problems
Remember, we often try a parametric model for our regression function rst, as it means we
are estimating a nite number of quantities, instead of an innite number. Probably the most
commonly-used model is that of the logistic function (often called logit), introduced in Section
16.1. Its r-predictor form is
m
Y ;X
(t) = P(Y = 1[X = t) =
1
1 +e
(
0
+
1
t
1
+...+rtr)
(16.4)
For instance, consider the patent example in Section 15.2. Under the logistic model, the population
proportion of all patents that are publicly funded, among those that contain the word NSF, do
not contain NIH, and make ve claims would have the value
1
1 +e
(
0
+
1
+5
3
)
(16.5)
358 CHAPTER 16. RELATIONS AMONG VARIABLES: ADVANCED
16.2.2.1 The Logistic Model: Intuitive Motivation
The logistic function itself,
1
1 +e
u
(16.6)
has values between 0 and 1, and is thus a candidate for modeling a probability. Also, it is monotonic
in u, making it further attractive, as in many classication problems we believe that m
Y ;X
(t) should
be monotonic in the predictor variables.
16.2.2.2 The Logistic Model: Theoretical Motivation
But there are much stronger reasons to use the logit model, as it includes many common parametric
models for X. To see this, note that we can write, for vector-valued discrete X and t,
P(Y = 1[X = t) =
P(Y = 1 and X = t)
P(X = t)
(16.7)
=
P(Y = 1)P(X = t[Y = 1)
P(X = t)
(16.8)
=
P(Y = 1)P(X = t[Y = 1)
P(Y = 1)P(X = t[Y = 1) +P(Y = 0)P(X = t[Y = 0)
(16.9)
=
1
1 +
(1q)P(X=t|Y =0)
qP(X=t|Y =1)
(16.10)
where q = P(Y = 1) is the proportion of members of the population which have Y = 1. (Keep in
mind that this probability is unconditional!!!! In the patent example, for instance, if say q = 0.12,
then 12% of all patents in the patent populationwithout regard to words used, numbers of claims,
etc.are publicly funded.)
If X is a continuous random vector, then the analog of (16.10) is
P(Y = 1[X = t) =
1
1 +
(1q)f
X|Y =0
(t)
qf
X|Y =1
(t)
(16.11)
Now suppose X, given Y , has a normal distribution. In other words, within each class, Y is
normally distributed. Consider the case of just one predictor variable, i.e. r = 1. Suppose that
16.2. THE CLASSIFICATION PROBLEM 359
given Y = i, X has the distribution N(
i
,
2
), i = 0,1. Then
f
X|Y =i
(t) =
1

2
exp
_
0.5
_
t
i

_
2
_
(16.12)
After doing some elementary but rather tedious algebra, (16.11) reduces to the logistic form
1
1 +e
(
0
+
1
t)
(16.13)
where
0
and
1
are functions of
0
,
0
and .
In other words, if X is normally distributed in both classes, with the same variance but
dierent means, then m
Y ;X
has the logistic form! And the same is true if X is multivariate
normal in each class, with dierent mean vectors but equal covariance matrices. (The algebra is
even more tedious here, but it does work out.)
So, not only does the logistic model have an intuitively appealing form, it is also
implied by one of the most famous distributions X can have within each classthe
multivariate normal.
If you reread the derivation above, you will see that the logit model will hold for any within-class
distributions for which
ln
_
f
X|Y =0
(t)
f
X|Y =1
(t)
_
(16.14)
(or its discrete analog) is linear in t. Well guess whatthis condition is true for exponential
distributions too! Work it out for yourself.
In fact, a number of famous distributions imply the logit model. For this reason, it is common to
assume a logit model even if we make no assumption on the distribution of X given Y. Again, it is
a good intuitive model, as discussed in Section 16.2.2.1.
16.2.3 Variable Selection in Classication Problems
16.2.3.1 Problems Inherited from the Regression Context
In Section 15.11.3, it was pointed out that the problem of predictor variable selection in regression
is unsolved. Since the classication problem is a special case of regression, there is no surere way
to select predictor variables there either.
360 CHAPTER 16. RELATIONS AMONG VARIABLES: ADVANCED
16.2.3.2 Example: Forest Cover Data
And again, using hypothesis testing to choose predictors is not the answer. To illustrate this, lets
look again at the forest cover data we saw in Section 10.7.4.
There were seven classes of forest cover there. Lets restrict attention to classes 1 and 2. In my R
analysis I had the class 1 and 2 data in objects cov1 and cov2, respectively. I combined them,
> cov1and2 <- rbind(cov1,cov2)
and created a new variable to serve as Y, recoding the 1,2 class names to 1,0:
cov1and2[,56] <- ifelse(cov1and2[,55] == 1,1,0)
Lets see how well we can predict a sites class from the variable HS12 (hillside shade at noon) that
we investigated in that past chapter, using a logistic model.
In R we t logistic models via the glm() function, for generalized linear models. The word gener-
alized here refers to models in which some function of m
Y ;X
(t) is linear in parameters
i
. For the
classication model,
ln (m
Y ;X
(t)/[1 m
Y ;X
(t)]) =
0
+
1
t
(1)
+... +
r
t
(r)
(16.15)
(Recall the discussion surrounding (16.14).)
This kind of generalized linear model is specied in R by setting the named argument family to
binomial. Here is the call:
> g <- glm(cov1and2[,56] ~ cov1and2[,8],family=binomial)
The result was:
> summary(g)
Call:
glm(formula = cov1and2[, 56] ~ cov1and2[, 8], family = binomial)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.165 -0.820 -0.775 1.504 1.741
Coefficients:
Estimate Std. Error z value Pr(>|z|)
16.2. THE CLASSIFICATION PROBLEM 361
(Intercept) 1.515820 1.148665 1.320 0.1870
cov1and2[, 8] -0.010960 0.005103 -2.148 0.0317 *
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 959.72 on 810 degrees of freedom
Residual deviance: 955.14 on 809 degrees of freedom
AIC: 959.14
Number of Fisher Scoring iterations: 4
So,

1
= 0.01. This is tiny, as can be seen from our data in the last chapter. There we found
that the estimated mean values of HS12 for cover types 1 and 2 were 223.8 and 226.3, a dierence
of only 2.5. That dierence in essence gets multiplied by 0.01. More concretely, in (16.1), plug in
our estimates 1.52 and -0.01 from our R output above, rst taking t to be 223.8 and then 226.3.
The results are 0.328 and 0.322, respectively. In other words, HS12 isnt having much eect on the
probability of cover type 1, and so it cannot be a good predictor of cover type.
Yet the R output says that
1
is signicantly dierent from 0, with a p-value of 0.03. Thus, we
see once again that hypothesis testing does not achieve our goal. Again, cross validation is a better
method for choosing predictors.
16.2.4 Y Must Have a Marginal Distribution!
In our material here, we have tacitly assumed that the vector (Y,X) has a distribution. That may
seem like an odd and puzzling remark to make here, but it is absolutely crucial. Lets see what
it means.
Consider the study on object-oriented programming in Section 15.1, but turned around. (This
example will be somewhat contrived, but it will illustrate the principle.) Suppose we know how
many lines of code are in a project, which we will still call X
(1)
, and we know how long it took to
complete, which we will now take as X
(2)
, and from this we want to guess whether object-oriented
or procedural programming was used (without being able to look at the code, of course), which is
now our new Y.
Here is our huge problem: Given our sample data, there is no way to estimate q in (16.10). Thats
because the authors of the study simply took two groups of programmers and had one group use
object-oriented programming and had the other group use procedural programming. If we had
sampled programmers at random from actual projects done at this company, that would enable us
to estimate q, the population proportion of projects done with OOP. But we cant do that with
the data that we do have. Indeed, in this setting, it may not even make sense to speak of q in the
rst place.
362 CHAPTER 16. RELATIONS AMONG VARIABLES: ADVANCED
Mathematically speaking, if you think about the process under which the data was collected in this
study, there does exist some conditional distribution of X given Y, but Y itself has no distribution.
So, we can NOT estimate P(Y=1[X). About the best we can do is try to guess Y on the basis of
whichever value of i makes f
X|Y =i
(X) larger.
16.3 Nonparametric Estimation of Regression and Classication
Functions
In some applications, there may be no good parametric model, say linear or logistic, for m
Y ;X
. Or,
we may have a parametric model that we are considering, but we would like to have some kind of
nonparametric estimation method available as a means of checking the validity of our parametric
model. So, how do we estimate a regression function nonparametrically?
Many, many methods have been developed. We introduce a few here.
16.3.1 Methods Based on Estimating m
Y ;X
(t)
To guide our intuition on this, lets turn again to the example of estimating the relationship between
height and weight. Consider estimation of the quantity m
W;H
(68.2), the population mean weight
of all people of height 68.2.
16.3.1.1 Kernel-Based Methods
We could take our estimate of m
W;H
(68.2), m
W;H
(68.2), to be the average weight of all the people
in our sample who have that height. But we may have very few people of that height (or even
none), so that our estimate may have a high variance, i.e. may not be very accurate.
What we could do instead is to take the mean weight of all the people in our sample whose heights
are near 68.2, say between 67.7 and 68.7. That would bias things a bit, but wed get a lower
variance. This is again an illustration of the variance/bias tradeo introduced in Section 14.1.3.
All nonparametric regression/classication methods work like this, though with many variations.
(As noted earlier, the classication problem is a special case of regression, so in the following
material we will usually not distinguish between the two.)
As our denition of near, we could take all people in our sample whose heights are within h
amount of 68.2. This should remind you of our density estimators in Section 12.4 of our chapter
on estimation and testing. As we saw there, a generalization would be to use a kernel method.
16.3. NONPARAMETRIC ESTIMATION OF REGRESSION AND CLASSIFICATION FUNCTIONS363
For instance, for univariate X and t:
m
Y ;X
(t) =

n
i=1
Y
i
k
_
tX
i
h
_

n
i=1
k
_
tX
i
h
_ (16.16)
This looks imposing, but it is simply a weighted average of the Y values in our sample, with the
larger weights being placed on observations for which X is close to t.
As before, the choice of h here involves a bias/variance tradeo. We might try choosing h via cross
validation, as discussed in Section 15.11.3.
There is an R package that includes a function nkreg() for kernel regression. The R base has a
similar method, called LOESS. Note: That is the class name, but the R function is called lowess().
16.3.1.2 Nearest-Neighbor Methods
Similarly, we could take a nearest-neighbor approoach, for instance estimating m
Y ;X
(68.2) to
be the mean weight of the k people in our sample with heights nearest 68.2. Here k controls
bias/variance tradeo.
16.3.1.3 The Naive Bayes Method
The NB method is not Bayesian in the sense of Section 12.5. Instead, its name comes simply from
its usage of Bayes Rule for conditional probability. It basically makes the same computations as
in Section 16.2.2.2, for the case in which the predictors are indicator variables and are independent
of each other, given the class.
Under that assumption, the numerator in (16.10) becomes
P(Y = 1) P[X
(1)
= t
1
[Y = 1] ... P[X
(r)
= t
r
[Y = 1] (16.17)
All of those quantities (and similarly, those in the denominator of (16.10) can be estimated directly
as sample proportions. For example,

P[X
(1)
= t
1
[Y = 1] would be the fraction of X
(1)
j
that are
equal to t
1
, among those observations for which Y
j
= 1.
A common example of the use of Naive Bayes is text mining, as in Section 8.5.1.4. Our independence
assumption in this case means that the probability that, for instance, a document of a certain class
contains both of the words baseball and strike is the product of the individual probabilities of those
words.
364 CHAPTER 16. RELATIONS AMONG VARIABLES: ADVANCED
Clearly the independence assumption is not justied in this application. But if our vocabulary
is large, that assumption limits the complexity of our model, which may be necessary from a
bias/variance tradeo point of view (Section 14.1.3).
16.3.2 Methods Based on Estimating Classication Boundaries
In the methods presented above, we are estimating the function m
Y ;X
(t). But with support vector
machines and CART below, we are in a way working backwards. In the classication case (which is
what we will focus on), for instance, our goal is to estimate the values of t for which the regression
function equals 0.5:
B = t : m
Y ;X
(t) = 0.5 (16.18)
Recall that r is the number of predictor variables we have. Then note the geometric form that the
set B in (16.18) will take on: discrete points if r = 1; a curve if r = 2; a surface if r = 3; and a
hypersurface if r > 3.
The motivation for using (16.18) stems from the fact, noted in Section 16.2.1, that if we know
m
Y ;X
(t), we will predict Y to be 1 if and only if m
Y ;X
(t) > 0.5. Since (16.18) represents the
boundary between the portions of the X space for which m
Y ;X
(t) is either larger or smaller than
0.5, it is the boundary for our prediction rule, i.e. the boundary separating the regions in X space
in which we predict Y to be 1 or 0.
Lest this become too abstract, again consider the simple example of predicting gender from height
and weight. Consider the (u,v) plane, with u and v representing height and weight, respectively.
Then (16.18) is some curve in that plane. If a persons (height,weight) pair is on one side of the
curve, we guess that the person is male, and otherwise guess female.
If the logistic model (16.4) holds, then that curve is actually a straight line. To see this, note that
in (16.4), the equation (16.18) boils down to

0
+
1
u +
2
v = 0 (16.19)
whose geometric form is a straight line.
16.3.2.1 Support Vector Machines (SVMs)
This method has been getting a lot of publicity in computer science circles (maybe too much; see
below). It is better explained for the classication case.
16.3. NONPARAMETRIC ESTIMATION OF REGRESSION AND CLASSIFICATION FUNCTIONS365
In the form of dot product (or inner product) from linear algebra, (16.19) is
(
1
,
2
)

(u, v) =
0
(16.20)
What SVM does is to generalize this, for instance changing the criterion to, say

0
u
2
+
1
uv +
2
v
2
+
3
u +
4
v = 1 (16.21)
Now our (u, v) plane is divided by a curve instead of by a straight line (though it includes straight
lines as special cases), thus providing more exibility and thus potentially better accuracy.
In SVM terminology, (16.21) uses a dierent kernel than regular dot product. (This of course
should not be confused our the term kernel in kernel-based regression above.) The actual method is
more complicated than this, involving transforming the original predictor variables and then using
an ordinary inner product in the transformed space. In the above example, the transformation
consists of squaring and multiplying our variables. That takes us from two-dimensional space (just
u and v) to ve dimensions (u, v, u
2
, v
2
and uv).
There are various other details that weve omitted here, but the essence of the method is as shown
above.
Of course, a good choice of the kernel is crucial to the successful usage of this method. It is the
analog of h and k in the nearness-based methods above.
16.3.2.2 CART
Another nonparametric method is that of Classication and Regression Trees (CART). Its
again easiest explained in the classication context, say the diabetes example above.
In the diabetes example, we might try to use glucose variable as our rst predictor. The data may
show that a high glucose value implies a high likelihood of developing diabetes, while a low value
does the opposite. We would then nd a split on this variable, meaning a a cuto value that denes
high and low. Pictorially, we draw this as the root of a tree, with the left branch indicating a
tentative guess of no diabetes and the right branch corresponding to a guess of diabetes.
Actually, we could do this for all our predictor variables, and nd which one produces the best split
at the root stage. But lets assume that we nd that glucose is that variable.
Now we repeat the process. For the left branchall the subset of our data corresponding to low
glucosewe nd the variable that best splits that branch, say body mass index. We do the same
for the right branch, say nding that age gives the best split. We keep going until the resulting
cells are too small for a reasonable split.
366 CHAPTER 16. RELATIONS AMONG VARIABLES: ADVANCED
it is either really high or really low, we predict diabetes from this information alone and stop. If
not, we then look at body mass index, and so on.
An example with real data is given in a tutorial on the use of rpart, an R package that does
analysis of the CART type, An Introduction to Recursive Partitioning Using the RPART Routines,
by Terry Therneau and Elizabeth Atkinson. The data was on treatment of cardiac arrest patients
by emergency medical technicians.
The response variable here is whether the technicians were able to revive the patient, with predictors
X
(1)
= initial heart rhythm, X
(2)
= initial response to defrillabration, and X
(3)
= initial response
to drugs. The resulting tree was
So, if for example a patient has X
(1)
= 1 and X
(2)
= 3, we would guess him to be revivable.
CART is a boundary method, as SVM is. Say for instance we have two variables, represented
graphically by s and t, and our root node rule is s > 0.62. In the left branch, the rule is t > 0.8
and in the right branch its t > 0.58. This boils down to a boundary line as follows:
16.3. NONPARAMETRIC ESTIMATION OF REGRESSION AND CLASSIFICATION FUNCTIONS367
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
s
t
CART obviously has an intuitive appeal, easily explained to nonstatisticians, and easy quite easy
to implement. It also has the virtue of working equally well with discrete or continuous predictor
variables.
The analogs here of the h in the kernel method and k in nearest-neighbor regression are the choice
of where to dene the splits, and when to stop splitting. Cross validation is often used for making
such decisions.
16.3.3 Comparison of Methods
Beware! There are no magic solutions to statistical problems. The statements one
sees by some computer science researchers to the eect that SVMs are generally superior to other
prediction methods are unfounded.
First, note that every one of the above methods involves some choice of tuning parameter, such as
368 CHAPTER 16. RELATIONS AMONG VARIABLES: ADVANCED
h in the kernel method, k in the nearest-neighbor method, the split points in CART, and in the
case of SVM, the form of which kernel to use. For SVM the choice of kernel is crucial, yet dicult.
Second, the comparisons are often unfair, notably comparisons of the logit model to SVM. Such
comparisons usually limit the logit experiments to rst-degree terms without interactions. But in
(16.4) we could throw in second-degree terms, etc., thus producing a curved partitioning line just
like SVM does.
I highly recommend the site www.dtreg.com/benchmarks.htm, which compares six dierent types
of classication function estimatorsincluding logistic regression and SVMon several dozen real
data sets. The overall percent misclassication rates, averaged over all the data sets, was fairly
close, ranging from a high of 25.3% to a low of 19.2%. The much-vaunted SVM came in at 20.3%.
Thats nice, but it was only a tad better than logits 20.9%and remember, thats with logit
running under the handicap of having only rst-degree terms.
Or consider the annual KDDCup competition, in which teams from around the world compete to
solve a given classication problem with the lowest misclassication rate. In KDDCup2009, for
instance, none of the top teams used SVM. See SIGKDD Explorations, December 2009 issue.
Considering that logit has a big advantage in that one gets an actual equation for the classication
function, complete with parameters which we can estimate and make condence intervals for, it
is not clear just what role SVM and the other nonparametric estimators should play, in general,
though in specic applications they may be appropriate.
16.4 Symmetric Relations Among Several Variables
It is a very sad thing that nowadays there is so little useless informationOscar Wilde, famous
19th century writer
Unlike the case of regression analysis, where the response/dependent variable plays a central role,
we are now interested in symmetric relations among several variables. Often our goal is dimension
reduction, meaning compressing our data into just a few important variables.
Dimension reduction ties in to the Oscar Wilde quote above, which is a complaint that there is
too much information of the useful variety. We are concerned here with reducing the complexity of
that information to a more manageable, simple set of variables.
Here we cover two of the most widely-used methods, principal components analysis for contin-
uous variables, and the log-linear model for the discrete case.
16.4. SYMMETRIC RELATIONS AMONG SEVERAL VARIABLES 369
16.4.1 Principal Components Analysis
Consider a random vector X = (X
1
, X
2
)

. Suppose the two components of X are highly correlated


with each other. Then for some constants c and d,
X
2
c +dX
1
(16.22)
Then in a sense there is really just one random variable here, as the second is nearly equal to some
linear combination of the rst. The second provides us with almost no new information, once we
have the rst.
In other words, even though the vector X roams in two-dimensional space, it usually sticks close to
a one-dimensional object, namely the line (16.22). We saw a graph illustrating this in our chapter
on multivariate distributions, page 183.
In general, consider a k-component random vector
X = (X
1
, ..., X
k
)

(16.23)
We again wish to investigate whether just a few, say w, of the X
i
tell almost the whole story, i.e.
whether most X
j
can be expressed approximately as linear combinations of these few X
i
. In other
words, even though X is k-dimensional, it tends to stick close to some w-dimensional subspace.
Note that although (16.22) is phrased in prediction terms, we are not (or more accurately, not
necessarily) interested in prediction here. We have not designated one of the X
(i)
to be a response
variable and the rest to be predictors.
Once again, the Principle of Parsimony is key. If we have, say, 20 or 30 variables, it would be nice
if we could reduce that to, for example, three or four. This may be easier to understand and work
with, albeit with the complication that our new variables would be linear combinations of the old
ones.
16.4.2 How to Calculate Them
Heres how it works. The theory of linear algebra says that since is a symmetric matrix, it is
diagonalizable, i.e. there is a real matrix Q for which
Q

Q = D (16.24)
where D is a diagonal matrix. (This is a special case of singular value decomposition.) The
370 CHAPTER 16. RELATIONS AMONG VARIABLES: ADVANCED
columns C
i
of Q are the eigenvectors of , and it turns out that they are orthogonal to each other,
i.e. their dot product is 0.
Let
W
i
= C

i
X, i = 1, ..., k (16.25)
so that the W
i
are scalar random variables, and set
W = (W
1
, ..., W
k
)

(16.26)
Then
W = Q

X (16.27)
Now, use the material on covariance matrices from our chapter on multivariate analysis, page 151,
Cov(W) = Cov(Q

X) = Q

Cov(X)Q = D (from (16.24)) (16.28)


Note too that if X has a multivariate normal distribution (which we are not assuming), then W
does too.
Lets recap:
We have created new random variables W
i
as linear combinations of our original X
j
.
The W
i
are uncorrelated. Thus if in addition X has a multivariate normal distribution, so
that W does too, then the W
i
will be independent.
The variance of W
i
is given by the i
th
diagonal element of D.
The W
i
are called the principal components of the distribution of X.
It is customary to relabel the W
i
so that W
1
has the largest variance, W
2
has the second-largest,
and so on. We then choose those W
i
that have the larger variances, and discard the others, because
the latter, having small variances, are close to constant and thus carry no information.
All this will become clearer in the example below.
16.4. SYMMETRIC RELATIONS AMONG SEVERAL VARIABLES 371
16.4.3 Example: Forest Cover Data
Lets try using principal component analysis on the forest cover data set weve looked at before.
There are 10 continuous variables (also many discrete ones, but there is another tool for that case,
the log-linear model, discussed in Section 16.4.4).
In my R run, the data set (not restricted to just two forest cover types, but consisting only of the
rst 1000 observations) was in the object f. Here are the call and the results:
> prc <- prcomp(f[,1:10])
> summary(prc)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6
Standard deviation 1812.394 1613.287 1.89e+02 1.10e+02 96.93455 30.16789
Proportion of Variance 0.552 0.438 6.01e-03 2.04e-03 0.00158 0.00015
Cumulative Proportion 0.552 0.990 9.96e-01 9.98e-01 0.99968 0.99984
PC7 PC8 PC9 PC10
Standard deviation 25.95478 16.78595 4.2 0.783
Proportion of Variance 0.00011 0.00005 0.0 0.000
Cumulative Proportion 0.99995 1.00000 1.0 1.000
You can see from the variance values here that R has scaled the W
i
so that their variances sum to
1.0. (It has not done so for the standard deviations, which are for the nonscaled variables.) This
is ne, as we are only interested in the variances relative to each other, i.e. saving the principal
components with the larger variances.
What we see here is that eight of the 10 principal components have very small variances, i.e. are
close to constant. In other words, though we have 10 variables X
1
, ..., X
10
, there is really only two
variables worth of information carried in them.
So for example if we wish to predict forest cover type from these 10 variables, we should only use
two of them. We could use W
1
and W
2
, but for the sake of interpretability we stick to the original
X vector; we can use any two of the X
i
.
The coecients of the linear combinations which produce W from X, i.e. the Q matrix, are available
via prc$rotation.
16.4.4 Log-Linear Models
Here we discuss a procedure which is something of an analog of principal components for discrete
variables. Our material on ANOVA will also come into play. It is recommended that the reader
review Sections 16.6 and 16.4.1 before continuing.
372 CHAPTER 16. RELATIONS AMONG VARIABLES: ADVANCED
16.4.4.1 The Setting
Lets consider a variation on the software engineering example in Sections 15.2 and 16.6. Assume
we have the factors, IDE, Language and Education. Our changeof extreme importanceis
that we will now assume that these factors are RANDOM. What does this mean?
In the original example described in Section 15.2, programmers were assigned to languages, and
in our extensions of that example, we continued to assume this. Thus for example the number of
programmers who use an IDE and program in Java was xed; if we repeated the experiment, that
number would stay the same. If we were sampling from some programmer population, our new
sample would have new programmers, but the number using and IDE and Java would be the same
as before, as our study procedure species this.
By contrast, lets now assume that we simply sample programmers at random, and ask them
whether they prefer to use an IDE or not, and which language they prefer.
2
Then for example the
number of programmers who prefer to use an IDE and program in Java will be random, not xed;
if we repeat the experiment, we will get a dierent count.
Suppose we now wish to investigate relations between the factors. Are choice of platform and
language related to education, for instance?
16.4.4.2 The Data
Denote our three factors by X
(s)
, s = 1,2,3. Here X
(1)
, IDE, will take on the values 1 and 2 instead
of 1 and 0 as before, 1 meaning that the programmer prefers to use an IDE, and 2 meaning not
so. X
(3)
changes this way too, and X
(2)
will take on the values 1 for C++, 2 for Java and 3 for C.
Note that we no longer use indicator variables.
Let X
(s)
r
denote the value of X
(s)
for the r
th
programmer in our sample, r = 1,2,...,n. Our data are
the counts
N
ijk
= number of r such that X
(1)
r
= i, X
(2)
r
= j and X
(3)
r
= k (16.29)
For instance, if we sample 100 programmers, our data might look like this:
prefers to use IDE:
Bachelors or less Masters or more
C++ 18 15
Java 22 10
C 6 4
2
Other sampling schemes are possible too.
16.4. SYMMETRIC RELATIONS AMONG SEVERAL VARIABLES 373
prefers not to use IDE:
Bachelors or less Masters or more
C++ 7 4
Java 6 2
C 3 3
So for example N
122
= 10 and N
212
= 4.
Here we have a three-dimensional contingency table. Each N
ijk
value is a cell in the table.
16.4.4.3 The Models
Let p
ijk
be the population probability of a randomly-chosen programmer falling into cell ijk, i.e.
p
ijk
= P
_
X
(1)
= i and X
(2)
= j and X
(3)
= k
_
= E(N
ijk
)/n (16.30)
As mentioned, we are interested in relations between the factors, in the form of independence, full
and partial. Consider rst the case of full independence:
p
ijk
= P
_
X
(1)
= i and X
(2)
= j and X
(3)
= k
_
(16.31)
= P
_
X
(1)
= i
_
P
_
X
(2)
= j
_
P
_
X
(3)
= k
_
(16.32)
Taking logs of both sides in (16.32), we see that independence of the three factors is equivalent to
saying
log(p
ijk
) = a
i
+b
j
+c
k
(16.33)
for some numbers a
i
, b
j
and c
j
. The numbers must be nonpositive, and since

m
P(X
(s)
= m) = 1 (16.34)
we must have, for instance,
2

g=1
exp(c
g
) = 1 (16.35)
374 CHAPTER 16. RELATIONS AMONG VARIABLES: ADVANCED
The point is that (16.33) looks like our no-interaction ANOVA models, e.g. (16.49). On the other
hand, if we assume instead that Education is independent of IDE and Language but that IDE and
Language are not independent of each other, our model would be
log(p
ijk
) = P
_
X
(1)
= i and X
(2)
= j
_
P
_
X
(3)
= k
_
(16.36)
= a
i
+b
j
+d
ij
+c
k
(16.37)
Here we have written P
_
X
(1)
= i and X
(2)
= j
_
as a sum of main eects a
i
and b
j
, and inter-
action eects, d
ij
, analogous to ANOVA.
Another possible model would have IDE and Language conditionally independent, given Education,
meaning that at any level of education, a programmers preference to use IDE or not, and his choice
of programming language, are not related. Wed write the model this way:
log(p
ijk
) = P
_
X
(1)
= i and X
(2)
= j
_
P
_
X
(3)
= k
_
(16.38)
= a
i
+b
j
+f
ik
+h
jk
+c
k
(16.39)
Note carefully that the type of independence in (16.39) has a quite dierent interpretation than
that in (16.37).
The full model, with no independence assumptions at all, would have three two-way interaction
terms, as well as a three-way interaction term.
16.4.4.4 Parameter Estimation
Remember, whenever we have parametric models, the statisticians Swiss army knife is maximum
likelihood estimation. That is what is most often used in the case of log-linear models.
How, then, do we compute the likelihood of our data, the N
ijk
? Its actually quite straightforward,
because the N
ijk
have a multinomial distribution. Then
L =
n!

i,j,k
N
ijk
!
p
N
ijk
ijk
(16.40)
We then write the p
ijk
in terms of our model parameters. Take for example (16.37), where we write
p
ijk
= e
a
i
+b
j
+d
ij
+c
k
(16.41)
16.5. SIMPSONS (NON-)PARADOX 375
We then substitute (16.41) in (16.40), and maximize the latter with respect to the a
i
, b
j
, d
ij
and
c
k
, subject to constraints such as (16.35).
The maximization may be messy. But certain cases have been worked out in closed form, and in
any case today one would typically do the computation by computer. In R, for example, there is
the loglin() function for this purpose.
16.4.4.5 The Goal: Parsimony Again
Again, wed like the simplest model possible, but not simpler. This means a model with as much
independence between factors as possible, subject to the model being accurate.
Classical log-linear model procedures do model selection by hypothesis testing, testing whether var-
ious interaction terms are 0. The tests often parallel ANOVA testing, with chi-square distributions
arising instead of F-distributions.
16.5 Simpsons (Non-)Paradox
Suppose each individual in a population either possesses or does not possess traits A, B and C, and
that we wish to predict trait A. Let

A,

B and

C denote the situations in which the individual does
not possess the given trait. Simpsons Paradox then describes a situation in which
P(A[B) > P(A[

B) (16.42)
and yet
P(A[B, C) < P(A[

B, C) (16.43)
In other words, the possession of trait B seems to have a positive predictive power for A by itself,
but when in addition trait C is held constant, the relation between B and A turns negative.
An example is given by Fabris and Freitas,
3
concerning a classic study of tuberculosis mortality in
1910. Here the attribute A is mortality, B is city (Richmond, with

B being New York), and C is
race (African-American, with

C being Caucasian). In probability terms, the data show that (these
of course are sample estimates)
3
C.C. Fabris and A.A. Freitas. Discovering Surprising Patterns by Detecting Occurrences of Simpsons Paradox. In
Research and Development in Intelligent Systems XVI (Proc. ES99, The 19th SGES Int. Conf. on Knowledge-Based
Systems and Applied Articial Intelligence), 148-160. Springer-Verlag, 1999
376 CHAPTER 16. RELATIONS AMONG VARIABLES: ADVANCED
P(mortality [ Richmond) = 0.0022
P(mortality [ New York) = 0.0019
P(mortality [ Richmond, black) = 0.0033
P(mortality [ New York, black) = 0.0056
P(mortality [ Richmond, white) = 0.0016
P(mortality [ New York, white) = 0.0018
The data also show that
P(black [ Richmond) = 0.37
P(black [ New York) = 0.002
a point which will become relevant below.
At rst, New York looks like it did a better job than Richmond. However, once one accounts for
race, we nd that New York is actually worse than Richmond. Why the reversal? The answer
stems from the fact that racial inequities being what they were at the time, blacks with the disease
fared much worse than whites. Richmonds population was 37% black, proportionally far more
than New Yorks 0.2%. So, Richmonds heavy concentration of blacks made its overall mortality
rate look worse than New Yorks, even though things were actually much worse in New York.
But is this really a paradox? Closer consideration of this example reveals that the only reason
this example (and others like it) is surprising is that the predictors were used in the wrong order.
One normally looks for predictors one at a time, rst nding the best single predictor, then the best
pair of predictors, and so on. If this were done on the above data set, the rst predictor variable
chosen would be race, not city. In other words, the sequence of analysis would look something like
this:
P(mortality [ Richmond) = 0.0022
P(mortality [ New York) = 0.0019
P(mortality [ black) = 0.0048
P(mortality [ white) = 0.0018
P(mortality [ black, Richmond) = 0.0033
16.5. SIMPSONS (NON-)PARADOX 377
P(mortality [ black, New York) = 0.0056
P(mortality [ white, Richmond) = 0.0016
P(mortality [ white, New York) = 0.0018
The analyst would have seen that race is a better predictor than city, and thus would have chosen
race as the best single predictor. The analyst would then investigate the race/city predictor pair,
and would never reach a point in which city alone were in the selected predictor set. Thus no
anomalies would arise.
Exercises
Note to instructor: See the Preface for a list of sources of real data on which exercises can be
assigned to complement the theoretical exercises below.
1. Suppose we are interested in documents of a certain type, which well call Type 1. Everything
that is not Type 1 well call Type 2, with a proportion q of all documents being Type 1. Our goal
will be to try to guess document type by the presence of absence of a certain word; we will guess
Type 1 if the word is present, and otherwise will guess Type 2.
Let T denote document type, and let W denote the event that the word is in the document. Also,
let p
i
be the proportion of documents that contain the word, among all documents of Type i, i =
1,2. The event C will denote our guessing correctly.
Find the overall probability of correct classication, P(C), and also P(C[W).
Hint: Be careful of your conditional and unconditional probabilities here.
2. In the quartic model in ALOHA simulation example, nd an approximate 95% condence
interval for the true population mean wait if our backo parameter b is set to 0.6.
Hint: You will need to use the fact that a linear combination of the components of a multivariate
normal random vector has a univariate normal distributions as discussed in Section 8.5.2.1.
3. Consider the linear regression model with one predictor, i.e. r = 1. Let Y
i
and X
i
represent the
values of the response and predictor variables for the i
th
observation in our sample.
(a) Assume as in Section 15.10.4 that V ar(Y [X = t) is a constant in t,
2
. Find the exact value
of Cov(

0
,

1
), as a function of the X
i
and
2
. Your nal answer should be in scalar, i.e.
non-matrix form.
(b) Suppose we wish to t the model m
Y ;X
(t) =
1
t, i.e. the usual linear model but without the
constant term,
0
. Derive a formula for the least-squares estimate of
1
.
378 CHAPTER 16. RELATIONS AMONG VARIABLES: ADVANCED
4. Suppose the random pair (X, Y ) has density 8st on 0 < t < s < 1. Find m
Y ;X
(s) and
V ar(Y [X = t), 0 < s < 1.
5. We showed that (16.11) reduces to the logistic model in the case in which the distribution of X
given Y is normal. Show that this is also true in the case in which that distribution is exponential,
i.e.
f
X|Y
(t, i) =
i
e

i
t
, t > 0 (16.44)
6. The code below reads in a le, data.txt, with the header record
"age", "weight", "systolic blood pressure", "height"
and then does the regression analysis.
Suppose we wish to estimate in the model
mean weight =
0
+
1
height +
2
age
Fill in the blanks in the code:
dt <- ____________(____________________________________)
regr <- lm(___________________________________________________)
cvmat <- _______________(regr)
print("the estimated value of beta2-beta0 is",
____________________________________________________)
print("the estimated variance of beta2 - beta0 is",
_______________________ %*% cvmat %*% _________________________)
# calculate the matrix Q
q <- cbind(______________________________________________)
7. In this problem, you will conduct an R simulation experiment similar to that of Foster and Stine
on overtting, discussed in Section 15.11.3.
Generate data X
(j)
i
, i = 1, ..., n, j = 1, ..., r from a N(0,1) distribution, and
i
, i = 1, ..., n from
N(0,4). Set Y
i
= X
(1)
i
+
i
, i = 1, ..., n. This simulates drawing a random sample of n observations
from an (r+1)-variate population.
Now suppose the analyst, unaware that Y is related to only X
(1)
, ts the model
m
Y ;X
(1)
,...,X
(r) (t
1
, ..., t
r
) =
0
+
1
t
(1)
+... +
r
t
(r)
(16.45)
16.5. SIMPSONS (NON-)PARADOX 379
In actuality,
j
= 0 for j > 1 (and for i = 0). But the analyst wouldnt know this. Suppose
the analyst selects predictors by testing the hypotheses H
0
:
i
= 0, as in Section 15.11.3, with
= 0.05.
Do this for various values of r and n. You should nd that, for xed n and increasing r. You begin
to nd that some of the predictors are declared to be signicantly related to Y (complete with
asterisks) when in fact they are not (while X
(1)
, which really is related to Y , may be declared NOT
signicant. This illustrates the folly of using hypothesis testing to do variable selection.
8. Suppose given X = t, the distribution of Y has mean t and variance
2
, for all t in (0,1). This
is a xed-X regression setting, i.e. X is nonrandom: For each i = 1,...,n we observe Yi drawn at
random from the distribution of Y given X = i/n. The quantities and
2
are unknown.
Our goal is to estimate m
Y ;X
(0.75). We have two choices for our estimator:
We can estimate in the usual least-squares manner, denoting our estimate by G, and then
use as our estimator T
1
= 0.75G.
We can take our estimator T
2
to be (Y
1
+... +Y
n
)/n,
Perform a tradeo analysis similar to that of Section 12.2, determining under what conditions T
1
is superior to T
2
and vice versa. Our criterion is mean squared error (MSE), E[(T
i
m
Y ;X
(0.75)
]
.
Make your expressions as closed-form as possible.
Advice: This is a linear model, albeit one without an intercept term. The quantity G here is simply
. G will turn out to be a linear combination of the Xs (which are constants), so its variance is
easy to nd.
9. Suppose X has an N(,
2
) distribution, i.e. with the standard deviation equal to the mean. (A
common assumption in regression contexts.) Show that h(X) = ln(X) will be a variance-stabilizing
transformation, a concept discussed in Section 13.2.2.
10. Consider a random pair (X, Y ) for which the linear model E(Y [X) =
0
+
1
X holds, and
think about predicting Y , rst without X and then with X, minimizing mean squared prediction
error (MSPE) in each case. From Section 16.7.1, we know that without X, the best predictor is
EY , while with X it is E(Y [X), which under our assumption here is
0
+
1
X. Show that the
reduction in MSPE accrued by using X, i.e.
E
_
(Y EY )
2

E
_
Y E(Y [X)
2

E [(Y EY )
2
]
(16.46)
is equal to
2
(X, Y ).
380 CHAPTER 16. RELATIONS AMONG VARIABLES: ADVANCED
11. In an analysis published on the Web (Sparks et al, Disease Progress over Time, The Plant
Health Instructor, 2008, the following R output is presented:
> severity.lm <- lm(diseasesev~temperature,data=severity)
> summary(severity.lm)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.66233 1.10082 2.418 0.04195 *
temperature 0.24168 0.06346 3.808 0.00518 **
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Fill in the blanks:
(a) The model here is
mean =
0
+
1
(b) The two null hypotheses being tested here are H
0
: and H
0
: .
12. In the notation of this chapter, give matrix and/or vector expressions for each of the following
in the linear regression model:
(a) s
2
, our estimator of
2
(b) the standard error of the estimated value of the regression function m
Y ;X
(t) at t = c, where
c = (c
0
, c
1
, ..., c
r
)
16.6 Linear Regression with All Predictors Being Nominal Vari-
ables: Analysis of Variance
(Note to readers: The material in this section is arguably of lesser value to computer science.
As such, it can easily be skipped. However, it does provide motivation for our treatment of the
log-linear model in Section 16.4.4.)
Continuing the ideas in Section 15.12, suppose in the software engineering study they had kept
the project size constant, and instead of X
(1)
being project size, this variable recorded whether
the programmer uses an integrated development environment (IDE). Say X
(1)
is 1 or 0, depending
on whether the programmer uses the Eclipse IDE or no IDE, respectively. Continue to assume
the study included the nominal Language variable, i.e. assume the study included the indicator
variables X
(2)
(C++) and X
(3)
(Java). Now all of our predictors would be nominal/indicator
variables. Regression analysis in such settings is called analysis of variance (ANOVA).
16.6. LINEAR REGRESSION WITH ALL PREDICTORS BEING NOMINAL VARIABLES: ANALYSIS OF VARIANCE381
Each nominal variable is called a factor. So, in our software engineering example, the factors
are IDE and Language. Note again that in terms of the actual predictor variables, each factor is
represented by one or more indicator variables; here IDE has one indicator variables and Language
has two.
Analysis of variance is a classic statistical procedure, used heavily in agriculture, for example. We
will not go into details here, but mention it briey both for the sake of completeness and for its
relevance to Sections 15.7 and 16.4.4. (The reader is strongly advised to review Sections 15.7 before
continuing.)
16.6.1 Its a Regression!
The term analyisis of variance is a misnomer. A more appropriate name would be analysis of
means, as it is in fact a regression analysis, as follows.
First, note in our software engineering example we basically are talking about six groups, because
there are six dierent combinations of values for the triple (X
(1)
, X
(2)
, X
(3)
). For instance, the
triple (1,0,1) means that the programmer is using an IDE and programming in Java. Note that
triples of the form (w,1,1) are impossible.
So, all that is happening here is that we have six groups with six means. But that is a regression!
Remember, for variables U and V, m
V ;U
(t) is the mean of all values of V in the subpopulation
group of people (or cars or whatever) dened by U = s. If U is a continuous variable, then we have
innitely many such groups, thus innitely many means. In our software engineering example, we
only have six groups, but the principle is the same. We can thus cast the problem in regression
terms:
m
Y ;X
(i, j, k) = E(Y [X
(1)
= i, X
(2)
= j, X
(3)
= k), i, j, k = 0, 1, j +k 1 (16.47)
Note the restriction j +k 1, which reects the fact that j and k cant both be 1.
Again, keep in mind that we are working with means. For instance, m
Y ;X
(0, 1, 0) is the population
mean project completion time for the programmers who do not use Eclipse and who program in
C++.
Since the triple (i,j,k) can take on only six values, m can be modeled fully generally in the following
six-parameter linear form:
m
Y ;X
(i, j, k) =
0
+
1
i +
2
j +
3
k +
4
ij +
5
ik (16.48)
where
4
and
5
are the coecients of two interaction terms, as in Section 15.7.
382 CHAPTER 16. RELATIONS AMONG VARIABLES: ADVANCED
16.6.2 Interaction Terms
It is crucial to understand the interaction terms. Without the ij and ik terms, for instance, our
model would be
m
Y ;X
(i, j, k) =
0
+
1
i +
2
j +
3
k (16.49)
which would mean (as in Section 15.7) that the dierence between using Eclipse and no IDE is the
same for all three programming languages, C++, Java and C. That common dierence would be

1
. If this conditionthe impact of using an IDE is the same across languagesdoesnt hold, at
least approximately, then we would use the full model, (16.48). More on this below.
Note carefully that there is no interaction term corresponding to jk, since that quantity is 0, and
thus there is no three-way interaction term corresponding to ijk either.
But suppose we add a third factor, Education, represented by the indicator X
(4)
, having the value
1 if the programmer has a least a Masters degree, 0 otherwise. Then m would take on 12 values,
and the full model would have 12 parameters:
m
Y ;X
(i, j, k, l) =
0
+
1
i+
2
j +
3
k+
4
l +
5
ij +
6
ik+
7
il +
8
jl +
9
kl +
10
ijl +
11
ikl (16.50)
Again, there would be no ijkl term, as jk = 0.
Here
1
,
2
,
3
and
4
are called the main eects, as opposed to the coecients of the interaction
terms, called of course the interaction eects.
The no-interaction version would be
m
Y ;X
(i, j, k, l) =
0
+
1
i +
2
j +
3
k +
4
l (16.51)
16.6.3 Now Consider Parsimony
In the three-factor example above, we have 12 groups and 12 means. Why not just treat it that
way, instead of applying the powerful tool of regression analysis? The answer lies in our desire for
parsimony, as noted in Section 15.11.1.
If for example (16.51) were to hold, at least approximately, we would have a far more satisfying
model. We could for instance then talk of the eect of using an IDE, rather than qualifying such
a statement by stating what the eect would be for each dierent language and education level.
16.6. LINEAR REGRESSION WITH ALL PREDICTORS BEING NOMINAL VARIABLES: ANALYSIS OF VARIANCE383
Moreover, if our sample size is not very large, we would get more accurate estimates of the various
subpopulation means, once again due to bias/variance tradeo.
Or it could be that, while (16.51) doesnt hold, a model with only two-way interactions,
m
Y ;X
(i, j, k, l) =
0
+
1
i +
2
j +
3
k +
4
l +
5
ij +
6
ik +
7
il +
8
jl +
9
kl (16.52)
does work well. This would not be as nice as (16.51), but it still would be more parsimonious than
(16.50).
Accordingly, the major thrust of ANOVA is to decide how rich a model is needed to do a good job
of describing the situation under study. There is an implied hierarchy of models of interest here:
the full model, including two- and three-way interactions, (16.50)
the model with two-factor interactions only, (16.52)
the no-interaction model, (16.51)
Traditionally these are determined via hypothesis testing, which involves certain partitionings of
sums of squares similar to (15.20). (This is where the name analysis of variance stems from.) The
null distribution of the test statistic often turns out to be an F-distribution. Of course, in this
book, we consider hypothesis testing inappropriate, preferring to give some careful thought to the
estimated parameters, but it is standard. Further testing can be done on individual
1
and so on.
Often people use simultaneous inference procedures, discussed briey in Section 13.3 of our chapter
on estimation and testing, since many tests are performed.
16.6.4 Reparameterization
Classical ANOVA uses a somewhat dierent parameterization than that weve considered here.
For instance, consider a single-factor setting (called one-way ANOVA) with three levels. Our
predictors are then X
(1)
and X
(2)
. Taking our approach here, we would write
m
Y ;X
(i, j) =
0
+
1
i +
2
j (16.53)
The traditional formulation would be

i
= +
i
, i = 1, 2, 3 (16.54)
384 CHAPTER 16. RELATIONS AMONG VARIABLES: ADVANCED
where
=

1
+
2
+
3
3
(16.55)
and

i
=
i
(16.56)
Of course, the two formulations are equivalent. It is left to the reader to check that, for instance,
=
0
+

1
+
2
2
(16.57)
There are similar formulations for ANOVA designs with more than one factor.
Note that the classical formulation overparameterizes the problem. In the one-way example above,
for instance, there are four parameters (,
1
,
2
,
3
) but only three groups. This would make the
system indeterminate, but we add the constraint
3

i=1

i
= 0 (16.58)
Equation (15.28) then must make use of generalized matrix inverses.
16.7 Optimality Issues
Being optimal is highly dependent on models being correct and appropriate, but optimality does
give us further condence in a model. In this section, we present two optimality results.
16.7.1 Optimality of the Regression Function for General Y
In predicting Y from X (with X random), we might assess our predictive ability by the mean
squared prediction error (MSPE):
MSPE = E
_
(Y w(X))
2

(16.59)
16.7. OPTIMALITY ISSUES 385
where w is some function we will use to form our prediction for Y based on X. What w is best, i.e.
which w minimizes MSPE?
To answer this question, condition on X in (16.59):
MSPE = E
_
E(Y w(X))
2
[X

(16.60)
Theorem 35 The best w is m, i.e. the best way to predict Y from X is to plug in X in the
regression function.
Recall from Section 14.1.1:
Lemma 36 For any random variable Z, the constant c which minimizes
E[(Z c)
2
] (16.61)
is
c = EZ (16.62)
Apply the lemma to the inner expectation in (16.60), with Z being Y and c being some function of
X. The minimizing value is EZ, i.e. E(Y [X) since our expectation here is conditional on X.
All of this tells us that the best function w in (16.59) is m
Y ;X
. This proves the theorem.
Note carefully that all of this was predicated on the use of a quadratic loss function, i.e. on
minimizing mean squared error. If instead we wished to minimize mean absolute error, the solution
would turn out to be to use the conditional median of Y given X, not the mean.
16.7.2 Optimality of the Regression Function for 0-1-Valued Y
Again, our context is that we want to guess Y, knowing X. Since Y is 0-1 valued, our guess for Y
based on X, g(X), should be 0-1 valued too. What is the best g?
Again, since Y and g are 0-1 valued, our criterion should be what will I call Probability of Correct
Classication (PCC):
4
PCC = P[Y = g(X)] (16.63)
4
This assumes equal costs for the two kinds of classication errors, i.e. that guessing Y = 1 when Y = 0 is no
more or no less serious than the opposite error.
386 CHAPTER 16. RELATIONS AMONG VARIABLES: ADVANCED
Now proceed as in (16.60):
PCC = E [PY = g(X)[X] (16.64)
The analog of Lemma 36 is
Lemma 37 Suppose W takes on values in the set A = 0,1, and consider the problem of maxi-
mizing
P(W = c), cA (16.65)
The solution is
_
1, if P(W = 1) > 0.5
0, otherwise
(16.66)
Proof
Again recalling that c is either 1 or 0, we have
P(W = c) = P(W = 1)c + [1 P(W = 1)] (1 c) (16.67)
= [2P(W = 1) 1] c + 1 P(W = 1) (16.68)
The result follows.
Applying this to (16.64), we see that the best g is given by
g(t) =
_
1, if m
Y ;X
(t) > 0.5
0, otherwise
(16.69)
So we nd that the regression function is again optimal, in this new context.
Chapter 17
Markov Chains
One of the most famous stochastic models is that of a Markov chain. This type of model is widely
used in computer science, biology, physics, business and so on.
17.1 Discrete-Time Markov Chains
17.1.1 Example: Finite Random Walk
To motivate this discussion, let us start with a simple example: Consider a random walk on
the set of integers between 1 and 5, moving randomly through that set, say one move per second,
according to the following scheme. If we are currently at position i, then one time period later we
will be at either i-1, i or i+1, according to the outcome of rolling a fair diewe move to i-1 if the
die comes up 1 or 2, stay at i if the die comes up 3 or 4, and move to i+1 in the case of a 5 or 6.
For the special cases i = 1 and i = 5, we simply move back to 2 or 4, respectively. (In random walk
terminology, these are called reecting barriers.)
The integers 1 through 5 form the state space for this process; if we are currently at 4, for instance,
we say we are in state 4. Let X
t
represent the position of the particle at time t, t = 0, 1,2,. . . .
The random walk is a Markov process. The process is memoryless, meaning that we can
forget the past; given the present and the past, the future depends only on the present:
P(X
t+1
= s
t+1
[X
t
= s
t
, X
t1
= s
t1
, . . . , X
0
= s
0
) = P(X
t+1
= s
t+1
[X
t
= s
t
) (17.1)
The term Markov process is the general one. If the state space is discrete, i.e. nite or countably
innite, then we usually use the more specialized term, Markov chain.
387
388 CHAPTER 17. MARKOV CHAINS
Although this equation has a very complex look, it has a very simple meaning: The distribution of
our next position, given our current position and all our past positions, is dependent only on the
current position. In other words, the system is memoryless, somewhat in analogy to the properties
of the exponential distribution discussed in Section 5.1. (In fact exponential distributions will play
a key role when we get to continuous-time Markov chains in Section 17.4. It is clear that the
random walk process above does have this memoryless property; for instance, if we are now at
position 4, the probability that our next state will be 3 is 1/3no matter where we were in the
past.
Continuing this example, let p
ij
denote the probability of going from position i to position j in one
step. For example, p
21
= p
23
=
1
3
while p
24
= 0 (we can reach position 4 from position 2 in two
steps, but not in one step). The numbers p
ij
are called the one-step transition probabilities of
the process. Denote by P the matrix whose entries are the p
ij
:
_
_
_
_
_
_
0 1 0 0 0
1
3
1
3
1
3
0 0
0
1
3
1
3
1
3
0
0 0
1
3
1
3
1
3
0 0 0 1 0
_
_
_
_
_
_
(17.2)
By the way, it turns out that the matrix P
k
gives the k-step transition probabilities. In other
words, the element (i,j) of this matrix gives the probability of going from i to j in k steps.
17.1.2 Long-Run Distribution
In typical applications we are interested in the long-run distribution of the process, for example
the long-run proportion of the time that we are at position 4. For each state i, dene

i
= lim
t
N
it
t
(17.3)
where N
it
is the number of visits the process makes to state i among times 1, 2,..., t. In most
practical cases, this proportion will exist and be independent of our initial position X
0
. The
i
are
called the steady-state probabilities, or the stationary distribution of the Markov chain.
Intuitively, the existence of
i
implies that as t approaches innity, the system approaches steady-
state, in the sense that
lim
t
P(X
t
= i) =
i
(17.4)
17.1. DISCRETE-TIME MARKOV CHAINS 389
Actually, the limit (17.4) may not exist in some cases. Well return to that point later, but for
typical cases it does exist, and we will usually assume this.
17.1.2.1 Derivation of the Balance Equations
Equation (17.4) suggests a way to calculate the values
i
, as follows.
First note that
P(X
t+1
= i) =

k
P(X
t
= k and X
t+1
= i) =

k
P(X
t
= k)P(X
t+1
= i[X
t
= k) =

k
P(X
t
= k)p
ki
(17.5)
where the sum goes over all states k. For example, in our random walk example above, we would
have
P(X
t+1
= 3) =
5

k=1
P(X
t
= k and X
t+1
= 3) =
5

k=1
P(X
t
= k)P(X
t+1
= 3[X
t
= k) =
5

k=1
P(X
t
= k)p
k3
(17.6)
Then as t in Equation (17.5), intuitively we would have

i
=

k
p
ki
(17.7)
Remember, here we know the p
ki
and want to nd the
i
. Solving these balance equations
equations (one for each i), gives us the
i
.
For the random walk problem above, for instance, the solution is = (
1
11
,
3
11
,
3
11
,
3
11
,
1
11
). Thus in
the long run we will spend 1/11 of our time at position 1, 3/11 of our time at position 2, and so
on.
17.1.2.2 Solving the Balance Equations
A matrix formulation is also useful. Letting denote the row vector of the elements
i
, i.e.
= (
1
,
2
, ...), these equations (one for each i) then have the matrix form
= P (17.8)
390 CHAPTER 17. MARKOV CHAINS
or
(I P

= 0 (17.9)
where as usual

denotes matrix transpose.
Note that there is also the constraint

i
= 1 (17.10)
One of the equations in the system is redundant. We thus eliminate one of them, say by removing
the last row of I-P in (17.9). This can be used to calculate the
i
.
To reect (17.10), which in matrix form is
1

= 1 (17.11)
where 1
n
is a column vector of all 1s, n is the number of states, and we replace the removed row
in I-P by a row of all 1s, and in the right-hand side of (17.9) we replace the last 0 by a 1. We can
then solve the system.
All this can be done with Rs solve() function:
1 findpi1 <- function(p) {
2 n <- nrow(p)
3 imp <- diag(n) - t(p) # I-P
4 imp[n,] <- rep(1,n) # form row of 1s
5 rhs <- c(rep(0,n-1),1) # form right-hand-side vector
6 pivec <- solve(imp,rhs) # solve for pi
7 return(pivec)
8 }
Or one can note from (17.8) that is a left eigenvector of P with eigenvalue 1, so one can use
Rs eigen() function. It can be proven that if P is irreducible and aperiodic (dened later in this
chapter), every eigenvalue other than 1 is smaller than 1 (so we can speak of the eigenvalue 1), and
the eigenvector corresponding to 1 has all components real.
Since is a left eigenvector, the argument in the call must be P rather than P. In addition, since
an eigenvector is only unique up to scalar multiplication, we must deal with the fact that the return
value of eigen() may have negative components, and will likely not satisfy (17.10). Here is the
code:
17.1. DISCRETE-TIME MARKOV CHAINS 391
1 findpi2 <- function(p) {
2 n <- nrow(p)
3 # find first eigenvector of P
4 pivec <- eigen(t(p))$vectors[,1]
5 # guaranteed to be real, but could be negative
6 if (pivec[1] < 0) pivec <- -pivec
7 # normalize
8 pivec <- pivec / sum(pivec)
9 return(pivec)
10 }
But Equation (17.9) may not be easy to solve. For instance, if the state space is innite, then this
matrix equation represents innitely many scalar equations. In such cases, you may need to try to
nd some clever trick which will allow you to solve the system, or in many cases a clever trick to
analyze the process in some way other than explicit solution of the system of equations.
And even for nite state spaces, the matrix may be extremely large. In some cases, you may need
to resort to numerical methods.
17.1.2.3 Periodic Chains
Note again that even if Equation (17.9) has a solution, this does not imply that (17.4) holds. For
instance, suppose we alter the random walk example above so that
p
i,i1
= p
i,i+1
=
1
2
(17.12)
for i = 2, 3, 4, with transitions out of states 1 and 5 remaining as before. In this case, the solution
to Equation (17.9) is (
1
8
,
1
4
,
1
4
,
1
4
,
1
8
). This solution is still valid, in the sense that Equation (17.3)
will hold. For example, we will spend 1/4 of our time at Position 4 in the long run. But the limit
of P(X
i
= 4) will not be 1/4, and in fact the limit will not even exist. If say X
0
is even, then X
i
can be even only for even values of i. We say that this Markov chain is periodic with period 2,
meaning that returns to a given state can only occur after amounts of time which are multiples of
2.
17.1.2.4 The Meaning of the Term Stationary Distribution
Though we have informally dened the term stationary distribution in terms of long-run proportions,
the technical denition is this:
Denition 38 Consider a Markov chain. Suppose we have a vector of nonnegative numbers that
sum to 1. Let X
0
have the distribution . If that results in X
1
having that distribution too (and
thus also all X
n
), we say that is the stationary distribution of this Markov chain.
392 CHAPTER 17. MARKOV CHAINS
Note that this denition stems from (17.5).
For instance, in our (rst) random walk example above, this would mean that if we have X
0
distributed on the integers 1 through 5 with probabilities (
1
11
,
3
11
,
3
11
,
3
11
,
1
11
), then for example
P(X
1
= 1) =
1
11
, P(X
1
= 4) =
3
11
etc. This is indeed the case, as you can verify using (17.5) with
t = 0.
In our notebook view, here is what we would do. Imagine that we generate a random integer
between 1 and 5 according to the probabilities (
1
11
,
3
11
,
3
11
,
3
11
,
1
11
),
1
and set X
0
to that number. We
would then generate another random number, by rolling an ordinary die, and going left, right or
staying put, with probability 1/3 each. We would then write down X
1
and X
2
on the rst line of
our notebook. We would then do this experiment again, recording the results on the second line,
then again and again. In the long run, 3/11 of the lines would have, for instance, X
0
= 4, and 3/11
of the lines would have X
1
= 4. In other words, X
1
would have the same distribution as X
0
.
17.1.3 Example: Stuck-At 0 Fault
17.1.3.1 Description
In the above example, the labels for the states consisted of single integers i. In some other examples,
convenient labels may be r-tuples, for example 2-tuples (i,j).
Consider a serial communication line. Let B
1
, B
2
, B
3
, ... denote the sequence of bits transmitted on
this line. It is reasonable to assume the B
i
to be independent, and that P(B
i
= 0) and P(B
i
= 1)
are both equal to 0.5.
Suppose that the receiver will eventually fail, with the type of failure being stuck at 0, meaning
that after failure it will report all future received bits to be 0, regardless of their true value. Once
failed, the receiver stays failed, and should be replaced. Eventually the new receiver will also fail,
and we will replace it; we continue this process indenitely.
Let denote the probability that the receiver fails on any given bit, with independence between
bits in terms of receiver failure. Then the lifetime of the receiver, that is, the time to failure, is
geometrically distributed with success probability i.e. the probability of failing on receipt of
the i-th bit after the receiver is installed is (1 )
i1
for i = l,2,3,...
However, the problem is that we will not know whether a receiver has failed (unless we test it once
in a while, which we are not including in this example). If the receiver reports a long string of 0s,
we should suspect that the receiver has failed, but of course we cannot be sure that it has; it is still
possible that the message being transmitted just happened to contain a long string of 0s.
Suppose we adopt the policy that, if we receive k consecutive 0s, we will replace the receiver with a
1
Say by rolling an 11-sided die.
17.1. DISCRETE-TIME MARKOV CHAINS 393
new unit. Here k is a design parameter; what value should we choose for it? If we use a very small
value, then we will incur great expense, due to the fact that we will be replacing receiver units at an
unnecessarily high rate. On the other hand, if we make k too large, then we will often wait too long
to replace the receiver, and the resulting error rate in received bits will be sizable. Resolution of
this tradeo between expense and accuracy depends on the relative importance of the two. (There
are also other possibilities, involving the addition of redundant bits for error detection, such as
parity bits. For simplicity, we will not consider such renements here. However, the analysis of
more complex systems would be similar to the one below.)
17.1.3.2 Initial Analysis
A natural state space in this example would be
(i, j) : i = 0, 1, ..., k 1; j = 0, 1; i +j ,= 0 (17.13)
where i represents the number of consecutive 0s that we have received so far, and j represents the
state of the receiver (0 for failed, 1 for nonfailed). Note that when we are in a state of the form
(k-1,j), if we receive a 0 on the next bit (whether it is a true 0 or the receiver has failed), our new
state will be (0,1), as we will install a new receiver. Note too that there is no state (0,0), since if
the receiver is down it must have received at least one bit.
The calculation of the transition matrix P is straightforward, though it requires careful thought.
For example, suppose the current state is (2,1), and that we are investigating the expense and bit
accuracy corresponding to a policy having k = 5. What can happen upon receipt of the next bit?
The next bit will have a true value of either 0 or 1, with probability 0.5 each. The receiver will
change from working to failed status with probability . Thus our next state could be:
(3,1), if a 0 arrives, and the receiver does not fail;
(0,1), if a 1 arrives, and the receiver does not fail; or
(3,0), if the receiver fails
The probabilities of these three transitions out of state (2,1) are:
p
(2,1),(3,1)
= 0.5(1 ) (17.14)
p
(2,1),(0,1)
= 0.5(1 ) (17.15)
p
(2,1),(3,0)
= (17.16)
394 CHAPTER 17. MARKOV CHAINS
Other entries of the matrix P can be computed similarly. Note by the way that from state (4,1)
we will go to (0,1), no matter what happens.
Formally specifying the matrix P using the 2-tuple notation as above would be very cumbersome.
In this case, it would be much easier to map to a one-dimensional labeling. For example, if k =
5, the nine states (1,0),...,(4,0),(0,1),(1,1),...,(4,1) could be renamed states 1,2,...,9. Then we could
form P under this labeling, and the transition probabilities above would appear as
p
78
= 0.5(1 ) (17.17)
p
75
= 0.5(1 ) (17.18)
p
73
= (17.19)
17.1.3.3 Going Beyond Finding
Finding the
i
should be just the rst step. We then want to use them to calculate various quantities
of interest.
2
For instance, in this example, it would also be useful to nd the error rate , and the
mean time (i.e., the mean number of bit receptions) between receiver replacements, . We can nd
both and in terms of the
i
, in the following manner.
The quantity is the proportion of the time during which the true value of the received bit is 1 but
the receiver is down, which is 0.5 times the proportion of the time spent in states of the form (i,0):
= 0.5(
1
+
2
+
3
+
4
) (17.20)
This should be clear intuitively, but it would also be instructive to present a more formal derivation
of the same thing. Let E
n
be the event that the n-th bit is received in error, with D
n
denoting the
event that the receiver is down. Then
= lim
n
P(E
n
) (17.21)
= lim
n
P(B
n
= 1 and D
n
) (17.22)
= lim
n
P(B
n
= 1)P(D
n
) (17.23)
= 0.5(
1
+
2
+
3
+
4
) (17.24)
Here we used the fact that B
n
and the receiver state are independent.
2
Note that unlike a classroom setting, where those quantities would be listed for the students to calculate, in
research we must decide on our own which quantities are of interest.
17.1. DISCRETE-TIME MARKOV CHAINS 395
Note that with the interpretation of as the stationary distribution of the process, in Equations
(17.21) above, we do not even need to take limits.
Equations (17.21) follow a pattern well use repeatedly in this chapter. In subsequent
examples we will not show the steps with the limits, but the limits are indeed there.
Make sure to mentally go through these steps yourself.
3
Now to get in terms of the
i
note that since is the long-run average number of bits between
receiver replacements, it is then the reciprocal of , the long-run fraction of bits that result in
replacements. For example, say we replace the receiver on average every 20 bits. Over a period
of 1000 bits, then (speaking on an intuitive level) that would mean about 50 replacements. Thus
approximately 0.05 (50 out of 1000) of all bits results in replacements.
=
1

(17.25)
Again suppose k = 5. A replacement will occur only from states of the form (4,j), and even then
only under the condition that the next reported bit is a 0. In other words, there are three possible
ways in which replacement can occur:
(a) We are in state (4,0). Here, since the receiver has failed, the next reported bit will denitely
be a 0, regardless of that bits true value. We will then have a total of k = 5 consecutive
received 0s, and therefore will replace the receiver.
(b) We are in the state (4,1), and the next bit to arrive is a true 0. It then will be reported as a
0, our fth consecutive 0, and we will replace the receiver, as in (a).
(c) We are in the state (4,1), and the next bit to arrive is a true 1, but the receiver fails at that
time, resulting in the reported value being a 0. Again we have ve consecutive reported 0s,
so we replace the receiver.
Therefore,
=
4
+
9
(0.5 + 0.5) (17.26)
Again, make sure you work through the full version of (17.26), using the pattern in (17.21).
3
The other way to work this out rigorously is to assume that X0 has the distribution , as in Section 17.1.2.4.
Then no limits are needed in (17.21. But this may be more dicult to understand.
396 CHAPTER 17. MARKOV CHAINS
Thus
=
1

=
1

4
+ 0.5
9
(1 +)
(17.27)
This kind of analysis could be used as the core of a cost-benet tradeo investigation to determine
a good value of k. (Note that the
i
are functions of k, and that the above equations for the case
k = 5 must be modied for other values of k.)
17.1.4 Example: Shared-Memory Multiprocessor
(Adapted from Probabiility and Statistics, with Reliability, Queuing and Computer Science Appli-
catiions, by K.S. Trivedi, Prentice-Hall, 1982 and 2002, but similar to many models in the research
literature.)
17.1.4.1 The Model
Consider a shared-memory multiprocessor system with m memory modules and m CPUs. The
address space is partitioned into m chunks, based on either the most-signicant or least-signicant
log
2
m bits in the address.
4
The CPUs will need to access the memory modules in some random way, depending on the programs
they are running. To make this idea concrete, consider the Intel assembly language instruction
add %eax, (%ebx)
which adds the contents of the EAX register to the word in memory pointed to by the EBX register.
Execution of that instruction will (absent cache and other similar eects, as we will assume here
and below) involve two accesses to memoryone to fetch the old value of the word pointed to by
EBX, and another to store the new value. Moreover, the instruction itself must be fetched from
memory. So, altogether the processing of this instruction involves three memory accesses.
Since dierent programs are made up of dierent instructions, use dierent register values and
so on, the sequence of addresses in memory that are generated by CPUs are modeled as random
variables. In our model here, the CPUs are assumed to act independently of each other, and
successive requests from a given CPU are independent of each other too. A CPU will choose the i
th
module with probability q
i
. A memory request takes one unit of time to process, though the wait
may be longer due to queuing. In this very simplistic model, as soon as a CPUs memory request
4
You may recognize this as high-order and low-order interleaving, respectively.
17.1. DISCRETE-TIME MARKOV CHAINS 397
is fullled, it generates another one. On the other hand, while a CPU has one memory request
pending, it does not generate another.
Lets assume a crossbar interconnect, which means there are m
2
separate paths from CPUs to
memory modules, so that if the m CPUs have memory requests to m dierent memory modules,
then all the requests can be fullled simultaneously. Also, assume as an approximation that we can
ignore communication delays.
How good are these assumptions? One weakness, for instance, is that many instructions, for
example, do not use memory at all, except for the instruction fetch, and as mentioned, even the
latter may be suppressed due to cache eects.
Another example of potential problems with the assumptions involves the fact that many programs
will have code like
for (i = 0; i < 10000; i++) sum += x[i];
Since the elements of the array x will be stored in consecutive addresses, successive memory requests
from the CPU while executing this code will not be independent. The assumption would be more
justied if we were including cache eects, or (noticed by Earl Barr) if we are studying a timesharing
system with a small quantum size.
Thus, many models of systems like this have been quite complex, in order to capture the eects of
various things like caching, nonindependence and so on in the model. Nevertheless, one can often
get some insight from even very simple models too. In any case, for our purposes here it is best to
stick to simple models, so as to understand more easily.
Our state will be an m-tuple (N
1
, ..., N
m
), where N
i
is the number of requests currently pending
at memory module i. Recalling our assumption that a CPU generates another memory request
immediately after the previous one is fullled, we always have that N
1
+... +N
m
= m.
It is straightforward to nd the transition probabilities p
ij
. Here are a couple of examples, with m
= 2:
p
(2,0),(1,1)
: Recall that state (2,0) means that currently there are two requests pending at
Module 1, one being served and one in the queue, and no requests at Module 2. For the
transition (2, 0) (1, 1) to occur, when the request being served at Module 1 is done, it will
make a new request, this time for Module 2. This will occur with probability q
2
. Meanwhile,
the request which had been queued at Module 1 will now start service. So, p
(2,0),(1,1)
= q
2
.
p
(1,1),(1,1)
: In state (1,1), both pending requests will nish in this cycle. To go to (1,1) again,
that would mean that the two CPUs request dierent modules from each otherCPUs 1 and
2 choose Modules 1 and 2 or 2 and 1. Each of those two possibilities has probability q
1
q
2
, so
p
(1,1),(1,1)
= 2q
1
q
2
.
398 CHAPTER 17. MARKOV CHAINS
We then solve for the , using (17.7). It turns out, for example, that

(1,1)
=
q
1
q
2
1 2q
1
q
2
(17.28)
17.1.4.2 Going Beyond Finding
Let B denote the number of memory requests completed in a given memory cycle. Then we may
be interested in E(B), the number of requests completed per unit time, i.e. per cycle. We can nd
E(B) as follows. Let S denote the current state. Then, continuing the case m = 2, we have from
the Law of Total Expectation,
5
E(B) = E[E(B[S)] (17.29)
= P(S = (2, 0))E(B[S = (2, 0)) +P(S = (1, 1))E(B[S = (1, 1)) +P(S = (0, 2))E(B[S = (0, 2)) (17.30)
=
(2,0)
E(B[S = (2, 0)) +
(1,1)
E(B[S = (1, 1)) +
(0,2)
E(B[S = (0, 2)) (17.31)
All this equation is doing is nding the overall mean of B by breaking down into the cases for the
dierent states.
Now if we are in state (2,0), only one request will be completed this cycle, and B will be 1. Thus
E(B[S = (2, 0)) = 1. Similarly, E(B[S = (1, 1)) = 2 and so on. After doing all the algebra, we nd
that
EB =
1 q
1
q
2
1 2q
1
q
2
(17.32)
The maximum value of E(B) occurs when q
1
= q
2
=
1
2
, in which case E(B)=1.5. This is a lot less
than the maximum capacity of the memory system, which is m = 2 requests per cycle.
So, we can learn a lot even from this simple model, in this case learning that there may be a
substantial underutilization of the system. This is a common theme in probabilistic modeling:
Simple models may be worthwhile in terms of insight provided, even if their numerical predictions
may not be too accurate.
5
Actually, we could take a more direct route in this case, noting that B can only take on the values 1 and 2. Then
EB = P(B = 1) +2P(B = 2) =
(2,0
+
s(0,2)
+2
(1,1)
. But the analysis below extends better to the case of general
m.
17.1. DISCRETE-TIME MARKOV CHAINS 399
17.1.5 Example: Slotted ALOHA
Recall the slotted ALOHA model from Chapter 2:
Time is divided into slots or epochs.
There are n nodes, each of which is either idle or has a single message transmission pending.
So, a node doesnt generate a new message until the old one is successfully transmitted (a
very unrealistic assumption, but were keeping things simple here).
In the middle of each time slot, each of the idle nodes generates a message with probability
q.
Just before the end of each time slot, each active node attempts to send its message with
probability p.
If more than one node attempts to send within a given time slot, there is a collision, and
each of the transmissions involved will fail.
So, we include a backo mechanism: At the middle of each time slot, each node with a
message will with probability q attempt to send the message, with the transmission time
occupying the remainder of the slot.
So, q is a design parameter, which must be chosen carefully. If q is too large, we will have too mnay
collisions, thus increasing the average time to send a message. If q is too small, a node will often
refrain from sending even if no other node is there to collide with.
Dene our state for any given time slot to be the number of nodes currently having a message to
send at the very beginning of the time slot (before new messages are generated). Then for 0 < i < n
and 0 < j < n i (there will be a few special boundary cases to consider too), we have
p
i,i1
= (1 q)
ni
. .
no new msgs
i(1 p)
i1
p
. .
one xmit
(17.33)
p
ii
= (1 q)
ni
[1 i(1 p)
i1
p]
. .
no new msgs and no succ xmits
+(n i)(1 q)
ni1
q (i + 1)(1 p)
i
p
. .
one new msg and one xmit
(17.34)
400 CHAPTER 17. MARKOV CHAINS
p
i,i+j
=
_
n i
j
_
q
j
(1 q)
nij
[1 (i +j)(1 p)
i+j1
p]
. .
j new msgs and no succ xmit
+
_
n i
j + 1
_
q
j+1
(1 q)
nij1
(i +j + 1)(1 p)
i+j
p
. .
j+1 new msgs and succ xmit
(17.35)
Note that in (17.34) and (17.35), we must take into account the fact that a node with a newly-
created messages might try to send it. In (17.35), for instance, in the rst term we have j new
messages, on top of the i we already had, so i+j messages might try to send. The probability that
there is no successful transmission is then 1 (i +j)(1 p)
i+j1
p.
The matrix P is then quite complex. We always hope to nd a closed-form solution, but that is
unlikely in this case. Solving it on a computer is easy, though, say by using the solve() function
in the R statistical language.
17.1.5.1 Going Beyond Finding
Once again various interesting quantities can be derived as functions of the , such as the system
throughput , i.e. the number of successful transmissions in the network per unit time. Heres how
to get :
First, suppose for concreteness that in steady-state the probability of there being a successful
transmission in a given slot is 20%. Then after, say, 100,000 slots, about 20,000 will have successful
transmissionsa throughput of 0.2. So, the long-run probability of successful transmission is the
same as the long-run fraction of slots in which there are successful transmissions! That in turn can
be broken down in terms of the various states:
= P(success xmit) (17.36)
=

s
P(success xmit [ in state s)P(in state s)
Now, to calculate P(success xmit [ in state s), recall that in state s we start the slot with s nonidle
nodes, but that we may acquire some new ones; each of the n-s idle nodes will create a new message,
17.1. DISCRETE-TIME MARKOV CHAINS 401
with probability q. So,
P(success xmit [ in state s) =
ns

j=0
_
n s
j
_
q
j
(1 q)
nsj
(s +j)(1 p)
s+j1
p (17.37)
Substituting into (17.36), we have
=
n

s=0
ns

j=0
_
n s
j
_
q
j
(1 q)
nsj
(s +j)(1 p)
s+j1
p
s
(17.38)
With some more subtle reasoning, one can derive the mean time a message waits before being
successfully transmitted, as follows:
Focus attention on one particular node, say Node 0. It will repeatedly cycle through idle and busy
periods, I and B. We wish to nd E(B). I has a geometric distribution with parameter q,
6
so
E(I) =
1
q
(17.39)
Then if we can nd E(I+B), we will get E(B) by subtraction.
To nd E(I+B), note that there is a one-to-one correspondence between I+B cycles and successful
transmissions; each I+B period ends with a successful transmission at Node 0. Imagine again
observing this node for, say, 100000 time slots, and say E(I+B) is 2000. That would mean wed have
about 50 cycles, thus 50 successful transmissions from this node. In other words, the throughput
would be approximately 50/100000 = 0.02 = 1/E(I+B). So, a fraction
1
E(I +B)
(17.40)
of the time slots have successful transmissions from this node.
But that quantity is the throughput for this node (number of successful transmissions per unit
time), and due to the symmetry of the system, that throughput is 1/n of the total throughput of
the n nodes in the network, which we denoted above by .
6
If a message is sent in the same slot in which it is created, we will count B as 1. If it is sent in the following slot,
B = 2, etc. B will have a modied geometric distribution starting at 0 instead of 1, but we will ignore this here for
the sake of simplicity.
402 CHAPTER 17. MARKOV CHAINS
So,
E(I +B) =
n

(17.41)
Thus from (17.39) we have
E(B) =
n


1
q
(17.42)
where of course is the function of the
i
in (17.36).
Now lets nd the proportion of attempted transmissions which are successful. This will be
E(number of successful transmissions in a slot)
E(number of attempted transmissions in a slot)
(17.43)
(To see why this is the case, again think of watching the network for 100,000 slots.) Then the
proportion of successful transmissions during that period of time is the number of successful trans-
missions divided by the number of attempted transmissions. Those two numbers are approximately
the numerator and denominator of 17.43.
Now, how do we evaluate (17.43)? Well, the numerator is easy, since it is , which we found before.
The denominator will be

s
[sp + (n s)pq] (17.44)
The factor sp+spq comes from the following reasoning. If we are in state s, the s nodes which
already have something to send will each transmit with probability p, so there will be an expected
number sp of them that try to send. Also, of the n-s which are idle at the beginning of the slot,
an expected sq of them will generate new messages, and of those sq, and estimated sqp will try to
send.
17.2 Simulation of Markov Chains
Simulation of Markov chains is identical to the patterns weve seen in earlier chapters, except for
one somewhat subtle dierence. To see this, consider the rst simulation code presented in this
book, in Section 2.12.3.
17.2. SIMULATION OF MARKOV CHAINS 403
There we were simulating X
1
and X
2
, the state of the system during the rst two time slots. A
rough outline of the code is
do nreps times
simulate X1 and X2
record X1, X1 and update counts
calculate probabilities as counts/nreps
We played the movie nreps times, calculating the behavior of X
1
and X
2
over many plays.
But suppose instead that we had been interested in nding
lim
n
X
1
+... +X
n
n
(17.45)
i.e. the long-run average number of active nodes over innitely many time slots. In that case,
we would need to play the movie only once.
Heres an example, simulating the stuck-at 0 example from Section 17.1.3:
1 # simulates the stuck-at 0 fault example, finding mean time between
2 # replacements; well keep simulating until we have nreplace replacements
3 # of the receiver, then divide that into the number of bits received, to
4 # get the mean time between replacements
5 sasim <- function(nreplace,rho,k) {
6 replace <- 0 # number of receivers replaced so far
7 up <- TRUE # receiver is up
8 nbits <- 0 # number of bits received so far
9 ncsec0 <- 0 # current number of consecutive 0s
10 while (TRUE) {
11 bit <- sample(0:1,1)
12 nbits <- nbits + 1
13 if (runif(1) < rho) {
14 up <- FALSE
15 bit <- 0
16 }
17 if (bit == 0) {
18 ncsec0 <- ncsec0 + 1
19 if (ncsec0 == k) {
20 replace <- replace + 1
21 ncsec0 <- 0
22 up <- TRUE
23 }
24 }
25 if (replace == nreplace) break
26 }
27 return(nbits/nreplace)
28 }
This follows from the fact that the limit in (17.3) occurs even in one play.
404 CHAPTER 17. MARKOV CHAINS
17.3 Hidden Markov Models
The word hidden in the term Hidden Markov Model (HMM) refers to the fact that the state of the
process is hidden, i.e. unobservable.
Actually, weve already seen an example of this, back in Section 17.1.3. There the state, actually
just part of it, was unobservable, namely the status of the receiver being up or down. But here we
are not trying to guess X
n
from Y
n
(see below), so it probably would not be considered an HMM.
Note too the connection to mixture models, Section 9.3.
An HMM consists of a Markov chain X
n
which is unobservable, together with observable values
Y
n
. The X
n
are governed by the transition probabilities p
ij
, and the Y
n
are generated from the X
n
according to
r
km
= P(Y
n
= m[X
n
= k) (17.46)
Typically the idea is to guess the X
n
from the Y
n
and our knowledge of the p
ij
and r
km
. The details
are too complex to give here, but you can at least understand that Bayes Rule comes into play.
A good example of HMMs would be in text mining applications. Here the Y
n
might be words in the
text, and X
n
would be their parts of speech (POS)nouns, verbs, adjectives and so on. Consider
the word round, for instance. Your rst thought might be that it is an adjective, but it could be a
noun (e.g. an elimination round in a tournament) or a verb (e.g. to round o a number or round
a corner). The HMM would help us to guess which, and therefore guess the true meaning of the
word.
HMMs are also used in speech process, DNA modeling and many other applications.
17.4 Continuous-Time Markov Chains
In the Markov chains we analyzed above, events occur only at integer times. However, many
Markov chain models are of the continuous-time type, in which events can occur at any times.
Here the holding time, i.e. the time the system spends in one state before changing to another
state, is a continuous random variable.
The state of a Markov chain at any time now has a continuous subscript. Instead of the chain
consisting of the random variables X
n
, n = 1, 2, 3, ... (you can also start n at 0 in the sense of
Section 17.1.2.4), it now consists of X
t
: t [0, ). The Markov property is now
P(X
t+u
= k[X
s
for all 0 s t) = P(X
t+u
= k[X
t
) for all t, u 0 (17.47)
17.4. CONTINUOUS-TIME MARKOV CHAINS 405
17.4.1 Holding-Time Distribution
In order for the Markov property to hold, the distribution of holding time at a given state needs
to be memoryless. You may recall that exponentially distributed random variables have this
property. In other words, if a random variable W has density
f(t) = e
t
(17.48)
for some then
P(W > r +s[W > r) = P(W > s) (17.49)
for all positive r and s. Actually, one can show that exponential distributions are the only contin-
uous distributions which have this property. Therefore, holding times in Markov chains must be
exponentially distributed.
It is dicult for the beginning modeler to fully appreciate the memoryless property. You are urged
to read the material on exponential distributions in Section 4.5.4.1 before continuing.
Because it is central to the Markov property, the exponential distribution is assumed for all basic
activities in Markov models. In queuing models, for instance, both the interarrival time and service
time are assumed to be exponentially distributed (though of course with dierent values of ). In
reliability modeling, the lifetime of a component is assumed to have an exponential distribution.
Such assumptions have in many cases been veried empirically. If you go to a bank, for example, and
record data on when customers arrive at the door, you will nd the exponential model to work well
(though you may have to restrict yourself to a given time of day, to account for nonrandom eects
such as heavy trac at the noon hour). In a study of time to failure for airplane air conditioners,
the distribution was also found to be well tted by an exponential density. On the other hand, in
many cases the distribution is not close to exponential, and purely Markovian models cannot be
used for anything more than a rough approximation.
17.4.2 The Notion of Rates
A key point is that the parameter in (17.48) has the interpretation of a rate, in the sense we
will now discuss. First, recall that 1/ is the mean. Say light bulb lifetimes have an exponential
distribution with mean 100 hours, so = 0.01. In our lamp, whenever its bulb burns out, we
immediately replace it with a new on. Imagine watching this lamp for, say, 100,000 hours. During
that time, we will have done approximately 100000/100 = 1000 replacements. That would be using
1000 light bulbs in 100000 hours, so we are using bulbs at the rate of 0.01 bulb per hour. For a
406 CHAPTER 17. MARKOV CHAINS
general , we would use light bulbs at the rate of bulbs per hour. This concept is crucial to what
follows.
17.4.3 Stationary Distribution
We again dene
i
to be the long-run proportion of time the system is in state i, and we again will
derive a system of linear equations to solve for these proportions.
17.4.3.1 Intuitive Derivation
To this end, let
i
denote the parameter in the holding-time distribution at state i, and dene the
quantities

rs
=
r
p
rs
(17.50)
with the following interpretation. In the context of the ideas in our example of the rate of light
bulb replacements in Section 17.4.2, one can view (17.50) as the rate of transitions from r to s,
during the time we are in state r.
Then, equating the rate of transitions into i and the rate out of i, we have

i
=

j=i

j
p
ji
(17.51)
These equations can then be solved for the
i
.
17.4.3.2 Computation
Motivated by (17.51), dene the matrix Q by
q
ij
=
_

j
p
ji
, if i ,= j

i
, if i = j
(17.52)
Q is called the innitesimal generator of the system, so named because it is the basis of the
system of dierential equations that can be used to nd the nite-time probabilistic behavior of
X
t
.
17.4. CONTINUOUS-TIME MARKOV CHAINS 407
The name also reects the rates notion weve been discussing, due to the fact that, say in our light
bulb example in Section 17.4.2,
P(bulb fails in next h time) = h +o(h) (17.53)
Then (17.51) is stated in matrix form as
Q

= 0 (17.54)
Here is R code to solve the system:
1 findpicontin <- function(q) {
2 n <- nrow(q)
3 newq <- t(q)
4 newq[n,] <- rep(1,n)
5 rhs <- c(rep(0,n-1),1)
6 pivec <- solve(newq,rhs)
7 return(pivec)
8 }
To solve the equations (17.51), well need a property of exponential distributions derived previously
in Section 8.3.7, copied here for convenience:
Theorem 39 Suppose W
1
, ..., W
k
are independent random variables, with W
i
being exponentially
distributed with parameter
i
. Let Z = min(W
1
, ..., W
k
). Then
(a) Z is exponentially distributed with parameter
1
+... +
k
(b) P(Z = W
i
) =

i

1
+...+
k
17.4.4 Example: Machine Repair
Suppose the operations in a factory require the use of a certain kind of machine. The manager has
installed two of these machines. This is known as a gracefully degrading system: When both
machines are working, the fact that there are two of them, instead of one, leads to a shorter wait
time for access to a machine. When one machine has failed, the wait is longer, but at least the
factory operations may continue. Of course, if both machines fail, the factory must shut down until
at least one machine is repaired.
Suppose the time until failure of a single machine, carrying the full load of the factory, has an
exponential distribution with mean 20.0, but the mean is 25.0 when the other machine is working,
since it is not so loaded. Repair time is exponentially distributed with mean 8.0.
408 CHAPTER 17. MARKOV CHAINS
We can take as our state space 0,1,2, where the state is the number of working machines. Now,
let us nd the parameters
i
and p
ji
for this system. For example, what about
2
? The holding
time in state 2 is the minimum of the two lifetimes of the machines, and thus from the results of
Section 8.3.7, has parameter
1
25.0
+
1
25.0
= 0.08.
For
1
, a transition out of state 1 will be either to state 2 (the down machine is repaired) or to state
0 (the up machine fails). The time until transition will be the minimum of the lifetime of the up
machine and the repair time of the down machine, and thus will have parameter
1
20.0
+
1
8.0
= 0.175.
Similarly,
0
=
1
8.0
+
1
8.0
= 0.25.
It is important to understand how the Markov property is being used here. Suppose we are in state
1, and the down machine is repaired, sending us into state 2. Remember, the machine which had
already been up has lived for some time now. But the memoryless property of the exponential
distribution implies that this machine is now born again.
What about the parameters p
ji
? Well, p
21
is certainly easy to nd; since the transition 2 1 is
the only transition possible out of state 2, p
21
= 1.
For p
12
, recall that transitions out of state 1 are to states 0 and 2, with rates 1/20.0 and 1/8.0,
respectively. So,
p
12
=
1/8.0
1/20.0 + 1/8.0
= 0.72 (17.55)
Working in this manner, we nally arrive at the complete system of equations (17.51):

2
(0.08) =
1
(0.125) (17.56)

1
(0.175) =
2
(0.08) +
0
(0.25) (17.57)

0
(0.25) =
1
(0.05) (17.58)
Of course, we also have the constraint
2
+
1
+
0
= 1. The solution turns out to be
= (0.072, 0.362, 0.566) (17.59)
Thus for example, during 7.2% of the time, there will be no machine available at all.
Several variations of this problem could be analyzed. We could compare the two-machine system
with a one-machine version. It turns out that the proportion of down time (i.e. time when no
machine is available) increases to 28.6%. Or we could analyze the case in which only one repair
person is employed by this factory, so that only one machine can be repaired at a time, compared
17.4. CONTINUOUS-TIME MARKOV CHAINS 409
to the situation above, in which we (tacitly) assumed that if both machines are down, they can be
repaired in parallel. We leave these variations as exercises for the reader.
17.4.5 Example: Migration in a Social Network
The following is a simplied version of research in online social networks.
There is a town with two social groups, with each person being in exactly one group. People arrive
from outside town, with exponentially distributed interarrival times at rate , and join one of the
groups with probability 0.5 each. Each person will occasionally switch groups, with one possible
switch being to leave town entirely. A persons time before switching is exponentially distributed
with rate ; the switch will either be to the other group or to the outside world, with probabilities
q and 1-q, respectively. Let the state of the system be (i,j), where i and j are the number of current
members in groups 1 and 2, respectively.
Lets nd a typical balance equation, say for the state (8,8):

(8,8)
(+16 ) = (
(9,8)
+
(8,9)
) 9(1 q) +(
(9,7)
+
(7,9)
) 9q +(
(8,7)
+
(7,8)
) 0.5 (17.60)
The reasoning is straightforward. How can we move out of state (8,8)? Well, there could be an
arrival (rate ), or any one of the 16 people could switch groups (rate 16), etc.
Now, in a going beyond nding the vein, lets nd the long-run fraction of transfers into group
1 that come from group 2, as opposed to from the outside.
The rate of transitions into that group from outside is 0.5. When the system is in state (i,j), the
rate of transitions into group 1 from group 2 is jq, so the overall rate is

i,j

(i,j)
jq. Thus the
fraction of new members coming in to group 1 from transfers is

i,j

(i,j)
jq
+

i,j

(i,j)
jq
(17.61)
The above reasoning is very common, quite applicable in many situations. By the way, note that

i,j

(i,j)
jq = qEN, where N is the number of members of group 1.
17.4.6 Continuous-Time Birth/Death Processes
We noted earlier that the system of equations for the
i
may not be easy to solve. In many cases, for
instance, the state space is innite and thus the system of equations is innite too. However, there is
410 CHAPTER 17. MARKOV CHAINS
a rich class of Markov chains for which closed-form solutions have been found, called birth/death
processes.
7
Here the state space consists of (or has been mapped to) the set of nonnegative integers, and p
ji
is nonzero only in cases in which [i j[ = 1. (The name birth/death has its origin in Markov
models of biological populations, in which the state is the current population size.) Note for
instance that the example of the gracefully degrading system above has this form. An M/M/1
queueone server, Markov (i.e. exponential) interarrival times and Markov service timesis
also a birth/death process, with the state being the number of jobs in the system.
Because the p
ji
have such a simple structure, there is hope that we can nd a closed-form solution
to (17.51), and it turns out we can. Let u
i
=
i,i+1
and d
i
=
i,i1
(u for up, d for down).
Then (17.51) is

i+1
d
i+1
+
i1
u
i1
=
i

i
=
i
(u
i
+d
i
), i 1 (17.62)

1
d
1
=
0

0
=
0
u
0
(17.63)
In other words,

i+1
d
i+1

i
u
i
=
i
d
i

i1
u
i1
, i 1 (17.64)

1
d
1

0
u
0
= 0 (17.65)
Applying (17.64) recursively to the base (17.65), we see that

i
d
i

i1
u
i1
= 0, i 1 (17.66)
so that

i
=
i1
u
i1
d
i
i 1 (17.67)
and thus

i
=
0
r
i
(17.68)
7
Though we treat the continuous-time case here, there is also a discrete-time analog.
17.5. HITTING TIMES ETC. 411
where
r
i
=
i
k=1
u
k1
d
k
(17.69)
where r
i
= 0 for i > m if the chain has no states past m.
Then since the
i
must sum to 1, we have that

0
=
1
1 +

i=1
r
i
(17.70)
and the other
i
are then found via (17.68).
Note that the chain might be nite, i.e. have u
i
= 0 for some i. In that case it is still a birth/death
chain, and the formulas above for still apply.
17.5 Hitting Times Etc.
In this section were interested in the amount of time it takes to get from one state to another,
including cases in which this might be innite.
17.5.1 Some Mathematical Conditions
There is a rich mathematical theory regarding the asymptotic behavior of Markov chains. We
will not present such material here in this brief introduction, but we will give an example of the
implications the theory can have.
A state in a Markov chain is called recurrent if it is guaranteed that, if we start at that state, we
will return to the state innitely many times. A nonrecurrent state is called transient.
Let T
ii
denote the time needed to return to state i if we start there. Keep in mind that T
ii
is
the time from one entry to state i to the next entry to state i. So, it includes time spent in i,
which is 1 unit of time for a discrete-time chain and a random exponential amount of time in the
continuous-time case, and then time spent away from i, up to the time of next entry to i. Note
that an equivalent denition of recurrence is that P(T
ii
< ) = 1, i.e. we are sure to return to i at
least once. By the Markov property, if we are sure to return once, then we are sure to return again
once after that, and so on, so this implies innitely many visits.
A recurrent state i is called positive recurrent if E(T
ii
) < , while a state which is recurrent
but not positive recurrent is called null recurrent.
412 CHAPTER 17. MARKOV CHAINS
Let T
ij
be the time it takes to get to state j if we are now in i. Note that this is measured from the
time that we enter state i to the time we enter state j.
One can show that in the discrete time case, a state i is recurrent if and only if

n=0
P(T
ii
= n) = (17.71)
This can be easily seen in the only if case: Let A
n
denote the indicator random variable for
the event T
ii
= n (Section 3.6). Then P(T
ii
= n) = EA
n
, so the left-hand side of (17.71) is the
expected value of the total number of visits to state i. If state i is recurrent, then we will visit i
innitely often, and thus that sum should be equal to innity.
Consider an irreducible Markov chain, meaning one which has the property that one can get
from any state to any other state (though not necessarily in one step). One can show that in an
irreducible chain, if one state is recurrent then they all are. The same statement holds if recurrent
is replaced by positive recurrent.
Again, this should make intuitive sense to you for the recurrent case: We make innitely many
visits to state i, and each time we have a nonzero probability of going to state j from there. Thus
we should make innitely many visits to j as well.
17.5.2 Example: Random Walks
Consider the famous random walk on the full set of integers: At each time step, one goes left one
integer or right one integer (e.g. to +3 or +5 from +4), with probability 1/2 each. In other words,
we ip a coin and go left for heads, right for tails.
If we start at 0, then we return to 0 when we have accumulated an equal number of heads and
tails. So for even-numbered n, i.e. n = 2m, we have
P(T
ii
= n) = P(m heads and m tails) =
_
2m
m
_
1
2
2m
(17.72)
One can use Stirlings approximation,
m!

2e
m
m
m+1/2
(17.73)
to show that the series (17.71) diverges in this case. So, this chain (meaning all states in the chain)
is recurrent. However, it turns out not to be not positive recurrent, as well see below.
17.5. HITTING TIMES ETC. 413
The same is true for the corresponding random walk on the two-dimensional integer lattice (moving
up, down, left or right with probability 1/4 each). However, in the three-dimensional case, the chain
is not even null recurrent; it is transient.
17.5.3 Finding Hitting and Recurrence Times
For a positive recurrent state i in a discrete-time Markov chain,

i
=
1
E(T
ii
)
(17.74)
The approach to deriving this is similar to that of Section 17.1.5.1. Dene alternating On and O
subcycles, where On means we are at state i and O means we are elsewhere. An On subcycle
has duration 1, and an O subcycle has duration T
ii
1. Dene a full cycle to consist of an On
subcycle followed by an O subcycle.
Then intuitively the proportion of time we are in state i is

i
=
E(On)
E(On) +E(O)
=
1
ET
ii
(17.75)
The equation is similar for the continuous-time case. Here E(On) = 1/
i
. The O subcycle has
mean duration ET
ii
1/
i
. Note again that T
ii
is measured from the time we enter state i once
until the time we enter it again. We then have

i
=
1/
i
E(T
ii
)
(17.76)
Thus positive recurrence means that
i
> 0. For a null recurrent chain, the limits in Equation
(17.3) are 0, which means that there may be rather little one can say of interest regarding the
long-run behavior of the chain.
We are often interested in nding quantities of the form E(T
ij
). We can do so by setting up systems
of equations similar to the balance equations used for nding stationary distributions.
First consider the discrete case. Conditioning on the rst step we take after being at state i, and
using the Law of Total Expectation, we have
E(T
ij
) =

k=j
p
ik
[1 +E(T
kj
)] +p
ij
1 = 1 +

k=j
p
ik
E(T
kj
) (17.77)
414 CHAPTER 17. MARKOV CHAINS
By varying i and j in (17.77), we get a system of linear equations which we can solve to nd the
ET
ij
. Note that (17.74) gives us equations we can use here too.
The continuous version uses the same reasoning:
E(T
ij
) =

k=j
p
ik
_
1

i
+E(T
kj
)
_
+p
ij

1

i
=
1

i
+

k=j
p
ik
E(T
kj
) (17.78)
One can use a similar analysis to determine the probability of ever reaching a state, in chains in
which this probability is not 1. (Some chains have have transient or even absorbing states, i.e.
states u such that p
uv
= 0 whenever v ,= u.)
For xed j dene

ij
= P(T
ij
< ) (17.79)
Then denoting by S the state we next visit after i, we have

ij
= P(T
ij
< ) (17.80)
=

k
P(S = k and T
ij
< ) (17.81)
=

k=j
P(S = k and T
kj
< ) +P(S = j) (17.82)
=

k=j
P(S = k) P(T
kj
< [S = k) +P(S = j) (17.83)
=

k=j
p
ik

kj
+p
ij
(17.84)
(17.85)
So, again we have a system of linear equations that we can solve for the
ij
.
17.5.4 Example: Finite Random Walk
Lets go back to the example in Section 17.1.1.
Suppose we start our random walk at 2. How long will it take to reach state 4? Set b
i
=
17.5. HITTING TIMES ETC. 415
E(T
i4
[start at i). From (17.77) we could set up equations like
b
2
=
1
3
(1 +b
1
) +
1
3
(1 +b
2
) +
1
3
(1 +b
3
) (17.86)
Now change the model a little, and make states 1 and 6 absorbing. Suppose we start at position
3. What is the probability that we eventually are absorbed at 6 rather than 1? We could set up
equations like (17.80) to nd this.
17.5.5 Example: Tree-Searching
Consider the following Markov chain with innite state space 0,1,2,3,....
8
The transition matrix
is dened by p
i,i+1
= q
i
and p
i0
= 1 q
i
. This kind of model has many dierent applications,
including in computer science tree-searching algorithms. (The state represents the level in the tree
where the search is currently, and a return to 0 represents a backtrack. More general backtracking
can be modeled similarly.)
The question at hand is, What conditions on the q
i
will give us a positive recurrent chain?
Assuming 0 < q
i
< 1 for all i, the chain is clearly irreducible. Thus, to check for recurrence, we
need check only one state, say state 0.
For state 0 (and thus the entire chain) to be recurrent, we need to show that P(T
00
< ) = 1. But
P(T
00
> n) =
n1
i=0
q
i
(17.87)
Therefore, the chain is recurrent if and only if
lim
n

n1
i=0
q
i
= 0 (17.88)
For positive recurrence, we need E(T
00
) < . Now, one can show that for any nonnegative integer-
valued random variable Y
E(Y ) =

n=0
P(Y > n) (17.89)
8
Adapted from Performance Modelling of Communication Networks and Computer Architectures, by P. Harrison
and N. Patel, pub. by Addison-Wesley, 1993.
416 CHAPTER 17. MARKOV CHAINS
Thus for positive recurrence, our condition on the q
i
is

n=0

n1
i=0
q
i
< (17.90)
Exercises
1. Consider a wraparound variant of the random walk in Section 17.1.1. We still have a reecting
barrier at 1, but at 5, we go back to 4, stay at 5 or wrap around to 1, each with probability 1/3.
Find the new set of stationary probabilities.
2. Consider the Markov model of the shared-memory multiprocessor system in our PLN. In each
part below, your answer will be a function of q
1
, ..., q
m
.
(a) For the case m = 3, nd p
(2,0,1),(1,1,1)
.
(b) For the case m = 6, give a compact expression for p
(1,1,1,1,1,1),(i,j,k,l,m,n)
.
Hint: We have an instance of a famous parametric distribution family here.
3. This problem involves the analysis of call centers. This is a subject of much interest in the
business world, with there being commercial simulators sold to analyze various scenarios. Here are
our assumptions:
Calls come in according to a Poisson process with intensity parameter .
Call duration is exponentially distributed with parameter .
There are always at least b operators in service, and at most b+r.
Operators work from home, and can be brought into or out of service instantly when needed.
They are paid only for the time in service.
If a call comes in when the current number of operators is larger than b but smaller than
b+r, another operator is brought into service to process the call.
If a call comes in when the current number of operators is b+r, the call is rejected.
When an operator completes processing a call, and the current number of operators (including
this one) is greater than b, then that operator is taken out of service.
Note that this is a birth/death process, with the state being the number of calls currently in the
system.
17.5. HITTING TIMES ETC. 417
(a) Find approximate closed-form expressions for the
i
for large b+r, in terms of b, r, and .
(You should not have any summation symbols.)
(b) Find the proportion of rejected calls, in terms of
i
and b, r, and .
(c) An operator is paid while in service, even if he/she is idle, in which case the wages are
wasted. Express the proportion of wasted time in terms of the
i
and b, r, and .
(d) Suppose b = r = 2, and = = 1.0. When a call completes while we are in state b+1, an
operator is sent away. Find the mean time until we make our next summons to the reserve
pool.
4. The bin-packing problem arises in many computer science applications. Items of various sizes
must be placed into xed-sized bins. The goal is to nd a packing arrangement that minimizes
unused space. Toward that end, work the following problem.
We are working in one dimension, and have a continuing stream of items arriving, of lengths
L
1
, L
2
, L
3
, .... We place the items in the bins in the order of arrival, i.e. without optimizing. We
continue to place items in a bin until we encounter an item that will not t in the remaining space,
in which case we go to the next bin.
Suppose the bins are of length 5, and an item has length 1, 2, 3 or 4, with probability 0.25 each.
Find the long-run proportion of wasted space.
Hint: Set up a discrete-time Markov chain, with time being the number of items packed so far,
and the state being the amount of occupied space in the current bin. Dene T
n
to be 1 or 0,
according to whether the n
th
item causes us to begin packing a new bin, so that the number of bins
used by time n is T
1
+... +T
n
.
5. Suppose we keep rolling a die. Find the mean number of rolls needed to get three consecutive
4s.
Hint: Use the material in Section 17.5.
6. A system consists of two machines, with exponentially distributed lifetimes having mean 25.0.
There is a single repairperson, but he is not usually on site. When a breakdown occurs, he is
summoned (unless he is already on his way or on site), and it takes him a random amount of time
to reach the site, exponentially distributed with mean 2.0. Repair time is exponentially distributed
with mean 8.0. If after completing a repair the repairperson nds that the other machine needs
xing, he will repair it; otherwise he will leave. Repair is performed on a First Come, First Served
schedule. Find the following:
(a) The long-run proportion of the time that the repairperson is on site.
(b) The rate per unit time of calls to the repairperson.
418 CHAPTER 17. MARKOV CHAINS
(c) The mean time to repair, i.e. the mean time between a breakdown of a machine and comple-
tion of repair of that machine.
(d) The probability that, when two machines are up and one of them goes down, the second
machine fails before the repairperson arrives.
7. Consider again the random walk in Section 17.1.1. Find
lim
n
(X
n
, X
n+1
) (17.91)
Hint: Apply the Law of Total Expectation to E(X
n
X
n+1
).
8. Consider a random variable X that has a continuous density. That implies that G(u) = P(X >
u) has a derivative. Dierentiate (17.49) with respect to r, then set r = 0, resulting in a dierential
equation for G. Solve that equation to show that the only continuous densities that produce the
memoryless property are those in the exponential family.
9. Suppose we model a certain database as follows. New items arrive according to a Poisson process
with intensity parameter . Each item stays in the database for an exponentially distributed amount
of time with parameter , independently of the other items. Our state at time t is the number of
items in the database at that time. Find closed-form expressions for the stationary distribution
and the long-run average size of the database.
10. Consider our machine repair example in Section 17.4.4, with the following change: The repair-
person is osite, and will not be summoned unless both machines are down. Once the repairperson
arrives, she will not leave until both machines are up. So for example, if she arrives and repairs
machine B, then while repairing A nds that B has gone down again, she will start work on B
immediately after nishing with A. Travel time to the site from the maintenance oce is 0. Repair
is performed on a First Come, First Served schedule. The time a machine is in working order has
an exponential distribution with rate , and repair is exponentially distributed with rate . Find
the following in terms of and :
(a) The long-run proportion of the time that the repairperson is on site.
(b) The rate per unit time of calls to the repairperson.
(c) The mean time to repair, i.e. the mean time between a breakdown of a machine and comple-
tion of repair of that machine. (Hint: The best approach is to look at rates. First, nd the
number of breakdowns per unit time. Then, ask how many of these occur during a time when
both machines are up, etc. In each case, what is the mean time to repair for the machine
that breaks?)
17.5. HITTING TIMES ETC. 419
11. There is a town with two social groups, with the following dynamics:
Everyone is in exactly one group at a time.
People arrive from outside town, with exponentially distributed interarrival times at rate ,
and join one of the groups with probability 0.5 each.
Each person will occasionally switch groups, with one possible group being to leave town
entirely (never to return). A persons time before switching groups is exponentially distributed
with rate . The switch will either be to the other group or to the outside world, with
probabilities q and 1-q, respectively.
Let the state of the system be (i,j), where i and j are the number of current members in groups 1
and 2, respectively. Answer in terms of , , and :
(a) Give the balance equation for the state (8,8).
(b) Fill in the blank: The president of Group 1 tells reporter, Weve found over the years that
% of entries into our group come as transfers from the other group.
420 CHAPTER 17. MARKOV CHAINS
Chapter 18
Introduction to Queuing Models
Seems like we spend large parts of our lives standing in line (or as they say in New York, standing
on line). This can be analyzed probabilistically, a subject we will be introduced in this chapter.
18.1 Introduction
Like other areas of applied stochastic processes, queuing theory has a vast literature, covering a
huge number of variations on dierent types of queues. Our tutorial here can only just scratch the
surface to this eld.
Here is a rough overview of a few of the large categories of queuing theory:
Single-server queues.
Networks of queues, including open networks (in which jobs arrive from outside the network,
visit some of the servers in the network, then leave) and closed networks (in which jobs
continually circulate within the network, never leaving).
Non-First Come, First Served (FCFS) service orderings. For example, there are Last Come,
First Served (i.e. stacks) and Processor Sharing (which models CPU timesharing).
In this brief introduction, we will not discuss non-FCFS queues, and will only scratch the surface
on the other tops.
421
422 CHAPTER 18. INTRODUCTION TO QUEUING MODELS
18.2 M/M/1
The rst M here stands for Markov or memoryless, alluding to the fact that arrivals to the
queue are Markovian, i.e. interarrivals are i.i.d. exponentially distributed. The second M means
that the service times are also i.i.d. exponential. Denote the reciprocal-mean interarrival and
service times by and .
The 1 in M/M/1 refers to the fact that there is a single server. We will assume FCFS job scheduling
here, but close inspection of the derivation will show that it applies to some other kinds of scheduling
too.
This system is a continuous-time Markov chain, with the state X
t
at time t being the number of
jobs in the system (not just in the queue but also including the one currently being served, if any).
18.2.1 Steady-State Probabilities
Intuitively the steady-state probabilities
i
will exist only if < . Otherwise jobs would come in
faster than they could be served, and the queue would become innite. So, we assume that u < 1,
where u =

.
Clearly this is a birth-and-death chain. For state k, the birth rate
k,k+1
is and the death rate

k,k1
is , k = 0,1,2,... (except that the death rate at state 0 is 0). Using the formula derived for
birth/death chains, we have that

i
= u
i

0
, i 0 (18.1)
and

0
=
1

j=0
u
j
= 1 u (18.2)
In other words,

i
= u
i
(1 u), i 0 (18.3)
Note by the way that since
0
= 1 u, then u is the utilization of the server, i.e. the proportion
of the time the server is busy. In fact, this can be seen intuitively: Think of a very long period
of time of length t. During this time approximately t jobs having arrived, keeping the server
18.2. M/M/1 423
busy for approximately t
1

time. Thus the fraction of time during which the server is busy is
approximantely
t
1

t
=

(18.4)
18.2.2 Mean Queue Length
Another way to look at Equation (18.3) is as follows. Let the random variable N have the long-run
distribution of X
t
, so that
P(N = i) = u
i
(1 u), i 0 (18.5)
Then this says that N+1 has a geometric distribution, with success probability 1-u. (N itself
is not quite geometrically distributed, since Ns values begin at 0 while a geometric distribution
begins at 1.)
Thus the long-run average value E(N) for X
t
will be the mean of that geometric distribution, minus
1, i.e.
EN =
1
1 u
1 =
u
1 u
(18.6)
The long-run mean queue length E(Q) will be this value minus the mean number of jobs being
served. The latter quantity is 1
0
= u, so
EQ =
u
2
1 u
(18.7)
18.2.3 Distribution of Residence Time/Littles Rule
Let R denote the residence time of a job, i.e. the time elapsed from the jobs arrival to its exit
from the system. Littles Rule says that
EN = ER (18.8)
This property holds for a variety of queuing systems, including this one. It can be proved formally,
but here is the intuition:
424 CHAPTER 18. INTRODUCTION TO QUEUING MODELS
Think of a particular job (in the literature of queuing theory, it is called a tagged job) at the
time it has just exited the system. If this is an average job, then it has been in the system for
ER amount of time, during which an average of ER new jobs have arrived behind it. These jobs
now comprise the total number of jobs in the system, which in the average case is EN.
Applying Littles Rule here, we know EN from Equation (18.6), so we can solve for ER:
ER =
1

u
1 u
=
1/
1 u
(18.9)
With a little more work, we can nd the actual distribution of R, not just its mean. This will
enable us to obtain quantities such as Var(R) and P(R > z). Here is our approach:
When a job arrives, say there are N jobs ahead of it, including one in service. Then this jobs value
of R can be expressed as
R = S
self
+S
1,resid
+S
2
+... +S
N
(18.10)
where S
self
is the service time for this job, S
1,resid
is the remaining time for the job now being
served (i.e. the residual life), and for i > 1 the random variable S
i
is the service time for the i
th
waiting job.
Then the Laplace transform of R, evaluated at say w, is
E(e
wR
) = E[e
w(S
self
+S
1,resid
+S
2
+...+S
N
)
] (18.11)
= E
_
E[e
w(S
self
+S
1,resid
+S
2
+...+S
N
)
[N]
_
(18.12)
= E[E(e
wS
)
N+1
] (18.13)
= E[g(w)
N+1
] (18.14)
where
g(w) = E(e
wS
) (18.15)
is the Laplace transform of the service variable, i.e. of an exponential distribution with parameter
equal to the service rate . Here we have made use of these facts:
The Laplace transform of a sum of independent random variables is the product of their
individual Laplace transforms.
18.2. M/M/1 425
Due to the memoryless property, S
1,resid
has the same distribution as do the other S
i
.
The distribution of the service times S
i
and queue length N observed by our tagged job is the
same as the distributions of those quantities at all times, not just at arrival times of tagged
jobs. This property can be proven for this kind of queue and many others, and is called
PASTAPoisson Arrivals See Time Averages.
(Note that the PASTA property is not obvious. On the contrary, given our experience with
the Bus Paradox and length-biased sampling in Section 5.3, we should be wary of such things.
But the PASTA property does hold and can be proven.)
But that last term in (18.14), E[g(w)
N+1
], is the generating function of N+1, evaluated at g(w).
And we know from Section 18.2.2 that N+1 has a geometric distribution. The generating function
for a nonnegative-integer valued random variable K with success probability p is
g
K
(s) = E(s
K
) =

i=1
s
i
(1 p)
i1
p =
ps
1 s(1 p)
(18.16)
In (18.14), we have p = 1 -u and s = g(w). So,
E(v
N+1
) =
g(w)(1 u)
1 u[g(w)]
(18.17)
Finally, by denition of Laplace transform,
g(w) = E(e
wS
) =
_

0
e
wt
e
t
dt =

w +
(18.18)
So, from (18.11), (18.17) and (18.18), the Laplace transform of R is
(1 u)
w +(1 u)
(18.19)
In principle, Laplace transforms can be inverted, and we could use numerical methods to retrieve
the distribution of R from (18.19). But hey, look at that! Equation (18.19) has the same form as
(18.18). In other words, we have discovered that R has an exponential distribution too, only with
parameter (1 u) instead of .
This is quite remarkable. The fact that the service and interarrival times are exponential doesnt
mean that everything else will be exponential too, so it is surprising that R does turn out to have
an exponential distribution.
426 CHAPTER 18. INTRODUCTION TO QUEUING MODELS
It is even more surprising in that R is a sum of independent exponential random variables, as we
saw in (18.10), and we know that such sums have Erland distributions. The resolution of this
seeming paradox is that the number of terms N in (18.10) is itself random. Conditional on N, R
has an Erlang distribution, but unconditionally R has an exponential distribution.
18.3 Multi-Server Models
Here we have c servers, with a common queue. There are many variations.
18.3.1 M/M/c
Here the servers are homogeneous. When a job gets to the head of the queue, it is served by the
rst available server.
The state is again the number of jobs in the system, including any jobs at the servers. Again it is
a birth/death chain, with u
i,i+1
= and
u
i,i1
=
_
i, if 0 < i < c
c, if i c
(18.20)
The solution turns out to be

k
=
_

0
(

)
k
1
k!
, k <c

0
(

)
k
1
c!c
kc
, k c
(18.21)
where

0
=
_
c1

k=0
(cu)
k
k!
+
(cu)
c
c!
1
1 u
_
1
(18.22)
and
u =

c
(18.23)
Note that the latter quantity is still the utilization per server, using an argument similar to that
which led to (18.4).
18.3. MULTI-SERVER MODELS 427
Recalling that the Taylor series for e
z
is

k=0
z
k
/k! we see that

0
e
cu
(18.24)
18.3.2 M/M/2 with Heterogeneous Servers
Here the servers have dierent rates. Well treat the case in which c = 2. Assume
1
<
2
. When a
job reaches the head of the queue, it chooses machine 2 if that machine is idle, and otherwise waits
for the rst available machine. Once it starts on a machine, it cannot be switched to the other.
Denote the state by (i,j,k), where
i is the number of jobs at server 1
j is the number of jobs at server 2
k is the number of jobs in the queue
The key is to notice that states 111, 112, 113, ... act like the M/M/k queue. This will reduce
nding the solution of the balance equations to solving a nite system of linear equations.
For k 1 we have
( +
1
+
2
)
11k
= (
1
+
2
)
11,k+1
+
11,k1
(18.25)
Collecting terms as in the derivation of the stationary distribution for birth/death processes, (18.25)
becomes
(
11k

11,k1
) = (
1
+
2
)(
11,k+1

11k
), k = 1, 2, ... (18.26)
Then we have
(
1
+
2
)
11,k+1

11k
= (
1
+
2
)
11k

11,k1
(18.27)
So, we now have all the
11i
, i = 2,3,... in terms of
111
and
110
, thus reducing our task to solving
a nite set of linear equations, as promised. Here are the rest of the equations:

000
=
2

010
+
1

100
(18.28)
428 CHAPTER 18. INTRODUCTION TO QUEUING MODELS
( +
2
)
010
=
000
+
1

110
(18.29)
( +
1
)
100
=
2

110
(18.30)
( +
1
+
2
)
110
=
010
+
100
+ (
1
+
2
)
111
(18.31)
From (3.47), we have
(
1
+
2
)
111

110
= (
1
+
2
)
110
(
010
+
100
) (18.32)
Look at that last term, (
010
+
100
). By adding (18.29) and (18.30), we have that
(
010
+
100
) =
000
+
1

110
+
2

110

1

100

2

010
(18.33)
Substituting (18.28) changes (18.33) to
(
010
+
100
) =
1

110
+
2

110
(18.34)
So...(18.32) becomes
(
1
+
2
)
111

110
= 0 (18.35)
By induction in (18.27), we have
(
1
+
2
)
11,k+1

11k
= 0, k = 1, 2, ... (18.36)
and

11i
=
i

110
, i = 0, 1, 2, ... (18.37)
where
=

1
+
2
(18.38)
18.4. LOSS MODELS 429
1 =

i,j,k

ijk
(18.39)
=
000
+
010
+
100
+

i=0

11i
(18.40)
=
000
+
010
+
100
+
110

i=0

i
(18.41)
=
000
+
010
+
100
+
110

1
1
(18.42)
Finding close-form expressions for the
i
is then straightforward.
18.4 Loss Models
One of the earliest queuing models was M/M/c/c: Markovian interarrival and service times, c
servers and a buer space of c jobs. Any job which arrives when c jobs are already in the system
is lost. This was used by telephone companies to nd the proportion of lost calls for a bank of c
trunk lines.
18.4.1 Cell Communications Model
Lets consider a more modern example of this sort, involving cellular phone systems. (This is an
extension of the example treated in K.S. Trivedi, Probability and Statistics, with Reliability and
Computer Science Applications (second edition), Wiley, 2002, Sec. 8.2.3.2, which is in turn is based
on two papers in the IEEE Transactions on Vehicular Technology.)
We consider one particular cell in the system. Mobile phone users drift in and out of the cell as
they move around the city. A call can either be a new call, i.e. a call which someone has just
dialed, or a hando call, i.e. a call which had already been in progress in a neighboring cell but
now has moved to this cell.
Each call in a cell needs a channel.
1
There are n channels available in the cell. We wish to give
hando calls priority over new calls.
2
This is accomplished as follows.
1
This could be a certain frequency or a certain time slot position.
2
We would rather give the caller of a new call a polite rejection message, e.g. No lines available at this time,
than suddenly terminate an existing conversation.
430 CHAPTER 18. INTRODUCTION TO QUEUING MODELS
The system always reserves g channels for hando calls. When a request for a new call (i.e. a
non-hando call) arrives, the system looks at X
t
, the current number of calls in the cell. If that
number is less than n-g, so that there are more than g idle channels available, the new call is
accepted; otherwise it is rejected.
We assume that new calls originate from within the cells according to a Poisson process with rate

1
, while hando calls drift in from neighboring cells at rate
2
. Meanwhile, call durations are
exponential with rate
1
, while the time that a call remains within the cell is exponential with rate

2
.
18.4.1.1 Stationary Distribution
We again have a birth/death process, though a bit more complicated than our earlier ones. Let
=
1
+
2
and =
1
+
2
. Then here is a sample balance equation, focused on transitions into
(left-hand side in the equation) and out of (right-hand side) state 1:

0
+
2
2 =
1
( +) (18.43)
Heres why: How can we enter state 1? Well, we could do so from state 0, where there are no calls;
this occurs if we get a new call (rate
1
) or a hando call (rate
2
. In state 2, we enter state 1 if
one of the two calls ends (rate
1
) or one of the two calls leaves the cell (rate
2
). The same kind
of reasoning shows that we leave state 1 at rate +.
As another example, here is the equation for state n-g:

ng
[
2
+ (n g)] =
ng+1
(n g + 1) +
ng1
(18.44)
Note the term
2
in (18.44), rather than as in (18.43).
Using our birth/death formula for the
i
, we nd that

k
=
_

0
A
k
k!
, k n-g

0
A
ng
k!
A
k(ng)
1
, k n-g
(18.45)
where A = /, A
1
=
2
/ and

0
=
_
_
ng1

k=0
A
k
k!
+
n

k=ng
A
ng
k!
A
k(ng)
1
_
_
1
(18.46)
18.5. NONEXPONENTIAL SERVICE TIMES 431
18.4.1.2 Going Beyond Finding the
One can calculate a number of interesting quantities from the
i
:
The probability of a hando call being rejected is
n
.
The probability of a new call being dropped is
n

k=ng

k
(18.47)
Since the per-channel utilization in state i is i/n, the overall long-run per-channel utilization
is
n

i=0

i
i
n
(18.48)
The long-run proportion of accepted calls which are hando calls is the rate at which hando
calls are accepted, divided by the rate at which calls are accepted:

n1
i=0

i

ng1
i=0

i
+
2

n1
i=0

i
(18.49)
18.5 Nonexponential Service Times
The Markov property is of course crucial to the analyses we made above. Thus dropping the
exponential assumption presents a major analytical challenge.
One queuing model which has been found tractable is M/G/1: Exponential interarrival times,
general service times, one server. In fact, the mean queue length and related quantities can be
obtained fairly easily, as follows.
Consider the residence time R for a tagged job. R is the time that our tagged job must rst wait
for completion of service of all jobs, if any, which are ahead of itqueued or now in serviceplus
the tagged jobs own service time. Let T
1
, T
2
, ... be i.i.d. with the distribution of a generic service
time random variable S. T
1
represents the service time of the tagged job itself. T
2
, ..., T
N
represent
the service times of the queued jobs, if any.
Let N be the number of jobs in the system, either being served or queued; B be either 1 or 0,
depending on whether the system is busy (i.e. N > 0) or not; and S
1,resid
be the remaining service
432 CHAPTER 18. INTRODUCTION TO QUEUING MODELS
time of the job currently being served, if any. Finally, we dene, as before, u =

1/ES
, the utilization.
Note that that implies the EB = u.
Then the distribution of R is that of
BS
1,resid
+ (T
1
+... +T
N
) + (1 B)T
1
(18.50)
Note that if N = 0, then T
1
+... +T
N
is considered to be 0, i.e. not present in (18.50).
Then
E(R) = uE(S
1,resid
) +E(T
1
+... +T
N
) + (1 u)ET
1
(18.51)
= uE(S
1,resid
) +E(N)E(S) + (1 u)ES (18.52)
= uE(S
1,resid
) +E(R)E(S) + (1 u)ES (18.53)
The last equality is due to Littles Rule. Note also that we have made use of the PASTA property
here, so that the distribution of N is the same at arrival times as general times.
Then
E(R) =
uE(S
1,resid
)
1 u
+ES (18.54)
Note that the two terms here represent the mean residence time as the mean queuing time plus the
mean service time.
So we must nd E(S
1,resid
). This is just the mean of the remaining-life distribution which we saw
in Section 5.4 of our unit on renewal theory. Then
E(S
1,resid
) =
_

0
t
1 F
S
(t)
ES
dt (18.55)
=
1
ES
_

0
t
_

t
f
S
(u) du dt (18.56)
=
1
ES
_

0
f
S
(u)
_
u
0
t dt du (18.57)
=
1
2ES
E(S
2
) (18.58)
18.6. REVERSED MARKOV CHAINS 433
So,
E(R) =
uE(S
2
)
2ES(1 u)
+ES (18.59)
What is remarkable about this famous formula is that E(R) depends not only on the mean service
time but also on the variance. This result, which is not so intuitively obvious at rst glance, shows
the power of modeling. We might observe the dependency of E(R) on the variance of service time
empirically if we do simulation, but here is a compact formula that shows it for us.
18.6 Reversed Markov Chains
We can get insight into some kinds of queuing systems by making use of the concepts of reversed
Markov chains, which involve playing the Markov chain backward, just as we could play a movie
backward.
Consider a continuous-time, irreducible, positive recurrent Markov chain X(t).
3
For any xed
time (typically thought of as large), dene the reversed version of X(t) as Y(t) = X(-t), for
0 t . We will discuss a number of properties of reversed chains. These properties will enable
what mathematicians call soft analysis of some Markov chains, especially those related to queues.
This term refers to short, simple, elegant proofs or derivations.
18.6.1 Markov Property
The rst property to note is that Y(t) is a Markov chain! Here is our rst chance for soft analysis.
The hard analysis approach would be to start with the denition, which in continuous time would
be that
P (Y (t) = k[Y (u), u s) = P (Y (t) = k[Y (s)) (18.60)
for all 0 < s < t and all k, using the fact that X(t) has the same property. That would involve
making substitutions in Equation (18.60) like Y(t) = X(-t), etc.
3
Recall that a Markov chain is irreducible if it is possible to get from each state to each other state in a nite
number of steps, and that the term positive recurrent means that the chain has a long-run state distribution . Also,
concerning our assumption here of continuous time, we should note that there are discrete-time analogs of the various
points well make below.
434 CHAPTER 18. INTRODUCTION TO QUEUING MODELS
But it is much easier to simply observe that the Markov property holds if and only if, conditional
on the present, the past and the future are independent. Since that property holds for X(t), it also
holds for Y(t) (with the roles of the past and the future interchanged).
18.6.2 Long-Run State Proportions
Clearly, if the long-run proportion of the time X(t) = k is
i
, the same long-run proportion will
hold for Y(t). This of course only makes sense if you think of larger and larger .
18.6.3 Form of the Transition Rates of the Reversed Chain
Let
ij
denote the number of transitions from state i to state j per unit time in the reversed chain.
That number must be equal to the number of transitions from j to i in the original chain. Therefore,

i

ij
=
j

ji
(18.61)
This gives us a formula for the
ij
:

ij
=

j

ji
(18.62)
18.6.4 Reversible Markov Chains
In some cases, the reversed chain has the same probabilistic structure as the original one! Note
carefully what that would mean. In the continuous-time case, it would mean that
ij
=
ij
for
all i and j, where the
ij
are the transition rates of Y(t).
4
If this is the case, we say that X(t) is
reversible.
That is a very strong property. An example of a chain which is not reversible is the tree-search model
in Section 17.5.5.
5
There the state space consists of all the nonnegative integers, and transitions
were possible from states n to n+1 and from n to 0. Clearly this chain is not reversible, since we
can go from n to 0 in one step but not vice versa.
4
Note that for a continuous-time Markov chain, the transition rates do indeed uniquely determined the probabilistic
structure of the chain, not just the long-run state proportions. The short-run behavior of the chain is also determined
by the transition rates, and at least in theory can be calculated by solving dierential equations whose coecients
make use of those rates.
5
That is a discrete-time example, but the principle here is the same.
18.6. REVERSED MARKOV CHAINS 435
18.6.4.1 Conditions for Checking Reversibility
Equation (18.61) shows that the original chain X(t) is reversible if and only if

ij
=
j

ji
(18.63)
for all i and j. These equations are called the detailed balance equations, as opposed to the
general balance equations,

j=i

ji
=
i

i
(18.64)
which are used to nd the values. Recall that (18.64) arises from equating the ow into state
i with the ow out of it. By contrast, Equation (18.63) equates the ow into i from a particular
state j to the ow from i to j. Again, that is a much stronger condition, so we can see that most
chains are not reversible. However, a number of important ones are reversible, as well see.
For example, consider birth/death chains. Here, the only cases in which
rs
is nonzero are those in
which [i j[ = 1. Now, Equation (17.64) in our derivation of for birth/death chains is exactly
(18.63)! So we see that birth/death chains are reversible.
More generally, equations (18.63) may not be so easy to check, since for complex chains we may
not be able to nd closed-form expressions for the values. Thus it is desirable to have another
test available for reversibility. One such test is Kolmogorovs Criterion:
The chain is reversible if and only if for any loop of states, the product of the transition
rates is the same in both the forward and backward directions.
For example, consider the loop i j k i. Then we would check whether
ij

jk

ki
=
ik

kj

ji
.
Technically, we do have to check all loops. However, in many cases it should be clear that just a
few loops are representative, as the other loops have the same structure.
Again consider birth/death chains. Kolmogorovs Criterion trivially shows that they are reversible,
since any loop involves a path which is the same path when traversed in reverse.
18.6.4.2 Making New Reversible Chains from Old Ones
Since reversible chains are so useful (when we are lucky enough to have them), a very useful trick
is to be able to form new reversible chains from old ones. The following two properties are very
handy in that regard:
436 CHAPTER 18. INTRODUCTION TO QUEUING MODELS
(a) Suppose U(t) and V(t) are reversible Markov chains, and dene W(t) to be the tuple [U(t),V(t)].
Then W(t) is reversible.
(b) Suppose X(t) is a reversible Markov chain, and A is an irreducible subset of the state space
of the chain, with long-run state distribution . Dene a chain W(t) with transition rates
/
ij
for i A, where /
ij
=
ij
if j A and /
ij
= 0 otherwise. Then W(t) is reversible, with
long-run state distribution given by
/
i
=

i

jA

j
(18.65)
18.6.4.3 Example: Distribution of Residual Life
In Section 5.4.3, we used Markov chain methods to derive the age distribution at a xed observation
point in a renewal process. From remarks made there, we know that residual life has the same
distribution. This could be proved similarly, at some eort, but it comes almost immediately from
reversibility considerations. After all, the residual life in the reversed process is the age in the
original process.
18.6.4.4 Example: Queues with a Common Waiting Area
Consider two M/M/1 queues, with chains G(t) and H(t), with independent arrival streams but
having a common waiting area, with jobs arriving to a full waiting area simply being lost.
6
First consider the case of an innite waiting area. Let u
1
and u
2
be the utilizations of the two
queues, as in (18.3). G(t) and H(t), being birth/death processes, are reversible. Then by property
(a) above, the chain [G(t),H(t)] is also reversible. Long-run proportion of the time that there are
m jobs in the rst queue and n jobs in the second is

mn
= (1 u
1
)
m
u
1
(1 u
2
)
n
u
2
(18.66)
for m,n = 0,1,2,3,...
Now consider what would happen if these two queues were to have a common, nite waiting area.
Denote the amount of space in the waiting area by w. The new process is the restriction of the
original process to a subset of states A as in (b) above. (The set A will be precisely dened below.)
It is easily veried from the Kolmogorov Criterion that the new process is also reversible.
6
Adapted from Ronald Wol, Stochastic Modeling and the Theory of Queues Prentice Hall, 1989.
18.6. REVERSED MARKOV CHAINS 437
Recall that the state m in the original queue U(t) is the number of jobs, including the one in service
if any. That means the number of jobs waiting is (m 1)
+
, where x
+
= max(x, 0). That means
that for our new system, with the common waiting area, we should take our subset A to be
(m, n) : m, n 0, (m1)
+
+ (n 1)
+
w (18.67)
So, by property (b) above, we know that the long-run state distribution for the queue with the
nite common waiting area is

mn
=
1
a
(1 u
1
)
m
u
1
(1 u
2
)
n
u
2
(18.68)
where
a =

(i,j)A
(1 u
1
)
i
u
1
(1 u
2
)
j
u
2
(18.69)
In this example, reversibility was quite useful. It would have been essentially impossible to derive
(18.68) algebraically. And even if intuition had suggested that solution as a guess, it would have
been quite messy to verify the guess.
18.6.4.5 Closed-Form Expression for for Any Reversible Markov Chain
(Adapted from Ronald Nelson, Probability, Stochastic Processes and Queuing Theory, Springer-
Verlag, 1995.)
Recall that most Markov chains, especially those with innite state spaces, do not have closed-form
expressions for the steady-state probabilities. But we can always get such expressions for reversible
chains, as follows.
Choose a xed state s, and nd paths from s to all other states. Denote the path to i by
s = j
i1
j
i2
... j
im
i
= i (18.70)
Dene

i
=
_
1, i = s

m
i
k=1
r
ik
, i ,= s
(18.71)
438 CHAPTER 18. INTRODUCTION TO QUEUING MODELS
where
r
ik
=
(j
ik
, j
i,k+1
)
(j
i,k+1
, j
i,k
)
(18.72)
Then the steady-state probabilities are

i
=

i

k

k
(18.73)
You may notice that this looks similar to the derivation for birth/death processes, which as has
been pointed out, are reversible.
18.7 Networks of Queues
18.7.1 Tandem Queues
Lets rst consider an M/M/1 queue. As mentioned earlier, this is a birth/death process, thus
reversible. This has an interesting and very useful application, as follows.
Think of the times at which jobs depart this system, i.e. the times at which jobs nish service.
In the reversed process, these times are arrivals. Due to the reversibility, that means that the
distribution of departure times is the same as that of arrival times. In other words:
Departures from this system behave as a Poisson process with rate .
Also, let the initial state X(0) be distributed according to the steady-state probabilities .
7
Due
to the PASTA property of Poisson arrivals, the distribution of the system state at arrival times is
the same as the distribution of the system state at nonrandom times t. Then by reversibility, we
have that:
The state distribution at departure times is the same as at nonrandom times.
And nally, noting as in Section 18.6.1 that, given X(t), the states X(s), s t of the queue before
time t are statistically independent of the arrival process after time t, reversibility gives us that:
7
Recall Section 17.1.2.4.
18.7. NETWORKS OF QUEUES 439
Given t, the departure process before time t is statistically independent of the states X(s), s
t of the queue after time t.
Lets apply that to tandem queues, which are queues acting in series. Suppose we have two such
queues, with the rst, X
1
(t) feeding its output to the second one, X
2
(t), as input. Suppose the
input into X
1
(t) is a Poisson process with rate , and service times at both queues are exponentially
distributed, with rates
1
and
2
.
X
1
(t) is an M/M/1 queue, so its steady-state probabilities for X
1
(t) are given by Equation (18.3),
with u = /
1
.
By the rst bulleted item above, we know that the input into X
2
(t) is also Poisson. Therefore, X
2
(t)
also is an M/M/1 queue, with steady-state probabilities as in Equation (18.3), with u = /
2
.
Now, what about the joint distribution of [X
1
(t), X
2
(t)]? The third bulleted item above says that
the input to X
2
(t) up to time t is independent of X
1
(s), s t. So, using the fact that we are
assuming that X
1
(0) has the steady-state distribution, we have that
P[X
1
(t) = i, X
2
(t) = j] = (1 u
1
)u
i
1
P[X
2
(t) = j] (18.74)
Now letting t , we get that the long-run probability of the vector [X
1
(t), X
2
(t)] being equal to
(i,j) is
(1 u
1
)u
j
1
(1 u
2
)u
j
2
(18.75)
In other words, the steady-state distribution for the vector has the two components of the vector
being independent.
Equation (18.75) is called a product form solution to the balance equations for steady-state
probabilities.
By the way, the vector [X
1
(t), X
2
(t)] is not reversible.
18.7.2 Jackson Networks
The tandem queues discussed in the last section comprise a special case of what are known as
Jackson networks. Once again, there exists an enormous literature of Jackson and other kinds
of queuing networks. The material can become very complicated (even the notation is very com-
plex), and we will only present an introduction here. Our presentation is adapted from I. Mitrani,
Modelling of Computer and Communcation Systems, Cambridge University Press, 1987.
440 CHAPTER 18. INTRODUCTION TO QUEUING MODELS
Our network consists of N nodes, and jobs move from node to node. There is a queue at each node,
and service time at node i is exponentially distributed with mean 1/
i
.
18.7.2.1 Open Networks
Each job originally arrives externally to the network, with the arrival rate at node i being
i
. After
moving among various nodes, the job will eventually leave the network. Specically, after a job
completes service at node i, it moves to node j with probability q
ij
, where

j
q
ij
< 1 (18.76)
reecting the fact that the job will leave the network altogether with probability 1

j
q
ij
.
8
It is
assumed that the movement from node to node is memoryless.
As an example, you may wish to think of movement of packets among routers in a computer
network, with the packets being jobs and the routers being nodes.
Let
i
denote the total trac rate into node i. By the usual equating of ow in and ow out, we
have

i
=
i
+
N

j=1

j
q
ji
(18.77)
Note the in Equations (18.77), the knowns are
i
and the q
ji
. We can solve this system of linear
equations for the unknowns,
i
.
The utilization at node i is then u
i
=
i
/
i
, as before. Jacksons Theorem then says that in the
long run, node i acts as an M/M/1 queue with that utilization, and that the nodes are independent
in the long run:
9
lim
t
P[X
1
(t) = i
1
, ..., X
N
(t) = i
N
] =
N
i=1
(1 u
i
)u
i
(18.78)
So, again we have a product form solution.
8
By the way, qii can be nonzero, allowing for feedback loops at nodes.
9
We do not present the proof here, but it really is just a matter of showing that the distribution here satises the
balance equations.
18.7. NETWORKS OF QUEUES 441
Let L
i
denote the average number of jobs at node i. From Equation (18.6), we have L
i
= u
i
/(1u
i
).
Thus the mean number of jobs in the system is
L =
N

i=1
u
i
1 u
i
(18.79)
From this we can get the mean time that jobs stay in the network, W: From Littles Rule, L = W,
so
W =
1

i=1
u
i
1 u
i
(18.80)
where =
1
+... +
N
is the total external arrival rate.
Jackson networks are not generally reversible. The reversed versions of Jackson networks are worth
studying for other reasons, but we cannot pursue them here.
18.7.3 Closed Networks
In a closed Jackson network, we have for all i,
i
= 0 and

j
q
ij
= 1 (18.81)
In other words, jobs never enter or leave the network. There have been many models like this in
the computer performance modeling literature. For instance, a model might consist of some nodes
representing CPUs, some representing disk drives, and some representing users at terminals.
It turns out that we again get a product form solution.
10
The notation is more involved, so we will
not present it here.
Exercises
1. Investigate the robustness of the M/M/1 queue model with respect to the assumption of ex-
ponential service times, as follows. Suppose the service time is actually uniformly distributed on
(0,c), so that the mean service time would be c/2. Assume that arrivals do follow the exponential
model, with mean interarrival time 1.0. Find the mean residence time, using (18.9), and compare
10
This is confusing, since the dierent nodes are now not independent, due to the fact that the number of jobs in
the overall system is constant.
442 CHAPTER 18. INTRODUCTION TO QUEUING MODELS
it to the true value obtained from (18.59). Do this for various values of c, and graph the two curves
using R.
2. Many mathematical analyses of queuing systems use nite source models. There are always a
xed number j of jobs in the system. A job queues up for the server, gets served in time S, then
waits a random time W before queuing up for the server again.
A typical example would be a le server with j clients. The time W would be the time a client
does work before it needs to access the le server again.
(a) Use Littles Rule, on two or more appropriately chosen boxes, to derive the following relation:
ER =
jES
U
EW (18.82)
where R is residence time (time spent in the queue plus service time) in one cycle for a job
and U is the utilization fraction of the server.
(b) Set up a continuous time Markov chain, assuming exponential distributions for S and W,
with state being the number of jobs currently at the server. Derive closed-form expressions
for the
i
.
3. Consider the following variant of an M/M/1 queue: Each customer has a certain amount of
patience, varying from one customer to another, exponentially distributed with rate . When a
customers patience wears out while the customer is in the queue, he/she leaves (but not if his/her
job is now in service). Arrival and service rates are and , respectively.
(a) Express the
i
in terms of , and .
(b) Express the proportion of lost jobs as a function of the
i
, , and .
4. A shop has two machines, with service time in machine i being exponentially distributed with
rate
i
, i = 1,2. Here
1
>
2
. When a job reaches the head of the queue, it chooses machine 1 if
that machine is idle, and otherwise waits for the rst available machine. If when a job nishes on
machine 1 there is a job in progress at machine 2, the latter job will be transferred to machine 1,
getting priority over any queued jobs. Arrivals follow the usual Poisson process, parameter .
(a) Find the mean residence time.
(b) Find the proportion of jobs that are originally assigned to machine 2.
Appendix A
Review of Matrix Algebra
This book assumes the reader has had a course in linear algebra (or has self-studied it, always
the better approach). This appendix is intended as a review of basic matrix algebra, or a quick
treatment for those lacking this background.
A.1 Terminology and Notation
A matrix is a rectangular array of numbers. A vector is a matrix with only one row (a row
vector or only one column (a column vector).
The expression, the (i,j) element of a matrix, will mean its element in row i, column j.
Please note the following conventions:
Capital letters, e.g. A and X, will be used to denote matrices and vectors.
Lower-case letters with subscripts, e.g. a
2,15
and x
8
, will be used to denote their elements.
Capital letters with subscripts, e.g. A
13
, will be used to denote submatrices and subvectors.
If A is a square matrix, i.e. one with equal numbers n of rows and columns, then its diagonal
elements are a
ii
, i = 1,...,n.
The norm (or length) of an n-element vector X is
| X |=

_
n

i=1
x
2
i
(A.1)
443
444 APPENDIX A. REVIEW OF MATRIX ALGEBRA
A.1.1 Matrix Addition and Multiplication
For two matrices have the same numbers of rows and same numbers of columns, addition is
dened elementwise, e.g.
_
_
1 5
0 3
4 8
_
_
+
_
_
6 2
0 1
4 0
_
_
=
_
_
7 7
0 4
8 8
_
_
(A.2)
Multiplication of a matrix by a scalar, i.e. a number, is also dened elementwise, e.g.
0.4
_
_
7 7
0 4
8 8
_
_
=
_
_
2.8 2.8
0 1.6
3.2 3.2
_
_
(A.3)
The inner product or dot product of equal-length vectors X and Y is dened to be
n

k=1
x
k
y
k
(A.4)
The product of matrices A and B is dened if the number of rows of B equals the number of
columns of A (A and B are said to be conformable). In that case, the (i,j) element of the
product C is dened to be
c
ij
=
n

k=1
a
ik
b
kj
(A.5)
For instance,
_
_
7 6
0 4
8 8
_
_
_
1 6
2 4
_
=
_
_
19 66
8 16
24 80
_
_
(A.6)
It is helpful to visualize c
ij
as the inner product of row i of A and column j of B, e.g. as
shown in bold face here:
_
_
7 6
0 4
8 8
_
_
_
1 6
2 4
_
=
_
_
7 70
8 16
8 80
_
_
(A.7)
A.2. MATRIX TRANSPOSE 445
Matrix multiplicatin is associative and distributive, but in general not commutative:
A(BC) = (AB)C (A.8)
A(B +C) = AB +AC (A.9)
AB ,= BA (A.10)
A.2 Matrix Transpose
The transpose of a matrix A, denoted A

or A
T
, is obtained by exchanging the rows and
columns of A, e.g.
_
_
7 70
8 16
8 80
_
_

=
_
7 8 8
70 16 80
_
(A.11)
If A + B is dened, then
(A+B)

= A

+B

(A.12)
If A and B are conformable, then
(AB)

= B

(A.13)
A.3 Linear Independence
Equal-length vectors X
1
,...,X
k
are said to be linearly independent if it is impossible for
a
1
X
1
+... +a
k
X
k
= 0 (A.14)
unless all the a
i
are 0.
446 APPENDIX A. REVIEW OF MATRIX ALGEBRA
A.4 Determinants
Let A be an nxn matrix. The denition of the determinant of A, det(A), involves an abstract
formula featuring permutations. It will be omitted here, in favor of the following computational
method.
Let A
(i,j)
denote the submatrix of A obtained by deleting its i
th
row and j
th
column. Then the
determinant can be computed recursively across the k
th
row of A as
det(A) =
n

m=1
(1)
k+m
det(A
(k,m)
) (A.15)
where
det
_
s t
u v
_
= sv tu (A.16)
A.5 Matrix Inverse
The identity matrix I of size n has 1s in all of its diagonal elements but 0s in all o-diagonal
elements. It has the property that AI = A and IA = A whenever those products are dened.
The A is a square matrix and AB = I, then B is said to be the inverse of A, denoted A
1
.
Then BA = I will hold as well.
A
1
exists if and only if its rows (or columns) are linearly independent.
A
1
exists if and only if det(A) ,= 0.
If A and B are square, conformable and invertible, then AB is also invertible, and
(AB)
1
= B
1
A
1
(A.17)
A.6 Eigenvalues and Eigenvectors
Let A be a square matrix.
1
1
For nonsquare matrices, the discussion here would generalize to the topic of singular value decomposition.
A.6. EIGENVALUES AND EIGENVECTORS 447
A scalar and a nonzero vector X that satisfy
AX = X (A.18)
are called an eigenvalue and eigenvector of A, respectively.
A matrix U is said to be orthogonal if its rows have norm 1 and are orthogonal to each
other, i.e. their inner product is 0. U thus has the property that UU

= I i.e. U
1
= U.
If A is symmetric and real, then it is diagonalizable, i.e there exists an orthogonal matrix
U such that
U

AU = D (A.19)
for a diagonal matrix D. The elements of D are the eigenvalues of A, and the columns of U
are the eigenvectors of A.
448 APPENDIX A. REVIEW OF MATRIX ALGEBRA
Appendix B
R Quick Start
Here we present a quick introduction to the R data/statistical programming language. Further
learning resources are listed at http://heather.cs.ucdavis.edu/
~
/matloff/r.html.
R syntax is similar to that of C. It is object-oriented (in the sense of encapsulation, polymorphism
and everything being an object) and is a functional language (i.e. almost no side eects, every
action is a function call, etc.).
B.1 Correspondences
aspect C/C++ R
assignment = <- (or =)
array terminology array vector, matrix, array
subscripts start at 0 start at 1
array notation m[2][3] m[2,3]
2-D array storage row-major order column-major order
mixed container struct, members accessed by . list, members acessed by $ or [[ ]]
return mechanism return return() or last value computed
primitive types int, oat, double, char, bool integer, oat, double, character, logical
mechanism for combining modules include, link library()
run method batch interactive, batch
449
450 APPENDIX B. R QUICK START
B.2 Starting R
To invoke R, just type R into a terminal window. On a Windows machine, you probably have
an R icon to click.
If you prefer to run from an IDE, you may wish to consider ESS for Emacs, StatET for Eclipse or
RStudio, all open source.
R is normally run in interactive mode, with > as the prompt. Among other things, that makes it
easy to try little experiments to learn from; remember my slogan, When in doubt, try it out!
B.3 First Sample Programming Session
Below is a commented R session, to introduce the concepts. I had a text editor open in another
window, constantly changing my code, then loading it via Rs source() command. The original
contents of the le odd.R were:
1 oddcount < f unc t i on ( x)
2 k < 0 # as s i gn 0 to k
3 f or ( n i n x)
4 i f ( n %% 2 == 1) k < k+1 # %% i s the modulo oper at or
5
6 r et ur n ( k)
7
By the way, we could have written that last statement as simply
1 k
because the last computed value of an R function is returned automatically.
The R session is shown below. You may wish to type it yourself as you go along, trying little
experiments of your own along the way.
1
1 > s our ce ( odd . R) # l oad code from the gi ven f i l e
2 > l s ( ) # what obj e c t s do we have ?
3 [ 1 ] oddcount
4 > # what ki nd of obj e c t i s oddcount ( wel l , we al r eady know)?
5 > c l a s s ( oddcount )
6 [ 1 ] f unc t i on
7 > # whi l e i n i nt e r a c t i v e mode , can pr i nt any obj e c t by typi ng i t s name ;
1
The source code for this le is at http://heather.cs.ucdavis.edu/
~
matloff/MiscPLN/R5MinIntro.tex.
B.3. FIRST SAMPLE PROGRAMMING SESSION 451
8 > # ot her wi s e use pr i nt ( ) , e . g . pr i nt ( x+y)
9 > oddcount
10 f unc t i on ( x)
11 k < 0 # as s i gn 0 to k
12 f or ( n i n x)
13 i f ( n %% 2 == 1) k < k+1 # %% i s the modulo oper at or
14
15 r et ur n ( k)
16
17
18 > # l e t s t e s t oddcount ( ) , but l ook at some pr o pe r t i e s of ve c t or s f i r s t
19 > y < c ( 5 , 12 , 13 , 8 , 88) # c ( ) i s the concat enat e f unc t i on
20 > y
21 [ 1 ] 5 12 13 8 88
22 > y [ 2 ] # R s ubs c r i pt s begi n at 1 , not 0
23 [ 1 ] 12
24 > y [ 2 : 4 ] # e xt r ac t el ements 2 , 3 and 4 of y
25 [ 1 ] 12 13 8
26 > y [ c ( 1 , 3 : 5 ) ] # el ements 1 , 3 , 4 and 5
27 [ 1 ] 5 13 8 88
28 > oddcount ( y) # shoul d r epor t 2 odd numbers
29 [ 1 ] 2
30
31 > # change code ( i n the ot her window) to ve c t o r i z e the count oper at i on ,
32 > # f or much f a s t e r execut i on
33 > s our ce ( odd . R)
34 > oddcount
35 f unc t i on ( x)
36 x1 < ( x %% 2) == 1 # x1 now a vect or of TRUEs and FALSEs
37 x2 < x [ x1 ] # x2 now has the el ements of x that were TRUE i n x1
38 r et ur n ( l engt h ( x2 ) )
39
40
41 > # t r y i t on s ubs et of y , el ements 2 through 3
42 > oddcount ( y [ 2 : 3 ] )
43 [ 1 ] 1
44 > # t r y i t on s ubs et of y , el ements 2 , 4 and 5
45 > oddcount ( y [ c ( 2 , 4 , 5 ) ] )
46 [ 1 ] 0
47
452 APPENDIX B. R QUICK START
48 > # f ur t he r compacti f y the code
49 > s our ce ( odd . R)
50 > oddcount
51 f unc t i on ( x)
52 l engt h ( x [ x %% 2 == 1 ] ) # l a s t val ue computed i s auto r et ur ned
53
54 > oddcount ( y) # t e s t i t
55 [ 1 ] 2
56
57 > # now have f t n r et ur n odd count AND the odd numbers themsel ves , us i ng
58 > # the R l i s t type
59 > s our ce ( odd . R)
60 > oddcount
61 f unc t i on ( x)
62 x1 < x [ x %% 2 == 1]
63 r et ur n ( l i s t ( odds=x1 , numodds=l engt h ( x1 ) ) )
64
65 > # R s l i s t type can cont ai n any type ; components de l i ne at e d by $
66 > oddcount ( y)
67 $odds
68 [ 1 ] 5 13
69
70 $numodds
71 [ 1 ] 2
72
73 > ocy < oddcount ( y) # save the output i n ocy , which wi l l be a l i s t
74 > ocy
75 $odds
76 [ 1 ] 5 13
77
78 $numodds
79 [ 1 ] 2
80
81 > ocy$odds
82 [ 1 ] 5 13
83 > ocy [ [ 1 ] ] # can get l i s t el ements us i ng [ [ ] ] i ns t e ad of $
84 [ 1 ] 5 13
85 > ocy [ [ 2 ] ]
86 [ 1 ] 2
B.4. SECOND SAMPLE PROGRAMMING SESSION 453
Note that the function of the R function function() is to produce functions! Thus assignment is
used. For example, here is what odd.R looked like at the end of the above session:
1 oddcount < f unc t i on ( x)
2 x1 < x [ x %% 2 == 1]
3 r et ur n ( l i s t ( odds=x1 , numodds=l engt h ( x1 ) ) )
4
We created some code, and then used function to create a function object, which we assigned to
oddcount.
Note that we eventually vectorized our function oddcount(). This means taking advantage of
the vector-based, functional language nature of R, exploiting Rs built-in functions instead of loops.
This changes the venue from interpreted R to C level, with a potentially large increase in speed.
For example:
1 > x < r uni f (1000000) # 1000000 random numbers from the i nt e r v a l ( 0 , 1)
2 > system . ti me (sum( x ) )
3 us er system el aps ed
4 0. 008 0. 000 0. 006
5 > system . ti me ( s < 0; f o r ( i i n 1: 1000000) s < s + x [ i ] )
6 us er system el aps ed
7 2. 776 0. 004 2. 859
B.4 Second Sample Programming Session
A matrix is a special case of a vector, with added class attributes, the numbers of rows and columns.
1 > # rowbind ( ) f unc t i on combi nes rows of mat r i ces ; there s a cbi nd ( ) too
2 > m1 < rbi nd ( 1 : 2 , c ( 5 , 8) )
3 > m1
4 [ , 1 ] [ , 2 ]
5 [ 1 , ] 1 2
6 [ 2 , ] 5 8
7 > rbi nd (m1, c (6 , 1))
8 [ , 1 ] [ , 2 ]
9 [ 1 , ] 1 2
10 [ 2 , ] 5 8
11 [ 3 , ] 6 1
12
13 > # form matri x from 1 , 2 , 3 , 4 , 5 , 6 , i n 2 c o l s ; R us es columnmajor s t or age
454 APPENDIX B. R QUICK START
14 > m2 < matri x ( 1 : 6 , nrow=2)
15 > m2
16 [ , 1 ] [ , 2 ] [ , 3 ]
17 [ 1 , ] 1 3 5
18 [ 2 , ] 2 4 6
19 > ncol (m2)
20 [ 1 ] 3
21 > nrow(m2)
22 [ 1 ] 2
23 > m2[ 2 , 3 ] # e xt r ac t el ement i n row 2 , c ol 3
24 [ 1 ] 6
25 # get submatri x of m2, c o l s 2 and 3 , any row
26 > m3 < m2[ , 2 : 3 ]
27 > m3
28 [ , 1 ] [ , 2 ]
29 [ 1 , ] 3 5
30 [ 2 , ] 4 6
31
32 > m1 m3 # el ementwi se mul t i pl i c a t i o n
33 [ , 1 ] [ , 2 ]
34 [ 1 , ] 3 10
35 [ 2 , ] 20 48
36 > 2. 5 m3 # s c a l a r mul t i pl i c a t i o n ( but s ee bel ow)
37 [ , 1 ] [ , 2 ]
38 [ 1 , ] 7. 5 12. 5
39 [ 2 , ] 10. 0 15. 0
40 > m1 %% m3 # l i ne a r al gebr a matri x mul t i pl i c a t i o n
41 [ , 1 ] [ , 2 ]
42 [ 1 , ] 11 17
43 [ 2 , ] 47 73
44
45 > # mat r i ces are s pe c i a l c as e s of vect or s , so can t r e at them as ve c t or s
46 > sum(m1)
47 [ 1 ] 16
48 > i f e l s e (m2 %%3 == 1 , 0 ,m2) # ( s ee bel ow)
49 [ , 1 ] [ , 2 ] [ , 3 ]
50 [ 1 , ] 0 3 5
51 [ 2 , ] 2 0 6
The scalar multiplication above is not quite what you may think, even though the result may
be. Heres why:
B.5. THIRD SAMPLE PROGRAMMING SESSION 455
In R, scalars dont really exist; they are just one-element vectors. However, R usually uses recy-
cling, i.e. replication, to make vector sizes match. In the example above in which we evaluated
the express 2.5 * m3, the number 2.5 was recycled to the matrix
_
2.5 2.5
2.5 2.5
_
(B.1)
in order to conform with m3 for (elementwise) multiplication.
The ifelse() function is another example of vectorization. Its call has the form
i f e l s e ( bool ean ve ct or expr es s i on1 , ve ct or expr es s i on2 , ve c t or e xpr e s s i on3 )
All three vector expressions must be the same length, though R will lengthen some via recycling.
The action will be to return a vector of the same length (and if matrices are involved, then the
result also has the same shape). Each element of the result will be set to its corresponding element
in vectorexpression2 or vectorexpression3, depending on whether the corresponding element
in vectorexpression1 is TRUE or FALSE.
In our example above,
> i f e l s e (m2 %%3 == 1 , 0 ,m2) # ( s ee bel ow)
the expression m2 %%3 == 1 evaluated to the boolean matrix
_
T F F
F T F
_
(B.2)
(TRUE and FALSE may be abbreviated to T and F.)
The 0 was recycled to the matrix
_
0 0 0
0 0 0
_
(B.3)
while vectorexpression3, m2, evaluated to itself.
B.5 Third Sample Programming Session
This time, we focus on vectors and matrices.
456 APPENDIX B. R QUICK START
> m < rbi nd ( 1 : 3 , c ( 5 , 12 , 13) ) # row bind , combine rows
> m
[ , 1 ] [ , 2 ] [ , 3 ]
[ 1 , ] 1 2 3
[ 2 , ] 5 12 13
> t (m) # t r ans pos e
[ , 1 ] [ , 2 ]
[ 1 , ] 1 5
[ 2 , ] 2 12
[ 3 , ] 3 13
> ma < m[ , 1 : 2 ]
> ma
[ , 1 ] [ , 2 ]
[ 1 , ] 1 2
[ 2 , ] 5 12
> rep ( 1 , 2) # repeat , make mul t i pl e c opi e s
[ 1 ] 1 1
> ma %% rep ( 1 , 2) # matri x mul t i pl y
[ , 1 ]
[ 1 , ] 3
[ 2 , ] 17
> s ol ve (ma, c ( 3 , 17) ) # s ol ve l i ne a r system
[ 1 ] 1 1
> s ol ve (ma) # matri x i nve r s e
[ , 1 ] [ , 2 ]
[ 1 , ] 6. 0 1.0
[ 2 , ] 2.5 0. 5
B.6 Complex Numbers
Here is a sample of use of the main functions of interest:
> za < complex ( r e a l =2, i magi nary =3. 5)
> za
[ 1 ] 2+3.5 i
> zb < complex ( r e a l =1, i magi nary=5)
> zb
[ 1 ] 15 i
> za zb
B.7. OTHER SOURCES FOR LEARNING R 457
[ 1 ] 19. 5 6. 5 i
> Re( za )
[ 1 ] 2
> Im( za )
[ 1 ] 3. 5
> za 2
[ 1 ] 8.25+14 i
> abs ( za )
[ 1 ] 4. 031129
> exp ( complex ( r e a l =0, i magi nary=pi /4) )
[ 1 ] 0. 7071068+0. 7071068 i
> cos ( pi /4)
[ 1 ] 0. 7071068
> s i n ( pi /4)
[ 1 ] 0. 7071068
Note that operations with complex-valued vectors and matrices work as usual; there are no special
complex functions.
B.7 Other Sources for Learning R
There are tons of resources for R on the Web. You may wish to start with the links at http:
//heather.cs.ucdavis.edu/
~
matloff/r.html.
B.8 Online Help
Rs help() function, which can be invoked also with a question mark, gives short descriptions of
the R functions. For example, typing
> ?rep
will give you a description of Rs rep() function.
An especially nice feature of R is its example() function, which gives nice examples of whatever
function you wish to query. For instance, typing
> example ( wi ref rame ( ) )
will show examplesR code and resulting picturesof wireframe(), one of Rs 3-dimensional
graphics functions.
458 APPENDIX B. R QUICK START
B.9 Debugging in R
The internal debugging tool in R, debug(), is usable but rather primitive. Here are some alterna-
tives:
The StatET IDE for R on Eclipse has a nice debugging tool. Works on all major platforms,
but can be tricky to install.
Revolution Analytics IDE for R is good too, but requires Microsoft Visual Studio.
My own debugging tool, debugR, is extensive and easy to install, but for the time being is lim-
ited to Linux, Mac and other Unix-family systems. See http://heather.cs.ucdavis.edu/debugR.html.

Anda mungkin juga menyukai