Anda di halaman 1dari 126

Introduction to Probability and Statistics: notes for a short course

Jonathan G. Campbell Department of Computing, Letterkenny Institute of Technology, Co. Donegal, Ireland. email: jonathan dot campbell (at) gmail.com, jonathan.campbell@lyit.ie URL: http://www.jgcampbell.com/stats/stats.pdf Report No: jc/09/0004/r Revision 0.3 18th August 2009

Contents
1 Introduction 1.1 Purpose and Scope . . . . . . . . . . . . . . . . . 1.2 Why use R? . . . . . . . . . . . . . . . . . . . . . 1.3 Relevant textbooks and web sources . . . . . . . . 1.3.1 General Books on Probability and Statistics 1.3.2 Books on R and Statistics using R . . . . . 1.3.3 Bayesian Statistics . . . . . . . . . . . . . 1.3.4 Web Links . . . . . . . . . . . . . . . . . . 1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . 1 1 1 2 2 3 3 3 3 1 1 1 1 2 1 1 1 2 3 4 6 1 1 1 4 5 5 6 7 1 1 1 1 2 3 3

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

Simple Data Analysis and Visualisation and Introduction to R 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Installation of R . . . . . . . . . . . . . . . . . . . . 2.1.2 Running R . . . . . . . . . . . . . . . . . . . . . . . 2.2 Visualisation and Exploratory Data Analysis . . . . . . . . . Averages 3.1 Introduction . . . . . . . . . . . . . . . . . 3.2 Arithmetic Mean . . . . . . . . . . . . . . 3.2.1 Arithmetic Mean using Frequencies 3.3 Median . . . . . . . . . . . . . . . . . . . . 3.4 Mode . . . . . . . . . . . . . . . . . . . . 3.5 Other Means . . . . . . . . . . . . . . . . Measures of Data Variability 4.1 Introduction . . . . . . . . . . . . . . . . 4.2 Variance and Standard Deviation . . . . . 4.2.1 Equalising the means . . . . . . . 4.2.2 Variability and spread . . . . . . . 4.2.3 Variance and Standard Deviation . 4.3 Standard Scores and Normalising Marks . 4.3.1 Standard Scores . . . . . . . . . Probability and Random Variables 5.1 Introduction . . . . . . . . . . . . . . . . 5.2 Basic Probability and Random Variables . 5.2.1 Introduction . . . . . . . . . . . . 5.2.2 Probability and Events . . . . . . 5.2.3 A Point on Terminology . . . . . 5.2.4 Probability of Non-disjoint Events 01

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

5.2.5 Finite Sample Spaces . . . . . . . . . . . . . . Random Variables . . . . . . . . . . . . . . . . . . . . Computing probabilities . . . . . . . . . . . . . . . . . Enumerating more complex events and sample spaces . 5.5.1 Multiplication of outcomes . . . . . . . . . . . 5.5.2 Addition of outcomes . . . . . . . . . . . . . . 5.5.3 Permutations . . . . . . . . . . . . . . . . . . 5.5.4 Combinations . . . . . . . . . . . . . . . . . . 5.6 Conditional Probability . . . . . . . . . . . . . . . . . 5.6.1 Venn diagrams . . . . . . . . . . . . . . . . . 5.6.2 Probability Trees . . . . . . . . . . . . . . . . 5.6.3 Joint Probability . . . . . . . . . . . . . . . . 5.7 Bayes Rule . . . . . . . . . . . . . . . . . . . . . . . 5.8 Independent Events . . . . . . . . . . . . . . . . . . . 5.9 Betting and Odds . . . . . . . . . . . . . . . . . . . . 5.10 Classical versus Bayesian Interpretations of Probability 5.3 5.4 5.5 6 One Dimensional Random Variables 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Denition: Random Variable . . . . . . . . . . 6.1.2 Probability associated with a Random Variable 6.2 Probability Mass Function (pmf) of a Discrete r.v. . . 6.3 Some Discrete Random Variables . . . . . . . . . . . . 6.3.1 Point Mass Distribution . . . . . . . . . . . . 6.3.2 Discrete Uniform Distribution . . . . . . . . . 6.3.3 Bernoulli Distribution . . . . . . . . . . . . . . 6.3.4 Binomial Distribution . . . . . . . . . . . . . . 6.3.5 Geometric Distribution . . . . . . . . . . . . . 6.3.6 Poisson Distribution . . . . . . . . . . . . . . 6.4 Some Continuous Random Variables . . . . . . . . . . 6.4.1 Probability Density Function (PDF) . . . . . . 6.4.2 Cumulative Distribution Function (cdf) . . . . 6.4.3 Uniform Distribution . . . . . . . . . . . . . . 6.4.4 Normal (Gaussian) Distribution . . . . . . . . 6.4.5 Exponential Distribution . . . . . . . . . . . . 6.4.6 Gamma Distribution . . . . . . . . . . . . . . 6.4.7 Beta Distribution . . . . . . . . . . . . . . . . 6.4.8 Student t Distribution . . . . . . . . . . . . . 6.4.9 Cauchy Distribution . . . . . . . . . . . . . . . 6.4.10 Chi-squared Distribution . . . . . . . . . . . . 6.5 Range spaces terminology . . . . . . . . . . . . . . 6.6 Parameters . . . . . . . . . . . . . . . . . . . . . . . Two- and Multi-Dimensional Random Variables 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . 7.2 Probability Function of a Discrete Two-dimensional 7.3 PDF of a Continuous Two-dimensional r.v. . . . . 7.4 Marginal Probability Distributions . . . . . . . . . 7.5 Conditional Probability Distributions . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

4 4 4 4 5 5 5 6 6 6 7 9 9 11 12 13 1 1 1 1 2 2 2 3 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 7 7 1 1 2 2 3 4

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . r.v. . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

02

7.6 7.7 8

Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . Two-dimensional (Bivariate) Normal Distribution . . . . . . . . . . . . . . . . . .

4 5 1 1 1 4 5 5 5 1 1 2 4 4 5 6 6 6 1 1 1 1 1 2 2 3 3 4 5 1 1 1 1 1 1 1 1 4 4 5 6 7

Characterisations of Random Variables 8.1 Introduction . . . . . . . . . . . . . . . . . . 8.2 Expected Value (Mean) of a Random Variable 8.3 Variance of a Random Variable . . . . . . . . 8.4 Expectations in Two-dimensions . . . . . . . 8.4.1 Mean . . . . . . . . . . . . . . . . . 8.4.2 Covariance . . . . . . . . . . . . . . The 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

Normal Distribution Introduction . . . . . . . . . . . . . . . . . . . . . . Cumulative Distribution Function (cdf) . . . . . . . Normal Cdf . . . . . . . . . . . . . . . . . . . . . . Using the Normal Cdf . . . . . . . . . . . . . . . . . Sum of Independent Normal Random Variables . . . Dierences of Normal Random Variables . . . . . . . Linear Transformations of Normal Random Variables The Central Limit Theorem . . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

10 Statistical Inference 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Statistical Estimation 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Populations and Samples . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Estimating the Mean . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4 Estimating the Standard Deviation . . . . . . . . . . . . . . . . . . . 11.5 Sampling Distributions . . . . . . . . . . . . . . . . . . . . . . . . . 11.5.1 Sampling Distribution of the mean . . . . . . . . . . . . . . . 11.5.2 Sampling Distribution for Estimates of the Standard Deviation 11.6 Condence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

12 Hypothesis Testing 12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Sampling 13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Classication and Pattern Recognition 14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Simple Classier Methods 15.1 Thresholding for one-dimensional data . . . . . . . 15.2 Linear separating lines/planes for two-dimensions . 15.3 Nearest mean classier . . . . . . . . . . . . . . . 15.4 Normal form of the separating line, projections, and 15.5 Projection and linear discriminant . . . . . . . . . 15.6 Projections and linear discriminants in p dimensions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . linear discriminants . . . . . . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

03

15.7 Template Matching and Discriminants . . . . . . . . . . . . . . . . . . . . . . . . 15.8 Nearest neighbour methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Statistical Classier Methods 16.1 One-dimensional classication revisited . . . . . . . . . . 16.2 Bayes Rule for the Inversion of Conditional Probabilities 16.3 Parametric Methods . . . . . . . . . . . . . . . . . . . . 16.4 Discriminants based on Normal Density . . . . . . . . . 16.5 Bayes-Gauss Classier Special Cases . . . . . . . . . . 16.5.1 Equal and Diagonal Covariances . . . . . . . . . 16.5.2 Equal but General Covariances . . . . . . . . . . 16.6 Least square error trained classier . . . . . . . . . . . . 16.7 Generalised linear discriminant function . . . . . . . . .

7 7 1 1 2 3 4 4 5 6 7 8 1 1 2 1 3 3 4 1 1 1 1 1 1 2 2 2 4 4 4 1 1 1 2 4 4 5 5 5 5 6 6

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

17 Linear Discriminant Analysis and Principal Components Analysis 17.1 Principal Components Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.2 Fishers Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 18 Neural Network Methods 18.1 Neurons for Boolean Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18.2 Three-layer neural network for arbitrarily complex decision regions . . . . . . . . . 18.3 Sigmoid activation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Unsupervised Classication (Clustering) 20 Regression 20.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Basic Mathematical Notation A.1 Sets . . . . . . . . . . . . . . . . . . . . . A.1.1 Set Denition and Membership . . A.1.2 Important Number Sets . . . . . . . A.1.3 Set Operations . . . . . . . . . . . A.1.4 Venn Diagrams . . . . . . . . . . . A.2 Iterated Summation and Product Notation A.3 Iterated Union and Intersection . . . . . . . A.4 Cartesian Product Sets . . . . . . . . . . . B Matrices and Linear Algebra B.1 Introduction . . . . . . . . . . . . B.2 Linear Simultaneous Equations . . B.3 Vectors and Matrices . . . . . . . B.4 Basic Matrix Arithmetic . . . . . . B.4.1 Matrix Multiplication . . . B.4.2 Multiplication by a Scalar . B.4.3 Addition . . . . . . . . . . B.5 Special Matrices . . . . . . . . . . B.5.1 Identity Matrix . . . . . . B.5.2 Orthogonal Matrix . . . . B.5.3 Diagonal . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

04

B.5.4 Transpose of a Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.6 Inverse Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.7 Multidimensional (Multivariate) Random Variables . . . . . . . . . . . . . . . . . .

6 7 8

05

Chapter 1 Introduction
1.1 Purpose and Scope

This report is written as the basis for a short course on statistics to be presented for postgraduate students at Letterkenny Institute of Technology. The notes have a mixed objective. I started writing a set of notes based on the traditional approach to probability and statistics, namely: basic probability, up to and including conditional probability, independence, Bayes Law; then some one-dimensional discrete and continuous distributions and some of the properties. Et cetera. And the on to sampling, parameter estimation, point estimates, condence intervals, and hypothesis testing. However, after discussion with someone who knows potential consumers of the course, I was persuaded to start with a more gentle introduction. Hence I start o with simple visualisation, the look at averages (central tendency), then variance, and then back to the main line. As I say, the notes have a mixed objective. One objective is as notes for a gentle introduction to statistics; another is to include a set of reference results that one would refer to during a course; that is a course presenter might not want to spend time of the details of, for example, the Binomial distribution, or even full details of the Normal, but it would be useful for students to have access to some of these details without having to access one or more textbooks. When I give a course, I may give attendees a printout of all the notes including an outline of the objective of the course and the plan of coverage, mentioning the chapters that will be used. Or, alternatively, I may do a specialised printout that includes only the chapters to be covered. The notes you see here include everything.

1.2

Why use R?

Let me quote from the R website http://www.r-project.org/: R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at 11

Bell Laboratories (formerly AT &T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a dierent implementation of S. There are some important dierences, but much code written for S runs unaltered under R. R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classication, clustering, . . . ) and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity. One of Rs strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control. When I have to choose a software package for teaching or for practical use (I mean generally, it could be a development system for a programming language, a computer games engine, a statistics package, . . . ) I look primarily at the following criteria: Is it easily available, i.e. is it already installed in our laboratory machines, or is easy (and cheap) to acquire? R does well on this criterion it is free to download and install, see 2.1.1. Is it well supported by textbooks and online documentation? Again, R does well. In the past ten years and this is greatly accelerating in the last ve years, a great many top class books on R and on particular statistical techniques using R; see 1.3. I notice that books that used to have just numerical examples now, in recent editions, give R examples. There is a top class mailing list supported by volunteers of the highest calibre: https://stat.ethz.ch/mailman/listinfo/r-help Via that mailing list, I have received assistance from world-class statisticians. Is it widely used? Yes.

1.3
1.3.1

Relevant textbooks and web sources


General Books on Probability and Statistics

These notes are mostly based on (Meyer 1966) (which was used for a college course on statistics that I attended), (Wasserman 2004), which is a good summary of all the statistics you might ever need, but is not an introduction, (Griths 2009) and (Milton 2009) which are excellent introductions though very wordy, (Crawley 2005), (Spiegel & Stephens 2008). The latter, (Spiegel & Stephens 2008), has plenty of examples including some examples on the use of the Excel spreadsheet. 12

(Dytham 2009) seems to be a good introduction for biologists and the more advanced (Quinn & Keough 2002) receives a lot of recommendations. Hackings book (Hacking 2001) is maybe a good introduction to probability and the philosophy and practice of probabilistic inference. The bibliography contains books in my collection and which I may have used in some small way and/or which may be useful to users of these notes.

1.3.2

Books on R and Statistics using R

Crawley may be the best general book (Crawley 2005); for bio-scientists it has the advantage that Crawleys research area is bio-science. Venables and Ripleys MASS (Venables & Ripley 2002) is top class note, do not be confused by the title Modern Applied Statistics with S; R is an open-source version of S (and S-Plus) and the book covers any dierences, which are minimal. Maindonald (Maindonald & Braun 2007) is good for R graphics; R code for all his diagrams is available online (free). Matlos R for Programmers (Matlo 2008) has the advantage that it is available online. See also the extensive list at http://www.r-project.org/doc/bib/R-books.html

1.3.3

Bayesian Statistics

Not that well be emphasising the Bayesian approach. (Sivia 2006) (best introduction to Bayesian statistics), (MacKay 2002), (Lee 2004).

1.3.4

Web Links

General: http://www.jgcampbell.com/links/stats.html; R: http://www.r-project.org/.

1.4

Outline

Chapter 5 gives an introduction to probability; if you want to understand basic statistics you must have a basic understanding of probability however we note that probability is to a great extent common sense. Before starting you should have a quick run through Appendix A just to familiarise yourself with basic mathematical notation; we note that the mathematical notation used is no more than shorthand; it would be dicult to write these notes without employing that shorthand; in addition, you will encounter similar shorthand in books and research papers. 13

Chapter 2 gives a very brief introduction to simple statistical techniques and visualisation and to the statistical package R. Chapter 3 gives a brief introduction to averages or what statisticians call central tendency. Chapter 4 This chapter introduces methods of describing data variability, most notably variance and standard deviation. Chapter 6 introduces random variables and lists the common one-dimensional probability distributions. Chapter 7 gives a brief introduction to multivariate random variables and some distributions. Note that Appendix B gives a gentle introduction to vector and matrix mathematics which are necessary in multivariate statistics. Chapter 8 discusses important characteristics of randoms variables such a mean and variance. Chapter 9 gives specialised treatment to the normal distribution in view of its importance in applications. Chapter 10 introduces statistical inference, that is, how can we infer properties of a population from statistics derived from a sample. One aspect of statistical inference is parameter estimation; Chapter 11 introduces point estimation and condence interval estimation. Hypothesis testing is strongly related to estimation; Chapter 12 gives an introduction to hypothesis testing. Chapter 13 discusses some of the intricacies of sampling. As of 2009-08-18 this is work in progress and will remain so for the foreseeable future.

14

Chapter 2 Simple Data Analysis and Visualisation and Introduction to R


2.1 Introduction

The objectives of this chapter are to give a very brief introduction to simple statistical techniques and visualisation and to the statistical package R.

2.1.1

Installation of R

Click on http://www.r-project.org/ and nd the Download link. For Windows users there is an exe le which does everything. You may need Administrator rights on your machine; contact Computer Services as necessary. Linux users are probably best advised to rely on the installer of their particular Linux distribution.

2.1.2

Running R

Start R by clicking on R desktop icon. R will open up a window with something like the following in it. R version 2.7.1 (2008-06-23) Copyright (C) 2008 The R Foundation for Statistical Computing ISBN 3-900051-07-0 R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type license() or licence() for distribution details. Type demo() for some demos, help() for on-line help, or help.start() for an HTML browser interface to help. Type q() to quit R. 21

The is R asking you to enter something as on a calculator; R can operate as a simple calculator, but of course we are interested in its use as a powerful statistical calculator. 2 + 3 [1] 5 sqrt(26) [1] 5.09902 34 [1] 81 For the remainder of this chapter well look at a signicant example involving visualisation and exploratory data analysis on a data set.

2.2

Visualisation and Exploratory Data Analysis

Were going to read in some examination result data and analyse them. The le exam.txt contains data as follows: exam 65 60 47 ... etc. 66 results in total The name of the column is exam and we tell R to pay attention to that. In what follows, # is a comment symbol and R ignores anything after the # until the next line.

Anything after is something that you typed a request to R. If something appears without a , that is an R response. ex - read.table(exam.txt, header= T) attach(ex) exam # print exam data on the screen [1] 65 60 47 43 51 32 62 71 0 56 52 59 15 49 54 67 44 2 47 61 45 95 62 80 46 [26] 52 61 12 62 69 78 62 48 56 56 58 60 0 48 71 50 90 51 53 5 51 63 35 39 10 [51] 57 53 20 54 22 44 53 52 25 60 55 39 30 53 67 50 That printout is quite uninformative, for example you have no idea what the maximum is, nor the range, nor have you an even rough idea of what the average mark is, etc. Let us look at a histogram. 22

hist(exam) And we get Figure 2.1. Often, like me here, you want to save the diagram to a le so that you can include it in a report. Here is how to do that; vis1-1.pdf is a lename that I made up. pdf(vis1-1.pdf, onefile=FALSE, height=8, width=6, pointsize=8, paper=special) hist(exam) devoff() Error: could not find function devoff # R complaining ... dev.off() # do this to finalise and close the file # if you dont its like forgetting to save in a wordprocessor.

Histogram of exam

Frequency

10

15

20

20

40 exam

60

80

100

Figure 2.1: Histogram of exam marks.

23

Let us see what the average mark is and the range of marks: mean(exam) [1] 49.07576 range(exam) [1] 0 95 We could have used: length(exam) [1] 66 # 66 results in exam sum(exam)/length(exam) [1] 49.07576 Let us see the data in sorted order a good deal more informative than unsorted: sort(exam) [1] 0 0 2 5 10 12 15 20 22 25 30 32 35 39 39 43 44 44 45 46 47 47 48 48 49 [26] 50 50 51 51 51 52 52 52 53 53 53 53 54 54 55 56 56 56 57 58 59 60 60 60 61 [51] 61 62 62 62 62 63 65 67 67 69 71 71 78 80 90 95 Now read in corresponding continuous assessment (CA) marks (courswork); they came from a spreadsheet so theres a load of digits after the decimal point and that makes the data evern more incomprehensible, so we use round to round them to the nearest integer number. It looks like the CA marks are more generous than the exam. marks, and mean(ca) conrms this, as does the histogram in Figure 2.2. cw - read.table(ca.txt, header= T) attach(cw) ca [1] 91.34390 85.54622 72.65543 63.10473 [9] 18.58191 83.30836 78.78221 77.68898 [17] 61.70048 16.28892 69.57387 83.08058 [25] 60.17263 79.49133 89.35610 27.89478 [33] 69.70333 85.23094 86.99767 82.89807 [41] 75.20815 97.17500 65.78075 70.29256 [49] 60.66164 20.05529 78.16085 73.58862 [57] 77.53929 77.20521 52.67979 89.10232 [65] 89.12518 67.58763 car = round(ca) car [1] 91 86 73 63 73 51 86 97 19 83 79 78 [26] 79 89 28 98 92 96 89 70 85 87 83 77

73.22074 21.07860 74.19594 98.06673 77.35877 14.20315 34.07182 76.78222

50.99642 76.04457 97.12300 92.34510 15.12655 73.02363 78.03601 54.16873

85.69151 76.56793 81.58833 96.19500 72.41332 87.38178 39.31353 40.23080

97.06528 86.90106 98.12345 88.69131 90.07670 52.74194 69.57565 81.09443

21 76 77 87 62 16 70 83 74 97 82 98 60 15 72 90 75 97 66 70 14 73 87 53 61 20

24

[51] 78 74 34 sort(car) [1] 14 15 16 [26] 73 73 73 [51] 87 87 87 mean(ca) [1] 70.10692

78 39 70 78 77 53 89 77 54 40 81 89 68

19 20 21 28 34 39 40 51 53 53 54 60 61 62 63 66 68 70 70 70 70 72 74 74 75 76 77 77 77 77 78 78 78 78 79 79 81 82 83 83 83 85 86 86 89 89 89 89 90 91 92 96 97 97 97 98 98

hist(ca) # and save another one to a file pdf(vis1-ca.pdf, onefile=FALSE, height=4, width=6, pointsize=8, paper=special) hist(ca) dev.off()

Histogram of ca

Frequency

10

15

20

40 ca

60

80

100

Figure 2.2: Histogram of CA marks.

25

Boxplots are another way of examining a data set. Figure 2.3 shows boxplots for the examination and CA results. The construction of the boxplot is as follows: (a) the heavy line across the interior of the box correspond to the median value (see Chapter 3); (b) the top and bottom of the box correspond to, respectively, the lower quartile and upper quartile, i.e. 25% of the data are below the lower quartile and 25% are above the upper quartile (or, if you like, 75% are below it). The so called whiskers show the smallest and largest values excluding boxplots interpretation of outliers. The outliers are then shown as single points. Quartile is a specialisation of the general term quantile, see Chapter 4. In Chapters 9, 11 and 12, well come across, for example, 5% and 95% quantiles. The median is the centre of the data, i.e. as many of the data are above the median as are bwlow it; see Chapter 3. To determine what are outliers, boxplot ts a Normal distribution to the data and labels as outliers any data that are below the 1% or above the 99% quantiles of the tted Normal distribution.

q q

80

60

40

20

q q q

Figure 2.3: Boxplot of: left, examination marks; right, CA marks.

26

20

q q q

40

60

80

100

q q q q q q

How to look at the two data sets together? There must be a way of superimposing one histogram on another, but I havent found that yet. So let us display a two-dimensional scatter plot of the two data sets, see Figure 2.4. library(lattice) # first we must load a library that has xyplot in it xyplot(exam ca)

q q

80
q q q q q qq q qq q q q q q q q q qq q q q qqq q qq qq q qq q q q q q q q q qq q q

q q

60

exam

40
q q q

q q q

20
q q q q q q q

20

40

60

80

100

ca

Figure 2.4: Scatter plot of Exam. marks versus CA marks. Someone says those CA and exam. marks look quite correlated, I wonder how accurately we could have predicted the exam. results using the CA?. This is regression territory and given that Figure 2.4 shows a sort of straight line relationship, well try linear regression, your old friend y = mx + c, or in this case exam = mca + c and it is more usual to use a, b exam = a + bca. a is the intercept, where the tted straight line meets the y-axis at x = 0 and b is the slope. fitres = lm(exam ca) summary(fitres) Call: lm(formula = exam ca) Residuals: Min 1Q -10.9697 -3.1181

Median -0.7405

3Q 3.1036

Max 22.8368

Coefficients: Estimate Std. Error t value Pr(t) 27

(Intercept) -10.83639 2.21002 -4.903 6.77e-06 *** ca 0.85458 0.03002 28.469 2e-16 *** --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 5.482 on 64 degrees of freedom Multiple R-squared: 0.9268,Adjusted R-squared: 0.9257 F-statistic: 810.5 on 1 and 64 DF, p-value: 2.2e-16 R prints a lot of information that well nd out about in Chapter 20; for now all we need to know are a = 10.83639 (intercept) and b = 0.85458 (coecient multiplying ca), i.e. the tted line is exam = 10.83639 + 0.85458 ca. Figure 2.5 shows the results of the straight line tting.

q q

80

q q q q q q q qq q q q q q q q q q q qq qq q q q q q q q q q q q q q q q q q q q q

exam

60

40

qq

q q q q q q

20

q q q q q q q

20

40 ca

60

80

100

Figure 2.5: Straight line tting Exam. marks versus CA marks.

28

Finally, we can save all those commands: savehistory(20090508-3.txt) # which we could load again at a later time with loadhistory(20090508-3.txt) # but in any case, weh you use q() to quit, R will offer you the # option of saving and thse saved commands will be loaded the # next time you run R. q() Save workspace image? [y/n/c]: y Thats enough for an introduction.

29

Chapter 3 Averages
3.1 Introduction

This chapter gives a brief introduction to averages or what statisticians call central tendency. These are often, but not always, useful in summarising a set of data, especially when we wish to compare the data set with another. There are some pitfalls in using the common-or-garden average and we will note some of these.

3.2

Arithmetic Mean

The most familiar average value is the arithmetic mean, i.e. sum the value and divide by the number of data. Just to get used to some mathematical notation, see A.2, well write this as youll see it in textbooks (the data are xi , i = 1, . . . , n): 1 x= n R-Example 1 . As before, well read the data and print them. This time they are already sorted, so much easier to read, even in list form.
n

xi ,
i=1

(3.1)

ex2 = read.table(exam2.txt, header = T) attach(ex2) exam2 [1] 43 43 43 44 46 48 48 50 51 53 53 53 55 56 56 57 57 58 58 59 59 59 60 60 60 [26] 61 62 62 64 69 We can compute the mean by summing and dividing, see below, but not unexpectedly, R has a function mean that does it for us. 31

sum(exam2) [1] 1647 length(hw) [1] 20 sum(exam2)/length(exam2) [1] 54.9 mean(exam2) [1] 54.9 In spite of its simplicity, it is possible to compute the arithmetic wrongly. R-Example 2 . The following are a set of homework marks, marked out of 10. We read the data in and print them. Then we produce a summarising table, marks versus frequency, which tells us that we have a three students with four (4) marks, three with ve, six with six, etc. df.homew - read.table(hw.txt, header = T) attach(df.homew) hw [1] 6 8 5 7 6 5 6 4 6 5 8 4 8 8 7 6 6 7 4 7 table(hw) hw 4 5 6 7 8 # marks 3 3 6 4 4 # frequencies If we were not using a computer, we might think that we have a quick way to compute the mean, we have just ve marks, namely, 4 5 6 7 8, so well take the average of those 4 + 5 + 6 + 7 + 8 = 30, so mean = 30/5 = 6. But R thinks dierently: mean(hw) [1] 6.15 The method we used works only if the frequencies are the same for each mark; it would be a rare uke if this were the case. But well pursue the matter further, because (a) computing an arithmetic mean using a frequency table done properly can be a (correct) shortcut if you have a lot of numbers and just a calculator or pencil and paper; (b) using frequencies prepares the ground for topics covered in later chapters.

3.2.1

Arithmetic Mean using Frequencies

Well rewrite the table, now calling the data (marks) x, well label them with i so that we have xi , i = 1 . . . n, and n = 5.

32

table(hw) hw i= 1 2 3 4 5 ---------------xi 4 5 6 7 8 fi 3 3 6 4 4

# marks # frequencies

If we want to use the frequency table, we have to replace eqn. 3.1 with
n i=1 fi xi n i=1 fi

x=

(3.2)

Applying eqn. 3.2 to our frequency table above gives (34+35+66+47+48)/(3+3+6+4+4) = (12+15+36+28+32)/20 = 123/20 = 6.15. If we look at the sum divided by number calculation in R, we see that the frequency calculation ends up with not only the same result, but the same division, length(hw) [1] 20 sum(hw) [1] 123 sum(hw)/length(hw) [1] 6.15 If you look at the sum of fi xi you will see that it is the same as 4 + 4 + 4 + 5 + . . . + 8 + 8 + 8 + 8; the sorted hw marks are below: sort(hw) [1] 4 4 4 5 5 5 6 6 6 6 6 6 7 7 7 7 8 8 8 8 And the sum of the frequencies is 20, i.e. the number of data. [B

3.3

Median

Sometimes neither the mean nor the mode give us what we would expect from a central value. Look at the following speed data (speed of cars at a speed check). Here mean,37.1, is well o the centre; and that oset is caused by an outlier, the 75. The oset would be a lot worse if the outlier was 1000 not likely in the case of speeds, but outliers of this magnitude are possible in the case of some measurement systems. A common example is a mineralisation survey taken across an area of land. For the sake of argument, assume that we are looking for zinc. A sample that coincides with the dumping of an old bucket will produce a huge outlier. Now if we want to produce contour plots based on smoothed values (averages over regions), then mean smoothing will show a (false) hot-spot, while median smoothing will not. 33

sp = read.table(cars.txt, header = T) attach(sp) speed [1] 25 31 33 31 30 35 75 mean(speed) [1] 37.14286 The media gives a the true central value. If we sort the speeds, we see that the central value (the fourth) is 31. median give the same result. sort(speed) [1] 25 30 31 31 33 35 75 median(speed) [1] 31 speed[4] [1] 31 In the example above there are seven values, so the central one is the fourth; if we had an even number of values, we would take the average of the two central values. It can be said that the median is a measure of central tendency that is robust against outliers.

3.4

Mode

Sometimes the mean does not give us what we would expect from a central value; for example, in the homework example, the mean (6.15) gives us a value that appears nowhere in the original data; thats normally not a big deal, but it suggests the mode as a possible average value. The mode is the most frequent value, i.e. obtained from a frequency table or from a histogram, Figure 3.1. table(hw) hw xi 4 5 6 7 fi 3 3 6 4

8 4

# marks # frequencies

34

Histogram of hw
6 Frequency 0 4 1 2 3 4 5

6 hw

Figure 3.1: Histogram of hw.

35

Multimodal Data Now that weve mentioned the mode, wed better take the opportunity of warning about multi-modal data. File hw2.txt contains data which has two peaks in its histogram, Figure 3.2. df.homew2 - read.table(hw2.txt, header = T) attach(df.homew2) sort(hw2) [1] 3 4 4 4 4 5 7 8 8 8 8 8 9 hist(hw2) mean(hw2) [1] 6.153846

Histogram of hw2
5 Frequency 0 3 1 2 3 4

6 hw2

Figure 3.2: Histogram of hw2 multimodal. We can work calculate the mean, but does it convey much about the centre of the data? No, and using the mean as such may be quite misleading. For example, an average of 6.15 may indicate that the homework was, on average, completed satisfactorily; however, in fact, we had two sets of results, one good, one poor and the average of 6.15 adequately represents neither. Multimodality is pretty obvious in that small and one-dimensional data set. In much larger data sets and especially in multidimensional data, multimodality may be dicult to detect. Much later, Chapter 19, well look at methods for separating multimodal data into dierent classes or clusters.

3.5

Other Means

Read up in (Crawley 2005) on: geometric mean and harmonic mean. 36

Chapter 4 Measures of Data Variability


4.1 Introduction

This chapter introduces methods of describing data variability, most notably variance and standard deviation.

4.2

Variance and Standard Deviation

We are now going to work through an example based on two examination results, exam3 and exam4, see below. df.exam3 = read.table(exam3.txt, header = T) attach(df.exam3) df.exam4 = read.table(exam4.txt, header = T) attach(df.exam4) exam3 [1] 68 70 71 72 72 73 73 73 74 75 75 75 75 75 76 76 76 76 76 77 77 78 78 80 82 exam4 [1] 43 43 43 44 46 48 48 50 51 53 53 53 55 56 56 57 57 58 58 59 59 59 60 60 60 [26] 61 62 62 64 69 73 We are going to assume that these examinations are from two optional modules that nal year BSc Honours students can take, that is students take one or other of these modules and not both. Final Honours classications depend on these results; but we can see already that the students who took exam3 are at an advantage; except for one, they all achieved rst class honours in that examination. If we assume that the exam3 students are equally capable as the exam4 students, then can we correct the imbalance? Before you start to be incredulous, this technique was practiced at a well-known university where I worked. First of all let us look at the histograms, Figure 4.1 and the box-plots, Figure 4.2. 41

hist(exam3) hist(exam4)

Histogram of exam3
10 12

Histogram of exam4

Frequency

Frequency 68 70 72 74 76 78 80 82

0 40

10

45

50

55

60

65

70

75

exam3

exam4

Figure 4.1: Histograms of exam3 and exam4.

boxplot(exam3) boxplot(exam4)

42

82

80

76

78

74

72

70

68

Figure 4.2: Boxplots of exam3 and exam4.

43

45

50

55

60

65

70

The means conrm the dierence. mean(exam3) [1] 74.92 mean(exam4) [1] 55.48387 diff - mean(exam3) - mean(exam4) diff [1] 19.43613

4.2.1

Equalising the means

Can we shift one of the means so that the two data sets have the same mean? diff [1] 19.43613 exam4new - round(exam4 + diff) exam4new [1] 62 62 62 63 65 67 67 69 70 72 72 72 74 75 75 76 76 77 77 78 78 78 79 79 79 [26] 80 81 81 83 88 92 fpdfsmall() hist(exam4new)

Histogram of exam3
10

Histogram of exam4new

Frequency

Frequency 68 70 72 74 76 78 80 82

0 60

10

65

70

75

80

85

90

95

exam3

exam4new

Figure 4.3: Histograms of exam3 and exam4 shifted by 19. 44

4.2.2

Variability and spread

That is a bit better, but there remains a greater spread in exam4new (mean shifted). Can we quantify spread; range gives us the range between minimum and maximum, but we would like one number. range(exam3) [1] 68 82 range(exam4new) [1] 62 92 From our experience with the mean, maybe we can take the mean (expected value) of deviations from the means, mean(exam3 - mean(exam3)) [1] -1.705372e-15 # effectively zero mean(exam4new - mean(exam4new)) [1] -4.586385e-16 Not much good; from the denition of the mean we should have known in advance that these means (or sums) of deviations would be zero the negative deviations cancel the positive. mean((exam4new - mean(exam4new))2) [1] 53.6691 mean((exam3 - mean(exam3))2) [1] 9.0336 We can achieve the same using sum and length, sum((exam3 - mean(exam3))2)/length(exam3) [1] 9.0336

4.2.3

Variance and Standard Deviation

The variance, which is the expected value of the squared deviations from the mean is the built-in function to use (var in R), see eqn. 4.1, 1 V ar (X) = E[(X )] = n var(exam3) [1] 9.41 var(exam4new) [1] 55.45806 45
n

(xi )2 .
i=1

(4.1)

Immediately, we see that it is not an illusion that the variability of exam4new is much greater than that of exam3. Note that the variance as calculated by var is slightly dierent from that calculated using mean well return to that below. The variance values, since they are sums of squares, give us a measure of squared variability; that can be hard to interpret and use; what we want is the square-root of the variance, or the standard deviation (sd in R), see eqn. 4.2, X = SD(X) = sqrt(var(exam4new)) [1] 7.447017 sqrt(var(exam3)) [1] 3.067572 sd(exam4new) [1] 7.447017 sd(exam3) [1] 3.067572 V ar [X]. (4.2)

Variance dierent from mean of squared deviations? We return to the problem of variance being dierent the mean of squared deviations. The clue is given below, sum((exam3 - mean(exam3))2)/length(exam3) [1] 9.0336 sum((exam3 - mean(exam3))2)/(length(exam3) -1) [1] 9.41 In fact, rather than eqn. 4.1, this particular implementation of var computes what is called the sample variance using eqn. 4.3, 1 V ar (X) = (n 1) This gives an unbiassed estimate of the variance.
n

(xi )2 .
i=1

(4.3)

4.3

Standard Scores and Normalising Marks

We now return to our desire to manipulate (fairly) the two data sets, exam3, exam4, such that students in each class have roughly the same opportunity; see section 4.2.1 where we equalised the means, but where we noted that the dierence in variability remained a problem.

46

4.3.1

Standard Scores

The normal way to equalise data sets like these (the proper term is either standardise or normalise) is to use the standard score as in, X .

Xss =

(4.4)

Eqn. 4.4 gives a set of scores with mean zero and standard deviation one, ss = 0, ss = 1. Thus, if we apply eqn. 4.4 to the two sets of marks, using the mean and standard-deviations of each, we get two sets of marks with the same mean (0) and the same spread (standard-deviation 1). That is ne for purely comparison purposes, but what if we need marks to publish? What we are going to do is: (i) use eqn. 4.4 to standardise the scores; then (ii) multiply by whatever (new) standard-deviation, call it new , that we require; nally, add the (new) mean that we require. The whole operation is given in eqn. 4.5, Xold new + new . old

Xnew =

(4.5)

Well now apply this to exam4, i.e. we want to make exam4 as close as possible to exam3 (in terms of mean and standard deviation). sd3 - sd(exam3) sd3 [1] 3.067572 m3 - mean(exam3) sd4 - sd(exam4) sd4 [1] 7.447017 m4 - mean(exam4) m4 [1] 55.48387 m3 [1] 74.92 exam4new = round(((exam4 - m4)/sd4)*sd3 + m3) exam4new [1] 70 70 70 70 71 72 72 73 73 74 74 74 75 75 75 76 76 76 76 76 76 76 77 77 77 [26] 77 78 78 78 80 82 mean(exam3) [1] 74.92 mean(exam4new) [1] 74.96774 # difference due to rounding sd(exam3) [1] 3.067572 sd(exam4new) [1] 2.99426 # difference due to rounding 47

And let us compare the histograms in Figure 4.4

Histogram of exam3
10 10

Histogram of exam4new

Frequency

Frequency 68 70 72 74 76 78 80 82

0 70

72

74

76 exam4new

78

80

82

exam3

Figure 4.4: Histograms of exam3 and exam4new (exam4 equalised with exam3).

48

Chapter 5 Probability and Random Variables


5.1 Introduction

This chapter gathers together some basic denitions, symbols and terminology to do with, probability, random variables, and random processes; the topics are chosen according to their applicability to basic statistics for bio-scientists, as well as pattern recognition, image processing and data compression. We will use some of the notation from Appendix A; you should have a quick look at that rst. We emphasise that such notation is merely shorthand for common sense concepts which would otherwise be confusing and long-winded if written in English.

5.2
5.2.1

Basic Probability and Random Variables


Introduction

Let there be a set of outcomes to an experiment {1 , 2 , . . . , n } = , where, to each i , we associate a probability pi . The denition of probability includes the following constraints: 0 pi 1,
n

(5.1)

pi = 1.
i=1

(5.2)

The above simple denition of probability over outcomes is satisfactory for simple applications, but for many applications we need to extend it to apply to subsets of . We could call the outcomes above elementary events, i.e. indivisible events, and we could call the subsets below composite, i.e. they are a composition of one or more outcomes. is often called the sample space, i.e. as dened above, the set of all possible outcomes of the experiment. Elements of are called outcomes, sample outcomes, or realisations. One of the problems of learning probability and statistics is the confusion caused by the multiplicity of terms for the same concept. In addition, dierent elds of study, e.g. bio-science, engineering, social science, . . . have their own terminology. 51

Example 1 Six sided dice. = {i | i {1, . . . 6}} = {1, 2, . . . 6}. Example 2 Toss two six sided dice. (1, 1), (1, 2), . . . (1, 6), (2, 1), . . . (6, 6)}. = {(i, j) | i, j {1, . . . 6}} =

Example 3 Two sided coin. = {H, T }. Outcomes need not be numbers.

5.2.2

Probability and Events

Let there be subsets of called events with a general event ai ; the set of all ai s is A. We dene a probability measure P on A; P is a number and satises the following axioms: P (a) 0, (5.3)

P () = 1, (certain event, something happens). If a1 , a2 , . . . are disjoint, i.e. ai aj = , i, j, i = j, then


(5.4)

P(
i=1

ai ) =
i=1

P (ai ).

(5.5)

Disjoint (subsets) is another term for mutually exclusive, i.e. they cannot possibly happen together. denotes set intersection, i.e. in eqn. 5.5 we are requiring that there is no overlap between any of the subsets and denotes union. Put simply, eqn. 5.5 says that probabilities add for events that do not overlap. denotes the empty set. There is a fourth axiom, a corollary of eqns. 5.4 and 5.5, P () = 0, (impossible event). (5.6)

Example 4 Six sided dice. = {1, 2, . . . 6}. Let a be the event score greater than three; i.e. a = {4, 5, 6}. Example 5 Toss two six sided dice. = {(i, j) | i, j {1, . . . 6}}. Let a be the event score less than four. Then a = {(1, 1), (1, 2), (2, 1)}.

Partition When {a1 a2 . . . an } = and a1 , a2 , . . . an are disjoint, we say that {a1 , a2 , . . . an } form a partition of .

52

5.2.3

A Point on Terminology

Above we have P (ai ) for probability that the outcome is in set ai . The outcome is in set ai is what is called a proposition. A proposition is a sentence which may be true or false but only one or the other and not in between. We should note that in most textbooks and later in these notes the arguments of probability functions, P (.) will be propositions, e.g. P (A) means the probability that A will occur, or that A will be true. Then, when we write P (AB) or P (A, B) (they mean the same), we mean probability of A and B being both true; logical and.

Not or set complement We may want to talk about the probability that A will be false, i.e. the probability that the outcome will be in the complement set to A, i.e. any of the outcomes (in ) but not in As set. Not A is denoted A. We now can write a further axiom. P (A) = 1 p(A). Example 6 Six sided dice. = {1, 2, . . . 6}. Let A = {1, 2, 3, 4}, so A = {5, 6}. P (A) = 1 P (A) = 1
4 6

(5.7)

2 6

= 1. 3

5.2.4

Probability of Non-disjoint Events

We saw in eqn. 5.5 that to compute the probability of two disjoint events you can add probabilities. For events A and B that are not necessarily disjoint (there may be overlap), we can write P (A B) = P (A) + P (B) P (AB). (5.8)

Example 7 Six sided dice. = {1, 2, . . . 6}. Let A = {1, 2, 3, 4}, so B = {4, 5}; so A B = {1, 2, 3, 4, 5} and A B = {4}. P (A B) = P (A) + P (B) P (A B) = 5 P (A B) = P ({1, 2, 3, 4, 5}) = 6 .
4 6

2 6

1 6

5 = 6 , and we can see that, computed directly,

We note that eqn. 5.8 collapses to eqn. 5.5 when AB is false (no overlap, the two cannot be true together), because of eqn. 5.6, i.e. P () = 0, and P (A B) = P (A) + P (B) P () = P (A) + P (B) 0 = P (A) + P (B).

53

5.2.5

Finite Sample Spaces

In Example 1 we could identify and list all possible outcomes and we have a nite sample space. On the other hand, if the outcome was a weight, for example of a precipitate, then we could not list all possible weights and we would have an innite sample space.

5.3

Random Variables

If, to every outcome, , of an experiment, we assign a number, X(), X is called a random variable (r.v.). X is a function over the set = {1 , 2 , . . .} of outcomes; if the range of X is the real numbers or some subset of them, X is a continuous r.v.; if the range of X is some integer set, then X is a discrete r.v. Chapter 6 contains an extensive discussion on random variables and an introduction probability distributions.

5.4

Computing probabilities

We have already done this in examples, but we need to formalise a bit. The number of elements in a (nite) set, say a, is called its cardinality and written |a|. Example 8 Six sided dice. = {1, 2, . . . 6}, || = 6. Let a = {4, 5, 6}, |a| = 3. If the outcomes are equally likely (which {1, 2, . . . 6} are), then we can compute the probability of an event a as the ratio: |a| . ||

P (a) =

(5.9)

Example 9 Six sided dice. = {1, 2, . . . 6}, || = 6. Let a = {4, 5, 6}, |a| = 3, so P (a) = |a| 3 1 = = . || 6 2

5.5

Enumerating more complex events and sample spaces


|a| || .

We see above P (a) =

But |a| or || may not be simple to enumerate or count.

54

5.5.1

Multiplication of outcomes

Let an event correspond to the combined outcomes of two experiments performed in sequence. Let the rst have n1 outcomes and the second n2 outcomes. Any of the n1 outcomes of the rst may be followed by any of the n2 outcomes of the second, so the number of outcomes in the combined experiment is n1 n2 . Example 10 Toss two six sided dice in sequence (but the result is the same if we throw them together). n1 = |1 | = 6, n2 = |2 | = 6, so, for the combined experiment, || = n1 n2 = 36, which we can also compute by counting the elements in = {(i, j) | i, j {1, . . . 6}}.

5.5.2

Addition of outcomes

Suppose again that we have two experiments. Let the rst have n1 outcomes and the second n2 outcomes. This time we perform the rst experiment or the second, but not both and which of them gets performed is chosen randomly; how many outcomes? We have n1 outcomes of the rst, or the n2 outcomes of the second, so the total number of outcomes in the combined experiment is n1 + n2 . Example 11 Toss one six sided dice or toss a two sided coin. n1 = |1 | = 6, n2 = |2 | = 2, so, for the combined experiment, || = n1 + n2 = 8, which we can also compute by counting the elements in = {1, 2, 3, 4, 5, 6, H, T }.

5.5.3

Permutations

Suppose we have n items and we wish to place them in a sequence just any sequence, not ordered according to size or any other attribute. How many ways to do this? The rst position may be lled by any of the n items; the second position may be lled by any of the remaining n 1 items, and so on, so that the number of possible dierent sequences (orderings) is n(n 1)(n 2) . . . 1 = n! (n-factorial). (5.10)

Suppose now we have n items and we wish to choose any r of them place these in a sequence. How many ways to do this? The rst position may be lled by any of the n items; the second position may be lled by any of the remaining n 1 items, and so on until we have r in the sequence. The number of possible dierent sequences (orderings) is

n(n 1)(n 2) . . . n (r 1) = n(n 1)(n 2) . . . n r + 1) =


n

n! =n P r . (n r )!

(5.11)

Pr is the name for the number of permutations of r from n. 55

5.5.4

Combinations

Suppose again we have n items and we wish to choose any r of them, but we do not need to place the r in a sequence. How many ways n Cr to do this? We can appeal to eqns. 5.11 and 5.10. n! =n Cr (number of ways of permuting)r = r !n Cr , (n r )! which leads to n! = r !(n r )! n r

Cr =

(5.12)

5.6

Conditional Probability

Example 12 = {1, 2, 3, 4, 5, 6}. I throw the dice. What is the probability of getting greaterthan-three, P (> 3)? Let A be greater-than-three so that A = {4, 5, 6}, and the cardinality of this set is nA = |A| = 3, and ndice = || = 6, see section 5.4; there are three possibilities greater-than-3, so P (A) = P (> 3) = nA /ndice = 3/6 = 1/2. Now, I have a peek and I tell you that we have an odd number, let us call this event B (odd). What now is the probability of A(> 3)? The probability surely has changed because the only possibilities now are A odd = {1, 3, 5}. Within this set, 5 is the only (one) possibility that satises greaterthan-three, so, forgetting about any ideas we had before, we say that the conditional probability of greater-than-three given that we already know that an odd number has occurred, 1/3, i.e. the probability has doubled based on the information that an odd has occurred. We write this P (> 3|odd), the conditional probability of a > 3 conditional on the fact that we already know that an odd number has occurred. This is conditional probability ; we computed the probability of B conditional on A, P (B|A).

5.6.1

Venn diagrams

Venn diagrams, see section A.1.4, can be used to think about conditional probabilities such as the one in Example 12. Here = {1, 2, 3, 4, 5, 6} corresponds to the universal set (the set of all possibilities). One we have been told that the number is odd, we can reduce our sample space to set odd; then odd (> 3) = {5}. Example 13 If after hearing rst that we have an odd number, then secondly we are told that greater-than-three has occurred, we are then asked (a) what is the probability of a six?, (b) what is the probability of a ve? Think about it, once we have the two pieces of information: odd, then greater-than-three, the possibilities are very greatly reduced. To what? 56

odd & <= 3 1 1 3 5 odd <= 3 2 2 4 6 even 4 3

odd & > 3 5 odd

6 even >3

even & <= 3

even & > 3

Figure 5.1: Dice: (a) universal set; (b) sets odd, even; (c) sets (> 3) and (<= 3) superimposed to show that, for example, odd&(> 3) = (set-odd) (set > 3) = {1, 3, 5} {4, 5, 6} = {5} .

5.6.2

Probability Trees

Probability trees, see (Griths 2009, p. 158), are another way to think graphically about conditional probability. In mathematics, trees can grow sideways or even upside down. Figure 5.2 shows a probability tree for Example 12. When we split into branches as in Figure 5.2, any branching must represent all possibilities; in this case we rst have odd and even; if we call odd B, we have even = not-odd = B. In the diagram we have no bar symbol, so we use B = B. Next we have (> 3 and (<= 3). Thus, at any branching the probabilities in the branches must sum to one. The diagram shows how to compute joint probabilities using conditional probabilities and the probability of the conditioning event, for example P (> 3 & odd) = P (> 3 | odd) P (odd). Figure 5.3 shows a general probability tree. The following may help us to think about conditional probability and joint probability. Think of the tree as having probability owing in its branches. We start of at the root with all the probability (one, 1); proportions of the probability ow into the rst set of branches (the proportions sum to one); follow one of those branches, at the next branching point, we split the remaining probability into proportions that again sum to one (it is just the proportions that sum to one, if there is, for example, 0.4 owing into the branching point, and the proportions are 0.4, 0.4, 0.2 three-way branch, then we will have probability ows of 0.16, 0.16, 0.08). And so on.

57

odd has occurred P(>3|odd) B P(odd) 1/2 1/2 P(even) B P(<=3|even) even has 1/3 occurred 1/3 P(<=3|odd) 2/3

>3 and odd has occurred P(>3 & odd) = P(>3|odd) x P(odd) <=3 and odd has occurred P(<=3 & odd) = P(<=3|odd) x P(odd)

P(>3|even) 2/3

P(>3 & even) = P(>3|even) x P(even) = 1/2 x 2/3 = 2/6 = 1/3 [P(4 or 6)] P(<=3 & even) = P(<=3|even) x P(even) = 1/2 x 1/3 = 1/6 [P(2)]

Figure 5.2: Probability tree for the dice example. We start o on the left with the root and everything possible. Then we split into branches odd and even. Next we split odd into (> 3) and (<= 3); same for the even branch.

We know B has occurred P(A | B) B P(B) P(A | B) A A

A has occurred i.e. A & B have occurred P(A & B) = P(AB) = P(A | B) x P(B) not A has occurred i.e. not A & B = A & B P(AB) = P(A | B) \x P(B)

P(A | B) P(B) (not B) B P(A | B)

P(AB) = P(A | B) x P(B)

A P(AB) = P(A | B) x P(B)

B has not occurred i.e. not B has occurred

Figure 5.3: Probability tree.

58

Symbolically, and referring to Figure 5.3 . . . If we have proportion P (B) in a branch and then that splits into proportions P (A|B) and P (A|B) (these (relative) proportions again sum to one, but their total probability sums to whatever owed into the branching point). Then the P (A|B) branch must an absolute amount of probability equal to P (A|B) P (B) and this is P (AB).

Formula for Conditional Probability abilities,

We now give the formula for computing conditional prob-

P (A|B) = provided that P (B) > 0. Alternatively, as in Figure 5.3,

P (AB) , P (B)

(5.13)

P (AB) = P (A|B)P (B).

(5.14)

5.6.3

Joint Probability

P (AB) is the joint probability of A and B happening together. Sometimes we write P (AB), sometimes P (A&B), sometimes P (A and B), and sometimes, using set notation, P (A B).

5.7

Bayes Rule

If we reverse the conditionality in eqn. 5.13 and noting that P (AB) = P (BA), we have P (AB) , P (A)

P (B|A) = leading to

(5.15)

P (A)P (B|A) = P (AB), and eqn. 5.13 gives us P (B)P (A|B) = P (AB), so that P (A)P (B|A) = P (B)P (A|B), 59

(5.16)

(5.17)

(5.18)

leading to Bayes rule: P (A|B) = P (A)P (B|A)/P (B). Eqn. 5.19 allows to invert or reverse the conditionality. Example 14 Let A be has disease-X; let B be has swollen ankles. From a sample of former disease-X patients, we can estimate P (B|A); say it is P (B|A) = 0.3. Let us assume that we also know the proportion of the general population that have swollen ankles, P (B) = 0.01. Also we assume that we have the incidence of disease-X in the general population, P (A) = 0.005. Eqn. 5.19 allows us to compute the probability that the patient has disease-X given that the swollen ankles symptom (B) is present, P (A|B). Of course, in general, P (A|B) = P (B|A). P (A|B) = P (A)P (B|A)/P (B) = 0.005 0.3/0.01 = 0.15. (5.20) (5.19)

Bayes rule may be written in a more general manner. First we need a result called the law of total probabilities. Let A1 , A2 , . . . , An be a partition of (see section 5.2.2 for a denition of partition), then
n

P (B) =
i=1

P (B|Ai )P (Ai ).

(5.21)

We write the more general form of Bayes rule as


n

P (Ai |B) = P (B|Ai )P (Ai )/


i=1

P (B|Ai )P (Ai ).

(5.22)

Let us return to Example 14 and apply eqn. 5.22. When we said proportion of the general population that have swollen ankles, P (B) = 0.01, we strictly meant probability of people with disease-X together with those without disease-X = 0.01. We can restate the problem with A1 = has disease-X and A2 = has not disease-X, so that they form a partition of the general population. Assume that we now have P (B|A2 ) = 0.01 (i.e. we are changing the story slightly to associate this probability with people who do not have disease-X) and, as before, P (B|A1 ) = 0.3; we need also P (A1 = 0.005, as before. What is P (A2 ); it is P (A1 ) (probability that a general person does not have disease-X) and this is 1 P (A1 ) = 0.995. Eqn. 5.21 now gives a revised gure for P (B),
n

P (B) =
i=1

P (B|Ai )P (Ai ) = P (B|A1 )P (A1 )+P (B|A2 )P (A2 ) = 0.30.005+0.010.995 = 0.01145,

and we can rework eqn. 14 (or use eqn. 5.22, P (A1 |B) = P (A1 )P (B|A1 )/P (B) = 0.005 0.3/0.01145 = 0.131. 510

5.8

Independent Events

We have already discussed disjoint events, i.e. events which cannot occur simultaneously; thus, disjoint events A, B, A B = . Consequently, we can state that P (A|B) = 0 (if B has occurred, A cannot). At the opposite extreme, let A B, i.e. A is a subset of B and if A has occurred, then so must B, with certainty, so in this case P (B|A) = 1. Example 15 = {1, 2, 3, 4, 5, 6}. Let B = {2, 4, 6} (even number) and A = {6}. If we know that a 6 has been thrown (A has occurred), what is P (B|A)? The answer is 1 we know that 6 is even so B is a sure thing in punter parlance :-). But there are cases where A and B are totally unrelated they are independent events. Example 16 Throw a dice (1) and toss a coin (2). 1 = {1, 2, 3, 4, 5, 6}, 2 = {H, T } and the combined sample space = {(1, H), (1, T ), (2, H), . . . , (6, H), (6, T )} and || = 12. Let A = {4, 6} and B = {H}, so that AB = {(4, H), (6, H)} (two out of 12 equally likely events), so P (AB) = 1/6. also P (A) = 1/3, P (B) = 1/2. From eqn. 5.13 we have P (AB) 1 1 1 = / = . P (A) 6 3 2

P (B|A) =

Because the result of the dice throw is unrelated to the result of the coin toss we are not surprised to nd that 1 . 2

P (B|A) = P (B) =

This leads us to a more general denition of independent events, P (AB) , P (A)

P (B|A) = P (B) = so that A and B are independent events if and only if

P (AB) = P (A)P (B).

(5.23)

511

5.9

Betting and Odds

In circumstances where the terms have meaning, probability of A can be computed as the ratio of the number of equal probability events favourable to A, nA , versus the total number of equal probability events, nT , P (A) = nA /nt . (5.24)

Odds, on the other hand are computed as the ratio of the number of equal probability events favourable to A, nA , versus the number of equal probability events unfavourable to A, nA , O(A) = nA /nA . (5.25)

1 Thus, the probability of a 1 on the throw of a dice is 6 , whilst the odds are 1 ; bookmakers express 5 this as ve-to-one against.

The probability for any number less than ve (14) would be bookmakers express this as two-to-one on. You can calculate probability from odds using O(A) . 1 + O(A)

4 6,

whilst the odds are

4 2

2 1;

P (A) =

(5.26)

Thus, for any number less than ve (14) on a dice throw,


2 O(A) = 1 1 + O(A) 1+

P (A) =

2 1

2 . 3

You can calculate odds from probability using P (A) , 1 P (A)

O(A) =

(5.27)

that is, the ratio of probability-for (favourable) to probability-against (unfavourable). Thus, for one on a dice throw,
1 6

O(A) =

1 6

1 . 5

512

Bookmakers odds and probabilities Bookmakers probabilities do not add to 1. Unlike proper probabilities, which add one for all possible events, see eqn 5.2. Lets say we have four horses, each with an equal probability of winning (P (Ai ) = 1, 2, 3, 4. We would expect odds of
1 4 1 4,

for i =

O(A) =

1 4

1 , 3

or three-to-one against. But the bookmaker has to make a living, and not just provide a mutual service for his punters. In this case, if four punters bet 10 Euro on each horse (bookie gets 40 Euro), one punter gets paid 30 Euro plus his stake returned = 40 Euro, and the bookie makes nothing for his work. The bookie is likely to give odds of something like two-to-one against, O (A) = 1 , and, computing 2 probabilities, we nd
1 O (A) = 2 1 + O (A) 1+

P (A) = and the sum of probabilities is 4 . 3

1 2

1 , 3

In this amended case, if four punters bet 10 Euro on each horse (bookie gets 40 Euro), one punter gets paid 20 Euro plus his stake returned = 30 Euro, and the bookie makes 10 Euro.

5.10

Classical versus Bayesian Interpretations of Probability

In many books and discussions you will see a distinction made between the classical and the Bayesian interpretation of probability; also, in this context the term frequentist may be used as a synonym for classical. As an interpretation of probability, the term Bayesian has little to do with Bayes rule, section 5.7, that is until we get to statistical inference, Chapter 10. Broadly speaking, Bayesians interpret probability as belief ; frequentists interpret probability as relative frequency.

Bayesian (belief) interpretation Take the case of the tossed (fair) dice. If you were asked to rate, on a scale of [0, 1], your belief that 2 will be the outcome, you would, I hope, agree that 2 the probability is 1 ; for an even number of dots: 6 = 1 ; and any number 1-6 a sure thing 6 2 probability is 1. Here 0 corresponds to complete disbelief and 1 to complete belief.

513

Relative frequency interpretation The frequentist says that the probability of 2 is the relative frequency with which 2 occurs in a large number of hypothetical throws. Let us then run an experiment involving a large number (n = 600) of throws. and let yi = the count of each Xi obtained. We might expect to obtain something like y1 = 95, y2 = 110, y3 = 90, y4 = 97, y5 = 105, y6 = 103. We then use p (i ) = yi ; the hat, indicates that p (i ) is an , n approximation to p(i ); however, p (i ) p(i ) as n . We have p (i ) = yi = p (i ) = {95/600, 110/600, 90/600, 97/600, 105/600, 103/600 = n 0.158, 0.183, 0.15, 0.162, 0.175, 0.172}. The correct value is p(i ) = 1 = 0.1667. 6 The errors above are not a real indictment of the frequentist method; a thought experiment allows 1 us to reason that p(i ) = 6 . On the other hand, when you want to bet on football match and would like to estimate the probability and hence the odds, it makes no sense to think of an innity of matches.

514

Chapter 6 One Dimensional Random Variables


6.1 Introduction

We have already introduced the notion of a random variable in section 5.3, i.e. where we associate a number with the outcome of an experiment governed by probability. In most cases, your (scientic) data will already be numerical, but it nonetheless remains worthwhile to be cognisant of the details of probability and sample space described in Chapter 5. In some of the examples in Chapter 5, namely those involving the dice, the outcome already is a number, i.e. {1, . . . , 6}; in some considerations, this number is more a label than a number, but in any case, the association of a number with the outcome is made trivial. In the coin example we had {H, T }; in this case we could use the association {H 1, T 0}.

6.1.1

Denition: Random Variable

If, to every outcome, , of an experiment, we assign a number, X(), X is called a random variable (r.v.). X is a function over the set = {1 , 2 , . . .} of outcomes; if the range of X is the real numbers or some subset of them, X is a continuous r.v.; if the range of X is some integer set, then X is a discrete r.v. The space of all possible values of X is called the range space of X, RX . In discussing random variables we label the r.v. with an upper case letter, e.g. X, but particular values of it are labelled with lower case, e.g. x, or xi . Example 17 Toss two coins. = {T T, T H, HT, HH}. Let a r.v. X be dened as the number of heads in the outcome, i.e. {T T 0, T H 1, HT 1, HH 2}. Notice that two outcomes map to the same number (1); this is not a problem or a mistake. RX = {0, 1, 2}.

6.1.2

Probability associated with a Random Variable

If we have an event B with respect to a range space RX . Let the event A with respect to be dened as

61

A = { | X() B}.

(6.1)

Then A and B are equivalent events and we can carry the denitions and equations of Chapter 5 over to random variables. Example 18 Two coins as in Example 17. Examples of equivalent events are: A = {T T }, B = {0}; A = {T H, HT }, B = {1}; A = {HH}, B = {2}. In the case of eqn. 6.1, we can say P (B) = P (A). (6.2)

Example 19 Two coins as in Example 18. A = {T T }, P (A) = 1 , B = {0}, P (B = 0) = 1 ; A = 4 4 1 {T H, HT }, P (A) = 1 , B = {1}, P (B = 1) = 1 ; A = {HH}, P (A) = 4 , B = {2}, P (B = 2) = 1 . 2 2 4

6.2

Probability Mass Function (pmf) of a Discrete r.v.

Let a r.v. X have a range space RX = {x1 , x2 , . . . , xn }. We denote the probability of a particular value X = xi as pX (xi ) = P (X = xi ). The probabilities pX (xi ), i = 1, 2, . . . , n, in keeping with eqns. 5.3 and 5.4, must satisfy pX (xi ) 0, i = 1, 2, . . . , n,
n

(6.3)

pX (xi ) = 1.
i=1

(6.4)

pX is called the probability function or the probability mass function of the r.v. X. Well attempt to standardise on probability mass function and its abbreviation pmf. We use the shorthand X pX to state that the r.v. X has a pmf pX . Often, where there is no ambiguity, you will nd the subscript X omitted pX (x) p(x).

6.3

Some Discrete Random Variables

This section identies and describes the pmfs of some commonly occurring discrete random variables.

6.3.1

Point Mass Distribution

If X can take on only one value, a, it has a point mass distribution at a; X a . pX (x) = 1, for x = a, and 0 elsewhere. 62 (6.5)

6.3.2

Discrete Uniform Distribution

X has a discrete uniform distribution on {1, . . . , k}, U(1, k), if pX (x) = 1 , for x = 1, . . . , k; and 0elsewhere. k (6.6)

Example 20 . Lottery machine, k balls. First draw, X U(1, k).

6.3.3

Bernoulli Distribution

Let X be the result of a (binary outcome) experiment with probability p of one outcome, X = 1, say, and 1 p for the other, X = 0; for example a coin ip. Theres overuse of the symbol p here, but we need to keep to standard notation; context should resolve any ambiguities between the parameter p = P (X = 1) and the pmf pX (X). pX (x) = q x (1 q)1x , for x {0, 1}. (6.7)

6.3.4

Binomial Distribution

Repeat the experiment above (Bernoulli distribution coin ip) n times and let X be the number of 1s (e.g. heads) obtained. n x p x (1 p)nx , for x {0, 1, . . . n}; 0, otherwise.

pX (x) =

(6.8)

n come from? We have already introduced it in eqn. 5.12; it is the number x of ways of selecting x items from n. The probability one of the x 1s is p x and the probability one of the n x 0s is (1 p)nx ; the ips are independent so we can multiply the probabilities to get n n n! p x (1 p)nx . However, there are possible ways of getting the X = x 1s. = x!(nx)! . x x Where does the Take n = 3; the sample space is = {T T T, T T H, T HT, T HH, HT T, HT H, HHT, HHH} and the event corresponding to x = 2 (two heads, any two heads) is A = {T HH, HT H, HHT }, i.e. there are three outcomes that give two heads. n x = 3 2 = 3! 6 = = 3. 2!1! 2

6.3.5

Geometric Distribution

X has a geometric distribution with parameter p, X Geom(p), p (0, 1), if P (X = k) = p(1 p)k1 , k = 1, 2, . . . , . Example 21 . Distribution of the number of coin ips until the rst head. 63 (6.9)

6.3.6

Poisson Distribution

X has a Poisson distribution with parameter , X P oi sson(), if pX (x) = e


x

x!

, x 0.

(6.10)

Example 22 . Distribution of rare events like trac accidents; there can be long periods of inactivity, but clumping of events is possible, e.g. waiting a long time for a town bus and three arrive in quick succession!

6.4

Some Continuous Random Variables

This section identies and describes the probability density functions of some commonly occurring continuous random variables. First we must introduce a continuous alternative to the probability mass function.

6.4.1

Probability Density Function (PDF)

When we discussed discrete r.v.s we let X have a range space RX = {x1 , x2 , . . . , xn }; the number of values in the range space was countable. Let the range space be RX = {0, 0.01, 0.02, . . . , 0.99, 1.0}; this is still a discrete r.v. But what if RX = [0, 1], i.e. all real numbers in the range 0 1. A number of problems arise, the chief of which are: the random variable is now continuous, i.e. the elements of the range space are not countable; the probability of any particular value of the r.v. is in fact zero. Example: you buy 0.5-kg of cheese in Tesco; what is the chance of it being exactly 0.5-kg? Zero. Same goes for the weight of a product of a chemical experiment. Hence we cannot use probability mass functions. We now must use a dierent probability function called a probability density function (pdf). A pdf, over a range space RX , must satisfy (c.f. eqns. 6.3 and 6.4 for discrete r.v.s) fX (x) 0, all x Rx , (6.11)

fX (x)dx = 1.
Rx

(6.12)

We emphasise that fX (x) is not a probability, but fX (x)dx is. If you want to speak of a probability over a continuous r.v. you must state something like the probability that X is in the range a to b, b inclusive, is P (a X b), i.e. a fX (x)dx. The term probability density function is used (in contrast to probability mass function (for discrete r.v.s)) because, with a continuous r.v. you simply cannot pick a value (X = x), say, and state P (X = x), which is in fact zero. 64

Discrete probability mass versus Continuous probability density Think of a ruler upon which we place (stick with Blue-tack) ball bearings of various sizes along its length; the ball bearings represent discrete masses and we can state that we have a mass m1 at ruling x1 ; we can also compute the total mass as i mi . Now think of a rod of varying diameter laid along the ruler; we cannot pick a point x and say that the mass at precisely that point is m(x), but we can say that the mass in a little length, x, x + x, is d(x)x, where d is the mass per unit length at x, (the density). In this case we can compute the total mass as length d(x)dx.

6.4.2

Cumulative Distribution Function (cdf)

Many textbooks base their treatment of continuous r.v.s on the cumulative distribution function (cdf); the cdf does give a probability. FX (x) = P (X x),
x

(6.13)

FX (x) =

fX (x)dx.

(6.14)

6.4.3

Uniform Distribution

X has a uniform distribution on [a, b], X Unif or m(a, b), if


1 (ba) ,

fX (x) =

for x [a, b] otherwise.

(6.15)

The cumulative distribution function (cdf) is 0, 0 x <a x [a, b] x > b. (6.16)

FX (x) =

(xa) , (ba)

6.4.4

Normal (Gaussian) Distribution

X has a Normal (Gaussian) distribution with parameters and , X N(, ), if 1 x 1 fX (x) = exp 2 2
2

, < x < .

(6.17)

The Normal distribution is often used to model measurements taken in the presence of error or noise. If the true value of a variable X is , then measurement (random) variable is distributed as N(, ) where (the standard deviation) is a measure of the size of the errors. 65

We say X has a standard Normal distribution if = 0 and = 1; standard Normal r.v.s are typically denoted by Z; Z N(0, 1). The CDF for Z is denoted by (z); although there is no formula for (z), it is tabulated. In the days before widespread use of computers, tables such as those for (z) were of great importance to those involved in statistics and statistical inference. Nowadays statistic packages and even some calculators will compute (z) for you or even remove the necessity by calculating the thing that required (z) as an intermediate value. If X N(, ) then Z = (x )/sigma N(0, 1). Conversely, if Z N(0, 1) then X = Z + N(, ). Also, if X N(, ) and Y = aX + b, then Y N(a + b, a).

6.4.5

Exponential Distribution

X has a Exponential distribution with parameter , > 0, X Exp(), if 1 exp(x/).

fX (x) =

(6.18)

The Exponential distribution is used to model the waiting times between infrequent events, c.f. the Poisson distribution, see section 6.3.6.

6.4.6

Gamma Distribution

X has a Gamma distribution with parameters , ; , > 0, X Gamma(, ), if 1 x 1 exp(x/), x > 0. ()

fX (x) =

(6.19)

The Gamma function, for parameter > 0, is given by

() =
0

y 1 e y dy .

(6.20)

The Exponential distribution is Gamma with parameter = 1, Gamma(1, ).

6.4.7

Beta Distribution

X has a Beta distribution with parameters , ; , > 0, X Beta(, ), if ( + ) 1 x (1 x)1 ), 0 < x < 1. () ()

fX (x) =

(6.21)

66

6.4.8

Student t Distribution

X has a Student t distribution (or just t distribution, with degrees of freedom X t , if


+1 2 2

fX (x) =

1 1+
x2 (+1)/2

(6.22)

6.4.9

Cauchy Distribution

The Cauchy distribution, X Cauchy , is a special case of the t distribution with = 1, 1 . (1 + x 2 )

fX (x) =

(6.23)

6.4.10

Chi-squared Distribution

X has a 2 distribution with n degrees of freedom X 2 , if n fX (x) = 1 x (n/2)1 e x/2 , x > 0. (n/2)2n/2 (6.24)

6.5

Range spaces terminology

In discussing discrete r.v.s we mentioned, for example, a range space RX = {x1 , x2 , . . . , xn }. If the range space is all the integers, we could use the common symbol RX = Z. If the range space is all the real numbers, we could use the common symbol RX = R. If the range space is a subset of R, we use, for example, RX = [0, 1] to state that the r.v. can be 0 1 inclusive. For a discrete (integer) subset we use, for example, {1, 2, . . . , 10}.

6.6

Parameters

In discussing the Binomial distribution, eqn. 6.8, and the Normal, eqn. 6.17, see below, n x

pX (x) =

q x (1 q)nx , for x {0, 1, . . . n}; 0, otherwise,

1 x 1 fX (x) = exp 2 2

, < x < ,

67

we note that q for the Binomial, and , for the Normal, completely specify the distributions. We call these parameters and we will see distributions written as, for example, fX (x; 1 , 2 ), where is a common symbol for parameter. A lot of practical statistics involves parameter estimation, where, for example, we may have a set (sample) of data x1 , x2 , . . . , xn , which we know to be drawn from a population with distribution fX (x; 1 , 2 ) and we want to compute an estimate 1 for 1 .

68

Chapter 7 Two- and Multi-Dimensional Random Variables


7.1 Introduction

Chapter 6 has introduced one dimensional random variables and certain well known distributions. Both discrete and continuous r.v.s were covered. In many cases, your (scientic) data will consist not just of single numbers, for example, the weight of a chemical in a mixture, but two or more numbers. If the numbers correspond to independent events, see section 5.8, it may be possible or desirable to treat them separately as individual one-dimensional r.v.s, but, generally, you will want to treat pairs or triples or multiple numbers together. In section 5.6 and eqn. 5.13 we introduced the notion of the probability of two events happening together, P (AB), the joint probability of A and B. Here we introduce rst two-dimensional r.v.s and then go on to generalise to multi-dimensional r.v.s.

Range spaces terminology for two and more dimensions See section 6.5 where we introduced some symbols and terminology used in describing range spaces for one-dimensional r.v.s. If we have a two-dimensional continuous random variable a pair (X, Y ) each member of which can take on any real value, we say that the range space is R R; for general multi-dimensions, say p-dimensions, where the random variable is a random vector, we use Rp . For a subsets of R, we use, for example, [0, 1] [0, 1] and [0, 1]p . The term for a combination (product) of sets such as [0, 1] [0, 1] is Cartesian product.

Two-dimensional (Bivariate) Random Variables If, to every outcome, , of an experiment, we assign two numbers, X(), Y (), X is called a two-dimensional random variable. As with one-dimension, we have discrete and continuous two-dimensional random variable, or random vector, especially when more than two dimensions.

71

Much of what we present here is just a two-dimensional analogue of what was covered in Chapter 6. Also, what is described here in terms of two-dimensions transfers immediately to multiple dimensions.

7.2

Probability Function of a Discrete Two-dimensional r.v.

By analogy with eqns. 6.3 and 6.4, for one-dimension, we have pX,Y (xi , yj ) = P (X = xi , Y = yj ) (or just p(xi , yj )) and it must satisfy the following p(xi , yj ) 0, i = 1, 2, . . . ; j = 1, 2, . . .
m n

(7.1)

p(xi , yj ) = 1.
j=1 i=1

(7.2)

As with one-d., pX,Y or just p is called the probability function or the joint probability function for the r.v. (X, Y ). Example 23 From (Meyer 1966, p. 85). There are two production lines; the rst has a capacity to produce up to ve items in a day; its actual production is a random variable X; the second has a capacity to produce up to three items in a day and its actual production is a random variable Y . The pair of random variables is the two-dimensional random variable (X, Y ) and the joint probability function is given in Table 7.1. Each entry represents P (X = xi , Y = yj ); so p(2, 3) = 0.04. Such a table could be estimated by noting (X, Y ) over a large number of days. X Y 0 1 2 3 0 0.0 0.01 0.01 0.01 1 0.01 0.02 0.03 0.02 2 0.03 0.04 0.05 0.04 3 0.05 0.05 0.05 0.06 4 0.07 0.06 0.05 0.06 5 0.09 0.08 0.06 0.05

Table 7.1: Example of a two-dimensional probability function We can verify that the table does represent a proper probability function in that requirement eqn. 7.1 is satised, and, by summing over all entries, that requirement eqn. 7.2 is satised the entries sum to 1.

7.3

PDF of a Continuous Two-dimensional r.v.

By analogy with eqns. 6.11 and 6.12, for one-dimension, we have the (joint) PDF f (x, y ) and it must satisfy the following f (x, y ) 0, all (x, y ) R R, (7.3) 72

f (x, y )dxdy = 1.

(7.4)

We emphasise again that f (x, y ) is not a probability, but f (x, y )dxdy is.

7.4

Marginal Probability Distributions

Example 24 Suppose in Example 23 (Table 7.1) we want to compute the probability functions for X and Y on their own. These are called marginal probability functions. The marginal probability function for X is given by

pX (xi ) = P (X = xi ) = P (X = xi , Y = y1 , or . . . , or X = xi , Y = yn ) =
j=1

p(xi , yj ).

(7.5)

Similarly, the marginal probability function Y is given by


n

pY (yj ) =
i=1

p(xi , yj ).

Table 7.2 shows the corresponding sums. X Y 0 1 2 3 Sum 0 0.0 0.01 0.01 0.01 0.03 1 0.01 0.02 0.03 0.02 0.08 2 0.03 0.04 0.05 0.04 0.16 3 0.05 0.05 0.05 0.06 0.21 4 0.07 0.06 0.05 0.06 0.24 5 0.09 0.08 0.06 0.05 0.28 Sum 0.25 0.26 0.25 0.24 1.00

Table 7.2: Example

We can verify that the sums corresponding to p(xi ) and p(yj ) do represent proper probability functions in that requirement 6.3 is satised, and, by summing the marginals, that requirement 6.4 is satised both sets of marginals sum to 1. For continuous random variables, we can state the equivalent equation for marginal PDFs:

fX (x) =
Y

fX,Y (x, y )dy .

(7.6)

73

7.5

Conditional Probability Distributions

In section 5.6 we introduced conditional probability, i.e. the probability of an event B when we know that event A has occurred: P (AB) . P (A)

P (B|A) = We can do the same for probability functions.

(7.7)

Example 25 Suppose in Example 24 (Table 7.2) we want to compute the conditional probability P (X = 2|Y = 1). Applying eqn. 7.7 we have P (X = 2, Y = 1) 0.04 = = 0.154. P (Y = 1) 0.26

P (X = 2|Y = 1) =

We can give general rules, noting that q(yj ), p(xi ) are marginal probability functions given by eqn. 7.5, p(xi , yj ) if q(yj ) > 0, q(yj ) p(xi , yj ) if p(xi ) > 0. p(xi )

p(xi |yj ) =

(7.8)

p(yj |xi ) =

(7.9)

We can give similar general rules for continuous random variables, noting that h(yj ), h(x) are marginal probability functions given by eqn. 7.6, f (x, y ) if h(y ) > 0, h(y ) f (x, y ) if g(x) > 0. g(xi )

f (x|y ) =

(7.10)

h(y |x) =

(7.11)

7.6

Independent Random Variables

We can dene the notion of independent random variables using the denition of independent events given in section 5.8; we had: A and B are independent events if and only if P (AB) = P (A)P (B). (The occurrence of event A in no way inuences the occurrence of B and vice-versa.) (7.12)

74

Independent Discrete Random Variables Given the two-d. discrete random variable (X, Y ), X and Y are said to be independent if and only if p(xi , yj ) = q(xi )r (yj ), noting that q(yj ), r (xi ) are marginal probability functions given by eqn. 7.5. (7.13)

Independent Continuous Random Variables Similarly, given the two-d. continuous random variable (X, Y ), X and Y are said to be independent if and only if f (x, y ) = g(x)h(y ), where g(x), h(y ) are marginal pdfs. (7.14)

7.7

Two-dimensional (Bivariate) Normal Distribution

We can extend the one-d. Normal (Gaussian) distribution to two-d.

f (x, y ) =

1 2x y

1 x x exp 2(1 2 ) x 1 2

(x x )(y y ) y y 2 + x y y

, (7.15)

for < x < , < y < . Before you start protesting that eqn. 7.15 is incomprehensible, (i) it isnt and I can explain it; (ii) there is a much better way of handling multivariate random variables that is better for even two-d. See Chapter B and section B.7.

75

Chapter 8 Characterisations of Random Variables


8.1 Introduction

We introduced the notion of a random variable in Chapters 6 and 7. We identied probability functions (for discrete r.v.s) and probability density functions for some commonly occurring r.v.s. Here we identify and dene some parameters (numbers) that characterise some aspects of r.v. distributions. Generally, the expected value or expectation of some function of the r.v. is found useful and the expected value of the r.v. itself (the mean) is rst amongst these.

8.2

Expected Value (Mean) of a Random Variable

The expected value of a r.v. X, or expectation, or mean, is the average value of X.

Denition: Expected Value, Discrete R.V. Discrete r.v., range space RX = {x1 , . . . , xn }; probability mass function p(xi ) = P (X = xi ). The expected value or expectation ((E(X)), or mean of X is given by
N

E(X) = x =
i=1

xi p(xi ).

(8.1)

Continuous r.v., range space RX = R; probability density function f (x). The expected value or expectation ((E(X)), or mean of X is given by

E(X) = x =
R

xf (x)dx.

(8.2)

81

Example 26 Toss two coins as in Example 18. X = number of heads. A = {T T }, P (A) = 1 , X = 4 1 1 {0}, P (X = 0) = 4 ; A = {T H, HT }, P (A) = 1 , X = {1}, P (X = 1) = 2 ; A = {HH}, P (A) = 2 1 1 4 , X = {2}, P (X = 2) = 4 .
N

E(X) = x =
i=1

1 1 1 xi p(xi ) = 0 + 1 + 2 = 0 + 0.5 + 0.5 = 1. 4 2 4

Example 27 Toss a dice and take X = the number of dots obtained; p(xi ) = 1 , i = 1, . . . , 6. 6
N

E(X) = x =
i=1

1 xi p(xi ) = 6

xi = 21/6 = 3.5.
i=1

(8.3)

Note that in Example 27 x = 3.5 is not one of the possible values of X. It is useful, particularly in two-d. cases, to think of x as the centre of mass, where p(xi ) is a mass and xi is a position along a lever arm; x is the position to place the fulcrum in order to achieve a balance.

Aside Sample Averages In later chapters we will encounter samples and sample averages. By sample we mean that we run an experiment and take some example values, say n of them, of the r.v., x1 , x2 , . . . , xn . Here we use n for the size of the sample rather than N as in eqn. 8.1 and note that the sample space Rx = x1 , . . . xN denotes the population, rather than a sample of it. Then we can compute a sample mean, X, (pronounced x-bar ) as 1 X= n
n

xi .
i=1

(8.4)

That is, compute the average like we learned in early arithmetic. Ordinarily, well make a strong distinction between sample mean and true mean. But let us consider the case of a large sample, say N = 600. Let yi = the count of each Xi obtained. We might expect to obtain something like y1 = 95, y2 = 110, y3 = 90, y4 = 97, y5 = 105, y6 = 103, so that for eqn. /refeq:charrv-samp1 n xi = 95 1, y2 = 110 2, y3 = 90 3, y4 = 97 4, y5 = i=1 105 5, y6 = 103 6 = 3.6. If we look more carefully at eqn. 8.2 for this example, we can interpret it as a sample version of eqn. 8.1.
n

X=
i=1

1 yi xi = n

xi
i=1

yi , n

(8.5)

and, comparing with eqn. 8.1, we have yi in place of p(xi ); we note that yi = p (xi ) = n n {95/600, 110/600, 90/600, 97/600, 105/600, 103/600 = 0.158, 0.183, 0.15, 0.162, 0.175, 0.172}, i.e. we have sample estimates of the probability mass function, which are incorrect. The error, X = x , is due to the errors in the p (xi ). Generally, as n , p (xi ) p(xi ) and X X . 82

Denition: Expected Value of a function of a r.v. of X Y = r (X) is given by

The expected value ((E(r (X))) of a function

E(Y ) = E(r (X)) =


i=1

r (xi )p(xi ).

(8.6)

Example 28 Let us use a dice as a one number slot-machine (one-armed-bandit). We pay 5c to play and the machine pays whatever number comes up (1 6); thus our payout for each play is xi 5. What is the expected value of the payout? (Think play for an hour, 1000 plays, inserting 5000c, what do we expect to win or lose?)

E(Y ) =
i=1

r (xi )p(xi ) =
i=1

(xi 5)

1 = 4/6 3/6 2/6 1/6 + 0/6 + 1/6 = 9/6 = 1.5. 6

That is, we lose on average 1.5c for every play and would lose 1500c in 1000 plays. (Maybe better than the average slot-machine?)

Expected values for two-dimensions and higher dimensions.

Eqns. 8.1 and 8.2 carry over to two and more

Discrete r.v., range space RX,Y = {x1 , . . . , xN }{y1 , . . . , yM }; probability mass function p(xi , yj ) = P (X = xi , Y = yj ). The expected value or expectation, (E[(X, Y )], or mean of the pair (X, Y ) is given by
N M

E[(X, Y )] = X,Y = (X , Y ) =
i=1 j=1

(xi , yj )p(xi , yj ).

(8.7)

And similarly for two-d. (and multidimensional) continuous, where multiple integrals replace single integrals.

Useful facts

For Xi , . . . , Xn random variables and constants ai , . . . , an , E(


i

ai Xi ) =
i

E(Xi ).

(8.8)

For Xi , . . . , Xn independent random variables


n n

E(
i=1

Xi ) =
i=1

E(Xi ).

(8.9)

83

8.3

Variance of a Random Variable

Variance gives the spread of a distribution. The variance is the expected value (mean value) of the squared deviation from the mean.

Denition: Variance Discrete r.v., range space RX = {x1 , . . . , xN }; probability mass function p(xi ), mean . The variance is given by
N

V (X) = = E[(X X ) ] =
i=1

(xi X )2 p(xi ).

(8.10)

Continuous r.v. V (X) = 2 = E[(X X )2 ] =


R

(x X )2 f (x)dx.

(8.11)

The following formula is sometimes useful V (X) = E(X 2 ) (E(X))2 = E(X 2 ) 2 . X Aside Sample Variance variance is given by (8.12)

Eqn. 8.2 gives the sample mean of a random variable; the sample

1 s = (n 1)
2

(xi X)2 .
i=1

(8.13)

You may wonder about the (n 1) instead of n; if we divided by n, the estimate would be biassed.

Standard Deviation

Standard deviation: X =

(V (X).

Useful facts about variance

For constants a, b, V (aX + b) a2 V (X). (8.14)

For Xi , . . . , Xn independent random variables and constants ai , . . . , an ,


n n

V(
i=1

Xi ) =
i=1

V (Xi ).

(8.15)

If Xi , . . . , Xn are independent and identically distributed (IID) random variables with = E(X), 2 = V (X), then E(X) = , V (X) = 2 /n, E(s 2 ) = 2 . 84 (8.16)

8.4
8.4.1

Expectations in Two-dimensions
Mean

Two-d. discrete r.v., range space RX = {x1 , . . . , xn } {y1 , . . . , yM }; probability mass function p(xi , yj ). The expected value or expectation ((E[(X, Y )]), or mean of (X, Y ) is given by
M N

E[(X, Y )] = X,Y =
j=1 i=1

(xi , yj )p(xi , yj ).

(8.17)

Similarly for a continuous r.v. double integral replaces summation, pdf replaces probability mass function.

8.4.2

Covariance

Let X, Y be r.v.s with means X , Y and standard deviations X , Y . The covariance between X and Y is dened as Cov (X, Y ) = E[(X X )(Y Y )]. Cov (X, Y ) = Cov (Y, X). The correlation between between X and Y is dened as X,Y = Cov (X, Y )/X Y . (8.19) (8.18)

85

Chapter 9 The Normal Distribution


9.1 Introduction

Here we introduce some uses of the Normal distribution, eqn. 6.17. The Normal distribution can be used as a model or approximate model in so many cases that a large amount of mathematics has been built up around it. Note: we use Normal (capitalised) to distinguish from the word normal (expected, typical) and because most other distribution names are capitalised. The probability density function (pdf) is given by: 1 1 x fX (x) = exp 2 2
2

, < x < .

(9.1)

We say X N(, ); note: some writers use X N(, 2 ), i.e. they use the variance for the second parameter of N; we will attempt to standardise on N(, ). It is well worth checking carefully when reading books and papers, there can be a great dierence between and 2 ! Because the pdf is dierent for each , , it is convenient to create a standardised Normal in which = 0, = 1. We standardise the r.v. X as follows; rst we shift to zero mean, and then we divide by to obtain unit standard deviation. Z = (X )/. (9.2)

When we standardise X, we obtain Z = (X )/ N(0, 1), and eqn. 9.1 becomes eqn. 9.3, 1 fZ (z) = exp(z 2 /2). 2

(9.3)

The pdf for N(0, 1) is shown in Figure 9.1. As you can see, most of the probability is located in 3 < Z < 3; between these limits we have probability 0.9974, i.e. P (3 < Z < 3) = 0.9974, that is if we have a random variable Z, we can be pretty sure it will fall between these limits; you may have heard the term three-sigma to denote nearly all occurrences. Likewise P (1.96 < Z < 1.96) = 0.95, so that probability outside these limits is 0.05 or 5%;

91

R-Example 3 The following R code computes and plots Figure 9.1. z = seq(-6, 6, length = 200) pdf = dnorm(z, 0, 1) ## dnorm for d(ensity) normal plot(z, pdf, type = l, lwd=3)

9.2

Cumulative Distribution Function (cdf)

As we indicated in section 6.4.2, the pdf does not represent a probability, but a probability density, the numbers we refer to above, for example, P (1.96 < Z < 1.96) = 0.95, are obtained by integration,
1.96

P (1.96 < Z < 1.96) = 0.95 =


1.96

fX (x)dx.
b a fX (x)dx,

(9.4)

However, for the Normal distribution, there is no easy way to compute the cdf comes in; we recall that the cdf is given by eqns. 9.5 and 9.6, FZ (z) = P (Z z),
z z

which is where

(9.5)

(z) = FZ (z) =

fZ (u)du =

1 exp(u 2 /2)du. 2

(9.6)

Because it is so commonly used, the standardised Normal cdf gets it own symbol, (z). (z) is plotted in Figure 9.2 which was created using the code in R-Example 4. R-Example 4 The following R code computes and plots Figure 9.1. z = seq(-6, 6, length = 200) cdf = pnorm(z, 0, 1) ## pnorm for p(robability) normal plot(z, cdf, type = l, lwd=3) ### add these if you want a figure for a report pdf(normcdf.pdf, onefile=FALSE, height=4, width=4, pointsize=8, paper=special) plot(z, cdf, type = l, lwd=3) dev.off() ### necessary to flush diagram into the file normcdf.pdf Following the discussion above on how most of the probability is located between (3 < Z < 3), we are not surprised to see that (z) is close to zero at z = 3; it rises to 0.5 at z = 0 (one half of the probability is below 0, the other above 0) and then attens out at z = 3 after which there is almost no probability for the integral to add in. 92

Figure 9.1: Standardised Normal distribution, N(0, 1), probability density function (pdf).

Figure 9.2: Normal cumulative distribution function (cdf).

93

9.3

Normal Cdf

Traditionally, statistics books, and books of tables contained tabulations of the Normal cdf, (z). We will see below how these tables are used. However, because most statistics is now conducted using software packages, tables may be less frequently used, and may be less commonly encountered in textbooks. R-Example 5 . The following R code computes Table 9.1. z = seq(-4, 4, cdf = pnorm(z, z [1] -4 -3 -2 -1 cdf [1] 3.167124e-05 [6] 8.413447e-01 z Phi(z) length = 9) 0, 1) 0 1 2 3 4

1.349898e-03 2.275013e-02 1.586553e-01 5.000000e-01 9.772499e-01 9.986501e-01 9.999683e-01

-4 -3 -2 -1 0 1 3.2e-05 1.35e-03 2.28e-02 0.159 0.5 0.84 Table 9.1: Erf(z) for z = -4 to + 4.

2 3 4 0.977 0.999 0.99997

What does (z = 2) = 2.28 1002 = 0.0228 mean? Referring to Figure 9.1 it means that the amount of probability to the left of Z = 2 is 0.0228, i.e. as indicated by eqn. 9.5. Owing to the symmetry of Figure 9.1, we can state that the amount of probability to the right of of Z = +2 is also 0.0228. Hence the probability P (Z < 2 or Z > +2) = 2 0.0228 = 0.0456 or 4.56%. If we move a little closer to the mean, we get P (Z < 1.96 or Z > +1.96) = 2 0.025 = 0.05 or 5%. This 5% quartile (+/ 1.96) is used a lot in statistics. If P (Z < 1.96 or Z > +1.96) = 0.05 then P (1.96 < Z < +1.96) = 0.95. In a similar way, we can determine that P (Z < 1 or Z > +1) = 2 0.159 = 0.318; that is, a standard Normal random variable Z is between plus or minus one standard deviation of the mean 3.18% of the time. The 0.159 number is used below in Example 29.

9.4

Using the Normal Cdf

Example 29 Suppose we have a manufacturing process which takes xed quantities of raw materials A (1000-grams) and B (500-g.) which react together to produce a product C in the form of a solid cake. The weights of the cakes, X, are monitored and those below a certain weight are set aside as B-grade. The manufacturer of the machine gives the yield expected value as E(X) = 165 grams with a variance of 9 and has determined that the yield follows the Normal distribution; that is, X = 165, X = 9 = 3 and X N(165, 3). We have decided that cakes below 162 grams should be marked as B-grade. 94

What is the probability that a randomly selected output will be less than 162 grams? We have no tables for N(165, 3), but we do have for N(0, 1), that is the cdf for the standardised Normal (z). Solution. (i) First we standardise using eqn. 9.2, Z = (X )/ = (X 165)/3. Our standardisation formula is Z = (X 165)/3, in which case the standardised weight corresponding to 162 is Z162 = (162 165)/3 = 1. (ii) The probability that Z < Z162 is just (Z162 = (1) and we can read that from Table 9.1, i.e. the probability is 0.159 and 15.9% of the output will be B-grade. (iii) Or, we can use R. pnorm(-1, 0, 1) ## here explicitly giving mu and sigma. [1] 0.1586553 pnorm(-1) ## if none given, R assumes mu = 0, sigma = 1 [1] 0.1586553 (iv) We can even let R handle the standardisation. pnorm(162, 165, 3) ## here explicitly giving mu and sigma. [1] 0.1586553

Normal distribution appropriate? In Example 29 there can be an immediate objection to the Normal model. X can never be less than zero, but N(165, 3) will have a value greater than zero (but very very small) for X < 0. In defence, we can argue that the value will be negligibly small so that use the Normal model should not introduce signicant errors. If we had a weight, E(X) = 4, V (X) = 9, = 3, then we would have to question the Normal model.

9.5

Sum of Independent Normal Random Variables

If X1 N(1 , 1 ) and X2 N(2 , 2 ) are independent random variables, X = X1 + X2 N(, ),


2 2 where = 1 + 2 and V ar (X) = 2 = 1 + 2 .

(9.7)

Add the means, add the variances; note not add the standard deviations. 95

Need example here. Eqn. 9.7 generalises to give the distribution of a sum on n independent observations of the same random variable. If Xi N(, ),
n

X = X1 + X2 , . . . , X n =
i=1

Xi N(n,

n). n.

(9.8)

That is, add n means, and add n variances, so that sum = Need example here.

nV ar (X) =

9.6

Dierences of Normal Random Variables

X1 N(1 , 1 ), X2 N(2 , 2 ) X = X1 X2 N(, ),


2 2 where = 1 2 and V ar (X) = 2 = 1 + 2 .

(9.9)

Take the dierence of the means and add the variances (not dierence of variances). Need example here.

9.7

Linear Transformations of Normal Random Variables

If X N(, ), Y = aX + b N(a + b, a). Need example here. (9.10)

9.8

The Central Limit Theorem

Why is the Normal distribution (a) so common; (b) so popular amongst statisticians. First, the Central Limit Theorem (CLT) states, roughly speaking, that if a random variable has been created by summing a large number of (independent) random variables, then the sum will have an approximately Normal distribution. Second, it is popular not just because of its common occurrence but because mathematics involving the distribution, eqn. 9.1 and its multivariate counterpart is in many cases rather easy or a good deal easier than mathematics involving some other distributions. A compact statement of the CLT, from (Wasserman 2004), is as follows. 96

Let X1 , X2 , . . . , Xn be independent and identically distributed r.v.s with mean and standard 1 deviation . Let Xn = n n Xi . Then, as n , i=1 Xn Xn Z, = / n V ar (Xn )

Zn = where Z N(0, 1).

(9.11)

97

Chapter 10 Statistical Inference


10.1 Introduction

We use the Normal distribution, eqn. 6.17, repeated here, to introduce statistical inference. 1 1 x fX (x) = exp 2 2
2

, < x < .

(10.1)

We may write fX as fX (x; , ) or fX (x; 1 , 2 ), where 1 , 2 are parameters. We may think of a family of Normal distributions, N, parametrised or labelled or indexed by 1 , 2 . Let us say we have performed and experiment and have collected a sample of random variables X, x1 , x2 , . . . , xn ; we assume that X N(, ) but we do not know either one or other (or both) of the parameters.

Point Estimation Parameter estimation is concerned with estimating parameters. A point estimate for say is an approximate value computed from the sample. Typically, in addition to the estimate, , we give some qualications such as the variance of the estimate, that is, an indication of how variable we think might be if we repeated the experiment a number of times.

Interval Estimation An interval estimate (set estimate, condence interval) for say is an interval [1 , 2 computed from the sample which we claim to contain the real . Typically, we give some indication of how plausible the interval is in the form a some sort of probability value.

Hypothesis Testing A typical hypothesis testing example is when a scientist needs to test the ecacy of a new method. And experiment is performed where there are two methods, M1 and M2 . Often, M1 is a control (say old method) and M2 is the new methods whose ecacy we wish to test. Let us keep the hypothesis simply by assuming that we wish to test whether M2 will give a better yield than M1 . 101

Chapter 11 Statistical Estimation


11.1 Introduction

When we state for example X fX (x; 1 , 2 ), we indicate that the distribution depends on parameters 1 , 2 . For example, we may think of a family of Normal distributions, N(1 , 2 ), parametrised or labelled or indexed by 1 = , 2 = .

11.2

Populations and Samples

When we quote values of parameters, for example the mean and standard deviation of a Normally distributed r.v., X N(, ), we are talking about population parameters. Let us collect a sample of random variables X, x1 , x2 , . . . , xn ; we assume that X N(, ) but we do not know either of the parameters. We must estimate them and an obvious rst attempt is to use sample mean and standard deviation. Note the dierence: population versus sample. A population includes all possible random variables; a sample contains, well, a sample taken from the population. If you wanted a quick estimate of the mean salary of lecturers in the college, you could ask a number of lecturers you know and take the average of that sample. The Human Resources Department could give you an exact gure, because they have the data for the (complete) population of, N, lecturers. They would compute the true population parameters as, 1 = N 1 = N
2 N N

xi ,
i=1

(11.1)

(xi )2 .
i=1

(11.2)

You could imagine that the larger your sample, the better the sample mean would approximate the population mean. 111

Random Sample However, apart from being a small sample, lecturers you know could contain another source of inaccuracy, namely that the sample is not random and so it may contain a bias due to the fact that, for example, the lecturers in your sample tend to be younger. By random sample we mean that each member of the population has an equal chance of being sampled. Achieving a random sample is not always easy, see Chapter 13.

11.3

Estimating the Mean

A point estimate for say is an approximate value computed from the sample. Typically, in addition to the estimate, , we give some qualications such as the variance of the estimate, that is, an indication of how variable we think might be if we repeated the experiment a number of is used to indicate that we have an estimate of . times. The hat symbol, , The most obvious estimate for is to copy eqn. 11.1, noting that we use capital N for the size of the population and lower-case n for the size of the sample, 1 =x = n
n

xi .
i=1

(11.3)

In this context the bar,as in x (x bar indicates mean or average. Need example here.

11.4

Estimating the Standard Deviation

The best estimate for is less obvious and eqn. 11.2 is modied slightly to, 1 n1
n

2 = s 2 =

(xi x .
i=1

(11.4)

Thus, we not only replace by its estimate, x , we divide by n 1 instead of n. It is usual to use s 2 to denote sample variance. The reason for the n 1 is that dividing by n would generally lead to a systematic underestimate a so-called bias. This may be discussed in a later chapter; (reference it if we do). Need example here.

112

11.5
11.5.1

Sampling Distributions
Sampling Distribution of the mean

The estimate of the mean given by eqn. 11.3 is itself a random variable; we can imagine taking m samples, each of size n, and each of these yielding a xj for j = 1, 2, . . . , m. E() = , x (11.5)

V ar () = 2 /n. x

(11.6)

Therefore, the standard deviation of the estimate of the mean is / n. We already encountered this in section 9.5 and eqn. 9.8. Both eqns. 11.5 are rather comforting, (a) the expected value of x is and the standard deviation of x is / n, that is, as n increases the standard deviation decreases and will decrease to zero as n . Finally, we can state that the sampling distribution of is N(, / n). This means that if we conduct a number of sample experiments (take a sample of n Xs and compute the mean bar x, then bar x will be found to have a normal distribution centred on the true mean . We note emphatically that we do not know . In the rst part of the discussion below, we assume that 2 is known. However, this is typically untrue, and we must use an estimate for the standard deviation, as in eqn. 11.4. Figure 11.1 (Maindonald & Braun 2007, p. 103) shows two sampling distributions, for a random variable X which has X = 10, = 1; Figure 11.1(a) shows the sampling distribution for a sample size of n = 4, while Figure 11.1(b) shows the sampling distribution for a sample size of n = 9; the distribution of X, corresponding to a sample size of n = 1 is shown for comparison. The useful formula now is, including standardisation: If the estimator for (unknown) is x and is known then x N(0, 1). / n

(11.7)

On the other hand, if is unknown, and we must replace with an estimate, s, see eqn. 11.4, then x tn1 , s/ n

(11.8)

where tn1 is the Student t distribution with n 1 degrees of freedom; see section 6.22. As with N(0, 1), we have tables for the t distribution.

113

Figure 11.1: (a) Sampling distribution for a sample size of n = 4; (b) sampling distribution for a sample size of n = 9; the distribution of X, corresponding to a sample size of n = 1 is shown for comparison.

11.5.2

Sampling Distribution for Estimates of the Standard Deviation

If the estimator for (unknown) is s, see eqn. 11.4, and is also unknown, with estimate x , then
n

i=1

(n 1)s 2 xi x = 2 , n1 2 2

(11.9)

where 2 is the Chi-squared distribution with n degrees of freedom; see section 6.4.10. As with n N(0, 1) and t we have tables for the 2 distribution. n

114

11.6

Condence Intervals

In section 11.5.1 we established that the distribution of the sample mean is x N(, / n) or x equivalently eqn. 11.7 /n N(0, 1). This tells us that the estimate has a distribution that is centred on the mean, that the expected value of the estimate is the mean, and that the distribution will have a standard-deviation (spread) of / n. Thus referring to Figure 11.1(a), we can say that the mean of x4 is , the true mean which we do not know and that dierent samples would vary between about 1.5 above and below the true mean. Hence if the true mean is 10 as in the diagram, and we kept repeating our sampling experiment, we would expect the estimate x4 to vary between about 8.5 and 11.5. On the other hand, if we used sample size n = 9, we would expect the estimate x9 to vary between about 9.0 and 11.0, see Figure 11.1(b). The previous few sentences should be suggesting that we should be able to give a plausible interval estimate such as we estimate that the mean is between 9 and 11, together with a probability for that assertion, e.g. about 0.95 as discussed in section 9.3 for P (1.96 < Z < +1.96). But unfortunately we cannot, for we do not know the true mean.
What can we say? Well, for example, that P (1.96 < ( )/ n < +1.96) = 0.95. Still not x much good, for we do not know and we must be satised with the less useful statement that the estimate x is within plus-or-minus 1.96 n from , with a probability of 0.95. More explanation may be needed. What if x is at one of these extremes, namely 1.96 n ; this would correspond to about 9 in Figure 11.1(a). We can then say that x + 1.96 n just about reaches up to . If we repeat the sampling, this will happen with a probability 1 0.025, i.e. the amount of probability up to Z = 1.96 is 0.025. Similarly, take the case that x is at the other extreme, namely + 1.96 n ; this would correspond to about 11 in Figure 11.1(a). We can now say that x 1.96 n just about reaches down to . If we repeat the sampling, this will happen with a probability 1 0.025 (recall the symmetry argument in section 9.3). Consequently, if we take x +/1.96 n we can say that this interval will capture with probability 0.95.

This allows us to construct a condence interval which we can claim contains ; that is, we compute not , but (L, U), an interval between (L)ower and (U)pper limits which we believe contains . In the case of condence (probability) 0.95 = 95%, we can compute (L, U) = ( 1.96 , x + 1.96 ) x n n (11.10)

Summary on Point Estimation and Condence Interval for the Mean when Variance Known Refer to Figure 11.1, part (b) of which is based on a sample size of n = 9. If we take a point estimate for the mean, it will be distributed according to the narrow distribution, i.e. if the true mean is 10, our estimate can be anywhere between 9 and 11. 115

If we decide to give an interval estimate, we need to decide on a condence (probability); the wider the interval, the greater the condence we can have in it but a huge interval with condence of 100% is not much use to anyone. The usual condence that is chosen is 95%. We would like to be able to look at Figure 11.1 (b) and say that our interval for the mean is 9 to 11 with condence 95% (based on the diagram this is approximate, 10 1.96 0.5 to 10 + 1.96 0.5 are the precise values for 95%. But we cannot make a statement like the latter, for we do not know that = 10. The best we can do is (a) take our estimate, x , (b) place a distribution like that in Fig ure 11.1(b) about it; (c) compute the x + / n ( 2) interval (eqn. 11.10). This allows us to state: if we repeated our sampling a large number of times, and we computed eqn. 11.10 each time (getting a dierent interval), then 95% of these intervals would contain the true mean . Excel-Example 1 Need Excel example here. Need section on t-distribution and small sample sampling distrib. for mean with std.-dev. unknown.

116

Chapter 12 Hypothesis Testing


12.1 Introduction

In Chapter 11 we discussed estimation of parameters, both point estimates and interval estimates (with condence value attached). This chapter is also based on sampling theory but here we are interested in decisions rather than estimates. For example, based on a sample of occurrences of heads and tails in a sample of n = 10 tosses of a coin, we might wish to come to the decision whether the coin is fair. We might want to decide whether application of a new fertiliser really does increase cropping yield, based on samples involving (i) the current fertiliser and (ii) the new one. The hypothesis testing technique involves the postulation of a hypothesis (an assumption, a statement about population distributions or their parameters) and then designing an experiment which will yield a sample upon which we can decide whether the hypothesis is true based on sample data. A typical hypothesis test is as follows. We make a hypothesis that a random variable is distributed according to fX (x), e.g. X N(, ), where we assume that is known. We identify a null hypothesis, H0 : = 0 and an alternative hypothesis, HA : > 0 . We compute a test statistic (a sample estimate with sample size n), for example = Xn and reject H0 if Xn > c, where c is some constant to be determined; Xn > c is the critical region; n c is called the acceptance region. X The greater we make c, then the greater the signicance level of the test Xn > c. We can set c using the same considerations we used in setting condence levels for a condence interval in section 11.6. As in eqn. 11.7, we know that Xn N(0, 1). / n

Z=

(12.1)

so that we can use er f (z) = (z) to choose a c = z such that P (z > c ) = 0.05 = P ( Xn n > / c ) = P (Xn > c + , say, for a 2.5% signicance level. (Ive chose 2.5% = 0.025 because it n corresponds to a cuto point (Z = 1.96) that we have already encountered.

121

That is, z > c would occurs only 2.5% of the time if H0 is true; in other words the critical region stretches from c to the right of it. The acceptance region stretches to the left of c , i.e. including everywhere that Xn c, where c = c + . n Recalling P (Z > +1.96) = 0.025, we can set c = 1.96 for a signicance level of 0.025. The latter corresponds to a one sided test. The standard normal pdf and the relevant critical region is shown in Figure 12.1 (Maindonald & Braun 2007, p. 106).

Figure 12.1: One side hypothesis test, signicance level = 0.025; critical region is shaded to the right of 1.96. For a two sided test with signicance level = 0.05, we include in the critical region also the marked region to the left of -1.96. Let us keep the original null hypothesis, H0 : = 0 , and now choose an dierent alternative hypothesis, namely HA : = 0 . A suitable acceptance region for this might be cl < Xn < ch , with the critical (rejection) region being all points below cl and all points above ch . If we now choose a signicance level of 0.05, we arrive at the familiar P (Z < 1.96 or Z > 122

+1.96) = 0.05, that is, if we have = 0 , then values of Z < 1.96 or Z > +1.96 or
Xn / n

Xn / n

<

1.96 or > +1.96 should occur only 5% of the time and this is a suciently signicant deviation for us to reject the null hypothesis. This is a two sided test. The signicance level, usually denoted , corresponds to the probability of rejecting H0 when H0 is true, that is, the extreme values in the critical region could occur, but with a small probability, . Table 12.1 shows the possible outcomes of the hypothesis test. H0 true HA true correct Type 2 error, prob. Type 1 error, prob. correct

Accept H0 Reject H0

Table 12.1: Outcomes of a hypothesis test.

123

Chapter 13 Sampling
13.1 Introduction

To be completed.

131

Chapter 14 Classication and Pattern Recognition


14.1 Introduction

The terms classication and pattern recognition are used almost synomomously; statisticians tend to favour classication, while engineers tend to use pattern recognition. This chapter merely introduces the concepts; Chapters 15, 16, 18, 17 and 19 ll in the details. These chapters are a reworking of some of the basic pattern recognition and neural network material covered in (Campbell 2005) and (Campbell & Murtagh 1998) and (Campbell 2000). We dene/summarize a pattern recognition system using the block diagram in Figure 14.1.

x sensed data

Pattern Recognition w (omega) System (Classifier) class

Figure 14.1: Pattern recognition system; x a tuple of p measurements, output class label. Typically textbooks distinguish between supervised classication and unsupervised classication.

Supervised classication Supervised (trained) classication may be posed as a prediction problem rather like regression. The prediction involves class labels. We have a set of examples, a sample, which we call training data, XT = {xi , i }n . We learn i=1 population parameters from the sample of xs. Warning: in some classication and pattern recognition literature, the term sample takes on a dierent meaning from the standard statistical term where a statistical sample means a set of random vectors taken from a population; in the pattern recognition literature a sample may mean a single random vector, so that a statistical sample will have to be termed a set of samples. x is the pattern vector of course in certain situations x is a simple scalar. is the class label, = {1 , . . . , c }. Then, given an unseen pattern x (a random vector), we predict . In general, x = (x0 x1 . . . xp1 )T , a p-dimensional vector; T denotes transposition. 141

Unsupervised classication Unsupervised classication is more of an exploratory data analysis technique than is supervised classication. In this case we have a set of patterns (random vectors) XT = {xi }n and we want to explore i=1 structure in the set. For example, are they clustered, thereby suggesting that the clusters identify a number of classes. Clustering involves assigning class labels to the XT = {xi }n based not on i=1 training data but on proximity of the xs or some other criterion.

142

Chapter 15 Simple Classier Methods


15.1 Thresholding for one-dimensional data

Let us assume that we want to classify a chemical product, for example fake pharmaceutical drugs, according to the results of a chemical analysis. The analysis data comprise a vector x where x1 might be percentage mass of component 1, x2 component 2, etc. The label might be courntry of origin, and it is this that we want to predict, given the results x from an analysis of a newly seize batch. For the moment, well assume just two classes 0 and 1 ; two-class problems are easy to describe, yet extension to n-class problems is easy. In our simplistic character recognition system we require to recognise two sources, country 0 and country 1, 0 and 1 . We start o with two components x = (x1 x2 )T . As described in Chapter 14, we have earlier obtained examples of the drug from both countries, XT = {xi , i }n , i.e. we have training data, or a sample. i=1 Let us see whether we can recognise using component 1 alone (x1 . Figure 15.1 shows some (training) data. We see that a threshold (T) set at about x1 = 2.8 is the best we can do; the classication algorithm is:

= 1 when x1 T, = 0 otherwise.

(15.1) (15.2)

Use of histograms, see Figure 15.2 might be a more methodical way of determining the threshold, T. If enough training data were available, n , the histograms, h0 (x1 ), h1 (x1 ), properly normalised would approach probability densities: p0 (x1 ), p1 (x1 ), more properly called class conditional probability densities (pdfs): p(x1 | ), = 0, 1, see Figure 15.3. When the random vector is three-dimensional (p = 3) or more, it becomes impossible to estimate the pdfs using histogram binning there are a great many bins, and most of them contain no data. In such cases it is usual to assume a distribution family, for example Normal, and to represent 151

0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 2 3 T 4 5 6 x1

Figure 15.1: Component 1 x1 .

freq.

h1(x1) h0(x1)

0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 2 3 T 4 5 6 x1

Figure 15.2: Histogram of component 1 x1 .

152

p(x1 | 1) p(x1 | 0)

0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 2 3 T 4 5 6 x1

Figure 15.3: Class conditional pdfs. the class contional pdfs using parameters estimated from a sample (training data estimation = training); see Chapter 11. The use of explicitly statistical methods is described in Chapter 16 but for now well try some intuitive methods, but as you will see we are never far from statistics.

153

15.2

Linear separating lines/planes for two-dimensions

Since there is overlap in the component-1, x1 , measurement, let us use the two components, x = (x1 x2 )T , i.e. (component-1, component-2). Figure 15.4 shows a scatter plot of these data (the sample).

x2 4 3 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 4 5 6 x1

Figure 15.4: Two dimensions, scatter plot. The dotted line shows that the data are separable by a straight line; it intercepts the axes at x1 = 4.5 and x2 = 6. Apart from plotting the data and drawing the line, how could we derive the separating from the data? (Thinking of a computer program.)

15.3

Nearest mean classier

First we estimate the class conditional means 0 = E(x| = 0 and 1 = E(x| = 1 ). Figure 15.5 shows the line joining the class means and the perpendicular bisector of this line; the perpendicular bisector turns out to be the separating line. We can derive the equation of the separating line using the fact that points on it are equidistant to both means, 0 , 1 , and expand using Pythagorass theorem,

|x 0 |2 = |x 1 |2 , (x1 01 ) + (x2 02 ) We eventually obtain (01 11 )x1 + (02 12 )x2 (2 + 2 2 2 ) = 0, 01 02 11 12 154


2 2

(15.3)
2 2

= (x1 11 ) + (x2 12 ) .

(15.4)

(15.5)

x2 4 3 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 4 5 6 x1

Figure 15.5: Two dimensional scatter plot showing means and separating line. which is of the form b1 x1 + b2 x2 b0 = 0. In Figure 15.5, 01 = 4, 02 = 3, 11 = 2, 12 = 1.5; with these values, eqn 15.6 becomes 4x1 + 3x2 18.75 = 0, which intercepts the x1 axis at 18.75/4 4.7 and the x2 axis at 18.75/3 = 6.25. (15.7) (15.6)

15.4

Normal form of the separating line, projections, and linear discriminants

Eqn 15.6 becomes more interesting and useful in its normal form, a1 x1 + a2 x2 a0 = 0,
2 2 where a1 + a2 = 1; eqn 15.8 can be obtained from eqn 15.6 by dividing across by

(15.8)
2 2 b1 + b2 .

Figure 15.6 shows interpretations of the normal form straight line equation, eqn 15.8. The coefcients of the unit vector normal to the line are n = (a1 a2 )T and a0 is the perpendicular distance from the line to the origin. Incidentally, the components correspond to the direction cosines of n = (a1 a2 )T = (cos sin a2 )T . Thus, (Foley, van Dam, Feiner, Hughes & Phillips 1994) n corresponds to one row of a (frame) rotating matrix; in other words, see below, section 15.5, dot product of the vector expression of a point with n, corresponds to projection onto n. (Note that cos /2 = sin .) 155

x2 normal vector (a1, a2)

a0/a2

line (x1 x2) a1x1 + a2x2 a0 = 0 a1x1 + a2x2 a0 > 0 a0 theta a0/a1 x1

at (x1, x2) a1x1 + a2x2 a0 < 0


Figure 15.6: Normal form of a straight line, interpretations. Also as shown in Figure 15.6, points x = (x1 x2 )T on the side of the line to which n = (a1 a2 )T points have a1 x1 + a2 x2 a0 > 0, whilst points on the other side have a1 x1 + a2 x2 a0 < 0; as we know, points on the line have a1 x1 + a2 x2 a0 = 0.

15.5

Projection and linear discriminant

We know that a1 x1 + a2 x2 = aT x, the dot product of n = (a1 a2 )T and x represents the projection of points x onto n yielding the scalar value along n, with a0 xing the origin. This is plausible: projecting onto n yields optimum separability. Such a projection, g(x) = a1 x1 + a2 x2 , is called a linear discriminant; now we can adapt equation eqn. 15.2, (15.9)

= 0 when g(x) > a0 , = 1, g(x) < a0 , = tie, g(x) = a0 . Linear discriminants, eqn. 15.12, are often written as g(x) = a1 x1 + a2 x2 a0 , whence eqn. 15.12 becomes

(15.10) (15.11) (15.12)

(15.13)

156

= 0 when g(x) > 0, = 1, g(x) < 0, = tie, g(x) = 0.

(15.14) (15.15) (15.16)

15.6

Projections and linear discriminants in p dimensions

Equation 15.13 readily generalises to p dimensions, n is a unit vector in p dimensional space, normal to the the p 1 separating hyperplane. For example, when p = 3, n is the unit vector normal to the separating plane. Other important projections used in pattern recognition are Principal Components Analysis (PCA) and Fishers Linear Discriminant Analysis (lda), see Chapter 17.

15.7

Template Matching and Discriminants

An intuitive (but well founded) classication method is that of template matching or correlation matching. Here we have perfect or average examples of classes stored in vectors {zj }c , one for j=1 each class. Without loss of generality, we assume that all vectors are normalised to unit length. Classication of an newly arrived vector x entails computing its template/correlation match with all c templates: xT zj ; class is chosen as j corresponding to the maximum of eqn. 15.17. Yet again we see that classication involves dot product, projection, and a linear discriminant. (15.17)

15.8

Nearest neighbour methods

Obviously, we may not always have the linear separability of Figure 15.5. One non-parametric method is to go beyond nearest mean, see eqn. 15.4, to compute the nearest neighbour in the entire training data set, and to decide class according to the class of the nearest neighbour. A variation is k-nearest neighbour, where a vote is taken over the classes of the k nearest neighbours.

157

Chapter 16 Statistical Classier Methods


16.1 One-dimensional classication revisited

Recall Figure 15.3, repeated here as Figure 16.1.

p(x1 | 1) p(x1 | 0)

0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 2 3 T 4 5 6 x1

Figure 16.1: Class conditional densities. We have class conditional pdfs: p(x1 | ), = 0, 1; given a newly arrived x1 we might decide on its class according to the maximum class conditional pdf at x1 , i.e. set a threshold T where p(x1 | 0) and p(x1 | 1) cross, see Figure 16.1. This is not completely correct. What we want is the probability of each class its posterior probability based on the evidence supplied by the data, combined with any prior evidence. In what follows, P (|x) is the posterior probability or a posteriori probability of class i given the observation x; P (i ) is the prior probability or a priori probability. We use upper case P (.) for discrete probabilities, whilst lower case p(.) denotes probability densities.

161

Bayes Rule

Recall Bayes rule from eqn. 5.22 and repeated here,


n

P (Ai |B) = P (B|Ai )P (Ai )/


i=1

P (B|Ai )P (Ai ).

(16.1)

This says that the posterior probability of Ai given B (conditional on B having occurred) is the n product of the conditional probability of B given Ai all divided by P (B) = i=1 P (B|Ai )P (Ai ). We can rewrite eqn. 16.1 in terms of our random variable x (= B) and our classes 0 , 1 (= Ai , i = 0, 1) to get
1

P (i |x) = P (x|i )P (i )/
i=0

P (x|i )P (i ).

(16.2)

P (i |x) is the posterior probability of class i given that our analysis has yielded x; P (i ) is the prior probability if we have no prior preference, the P (0 ) = 0.5, P (1 ) = 0.5. Eqn. 16.2 forms a Bayes decision rule: compute the two posterior probabilities and take the class which has the maximum. Let the Bayes decision rule be represented by a function g(.) of the feature vector x: g(x) = ar g maxwj [P (j | x)] (16.3)

To show that the Bayes decision rule, eqn. 16.3, achieves the minimum probability of error, we compute the probability of error conditional on the feature vector x the conditional risk associated with it: c R(g(x) = j | x) =
k=1,k=j

P (k | x).

(16.4)

That is to say, for the point x we compute the posterior probabilities of all the c 1 classes not chosen. Since = {1 , . . . , c } form a partition (they are mutually exclusive and exhaustive) and the P (k |x)c are probabilities and so sum to unity, eqn. 16.4 reduces to: k=1 R(g(x) = j ) = 1 P (j | x). (16.5)

It immediately follows that, to minimise R(g(x) = j ), we maximise P (j | x), thus establishing the optimality of eqn. 16.3. The problem now is to determine P ( | x) which brings us to Bayes rule.

16.2

Bayes Rule for the Inversion of Conditional Probabilities

****[Needs tidying and made compatible with previous section.] From the denition of conditional probability, we have: p(, x) = P ( | x)p(x), 162 (16.6)

and, owing to the fact that the events in a joint probability are interchangeable, we can equate the joint probabilities : p(, x) = p(x, ) = p(x | )P (). (16.7) Therefore, equating the right hand sides of these equations, and rearranging, we arrive at Bayes rule for the posterior probability P ( | x): P ( | x) = p(x | )P () . p(x) (16.8)

P () expresses our belief that will occur, prior to any observation. If we have no prior knowledge, c we can assume equal priors for each class: P (1 ) = P (2 ) . . . = P (c ), j=1 P (j ) = 1. Although we avoid further discussion here, we note that the matter of choice of prior probabilities is the subject of considerable discussion especially in the literature on Bayesian inference, see, for example, (Sivia 1996). p(x) is the unconditional probability density of x, and can be obtained by summing the conditional densities: c p(x) =
j=1

p(x | j )P (j ).

(16.9)

Thus, to solve eqn. 16.8, it remains to estimate the conditional densities.

16.3

Parametric Methods

Where we can assume that the densities follow a particular form, for example Gaussian, the density estimation problem is reduced to that of estimation of parameters. The multivariate normal density, see section B.7, p-dimensional, is given by: p(x | j ) = (2)p/2 1 1 exp [ (x j )T K1 (x j )] j 1/2 | Kj | 2 (16.10)

p(x | j ) is completely specied by j , the p-dimensional mean vector, and Kj the corresponding p p covariance matrix: j = E[x]=j , (16.11) Kj = E[(x j )(x j )T ]=j . The respective maximum likelihood estimates are: 1 j = Nj and, 1 Kj = Nj 1
Nj Nj

(16.12)

xn ,
n=1

(16.13)

(xn j )(xn j )T ,
n=1

(16.14)

where we have separated the training data XT = {xn , n }N into sets according to class. n=1 163

16.4

Discriminants based on Normal Density

We may write eqn. 16.8 as a discriminant function: gj (x) = P (j | x) = p(x | j )P (j ) , p(x) (16.15)

so that classication, eqn. 16.3, becomes a matter of assigning x to class wj if, gj (x) > gk (x), k = j. (16.16)

Since p(x), the denominator of eqn. 16.15 is the same for all gj (x) and since eqn. 16.16 involves comparison only, we may rewrite eqn. 16.15 as gj (x) = p(x | j )P (j ). (16.17)

We may derive a further possible discriminant by taking the logarithm of eqn. 16.17 since logarithm is a monotonically increasing function, application of it preserves relative order of its arguments: gj (x) = log p(x | j ) + log P (j ). (16.18) In the multivariate Gaussian case, eqn. 16.18 becomes (Duda & Hart 1973), p 1 1 gj (x) = (x j )T K1 (x j ) log2 log | Kj | +logP (j ) j 2 2 2 Henceforth, we refer to eqn. 16.19 as the Bayes-Gauss classier. The multivariate normal (Gaussian) density provides a good characterisation of pattern (vector) distribution where we can model the generation of patterns as ideal pattern plus measurement noise; for an instance of a measured vector x from class j : xn = j + en , where en N(0, Kj ), that is, the noise covariance is class dependent. (16.20) (16.19)

16.5

Bayes-Gauss Classier Special Cases

(Duda & Hart 1973, pp. 2631) Revealing comparisons with the other learning paradigms which play an important role in this thesis are made possible if we examine particular forms of noise covariance in which the Bayes-Gauss classier decays to certain interesting limiting forms: Equal and Diagonal Covariances (Kj = 2 I, j, where I is the unit matrix); in this case certain important equivalences with eqn. 16.19 can be demonstrated: Nearest mean classier; 164

Linear discriminant; Template matching; Matched lter; Single layer neural network classier. Equal but Non-diagonal Covariance Matrices. Nearest mean classier using Mahalanobis distance; and, as in the case of diagonal covariance, Linear discriminant function; Single layer neural network;

16.5.1

Equal and Diagonal Covariances

When each class has the same covariance matrix, and these are diagonal, we have, Kj = 2 I, so 1 that K1 = 2 I. Since the covariance matrices are equal, we can eliminate the 1 | logKj |; the j 2 p log2 term is constant in any case; thus, including the simplication of the (x j )T K1 (x j ), j 2 eqn. 16.19 may be rewritten: gj (x) = 1 (x j )T (x j ) + logP (j ) 2 2 1 = x j ) 2 + logP (j ). 2 2 (16.21) (16.22)

Nearest mean classier If we assume equal prior probabilities P (j ), the second term in eqn. 16.22 may be eliminated for comparison purposes and we are left with a nearest mean classier.

Linear discriminant

If we further expand the squared distance term, we have, gj (x) = 1 (xT x 2T x + T j ) + logP (j ), j j 2 2 (16.23)

which may be rewritten as a linear discriminant: gj (x) = wj0 + wjT x where wj0 = and wj = 1 (T j ) + logP (j ), j 2 2 1 j . 2 (16.24)

(16.25)

(16.26)

Template matching In this latter form the Bayes-Gauss classier may be seen to be performing template matching or correlation matching, where wj = constant j , that is, the prototypical pattern for class j, the mean j , is the template.

165

Matched lter In radar and communications systems a matched lter detector is an optimum detector of (subsequence) signals, for example, communication symbols. If the vector x is written as a time series (a digital signal), x[n], n = 0, 1, . . . then the matched lter for each signal j may be implemented as a convolution:
N1

yj [n] = x[n] h[n] =


m=0

x[n m] hj [m],

(16.27)

where the kernel h[.] is a time reversed template that is, at each time instant, the correlation between h[.] and the last N samples of x[.] are computed. Provided some threshold is exceeded, the signal achieving the maximum correlation is detected.

Single Layer Neural Network sication rule as:

If we restrict the problem to two classes, we can write the clas-

g(x) = g1 (x) g2 (x) 0 : 1 , other w i se 2 = w0 + w x,


1 where w0 = 22 (T 1 T 2 ) + log P (1 ) 1 2 P (2 ) T

(16.28) (16.29)

and w =

1 2 (1

2 ).

In other words, eqn. 16.29 implements a linear combination, adds a bias, and thresholds the result that is, a single layer neural network with a hard-limit activation function. (Duda & Hart 1973) further demonstrate that eqn. 16.22 implements a hyper-plane partitioning of the feature space.

16.5.2

Equal but General Covariances

When each class has the same covariance matrix, K, eqn. 16.19 reduces to: gj (x) = (x j )T K1 (x j ) + logP (j ) (16.30)

Nearest Mean Classier, Mahalanobis Distance If we have equal prior probabilities P (j ), we arrive at a nearest mean classier where the distance calculation is weighted. The Mahalanobis distance (xj )T K1 (xj ) eectively weights contributions according to inverse variance. Points j of equal Mahalanobis distance correspond to points of equal conditional density p(x | j ).

Linear Discriminant

Eqn. 16.30 may be rewritten as a linear discriminant, see also section 15.5: gj (x) = wj0 + wjT x (16.31)

where

1 wj0 = (T K1 j ) + logP (j ), 2 j wj = K1 j . 166

(16.32) (16.33)

and

Weighted template matching, matched lter In this latter form the Bayes-Gauss classier may be seen to be performing weighted template matching.

Single Layer Neural Network As for the diagonal covariance matrix, it can be easily demonstrated that, for two classes, eqns. 16.31 16.33 may be implemented by a single neuron. The only dierence from eqn. 16.29 is that the non-bias weights, instead of being simple a dierence between means, is now weighted by the inverse of the covariance matrix.

16.6

Least square error trained classier

We can formulate the problem of classication as a least-square-error problem. Let us require the classier to output a class membership indicator [0, 1] for each class, we can write: d = f (x) (16.34)

where d = (d1 , d2 , . . . dc )T is the c-dimensional vector of class indicators and x, as usual, the p-dimensional feature vector. We can express individual class membership indicators as:
p

dj = b0 +
i=1

bi xi + e.

(16.35)

In order to continue the analysis we need to refer to the theory of linear regression, see Chapter 20. We repeat eqn. 20.12 here, B = (XT X)1 XT Y (16.36)

XT Y is a p + 1 c matrix, and B is a (p + 1) c matrix of coecients that is, one column of p + 1 coecients for each class. Eqn. 16.36 denes the training algorithm of our classier. Application of the classier to a feature vector x may be expressed as: = Bx. y It remains to nd the maximum of the c components of . y In a two-class case, this least-square-error training algorithm yields an identical discriminant to Fishers linear discriminant (Duda & Hart 1973). Fishers linear discriminant is described in Chapter 17. (16.37)

167

16.7

Generalised linear discriminant function

Eqn. 15.13 may be adapted to cope with any function(s) of the features xi ; we can dene a new feature vector x where: xk = fk (x). (16.38) In the pattern recognition literature, the solution of eqn. 16.38 involving now the vector x is called the generalised linear discriminant function (Duda & Hart 1973). It is desirable to escape from the xed model of eqn. 16.38: the form of the fk (x) must be known in advance. Multilayer perceptron (MLP) neural networks provide such a solution. We have already shown the correspondence between the linear model, eqn. 20.8, and a single layer neural network with a single output node and linear activation function. An MLP with appropriate nonlinear activation functions, e.g. sigmoid, provides a model-free and arbitrary non-linear solution to learning the mapping between x and y (Bishop 1995).

168

Chapter 17 Linear Discriminant Analysis and Principal Components Analysis


17.1 Principal Components Analysis

Principal component analysis (PCA), also called Karhunen-Lo`ve transform (Duda, Hart & Stork e 2000) is a linear transformation which maps a p-dimensional feature vector x Rp to another vector y Rp where the transformation is optimised such that the components of y contain maximum information in a least-square-error sense. In other words, if we take the rst r p components (y Rq ), then using the inverse transformation, we can reproduce x with minimum error. Yet another view is that the rst few components of y contain most of the variance, that is, in those components, the transformation stretches the data maximally apart. It is this that makes PCA good for visualisation of the data in two dimensions, i.e. the rst two principal components give an optimum view of the spread of the data. We note however, unlike linear discriminant analysis, see section 17.2, PCA does not take account of class labels. Hence it is typically a more useful visualisation of the inherent variability of the data. In general x can be represented, without error, by the following expansion:
p

x = Uy =
i=1

yi ui

(17.1)

where yi is the ith component of y and where U = (u1 , u2 , . . . , up ) is an orthonormal matrix: ut uk = jk = 1, when i = k; otherwise = 0. j 171 (17.4) (17.3) (17.2)

If we truncate the expansion at i = q


q

x = Uq y =
i=1

yi ui ,

(17.5)

we obtain a least square error approximation of x, i.e. |x x | = mi ni mum. (17.6)

The optimum transformation matrix U turns out to be the eigenvector matrix of the sample covariance matrix C: 1 t A A, N

C= where A is the N p sample matrix.

(17.7)

UCUt = , the diagonal matrix of eigenvalues.

(17.8)

17.2

Fishers Linear Discriminant Analysis

In contrast with PCA (see section 17.1), linear discriminant analysis (LDA) transforms the data to provide optimal class separability (Duda et al. 2000) (Fisher 1936). Fishers original LDA, for two-class data, is obtained as follows. We introduce a linear discriminant u (a p-dimensional vector of weights the weights are very similar to the weights used in neural networks) which, via a dot product, maps a feature vector x to a scalar, y = ut x. (17.9)

u is optimised to maximise simultaneously, (a) the separability of the classes (between-class separability ), and (b) the clustering together of same class data (within-class clustering). Mathematically, this criterion can be expressed as: ut SB u . J(u) = t u SW u where SB is the between-class covariance, SB = (m1 m2 )(m1 m2 )t , and

(17.10)

(17.11)

172

Sw = C1 + C2 , the sum of the class-conditional covariance matrices, see section 17.1. m1 and m2 are the class means. u is now computed as: u = S1 m1 m2 . w

(17.12)

(17.13)

There are other formulations of LDA (Duda et al. 2000) (Venables & Ripley 2002), particularly extensions from two-class to multi-class data. In addition, there are extensions (Duda et al. 2000) (Venables & Ripley 2002) which form a second discriminant, orthogonal to the rst, which optimises the separability and clustering criteria, subject to the orthogonality constraint. The second dimension/discriminant is useful to allow the data to be view as a two-dimensional scatter plot.

173

Chapter 18 Neural Network Methods


Here we show that a single neuron implements a linear discriminant (and hence also implements a separating hyperplane). Then we proceed to a discussion which indicates that a neural network comprising three processing layers can implement any arbitrarily complex decision region. Recall eqn. 15.12, with ai wi , and now (arbitrarily) allocating discriminant value zero to class 0,
p

g(x) =
i=1

wi xi w0

0, > 0,

=0 = 1.

(18.1)

Figure 18.1 shows a single articial neuron which implements precisely eqn. 18.1.

+1 (bias) w0 x1 w1 w2 x2 . . . xp wp

Figure 18.1: Single neuron. The signal ows into the neuron (circle) are weighted; the neuron receives wi xi . The neuron sums and applies a hard limit (output = 1 when sum > 0, otherwise 0). Later we will introduce a sigmoid activation function (softer transition) instead of the hard limit. The threshold term in the linear discriminant (a0 in eqn. 15.13) is provided by w0 +1. Another interpretation of bias, useful in mathematical analysis of neural networks, see section 16.6, is to represent it by a constant component, +1, as the zeroth component of the augmented feature vector.

181

Just to reemphasise the linear boundary nature of linear discriminants (and hence neural networks), examine the two-dimensional case, w1 x1 + w2 x2 w0 0, > 0, =0 = 1. (18.2)

The boundary between classes, given by w1 x1 + w2 x2 w0 = 0, is a straight line with x1 -axis intercept at w0 /w1 and x2 -axis intercept at w0 /w2 , see Figure 18.2.

x2 w0/w2

w1/w0

x1

Figure 18.2: Separating line implemented by two-input neuron.

182

18.1

Neurons for Boolean Functions

A neuron with weights w0 = 0.5, and w1 = w2 = 0.35 implements a Boolean AND: x1 x2 AND(x1,x2) Neuron summation Hard-limit (0?) ---------------------------------------------- -------------0 0 0 sum = -0.5 + 0.35x1 + 0.35x2 = -0.5 = output= 0 1 0 0 sum = -0.5 + 0.35x1 + 0.35x2 = -0.15 = output= 0 0 1 0 sum = -0.5 + 0.35x1 + 0.35x2 = -0.15 = output= 0 1 1 1 sum = -0.5 + 0.35x1 + 0.35x2 = +0.2 = output= 1 ------------------------------------------------ ------------Similarly, a neuron with weights w0 = 0.25, and w1 = w2 = 0.35 implements a Boolean OR. Figure 18.3 shows the x1 -x2 -plane representation of AND, OR, and XOR (exclusive-or).

x2 1

x2 1

x2 1

0 1 x1

0 OR

1 1 x1

0 XOR

1 1 x1

AND

Figure 18.3: AND, OR, XOR. It is noted that XOR cannot be implemented by a single neuron; in fact it required two layers. Two layer were a big problem in the rst wave of neural network research in the 1960s, when it was not known how to train more than one layer.

18.2

Three-layer neural network for arbitrarily complex decision regions

The purpose of this section is to give an intuitive argument as to why three processing layers can implement an arbitrarily complex decision region. Figure 18.4 shows such a decision region in two-dimensions. As shown in the gure, however, each island of class 1 may be delineated using a series of boundaries, d11 , d12 , d13 , d14 and d21 , d22 , d23 , d24 . Figure 18.5 shows a three-layer network which can implement this decision region. First, just as before, input neurons implement separating lines (hyperplanes), d11, etc. Next, in layer 2, we AND together the decisions from the separating hyperplanes to obtain decisions, in island 1, in island 2. Finally, in the output layer, we OR together the latter decisions; thus we can construct an arbitrarily complex partitioning. 183

x2 4 3 2

d24 d21 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 d23 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 10 0 0 0 0 0 0 0 0 0 0 1 1 1 10 0 0 0 d11 1 d22 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 d14 0 0 1 1 1 1 1 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 d12 1 1 1 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 d13 1 2 3 4 5 6 x1


Figure 18.4: Complex decision region required.

Of course, this is merely an intuitive argument. A three layer neural network trained with backpropagation or some other technique might well achieve the partitioning in quite a dierent manner.

18.3

Sigmoid activation functions

If a neural network is to be trained using backpropagation or similar technique, hard limit activation functions cause problems (associated with dierentiation). Sigmoid activation functions are used instead. A sigmoid activation function corresponding to the hard limit progresses from output value 0 at , passes through 0 with value 0.5 and attens out at value 1 at +.

184

+1 (bias) d11 x1

x2 . . . xp

+1 d12

+1 d13 AND . . .

+1 d14 OR d21

class

. . . d24

. . . AND

Figure 18.5: Three-layer neural network implementing an arbitrarily complex decision region.

185

Chapter 19 Unsupervised Classication (Clustering)

191

Chapter 20 Regression
20.1 Linear Regression

The simplest linear model, y = mx + c, of school mathematics, is given by: y = b0 + b1 x + e, (20.1)

which shows the dependence of the dependent variable y on the independent variable x. In other words, y is a linear function of x and the observation is subject to noise, e; e is assumed to be a zero-mean random process. Strictly eqn. 20.1 is ane, since b0 is included, but common usage dictates the use of linear. Taking the nth observation of (x, y ), we have (Beck & Arnold 1977, p. 133): yn = b0 + b1 xn + en (20.2) Least square error estimators for b0 and b1 , b0 and b1 may be obtained from a set of paired N observations {xn , yn }n=1 by minimising the sum of squared residuals:
N N 2 rn = n=1 N n=1

S=

(yn yn )2

(20.3)

S=
n=1

(yn b0 b1 xn )2

(20.4)

Minimising with respect to b0 and b1 , and replacing these with their estimators, b0 and b1 , gives the familiar result: b1 = N[ yn xn ( b0 = yi )( xi )]/[N( xi2 ) ( xi )2 ] (20.5) (20.6)

yn b1 xn xn N N

The validity of these estimates does not depend on the distribution of the errors en ; that is, assumption of Gaussianity is not essential. On the other hand, all the simplest estimation procedures, including eqns. 20.5 and 20.6, assume the xn to be error free, and that the error en is associated with yn . 201

In the case where y , still one-dimensional, is a function of many independent variables p in our usual formulation of p-dimensional feature vectors eqn. 20.2 becomes:
p

yn = b0 +
i=1

bi xin + en

(20.7)

where xin is the i -th component of the n-th feature vector. Eqn. 20.7 can be written compactly as: yn = xT b + en n (20.8)

where b = (b0 , b1 , . . . , bp )T is a p + 1 dimensional vector of coecients, and xn = (1, x1n , x2n , . . . , xpn ) is the augmented feature vector. The constant 1 in the augmented vector corresponds to the coecient b0 , that is it is the so called bias term of neural networks, see sections 15.5 and 18. All N observation equations may now be collected together: y = Xb + e (20.9)

where y = (y1 , y2 , . . . , yn , . . . , yN )T is the N 1 vector of observations of the dependent variable, and e = (e1 , e2 , . . . , en , . . . , eN )T . X is the N p + 1 matrix formed by N rows of p + 1 independent variables. Now, the sum of squared residuals, eqn. 20.3, becomes: S = (y Xb)T . (20.10)

Minimising with respect to b just as eqn. 20.3 was minimised with respect to b0 and b1 leads to a solution for b (Beck & Arnold 1977, p. 235): b = (XT X)1 XT y.
N

(20.11)

The jk-th element of the (p + 1) (p + 1) matrix XT X is n=1 xnj xnk , in other words, just N the jk-th element of the autocorrelation matrix, R, of the vector of independent variables x estimated from the N sample vectors. If we have multiple dependent variables (y ), in this case, c of them, we can replace y in eqn. 20.11 with an appropriate matrix N c matrix Y formed by N rows each of c observations. Now, eqn. 20.11 becomes: B = (XT X)1 XT Y (20.12) XT Y is a p + 1 c matrix, and B is a (p + 1) c matrix of coecients. Eqn. 20.12 has one signicant weakness: it depends on the condition of the matrix XT X. As with any autocorrelation or auto-covariance matrix, this cannot be guaranteed; for example, linearly dependent features will render the matrix singular. In fact, there is an elegant indirect implementation of eqn. 20.12 involving the singular value decomposition (SVD) (Press, Flannery, Teukolsky & Vetterling 1992), (Golub & Van Loan 1989). The Widrow-Ho iterative gradient-descent training procedure (Widrow & Lehr 1990) developed in the early 1960s tackles the problem in a dierent manner. 202

Bibliography
Beck, J. & Arnold, K. (1977). Parameter Estimation in Engineering and Science, John Wiley & Sons, New York. Berger, J. (1985). Statistical Decision Theory and Bayesain Analysis 2nd ed., Springer Verlag. Berry, D. (1996). Statistics a Bayesian Perspective, Duxbury Press. Bishop, C. (1995). Neural Networks for Pattern Recognition, Oxford University Press, Oxford, U.K. Boslaugh, S. & Watters, P. A. (2008). Statistics in a Nutshell, OReilly. Campbell, J. (2000). Fuzzy Logic and Neural Network Techniques in Data Analysis, PhD thesis, University of Ulster. Campbell, J. (2005). Lecture notes on pattern recognition and image processing, Technical report, Letterkenny Institute of Technology. http://www.jgcampbell.com/ip/pr.pdf (accessed 200905-01). Campbell, J. & Murtagh, F. (1998). Image processing and pattern recognition, Technical report, Computer Science, Queens University Belfast. available at: http://www.jgcampbell.com/ip (2009-05-01). Casella, G. & Berger, R. (2001). Statistical Inference, 2nd edn, McGraw-Hill. Crawley, M. J. (2005). Statistics: an introduction using R, John Wiley. Good introduction to statistics using R. Duda, R. & Hart, P. (1973). Pattern Classication and Scene Analysis, Wiley-Interscience, New York. Duda, R., Hart, P. & Stork, D. (2000). Pattern Classication, Wiley-Interscience. Duntsch, I. & Gediga, G. (2000). Sets, Relations, Functions, Methodos Publishers. Available via http://www.cosc.brocku.ca/ duentsch/papers/methprimer1.html (2009-04-30). Dytham, C. (2009). Choosing and Using Statistics: A Biologists Guide, 2nd edn, Blackwell Publishing. ISBN-13: 978-1-4051-0243-8. Feller, W. (1968). An Introduction to Probability Theory and its Applications, volume 1, 3rd edn, John Wiley & Sons, New York. Fisher, R. (1936). The use of multiple measurements in taxonomic problems, Annals of Eugenics 7: 179188. in (?). 201

Foley, J., van Dam, A., Feiner, S., Hughes, J. & Phillips, R. (1994). Introduction to Computer Graphics, Addison Wesley. Frey, B. (2006). Statistics Hacks, OReilly. Gelman, A., Carlin, J., Stern, H. & Rubin, D. (1995). Bayesian Data Analysis, Chapman and Hall. Gelman, A. & Nolan, D. (2002). Teaching statistics: a bag of tricks, Oxford University Press. Golub, G. & Van Loan, C. (1989). Matrix Computations, 2nd edn, Johns Hopkins University Press, Baltimore. Griths, D. (2009). Head First Statistics, OReilly. ISBN-10: 0596527586. Excellent introduction. Hacking, I. (2001). An Introduction to Probability and Inductive Logic, Oxford University Press. Hastie, T., Tibshirani, R. & Friedman, J. (2001). The Elements of Statistical Learning, Springer. Hsu, H. (1997). Theory and Problems of Probability, Random Variables, and Random Processes (Schaums Outlines), McGraw-Hill. Jaynes, E. & (editor), L. B. (2003). Probability Theory: The Logic of Science, Cambridge University Press. Jaynes was one of the chief advocates of the Bayesian method. Jereys, H. (1961/1998). Theory of Probability, 3rd edn, Oxford University Press (Oxford Classics Series 1998), Oxford, U.K. Larson, H. (1982). Introduction to Probability and Statistical Inference, 3rd edn, John Wiley. Lee, P. M. (2004). Bayesian Statistics: an introduction, 3rd edn, Arnold. Reputedly one of the best introductions to Bayesian statistics; Contains examples in R. MacKay, D. J. C. (2002). Information Theory, Inference and Learning Algorithms, Cambridge University Press. MacKay is a major advocate of Bayesian methods. Maindonald, J. & Braun, J. (2007). Data Analysis and Graphics Using R: an example-based approach, 2nd edn, Cambridge University Press, Cambridge, U.K. ISBN: 978-0-521-86116-8; good R examples, including graphics. Matlo, N. (2008). R for programmers, Technical report, University of California, Davis. http://heather.cs.ucdavis.edu/ matlo/R/RProg.pdf (accessed 2009-04-25). Meyer, P. L. (1966). Introductory Probability and Statistical Applications, Addison-Wesley, Reading, MA. Excellent introduction, but now out of print. Milton, M. (2009). Head First Data Analysis: A learners guide to big numbers, statistics, and good decisions, OReilly. ISBN-10: 0596153937. Another excellent introduction. Uses R. Murtagh, F. (2005). Correspondence Analysis and data Coding with Java and R, Chapman and Hall/CRC Press. OHagan, A. (1994). Kendalls Advanced Theory of Statistics, Vol. 2B, Bayesian Inference, Edward Arnold. Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems (revised second printing), Morgan Kaufmann, San Francisco, CA. 202

Press, W., Flannery, B., Teukolsky, S. & Vetterling, W. (1992). Numerical Recipes in C, 2nd edn, Cambridge University Press, Cambridge, UK. Quinn, G. P. & Keough, M. J. (2002). Experimental Design and Data Analysis for Biologists, Cambridge University Press. ISBN-13: 978-0521009768. Ripley, B. (1996). Pattern Recognition and Neural Networks, Cambridge University Press, Cambridge, U.K. Rosenkrantz, R. D. (ed.) (1983). E.T. Jaynes. Papers on Probability, Statistics and Statistical Physics, Kluwer, Dordrecht. Salsburg, D. (2001). The Lady Tasting Tea: How Statistics Revolutionized Science in the 20th Century, W.H. Freeman. Great introduction to the origins of statistics. Silvey, S. (1975). Statistical Inference, Chapman and Hall. Sivia, D. (1996). Data Analysis, A Bayesian Tutorial, Oxford University Press, Oxford, U.K. Sivia, D. (2006). Data Analysis, A Bayesian Tutorial, 2nd edn, Oxford University Press. Best introduction to Bayesian inference there is. Spiegel, M. R., Schiller, J. & Srinivasan, R. A. (2009). Theory and Problems of Probability and Statistics (Schaums Outlines), 3rd edn, McGraw-Hill. Spiegel, M. R. & Stephens, L. J. (2008). Statistics (Schaums Outlines), 4th edn, McGraw-Hill. Highly recommended; if you have to buy one book, this is the one; has examples using a few packages, most notably Excel. Taylor, P. (2008). Probability (manuscript notes on mathematical foundations), Technical report, University of Manchester. http://www.paultaylor.eu/tripos/Probability.pdf (accessed 200904-25). Therrien, C. (1989). Decision, Estimation, and Classication, Chichester, UK: John Wiley and Sons. Tisted, R. (1988). Elements of Statistical Computing, Chapman and Hall/CRC Press. Venables, W. & Ripley, B. (2000). S Programming, Springer-Verlag. Venables, W. & Ripley, B. (2002). Modern Applied Statistics with S, 4th edn, Springer-Verlag. Highly recommended for learning R (R is a free version of S). Wasserman, L. (2004). All of Statistics: a concise course in statistical inference, Springer Verlag, New York, NY. ISBN: 0-387-40272-1; top class encyclopedic reference. Widrow, B. & Lehr, M. (1990). 30 Years of Adaptive Neural Networks, Proc. IEEE 78(9): 1415 1442.

Appendix A Basic Mathematical Notation


The notation described here notation is merely shorthand for common sense concepts which would otherwise be confusing and long-winded if written in English. Casual familiarity with the most important items will also allow you to read papers using statistics without becoming confused. The online book Sets, Relations, Functions (Duntsch & Gediga 2000) is an ideal introduction; we take these notes from that book.

A.1
A.1.1

Sets
Set Denition and Membership

A set is a very basic mathematical entity and hence is a bit hard to dene. Lets say that a set is a collection of objects; there cannot be repetition (duplication) of objects. We can specify a set by writing all its members within curly brackets, { }. Example 30 Six sided dice, set of possible faces (identied by the number of spots); call the set D. We can write D as, D = {1, 2, 3, 4, 5, 6}. When there is an obvious sequence, we can write, D = {1, 2, . . . , 6}. Sometimes we specify a rule for making the set, we have for example, the trivial rule generated set D = {i | i {1, . . . , 6}} = {1, . . . , 6}; the set of even numbers between 1 and 6 is given by Dev en = {i | i {1, . . . , 6} and i even} = {2, 4, 6}. We use the membership symbol to state that an object is a member of a set, for example, 1 {1, 2, 3}; we can state non-membership by , for example, 6 {1, 2, 3} There is no ordering of position in a set. {1, 2, 3}, {2, 3, 1} represent the same set. If there is repetition, it is understood that the repeated elements have no eect so that {1, 2, 3}, {2, 3, 1, 1, 2} represent the same set.

A1

A.1.2

Important Number Sets

Natural numbers: N = {0, 1, 2, . . .}. Positive natural numbers: N+ = {1, 2, . . .}. Integers: Z = {. . . , 2, 1, 0, 1, 2, . . .}. Real numbers: R.

A.1.3

Set Operations

Intersection. The set formed by the intersection of sets A, B is written C = A B = {x : x A andx B. Example 31 A = {1, 2, 3, 4}, B = {3, 4, 5}, A B = {3, 4}. Union. The set formed by the union of sets A, B is written C = A B = {x : x A orx B, where or means inclusive or, that is a or b means either a or b, or both. Example 32 A = {1, 2, 3, 4}, B = {3, 4, 5}, A B = {1, 2, 3, 4, 5}. Set dierence. The set formed by the dierence of sets A, B is written C=A B = {x : x A andx B. That is, remove any members of B. Example 33 A = {1, 2, 3, 4}, B = {3, 4}, A B = {1, 2}. Set complement (with respect to a universal set, U). A = {x : x A andx U. Example 34 U = {1, 2, 3, 4, 5, 6}; A = {3, 4, 5}, A{1, 2, 6}. Comment. In case the notion of a universal set causes diculty: the universal set depends on the problem at hand; when talking about a class of students, then U would be the set of all students in the class. You might have A as the set of all students (in that class in that universal set) from County Donegal; then A is the set of all students from outside County Donegal that is not from County Donegal.

A.1.4

Venn Diagrams

Set operations such as intersection, union, dierence and complement are often illustrated using Venn diagrams such as those shown in Figure A.1. A2

11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000
Intersection of A, B

11111111 00000000 11111111 00000000 11111111 00000000 A 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 111111111 000000000 11111111 00000000 111111111 000000000 11111111 00000000 111111111 000000000 11111111 00000000 111111111 000000000 11111111 00000000 111111111 000000000 11111111 00000000 111111111 000000000 11111111 00000000 B 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000
Union of A, B (all shaded area)

U = universal set

111111 000000 111111 000000 111111 000000 111111 000000 111111 000000 111111 000000

complement of A

Figure A.1: Set operations illustrated using Venn diagrams; (a) intersection, (b) union, (c) complement. Subset When a set A has no members or some or all of the members of B, but no more, we say that A is a subset of B. A B. Example 35 B = {1, 2, 3, 4, 5, 6}; A = {3, 4, 5}, so that A B.

Equality of sets When a set A has the same members as B, or each is empty, we say that they are equal: A = B. Another way of looking at this is, if A B and B A, then A = B.

Empty Set

If a set contains no members, we call it the empty set; symbol .

Cardinality of a Set

The number of elements in a set A is called its cardinality and written |A|.

Example 36 A = {1, 2, . . . , 6}, |A| = 6. B = {John, Mar y , Jean}, |B| = 3.

Power Set

(Probably not necessary for basic probability.)

Given a set A, the power set of A, P(A), is the set of all subsets of A. |P(A)| = 2|A| . Notice that you can have a set of sets, for example, the set of all classes in the computing department. Example 37 A = {a, b, c}, |A| = 3. P(A) = {, {c}, {b}, {a}, {b, c}, {a, c}, {a, b}, {a, b, c}}. Verify that |P(A)| = 2|A| = 23 = 8.

A3

Finite and Innite Sets Roughly speaking, if |A| = n where n is some number we can identify, then we say that A is a nite set. Most of the sets in our examples are nite sets; otherwise the set is innite. N, Z, R are innite sets. This is an example of a nite set of integer numbers A = {1, 2, . . . , n}; in contrast an innite set of integer numbers would be written A = {1, 2, . . .} which means A = {1, 2, . . . , }.

Disjoint Sets

We say that A1 , A2 , . . . are disjoint of Ai Aj = , i, j, i = j.

denotes for all.

A.2

Iterated Summation and Product Notation

If we want to write down the operation of summing the numbers from 1 to 6, we could write s = 1 + 2 + 3 + 4 + 5 + 6 or s = 1 + 2+, . . . , +6. But this becomes tedious or impossible for larger 6 lists. We have the summation notation i=1 i . Similarly, if we want to write down the operation of multiplying together all the numbers from 1 6 to 6, we use the product notation i=1 i .

A.3

Iterated Union and Intersection

If we want to write down the operation of taking the union (see section A.1.3 of a list of sets the numbers from A1 to A6 , we could write B = A1 A2 , . . . , A6 . But this becomes tedious or impossible for larger lists. Similar to the summation notation we have B = 6 Ai . i=1 For intersection we have B =
6 i=1

Ai .

A.4

Cartesian Product Sets

Quite often we need to make new sets by making pairs (or triples or n-tuples) from existing sets. Example 38 Let B = {1, 2, 3, 4, 5, 6} the set of outcomes from throwing a six-sided dice and A = {H, T }, the set of outcomes of a coin toss. If we perform an experiment where we throw the dice and toss a coin and we want to describe the set of all possible pairs C = {(1, H), (1, T ), (2, H), . . . , (6, H), (6, T )}, we call set C the Cartesian product of A and B. The Cartesian product of A and B is written A B. The cardinality of A B, |A B| = |A| |B|. So in Example 38, we have |A B| = |A| |B| = 6 2 = 12. Note: pairs such as (1, H), (1, T ), or generally n-tuples, enclosed in round brackets ( ) are not sets. A4

Appendix B Matrices and Linear Algebra


B.1 Introduction

In Chapters 7 and 8 we introduce two-dimensional random variables, that is, pairs of random variables which, for one reason or another, we want to treat as pairs rather than separately. Much of what we do in one-dimension generalises to two- and generally multi-dimensions; likewise two-d. to multi-dimensions.

B.2

Linear Simultaneous Equations

Eqns. B.1 and B.2 are a pair of linear simultaneous equations, y1 = 3x1 + 1x2 , y2 = 2x1 + 4x2 . Practically, these equations could express the following: Price of an apple = x1 , price of an orange = x2 (both unknown). Person A buys 3 apples, and 1 orange and the total bill is 5c (y1 ). Person B buys 2 apples and 4 oranges and the total bill is 10c (y2 ). Now, what is x1 , the price of apples, and x2 , the price of oranges? We want to solve for the unknowns x1 , x2 . Matrix algebra gives us a nice technique for solving such problems, see section B.6, but rst well see how to solve it without matrices. Substitute y1 = 5 and y2 = 10 into eqns. B.1 and B.2: (B.1) (B.2)

5 = 3x1 + 1x2 , 10 = 2x1 + 4x2 . Eqn. B.3 gives x2 = 5 3x1 , which, substituted into eqn. B.4 gives: B1

(B.3) (B.4)

10 = 2x1 + 4(5 3x1 ), 10 = 2x1 + 20 12x1 , 10 = 10x1 , x1 = 1. Now, substitute x1 = 1 into eqn. B.3: 5 = 3 + x2 , x2 = 2. We have determined our unknowns x1 = 1 and x2 = 2. Ex. Substitute x1 = 1 and x2 = 2 into eqns. B.3 and B.4 to check the correctness of the result.

B.3

Vectors and Matrices

Eqns. B.1 and B.2 can be written in matrix form as y = Ax where A is a 2 row 2 column matrix, A = (B.5)

3 1 , y is a one column two row matrix, 2 4 y1 and x is another one representing a tuple, and what we will from now on call a vector, y = y2 x1 column two row matrix, x = . x2 Vectors We could be extra careful and continue to call objects like x and y tuples. But everyone in the statistical world uses the term vector for tuple, and, because we are using vector and matrix arithmetic and algebra, this gives another reason to use vector. A vector is nothing more than an ordered collection of one-dimensional variables; however, vector and matrix mathematics have been developed to allow us to do mathematics on vectors without having to deal with each of the elements of (X1 , X2 , . . . , Xp ) separately. It will rarely be helpful to think of these vectors as being like vectors of physics and having magnitude and direction; but it is often helpful to think of two-dimensional vectors as representing points in a Euclidean plane and to think of general multidimensional vectors (p-dimensions, say) as representing points in p-dimensional space. Generally, a system of m equations, in n variables, x1 , x2 , . . . , xn , y1 = a11 x1 + a12 x2 + a1n xn y2 = a21 x1 + a22 x2 + a2n xn ... yr = ar 1 x1 + ar 2 x2 + ar n xn ... ym = am1 x1 + am2 x2 + amn xn B2 (B.6)

can be written in matrix form as y = Ax, where y is an m 1 vector, y= x is an n 1 vector, x= and A is an m-row n-column matrix A= a11 a21 .. .. .. am1 a12 a22 .. ar c .. am2 a1n a2n .. .. .. amn . x1 x2 . . xn , y1 y2 . . ym , (B.7)

.. .. .. ..

That is, the matrix A is a rectangular array of numbers whose element in row r , column c is ar c (rows are horizontal, think rows of teeth; columns are vertical. The matrix A is said to be m n, i.e. m rows, n columns. Eqn. B.7 can be interpreted as the denition of a function which takes n arguments (x1 , x2 , . . . , xn ) and returns m variables (y1 , y2 . . . ym ). Such a function is also called a transformation: it transforms n-dimensional vectors to m-dimensional vectors. Such equations are linear transformations because there are no terms in xr2 or higher, only in xr = xr1 , and no numbers like 5 (5xr0 = 5 1 = 5).

Uses of Vectors and Transformations in Statistics

Instead of denoting a two-d. random variX1 = X able as (X, Y ), it is much more convenient to denote it as vector X = . X2 = Y This is particularly true when we get to larger dimensions, when equations like eqn. 7.15 get enormous or impossible. Why transformations? In other places, we have used combinations of random variables such as U = aX + bY ; and we might have also V = cX + d Y . Thus, we create a new two-d. random variable (U, V ) using linear combinations of (X, Y ); we transform (X, Y ) to yield (U, V ). This can be neatly expressed using matrix notation. y is an 2 1 vector, B3

y= x is an 2 1 vector, x= and A is an 2-row 2-column matrix A=

U V X Y

a11 = a a12 = b a21 = c a22 = d

The larger equation above allows us to create a mdimensional random variable, y, as the linear combination of the n random variables in the ndimensional vector x.

B.4
B.4.1

Basic Matrix Arithmetic


Matrix Multiplication

We may multiply two matrices A, m n, and B, q p, as long as n = q. Such a multiplication produces an m p result. Thus, C = A B. mp mn np (B.8)

Method: The element at the r th row and cth column of C is the product (sum of component-wise products) of the r th row of A with the cth column of B. Pictorially: n ------------------- A ---------------p --------- B = n V ---------p ---------- C m -----------

C = AB , A= B= a11 a12 a21 a22 b11 b12 b21 b22 B4 , ,

so, the product C= a11 b11 + a12 b21 a11 b12 + a12 b22 a21 b11 + a22 b21 a21 b12 + a22 b22 .

Example. Consider Eqn. B.7, y = Ax. Thus the product of A(m n) and x(n 1) is y1 = a11 x1 + a12 x2 + a1n xn , ym = am1 x1 + am2 x2 + amn xn . In summation notation, yr =
c=n c=1

ar c xc .

The product is (m n) (n 1) so the result is (m 1), which checks okay, for y is (m 1).

B.4.2

Multiplication by a Scalar

As with vectors (when represented as components), we simply multiply each component by the scalar, a11 a12 a21 a22 ca11 ca12 ca21 ca22

B.4.3

Addition

As with vectors (when represented as components), we add component-wise, a11 a12 a21 a22 b11 b12 b21 b22 a11 + b11 a12 + b12 a21 + b21 a22 + b22

Clearly, the matrices must be the same size, i.e. row and column dimensions must be equal.

B.5
B.5.1

Special Matrices
Identity Matrix
I= 1 0 0 1

i.e. produces no transformation eect. Thus, IA = A We can dene the matrix inverse as follows, if AB = I then B = A1 , see section B.6.

B5

B.5.2

Orthogonal Matrix
AAt = I

A matrix which satises the property: i.e. the transpose of the matrix is its inverse, see section B.6. Another way of viewing orthogonality in matrices is: For each row of the matrix (ar 1 ar 2 ....ar n ), the scalar product with itself is 1, and with all other rows, 0. I.e. n c=1 ar c apc = 1 for r = p, = 0 otherwise.

B.5.3

Diagonal
A= Sx 0 0 Sy

is diagonal, i.e. the only non-zero elements are on the diagonal. The inverse of a diagonal matrix a11 0 0 a22 is 1/a11 0 0 1/a22

B.5.4

Transpose of a Matrix

At , spoken A-transpose. If A= then At = i.e. replace column 1 with row 1 etc. The transpose is sometimes AT or A . a11 a21 a12 a22 a11 a12 a21 a22

B6

B.6

Inverse Matrix

Only for square matrices (m = n). Consider again Eqns. B.1 and B.2:

y1 = 3x1 + 1x2 y2 = 2x1 + 4x2 i.e. y = Ax. 3 1 2 4

A= Apply this to

x= to get

1 2

y1 = 3.1 + 1.2 = 5, y2 = 2.1 + 4.2 = 10. What if you know y = (5 10)t and you want to retrieve x = (x1 x2 )t ? In other words, can matrices help us solve for x1 , x2 as we did in section B.2? The answer is yes. Find the inverse of A = A1 and then apply the inverse transformation to y, that is, multiply y by the inverse of the matrix, x = A1 y. In the case of a 2 2 matrix a11 a12 a21 a22 a22 a12 a21 a11 (B.9)

A=

A1 =

1 |A|

(B.10)

where the determinant of the array, A, is | A |= a11 a22 a12 a21 If | A |= 0, then A is not invertible, it is singular. Inverse matrices give us the equivalent of division. If | A |= 0, attempting to nd the inverse is the equivalent to calculating 1/0. Thus for B7

A= we have | A |= 3 4 2 1 = 10 so A1 = (1/10)

3 1 2 4

4 1 2 3

0.4 0.1 0.2 0.3

Therefore, apply A1 to We nd: A1 y =

5 10

0.4 0.1 0.2 0.3

5 10

5 0.4 + 10 0.1 5 0.2 + 10 0.3

1 2

which is the answer we got in section B.2. In fact, in section B.2 what we did was something very similar to how one inverts a matrix in a computer program.

B.7

Multidimensional (Multivariate) Random Variables

We can now generalise two-d. random variables to p dimensions by extending (X, Y ) to (X1 , X2 , . . . , Xp ). It is usual to call the p-dimensional (multivariate) random variable a random vector and to use vector notation: X = (X1 , X2 , . . . , Xp ). The multivariate Normal pdf, p-dimensional, is given by: f (x) = 1 (2)p/2 |K|1/2 1 exp [ (x )T K1 (x )]. 2 (B.11)

B8