Learning Statistics

Topic Subtopic
Science & Mathematics Mathematics
Learning Statistics
Concepts and Applications in R
Course Guidebook
Professor Talithia Williams

Harvey Mudd College
PUBLISHED BY:
THE GREAT COURSES

Corporate Headquarters
4840 Westfields Boulevard, Suite 500
Chantilly, Virginia 20151-2299
Phone: 1-800-832-2412
Fax: 703-378-3819
www.thegreatcourses.com
Copyright © The Teaching Company, 2017
Printed in the United States of America
This book is in copyright. All rights reserved.
Without limiting the rights under copyright reserved above,

no part of this publication may be reproduced, stored in
or introduced into a retrieval system, or transmitted,
in any form, or by any means
(electronic, mechanical, photocopying, recording, or
otherwise),
without the prior written permission of
The Teaching Company.
TALITHIA WILLIAMS,
PH.D.
Associate Professor of
Mathematics
Harvey Mudd College
T alithia Williams is an Associate Professor of Mathematics and the

Associate Dean for Research and Experiential Learning at Harvey
Mudd College. In her present capacity as a faculty member, she
exemplifies the role of teacher and scholar through outstanding research,
with a passion for integrating and motivating the educational process
with real-world statistical applications.
Dr. Williams’s educational background includes a bachelor’s degree in

Mathematics from Spelman College, master’s degrees in Mathematics
from Howard University and Statistics from Rice University, and a Ph.D.
in Statistics from Rice University. Her professional experiences include
research appointments at NASA’s Jet Propulsion Laboratory, NASA’s
Johnson Space Center, and the National Security Agency.
Dr. Williams takes sophisticated numerical concepts and makes them

understandable and relatable to everyone. As illustrated in her popular
TED Talk “Own Your Body’s Data,” she demystifies the mathematical
process in amusing and insightful ways, using statistics as a way of seeing
the world in a new light and transforming our future through the bold
new possibilities inherent in the STEM ( science, technology, engineering,
Professor Biography i
and mathematics ) fields. Dr. Williams has made it her life’s work to get
people—students, parents, educators, and community members—more
excited about the possibilities inherent in a STEM education.
Dr. Williams received the Mathematical Association of America’s Henry

L. Alder Award for Distinguished Teaching by a Beginning College or
University Mathematics Faculty Member, which honors faculty members
whose teaching is effective and extraordinary and extends its influence
beyond the classroom.
Dr. Williams develops statistical models that emphasize the spatial and
temporal structure of data and has partnered with the World Health
Organization in developing a model to predict the annual number of
cataract surgeries needed to eliminate blindness in Africa. Through her
research and work in the community at large, she is helping change the
collective mindset regarding STEM in general and math in particular—
rebranding the field of mathematics as anything but dry, technical, or
male-dominated but instead as a logical, productive career path that is
crucial to the future of the country.
Dr. Williams is cohost of the PBS series NOVA Wonders, a 6-part series that
journeys to the frontiers of science, where researchers are tackling some of
the most intriguing questions about life and the cosmos. She has delivered
speeches tailored to a wide range of audiences within the educational field,
including speaking throughout the country about the value of statistics in
quantifying personal health information.
Dr. Williams has partnered with Sacred SISTAHS ( Sisters in Solidarity

Teaching and Healing Our Spirit ) to launch their annual STEM conference
for underrepresented middle school and high school girls and their
parents. The conference is designed to attract more young girls of color
toward STEM careers.
ii Learning Statistics: Concepts and Applications in R

TABLE OF CONTENTS
INTRODUCTION
Professor Biography . . . . . . . . . . . . . . . . . . . . . . . . . . i
Course Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 001
R and RStudio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 005
LECTURE GUIDES
01 How to Summarize Data with Statistics . . . . . . . . . . . 009
02 Exploratory Data Visualization in R . . . . . . . . . . . . 022
03 Sampling and Probability . . . . . . . . . . . . . . . . . . . 040
04 Discrete Distributions . . . . . . . . . . . . . . . . . . . . . 051
05 Continuous and Normal Distributions . . . . . . . . . . . . 065
06 Covariance and Correlation . . . . . . . . . . . . . . . . . . 079
07 Validating Statistical Assumptions . . . . . . . . . . . . . 094
08 Sample Size and Sampling Distributions . . . . . . . . . . 118
09 Point Estimates and Standard Error . . . . . . . . . . . . . 131
10 Interval Estimates and Confidence Intervals . . . . . . . . 142
11 Hypothesis Testing: 1 Sample . . . . . . . . . . . . . . . . . 155
12 Hypothesis Testing: 2 Samples, Paired Test . . . . . . . . 168
Table of Contents iii

13 Linear Regression Models and Assumptions . . . . . . . . . 183
14 Regression Predictions, Confidence Intervals . . . . . . . 199
15 Multiple Linear Regression . . . . . . . . . . . . . . . . . . 215
16 Analysis of Variance: Comparing 3 Means . . . . . . . . . . 238
17 Analysis of Covariance and Multiple ANOVA . . . . . . . . . 255
18 Statistical Design of Experiments . . . . . . . . . . . . . . 270
19 Regression Trees and Classification Trees . . . . . . . . . 281
20 Polynomial and Logistic Regression . . . . . . . . . . . . . 297
21 Spatial Statistics . . . . . . . . . . . . . . . . . . . . . . . 315
22 Time Series Analysis . . . . . . . . . . . . . . . . . . . . . . 331
23 Prior Information and Bayesian Inference . . . . . . . . . 352
24 Statistics Your Way with Custom Functions . . . . . . . . . 366
SUPPLEMENTARY MATERIAL
Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
iv Learning Statistics: Concepts and Applications in R

LEARNING STATISTICS
CONCEPTS AND APPLICATIONS IN R
T his course provides an introduction to statistics with examples

and applications in R, a free software environment that is
transforming how statistics is done around the world. R offers
a different way of doing statistics that may require patience when you
get started if you haven’t experienced it before, but it quickly provides a
powerful environment for analysis, including access to amazing graphics
capabilities.
The fundamental aim of this course is to get you to think probabilistically

and to achieve statistical literacy. By the end of the course, you should
have a clear understanding that randomness is everywhere, and you
should be comfortable with some basic concepts and techniques to model
randomness and to interpret random data. You should also be able to use R
as a tool for understanding, analyzing, and modeling data.
We begin the course with a look at the descriptive properties of data and
learn exploratory visualization techniques using R. This helps us begin to
see the shape of data, find trends, and locate outliers. The field of statistics
is really a branch of mathematics that deals with analyzing and making
decisions based on data.
Many of the statistical decisions we make depend on the field of probability,

the study of measuring how chance affects events or outcomes. The
discipline of probability is the study of data that arises from randomness,
and it lays the foundation for all of statistics, which in turn is the study of
data that arise from random processes.
We spend lectures 3 through 5 on probability and random distributions,

which are the foundations for understanding the data you happen to have
and the basis for drawing statistical inferences from that data. We develop
the fundamental concepts of expectation and variance, which become our
Course Scope 001

framework for calculating the average (mean) and spread of our data. We
compute conditional probabilities, and we introduce Bayes’s theorem,
which we later use to update our beliefs in the presence of new data.
We learn to check for independence of events, and set up and work with
discrete random variables (lecture 4), including those that follow the
Bernoulli, binomial, geometric, and Poisson distributions. Probability
distributions allow us to see, graphically and by calculation, the values of
our random variables.
The course begins sustained work with continuous random variables

starting in lecture 5. The most famous continuous random variable is
the normal distribution, which graphically represents the shape of many
real-life populations. Known for its signature shape as the bell curve, the
normal distribution describes the randomness of data where most of the
values are symmetrically clustered about a center and less and less data
occurs as you move away from the center.
We also explore the unique properties of several other continuous

distributions, including the uniform, exponential, beta, gamma, and chi-
squared distributions.
Throughout the course, we create and interpret scatterplots, histograms,

bar plots, and other descriptive plots. But much of statistical analysis goes
further than merely describing the data we have to drawing inferences
beyond the data we have.
We begin a bridge toward inference by learning what correlation and

covariance mean, how to compute them, and what assumptions they
depend on (lectures 6 and 7).
We apply the central limit theorem of statistics (lecture 8), which tells
us that as our sample size increases, our sample means become normally
distributed, no matter what distribution they originate from.
002 Learning Statistics: Concepts and Applications in R

Next, we focus on estimation (lectures 9–10) and hypothesis testing
(lectures 11–12), which we use to determine which beliefs are best
supported by sample data. We also explore the concept of statistical
significance, by computing the p-value, which helps us know the exact
probability of observing our data. We will find confidence intervals
(lecture 10), a range of values that captures our uncertainty about
properties of our data.
We then turn to a technique known as regression, which fits a line to

our data to give a graphical model of the relationship between 2 or more
variables, focusing on linear relationships uncovered by what is called
linear regression (lectures 13–15). We compute and interpret a simple
linear regression between 2 variables and extend it to multiple linear
regression to capture the relationship of 3 or more variables. When
modeling curvature in our data, we easily extend linear regression
concepts to polynomial regression.
For data that has categorical predictors, such as gender, we turn to what is
called analysis of variance (ANOVA), which allows us to compare the means
of 2 groups in lecture 16, and multiple analysis of variance (MANOVA)
and analysis of covariance (ANCOVA) in lecture 17. We also explore how
ANOVA can be used in statistical design of experiments (lecture 18), as
pioneered by the great statistician Sir Ronald Fisher.
ANOVA and linear regression depend on key assumptions that are often
not met, including linearity, independence, homogeneity, and constant
variance. So, in lectures 19 through 23, we consider how to do statistical
analysis when one or more of those assumptions do not hold. Regression
trees and classification trees (known more generally as decision trees)
don’t require assumptions such as linearity, are even easier to use than
linear regression, and work well even when some values are missing.
However, not all data have natural splits amenable to decision trees, so
we turn in lecture 20 to polynomial regression (adding nonlinear terms
to our linear model) and to step functions (which apply different models
Course Scope 003

to different portions of our data). We also explore how probabilities for
binary outcomes (e.g., live/die, succeed/fail) can be understood using
logistic regression.
Two exciting forms of analysis sometimes omitted from a first course in

statistics are spatial statistics, also known as geostatistics (lecture 21),
and time series analysis (lecture 22), which address practical questions
of how to analyze data that are correlated across space or time. An
even more powerful topic now making its way into beginning statistics
courses, thanks to the substantial power of ordinary computers, is the
Bayesian approach to statistical inference (lecture 23), which allows us to
incorporate a probability model about any prior information we may have
to refine estimates based on current information.
We conclude in lecture 24 with custom functions you create yourself. A key

advantage of using R is the ability to refine your analysis and presentation
so as to focus more clearly and directly on whatever interests you. You
leave with an understanding and intuition of what information can be
drawn from data and the techniques needed to conduct and present
accurate statistical analyses, all using the statistical language of R.

R AND RSTUDIO
To download and install R, visit https://cloud.r-project.org/ and follow the
instructions for Windows or Mac users, depending on what you are using.
After you’ve downloaded and installed R, download and install RStudio,

which is a separate piece of software that works with R to make R much
more user friendly and also adds some helpful features. RStudio gives R
a point-and-click interface for a few of its features. RStudio also adds
a number of features that make your R programming easier and more
efficient. All examples featured in the lectures are done in RStudio.
To download and install RStudio, visit RStudio’s download website:

https://www.rstudio.com/products/rstudio/download2/.
HOW TO INSTALL PACKAGES USING THE RSTUDIO CONSOLE
Once you have installed R and RStudio, you can install additional packages
that are required for this course. The following instructions assume that
you are in the RStudio environment and know the package names needed.
1 In the RStudio console, at the prompt >, type the following command and press
the enter or return key to install a package. For example, let’s install the “swirl”
package.
> install.packages("swirl")
2 Then, R will fetch all the required package files from CRAN (Comprehensive R
Archive Network) and install it for you.
R and RStudio 005

3 To load the new package, type “library("NameOfPackage")”—for example, load
“swirl” like this:
> library("swirl")
Unlike other packages, the “swirl” package will immediately begin interacting
with you, suggesting that you type the following to begin using a training
session in “swirl”:
> swirl()
HOW TO INSTALL PACKAGES USING THE RSTUDIO

GUI (GRAPHICAL USER INTERFACE)
A second way to install packages—which is a little slower but perhaps more

familiar—is to click on menu commands in RStudio.
1 Click on the “Packages” tab from the lower-right-corner pane.
2 Click on the “Install” icon in the “Packages” tab.
3 Type in package names in the “Packages” field. Try typing “swirl” because this is
the first package that is recommended for you to use.
4 Click “Install” to let R install the package and other packages that are dependent
for using the package. You’ll notice the installation progress from the R console.
5 Once all the package files are downloaded and installed on your computer, you’ll
find the package name in the “Packages” pane (scroll through), or use the search
bar on the top-right side of the “Packages” panel. To load the package you just
installed, click on the checkbox.

Packages used in this course include:
ൖൖ graphics
ൖൖ stats
ൖൖ utils
If you don’t know package names, the best place to get an overview of the
best available packages is the “Task Views” section on the CRAN website,
available at https://cran.r-project.org/web/views/.
ADDITIONAL ONLINE REFERENCES (OPTIONAL)

ൖൖ http://rprogramming.net/how-to-download-r-quickly-and-easily/.
ൖൖ http://rprogramming.net/download-and-install-rstudio/.
ൖൖ RStudio: http://web.cs.ucla.edu/~gulzar/rstudio/index.html.
ൖൖ swirl installation: http://swirlstats.com/students.html.
R and RStudio 007

LECTURE 01
HOW TO SUMMARIZE
DATA WITH STATISTICS
T
o truly appreciate statistical information, we have
to understand the language ( and assumptions )
of statistics—and how to reason in the face of
uncertainty. In effect, we have to become masters at the
art of learning from data, which has 2 sides: accurately
describing and summarizing the data we have; and going
beyond the data we have, making inferences about data we
don’t have. Statistics is both descriptive and inferential.
WHAT IS STATISTICS?
ۧۧ Statistics is a branch of mathematics, but it’s also a science. It
involves the collection of data, analysis of data ( working with data ),
interpretation of data to reach conclusions, and presentation of data.
ۧۧ Think of statistics as a way to get information from data. It’s a great

toolkit. But it’s more than a toolkit. It’s a powerful framework for
thinking—for reaching insights and solving problems.
ۧۧ Quantitative data are always numbers. This type of data is often the
result of measuring a characteristic about a population ( e.g., height,
number of people living in your town, or percentage of registered voters ).
Lecture 01 — How to Summarize Data with Statistics 009

ۧۧ Quantitative data are either discrete or continuous. Discrete data take
on only certain numerical values, such as 1, 2, 3, 4, etc. Continuous data
take on values in an interval of numbers. The data is measured on a
continuous scale.
ۧۧ Qualitative variables are generally described by words or letters. For

example, hair color might be black, dark brown, light brown, blonde,
gray, or red. Qualitative variables are also known as categorical
variables because the data appear in categories.
ۧۧ An experiment recorded how much chickens grew when given different

types of feed. Newly hatched chicks were randomly put into 6 groups,
and each group was given a different feed supplement. The following
graph shows the total number of chicks that were placed in each group.

ۧۧ After 6 weeks, their weight was recorded.
ۧۧ The purpose was to determine which feed ( if any ) led to the heaviest
chickens. In this example, weight is a continuous, quantitative variable
giving the chick weight, and feed is a categorical, qualitative variable
giving the feed type.
ۧۧ When we combine individual chicks from each feed, we get subtotals.

This allows us to see the average weight by feed. Notice that chickens
on diets in the first group and last group turned out to be the heaviest.

INTRODUCTION TO R
ۧۧ These graphics are generated in a statistical package called R,
which is a free and open-source set of tools that has become the
world’s leading statistical programming environment that is used by
everyone from beginning students to experts needing a statistical
engine for big data projects.
ۧۧ Statistics is about analyzing data—but not just any data. As we move

from description to inference, statistics is about using representative
samples taken from a population.

ۧۧ How do we get a representative sample? We take a random sample.
ۧۧ A simple random sample is a subset of the population where each

member of the subset has an equal probability of being chosen.
ۧۧ Based on the sample, we try to generalize about the entire population.

That’s known as statistical inference. This is why it’s so important that
the sample is representative of the population.
ۧۧ As we begin to discuss samples and populations, we need a convenient

way to refer to them. The following is the frequently used notation:
population parameters sample statistics

μ ( “mu” population mean ) X ( “𝑥-bar” sample mean )
ۧۧ We denote the sample mean of a variable by placing a bar over it ( e.g., X ).
ۧۧ We need to extract basic information from the sample. We do that through

the summary statistics, which describes and summarizes the data.
ۧۧ The mean value, or average, tells us the center of the data. We find it
by adding all of the data points and dividing by the total number. The
following is the weights of chicks that were given a horsebean feed.
x = c(179, 160, 136, 227, 217, 168, 108, 124, 143, 140)
sum(x); sum(x)/10; mean(x)
[1] 1602
[1] 160.2
[1] 160.2
ۧۧ The median is another way of measuring the center of the data. Think
of the median as the middle value, although it doesn’t actually have to
be one of the observed values. To find the median, order the data and
locate a number that splits the data into 2 equal parts.

x = c(179, 160, 136, 227, 217, 168, 108, 124, 143, 140)
y = sort(x)
y
[1] 108 124 136 140 143 160 168 179 217 227
ۧۧ Because our dataset has 10 values, 2 values are located in the

center—143 and 160—at locations 5 and 6. We average those values to
get a median of 151.5.
(143 + 160) / 2
[1] 151.5
ۧۧ The median is a number that separates ordered data into halves. Half
the values are the same size or smaller than the median, and half the
values are the same size or larger than the median.
ۧۧ If our dataset instead had 11 values, then the median would be equal to
the number located at location 6 when the data is sorted.

ۧۧ Let’s add a weight of 500 to our values.
x = c(179, 160, 136, 227, 217, 168, 108, 124, 143, 140,
500)
y = sort(x)
y
[1] 108 124 136 140 143 160 168 179 217 227 500
ۧۧ Now the median is 160. But notice that the mean changes to 191.1.
mean(x)
[1] 191.0909
ۧۧ The median is generally a better measure of the center when your data
has extreme values, or outliers. The median is not affected by extreme
values. So, if your mean is far away from the median, that’s a hint that
the median might be a better representative of your data.
ۧۧ We can use R to generate summary statistics. Let’s try it on the

“chickwts” dataset using this command: “summary( chickwts ).” The
word “summary” followed by parentheses is a command. The word
inside the parentheses is the name of the dataset ( what we want to find
the summary of ).
weight feed
Min. : 108.0 casein : 12
1st Qu. : 204.5 horsebean : 10
Median : 258.0 linseed : 12
Mean : 261.3 meatmeal : 11
3rd Qu. : 323.5 soybean : 14
Max. : 423.0 sunflower : 12
ۧۧ The summary output gives us the mean and median of the weight data,
along with minimum and maximum values and first and third quartile.
For feed, we get a summary of how many chicks are in each group.

ۧۧ Once we know where our data is centered, we then need to understand
how spread out our values are. One way is to calculate how far each
individual value is from the center of the data.
ۧۧ If our data is centered on the mean, we can calculate a distance from

there. Each of these distances is called a deviation.
x = c(179, 160, 136, 227, 217, 168, 108, 124, 143, 140)
179 - mean(x)
[1] 18.8
160 - mean(x)
[1] -0.2
130 - mean(x)
[1] -30.2
ۧۧ We could add all of the deviations, but we’d just get a sum of 0. We
could add the absolute value of all the deviations and average to
get a mean absolute deviation. In R, we could do that with just the
command “mad( ).”

ۧۧ But the mean gives more weight to outliers. So, when measuring spread
from the mean, we usually want to give more weight to outliers. We sum
the square of the deviations and divide that by the total number of data
values minus 1 ( 𝑛 − 1 ).
x = c(179, 160, 136, 227, 217, 168, 108, 124, 143, 140)
round(sum(x - mean(x)),10)
[1] 0
ۧۧ The variance is in squared units and doesn’t have the same units as
the data. We can get back to our original units by taking the square
root, giving what is called the standard deviation, which measures the
spread in the same units as the data.
ۧۧ We denote the standard deviation of a sample with s and the standard

deviation of a population with σ2.
population parameters sample statistics

μ ( population mean ) X ( sample mean )
σ2 ( population variance ) s2 ( sample variance )
ۧۧ When the sample standard deviation ( s ) is equal to 0, there is no spread

and the data values are exactly equal to each other. When s is greater
than 0 but small, then the data values are mostly close together. When
s is much greater than 0, the data values are very spread out about the
mean. Outliers can make s very large.
ۧۧ So, standard deviation is the square root of the variance. Standard

deviation measures how far data values are from their mean.

ۧۧ The mean and standard deviation are often displayed on a bell-shaped
curve.
x.bar = mean(x)
sum((x - x.bar)^2)/(length(x)-1)
[1] 1491.956
sqrt(sum((x - x.bar)^2)/(length(x)-1))
[1] 38.62584
STATISTICAL GRAPHS
ۧۧ But what if data are not evenly spread around the mean? That’s called
skewness. What can we do when data are highly skewed?
ۧۧ We can use the median. A common statistical graph for showing the
spread of data around the median is the box plot, which is a graphical
display of the concentration of the data, centered on the median.
ۧۧ Box plots show us the visual spread of the data values. They give us the
smallest value, the first quartile, the median, the third quartile, and
the largest value. Quartiles are numbers that separate the data into
quarters. Like the median, quartiles may be located on a data point or
between 2 data points.
ۧۧ To find the quartiles, we first find the median, which is the second quartile.
The first quartile is the middle value of the lower half of the data, and the
third quartile is the middle value of the upper half of the data.
ۧۧ Let’s look again at the chicks dataset.
sort(x)
[1] 108 124 136 140 143 160 168 179 217 227
ۧۧ The lower half of the data is 108 through 143. The middle value of the
lower half is 136. One-quarter of the values are ≤ 136, and 3/4 of the
values are > 136. The upper half of the data is 160 through 227.

sort(x)
[1] 108 124 136 140 143 160 168 179 217 227
ۧۧ The middle value of the upper half is 179, which represents the third
quartile, Q3 . Three-quarters of the values are < 179, and 1/4 of the values
are ≥ 179.
ۧۧ Box plots are a vertical rectangular box and 2 vertical whiskers that
extend from the ends of the box to the smallest and largest data values
that are not outliers. Outlier values, if any exist, are marked as points
above or below the endpoints of the whiskers.
ۧۧ The smallest and largest non-outlier data values label the endpoints of
the axis. The first quartile marks the lower end of the box, and the third
quartile marks the upper end of the box. The central 50% of the data
falls within the box.
boxplot(x, main="Chicken Weight Boxplot", ylab = "Weight")

STATISTICAL SOFTWARE
ۧۧ Compared to other statistics programs—such as Statistical Analysis
Software, SPSS Statistics, and Stata—R is like Wikipedia, the online
encyclopedia: The base version we’ll use a lot is not very fancy, and it
might seem a little plain, but there are many more people actively using
and contributing to R, with plenty of R add-on packages that can get as
fancy as you like.
ۧۧ The other difference is that R is a programming environment, meaning

that you tell the program what to do using lines of code instead of
with pull-down menus. So, you have a lot more power and flexibility.
Moreover, R is a high-level language, meaning that the commands you
use are a lot like English.
ۧۧ It’s possible to do some of the basic statistics that will be covered in this
course using spreadsheet software, such as Excel, but the best way to
learn R is to start with the basics, not wait until you get to something
your spreadsheet can’t handle. And an added bonus to beginning with R
for this course is that many of the datasets we use come bundled with R.
ۧۧ When R is used anywhere in this course, an implementation of R called

RStudio will actually be used. Compared to base R, RStudio has more
functionality to help you generate files. And RStudio is built on base R:
When you install RStudio, you will also be installing base R.
STATISTICAL ASSUMPTIONS
ۧۧ No matter what we do in statistics, it’s important to keep track of the
statistical assumptions underlying what we’re doing.

ۧۧ Even when we’re merely describing and summarizing our data—
when we are doing descriptive statistics—part of what we are also
doing is checking to see which of our basic statistical assumptions
the data meets and which it may not meet. Checking data against our
assumptions tells us what information can be drawn from the data.
ۧۧ All data has uncertainty. That’s why we need to understand probability,

which provides the foundation for statistical reasoning on the basis
of samples to infer conclusions beyond your sample. This is called
inferential statistics.
SUGGESTED READING
Diez, Barr, and Cetinkaya-Rundel, OpenIntro Statistics, “Introduction to Data,”
sections 1.1–1.8.
Yau, R Tutorial, “R Introduction,” http://www.r-tutor.com/r-introduction.
——— , R Tutorial, “Numerical Measures,” http://www.r-tutor.com/
elementary-statistics/numerical-measures.
PROBLEMS
1 Eight athletes competed in the 100-yard dash during a local high school
tournament, resulting in the following completion times: 13.8, 14.1, 15.7, 14.5,
13.3, 14,9, 15.1, 14.0. Calculate the mean, median, variance, and standard
deviation of the data.
2 Try also using R. Which of these statistics is unaffected by outliers?
a) mean
b) median
c) standard deviation
d) variance

LECTURE 02
EXPLORATORY DATA
VISUALIZATION IN R
T
his course uses a powerful computer programming
language known as R to help us analyze and
understand data. R is the leading tool for statistics,
data analysis, and machine learning. It is more than
a statistical package; it’s a programming language,
so you can create your own objects, functions, and
packages. There are more than 2000 cutting-edge,
user-contributed packages available online at the
Comprehensive R Archive Network ( CRAN ).
WHY DO WE USE R?
ۧۧ We use R for several reasons. It’s free, and it’s open source, meaning
that anyone can examine the source code to see exactly what it’s doing.
It explicitly documents the steps of your analysis. R makes it easy to
correct, reproduce, and update your work. You can use it anywhere, on
any operating system.
ۧۧ R lets you integrate with other programming languages, such as C/

C++, Java, and Python. You can interact with many data sources:
spreadsheets, such as Excel, and other statistical packages, such as SAS,

Stata, and SPSS. You can import simply by pushing an Import Dataset
button in RStudio. R has a large, active, and growing community of
users, many of whom are experts in their respective fields.
ۧۧ Other statistical packages are limited in their ability to change their

environment. They rely on algorithms that have been developed
for them, with more limited programming flexibility. The way they
approach a problem is more constrained by how their employed
programmers thought to approach them. And they cost money.
ۧۧ Spreadsheet programs like Excel have several benefits. Most people

are familiar with spreadsheets, and it’s easy to do minor data analysis
on a small, clean dataset. In fact, you can do a quick analysis of small
datasets. It’s easy to get a quick look at the data and do a simple, one-
time analysis.
ۧۧ With spreadsheet programs like Excel, a lot is accomplished through mouse

clicks, which is a great user experience in the moment. But recreating your
work with new data can be time consuming and monotonous. With R, you
just load a new dataset and run your scripts again.
ۧۧ Spreadsheets also have drawbacks. It takes a long time to manipulate

data in a spreadsheet. Even simple commands, such as taking the mean
or log of your data, are difficult. In R, it’s one line: “mean( data )” or
“log( data ).”
ۧۧ If you’re starting with spreadsheets like Excel, that’s okay. But R is

faster and more powerful—and if you start with R from the beginning,
you’ll already be prepared whenever you find yourself wanting to do a
more complicated analysis that you’re not able to do in a spreadsheet
environment.
ۧۧ With R, everything is accomplished via code. You load your data into R
and explore and manipulate that data by running scripts. It`s easy to
reproduce your work on other datasets. Because all data manipulation
Lecture 02 — Exploratory Data Visualization in R 023

and exploration is done with code, it’s simple to redo your work on a
new dataset. Because you`re working with actual code, you can see
where problems are and fix them. It’s easy to share your work and have
others add to what’s been done.
ۧۧ It’s easy to get help online; you can show exactly what you’re using and
ask very specific questions. In fact, most of the time when you get help
online, people will post the exact code that addresses your issue. Stack
Overflow ( http://stackoverflow.com/ ) is a community of roughly 7
million programmers helping each other.
ۧۧ You can load any data into R. It doesn’t matter where your data is or
what form it’s in. You can load CSV files. The first time, it’ll ask you to
install required packages. Just say yes.
#install.packages("readr")
#library(readr)
#shoes <- read_csv("C: /Users/tawilliams/Desktop/
shoes.csv")
#View(shoes)
ۧۧ Through the use of packages, R is a complete toolset. R can do much

more than Excel when it comes to data analysis. There are now more
than 1000 packages available for R, so regardless of what type of data
you’re working with, R can handle it. And if it turns out that you need a
package that you don’t already have, you can install it.
ۧۧ R is not without its drawbacks. There’s a learning curve and minimal

graphical user interface. It’s easy to make mistakes that seem correct
intuitively. But this course has been designed to walk you through that
process and build your programming confidence.

INTRODUCTION TO R
ۧۧ One of the most useful features of R is its ability to handle and manipulate
data. Double-click the R icon to open the R Console window in RStudio.
In its most basic functions, R works much like a calculator does.
ۧۧ Open RStudio and locate the R Console window on the left ( or lower
left, if you have 4 panes ). Type immediately after the > prompt the
expression 3 + 5 and then hit the return key.
3+5
[1] 8
ۧۧ The prompt > indicates that the system is ready to receive commands.
Writing an expression, such as 5 + 5, and hitting the return key sends
the expression to be executed.
ۧۧ Let’s use R to explore data. First, we have to input the data.
x = 3
x
[1] 3
y = 5
y
[1] 5
x+y
[1] 8
x * y
[1] 15
x / y
[1] 0.6
z = x / y
z
[1] 0.6

ۧۧ The function “c” combines its arguments and produces a sequence. We
often use this command to define a small dataset. It allows us to define
a group of numbers as a set. Then, we can plot it.
c(3,0,10,-4,0.5)
[1] 3.0 0.0 10.0 -4.0 0.5
ۧۧ The function “c” is an example of an R function. In this example, the

sequence was created and sent to the screen but not saved. If we want to
create an object for further use, then we should save it and give it a name.
ۧۧ For example, it we want to save the vector of data under the name
“widget,” then write the following expression at the prompt.
widget = c(3,0,10,-4,0.5)
widget
[1] 3.0 0.0 10.0 -4.0 0.5
widget + 2
[1] 5.0 2.0 12.0 -2.0 2.5
widget * widget
[1] 9.00 0.00 100.00 16.00 0.25
widget^2
[1] 9.00 0.00 100.00 16.00 0.25
PLOTTING IN R
ۧۧ You’ll usually want to save your work. To do that, we need to open a
script file. Go to File → New File → R Script. That opens a panel in the
upper left of your screen. In that script window, we can try the code
below to generate our first plot.
x = c(1,2,3,4,5)
y = c(1,8,27,64,125)
plot(x,y)

ۧۧ We can add titles and labels for the 𝑥- and 𝑦-axis.
plot(x,y, main = "Our first R plot!", xlab = "Integers

1-5", ylab = "Integers 1 - 5 cubed")

HOW TO INSTALL A PACKAGE
ۧۧ R comes equipped with many built-in datasets inside various packages.
Here’s how to install a package.
install.packages("datasets")
library(datasets)
data(faithful)
plot(faithful)
ۧۧ When you want to install packages, in the lower-right quadrant of

RStudio, you should see 5 tabs: Files, Plots, Packages, Help, and Viewer.
Click on Packages. Just below Packages, you should see the command
buttons for Install and Update. Click on Install.
ۧۧ When a window pops up, type the name of the package in the space for
packages. In this case, type “datasets” and press Install. The package
will automatically update to your computer.
ۧۧ From your R script, type and highlight “library( datasets )” and run that
line of code by clicking the “Run” button to run your selected lines. This
loads the datasets library.
OLD FAITHFUL DATA

ۧۧ The Old Faithful geyser data gives the waiting time between geyser
eruptions and the duration of the eruption for the Old Faithful geyser in
Yellowstone National Park in Wyoming.
data(faithful)
plot(faithful)

plot(faithful, main = "Old Faithful Eruptions")
faithful

eruptions waiting eruptions waiting
1 3.600 79 35 3.833 74
2 1.800 54 36 2.017 52
3 3.333 74 37 1.867 48
4 2.283 62 38 4.833 80
5 4.533 85 39 1.833 59
6 2.883 55 40 4.783 90
7 4.700 88 41 4.350 80
8 3.600 85 42 1.883 58
9 1.950 51 43 4.567 84
10 4.350 85 44 1.750 58
11 1.833 54 45 4.533 73
12 3.917 84 46 3.317 83
13 4.200 78 47 3.833 64
14 1.750 47 48 2.100 53
15 4.700 83 49 4.633 82
16 2.167 52 50 2.000 59
17 1.750 62 51 4.800 75
18 4.800 84 52 4.716 90
19 1.600 52 53 1.833 54
20 4.250 79 54 4.833 80
21 1.800 51 55 1.733 54
22 1.750 47 56 4.883 83
23 3.450 78 57 3.717 71
24 3.067 69 58 1.667 64
25 4.533 74 59 4.567 77
26 3.600 83 60 4.317 81
27 1.967 55 61 2.233 59
28 4.083 76 62 4.500 84
29 3.850 78 63 1.750 48
30 4.433 79 64 4.800 82
31 4.300 73 65 1.817 60
32 4.467 77 66 4.400 92
33 3.367 66 67 4.167 78
34 4.033 80 68 4.700 78

69 2.067 65 103 2.100 49
70 4.700 73 104 4.500 83
71 4.033 82 105 4.050 81
72 1.967 56 106 1.867 47
73 4.500 79 107 4.700 84
74 4.000 71 108 1.783 52
75 1.983 62 109 4.850 86
76 5.067 76 110 3.683 81
77 2.017 60 111 4.733 75
78 4.567 78 112 2.300 59
79 3.883 76 113 4.900 89
80 3.600 83 114 4.417 79
81 4.133 75 115 1.700 59
82 4.333 82 116 4.633 81
83 4.100 70 117 2.317 50
84 2.633 65 118 4.600 85
85 4.067 73 119 1.817 59
86 4.933 88 120 4.417 87
87 3.950 76 121 2.617 53
88 4.517 80 122 4.067 69
89 2.167 48 123 4.250 77
90 4.000 86 124 1.967 56
91 2.200 60 125 4.600 88
92 4.333 90 126 3.767 81
93 1.867 50 127 1.917 45
94 4.817 78 128 4.500 82
95 1.833 63 129 2.267 55
96 4.300 72 130 4.650 90
97 4.667 84 131 1.867 45
98 3.750 75 132 4.167 83
99 1.867 51 133 2.800 56
100 4.900 82 134 4.333 89
101 2.483 62 135 1.833 46
102 4.367 88 136 4.383 82

137 1.883 51 171 1.917 49
138 4.933 86 172 2.083 57
139 2.033 53 173 4.583 77
140 3.733 79 174 3.333 68
141 4.233 81 175 4.167 81
142 2.233 60 176 4.333 81
143 4.533 82 177 4.500 73
144 4.817 77 178 2.417 50
145 4.333 76 179 4.000 85
146 1.983 59 180 4.167 74
147 4.633 80 181 1.883 55
148 2.017 49 182 4.583 77
149 5.100 96 183 4.250 83
150 1.800 53 184 3.767 83
151 5.033 77 185 2.033 51
152 4.000 77 186 4.433 78
153 2.400 65 187 4.083 84
154 4.600 81 188 1.833 46
155 3.567 71 189 4.417 83
156 4.000 70 190 2.183 55
157 4.500 81 191 4.800 81
158 4.083 93 192 1.833 57
159 1.800 53 193 4.800 76
160 3.967 89 194 4.100 84
161 2.200 45 195 3.966 77
162 4.150 86 196 4.233 81
163 2.000 58 197 3.500 87
164 3.833 78 198 4.366 77
165 3.500 66 199 2.250 51
166 4.583 76 200 4.667 78
167 2.367 63 201 2.100 60
168 5.000 88 202 4.350 82
169 1.933 52 203 4.133 91
170 4.617 93 204 1.867 53

205 4.600 78 239 3.950 79
206 1.783 46 240 2.333 64
207 4.367 77 241 4.150 75
208 3.850 84 242 2.350 47
209 1.933 49 243 4.933 86
210 4.500 83 244 2.900 63
211 2.383 71 245 4.583 85
212 4.700 80 246 3.833 82
213 1.867 49 247 2.083 57
214 3.833 75 248 4.367 82
215 3.417 64 249 2.133 67
216 4.233 76 250 4.350 74
217 2.400 53 251 2.200 54
218 4.800 94 252 4.450 83
219 2.000 55 253 3.567 73
220 4.150 76 254 4.500 73
221 1.867 50 255 4.150 88
222 4.267 82 256 3.817 80
223 1.750 54 257 3.917 71
224 4.483 75 258 4.450 83
225 4.000 78 259 2.000 56
226 4.117 79 260 4.283 79
227 4.083 78 261 4.767 78
228 4.267 78 262 4.533 84
229 3.917 70 263 1.850 58
230 4.550 79 264 4.250 83
231 4.083 70 265 1.983 43
232 2.417 54 266 2.250 60
233 4.183 86 267 4.750 75
234 2.217 50 268 4.117 81
235 4.450 90 269 2.150 46
236 1.883 54 270 4.417 90
237 1.850 54 271 1.817 46
238 4.283 77 272 4.467 74

plot(faithful, main = "Old Faithful Eruptions",
xlab = "Eruption length (min)",
ylab = "Wait time (min)", pch=20)
HISTOGRAMS
ۧۧ A histogram is a plot that lets you discover and show the underlying
shape of a set of continuous data. You can also inspect the data for
outliers and overall spread.
ۧۧ To get the histogram, count the occurrence of each value of the variable
and plot the number for each count ( t he frequency ) on the 𝑦-axis. The
values can be displayed as frequencies or percentages.
hist(faithful$waiting)

# Breaks
hist(faithful$waiting, plot = FALSE)$breaks
[1] 40 45 50 55 60 65 70 75 80 85 90 95 100
# Counts
hist(faithful$waiting, plot = FALSE)$counts
[1] 4 22 33 24 14 10 27 54 55 23 5 1
hist(faithful$waiting,main = "Histogram" breaks=
seq(from=40,to=100, by=1))
# Breaks
hist(faithful$waiting, breaks= seq(from=40,to=100, by=1),
plot=FALSE)$breaks
[1] 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
[18] 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73
[35] 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
[52] 91 92 93 94 95 96 97 98 99 100
# Counts
hist(faithful$waiting, breaks= seq(from=40,to=100, by=1),
plot=FALSE)$counts
[1] 0 0 1 0 3 5 4 3 5 5 6 5 7 9 6 4 3 4 7 6 0 4 3
[24] 4 3 2 1 1 2 4 5 1 7 6 8 9 12 15 10 8 13 12 14 10 6 6
[47] 2 6 3 6 1 1 2 1 0 1 0 0 0 0

QUANTILE-QUANTILE PLOTS
ۧۧ In a quantile-quantile ( Q-Q ) plot, quantiles of the sample are plotted
against quantiles of a proposed theoretical distribution ( here, the
normal distribution ). If the points fall on a straight line, this indicates
that the quantiles of the sample data are consistent with the quantiles
of the proposed theoretical distribution.
# A standard normal random sample, n=100

x <- rnorm(300)
qqnorm(x,pch=16,cex=.5)
qqline(x)

qqnorm(faithful$waiting, pch=16, cex=.5, main="Q-Q Plot
for Waiting Time")
qqline(faithful$waiting)
x = rnorm(30);qqnorm(x);qqline(x)

x = rnorm(100);qqnorm(x,pch=16,cex=.5);qqline(x)
PITFALL
ۧۧ Be careful not to overwrite built-in functions in R.
ൖൖ Don’t do this:
mean = (5+7)/2
mean
ൖൖ Check for built-in functions:
?mean
?c
?t

SUGGESTED READING
Yau, R Tutorial, “Qualitative Data,” http://www.r-tutor.com/elementary-
statistics/qualitative-data.
——— , R Tutorial, “Quantitative Data,” http://www.r-tutor.com/elementary-
statistics/quantitative-data.
PROBLEMS
1 Exploratory data analysis can be used to
a) help visualize the shape and spread of your data.

b) help you find patterns in your data.
c) help you locate outliers.
d) summarize the main characteristics of your data.
e) All of the above.
2 The “painters” dataset in the “MASS” R library contains a subjective assessment

score on a 0 to 20 integer scale of 54 classical painters. The painters were
assessed by 18th-century art critic Roger de Piles on 4 characteristics:
composition, drawing, colour, and expression. Load the “painters” data from
the “MASS” R library and construct bar plots of the variables “Composition,”
“Drawing,” “Colour,” and “Expression.” Comment on the range of scores for each
of the variables.
library(MASS)
data(“painters”)
# Use the table function to create barplots
barplot(table(painters$Composition), main="Composition
Score")
barplot(table(painters$Drawing), main="Drawing Score")
barplot(table(painters$Colour), main="Colour Score")
barplot(table(painters$Expression), main="Expression
Score")

LECTURE 03
SAMPLING AND
PROBABILITY
S
tatistics sharpens our knowledge of how
randomness is all around us. Life is full of having
to make decisions under uncertainty. Two
fundamental ideas in statistics are uncertainty and
variation. Probability is the foundation that helps us
understand uncertainty and variation. This is why
probability plays a key role in statistics. Probability is
a mathematical language used to measure uncertain
events. Whenever we collect data or make measurements,
the process that we use is subject to variation, meaning
that if the same measurements were repeated, the
answer would be slightly different, due to variation.
PROBABILITY
ۧۧ Data is the raw information from which statistics are created. Data
is being collected everywhere all the time. When we collect data, we
convert information to numbers. Statistical thinking gives you the tools
to intelligently extract information from data.

ۧۧ It’s helpful to think about data collection in the context of the general
process of investigation.
1 State the question that we’re interested in. Maybe we want to know
whether a new cold medicine will relieve coughing within 48 hours.
Whatever the question is, we need to be able to gather information
that will help us make a decision.
2 Collect data that helps answer the question. Suppose that you give
some samples of cold medicine to your coughing friends and record
how many of them stopped coughing within a 48-hour period.
You just introduced a bias into your data. It’s likely that your close
friends have similar characteristics as you do. They’re in your same
age bracket, live in the same city, or are the same gender as you. For
your results to be as widely applicable as possible, you have to collect
data in a way that is objective and rigorous. To remove bias entirely,
you have to collect a sample where every person has an equal chance
of being selected.
3 Analyze the data. Statisticians often start an analysis with a

graphical look at the data to get an overall picture of it. They use
plots, graphs, tables—anything that can help reveal patterns or
relationships within their data. A plot or graph can be a more
effective way of presenting data than a mass of numbers because
you can see where data clusters and where there are only a few
data values. Then, statisticians also use formal techniques, such
as hypothesis testing or linear regression, to determine if ( or how
much ) their data supports their initial claim.
4 Draw a conclusion. You consider all of the information that you’ve

gathered and analyzed to make a definitive statement about your
problem. For example: Study shows that new miracle cough medicine
relieves coughing in 48 hours, guaranteed. That conclusion is too
strong; zero uncertainty is too good to be true. But you do want to
extend the results from your sample data, the data you collected, to
the overall population more generally.
Lecture 03 — Sampling and Probability 041

SAMPLING
ۧۧ Because there is always uncertainty, a strong foundation in probabilistic
reasoning gives you the basic tools needed to make sense of everything
that is done in statistics. Sampling refers to selecting, at random,
one or more of all possible outcomes ( ideally, in proportion to their
likelihoods ). An example of sampling is selecting a random sample of
Americans for a political poll.
ۧۧ Sampling helps us draw inferences about the larger population

of outcomes. Drawing random samples forms the foundation of
probability. Probability is all about considering the possible outcomes
of an experiment and quantifying their chances.
ۧۧ When a weather forecaster declares a 75% chance of rain on a particular

day, that means that in a large number of days with atmospheric
conditions like those of that particular day, the proportion of days in
which rain occurs is 0.75. Why, then, does probability refer to the long
run? It’s simply because we can’t accurately determine a probability
with a small number of observations.
ۧۧ If you sample 10 people and they are all right-handed, you can’t conclude
that the probability of being right-handed is 100%. It takes a much
larger sample of people to accurately predict the proportion of people
that are truly right-handed.
ۧۧ What happens when we have real, but limited, data, for which we’d
like to calculate probabilities? Suppose that you want to understand
how a grocery store displays cereal boxes and whether different types
of breakfast cereals are targeted to adults or children. You go into our
local grocery store and notice that there are 6 rows of shelves and
that each shelf has 5 boxes of cereal. You can group the 6 shelves into
3 categories: the bottom 2 shelves, the middle 2 shelves, and the top
2 shelves.

ۧۧ Let A be the event that breakfast cereal is targeted at adults: A = {adult
breakfast cereal}.
Top 2 Middle 2 Bottom 2 Total

Adults 9 3 0 12
Children 1 7 10 18
Total 10 10 10 30
What is the probability of A?
ۧۧ We could do the same type of calculation to determine the probability

of breakfast cereal being on the middle 2 rows: B = {breakfast cereal is
on the middle 2 rows}.
ۧۧ It’s natural to think about how 2 events relate to each other. For
example, what’s the probability that the cereal is targeted at adults
and is located on the middle 2 shelves? In this case, we want to look
at our data to see where those 2 events occur together. Notice that 3
( not 2 ) types of breakfast cereal are located in the middle 2 shelves and
targeted at adults. So, the probability would be
ۧۧ We call the probability of A and B the intersection, the place where the
2 events overlap.
ۧۧ What if we were interested in knowing the probability of A or B—the

probability that the cereal is targeted at adults or that it’s on the middle
2 shelves. In this case, we’re no longer limited to just the intersection.

ۧۧ The cereal could be targeted at adults, with probability 12/30, or be
located on the middle 2 shelves, with probability 10/30. It makes sense
for us to add these 2 probabilities.
ۧۧ But notice that by adding the probability of A plus the probability of

B, we’ve included their intersection twice. We need to subtract one of
those back out.
ۧۧ We can update our probabilities of events by conditioning on new given

information. Let’s define a new event. Let C be the event that cereal
is targeted to children: C = {children breakfast cereal}. What’s the
probability of C?
ۧۧ There is a 60% chance that cereal is targeted toward children.
ۧۧ What is the probability of C and B, where B is the event that the cereal is
located on the middle 2 shelves? Again, we’re looking at the overlap, or
intersection, of these 2 events.

ۧۧ What is the probability that the cereal is targeted to children, given
that it’s located on the middle 2 shelves? In other words, what is the
probability of event C, given the information from event B?
ۧۧ We’ve limited ourselves to the middle 2 shelves, and of those 10 boxes of

cereal, we see that 8 of them are targeted to children. So, the conditional
probability was higher.
SAMPLE SPACES, EVENTS, AND PROBABILITIES

ۧۧ An experiment is an action whose outcome is uncertain. Examples
include flipping a fair coin or rolling a fair dice twice.
ۧۧ From experiments, we can build a sample space, which is the set of all
possible outcomes of that experiment. When flipping a fair coin, the
sample space, S, would be equal to either H ( for heads ) or T ( for tails ):
S = {H, T}.
ۧۧ Likewise, if we are rolling a fair die twice, the sample space is all
possible combinations of those 2 rolls: {( 1, 1 ), ( 1, 2 ), …, ( 1, 6 ), …, ( 6, 1 ),
…, ( 6, 5 ), ( 6, 6 )}.
ۧۧ An event is a collection of outcomes of the experiment. If we flip a fair

coin, an event is “flip a head,” or {H}. If we are rolling a fair die twice,
we can define an event as “the sum equals 4,” in which case, there are 3
possibilities: {( 1, 3 ), ( 2, 2 ), ( 3, 1 )}.
ۧۧ We want to associate a probability to these events. A probability is a

rule assigning each event a value on [0, 1] reflecting the chance of the
event occurring.

ۧۧ When all outcomes in the sample space are equally likely, the probability
of an event is
ۧۧ When flipping a fair coin, the probability that we flip a head is 1/2. We get
that by taking the number of events in flipping a head, which is H, or 1
event, out of the total number of possible events, which is 2.
ۧۧ Likewise, if we roll a fair die twice, the probability that the sum equals
4, or equals those 3 events ( {( 1, 3 ), ( 2, 2 ), ( 3, 1 )} ) out of the total 36
possibilities, is 3 divided by 36.
ۧۧ The building blocks of statistics are the axioms of probability. These

beginning statements are the ground rules from which we build and
prove statistical theory.
ۧۧ Probability theory can be reduced to 3 fundamental axioms.
1 For any event A, 0 ≤ P( A ) ≤ 1. In other words, the probability of any

event is between 0 and 1. The smallest that a probability can ever be
is 0; the largest it can ever be is 1.
2 If S is the sample space, P( S ) = 1 and P( ∅ ) = 0. In other words, the

probability of the entire sample space is 1 and the probability of
the empty set is 0. The sample space is everything possible for the
probability experiment. There are no events outside of the sample space.

3 The third axiom of probability deals with mutually exclusive events,
but first we need to understand unions and intersections. A union is
the probability of A or B and is denoted P( A ∪ B ). An intersection is
the probability of A and B is denoted P( A ∩ B ). If A and B are mutually
exclusive, meaning that they don’t overlap, then the probability of
their union is the sum of those probabilities: P( A ∪ B ) = P( A ) + P( B ).
ۧۧ The probability of A not happening, called A complement, is P( AC ) =

1 − P( A ).
ۧۧ Formally, if we’re looking at the probability of a union of events,

P( A ∪ B ) = P( A ) + P( B ) − P( A ∩ B ).
ۧۧ If A and B are mutually exclusive, then P( A ∩ B ) = 0.
ۧۧ If A and B are independent, then P( A ∩ B ) = P( A )P( B ).
ۧۧ This leads to the idea of conditional probability. Knowing that the

cereal is located on the middle 2 shelves changes the probability of
being targeted to children.
ۧۧ Formally, let A and B be events, with P( B ) > 0. The conditional

probability of A given that B has occurred is
ۧۧ If we go back to our example, we can calculate the probability of C given

B using the formal definition.
ۧۧ Another way to view conditional probability is by solving for the

intersection term. We can rewrite our definition as

ۧۧ The probability of A and B occurring is the same as the probability of B
and A occurring, so we can also write
ۧۧ If we set these 2 equations equal to each other, we get
ۧۧ At first, this might not seem very useful. In fact, it seems rather circular
that we’ve just rewritten P( A and B ) and P( B and A ) and set them equal.
But if you divide both sides by P( B ), you’re left with a famous and useful
result that relates conditional probabilities known as Bayes’s rule.
ۧۧ Moreover, we can write the law of total probability.
ۧۧ Therefore, another form of Bayes’s rule is

PITFALLS
ۧۧ Suppose you’re worried that you might have a rare disease. You
visit your doctor to get tested, and the doctor tells you that the test
is accurate 98% of the time. So, if you have the rare disease, it will
correctly tell you that 98% of the time. Likewise, if you don’t have the
disease, it will correctly tell you that you don’t 98% of the time.
ۧۧ The disease is rare and deadly and occurs in 1 out of every 10,000
people. Unfortunately, your test result is positive. What’s the chance
that you actually have the disease?
ۧۧ Bayes’s theorem can help answer this question.
A = have the disease
B = test is positive
ۧۧ So, the test is positive, and the test is accurate 98% of the time. However,
you have less than a 1% chance of having the disease.

SUGGESTED READING
Diez, Barr, and Cetinkaya-Rundel, OpenIntro Statistics, “Probability,” sections
2.1–2.4.
PROBLEMS
1 On a single toss of a fair coin, the probability of heads is 0.5 and the probability
of tails is 0.5. If you toss a coin twice and get tails on the first toss, are you more
likely to get heads on the second toss?
2 Isabella runs a small jewelry store. Last week, she counted 143 people who
walked by her store. Of the 143 people, 79 of them came in. Of the 79 that came
in, 53 people bought something in the store.
a) What’s the probability that a person who walks by the store will buy
something?
b) What’s the probability that a person who walks in the store will buy
something?
c) What’s the probability that a person who walks in the store will buy
nothing?
d) What’s the probability that a person who walks by the store will come
in and buy something?

LECTURE 04
DISCRETE DISTRIBUTIONS
R
andom variables are used to model situations in
which the outcome, before the fact, is uncertain.
In other words, a random variable is a real
number whose value is based on the random outcome
of an experiment. A list of all possible outcomes for a
given random variable is called a sample space. This
space includes the outcome that eventually did take
place but also all other outcomes that could have
taken place but never did. The idea of a sample space
puts the outcome that did happen in a larger context
of all possible outcomes. A random variable can be
either discrete or continuous. A discrete random
variable takes on discrete, or countable, values.
DISCRETE DISTRIBUTIONS
ۧۧ Certain discrete distributions appear frequently in real life and have
special names.
ൖൖ For example, the number of times that heads might appear out of 10
coin flips follows a binomial distribution.
ൖൖ The number of flips needed to get 1 head follows a geometric

distribution.
Lecture 04 — Discrete Distributions 051

ൖൖ The number of flips needed to get N heads, where we pick any value
for N—for example, 5—follows a negative binomial distribution.
Instead of the number of heads needed to get 10 flips, this is the
number of flips needed to get 5 heads.
ൖൖ There’s also a limiting case of the binomial, where each actual event
is rare, almost like the number of times the coin lands on neither
heads nor tails. This is called the Poisson distribution, and it’s always
about an unusual outcome—for example, the number of defects on a
semiconductor chip.
ۧۧ In a simple discrete experiment, we’re going to flip a fair coin 3 times.

The possible outcomes are 3 heads, 3 tails, or everything in between:
{HHH, HHT, HTH, THH, HTT, THT, TTH, TTT}. Each of those outcomes is
equally likely, so they have a probability of 1/8.
ۧۧ Let X be the number of times heads appears. X is a random variable. X

can take on values of 0 ( where there are no heads in the flip ), 1, 2, or 3.
ۧۧ Discrete random variables are defined by their probability mass

function ( PMF ). The PMF defines what values the random variable can
take on and with what probabilities.
ۧۧ In the coin-flipping experiment, the PMF for X is
ۧۧ Suppose that you are paid $1 for each head that appears in a coin-
flipping experiment. Up to how much should you be willing to pay
to play this game if you plan to play it only once? To help you decide,
imagine that you can play the game a large number of times and observe
how much you win on average.

ۧۧ Suppose that you play N times.
ൖൖ On roughly N = trials, you win $0.
ۧۧ Your average winnings over the N trials will be approximately
ۧۧ You should be willing to pay up to $1.50 to play this game to come out
ahead, on average. This is the idea of expected value.
ۧۧ The expected value of a discrete random variable represents the

long-run average value of that random variable. It is also called the
( population ) mean because it is the value of the mean we would obtain
if we could enumerate the entire population of realizations of random
variable X.
ۧۧ Suppose that if X heads come up, you win $X2. Now how much should
you be willing to pay?

ۧۧ The expected value of a function h( X ) is
ۧۧ Expected value is linear.
ۧۧ When we take the expected value of a constant, it’s equal to that

constant. If we take the expected value of a constant times our random
variable, that constant can come out of the expectation, and we can
multiply it by whatever our expected value is. Be careful, though,
because the expected value of a function is not necessarily the function
of the expected value, so they aren’t interchangeable.
ۧۧ Let’s consider variance in the context of a random variable. By definition,

variance measures the spread of a random variable’s distribution about
its mean. The variance of a discrete random variable is
ۧۧ The variance of the number of heads appearing after 3 coin flips is
ۧۧ The standard deviation of a random variable is the square root of its

variance. It gives us an idea of the range or spread of our variable in the
same units as the variable.

TYPES OF RANDOM VARIABLES
ۧۧ We can always define a random variable using a PMF. For example,
ۧۧ This is a valid PMF because it sums to 1 over all of the possible values
of X.
ۧۧ Additionally, certain classes of random variables arise commonly in

natural processes. These are called special discrete random variables.
ۧۧ Suppose that an experiment has only 2 outcomes—for example, success

or failure. Define X = 1 if success occurs and X = 0 if failure occurs. The X
is a Bernoulli random variable where P( X = 1 ) = 𝑝 and P( X = 0 ) = 1 − 𝑝,
where 𝑝 is the probability of success.
ۧۧ Let’s revisit our coin-flipping experiment. We flipped a coin 3 times.

Each coin flip had a probability of 0.5 of coming up heads. The result
of each coin flip was independent. Our random variable X counted the
number of times ( out of 3 ) that heads appeared. Random variables
arising in such a scenario are called binomial random variables.
ۧۧ X is a binomial random variable with parameters 𝑛 and 𝑝 ( or, in

shorthand, X ~ Bin( 𝑛, 𝑝 ) ) if:
ൖൖ the experiment consists of a series of independent trials;
ൖൖ the outcome of each trial can be classified as a Bernoulli random

variable ( e.g., success/failure, heads/tails, etc. );

ൖൖ the probability of “success” on each trial equals a fixed value 𝑝; and
ൖൖ X counts the number of “successes” out of 𝑛 trials.
ۧۧ In our coin-flipping example, X ~ Bin( 3, 0.5 ).
ۧۧ The PMF of a Bin( 𝑛, 𝑝 ) random variable is
is the number of ways to choose k items from a set of 𝑛 items.

It is defined as
ۧۧ For example, if you have 4 items and you want to know how many ways
you can pick 2 items out of those 4, you can plug it into this formula to
get 6 ways.
ۧۧ What is the expected value? If you flip a coin 𝑛 times and each time has
a probability 𝑝 of yielding heads, on average how many heads do you
expect to get?
ۧۧ The variance is Var[X] = 𝑛𝑝( 1 – 𝑝 ). This is maximized when 𝑝 = 0.5.

ۧۧ Suppose that we have a series of Bernoulli trials. The random variable
X is defined as the number of trials until the first success. This follows a
geometric distribution with parameter 0 < 𝑝 < 1.
ۧۧ We fail 𝑥 − 1 times and succeed on the last time.
ۧۧ Suppose that we again have a series of Bernoulli trials. Let’s define the
random variable X as the number of trials until r successes occur. Then,
X is a negative binomial random variable with parameters 0 < 𝑝 < 1 and
r = 1, 2, 3, … .
ۧۧ We fail 𝑥 − r times and succeed r times.
ۧۧ A discrete random variable X follows a Poisson distribution with

parameter λ ( shorthand: X ~ Poi( λ ) ) if its PMF is given by
ۧۧ The Poisson distribution arises frequently in applications pertaining to

rare events, such as the number of typos on a page of a textbook.

CONTINUOUS DISTRIBUTIONS
ۧۧ A continuous random variable is a random variable that takes on values
over a real interval. For example, the position of a defect over a length
of wire is uniform distribution. The time between emissions of alpha
particles in radioactive decay is exponential distribution. The droplet
size of a pesticide sprayed through a nozzle is normal distribution.
ۧۧ Consider a 4-meter length of wire that has a defect somewhere on it.

Assume that the defect’s location is equally likely to be anywhere on
the length of wire. What’s the probability that the defect is located in
the first 1 meter? How many outcomes are possible for the location
of the defect? What’s the probability that it’s located at precisely
3.286438 meters?
ۧۧ If X is a continuous random variable taking on values in a real interval,

P( X = 3.286438 ) = 0. In this case, the probability that the defect is
located at 3.286438 meters has 0 probability. Otherwise, if we summed
the probabilities over all of the values in the real interval, the total
would exceed 1.
ۧۧ Therefore, a continuous random variable doesn’t have a probability

mass function, which associates a probability to each value. Instead,
probability is spread over intervals, in what is known as a probability
density function ( PDF ).
ۧۧ The density function for a random variable speaks to the frequency

of the outcome 𝑥 for the random variable X. The density function
for continuous variables does not correspond to the probability. In
other words,

ۧۧ You can look at a histogram to get the shape of a continuous distribution.
ۧۧ A continuous random variable takes on values over a real interval, so

the position of the defect over a wire has constant probability.
ۧۧ The time you need to wait for the emission of 10 alpha particles might
be a sum of exponential distributions known as the gamma distribution.
ۧۧ The PDF f( 𝑥 ) is defined such that

ۧۧ The cumulative distribution function ( CDF ) is the probability that 𝑥 is
less than or equal to some value—for example, 𝑦. Imagine that we’ve
taken our PDF and have added up until we get to the point 𝑦.
ۧۧ The expected value is
ۧۧ The expected value of a function is
ۧۧ The variance is
ۧۧ Continuous distributions have several properties.
ൖൖ f( 𝑥 ) ≥ 0 for all probabilities over the interval; otherwise, we could

have negative probabilities.
ൖൖ .
ൖൖ f must be integrable over the sample space.

ۧۧ Uniform distribution, a type of continuous distribution, represents
values on a real interval [𝑎, 𝑏] that are all equally likely to arise. We
write X ~ U( 𝑎, 𝑏 ), 𝑎 ≤ X ≤ 𝑏. For example, the location of a defect on a
wire and the time until the next regularly scheduled train arrives both
follow a uniform distribution, graphed here.
ۧۧ Because X is uniform, f( 𝑥 ) = 𝑐 for some constant 𝑐, and we need the area
under the curve to equal 1.
ۧۧ Therefore,
ۧۧ Suppose that X ~ U( 𝑎, 𝑏 ). What is the probability that X ≤ 𝑐, where 𝑎 ≤ 𝑐

≤ 𝑏?

ۧۧ For example, if
ۧۧ Suppose that X ~ U( 𝑎, 𝑏 ). What is the probability that 𝑐 ≤ X ≤ d, where 𝑎

≤ 𝑐 ≤ d ≤ 𝑏?

ۧۧ For example, if
ۧۧ Other properties of uniform distribution include the following.
ൖൖ
ൖൖ
ൖൖ
ۧۧ The exponential distribution is another common continuous distribution.

It’s defined as follows.
ۧۧ Exponential distributions are used when we model time between

events that occur at some rate, λ.

PITFALLS
ۧۧ A common misunderstanding about continuous random variables is
that if there are only 2 possible outcomes and you don’t know which is
true, the probability of each of these outcomes is 1/2. In fact, probabilities
in those binary situations could be anything from 0 to 1.
ۧۧ For example, if the outcomes of interest are “has cancer” and “does not
have cancer,” the probabilities of having cancer are ( in most cases )
much less than 1/2. The number of possible outcomes in an experiment
doesn’t necessarily say anything about the probability of the outcomes.
SUGGESTED READING
Diez, Barr, and Cetinkaya-Rundel, OpenIntro Statistics, “Probability,” section
2.5, and “Distributions of Random Variables,” sections 3.3–3.5.
Yau, R Tutorial, “Probability Distributions,” http://www.r-tutor.com/
elementary-statistics/probability-distributions.
PROBLEMS
1 If X has a binomial distribution with 𝑛 = 20 trials and a mean of 5, then the
success probability 𝑝 is:
a) 0.10.
b) 0.20.
c) 0.25.
d) Need to first take a sample.
2 Suppose that each ticket purchased in the local lottery has a 20% chance of
winning. Let X equal the number of winning tickets out of 10 that are purchased.
a) What is the distribution of X?

b) What is the probability that X = 2?

LECTURE 05
CONTINUOUS AND
NORMAL DISTRIBUTIONS
T
he normal distribution is one of the most common,
well-used distributions in statistics. Normal
distributions come in many means and standard
deviations, but they all have a signature shape, where
the data values fall into a smooth, bell-shaped curve.
The data are concentrated in the center, but some of
them are more spread out than others. The spread of the
distribution is determined by the standard deviation.
NORMAL DISTRIBUTION
ۧۧ Every normal distribution has certain properties that distinctly
characterize it.
1 The shape is symmetric, meaning that if you were to cut the distribution
in half, the left side would be a mirror image of the right side.
2 The bulk of the probability is toward the middle of the distribution

and decreases as we move into the left and right tails of the
distribution.
3 The mean, median, and mode are all the same, and we can find them
directly in the center of the distribution.
Lecture 05 — Continuous and Normal Distributions 065

ۧۧ The normal distribution is the most common distribution in statistics.
Because normal distributions approximate so many natural phenomena,
it has developed into the gold standard for many probability problems.
ۧۧ Many of the variables we observe in everyday life, such as heights, weights,

shoe sizes, reading abilities, and measurement errors, are approximately
normally distributed. And many of the discrete distributions you learned
about previously can be approximated with the normal distribution.
ۧۧ By definition, a continuous random variable X has normal distribution

with mean μ and standard deviation σ if the probability density function
( PDF ) of X is
ۧۧ For shorthand, we write X ~ N( μ, σ ).
ۧۧ This formula is used to create the different examples of normal

distributions that you see here.

ۧۧ Let’s look closely at 2 of those curves. When the mean ( μ ) is 15 and the
standard deviation ( σ ) is 7, we plug those values into our PDF to get this
particular normal distribution.
ۧۧ Likewise, if the mean equals −2 and the standard deviation is 3, we get

this normal distribution.

ۧۧ In fact, pick a number anywhere from negative infinity to positive
infinity. Let’s call that the mean. And pick a positive number. Let’s call
that the standard deviation. We can plug those values into our PDF to
get a unique normal distribution.
STANDARD NORMAL DISTRIBUTION

ۧۧ Because we have so many possibilities of shapes and sizes of the normal
distribution, we needed a common way of describing them—a gold
standard that we could use to solve problems regardless of different
means and standard deviations. So, instead, we work with what is
called the standard normal distribution.
ۧۧ When μ = 0 and σ = 1, the normal PDF reduces to the standard normal

PDF.
ۧۧ So, rather than directly solving a problem where X ~ N( μ, σ ), we use an
indirect approach.
1 Recenter X to have a mean of 0.
X − μ
2 Rescale X to have a standard deviation of 1.
ۧۧ The resulting distribution is denoted by Z, and it follows a normal

distribution with a mean of 0 and a standard deviation of 1.

ۧۧ In other words, if X ~ N( μ, σ ), then
ۧۧ Every normal random variable can be transformed into a standard

normal random variable by subtracting its mean, μ, and dividing by its
standard deviation, σ.
ۧۧ Z represents the number of standard deviations our random variable X

has fallen above or below the mean. If Z is positive, the corresponding
value of X is above the mean. If Z is negative, the corresponding value of
X is below the mean.
ۧۧ If Z is standard normal, X = μ + σZ ~ N( μ, σ2 ).

CUMULATIVE DISTRIBUTION FUNCTION
ۧۧ Let X ~ N( μ, σ ).
ۧۧ There is no closed form for the cumulative distribution function of the

normal distribution, meaning that the only way for us to calculate this
interval is to compute it numerically.
ۧۧ If we were solving this by hand, we would standardize our random

variable X and use a standard normal distribution table to look up the
corresponding probabilities.
ۧۧ But we can also solve for it using R.
pnorm( a ) gives P( Z ≤ 𝑎 ) for Z ~ N( 0, 1 ).

pnorm( a, mean=μ, sd=σ ) gives P( 𝑥 ≤ 𝑎 ) for X ~ N( μ, σ ).
ۧۧ Let X ~ N( 50, 10 ).
1 Find P( X ≤ 45 ). Here’s how to do it in R.
pnorm(45, mean=50, sd =10)

= 0.3085375.
2 Find P( X ≥ 60 ). To find P( X ≥ 𝑎 ), use
pnorm( a, mean=μ, sd=σ, lower.tail=FALSE ) for X ~ N( μ, σ ) ).
pnorm(60, mean=50, sd =10, lower.tail=FALSE)

= 0.1586553.

lower.tail = TRUE means return the probability contained underneath the
lower ( left ) tail, P( X ≤ 𝑎 ).
lower.tail = FALSE means return the probability contained in the upper tail,
i.e. P( X ≥ 𝑎 ).
3 Find P( 45 ≤ X ≤ 60 ).
pnorm(60, mean=50, sd=10) - pnorm(45, mean=50, sd=10)

= 0.5328072
PROBABILITIES ASSOCIATED WITH STANDARD

DEVIATIONS
ۧۧ The probability that a normal random value falls within ±1 standard
deviation of its mean is approximately 68%.

deviations of its mean is approximately 95%.

deviations of its mean is approximately 99%.

ۧۧ Let’s verify in R that the probability Z falls within ±2 standard
deviations of the mean is roughly 95%.
P( −2 ≤ Z ≤ 2 ) = P( Z ≤ 2 ) − P( Z ≤ −2 )
pnorm(2) - pnorm(-2)
= 0.9544997
ۧۧ We can also verify that the probability Z falls within ±3 standard

deviations is roughly 99%.
P( −3 ≤ Z ≤ 3 ) = P( Z ≤ 3 ) − P( Z ≤ −3 )
= 0.9973002

ۧۧ As you might suspect from the formula for the normal density function,
it would be difficult and tedious to do the calculus every time we had a
new set of parameters for μ and σ. But fortunately, R makes that process
easy for us.
ۧۧ Suppose that X ~ N( 13, 4 ). What’s the probability that X falls within ±1
standard deviation of its mean?
P( μ − σ ≤ X ≤ μ + σ ) = P( 9 ≤ X ≤ 17 )

= P( X ≤ μ + σ ) − P( X ≤ μ − σ )= P( X ≤ 17 ) − P( X ≤ 9 )
pnorm(17, mean=13, sd=4) - pnorm(9, mean=13, sd=4)

= 0.6826895
ۧۧ We could also solve this problem by standardizing X.
= 0.6826895
ۧۧ We get the same probability as calculated previously.

OTHER PROPERTIES
ۧۧ The standard normal distribution has other useful properties. For
example, let Z ~ N( 0, 1 ).
P( Z ≥ 𝑎 ) = P( Z ≤ −𝑎 )
Equivalently, 1 − P( Z ≤ 𝑎 ) = P( Z ≤ −𝑎 ).
And P( Z ≥ 𝑎 ) = 1 − P( Z ≤ 𝑎 ).

PERCENTILES
ۧۧ We can also use R to find percentiles of the normal distribution. This is
the value, 𝑎, such that a given percentage of X’s distribution lies below 𝑎.
ۧۧ To find the 90th percentile of X ~ N( 10, 5 ), we seek the value 𝑎 such that
P( X ≤ 𝑎 ) = 0.90.
ൖൖ Option 1: Find this directly in R using the command qnorm.
qnorm(0.90, mean=10, sd=5)

= 16.4077578
ൖൖ Option 2: Find the number of standard of deviations above ( or

below ) the mean using a standard normal and convert back to X’s
units.
qnorm(0.90)
= 1.2815516
ۧۧ The 90th percentile is 1.2815516 × σ + μ = 1.2815516 × 5 + 10 = 16.4077578.
ۧۧ If the qth percentile of the standard normal distribution is the value 𝑎,

what is the percentile of the value −𝑎? In general, if the qth percentile of
standard normal distribution is the value 𝑎, then the percentile of −𝑎 is
1 − q/100.
ۧۧ If the 90th percentile of the standard normal is 1.2815516, then what is

the 10th percentile? The 10th percentile is the value −1.2815516.
ۧۧ Suppose that the LDL cholesterol readings in a population follow a

normal distribution with a mean of 129 milligrams per deciliter ( mg/dL )
and a standard deviation of 23 mg/dL. It’s recommended that a person
consult with a doctor if his or her cholesterol levels exceed 158 mg/dL.
If an individual is randomly chosen from this population, what’s the
probability that he or she will need to consult a doctor?

ۧۧ If X = {cholesterol level}, then X ~ N( 129, 23 ).
ۧۧ This person will need to consult a doctor if his or her cholesterol level
is > 158.
ۧۧ We’re looking for P( X > 158 ).
ۧۧ Solving directly in R gives us pnorm( 158, mean=129, sd=23, lower.

tail=FALSE ) = 0.1036779.
ۧۧ So, the probability that a randomly selected person will need to consult
a doctor is approximately 10%.
ۧۧ What’s the cholesterol level below which 95% of this population lies?
1 Solve directly:
qnorm(0.95, mean = 129, sd = 23)

= 166.8316334.
2 Solve using standard normal:
qnorm(0.95)
= 1.6448536.
ۧۧ Many of the things around us are normally distributed, or very close to

it, such as experimental measurements and homework grades.
ۧۧ The normal distribution is easy to work with mathematically. In many

practical cases, the methods developed using normal theory work quite
well even when the distribution is not normal.
ۧۧ Many other distributions can be approximated by the normal

distribution even if they aren’t normal. We can especially take
advantage of this fact when our sample size increases. In fact, there’s a
very strong connection between the size of a sample and the extent to
which a sampling distribution approaches the normal form.

PITFALL
ۧۧ But it’s not without its pitfalls. A great deal of data follows a normal
distribution, but some does not. For example, data for various biological
phenomena instead grow according to a log normal distribution.
Normality brings attractive properties to an analysis, but the
assumption of normality always needs to be validated.
SUGGESTED READING
Diez, Barr, and Cetinkaya-Rundel, OpenIntro Statistics, “Distributions of
Random Variables,” sections 3.1–3.2.
Yau, R Tutorial, “Probability Distributions,” http://www.r-tutor.com/
elementary-statistics/probability-distributions.
PROBLEMS
1 A normal density curve has which of the following properties?
a) It is symmetric.
b) The peak of the distribution is centered above its mean.
c) The spread of the curve is proportional to the standard deviation.
d) All of the above.
2 Let X ~ N( 100, 25 ).
a) Find P( X < 73 ) in R.

b) Find P( X > 73 ) in R.
c) Find the 90th percentile of X ~ N( 100, 25 ). In other words, find the value 𝑎
such that P( X < 𝑎 ) = 0.90.

LECTURE 06
COVARIANCE AND
CORRELATION
I
f you’re new to statistics, you may be ready to jump
on the cause-and-effect bandwagon when you find
a strong relationship between 2 variables. But have
you ever thought about why 2 variables might be
correlated? So far, when we’ve considered variance,
we’ve limited ourselves to 1 variable. But what if
we have 2 variables that we think might be related?
How might they vary together? This brings us to the
idea of covariance, and from there to correlation.
COVARIANCE
ۧۧ Suppose that you poll a statistics class and ask them the total number of
hours they spent studying for their last exam and collect the following
data.
Hours Studied
X = {2, 3, 5, 6, 8, 9, 10, 13}
ۧۧ You want to see if studying has any relationship to their actual test
scores.
Test Scores
Y = {58, 75, 71, 77, 80, 88, 83, 95}
Lecture 06 — Covariance and Correlation 079

ۧۧ Let’s plot the data.
x = c(2,3,5,6,8,9,10,13)
y = c(58,75,71,77,80,88,83,95)
plot(x,y,main = "Hours Spent Studying vs. Test Score",
xlab = "Hours Spent Studying",
ylab = "Test Score",pch=20)
ۧۧ There appears to be an increasing trend in time spent studying and test

score.
ۧۧ Notice that there’s variability along the 𝑥-axis and variability along the
𝑦-axis.
plot(x,y,main = "Mean X = 7", pch=20, xlab = "Hours Spent

Studying", ylab = "Test Score")
abline(v=mean(x), col=2, lwd = 2)

ۧۧ Let’s first look at the variability in X. Here, we have a vertical line at the
mean number of hours spent studying, 7. If we wanted to calculate the
variance of X, we would calculate the distance from each point to the
mean line, square it, and add them all up.
plot(x,y,main = "Mean Y = 78.4", xlab = "Hours Spent

Studying",ylab = "Test Score",pch=20)
abline(h=mean(y), col = 2, lwd = 2)

ۧۧ Now here’s a horizontal line at the mean of Y, 78.4. This is the average
test score for the class.
ۧۧ To calculate the variance in Y, we calculate the distance from each point

to the horizontal mean line, square it, and add them all up.
plot(x,y,main = "Hours Spent Studying vs. Test

Score", xlab = "Hours Spent Studying", ylab = "Test
Score",pch=20)
abline(h=mean(y), col = 2, lwd = 2)
abline(v=mean(x), col=2, lwd = 2)

ۧۧ But now that we have 2 variables, we need to connect them. We do this
by multiplying each X deviation by its associated Y deviation and taking
the sum of those values. Let’s see how this works in R.
x = c(2,3,5,6,8,9,10,13)
y = c(58,75,71,77,80,88,83,95)
#First Deviation
(2 - 7) * (58 - 78.4)
[1] 102
-5 * -20.4
[1] 102
x = c(2,3,5,6,8,9,10,13)
y = c(58,75,71,77,80,88,83,95)
#Second Deviation
(3 - 7) * (75 - 78.4)
[1] 13.6
-4 * -3.4
[1] 13.6
x = c(2,3,5,6,8,9,10,13)
y = c(58,75,71,77,80,88,83,95)
#Last (8th) Deviation
(13 - 7) * (95 - 78.4)
[1] 99.6
6 * 16.6
[1] 99.6
ۧۧ If we add all 8 deviations and divide by 𝑛 − 1, we get the covariance of

our sample, a measure of how X and Y vary together.

ۧۧ R has a built-in covariance function, “cov.”
x = c(2,3,5,6,8,9,10,13)
y = c(58,75,71,77,80,88,83,95)
cov(x,y)
[1] 38
ۧۧ Formally, when thinking in terms of random variables, the covariance

can be written as Cov( X, Y ) = E( ( X − X )( Y − Y ) ).
ۧۧ A positive covariance means that the random variables tend to move

together. When one variable is above its mean, the other is, too. A negative
covariance means that the random variables move opposite each other.
When one variable is above its mean, the other tends to be below its mean.
ۧۧ In this case, the covariance of X and Y is 38. But what does that mean? Is
that a large covariance or a small covariance?
ۧۧ The problem is that the covariance can take on any number of values. One
person might have a dataset with a covariance of 500 and another might
have a dataset with a covariance of 5. Unless their data is measured in the
exact same units, they can’t even compare those 2 numbers.
ۧۧ The problem with covariance is it can’t tell us how strong the relationship
is between X and Y. We need to go one step further.
CORRELATION
�

If we take the covariance and divide through by the product of the 2
standard deviations, then magic begins to happen. What we’ve done is
scale it to a dimensionless measure, meaning that it has no units attached
to it. It’s called the correlation coefficient, and it’s a popular way to measure
the strength of a linear relationship between 2 random variables.

ۧۧ In this case, the sample standard deviations of X and Y are 3.7032804
and 11.1859287, respectively.
ۧۧ This is a very strong positive relationship, as you can see from the
original scatterplot.
ۧۧ In R:
x = c(2,3,5,6,8,9,10,13)
y = c(58,75,71,77,80,88,83,95)
cov(x,y) / (sd(x)*sd(y))
[1] 0.9173286
ۧۧ We can also calculate correlation using the “cor” function in R.
cor(x,y)
[1] 0.9173286
ۧۧ You can think of the correlation as a scaled version of covariance.
ۧۧ In fact, the correlation and covariance will always have the same sign—
either both positive or both negative.
ۧۧ When the correlation is positive, we say that X and Y are positively

correlated. When the correlation is negative, X and Y are negatively
correlated. And if the correlation equals 0, the variables are
uncorrelated.
ۧۧ Ultimately, covariance and correlation both measure the linear

dependence between 2 variables.

ۧۧ We use the Greek letter ρ ( rho ) to refer to the population correlation
and the Roman letter r when we’re talking about the sample correlation.
So, if you see r = 0.45, you automatically know that the correlation is
being taken on a sample, and if you see ρ = 0.45, then this tells you it’s
the correlation of the population.
ۧۧ The correlation coefficient has a few key advantages over covariance.
ൖൖ Unlike the covariance, the correlation is unitless, so we can directly

compare all types of variables.
ൖൖ Because the covariance ranges anywhere from negative infinity to

positive infinity, it’s hard to have an idea of scale. On the other hand,
because the correlation goes from −1 to +1, we immediately have an
idea of the strength of the relationship.
ۧۧ Let’s look at our Old Faithful dataset, which compares the waiting time
to the length of eruptions of the Old Faithful geyser. In R, we’re able to
calculate the correlation for an entire dataset using the “cor” function.
data(faithful)
round(cor(faithful),4)
ۧۧ R returns a 2-by-2 matrix. We call this the covariance matrix. In the first
column, “eruptions” is perfectly correlated with itself and is also highly
correlated with “waiting,” at a value of 0.9008. The second column
likewise gives the correlation between waiting and eruptions and the
correlation between waiting and itself, equal to 1.
eruptions waiting
eruptions 1.0000 0.9008
waiting 0.9008 1.0000
ۧۧ Inside the datasets package is a dataset called “Harman23.cor.” This

data gives the correlation between 8 physical measurements on 305
girls between ages 7 and 17.

ۧۧ Here’s how to load the data into R. If you don’t already have it installed,
you should first install the datasets library. You can do this in R by
clicking in the lower-right panel on Packages, then Install, and then type
“datasets” and press enter. You then need to run the “library( datasets )”
command to bring it into your workspace.
library(datasets)
data("Harman23.cor")
round(Harman23.cor$cov,2)
height arm.span forearm lower.leg weight
height 1.00 0.85 0.80 0.86 0.47
arm.span 0.85
1.00 0.88 0.83 0.38
forearm 0.80 0.88 1.00 0.80 0.38
lower.leg 0.86 0.83 0.80 1.00 0.44
weight 0.47 0.38 0.38 0.44 1.00
bitro.diameter 0.40 0.33 0.32 0.33 0.76
chest.girth 0.30 0.28 0.24 0.33 0.73
chest.width 0.38 0.42 0.34 0.36 0.63
bitro.diameter chest.girth chest.width

height 0.40 0.30 0.38
arm.span 0.33 0.28 0.42
forearm 0.32 0.24 0.34
lower.leg 0.33 0.33 0.36
weight 0.76 0.73 0.63
bitro.diameter 1.00 0.58 0.58
chest.girth 0.58 1.00 0.54
chest.width 0.58 0.54 1.00
ۧۧ Notice that along the diagonal, the values all equal 1. This is because
each variable is perfectly correlated with itself.
ۧۧ Find some of the higher correlations. Height and lower leg have a
correlation of 0.86. This makes sense, because if a person is tall, that
person is likely to have long legs. Arm span and forearm have a correlation
of 0.88, which is also logical because the forearm is included in arm span.

ۧۧ Notice some of the low correlations. Arm span and chest girth have
a correlation of 0.28. Surprisingly, height and weight only have a
correlation of 0.47. Remember that the girls range in age from 7 to 17,
which is a time when their height, especially, is often changing faster
than their weight.
ۧۧ Another R dataset is in the car package in R, and it’s called Salaries.
library(car)
data("Salaries")
head(Salaries)
ۧۧ The Salaries dataset has the 2008 to 2009 9-month academic salary
for assistant professors, associate professors, and full professors in a
particular college in the United States.
ۧۧ The data was collected to better monitor any salary differences

between male and female faculty members.
data("Salaries")
head(Salaries)
rank discipline yrs.since.phd yrs.service sex salary
1 Prof B 19 18 Male 139750
2 Prof B 20 16 Male 173200
3 AsstProf B 4 3 Male 79750
4 Prof B 45 39 Male 115000
5 Prof B 40 41 Male 141500
6 AssocProf B 6 6 Male 97000
ۧۧ The variables include rank, discipline, years since Ph.D., years of

service, and gender. The summary statistics will give us a better idea of
the data’s features.
ۧۧ Years since Ph.D. ranges from 1 to 56 with an average of 22.3. Years

of service ranges from 0 to 60 with an average of 17. That 60 is likely
an outlier because it is way outside of the probabilities for a normal
distribution.

summary(Salaries)
rank discipline yrs.since.phd yrs.service sex salary
AsstProf: 67 A: 181 Min.: 1.00 Min.: 0.00 Female: 39 Min.: 57800
AssocProf: 64 B: 216 1st Qu.: 12.00 1st Qu.: 7.00 Male: 358 1st Qu.: 91000
Prof: 266 Median: 21.00 Median: 16.00 Median: 107300
Mean: 22.31 Mean: 17.61 Mean: 113706
3rd Qu.: 32.00 3rd Qu.: 27.00 3rd Qu.: 134185
Max.: 56.00 Max.: 60.00 Max.: 231545
ۧۧ Salary ranges from 57,800 to 231,545 with a mean of around 113,000.

We can calculate the correlation directly using the code you see here.
sd.phd <- sd(Salaries$yrs.since.phd)

sd.ser <- sd(Salaries$yrs.service)
cor_xy <- cov(Salaries$yrs.since.phd, Salaries$yrs.
service) / (sd.phd * sd.ser)
cor_xy
[1] 0.9096491
ۧۧ Or we can calculate the correlation using the R “cor” function.
cor(Salaries$yrs.since.phd, Salaries$yrs.service)
[1] 0.9096491
CORRELATION AND COVARIANCE

ۧۧ We can also use R to calculate the correlation between all of the
numerical variables in the dataset. Let’s create a subset of “Salaries”
that contains only the numeric variables.
cor(Salaries[,c(3,4,6)])
yrs.since.phd yrs.service salary
yrs.since.phd 1.0000000 0.9096491 0.4192311
yrs.service 0.9096491 1.0000000 0.3347447
salary 0.4192311 0.3347447 1.0000000

ۧۧ Instead of having a table with the correlation between all numeric
variables, we can also create a plot.

ۧۧ Covariance and correlation are measures of the degree of linear
association.
ۧۧ Independent random variables have correlation 0. However, having a

correlation equal to 0 does not guarantee independence.
ۧۧ There might be a nonlinear relation, such as circular data, that would

show a correlation of 0 even though there’s a pattern in the data.
PITFALLS
ۧۧ The r correlations coefficient looks for a linear relationship and assumes
that the 2 variables are normally distributed. If you suspect a nonlinear
relationship, we can consider transforming the data by taking the log or
raising it to an exponent.
ۧۧ Remember that the correlation coefficient is a single number that

measures the linear association between 2 random variables. But we can
get misleading results if we consider them outside the context of our data.
ۧۧ Correlation does not imply causality. Just because X and Y are correlated
does not mean that X causes Y. They could both be caused by some other
factor Z, or Y might cause X instead.

SUGGESTED READING
Diez, Barr, and Cetinkaya-Rundel, OpenIntro Statistics, “Probability,” section
2.4.
Yau, R Tutorial, “Covariance,” http://www.r-tutor.com/elementary-statistics/
numerical-measures/covariance.
——— , R Tutorial, “Correlation Coefficient,” http://www.r-tutor.com/
elementary-statistics/numerical-measures/correlation-coefficient.
PROBLEMS
1 Suppose that you find a correlation of 0.65 between an individual’s income
and the number of years of college that individual has completed. Which of the
following 4 statements can we conclude?
a) More years of education causes one to have a higher income.

b) If a person attends 4 years of college, he or she will have a yearly income of
$65,000.
c) College completion explains 65% of the variance in an individual’s income.
d) More years of education are associated with higher income.
2 The “cars” dataset in R contains the speed of cars and the distances taken to
stop in the 1920s.
a) Load the data and look at the summary statistics.
library(datasets)
data("cars")
summary(cars)

b) Calculate the correlation and covariance of speed and distance.
cor(cars$speed, cars$dist)
cov(cars$speed, cars$dist)
c) Plot car speed versus distance and describe the relationship.
plot(cars$speed, cars$dist, main="Speed vs. Stopping

Distance", pch=20)

LECTURE 07
VALIDATING STATISTICAL
ASSUMPTIONS
S
tatistical graphs are useful in helping us visualize
data. Through graphs, we understand data
properties, such as the mean, median, and standard
deviation; find patterns in data, such as clustering and
correlation; suggest an underlying model that could have
generated the data; verify our assumptions; check and
fortify our analysis; summarize; and communicate results.
This lecture will define and identify basic summaries
of data, both numerical and graphical, and use R for
calculating descriptive statistics, making graphs, and
even writing functions that work on multiple datasets.
IRIS DATA
ۧۧ The “iris” dataset is widely used throughout statistical science for
illustrating various problems in statistical graphics, multivariate
statistics, and machine learning. It’s a small but nontrivial dataset. The
data values are real ( as opposed to simulated ) and are of high quality
( collected with minimal error ). The data were used by the celebrated
British statistician Ronald Fisher in 1936.

ۧۧ Using famous datasets is one of the traditions of statistics. Also, when
comparing old and new methods, or in evaluating any method, it’s
helpful to try them out on known datasets, thus maintaining continuity
in how we assess methods.
ۧۧ The “iris” dataset is most commonly used in statistics for pattern

recognition. The dataset contains 3 classes of 50 instances each. Each
class refers to a type, or species, of iris plant, such as Iris setosa, Iris
versicolor, and Iris virginica, and we have data measurements on length
and width for each flower: sepal length, sepal width, petal length, and
petal width. The iris data is in the datasets library in R.
library(datasets)
library(RColorBrewer)
attach(iris)
ۧۧ The iris species are so similar that they are difficult to separate visually.
So, American botanist Edgar Anderson gathered the data we now have
to look for statistical differences that might help identify each species.
BAR PLOTS
ۧۧ Bar plots are useful for showing comparisons across several groups.
Although it looks like a histogram, a bar plot is plotted over a label that
represents a category ( e.g., iris type ).
ۧۧ One difference you might notice is that the bars of a bar plot are
separated with spaces in between, while in a histogram, the values are
plotted right next to one another, with no space in between.
ۧۧ A more important difference between a bar plot and histogram: It’s

always appropriate to talk about the skewness of a histogram—that is,
the possible tendency of the observations to fall more on the low end or
the high end of the 𝑥-axis. However, on bar plots, the 𝑥-axis is typically
categorical ( i.e., not quantitative ).
Lecture 07 — Validating Statistical Assumptions 095

barplot(iris$Petal.Length, main = "Petal Length")
barplot(iris$Sepal.Length, main = "Sepal Length")

barplot(table(iris$Species,iris$Sepal.Length), col =
brewer.pal(3,"Set1"), main = "Stacked Plot of Sepal
Length by Species")
BOX PLOTS
ۧۧ The summary function is a quick and easy way to assess the statistical
properties of each attribute. These values are displayed graphically in
a box plot.
summary(iris[,1: 2])
Sepal.Length Sepal.Width
Min.: 4.300 Min.: 2.000
1st Qu.: 5.100 1st Qu.: 2.800
Median: 5.800 Median: 3.000
Mean: 5.843 Mean: 3.057
3rd Qu.: 6.400 3rd Qu.: 3.300
Max.: 7.900 Max.: 4.400

summary(iris[,3: 4])
Petal.Length Petal.Width
Min.: 1.000 Min.: 0.100
1st Qu.: 1.600 1st Qu.: 0.300
Mean: 3.758 Mean: 1.199
3rd Qu.: 5.100 3rd Qu.: 1.800
Max.: 6.900 Max.: 2.500
summary(iris[,5])
setosa versicolor virginica
50 50 50
ۧۧ Box plots are used to compactly show many pieces of information about
a variable’s distribution. They are great for visualizing the spread of the
data. Box plots show 5 statistically important numbers: the minimum,
the 25th percentile, the median, the 75th percentile, and the maximum.
boxplot(iris$Sepal.Length, main = "Sepal Length")

boxplot(iris[,1: 4], names=c("SepL","SepW", "PetL",
"PetW"))
ۧۧ A box plot can also be used to show how one attribute, such as petal
length, varies with another attribute, such as iris type.
boxplot(iris$Petal.Length~iris$Species, main = "Petal

Length vs. Species")

ۧۧ We can visualize how the spread of sepal length changes across various
categories of species.
ۧۧ A color palette is a group of colors that is used to make the graph more
appealing and help create visual distinctions in the data.
boxplot(iris$Sepal.Length~iris$Species, col=heat.
colors(3), main = "Sepal Length vs.Species")
SCATTERPLOTS
ۧۧ Scatterplots are helpful for visualizing data and simple data inspection.
Let’s try the following code.
plot(iris$Petal.Length, main="Petal Length", ylab = "Petal

Length", xlab = "Species")

ۧۧ Let’s generate corresponding scatterplots for Petal.Width, Sepal.
Length, and Sepal.Width.

ۧۧ What are your observations? Do any plots help us distinguish between
the 3 species groups?
ۧۧ Scatterplots are used to plot 2 variables against each other. We can add
a third dimension by coloring the data values according to their species.
ۧۧ For datasets with only a few attributes, we can construct and view
all the pairwise scatterplots. In the first row, all the 𝑦-values are
represented by Sepal.Length on the 𝑦-axis. In the first column, all the
𝑥-axis values are represented by Sepal.Length on the 𝑥-axis.
ۧۧ Likewise, in the second row, all the 𝑦-values are represented by Sepal.
Width on the 𝑦-axis. In the second column, all the 𝑥-axis values are
represented by Sepal.Width on the 𝑥-axis.

ۧۧ It continues in this fashion for Petal.Length and Petal.Width. Bring your
attention to the bottom-left corner, the last plot in column 1. This plot
has Sepal.Length on the 𝑥-axis and Petal.Width on the 𝑦-axis.
ۧۧ Because the upper and lower graphs are duplicates of each other, let’s
change our code to show the correlation between our variables in the
upper level.
ۧۧ The correlation measures the strength of the relationship between 2

random variables. Correlations range from −1 to 1, where values near 1
indicate a strong positive relationship, values near −1 indicate a strong
negative relationship, and values near 0 indicate no relationship.

ۧۧ Petal.Length and Petal.Width have the highest correlation of 0.96,
followed by Petal.Length and Sepal.Length with 0.87. The lowest
correlation is between Sepal.Length and Sepal.Width.
ۧۧ R has a built-in package called ggplot2 the allows you to produce figures
with visuals plots. It is used for making quick, professional-looking
plots with minimal code.
ۧۧ In the following scatterplot of sepal length versus width, we’ve changed

both the color and character used to plot species. We can better
distinguish between versicolor and virginica.

ۧۧ Below is a plot of sepal length versus petal length. We’ve used the same
plotting character with different colors to separate species. But notice
that the size of the dot varies according to a scale, which corresponds to
petal width. So, each point on the plot gives us 4 pieces of information:
sepal length, petal length, petal width, and species.

HISTOGRAMS
ۧۧ A histogram is a plot that breaks the data into bins ( or breaks ) and
shows the frequency distribution of those bins. We can change the
breaks to see the effect it has on data visualization.
ۧۧ Let’s create some histograms of our iris data. The number of bins in the
histogram is variable.
hist(iris$Petal.Width, breaks=13)
hist(iris$Petal.Width, breaks=25)

ۧۧ We can create custom break points by defining a sequence vector, “b,”
that ranges from min( iris$Petal.Width ) to the max( iris$Petal.Width )
with a specified number of breaks.
b <- seq(min(iris$Petal.Width), max(iris$Petal.Width),

length=11)
hist(iris$Petal.Width, breaks=b, xlab="Petal Width",
main="Histogram of Petal Width")

DENSITY PLOTS
ۧۧ Density plots can be viewed as smoothed versions of a histogram. We
can estimate the density using R’s density function.
dens.pw = density(iris$Petal.Width)
plot(dens.pw, ylab = "Frequency", xlab = "Width", main=
"Petal Width Density")

ۧۧ Let’s also look at the density function of Petal.Length for each of the 3
classes of irises.
CONTOUR PLOTS
ۧۧ Density estimation is available for higher-dimensional data using
contour plots. A contour plot is a graph that explores the potential
relationship among 3 variables.
ۧۧ Contour plots display the 3-dimensional relationship in 2 dimensions,

with 𝑥 and 𝑦 variables plotted on the 𝑥 and 𝑦 scales and the z variable
represented by contours.

ۧۧ A contour plot is like a topographical map in which 𝑥-, 𝑦-, and z-values
are plotted instead of longitude, latitude, and elevation. Elevated
contour levels indicate data clusters.
ۧۧ The plot may also be viewed as a heat map, with brighter colors denoting
more values in those regions.

QUANTILE PLOTS
ۧۧ We can calculate the quantiles of the iris dataset to compare them to those
of a normal distribution. The test we use to check for normality is called
the Shapiro-Wilk test, after Samuel Sanford Shapiro and Martin Wilk.
qqnorm(quantile.virginica, main="Virginica")
qqline(quantile.virginica)
qqnorm(quantile.versicolor, main="Versicolor")
qqline(quantile.versicolor)

qqnorm(quantile.setosa, main="Setosa")
qqline(quantile.setosa)
ۧۧ The Shapiro-Wilk test can be used to determine whether data are

normally distributed. Typically, if the p-value is greater than 0.05, then
the data are normally distributed.
shapiro.test(quantile.setosa)
data: quantile.setosa
W = 0.96247, p-value = 0.4658
shapiro.test(quantile.versicolor)
data: quantile.versicolor
W = 0.96319, p-value = 0.4815
shapiro.test(quantile.virginica)
data: quantile.virginica
W = 0.97161, p-value = 0.6861
PITFALLS
ۧۧ With histograms, when you fluctuate bin width, it can be good, but it
can also be problematic.
ۧۧ Let’s do a histogram for Petal.Width, and let’s only give it 3 bins, or set
the break sequence to 3. Here’s what that histogram looks like.

hist(iris$Petal.Width, prob=TRUE, breaks=3)
ۧۧ This histogram doesn’t really give us information about the shape of the
data. It has poor bin width. One solution is to overlay the density plot.
The density is sort of like a smooth version of the histogram. This is one
way that we can tell if our histogram is accurately picking up the shape
of the spread of our underlying data.
hist(iris$Petal.Width, prob=TRUE, breaks=3)

lines(density(iris$Petal.Width))

ۧۧ Let’s change our breaks to 5. Notice how the histogram is trying to pull
up some of that shape.
hist(iris$Petal.Length, breaks=5, prob=TRUE)
ۧۧ But what happens when we overlay the histogram with the density
function? The density function in the second bin is showing us that
maybe there’s some stuff that’s not quite coming up in our graph.
hist(iris$Petal.Length, breaks=5, prob=TRUE)

lines(density(iris$Petal.Length))

ۧۧ What if we just let R tell us what it wants to do? We want a histogram of
our data. Here’s the image we get.
hist(iris$Petal.Length, prob=TRUE)
ۧۧ What happens when we overlay the histogram with the density

function? R’s built-in bin sorting does a pretty decent job of giving you
the appropriate number of histograms to accurately capture your data.
hist(iris$Petal.Length, prob=TRUE)
lines(density(iris$Petal.Length))

ۧۧ In this example, we created a custom bin size, because we wanted to see
a pitfall that we’d run into if we were trying to graph a histogram. We
can do this when we do other types of graphical analysis, just to check
and make sure that what we’re getting is accurate.
ۧۧ Overall, graphical data analysis has become a major way to avoid many
pitfalls in statistics. Once upon a time, graphical data analysis was
rather challenging ( or at least time consuming ) to do, but that’s all
changed. You should always take advantage of how easy it has become
to display your data and begin your analysis in a very visual way.
SUGGESTED READING
Diez, Barr, and Cetinkaya-Rundel, OpenIntro Statistics, “Evaluating the Normal
Approximation,” section 3.2.
PROBLEMS
stop in the 1920s.
a) Use the following code to load the data and graph the quantile-quantile
(Q-Q) plot for the variables “distance” and “speed.” Comment on whether
the true underlying distribution appears to be normally distributed.
qqnorm(cars$dist, main="Q-Q Plot distance", pch=20)

qqline(cars$dist)
qqnorm(cars$speed, main="Q-Q Plot speed", pch=20)

qqline(cars$speed)
b) Use the Shapiro-Wilk test to determine whether the data are normally
distributed. (Recall that, typically, if the p-value is greater than 0.05, then
the data are normally distributed.)
shapiro.test(cars$dist)
shapiro.test(cars$speed)

c) Plot histograms of “distance” and “speed” with their respective density
plots overlaid. Does the visual shape of the data support the normality
conclusions from the Shapiro-Wilk test?
hist(cars$dist, main="Distance histogram and density",

prob=TRUE)
lines(density(cars$dist), lwd=2)
hist(cars$speed, main="Speed histogram and density",

prob=TRUE)
lines(density(cars$speed),lwd=2)

LECTURE 08
SAMPLE SIZE AND

SAMPLING DISTRIBUTIONS
S
tatisticians are often called on to do consulting,
whether for individuals, companies, or nonprofits.
This lecture focuses on a consulting example
involving a large metropolitan city that is considering
building a new hospital to meet the needs of residents.
They want to survey the population to better understand
the typical emergency room ( ER ) demand so that
they can plan an appropriate number of beds.
SAMPLE MEANS
ۧۧ One thousand residents are sampled and their number of visits to the
ER are recorded. The number of times a person visited the ER ranges
from 0 to 59 times in a year. The counts for each visit are given in
the vector “counts.” The vector “visits” repeats each “time” for each
corresponding “count.”
times <- (0: 59)

counts <- c(49, 51, 50, 85, 47, 61, 29, 29, 32, 21, 36,
38, 30, 27, 24, 34, 38, 37, 24, 32, 26, 23, 27, 22, 19,
10, 10, 12, 13, 8, 12, 5, 6, 6, 1, 11, 2, 4, 2, 1, 3, 0,
0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1)
visits <- rep.int(times, counts)

ۧۧ The following histogram displays information about the distribution
of visits for our population of 1000 residents. Notice the shape of the
distribution.
ۧۧ We want to use our sample of 1000 residents to determine a precise

estimate of the mean number of visits per resident. We can compute the
sample mean of our 1000 residents, but that’s just one estimate. How close
is the sample mean to the true mean number of emergency room visits?
ۧۧ If we look at different samples of residents and their mean number of

visits, how would the estimate change? What if our sample size was
100? 10,000? Let’s investigate how the distribution of a collection of
sample means changes as the sample size increases.
mean.age<-function(n) {
trials = 1000
my.samples <- matrix(sample(visits, size=n*trials,
replace=TRUE), trials)
means <- apply(my.samples, 1, mean)
means
}
ۧۧ The function mean.age( 𝑛 ) performs 1000 times the experiment of

drawing a sample of visits of size 𝑛 at random with replacement from
our dataset. This is called bootstrapping a sample from data.
ۧۧ Let X 𝑛 = ( X1 + ⋯ + X𝑛 )/𝑛 be the sample mean of the visits of 𝑛 people drawn
at random with replacement from our dataset. For each of the 1000 trials,
X 𝑛 is computed. The 1000 sample means are returned in a vector, “means.”
Lecture 08 — Sample Size and Sampling Distributions 119

ۧۧ The function mean.age( 1 ) draws a single person ( i.e., a sample of
size 1 ). It tells us the sample mean of the visits of 𝑛 people ( 𝑛 = 1 in this
case ) sampled 1000 times.
par(mfrow=c(1,2))
hist(mean.age(1), main="Mean of 1 Visit", xlab="Number of Visits")
hist(visits, main="ER Visits Data", xlab="Number of Visits")
ۧۧ Notice that mean.age( 1 ) has approximately the same distribution as

the original data.

DISTRIBUTION OF THE SAMPLE MEAN X N
par(mfrow=c(2,3))
MA1<-mean.age(1)
MA2<-mean.age(2)
MA10<-mean.age(10)
MA20<-mean.age(20)
MA100<-mean.age(100)
MA200<-mean.age(200)
hist(MA1, xlim=c(0,60))

VARIANCE OF X N
ۧۧ We see from the histograms that the distribution of the sample mean
X 𝑛 becomes narrower as 𝑛 increases. In other words, the variance
decreases as the sample size gets larger.
ۧۧ Let’s see if there’s a functional relationship between the variance

distribution of the sample mean X 𝑛 and the sample size 𝑛. We can
calculate the variances of mean.age( 1 ), mean.age( 2 ), mean.age( 10 ),
mean.age( 20 ), mean.age( 100 ), and mean.age( 200 ) with a function
called “vars.”
vars<- data.frame(n=c(1,2,10,20,100,200), variance =

c(var(MA1), var(MA2), var(MA10), var(MA20), var(MA100),
var(MA200)))
ۧۧ Let’s look at the relationship of variance with sample size 𝑛.
vars
n variance
1 1 95.5395355
2 2 46.8052412
3 10 8.9583081
4 20 4.4299850
5 100 0.9604743
6 200 0.4707666
ۧۧ Plot the variance of each sample mean versus the sample size.
plot(vars$n, vars$variance)

ۧۧ Variance appears to decrease as the sample size increases. The decrease
is proportional to 1/𝑛.
ۧۧ In addition to becoming narrower, the shape of X 𝑛’s distribution

changes from being skewed to becoming symmetric and bell-shaped,
like the normal distribution, as 𝑛 becomes larger.
QUANTILE-QUANTILE PLOTS
ۧۧ We can test how close the distribution of X 𝑛 is to the normal distribution
by examining quantile-quantile ( Q-Q ) plots. In a Q-Q plot, quantiles of
the sample are plotted against quantiles of a proposed distribution,
also known as theoretical quantiles.

ۧۧ If the points of a Q-Q plot fall on a straight line, this indicates that the
sample data are consistent with the proposed distribution ( here, the
normal distribution ). As 𝑛 increases, what changes in the Q-Q plots?
par(mfrow=c(2,3))
qqnorm(MA1)
qqnorm(MA2)
qqnorm(MA10)
qqnorm(MA20)
qqnorm(MA100)
qqnorm(MA200)

SHAPIRO-WILK TEST
par(mfrow=c(2,3))
shapiro.test(MA1)
shapiro.test(MA2)
shapiro.test(MA10)
shapiro.test(MA20)
shapiro.test(MA100)
shapiro.test(MA200)
data: MA1
W = 0.92764, p-value < 2.2e-16
data: MA2
W = 0.96688, p-value = 2.603e-14
data: MA10
W = 0.9976, p-value = 0.153
data: MA20
W = 0.99629, p-value = 0.01751
data: MA100
W = 0.99821, p-value = 0.3811
data: MA200
W = 0.99833, p-value = 0.4477
SAMPLING DISTRIBUTIONS
ۧۧ A statistic is a value computed from data. In our example, our statistic is
the average number of ER visits.
ۧۧ Is a statistic a random variable? Recall that a random variable is a

numeric value associated with the outcome of an experiment. It has a
distribution, a mean ( expected value ), and a variance.

ۧۧ Because data constitute a random realization of an experiment, each
time we run the experiment, we will get a different value for our statistic.
So, in fact, it has a distribution, a mean ( expected value ), and a variance.
ۧۧ We’ve seen empirically ( i.e., from data ) that the sampling distribution
of the mean approaches a normal distribution for the ER visits data.
Does this happen in general for distribution of data or just data similar
to the ER data?
THE CENTRAL LIMIT THEOREM

ۧۧ In the field of statistics, it’s rarely possible to collect all of the data from
an entire population. Instead, we can gather a subset of data from a
population and then use statistics for that sample to draw conclusions
about the population.
ۧۧ The 2 most common characteristics that we use to define a population

are its mean and standard deviation. And when data comes from a
normal distribution, we immediately have an idea of the data’s shape.
ۧۧ The mean tells us the center of that distribution. The standard deviation
tells us the spread. The central limit theorem tells us that, no matter
what the population distribution looks like, the distribution of the
sample means will approach a normal distribution.
ۧۧ The definition of the central limit theorem is as follows: Let X1 … X𝑛 be

a random sample from any distribution with mean μ and variance σ2.
Then, as the sample size 𝑛 increases,
ۧۧ In other words, as our sample size increases, any distribution—normal

or not—will have a mean that tends to behave normally.

ۧۧ Because the normal distribution occurs so frequently in real life and
behaves in a very predictable way ( 68% within 1 standard deviation
and 95% within 2 standard deviations ), this distribution forms the
foundation of statistical inference.
ۧۧ Suppose that each person in a 3000-person stadium has a 60% chance

to be rooting for the home team. What’s the probability that more than
1850 people are rooting for the home team?
ۧۧ Let X equal the number of home team fans in attendance. This takes us
back to the binomial distribution, because X ~ Bin( 𝑛 = 3000, 𝑝 = 0.60 ).
ۧۧ Recall: E( X ) = 𝑛𝑝 = 3000 × 0.60 = 1800 − Var( X ) = 𝑛𝑝( 1 − 𝑝 ) = 3000 ×

0.60 × 0.40 = 720.
ۧۧ So, all we need to calculate is the probability that X > 1850.
ۧۧ We wouldn’t want to solve this using the binomial distribution. It’s too
tedious of a calculation. But what we can do is approximate the binomial
with the normal distribution.
P( X = k ) ≈ P( k − 1/2 ≤ X ≤ k + 1/2 )
ۧۧ This is called the continuity correction. Remember that discrete random

variables only take on integer values, whereas continuous distributions
have 0 probability at exact integer values.

ۧۧ We need to create an interval around k so that we can approximate the
probability using a normal distribution.
ۧۧ So, wherein X ~ Bin( 𝑛 = 3000, 𝑝 = 0.60 )

and Z ~ N( 0, 1 ).
pnorm(1.8820239, lower.tail=FALSE)
= 0.0299164
ۧۧ So, there’s a 2.99% chance that we have more than 1850 fans in
attendance.
ۧۧ How far off are we from the binomial calculation? We can use R to do
the exact calculation for the binomial, and it’s called the pbinom.
1 - pbinom(1850,3000,.60)
= 0.029692
ۧۧ The value that we get is really close to what we got in our simpler
approximation with the normal distribution.

ۧۧ What if we didn’t do this continuity correction? What if we just wanted
to standardize and plug it straight into our normal distribution? If we
do that, we get a value of approximately 0.031.
pnorm(1.86339, lower.tail=FALSE)
= 0.0312037
ۧۧ Notice that we’re slightly off. The actual binomial value was 0.029692.
ۧۧ Our normal approximation with the correction was the closer value
of 0.0299164, and if we did that normal approximation without the
correction, we would have been even further away from our binomial,
at 0.0312037.
PITFALLS
ۧۧ Often when statisticians consult with clients designing an experiment,
one of the clients’ top priorities is keeping costs down, which translates
to them wanting to take less samples and magically invoke the power of
the central limit theorem.
ۧۧ Unfortunately, it doesn’t work that way. Unless you know that your
true population is normally distributed, then you need at least 30 or 40
samples before the central limit theorem kicks in.
ۧۧ The central limit theorem is an important concept that lets us

approximate the sampling distribution of X with a normal distribution.
But if the data has extreme variation or several outliers, we might need
an even larger sample size.

SUGGESTED READING
Diez, Barr, and Cetinkaya-Rundel, OpenIntro Statistics, “Foundations for
Inference,” section 2.4.
PROBLEMS
1 If the central limit theorem is applicable, this means that the sampling
distribution of a population can be treated as normal because
the is .
a) non-normal; sample size; small

b) binomial; variance; large
c) non-normal; sample size; large
d) Poisson; standard error; large
2 Suppose that X equals the birth weight in grams of babies born in Yugoslavia. Let
E( X ) = 3325 and var( X ) = 6802. Let X = the sample mean of a random sample of
size 𝑛 = 360 babies.
a) Find E( X ).

b) Find var( X ).
c) Find P( 2980 < X < 3623 ).

LECTURE 09
POINT ESTIMATES AND

STANDARD ERROR
D
escriptive statistics focus on summarizing
characteristics about our data, such as
calculating the sample mean or plotting
histograms, and describe the dataset that’s being
analyzed but don’t let us to draw any conclusions or
make any interferences about the data. On the other
hand, statistical inference uses our dataset to extract
information about populations or answer real-world
questions with certainty. It builds on the methods of
descriptive statistics, but we can draw conclusions
about the population based on data from a sample.
POINT ESTIMATES
ۧۧ Any time we have a random sample, we can calculate the sample mean,
variance, and standard deviation. These values are called point estimates.
library(datasets)
data("Orange")
ۧۧ The dataset “Orange” has 35 rows and 3 columns of records of the

growth of orange trees. Find a point estimate for the average orange
tree growth and average tree circumference using the orange tree data.
Lecture 09 — Point Estimates and Standard Error 131

summary(Orange)
Tree age circumference
3: 7 Min.: 118.0 Min.: 30.0
1: 7 1st Qu.: 484.0 1st Qu.: 65.5
5: 7 Median: 1004.0 Median: 115.0
2: 7 Mean: 922.1 Mean: 115.9
4: 7 3rd Qu.: 1372.0 3rd Qu.: 161.5
Max.: 1582.0 Max.: 214.0
ۧۧ We see the means from our box plot, but we can pull them out directly
using the mean function.
mean(Orange$age)
[1] 922.1429
mean(Orange$circumference)
[1] 115.8571
ۧۧ A point estimate for the average orange tree age would be 922.14
days, and a point estimate for orange circumference would be 115.85
millimeters.
ۧۧ The median is also a perfectly valid point estimate.
median(Orange$age)
[1] 1004
median(Orange$circumference)
[1] 115
ۧۧ Another point estimate for the average orange tree age would be 1004,
and another point estimate for orange circumference would be 115.
ۧۧ Notice how close the sample mean circumference is to the sample

median of circumference: 115.85 compared to 115. And look how far off
tree age estimates are from each other: 922.1 compared to 1004.
ۧۧ Which estimate is better? How do we decide between the 2?

STATISTICAL INFERENCE
ۧۧ Statistical inference is the process of making an estimate, prediction, or
decision about a population based on a sample.
ۧۧ What can we infer about a population’s parameters based on a sample’s

statistics? What do we gain by sampling and drawing inference?
ۧۧ The rationale is that large populations make investigating every

member impractical and expensive. It’s easier and cheaper to take a
sample and make estimates about the population from the sample. But
those estimates aren’t always going to be correct.
ۧۧ We have to build into the statistical inference measures of reliability—

namely, confidence level and significance level.
ۧۧ Statistical inference helps us do 2 main things: Draw conclusions about

a population or process from sample data and provide a statement of
how much confidence we can place in our conclusions.
ESTIMATION
ۧۧ The objective of estimation is to approximate the value of a population
parameter on the basis of a sample statistic. We can estimate both
points and intervals.
ۧۧ The most common point estimate is the sample mean X , which is used to
estimate the population mean μ. A point estimator gives us a particular
value to estimate our parameter with, whereas a confidence interval
gives us a range of possible estimator values.
ۧۧ A point estimate of a parameter θ is a single number that can be

regarded as a sensible value for θ. We find a point estimate by selecting
a suitable statistic and computing its value from given sample data. The
selected statistic is called the point estimator.

ۧۧ Point estimate is a value; point estimator is a statistic. Usually, we use
to denote the point estimator of a parameter θ.
ۧۧ Note that in inferential statistics, you’ll often see Greek letters for
population and for the sample. But the sample statistic will have a
hat on top. So, if you see or , this refers to the point estimate for
the sample.
ۧۧ Different statistics can be used to estimate the same parameter—i.e., a

parameter may have multiple point estimators.
ۧۧ The “women” dataset in R contains the average heights and weights

for American women aged 30 to 39. This is a sample of 15 women with
those 2 variables: height in inches and weight in pounds.
data("women")
attach(women)
summary(women)
ۧۧ This is the height and weight of the first 6 points.
data("women")
attach(women)
head(women)
height weight
1 58 115
2 59 117
3 60 120
4 61 123
5 62 126
6 63 129
plot(height,weight)

mean(weight)
[1] 136.7333
median(weight)
[1] 135
ۧۧ We can generate point estimates for the population parameters based

on our sample statistics. What other estimators could we use to
estimate the population mean ( μ )?
ൖൖ Average of the extremes: . In fact, we’re taking the

range and halving it, so this gives us a measure of the center of the
dataset.
ൖൖ Trimmed mean: = X tr( 10 ) . In this example, a 10% trimmed mean

means that we’re going to discard the smallest and largest 10% of
our sample data and then take an average .

ۧۧ The following graphic is a plot of the population distribution with an
unknown parameter of interest—in this case, let’s say the mean, μ. We
collect a random sample from our population in an attempt to get a
good estimate of μ, which we’ll call .
ۧۧ Because we don’t know how close we are to the true population

parameter, we have to be strategic in selecting our point estimator. In
other words, we want our point estimate to follow certain rules so that
we know our sample estimate will be close to the population estimate.
CHOOSING AN OPTIMAL ESTIMATOR

ۧۧ How do we choose an optimal estimator?
ۧۧ A point estimator is a random variable, and its value varies from

sample to sample. That variability is called the error of estimation.
error of estimation = − θ
Solving for : = θ + error of estimation

ۧۧ What this says is that we want our optimal estimator to be really close
to θ but plus maybe a little bit of error because we’re probably going to
be off a bit.
ۧۧ Which is the best one? We want to choose an estimator that is unbiased,

one that is closest to the true value of the parameter θ on average, and
that has minimum variance, or the one with the smallest estimation of
error—the smallest variability, the least spread.
ۧۧ Ideally, an estimator should have low variability ( to be precise ) and low
bias ( to be accurate ).
ۧۧ A point estimator is an unbiased estimator of the parameter θ if E( ) = θ.

If is not unbiased, the difference E( ) − θ is called the bias of .
ۧۧ Other estimates exist, such as the range, or the average of the largest
and smallest values. But these estimates are not unbiased and should
not be used to estimate the mean.
ۧۧ Sample mean, , is an unbiased estimator of the 𝑛 population

mean μ. For continuous, symmetric distributions, sample median X and
trimmed mean are also unbiased estimators of the population mean μ.
ۧۧ Sample variance, , is the unbiased estimator of the 𝑛 − 1

population variance σ2.
ۧۧ Suppose that X is a binomial with parameters 𝑛 and 𝑝, with 𝑝 unknown.

We can show that the estimator = 𝑥 / 𝑛 is an unbiased estimator.

ۧۧ For a normal distribution with mean μ and variance σ2, given a random
sample of size 𝑛, X1, X2 , … X𝑛 , we can show that the sample mean
is an unbiased estimator of μ.
CONSISTENCY
ۧۧ An unbiased estimator is consistent if the variance of the estimator
approaches 0 as our sample size ( 𝑛 ) approaches infinity. In other words,
the difference between the estimator and the population parameter
becomes smaller as the sample size increases.
ۧۧ For example, the sample mean has a variance of s2/𝑛. This variance goes
to 0 as the sample size increases.
ۧۧ As another example, suppose instead that we estimate the mean by

averaging the first and last values, X1 and X𝑛 . This estimate is unbiased
because
ۧۧ But the variance of ( X1 + X𝑛 )/2 does not go to 0 as 𝑛 increases. In fact, it

stays fixed.

EFFICIENCY
ۧۧ If we have the option of choosing between 2 unbiased estimators, which
one should we choose? The one with the smallest variance is more efficient.
ۧۧ For example, for a normal population, both the sample mean and median
are unbiased estimators, but the sample median has more variability
than the sample mean for a fixed sample size. Let’s look at this in R.
set.seed(1234)
x = cbind(rnorm(100,0,1),rnorm(100,0,1),rnorm(100,0,1),
rnorm(100,0,1),rnorm(100,0,1),rnorm(100,0,1), rnorm(100,
0,1),rnorm(100,0,1),rnorm(100,0,1), rnorm(100,0,1))
apply(x,2,'mean')
apply(x,2,'median')
[1] -0.157 0.041 0.155 -0.008 -0.022 -0.137 -0.088 -0.001
0.018 -0.068
[1] -0.385 0.033 0.278 -0.043 -0.009 -0.067 -0.050 -0.104
-0.052 -0.035
round(mean(apply(x,2,'mean')),3)
[1] -0.027
round(mean(apply(x,2,'median')),3)
[1] -0.043
round(var(apply(x,2,'mean')),3)
[1] 0.008
round(var(apply(x,2,'median')),3)
[1] 0.026
STANDARD ERROR OF AN ESTIMATOR

ۧۧ To denote the precision of an estimator, we can use its standard deviation
as a measure. But in inferential statistics, we call it something slightly
different. The standard error of an estimator is its standard deviation.

ۧۧ The estimated standard error can be denoted either by σ or by s .
You may also come across SEE, which stands for standard error of the
estimator.
PROPERTIES OF POINT ESTIMATORS

ۧۧ Point estimators are fundamental to drawing inferences.
ൖൖ Point estimates from a sample can be used to estimate our population

parameters.
ൖൖ There’s variability in the point estimates. They change slightly with

each sample we take.
ൖൖ We can quantify the uncertainty of an estimator by looking at its

standard error.
PITFALLS
ۧۧ A point estimate only gives a single number for a population
parameter. Several point estimates on the same dataset would give
you too many estimates to logically choose from. And each estimate
has its own associated error. Remember that a sample statistic is
always a random variable.
ۧۧ The problem becomes how to pick the best estimator. If our estimate
is unbiased, efficient, and precise, that makes a great point estimate.
But is there a better solution? Perhaps we could give a range of
values that an estimator could take on. That kind of range is called a
confidence interval.

SUGGESTED READING
Yau, R Tutorial, “Point Estimate of Population Mean,” http://www.r-tutor.com/
elementary-statistics/interval-estimation/point-estimate-population-
mean.
PROBLEMS
1 If the mean of a sample statistic is not equal to the population parameter, then
the sample statistic is called
a) an unbiased estimator.
b) a biased estimator.
c) an interval estimator.
d) a point estimator.
2 The “cars” dataset in R contains the speed of cars and the distances taken
to stop in the 1920s. Find a point estimate for the population mean for both
“distance” and “speed.”

LECTURE 10
INTERVAL ESTIMATES AND

CONFIDENCE INTERVALS
W
hen we calculate a point estimate, our chances
of hitting the target population parameter are
not very likely. We’ll often get close but will
seldom hit the mark. In this lecture, you will learn about
confidence intervals, with which we can increase our
chances of capturing the true population parameter.
ۧۧ A confidence interval draws inferences about a population by estimating
the value of an unknown parameter using an interval. We’re looking
for an interval that covers the true population parameter with some
amount of certainty or confidence.
ۧۧ Suppose that we want to estimate the parameter μ of a normal

distribution. Given a random sample, we might naturally choose to use
the sample mean X 𝑛 = ( ∑ Xi )/𝑛 as our point estimate.
ۧۧ But the problem is that a different random sample ( w ith different
observed values ) would yield different estimates for μ. Which of those
estimates ( t he sample mean from our first sample or the sample mean
from our second sample ) would be closest to the true value? We’d really
have no way of knowing.

ۧۧ What if, instead, we could find an interval of possible values of μ—one
that contains the true value of μ with a certain level of confidence?
ۧۧ To create a confidence interval, we start with the point estimate ±

estimation error. If our interval is narrow, that means that our estimate
is precise.
ۧۧ Let X1, X2 , … X𝑛 be a random sample from a normal distribution with

μ and σ. We know that regardless of sample size 𝑛, X is normally
distributed with expected value μ and standard deviation . We get
this from the central limit theorem.
ۧۧ We’ve just found a random interval.
CONFIDENCE LEVELS
ۧۧ Confidence intervals allow us to estimate population parameters using
a range of values—a range of values that are more likely to capture the
true population parameter.
ۧۧ To do that, we need to set the confidence level, which tells us how likely it
is that the population parameter is actually contained in the confidence
interval. It’s a measure of the degree of the reliability of the interval.
Lecture 10 — Interval Estimates and Confidence Intervals 143

ۧۧ In statistics, we typically use a confidence level of 95%, but we could
specify any confidence level as ( 1 − α )100%. Note that when α = 0.05,
we get the 95% confidence level. This means that 95 out of 100 samples
( 95% ) from the same population will result in a confidence interval
that contains the true population parameter.
95% CONFIDENCE INTERVALS

ۧۧ After observing X1 = 𝑥1, X2 = 𝑥2 , …, X𝑛 = 𝑥𝑛 from a normal distribution
with known σ, we compute the sample mean X . The 95% confidence
interval for the population mean μ is
ۧۧ We can also say with 95% confidence,
ۧۧ Suppose that a normal population has unknown μ and σ = 2.0. If 𝑛 = 31

and X = 80.0, what is the 95% confidence interval for μ?
ۧۧ Given a 95% confidence interval, is it correct to say that μ falls in the

confidence interval with probability 0.95?
ۧۧ Look at the probability.

ۧۧ When we substitute X with the observed value, x , there’s no randomness
left. For a particular sample, either μ is in the interval or it isn’t.
ۧۧ From the previous example, is this statement accurate: P( 79.3 ≤ μ ≤ 80.7 )

= 0.95? No, because if we had a slightly different sample, we would have a
slightly different confidence interval. Instead, we refer to being confident
in the method.
ۧۧ A precise way to interpret confidence interval is that with 95%

confidence, the interval calculated includes μ. It’s the confidence
interval itself that is random ( i.e., it changes for each sample ); μ is just
a fixed number ( t he population mean ). It is nonrandom, but we usually
don’t know what it is. That’s why we’re taking a sample.
ۧۧ The 95% means that we used a method that captures the true mean
95% of the time.
ۧۧ Let’s define zα/2 as our critical value. This is the value such that
P( Z > zα/2 ) = P( Z < −zα/2 ) = α/2, where Z ~ N( 0, 1 ).
ۧۧ The area between −zα/2 and zα/2 under the standard normal curve is
1 − α. In other words, P( −zα/2 < Z < zα/2 ) = 1 − α.
ۧۧ Let’s look at this graphically.

ۧۧ In this plot, you see our critical values, zα/2 and −zα/2 . We have to be
strategic in choosing zα/2 and −zα/2 .
ۧۧ Again, P( −zα/2 < Z < zα/2 ) = 1 − α, so ( 1 − α ) is the area in the center of our
normal curve. That means that each of our tail ends has an area equal to
α/2 so that the total area adds up to 1.
ۧۧ Suppose that α = 0.05. P( −zα/2 < Z < zα/2 ) = 1 − α = 95%, and there would
be 5% left over in the tails ( t he shaded area ): 2.5% in the left shaded
area and 2.5% in the right shaded area.
OTHER CONFIDENCE LEVELS

ۧۧ We’re not limited to a 95% confidence interval. We can also specify
other alpha ( α ) levels.
ൖൖ When α equals 0.1, that gives us 90% confidence. We have to divide

0.1 by 2 to get the area that’s left in the tails.
ൖൖ Likewise, when α equals 0.05, we get 95% confidence and 2.5 % left
in each of the tails.
ൖൖ When α equals 0.01, we get 99% confidence and 0.5% area in each of
the tails.
qnorm(0.05)
[1] -1.644854
qnorm(0.025)
[1] -1.959964
qnorm(0.005)
[1] -2.575829

ۧۧ Several of the most commonly used zα/2’s are as follows.
Confidence Level 90% 95% 99%

α/2 0.05 0.025 0.005
zα/2 1.645 1.960 2.576
POINT ESTIMATES AND CONFIDENCE

ۧۧ The Organic Milk Association produces organic gallon milk for more
than half of the United States. The company claims that, due to
manufacturing upgrades, they are confident that the average amount
of milk in each gallon container they ship has a volume of 128 ounces.
ۧۧ Let’s look more closely at confidence intervals using simulated

population data. Suppose that we had all the data on the California
Organic Milk Association’s gallon jugs. This is the same as assuming
that we know everything about the population.
ۧۧ We can use R to simulate a population of amounts of milk in 10,000 milk

jugs. We assume some distribution function and randomly create
10,000 data points in a variable called “milk.”
set.seed(343)
milk = 129-rexp(100000,0.95)
hist(milk, main="Histogram
of Milk Population",
col="red")
ۧۧ The histogram sample is skewed

on the left ( meaning that the
data is piled up on the right ).

ۧۧ Here’s our true population parameters.
true_mean = mean(milk)
true_sd = sd(milk)
true_mean
[1] 127.943
true_sd
[1] 1.058831
ۧۧ We can take a sample from our population to see how close we are to
the actual true mean.
set.seed(343)
sample_milk <- sample(milk, size=50, rep=T)
sample_mean <- mean(sample_milk)
sample_mean
[1] 127.9848
sample_mean-true_mean
[1] 0.04179896
ۧۧ Our sample mean is only slightly larger than the population mean, by
0.04 ounces.
ۧۧ We can calculate the 95% confidence interval for the sample mean in R.
n=50
sample_milk <- sample(milk, size=50, rep=T)
sample_mean <- mean(sample_milk)
sample_mean - 1.96 * sd(sample_milk) / sqrt(n)
[1] 127.5935
sample_mean + 1.96 * sd(sample_milk) / sqrt(n)
[1] 128.13

ۧۧ This is a box plot of our data, with population on the left and sample on
the right.
ۧۧ Our point estimate based on a sample of just 50 milk jugs only slightly
overestimates the true population. This illustrates an important point:
We can get a fairly accurate estimate of a large population by sampling a
relatively small subset of individuals.
ۧۧ The histogram of our 50-jug sample is skewed like our population.
hist(sample_milk)

ۧۧ But what might the central limit theorem tell us about this population?
Remember, the central limit theorem says that no matter what the
shape of the population distribution, the shape of the distribution of
sample means is approximately normal.
ۧۧ Suppose that we take 1000 samples of size 𝑛 = 50 and look at the sample
distribution.
milk_mean = numeric(0)
for (i in 1: 1000)
milk_mean[i] = mean(sample(milk, 50, rep=T))
hist(milk_mean)
ۧۧ The distribution of sample means looks more like a normal distribution

than the distribution of the population.
ۧۧ How normal is the distribution of 1000 sample means? One way to

quantify this is to make the normal quantile-quantile ( Q-Q ) plot
comparing the quantiles of our distribution against the quantiles of a
truly normal distribution.

ۧۧ If the plot is a straight line, then the data is distributed just like the
normal data. Let’s do this in R.
qqnorm(milk_mean)
qqline(milk_mean)
ۧۧ We want to compute some confidence intervals from our samples.

In this ( artificial ) case, we know the population parameters: μ ( t he
population mean ) and σ ( t he population standard deviation ). We can
use R to calculate the confidence intervals.
ۧۧ Now let’s return to our samples of size 𝑛 = 50 and take 100 such samples.
samp_mean = numeric(0)
for (i in 1: 100)
samp_mean[i] = mean(sample(milk,50, rep=T))
hist(samp_mean)

ۧۧ We can compute and plot a 90% confidence interval for the population
mean for each one of our samples. Shown here for 20 of the samples:
m = 20; n = 50; mu = mean(milk);

sigma = sd(milk);
SE = sigma/sqrt(n)
alpha = 0.10 ; zcrit = qnorm(1-alpha/2);
matplot(rbind(samp_mean[1: m] - zcrit*SE, samp_mean[1: m]
+ zcrit*SE),
rbind(1: m,1: m),
type="l", lty=1,lwd=2,
col='darkgray',
xlab = "Ounces",
ylab="Confidence Intervals",
main="90% Conf. Int.");
abline(v=mu)

ۧۧ The 80% confidence interval
has shorter lines that the 90%
confidence interval and is less
likely to cover the true
population mean.
ۧۧ Note that we use qnorm( 0.975 )

to get the desired z critical
value instead of qnorm( 0.95 )
because the distribution has
2 tails.
PITFALL
ۧۧ There’s a possible pitfall: 18 out
of the 20 confidence intervals
we calculated ( for the 90%
confidence interval ) captured
the true population mean.
But that means that 2 of them
didn’t. When we only collect
1 sample, we have no way of
knowing if it covers the true
mean or not. So, remember, our
confidence is in the method.

SUGGESTED READING
Yau, R Tutorial, “Interval Estimation,” http://www.r-tutor.com/elementary-
statistics/interval-estimation.
PROBLEMS
1 Of the following, which is not needed to calculate a confidence interval for a
population mean?
a) A confidence level.
b) A point estimate of the population mean.
c) A sample size of at least 10.
d) An estimate of the population variance.
e) All of the above are needed.
stop in the 1920s. Find a 95% confidence interval for the mean “distance” and
mean “speed.” Compare to a 90% confidence interval.

LECTURE 11
HYPOTHESIS TESTING:
1 SAMPLE
T
he past few lectures have been focused on parameter
estimation: How do we use our sample data to
estimate population parameters, such as the mean?
But there’s another way that we can look at our data.
Instead of using our data to estimate the population
parameter, we could guess a value for our population
parameter and then ask ourselves if we think our sample
could have come from that particular population. Instead
of going from data to parameter, we go from parameter
to data. This new approach is called hypothesis testing.
HYPOTHESIS TESTING
ۧۧ In general, a hypothesis is an educated guess about something in the
world around you—something that should be testable, either by
experiment or observation. In hypothesis testing, we want to know
whether the characteristics of our sample match the underlying
characteristics of our assumed population.
ۧۧ We need a quantitative way to evaluate our sample, relative to some

baseline, and that’s what statistical hypothesis testing does for us.
Formally, a statistical hypothesis is a claim about our population
parameters.
Lecture 11 — Hypothesis Testing: 1 Sample 155

ۧۧ When we do hypothesis testing, we start with 2 contradictory hypotheses.
1 Null hypothesis ( H0 ): A claim initially assumed to be true—a prior

belief, an old theory, the status quo, the current standard.
2 Alternative hypothesis ( Ha ): The contradictory claim—the competing

claim, a new theory, an alternative standard.
ۧۧ We need a way to decide between the null hypothesis and the

alternative, and this is what our decision principle is for.
ൖൖ Decision principle: We reject H0 in favor of Ha if sample data show

strong evidence that H0 is false. Otherwise, we do not reject H0 .
( Statisticians also say this: We fail to reject H0 . )
ۧۧ Hypothesis testing uses sample data to make decisions about

populations. We can specify our testing procedure by the following:
ൖൖ A test statistic: A function of sample data on which the decision

( reject H0 or do not reject H0 ) is to be based.
ൖൖ A rejection region: The set of all test statistic values for which H0 will
be rejected.
ൖൖ A decision rule: The null hypothesis will be rejected if our test

statistic falls in the rejection region.
ۧۧ The null hypothesis is assumed true, and we have to provide the statistical
evidence needed to reject it in favor of an alternative hypothesis.
THE NULL HYPOTHESIS

ۧۧ Let’s see what hypothesis testing looks like for a population mean.

ۧۧ H0: the mean of our sample is equal to the mean of the population, μ0 . μ0
is the mean of our population, assuming the null hypothesis is true.
H0 : μ = μ 0
ۧۧ We often test whether our sample mean is different from 0.
H0 : μ = 0
THE ALTERNATIVE HYPOTHESIS
Ha : μ > μ 0
Ha : μ < μ 0
Ha : μ ≠ μ 0
ERRORS IN HYPOTHESIS TESTING

ۧۧ There are 2 types of errors in hypothesis testing.
1 Type I error: Reject H0 when H0 is true ( also known as “false positive,”

because we think we’ve found something new, but that’s false ).
2 Type II error: Fail to reject H0 when H0 is false ( also known as “false

negative,” because we’ve missed a new result ).
ۧۧ We denote: α = P( t ype I error ) and β = P( t ype II error ).
SIGNIFICANCE LEVEL
ۧۧ We want α and β to be both small, but this is a contradiction. We can
decrease the rejection region to get smaller α. However, a small
rejection region results in a larger β.

ۧۧ A type I error is usually more serious than a type II error.
ۧۧ Let’s look at this in terms of criminal trials, where, in the United States,
the initial assumption is that a defendant is innocent until proven guilty.
ۧۧ Translated to a hypothesis test:
ൖൖ H0: Defendant is not guilty.
ൖൖ Ha: Defendant is guilty.
ۧۧ We always assume that the null hypothesis is true unless we have

sufficient evidence, beyond a reasonable doubt, to reject it.
ۧۧ In hypothesis testing, the data are the evidence. If there’s enough

evidence, we reject the null hypothesis. If there’s not enough, we fail to
reject it.
ۧۧ A type I error would be convicting an innocent person. ( T he evidence

led to a wrong conclusion. ) A type II error would be setting a guilty
person free. ( Not enough evidence to prove guilt. )
ۧۧ Because we want a just, fair process, we assume that everyone is

innocent until proven guilty beyond a reasonable doubt. The proof is
the data, which is the large body of evidence presented that suggests
whether the person committed the crime or not.
ۧۧ If we wanted to control for the type II error—meaning never setting a

guilty person free—we would end up convicting many innocent people.
Likewise, if we wanted to control for the type I error and never convict
an innocent person, we would set every guilty person free as well.
ۧۧ So, we typically choose the largest α that can be tolerated—for example,

α = 0.05. This means that we’re willing to mistakenly convict an innocent
person 5% of the time in exchange for making the right decision 95% of
the time. If we can’t tolerate that, then α = 0.01. We refer to α as the
significance level.

TESTS ABOUT POPULATION MEAN:
NORMAL WITH KNOWN Σ
ۧۧ Let’s look at the following types of hypotheses:
ൖൖ Upper-tailed test: H0: μ = μ0 versus Ha: μ > μ0 .

ൖൖ Lower-tailed test: H0: μ = μ0 versus Ha: μ < μ0 .
ൖൖ 2-tailed test: H0: μ = μ0 versus Ha: μ ≠ μ0 .
UPPER-TAILED TEST
ۧۧ Suppose that we have a random sample X1, X2 , …, X𝑛 from a N( μ, σ )
with σ known and we want to do an upper-tailed test. Test H0:
μ = μ0 versus Ha: μ > μ0 . The test procedure is as follows:
1 Test statistic: under the null.
2 Significance level: α. ( Note: At α = 0.05, zα = z0.05 = 1.645. )
3 Rejection region: Z ≥ zα .
4 Decision: If the calculated test statistic ≥ zα , reject the null.

Otherwise, fail to reject the null.
ۧۧ For example, a random sample of size 9 from N( μ, σ = 1 ) yields

( 2, −3, 6, 1, −2, 4, 0, −1, 2 ). Test H0: μ = 0 versus Ha: μ > 0. Determine
the following:
1 Test statistic: .
2 Significance level: α = 0.05, z0.05 = 1.645.
3 Rejection region: Z ≥ 1.645.
4 Decision: Because 3 ≥ 1.645, our test statistic falls in the

rejection region and we reject the null hypothesis.

LOWER-TAILED TEST
with σ known and we want to do a lower-tailed test. Test H0: μ = μ0
versus Ha: μ < μ0 . The test procedure is as follows:
6 Significance level: α. ( Note: At α = 0.05, zα = z0.05 = −1.645. )
7 Rejection region: Z ≤ −zα .
8 Decision: If the calculated test statistic ≤ −zα , reject the null.

Otherwise, fail to reject the null.
ۧۧ For example, a random sample of size 16 from N( μ, σ = 1 ) yields

( −1.2, 2.7, −0.5, −2.1, 0.6, 1.8, −3.0, −0.4, −1.5, 0.2, 1, 0, −0.1, 2.2,
−0.6, 1.1 ). Test H0: μ = 0 versus Ha: μ < 0. Determine the following:
1 Test statistic: .
2 Significance level: α = 0.05, −z0.05 = −1.645.
3 Rejection region: Z ≤ −1.645.
4 Decision: Because 0.05 ≥ −1.645, our test statistic does not fall
in the rejection region and we fail to reject the null hypothesis.

2-TAILED TEST
with σ known and we want to do a 2-tailed test. Test H0: μ = μ0
versus Ha: μ ≠ μ0 . The test procedure is as follows:
6 Significance level: α. ( Note: At α = 0.05, zα/2 = z0.025 = −1.645. )
7 Rejection region: Z ≤ −zα/2 or Z ≥ zα/2 .
8 Decision: If the calculated test statistic Z ≤ −zα/2 or Z ≥ zα/2 , we

reject the null. Otherwise, we fail to reject the null.
ۧۧ For example, total cholesterol levels for obese patients have a

mean of 200 with a standard deviation of 15. A researcher thinks
that drinking 3 glasses of water first thing in the morning will
have an effect on cholesterol levels. A sample of 40 patients who
have tried the morning water routine have a mean cholesterol
level of 182. Test the hypothesis that the water had an effect.
ۧۧ Test H0: μ = 200 versus Ha: ≠ 200. Determine the following:
1 Test statistic: .
2 Significance level: α = 0.05, z0.05/2 = 1.96.
3 Rejection region: Z ≤ −1.96 or Z ≥ 1.96.
4 Decision: Because −7.50 ≤ −1.96, our test statistic falls in the

rejection region and we reject the null hypothesis in favor of the
alternative. Water has an effect on lowering cholesterol levels.

THE T-DISTRIBUTION
ۧۧ The confidence interval and hypothesis test for μ is valid provided that
we have a large enough sample size ( approximately 𝑛 ≥ 30 ). However,
when 𝑛 is small, we can’t invoke the central limit theorem.
ۧۧ When 𝑛 is small:
1 The sample standard deviation, $S = \sqrt \frac{\sum( X _i - \

bar{X} )^2}{( n - 1 )}$, is less likely to be close to σ.
2 We introduce variability in both the numerator and denominator of Z.
The resulting probability distribution of Z will be more spread

out than the standard normal distribution.
ۧۧ In this case, inference is based on a new family of distributions called

t-distributions.
ۧۧ Notice in this graph, where we compare the t-distribution to the

standard normal, the following things:
1 The t-distribution is bell-shaped.
2 The height differs depending on the sample size ( degrees of

freedom ).
3 The distribution becomes wider for smaller sample sizes. This is to

account for estimation error.
4 But as our sample size, 𝑛, approaches infinity, the t-distribution

approaches the standard normal distribution.

ۧۧ When X is the mean of a random sample of size 𝑛 from a normal
distribution with mean μ, the random variable
has a probability distribution called a t-distribution with 𝑛 − 1

degrees of freedom. We write T ~ t𝑛−1.
NORMAL WITH UNKNOWN Σ: UPPER-TAILED TEST
ۧۧ Suppose that we have a random sample X1, X2 , …, X𝑛 from a N( μ, σ ) with
σ unknown. Test H0: μ = μ0 versus Ha: μ > μ0 . The test procedure is as
follows:
2 Significance level: α.
3 Rejection region: T ≥ tα,𝑛−1.
4 Decision: If the calculated test statistic ≥ tα , reject the null. Otherwise,

fail to reject the null.

ALTERNATIVE TEST PROCEDURES
BASED ON P-VALUES
ۧۧ When we perform a hypothesis test, a 𝑝-value can help us determine
the significance of our results. A 𝑝-value evaluates how well the sample
data support the argument that the null hypothesis is true, measures
how compatible our data are with the null hypothesis, and answers the
question of how likely the effect observed in our sample data is if the
null hypothesis is true.
ൖൖ High 𝑝-values: Our data are likely with a true null hypothesis.
ൖൖ Low 𝑝-values: Our data are unlikely with a true null hypothesis.
ۧۧ The 𝑝-value is the probability of getting our particular sample ( or a

sample more extreme than ours ), by chance, when the null hypothesis is
true. For a test with a desired significance level α, we use the following
decision rule:
ൖൖ If 𝑝-value ≤ α, then we reject H0 .
ൖൖ If 𝑝-value > α, then we fail to reject H0 .
ۧۧ A low 𝑝-value suggests that our sample provides enough evidence that
we can reject the null hypothesis.
ۧۧ The following graph shows the distribution under the null hypothesis.
Our observed data point is the test statistic calculated from our sample.
The shaded area is the probability of seeing our observed sample, or a
sample more extreme, by chance, when the null hypothesis is true. The
closer our 𝑝-value gets to the very unlikely zone, the more evidence our
data provides to reject the null hypothesis.

1 For an upper-tailed test: H0: μ = μ0 versus Ha: μ > μ0

𝑝-value = P( Z ≥ zα/2 │ μ = μ0 ), where Z is the test statistic calculated from
our sample.
2 For a lower-tailed test: H0: μ = μ0 versus Ha: μ < μ0

𝑝-value = P( Z ≤ −zα/2 │ μ = μ0 ), where Z is the test statistic calculated from
our sample.
3 For a 2-tailed test: H0: μ = μ0 versus Ha: μ ≠ μ0
𝑝-value = 2 P( |Z| ≥ |zα/2| │ μ = μ0 ). We have to add twice the value at our
tail ends because the test is 2-sided.
ۧۧ To test a hypothesis using 𝑝-values, follow these steps:
1 State the null ( H0 ) and alternative hypothesis ( Ha ).
2 Calculate the test statistic.

3 Choose the significance level ( α ).
4 Compute the 𝑝-value.
5 State the conclusion in terms of the problem:
𝑝-value ≤ α, we reject H0 .
𝑝-value > α, we fail to reject H0 .
PITFALLS
ۧۧ A statistical test is not designed to prove or disprove hypotheses.
It weighs the evidence provided by the data and decides what is
warranted.
ۧۧ We either reject H0 or fail to reject H0 . A significance test only gives

evidence for or against H0, although Ha is what we believe to be true.
ۧۧ Even if H0 is true, remember that there is still probability that we will

mistakenly reject H0 ( i.e., the significance level α ).
SUGGESTED READING
Diez, Barr, and Cetinkaya-Rundel, OpenIntro Statistics, “Inference for
Numerical Data,” section 5.1.
Yau, R Tutorial, “Hypothesis Testing,” http://www.r-tutor.com/elementary-
statistics/hypothesis-testing.

PROBLEMS
1 When conducting a hypothesis test, we say that the data are statistically
significant at level α if
a) α is small.
b) the 𝑝-value is less than α.
c) α = 0.05.
d) α = 0.01.
e) the 𝑝-value is greater than α.
2 Null and alternative hypotheses are statements about
a) population parameters.
b) sample parameters.
c) sample statistics.
d) It depends—sometimes population parameters and sometimes sample
statistics.

LECTURE 12
HYPOTHESIS TESTING:
2 SAMPLES, PAIRED TEST
S
uppose that we want to test whether 2 samples are
from the same distribution. We can look at their
descriptive statistics, such as histograms or box plots,
but that won’t confirm a statistical difference. We need a
more formal method to determine a true difference. The
goal of this lecture is to demonstrate how to determine
if 2 samples are similar, meaning that they come from
the same underlying distribution, or different, meaning
that they come from different underlying distributions.
2-SAMPLE HYPOTHESIS TESTING

ۧۧ The chicken feed data has the weight and 6 different types of feed given
to newly hatched chickens. We might ask the following question: Does
the feed for chickens affect their growth? If so, which feed helps our
chickens grow the largest? Which one is least effective?
ۧۧ If the feed and weight are independent, meaning that neither one affects
the other, then the distributions of the 2 samples should be the same. In
other words, no matter what feed the chickens eat, their weights should
be roughly the same.

ۧۧ On the other hand, if feed and chicken are dependent, then we would
expect the weight of the chickens to be different, depending on which
feed they eat.
ۧۧ This is the underlying premise of the t-test: How do we determine if 2

samples really are different from each other?
ۧۧ Let’s use our chicken weight data, which has newly hatched chicks that
were randomly placed in 6 groups, with each group given a different
feed supplement.
ۧۧ To do this in R, load the datasets library. Bring up the “chickwts” dataset

using the “dataset( chickwts )” command. Let’s look at the summary
statistics, so “summary( chickwts ).”
library("datasets")
data(chickwts)
summary(chickwts)
weight feed
Min.: 108.0 casein: 12
1st Qu.: 204.5 horsebean: 10
Median: 258.0 linseed: 12
Mean: 261.3 meatmeal: 11
3rd Qu.: 323.5 soybean: 14
Max.: 423.0 sunflower: 12
ۧۧ The weights are measured in grams after 6 weeks. The minimum

weight is 108 grams and maximum is 423, with a mean of 261.3.
ۧۧ We also see the 6 different feed types, along with the corresponding
number of chickens in that feed group. The casein feed had 12 chickens,
horsebean feed had 10 chickens, linseed had 12, and so on.
data("chickwts")
attach(chickwts)
meat = chickwts[chickwts$feed=="meatmeal",1]
horse = chickwts[chickwts$feed=="horsebean",1]
Lecture 12 — Hypothesis Testing: 2 Samples, Paired Test 169

ۧۧ Let’s pull out 2 of our feed groups and compare the chickens from those
groups. In R, let’s assign meat to “chickweights,” where chickwts$feed
= = “meatmeal” ( t hat command refers to the columns ), followed by a
comma and the number 1. This tells R to look in row 1 to get the values.
ۧۧ We do a similar type of command for horsebean. Notice that we just

named the new dataset “meat” and “horse.” In fact, if we type those out,
we can look at the data values.
ۧۧ Notice that our dataset is uneven. Meat has 11 entries while horse has 10.
[1] 325 257 303 315 380 153 263 242 206 344 258
[1] 179 160 136 227 217 168 108 124 143 140
ۧۧ Showing box plots for both samples is a great way for us to compare
them. Here’s the command to do that in R. Remember, our goal for this
lecture is to determine if 2 samples are similar ( meaning that they
come from the same underlying distribution ) or different ( meaning
that they come from different underlying distributions ). Your box plot
is helping shine light on that goal. We see that the bulk of the chickens
in meatmeal, from the first quartile onward, weigh more than all the
chickens that had the horsebean feed supplement.
boxplot(meat,horse, main = "Meatmeal Horsebean")

ۧۧ Let’s look at a plot of the weights by ordered pairs. Imagine that we
order meatmeal from smallest weight to greatest weight and likewise
order horsebean from smallest weight to greatest weight.
ۧۧ We take those 2 ordered sets of data and pair them up and plot them. If
feed had no effect on chicken weight, then we would expect to see our
points fall around the line Y = X. But if feed did have an effect, we would
expect our points to be shifted, either above or below the Y = X line.
ۧۧ Notice that all of our points fall below the line, being pulled by the larger
values in meatmeal.
qqplot(meat, horse, xlim=c(100, 420),xlab="Meatmeal", ylab

= "Horsebean", ylim=c(100,420), pch=20, cex=2)
abline(a=0,b=1, lwd=3, col="red")
ۧۧ Another way to visualize any separation in your data is through a

parallel plot. Again, we see that meatmeal has several values that are all
higher than horsebean.

ۧۧ There’s nothing special about plotting the data on lines at Y = 0 and
Y = 1; that was just to separate the 2 feeds.
ۧۧ Our graphical data analysis is leading us to believe that there might be

a difference in chicken weight due to feed. But to draw that conclusion
statistically, we can’t just look at the descriptive statistics. We have to
do a formal statistical test—in this case, a hypothesis test.
ۧۧ H0 is that the 2 samples are from the same distribution ( and have the
same mean ).
H0: μmeat = μhorse

ۧۧ Testing μmeat = μhorse is equivalent to testing H0: μmeat − μhorse = 0.
ۧۧ The test statistic to use is the difference between sample means

X meat − X horse and rescale that by the variance.
ۧۧ A 2-sample t-test lets us determine whether the means of 2 independent

data samples are different from one another. In a 2-sample test, the null
hypothesis is that the means of both groups are the same. Unlike the
1-sample test ( where we test against a known population parameter ),
the 2-sample test only involves sample means.
ۧۧ You might sometimes see this method referred to as Welch’s t-test

because it deals with comparing 2 samples, as opposed to Student’s
t-test, which is for 1 sample.
ۧۧ We can conduct a 2-sample t-test by passing a second data sample into

R’s “t.test( )” function, but first let’s verify our t-statistic in R.
ۧۧ Here we’ve taken the mean and standard deviation of meatmeal and
horsebean and calculated our test statistic, T. The value that we get is
5.059.
mean.meat = mean(meat)
mean.horse = mean(horse)
sd.meat = sd(meat)/sqrt(length(meat))
sd.horse = sd(horse)/sqrt(length(horse))
T.stat = (mean.meat - mean.horse)/sqrt(sd.meat^2+sd.
horse^2)
T.stat
[1] 5.059444
ۧۧ R does the heavy lifting for you with the “t.test” function. The command
is “t.test( meat, horse ),” and it returns the Welch’s 2-sample t-test.
ۧۧ We can confirm our test statistic value of 5.059, and we have a very low
𝑝-value, which would lead us to reject the null hypothesis that the 2
means are equal.

t.test(meat,horse)
Welch Two Sample t-test
data: meat and horse
t = 5.0594, df = 16.524, p-value = 0.0001054
alternative hypothesis: true difference in means is not
equal to 0
95 percent confidence interval: 67.93364 165.48454
sample estimates:
mean of x mean of y
276.9091 160.2000
PAIRED T-TESTS
ۧۧ The 2-sample t-test procedure only applies when the 2 samples are
independent and the underlying distributions are normal, as in our
chicken weight example.
ۧۧ But what if we want to do a hypothesis test on data that’s related—

dependent data? An example would be a person’s cholesterol level
before and after being on a low-fat diet. In this case, we use a paired
t-test, which helps us compare 2 samples where observations from one
sample can be paired with observations from the other sample.
ۧۧ We often do a paired t-test when we have before and after

measurements on the same subjects. We could also do this type of
test when we are trying to compare 2 different treatments where
the treatments are given to the same subjects ( for example, body
temperature measurements of the same person when taken with an
oral thermometer or forehead thermometer ).
ۧۧ The paired-sample t-test works the same as a 1-sample t-test, but each
observation in one sample is correlated to an observation in the second
sample.

ۧۧ For example, the “IceSkating” dataset in the R package called
“PairedData” gives the speed measurement ( in meters per second ) for
7 ice-skating dancers when 1 leg was either extended ( extension ) or
flexed ( f lexion ).
ۧۧ This dataset is located in the “PairedData” package, so we need to use

the install.packages( “PairedData” ) line of code or click on “Install
Packages” in the lower-left panel to install it that way.
ۧۧ Bring the package into our session by typing “library( PairedData ).”

The dataset we want is “IceSkating,” so load it with the “data” command
and attach it with the “attach” command.
install.packages("PairedData")
library(PairedData)
data(IceSkating)
attach(IceSkating)
ۧۧ The first column shows that we have 7 subjects in this dataset, along
with extended speed measurements in column 2 and flexed speed
measurements in column 3.
IceSkating
Subject Extension Flexion
1 S1 2.13 1.90
2 S2 1.77 1.55
3 S3 1.68 1.62
4 S4 2.04 1.89
5 S5 2.12 2.01
6 S6 1.92 1.91
7 S7 2.08 2.10
ۧۧ Our summary statistics give us an idea of the spread of the data.

Extension ranges from 1.68 to 2.13 meters per second with a mean of
1.963, and flexion ranges from 1.55 to 2.1 meters per second with a
mean of 1.854.

summary(IceSkating)
Subject Extension Flexion
Length: 7 Min.: 1.680 Min.: 1.550
Class: character 1st Qu.: 1.845 1st Qu.: 1.755
Mode: character Median : 2.040 Median: 1.900
Mean: 1.963 Mean: 1.854
3rd Qu.: 2.100 3rd Qu.: 1.960
Max.: 2.130 Max.: 2.100
ۧۧ Is there a difference in speed when the leg is extended versus flexed?
ۧۧ The following is a paired plot of our data, with “extension” on the 𝑥-axis
and “flexion” on the 𝑦-axis. If there was no difference, we’d expect the
points to fall along this line, where X = Y. And they’re not too far off.
They fall slightly below, with a preference toward extension.
ۧۧ We can also look at the before and after measurements by subjects. In

the following plot, we’ve sorted the subjects from least to greatest ( in
terms of average speed ) along the 𝑦-axis, and their before and after
measurements are plotted with extension in red and flexion in blue.
In R, we use the following command.
with(IceSkating,plot(paired(Extension,Flexion),
type="McNeil"))

ۧۧ One main assumption for the paired t-test is that the differences of
the pairs should follow a normal distribution, so we should check that
assumption. Actually, our original data no longer has to satisfy the
normality constraint. We’re not directly using those data values in the
t-test—only their differences.
hist(Extension-Flexion)

ۧۧ Because we only have 7 points, this histogram doesn’t really help us see
if the data’s normal.
ۧۧ A quantile-quantile ( Q-Q ) plot of the difference looks much better.

Notice how our points fall nicely along the line. And we can confirm
with the Shapiro-Wilk test, which gives us a 𝑝-value of 0.6137.
with(IceSkating,qqnorm(Extension-Flexion))
with(IceSkating,qqline(Extension-Flexion))
shapiro.test(Extension-Flexion)
data: Extension - Flexion
W = 0.93721, p-value = 0.6137
ۧۧ Let’s walk through some of the details of the paired t-test. For each
matched pair, we create a new variable, d, which represents the
difference between the 2 samples: d = Extension − Flexion. In R, we
assign d to the difference in speed between extension and flexion.

Extension
[1] 2.13 1.77 1.68 2.04 2.12 1.92 2.08
Flexion
[1] 1.90 1.55 1.62 1.89 2.01 1.91 2.10
d = Extension - Flexion
d
[1] 0.23 0.22 0.06 0.15 0.11 0.01 -0.02
ۧۧ If there’s no difference in speed when the leg is extended versus flexed,

then we would expect the means of the 2 groups to be equal: μex = μfl.
ۧۧ This is equivalent to testing that the mean of the difference is equal to

0: μex − μfl = d = 0.
ۧۧ We can quickly verify that mean( Extension ) − mean( Flexion ) = mean( d ).
mean(Extension) - mean(Flexion)
[1] 0.1085714
mean(d)
[1] 0.1085714
H0: d = 0
Ha: d ≠ 0
ۧۧ The standard error of d is calculated as
ۧۧ From here, we can calculate our test statistic:
ۧۧ So, T = 2.9346765.

ۧۧ And we can compare it to the tα/2, n−1 critical value. Here’s the code to get
those values in R, both for the 95% and the 99% confidence level:
abs(qt(0.05/2, 6)) # 95% confidence, 2 sided

[1] 2.446912
abs(qt(0.01/2, 6)) # 99% confidence, 2 sided
[1] 3.707428
ۧۧ Notice that at the 95% confidence level, we reject the null hypothesis
because T > 2.45. We would conclude that there is a difference in speed
when the leg is extended versus flexed.
ۧۧ But at the 99% confidence level, we fail to reject the null because
T < 3.71. This is where the 𝑝-value’s importance is critical, because it
tells us exactly where the cutoff is for this particular sample.
ۧۧ In R, the function “pt” gives the value of the t-distribution at our

calculated t-statistic with 𝑛 − 1 ( t he number of groups − 1 ) degrees of
freedom. We multiple by 2 because it’s a 2-sided test.
2*(1-pt(2.9346765,6))
[1] 0.02612775
ۧۧ We can compare this to the output that R gives us. Notice that our
𝑝-value of 0.026 matches the result that we get from R.
t.test(Extension,Flexion,paired=TRUE)
data: Extension and Flexion
t = 2.9347, df = 6, p-value = 0.02613
equal to 0
95 percent confidence interval:
0.01804536 0.19909749
sample estimates:
mean of the differences
0.1085714

PITFALLS
ۧۧ Some people inadvertently do a regular t-test on paired data, but they
shouldn’t do that. If fact, let’s see what would happen if we just did the
normal Student’s t-test on paired data. Do you think we would get the
same result?
ۧۧ With a 𝑝-value of 0.3036, we’re no longer able to reject the null hypothesis
at any significance level. This difference is because our degrees of freedom
are lower for paired versus independent data. With paired data, it takes 2
values to provide a unit of information, 1 degree of freedom. When the
data are independent, each data point gives us a degree of freedom.
t.test(Extension,Flexion)
data: Extension and Flexion
t = 1.0731, df = 11.857, p-value = 0.3046
equal to 0
-0.1121619 0.3293047
sample estimates:
mean of x mean of y
1.962857 1.854286
ۧۧ If the differences between pairs are non-normal, don’t use the t-test for
that. Your results won’t be valid. In this case, it would be better to use
a non-parametric test, such as the Wilcoxon signed-rank test, which
doesn’t depend on an underlying assumption of normality.
SUGGESTED READING
Numerical Data,” sections 5.2–5.3.
Yau, R Tutorial, “Inference about Two Populations,” http://www.r-tutor.com/
elementary-statistics/inference-about-two-populations.

PROBLEMS
1 Suppose that we want to test whether cars get better mileage on premium
unleaded gas versus regular unleaded gas. The following 22 cars were tested
with both premium or regular gas and the mileage was recorded. Conduct a
paired t-test to determine whether cars get significantly better mileage with
premium gas.
Regular car mileage = (14, 16, 20, 20, 21, 21, 23, 24, 23,
22, 23, 22, 27, 25, 27, 28, 30, 29, 31, 30, 35, 34)
Premium car mileage = (16, 17, 19, 22, 24, 24, 25, 25, 26,
26, 24, 27, 26, 28, 32, 33, 33, 31, 35, 31, 37, 40)
2 In which of the following cases should you use a paired-samples t-test?
a) When comparing men’s and women’s shoe sizes.

b) When comparing the same participant’s performance before and after
training.
c) When comparing the difference in speed of 3 running groups.
d) When comparing 2 independent groups.

LECTURE 13
LINEAR REGRESSION
MODELS AND
ASSUMPTIONS
I
n regression, we look at the association between 2 or
more quantitative variables. We’re going to begin to
look for whether one variable causes or predicts changes
in another variable. The response variable, which is the
dependent variable, might measure an outcome of a study
or experiment. The explanatory variable, which is the
independent or predictor variable, explains or is related
to changes in the response variable. We now have pairs
of observations: ( 𝑥1, 𝑦1 ), ( 𝑥2 , 𝑦2 ), …, ( 𝑥𝑛 , 𝑦𝑛 ). In linear
regression, we’re looking for the association between
2 variables to be centered on a line. In essence, we’re
looking for the effect that one variable has on another.
LINEAR REGRESSION
ۧۧ Imagine that we are wheat farmers in Kansas. For the past 10 years, at
the end of the season, we’ve recorded the total amount of rainfall and
the average height of our wheat. We’d like to know if rainfall has any
effect on our wheat yield.
Lecture 13 — Linear Regression Models and Assumptions 183

ۧۧ We have 10 data points, each recording the amount of rainfall ( measured
in inches ) and wheat growth ( in inches ). Here’s our data in R:
rainfall = c(3.07,3.55,3.90,4.38,
4.79,5.30,5.42,5.99,6.45,6.77)
wheat = c(78,82,85,91,92,96,97,104,111,119)
summary(cbind(rainfall, wheat))
ۧۧ Our summary statistics show that rainfall ranges from a minimum of

3.07 to a maximum of 6.77 inches with a mean of around 4.9. Wheat
ranges from 78 to 119 inches with a mean of 95.5.
rainfall wheat
Min: 3.070 Min.: 78.0
1st Qu.: 4.020 1st Qu.: 86.5
Mean: 4.962 Mean: 95.5
3rd Qu.: 5.848 3rd Qu.: 102.2
Max.: 6.770 Max.: 119.0
ۧۧ Notice that our scatterplot shows that there might be a linear

relationship between rainfall and wheat yield.

ۧۧ We can summarize the relationship between the 2 variables by
fitting a regression line, which is a straight line that describes how a
response variable 𝑦 changes as an explanatory variable 𝑥 changes and
minimizes the squared difference ( also called residuals ) between each
observation and the fitted line.
ۧۧ We minimize squared difference ( as opposed to, for example, the

absolute difference ) because when we’re fitting a line to data, we get
the best estimates of the slope and the 𝑦-intercept when we minimize
the squared difference.
ۧۧ Our linear regression model, with unknown parameters, looks like this:
𝑦 = β0 + β1 𝑥 + 𝜖, where β0 is the 𝑦-intercept, β1 is the slope, and 𝜖 is the
error term.
ۧۧ Notice the Greek letters: Our model is about the underlying population.
ൖൖ 𝜖 stands for a series of random errors 𝜖1, 𝜖2 , …, 𝜖𝑛 , which follow

a normal distribution with mean μ = 0 and standard deviation σ
( normality ).
ൖൖ The standard deviation σ does not vary for different values of 𝑥

( constant variability ).
ൖൖ The random errors 𝜖1, 𝜖2 , …, 𝜖𝑛 associated with different observations

are independent of each other ( independence ).
ۧۧ We’ll use our data to estimate the slope and 𝑦-intercept. Notice that
when we talk about estimates from our sample data, we use hats over
the βs.
ۧۧ Our linear regression line, with parameters now estimated from the
sample data, is ŷ = 0 + 1 𝑥.

ۧۧ We estimate β1 by
ۧۧ We estimate β0 by
ۧۧ We estimate our model parameters through the method of least squares,

which allows us to fit a line through the points that will minimize the
sum of the vertical distances of the data points from the line.
min( 𝑦 − ŷ )2
ۧۧ R will conveniently fit a linear regression for us. We use the function
“lm,” which stands for linear regression. After “lm” comes our response
variable, “wheat,” followed by a tilde, which tells R that we want to
model wheat on the explanatory variables that follow it. In this case,
that’s rainfall.
ۧۧ It helps for us to store that linear model so we can do further analysis on

it, so we call it “wheat.mod.” If we use the summary function on “wheat.
mod,” we can see our regression output.
Call:
lm(formula = wheat ~ rainfall)
Residuals:
Min 1Q Median 3Q Max
-3.158 -1.903 0.334 1.278 5.114
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 45.0408 3.6677 12.28 1.80e-06 ***
rainfall 10.1691 0.7191 14.14 6.08e-07 ***
---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.684 on 8 degrees of freedom
Multiple R-squared: 0.9615, Adjusted R-squared: 0.9567
F-statistic: 200 on 1 and 8 DF, p-value: 6.079e-07
ۧۧ R gives us a lot of regression output.
1 The “call” tells us what our regression model is. In this case, wheat is
being regressed on rainfall.
2 With the summary statistics for the residuals, we can quickly see if
our residuals have any outliers and check to see that the median is
close to 0.
3 “Coefficients” is where we find our estimates: 0 = 45.0408 ( t he

𝑦-intercept term ) and 1 = 10.1691 ( our slope coefficient for wheat ).
ۧۧ The following is a plot of our wheat data with the estimated regression
line.

ۧۧ You can see how well our line seems to fit the data. This is because we
chose the line that minimized the squared difference between the data
values and the estimated regression.
ۧۧ What is the equation of the linear regression line? We can actually solve
for our regression coefficients directly from the data.
ۧۧ For the slope, 1 , our regression fit is
ۧۧ When we plug our 𝑥- and 𝑦-values in, we get 10.1691. This matches the
result from R.
ۧۧ We can also solve for the 𝑦-intercept estimate, 0 .
ۧۧ Now we’re ready to see our regression equation: ŷ = 45.0408 + 10.1691 × 𝑥.

PREDICTIONS
ۧۧ We can use a regression line to predict the value of the response
variable 𝑦 for a specific value of the explanatory variable 𝑥. This value is
called a predicted value or fitted value.
ۧۧ Be careful about extrapolations. We can’t conclude that the regression

line is an accurate model outside our sample data. In other words, just
because our data supports a linear relationship between 𝑦 and 𝑥 doesn’t
mean that that relationship holds for values outside the range of 𝑥.
Predictions of 𝑦 for values of 𝑥 that are far away from the actual data
range often aren’t accurate.
ۧۧ Let’s use our regression equation ŷ = 45.0408 + 10.1691 × 𝑥 to actually

predict the wheat yield.
ൖൖ With 5 inches of rain: ŷ = 45.0408 + 10.1691 × 5 = 95.8863.
ൖൖ With 100 inches of rain: ŷ = 45.0408 + 10.1691 × 100 = 1061.9508.
RESIDUALS
ۧۧ We need a way of understanding how well our regression line fits the
data. The residuals help us do just that. A residual is the difference
between an observed value ( Yi ) and the value predicted by the
regression line ( Ŷi ): ei = 𝑦i − ŷi .
ۧۧ Some people get confused at this point, because it seems like the
residuals, ei, are the same as our errors, 𝜖i .
𝜖i = 𝑦i − ( β0 + β1 𝑥i )

ۧۧ 𝜖 measures the deviation of the observed value ( 𝑦 ) from the
unobservable, true regression line.
ۧۧ The residuals, on the other hand, measure the difference between the
observed value ( 𝑦i ) and the estimated regression line ( ŷi ).
ei = 𝑦i − ŷi
ei = 𝑦i − ( 0 + 1 𝑥 )
ۧۧ The following plot shows what the residual is. For a given observation,
𝑦i, we find the distance from that point to the regression line. The
value on the regression line ŷi is the estimated regression fit for 𝑦i . The
difference between 𝑦i and ŷi is the residual.
ۧۧ A residual plot is a scatterplot of the regression residuals against the

explanatory variable 𝑥. The mean of the least-squares residuals is
always 0: e = 0.

ۧۧ A good plot, with some great
residuals, looks like total
randomness. There’s no
pattern. And approximately
the same number of points
will be above and below
the e = 0 line. A bad plot
has some obvious pattern,
such as a funnel shape or
parabola. There might be
more points above 0 than
below ( or vice versa ).
ۧۧ The “no problem” graph is an

example of a good residual
plot. The data values are
nicely centered at 0, and the
points are well balanced
above and below the line at
𝑦 = 0.
ۧۧ The next plot is not so great.

Notice how the residuals
fan out as 𝑥 increases. Their
variance is increasing as
our fitted values increase.
This phenomenon is called
heteroscedasticity.

ۧۧ The plot at right is also not so great.
There’s a pattern left in the residuals,
which means that there’s more
structure that needs to be accounted
for in the model.
ۧۧ The following is a residual plot of the

wheat data. There appears to be a
pattern, but because there are so few
data points, it’s best if we look at a
quantile-quantile ( Q-Q ) plot—which
confirms that our data, even though
it’s small, satisfies the normality
assumption.
rainfall = c(3.07,3.55,3.90,4.38,4.79,
5.30,5.42,5.99,6.45,6.77)
wheat = c(78,82,85,91,92,96,97,104,111,119)
wheat.lm = lm(wheat~rainfall)
plot(wheat.lm$fitted.values,wheat.lm$residuals,
main = "Residuals vs. Fitted Values",
xlab = "Fitted Wheat Values", ylab = "Residuals")
abline(h=0, col="red")

qqnorm(residuals(lm(wheat~rainfall)))
qqline(residuals(lm(wheat~rainfall)), lwd=2, col="Blue")
ASSESSING MODEL FIT

ۧۧ Now that we’ve fit a linear model to our data, we need to figure out how
good it is. How well does it fit our data? Is the model strong enough that
we could get good predictions from it? The following terminology will
help us asses model fit.
ۧۧ The regression sum of squares ( SSR ) measures the variation in ŷi .

ۧۧ The error sum of squares ( SSE ) measures the residual variation in 𝑦
due to error. This is the sum of our residuals squared.
ۧۧ The total sum of squares ( SST ) measures the total variation in 𝑦.
ۧۧ Note: SST = SSR + SSE.
ۧۧ The coefficient of determination, R2, is the fraction of the variation in 𝑦

that can be explained by the linear regression of 𝑦 on 𝑥.
ۧۧ An R2 of 0 means that the 𝑦 cannot be predicted linearly from 𝑥. An

R2 of 1 means that the 𝑦 can be predicted without error from 𝑥. An R2
between 0 and 1 indicates the extent to which 𝑦 is predictable by 𝑥. ( A n
R2 of 0.10 means that 10% of the variance in 𝑦 is predictable from 𝑥; an
R2 of 0.80 means that 80% is predictable. )
ۧۧ The plot shown at right is an example

of data that has an R2 of approximately
0.25. You see some association
between the 2 variables, but the linear
relationship isn’t strong.

ۧۧ The data in the next plot has an R2 value
closer to 0.7. It has a stronger linear
relationship, and you can see where the
increases in 𝑥 relate to increases in 𝑦.
ۧۧ The third plot has an R2 value of 0.95.

So, 95% of the variability in 𝑦 can be
explained by this linear model with 𝑥.
This shows a strong linear relationship
between 𝑥 and 𝑦. You can see how
tightly the points are clustered.
ۧۧ We can’t forget to check our model

assumptions.
ൖൖ To check for normality, we look at a

normal Q-Q plot of the residuals.
ൖൖ To check for constant variability, we

check the residual plot.
ൖൖ To check for independence, we

examine the way in which subjects/
units were selected in the study or
how the data was collected.
ൖൖ To check for linearity, we check a scatterplot or a residual plot.
ۧۧ It’s always important to check that the assumptions of the regression

model have been met; otherwise, our results won’t be valid.

PITFALLS
ۧۧ Fitting a linear model to nonlinear data. Notice in the following
graph that the data is slightly curved, and while the linear regression fit
isn’t too bad, the residuals aren’t normal, and we see that same curved
pattern showing a similar pattern in them as in the original data.
ۧۧ Fitting a linear model to data with a nonconstant variance. Notice

how 𝑦’s variance is increasing as 𝑥 increases ( heteroscedasticity ).

ۧۧ Fitting what appears to be a good model but actually has non-
normal error. In this case, the initial scatterplot may fool us into fitting
a linear regression, but the residual plot should make an eyebrow raise.
The points aren’t centered at 0, and a Q-Q plot lets you see that, in fact,
they aren’t normal.
SUGGESTED READING
Diez, Barr, and Cetinkaya-Rundel, OpenIntro Statistics, “Introduction to Linear
Regression,” sections 7.1–7.2.
Faraway, Linear Models with R, “Estimation,” chap. 2.
Yau, R Tutorial, “Simple Linear Regression,” http://www.r-tutor.com/
elementary-statistics/simple-linear-regression.
PROBLEMS
1 What does the linear regression slope b1 represent?
a) The predicted value of Y when X = 0.

b) The estimated change in average Y per unit change in X.
c) The predicted value of Y.
d) The variation around the line of regression.

2 Assuming a linear relationship between X and Y, which of the following is true if
the correlation coefficient, r, equals −0.63?
a) There is no correlation.
b) The slope b1 is negative.
c) Variable X is larger than variable Y.
d) The variance of X is negative.

LECTURE 14
REGRESSION PREDICTIONS,
A
s you will learn in this lecture, transforming
the response variable can help us eliminate
heteroscedasticity ( increasing or decreasing
variance ) and satisfy the assumptions of normality,
independence, and linearity. In this lecture, you will
learn how to use linear regression to make predictions
as well as how to learn about population parameters
through confidence intervals and hypothesis tests.
TRANSFORMATIONS: LN(Y)
ۧۧ The following data has an exponential shape and doesn’t satisfy the
assumption of linearity.
ۧۧ Notice that the residuals are large in magnitude, not centered at 0, and
not balanced. They’re heteroscedastic with increasing variance. The
histogram is slightly skewed to the right.
ۧۧ After just taking the natural log, our data appear more linear. Our
residuals look more linear. They are centered at 0 and have less of a
pattern, and our histogram has shifted toward normality.
Lecture 14 — Regression Predictions, Confidence Intervals 199

TRANSFORMATIONS:
ۧۧ The following dataset likewise has a curved pattern, non-constant variance,
and non-normal residuals. In this case, a square root transformation gets
us to normality, both in the original data and the residuals.

TRANSFORMATIONS: EY
ۧۧ The following is another dataset where the original data is curved, the
residuals have decreasing variance, and the histogram is skewed to
the left.
ۧۧ For the data, an exponential function was the best transformation to

eliminate nonlinearity and give us normal residuals.
TRANSFORMATIONS: Y 2
ۧۧ The following is a slightly different dataset with a curved pattern.
The residuals have a clear pattern, and the histogram is heavily
skewed to the left. In this case, the solution is to square the data.
Notice what a great job that transformation does in helping us satisfy
our model assumptions.

SAMPLING DISTRIBUTIONS OF
LEAST-SQUARES ESTIMATES
ۧۧ Let’s revisit the simple regression setting and our population
parameters, β0 ( slope ) and β1 ( 𝑦-intercept ).
ۧۧ To estimate our population parameters, we collect a sample and solve

for 0 and 1. But how much do our estimates depend on the particular
random sample that we happen to observe? If we had 2 random samples
and calculated 2 0’s and 2 1’s, how different would they be? Would our
estimates vary much from sample to sample? How would their variation
change as the sample size increased or decreased?

ۧۧ Remember, we’re ultimately interested in drawing conclusions
about the population, not the particular sample we observed. How
confident can we be that our sample estimates accurately reflect the
population parameters?
ۧۧ For example, the solid line in the following graph represents our true
underlying population. Let’s sample 5 points from this population and
fit a regression line to those 5 points. Sometimes our 5 points do a nice
job of estimating the population slope. But it’s difficult to precisely
estimate the slope with so few points.

ۧۧ On the other hand, what if we were to sample 50 points from the
population? Notice the difference in our regression line fits in the
following. Even though there’s variability in the sample, because we
have a larger number of points, we’re able to more accurately estimate
the true population slope.
ۧۧ Our parameter estimates, 0 and 1, are actually random variables.

Their estimates change from sample to sample. So, what’s their
underlying sampling distribution?
ۧۧ The sampling distributions of 0 and 1 are normal. Basically, you can

write both 0 and 1 as a sum of normal random variables, which, in
turn, makes them normal.
ۧۧ The mean of 1 is equal to the population mean: μ 1 = 1. In other words,

it’s unbiased. This is a key property that we like in our estimators
because it tells us that, on average, our estimate approaches the true
population parameter.

ۧۧ When we take the expected value of 1 , we get our true population
parameter: E( 1 ) = β1.
ۧۧ The standard deviation of the sample is estimated by
ۧۧ Ideally, we want the standard deviation of 1 to be small. That would

give us a tighter fit of the line to our data. The standard deviation is
small when the denominator for is large.
ۧۧ In other words, if we only have a few data points or if our data is

clustered, our standard deviation for 1 will be large, whereas if we
have a larger dataset with values that are more spread out ( as in the
example ), we get a more precise estimate of the slope.
ۧۧ We can now test whether our estimates are significant.
SIGNIFICANT 0
AND 1
ۧۧ An example of a linear
regression fit where both
the slope and 𝑦-intercept
estimates are significant
is shown at right. Our
data lies close enough to
the 𝑦-axis that we can
extrapolate out to 𝑦 = 0
fairly accurately. And in
spite of the variability
in our data, we can fit a
regression line that has a
nonzero slope.

ۧۧ In the following data, there’s no linear relationship between X and Y, so
our slope estimate is equal to 0.
ۧۧ But we could still estimate the 𝑦-intercept as the constant Y . Our model
for this data would be a straight line centered at the mean of Y.
ۧۧ The following is an example where we have a linear trend in the data,

so we can accurately fit 1. But notice the range of our X variable. We’re
quite far from the 𝑦-intercept line. Imagine trying to extrapolate our
regression line all the way out to where X = 0.
ۧۧ Extrapolation is a deadly statistical sin. So, while we have a significant

slope estimate in this case, our 𝑦-intercept estimate wouldn’t be
significant.

CONFIDENCE INTERVAL AND
HYPOTHESIS TEST FOR ß1
ۧۧ Confidence intervals and hypothesis tests are the ways we learn about
population parameters. Let’s calculate confidence intervals and conduct
hypothesis tests for β1.
ۧۧ The null hypothesis is H0: β1 = 0. This implies that there is no linear

effect and we ignore X.
ۧۧ The alternative hypothesis is Ha: β1 > 0 or Ha: β1 < 0 or Ha: β1 ≠ 0. This

implies that X and Y have a linear relationship with estimated slope β1.
ۧۧ The t-test statistic is
ۧۧ This test statistic has a t-distribution with 𝑛 − 2 degrees of freedom if

H0 is true.
ۧۧ The confidence interval for 1, 1 ± t α/2, 𝑛−2s 1, provides a range of values

likely to contain the true unknown value β1 and lets us answer the
question of whether X is related to Y.
ۧۧ If the confidence interval for β1 contains 0, there is no linear relationship

between X and Y. If it doesn’t contain 0, then there is a linear relationship
between X and Y.
ۧۧ The following data was extracted from the 1974 Motor Trend magazine
and comprises fuel consumption ( 𝑦 ) and 10 aspects of automobile
design and performance ( 𝑥 ) for 32 automobiles ( 1973–1974 models ).
Let’s explore the “mtcars” dataset and use linear regression to predict
vehicle gas mileage based on vehicle weight.

ۧۧ First, let’s look at a summary of the data.
library(datasets)
data(mtcars)
summary(mtcars)
mpg cyl disp hp
Min.: 10.40 Min.: 4.000 Min.: 71.1 Min.: 52.0
1st Qu.: 15.43 1st Qu.: 4.000 1st Qu.: 120.8 1st Qu.: 96.5
Median: 19.20 Median: 6.000 Median: 196.3 Median: 123.0
Mean: 20.09 Mean: 6.188 Mean: 230.7 Mean: 146.7
3rd Qu.: 22.80 3rd Qu.: 8.000 3rd Qu.: 326.0 3rd Qu.: 180.0
Max.: 33.90 Max.: 8.000 Max.: 472.0 Max.: 335.0
drat wt qsec vs
Min.: 2.760 Min.: 1.513 Min.: 14.50 Min.: 0.0000
Max.: 4.930 Max.: 5.424 Max.: 22.90 Max.: 1.0000
am gear carb
Min.: 0.0000 Min.: 3.000 Min.: 1.000
1st Qu.: 0.0000 1st Qu.: 3.000 1st Qu.: 2.000
Median: 0.0000 Median: 4.000 Median: 2.000
Mean: 0.4062 Mean: 3.688 Mean: 2.812
3rd Qu.: 1.0000 3rd Qu.: 4.000 3rd Qu.: 4.000
Max.: 1.0000 Max.: 5.000 Max.: 8.000
ۧۧ A summary statistic shows that we have miles per gallon, number of

cylinders, horsepower, weight, transmission, displacement, etc.

ۧۧ A scatterplot of weight and miles per gallon gives us a sense of the shape
of the data.
plot(mpg ~ wt, data=mtcars)
ۧۧ The scatterplot shows a roughly linear relationship between weight

and miles per gallon, suggesting that a linear regression model might
work well. As the weight increases, miles per gallon seems to decrease.
ۧۧ Let’s fit a linear model to our data.
mpg_model = lm(mpg ~ wt, data=mtcars)

summary(mpg_model)
Call:
lm(formula = mpg ~ wt, data = mtcars)
Residuals:
-4.5432 -2.3647 -0.1252 1.4096 6.8727
Coefficients:
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591 -9.559 1.29e-10 ***
---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
ۧۧ The output above shows the formula used to make the model, followed
by a 5-number summary of the residuals and a summary of the model
coefficients.
ۧۧ The coefficients are the constants used to create the best fit line: In this
case, the 𝑦-intercept term 0 is set to 37.2851, and the slope coefficient 1
for the weight variable is −5.3445. The model fit the line mpg = 37.2851 −
5.3445 × wt.
ۧۧ The t-value and Pr( >|t| ) ( 𝑝-value ) columns indicate the results of

conducting a t-test checking whether the coefficients chosen differ
significantly from 0.
ۧۧ The intercept estimate 0 is significant with a 𝑝-value of 0. Likewise, the

slope estimate has a 𝑝-value of approximately 0.
ۧۧ At the bottom of the summary output, the values of interest to us are

“Multiple R-squared,” which tells how much of the variation in miles
per gallon can be explained by weight, and “Adjusted R-squared,” which
is a modification of the normal R2 that adjusts for models that involve
multiple independent predictor 𝑥 variables.
ۧۧ As you add more predictor variables to a model, the normal R2 reading

can only increase, even if those variables add little information to the
model.
ۧۧ In general, we’d like to keep models as simple as possible, so adjusted R2

is preferable for assessing the explanatory power of the model. In this
case, roughly 75% of the variation in miles per gallon is explained by
weight.

ۧۧ Calculate a 95% confidence interval for the true slope.
qt(.975,df=28)
[1] 2.048407
−5.3445 ± 2.048407 × 0.5591 = ( −4.1992, −6.4897 )
ۧۧ We are 95% confident that the true slope, regressing miles per gallon
on weight, is between −4.1992 and −6.4897 miles per gallon per 1000
pounds.
Coefficients:
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591 -9.559 1.29e-10 ***
ۧۧ Is the slope significantly different from 0? Let’s calculate our sample

t-statistic and compare it to the t critical value.
ۧۧ We compare |T| to t( 0.05/2,30−2 ) = 2.048407. Because |T| > |2.048407|, we

reject the null hypothesis that the slope is equal to 0.
ۧۧ We can confirm the 𝑝-value reported in R with the following:
pt(-9.559,df=28)
[1] 1.29007e-10
ۧۧ Comparing this 𝑝-value to α = 0.05 and α = 0.01 would likewise lead us

to reject the null hypothesis that β1 = 0.
ۧۧ Next, we have to verify our model assumptions, which mostly have to

do with the residuals. The following is a histogram of the residuals from
our model fit. They’re sort of centered, but they are a little skewed.

hist(mpg_model$residuals)
ۧۧ We can get a normal quantile-quantile ( Q-Q ) plot of those residuals to

get a better idea of their shape and to visually confirm normality. In R,
we do a Q-Q norm of the residuals left over from our model and fit a Q-Q
line to that.
qqnorm(mpg_model$residuals)
qqline(mpg_model$residuals)

ۧۧ Our residuals do a pretty nice job of staying fairly close to the normal
line, but we do see in the tail end where they tend to veer away. This
kind of plot leads us to do a Shapiro-Wilk test.
shapiro.test(mpg_model$residuals)
data: mpg_model$residuals
W = 0.94508, p-value = 0.1044
ۧۧ Our residuals are not normal by the Shapiro-Wilk test.
ۧۧ The residuals appear to follow a slightly nonlinear pattern, veering

away from the normal line on each end. This is an indication that
a simple straight line might not be sufficient to fully describe the
relationship between weight and miles per gallon.
PITFALL
ۧۧ Clearly, we need to add more predictors to our model, but we run into
a pitfall when we do that. Collinearity occurs when 2 or more predictor
variables are closely related to one another, or highly correlated. The
presence of collinearity can create problems in the regression context,
because it can be difficult to separate out the individual effects of
collinear variables on the response. But it’s fixable.
SUGGESTED READING
Crawley, The R Book, “Regression,” chap. 10.
Diez, Barr, and Cetinkaya-Rundel, OpenIntro Statistics, “Introduction to Linear
Yau, R Tutorial, “Simple Linear Regression,” http://www.r-tutor.com/
elementary-statistics/simple-linear-regression.

PROBLEMS
1 To meet the assumptions for simple linear regression, what type of relationship
should be observed between the residual values and values of X?
a) A negative correlation.
b) No apparent relationship.
c) A statistically significant relationship.
d) A positive correlation.
e) A heteroskedastic relationship.
2 We can use to measure the strength of the linear relationship

between 2 numerical variables.
a) the correlation coefficient

b) the slope
c) the 𝑦-intercept
d) a scatterplot
e) the 𝑝-value

LECTURE 15
MULTIPLE LINEAR
REGRESSION
W
hen we first started examining the relationship
between 2 variables, we used t-tests, which
allowed us to determine if there was a
statistically significant relationship between 2 variables.
We then moved to simple linear regression, which allowed
us to fit a line to a predictor versus response. While these
are useful in the case where we only have 2 variables, it’s
more often the case to work with data that has multiple
predictors. This lecture is about multiple linear regression.
MULTIPLE LINEAR REGRESSION

ۧۧ Multiple regression is a statistical tool that allows us to examine how
multiple predictor variables can be used to model a response variable.
ۧۧ We’ve generalized the simple linear regression methodology to now

describe the relationship between a response variable 𝑦 and a set of
predictorsX1, X2 , …, X𝑛 in terms of a linear function.
Lecture 15 — Multiple Linear Regression 215

ۧۧ The variables X1, X2 , …, X𝑝 are the 𝑝 explanatory variables ( because they
explain what’s happening ) in the response variable Y.
ۧۧ Y = β0 + β1 X1 + β2 X2 + … + βpXp + 𝜖, where β0, β1, …, βp are the coefficients

( parameters ), X1, X2 , …, Xp are the explanatory variables, and 𝜖 is the
N( 0, σ2 ) error term.
ۧۧ After data collection, we have a collection of observations:
𝑦1, 𝑥11, 𝑥 21, …, 𝑥p1

𝑦 2 , 𝑥12 , 𝑥 22 , …, 𝑥p2
…
𝑦𝑛 , 𝑥p1, 𝑥p2 , …, 𝑥p𝑛 .
ۧۧ As in the case of simple linear regression, we want to find the equation

of the line that best fits the data. In this case, it means finding ( 0, 1, …,
p ) such that the fitted values, ŷi , are as close as possible to the observed
values, 𝑦i.
ۧۧ The following data was collected on a population of women 21 and older

of Pima Indian heritage who were living near Phoenix, Arizona. The
women were tested for diabetes according to World Health Organization
criteria. This is a very rich dataset with the following columns:
ൖൖ npreg: number of pregnancies;
ൖൖ glu: plasma glucose concentration in an oral glucose tolerance test;
ൖൖ bp: diastolic blood pressure ( mm Hg );
ൖൖ skin: triceps skinfold thickness ( mm );
ൖൖ bmi: body mass index ( weight in kg/( height in m )2 );

ൖൖ ped: diabetes pedigree function ( information on relatives with diabetes );
ൖൖ age: age in years; and
ൖൖ type: “Yes” or “No” for diabetic according to WHO criteria.
ۧۧ This dataset is in the “MASS” package in R.
install.packages("MASS")
library(MASS)
data(Pima.tr)
head(Pima.tr)
Pima.tr = pima
npreg
glu bp skin bmi ped age type
1
5 86 68 28 30.2 0.364 24 No
2 7 195 70 33 25.1 0.163 55 Yes
3
5 77 82 41 35.8 0.156 35 No
4 0 165 76 43 47.9 0.259 26 No
5 0 107 60 25 26.4 0.133 23 No
6
5 97 76 27 35.6 0.378 52 Yes
[1] 200 8
ۧۧ Look at some of the maximum values. The number of pregnancies has

a minimum of 1, median of 2, and maximum of 14. The age ranges from
a minimum of 21 to a maximum of 63. Body mass index goes from a
minimum of 18.2 up to a maximum of 47.9.
ۧۧ The diabletes index, ped, goes from a minimum of 0.08 to a maximum

of 2.28. Notice how far away that is from the third quantile of that
variable. That might indicate a possible outlier.
ۧۧ The median and mean of bmi are 32.8 and 32.31, respectively. When you
see a median and a mean so close together, it gives you some assurance
that that might be an underlying normal population.

summary(pima)
npreg glu bp skin
Min.: 0.00 Min.: 56.0 Min.: 38.00 Min.: 7.00
Max.: 14.00 Max.: 199.0 Max.: 110.00 Max.: 99.00
bmi ped age type

Min.: 18.20 Min.: 0.0850 Min.: 21.00 No: 132
1st Qu.: 27.57 1st Qu.: 0.2535 1st Qu.: 23.00 Yes: 68
Median: 32.80 Median: 0.3725 Median: 28.00
Mean: 32.31 Mean: 0.4608 Mean: 32.11
3rd Qu.: 36.50 3rd Qu.: 0.6160 3rd Qu.: 39.25
Max.: 47.90 Max.: 2.2880 Max.: 63.00
ۧۧ All of the variables are quantitative except for type, which is categorical
( “ Yes” for diseased or “No” for non-diseased ).
hist(pima$bp)

plot(density(pima$bp), main="Blood Pressure")
ۧۧ In the following pairs plot of 4 of the data values, there doesn’t seem to be
a strong relationship in the data values. In fact, skin is a possible outlier.
pairs(pima[1: 4])

pairs(pima[5: 8])
ۧۧ In the following pairs plot of the other 4 quantitative values, there is a

slight relationship between bmi and ped.
ۧۧ We can do a correlation analysis of our data to tell us the relationship

between our variables. Notice that most of the correlations are at or
below 0.34. The highest is between bmi and skin, at 0.66. Then, age
and number of pregnancies is correlated at 0.6. That tells us that these
variables would go nicely into a linear regression fit. We’re not going to
see a strong correlation among them.
round(cor(pima[1: 7]),2)
npreg glu bp skin bmi ped age
npreg
1.00 0.17 0.25 0.11 0.06 -0.12 0.60
glu 0.17 1.00 0.27 0.22 0.22 0.06 0.34
bp 0.25 0.27 1.00 0.26 0.24 -0.05 0.39
skin 0.11 0.22 0.26 1.00 0.66 0.10 0.25
bmi 0.06 0.22 0.24 0.66 1.00 0.19 0.13
ped -0.12 0.06 -0.05 0.10 0.19 1.00 -0.07
age 0.60 0.34 0.39 0.25 0.13 -0.07 1.00

ۧۧ Let’s fit a multiple linear regression model to bmi from the rest of the
explanatory variables.
lm1 <- lm(bmi~npreg+glu+bp+skin+ped+age+type, data=pima)

summary(lm1)
summary(lm1)
Call:
lm(formula = bmi ~ npreg + glu + bp + skin + ped + age + type,
data = pima)
Residuals:
-19.9065 -2.5723 -0.1412 2.6039 11.2664
Coefficients:
(Intercept) 19.203109 2.346040 8.185 3.69e-14 ***
npreg 0.018970 0.120858 0.157 0.8754
glu 0.006432 0.011981 0.537 0.5920
bp 0.046753 0.031272 1.495 0.1366
skin 0.322989 0.029362 11.000 < 2e-16 ***
ped 2.060288 1.094140 1.883 0.0612 .
age -0.061772 0.040459 -1.527 0.1285
typeYes 1.494968 0.827212 1.807 0.0723 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
F-statistic: 24.67 on 7 and 192 DF, p-value: < 2.2e-16
ۧۧ We could go through and think hard about which variables to include,

or we could automate the process. Stepwise regression lets us drop
insignificant variables one by one. This is particularly useful if you have
many potential explanatory variables.

ۧۧ Not all of our 𝑥i variables are significant predictors. We can drop the
most insignificant paremeter out one at a time using the “drop1”
command in R.

drop1(lm2, test="F")
Single term deletions
Model:
bmi ~ npreg + glu + bp + skin + ped + age + type
Df Sum of Sq RSS AIC F value Pr(>F)
<none> 3937.1 611.98
npreg 1 0.51 3937.6 610.00 0.0246 0.87544
glu 1 5.91 3943.0 610.28 0.2882 0.59201
bp 1 45.83 3982.9 612.29 2.2351 0.13655
skin 1 2481.25 6418.4 707.72 121.0025 < 2e-16 ***
ped 1 72.71 4009.8 613.64 3.5458 0.06121 .
age 1 47.80 3984.9 612.39 2.3311 0.12846
type 1 66.97 4004.1 613.35 3.2661 0.07229 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model:
bmi ~ glu + bp + skin + ped + age + type
<none> 3937.6 610.00
glu 1 5.66 3943.3 608.29 0.2776 0.59885
bp 1 46.23 3983.8 610.34 2.2659 0.13388
skin 1 2485.06 6422.7 705.85 121.8040 < 2e-16 ***
ped 1 72.27 4009.9 611.64 3.5421 0.06134 .
age 1 59.12 3996.7 610.98 2.8978 0.09031 .
type 1 69.57 4007.2 611.51 3.4099 0.06634 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model:
bmi ~ bp + skin + ped + age + type
<none> 3943.3 608.29
bp 1 51.10 3994.4 608.87 2.5141 0.11446
skin 1 2510.85 6454.1 704.83 123.5280 < 2e-16 ***
ped 1 71.81 4015.1 609.90 3.5330 0.06166 .
age 1 55.22 3998.5 609.07 2.7165 0.10093
type 1 99.95 4043.2 611.30 4.9175 0.02775 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model:
bmi ~ skin + ped + age + type
<none> 3994.4 608.87
skin 1 2732.31 6726.7 711.10 133.3877 < 2e-16 ***
ped 1 65.28 4059.7 610.11 3.1867 0.07579 .
age 1 30.35 4024.7 608.38 1.4817 0.22498
type 1 109.03 4103.4 612.25 5.3228 0.02210 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model:
bmi ~ skin + ped + type
<none> 4024.7 608.38
skin 1 2723.43 6748.2 709.74 132.6280 < 2e-16 ***
ped 1 84.11 4108.8 610.52 4.0959 0.04435 *
type 1 82.52 4107.2 610.44 4.0185 0.04638 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ۧۧ We can also model glucose as a function of our other predictor variables.
lm3 <- lm(glu ~ bmi+npreg+bp+skin+ped+age+type, data=pima)

summary(lm3)
Call:
lm(formula = glu ~ bmi + npreg + bp + skin + ped + age +
type, data = pima)
Residuals:
Min 1Q Median
3Q Max
-66.595 -17.396 -1.641 12.952 89.977
Coefficients:
(Intercept) 68.96159 15.62700 4.413 1.70e-05 ***
bmi 0.23301 0.43405 0.537 0.5920
npreg -0.84245 0.72494 -1.162 0.2466
bp 0.31786 0.18792 1.691 0.0924 .
skin 0.07046 0.22559 0.312 0.7551
ped -2.36903 6.64389 -0.357 0.7218
age 0.55832 0.24166 2.310 0.0219 *
typeYes 26.29928 4.64856 5.658 5.51e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '
' 1

ۧۧ Plot the actual values against the predicted values. Informally, the
model does not appear to be doing a good job.
plot(lm3$fitted.values, pima$glu)
predict(lm3) #Compare to predicted value
1
2 3 4 5 6 7 8
107.9113 150.1095 121.2165 121.2130 108.4727 153.5411 108.2089 103.0227
9 10 11 12 13 14 15 16
135.1703 145.4740 133.4885 120.2228 154.1226 134.1559 116.4043 107.4087
17
18 19 20 21 22 23 24
103.1544 154.5992 147.5264 106.2509 109.9259 120.9955 119.3260 106.5237
25
26 27 28 29 30 31 32
113.3405 138.9335 103.5978 140.0589 116.8786 113.8126 111.5214 108.4242
33
34 35 36 37 38 39 40
144.2897 107.6643 144.5352 121.8302 112.1302 100.9162 117.7512 112.5533
41
42 43 44 45 46 47 48
158.6535 107.2404 107.9765 105.0584 114.5670 112.7330 114.0311 110.0928
49
50 51 52 53 54 55 56
138.6324 140.5826 110.2875 108.4849 147.1940 111.5525 112.2581 116.5941
57
58 59 60 61 62 63 64
117.1681 100.4010 107.8831 157.8342 148.7116 109.1409 116.1440 118.8070
65
66 67 68 69 70 71 72
105.2931 139.4098 137.2332 104.9866 138.8693 125.3857 145.0742 139.2237
73
74 75 76 77 78 79 80
151.7286 111.6350 159.4779 159.3554 103.9363 107.7076 129.9282 125.0206
81
82 83 84 85 86 87 88
107.0139 117.6718 149.6342 140.5710 115.0740 110.8691 143.4579 110.1696
89
90 91 92 93 94 95 96
107.0742 105.5610 112.8740 123.0413 152.9125 105.2042 111.8202 154.5120
97 98 99 100 101 102 103 104
112.5713 111.3697 101.3055 151.7832 127.5947 135.2140 113.2124 140.9655
105 106 107 108 109 110 111 112
107.4063 127.7725 107.7509 144.8029 110.4144 120.5881 128.4289 114.0341
113 114 115 116 117 118 119 120
138.2012 148.4626 114.8965 136.3195 143.4510 135.9013 116.3331 146.0040
121 122 123 124 125 126 127 128
103.2325 111.7506 139.4186 118.2989 157.7137 121.5701 100.8103 104.1390

129 130 131 132 133 134 135 136
126.1328 144.8004 149.4842 136.5256 115.2543 110.0883 114.5806 119.8008
137 138 139 140 141 142 143 144
110.5774 108.7274 107.0438 127.0950 141.9563 153.0029 114.2736 107.0358
145 146 147 148 149 150 151 152
116.4735 116.0529 104.1854 133.6234 115.7183 109.9265 119.3218 139.1459
153 154 155 156 157 158 159 160
142.5617 140.4359 107.7655 149.6645 164.1411 110.2871 115.8472 140.4078
161 162 163 164 165 166 167 168
143.9828 107.4420 131.9647 110.1271 108.9702 106.9638 147.7653 118.3234
169 170 171 172 173 174 175 176
117.5237 110.2563 141.6309 113.5333 142.4427 144.7363 145.5700 113.8584
177 178 179 180 181 182 183 184
106.8494 122.8657 119.7222 118.8775 102.0204 119.2236 110.4167 146.4841
185 186 187 188 189 190 191 192
115.8898 135.3206 143.2565 133.8311 104.3186 161.0263 109.8291 121.0842
193 194 195 196 197 198 199 200
134.2310 112.1574 109.9684 105.7707 146.3697 113.8495 109.0741 142.3796
ۧۧ We can check for linearity by examining the residual plot.
plot(lm3$fitted.values,lm3$residuals)

ۧۧ Check for normality and constant variance. Do the conditions appear to
be approximately satisfied for this model?
hist(lm3$residuals)
qqnorm(lm3$residuals)
qqline(lm3$residuals)
DATA WITH HIGHER CORRELATION

ۧۧ The Pima dataset is nice because it has several quantitative predictors
and the predictors aren’t highly correlated. But what happens when we
have predictors that are correlated?
ۧۧ Let’s return to the Motor Trend cars data and fit a multiple regression model
by including more explanatory variables in a linear regression model.
ۧۧ Let’s load our “mtcars” dataset.
library(datasets)
data(mtcars)

ۧۧ Let’s add the “horsepower” variable to our original model on the mpg
dataset. Now our proposed regression model is 𝑦 = β0 + β1weight +
β2horsepower + 𝜖, where β0 is our 𝑦-intercept, β1 is our regression
coefficient for weight, and β2 is our regression coefficient for
horsepower.
mpg_model = lm(mpg ~ wt + hp, data=mtcars)

summary(mpg_model)
Call:
lm(formula = mpg ~ wt + hp, data = mtcars)
Residuals:
-3.941 -1.600 -0.182 1.050 5.854
Coefficients:
(Intercept) 37.22727 1.59879 23.285 < 2e-16 ***
wt -3.87783 0.63273 -6.129 1.12e-06 ***
hp -0.03177 0.00903 -3.519 0.00145 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '
' 1
ۧۧ The coefficient for each of our explanatory variables is the predicted

change in 𝑦 for 1 unit change in 𝑥, given that the other explanatory
variables are held fixed.
ۧۧ For example, as weight increases by 1 unit, the miles per gallon will
decrease by 3.877. Intuitively, this makes sense: The heavier a car gets,
the lower miles per gallon it would tend to get.

ۧۧ The 𝑝-value for each coefficient tells us whether it’s a significant
predictor of 𝑦 given the other explanatory variables in the model.
Notice that the 𝑝-values for both weight and horsepower are significant.
ۧۧ This model gives us a multiple R2 of 0.8268 and an adjusted R2 of

0.8148. In other words, 81.48% of the variation in 𝑦 can be explained
by the linear regression model with “hp” and “wt.” It’s hard to imagine
improving on this model.
ۧۧ If explanatory variables are associated with each other, coefficients and

𝑝-values will change depending on what else is included in the model.
cor(mtcars$wt, mtcars$hp)
[1] 0.6587479
ۧۧ We see that “hp” and “wt” are correlated with a correlation coefficient
of 0.658. It’s possible that these predictors are collinear, meaning that
they’re either highly correlated or they contribute the same information
to the model.
ۧۧ Because we have several predictor variables in our “mtcars” data, let’s

just throw them all in the model. Perhaps this will improve our model
even more.
mpg_model = lm(mpg ~ ., data=mtcars)

summary(mpg_model)
Call:
lm(formula = mpg ~ ., data = mtcars)
Residuals:
-3.4506 -1.6044 -0.1196 1.2193 4.6271

Coefficients:
(Intercept) 12.30337 18.71788 0.657 0.5181
cyl -0.11144 1.04502 -0.107 0.9161
disp 0.01334 0.01786 0.747 0.4635
hp -0.02148 0.02177 -0.987 0.3350
drat 0.78711 1.63537 0.481 0.6353
wt -3.71530 1.89441 -1.961 0.0633 .
qsec 0.82104 0.73084 1.123 0.2739
vs 0.31776 2.10451 0.151 0.8814
am 2.52023 2.05665 1.225 0.2340
gear 0.65541 1.49326 0.439 0.6652
carb -0.19942 0.82875 -0.241 0.8122
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '
' 1
ۧۧ None of our variables are significant. The all have 𝑝-values greater than
0.05. In fact, a model using all the variables doesn’t perform as well as
our weight-and-horsepower–based model. Our adjusted R2 actually
decreased—from 0.8148 to 0.8066.
ۧۧ We see that when we add variables that have little relationship with the
response or even variables that are too correlated to one another, we
can get poor results.
ۧۧ The adjusted R2 is like R2 but takes into account the number of

explanatory variables. As the number of explanatory variables
increases, adjusted R2 gets smaller than R2. It penalizes us when we add
unnecessary predictors to the model.

ۧۧ Adding more explanatory variables will only make R2 increase. The
more predictors we have in the model, the more they consume our
regression sum of squares. Because R2 is the ratio of the regression
sum of squares to the total sum of squares, it will always increase with
additional predictor variables.
ۧۧ One problem in our data is that our variables are correlated. We can see
this in the pairs plots.
pairs(mtcars[,c(1,3: 4)])
pairs(mtcars[,c(5: 7)])

pairs(mtcars[,c(1,3: 7)])
ۧۧ We have highly correlated variables. This leads to model misspecification.
round(cor(mtcars[,c(1,3: 7)]),2)
mpg disp hp drat wt qsec
mpg 1.00 -0.85 -0.78 0.68 -0.87 0.42
disp -0.85 1.00 0.79 -0.71 0.89 -0.43
hp
-0.78 0.79 1.00 -0.45 0.66 -0.71
drat 0.68 -0.71 -0.45 1.00 -0.71 0.09
wt
-0.87 0.89 0.66 -0.71 1.00 -0.17
qsec 0.42 -0.43 -0.71 0.09 -0.17 1.00
ۧۧ In fact, it’s possible to overfit a model by including too many explanatory

variables. We have to use the principle of parsimony—the simplest,
most efficient model is the best—because the fewer the coefficients
have to estimate, the better they will be estimated.
ۧۧ We can find the best model by pruning. We “step” through the predictor
variables and remove the ones that are not significant.

CHOOSING THE BEST MODEL
ۧۧ How do we choose a best model? There are several methods we could
use based on which model features are most important to us.
ۧۧ Choose the model with the highest adjusted R2. This assumes that we
choose to evaluate the success of our model in terms of the percentage of
the variability in the response explained by the explanatory variables.
ۧۧ The 𝑝-value for an explanatory variable can be taken as a rough measure

for how helpful that explanatory variable is to the model. Insignificant
variables may be pruned from the model as long as adjusted R2 doesn’t
decrease. You can also look at relationships between explanatory
variables; if 2 are strongly associated, perhaps both are not necessary.
ۧۧ Let’s do a stepwise regression on our linear model fit of miles per gallon
with all of our data. This will automatically spit out the best model.
mpg_model2 <- step(lm(mpg ~ ., data=mtcars))

Start: AIC=70.9
mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear
+ carb
Df Sum of Sq RSS AIC
- cyl 1 0.0799 147.57 68.915
- vs 1 0.1601 147.66 68.932
- carb 1 0.4067 147.90 68.986
- gear 1 1.3531 148.85 69.190
- drat 1 1.6270 149.12 69.249
- disp 1 3.9167 151.41 69.736
- hp 1 6.8399 154.33 70.348
- qsec 1 8.8641 156.36 70.765
<none> 147.49 70.898
- am 1 10.5467 158.04 71.108
- wt 1 27.0144 174.51 74.280

Step: AIC=68.92
mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb
- vs 1 0.2685 147.84 66.973
- carb 1 0.5201 148.09 67.028
- gear 1 1.8211 149.40 67.308
- drat 1 1.9826 149.56 67.342
- disp 1 3.9009 151.47 67.750
- hp 1 7.3632 154.94 68.473
<none> 147.57 68.915
- qsec 1 10.0933 157.67 69.032
- am 1 11.8359 159.41 69.384
- wt 1 27.0280 174.60 72.297
Step: AIC=66.97
mpg ~ disp + hp + drat + wt + qsec + am + gear + carb
- carb 1 0.6855 148.53 65.121
- gear 1 2.1437 149.99 65.434
- drat 1 2.2139 150.06 65.449
- disp 1 3.6467 151.49 65.753
- hp 1 7.1060 154.95 66.475
<none> 147.84 66.973
- am 1 11.5694 159.41 67.384
- qsec 1 15.6830 163.53 68.200
- wt 1 27.3799 175.22 70.410
Step: AIC=65.12
mpg ~ disp + hp + drat + wt + qsec + am + gear
- gear 1 1.565 150.09 63.457
- drat 1 1.932 150.46 63.535
<none> 148.53 65.121
- disp 1 10.110 158.64 65.229
- am 1 12.323 160.85 65.672
- hp 1 14.826 163.35 66.166
- qsec 1 26.408 174.94 68.358
- wt 1 69.127 217.66 75.350

Step: AIC=63.46
mpg ~ disp + hp + drat + wt + qsec + am
- drat 1 3.345 153.44 62.162
- disp 1 8.545 158.64 63.229
<none> 150.09 63.457
- hp 1 13.285 163.38 64.171
- am 1 20.036 170.13 65.466
- qsec 1 25.574 175.67 66.491
- wt 1 67.572 217.66 73.351
Step: AIC=62.16
mpg ~ disp + hp + wt + qsec + am
- disp 1 6.629 160.07 61.515
<none> 153.44 62.162
- hp 1 12.572 166.01 62.682
- qsec 1 26.470 179.91 65.255
- am 1 32.198 185.63 66.258
- wt 1 69.043 222.48 72.051
Step: AIC=61.52
mpg ~ hp + wt + qsec + am
- hp 1 9.219 169.29 61.307
<none> 160.07 61.515
- qsec 1 20.225 180.29 63.323
- am 1 25.993 186.06 64.331
- wt 1 78.494 238.56 72.284
Step: AIC=61.31
mpg ~ wt + qsec + am
<none> 169.29 61.307
- am 1 26.178 195.46 63.908
- qsec 1 109.034 278.32 75.217
- wt 1 183.347 352.63 82.790

Call:
lm(formula = mpg ~ wt + qsec + am, data = mtcars)
Residuals:
-3.4811 -1.5555 -0.7257 1.4110 4.6610
Coefficients:
(Intercept) 9.6178 6.9596 1.382 0.177915
wt -3.9165 0.7112 -5.507 6.95e-06 ***
qsec 1.2259 0.2887 4.247 0.000216 ***
am 2.9358 1.4109 2.081 0.046716 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ۧۧ The output from our stepwise regression model is as follows. This is the
best model.
Coefficients:
(Intercept) 9.6178 6.9596 1.382 0.177915
wt -3.9165 0.7112 -5.507 6.95e-06 ***
qsec 1.2259 0.2887 4.247 0.000216 ***
am 2.9358 1.4109 2.081 0.046716 *
PITFALLS
ۧۧ What if one of your assumptions isn’t met? With nonlinearity, you have
to transform the data. You can use polynomial regression.
ۧۧ What if your residuals are not constant? You can do a weighted version
of your least-squares regression.

SUGGESTED READING
Diez, Barr, and Cetinkaya-Rundel, OpenIntro Statistics, “Multiple and Logistic
Yau, R Tutorial, “Multiple Linear Regression,” http://www.r-tutor.com/
elementary-statistics/multiple-linear-regression.
PROBLEMS
1 In least-squares regression, which of the following is not a required assumption
about the error term 𝜖?
a) The expected value of the error term is 1.

b) The variance of the error term is the same for all values of X.
c) The values of the error term are independent.
d) The error term is normally distributed.
2 What do residuals represent?
a) The difference between the actual Y values and the mean of Y.

b) The difference between the actual Y values and the predicted Y values.
c) The square root of the slope.
d) The predicted value of Y for the average X value.

LECTURE 16
ANALYSIS OF VARIANCE:
COMPARING 3 MEANS
V
ariation and randomness are everywhere.
Whether you’re looking for a mechanical part
failure, determining a new drug’s effectiveness,
or wondering if it will rain tomorrow, almost
everything has variation. One of the most commonly
used statistical methods is ANOVA, which is an
acronym for the phrase “analysis of variance.” The
whole purpose of ANOVA is to break up variation into
component parts and then look at their significance.
ANALYSIS OF VARIANCE (ANOVA)

ۧۧ Suppose that we have
3 normal distributions
and the means are 0, 1,
and 2. They’re different,
but could they be
statistically equivalent
and fundamentally the
same? In other words,
could they have all
come from the same
underlying population?

ۧۧ It’s not likely. Notice that there’s very little overlap among the 3
distributions. They have tight variances. In fact, the means are more
than 4 standard deviations away from each other. And for normal
distributions, just 3 standard deviations should contain more than 99%
of the data.
ۧۧ On the other hand, check out the following 3 distributions.
ۧۧ They have the same means, at 0, 1, and 2, but they have a larger
variance. It’s more probable that these 3 samples could come from the
same underlying population.
ۧۧ If sample data were taken from the first 3 distributions, we would

very likely find a statistically significant difference because the 3
distributions have so little overlap. But for the second 3 distributions,
it would be more difficult to find a statistically significant difference
because they have significant overlap.
Lecture 16 — Analysis of Variance: Comparing 3 Means 239

ۧۧ The reason they overlap is because of the higher variance. Remember,
variance and standard deviation tell us about the spread of a
distribution. So, to test for a difference in the means, we must analyze
the overlap—the variance.
ۧۧ The analysis of variance ( A NOVA ) was first developed as an extension

of the t-tests but with a clear advantage: ANOVA allows us to compare
more than 2 group means at the same time.
ۧۧ ANOVA is used to model situations where our response variable is

continuous and our variables explaining or predicting that response
are all categorical. ANOVA answers the following questions: Do
categories have an effect? How is the effect different across categories?
Is this significant? You can think of ANOVA as regression specifically for
categorical predictor variables.
ۧۧ If a t-test can compare 2 groups, why can’t we just do a t-test on all

possible combinations of groups? The problem is that our α, probability
of a type I error, would accumulate, leading to what is called alpha
inflation, which occurs when we conduct several different tests on the
same set of data.
ۧۧ ANOVA evaluates all mean differences simultaneously with 1 test,

regardless of the number of means. So, it’s simpler, and we avoid the
problem of an inflated α level.
ۧۧ Recall that the t-test was a ratio of group difference for 2 groups divided
by the sampling variability ( where sampling variability is the standard
error ).

ۧۧ ANOVA is based on the same idea. We examine the variance among 3 or
more group means, and we compare that to the overall variance in the
sample in a ratio that is called an F-test. ( W hen there are 2 groups, then
the F-test and t-test give the same result. )
ۧۧ The F-test is named after the man who invented the idea, Sir Ronald
Fisher, who was analyzing fertilizer data. Agricultural researchers
had been trying to figure out which fertilizer worked best by using a
different one each year. Fisher developed much better tests to control
for weather and land conditions.
ۧۧ So, he designed experiments with multiple fertilizers each year and

assigned fertilizers to their plot locations on a strictly random basis.
Now Fisher was getting data with much less noise, and the question
became how to measure the degree to which the mean for the different
fertilizer groups differed.
ۧۧ We need to construct a sum-of-squares deviates measure for the 3

group means.
ۧۧ In fitting the ANOVA model, we more or less assume the same conditions
as multiple linear regression.
ൖൖ The observations within each sample must be independent.
ൖൖ The population that we take the samples from must be normally

distributed and have roughly the same variability. Violating the
assumption of homogeneity of variance risks invalid test results.
ൖൖ The errors are independent, are normally distributed, and have

nearly constant variance.

ۧۧ Fisher sought to meet all 3 assumptions with fertilizers that were
assigned randomly and growing all plants in the same field, in the same
year, with the same method.
ۧۧ If the assumptions are met, we perform ANOVA to evaluate if the data

provide strong evidence against the null hypothesis that all group
means, μi, are equal.
ۧۧ What if you want to assess more than 1 factor? There are different
types of ANOVA.
ൖൖ 1-way ANOVA is an extension of the t-test to 3 or more samples. We

have a single factor divided into groups ( such as types of fertilizer )
to look for the effect on a single continuous response variable ( such
as agricultural output ).
ൖൖ 2-way ANOVA ( and higher ) looks at the contribution to total variance

of 2 ( or more ) input factors ( such as type of fertilizer and type of
soil ). So, 2-way ANOVA compares levels of 2 or more factors for mean
differences on a single continuous response variable.
ൖൖ With 3-way ANOVA, there are 3 input factors ( such as type of

fertilizer, type of soil, and type of irrigation ).
ۧۧ You may hear another name commonly used term for ANOVA: factorial
design. A 3-way factorial design is the same as a 3-way ANOVA.
ۧۧ We can sometimes look at the interaction of the input factors, such

as effects of fertilizer type and irrigation method. Does the effect of
fertilizer change as the type or level of watering changes?
ൖൖ H0: The mean outcome is the same across all categories: μ1 = μ2 = … = μk.
ൖൖ Ha: The mean of the outcome is different for some ( or all ) groups.
In other words, there is at least one mean difference among the
populations where μi represents the mean of the outcome for
observations in category i.

ۧۧ How many ways can H0 be wrong?
ൖൖ All means are different from every other mean.
ൖൖ Some means are different from others while some are similar.
ۧۧ Recall the sample variance:
ۧۧ The distance from any data point to the mean is the deviation from this
point to the mean: ( Xi − X ).
ۧۧ The sum of squares is the sum of all squared deviations:
ۧۧ ANOVA measures 2 sources of variation in the data and compares their

relative sizes:
1 Variation between groups: For each data value, we look at the

difference between its group mean and the overall mean. This is
called the sum of squares between ( SSB ), which is the sum of the
squared deviations between each group mean and the overall mean
weighted by the sample size of each group ( 𝑛 group ).
2 Variation within groups: For each data value, we look at the difference
between that value and the mean of its group. This is called the sum
of squares within ( SSW ), which is the sum of the squared deviations
within each group.

ۧۧ If the group means are not very different, the variation between them
and the overall mean ( SSB ) won’t be much more than the variation
between the observations within a group ( SSW ).
ۧۧ The sum of squares total ( SST ) is the sum of the squared deviations
between each observation and the overall mean.
ۧۧ In other words, total SS = within-group SS + between-group SS: SST =

SSW + SSB .
ۧۧ We assume that the variance σ2 is approximately the same for each of

the group’s populations. We can combine the estimates of σ2 across the
groups and use an overall estimate for the common population variance.
ۧۧ To calculate the within-group variation: SSW/( N − k ) = MSW, where

N = total sample size and k = number of groups.
ۧۧ MSW ( mean square within ) is also called the within-groups mean square.
ۧۧ We also look at the variation between groups. To calculate the between-

group variation: SSB/( k − 1 ) = MSB.
ۧۧ MSB is the mean square between.
THE F-STATISTIC
ۧۧ Our goal is to compare the 2 sources of variability: MSW and MSB. Our
test statistic is

ۧۧ If H0 is true, −F will be small ( close to 1 ), which means that the between-
group variation is about the same as the within-group variation. In
other words, the grouping doesn’t explain much variation in the data.
ۧۧ If H0 is not true, −F will be large, which means that the between-group

variation explains a lot of the variation in the data, much more than
the variance within groups. In other words, the difference between the
individual groups is much larger than the difference within each group.
ۧۧ What we’ve just computed is called the F-statistic or F-ratio. Unlike the
t-statistic, which is based on sample means, the F-ratio is based on a
ratio of sample variances. The variance in the numerator measures the
size of differences among sample means. Variance in the denominator
measures the other differences expected if group means were not
different from one another.
ۧۧ Because F-ratios are computed from 2 variances, they are always

positive numbers. Once we have our F-ratio, we can conduct an F-test,
which is designed to test if 2 population variances are equal. It does this
by comparing the ratio of 2 variances.
ۧۧ If the variances are equal, the ratio of the variances will be 1.
ۧۧ If the variances are roughly equal, then the within-group variance

( MSW ) is the same as the between-group variance ( MSB ). So, grouping
the data doesn’t make a difference. The means from those groups are all
the same.
ۧۧ If the variances are unequal, then the grouping has an effect. The
between-group variation ( MSB ) becomes large compared to the within-
group variation ( MSW ), and the F-ratio would be greater than 1.
ۧۧ How far does the between-group variation need to get before we

declare that the group means are different? We need a distribution that
only takes on positive values ( because the F-ratio is always positive )
and will only reject the null hypothesis for larger than some threshold.

ۧۧ Our solution is a sum of squared standard normal deviates: the chi-
squared distribution, which in general is obtained by taking the values
of the ratio of the sample variance and population variance multiplied
by the degrees of freedom.
ۧۧ Just as we use the normal distribution to test for a difference in means, we

would use the chi-squared distribution to test for a difference in variances.
ۧۧ The F-distribution is formed by taking the ratio of 2 independent chi-

squared variables divided by their respective degrees of freedom.
ۧۧ Because F is formed by chi-squared, many of the chi-squared properties

carry over to the F-distribution.
ൖൖ The F-values are all non-negative.

ൖൖ The distribution is non-symmetric.
ൖൖ The mean is approximately 1.
ൖൖ There are 2 independent degrees of freedom: 1 for the numerator
and 1 for the denominator.
ۧۧ The F-distribution is any distribution that results by taking the quotient

of 2𝜒2 distributions divided by their respective degrees of freedom.
When we specify an F-distribution, we have to state the 2 parameters
that correspond to the degrees of freedom for the numerator ( k − 1 )
and the denominator ( N − k ).
ۧۧ The table of F-values is organized by 2 degrees of freedom: The degrees of

freedom of the numerator ( between ) are shown in table columns while the
degrees of freedom of the denominator ( within ) are shown in table rows.
ۧۧ For example, if the degrees of freedom of the numerator were 20 and the
degrees of freedom of the in the denominator were 19, then our critical
value from the F-distribution would be 2.1555. We would compare our
F-ratio to this value and reject H0 if our ratio were larger than 2.1555 or
fail to reject H0 if our ratio were smaller then 2.1555.

ANOVA IN R
ۧۧ The ANOVA summary table is a concise method for presenting ANOVA
results.
1 The first column lists the source of the variation, either between-
group or within-group, followed by the total variation.
2 The second column gives us the sums of squares ( SSB ), ( SSW ), and ( SST ).
3 The third column lists the degrees of freedom ( k − 1 ) and ( N − k ), and
if you add both of those, we get the total degrees of freedom, ( N − 1 ).
4 The fourth column is the mean square between and within group.
5 The fifth column lists the F-ratio.
Summary ANOVA
Sum of Degrees of Variance Estimate

Source F-Ratio
Squares Freedom ( Mean Square)
Between SSB K−1 MSB = SSB/( K − 1 ) MSB/( MSW )
Within SSW N−K MSW = SSW/( N − K )
Total SST = SSB + SSW N−1
ۧۧ The first step in our analysis is to graphically compare the means of the
variable of interest across groups. To do that, we can create side-by-side
box plots of the measurements organized in groups using a function.
ۧۧ ANOVA requires that the variability of groups be homogeneous. We can

observe the box plots to verify this assumption.
require(stats); require(graphics)
boxplot(weight ~ feed, data = chickwts, col = "lightgray",
main = "Chickwts data", ylab = "Weight in grams",
xlab="Type of Feed")

ۧۧ ANOVA also requires the number of cases in your data groups to be
approximately the same.
summary(chickwts)
weight feed
Min.: 108.0 casein: 12
1st Qu.: 204.5 horsebean: 10
Median: 258.0 linseed: 12
Mean: 261.3 meatmeal: 11
3rd Qu.: 323.5 soybean: 14
Max.: 423.0 sunflower: 12
ۧۧ Our group sizes only range between 10 and 14, but what if we had larger
variation in sample size?
ۧۧ A variable-width box plot can show whether your groups have the same
number and shape. In a variable-width box plot, the width of the box
plot represents the number in each group. The height, as usual, shows
the spread in the data.

ۧۧ Our group sizes are 15, 80, 7, and 36. Notice how the size of the box plot
corresponds to the number of elements in each group.
ۧۧ To test whether the difference in means is statistically significant, we

can perform an ANOVA using the R function “aov( ).”
ۧۧ If the ANOVA test shows that there is a significant difference in means

between the groups, we may want to perform multiple comparisons
between all pair-wise means to determine how they differ.
ۧۧ Once the ANOVA model is fit, we can look at the results using the
“summary( )” function. This produces the standard ANOVA table.
results = aov(weight ~ feed, data = chickwts)

summary(results)
Df Sum Sq Mean Sq F value Pr(>F)
feed 5 231129 46226 15.37 5.94e-10 ***
Residuals 65 195556 3009

---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ۧۧ Studying the output of the ANOVA table, we see that the F-statistic is
15.37 with an extremely low 𝑝-value. A 𝑝-value less than our α level
( 0.05 or 0.01 ) means that we reject the null hypothesis of equal means
for 5 feed groups.
ۧۧ The ANOVA F-test answers the question whether there are significant
differences in the k population means. But it doesn’t give us any
information about how they differ. That’s because ANOVA compares all
individual mean differences simultaneously, in 1 test.
ۧۧ When we reject H0 ( meaning that we had a significant F-ratio ), we still

have to figure out where the differences occur. We know that at least
1 difference in means is statistically significant, but it doesn’t tell us
which means differ. We have to do additional follow-up tests to figure
out exactly which means differ.
TUKEY’S METHOD
ۧۧ A common multiple comparisons procedure is Tukey’s method, named
for John Tukey, an inventive mathematics professor with a joint
appointment at Bell Labs. He was first to use the terms “software” and
“bit” in computer science, and in 1970, he created box plots.
ۧۧ Tukey was a prolific inventor of methods, so we also call this Tukey’s

honest significance test, sometimes abbreviated as HSD. It’s a single
value that determines the minimum difference between treatment
means that is necessary to claim statistical significance—a difference
large enough that 𝑝 < α experiment-wise.
ۧۧ The function “TukeyHSD( )” creates a set of confidence intervals on the

differences between means with the specified family-wise probability
of coverage. The general form is TukeyHSD( x, conf.level = 0.95 ). In the
following, 𝑥 is a fitted model object ( e.g., an aov fit ), and conf.level is the
confidence level.

results = aov(weight ~ feed, data = chickwts)
TukeyHSD(results, conf.level = 0.95)
Tukey multiple comparisons of means

95% family-wise confidence level
Fit: aov(formula = weight ~ feed, data = chickwts)
$feed
diff lwr upr p adj
horsebean-casein -163.383333 -232.346876 -94.41979 0.0000000
linseed-casein -104.833333 -170.587491 -39.07918 0.0002100
meatmeal-casein -46.674242 -113.906207 20.55772 0.3324584
soybean-casein -77.154762 -140.517054 -13.79247 0.0083653
sunflower-casein 5.333333 -60.420825 71.08749 0.9998902
linseed-horsebean 58.550000 -10.413543 127.51354 0.1413329
meatmeal-horsebean 116.709091 46.335105 187.08308 0.0001062
soybean-horsebean 86.228571 19.541684 152.91546 0.0042167
sunflower-horsebean 168.716667 99.753124 237.68021 0.0000000
meatmeal-linseed 58.159091 -9.072873 125.39106 0.1276965
soybean-linseed 27.678571 -35.683721 91.04086 0.7932853
sunflower-linseed 110.166667 44.412509 175.92082 0.0000884
soybean-meatmeal -30.480519 -95.375109 34.41407 0.7391356
sunflower-meatmeal 52.007576 -15.224388 119.23954 0.2206962
sunflower-soybean 82.488095 19.125803 145.85039 0.0038845
Feed comparison | Diff | Lower | Upper | pvalue
horsebean-casein | -163.38 | -232.34 | -94.41 | 0.0000
linseed-casein | -104.83 | -170.58 | -39.07 | 0.000
meatmeal-casein | -46.67 | -113.90 | 20.55 | 0.332
PITFALLS
ۧۧ ANOVA depends on the same assumptions as least-squares linear
regression—only more so.

ۧۧ Let’s return to our opening example of 3 curves, but now with different
variances for each.
ۧۧ Could these 3 groups come from the same underlying population? This
is a possible pitfall. Don’t be misled by the name ANOVA into expecting
it to analyze any kind of variance. ANOVA assumes a shared variance
( i.e., roughly the same variance ) across all groups. It’s looking only at
whether means with that shared variance value also come from the
same distribution.
ۧۧ When the F-test shows that means come from different distributions,
then that says, for example, that the new fertilizer you’re testing gives
statistically different results from other fertilizers.

SUGGESTED READING
Crawley, The R Book, “Analysis of Variance,” chap. 11.
Numerical Data,” section 5.5.
Faraway, Linear Models with R, “One-Way Analysis of Variance,” chap. 14.
Yau, R Tutorial, “Analysis of Variance,” http://www.r-tutor.com/elementary-
statistics/analysis-variance.
PROBLEMS
1 1-way ANOVA is used when
a) analyzing the difference between more than 2 population means.

b) analyzing the results of a 2-tailed test.
c) analyzing the results from a large sample.
d) analyzing the difference between 2 population means.
2 To determine whether an obtained F-value is statistically significant, it must

be compared to a critical F-value. What 2 pieces of information do we need to
calculate the critical F-value?
a) mean; sample standard deviation

b) sample variance; number of groups
c) mean; sample size
d) sample size; number of groups
3 When comparing samples from 3 or more experimental treatments in a 1-way

ANOVA, which of the following statements are true?
a) The response variables within each of the k populations have equal

variances.
b) The response variables all follow normal distributions.
c) The samples associated with each population are randomly selected and
independent.

LECTURE 17
ANALYSIS OF
COVARIANCE AND
MULTIPLE ANOVA
I
f you’re studying cancer in patients and you want to
know which of 4 new treatments is most effective, you
would use ANOVA, but you’d also want to be careful that
you aren’t missing a continuous factor that may co-vary
with your results, such as distance from a major source
of pollution. ANOVA won’t model a continuous predictor
variable; it only works for categorical variables. Analysis
of covariance can be used to address this problem.
ANALYSIS OF COVARIANCE (ANCOVA)

ۧۧ Like analysis of variance ( A NOVA ), analysis of covariance ( A NCOVA )
has a single continuous response variable, 𝑦. However, ANCOVA can
have predictor variables that are both continuous, as in regression, and
categorical, as in ANOVA. In a sense, ANCOVA is a blending of ANOVA
and linear regression.
ۧۧ As in regression, we can model 𝑦 in ANCOVA with both continuous and

categorical independent variables. Use ANCOVA when you have some
categorical predictors or factors and some quantitative predictors.
Lecture 17 — Analysis of Covariance and Multiple ANOVA 255

ۧۧ Unlike regression, the covariate variables in ANCOVA are not necessarily of
primary interest. But by including them in the model, we can explain more
of the response variable and reduce the error variance. In some situations,
if we fail to include an important covariate, we can get misleading results.
ۧۧ Imagine that we have been hired by the American Cancer Society to

study 4 potential treatments for esophageal cancer. The response
variable, or dependent variable, is the number of months a patient lives
after being placed on 1 of the 4 treatments.
ۧۧ We open their database and notice that there are 12 patients with
esophageal cancer. We place them in 4 groups of 3 each. Let’s analyze
the data as a 1-way ANOVA.
ൖൖ months: total number of months the patient survived post-treatment.
ൖൖ treat: 1 of 4 treatments ( 1, 2, 3, or 4 ).
months = c(78,93,86,57,45,60,28,31,22,9,12,4)
treat = gl(4,3)

lm.mod = lm(months ~ treat)
anova(lm.mod)
Analysis of Variance Table
Response: months
treat 3 10190.9 3397.0 86.731 1.925e-06 ***
Residuals 8 313.3 39.2

---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.mod)
Call:
lm(formula = months ~ treat)
Residuals:
-9.0000 -4.5000 0.8333 3.7500 7.3333
Coefficients:
(Intercept) 85.667 3.613 23.709 1.07e-08 ***
treat2 -31.667 5.110 -6.197 0.00026 ***
treat3 -58.667 5.110 -11.481 3.00e-06 ***
treat4 -77.333 5.110 -15.134 3.60e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ۧۧ The analysis tells us that there is a significant treatment effect. It

suggests that treatment 1 is the best because those people lived longer
post-treatment.

ۧۧ We submit our analysis and our recommendation for treatment 1. But
there’s a problem with our analysis that we didn’t consider.
ۧۧ There is clearly a linear relationship between the duration of the cancer

in the body and the survival time post-treatment. The 3 patients in
group 1 all have had esophageal cancer for less than 5 years, patients in
group 2 had it between 5 and 9 years, patients in group 3 had it between
9 and 12 years, and patients in group 4 had it close to 15 years or more.
ۧۧ In our initial analysis, we didn’t consider the stage to which the cancer
had progressed at the time that treatment begins. This is important,
because those at earlier stages of disease will naturally live longer
on average. Stage of disease is a covariate. We should have been more
intentional in using randomization to balance out our groups.

ۧۧ After seeing survival time versus disease duration, it’s clear that we
can’t compare the survival time without considering the prior duration
of the disease. Survival time is affected by the cancer stage, and the
number of years a person has had the disease.
ۧۧ Cancer stage is a way of describing where the cancer is located, if

or where it has spread, and whether it is affecting other parts of the
body. In this case, esophageal cancer occurs in 4 stages, and we didn’t
randomize our groups according to stage of the disease in the body.
ۧۧ Stage of disease is the contributing factor toward survival time.

Survival really didn’t have anything to do with the choice of treatment;
it just happened that everyone on treatment 1 was in an earlier stage of
the disease, so that made it look like there was a treatment effect.
ۧۧ In fact, if we were to recommend a treatment at all ( or recommend

which treatment to study more ), we might prefer treatment 4. Although,
all 4 were equally ( in )effective, treatment 4 was used for those in the
worst cancer stage.
ۧۧ Let’s include stage in years in our model.
lm.mod2 = lm(months ~ years + treat)

anova(lm.mod2)
Response: months
years 1 9505.4 9505.4 219.8950 1.519e-06 ***
treat 3 696.3 232.1 5.3692 0.03113 *
Residuals 7 302.6 43.2

---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ۧۧ While treatment is still significant at an 𝑎 of 0.05, years is even more
significant, with an extremely low 𝑝-value. While we’re not directly
interested in the number of years a person has had cancer, we needed
to include it in our model as a covariate because we failed to properly
randomize.
ۧۧ Let’s randomize patients according to their stages.
set.seed(1234)
months2 = c(sample(c(78,93,86,57,45,60,28,
31,22,9,12,4),12,replace=F))
treat = gl(4,3)
years2 = c(sample(c(2.3,3.4,1.8,5.8,6.2,7.3,
9.6,11.0,12.2,14.8,17.3,16.0), 12,replace=F))
ۧۧ Notice that we have much more spread in the survival time post-
treatment. There’s not a clear treatment that outperforms the others.

ۧۧ Each of our groups has a mix of patients in various stages of esophageal
cancer. By randomizing, we can eliminate the effect of cancer stage on
treatment.
ۧۧ Treatment is no longer significant, with a 𝑝-value of 0.8998.
lm.mod4 = lm(months2 ~ treat)

anova(lm.mod4)
Response: months2
treat 3 700.9 233.64 0.1907 0.8998
Residuals 8 9803.3 1225.42
ۧۧ Notice that even when we add years to the model, both variables remain
insignificant.

lm.mod5 = lm(months2 ~ years2 + treat)
anova(lm.mod5)
Response: months2
years2 1 267.5 267.52 0.1937 0.6731
treat 3 570.7 190.22 0.1378 0.9343
Residuals 7 9666.1 1380.87
ۧۧ Once we intentionally randomize our treatment by stage, we see that

neither the treatment nor the stage of cancer has any effect.
ۧۧ R fits ANCOVA whenever you have both categorical and continuous

variables.
MULTIPLE ANOVA (MANOVA)

ۧۧ Suppose that we want to model multiple dependent variables with
our independent variables. We saw this in the “iris” dataset, where 4
dependent variables depend on 3 independent variables.
ۧۧ In a situation with multiple ( dependent ) response variables, you can

test them simultaneously using a multivariate analysis of variance
( M ANOVA ). A MANOVA could be used to test this hypothesis.
ۧۧ A MANOVA is an ANOVA with 2 or more continuous response variables,

meaning 2 or more 𝑦 variables. Like ANOVA, MANOVA has both a 1-way
and a 2-way. The number of ( independent ) factor variables involved
distinguish a 1-way MANOVA from a 2-way MANOVA.
ۧۧ When comparing 2 or more continuous response variables by a

single factor, a 1-way MANOVA is appropriate. A 2-way MANOVA also
compares 2 or more continuous response variables but compares them
by at least 2 independent factors.

ۧۧ MANOVA can be used in certain conditions:
1 The dependent variables should be normally distributed within

groups. The R function “mshapiro.test” can be used to perform the
Shapiro-Wilk test for multivariate normality, which is useful in the
case of MANOVA.
2 Like ANOVA, MANOVA assumes homogeneity of variances across the

range of predictor variables.
3 MANOVA can be used if there is linearity between all pairs of

dependent variables, all pairs of covariates, and all dependent
variable–covariate pairs.
ۧۧ If the global MANOVA test is significant, we conclude that the

corresponding effect ( t reatment ) is significant. In that case, the next
question is to determine if the treatment affects only the weight, only
the height, or both. In other words, we want to identify the specific
dependent variables that accounted for the significant global effect.
ۧۧ Let’s perform a MANOVA on the “iris” dataset to determine if there’s

any significant difference between the different species in sepal
and petal length. Species is our only independent variable, so this is
1-way MANOVA.
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.

Width"
[5] "Species"
SL SW PL PW Species
7 4.6 3.4 1.4 0.3 setosa
44 5.0 3.5 1.6 0.6 setosa
61 5.0 2.0 3.5 1.0 versicolor
65 5.6 2.9 3.6 1.3 versicolor
77 6.8 2.8 4.8 1.4 versicolor
79 6.0 2.9 4.5 1.5 versicolor
118 7.7 3.8 6.7 2.2 virginica

128 6.1 3.0 4.9 1.8 virginica
130 7.2 3.0 5.8 1.6 virginica
138 6.4 3.1 5.5 1.8 virginica
ۧۧ Side-by-side box plots display heterogeneity of variance.
boxplot(iris[, "Sepal.Length"] ~ Species,

data=iris, ylab= "Sepal Length")
boxplot(iris[, "Sepal.Width"] ~ Species,

data=iris, ylab= "Sepal Width")

ۧۧ Sepal length and sepal width have significant overlap among groups.
boxplot(iris[, "Petal.Length"] ~ Species,

data=iris, ylab= "Petal Length")
boxplot(iris[, "Petal.Width"] ~ Species,

data=iris, ylab= "Petal Width")
ۧۧ As the graphs suggest, box plots for the measurements show that
versicolor and virginica are more similar to each other than either is
to setosa.

ۧۧ Let’s fit a 1-way MANOVA to the iris data.
library(MASS)
data(iris)
attach(iris)
# MANOVA test
man.mod = manova(cbind(Sepal.Length, Petal.Length) ~ Species,
data = iris)
man.mod
Call:
manova(cbind(Sepal.Length, Petal.Length) ~ Species, data = iris)
Terms:
Species Residuals
resp 1 63.2121 38.9562
resp 2 437.1028 27.2226
Deg. of Freedom 2 147
Residual standard errors: 0.5147894 0.4303345
Estimated effects may be unbalanced
summary(man.mod)
Df Pillai approx F num Df den Df Pr(>F)
Species 2 0.9885 71.829 4 294 < 2.2e-16 ***
Residuals 147
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Look to see which differ

summary.aov(man.mod)
Response Sepal.Length:
Species 2 63.212 31.606 119.26 < 2.2e-16 ***
Residuals 147 38.956 0.265
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Response Petal.Length:
Species 2 437.10 218.551 1180.2 < 2.2e-16 ***
Residuals 147 27.22 0.185
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# MANOVA - Multivariate Analysis of Variance

man1 <- manova(cbind(Sepal.Length,Sepal.Width,Petal.Length,Petal.
Width) ~ Species, iris)
summary.aov(man1)[1]
Species 2 63.21 31.606 119.3 <2e-16 ***
Residuals 147 38.96 0.265
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Species 2 11.35 5.672 49.16 <2e-16 ***
Residuals 147 16.96 0.115
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Species 2 437.1 218.55 1180 <2e-16 ***
Residuals 147 27.2 0.19
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Species 2 80.41 40.21 960 <2e-16 ***
Residuals 147 6.16 0.04
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

PITFALLS
ۧۧ MANOVA is not always better. For example, if any of your dependent
variables are correlated, the independent variable would appear
more important than it actually is. You should use ANOVA instead.
And MANOVA can be complicated: You need to sort through which
independent variable is affecting each dependent variable.
ۧۧ Even if we tell R to do a “lm” but really need to do an ANCOVA, R will do

the correct model for you. But we still need to check normality conditions
using the Shapiro-Wilk test or the multivariate Shapiro-Wilk test.
SUGGESTED READING
Crawley, The R Book, “Analysis of Covariance,” chap. 12.
Faraway, Linear Models with R, “Analysis of Covariance,” chap. 13; and
“Factorial Designs,” chap. 15.
PROBLEMS
1 Which of the following statements about ANCOVA are true?
a) ANCOVA is a technique that attempts to make allowance for imbalances

between groups.
b) We can use ANCOVA when we have both categorical and quantitative
predictors.
c) The covariate variables in ANCOVA are not necessarily of primary interest.
d) Choosing not to include an important covariate can still give us good
results.
e) Including the covariate in the model can help explain more of the response
variable and reduce the error variance.
f) All of the above.

2 A study of mental agility is conducted on a group of 43 Alzheimer’s patients by
recording the length of time it takes to sign their names with the left and right
hand using 3 different writing utensils. The response variable Y is recorded as
Y1 = length of time needed to sign with your right hand;
Y2 = length of time needed to sign with your left hand.
The explanatory variables are X1 = pen, X2 = pencil, and X3 = marker. How would
you model this situation?
a) ANOVA
b) ANCOVA
c) MANOVA
d) regression tree

LECTURE 18
STATISTICAL DESIGN
OF EXPERIMENTS
I
n this lecture, you will gain an understanding of how
experiments should be designed so that you collect
sound statistical data. You will be introduced to
the basic terminology. You will also learn techniques
for effective experimental design, along with 2 well-
known experimental design models: the randomized
block design and the 2k factorial design.
DESIGN OF EXPERIMENTS
ۧۧ The steps needed to plan and conduct an experiment closely follow the
scientific method.
1 State the problem.

2 Decide on the explanatory and response variables.
3 Design and conduct the experiment.
4 Analyze results.
5 Draw a conclusion.
ۧۧ It’s important that we take the necessary time and effort to organize
the experiment appropriately so that we have a sufficient amount of
data to answer the question. This process is called experimental design.

ۧۧ Let’s begin with some basic terminology.
ൖൖ A factor is an independent, controlled variable. Examples include

type of drug, temperature, and marital status.
ൖൖ A treatment is a specific value of the factor that gives us the factor

breakdown. For example, single, married, divorced, and widowed
are all treatments in the factor marital status.
ൖൖ Variables are all of the factors, their treatments, and the measured
responses.
ۧۧ There are several types of factors that might occur in an experiment,

including experimental factors, which are factors that get assigned
at random to treatment levels ( such as temperature or amount of a
particular drug ), and classification factors, which are factors that
represent unchangeable labels ( such as sex, age, or race ).
ۧۧ Experimental error is the variation in the responses among

experimental units that are assigned the same treatment and observed
under the same experimental conditions. Ideally, we would like
experimental error to be 0. But this is impossible for several reasons,
including variation in the way the measurements were recorded and
extraneous factors other than the treatments that affect the response.
RANDOMIZATION
ۧۧ A technique for effective experimental design is randomization.
Treatments should be allocated to experimental units randomly.
ۧۧ The main reason that we randomize is to negate the effects of any

uncontrolled, extraneous variable. By making sure that the sample is
a representative of the population, we also eliminate possible effects of
systematic biases.
Lecture 18 — Statistical Design of Experiments 271

REPLICATION
ۧۧ Another technique is replication, which tells us the number of samples
to be taken for each treatment. When we increase the number of
replications, we collect more information about the treatments.
ۧۧ Replication gives us more power to reject null hypotheses ( t hat the
group means all = 0 ) and helps when we have missing data or possible
outliers. When possible, we want to base our theories on reproducible
results ( although this applies more to replicating your whole study
than to just using larger samples ).
ۧۧ The more replication we have, the more variability we can observe in the
response variable, separate from the treatment effects. When we increase
the number of replications, we increase the reliability of the outcome.
ۧۧ To show that 2 treatments have different means, we need to measure

several samples given the same treatment.
BLOCKING
ۧۧ The next technique is blocking, which refers to the distribution of the
experimental units into blocks in such a way that the units with each
block are homogeneous.
ۧۧ For blocking to be effective, experiments should be arranged so that

within-block variation is smaller than between-block variation.
ۧۧ There are several different types of blocking techniques, the most

common of which is called the randomized complete blocks. When
participants within a given block are randomly assigned to one of the
treatment groups and this process is repeated for all blocks, the design
is called the randomized blocks design.

ۧۧ It’s random because the experimental units are randomly assigned to
each block. It is complete because each treatment is included in every
block. Assignment of experimental units to the blocks and to the
treatments must be random.
ۧۧ Blocks of units are created to control known sources of variation in the

mean response among experimental units.
SAMPLE SIZE
ۧۧ Another technique is sample size. The decision between the sample size
and the cost will always be a compromise—unless you have infinite
resources.
ۧۧ Any time we take more measurements, we increase the reliability of

the results, but this is at the expense of an increase in cost. This is the
reason we take a sample from the population to make inferences about
population parameters.
RANDOMIZED BLOCK DESIGN

ۧۧ If we have pairs of subjects ( perhaps twins, or subjects with similar
demographic variables, etc. ), then we can randomly assign one
treatment to one subject and the other treatment to the other. This
works particularly well if there are only 2 levels of the factor of interest.
ۧۧ There are several advantages to a block design: It allows us to control a

single source of variation and removes its effect from the experimental
error, and it gives us flexibility in the experimental layout.
ۧۧ However, a disadvantage might be that if there are a large number of

treatments, then it can be difficult to fit all combinations of treatments
within a block.

ۧۧ The following is the output from an ANOVA in a randomized block
design.
1 The first column has the source of variation: due to treatment,

blocks, error, and total.
2 The second column gives the associated sum of squares for

treatments, blocks, error, and total.
3 The third column gives the degrees of freedom for sum of squares
for treatments, for sum of squares for blocks, for error, and for total.
4 The fourth column provides us with the mean square for treatments,
the mean square for blocks, and the mean square error.
5 Under the null hypothesis, our group means are all equal to 0
and there would be no treatment effect. If our F ratio is close to
1, meaning that the mean square for treatments is close to the
mean square error, we fail to reject H0 . Otherwise, if there’s a valid
treatment effect, then the mean square for treatments will be large
when compared to the mean square error and our F statistic will
allow us to reject the null hypothesis.
Analysis of Variance for a Randomized Complete Block Design
Source of Sum of Degrees of

Mean Square F0
Variation Squares Freedom
Treatments SSTreatments/ MSTreatments/

SSTreatments 𝑎−1
( 𝑎 − 1 ) ( MSE )
Blocks SSBlocks/
SSBlocks 𝑏−1
( 𝑏 − 1 )
Error SSE/
SSE ( 𝑎 − 1 )( 𝑏 − 1 )
[( 𝑎 − 1 )( 𝑏 − 1 )]
Total SST N−1

2K FACTORIAL DESIGN
ۧۧ A factorial design is used to evaluate 2 or more factors simultaneously.
In a full factorial design, we vary one factor at a time ( while holding the
others fixed ) and perform the full experiment.
ۧۧ When several factors may affect a response, it helps to begin with a

simple design where each factor has just 2 levels.
ۧۧ This leads us to the popular 2k factorial design, where we have k factors,

each with 2 levels, and that gives us 2k treatment combinations. The 2k
full factorial design uses all 2k treatments. It requires the fewest runs
of any factorial design for k factors. This design is often used at an early
stage of factor screening because it gives us an idea of which factors are
going to be most significant in the experiment.
ۧۧ 2k factorial design is especially useful when we have experiments

involving several factors ( k is the number of factors ) where we need to
study the interaction of those factors on a specific response.
ۧۧ Each of the factors have 2 levels ( e.g., “low” or “high” ), which may be
qualitative or quantitative.
ۧۧ 2k factorial design is particularly useful in the early stages of

experimental work when you have several factors that you want to
investigate; need to minimize the number of treatment combinations or
your sample size; and need to study all k factors in a complete factorial
arrangement.
ۧۧ The 2k factorial design experiment collects data at all possible

combinations of factor levels. As k becomes large, the sample size will
increase exponentially, so designs beyond k = 5 or 6 become infeasible.
ۧۧ Let’s look at the simplest 2k factorial design, when k = 2. There are

2 factors, A and B, and each factor has 2 levels, low and high. We
are interested in the main effect of A, the main effect of B, and the
interaction between A and B.

ۧۧ In the simplest case where k = 2, we have 4 factor levels. In trials 1 and
2, we fix A to high and vary B between high and low. In trials 3 and 4, we
fix A to low and likewise vary B between high and low.
ۧۧ What is the case for k = 3? Because 23 = 8, we know we will have 8

factor levels. Each factor should be “low” 4 times and “high” 4 times and
express all possibilities.
ۧۧ Let’s begin by specifying the first 4 of factor A to be high ( + + + + ), which
requires the last 4 of factor A to be low ( − − − − ). Once A is factored, we
have to let B factor within A. For A’s 4 positives, B will be both ( + + ) and
( − − ). Likewise, for A’s 4 negative times, B will be both ( + + ) and ( − − ).
ۧۧ All that’s left is to factor in C. For A both positive and B both negative,
factor C comes in as a positive and a negative. Likewise, for A both
positive and B both positive, factor C comes in as both a positive and a
negative.

ۧۧ Do you see the pattern? For A both negative and B both negative, factor
C comes in as a positive and a negative. For A both negative and B both
positive, factor C comes in as a positive and a negative.
22 Factorial Design
Run Factor A Factor B Response

1 low low Y1
2 high low Y2
3 low high Y3
4 high high Y4
ۧۧ R can help you here. Load the “BHH2” library and run the command
“print( X = ffDesMatrix( 2 ) ).” We use the “ffDesMatrix” to generate 2k
factorial designs.
library(BHH2)
print(X = ffDesMatrix(2))
22=4 factorial design
[,1] [,2]
[1,] -1 -1
[2,] 1 -1
[3,] -1 1
[4,] 1 1

print(X <- ffDesMatrix(3))
[,1] [,2] [,3]
[1,] -1 -1 -1
[2,] 1 -1 -1
[3,] -1 1 -1
[4,] 1 1 -1
[5,] -1 -1 1
[6,] 1 -1 1
[7,] -1 1 1
[8,] 1 1 1

[,1] [,2] [,3] [,4]
[1,] -1 -1 -1 -1
[2,] 1 -1 -1 -1
[3,] -1 1 -1 -1
[4,] 1 1 -1 -1
[5,] -1 -1 1 -1
[6,] 1 -1 1 -1
[7,] -1 1 1 -1
[8,] 1 1 1 -1
[9,] -1 -1 -1 1
[10,] 1 -1 -1 1
[11,] -1 1 -1 1
[12,] 1 1 -1 1
[13,] -1 -1 1 1
[14,] 1 -1 1 1
[15,] -1 1 1 1
[16,] 1 1 1 1

[,1] [,2] [,3] [,4] [,5]
[1,] -1 -1 -1 -1 -1
[2,] 1 -1 -1 -1 -1
[3,] -1 1 -1 -1 -1
[4,] 1 1 -1 -1 -1
[5,] -1 -1 1 -1 -1
[6,] 1 -1 1 -1 -1
[7,] -1 1 1 -1 -1
[8,] 1 1 1 -1 -1
[9,] -1 -1 -1 1 -1
[10,] 1 -1 -1 1 -1
[11,] -1 1 -1 1 -1
[12,] 1 1 -1 1 -1

[13,] -1 -1 1 1 -1
[14,] 1 -1 1 1 -1
[15,] -1 1 1 1 -1
[16,] 1 1 1 1 -1
[17,] -1 -1 -1 -1 1
[18,] 1 -1 -1 -1 1
[19,] -1 1 -1 -1 1
[20,] 1 1 -1 -1 1
[21,] -1 -1 1 -1 1
[22,] 1 -1 1 -1 1
[23,] -1 1 1 -1 1
[24,] 1 1 1 -1 1
[25,] -1 -1 -1 1 1
[26,] 1 -1 -1 1 1
[27,] -1 1 -1 1 1
[28,] 1 1 -1 1 1
[29,] -1 -1 1 1 1
[30,] 1 -1 1 1 1
[31,] -1 1 1 1 1
[32,] 1 1 1 1 1
ۧۧ Once your design is specified and data has been collected, the analysis
of the experiment is similar to many of the ANOVA/ANCOVA and
MANOVA techniques that you’ve already learned about.
PITFALL
ۧۧ We can use blocking to control for a parameter that may not be of
immediate interest but that has to be accounted for in the analysis. But
don’t forget about ANCOVA. It’s possible that you could save a step and
use ANCOVA to measure and remove the effect of that factor from the
analysis. We would only have to adjust statistically to account for the
covariate, whereas in blocking, we would have to design the experiment
with a block factor into the design.

SUGGESTED READING
Faraway, Linear Models with R, “Block Designs,” chap. 16.
PROBLEMS
1 Replication tells us the number of samples that we need to take for each
treatment. Which of the following statements about replications are true?
a) When we increase the number of replications, we collect more information

about the treatments.
b) Replications give us more power to reject null hypotheses ( t hat the group
means all equal 0 ).
c) Replications helps when we have missing data or possible outliers.
d) The more replication we have, the more variability we can observe in the
response variable, separate from the treatment effects.
e) Increasing the number of replications increases the reliability of the
outcome.
f) All of the above are true.
2 Which of the following statements are false?
a) The 2k factorial design involves k factors, each with 2 levels, resulting in 2k

treatment combinations.
b) A factorial design is used to evaluate 2 or more factors simultaneously.
c) In a full factorial design, we vary 2 factors at a time ( while holding the
others fixed ) and perform the full experiment.
d) In the 2k factorial design, as k gets large, the sample size will increase
exponentially.

LECTURE 19
REGRESSION TREES AND

CLASSIFICATION TREES
A
decision tree is a graph that uses a branching
method to determine all possible outcomes of a
decision. The structure of a decision tree is similar
to a real tree, with a root, branches, and even leaves, but
it is an upside-down tree. We start at the root up top and
work our way down to the leaves. Classification trees and
regression trees are easily understandable and transparent
methods for predicting or classifying new records.
TREES AND RECURSIVE PARTITIONING

ۧۧ A tree is a graphical representation of a set of rules. Tree-based methods
involve partitioning the data into a number of smaller regions and using
data from those regions for predictions. When displayed graphically,
the rules used to partition the predictor space can be summarized in a
tree, which are referred to as decision tree methods.
ۧۧ Suppose that we have one predictor variable, X, and one response

variable, Y.
1 Take all of your data. Consider all possible values of all predictor
variables.
Lecture 19 — Regression Trees and Classification Trees 281

2 Select the variable, for example, Xi = 3, that produces the greatest
separation in Y. ( X = 3 ) is called a split.
3 If Xi < 3, then send the data to the left; otherwise, send data points to
the right. Notice that we just do binary splits.
ۧۧ Recursive partitioning allows us to partition the space into subdivisions

with smaller, more manageable chunks of the data and then fit a simple,
local model.
ۧۧ Trees give us rules that are easy to interpret and implement. Decision
trees more closely mirror the human decision-making approach
than linear regression. Trees can be displayed graphically and are
easily interpreted even by a nonexpert. Also, trees don’t require the
assumptions of statistical models and work well even when some data
values are missing.
ۧۧ Trees for continuous outcomes are called regression trees, while trees
for categorical outcomes are called classification trees.
REGRESSION TREES
ۧۧ Regression trees are a simple yet powerful way to predict the response
variable based on partitioning the predictor variables. The idea is to
split the data into partitions and to fit a constant model of the response
variable in each partition.
ۧۧ Regression trees use the sum of squares. In fact, the way that we split
our data is to find the point of greatest separation in ∑[𝑦 − E( 𝑦 )]2.
ۧۧ How do we use data to construct trees that give us useful answers?

There is a large amount of work done in this type of problem. Let’s start
with some terminology.
ൖൖ Parent node ( A ): A node at the top that splits into lower child nodes.

ൖൖ Child node ( B, C ): The result of a binary
splitting of a parent node. Two child nodes
encompass the entire parent node.
ൖൖ Terminal node ( leaf node ) ( D,

E, F, G ): The final partitions
of the data that contain no
splits. This is where the
predicted values are.
ۧۧ In R, the “car.test.frame” dataset has 60 rows and 8 columns, giving data

on makes of cars taken from the April 1990 issue of Consumer Reports.
ۧۧ The “rpart” and “tree” libraries in the R tool—“library( r part )” and

“library( t ree ),” respectively—can be used for classification trees and
for regression trees.
Price Country Mileage Type

Eagle Summit 4 8895 USA 33 Small
Ford Escort 4 7402 USA 33 Small
Ford Festiva 4 6319 Korea 37 Small
Honda Civic 4 6635 Japan/USA 32 Small
Mazda Protege 4 6599 Japan 32 Small
ൖൖ Price: in U.S. dollars of a standard car model; ranges from 5866 to

24,760.
ൖൖ Country of origin: France, Germany, Japan, Japan/USA, Korea,

Mexico, Sweden, and USA.
ൖൖ Reliability: a numeric vector coded 1 to 5.
ൖൖ Mileage: fuel consumption miles per gallon; ranges from 18 to 37.
ൖൖ Type: compact, small, medium, large, sporty, and van.
ൖൖ Weight: weight in pounds; ranges from 1845 to 3855.

Price Country Reliability
Min.: 5866 USA: 26 Min.: 1.000
1st Qu.: 9932 Japan: 19 1st Qu.: 2.000
Median: 12216 Japan/USA: 7 Median: 3.000
Mean: 12616 Korea: 3 Mean: 3.388
3rd Qu.: 14933 Germany: 2 3rd Qu.: 5.000
Max.: 24760 France: 1 Max.: 5.000
(Other): 2 NA's: 11
Mileage Type Weight

Min.: 18.00 Compact: 15 Min.: 1845
1st Qu.:21.00 Large: 3 1st Qu.: 2571
Median: 23.00 Medium: 13 Median: 2885
Mean: 24.58 Small: 13 Mean: 2901
3rd Qu.: 27.00 Sporty: 9 3rd Qu.: 3231
Max.: 37.00 Van: 7 Max.: 3855
ۧۧ The first model we’ll consider is one using just 1 predictor variable of
weight to model mileage. As long as mileage is a numeric variable, “tree”
assumes that we want a regression tree model:
node), split, n, deviance, yval

* denotes terminal node
1) root 60 1355.000 24.58
2) Weight < 2567.5 15 186.900 30.93
4) Weight < 2280 6 16.000 34.00 *
5) Weight > 2280 9 76.890 28.89 *
3) Weight > 2567.5 45 361.200 22.47
6) Weight < 3087.5 23 117.700 24.43
12) Weight < 2747.5 8 39.880 25.62 *
13) Weight > 2747.5 15 60.400 23.80 *
7) Weight > 3087.5 22 61.320 20.41
14) Weight < 3637.5 16 32.940 21.06 *
15) Weight > 3637.5 6 3.333 18.67 *

ۧۧ Output of the fitted model shows the partition structure. The root level
( no splits ) shows the total number of observations ( 60 ); the associated
deviance of the response variable, or sum of squares ( SSY = 1355 );
followed by the mean response value for that subset ( for the root, this is
the overall mean, which is 24.58 miles per gallon ).
ۧۧ Subsequent splits refer to these statistics for the associated data

subsets, with final nodes ( leaves ) indicated by asterisks. For example,
the first split is at the indented rows labeled 2 and 3. Row 2 is split
according to weight less than 2567.5; 15 observations fell in this
category, with a sum of squares equal to 186.9 and a mean of 30.9 miles
per gallon. Row 3 is split according to weight greater than 2567.5; 45
observations fell into this category, with a sum of squares of 361.2 and a
mean of 22.47. The next level split occurs at both 4/5 and 6/7.
ۧۧ The summary function that is associated with “tree” lists the formula
along with the associated datasets for the number of terminal nodes
( or leaves ) and gives us the residual mean deviance. ( T his is the mean
square, which equals the sum of squares divided by N minus the number
of nodes, or 60 – 6, which gives us 54. ) We also have the 5-number
summary of the residuals.
Regression tree:
tree(formula = Mileage ~ Weight, data = car.test.frame)
Number of terminal nodes: 6
Residual mean deviance: 4.249 = 229.4 / 54
Distribution of residuals:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-3.8890 -1.0620 -0.0625 0.0000 1.2330 4.3750
ۧۧ To examine the tree, use the plot and text functions:
plot(my.tree)
text(my.tree)

ۧۧ There are 6 terminal nodes produced by 5 splits. The first partitioned
the data into 2 sets of above and below a weight of 2567.5. Data values
below 2567.5 go to the left; data values greater than or equal to 2567.5
fall to the right.
ۧۧ The second and third splits occurred at weights of 2280 and 3087.5. The
fourth and fifth splits occurred at weights of 2747.5 and 3637.5.
ۧۧ Values at the leaves show the mean mileage ( miles per gallon )
associated with each of the 6 final data subsets.
ۧۧ The height of each branch is proportional to the reduction in sum of

squares from each split.
ۧۧ For classic regression trees, the local model is the average of all values
in the last terminal node. The flexibility of a tree—its ability to help us
correctly classify data—depends on how many leaves it has.
ۧۧ How large should we grow our tree? A large tree will fragment the data
into smaller and smaller samples. This often leads to a model that overfits
the sample data and fails if we want to predict. On the other hand, a small
tree might not capture the important relationships among the variables.

ۧۧ Why did our tree stop at 6 terminal nodes? In R, the “tree” function
limits growth by either forcing each terminal node to contain a certain
number of leaves, or, if we add a node to the model, the error has to be
reduced by a certain amount.
ۧۧ There are default values in “tree” that determine the stopping rules
associated with having a few remaining samples or splits that add
information to the model. We use stopping rules to control if the tree-
growing process should be stopped or not.
ۧۧ There are a few common stopping rules: The node won’t be split
if the size of a node is less than the user-specified minimum node
size, if the split of a node results in a child node whose node size is
less than the user-specified minimum child-node size value, or if
the improvement at the next split is smaller than the user-specified
minimum improvement.
ۧۧ Let’s see what happens when we change our defaults to a lower

threshold. Here’s the code in R:
fit <- rpart(Mileage ~ Weight, data=car.test.frame,

control = list(minsplit = 10, minbucket = 5,cp =
0.0001), method="anova")
ۧۧ Notice that we’ve set the following:
ൖൖ minsplit = 10: the minimum number of observations that must be in

a node for a split to be attempted;
ൖൖ minbucket = 5: the minimum number of observations in any terminal

leaf node; and
ൖൖ cp ( complexity parameter ) = 0.0001: Any split that doesn’t increase

the overall R2 by at least “cp” at each step is not attempted.

car.tree <- rpart(Mileage ~ Weight, data=car.test.frame,
control = list(minsplit = 10, minbucket = 5, cp =
0.0001), method="anova")
plot(car.tree, uniform = TRUE)
text(car.tree, digits = 4, use.n = TRUE)
ۧۧ The resulting tree has 7 splits with 8 nodes. We could change the
parameters even more to get a bigger tree. But given the tendency for
tree models to overfit data, how do we know when we have a good model?
ۧۧ Tree models use a cross-validation technique that splits the data into a
training set for model fitting and a testing set to evaluate how good the
fit is. The following is how cross-validation works:
1 Data is randomly divided into a training set and a testing set

( perhaps 80% and 20% or 90% and 10% ).
2 The tree-growing algorithm gets applied to the training data only.

We grow the largest tree we can. ( T his will likely overfit the data. )
3 Prune the tree. At each pair of leaf nodes with a common parent,
calculate the sum-of-squares error on the testing data. Check to see
if the error would be smaller by removing those 2 nodes and making
their parent a leaf. ( Go around snipping the children off. )

ۧۧ We keep doing this until pruning no longer improves the error on the
testing data.
ۧۧ The dotted line is a guide for a cutoff value relative to our complexity
parameter. The optimal choice of tree size is 5, because this is the first
value that falls below the dotted line. Going from 5 to 6 splits doesn’t
reduce the complexity parameter by a minimum of 0.01.
ۧۧ Let’s create a regression tree that predicts car mileage from price,
country, reliability, and car type. We have an optimal number of splits at
4, because the complexity parameter, “cp,” doesn’t decrease by at least
0.01 from 4 to 5 splits.

ۧۧ Here’s the output of our regression tree. First, there’s a split by price,
then type of car, and then price and type of car once again.
ۧۧ Is it better to run a regular regression or a regression tree? It depends

on the problem.
ۧۧ If the relationship between the predictors and response is well

approximated by a linear model, then linear regression will not only be
a better model fit, but it will also outperform regression tree methods
that don’t exploit this linear relationship.
ۧۧ However, the regression tree is one of the easiest understood ways to

convey a model to a non-statistician. Furthermore, regression trees
are among the very few methods that have been developed that are
capable of automatically modeling interactions without becoming
computationally and statistically infeasible.

CLASSIFICATION TREES
ۧۧ A classification tree is pretty similar to a regression tree, except it uses
a qualitative response rather than a quantitative one.
ۧۧ For a classification tree, we predict that each observation belongs to

the most commonly occurring class of training observations in the
region to which it belongs. We’re often interested not only in the class
prediction corresponding to a particular terminal node region, but also
in the class proportion of observations that falls into the correct region.
ۧۧ Growing a classification tree also uses binary splitting. However,

classification trees predict a qualitative instead of a quantitative
outcome. In other words, unlike regression trees, which have numerical
predictions, classification trees make predictions for a discrete,
categorical variable.
ۧۧ The decision-making input variables that are used to split the data can
be numerical or categorical. Outcome is categorical, so we use the mode
of the terminal nodes as the predicted value.
ۧۧ Let’s do an example using the “iris” dataset for 3 species of iris. We

want to find a tree that can classify iris flowers as setosa, versicolor,
or virginica using some measurements as covariables. The response
variable is categorical, so the resulting tree is a classification tree.
Classification tree:
tree(formula = Species ~ Sepal.Width + Petal.Width, data =
iris)
Misclassification error rate: 0.03333 = 5 / 150

ۧۧ We have a tree with 4 splits and 5 nodes. If Petal.Width is smaller than
0.8, we label the flower setosa, and if Petal.Width is greater than 1.7, we
correctly classify the flower as virginica. It’s the in-between values that
we have to fine-tune.
tree(formula = Species ~ Sepal.Width + Petal.Width, data =
iris)
ۧۧ Of our 5 nodes, our classification tree only misclassifies 5 out of the 150
flowers, for a misclassification rate of around 3%.
ۧۧ We can see those 5 points graphically by looking at a scatterplot of our

data along with binary splits. Notice that for Petal.Width between 1.35
and 1.75, we incorrectly classify 3 virginica as versicolor and incorrectly
classify 2 versicolor as virginica. ( T here’s some overlap in values for a
few of the points, which is why you don’t see all 5 on the plot. )

setosa versicolor virginica setosa versicolor virginica
5 1 0 0 57 0 0.90 0.10
10 1 0 0 58 0 1.00 0.00
12 1 0 0 59 0 1.00 0.00
15 1 0 0 60 0 0.90 0.10
16 1 0 0 68 0 1.00 0.00
18 1 0 0 70 0 1.00 0.00
22 1 0 0 76 0 0.90 0.10
23 1 0 0 78 0 0.90 0.10
29 1 0 0 81 0 1.00 0.00
31 1 0 0 82 0 1.00 0.00
32 1 0 0 96 0 1.00 0.00
33 1 0 0 98 0 1.00 0.00
35 1 0 0 105 0 0.02 0.98
38 1 0 0 107 0 0.40 0.60
46 1 0 0 108 0 0.02 0.98

ۧۧ We predict setosa correctly with a probability of 1, and for most points,
we’re highly confident in our classification. At the point labeled 60,
we classify with 90% confidence that it belongs to versicolor and 10%
confidence that it belongs to virginica.
ۧۧ We’re much less confident in our classification at point 107. We’re only
40% confident that it belongs to versicolor and 60% confident that it
belongs to virginica. This is where misclassifications take place.
PITFALL
ۧۧ Classification trees and regression trees may not perform well when
you have structure in your data that isn’t well captured by horizontal
or vertical splits. For example, in the following plot, while there’s
separation in the data, horizontal and vertical splits won’t help us
classify the data very well.

ۧۧ Also, regression trees can be highly unstable. Trees tend to have high
variance because a very small change in the data can produce a very
different series of splits. And because trees don’t make any assumptions
about the data structure, they usually require larger samples.
ۧۧ There’s also no way to capture interactions between variables. For

interaction among variables, revert back to multiple linear regression
or try a new method, such as polynomial regression.
SUGGESTED READING
Crawley, The R Book, “Tree Models,” chap. 23.
Faraway, Extending the Linear Model with R, “Trees,” chap. 16.
PROBLEMS
1 The “kyphosis” dataset has 81 rows and 4 columns representing data on
children who have had corrective spinal surgery. It contains the following
variables:
ൖൖ kyphosis: a factor with levels “absent” and “present” indicating if a

kyphosis (a type of deformation) was present after the operation
ൖൖ age: in months of each child
ൖൖ number: the number of vertebrae involved
ൖൖ start: the number of the first (topmost) vertebra operated on
Install the “tree” library and fit a regression tree to the “kyphosis” data using
only the variable “age.” Plot the tree and comment on the residual mean
deviance and misclassification rate.
library(tree)
data(kyphosis)
tree1 <- tree(Kyphosis ~ Age, data = kyphosis)
plot(tree1)
text(tree1)
summary(tree1)

2 Fit a regression tree to the “kyphosis” data using the variables “age,” “number,”
and “start.” Plot the tree and comment on the residual mean deviance and
misclassification rate as compared to the regression tree in question 1.
tree2 <- tree(Kyphosis ~ Age + Number + Start, data =

kyphosis)
plot(tree2)
text(tree2)
summary(tree2)

LECTURE 20
POLYNOMIAL AND
LOGISTIC REGRESSION
L
inear models can be used to model a variety of
data and are relatively easy to fit in R. Applying
transformations to data that might not fit the
normality assumption gives us even more modeling
flexibility. But what about data that still doesn’t conform
to normality even after a transformation? Trees are one
possibility, but even tree-fitting methods aren’t effective
on data that doesn’t have natural splits. What can we do
when we have data for which transformations and tree
algorithms aren’t effective? In this lecture, you will learn
about polynomial regression and logistic regression.
POLYNOMIAL REGRESSION
ۧۧ Polynomial regression lets us extend the linear model by adding powers
of predictors to our model. This gives us a clean way to give a nonlinear
fit to our data.
ۧۧ Recall the standard linear model, where Y is modeled as 𝑦 = β0 + β1 X1

+ β2 X2 + β3X3 + … + et , ( where the βs are coefficients and et is a term
with 0 mean and normal distribution error ). Instead of predictors X1,
X2 , X3 …, we now have a polynomial function with predictor X, where
Y is modeled as 𝑦 = β0 + β1 X + β2 X2 + β3X3 + … + et . This is known as
polynomial regression.
Lecture 20 — Polynomial and Logistic Regression 297

ۧۧ Let’s look at an example in R.
library(MASS)
data(Boston)
names(Boston)
[1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
[8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
ۧۧ Here are the variables in the dataset. Let’s restrict our attention to
the last 2: the median house value for select neighborhoods in Boston
( medv ), which is the response variable, and the lower status of the
population by percentage ( lsat ), which is the explanatory variable. Let’s
use simple linear regression.
lm.fit <- lm(medv ~ lstat, data = Boston)

summary(lm.fit)
Call:
lm(formula = medv ~ lstat, data = Boston)
Residuals:
-15.168 -3.990 -1.318 2.034 24.500
Coefficients:
(Intercept) 34.55384 0.56263 61.41 <2e-16 ***
lstat -0.95005 0.03873 -24.53 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ۧۧ The linear fit doesn’t fully capture the curvature of the data. A
polynomial regression would give us flexibility to fit a nonlinear curve.
ۧۧ We can perform a regression of “medv” onto a quadratic polynomial of

“lstat” by using this formula: medv ~ lstat + I( lstat^2 ).
lm(medv ~ lstat + I(lstat^2), data = Boston)
ۧۧ The polynomial fit seems to do a better job of capturing the trends in

the data. Look at the improved R2 value. Notice that the adjusted R2
value has improved from 0.5432 to 0.6393 and each of our coefficients
are significant.
Call:
lm(formula = medv ~ lstat + I(lstat^2), data = Boston)
Residuals:
-15.2834 -3.8313 -0.5295 2.3095 25.4148

Coefficients:
(Intercept) 42.862007 0.872084 49.15 <2e-16 ***
lstat -2.332821 0.123803 -18.84 <2e-16 ***
I(lstat^2) 0.043547 0.003745 11.63 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ۧۧ Notice that we had to use the indicator function, I( lstat^2 ), to square

the variable. What happens when we forget it?
summary(lm(medv ~ lstat + lstat^2, data = Boston))
ۧۧ This just fits the same model as median value on lower status. In fact,
“lstat^2” looks like a variable name we forgot to define, so it gets
ignored without the indicator function. The “I( )” function is necessary.
Call:
lm(formula = medv ~ lstat + lstat^2, data = Boston)
Residuals:
-15.168 -3.990 -1.318 2.034 24.500
Coefficients:
(Intercept) 34.55384 0.56263 61.41 <2e-16 ***
lstat -0.95005 0.03873 -24.53 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ۧۧ The quadratic model fits the data pretty well, but notice that it has poor
behavior for “lstat” above about 25%: The data doesn’t seem to support
the curve where it turns upward, so home values do not appear to trend
upward as low status increases above 25%.
ۧۧ A higher-order model might give us a better fit. We can extend the

polynomial out to 3 degrees to get an even more accurate fit of our data.
summary(lm(medv ~ lstat + I(lstat^2) + I(lstat^3), data =

Boston))
Call:
lm(formula = medv ~ lstat + I(lstat^2) + I(lstat^3), data
= Boston)
Residuals:
-14.5441 -3.7122 -0.5145 2.4846 26.4153

Coefficients:
(Intercept) 48.6496253 1.4347240 33.909 < 2e-16 ***
lstat -3.8655928 0.3287861 -11.757 < 2e-16 ***
I(lstat^2) 0.1487385 0.0212987 6.983 9.18e-12 ***
I(lstat^3) -0.0020039 0.0003997 -5.013 7.43e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ۧۧ We increase in R2 from 0.6393 to 0.6558 when going from a second-

degree to a third-degree polynomial. It better models the tail end of our
data, suggesting that home values would continue to decrease as low
socioeconomic status increased.

ۧۧ Why stop at a third-degree polynomial?
qplot(data = Boston, x = lstat, y = medv,

xlab = "Low Socioeconomic Status %",
ylab = "Median Home Value ($1000's)") +
stat_smooth(method = "lm", formula = y ~ poly(x, 4)) +
ggtitle("Polynomial Regression (order=4)")
ۧۧ Here’s a fourth-degree polynomial, which seems to do a better job of

fitting the data in the < 30% lstat range. But look what happens for lstat
> 30%. We need to think about what this is saying because this model
is suggesting questionable behavior. There’s no reason to believe that
home values would increase as low socioeconomic status increased.

ۧۧ We continue with a fifth-degree fit. In fact, as we increase the
polynomials, we fine-tune our fit to the data.

ۧۧ Let’s see what happens when we ramp up to the 15th-degree polynomial

fit of our data.


ۧۧ The danger in increasing the order of polynomials is that our model
becomes overfit. An overfit model is one that’s too complicated to
convey useful information from your dataset. In other words, the model
becomes tailored to fit the nuances in our specific sample rather than
reflecting the overall population.
ۧۧ Instead, we want our model to approximate the true model for the
entire population. Our model shouldn’t just fit the current sample; it
should fit new samples, too.
ۧۧ The fitted line for a 15-degree polynomial illustrates the dangers of

overfitting models. This model appears to explain a lot of variation in
the response variable, but the model is too complex for the sample data.
ۧۧ Generally, it is unusual to use greater than 3 or 4 degrees because after

this, we risk overfitting the model. We also risk generating too much
variance.

ۧۧ Polynomial functions allow us to model the nonlinear structure of our
data with a global model, one model that works for all the data.
ۧۧ What if we have reasons not to fit a single global model? Another

approach is to fit smaller, local models to capture the nonlinear trend
in the data.
ۧۧ Imagine that we break X into bins and fit a different constant in each
bin. Essentially, we create the bins by selecting K cut points in the range
of X and then construct K + 1 new variables, which behave like dummy
variables ( w ith only 1 or 0 as values ).
ۧۧ Let’s consider a 5-year step pattern for the Boston data. Here’s what a
step function would look like on our Boston dataset.

xlab = "Low Socioeconomc Status %",
stat_smooth(method = "lm", formula = y ~ cut(x, breaks =
c(-Inf, 5, 10, 15, 20, Inf))) + ggtitle("Step Function")

ۧۧ When comparing a third-degree polynomial, for example, with a step
function, how do we decide which one is a better fit? We can take the
R2 value for each and see which one is higher. We can also analyze the
residuals of each, checking them for normality and linearity.
LOGISTIC REGRESSION
ۧۧ Another class of cases where we might want to fit a model to nonlinear
data is logistic regression. Often, we face the problem of modeling
binary data. We might want to predict whether patients will live or die,
species will thrive or go extinct, buildings will fall or stay standing, or a
patient will accept or reject an organ transfer.
ۧۧ The problem is that we’re dealing with a binary outcome. One solution
would be to just code “alive” as 1 and “dead” as 0 and fit a line through
those points as we would with linear regression. It’s possible. We could
then try to interpret values on the fitted line as the probability of
survival.
ۧۧ The problem is that we’re dealing with a probability, but the regression
can easily give us illegal values less than 0 or greater than 1. Also,
linear models don’t handle probabilities well. For example, smoking
2 cigarettes per day might double your risk of cancer compared to
smoking 1 per day, but increasing from 11 to 12 per day may not make
such a big difference. And once we predict that a patient has a 100%
chance of an outcome, we’re maxed out, and no new information about
the patient could improve his or her odds.
ۧۧ Another way of thinking about probabilities is to transform them using

this function:

ۧۧ This is the probability of something happening divided by the
probability of it not happening.
ۧۧ The odds give us a value that ranges from nearly 0 ( for very small
probabilities ) to positive infinity ( for probabilities essentially at 1 ).
ۧۧ But with one more transformation, we can get a value that’s unbounded
over the real numbers. This is the logistic function: 𝑦 = log[𝑝/( 1 − 𝑝 )].
Probabilities transformed through the logistic function are known as
logit values.
ۧۧ We now have a value that ranges from negative infinity to positive

infinity. This is good for linear regression because now we can fit a
model to this data.
ۧۧ The logistic regression is a regression model in which the response

variable ( dependent variable ) has categorical values such as true/false
or 0/1. It deals with measuring the probability of a binary response
variable. In R, the function “glm( )” is used to create the logistic
regression.
ۧۧ Logistic regression is a generalized linear model ( GLM ) that we use

to model a binary categorical variable with numerical and categorical
predictors. We assume that a binomial distribution produced the
outcome variable and therefore want to model 𝑝, the probability of
success for a given set of predictors.
ۧۧ To specify the logistic model, we need to specify a link function, which is

a function that relates the mean of the response to the linear predictors
in the model.
ۧۧ In our case, there’s a variety of options, but the most commonly used
is the logit function ( and the similar-looking probit function, which is
based on a normal distribution ).

ۧۧ Notice that the logit function takes a value between 0 and 1 and maps it
to a value between negative infinity and positive infinity.
ۧۧ The general mathematical equation for logistic regression is
ۧۧ The R dataset “mtcars” describes different models of a car with their

various engine specifications. In the “mtcars” dataset, the type of
vehicle transmission ( automatic or manual ) is described by the column
“am,” which is a binary value ( 0 or 1 ).
ۧۧ We can create a logistic regression model between the columns “am”

and 3 other columns: “hp” ( horsepower ), “wt” ( weight in 1000 pounds ),
and “cyl” ( number of cylinders ).
am cyl hp wt
Mazda RX4 1 6 110 2.620
Mazda RX4 Wag 1 6 110 2.875
Datsun 710 1 4 93 2.320
Hornet 4 Drive 0 6 110 3.215
Hornet Sportabout
0 8 175 3.440
Valiant 0 6 105 3.460
ۧۧ We use the “glm( )” function to create the regression model and get its
summary for analysis. In R, we fit a GLM in the same way as a linear
model, except using “glm” instead of “lm,” and we must also specify the
type of GLM to fit using the “family” argument.
glm.fit = glm(formula = am ~ cyl + hp + wt, data = mtcars,

family = binomial)
summary(glm.fit)
Call: glm(formula = am ~ cyl + hp + wt, family = binomial,
data = mtcars)

Deviance Residuals:
-2.17272 -0.14907 -0.01464 0.14116 1.27641
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 19.70288 8.11637 2.428 0.0152 *
cyl 0.48760 1.07162 0.455 0.6491
hp 0.03259 0.01886 1.728 0.0840 .
wt -9.14947 4.15332 -2.203 0.0276 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 43.2297 on 31 degrees of freedom
Residual deviance: 9.8415 on 28 degrees of freedom
AIC: 17.841
Number of Fisher Scoring iterations: 8
ۧۧ Similar to our linear regression output, logistic regression provides

estimates for the coefficients ( 0.48760 for cylinders and 0.03259 for
horsepower ) and gives us the standard error. Notice that the standard
error for cylinders, 1.07162, is more than twice the estimate value. This
is sometimes a clue that indicates that an estimate is insignificant.
ۧۧ We see the associated z-value. This is the z-statistic that tests whether
our coefficients are significant.
ۧۧ The last column is the 𝑝-value. Remember that 𝑝-values less than 0.05
indicate that the coefficient is statistically significant.
ۧۧ Because the 𝑝-value in the last column is more than 0.05 for the
variables “cyl” and “hp,” we consider them to be insignificant in
contributing to the value of the variable “am” ( t ransmission ). Only “wt”
( weight ) impacts the “am” value in this logistic regression model.

ۧۧ Let’s remove the insignificant variables and rerun the model.
glm.fit2 = glm(formula = am ~ wt, data = mtcars, family =

binomial)
summary(glm.fit2)
Call: glm(formula = am ~ wt, family = binomial, data = mtcars)
Deviance Residuals:
-2.11400 -0.53738 -0.08811 0.26055 2.19931
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 12.040 4.510 2.670 0.00759 **
wt -4.024 1.436 -2.801 0.00509 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 43.230 on 31 degrees of freedom
Residual deviance: 19.176 on 30 degrees of freedom
AIC: 23.176
Number of Fisher Scoring iterations: 6
ۧۧ Our estimate for weight changed from approximately −9 to −4, and the
𝑝-value decreased from 0.0276 to 0.005.
ۧۧ Here’s how we can look at our model graphically.
plot(mtcars$wt, mtcars$am,xlab="Weight (1000 lbs)",

ylab="AM: Probability of Manual",lwd =2, cex = 0.9)
g = glm(formula = am ~ wt, data = mtcars, family =
binomial)
yweight <- predict(g, list(wt = seq(1, 5, 0.01)),
type="response")
lines(seq(1, 5, 0.01), yweight, lwd =2, col =3)

ۧۧ Here’s our model:
ۧۧ We need to work our way back through the logit score’s construction
process.
ۧۧ What’s the probability of having a manual transmission when the car

weighs 4000 pounds?

ۧۧ Solving for 𝑝, we get:
ۧۧ Thus, the model says that a car that weighs 4000 pounds has a 1.7%
chance of being a manual.
PITFALLS
ۧۧ Don’t try to extrapolate from fitted polynomials.
ۧۧ Higher-order polynomials are great fits to the data at hand. They can
often overfit the data. But predictions or extrapolations for values of 𝑥
outside of its range will often be incorrect and invalid.
SUGGESTED READING
Diez, Barr, and Cetinkaya-Rundel, OpenIntro Statistics, “Multiple and Logistic
Regression,” section 8.4.
Yau, R Tutorial, “Logistic Regression,” http://www.r-tutor.com/elementary-
statistics/logistic-regression.
PROBLEMS
1 In polynomial regression, different variables are added to an
equation to see whether they increase the significantly.
a) powers of X; variance
b) functions of X; degrees of freedom
c) powers of X; R2
d) significant; mean square error

2 In simple logistic regression, the predictor (independent) variable
a) is always interval/ratio data.

b) must be transformed logarithmically before doing a logistic regression.
c) is always a binary variable.

LECTURE 21
SPATIAL STATISTICS
W
e tend to think of data as being fixed in time and
space, collected in the moment to be analyzed in
the moment. But what happens when our data
is on the move—when values of our variables change over
space? We can use a branch of statistics called spatial
statistics, which extends traditional statistics to support
the analysis of geographic data. It gives us techniques to
analyze spatial patterns and measure spatial relationships.
SPATIAL STATISTICS
ۧۧ Spatial statistics is designed specifically for use with spatial data—
with geographic data. These methods actually use space ( area, length,
direction, orientation, or some notion of how the features in a dataset
interact with each other ), and space is right in the statistics. That’s
what makes spatial statistics ( also known as geostatistics ) different
from traditional statistical methods.
1 Descriptive spatial statistics is similar to traditional descriptive

statistics. For example, if we have lots of points on the map, we might
want to know where the center of those points is located. That’s
equivalent to computing the mean. We might also want to know
how spread out those points are around the center. That’s similar to
computing the standard deviation.
Lecture 21 — Spatial Statistics 315

2 Spatial pattern analysis tells us whether there is any ( nonrandom )
structure to the data. For example: Are features clustered? Are they
uniform? Or are they dispersed in a regular ( and nonrandom ) way?
3 Spatial analysis is used to measure and evaluate spatial relationships.

Imagine that we are looking at a hot-spot map for 911 calls. We
might be curious about why we are seeing so many calls, or hot
spots, in certain locations. Spatial regression analysis could be used
( t he output looks just like regular regression, except you test for
autocorrelation with a test called Moran’s I ). Spatial regression could
identify the factors promoting the spatial pattern we’re observing,
which might help us explain why 911 rates are so high.
ۧۧ The difference between traditional statistics and spatial statistics

is that traditional statistics is inherently nonspatial and represents a
dataset by its typical response ( average, median, etc. ) regardless of
spatial patterns, while spatial statistics extends traditional statistics by
mapping variation in a dataset to show where unusual responses occur
instead of focusing ( only ) on spread around a single typical response.
ۧۧ The independence assumption is no longer valid. The attributes of an

observation at location i might influence the observation at j if they
are close together. This is perhaps the most critical reason to consider
spatial modeling.

ۧۧ Correlation measures the strength of the linear relationship between 2
variables. Spatial data can have correlation with itself, and this is called
spatial autocorrelation, which means that there is ( nonrandom ) spatial
structure in your data. That structure might be clustered or some kind
of regular dispersion, but either way, the distribution of your data
values is not random.
ۧۧ Spatially autocorrelated data violates the ( independence and

homogeneity and constant variance ) assumptions for some traditional
statistical methods, so it is often seen as a nuisance in traditional
statistical methods.
ۧۧ When we see spatial autocorrelation in our data, it means that there’s an

underlying spatial process at work. Something is causing a clustering or
spatial structure. Understanding what that is often turns out to be what
we are most interested in.
ۧۧ This lecture uses several libraries for spatial analysis in R:
ൖൖ library( sp );
ൖൖ library( spdep );
ൖൖ library( rgeos );
ൖൖ library( geoR ); and
ൖൖ library( gstat ).
ۧۧ There are other packages that help with visualization of a spatial analysis:
ൖൖ library( raster );
ൖൖ library( rasterVis );
ൖൖ library( maptools ); and
ൖൖ library( g gmaps ).

ۧۧ There are also R files and packages containing spatial data.
ۧۧ Spatial data come in several different formats, such as points, lines, and
polygons. For example, customers might be points, roads could be lines,
and zoning districts could be polygons. In R, any of that information can
be handled in a vector format or a raster format.
ۧۧ The package “sp” defines a set of spatial objects for R, including points,
lines, and polygon vectors; and gridded/pixel raster data.
ۧۧ Here is data for 100 counties in North Carolina that includes the counts
of live births and deaths due to sudden infant death syndrome ( SIDS )
for 2 periods: July 1974–June 1978 and July 1979–June 1984.

ۧۧ Let’s plot the SIDS data along with the county boundaries of North
Carolina.
ۧۧ This map has the latitude and longitude locations for the North Carolina
counties along with other data from the late 1970s and early 1980s.
ۧۧ Who’s your neighbor? How do you decide if somebody is close to you?

And how much influence should they have on your predicted value?
ۧۧ There are various ways that we define the neighbor relationship, but 3
of the most popular are rook, queen, and K-nearest neighbors. Rook and
queen neighbors relate to the moves that those 2 pieces can make in a
chess game.

ۧۧ The rook’s neighbors are to the left, right, above, and below. Here are
the rook’s neighbors for the North Carolina SIDS dataset.
ۧۧ The queen’s neighbors include all of the rook’s neighbors plus diagonal
neighbors. In the queen’s neighbor, any areas sharing any boundary
point are taken as neighbors. Here are the queen’s neighbors for the
North Carolina dataset.

ۧۧ K-nearest neighbors chooses the K-nearest neighbors ( often based on
Euclidian distance ) to form the dependency.
ۧۧ Here’s a plot of the distance connections based on 1-nearest neighbor,

2-nearest neighbors, and 3-nearest neighbors.

ۧۧ Once our list of neighbors has been created, we assign spatial weights
to each relationship.
ۧۧ The queen’s neighbors has 490 nonzero links, with an average of 4.9
links per spatial location, whereas 1-nearest neighbor has only 200
nonzero links, with an average of 2 links per spatial location.
ۧۧ We can choose the type of weight that we want to model on our data—
whether queen, rook, or K-nearest neighbor—based on our research
question and underlying spatial autocorrelation of the data. Most
statisticians compare multiple model fits to determine the best one.
Characteristics of weights list object:

Neighbour list object:
Number of regions: 100
Number of nonzero links: 490
Percentage nonzero weights: 4.9
Average number of links: 4.9
Weights style: W
Weights constants summary:
n nn S0 S1 S2
W 100 10000 100 44.65023 410.4746
Characteristics of weights list object:

Neighbour list object:
Number of regions: 100
Number of nonzero links: 200
Percentage nonzero weights: 2
Average number of links: 2
Non-symmetric neighbours list
Weights style: W
Weights constants summary:
n nn S0 S1 S2
W 100 10000 100 88 421.5

MORAN’S I
ۧۧ Moran’s I, named for Australian statistician Patrick Moran, is a
correlation coefficient that measures the overall spatial autocorrelation
of a dataset. In other words, it measures how one object is similar to
others surrounding it. If objects are attracted ( or repelled ) by each
other, it means that the observations are not independent.
ۧۧ The null hypothesis for the test is that the data is randomly dispersed.
There’s no spatial correlation. The alternate hypothesis is that the data
is more spatially clustered than would be expected by chance alone.
ۧۧ Calculations for Moran’s I are based on a weighted matrix, with units i

and j. It’s similar to Pearson’s correlation, but it isn’t bounded on [−1, 1]
because of the spatial weights.
where s2 is the variance of Y and wi,j are associated weights at each

location.
ۧۧ We can calculate Moran’s I for our North Carolina SIDS data. Notice that
the 𝑝-value of 0.007 is significant, so we reject the null hypothesis that
our spatial locations are random.
Moran I test under randomisation

data: nc$SID79
weights: sids.nbq.w
Moran I statistic standard deviate = 2.6926, p-value =
0.00709
alternative hypothesis: two.sided
sample estimates:
Moran I statistic Expectation Variance
0.158977893 -0.010101010 0.003943149

ۧۧ Here’s a map of the SIDS cases relative to the number of live births in
1974. The color represents the relative risk, the ratio of observed and
expected counts of cases multiplied by 100.
ۧۧ Blue values are low, with a relative risk of less than 5%, while redder
values have a relative risk of greater than 95%. Notice the clustering of
counties with higher relative risk of SIDS.
ۧۧ Spatial analysis, in a narrow sense, is a set of statistical tools used to

find order and patterns in spatial phenomena. Spatial patterns found in
spatial analysis help our understanding of not only spatial phenomena
themselves but also their underlying structure.
SEMIVARIOGRAMS
ۧۧ Here’s a locally weighted plot of
samples of coal taken over a spatial
area. The z-axis shows the amount of
coal ( not elevation ). Peaks, or high
points, represent spatial areas where
larger amounts of coal were found.
Valleys, or low points, represent areas
where smaller amounts were found.

ۧۧ To generate this type of spatial map, we have to understand the
correlation between spatial values. We get this information from the
semivariogram, which gives us the ( overall ) spatial autocorrelation of
the measured sample points. Once each pair of locations is plotted, a
model is fit through them.
ۧۧ Here’s a semivariogram of the coal data.
variog: computing omnidirectional variogram
ۧۧ There are 3 characteristics that are commonly used to describe these

models.
1 When you look at the model of a semivariogram, you’ll notice that at a

certain distance, the model levels out. The distance where the model
first flattens out is known as the range. Sample locations separated
by distances closer than the range are spatially autocorrelated,
whereas locations farther apart than the range are not. So, the range
is the range of autocorrelation.

2 The vertical point where the semivariogram model levels out is
called the sill, which is the corresponding 𝑦-value associated with
that particular range. In our example, we would have to go out
farther than 12 lags to observe the leveling out. This indicates strong
correlation in the data.
3 Theoretically, at 0 separation distance (lag = 0), the semivariogram

value (for the sill) should be 0. However, at an infinitesimally small
separation distance, the semivariogram often exhibits a jump
discontinuity even at a lag of 0, known as the nugget effect, which is a
semivariance greater than 0. For example, if the semivariogram model
intercepts the 𝑦-axis at 1.1, then the nugget is approximately 1.1. The
nugget effect can be attributed to measurement errors or spatial
sources of variation at distances smaller than the sampling interval.
Measurement error occurs because of the error inherent in measuring
devices. Natural phenomena can vary spatially over scales large and
small. Variation at scales smaller than the lag distance (smaller than
sampling distances) will appear as part of the nugget effect.
ۧۧ Before collecting data, we should make sure that we understand the

scales of spatial variation.
variog: computing variogram for direction = 0 degrees

(0 radians)
tolerance angle = 22.5 degrees (0.393 radians)
(0.785 radians)
(1.571 radians)
(2.356 radians)

ۧۧ Directional semivariograms are fun to compare on the same plot to see
how correlated your spatial data is from different directions.
ۧۧ Imagine that you stand in the center of your data and begin to walk
north. You can calculate the semivariogram as you go along to find
out at which point you’re no longer correlated with where you started.
( T his is the sill. )
ۧۧ Now imagine that you walked east from the center, then south, and then
west. Those 4 semivariograms give us a spatial landscape.
ۧۧ We could also fit a spatial model to our data and analyze the residuals,
much like we did in linear regression.
ۧۧ What do you think the semivariogram would look like if there was no
spatial correlation—no spatial pattern left in the residuals?

ۧۧ No matter where you walked, you would observe the same overall
variance. That’s exactly what we see in this plot of our initial
semivariogram compared with the green lines of the residual
semivariogram. The lags, or distance, have no effect on the semivariance.
SPATIAL STATISTICS INFERENCES

ۧۧ In spatial statistics, we still do things like hypothesis testing. We might
be interested in testing whether our spatial pattern is random.
ൖൖ Null hypothesis: The spatial pattern is random. That is called an

independent random process ( IRP ) or a complete spatial randomness
( CSR ).
ൖൖ Alternative hypothesis: The spatial pattern is not random. Perhaps

it’s clustered or uniformly dispersed.

ۧۧ These spatial processes differ from randomness in 2 primary ways:
1 Variation in the study area. For example, cancer cases cluster

because chemical plants cluster. This is called a first-order effect.
2 Interdependence between the points themselves. For example,

diseases cluster because people catch them from others who have
the disease. This is called a second-order effect.
ۧۧ We can even look at spatial data over time, called spatiotemporal, or

space-time, statistics. These models help us predict the path that a
hurricane or tornado might take to allow for early warnings.
PITFALL
ۧۧ Unlike other instances where R’s default settings are usually fine, R
can’t always pick the best way to divide spatial data for you. You need
to look at spatial data from all different directions and choose the
semivariogram that most closely models your research needs.
SUGGESTED READING
Crawley, The R Book, “Spatial Statistics,” chap. 26.
PROBLEMS
1 Estimate the range, sill, and nugget
in the semivariogram shown here.

2 Which of the following statements tend to be true about semivariograms?
a) Two sample points at the same location are likely to have the same
semivariance.
b) Values that occur at a distance prior to where the graph starts leveling out
are spatially autocorrelated.
c) When distance increases, the semivariance increases.
d) If there are fewer pairs of points separated by far distances, the
correlations between them will tend to decrease.
e) All of the above tend to be true.

LECTURE 22
TIME SERIES ANALYSIS
T
ime series analysis gives us a way to model response
data that’s correlated with itself from one point in
time to the next—data that has a time dependence.
To analyze that type of data, we need new methods that
can model a dependency on time. We’ve traditionally
looked at modeling the relationship of ( predictor ) X versus
( response ) Y, whether by linear, polynomial, or logistic
regression. But now, we have a single measurement
from a population and we want to understand how
that measurement changes over time. Time is the
independent predictor variable. Our goal is to understand
how our response, Y, varies with respect to time.
TIME SERIES
ۧۧ A time series is a collection of evenly spaced numerical observations
that are measured successively in time.
ۧۧ The 3 main goals of time series analysis are to describe ( identify

patterns, trends, and seasonal variation ), explain ( understand and
model the data ), and forecast ( predict short-term trends from previous
patterns ).
Lecture 22 — Time Series Analysis 331

ۧۧ Statisticians are especially interested in understanding the underlying
structure so that they can forecast future values. The goal is to develop
a statistical model that describes and explains the data in such a way
that forecasting can occur.
ۧۧ We begin by looking for underlying structure in the data. The most

common patterns are trends, which tend to be linear, quadratic, or
maybe even exponential. In R, we can set the trend of a model using the
“lm( )” command.
ۧۧ The data file “wages” contains monthly values of the average hourly
wages ( in dollars ) for workers in the U.S. apparel and textile products
industry for July 1981 through June 1987.
library(TSA)
data(wages)
plot(wages,ylab='Monthly Wages',type='o')
ۧۧ Here’s the time series plot for these data:

ۧۧ Notice that this plot shows a strong increasing trend, perhaps linear or
quadratic. If we didn’t know that this was a time series, we would fit a
linear regression to the data, look for a significant predictor, and check
the residuals for normality and independence using quantile-quantile
( Q-Q ) plots.
ۧۧ Time series analysis follows in much the same way: Fit a line to the
data as a function of time ( predictor ) and check the residuals ( what’s
left over from the fit ). They should look like noise ( not have any
pattern or trend ).
ۧۧ We’re modeling wages as a function of time.
wages.lm=lm(wages~time(wages))
summary(wages.lm)
Call: lm(formula = wages ~ time(wages))
Residuals:
-0.23828 -0.04981 0.01942 0.05845 0.13136
Coefficients:
(Intercept) -5.490e+02 1.115e+01 -49.24 <2e-16 ***
time(wages) 2.811e-01 5.618e-03 50.03 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

F-statistic: 2503 on 1 and 70 DF, p-value: < 2.2e-16
ۧۧ With a multiple R2 of 97% and highly significant regression coefficients,

the case might appear to be closed. But let’s make sure that our residuals
don’t have any trend in them.

plot(y,x=as.vector(time(wages)),
ylab='Standardized Residuals',type='o'.
main="Residuals Plot")
ۧۧ For residuals, the desirable result is that their correlation is 0. In other

words, residuals should be unrelated to each other.
ۧۧ These residuals don’t look random at all. They hang together too closely,
suggesting that there’s still more structure to be removed from the
data. Let’s try a quadratic fit on the wages time series.
wages.lm2=lm(wages~time(wages)+I(time(wages)^2))
summary(wages.lm2)
Call: lm(formula = wages ~ time(wages) + I(time(wages)^2))
Residuals:
-0.148318 -0.041440 0.001563 0.050089 0.139839

Coefficients:
(Intercept) -8.495e+04 1.019e+04 -8.336 4.87e-12 ***
time(wages) 8.534e+01 1.027e+01 8.309 5.44e-12 ***
I(time(wages)^2) -2.143e-02 2.588e-03 -8.282 6.10e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
F-statistic: 2494 on 2 and 69 DF, p-value: < 2.2e-16
y=rstudent(wages.lm2)
ۧۧ Again, based on the regression summary and an R2 of 0.986, it appears

as if we might have the best model. Let’s look at the residuals.
plot(y,x=as.vector(time(wages)),ylab='Standardized
Residuals', type='o', main="Residual Plot")

ۧۧ There still appears to be some structure in the residuals. Let’s check
them for normality.
hist(rstudent(wages.lm2),xlab='Standardized Residuals')
qqnorm(rstudent(wages.lm2))
qqline(rstudent(wages.lm2))

shapiro.test(rstudent(wages.lm2))
Shapiro-Wilk normality test
data: rstudent(wages.lm2)
W = 0.98868, p-value = 0.7693
ۧۧ Once again, we’ve now transitioned to a sequence of random variables,

Y1, Y2 , …, Yt , where Yt is the value of the variable Y at time t for t = 1, 2, ….
ۧۧ What are the expected value, variance, covariance, and correlation in

this new contest? Not much changes, except the notation. Let’s see how
those look in the time series context.
ۧۧ The mean function of Yt:
ۧۧ The autocovariance function of Yt:
ۧۧ The autocorrelation function of Yt:
ۧۧ Autocorrelation values ( 𝑝t,s ) near ±1 imply a strong linear dependence

between Yt and Ys .
ۧۧ Autocorrelation values ( 𝑝t,s ) near 0 imply a linear dependence between

Yt and Ys .
ۧۧ Autocorrelation values ( 𝑝t,s ) equal to 0 mean that Yt and Ys are

independent.

ۧۧ A stationary time series process has a constant mean, a constant variance,
and a constant covariance function that is independent of time.
ۧۧ We need the processes to be stationary before we can predict them.

Because they are not predictable, non-stationary processes can’t be
forecasted. You must transform non-stationary data to stationary data
before moving forward with an evaluation.
ۧۧ Time series in general often contain trends, seasonality, and serial

dependence.
1 A trend is a consistent directional movement in a time series. These

trends will either be deterministic or stochastic. A deterministic
trend allows us to provide an underlying rationale, while a stochastic
trend is a random feature of a series that we will be unlikely to explain.
Trends often appear in financial series, particularly commodities
prices, and many commodity trading advisor funds use sophisticated
trend-identification models in their trading algorithms.
2 Many time series contain seasonal variation. This is particularly true

in series representing business sales or climate levels. In quantitative
finance, we often see seasonal variation in commodities, particularly
those related to growing seasons or annual temperature variation.
3 Serial dependence is like autocorrelation, but the dependence

implies causality.
TYPES OF MODELS
ۧۧ The basic objective usually is to determine a model that describes
the pattern of the time series. Let’s consider the 2 types of models
for the pattern in a time series: autoregressive models and moving
average models.

ۧۧ Autoregressive ( A R ) models are ones that relate the present value of a
series to past series values. That is called an AR( 𝑝 ), an autoregressive
model of order 𝑝, where 𝑝 is basically the number of lags and the number
of steps you’re going to look into the past.
ۧۧ Theoretically, the AR( 1 ) model is written as
ۧۧ A 𝜙 at 0.9 is a strong dependence on the previous point whereas a 𝜙 at

0.1 is not getting much information from the neighboring point.
ۧۧ The assumptions of the AR( 1 ) model are that the errors are
independently distributed with a normal distribution that has mean 0
and constant variance and that the errors are independent of Yt .
ۧۧ Moving average ( M A ) models are ones that relate the present value of a
series to past prediction errors.
ۧۧ These past prediction errors have to be estimated by nonlinear methods

and are known as moving average terms in the model.
ۧۧ In the MA( q ) way of thinking, the observations of a random variable at

time t are affected by the shock at time t( et ) and by the shocks that have
taken place before time t.
ۧۧ Together, these time series models are called autoregressive moving

average ( A RMA ) models, which provide one of the basic tools in time
series modeling.

ۧۧ When we include differencing ( taking the difference of Yt − Yt−1 ), we get
autoregressive integrated moving average ( ARIMA ) models. ARIMA series
often have an underlying linear trend and sometimes contain seasonality,
which is cyclical fluctuations that reoccur seasonally throughout a year.
SEASONAL SERIES
ۧۧ The other common pattern in a time series dataset is seasonality, which
is any kind of periodicity.
ۧۧ A time series repeats itself after a regular period of time. The business
cycle plays an important role in economics. In time series analysis, the
business cycle is typically represented by a seasonal ( or periodic ) model.
ۧۧ A smallest time period for this repetitive phenomenon is called a seasonal

period, s. We use s to denote periodicity of a seasonal time series.
ۧۧ The monthly airline passenger time series, first investigated by George

Box and Gwilym Jenkins in 1976, is considered a classic time series.
Let’s check out the time series plots of the original series.
data(airpass); plot(airpass, type='o',ylab='Air Passengers')

plot(airpass, type='o',ylab='Air Passengers', main="Air
Passengers")
plot(log(airpass), type='o',ylab='Log(Air Passengers)')

ۧۧ The graph of the logarithms displays a much more constant variation
around the upward trend. Check out the change in values on the 𝑦-axis:
100 to 600 down to 4.5 to 6.5. Notice that the log transformation has
removed the increasing variability, or heteroscedasticity.
ۧۧ The first difference of a time series is the series of changes from one
period to the next. If Yt denotes the value of the time series Y at period t,
then the first difference of Y at period t is equal to Yt − Yt−1.
ۧۧ We use differencing methods to model and remove the effect of trends

( linear, quadratic, etc. ) in the data. Let’s check out the time series plots
of the first difference of the log series.
plot(diff(log(airpass)),type='o',ylab='Difference of
Log(Air Passengers)')
plot(diff(log(airpass)),type='l',ylab='Difference of

Log(Air Passengers)')
points(diff(log(airpass)),x=time(diff(log(airpass))),pch=
as.vector(season(diff(log(airpass)))))

ۧۧ The seasonality can be observed by looking at the plotting symbols
carefully. Septembers, Octobers, and Novembers are mostly at the low
points, and Decembers are mostly at the high points.
ۧۧ Let’s check out the time series plot of the seasonal difference of the first
difference of the logged series.
plot(diff(diff(log(airpass)),lag=12),type='l',ylab='First
& Seasonal Differences of Log(AirPass)')
points(diff(diff(log(airpass)),lag=12),x=time(diff(diff(lo
g(airpass)),lag=12)),
pch=as.vector(season(diff(diff(log(airpass)),lag=12))))
ۧۧ We chose to do the plot with seasonal plotting symbols. The seasonality

is much less obvious now. Some Decembers are high and some are low.
Similarly, some Octobers are high and some are low.
ۧۧ We’ve accounted for the structure in our data through taking the log,
the first difference, and the first seasonal difference.

SAMPLE AUTOCORRELATION FUNCTION
ۧۧ How much is this time series correlated with itself? The sample
autocorrelation function ( ACF ) can give us an answer.
ۧۧ Here, the ACF gives correlations between the series 𝑦t and lagged values
of the series for lags of 1, 2, 3, and so on. This is a visual plot of 𝜌1, 𝜌2 , ….
ۧۧ The ACF can be used to identify the possible structure of time series
data. That can be tricky because there often isn’t a single clear-cut
interpretation of a sample ACF.
ۧۧ In a different way, the ACF of the residuals for a model is also useful.
The ideal for an ACF of residuals is that there aren’t any significant
correlations for any lag, because then your model has taken into account
all of the structure in the data.
ۧۧ The following is the ACF of the residuals for the wages example, where
we used an AR( 1 ) model. The lag ( t ime span between observations ) is
shown along the horizontal, and the autocorrelation is on the vertical.
The dotted lines indicate bounds for statistical significance.

ۧۧ In general, a partial autocorrelation is a conditional correlation. It is the
correlation between 2 points on a time series, under the assumption
that we take into account the values in between those 2 points.
ۧۧ Here’s the ACF of the seasonal difference of the first difference of the
logged series for air passengers.
acf(as.vector(diff(diff(log(airpass)),lag=12)),ci.type='ma',
main='First & Seasonal Differences of Log(AirPass)')

ۧۧ Although there is a slightly significant autocorrelation at lag 3, the most
prominent autocorrelations are at lags 1 and 12.
ۧۧ Let’s fit a seasonal ARIMA model to the logged series.
model=arima(log(airpass),order=c(0,1,1),
seasonal=list(order=c(0,1,1), period=12))
model
Call: arima(x = log(airpass), order = c(0, 1, 1), seasonal
= list(order = c(0, 1, 1), period = 12))
Coefficients:
ma1 sma1
-0.4018 -0.5569
s.e. 0.0896 0.0731
sigma^2 estimated as 0.001348: log likelihood = 244.7,
aic = -485.4

FORECASTING
ۧۧ Now that we’ve fit the data, we’d like to be able to use our model to
make forecasts, or predictions. Let’s produce forecasts for this series
with a lead time of 2 years.
plot(model,n1=c(1969,1),n.ahead=24,pch=19,ylab='Log(Air
Passengers)')
ۧۧ The forecasts follow the seasonal and upward trend of the time series
nicely. The forecast limits provide us with a clear measure of the
uncertainty in the forecasts. We can also plot the forecasts and limits
in original terms.
plot(model,n1=c(1969,1),n.ahead=24,pch=19,ylab='Air
Passengers',transform=exp)
ۧۧ In original terms, it is easier to see that the forecast limits spread out as
we get further into the future.

PITFALLS
ۧۧ A short-term trend is not the same as a long-term trend. What looks like
a line in the short term can be a curve over a longer time period.
ۧۧ Just because you found one kind of seasonality in time series doesn’t
mean that there’s not a second and a third kind as well.
ۧۧ Any time series of currency ( dollars, pounds, etc. ) has to take into
account the fact that the value of money changes over time. So, if we
were modeling changes in home prices over time, then we would need
to correct our data to some common base. Usually this is done by
starting with a base year ( often the start or end of the series, or the
current year ) and adjusting values based on changes to some official
inflation statistic ( e.g., the consumer price index ).
ۧۧ Perhaps the biggest advantage of using time series is that we can use it
to model the past—and better understand the future.

SUGGESTED READING
Crawley, The R Book, “Time Series Analysis,” chap. 24.
Cryer and Chan, Time Series Analysis with Applications in R.
PROBLEMS
1 In the R library “TSA,” there’s a dataset called “beersales” that contains the
monthly U.S. beer sales (in millions of barrels) from January 1975 through
December 1990.
a) Make a time series plot for “beersales” using the first letter of each month
as the plotting symbols.
library(TSA)
data("beersales")
plot(beersales, main="Monthly US Beer Sales",type='l')
points(y=beersales,x=time(beersales), pch=as.
vector(season(beersales)))
b) Fit a seasonal-means trend to the data and examine the results.
beer.model=lm(beersales~season(beersales))
summary(beer.model)
c) Check for normality of the residuals by plotting the residuals, along

with their histogram and Q-Q plot, and performing a Shapiro-Wilk Test.
Comment on the results.
plot(y=beer.model$residuals,x=as.vector(time(beers
ales)),type='l', main="Residuals", xlab="Year",
ylab="Residuals")
points(y=beer.model$residuals,x=as.
vector(time(beersales)),pch=as.
vector(season(beersales)))
hist(beer.model$residuals, main="Beer Residuals")
qqnorm(beer.model$residuals, pch=20)
qqline(beer.model$residuals)
shapiro.test(beer.model$residuals)

2 In this lecture, we took first differences to model the trend in the data and
produce normal residuals. Because the seasonal model in question 1 didn’t
give us normal residuals, let’s try taking first seasonal differences to model the
seasonal trend.
a) Take first seasonal differences of the data, with a season = 12 months. Fit a
seasonal-means trend to the data and examine the results.
beer.diff = diff(beersales, season=12)

beer.model2=lm(beer.diff~season(beer.diff)+time(beer.
diff))
summary(beer.model2)
b) Check for normality of the residuals by plotting the residuals, along with
their histogram and Q-Q plot, and performing a Shapiro-Wilk Test.
plot(y=beer.model2$residuals,x=as.vector(time(beer.
diff)),type='l',
main="Seasonal Differenced Residuals", xlab="Year",
ylab="Residuals")
points(y=beer.model2$residuals,x=as.vector(time(beer.
diff)),
pch=as.vector(season(beer.diff)))
hist(beer.model2$residuals, main="Seasonal Difference

Residuals Histogram", xlab="Residuals")
qqnorm(beer.model2$residuals, main="Beer Seasonal

Difference QQPlot", pch=20)
qqline(beer.model2$residuals)
shapiro.test(beer.model2$residuals) #Not normal

LECTURE 23
PRIOR INFORMATION AND

BAYESIAN INFERENCE
T
hus far, our approach to statistical inference
has been greedy for new data. Getting more
new data, and increasing our sample size, are
the keys to making our inferences even more reliable.
The approach we have been following is called a
frequentist approach. But what about also using prior
data and prior information? This is the central idea of
Bayesian inferential statistics, which gives us a way
to update our prior beliefs based on observed data.
BAYESIAN STATISTICS
ۧۧ Bayesian statistics is an entirely different approach for doing statistical
inference. In other words, it’s not just another technique like regression,
hypothesis testing, or ANOVA.
ۧۧ The methods are based on the idea that before we ever observe data, we
have some prior belief about it, perhaps based on experience or other
experiments. As we observe new data, we use the data to update our belief.
ۧۧ A lot of what you’ve learned in this course stays the same with Bayesian
statistics. Histograms, box plots, and numerical summaries stay
largely the same. Discrete and continuous distributions, along with
special cases, such as the binomial, exponential, uniform, and normal

distributions, stay the same, but we often make more use of a variety of
distributions. Uncovering a relationship between 2 variables in terms
of correlation and covariance also stays much the same.
ۧۧ But the way that we have been drawing conclusions about a population
of interest through t-tests and hypothesis testing looks very different in
a Bayesian approach to inference.
ۧۧ In a frequentist approach, we worked hard to obtain a random sample,

modeled the linear relationship between response and explanatory
variables, and attached measures of confidence to them.
ۧۧ Most of the methods learned so far in this course take this frequentist
approach, where we assume the following:
ൖൖ Population parameters are constants, fixed but unknown. ( T he

coefficients in linear regression are estimates of fixed but unknown
parameters. )
ൖൖ Probabilities are always interpreted as long-run relative frequency.
ൖൖ P( heads = 0.5 ) assumes that we flip the coin an infinite number

of times.
ۧۧ Statistical analyses are judged by how well they perform in the long run
over an infinite number of hypothetical repetitions of the experiment.
ۧۧ But a frequentist approach isn’t the only approach.
ۧۧ The Bayesian approach is quite clever. When advanced information

is available, then Bayesian methods give us a clear way to update
uncertainty as new information comes in.
ۧۧ For example, your prior belief might be that a coin is fair and P( heads ) =
0.5. You might observe 50 flips of the coin, where 42 out of 50 flips were
heads. This might lead you to change your belief that the coin was fair.
Lecture 23 — Prior Information and Bayesian Inference 353

ۧۧ Bayesian methods can even be used to combine results from different
experiments.
ۧۧ At its foundation, we use conditional probability theory to quantify the

relationship between our prior beliefs and our observations.
ۧۧ The Bayesian approach is useful for estimating a proportion, where

you have a rough idea but you don’t have much data. Your uncertainty
is described by a probability distribution called your prior distribution.
ۧۧ Suppose that you obtain some data relevant to that thing. The data
changes your uncertainty, which is then described by a new probability
distribution, called the posterior distribution, which reflects the
information both in the prior distribution and the data. In other words,
in Bayesian statistics, we start with a parameter, obtain some data, and
update our parameter.
ۧۧ When we get new information, we should update our probabilities

to take the new information into account. Bayesian statistics tell us
exactly how to do this.
ۧۧ The posterior probabilities of the hypotheses are proportional to the

prior probabilities and the likelihoods ( t he probability of the data that
you actually observed ). A high prior probability will help a hypothesis
have a high posterior probability. A high likelihood value also helps. To
understand what this means about reasoning, consider the meanings of
the prior and the likelihood.
ۧۧ There are 2 things that can contribute to a hypothesis being plausible:
1 If the prior probability is high—that is, the hypothesis was already

plausible before we got the data.
2 If the hypothesis predicted the data well—that is, the data was what
we would have expected to occur if the hypothesis had been true.

ۧۧ In Bayesian statistics, the terms in Bayes’s rule have special names.
1 The unknown parameter θ is considered a random variable—no

longer fixed. We use a probability distribution ( normal, uniform,
etc. ) to describe the uncertainty of the unknown parameter.
2 P( θ ) is the prior probability, which describes how sure we were that
θ was true before we observed the data.
3 Once we have our data, we update the parameter distribution using

Bayes’s theorem.
4 P( θ|data ) is the posterior probability. It describes how certain or

confident we are that the hypothesis is true, given that we have
observed data.
ൖൖ alculating posterior probabilities is the main goal of Bayesian

C
statistics. P( data|θ ) is the likelihood. If you were to assume that
θ is true, this is the probability that you would have observed the
data. 𝑝( data ) is the marginal likelihood. This is the probability
that you would have observed data whether θ is true or not. Using
Bayes’s rule:
5 The Bayes estimator of θ is either the mean or the median of the

posterior distribution—you get to choose. In Bayesian statistics,
both the mean and median are valid estimators.
ۧۧ We can choose an optimal Bayes estimator by choosing the estimate

that minimizes a defined loss function, which is what we lose by using
as an estimate of the true parameter θ.

ۧۧ The absolute loss function is as follows: L( θ, ) = |θ − |.
ۧۧ The squared loss function is as follows: L( θ, ) = ( θ − )2.
ۧۧ For squared error loss, the posterior mean is the Bayes estimator.
ۧۧ For absolute error loss, the posterior median is the Bayes estimator.
ۧۧ The absolute error is the dashed line, and the squared error is the solid
line. Depending on where our data fall along the 𝑥-axis, the squared
error is lowest or the absolute error is lowest.
BETA BINOMIAL DISTRIBUTION

ۧۧ The beta distribution is a common distribution that is used to
describe our prior knowledge. It can be understood as representing
a distribution of probabilities—that is, it represents all the possible
values of a probability when we don’t know what that probability is.
ۧۧ In baseball, a batting average is the number of times a player gets a base

hit divided by the number of times the player goes up at bat. The batting
average lies between 0 and 1; a decent batting average ranges between
0.25 and 0.27 while a batting average over 0.300 is excellent.

ۧۧ If we’re putting together a fantasy baseball league, we might want to
predict the season’s batting average for a player at the beginning of the
season. We have some prior information about batting averages: Most
season averages range between 0.220 and 0.365.
ۧۧ There’s a binomial distribution lurking in the background, because at

its core, the batting average represents a series of successes ( hits ) and
failures ( misses ).
ۧۧ In situations where we want to place a prior distribution on binomial

data, the best distribution to use is the beta distribution. In fact, we
can use the beta distribution to represent our belief about the player’s
batting average.
ۧۧ The domain of the beta distribution is ( 0, 1 ), which lines up with the
appropriate range for our batting average. We expect that the player’s
season-long batting average will be most likely around 0.26 but that it
could reasonably range from 0.20 to 0.36. This can be represented with
a beta distribution with parameters 𝑎 = 78 and 𝑏 = 222:
curve(dbeta(x, 78, 222))

ۧۧ Assuming a season with 300 at bats, 𝑎 = 78 represents the number of
hits and 𝑏 = 222 represents the number of misses.
ۧۧ The mean of the beta distribution is 𝑎/( 𝑎 + 𝑏 ) = 78/( 78+222 ) = 0.26.

And the distribution pretty accurately contains our prior belief of a
batting average between ( 0.20, 0.36 ).
ۧۧ The 𝑥-axis represents the batting average, a probability between 0 and

1. The 𝑦-axis represents a probability density. The beta distribution is,
at its core, a probability distribution of probabilities.
ۧۧ But here’s why the beta distribution is so amazing. Imagine that our
player gets a single hit. The player’s record for the season is now 1 hit
and 1 at bat. We can update our probability, given this new information.
ۧۧ The new beta distribution will be beta( 𝑎 0+ hits, 𝑏0 + misses ), where 𝑎 0

and 𝑏0 are the parameters we started with—that is, 78 and 222.
ۧۧ In this case, 𝑎 increases by 1 ( t he player’s 1 hit ) while 𝑏 has not

increased at all ( no misses yet ). That means that our new distribution
is beta( 78 + 1222 ), or:
curve(dbeta(x, 79, 222))

ۧۧ You can’t really tell the difference here, but there was a small shift.
ۧۧ Suppose that halfway through the season the player has been up to bat
300 times, hitting 105 out of those times. The new distribution would
be beta( 79 + 105, 222 + 195 ).
curve(dbeta(x, 79+105, 222+195))
ۧۧ The curve is now thinner and slightly shifted to the right to reflect the
higher batting average. In fact, the new expected value is our posterior
estimate of the player’s batting average. We can calculate it as 𝑎/( 𝑎 + 𝑏 ).
ۧۧ After 105 hits of 300 real at bats, the expected value of the new beta
distribution is 𝑎/( 𝑎 + 𝑏 ) = ( 79 + 105 )/( 79 + 105 + 222 + 195 ) = 0.306.
ۧۧ Notice that it is lower than the frequentist estimate of 105/( 105 + 195 )

= 0.35 but higher than the estimate we started the season with,
78/( 78 + 222 ) = 0.26.
ۧۧ Thus, the beta distribution is best for representing a probabilistic

distribution of probabilities—the case where we don’t know what a
probability is in advance but we have some reasonable guesses.

ۧۧ The Bayesian framework allows us to dynamically update our prior beliefs
by considering various types of distributions for our prior probability.
ۧۧ In the absence of other information, a uniform ( or constant ) prior

is often assumed, meaning that every possible value has the same
probability of being selected.
π(𝑥) = 1
ۧۧ Recall that a random variable has the uniform ( 0, l ) distribution if its
probability density function is constant over the interval [0, 1] and 0
everywhere else.
ۧۧ On the other hand, the beta ( 𝑎, 𝑏 ) distribution is another commonly
used distribution for a continuous random variable that can only take
on values [0, 1].
ۧۧ It has the probability density function
where 𝛤( 𝑛 ) = 𝑛 − 1! and 0 < 𝑥 < 1.

ۧۧ The gamma ratio in the front is a constant we can solve for that helps
the function integrate to 1.
ۧۧ The most important thing is that 𝑥a−1( 1 − 𝑥 )b−1 determines the shape of
the curve.
ۧۧ Notice that uniform ( 0, 1 ) is a special case of the beta distribution
where 𝑎 = 1 and 𝑏 = 1. The shape of the distribution changes when 𝑎 and
𝑏 are much less than 1, when they are much greater than 1, and for all
values in between.
ۧۧ The shape of the distribution is describing our prior belief about

a probability. So, if you wanted to place a prior distribution on the
probability of a mechanical part failure, you might use a beta prior of
𝑎 = 0.1 and 𝑏 = 0.9 because the shape of that distribution has the highest
probability at 0 and quickly falls off.

ۧۧ In the case of a binomial likelihood, any beta prior that we choose will
give us a beta posterior distribution. When a prior and posterior belong
to the same distribution family, this distribution is referred to as a
conjugate prior. In this case, the beta distribution is a conjugate prior
for the binomial likelihood.
ۧۧ Conjugate priors are immensely useful because they provide a simple

analytic solution to this type of inference problem. We can easily know
the mean and variance, for example.
ۧۧ But conjugate priors are also somewhat limiting because our

prior belief may not be representable using the conjugate family’s
parameterization.

PITFALLS
ۧۧ One of the pitfalls of Bayesian statistics is that 3 different people can
choose 3 different priors, analyze the same dataset, and get 3 different
posteriors.
ۧۧ A good ( informative ) prior will typically be based on a very similar

study done under similar conditions in the past. But these aren’t always
available, or applicable, in every situation. It’s also not as easy to write
out a posterior distribution that doesn’t come from a conjugate family.
ۧۧ Suppose that your data followed a beta distribution and you wanted to use
an exponential prior. The posterior would be a beta times an exponential
distribution. We don’t know the mean or variance of that distribution. It’s
actually more computationally intensive to try to estimate.
ۧۧ On the other hand, a Bayesian would complain that frequentist

assumptions about long-run testing over an infinite number of
hypothetical repetitions of the experiment may sound attractive but
are not realistic. Bayesian methods better approach how we already
think and the types of problems we often encounter.
SUGGESTED READING
Bolstad, Introduction to Bayesian Statistics, “Bayesian Inference for Binomial
Proportion,” chap. 8; and “Bayesian Inference for Normal Mean,” chap. 11.
PROBLEMS
1 Which of the following statements are true?
a) A major difference between Bayesian and frequentist statistics is the use of

prior information.
b) Statisticians who use the normal distribution are called Bayesians.
c) In Bayesian statistics, the population parameters, such as the mean and
median, are assumed to be random variables.

d) If the prior distribution is beta and the likelihood function is binomial, then
the posterior distribution is also binomial.
e) All of the above.
2 If you have data that follows a binomial distribution and would like to assign a
prior, which of the following methods would not be appropriate?
a) Choose a prior from a previous study unrelated to the data.

b) Choose a beta prior that aligns with your beliefs because the beta is a
conjugate family with the binomial and will result in posterior distribution
that can be evaluated without integration.
c) Choose the uniform prior that gives equal weight to all values.
d) Construct a discrete prior at several values and interpolate them to create
a continuous prior distribution.

LECTURE 24
STATISTICS YOUR WAY

WITH CUSTOM FUNCTIONS
O
ne of the coolest things you can do in R is write your
own functions. Custom functions allow you to define
a specific action you are interested in, which you
can then easily apply to new data. This can be anything
from calculating a unique statistic, to creating a custom
plot, to combining several outputs into a single display.
WRITING CUSTOM FUNCTIONS

ۧۧ To make a custom function, you need to define one or more inputs
followed by some action and an output. Your function will take the input
and perform the action based on that input and spit out some output in
the form of transformed data, summary statistics, or a plot.
ۧۧ Custom functions can be a quick 1-liner or a body of code that’s long,

with elaborate functions. You need to practice writing functions in R
that automate the data analysis process. This will make your R code
cleaner and less repetitive.
ۧۧ Remember that functions take arguments, do something to them, and

return a result. Here’s the syntax for creating a function in R:

new.function <- function(arguments)
{
lines of code using the arguments
}
ۧۧ For example, let’s create a custom function that replicates the “mean( )”
function in R:
mean.fun <- function(x)

{
my.mean <- sum(x) / length(x)
return(my.mean)
}
ۧۧ Once you’ve defined a function and given it a name, you can then use the
function on new data:
set.seed(3456)
mean.fun(0:20)
[1] 10
mean.fun(rnorm(300))
[1] 0.02367274
ۧۧ While the function “mean.fun” only had 1 input ( t he vector 𝑥 ), we can
also define multiple inputs. For example, let’s adjust the function so that
it has an input “delete.outliers,” which, when true, automatically deletes
outliers before calculating the mean. This will come in handy whenever
you do exploratory data analysis.
mean.fun <- function(x, delete.outliers = F)

{
if(delete.outliers == F) {
my.mean <- sum(x) / length(x)
}
if(delete.outliers == T) {
Orig.Mean <- sum(x) / length(x)
Orig.SD <- sd(x)
Lecture 24 — Statistics Your Way with Custom Functions 367

Outlier.log <- (x > Orig.Mean + 2 * Orig.SD) | (x <
Orig.Mean - 2 * Orig.SD)
x.noout <- x[Outlier.log == FALSE]

my.mean <- sum(x.noout) / length(x.noout)
}
return(my.mean)
}
ۧۧ A few notes on the function:
1 In the function input definition, we added “== F” after “delete.

outliers.” This serves as a default value.
2 If you include a default value for a function input, then if the user doesn’t
specify the input value, R will use the default value. You can even include
a default data vector for 𝑥 by including “x = rnorm( 10 ),” for example.
3 When you have a logical input ( which is either true or false, such as
“delete.outliers” ), it’s common to use “if” statements that define the
action for when the input is true and for when it’s false.
ۧۧ Here is an example of the function in action. Let’s create a vector “data”

with 100 samples from a standard normal distribution with 1 outlier
at 400.
set.seed(3456)
Data <- c(rnorm(n = 100, mean = 0, sd = 1),400)
mean.fun(Data, delete.outliers = F)
[1] 3.936704
mean.fun(Data, delete.outliers = T)
[1] -0.02392943
ۧۧ The code works.
ۧۧ Once we’ve defined a custom function in R, we can then use that

function in other functions.

MODIFYING EXISTING FUNCTIONS
ۧۧ One common task is to modify an existing function by changing one
parameter or a few of them.
ۧۧ If you find that you’re frequently adding certain things to plots, such as
reference lines for the median or mean, then creating a custom plotting
function can make things easier for you.
ۧۧ Let’s start by creating a custom histogram with added reference lines.

We’ll call it “hist.fun”:
#add.median = Adds a vertical solid line for the median.

hist.fun <- function(x, add.median = T) {
hist(x)
if(add.median == T) {abline(v = median(x), lwd = 3, col
= "blue")}
}
ۧۧ Use “abline” to define any line, and then “v = ” for a vertical line.
set.seed(3456)
hist.fun(rexp(100), add.median = T)

ۧۧ Let’s modify our custom histogram function “hist.fun”:
#add.mean = Adds a vertical dashed line for the mean.

hist.fun <- function(x, add.median = T, add.mean = T)
{
hist(x)
= "blue")}
if(add.mean == T) {abline(v = mean(x), lty = 2, lwd =
3)}
}
set.seed(3456)
hist.fun(rexp(100), add.median = T, add.mean = T)

ۧۧ We don’t have to stop there!
#add.legend = Adds a legend to the plot.

hist.fun <- function(x, add.median = T, add.mean = T, add.
legend = T) {
hist(x)
if(add.median == T) {abline(v = median(x), lwd = 3, col =
"blue")}
if(add.mean == T) {abline(v = mean(x), lty = 2, lwd = 3)}
if(add.legend == T) {
legend("topright", c("median", "mean"),
lwd = c(3, 3), lty = c(1, 2),
col=c("blue","white"))
}
}
hist.fun(rexp(100), add.median = T, add.mean = T, add.
legend = T)

ۧۧ Why stop there?
#color histogram, color lines

hist.fun <- function(x, add.median = T, add.mean = T, add.
legend = T) {
hist(x, col="cadetblue")
= "blue")}
if(add.mean == T) {abline(v = mean(x), lty = 2, lwd = 3,
col = "red")}
legend("topright", c("median", "mean"), lwd = c(3, 3),
lty = c(1, 2), col = c("blue","red"))
}
}
set.seed(3456)
hist.fun(rexp(100), add.median = T,
add.mean = T, add.legend = T)

ۧۧ We can try this function on other datasets, such as the “iris” data.
hist.fun(iris$Petal.Length, add.median = T, add.mean = T,

add.legend = T)
data(mtcars)
hist.fun(mtcars$mpg, add.median = T, add.mean = T, add.
legend = T)

ۧۧ Don’t stop there!
#num.breaks
hist.fun <- function(x, add.median = T, add.mean = T,
add.legend = T, num.breaks = 12) {
b <- seq(min(x), max(x), length=num.breaks)
hist(x, col="cadetblue", breaks = b)
= "blue")}
if(add.mean == T) {abline(v = mean(x), lty = 2, lwd = 3,
col = "red")}
legend("topright", c("median", "mean"), lwd = c(3, 3),
lty = c(1, 2), col = c("blue","red"))
}
}
ۧۧ Let’s try this function on another dataset.
data(mtcars)
legend = T)

ۧۧ We can generate plots with different breaks so that we can compare them.
par(mfrow=c(1,2))
hist.fun(chick.no.out,num.breaks = 5)
par(mfrow=c(1,3))

data(mtcars)
legend = T, num.breaks=20, main.title = "Miles per
Gallon, with High-MPG Outliers")
ۧۧ We can add as many custom procedures and plotting parameters as

we’d like. In fact, we could create a similar function for scatterplots.
ۧۧ We can create a custom scatterplot with some fun features.
#add.mean = Adds a reference line for the mean of each

variable.
set.seed(3456)
scatter.fun <- function(x = rnorm(1000), y = rnorm(1000),
add.mean = T) {
plot(x, y, pch=20)
if(add.mean == T) {
abline(h = mean(y))
abline(v = mean(x))
}
}
ۧۧ Let’s run it with defaults.
scatter.fun()

ۧۧ Let’s see our custom scatterplot in action. We can create 2 plots, one
where there is not significant correlation and one where there is.
set.seed(3456)
x <- rnorm(100)
y.uncorr <- rnorm(100)
y.corr <- x + rnorm(100, 0, .3)
par(mfrow = c(1, 2))
scatter.fun(x, y.uncorr)
scatter.fun(x, y.corr)
ۧۧ “Add.regression” adds a regression line. If the correlation is significant,

we'll make the line green; if not, it will be red.

add.mean = T, add.regression = T) {
plot(x, y, pch=20)
if(add.mean == T) {
abline(h = mean(y))
abline(v = mean(x))}
if(add.regression == T) {
Model <- lm(y ~ x)
...
...
Model <- lm(y ~ x)
p.value <- anova(Model)$"Pr(>F)"[1]
if(p.value <= 0.05) {Reg.Line.Col <- "green"}
if(p.value > 0.05) {Reg.Line.Col <- "darkred"}
abline(lm(y ~ x), lty = 2, lwd = 4,
col = Reg.Line.Col)}
}
set.seed(3456)

#add.conclusion = Adds the correlation coefficient and its p-value.
add.mean = T, add.regression = T,
add.conclusion = T) {
plot(x, y, pch=20)
if(add.mean == T) {
abline(h = mean(y))
abline(v = mean(x))
}
if(add.conclusion == T) {
C.Test <- cor.test(x, y)
Coefficient <- C.Test$estimate

p.value <- C.Test$p.value
if(p.value <= .05) {
Conclusion <- paste("The correlation coefficient is ",

round(Coefficient, 2),
".\nIt IS significantly different from 0 (p =
", round(p.value, 3),
")", sep = "")
}
if(p.value > .05) {

Conclusion <- paste("The correlation coefficient is ",
round(Coefficient, 2),
".\nIt is NOT significantly different from 0
(p = ", round(p.value, 3),
")", sep = "")
}
mtext(text = Conclusion,
side = 3, line = 1, cex = .8) }

if(add.regression == T) {
Model <- lm(y ~ x)
p.value <- anova(Model)$"Pr(>F)"[1]
if(p.value <= 0.05) {Reg.Line.Col <- "green"}

if(p.value > 0.05) {Reg.Line.Col <- "darkred"}
abline(lm(y ~ x), lty = 2, lwd = 4, col = Reg.Line.Col)

}
}


CUSTOMIZING R
ۧۧ When you begin a new R session, your custom functions won’t
automatically load. But you can customize R itself so that R will load
your custom functions automatically. You can create an R file called
“FUN Functions.R” and automatically load it using the following:
source(file = "FUN Functions.R")
ۧۧ For example, if R was looking on the desktop and all R files are kept in a
folder on the desktop called “R files,” the following command will make
R look in that folder:
setwd("R files")
ۧۧ Once R is looking in the folder of interest, the files ( data, R code, etc. ) in
that folder can then be accessed.
source(file = "FUN Functions.R")
ۧۧ This command accesses and runs the code contained in the file.
ۧۧ Now, any time you want to create a scatterplot and automatically

include reference lines and regression lines, or whatever, you have a
custom function that you can add to and build on.
EXCHANGING DATA
ۧۧ You can make R do so many more things for you. You can even share
your work others.

ۧۧ You can download data with commands such as:
ൖൖ read.csv( “ filename.txt” )
ൖൖ read.table( “ filename”, sep = “,” )
ൖൖ read.table( “ filename”, sep = “ “ )
ۧۧ Download data from the Internet with the following command:
ൖൖ download.file( url )
ൖൖ read.csv( url( “ http://any.where.com/data.csv” ) )
ۧۧ Many more resources are available online.
ۧۧ To export data from R, you can use “write.table” to output R data to a

text file.
ۧۧ You can specify that the data will be separated by commas ( or whatever
characters you want ) with the “sep” command:
ൖൖ `write.table( f undata, “fundata.txt”, sep=”,” )
ۧۧ Once you’ve done all of this work, there are many ways to share it. The
easiest is to create an R Markdown document.
ۧۧ If you’re in R studio, click on “File,” then “New File,” and then select “R
Markdown.” It will ask you for a title and to select your default output:
html, pdf, or Microsoft word. Then, RStudio creates the R Markdown file
for you. This allows you to publish to books, websites, articles, and more.
ۧۧ R Markdown uses a notebook interface to incorporate narrative text

and code to produce elegantly formatted output. You can use multiple
languages, including R, Python, and SQL ( Structured Query Language ),
to save and execute code and generate high-quality reports.

PITFALL
ۧۧ R has built-in functions, such as “summary( )” or “median( ),” and this
overlap is the most common pitfall: overwriting a built-in R function.
ۧۧ Some of R’s built-in functions are common words, such as “contour,”

“image,” “jitter,” “length,” “mean,” “median,” “order,” “plot,” “sample,”
“sign,” “sequence,” “sort,” and “summary.”
ۧۧ Suppose that you overwrite the built-in function “plot.”
plot = hist(x)
plot(x) ERROR
rm(plot) #Removes plot and defaults to base definition
ۧۧ Before creating a new function, always type “?( new function name )” to

see if it already exists as an R function.
SUGGESTED READING
Crawley, The R Book, “Writing R Functions,” section 2.15.
Phillips, “Writing Custom Functions,” https://rstudio-pubs-static.
s3.amazonaws.com/47500_f7c2ec48e68446f99bc04f935195f955.html.
PROBLEMS
1 Which of the following are good reasons to write custom functions in R?
a) You can address a specific analytical ne2d that isn’t covered in a built-in R
function.
b) You can automate a procedure that you use repeatedly.
c) You can save time when analyzing a new dataset.
2 Create a custom function that takes in the numbers a and b and returns ( a + b )2,
ab , and the square root of the absolute value of ( a × b ).

SOLUTIONS
Lecture 01
1 In R, you define a variable ( here called “x” ) to hold your data using the “c”
command:
x = c(13.8, 14.1, 15.7, 14.5, 13.3, 14,9, 15.1, 14.0)
Calculating the given statistics in R is then as simple as typing the following:
> mean(x)
13.72222
> median(x)
14
> var(x)
3.649444
> sd(x)
1.910352
2 Only the median. If outliers are too important to drop from your data but you
still want to get an idea of the central tendency of your data, the median is an
appropriate statistic to use.
Lecture 02
1 e) E xploratory data analysis is one of the first things we do to get an idea of the
shape, spread, central tendency, and overall distribution of our data.

2
Although the scale ranges from 0 to 20, none of the painters were given a score
above 18. Composition and drawing scores peak at 15 but are relatively uniform
in distribution. The colour score has more frequency at 6, 8, 10, 16, and 17, while
the expression scores are more frequently below 9.
Lecture 03
1 No. The probability of heads on the second toss is 0.5 regardless of the outcome
of the first toss. However, if a particular coin consistently gives heads more than
tails, then it may become appropriate to include that prior information about
your particular coin and begin adjusting the probabilities for that particular
coin accordingly.
2 a) P( walks by will come in ) = 79/143.

b) P( walks in will buy something ) = 53/79.
c) P( walks in will not buy something ) = 1 − 53/79.
d) P( walks by will come in and buy something ) = 53/143.
Solutions 385
Lecture 04
1 c) 0.25. We can calculate this from the expected value. E( X ) = 𝑛𝑝; 5 = 20 × 𝑝; 𝑝
= 0.25.
2 a) X ~ Bin( 10, 0.2 )
b) P( X = 2 ) = ( 0.2 )2( 0.8 )8
Lecture 05
1 d) All of the above.
2 a) pnorm( 73, mean=100, sd=25 )

b) pnorm( 73, mean=100, sd=25, lower.tail=FALSE )
c) qnorm( 0.90, mean=100, sd=25 )
Lecture 06
1 d) Correlation does not imply causation.
2 a)
> summary(cars)
speed dist
Min.: 4.0 Min.: 2.00
1st Qu.: 12.0 1st Qu.: 26.00
Mean: 15.4 Mean: 42.98
3rd Qu.: 19.0 3rd Qu.: 56.00
Max.: 25.0 Max.: 120.00
b)
> cor(cars$speed, cars$dist)
0.8068949
> cov(cars$speed, cars$dist)
109.9469
plot(cars$speed, cars$dist)

c)
There appears to be a linear or curvilinear relationship between speed and

distance. As the speed increases, the stopping distance also increases.
Lecture 07
1
“Distance” and “speed” appear to diverge away from normality, particularly in

the tails of the Q-Q plot. Notice that several of those values lie away from the
normal quantile line. However, we can’t check for normality using the naked eye.
We need to do a Shapiro-Wilk test to confirm.
Solutions 387
> shapiro.test(cars$dist)
data: cars$dist
W = 0.95144, p-value = 0.0391
> shapiro.test(cars$speed)
data: cars$speed
W = 0.97765, p-value = 0.4576
From the results of the Shapiro-Wilk test, “distance” is not normally distributed
( 𝑝-value = 0.0391 ) while “speed” is normally distributed ( 𝑝-value = 0.4576 ).
It’s clearer from the histogram and density plot that “distance” is skewed
toward the left, resulting in a longer-tailed distribution and departing from
normality. “Speed” is more symmetric with less extreme data in the tails,
resulting in data that follows an underlying normal distribution.

Lecture 08
1 c) If the central limit theorem is not applicable, then you might have a small,
non-normal sample.
2 a) 3325
b) 6802/360
c) pnorm( 3623, mean=3325, sd=680 ) – pnorm( 2980, mean=3325, sd=680 ) =
0.3634385
Lecture 09
1 b) If the mean of a sample statistic is not equal to the population parameter,
then the sample statistic is called a biased estimator.
2 A good point estimate for the population mean is the sample mean. We can find
it in R using the following commands:
# Point Estimate
mean(cars$speed)
15.4
mean(cars$dist)
42.98
Lecture 10
1 c) A lthough a large sample size results in a more precise confidence interval,
it’s not true that the sample must be at least 10 to calculate the confidence
interval.
Solutions 389
2 95% Confidence Interval for Speed
( 5.036217, 25.76378 )
mean(cars$speed)-1.96*sd(cars$speed)
5.036217
mean(cars$speed)+1.96*sd(cars$speed)
25.76378
95% Confidence Interval for Distance

( -0.2070292, 86.16703 )
mean(cars$dist)+qt(0.95,length(cars$dist))*sd(cars$dist)
86.16703
mean(cars$dist)-qt(0.95,length(cars$dist))*sd(cars$dist)
-0.2070292
90% Confidence Interval for Speed

( 6 .675387, 24.12461 )
mean(cars$speed)+1.65*sd(cars$speed)
24.12461
mean(cars$speed)-1.65*sd(cars$speed)
6.675387
90% Confidence Interval for Distance

( 9.512957, 76.44704 )
mean(cars$dist)+qt(0.90,length(cars$dist))*sd(cars$dist)
76.44704
mean(cars$dist)-qt(0.90,length(cars$dist))*sd(cars$dist)
9.512957
Lecture 11
1 b) W hen the 𝑝-value is less than α, we have enough evidence against the null
hypothesis, resulting in statistically significant data.

2 a) A lthough we use the sample statistic calculated from the data, the
hypothesis refers to the population parameter.
Lecture 12
1 Based on the results, we are highly confident, at the 0.05 level, that the true
mean difference in gas mileage between regular and premium gas is between
1.727599 and 3.363310 miles per gallon.
> Regular = c(14, 16, 20, 20, 21, 21, 23, 24, 23, 22, 23,
22, 27, 25, 27, 28, 30, 29, 31, 30, 35, 34)
> Premium = c(16, 17, 19, 22, 24, 24, 25, 25, 26, 26, 24,
27, 26, 28, 32, 33, 33, 31, 35, 31, 37, 40)
> t.test(Premium,Regular, paired=TRUE)
Paired t-test
data: Premium and Regular
t = 6.4725, df = 21, p-value = 2.055e-06
equal to 0
1.727599 3.363310
sample estimates:
mean of the differences
2.545455
2 b) Paired tests are only done you have 2 samples in which observations in one
sample can be paired with the observations in the other sample.
Lecture 13
1 b) The estimated change in average Y per unit change in X.
2 b) The slope b1 is negative.
Solutions 391
Lecture 14
1 b) No apparent relationship.
2 a) We can use the correlation coefficient to measure the strength of the linear
relationship between 2 numerical variables.
Lecture 15
1 a) T he expected value of the error terms is assumed to be 0.
2 b) Residuals tell us how far off our actual Y values are from our predicted
regression line values.
Lecture 16
1 a) With only 2 sample means, we can use a t-test. ANOVA is used when we have
more than 3 means.
2 d) We need the number of groups and the sample size to find the critical
F-value.
3 d) T hese statements form the base requirements needed to use ANOVA

methods on a dataset.
Lecture 17
1 a), b), c), and e) are all true; d) is not true. If we fail to include an important
covariate, our results are likely to be invalid.
2 c) M ANOVA: We have 2 continuous Y variables and categorical X1, X2 , and X3 .

Lecture 18
1 f) A ll of the above are true.
2 c) In a full factorial design, we only vary 1 factor at a time.
Lecture 19
1
> summary(tree1)
tree(formula = Kyphosis ~ Age, data = kyphosis)
We get a residual mean deviance of 0.8445 and a misclassification error rate of

0.2099. We incorrectly classify 17 of the 81 data values.
Solutions 393
2
tree(formula = Kyphosis ~ Age + Number + Start, data =
kyphosis)
Our residual mean deviance decreases, and we only incorrectly classify 10 of the
81 values. Using additional variables ( “age” along with “number” and “start” )
gives us a better tree by improving the classification rate.
Lecture 20
1 c) In polynomial regression, different powers of X variables are added to an
equation to see whether they increase the R2 significantly.
2 a) In simple logistic regression, the predictor ( independent ) variable is

always interval/ratio data.

Lecture 21
1 nugget = 2; sill = 4.5; range = 6
2 e) All of the above tend to be true.
Lecture 22
1 By including the letter for each month as a plotting symbol, we can see the
seasonality in the series. Higher beer sales tend to occur in the summer months
of May, June, July, and August, while lower sales occur in winter months of
November, December, and January. There’s also an upward trend from 1975 to
around 1982.
> summary(beer.model)
Call:
lm(formula = beersales ~ season(beersales))
Residuals:
-3.5745 -0.4772 0.1759 0.7312 2.1023
Solutions 395
Coefficients:
(Intercept) 12.48568 0.26392 47.309 < 2e-16 ***
season(beersales)February -0.14259 0.37324 -0.382 0.702879
season(beersales)March 2.08219 0.37324 5.579 8.77e-08 ***
season(beersales)April 2.39760 0.37324 6.424 1.15e-09 ***
season(beersales)May 3.59896 0.37324 9.643 < 2e-16 ***
season(beersales)June 3.84976 0.37324 10.314 < 2e-16 ***
season(beersales)July 3.76866 0.37324 10.097 < 2e-16 ***
season(beersales)August 3.60877 0.37324 9.669 < 2e-16 ***
season(beersales)September 1.57282 0.37324 4.214 3.96e-05 ***
season(beersales)October 1.25444 0.37324 3.361 0.000948 ***
season(beersales)November -0.04797 0.37324 -0.129 0.897881
season(beersales)December -0.42309 0.37324 -1.134 0.258487
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Multiple R-squared: 0.7103, Adjusted R-squared:
0.6926

> shapiro.test(beer.model$residuals)
data: beer.model$residuals
W = 0.94142, p-value = 4.892e-07
beer.diff = diff(beersales, season=12)

beer.model2=lm(beer.diff~season(beer.diff)+time(beer.diff))
summary(beer.model2)
Call:
lm(formula = beer.diff ~ season(beer.diff) + time(beer.diff))
Residuals:
-2.23411 -0.54159 0.04528 0.48127 1.88851
Coefficients:
(Intercept) 1.4216296 23.0675027 0.062 0.950927
season(beer.diff)February -0.7342015 0.2651087 -2.769 0.006211 **
season(beer.diff)March 1.6332146 0.2650927 6.161 4.70e-09 ***
season(beer.diff)April -0.2761317 0.2650803 -1.042 0.298968
Solutions 397
season(beer.diff)May 0.6098594 0.2650714 2.301 0.022566 *
season(beer.diff)June -0.3406745 0.2650661 -1.285 0.200377
season(beer.diff)July -0.6725333 0.2650643 -2.537 0.012031 *
season(beer.diff)August -0.7512797 0.2650661 -2.834 0.005123 **
season(beer.diff)September -2.6273198 0.2650714 -9.912 < 2e-16 ***
season(beer.diff)October -0.9097037 0.2650803 -3.432 0.000746 ***
season(beer.diff)November -1.8937063 0.2650927 -7.144 2.24e-11 ***
season(beer.diff)December -0.9663776 0.2651087 -3.645 0.000350 ***
time(beer.diff) -0.0004187 0.0116322 -0.036 0.971330
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


> shapiro.test(beer.model2$residuals)
data: beer.model2$residuals
W = 0.99439, p-value = 0.6911
Solutions 399
Lecture 23
1 a) major difference between Bayesian and frequentist statistics is the use of
A
prior information.
and
c) In Bayesian statistics, the population parameters, such as the mean and
median, are assumed to be random variables.
2 a) Choose a prior from a previous study unrelated to the data.
Lecture 24
1 d) Custom functions help us do all of these and much more.
2 Your function may be labeled differently, but here’s an example of one that
works:
my.fun = function(a,b)
{
return(list((a+b)^2, a^b, sqrt(abs(a*b))))
}

BIBLIOGRAPHY
OpenIntro Statistics is the primary companion book for this course. It
is a free, high-quality, college-level textbook that covers exploratory
data analysis, probability, distributions, statistical inference, linear
regression, multiple regression, logistic regression, and ANOVA.
This book is used in courses at all types of institutions, including

community colleges, high schools, and Ivy League universities. Similar
to the mission of R, OpenIntro was started to create a high-quality,
free, and open-source introductory textbook with the additional
goal of lowering the barriers to statistics education. Also like R, this
is a textbook that is revised and improved with input from the R
community. You can download a free pdf of the book at
https://www.openintro.org/stat/textbook.php?stat_book=os.
You can order a print copy (hardcover or paperback) for less than $20
(as of May 2017).
Bolstad, William. Introduction to Bayesian Statistics. Hoboken, NJ: Wiley-

Interscience, 2007.
Crawley, Michael J. The R Book. Hoboken, NJ: Wiley Press, 2013.
Cryer, Jonathan D., and Kung-Sik Chan. Time Series Analysis with
Applications in R. New York: Springer, 2010.
Diez, David M., Christopher D. Barr, and Mine Cetinkaya-Rundel. OpenIntro

Statistics. Creative Commons License, 2015.
Bibliography 401
Faraway, Julian J. Linear Models with R. Boca Raton, FL: CRC Press, 2005.
———. Extending the Linear Model with R. Boca Raton, FL: CRC Press, 2016.
Phillips, Nathaniel. “Writing Custom Functions.” Dec. 3, 2014.
https://rstudio-pubs-static.s3.amazonaws.com/47500_
f7c2ec48e68446f99bc04f935195f955.html.
Yau, Chi. R Tutorial: An R Introduction to Statistics.
http://www.r-tutor.com.

Learning Statistics

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Learning Statistics

Diunggah oleh

Hak Cipta:

Format Tersedia

Topic Subtopic

Science & Mathematics Mathematics

Professor Talithia Williams

THE GREAT COURSES

Copyright © The Teaching Company, 2017

Printed in the United States of America

This book is in copyright. All rights reserved.

Without limiting the rights under copyright reserved above,

T alithia Williams is an Associate Professor of Mathematics and the

Dr. Williams’s educational background includes a bachelor’s degree in

Dr. Williams takes sophisticated numerical concepts and makes them

Dr. Williams received the Mathematical Association of America’s Henry

Dr. Williams has partnered with Sacred SISTAHS ( Sisters in Solidarity

ii Learning Statistics: Concepts and Applications in R

01 How to Summarize Data with Statistics . . . . . . . . . . . 009

02 Exploratory Data Visualization in R . . . . . . . . . . . . 022

03 Sampling and Probability . . . . . . . . . . . . . . . . . . . 040

04 Discrete Distributions . . . . . . . . . . . . . . . . . . . . . 051

05 Continuous and Normal Distributions . . . . . . . . . . . . 065

06 Covariance and Correlation . . . . . . . . . . . . . . . . . . 079

07 Validating Statistical Assumptions . . . . . . . . . . . . . 094

08 Sample Size and Sampling Distributions . . . . . . . . . . 118

09 Point Estimates and Standard Error . . . . . . . . . . . . . 131

10 Interval Estimates and Confidence Intervals . . . . . . . . 142

11 Hypothesis Testing: 1 Sample . . . . . . . . . . . . . . . . . 155

12 Hypothesis Testing: 2 Samples, Paired Test . . . . . . . . 168

Table of Contents iii

14 Regression Predictions, Confidence Intervals . . . . . . . 199

15 Multiple Linear Regression . . . . . . . . . . . . . . . . . . 215

16 Analysis of Variance: Comparing 3 Means . . . . . . . . . . 238

17 Analysis of Covariance and Multiple ANOVA . . . . . . . . . 255

18 Statistical Design of Experiments . . . . . . . . . . . . . . 270

19 Regression Trees and Classification Trees . . . . . . . . . 281

20 Polynomial and Logistic Regression . . . . . . . . . . . . . 297

21 Spatial Statistics . . . . . . . . . . . . . . . . . . . . . . . 315

22 Time Series Analysis . . . . . . . . . . . . . . . . . . . . . . 331

23 Prior Information and Bayesian Inference . . . . . . . . . 352

24 Statistics Your Way with Custom Functions . . . . . . . . . 366

iv Learning Statistics: Concepts and Applications in R

T his course provides an introduction to statistics with examples

The fundamental aim of this course is to get you to think probabilistically

Many of the statistical decisions we make depend on the field of probability,

We spend lectures 3 through 5 on probability and random distributions,

Course Scope 001

The course begins sustained work with continuous random variables

We also explore the unique properties of several other continuous

Throughout the course, we create and interpret scatterplots, histograms,

We begin a bridge toward inference by learning what correlation and

002 Learning Statistics: Concepts and Applications in R

We then turn to a technique known as regression, which fits a line to

Course Scope 003

Two exciting forms of analysis sometimes omitted from a first course in

We conclude in lecture 24 with custom functions you create yourself. A key

004 Learning Statistics: Concepts and Applications in R

After you’ve downloaded and installed R, download and install RStudio,

To download and install RStudio, visit RStudio’s download website:

HOW TO INSTALL PACKAGES USING THE RSTUDIO CONSOLE

R and RStudio 005

HOW TO INSTALL PACKAGES USING THE RSTUDIO

A second way to install packages—which is a little slower but perhaps more

1 Click on the “Packages” tab from the lower-right-corner pane.

2 Click on the “Install” icon in the “Packages” tab.

006 Learning Statistics: Concepts and Applications in R

ADDITIONAL ONLINE REFERENCES (OPTIONAL)