Learning Statistics
Concepts and Applications in R
Course Guidebook
Associate Professor of
Mathematics
Harvey Mudd College
Professor Biography i
and mathematics ) fields. Dr. Williams has made it her life’s work to get
people—students, parents, educators, and community members—more
excited about the possibilities inherent in a STEM education.
Dr. Williams develops statistical models that emphasize the spatial and
temporal structure of data and has partnered with the World Health
Organization in developing a model to predict the annual number of
cataract surgeries needed to eliminate blindness in Africa. Through her
research and work in the community at large, she is helping change the
collective mindset regarding STEM in general and math in particular—
rebranding the field of mathematics as anything but dry, technical, or
male-dominated but instead as a logical, productive career path that is
crucial to the future of the country.
Dr. Williams is cohost of the PBS series NOVA Wonders, a 6-part series that
journeys to the frontiers of science, where researchers are tackling some of
the most intriguing questions about life and the cosmos. She has delivered
speeches tailored to a wide range of audiences within the educational field,
including speaking throughout the country about the value of statistics in
quantifying personal health information.
INTRODUCTION
Professor Biography . . . . . . . . . . . . . . . . . . . . . . . . . . i
Course Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 001
R and RStudio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 005
LECTURE GUIDES
SUPPLEMENTARY MATERIAL
Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
We begin the course with a look at the descriptive properties of data and
learn exploratory visualization techniques using R. This helps us begin to
see the shape of data, find trends, and locate outliers. The field of statistics
is really a branch of mathematics that deals with analyzing and making
decisions based on data.
We learn to check for independence of events, and set up and work with
discrete random variables (lecture 4), including those that follow the
Bernoulli, binomial, geometric, and Poisson distributions. Probability
distributions allow us to see, graphically and by calculation, the values of
our random variables.
We apply the central limit theorem of statistics (lecture 8), which tells
us that as our sample size increases, our sample means become normally
distributed, no matter what distribution they originate from.
For data that has categorical predictors, such as gender, we turn to what is
called analysis of variance (ANOVA), which allows us to compare the means
of 2 groups in lecture 16, and multiple analysis of variance (MANOVA)
and analysis of covariance (ANCOVA) in lecture 17. We also explore how
ANOVA can be used in statistical design of experiments (lecture 18), as
pioneered by the great statistician Sir Ronald Fisher.
ANOVA and linear regression depend on key assumptions that are often
not met, including linearity, independence, homogeneity, and constant
variance. So, in lectures 19 through 23, we consider how to do statistical
analysis when one or more of those assumptions do not hold. Regression
trees and classification trees (known more generally as decision trees)
don’t require assumptions such as linearity, are even easier to use than
linear regression, and work well even when some values are missing.
However, not all data have natural splits amenable to decision trees, so
we turn in lecture 20 to polynomial regression (adding nonlinear terms
to our linear model) and to step functions (which apply different models
Once you have installed R and RStudio, you can install additional packages
that are required for this course. The following instructions assume that
you are in the RStudio environment and know the package names needed.
1 In the RStudio console, at the prompt >, type the following command and press
the enter or return key to install a package. For example, let’s install the “swirl”
package.
> install.packages("swirl")
2 Then, R will fetch all the required package files from CRAN (Comprehensive R
Archive Network) and install it for you.
> library("swirl")
Unlike other packages, the “swirl” package will immediately begin interacting
with you, suggesting that you type the following to begin using a training
session in “swirl”:
> swirl()
3 Type in package names in the “Packages” field. Try typing “swirl” because this is
the first package that is recommended for you to use.
4 Click “Install” to let R install the package and other packages that are dependent
for using the package. You’ll notice the installation progress from the R console.
5 Once all the package files are downloaded and installed on your computer, you’ll
find the package name in the “Packages” pane (scroll through), or use the search
bar on the top-right side of the “Packages” panel. To load the package you just
installed, click on the checkbox.
ൖൖ graphics
ൖൖ stats
ൖൖ utils
If you don’t know package names, the best place to get an overview of the
best available packages is the “Task Views” section on the CRAN website,
available at https://cran.r-project.org/web/views/.
ൖൖ http://rprogramming.net/download-and-install-rstudio/.
ൖൖ RStudio: http://web.cs.ucla.edu/~gulzar/rstudio/index.html.
HOW TO SUMMARIZE
DATA WITH STATISTICS
T
o truly appreciate statistical information, we have
to understand the language ( and assumptions )
of statistics—and how to reason in the face of
uncertainty. In effect, we have to become masters at the
art of learning from data, which has 2 sides: accurately
describing and summarizing the data we have; and going
beyond the data we have, making inferences about data we
don’t have. Statistics is both descriptive and inferential.
WHAT IS STATISTICS?
ۧۧ Statistics is a branch of mathematics, but it’s also a science. It
involves the collection of data, analysis of data ( working with data ),
interpretation of data to reach conclusions, and presentation of data.
ۧۧ Quantitative data are always numbers. This type of data is often the
result of measuring a characteristic about a population ( e.g., height,
number of people living in your town, or percentage of registered voters ).
ۧۧ The purpose was to determine which feed ( if any ) led to the heaviest
chickens. In this example, weight is a continuous, quantitative variable
giving the chick weight, and feed is a categorical, qualitative variable
giving the feed type.
ۧۧ We denote the sample mean of a variable by placing a bar over it ( e.g., X ).
ۧۧ The mean value, or average, tells us the center of the data. We find it
by adding all of the data points and dividing by the total number. The
following is the weights of chicks that were given a horsebean feed.
x = c(179, 160, 136, 227, 217, 168, 108, 124, 143, 140)
sum(x); sum(x)/10; mean(x)
[1] 1602
[1] 160.2
[1] 160.2
ۧۧ The median is another way of measuring the center of the data. Think
of the median as the middle value, although it doesn’t actually have to
be one of the observed values. To find the median, order the data and
locate a number that splits the data into 2 equal parts.
(143 + 160) / 2
[1] 151.5
ۧۧ The median is a number that separates ordered data into halves. Half
the values are the same size or smaller than the median, and half the
values are the same size or larger than the median.
ۧۧ If our dataset instead had 11 values, then the median would be equal to
the number located at location 6 when the data is sorted.
x = c(179, 160, 136, 227, 217, 168, 108, 124, 143, 140,
500)
y = sort(x)
y
[1] 108 124 136 140 143 160 168 179 217 227 500
ۧۧ Now the median is 160. But notice that the mean changes to 191.1.
mean(x)
[1] 191.0909
ۧۧ The median is generally a better measure of the center when your data
has extreme values, or outliers. The median is not affected by extreme
values. So, if your mean is far away from the median, that’s a hint that
the median might be a better representative of your data.
weight feed
Min. : 108.0 casein : 12
1st Qu. : 204.5 horsebean : 10
Median : 258.0 linseed : 12
Mean : 261.3 meatmeal : 11
3rd Qu. : 323.5 soybean : 14
Max. : 423.0 sunflower : 12
ۧۧ The summary output gives us the mean and median of the weight data,
along with minimum and maximum values and first and third quartile.
For feed, we get a summary of how many chicks are in each group.
x = c(179, 160, 136, 227, 217, 168, 108, 124, 143, 140)
179 - mean(x)
[1] 18.8
160 - mean(x)
[1] -0.2
130 - mean(x)
[1] -30.2
ۧۧ We could add all of the deviations, but we’d just get a sum of 0. We
could add the absolute value of all the deviations and average to
get a mean absolute deviation. In R, we could do that with just the
command “mad( ).”
x = c(179, 160, 136, 227, 217, 168, 108, 124, 143, 140)
round(sum(x - mean(x)),10)
[1] 0
ۧۧ The variance is in squared units and doesn’t have the same units as
the data. We can get back to our original units by taking the square
root, giving what is called the standard deviation, which measures the
spread in the same units as the data.
x.bar = mean(x)
sum((x - x.bar)^2)/(length(x)-1)
[1] 1491.956
sqrt(sum((x - x.bar)^2)/(length(x)-1))
[1] 38.62584
STATISTICAL GRAPHS
ۧۧ But what if data are not evenly spread around the mean? That’s called
skewness. What can we do when data are highly skewed?
ۧۧ We can use the median. A common statistical graph for showing the
spread of data around the median is the box plot, which is a graphical
display of the concentration of the data, centered on the median.
ۧۧ Box plots show us the visual spread of the data values. They give us the
smallest value, the first quartile, the median, the third quartile, and
the largest value. Quartiles are numbers that separate the data into
quarters. Like the median, quartiles may be located on a data point or
between 2 data points.
ۧۧ To find the quartiles, we first find the median, which is the second quartile.
The first quartile is the middle value of the lower half of the data, and the
third quartile is the middle value of the upper half of the data.
sort(x)
[1] 108 124 136 140 143 160 168 179 217 227
ۧۧ The lower half of the data is 108 through 143. The middle value of the
lower half is 136. One-quarter of the values are ≤ 136, and 3/4 of the
values are > 136. The upper half of the data is 160 through 227.
ۧۧ The middle value of the upper half is 179, which represents the third
quartile, Q3 . Three-quarters of the values are < 179, and 1/4 of the values
are ≥ 179.
ۧۧ Box plots are a vertical rectangular box and 2 vertical whiskers that
extend from the ends of the box to the smallest and largest data values
that are not outliers. Outlier values, if any exist, are marked as points
above or below the endpoints of the whiskers.
ۧۧ The smallest and largest non-outlier data values label the endpoints of
the axis. The first quartile marks the lower end of the box, and the third
quartile marks the upper end of the box. The central 50% of the data
falls within the box.
ۧۧ It’s possible to do some of the basic statistics that will be covered in this
course using spreadsheet software, such as Excel, but the best way to
learn R is to start with the basics, not wait until you get to something
your spreadsheet can’t handle. And an added bonus to beginning with R
for this course is that many of the datasets we use come bundled with R.
STATISTICAL ASSUMPTIONS
ۧۧ No matter what we do in statistics, it’s important to keep track of the
statistical assumptions underlying what we’re doing.
SUGGESTED READING
Diez, Barr, and Cetinkaya-Rundel, OpenIntro Statistics, “Introduction to Data,”
sections 1.1–1.8.
Yau, R Tutorial, “R Introduction,” http://www.r-tutor.com/r-introduction.
——— , R Tutorial, “Numerical Measures,” http://www.r-tutor.com/
elementary-statistics/numerical-measures.
PROBLEMS
1 Eight athletes competed in the 100-yard dash during a local high school
tournament, resulting in the following completion times: 13.8, 14.1, 15.7, 14.5,
13.3, 14,9, 15.1, 14.0. Calculate the mean, median, variance, and standard
deviation of the data.
a) mean
b) median
c) standard deviation
d) variance
EXPLORATORY DATA
VISUALIZATION IN R
T
his course uses a powerful computer programming
language known as R to help us analyze and
understand data. R is the leading tool for statistics,
data analysis, and machine learning. It is more than
a statistical package; it’s a programming language,
so you can create your own objects, functions, and
packages. There are more than 2000 cutting-edge,
user-contributed packages available online at the
Comprehensive R Archive Network ( CRAN ).
WHY DO WE USE R?
ۧۧ We use R for several reasons. It’s free, and it’s open source, meaning
that anyone can examine the source code to see exactly what it’s doing.
It explicitly documents the steps of your analysis. R makes it easy to
correct, reproduce, and update your work. You can use it anywhere, on
any operating system.
ۧۧ With R, everything is accomplished via code. You load your data into R
and explore and manipulate that data by running scripts. It`s easy to
reproduce your work on other datasets. Because all data manipulation
ۧۧ It’s easy to get help online; you can show exactly what you’re using and
ask very specific questions. In fact, most of the time when you get help
online, people will post the exact code that addresses your issue. Stack
Overflow ( http://stackoverflow.com/ ) is a community of roughly 7
million programmers helping each other.
ۧۧ You can load any data into R. It doesn’t matter where your data is or
what form it’s in. You can load CSV files. The first time, it’ll ask you to
install required packages. Just say yes.
#install.packages("readr")
#library(readr)
#shoes <- read_csv("C: /Users/tawilliams/Desktop/
shoes.csv")
#View(shoes)
ۧۧ Open RStudio and locate the R Console window on the left ( or lower
left, if you have 4 panes ). Type immediately after the > prompt the
expression 3 + 5 and then hit the return key.
3+5
[1] 8
ۧۧ The prompt > indicates that the system is ready to receive commands.
Writing an expression, such as 5 + 5, and hitting the return key sends
the expression to be executed.
x = 3
x
[1] 3
y = 5
y
[1] 5
x+y
[1] 8
x * y
[1] 15
x / y
[1] 0.6
z = x / y
z
[1] 0.6
c(3,0,10,-4,0.5)
[1] 3.0 0.0 10.0 -4.0 0.5
ۧۧ For example, it we want to save the vector of data under the name
“widget,” then write the following expression at the prompt.
widget = c(3,0,10,-4,0.5)
widget
[1] 3.0 0.0 10.0 -4.0 0.5
widget + 2
[1] 5.0 2.0 12.0 -2.0 2.5
widget * widget
[1] 9.00 0.00 100.00 16.00 0.25
widget^2
[1] 9.00 0.00 100.00 16.00 0.25
PLOTTING IN R
ۧۧ You’ll usually want to save your work. To do that, we need to open a
script file. Go to File → New File → R Script. That opens a panel in the
upper left of your screen. In that script window, we can try the code
below to generate our first plot.
x = c(1,2,3,4,5)
y = c(1,8,27,64,125)
plot(x,y)
install.packages("datasets")
library(datasets)
data(faithful)
plot(faithful)
ۧۧ When a window pops up, type the name of the package in the space for
packages. In this case, type “datasets” and press Install. The package
will automatically update to your computer.
ۧۧ From your R script, type and highlight “library( datasets )” and run that
line of code by clicking the “Run” button to run your selected lines. This
loads the datasets library.
data(faithful)
plot(faithful)
HISTOGRAMS
ۧۧ A histogram is a plot that lets you discover and show the underlying
shape of a set of continuous data. You can also inspect the data for
outliers and overall spread.
ۧۧ To get the histogram, count the occurrence of each value of the variable
and plot the number for each count ( t he frequency ) on the 𝑦-axis. The
values can be displayed as frequencies or percentages.
hist(faithful$waiting)
x = rnorm(30);qqnorm(x);qqline(x)
PITFALL
ۧۧ Be careful not to overwrite built-in functions in R.
ൖൖ Don’t do this:
mean = (5+7)/2
mean
?mean
?c
?t
PROBLEMS
1 Exploratory data analysis can be used to
library(MASS)
data(“painters”)
# Use the table function to create barplots
barplot(table(painters$Composition), main="Composition
Score")
barplot(table(painters$Drawing), main="Drawing Score")
barplot(table(painters$Colour), main="Colour Score")
barplot(table(painters$Expression), main="Expression
Score")
SAMPLING AND
PROBABILITY
S
tatistics sharpens our knowledge of how
randomness is all around us. Life is full of having
to make decisions under uncertainty. Two
fundamental ideas in statistics are uncertainty and
variation. Probability is the foundation that helps us
understand uncertainty and variation. This is why
probability plays a key role in statistics. Probability is
a mathematical language used to measure uncertain
events. Whenever we collect data or make measurements,
the process that we use is subject to variation, meaning
that if the same measurements were repeated, the
answer would be slightly different, due to variation.
PROBABILITY
ۧۧ Data is the raw information from which statistics are created. Data
is being collected everywhere all the time. When we collect data, we
convert information to numbers. Statistical thinking gives you the tools
to intelligently extract information from data.
1 State the question that we’re interested in. Maybe we want to know
whether a new cold medicine will relieve coughing within 48 hours.
Whatever the question is, we need to be able to gather information
that will help us make a decision.
2 Collect data that helps answer the question. Suppose that you give
some samples of cold medicine to your coughing friends and record
how many of them stopped coughing within a 48-hour period.
You just introduced a bias into your data. It’s likely that your close
friends have similar characteristics as you do. They’re in your same
age bracket, live in the same city, or are the same gender as you. For
your results to be as widely applicable as possible, you have to collect
data in a way that is objective and rigorous. To remove bias entirely,
you have to collect a sample where every person has an equal chance
of being selected.
ۧۧ If you sample 10 people and they are all right-handed, you can’t conclude
that the probability of being right-handed is 100%. It takes a much
larger sample of people to accurately predict the proportion of people
that are truly right-handed.
ۧۧ What happens when we have real, but limited, data, for which we’d
like to calculate probabilities? Suppose that you want to understand
how a grocery store displays cereal boxes and whether different types
of breakfast cereals are targeted to adults or children. You go into our
local grocery store and notice that there are 6 rows of shelves and
that each shelf has 5 boxes of cereal. You can group the 6 shelves into
3 categories: the bottom 2 shelves, the middle 2 shelves, and the top
2 shelves.
ۧۧ It’s natural to think about how 2 events relate to each other. For
example, what’s the probability that the cereal is targeted at adults
and is located on the middle 2 shelves? In this case, we want to look
at our data to see where those 2 events occur together. Notice that 3
( not 2 ) types of breakfast cereal are located in the middle 2 shelves and
targeted at adults. So, the probability would be
ۧۧ We call the probability of A and B the intersection, the place where the
2 events overlap.
ۧۧ What is the probability of C and B, where B is the event that the cereal is
located on the middle 2 shelves? Again, we’re looking at the overlap, or
intersection, of these 2 events.
ۧۧ From experiments, we can build a sample space, which is the set of all
possible outcomes of that experiment. When flipping a fair coin, the
sample space, S, would be equal to either H ( for heads ) or T ( for tails ):
S = {H, T}.
ۧۧ Likewise, if we are rolling a fair die twice, the sample space is all
possible combinations of those 2 rolls: {( 1, 1 ), ( 1, 2 ), …, ( 1, 6 ), …, ( 6, 1 ),
…, ( 6, 5 ), ( 6, 6 )}.
ۧۧ When flipping a fair coin, the probability that we flip a head is 1/2. We get
that by taking the number of events in flipping a head, which is H, or 1
event, out of the total number of possible events, which is 2.
ۧۧ Likewise, if we roll a fair die twice, the probability that the sum equals
4, or equals those 3 events ( {( 1, 3 ), ( 2, 2 ), ( 3, 1 )} ) out of the total 36
possibilities, is 3 divided by 36.
ۧۧ At first, this might not seem very useful. In fact, it seems rather circular
that we’ve just rewritten P( A and B ) and P( B and A ) and set them equal.
But if you divide both sides by P( B ), you’re left with a famous and useful
result that relates conditional probabilities known as Bayes’s rule.
ۧۧ The disease is rare and deadly and occurs in 1 out of every 10,000
people. Unfortunately, your test result is positive. What’s the chance
that you actually have the disease?
B = test is positive
ۧۧ So, the test is positive, and the test is accurate 98% of the time. However,
you have less than a 1% chance of having the disease.
PROBLEMS
1 On a single toss of a fair coin, the probability of heads is 0.5 and the probability
of tails is 0.5. If you toss a coin twice and get tails on the first toss, are you more
likely to get heads on the second toss?
2 Isabella runs a small jewelry store. Last week, she counted 143 people who
walked by her store. Of the 143 people, 79 of them came in. Of the 79 that came
in, 53 people bought something in the store.
a) What’s the probability that a person who walks by the store will buy
something?
b) What’s the probability that a person who walks in the store will buy
something?
c) What’s the probability that a person who walks in the store will buy
nothing?
d) What’s the probability that a person who walks by the store will come
in and buy something?
DISCRETE DISTRIBUTIONS
R
andom variables are used to model situations in
which the outcome, before the fact, is uncertain.
In other words, a random variable is a real
number whose value is based on the random outcome
of an experiment. A list of all possible outcomes for a
given random variable is called a sample space. This
space includes the outcome that eventually did take
place but also all other outcomes that could have
taken place but never did. The idea of a sample space
puts the outcome that did happen in a larger context
of all possible outcomes. A random variable can be
either discrete or continuous. A discrete random
variable takes on discrete, or countable, values.
DISCRETE DISTRIBUTIONS
ۧۧ Certain discrete distributions appear frequently in real life and have
special names.
ൖൖ For example, the number of times that heads might appear out of 10
coin flips follows a binomial distribution.
ൖൖ There’s also a limiting case of the binomial, where each actual event
is rare, almost like the number of times the coin lands on neither
heads nor tails. This is called the Poisson distribution, and it’s always
about an unusual outcome—for example, the number of defects on a
semiconductor chip.
ۧۧ Suppose that you are paid $1 for each head that appears in a coin-
flipping experiment. Up to how much should you be willing to pay
to play this game if you plan to play it only once? To help you decide,
imagine that you can play the game a large number of times and observe
how much you win on average.
ۧۧ You should be willing to pay up to $1.50 to play this game to come out
ahead, on average. This is the idea of expected value.
ۧۧ Suppose that if X heads come up, you win $X2. Now how much should
you be willing to pay?
ۧۧ This is a valid PMF because it sums to 1 over all of the possible values
of X.
ۧۧ For example, if you have 4 items and you want to know how many ways
you can pick 2 items out of those 4, you can plug it into this formula to
get 6 ways.
ۧۧ What is the expected value? If you flip a coin 𝑛 times and each time has
a probability 𝑝 of yielding heads, on average how many heads do you
expect to get?
ۧۧ Suppose that we again have a series of Bernoulli trials. Let’s define the
random variable X as the number of trials until r successes occur. Then,
X is a negative binomial random variable with parameters 0 < 𝑝 < 1 and
r = 1, 2, 3, … .
ۧۧ The time you need to wait for the emission of 10 alpha particles might
be a sum of exponential distributions known as the gamma distribution.
ۧۧ The variance is
ൖൖ .
ۧۧ Because X is uniform, f( 𝑥 ) = 𝑐 for some constant 𝑐, and we need the area
under the curve to equal 1.
ۧۧ Therefore,
ൖൖ
ൖൖ
ൖൖ
ۧۧ For example, if the outcomes of interest are “has cancer” and “does not
have cancer,” the probabilities of having cancer are ( in most cases )
much less than 1/2. The number of possible outcomes in an experiment
doesn’t necessarily say anything about the probability of the outcomes.
SUGGESTED READING
Diez, Barr, and Cetinkaya-Rundel, OpenIntro Statistics, “Probability,” section
2.5, and “Distributions of Random Variables,” sections 3.3–3.5.
Yau, R Tutorial, “Probability Distributions,” http://www.r-tutor.com/
elementary-statistics/probability-distributions.
PROBLEMS
1 If X has a binomial distribution with 𝑛 = 20 trials and a mean of 5, then the
success probability 𝑝 is:
a) 0.10.
b) 0.20.
c) 0.25.
d) Need to first take a sample.
2 Suppose that each ticket purchased in the local lottery has a 20% chance of
winning. Let X equal the number of winning tickets out of 10 that are purchased.
CONTINUOUS AND
NORMAL DISTRIBUTIONS
T
he normal distribution is one of the most common,
well-used distributions in statistics. Normal
distributions come in many means and standard
deviations, but they all have a signature shape, where
the data values fall into a smooth, bell-shaped curve.
The data are concentrated in the center, but some of
them are more spread out than others. The spread of the
distribution is determined by the standard deviation.
NORMAL DISTRIBUTION
ۧۧ Every normal distribution has certain properties that distinctly
characterize it.
1 The shape is symmetric, meaning that if you were to cut the distribution
in half, the left side would be a mirror image of the right side.
3 The mean, median, and mode are all the same, and we can find them
directly in the center of the distribution.
ۧۧ So, rather than directly solving a problem where X ~ N( μ, σ ), we use an
indirect approach.
X − μ
lower.tail = FALSE means return the probability contained in the upper tail,
i.e. P( X ≥ 𝑎 ).
pnorm(2) - pnorm(-2)
= 0.9544997
pnorm(3) - pnorm(-3)
= 0.9973002
ۧۧ Suppose that X ~ N( 13, 4 ). What’s the probability that X falls within ±1
standard deviation of its mean?
pnorm(1) - pnorm(-1)
= 0.6826895
ۧۧ To find the 90th percentile of X ~ N( 10, 5 ), we seek the value 𝑎 such that
P( X ≤ 𝑎 ) = 0.90.
qnorm(0.90)
= 1.2815516
ۧۧ This person will need to consult a doctor if his or her cholesterol level
is > 158.
ۧۧ So, the probability that a randomly selected person will need to consult
a doctor is approximately 10%.
ۧۧ What’s the cholesterol level below which 95% of this population lies?
1 Solve directly:
qnorm(0.95)
= 1.6448536.
SUGGESTED READING
Diez, Barr, and Cetinkaya-Rundel, OpenIntro Statistics, “Distributions of
Random Variables,” sections 3.1–3.2.
Yau, R Tutorial, “Probability Distributions,” http://www.r-tutor.com/
elementary-statistics/probability-distributions.
PROBLEMS
1 A normal density curve has which of the following properties?
a) It is symmetric.
b) The peak of the distribution is centered above its mean.
c) The spread of the curve is proportional to the standard deviation.
d) All of the above.
COVARIANCE AND
CORRELATION
I
f you’re new to statistics, you may be ready to jump
on the cause-and-effect bandwagon when you find
a strong relationship between 2 variables. But have
you ever thought about why 2 variables might be
correlated? So far, when we’ve considered variance,
we’ve limited ourselves to 1 variable. But what if
we have 2 variables that we think might be related?
How might they vary together? This brings us to the
idea of covariance, and from there to correlation.
COVARIANCE
ۧۧ Suppose that you poll a statistics class and ask them the total number of
hours they spent studying for their last exam and collect the following
data.
Hours Studied
X = {2, 3, 5, 6, 8, 9, 10, 13}
ۧۧ You want to see if studying has any relationship to their actual test
scores.
Test Scores
Y = {58, 75, 71, 77, 80, 88, 83, 95}
x = c(2,3,5,6,8,9,10,13)
y = c(58,75,71,77,80,88,83,95)
plot(x,y,main = "Hours Spent Studying vs. Test Score",
xlab = "Hours Spent Studying",
ylab = "Test Score",pch=20)
ۧۧ Notice that there’s variability along the 𝑥-axis and variability along the
𝑦-axis.
x = c(2,3,5,6,8,9,10,13)
y = c(58,75,71,77,80,88,83,95)
#First Deviation
(2 - 7) * (58 - 78.4)
[1] 102
-5 * -20.4
[1] 102
x = c(2,3,5,6,8,9,10,13)
y = c(58,75,71,77,80,88,83,95)
#Second Deviation
(3 - 7) * (75 - 78.4)
[1] 13.6
-4 * -3.4
[1] 13.6
x = c(2,3,5,6,8,9,10,13)
y = c(58,75,71,77,80,88,83,95)
#Last (8th) Deviation
(13 - 7) * (95 - 78.4)
[1] 99.6
6 * 16.6
[1] 99.6
x = c(2,3,5,6,8,9,10,13)
y = c(58,75,71,77,80,88,83,95)
cov(x,y)
[1] 38
ۧۧ In this case, the covariance of X and Y is 38. But what does that mean? Is
that a large covariance or a small covariance?
ۧۧ The problem is that the covariance can take on any number of values. One
person might have a dataset with a covariance of 500 and another might
have a dataset with a covariance of 5. Unless their data is measured in the
exact same units, they can’t even compare those 2 numbers.
ۧۧ The problem with covariance is it can’t tell us how strong the relationship
is between X and Y. We need to go one step further.
CORRELATION
�
If we take the covariance and divide through by the product of the 2
standard deviations, then magic begins to happen. What we’ve done is
scale it to a dimensionless measure, meaning that it has no units attached
to it. It’s called the correlation coefficient, and it’s a popular way to measure
the strength of a linear relationship between 2 random variables.
ۧۧ This is a very strong positive relationship, as you can see from the
original scatterplot.
ۧۧ In R:
x = c(2,3,5,6,8,9,10,13)
y = c(58,75,71,77,80,88,83,95)
cov(x,y) / (sd(x)*sd(y))
[1] 0.9173286
cor(x,y)
[1] 0.9173286
ۧۧ In fact, the correlation and covariance will always have the same sign—
either both positive or both negative.
ۧۧ Let’s look at our Old Faithful dataset, which compares the waiting time
to the length of eruptions of the Old Faithful geyser. In R, we’re able to
calculate the correlation for an entire dataset using the “cor” function.
data(faithful)
round(cor(faithful),4)
ۧۧ R returns a 2-by-2 matrix. We call this the covariance matrix. In the first
column, “eruptions” is perfectly correlated with itself and is also highly
correlated with “waiting,” at a value of 0.9008. The second column
likewise gives the correlation between waiting and eruptions and the
correlation between waiting and itself, equal to 1.
eruptions waiting
eruptions 1.0000 0.9008
waiting 0.9008 1.0000
library(datasets)
data("Harman23.cor")
round(Harman23.cor$cov,2)
height arm.span forearm lower.leg weight
height 1.00 0.85 0.80 0.86 0.47
arm.span 0.85
1.00 0.88 0.83 0.38
forearm 0.80 0.88 1.00 0.80 0.38
lower.leg 0.86 0.83 0.80 1.00 0.44
weight 0.47 0.38 0.38 0.44 1.00
bitro.diameter 0.40 0.33 0.32 0.33 0.76
chest.girth 0.30 0.28 0.24 0.33 0.73
chest.width 0.38 0.42 0.34 0.36 0.63
ۧۧ Notice that along the diagonal, the values all equal 1. This is because
each variable is perfectly correlated with itself.
ۧۧ Find some of the higher correlations. Height and lower leg have a
correlation of 0.86. This makes sense, because if a person is tall, that
person is likely to have long legs. Arm span and forearm have a correlation
of 0.88, which is also logical because the forearm is included in arm span.
library(car)
data("Salaries")
head(Salaries)
ۧۧ The Salaries dataset has the 2008 to 2009 9-month academic salary
for assistant professors, associate professors, and full professors in a
particular college in the United States.
data("Salaries")
head(Salaries)
rank discipline yrs.since.phd yrs.service sex salary
1 Prof B 19 18 Male 139750
2 Prof B 20 16 Male 173200
3 AsstProf B 4 3 Male 79750
4 Prof B 45 39 Male 115000
5 Prof B 40 41 Male 141500
6 AssocProf B 6 6 Male 97000
AssocProf: 64 B: 216 1st Qu.: 12.00 1st Qu.: 7.00 Male: 358 1st Qu.: 91000
cor(Salaries$yrs.since.phd, Salaries$yrs.service)
[1] 0.9096491
cor(Salaries[,c(3,4,6)])
yrs.since.phd yrs.service salary
yrs.since.phd 1.0000000 0.9096491 0.4192311
yrs.service 0.9096491 1.0000000 0.3347447
salary 0.4192311 0.3347447 1.0000000
PITFALLS
ۧۧ The r correlations coefficient looks for a linear relationship and assumes
that the 2 variables are normally distributed. If you suspect a nonlinear
relationship, we can consider transforming the data by taking the log or
raising it to an exponent.
ۧۧ Correlation does not imply causality. Just because X and Y are correlated
does not mean that X causes Y. They could both be caused by some other
factor Z, or Y might cause X instead.
PROBLEMS
1 Suppose that you find a correlation of 0.65 between an individual’s income
and the number of years of college that individual has completed. Which of the
following 4 statements can we conclude?
2 The “cars” dataset in R contains the speed of cars and the distances taken to
stop in the 1920s.
library(datasets)
data("cars")
summary(cars)
cor(cars$speed, cars$dist)
cov(cars$speed, cars$dist)
VALIDATING STATISTICAL
ASSUMPTIONS
S
tatistical graphs are useful in helping us visualize
data. Through graphs, we understand data
properties, such as the mean, median, and standard
deviation; find patterns in data, such as clustering and
correlation; suggest an underlying model that could have
generated the data; verify our assumptions; check and
fortify our analysis; summarize; and communicate results.
This lecture will define and identify basic summaries
of data, both numerical and graphical, and use R for
calculating descriptive statistics, making graphs, and
even writing functions that work on multiple datasets.
IRIS DATA
ۧۧ The “iris” dataset is widely used throughout statistical science for
illustrating various problems in statistical graphics, multivariate
statistics, and machine learning. It’s a small but nontrivial dataset. The
data values are real ( as opposed to simulated ) and are of high quality
( collected with minimal error ). The data were used by the celebrated
British statistician Ronald Fisher in 1936.
library(datasets)
library(RColorBrewer)
attach(iris)
ۧۧ The iris species are so similar that they are difficult to separate visually.
So, American botanist Edgar Anderson gathered the data we now have
to look for statistical differences that might help identify each species.
BAR PLOTS
ۧۧ Bar plots are useful for showing comparisons across several groups.
Although it looks like a histogram, a bar plot is plotted over a label that
represents a category ( e.g., iris type ).
ۧۧ One difference you might notice is that the bars of a bar plot are
separated with spaces in between, while in a histogram, the values are
plotted right next to one another, with no space in between.
BOX PLOTS
ۧۧ The summary function is a quick and easy way to assess the statistical
properties of each attribute. These values are displayed graphically in
a box plot.
summary(iris[,1: 2])
Sepal.Length Sepal.Width
Min.: 4.300 Min.: 2.000
1st Qu.: 5.100 1st Qu.: 2.800
Median: 5.800 Median: 3.000
Mean: 5.843 Mean: 3.057
3rd Qu.: 6.400 3rd Qu.: 3.300
Max.: 7.900 Max.: 4.400
ۧۧ Box plots are used to compactly show many pieces of information about
a variable’s distribution. They are great for visualizing the spread of the
data. Box plots show 5 statistically important numbers: the minimum,
the 25th percentile, the median, the 75th percentile, and the maximum.
ۧۧ A box plot can also be used to show how one attribute, such as petal
length, varies with another attribute, such as iris type.
ۧۧ A color palette is a group of colors that is used to make the graph more
appealing and help create visual distinctions in the data.
boxplot(iris$Sepal.Length~iris$Species, col=heat.
colors(3), main = "Sepal Length vs.Species")
SCATTERPLOTS
ۧۧ Scatterplots are helpful for visualizing data and simple data inspection.
Let’s try the following code.
ۧۧ Scatterplots are used to plot 2 variables against each other. We can add
a third dimension by coloring the data values according to their species.
ۧۧ For datasets with only a few attributes, we can construct and view
all the pairwise scatterplots. In the first row, all the 𝑦-values are
represented by Sepal.Length on the 𝑦-axis. In the first column, all the
𝑥-axis values are represented by Sepal.Length on the 𝑥-axis.
ۧۧ Likewise, in the second row, all the 𝑦-values are represented by Sepal.
Width on the 𝑦-axis. In the second column, all the 𝑥-axis values are
represented by Sepal.Width on the 𝑥-axis.
ۧۧ Because the upper and lower graphs are duplicates of each other, let’s
change our code to show the correlation between our variables in the
upper level.
ۧۧ R has a built-in package called ggplot2 the allows you to produce figures
with visuals plots. It is used for making quick, professional-looking
plots with minimal code.
ۧۧ Let’s create some histograms of our iris data. The number of bins in the
histogram is variable.
hist(iris$Petal.Width, breaks=13)
hist(iris$Petal.Width, breaks=25)
dens.pw = density(iris$Petal.Width)
plot(dens.pw, ylab = "Frequency", xlab = "Width", main=
"Petal Width Density")
CONTOUR PLOTS
ۧۧ Density estimation is available for higher-dimensional data using
contour plots. A contour plot is a graph that explores the potential
relationship among 3 variables.
ۧۧ The plot may also be viewed as a heat map, with brighter colors denoting
more values in those regions.
qqnorm(quantile.virginica, main="Virginica")
qqline(quantile.virginica)
qqnorm(quantile.versicolor, main="Versicolor")
qqline(quantile.versicolor)
shapiro.test(quantile.setosa)
data: quantile.setosa
W = 0.96247, p-value = 0.4658
shapiro.test(quantile.versicolor)
data: quantile.versicolor
W = 0.96319, p-value = 0.4815
shapiro.test(quantile.virginica)
data: quantile.virginica
W = 0.97161, p-value = 0.6861
PITFALLS
ۧۧ With histograms, when you fluctuate bin width, it can be good, but it
can also be problematic.
ۧۧ Let’s do a histogram for Petal.Width, and let’s only give it 3 bins, or set
the break sequence to 3. Here’s what that histogram looks like.
ۧۧ This histogram doesn’t really give us information about the shape of the
data. It has poor bin width. One solution is to overlay the density plot.
The density is sort of like a smooth version of the histogram. This is one
way that we can tell if our histogram is accurately picking up the shape
of the spread of our underlying data.
ۧۧ But what happens when we overlay the histogram with the density
function? The density function in the second bin is showing us that
maybe there’s some stuff that’s not quite coming up in our graph.
hist(iris$Petal.Length, prob=TRUE)
hist(iris$Petal.Length, prob=TRUE)
lines(density(iris$Petal.Length))
ۧۧ Overall, graphical data analysis has become a major way to avoid many
pitfalls in statistics. Once upon a time, graphical data analysis was
rather challenging ( or at least time consuming ) to do, but that’s all
changed. You should always take advantage of how easy it has become
to display your data and begin your analysis in a very visual way.
SUGGESTED READING
Diez, Barr, and Cetinkaya-Rundel, OpenIntro Statistics, “Evaluating the Normal
Approximation,” section 3.2.
PROBLEMS
1 The “cars” dataset in R contains the speed of cars and the distances taken to
stop in the 1920s.
a) Use the following code to load the data and graph the quantile-quantile
(Q-Q) plot for the variables “distance” and “speed.” Comment on whether
the true underlying distribution appears to be normally distributed.
b) Use the Shapiro-Wilk test to determine whether the data are normally
distributed. (Recall that, typically, if the p-value is greater than 0.05, then
the data are normally distributed.)
shapiro.test(cars$dist)
shapiro.test(cars$speed)
S
tatisticians are often called on to do consulting,
whether for individuals, companies, or nonprofits.
This lecture focuses on a consulting example
involving a large metropolitan city that is considering
building a new hospital to meet the needs of residents.
They want to survey the population to better understand
the typical emergency room ( ER ) demand so that
they can plan an appropriate number of beds.
SAMPLE MEANS
ۧۧ One thousand residents are sampled and their number of visits to the
ER are recorded. The number of times a person visited the ER ranges
from 0 to 59 times in a year. The counts for each visit are given in
the vector “counts.” The vector “visits” repeats each “time” for each
corresponding “count.”
mean.age<-function(n) {
trials = 1000
my.samples <- matrix(sample(visits, size=n*trials,
replace=TRUE), trials)
means <- apply(my.samples, 1, mean)
means
}
ۧۧ Let X 𝑛 = ( X1 + ⋯ + X𝑛 )/𝑛 be the sample mean of the visits of 𝑛 people drawn
at random with replacement from our dataset. For each of the 1000 trials,
X 𝑛 is computed. The 1000 sample means are returned in a vector, “means.”
par(mfrow=c(1,2))
hist(mean.age(1), main="Mean of 1 Visit", xlab="Number of Visits")
hist(visits, main="ER Visits Data", xlab="Number of Visits")
par(mfrow=c(2,3))
MA1<-mean.age(1)
MA2<-mean.age(2)
MA10<-mean.age(10)
MA20<-mean.age(20)
MA100<-mean.age(100)
MA200<-mean.age(200)
hist(MA1, xlim=c(0,60))
hist(MA2, xlim=c(0,60))
hist(MA10, xlim=c(0,60))
hist(MA20, xlim=c(0,60))
hist(MA100, xlim=c(0,60))
hist(MA200, xlim=c(0,60))
vars
n variance
1 1 95.5395355
2 2 46.8052412
3 10 8.9583081
4 20 4.4299850
5 100 0.9604743
6 200 0.4707666
ۧۧ Plot the variance of each sample mean versus the sample size.
plot(vars$n, vars$variance)
QUANTILE-QUANTILE PLOTS
ۧۧ We can test how close the distribution of X 𝑛 is to the normal distribution
by examining quantile-quantile ( Q-Q ) plots. In a Q-Q plot, quantiles of
the sample are plotted against quantiles of a proposed distribution,
also known as theoretical quantiles.
par(mfrow=c(2,3))
qqnorm(MA1)
qqnorm(MA2)
qqnorm(MA10)
qqnorm(MA20)
qqnorm(MA100)
qqnorm(MA200)
par(mfrow=c(2,3))
shapiro.test(MA1)
shapiro.test(MA2)
shapiro.test(MA10)
shapiro.test(MA20)
shapiro.test(MA100)
shapiro.test(MA200)
data: MA1
W = 0.92764, p-value < 2.2e-16
data: MA2
W = 0.96688, p-value = 2.603e-14
data: MA10
W = 0.9976, p-value = 0.153
data: MA20
W = 0.99629, p-value = 0.01751
data: MA100
W = 0.99821, p-value = 0.3811
data: MA200
W = 0.99833, p-value = 0.4477
SAMPLING DISTRIBUTIONS
ۧۧ A statistic is a value computed from data. In our example, our statistic is
the average number of ER visits.
ۧۧ We’ve seen empirically ( i.e., from data ) that the sampling distribution
of the mean approaches a normal distribution for the ER visits data.
Does this happen in general for distribution of data or just data similar
to the ER data?
ۧۧ The mean tells us the center of that distribution. The standard deviation
tells us the spread. The central limit theorem tells us that, no matter
what the population distribution looks like, the distribution of the
sample means will approach a normal distribution.
ۧۧ Let X equal the number of home team fans in attendance. This takes us
back to the binomial distribution, because X ~ Bin( 𝑛 = 3000, 𝑝 = 0.60 ).
ۧۧ We wouldn’t want to solve this using the binomial distribution. It’s too
tedious of a calculation. But what we can do is approximate the binomial
with the normal distribution.
pnorm(1.8820239, lower.tail=FALSE)
= 0.0299164
ۧۧ So, there’s a 2.99% chance that we have more than 1850 fans in
attendance.
ۧۧ How far off are we from the binomial calculation? We can use R to do
the exact calculation for the binomial, and it’s called the pbinom.
1 - pbinom(1850,3000,.60)
= 0.029692
ۧۧ The value that we get is really close to what we got in our simpler
approximation with the normal distribution.
pnorm(1.86339, lower.tail=FALSE)
= 0.0312037
ۧۧ Notice that we’re slightly off. The actual binomial value was 0.029692.
ۧۧ Our normal approximation with the correction was the closer value
of 0.0299164, and if we did that normal approximation without the
correction, we would have been even further away from our binomial,
at 0.0312037.
PITFALLS
ۧۧ Often when statisticians consult with clients designing an experiment,
one of the clients’ top priorities is keeping costs down, which translates
to them wanting to take less samples and magically invoke the power of
the central limit theorem.
ۧۧ Unfortunately, it doesn’t work that way. Unless you know that your
true population is normally distributed, then you need at least 30 or 40
samples before the central limit theorem kicks in.
PROBLEMS
1 If the central limit theorem is applicable, this means that the sampling
distribution of a population can be treated as normal because
the is .
2 Suppose that X equals the birth weight in grams of babies born in Yugoslavia. Let
E( X ) = 3325 and var( X ) = 6802. Let X = the sample mean of a random sample of
size 𝑛 = 360 babies.
D
escriptive statistics focus on summarizing
characteristics about our data, such as
calculating the sample mean or plotting
histograms, and describe the dataset that’s being
analyzed but don’t let us to draw any conclusions or
make any interferences about the data. On the other
hand, statistical inference uses our dataset to extract
information about populations or answer real-world
questions with certainty. It builds on the methods of
descriptive statistics, but we can draw conclusions
about the population based on data from a sample.
POINT ESTIMATES
ۧۧ Any time we have a random sample, we can calculate the sample mean,
variance, and standard deviation. These values are called point estimates.
library(datasets)
data("Orange")
ۧۧ We see the means from our box plot, but we can pull them out directly
using the mean function.
mean(Orange$age)
[1] 922.1429
mean(Orange$circumference)
[1] 115.8571
ۧۧ A point estimate for the average orange tree age would be 922.14
days, and a point estimate for orange circumference would be 115.85
millimeters.
median(Orange$age)
[1] 1004
median(Orange$circumference)
[1] 115
ۧۧ Another point estimate for the average orange tree age would be 1004,
and another point estimate for orange circumference would be 115.
ESTIMATION
ۧۧ The objective of estimation is to approximate the value of a population
parameter on the basis of a sample statistic. We can estimate both
points and intervals.
ۧۧ The most common point estimate is the sample mean X , which is used to
estimate the population mean μ. A point estimator gives us a particular
value to estimate our parameter with, whereas a confidence interval
gives us a range of possible estimator values.
ۧۧ Note that in inferential statistics, you’ll often see Greek letters for
population and for the sample. But the sample statistic will have a
hat on top. So, if you see or , this refers to the point estimate for
the sample.
data("women")
attach(women)
summary(women)
data("women")
attach(women)
head(women)
height weight
1 58 115
2 59 117
3 60 120
4 61 123
5 62 126
6 63 129
plot(height,weight)
error of estimation = − θ
ۧۧ Ideally, an estimator should have low variability ( to be precise ) and low
bias ( to be accurate ).
ۧۧ Other estimates exist, such as the range, or the average of the largest
and smallest values. But these estimates are not unbiased and should
not be used to estimate the mean.
CONSISTENCY
ۧۧ An unbiased estimator is consistent if the variance of the estimator
approaches 0 as our sample size ( 𝑛 ) approaches infinity. In other words,
the difference between the estimator and the population parameter
becomes smaller as the sample size increases.
ۧۧ For example, the sample mean has a variance of s2/𝑛. This variance goes
to 0 as the sample size increases.
ۧۧ For example, for a normal population, both the sample mean and median
are unbiased estimators, but the sample median has more variability
than the sample mean for a fixed sample size. Let’s look at this in R.
set.seed(1234)
x = cbind(rnorm(100,0,1),rnorm(100,0,1),rnorm(100,0,1),
rnorm(100,0,1),rnorm(100,0,1),rnorm(100,0,1), rnorm(100,
0,1),rnorm(100,0,1),rnorm(100,0,1), rnorm(100,0,1))
apply(x,2,'mean')
apply(x,2,'median')
[1] -0.157 0.041 0.155 -0.008 -0.022 -0.137 -0.088 -0.001
0.018 -0.068
[1] -0.385 0.033 0.278 -0.043 -0.009 -0.067 -0.050 -0.104
-0.052 -0.035
round(mean(apply(x,2,'mean')),3)
[1] -0.027
round(mean(apply(x,2,'median')),3)
[1] -0.043
round(var(apply(x,2,'mean')),3)
[1] 0.008
round(var(apply(x,2,'median')),3)
[1] 0.026
PITFALLS
ۧۧ A point estimate only gives a single number for a population
parameter. Several point estimates on the same dataset would give
you too many estimates to logically choose from. And each estimate
has its own associated error. Remember that a sample statistic is
always a random variable.
ۧۧ The problem becomes how to pick the best estimator. If our estimate
is unbiased, efficient, and precise, that makes a great point estimate.
But is there a better solution? Perhaps we could give a range of
values that an estimator could take on. That kind of range is called a
confidence interval.
PROBLEMS
1 If the mean of a sample statistic is not equal to the population parameter, then
the sample statistic is called
a) an unbiased estimator.
b) a biased estimator.
c) an interval estimator.
d) a point estimator.
2 The “cars” dataset in R contains the speed of cars and the distances taken
to stop in the 1920s. Find a point estimate for the population mean for both
“distance” and “speed.”
W
hen we calculate a point estimate, our chances
of hitting the target population parameter are
not very likely. We’ll often get close but will
seldom hit the mark. In this lecture, you will learn about
confidence intervals, with which we can increase our
chances of capturing the true population parameter.
CONFIDENCE INTERVALS
ۧۧ A confidence interval draws inferences about a population by estimating
the value of an unknown parameter using an interval. We’re looking
for an interval that covers the true population parameter with some
amount of certainty or confidence.
ۧۧ But the problem is that a different random sample ( w ith different
observed values ) would yield different estimates for μ. Which of those
estimates ( t he sample mean from our first sample or the sample mean
from our second sample ) would be closest to the true value? We’d really
have no way of knowing.
CONFIDENCE LEVELS
ۧۧ Confidence intervals allow us to estimate population parameters using
a range of values—a range of values that are more likely to capture the
true population parameter.
ۧۧ To do that, we need to set the confidence level, which tells us how likely it
is that the population parameter is actually contained in the confidence
interval. It’s a measure of the degree of the reliability of the interval.
ۧۧ The 95% means that we used a method that captures the true mean
95% of the time.
ۧۧ Let’s define zα/2 as our critical value. This is the value such that
P( Z > zα/2 ) = P( Z < −zα/2 ) = α/2, where Z ~ N( 0, 1 ).
ۧۧ The area between −zα/2 and zα/2 under the standard normal curve is
1 − α. In other words, P( −zα/2 < Z < zα/2 ) = 1 − α.
ۧۧ Again, P( −zα/2 < Z < zα/2 ) = 1 − α, so ( 1 − α ) is the area in the center of our
normal curve. That means that each of our tail ends has an area equal to
α/2 so that the total area adds up to 1.
ۧۧ Suppose that α = 0.05. P( −zα/2 < Z < zα/2 ) = 1 − α = 95%, and there would
be 5% left over in the tails ( t he shaded area ): 2.5% in the left shaded
area and 2.5% in the right shaded area.
ൖൖ Likewise, when α equals 0.05, we get 95% confidence and 2.5 % left
in each of the tails.
ൖൖ When α equals 0.01, we get 99% confidence and 0.5% area in each of
the tails.
qnorm(0.05)
[1] -1.644854
qnorm(0.025)
[1] -1.959964
qnorm(0.005)
[1] -2.575829
set.seed(343)
milk = 129-rexp(100000,0.95)
hist(milk, main="Histogram
of Milk Population",
col="red")
true_mean = mean(milk)
true_sd = sd(milk)
true_mean
[1] 127.943
true_sd
[1] 1.058831
ۧۧ We can take a sample from our population to see how close we are to
the actual true mean.
set.seed(343)
sample_milk <- sample(milk, size=50, rep=T)
sample_mean <- mean(sample_milk)
sample_mean
[1] 127.9848
sample_mean-true_mean
[1] 0.04179896
ۧۧ Our sample mean is only slightly larger than the population mean, by
0.04 ounces.
ۧۧ We can calculate the 95% confidence interval for the sample mean in R.
n=50
sample_milk <- sample(milk, size=50, rep=T)
sample_mean <- mean(sample_milk)
sample_mean - 1.96 * sd(sample_milk) / sqrt(n)
[1] 127.5935
sample_mean + 1.96 * sd(sample_milk) / sqrt(n)
[1] 128.13
ۧۧ Our point estimate based on a sample of just 50 milk jugs only slightly
overestimates the true population. This illustrates an important point:
We can get a fairly accurate estimate of a large population by sampling a
relatively small subset of individuals.
hist(sample_milk)
ۧۧ Suppose that we take 1000 samples of size 𝑛 = 50 and look at the sample
distribution.
milk_mean = numeric(0)
for (i in 1: 1000)
milk_mean[i] = mean(sample(milk, 50, rep=T))
hist(milk_mean)
qqnorm(milk_mean)
qqline(milk_mean)
ۧۧ Now let’s return to our samples of size 𝑛 = 50 and take 100 such samples.
samp_mean = numeric(0)
for (i in 1: 100)
samp_mean[i] = mean(sample(milk,50, rep=T))
hist(samp_mean)
PITFALL
ۧۧ There’s a possible pitfall: 18 out
of the 20 confidence intervals
we calculated ( for the 90%
confidence interval ) captured
the true population mean.
But that means that 2 of them
didn’t. When we only collect
1 sample, we have no way of
knowing if it covers the true
mean or not. So, remember, our
confidence is in the method.
PROBLEMS
1 Of the following, which is not needed to calculate a confidence interval for a
population mean?
a) A confidence level.
b) A point estimate of the population mean.
c) A sample size of at least 10.
d) An estimate of the population variance.
e) All of the above are needed.
2 The “cars” dataset in R contains the speed of cars and the distances taken to
stop in the 1920s. Find a 95% confidence interval for the mean “distance” and
mean “speed.” Compare to a 90% confidence interval.
HYPOTHESIS TESTING:
1 SAMPLE
T
he past few lectures have been focused on parameter
estimation: How do we use our sample data to
estimate population parameters, such as the mean?
But there’s another way that we can look at our data.
Instead of using our data to estimate the population
parameter, we could guess a value for our population
parameter and then ask ourselves if we think our sample
could have come from that particular population. Instead
of going from data to parameter, we go from parameter
to data. This new approach is called hypothesis testing.
HYPOTHESIS TESTING
ۧۧ In general, a hypothesis is an educated guess about something in the
world around you—something that should be testable, either by
experiment or observation. In hypothesis testing, we want to know
whether the characteristics of our sample match the underlying
characteristics of our assumed population.
ൖൖ A rejection region: The set of all test statistic values for which H0 will
be rejected.
ۧۧ The null hypothesis is assumed true, and we have to provide the statistical
evidence needed to reject it in favor of an alternative hypothesis.
H0 : μ = μ 0
H0 : μ = 0
Ha : μ > μ 0
Ha : μ < μ 0
Ha : μ ≠ μ 0
SIGNIFICANCE LEVEL
ۧۧ We want α and β to be both small, but this is a contradiction. We can
decrease the rejection region to get smaller α. However, a small
rejection region results in a larger β.
ۧۧ Let’s look at this in terms of criminal trials, where, in the United States,
the initial assumption is that a defendant is innocent until proven guilty.
UPPER-TAILED TEST
ۧۧ Suppose that we have a random sample X1, X2 , …, X𝑛 from a N( μ, σ )
with σ known and we want to do an upper-tailed test. Test H0:
μ = μ0 versus Ha: μ > μ0 . The test procedure is as follows:
3 Rejection region: Z ≥ zα .
1 Test statistic: .
1 Test statistic: .
4 Decision: Because 0.05 ≥ −1.645, our test statistic does not fall
in the rejection region and we fail to reject the null hypothesis.
1 Test statistic: .
ۧۧ When 𝑛 is small:
ۧۧ Suppose that we have a random sample X1, X2 , …, X𝑛 from a N( μ, σ ) with
σ unknown. Test H0: μ = μ0 versus Ha: μ > μ0 . The test procedure is as
follows:
2 Significance level: α.
ൖൖ High 𝑝-values: Our data are likely with a true null hypothesis.
ൖൖ Low 𝑝-values: Our data are unlikely with a true null hypothesis.
ۧۧ A low 𝑝-value suggests that our sample provides enough evidence that
we can reject the null hypothesis.
ۧۧ The following graph shows the distribution under the null hypothesis.
Our observed data point is the test statistic calculated from our sample.
The shaded area is the probability of seeing our observed sample, or a
sample more extreme, by chance, when the null hypothesis is true. The
closer our 𝑝-value gets to the very unlikely zone, the more evidence our
data provides to reject the null hypothesis.
𝑝-value = P( Z ≥ zα/2 │ μ = μ0 ), where Z is the test statistic calculated from
our sample.
𝑝-value = P( Z ≤ −zα/2 │ μ = μ0 ), where Z is the test statistic calculated from
our sample.
𝑝-value = 2 P( |Z| ≥ |zα/2| │ μ = μ0 ). We have to add twice the value at our
tail ends because the test is 2-sided.
𝑝-value ≤ α, we reject H0 .
PITFALLS
ۧۧ A statistical test is not designed to prove or disprove hypotheses.
It weighs the evidence provided by the data and decides what is
warranted.
SUGGESTED READING
Diez, Barr, and Cetinkaya-Rundel, OpenIntro Statistics, “Inference for
Numerical Data,” section 5.1.
Yau, R Tutorial, “Hypothesis Testing,” http://www.r-tutor.com/elementary-
statistics/hypothesis-testing.
a) α is small.
b) the 𝑝-value is less than α.
c) α = 0.05.
d) α = 0.01.
e) the 𝑝-value is greater than α.
a) population parameters.
b) sample parameters.
c) sample statistics.
d) It depends—sometimes population parameters and sometimes sample
statistics.
HYPOTHESIS TESTING:
2 SAMPLES, PAIRED TEST
S
uppose that we want to test whether 2 samples are
from the same distribution. We can look at their
descriptive statistics, such as histograms or box plots,
but that won’t confirm a statistical difference. We need a
more formal method to determine a true difference. The
goal of this lecture is to demonstrate how to determine
if 2 samples are similar, meaning that they come from
the same underlying distribution, or different, meaning
that they come from different underlying distributions.
ۧۧ If the feed and weight are independent, meaning that neither one affects
the other, then the distributions of the 2 samples should be the same. In
other words, no matter what feed the chickens eat, their weights should
be roughly the same.
ۧۧ Let’s use our chicken weight data, which has newly hatched chicks that
were randomly placed in 6 groups, with each group given a different
feed supplement.
library("datasets")
data(chickwts)
summary(chickwts)
weight feed
Min.: 108.0 casein: 12
1st Qu.: 204.5 horsebean: 10
Median: 258.0 linseed: 12
Mean: 261.3 meatmeal: 11
3rd Qu.: 323.5 soybean: 14
Max.: 423.0 sunflower: 12
ۧۧ We also see the 6 different feed types, along with the corresponding
number of chickens in that feed group. The casein feed had 12 chickens,
horsebean feed had 10 chickens, linseed had 12, and so on.
data("chickwts")
attach(chickwts)
meat = chickwts[chickwts$feed=="meatmeal",1]
horse = chickwts[chickwts$feed=="horsebean",1]
ۧۧ Notice that our dataset is uneven. Meat has 11 entries while horse has 10.
[1] 325 257 303 315 380 153 263 242 206 344 258
[1] 179 160 136 227 217 168 108 124 143 140
ۧۧ Showing box plots for both samples is a great way for us to compare
them. Here’s the command to do that in R. Remember, our goal for this
lecture is to determine if 2 samples are similar ( meaning that they
come from the same underlying distribution ) or different ( meaning
that they come from different underlying distributions ). Your box plot
is helping shine light on that goal. We see that the bulk of the chickens
in meatmeal, from the first quartile onward, weigh more than all the
chickens that had the horsebean feed supplement.
ۧۧ We take those 2 ordered sets of data and pair them up and plot them. If
feed had no effect on chicken weight, then we would expect to see our
points fall around the line Y = X. But if feed did have an effect, we would
expect our points to be shifted, either above or below the Y = X line.
ۧۧ Notice that all of our points fall below the line, being pulled by the larger
values in meatmeal.
ۧۧ H0 is that the 2 samples are from the same distribution ( and have the
same mean ).
ۧۧ Here we’ve taken the mean and standard deviation of meatmeal and
horsebean and calculated our test statistic, T. The value that we get is
5.059.
mean.meat = mean(meat)
mean.horse = mean(horse)
sd.meat = sd(meat)/sqrt(length(meat))
sd.horse = sd(horse)/sqrt(length(horse))
T.stat = (mean.meat - mean.horse)/sqrt(sd.meat^2+sd.
horse^2)
T.stat
[1] 5.059444
ۧۧ R does the heavy lifting for you with the “t.test” function. The command
is “t.test( meat, horse ),” and it returns the Welch’s 2-sample t-test.
ۧۧ We can confirm our test statistic value of 5.059, and we have a very low
𝑝-value, which would lead us to reject the null hypothesis that the 2
means are equal.
PAIRED T-TESTS
ۧۧ The 2-sample t-test procedure only applies when the 2 samples are
independent and the underlying distributions are normal, as in our
chicken weight example.
ۧۧ The paired-sample t-test works the same as a 1-sample t-test, but each
observation in one sample is correlated to an observation in the second
sample.
install.packages("PairedData")
library(PairedData)
data(IceSkating)
attach(IceSkating)
ۧۧ The first column shows that we have 7 subjects in this dataset, along
with extended speed measurements in column 2 and flexed speed
measurements in column 3.
IceSkating
Subject Extension Flexion
1 S1 2.13 1.90
2 S2 1.77 1.55
3 S3 1.68 1.62
4 S4 2.04 1.89
5 S5 2.12 2.01
6 S6 1.92 1.91
7 S7 2.08 2.10
ۧۧ The following is a paired plot of our data, with “extension” on the 𝑥-axis
and “flexion” on the 𝑦-axis. If there was no difference, we’d expect the
points to fall along this line, where X = Y. And they’re not too far off.
They fall slightly below, with a preference toward extension.
with(IceSkating,plot(paired(Extension,Flexion),
type="McNeil"))
hist(Extension-Flexion)
with(IceSkating,qqnorm(Extension-Flexion))
with(IceSkating,qqline(Extension-Flexion))
shapiro.test(Extension-Flexion)
data: Extension - Flexion
W = 0.93721, p-value = 0.6137
ۧۧ Let’s walk through some of the details of the paired t-test. For each
matched pair, we create a new variable, d, which represents the
difference between the 2 samples: d = Extension − Flexion. In R, we
assign d to the difference in speed between extension and flexion.
mean(Extension) - mean(Flexion)
[1] 0.1085714
mean(d)
[1] 0.1085714
H0: d = 0
Ha: d ≠ 0
ۧۧ So, T = 2.9346765.
ۧۧ Notice that at the 95% confidence level, we reject the null hypothesis
because T > 2.45. We would conclude that there is a difference in speed
when the leg is extended versus flexed.
ۧۧ But at the 99% confidence level, we fail to reject the null because
T < 3.71. This is where the 𝑝-value’s importance is critical, because it
tells us exactly where the cutoff is for this particular sample.
2*(1-pt(2.9346765,6))
[1] 0.02612775
ۧۧ We can compare this to the output that R gives us. Notice that our
𝑝-value of 0.026 matches the result that we get from R.
t.test(Extension,Flexion,paired=TRUE)
Welch Two Sample t-test
data: Extension and Flexion
t = 2.9347, df = 6, p-value = 0.02613
alternative hypothesis: true difference in means is not
equal to 0
95 percent confidence interval:
0.01804536 0.19909749
sample estimates:
mean of the differences
0.1085714
ۧۧ With a 𝑝-value of 0.3036, we’re no longer able to reject the null hypothesis
at any significance level. This difference is because our degrees of freedom
are lower for paired versus independent data. With paired data, it takes 2
values to provide a unit of information, 1 degree of freedom. When the
data are independent, each data point gives us a degree of freedom.
t.test(Extension,Flexion)
Welch Two Sample t-test
data: Extension and Flexion
t = 1.0731, df = 11.857, p-value = 0.3046
alternative hypothesis: true difference in means is not
equal to 0
95 percent confidence interval:
-0.1121619 0.3293047
sample estimates:
mean of x mean of y
1.962857 1.854286
ۧۧ If the differences between pairs are non-normal, don’t use the t-test for
that. Your results won’t be valid. In this case, it would be better to use
a non-parametric test, such as the Wilcoxon signed-rank test, which
doesn’t depend on an underlying assumption of normality.
SUGGESTED READING
Diez, Barr, and Cetinkaya-Rundel, OpenIntro Statistics, “Inference for
Numerical Data,” sections 5.2–5.3.
Yau, R Tutorial, “Inference about Two Populations,” http://www.r-tutor.com/
elementary-statistics/inference-about-two-populations.
Regular car mileage = (14, 16, 20, 20, 21, 21, 23, 24, 23,
22, 23, 22, 27, 25, 27, 28, 30, 29, 31, 30, 35, 34)
Premium car mileage = (16, 17, 19, 22, 24, 24, 25, 25, 26,
26, 24, 27, 26, 28, 32, 33, 33, 31, 35, 31, 37, 40)
LINEAR REGRESSION
MODELS AND
ASSUMPTIONS
I
n regression, we look at the association between 2 or
more quantitative variables. We’re going to begin to
look for whether one variable causes or predicts changes
in another variable. The response variable, which is the
dependent variable, might measure an outcome of a study
or experiment. The explanatory variable, which is the
independent or predictor variable, explains or is related
to changes in the response variable. We now have pairs
of observations: ( 𝑥1, 𝑦1 ), ( 𝑥2 , 𝑦2 ), …, ( 𝑥𝑛 , 𝑦𝑛 ). In linear
regression, we’re looking for the association between
2 variables to be centered on a line. In essence, we’re
looking for the effect that one variable has on another.
LINEAR REGRESSION
ۧۧ Imagine that we are wheat farmers in Kansas. For the past 10 years, at
the end of the season, we’ve recorded the total amount of rainfall and
the average height of our wheat. We’d like to know if rainfall has any
effect on our wheat yield.
rainfall = c(3.07,3.55,3.90,4.38,
4.79,5.30,5.42,5.99,6.45,6.77)
wheat = c(78,82,85,91,92,96,97,104,111,119)
summary(cbind(rainfall, wheat))
rainfall wheat
Min: 3.070 Min.: 78.0
1st Qu.: 4.020 1st Qu.: 86.5
Median: 5.045 Median: 94.0
Mean: 4.962 Mean: 95.5
3rd Qu.: 5.848 3rd Qu.: 102.2
Max.: 6.770 Max.: 119.0
ۧۧ Our linear regression model, with unknown parameters, looks like this:
𝑦 = β0 + β1 𝑥 + 𝜖, where β0 is the 𝑦-intercept, β1 is the slope, and 𝜖 is the
error term.
ۧۧ Notice the Greek letters: Our model is about the underlying population.
ۧۧ We’ll use our data to estimate the slope and 𝑦-intercept. Notice that
when we talk about estimates from our sample data, we use hats over
the βs.
ۧۧ Our linear regression line, with parameters now estimated from the
sample data, is ŷ = 0 + 1 𝑥.
ۧۧ We estimate β0 by
min( 𝑦 − ŷ )2
ۧۧ R will conveniently fit a linear regression for us. We use the function
“lm,” which stands for linear regression. After “lm” comes our response
variable, “wheat,” followed by a tilde, which tells R that we want to
model wheat on the explanatory variables that follow it. In this case,
that’s rainfall.
Call:
lm(formula = wheat ~ rainfall)
Residuals:
Min 1Q Median 3Q Max
-3.158 -1.903 0.334 1.278 5.114
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 45.0408 3.6677 12.28 1.80e-06 ***
rainfall 10.1691 0.7191 14.14 6.08e-07 ***
---
1 The “call” tells us what our regression model is. In this case, wheat is
being regressed on rainfall.
2 With the summary statistics for the residuals, we can quickly see if
our residuals have any outliers and check to see that the median is
close to 0.
ۧۧ The following is a plot of our wheat data with the estimated regression
line.
ۧۧ What is the equation of the linear regression line? We can actually solve
for our regression coefficients directly from the data.
ۧۧ When we plug our 𝑥- and 𝑦-values in, we get 10.1691. This matches the
result from R.
RESIDUALS
ۧۧ We need a way of understanding how well our regression line fits the
data. The residuals help us do just that. A residual is the difference
between an observed value ( Yi ) and the value predicted by the
regression line ( Ŷi ): ei = 𝑦i − ŷi .
ۧۧ Some people get confused at this point, because it seems like the
residuals, ei, are the same as our errors, 𝜖i .
𝜖i = 𝑦i − ( β0 + β1 𝑥i )
ۧۧ The residuals, on the other hand, measure the difference between the
observed value ( 𝑦i ) and the estimated regression line ( ŷi ).
ei = 𝑦i − ŷi
ei = 𝑦i − ( 0 + 1 𝑥 )
ۧۧ The following plot shows what the residual is. For a given observation,
𝑦i, we find the distance from that point to the regression line. The
value on the regression line ŷi is the estimated regression fit for 𝑦i . The
difference between 𝑦i and ŷi is the residual.
rainfall = c(3.07,3.55,3.90,4.38,4.79,
5.30,5.42,5.99,6.45,6.77)
wheat = c(78,82,85,91,92,96,97,104,111,119)
wheat.lm = lm(wheat~rainfall)
plot(wheat.lm$fitted.values,wheat.lm$residuals,
main = "Residuals vs. Fitted Values",
xlab = "Fitted Wheat Values", ylab = "Residuals")
abline(h=0, col="red")
SUGGESTED READING
Diez, Barr, and Cetinkaya-Rundel, OpenIntro Statistics, “Introduction to Linear
Regression,” sections 7.1–7.2.
Faraway, Linear Models with R, “Estimation,” chap. 2.
Yau, R Tutorial, “Simple Linear Regression,” http://www.r-tutor.com/
elementary-statistics/simple-linear-regression.
PROBLEMS
1 What does the linear regression slope b1 represent?
a) There is no correlation.
b) The slope b1 is negative.
c) Variable X is larger than variable Y.
d) The variance of X is negative.
REGRESSION PREDICTIONS,
CONFIDENCE INTERVALS
A
s you will learn in this lecture, transforming
the response variable can help us eliminate
heteroscedasticity ( increasing or decreasing
variance ) and satisfy the assumptions of normality,
independence, and linearity. In this lecture, you will
learn how to use linear regression to make predictions
as well as how to learn about population parameters
through confidence intervals and hypothesis tests.
TRANSFORMATIONS: LN(Y)
ۧۧ The following data has an exponential shape and doesn’t satisfy the
assumption of linearity.
ۧۧ Notice that the residuals are large in magnitude, not centered at 0, and
not balanced. They’re heteroscedastic with increasing variance. The
histogram is slightly skewed to the right.
ۧۧ After just taking the natural log, our data appear more linear. Our
residuals look more linear. They are centered at 0 and have less of a
pattern, and our histogram has shifted toward normality.
TRANSFORMATIONS: Y 2
ۧۧ The following is a slightly different dataset with a curved pattern.
The residuals have a clear pattern, and the histogram is heavily
skewed to the left. In this case, the solution is to square the data.
Notice what a great job that transformation does in helping us satisfy
our model assumptions.
ۧۧ For example, the solid line in the following graph represents our true
underlying population. Let’s sample 5 points from this population and
fit a regression line to those 5 points. Sometimes our 5 points do a nice
job of estimating the population slope. But it’s difficult to precisely
estimate the slope with so few points.
SIGNIFICANT 0
AND 1
ۧۧ An example of a linear
regression fit where both
the slope and 𝑦-intercept
estimates are significant
is shown at right. Our
data lies close enough to
the 𝑦-axis that we can
extrapolate out to 𝑦 = 0
fairly accurately. And in
spite of the variability
in our data, we can fit a
regression line that has a
nonzero slope.
ۧۧ But we could still estimate the 𝑦-intercept as the constant Y . Our model
for this data would be a straight line centered at the mean of Y.
ۧۧ The following data was extracted from the 1974 Motor Trend magazine
and comprises fuel consumption ( 𝑦 ) and 10 aspects of automobile
design and performance ( 𝑥 ) for 32 automobiles ( 1973–1974 models ).
Let’s explore the “mtcars” dataset and use linear regression to predict
vehicle gas mileage based on vehicle weight.
library(datasets)
data(mtcars)
summary(mtcars)
mpg cyl disp hp
Min.: 10.40 Min.: 4.000 Min.: 71.1 Min.: 52.0
1st Qu.: 15.43 1st Qu.: 4.000 1st Qu.: 120.8 1st Qu.: 96.5
Median: 19.20 Median: 6.000 Median: 196.3 Median: 123.0
Mean: 20.09 Mean: 6.188 Mean: 230.7 Mean: 146.7
3rd Qu.: 22.80 3rd Qu.: 8.000 3rd Qu.: 326.0 3rd Qu.: 180.0
Max.: 33.90 Max.: 8.000 Max.: 472.0 Max.: 335.0
drat wt qsec vs
Min.: 2.760 Min.: 1.513 Min.: 14.50 Min.: 0.0000
1st Qu.: 3.080 1st Qu.: 2.581 1st Qu.: 16.89 1st Qu.: 0.0000
Median: 3.695 Median: 3.325 Median: 17.71 Median: 0.0000
Mean: 3.597 Mean: 3.217 Mean: 17.85 Mean: 0.4375
3rd Qu.: 3.920 3rd Qu.: 3.610 3rd Qu.: 18.90 3rd Qu.: 1.0000
Max.: 4.930 Max.: 5.424 Max.: 22.90 Max.: 1.0000
am gear carb
Min.: 0.0000 Min.: 3.000 Min.: 1.000
1st Qu.: 0.0000 1st Qu.: 3.000 1st Qu.: 2.000
Median: 0.0000 Median: 4.000 Median: 2.000
Mean: 0.4062 Mean: 3.688 Mean: 2.812
3rd Qu.: 1.0000 3rd Qu.: 4.000 3rd Qu.: 4.000
Max.: 1.0000 Max.: 5.000 Max.: 8.000
Residuals:
Min 1Q Median 3Q Max
-4.5432 -2.3647 -0.1252 1.4096 6.8727
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591 -9.559 1.29e-10 ***
---
ۧۧ The output above shows the formula used to make the model, followed
by a 5-number summary of the residuals and a summary of the model
coefficients.
ۧۧ The coefficients are the constants used to create the best fit line: In this
case, the 𝑦-intercept term 0 is set to 37.2851, and the slope coefficient 1
for the weight variable is −5.3445. The model fit the line mpg = 37.2851 −
5.3445 × wt.
qt(.975,df=28)
[1] 2.048407
ۧۧ We are 95% confident that the true slope, regressing miles per gallon
on weight, is between −4.1992 and −6.4897 miles per gallon per 1000
pounds.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591 -9.559 1.29e-10 ***
pt(-9.559,df=28)
[1] 1.29007e-10
qqnorm(mpg_model$residuals)
qqline(mpg_model$residuals)
shapiro.test(mpg_model$residuals)
data: mpg_model$residuals
W = 0.94508, p-value = 0.1044
PITFALL
ۧۧ Clearly, we need to add more predictors to our model, but we run into
a pitfall when we do that. Collinearity occurs when 2 or more predictor
variables are closely related to one another, or highly correlated. The
presence of collinearity can create problems in the regression context,
because it can be difficult to separate out the individual effects of
collinear variables on the response. But it’s fixable.
SUGGESTED READING
Crawley, The R Book, “Regression,” chap. 10.
Diez, Barr, and Cetinkaya-Rundel, OpenIntro Statistics, “Introduction to Linear
Regression,” sections 7.3–7.4.
Yau, R Tutorial, “Simple Linear Regression,” http://www.r-tutor.com/
elementary-statistics/simple-linear-regression.
a) A negative correlation.
b) No apparent relationship.
c) A statistically significant relationship.
d) A positive correlation.
e) A heteroskedastic relationship.
MULTIPLE LINEAR
REGRESSION
W
hen we first started examining the relationship
between 2 variables, we used t-tests, which
allowed us to determine if there was a
statistically significant relationship between 2 variables.
We then moved to simple linear regression, which allowed
us to fit a line to a predictor versus response. While these
are useful in the case where we only have 2 variables, it’s
more often the case to work with data that has multiple
predictors. This lecture is about multiple linear regression.
install.packages("MASS")
library(MASS)
data(Pima.tr)
head(Pima.tr)
Pima.tr = pima
npreg
glu bp skin bmi ped age type
1
5 86 68 28 30.2 0.364 24 No
2 7 195 70 33 25.1 0.163 55 Yes
3
5 77 82 41 35.8 0.156 35 No
4 0 165 76 43 47.9 0.259 26 No
5 0 107 60 25 26.4 0.133 23 No
6
5 97 76 27 35.6 0.378 52 Yes
[1] 200 8
ۧۧ The median and mean of bmi are 32.8 and 32.31, respectively. When you
see a median and a mean so close together, it gives you some assurance
that that might be an underlying normal population.
ۧۧ All of the variables are quantitative except for type, which is categorical
( “ Yes” for diseased or “No” for non-diseased ).
hist(pima$bp)
ۧۧ In the following pairs plot of 4 of the data values, there doesn’t seem to be
a strong relationship in the data values. In fact, skin is a possible outlier.
pairs(pima[1: 4])
round(cor(pima[1: 7]),2)
npreg glu bp skin bmi ped age
npreg
1.00 0.17 0.25 0.11 0.06 -0.12 0.60
glu 0.17 1.00 0.27 0.22 0.22 0.06 0.34
bp 0.25 0.27 1.00 0.26 0.24 -0.05 0.39
skin 0.11 0.22 0.26 1.00 0.66 0.10 0.25
bmi 0.06 0.22 0.24 0.66 1.00 0.19 0.13
ped -0.12 0.06 -0.05 0.10 0.19 1.00 -0.07
age 0.60 0.34 0.39 0.25 0.13 -0.07 1.00
Residuals:
Min 1Q Median 3Q Max
-19.9065 -2.5723 -0.1412 2.6039 11.2664
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 19.203109 2.346040 8.185 3.69e-14 ***
npreg 0.018970 0.120858 0.157 0.8754
glu 0.006432 0.011981 0.537 0.5920
bp 0.046753 0.031272 1.495 0.1366
skin 0.322989 0.029362 11.000 < 2e-16 ***
ped 2.060288 1.094140 1.883 0.0612 .
age -0.061772 0.040459 -1.527 0.1285
typeYes 1.494968 0.827212 1.807 0.0723 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.528 on 192 degrees of freedom
Multiple R-squared: 0.4735, Adjusted R-squared: 0.4543
F-statistic: 24.67 on 7 and 192 DF, p-value: < 2.2e-16
Residuals:
Min 1Q Median
3Q Max
-66.595 -17.396 -1.641 12.952 89.977
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 68.96159 15.62700 4.413 1.70e-05 ***
bmi 0.23301 0.43405 0.537 0.5920
npreg -0.84245 0.72494 -1.162 0.2466
bp 0.31786 0.18792 1.691 0.0924 .
skin 0.07046 0.22559 0.312 0.7551
ped -2.36903 6.64389 -0.357 0.7218
age 0.55832 0.24166 2.310 0.0219 *
typeYes 26.29928 4.64856 5.658 5.51e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '
' 1
Residual standard error: 27.26 on 192 degrees of freedom
Multiple R-squared: 0.2853, Adjusted R-squared: 0.2592
F-statistic: 10.95 on 7 and 192 DF, p-value: 1.312e-11
plot(lm3$fitted.values, pima$glu)
predict(lm3) #Compare to predicted value
1
2 3 4 5 6 7 8
9 10 11 12 13 14 15 16
17
18 19 20 21 22 23 24
25
26 27 28 29 30 31 32
33
34 35 36 37 38 39 40
41
42 43 44 45 46 47 48
49
50 51 52 53 54 55 56
57
58 59 60 61 62 63 64
65
66 67 68 69 70 71 72
73
74 75 76 77 78 79 80
81
82 83 84 85 86 87 88
89
90 91 92 93 94 95 96
plot(lm3$fitted.values,lm3$residuals)
hist(lm3$residuals)
qqnorm(lm3$residuals)
qqline(lm3$residuals)
ۧۧ Let’s return to the Motor Trend cars data and fit a multiple regression model
by including more explanatory variables in a linear regression model.
library(datasets)
data(mtcars)
Residuals:
Min 1Q Median 3Q Max
-3.941 -1.600 -0.182 1.050 5.854
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.22727 1.59879 23.285 < 2e-16 ***
wt -3.87783 0.63273 -6.129 1.12e-06 ***
hp -0.03177 0.00903 -3.519 0.00145 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '
' 1
Residual standard error: 2.593 on 29 degrees of freedom
Multiple R-squared: 0.8268, Adjusted R-squared: 0.8148
F-statistic: 69.21 on 2 and 29 DF, p-value: 9.109e-12
ۧۧ For example, as weight increases by 1 unit, the miles per gallon will
decrease by 3.877. Intuitively, this makes sense: The heavier a car gets,
the lower miles per gallon it would tend to get.
cor(mtcars$wt, mtcars$hp)
[1] 0.6587479
ۧۧ We see that “hp” and “wt” are correlated with a correlation coefficient
of 0.658. It’s possible that these predictors are collinear, meaning that
they’re either highly correlated or they contribute the same information
to the model.
Residuals:
Min 1Q Median 3Q Max
-3.4506 -1.6044 -0.1196 1.2193 4.6271
ۧۧ None of our variables are significant. The all have 𝑝-values greater than
0.05. In fact, a model using all the variables doesn’t perform as well as
our weight-and-horsepower–based model. Our adjusted R2 actually
decreased—from 0.8148 to 0.8066.
ۧۧ We see that when we add variables that have little relationship with the
response or even variables that are too correlated to one another, we
can get poor results.
ۧۧ One problem in our data is that our variables are correlated. We can see
this in the pairs plots.
pairs(mtcars[,c(1,3: 4)])
pairs(mtcars[,c(5: 7)])
round(cor(mtcars[,c(1,3: 7)]),2)
mpg disp hp drat wt qsec
mpg 1.00 -0.85 -0.78 0.68 -0.87 0.42
disp -0.85 1.00 0.79 -0.71 0.89 -0.43
hp
-0.78 0.79 1.00 -0.45 0.66 -0.71
drat 0.68 -0.71 -0.45 1.00 -0.71 0.09
wt
-0.87 0.89 0.66 -0.71 1.00 -0.17
qsec 0.42 -0.43 -0.71 0.09 -0.17 1.00
ۧۧ We can find the best model by pruning. We “step” through the predictor
variables and remove the ones that are not significant.
ۧۧ Choose the model with the highest adjusted R2. This assumes that we
choose to evaluate the success of our model in terms of the percentage of
the variability in the response explained by the explanatory variables.
ۧۧ Let’s do a stepwise regression on our linear model fit of miles per gallon
with all of our data. This will automatically spit out the best model.
Step: AIC=66.97
mpg ~ disp + hp + drat + wt + qsec + am + gear + carb
Df Sum of Sq RSS AIC
- carb 1 0.6855 148.53 65.121
- gear 1 2.1437 149.99 65.434
- drat 1 2.2139 150.06 65.449
- disp 1 3.6467 151.49 65.753
- hp 1 7.1060 154.95 66.475
<none> 147.84 66.973
- am 1 11.5694 159.41 67.384
- qsec 1 15.6830 163.53 68.200
- wt 1 27.3799 175.22 70.410
Step: AIC=65.12
mpg ~ disp + hp + drat + wt + qsec + am + gear
Df Sum of Sq RSS AIC
- gear 1 1.565 150.09 63.457
- drat 1 1.932 150.46 63.535
<none> 148.53 65.121
- disp 1 10.110 158.64 65.229
- am 1 12.323 160.85 65.672
- hp 1 14.826 163.35 66.166
- qsec 1 26.408 174.94 68.358
- wt 1 69.127 217.66 75.350
Step: AIC=62.16
mpg ~ disp + hp + wt + qsec + am
Df Sum of Sq RSS AIC
- disp 1 6.629 160.07 61.515
<none> 153.44 62.162
- hp 1 12.572 166.01 62.682
- qsec 1 26.470 179.91 65.255
- am 1 32.198 185.63 66.258
- wt 1 69.043 222.48 72.051
Step: AIC=61.52
mpg ~ hp + wt + qsec + am
Df Sum of Sq RSS AIC
- hp 1 9.219 169.29 61.307
<none> 160.07 61.515
- qsec 1 20.225 180.29 63.323
- am 1 25.993 186.06 64.331
- wt 1 78.494 238.56 72.284
Step: AIC=61.31
mpg ~ wt + qsec + am
Df Sum of Sq RSS AIC
<none> 169.29 61.307
- am 1 26.178 195.46 63.908
- qsec 1 109.034 278.32 75.217
- wt 1 183.347 352.63 82.790
Residuals:
Min 1Q Median 3Q Max
-3.4811 -1.5555 -0.7257 1.4110 4.6610
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.6178 6.9596 1.382 0.177915
wt -3.9165 0.7112 -5.507 6.95e-06 ***
qsec 1.2259 0.2887 4.247 0.000216 ***
am 2.9358 1.4109 2.081 0.046716 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.459 on 28 degrees of freedom
Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
ۧۧ The output from our stepwise regression model is as follows. This is the
best model.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.6178 6.9596 1.382 0.177915
wt -3.9165 0.7112 -5.507 6.95e-06 ***
qsec 1.2259 0.2887 4.247 0.000216 ***
am 2.9358 1.4109 2.081 0.046716 *
PITFALLS
ۧۧ What if one of your assumptions isn’t met? With nonlinearity, you have
to transform the data. You can use polynomial regression.
ۧۧ What if your residuals are not constant? You can do a weighted version
of your least-squares regression.
PROBLEMS
1 In least-squares regression, which of the following is not a required assumption
about the error term 𝜖?
ANALYSIS OF VARIANCE:
COMPARING 3 MEANS
V
ariation and randomness are everywhere.
Whether you’re looking for a mechanical part
failure, determining a new drug’s effectiveness,
or wondering if it will rain tomorrow, almost
everything has variation. One of the most commonly
used statistical methods is ANOVA, which is an
acronym for the phrase “analysis of variance.” The
whole purpose of ANOVA is to break up variation into
component parts and then look at their significance.
ۧۧ They have the same means, at 0, 1, and 2, but they have a larger
variance. It’s more probable that these 3 samples could come from the
same underlying population.
ۧۧ Recall that the t-test was a ratio of group difference for 2 groups divided
by the sampling variability ( where sampling variability is the standard
error ).
ۧۧ The F-test is named after the man who invented the idea, Sir Ronald
Fisher, who was analyzing fertilizer data. Agricultural researchers
had been trying to figure out which fertilizer worked best by using a
different one each year. Fisher developed much better tests to control
for weather and land conditions.
ۧۧ In fitting the ANOVA model, we more or less assume the same conditions
as multiple linear regression.
ۧۧ What if you want to assess more than 1 factor? There are different
types of ANOVA.
ۧۧ You may hear another name commonly used term for ANOVA: factorial
design. A 3-way factorial design is the same as a 3-way ANOVA.
ൖൖ H0: The mean outcome is the same across all categories: μ1 = μ2 = … = μk.
ൖൖ Ha: The mean of the outcome is different for some ( or all ) groups.
In other words, there is at least one mean difference among the
populations where μi represents the mean of the outcome for
observations in category i.
ൖൖ Some means are different from others while some are similar.
ۧۧ The distance from any data point to the mean is the deviation from this
point to the mean: ( Xi − X ).
2 Variation within groups: For each data value, we look at the difference
between that value and the mean of its group. This is called the sum
of squares within ( SSW ), which is the sum of the squared deviations
within each group.
ۧۧ The sum of squares total ( SST ) is the sum of the squared deviations
between each observation and the overall mean.
ۧۧ MSW ( mean square within ) is also called the within-groups mean square.
THE F-STATISTIC
ۧۧ Our goal is to compare the 2 sources of variability: MSW and MSB. Our
test statistic is
ۧۧ What we’ve just computed is called the F-statistic or F-ratio. Unlike the
t-statistic, which is based on sample means, the F-ratio is based on a
ratio of sample variances. The variance in the numerator measures the
size of differences among sample means. Variance in the denominator
measures the other differences expected if group means were not
different from one another.
ۧۧ If the variances are unequal, then the grouping has an effect. The
between-group variation ( MSB ) becomes large compared to the within-
group variation ( MSW ), and the F-ratio would be greater than 1.
ۧۧ For example, if the degrees of freedom of the numerator were 20 and the
degrees of freedom of the in the denominator were 19, then our critical
value from the F-distribution would be 2.1555. We would compare our
F-ratio to this value and reject H0 if our ratio were larger than 2.1555 or
fail to reject H0 if our ratio were smaller then 2.1555.
1 The first column lists the source of the variation, either between-
group or within-group, followed by the total variation.
2 The second column gives us the sums of squares ( SSB ), ( SSW ), and ( SST ).
3 The third column lists the degrees of freedom ( k − 1 ) and ( N − k ), and
if you add both of those, we get the total degrees of freedom, ( N − 1 ).
4 The fourth column is the mean square between and within group.
Summary ANOVA
ۧۧ The first step in our analysis is to graphically compare the means of the
variable of interest across groups. To do that, we can create side-by-side
box plots of the measurements organized in groups using a function.
require(stats); require(graphics)
boxplot(weight ~ feed, data = chickwts, col = "lightgray",
main = "Chickwts data", ylab = "Weight in grams",
xlab="Type of Feed")
summary(chickwts)
weight feed
Min.: 108.0 casein: 12
1st Qu.: 204.5 horsebean: 10
Median: 258.0 linseed: 12
Mean: 261.3 meatmeal: 11
3rd Qu.: 323.5 soybean: 14
Max.: 423.0 sunflower: 12
ۧۧ Our group sizes only range between 10 and 14, but what if we had larger
variation in sample size?
ۧۧ A variable-width box plot can show whether your groups have the same
number and shape. In a variable-width box plot, the width of the box
plot represents the number in each group. The height, as usual, shows
the spread in the data.
ۧۧ Once the ANOVA model is fit, we can look at the results using the
“summary( )” function. This produces the standard ANOVA table.
ۧۧ The ANOVA F-test answers the question whether there are significant
differences in the k population means. But it doesn’t give us any
information about how they differ. That’s because ANOVA compares all
individual mean differences simultaneously, in 1 test.
TUKEY’S METHOD
ۧۧ A common multiple comparisons procedure is Tukey’s method, named
for John Tukey, an inventive mathematics professor with a joint
appointment at Bell Labs. He was first to use the terms “software” and
“bit” in computer science, and in 1970, he created box plots.
PITFALLS
ۧۧ ANOVA depends on the same assumptions as least-squares linear
regression—only more so.
ۧۧ Could these 3 groups come from the same underlying population? This
is a possible pitfall. Don’t be misled by the name ANOVA into expecting
it to analyze any kind of variance. ANOVA assumes a shared variance
( i.e., roughly the same variance ) across all groups. It’s looking only at
whether means with that shared variance value also come from the
same distribution.
ۧۧ When the F-test shows that means come from different distributions,
then that says, for example, that the new fertilizer you’re testing gives
statistically different results from other fertilizers.
PROBLEMS
1 1-way ANOVA is used when
ANALYSIS OF
COVARIANCE AND
MULTIPLE ANOVA
I
f you’re studying cancer in patients and you want to
know which of 4 new treatments is most effective, you
would use ANOVA, but you’d also want to be careful that
you aren’t missing a continuous factor that may co-vary
with your results, such as distance from a major source
of pollution. ANOVA won’t model a continuous predictor
variable; it only works for categorical variables. Analysis
of covariance can be used to address this problem.
ۧۧ We open their database and notice that there are 12 patients with
esophageal cancer. We place them in 4 groups of 3 each. Let’s analyze
the data as a 1-way ANOVA.
months = c(78,93,86,57,45,60,28,31,22,9,12,4)
treat = gl(4,3)
summary(lm.mod)
Call:
lm(formula = months ~ treat)
Residuals:
Min 1Q Median 3Q Max
-9.0000 -4.5000 0.8333 3.7500 7.3333
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 85.667 3.613 23.709 1.07e-08 ***
treat2 -31.667 5.110 -6.197 0.00026 ***
treat3 -58.667 5.110 -11.481 3.00e-06 ***
treat4 -77.333 5.110 -15.134 3.60e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.258 on 8 degrees of freedom
Multiple R-squared: 0.9702, Adjusted R-squared: 0.959
F-statistic: 86.73 on 3 and 8 DF, p-value: 1.925e-06
ۧۧ In our initial analysis, we didn’t consider the stage to which the cancer
had progressed at the time that treatment begins. This is important,
because those at earlier stages of disease will naturally live longer
on average. Stage of disease is a covariate. We should have been more
intentional in using randomization to balance out our groups.
set.seed(1234)
months2 = c(sample(c(78,93,86,57,45,60,28,
31,22,9,12,4),12,replace=F))
treat = gl(4,3)
years2 = c(sample(c(2.3,3.4,1.8,5.8,6.2,7.3,
9.6,11.0,12.2,14.8,17.3,16.0), 12,replace=F))
ۧۧ Notice that we have much more spread in the survival time post-
treatment. There’s not a clear treatment that outperforms the others.
ۧۧ Notice that even when we add years to the model, both variables remain
insignificant.
ۧۧ As the graphs suggest, box plots for the measurements show that
versicolor and virginica are more similar to each other than either is
to setosa.
library(MASS)
data(iris)
attach(iris)
# MANOVA test
man.mod = manova(cbind(Sepal.Length, Petal.Length) ~ Species,
data = iris)
man.mod
Call:
manova(cbind(Sepal.Length, Petal.Length) ~ Species, data = iris)
Terms:
Species Residuals
resp 1 63.2121 38.9562
resp 2 437.1028 27.2226
Deg. of Freedom 2 147
Residual standard errors: 0.5147894 0.4303345
Estimated effects may be unbalanced
summary(man.mod)
Df Pillai approx F num Df den Df Pr(>F)
Species 2 0.9885 71.829 4 294 < 2.2e-16 ***
Residuals 147
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary.aov(man1)[2]
Df Sum Sq Mean Sq F value Pr(>F)
Species 2 11.35 5.672 49.16 <2e-16 ***
Residuals 147 16.96 0.115
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary.aov(man1)[3]
Df Sum Sq Mean Sq F value Pr(>F)
Species 2 437.1 218.55 1180 <2e-16 ***
Residuals 147 27.2 0.19
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary.aov(man1)[4]
Df Sum Sq Mean Sq F value Pr(>F)
Species 2 80.41 40.21 960 <2e-16 ***
Residuals 147 6.16 0.04
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
SUGGESTED READING
Crawley, The R Book, “Analysis of Covariance,” chap. 12.
Faraway, Linear Models with R, “Analysis of Covariance,” chap. 13; and
“Factorial Designs,” chap. 15.
PROBLEMS
1 Which of the following statements about ANCOVA are true?
The explanatory variables are X1 = pen, X2 = pencil, and X3 = marker. How would
you model this situation?
a) ANOVA
b) ANCOVA
c) MANOVA
d) regression tree
STATISTICAL DESIGN
OF EXPERIMENTS
I
n this lecture, you will gain an understanding of how
experiments should be designed so that you collect
sound statistical data. You will be introduced to
the basic terminology. You will also learn techniques
for effective experimental design, along with 2 well-
known experimental design models: the randomized
block design and the 2k factorial design.
DESIGN OF EXPERIMENTS
ۧۧ The steps needed to plan and conduct an experiment closely follow the
scientific method.
ۧۧ It’s important that we take the necessary time and effort to organize
the experiment appropriately so that we have a sufficient amount of
data to answer the question. This process is called experimental design.
ൖൖ Variables are all of the factors, their treatments, and the measured
responses.
RANDOMIZATION
ۧۧ A technique for effective experimental design is randomization.
Treatments should be allocated to experimental units randomly.
ۧۧ Replication gives us more power to reject null hypotheses ( t hat the
group means all = 0 ) and helps when we have missing data or possible
outliers. When possible, we want to base our theories on reproducible
results ( although this applies more to replicating your whole study
than to just using larger samples ).
ۧۧ The more replication we have, the more variability we can observe in the
response variable, separate from the treatment effects. When we increase
the number of replications, we increase the reliability of the outcome.
BLOCKING
ۧۧ The next technique is blocking, which refers to the distribution of the
experimental units into blocks in such a way that the units with each
block are homogeneous.
SAMPLE SIZE
ۧۧ Another technique is sample size. The decision between the sample size
and the cost will always be a compromise—unless you have infinite
resources.
3 The third column gives the degrees of freedom for sum of squares
for treatments, for sum of squares for blocks, for error, and for total.
4 The fourth column provides us with the mean square for treatments,
the mean square for blocks, and the mean square error.
5 Under the null hypothesis, our group means are all equal to 0
and there would be no treatment effect. If our F ratio is close to
1, meaning that the mean square for treatments is close to the
mean square error, we fail to reject H0 . Otherwise, if there’s a valid
treatment effect, then the mean square for treatments will be large
when compared to the mean square error and our F statistic will
allow us to reject the null hypothesis.
Blocks SSBlocks/
SSBlocks 𝑏−1
( 𝑏 − 1 )
Error SSE/
SSE ( 𝑎 − 1 )( 𝑏 − 1 )
[( 𝑎 − 1 )( 𝑏 − 1 )]
ۧۧ Each of the factors have 2 levels ( e.g., “low” or “high” ), which may be
qualitative or quantitative.
ۧۧ Let’s begin by specifying the first 4 of factor A to be high ( + + + + ), which
requires the last 4 of factor A to be low ( − − − − ). Once A is factored, we
have to let B factor within A. For A’s 4 positives, B will be both ( + + ) and
( − − ). Likewise, for A’s 4 negative times, B will be both ( + + ) and ( − − ).
ۧۧ All that’s left is to factor in C. For A both positive and B both negative,
factor C comes in as a positive and a negative. Likewise, for A both
positive and B both positive, factor C comes in as both a positive and a
negative.
22 Factorial Design
ۧۧ R can help you here. Load the “BHH2” library and run the command
“print( X = ffDesMatrix( 2 ) ).” We use the “ffDesMatrix” to generate 2k
factorial designs.
library(BHH2)
print(X = ffDesMatrix(2))
22=4 factorial design
[,1] [,2]
[1,] -1 -1
[2,] 1 -1
[3,] -1 1
[4,] 1 1
ۧۧ Once your design is specified and data has been collected, the analysis
of the experiment is similar to many of the ANOVA/ANCOVA and
MANOVA techniques that you’ve already learned about.
PITFALL
ۧۧ We can use blocking to control for a parameter that may not be of
immediate interest but that has to be accounted for in the analysis. But
don’t forget about ANCOVA. It’s possible that you could save a step and
use ANCOVA to measure and remove the effect of that factor from the
analysis. We would only have to adjust statistically to account for the
covariate, whereas in blocking, we would have to design the experiment
with a block factor into the design.
PROBLEMS
1 Replication tells us the number of samples that we need to take for each
treatment. Which of the following statements about replications are true?
A
decision tree is a graph that uses a branching
method to determine all possible outcomes of a
decision. The structure of a decision tree is similar
to a real tree, with a root, branches, and even leaves, but
it is an upside-down tree. We start at the root up top and
work our way down to the leaves. Classification trees and
regression trees are easily understandable and transparent
methods for predicting or classifying new records.
1 Take all of your data. Consider all possible values of all predictor
variables.
3 If Xi < 3, then send the data to the left; otherwise, send data points to
the right. Notice that we just do binary splits.
ۧۧ Trees give us rules that are easy to interpret and implement. Decision
trees more closely mirror the human decision-making approach
than linear regression. Trees can be displayed graphically and are
easily interpreted even by a nonexpert. Also, trees don’t require the
assumptions of statistical models and work well even when some data
values are missing.
ۧۧ Trees for continuous outcomes are called regression trees, while trees
for categorical outcomes are called classification trees.
REGRESSION TREES
ۧۧ Regression trees are a simple yet powerful way to predict the response
variable based on partitioning the predictor variables. The idea is to
split the data into partitions and to fit a constant model of the response
variable in each partition.
ۧۧ Regression trees use the sum of squares. In fact, the way that we split
our data is to find the point of greatest separation in ∑[𝑦 − E( 𝑦 )]2.
ൖൖ Parent node ( A ): A node at the top that splits into lower child nodes.
ۧۧ The first model we’ll consider is one using just 1 predictor variable of
weight to model mileage. As long as mileage is a numeric variable, “tree”
assumes that we want a regression tree model:
ۧۧ The summary function that is associated with “tree” lists the formula
along with the associated datasets for the number of terminal nodes
( or leaves ) and gives us the residual mean deviance. ( T his is the mean
square, which equals the sum of squares divided by N minus the number
of nodes, or 60 – 6, which gives us 54. ) We also have the 5-number
summary of the residuals.
Regression tree:
tree(formula = Mileage ~ Weight, data = car.test.frame)
Number of terminal nodes: 6
Residual mean deviance: 4.249 = 229.4 / 54
Distribution of residuals:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-3.8890 -1.0620 -0.0625 0.0000 1.2330 4.3750
plot(my.tree)
text(my.tree)
ۧۧ The second and third splits occurred at weights of 2280 and 3087.5. The
fourth and fifth splits occurred at weights of 2747.5 and 3637.5.
ۧۧ Values at the leaves show the mean mileage ( miles per gallon )
associated with each of the 6 final data subsets.
ۧۧ For classic regression trees, the local model is the average of all values
in the last terminal node. The flexibility of a tree—its ability to help us
correctly classify data—depends on how many leaves it has.
ۧۧ How large should we grow our tree? A large tree will fragment the data
into smaller and smaller samples. This often leads to a model that overfits
the sample data and fails if we want to predict. On the other hand, a small
tree might not capture the important relationships among the variables.
ۧۧ There are default values in “tree” that determine the stopping rules
associated with having a few remaining samples or splits that add
information to the model. We use stopping rules to control if the tree-
growing process should be stopped or not.
ۧۧ There are a few common stopping rules: The node won’t be split
if the size of a node is less than the user-specified minimum node
size, if the split of a node results in a child node whose node size is
less than the user-specified minimum child-node size value, or if
the improvement at the next split is smaller than the user-specified
minimum improvement.
ۧۧ The resulting tree has 7 splits with 8 nodes. We could change the
parameters even more to get a bigger tree. But given the tendency for
tree models to overfit data, how do we know when we have a good model?
ۧۧ Tree models use a cross-validation technique that splits the data into a
training set for model fitting and a testing set to evaluate how good the
fit is. The following is how cross-validation works:
3 Prune the tree. At each pair of leaf nodes with a common parent,
calculate the sum-of-squares error on the testing data. Check to see
if the error would be smaller by removing those 2 nodes and making
their parent a leaf. ( Go around snipping the children off. )
ۧۧ The dotted line is a guide for a cutoff value relative to our complexity
parameter. The optimal choice of tree size is 5, because this is the first
value that falls below the dotted line. Going from 5 to 6 splits doesn’t
reduce the complexity parameter by a minimum of 0.01.
ۧۧ Let’s create a regression tree that predicts car mileage from price,
country, reliability, and car type. We have an optimal number of splits at
4, because the complexity parameter, “cp,” doesn’t decrease by at least
0.01 from 4 to 5 splits.
ۧۧ The decision-making input variables that are used to split the data can
be numerical or categorical. Outcome is categorical, so we use the mode
of the terminal nodes as the predicted value.
Classification tree:
tree(formula = Species ~ Sepal.Width + Petal.Width, data =
iris)
Number of terminal nodes: 5
Residual mean deviance: 0.204 = 29.57 / 145
Misclassification error rate: 0.03333 = 5 / 150
Classification tree:
tree(formula = Species ~ Sepal.Width + Petal.Width, data =
iris)
Number of terminal nodes: 5
Residual mean deviance: 0.204 = 29.57 / 145
Misclassification error rate: 0.03333 = 5 / 150
ۧۧ Of our 5 nodes, our classification tree only misclassifies 5 out of the 150
flowers, for a misclassification rate of around 3%.
ۧۧ We’re much less confident in our classification at point 107. We’re only
40% confident that it belongs to versicolor and 60% confident that it
belongs to virginica. This is where misclassifications take place.
PITFALL
ۧۧ Classification trees and regression trees may not perform well when
you have structure in your data that isn’t well captured by horizontal
or vertical splits. For example, in the following plot, while there’s
separation in the data, horizontal and vertical splits won’t help us
classify the data very well.
SUGGESTED READING
Crawley, The R Book, “Tree Models,” chap. 23.
Faraway, Extending the Linear Model with R, “Trees,” chap. 16.
PROBLEMS
1 The “kyphosis” dataset has 81 rows and 4 columns representing data on
children who have had corrective spinal surgery. It contains the following
variables:
Install the “tree” library and fit a regression tree to the “kyphosis” data using
only the variable “age.” Plot the tree and comment on the residual mean
deviance and misclassification rate.
library(tree)
data(kyphosis)
tree1 <- tree(Kyphosis ~ Age, data = kyphosis)
plot(tree1)
text(tree1)
summary(tree1)
POLYNOMIAL AND
LOGISTIC REGRESSION
L
inear models can be used to model a variety of
data and are relatively easy to fit in R. Applying
transformations to data that might not fit the
normality assumption gives us even more modeling
flexibility. But what about data that still doesn’t conform
to normality even after a transformation? Trees are one
possibility, but even tree-fitting methods aren’t effective
on data that doesn’t have natural splits. What can we do
when we have data for which transformations and tree
algorithms aren’t effective? In this lecture, you will learn
about polynomial regression and logistic regression.
POLYNOMIAL REGRESSION
ۧۧ Polynomial regression lets us extend the linear model by adding powers
of predictors to our model. This gives us a clean way to give a nonlinear
fit to our data.
library(MASS)
data(Boston)
names(Boston)
[1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
[8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
ۧۧ Here are the variables in the dataset. Let’s restrict our attention to
the last 2: the median house value for select neighborhoods in Boston
( medv ), which is the response variable, and the lower status of the
population by percentage ( lsat ), which is the explanatory variable. Let’s
use simple linear regression.
Residuals:
Min 1Q Median 3Q Max
-15.168 -3.990 -1.318 2.034 24.500
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.55384 0.56263 61.41 <2e-16 ***
lstat -0.95005 0.03873 -24.53 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.216 on 504 degrees of freedom
Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
Call:
lm(formula = medv ~ lstat + I(lstat^2), data = Boston)
Residuals:
Min 1Q Median 3Q Max
-15.2834 -3.8313 -0.5295 2.3095 25.4148
ۧۧ This just fits the same model as median value on lower status. In fact,
“lstat^2” looks like a variable name we forgot to define, so it gets
ignored without the indicator function. The “I( )” function is necessary.
Call:
lm(formula = medv ~ lstat + lstat^2, data = Boston)
Residuals:
Min 1Q Median 3Q Max
-15.168 -3.990 -1.318 2.034 24.500
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.55384 0.56263 61.41 <2e-16 ***
lstat -0.95005 0.03873 -24.53 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.216 on 504 degrees of freedom
Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
Residuals:
Min 1Q Median 3Q Max
-14.5441 -3.7122 -0.5145 2.4846 26.4153
ۧۧ Instead, we want our model to approximate the true model for the
entire population. Our model shouldn’t just fit the current sample; it
should fit new samples, too.
ۧۧ Imagine that we break X into bins and fit a different constant in each
bin. Essentially, we create the bins by selecting K cut points in the range
of X and then construct K + 1 new variables, which behave like dummy
variables ( w ith only 1 or 0 as values ).
ۧۧ Let’s consider a 5-year step pattern for the Boston data. Here’s what a
step function would look like on our Boston dataset.
LOGISTIC REGRESSION
ۧۧ Another class of cases where we might want to fit a model to nonlinear
data is logistic regression. Often, we face the problem of modeling
binary data. We might want to predict whether patients will live or die,
species will thrive or go extinct, buildings will fall or stay standing, or a
patient will accept or reject an organ transfer.
ۧۧ The problem is that we’re dealing with a binary outcome. One solution
would be to just code “alive” as 1 and “dead” as 0 and fit a line through
those points as we would with linear regression. It’s possible. We could
then try to interpret values on the fitted line as the probability of
survival.
ۧۧ The problem is that we’re dealing with a probability, but the regression
can easily give us illegal values less than 0 or greater than 1. Also,
linear models don’t handle probabilities well. For example, smoking
2 cigarettes per day might double your risk of cancer compared to
smoking 1 per day, but increasing from 11 to 12 per day may not make
such a big difference. And once we predict that a patient has a 100%
chance of an outcome, we’re maxed out, and no new information about
the patient could improve his or her odds.
ۧۧ The odds give us a value that ranges from nearly 0 ( for very small
probabilities ) to positive infinity ( for probabilities essentially at 1 ).
ۧۧ But with one more transformation, we can get a value that’s unbounded
over the real numbers. This is the logistic function: 𝑦 = log[𝑝/( 1 − 𝑝 )].
Probabilities transformed through the logistic function are known as
logit values.
ۧۧ In our case, there’s a variety of options, but the most commonly used
is the logit function ( and the similar-looking probit function, which is
based on a normal distribution ).
am cyl hp wt
Mazda RX4 1 6 110 2.620
Mazda RX4 Wag 1 6 110 2.875
Datsun 710 1 4 93 2.320
Hornet 4 Drive 0 6 110 3.215
Hornet Sportabout
0 8 175 3.440
Valiant 0 6 105 3.460
ۧۧ We use the “glm( )” function to create the regression model and get its
summary for analysis. In R, we fit a GLM in the same way as a linear
model, except using “glm” instead of “lm,” and we must also specify the
type of GLM to fit using the “family” argument.
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 19.70288 8.11637 2.428 0.0152 *
cyl 0.48760 1.07162 0.455 0.6491
hp 0.03259 0.01886 1.728 0.0840 .
wt -9.14947 4.15332 -2.203 0.0276 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 43.2297 on 31 degrees of freedom
Residual deviance: 9.8415 on 28 degrees of freedom
AIC: 17.841
Number of Fisher Scoring iterations: 8
ۧۧ We see the associated z-value. This is the z-statistic that tests whether
our coefficients are significant.
ۧۧ The last column is the 𝑝-value. Remember that 𝑝-values less than 0.05
indicate that the coefficient is statistically significant.
ۧۧ Because the 𝑝-value in the last column is more than 0.05 for the
variables “cyl” and “hp,” we consider them to be insignificant in
contributing to the value of the variable “am” ( t ransmission ). Only “wt”
( weight ) impacts the “am” value in this logistic regression model.
Deviance Residuals:
Min 1Q Median 3Q Max
-2.11400 -0.53738 -0.08811 0.26055 2.19931
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 12.040 4.510 2.670 0.00759 **
wt -4.024 1.436 -2.801 0.00509 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 43.230 on 31 degrees of freedom
Residual deviance: 19.176 on 30 degrees of freedom
AIC: 23.176
Number of Fisher Scoring iterations: 6
ۧۧ Our estimate for weight changed from approximately −9 to −4, and the
𝑝-value decreased from 0.0276 to 0.005.
ۧۧ We need to work our way back through the logit score’s construction
process.
ۧۧ Thus, the model says that a car that weighs 4000 pounds has a 1.7%
chance of being a manual.
PITFALLS
ۧۧ Don’t try to extrapolate from fitted polynomials.
ۧۧ Higher-order polynomials are great fits to the data at hand. They can
often overfit the data. But predictions or extrapolations for values of 𝑥
outside of its range will often be incorrect and invalid.
SUGGESTED READING
Diez, Barr, and Cetinkaya-Rundel, OpenIntro Statistics, “Multiple and Logistic
Regression,” section 8.4.
Yau, R Tutorial, “Logistic Regression,” http://www.r-tutor.com/elementary-
statistics/logistic-regression.
PROBLEMS
1 In polynomial regression, different variables are added to an
equation to see whether they increase the significantly.
a) powers of X; variance
b) functions of X; degrees of freedom
c) powers of X; R2
d) significant; mean square error
SPATIAL STATISTICS
W
e tend to think of data as being fixed in time and
space, collected in the moment to be analyzed in
the moment. But what happens when our data
is on the move—when values of our variables change over
space? We can use a branch of statistics called spatial
statistics, which extends traditional statistics to support
the analysis of geographic data. It gives us techniques to
analyze spatial patterns and measure spatial relationships.
SPATIAL STATISTICS
ۧۧ Spatial statistics is designed specifically for use with spatial data—
with geographic data. These methods actually use space ( area, length,
direction, orientation, or some notion of how the features in a dataset
interact with each other ), and space is right in the statistics. That’s
what makes spatial statistics ( also known as geostatistics ) different
from traditional statistical methods.
ൖൖ library( sp );
ൖൖ library( spdep );
ൖൖ library( rgeos );
ൖൖ library( geoR ); and
ൖൖ library( gstat ).
ۧۧ There are other packages that help with visualization of a spatial analysis:
ൖൖ library( raster );
ൖൖ library( rasterVis );
ൖൖ library( maptools ); and
ൖൖ library( g gmaps ).
ۧۧ Spatial data come in several different formats, such as points, lines, and
polygons. For example, customers might be points, roads could be lines,
and zoning districts could be polygons. In R, any of that information can
be handled in a vector format or a raster format.
ۧۧ The package “sp” defines a set of spatial objects for R, including points,
lines, and polygon vectors; and gridded/pixel raster data.
ۧۧ Here is data for 100 counties in North Carolina that includes the counts
of live births and deaths due to sudden infant death syndrome ( SIDS )
for 2 periods: July 1974–June 1978 and July 1979–June 1984.
ۧۧ This map has the latitude and longitude locations for the North Carolina
counties along with other data from the late 1970s and early 1980s.
ۧۧ There are various ways that we define the neighbor relationship, but 3
of the most popular are rook, queen, and K-nearest neighbors. Rook and
queen neighbors relate to the moves that those 2 pieces can make in a
chess game.
ۧۧ The queen’s neighbors include all of the rook’s neighbors plus diagonal
neighbors. In the queen’s neighbor, any areas sharing any boundary
point are taken as neighbors. Here are the queen’s neighbors for the
North Carolina dataset.
ۧۧ The queen’s neighbors has 490 nonzero links, with an average of 4.9
links per spatial location, whereas 1-nearest neighbor has only 200
nonzero links, with an average of 2 links per spatial location.
ۧۧ We can choose the type of weight that we want to model on our data—
whether queen, rook, or K-nearest neighbor—based on our research
question and underlying spatial autocorrelation of the data. Most
statisticians compare multiple model fits to determine the best one.
ۧۧ The null hypothesis for the test is that the data is randomly dispersed.
There’s no spatial correlation. The alternate hypothesis is that the data
is more spatially clustered than would be expected by chance alone.
ۧۧ We can calculate Moran’s I for our North Carolina SIDS data. Notice that
the 𝑝-value of 0.007 is significant, so we reject the null hypothesis that
our spatial locations are random.
ۧۧ Blue values are low, with a relative risk of less than 5%, while redder
values have a relative risk of greater than 95%. Notice the clustering of
counties with higher relative risk of SIDS.
SEMIVARIOGRAMS
ۧۧ Here’s a locally weighted plot of
samples of coal taken over a spatial
area. The z-axis shows the amount of
coal ( not elevation ). Peaks, or high
points, represent spatial areas where
larger amounts of coal were found.
Valleys, or low points, represent areas
where smaller amounts were found.
ۧۧ Imagine that you stand in the center of your data and begin to walk
north. You can calculate the semivariogram as you go along to find
out at which point you’re no longer correlated with where you started.
( T his is the sill. )
ۧۧ Now imagine that you walked east from the center, then south, and then
west. Those 4 semivariograms give us a spatial landscape.
ۧۧ We could also fit a spatial model to our data and analyze the residuals,
much like we did in linear regression.
ۧۧ What do you think the semivariogram would look like if there was no
spatial correlation—no spatial pattern left in the residuals?
PITFALL
ۧۧ Unlike other instances where R’s default settings are usually fine, R
can’t always pick the best way to divide spatial data for you. You need
to look at spatial data from all different directions and choose the
semivariogram that most closely models your research needs.
SUGGESTED READING
Crawley, The R Book, “Spatial Statistics,” chap. 26.
PROBLEMS
1 Estimate the range, sill, and nugget
in the semivariogram shown here.
a) Two sample points at the same location are likely to have the same
semivariance.
b) Values that occur at a distance prior to where the graph starts leveling out
are spatially autocorrelated.
c) When distance increases, the semivariance increases.
d) If there are fewer pairs of points separated by far distances, the
correlations between them will tend to decrease.
e) All of the above tend to be true.
T
ime series analysis gives us a way to model response
data that’s correlated with itself from one point in
time to the next—data that has a time dependence.
To analyze that type of data, we need new methods that
can model a dependency on time. We’ve traditionally
looked at modeling the relationship of ( predictor ) X versus
( response ) Y, whether by linear, polynomial, or logistic
regression. But now, we have a single measurement
from a population and we want to understand how
that measurement changes over time. Time is the
independent predictor variable. Our goal is to understand
how our response, Y, varies with respect to time.
TIME SERIES
ۧۧ A time series is a collection of evenly spaced numerical observations
that are measured successively in time.
ۧۧ The data file “wages” contains monthly values of the average hourly
wages ( in dollars ) for workers in the U.S. apparel and textile products
industry for July 1981 through June 1987.
library(TSA)
data(wages)
plot(wages,ylab='Monthly Wages',type='o')
ۧۧ Time series analysis follows in much the same way: Fit a line to the
data as a function of time ( predictor ) and check the residuals ( what’s
left over from the fit ). They should look like noise ( not have any
pattern or trend ).
wages.lm=lm(wages~time(wages))
summary(wages.lm)
Call: lm(formula = wages ~ time(wages))
Residuals:
Min 1Q Median 3Q Max
-0.23828 -0.04981 0.01942 0.05845 0.13136
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.490e+02 1.115e+01 -49.24 <2e-16 ***
time(wages) 2.811e-01 5.618e-03 50.03 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ۧۧ These residuals don’t look random at all. They hang together too closely,
suggesting that there’s still more structure to be removed from the
data. Let’s try a quadratic fit on the wages time series.
wages.lm2=lm(wages~time(wages)+I(time(wages)^2))
summary(wages.lm2)
Call: lm(formula = wages ~ time(wages) + I(time(wages)^2))
Residuals:
Min 1Q Median 3Q Max
-0.148318 -0.041440 0.001563 0.050089 0.139839
plot(y,x=as.vector(time(wages)),ylab='Standardized
Residuals', type='o', main="Residual Plot")
hist(rstudent(wages.lm2),xlab='Standardized Residuals')
qqnorm(rstudent(wages.lm2))
qqline(rstudent(wages.lm2))
TYPES OF MODELS
ۧۧ The basic objective usually is to determine a model that describes
the pattern of the time series. Let’s consider the 2 types of models
for the pattern in a time series: autoregressive models and moving
average models.
ۧۧ The assumptions of the AR( 1 ) model are that the errors are
independently distributed with a normal distribution that has mean 0
and constant variance and that the errors are independent of Yt .
ۧۧ Moving average ( M A ) models are ones that relate the present value of a
series to past prediction errors.
SEASONAL SERIES
ۧۧ The other common pattern in a time series dataset is seasonality, which
is any kind of periodicity.
ۧۧ A time series repeats itself after a regular period of time. The business
cycle plays an important role in economics. In time series analysis, the
business cycle is typically represented by a seasonal ( or periodic ) model.
ۧۧ The first difference of a time series is the series of changes from one
period to the next. If Yt denotes the value of the time series Y at period t,
then the first difference of Y at period t is equal to Yt − Yt−1.
plot(diff(log(airpass)),type='o',ylab='Difference of
Log(Air Passengers)')
plot(diff(log(airpass)),type='l',ylab='Difference of
ۧۧ Let’s check out the time series plot of the seasonal difference of the first
difference of the logged series.
plot(diff(diff(log(airpass)),lag=12),type='l',ylab='First
& Seasonal Differences of Log(AirPass)')
points(diff(diff(log(airpass)),lag=12),x=time(diff(diff(lo
g(airpass)),lag=12)),
pch=as.vector(season(diff(diff(log(airpass)),lag=12))))
ۧۧ We’ve accounted for the structure in our data through taking the log,
the first difference, and the first seasonal difference.
ۧۧ Here, the ACF gives correlations between the series 𝑦t and lagged values
of the series for lags of 1, 2, 3, and so on. This is a visual plot of 𝜌1, 𝜌2 , ….
ۧۧ The ACF can be used to identify the possible structure of time series
data. That can be tricky because there often isn’t a single clear-cut
interpretation of a sample ACF.
ۧۧ In a different way, the ACF of the residuals for a model is also useful.
The ideal for an ACF of residuals is that there aren’t any significant
correlations for any lag, because then your model has taken into account
all of the structure in the data.
ۧۧ The following is the ACF of the residuals for the wages example, where
we used an AR( 1 ) model. The lag ( t ime span between observations ) is
shown along the horizontal, and the autocorrelation is on the vertical.
The dotted lines indicate bounds for statistical significance.
ۧۧ Here’s the ACF of the seasonal difference of the first difference of the
logged series for air passengers.
acf(as.vector(diff(diff(log(airpass)),lag=12)),ci.type='ma',
main='First & Seasonal Differences of Log(AirPass)')
model=arima(log(airpass),order=c(0,1,1),
seasonal=list(order=c(0,1,1), period=12))
model
Call: arima(x = log(airpass), order = c(0, 1, 1), seasonal
= list(order = c(0, 1, 1), period = 12))
Coefficients:
ma1 sma1
-0.4018 -0.5569
s.e. 0.0896 0.0731
sigma^2 estimated as 0.001348: log likelihood = 244.7,
aic = -485.4
plot(model,n1=c(1969,1),n.ahead=24,pch=19,ylab='Log(Air
Passengers)')
ۧۧ The forecasts follow the seasonal and upward trend of the time series
nicely. The forecast limits provide us with a clear measure of the
uncertainty in the forecasts. We can also plot the forecasts and limits
in original terms.
plot(model,n1=c(1969,1),n.ahead=24,pch=19,ylab='Air
Passengers',transform=exp)
ۧۧ In original terms, it is easier to see that the forecast limits spread out as
we get further into the future.
ۧۧ Just because you found one kind of seasonality in time series doesn’t
mean that there’s not a second and a third kind as well.
ۧۧ Any time series of currency ( dollars, pounds, etc. ) has to take into
account the fact that the value of money changes over time. So, if we
were modeling changes in home prices over time, then we would need
to correct our data to some common base. Usually this is done by
starting with a base year ( often the start or end of the series, or the
current year ) and adjusting values based on changes to some official
inflation statistic ( e.g., the consumer price index ).
ۧۧ Perhaps the biggest advantage of using time series is that we can use it
to model the past—and better understand the future.
PROBLEMS
1 In the R library “TSA,” there’s a dataset called “beersales” that contains the
monthly U.S. beer sales (in millions of barrels) from January 1975 through
December 1990.
a) Make a time series plot for “beersales” using the first letter of each month
as the plotting symbols.
library(TSA)
data("beersales")
plot(beersales, main="Monthly US Beer Sales",type='l')
points(y=beersales,x=time(beersales), pch=as.
vector(season(beersales)))
beer.model=lm(beersales~season(beersales))
summary(beer.model)
plot(y=beer.model$residuals,x=as.vector(time(beers
ales)),type='l', main="Residuals", xlab="Year",
ylab="Residuals")
points(y=beer.model$residuals,x=as.
vector(time(beersales)),pch=as.
vector(season(beersales)))
hist(beer.model$residuals, main="Beer Residuals")
qqnorm(beer.model$residuals, pch=20)
qqline(beer.model$residuals)
shapiro.test(beer.model$residuals)
a) Take first seasonal differences of the data, with a season = 12 months. Fit a
seasonal-means trend to the data and examine the results.
b) Check for normality of the residuals by plotting the residuals, along with
their histogram and Q-Q plot, and performing a Shapiro-Wilk Test.
plot(y=beer.model2$residuals,x=as.vector(time(beer.
diff)),type='l',
main="Seasonal Differenced Residuals", xlab="Year",
ylab="Residuals")
points(y=beer.model2$residuals,x=as.vector(time(beer.
diff)),
pch=as.vector(season(beer.diff)))
qqline(beer.model2$residuals)
T
hus far, our approach to statistical inference
has been greedy for new data. Getting more
new data, and increasing our sample size, are
the keys to making our inferences even more reliable.
The approach we have been following is called a
frequentist approach. But what about also using prior
data and prior information? This is the central idea of
Bayesian inferential statistics, which gives us a way
to update our prior beliefs based on observed data.
BAYESIAN STATISTICS
ۧۧ Bayesian statistics is an entirely different approach for doing statistical
inference. In other words, it’s not just another technique like regression,
hypothesis testing, or ANOVA.
ۧۧ The methods are based on the idea that before we ever observe data, we
have some prior belief about it, perhaps based on experience or other
experiments. As we observe new data, we use the data to update our belief.
ۧۧ A lot of what you’ve learned in this course stays the same with Bayesian
statistics. Histograms, box plots, and numerical summaries stay
largely the same. Discrete and continuous distributions, along with
special cases, such as the binomial, exponential, uniform, and normal
ۧۧ But the way that we have been drawing conclusions about a population
of interest through t-tests and hypothesis testing looks very different in
a Bayesian approach to inference.
ۧۧ Most of the methods learned so far in this course take this frequentist
approach, where we assume the following:
ۧۧ Statistical analyses are judged by how well they perform in the long run
over an infinite number of hypothetical repetitions of the experiment.
ۧۧ For example, your prior belief might be that a coin is fair and P( heads ) =
0.5. You might observe 50 flips of the coin, where 42 out of 50 flips were
heads. This might lead you to change your belief that the coin was fair.
ۧۧ Suppose that you obtain some data relevant to that thing. The data
changes your uncertainty, which is then described by a new probability
distribution, called the posterior distribution, which reflects the
information both in the prior distribution and the data. In other words,
in Bayesian statistics, we start with a parameter, obtain some data, and
update our parameter.
2 If the hypothesis predicted the data well—that is, the data was what
we would have expected to occur if the hypothesis had been true.
2 P( θ ) is the prior probability, which describes how sure we were that
θ was true before we observed the data.
ۧۧ For squared error loss, the posterior mean is the Bayes estimator.
ۧۧ For absolute error loss, the posterior median is the Bayes estimator.
ۧۧ The absolute error is the dashed line, and the squared error is the solid
line. Depending on where our data fall along the 𝑥-axis, the squared
error is lowest or the absolute error is lowest.
ۧۧ The domain of the beta distribution is ( 0, 1 ), which lines up with the
appropriate range for our batting average. We expect that the player’s
season-long batting average will be most likely around 0.26 but that it
could reasonably range from 0.20 to 0.36. This can be represented with
a beta distribution with parameters 𝑎 = 78 and 𝑏 = 222:
ۧۧ But here’s why the beta distribution is so amazing. Imagine that our
player gets a single hit. The player’s record for the season is now 1 hit
and 1 at bat. We can update our probability, given this new information.
ۧۧ Suppose that halfway through the season the player has been up to bat
300 times, hitting 105 out of those times. The new distribution would
be beta( 79 + 105, 222 + 195 ).
ۧۧ The curve is now thinner and slightly shifted to the right to reflect the
higher batting average. In fact, the new expected value is our posterior
estimate of the player’s batting average. We can calculate it as 𝑎/( 𝑎 + 𝑏 ).
ۧۧ After 105 hits of 300 real at bats, the expected value of the new beta
distribution is 𝑎/( 𝑎 + 𝑏 ) = ( 79 + 105 )/( 79 + 105 + 222 + 195 ) = 0.306.
π(𝑥) = 1
ۧۧ Recall that a random variable has the uniform ( 0, l ) distribution if its
probability density function is constant over the interval [0, 1] and 0
everywhere else.
ۧۧ On the other hand, the beta ( 𝑎, 𝑏 ) distribution is another commonly
used distribution for a continuous random variable that can only take
on values [0, 1].
ۧۧ The most important thing is that 𝑥a−1( 1 − 𝑥 )b−1 determines the shape of
the curve.
ۧۧ Notice that uniform ( 0, 1 ) is a special case of the beta distribution
where 𝑎 = 1 and 𝑏 = 1. The shape of the distribution changes when 𝑎 and
𝑏 are much less than 1, when they are much greater than 1, and for all
values in between.
ۧۧ Suppose that your data followed a beta distribution and you wanted to use
an exponential prior. The posterior would be a beta times an exponential
distribution. We don’t know the mean or variance of that distribution. It’s
actually more computationally intensive to try to estimate.
SUGGESTED READING
Bolstad, Introduction to Bayesian Statistics, “Bayesian Inference for Binomial
Proportion,” chap. 8; and “Bayesian Inference for Normal Mean,” chap. 11.
PROBLEMS
1 Which of the following statements are true?
2 If you have data that follows a binomial distribution and would like to assign a
prior, which of the following methods would not be appropriate?
c) Choose the uniform prior that gives equal weight to all values.
d) Construct a discrete prior at several values and interpolate them to create
a continuous prior distribution.
O
ne of the coolest things you can do in R is write your
own functions. Custom functions allow you to define
a specific action you are interested in, which you
can then easily apply to new data. This can be anything
from calculating a unique statistic, to creating a custom
plot, to combining several outputs into a single display.
ۧۧ For example, let’s create a custom function that replicates the “mean( )”
function in R:
ۧۧ Once you’ve defined a function and given it a name, you can then use the
function on new data:
set.seed(3456)
mean.fun(0:20)
[1] 10
mean.fun(rnorm(300))
[1] 0.02367274
ۧۧ While the function “mean.fun” only had 1 input ( t he vector 𝑥 ), we can
also define multiple inputs. For example, let’s adjust the function so that
it has an input “delete.outliers,” which, when true, automatically deletes
outliers before calculating the mean. This will come in handy whenever
you do exploratory data analysis.
2 If you include a default value for a function input, then if the user doesn’t
specify the input value, R will use the default value. You can even include
a default data vector for 𝑥 by including “x = rnorm( 10 ),” for example.
3 When you have a logical input ( which is either true or false, such as
“delete.outliers” ), it’s common to use “if” statements that define the
action for when the input is true and for when it’s false.
set.seed(3456)
Data <- c(rnorm(n = 100, mean = 0, sd = 1),400)
mean.fun(Data, delete.outliers = F)
[1] 3.936704
mean.fun(Data, delete.outliers = T)
[1] -0.02392943
ۧۧ If you find that you’re frequently adding certain things to plots, such as
reference lines for the median or mean, then creating a custom plotting
function can make things easier for you.
ۧۧ Use “abline” to define any line, and then “v = ” for a vertical line.
set.seed(3456)
hist.fun(rexp(100), add.median = T)
data(mtcars)
hist.fun(mtcars$mpg, add.median = T, add.mean = T, add.
legend = T)
#num.breaks
hist.fun <- function(x, add.median = T, add.mean = T,
add.legend = T, num.breaks = 12) {
b <- seq(min(x), max(x), length=num.breaks)
hist(x, col="cadetblue", breaks = b)
if(add.median == T) {abline(v = median(x), lwd = 3, col
= "blue")}
if(add.mean == T) {abline(v = mean(x), lty = 2, lwd = 3,
col = "red")}
if(add.legend == T) {
legend("topright", c("median", "mean"), lwd = c(3, 3),
lty = c(1, 2), col = c("blue","red"))
}
}
data(mtcars)
hist.fun(mtcars$mpg, add.median = T, add.mean = T, add.
legend = T)
par(mfrow=c(1,2))
hist.fun(chick.no.out,num.breaks = 5)
hist.fun(chick.no.out,num.breaks = 20)
par(mfrow=c(1,3))
hist.fun(chick.no.out,num.breaks = 5)
hist.fun(chick.no.out,num.breaks = 20)
hist.fun(chick.no.out,num.breaks = 30)
if(add.mean == T) {
abline(h = mean(y))
abline(v = mean(x))
}
}
scatter.fun()
set.seed(3456)
x <- rnorm(100)
y.uncorr <- rnorm(100)
y.corr <- x + rnorm(100, 0, .3)
par(mfrow = c(1, 2))
scatter.fun(x, y.uncorr)
scatter.fun(x, y.corr)
...
Model <- lm(y ~ x)
p.value <- anova(Model)$"Pr(>F)"[1]
if(p.value <= 0.05) {Reg.Line.Col <- "green"}
if(p.value > 0.05) {Reg.Line.Col <- "darkred"}
abline(lm(y ~ x), lty = 2, lwd = 4,
col = Reg.Line.Col)}
}
set.seed(3456)
par(mfrow = c(1, 2))
scatter.fun(x, y.uncorr)
scatter.fun(x, y.corr)
if(add.mean == T) {
abline(h = mean(y))
abline(v = mean(x))
}
if(add.conclusion == T) {
C.Test <- cor.test(x, y)
ۧۧ For example, if R was looking on the desktop and all R files are kept in a
folder on the desktop called “R files,” the following command will make
R look in that folder:
setwd("R files")
ۧۧ Once R is looking in the folder of interest, the files ( data, R code, etc. ) in
that folder can then be accessed.
ۧۧ This command accesses and runs the code contained in the file.
EXCHANGING DATA
ۧۧ You can make R do so many more things for you. You can even share
your work others.
ൖൖ read.csv( “ filename.txt” )
ൖൖ download.file( url )
ൖൖ read.csv( url( “ http://any.where.com/data.csv” ) )
ۧۧ You can specify that the data will be separated by commas ( or whatever
characters you want ) with the “sep” command:
ۧۧ Once you’ve done all of this work, there are many ways to share it. The
easiest is to create an R Markdown document.
ۧۧ If you’re in R studio, click on “File,” then “New File,” and then select “R
Markdown.” It will ask you for a title and to select your default output:
html, pdf, or Microsoft word. Then, RStudio creates the R Markdown file
for you. This allows you to publish to books, websites, articles, and more.
plot = hist(x)
plot(x) ERROR
rm(plot) #Removes plot and defaults to base definition
SUGGESTED READING
Crawley, The R Book, “Writing R Functions,” section 2.15.
Phillips, “Writing Custom Functions,” https://rstudio-pubs-static.
s3.amazonaws.com/47500_f7c2ec48e68446f99bc04f935195f955.html.
PROBLEMS
1 Which of the following are good reasons to write custom functions in R?
a) You can address a specific analytical ne2d that isn’t covered in a built-in R
function.
b) You can automate a procedure that you use repeatedly.
c) You can save time when analyzing a new dataset.
d) All of the above.
2 Create a custom function that takes in the numbers a and b and returns ( a + b )2,
ab , and the square root of the absolute value of ( a × b ).
Lecture 01
1 In R, you define a variable ( here called “x” ) to hold your data using the “c”
command:
> mean(x)
13.72222
> median(x)
14
> var(x)
3.649444
> sd(x)
1.910352
2 Only the median. If outliers are too important to drop from your data but you
still want to get an idea of the central tendency of your data, the median is an
appropriate statistic to use.
Lecture 02
1 e) E xploratory data analysis is one of the first things we do to get an idea of the
shape, spread, central tendency, and overall distribution of our data.
Although the scale ranges from 0 to 20, none of the painters were given a score
above 18. Composition and drawing scores peak at 15 but are relatively uniform
in distribution. The colour score has more frequency at 6, 8, 10, 16, and 17, while
the expression scores are more frequently below 9.
Lecture 03
1 No. The probability of heads on the second toss is 0.5 regardless of the outcome
of the first toss. However, if a particular coin consistently gives heads more than
tails, then it may become appropriate to include that prior information about
your particular coin and begin adjusting the probabilities for that particular
coin accordingly.
Solutions 385
Lecture 04
1 c) 0.25. We can calculate this from the expected value. E( X ) = 𝑛𝑝; 5 = 20 × 𝑝; 𝑝
= 0.25.
2 a) X ~ Bin( 10, 0.2 )
b) P( X = 2 ) = ( 0.2 )2( 0.8 )8
Lecture 05
1 d) All of the above.
Lecture 06
1 d) Correlation does not imply causation.
2 a)
> summary(cars)
speed dist
Min.: 4.0 Min.: 2.00
1st Qu.: 12.0 1st Qu.: 26.00
Median: 15.0 Median: 36.00
Mean: 15.4 Mean: 42.98
3rd Qu.: 19.0 3rd Qu.: 56.00
Max.: 25.0 Max.: 120.00
b)
> cor(cars$speed, cars$dist)
0.8068949
> cov(cars$speed, cars$dist)
109.9469
plot(cars$speed, cars$dist)
Lecture 07
1
Solutions 387
> shapiro.test(cars$dist)
data: cars$dist
W = 0.95144, p-value = 0.0391
> shapiro.test(cars$speed)
data: cars$speed
W = 0.97765, p-value = 0.4576
From the results of the Shapiro-Wilk test, “distance” is not normally distributed
( 𝑝-value = 0.0391 ) while “speed” is normally distributed ( 𝑝-value = 0.4576 ).
It’s clearer from the histogram and density plot that “distance” is skewed
toward the left, resulting in a longer-tailed distribution and departing from
normality. “Speed” is more symmetric with less extreme data in the tails,
resulting in data that follows an underlying normal distribution.
2 a) 3325
b) 6802/360
c) pnorm( 3623, mean=3325, sd=680 ) – pnorm( 2980, mean=3325, sd=680 ) =
0.3634385
Lecture 09
1 b) If the mean of a sample statistic is not equal to the population parameter,
then the sample statistic is called a biased estimator.
2 A good point estimate for the population mean is the sample mean. We can find
it in R using the following commands:
# Point Estimate
mean(cars$speed)
15.4
mean(cars$dist)
42.98
Lecture 10
1 c) A lthough a large sample size results in a more precise confidence interval,
it’s not true that the sample must be at least 10 to calculate the confidence
interval.
Solutions 389
2 95% Confidence Interval for Speed
( 5.036217, 25.76378 )
mean(cars$speed)-1.96*sd(cars$speed)
5.036217
mean(cars$speed)+1.96*sd(cars$speed)
25.76378
mean(cars$dist)+qt(0.95,length(cars$dist))*sd(cars$dist)
86.16703
mean(cars$dist)-qt(0.95,length(cars$dist))*sd(cars$dist)
-0.2070292
mean(cars$speed)+1.65*sd(cars$speed)
24.12461
mean(cars$speed)-1.65*sd(cars$speed)
6.675387
mean(cars$dist)+qt(0.90,length(cars$dist))*sd(cars$dist)
76.44704
mean(cars$dist)-qt(0.90,length(cars$dist))*sd(cars$dist)
9.512957
Lecture 11
1 b) W hen the 𝑝-value is less than α, we have enough evidence against the null
hypothesis, resulting in statistically significant data.
Lecture 12
1 Based on the results, we are highly confident, at the 0.05 level, that the true
mean difference in gas mileage between regular and premium gas is between
1.727599 and 3.363310 miles per gallon.
> Regular = c(14, 16, 20, 20, 21, 21, 23, 24, 23, 22, 23,
22, 27, 25, 27, 28, 30, 29, 31, 30, 35, 34)
> Premium = c(16, 17, 19, 22, 24, 24, 25, 25, 26, 26, 24,
27, 26, 28, 32, 33, 33, 31, 35, 31, 37, 40)
> t.test(Premium,Regular, paired=TRUE)
Paired t-test
data: Premium and Regular
t = 6.4725, df = 21, p-value = 2.055e-06
alternative hypothesis: true difference in means is not
equal to 0
95 percent confidence interval:
1.727599 3.363310
sample estimates:
mean of the differences
2.545455
2 b) Paired tests are only done you have 2 samples in which observations in one
sample can be paired with the observations in the other sample.
Lecture 13
1 b) The estimated change in average Y per unit change in X.
Solutions 391
Lecture 14
1 b) No apparent relationship.
2 a) We can use the correlation coefficient to measure the strength of the linear
relationship between 2 numerical variables.
Lecture 15
1 a) T he expected value of the error terms is assumed to be 0.
2 b) Residuals tell us how far off our actual Y values are from our predicted
regression line values.
Lecture 16
1 a) With only 2 sample means, we can use a t-test. ANOVA is used when we have
more than 3 means.
2 d) We need the number of groups and the sample size to find the critical
F-value.
Lecture 17
1 a), b), c), and e) are all true; d) is not true. If we fail to include an important
covariate, our results are likely to be invalid.
Lecture 19
1
> summary(tree1)
Classification tree:
tree(formula = Kyphosis ~ Age, data = kyphosis)
Number of terminal nodes: 6
Residual mean deviance: 0.8445 = 63.34 / 75
Misclassification error rate: 0.2099 = 17 / 81
Solutions 393
2
Classification tree:
tree(formula = Kyphosis ~ Age + Number + Start, data =
kyphosis)
Number of terminal nodes: 10
Residual mean deviance: 0.5809 = 41.24 / 71
Misclassification error rate: 0.1235 = 10 / 81
Our residual mean deviance decreases, and we only incorrectly classify 10 of the
81 values. Using additional variables ( “age” along with “number” and “start” )
gives us a better tree by improving the classification rate.
Lecture 20
1 c) In polynomial regression, different powers of X variables are added to an
equation to see whether they increase the R2 significantly.
Lecture 22
1 By including the letter for each month as a plotting symbol, we can see the
seasonality in the series. Higher beer sales tend to occur in the summer months
of May, June, July, and August, while lower sales occur in winter months of
November, December, and January. There’s also an upward trend from 1975 to
around 1982.
> summary(beer.model)
Call:
lm(formula = beersales ~ season(beersales))
Residuals:
Min 1Q Median 3Q Max
-3.5745 -0.4772 0.1759 0.7312 2.1023
Solutions 395
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.48568 0.26392 47.309 < 2e-16 ***
season(beersales)February -0.14259 0.37324 -0.382 0.702879
season(beersales)March 2.08219 0.37324 5.579 8.77e-08 ***
season(beersales)April 2.39760 0.37324 6.424 1.15e-09 ***
season(beersales)May 3.59896 0.37324 9.643 < 2e-16 ***
season(beersales)June 3.84976 0.37324 10.314 < 2e-16 ***
season(beersales)July 3.76866 0.37324 10.097 < 2e-16 ***
season(beersales)August 3.60877 0.37324 9.669 < 2e-16 ***
season(beersales)September 1.57282 0.37324 4.214 3.96e-05 ***
season(beersales)October 1.25444 0.37324 3.361 0.000948 ***
season(beersales)November -0.04797 0.37324 -0.129 0.897881
season(beersales)December -0.42309 0.37324 -1.134 0.258487
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Call:
lm(formula = beer.diff ~ season(beer.diff) + time(beer.diff))
Residuals:
Min 1Q Median 3Q Max
-2.23411 -0.54159 0.04528 0.48127 1.88851
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.4216296 23.0675027 0.062 0.950927
season(beer.diff)February -0.7342015 0.2651087 -2.769 0.006211 **
season(beer.diff)March 1.6332146 0.2650927 6.161 4.70e-09 ***
season(beer.diff)April -0.2761317 0.2650803 -1.042 0.298968
Solutions 397
season(beer.diff)May 0.6098594 0.2650714 2.301 0.022566 *
season(beer.diff)June -0.3406745 0.2650661 -1.285 0.200377
season(beer.diff)July -0.6725333 0.2650643 -2.537 0.012031 *
season(beer.diff)August -0.7512797 0.2650661 -2.834 0.005123 **
season(beer.diff)September -2.6273198 0.2650714 -9.912 < 2e-16 ***
season(beer.diff)October -0.9097037 0.2650803 -3.432 0.000746 ***
season(beer.diff)November -1.8937063 0.2650927 -7.144 2.24e-11 ***
season(beer.diff)December -0.9663776 0.2651087 -3.645 0.000350 ***
time(beer.diff) -0.0004187 0.0116322 -0.036 0.971330
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
data: beer.model2$residuals
W = 0.99439, p-value = 0.6911
Solutions 399
Lecture 23
1 a) major difference between Bayesian and frequentist statistics is the use of
A
prior information.
and
c) In Bayesian statistics, the population parameters, such as the mean and
median, are assumed to be random variables.
Lecture 24
1 d) Custom functions help us do all of these and much more.
2 Your function may be labeled differently, but here’s an example of one that
works:
my.fun = function(a,b)
{
return(list((a+b)^2, a^b, sqrt(abs(a*b))))
}
https://www.openintro.org/stat/textbook.php?stat_book=os.
You can order a print copy (hardcover or paperback) for less than $20
(as of May 2017).
Cryer, Jonathan D., and Kung-Sik Chan. Time Series Analysis with
Applications in R. New York: Springer, 2010.
Bibliography 401
Faraway, Julian J. Linear Models with R. Boca Raton, FL: CRC Press, 2005.
———. Extending the Linear Model with R. Boca Raton, FL: CRC Press, 2016.
https://rstudio-pubs-static.s3.amazonaws.com/47500_
f7c2ec48e68446f99bc04f935195f955.html.
http://www.r-tutor.com.