Page 1 of 61 Page 2 of 61
Page 3 of 61 Page 4 of 61
More Symbols Aspects of Statistics
1.) Design - How to obtain the data to answer
In the election example on page 1, we were interested in questions of interest.
studying proportions, not averages. So when we are Ex. use a survey, set up an experiment
studying proportions, there are other symbols we use.
If we have the proportion for an entire population, like 2.) Description – summarizing the obtained data.
the proportion of voters voting for candidate A in the Describe the sample data.
entire population, we use the letter “p” to denote this
population proportion. EX – bar graph
Page 5 of 61 Page 6 of 61
Page 7 of 61 Page 8 of 61
HW 4.2-4.3 Simple Random Sampling Problem What are different ways to sample?
A randomized experiment investigates
whether an herbal treatment is better than a Types of Sampling:
1. Simple Random Sampling – every subject has an
placebo in treating subjects suffering from equally likely chance of being selected for the sample.
depression. Unknown to the researchers, Usually, samples are chosen using a random number
the herbal treatment has no effect: Subjects table.
have the same score on a rating scale for
depression (for which higher scores 2. Stratified Sampling – the population is divided into
non-overlapping groups (called strata) and a simple
represent worse depression) no matter random sample is then obtained from each group.
which treatment they take.
(a) The study will use eight subjects 3. Cluster Sampling - the population is divided into non-
numbered 1 to 8. Using random numbers overlapping groups and all individuals within a
randomly selected group or groups are sampled.
pick the four subjects who will take the
herbal treatment. (Use the first row and first 4. Convenience Sampling – sampling where the
column in the table.) individuals are easily obtained. Internet surveys are
Line/Col (1) (2) (3) convenience samples. Studies that use convenience
sampling generally have results that are suspect.
1 10480 15011 01536
2 22368 46573 25595 5. Systematic Sampling – selecting every kth subject
from the population.
3 24130 48360 22527
4 42167 93093 06243 The difference between stratified and cluster sampling is
Identify the four who will take the herbal that stratified sampling samples some individuals from all
groups, where cluster sampling samples all individuals
treatment. (List in numerical order.) from some groups.
, , ,
Page 9 of 61 Page 10 of 61
Example: Identify the type of sampling used below. Chapter Two – Exploring and Summarizing
Data
In order to determine the average IQ of ninth-grade
students, a school psychologist obtains a list of all high
schools in the local public school system. She randomly Variable – Characteristic that we are studying
selects five of these schools and administers an IQ test to
all ninth-grade students at the selected schools. 2.1 What are the Types of Data?
_______________________________________________
A member of Congress wishes to determine her
county’s opinion regarding estate taxes. She divides her
Two Kinds of Variables:
county into three income classes: low-income
households, middle-income households, and upper- 1.) Categorical – Classifies subjects based on some
income households. She then takes a random sample of attribute or characteristic. Each observation
households from each income class. belongs to a set of categories.
________________________________________________
A radio station asks its listeners to call in their opinion Ex – A person could live in a ‘house’, ‘condo’,
regarding the use of American forces in peacekeeping ‘apartment’, ‘dormitory’, etc.
missions.
________________________________________________
In an effort to identify whether an advertising campaign
2.) Quantitative – Provides numerical measures of
has been effective, a marketing firm conducts a subjects. The variable takes on numerical values.
nationwide poll by randomly selecting individuals from
a list of known users of the product. Ex – A person can be 56 inches in height, weigh
132 pounds or get a 92 on a test.
A lobby has a list of the 100 senators of the U.S. In
order to determine the Senate’s position regarding farm
subsidies, they decide to talk with every seventh senator
on the list starting with the third.
________________________________________________
Page 11 of 61 Page 12 of 61
Two Kinds of Quantitative Variables: Example: Identify each of the following as
categorical or quantitative variables. If
1.) Discrete – a countable number of values. quantitative, identify further as discrete or
continuous.
EX – The number of people in this class, the
number of words on this page. 1. The length of time until a pain reliever begins
to work.
2.) Continuous – an uncountable number of
values. Continuous variables are usually variables
that can take on all values on an interval. 2. The colors used in a statistics textbook.
EX – Height, Weight, Temperature
3. The number of files on a hard drive.
Page 13 of 61 Page 14 of 61
Page 15 of 61 Page 16 of 61
Pareto chart – a bar graph whose bars are drawn in Graphs for Quantitative Variables
decreasing order of frequency or proportion.
Histogram – a bar graph for quantitative data.
Favorite Cookie
Ex. The table below shows the number of points scored by the
0.35 UGA football team in the 2002-2003 season.
0.30
0.30
y
Peanut
Butter,
13.33%
Oreo ,
Oatmeal 16.67%
Raisin,
16.67%
Page 17 of 61 Page 18 of 61
Stem and Leaf Plot – A stem and leaf plot is just a bar Example
graph on its side. The stem consists of all digits except for
the final one, which is the leaf. The following data represent the length of eruption in seconds for a
random sample of eruptions of “Old Faithful”, a geyser at
Ex. The table below shows the number of points scored by the Yellowstone National Park. Draw a stem and leaf plot.
UGA football team in the 2002-2003 season.
108 113 102 97 106
Opponent # of Points 110 99 109 108 112
Clemson 31 97 76 107 104 114
South Carolina 13
Northwestern St 45
New Mexico St. 41
Alabama 27
Tennessee 18
Vanderbilt 48
Kentucky 52
Florida 13
Ole Miss 31
Auburn 24
Georgia Tech 51
Arkansas 30
Florida St. 26
1|
2|
3|
4|
5|
Page 19 of 61 Page 20 of 61
Shapes of Histograms Example
Symmetrical/Normal – the side of the distribution below the IQ's of 7th Graders
middle is a mirror image of the side above the middle. 140
128
30
120
25
100
20
Frequency
80
15
10 60 52 52
5 40
0 15 15
20
2 3 3
Skewed left – left tail is stretched out longer than the right tail. 0
1
20
5
Which class has the highest frequency? What is its frequency?
0
Skewed right – right tail is stretched out longer than the left tail.
30
Which class has the lowest frequency? What is its frequency?
25
20
What proportion of the students have an IQ between 120 and 129?
15
10
5
Describe the shape of the distribution – is it skewed right, skewed
0
left or approximately normal?
NOTE: Many times we will use smooth curves to show the data
rather than histograms.
Page 21 of 61 Page 22 of 61
2.3 How can we describe the center of quantitative Mode – The data value that occurs most frequently (has the
highest frequency). It is important to point out that the mode is
data? NOT equal to the frequency, the mode is the data value that
Mean (Average) – adding up all the values of the variable x corresponds with the highest frequency.
and dividing by the number of these values, n.
Mean =
∑ x Example: 10 bags of M&M’s were opened and the number of
M&M’s in each of the 10 bags is:
n
Example: What is the mean of 1, 3, 6, 7, 8? 32 34 31 35 32 36 29 38 34 32
Population Mean: μ (known as mu) Example: What is the mode in the bar graph below?
Favorite Cookie
Sample Mean:⎯x (known as x-bar)
10 9
9
Median – The value of the data that occupies the middle 8
7
position when the data are ranked in ascending order. It
Frequency
6 5 5
separates the bottom 50% of the data from the top 50% of 5 4 4
4
the data. 3
3
2
1
Steps in Computing the Median of a Data Set: 0
1. Arrange the data from low to high. Oreo Chocolate Oatmeal Sugar Peanut Brownie
Chip Raisin Butter
Page 23 of 61 Page 24 of 61
Example: Using the previous UGA football example: The following frequency table shows the number of children in a
Number of Points = 31, 13, 45, 41, 27, 18, 48, 52, 13, 31, 24, 51, daycare separated out by their ages:
30, 26 Age of Children Frequency
2 3
What is the sum of all points scored by the Bulldogs that year? 3 7
4 6
5 1
Write out all the ages for the children at the daycare.
What is the mean number of points scored by the Bulldogs that
year?
NOW, let’s see how StatCrunch can do these calculations for us.
Page 25 of 61 Page 26 of 61
It is important to note that the mean is sensitive to extreme values Mean = Median: The graph is approximately normal/symmetrical.
in the dataset, either very large or very small numbers. The
median, however, is not. The median is resistant to extreme Mean < Median: The graph is skewed left.
values.
This is true because a skewed left graph has more low data values
Example: on the left. These low data values make the mean lower & less than
Data set: the median.
Find the mean This is true because a skewed right graph has more high data
values on the right. These high data values make the mean higher
n= & greater than the median.
Example: Match the histograms to these summary statistics.
Σx = mean = Mean Median Graph
1 42 42
Find the median 2 31 36
3 31 26
Put values in order:
Median =
Mode =
If I asked for the number which best describes the “middle” of the
data, what is the best answer? Why?
Page 27 of 61 Page 28 of 61
2.4 How can we describe the spread of Sample Variance – the mean of the squared deviations, calculated
using n – 1 as the divisor. What you are doing when you are
quantitative data? calculating sample variance is, in a way, you are averaging all the
squared deviations, except you are dividing by n – 1 instead of
Range – The difference between the largest and the smallest pieces dividing by n.
of data.
Variance =
∑ ( x − x) 2
x−x
7 7–7
9 9–7
11 11 – 7
Ex. Data Set: 2, 6, 7, 9, 11 SUM:
2 + 6 + 7 + 9 + 11 35 Variance =
x= = =7
5 5 Standard Deviation – the positive square-root of the variance
x x -⎯x Deviation
s = Variance
2 2–7 From the example above,
6 6–7
s=
7 7–7
9 9–7 In lab, you will learn how to use StatCrunch to calculate this
11 11 – 7 sample standard deviation value without having to go through all
these steps.
Page 29 of 61 Page 30 of 61
Variance and standard deviation measure how spread apart Example: Consider the following three data sets:
your data values are. The higher the variance and standard
deviation, the more spread apart the data values will be. A: 50, 50, 50 B: 40, 50, 60 C: 30, 50, 70
Example: If we administered Test A and Test B to five students, Use these data sets to practice finding the sample standard
and their scores were the following: deviation.
B Deviation Deviation2
Just like we can either be looking at a population mean or a sample
mean depending upon if we are looking at the entire population or 40
just a sample from the population, we also have symbols to 50
represent population standard deviation and sample standard
deviation: 60
Page 31 of 61 Page 32 of 61
As you can see in the distributions below, the distribution with a Empirical Rule – If a distribution is bell-shaped, we can
larger standard deviation is going to be wider, because its data approximate the percentage of data that lie within one, two, and
values are more spread apart: three standard deviations of the mean.
A. Standard Deviation Equal to 1.0 μ ± 1σ (-1 to +1) ~ 68% of the data values
μ ± 2σ (-2 to +2) ~ 95% of the data values
μ ± 3σ (-3 to +3) ~ all of the data values
Page 33 of 61 Page 34 of 61
Example: The weight, in grams, of both kidneys based upon a 2.5 How can we describe the position of values in
sample of 30 forty-five year old men resulted in a sample mean of
325 grams, with a sample standard deviation of 30 grams.
quantitative data?
1. Percentiles
a. A histogram of the data indicates that the data follow a bell-
shaped distribution. Draw a curve of these kidney weights.
The pth percentile is a value such that p% of the observations in the
data fall below or at that value.
This also means that the other (100 – p)% of the observations in
the data are larger than that value.
Page 35 of 61 Page 36 of 61
Quartiles – specific percentiles that are useful. Each set of data The following data represent the hemoglobin (in g/dL) for 20
has three quartiles. randomly selected cats.
5.7 7.7 7.8 8.7 8.9
First Quartile (Q1) – the value such that 25% of the data 9.4 9.5 9.6 9.6 9.9
values are smaller than Q1, and 75% are larger. This is also 10.0 10.3 10.6 10.7 11.0
known as the 25th percentile. 11.2 11.7 12.9 13.0 13.4
Second Quartile (Q2) – the value such that 50% of the data Determine the quartiles.
values are smaller than Q2, and 50% are larger. This is also
known as the median and the 50th percentile.
Third Quartile (Q3) – the value such that 75% of the data
values are smaller than Q3, and 25% are larger. This is also
known as the 75th percentile.
Finding Quartiles
Page 37 of 61 Page 38 of 61
Outliers – extreme observations that occur because of error in the The 5-Number Summary and Boxplots
measurement of the variable, during data entry, or from errors in
sampling.
25% 25% 25% 25%
Steps for Checking for Outliers: Minimum Q1 Q2 Q3 Maximum
1.) Determine the first and third quartiles of the dataset.
2.) Compute the interquartile range. The interquartile range This is the 5-number summary, it includes the minimum, Q1, Q2
or IQR is the difference between the third and first or the median, Q3, and the maximum number.
quartile.
IQR = Q3 – Q1 Boxplot – a graph of the five number summary.
3.) If a data value is less than Q1 – 1.5(IQR) or greater than
Q3 + 1.5(IQR), it is considered an outlier. Steps in Drawing a Boxplot:
1.) Determine Q1, Q2, and Q3.
Example (continued): Hemoglobin in Cats 2.) Draw vertical lines at Q1, the median (Q2), and Q3.
The following data represent the hemoglobin (in g/dL) for 20 Enclose these vertical lines in a box.
randomly selected cats. 3.) Draw a line from Q1 to the smallest data value that is not
5.7 7.7 7.8 8.7 8.9 an outlier. Draw a line from Q3 to the largest data value
9.4 9.5 9.6 9.6 9.9 that is not an outlier.
10.0 10.3 10.6 10.7 11.0 4.) Any data values that are outliers are marked with an
11.2 11.7 12.9 13.0 13.4 asterisk (*).
Compute the IQR.
Page 39 of 61 Page 40 of 61
Example: Draw a boxplot for the cat data: Distribution Shape Based upon Boxplot:
5.7 7.7 7.8 8.7 8.9 1.) If the median is near the center of the box and each
9.4 9.5 9.6 9.6 9.9 horizontal line is approximately equal length, the
10.0 10.3 10.6 10.7 11.0 distribution is approximately symmetric.
11.2 11.7 12.9 13.0 13.4 2.) If the median is to the left of the center of the box or the
right line is much longer than the left line, the
Step 1: Determine Q1, Q2, and Q3 distribution is skewed right.
3.) If the median is to the right of the center of the box or the
left line is much longer than the right line, the
distribution is skewed left.
Step 2: Draw vertical lines at Q1, the median (Q2), and Q3.
Enclose these vertical lines in a box.
Step 3: Draw a line from Q1 to the smallest data value that is not
an outlier. Draw a line from Q3 to the largest data value that is not
an outlier.
Step 4: Any data values that are outliers are marked with an
asterisk (*).
Page 41 of 61 Page 42 of 61
2. Z-score If the heights for males are normally distributed, draw a curve
representing these heights. Label where the 75-inch tall man is
Z-score – The position a value has relative to the mean measured under this curve, and see that it corresponds to his Z-score.
in standard deviations.
value - mean
z − score =
standard deviation
The Z-score is the number of standard deviations a data value is What height is exactly two standard deviations below the mean.
from the mean. Calculate the Z-score for this height to make sure it does equal -2.
Example:
From samples taken, the average 20-29 year-old man is
70.0 inches tall, with a standard deviation of 2.8 inches, Using Z-Scores to check for Outliers
while the average 20-29 year-old woman is 64.6 inches
Outliers for a bell-shaped curve:
tall, with a standard deviation of 2.6 inches. A data value in a bell-shaped distribution is regarded as a potential
outlier if it falls more than three standard deviations from the mean.
Find the z-score for a 75-inch tall man. Or, in other words, if a value has a Z-Score less than -3 or a
Z-Score greater than +3, then it is a potential outlier.
Page 43 of 61 Page 44 of 61
Chapter Three – Association: Contingency, 3.1 How can we explore the association between
Correlation, and Regression two categorical variables?
In Chapter 3, we explore the relationships between two variables. To do this, we use contingency tables.
Response variable – a variable that can be explained by, or is Contingency or 2-way table – a table that relates 2 categorical
determined by, another variable. This is our y-variable, the variable variables. Each box inside the table is referred to as a cell.
that goes on the vertical axis when we are graphing data.
Suppose we have the following data:
Explanatory variable – explains, or affects, the response variable.
This is our x-variable, the variable that goes on the horizontal axis Left-handed Right-handed
when we are graphing data. Male 160 600
Female 140 560
Ex. The amount you eat affects how much weight you gain. The
amount you eat is the explanatory variable which determines Are these categorical variables?
weight gain, the response variable.
What is the response variable?
Association – an association exists between two variables if a
particular value for one variable is more likely to occur with
certain values of the other variable. What is the explanatory variable?
Page 45 of 61 Page 46 of 61
We can also calculate the proportion for each group. Total up the Relative Risk / Odds Ratio
columns and rows again, and answer the questions: We can use these conditional proportions to determine the
Left-handed Right-handed Total comparative odds for each group.
Male 160 600 760
Female 140 560 700 Let’s create a table with these conditional proportions for
Total 300 1160 1460 categories of the response variable.
Ex. What proportion of the people in the data is female? Left-handed Right-handed
Male
Female
Ex. What proportion of the people in the data is left-handed?
conditional proportion for one group
relativerisk =
Conditional Proportion – the proportion for a value of a variable, conditional proportion for another group
given a specific value of the other variable.
When we calculate relative risk, the higher conditional proportion
Total up the columns and rows again, and answer the questions:
Total
goes in the numerator. We can use relative risk to see how many
Left-handed Right-handed
times more likely the outcome for one group is than the other
Male 160 600 760
group.
Female 140 560 700
Total 300 1160 1460
Example: Fill in the blank. A male is _____ times more likely to be
left-handed than a female.
Ex. What proportion of the males is right-handed?
Page 47 of 61 Page 48 of 61
We asked 1795 their political affiliation and whether they think 3.2 How can we explore the association between
marijuana should be legalized. Here is the data we received:
two quantitative variables?
Legalize Marijuana?
Political Yes No Total
Affiliation When we have two quantitative variables, the first thing we do is
Democrat 240 326 566 make a scatterplot of the data.
Independent 292 446 738
Republican 121 370 491 Scatterplot – a graphical display for two quantitative variables.
Total 653 1142 1795 (Explanatory variable is on the horizontal axis, response variable is
on the vertical axis, and the points are not connected.)
Example:
Fill in the blank. A democrat is _______ times more likely to favor
legalization of marijuana than a republican.
Page 49 of 61 Page 50 of 61
Here is a scatterplot of this data: Example: Determine the type of association for the following
pairs of variables.
350
200
stop
150
100
c) weight on a bar and number of repetitions a weightlifter can
50 achieve
0
0 5 10 15 20 25 30 35
d) the temperature outside and my grade on a test
Spaces from GO
positive association – as x increases, y increases.
Page 51 of 61 Page 52 of 61
So we can state the association between two variables, but what if Coefficient of linear correlation (r) – the numerical measure of
we want to take it one step further and determine if there is a the strength of the linear relation between x and y.
linear relationship between the variables?
Then we calculate what we call correlation.
-1 -.5 0 .5 1
Page 53 of 61 Page 54 of 61
Page 55 of 61 Page 56 of 61
3.3-4 How to predict the outcome of a variable? Here is our data:
regression line – predicts the value for the response And here is the graph of the regression line for this
variable y as a straight-line function of the value x of data:
the explanatory variable. 350
300
Cost
200
our data, and a predicted y value using this regression Spaces from GO
line. The best regression line is going to be the one
that has the predicted y values closest to the actual y Residual – the difference between the actual value and the
values. predicted value of y.
^
We use the actual data values to create the regression residual = actual y – predicted y = y – y
line. We won’t need to do this, but StatCrunch can do
this for us. The line that “best” describes the relation between 2
Let’s take a look back at our monopoly data and look variables is the one that makes the residuals as small as
at the regression line for that data: possible.
Page 57 of 61 Page 58 of 61
The formula for the regression line using the least squares method: Calculate the predicted cost and residual for each of the
four properties in our data using the regression line formula:
ŷ = a + b x Property # spaces Actual Predicted Cost Residual
from GO Cost
where a = y-intercept and b = slope Reading 5 $200
Railroad
So for our Monopoly example, here is the regression Virginia 14 $160
line formula: Ave.
Illinois 24 $240
yˆ = 147.016 + 4.159 x Ave.
N. Carolina 32 $300
Ave.
We can get this in StatCrunch by going to
StatÆRegressionÆSimple Linear. When we use our regression line to predict the costs for
other properties, this is called extrapolation.
Interpretations of y-intercept and slope:
y-intercept = the predicted value of y when x = 0. Predict the cost for Tennessee Ave., it is 18 spaces from
Interpret the y-intercept in the above scenario: GO.
Page 59 of 61 Page 60 of 61
Example
Weight(x) 40 80 100 120 150
Number of
Reps(y) 25 20 18 15 10
Page 61 of 61