Anda di halaman 1dari 78

Basic Statistics

Introductory Workshop MSBAPM


John R Wilson
John.Wilson@business.uconn.edu
(Note that much of the material and definitions are from A First Course in Statistics, McClave and Sincion, 11 th
edition)

Section One Overview of terms

Some definitions
Descriptive Statistics - Methods for describing a data set
as it is.
Inferential Statistics Reach conclusions beyond what
the immediate data
Sample data insight ----- Infers population insight

Parameter summary measure for population


Statistic summary measure for a sample

More definitions!!
Population A unit that we are interested in studying
(ie, registered voters, college students, mlb players,
etc.)
Sample subset of the population
Variables Characteristics of the population or sample
Measures Assignment of a value for each characteristic
Could be continuous (ie age, weight) or categorical (A,B,C etc)

Why inferential statistics?


Population vs Sample
Population may not be practical or is too expensive
Ex. We want to measure likely voter outcomespolling EVERY
voter is not practical so we poll a random sample of voters
and then infer what we would get if we had polled ALL voters
There is a potential for error

Exercise One
A survey shows that the average age of TV viewers is
51. Fox believes the average age of their viewership is
less than 51. To test this, they sample 200 viewers.

Describe
Describe
Describe
Describe

the
the
the
the

population
variable of interest
sample
inference

Exercise One Answers


A survey shows that the average age of TV viewers is
51. Fox believes the average age of their viewership is
less than 51. To test this, they sample 200 viewers.

Describe the population (All Fox viewers)


Describe the variable of interest (Age)
Describe the sample (200 of the Fox viewers)
Describe the inference (The inference is that the average age
of the SAMPLE approximates the average age of the entire
population)

Reliability
As noted earlier, there could be error in a sample
statistic correctly approximating a population parameter
(ie Fox sample age correctly approximates age of ALL
Fox viewers)
We use measures of reliability to reflect the degree of
uncertainty about statistical inference
This is often referred to as the margin of error or
confidence interval
We will discuss these in more detail later

Data types
Quantitative (continuous)
Data that can be measured on a numerical scale

Qualitative Data (nominal or ordinal)


Data that must be classified or encoded

Sometimes, qualitative data APPEARS like quantitative,


but it isnt really
Ex. Political Affiliation
1 = Republican, 2= Democrat, 3=Independent, 4=Other

Exercise Two
The Army Corps of Engineers is measuring toxicity in
fish. The captured a total of 144 fish and noted the
following variables

River/creek where fish was captured


Species
Length
Weight
Toxic concentration in ppm

Which are quantitative/qualitative

Exercise Two
The Army Corps of Engineers is measuring toxicity in
fish. The captured a total of 144 fish and noted the
following variables

River/creek where fish was captured (Qual)


Species (Qual)
Length (QuanT)
Weight (QuanT)
Toxic concentration in ppm (QuanT)

Note that species or body of water might be


represented numerically, but they are still considered
categorical

Section Two Descriptive Statistics

Describing Qualitative Data


Remember, qualitative data is usually reflected in
categories or classes
We can measure the frequency of the number of
observations that appear in a specific class/category
Ex. Education type: No college 32, Some college 54,
College grad 34
Note that there are 120 observations (32 + 54 + 34)

Freqs are 32/120, 54/120 and 34/120


Sum of frequencies ALWAYS must equal 1 (or 100%)

Exercise Three
Patient #

Insured (1), Medicare/Medicaid (2),


Uninsured (3)

What is the frequency of


each of the 3 insurance
types?
What type of variable is
Patient #

Quantitative Data
Notation
In statistics, notation is often used as shorthand for a variety
of calculations
Often times, the summation sign will be used (CAPITAL Sigma
we will use lower case sigma to denote a different item
later).
We then denote the measurements as x1, x2, x3, ..xn,
where n is the total number of measurements.
To sum these up, we use the summation sign as such:


Lets assume the following column of values
45
23
8
56
9
22
65
24

What is n?
What value would correspond to x4?
What is the value of ?


Lets assume the following column of values
45
23
8
56
9
22
65
24

What is n? n is 8
What value would correspond to x4? 56
What is the value of ? 252

Exercise Four
Assume a dataset of 3,8,4,5,3,4,6
Find
Find ^2
Find

Quantitative Data Location


Measures of Central Tendency the tendency of
variables to cluster or center around certain numerical
variables
Mean The average of all observations
Median- The midpoint of observations
Mode- The observation value that occurs most frequently (May
be more than one)

Exercise Five

5
8
6
8
9
5
7
2
5
6
3
5
8
9

10

What is the mean?


Median?
Mode?

Excel for central measures


Open up new excel workbook
Enter numbers in from Exercise 5
Use excel functions to calculate mean, median and
mode
SAVE this workbook..we will use this later

Sample mean
Formula for sample mean which is denoted as

=(/n where n is the number of observations

is usually referred to as x bar (Population mean is


mu)
The rounding of is subject to the degree of accuracy
necessary

Median
When n is odd, the median is simply the middle number
after arranging all observations either in ascending or
descending order
When n is even, then we use the average of the middle
two numbers
Ex: 1,5,6,8,9,9
The median would be 7 (Average of 6,8)

Mean or Median?
If the mean and median or close, we typically use the
mean
However, the mean is sensitive to extreme values
(outliers) whereas the median is not. In that case, the
median is often used.
Often, the median is used when considering household
incomes, which can have extreme outliers (Virginia sociology
majors)

Symmetric vs Skewed
When the mean is to the right of median, it is right
skewed (Skewness number would be negative)
When the mean is to the left of median, it is left skewed
(Skewness number would be postive)

Some other measures of


quantitative values
Max and Min the highest and lowest value
Range - the difference from the highest value and the
lowest value
Lower half-all values to left of median
Upper half- all values to the right of the median
Quartiles median of the lower half is 1st quartile
Median of the upper half is 3rd quartile

Exercise Six
Consider {3,7,8,5,12,14,21,13,18,14}
Calculate:

Min
Max
Range
1st quartile
3rd quartile

Section 3 Variance and Standard


Deviation

Variance
Central measures are only part of the story
We are also interested in the spread of the distribution
We earlier considered range
Now we want to consider the variance and standard deviation

Variance
The distance from an observation to the mean of all
observations is called a deviation
Each observation would be noted as x- (Note that this
could be positive or negative
Ex.
Imagine a plot with three points {3,4,5}
The mean is 4, so the deviation for 3 is negative one and the
deviation for 5 is plus one (The deviation for 4 is 0)

Example of variance
Now consider these two data sets

{1,2,3,4,5} and {2,3,3,3,4}


How would you describe these datasets?
What are the raw deviations for each number in each set?
Why cant we simply average out the deviations?

Example of variance solution


{1,2,3,4,5} and {2,3,3,3,4}
How would you describe these datasets?
They both have a mean of 3
The ranges are 4 and 2, respectively
They both have 5 members, but there is greater variance in the first
set

What are the raw deviations for each number in each set?
{-2,-1,0,1,2} and {-1,0,0,0,1}

Why cant we simply average out the deviations to arrive at


a single measure of variablility?
It would always equal zero
There are two ways to handle thisuse absolute values (which can
cause problems) or square each deviation

Calculating variance
Remember our last set {3,4,5}
We had deviations of -1,0,1
If we summed them up, they would equal zero, but that is
clearly not the case as there is a distribution
To resolve that, we square each of the deviations, sum them
up and then divide by N-1 (In this case, 2)
Therefore our variance is 1+0+1/2 = 1
Our formula looks like this:

Standard deviation of sample


Our standard deviation is the square root of our
variance
In our last case, our variance was 1, so the square root of 1 is
1

Six sigma
Quality control methodology
Based on standard deviations
1 Standard Deviation encompasses about 68% of values
2 Standard Deviations encompasses about 95% of values
3 Standard Deviations encompasses about 99% of values
Remember, these are on either side of the mean, so a total of six
standard deviations (6 sigmas)

Six sigma

Standard Deviation and Variance in


Excel
Reopen our previous workbook
We can use the stdev and var functions

Exercise seven
Using excel
Enter the following numbers into a column
25,28,24,26,29,27,31,20,28,29
Using excel functions, return the mean, median, mode,
variance and standard deviation

The need to visualize


The measures discussed (mean, std deviation, etc)
dont always tell the whole story
Anscombes quartet

Lunch Break

Probability
Why do professional poker players always seem to win?
LUCK??
Some, but they also have a great understanding of
probability and odds

Simple probability the coin toss


Imagine tossing a coin
What is the probability of the toss resulting in Heads?
The answer: .5 or 50%

Simple probability
Lets take our last examplewe flipped a coin and it
came up heads
We now want to flip the coin again
What is the probability that it will come up heads again?
If we flip it 10 times and it is heads all ten times, what is the
probability that it will come up heads the 11th time.

Experiments
These coin tosses are acts or observations that lead to a
single outcome, but that cannot be predicted with
certainty.
Each observation is called a sample point or simple
event

Sample points
In the case of the coin, there were two sample points
heads and tails
What are the sample points on a single die (Dice)
We have to be careful when considering the sample
points that are possible (See next slide)

Sample points
Problem: List all of the sample points if we flip TWO coins
We might think there are three sample points when
flipping these coins
Tails/Tails
Heads/Heads
Heads/Tails

Sample points --- sample space


But there are really 4 sample points
HH
TT
HT
TH
The set of sample points is referred to as a sample
space and, as observed above, is denoted as S: {HH,
TT, HT, TH}

Exercise eight
Denote the sample space of a single dice
Denote the sample space of the face cards (J, Q, K) in a
standard deck of cards

Probability defined
The probability of an event is the likelihood that an
outcome will occur when the experiment is performed
Probability is usually denoted as P
What is the likelihood that when we roll a standard die,
the roll will result in a 3?
It is 1 in 6, or approx. 16.7%

Probability expanded
Remember the 10 consecutive coin flips that resulted in
heads?
If we were to conduct that experiment (Flipping the coin)
one million times, we would very likely see a result that
suggests roughly half the flips were heads and half the
flips were tails.
This is the law of large numbers which states that the
relative frequency of an outcome approaches its true or
theoretical probability the more times you repeat it.

Unclear sample spaces


You are starting a business. What is the probability that it will
succeed?
We could simply state 50%...it will either succeed or fail.
However, we know that is not true, so how would we assign a
probability?
We could consider experience in running a similar business
We could look at success rates of similar businesses
We could apply statistical techniques using variables such as capital,
location, etc.

Whichever techniques we use, the final assessment of


probability is still subjective

Probability rules
ALL probabilities, whether subjective or not, must obey
2 basic rules.
The probability of a sample point MUST lie between 0 and 1
The probabilities of all sample points within a sample space
must equal 1

Exercise nine Class discussion


Suppose you are traveling to Orange county California
and are interested in staying at a hotel that has a water
conservation program.
What are the sample points for this experiment?.
How would you go about assigning the probabilities of
your sample points?

Events
First, what are the sample points for a single die?
S: {1,2,3,4,5,6}
Suppose that instead of the probability of a single
sample point, we were interested in something like the
probability that the number will be even (or odd)this is
called an event
An event is typically noted as A.
Event A (even in this case)contains 3 sample points,
all with probabilities of 1/6we simply add them up to
get P of A of 1/2

Steps for calculating P of event


Define the experiement..that is, describe the process
used to make an observation and the type of
observation that will be recorded
List the sample points
Assign probabilities to the sample points
Determine the collection of sample points contained in
the event
Sum the sample point probabilities to get the probability
of the event

Exercise ten
This is a study of divorced people
Group

Description

Proportion

PP

Joint custody and get along well

.12

CC

Occasional conflict

.38

AA

Cooperate on children, conflict otherwise

.25

FF

Hostile to each other, in conflict on every issue

.25

Suppose that 100 couples are selected at random:


List the sample points
Assign probabilities to the sample points
What is the probability that the spouses fall into the FF category?
What is the probability of at least some conflict?

Identifying sample points from large


groups
So far, we have seen small combinations, so identifying
the sample points was easy
But lets assume we wanted to select any sample of 5
items from a jar of 100 items. To understand the
probability of any ONE group of 5 items, we need to
know how many different groups of 5 there are.
We could try and list them all, but that would be tedious and
doesnt scale well

So what do we do?

Combinatorial math
We utilize a combination formula
From our last scenario, lets assign N=100 (Total number of
items in the jar) and n=5 (Total number of items we wish to
select)
The formula we use is*:
N! / n!(N-n)!
The exclamation point is called a factorial (see next slide)
*This assumes once an item is selected it is NOT replaced). If it was replaced, the formula would be
100^5, but it wouldnt make sense in this context because we need 5 DIFFERENT items

Factorial
Factorial simply means that we multiply a number by
each number before it
Ex 5!
This means 5 x 4 x 3 x 2 x 1 (0! is 1 by definition)
So 5! = 120

Back to combinations
So in our example, N=100 and n=5
So we would have 100!/5!(100 5)!
This equals a REALLY big numberlets open up excel
and use the fact function to calculate it
You should get 75287520
That means there are over 75 million combinations
The probability of selecting any ONE of the
combinations is 1/75287520

Exercise eleven
Calculate how many samples of 5 items out of a
possible 20 there are.
(Answer is 15,504)
Lets calculate LOTTO!!
Assume 53 numbers and you need the right
combination of 6 to win. What is the probability?

Multiplicative probability of
independent events
What if we have multiple events occurring?
For example, the P of surving heart surgery is .9, and
the p of surviving the recovery is .94. What is the
probability that if you have heart surgery, you will
actually go home?
We simply multiply the probabilities, so the p of going
home is .9 * .94=.846 (So the hospital would say their
procedures have an 85% success rate)

Exercise 11 (Lotto)
53!/(6!)(47!)
This equals 22,957,480
The probability is 1/22,957,480
The ticket would state that the odds of winning are
roughly 1 in 23million

Discrete probability distribution


In our sample point problems, we were able to list,
either manually or using combinatorial math, the
sample points.
The fact that we can list these makes them what we
refer to as discrete random variables. This term implies
that there is a finite number of distinct possible values
(or sample points)
This differs from continuous values that are infinite and
lead to the continuous probability distribution to be
discussed shortly

Discrete probability distribution


Using our example of two coins

Toss of two coins


2.5

1.5

0.5

HH

1 H, 1 T
Toss of two coins

TT

Discrete Probability Distribution


Toss of two coins
0.75
0.5
0.25
0
-0.25

HH

1 H, 1 T

TT

Toss of two coins

Remember from our earlier example that there were


4 sample points {HH, HT, TH, TT}
If we treat HT and TH as a single value, we can
calculate the p of that value as + or
The graph above depicts our discrete probability
distribution
We can also depict this with a formula

Discrete Probability Distribution


The probability distribution of a discrete random variable is a graph, table
or formula that specifies the PROBABILITY associated with each possible
value that the random variable can assume

There are two conditions that must be met


P(x) > or = to 0 for ALL values of x.
Sum of p(x) = 1

Discrete Probability Distribution


Sometimes, the distribution is discovered after many observations and
is not known a priori
For example, U of AZ researchers used historical records of droughts in
Texas to show that the distribution of x, where x = number of years that
must be sampled until a dry year is observed, could be shown i8n this
formula: p(x) = (.3)(.7)^(x-1) x = 1,2,3
What is the p that after any 3 consecutive years, a drought would occur

P(3) = (.3)(.7)^(3-1) = (.3)(.7)^2 = (.3)(.49) = .147


Thus there is a 15% chance that for any 3 years, a drought might occur

Mean and std deviation of discrete


random variable
Toss of two coins
0.75
0.5
0.25
0
-0.25

HH (0)

1 H, 1 T (1)

Toss of two coins

TT (2)

Lets assign values of 0, 1 , 2 to the 3


possible outcomes. While we can
look at the graph to see that the
mean appears to be 1, we can
confirm that by multiplying our
values by their respective p
So 0(1/4) + 1(.5) + 2(.25) = 1
Thus the mean is one.
This mean is often referred to as the
expected value and denoted E(x)
-- You will see this later in the BAPM
program

Exercise twelve
Lets say you work for an insurance company and you
sell a one year $10,000 policy with a premium of $290.
The policy pays out if the customer dies, but the p of
that happening is .001. What is the expected gain of
this transaction?

Exercise twelve
Lets say you work for an insurance company and you
sell a one year $10,000 policy with a premium of $290.
The policy pays out if the customer dies, but the p of
that happening is .001. What is the expected gain of
this transaction?
Gain x
Sample point
p
290

Customer lives

.999

-9,710 (290
10,000)

Customer dies

.001

The expected gain is 290(.999) + (-9710)(.001) = $280


If the company were to sell a very large number of policies it would net,
on average, 280 per sale.

The variance (and std deviation) of


a random variable
The variance of a random variable x is:
^2 = E[ (x-)^2] = (x-)^2)*(p(x))
This is referred to as the expected value of the squared
distance from the mean
The std deviation is simply the square root of ^2

Exercise thirteen
A certain type of chemotherapy is successful 70% of the
time. Let x equal the number of successful cures out of
5. X
0
1
2
3
4
5
P(x)

.002

.029

.132

.309

.360

.168

So this means that for any five treated patients, 2 will


survive 13.2% of time, 4 will survive 36% of time, etc.
Find the mean, variance and standard deviation of the
distribution

Exercise thirteen

= E(x) = 0(.002) + 1(.029) + 2(.132) + 3(.309) + 4(.360) + 5(.168) = 3.5


This means that the number of successful cures, on average, for five patients
will be 3.5 (70% success rate)
Variance = ^2 = E[ (x-)^2] = (x-)^2)*(p(x))
= (0-3.5)^2 *(.002) + (1-3.5)^2 * (.029) (5-3.5)^2 * (.168) = 1.05
Std deviation is square root of 1.05 = 1.02

Conclusion of Session 1
Q&A
Session 2 Agenda

Continuous probability distribution


Sampling Distributions
Interval Estimates
Confidence Intervals
Z values, t test and p values
Introduction to hypothesis testing

Appendix
Website with statistical symbols
http://www.rapidtables.com/math/symbols/Statistical_Sy
mbols.htm
Greek letters

Z score table

T-table

Anda mungkin juga menyukai