Anda di halaman 1dari 44

Applied Statistics and Research Methods

By
Dr. Abhijat Arun Abhyankar
Associate Professor
Course Contents
Sampling and data collection; Design of Experiment-complete
randomized design, randomized block design and Latin square
design. Probability distribution-Bayes rule, binomial, Poisson,
uniform, hypergeometric, multinormal, gamma, beta, chi-square
and F distribution; statistical estimation, hypothesis test-z , t, chi-
square and ANOVA; Multivariate analysis-multiple regression,
discriminant analysis, multidimensional scaling and factor
analysis; Research process-identification of problem, formulation
of objective, hypotheses; Research design characteristics of good
research desin ; measurement scale; Likert, semantic differential
and staple scale; Survey-types and techniques, questionnaire
designing, data analysis; writing and presentation of research
reports; ethics.
Session Plan
Lecture No. Topic to be covered
1 Introductory lecture
2 Normal distribution
3 Statistical estimation
4 Hypothesis testing

5 t, Chi square test


6 ANOVA
7 Bayes theorem and Discrete distribution
8 Simple and Multiple Regression
9 Multivariate analysis
10 Class assignment
Reference Book

Research Methodology, Methods and Techniques, 2nd edition by


C. R. Kothari
Statistics for Management, 7th edition, R. I. Levin and D. S.
Rubin
A Class Textbook of Business Statistics, Malochan Hazarika
Statistics for Business and Economics, 8th edition, Anderson,
Sweeney and Williams
Business Statistics for Contemporary Decision Making, 4th
edition, Ken Black
Applied Multivariate Statistical Analysis, 5th edition, Richard A.
Johnson and Dean W. Wichern
Any Guesses

R, R-square, adjusted R-square

Properties of Normal distribution

outliers

Null and alternative hypothesis

Linear and Multiple regression model

Test for Normality

Degree of freedom
Correlation coefficient, r
The quantity r, called the linear correlation coefficient, as a
descriptive measure of the strength of a linear association between
two variables. The linear correlation coefficient is sometimes
referred to as the Pearson product moment correlation coefficient
in honor of its developer Karl Pearson. The mathematical formula
for computing r is:
Correlation Coefficient (contd.)
The value of r is such that -1 < r < +1.

The + and signs are used for positive linear correlations and negative linear correlations, respectively.

Positive correlation: If x and y have a strong positive linear correlation, r is close to +1. An r value of exactly +1
indicates a perfect positive fit. Positive values indicate a relationship between x and y variables such that as
values for x increases, values for y also increase.

Negative correlation: If x and y have a strong negative linear correlation, r is close to -1. An r value of exactly -1
indicates a perfect negative fit. Negative values indicate a relationship between x and y such that as values for x
increase, values for y decrease.

No correlation: If there is no linear correlation or a weak linear correlation, r is close to 0. A value near zero
means that there is a random, nonlinear relationship between the two variables.

Note that r is a dimensionless quantity; that is, it does not depend on the units employed.

A perfect correlation of 1 occurs only when the data points all lie exactly on a straight line. If r = +1, the slope of
this line is positive. If r = -1, the slope of this line is negative.

A correlation greater than 0.8 is generally described as strong, whereas a correlation less than 0.5 is generally
described as weak.
Coefficient of Determination, r2, R2
The coefficient of determination, r 2, gives the proportion of the variance
(fluctuation) of one variable that is predictable from the other variable.

It allows us to determine how certain one can be in making predictions from a


certain model/graph. The coefficient of determination is the ratio of the explained
variation to the total variation. The coefficient of determination is such that 0 < r 2 <
1, and denotes the strength of the linear association between x and y.

The coefficient of determination represents the percent of the data that is the closest
to the line of best fit. For example, if r = 0.922, then r 2 = 0.850, which means that
85% of the total variation in y can be explained by the linear relationship between x
and y (as described by the regression equation). The other 15% of the total
variation in y remains unexplained.

The coefficient of determination is a measure of how well the regression line


represents the data. If the regression line passes exactly through every point on the
scatter plot, it would be able to explain all of the variation. The further the line is
away from the points, the less it is able to explain.
Problem:
Student population and quarterly sales data for 10 armands pizza
parlours

Restaurant Student population Quarterly sales


(1000s) ($1000s)
xi yi
1 2 58
2 6 105
3 8 88
4 8 118
5 12 117
6 16 137
7 20 157
8 20 169
9 22 149
10 26 202
Sample data for the stereo and sound equipment
store
week Number of commercials (x) Sales volume (y)
1 2 50
2 5 57
3 1 41
4 3 54
5 4 54
6 1 38
7 5 63
8 3 48
9 4 59
10 2 46

Calculate correlation coefficient and coefficient of determination


Slope and Intercept of best fit line

Y = a + bX
Where, a intercept
And b is slope
Y-intercept, a = Y bX
Slope = XY-nXY/X2-nX2
Rank correlation
This is a measure of the correlation that exists between the two sets of ranks, a
measure of the degree of association between the variables that we would not have
been able to calculate otherwise.
We can compute a measure of association that is based on the ranks of the
observations, not the numerical values of the data.
This measure is called the Spearman rank correlation coefficient, in honor of the
statistician who developed it in the early 1900s.

Coefficient of
rank correlation rs = 1 - 6 d 2
n (n2 1)
where:
rs = coefficient of rank correlation (notice that the subscript s, from
Spearman, distinguishes this r from the one we calculated earlier)
n = number of paired observations
= notation meaning the sum of
d = difference between the ranks for each pair of observations
Problem
Ranking of eleven cities
CITY AIR-QUALITY PULMONARY-
RANK (1) DISEASE RANK (2)
A 4 5
B 7 4
C 9 7
D 1 3
E 2 1
F 10 11
G 3 2
H 5 10
I 6 8
J 8 6
K 11 9
Interpretation

A correlation coefficient of 0.736 suggests a


substantial positive association between average air
quality and the occurrence of pulmonary disease, at
least in the eleven cities sampled; that is, high levels
of pollution go with high incidence of pulmonary
disease.
Problem: Multiple regression
Data from IRS auditing records during last 10 months
X1 X2 Y
MONTH FIELD-AUDIT COMPUTER ACTUAL UNPAID TAXES
LABOR HOURS HOURS DISCOVERED (MILLIONS
(00s OMITTED) (00s OMITTED) OF DOLLARS)
January 45 16 29
February 42 14 24
March 44 15 27
April 45 13 25
May 43 13 26
June 46 14 28
July 44 16 30
August 45 16 28
September 44 15 28
October 43 15 27
Y = a + b1X1 + b2X2 -------------------------(A)
Where,
Y = estimated value corresponding to the dependent variable
A = Y-intercept
X1 and X2 = values of the two independent variables
b1and b2 = slopes associated with X1 and X2, respectively
Y = na + b1X1 + b2X2

X1Y = aX1 + b1X12 + b2X1X2

X2Y = aX2 + b1X1X2 + b2X22


Values for fitting least squares plane, where n = 10

Y X1 X2 X1 Y X2 Y X1 X2 X1 2 X2 2 Y2
(1) (2) (3) (2) x (1) (3) x (1) (2) x (3) (2)2 (3)2 (1)2
29 45 16 1,305 464 720 2,025 256 841
24 42 14 1,008 336 588 1,764 196 576
27 44 15 1,188 405 660 1,936 225 729
25 45 13 1,125 325 585 2,025 169 625
26 43 13 1,118 338 559 1,849 169 676
28 46 14 1,288 392 644 2,116 196 784
30 44 16 1,320 480 704 1,936 256 900
28 45 16 1,260 448 720 2,025 256 784
28 44 15 1,232 420 660 1,936 225 784
27 43 15 1,161 405 645 1,849 225 729
272 441 147 12,005 4,013 6,485 19,461 2,173 7,428

Y X1 X2 X1Y X2Y X1X2 X12 X22 Y2


Y = 27.2 X1 = 44.1 X2 = 14.7
272 = 10a + 441b1 + 147b2 -------------(1)
12,005 = 441a + 19,461b1 + 6,485b2 -------------(2)
4,013 = 147a + 6,485b1 + 2,173b2 -------------(3)

Step 1. Multiply Equation (1) by 441. Multiply Equation (2) by 10. Add (1)
to (2). This eliminates a and produces Equation (4).
(1) X (- 441): - 119,952 = - 4410a 194,481b1 64, 827b2
(2) X (10) : 120,050 = 4410a + 194,610b1 + 64,850b2
(4) : 98 = 129b1 + 23b2

Step 2. Multiply Equation (1) by 147 and Equation (3) by 10. Add (1) to (3).
This eliminates a and produces Equation (5).
(1) X (- 147): - 39,984 = - 1470a 64,827b1 21,609b2
(3) X 10 : 40,130 = 1470a + 64,850b1 + 21,730b2
(5) : 146 = 23b1 + 121b2
Step 3. Multiply Equation (4) by 23 and Equation (5) by 129. Add (4) to (5)
to eliminates b1. this Produces Equation (6), which can be solved for b 2.
(4) X (- 23): - 2,254 = - 2,967b1 529b2
(5) X (129): 18,834 = 2,967b1 + 15,609b2
(6) : 16,580 = 15,080b2

b2 = 1.099

Step 4. Find the value of b1 by substituting the value for b2 into equation (4):
(4): 98 = 129b1 + 23b2
98 = 129b1 + (23)(1.099)
98 = 129b1 + 25.277
72.723 = 10a
b1 = 0.564
Step 5. Substitute the values of b1 and b2 into Equation (1) to determine the
value of a:
(1): 272 = 10a + 441b1 + 147b2
272 = 10a + (441)(0.564) + (147)(1.099)
272 = 10a + 248.724 + 161.553
- 138.277 = 10a
a= - 13.828

Step 6. Substitute the values of a, b1, and b2 into the general two-variable
regression equation (Equation A). The resulting Equation (7) describes the
relationship among the number of field-audit labor hours, the number of
computer hours, and the unpaid taxes discovered by the auditing division.

= a + b1X1 + b2X2

(7): = - 13.828 + 0.564X1 + 1.099X2


Problem:
Given the following set of data
a) Calculate the multiple regression plane.
b) Predict Y when X1 = 5 and X2 = 7

Y X1 X2
2 1 0
8 3 4
5 2 1
6 3 3
12 5 3
19 8 8
Radio station WILD is Contemplating a new contest that will require listeners
to call the station and guess the identity of a secret spy. WILD hopes that the
contest will capture a larger share of the listening market. Prizes and the
number of times per day that calls will be accepted have yet to be determined.
The past 5 contests that WILD has run have yield the following data:

X1 Y
X2
NUMBER OF CALLS TOTAL % OF LISTNING MARKET
PRIZES ($) DURING CONTEST
PER DAY
15 15 39
8 3.5 23
19 5 28
24 10 35
10 1.5 23

a) Calculate the least squares equation that best relates these 3 variables.
b) If WILD takes 13 calls per day and each prize is worth $7.50, what market
share should be expected during the contest?
Coefficient of Multiple Regression

Ry.x1x2= b1(x1y-nx1y) + b2(x2y-nx2y)


y2-ny2
where,
and b1 and b2 are the regression coefficients
Adjusted multiple coefficient of determination
If a variable is added to the model, r square becomes larger even
if variable added is not statistically significant. The adjusted
multiple coefficient of determination compensates for the number
of independent variables in the model.

R2a= 1-(1-R2) n-1


n-p-1
Where,
R2= sample R-square
p= Number of predictors (or independent variables)
N= Total sample size
Properties of a normal distribution
Forms a symmetric bell-shaped curve
50% of the scores lie above and 50% below the midpoint of the
distribution
Curve is asymptotic to the x axis
Mean, median, and mode are located at the midpoint of the x axis
The total area under the curve is equal to one.
Between and + , the graph curves downward. The graph
curves upward to the left of and to the right of + . The points
at which the curve changes from curving upward to curving downward
are called the inflection points
Percentages of the Area Under the Standard Normal Curve
A graph of this standardized (mean 0 and variance 1) normal curve
is shown

In this graph, we have indicated the areas between the regions as follows:
-1 Z 1 68.27%
-2 Z 2 95.45%
-3 Z 3 99.73%
This means that 68.27% of the scores lie within 1 standard deviation of the mean
Example compute the mean, median and mode
starting salary for the business college graduates

Graduate Monthly Starting salary ($)


1 2850
2 2950
3 3050
4 2880
5 2755
6 2710
7 2890
8 3130
9 2940
10 3325
11 2920
12 2880
Percentile
A percentile provides information about how the data are
spread over the interval from the smallest to the largest
value. For data that do not have numerous repeated values,
the pth percentile divides the data into two parts.
Approximately p percent of the observations have values
less than the pth percentile; approximately (100-p) percent of
the observations have values greater than the pth percentile.

Percentile
The pth percentile is a value such that at least p percent of
the observations are less than or equal to this value and at
least (100-p) percent of the observations are greater than or
equal to this value
Calculating the pth percentile
Step 1: Arrange the data in ascending order (smallest to largest
value)

Step 2: Compute an index i

i= (P/100)n

Where p is the percentile of interest and n is the number of


observations

Step 3: (a) If i is not an integer, round up. The next integer greater
than I denotes the position of the pth percentile.
(b) If i is an integer, the pth percentile is the average of the values
in position i and i+1
Problem: Calculate 85th and 50th
percentile for the starting salary data
Step 1: Arrange the data in ascending order
2710, 2755, 2850, 2880, 2880, 2890, 2920, 2940, 2950, 3050, 3130, 3325

Step 2:
i=(p/100)n=(85/100)12=10.2

Step 3: Because i is not an integer round up. The position of the 85 th percentile is the
next integer greater than 10.2, the 11th position

From data, 85th percentile is the value in 11th position or 3130

50th percentile:
Step 2: i=(p/100)n=(50/100)12=6

Because i is an integer, the 50 th percentile is the average of the sixth and seventh values,
thus 50 th percentile is (2890+2920)/2= 2905. Note 50 th percentile is also the median
Quartiles
Quartiles are just specific percentiles thus steps for
computing percentiles can be applied directly in
computation of quartiles. The quartiles divides the values
into four parts

Q1= first quartile or 25 th percentile


Q2= second quartile or 50 th percentile (also the median)
Q3= third quartile or 75 th percentile

Compute Q1 and Q3?


Quartiles (contd.)-Solution

2710
2755
2850 Q1= 2865
2880
2880
2890 Q2= 2905 (Median)
2920
2940
2950 Q3= 3000
3050
3130
3325
Box plot

A box plot is a graphical summary of data based on five


number summary. A key to the development of a box plot
is the computation of the median and the quartiles, Q 1 and
Q3. The interquartile range IQR=Q3-Q1 is also used.

Five number summary


1. Smallest value
2. First quartile
3. Median
4. Third Quartile
5. Largest value
Box plot for detecting outliers
Draw a box from the 25th to the 75th percentile.

Split the box with a line at the median

Draw a thin lines (whisker) from the 75th percentile up to the maximum
value.

Draw another thin line from the 25th percentile down to the minimum
value.

The length of the box in a box plot, i.e., the distance between the 25th and
75th percentiles, is known as the interquartile range. (IQR) You can use this
box length to detect outliers.

If any value in series is greater than 1.5 times the length of the box (out
of Q1 and Q3), then we have evidence of outliers.
Box plot of the starting salary data with lines showing the
lower and upper limits
e.g. starting salary data
Problem: Compute the sample standard deviation for starting salary data
Monthly salary (xi) Sample Deviation about the Squared deviation about the
mean ( x ) mean ( xi - x) mean (xi-x )2
2850 2940 -90 8100
2950 2940 10 100
3050 2940 110 12100
2880 2940 -60 3600
2755 2940 -185 34225
2710 2940 -230 52900
2890 2940 -50 2500
3130 2940 190 36100
2940 2940 0 0
3325 2940 385 148225
2920 2940 -20 400
2880 2940 -60 3600
301,850

Sample standard deviation=165.15


Z scores-The standardized score
Help in determining how far particular value is
from mean

Z score

Zi= Xi-X
S

Where zi = z score for xi


X= the sample mean
S= sample standard deviation
Z score for the class size data

No. of students in Deviation from Z score (xi-x/S)


Class (xi) Mean (xi-x)
46 2 2/8=0.25

54 10 10/8=1.25

42 -2 -2/8=-0.25

46 2 2/8=0.25

32 -12 -12/8=-1.50
Null and alternative hypothesis
Given the test scores of two random samples of men and women, does one group
differ from the other?

A possible null hypothesis is that the mean male score is the same as the mean
female score:

H0 : 1 = 2

where:
H0 = the null hypothesis

1 = the mean of population 1, and

2 = the mean of population 2.

A stronger null hypothesis is that the two samples are drawn from the same
population, such that the variance and shape of the distributions are also equal
Type of Sampling

1 Random sampling

2 Systematic sampling

3 Stratified Sampling

4 Cluster Sampling
Simple Random Sampling

Simple random sampling is the basic sampling technique


where we select a group of subjects (a sample) for study from
a larger group (a population).

Each individual is chosen entirely by chance and each


member of the population has an equal chance of being
included in the sample.

Every possible sample of a given size has the same chance of


selection; i.e. each member of the population is equally likely
to be chosen at any stage in the sampling process.
Systematic Sampling
A systematic sample is a sample design in which a list of the population is used as a
sampling frame and cases are selected by skipping through the list at regular
intervals.

Systematic sampling assumes that the list is well shuffled, which means there are
no clear trends such as ordering data by age, income, or education or having a
characteristic at regular intervals. If there is an ordering, there is the danger of
selecting a biased sample.

Examples:

If the population consists of 1,000 people and we want to select a sample of 250,
we will take every fourth person on the list after selecting a random starting point
between 1 and 4.

This skip pattern is found by diving the population size by the sample size
(1,000/250 = 4).
Stratified Sampling
A stratified sample is a sampling technique in which the researcher
divided the entire target population into different subgroups, or strata,
and then randomly selects the final subjects proportionally from the
different strata.

This type of sampling is used when the researcher wants to highlight


specific subgroups within the population.

For example, to obtain a stratified sample of university students, the


researcher would first organize the population by college class and
then select appropriate numbers of freshmen, sophomores, juniors,
and seniors. This ensures that the researcher has adequate amounts of
subjects from each class in the final sample.
Cluster Sampling
Cluster sampling may be used when it is either impossible or
impractical to compile an exhaustive list of the elements that make
up the target population.

Usually, however, the population elements are already grouped into


subpopulations and lists of those subpopulations already exist or can
be created.

For example, lets say the target population in a study was church
members in the United States. There is no list of all church members
in the country. The researcher could, however, create a list of
churches in the United States, choose a sample of churches, and then
obtain lists of members from those churches.

Anda mungkin juga menyukai