Unit II - Parametric & Non-Parametric Tests

Business Analytics for
Managers
Using Excel for Data Analysis
BUSINESS ANALYTICS FOR MANAGERS

Syllabus
Code: 302
Credits: 3
Unit I
Digital Data Introduction, Types of Digital Data: Structured Data,
Unstructured Data, and Semi-Structured Data; Exploring and Discovering
Data; Introduction to OLTP: Queries, Advantages and Challenges;
Introduction to OLAP: One-dimensional, Two-dimensional and Threedimensional data; OLAP Architectures: MOLAP, ROLAP, HOLAP; Role of
OLAP Tools in BI Architecture.
Introduction to Big data; BI Component Frameworks: Business,
Administration and Operation, and Implementation Layer; BI is for
Whom?; BI Applications; BI Roles and Responsibilities; Basics of Data
Integration; Data Warehousing: Data Mart; What constitutes a Data
Warehouse?; Introduction to ETL; Data Integration.
CONTD...
UNIT II
Business Analytics Introduction to Business Analytics; Introduction to R

Programming; Data Manipulation in R: Vectors, Basic Math, and Matrix Operations;
Summarizing Data: Numerical and Graphical Summaries; Data Visualization in R; Data
Transformation; Data Import Techniques in R; Time Series and Spatial Graphs; Graphs
for Categorical Responses and Panel Data.
Data Modeling using Excel Overview of Excel; Basic Data Summaries: Measures
of Central Tendency, Measures of Dispersion, and Measures of Skewness and
Kurtosis; Introduction to Parametric Hypothesis Testing: One and Two Sample Tests
Z Test and t-Test; Chi-square and Non-parametric Hypothesis Testing: Chi-square
Goodness-of-Fit Test, Sign Test, Wilcoxon Signed Rank Sum Test, Mann-Whitney U
Test; Linear Correlation and Regression Analysis; Time Series Data and Analysis.
Unit III
Statistics Essentials for Advanced Analytics Linear and Logistic Regression;

Linear Discriminant Analysis; Exploratory Factor Analysis; Confirmatory Factor
Analysis; Conjoint Analysis; Data Mining: Clustering Techniques, Association Rule
Mining and Sentiment Analysis, Decision Trees and Random Forests; Structural
Equations Modeling.
CONTD...
Suggested Readings
Prasad R N and Acharya Seema (2013), Fundamentals of Business Analytics, Wiley India Pvt.
Ltd., New Delhi.
Glyn Davis and Branko Pecar (2013), Business Statistics using Excel, Oxford University Press,
New Delhi.
Halady Rao Purba (2013), Business Analytics an Application Focus, PHI Learning Private
Limited, New Delhi.
Jank Wolfgang (2011), Business Analytics for Managers, SpringerScience + Business Media,
ISBN 978-1-4614-0405-7.
Davenport Thomas H et al. (2008), Competing on Analytics, Pearson Publication, USA.
Decision support systems and business intelligence HBR Press.
E Turban, et al. (2008), Business Intelligence: A Managerial Approach, Pearson Prentice Hall.
Mosimann R et al. (2007), The performance manager: Proven strategies for turning information
into higher business performance, Cognos Press.
Articles
Thomas H. Davenport, Competing on Analytics, Harvard Business Review, 2006.

Solomon Negash, Business Intelligence, Communications of the Association for Information
Systems (Volume 13, 2004) 177-195.
Sara Philpott, Advanced Analytics: Unlocking the Power of Insight, IBM, 2010.
Pam Baker, Using Data Visualizations To Drive Business Decisions, Fiercebigdata, 2013.
Teaching Schedule Session-wise

Topic
Overview of Excel
Basic Data Summaries
Session
Session 1
o Measures of Central Tendency

o Measures of Dispersion
o Measures of Skewness & Kurtosis
Parametric Hypothesis Testing
Session 2,3
o One & Two Sample Tests Z & t-Test

o Chi-square Test
Non-parametric Hypothesis Testing

o
o
o
o
Session 4,5
Chi-square Goodness-of-Fit Test

Sign Test
Wilcoxon Signed Rank Sum Test
Mann-Whitney U Test
Correlation Analysis
Linear Regression Analysis
Time Series Data and Analysis
Session 6
Logistic Regression
Session 7
Linear Discriminant Analysis
Session 8
Exercises & Case Study
Contd...
Topic
Session
Exploratory Factor Analysis
Session 9
Confirmatory Factor Analysis
Session 10
Conjoint Analysis
Session 11
Data Mining
Session 12,13,14
Clustering Techniques
Association Rule Mining & Sentiment Analysis
Decision Trees & Random Forests
Structural Equations Modeling
Exercises & Case Study
Introduction to Parametric
Hypothesis Testing
We shall explore a range of

hypothesis tests for one
and two samples where the
population is considered to
be normally distributed
Learning Objectives
On completing this unit you should be able to:
Understand concept of
null & alternative hypothesis
Understand difference between
one & two samples

parametric & non-parametric tests
one and two tail tests
type I and II errors
Understand the use of the p-value and critical test statistic

Conduct one sample hypothesis tests for the sample mean and
proportion
Conduct two sample hypothesis tests for the sample mean and
proportion
Conduct an F Test for two population variances
Solve hypothesis problems using the Microsoft Excel
Hypothesis Testing
Rationale
Hypothesis: statement about the value of a population parameter
developed for the purpose of testing
Deciding between two possibilities based on data - Is it real? Or is it just
coincidence? A hypothesis is either TRUE or FALSE
Example: - Hypothesis statement average salary of accountants is 31000 can be
measured & assessed through the variable salary
Hypothesis Testing: procedure used to determine whether the

hypothesis is a reasonable statement & should not be rejected, or is an
unreasonable statement & should be rejected, based on sample
evidence & probability theory
Null hypothesis (H0): also known as the hypothesis of no difference and
is formulated in anticipation of being rejected as false
Alternative hypothesis (H1): is a positive proposition which states that a
significant difference exists
Example: - The average salary of accountants is 31000
Null hypothesis H : = 31000
Alternative hypothesis H :
Contd...
Level of significance - represents the amount of
risk an analyst will accept when making a decision
represents the amount of error associated with rejecting the null
hypothesis when it is true
Usually expressed as % and denoted by (Alpha)
Its normally 5% (0.05) or 1% (0.01), or sometimes 10% (0.1)
Value of depends upon how sure you want to be that your
decisions are an accurate reflection of the true population
relationship
Example: 5% LOS implies that there are about 5 chances in 100
of rejecting the H0 when it is true or we are 95% confident that we
will make a correct decision
1. If we sampled from a population data set that is normally distributed then the
sampling distribution for the sample mean will be normally distributed with sample
mean = population mean, with sampling error X n
2. For populations that are not normally distributed we can make use of the Central
Limit Theorem. For large n, the sampling distribution approximates to the normal
distribution.
3. For small sample size we employ the Student t distribution which states that if a
population is normally distributed then the sample mean is normally distributed with
sample mean = population mean, with the sampling error estimated using the sample
Contd...
We stated earlier that the alternative hypotheses is of the form
H1: 31000
Two tailed test - The sign tells us that we are not sure what the
direction of the difference will be (< or >) but that a difference exists
One tailed test - It is possible that we are assessing that
the average accountant salary is greater than 31000
implying H1: > 31000
is smaller than 31000
implying H1: < 31000
Two tail test H1: 100
Lower one tail test H1: < 100
Upper one tail test H1: >

100
Decision to accept/ reject null 2 approaches to accept/ reject hypothesis

hypothesis H0 made on the basis of 1)Use the p-value (via Excel)
information supplied by sample data
p-value represents the probability of the
results in one 2 types of errors
calculated random sample test statistic being
Type I Error: Committed by the test in
rejecting a true null hypothesis. Probability
of committing type I error is denoted by
Type II Error: Committed by the test in
accepting a false null hypothesis.
Probability of committing type II error is
denoted by
Your Decision
The Truth
Accept Null
Hypothesis
H0
Null
Hypothesis
H0
Correct
Decision
Research
Hypothesis
H1
Type II Error
[not easily
controlled]
Contd...
Accept Research
Hypothesis
H1
Type I Error
[level 0.05]
Correct
Decision
this extreme if the null hypothesis is true

p-value is compared with the chosen
significance level () to make a decision
between accepting or rejecting the null
hypothesis H0
If p < , then reject null hypothesis H0
because low probability events are unlikely
to occur & accept alternative hypothesis H1
2)Calculate the test statistic & compare with a
critical test statistic
Calculate the test statistic and compare the value
with a critical test statistic estimate from an
appropriate table or via Excel
Value of the critical test statistic will depend upon significance level for z test problems and
significance level and number of degrees of
freedom for t test problems
If test statistic > critical test statistic then we
would reject null hypothesis H0 and accept
Parametric vs. Non-parametric Tests

Tests of hypothesis are usually classified into two methods: parametric
and non-parametric:
Parametric methods - make assumptions about the underlying distribution
from which sample populations are selected
assume that your data is approximately normally distributed (follows a classic

bell-shaped curve)
Data is at the interval/ratio level of measurement
Nonparametric methods - make no assumptions about the sample populations

distribution
are often based upon data that has been ranked, rather than actual
measurement data
One sample test - involves testing a sample parameter (e.g. mean value)
against a perceived population value (e.g. accountant salary 31000) to
ascertain whether there is not a significant difference between a sample
statistic and a population parameter
Example: - H0: m = 31000)
Two sample tests - we test a sample against another sample to ascertain

whether or not there is a significant difference between two samples and,
consequently, whether or not the two samples represent different populations
Choosing an appropriate test
Test Statistic
P-value Method
Critical Statistic Method
Z-test
Two-tail tests
=2*(1-NORMSDIST(ABS(Z-value))
=NORMSINV(1-/2) (Upper value)

=NORMSINV(/2)
(Lower value)
Lower-tail test
=NORMSDIST(Z-value)
=NORMSINV()
Upper-tail test
=1-NORMSDIST(Z-value)
=NORMSINV(1-)
Two-tail tests
=TDIST(ABS(t-value),df,2)
=TINV(,df) (Upper value)

= -TINV(,df) (Lower value)
Lower-tail test
=TINV(2*,df)
Upper-tail test
= -TINV(2*,df)
Two-tail tests
=FDIST(F-value, df1,df2)
=FINV(1-/2, df1,df2) (Upper value)

=FINV(/2, df1,df2) (Lower value)
Lower-tail test
=1-FDIST(F-value, df1,df2)
=FINV(, df1,df2)
Upper-tail test
=FINV(1-, df1,df2)
T-test
Summary
F-test
Two tail test H1: 100
Lower one tail test H1: < 100
Upper one tail test H1: >

100
1.One sample Z-Test for Population Mean

Assumptions
Sample data is randomly collected from a population
Population normally distributed
Population standard deviation is known
Example 1:- Historical output of
employees of a firm produce 100 units
per hour with a standard deviation of 20
units per hour. A new employee is tested
on 36 separate random occasions and
found to have an output of 90 units per
hour. Does this indicate that the new
employee's
output
is
significantly
different from the average output?
H0: = 100, H1: 100
Given = 5% = 0.05, = 100, = 20, n = 36,
X=
90
Solution-Using p-value method & critical test statistic

method (Zcri)
Step 1: State null and the alternate hypothesis - H0: = 100, H1: 100
Step 2: Select the level of significance - = 0.05
Step 3: Select the test statistic -
Z-distribution (Large sample)
Step 4: Formulate the decision rule

P-value Method - If p < , then reject null hypothesis H0 & accept the
alternative hypothesis H1
X 90 100
Zcal
3
From Excel, Two tail p-value = 0.0026998
n
20 36
As 0.0027 < 0.05; Reject H0

Critical test Statistic Method - If test statistic > critical test statistic then we
would reject null hypothesis H0 and accept alternative hypothesis H1
From Excel, Two tail critical value = NORMSINV() = 1.96
As 3 > 1.96, Reject H0
Step 5: Interpretation
Evidence suggests that the new employee output is significantly different at 5%
from the firms existing employee output i.e., sample mean value (90 units
per hour) is not close enough to the population mean value (100 units per
Test
Statistic
P-value Method
Two-tail
tests

=NORMSINV(1-/2)
=NORMSINV(/2)
(Upper value)
(Lower value)
Zcal
X 90 100 3
n
20 36
2. One sample t-Test for Population Mean

Assumptions
If the population standard deviation is not known then t-test uses the sample
standard deviation, s, as an estimate of the population standard deviation,
If the population distribution is normal
Example 2: - A local car dealer wants to know if
the purchasing habits for extra fittings by a male
buyer has changed. Based upon collected data
he has estimated that the distribution of extra
fittings purchased is approximately normally
distributed with an average of 2000 per
customer. To test this hypothesis he has
collected the data of the purchases made by the
last seven male customers (): 2300, 2386,
1920, 1578, 3065, 2312 and 1790. Test whether
the extras purchased on average has changed.
H0: = 2000, H1: 2000
Given = 5% = 0.05, = 2000, = unknown, n = 7,
X=
2193

method (tcri)
Step 1: State null and the alternate hypothesis - H0: = 2000, H1: 2000
t-distribution (Small sample)

X 2193 2000 1.0429
t cal
s n
489.62.. 7
From Excel, Two tail p-value = 0.337182452
As 0.337 > 0.05, Accept H0
df n 1 6
would reject null hypothesis H0 and accept alternative hypothesis H1
From Excel, Two tail critical value = TINV() = 2.45
As -2.45 < 1.04 < 2.45, Accept H0
extras
2000.
There is no significant difference, at the 0.05 level, between the

purchased by the sample and the historical extras purchased of
t cal
X 2193 2000 1.0429

s
489.62.. 7
df n 1 6
Test
Statistic
Two-tail
tests
P-value Method

3. Two sample Z-Test for Population Mean

Example 3:- A large organisation
produces electric light bulbs in each
of its two factories (A and B). It is
suspected that the quality of
production from factory A is better
than from factory B. To test this
assertion the organisation collects
samples from factory A and B and
measures how long each light bulb
works (in hours) before the light bulb
fails. Conduct an appropriate test to
test this hypothesis?
H0: A B
H1: A > B
Given = 5% = 0.05, 2A = 52783, 2B = 61560, nA = 30, nB = 32

method (Zcri)
Step 1: State null and the alternate hypothesis - H0: A B
H1: A > B

Z-distribution (Large samples)

Z cal
From Excel, Upper tail P-value = 0.000035

As 0.0000158 < 0.05, Accept H1
X X
A
2
A2
B
n
n
A
B
4.16
would reject the null hypothesis H0 & accept the alternative hypothesis H1
From Excel, Upper tail critical value Zcri = +1.64
As 4.16 > 1.64, Accept H1
At the 0.05 level of significance, the light bulbs from factory A have significantly
longer life time than the light bulbs from factory B.
Z cal
Test Statistic
P-value Method
Upper-tail test
=1-NORMSDIST(Z-value)

=NORMSINV(1-)
X X
A
A
B
n
n
A
B
4.16
Analysis ToolPak
solution
z-Test: Two Sample for Means
Mean
Known Variance
Observations
Hypothesized Mean
Difference
Variable
2
1135.333 894.218
333
75
46516.6 57845.9
30
32
Variable 1
4.160713
018
1.58628EP(Z<=z) one-tail
05
1.644853
z Critical one-tail
627
3.17256EP(Z<=z) two-tail
05
Select Data > Data Analysis > Z Test: Two1.959963
Sample for Means
z Critical two-tail
4. Two sample Z-Test for Population

Proportion
Example 4:- A local police authority

concerned with the number of
passengers not wearing rear seat
belts in cars decided to undertake a
series of surveys in two large cities.
The survey consisted of two
independent
random
samples
collected from city A and B and the
police authority would like to know if
the proportions of passengers wearing
seat belts between city A and B are
different. Conduct an appropriate test
to test this hypothesis?
H 0 : A = B H 1 : A B
Given = 5% = 0.05, NA = 250, NB = 190, nA = 135 nB = 80, A ~ A B ~ B

method (Zcri)
Step 1: State null and the alternate hypothesis - H0: A = B
H1: A B

Step 3: Select the test statistic - Z-distribution (Large samples)
P-value Method - If p < , then reject null hypothesis H0 & accept the alternative
A B
hypothesis H1
Z
2.49
cal
From Excel, Two tail P-value = 0.013

As 0.013 < 0.05, Accept H1
A 1 A B 1 B
NA
NB
From Excel, Two tail critical value Zcri = 1.95
As 2.49 > 1.95, Accept H1

At the 0.05 level of significance, We conclude that a significant difference exists between
the proportions of rear passengers wearing seat belts between city A and B.
Test Statistic
Two-tail tests
P-value Method
City A City B
No. Interviewed, N
250
190
No. wearing seat
belts, n
135
80

=NORMSINV(1-/2)
(Upper value)
=NORMSINV(/2) (Lower value)
Z cal
A B
2.49
A 1 A B 1 B
NA
NB
5. Two-Sample t Test for Population Mean

(independent samples, equal variances Pooled t-test)
Example 5:- A certain product of organic beans are packed in
tins and sold by two local shops. The local authority have
received complaints from customers that the amount of beans
within the tins sold by the shop are different. To test this
statistically two small random samples were collected from both
shops.
H0: 1 = 2 H1: 1 2
Given = 5% = 0.05, 1 = 2 unknown (pooled)
n1 = 18 n2 = 25

method (tcri)
Step 1: State null and the alternate hypothesis - H0: 1 = 2 H1: 1 2
Step 3: Select the test statistic - t-distribution
hypothesis H1
n1 1 s12 n 2 1 s 2 2
As 0.036 < 0.05, Accept H1
A B
n1 n 2 2
2082.017
df n1 n 2 2 41
From Excel, Two tail critical value tcri = +2.019
As 2.156 > 2.019, Accept H1
t cal
1 2
X2
1
1
n n
1
2
2.156
We conclude that based upon the sample data collected that we have evidence
that the quantity of beans sold by shops A and B are significantly different at the
5% level of significance. It should be noted that the decision will change if you
Test Statistic
Two-tail tests
P-value Method

=NORMSINV(1-/2)
=NORMSINV(/2)
(Upper value)
(Lower value)
A B
n 1s
1
n 2 1 s 2
2082.017
n1 n 2 2
2
df n1 n 2 2 41
t cal
1 2
X2
1
1
n n
2
1
2.156
Analysis ToolPak
solution
Select Data > Data
t-Test: Two-Sample Assuming Equal Variances
Variable 1 Variable 2
527.05555
Mean
56
496.64
2603.1143 1712.906
Variance
79
667
Observations
18
25
2082.0171
Pooled Variance
82
Hypothesized Mean
Difference
0
df
41
2.1563816
t Stat
53
0.0184852
P(T<=t) one-tail
15
1.6828780
t Critical one-tail
03
0.0369704
P(T<=t) two-tail
3
Analysis > Two Sample Pooled

t
Test
for
2.0195409 Means
6. Two-Sample t-Test for Population Mean

(independent samples, unequal variances)
Example 6:- A certain product of organic beans are packed in
tins and sold by two local shops. The local authority have
received complaints from customers that the amount of beans
within the tins sold by the shop are different. To test this
statistically two small random samples were collected from both
shops.
H0: 1 = 2 H1: 1 2
Given = 5% = 0.05,
1 and 2 unknown, Distribution unknown
n1 = 18 n2 = 25

method (tcri)
Step 1: State null and the alternate hypothesis - H0: 1 = 2 H1: 1 2
Step 3: Select the test statistic t-distribution
X1 X 2
hypothesis H1
t cal
2.083
2
2
s1
s2
n
n
2
1
0.044 < 0.05, Accept H
1
would reject the null hypothesis H0 & accept the alternative

hypothesis
H1
2
2
2
From Excel, Two tail critical value tcri = 2.03
s1
s2
n2
n1
s12
s22

n
n
1
2
n1 1
n 2 1
As 2.083 > 2.03, Accept H1
df
32
We conclude that based upon the sample data collected that we have evidence
that the quantity of beans sold by shops A and B are significantly different at the
5% level of significance. It should be noted that the result in this case rests at
Test Statistic
Two-tail tests
P-value Method

X1 X 2
t cal
s1 s 2

n1 n 2
2
2.083
df
s12 s 2 2
n
n
1
2
32
2
s1
s22
n
n
2
1
n1 1 n 2 1
Analysis ToolPak
solution
Select Data > Data Analysis >

variances
t-Test: Two-Sample Assuming Unequal

Variances
527.05555
Mean
56
496.64
2603.1143 1712.906
Variance
79
667
Observations
18
25
Hypothesized Mean
Difference
0
df
32
2.0833856
t Stat
02
0.0226440
P(T<=t) one-tail
68
1.6938887
t Critical one-tail
03
0.0452881
Two Sample
t-Test
for
Means
P(T<=t) two-tail
36 assuming
unequal
2.0369333
7. Two-Sample t-test for Population Mean

(dependent samples Paired t-test)
Example 7:- Suppose that Super Slim is offering a weight

reduction program that they advertise will result in more than a 10
lb weight loss in the first 30 days. Twenty six subjects were
independently randomly selected for a study and their weights
before and after the weight loss program were recorded. Super
Slim have stated that the historical data shows that the
populations are normally distributed.
H0: D = 1 2 10
H1: D > 10
Given = 5% = 0.05,
Distribution unknown
1 unknown, 2 unknown
n1 = n2 = 26
Assume n large, CLT applies

method (tcri)
Step 1: State null & the alternate hypothesis - H0: D = 1 2 10
H1: D > 10

t-distribution

2
2
d
hypothesis H1
n
From Excel, Upper one tail P-value = 0.0093
As 0.0093 < 0.05, Accept H1
sd
n 1
14.57
df n 1 25
From Excel, Upper one tail critical value tcri = + 1.708
As 2.5178 > 1.708, Accept H1
t cal
dD
2.5178
sd n
Conclude that the average weight loss is more than 10 Ibs at a 5% level of
significance. Again, you would have a borderline decision at 5%.
d d
2
Analysis ToolPak
solution
sd
n 1
14.57
df n 1 25
dD
t cal
2.5178
sd n
Select Data > Data Analysis > Two Sample t Test Assuming Paired Samples
8. F-Test for Population Variances

Example 8:- A certain product of organic beans are packed in tins and
sold by two local shops. The local authority have received complaints
from customers that the amount of beans within the tins sold by the
shop are different. To test this statistically two small random samples
were collected from both shops. Use F-test to check if the two
population variances can be considered equal with a 95% confidence.
H0: 12 = 22
H1: 12 22
Given = 5% = 0.05

method (Fcri)
Step 1: State null & the alternate hypothesis - H0: 12 = 22
H1: 12 22

Step 3: Select the test statistic - F-distribution
hypothesis H1
2
As 0. 3393282 > 0.05, Accept H0
s A
F 2 1.5197
s B
From Excel,
As FL(0.39) < F(1.5197) < FU(2.38), Accept H0
df numerator n A 1
df denominator n B 1
Conclude that the two population variances are not significantly different at the
95% level of confidence
Test Statistic
Two-tail tests
P-value Method

=FINV(/2, df1,df2) (Lower value)
=FINV(1-/2, df1,df2) (Upper value)
s2A
F 2 1.5197
s B
df numerator n A 1
df denominator n B 1
Analysis ToolPak solution
F-Test Two-Sample for Variances
527.05555
Mean
56
496.64
2603.1143 1712.9066
Variance
79
7
Observations 18
25
df
17
24
1.5197059
F
07
P(F<=f) one- 0.1696641

tail
07
F Critical one- 2.0702839

tail
55
Select Data > Data Analysis > F Test for Two Population Variances (Variance Ratio Test)
Conclusion
In this presentation we explored the concept of hypothesis testing:
Thank You
Chi Square and Non-Parametric

Hypothesis Testing
This chapter will provide an overview to the chi squared distribution ( 2) and nonparametric tests that can be used when parametric methods are not appropriate.
Learning Objectives
On completing this unit you should be able to:
Apply the chi square test to measure the difference between
two proportions from two samples
Apply the chi-square test to test for association between
categorical variables
Apply the chi-square goodness-of-fit test
Apply the sign test to one sample
Apply the Wilcoxon signed rank T test to two paired samples
Apply the Mann-Whitney U test to two independent samples
Introduction
Parametric tests - assess whether the differences between means (or
variances, proportions) are statistically significant. Model assumptions
are:
a) Underlying population being measured varies as a normal distribution
b) The level of measurement is of equal interval or ratio scaling, and
c) The population variances are equal
Unfortunately, we will come across data that does not fit these
assumptions
a) How do we measure the difference between the attitudes of people
surveyed in assessing their favourite car, where the responses are in
the form of 1, 2, 3,, n? In this situation we have ordinal data in
which taking differences between the numbers (or ranks) is
meaningless.
b) Furthermore, if we are asking for opinions where the opinion is of a
categorical form (e.g. strongly agree, agree, do not agree) then the
concept of difference is again meaningless. The responses are words
not numbers, but you can, if you so wish, solve this problem by
allocating a number to each response, with 1 for strongly agree, 2 for
Choosing a Test
Chi-Square Test
Versatile test
Widely used test with data that is categorical (or nominal or
qualitative) in nature
This section will explore the application of chi square in solving 4 types
of problems:
1. Perform a 2 test of association (independence)
2. Perform a 2 test of the difference between two independent proportions
3. Perform a 2 test of the difference between two dependent proportions
(McNemars test for matched pairs)
4. Perform a 2 test of goodness of fit to a theoretical probability distribution
For 2 and 3, you could use a Z/t test if you assume the population is
normally distributed.
1.
Chi Square Test of Association
Used to determine whether the frequency occurrence for two

category variables (or more) are significantly related (or
associated) to each other
The null hypothesis states that the row and column variables are not
associated
It can be shown that if the null hypothesis is true then the expected
frequencies (E) can be calculated using
Row Total Column Total
E
Grand Total
To test the null hypothesis we would compare the expected cell
frequencies with the observed cell frequencies and calculate the chi
O E 2
2
squared test statistic given by

E
The number of degrees of freedom, df, is given by df r 1 c 1

For the chi square test to give meaningful results the expected frequency for each
cell is required to be at least 5.
Contd...
Example 9:- Suppose a university
sampled 485 of its students to determine
whether males and females differed in
preference for five courses offered. The
question we would like to answer is to
confirm whether or not we have an
association between the courses chosen
and the persons gender. In this case we
have two attributes, gender and course,
both of which have been divided into
categories: 2 for gender and 5 for
course. Determine whether gender and
course preference are associated using
chi-square test of association on the
contingency table.

Total Sample Size
O E 2
E
63.2
df r 1 c 1 4
Contd...
Step 1: State null and the alternate hypothesis
H0: Gender and course preference are not associated (or independent)
H1: There is an association between sex and course preference (or dependent)
Step 3: Select the test statistic - Chi-square distribution

Total Sample Size

hypothesis H1
2
From Excel, Two tail P-value = 5.7E-13
As 5.7E-13 < 0.05, Accept H1
O E
E
63.2
df r 1 c 1 4
From Excel, Two tail critical value = 9.4877
As 63.36 > 9.4877 , Accept H1
There is a significant relationship, or association, between the
category variables
2. Testing Differences for 2 Proportions

Example 10:- A firm who surveys whether or not employees use the
train to travel to work. The firm collects the data and has created a 2*2
contingency table to summarise the responses for only the people who
work on two days. The question is now whether or not we have a
significant difference between the Monday and Wednesday employees
who travel to work by train.
Contd...
H0: 1 = 2
(proportions travelling by train on the two days is the same)
H1: 1 2
(proportions different)

Step 3: Select the test statistic - chi-square distribution
2
hypothesis H1
As 0.035161 < 0.05, Accept H1
O E
E
4.4373
As 4.43 > 3.84 , Accept H1
Step 5:Interpretation - Conclude that there is a significant difference in the proportions
travelling by train on Monday and Wednesday. Note: if we have a 1%
significance level then the decision would be reversed
3. Chi Square Test for the Difference between 2

Dependent Proportions (McNemars test)
Example:- Estimate the effectiveness of a political campaign on the voting patterns
of a group of voters. 2 groups of voters are selected at random and their voting
intentions (Drop CO2, Tax) for a local election are recorded. Both groups are then
subjected to the same campaign and their voting intentions are recorded. The
question that arises is whether or not the campaign was effective on the voting
intentions of the voters.
In
the
problem, we
shall look at
whether or not
the proportion
voting
Drop
CO2
has
significantly
changed
Contd...
In general the 2*2 contingency table can be structured as shown above
Drop CO2 Before proportion (1) and Drop CO2 After proportion (2) are given by
equations
1 a b N
2 a c N
df r 1 c 1 1
Two Tests available:
To test the null hypothesis we can use the McNemar z-test statistic which is
normally approximated defined by the equation
bc
Z
bc
To test the null hypothesis we can use the McNemar 2-test statistic defined by
equation
2
2
For one df the relationship between 2 and Z is 2 Z 2
b c
bc
Contd...
H0: 1 = 2
(proportions voting Drop CO2 before and after the same )
H1: 1 2
(proportions different)

Step 3: Select the test statistic - McNemar z-test statistic
bc
89 45
3.801
bc
89 45
hypothesis H1
Proportions, 1 = 0.60 and 2 = 0.53
From Excel, the two tail p-value = 0.00014. As 0.00014 < 0.05, Accept H1
As 3.801 > 1.96 , Accept H1
Step 5: Interpretation - There is a significant difference in the voting intentions for Drop
4.
Chi Square Goodness-of Fit Test
cal 2
O E 2
E
df = n k 1
In this section we will explore concept of measuring how well a data set can be
modelled by a particular probability distribution using the method of goodness-of-fit
test
It can be applied to discrete distributions, such as Binomial & Poisson
For a chi-square goodness of fit test, the hypotheses takes the form:
H0: The data are consistent with a specified distribution
H1: The data are not consistent with a specified distribution
The goodness-of-fit can be assessed by conducting a chi-square test on the

observed and expected frequencies as defined by equation
Contd...
r e
P X r
r!
A motorway safety officer who believes that the number of accidents per
week occurring on a stretch of motorway can be modelled using a
Poisson distribution.
If X denotes the number of accidents per week then the sample data can be
modelled by fitting a Poisson distribution to the sample data.
Contd...
H0: No. of accidents follow a Poisson distribution
H1: No. of accidents do not follow a Poisson distribution
Step 3: Select the test statistic - Chi-square distribution/ Goodness of fit test
hypothesis H1
From Excel, the p-value = 0.73
As 0.73 > 0.05, Accept H0
Critical test Statistic Method - If test statistic > critical test statistic then we would
reject the null hypothesis H0 & accept the alternative hypothesis H1

As 11.07 > 1.96 , Accept H1
Step 5:Interpretation - Conclude that the there is a significant relationship between the
observed and expected frequencies. This implies that the data can be modelled
Non-Parametric Tests
Many statistical tests require that data follows normal distribution
Distribution free tests/ Non-parametric tests - Do not require the data to
follow a particular distribution
In this presentation will explore three non parametric tests
Sign test
Wilcoxon signed rank test
Mann Whitney U test
Test
One sample
Paired samples
Independent
samples
Parametric test
One sample z-test
One sample t-test
Two paired sample Z-test
Two paired sample t-test
Two independent sample t-test
Non parametric test

Sign test
Wilcoxon signed-rank test
Sign test
Wilcoxon signed rank test
Mann Whitney U test (Wilcoxon
rank sum test)
5.
The Sign Test
The sign test is used to test a set of data values against a perceived
hypothesis statement, including:
1.Assessing the validity of a population median value assessed from
collected sample data replaces the one-sample t-test which assumes a
normal population and that a mean value as meaning.
2.Assessing the validity that the difference between two population
medians is zero based upon sample data replaces the paired t-test
which assumes a normal population and that a mean value as meaning.
3.Assessing the validity of proportions where the proportions are
estimated from ordered nominal (or categorical) data where a numerical
scale is inappropriate but where we can rank the data observations
replaces the sample Z test for proportions which assumes a normal
population.
Contd...
If we rank the data then the null hypothesis would result in half the ranks
to be less than the median (r1) and half the ranks would be greater than
the median (r2)
In this situation the null hypothesis can be modelled by a binomial
distribution with the probability of a data value being less than or greater
than the median being equal to p = 0.5, with sample size n
The sign test assumptions are
Randomly selected samples and
Continuous distribution
Sign test measures the number of counts that fall above and below the
median value
Under the null hypothesis, we would expect the number of counts
distribution to be approximately symmetric around the median and the
distribution of values below and above to be distributed at random among
the ranks
Contd...
The corresponding hypothesis statements for two tail and one tail tests
are:
Two tail test
H0: sample median = population median (0.5)
H1: sample median population median (0.5)
Upper one tail test

H1: sample median > population median (0.5)
Lower one tail test

H1: sample median < population median (0.5)
For a binomial distribution the value of the probability (P(X=r)), mean () and
standard deviation are given by:
P X r C r p q
n
r n r
n r n r
p q
r
n
n!

r! n r !
r
np
npq np 1 p
The probability distribution is a binomial distribution with the probability of

success p = 0.5 and the number of trials represented by the number of paired
observations (n), X ~ Bin (n, p)
16 randomly selected people

were chosen to measure the
For the training programme to be effective we
effectiveness of a new training
would expect the hypothesis statement to be
programme on the value of
H1: the training programme results in the
sales. Calculate test statistics:
average value in sales to increase
(i)Binomial probability P(X Given random selection is made and no
information is given about the distribution, we will
12)
Contd...
use the sign test here
Contd...
H0: The median sales difference is zero
H1: Median sales after training > Median sales before training
Step 3: Select the test statistic Sign test
Calculate binomial probabilities, P(Xx)
hypothesis H1
p = P(X 12) = P(X = 12, 13, 14, 15, 16)
= P(X = 12) + P(X=13) + P(X=14) + P(X=15) + P(X=16)
X 11.5 8
From Excel, = np = 8 and = sqrt(npq) = 2
Z c
1.75
2
From Excel, upper one tail p-value = 0.0401
As 0.0401 < 0.05, Reject H0
As 1.75 > 1.6449, Reject H0
6.
Wilcoxon Signed Rank Sum Test

(Matched Pairs Test)
The t-test is the standard test for testing the difference between population
means for two paired samples that are equal
If the populations are non-normal, particularly for small samples, then the ttest may not be valid
As for the sign test, the Wilcoxon signed rank sum test is another example of
a non-parametric/ distribution free test, used to test the null hypothesis that
the median of a distribution is equal to some value
It can be used in place of
1) One-sample t-test
2) Paired t-test
3) Ordered categorical data where a numerical scale is inappropriate but
where it is possible to rank the observations
The method considers the differences between n matched pairs as one
sample
If the two population distributions are identical, then we can show that the
sample statistic has a symmetric null distribution
Contd...
Assumptions
The Wilcoxon signed rank sum test assumptions are:
1)Each matched data pair is randomly distributed
2)The matched pair differences should be symmetrically distributed
Although the Wilcoxon test assumes neither normality nor homogeneity
of variance, it does assume that the two samples are from populations with
the same distribution shape
It is also vulnerable to outliers although not to nearly the same extent as
the t-test
Contd...
Suppose that Slim-Gym is offering a weight reduction program that they advertise
will result in more than a 10 lb weight loss in the first 30 days. Twenty subjects were
selected for a study and their weights before and after the weight loss program
were recorded.
Contd...
Test
Two-tailed Test
Hypothesis
Tcal
H1: Population locations not centred at 0
Tcal = Minimum of T- and T+
H1: Population differences are centred at a Tcal = T+
value > 0
One-tailed Test
H1: Population differences are centred at a Tcal = Tvalue < 0
1.
Rank data
2.
Calculate the sum of the

ranks, T- and T+
T- = Sum of ve ranks = 35
T+ = Sum of + ve ranks = 265
3.
Find Tcal
4.
5.
Calculate critical values

Make decision: From the
sample data we have
sufficient
statistical
evidence that the weight
loss is greater than 10Ibs.
n' n'1
T T
300
2
n' n'1
T
150
4
T
n' n'1 2n'1

35.0
24
Tcal T 0.5
3.2714
T
Contd...
H0: The population median weight loss is atleast 10 lbs (X-Y10)
H1: The population median weight loss is greater than 10 lbs (X-Y-10>0)
Step 3: Select the test statistic Wilcoxon signed rank test (Samples consist of ratio
data & no information about the form of the distribution)
hypothesis H1
From Excel, upper one tail p-value = 0.0401
As 0.000535 < 0.05, Reject H0
As 3.2714 > 1.6449, Reject H0
Contd...
Small number of paired observations (n 20)
Calculate Tcal. Tcal = MIN (T- , T+) = MIN (35, 265)

= 35. The decision rule is to reject H0 if Tcal Tcri
Tcri = 92 for = 0.05 with no tied ranks
Tcal Tcri (35 < 92), We would reject H0 & accept

H1
If you want the upper critical value then you can

calculate the value if you remember that the
distribution is symmetric about the median
(remember median = mean for symmetric
distributions):
lower
Tcri = upper Tcri T.
Dealing
with tiesT(Tied
Observations)
There are two types of tied observations that may arise when using the
Wilcoxon signed rank test:
1.
Observations in the sample may be exactly equal to 0 in the case of
paired differences. Ignore such observations and adjust n
accordingly. For the previous example we removed any values and
used n instead of n.
2.
Two or more observations/differences may be equal. If so, average
the ranks across the tied observations and reduce the variance by
Equation (8.16) for each group of t tied ranks.
t
48
7.
Mann-Whitney U Test for 2

Independent Samples
The Mann-Whitney U test is a non-parametric test that can be used in place

of an unpaired t-test
It is used to test the null hypothesis that two samples come from the same
population (i.e. have the same median) or, alternatively, whether
observations in one sample tend to be larger than observations in the other
Although it is a non-parametric test it does assume that the two distributions
are similar in shape
The basic premise of the test is that once all of the values in the two samples
are put into a single ordered list, if they come from the same parent
population, then the rank at which values from sample 1 and sample 2
appear will be by chance
If the two samples come from different populations, then the rank at which
the sample values will appear will not be random and there will be a
tendency for values from one of the samples to have lower ranks than values
from the other sample
We are thus testing for different locations of the two samples
The Mann-Whitney assumptions are as follows:
(1) independent random samples are obtained from each population
(2) the two populations are continuous and have the same shape
A local training firm has developed

an innovative programme to improve the
performance of students on the courses it
offers. To assess whether the new
programme improves student performance
the firm have collected two random samples
from the population of students sitting an
accountancy examination, where sample 1
students have studied via the traditional
method and sample 2 students via the new
programme.
The firm has analysed previous
data and the outcome of the results
provides evidence that the distribution is not
normally distributed but is skewed to the left.
This information provides concerns at the
suitability of using a two sample
independent t test to undertake the analysis
and instead decide to use a suitable
distribution free test. In this case the
appropriate test is the Mann-Whitney U test.
Contd...
Contd...
Contd...
U cal U 0.5 9 28 0.5
2.1410
U
8.6410

H0: No difference in examination performance between the two groups
H1: New programme has improved performance (M1 < M2)
Step 3: Select the test statistic - Mann-Whitney U test(Lower one tail test)
If the total number of pair wise comparisons (n1n2 = 7*8 = 56 > 20) we can
approximate the Mann-Whitney distribution with a normal distribution
hypothesis H1
From Excel, lower one tail p-value = 0.0161
As 0.0161 < 0.05, Reject H0
As -2.1410 > 1.65, Reject H0
Step 5:Interpretation - Based upon the data, there is sufficient evidence to indicate at a
5% significance level that the performance as improved. Note that if we modify
the
level of significance to 1%
then the decision would be a borderline decision.
n 1 n 1 1
n 2 n 2 1
nn
U 1 n 1n 2
T1 U 2 n 1n 2
T2 U Minimum (U1 and U2) U 1 2
cal
2
2
2
Paired Comparisons
Small number of pair wise observations (n 20)
For a small number of paired comparisons (n = n1n2 20) we use tables to
calculate an exact value of the critical test value (Ucri) or an exact p-value based
upon P (U 9). For a 5% two tail test with n1 = 7, n2 = 8: (i) the lower critical U
value, Ucri = 11. Since Ucal < Ucri (9 < 11), we reject H0 and accept H1, and (ii) the
lower p-value = 0.014.
Given that we have a two tailed test
then the two tail p-value = 2*0.014 =
0.028 < 0.05, we reject we reject H0
and accept H1. The theory suggests
that if the null hypothesis is true then
the U test statistic will be centered at
U = 28 with critical regions identified
in Figure 8.11.
79
Tied Observations
Dealing with ties
If we find data with the same number value then we can deal with this problem
my allocating the average tie value to each shared data value. In this situation
we would then have to use the normal approximation with the standard deviation
U adjustment give by Equation (8.23):
U
n n 3 n n g t j3 t j
n 1n 2
2
1
2
n1 n 2 n1 n 2 1
12
j1 12
Where g = number of ties and tj = the number of tied ranks in group j.

1. In the example and exercises we have not modified the solution for tied
ranks.
2. The Mann-Whitney U test is statistically equivalent to the Wilcoxon rank sum
test.
80
Conclusion
In this presentation we explored the concept of chi squared and nonparametric hypothesis testing:
81

Unit II - Parametric & Non-Parametric Tests

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Unit II - Parametric & Non-Parametric Tests

Diunggah oleh

Hak Cipta:

Format Tersedia

Business Analytics for

BUSINESS ANALYTICS FOR MANAGERS

Business Analytics Introduction to Business Analytics; Introduction to R

Statistics Essentials for Advanced Analytics Linear and Logistic Regression;

Thomas H. Davenport, Competing on Analytics, Harvard Business Review, 2006.

Teaching Schedule Session-wise

o Measures of Central Tendency

Parametric Hypothesis Testing

o One & Two Sample Tests Z & t-Test

Non-parametric Hypothesis Testing

Chi-square Goodness-of-Fit Test

Linear Discriminant Analysis

Exercises & Case Study

Exploratory Factor Analysis

Confirmatory Factor Analysis

Exercises & Case Study

We shall explore a range of

null & alternative hypothesis

Understand difference between

one & two samples

Understand the use of the p-value and critical test statistic

Hypothesis Testing: procedure used to determine whether the

Lower one tail test H1: < 100

Upper one tail test H1: >

Decision to accept/ reject null 2 approaches to accept/ reject hypothesis

this extreme if the null hypothesis is true

Parametric vs. Non-parametric Tests

assume that your data is approximately normally distributed (follows a classic

Nonparametric methods - make no assumptions about the sample populations

Example: - H0: m = 31000)

Two sample tests - we test a sample against another sample to ascertain

Choosing an appropriate test

Critical Statistic Method

=NORMSINV(1-/2) (Upper value)

=TINV(,df) (Upper value)

=FINV(1-/2, df1,df2) (Upper value)

Two tail test H1: 100

Lower one tail test H1: < 100

Upper one tail test H1: >

1.One sample Z-Test for Population Mean

Solution-Using p-value method & critical test statistic

Z-distribution (Large sample)

Step 4: Formulate the decision rule

As 0.0027 < 0.05; Reject H0

Critical Statistic Method

2. One sample t-Test for Population Mean

Solution-Using p-value method & critical test statistic

t-distribution (Small sample)

Step 4: Formulate the decision rule

There is no significant difference, at the 0.05 level, between the

X 2193 2000 1.0429

Critical Statistic Method

=TINV(,df) (Upper value)

3. Two sample Z-Test for Population Mean

Solution-Using p-value method & critical test statistic

Step 2: Select the level of significance - = 0.05

Z-distribution (Large samples)

Step 4: Formulate the decision rule

From Excel, Upper tail P-value = 0.000035

Critical Statistic Method

z-Test: Two Sample for Means

4. Two sample Z-Test for Population

Example 4:- A local police authority