Anda di halaman 1dari 81

Business Analytics for

Managers
Using Excel for Data Analysis

BUSINESS ANALYTICS FOR MANAGERS


Syllabus
Code: 302

Credits: 3

Unit I
Digital Data Introduction, Types of Digital Data: Structured Data,
Unstructured Data, and Semi-Structured Data; Exploring and Discovering
Data; Introduction to OLTP: Queries, Advantages and Challenges;
Introduction to OLAP: One-dimensional, Two-dimensional and Threedimensional data; OLAP Architectures: MOLAP, ROLAP, HOLAP; Role of
OLAP Tools in BI Architecture.
Introduction to Big data; BI Component Frameworks: Business,
Administration and Operation, and Implementation Layer; BI is for
Whom?; BI Applications; BI Roles and Responsibilities; Basics of Data
Integration; Data Warehousing: Data Mart; What constitutes a Data
Warehouse?; Introduction to ETL; Data Integration.

CONTD...
UNIT II

Business Analytics Introduction to Business Analytics; Introduction to R


Programming; Data Manipulation in R: Vectors, Basic Math, and Matrix Operations;
Summarizing Data: Numerical and Graphical Summaries; Data Visualization in R; Data
Transformation; Data Import Techniques in R; Time Series and Spatial Graphs; Graphs
for Categorical Responses and Panel Data.
Data Modeling using Excel Overview of Excel; Basic Data Summaries: Measures
of Central Tendency, Measures of Dispersion, and Measures of Skewness and
Kurtosis; Introduction to Parametric Hypothesis Testing: One and Two Sample Tests
Z Test and t-Test; Chi-square and Non-parametric Hypothesis Testing: Chi-square
Goodness-of-Fit Test, Sign Test, Wilcoxon Signed Rank Sum Test, Mann-Whitney U
Test; Linear Correlation and Regression Analysis; Time Series Data and Analysis.

Unit III

Statistics Essentials for Advanced Analytics Linear and Logistic Regression;


Linear Discriminant Analysis; Exploratory Factor Analysis; Confirmatory Factor
Analysis; Conjoint Analysis; Data Mining: Clustering Techniques, Association Rule
Mining and Sentiment Analysis, Decision Trees and Random Forests; Structural
Equations Modeling.

CONTD...
Suggested Readings

Prasad R N and Acharya Seema (2013), Fundamentals of Business Analytics, Wiley India Pvt.
Ltd., New Delhi.
Glyn Davis and Branko Pecar (2013), Business Statistics using Excel, Oxford University Press,
New Delhi.
Halady Rao Purba (2013), Business Analytics an Application Focus, PHI Learning Private
Limited, New Delhi.
Jank Wolfgang (2011), Business Analytics for Managers, SpringerScience + Business Media,
ISBN 978-1-4614-0405-7.
Davenport Thomas H et al. (2008), Competing on Analytics, Pearson Publication, USA.
Decision support systems and business intelligence HBR Press.
E Turban, et al. (2008), Business Intelligence: A Managerial Approach, Pearson Prentice Hall.
Mosimann R et al. (2007), The performance manager: Proven strategies for turning information
into higher business performance, Cognos Press.

Articles

Thomas H. Davenport, Competing on Analytics, Harvard Business Review, 2006.


Solomon Negash, Business Intelligence, Communications of the Association for Information
Systems (Volume 13, 2004) 177-195.
Sara Philpott, Advanced Analytics: Unlocking the Power of Insight, IBM, 2010.
Pam Baker, Using Data Visualizations To Drive Business Decisions, Fiercebigdata, 2013.

Teaching Schedule Session-wise


Topic
Overview of Excel
Basic Data Summaries

Session
Session 1

o Measures of Central Tendency


o Measures of Dispersion
o Measures of Skewness & Kurtosis

Parametric Hypothesis Testing

Session 2,3

o One & Two Sample Tests Z & t-Test


o Chi-square Test

Non-parametric Hypothesis Testing


o
o
o
o

Session 4,5

Chi-square Goodness-of-Fit Test


Sign Test
Wilcoxon Signed Rank Sum Test
Mann-Whitney U Test

Correlation Analysis
Linear Regression Analysis
Time Series Data and Analysis

Session 6

Logistic Regression

Session 7

Linear Discriminant Analysis

Session 8

Exercises & Case Study

Contd...
Topic

Session

Exploratory Factor Analysis

Session 9

Confirmatory Factor Analysis

Session 10

Conjoint Analysis

Session 11

Data Mining

Session 12,13,14

Clustering Techniques
Association Rule Mining & Sentiment Analysis
Decision Trees & Random Forests
Structural Equations Modeling

Exercises & Case Study

Introduction to Parametric
Hypothesis Testing

We shall explore a range of


hypothesis tests for one
and two samples where the
population is considered to
be normally distributed

Learning Objectives
On completing this unit you should be able to:
Understand concept of

null & alternative hypothesis

Understand difference between

one & two samples


parametric & non-parametric tests
one and two tail tests
type I and II errors

Understand the use of the p-value and critical test statistic


Conduct one sample hypothesis tests for the sample mean and
proportion
Conduct two sample hypothesis tests for the sample mean and
proportion
Conduct an F Test for two population variances
Solve hypothesis problems using the Microsoft Excel

Hypothesis Testing
Rationale
Hypothesis: statement about the value of a population parameter
developed for the purpose of testing
Deciding between two possibilities based on data - Is it real? Or is it just
coincidence? A hypothesis is either TRUE or FALSE
Example: - Hypothesis statement average salary of accountants is 31000 can be
measured & assessed through the variable salary

Hypothesis Testing: procedure used to determine whether the


hypothesis is a reasonable statement & should not be rejected, or is an
unreasonable statement & should be rejected, based on sample
evidence & probability theory
Null hypothesis (H0): also known as the hypothesis of no difference and
is formulated in anticipation of being rejected as false
Alternative hypothesis (H1): is a positive proposition which states that a
significant difference exists
Example: - The average salary of accountants is 31000
Null hypothesis H : = 31000
Alternative hypothesis H :

Contd...
Level of significance - represents the amount of
risk an analyst will accept when making a decision
represents the amount of error associated with rejecting the null
hypothesis when it is true
Usually expressed as % and denoted by (Alpha)
Its normally 5% (0.05) or 1% (0.01), or sometimes 10% (0.1)
Value of depends upon how sure you want to be that your
decisions are an accurate reflection of the true population
relationship
Example: 5% LOS implies that there are about 5 chances in 100
of rejecting the H0 when it is true or we are 95% confident that we
will make a correct decision

1. If we sampled from a population data set that is normally distributed then the
sampling distribution for the sample mean will be normally distributed with sample
mean = population mean, with sampling error X n
2. For populations that are not normally distributed we can make use of the Central
Limit Theorem. For large n, the sampling distribution approximates to the normal
distribution.
3. For small sample size we employ the Student t distribution which states that if a
population is normally distributed then the sample mean is normally distributed with
sample mean = population mean, with the sampling error estimated using the sample

Contd...
We stated earlier that the alternative hypotheses is of the form
H1: 31000

Two tailed test - The sign tells us that we are not sure what the
direction of the difference will be (< or >) but that a difference exists
One tailed test - It is possible that we are assessing that
the average accountant salary is greater than 31000
implying H1: > 31000
is smaller than 31000
implying H1: < 31000
Two tail test H1: 100

Lower one tail test H1: < 100

Upper one tail test H1: >


100

Decision to accept/ reject null 2 approaches to accept/ reject hypothesis


hypothesis H0 made on the basis of 1)Use the p-value (via Excel)
information supplied by sample data
p-value represents the probability of the
results in one 2 types of errors
calculated random sample test statistic being
Type I Error: Committed by the test in
rejecting a true null hypothesis. Probability
of committing type I error is denoted by
Type II Error: Committed by the test in
accepting a false null hypothesis.
Probability of committing type II error is
denoted by
Your Decision

The Truth

Accept Null
Hypothesis
H0
Null
Hypothesis
H0

Correct
Decision

Research
Hypothesis
H1

Type II Error
[not easily
controlled]

Contd...

Accept Research
Hypothesis
H1
Type I Error
[level 0.05]
Correct
Decision

this extreme if the null hypothesis is true


p-value is compared with the chosen
significance level () to make a decision
between accepting or rejecting the null
hypothesis H0
If p < , then reject null hypothesis H0
because low probability events are unlikely
to occur & accept alternative hypothesis H1
2)Calculate the test statistic & compare with a
critical test statistic
Calculate the test statistic and compare the value
with a critical test statistic estimate from an
appropriate table or via Excel
Value of the critical test statistic will depend upon significance level for z test problems and
significance level and number of degrees of
freedom for t test problems
If test statistic > critical test statistic then we
would reject null hypothesis H0 and accept

Parametric vs. Non-parametric Tests


Tests of hypothesis are usually classified into two methods: parametric
and non-parametric:
Parametric methods - make assumptions about the underlying distribution
from which sample populations are selected

assume that your data is approximately normally distributed (follows a classic


bell-shaped curve)
Data is at the interval/ratio level of measurement

Nonparametric methods - make no assumptions about the sample populations


distribution

are often based upon data that has been ranked, rather than actual
measurement data

One sample test - involves testing a sample parameter (e.g. mean value)
against a perceived population value (e.g. accountant salary 31000) to
ascertain whether there is not a significant difference between a sample
statistic and a population parameter

Example: - H0: m = 31000)

Two sample tests - we test a sample against another sample to ascertain


whether or not there is a significant difference between two samples and,
consequently, whether or not the two samples represent different populations

Choosing an appropriate test

Test Statistic

P-value Method

Critical Statistic Method

Z-test
Two-tail tests

=2*(1-NORMSDIST(ABS(Z-value))

=NORMSINV(1-/2) (Upper value)


=NORMSINV(/2)
(Lower value)

Lower-tail test

=NORMSDIST(Z-value)

=NORMSINV()

Upper-tail test

=1-NORMSDIST(Z-value)

=NORMSINV(1-)

Two-tail tests

=TDIST(ABS(t-value),df,2)

=TINV(,df) (Upper value)


= -TINV(,df) (Lower value)

Lower-tail test

=TDIST(ABS(t-value),df,1)

=TINV(2*,df)

Upper-tail test

=TDIST(ABS(t-value),df,1)

= -TINV(2*,df)

Two-tail tests

=FDIST(F-value, df1,df2)

=FINV(1-/2, df1,df2) (Upper value)


=FINV(/2, df1,df2) (Lower value)

Lower-tail test

=1-FDIST(F-value, df1,df2)

=FINV(, df1,df2)

Upper-tail test

=FDIST(F-value, df1,df2)

=FINV(1-, df1,df2)

T-test

Summary

F-test

Two tail test H1: 100

Lower one tail test H1: < 100

Upper one tail test H1: >


100

1.One sample Z-Test for Population Mean


Assumptions
Sample data is randomly collected from a population
Population normally distributed
Population standard deviation is known
Example 1:- Historical output of
employees of a firm produce 100 units
per hour with a standard deviation of 20
units per hour. A new employee is tested
on 36 separate random occasions and
found to have an output of 90 units per
hour. Does this indicate that the new
employee's
output
is
significantly
different from the average output?
H0: = 100, H1: 100
Given = 5% = 0.05, = 100, = 20, n = 36,

X=

90

Solution-Using p-value method & critical test statistic


method (Zcri)
Step 1: State null and the alternate hypothesis - H0: = 100, H1: 100
Step 2: Select the level of significance - = 0.05
Step 3: Select the test statistic -

Z-distribution (Large sample)

Step 4: Formulate the decision rule


P-value Method - If p < , then reject null hypothesis H0 & accept the
alternative hypothesis H1
X 90 100
Zcal

3
From Excel, Two tail p-value = 0.0026998
n
20 36

As 0.0027 < 0.05; Reject H0


Critical test Statistic Method - If test statistic > critical test statistic then we
would reject null hypothesis H0 and accept alternative hypothesis H1
From Excel, Two tail critical value = NORMSINV() = 1.96
As 3 > 1.96, Reject H0
Step 5: Interpretation
Evidence suggests that the new employee output is significantly different at 5%
from the firms existing employee output i.e., sample mean value (90 units
per hour) is not close enough to the population mean value (100 units per

Test
Statistic

P-value Method

Two-tail
tests

=2*(1-NORMSDIST(ABS(Z-value))

Critical Statistic Method


=NORMSINV(1-/2)
=NORMSINV(/2)

(Upper value)
(Lower value)

Zcal

X 90 100 3
n

20 36

2. One sample t-Test for Population Mean


Assumptions
If the population standard deviation is not known then t-test uses the sample
standard deviation, s, as an estimate of the population standard deviation,
If the population distribution is normal
Example 2: - A local car dealer wants to know if
the purchasing habits for extra fittings by a male
buyer has changed. Based upon collected data
he has estimated that the distribution of extra
fittings purchased is approximately normally
distributed with an average of 2000 per
customer. To test this hypothesis he has
collected the data of the purchases made by the
last seven male customers (): 2300, 2386,
1920, 1578, 3065, 2312 and 1790. Test whether
the extras purchased on average has changed.
H0: = 2000, H1: 2000
Given = 5% = 0.05, = 2000, = unknown, n = 7,

X=

2193

Solution-Using p-value method & critical test statistic


method (tcri)
Step 1: State null and the alternate hypothesis - H0: = 2000, H1: 2000
Step 2: Select the level of significance - = 0.05
Step 3: Select the test statistic -

t-distribution (Small sample)

Step 4: Formulate the decision rule


P-value Method - If p < , then reject null hypothesis H0 & accept the
X 2193 2000 1.0429
alternative hypothesis H1
t cal
s n
489.62.. 7
From Excel, Two tail p-value = 0.337182452
As 0.337 > 0.05, Accept H0

df n 1 6

Critical test Statistic Method - If test statistic > critical test statistic then we
would reject null hypothesis H0 and accept alternative hypothesis H1
From Excel, Two tail critical value = TINV() = 2.45
As -2.45 < 1.04 < 2.45, Accept H0
Step 5: Interpretation
extras
2000.

There is no significant difference, at the 0.05 level, between the


purchased by the sample and the historical extras purchased of

t cal

X 2193 2000 1.0429


s

489.62.. 7

df n 1 6

Test
Statistic
Two-tail
tests

P-value Method

Critical Statistic Method

=TDIST(ABS(t-value),df,2)

=TINV(,df) (Upper value)


= -TINV(,df) (Lower value)

3. Two sample Z-Test for Population Mean


Example 3:- A large organisation
produces electric light bulbs in each
of its two factories (A and B). It is
suspected that the quality of
production from factory A is better
than from factory B. To test this
assertion the organisation collects
samples from factory A and B and
measures how long each light bulb
works (in hours) before the light bulb
fails. Conduct an appropriate test to
test this hypothesis?
H0: A B
H1: A > B
Given = 5% = 0.05, 2A = 52783, 2B = 61560, nA = 30, nB = 32

Solution-Using p-value method & critical test statistic


method (Zcri)
Step 1: State null and the alternate hypothesis - H0: A B

H1: A > B

Step 2: Select the level of significance - = 0.05


Step 3: Select the test statistic -

Z-distribution (Large samples)

Step 4: Formulate the decision rule


P-value Method - If p < , then reject null hypothesis H0 & accept the
alternative hypothesis H1
Z cal

From Excel, Upper tail P-value = 0.000035


As 0.0000158 < 0.05, Accept H1

X X
A

2
A2
B

n
n
A
B

4.16

Critical test Statistic Method - If test statistic > critical test statistic then we
would reject the null hypothesis H0 & accept the alternative hypothesis H1
From Excel, Upper tail critical value Zcri = +1.64
As 4.16 > 1.64, Accept H1
Step 5: Interpretation
At the 0.05 level of significance, the light bulbs from factory A have significantly
longer life time than the light bulbs from factory B.

Z cal
Test Statistic

P-value Method

Upper-tail test

=1-NORMSDIST(Z-value)

Critical Statistic Method


=NORMSINV(1-)

X X
A

A
B

n
n
A
B

4.16

Analysis ToolPak
solution

z-Test: Two Sample for Means

Mean
Known Variance
Observations
Hypothesized Mean
Difference

Variable
2
1135.333 894.218
333
75
46516.6 57845.9
30
32
Variable 1

4.160713

018
1.58628EP(Z<=z) one-tail

05
1.644853
z Critical one-tail

627
3.17256EP(Z<=z) two-tail

05
Select Data > Data Analysis > Z Test: Two1.959963
Sample for Means
z Critical two-tail

4. Two sample Z-Test for Population


Proportion

Example 4:- A local police authority


concerned with the number of
passengers not wearing rear seat
belts in cars decided to undertake a
series of surveys in two large cities.
The survey consisted of two
independent
random
samples
collected from city A and B and the
police authority would like to know if
the proportions of passengers wearing
seat belts between city A and B are
different. Conduct an appropriate test
to test this hypothesis?

H 0 : A = B H 1 : A B
Given = 5% = 0.05, NA = 250, NB = 190, nA = 135 nB = 80, A ~ A B ~ B

Solution-Using p-value method & critical test statistic


method (Zcri)
Step 1: State null and the alternate hypothesis - H0: A = B

H1: A B

Step 2: Select the level of significance - = 0.05


Step 3: Select the test statistic - Z-distribution (Large samples)
Step 4: Formulate the decision rule
P-value Method - If p < , then reject null hypothesis H0 & accept the alternative
A B
hypothesis H1
Z
2.49
cal

From Excel, Two tail P-value = 0.013


As 0.013 < 0.05, Accept H1

A 1 A B 1 B

NA
NB

Critical test Statistic Method - If test statistic > critical test statistic then we

would reject the null hypothesis H0 & accept the alternative hypothesis H1
From Excel, Two tail critical value Zcri = 1.95

As 2.49 > 1.95, Accept H1


Step 5: Interpretation
At the 0.05 level of significance, We conclude that a significant difference exists between
the proportions of rear passengers wearing seat belts between city A and B.

Test Statistic
Two-tail tests

P-value Method
=2*(1-NORMSDIST(ABS(Z-value))

City A City B
No. Interviewed, N
250
190
No. wearing seat
belts, n
135
80

Critical Statistic Method


=NORMSINV(1-/2)
(Upper value)
=NORMSINV(/2) (Lower value)

Z cal

A B
2.49
A 1 A B 1 B

NA
NB

5. Two-Sample t Test for Population Mean


(independent samples, equal variances Pooled t-test)
Example 5:- A certain product of organic beans are packed in
tins and sold by two local shops. The local authority have
received complaints from customers that the amount of beans
within the tins sold by the shop are different. To test this
statistically two small random samples were collected from both
shops.
H0: 1 = 2 H1: 1 2
Given = 5% = 0.05, 1 = 2 unknown (pooled)
n1 = 18 n2 = 25

Solution-Using p-value method & critical test statistic


method (tcri)
Step 1: State null and the alternate hypothesis - H0: 1 = 2 H1: 1 2
Step 2: Select the level of significance - = 0.05
Step 3: Select the test statistic - t-distribution
Step 4: Formulate the decision rule
P-value Method - If p < , then reject null hypothesis H0 & accept the alternative
hypothesis H1
n1 1 s12 n 2 1 s 2 2
From Excel, Two tail P-value = 0.036
As 0.036 < 0.05, Accept H1

A B

n1 n 2 2

2082.017

df n1 n 2 2 41

Critical test Statistic Method - If test statistic > critical test statistic then we

would reject the null hypothesis H0 & accept the alternative hypothesis H1
From Excel, Two tail critical value tcri = +2.019
As 2.156 > 2.019, Accept H1
Step 5: Interpretation

t cal

1 2

X2

1
1
n n

1
2

2.156

We conclude that based upon the sample data collected that we have evidence
that the quantity of beans sold by shops A and B are significantly different at the
5% level of significance. It should be noted that the decision will change if you

Test Statistic
Two-tail tests

P-value Method
=2*(1-NORMSDIST(ABS(Z-value))

Critical Statistic Method


=NORMSINV(1-/2)
=NORMSINV(/2)

(Upper value)
(Lower value)

A B

n 1s
1

n 2 1 s 2
2082.017
n1 n 2 2
2

df n1 n 2 2 41

t cal

1 2

X2
1
1
n n
2
1

2.156

Analysis ToolPak
solution

Select Data > Data

t-Test: Two-Sample Assuming Equal Variances

Variable 1 Variable 2
527.05555
Mean
56
496.64
2603.1143 1712.906
Variance
79
667
Observations
18
25
2082.0171
Pooled Variance
82

Hypothesized Mean
Difference
0

df
41

2.1563816
t Stat
53

0.0184852
P(T<=t) one-tail
15

1.6828780
t Critical one-tail
03

0.0369704
P(T<=t) two-tail
3

Analysis > Two Sample Pooled


t
Test
for
2.0195409 Means

6. Two-Sample t-Test for Population Mean


(independent samples, unequal variances)
Example 6:- A certain product of organic beans are packed in
tins and sold by two local shops. The local authority have
received complaints from customers that the amount of beans
within the tins sold by the shop are different. To test this
statistically two small random samples were collected from both
shops.
H0: 1 = 2 H1: 1 2
Given = 5% = 0.05,
1 and 2 unknown, Distribution unknown
n1 = 18 n2 = 25

Solution-Using p-value method & critical test statistic


method (tcri)
Step 1: State null and the alternate hypothesis - H0: 1 = 2 H1: 1 2
Step 2: Select the level of significance - = 0.05
Step 3: Select the test statistic t-distribution
Step 4: Formulate the decision rule
P-value Method - If p < , then reject null hypothesis H0 & accept the alternative
X1 X 2
hypothesis H1
t cal
2.083
2
2
s1
s2
From Excel, Two tail P-value = 0.044

n
n
2
1
0.044 < 0.05, Accept H
1

Critical test Statistic Method - If test statistic > critical test statistic then we

would reject the null hypothesis H0 & accept the alternative


hypothesis
H1
2
2
2

From Excel, Two tail critical value tcri = 2.03

s1
s2

n2
n1
s12
s22


n
n

1
2

n1 1
n 2 1

As 2.083 > 2.03, Accept H1

df

32

Step 5: Interpretation
We conclude that based upon the sample data collected that we have evidence
that the quantity of beans sold by shops A and B are significantly different at the
5% level of significance. It should be noted that the result in this case rests at

Test Statistic
Two-tail tests

P-value Method
=TDIST(ABS(t-value),df,2)

Critical Statistic Method


=TINV(,df) (Upper value)
= -TINV(,df) (Lower value)

X1 X 2

t cal

s1 s 2

n1 n 2
2

2.083

df

s12 s 2 2

n
n
1
2

32
2
s1
s22

n
n
2
1
n1 1 n 2 1

Analysis ToolPak
solution

Select Data > Data Analysis >


variances

t-Test: Two-Sample Assuming Unequal


Variances

Variable 1 Variable 2
527.05555
Mean
56
496.64
2603.1143 1712.906
Variance
79
667
Observations
18
25
Hypothesized Mean
Difference
0

df
32

2.0833856
t Stat
02

0.0226440
P(T<=t) one-tail
68

1.6938887
t Critical one-tail
03

0.0452881
Two Sample
t-Test
for
Means
P(T<=t) two-tail
36 assuming
unequal
2.0369333

7. Two-Sample t-test for Population Mean


(dependent samples Paired t-test)

Example 7:- Suppose that Super Slim is offering a weight


reduction program that they advertise will result in more than a 10
lb weight loss in the first 30 days. Twenty six subjects were
independently randomly selected for a study and their weights
before and after the weight loss program were recorded. Super
Slim have stated that the historical data shows that the
populations are normally distributed.
H0: D = 1 2 10

H1: D > 10

Given = 5% = 0.05,
Distribution unknown
1 unknown, 2 unknown
n1 = n2 = 26
Assume n large, CLT applies

Solution-Using p-value method & critical test statistic


method (tcri)
Step 1: State null & the alternate hypothesis - H0: D = 1 2 10

H1: D > 10

Step 2: Select the level of significance - = 0.05


Step 3: Select the test statistic -

t-distribution

Step 4: Formulate the decision rule


P-value Method - If p < , then reject null hypothesis H0 & accept the alternative
2
2

d
hypothesis H1

n
From Excel, Upper one tail P-value = 0.0093
As 0.0093 < 0.05, Accept H1

sd

n 1

14.57

df n 1 25

Critical test Statistic Method - If test statistic > critical test statistic then we

would reject the null hypothesis H0 & accept the alternative hypothesis H1
From Excel, Upper one tail critical value tcri = + 1.708
As 2.5178 > 1.708, Accept H1

t cal

dD

2.5178
sd n

Step 5: Interpretation
Conclude that the average weight loss is more than 10 Ibs at a 5% level of
significance. Again, you would have a borderline decision at 5%.

d d
2

Analysis ToolPak
solution

sd

n 1

14.57

df n 1 25

dD
t cal
2.5178
sd n

Select Data > Data Analysis > Two Sample t Test Assuming Paired Samples

8. F-Test for Population Variances


Example 8:- A certain product of organic beans are packed in tins and
sold by two local shops. The local authority have received complaints
from customers that the amount of beans within the tins sold by the
shop are different. To test this statistically two small random samples
were collected from both shops. Use F-test to check if the two
population variances can be considered equal with a 95% confidence.
H0: 12 = 22
H1: 12 22
Given = 5% = 0.05

Solution-Using p-value method & critical test statistic


method (Fcri)
Step 1: State null & the alternate hypothesis - H0: 12 = 22

H1: 12 22

Step 2: Select the level of significance - = 0.05


Step 3: Select the test statistic - F-distribution
Step 4: Formulate the decision rule
P-value Method - If p < , then reject null hypothesis H0 & accept the alternative
hypothesis H1
2
From Excel, Two tail P-value = 0.3393282
As 0. 3393282 > 0.05, Accept H0

s A
F 2 1.5197
s B

Critical test Statistic Method - If test statistic > critical test statistic then we

would reject the null hypothesis H0 & accept the alternative hypothesis H1
From Excel,
As FL(0.39) < F(1.5197) < FU(2.38), Accept H0

df numerator n A 1
df denominator n B 1

Step 5: Interpretation
Conclude that the two population variances are not significantly different at the
95% level of confidence

Test Statistic
Two-tail tests

P-value Method
=FDIST(F-value, df1,df2)

Critical Statistic Method


=FINV(/2, df1,df2) (Lower value)
=FINV(1-/2, df1,df2) (Upper value)

s2A
F 2 1.5197
s B
df numerator n A 1
df denominator n B 1

Analysis ToolPak solution

F-Test Two-Sample for Variances

Variable 1 Variable 2
527.05555
Mean
56
496.64
2603.1143 1712.9066
Variance
79
7
Observations 18
25
df
17
24
1.5197059
F
07

P(F<=f) one- 0.1696641


tail
07

F Critical one- 2.0702839


tail
55

Select Data > Data Analysis > F Test for Two Population Variances (Variance Ratio Test)

Conclusion
In this presentation we explored the concept of hypothesis testing:

Thank You

Chi Square and Non-Parametric


Hypothesis Testing

This chapter will provide an overview to the chi squared distribution ( 2) and nonparametric tests that can be used when parametric methods are not appropriate.

Learning Objectives
On completing this unit you should be able to:
Apply the chi square test to measure the difference between
two proportions from two samples
Apply the chi-square test to test for association between
categorical variables
Apply the chi-square goodness-of-fit test
Apply the sign test to one sample
Apply the Wilcoxon signed rank T test to two paired samples
Apply the Mann-Whitney U test to two independent samples

Introduction
Parametric tests - assess whether the differences between means (or
variances, proportions) are statistically significant. Model assumptions
are:
a) Underlying population being measured varies as a normal distribution
b) The level of measurement is of equal interval or ratio scaling, and
c) The population variances are equal

Unfortunately, we will come across data that does not fit these
assumptions
a) How do we measure the difference between the attitudes of people
surveyed in assessing their favourite car, where the responses are in
the form of 1, 2, 3,, n? In this situation we have ordinal data in
which taking differences between the numbers (or ranks) is
meaningless.
b) Furthermore, if we are asking for opinions where the opinion is of a
categorical form (e.g. strongly agree, agree, do not agree) then the
concept of difference is again meaningless. The responses are words
not numbers, but you can, if you so wish, solve this problem by
allocating a number to each response, with 1 for strongly agree, 2 for

Choosing a Test

Chi-Square Test
Versatile test
Widely used test with data that is categorical (or nominal or
qualitative) in nature
This section will explore the application of chi square in solving 4 types
of problems:
1. Perform a 2 test of association (independence)
2. Perform a 2 test of the difference between two independent proportions
3. Perform a 2 test of the difference between two dependent proportions
(McNemars test for matched pairs)
4. Perform a 2 test of goodness of fit to a theoretical probability distribution

For 2 and 3, you could use a Z/t test if you assume the population is
normally distributed.

1.

Chi Square Test of Association

Used to determine whether the frequency occurrence for two


category variables (or more) are significantly related (or
associated) to each other
The null hypothesis states that the row and column variables are not
associated
It can be shown that if the null hypothesis is true then the expected
frequencies (E) can be calculated using
Row Total Column Total
E
Grand Total
To test the null hypothesis we would compare the expected cell
frequencies with the observed cell frequencies and calculate the chi

O E 2
2
squared test statistic given by

E

The number of degrees of freedom, df, is given by df r 1 c 1


For the chi square test to give meaningful results the expected frequency for each
cell is required to be at least 5.

Contd...
Example 9:- Suppose a university
sampled 485 of its students to determine
whether males and females differed in
preference for five courses offered. The
question we would like to answer is to
confirm whether or not we have an
association between the courses chosen
and the persons gender. In this case we
have two attributes, gender and course,
both of which have been divided into
categories: 2 for gender and 5 for
course. Determine whether gender and
course preference are associated using
chi-square test of association on the
contingency table.

Row Total Column Total


Total Sample Size

O E 2
E

63.2

df r 1 c 1 4

Contd...
Step 1: State null and the alternate hypothesis
H0: Gender and course preference are not associated (or independent)
H1: There is an association between sex and course preference (or dependent)
Step 2: Select the level of significance - = 0.05
Step 3: Select the test statistic - Chi-square distribution

Row Total Column Total


Total Sample Size

Step 4: Formulate the decision rule


P-value Method - If p < , then reject null hypothesis H0 & accept the alternative
hypothesis H1
2
From Excel, Two tail P-value = 5.7E-13

As 5.7E-13 < 0.05, Accept H1

O E
E

63.2

Critical test Statistic Method - If test statistic > critical test statistic then we

would reject the null hypothesis H0 & accept the alternative hypothesis H1
df r 1 c 1 4
From Excel, Two tail critical value = 9.4877
As 63.36 > 9.4877 , Accept H1
Step 5: Interpretation
There is a significant relationship, or association, between the

category variables

2. Testing Differences for 2 Proportions


Example 10:- A firm who surveys whether or not employees use the
train to travel to work. The firm collects the data and has created a 2*2
contingency table to summarise the responses for only the people who
work on two days. The question is now whether or not we have a
significant difference between the Monday and Wednesday employees
who travel to work by train.

Contd...
Step 1: State null and the alternate hypothesis
H0: 1 = 2

(proportions travelling by train on the two days is the same)

H1: 1 2

(proportions different)

Step 2: Select the level of significance - = 0.05


Step 3: Select the test statistic - chi-square distribution
Step 4: Formulate the decision rule
P-value Method - If p < , then reject null hypothesis H0 & accept the alternative
2
hypothesis H1
From Excel, Two tail P-value = 0.035161
As 0.035161 < 0.05, Accept H1

O E
E

4.4373

Critical test Statistic Method - If test statistic > critical test statistic then we

would reject the null hypothesis H0 & accept the alternative hypothesis H1
From Excel, Two tail critical value = 3.84
As 4.43 > 3.84 , Accept H1
Step 5:Interpretation - Conclude that there is a significant difference in the proportions
travelling by train on Monday and Wednesday. Note: if we have a 1%
significance level then the decision would be reversed

3. Chi Square Test for the Difference between 2


Dependent Proportions (McNemars test)
Example:- Estimate the effectiveness of a political campaign on the voting patterns
of a group of voters. 2 groups of voters are selected at random and their voting
intentions (Drop CO2, Tax) for a local election are recorded. Both groups are then
subjected to the same campaign and their voting intentions are recorded. The
question that arises is whether or not the campaign was effective on the voting
intentions of the voters.
In
the
problem, we
shall look at
whether or not
the proportion
voting
Drop
CO2
has
significantly
changed

Contd...
In general the 2*2 contingency table can be structured as shown above
Drop CO2 Before proportion (1) and Drop CO2 After proportion (2) are given by
equations
1 a b N

2 a c N

df r 1 c 1 1
Two Tests available:
To test the null hypothesis we can use the McNemar z-test statistic which is
normally approximated defined by the equation
bc
Z

bc

To test the null hypothesis we can use the McNemar 2-test statistic defined by
equation
2
2

For one df the relationship between 2 and Z is 2 Z 2

b c

bc

Contd...
Step 1: State null and the alternate hypothesis
H0: 1 = 2

(proportions voting Drop CO2 before and after the same )

H1: 1 2

(proportions different)

Step 2: Select the level of significance - = 0.05


Step 3: Select the test statistic - McNemar z-test statistic
Step 4: Formulate the decision rule

bc
89 45

3.801
bc
89 45

P-value Method - If p < , then reject null hypothesis H0 & accept the alternative
hypothesis H1
Proportions, 1 = 0.60 and 2 = 0.53
From Excel, the two tail p-value = 0.00014. As 0.00014 < 0.05, Accept H1
Critical test Statistic Method - If test statistic > critical test statistic then we

would reject the null hypothesis H0 & accept the alternative hypothesis H1
From Excel, Two tail critical value = 9.4877
As 3.801 > 1.96 , Accept H1
Step 5: Interpretation - There is a significant difference in the voting intentions for Drop

4.

Chi Square Goodness-of Fit Test

cal 2

O E 2
E

df = n k 1

In this section we will explore concept of measuring how well a data set can be
modelled by a particular probability distribution using the method of goodness-of-fit
test

It can be applied to discrete distributions, such as Binomial & Poisson

For a chi-square goodness of fit test, the hypotheses takes the form:

H0: The data are consistent with a specified distribution

H1: The data are not consistent with a specified distribution

The goodness-of-fit can be assessed by conducting a chi-square test on the


observed and expected frequencies as defined by equation

Contd...

r e
P X r
r!

A motorway safety officer who believes that the number of accidents per
week occurring on a stretch of motorway can be modelled using a
Poisson distribution.
If X denotes the number of accidents per week then the sample data can be
modelled by fitting a Poisson distribution to the sample data.

Contd...
Step 1: State null and the alternate hypothesis
H0: No. of accidents follow a Poisson distribution
H1: No. of accidents do not follow a Poisson distribution
Step 2: Select the level of significance - = 0.05
Step 3: Select the test statistic - Chi-square distribution/ Goodness of fit test
Step 4: Formulate the decision rule
P-value Method - If p < , then reject null hypothesis H0 & accept the alternative
hypothesis H1
From Excel, the p-value = 0.73
As 0.73 > 0.05, Accept H0
Critical test Statistic Method - If test statistic > critical test statistic then we would

reject the null hypothesis H0 & accept the alternative hypothesis H1


From Excel, Two tail critical value = 11.07
As 11.07 > 1.96 , Accept H1
Step 5:Interpretation - Conclude that the there is a significant relationship between the
observed and expected frequencies. This implies that the data can be modelled

Non-Parametric Tests
Many statistical tests require that data follows normal distribution
Distribution free tests/ Non-parametric tests - Do not require the data to
follow a particular distribution
In this presentation will explore three non parametric tests
Sign test
Wilcoxon signed rank test
Mann Whitney U test
Test
One sample
Paired samples
Independent
samples

Parametric test
One sample z-test
One sample t-test
Two paired sample Z-test
Two paired sample t-test
Two independent sample t-test

Non parametric test


Sign test
Wilcoxon signed-rank test
Sign test
Wilcoxon signed rank test
Mann Whitney U test (Wilcoxon
rank sum test)

5.

The Sign Test

The sign test is used to test a set of data values against a perceived
hypothesis statement, including:
1.Assessing the validity of a population median value assessed from
collected sample data replaces the one-sample t-test which assumes a
normal population and that a mean value as meaning.
2.Assessing the validity that the difference between two population
medians is zero based upon sample data replaces the paired t-test
which assumes a normal population and that a mean value as meaning.
3.Assessing the validity of proportions where the proportions are
estimated from ordered nominal (or categorical) data where a numerical
scale is inappropriate but where we can rank the data observations
replaces the sample Z test for proportions which assumes a normal
population.

Contd...
If we rank the data then the null hypothesis would result in half the ranks
to be less than the median (r1) and half the ranks would be greater than
the median (r2)
In this situation the null hypothesis can be modelled by a binomial
distribution with the probability of a data value being less than or greater
than the median being equal to p = 0.5, with sample size n
The sign test assumptions are
Randomly selected samples and
Continuous distribution

Sign test measures the number of counts that fall above and below the
median value
Under the null hypothesis, we would expect the number of counts
distribution to be approximately symmetric around the median and the
distribution of values below and above to be distributed at random among
the ranks

Contd...
The corresponding hypothesis statements for two tail and one tail tests
are:
Two tail test
H0: sample median = population median (0.5)
H1: sample median population median (0.5)

Upper one tail test


H0: sample median population median (0.5)
H1: sample median > population median (0.5)

Lower one tail test


H0: sample median population median (0.5)
H1: sample median < population median (0.5)

For a binomial distribution the value of the probability (P(X=r)), mean () and
standard deviation are given by:
P X r C r p q
n

r n r

n r n r
p q
r

n
n!

r! n r !
r

np

npq np 1 p

The probability distribution is a binomial distribution with the probability of


success p = 0.5 and the number of trials represented by the number of paired
observations (n), X ~ Bin (n, p)

16 randomly selected people


were chosen to measure the
For the training programme to be effective we
effectiveness of a new training
would expect the hypothesis statement to be
programme on the value of
H1: the training programme results in the
sales. Calculate test statistics:
average value in sales to increase
(i)Binomial probability P(X Given random selection is made and no
information is given about the distribution, we will
12)

Contd...

use the sign test here

Contd...
Step 1: State null and the alternate hypothesis
H0: The median sales difference is zero
H1: Median sales after training > Median sales before training
Step 2: Select the level of significance - = 0.05
Step 3: Select the test statistic Sign test
Step 4: Formulate the decision rule
Calculate binomial probabilities, P(Xx)
P-value Method - If p < , then reject null hypothesis H0 & accept the alternative
hypothesis H1
p = P(X 12) = P(X = 12, 13, 14, 15, 16)
= P(X = 12) + P(X=13) + P(X=14) + P(X=15) + P(X=16)
X 11.5 8
From Excel, = np = 8 and = sqrt(npq) = 2
Z c

1.75

2
From Excel, upper one tail p-value = 0.0401
As 0.0401 < 0.05, Reject H0
Critical test Statistic Method - If test statistic > critical test statistic then we would
reject the null hypothesis H0 & accept the alternative hypothesis H1
From Excel, Two tail critical value = 1.6449
As 1.75 > 1.6449, Reject H0
Step 5:Interpretation - Conclude that there is a significant difference in the proportions
travelling by train on Monday and Wednesday. Note: if we have a 1%

6.

Wilcoxon Signed Rank Sum Test


(Matched Pairs Test)

The t-test is the standard test for testing the difference between population
means for two paired samples that are equal
If the populations are non-normal, particularly for small samples, then the ttest may not be valid
As for the sign test, the Wilcoxon signed rank sum test is another example of
a non-parametric/ distribution free test, used to test the null hypothesis that
the median of a distribution is equal to some value
It can be used in place of
1) One-sample t-test
2) Paired t-test
3) Ordered categorical data where a numerical scale is inappropriate but
where it is possible to rank the observations
The method considers the differences between n matched pairs as one
sample
If the two population distributions are identical, then we can show that the
sample statistic has a symmetric null distribution

Contd...
Assumptions
The Wilcoxon signed rank sum test assumptions are:
1)Each matched data pair is randomly distributed
2)The matched pair differences should be symmetrically distributed
Although the Wilcoxon test assumes neither normality nor homogeneity
of variance, it does assume that the two samples are from populations with
the same distribution shape
It is also vulnerable to outliers although not to nearly the same extent as
the t-test

Contd...
Suppose that Slim-Gym is offering a weight reduction program that they advertise
will result in more than a 10 lb weight loss in the first 30 days. Twenty subjects were
selected for a study and their weights before and after the weight loss program
were recorded.

Contd...

Test
Two-tailed Test

Hypothesis
Tcal
H1: Population locations not centred at 0
Tcal = Minimum of T- and T+
H1: Population differences are centred at a Tcal = T+
value > 0
One-tailed Test
H1: Population differences are centred at a Tcal = Tvalue < 0

1.

Rank data

2.

Calculate the sum of the


ranks, T- and T+
T- = Sum of ve ranks = 35
T+ = Sum of + ve ranks = 265

3.

Find Tcal

4.
5.

Calculate critical values


Make decision: From the
sample data we have
sufficient
statistical
evidence that the weight
loss is greater than 10Ibs.
n' n'1
T T
300
2
n' n'1
T
150
4
T

n' n'1 2n'1


35.0
24

Tcal T 0.5
3.2714
T

Contd...
Step 1: State null and the alternate hypothesis
H0: The population median weight loss is atleast 10 lbs (X-Y10)
H1: The population median weight loss is greater than 10 lbs (X-Y-10>0)
Step 2: Select the level of significance - = 0.05
Step 3: Select the test statistic Wilcoxon signed rank test (Samples consist of ratio
data & no information about the form of the distribution)
Step 4: Formulate the decision rule
P-value Method - If p < , then reject null hypothesis H0 & accept the alternative
hypothesis H1
From Excel, upper one tail p-value = 0.0401
As 0.000535 < 0.05, Reject H0
Critical test Statistic Method - If test statistic > critical test statistic then we would
reject the null hypothesis H0 & accept the alternative hypothesis H1
From Excel, Two tail critical value = 1.6449
As 3.2714 > 1.6449, Reject H0
Step 5:Interpretation - Conclude that there is a significant difference in the proportions
travelling by train on Monday and Wednesday. Note: if we have a 1%

Contd...
Small number of paired observations (n 20)

Calculate Tcal. Tcal = MIN (T- , T+) = MIN (35, 265)


= 35. The decision rule is to reject H0 if Tcal Tcri

Tcri = 92 for = 0.05 with no tied ranks

Tcal Tcri (35 < 92), We would reject H0 & accept


H1

If you want the upper critical value then you can


calculate the value if you remember that the
distribution is symmetric about the median
(remember median = mean for symmetric
distributions):
lower
Tcri = upper Tcri T.
Dealing
with tiesT(Tied
Observations)
There are two types of tied observations that may arise when using the
Wilcoxon signed rank test:
1.
Observations in the sample may be exactly equal to 0 in the case of
paired differences. Ignore such observations and adjust n
accordingly. For the previous example we removed any values and
used n instead of n.
2.
Two or more observations/differences may be equal. If so, average
the ranks across the tied observations and reduce the variance by
Equation (8.16) for each group of t tied ranks.

t
48

7.

Mann-Whitney U Test for 2


Independent Samples

The Mann-Whitney U test is a non-parametric test that can be used in place


of an unpaired t-test
It is used to test the null hypothesis that two samples come from the same
population (i.e. have the same median) or, alternatively, whether
observations in one sample tend to be larger than observations in the other
Although it is a non-parametric test it does assume that the two distributions
are similar in shape
The basic premise of the test is that once all of the values in the two samples
are put into a single ordered list, if they come from the same parent
population, then the rank at which values from sample 1 and sample 2
appear will be by chance
If the two samples come from different populations, then the rank at which
the sample values will appear will not be random and there will be a
tendency for values from one of the samples to have lower ranks than values
from the other sample
We are thus testing for different locations of the two samples
The Mann-Whitney assumptions are as follows:
(1) independent random samples are obtained from each population
(2) the two populations are continuous and have the same shape

A local training firm has developed


an innovative programme to improve the
performance of students on the courses it
offers. To assess whether the new
programme improves student performance
the firm have collected two random samples
from the population of students sitting an
accountancy examination, where sample 1
students have studied via the traditional
method and sample 2 students via the new
programme.
The firm has analysed previous
data and the outcome of the results
provides evidence that the distribution is not
normally distributed but is skewed to the left.
This information provides concerns at the
suitability of using a two sample
independent t test to undertake the analysis
and instead decide to use a suitable
distribution free test. In this case the
appropriate test is the Mann-Whitney U test.

Contd...

Contd...

Contd...

U cal U 0.5 9 28 0.5

2.1410
U
8.6410

Step 1: State null and the alternate hypothesis


H0: No difference in examination performance between the two groups
H1: New programme has improved performance (M1 < M2)
Step 2: Select the level of significance - = 0.05
Step 3: Select the test statistic - Mann-Whitney U test(Lower one tail test)
Step 4: Formulate the decision rule
If the total number of pair wise comparisons (n1n2 = 7*8 = 56 > 20) we can
approximate the Mann-Whitney distribution with a normal distribution
P-value Method - If p < , then reject null hypothesis H0 & accept the alternative
hypothesis H1
From Excel, lower one tail p-value = 0.0161
As 0.0161 < 0.05, Reject H0
Critical test Statistic Method - If test statistic > critical test statistic then we would
reject the null hypothesis H0 & accept the alternative hypothesis H1
From Excel, Two tail critical value = 1.6449
As -2.1410 > 1.65, Reject H0
Step 5:Interpretation - Based upon the data, there is sufficient evidence to indicate at a
5% significance level that the performance as improved. Note that if we modify
the
level of significance to 1%
then the decision would be a borderline decision.
n 1 n 1 1
n 2 n 2 1
nn
U 1 n 1n 2
T1 U 2 n 1n 2
T2 U Minimum (U1 and U2) U 1 2
cal
2
2
2

Paired Comparisons
Small number of pair wise observations (n 20)
For a small number of paired comparisons (n = n1n2 20) we use tables to
calculate an exact value of the critical test value (Ucri) or an exact p-value based
upon P (U 9). For a 5% two tail test with n1 = 7, n2 = 8: (i) the lower critical U
value, Ucri = 11. Since Ucal < Ucri (9 < 11), we reject H0 and accept H1, and (ii) the
lower p-value = 0.014.
Given that we have a two tailed test
then the two tail p-value = 2*0.014 =
0.028 < 0.05, we reject we reject H0
and accept H1. The theory suggests
that if the null hypothesis is true then
the U test statistic will be centered at
U = 28 with critical regions identified
in Figure 8.11.
79

Tied Observations
Dealing with ties
If we find data with the same number value then we can deal with this problem
my allocating the average tie value to each shared data value. In this situation
we would then have to use the normal approximation with the standard deviation
U adjustment give by Equation (8.23):
U

n n 3 n n g t j3 t j
n 1n 2
2
1
2

n1 n 2 n1 n 2 1
12
j1 12

Where g = number of ties and tj = the number of tied ranks in group j.


1. In the example and exercises we have not modified the solution for tied
ranks.
2. The Mann-Whitney U test is statistically equivalent to the Wilcoxon rank sum
test.
80

Conclusion
In this presentation we explored the concept of chi squared and nonparametric hypothesis testing:

81