Managers
Using Excel for Data Analysis
Credits: 3
Unit I
Digital Data Introduction, Types of Digital Data: Structured Data,
Unstructured Data, and Semi-Structured Data; Exploring and Discovering
Data; Introduction to OLTP: Queries, Advantages and Challenges;
Introduction to OLAP: One-dimensional, Two-dimensional and Threedimensional data; OLAP Architectures: MOLAP, ROLAP, HOLAP; Role of
OLAP Tools in BI Architecture.
Introduction to Big data; BI Component Frameworks: Business,
Administration and Operation, and Implementation Layer; BI is for
Whom?; BI Applications; BI Roles and Responsibilities; Basics of Data
Integration; Data Warehousing: Data Mart; What constitutes a Data
Warehouse?; Introduction to ETL; Data Integration.
CONTD...
UNIT II
Unit III
CONTD...
Suggested Readings
Prasad R N and Acharya Seema (2013), Fundamentals of Business Analytics, Wiley India Pvt.
Ltd., New Delhi.
Glyn Davis and Branko Pecar (2013), Business Statistics using Excel, Oxford University Press,
New Delhi.
Halady Rao Purba (2013), Business Analytics an Application Focus, PHI Learning Private
Limited, New Delhi.
Jank Wolfgang (2011), Business Analytics for Managers, SpringerScience + Business Media,
ISBN 978-1-4614-0405-7.
Davenport Thomas H et al. (2008), Competing on Analytics, Pearson Publication, USA.
Decision support systems and business intelligence HBR Press.
E Turban, et al. (2008), Business Intelligence: A Managerial Approach, Pearson Prentice Hall.
Mosimann R et al. (2007), The performance manager: Proven strategies for turning information
into higher business performance, Cognos Press.
Articles
Session
Session 1
Session 2,3
Session 4,5
Correlation Analysis
Linear Regression Analysis
Time Series Data and Analysis
Session 6
Logistic Regression
Session 7
Session 8
Contd...
Topic
Session
Session 9
Session 10
Conjoint Analysis
Session 11
Data Mining
Session 12,13,14
Clustering Techniques
Association Rule Mining & Sentiment Analysis
Decision Trees & Random Forests
Structural Equations Modeling
Introduction to Parametric
Hypothesis Testing
Learning Objectives
On completing this unit you should be able to:
Understand concept of
Hypothesis Testing
Rationale
Hypothesis: statement about the value of a population parameter
developed for the purpose of testing
Deciding between two possibilities based on data - Is it real? Or is it just
coincidence? A hypothesis is either TRUE or FALSE
Example: - Hypothesis statement average salary of accountants is 31000 can be
measured & assessed through the variable salary
Contd...
Level of significance - represents the amount of
risk an analyst will accept when making a decision
represents the amount of error associated with rejecting the null
hypothesis when it is true
Usually expressed as % and denoted by (Alpha)
Its normally 5% (0.05) or 1% (0.01), or sometimes 10% (0.1)
Value of depends upon how sure you want to be that your
decisions are an accurate reflection of the true population
relationship
Example: 5% LOS implies that there are about 5 chances in 100
of rejecting the H0 when it is true or we are 95% confident that we
will make a correct decision
1. If we sampled from a population data set that is normally distributed then the
sampling distribution for the sample mean will be normally distributed with sample
mean = population mean, with sampling error X n
2. For populations that are not normally distributed we can make use of the Central
Limit Theorem. For large n, the sampling distribution approximates to the normal
distribution.
3. For small sample size we employ the Student t distribution which states that if a
population is normally distributed then the sample mean is normally distributed with
sample mean = population mean, with the sampling error estimated using the sample
Contd...
We stated earlier that the alternative hypotheses is of the form
H1: 31000
Two tailed test - The sign tells us that we are not sure what the
direction of the difference will be (< or >) but that a difference exists
One tailed test - It is possible that we are assessing that
the average accountant salary is greater than 31000
implying H1: > 31000
is smaller than 31000
implying H1: < 31000
Two tail test H1: 100
The Truth
Accept Null
Hypothesis
H0
Null
Hypothesis
H0
Correct
Decision
Research
Hypothesis
H1
Type II Error
[not easily
controlled]
Contd...
Accept Research
Hypothesis
H1
Type I Error
[level 0.05]
Correct
Decision
are often based upon data that has been ranked, rather than actual
measurement data
One sample test - involves testing a sample parameter (e.g. mean value)
against a perceived population value (e.g. accountant salary 31000) to
ascertain whether there is not a significant difference between a sample
statistic and a population parameter
Test Statistic
P-value Method
Z-test
Two-tail tests
=2*(1-NORMSDIST(ABS(Z-value))
Lower-tail test
=NORMSDIST(Z-value)
=NORMSINV()
Upper-tail test
=1-NORMSDIST(Z-value)
=NORMSINV(1-)
Two-tail tests
=TDIST(ABS(t-value),df,2)
Lower-tail test
=TDIST(ABS(t-value),df,1)
=TINV(2*,df)
Upper-tail test
=TDIST(ABS(t-value),df,1)
= -TINV(2*,df)
Two-tail tests
=FDIST(F-value, df1,df2)
Lower-tail test
=1-FDIST(F-value, df1,df2)
=FINV(, df1,df2)
Upper-tail test
=FDIST(F-value, df1,df2)
=FINV(1-, df1,df2)
T-test
Summary
F-test
X=
90
3
From Excel, Two tail p-value = 0.0026998
n
20 36
Test
Statistic
P-value Method
Two-tail
tests
=2*(1-NORMSDIST(ABS(Z-value))
(Upper value)
(Lower value)
Zcal
X 90 100 3
n
20 36
X=
2193
df n 1 6
Critical test Statistic Method - If test statistic > critical test statistic then we
would reject null hypothesis H0 and accept alternative hypothesis H1
From Excel, Two tail critical value = TINV() = 2.45
As -2.45 < 1.04 < 2.45, Accept H0
Step 5: Interpretation
extras
2000.
t cal
489.62.. 7
df n 1 6
Test
Statistic
Two-tail
tests
P-value Method
=TDIST(ABS(t-value),df,2)
H1: A > B
X X
A
2
A2
B
n
n
A
B
4.16
Critical test Statistic Method - If test statistic > critical test statistic then we
would reject the null hypothesis H0 & accept the alternative hypothesis H1
From Excel, Upper tail critical value Zcri = +1.64
As 4.16 > 1.64, Accept H1
Step 5: Interpretation
At the 0.05 level of significance, the light bulbs from factory A have significantly
longer life time than the light bulbs from factory B.
Z cal
Test Statistic
P-value Method
Upper-tail test
=1-NORMSDIST(Z-value)
X X
A
A
B
n
n
A
B
4.16
Analysis ToolPak
solution
Mean
Known Variance
Observations
Hypothesized Mean
Difference
Variable
2
1135.333 894.218
333
75
46516.6 57845.9
30
32
Variable 1
4.160713
018
1.58628EP(Z<=z) one-tail
05
1.644853
z Critical one-tail
627
3.17256EP(Z<=z) two-tail
05
Select Data > Data Analysis > Z Test: Two1.959963
Sample for Means
z Critical two-tail
H 0 : A = B H 1 : A B
Given = 5% = 0.05, NA = 250, NB = 190, nA = 135 nB = 80, A ~ A B ~ B
H1: A B
A 1 A B 1 B
NA
NB
Critical test Statistic Method - If test statistic > critical test statistic then we
would reject the null hypothesis H0 & accept the alternative hypothesis H1
From Excel, Two tail critical value Zcri = 1.95
Test Statistic
Two-tail tests
P-value Method
=2*(1-NORMSDIST(ABS(Z-value))
City A City B
No. Interviewed, N
250
190
No. wearing seat
belts, n
135
80
Z cal
A B
2.49
A 1 A B 1 B
NA
NB
A B
n1 n 2 2
2082.017
df n1 n 2 2 41
Critical test Statistic Method - If test statistic > critical test statistic then we
would reject the null hypothesis H0 & accept the alternative hypothesis H1
From Excel, Two tail critical value tcri = +2.019
As 2.156 > 2.019, Accept H1
Step 5: Interpretation
t cal
1 2
X2
1
1
n n
1
2
2.156
We conclude that based upon the sample data collected that we have evidence
that the quantity of beans sold by shops A and B are significantly different at the
5% level of significance. It should be noted that the decision will change if you
Test Statistic
Two-tail tests
P-value Method
=2*(1-NORMSDIST(ABS(Z-value))
(Upper value)
(Lower value)
A B
n 1s
1
n 2 1 s 2
2082.017
n1 n 2 2
2
df n1 n 2 2 41
t cal
1 2
X2
1
1
n n
2
1
2.156
Analysis ToolPak
solution
Variable 1 Variable 2
527.05555
Mean
56
496.64
2603.1143 1712.906
Variance
79
667
Observations
18
25
2082.0171
Pooled Variance
82
Hypothesized Mean
Difference
0
df
41
2.1563816
t Stat
53
0.0184852
P(T<=t) one-tail
15
1.6828780
t Critical one-tail
03
0.0369704
P(T<=t) two-tail
3
n
n
2
1
0.044 < 0.05, Accept H
1
Critical test Statistic Method - If test statistic > critical test statistic then we
s1
s2
n2
n1
s12
s22
n
n
1
2
n1 1
n 2 1
df
32
Step 5: Interpretation
We conclude that based upon the sample data collected that we have evidence
that the quantity of beans sold by shops A and B are significantly different at the
5% level of significance. It should be noted that the result in this case rests at
Test Statistic
Two-tail tests
P-value Method
=TDIST(ABS(t-value),df,2)
X1 X 2
t cal
s1 s 2
n1 n 2
2
2.083
df
s12 s 2 2
n
n
1
2
32
2
s1
s22
n
n
2
1
n1 1 n 2 1
Analysis ToolPak
solution
Variable 1 Variable 2
527.05555
Mean
56
496.64
2603.1143 1712.906
Variance
79
667
Observations
18
25
Hypothesized Mean
Difference
0
df
32
2.0833856
t Stat
02
0.0226440
P(T<=t) one-tail
68
1.6938887
t Critical one-tail
03
0.0452881
Two Sample
t-Test
for
Means
P(T<=t) two-tail
36 assuming
unequal
2.0369333
H1: D > 10
Given = 5% = 0.05,
Distribution unknown
1 unknown, 2 unknown
n1 = n2 = 26
Assume n large, CLT applies
H1: D > 10
t-distribution
d
hypothesis H1
n
From Excel, Upper one tail P-value = 0.0093
As 0.0093 < 0.05, Accept H1
sd
n 1
14.57
df n 1 25
Critical test Statistic Method - If test statistic > critical test statistic then we
would reject the null hypothesis H0 & accept the alternative hypothesis H1
From Excel, Upper one tail critical value tcri = + 1.708
As 2.5178 > 1.708, Accept H1
t cal
dD
2.5178
sd n
Step 5: Interpretation
Conclude that the average weight loss is more than 10 Ibs at a 5% level of
significance. Again, you would have a borderline decision at 5%.
d d
2
Analysis ToolPak
solution
sd
n 1
14.57
df n 1 25
dD
t cal
2.5178
sd n
Select Data > Data Analysis > Two Sample t Test Assuming Paired Samples
H1: 12 22
s A
F 2 1.5197
s B
Critical test Statistic Method - If test statistic > critical test statistic then we
would reject the null hypothesis H0 & accept the alternative hypothesis H1
From Excel,
As FL(0.39) < F(1.5197) < FU(2.38), Accept H0
df numerator n A 1
df denominator n B 1
Step 5: Interpretation
Conclude that the two population variances are not significantly different at the
95% level of confidence
Test Statistic
Two-tail tests
P-value Method
=FDIST(F-value, df1,df2)
s2A
F 2 1.5197
s B
df numerator n A 1
df denominator n B 1
Variable 1 Variable 2
527.05555
Mean
56
496.64
2603.1143 1712.9066
Variance
79
7
Observations 18
25
df
17
24
1.5197059
F
07
Select Data > Data Analysis > F Test for Two Population Variances (Variance Ratio Test)
Conclusion
In this presentation we explored the concept of hypothesis testing:
Thank You
This chapter will provide an overview to the chi squared distribution ( 2) and nonparametric tests that can be used when parametric methods are not appropriate.
Learning Objectives
On completing this unit you should be able to:
Apply the chi square test to measure the difference between
two proportions from two samples
Apply the chi-square test to test for association between
categorical variables
Apply the chi-square goodness-of-fit test
Apply the sign test to one sample
Apply the Wilcoxon signed rank T test to two paired samples
Apply the Mann-Whitney U test to two independent samples
Introduction
Parametric tests - assess whether the differences between means (or
variances, proportions) are statistically significant. Model assumptions
are:
a) Underlying population being measured varies as a normal distribution
b) The level of measurement is of equal interval or ratio scaling, and
c) The population variances are equal
Unfortunately, we will come across data that does not fit these
assumptions
a) How do we measure the difference between the attitudes of people
surveyed in assessing their favourite car, where the responses are in
the form of 1, 2, 3,, n? In this situation we have ordinal data in
which taking differences between the numbers (or ranks) is
meaningless.
b) Furthermore, if we are asking for opinions where the opinion is of a
categorical form (e.g. strongly agree, agree, do not agree) then the
concept of difference is again meaningless. The responses are words
not numbers, but you can, if you so wish, solve this problem by
allocating a number to each response, with 1 for strongly agree, 2 for
Choosing a Test
Chi-Square Test
Versatile test
Widely used test with data that is categorical (or nominal or
qualitative) in nature
This section will explore the application of chi square in solving 4 types
of problems:
1. Perform a 2 test of association (independence)
2. Perform a 2 test of the difference between two independent proportions
3. Perform a 2 test of the difference between two dependent proportions
(McNemars test for matched pairs)
4. Perform a 2 test of goodness of fit to a theoretical probability distribution
For 2 and 3, you could use a Z/t test if you assume the population is
normally distributed.
1.
O E 2
2
squared test statistic given by
E
Contd...
Example 9:- Suppose a university
sampled 485 of its students to determine
whether males and females differed in
preference for five courses offered. The
question we would like to answer is to
confirm whether or not we have an
association between the courses chosen
and the persons gender. In this case we
have two attributes, gender and course,
both of which have been divided into
categories: 2 for gender and 5 for
course. Determine whether gender and
course preference are associated using
chi-square test of association on the
contingency table.
O E 2
E
63.2
df r 1 c 1 4
Contd...
Step 1: State null and the alternate hypothesis
H0: Gender and course preference are not associated (or independent)
H1: There is an association between sex and course preference (or dependent)
Step 2: Select the level of significance - = 0.05
Step 3: Select the test statistic - Chi-square distribution
O E
E
63.2
Critical test Statistic Method - If test statistic > critical test statistic then we
would reject the null hypothesis H0 & accept the alternative hypothesis H1
df r 1 c 1 4
From Excel, Two tail critical value = 9.4877
As 63.36 > 9.4877 , Accept H1
Step 5: Interpretation
There is a significant relationship, or association, between the
category variables
Contd...
Step 1: State null and the alternate hypothesis
H0: 1 = 2
H1: 1 2
(proportions different)
O E
E
4.4373
Critical test Statistic Method - If test statistic > critical test statistic then we
would reject the null hypothesis H0 & accept the alternative hypothesis H1
From Excel, Two tail critical value = 3.84
As 4.43 > 3.84 , Accept H1
Step 5:Interpretation - Conclude that there is a significant difference in the proportions
travelling by train on Monday and Wednesday. Note: if we have a 1%
significance level then the decision would be reversed
Contd...
In general the 2*2 contingency table can be structured as shown above
Drop CO2 Before proportion (1) and Drop CO2 After proportion (2) are given by
equations
1 a b N
2 a c N
df r 1 c 1 1
Two Tests available:
To test the null hypothesis we can use the McNemar z-test statistic which is
normally approximated defined by the equation
bc
Z
bc
To test the null hypothesis we can use the McNemar 2-test statistic defined by
equation
2
2
b c
bc
Contd...
Step 1: State null and the alternate hypothesis
H0: 1 = 2
H1: 1 2
(proportions different)
bc
89 45
3.801
bc
89 45
P-value Method - If p < , then reject null hypothesis H0 & accept the alternative
hypothesis H1
Proportions, 1 = 0.60 and 2 = 0.53
From Excel, the two tail p-value = 0.00014. As 0.00014 < 0.05, Accept H1
Critical test Statistic Method - If test statistic > critical test statistic then we
would reject the null hypothesis H0 & accept the alternative hypothesis H1
From Excel, Two tail critical value = 9.4877
As 3.801 > 1.96 , Accept H1
Step 5: Interpretation - There is a significant difference in the voting intentions for Drop
4.
cal 2
O E 2
E
df = n k 1
In this section we will explore concept of measuring how well a data set can be
modelled by a particular probability distribution using the method of goodness-of-fit
test
For a chi-square goodness of fit test, the hypotheses takes the form:
Contd...
r e
P X r
r!
A motorway safety officer who believes that the number of accidents per
week occurring on a stretch of motorway can be modelled using a
Poisson distribution.
If X denotes the number of accidents per week then the sample data can be
modelled by fitting a Poisson distribution to the sample data.
Contd...
Step 1: State null and the alternate hypothesis
H0: No. of accidents follow a Poisson distribution
H1: No. of accidents do not follow a Poisson distribution
Step 2: Select the level of significance - = 0.05
Step 3: Select the test statistic - Chi-square distribution/ Goodness of fit test
Step 4: Formulate the decision rule
P-value Method - If p < , then reject null hypothesis H0 & accept the alternative
hypothesis H1
From Excel, the p-value = 0.73
As 0.73 > 0.05, Accept H0
Critical test Statistic Method - If test statistic > critical test statistic then we would
Non-Parametric Tests
Many statistical tests require that data follows normal distribution
Distribution free tests/ Non-parametric tests - Do not require the data to
follow a particular distribution
In this presentation will explore three non parametric tests
Sign test
Wilcoxon signed rank test
Mann Whitney U test
Test
One sample
Paired samples
Independent
samples
Parametric test
One sample z-test
One sample t-test
Two paired sample Z-test
Two paired sample t-test
Two independent sample t-test
5.
The sign test is used to test a set of data values against a perceived
hypothesis statement, including:
1.Assessing the validity of a population median value assessed from
collected sample data replaces the one-sample t-test which assumes a
normal population and that a mean value as meaning.
2.Assessing the validity that the difference between two population
medians is zero based upon sample data replaces the paired t-test
which assumes a normal population and that a mean value as meaning.
3.Assessing the validity of proportions where the proportions are
estimated from ordered nominal (or categorical) data where a numerical
scale is inappropriate but where we can rank the data observations
replaces the sample Z test for proportions which assumes a normal
population.
Contd...
If we rank the data then the null hypothesis would result in half the ranks
to be less than the median (r1) and half the ranks would be greater than
the median (r2)
In this situation the null hypothesis can be modelled by a binomial
distribution with the probability of a data value being less than or greater
than the median being equal to p = 0.5, with sample size n
The sign test assumptions are
Randomly selected samples and
Continuous distribution
Sign test measures the number of counts that fall above and below the
median value
Under the null hypothesis, we would expect the number of counts
distribution to be approximately symmetric around the median and the
distribution of values below and above to be distributed at random among
the ranks
Contd...
The corresponding hypothesis statements for two tail and one tail tests
are:
Two tail test
H0: sample median = population median (0.5)
H1: sample median population median (0.5)
For a binomial distribution the value of the probability (P(X=r)), mean () and
standard deviation are given by:
P X r C r p q
n
r n r
n r n r
p q
r
n
n!
r! n r !
r
np
npq np 1 p
Contd...
Contd...
Step 1: State null and the alternate hypothesis
H0: The median sales difference is zero
H1: Median sales after training > Median sales before training
Step 2: Select the level of significance - = 0.05
Step 3: Select the test statistic Sign test
Step 4: Formulate the decision rule
Calculate binomial probabilities, P(Xx)
P-value Method - If p < , then reject null hypothesis H0 & accept the alternative
hypothesis H1
p = P(X 12) = P(X = 12, 13, 14, 15, 16)
= P(X = 12) + P(X=13) + P(X=14) + P(X=15) + P(X=16)
X 11.5 8
From Excel, = np = 8 and = sqrt(npq) = 2
Z c
1.75
2
From Excel, upper one tail p-value = 0.0401
As 0.0401 < 0.05, Reject H0
Critical test Statistic Method - If test statistic > critical test statistic then we would
reject the null hypothesis H0 & accept the alternative hypothesis H1
From Excel, Two tail critical value = 1.6449
As 1.75 > 1.6449, Reject H0
Step 5:Interpretation - Conclude that there is a significant difference in the proportions
travelling by train on Monday and Wednesday. Note: if we have a 1%
6.
The t-test is the standard test for testing the difference between population
means for two paired samples that are equal
If the populations are non-normal, particularly for small samples, then the ttest may not be valid
As for the sign test, the Wilcoxon signed rank sum test is another example of
a non-parametric/ distribution free test, used to test the null hypothesis that
the median of a distribution is equal to some value
It can be used in place of
1) One-sample t-test
2) Paired t-test
3) Ordered categorical data where a numerical scale is inappropriate but
where it is possible to rank the observations
The method considers the differences between n matched pairs as one
sample
If the two population distributions are identical, then we can show that the
sample statistic has a symmetric null distribution
Contd...
Assumptions
The Wilcoxon signed rank sum test assumptions are:
1)Each matched data pair is randomly distributed
2)The matched pair differences should be symmetrically distributed
Although the Wilcoxon test assumes neither normality nor homogeneity
of variance, it does assume that the two samples are from populations with
the same distribution shape
It is also vulnerable to outliers although not to nearly the same extent as
the t-test
Contd...
Suppose that Slim-Gym is offering a weight reduction program that they advertise
will result in more than a 10 lb weight loss in the first 30 days. Twenty subjects were
selected for a study and their weights before and after the weight loss program
were recorded.
Contd...
Test
Two-tailed Test
Hypothesis
Tcal
H1: Population locations not centred at 0
Tcal = Minimum of T- and T+
H1: Population differences are centred at a Tcal = T+
value > 0
One-tailed Test
H1: Population differences are centred at a Tcal = Tvalue < 0
1.
Rank data
2.
3.
Find Tcal
4.
5.
Tcal T 0.5
3.2714
T
Contd...
Step 1: State null and the alternate hypothesis
H0: The population median weight loss is atleast 10 lbs (X-Y10)
H1: The population median weight loss is greater than 10 lbs (X-Y-10>0)
Step 2: Select the level of significance - = 0.05
Step 3: Select the test statistic Wilcoxon signed rank test (Samples consist of ratio
data & no information about the form of the distribution)
Step 4: Formulate the decision rule
P-value Method - If p < , then reject null hypothesis H0 & accept the alternative
hypothesis H1
From Excel, upper one tail p-value = 0.0401
As 0.000535 < 0.05, Reject H0
Critical test Statistic Method - If test statistic > critical test statistic then we would
reject the null hypothesis H0 & accept the alternative hypothesis H1
From Excel, Two tail critical value = 1.6449
As 3.2714 > 1.6449, Reject H0
Step 5:Interpretation - Conclude that there is a significant difference in the proportions
travelling by train on Monday and Wednesday. Note: if we have a 1%
Contd...
Small number of paired observations (n 20)
t
48
7.
Contd...
Contd...
Contd...
2.1410
U
8.6410
Paired Comparisons
Small number of pair wise observations (n 20)
For a small number of paired comparisons (n = n1n2 20) we use tables to
calculate an exact value of the critical test value (Ucri) or an exact p-value based
upon P (U 9). For a 5% two tail test with n1 = 7, n2 = 8: (i) the lower critical U
value, Ucri = 11. Since Ucal < Ucri (9 < 11), we reject H0 and accept H1, and (ii) the
lower p-value = 0.014.
Given that we have a two tailed test
then the two tail p-value = 2*0.014 =
0.028 < 0.05, we reject we reject H0
and accept H1. The theory suggests
that if the null hypothesis is true then
the U test statistic will be centered at
U = 28 with critical regions identified
in Figure 8.11.
79
Tied Observations
Dealing with ties
If we find data with the same number value then we can deal with this problem
my allocating the average tie value to each shared data value. In this situation
we would then have to use the normal approximation with the standard deviation
U adjustment give by Equation (8.23):
U
n n 3 n n g t j3 t j
n 1n 2
2
1
2
n1 n 2 n1 n 2 1
12
j1 12
Conclusion
In this presentation we explored the concept of chi squared and nonparametric hypothesis testing:
81