Anda di halaman 1dari 99

IISWBM

C20 Data Analytics

ANINDITA MANDAL
BARNALI CHAUDHURI
JAYANTA KUMAR DAS
CHAPTER 01
INTRODUCTION TO DATA SCIENCE
The population is the set of entities under study. It is a collection of people, items, or events
about which you want to make inferences.
Sample : This subset of the population is called a sample. So, we can say, a sample is a subset
of people, items, or events from a larger population that you collect and analyze to make
inferences. To represent the population well, a sample should be randomly collected and

Point Estimation and Interval adequately large.


Importance of Sample
Typically it is impossible to survey/measure the every member of entire population because
Estimation not all members are observable. If it is possible to study the entire population it is often
costly to do so and would take a great deal of time.
Use this sample to draw inferences about the population under study, given some
CHAPTER 02 conditions.
Statistic: Measure of some characteristics of data in a sample(e. g. the mean height of
POINT & INTERVAL ESTIMATION OF NUMERIC men)of the population- called a statistic.
Parameter: A parameter is a statistical constant that describes a feature about a population.
DATA Measure of some characteristics of (e. g. the mean height of men) the population- called a
parameter.

Mean (Arithmetic):The mean (or average) is the most popular and well known measure of central tendency. Best measure of central
Type of Variable
tendency
Advantage:
An important property of the mean is that it includes every value in your data set as part of the calculation. Nominal Mode
In addition, the mean is the only measure of central tendency where the sum of the deviations of each value from the
mean is always zero. Ordinal Median
It can be used with both discrete and continuous data, although its use is most often with continuous data.
Interval/Ratio (not skewed) Mean
Disadvantage:
The mean has one main disadvantage: it is particularly susceptible to the influence of outliers.
Interval/Ratio (skewed) Median
Measure:
The mean is equal to the sum of all the values in the data set divided by the number of values in the data set. So, if we have
n values in a data set and they have values x1, x2, ..., xn, the sample mean, usually denoted by x.

Median: The median is the middle score for a set of data that has been arranged in order of magnitude. The median Standard Deviation: The standard deviation is a measure of the spread of scores
is less affected by outliers and skewed data.
within a set of data. We can estimate the population standard deviation from a
Mode sample standard deviation. These two standard deviations - sample and population
The mode is the most frequent score in our data set. On a histogram it represents the highest bar in a bar chart or standard deviations - are calculated differently.
histogram
The sample standard deviation formula is: Variance: variance measures the variability (volatility) from an average or mean, and volatility is a measure of risk, the
where, variance statistic can help determine the risk .
s = sample standard deviation Importance of variance:
Use variance to see how individual numbers relate to each other within a data set. A drawback to variance is that it gives
X= score added weight to numbers far from the mean (outliers).
x= sample mean
The formula for the variance in a population is:

The population standard deviation formula is:

= population mean The formula to estimate the variance from a sample is


n = number of scores in population

SAMPLING VARIABILITY Problem 1


The sampling variability of a statistic refers to how much the statistic Which of the following statements is true.
varies from sample to sample and is usually measured by its standard I. The standard error is computed solely from sample attributes.
error; the smaller the standard error, the less the sampling variability. II. The standard deviation is computed solely from sample attributes.
For example, the standard error of the mean is a measure of the III. The standard error is a measure of central tendency.
sampling variability of the mean. Recall that the formula for the standard (A) I only
error of the mean is (B) II only
(C) III only
(D) All of the above.
(E) None of the above.
Ans:
Importance of Standard Error The correct answer is (E). The standard error can be computed from a knowledge of
The standard error is an estimate of the standard deviation of a statistic. The standard sample attributes - sample size and sample statistics. The standard
error is important because it is used to compute other measures, like confidence deviation cannot be computed solely from sample attributes; it requires a
intervals and margins of error. knowledge of one or more population parameters. The standard error is a measure
of variability, not a measure of central tendency.
What is a Z-Score?
A z-score is a measure of how many standard deviations below or above the population mean a raw score is. A z-score is
also known as a standard score and it can be placed on a normal distribution curve.
(A) Q. A teacher sets an exam for their pupils. The teacher wants to summarize the results
the pupils attained as a mean and standard deviation. Which standard deviation should A z-score can tell you where that pe so s eight is co pared to the a erage populatio s mean weight.
be used?
The Z Score Formula: One Sample
A. Population standard deviation.
The basic z score formula for a sample is:
z=x/
Q. A researcher has recruited males aged 45 to 65 years old for an exercise training study to Example: lets sa e ha e a test s o e of 9 . The test has a mean of a d a standard deviation of .
Assuming a normal distribution, calculate z score:
investigate risk markers for heart disease (e.g., cholesterol). Which standard deviation z=x/
would most likely be used?
The z score tells you how many standard deviations from the mean your score is.
A. Sample standard deviation. Z Score Formula: Standard Error of the Mean
For multiple samples, want to describe the standard deviation of those sample means using z score formula:
Q. One of the questions on a national consensus survey asks for respondents' age. Which z=x/ /
This z-score will tell you how many standard errors there are between the sample mean and the population mean.
standard deviation would be used to describe the variation in all ages received from the
Sample problem: I ge e al, the ea height of o e is ith a sta da d de iatio of . . What is the
consensus? probability of finding a random sample of o e ith a ea height of , assu i g the heights a e o all
A. Population standard deviation. distributed?

Population proportion Statistical inference


A part of a population with a particular attribute, expressed as a fraction, decimal or Statistical inference is the process of inference from the sample to a population with
percentage of the whole population. calculated degree of certainty. The two common forms of statistical inference are:
Formula: The population proportion is the number of members in the population with a Estimation
particular attribute divided by the number of members in the population. Null hypothesis tests of significance (NHTS)
.
Estimation in Statistics
Estimation refers to the process by which one makes inferences about a population,
. based on information obtained from a sample.

Examples of parameters include:


p: called the population proportion
: called the population mean
Example: a. If 6 out of 40 students plan to go to graduate school, the proportion of all students who plan to go to graduate : called the population standard deviation
school is estimated as ________. The standard error of this estimate is ________.
b. If 54 out of 360 students plan to go to graduate school, the proportion of all students who plan to go to graduate school
is estimated as ________. The standard error of this estimate is ________.
Point estimates are single points that are used to infer parameters directly. For example,
Every member of the
Statistical estimation population has the
same chance of being
Sample proportion p (p hat) is the point estimator of p Population
selected in the sample
Sample mean x (x bar) is the point estimator of
Sample standard deviation s is the point estimator of
Parameters
Point estimates and parameters represent fundamentally different things.
Point estimates are calculated from the data; parameters are not.
Point estimates vary from study to study; parameters do not.
estimation
Point estimates are random variables: parameters are constants.

Random sample

Statistics

Point Estimate vs. Interval Estimate


An estimate of a population parameter may be expressed in two ways:
Point estimate: A point estimate of a population parameter is a single value of a
statistic.
Statistical estimation Interval estimate: An interval estimate is defined by two numbers, between which a
Estimate population parameter is said to lie.
Confidence Intervals
Point estimate Interval estimate Statisticians use a confidence interval to express the accuracy and uncertainty associated
* sample mean * confidence interval for mean with a particular sampling method. A confidence interval consists of three parts.
* sample proportion * confidence interval for proportion A confidence level.
A statistic.
A margin of error.
Point estimate is always within the interval estimate
The confidence level describes the uncertainty of a sampling method. The probability
part of a confidence interval is called a confidence level. The statistic and the margin
of error define an interval estimate that describes the accuracy of the method. The
interval estimate of a confidence interval is defined by the sample statistic + margin of
error.
Margin of Error
Interval estimate as a 95% confidence interval means that if we used the same sampling In a confidence interval, the range of values above and below the sample statistic is called
method to select different samples and compute different interval estimates, the true the margin of error.
population parameter would fall within a range defined by the sample statistic + margin For example, suppose an election survey reports that the independent candidate will
of error 95% of the time. receive 30% of the vote. The reports that the survey had a 5% margin of error and a
Confidence intervals are preferred to point estimates, because confidence intervals confidence level of 95%. These findings result in the following confidence interval: We are
indicate (a) the accuracy of the estimate and (b) the uncertainty of the estimate. 95% confident that the independent candidate will receive between 25% and 35% of the
vote.
Estimating with the help of sampling distribution of the mean with known
The SDM indicates that:
x-bar is an unbiased estimate of ;
Estimate Margin of error Estimate + Margin of error
Estimate
the SDM tends to be normal when the population is normal or when the sample is large;
the standard deviation of the SDM is equal to /n . This is called standard error of the mean
(SEM) and reflects the accuracy of x-bar as an estimate of :

SEM =/ n

Suppose a measurement that has = 10. Calculate,


o A sample of n = 1 for this variable derives SEM = i) 90% CI for for sample size n = 10 with SEM = 4.30 and x = 29.0. The z value =1.64.
o A sample of n = 4 derives SEM =
o A sample of n = 16 derives SEM = : ?
Margin of error:?
Co fide ce I ter al for ith k o
Let % ep ese t the o fide e le el of a o fide e i te al. : the la k of o fide e ii) 95% CI for and margin of error with same x-bar and SEM?
A % CI fo is gi e : iii) the 99% CI fo a d a gi of e o ith sa e -bar and SEM?

Confidence Level Confidence Interval Confidence Interval


Length
The reason we use z1-/ i stead of z - i this fo ula is e ause the a do e o is split et ee u de esti ates left 90%
tail of the DM a d o e esti ates ight tail of the DM . The o fide e le el a ea lies et ee z / a d
z / : 95%

The common levels of confidence and their associated alpha levels and z quantiles: 99%

% z1-/
uppose a populatio ith = a d u k o ea . A a do sa ple of o se atio s f o this populatio
90% .10 1.64
and observe the following values: {21, 42, 5, 11, 30, 50, 28, 27, 24, 52}. Based on these 10 observations, x = ? , SEM = ?
a d a 9 % CI fo = ?
95% .05 1.96
99% .01 2.58
Sample Size Requirements for estimating
m represents the margin of error and population size is n.
Problem 1
Which of the following statements is true.
I. When the margin of error is small, the confidence level is high.
From above equation we get, II. When the margin of error is small, the confidence level is low.
III. A confidence interval is a type of point estimate.
IV. A population mean is an example of a point estimate.
(A) I only
Given, sta da d de iatio = a d a t to esti ate ith 9 % o fide e. (B) II only
i)What will be the samples size required to achieve a margin of error of 5? (C) III only
(D) IV only.
ii) The samples size required to achieve a margin of error of 2.5? (E) None of the above.

Solution

Example 1: 57 individuals reveals 17 smokers. Use npq rule to determine suitability of the
Estimating p with Sampling distribution of the proportion
Proportion for sample = method .Estimate the 95% CI for p .
p = u e su esses of i the sa ple/
Example 2: Out of 2673 people surveyed, 170 have risk factor X. We want to determine the
I la ge sa ples, the sa pli g dist i utio of p is app o i atel o al ith a ea of p a d sta da d e o of the
proportion SEP : population prevalence of the risk factor with 95% confidence.

Sample size requirement for estimating p


n= sample size, where q = 1 p.
To achieve margin of error m,
Confidence interval for p
This approach should be used only in samples that are large. If npq , the p o eed ith this ethod. he e p* ep ese t the a edu ated guess fo the p opo tio a d * = p. Whe o
reasonable guess of p is available, use p* = 0.50.
A app o i ate % CI fo p is gi e

Example 1: Calculate sample a population with 95% confidence for the prevalence of
Here, smoking. How large a sample is needed to achieve a margin of error of 0.05 if we assume the
prevalence of smoking is roughly 30% ?

Example 2: How large a sample is needed to shrink the margin of error to 0.03?
5: Introduction to Estimation
Contents
Acronyms and symbols ....................................................................................................... 1
Statistical inference ............................................................................................................. 2
Estimating with confidence ............................................................................................. 3
Sampling distribution of the mean .................................................................................. 3
Confidence Interval for when is known before hand ............................................... 4
Sample Size Requirements for estimating with confidence ........................................ 6
Estimating p with confidence.............................................................................................. 7
Sampling distribution of the proportion .......................................................................... 7
Confidence interval for p ................................................................................................ 7
Sample size requirement for estimating p with confidence ............................................ 9

Acronyms and symbols

q complement of the sample proportion


x sample mean
p sample proportion
1 confidence level
CI confidence interval
LCL lower confidence limit
m margin of error
n sample size
NHTS null hypothesis test of significance
p binomial success parameter (population proportion)
s sample standard deviation
SDM sampling distribution of mean (hypothetical probability model)
SEM standard error of the mean
SEP standard error of the proportion
UCL upper confidence limit
alpha level
expected value (population mean)
standard deviation parameter

Page 5.1 (C:\Users\B. Burt Gerstman\Dropbox\StatPrimer\estimation.docx, 5/8/2016)


Statistical inference
Statistical inference is the act of generalizing from the data (sample) to a larger
phenomenon (population) with calculated degree of certainty. The act of generalizing
and deriving statistical judgments is the process of inference. [Note: There is a distinction
between causal inference and statistical inference. Here we consider only statistical
inference.]

The two common forms of statistical inference are:

Estimation
Null hypothesis tests of significance (NHTS)

There are two forms of estimation:

Point estimation (maximally likely value for parameter)


Interval estimation (also called confidence interval for parameter)

This chapter introduces estimation. The following chapter introduced NHTS.

Both estimation and NHTS are used to infer parameters. A parameter is a statistical
constant that describes a feature about a phenomena, population, pmf, or pdf.

Examples of parameters include:

Binomial probability of success p (also called the population proportion)


Expected value (also called the population mean)
Standard deviation (also called the population standard deviation)

Point estimates are single points that are used to infer parameters directly. For example,

Sample proportion p (p hat) is the point estimator of p


Sample mean x (x bar) is the point estimator of
Sample standard deviation s is the point estimator of

Notice the use of different symbols to distinguish estimators and parameters. More
importantly, point estimates and parameters represent fundamentally different things.

Point estimates are calculated from the data; parameters are not.
Point estimates vary from study to study; parameters do not.
Point estimates are random variables: parameters are constants.

Page 5.2 (C:\Users\B. Burt Gerstman\Dropbox\StatPrimer\estimation.docx, 5/8/2016)


Estimating with confidence
Sampling distribution of the mean
Although point estimate x is a valuable reflections of parameter , it provides no
information about the precision of the estimate. We ask: How precise is x as estimate of
? How much can we expect any given x to vary from ?

The variability of x as the point estimate of starts by considering a hypothetical


distribution called the sampling distribution of a mean (SDM for short). Understanding
the SDM is difficult because it is based on a thought experiment that doesnt occur in
actuality, being a hypothetical distribution based on mathematical laws and probabilities.
The SDM imagines what would happen if we took repeated samples of the same size
from the same (or similar) populations done under the identical conditions. From this
hypothetical experiment we build a pmf or pdf that is used to determine probabilities
for various hypothetical outcomes.

Without going into too much detail, the SDM reveals that:

x is an unbiased estimate of ;
the SDM tends to be normal (Gaussian) when the population is normal or when
the sample is adequately large;
the standard deviation of the SDM is equal to n . This statisticwhich is
called the standard error of the mean (SEM)predicts how closely the x s in
the SDM are likely to cluster around the value of and is a reflection of the
precision of x as an estimate of :

SEM = n

Note that this formula is based on and not on sample standard deviation s.
Recall that is NOT calculated from the data and is derived from an external
source. Also note that the SEM is inversely proportion to the square root of n.

Numerical example. Suppose a measurement that has = 10.

o A sample of n = 1 for this variable derives SEM = n = 10 / 1 = 10


o A sample of n = 4 derives SEM = n = 10 / 4 = 5
o A sample of n = 16 derives SEM = n = 10 / 16 = 2.5

Each time we quadruple n, the SEM is cut in half. This is called the square root law
the precision of the mean is inversely proportional to the square root of the sample size.

Page 5.3 (C:\Users\B. Burt Gerstman\Dropbox\StatPrimer\estimation.docx, 5/8/2016)


Confidence Interval for when is known before hand

To gain further insight into , we surround the point estimate with a margin of error:

This forms a confidence interval (CI). The lower end of the confidence interval is the
lower confidence limit (LCL). The upper end is the upper confidence limit (UCL).

Note: The margin of error is the plus-or-minus wiggle-room drawn around the point
estimate; it is equal to half the confidence interval length.

Let (1)100% represent the confidence level of a confidence interval. The (alpha)
level represents the lack of confidence and is the chance the researcher is willing to
take in not capturing the value of the parameter.

A (1)100% CI for is given by:

x ( z1 / 2 )( SEM )

The z1-/2 in this formula is the z quantile association with a 1 level of confidence. The
reason we use z1-/2 instead of z1- in this formula is because the random error
(imprecision) is split between underestimates (left tail of the SDM) and overestimates
(right tail of the SDM). The confidence level 1 area lies between z1/2 and z1/2:

Page 5.4 (C:\Users\B. Burt Gerstman\Dropbox\StatPrimer\estimation.docx, 5/8/2016)


You may use the z/t table on the StatPrimer website to determine z quantiles for various
levels of confidence. Here are the common levels of confidence and their associated
alpha levels and z quantiles:

(1)100% z1-/2
90% .10 1.64
95% .05 1.96
99% .01 2.58

Numerical example, 90% CI for . Suppose we have a sample of n = 10 with SEM = 4.30
and x = 29.0. The z quantile for 10% confidence is z1.10/2 = z.95 = 1.64 and the 90% CI for =
29.0 (1.64)(4.30) = 29.0 7.1 = (21.9, 36.1). We use this inference to address population mean
and NOT about sample mean x . Note that the margin of error for this estimate is 7.1.

Numerical example, 95% CI for . The z quantile for 95% confidence is z1.05/2 = z.975 = 1.96.
The 95% CI for = 29.0 (1.96)(4.30) = 29.0 8.4 = (20.6, 37.4). Note that the margin of error
for this estimate is 8.4.

Numerical example, 99% CI for . Using the same data, = .01 for 99% confidence and the
99% CI for = 29.0 (2.58)(4.30) = 29.0 11.1 = (17.9, 40.1). Note that the margin of error for
this estimate is 11.1.

Here are confidence interval lengths (UCL LCL) of the three intervals just calculated:

Confidence Level Confidence Interval Confidence Interval Length


90% (21.9, 36.1) 36.1 21.9 = 14.2
95% (20.6, 37.4) 37.4 20.6 = 16.8
99% (17.9, 40.1) 40.1 17.9 = 22.2

The confidence interval length grows as the level of confidence increases from 90% to
95% to 99%.This is because there is a trade-off between the confidence and margin of
error. You can achieve a smaller margin of error if you are willing to pay the price of less
confidence. Therefore, as Dr. Evil might say, 95% is pretty standard.

Numerical example. Suppose a population has = 15 (not calculated, but known ahead of time)
and unknown mean . We take a random sample of 10 observations from this population and
observe the following values: {21, 42, 5, 11, 30, 50, 28, 27, 24, 52}. Based on these 10
observations, x = 29.0, SEM = 15/10 = 4.73 and a 95% CI for = 29.0 (1.96)(4.73) = 29.0
9.27 = (19.73, 38.27).

Interpretation notes:

The margin of error (m) is the plus or minus value surrounding the estimate. In this case m
= 9.27.
We use these confidence interval to address potential locations of the population mean ,
NOT the sample mean x .

Page 5.5 (C:\Users\B. Burt Gerstman\Dropbox\StatPrimer\estimation.docx, 5/8/2016)


Sample Size Requirements for estimating with confidence

One of the questions we often faces is How much data should be collected? Collecting
too much data is a waste of time and money. Also, by collecting fewer data points we can
devote more time and energy into making these measurements accuracy. However,
collecting too little data renders our estimate too imprecise to be useful.

To address the question of sample size requirements, let m represent the desired margin
of error of an estimate. This is equivalent to half the ultimate confidence interval length.

Note that margin of error m = z12 / 2 . Solving this equation for n derives,
n
2
n = z12 / 2 2
m

We always round results from this formula up to the next integer to ensure that we have a
margin of error no greater than m.

Note that to determine the sample size requirements for estimating with a given level of
confidence requires specification of the z quantile based on the desired level of
confidence (z1/2), population standard deviation (), and desired margin of error (m).

Numerical examples. Suppose we have a variable with standard deviation = 15 and


want to estimate with 95% confidence.
2 15 2
The samples size required to achieve a margin of error of 5 n = z12 / 2 2 = 1.96 2 2 =
m 5
36.
15 2
The samples size required to achieve a margin of error of 2.5 is n = 1.96 2
= 144.
2.5 2

Again, doubling the precision requires quadrupling the sample size.

Page 5.6 (C:\Users\B. Burt Gerstman\Dropbox\StatPrimer\estimation.docx, 5/8/2016)


Estimating p with confidence
Sampling distribution of the proportion

Estimating parameter p is analogous to estimating parameter . However, instead of using x as


an unbiased point estimate of , we use p as an unbiased estimate of p.

The symbol p (p-hat) represents the sample proportion:

number of successes in the sample


p =
n

For example, if we find 17 smokers in an SRS of 57 individuals, p = 17 / 57 = 0.2982. We ask,


How precise is p as are reflection of p? How much can we expect any given p to vary from p?

In samples that are large, the sampling distribution of p is approximately normal with a mean of
pq
p and standard error of the proportion SEP = where q = 1 p. The SEP quantifies the
n
precision of the sample proportion as an estimate of parameter p.

Confidence interval for p

This approach should be used only in samples that are large. a Use this rule to determine if
the sample is large enough: if npq 5 proceed with this method. (Call this the npq
rule).

An approximate (1)100% CI for p is given by

p ( z1 / 2 )( SEP)
p q
where the estimated SEP = .
n

Numerical example. An SRS of 57 individuals reveals 17 smokers. Therefore, p = 17 / 57 =


0.2982, q = 1 0.2982 = 0.7018 ad np q = (.2982)(.7018)(57) = 11.9. Thus, the sample is large
p q .2982 .7018
to proceed with the above formula. The estimated SEP = = = 0.06059 and
n 57
the 95% CI for p = .2982 (1.96)(.06059) = .2982 .1188 = (.1794, .4170). Thus, the population
prevalence is between 18% and 42% with 95% confidence.

a
A more precise formula that can be used in small samples is provided in a future chapter.

Page 5.7 (C:\Users\B. Burt Gerstman\Dropbox\StatPrimer\estimation.docx, 5/8/2016)


Estimation of a proportion (step-by-step summary)

Step 1. Review the research question and identify the parameter. Read the research
question. Verify that we have a single sample that addresses a binomial proportion (p).

Step 2. Point estimate. Calculate the sample proportion ( p ) as the point estimate of the
parameter.

Step 3. Confidence interval. Determine whether the z (normal approximation) formula


can be used with the npq rule. If so, determine the z percentile for the given level of
p q
confidence (table) and the standard error of the proportion SEP = . Apply the
n
formula p ( z1 / 2 )( SEP) .

Step 4. Interpret the results. In plain language report what proportion and the variable it
address. Report the confidence interval; being clear about what population is being
addressed. Reported results should be rounds as appropriate to the reader.

Illustration

Of 2673 people surveyed, 170 have risk factor X. We want to determine the population
prevalence of the risk factor with 95% confidence.

Step 1. Prevalence is the proportion of individuals with a binary trait. Therefore we wish to
estimate parameter p.

Step 2. p = 170 / 2673 = .06360 = 6.4%.

Step 3. np q = 2673(.0636)(1 .0636) = 159 z method OK.


p q (.0636)(1 .0636)
SEP = = = .00472
n 2673
The 95% CI for p = p ( z1 / 2 )( SEP) = 0.636 1.96 .00472 = .0636 .0093 = (.0543, .0729)
= (5.4%, 7.3%)

Step 4. The prevalence in the sample was 6.4%. The prevalence in the population is between
5.4% and 7.3% with 95% confidence.

Page 5.8 (C:\Users\B. Burt Gerstman\Dropbox\StatPrimer\estimation.docx, 5/8/2016)


Sample size requirement for estimating p with confidence

In planning a study, we want to collect enough data to estimate p with adequate precision.
Earlier in the chapter we determined the sample size requirements to estimate with
confidence. We apply a similar method to determine the sample size requirements to
estimate p.

Let m represent the margin of error. This provides the wiggle room around p for our
confidence interval and is equal to half the confidence interval length. To achieve margin
of error m,

z12 p * q *
n= 2

m2

where p* represent the an educated guess for the proportion and q* = 1 p*.

When no reasonable guess of p is available, use p* = 0.50 to provide a worst-case


scenario sample size that will provide more than enough data.

Numeric example: We want to sample a population and calculate a 95% confidence for
the prevalence of smoking. How large a sample is needed to achieve a margin of error of
0.05 if we assume the prevalence of smoking is roughly 30%

z12 p * q * 1.96 2 0.30 0.70


Solution: To achieve a margin of error of 0.05, n = 2
= = 322.7.
m2 0.05 2
Round this up to 323 to ensure adequate precision.

How large a sample is needed to shrink the margin of error to 0.03?


1.96 2 0.30 0.70
To achieve a margin of error of 0.05, n = . = 896.4, so study 897 individuals.
0.032

Page 5.9 (C:\Users\B. Burt Gerstman\Dropbox\StatPrimer\estimation.docx, 5/8/2016)



Experiments are performed by investigators in
virtually all fields of inquiry usually to discover
about a particular process or system.

An Experiment is a test and or series of tests

DOE is a formal mathematical method for systematically
CHAPTER 03 planning and conducting scientific studies that change
experimental variables together in order to determine their
DESIGN OF EXPERIMENTS effect of a given response.

Applications of Experimental Design Applications of Experimental Design


Marketing Production

Product Development
Quality Control

Market Segmentation
Work Design

Sales Promotion Program
Plant Location

Product Launching
Inventory Control

Design of Distribution Channel
Aggregate Planning

Pricing
Project Management

Design of Advertisement
Materials Requirement Planning

Buying Behaviour
Research Questions examples
Applications of Experimental Design
A stock market analyst wants to study the impact of the type of company on the
quarterly Averages of the earnings per share (EPS) .
Finance

Capital Budgeting Four technology alternatives are available to manufacture a product. The R & D
manager of a company want to study whether there is some impact of

Portfolio Management technology on the hourly output (in units ) of the product.

Inflation / Deflation
The sales manager of a leading textile show room at kolkata wishes to study

Economic Growth Analysis the opinion of its customers on the service quality in a 0-10 scale . In this
experiment two factors were considered -
Personnel
1. The nature of profession

Manpower Planning 2. The monthly income level of the customers

Performance Appraisal System

Training Methods

Inflation / Deflation

Research Question Research Question


A stock market analyst wants to study the impact of the type of company on The sales manager of a leading textile show room at kolkata wishes to study
the quarterly Averages of the earnings per share (EPS) . the opinion of its customers on the service quality in a 0-10 scale . In this
experiment two factors were considered (a) The nature of profession (b) The
DATA Collected
monthly income level of the customers
He collected four quarterly averages of EPS data for four different companies during the last
financial year as summarized below
Research Hypothesis Factor Interaction 2-way ANOVA
Research Hypothesis H0 = There is no significance difference between Income level in terms of service quality
H0 = There is no significance difference between companies in terms of quarterly H1 = There is significance difference between Income level in terms of service quality
averages of EPS H0 = There is no significance difference between in terms of service quality
H1 = There is significance difference between companies for atleast a pair of H1 = There is significance difference between professions in terms of service quality
companies in terms of quarterly averages of EPS H0 = There is no significance difference different pairs of interaction terms of Income Level
and profession in terms of service quality
ANOVA One Way / Two Way Tool for Analysis H0 = There is significance difference different pairs of interaction terms of Income Level and
profession in terms of service quality
Treatements / Levels of Company
Factor Company
Company Nature of the Profession (P)
C1 C2 C3 C4 Engineer Doctor Lawyer Others
Replication Q1 12 16 25 13 Income < Rs 12 16 25 13
Level 10,000
Factor - Quarter Q2 8 18 15 8 (I) 8 18 15 8
Q3 16 10 22 20 >= Rs 16 10 22 20
10,000
Q4 19 11 9 5 19 11 9 5
BASIC STEPS IN DOE TERMINOLOGY

Replication repetition of a basic

Four elements associated with DOE: experiment without changing any factor

1. The design of the experiment, settings.

2. The collection of the data,



It allows the experimenter to estimate the
experimental error (noise) in the system.

3. The statistical analysis of the data, and


It allows the experimenter to obtain more

4.The conclusions reached and statistical power (ability to identify small /


recommendations made as a result of the Interaction effects)
experiment.

TERMINOLOGY TERMINOLOGY

Factors experimental factors or


Responses dependent variable measured
independent variables (continuous or to describe the output of the process.
discrete) an investigator manipulates to
capture any changes in the output of the
Treatment Combinations (Experimental
process.
unit / run) experimental trial where all

uncontrollable factors factors are set at a specified level.

and controllable factors
TERMINOLOGY
Design Of Experiments
Fixed Effects Model - If the treatment
levels are specifically chosen by the Definitions:
experimenter, then conclusions reached Effect:
will only apply to those levels. The change in a response variable produced by a
change in the factor level.
Degree of Freedom:
Random Effects Model If the treatment The number of levels of a factor minus 1.
levels are randomly chosen from a Interaction:
population of many possible treatment Two or more factors that, together, produce a result
levels, then conclusions reached can be different that what the result of their separate effects
extended to all treatment levels in the would be.
population.

Design Of Experiments Design Of Experiments


Definitions: Characteristics of a Good Experiment Design
Noise factor: The experiment should provide unbiased estimates of
process variable and treatment effects (factors at different
An uncontrollable (but measurable) source of variation in
levels).
the functional characteristics of a product or process.
The experiment should provide the precision necessary to
Response variable: enable the experimenter to detect important differences.
The variable(s) used to describe the reaction of a The experiment should plan for the analysis of the results.
process to variations in control variables (factors).
The Quality characteristic under study.
Design Of Experiments
ONE WAY ANALYSIS OF VARIANCE
Characteristics of a Good Experiment Design
The experiment should generate results that are free from
Used to test hypothesis that the means of
ambiguity of interpretation.
The experiment should point the experimenter in the
several distribtions are equal.
direction of improvement.
Example: Production line has 7 fill needles
The experiment should be as simple as possible. and you wish to assess whether or not the
Easy to set up and carry out
Simple to analyze and interpret average fill is the same for all 7 needles.
Simple to communicate or explain to others

Experiment: sample 20 fills from each of the
9 needles and test at 5% level of sign.

Ho: 1 = 2 = 3= 4 = 5 = 6 = 7

RESULTS: ANALYSIS OF VARIANCE SINCE NEEDLE MEANS ARE NOT ALL


TABLE EQUAL, WHICH ONES ARE DIFFERENT?
Analysis of Variance
-----------------------------------------------------------------------------
Multiple Range Tests for 7 Needles
Source Sum of Squares Df Mean Square F-Ratio P-Value
-----------------------------------------------------------------------------
Between groups 1.10019 6 0.183364 18.66 0.0000 Method: 95.0 percent LSD
Within groups 1.30717 133 0.00982837 Col_2 Count Mean Homogeneous Groups
----------------------------------------------------------------------------- --------------------------------------------------------------------------------
Total (Corr.) 2.40736 139 N7 20 11.786 X
N2 20 11.9811 X
N1 20 11.9827 X
N6 20 11.9873 X
N3 20 11.9951 X
N5 20 11.9953 X
N4 20 12.11 X
TERMINOLOGY TERMINOLOGY

Randomization a statistical tool used to



Blocking technique used to increase the
minimize potential uncontrollable biases in the precision of an experiment by breaking the
experiment by randomly assigning material, experiment into homogeneous segments
people, order that experimental trials are (blocks) in order to control any potential
block to block variability (multiple lots of raw
conducted, or any other factor not under the
material, several shifts, several machines,
control of the experimenter.
several inspectors). Any effects on the

Results in averaging out the effects of the experimental results as a result of the
extraneous factors that may be present in order blocking factor will be identified and
to minimize the risk of these factors affecting the minimized.
experimental results.

Experimental Design Techniques Strategy of Experimentation



Single Factor ANOVA

Completely Randomized Design.
One-Factor-At-A-Time

Randomized Complete Block Design
Factorial Experiment

Latin Square Design

Complete Factorial Design'

Confounded Design

Fractional Factorial Design

Split Plot Design

Regression Approach

Multivariate Analysis of Variance(MANOVA)
One-Factor-At-A-Time

This method consists of selecting a baseline set of


levels, for each factor, then successively varying each
Sales Man
factor over its range with the other factors held constant
Source degree of freedom SS Mean SS F-ratio
at the baseline level,
Treatement 1 450 450 4.5

After all tests are performed a series of graphs are Error 6 600 100
constructed showing how response variable is affected Total N-1 = 7 1050 F(1,6)=5.9874
by varying each factor with all other factors held
constant. There is no difference in sale value regarding Salesman

Sales Region

Major disadvantage is it fails to consider any


possible interaction between the factors. Source degree of freedom SS Mean SS F-ratio

Treatement 1 525 525 6

In real world interaction between the factors are


common. Error 6 525 87.5

Total N-1 = 7 1050 F(1,6)=5.9874

There is significant difference in sale value regarding Sales region


randomize the order of experimental runs to satisfy the
statistical requirement of independence of observations.
Analysis of Variance
ONE WAY ANALYSIS OF VARIANCE Hypothesis is an assumption about a population.

Two types of Hypothesis

Used to test hypothesis that the means of Null Hypothesis (H0) Preferred assumption about a
several populations are equal. population

Example: Production line has 7 fill needles
Alternate Hypothesis (H1) - Opposite of the NULL
hypothesis
and you wish to assess whether or not the
average fill is the same for all 7 needles.

Experiment: sample 20 fills from each of Example
H0 = There is no significance difference between companies in terms of quarterly
the 9 needles and test at 5% level of sign. averages of EPS
H1 = There is significance difference between companies for at least a pair of

Ho: 1 = 2 = 3= 4 = 5 = 6 = 7 companies in terms of quarterly averages of EPS

SINCE NEEDLE MEANS ARE NOT ALL


RESULTS: ANALYSIS OF
EQUAL, WHICH ONES ARE
VARIANCE TABLE DIFFERENT?

Multiple Range Tests for 7 Needles
Analysis of Variance
----------------------------------------------------------------------------- Method: 95.0 percent LSD
Source Sum of Squares Df Mean Square F-Ratio P-Value Col_2 Count Mean Homogeneous Groups
----------------------------------------------------------------------------- --------------------------------------------------------------------------------
Between groups 1.10019 6 0.183364 18.66 0.0000 N7 20 11.786 X
Within groups 1.30717 133 0.00982837
N2 20 11.9811 X
-----------------------------------------------------------------------------
N1 20 11.9827 X
Total (Corr.) 2.40736 139
N6 20 11.9873 X
N3 20 11.9951 X
N5 20 11.9953 X
N4 20 12.11 X
Significance level ()
The signification level is the probability with which the null hypothesis ( H 0) is
true.
Test of Hypothesis three types

Test of Hypothesis concerning mean(s)

Test of Hypothesis concerning variances

Test of hypothesis concerning proportions.

Test of Hypothesis Using F-


MSSTreatements
ratio
Fratio=
MSSError

ifFratio<F Critical

a = no. of treatments RejectH 0 ,AcceptH 1


N= total no of data elements
else

RejectH 1 ,AcceptH 0

Source of Variation Degree of Sum of Squares Mean Sum of Squares F


Freedom
Between Treatments a-1 SS Treatments SS Treatments / (a-1) Mss Treatement
Within Treatments N-1-(a-1) SS Error SS Treatments / (a(n-1)) Mss Error
/

Total N-1 SS Total

Company CRD
C1 C2
Q1 12 16

Completely randomized design can be defined as
Q2 8 18
studying effects of a primary factor with other factors not
Q3 16 10 taken into consideration in the design of experiments.
Q4 19 11
Completely randomized design falls within the category
of true random number generation. It is the simplest
n a aisthenumberoftreatments
form of design.
SS total= Y 2 2
ijY . . / N
nisthenumberofreplicationsundereachtreatment
In completely randomized design, subjects are assigned
i=1 j=1 Y
a Y .. isthesummationof ij valuesofi,j to various groups at random without the involvement of
all
SS treatment = Y .2j / nY .2. / N Y ij
any judgments.
j= 1 Y . j isthesummationof valuesofiforagivenj
all
Completely randomized designs are analyzed by one
SS error =SS total SS treatment Nisthetotalnumberofobservations theexperiments way ANOVA.

Alpha Engineering Ltd is facing the problem of quality in terms of
Disadvantage of CRD surface finish of components which are machined using 4 different
machines(P,Q,R and S).

Unrestricted randomization means that units that
The company has selected 4 operators from 4 different grades of the
employees who will machine the components on the machines. The
receive one treatment may be inherently different operators are A1, A2,A3 and A4 from A, B1,B2,B3 and B4 from B ,
from units that receive other treatments. C1,C2,C3 and C4 from C, and D1,D2,D3 and D4 from D for allotment to
different machines during different weeks (W1,W2,W3 and W4) of the
month of experimentation.

Any variation in units shows up in the
experimental error sum of squares. Unless the
The sixteen operators are randomly assigned to different combinations
of machine and week as shown in table next slide :
units are very similar, a CRD will have larger
experimental error than other designs.

Experimental Combinations of CRD


Randomized complete block design (RCBD)
Machines
P Q R S Four grades - (A ,B ,C, D) of Operators.

It is perhaps the most commonly encountered design that can be
W1 A C B C A= A1, A2, A3, A4 B= B1, B2, B3, B4
analyzed as a two-way AOV.
Week W2 B A C D D= D1, D2, D3, D4 C= C1, C2, C3, C4
W3 D B C B
In this design, a set of experimental units is grouped (blocked) in a
W4 A D A D way that minimizes the variability among the units within groups
(blocks).
Machines
P Q R S

The objective is to keep the experimental error within each block as
W1 A1 (23) C1 (35) B4 (20) C3(25) well as possible. Each block contains a complete set of treatments,
Week W2 B2 (20) A2 (34) C2 (29) D1(28)
therefore differences among blocks are not due to treatments, and
this variability can be estimated as a separate source of variation.
W3 D2 (40) B3 (30) C4 (45) B1(30)
W4 A4 (34) D4 (34) A3 (45) D3(40) Grade of Operators
The greater the variability among blocks the more efficient the
A B C D design becomes.
A1 (23) B1(30) C1 (35) D1(28)
A2 (34) B2 (20) C2 (29) D2 (40)

The CRD has more degrees of freedom for error and a smaller F
A3 (45) B3 (30) C3(25) D3(40)
value is required for significant difference among treatments.
A4 (34) B4 (20) C4 (45) D4 (34)
Completely Randomized Block Design Machines CRD Grade of Operators
Week P Q R S A B C D
Four grades - (A ,B ,C, D) of Operators.
Week is a blocking factor. A1 (23) B1(30) C1 (35) D1(28)
A= A1, A2, A3, A4 B= B1, B2, B3, B4 W1 A1(23) B1(30) C3(25) D1(28) A2 (34) B2 (20) C2 (29) D2 (40)
Machines D= D1, D2, D3, D4 C= C1, C2, C3, C4 W2 B2(20) C2(29) A2(34) D4(34) A3 (45) B3 (30) C3(25) D3(40)
P Q R S CRD Grade of Operators W3 A3(45) D3(40) B3(30) C4(45) A4 (34) B4 (20) C4 (45) D4 (34)
A1 (23) C1 (35) B4 (20) C3(25) A B C D W4 C1(35) B4(20) D2(40) A4(34) CRBD Grade of Operators
B2 (20) A2 (34) C2 (29) D1(28) A1 (23) B1(30) C1 (35) D1(28) Week A B C D
D2 (40) B3 (30) C4 (45) B1(30) No D No A No B
A2 (34) B2 (20) C2 (29) D2 (40)
A4 (34) D4 (34) A3 (45) D3(40) A3 (45) B3 (30) C3(25) D3(40) W1 A1(23) B1(30) C3(25) D1(28)

No Operator of grade C,B A4 (34) B4 (20) C4 (45) D4 (34) Latin Square Design W2 A2(34) B2(20) C2(29) D4(34)
No Operator of grade D
improvement over CRBD W3 A3(45) B3(30) C4(45) D2(40)
W4 A4(34) B4(20) C1(35) D3(40)
Machines CRBD LSD
Grade of Operators Latin square designs allow for Machines
Week P Q R S Week A B C D two blocking factors. In other Week P Q R S
words, these designs are used
W1 A1(23) B1(30) C3(25) D1(28) W1 23 30 25 28 to simultaneously control (or W1 A1(23) B1(30) C4(45) D1(28)
W2 B2(20) C2(29) A2(34) D4(34) W2 34 20 29 34 eliminate) two sources of W2 B2(20) C2(29) D4(34) A2(34)
W3 A3(45) D3(40) B3(30) C4(45) W3 45 30 45 40 nuisance variability. W3 C3(25) D3(40) A3(45) B3(30)
W4 C1(35) B4(20) D2(40) A4(34) W4 34 20 35 40 W4 D2(40) A4(34) B4(20) C1(35)

CRD Grade of Operators Blocking Factor NIL


A B C D The model of the CRD
Factor Grade of Operators
A1 (23) B1(30) C1 (35) D1(28) Y ij =+T j +eij
Interaction between Factors NIL
A2 (34) B2 (20) C2 (29) D2 (40) is the overall mean
A3 (45) B3 (30) C3(25) D3(40) Y ij is the ith observation under the jth treatment of the factor
A4 (34) B4 (20) C4 (45) D4 (34) T j is the effect of the jth treatment of the factor
CRBD Grade of Operators e ij is the random error associated with the ith observation
Week A B C D under the jth treatment of the factor
Blocking Factor Week
W1 A1(23) B1(30) C3(25) D1(28) Null Hypothesis,H 0 :T 1 =T 2 =T 3 =T 4
Factor Week, Grade of Operators
W2 A2(34) B2(20) C2(29) D4(34) H 1 :Treatement means are not equal for atleast one pair of treatments
Interaction between Factors NIL
W3 A3(45) B3(30) C4(45) D2(40)
SSTotal =SS Treatment +SS error
W4 A4(34) B4(20) C1(35) D3(40)
LSD Machines
n a
a is the number of treatments
SS total = Y 2ijY .2. / N
Week A B C D
Blocking Factor Week, Grade of Operators i=1 j=1 n is the number of replications under each treatment
W1 A1(23) B1(30) C4(45) D1(28)
Factor Week, Grade of Operators a Y .. is the summationofY ij for all values of i,j
W2 A2(34) B2(20) C2(29) D4(34) Interaction between Factors NIL SS treatment = Y .2j / nY .2. / N
W3 A3(45) B3(30) C3(25) D3(40) j= 1 Y . j is the summation of Y ij for all values of i for a given j
W4 A4(34) B4(20) C1(35) D2(40) SS error =SS total SS treatment N is the total number of observationsthe experiments
The model of the CRBD The model of the LSD
Y
+T j +M k +e ijkijk =+B i
Y ij =+B i +T j +eij istheoverallmean
is the over all mean Y ijk istheobservationinntheithrowanndkthcolumnforthejthtreatmentofthefactor
Y ij is the ith observation under the jth treatment of the factor Bi istheeffectoftheithrow
B i is the effect of the ith block T j istheeffectofthejthtreatmentofthefactor
T j is the effect of the jth treatment of the factor M k istheeffectofthekthcolumnannd
e ij is the random error associated with the ith observation e ijk istherandomerrorassociatedwiththeithobservation
withtheithrowanndthekthcolumnforthejthtreatmentofthefactor
under the jth treatment of the factor
NullHypothesis,w . r . t . Treatment,AlternateHypothesisH 1 regardingTreatment
Null Hypothesis,w .r . t . Treatment H 0 : T 1 =T 2 =T 3 =T 4
NullHypothesis,w . r . t . rows,AlternateHypothesisH 1 regardingrows
H 1 : Treatement means are not equal for at least one pair of treatments NullHypothesis,w . r . t . columns,AlternateHypothesisH 1 regardingcolumns
Null Hypothesis,w .r . t . BlockH 0 : B1 =B 2 =B 3 =B 4
H 1 : Block means are not equal for at least one pair of Blocks
b c
Y 2. . . aisthenumberoftreatments
SS Total =SS Block +SSTreatment +SSerror SS Total= Y 2ik
i=1 k=1 N bisthenumberofblocks ( row )
n a
Y 2.. a is the number of treatments a
Y 2. j. Y 2. ..
SS Total = Y 2ij SS Treatment = cisthenumberofblocks ( col )
i= 1 j= 1 N b is the number of blocks j=1 b N
Y ... isthesummationofY ijk forallvaluesofi,j,k
a
Y 2. j Y 2.. Y .. is the summation of Y ij for all values of i,j
b
Y 2i. . Y 2.. .
SS Treatment = SS Block ( row ) = Y . j. isthesummationofY ijk forallvaluesofianndkforagivenj
j= 1 n N i=1 c N
Y 2i . Y 2. .
Y . j is the summation of Y ij for all values of i for a given j
b c
Y . . k Y 2.. .
2
Y i.. isthesummationofY ijk forallvaluesofianndjforagivenk
SS Block = Y i. is the summation of Y ij for all values of jfor a giveni SS Block ( column )=
j= 1 a N k=1 b N Nisthetotalnumberofobservationstheexperiments
SS error =SS Total SSTreatment SS BlockN is the total number of observationsthe experiments SS error =SS Total SSTreatment SS Block ( row ) SS Block (column )

LSD Machines 512 2


SSTotal=( 232 +30 2+452 +282 +........+402 +342+202+352 ) =938
Week P Q R S Y i. . 16 Result for LSD
2 2 2 2 2
126 +117 +140 +129 512
W1 A1(23) B1(30) C4(45) D1(28) 126 SSBlock ( week )= =67.5 Source of Degree of Sum of Squares Mean Sum of F-Ratio
W2 B2(20) C2(29) D4(34) A2(34) 117
4 16 Variation freedom Squares

108 +133 +144 +127 5122


2 2 2 2 Between Grades 3 270.0 =270.0 / 3 = 90 1.256<4.76
W3 C3(25) D3(40) A3(45) B3(30) 140
SSBlock ( machines)= =170.5
W4 D2(40) A4(34) B4(20) C1(35) 129 4 16 Between Weeks 3 67.5 67.5 / 3 = 22.50 0.314<4.76
Y ..k
108 133 144 127 512 Between 3 170.5 170.5 / 3 = 56.83 0.793<4.76
Machines
LSD Grades Error 15-3-3-3 = 6 430 430 / 6 = 71.67
Week A B C D
1362 +100 2+134 2+142 2 5122 Total 15 938
SS Treatment ( Grade )= =270
W1 A1(23) B1(30) C4(45) D1(28) 4 16
W2 A2(34) B2(20) C2(29) D4(34)
W3 A3(45) B3(30) C3(25) D3(40)
W4 A4(34) B4(20) C1(35) D2(40)
Y . j. 136 100 134 142

SS error = 938 67 . 5 170 . 5270 =430


SS Total= (12)^2+ (16)^2......(5)^2 - (227)^2 / 16 = 478.4375
Complete Randomized Design
SS Treatment= { [(55)^2 +(55)^2+(71)^2+(46)^2 ] / 4} - {(227)^2 / 16}= 81.1875

A stock market analyst wants to study the impact of the type of SS Error = SS Total - SS Treatment = 478.4375 81.1875 = 397.25
company on quarterly averages of EPS data of four different
companies during the last financial year from 'Capital Market' which
is summarized below: C1 C2 C3 C4
12 16 25 13
C1 C2 C3 C4 8 18 15 8
12 16 25 13 16 10 22 20
8 18 15 8 19 11 9 5
16 10 22 20
55 55 71 46
19 11 9 5

There is no significance difference between


companies in terms of quarterly averages of EPS
data

Source of Degree of Sum of Mean Sum F Calculated FACTORIAL (2k) DESIGNS


Variation Freedom Squares of Squares

Experiments involving several factors ( k =
# of factors) where it is necessary to study
the joint effect of these factors on a specific
Between
Companies
4-1 (a-1) 81.1875 81.1875 / 3
= 27.063
= 27.063 / 33.104
= 0.23
response.
< 3.49

Each of the factors are set at two levels (a
low level and a high level) which may be
Error 15-3=12 397.25 397.25 / 12
= 33.104 qualitative, or quantitative.
Total 16-1=15 478.4375
FACTORIAL (2k) DESIGNS FACTORIAL (2k) DESIGNS

Factors are assumed to be fixed (fixed effects model)


Particularly useful in the early stages of

Designs are completely randomized (experimental trials


experimental work when you are likely to
are run in a random order, etc.)
have many factors being investigated and

The usual normality assumptions are satisfied.


you want to minimize the number of
treatment combinations (sample size) but,
at the same time, study all k factors in a
complete factorial arrangement (the
experiment collects data at all possible
combinations of factor levels).

bc abc
FACTORIAL (2 ) DESIGNS
k Hb b ab
Hb b ab

B

As k gets large, the sample size will B

increase exponentially. If experiment is Lb


1 a Hc c ac

replicated, the # runs again increases. Lb Lc 1 a

La A Ha La A Ha
k # of runs
2 4
Average effect of A =[ [(a-1) /2] + [(ab-b)/2] ] / 2
3 8
4 16 Average effect of B =[ [(b-1) /2] + [(ab-a)/2] ] / 2
5 32
6 64 Average effect of A =[ {(a-1) /2} + {(ac-c)/2} + {(abc-bc) / 2}+ {(ab-b) / 2} ] / 2
7 128 Average effect of B =[ {(b-1) /2} + {(ab-a)/2} + {(abc-ac) / 2}+ {(bc-c) / 2} ] / 2
8 256
9 512 Average effect of C =[ {(c-1) /2} + {(ac-a)/2} + {(abc-ab) / 2}+ {(bc-b) / 2} ] / 2
10 1024
Data arranged in Yate's order A Stock market Analyst wants to study the impact of the type of company and time period on
the quarterly averages of the earnings per share (EPS). So he collected four quarterly
A B C AB AC BC ABC Y averages of EPS data of two different companies during the last two financial years from
A B AB Y Capital Market which are summarized in table below :
1 - - - + + + - Type Year (B) A B AB R1 R2 Y
1 - - + of
Comp 1(-) 2(+)
a + - - - - + + any
a + - - 12 16
(A) 1
1(-)
18 15
- - + 12 18 12+18
b - + - b - + - - + - + = 30
16 10
ab + + + 2(+)
ab + + - + - - - 19 11
a + - - 16 19 35
c - - + + - - + b - + - 16 15 31
Cont A = (35+21)-(30+31) = -5
ac + - + - + - - Cont B = (31+21)-(30+35) = -13 ab + + + 10 11 21
Cont AB = (30+21)-(31+35) = -15
bc - + + - - + -
SS-A = [Cont A] 2 / [ 2factors x replicates ] = 3.125
abc + + + + + + + SS-B = [Cont B] 2 / [ 2factors x replicates ] = 21.125
SS-AB = [Cont AB] 2 / [ 2factors x replicates ] = 28.125

A B AB R1 R2 Y
SS-Total = (122 + 182+...+102+112) (1172 / 8) ANOVA for 2^k factorial design using Yate's Alogorithm
= 75.875

1 - - + 12 18 12+18
= 30 A B AB R1 R2 Y C1 C2 Contrast SS

a 1 - - + 12 18 30 30+35 65+52
+ - - 16 19 35
=65 =117
b - + - 16 15 31 a + - - 16 19 35 31+21 5-10 A (-5)2 / 8
=52 =-5 =3.125
ab + + + 10 11 21
b - + - 16 15 31 35-30 52-65 B (-13)2 / 8
Source DF SS MSS F-ratio F-Crit =21.125
=5 =-13
Company 2-1 3.125 3.125 / 1 3.125 / 5.875 7.71
=0.531915
ab + + + 10 11 21 21-31 -10-5 AB (-15)2 / 8
Year 2-1 21.125 21.125 / 1 21.125 / 5.875 7.71 =28.125
=3.617 =-10 =-15
Company x 1*1 28.125 28.125 / 1 28.125 / 5.875 7.71
Year =4.7872
Error 7-1-1-1 = 4 = 23.5 =23.5 / 4 = 5.875
Total N-1 = 8-1 75.875
A company is keen in assessing the contribution of its employees
The surface finish of product produced in a machine shop is in a 0-10 scale in terms of value addition to its business operations.
suspected to be affected by a factor Operator and another factor In this connection the UG qualification, sex and work experience
Shift. The data of this experiment with two replications in different of the employees are considered to be the factors. The
treatment combinations are summarized below. Perform Anova corresponding ratings of the employees are shown below :
using General Linear Model and 2n Factorial Design and Yate's
Design and check the significance of the components of the
related model when is 0.05. Work UG degree (B)
Experience
(A) Engineering Others
Operator Shift (B) Sex (C) Sex (C)
(A) 1 2
Male Female Male Female
<3 8 2 4 2
1 65 20 7 6 8 4
70 40 >=3 9 4 7 5
9 9 8 6
2 30 50
35 40
Ans F- values below , F-crit = 5.32

WE = 4, UG-Degree = 1.563 , WE x UG-Degree = 0, Sex= 7.563 , WE x Sex = 0.250


UG-Degree x Sex = 0.063 , WE x UG-Degree x Sex =0

FACTORIAL (2k) DESIGNS (k = 2):


FACTORIAL (2 ) DESIGNS (k = 2):
k
GRAPHICAL OUTPUT
GRAPHICAL OUTPUT

Factor A has an effect on the response

Neither factor A nor Factor B have an effect variable, but Factor B does not.
on the response variable.
FACTORIAL (2k) DESIGNS (k = 2):
FACTORIAL (2k) DESIGNS (k = 2):
GRAPHICAL OUTPUT
GRAPHICAL OUTPUT

Factor B has an effect on the response variable, but only if

Factor A and Factor B have an effect on the factor A is set at the High level. This is called
response variable. interaction and it basically means that the effect one
factor has on a response is dependent on the level you set
other factors at. Interactions can be major problems in a
DOE if you fail to account for the interaction when
designing your experiment.
Why control Quality ?
Controlling and improving quality has become an
important business strategy for - manufacturers,
distributors, transportation companies, financial services
CHAPTER 04 organizations; health care providers, and government
agencies.
STATISTICAL QUALITY CONTROL
Quality is a competitive advantage. A business that
can delight customers by improving and controlling
quality can dominate its competitors.

Definition of Quality
Quality is inversely proportional to variability.
Quality means fitness for use.
There are two general aspects of fitness for use: Implies :
quality of design [ Design of Experiments ] variability in the important characteristics of a product -decreases,
the quality of the product increases.
quality of conformance - how well the product / Service
conforms to the specifications required by the design. Quality improvement is the reduction of variability in processes
[ Statistical Quality Control ] and products.
Quality is inversely proportional to variability.
The largest allowable value for a quality characteristic is
called the upper specification limit (USL), and the
smallest allowable value for a quality characteristic is
called the lower specification limit (LSL).
Why Statistical Quality Control ?
Since variability can only be described in statistical terms,
statistical methods play a central role in quality improvement
efforts.

Descriptive Statistics include:

The Mean- measure of central tendency


The Range- diference between
largest/smallest observations in a set of data

Standard Deviation measures the amount


of data dispersion around mean
Distribution of Data shape

Normal or bell shaped or
Skewed


Classify data on quality characteristics as either attributes or
variables data.

Three SQC Categories Sources of Variation

Statistical quality control (SQC): the term used to describe the set of statistical
Variation exists in all processes.
tools used by quality professionals; SQC encompasses three broad Variation can be categorized as either:
categories of:
Common or Random causes of variation, or
1. Statistical process control (SPC)

Random causes that we cannot identify
2. Descriptive statistics include the mean, standard deviation,
and range

Unavoidable, e.g. slight diferences in process variables like

Involve inspecting the output from a process diameter, weight, service time, temperature
Quality characteristics are measured and charted Assignable causes of variation
Helps identify in-process variations Causes can be identiied and eliminated: poor employee
3. Acceptance sampling used to randomly inspect a batch of goods training, worn tool, machine needing repair
to determine acceptance/rejection
Does not help to catch in-process problems
This model represents manufacturing or service processes.
Statistical Methods for Quality Control and
Improvement
statistical process control [ Online Tool ]
Design of experiments, [Offline Tool ]
They are often used during development activities and the early
stages of manufacturing.
Acceptance sampling.
Done at incoming raw materials or components point , or final
production.

A process in a financial institution that processes Automobile 15-1.2: Statistical Process Control
Loan Applications.

Statistical process control is a collection of


tools that when used together can result in
process stability and variance reduction

11

John Wiley & Sons, Inc. Applied Statistics and Probability for Engineers, by Montgomery and Runger.
15-1.2: Statistical Process Control 15-2: Introduction to Control Charts

15-2.1 Basic Principles


The seven major tools are
A process that is operating with only chance
1) Histogram causes of variation present is said to be in
2) Pareto Chart statistical control.
4) Cause and Effect Diagram A process that is operating in the presence of
5) Defect Concentration Diagram assignable causes is said to be out of control.
6) Control Chart The eventual goal of SPC is the elimination of
7) Scatter Diagram variability in the process.
8) Check Sheet

12 13

John Wiley & Sons, Inc. Applied Statistics and Probability for Engineers, by Montgomery and Runger. John Wiley & Sons, Inc. Applied Statistics and Probability for Engineers, by Montgomery and Runger.

15-2: Introduction to Control Charts A Control Chart is one of the primary techniques of
statistical process control (SPC).

15-2.1 Basic Principles

A process that is operating with only chance


causes of variation present is said to be in
statistical control.
A process that is operating in the presence of This chart plots the averages of measurements of a quality characteristic in
samples taken from the process versus time (or the sample number).
assignable causes is said to be out of control.
The eventual goal of SPC is the elimination of The center line represents where this process characteristic should fall if
there are no unusual sources of variability present. The control limits are
variability in the process. determined from some simple statistical considerations.
If the process is in control, all the plotted points should have an essentially random
pattern.

Methods for looking for sequences or nonrandom patterns can be applied to control
14
charts as an aid in detecting out-of-control Conditions.
John Wiley & Sons, Inc. Applied Statistics and Probability for Engineers, by Montgomery and Runger.
Problems in Process X-bar and R or S Control Charts
An assignable cause can result in many different types of UCL = + 3 / n
shifts in the process parameters. UWL = + 2 / n
LWL = - 2 / n
LCL = -2 / n
The mean could shift instantaneously to a new value
and remain there (this is sometimes called a sustained
shift);
or it could shift abruptly; but the assignable cause
could be short-lived and the mean could then return to
its nominal or in-control value; The constants D 3 and D 4 are tabulated for various values of n
assignable cause could result in a steady drift or trend There is a well-known relationship between the range of a sample from a normal
distribution and the standard deviation of that distribution. The random variable W =
in the value of the mean. R/s is called the relative range. The parameters of the distribution of W are a function
of the sample size n. The mean of W is d2.
Process Mean is OK but process standard Control Limits for the S- Chart
deviation / Range increases / decreases. [ X-Bar
R Chart Comb] , [X-Bar S Chart Comb ]
17

John Wiley & Sons, Inc. Applied Statistics and Probability for Engineers, by Montgomery and Runger.

John Wiley & Sons, Inc. Applied Statistics and Probability for Engineers, by Montgomery and Runger.
The value of depends on the method you use to estimate it. We
will look at three methods for estimating for subgroup data:
Average of the subgroup ranges(Most Used)
Average of the subgroup standard deviations
Pooled standard deviation

Average of the subgroup ranges


Average of the subgroup standard deviations
Interpretation of x-Bar , R,S A mixture

when the plotted points tend to fall near or slightly outside the
a control chart can indicate an out-of-control condition even
control limits, with relatively few points near the center line.
though no single point plots outside the control limits, if the
pattern of the plotted points exhibits nonrandom or systematic
behavior.
Cyclic patterns
Cause : environmental changes such as temperature, operator
fatigue, regular rotation of operators and/or machines, or
fluctuation in voltage or pressure etc
A mixture pattern can also occur when output product from
several sources (such as parallel machines) is fed into a
common stream which is then sampled for process monitoring
purposes.

A shift in process level A trend, or continuous movement in one direction


Shifts may result from the - Trends are usually due to a gradual wearing out or deterioration
of a tool or some other critical process component.
introduction of new workers; changes in methods, raw materials, or
machines; They can also result from human causes, such as operator
a change in the inspection method or standards; or fatigue or the presence of supervision. Finally, trends can result
a change in either the skill from seasonal influences, such as temperature.
Stratification
Designing a control chart- the sample size and the
A tendency for the points to cluster artificially around the center line. frequency of sampling.
There is a marked lack of natural variability in the observed pattern.
Small Process Shift Larger Sample Size
One potential cause of stratification is incorrect calculation of Large Process Shift - Smaller Sample Sizes than previous case
control limits.

For example, suppose that a sample of size 5 is obtained by taking one


observation from each of five parallel processes. If the largest and smallest
units in each sample are relatively far apart because they come from two
different distributions, then R will be incorrectly inflated, causing the
limits on the x chart to be too wide. In this case R incorrectly measures the
variability between the different underlying distributions, in addition to the
chance cause variation that it is intended to measure.

Another way to evaluate the decisions regarding sample Rational Subgroups


size and sampling frequency is through the Average Run
Length (ARL) of the control chart.
The rational sub-group concept means that subgroups or samples
should be selected so that if assignable causes are present,
ARL is the average number of points that must be plotted before a The chance for differences between subgroups will be maximized,
point indicates an out-of-control condition. while the chance for differences due to these assignable causes
within a subgroup will be minimized.
If the process observations are uncorrelated, then for any Shewhart
control chart, the ARL can be calculated easily from Time order is frequently a good basis for forming subgroups
because it allows us to detect assignable causes that occur over
time.
ARL =1/ p where p is the probability that any point exceeds Each sample consists of units that were produced at the same time
the control limits. This equation can be used to (or as closely together as possible).
evaluate the performance of the control chart.
It minimizes the chance of variability due to assignable causes within
forthe x chart with three-sigma limits, p = 0.0027 is the probability that a a sample, and it maximizes the chance of variability between
single point falls outside the limits when the process is in control. samples if assignable causes are present. It also provides a better
Therefore, the average run length of the x chart when the process is in estimate of the standard deviation of the process in the case of
control (called ARL 0 ) is ARL 0 = 1/p= 1/.0027= 370. That is, even if the variables control charts.
process remains in control, an out-of-control signal will be generated every
370 samples, on the average.
A set of decision rules for detecting nonrandom patterns on
control charts.
1.One or more points outside of the control limits.
2.Two of three consecutive points outside the two-sigma warning limits but
still inside the control limits.
3.Four of five consecutive points beyond the one-sigma limits.
4.A run of eight consecutive points on one side of the center line.
5.Six points in a row steadily increasing or decreasing.
6.Fifteen points in a row in zone C (both above and below the center line).
7.Fourteen points in a row alternating up and down.
8.Eight points in a row on both sides of the center line with none in zone C.
9.An unusual or nonrandom pattern in the data.
10.One or more points near a warning or control limit.

What is Process Capability?


Process capability compares the output of an in-control process to
the specification limits by using capability indices. The comparison
is made by forming the ratio of the spread between the process
specifications (the specification "width") to the spread of the
process values, as measured by 6 process standard deviation
units (the process "width").
process capability ratio (PCR) = Cp

= (USL LSL)
-------------------
6

= R
------
d2
Cp is valuable in measuring process capability. However, it has one Control Chart for Attributes
shortcoming: it assumes that process variability is centered on the
specification range. Unfortunately,this is not always the case. Control charts for attributes are used to measure quality
characteristics that are counted rather than measured.
Attributes are discrete in nature and entail simple yes-or-no
decisions.
For example, this could be the number of non-functioning light
bulbs, the proportion of broken eggs in a carton, number of
complaints issued.
C-charts count the actual number of defects.
For example, we can count the number of complaints from
customers in a month
P-charts are used to measure the proportion of items in a
sample that are defective.
Examples are the proportion of broken cookies in a batch

P- Chart
The center line is computed as the average proportion defective in
the population, p . This is obtained by taking a number of samples of
observations at random and computing the average value of p
across all samples.
Problems

0.4
0.35
0.3 CL
0.25 VALUE
0.2 UCL
LCL
0.15
UWL
0.1 ULL
0.05
0
1 2 3 4 5 6 7 8 9 10
The Shewhart Control Chart for Individual
Measurements

There are many situations in which the sample size used for process
monitoring is n = 1; that is, the sample consists of an individual unit.
In process plants, such as paper making, measurements on some
parameter such as coating thickness across the roll will differ very little
and produce a standard deviation that is much too small if the objective is
to control coating thickness along the roll.
In many applications of the individuals control chart, we use the moving
range two successive observations as the basis of estimating the
process variability.

Measuring Six Sigma for Service Industry

The Six Sigma can be successfully linked to a service industry or to


a service function like finance and stores within a manufacturing
organization.
It will help you find out DPMO levels and also the present Sigma
level of a company.

Why Six Sigma


Failure cost Prevention Cost
Warranty Repair, Customer Complaints
Reliability Analysis in Design,
Customer Survey to Reveal Requirements

If for an error prevention cost is Rs 1 , then it will cost Rs 10 to detect In-house &
Rs 100 if detected by Customer
Setting Six Sigma targets Six Sigma in the Accounts department
In product related industry for example, the customer or buyer can define
certain specifications, which help identify and quantify parameters for Six The management notices that most of the customer complaints are
Sigma implementation. related to vouchers handled by the company.
In the service industry however, output being intangible, the company has to To reduce the number of customer complaints, the company should
set its own targets by identifying all the key characteristics of their service aim to reduce the number of errors in the vouchers.
and identifying the process measures that have a direct impact on these key
characteristics. Typically, the Six Sigma implementation strategy would The CTQs identified in the accounts department, are errors pertaining
suggest that the company takes the following steps: to Amount, Tax, Code and Date.
The team needs to find out the Defects per million opportunities
Identify the CTQ (Critical to Quality ) that is the most significant For this they have to analyze sample vouchers. These should be
Identify the root cause selected randomly.
Design a solution that would address this root cause The team inspected 1000 vouchers. And found a total of 120 defects.
Implement this solution Therefore, defects per invoice = 120/1000 = 0.12
And Defect per CTQ = 0.12/4 = 0.03
Verify the effect of the solution by conducting audits at regular intervals
This value is then expressed in terms of Defects per million opportunities. It
Improve the process if needed thus becomes 30,000 ppm.

The table assumes a 1.5 sigma shift because processes tend to exhibit instability of
that magnitude over time.
Area: Call Center
Customer Quote: I consistently wait too long to speak to a representative.
CTQ Measure: Time on hold (seconds)
CTQ Specification: Less than 60 seconds from call connection to the
automated response system
Defect: Calls with hold time equal and greater than 60 seconds
Unit: Call
Opportunity: 1 per call

Calculate Sigma
Defects: 263 calls
Units: 21,501 calls
CTQ = 263/21,501=0.012 , CPK = 1.250 < 2 (6 sigma) 3.4 ppm
Opportunities: 1 per call
Sigma: 3.75
Multiple criteria decision making (MCDM) refers to making decisions in
the presence of multiple non-commensurable and conflicting criteria,
different units of measurement among the criteria, and the presence of
quite different alternatives.
Multi-criteria Decision Making
MCDM problems are common in everyday life. Multi criterion Decision-
and Making (MCDM) analysis has some unique characteristics such as, the
Analytical Hierarchical Problem presence of multiple conflicting criteria, In personal context, a house or
a car one buys may be characterised in terms of price, size, style, safety,
comfort, etc. In business context, MCDM problems are more
CHAPTER 05 complicated and usually of large scale.
QUANTITATIVE TECHNIQUES USED IN ADVANCED
DECISION MAKING

Normally in problems associated with selection and assessment, the


Types of MCDM
number of alternative solutions is limited. Therefore the potential
There exist two distinctive types of MCDM problems due to the different
alternative solutions could be infinite. If this is the case, the problem
problems settings:
is referred to as multiple objective optimisation problems instead of
one type having a finite number of alternative solutions and
multiple attribute decision problems. Our focus will be on the
the other an infinite number of solutions.
problems with a finite number of alternatives.
Main Features of MCDM An ideal solution to a MCDM problem would maximise all profit criteria and minimise all cost
criteria.
Multiple attributes/criteria often form a hierarchy. Type of solutions:
Conflict among criteria. Non dominated solutions.
Hybrid nature 1) Incommensurable units. An alternative (solution) is dominated.
2) Mixture of qualitative and quantitative attributes. 1. Satisfying solutions.
3) Mixture of deterministic and probabilistic attributes. 2. Preferred solutions
MCDM Methods
Uncertainty 1) Uncertainty in subjective judgments.
There are two types of MCDM methods. One is compensatory and the other is non-
2) Uncertainty due to lack of data or incomplete information. compensatory.
Large Scale There are three steps in utilizing any decision-making technique involving numerical analysis
Assessment may not be conclusive of alternatives:
Determining the relevant criteria and alternatives
MCDM Solutions Attach numerical measures to the relative importance to the criteria and the impact of the
All criteria in a MCDM problem can be classified into two categories. alternatives on these criteria
Criteria that are to be maximised are in the profit criteria category. Process the numerical values to determine a ranking of each alternative.
Similarly, criteria that are to be minimised are in the cost criteria category.

Numerous MCDM methods, such as,


ELECTRE-3 and 4, In single-dimensional cases, with same units , the WSM can be used without difficulty.
Promethee-2 Difficulty with this method emerges when it is applied to multi-dimensional MCDM
Compromise Programming, problems.
Cooperative Game theory, The WPM Method
Composite Programming, The weighted product model (WPM) is very similar to the WSM. The main difference is
Analytical Hierarchy Process, that instead of addition, there is multiplication. Each alternative is compared with the
Multi-Attribute Utility Theory, others by multiplying a number of ratios, one for each criterion. Each ratio is raised to the
Multicriterion Q-Analysis etc. power equivalent to the relative weight of the corresponding criterion. In order to compare
are employed for different applications. two alternatives AK and AL, the following product has to be calculated
R (AK/ AL) =akj/aij
The WSM Method Where n is the number of criteria, a is the actual value of the i-th alternative in terms of
The weighted sum model (WSM) is probably the most commonly used approach, the j-th criterion, and wf is the weight of the j-th criterion.
especially in single dimensional problems. If there are m alternatives and n criteria then, If the term R(AK/ AL) is greater than or equal to one, then it indicates that alternative AK
AWSM =max ao wf for i= , , ,----m, is more desirable than alternative AL ( in the maximization case). The best alternative is
Where Awsm is the WSM score of the best alternative, n is the number of decision the one that is better than or at least equal to all other alternatives.
criteria, ao is the actual value of the i-th alternative in terms of the j-th criterion, and wf is
the weight of importance of the j-th criterion.
The AHP method Major steps in applying the AHP techniques are:

The Analytic Hierarchy Process (AHP) decomposes a complex MCDM problem into a system 1 Develop a hierarchy of factors impacting the final decision. This is known as the AHP
of hierarchies. The final step in the AHP deals with the structure of an m*n matrix ( Where decision model. The last level of the hierarchy is the three candidates as an alternative.
m is the number of alternatives and n is the number of criteria). The matrix is constructed
by using the relative importance of the alternatives in terms of each criterion. It deals with 2 Elicit pair wise comparisons between the factors using inputs from users/managers.
complex problems which involve the consideration of multiple criteria/alternatives
simultaneously. 3 Evaluate relative importance weights at each level of the hierarchy.

AHP based on Pairwise comparison method. It is any process of comparing entities in pairs 4 Combine relative importance weights to obtain an overall ranking of the three candidates.
to judge which of each entity is preferred, or has a greater amount of some quantitative
property, or whether or not the two entities are identical. A paired comparison is usually a While comparing two criteria, the simple rule as recommended by Saaty (1980). Thus while
method to compare one entity with another of a similar status. Usually, such paired comparing two attributes X and Y we assign the values in the following manner based on the
comparisons are made on the grounds of the overall performance of an individual. relative preference of the decision maker. To fill the lower triangular matrix, we use the
reciprocal values of the upper diagonal.
Prof. Thomas L. Saaty (1980) originally developed the Analytic Hierarchy Process (AHP) to
enable decision making in situations characterized by multiple attributes and alternatives.

Estimating the Consistency for sensitivity analysis


Intensity of Definition Sensitivity analysis is an extension to AHP. It provides information about the robustness of
Importance any decision. It is applicable and necessary to explore the impact of alternative priority
1 Equal importance structure for the rating of employee. The weights for the pair wise comparison were changed
3 Weak importance of one over other and it was found that the performance evaluation was also changing accordingly.
5 Strong importance
7 Demonstrated importance Step 1. Multiply each value in the first column of the pairwise comparison matrix by
9 Absolute importance corresponding relative priority matrix.
2,4,6,8 Intermediate Values Step 2. Repeat Step 1 for remaining columns.
Reciprocals of the above If activity i has one of the above numbers assigned to it
when compared with activity j, then j has the reciprocal Step 3. Add the vectors resulted from step-1 and 2.
value when compared with i.
1.1 1.9 When elements are close and nearly indistinguishable Step 4. Divide each elements of the vector of weighed sums obtained in step 1-3 by the
corresponding priority value.

Table 1: Scale Used for Pair wise Comparison Step 5. Compute the average of the values found in step 4. Let be the average.

Step 6. Compute the consistency index (CI), which is defined as (max - n) / (n-1).
Compute the random index, RI, using ratio: Example: A company decided to out source some parts of their product. Three different
RI = 1.98 (n-2)/n company submit their tender for the above required parts. Three factors are important to
Accept the matrix if consistency ratio, CR, is less than 0.10, where CR is select the best fit- costs, reliability of the product and delivery time of the orders. The price
CR = CI / RI offered by them as follows:
Consistency Ratio CR = (CI/CR)
If the Consistency Ratio (CI/CR) <0.10, so the degree of consistency is satisfactory. The ABC-100/- per gross
decision aker s comparison is probably consistent enough to be useful. XYZ-80/- per gross
PQR -144/- per gross
Standard Random Index(RI) for number of alternatives:
1 gross= 12 dozens=144
No. of 3 4 5 6 7 8
alternatives
(n) Criteria : Cost Reliability Delivery Time
RI 0.58 0.9 1.12 1.24 1.32 1.41
Alternatives: ABC ABC ABC
XYZ XYZ XYZ
PQR PQR PQR

Terms of price are compared as, XYZ is moderately preferred to ABC and very strongly preferred to PQR. Where as, ABC
is strongly preferred to PQR.

Si ce, XYZ is oderately preferred to ABC, ABC s e try i the XYZ ro is 3 a d XYZ s e try i ABC ro is 1/3.

Si ce, XYZ is ery stro gly preferred to PQ, PQ s e try i the XYZ ro is a d XYZ s e try i the PQ ro is 1/ .

Si ce , ABC is stro gly to ery stro g preferred to PQ, PQ s e try i the ABC ro is a d ABS s e try i the PQ ro is
1/6.

The final matrix looks like :

ABC XYZ PQR


ABC 1 1/3 6
XYZ 3 1 7
PQR 1/6 1/7 1
Open Sales.xlsx and perform the following.-
1.Show the region wise selling pattern for all sales persons and their total sales amount.
2. Display the product wise sales for each region.

PIVOT TABLE 3. Compare the monthly selling performance for each sales person.
4. Draw a pivot chart showing monthly regional selling status. Change the chart according to
AND product sales.

OPTIMIZATION USING SOLVER 5. Open the student.xlsx. Display the month wise sum of score for all subjects and their
grand total.
6. Display the highest score for each students.
CHAPTER 05 7. Display the pivot chart for students monthly score.
QUANTITATIVE TECHNIQUES USED IN ADVANCED
DECISION MAKING

The Data worksheet in the Groceriespt.xlsx file contains more than 900 rows of sales data. From information in a random sample of 925 people, I know the gender, the age, and the amount these
people spent on travel last year. How can I use this data to determine how gender and age influence a
Each row contains the number of units sold and revenue of a product at a store as well as persons travel expenditures? What can I conclude about the type of person to whom I should mail the
the month and year of the sale. The product group (fruit, milk, cereal, or ice cream) is also brochure?
included. You would like to see a breakdown of sales during each year of each product group
and product at each store. You would also like to be able to show this breakdown during any
subset of months in a given year (for example, what the sales were from January through To understand this data, you need to break it down as follows:
June). Average amount spent on travel by gender
Average amount spent on travel for each age group
Average amount spent on travel by gender for each age group
Determine the following using groceries worksheet:

Amount spent per year in each store on each product


Total spending at each store
Total spending for each year
Solver Optimization Tool Unit Cost of Transportation from sources M, P and T to the Destinations A, B, C and D
A manufacturer produces four products A, B, C and D by using two types of machines (lathe are given.
and milling machines). The time required on the two machines to manufacture one unit of
each of the four products, the profit per unit of the product and the total time available on A B C D
the two types of machines per day are given below:
M 0.6 0.56 0.22 0.4
Find the number of units to be manufactured of each product per day for maximizing the
P 0.36 0.3 0.28 0.58
profit.
Find profit value, if minimum qty. of Product A and B will be 30 and 80, respectively. T 0.65 0.68 0.55 0.42
Machine Total required per unit (minutes) Total time
Product A Product B Product C Product D available per The available capacities at M, P, and T are 9000, 12000 and 13000 units,
day (minutes) respectively.
Lathe 7 10 4 9 1200
machine
Milling 3 40 1 1 800
machine The Demand at the destinations are 7500, 8500, 9500 and 8000 units, respectively.
Profit per unit 45 100 30 50
(Rs.) Formulate the Transportation problem and solve it using SOLVER.

A Bank processes checks seven days a week. The number of workers needed each day to When you click Solve, youll see the essage, ol er ould ot fi d a feasi le solutio . This essage does ot ea
that you ade a istake i your odel ut, rather, that ith li ited resour es, you a t eet de a d for all
process checks. For example, 13 workers are needed on Tuesday, 15 workers are needed on
products.
Wednesday, and so on. All bank employees work five consecutive days. Find the minimum
number of employees that the Bank can have and still meet its labour requirements based
on the following data.

Number Wednesd
starting Day worker starts Monday Tuesday ay Thursday Friday Saturday Sunday
0 Monday 1 1 1 1 1 0 0
0 Tuesday 0 1 1 1 1 1 0
0 Wednesday 0 0 1 1 1 1 1
0 Thursday 1 0 0 1 1 1 1
0 Friday 1 1 0 0 1 1 1
0 Saturday 1 1 1 0 0 1 1
0 Sunday 1 1 1 1 0 0 1

Number working 0 0 0 0 0 0 0
>= >= >= >= >= >= >=
Number needed 17 13 15 17 9 9 12
CHAPTER 06
DATA ANALYSIS USING MS EXCEL
CHAPTER 06
DATA ANALYSIS USING MS EXCEL Advanced Statistical Applications

Performing statistical analysis in excel is very convenient. What is Data Analysis?

Excel has various built-in functions that allow to perform Data analysis is the process used to get result from
all sorts of statistical calculations. raw data that can be used to make decisions.

Analysis Tool Pack in Excel also work as a Tool for


Results of data analysis can be used for:
* Summarizing data
* Fitting data (simple linear regression, Multiple Detecting trends
Regression) Making predictions
* Hypothesis testing (t-test)
Click on Data Select Data Analysis Descriptive Statistics Select input
range Check Summary statistics Ok
The Descriptive Statistics
Descriptive Statistics answer basic
questions about the central tendency
and dispersion of data observations.

* Numerical Summaries
* Measures of location
* Measures of variability

Hypothesis Testing
Data > Data analysis > t-Test: Two sample . . .
H0 = null hypothesis
There is no significant difference
As a ule of thu , a use e ual a ia es
H1 = alternative hypothesis
if ratio of variances < 3.
There is a significant difference

Excel will automatically calculate t-values to compare:

Means of two datasets with equal variances


Means of two datasets with unequal variances
Two sets of paired data

abs(t-score) < abs(t-critical): accept H0 6


t-Test two sample
t-Test: Two-Sample Assuming Unequal Variances
Paired sample t-test
Variable 1 Variable 2
Mean 54.99931 50.90014
Probability of drawing two The paired sample t-test compares the means of
random samples from a
Variance 1.262476 7.290012
normally distributed two variables for a single group. The procedure
Observations 22 5
Hypothesized Mean Difference 0
population and getting the
mean of sample #1 this
computes the difference between values of the two
df 4 much larger than the mean variables for each case and tests whether the
t Stat 3.329922 of sample #2. The mean of
P(T<=t) one-tail 0.014554 sample #1 is larger at a
average differs from zero.
t Critical one-tail 2.131846 significance level of =0.03
P(T<=t) two-tail 0.029108 or at the 3 % sig ifi a e
t Critical two-tail 2.776451 level , e ause p < 0.03. Example: high blood pressure, all the patients are
measured at the beginning of the study, given a
t > tcritical so the mean of medicine/ treatment and measured after
sample #1 is significantly
different from the mean of treatment.
sample #2.
8

ANOVA : Analysis of variance Summary Table


All of this sounds like a lot to remember, and it is. However,
A One-Way Analysis of Variance is a way to test the equality there is a table which makes things really nice.
of three or more means at one time by using variances.
Assumptions SS df MS F
The populations from which the samples were obtained Between SS(B) k-1 SS(B) MS(B)
must be normally or approximately normally distributed. ----------- --------------
The samples must be independent. k-1 MS(W)
The variances of the populations must be equal. Within SS(W) N-k SS(W) .
Hypotheses -----------
The null hypothesis will be that all population means are N-k
equal, the alternative hypothesis is that at least one mean Total SS(W) + N-1 .
is different. SS(B)
F test statistic So How big is F?
Since F is
Recall that a F variable is the ratio of two independent chi- Mean Square Between / Mean Square Within
square variables divided by their respective degrees of
freedom. Also recall that the F test statistic is the ratio of = MSG / MSE
two sample variances, well, it turns out that's exactly what
we have here. The F test statistic is found by dividing the A large value of F indicates relatively more
between group variance by the within group variance. The difference between groups than within groups
degrees of freedom for the numerator are the degrees of (evidence against H0)
freedom for the between group (k-1) and the degrees of
freedom for the denominator are the degrees of freedom
for the within group (N-k).

Correlation
Correlation is the extent to which variables in two
different data series tend to move together (or
apart) over time. A correlation coefficient
Describes the strength of the correlation between
two series. Values in range (-1.0, 1.0)

Click on Data Select Data Analysis Correlation


Select input range Ok
Correlation Coefficient, r = .75
Regression Analysis
Correlation: Player Salary and Ticket Correlation measures whether a linear
Price relationship exists between two series of data.
30
Linear Regression attempts to find the
20 Change in Ticket relationship between the two series and
10 Price expresses this relation with a linear equation.
0 Change in Linear equation in the form:
-10 Player Salary
-20
1995 1996 1997 1998 1999 2000 2001
y = mx+ b

Data Data Analysis Regression

Select a dependent variable (y) and an


independent variable (x).
What does this output tell us? Multiple Regression
It describes the relationship in terms of an equation:
Y= -66672 + 10.64X Multiple regression is the appropriate method of analysis
The value of -66672 is known as the Intercept as it gives when the research problem involves a single metric dependent
the value of y when x = 0. variable presumed to be related to two or more metric
independent variables. The objective of multiple regression
The value of 10.64 is known as the Slope or the Gradient
analysis is to predict the changes in the dependent variable in
and measures the increase in the value of Y that results response to change in the independent variables. This
from a one unit increase in the value of X. objective is most often achieved through the statistical rule of
least squares.
How good is the fit?
R-Square statistic describes how much For example , organization attempt to predict its sales from
of the variation in Y variable was explained by variation in X information on its expenditure for advertising , the number of
variable. salespeople, and the number of stores carrying its product.
R-Square = 1 is perfect.
R-Square > 0.5 is considered good

SUMMARY OUTPUT

Regression Statistics
The regression analysis in previous example has only Multiple R 0.97

considered situations in which the value of Y depended R Square


Adjusted R
0.94

upon one independent X variable. However it is easy to Square 0.92

envisage circumstances in which Y depends upon two or Standard Error 1030.04

more independent x variables - call them X1 and X2. Observations 12.00

Under these circumstance we would want to fit a ANOVA


regression equation of the form: y = a + b*X1 + c*X2. df SS F Significance F

this is known as Multiple Regression. Regression 2.00 145881218.66 68.75 0.00


Residual 9.00 9548781.34
Total 11.00 155430000.00

Coefficients Standard Error P-value Lower 95% Upper 95%


Intercept 19668.96 2728.32 0.00 13497.07 25840.86
ADVEX 0.53 0.05 0.00 0.41 0.65
PRICE -6.41 0.78 0.00 -8.18 -4.63
Introduction to Monte Carlo Simulation Who uses Monte Carlo Simulation?

Data analysts would like to estimate the Many companies use Monte Carlo Simulation as an important part
of their decision making process.
probabilities of uncertain events accurately.
General Motors, P&G, Pfizer use simulation to estimate both
Monte Carlo simulation enables you to model
average return and the risk factor of new products.
situations that present uncertainty and then play Lilly uses simulation to determine the optimal plant capacity for
them out on a computer thousands of times. each drug.
P&G uses simulation to model and optimally hedge foreign
exchange risk.
Note : The term Monte Carlo simulation comes from the
Monte Carlo Simulation uses by organization for forecasting net
computer simulations performed during 130s 1940s to estimate
income, predicting structural and purchasing costs, and determining
the probability that the chain reaction needed for an atom bomb
its susceptibility to different kinds of risks
to detonate would work successfully. The scientists involved in
Financial planners use Monte Carlo Simulation to determine
this work were big fans of gambling , so they gave the simulations
optimal investment strategies for their lie ts retirement.
the code name Monte Carlo.

Differences Between Types of Simulation Models


Optimization and Simulation
Continuous
Optimization models Based on mathematical equations.
Used for simulating continuous values for all points in
Yield decision variables as outputs time.
Promise the best (optimal) solution to the model Example: The amount of time a person spends in a queue.
Simulation models
Discrete
Require the decision variables as inputs Used for simulating specific values or specific points.
Give only a satisfactory answer Example: Number of people in a waiting line (queue).
Simulation Methodology:
What happens when I type = RAND() in a cell?

Estimate probabilities of future events You get a number that is equally likely to assume any
value between 0 and 1. Thus, around 25 percent of the
Assign random number ranges to time, you should get a number less than or equal to
percentages (probabilities) 0.25; around 10 percent of the time, you should get a
Obtain random numbers number that is at least 0.90, and so on.
Use a do u e s to si ulate e e ts.

Simulation of discrete random variable


Advantages of Simulation
Suppose the demand for a calendar is governed by the
following discrete random variable. Simulation often leads to a better understanding of
the real system.
Years of experience in the real system can be
Demand Probability compressed into seconds or minutes.
10,000 0.10 Simulation does not disrupt ongoing activities of the
20,000 0.35 real system.
40,000 0.30 Simulation is far more general than mathematical
60,000 0.25 models.
Simulation can be used as a game for training
How can have excel simulate, this demand for calendars experience (safety!).
many times?
Finally we find the simulated probabilities are close to
the assumed demand probabilities.
i ulatio Adva tages o td Disadvantages of Simulation
Simulation can be used when data is hard to come by. There is no guarantee that the model will, in fact,
Simulation can provide a more realistic replication of a provide good answers.
system than mathematical analysis. There is no way to prove reliability.
Simulation can be used to analyze transient Simulation may be less accurate than mathematical
conditions, whereas mathematical techniques usually analysis because it is randomly based.
cannot. Building a simulation model can take a great deal of
Simulation considers variation and can calculate time (but if the payoff is great, then it is worth it!).
confidence intervals of model results. A significant amount of computer time may be needed
to run complex models (old concern - no longer an
Simulation can model a system with multiple phases issue!).
The technique of simulation still lacks a standardized
approach.

To fight back, the chief buyer reorganized the sections that are in
An Inventory Control Example: trouble to create a store within a store.
Foslins Housewares With these ha ges, plus the sto es eputatio fo uality a d
i ulatio a e used fo odels i hi h the uestio is Ho service, she feels that Foslins can effectively compete.
u h of this should e do?
We will now use an inventory control model to provide an A I te atio al Di i g Mo th p o otio ill e featu ed i
illustration of simulation. October to introduce the new facility.
THE OMELET PAN PROMOTION: Five specially made articles (each from a different country) will be
HOW MANY PANS TO ORDER? featured on sale. For example, a copper omelet pan from France,
In Foslins, certain sections of the housewares department have a set of 12 long-stem wine glasses from Spain, etc.
just suffered their second consecutive bad year.
All items must be ordered 6 months in advance. Any unsold items
Due to competition, the gourmet cooking, glassware, stainless after October will be sold to a discount chain at a reduced price. If
flatware, and contemporary dishes sections of Foslins are not they run out, a more expensive item from the regular line will be
generating enough revenue to justify the amount of floor space. substituted.
For example, suppose you order 1000 pans and the demand
Consider the special omelet pans: turned out to be 1100 pans.
Buying price: $22.00 In this situation, you would be 100 pans short and would have to
Selling price: $35.00 buy 100 regular pans and sell them at the sale price in order to
Discounted price: $15.00 (at the end of October) make up the deficit.
The net profit would be:
Regular pans:
$35(1100) - $32(100) - $22(1000) = $12,300
Buying price: $32.00
Normal selling price: $65.00 In general, let y = number of pans ordered and
D = demand. Then for D > y,
Selling price if substituted: $35.00
Profit = 35D 32(D y) 22y

Now, without knowing the demand for this special product, how Profit = 3D + 10y
many pans should be ordered in advance?

In another scenario, suppose you order 1000 pans and the This spreadsheet assumes an order of 11 omelet pans and a
demand turned out to be 200 pans. random demand of 8 (i.e., y = 11 and D = 8).
In this situation, you would have an excess of 800 pans and would
have to sell the addition pans at $15 each and take a loss.

The net profit would be: =MAX(0,B3-B5)


=MAX(0,B5-B3)
$35(200) + $15(800) - $22(1000) = -$3000
In general, for D < y,
Profit = 35D + 15(y D) 22y
Profit = 20D - 7y
=IF(B3>=B5,E3*B5+E4*B7-E5*B3,E3*B5-E8*B9-E5*B3)
For D = y, the two formulas are identical.
Since the order quantity is greater than demand, (y >D) there are 3
extra pans. Thus, the profit is $83.
PROFIT VERSUS ORDER QUANTITY To generate random demand for this probability distribution in
Now, assume that demand has the following distribution: Crystal Ball, enter this discrete distribution in a two-column format
for Crystal Ball.
Prob {demand = 8} = 0.1
Prob {demand = 9} = 0.2
Prob {demand = 10} = 0.3
Prob {demand = 11} = 0.2
Prob {demand = 12} = 0.1
Prob {demand = 13} = 0.1

Note: These demands have been chosen artificially small in order


to simplify the example.

In the resulting dialog, click on the Data button

Now, click on
cell B5. Next,
click on the
Define
Assumptions
icon,
choose Custom
distribution and
click OK.
Now, enter the spreadsheet cell range where the discrete After clicking OK, the distribution will be displayed:
distribution was placed and click OK.

Click OK again to accept these settings.

To determine the expected profit through the use of the In order to use simulation to calculate the average profit, first
simulations, click on cell B11 and then click on the Define Forecast generate a number of trials, setting y = 11.
icon. The profit that results on any given trial depends on the value of
demand that was generated on that trial.
The average profit over all trials is the expected profit.
To do this, click on the Run Preferences icon and change the
Maximum Number of Trials to 500.

Click OK.

If not already selected, choose Large as the window size and When
Stopped (Faster). Click OK.
Next, click on the Start Simulation icon and after Crystal Ball To look at the statistics from the simulation, go to the Crystal Ball
has run the 500 iterations, the following dialog will appear: View menu and choose Statistics.

Click OK and Crystal Ball will automatically produce a histogram.

Based on these trials, the best estimate of the expected profit of


ordering 11 omelet pans is $123.24.
Note that since both the demand and the average profit are
random variables, running the simulation again will most likely
result in a different average profit.

Expected Value versus Order Quantity. To calculate the true Simulated versus Expected Profits. For any particular order
expected profit using the spreadsheet and Crystal Ball, simply quantity, the average profit generated by the spreadsheet
enter each demand in cell B5 (one at a time), run the simulation simulator does not equal the true expected profit. The implication
and then record the average profit. of this fact on the process of making a decision is interesting.

The computed expected profits and simulated average profits for


order sizes of 9, 10,11, and 12 pans are shown below:

These average profits will then be multiplied by their respective Based on the max. profit, your decision would be to order 10 pans
probabilities. The sum of these values will give the true expected for the true expected profit or 11 pans for the simulated average
profit. profit.
The previous example illustrates that simulation, in general, is not RECAPITULATION
guaranteed to achieve optimality. To summarize:
A simple way to increase the likelihood of achieving optimality is 1. A spreadsheet simulator takes parameters and
to increase the number of trials. decisions as inputs and yields a performance
measure(s) as output.
With simulation, your decision may be wrongly identified if care is
not taken to simulate a sufficient number of trials.
2. Each iteration of the spreadsheet simulator will
In a real problem you would not both calculate the true expected generally yield a different value for the
profit and use simulation to calculate an average profit. performance measure.

Use simulation when


it is computationally impractical or not even possible to 3. The performance measure (for an order size of 11)
calculate the expected profit associated with the alternative was profit. The 500 trials taken together combine
decisions, to produce a measure of goodness of the order
or when it is important to assess the variability of the size: average profit.
performance measure for various solutions.

4. Increasing the number of simulation iterations


will usually improve the accuracy of the estimate
of the expected value of the performance
measure.
5. In a simulation, we can never be sure that the
optimal decision has been found. Although, a
95% or 99% confidence interval can be
calculated.
6. Management must assess four main factors :
a. Does the model capture the real problem?
b. Are the influence of the starting and ending conditions
of the simulation properly accounted for?
c. Have enough trials been performed for each decision so
that the average value of the measure(s) of performance
is a good indication of the true expected value?
d. Have enough decisions been evaluated so that you can
elie e that the est a s e fou d is lose e ough to
the optimum?
CHAPTER 07 Data Cleansing and
DATA CLEANSING & PRE-PROCESSING Preprocessing

Data Preprocessing Why Data Preprocessing?


Why preprocess the data? Data in the real world is dirty
Data cleaning incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
Data integration and transformation noisy: containing errors or outliers
inconsistent: containing discrepancies in codes or names
Data reduction
No quality data, no quality mining results!
Discretization and concept hierarchy generation Quality decisions must be based on quality data
Data warehouse needs consistent integration of quality
Summary data
Data Quality: Why Preprocess the Data? Major Tasks in Data Preprocessing
Measures for data quality: A multidimensional view Data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers, and
Accuracy: correct or wrong, accurate or not resolve inconsistencies
Co plete ess: ot re orded, u a aila le, Data integration
Integration of multiple databases, data cubes, or files
Co siste y: so e odified ut so e ot, da gli g, Data transformation
Normalization and aggregation
Timeliness: timely update?
Data reduction
Believability: how trustable the data are correct? Obtains reduced representation in volume but produces the same or
similar analytical results
Interpretability: how easily the data can be Data discretization
understood? Part of data reduction but with particular importance, especially for
numerical data

Forms of data preprocessing Data Preprocessing

Why preprocess the data?


Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary
Data Cleaning
Data Cleaning
Data cleaning is a technique that is applied to
remove the noisy data and correct the Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
inconsistencies in data. Data cleaning involves incomplete: lacking attribute values, lacking certain attributes of
transformations to correct the wrong data. Data interest, or containing only aggregate data
e.g., Occupation = issi g data
cleaning is performed as a data preprocessing step
noisy: containing noise, errors, or outliers
while preparing the data for a data warehouse. e.g., Salary = a error
inconsistent: containing discrepancies in codes or names, e.g.,
Data cleaning tasks Age = , Birthday = / 7/
Was rati g , , , o rati g A, B, C
Fill in missing values discrepancy between duplicate records
Intentional (e.g., disguised missing data)
Identify outliers and smooth out noisy data Ja . as e eryo es irthday?

Correct inconsistent data

Missing Data How to Handle Missing Data?


Ignore the tuple: usually done when class label is missing (assuming the
Data is not always available
tasks in classificationnot effective when the percentage of missing values
E.g., many tuples have no recorded value for several attributes, such
as customer income in sales data per attribute varies considerably)

Missing data may be due to Fill in the missing value manually: tedious + infeasible?

equipment malfunction Use a glo al o sta t to fill i the issi g alue: e.g., u k o ,a e
inconsistent with other recorded data and thus deleted class?!
data not entered due to misunderstanding Use the attribute mean to fill in the missing value
certain data may not be considered important at the time of entry Use the most probable value to fill in the missing value: inference-based
not register history or changes of the data such as Bayesian formula or decision tree
Noisy Data How to Handle Noisy Data?
Noise: random error or variance in a measured variable
Incorrect attribute values may due to Binning method:
faulty data collection instruments first sort data and partition into (equi-depth) bins
data entry problems
then smooth by bin means, smooth by bin median, smooth
data transmission problems
by bin boundaries, etc.
technology limitation
inconsistency in naming convention Clustering
Other data problems which requires data cleaning detect and remove outliers
duplicate records
Combined computer and human inspection
incomplete data
inconsistent data detect suspicious values and check by human
Regression
smooth by fitting the data into regression functions

Simple Discretization Methods: Binning Data Preprocessing


Equal-width (distance) partitioning:
It divides the range into N intervals of equal size: uniform grid Why preprocess the data?
if A and B are the lowest and highest values of the attribute, the width
of intervals will be: W = (B-A)/N.
The most straightforward
Data cleaning
But outliers may dominate presentation
Skewed data is not handled well. Data integration and transformation
Equal-depth (frequency) partitioning: Data reduction
It divides the range into N intervals, each containing approximately
same number of samples
Good data scaling
Discretization and concept hierarchy generation
Managing categorical attributes can be tricky.
Summary
Data Integration Handling Redundancy in Data Integration
Data Integration is a data preprocessing technique that
merges the data from multiple heterogeneous data sources Redundant data occur often when integration of multiple
into a coherent data store. Data integration may involve databases
inconsistent data and therefore needs data cleaning. Object identification: The same attribute or object may
Schema integration have different names in different databases
integrate metadata from different sources Derivable data: O e attri ute ay e a deri ed attri ute
Entity identification problem: identify real world entities in another table, e.g., annual revenue
from multiple data sources, e.g., A.cust-id B.cust-# Redundant attributes may be able to be detected by
Detecting and resolving data value conflicts correlation analysis and covariance analysis
for the same real world entity, attribute values from Careful integration of the data from multiple sources may help
different sources are different reduce/avoid redundancies and inconsistencies and improve
possible reasons: different representations, different mining speed and quality
scales, e.g., metric vs. British units
17

Correlation Analysis (Nominal Data) Correlation Analysis (Numeric Data)

2 (chi-square) test Correlation coefficient (also called Pearso s produ t o e t


(Observed Expected) 2

coefficient)
2

Expected

n n
(ai A)(bi B) (ai bi ) n AB
The larger the 2 value, the more likely the variables are related rA, B i 1
i 1
(n 1) A B (n 1) A B
The cells that contribute the most to the 2 value are those
whose actual count is very different from the expected count where n is the number of tuples, A and B are the respective
means of A and B, A and B are the respective standard
Correlation does not imply causality
deviation of A and B, and (aibi) is the sum of the AB cross-
# of hospitals and # of car-theft in a city are correlated product.
Both are causally linked to the third variable: population If rA,B > , A a d B are positi ely orrelated As alues i rease as
Bs . The higher, the stro ger orrelatio .
rA,B = 0: independent; rAB < 0: negatively correlated
Visually Evaluating Correlation
Data Transformation
Smoothing: remove noise from data
Aggregation: summarization, data cube construction
Generalization: concept hierarchy climbing
Scatter plots
showing the Normalization: scaled to fall within a small, specified range
similarity from
1 to 1. min-max normalization
z-score normalization
normalization by decimal scaling

20

Data Transformation:
Normalization Data Preprocessing
min-max normalization
v minA Why preprocess the data?
v'
maxA minA Data cleaning
z-score normalization
v meanA Data integration and transformation
v'
stand _ devA
Data reduction
normalization by decimal scaling
v Discretization and concept hierarchy generation
v' Where j is the smallest integer such that Max(| v ' |)<1
10 j
Summary
Data Reduction Strategies Data Cube Aggregation
Warehouse may store terabytes of data: Complex data The lowest level of a data cube
analysis/mining may take a very long time to run on the
complete data set the aggregated data for an individual entity of interest

Data reduction e.g., a customer in a phone calling data warehouse.


Obtains a reduced representation of the data set that is much Multiple levels of aggregation in data cubes
smaller in volume but yet produces the same (or almost the
same) analytical results Further reduce the size of data to deal with
Data reduction strategies Reference appropriate levels
Data cube aggregation
Use the smallest representation which is enough to solve the
Dimensionality reduction
task
Numerosity reduction
Discretization and concept hierarchy generation

Dimensionality Reduction Numerosity Reduction


Curse of dimensionality
When dimensionality increases, data becomes increasingly sparse Reduce data volume by choosing alternative, smaller forms of
Density and distance between points, which is critical to clustering, data representation
outlier analysis, becomes less meaningful Parametric methods (e.g., regression)
The possible combinations of subspaces will grow exponentially Assume the data fits some model, estimate model
Dimensionality reduction parameters, store only the parameters, and discard the
Avoid the curse of dimensionality data (except possible outliers)
Help eliminate irrelevant features and reduce noise
Non-parametric methods
Reduce time and space required in data mining
Allow easier visualization Do not assume models
Dimensionality reduction techniques Major families: histograms, clustering, sampling
Principal Component Analysis
Supervised and nonlinear techniques (e.g., feature selection)

27
y

Parametric Data Reduction: Regression Regression Analysis


Y1

Regression analysis: A collective name for


techniques for the modeling and analysis of Y1 y=x+1
Linear regression numerical data consisting of values of a
Data modeled to fit a straight line dependent variable (also called response
x
Often uses the least-square method to fit the line variable or measurement) and of one or more X1
independent variables (aka. explanatory
Multiple regression
variables or predictors)
Allows a response variable Y to be modeled as a linear Used for prediction (including
function of multidimensional feature vector The parameters are estimated so as to give a forecasting of time-series data),
"best fit" of the data inference, hypothesis testing,
Most commonly the best fit is evaluated by and modeling of causal
using the least squares method, but other relationships

criteria have also been used

Regression Analysis Dimensionality Reduction

Linear regression: Y = w X + b Feature selection (i.e., attribute subset selection):


Two regression coefficients, w and b, specify the line and are to be Select a minimum set of features such that the probability
estimated by using the data at hand distribution of different classes given the values for those
Using the least squares criterion to the known values of Y1, Y2, , X1, X2, features is as close as possible to the original distribution
. given the values of all features
Multiple regression: Y = b0 + b1 X1 + b2 X2
Many nonlinear functions can be transformed into the above reduce # of patterns in the patterns, easier to understand
Sampling
Clustering
Sampling: obtaining a small samples to represent the whole
Partition data set into clusters, and one can store cluster data set N
representation only Allow a mining algorithm to run in complexity that is potentially
Can be very effective if data is clustered but not if data is sub-linear to the size of the data

s eared Key principle: Choose a representative subset of the data


Simple random sampling may have very poor performance in
Can have hierarchical clustering and be stored in multi-
the presence of skew
dimensional index tree structures
Develop adaptive sampling methods, e.g., stratified
There are many choices of clustering definitions and sampling.
clustering algorithms

33

Types of Sampling Data Preprocessing


Simple random sampling
There is an equal probability of selecting any particular item
Why preprocess the data?
Sampling without replacement
Once an object is selected, it is removed from the population Data cleaning
Sampling with replacement
A selected object is not removed from the population Data integration and transformation
Stratified sampling: Data reduction
Partition the data set, and draw samples from each partition
(proportionally, i.e., approximately the same percentage of Discretization and concept hierarchy generation
the data)
Used in conjunction with skewed data Summary
34
Discretization Discretization and Concept hierachy
Three types of attributes:
Nominal values from an unordered set Discretization
Ordinal values from an ordered set reduce the number of values for a given continuous
Continuous real numbers attribute by dividing the range of the attribute into
Discretization: intervals. Interval labels can then be used to replace
divide the range of a continuous attribute into intervals actual data values.
Some classification algorithms only accept categorical Concept hierarchies
attributes.
Reduce data size by discretization reduce the data by collecting and replacing low level
Prepare for further analysis concepts (such as numeric values for the attribute age) by
higher level concepts (such as young, middle-aged, or
senior).

Discretization for numeric data Data Preprocessing

Binning
Why preprocess the data?

Histogram analysis Data cleaning


Data integration and transformation
Clustering analysis
Data reduction
Discretization and concept hierarchy generation
Summary
Summary Knowledge Discovery

Here is the list of steps involved in the knowledge discovery process


Data preparation is a big issue for both warehousing and
Data Cleaning
mining Data Integration
Data preparation includes Data Selection
Data Transformation
Data cleaning and data integration Data Mining
Data reduction and feature selection Pattern Evaluation
Knowledge Presentation
Discretization
A lot a methods have been developed but still an active area
of research

Data Integration
Data Cleaning I this step, the oise a d i o siste t data is
removed. Data Integration is a data preprocessing technique
Data Integration I this step, ultiple data sour es are that merges the data from multiple heterogeneous
combined. data sources into a coherent data store. Data
Data Selection I this step, data rele a t to the a alysis task integration may involve inconsistent data and
are retrieved from the database. therefore needs data cleaning.
Data Transformation I this step, data is tra sfor ed or Data Cleaning
consolidated into forms appropriate for mining by performing
summary or aggregation operations. Data cleaning is a technique that is applied to
Data Mining I this step, i tellige t ethods are applied i remove the noisy data and correct the
order to extract data patterns. inconsistencies in data. Data cleaning involves
Pattern Evaluation I this step, data patter s are e aluated. transformations to correct the wrong data. Data
Knowledge Presentation I this step, k o ledge is cleaning is performed as a data preprocessing step
represented. while preparing the data for a data warehouse.
Data Selection
Data Selection is the process where data relevant to
Knowledge Discovery Process
the analysis task are retrieved from the database. Data mining: the core of
Sometimes data transformation and consolidation are knowledge discovery Knowledge Interpretation
performed before the data selection process. process.
Clusters Data Mining
Cluster refers to a group of similar kind of objects. Task-relevant Data
Cluster analysis refers to forming group of objects that Data transformations
are very similar to each other but are highly different Selection
Preprocessed
from the objects in other clusters. Data
Data Transformation Data Cleaning
In this step, data is transformed or consolidated into Data Integration
forms appropriate for mining, by performing summary
or aggregation operations. Databases

Anda mungkin juga menyukai