Anda di halaman 1dari 73

Ten Data Analysis Tools You Cant Afford to be Without

Neil W. Polhemus, CTO, StatPoint Technologies, Inc. George H. Dyson, Director of Six Sigma Services
Copyright 2010 by StatPoint Technologies, Inc.

Business Improvement Objectives


Businesses must create value. This output must be greater than the inputs needed to produce it. If the output meets the customers needs, the business is effective. If the business creates added value with minimum resources, the business is efficient.

The TheRole Roleof ofSix SixSigma Sigmais isto toHelp Helpa aBusiness BusinessProduce Produce the theMaximum MaximumValue ValueWhile WhileUsing UsingMinimum MinimumResources Resources
(Pyzdek 2003)

Six Sigma Business Successes


Cost reductions Productivity improvements Market - share growth Customer relations improvements Defect reductions Culture changes Product and service improvements Cycle - time reductions
(CSSBB Primer, 2001) / (Pande, 2000)

All these Successes have a common thread.

DATA!!!
Data drives analysis. Analysis uses statistical models & tools of all kinds. To extract meaningful information To uncover signals in the presence of noise To understand the past To monitor the present To forecast the future This understanding results in reduced cost and saving

money. Deming: Doesnt anyone care about Profit?

Examples
Product comparisons Survey analysis Distribution fitting Comparison of multiple samples Outlier detection Curve fitting Response surface modeling Time series forecasting Event rate modeling Interactive maps
5

Problem #1 Product Comparisons


Consumer Reports 2010: Sedans (Family, Upscale, Luxury) 7 variables

Price (dollars) Road-test score (0 100) Predicted reliability (1 5) Owner satisfaction (1 5) Owner cost (1 5) Safety (1 5) Fuel economy (overall mpg)

Data file: cars.sgd (n=30)

Multivariate Visualization
The data consist of 7 quantitative variables plus one categorical factor. Each car is thus a point in 7-dimensional space with one other feature. How can we visualize whats going on? 1. 2-D scatterplot 2. 3-D scatterplot 3. Scatterplot matrix 4. Bubble chart 5. Parallel coordinates plot 6. Star glyphs 7. Chernoff faces 8. Radar/spider plot

2-D Scatterplot
Useful for plotting 2 dimensions.
Plot of MPG vs Price

35 32 29 MPG 26 23 20 17 24 34 44 Price 54 64 (X 1000)

Class Family Luxury Upscale

3-D Scatterplot
Useful for plotting 3 dimensions.
Plot of MPG vs Price and Road Test Class Family Luxury Upscale

Toyota Camry Hybrid

35 32 29 MPG 26 23 20 17 24

Nissan Altima Hybrid

98 88 34 44 Price 54 64 58 (X 1000) 78 68 Road Test

10

Bubble Chart
Uses size of bubble to illustrate a third dimension.
Bubble Chart for Reliability

35 32 29 MPG 26 23 20 17 24 34 44 Price 54 64 (X 1000)

Class Family Luxury Upscale

11

Scatterplot Matrix
Plots all pairs of variables.
Class Price Family Luxury Upscale Road Test

Reliability

Owner Satisfaction

Owner Cost

Safety

MPG

12

Parallel Coordinates Plot


Each case is shown as a line connecting the values of the variables.
Parallel Coordinates Plot 1 Class Family Luxury Upscale

0.8

0.6

0.4

0.2

0
PG Pr ic e Te st el ia bi li t y ty at is fa ct io n Sa fe M w ne rC O oa d os t

R O

13

w ne rS

Star Glyphs
Each case is shown as a polygon with vertices scaled by the variables.

1/Price

Owner Cost

MPG

Safety

Road Test

Owner Satisfaction

Reliability

14

Star Glyphs
Variables are scaled so that larger area is better.
Class Family Luxury Upscale

Nissan Altima

Honda Accord

Toyota Camry XLE

Volkswagen Passat Toyota Camry Hybrid

Hyundai Sonata

Chevrolet Malibu

Ford Fusion

Mercury Milan

Nissan Altima Hybrid

Ford Taurus

Kia Optima

Chevrolet Impala

Lexus ES

Toyota Avalon

Acura TL

Hyundai Azera

Lincoln MKZ

Buick Lucerne

Volvo S60

Infiniti M35 AWD

Infinit M35

Audi A6

Acura RL

BMW 535i

Cadillac STS

Cadillac DTS

Lexus GS 300

Lexus GS 450 Hybrid

Volvo S80

15

Chernoff Faces
Each variable is assigned to a different feature of the faces.
Class Family Luxury Upscale

Nissan Altima

Honda Accord

Toyota Camry XLE

Volkswagen Passat Toyota Camry Hybrid

Hyundai Sonata

Chevrolet Malibu

Ford Fusion

Mercury Milan

Nissan Altima Hybrid

Ford Taurus

Kia Optima

Chevrolet Impala

Lexus ES

Toyota Avalon

Acura TL

Hyundai Azera

Lincoln MKZ

Buick Lucerne

Volvo S60

Infiniti M35 AWD

Infinit M35

Audi A6

Acura RL

BMW 535i

Cadillac STS

Cadillac DTS

Lexus GS 300

Lexus GS 450 Hybrid

Volvo S80

16

Feature Assignments
Unfortunately, some features have a greater impact than others.

17

Radar/Spider Plot
Good for comparing a small number of cases.
Radar/Spider Plot Model Volkswagen Passat Chevrolet Impala Lincoln MKZ Cadillac STS

MPG (15.0-25.0) Safety (0.0-5.0)

Owner Cost (0.0-5.0)

Road Test (50.0-100.0)

Owner Satisfaction (0.0-5.0)

Reliability (0.0-5.0)

Price (25000.0-55000.0)

18

Problem #2: Survey Analysis


The most commonly used statistical procedure is the

calculation of a two-way table (a tabulation of responses that can be classified in 2 ways).


Such crosstabulations result in contingency tables that provide

much useful information.


Example: On January 18-19, 2010, Rasmussen Reports

asked 1000 likely voters: How would your rate the U.S. healthcare system?
19

Data file: healthcare.sgd

20

Mosaic Plot
Scales the area of each bar according to the counts in the table.

21

Chi-square Test
Tests for lack of independence between row and column classification.

22

Correspondence Analysis
Used to help visualize the important information in two-way tables.

23

Problem #3: Distribution Fitting


In many studies, determining the distribution of a

quantitative variable is critical.


Common examples covered in Six Sigma include capability

studies.
Distribution fitting is also quite critical in many design

problems.

24

Data file: waves.sgd (n=26,304)


Source: www.iahr.net

25

Frequency Histogram
Shows the number of observations in non-overlapping intervals.
Histogram

2400 2000 1600 1200 800 400 0 0 2 4 6 Height 8 10 12

26

frequency

Normal Distribution
The normal distribution is a poor model for this data.
Histogram for Height (X 1000) 4

Distribution Normal

3 frequency

0 -1 2 5 Height 8 11 14

27

Comparison of Distributions
Fits many distributions and sorts them by goodness of fit.

28

Lognormal Distribution
The 3-parameter lognormal distribution is much better.
Histogram for Height

2400 2000 1600 1200 800 400 0 0 2 4 6 Height 8 10 12 Distribution Largest Extreme Value Lognormal (3-Parameter) Normal

29

frequency

Problem #4: Multiple Samples


Data are frequently obtained from more than one sample. Asserting a significant difference between the samples (or

lack thereof) is an important application of data analysis.

30

Data file: thickness.sgd (n=480)

31

Box-and-Whisker Plots
A very useful plot for comparing samples (from John Tukey).
64 63 62 61

Maximum Observation (within 1.5 x IQR above 75th)

75th Percentile IQR Median (50th Percentile) 25th Percentile

Length

60 59 58 57 56 55 54

1.5 x IQR

Minimum Observation (within 1.5 x IQR below 25th) (>1.5 x IQR below 25th Percentile, or >1.5 x IQR above 75th Percentile)

*
Plus sign may be added to show sample mean.

Outlier

32

Notched Box-and-Whisker Plots


Non-overlapping notches indicate significantly different medians.

33

HSD Intervals
Allow pairwise comparison of all level means.

34

Problem #5: Outlier Detection


Many data sets contain aberrant observations that dont come

from the same distribution as the others.


Identifying outliers and treating them separately often results

in better models.

35

Data file: bodytemp.sgd (n=130)

36

Outlier Plot
Shows each data value with lines at 1, 2, 3 and 4-sigma.

37

Grubbs Test
Small P-value indicates that the extreme Studentized deviate (ESD) is highly unusual.

38

Problem #6: Curve Fitting


A common data analysis problem involves determining the

relationship between a response variable Y and a predictor variable X.


If we can estimate a model where Y = f(X), then we can use

that model to make predictions.

39

Data file: chlorine.sgd (n=44)

40

Simple Linear Regression


Fits a linear model of the form Y = mX + b.
Plot of Fitted Model chlorine = 0.48551 - 0.00271679*weeks 0.5 0.48 0.46 chlorine 0.44 0.42 0.4 0.38 0 10 20 weeks 30 40 50 95% prediction limits for new observations 95% confidence limits for mean

41

Comparison of Alternative Models


Fits many transformable nonlinear models and sorts by R-Squared.

42

Reciprocal X Model
A nonlinear model of the form Y = m/X + b is much better.
Plot of Fitted Model chlorine = 0.368053 + 1.02553/weeks 0.5 0.48 0.46 chlorine 0.44 0.42 0.4 0.38 0 10 20 weeks 30 40 50

43

Problem #7: Response Surfaces


The MISSION: air dominance at the lowest possible price. Use optimization models, built from performance data, to

design the best aircraft engine.


Note: the data have been altered and are for demonstration

purposes only.

44

Problem Statement
Optimize 3 response variables: Minimize total fleet acquisition cost (Y1) Maximize climb rate (Y2) Maximize launch rate (Y3)
Input factors:

X1: Fan Pressure Ratio: 3.9 to 4.7 X2: Overall Pressure Ratio : 34 to 40 X3: Inlet airflow : 240 to 270 pps
45

Data file: engines.sgx (n=584)

46

Design Plot
Shows the location of the historical data within the factor space.
Experimental Region

270 265 Air Flow 260 255 250 245 240 3.9 4.1 4.3 4.5 FPR 4.7 34 38 37 36 OPR 35 39 40

47

Standardized Pareto Charts


Show the significant factors affecting each response.
Standardized Pareto Chart for Y1 Cost
Standardized Pareto Chart for Y2 Ps

C:Air Flow A:FPR B:OPR AB BB BC AA CC AC 0 10 20 Standardized effect 30 40

+ -

C:Air Flow A:FPR B:OPR AA BC AB CC BB AC 0 20 40 60 80 Standardized effect 100 120

+ -

Standardized Pareto Chart for Y3 Launch

C:Air Flow A:FPR B:OPR AA BC AB BB CC AC 0 20 40 60 80 Standardized effect 100 120

+ -

48

Desirability Function
Quantifies the desirability of a joint response (Y1, Y2, Y3). For responses to be minimized:

For responses to be maximized:

Combined desirability: D=d(Y1)*d(Y2)*d(Y3)


49

Optimal Conditions
Found at the levels shown below:

50

Response Surface
Show the estimated desirability throughout the experimental region.
Desirability Plot Air Flow=254.697 Desirability 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 41 43

300 280 Air Flow 260 240 220 3.6 3.9 4.2 FPR 4.5 4.8 33 37 39 35 OPR

5.1

31

51

Problem #8: Time Series Data


Data recorded at equally spaced points in time is called a time

series.
Time series models are used for various purposes:

Analysis of trends and seasonal effects Forecasting Control

Autocorrelation between adjacent observations requires

special models.
52

Data file: customers.sgd (n=168)

53

Time Sequence Plot


Plots the data versus time.
Time Series Plot for Customers (X 1000) 113

103 Customers

93

83

73 1/96 1/99 1/02 Month 1/05 1/08 1/11

54

Autocorrelation Function
Estimates the correlation between observations at different lags.
Estimated Autocorrelations for Customers

1 0.6 Autocorrelations 0.2 -0.2 -0.6 -1 0 12 lag 24 36

55

Seasonal Decomposition
Shows the average value during each season (scaled to 100).
Seasonal Index Plot for Customers

114 110 seasonal index 106 102 98 94 90 1 2 3 4 5 6 7 season 8 9 10 11 12

56

Seasonal Subseries Plot


Shows the seasonal averages and trend within each season.
Seasonal Subseries Plot for Customers (X 1000) 113

103 Customers

93

83

73 0 2 4 6 8 Season 10 12 14

57

Annual Subseries Plot


Shows the seasonal effect separately for each cycle.
Annual Subseries Plot for Customers (X 1000) 113

103 Customers

93

83

Cycle 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 0 3 6 Season 9 12 15

73

58

Seasonally Adjusted Data


Removes the seasonal effects from the data.
Seasonally Adjusted Data Plot for Customers (X 1000) 106 101 seasonally adjusted 96 91 86 81 76 1/96 1/99 1/02 Month 1/05 1/08 1/11

59

Automatic Forecasting
Fits many models and automatically selects the best.

60

ARIMA Model
Selected model is a rather complicated seasonal ARIMA model.
Time Sequence Plot for Customers ARIMA(2,1,1)x(1,1,2)12 (X 1000) 123 113 Customers 103 actual forecast 95.0% limits

93 83 73 1/96 1/00 1/04 Month 1/08 1/12

61

Problem #9: Event Rate Modeling


When the data to be analyzed consist of the time at which

events occur, the process by which those events are generated is called a point process.
The most common type of point process is a Poisson process, in

which the times between events are independent and follow a negative exponential distribution.
If the event rate is constant, the process is called homogeneous.

If the event rate changes over time, then the process is called nonhomogeneous.
62

Data file: earthquakes.sgd (n=52)

63

Plot of Magnitude Versus Date


Shows an apparent increase in magnitude over the sampling period.
Plot of Magnitude vs Date

8.4 7.9 7.4 6.9 6.4 5.9 5.4 1/1/08

Magnitude

12/31/08 Date

12/31/09

12/31/10

64

Point Process Plot


Shows the dates of occurrence only.
Events Plot for Date

Earthquake

1/1/08

12/31/08 12/31/09 Time or distance

12/31/10

65

Event Rate
The critical parameter in a point process is the rate of events

per unit time (such as earthquakes per year). This parameter is usually called .
The rate parameter is also related to the mean time between

events, with MTBE = 1 /

may be constant (a homogeneous process) or vary over

time (a nonhomogenous process).

66

Cumulative Events Plot


The slope of the line is related to the event rate.
Cumulative Events Plot for Date

80 Mean 60

Number of events

40

20

0 1/1/08

12/31/08 12/31/09 Time or distance

12/31/10

67

Trend Test
A small P-value would indicate a significant trend.

68

Other Uses
Point process models are also very useful for estimating failure rates.
Events Plot

Aircraft 1 Aircraft 2 Aircraft 3 Aircraft 4 Aircraft 5 Aircraft 6 Aircraft 7 Aircraft 8 Aircraft 9 Aircraft 10 Aircraft 11 Aircraft 12 Aircraft 13 0 500 1000 1500 Time or distance 2000 2500

69

Between Group Comparisons


Tests can be made to determine whether there are significant differences.
Cumulative Events Plot

30 Mean 25 Number of events 20 15 10 5 0 0 500 1000 1500 Time or distance 2000 2500 Aircraft Aircraft Aircraft Aircraft Aircraft Aircraft Aircraft Aircraft Aircraft Aircraft Aircraft Aircraft Aircraft 1 2 3 4 5 6 7 8 9 10 11 12 13

70

Problem #10: Interactive Maps


Graphics that allow the user to interact with the data are

extremely useful.
Maps are one important example.

71

Data file: census2000.sgd (n=51)

72

U.S. Map Statlet


The slider changes the cutoff between the red and blue states.

73

Anda mungkin juga menyukai