Anda di halaman 1dari 6

Data Cleaning

Tips for making your data suitable for analysis


Robyn R. Raszkowski

Why clean your data?


Screening process Detect errors
Missing data Outliers

Make sure data meets assumptions for analysis


Normality

2 Types of Screening
1. Preliminary data screening
Screen one variable at a time on the entire data set before any analysis Todays focus

Steps
1. 2. 3. 4. 5. Check for missing data Check for normality Remove outliers Check for normality again Transform data

2. In conjunction with statistical analysis


Dependent on analysis being performed

Check for missing data

Step 1

Check for missing data


Explore in SPSS
Analyze
Plots
Histogram Normality plots with tests

Step 1

Descriptive Statistics

Explore

Check for missing data

Step 1

Check for normality


Case Processing Summary Cases Missing Percent 0 .0% Valid N Percent 50 100.0% N N Total Percent 100.0%

Step 2

Starbucks Retail Stores

50

Check for normality


Still using information from explore in SPSS Look at:
1. 2. 3. 4. Descriptives table Tests of Normality table Histogram Box plot

Step 2

Check for normality


1. Descriptives table
Descriptives Starbucks Retail Stores Mean 95% Confidence Interval for Mean 5% Trimmed Mean Median Variance Std. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtosis Statistic 206.84 105.07 308.61 152.84 89.00 128230.7 358.093 4 2372 2368 212 4.795 27.878 Std. Error 50.642

Step 2

Lower Bound Upper Bound

.337 .662

Check for normality


1. Descriptives table Skewness/ Standard Error = z score 4.795/ .337 = 14.29 Positive skew! = .05 z=
+/-1.96

Step 2

Check for normality


2. Tests of normality table
Tests of Normality Kolmogorov-Smirnov Statistic df Sig. .286 50 .000
a

Step 2

Starbucks Retail Stores

Statistic .500

Shapiro-Wilk df 50

Sig. .000

a. Lilliefors Significance Correction

If sig = not normally distributed

Check for normality


3. Histogram

Step 2

Check for normality


4. Box plot

Step 2

Remove outliers

Step 3

Remove outliers
Remove data points highlighted in box plot
Not the best method

Step 3

Schweinle Method
Remove data that is 2.5 SD from mean

Remove outliers
Schweinle Method 1. SD x 2.5 359.093 x 2.5 897.733 2. Add that value to the mean 897.733 + 206.84 = 1104.61
Remove any values above 1104.61

Step 3

Remove outliers
SPSS: Data select cases Select if condition is satisfied Variable <= 1104.61
SPSS will not analyze data that is over 1104.61

Step 3

Click continue and OK

Check for normality again!

Step 4

Check for normality again!


z score 1.613/ .34 = 4.74 Still has positive skew, but better than before! (14.29) These are still significant too:
Tests of Normality Kolmogorov-Smirnov Statistic df Sig. .186 49 .000
a

Step 4

Starbucks Retail Stores

Statistic .802

Shapiro-Wilk df 49

Sig. .000

a. Lilliefors Significance Correction

Check for normality again!

Step 4

Transform Data

Step 5

Step 5 Transform data


Positive skew: Negative skew:

Step 5 Transform data square root


SPSS: transform compute Target variable: enter new name
Ex: sqrt

Start here

Click on arithmetic under function group Click on sqrt under functions and special variables Click on the up arrow to bring sqrt(?) to numeric expression box Highlight variable to be transformed and click the right arrow to replace the (?) Explore data again to check for normality

Step 5 Transform data and explore


z score .76/ .34 = 2.24 Still has positive skew, but better than before! (14.29 4.74) These are still significant too:
Tests of Normality Kolmogorov-Smirnov Statistic df Sig. .144 49 .013
a

Step 5 Transform data and explore

sqrt

Statistic .927

Shapiro-Wilk df 49

Sig. .005

a. Lilliefors Significance Correction

Step 5 Transform data


Positive skew:

Step 5 Transform data log10

Negative skew:

Start here

Remind yourself: this is fun!

Now this

Step 5 Transform data log10


SPSS: transform compute Target variable: enter new name
Ex: log10

Step 5 Transform data and explore


z score -.225/ .34 = -.66 Finally! Its less than 1.96! Finally, these are not significant:
Tests of Normality Kolmogorov-Smirnov Statistic df Sig. .107 49 .200*
a

Click reset button to clear Click on arithmetic under function group Click on lg10 under functions and special variables Click on the up arrow to bring lg10(?) to numeric expression box Highlight variable to be transformed and click the right arrow to replace the (?) Explore data again to check for normality

log10

Statistic .969

Shapiro-Wilk df 49

Sig. .217

*. This is a lower bound of the true significance. a. Lilliefors Significance Correction

Step 5 Transform data and explore

Data is normally distributed

Finally!

Keep in mind:
Do this with each dependent variable before analyzing data Keep transformations consistent across all dependent variables Although transformed data looks pretty, it can be difficult to interpret Run your analysis with transformed data and without the transformation and compare the results

Great resource:
Mickey, R. M., Dunn, O. J., and Clark V. A. (2004). Applied statistics: analysis of variance and regression, 3rd Edition. John Wiley & Sons, Inc.
Chapter 1: Data Screening

Anda mungkin juga menyukai