4-Data Cleaning - Handout

Data Cleaning
Tips for making your data suitable for analysis

Robyn R. Raszkowski
Why clean your data?

Screening process Detect errors
Missing data Outliers
Make sure data meets assumptions for analysis

Normality
2 Types of Screening
1. Preliminary data screening
Screen one variable at a time on the entire data set before any analysis Todays focus
Steps
1. 2. 3. 4. 5. Check for missing data Check for normality Remove outliers Check for normality again Transform data
2. In conjunction with statistical analysis

Dependent on analysis being performed
Check for missing data
Step 1

Explore in SPSS
Analyze
Plots
Histogram Normality plots with tests
Step 1
Descriptive Statistics
Explore
Step 1
Check for normality

Case Processing Summary Cases Missing Percent 0 .0% Valid N Percent 50 100.0% N N Total Percent 100.0%
Step 2
Starbucks Retail Stores
50
Check for normality

Still using information from explore in SPSS Look at:
1. 2. 3. 4. Descriptives table Tests of Normality table Histogram Box plot
Step 2
Check for normality

1. Descriptives table
Descriptives Starbucks Retail Stores Mean 95% Confidence Interval for Mean 5% Trimmed Mean Median Variance Std. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtosis Statistic 206.84 105.07 308.61 152.84 89.00 128230.7 358.093 4 2372 2368 212 4.795 27.878 Std. Error 50.642
Step 2
Lower Bound Upper Bound
.337 .662
Check for normality

1. Descriptives table Skewness/ Standard Error = z score 4.795/ .337 = 14.29 Positive skew! = .05 z=
+/-1.96
Step 2
Check for normality

2. Tests of normality table
Tests of Normality Kolmogorov-Smirnov Statistic df Sig. .286 50 .000
a
Step 2
Statistic .500
Shapiro-Wilk df 50
Sig. .000
a. Lilliefors Significance Correction
If sig = not normally distributed
Check for normality

3. Histogram
Step 2
Check for normality

4. Box plot
Step 2
Remove outliers
Step 3
Remove outliers
Remove data points highlighted in box plot
Not the best method
Step 3
Schweinle Method
Remove data that is 2.5 SD from mean
Remove outliers
Schweinle Method 1. SD x 2.5 359.093 x 2.5 897.733 2. Add that value to the mean 897.733 + 206.84 = 1104.61
Remove any values above 1104.61
Step 3
Remove outliers
SPSS: Data select cases Select if condition is satisfied Variable <= 1104.61
SPSS will not analyze data that is over 1104.61
Step 3
Click continue and OK
Check for normality again!
Step 4

z score 1.613/ .34 = 4.74 Still has positive skew, but better than before! (14.29) These are still significant too:
a
Step 4
Statistic .802
Shapiro-Wilk df 49
Sig. .000
Step 4
Transform Data
Step 5
Step 5 Transform data

Positive skew: Negative skew:
Step 5 Transform data square root

SPSS: transform compute Target variable: enter new name
Ex: sqrt
Start here
Click on arithmetic under function group Click on sqrt under functions and special variables Click on the up arrow to bring sqrt(?) to numeric expression box Highlight variable to be transformed and click the right arrow to replace the (?) Explore data again to check for normality
Step 5 Transform data and explore

z score .76/ .34 = 2.24 Still has positive skew, but better than before! (14.29 4.74) These are still significant too:
a
sqrt
Statistic .927
Shapiro-Wilk df 49
Sig. .005
Step 5 Transform data

Positive skew:
Step 5 Transform data log10
Negative skew:
Start here
Remind yourself: this is fun!
Now this
Step 5 Transform data log10

SPSS: transform compute Target variable: enter new name
Ex: log10

z score -.225/ .34 = -.66 Finally! Its less than 1.96! Finally, these are not significant:
Tests of Normality Kolmogorov-Smirnov Statistic df Sig. .107 49 .200*
a
Click reset button to clear Click on arithmetic under function group Click on lg10 under functions and special variables Click on the up arrow to bring lg10(?) to numeric expression box Highlight variable to be transformed and click the right arrow to replace the (?) Explore data again to check for normality
log10
Statistic .969
Shapiro-Wilk df 49
Sig. .217
*. This is a lower bound of the true significance. a. Lilliefors Significance Correction
Data is normally distributed
Finally!
Keep in mind:
Do this with each dependent variable before analyzing data Keep transformations consistent across all dependent variables Although transformed data looks pretty, it can be difficult to interpret Run your analysis with transformed data and without the transformation and compare the results
Great resource:
Mickey, R. M., Dunn, O. J., and Clark V. A. (2004). Applied statistics: analysis of variance and regression, 3rd Edition. John Wiley & Sons, Inc.
Chapter 1: Data Screening

4-Data Cleaning - Handout

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

4-Data Cleaning - Handout

Diunggah oleh

Hak Cipta:

Format Tersedia

Data Cleaning

Tips for making your data suitable for analysis

Why clean your data?

Make sure data meets assumptions for analysis

2. In conjunction with statistical analysis

Check for missing data

Check for missing data

Check for missing data

Check for normality

Starbucks Retail Stores

Check for normality

Check for normality

Lower Bound Upper Bound

Check for normality

Check for normality

Starbucks Retail Stores

a. Lilliefors Significance Correction

If sig = not normally distributed

Check for normality

Check for normality

Click continue and OK

Check for normality again!

Check for normality again!

Starbucks Retail Stores

a. Lilliefors Significance Correction

Check for normality again!

Step 5 Transform data

Step 5 Transform data square root

Step 5 Transform data and explore

Step 5 Transform data and explore

a. Lilliefors Significance Correction

Step 5 Transform data

Step 5 Transform data log10

Remind yourself: this is fun!

Step 5 Transform data log10

Step 5 Transform data and explore

*. This is a lower bound of the true significance. a. Lilliefors Significance Correction

Step 5 Transform data and explore

Data is normally distributed

Anda mungkin juga menyukai