Anda di halaman 1dari 20

Dr Kartika Fithriasari

1
 EDAadalah suatu pendekatan untuk analisis data yang
menggunakan berbagai teknik (terutama grafis) untuk
 memaksimalkan wawasan kumpulan data;
 mengungkap struktur yang mendasari;
 ekstrak variabel penting;
 mendeteksi outlier dan anomali;
 uji asumsi yang mendasari;
 mengembangkan parsimoniousmodels.
 Pendekatan EDA tepatnya adalah suatu pendekatan bukan
seperangkat teknik, tapi sikap/ filosofi tentang bagaimana
sebuah analisis data harus dilakukan
 EDA is the first step of data analysis
2
Discover the structure

Find pattern

Indentify relationship

3
EDA helps to prevent common statistics prolem
 Most Statistical Techniques
requires special assumption before
they could be employed, and EDA can investigate these
assumption

4
Problem Data Model Analysis Conclusion

Problem Data Analysis Model Conclusion

5
 CDA isConfirmatory Data Analysis
 Confirmatory
 Formulate model before seeing the data
 Analyze the data
 Asses “significance”(inference) based on model

 EDA & CDA are important


 EDA is tools for exploring and investigating data

 CDA is tools for validating hypothesis

6
I II III IV
x y x y x y x y
10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58
 Given 4 datasets and 8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76
please analyze the data 13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71
 What is appropriate 9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84
method? 11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47
14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04
6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25
4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50
12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56
7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91
5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89

7
X1 Y1 X2 Y2 X3 Y3 X4 Y4
10 8.04 10 9.14 10 7.46 8 6.58
8 6.95 8 8.14 8 6.77 8 5.76
13 7.58 13 8.74 13 12.74 8 7.71
9 8.81 9 8.77 9 7.11 8 8.84
11 8.33 11 9.26 11 7.81 8 8.47
14 9.96 14 8.1 14 8.84 8 7.04
6 7.24 6 6.13 6 6.08 8 5.25
4 4.26 4 3.1 4 5.39 19 12.5
12 10.84 12 9.13 12 8.15 8 5.56
7 4.82 7 7.26 7 6.42 8 7.91
5 5.68 5 4.74 5 5.73 8 6.89
Mean 9 7.50 9 7.50 9 7.5 9 7.50
Std Dev 3.32 2.03 3.32 2.03 3.32 2.03 3.32 2.03
Correlation 0.82 0.82 0.82 0.82

8
9
10
 Dataset I consists of a set of points that appear to follow a rough linear
relationship with some variance.
 Dataset II fits a neat curve but doesn’t follow a linear relationship
(maybe it’s quadratic?).
 Dataset III looks like a tight linear relationship between x and y, except
for one large outlier.
 Dataset IV looks like x remains constant, except for one outlier as well.
 Computing summary statistics or staring at the data wouldn’t have
told us any of these stories.
 Instead, it’s important to visualize the data to get a clear picture of
what’s going on.
Summary Statistics Can Be Dangerous. 11
 The distribution of starting
salaries for new law
graduates. The National
Association of Law
Placement (NALP) reports
that in 2012, lawyers
made $80,798 on average
in starting salary. However
a look at the salary
distribution shows what
law salaries really look like:

12
 It turns out that law graduates usually fall into one of two groups.
 The majority of new lawyers make somewhere between $35,000
and $75,000 per year, and a sizable minority earns $160,000 per
year.
 What we have here is a bimodal distribution: there are two peaks
that arise from two distinct distributions happening within the same
dataset.
 The $80,798 figure reported as the average falls into the trough
between the two peaks, and few lawyers have salaries near that
number.
 A much more accurate statement would be that most law
graduates make around $50,000 on average, and those who go to
one of the top law schools make $160,000 on average.
13
 Let’s
start off by plotting the atmospheric CO2
concentrations (in ppm) pulled from NOAA’s website
 what patterns of interest?

14
 We note two patterns of interest: an overall upward trend, and a
cyclical trend.
 Our first EDA task is to model the overall trend.
 We can attempt to fit a straight line to the data using a standard
regression analysis procedure.
 The fitted line is displayed in red in the following plot.

15
 if The fitted line is straight line, then residuals plot can be like this

16
 An overall trend is still present, despite having attempted to control for it. This
implies that our simple line model does not do a good job in smoothing out the
overall trend. It appears that the overall trend is slightly convex and has a small
peak around the 1990’s; we should try to fit the trend using a 3rd order polynomial
of the form:

17
 Now, let’s look at the residuals:

The residuals have “W” shaped trend in the residual.

18
 Using smoothing technique, is the LOESS curve

 Residual

19
“Visualization is critical to data analysis. It provides a
front line of attack, revealing intricate structure in data that
cannot be absorbed in any other way.”

–William S. Cleveland

20