DIAN EKA R
Data cenderung kotor
Incomplete: kekurangan nilai
atribut
Noise: adanya errors atau
outliers
Inconsistent: format yang
DATA PREPROCESSING: berbeda dalam code dan
WHY IS NEEDED? nama
Data yg tidak berkualitas, tidak ada hasil
mining yang berkualitas
Keputusan kualitas harus
didasarkan pada data kualitas
Centering
Normalization
Scaling
Rumus:
åxx1 + x2 + ... + x N
i
x= =i=1
Nilai rata-rata suatu data N N
Sensitif terhadap data ekstrim/outlier → trimmed mean
Trimmed mean: memotong 2% top dan buttom data sebelum menghitung mean
Data gaji (ribuan dollar) diurutkan mulai dari terkecil hingga terbesar: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
30 + 36 + 47 + 50 + 52 + 56 + 60 + 63+ 70 + 70 +110
x=
12
696
x= = 58
12
Dari contoh data gaji, data telah terurut dan jumlah data genap (12) maka median: nilai tengah (nilai ke-6 dan ke-7) yaitu
antara 52 dan 56.
52 + 56
median = = 54
2
30 +110
midrange = = 70
2
Maka nilai midrange: $70.000
CENTRAL
TENDENCY
Data dengan unimodal dan distribusi
data simetris (terdistribusi normal)
memiliki nilai mean, median, dan mode
yang sama dan berada di tengah.
Kondisi riil → asimetris (positively skewed
atau negatively skewed)
Mengukur sebaran data → five-number summary (range,
The 2-quantile is the data point dividing the lower and upper halves of the data distribution → median
The 4-quantiles are the three data points that split the data distribution into four equal parts; each part represents
one-fourth of the data distribution → quartiles
The 100-quantiles → percentiles (divide the data distribution into 100 equal-sized consecutive sets)
RANGE, QUARTILES, DAN INTERQUARTILE (3)
The first quartile Q1 is the 25th percentile → cuts off the lowest 25% of the data.
The third quartile Q3 is the 75th percentile → cuts off the lowest 75% (or highest 25%) of the data.
The second quartile Q2 is the 50th percentile (median) → the center of the data distribution.
RANGE, QUARTILES, DAN INTERQUARTILE (4)
Interquartile range (IQR) → the distance between the first and third quartiles
IQR = Q3 - Q1
The quartiles → three values that split the sorted data set into four equal parts
Outlier → values falling at least 1.5 x IQR above the third quartile or below the first quartile
CONTOH
Dari contoh data gaji, quartiles untuk data gaji adalah data ke-3 (Q1), ke-6 (Q2), dan ke-9 (Q3).
Maka:
Q1 = $47.000
Q2 = $52.000
Q3 = $63.000
IQR = 63-47 = $16.000
Outlier = nilai < Q1 – 1,5 x IQR dan nilai > Q3+ 1,5 x IQR
FIVE-NUMBER SUMMARY & BOXPLOT
The figure shows boxplots for unit price data for items sold at
four branches of AllElectronics during a given time period.
For branch 1, we see that the median price of items sold is
$80, Q1 is $60, and Q3 is $100.
Notice that two outlying observations for this branch were
plotted individually, as their values of 175 and 202 are more
than 1.5 times the IQR here of 40.
VARIANCE AND STANDARD DEVIATION
A low standard deviation means that the data observations tend to be very close to the mean, while a high standard
deviation indicates that the data are spread out over a large range of values.
VARIANCE AND STANDARD DEVIATION (2)
Standard deviation (σ) dari pengamatan adalah akar dari variance (σ2)
CONTOH
TUGAS 1 The age values for the data tuples are (in increasing order) 13,
15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33,
35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
a) What is the mean of the data?What is the median?
b) What is the mode of the data? Comment on the data’s
modality (i.e., bimodal, trimodal, etc.).
c) What is the midrange of the data?
d) Can you find (roughly) the first quartile (Q1 ) and the third
quartile (Q3 ) of the data?
e) Give the five-number summary of the data.
f) Show a boxplot of the data.
g) How is a quantile–quantile plot different from a quantile
plot ?
TUGAS 2
Suppose that a hospital tested the age and body fat data for 18 randomly selected adults with the following results:
Calculate the mean, median, and standard deviation of age and %fat .
Draw the boxplots for age and %fat .
Draw a scatter plot and a q-q plot based on these two variables