Mengenal dan
memahami data
1
Visualisasi data
Record
Relational records
Transaction data
Ordered
Image data:
Video data:
TID
Items
1
2
3
4
5
Data Objects
Contoh:
Kolom-kolom ->attribut-atribut.
4
Atribut
Attribute Types
Warna rambut
Atribut Numerik
tanggal kalender
No true zero-point
Ratio
Inherent zero-point
Atribut Diskrit
Terhingga, dapat dihitung walaupun itu tak
terhingga
Kode pos, kata dalam sekumpulan dokumen
Kadang dinyatakan dengan variabel integer
Catatan Atribut Binary: kasus khusus atribut
diskrit
Atribut Kontinu
Memilki nilai real
E.g., temperature, tinggi, berat
Atribut kontinu dinyatakn dengan floating-point
variables
8
Tujuan
Untuk memahami data: central tendency,
variasi dan sebaran
Karakteristik Sebaran data
median, max, min, quantiles, outliers, variance,
dll.
Mean/rata-rata:
Trimmed mean:
Median:
x
1 n
x xi
N
n i 1
w x
i 1
n
Mode
median L1 (
n / 2 ( freq ) l
freq median
w
i 1
) width
interv
al
Empirical formula:
Symmetric vs.
Skewed Data
positively skewed
symmetric
negatively
skewed
11
Outlier: biasanya lebih tinggi atau lebih rendah dari 1.5 x IQR
(
x
2
i
s2
( xi x ) 2
[ xi ( xi ) 2 ]
N i 1
N
n 1 i 1
n 1 i 1
n i 1
x
i 1
12
Boxplot Analysis
Boxplot
13
Kurva norma
dari to +: berisi 68% pengukukuran (:
mean, : standar deviasi)
14
Histogram Analysis
Histogram: grafik
menampilkan tabulasi dari
frekwensi data
15
Dua histogram
menunjukkan
boxplot yang sama
Tetapi distribusi
datanya berbeda
16
Scatter plot
17
18
Uncorrelated Data
19
Data Visualization
20
Methods
21
Scatterplot Matrices
Similarity
Range [0,1]
Minimum dissimilarity i0
23
Data matrix
n titik data dengan
p dimensi
Two modes
x11
...
x
i1
...
x
n1
Dissimilarity matrix
n titik data yang
didata adalah
distance/jarak
Matrik segitiga
Single mode
...
x1f
...
x1p
...
...
...
...
xif
...
...
xip
...
...
... xnf
...
...
...
xnp
d(2,1)
0
d(3,1) d ( 3,2) 0
:
:
:
... 0
24
p
m # variabel
m: # yangdsesuai,
total
(i, j) p:
p
25
Object i
26
Example
Name
Jack
Mary
Jim
Gender
M
F
M
Fever
Y
Y
Y
Cough
N
N
P
Test-1
P
P
N
Test-2
N
N
N
Test-3
N
P
N
Test-4
N
N
N
27
Z-score:
s f n (| x1 f m f | | x2 f m f | ... | xnf m f |)
m f 1n (x1 f x2 f ... xnf )
x m
.
zif
if
sf
Example:
Data Matrix and Dissimilarity Matrix
Data Matrix
Dissimilarity Matrix
(with Euclidean Distance)
29
Properties
31
Manhattan
(L1)
Euclidean (L2)
Supremum
32
Ordinal Variables
pf 1 ij( f ) dij( f )
d (i, j)
pf 1 ij( f )
f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
f is numeric: use the normalized distance
f is ordinal
Compute ranks r and
if
r
if 1
zif
Treat z as interval-scaled
if
M f 1
34
Cosine Similarity
35
36
KL Divergence: Comparing
Two Probability Distributions
Discrete form:
The KL divergence measures the expected number of extra bits
required to code samples from p(x) (true distribution) when
using a code based on q(x), which represents a theory, model,
description, or approximation of p(x)
Its continuous form:
The KL divergence: not a distance measure, not a metric:
asymmetric, not satisfy triangular inequality
37
How to
Compute the KL
Divergence?
Base on the formula, D (P,Q) 0 and D
KL
(P || Q) = 0 if and only if P = Q.
KL
38