Data Mining Memahami Data

Data Mining:
Mengenal dan
memahami data
1
Mengenal dan memahami data
Objek data dan macam-macam atribut
Statistik diskriptif data
Visualisasi data
Mengukur kesamaan dan ketidaksamaan

data
Types of Data Sets
Record
Relational records
Data matrix, e.g., numerical matrix
Document data: text documents:

term-frequency vector
Transaction data
Graph and network
World Wide Web
Social or information networks
Ordered
Video data: sequence of images
Temporal data: time-series
Spatial, image and multimedia:
Spatial data: maps
Image data:
Video data:
TID
Items
1
2
3
4
5
Bread, Coke, Milk

Beer, Bread
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk
Data Objects
Data object menyatakan suatu entitas
Contoh:
Database penjualan: customers, barang-barang

yang dijual, penjualan
Database medis: pasien, perawatan
Database universitas: mahasiswa, professor,

perkuliahan
Data objects dijelaskan dengan attribut-atribut.
Baris-baris Database -> data objects;
Kolom-kolom ->attribut-atribut.
4
Atribut
Attribut ( dimensi, fitur, variabel):

menyatakan karakteristik atau fitur dari data
objek
Misal., ID_pelanggan, nama, alama
Tipe-tipe:
Nominal
Ordina
Biner
Numerik:
Interval-scaled
Ratio-scaled
5
Attribute Types
Nominal: kategori, keadaan, atau nama suatu hal
Warna rambut
Status , kode pos, dll, NRP dll

Binary :Atribut Nominal dengan hanya 2 keadaan (0 dan 1)
Symmetric binary: keduanya sama penting
Misal: jenis kelamin,
Asymmetric binary: keduanya tidak sama penting.
Misal : medical test (positive atau negative)
Dinyatakan dengan 1 untuk menyatakan hal yang

lebih penting ( positif HIV)
Ordinal
Memiliki arti secara berurutan, (ranking) tetapi tidak

dinyatakan dengan besaran angka atau nilai.
Size = {small, medium, large}, kelas, pangkat
Atribut Numerik
Kuantitas (integer atau nilai real)

Interval
Diukur pada skala dengan unit satuan yang

sama
Nilai memiliki urutan
tanggal kalender
No true zero-point
Ratio
Inherent zero-point
Contoh:Panjang, berat badan, dll
Bisa mengatakan perkalian dari nilai objek data

yang lain
Misal : panjang jalan A adalah 2 kali dari

panjang jalan B
7
Atribut Discrete dan kontinu
Atribut Diskrit
Terhingga, dapat dihitung walaupun itu tak
terhingga
Kode pos, kata dalam sekumpulan dokumen
Kadang dinyatakan dengan variabel integer
Catatan Atribut Binary: kasus khusus atribut
diskrit
Atribut Kontinu
Memilki nilai real
E.g., temperature, tinggi, berat
Atribut kontinu dinyatakn dengan floating-point
variables
8
Basic Statistical Descriptions of

Data
Tujuan
Untuk memahami data: central tendency,
variasi dan sebaran
Karakteristik Sebaran data
median, max, min, quantiles, outliers, variance,
dll.
Mengukur Central Tendency
Note: n jumlah sample dan N nilai populasi.
Mean/rata-rata:
Trimmed mean:
Median:
x
1 n
x xi
N
n i 1
Mean (algebraic measure) (sample vs. population):
w x
i 1
n
Estimated by interpolation (for grouped data):
Mode
median L1 (
n / 2 ( freq ) l
freq median
w
i 1
) width
Value that occurs most frequently in the dataMedia

n
Unimodal, bimodal, trimodal
interv
al
Empirical formula:
mean mode 3 (mean median)

10
Symmetric vs.
Skewed Data
Median, mean and mode of

symmetric, positively and
negatively skewed data
positively skewed
March 24, 2016
symmetric
negatively
skewed
Data Mining: Concepts and

Techniques
11
Mengukur sebaran Data
Quartiles, outliers and boxplots
Quartiles: Q1 (25th percentile), Q3 (75th percentile)
Inter-quartile range: IQR = Q3 Q1
Five number summary: min, Q1, median, Q3, max
Outlier: biasanya lebih tinggi atau lebih rendah dari 1.5 x IQR
Variansi dan standar deviasi (sample: s, population: )
Variance: (algebraic, scalable computation)

1 n
1
2
2
n
n
n
1
1
1
(
x
2
i
s2
( xi x ) 2
[ xi ( xi ) 2 ]
N i 1
N
n 1 i 1
n 1 i 1
n i 1
x
i 1
Standard deviation s (atau ) akar kuadrat daro variance s2

( atau 2 )
12
Boxplot Analysis
Lima nilai dari sebaran data
Minimum, Q1, Median, Q3, Maximum
Boxplot
Data dinyatakan dengan box
Ujung dari box kuartil pertama dan

ketiga, tinggi kotak adalah IQR
Median ditandai garis dalam box
Outliers: diplot sendiri diluar
13
Sifat-sifat kurva Distribusi Normal
Kurva norma
dari to +: berisi 68% pengukukuran (:
mean, : standar deviasi)
Dari 2 to +2: berisi 95% pengukuran

Dari 3 to +3: berisi 99.7% pengukuran
14
Histogram Analysis
Histogram: grafik
menampilkan tabulasi dari
frekwensi data
15
Histograms Often Tell More than

Boxplots
Dua histogram
menunjukkan
boxplot yang sama
Nilai yang sama:

min, Q1, median,
Q3, max
Tetapi distribusi
datanya berbeda
16
Scatter plot
Melihat data bivariate data untuk melihat cluster

dan outlier data, etc
Setiap data menunjukkan pasangan koordinat dari
suatu data
17
Positively and Negatively Correlated

Data
Kiri atas korelasi positif
Kanan atas korelasi negatif
18
Uncorrelated Data
19
Data Visualization
Why data visualization?

Gain insight into an information space by mapping data onto
graphical primitives
Provide qualitative overview of large data sets
Search for patterns, trends, structure, irregularities,
relationships among data
Help find interesting regions and suitable parameters for
further quantitative analysis
20
Geometric Projection Visualization

Techniques
Visualization of geometric transformations and

projections of the data
Methods
Scatterplot and scatterplot matrices
21
Used by ermission of M. Ward, Worcester Polytechnic Institute
Scatterplot Matrices
Matrix of scatterplots (x-y-diagrams) of the k-dim. data [total of (k2/2-k) scatterplots]

22
Similarity and Dissimilarity
Similarity
Mengukur secara Numerik bagaimana kesamaan

dua objek data
Tinggi nilainya bila benda yang lebih mirip
Range [0,1]
Dissimilarity (e.g., distance/jarak)
Ukuran numerik dari perbedaan dua objek
Sangat rendah bila benda yang lebih mirip
Minimum dissimilarity i0
23
Data Matrix and Dissimilarity

Matrix
Data matrix
n titik data dengan
p dimensi
Two modes
x11
...
x
i1
...
x
n1
Dissimilarity matrix
n titik data yang
didata adalah
distance/jarak
Matrik segitiga
Single mode
...
x1f
...
x1p
...
...
...
...
xif
...
...
xip
...
...
... xnf
...
...
...
xnp
d(2,1)
0
d(3,1) d ( 3,2) 0
:
:
:
d ( n,1) d ( n,2) ...
... 0
24
Proximity Measure for Nominal

Attributes
Misal terdapat 2 atau lebih nilai, misal., red,

yellow, blue, green (generalisasi dari atribut
binary)
Metode Simple matching
p
m # variabel
m: # yangdsesuai,
total
(i, j) p:
p
25
Proximity Measure for Binary

Attributes
Object j
A contingency table for binary

data
Object i
Distance measure for symmetric

binary variables:
Distance measure for asymmetric

binary variables:
Jaccard coefficient (similarity

measure for asymmetric binary
variables):
Note: Jaccard coefficient is the same as coherence:
26
Dissimilarity between Binary

Variables
Example
Name
Jack
Mary
Jim
Gender
M
F
M
Fever
Y
Y
Y
Cough
N
N
P
Test-1
P
P
N
Test-2
N
N
N
Test-3
N
P
N
Test-4
N
N
N
Gender is a symmetric attribute

The remaining attributes are asymmetric binary
Let the values Y and P be 1, and the value N 0
01
0.33
2 01
11
d ( jack , jim )
0.67
111
1 2
d ( jim , mary )
0.75
11 2
d ( jack , mary )
27
Standardizing Numeric Data
Z-score:
X: raw score to be standardized, : mean of the population,

: standard deviation
the distance between the raw score and the population

mean in units of the standard deviation
negative when the raw score is below the mean, + when

above
An alternative way: Calculate

the mean absolute deviation
1
where
s f n (| x1 f m f | | x2 f m f | ... | xnf m f |)
m f 1n (x1 f x2 f ... xnf )
x m
.
standardized measure (z-score):
zif
if
sf
Using mean absolute deviation is more robust than using

standard deviation
28
Example:
Data Matrix and Dissimilarity Matrix
Data Matrix
Dissimilarity Matrix
(with Euclidean Distance)
29
Distance on Numeric Data: Minkowski

Distance
Minkowski distance: A popular distance measure
where i = (xi1, xi2, , xip) and j = (xj1, xj2, , xjp) are

two p-dimensional data objects, and h is the order
(the distance so defined is also called L-h norm)
Properties
d(i, j) > 0 if i j, and d(i, i) = 0 (Positive

definiteness)
d(i, j) = d(j, i) (Symmetry)
d(i, j) d(i, k) + d(k, j) (Triangle Inequality)

30
Special Cases of Minkowski Distance
h = 1: Manhattan (city block, L1 norm) distance
E.g., the Hamming distance: the number of bits that

are different between two binary vectors
d (i, j) | x x | | x x | ... | x x |
i1 j1
i2 j 2
ip
jp
h = 2: (L2 norm) Euclidean distance

d (i, j) (| x x |2 | x x |2 ... | x x |2 )
i1 j1
i2 j 2
ip
jp
h . supremum (Lmax norm, L norm) distance.
This is the maximum difference between any

component (attribute) of the vectors
31
Example: Minkowski Distance

Dissimilarity Matrices
Manhattan
(L1)
Euclidean (L2)
Supremum
32
Ordinal Variables
An ordinal variable can be discrete or continuous
Order is important, e.g., rank
Can be treated like interval-scaled

rif {1,..., M f }
replace x
if by their rank
map the range of each variable onto [0, 1] by

replacing i-th object in the f-th variable by
rif 1
zif
M f 1
compute the dissimilarity using methods for

interval-scaled variables
33
Attributes of Mixed Type
A database may contain all attribute types

Nominal, symmetric binary, asymmetric binary,
numeric, ordinal
One may use a weighted formula to combine their
effects
pf 1 ij( f ) dij( f )
d (i, j)
pf 1 ij( f )
f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
f is numeric: use the normalized distance
f is ordinal
Compute ranks r and
if
r
if 1
zif
Treat z as interval-scaled
if
M f 1
34
Cosine Similarity
A document can be represented by thousands of attributes, each

recording the frequency of a particular word (such as keywords) or
phrase in the document.
Other vector objects: gene features in micro-arrays,

Applications: information retrieval, biologic taxonomy, gene feature
mapping, ...
Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency
vectors), then
cos(d1, d2) = (d1 d2) /||d1|| ||d2|| ,
where indicates vector dot product, ||d||: the length of vector d
35
Example: Cosine Similarity
cos(d1, d2) = (d1 d2) /||d1|| ||d2|| ,

where indicates vector dot product, ||d|: the length of vector d
Ex: Find the similarity between documents 1 and 2.
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0) 0.5=(42)0.5
= 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1) 0.5=(17)0.5
= 4.12
cos(d1, d2 ) = 0.94
36
KL Divergence: Comparing
Two Probability Distributions
The Kullback-Leibler (KL) divergence: Measure the difference

between two probability distributions over the same variable x
From information theory, closely related to relative entropy,

information divergence, and information for discrimination
DKL(p(x) || q(x)): divergence of q(x) from p(x), measuring the
information lost when q(x) is used to approximate p(x)
Discrete form:
The KL divergence measures the expected number of extra bits
required to code samples from p(x) (true distribution) when
using a code based on q(x), which represents a theory, model,
description, or approximation of p(x)
Its continuous form:
The KL divergence: not a distance measure, not a metric:
asymmetric, not satisfy triangular inequality
37
How to
Compute the KL
Divergence?
Base on the formula, D (P,Q) 0 and D
KL
(P || Q) = 0 if and only if P = Q.
KL
How about when p = 0 or q = 0?

lim
p0 p log p = 0
when p != 0 but q = 0, DKL(p || q) is defined as , i.e., if one event e is
possible (i.e., p(e) > 0), and the other predicts it is absolutely
impossible (i.e., q(e) = 0), then the two distributions are absolutely
different
However, in practice, P and Q are derived from frequency distributions,
not counting the possibility of unseen events. Thus smoothing is needed
Example: P : (a : 3/5, b : 1/5, c : 1/5). Q : (a : 5/9, b : 3/9, d : 1/9)
need to introduce a small constant , e.g., = 103
The sample set observed in P, SP = {a, b, c}, SQ = {a, b, d}, SU = {a,
b, c, d}
Smoothing, add missing symbols to each distribution, with probability
P : (a : 3/5 /3, b : 1/5 /3, c : 1/5 /3, d : )
Q : (a : 5/9 /3, b : 3/9 /3, c : , d : 1/9 /3).
D (P || Q) can be computed easily
KL
38

Data Mining Memahami Data

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Data Mining Memahami Data

Diunggah oleh

Hak Cipta:

Format Tersedia

Data Mining:

Mengenal dan memahami data

Objek data dan macam-macam atribut

Statistik diskriptif data

Mengukur kesamaan dan ketidaksamaan

Types of Data Sets

Data matrix, e.g., numerical matrix

Document data: text documents:

Graph and network

World Wide Web

Social or information networks

Video data: sequence of images

Temporal data: time-series

Spatial, image and multimedia:

Spatial data: maps

Bread, Coke, Milk

Data object menyatakan suatu entitas

Database penjualan: customers, barang-barang

Database medis: pasien, perawatan

Database universitas: mahasiswa, professor,

Data objects dijelaskan dengan attribut-atribut.

Baris-baris Database -> data objects;

Attribut ( dimensi, fitur, variabel):

Nominal: kategori, keadaan, atau nama suatu hal

Status , kode pos, dll, NRP dll

Symmetric binary: keduanya sama penting

Misal: jenis kelamin,

Asymmetric binary: keduanya tidak sama penting.

Misal : medical test (positive atau negative)

Dinyatakan dengan 1 untuk menyatakan hal yang

Memiliki arti secara berurutan, (ranking) tetapi tidak

Size = {small, medium, large}, kelas, pangkat

Kuantitas (integer atau nilai real)

Diukur pada skala dengan unit satuan yang

Nilai memiliki urutan

Contoh:Panjang, berat badan, dll

Bisa mengatakan perkalian dari nilai objek data

Misal : panjang jalan A adalah 2 kali dari

Atribut Discrete dan kontinu

Basic Statistical Descriptions of

Mengukur Central Tendency

Note: n jumlah sample dan N nilai populasi.

Mean (algebraic measure) (sample vs. population):

Estimated by interpolation (for grouped data):

Value that occurs most frequently in the dataMedia

Unimodal, bimodal, trimodal

mean mode 3 (mean median)

Median, mean and mode of

March 24, 2016

Data Mining: Concepts and

Mengukur sebaran Data

Quartiles, outliers and boxplots

Quartiles: Q1 (25th percentile), Q3 (75th percentile)

Inter-quartile range: IQR = Q3 Q1

Five number summary: min, Q1, median, Q3, max

Variansi dan standar deviasi (sample: s, population: )

Variance: (algebraic, scalable computation)

Standard deviation s (atau ) akar kuadrat daro variance s2

Lima nilai dari sebaran data

Minimum, Q1, Median, Q3, Maximum

Data dinyatakan dengan box

Ujung dari box kuartil pertama dan

Median ditandai garis dalam box

Outliers: diplot sendiri diluar